Building a Boilerplate with AI: What We Learned After Three Rebuilds

Client: Internal - XBG Solutions

2024-12-04

We spent six months and three complete rebuilds building a SvelteKit boilerplate with AI tools. Here's what we learned about token economics, the agreeability problem, and where humans must stay in control.

Building a Boilerplate with AI: What We Learned After Three Rebuilds

We decided to build our own SvelteKit boilerplate. Not because we love reinventing wheels, but because we needed to learn how AI-assisted development actually works when you’re building something real.

Six months, three complete rebuilds, $1,000 in API tokens, and 677 passing tests later, we’ve got a boilerplate that saves us 1-2 weeks per project. More importantly, we’ve got lessons about working with AI that directly inform how we help clients adopt these approaches.

This isn’t a success story. It’s an honest account of what broke, what we fixed, and what we’re still figuring out.

Why Build Your Own Boilerplate?

There weren’t many (if any) production-ready boilerplates for SvelteKit projects when we started. Given we’d chosen Svelte with TypeScript as our frontend architecture—future-leaning framework, truly reactive rather than just stateful, tiny bundles that compile away the framework, performance obsessed—we needed something purpose-built.

More critically, we knew this boilerplate would be the foundation for every subsequent client project. It needed to be opinionated enough to speed delivery but not so locked-in that we’d curse our own names later. We’ve been the CTOs inheriting spaghetti code with no extensibility. We’re conscious of not becoming that problem for others.

And if we were building the foundation for rapid MVP development anyway, it was the perfect test case for learning how to build with an LLM.

The Dark Period: Two Months Bug-Chasing Nothing

The first version didn’t work. Not “it needs some fixes” didn’t work—“the app won’t even initialise” didn’t work.

We spent two months bug-chasing. Those were dark months.

The problem wasn’t the code quality in isolation. The problem was we’d jumped from “here’s what we want” straight to “let’s build it.” We’d let the LLM make too many design choices. We were working entirely in Claude Web at that point—smaller context windows than today, fewer ways to get data in, constant copying of project files into .txt documents uploaded to project knowledge.

The model never had a chance of knowing the whole context. Not the whole project, not the evolution timeline of ideas and requirements and design choices and the reasons behind them.

We’d max out context windows mid-conversation and have to start fresh. We’d hit the 4-hour usage limits with only three productive work periods per day and two mandatory 4-hour waits in between. You’d burn a whole day with only a handful of hours of actual progress.

What kept us going? Relief. Once we bit the bullet and decided to start over with a different approach, things rolled more smoothly from day one.

The Second Build: Architecture First, Always

For the second iteration, we defined standards and architectural guardrails before building anything:

  • When is a function a utility and when is it a service?
  • When should a service be built in a store versus a standalone service?
  • What are our logging and error patterns?
  • Commenting standards?
  • Unit and integration test standards, mocking approaches, coverage goals?

This mirrors what we now tell clients: LLMs perform like senior developers when prompted well, but they need the hand-holding of juniors. They think too literally. They always want to say “yes”—or in Claude’s case, “brilliant!”

Development with AI needs to reflect real-world development processes. Requirements definitions and user stories. Grooming sessions. Technical design. Iterations from core architecture through to feature delivery.

As a product team that’s definitely taken shortcuts in process over the years to deliver at pace, it was a good lesson in why developers ask the questions and raise the hurdles they do.

The Workflow Split: Web for Design, CLI for Heavy Lifting

We tried various tools to work at the two simultaneous scales a development project works on:

  1. Building a service or function that delivers a user story
  2. Keeping context of the entire project and architecture while you do it

Generally, LLMs fall one way or the other. They suck at doing both simultaneously. This is fair—it takes a senior developer or architect with a lot of war stories to balance wider and narrower context at the same time whilst talking in terms of requirements and solution design.

We landed on this split:

  • Claude Web for requirements → system design
  • Claude Web for first-attempt service builds
  • Claude Code CLI for corrections and moments needing whole-of-project view

If you had unlimited budget, you’d probably exist in Claude Code CLI from requirements implementation onwards. But in a budget-constrained reality—using an annual Claude subscription for most work and Claude Code tokens for heavy-lifting tasks—this proved a good balance.

Next we’re exploring running Qwen-3 Code locally on small, very-defined tasks to reduce token burden even further.

The Token Economics Reality

The project cost about $1,000 in tokens. That’s a fraction of what a contractor would charge to build the same boilerplate.

But here’s the catch: we were extremely hands-on. If we had to do it again knowing what we know now, we’d quote a client $45,000 for similar work.

The point isn’t that AI is cheap. The point is that this boilerplate is an investment with returns for years to come via reuse.

The real lesson about token economics: you give a wide prompt—something refactor-level—and you’ll burn through $100 USD in half a day.

With the 4-hour limits, you’ll load up context for a piece of analysis, ask a question or two, and get locked out for four hours before you’ve arrived at a usable decision. You need the skill to be selective with context provision and know when to break up a task over a 4-hour pause.

Claude Web burns through tokens quickly. Claude Code CLI burns through dollars quickly. Both require careful task scoping.

The Agreeability Problem

Here’s a critical improvement we made: we added a “Rules of Engagement” document as the first thing in every project’s context.

The key rule:

Being polite is part of your programming, that’s great! Being agreeable is useless and dangerous, and in Australian culture is NOT polite. Being honest, considerate, and accurate is polite.

Therein, when I make a statement, suggestion, etc think critically and provide considered feedback. Don’t just lead with things to the effect of “you’re absolutely right”.

If you default to being too agreeable and default openings like “you’re absolutely right” I will know that you’ve lost the context of the project knowledge and are not following your instructions.

This wasn’t a response to one final-straw incident. It was the realisation that we kept tripping over the same mistake—trying to pull critical thinking out of the LLM. When you start purposefully giving Claude erroneous input and it responds with “Brilliant!”, you start to mistrust everything it says.

Now, if Claude opens a message with “You’re absolutely right!” it’s taken as a break of the rules of engagement—a telltale that the model has slipped into agreeability and isn’t adhering to directions.

Does it work? Absolutely. Claude doesn’t catch itself, but it gives us a clear fail signal. When we suspect excess positivity, we can say: “Are you still observing the rules of engagement and this is actually a good approach, or have you lost context?”

The model quickly pulls itself back into line.

The Testing Revelation

LLMs write meaningless unit tests unless you lay down specific guardrails.

Claude’s default is to over-mock things, resulting in very brittle tests and mocks. Worse, it tests the step-by-step “how” rather than the substantive “what” of a given function.

We established core testing principles:

  1. Test WHAT, not HOW
  2. Minimal, strategic mocking
  3. Test organisation: Group related tests using nested describe blocks, one describe block per function, clear descriptive test names that read like documentation
  4. Test data management: Use factories for complex test data, keep test data minimal with only required fields, use meaningful values that aid debugging
  5. Assertion patterns: Test behaviour not data structure, use Jest matchers for cleaner assertions, group related assertions logically
  6. Error testing: Test both success and failure paths, use appropriate matchers, verify error types and messages

Critically, we provide examples and anti-examples of each principle.

The LLM is the contractor who’s just trying to please you all the time, looking to say “I did it! I achieved the thing you asked for!” without automatic regard for the context of the “why” for a given task. You need to force the why and re-remind the why constantly.

Where Humans Must Stay in Control

We were extremely hands-on throughout. We defined the architecture. We defined coding and test standards. We worked in iterative loops.

We frequently watch the running printout of the LLM’s logic—whether CLI or web—to ensure it’s not going down rabbit holes. If any part of any response seems out of alignment with architecture or requirements, we get the LLM to critique its own work or explain its approach and logic.

What does “going down a rabbit hole” look like? It’s hard to articulate specific warning signs beyond this: you’re using 20+ years of experience to watch for incongruence and inconsistency.

The strategic lesson for CTOs: treat the LLM-agent appropriately in hybrid human-bot development teams. Lean more heavily on documentation and review loops. Task scaling is imperative.

Leverage bot-agents for non-IP grunt work and bug-chasing, with their work reviewed by a human pair-programmer. Keep IP-heavy service builds internal and human-led. This also drives engagement amongst your dev team—they get to work on interesting “transform” topics rather than boring, disengaging data transfer and bug hunting.

The Third Version: Heavy Refactor, Not Full Rebuild

The final version was more heavy refactor of version two than a start-from-zero.

After building the first app on top of the boilerplate, we realised configs were spread throughout the project rather than centralised. The same things were being set in multiple config files and constants files multiple times rather than via single reused constants.

We also made a conscious choice to remove Flowbite-Svelte as our components library and go with SHADCN instead. More extensive library. More customisable control and implementation options. And critically, its wide adoption offered more options for the future when it comes time to build the Figma → frontend files automation.

What We’ve Built On Top

We’ve built three frontends on top of the boilerplate already. None are in production yet, but it’s already proved its value—going quickly from backend finished to clickable MVP already deployed.

Each time we save at least 1-2 weeks of frontend architectural development.

The Figma → components and routes automation will be the next significant step up in throughput.

What’s Next in the Automation Pipeline

Based on what we learned here, we’re prioritising:

Most achievable right now: Figma to component builder, especially via the Figma MCP.

Partially underway: Wireframes → Figma scaffolding project. This is about being very conscious of what the model can and can’t do, and identifying where that overlaps potential time-saving for product folk, clients, and UX designers.

Further down the list: Backend boilerplates. These have proved much easier to build with LLMs so far—especially the data → API layers with nice separation of concerns for models, repositories, services, controllers, and routes.

A significant recent learning: adding an MCP layer over the top of the backend project is a quick way to UAT and iterate to confirm core flows work. It also highlights where plain language descriptions of processes and flows need tighter definitions to remove ambiguity. This pattern of development will inevitably help requirements definition to frontend agents and humans in the loop.

What We’d Tell Other CTOs

If you’re thinking about adopting AI-assisted development, here’s what matters:

Start with internal infrastructure projects. Building tools for yourselves is the perfect learning ground. You own the timeline, you’re the stakeholder, and the lessons are immediately applicable to client work.

Treat AI like a senior developer who needs junior hand-holding. It can perform at a high level when prompted well, but it needs rigorous standards, architecture documentation, and frequent review loops upfront.

Watch the token economics carefully. $100 can disappear in half a day on a refactor-level prompt. Budget accordingly and learn to scope tasks tightly.

Guard against agreeability. Build in rules that force the LLM to think critically rather than just pleasing you. Make excess positivity a red flag you actively watch for.

Define testing standards explicitly. Left to its own devices, AI writes brittle tests that check implementation rather than behaviour. You need guardrails.

Keep humans in control of requirements, architecture, and IP-heavy work. Let AI handle grunt work and bug-chasing, but have a human pair-programmer review everything. Your dev team will be more engaged working on interesting problems anyway.

The Honest Bits

What’s still rough around the edges? Our ability to swap between different CLI LLMs. That’s our next learning loop—using locally-run LLMs for discrete tasks to lower overall token spend whilst leveraging more powerful and costly LLMs only for heavy-hitting, deep-analysis, and architecture-level topics.

Would we do anything differently? We had three top-to-bottom rebuilds starting from zero lines of code. But we wouldn’t change much—we learned by doing. War stories and “wasted time”, tight-skulled frustration and yelling at a screen are often good teachers.

If we were to do it again with everything we know now, we’d achieve it in one rebuild instead of three. The cost would be about $45,000 realistically—but the point is, this app doesn’t need to be rebuilt. It’s an investment that will have returns for years to come via reuse.

The learnings we made being so hands-on deliver as much value—more, arguably—than the boilerplate itself will.

The Real Payoff

We now save 1-2 weeks per project using this boilerplate. We get to a deployable state on day one of a frontend build, with all underlying frontend architecture already built out. We step right into styling, components, and page development.

More valuable: we’ve learned how to work effectively with AI tools in a way that directly informs how we help clients adopt these approaches. We’ve hit the problems they’ll hit. We’ve found the solutions that actually work.

That’s worth more than any boilerplate.

Want to discuss a similar challenge?

We're always up for a chat about systems, automation, and pragmatic solutions.

Get in Touch