Spec-Driven Development Case Study: 13 Apps in 70 Days

Most articles about AI-assisted development describe a vendor tool, a workshop experiment, or a solo prototype built in an afternoon. This is none of those. It is a practitioner describing a real production build: one system, one developer, seventy days, real money, real compliance. The numbers are public. The limits are honest.

If you want the method itself, start with what Spec-Driven Development is. This is the proof of principle behind it: one real system, seventy days, the method applied at a scale where the failure modes are expensive and hard to recover from.

One crypto fintech, built solo with SDD

13
apps in production: 3
APIs, 3 databases: 8
shared packages: 70
days: ~1,650
automated tests: 138K
lines of TypeScript

The bet

The brief was not a prototype. Build a complete crypto payment platform: a PIX payment gateway for the Brazilian market, an OTC exchange engine, and on-chain settlement on Bitcoin’s Liquid Network. End to end. Production-grade. Real money, real compliance, real deadline. Solo. Seventy days.

That is not a project you improvise through a chat window. Three isolated PostgreSQL databases. Three Fastify APIs. Nine Next.js frontends. A shared component library. A shared authentication layer. Kubernetes behind all of it, in a Turborepo monorepo. Add the regulatory surface: PIX runs through Brazil’s central bank with its own compliance hooks and a bank-direct endpoint over mutual TLS. Add an OTC exchange engine that needs deterministic pricing, asymmetric spreads, and a fallback chain across five market sources. Add a settlement engine that has to handle stuck transactions, deviation-based refunds, and an auto-settle cap that overflows to manual approval when balances exceed a threshold. Each domain has its own failure modes. They compound. A wrong assumption in the pricing layer surfaces later, in the settlement layer, with real money in transit.

The main objections to spec-driven development in 2026 have come from two kinds of experiments: workshop prototypes where teams try the method on toy projects for an afternoon, and small solo builds where an engineer ships something in a few hours without SDD and concludes it was unnecessary overhead. Neither test is the relevant one. A 10-hour solo project with no compliance pressure, no real balances, and no external deadline does not need a 28-file spec corpus. An OTC exchange engine with a two-phase pricing model, five market sources, and a bank-direct BACEN integration very much does. The method is proportional to what you are protecting.

The naive approach to AI-assisted development at this scale is to open a chat window and start describing features. I have done it. It is the fast bricklayer with no blueprint: walls in minutes, staircase into a wall. At the scale of a multi-tenant fintech with compliance hooks, prompt-and-pray does not slow you down gradually. It builds something that looks complete and passes a shallow review, then fails when a real edge case arrives. The code compiles. The tests pass. The assumption that nobody wrote down comes back in production with a customer’s balance attached to it.

So I did not prompt. I specified.

The model maker says plan first

This is not a personal quirk or a contrarian take. Anthropic’s own guidance for Claude Code describes a four-step loop: explore, plan, implement, commit. There is a dedicated plan mode in the product whose entire job is to stop the agent from writing code while it thinks. Their reasoning is direct: “letting Claude jump straight to coding can produce code that solves the wrong problem.”

That is the company that trained the model telling you what most developers skip. They publish a prompt template for interviewing yourself into a specification before any code exists, then starting a fresh session to execute it. Three sentences, from the people who built the tool.

A 2025 randomized trial by METR found that experienced developers working with AI on real codebases were 19 percent slower than without it, despite expecting to be 24 percent faster. They were confidently wrong in both directions: wrong about the outcome, wrong about their perception of it afterward. The failure mode is not the model. It is capability without direction. The more capable the model, the further a vague instruction carries it in the wrong direction. Capability amplifies direction. It does not supply it.

Across 13 production applications, I ran what Anthropic describes at scale. The mechanism was a standing specification corpus: 28 markdown specs across 12 domains, plus 8 custom slash commands, living in the repository and loading as agent context at the start of every session. Not ad-hoc prompting. The project’s memory, on disk, versioned in git.

Spec first, prompt never

Every capability started as a written specification: requirements, design, tasks. Not a chat message. The agents executed against the spec. I reviewed against acceptance criteria and shipped. When output was wrong, I fixed the spec, not the prompt. Every line of generated code traced back to a decision I had made and could point to.

The standing spec corpus: the agent's memory, on disk

.claude/
- auth/domainroles, scopes, route permissions
- exchange/domainpricing.md: the VWAP and spread contract
- database/domainentities, migrations, schema design
- testing/domainconventions, db, ui
- ui/design tokens, components
- codebase/conventions, server actions
- commands/8 commands/new-route, /migrate, /verify...
- skills/1 skillplatform conventions

Twenty-eight markdown specs across twelve domains, plus eight custom slash commands that turned the repeatable generation steps into contracts the agent had to follow. The slash commands are the part most people miss. They are not keyboard shortcuts. Each one encodes a complete workflow: route generation, database migration, test scaffolding, deploy verification. The agent executes them the same way, every session. No drift. No reinvention. The slash commands are the repeatable unit of work. The specs are the persistent memory behind them.

The auth domain, for example, is not a single note about “use JWT.” It is a spec covering five roles, nineteen scopes, eighteen granular permissions, the rule that every route declares its required scope at registration time, and the machine-to-machine flow with row-level scoping for client credentials. When the agent generates a new route, it reads that spec first. The scope declaration is not a thing the developer remembers to add. It is a thing the spec requires, and the agent checks.

A single developer does not hold a 13-app fintech in their head. The spec holds it. The developer holds the spec.

The delivery loop

The delivery loop, repeated per capability

Input

A capability to ship (example: the settlement engine)

01Specify
Requirements, design, and tasks written before a line of code. The spec is the contract, not decoration. Human approval required before advancing.
02Generate
The agent implements against the spec and writes tests alongside code. No prompting. The spec is the instruction.
03Verify
Review against acceptance criteria and the pnpm verify gate: Prettier, ESLint zero warnings, strict tsc, tests on a live database.
04Correct the spec
Wrong output means a wrong or incomplete spec. Fix the document, regenerate. Never patch the code and leave the spec behind.

Output

Merged, tested, traceable to a decision

The fourth step is where the discipline lives. When something was wrong, the instinct is to patch the code and move on. That instinct is poison at scale. Patch the code, leave the spec untouched, and the next time that module is regenerated or extended, the agent rebuilds from the spec and reintroduces the same error. The spec is the source of truth. The code is what falls out of it. Every correction goes into the document first.

The loop repeated for every capability across all thirteen applications: the pricing engine, the settlement engine, the idempotency layer, the auth server, the webhook queue, the on-chain deposit poller, the Kubernetes manifests, the shared UI library. Same loop. Same discipline. Same rule about corrections. The repetition is the point. Consistency at this scale is not a management exercise. It is what keeps a single person from losing track of a system that is too large to hold in memory.

What the specs caught

The pricing engine

The clearest example is the OTC exchange pricing engine. Before any code existed, the pricing behavior lived in a single spec file. The two-phase model: a peg rate fixed to a reference index, active until the VWAP window activates, which requires at least five confirmed trades in a rolling 24-hour window. The fallback chain across five market sources, aggregated by median with outlier filtering. Asymmetric per-pair spreads, applied after VWAP resolution. And the precision rule that everything else depended on: all money arithmetic in 8-decimal integers, satoshi precision, never floating point.

That last rule was not a comment or a coding convention. It was a requirement, written in the spec before the first line of implementation existed, expressed as a test failure condition. Float arithmetic in financial software does not fail loudly. It drifts. Small rounding errors compound across multiple operations, across multiple trades, across multiple customers. By the time the discrepancy is visible in a ledger, it is a support problem, not a test failure. The spec made it a test failure on day one.

Here is a representative fragment of what that spec looked like:

## Precision rule (MUST)
All money arithmetic uses 8-decimal integers (satoshi precision).
No float or double anywhere in the pricing path.
A float in any response field is a test failure, not a lint warning.

## Two-phase model
Phase 1 (peg): price is fixed to the reference index.
Phase 2 (VWAP): rolling 24h VWAP, active once >= 5 confirmed trades exist.
The engine falls back from Phase 2 to Phase 1 when VWAP is unavailable.
Fallback is logged and triggers an ops alert.

## Fallback chain (market sources, in order)
1. Internal VWAP from confirmed trades (primary)
2. Cross-rate from 5 external sources, median-aggregated, outliers removed
3. Last known good price, flagged stale, ops notified

## Acceptance criteria
- Request with < 5 confirmed trades MUST use the peg price, not VWAP.
- When source 1 is unavailable, fallback to source 2 within 200ms.
- All returned amounts MUST be integers. A float anywhere is a test failure.
- Spread changes take effect on the next quote, never retroactively.

## Out of scope (v1)
- Dynamic spread adjustment based on volatility.
- Per-user spread overrides.

The production code mirrors that spec almost line for line. The fallback chain is implemented in the same order. The precision rule is enforced by both the type system and a dedicated test that fails if any field in the pricing response contains a float. The VWAP activation threshold is a named constant that matches the spec. The out-of-scope list stopped the agent from adding features nobody asked for, on two separate occasions. The document is the reason a solo developer could trust an exchange engine with real balances: the hard decisions were made, reviewed, and frozen before the agent touched the arithmetic.

Payment idempotency

The same pattern holds in the payment gateway. A spec froze the idempotency rule before a line of the charge endpoint was written:

## Idempotency (MUST)
WHEN a merchant POSTs a charge with a valid Idempotency-Key,
THE SYSTEM SHALL create at most one charge for that key.

WHEN the same Idempotency-Key is replayed within 24h,
THE SYSTEM SHALL return the original charge and create no new row.

## Schema
charges(id, merchant_id, amount_cents, currency, status,
        idempotency_key, created_at)
UNIQUE(merchant_id, idempotency_key)
-- Enforced at the database level. Application logic alone is not sufficient.
-- A retry that hits a constraint violation returns the original charge.

The UNIQUE constraint puts the rule in the database, not in application code, not in a cache, not in a middleware check that a future refactor might remove. Without a spec, that constraint lives in your head. It falls out on the next regeneration. The agent rebuilds from the spec, the spec has no unique index, the new version of the module has no constraint. One customer retry, one double-billed charge. The constraint in the spec is the constraint in the migration is the constraint in the live schema. That chain is how you enforce correctness across sessions.

If I had prompted it

01build an exchange pricing engine gets you float arithmetic and rounding drift on real balances
02silent scope creep: the agent adds things it thinks are helpful but nobody asked for
03edge cases like stuck settlements and deviation refunds surface in production
04idempotency constraint lives in your head, falls out on regeneration, customer double-billed
05no record of why any decision was made

Because I specified it

018-decimal integer math declared in the spec before a line was generated, enforced by a test failure on any float
02negative scope stopped the agent from building what nobody asked for
03edge cases written as acceptance criteria, caught at review not in production
04UNIQUE(merchant_id, idempotency_key) in the schema, enforced by the database
05every decision traceable in git, forever

The agent's capability was identical in both columns. The spec is the only variable. It is the whole difference.

Across the full platform, that discipline produced what you need for money to move safely. A payment gateway with four PIX providers behind a factory pattern, including a bank-direct BACEN Cob v2 integration over mutual TLS. HMAC-SHA256 signed webhooks with a six-attempt exponential-backoff retry queue. An OAuth 2.1 and OIDC auth server with five roles, nineteen scopes, TOTP 2FA, and machine-to-machine client credentials with row-level scoping. A settlement engine with two-block confirmation, tolerance-based auto-refund on deviations above 10%, a $10K auto-settle cap that overflows to manual approval, a three-phase atomic commit, and crash recovery for settlements stuck mid-flight. The project overview has the full production checklist.

Each of those features started from a spec. Each spec had its failure mode described before a line of implementation existed. Each failure mode was caught at the review gate, not in production.

The output in numbers

What spec-driven execution produced

28
spec files: ~1,650
automated tests: 138K
lines of TypeScript: 39
database migrations

None of these numbers prove the code is perfect. They prove it was built to a contract, not improvised.

What this proves, and what it does not

It does not generalize to any developer. I have 25 years of experience, including prior work in fintech, aerospace, and enterprise integration. SDD scaled a mature mental model. The specs I wrote were good because I already knew what to put in them: which edge cases matter in payment flows, where float math causes grief, how to structure a fallback chain. There is no evidence here that SDD rescues someone still building that model. The method externalizes expertise. It does not manufacture it.

Speed was measured; quality was not audited. “13 apps in 70 days” says nothing about long-term defect density or maintainability. The CI gate was strict and the tests were real: Prettier, ESLint with zero warnings, strict tsc, approximately 1,650 Vitest and Playwright tests against a live PostgreSQL. But I make no claim about the five-year cost of this code. Green tests and tight CI have shipped bad systems before.

An external deadline was doing real work. I have watched SDD produce output under a client deadline and watched personal projects with the same method stall indefinitely. The spec amplifies execution. It does not replace accountability, urgency, or the pressure of a real delivery date.

Production is not a business. Real money moving through hardened infrastructure is not the same as a sustainable company with customers, margins, and a support queue. This case proves an engineering method, not a market.

Single data point, no control group. The right response to this case study is not “SDD always works at this scale.” It is “SDD demonstrably worked at this scale, once, for this operator.” If you run the method and it fails, the failure is also data. One case with a favorable outcome is not the same as a methodology with broad empirical support. I am claiming the mechanism is sound, not that the outcome is guaranteed. The mechanism is this: an AI agent has no memory between sessions, and the spec is the external memory you give it. That is a structural argument, not an empirical one. The case study demonstrates it worked. It does not prove it always will.

None of that weakens the core claim. The specification was the multiplier, not the AI. Without the specs, the agent is a fast bricklayer with no blueprint. With them, it is a senior engineer with perfect recall of every decision you made. That is a thing one person can direct at the scale of a fintech.

Take the method, not just the story

The story is one data point. The method is portable.

Learn the method: What Is Spec-Driven Development?
Write your first spec: How to write a spec an AI can build from
Do it in your editor: Spec-Driven Development with Claude Code
See the project: The crypto fintech in production

The code wrote itself. The specs did not. That is where the seventy days went, and it is why they were enough.