What Is Spec-Driven Development? The Practitioner's Guide

The specification is not documentation. It is the memory your AI agent does not have. Write it, or the agent reinvents your decisions on every run.

I have shipped production code for 25 years. In late 2025 I used spec-driven development to build a complete crypto fintech, 13 apps with 3 APIs, 3 databases, and Kubernetes in production, in 70 days, solo, with AI agents. The case study has the real numbers. This piece is the method.

Here is the thing nobody selling you a tool will say plainly: the company that makes the model you are using already told you to do this. Anthropic’s own guidance for Claude Code is to plan before you code. Most developers skip that step, watch the agent produce something fast and wrong, and blame the model. The model is fine. The process is missing.

Spec-driven development, defined

That flip is the whole idea. It reads like bureaucracy until you remember who you are writing for now. It is not the next human maintainer. It is a collaborator that forgets everything the moment the session ends.

The company that makes the model tells you to plan first

Skip the vendor blogs for a second and read the model maker’s own manual. Anthropic’s Claude Code best practices describe a four-step loop: explore, plan, implement, commit. There is a dedicated plan mode in the product whose entire job is to stop the agent from writing code while it thinks. The engineer who built Claude Code says most of his own sessions start in plan mode.

Their reasoning is blunt. In Anthropic’s words, “letting Claude jump straight to coding can produce code that solves the wrong problem.” And on what a good spec contains: “the most useful specs are self-contained: they name the files and interfaces involved, state what is out of scope, and end with an end-to-end verification step that proves the feature works.”

They even publish the prompt. This is Anthropic’s own template for turning an idea into a spec before a line of code exists:

I want to build [brief description]. Interview me in detail using the
AskUserQuestion tool. Ask about technical implementation, UI/UX, edge
cases, concerns, and tradeoffs. Don't ask obvious questions, dig into the
hard parts I might not have considered. Keep interviewing until we've
covered everything, then write a complete spec to SPEC.md.

Then you start a fresh session and execute the spec. That is spec-driven development in three sentences, from the people who trained the model. Everything below is how to do it well.

The problem is memory, not intelligence

Hire the fastest bricklayer alive. Walls in minutes, plumbing in seconds, roof before lunch. Forget the blueprint. You get a house that stands, with the bathroom where the kitchen goes and a staircase into a wall. Claude Code, Cursor, Copilot: that is the bricklayer. Speed was never the problem. Direction is.

The root cause is not that the agent is dumb. It is that the agent has no memory across sessions. Every conversation starts at zero. It cannot recall the decision you made yesterday or the constraint you agreed on last week.

This is why the failure hides until it is expensive. The code compiles. It is syntactically perfect. It just solves a problem you never fully described, with assumptions you never made. Payment code ships without an idempotency key. A retry double-bills a customer. You patch the code. The next time the agent regenerates that module, the same gap comes back, because the constraint lived in your head, not in the spec.

There is a result here that should stop experienced developers cold. In a 2025 randomized controlled trial, METR watched 16 seasoned open-source developers work 246 real issues on large, mature codebases. The developers expected AI to make them 24 percent faster. They were actually 19 percent slower with it. And afterward they still believed it had sped them up by about 20 percent. The more capable the model, the further a vague instruction carries it in the wrong direction. Capability amplifies direction. It does not supply it.

Prompt first, no spec

01prompt, generate, notice something is missing
02reprompt, break something, fix, reprompt
03repeat until it roughly works
048 to 12 hours for one real feature

Spec first

01an hour or two defining the spec
02generate from the spec
03small adjustments
04correct on the first real pass

The planning hours are not overhead. They are the regeneration hours you never spend.

A spec is not a prompt, and the difference is the whole game

The word “spec” got stretched until it meant nothing. Half the industry now says it to mean “a detailed prompt.” Clear that up first, because a prompt and a spec fail in different ways, and only one of them is worth defending.

A prompt is an instruction for one turn. A spec is a contract for the whole feature. A PRD tells you what to build for a business. A spec tells the agent how the system must behave, precisely enough to implement without guessing. A design doc explains a decision to humans. A spec is written to be executed.

A spec versus the things people confuse it with

The spec is the only one an agent executes and the only one that outlives the code it produced.
Artifact	Written for	Lifespan	Source of truth?
Prompt	One agent turn	Seconds	No, it decays instantly
PRD	Stakeholders	A release	Partial: the what, not the how
Design doc	Human reviewers	Until it is built	No, it explains, it does not govern
Spec (SDD)	The AI agent	Lives with the feature	Yes, code is generated from it

The spec is the only one an agent executes and the only one that outlives the code it produced.

The four pillars, and what actually goes in each one

SDD is four phases with a human approval gate between each. The agent does not advance to the next phase without your sign-off. That is not ceremony. It is how you catch an error while it is still cheap.

The four pillars

Input

A behavior to build. Start from what the system must do, not from code.

01Requirements: what
What the system must do, in business language. Observable behaviors, technology independent. Gate before design.
02Design: how
Architecture, data models, API contracts, technology choices with a reason. Every requirement mapped to a decision. Gate before tasks.
03Tasks: how much
Break it into units of 2 to 4 hours, each independently testable, with explicit dependencies. Gate before implementation.
04Implementation: execution
Code follows the spec. Every task verified against its acceptance criteria. What you learn goes back into the spec.

Output

A trail from the why to the how, and code that matches what you actually asked for.

Requirements is where you decide what “done” means, in plain sentences a non-engineer could check. No stack, no libraries, no schema. If it says “React” or “Postgres,” it belongs in the next phase. The output is a list of behaviors and the acceptance criteria that prove each one.

Design is where the engineering lives. Data models, the API contract, the third-party integrations, the edge cases, and the technology choices with the reason attached. Not “use Postgres” but “use Postgres because payment records need ACID guarantees.” The reason is what stops the agent from swapping in something else three sessions later. Every requirement from phase one gets mapped to a design decision here, so nothing silently falls off.

Tasks is where you cut the work into pieces small enough to verify. Two to four hours each. If a task cannot be tested on its own, it is too big or too vague. Each task names the files it touches, its dependencies, and how you will know it passed.

Implementation is the only phase where code gets written, and it only starts after the tasks are approved. The gate is the point. In my own kit the gate is a literal one-line file, a .status per feature that reads requirements:approved, then design:approved, then tasks:approved. The agent reads it before it does anything. The rule is flat: a design.md sitting on disk is not approval. Only the status token is. That single constraint stops an eager agent from racing into code off a draft nobody signed.

The economics are the reason to bother. An error caught in requirements costs minutes. The same error caught in implementation costs days. Caught in production, with real money moving, it costs weeks and an apology. The gates exist to drag every error as far left as it will go.

A real spec, start to finish

Most guides describe a spec and never show you one. Here is a complete one for a hard case: creating a payment charge, where a retry must never double-bill a customer. This is close to what I actually wrote for the fintech.

# Spec: Create a payment charge (POST /v1/charges)
# Status: requirements:approved

## Overview
A merchant creates a charge against a customer. This moves money, so it
must be safe to retry and impossible to double-bill.

## In scope
- Create a charge from an authenticated merchant request.
- Return the charge id and status.
- Guarantee exactly-once billing under client retries.

## Out of scope (v1)
- Refunds (separate spec: 005-refunds).
- Partial captures.
- Multi-currency. All amounts are BRL, stored as integer cents. Never float.

## Functional requirements (EARS)
- FR-1  WHEN a merchant POSTs a charge with a valid Idempotency-Key,
        THE SYSTEM SHALL create at most one charge for that key.
- FR-2  WHEN the same Idempotency-Key is replayed within 24h,
        THE SYSTEM SHALL return the original charge and create no new one.
- FR-3  IF the amount is <= 0,
        THE SYSTEM SHALL reject with 422 "amount must be positive".
- FR-4  IF the merchant is over its rate limit,
        THE SYSTEM SHALL reject with 429 and a Retry-After header.
- FR-5  WHILE a charge is pending,
        THE SYSTEM SHALL NOT allow a second capture.

## Acceptance criteria (examples the agent must satisfy)
- amount=1000, key=abc            -> 201, status=pending
- same key=abc, replayed          -> 200, same charge id, no new row
- amount=0                        -> 422 "amount must be positive"
- amount=-50                      -> 422 "amount must be positive"
- 6th request in 1s, one merchant -> 429, Retry-After: 1

## Non-functional
- p95 latency under 300ms at 200 requests/sec per merchant.
- Every money value is an integer. No floating point anywhere in the path.

## Data
- charges(id, merchant_id, amount_cents, currency, status,
          idempotency_key, created_at)
- UNIQUE(merchant_id, idempotency_key)   # this is what enforces FR-1

## Verification
- Integration test replays one key 50x concurrently; assert exactly one
  row and one ledger entry.
- Load test holds p95 < 300ms at 200 rps.

## Confirm before building
Do not write code until you restate FR-1 through FR-5 and the uniqueness
constraint in your own words. If any acceptance criterion is ambiguous,
ask before implementing.

Read what that spec does. It names the exact behaviors, it says what it will not do, it pins the money type to integers so the agent cannot reach for a float, it puts the double-billing defense in the database as a unique constraint instead of hoping the code remembers, and it ends by telling the agent to prove it understood before it types. That last line is not decoration. It is the cheapest bug prevention you will ever write.

What made that spec work

A good spec has one testable property: hand it to someone who lacks all your context, and they still build the right thing. Four moves get you there.

Write it for a smart kid

Explain the system to a brilliant 12-year-old who asks sharp questions and knows none of your context. You would not say “do that thing with the tasks.” You would say “when someone creates a task, save the title, check that they have permission in that workspace, and notify everyone viewing the list.” The agent needs exactly that level. Not because it is slow, but because, like the kid, it has none of your implicit knowledge. The point of the principle is to drag tacit rules (“check permission”) into the open, where they stop being something the agent has to guess.

Be specific, or the agent guesses

Adjectives are not requirements. “Fast,” “clean,” “secure,” “robust”: each one is an invitation for the agent to invent its own definition.

Vague versus executable

Every question you answer in the spec is a wrong assumption you kept out of the code.
Vague: the agent guesses	Executable: the agent knows
"The system should be fast."	GET /api/v1/tasks responds under 500ms at p95 for lists up to 1,000 tasks.
"Validate the title."	empty -> "Title is required"; 1 char -> "Min 2 characters"; 501 chars -> "Max 500".
"Handle errors gracefully."	On provider timeout, retry 3x with backoff, then queue for manual review.

Every question you answer in the spec is a wrong assumption you kept out of the code.

Write requirements in EARS

Here is the technique almost no guide teaches, and it is the highest-leverage one. Write functional requirements in EARS, the Easy Approach to Requirements Syntax. It is a 30-year-old format from requirements engineering, and it is built for exactly the ambiguity SDD is fighting. Five sentence shapes cover nearly everything, and each one leaves the agent no room to interpret.

The EARS patterns, with real examples

The structure is the point. WHEN, IF, WHILE, SHALL. It reads like a contract because it is one.
Pattern	Template and example
Ubiquitous	THE SYSTEM SHALL validate workspace permissions on every task operation.
Event-driven	WHEN a task is completed, THE SYSTEM SHALL record the timestamp and the user.
State-driven	WHILE a task is archived, THE SYSTEM SHALL NOT allow edits.
Unwanted behavior	IF more than 50 subtasks are created, THE SYSTEM SHALL show "Subtask limit reached".
Optional	WHERE notifications are enabled, THE SYSTEM SHALL notify assignees on change.

The structure is the point. WHEN, IF, WHILE, SHALL. It reads like a contract because it is one.

Say what you will not build

Negative scope is the single best defense against an agent “helpfully” building something you never asked for. Write it early, write it flat: no recurring tasks, no calendar integration until v2, no time tracking, subtasks only. Every line you exclude is a data-model detour the agent does not take. The full mechanics of this, plus a copy-paste template, live in the companion piece: how to write a spec an AI agent can build from.

Pick your level of rigor

You do not have to go all in. SDD is a dial, and choosing the setting is what kills the “specs are overkill” argument before it starts. The question is never whether to spec. It is how much this particular work deserves. This taxonomy comes from Birgitta Böckeler’s work at Thoughtworks, and it is the cleanest way to think about it.

How much authority does the spec hold over the code?

On the fintech I ran spec-first for MVP features and spec-anchored for anything touching money. Spec-as-source is still a research bet.
Level	The spec is	Best for
Spec-first	A launch pad. Guides the first build, then you let it go.	MVPs, prototypes, one-off features
Spec-anchored	A living document kept in sync with the code as it changes.	Production systems (the sweet spot)
Spec-as-source	The only file a human edits. Code is regenerated from it.	Frontier, still experimental

On the fintech I ran spec-first for MVP features and spec-anchored for anything touching money. Spec-as-source is still a research bet.

SDD is not TDD, BDD, or vibe coding

SDD is not new. It is the latest point on a 30-year line. Kent Beck’s TDD drove code from tests. BDD drove it from behavior examples. SDD drives it from an approved specification. One useful way to see it: TDD is SDD at the unit level. The academic framing of it, in a 2026 paper on the topic, is that the specification becomes the source of truth and the code becomes a generated or verified artifact. Here is where each method keeps its truth.

Where does the truth live?

I run the direct head-to-head in Spec-Driven Development versus vibe coding.
Method	Truth lives in	Typical failure mode
Vibe coding	The last prompt	Fast, confident, wrong
TDD	Unit tests	Green tests, wrong architecture
BDD	Behavior examples	Scenarios drift from code
SDD	The approved spec	Spec drift if you do not keep it live

I run the direct head-to-head in Spec-Driven Development versus vibe coding.

If you are coming from the vibe-coding side, the timed comparison, same task with and without a spec, is its own article.

A million tokens of context will not save you

This is the sharpest objection of 2026, and almost nobody answers it. If I can fit my whole codebase in the context window, why write a spec?

Because context length and context precision are different problems. A million tokens of code tells the agent what the system currently is. It says nothing about what it should become: your intent, your constraints, the edge cases you care about, the things that are deliberately out of bounds. A bigger window makes the agent better informed about the present and no wiser about the target. Worse, more context is more surface for the agent to pattern-match the wrong precedent from.

A spec is not information delivery. It is a set of decisions. The context window makes the agent aware. The spec makes it aligned. Bigger windows raise the value of a clear spec, because now the limit on quality is not how much the agent can see. It is how clearly you told it what to do.

When not to write a spec

An honest method tells you where it does not apply. SDD has real overhead, and for plenty of work that overhead is pure waste. Anthropic draws the line in one sentence: “if you could describe the diff in one sentence, skip the plan.” I agree. Skip the spec when the payoff is not there.

Should this work get a spec?

Should this work get a spec?. Winner: Production feature with a weighted score of 55. Scale 1-5 (5 = best).
Criterion (weight)	One-off script	Exploratory spike	Production feature
Lives longer than a few days (3)	1	2	5
Spans multiple sessions (3)	1	1	5
Multiple complex features or services (2)	1	2	5
Correctness or compliance actually matters (3)	1	1	5
Weighted score	11	16	55

Scale 1-5 (5 = best). Highlighted column: winner by weighted score.

Low score, just prompt and go. High score, the spec pays for itself. A throwaway prototype does not need SDD. A payment system does.

Building a one-hour utility? Writing a spec first is the bureaucracy the skeptics warn about. Reach for SDD when the work outlives a single sitting, involves real architecture, or spans multiple sessions, because that is exactly when the agent’s missing memory starts costing you money.

No, this is not waterfall

The difference is precise enough to state. Waterfall’s problem was never planning up front. It was frozen planning: a feedback loop so long a decision made months ago could not answer for what you learned since. SDD specs are living. You revise a requirement and the change propagates, on purpose, through design and tasks. The loop is per phase, not per project.

There is a warning worth keeping, though, and it comes from Böckeler again. Model-driven development tried this in the 2000s, generating code from formal models, and it mostly died: rigid DSLs, giant generators, the wrong level of abstraction. LLMs remove some of that overhead. But the failure modes that killed MDD, spec drift, over-specifying too early, making the thing worse in the name of rigor, are risks SDD can still repeat if you are careless. Keep the spec proportional to the phase, and keep it alive.

What it looks like at scale

I will keep this short, because it has its own case study. But to make the method concrete, here is what SDD produced on one real project.

One crypto fintech, one developer, spec-driven

13
apps in production: 3
APIs, 3 databases: 70
days: 28
spec files

The specs were the multiplier, not the AI. Without them the agent is a fast bricklayer with no blueprint. With them it is a senior engineer with perfect recall of every decision you made.

FAQ

What is spec-driven development in simple terms?

It is a way of building software with AI where you write and approve a detailed specification before any code is generated, and that spec becomes the source of truth the agent builds from.

The shift is that the spec is the primary artifact and the code is the consequence, the reverse of how documentation usually works.

Does Anthropic recommend spec-driven development?

In practice, yes. Anthropic's Claude Code best practices describe an explore, plan, implement, commit loop and a dedicated plan mode, and they publish a prompt template for interviewing you into a spec before any code is written.

Their stated reason: letting the model jump straight to coding can produce code that solves the wrong problem.

Is spec-driven development the same as test-driven development?

They are related, not the same. TDD drives code from tests. SDD drives it from an approved specification. One useful framing is that TDD is SDD at the unit level.

SDD works at a higher altitude, requirements and design and tasks, and it is aimed at guiding agents that have no memory between sessions.

Is SDD just waterfall with a new name?

No. Waterfall's problem was frozen, long-loop planning. SDD specs are living documents: you revise them and the change propagates through design and tasks in a controlled way.

The feedback loop is per phase, not per project.

Does a large context window make specs unnecessary?

No. A big context window tells the agent what the code currently is. It does not tell it what the code should become, or what is deliberately out of scope.

Context length and context precision are different problems. Bigger windows raise the value of a clear spec.

When should I not use spec-driven development?

For one-off scripts, throwaway prototypes, exploratory spikes, or anything you can finish in a single prompt session. Anthropic's own rule of thumb: if you could describe the diff in one sentence, skip the plan.

Use SDD when the work outlives one sitting, spans multiple sessions, or involves real architecture and correctness requirements.

What tools do I need for SDD?

None in particular. SDD is a method, not a product. You can run it with plain markdown files and any capable coding agent.

Toolkits like GitHub Spec Kit (over 80,000 stars), AWS Kiro, and my own pi-sdd-kit codify the workflow, but the method works with a text editor and discipline.

Where to go next

SDD is not about writing more. It is about writing the right things down, because your fastest collaborator forgets everything the second the session ends, and the spec is the only memory it gets.

Start small. Pick one feature. Write the three documents, requirements, design, tasks, and hand them to your agent. The first spec is slow. The second takes half as long. By the third it is muscle memory.

See it at scale: how I built a 13-app crypto fintech in 70 days, solo, with SDD
Go hands-on: spec-driven development with Claude Code
Write your first one: how to write a spec an AI can build from
The tools, reviewed: What Is Kiro? and GitHub Spec Kit
The argument, sharpened: Don’t Code, Specify

The code writes itself now. The spec does not. That is the whole job.