← articles
Spec-Driven DevelopmentClaude CodeAI AgentsSoftware EngineeringAI Coding

Spec-Driven Development with Claude Code: The Hands-On Workflow

How to run Spec-Driven Development inside Claude Code: plan mode, the three-layer context system, steering files, .status gates, and specialized sub-agents that turn a stateless AI into a consistent collaborator.

Claude Code is capable and stateless. That combination is dangerous without a system. Spec-driven development fixes the second problem so you can safely use the first.

This is the hands-on companion to what spec-driven development is: the how for Claude Code specifically. New to SDD? Start there. Already understand why specs matter? This is it.

The method gives Claude Code three things it cannot supply itself: a memory that outlives the session, an approval gate that prevents premature implementation, and a structured review that checks the result against what was actually approved. Every section below adds one piece of that system.

Anthropic already told you to plan first

Before any framework, read the model maker’s own manual. Anthropic’s Claude Code best practices describe a four-step loop: explore, plan, implement, commit. There is a dedicated plan mode whose only job is to stop the agent from writing code while it thinks. Their stated reason is direct: “letting Claude jump straight to coding can produce code that solves the wrong problem.”

Plan mode is not a conceptual nudge. It is a real product feature. In the terminal interface, Shift+Tab enters plan mode. In plan mode, Claude reads files and answers questions without making changes. When you have a plan worth reviewing, Ctrl+G opens it in your text editor so you can edit it directly before Claude proceeds. Then exit plan mode and let it implement.

Anthropic is equally specific about what a good spec contains: “the most useful specs are self-contained: they name the files and interfaces involved, state what is out of scope, and end with an end-to-end verification step that proves the feature works.” That one sentence is the design brief for everything below.

The interview-to-spec technique

Anthropic publishes the prompt that converts an idea into a spec before a single line of code exists. This is their own template:

I want to build [brief description]. Interview me in detail using the
AskUserQuestion tool. Ask about technical implementation, UI/UX, edge
cases, concerns, and tradeoffs. Don't ask obvious questions, dig into
the hard parts I might not have considered. Keep interviewing until
we've covered everything, then write a complete spec to SPEC.md.

Run this in a fresh session. Claude pushes you on edge cases you have not considered yet. When it finishes, start another fresh session to implement. The clean context keeps the implementation focused entirely on the spec, not on the conversation that produced it.

This is the PRD phase in a structured SDD workflow, formalized. The technique is Anthropic’s. The system around it is the method.

When to skip the plan entirely

Anthropic draws the line plainly: “if you could describe the diff in one sentence, skip the plan.” A renamed variable, a log line, a typo fix: none of those need a spec. The ceremony exists for work that outlives a single sitting, spans multiple sessions, or touches real architecture and correctness. Everything else, just prompt and go.

CLAUDE.md is layer one, not the whole system

Claude Code reads CLAUDE.md at the start of every conversation. Anthropic calls it “a special file that Claude reads at the start of every conversation” and provides clear discipline for keeping it useful: “keep it concise. For each line, ask: would removing this cause Claude to make mistakes? If not, cut it. Bloated CLAUDE.md files cause Claude to ignore your actual instructions.”

That last sentence is the failure mode most developers hit. They load CLAUDE.md with coding conventions, tool preferences, team norms, and project background until it is 400 lines long. Claude reads the first third and skips the rest. The conventions that matter most are the ones that get lost.

The fix is not better organization inside CLAUDE.md. It is moving most of that content out of CLAUDE.md entirely, into a context system with three distinct layers that each do one job.

The three-layer context system

A project that outlives a single session needs three layers of context, each with a different lifespan and a different job.

Three layers, three jobs

  1. 01

    CLAUDE.md

    Loaded every session. Contains only what must load before the agent does anything: pointers to the other layers, a one-line reminder to check .status before implementing, and any behavioral rules that apply to every single conversation.

    Keep it under 30 lines.
    If removing it does not cause mistakes, cut it.
  2. 02

    steering/

    Stable product context that rarely changes. What the product is and is not, the tech stack with reasons, code conventions beyond linting, and the architectural rules that are non-negotiable. CLAUDE.md points here; the agent reads what it needs.

    Changes when you make a deliberate architectural decision.
    Never changes for a single feature.
  3. 03

    specs/NNN-feature/

    Per-feature specs that evolve through the pipeline: requirements, design, tasks, and the .status file that gates implementation. The only place implementation work is authorized.

    Each file advances one phase.
    .status is the sole gate. File existence is not approval.
CLAUDE.md handles routing. Steering holds memory. The spec folder holds the current work. None of them can do the other two jobs.

CLAUDE.md is a short set of pointers: “Product context lives in steering/. Feature specs live in specs/NNN-name/. Before implementing anything, read the .status file for that feature.” That pointer structure means Claude reads the right files when they matter, not everything at once. The pointer structure is also what keeps CLAUDE.md short: most content has a better home.

The full directory

Project structure for SDD with Claude Code

  • CLAUDE.mdentry point, loaded every session, routes agent to all context below
  • steering/stable
    • product.mdwhat it is, who uses it, what it explicitly is not
    • tech-stack.mdstack, versions, libraries, and the reason for each
    • conventions.mdAPI structure, error shapes, auth patterns, naming rules
    • principles.mdload-bearing architectural rules, the non-negotiables
  • specs/
    • 001-user-auth/feature
      • requirements.mdEARS-formatted behaviors, acceptance criteria
      • design.mdarchitecture, data models, API contracts
      • tasks.mdimplementable units, 2-4h each, independently testable
      • .statustasks:approved is the only green light for EXEC
  • .claude/
    • commands/slash commands/spec, /tasks, /review for each pipeline stage
  • agents/
    • architect.mddesigns, never implements
    • implementer.mdimplements only after tasks:approved, STOPs on ambiguity
    • reviewer.mdreviews against spec only, not preferences

The depth is deliberate. CLAUDE.md is a routing file. The steering folder holds the product memory. The spec folder holds the current feature. The agents folder holds the sub-agent prompts. You initialize this structure once and work from it for every feature.

The toolkit that codifies this into slash commands is my @felipefontoura/pi-sdd-kit, which implements the same structure for the Pi coding agent. The method works without any kit, with plain markdown and consistent discipline. The structure is the point; the tooling is optional.

Steering: the memory that outlives the session

The four files in steering/ hold the context the agent needs to make good decisions without re-explaining anything every session.

product.md is a two-page answer to what the product does, who uses it, and what it deliberately does not do. An agent that does not know “this is a payment backend for crypto merchants, not a consumer app” will drift toward consumer defaults, add features nobody asked for, and optimize for the wrong things. The negative scope matters as much as the positive.

tech-stack.md states the stack, the versions, and why each major choice was made. Not a dependency list: a rationale. “PostgreSQL because payment records need ACID guarantees” is the sentence that prevents the agent from suggesting SQLite when you add a new module three sessions from now. The reason is what makes the file useful across sessions. Without it, the file is a changelog nobody reads.

conventions.md goes beyond linting rules. It captures patterns: how API routes are structured, how errors are shaped, how authentication is applied at the boundary. The tacit knowledge that lives in experienced developer heads, until you write it down.

principles.md is the shortest file and the hardest to write well. Load-bearing architectural rules in declarative sentences. For the fintech: “All money math is integer arithmetic, never float.” “No direct database access outside the repository layer.” “If you are uncertain whether something is in scope, it is not.” These constraints prevent category errors before the agent generates a single line of code.

These files change rarely. When they do, it is because you made a deliberate architectural decision. Writing it in steering/ is how that decision becomes the agent’s permanent context for every future session.

The gated pipeline

Every feature moves through a defined sequence. IDEA and PLAN are optional for small or well-understood work. The core sequence for any real feature is PRD through REVIEW:

SDD pipeline in Claude Code

Input

A behavior to build: not a line of code, a thing the system must do

  1. PRDProduct requirements document

    Business-language description: what the system must do, who uses it, what is explicitly out of scope. No code. This is where the interview-to-spec technique lives.

  2. SPECTechnical specification

    EARS-formatted requirements, architecture decisions, data models, API contracts. Human gate before proceeding: .status must read requirements:approved, then design:approved.

  3. TASKSTask decomposition

    Implementable units of 2-4 hours each, independently testable, explicit dependencies. Implementation Readiness Check confirms every requirement maps to at least one task. Human gate: .status must read tasks:approved.

  4. EXECImplementation

    Agent reads tasks.md and implements task by task. Only starts after .status reads tasks:approved. Surfaces every ambiguity before acting on it.

  5. REVIEWVerification report

    Each task verified: Claim, Command, Exit code, Verdict PASS or FAIL. A FAIL means the spec was wrong: fix the document, regenerate. Never patch code to hide a spec error.

Output

Implemented feature with a traceable audit trail from the requirement to the verification result

SDD pipeline in Claude Code: flow of 5 steps from "A behavior to build: not a line of code, a thing the system must do" resulting in "Implemented feature with a traceable audit trail from the requirement to the verification result".

The gate between TASKS and EXEC is enforced by a single .status file inside each feature folder. One line. The agent reads it before every implementation step. The state of that file is the only approved signal the agent honors.

This sounds obvious until you watch an eager agent race from a completed spec straight into implementation because the files exist. The gate prevents it. You update .status manually, after you have read and signed off on each phase. That manual step is the human approval the entire method is built around. It is also the hardest habit to maintain under deadline pressure, which is exactly when it matters most.

Three specialized sub-agents

The most effective pattern is splitting work across three narrow sub-agents instead of asking one agent to do everything. Each has one job and one constraint.

The three sub-agents

  1. 01

    Architect

    Reads the PRD and all steering context. Produces requirements.md and design.md: data models, API contracts, technology choices with explicit justifications, and a traceability map from each requirement to a design decision.

    Core constraint: never write implementation code.
    If asked to implement, clarify scope instead.
  2. 02

    Implementer

    Reads tasks.md and .status (must be tasks:approved). Implements task by task, writes tests alongside the code. EARS requirements in the spec are the acceptance criteria, not the Implementer's interpretation of them.

    Core constraint: STOP if you encounter ambiguity.
    Do not assume. Do not infer. Ask.
  3. 03

    Reviewer

    Reads requirements.md, design.md, tasks.md, and the generated code in full. Reviews against the spec only: not general best practice, not stylistic preference. Reports gaps, not opinions.

    Core constraint: review against the spec, not your preferences.
    Do not add requirements. Flag gaps only.
Each agent has one job and one constraint. Mixing them is where the method breaks.

The Reviewer pattern maps directly to Anthropic’s own recommendation for autonomous runs: “before treating a task as done, have a subagent review the diff in a fresh context and report gaps.” Their reasoning is precise: a reviewer running in a fresh subagent context sees only the diff and the criteria you give it, not the reasoning that produced the change. It evaluates the result on its own terms. The Reviewer sub-agent here is that recommendation made explicit, with a constraint that prevents it from becoming a second design session.

The Architect-Implementer split solves a specific failure mode. An agent that both designs and implements has an incentive to design something it already knows how to build. The Architect is constrained from writing code, so design decisions must stand on their own merits. The constraint also makes the design document genuine: it was written by something that cannot shortcut its own recommendations.

GitHub Spec Kit vs pi-sdd-kit: what each gives you

GitHub’s Spec Kit provides slash commands for a spec-driven workflow: /specify to create a spec, /plan to generate a plan, /tasks to decompose. It is a well-designed convention that codifies the right instinct: write a spec before you code. It has broad community adoption and is the right choice if your primary tool is GitHub Copilot.

What each approach gives you

Different scope, not competing tools. Spec Kit is a workflow convention for Copilot. The three-layer structure here is a memory and gate system for Claude Code.
FeatureGitHub Spec Kitpi-sdd-kit structure
Spec-first workflowYes, via /specify /plan /tasksYes, via PRD to TASKS pipeline
Stable product memory across sessionsNot built inYes, steering/ folder with 4 context files
Machine-readable approval gateNot built inYes, .status file with explicit token
Specialized sub-agents with constraintsNot built inYes, Architect / Implementer / Reviewer
Works with GitHub CopilotYes, nativeMethod works with any agent
Different scope, not competing tools. Spec Kit is a workflow convention for Copilot. The three-layer structure here is a memory and gate system for Claude Code.

If you are building a prototype in an afternoon, neither is necessary. A plain markdown spec and plan mode are enough to get the benefit. The structure compounds in value as sessions, features, and team members multiply.

A feature from PRD to REVIEW

Here is a concrete feature moving through the pipeline: PATCH /users/me so an authenticated user can update their display name and timezone. This is simplified from a real endpoint in the fintech.

PRD in five minutes. Feature in business language: “Authenticated users need to update their display name (2-64 chars) and timezone (IANA string, validated server-side). They can update one or both fields in a single request. No other profile fields are in scope for this endpoint.”

The Architect reads the PRD and steering/ (specifically tech-stack.md and principles.md) then produces requirements.md in EARS format:

WHEN an authenticated user sends PATCH /users/me,
THE SYSTEM SHALL validate all provided fields before persisting any change.

IF displayName is provided AND length is less than 2 OR greater than 64,
THE SYSTEM SHALL return 400 with message "displayName must be 2 to 64 characters".

IF timezone is provided AND is not a valid IANA timezone identifier,
THE SYSTEM SHALL return 400 with message "Invalid timezone identifier".

THE SYSTEM SHALL NOT allow unauthenticated requests to this endpoint.

You review. Requirements match the PRD. No ambiguity. You update .status to requirements:approved.

The Architect produces design.md: route handler, validation layer, repository method, response shape, and an explicit note that timezone validation uses the IANA tz database from the existing auth service. You review. The design maps cleanly to every requirement and follows steering/conventions.md. You update .status to design:approved.

Task decomposition produces four units in tasks.md:

  1. Add PATCH /users/me route with authentication guard. Testable: route returns 401 without a valid token.
  2. Implement displayName validation. Testable: 400 on fewer than 2 or more than 64 characters.
  3. Implement timezone validation against IANA database. Testable: 400 on an unknown timezone string.
  4. Implement repository update method. Testable: persists changes, returns the updated user object.

The Implementation Readiness Check confirms every requirement maps to at least one task, every task is independently testable, and there are no undeclared dependencies. .status becomes tasks:approved.

The Implementer reads tasks.md, finds tasks:approved, and works through the list in order: tests alongside, not after. Midway through task 3, it encounters an ambiguity: what HTTP status if the user account is deactivated? It STOPs and asks instead of guessing. You answer: 403. That decision goes back into requirements.md before implementation resumes.

After EXEC, the Reviewer runs the verification report:

Verification report: PATCH /users/me

Every row is a PASS. A FAIL means fix the spec first, then regenerate. Never patch code to paper over a spec error.
ClaimCommandExit codeVerdict
401 without auth tokencurl -X PATCH /users/me0PASS
400 on displayName too shortPATCH /users/me displayName=x (1 char)0PASS
400 on invalid timezonePATCH /users/me timezone=badzone0PASS
403 on deactivated accountPATCH /users/me X-Test-Inactive: 10PASS
200 on valid updatePATCH /users/me displayName=Felipe0PASS
Every row is a PASS. A FAIL means fix the spec first, then regenerate. Never patch code to paper over a spec error.

Every row is a PASS. If any were a FAIL, the fix starts in the spec: the requirement was wrong (update requirements.md) or the implementation missed something (fix tasks.md, regenerate). The spec is the source of truth. Patching code to make a verification row pass while leaving the spec unchanged is how the method quietly dies.

That loop (PRD through REVIEW) is the same for every feature. The discipline compounds: after ten features, the steering files are tight, the agent almost never STOPs for ambiguity, and the Implementation Readiness Check takes two minutes because the patterns are established. The friction front-loads.

A real failure the gate caught

During the fintech build, one of the early charge-creation specs passed PRD and reached the gate at design review. The EARS requirements read correctly. But the design.md the Architect produced was missing the idempotency constraint: the UNIQUE(merchant_id, idempotency_key) index that enforces exactly-once billing at the database level.

The design was technically coherent. It was also wrong. The gate forced a design review before the agent could implement. I caught the missing constraint during that review, not during a production incident.

An eager agent without the gate would have implemented from the design, the index would not exist, and the first retry under load would have created a duplicate charge. On a system handling real BRL transactions, that is not a test scenario. The gate caught it while it was still a text file, not a billing error. Error caught in design costs minutes. Error caught in production, on a payment system, costs weeks and an apology.

This is also why the gate is manual. An automated gate could advance on “design.md is complete.” The human gate forces a reading, which is the only way to catch what a technically valid but wrong document contains.

How CLAUDE.md grows wrong, and how to prune it

The pattern is predictable. One bad session produces one rule. The next bad session produces another. Three months later, CLAUDE.md is 300 lines and the agent ignores half of it. Anthropic’s own failure description: “if Claude keeps doing something you don’t want despite having a rule against it, the file is probably too long and the rule is getting lost.”

CLAUDE.md pruning checklist

  • Required:
    Would removing this cause Claude to make mistakes?If not, cut it. This is Anthropic's own test, verbatim. Apply it to every line, not just the ones you suspect.
  • Required:
    Does Claude already do this correctly without the instruction?Then the instruction is noise. Delete it or convert it to a hook for deterministic enforcement.
  • Required:
    Is this domain knowledge or a workflow step?Move it to a steering file or a .claude/skills/ file. CLAUDE.md is not a knowledge base; it is a behavioral config.
  • Required:
    Does Claude keep ignoring this rule despite the instruction?The file is too long and the rule is buried. Prune first, then restate. A short file with three rules beats a long file with thirty.
  • Required:
    Is this a non-obvious behavioral rule that applies to every session?This one stays. CLAUDE.md is for session-level behavioral instructions that cannot live anywhere else.
Run this checklist after every two weeks of active development. The file should shrink over time as you move context to the right layer.

The right shape for CLAUDE.md in a project using this system: a brief description of the three-layer structure, a pointer to steering/ for product context, a pointer to specs/ for feature work, the instruction to check .status before implementing, and any behavioral rules that apply universally. Under 30 lines. Everything else belongs somewhere else.

FAQ

Does Claude Code have spec-driven development built in?

Claude Code has plan mode built in, and Anthropic's best practices describe the four-step loop (explore, plan, implement, commit). That is the foundation of SDD applied to Claude Code.

What Claude Code does not have built in is the three-layer context system, the .status gate, or the specialized sub-agents. Those are the structure you apply on top, with either plain markdown and discipline or a tool like pi-sdd-kit.

What is plan mode in Claude Code?

Plan mode is a built-in Claude Code feature that prevents the agent from writing code while it explores and plans. Press Shift+Tab in the terminal interface to enter it. In plan mode, Claude reads files and answers questions without making changes.

Press Ctrl+G to open the current plan in your text editor and edit it directly. Then exit plan mode and let Claude implement. Anthropic recommends using plan mode for anything beyond a trivial change, and skipping it when the diff can be described in one sentence.

What is a steering file?

A steering file is a persistent context document the agent reads every session via CLAUDE.md pointers. The four core ones (product.md, tech-stack.md, conventions.md, principles.md) answer the questions the agent would otherwise guess at: what the product is, what stack it runs on, how code is written, and which architectural rules are non-negotiable.

They sit in steering/ and change rarely. When they do change, it is because you made a deliberate architectural decision. Writing it down is how that decision becomes the agent's permanent context for every future session.

Spec Kit vs Claude Code: which workflow wins?

GitHub Spec Kit and Claude Code are not competing tools. Spec Kit is a convention layer for GitHub Copilot: slash commands (/specify, /plan, /tasks) that codify spec-driven workflows inside Copilot.

If you use Claude Code, the three-layer context system and .status gate described in this article are the equivalent structure. What the pi-sdd-kit approach adds over Spec Kit is steering as durable memory, .status as a machine-readable approval gate, and sub-agents with explicit constraints. Different scope, not a competition.

Why does SDD use a .status gate instead of just reviewing the spec?

Because 'I reviewed it' is a mental state, not a signal the agent can read. The .status file gives the agent an explicit, machine-readable token it must check before advancing to the next phase.

The hard rule (file existence does not imply approval) exists because a completed file and an approved file look identical to an agent scanning the directory. The status token removes that ambiguity completely.

What is pi-sdd-kit and do I need it?

pi-sdd-kit is my npm package that codifies the SDD workflow as slash commands for the Pi coding agent: /skill:sdd-prd, /skill:sdd-spec, /skill:sdd-tasks, /skill:sdd-exec, /skill:sdd-review, and the .status gate convention. Published as @felipefontoura/pi-sdd-kit.

You do not need it. The steering/ folder, specs/NNN-feature/ structure, and .status gate work with any agent and any editor using plain markdown. The kit removes ceremony and makes the workflow consistent across projects. It is infrastructure, not a prerequisite.

What is the Implementation Readiness Check?

A pre-EXEC validation that tasks.md is actually ready for implementation: every EARS requirement maps to at least one task, every task has explicit acceptance criteria, every task is independently testable, and no task has an undeclared dependency on another.

It runs before tasks:approved goes into .status. Its job is to distinguish 'I think we are ready' from 'the spec is complete enough to implement without ambiguity.' Those are different things, and catching that gap here is cheaper than catching it during implementation.

How is this different from just using CLAUDE.md with a lot of instructions?

CLAUDE.md is one layer: the entry point that loads every session. SDD adds two more: steering/ for stable product context that rarely changes, and specs/NNN-feature/ for per-feature specs that evolve through the pipeline.

CLAUDE.md tells the agent how to behave. Steering tells it what it is building and why those decisions were made. The feature spec tells it what to build right now. All three are load-bearing. Each fails without the other two.

Where to go next

SDD with Claude Code is a structure you apply to the tool you already have. The memory problem is real, and the three-layer context system (CLAUDE.md, steering, specs) solves it without adding ceremony that slows you down.

The method is explained in full in what spec-driven development is. To see it running at production scale (13 apps, real money, 70 days), the case study has the receipts. For the spec documents themselves (how to write requirements in EARS format, what a complete design section contains, how to decompose tasks), the companion piece is how to write a spec an AI can build from.

The kit is at @felipefontoura/pi-sdd-kit. Plain markdown and consistent approval gates are enough to start without it.

The code writes itself now. The specs still do not. That is the work.