← articles
Spec-Driven DevelopmentVibe CodingAI AgentsAI CodingSoftware Engineering

Spec-Driven Development vs. Vibe Coding: A Practitioner's Honest Comparison

Vibe coding is real and useful. It is also wrong for anything that must be correct. A METR RCT, a payment idempotency head-to-head, and the honest verdict.

Vibe coding is real, it works, and I use it. It also has a failure mode that becomes more expensive the more capable the model becomes. This article is the honest comparison: what each approach is, where each one breaks, and the decision rule for when to use which.

If you are new to SDD, the full methodology is in What Is Spec-Driven Development?. Come back here for the head-to-head.

What vibe coding actually is

Andrej Karpathy coined the term in a post on X on February 2, 2025. His description was precise: you fully give in to the vibes, embrace exponentials, and forget that the code even exists. You are not writing code. You are describing an intent and accepting what the model produces. The name is accurate. You code by feel.

Karpathy explicitly scoped vibe coding to throwaway weekend projects when he introduced it. That scope matters. Vibe coding was never proposed as a production methodology by its creator. Within a year, it had normalized into something else entirely: the default way most developers interact with AI coding tools. The framing stuck while the original constraint quietly fell off.

The workflow is three steps. You describe what you want in plain language. The model generates code. You nudge it when something looks wrong and generate again. You are not reading every line. You are steering a conversation. The model drives.

This is genuinely useful work. For a throwaway prototype you can get something running in twenty minutes. For a broken script, three prompts often fixes it. For understanding what a problem actually looks like before committing to an architecture, vibe coding gives you a fast feedback loop that nothing else matches.

The question is not whether vibe coding works. It is where it stops being safe.

The structural problem vibe coding cannot solve

Vibe coding optimizes for the moment when something first runs. Green terminal output. A page that renders. A function that returns a value. That moment feels like progress. The model’s confidence is contagious.

The feeling starts lying the second your work needs to outlive the session. The AI operates in an eternal present. Every conversation starts from nothing: no recall of the constraint you added at 10pm, the edge case you thought of in the shower, the business rule you inlined as a comment and then deleted. Everything you did not write down gets reinvented. Not consistently.

This is not a problem waiting on a context-length fix. A million-token context window tells the agent what the system currently is. It says nothing about what it should become: your intent, your constraints, the edge cases you care about, the things that are deliberately out of bounds. A bigger window makes the agent better informed about the present and no wiser about the target.

Addy Osmani’s framing captures where the ceiling lives: AI gets you roughly 70 percent of the way fast. The last 30 percent is where production code lives, edge cases, business rules, invariants, security constraints, the things that are not obvious from the happy path. Vibe coding has no mechanism for precision at that depth. It prompts again and hopes the model fills the gap correctly.

Sometimes it does. Often it does not. You will not know which until the code hits something real.

The spec is what supplies the direction the model cannot invent.

The data is worse than you expect

What the research actually found

19%
slower with AI (actual result)

METR 2025 RCT, experienced devs on their own codebases

~20%
faster (what they believed afterward)

Same devs, same trial, wrong conclusion

246
real issues tested

16 experienced devs, large mature open-source projects

~40%
of security-sensitive AI code had a known vulnerability

Pearce et al., IEEE S&P 2023

The confidence is not the competence. Experienced developers felt faster while being slower.

In 2025, METR ran a randomized controlled trial with 16 experienced open-source developers working 246 real issues on large, mature codebases, using frontier models: Cursor Pro with Claude 3.5 and 3.7 Sonnet. The developers expected AI to make them 24 percent faster. They were actually 19 percent slower. And afterward they still believed it had sped them up by about 20 percent.

That last number is the uncomfortable one. The combination of confident output and invisible errors makes the productivity drain invisible from inside the session. You feel fast. You are slower. The gap between the two is precisely what produces wrong-but-confident code that ships. (METR released a follow-up in early 2026 noting that newer models show different results in their ongoing study; the 2025 data remains the most rigorous RCT baseline for that generation of tools, and the underlying mechanism is unchanged.)

The security picture lands separately. Research by Pearce et al. published at IEEE Security and Privacy found that roughly 40 percent of code generated in security-sensitive settings contained a known vulnerability. The mechanism matters more than the specific number: with AI-generated code, a defect is a gap in the specification. That gap resurfaces every time the code is regenerated, in a new form, until a spec explicitly encodes the constraint.

The critique gets more uncomfortable for one specific reason: the problem worsens as models improve. A weaker model makes a small, obviously wrong thing. A capable model makes a large, coherent, well-architected, subtly wrong thing. Capability amplifies direction. It does not supply it.

Why the same bug keeps coming back

This is the mechanism that most comparisons miss, and it is the core argument for spec-driven work.

With AI-generated code, a defect is not a mistake in the code. It is a gap in the specification. Because generation is non-deterministic, that gap resurfaces in a new form every time the code is regenerated, until a spec explicitly encodes the constraint. You patch the output. You do not patch the process. The next regeneration brings the same class of problem back in a different shape.

Vibe coding has no way to close that gap at the source. Every fix is a local correction to a one-off output. The agent starts each new session from nothing, infers the same missing assumption, and produces the same class of error.

The fix is not better prompting. It is writing down the constraint in a durable artifact the agent can execute from. Writing the spec is the act of patching the process rather than the output. When you patch the process, every future regeneration produces a safe implementation automatically. Every session starts from the same foundation.

This is why the failure mode is invisible from inside the session. You reprompt, the problem disappears, the terminal is green. It feels solved. The gap is still open. It resurfaces the next time you regenerate, or the next time the context shifts, or the next time a different agent touches the same code.

A concrete head-to-head: the charge that double-bills

Take a realistic mid-size task: a payment charge endpoint where a client retry must never bill the customer twice. This exact pattern appears in payment flows I built while shipping a 13-app crypto fintech in 70 days, solo. The case study has the full numbers.

The vibe approach: you write “add a POST /v1/charges endpoint with idempotency.” The model generates sixty lines of handler code in eight seconds. It looks complete. It is not. Idempotency requires state: a record of which keys have already been processed. Without an explicit database constraint, two concurrent retries can race, both read “not found,” and both insert a new charge row. Double-billed customer.

The spec approach writes this before any code exists:

FR-1  WHEN a merchant POSTs a charge with a valid Idempotency-Key,
      THE SYSTEM SHALL create at most one charge for that (merchant_id, key) pair.
FR-2  WHEN the same Idempotency-Key is replayed within 24h,
      THE SYSTEM SHALL return the original charge and create no new one.

Data model:
  charges(id, merchant_id, amount_cents, idempotency_key, status, created_at)
  UNIQUE(merchant_id, idempotency_key)   -- this enforces FR-1 at the database level

The UNIQUE constraint is the entire difference. It is not a code-level check that can be raced around. It is a database-level invariant the agent must implement, and it survives every regeneration. The EARS sentence (WHEN/SHALL) makes the behavior explicit enough that the model cannot invent a shortcut. I used this exact pattern across every payment endpoint in that fintech. Not one double-billed in production.

Vibe coding

  1. 01Prompt: add a charge endpoint with idempotency. 60 lines in 8 seconds.
  2. 02Idempotency key stored in a column but not constrained: concurrent retries race and both insert.
  3. 03Double-billing surfaces in load testing (best case) or in production (common case).
  4. 04Reprompting fixes the symptom, not the gap. Next regeneration: same class of bug.
  5. 05Next session: model starts fresh. No memory of the constraint discussion.

Spec-Driven

  1. 01EARS rule written first: WHEN same key replayed, THE SYSTEM SHALL return the original charge.
  2. 02UNIQUE(merchant_id, idempotency_key) declared in the data model before code exists.
  3. 03Agent generates the handler with the concurrency case covered on the first pass.
  4. 04Constraint lives in the spec. Every regeneration produces a safe implementation.
  5. 05Next session: agent reads the spec. The invariant is already there.
Same model. Same task. Different outcome. The UNIQUE constraint is the only variable. From Felipe's real 13-app crypto fintech.

The vibe path is not just slower in total. It is slower in a way you cannot see until the code fails. And it fails in a way that comes back every time you regenerate without fixing the spec.

Head-to-head across what actually matters

Vibe coding vs. Spec-Driven Development

Vibe coding vs. Spec-Driven Development. Winner: Spec-Driven with a weighted score of 51. Scale 1-5 (5 = best).
Criterion (weight)Vibe codingSpec-Driven
Speed to first output (2)53
Correctness on real features (3)25
Rework cost (inverted: higher = less rework) (3)25
Scales past one session (3)15
Weighted score2551

Scale 1-5 (5 = best). Highlighted column: winner by weighted score.

Weighted for production work. For throwaway scripts and single-sitting prototypes, the weights shift. Vibe wins on what matters there.

Vibe coding leads on exactly one production dimension: speed to first output. That advantage narrows once you count rework cycles. Speed to correct output tells a different story.

How the methods differ at the level that counts

Vibe coding versus Spec-Driven Development at a glance

The spec is not overhead. It is the fix for the structural problem vibe coding cannot solve.
DimensionVibe codingSpec-Driven
Primary artifactThe last promptThe approved spec
Memory between sessionsNone. Starts fresh every time.The spec file. Agent continues from where you stopped.
Correctness guaranteeWhatever the model inferred from the promptWhat the spec explicitly states
When a defect recursReprompt and hope. The gap in the spec stays open.Update the spec. The constraint is sealed for every future run.
Right forPrototypes, scripts, single-session explorationProduction features, money movement, compliance
The spec is not overhead. It is the fix for the structural problem vibe coding cannot solve.

The skill senior developers are missing

Most developers approached AI tools with the intuition that the bottleneck was implementation speed. If the model writes code faster, I am faster. The METR result breaks that assumption cleanly. Experienced developers on their own codebases were slower. The bottleneck was never implementation. It was precision of intent.

This lands harder on senior developers than anyone expects, and the counterintuitive part is why. Experience means more implicit context: more knowledge of the system’s history, more assumptions about what “correct” means, more decisions that feel obvious and never get written down. Every piece of tacit knowledge is invisible to the model. The more you know, the bigger the gap between what you said and what you meant.

A junior developer describes what they want in more explicit terms, because they are less certain about what is obvious. A senior developer gives the model a rough sketch and expects it to fill in the rest the way another experienced engineer would. The model does not fill in the gaps that way. It pattern-matches against everything it has seen, and the most common pattern is not your specific system.

This is why the METR finding lands hardest on exactly the people who expected AI to help them most. The developers in that study were experienced, working on their own codebases, using frontier models. They still lost ground. The cause is not capability. It is the gap between what was said and what needed to be true.

Anthropic’s own guidance makes this explicit: “letting Claude jump straight to coding can produce code that solves the wrong problem.” Years of experience do not help you prompt faster. They help you think precisely about what needs to be true before the code runs: the constraints, the edge cases, the invariants, what is deliberately out of scope. That precision, locked in your head, is invisible to the model. Writing a spec externalizes it into a durable artifact the agent can actually execute from.

I went deeper on this in Don’t Code, Specify: why experienced developers were getting worse results with AI tools, and what shifted when I changed the default workflow.

The honest verdict: vibe when it’s cheap to be wrong, specify when it isn’t

I am not making a case against vibe coding. I use it constantly, and you should too. The error is applying it to the wrong class of problem.

Anthropic draws a practical line in their own guidance: “if you could describe the diff in one sentence, skip the plan.” I use that test constantly. One sentence means low complexity, low risk, probably low stakes. When the description needs three sentences and a list of edge cases, that is the spec.

Most real projects need both modes. Vibe the exploratory work to understand what the problem actually is. Specify everything that enters production. The mistake is treating all work as one category or the other. That is how you get vibe coders with production incidents and spec writers with nothing shipped.

The 70 percent insight points to the seam. Use vibe coding to get to the 70 percent fast, then switch to SDD to close the gap correctly. Sequential, not competing.

FAQ

Is vibe coding bad?

No. Vibe coding is genuinely useful for low-stakes, fast-feedback work: prototypes, scripts, exploratory spikes. The problem is not the approach. It is applying it to work where correctness, edge cases, and multi-session continuity actually matter.

The failure mode is treating vibe coding as the default for all AI-assisted development, including production features where the last 30 percent of correctness is load-bearing.

Did Karpathy invent vibe coding?

Andrej Karpathy coined the term on February 2, 2025, in a post on X. He described it as fully giving in to the vibes and forgetting that the code even exists. The concept itself (accepting AI output without reading every line, steering by feel) was already emerging in practice; he named it and gave it a shape.

He also explicitly scoped it to throwaway weekend projects. That original constraint got lost as the term spread. Most of the problems with vibe coding in production come directly from ignoring the scope Karpathy set when he defined it.

Can I combine vibe coding and SDD?

Yes, and for most real projects you should. Vibe coding is excellent for exploration: you learn what the problem is by building a rough version fast. The spec captures what you discovered and becomes the foundation for the production build.

Vibe to discover. Specify to ship. Sequential, not competing.

Why does the same AI bug keep coming back?

Because you fixed the output, not the spec. With AI-generated code, a defect is a gap in the specification. Generation is non-deterministic, so that gap resurfaces in a new form every time you regenerate, until a spec explicitly encodes the constraint.

The fix is to patch the spec, not just the code. The code is the output. The spec is the source. This is the core mechanism behind the vulnerability recurrence pattern in AI-generated security-sensitive code (Pearce et al., IEEE S&P 2023).

Why are senior devs slower with AI?

Two reasons. First, experienced developers carry more implicit context, which means more assumptions the model can silently violate. The gap between what you meant and what you said is larger when your mental model of the system is complex.

Second, experienced developers tend to work on production systems where the last 30 percent of correctness is load-bearing. Vibe coding stalls exactly there. The METR finding (19 percent slower on their own codebases, with frontier models) is striking because these developers knew their codebases well. The bottleneck was precision of intent, not familiarity with the code.

Where to go next

Vibe coding and SDD answer different questions. Vibe asks: how fast can I get something running? SDD asks: how do I make sure what is running is what I actually meant to build?

Both questions matter. The skill is knowing which one applies.