Harness Engineering: How to Improve Your AI Coding Agent

You cannot touch the model’s weights. You own every other line of code around it. That code is the harness, and it is where almost all of your advantage actually lives.

Every week someone tells me their agent got smarter because they switched to a newer model. Sometimes that is true. Most of the time they changed the one variable they do not control and left untouched the one they do.

I have shipped production code for 25 years. In late 2025 I built a 13-app crypto fintech solo with AI agents in 70 days. The case study has the real numbers. Here is the part nobody puts on the landing page: the model I used was the same model everyone else had. The difference was not the intelligence. It was the scaffolding I wrapped around it. The harness.

This is a guide to improving that harness. Not the theory of what a harness is, which the whole internet already wrote. The practice of making yours measurably better, from someone who runs a minimal one every day.

Agent equals model plus harness

Start with the cleanest definition anyone has published. LangChain put it in four words: Agent = Model + Harness. The model provides intelligence. The harness makes that intelligence useful.

Strip a harness to its parts and you get four, and every serious writeup lands on the same four: an agent loop that keeps calling the model until the work is done, a tool interface the model acts through, context management that decides what the model sees each turn, and control mechanisms that keep the whole thing from running off a cliff.

Notice what is on that list. None of it is the model. All of it is yours. When people say “prompt engineering” they mean one message. When they say “context engineering” they mean one window. Harness engineering is the layer above both: it designs the system that resets context, hands off state between sessions, and verifies the result before it ships. That is the layer with the most headroom left in it, and it is the layer almost nobody is deliberately building.

The industry is improving harnesses backwards

Here is the default move when a harness underperforms. Add to it. Bolt on another MCP server. Inject more instructions into the system prompt. Spawn a subagent to go find context. Wire up a plan mode, a memory service, a router.

Every one of those additions costs you the one resource the model spends before it writes a single line: the context window.

Mario Zechner, who built the pi coding agent, put the frustration plainly: mainstream harnesses became opaque and unstable because their system prompts and injected context kept changing behind the user’s back. You cannot improve what you cannot see. And you cannot see a harness that hides half of what it feeds the model.

The counterintuitive fix, the one that actually works: make the harness smaller and more transparent, not bigger and more clever.

Meet a harness you can actually see

To improve a harness you need one you can hold in your head. That is why I run pi, an open-source coding agent by earendil-works. Not because it has the most features. Because it has the fewest, and it hides none of them.

pi, the whole harness at a glance

4
built-in tools: ~150
words of system prompt: 0
MCP servers by default: BYOK
any provider

A minimal agent scaffold that loops until the model stops calling tools. No hidden context, no invisible subagents, no plan mode. And it still holds its own on Terminal-Bench 2.0.

pi gives the model four tools: read, write, edit, and bash. The system prompt is roughly 150 words, on the theory that a frontier model trained on millions of coding sessions already knows how to be a coding agent and does not need a thousand tokens reminding it. The agent loop is one page of logic: call the model, run the tools it asked for, feed back the results, repeat until it stops asking. That is the whole thing.

This is not a toy. The point of a transparent harness is that when the agent does something dumb, you can see exactly which of the four moving parts caused it, and fix that part. Try that with a harness that injects context you are not allowed to read.

The heavy harness

01System prompt runs to thousands of tokens of scaffolding
02Three MCP servers eat the window before you type
03Context injected where you cannot inspect it
04Subagents spawn invisibly and report back a summary
05You debug the model because you cannot see the harness

The minimal harness

01System prompt is 150 words the model already understood
02Zero MCP servers, plain CLI tools with readable help
03Every token in the window is one you can name
04Subagents are explicit: you call pi from bash and watch it
05You debug the harness because you can read all of it

Same model in both. The harness is the entire difference.

The three axes you actually control

Once you can see the harness, improving it stops being vague. Every worthwhile change moves one of exactly three things: what the model sees, what the model can do, and what survives when the session ends. Master those three and you have engineered your harness. Ignore them and you are just swapping models and hoping.

Where the wins actually are

01
What the model sees
The context window is a hard budget, not a suggestion. Every token spent on boilerplate, raw tool output, or an MCP schema the model never uses is a token not spent on your actual problem. Improving the harness starts with defending the window.
```
Cut the system prompt. Filter tool output at the source.
Drop MCP servers you do not use every day.
```
+
02
What the model can do
Tools are the model's hands. More tools does not mean more capability. It means more surface area and more token cost. A handful of sharp, composable tools plus a shell beats a wall of narrow ones. Reach for progressive disclosure, not a bigger menu.
```
read, write, edit, bash covers most work.
Add a tool only when bash plus a CLI genuinely cannot.
```
+
03
What survives the session
The model forgets everything the moment the context resets. You do not buy memory. You write it: an instructions file, a plan, a progress log, commits. The harness that persists state cleanly is the one that ships large work.
```
AGENTS.md, PLAN.md, a progress log, one commit per feature.
The spec is the memory the agent does not have.
```

Context, tools, memory. Improve these three and the same model gets dramatically more useful. Chase anything else and you are decorating.

The rest of this guide is one section per axis, with the concrete moves I actually make.

Axis one: defend the context window

The context window is the only truly scarce resource in the whole system, and most harnesses spend it like it is free. The two biggest leaks are the system prompt and raw tool output.

The system prompt leak is easy. Read yours. Every sentence that tells a frontier model something it already does correctly is a sentence you are paying for on every single turn, forever. pi’s answer is a 150-word prompt. Yours does not have to be that short, but apply the test: if deleting a line does not make the agent worse, the line is noise. Cut it.

The tool-output leak is bigger and quieter. A single git status in a busy repo can be two thousand tokens. A full test run can be twenty-five thousand. The model does not need all of it, it needs the signal, but a naive harness dumps the whole firehose into the window and calls it context.

This is the highest-return change you can make to a harness and almost nobody makes it, because the tokens leak somewhere you never look: inside the tool results. Put a filter there and every session in every project gets cheaper and sharper at once.

Where the window actually goes

Four leaks, four fixes. None of them require a smarter model. All of them are pure harness work.
What loads	Heavy harness	Minimal harness
System prompt	thousands of tokens, every turn	~150 words the model knew already
MCP servers	13k to 18k tokens each, before you type	none; CLI tools with readable help
Tool output	raw dump, 2k to 25k tokens a call	filtered at the source, 60 to 90% smaller
Old turns	the same file read five times	compacted on a schedule, delta only

Four leaks, four fixes. None of them require a smarter model. All of them are pure harness work.

When the window does fill up, compaction is your release valve. pi runs it automatically as the window approaches its limit, and gives you /compact to trigger a summary of older turns on demand while keeping the recent ones intact. The full history stays on disk as JSONL, so nothing is lost, only the working set shrinks. That is the difference between compaction and amnesia.

Axis two: fewer, sharper tools

The instinct to add tools is the instinct that ruins harnesses. Every tool you register costs tokens for its definition on every turn, and dilutes the model’s choice with one more option to get wrong. The question is never “what tool could I add.” It is “what tool must exist because the shell genuinely cannot do it.”

pi’s answer is blunt and mostly right: read, write, edit, and bash are enough for an effective coding agent. Bash is the master key. It is not one tool. It is every command-line tool you have ever installed, wrapped in documentation the model can read on demand. That is progressive disclosure done right. The model does not carry a thousand tool schemas in the window. It runs --help when it needs one.

This does not mean MCP is wrong. It means MCP is a cost, and costs need to earn their place. If a server is load-bearing for your daily work, keep it. If it is loaded “just in case,” it is a tax you pay on every turn for a capability you rarely use. Improving the harness here is mostly subtraction: audit your tools, keep the ones bash cannot replace, cut the rest.

Does this tool belong in the harness?

Anti-pattern:
Can bash plus an installed CLI already do this?Then the tool is redundant. Delete it and let the model use the shell. The CLI's help text is its schema, loaded on demand.
Required:
Do I invoke this capability in most sessions?If yes, a dedicated tool or MCP server earns its token cost. If no, it is a just-in-case tax on every turn.
Required:
Is the tool's description short and unambiguous?A tool the model misreads is worse than no tool. Sharp names and one-line descriptions beat sprawling schemas.
Required:
When it runs, can I see exactly what it did?A tool whose effects you cannot inspect is a debugging dead end. Prefer explicit and visible over magic and hidden.

When you genuinely need to extend pi, you do it the transparent way: an extension file that registers a tool you can read, or you invoke pi itself from bash as an explicit subagent so you watch it work instead of trusting a summary from a hidden one. I used that path to add the one control I missed from OpenCode: per-phase model routing. A small package I wrote reads a model and a reasoning level off each skill and switches to them the moment the skill loads, so exploring runs a cheap model and building runs a precise one. That is the thing OpenCode calls modes, added as a package I can read rather than a feature I wait for. The pi guide has the setup. Zechner’s rule stuck with me: reaching for a subagent mid-session to go gather context is usually a sign you did not plan the context ahead of time.

Axis three: give the agent a memory

This is the axis that separates a demo from a system. The model is stateless. It forgets everything the instant the session ends. Anthropic has the analogy that makes it concrete.

Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.

That is your agent on every long task. The fix is not a bigger context window or a fancier memory product. It is deliberate, boring, durable files that the next shift reads first. The filesystem, as LangChain puts it, is the most foundational harness primitive there is, and memory is just the harness using it on purpose.

Four artifacts carry the load:

An instructions file the harness loads every session. pi reads AGENTS.md from the project and its parents automatically. Keep it lean: what the project is, the conventions that are not obvious, and pointers to everything else. If removing a line does not cause mistakes, cut it.
A plan that lives on disk, not in the model’s head. pi deliberately has no hidden plan mode. Planning goes in a PLAN.md or TODO.md that survives the session, stays visible, and stays editable by you. A plan the model cannot lose is a plan you can trust.
A progress log the agent updates before the session ends and reads when it starts. Anthropic’s long-running-agent recipe is exactly this: read the progress notes and the git log, run the end-to-end tests, then pick the next unfinished piece. Boring. Reliable.
Commits as memory. One clean commit per feature is not hygiene. It is state. It gives the next shift a readable history and gives you a rollback when a shift goes bad.

The stack I run

Principles are cheap. Here is the running system I use to make them real: three layers, each one killing a different kind of token waste, mapped straight onto the three axes.

Three layers, three kinds of waste

L1
RTK, the tool layer
A CLI proxy that compresses command output before it reaches the model. This is the shipped, load-bearing layer I run every day. It is the fix for axis one at the tool boundary: less useless output, 60 to 90 percent off everyday commands.
```
git status  → a handful of tokens
test run    → the failures that matter
```
+
L2
Compression, the conversation layer
Between turns, collapse what did not change: the same file read five times, the same error printed three times, the plan repeated in every step. The model sees the delta, not the whole transcript again. Axis one, one level up from the tools.
```
Deduplicate reads. Collapse successes to one line.
Keep only what changed.
```
+
L3
Memory, the knowledge layer
Persistent project context loaded at session start, so you never re-explain the stack, the conventions, the decisions. This is axis three wired into the harness instead of into your fingers: the agent starts knowing, not asking.
```
Auto-load decisions and conventions.
The next session starts where the last one ended.
```

RTK cuts tool output, compression cuts repeated history, memory cuts re-explanation. The same three axes, turned into infrastructure that runs whether or not I remember to be disciplined.

RTK is real and I run it today. The compression and memory layers are the ones I am building into a single local-first proxy that sits between any agent and the model, so the whole stack travels with me across pi, Claude Code, or anything that speaks an OpenAI-compatible endpoint. The full architecture and the token math are in the token-cost deep dive. The point for now is smaller and harder to argue with: the biggest wins in a harness are not clever, they are plumbing. Put the filter where the tokens leak and every session gets cheaper at once.

The loop is the last mile

Context, tools, and memory equip a single agent run. The agent loop is what turns those runs into finished work. Keep it simple and make it verify.

A harness loop that actually finishes

Input

A feature to build, stated as a behavior the system must exhibit

READLoad the memory
Session opens by reading the instructions file, the plan, the progress log, and the git history. The new shift learns what the last shift did before touching anything.
LOOPCall, act, feed back
The model calls tools, the harness runs them and returns filtered results, and it repeats until the model stops asking. No arbitrary step cap. The work being done is the stop condition.
VERIFYProve it end to end
Do not trust a self-report. Run the tests. Anthropic found agents verify features reliably once explicitly told to drive real end-to-end checks, including browser automation, rather than eyeballing the diff.
PERSISTCommit and log
One clean commit for the feature, the progress log updated, the plan advanced. The window can now reset with zero loss, because the state lives on disk, not in the context.

Output

A session that ends leaving the next one everything it needs to continue

The trap at this stage is letting the loop declare victory on its own word. A harness that asks the model “did it work?” gets the answer the model wants to give. A harness that runs the test suite and reads the exit code gets the truth. Verification is not a nice-to-have bolted on the end. It is the control mechanism that makes the whole loop safe to run unattended.

The verdict: minimal wins

Put the two philosophies side by side and score them on what actually matters when you are the one operating the agent, not the one selling it.

Heavy harness vs minimal harness

Heavy harness vs minimal harness. Winner: Minimal harness with a weighted score of 53. Scale 1-5 (5 = best).
Criterion (weight)	Kitchen-sink harness	Minimal harness
Transparency (3)	2	5
Token cost (3)	2	5
Control (2)	2	5
Verifiability (2)	3	4
Portability (1)	2	5
Weighted score	24	53

Scale 1-5 (5 = best). Highlighted column: winner by weighted score.

Weighted for the operator, not the demo. The minimal harness does not win because it is trendy. It wins because every axis that lets you improve a harness is an axis where seeing less and paying more is a disadvantage.

The kitchen-sink harness is not built to be improved. It is built to impress on first run and to keep you from looking too closely after. The minimal harness is the opposite. It assumes you will want to change it, so it shows you everything and costs you almost nothing to reason about. That is the whole game.

Do this to your harness this week

You do not need to rebuild anything. Pick the leaks and plug them in order of return.

Harness tune-up, highest return first

Required:
Put a compression filter on your tool outputThe single biggest win. Raw command output is where the window quietly dies. Filter it at the shell before it reaches the model and reclaim 60 to 90 percent on everyday commands.
Required:
Read your system prompt and cut every line the model already obeysYou pay for it on every turn. If deleting a line does not make the agent worse, it was noise.
Required:
Audit your MCP servers and drop the just-in-case onesKeep only what you use most sessions. Everything else is a per-turn tax for a rare payoff. Replace it with a CLI the model runs through bash.
Required:
Write the four memory artifactsAn instructions file, a plan on disk, a progress log, and one commit per feature. This is what lets the agent survive a context reset without losing the thread.
Required:
Make the loop verify with real tests, not self-reportsWire the loop to run the suite and read the exit code. A harness that checks its own work is the one you can leave running.
Optional:
Switch to a harness you can read end to endNot mandatory, but everything above is easier on a transparent, provider-agnostic harness like pi. You cannot tune what you are not allowed to see.

The model is a commodity. The harness is your edge.

Every lab is shipping a smarter model next quarter, and every one of your competitors gets the same upgrade the same day. That is not where your advantage is. It never was.

Your advantage is the harness. It is the only part of the system you fully own, the only part that compounds with your judgment instead of resetting with someone else’s release schedule. A modest model in a sharp, transparent, well-fed harness will out-ship a frontier model in a bloated one, every time, because the bottleneck was never the intelligence. It was everything you wrapped around it.

Stop upgrading the model and hoping. Open the harness, and fix it.

Harness engineering, quick answers

What is harness engineering?

Harness engineering is the practice of designing everything around an AI model that is not the model itself: the agent loop, the tools, the context management, and the control and verification logic.

It sits above prompt engineering, which optimizes one message, and context engineering, which curates one window. Harness engineering designs the whole system that resets context, persists state across sessions, and checks the result.

Is harness engineering the same as context engineering?

No. Context engineering is about what the model sees inside a single context window. Harness engineering is the layer above it.

Context management is one of the four parts of a harness, alongside the agent loop, the tool interface, and the control mechanisms. So context engineering is a component of harness engineering, not a synonym for it.

How do I actually improve my AI coding agent's harness?

Work the three axes you control. Defend the context window by cutting the system prompt and filtering tool output. Keep tools few and sharp, leaning on bash instead of a wall of MCP servers. Give the agent durable memory through an instructions file, a plan, a progress log, and clean commits.

Then make the loop verify its own work with real tests. None of this requires a better model.

Do I need MCP servers to build a good agent?

No. A popular MCP server can cost 13k to 18k tokens of context before you type anything, and you pay that whether or not the model uses it. For most coding work, letting the model run the CLI tools you already have through bash is cheaper and more transparent.

Keep the MCP servers you use in most sessions. Drop the just-in-case ones.

What is pi and why use it as the example?

pi is an open-source, provider-agnostic coding agent by earendil-works. It ships four tools, a roughly 150-word system prompt, and an agent loop you can read on one page.

It is the example here because a minimal, transparent harness is the one you can actually study and improve. When something goes wrong, you can see which moving part caused it, which is impossible in a harness that hides its context.

Will a minimal harness underperform a feature-rich one?

Not on the axes that matter. pi holds its own on Terminal-Bench 2.0 with four tools and a tiny prompt, which suggests verbose scaffolding buys less than it charges.

A minimal harness wins on transparency, token cost, control, and verifiability. Those are precisely the properties that let you keep improving it, which a feature-rich, opaque harness takes away.