Context Engineering Is Just Deciding What the Model Sees
Context engineering is not a new discipline. It is the one job inside your AI agent harness that determines whether the model can actually finish the work.
Context engineering is deciding what the model sees. Everything else is implementation detail.
Every few months the industry mints a new word for a thing that was already true. Context engineering is the latest. The idea is not wrong, and the label is not useless. What is wrong is the way the concept gets taught: as a discipline adjacent to prompt engineering, a softer art, something you do with careful wording and thoughtful structure. That framing misses the part that actually matters.
The context window is a hard budget. Every token that enters it displaces a token you wanted to keep. When the budget runs out, the agent cannot see the beginning of its own plan, loses the error it was chasing, and starts asking for things it already read. That is not a model failure. It is a management failure. Context engineering is the job of managing that budget, and you are almost certainly already failing it.
I have shipped production code for 25 years. In late 2025 I built a 13-app crypto fintech solo with AI agents in 70 days. The one thing I tuned harder than anything else was not the model, not the prompts, and not the tool selection. It was what the agent was allowed to see at any given moment. This is what I learned.
Context engineering is not prompt engineering
People conflate these two things constantly, and the conflation costs them. Prompt engineering is the craft of writing one message well: the right instruction, the right examples, the right framing for a single request. Context engineering operates at the session level. It manages the entire token budget across all turns, deciding what stays, what gets compressed, what gets evicted, and what gets loaded fresh.
Prompt engineering
- 01Optimizes a single message
- 02Focuses on instruction clarity
- 03Scope: one request
- 04Output: better one-shot result
- 05Already well-understood
Context engineering
- 01Manages the full token budget
- 02Focuses on what the model sees
- 03Scope: the whole session
- 04Output: task completion rate
- 05Where the real gains are
A better prompt in a full window does nothing. The model cannot fit it. You can write the most carefully crafted instruction of your life and the agent will still hallucinate the file structure it already read because the read is buried under twenty thousand tokens of test output it never needed. The budget is the constraint. The prompt is downstream of it.
Where context engineering fits in the harness
Context engineering does not stand alone. It is one of four parts of what harness engineering calls the system around the model: the agent loop, the tool interface, context management, and control. Context management is that third part. It answers one question on every turn: of everything that could go into the window, what actually should?
The four harness layers
- L1
Agent loop
The orchestration that keeps calling the model until the task is done. It decides when to stop, when to retry, when to hand off. Context engineering does not live here, but the loop determines how many turns burn the budget.
- L2
Tool interface
Every tool the model can call. The tool schema itself costs tokens before the model types anything. An MCP server with 40 tools loaded can eat 13k to 18k tokens from the schema alone, before a single command runs.
- L3
Context management
What the model sees on each turn. This is where context engineering lives: compressing tool output, compacting history, loading memory, evicting what no longer matters. The budget decisions happen here.
budget = window size spent = tool output + history + system + schemas available = budget - spent
- L4
Control
The guardrails and checkpoints that keep the agent from running off. Permissions, human-in-the-loop stops, verification before ship. Important, but not the topic today.
Naming it this way matters. If you call everything above one-shot prompting “context engineering,” you lose the distinction between curating the window (context management) and designing the system that manages the window (harness engineering). Both are real, and they are not the same job.
The budget you are already overspending
Here is the uncomfortable part. You do not have to do anything unusual to blow the context window. The default behavior of any AI coding agent blows it for you, automatically, on almost every non-trivial task.
There are four leaks. Each one is real, each one has a measurable token cost, and almost every harness ignores at least two of them.
The four context leaks
- 01
Raw tool output
Every command the agent runs dumps its full output into the window. A git status in a busy repo is around 2,000 tokens. A full test run is around 25,000. The agent needed the three failing tests, not the twenty-two passing ones.
git status ~2,000 tokens full test run ~25,000 tokens git log --all can exceed 10,000
- 02
Repeated history
Every turn the entire conversation replays. The file the agent read on turn three is still present on turn fifteen. The error it printed on turn four printed again on turn seven. The model pays to reread all of it, every single time.
same file x 5 reads = 5x the token cost error printed x 3 = 3x the token cost
- 03
Bloated system prompt
A long system prompt costs tokens on every turn because it rides at the top of every context. pi, the minimal coding agent, runs around 150 words. Many commercial harnesses run ten times that, filled with instructions the model already knows from training.
- 04
Unused MCP schemas
Tool schemas load into the window whether the model calls the tools or not. A popular MCP server can consume 13,000 to 18,000 tokens in schema before a single command runs. Load only the tools you actually need for this session.
one MCP server 13k to 18k tokens two MCP servers 26k to 36k tokens both loaded regardless of usage
The token cost breakdown has the full accounting. The point here is narrower: context engineering starts with knowing which of these four leaks is eating your budget. You cannot fix what you have not named.
The tactics, tied to each leak
Knowing the leaks is not enough. Here is what to do about each one. These are not theoretical options. They are the actual moves, in order of return.
Compress tool output at the source. The highest-return change to any harness is a filter on tool output. RTK is an open-source CLI proxy that sits between the shell and the model. The agent runs a command, RTK rewrites the output before it reaches the window. Smart filtering, grouping, deduplication. On everyday dev commands it saves 60 to 90 percent. A shell hook makes it invisible to the agent. This is the one to steal today.
Compact history between turns. The conversation layer is the second fix. Between turns, deduplicate repeated file reads, collapse successful steps to a single-line summary, keep recent turns intact and summarize the old ones. The model sees the delta instead of the transcript again. This is not a novel idea. It is the same filter as the tool layer, applied one level up.
Keep the system prompt short and specific. pi’s minimal approach is instructive. Around 150 words, nothing the model does not need to know. The theory: a frontier model already understands software engineering, version control, and careful iteration. You do not need to explain them. You need to tell it what is specific to your project. Every line of boilerplate in the system prompt is a tax on every single turn.
Load only the tools this session needs. Every MCP server you load at startup costs tokens whether you use it or not. Session-scoped tool loading is the fix: start with the minimal set, add tools when the agent actually needs them. This is a harness design choice, not a model choice.
Context engineering moves, in order
- Required:Compress tool output with a CLI proxyRTK at github.com/rtk-ai/rtk. Open source, shell hook makes it transparent. Cuts 60 to 90 percent on everyday dev commands. Start here.
- Required:Compact conversation history between turnsDeduplicate reads, summarize old turns, keep the recent turns intact. The model should read the delta, not the replay.
- Required:Audit the system prompt for boilerplateCut everything the model already knows from training. What remains should be project-specific: the conventions, the decisions, the things training cannot know.
- Optional:Load tools session-scoped, not at startupOnly the tools this session actually needs. Add tools when the agent asks, not before. Every schema that loads costs tokens whether the model calls it or not.
- Optional:Inject memory at session start instead of re-explainingProject decisions and conventions in an AGENTS.md file. The harness loads it automatically. The agent starts knowing instead of asking.
The table that makes it concrete
Here is what the same coding session looks like with a naive harness versus a context-engineered one. Same model, same task, same tools.
Naive vs. engineered context
| Budget item | Naive harness | Engineered harness |
|---|---|---|
| System prompt | 5,000 to 10,000 tokens of boilerplate | 150 words, project-specific only |
| Tool schemas loaded | all servers, 15k to 30k tokens at startup | session-scoped, 2k to 4k tokens |
| Tool output per call | raw dump, 2k to 25k per command | compressed, 60 to 90% off |
| Conversation history | full replay every turn | delta only, old turns summarized |
| Session memory | re-explain everything on session start | auto-loaded from AGENTS.md |
| Outcome | window exhausted mid-task | budget available for the actual work |
The numbers on the left are not hypothetical. They are what you get with the defaults. You pay for a 200k-token window and you burn a third of it before the agent types the first useful line.
Context engineering is a means, not a cult
The discourse around context engineering has developed a certain flavor. People write about it as though a well-managed window is the goal. It is not. The goal is finishing the task. The goal is tokens-per-finished-task, not a tidy budget.
This matters because over-curation is also a failure mode. Compress the tool output too aggressively and the model misses the error signal it needed. Summarize the history too early and it loses the plan it was executing. Evict the file it is currently editing and it rewrites it blind. Context engineering is a calibration problem, not a minimization problem. The question is not “how small can I make the window” but “does the model have what it needs to take the next step?”
The frame I find useful: context engineering is budget management for a fixed resource. You do not spend zero on groceries because spending zero is virtuous. You spend what you need and avoid spending on things that do not feed you. The session that finishes the task on turn eight with a half-full window is a better result than the session that hits turn thirty with a perfectly compact history and no task done.
The harness engineering guide is where this connects to the broader system. Context management is one of four parts. It is not the most important part in isolation. It is the part that determines whether the other three parts can function across a real session on a real codebase.
Context engineering is deciding what the model sees. Get that decision right and the model does what it can do. Get it wrong and you are paying for intelligence you will never use.
Common questions
What is context engineering in AI coding agents?
Context engineering is the practice of deciding which tokens go into the model's context window on each turn of an agent session. It is not about writing better prompts. It is about managing the hard budget of the context window: compressing tool output, compacting history, keeping the system prompt tight, and loading only the tools the session actually needs. The goal is that the model has what it needs to take the next step without wasting tokens on noise it will never use.
How is context engineering different from prompt engineering?
Prompt engineering optimizes a single message: the right instruction, the right examples, the right framing for one request. It operates at the scope of one call.
Context engineering operates at the session level. It manages the entire token budget across all turns: what stays in the window, what gets compressed, what gets evicted, and what gets loaded fresh from memory. A better prompt in a full window does nothing. The budget is the constraint, and prompt engineering is downstream of it.
What are the biggest context window leaks in AI coding agents?
There are four main leaks. The loudest is raw tool output: a git status is around 2,000 tokens and a full test run is around 25,000. The agent needed the failing tests, not the passing ones.
The second is repeated history: the full conversation replays on every turn, so a file read on turn three is still present on turn fifteen, paying its token cost again every time.
The third is a bloated system prompt: anything the model already knows from training is a tax on every single turn. Keep it to what is project-specific.
The fourth is unused MCP schemas: every tool server loaded at startup costs tokens whether the model calls it or not. One popular MCP server can eat 13,000 to 18,000 tokens before a single command runs.
What is the fastest way to reduce context window usage in a coding agent?
Compress tool output at the source. RTK (github.com/rtk-ai/rtk) is an open-source CLI proxy that sits between the shell and the model. The agent runs a command, RTK rewrites the output to a compressed form before it reaches the context window. On everyday dev commands it saves 60 to 90 percent. A shell hook makes it completely transparent to the agent. This is the highest-return change and the easiest to wire in.
How does MCP tool schema cost affect the context window?
Every MCP server you load injects its tool schema into the context window at session start, before the agent types anything. A single popular MCP server can consume 13,000 to 18,000 tokens in schema. Load two and you have spent 26,000 to 36,000 tokens before a single command runs. The fix is session-scoped tool loading: start with the minimal set of tools the session actually needs and add more only when the agent asks. Loading all available tools at startup is a common default and a consistent waste.
Is context engineering more important than choosing a better model?
For most real coding sessions, yes. A better model in a full window still fails the task when the relevant file was evicted three turns ago. Context engineering determines whether the model has the information it needs to act. Once you have the context layer working, a model upgrade compounds cleanly on top. Before you fix the context layer, a model upgrade mostly pays for the same waste at a higher price.