How to Cut Your AI Coding Agent's Token Cost

You watch the dollar meter on your subscription. You never watch where the tokens actually die. They die in three places, and you can plug all three without touching the model.

Everyone obsesses over the price per million tokens. It is the wrong number to stare at. The model bills you for every token it reads, and most of the tokens it reads are garbage you would never have pasted in yourself: two hundred passing tests scrolling by to hide the three that failed, the same file read five times in one session, the tech stack re-explained at the top of every new conversation.

That waste is not the model’s fault. It is the harness handing the model a firehose and calling it context. This is a companion to harness engineering, zoomed all the way in on one axis: the token bill. Here is where the tokens go, and the stack I run to stop the bleeding.

The tokens leak in three places

Watch a real session and the waste sorts itself into three layers. Each one leaks a different way, so each one needs a different fix. Bolt on a bigger model and you pay more for the same leaks. Fix the layers and every session gets cheaper at once.

Three leaks, three layers

L1
Tool output, the loudest leak
Every command the agent runs dumps its full output into the window. A git status in a busy repo is two thousand tokens. A full test run is twenty-five thousand. The model needed the three failures, not the two hundred passes.
```
git status  → 2,000 tokens
test run    → 25,000 tokens
```
+
L2
Repeated history, the quiet leak
Across turns the transcript repeats itself. The same file read five times. The same error printed three times. The plan restated in every step. The model re-reads all of it on every single turn, and you pay for the re-read every time.
```
same file × 5 reads
same error × 3 prints
```
+
L3
Re-explanation, the invisible leak
Every new session starts from zero. You re-explain the stack, the conventions, the decisions you already made last week. It is the cheapest leak to ignore and the most expensive over a month, because you pay it with your own time too.
```
session start → re-explain everything
next session  → do it again
```

Loud, quiet, invisible. The loud one is easiest to fix and gives the most back. Start there.

Fix the loud leak first: compress tool output

The highest-return change to any harness is a filter on tool output, because that is where the tokens hemorrhage and nobody looks. The output scrolls past in the terminal, looks normal, and quietly eats a third of your window.

The tool I run for this is RTK, an open-source CLI proxy. It sits between the shell and the model. The agent runs a command, RTK rewrites the output to a compressed equivalent before it reaches the context window, and the model sees signal instead of noise. Smart filtering, grouping, deduplication, truncation. On everyday dev commands it saves 60 to 90 percent, and a shell hook makes it invisible: the agent types git status, the hook rewrites it to rtk git status, and nobody upstream has to change a thing.

What the filter gives back

60-90%
fewer tokens: 2k → ~0
git status: 25k
test-run tokens: 0
workflow changes

No model change, no prompt change. A filter at the one boundary where the tokens actually leak.

Fix the quiet leak: compress the conversation

Tool output is one call. The conversation is the whole history, resent on every turn. When the agent reads the same file five times, all five copies ride along in the context on turn six. When an error printed three times, the model rereads all three. The window fills with its own echoes.

The fix is a compression pass between turns: deduplicate the reads, collapse successful steps to a single line, keep the recent turns intact and summarize the old ones. The model sees the delta, not the transcript again. This is the layer I am still building, so I will not hand you a fake benchmark. The mechanism is the same as the tool layer, one level up: put the filter where the repetition is, and stop paying to reread what did not change.

Same session, two harnesses

Four leaks, four filters. Not one of them is a smarter model. All of them are plumbing.
Where tokens go	Naive harness	Filtered harness
Tool output	raw dump, 2k to 25k a call	compressed at the source, 60 to 90% off
Repeated reads	same file resent every turn	deduplicated, one copy kept
Old turns	full transcript, every turn	summarized, recent turns intact
Session start	re-explain the project by hand	context auto-loaded from memory

Four leaks, four filters. Not one of them is a smarter model. All of them are plumbing.

Fix the invisible leak: give the harness a memory

The third leak costs you twice, in tokens and in your own time. Every fresh session you re-type the stack, the conventions, the decisions. The model starts blank because the harness let it.

The fix is persistent context loaded at session start: the decisions and conventions live in files, and the harness injects the relevant ones before the first message. The agent starts knowing instead of asking. In practice this is an AGENTS.md the harness reads automatically, plus a memory store for the things too big to keep in one file. I wrote the file-level version up in the AGENTS.md guide. The point here is narrower: re-explanation is a token leak like any other, and the fix is to stop doing by hand what a file can do once.

The stack, and the honest status

Three leaks, three layers. Here is what I actually run, and what is still on the bench. I will not sell you the parts I have not shipped.

The three-layer token stack

Required:
RTK at the tool layer, shipped and load-bearingOpen-source CLI proxy, compresses command output 60 to 90 percent. I run it every day. This is the one to steal today.
Optional:
Compression at the conversation layer, buildingDeduplicate reads and summarize old turns between prompts. The mechanism is proven at the tool layer; the conversation-level version is what I am wiring up now.
Optional:
Memory at the knowledge layer, buildingAuto-load project decisions and conventions at session start. AGENTS.md covers the file-level case today; the indexed store is next.
Optional:
One local proxy in front of any agentThe direction: fold all three into a single local-first proxy that sits between any agent and the model, so the stack travels across pi, Claude Code, or anything that speaks an OpenAI-compatible endpoint.

The plan is not clever, and that is the point. A proxy that speaks the same OpenAI-compatible protocol every agent already speaks, doing three boring jobs before the request reaches the model.

What the proxy does to one request

Input

An agent about to send a turn to the model

MEMORYInject what matters
Pull the relevant project decisions and conventions from the store and put them in the system prompt, so the model starts the session already knowing the codebase.
COMPRESSCut the repetition
Deduplicate repeated reads and summarize old turns, so the model pays for the delta instead of rereading the whole transcript.
FORWARDSend the lean payload
Route the trimmed request to the provider. RTK already cut the tool output upstream at the shell, so what arrives is signal.

Output

The same answer, from a fraction of the tokens, on every agent you use

The number to actually watch

Stop watching the price per million tokens. You do not control it, and it drops on its own every few months anyway. Watch the tokens per finished task. That number is pure harness, and it is the one you can cut in half this week.

One more lever lives here: route a cheap model for exploration and a strong one only for the build. I automate that per phase in pi with a small model-handoff package that switches model and reasoning effort when a skill loads, covered in the pi guide. The exploratory turns stop billing you at premium rates for work that never needed them.

Put a filter on your tool output and you have done the highest-return thing available to you. Everything after that is the same move at a different layer: find where the tokens repeat, and stop paying to read them twice.

Token cost, quick answers

What is the fastest way to cut my AI coding agent's token cost?

Compress tool output at the shell before it reaches the model. A git status can be two thousand tokens and a test run twenty-five thousand, and the model needed almost none of it.

A CLI proxy like RTK does this and saves 60 to 90 percent on everyday commands with no change to your workflow. It is the single highest-return change to a harness.

Does a bigger context window solve token waste?

No. A bigger window makes waste affordable, not absent. The leak is still there, you just hit the limit later and pay more per turn until you do.

Fix the leak at the source instead: filter tool output, deduplicate repeated history, and load project memory so you stop re-explaining. Then a normal window is plenty.

What is RTK?

RTK is an open-source CLI proxy that intercepts common shell commands and rewrites their output to a compressed equivalent before it reaches the model's context window, cutting 60 to 90 percent of the tokens on everyday dev commands. A shell hook makes it invisible to the agent.

Where do the tokens actually go in an agent session?

Three places. Tool output that dumps the full firehose when the model needed a line of it. Repeated history that resends the same reads and errors on every turn. And re-explanation, where each new session starts blank and you retype the project context.

Each leak has its own fix: compress the output, compress the conversation, and give the harness a memory.