← lab
post-mortemholds-upagentsinfraorchestration

How four minutes of silence killed my AI agent

An agent finished its work, then the orchestrator killed it before it could save. Liveness was inferred from stdout, so a long silent synthesis read as dead. A 15-line keepalive fixed it.

The agent did everything right. It read the sensors, pulled real data, and wrote a sharp synthesis: three content opportunities with hooks and live market signals. Then, three seconds before it could save that work, the system killed it. The run came back stranded.

Both first instincts were wrong

The write-back failed with a 409, then a 401, then adapter_failed. My first two instincts were both wrong.

  1. Post a comment to keep the run alive. The idea: emit activity so the orchestrator sees a live run. Theater. The stall detector never reads the “last activity” field. I confirmed that in the source before betting on it. The clock that matters is the process heartbeat, and a comment does not touch it.
  2. Blame the flaky data source. A tool had 404’d earlier in the run, so I blamed it. Unrelated. The 404 was a fingerprint block on a datacenter request, a real bug, but not this one.

The failure was a silent chain

Reading the runtime source line by line, the failure was a chain, and every link was invisible on its own:

  • The model spent minutes generating a long synthesis with no tool calls, so the child process emitted no stdout.
  • No stdout meant the adapter forwarded nothing, so the run’s heartbeat aged.
  • A stale heartbeat looks exactly like a dead run, so the orchestrator spawned a duplicate to recover it.
  • The duplicate stole the checkout lock. Now the real run’s write-back was rejected: its actor no longer held the lock. That is the 409, then the 401, then the strand.

The orchestrator could not tell the difference between “this run is dead” and “this run went quiet to think.” It only had one signal, and it read silence as death.

Fifteen lines and one self-inflicted bug

The patch was about fifteen lines: a timer that emits a heartbeat line whenever the process stays quiet past a threshold, reset by any real output, cleared when the child settles. I built a deliberately heavier task to force a long silent synthesis. It ran about ten minutes and closed clean. The old one died at five. The runtime fix went back upstream as a public pull request against the adapter.

On the way, I reloaded the container with docker restart and orphaned the swarm task: zero replicas, the mesh stopped routing, the API served 404s while the app was healthy inside. The right move in swarm is service update --force. Two bugs in one session, both instructive.

Takeaway

Any system that infers “alive” from output will eventually kill a worker that goes quiet to do its real work. Liveness and progress are different signals. If you only have one channel, heartbeat it on a timer that is independent of the work itself, or the system will punish concentration.