-
Notifications
You must be signed in to change notification settings - Fork 0
phantom_loop: agent-maintained running tallies drift over long runs #6
Description
Observation
In a long multi-tick phantom_loop run (28 iterations over ~3.5 hours), the agent's self-reported running tally at the end of the loop drifted from the authoritative count.
- Self-reported (written by the agent into the progress log on the final tick):
88 total items - Authoritative (grep of the structured blocks in
state.mdafter finalize):91 total items - Drift: 3 items, ~3.3%, over 28 ticks
The authoritative data in the state file was correct. The drift is purely in the agent's self-maintained summary counter that lives in the progress log.
Why it happens
Each tick, the agent was instructed to:
- Append one or more structured blocks to
## Findings - Append a one-line log entry to
## Progresswith a per-tick tally and update a running total
The running total is maintained by the agent reading its previous log line and incrementing. Over 28 ticks — especially the ones where the queue self-expanded mid-run (one item split into three sub-items) — the agent eyeballed rather than re-counting the source of truth, and the error compounded.
This is not a phantom_loop correctness bug. The runner, store, state file, and tick scheduling all behaved correctly. The drift lives entirely in agent behavior inside the prompt contract the operator wrote.
Why it's worth filing anyway
The phantom_loop tool description is the main place operators learn how to structure a per-tick contract. Today it says nothing about the trade-off between:
- Self-reported rolling tallies — cheap per tick, drift-prone over long runs
- Deterministic re-counting — one grep/
wc -lat tick start gives the ground truth, costs ~50ms
A short guidance note in the tool description would steer operators toward the deterministic pattern by default and save future loops from this class of drift.
Fix options
1. Tool description guidance (smallest)
Add a paragraph to the phantom_loop tool description under the start action, roughly:
When your per-tick contract maintains running tallies or counters (e.g. "N items done, M findings"), derive them deterministically from the state file on each tick rather than incrementing a prior value. Grepping or counting structured blocks is ~50ms and cannot drift. Self-reported incrementing counters drift over long runs, especially when the work queue mutates mid-loop.
Zero code change, pure documentation.
2. State file helper in the loop runner (medium)
Expose a tiny helper (perhaps tally_state_file) as an in-process MCP tool the agent can call per tick: given a glob/regex over the state file, return a count. Makes the deterministic path the path of least resistance.
Cons: new tool surface area, and it's a thin wrapper over grep -c/wc -l the agent already has.
3. Post-run self-audit (largest)
On loop finalize, if the agent has been writing log lines in a recognizable "Tick N: ... K items, M findings" format, compare the latest self-reported tally against a regex count of the structured blocks and surface the drift in the finalize notice.
Cons: fragile (requires matching the operator's log format), and it's better to prevent the drift than measure it after the fact.
Recommendation
Option 1 only. This is a prompt-engineering / operator-guidance issue, not a runtime bug. A two-sentence addition to the phantom_loop tool description at src/loop/tool.ts (under the start action bullet list) is proportionate to the problem.
Options 2 and 3 are disproportionate and add surface area for a class of error that responsible per-tick contracts don't have in the first place.
Reproduction
- Start a multi-tick loop (e.g. 20+ iterations) where each tick appends one or more structured blocks to a section in
state.mdand also appends a one-line summary to a separate log section that includes a running total. - Include at least one case where the work queue self-expands mid-run (a queued item is split into sub-items).
- After the loop reaches
done, compare:- The running total from the last progress log line
- A grep count of the structured block headers in
state.md
- Expect them to match. Observe a small drift (single-digit percent) in favor of the grep count being higher.
Verification after fix
- Apply the Option 1 doc change to
src/loop/tool.ts. bun run typecheckandbun test src/loopstay green (no code path changed).- Start a new loop using the updated tool description and a per-tick contract that follows the guidance (deterministic re-count on each tick).
- At finalize, the agent-reported tally should match the grep count exactly.