phantom_loop: agent-maintained running tallies drift over long runs

## Observation

In a long multi-tick `phantom_loop` run (28 iterations over ~3.5 hours), the agent's self-reported running tally at the end of the loop drifted from the authoritative count.

- **Self-reported** (written by the agent into the progress log on the final tick): `88 total items`
- **Authoritative** (grep of the structured blocks in `state.md` after finalize): `91 total items`
- Drift: **3 items, ~3.3%**, over 28 ticks

The authoritative data in the state file was correct. The drift is purely in the agent's self-maintained summary counter that lives in the progress log.

## Why it happens

Each tick, the agent was instructed to:
1. Append one or more structured blocks to `## Findings`
2. Append a one-line log entry to `## Progress` with a per-tick tally and update a running total

The running total is maintained by the agent reading its previous log line and incrementing. Over 28 ticks — especially the ones where the queue self-expanded mid-run (one item split into three sub-items) — the agent eyeballed rather than re-counting the source of truth, and the error compounded.

This is not a `phantom_loop` correctness bug. The runner, store, state file, and tick scheduling all behaved correctly. The drift lives entirely in agent behavior inside the prompt contract the operator wrote.

## Why it's worth filing anyway

The `phantom_loop` tool description is the main place operators learn how to structure a per-tick contract. Today it says nothing about the trade-off between:

- **Self-reported rolling tallies** — cheap per tick, drift-prone over long runs
- **Deterministic re-counting** — one grep/`wc -l` at tick start gives the ground truth, costs ~50ms

A short guidance note in the tool description would steer operators toward the deterministic pattern by default and save future loops from this class of drift.

## Fix options

### 1. Tool description guidance (smallest)

Add a paragraph to the `phantom_loop` tool description under the `start` action, roughly:

> When your per-tick contract maintains running tallies or counters (e.g. "N items done, M findings"), derive them deterministically from the state file on each tick rather than incrementing a prior value. Grepping or counting structured blocks is ~50ms and cannot drift. Self-reported incrementing counters drift over long runs, especially when the work queue mutates mid-loop.

Zero code change, pure documentation.

### 2. State file helper in the loop runner (medium)

Expose a tiny helper (perhaps `tally_state_file`) as an in-process MCP tool the agent can call per tick: given a glob/regex over the state file, return a count. Makes the deterministic path the path of least resistance.

Cons: new tool surface area, and it's a thin wrapper over `grep -c`/`wc -l` the agent already has.

### 3. Post-run self-audit (largest)

On loop finalize, if the agent has been writing log lines in a recognizable "Tick N: ... K items, M findings" format, compare the latest self-reported tally against a regex count of the structured blocks and surface the drift in the finalize notice.

Cons: fragile (requires matching the operator's log format), and it's better to prevent the drift than measure it after the fact.

## Recommendation

**Option 1 only.** This is a prompt-engineering / operator-guidance issue, not a runtime bug. A two-sentence addition to the `phantom_loop` tool description at `src/loop/tool.ts` (under the `start` action bullet list) is proportionate to the problem.

Options 2 and 3 are disproportionate and add surface area for a class of error that responsible per-tick contracts don't have in the first place.

## Reproduction

1. Start a multi-tick loop (e.g. 20+ iterations) where each tick appends one or more structured blocks to a section in `state.md` and also appends a one-line summary to a separate log section that includes a running total.
2. Include at least one case where the work queue self-expands mid-run (a queued item is split into sub-items).
3. After the loop reaches `done`, compare:
   - The running total from the last progress log line
   - A grep count of the structured block headers in `state.md`
4. Expect them to match. Observe a small drift (single-digit percent) in favor of the grep count being higher.

## Verification after fix

1. Apply the Option 1 doc change to `src/loop/tool.ts`.
2. `bun run typecheck` and `bun test src/loop` stay green (no code path changed).
3. Start a new loop using the updated tool description and a per-tick contract that follows the guidance (deterministic re-count on each tick).
4. At finalize, the agent-reported tally should match the grep count exactly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phantom_loop: agent-maintained running tallies drift over long runs #6

Observation

Why it happens

Why it's worth filing anyway

Fix options

1. Tool description guidance (smallest)

2. State file helper in the loop runner (medium)

3. Post-run self-audit (largest)

Recommendation

Reproduction

Verification after fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

phantom_loop: agent-maintained running tallies drift over long runs #6

Description

Observation

Why it happens

Why it's worth filing anyway

Fix options

1. Tool description guidance (smallest)

2. State file helper in the loop runner (medium)

3. Post-run self-audit (largest)

Recommendation

Reproduction

Verification after fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions