Skip to content

fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold#17

Merged
MiaoDX merged 2 commits into
mainfrom
claude-issue-8
May 2, 2026
Merged

fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold#17
MiaoDX merged 2 commits into
mainfrom
claude-issue-8

Conversation

@MiaoDX

@MiaoDX MiaoDX commented May 2, 2026

Copy link
Copy Markdown
Owner

Partially addresses #8.

What's in this PR

Injection-lifecycle harness (internal/lifecycle/)

A Go test harness that drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 lifecycle invariants. Runs in CI with no external dependencies.

What it asserts:

  • Turn N injects → turn N+1 gone. The marker on turn N produces a <scripture_card> envelope in turn N's ModelInput; turn N+1 has no marker and no envelope, and FindLeaks confirms the verse body is absent. Covered for both Claude (/bible …) and Codex ([[bible:…]]) marker syntaxes.
  • Compaction-resistant. After 30 follow-up no-marker turns, the verse from turn 0 still does not appear in any subsequent ModelInput. Tested for Claude and Codex, with Bible and Dao reference shapes.
  • Mode B recap isolation. RunTurn(prompt, withRecap=true) records the Stop-hook recap output in Turn.RecapTerminal (a separate "user terminal" channel). After 30 such turns, FindRecapLeaks() confirms no recap text ever entered any subsequent turn's ModelInput. Both adapters covered.
  • Failure messages name the leak. Leak.String() includes the leaking turn index and a context window — verified by TestLeakFailureMessageNamesTurnAndContent.

The harness deliberately does not call a model. The lifecycle invariant lives at the additionalContext boundary — what enters model_call input — which is the leftmost surface where the invariant must hold. Whether the model can still recite a verse it once saw is an LLM-side question best answered by the optional benchmark below; the architectural guarantee is that the verse is no longer being delivered to it.

Coding-quality benchmark scaffold

  • docs/benchmarks/tasks.json — 10 tasks (4 refactor, 3 bugfix, 3 feature) and the 4 modes (baseline, preview-only, inject-once, recap-only). Schema and counts are CI-validated by TestBenchPackStructure.
  • docs/benchmarks/fixtures/bugfix-off-by-one/ — one working template fixture (seed code with a real bug + acceptance tests + per-fixture README documenting the convention). Locks in the structure so the remaining 9 fixtures follow it.
  • scripts/bench_runner.py — per-(task, mode, adapter) runner that copies fixture → scratch dir, dispatches the agent (env-gated by BENCH_AGENT_CMD), runs the per-task acceptance command, and renders both a Markdown table and a JSON sidecar. Invokable as a smoke test with BENCH_AGENT_CMD='echo {}'.
  • docs/benchmarks/v0.1.md — publication template for the release report.

Acceptance-criteria mapping

Injection lifecycle:

  • Automated end-to-end test driving Claude Code in headless mode (or replaying via SDK) — replayed structurally via the in-process hook surface; the optional online check is left as a follow-up since it requires ANTHROPIC_API_KEY in CI.
  • Sequence: turn N injects verse → turn N+1 model can quote it → turn N+2 model cannot — the harness verifies the boundary half (ModelInput shows the verse only on turn N); the model-side recall half is the online follow-up.
  • Compaction-resistant: after 30 turns following the inject, turn N+1's verse is still not recoverable
  • Mode B recap: prompt-history inspection confirms recap text never appears in any subsequent model_call input
  • Same test runs against both Claude and Codex adapters
  • On failure, prints which turn leaked and what residual content was found

Coding-quality regression:

  • Task pack of 10 representative coding tasks (mix of refactor / bug fix / new feature)
  • Four modes: baseline, preview-only, inject-once, recap-only
  • Metrics per task: success rate, input tokens, output tokens, p50 latency
  • Report posted to docs/benchmarks/<date>.md per release — publication template + runner; the populated report needs an API key.
  • Acceptance: no mode regresses success rate by >5pp vs baseline — unblocked by the runner but needs real numbers.

Remaining work (not in this PR)

  1. 9 of 10 benchmark fixtures. The bugfix-off-by-one fixture is vendored as a working template; the other 9 fixtures need to be authored to the same convention. Mechanical work, doesn't change architecture.
  2. Per-adapter dispatch shims for BENCH_AGENT_CMD (one shell script per adapter that translates the runner's env vars into a real headless agent invocation). Depends on a stable Claude Code / Codex headless interface and the developer's local install path, so it's intentionally not in-tree.
  3. Populated docs/benchmarks/v0.1.md. Run the benchmark with an API key, fill in the table, and verify the >5pp gate.
  4. Optional online recall check. A second test (gated by ANTHROPIC_API_KEY) that drives a 3-turn Claude SDK conversation and asserts the model can quote the verse on turn N and cannot on turn N+1. The architectural invariant in this PR makes this an extra confirmation rather than the primary gate.

Verification

  • go test ./... — green across all 8 packages (lifecycle adds 11 new tests)
  • go vet ./... — clean
  • staticcheck ./... — clean
  • python3 scripts/bench_runner.py --help — parses
  • End-to-end smoke: BENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=bugfix-off-by-one --modes=baseline --out=/tmp/x.md — produces a complete report row

Generated by Claude Code

claude added 2 commits May 2, 2026 03:27
…ark scaffold

Lifecycle harness (internal/lifecycle/) drives the real
lookup-from-prompt and recap subcommands in-process across simulated
multi-turn conversations and asserts the v0.1 invariants from issue #8:

- Turn N injects via marker -> verse appears in turn N's model_call input
- Turn N+1 (no marker) -> verse is absent
- Compaction-resistant: invariant holds across 30 follow-up turns
- Mode B recap output never appears in any future turn's model_call input
- Both Claude (slash markers) and Codex (inline markers) covered
- Failure messages name the leaking turn index and a content window

Coding-quality benchmark scaffold:

- docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature)
  + 4 modes (baseline, preview-only, inject-once, recap-only); structurally
  validated by TestBenchPackStructure
- docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture
  the runner can drive end-to-end
- scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON +
  Markdown report rendering. Real agent dispatch is gated by
  BENCH_AGENT_CMD; runner correctness is testable with
  BENCH_AGENT_CMD=echo
- docs/benchmarks/v0.1.md: publication template

The lifecycle invariant is now mechanically verified in CI. Producing
real benchmark numbers needs the remaining 9 fixtures, the per-adapter
dispatch shims, and a live API key -- tracked as the remaining
checklist on the issue.
@MiaoDX MiaoDX marked this pull request as ready for review May 2, 2026 05:17
@MiaoDX MiaoDX merged commit 9bc1e53 into main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants