fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold by MiaoDX · Pull Request #17 · MiaoDX/verse-driven

MiaoDX · 2026-05-02T03:27:52Z

Partially addresses #8.

What's in this PR

Injection-lifecycle harness (`internal/lifecycle/`)

A Go test harness that drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 lifecycle invariants. Runs in CI with no external dependencies.

What it asserts:

Turn N injects → turn N+1 gone. The marker on turn N produces a <scripture_card> envelope in turn N's ModelInput; turn N+1 has no marker and no envelope, and FindLeaks confirms the verse body is absent. Covered for both Claude (/bible …) and Codex ([[bible:…]]) marker syntaxes.
Compaction-resistant. After 30 follow-up no-marker turns, the verse from turn 0 still does not appear in any subsequent ModelInput. Tested for Claude and Codex, with Bible and Dao reference shapes.
Mode B recap isolation. RunTurn(prompt, withRecap=true) records the Stop-hook recap output in Turn.RecapTerminal (a separate "user terminal" channel). After 30 such turns, FindRecapLeaks() confirms no recap text ever entered any subsequent turn's ModelInput. Both adapters covered.
Failure messages name the leak. Leak.String() includes the leaking turn index and a context window — verified by TestLeakFailureMessageNamesTurnAndContent.

The harness deliberately does not call a model. The lifecycle invariant lives at the additionalContext boundary — what enters model_call input — which is the leftmost surface where the invariant must hold. Whether the model can still recite a verse it once saw is an LLM-side question best answered by the optional benchmark below; the architectural guarantee is that the verse is no longer being delivered to it.

Coding-quality benchmark scaffold

docs/benchmarks/tasks.json — 10 tasks (4 refactor, 3 bugfix, 3 feature) and the 4 modes (baseline, preview-only, inject-once, recap-only). Schema and counts are CI-validated by TestBenchPackStructure.
docs/benchmarks/fixtures/bugfix-off-by-one/ — one working template fixture (seed code with a real bug + acceptance tests + per-fixture README documenting the convention). Locks in the structure so the remaining 9 fixtures follow it.
scripts/bench_runner.py — per-(task, mode, adapter) runner that copies fixture → scratch dir, dispatches the agent (env-gated by BENCH_AGENT_CMD), runs the per-task acceptance command, and renders both a Markdown table and a JSON sidecar. Invokable as a smoke test with BENCH_AGENT_CMD='echo {}'.
docs/benchmarks/v0.1.md — publication template for the release report.

Acceptance-criteria mapping

Injection lifecycle:

Automated end-to-end test driving Claude Code in headless mode (or replaying via SDK) — replayed structurally via the in-process hook surface; the optional online check is left as a follow-up since it requires ANTHROPIC_API_KEY in CI.
Sequence: turn N injects verse → turn N+1 model can quote it → turn N+2 model cannot — the harness verifies the boundary half (ModelInput shows the verse only on turn N); the model-side recall half is the online follow-up.
Compaction-resistant: after 30 turns following the inject, turn N+1's verse is still not recoverable
Mode B recap: prompt-history inspection confirms recap text never appears in any subsequent model_call input
Same test runs against both Claude and Codex adapters
On failure, prints which turn leaked and what residual content was found

Coding-quality regression:

Task pack of 10 representative coding tasks (mix of refactor / bug fix / new feature)
Four modes: baseline, preview-only, inject-once, recap-only
Metrics per task: success rate, input tokens, output tokens, p50 latency
Report posted to docs/benchmarks/<date>.md per release — publication template + runner; the populated report needs an API key.
Acceptance: no mode regresses success rate by >5pp vs baseline — unblocked by the runner but needs real numbers.

Remaining work (not in this PR)

9 of 10 benchmark fixtures. The bugfix-off-by-one fixture is vendored as a working template; the other 9 fixtures need to be authored to the same convention. Mechanical work, doesn't change architecture.
Per-adapter dispatch shims for BENCH_AGENT_CMD (one shell script per adapter that translates the runner's env vars into a real headless agent invocation). Depends on a stable Claude Code / Codex headless interface and the developer's local install path, so it's intentionally not in-tree.
Populated docs/benchmarks/v0.1.md. Run the benchmark with an API key, fill in the table, and verify the >5pp gate.
Optional online recall check. A second test (gated by ANTHROPIC_API_KEY) that drives a 3-turn Claude SDK conversation and asserts the model can quote the verse on turn N and cannot on turn N+1. The architectural invariant in this PR makes this an extra confirmation rather than the primary gate.

Verification

go test ./... — green across all 8 packages (lifecycle adds 11 new tests)
go vet ./... — clean
staticcheck ./... — clean
python3 scripts/bench_runner.py --help — parses
End-to-end smoke: BENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=bugfix-off-by-one --modes=baseline --out=/tmp/x.md — produces a complete report row

Generated by Claude Code

…ark scaffold Lifecycle harness (internal/lifecycle/) drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 invariants from issue #8: - Turn N injects via marker -> verse appears in turn N's model_call input - Turn N+1 (no marker) -> verse is absent - Compaction-resistant: invariant holds across 30 follow-up turns - Mode B recap output never appears in any future turn's model_call input - Both Claude (slash markers) and Codex (inline markers) covered - Failure messages name the leaking turn index and a content window Coding-quality benchmark scaffold: - docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature) + 4 modes (baseline, preview-only, inject-once, recap-only); structurally validated by TestBenchPackStructure - docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture the runner can drive end-to-end - scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON + Markdown report rendering. Real agent dispatch is gated by BENCH_AGENT_CMD; runner correctness is testable with BENCH_AGENT_CMD=echo - docs/benchmarks/v0.1.md: publication template The lifecycle invariant is now mechanically verified in CI. Producing real benchmark numbers needs the remaining 9 fixtures, the per-adapter dispatch shims, and a live API key -- tracked as the remaining checklist on the issue.

claude added 2 commits May 2, 2026 03:27

Merge remote-tracking branch 'origin/main' into claude-issue-8

58a9de8

MiaoDX marked this pull request as ready for review May 2, 2026 05:17

MiaoDX merged commit 9bc1e53 into main May 2, 2026
1 check passed

This was referenced May 2, 2026

test(critical): injection lifecycle + coding-quality regression #8

Closed

test(bench): vendor 9 remaining benchmark fixtures for issue #8 #19

Closed

test(bench): vendor 9 remaining benchmark fixtures for issue #8 #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold#17

fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold#17
MiaoDX merged 2 commits into
mainfrom
claude-issue-8

MiaoDX commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MiaoDX commented May 2, 2026

What's in this PR

Injection-lifecycle harness (internal/lifecycle/)

Coding-quality benchmark scaffold

Acceptance-criteria mapping

Remaining work (not in this PR)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Injection-lifecycle harness (`internal/lifecycle/`)