fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold#17
Merged
Conversation
…ark scaffold Lifecycle harness (internal/lifecycle/) drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 invariants from issue #8: - Turn N injects via marker -> verse appears in turn N's model_call input - Turn N+1 (no marker) -> verse is absent - Compaction-resistant: invariant holds across 30 follow-up turns - Mode B recap output never appears in any future turn's model_call input - Both Claude (slash markers) and Codex (inline markers) covered - Failure messages name the leaking turn index and a content window Coding-quality benchmark scaffold: - docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature) + 4 modes (baseline, preview-only, inject-once, recap-only); structurally validated by TestBenchPackStructure - docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture the runner can drive end-to-end - scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON + Markdown report rendering. Real agent dispatch is gated by BENCH_AGENT_CMD; runner correctness is testable with BENCH_AGENT_CMD=echo - docs/benchmarks/v0.1.md: publication template The lifecycle invariant is now mechanically verified in CI. Producing real benchmark numbers needs the remaining 9 fixtures, the per-adapter dispatch shims, and a live API key -- tracked as the remaining checklist on the issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Partially addresses #8.
What's in this PR
Injection-lifecycle harness (
internal/lifecycle/)A Go test harness that drives the real
lookup-from-promptandrecapsubcommands in-process across simulated multi-turn conversations and asserts the v0.1 lifecycle invariants. Runs in CI with no external dependencies.What it asserts:
<scripture_card>envelope in turn N'sModelInput; turn N+1 has no marker and no envelope, andFindLeaksconfirms the verse body is absent. Covered for both Claude (/bible …) and Codex ([[bible:…]]) marker syntaxes.ModelInput. Tested for Claude and Codex, with Bible and Dao reference shapes.RunTurn(prompt, withRecap=true)records the Stop-hook recap output inTurn.RecapTerminal(a separate "user terminal" channel). After 30 such turns,FindRecapLeaks()confirms no recap text ever entered any subsequent turn'sModelInput. Both adapters covered.Leak.String()includes the leaking turn index and a context window — verified byTestLeakFailureMessageNamesTurnAndContent.The harness deliberately does not call a model. The lifecycle invariant lives at the
additionalContextboundary — what entersmodel_callinput — which is the leftmost surface where the invariant must hold. Whether the model can still recite a verse it once saw is an LLM-side question best answered by the optional benchmark below; the architectural guarantee is that the verse is no longer being delivered to it.Coding-quality benchmark scaffold
docs/benchmarks/tasks.json— 10 tasks (4 refactor, 3 bugfix, 3 feature) and the 4 modes (baseline,preview-only,inject-once,recap-only). Schema and counts are CI-validated byTestBenchPackStructure.docs/benchmarks/fixtures/bugfix-off-by-one/— one working template fixture (seed code with a real bug + acceptance tests + per-fixture README documenting the convention). Locks in the structure so the remaining 9 fixtures follow it.scripts/bench_runner.py— per-(task, mode, adapter) runner that copies fixture → scratch dir, dispatches the agent (env-gated byBENCH_AGENT_CMD), runs the per-task acceptance command, and renders both a Markdown table and a JSON sidecar. Invokable as a smoke test withBENCH_AGENT_CMD='echo {}'.docs/benchmarks/v0.1.md— publication template for the release report.Acceptance-criteria mapping
Injection lifecycle:
ANTHROPIC_API_KEYin CI.ModelInputshows the verse only on turn N); the model-side recall half is the online follow-up.model_callinputCoding-quality regression:
baseline,preview-only,inject-once,recap-onlydocs/benchmarks/<date>.mdper release — publication template + runner; the populated report needs an API key.Remaining work (not in this PR)
BENCH_AGENT_CMD(one shell script per adapter that translates the runner's env vars into a real headless agent invocation). Depends on a stable Claude Code / Codex headless interface and the developer's local install path, so it's intentionally not in-tree.docs/benchmarks/v0.1.md. Run the benchmark with an API key, fill in the table, and verify the >5pp gate.ANTHROPIC_API_KEY) that drives a 3-turn Claude SDK conversation and asserts the model can quote the verse on turn N and cannot on turn N+1. The architectural invariant in this PR makes this an extra confirmation rather than the primary gate.Verification
go test ./...— green across all 8 packages (lifecycle adds 11 new tests)go vet ./...— cleanstaticcheck ./...— cleanpython3 scripts/bench_runner.py --help— parsesBENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=bugfix-off-by-one --modes=baseline --out=/tmp/x.md— produces a complete report rowGenerated by Claude Code