test(bench): vendor 9 remaining benchmark fixtures for issue #8 by MiaoDX · Pull Request #19 · MiaoDX/verse-driven

MiaoDX · 2026-05-02T06:39:30Z

Superseded by #20 (clean branch off main; the rebased history on claude-issue-8 was tangled with the prior squash-merged PR #17).

Generated by Claude Code

…ark scaffold Lifecycle harness (internal/lifecycle/) drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 invariants from issue #8: - Turn N injects via marker -> verse appears in turn N's model_call input - Turn N+1 (no marker) -> verse is absent - Compaction-resistant: invariant holds across 30 follow-up turns - Mode B recap output never appears in any future turn's model_call input - Both Claude (slash markers) and Codex (inline markers) covered - Failure messages name the leaking turn index and a content window Coding-quality benchmark scaffold: - docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature) + 4 modes (baseline, preview-only, inject-once, recap-only); structurally validated by TestBenchPackStructure - docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture the runner can drive end-to-end - scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON + Markdown report rendering. Real agent dispatch is gated by BENCH_AGENT_CMD; runner correctness is testable with BENCH_AGENT_CMD=echo - docs/benchmarks/v0.1.md: publication template The lifecycle invariant is now mechanically verified in CI. Producing real benchmark numbers needs the remaining 9 fixtures, the per-adapter dispatch shims, and a live API key -- tracked as the remaining checklist on the issue.

…pack Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one template: a README.md describing the task, one source file with the seed code, and a test_*.py acceptance suite graded by pytest -q. Seed-code grading: each fixture's seed already fails the relevant acceptance tests, so a no-op agent run is correctly graded as failing. Refactor seeds pass behavioral tests but fail the structural assertions (extracted helper exists, ambiguous names removed, no positional row indexing); bugfix and feature seeds fail behavior tests directly. Test infrastructure: - bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains README.md + a non-test source file + a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI. - .gitignore: __pycache__/ and *.pyc, since pytest now runs against the fixtures during local verification. Closes the second-largest remaining bullet in #8's checklist; the remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md, optional ANTHROPIC_API_KEY-gated online recall check) are deliberately out-of-tree or API-key-gated and tracked on the issue.

claude added 3 commits May 2, 2026 03:27

Merge remote-tracking branch 'origin/main' into claude-issue-8

58a9de8

MiaoDX marked this pull request as ready for review May 2, 2026 06:39

MiaoDX closed this May 2, 2026

MiaoDX mentioned this pull request May 2, 2026

test(bench): vendor 9 remaining benchmark fixtures for issue #8 #20

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): vendor 9 remaining benchmark fixtures for issue #8#19

test(bench): vendor 9 remaining benchmark fixtures for issue #8#19
MiaoDX wants to merge 3 commits into
mainfrom
claude-issue-8

MiaoDX commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MiaoDX commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MiaoDX commented May 2, 2026 •

edited

Loading