Skip to content

test(bench): vendor 9 remaining benchmark fixtures for issue #8#19

Closed
MiaoDX wants to merge 3 commits into
mainfrom
claude-issue-8
Closed

test(bench): vendor 9 remaining benchmark fixtures for issue #8#19
MiaoDX wants to merge 3 commits into
mainfrom
claude-issue-8

Conversation

@MiaoDX

@MiaoDX MiaoDX commented May 2, 2026

Copy link
Copy Markdown
Owner

Superseded by #20 (clean branch off main; the rebased history on claude-issue-8 was tangled with the prior squash-merged PR #17).


Generated by Claude Code

claude added 3 commits May 2, 2026 03:27
…ark scaffold

Lifecycle harness (internal/lifecycle/) drives the real
lookup-from-prompt and recap subcommands in-process across simulated
multi-turn conversations and asserts the v0.1 invariants from issue #8:

- Turn N injects via marker -> verse appears in turn N's model_call input
- Turn N+1 (no marker) -> verse is absent
- Compaction-resistant: invariant holds across 30 follow-up turns
- Mode B recap output never appears in any future turn's model_call input
- Both Claude (slash markers) and Codex (inline markers) covered
- Failure messages name the leaking turn index and a content window

Coding-quality benchmark scaffold:

- docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature)
  + 4 modes (baseline, preview-only, inject-once, recap-only); structurally
  validated by TestBenchPackStructure
- docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture
  the runner can drive end-to-end
- scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON +
  Markdown report rendering. Real agent dispatch is gated by
  BENCH_AGENT_CMD; runner correctness is testable with
  BENCH_AGENT_CMD=echo
- docs/benchmarks/v0.1.md: publication template

The lifecycle invariant is now mechanically verified in CI. Producing
real benchmark numbers needs the remaining 9 fixtures, the per-adapter
dispatch shims, and a live API key -- tracked as the remaining
checklist on the issue.
…pack

Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by
docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one
template: a README.md describing the task, one source file with the
seed code, and a test_*.py acceptance suite graded by pytest -q.

Seed-code grading: each fixture's seed already fails the relevant
acceptance tests, so a no-op agent run is correctly graded as failing.
Refactor seeds pass behavioral tests but fail the structural assertions
(extracted helper exists, ambiguous names removed, no positional row
indexing); bugfix and feature seeds fail behavior tests directly.

Test infrastructure:
- bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by
  TestBenchFixturesPresent, which iterates every task in tasks.json and
  asserts the fixture dir contains README.md + a non-test source file +
  a test_*.py. Adding a task without vendoring the fixture (or vice
  versa) now fails CI.
- .gitignore: __pycache__/ and *.pyc, since pytest now runs against the
  fixtures during local verification.

Closes the second-largest remaining bullet in #8's checklist; the
remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md,
optional ANTHROPIC_API_KEY-gated online recall check) are deliberately
out-of-tree or API-key-gated and tracked on the issue.
@MiaoDX MiaoDX marked this pull request as ready for review May 2, 2026 06:39
@MiaoDX MiaoDX closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants