test(bench): vendor 9 remaining benchmark fixtures for issue #8#19
Closed
MiaoDX wants to merge 3 commits into
Closed
test(bench): vendor 9 remaining benchmark fixtures for issue #8#19MiaoDX wants to merge 3 commits into
MiaoDX wants to merge 3 commits into
Conversation
…ark scaffold Lifecycle harness (internal/lifecycle/) drives the real lookup-from-prompt and recap subcommands in-process across simulated multi-turn conversations and asserts the v0.1 invariants from issue #8: - Turn N injects via marker -> verse appears in turn N's model_call input - Turn N+1 (no marker) -> verse is absent - Compaction-resistant: invariant holds across 30 follow-up turns - Mode B recap output never appears in any future turn's model_call input - Both Claude (slash markers) and Codex (inline markers) covered - Failure messages name the leaking turn index and a content window Coding-quality benchmark scaffold: - docs/benchmarks/tasks.json: 10 tasks (4 refactor, 3 bugfix, 3 feature) + 4 modes (baseline, preview-only, inject-once, recap-only); structurally validated by TestBenchPackStructure - docs/benchmarks/fixtures/bugfix-off-by-one/: working template fixture the runner can drive end-to-end - scripts/bench_runner.py: per-(task,mode,adapter) runner with JSON + Markdown report rendering. Real agent dispatch is gated by BENCH_AGENT_CMD; runner correctness is testable with BENCH_AGENT_CMD=echo - docs/benchmarks/v0.1.md: publication template The lifecycle invariant is now mechanically verified in CI. Producing real benchmark numbers needs the remaining 9 fixtures, the per-adapter dispatch shims, and a live API key -- tracked as the remaining checklist on the issue.
…pack Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one template: a README.md describing the task, one source file with the seed code, and a test_*.py acceptance suite graded by pytest -q. Seed-code grading: each fixture's seed already fails the relevant acceptance tests, so a no-op agent run is correctly graded as failing. Refactor seeds pass behavioral tests but fail the structural assertions (extracted helper exists, ambiguous names removed, no positional row indexing); bugfix and feature seeds fail behavior tests directly. Test infrastructure: - bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains README.md + a non-test source file + a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI. - .gitignore: __pycache__/ and *.pyc, since pytest now runs against the fixtures during local verification. Closes the second-largest remaining bullet in #8's checklist; the remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md, optional ANTHROPIC_API_KEY-gated online recall check) are deliberately out-of-tree or API-key-gated and tracked on the issue.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Superseded by #20 (clean branch off
main; the rebased history onclaude-issue-8was tangled with the prior squash-merged PR #17).Generated by Claude Code