test(bench): vendor 9 remaining benchmark fixtures for issue #8#20
Merged
Conversation
…pack Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one template: a README.md describing the task, one source file with the seed code, and a test_*.py acceptance suite graded by pytest -q. Seed-code grading: each fixture's seed already fails the relevant acceptance tests, so a no-op agent run is correctly graded as failing. Refactor seeds pass behavioral tests but fail the structural assertions (extracted helper exists, ambiguous names removed, no positional row indexing); bugfix and feature seeds fail behavior tests directly. Test infrastructure: - bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains README.md + a non-test source file + a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI. - .gitignore: __pycache__/ and *.pyc, since pytest now runs against the fixtures during local verification. Closes the second-largest remaining bullet in #8's checklist; the remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md, optional ANTHROPIC_API_KEY-gated online recall check) are deliberately out-of-tree or API-key-gated and tracked on the issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continues #8 (supersedes #19). Picks up the largest mechanical bullet from the remaining-work checklist on #8: vendor the 9 fixtures referenced by
docs/benchmarks/tasks.jsonwhose absence was tracked in the issue comment.What's in this PR
Nine fixtures, one convention
Each follows the
bugfix-off-by-onetemplate (PR #17):README.mddescribing the task in PR-description form, one source file with the seed code, and atest_*.pyacceptance suite that the runner grades withpytest -q.refactor-cronschedule_job(extractparse_cron)refactor-config-mergedeep_merge)refactor-csv-rowsrow[0]/row[1]/row[2]indexing (switch to namedtuple)refactor-renamedo+ variabletmp(rename, keep public API)bugfix-utf8-truncates.encode()[:n]splits multi-byte codepointsbugfix-race-countertime.sleep(0)between read and writefeature-rate-limitTokenBucket+rate_limitdecorator stubs raisingNotImplementedErrorfeature-flatten-jsonflatten_json(obj, sep)stubfeature-cli-flag--jsonflag +render_jsonhelperSeed-code grading is correct
Each fixture's seed already fails its acceptance suite, so a no-op agent run grades as a miss. Confirmed locally:
parse_cronexists, norow[0..2]substrings remain, no module-leveldo/ no\btmp\btoken,deep_mergeexists).UnicodeDecodeErroron multi-byte split, lost increments under threads,NotImplementedError,AssertionErroron missing--json).The race-counter seed is an explicit non-atomic read-modify-write with
time.sleep(0)between the read and the write — necessary because plainself._n += 1is usually atomic on CPython (GIL) and would not reliably grade as a bug. With the explicit yield, races land deterministically (the test loses ~96% of updates on this machine).Test infrastructure
internal/lifecycle/bench_pack_test.go:TestBenchTemplateFixtureExistsis replaced withTestBenchFixturesPresent, which iterates every task intasks.jsonand asserts the fixture dir contains aREADME.md, a non-test source.py, and atest_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI..gitignore: adds__pycache__/and*.pycsince pytest now runs against the fixtures during local verification.Acceptance-criteria mapping (issue #8)
This PR closes the "task pack of 10 representative coding tasks" sub-bullet, which was previously satisfied only on paper (10 entries in
tasks.json, 1 fixture vendored).Still outstanding on #8 — and the same items the prior comment flagged:
BENCH_AGENT_CMDshims (intentionally out-of-tree per fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold #17's notes)docs/benchmarks/v0.1.md(requiresANTHROPIC_API_KEY)ANTHROPIC_API_KEY)Verification
go test ./...— green across all 10 packagesgo vet ./...— cleanpython3 -m py_compileon every fixture source — cleanpytest -qon each seed fixture grades as expected (failures land on the bug / missing functionality the agent is asked to fix)BENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=refactor-cron,bugfix-utf8-truncate --modes=baseline --out=/tmp/x.md— runner produces a complete report row for the new fixturesGenerated by Claude Code