test(bench): vendor 9 remaining benchmark fixtures for issue #8 by MiaoDX · Pull Request #20 · MiaoDX/verse-driven

MiaoDX · 2026-05-02T06:41:14Z

Continues #8 (supersedes #19). Picks up the largest mechanical bullet from the remaining-work checklist on #8: vendor the 9 fixtures referenced by docs/benchmarks/tasks.json whose absence was tracked in the issue comment.

Branch name note: this is the issue-#8 follow-up branch; the previous attempt on claude-issue-8 had a tangled history with PR #17's squash-merge and produced a content-conflict, so this PR is on claude-issue-8-fixtures (single commit off main).

What's in this PR

Nine fixtures, one convention

Each follows the bugfix-off-by-one template (PR #17): README.md describing the task in PR-description form, one source file with the seed code, and a test_*.py acceptance suite that the runner grades with pytest -q.

id	type	hook the seed exposes
`refactor-cron`	refactor	parsing inlined in `schedule_job` (extract `parse_cron`)
`refactor-config-merge`	refactor	nested if-cascade for 3-layer merge (introduce `deep_merge`)
`refactor-csv-rows`	refactor	`row[0]` / `row[1]` / `row[2]` indexing (switch to namedtuple)
`refactor-rename`	refactor	function `do` + variable `tmp` (rename, keep public API)
`bugfix-utf8-truncate`	bugfix	`s.encode()[:n]` splits multi-byte codepoints
`bugfix-race-counter`	bugfix	explicit read-modify-write with `time.sleep(0)` between read and write
`feature-rate-limit`	feature	`TokenBucket` + `rate_limit` decorator stubs raising `NotImplementedError`
`feature-flatten-json`	feature	`flatten_json(obj, sep)` stub
`feature-cli-flag`	feature	working text CLI; needs opt-in `--json` flag + `render_json` helper

Seed-code grading is correct

Each fixture's seed already fails its acceptance suite, so a no-op agent run grades as a miss. Confirmed locally:

Refactor seeds pass behavioral tests but fail the structural assertions (e.g. parse_cron exists, no row[0..2] substrings remain, no module-level do / no \btmp\b token, deep_merge exists).
Bugfix and feature seeds fail the relevant behavior tests directly (UnicodeDecodeError on multi-byte split, lost increments under threads, NotImplementedError, AssertionError on missing --json).

The race-counter seed is an explicit non-atomic read-modify-write with time.sleep(0) between the read and the write — necessary because plain self._n += 1 is usually atomic on CPython (GIL) and would not reliably grade as a bug. With the explicit yield, races land deterministically (the test loses ~96% of updates on this machine).

Test infrastructure

internal/lifecycle/bench_pack_test.go: TestBenchTemplateFixtureExists is replaced with TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains a README.md, a non-test source .py, and a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI.
.gitignore: adds __pycache__/ and *.pyc since pytest now runs against the fixtures during local verification.

Acceptance-criteria mapping (issue #8)

This PR closes the "task pack of 10 representative coding tasks" sub-bullet, which was previously satisfied only on paper (10 entries in tasks.json, 1 fixture vendored).

Still outstanding on #8 — and the same items the prior comment flagged:

Per-adapter BENCH_AGENT_CMD shims (intentionally out-of-tree per fix: #8 — test(critical): injection-lifecycle simulator + coding-quality benchmark scaffold #17's notes)
Populated docs/benchmarks/v0.1.md (requires ANTHROPIC_API_KEY)
Optional online recall check (requires ANTHROPIC_API_KEY)

Verification

go test ./... — green across all 10 packages
go vet ./... — clean
python3 -m py_compile on every fixture source — clean
pytest -q on each seed fixture grades as expected (failures land on the bug / missing functionality the agent is asked to fix)
BENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=refactor-cron,bugfix-utf8-truncate --modes=baseline --out=/tmp/x.md — runner produces a complete report row for the new fixtures

Generated by Claude Code

…pack Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one template: a README.md describing the task, one source file with the seed code, and a test_*.py acceptance suite graded by pytest -q. Seed-code grading: each fixture's seed already fails the relevant acceptance tests, so a no-op agent run is correctly graded as failing. Refactor seeds pass behavioral tests but fail the structural assertions (extracted helper exists, ambiguous names removed, no positional row indexing); bugfix and feature seeds fail behavior tests directly. Test infrastructure: - bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains README.md + a non-test source file + a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI. - .gitignore: __pycache__/ and *.pyc, since pytest now runs against the fixtures during local verification. Closes the second-largest remaining bullet in #8's checklist; the remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md, optional ANTHROPIC_API_KEY-gated online recall check) are deliberately out-of-tree or API-key-gated and tracked on the issue.

MiaoDX marked this pull request as ready for review May 2, 2026 06:41

MiaoDX merged commit e9e2f72 into main May 2, 2026
1 check passed

MiaoDX mentioned this pull request May 2, 2026

test(critical): injection lifecycle + coding-quality regression #8

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): vendor 9 remaining benchmark fixtures for issue #8#20

test(bench): vendor 9 remaining benchmark fixtures for issue #8#20
MiaoDX merged 1 commit into
mainfrom
claude-issue-8-fixtures

MiaoDX commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MiaoDX commented May 2, 2026

What's in this PR

Nine fixtures, one convention

Seed-code grading is correct

Test infrastructure

Acceptance-criteria mapping (issue #8)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants