Skip to content

test(bench): vendor 9 remaining benchmark fixtures for issue #8#20

Merged
MiaoDX merged 1 commit into
mainfrom
claude-issue-8-fixtures
May 2, 2026
Merged

test(bench): vendor 9 remaining benchmark fixtures for issue #8#20
MiaoDX merged 1 commit into
mainfrom
claude-issue-8-fixtures

Conversation

@MiaoDX

@MiaoDX MiaoDX commented May 2, 2026

Copy link
Copy Markdown
Owner

Continues #8 (supersedes #19). Picks up the largest mechanical bullet from the remaining-work checklist on #8: vendor the 9 fixtures referenced by docs/benchmarks/tasks.json whose absence was tracked in the issue comment.

Branch name note: this is the issue-#8 follow-up branch; the previous attempt on claude-issue-8 had a tangled history with PR #17's squash-merge and produced a content-conflict, so this PR is on claude-issue-8-fixtures (single commit off main).

What's in this PR

Nine fixtures, one convention

Each follows the bugfix-off-by-one template (PR #17): README.md describing the task in PR-description form, one source file with the seed code, and a test_*.py acceptance suite that the runner grades with pytest -q.

id type hook the seed exposes
refactor-cron refactor parsing inlined in schedule_job (extract parse_cron)
refactor-config-merge refactor nested if-cascade for 3-layer merge (introduce deep_merge)
refactor-csv-rows refactor row[0] / row[1] / row[2] indexing (switch to namedtuple)
refactor-rename refactor function do + variable tmp (rename, keep public API)
bugfix-utf8-truncate bugfix s.encode()[:n] splits multi-byte codepoints
bugfix-race-counter bugfix explicit read-modify-write with time.sleep(0) between read and write
feature-rate-limit feature TokenBucket + rate_limit decorator stubs raising NotImplementedError
feature-flatten-json feature flatten_json(obj, sep) stub
feature-cli-flag feature working text CLI; needs opt-in --json flag + render_json helper

Seed-code grading is correct

Each fixture's seed already fails its acceptance suite, so a no-op agent run grades as a miss. Confirmed locally:

  • Refactor seeds pass behavioral tests but fail the structural assertions (e.g. parse_cron exists, no row[0..2] substrings remain, no module-level do / no \btmp\b token, deep_merge exists).
  • Bugfix and feature seeds fail the relevant behavior tests directly (UnicodeDecodeError on multi-byte split, lost increments under threads, NotImplementedError, AssertionError on missing --json).

The race-counter seed is an explicit non-atomic read-modify-write with time.sleep(0) between the read and the write — necessary because plain self._n += 1 is usually atomic on CPython (GIL) and would not reliably grade as a bug. With the explicit yield, races land deterministically (the test loses ~96% of updates on this machine).

Test infrastructure

  • internal/lifecycle/bench_pack_test.go: TestBenchTemplateFixtureExists is replaced with TestBenchFixturesPresent, which iterates every task in tasks.json and asserts the fixture dir contains a README.md, a non-test source .py, and a test_*.py. Adding a task without vendoring the fixture (or vice versa) now fails CI.
  • .gitignore: adds __pycache__/ and *.pyc since pytest now runs against the fixtures during local verification.

Acceptance-criteria mapping (issue #8)

This PR closes the "task pack of 10 representative coding tasks" sub-bullet, which was previously satisfied only on paper (10 entries in tasks.json, 1 fixture vendored).

Still outstanding on #8 — and the same items the prior comment flagged:

Verification

  • go test ./... — green across all 10 packages
  • go vet ./... — clean
  • python3 -m py_compile on every fixture source — clean
  • pytest -q on each seed fixture grades as expected (failures land on the bug / missing functionality the agent is asked to fix)
  • BENCH_AGENT_CMD='echo {}' python3 scripts/bench_runner.py --tasks=refactor-cron,bugfix-utf8-truncate --modes=baseline --out=/tmp/x.md — runner produces a complete report row for the new fixtures

Generated by Claude Code

…pack

Authors the 4 refactor / 3 bugfix / 3 feature fixtures referenced by
docs/benchmarks/tasks.json. Each fixture follows the bugfix-off-by-one
template: a README.md describing the task, one source file with the
seed code, and a test_*.py acceptance suite graded by pytest -q.

Seed-code grading: each fixture's seed already fails the relevant
acceptance tests, so a no-op agent run is correctly graded as failing.
Refactor seeds pass behavioral tests but fail the structural assertions
(extracted helper exists, ambiguous names removed, no positional row
indexing); bugfix and feature seeds fail behavior tests directly.

Test infrastructure:
- bench_pack_test.go: TestBenchTemplateFixtureExists is replaced by
  TestBenchFixturesPresent, which iterates every task in tasks.json and
  asserts the fixture dir contains README.md + a non-test source file +
  a test_*.py. Adding a task without vendoring the fixture (or vice
  versa) now fails CI.
- .gitignore: __pycache__/ and *.pyc, since pytest now runs against the
  fixtures during local verification.

Closes the second-largest remaining bullet in #8's checklist; the
remaining items (per-adapter BENCH_AGENT_CMD shims, populated v0.1.md,
optional ANTHROPIC_API_KEY-gated online recall check) are deliberately
out-of-tree or API-key-gated and tracked on the issue.
@MiaoDX MiaoDX marked this pull request as ready for review May 2, 2026 06:41
@MiaoDX MiaoDX merged commit e9e2f72 into main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants