Skip to content

Gsd/auto bench#11

Merged
PythonFZ merged 20 commits intomainfrom
gsd/auto-bench
Mar 9, 2026
Merged

Gsd/auto bench#11
PythonFZ merged 20 commits intomainfrom
gsd/auto-bench

Conversation

@PythonFZ
Copy link
Copy Markdown
Member

@PythonFZ PythonFZ commented Mar 9, 2026

Summary by CodeRabbit

  • New Features

    • Dedicated CI benchmark workflow that runs after tests on main, publishes historical results to a GitHub Pages dashboard, and provides PR comparison reports (alerts on large regressions, default 150% threshold).
  • Chores

    • Removed legacy local benchmark cache and visualization tooling; added ignore rules to exclude machine-specific benchmark data.
  • Documentation

    • Added planning, requirements, roadmap, validation, and research docs describing the benchmark pipeline and PR feedback process.
  • Tests

    • Adjusted benchmark tests to limit per-run work and improve per-run isolation/cleanup.

PythonFZ and others added 19 commits March 9, 2026 15:43
- Create benchmark.yml with workflow_run trigger on Tests+main
- Single Python 3.13 job with MongoDB+Redis services
- github-action-benchmark auto-push to gh-pages at dev/bench
- Document CI-01 (gh-pages setup) and CI-04 (release strategy)
- Add .benchmarks/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… files

- Remove Run benchmarks, Visualize benchmarks, Upload benchmark results steps
- Delete docs/visualize_benchmarks.py (superseded by gh-pages dashboard)
- Delete .benchmarks/ local cache directory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 05-01-SUMMARY.md with execution results
- Update STATE.md with plan completion and decisions
- Update ROADMAP.md with phase 5 progress
- Mark CI-01 through CI-04 requirements complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pull_request trigger (opened, synchronize) alongside workflow_run
- Add concurrency group to cancel in-progress PR benchmark runs
- Split benchmark-action into main (auto-push) and PR (compare-only) steps
- PR step: summary-always, comment-on-alert, fail-on-alert at 150% threshold
- PR step: auto-push false, save-data-file false to avoid polluting gh-pages
- Document PR behavior and branch protection setup in header comments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This pull request separates benchmarking into a dedicated CI workflow, removes legacy local benchmark artifacts and visualization, updates ignore rules, and adds extensive planning, research, validation, and roadmap documentation for benchmark and PR-feedback phases.

Changes

Cohort / File(s) Summary
CI Workflow Setup
/.github/workflows/benchmark.yml, /.github/workflows/tests.yml
Added a new post-test benchmark workflow triggered via workflow_run (and PR planning), with Redis/MongoDB services and pytest-benchmark execution; removed benchmark, visualization, and upload steps from the tests workflow.
Ignored Artifacts
/.gitignore
Added .benchmarks/ and related benchmark artifacts to ignore rules.
Legacy Artifacts Removed
docs/visualize_benchmarks.py, .benchmarks/Darwin-CPython-3.11-64bit/0001_baseline.json, .benchmarks/Darwin-CPython-3.11-64bit/0002_perf_analysis.json
Deleted the local benchmark visualization script and committed benchmark JSON files / cache.
Tests: Benchmarks Adjustment
tests/benchmarks/test_bench_write.py
Introduced per-run isolation and a WRITE_SINGLE_FRAMES cap (constant) to limit CI runtime; switched to UUID-based unique output paths and improved DB-instance teardown.
Planning: Phase 5 (Benchmark Pipeline)
.planning/phases/05-benchmark-pipeline/*, .planning/PROJECT.md, .planning/REQUIREMENTS.md, .planning/ROADMAP.md, .planning/STATE.md
Added comprehensive Phase 5 planning, requirements, roadmap, state updates, validation, verification, and research artifacts describing the dedicated benchmark workflow, gh-pages integration, and rollout tasks.
Planning: Phase 6 (PR Feedback)
.planning/phases/06-pr-feedback/*
Added Phase 6 planning and research for PR benchmark comparisons, concurrency, threshold alerts, and workflow guard logic.
Research & Architecture
.planning/research/ARCHITECTURE.md, .planning/research/FEATURES.md, .planning/research/PITFALLS.md, .planning/research/STACK.md, .planning/research/SUMMARY.md
Reframed repository research and architecture toward CI benchmarking infrastructure, tooling choices (github-action-benchmark), risks, and deployment guidance.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Tests as Tests Workflow
participant Runner as CI Runner
participant Services as Redis/MongoDB
participant Bench as Benchmark Workflow
participant GH as github-action-benchmark (gh-pages)
Tests->>Runner: Complete matrix tests & upload artifacts
Tests-->>Bench: Trigger workflow_run (on: workflow_run)
Bench->>Services: Start/health-check containers
Bench->>Runner: Setup Python, install package
Bench->>Runner: Run pytest --benchmark-only -> produces JSON
Runner->>GH: For main -> publish results to gh-pages
Runner->>GH: For PR -> compare against baseline, post summary/alerts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I nibbled through the CI hay,
Separated benchmarks to their own way,
No more clutter, dashboards gleam,
Pushed to gh-pages — a rabbit's dream! 🥕📈

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Gsd/auto bench' is vague and does not clearly convey the main purpose of this extensive pull request, which introduces CI benchmark infrastructure, workflow automation, and planning documentation. Consider a more descriptive title such as 'Set up automated benchmark CI workflow and infrastructure' or 'Implement benchmark pipeline with GitHub Actions' to better reflect the substantial changes being made.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gsd/auto-bench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (6)
.planning/REQUIREMENTS.md (1)

21-21: PR-03 may already be implemented.

benchmark.yml line 107 sets fail-on-alert: true, which blocks PR merges on benchmark regressions. This appears to implement PR-03 (fail-on-regression gate). Consider marking this complete or clarifying if additional work is needed for Phase 6.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/REQUIREMENTS.md at line 21, The PR-03 item in REQUIREMENTS.md
appears outdated because benchmark.yml already enforces a fail-on-regression
gate via the setting fail-on-alert: true; update the PR-03 checklist entry to
mark it complete or add clarifying text that references the existing
implementation (benchmark.yml's fail-on-alert setting) and whether any
additional Phase 6 enhancements are required so reviewers know this requirement
is satisfied or what remains.
.github/workflows/benchmark.yml (1)

27-28: Consider: PR benchmarks run independently of test results.

The pull_request trigger runs benchmarks directly without waiting for the Tests workflow to pass. If tests fail, benchmark resources are wasted. This is a design trade-off favoring faster feedback.

To chain PR benchmarks after tests, you could use workflow_run for PRs too, but this adds complexity around accessing PR context from workflow_run. The current approach is simpler and acceptable for most projects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/benchmark.yml around lines 27 - 28, The current benchmark
workflow uses the pull_request trigger (pull_request: types: [opened,
synchronize]) which runs benchmarks immediately; to run benchmarks only after
tests pass, replace the pull_request trigger with a workflow_run trigger that
listens for the Tests workflow completion (e.g., workflow_run: workflows:
["Tests"] types: [completed]) and conditionally run when
github.event.workflow_run.conclusion == 'success'; when you need PR context in
that workflow, read PR info from github.event.workflow_run.pull_requests (or
fetch via the GitHub API using the run's commit/branch) to avoid losing PR
association.
.planning/phases/06-pr-feedback/06-VALIDATION.md (1)

41-42: Automate the PR-02 threshold check instead of treating it as manual-only.

Verifying a fixed YAML value is a cheap deterministic check, so keeping PR-02 manual weakens the validation contract for no real benefit. A grep/yq-style assertion here would also align better with the sign-off item about automated verification coverage.

Proposed update
-| 06-01-02 | 01 | 1 | PR-02 | manual | Verify `alert-threshold: "150%"` in YAML | N/A | ⬜ pending |
+| 06-01-02 | 01 | 1 | PR-02 | automated | `rg -n 'alert-threshold:\s*"150%"' .github/workflows/benchmark.yml` | N/A | ⬜ pending |
...
-| Alert threshold is configurable at 150% | PR-02 | Configuration verification in YAML | Read `alert-threshold` value in workflow file |
+| Alert threshold is configurable at 150% | PR-02 | None — static workflow config is directly checkable | Assert `alert-threshold: "150%"` is present in `.github/workflows/benchmark.yml` |

Also applies to: 60-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/06-pr-feedback/06-VALIDATION.md around lines 41 - 42, The
PR-02 validation entry currently marked as manual should be changed to an
automated check: update the test matrix row for PR-02 so the "manual" column
becomes "automated"/"auto" and the description notes an automated YAML
assertion; implement a deterministic assertion that verifies the YAML contains
alert-threshold: "150%" (e.g., a yq/grep check in the validation script invoked
by the pipeline) and link that automation as the verification step for test id
PR-02 so the table and pipeline reflect the automated coverage for the
alert-threshold check.
.planning/STATE.md (2)

35-41: Consider standardizing duration formatting.

The duration format "1min" appears twice (lines 39-40). Standard formatting would be "1 min" (with space), "1m", or "1 minute" for better readability in planning documents.

📝 Suggested formatting improvement
 **Velocity:**
 - Total plans completed: 1 (v0.3.1)
-- Average duration: 1min
-- Total execution time: 1min
+- Average duration: 1 min
+- Total execution time: 1 min
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/STATE.md around lines 35 - 41, Update the duration formatting in
the "Performance Metrics" -> "Velocity" block by replacing the repeated "1min"
strings used for "Average duration" and "Total execution time" with a
standardized, readable format (e.g., "1 min" or "1 minute"); locate the "Average
duration" and "Total execution time" entries and change their values to the
chosen format so both lines match consistently.

10-31: Clarify dual phase numbering system.

The document uses two different phase numbering systems that may cause confusion:

  • Lines 10-13: Progress shows "3 total phases, 1 completed" (milestone-specific)
  • Line 28: "Phase: 5 of 7" (project-wide numbering)

While this dual system is internally consistent (Phase 5 is the first phase of v0.3.1 milestone), future readers may find it confusing. Consider adding a brief note explaining that "total_phases" refers to the current milestone's phases, while the "Phase: X of Y" notation refers to overall project phases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/STATE.md around lines 10 - 31, The document shows two phase
numbering systems (the YAML keys total_phases/completed_phases/percent and the
human-readable "Phase: 5 of 7" line) which can confuse readers; add a single
clarifying sentence or parenthetical near the top (e.g., after the "Project
State" header or directly beneath the Project Reference) that explains that
total_phases/completed_phases refer to the current milestone's phases while the
"Phase: X of Y" line denotes overall project phases, and update any adjacent
wording to make this distinction explicit (reference total_phases,
completed_phases, and the "Phase: 5 of 7" text so reviewers can locate the
change).
.planning/phases/06-pr-feedback/06-01-PLAN.md (1)

136-136: Consider breaking verification script into multiple lines for readability.

The verification script is a 400+ character one-liner with 8 chained grep commands. While functionally correct, splitting it into multiple lines with comments would improve maintainability and make it easier to identify which check failed.

♻️ More readable verification script
#!/bin/bash
# Verify benchmark.yml has all required PR comparison elements

cd /Users/fzills/tools/asebytes

echo "Checking for pull_request trigger..."
grep -q 'pull_request' .github/workflows/benchmark.yml || exit 1

echo "Checking for summary-always option..."
grep -q 'summary-always: true' .github/workflows/benchmark.yml || exit 1

echo "Checking for fail-on-alert option..."
grep -q 'fail-on-alert: true' .github/workflows/benchmark.yml || exit 1

echo "Checking for alert-threshold..."
grep -q 'alert-threshold' .github/workflows/benchmark.yml || exit 1

echo "Checking for auto-push: false (PR mode)..."
grep -q 'auto-push: false' .github/workflows/benchmark.yml || exit 1

echo "Checking for save-data-file: false..."
grep -q 'save-data-file: false' .github/workflows/benchmark.yml || exit 1

echo "Checking for concurrency group..."
grep -q 'concurrency' .github/workflows/benchmark.yml || exit 1

echo "ALL CHECKS PASSED"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/06-pr-feedback/06-01-PLAN.md at line 136, The one-line
verification command in the automated block should be split into a multi-line
shell script for readability and debuggability: replace the long chained command
(the current automated line containing multiple grep -q checks) with a bash
script that cd's to the repo, runs each grep -q as its own step with a preceding
echo message and a short comment, and uses "|| exit 1" after each grep so
failures surface immediately; keep the same exact grep patterns ('pull_request',
'summary-always: true', 'fail-on-alert: true', 'alert-threshold', 'auto-push:
false', 'save-data-file: false', 'concurrency') and ensure a final echo "ALL
CHECKS PASSED" is printed on success to preserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/benchmark.yml:
- Around line 13-14: There's a contradiction between the CI-04 description in
the benchmark workflow header and the CI-04 entry in REQUIREMENTS.md; pick the
intended behavior and make them consistent: either update the CI-04 comment in
the benchmark workflow file to state that release/tag events DO trigger a
benchmark snapshot on gh-pages, or update the CI-04 entry in REQUIREMENTS.md to
mark that release/tag events do NOT get a separate benchmark run. Locate the
CI-04 comment in the workflow header and the CI-04 line in REQUIREMENTS.md and
edit the one that is incorrect so both documents state the same behavior.

In @.planning/phases/05-benchmark-pipeline/05-01-PLAN.md:
- Around line 52-55: The execution_context block contains machine-specific
absolute paths (the strings starting with /Users/...) which must be replaced
with repo-relative references or plain document identifiers; edit the
execution_context entry to remove those absolute paths and use repo-relative
paths like ./workflows/execute-plan.md and ./templates/summary.md or simple
identifiers (e.g., "workflows/execute-plan" and "templates/summary") so the plan
contains no user-specific filesystem information.

In @.planning/phases/05-benchmark-pipeline/05-RESEARCH.md:
- Line 13: Remove the obsolete "artifact-handoff" architecture text from the
research doc: delete or rewrite the "Primary recommendation" paragraph that
advises passing benchmark JSON across workflows (the sentence starting "Use
`workflow_run` trigger with `actions/download-artifact@v4`...") and any other
conflicting passages that describe cross-workflow artifact handoff (also remove
the blocks referenced around lines 70-72 and 89-101). Replace those with a clear
single recommendation that `benchmark.yml` runs benchmarks directly (matching
the "IMPORTANT DECISION" and the cleaned `tests.yml`), and mention using
`benchmark-action/github-action-benchmark@v1` with `auto-push: true` to publish
results to gh-pages if needed; ensure the document no longer presents both
models side-by-side.

In @.planning/phases/05-benchmark-pipeline/05-VERIFICATION.md:
- Around line 22-25: The verification doc is out of date: update
.planning/phases/05-benchmark-pipeline/05-VERIFICATION.md so its statements
about the benchmark workflow match the current .github/workflows/benchmark.yml
(which now includes a pull_request trigger and a PR-only comparison step) —
either mark the listed items (e.g., the “PR does NOT trigger benchmark workflow”
entry and related lines referenced in the diff) as historical snapshots, or
change their VERIFIED status and text to reflect that benchmark.yml contains a
pull_request trigger and PR-only comparison; apply the same updates to the other
affected sections called out (lines referenced as 35-36, 43-45, 50-53) so the
document consistently matches the branch behavior.

In @.planning/phases/06-pr-feedback/06-CONTEXT.md:
- Around line 17-21: The concurrency group currently uses 'benchmark-${{
github.event.pull_request.number }}' which is undefined for workflow_run events
and causes unrelated runs to collide; change the concurrency group to use a
fallback expression so when github.event.pull_request.number is missing it falls
back to a stable per-run identifier (for example github.run_id or
github.ref/sha). Update the concurrency: group value to incorporate that
fallback (keeping the 'benchmark-' prefix) so both pull_request and workflow_run
triggers produce unique, non-conflicting keys.

In @.planning/REQUIREMENTS.md:
- Line 15: The CI-04 check is marked complete but the benchmark workflow lacks
release/tag triggers; either update the REQUIREMENTS.md entry for CI-04 to
"pending" or modify the benchmark workflow (benchmark.yml) to include release
and tag push triggers (add an on: release: types: [published] and/or on: push:
tags: ['*']) so that release/tag events will run the benchmark and publish the
snapshot to gh-pages; update the CI-04 status in .planning/REQUIREMENTS.md if
you choose not to change the workflow.

In @.planning/research/ARCHITECTURE.md:
- Around line 112-120: The ARCHITECTURE.md section uses an `auto-push`
conditional of `github.event_name == 'push' && github.ref == 'refs/heads/main'`
which conflicts with the Phase 6 plan that uses `workflow_run` for main-branch
benchmarks and `pull_request` for PRs; update the ARCHITECTURE.md text to match
the Phase 6 approach (or vice versa) by replacing the current `auto-push`
conditional with the `workflow_run`-based logic (`github.event_name ==
'workflow_run'` for main branch runs triggered after Tests and `pull_request`
for PR runs) and explicitly document that `comment-on-alert: true` remains
enabled for PR comparisons so readers can trace the `workflow_run` vs
`pull_request` trigger strategy consistently with the Phase 6 plan.
- Around line 122-125: Decision 4 in ARCHITECTURE.md claims we kept
visualize_benchmarks.py, but the file docs/visualize_benchmarks.py is missing
and docs/README.md still references running it; either restore the script and
its artifact-upload behavior and add it back at docs/visualize_benchmarks.py
(ensuring it accepts benchmark_results.json as the CLI arg and is documented in
docs/README.md), or change Decision 4 text in .planning/research/ARCHITECTURE.md
and remove or update the docs/README.md reference so both documentation and
architecture reflect that the visualization script was removed; update whichever
files you change to keep the repository consistent (Decision 4, docs/README.md,
and presence or absence of docs/visualize_benchmarks.py).

In @.planning/research/FEATURES.md:
- Line 84: Update the stale reference in the FEATURES.md line that mentions
"`--benchmark-json=benchmark_results.json` in CI (tests.yml line 61)`" so it
points to the new workflow location: replace the `tests.yml line 61` reference
with the appropriate `benchmark.yml` workflow (or remove the line-number mention
and reference `benchmark.yml`/the benchmark workflow step), ensuring the text
still mentions `--benchmark-json=benchmark_results.json` as the direct input to
github-action-benchmark `output-file-path`.

In @.planning/ROADMAP.md:
- Around line 87-89: The "Benchmark Pipeline" (Phase 5) table row is missing its
Milestone cell, causing column misalignment; update the Phase 5 row (the line
containing "5. Benchmark Pipeline | 1/1 | Complete   | 2026-03-09 | -") to
include the Milestone value (match the format used by lines with "v0.3.1") so
all pipes align and the table columns are consistent.

---

Nitpick comments:
In @.github/workflows/benchmark.yml:
- Around line 27-28: The current benchmark workflow uses the pull_request
trigger (pull_request: types: [opened, synchronize]) which runs benchmarks
immediately; to run benchmarks only after tests pass, replace the pull_request
trigger with a workflow_run trigger that listens for the Tests workflow
completion (e.g., workflow_run: workflows: ["Tests"] types: [completed]) and
conditionally run when github.event.workflow_run.conclusion == 'success'; when
you need PR context in that workflow, read PR info from
github.event.workflow_run.pull_requests (or fetch via the GitHub API using the
run's commit/branch) to avoid losing PR association.

In @.planning/phases/06-pr-feedback/06-01-PLAN.md:
- Line 136: The one-line verification command in the automated block should be
split into a multi-line shell script for readability and debuggability: replace
the long chained command (the current automated line containing multiple grep -q
checks) with a bash script that cd's to the repo, runs each grep -q as its own
step with a preceding echo message and a short comment, and uses "|| exit 1"
after each grep so failures surface immediately; keep the same exact grep
patterns ('pull_request', 'summary-always: true', 'fail-on-alert: true',
'alert-threshold', 'auto-push: false', 'save-data-file: false', 'concurrency')
and ensure a final echo "ALL CHECKS PASSED" is printed on success to preserve
behavior.

In @.planning/phases/06-pr-feedback/06-VALIDATION.md:
- Around line 41-42: The PR-02 validation entry currently marked as manual
should be changed to an automated check: update the test matrix row for PR-02 so
the "manual" column becomes "automated"/"auto" and the description notes an
automated YAML assertion; implement a deterministic assertion that verifies the
YAML contains alert-threshold: "150%" (e.g., a yq/grep check in the validation
script invoked by the pipeline) and link that automation as the verification
step for test id PR-02 so the table and pipeline reflect the automated coverage
for the alert-threshold check.

In @.planning/REQUIREMENTS.md:
- Line 21: The PR-03 item in REQUIREMENTS.md appears outdated because
benchmark.yml already enforces a fail-on-regression gate via the setting
fail-on-alert: true; update the PR-03 checklist entry to mark it complete or add
clarifying text that references the existing implementation (benchmark.yml's
fail-on-alert setting) and whether any additional Phase 6 enhancements are
required so reviewers know this requirement is satisfied or what remains.

In @.planning/STATE.md:
- Around line 35-41: Update the duration formatting in the "Performance Metrics"
-> "Velocity" block by replacing the repeated "1min" strings used for "Average
duration" and "Total execution time" with a standardized, readable format (e.g.,
"1 min" or "1 minute"); locate the "Average duration" and "Total execution time"
entries and change their values to the chosen format so both lines match
consistently.
- Around line 10-31: The document shows two phase numbering systems (the YAML
keys total_phases/completed_phases/percent and the human-readable "Phase: 5 of
7" line) which can confuse readers; add a single clarifying sentence or
parenthetical near the top (e.g., after the "Project State" header or directly
beneath the Project Reference) that explains that total_phases/completed_phases
refer to the current milestone's phases while the "Phase: X of Y" line denotes
overall project phases, and update any adjacent wording to make this distinction
explicit (reference total_phases, completed_phases, and the "Phase: 5 of 7" text
so reviewers can locate the change).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c48e66ed-ebfe-4379-a279-afa1e27deda6

📥 Commits

Reviewing files that changed from the base of the PR and between 7b4b9c9 and 5c674d7.

📒 Files selected for processing (25)
  • .benchmarks/Darwin-CPython-3.11-64bit/0001_baseline.json
  • .benchmarks/Darwin-CPython-3.11-64bit/0002_perf_analysis.json
  • .github/workflows/benchmark.yml
  • .github/workflows/tests.yml
  • .gitignore
  • .planning/PROJECT.md
  • .planning/REQUIREMENTS.md
  • .planning/ROADMAP.md
  • .planning/STATE.md
  • .planning/phases/05-benchmark-pipeline/05-01-PLAN.md
  • .planning/phases/05-benchmark-pipeline/05-01-SUMMARY.md
  • .planning/phases/05-benchmark-pipeline/05-CONTEXT.md
  • .planning/phases/05-benchmark-pipeline/05-RESEARCH.md
  • .planning/phases/05-benchmark-pipeline/05-VALIDATION.md
  • .planning/phases/05-benchmark-pipeline/05-VERIFICATION.md
  • .planning/phases/06-pr-feedback/06-01-PLAN.md
  • .planning/phases/06-pr-feedback/06-CONTEXT.md
  • .planning/phases/06-pr-feedback/06-RESEARCH.md
  • .planning/phases/06-pr-feedback/06-VALIDATION.md
  • .planning/research/ARCHITECTURE.md
  • .planning/research/FEATURES.md
  • .planning/research/PITFALLS.md
  • .planning/research/STACK.md
  • .planning/research/SUMMARY.md
  • docs/visualize_benchmarks.py
💤 Files with no reviewable changes (4)
  • .benchmarks/Darwin-CPython-3.11-64bit/0001_baseline.json
  • .benchmarks/Darwin-CPython-3.11-64bit/0002_perf_analysis.json
  • .github/workflows/tests.yml
  • docs/visualize_benchmarks.py

Comment on lines +13 to +14
# CI-04: Release/tag events do NOT get a separate benchmark run. Every push to
# main updates the gh-pages dashboard, so releases inherit the latest baseline.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Contradiction between CI-04 documentation and REQUIREMENTS.md.

The workflow comment states release/tag events do NOT get a separate benchmark run, but .planning/REQUIREMENTS.md line 15 marks CI-04: Release/tag events trigger a benchmark snapshot on gh-pages as complete. One of these needs to be corrected to maintain consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/benchmark.yml around lines 13 - 14, There's a
contradiction between the CI-04 description in the benchmark workflow header and
the CI-04 entry in REQUIREMENTS.md; pick the intended behavior and make them
consistent: either update the CI-04 comment in the benchmark workflow file to
state that release/tag events DO trigger a benchmark snapshot on gh-pages, or
update the CI-04 entry in REQUIREMENTS.md to mark that release/tag events do NOT
get a separate benchmark run. Locate the CI-04 comment in the workflow header
and the CI-04 line in REQUIREMENTS.md and edit the one that is incorrect so both
documents state the same behavior.

Comment on lines +52 to +55
<execution_context>
@/Users/fzills/.claude/get-shit-done/workflows/execute-plan.md
@/Users/fzills/.claude/get-shit-done/templates/summary.md
</execution_context>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove machine-specific absolute paths from the committed plan.

These paths leak the local username/workstation layout and are unusable for anyone else reading or replaying the plan. Please replace them with repo-relative references or plain document identifiers instead of committing /Users/... paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/05-benchmark-pipeline/05-01-PLAN.md around lines 52 - 55,
The execution_context block contains machine-specific absolute paths (the
strings starting with /Users/...) which must be replaced with repo-relative
references or plain document identifiers; edit the execution_context entry to
remove those absolute paths and use repo-relative paths like
./workflows/execute-plan.md and ./templates/summary.md or simple identifiers
(e.g., "workflows/execute-plan" and "templates/summary") so the plan contains no
user-specific filesystem information.


The user explicitly decided against release-triggered benchmark runs -- every main push updates the dashboard, so releases just link to the existing data. CI-04 is satisfied by documentation rather than a separate release workflow trigger.

**Primary recommendation:** Use `workflow_run` trigger with `actions/download-artifact@v4` (cross-workflow download via `run-id`) to decouple benchmarks from the test matrix, then `benchmark-action/github-action-benchmark@v1` with `auto-push: true` to commit results to gh-pages.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the obsolete artifact-handoff architecture from this research doc.

The current recommendation says to pass benchmark JSON across workflows, but the later “IMPORTANT DECISION” section and the actual cleaned tests.yml both say benchmark.yml runs benchmarks directly. Keeping both models in the same document makes the research guidance self-contradictory and easy to implement incorrectly.

Also applies to: 70-72, 89-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/05-benchmark-pipeline/05-RESEARCH.md at line 13, Remove the
obsolete "artifact-handoff" architecture text from the research doc: delete or
rewrite the "Primary recommendation" paragraph that advises passing benchmark
JSON across workflows (the sentence starting "Use `workflow_run` trigger with
`actions/download-artifact@v4`...") and any other conflicting passages that
describe cross-workflow artifact handoff (also remove the blocks referenced
around lines 70-72 and 89-101). Replace those with a clear single recommendation
that `benchmark.yml` runs benchmarks directly (matching the "IMPORTANT DECISION"
and the cleaned `tests.yml`), and mention using
`benchmark-action/github-action-benchmark@v1` with `auto-push: true` to publish
results to gh-pages if needed; ensure the document no longer presents both
models side-by-side.

Comment on lines +22 to +25
| 1 | Pushing to main triggers a benchmark workflow after tests pass | VERIFIED | benchmark.yml: `workflow_run` on `["Tests"]`, `types: [completed]`, `branches: [main]`, condition `github.event.workflow_run.conclusion == 'success'` |
| 2 | Benchmark results are auto-pushed to gh-pages at /dev/bench/ | VERIFIED | benchmark.yml: `auto-push: true`, `gh-pages-branch: gh-pages`, `benchmark-data-dir-path: dev/bench` |
| 3 | Opening or updating a PR does NOT trigger benchmark workflow | VERIFIED | benchmark.yml only has `workflow_run` trigger with `branches: [main]` -- no `pull_request` trigger |
| 4 | Release/tag behavior is documented (no separate trigger; main pushes cover it) | VERIFIED | benchmark.yml lines 7-8: CI-04 comment explaining releases inherit latest baseline |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This verification report is stale against the workflow shipped in this branch.

The current .github/workflows/benchmark.yml already includes a pull_request trigger and a PR-only comparison step, so statements like “PR does NOT trigger benchmark workflow” and “benchmark.yml only has workflow_run” are no longer true. Please either snapshot this as a historical Phase 5 artifact or update the verified truths/status so the report matches the actual branch contents.

Also applies to: 35-36, 43-45, 50-53

🧰 Tools
🪛 LanguageTool

[uncategorized] ~22-~22: The official name of this software platform is spelled with a capital “H”.
Context: ...pleted], branches: [main], condition github.event.workflow_run.conclusion == 'succe...

(GITHUB)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/05-benchmark-pipeline/05-VERIFICATION.md around lines 22 -
25, The verification doc is out of date: update
.planning/phases/05-benchmark-pipeline/05-VERIFICATION.md so its statements
about the benchmark workflow match the current .github/workflows/benchmark.yml
(which now includes a pull_request trigger and a PR-only comparison step) —
either mark the listed items (e.g., the “PR does NOT trigger benchmark workflow”
entry and related lines referenced in the diff) as historical snapshots, or
change their VERIFIED status and text to reflect that benchmark.yml contains a
pull_request trigger and PR-only comparison; apply the same updates to the other
affected sections called out (lines referenced as 35-36, 43-45, 50-53) so the
document consistently matches the branch behavior.

Comment on lines +17 to +21
- Add `pull_request` trigger (opened + synchronize) to existing `benchmark.yml` — same file handles both main and PR flows
- Main pushes continue using `workflow_run` trigger (existing Phase 5 behavior)
- PR benchmarks run independently — do NOT wait for tests to pass
- Concurrency cancel per PR number: `concurrency: { group: 'benchmark-${{ github.event.pull_request.number }}', cancel-in-progress: true }`
- Use `if` conditions on `github.event_name` to distinguish main vs PR behavior at the step level
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "*.yml" -o -name "*.yaml" | grep -i benchmark | head -20

Repository: zincware/asebytes

Length of output: 94


🌐 Web query:

GitHub Actions github.event.pull_request.number undefined workflow_run trigger

💡 Result:

github.event.pull_request.number is only populated for workflows triggered by pull_request / pull_request_target. When your workflow is triggered by workflow_run, the event payload is a workflow_run payload, so github.event.pull_request is typically undefined. [1]

If the completed run was associated with a PR

Use the PR list attached to the run:

${{ github.event.workflow_run.pull_requests[0].number }}

The workflow-run payload (and the workflow-runs API response) exposes a pull_requests array for runs whose head_sha/head_branch matches an open PR. [2]

If pull_requests is empty (common with forks / some setups)

This is a known limitation: workflow_run.pull_requests can be empty (especially for PRs from forks), so you must pass the PR number yourself (artifact/output) from the first workflow or resolve it via API using the run’s head_sha. [3]

Practical pattern: In workflow A (triggered by pull_request), save ${{ github.event.pull_request.number }} as an artifact; in workflow B (workflow_run), download it and use that PR number. GitHub’s own docs show this “upload PR number then use it in workflow_run” approach. [1]

Sources

[1] GitHub Docs – Events that trigger workflows (workflow_run example passing PR number via artifact) (docs.github.com)
[2] Workflow run fields (pull_requests array) (github-provider.stackql.io)
[3] GitHub Community – workflow_run.pull_requests empty for fork PRs / limitations (github.com)

Citations:


🌐 Web query:

GitHub Actions concurrency group mixed triggers pull_request workflow_run

💡 Result:

concurrency.group is just a string (or expression) evaluated per workflow run, regardless of which trigger (pull_request, workflow_run, etc.) started it. If the expression resolves to different values across triggers, the runs won’t share a concurrency bucket and therefore won’t cancel/serialize each other. (docs.github.com)

Why this bites with pull_request + workflow_run

  • For pull_request, people often use github.ref / github.head_ref. But github.ref is typically refs/pull/<n>/merge, not the branch name.
  • For workflow_run, the triggered workflow runs using the workflow file from the default branch, and github.ref is commonly that default-branch ref—not the PR ref—so github.ref no longer matches what you got under pull_request. (stackoverflow.com)

A robust pattern: group by PR number (else head branch)

Use the PR number when available; otherwise fall back to the originating branch from the workflow_run payload:

concurrency:
  group: ci-${{
    github.event.pull_request.number
    || github.event.workflow_run.pull_requests[0].number
    || format('{0}:{1}',
        github.event.workflow_run.head_repository.owner.login,
        github.event.workflow_run.head_branch)
    || github.head_ref
    || github.ref_name
  }}
  cancel-in-progress: true

Notes:

  • github.event.workflow_run.head_branch is the branch that triggered the upstream workflow. (stackoverflow.com)
  • workflow_run.pull_requests[0] is convenient when populated, but it may be missing/empty in some cases (notably fork-related scenarios), so keep a fallback (like owner:branch). (github.com)
  • The expression uses standard contexts/expressions supported in workflow syntax. (docs.github.com)

Citations:


🏁 Script executed:

cat -n .github/workflows/benchmark.yml | head -60

Repository: zincware/asebytes

Length of output: 2220


🏁 Script executed:

wc -l .github/workflows/benchmark.yml

Repository: zincware/asebytes

Length of output: 96


Use a fallback expression for the concurrency group to handle mixed triggers safely.

The guidance in lines 19–20 prescribes using only github.event.pull_request.number in the concurrency group, but this workflow also listens to workflow_run. The github.event.pull_request.number context is undefined for workflow_run events, causing all non-PR runs to share the same empty concurrency key and unintentionally cancel each other. Use a fallback that works across both triggers:

Safer pattern
-- Concurrency cancel per PR number: `concurrency: { group: 'benchmark-${{ github.event.pull_request.number }}', cancel-in-progress: true }`
+- Concurrency cancel per event: `concurrency: { group: 'benchmark-${{ github.event.pull_request.number || github.sha }}', cancel-in-progress: true }`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Add `pull_request` trigger (opened + synchronize) to existing `benchmark.yml` — same file handles both main and PR flows
- Main pushes continue using `workflow_run` trigger (existing Phase 5 behavior)
- PR benchmarks run independently — do NOT wait for tests to pass
- Concurrency cancel per PR number: `concurrency: { group: 'benchmark-${{ github.event.pull_request.number }}', cancel-in-progress: true }`
- Use `if` conditions on `github.event_name` to distinguish main vs PR behavior at the step level
- Add `pull_request` trigger (opened + synchronize) to existing `benchmark.yml` — same file handles both main and PR flows
- Main pushes continue using `workflow_run` trigger (existing Phase 5 behavior)
- PR benchmarks run independently — do NOT wait for tests to pass
- Concurrency cancel per event: `concurrency: { group: 'benchmark-${{ github.event.pull_request.number || github.sha }}', cancel-in-progress: true }`
- Use `if` conditions on `github.event_name` to distinguish main vs PR behavior at the step level
🧰 Tools
🪛 LanguageTool

[uncategorized] ~21-~21: The official name of this software platform is spelled with a capital “H”.
Context: ...gress: true }- Useifconditions ongithub.event_name` to distinguish main vs PR b...

(GITHUB)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/phases/06-pr-feedback/06-CONTEXT.md around lines 17 - 21, The
concurrency group currently uses 'benchmark-${{ github.event.pull_request.number
}}' which is undefined for workflow_run events and causes unrelated runs to
collide; change the concurrency group to use a fallback expression so when
github.event.pull_request.number is missing it falls back to a stable per-run
identifier (for example github.run_id or github.ref/sha). Update the
concurrency: group value to incorporate that fallback (keeping the 'benchmark-'
prefix) so both pull_request and workflow_run triggers produce unique,
non-conflicting keys.

- [x] **CI-01**: gh-pages branch exists with GitHub Pages enabled serving benchmark dashboard
- [x] **CI-02**: Post-matrix benchmark job runs github-action-benchmark for a single Python version (latest)
- [x] **CI-03**: Auto-push to gh-pages only on main branch pushes, not PRs
- [x] **CI-04**: Release/tag events trigger a benchmark snapshot on gh-pages
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

CI-04 marked complete but not implemented.

The requirement states "Release/tag events trigger a benchmark snapshot on gh-pages" but benchmark.yml has no release or push.tags trigger. The workflow comments explicitly state releases do NOT get separate benchmark runs. Either update the requirement status to pending or add the tag trigger to the workflow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/REQUIREMENTS.md at line 15, The CI-04 check is marked complete but
the benchmark workflow lacks release/tag triggers; either update the
REQUIREMENTS.md entry for CI-04 to "pending" or modify the benchmark workflow
(benchmark.yml) to include release and tag push triggers (add an on: release:
types: [published] and/or on: push: tags: ['*']) so that release/tag events will
run the benchmark and publish the snapshot to gh-pages; update the CI-04 status
in .planning/REQUIREMENTS.md if you choose not to change the workflow.

Comment on lines +112 to +120
### Decision 3: `auto-push: true` only on main branch pushes

# Padded (explicit opt-in via extension suffix)
_RegistryEntry("pattern", "*.h5p", "object", "asebytes.columnar", "PaddedColumnarBackend", ...),
_RegistryEntry("pattern", "*.zarrp", "object", "asebytes.columnar", "PaddedColumnarBackend", ...),
**Why:** On PRs, you want comparison comments but should NOT push results to gh-pages (PR benchmarks are noisy, transient, and would pollute the historical record). Use a conditional:

# H5MD (spec-compliant, always padded)
_RegistryEntry("pattern", "*.h5md", "object", "asebytes.h5md", "H5MDBackend", ...),
]
```yaml
auto-push: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
```

**Rationale for `.h5p`/`.zarrp` over `.h5-padded`:** File extensions with hyphens break shell glob patterns and confuse some filesystem tools. Single-suffix extensions are conventional. The `p` suffix is short for "padded" and unambiguous in context. However, the exact naming is a user decision -- what matters architecturally is that padded and ragged are separate registry entries pointing to separate backend classes.

**Alternative considered: parameter-based dispatch** (`ASEIO("data.h5", strategy="padded")`). Rejected because: (1) the registry is extension-based and adding constructor parameters would require registry protocol changes, (2) a file's storage strategy is an intrinsic property of the file, not a user preference at open time, (3) re-opening a file should auto-detect its strategy without user knowledge.

## Patterns to Follow

### Pattern 1: Template Method for Columnar Backends

**What:** `BaseColumnarBackend` implements the full `ReadWriteBackend` contract. Subclasses override only the per-atom-specific methods.

**When:** Any columnar backend (ragged, padded, H5MD).

**Why:** The current ColumnarBackend and H5MDBackend duplicate ~300 lines of scalar column handling, JSON serialization, and postprocessing. A Template Method base class eliminates this.

```python
class BaseColumnarBackend(ReadWriteBackend[str, Any], ABC):
"""Shared logic for all columnar backends."""

_returns_mutable: bool = True

# --- Concrete methods (shared) ---

def _postprocess(self, val, col_name, *, is_per_atom=False):
"""Shared read postprocessing: NaN->None, JSON decode, numpy scalar unwrap."""
...

def _prepare_scalar_column(self, values):
"""Shared scalar column serialization."""
...

def _write_scalar_columns(self, keys, batch_values):
"""Write non-per-atom columns to store."""
...

def _classify_columns(self, data):
"""Determine which columns are per-atom vs scalar."""
...

def _serialize_value(self, val):
"""JSON-encode dicts/lists/strings."""
...

# --- Abstract methods (per-atom strategy) ---

@abstractmethod
def _write_per_atom_columns(self, keys, batch_values, n_atoms_list):
"""Write per-atom columns using strategy-specific layout."""
...

@abstractmethod
def _read_per_atom_value(self, col_name, index):
"""Read a single per-atom value for one frame."""
...

@abstractmethod
def _read_per_atom_bulk(self, col_name, indices):
"""Read per-atom values for multiple frames."""
...
PR runs still get `comment-on-alert: true` which compares against the last stored result and comments on the PR if regression is detected.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistency in trigger strategy between documents.

This section describes using github.event_name == 'push' && github.ref == 'refs/heads/main' for conditional auto-push, implying the workflow runs on direct push events to main.

However, .planning/phases/06-pr-feedback/06-01-PLAN.md describes a different approach:

  • Main branch benchmarks trigger via workflow_run (after Tests workflow completes)
  • PR benchmarks trigger via pull_request event
  • Conditions check github.event_name == 'workflow_run' vs github.event_name == 'pull_request'

These are two different architectural approaches. Clarify which approach is being implemented, or update this document to reflect the workflow_run-based approach described in the Phase 6 plan.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/research/ARCHITECTURE.md around lines 112 - 120, The
ARCHITECTURE.md section uses an `auto-push` conditional of `github.event_name ==
'push' && github.ref == 'refs/heads/main'` which conflicts with the Phase 6 plan
that uses `workflow_run` for main-branch benchmarks and `pull_request` for PRs;
update the ARCHITECTURE.md text to match the Phase 6 approach (or vice versa) by
replacing the current `auto-push` conditional with the `workflow_run`-based
logic (`github.event_name == 'workflow_run'` for main branch runs triggered
after Tests and `pull_request` for PR runs) and explicitly document that
`comment-on-alert: true` remains enabled for PR comparisons so readers can trace
the `workflow_run` vs `pull_request` trigger strategy consistently with the
Phase 6 plan.

Comment on lines +122 to +125
### Decision 4: Keep existing visualize_benchmarks.py + artifact uploads

**Why:** The PNGs serve a different purpose (static per-run snapshots in artifact archives). The github-action-benchmark dashboard provides historical trends. Both are valuable. No reason to remove existing functionality.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if visualize_benchmarks.py exists in the repository

fd -t f "visualize_benchmarks.py"

Repository: zincware/asebytes

Length of output: 43


🏁 Script executed:

# Check git history for visualize_benchmarks.py
git log --oneline --all -- "*visualize_benchmarks.py" 2>/dev/null || echo "No git history for this file"

Repository: zincware/asebytes

Length of output: 229


🏁 Script executed:

# Search for references to visualize_benchmarks in the codebase
rg "visualize_benchmarks" --type md --type py --type yaml --type txt

Repository: zincware/asebytes

Length of output: 134


🏁 Script executed:

# Read the ARCHITECTURE.md file to see the exact context of Decision 4
cat -n ".planning/research/ARCHITECTURE.md" | sed -n '115,135p'

Repository: zincware/asebytes

Length of output: 1446


🏁 Script executed:

ls -la docs/ | grep -i visualize

Repository: zincware/asebytes

Length of output: 43


🏁 Script executed:

# Check if docs/visualize_benchmarks.py exists
test -f "docs/visualize_benchmarks.py" && echo "File exists" || echo "File does not exist"

Repository: zincware/asebytes

Length of output: 80


🏁 Script executed:

# Get more context from docs/README.md about the visualize_benchmarks reference
grep -n "visualize_benchmarks" docs/README.md -B 2 -A 2

Repository: zincware/asebytes

Length of output: 225


Fix Decision 4: Visualization script was removed, not kept.

Decision 4 states "Keep existing visualize_benchmarks.py + artifact uploads," but the file docs/visualize_benchmarks.py does not exist in the repository. Additionally, docs/README.md still contains a reference to this non-existent script (uv run docs/visualize_benchmarks.py benchmark_results.json).

Either restore the visualization script if it should be kept, or rewrite Decision 4 to reflect that the visualization script and its documentation have been removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/research/ARCHITECTURE.md around lines 122 - 125, Decision 4 in
ARCHITECTURE.md claims we kept visualize_benchmarks.py, but the file
docs/visualize_benchmarks.py is missing and docs/README.md still references
running it; either restore the script and its artifact-upload behavior and add
it back at docs/visualize_benchmarks.py (ensuring it accepts
benchmark_results.json as the CLI arg and is documented in docs/README.md), or
change Decision 4 text in .planning/research/ARCHITECTURE.md and remove or
update the docs/README.md reference so both documentation and architecture
reflect that the visualization script was removed; update whichever files you
change to keep the repository consistent (Decision 4, docs/README.md, and
presence or absence of docs/visualize_benchmarks.py).

5. **Backend-specific optimizations** (MongoDB TTL, Redis Lua) - These are validated wins (1.9-3.5x) but lower priority than correctness.
| Existing Asset | How It Feeds New Features |
|----------------|--------------------------|
| `--benchmark-json=benchmark_results.json` in CI (tests.yml line 61) | Direct input to github-action-benchmark `output-file-path` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale reference: benchmark JSON is no longer generated in tests.yml.

The reference to tests.yml line 61 is outdated. Per the relevant code snippet, benchmark steps have been moved to the new benchmark.yml workflow. Consider updating this line to reference the new location.

📝 Proposed fix
-| `--benchmark-json=benchmark_results.json` in CI (tests.yml line 61) | Direct input to github-action-benchmark `output-file-path` |
+| `--benchmark-json=benchmark_results.json` in CI (benchmark.yml line 81) | Direct input to github-action-benchmark `output-file-path` |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| `--benchmark-json=benchmark_results.json` in CI (tests.yml line 61) | Direct input to github-action-benchmark `output-file-path` |
| `--benchmark-json=benchmark_results.json` in CI (benchmark.yml line 81) | Direct input to github-action-benchmark `output-file-path` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/research/FEATURES.md at line 84, Update the stale reference in the
FEATURES.md line that mentions "`--benchmark-json=benchmark_results.json` in CI
(tests.yml line 61)`" so it points to the new workflow location: replace the
`tests.yml line 61` reference with the appropriate `benchmark.yml` workflow (or
remove the line-number mention and reference `benchmark.yml`/the benchmark
workflow step), ensuring the text still mentions
`--benchmark-json=benchmark_results.json` as the direct input to
github-action-benchmark `output-file-path`.

Comment on lines +87 to +89
| 5. Benchmark Pipeline | 1/1 | Complete | 2026-03-09 | - |
| 6. PR Feedback | v0.3.1 | 0/1 | Not started | - |
| 7. Dashboard and README | v0.3.1 | 0/? | Not started | - |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Table row misalignment in Phase 5 progress entry.

Line 87 is missing the Milestone column value and has columns shifted. Compare to the consistent format in lines 88-89.

📝 Proposed fix
-| 5. Benchmark Pipeline | 1/1 | Complete   | 2026-03-09 | - |
+| 5. Benchmark Pipeline | v0.3.1 | 1/1 | Complete | 2026-03-09 |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| 5. Benchmark Pipeline | 1/1 | Complete | 2026-03-09 | - |
| 6. PR Feedback | v0.3.1 | 0/1 | Not started | - |
| 7. Dashboard and README | v0.3.1 | 0/? | Not started | - |
| 5. Benchmark Pipeline | v0.3.1 | 1/1 | Complete | 2026-03-09 |
| 6. PR Feedback | v0.3.1 | 0/1 | Not started | - |
| 7. Dashboard and README | v0.3.1 | 0/? | Not started | - |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.planning/ROADMAP.md around lines 87 - 89, The "Benchmark Pipeline" (Phase
5) table row is missing its Milestone cell, causing column misalignment; update
the Phase 5 row (the line containing "5. Benchmark Pipeline | 1/1 | Complete   |
2026-03-09 | -") to include the Milestone value (match the format used by lines
with "v0.3.1") so all pipes align and the table columns are consistent.

…terations

write_single tests created the DB once and appended in each benchmark
iteration, causing quadratic slowdown (2h+ hangs on CI). Two fixes:
- Create fresh DB per iteration (matches write_trajectory pattern)
- Cap to 10 frames (per-row overhead is the signal, not throughput)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/benchmarks/test_bench_write.py (2)

27-30: ⚠️ Potential issue | 🟠 Major

Clean up local benchmark artifacts per invocation.

These UUID-based paths fix cross-run reuse, but every benchmark round now leaves another file/store behind until tmp_path is torn down. On the larger datasets, pytest-benchmark will call fn() many times, so later samples run against a progressively dirtier filesystem and can start burning a lot of disk in CI.

Please add per-iteration teardown for the local targets, or reuse a single cleaned target between iterations so the benchmark stays measuring write cost rather than accumulated temp data.

Also applies to: 39-42, 51-54, 95-99, 110-113, 122-124, 133-137, 157-161, 171-175, 185-189, 235-239, 251-255, 265-268, 278-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_bench_write.py` around lines 27 - 30, The benchmark
helper fn currently creates UUID-named LMDB stores via tmp_path and ASEIO and
never removes them, causing accumulation across iterations; modify fn (and the
other similar helpers noted) to either (A) create a single SAEIO/LMDB target
once outside the benchmark loop and clear/truncate its contents between
iterations, or (B) after db.extend(frames) call the ASEIO close/cleanup API and
remove the created path (tmp_path / f"wt_{name}_lmdb_...") so each invocation
deletes its artifact; ensure you reference the ASEIO instance (db), call its
proper close/flush method, and unlink/rmdir the created filesystem path (or
reuse a shared path created before benchmarking) for all the functions that
create tmp_path/ASEIO stores.

64-72: ⚠️ Potential issue | 🟠 Major

Make MongoDB/Redis teardown failure-safe.

Right now cleanup only runs after benchmark(fn), and the handle is appended to dbs only after the writes finish. If a benchmark iteration fails partway through, the collection/prefix is leaked and can pollute later CI runs.

Please append the handle immediately after construction and wrap benchmark(fn) in try/finally so teardown always runs.

Suggested pattern
     dbs = []

     def fn():
         uri = f"{mongo_uri}/bench_ws_{name}_{uuid.uuid4().hex[:8]}"
         db = ASEIO(uri)
+        dbs.append(db)
         for mol in frames:
             db.extend([mol])
-        dbs.append(db)

-    benchmark(fn)
-    for db in dbs:
-        db.remove()
+    try:
+        benchmark(fn)
+    finally:
+        for db in dbs:
+            db.remove()

Also applies to: 80-88, 200-209, 218-227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_bench_write.py` around lines 64 - 72, The teardown can
leak resources if a benchmark iteration fails because db handles are only
appended to dbs after writes complete and benchmark(fn) isn't guaranteed to run
cleanup; modify the fn closure so it constructs the ASEIO instance and appends
it to dbs immediately after creation (right after db = ASEIO(uri)) and then
perform db.extend(frames); also wrap the call to benchmark(fn) in a try/finally
where the finally iterates over dbs and calls db.remove() to ensure teardown
always runs; apply the same change to the other similar blocks referenced (lines
handling the same pattern).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/benchmarks/test_bench_write.py`:
- Around line 27-30: The benchmark helper fn currently creates UUID-named LMDB
stores via tmp_path and ASEIO and never removes them, causing accumulation
across iterations; modify fn (and the other similar helpers noted) to either (A)
create a single SAEIO/LMDB target once outside the benchmark loop and
clear/truncate its contents between iterations, or (B) after db.extend(frames)
call the ASEIO close/cleanup API and remove the created path (tmp_path /
f"wt_{name}_lmdb_...") so each invocation deletes its artifact; ensure you
reference the ASEIO instance (db), call its proper close/flush method, and
unlink/rmdir the created filesystem path (or reuse a shared path created before
benchmarking) for all the functions that create tmp_path/ASEIO stores.
- Around line 64-72: The teardown can leak resources if a benchmark iteration
fails because db handles are only appended to dbs after writes complete and
benchmark(fn) isn't guaranteed to run cleanup; modify the fn closure so it
constructs the ASEIO instance and appends it to dbs immediately after creation
(right after db = ASEIO(uri)) and then perform db.extend(frames); also wrap the
call to benchmark(fn) in a try/finally where the finally iterates over dbs and
calls db.remove() to ensure teardown always runs; apply the same change to the
other similar blocks referenced (lines handling the same pattern).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f34ee4b-344e-4f93-8e75-ef1597d6ec79

📥 Commits

Reviewing files that changed from the base of the PR and between 5c674d7 and 857dc47.

📒 Files selected for processing (1)
  • tests/benchmarks/test_bench_write.py

@PythonFZ PythonFZ merged commit 874ef0d into main Mar 9, 2026
1 of 5 checks passed
@PythonFZ PythonFZ deleted the gsd/auto-bench branch March 9, 2026 20:24
@PythonFZ PythonFZ restored the gsd/auto-bench branch March 9, 2026 20:25
PythonFZ added a commit that referenced this pull request Mar 9, 2026
This was referenced Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant