feat: --pivot comparison axis for compare/plot (fold one run along a config dim)#137
Merged
Conversation
…a config dim) Add `--pivot <dim>` to `benchmem compare` and `benchmem plot`, so a single combined run whose rows differ only in one config dim (e.g. `semantics=legacy|v1`) can be A/B'd directly — without splitting the data into separate run files or forcing N pytest invocations. The insight: the series axis was always a dim, and the run-file is just the default one. `--pivot` re-points it at a data dim, folding one run so rows that differ only in that dim pair up and its values become the compared series — the A/B a run-file pair gives today, from one run. It's implemented as a single fold on the shared `load_long_df` spine (`_pivot_to_series`), so the compare table and the compare/scatter plot views inherit it with no change to their logic; a param duplicated into the node id is lifted back out so `test_build[legacy-100]` and `[v1-100]` collapse to one `test_build[100]` row. Scoped to one series axis per A/B view: errors on multiple runs, and on the scaling/sweep views (there the dim is a normal --x/--facet axis). Composes with --columns/--stat/--facet/--where/--sort/--group-by unchanged. Named `--pivot` (not the issue's `--by`) to avoid colliding with `--group-by`: --group-by partitions rows, --pivot sets what's compared. Closes #129 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Documentation build overview
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--fail-on generalizes the same way the table did: with run-files it gates runs[0] (base) vs runs[-1] (head); with --pivot it gates the first value of the dim vs the last, folded out of one run, paired per collapsed id. So a single combined run gates itself in CI (legacy→v1 peak growth) with no second file — closing the gap the CodSpeed driver needs. `find_pivot_regressions` reuses load_long_df (the fold) + _regressions_for (the growth core); the CLI relaxes the "needs two runs" guard when --pivot supplies the two sides, and routes the gate accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two ergonomic warts: - `plot one.json --pivot DIM` (no --view) defaulted to scaling, which the pivot guard rejects — the most natural command failed. Default to the compare view when --pivot is set and --view is unset; an explicit --view still wins. - A pivot that pairs nothing (e.g. a custom pytest.param(id=…) whose label differs from its value, so the value never strips out of the id) silently produced a pile of one-series rows. Warn when no id recurs across pivot values. Also document the base ordering: the first dim value in parametrize/collection order is the --fail-on base and first row (reorder the list to flip it); the table's (1.0) marks each column's best, as the run-file table does — not the base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Auto-selected scaling (single run, no --pivot) errors when the run has no numeric dim (e.g. an all-categorical semantics=legacy|v1 run) or several. The old message only said "pass x=", missing that a single-run categorical A/B is exactly what --pivot is for. Point at --x / --pivot / two-runs instead — the same guide-don't-fail treatment the pivot view-guard got. scaling stays the right single-run default: for one run with an inferable numeric dim it's the useful zero-config curve (and folds categoricals into colour/facet); the fix is only its failure mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
--pivot <dim>tobenchmem compareandbenchmem plot, so a single combined run whose rows differ only in one config dim (e.g.semantics=legacy|v1,solver=…,branch=…) can be A/B'd directly — no splitting into separate run files, no N pytest invocations.One run now drives the A/B table, the A/B plot, the scaling plot, and an external per-id gate (e.g. CodSpeed, which wants the config value in the node id of one run). Closes #129.
The idea
The series axis — the thing laid side by side and ranked with the
(N.NN)multiplier — was always a dim. The run-file is just the default one, which is whycompare a.json b.jsonranks one file against another.--pivotre-points that axis at a real data dim: its values become the series and it's lifted out of each row's identity so rows differing only in it pair up.Implemented as a single fold on the shared
load_long_dfspine (_pivot_to_series), so the compare table and the compare/scatter plot views — already written in terms of(series, pairing-key)— inherit it with no change to their logic. A param duplicated into the opaque node id is lifted back out, sotest_build[legacy-100]andtest_build[v1-100]collapse to onetest_build[100]row.Scope & semantics
scaling/sweep(there the dim is a normal--x/--facetaxis).--fail-onfollows the same axis: normally it gatesruns[0]vsruns[-1]; with--pivotit gates the first dim value vs the last, folded out of the one run (find_pivot_regressions, reusing the existing growth core). So a combined run gates itself in CI.--fail-onbase and the first row — reorder theparametrizelist to flip it. The table's(1.0)separately marks each column's best value, exactly as the run-file table does.--group-by:--group-bypartitions rows into sub-tables (series stays the run-files);--pivotsets what is compared. Named--pivotrather than the issue's--byto avoid thegroupby/GROUP BYcollision; docstrings and docs spell out the contrast.--columns/--stat/--facet/--where/--sort/--group-byunchanged.Ergonomics
plot one.json --pivot DIM(no--view) defaults to the compare view — otherwise one run would default toscaling, which has no series axis to pivot. An explicit--viewstill wins.pytest.param(id=…)whose label differs from its value, so the value never strips out of the id) now warns ("left every row unpaired…") instead of silently showing a pile of one-series rows. Use a plain parametrize value or anextra_infodim (which isn't in the id at all) for the pivot axis.scalingthat can't infer an x-axis (no numeric dim, or several) now guides toward--x/--pivot/ two runs rather than just "pass x=".Tests & checks
New tests across
test_compare.py,test_plotting.py,test_cli.py: param-in-id fold, bareextra_infodim, multiple-runs error, unknown-dim error, scatter single-run, the view-scope guard, the compare-view default, the no-pairing warning, the scaling-inference guidance,--fail-onalong the pivot axis (unit + CLI), and the headline acceptance (one run drives both the table and the compare plot). Full suite green; ruff format + check clean; source type-checks.Docs
compare-plot.md: new "One run, two configs —--pivot" section (series-axis model,--group-bycontrast,--fail-on-along-pivot, baseline-ordering note, custom-id limitation);--pivotadded to the plot-flags rundown.reference.mdrenders the CLI help live, so--pivotand the updated--fail-onhelp appear automatically.🤖 Generated with Claude Code