Skip to content

feat: --pivot comparison axis for compare/plot (fold one run along a config dim)#137

Merged
FBumann merged 5 commits into
mainfrom
feat/by-comparison-axis
Jun 29, 2026
Merged

feat: --pivot comparison axis for compare/plot (fold one run along a config dim)#137
FBumann merged 5 commits into
mainfrom
feat/by-comparison-axis

Conversation

@FBumann

@FBumann FBumann commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

Adds --pivot <dim> to benchmem compare and benchmem plot, so a single combined run whose rows differ only in one config dim (e.g. semantics=legacy|v1, solver=…, branch=…) can be A/B'd directly — no splitting into separate run files, no N pytest invocations.

# one combined run; semantics is a param (in the id) AND a dim
pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json

benchmem compare build.json --pivot param:semantics --columns time,peak       # A/B table
benchmem plot    build.json --pivot param:semantics --columns peak             # A/B bars (compare view is the default under --pivot)
benchmem compare build.json --pivot param:semantics --fail-on peak:10%         # gate legacy→v1 in CI, from one run
benchmem plot    build.json --x n --facet semantics --columns peak             # --pivot optional here

One run now drives the A/B table, the A/B plot, the scaling plot, and an external per-id gate (e.g. CodSpeed, which wants the config value in the node id of one run). Closes #129.

The idea

The series axis — the thing laid side by side and ranked with the (N.NN) multiplier — was always a dim. The run-file is just the default one, which is why compare a.json b.json ranks one file against another. --pivot re-points that axis at a real data dim: its values become the series and it's lifted out of each row's identity so rows differing only in it pair up.

Implemented as a single fold on the shared load_long_df spine (_pivot_to_series), so the compare table and the compare/scatter plot views — already written in terms of (series, pairing-key) — inherit it with no change to their logic. A param duplicated into the opaque node id is lifted back out, so test_build[legacy-100] and test_build[v1-100] collapse to one test_build[100] row.

Scope & semantics

  • One series axis per A/B view: errors on multiple runs (files × dim would be a 2-D matrix the A/B view can't render), and on scaling/sweep (there the dim is a normal --x/--facet axis).
  • --fail-on follows the same axis: normally it gates runs[0] vs runs[-1]; with --pivot it gates the first dim value vs the last, folded out of the one run (find_pivot_regressions, reusing the existing growth core). So a combined run gates itself in CI.
  • Ordering is parametrize/collection order (verified, not lexicographic): the first dim value is the --fail-on base and the first row — reorder the parametrize list to flip it. The table's (1.0) separately marks each column's best value, exactly as the run-file table does.
  • Distinct from --group-by: --group-by partitions rows into sub-tables (series stays the run-files); --pivot sets what is compared. Named --pivot rather than the issue's --by to avoid the groupby/GROUP BY collision; docstrings and docs spell out the contrast.
  • Composes with --columns / --stat / --facet / --where / --sort / --group-by unchanged.

Ergonomics

  • plot one.json --pivot DIM (no --view) defaults to the compare view — otherwise one run would default to scaling, which has no series axis to pivot. An explicit --view still wins.
  • A pivot that pairs nothing (a custom pytest.param(id=…) whose label differs from its value, so the value never strips out of the id) now warns ("left every row unpaired…") instead of silently showing a pile of one-series rows. Use a plain parametrize value or an extra_info dim (which isn't in the id at all) for the pivot axis.
  • Auto-selected scaling that can't infer an x-axis (no numeric dim, or several) now guides toward --x / --pivot / two runs rather than just "pass x=".

Tests & checks

New tests across test_compare.py, test_plotting.py, test_cli.py: param-in-id fold, bare extra_info dim, multiple-runs error, unknown-dim error, scatter single-run, the view-scope guard, the compare-view default, the no-pairing warning, the scaling-inference guidance, --fail-on along the pivot axis (unit + CLI), and the headline acceptance (one run drives both the table and the compare plot). Full suite green; ruff format + check clean; source type-checks.

Docs

  • compare-plot.md: new "One run, two configs — --pivot" section (series-axis model, --group-by contrast, --fail-on-along-pivot, baseline-ordering note, custom-id limitation); --pivot added to the plot-flags rundown.
  • reference.md renders the CLI help live, so --pivot and the updated --fail-on help appear automatically.

Note: branched off main after #128, so it carries the mypy<2.1 pin. main's own [tool.mypy] numpy-stub issue (python_version = "3.11") is unrelated to this change.

🤖 Generated with Claude Code

…a config dim)

Add `--pivot <dim>` to `benchmem compare` and `benchmem plot`, so a single
combined run whose rows differ only in one config dim (e.g.
`semantics=legacy|v1`) can be A/B'd directly — without splitting the data into
separate run files or forcing N pytest invocations.

The insight: the series axis was always a dim, and the run-file is just the
default one. `--pivot` re-points it at a data dim, folding one run so rows that
differ only in that dim pair up and its values become the compared series — the
A/B a run-file pair gives today, from one run. It's implemented as a single fold
on the shared `load_long_df` spine (`_pivot_to_series`), so the compare table and
the compare/scatter plot views inherit it with no change to their logic; a param
duplicated into the node id is lifted back out so `test_build[legacy-100]` and
`[v1-100]` collapse to one `test_build[100]` row.

Scoped to one series axis per A/B view: errors on multiple runs, and on the
scaling/sweep views (there the dim is a normal --x/--facet axis). Composes with
--columns/--stat/--facet/--where/--sort/--group-by unchanged. Named `--pivot`
(not the issue's `--by`) to avoid colliding with `--group-by`: --group-by
partitions rows, --pivot sets what's compared.

Closes #129

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jun 29, 2026

Copy link
Copy Markdown

Documentation build overview

📚 pytest-benchmem | 🛠️ Build #33353023 | 📁 Comparing 9bf34ad against latest (16e7c56)

  🔍 Preview build  

2 files changed
± compare-plot/index.html
± reference/index.html

FBumann and others added 4 commits June 29, 2026 12:04
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--fail-on generalizes the same way the table did: with run-files it gates
runs[0] (base) vs runs[-1] (head); with --pivot it gates the first value of the
dim vs the last, folded out of one run, paired per collapsed id. So a single
combined run gates itself in CI (legacy→v1 peak growth) with no second file —
closing the gap the CodSpeed driver needs.

`find_pivot_regressions` reuses load_long_df (the fold) + _regressions_for (the
growth core); the CLI relaxes the "needs two runs" guard when --pivot supplies
the two sides, and routes the gate accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two ergonomic warts:

- `plot one.json --pivot DIM` (no --view) defaulted to scaling, which the pivot
  guard rejects — the most natural command failed. Default to the compare view
  when --pivot is set and --view is unset; an explicit --view still wins.

- A pivot that pairs nothing (e.g. a custom pytest.param(id=…) whose label
  differs from its value, so the value never strips out of the id) silently
  produced a pile of one-series rows. Warn when no id recurs across pivot values.

Also document the base ordering: the first dim value in parametrize/collection
order is the --fail-on base and first row (reorder the list to flip it); the
table's (1.0) marks each column's best, as the run-file table does — not the base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Auto-selected scaling (single run, no --pivot) errors when the run has no numeric
dim (e.g. an all-categorical semantics=legacy|v1 run) or several. The old message
only said "pass x=", missing that a single-run categorical A/B is exactly what
--pivot is for. Point at --x / --pivot / two-runs instead — the same
guide-don't-fail treatment the pivot view-guard got.

scaling stays the right single-run default: for one run with an inferable numeric
dim it's the useful zero-config curve (and folds categoricals into colour/facet);
the fix is only its failure mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann merged commit d0672f7 into main Jun 29, 2026
13 checks passed
@FBumann FBumann deleted the feat/by-comparison-axis branch June 29, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: --by <dim> comparison axis for compare/plot (fold one run along a config dim)

1 participant