feat: `--pivot` comparison axis for compare/plot (fold one run along a config dim) by FBumann · Pull Request #137 · fluxopt/pytest-benchmem

FBumann · 2026-06-29T10:02:24Z

What

Adds --pivot <dim> to benchmem compare and benchmem plot, so a single combined run whose rows differ only in one config dim (e.g. semantics=legacy|v1, solver=…, branch=…) can be A/B'd directly — no splitting into separate run files, no N pytest invocations.

# one combined run; semantics is a param (in the id) AND a dim
pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json

benchmem compare build.json --pivot param:semantics --columns time,peak       # A/B table
benchmem plot    build.json --pivot param:semantics --columns peak             # A/B bars (compare view is the default under --pivot)
benchmem compare build.json --pivot param:semantics --fail-on peak:10%         # gate legacy→v1 in CI, from one run
benchmem plot    build.json --x n --facet semantics --columns peak             # --pivot optional here

One run now drives the A/B table, the A/B plot, the scaling plot, and an external per-id gate (e.g. CodSpeed, which wants the config value in the node id of one run). Closes #129.

The idea

The series axis — the thing laid side by side and ranked with the (N.NN) multiplier — was always a dim. The run-file is just the default one, which is why compare a.json b.json ranks one file against another. --pivot re-points that axis at a real data dim: its values become the series and it's lifted out of each row's identity so rows differing only in it pair up.

Implemented as a single fold on the shared load_long_df spine (_pivot_to_series), so the compare table and the compare/scatter plot views — already written in terms of (series, pairing-key) — inherit it with no change to their logic. A param duplicated into the opaque node id is lifted back out, so test_build[legacy-100] and test_build[v1-100] collapse to one test_build[100] row.

Scope & semantics

One series axis per A/B view: errors on multiple runs (files × dim would be a 2-D matrix the A/B view can't render), and on scaling/sweep (there the dim is a normal --x/--facet axis).
--fail-on follows the same axis: normally it gates runs[0] vs runs[-1]; with --pivot it gates the first dim value vs the last, folded out of the one run (find_pivot_regressions, reusing the existing growth core). So a combined run gates itself in CI.
Ordering is parametrize/collection order (verified, not lexicographic): the first dim value is the --fail-on base and the first row — reorder the parametrize list to flip it. The table's (1.0) separately marks each column's best value, exactly as the run-file table does.
Distinct from --group-by: --group-by partitions rows into sub-tables (series stays the run-files); --pivot sets what is compared. Named --pivot rather than the issue's --by to avoid the groupby/GROUP BY collision; docstrings and docs spell out the contrast.
Composes with --columns / --stat / --facet / --where / --sort / --group-by unchanged.

Ergonomics

plot one.json --pivot DIM (no --view) defaults to the compare view — otherwise one run would default to scaling, which has no series axis to pivot. An explicit --view still wins.
A pivot that pairs nothing (a custom pytest.param(id=…) whose label differs from its value, so the value never strips out of the id) now warns ("left every row unpaired…") instead of silently showing a pile of one-series rows. Use a plain parametrize value or an extra_info dim (which isn't in the id at all) for the pivot axis.
Auto-selected scaling that can't infer an x-axis (no numeric dim, or several) now guides toward --x / --pivot / two runs rather than just "pass x=".

Tests & checks

New tests across test_compare.py, test_plotting.py, test_cli.py: param-in-id fold, bare extra_info dim, multiple-runs error, unknown-dim error, scatter single-run, the view-scope guard, the compare-view default, the no-pairing warning, the scaling-inference guidance, --fail-on along the pivot axis (unit + CLI), and the headline acceptance (one run drives both the table and the compare plot). Full suite green; ruff format + check clean; source type-checks.

Docs

compare-plot.md: new "One run, two configs — --pivot" section (series-axis model, --group-by contrast, --fail-on-along-pivot, baseline-ordering note, custom-id limitation); --pivot added to the plot-flags rundown.
reference.md renders the CLI help live, so --pivot and the updated --fail-on help appear automatically.

Note: branched off main after #128, so it carries the mypy<2.1 pin. main's own [tool.mypy] numpy-stub issue (python_version = "3.11") is unrelated to this change.

🤖 Generated with Claude Code

…a config dim) Add `--pivot <dim>` to `benchmem compare` and `benchmem plot`, so a single combined run whose rows differ only in one config dim (e.g. `semantics=legacy|v1`) can be A/B'd directly — without splitting the data into separate run files or forcing N pytest invocations. The insight: the series axis was always a dim, and the run-file is just the default one. `--pivot` re-points it at a data dim, folding one run so rows that differ only in that dim pair up and its values become the compared series — the A/B a run-file pair gives today, from one run. It's implemented as a single fold on the shared `load_long_df` spine (`_pivot_to_series`), so the compare table and the compare/scatter plot views inherit it with no change to their logic; a param duplicated into the node id is lifted back out so `test_build[legacy-100]` and `[v1-100]` collapse to one `test_build[100]` row. Scoped to one series axis per A/B view: errors on multiple runs, and on the scaling/sweep views (there the dim is a normal --x/--facet axis). Composes with --columns/--stat/--facet/--where/--sort/--group-by unchanged. Named `--pivot` (not the issue's `--by`) to avoid colliding with `--group-by`: --group-by partitions rows, --pivot sets what's compared. Closes #129 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

read-the-docs-community · 2026-06-29T10:03:44Z

Documentation build overview

📚 pytest-benchmem | 🛠️ Build #33353023 | 📁 Comparing 9bf34ad against latest (16e7c56)

🔍 Preview build

2 files changed

± compare-plot/index.html
± reference/index.html

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

--fail-on generalizes the same way the table did: with run-files it gates runs[0] (base) vs runs[-1] (head); with --pivot it gates the first value of the dim vs the last, folded out of one run, paired per collapsed id. So a single combined run gates itself in CI (legacy→v1 peak growth) with no second file — closing the gap the CodSpeed driver needs. `find_pivot_regressions` reuses load_long_df (the fold) + _regressions_for (the growth core); the CLI relaxes the "needs two runs" guard when --pivot supplies the two sides, and routes the gate accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two ergonomic warts: - `plot one.json --pivot DIM` (no --view) defaulted to scaling, which the pivot guard rejects — the most natural command failed. Default to the compare view when --pivot is set and --view is unset; an explicit --view still wins. - A pivot that pairs nothing (e.g. a custom pytest.param(id=…) whose label differs from its value, so the value never strips out of the id) silently produced a pile of one-series rows. Warn when no id recurs across pivot values. Also document the base ordering: the first dim value in parametrize/collection order is the --fail-on base and first row (reorder the list to flip it); the table's (1.0) marks each column's best, as the run-file table does — not the base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Auto-selected scaling (single run, no --pivot) errors when the run has no numeric dim (e.g. an all-categorical semantics=legacy|v1 run) or several. The old message only said "pass x=", missing that a single-run categorical A/B is exactly what --pivot is for. Point at --x / --pivot / two-runs instead — the same guide-don't-fail treatment the pivot view-guard got. scaling stays the right single-run default: for one run with an inferable numeric dim it's the useful zero-config curve (and folds categoricals into colour/facet); the fix is only its failure mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann and others added 4 commits June 29, 2026 12:04

style: ruff-format the pivot test (drop manual wrap + dead noqa)

44c4e02

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann merged commit d0672f7 into main Jun 29, 2026
13 checks passed

FBumann deleted the feat/by-comparison-axis branch June 29, 2026 10:57

github-actions Bot mentioned this pull request Jun 29, 2026

chore(main): release 0.4.4 #138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: `--pivot` comparison axis for compare/plot (fold one run along a config dim)#137

feat: `--pivot` comparison axis for compare/plot (fold one run along a config dim)#137
FBumann merged 5 commits into
mainfrom
feat/by-comparison-axis

FBumann commented Jun 29, 2026 •

edited

Loading

Uh oh!

read-the-docs-community Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FBumann commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The idea

Scope & semantics

Ergonomics

Tests & checks

Docs

Uh oh!

read-the-docs-community Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FBumann commented Jun 29, 2026 •

edited

Loading

read-the-docs-community Bot commented Jun 29, 2026 •

edited

Loading