Skip to content

ci: land lib-cmp comparison-bench workflows on main (fidelity + perf)#20

Merged
jackmoxley merged 1 commit into
mainfrom
ci/lib-cmp-workflows
May 26, 2026
Merged

ci: land lib-cmp comparison-bench workflows on main (fidelity + perf)#20
jackmoxley merged 1 commit into
mainfrom
ci/lib-cmp-workflows

Conversation

@jackmoxley
Copy link
Copy Markdown
Contributor

Lands the two comparison-bench workflows on the default branch so they become UI-dispatchable (the workflow_dispatch 'Run workflow' control only appears for workflows on the default branch) and so the pull_request->main hook fires (GitHub reads dispatch/PR workflow definitions from the default branch).

  • lib-cmp-precision.yml — the Fidelity / LSBε precision shootout (decimal-scaled vs 6 peers; per-cell precision-relative grade, score + %CR two-letter grade, report to the run summary + artifact).
  • lib-cmp-perf.yml — the peer-crate timing comparison (width × scale).

Both are advisory — never required status checks. Trigger: pull_request -> main + workflow_dispatch. They run against the selected ref's tree (release/0.5.0 carries the lib_cmp benches today; they arrive on main with the 0.5.0 release), so on main they stay dormant-but-valid until then. No source changes — workflow files only.

Land the two comparison-bench workflows on the default branch so their
workflow_dispatch UI control appears and the pull_request->main hook fires
(GitHub reads dispatch/PR workflows from the default branch). Both are
advisory (never required status checks):
- lib-cmp-precision.yml: the Fidelity / LSB-epsilon precision shootout.
- lib-cmp-perf.yml: the peer-crate timing comparison (width x scale).
Trigger: pull_request->main + workflow_dispatch. They run against the
selected ref's tree (release/0.5.0 carries the benches today; they arrive
on main with the 0.5.0 release), so on main they stay dormant-but-valid
until then.
Copy link
Copy Markdown
Contributor Author

@jackmoxley jackmoxley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 26, 2026

Merging this PR will not alter performance

⚡ 4 improved benchmarks
❌ 5 regressed benchmarks
✅ 201 untouched benchmarks
⏩ 9 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
div_ceil 383.6 ns 441.9 ns -13.2%
ceil 318.1 ns 288.9 ns +10.1%
trunc 316.7 ns 375 ns -15.56%
next_power_of_two 62.2 ns 91.4 ns -31.91%
clamp 156.7 ns 127.5 ns +22.88%
TryFrom_u128 125.6 ns 96.4 ns +30.26%
to_num_i32 288.3 ns 346.7 ns -16.83%
to_f32 220.3 ns 249.4 ns -11.69%
D38_div 791.9 ns 675.3 ns +17.28%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ci/lib-cmp-workflows (6ad9994) with main (a9c0e9b)

Open in CodSpeed

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@jackmoxley jackmoxley merged commit 3aa1c14 into main May 26, 2026
55 of 57 checks passed
@jackmoxley jackmoxley deleted the ci/lib-cmp-workflows branch May 26, 2026 22:42
jackmoxley added a commit that referenced this pull request May 28, 2026
The previous `neg_twos_complement` did a two-pass shape:
  1. NOT loop into out[N] (N writes).
  2. `add_assign_fixed(out, [1, 0, …, 0])` (a full N-limb dependent
     carry chain over a second stack array, even though limbs 1..N
     add `0` after limb 0).

At wide N the dependent add chain across every limb dominates: each
overflowing_add reads the previous carry, blocking vectorisation, and
the second stack array is pure overhead.

Replace with a limb-0 split:
  - `out[0] = !a[0] + 1`, capture the carry `c0`.
  - If `c0 == false` (the overwhelmingly common path), limbs 1..N
    reduce to plain independent `!a[i]` writes — no cross-limb
    dependency chain, the compiler can keep them register-resident
    and vectorise the NOT loop.
  - If `c0 == true` (`a[0] == MAX`), fall back to a dependent
    carry-prop chain through limbs 1..N (the correct, slow path).

Generic over `N`, single kernel — no per-tier copies, no LimbSize
axis, no Scratch-on-Int needed. Constitution rules 1-6 hold: one
generic algorithm, one named file, matcher unchanged, sizing local
to width.

A/B verdict (benches/micro/neg_kernel_ab.rs, 6 inputs covering
tiny / half_wide / mid / high / low / carry_chain):

  D462  (N=24): fused_split ≈ two_pass  (within ±10%, noisy)
  D616  (N=32): fused_split beats two_pass by 1.25-1.83x
  D924  (N=48): fused_split beats two_pass by 1.42-2.42x
  D1232 (N=64): fused_split beats two_pass by 1.54-1.63x

Recovers ranks #23/#27/#28 (D616), #31 (D1232) of the bbc §8.4 wide-
neg cluster; D462 (#13/#17/#19/#20) is a wash at the kernel level
(any remaining gap lives in the call shape, not the kernel).

Bench seam: `__bench_internals::neg_fused_split` (routed kernel),
`neg_two_pass` (previous shape, reference baseline), `neg_fused_open`
(single-pass dependent-chain candidate). All bit-identical, asserted
before timing.

Validation: 6 kernel unit tests + 785 lib tests pass.
`cargo check` (default) + `cargo check --features
wide,x-wide,xx-wide,macros --all-targets` both clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant