ci: land lib-cmp comparison-bench workflows on main (fidelity + perf) by jackmoxley · Pull Request #20 · mootable/decimal-scaled

jackmoxley · 2026-05-26T22:21:13Z

Lands the two comparison-bench workflows on the default branch so they become UI-dispatchable (the workflow_dispatch 'Run workflow' control only appears for workflows on the default branch) and so the pull_request->main hook fires (GitHub reads dispatch/PR workflow definitions from the default branch).

lib-cmp-precision.yml — the Fidelity / LSBε precision shootout (decimal-scaled vs 6 peers; per-cell precision-relative grade, score + %CR two-letter grade, report to the run summary + artifact).
lib-cmp-perf.yml — the peer-crate timing comparison (width × scale).

Both are advisory — never required status checks. Trigger: pull_request -> main + workflow_dispatch. They run against the selected ref's tree (release/0.5.0 carries the lib_cmp benches today; they arrive on main with the 0.5.0 release), so on main they stay dormant-but-valid until then. No source changes — workflow files only.

Land the two comparison-bench workflows on the default branch so their workflow_dispatch UI control appears and the pull_request->main hook fires (GitHub reads dispatch/PR workflows from the default branch). Both are advisory (never required status checks): - lib-cmp-precision.yml: the Fidelity / LSB-epsilon precision shootout. - lib-cmp-perf.yml: the peer-crate timing comparison (width x scale). Trigger: pull_request->main + workflow_dispatch. They run against the selected ref's tree (release/0.5.0 carries the benches today; they arrive on main with the 0.5.0 release), so on main they stay dormant-but-valid until then.

jackmoxley

looking good

codspeed-hq · 2026-05-26T22:25:52Z

Merging this PR will not alter performance

⚡ 4 improved benchmarks
❌ 5 regressed benchmarks
✅ 201 untouched benchmarks
⏩ 9 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`div_ceil`	383.6 ns	441.9 ns	-13.2%
⚡	`ceil`	318.1 ns	288.9 ns	+10.1%
❌	`trunc`	316.7 ns	375 ns	-15.56%
❌	`next_power_of_two`	62.2 ns	91.4 ns	-31.91%
⚡	`clamp`	156.7 ns	127.5 ns	+22.88%
⚡	`TryFrom_u128`	125.6 ns	96.4 ns	+30.26%
❌	`to_num_i32`	288.3 ns	346.7 ns	-16.83%
❌	`to_f32`	220.3 ns	249.4 ns	-11.69%
⚡	`D38_div`	791.9 ns	675.3 ns	+17.28%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ci/lib-cmp-workflows (6ad9994) with main (a9c0e9b)}

9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

The previous `neg_twos_complement` did a two-pass shape: 1. NOT loop into out[N] (N writes). 2. `add_assign_fixed(out, [1, 0, …, 0])` (a full N-limb dependent carry chain over a second stack array, even though limbs 1..N add `0` after limb 0). At wide N the dependent add chain across every limb dominates: each overflowing_add reads the previous carry, blocking vectorisation, and the second stack array is pure overhead. Replace with a limb-0 split: - `out[0] = !a[0] + 1`, capture the carry `c0`. - If `c0 == false` (the overwhelmingly common path), limbs 1..N reduce to plain independent `!a[i]` writes — no cross-limb dependency chain, the compiler can keep them register-resident and vectorise the NOT loop. - If `c0 == true` (`a[0] == MAX`), fall back to a dependent carry-prop chain through limbs 1..N (the correct, slow path). Generic over `N`, single kernel — no per-tier copies, no LimbSize axis, no Scratch-on-Int needed. Constitution rules 1-6 hold: one generic algorithm, one named file, matcher unchanged, sizing local to width. A/B verdict (benches/micro/neg_kernel_ab.rs, 6 inputs covering tiny / half_wide / mid / high / low / carry_chain): D462 (N=24): fused_split ≈ two_pass (within ±10%, noisy) D616 (N=32): fused_split beats two_pass by 1.25-1.83x D924 (N=48): fused_split beats two_pass by 1.42-2.42x D1232 (N=64): fused_split beats two_pass by 1.54-1.63x Recovers ranks #23/#27/#28 (D616), #31 (D1232) of the bbc §8.4 wide- neg cluster; D462 (#13/#17/#19/#20) is a wash at the kernel level (any remaining gap lives in the call shape, not the kernel). Bench seam: `__bench_internals::neg_fused_split` (routed kernel), `neg_two_pass` (previous shape, reference baseline), `neg_fused_open` (single-pass dependent-chain candidate). All bit-identical, asserted before timing. Validation: 6 kernel unit tests + 785 lib tests pass. `cargo check` (default) + `cargo check --features wide,x-wide,xx-wide,macros --all-targets` both clean.

jackmoxley commented May 26, 2026

View reviewed changes

jackmoxley merged commit 3aa1c14 into main May 26, 2026
55 of 57 checks passed

jackmoxley deleted the ci/lib-cmp-workflows branch May 26, 2026 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: land lib-cmp comparison-bench workflows on main (fidelity + perf)#20

ci: land lib-cmp comparison-bench workflows on main (fidelity + perf)#20
jackmoxley merged 1 commit into
mainfrom
ci/lib-cmp-workflows

jackmoxley commented May 26, 2026

Uh oh!

jackmoxley left a comment

Uh oh!

codspeed-hq Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackmoxley commented May 26, 2026

Uh oh!

jackmoxley left a comment

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented May 26, 2026

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant