perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels by lilith · Pull Request #33 · imazen/zenpixels

lilith · 2026-05-10T01:48:34Z

Summary

22-commit refactor that replaces `stamp_trc_kernels` (v1) with `fast_gamut_v2`, a const-generic + SIMD-chunk implementation that's 3-9% faster than v1 across paired benches.

Architecture changes

New `fast_gamut_v2` is the only path; v1 `stamp_trc_kernels` deleted
Const-generic functions replace stamp macro
`mat3x3_x{4,8,16}` SIMD helpers for chunk loops
Compile-time const-assert gates on TRC / CHANNELS / CHUNK
Native f32x8 V3 body (drops V3 from `wide`)
garb SIMD chunk fns for V3 (native), NEON, WASM128
Asm-verified zero trampolines + ld3q / AVX2 shuffles inline

Performance

Paired v1/v2 zenbench harness shows v2 3-9% faster than v1 on the V3 path
Fixed-size array pattern enables LLVM to emit shuffles without explicit intrinsics

Correctness

Brute-force v1/v2 parity test exposed clamp divergence — fixed with per-pair wrappers
Native V3 path coverage tests added

Test plan

`cargo test --all-features --all-targets` clean on default + V3
Bench rerun to confirm 3-9% speedup is reproducible
Verify NEON / WASM128 paths in CI
Confirm garb path-dep is replaced with a versioned dep before merge

Notes

`deps(zenpixels-convert): path-dep to garb worktree for in-flight SIMD chunks` — needs to be flipped to a versioned garb release before merging
Possible overlap with the open feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait) #30 (`push-qpwqxvsrmqqn` / gamut-detect) — may want to coordinate land order

Adds fast_gamut_v2.rs with the wide-tier (f32x16) body for the sRGB→sRGB RGB pipeline: - Single #[magetypes(v4x, v4, v3, scalar)] body using GenericF32x16<Token>. - Calls into linear_srgb::tf::srgb::{srgb_to_linear_x16, linear_to_srgb_x16} for the TRC; matrix multiply inline. - 16 pixels (48 f32) per SIMD chunk, scalar tail for leftover. - 5 unit tests covering identity matrix, zero/one corner pixels, sub-chunk input, mixed chunk + tail. All pass in debug + release. Cargo.toml upgrades: - archmage 0.9.15 -> 0.9.23 with avx512 feature - magetypes 0.9.18 -> 0.9.23 with w512 + avx512 features - Adds avx512 = [] feature to gate V4x in default-tier-list incant!s Sets up the worktree for the fast_gamut.rs redesign (see DESIGN.md): - Captures baseline benchmarks (bench_t3_tf_fused, bench_t7_gamut, bench_matlut_vs_poly) under benchmarks/fast_gamut_baseline_2026-05-02/ for before/after comparison. - Activates the linear-srgb visibility flip via path dep to ../linear-srgb--pub-tf-x16 (sibling worktree). Next steps (per DESIGN.md migration plan): - Extend wide body to RGBA + BT.709/PQ/HLG/Adobe TRCs. - Add narrow body for NEON/WASM128 via #[magetypes(neon, wasm128)] + f32x4 generics. - Wire dispatch into convert_f32_rgb_dispatch for matched (src, dst) TRC pairs; gate behind a build-time feature initially. - Run bench_t3_tf_fused + bench_t7_gamut, compare to baselines. - After numbers land, prepare upstream linear-srgb PR with the visibility flip + clamped x16 sRGB additions.

…ies) Replaces the v1 stamp_trc_kernels! macro family with a single `stamp_v2_pair!` that generates two magetypes-stamped bodies per (linearize, encode) tuple: - wide body: #[magetypes(v4x, v4, v3, scalar)] over GenericF32x16<Token> — covers all four x86_64 tiers + cross-arch scalar fallback through one body. Calls into linear_srgb::tf::*::*_x16<T: F32x16Convert> for the TRC, fused with an inline 3×3 matmul. - narrow body: #[magetypes(neon, wasm128)] over GenericF32x4<Token> — covers AArch64 / WASM hosts with native register-width SIMD via linear_srgb::tf::*::*_x4<T: F32x4Convert>. Public entries cover the full v1 surface: - same-TRC: srgb, bt709, pq, hlg, adobe, plus matrix-only linear→linear. - cross-TRC: pq→srgb, hlg→srgb, srgb→pq, bt709→srgb, srgb→bt709, adobe→srgb, srgb→adobe. - both RGB (3-ch) and RGBA (4-ch byte-exact alpha passthrough). Match-on-TRC dispatch (`convert_f32_rgb_v2` / `convert_f32_rgba_v2`) mirrors v1's `convert_f32_rgb_dispatch` and is ready to be wired in behind it. Not yet active — v1 still owns the public dispatch path. 13 tests cover identity roundtrip per TRC, alpha byte-exact passthrough, sub-chunk + exact-chunk + mixed-tail sizes (5/7/13/16/17/19/23 px), parity vs a scalar reference on a real non-identity gamut matrix at TOL_PARITY=5e-5 across all pairs, linear→linear bypass, unsupported-pair fallthrough, and a tier-permutation stability check via `for_each_token_permutation`. All 268 tests in the crate pass.

Replaces the v1 stamp_trc_kernels-based incant! match in convert_f32_{rgb,rgba}_dispatch with a single delegating call to the v2 surface. v2 covers all 13 (Linear,Linear) + same-TRC + cross-TRC pairs that v1 supported, dispatching wide (V4x/V4/V3/scalar via f32x16<T>) on x86_64 and narrow (NEON/WASM128 via f32x4<T>) on AArch64/WASM. Scalar fallback below the v2 call is kept for any TRC the v2 surface doesn't yet cover. The v1 stamp_trc_kernels output remains compiled in this crate but is now unreachable from production code paths; deletion will be a follow-up commit after benchmark parity is confirmed. All 255 zenpixels-convert lib tests + 23 integration tests pass.

Captures bench_t7_gamut and bench_t3_tf_fused output from the v2-wired build alongside a markdown comparison. bench_t7_gamut (load-bearing — exercises convert_f32_rgb/rgba_dispatch via the v2 surface): - median Δ +1.2% - 14/18 within ±3% of v1 - 16/18 within ±5% - 4/18 faster than v1 - 4096px workhorses: -2.0% .. +5.8% V3 parity holds — no native f32x8 body needed. The first AFTER run of bench_t3_tf_fused was captured under heavy system load (load avg ~12, 4x baseline wallclock); since t3 doesn't go through fast_gamut at all this drift is unrelated to the refactor and the noisy log is preserved for reference only.

Replaces the wide body's f32x16 polyfill-to-AVX2 path on V3 hosts with a native f32x8 (single 256-bit) body that mirrors v1's fused_8px_rgb_<name> shape. Eliminates the register-pressure overhead of running an f32x16 polynomial body as 2x256-bit AVX2 ops. Changes: - stamp_v2_pair! now generates three impls per pair: - wide body (#[magetypes(v4x, v4, scalar)]) over f32x16<T> — was (v4x, v4, v3, scalar). V3 dropped here. - native V3 body (#[magetypes(v3)]) over f32x8<T>, new — uses the *_to_linear_x8 / linear_to_*_x8 helpers exposed in linear-srgb. - narrow body (#[magetypes(neon, wasm128)]) over f32x4<T> — unchanged. - Public dispatchers convert_f32_rgb_<name>_v2 / convert_f32_rgba_<name>_v2 now use Option A (manual try-tier cascade) instead of incant!: V4x -> wide_impl_v4x (cfg avx512) V4 -> wide_impl_v4 (cfg avx512) V3 -> native_impl_v3 else-> wide_impl_scalar (or narrow_impl_neon/_wasm128 on those archs) - All 13 stamp invocations updated to pass lin_x8 / enc_x8. - linear-srgb dep already exposes *_x8 + gamma_to_linear_x8 (sibling change in linear-srgb--pub-tf-x16 worktree). Tests: all 255 lib tests pass (was 247 plus 8 new fast_gamut_v2 tests introduced earlier). Benchmark validation in a follow-up change.

Adds three new tests to verify the native f32x8 V3 body: 1. native_v3_parity_same_trc_rgb — calls each convert_rgb_<name>_native_impl_v3 directly with a summoned X64V3Token, compares against scalar reference under TOL_PARITY (5e-5). Covers sRGB, BT.709, PQ, HLG, Adobe (gamma 2.2). 19 pixels exercises both 8-px chunks (2x) and 3-px tail. 2. native_v3_parity_same_trc_rgba — same shape over RGBA, with stamped alpha values verified byte-exact passthrough. 3. native_v3_parity_cross_trc_rgb — cross-TRC pairs (PQ->sRGB, HLG->sRGB, sRGB->PQ, BT.709<->sRGB, Adobe<->sRGB). 4. dispatcher_routes_to_native_v3_when_avx512_disabled — exercises the public dispatcher end-to-end. Disables V4 / V4x process-wide via dangerously_disable_token_process_wide so the dispatch cascade falls through to V3, then verifies output matches scalar reference. All gated on cfg(target_arch = "x86_64"). Tests gracefully no-op (with host-capability stderr note) if X64V3Token::summon() returns None. Tests: 17 fast_gamut_v2 tests pass on this V3 host (was 13). Full zenpixels-convert suite: 633 passing.

Captures the post-native-V3 benchmark run and updates COMPARISON.md. bench_t7_gamut: median delta vs v1 baseline is 0.0% (was +1.2% with the wide-only AFTER); 17 of 18 rows within +/-3%; 9 of 18 faster than v1; all 4096-pixel rows at parity or faster. bench_t3_tf_fused: re-captured under quiet system load (load avg ~1.97 at start, total wall 73.2s — was ~4x longer with CV up to 146% on the prior contended run). Sanity check only — t3 does not route through fast_gamut so this run is not load-bearing for the V3 body. Files: - bench_t7_gamut_AFTER_native_v3.log (full zenbench output) - bench_t3_tf_fused_AFTER_native_v3.log - COMPARISON.md (rewritten with three-column table: BEFORE / wide-only AFTER / native_v3 AFTER, plus per-row delta hypothesis on the one remaining outlier row at 1080p Linear F32 -> sRGB U8 + gamut)

…nclusive Saves the second bench_t7_gamut run with a candid note added to COMPARISON.md explaining that load avg ran from 1.72 -> 7.05 over the 101-second bench, contaminating the results. The three Linear F32 -> sRGB U8 + gamut rows came in noisier than the original AFTER (+15% /+9%/+10% with one CV=26% marker), so the +6.5% outlier hypothesis remains unconfirmed. Other 15 rows replicated within +/-2% of AFTER, including the -13% win on sRGB U8 -> Linear F32 1080p (re-confirmed, not a sample fluke). Files added: - bench_t7_gamut_RERUN_v3body_quiet.log A clean structural verdict on the outlier needs a quieter box than this dev tree (9 concurrent claude processes during the run).

…→ v2 now 3-9% faster than v1 Two pieces in one commit since they're inseparable: the paired zenbench harness was needed to find the regression, and the fixed-size array fix was what the harness pointed at. PART 1 — paired zenbench harness (proper bias-free A/B): - New __bench_v1_v2 feature flag in zenpixels-convert/Cargo.toml. - New __v1_convert_f32_{rgb,rgba}_dispatch helpers in fast_gamut.rs expose the pre-d207f3b6 v1 inline dispatch (still backed by the stamped functions which remain compiled). - New __bench_v1_v2 module in lib.rs (mirrors __bench_u16_hybrids). - New benches/bench_v1_vs_v2_paired.rs registers v1 + v2 as paired g.bench(...) calls in the same group, so zenbench interleaves them in randomized round-robin order and reports v2/v1 deltas with 95% CIs directly. Replaces the criterion-style 'run v1 once, run v2 once, hand-diff' workflow that was contaminated by thermal/turbo drift and load spikes. PART 2 — bounds-check elimination via fixed-size array pattern: - cargo asm on convert_rgba_bt709_native_impl_v3 showed v2's loop top emitted 121 cmp/je/jae bounds-check branches vs v1's 8. Cause: the macro's data[off + i*N] inner-loop deinterleave — LLVM couldn't hoist all checks into a single max-index check the way v1's hand-tuned 'lea r8, [rax+31]; cmp r8, rdx; jae' + vinsertps lane gathers did. - Fix is one line per chunk-loop body — CLAUDE.md 'Fixed-size array pattern' applied uniformly to all 6 magetypes-stamped bodies in stamp_v2_pair! (wide RGB+RGBA, native RGB+RGBA, narrow RGB+RGBA): let chunk: &mut [f32; CHUNK] = chunk.try_into().unwrap(); All chunk[i*N + ch] indexes are now statically proven safe by the &[f32; CHUNK] type — zero interior bounds checks. Asm impact (BT.709 RGBA path, the worst regression site): - v2 lines: 2933 → 2523 (now 30 fewer than v1's 2553) - v2 bounds-check branches: 121 → 6 (v1 has 8) Bench impact (paired zenbench, 1.15 load avg, full sweep): - 31 of 33 rows have v2 measurably faster than v1 (CI excludes 0) - Median Δ ≈ -3.5%, best -8.8% (sRGB RGBA 256px) - 2 BT.709 same-pair rows are at parity (±1%, CI overlap) - BT.709 RGBA regression (was +8% to +9.7%) → ±1% parity - PQ same-pair (was +2-3% slower) → -1.6% to -2.6% faster - sRGB Δ (was -0.8% to +1.5%) → -7% to -8.8% User requirement 'must be faster on V3' is now met. Tests: all 17 fast_gamut_v2 + 247 lib tests pass. No production code paths changed; v2 dispatch wiring at d207f3b remains. Files added: - zenpixels-convert/Cargo.toml: __bench_v1_v2 feature + bench entry - zenpixels-convert/src/lib.rs: __bench_v1_v2 shim module - zenpixels-convert/src/fast_gamut.rs: __v1_convert_f32_{rgb,rgba}_dispatch - zenpixels-convert/benches/bench_v1_vs_v2_paired.rs: paired harness - zenpixels-convert/src/fast_gamut_v2.rs: chunk.try_into() fix × 6 bodies - benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired.log (pre-fix paired numbers — kept for traceability) - benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired_FIXED.log - benchmarks/fast_gamut_after_2026-05-02/PAIRED_COMPARISON.md (now contains both pre-fix and post-fix tables with the asm dive)

…gence; fix with per-pair wrappers A new tests/v1_v2_brute_force_parity.rs test runs every TRC pair × {RGB, RGBA} × 4 matrices × 18 sizes × 4 seeds (7600+ cases) through both v1 and v2 and asserts per-channel identity within tolerance. First run FAILED — v2 diverged from v1 by up to 2.9 absolute units on cross-gamut output (e.g. P3→sRGB pixel at primary corner: v1=0, v2=-2.9). Per CLAUDE.md zero-tolerance: 'If two code paths for the same operation produce different output, that is a bug in one of them.' Bug was in v2. Root cause: linear-srgb's two TF surfaces have INCONSISTENT clamping: - tokens::x{4,8}::{srgb,gamma}_*_v3: clamp input to [0,1]. - tokens::x{4,8}::{bt709,pq,hlg}_*_v3: do NOT clamp (HDR extended). - tf::srgb::*_x{4,8,16}<T> (used by v2): do NOT clamp. - tf::gamma::*_x{4,8,16}<T>: already clamps internally. - tf::{bt709,pq,hlg}::*_x{4,8,16}<T>: do NOT clamp. v1 inherited per-TF clamp from its wrappers; v2's macro called the unclamped tf::*::* generics uniformly, propagating cross-gamut out-of-range matrix products through the encode polynomial. Fix: added per-side clamp wrappers (srgb_to_linear_x{4,8,16}_clamped etc.) in fast_gamut_v2.rs that wrap tf::srgb::* with .max(zero).min(one). Updated the 27 stamp_v2_pair! invocations touching sRGB inputs/outputs to use the clamped wrappers. BT.709/PQ/HLG paths keep the raw kernels (matches v1 no-clamp behavior). Adobe (Gamma22) auto-clamps via gamma.rs. Verification: - brute_force_v1_v2_parity_rgb: 3744 cases — all pass. - brute_force_v1_v2_parity_rgba: 3744 cases — all pass. - brute_force_chunk_boundaries: dense SIMD-boundary sweep — pass. - Full lib test suite: 633 passing, 0 failing. Perf cost (load avg 1.12, paired zenbench): - sRGB rows: +1-3% slower than v1 (clamp wrappers add ~4 ops/call). - All other paths: unchanged or faster. - Net: 24 of 33 rows faster than v1, 9 within ±2% (down from 31/33 faster — the previous 'faster' numbers were on INCORRECT output that didn't match v1). The CORRECT v2 is faster on the majority of pairs, parity on the rest, and byte-identical to v1 across 7600+ random inputs. Files added: - zenpixels-convert/tests/v1_v2_brute_force_parity.rs (new test) - benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired_CORRECTNESS.log - PAIRED_COMPARISON.md updated with the brute-force findings. Files changed: - zenpixels-convert/src/fast_gamut_v2.rs: - 6 new per-width clamp wrappers (srgb_to_linear_x{4,8,16}_clamped, linear_to_srgb_x{4,8,16}_clamped). - 27 stamp invocations switched from tf::srgb::* to *_clamped.

…the only path Strips the legacy stamp_trc_kernels! macro and its 12 invocations now that fast_gamut_v2 has been the production dispatch since d207f3b and brute-force parity (5ab0328) confirmed v1/v2 byte-equivalence across 7600+ random inputs. Deleted: - macro_rules! stamp_trc_kernels (~125 lines, the meta-template) - 12 stamp invocations: srgb, bt709, pq, hlg same-TRC pairs; pq_to_srgb, hlg_to_srgb, srgb_to_pq, bt709_to_srgb, srgb_to_bt709 cross-TRC pairs; adobe, adobe_to_srgb, srgb_to_adobe Adobe pairs - adobe_to_linear_x8 (only used by deleted stamps; adobe_from_linear_x8 KEEPS — used by simd_encode_x8_dispatch in the u8 production fused path) - has_simd_encode (dead pub(crate), no callers) - __v1_convert_f32_{rgb,rgba}_dispatch (bench-only shims; no longer needed — bench harness deleted) - benches/bench_v1_vs_v2_paired.rs (bench harness for v1/v2 race) - tests/v1_v2_brute_force_parity.rs (parity test, served its purpose in catching the clamping-divergence bug) - __bench_v1_v2 feature in Cargo.toml + lib.rs shim module Kept (still production-load-bearing): - mat3x3_x8 — used by convert_8px_u8_rgb_fused - simd_encode_x8_dispatch — same - adobe_from_linear_x8 — called by simd_encode_x8_dispatch - trc_x8 / mt_f32x8 imports — used by all the above Verification: - Full lib + integration + doctests: 633 passing, 0 failing. - cargo semver-checks vs zenpixels-convert@0.2.11 on crates.io: 196 checks pass, 56 skip, 'no semver update required'. No minor version bump needed. Net source delta vs main: - fast_gamut.rs: -616 lines (v1 surface gone) - fast_gamut_v2.rs: +1610 lines (replacement) - lib.rs/Cargo.toml: small adds for v2 wiring - Total source: ~+1000 lines, with V4x/V4 lanes, NEON, WASM128 SIMD added (was scalar in v1 for non-x86_64); V3 throughput improved 1-9% on the heavy-polynomial paths.

…ions Replace the 200+-line stamp_v2_pair! macro and its 12 invocations with three const-generic magetypes-stamped bodies: - convert_wide<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(v4x, v4, scalar)] - convert_native<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(v3)] - convert_narrow<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(neon, wasm128)] TRC u8 tags (TRC_SRGB / TRC_BT709 / TRC_PQ / TRC_HLG / TRC_GAMMA22) drive a single-arm const-folded match inside per-width inline helpers (linearize_x{4,8,16}, encode_x{4,8,16}, scalar_linearize, scalar_encode). LLVM with #[inline(always)] on the helpers folds the match against the const generic, leaving exactly one TRC kernel call per monomorph. The wildcard match arms use safe unreachable!() — forbid(unsafe_code) blocks unreachable_unchecked, but the const-fold elides the panic call site at optimization time (asm spot-check on __arcane_convert_native_v3 monomorph 0 shows zero panic call sites). Tests are updated to call convert_native_v3 directly with explicit const generic arguments instead of the per-pair functions. Public API (convert_f32_rgb_v2 / convert_f32_rgba_v2 / convert_f32_rgb_linear_v2 / convert_f32_rgba_linear_v2) is unchanged. cargo semver-checks: 196 pass, no semver update required. Line count: 1610 -> 1437 (-173). Tests: 633 passed / 0 failed. Bench results (bench_t7_gamut, AMD 7950X) saved in benchmarks/fast_gamut_after_2026-05-02/bench_t7_gamut_AFTER_const_generic.log + CONST_GENERIC_RESULT.md. Median delta vs prior native-V3 baseline = 0%; worst-case row +4.17% on Linear F32 gamut 1080p (within bench noise; that path bypasses the const-generic kernels entirely).

The 9-splat + 6-mul_add SIMD matrix-multiply pattern was duplicated across three magetypes-stamped bodies (convert_wide / convert_native / convert_narrow). Extract per-width `mat3x3_x{4,8,16}` helpers (added in the previous change) and call them from inside each chunk loop. The splats stay inside the helper rather than being hoisted at the call site — `#[inline(always)]` lets LLVM hoist the constant materialization to the loop preheader on its own. Verification: - 259 lib tests pass (same as baseline). - Asm spot-check on `__arcane_convert_native_v3` monomorph 0: 9 `vbroadcastss` from `[rdi+0..32]` still hoisted to function prelude before the chunk-loop label; FMA matrix multiply still in the loop body. Total instructions: 627 -> 583 (-7%, slight improvement from less stack spilling of splatted constants). - Asm spot-check on `convert_wide_scalar` monomorph 0: 3647 -> 3656 (+0.25%, noise-level). - `cargo semver-checks --baseline-version 0.2.11`: no semver update required.

…CHUNK Adds inline 'const { assert!(...) }' blocks at every const-generic entry point so misuse of the parameters is caught at monomorphization time rather than via the runtime 'unreachable!()' wildcard arms. Gated parameters: - SRC_TRC < 5 (must be one of TRC_SRGB|BT709|PQ|HLG|GAMMA22) — checked in scalar_linearize, linearize_x{4,8,16}, convert_{wide,native,narrow}. - DST_TRC < 5 — checked in scalar_encode, encode_x{4,8,16}, convert_{wide,native,narrow}. - CHANNELS == 3 || CHANNELS == 4 — checked in convert_{wide,native,narrow}. - CHUNK == PIXELS * CHANNELS where PIXELS is the body's native lane count (16 for wide, 8 for native V3, 4 for narrow) — replaces the prior runtime 'debug_assert_eq!' check. Verified that const-assert evaluation fires at monomorphization with a negative-test rustc invocation — instantiating with SRC_TRC=99 produces E0080 'evaluation panicked: SRC_TRC must be one of TRC_SRGB|...' and rc=1, before any test runs. The runtime '_ => unreachable!()' arms stay as a belt-and-braces backup that LLVM continues eliminating under monomorphization (no asm impact). Tests: 259 lib tests pass. Semver: 196 checks pass, no semver update required (gates are in crate-private functions; public surface unchanged).

… chunks Sibling /home/lilith/work/garb/ holds PR #5 work (deinterleave-f32-chunks) with new public chunk SIMD fns. Switch to a path dep so fast_gamut_v2 can call them before the release ships. Flip back to a version dep once the PR merges and a 0.2.x crate ships.

Replace manual stride-3/4 deinterleave + reinterleave loops in convert_native (#[magetypes(v3)]) with calls into garb's new rgb_f32_chunk8_to_planes_v3 / rgba_f32_chunk8_to_planes_v3 and inverse interleave fns. The garb impls use the canonical 5-shuffle AVX2 stride-3 recipe (vshufps / vmovlhps / vmovhlps) for RGB and unpack+permute2f128 4-channel transpose for RGBA — much better codegen than LLVM auto-vectorizing scalar element-wise loads. Wired via a per-token ChunkXform8 trait. The V3 impl forwards through #[arcane]-stamped free fns so each call holds a matching target_feature region; archmage inlines the wrapper. Scalar fallback (used by the magetypes-emitted convert_native_scalar dead-code variant) keeps the manual loop. Alpha plane is byte-exact passthrough for RGBA — read out unchanged, written back unchanged, NaN payloads preserved. 633 lib + integration tests pass (same as pre-integration).

…SM128) Replace manual stride-3/4 deinterleave + reinterleave loops in convert_narrow (#[magetypes(neon, wasm128)]) with calls into garb's chunk-4 SIMD fns. NEON uses vld3q_f32 / vld4q_f32 (single-instruction hardware structure-loads); WASM128 uses the i32x4_shuffle!-based 5-shuffle recipe. Both inline through #[arcane] safe wrappers — no unsafe in this crate. Validates with x86_64 native (633 tests pass), aarch64-unknown-linux-gnu cross-build (clean), wasm32-unknown-unknown cross-build (clean).

…es inline Verified post-integration via cargo asm: x86_64 V3 (zenpixels_convert::fast_gamut_v2::__arcane_convert_native_v3 monomorph 0): - 745 lines, 29 SIMD shuffle ops (vshufps/vunpck/vinsertps/vmovlhps/vmovhlps/vperm) - 0 calls to any garb::* symbol -- proves the #[arcane] trait wrapper + #[rite] garb body fully inlined into the magetypes V3 region - 0 panic_bounds_check - 108 vbroadcastss (matrix splats hoisted to loop preheader as expected) - Indirect calls (call r12) only present in scalar tail-pixel path (transcendental TRC fallback for PQ/HLG); no garb-related calls. Other V3 monomorphs (1, 5, 10) checked: same patterns -- 0 garb calls, 0 bounds checks, 25-44 shuffles each. aarch64 NEON (convert_f32_v2_inner monomorph 0, with __arcane_convert_narrow_neon fully inlined): - 9766 lines - 24 ld3/ld4/st3/st4 NEON structure-load/store instructions (vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32) - 0 branch-link to any garb::* symbol - 0 panic_bounds_check - Only named call: bl powf (transcendental TRC fallback) Cross-target compile clean: x86_64-unknown-linux-gnu (native, 633 tests pass), aarch64-unknown-linux-gnu, wasm32-unknown-unknown. cargo semver-checks check-release: no semver update required (196 checks pass, 56 skip).

…difier The plain mode 'v3' / 'neon, wasm128' tier lists triggered the macro's auto-append of a scalar fallback variant — emitting convert_native_scalar and convert_narrow_scalar that were never callable (dispatcher always falls through to convert_wide_scalar on hosts without v3/neon/wasm128). Switch to additive-mode tier lists that explicitly subtract every default tier we don't want, including scalar: v3 only: [-v4, -neon, -wasm128, -scalar] neon + wasm128: [-v4, -v3, -scalar] The macro currently rejects mixing plain tiers with -modifier (lines 287-292 of archmage-macros/src/tiers.rs); pure-additive form is the working path. convert_wide stays at plain (v4x, v4, scalar) — its scalar variant is the load-bearing fallback for non-SIMD hosts. Trait/wrapper machinery (ChunkXform8/4 + scalar impls + scalar helpers) becomes dead with this change but is left in place; deletion follows in the next change. This commit is functional-only — the magetypes bodies still call the trait methods. Tests: 633 / 0

The ChunkXform8/4 trait was a per-token dispatch indirection over garb's chunk-N SIMD fns: V3 / NEON / WASM128 trait impls each forwarded to garb's _v3 / _neon / _wasm128 function via an #[archmage::arcane] wrapper. With the dead scalar variants now removed, the trait collapses to a single concrete arch per body — V3-only for convert_native, arch-split (#[cfg]) for convert_narrow. Replace the trait method calls with direct garb calls. NEON and WASM128 monomorphizations of convert_narrow are arch-mutually-exclusive, so #[cfg] arms inside the body select the correct garb fn per arch. The trait + impls + wrapper modules become unreachable and are deleted in the next change. No unsafe blocks needed — garb's #[rite] fns inline into the magetypes body's #[target_feature] region exactly the same way the wrapper module inlined them, just one Rust call frame fewer. Tests: 633 / 0

Removes ~419 lines of bridge code rendered unreachable by the previous two changes: - trait ChunkXform8 (4 methods) + 2 impls (ScalarToken, X64V3Token) - trait ChunkXform4 (4 methods) + 3 impls (ScalarToken, NeonToken, Wasm128Token) - mod x64v3_chunk_calls (4 #[arcane] wrapper fns) - mod neon_chunk_calls (4 #[arcane] wrapper fns) - mod wasm128_chunk_calls (4 #[arcane] wrapper fns) - 8 scalar_*_to_planes / scalar_planes_to_*_{4,8} helpers (only referenced from the deleted ScalarToken trait impls) Total deleted: 2 traits, 5 impls, 3 wrapper modules (12 fns), 8 scalar helpers — 419 lines. File size: 1969 -> 1589 lines (-380, accounting for one collapsed blank line from the boundary cleanup). The chunk loops in convert_native and convert_narrow now call garb's #[rite] fns directly. The asm shape is identical to before — there was never a real call frame between the magetypes #[arcane] body and the garb #[rite] body, just a Rust-source layer that LLVM elided. Removing it makes that obvious from the source. Verification (host: x86_64-unknown-linux-gnu, V3 monomorph 0): cargo asm convert_native_v3 [0]: AVX2 shuffle ops (vshufps/vunpck.ps/vinsertf128/...): 22 call.*garb: 0 (fully inlined) panic_bounds_check: 0 vbroadcastss: 108 (TRC poly coefficients + matrix splats; SRGB->SRGB has the heaviest poly fit) Same monomorph 6: 21 / 0 / 0 / 86 — same shape. Cross-target compile: cargo build --target aarch64-unknown-linux-gnu -p zenpixels-convert -> clean cargo build --target wasm32-unknown-unknown -p zenpixels-convert -> clean (Pre-existing _x4 / _x8 / f32x4-camelcase warnings only; symmetric to baseline.) cargo semver-checks check-release -p zenpixels-convert --baseline-version 0.2.11 -> 196 pass, 56 skip, no semver update required. Tests: 633 / 0

…y pin PR #33's `fast_gamut_v2.rs` calls 12 chunk SIMD primitives that never shipped on crates.io — they were the pre-tokenless versions of garb's hand-written 128-bit-XMM f32 chunk SIMD (`*_chunk{4,8}_to_planes_v3`, `_neon`, `_wasm128`). Those got dropped from garb 0.2.8 entirely after benches showed they lose to LLVM autovec by 26-37% at 1024px (the `_mm_*` 128-bit intrinsics couldn't reach 256-bit YMM the way autovec under target_feature avx2,fma can). The replacement pattern is the public `*_scalar` chunk fns: pure fixed-array indexing that LLVM autovec lifts to YMM inside the caller's `#[arcane(<tier>)]` region. Same wall-clock as the deleted hand-written chunks would have given, just without the v0.2.7-yanked-because-of-archmage-coupling machinery. Migration pattern (applied at 12 call sites): ```diff -let (r, g, b) = garb::deinterleave::rgb_f32_chunk8_to_planes_v3(token, c); +let (r, g, b) = garb::deinterleave::rgb_f32_chunk8_to_planes_scalar(c); ``` Plus same shape for `_neon` / `_wasm128` (both also rename to `_scalar` since the `#[arcane(<tier>)]` caller establishes target_feature for each arch). Also: `Cargo.toml` switched the garb dep from `path = "../../../garb"` (local-only) to `version = "0.2.8"` (registry). The path form was preventing CI from compiling at all. Caveats: - Bench logs in benchmarks/fast_gamut_after_2026-05-02/ were measured against the OLD hand-written 128-bit chunk SIMD path, not the autovec'd _scalar path. They reflect a baseline that no longer exists in published garb. Numbers should be re-measured if they're load-bearing for any decision; they may shift modestly (autovec is +26-37% on f32 at 1024px per garb's own bench, so the new path should be at least as fast or faster than what was logged). Test plan: - cargo build --release -p zenpixels-convert: clean - cargo test --release -p zenpixels-convert --lib: 259 pass Tracking: imazen/garb#7

lilith · 2026-05-10T04:07:14Z

Pushed 6bd659a to fix CI:

The 12 chunk-SIMD call sites in fast_gamut_v2.rs were referencing pre-tokenless names (*_chunk8_to_planes_v3, _neon, _wasm128) that never shipped on crates.io — they were the working-name forms of garb's hand-written 128-bit chunk SIMD that got dropped from garb 0.2.8 entirely (bench showed they lost to LLVM autovec by 26-37% at 1024px because _mm_* 128-bit intrinsics couldn't reach 256-bit YMM the way autovec under target_feature avx2,fma does).

Migrated all 12 to the public *_scalar chunk fns. Inside the surrounding #[arcane(<tier>)] region, those autovec to YMM — same or better wall-clock as the deleted hand-written chunks.
Cargo.toml: garb = { path = "../../../garb", ... } → garb = { version = "0.2.8", ... }. The path dep wasn't valid in CI.
cargo fmt cleaned up some pre-existing formatting drift in the file (alphabetical import reorder, const { ... } block form). That's the source of the large diff line count — only ~12 lines of logical change.

Caveat

The bench logs in benchmarks/fast_gamut_after_2026-05-02/ were measured against the OLD hand-written 128-bit chunk SIMD baseline, which no longer exists in published garb. They should be re-measured if the numbers are load-bearing for any review decision. Per garb's own bench data, autovec is +26-37% on f32 deinterleave at 1024px vs the deleted chunks, so the new path should be at least as fast or faster than what was logged.

CI should now go green. Tests: cargo test --release -p zenpixels-convert --lib → 259 pass locally with the migrated code.

Three review items from #33: 1. **Delete `pub(crate) const TRC_LINEAR: u8 = 5`.** The const was declared `#[allow(dead_code)]`, gated out of every const-generic path by `assert!(SRC_TRC < 5, ...)` / `DST_TRC < 5` const-asserts, and never referenced after its definition. The `(Linear, Linear)` case is handled at the enum-pair level in `convert_f32_v2_inner` via a dedicated short-circuit to `convert_*_linear_v2` (the matrix-only scalar loop, no TRC step). Mixed Linear↔tagged pairs are unsupported (return `false` from the public entry); v1 didn't support them either. Replaced the misleading "included for completeness" comment with a clear note explaining the design choice: when a Linear↔tagged caller appears, add a dedicated enum-pair branch alongside the existing `(Linear, Linear)` short-circuit so the linearize/encode stages can be elided rather than monomorphized as identity arms. 2. **Annotate historical bench logs as SUPERSEDED.** The `benchmarks/fast_gamut_after_2026-05-02/` directory contains two v1↔v2 comparisons: - `COMPARISON.md` was the criterion-style A-then-B (run v1, run v2 on a different commit, hand-diff). Headline "median 0.0% Δ" was a thermal-drift artifact. - `PAIRED_COMPARISON.md` did the same comparison properly with zenbench paired interleaving — but was measured against (a) v1, deleted by this PR, and (b) garb 0.2.7's hand-written 128-bit chunk SIMD, dropped from garb 0.2.8 entirely after bench showed LLVM autovec under target_feature avx2,fma was +26-37% faster at 1024px. Both baselines are deleted on HEAD. Banners on both files now point readers at the fresh HEAD throughput snapshot. 3. **Fresh HEAD throughput snapshot.** New `benchmarks/fast_gamut_head_2026-05-13/{bench_t7_gamut_HEAD.log,META.md}` captures `bench_t7_gamut` on the migrated code (v2 calling garb 0.2.8 `_scalar` chunk fns, autovec'd to 256-bit YMM under `#[arcane(v3)]`). Not a v1↔v2 paired comparison (impossible — v1 is gone), but it sanity-checks that the migrated path performs in expected throughput buckets: sRGB U8 fused gamut (P3→BT.709) 256px → 227ns 4096px → 3.36µs 1080p → 1.79ms META.md explicitly documents what this run does and does not measure. Test plan: - `cargo test --release -p zenpixels-convert --lib` — 259 pass - `cargo bench --bench bench_t7_gamut -p zenpixels-convert` — clean run Tracking: imazen/garb#7 + this PR.

lilith · 2026-05-13T22:01:14Z

Pushed two more commits addressing the review:

What changed

Deleted pub(crate) const TRC_LINEAR: u8 = 5 plus its misleading "included for completeness" comment. The const was unreachable: gated out of every const-generic path by assert!(SRC_TRC < 5), never referenced after its declaration. The (Linear, Linear) short-circuit at the enum-pair level in convert_f32_v2_inner already handles linear→linear without needing a TRC tag. New comment block documents the design call: if Linear↔tagged callers appear, add a dedicated enum-pair branch rather than wiring a TRC tag with identity arms in every linearize/encode stage.
SUPERSEDED banners on COMPARISON.md (criterion thermal-drift artifact) and PAIRED_COMPARISON.md (paired but measured against now-deleted baselines). Both point at the fresh HEAD snapshot below.
benchmarks/fast_gamut_head_2026-05-13/ — fresh bench_t7_gamut run on the migrated code (v2 + garb 0.2.8 _scalar). META.md documents what it does measure (sanity-check throughput on HEAD) and does not (it's NOT a v1↔v2 paired comparison — v1 is gone).

What this means for the perf claim

The PR description's "v2 is 3-9% faster than v1" was true when measured against v1 + garb 0.2.7's hand-written 128-bit chunk SIMD. Both are gone on HEAD:

v1: deleted in this PR.
garb 0.2.7's chunk SIMD: dropped in garb 0.2.8 because LLVM autovec under target_feature avx2,fma was +26-37% faster at 1024px on f32 deinterleave (see garb's bench).

So v2 on HEAD is directionally still the right move (the design wins — broader platform coverage, simpler maintenance, const-generic over macro stamp, correctness validated by brute-force parity test) but the exact ±% perf delta is not citeable — the baseline it was measured against doesn't exist.

If a hard "v2 is faster than v1 on HEAD" claim is needed, the work is:

revive v1 as a #[cfg(feature = "fast_gamut_v1_compat")] shim,
restore the bench_v1_vs_v2_paired harness,
re-run paired against HEAD's garb 0.2.8 _scalar path,
compare.

I didn't do that here because (a) it's substantial scaffold work, (b) the brute-force parity test gives the correctness guarantee that matters most, and (c) the design-level wins of v2 are independent of the exact perf delta.

Test plan

cargo test --release -p zenpixels-convert --lib — 259 pass
cargo bench --bench bench_t7_gamut -p zenpixels-convert — clean run, results saved
Re-evaluate perf claim with paired v1↔v2 harness (deferred — see above)

Refs: 6bd659a (garb migration), this commit.

lilith · 2026-06-04T03:27:35Z

Rebase + evaluation (is v2 really better?)

Rebased this onto current main (a128c294) and pushed the result as eval-fast-gamut-v2 (a4f39045) — main untouched. Then measured correctness + perf on real hardware (AMD 7950X, default runtime dispatch, no target-cpu=native).

Rebase: clean. 3 conflicts resolved — kept the v1 stamp_trc_kernels deletion + v2 delegation, but preserved main's [-v4] annotations on the retained u8/u16 kernels (this branch predated that f16 work), and kept main's archmage/magetypes 0.9.26 + avx512 feature + versioned linear-srgb 0.6.12 (dropped the branch's path-dep).

Correctness: pixel-faithful ✅. 633 in-tree tests pass. A standalone v1-vs-v2 harness (7,488 cases: P3→sRGB & BT.2020→sRGB cross-gamut, edge/HDR/negative inputs) is bit-identical (max dev 0e0) on V4x/V4/V3 — every tier real x86 uses. Alpha bit-exact. The only scalar-tier divergence is on out-of-domain HLG/PQ signals (>1), where v2 is actually more correct (v1's scalar SIMD lane leaked garbage into valid neighbor pixels; v2 doesn't). No real-pixel regression.

Perf: parity on the common paths, one real win. Paired zenbench v1-vs-v2:

sRGB f32 RGB & RGBA (64px–1MP): statistically indistinguishable (~4.7–5.7 GiB/s both, CIs cross zero).
BT.709 f32: parity.
PQ f32 RGB: v2 genuinely faster, −4.8%/−5.4% at 64/1024px (significant) — the heavy-transcendental path is where the native f32x8 V3 body pays off.

This does not reproduce the "3–9% across the board" claim — that was vs garb 0.2.7's hand-written 128-bit chunks, since dropped (your final commit already flagged those benches superseded).

Blocker: fast_gamut_v2.rs calls linear_srgb::tf::{srgb,bt709,pq,hlg,gamma}::*_x{4,8,16}, which are pub(crate) (not pub) in released linear-srgb 0.6.12 (and the _x16 are named _extended_x16). They're public only on linear-srgb's unmerged feat/pub-tf-x16 branch — so this won't compile against released deps until that ships. The committed eval tree fails to build against versioned linear-srgb (module srgb is private); I evaluated via a throwaway local override. (Eval CI on this branch will go red for exactly this reason — that's the correct signal.)

Minor: adds #[doc(hidden)] pub mod fast_gamut_v2 with 4 pub fns and no external consumer — should be pub(crate) per this repo's strict-YAGNI rule.

Verdict: NEEDS-WORK. Correct + clean rebase, but at parity on the dominant sRGB/BT.709 paths (only ~5% on PQ) and blocked on the linear-srgb pub API. Recommend holding until linear-srgb feat/pub-tf-x16 lands, then deciding whether the PQ-only win + scalar-fallback fix justify the +11.6k churn. Rebased branch + artifacts preserved.

lilith · 2026-06-04T05:07:58Z

Closing this — with prejudice, not parking it.

Per the rebased evaluation above: v2 is correct and rebases cleanly, but it's at parity on the dominant sRGB/BT.709 paths and only ~5% faster on PQ — it does not deliver the across-the-board speedup that motivated it. Landing it would mean +11.6k lines of churn plus a new hard coupling to an unreleased linear-srgb pub API (feat/pub-tf-x16), in exchange for a single-path gain. That trade isn't worth it, and the v1 path on main is fine.

If the gamut/TRC conversion is worth optimizing again later, it should be a fresh, smaller effort justified by a measured broad win on the paths that actually dominate — not this branch. The full A/B (parity harness, paired zenbench, per-tier bit-exactness) is preserved at /mnt/v/output/zenpixels-pr33-eval/ and summarized in the comment above, so the measurements don't need redoing.

lilith added 22 commits May 2, 2026 02:36

refactor(fast_gamut_v2): add mat3x3_x{4,8,16} SIMD helpers

b5baa3b

lilith self-assigned this May 10, 2026

lilith closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels#33

perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels#33
lilith wants to merge 24 commits into
mainfrom
feat/fast-gamut-refactor

lilith commented May 10, 2026

Uh oh!

lilith commented May 10, 2026

Uh oh!

lilith commented May 13, 2026

Uh oh!

lilith commented Jun 4, 2026

Uh oh!

lilith commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilith commented May 10, 2026

Summary

Architecture changes

Performance

Correctness

Test plan

Notes

Uh oh!

lilith commented May 10, 2026

Caveat

Uh oh!

lilith commented May 13, 2026

What changed

What this means for the perf claim

Test plan

Uh oh!

lilith commented Jun 4, 2026

Rebase + evaluation (is v2 really better?)

Uh oh!

lilith commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant