Skip to content

perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels#33

Closed
lilith wants to merge 24 commits into
mainfrom
feat/fast-gamut-refactor
Closed

perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels#33
lilith wants to merge 24 commits into
mainfrom
feat/fast-gamut-refactor

Conversation

@lilith
Copy link
Copy Markdown
Member

@lilith lilith commented May 10, 2026

Summary

22-commit refactor that replaces `stamp_trc_kernels` (v1) with `fast_gamut_v2`, a const-generic + SIMD-chunk implementation that's 3-9% faster than v1 across paired benches.

Architecture changes

  • New `fast_gamut_v2` is the only path; v1 `stamp_trc_kernels` deleted
  • Const-generic functions replace stamp macro
  • `mat3x3_x{4,8,16}` SIMD helpers for chunk loops
  • Compile-time const-assert gates on TRC / CHANNELS / CHUNK
  • Native f32x8 V3 body (drops V3 from `wide`)
  • garb SIMD chunk fns for V3 (native), NEON, WASM128
  • Asm-verified zero trampolines + ld3q / AVX2 shuffles inline

Performance

  • Paired v1/v2 zenbench harness shows v2 3-9% faster than v1 on the V3 path
  • Fixed-size array pattern enables LLVM to emit shuffles without explicit intrinsics

Correctness

  • Brute-force v1/v2 parity test exposed clamp divergence — fixed with per-pair wrappers
  • Native V3 path coverage tests added

Test plan

  • `cargo test --all-features --all-targets` clean on default + V3
  • Bench rerun to confirm 3-9% speedup is reproducible
  • Verify NEON / WASM128 paths in CI
  • Confirm garb path-dep is replaced with a versioned dep before merge

Notes

lilith added 22 commits May 2, 2026 02:36
Adds fast_gamut_v2.rs with the wide-tier (f32x16) body for the
sRGB→sRGB RGB pipeline:

- Single #[magetypes(v4x, v4, v3, scalar)] body using
  GenericF32x16<Token>.
- Calls into linear_srgb::tf::srgb::{srgb_to_linear_x16,
  linear_to_srgb_x16} for the TRC; matrix multiply inline.
- 16 pixels (48 f32) per SIMD chunk, scalar tail for leftover.
- 5 unit tests covering identity matrix, zero/one corner pixels,
  sub-chunk input, mixed chunk + tail. All pass in debug + release.

Cargo.toml upgrades:
- archmage 0.9.15 -> 0.9.23 with avx512 feature
- magetypes 0.9.18 -> 0.9.23 with w512 + avx512 features
- Adds avx512 = [] feature to gate V4x in default-tier-list incant!s

Sets up the worktree for the fast_gamut.rs redesign (see DESIGN.md):
- Captures baseline benchmarks (bench_t3_tf_fused, bench_t7_gamut,
  bench_matlut_vs_poly) under benchmarks/fast_gamut_baseline_2026-05-02/
  for before/after comparison.
- Activates the linear-srgb visibility flip via path dep to
  ../linear-srgb--pub-tf-x16 (sibling worktree).

Next steps (per DESIGN.md migration plan):
- Extend wide body to RGBA + BT.709/PQ/HLG/Adobe TRCs.
- Add narrow body for NEON/WASM128 via #[magetypes(neon, wasm128)]
  + f32x4 generics.
- Wire dispatch into convert_f32_rgb_dispatch for matched (src, dst)
  TRC pairs; gate behind a build-time feature initially.
- Run bench_t3_tf_fused + bench_t7_gamut, compare to baselines.
- After numbers land, prepare upstream linear-srgb PR with the
  visibility flip + clamped x16 sRGB additions.
…ies)

Replaces the v1 stamp_trc_kernels! macro family with a single
`stamp_v2_pair!` that generates two magetypes-stamped bodies per
(linearize, encode) tuple:

- wide body: #[magetypes(v4x, v4, v3, scalar)] over GenericF32x16<Token>
  — covers all four x86_64 tiers + cross-arch scalar fallback through
  one body. Calls into linear_srgb::tf::*::*_x16<T: F32x16Convert> for
  the TRC, fused with an inline 3×3 matmul.
- narrow body: #[magetypes(neon, wasm128)] over GenericF32x4<Token> —
  covers AArch64 / WASM hosts with native register-width SIMD via
  linear_srgb::tf::*::*_x4<T: F32x4Convert>.

Public entries cover the full v1 surface:

- same-TRC: srgb, bt709, pq, hlg, adobe, plus matrix-only linear→linear.
- cross-TRC: pq→srgb, hlg→srgb, srgb→pq, bt709→srgb, srgb→bt709,
  adobe→srgb, srgb→adobe.
- both RGB (3-ch) and RGBA (4-ch byte-exact alpha passthrough).

Match-on-TRC dispatch (`convert_f32_rgb_v2` / `convert_f32_rgba_v2`)
mirrors v1's `convert_f32_rgb_dispatch` and is ready to be wired in
behind it. Not yet active — v1 still owns the public dispatch path.

13 tests cover identity roundtrip per TRC, alpha byte-exact passthrough,
sub-chunk + exact-chunk + mixed-tail sizes (5/7/13/16/17/19/23 px),
parity vs a scalar reference on a real non-identity gamut matrix at
TOL_PARITY=5e-5 across all pairs, linear→linear bypass, unsupported-pair
fallthrough, and a tier-permutation stability check via
`for_each_token_permutation`. All 268 tests in the crate pass.
Replaces the v1 stamp_trc_kernels-based incant! match in
convert_f32_{rgb,rgba}_dispatch with a single delegating call to the
v2 surface. v2 covers all 13 (Linear,Linear) + same-TRC + cross-TRC
pairs that v1 supported, dispatching wide (V4x/V4/V3/scalar via
f32x16<T>) on x86_64 and narrow (NEON/WASM128 via f32x4<T>) on
AArch64/WASM. Scalar fallback below the v2 call is kept for any TRC
the v2 surface doesn't yet cover.

The v1 stamp_trc_kernels output remains compiled in this crate but
is now unreachable from production code paths; deletion will be a
follow-up commit after benchmark parity is confirmed.

All 255 zenpixels-convert lib tests + 23 integration tests pass.
Captures bench_t7_gamut and bench_t3_tf_fused output from the v2-wired
build alongside a markdown comparison.

bench_t7_gamut (load-bearing — exercises convert_f32_rgb/rgba_dispatch
via the v2 surface):
- median Δ +1.2%
- 14/18 within ±3% of v1
- 16/18 within ±5%
- 4/18 faster than v1
- 4096px workhorses: -2.0% .. +5.8%

V3 parity holds — no native f32x8 body needed. The first AFTER run of
bench_t3_tf_fused was captured under heavy system load (load avg ~12,
4x baseline wallclock); since t3 doesn't go through fast_gamut at all
this drift is unrelated to the refactor and the noisy log is preserved
for reference only.
Replaces the wide body's f32x16 polyfill-to-AVX2 path on V3 hosts with a
native f32x8 (single 256-bit) body that mirrors v1's fused_8px_rgb_<name>
shape. Eliminates the register-pressure overhead of running an f32x16
polynomial body as 2x256-bit AVX2 ops.

Changes:
- stamp_v2_pair! now generates three impls per pair:
  - wide body (#[magetypes(v4x, v4, scalar)]) over f32x16<T> — was
    (v4x, v4, v3, scalar). V3 dropped here.
  - native V3 body (#[magetypes(v3)]) over f32x8<T>, new — uses the
    *_to_linear_x8 / linear_to_*_x8 helpers exposed in linear-srgb.
  - narrow body (#[magetypes(neon, wasm128)]) over f32x4<T> — unchanged.
- Public dispatchers convert_f32_rgb_<name>_v2 / convert_f32_rgba_<name>_v2
  now use Option A (manual try-tier cascade) instead of incant!:
    V4x -> wide_impl_v4x      (cfg avx512)
    V4  -> wide_impl_v4       (cfg avx512)
    V3  -> native_impl_v3
    else-> wide_impl_scalar (or narrow_impl_neon/_wasm128 on those archs)
- All 13 stamp invocations updated to pass lin_x8 / enc_x8.
- linear-srgb dep already exposes *_x8 + gamma_to_linear_x8 (sibling change
  in linear-srgb--pub-tf-x16 worktree).

Tests: all 255 lib tests pass (was 247 plus 8 new fast_gamut_v2 tests
introduced earlier). Benchmark validation in a follow-up change.
Adds three new tests to verify the native f32x8 V3 body:

1. native_v3_parity_same_trc_rgb — calls each
   convert_rgb_<name>_native_impl_v3 directly with a summoned X64V3Token,
   compares against scalar reference under TOL_PARITY (5e-5). Covers
   sRGB, BT.709, PQ, HLG, Adobe (gamma 2.2). 19 pixels exercises both
   8-px chunks (2x) and 3-px tail.

2. native_v3_parity_same_trc_rgba — same shape over RGBA, with stamped
   alpha values verified byte-exact passthrough.

3. native_v3_parity_cross_trc_rgb — cross-TRC pairs (PQ->sRGB,
   HLG->sRGB, sRGB->PQ, BT.709<->sRGB, Adobe<->sRGB).

4. dispatcher_routes_to_native_v3_when_avx512_disabled — exercises the
   public dispatcher end-to-end. Disables V4 / V4x process-wide via
   dangerously_disable_token_process_wide so the dispatch cascade
   falls through to V3, then verifies output matches scalar reference.

All gated on cfg(target_arch = "x86_64"). Tests gracefully no-op (with
host-capability stderr note) if X64V3Token::summon() returns None.

Tests: 17 fast_gamut_v2 tests pass on this V3 host (was 13). Full
zenpixels-convert suite: 633 passing.
Captures the post-native-V3 benchmark run and updates COMPARISON.md.

bench_t7_gamut: median delta vs v1 baseline is 0.0% (was +1.2% with
the wide-only AFTER); 17 of 18 rows within +/-3%; 9 of 18 faster than
v1; all 4096-pixel rows at parity or faster.

bench_t3_tf_fused: re-captured under quiet system load (load avg
~1.97 at start, total wall 73.2s — was ~4x longer with CV up to 146%
on the prior contended run). Sanity check only — t3 does not route
through fast_gamut so this run is not load-bearing for the V3 body.

Files:
- bench_t7_gamut_AFTER_native_v3.log   (full zenbench output)
- bench_t3_tf_fused_AFTER_native_v3.log
- COMPARISON.md (rewritten with three-column table: BEFORE / wide-only
  AFTER / native_v3 AFTER, plus per-row delta hypothesis on the one
  remaining outlier row at 1080p Linear F32 -> sRGB U8 + gamut)
…nclusive

Saves the second bench_t7_gamut run with a candid note added to
COMPARISON.md explaining that load avg ran from 1.72 -> 7.05 over the
101-second bench, contaminating the results. The three Linear F32 ->
sRGB U8 + gamut rows came in noisier than the original AFTER (+15%
/+9%/+10% with one CV=26% marker), so the +6.5% outlier hypothesis
remains unconfirmed.

Other 15 rows replicated within +/-2% of AFTER, including the -13%
win on sRGB U8 -> Linear F32 1080p (re-confirmed, not a sample fluke).

Files added:
- bench_t7_gamut_RERUN_v3body_quiet.log

A clean structural verdict on the outlier needs a quieter box than
this dev tree (9 concurrent claude processes during the run).
…→ v2 now 3-9% faster than v1

Two pieces in one commit since they're inseparable: the paired
zenbench harness was needed to find the regression, and the
fixed-size array fix was what the harness pointed at.

PART 1 — paired zenbench harness (proper bias-free A/B):
- New __bench_v1_v2 feature flag in zenpixels-convert/Cargo.toml.
- New __v1_convert_f32_{rgb,rgba}_dispatch helpers in fast_gamut.rs
  expose the pre-d207f3b6 v1 inline dispatch (still backed by the
  stamped functions which remain compiled).
- New __bench_v1_v2 module in lib.rs (mirrors __bench_u16_hybrids).
- New benches/bench_v1_vs_v2_paired.rs registers v1 + v2 as paired
  g.bench(...) calls in the same group, so zenbench interleaves them
  in randomized round-robin order and reports v2/v1 deltas with 95%
  CIs directly. Replaces the criterion-style 'run v1 once, run v2
  once, hand-diff' workflow that was contaminated by thermal/turbo
  drift and load spikes.

PART 2 — bounds-check elimination via fixed-size array pattern:
- cargo asm on convert_rgba_bt709_native_impl_v3 showed v2's loop top
  emitted 121 cmp/je/jae bounds-check branches vs v1's 8. Cause: the
  macro's data[off + i*N] inner-loop deinterleave — LLVM couldn't
  hoist all checks into a single max-index check the way v1's
  hand-tuned 'lea r8, [rax+31]; cmp r8, rdx; jae' + vinsertps lane
  gathers did.
- Fix is one line per chunk-loop body — CLAUDE.md 'Fixed-size array
  pattern' applied uniformly to all 6 magetypes-stamped bodies in
  stamp_v2_pair! (wide RGB+RGBA, native RGB+RGBA, narrow RGB+RGBA):

    let chunk: &mut [f32; CHUNK] = chunk.try_into().unwrap();

  All chunk[i*N + ch] indexes are now statically proven safe by the
  &[f32; CHUNK] type — zero interior bounds checks.

Asm impact (BT.709 RGBA path, the worst regression site):
- v2 lines: 2933 → 2523 (now 30 fewer than v1's 2553)
- v2 bounds-check branches: 121 → 6 (v1 has 8)

Bench impact (paired zenbench, 1.15 load avg, full sweep):
- 31 of 33 rows have v2 measurably faster than v1 (CI excludes 0)
- Median Δ ≈ -3.5%, best -8.8% (sRGB RGBA 256px)
- 2 BT.709 same-pair rows are at parity (±1%, CI overlap)
- BT.709 RGBA regression (was +8% to +9.7%) → ±1% parity
- PQ same-pair (was +2-3% slower) → -1.6% to -2.6% faster
- sRGB Δ (was -0.8% to +1.5%) → -7% to -8.8%

User requirement 'must be faster on V3' is now met.

Tests: all 17 fast_gamut_v2 + 247 lib tests pass. No production code
paths changed; v2 dispatch wiring at d207f3b remains.

Files added:
- zenpixels-convert/Cargo.toml: __bench_v1_v2 feature + bench entry
- zenpixels-convert/src/lib.rs: __bench_v1_v2 shim module
- zenpixels-convert/src/fast_gamut.rs: __v1_convert_f32_{rgb,rgba}_dispatch
- zenpixels-convert/benches/bench_v1_vs_v2_paired.rs: paired harness
- zenpixels-convert/src/fast_gamut_v2.rs: chunk.try_into() fix × 6 bodies
- benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired.log
  (pre-fix paired numbers — kept for traceability)
- benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired_FIXED.log
- benchmarks/fast_gamut_after_2026-05-02/PAIRED_COMPARISON.md
  (now contains both pre-fix and post-fix tables with the asm dive)
…gence; fix with per-pair wrappers

A new tests/v1_v2_brute_force_parity.rs test runs every TRC pair ×
{RGB, RGBA} × 4 matrices × 18 sizes × 4 seeds (7600+ cases) through
both v1 and v2 and asserts per-channel identity within tolerance.

First run FAILED — v2 diverged from v1 by up to 2.9 absolute units
on cross-gamut output (e.g. P3→sRGB pixel at primary corner: v1=0,
v2=-2.9). Per CLAUDE.md zero-tolerance: 'If two code paths for the
same operation produce different output, that is a bug in one of
them.' Bug was in v2.

Root cause: linear-srgb's two TF surfaces have INCONSISTENT clamping:
- tokens::x{4,8}::{srgb,gamma}_*_v3: clamp input to [0,1].
- tokens::x{4,8}::{bt709,pq,hlg}_*_v3: do NOT clamp (HDR extended).
- tf::srgb::*_x{4,8,16}<T> (used by v2): do NOT clamp.
- tf::gamma::*_x{4,8,16}<T>: already clamps internally.
- tf::{bt709,pq,hlg}::*_x{4,8,16}<T>: do NOT clamp.

v1 inherited per-TF clamp from its wrappers; v2's macro called the
unclamped tf::*::* generics uniformly, propagating cross-gamut
out-of-range matrix products through the encode polynomial.

Fix: added per-side clamp wrappers (srgb_to_linear_x{4,8,16}_clamped
etc.) in fast_gamut_v2.rs that wrap tf::srgb::* with .max(zero).min(one).
Updated the 27 stamp_v2_pair! invocations touching sRGB inputs/outputs
to use the clamped wrappers. BT.709/PQ/HLG paths keep the raw kernels
(matches v1 no-clamp behavior). Adobe (Gamma22) auto-clamps via gamma.rs.

Verification:
- brute_force_v1_v2_parity_rgb:    3744 cases — all pass.
- brute_force_v1_v2_parity_rgba:   3744 cases — all pass.
- brute_force_chunk_boundaries:   dense SIMD-boundary sweep — pass.
- Full lib test suite:              633 passing, 0 failing.

Perf cost (load avg 1.12, paired zenbench):
- sRGB rows: +1-3% slower than v1 (clamp wrappers add ~4 ops/call).
- All other paths: unchanged or faster.
- Net: 24 of 33 rows faster than v1, 9 within ±2% (down from
  31/33 faster — the previous 'faster' numbers were on INCORRECT
  output that didn't match v1).

The CORRECT v2 is faster on the majority of pairs, parity on the
rest, and byte-identical to v1 across 7600+ random inputs.

Files added:
- zenpixels-convert/tests/v1_v2_brute_force_parity.rs (new test)
- benchmarks/fast_gamut_after_2026-05-02/bench_v1_vs_v2_paired_CORRECTNESS.log
- PAIRED_COMPARISON.md updated with the brute-force findings.

Files changed:
- zenpixels-convert/src/fast_gamut_v2.rs:
  - 6 new per-width clamp wrappers (srgb_to_linear_x{4,8,16}_clamped,
    linear_to_srgb_x{4,8,16}_clamped).
  - 27 stamp invocations switched from tf::srgb::* to *_clamped.
…the only path

Strips the legacy stamp_trc_kernels! macro and its 12 invocations now
that fast_gamut_v2 has been the production dispatch since d207f3b
and brute-force parity (5ab0328) confirmed v1/v2 byte-equivalence
across 7600+ random inputs.

Deleted:
- macro_rules! stamp_trc_kernels (~125 lines, the meta-template)
- 12 stamp invocations: srgb, bt709, pq, hlg same-TRC pairs;
  pq_to_srgb, hlg_to_srgb, srgb_to_pq, bt709_to_srgb, srgb_to_bt709
  cross-TRC pairs; adobe, adobe_to_srgb, srgb_to_adobe Adobe pairs
- adobe_to_linear_x8 (only used by deleted stamps;
  adobe_from_linear_x8 KEEPS — used by simd_encode_x8_dispatch in
  the u8 production fused path)
- has_simd_encode (dead pub(crate), no callers)
- __v1_convert_f32_{rgb,rgba}_dispatch (bench-only shims; no longer
  needed — bench harness deleted)
- benches/bench_v1_vs_v2_paired.rs (bench harness for v1/v2 race)
- tests/v1_v2_brute_force_parity.rs (parity test, served its purpose
  in catching the clamping-divergence bug)
- __bench_v1_v2 feature in Cargo.toml + lib.rs shim module

Kept (still production-load-bearing):
- mat3x3_x8 — used by convert_8px_u8_rgb_fused
- simd_encode_x8_dispatch — same
- adobe_from_linear_x8 — called by simd_encode_x8_dispatch
- trc_x8 / mt_f32x8 imports — used by all the above

Verification:
- Full lib + integration + doctests: 633 passing, 0 failing.
- cargo semver-checks vs zenpixels-convert@0.2.11 on crates.io:
  196 checks pass, 56 skip, 'no semver update required'.
  No minor version bump needed.

Net source delta vs main:
- fast_gamut.rs: -616 lines (v1 surface gone)
- fast_gamut_v2.rs: +1610 lines (replacement)
- lib.rs/Cargo.toml: small adds for v2 wiring
- Total source: ~+1000 lines, with V4x/V4 lanes, NEON, WASM128 SIMD
  added (was scalar in v1 for non-x86_64); V3 throughput improved 1-9%
  on the heavy-polynomial paths.
…ions

Replace the 200+-line stamp_v2_pair! macro and its 12 invocations with three
const-generic magetypes-stamped bodies:

- convert_wide<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(v4x, v4, scalar)]
- convert_native<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(v3)]
- convert_narrow<SRC_TRC, DST_TRC, CHANNELS, CHUNK> [magetypes(neon, wasm128)]

TRC u8 tags (TRC_SRGB / TRC_BT709 / TRC_PQ / TRC_HLG / TRC_GAMMA22) drive a
single-arm const-folded match inside per-width inline helpers
(linearize_x{4,8,16}, encode_x{4,8,16}, scalar_linearize, scalar_encode).
LLVM with #[inline(always)] on the helpers folds the match against the const
generic, leaving exactly one TRC kernel call per monomorph.

The wildcard match arms use safe unreachable!() — forbid(unsafe_code) blocks
unreachable_unchecked, but the const-fold elides the panic call site at
optimization time (asm spot-check on __arcane_convert_native_v3 monomorph 0
shows zero panic call sites). Tests are updated to call convert_native_v3
directly with explicit const generic arguments instead of the per-pair
functions.

Public API (convert_f32_rgb_v2 / convert_f32_rgba_v2 / convert_f32_rgb_linear_v2
/ convert_f32_rgba_linear_v2) is unchanged. cargo semver-checks: 196 pass,
no semver update required.

Line count: 1610 -> 1437 (-173).
Tests: 633 passed / 0 failed.

Bench results (bench_t7_gamut, AMD 7950X) saved in
benchmarks/fast_gamut_after_2026-05-02/bench_t7_gamut_AFTER_const_generic.log
+ CONST_GENERIC_RESULT.md. Median delta vs prior native-V3 baseline = 0%;
worst-case row +4.17% on Linear F32 gamut 1080p (within bench noise; that
path bypasses the const-generic kernels entirely).
The 9-splat + 6-mul_add SIMD matrix-multiply pattern was duplicated across
three magetypes-stamped bodies (convert_wide / convert_native / convert_narrow).
Extract per-width `mat3x3_x{4,8,16}` helpers (added in the previous change)
and call them from inside each chunk loop.

The splats stay inside the helper rather than being hoisted at the call
site — `#[inline(always)]` lets LLVM hoist the constant materialization to
the loop preheader on its own.

Verification:
- 259 lib tests pass (same as baseline).
- Asm spot-check on `__arcane_convert_native_v3` monomorph 0:
  9 `vbroadcastss` from `[rdi+0..32]` still hoisted to function prelude
  before the chunk-loop label; FMA matrix multiply still in the loop body.
  Total instructions: 627 -> 583 (-7%, slight improvement from less stack
  spilling of splatted constants).
- Asm spot-check on `convert_wide_scalar` monomorph 0:
  3647 -> 3656 (+0.25%, noise-level).
- `cargo semver-checks --baseline-version 0.2.11`: no semver update required.
…CHUNK

Adds inline 'const { assert!(...) }' blocks at every const-generic
entry point so misuse of the parameters is caught at monomorphization
time rather than via the runtime 'unreachable!()' wildcard arms.

Gated parameters:
- SRC_TRC < 5 (must be one of TRC_SRGB|BT709|PQ|HLG|GAMMA22) — checked
  in scalar_linearize, linearize_x{4,8,16}, convert_{wide,native,narrow}.
- DST_TRC < 5 — checked in scalar_encode, encode_x{4,8,16},
  convert_{wide,native,narrow}.
- CHANNELS == 3 || CHANNELS == 4 — checked in convert_{wide,native,narrow}.
- CHUNK == PIXELS * CHANNELS where PIXELS is the body's native lane count
  (16 for wide, 8 for native V3, 4 for narrow) — replaces the prior
  runtime 'debug_assert_eq!' check.

Verified that const-assert evaluation fires at monomorphization with a
negative-test rustc invocation — instantiating with SRC_TRC=99 produces
E0080 'evaluation panicked: SRC_TRC must be one of TRC_SRGB|...' and
rc=1, before any test runs.

The runtime '_ => unreachable!()' arms stay as a belt-and-braces backup
that LLVM continues eliminating under monomorphization (no asm impact).

Tests: 259 lib tests pass.
Semver: 196 checks pass, no semver update required (gates are in
crate-private functions; public surface unchanged).
… chunks

Sibling /home/lilith/work/garb/ holds PR #5 work (deinterleave-f32-chunks)
with new public chunk SIMD fns. Switch to a path dep so fast_gamut_v2 can
call them before the release ships. Flip back to a version dep once the
PR merges and a 0.2.x crate ships.
Replace manual stride-3/4 deinterleave + reinterleave loops in
convert_native (#[magetypes(v3)]) with calls into garb's new
rgb_f32_chunk8_to_planes_v3 / rgba_f32_chunk8_to_planes_v3 and inverse
interleave fns. The garb impls use the canonical 5-shuffle AVX2
stride-3 recipe (vshufps / vmovlhps / vmovhlps) for RGB and
unpack+permute2f128 4-channel transpose for RGBA — much better
codegen than LLVM auto-vectorizing scalar element-wise loads.

Wired via a per-token ChunkXform8 trait. The V3 impl forwards through
#[arcane]-stamped free fns so each call holds a matching target_feature
region; archmage inlines the wrapper. Scalar fallback (used by the
magetypes-emitted convert_native_scalar dead-code variant) keeps the
manual loop. Alpha plane is byte-exact passthrough for RGBA — read
out unchanged, written back unchanged, NaN payloads preserved.

633 lib + integration tests pass (same as pre-integration).
…SM128)

Replace manual stride-3/4 deinterleave + reinterleave loops in
convert_narrow (#[magetypes(neon, wasm128)]) with calls into garb's
chunk-4 SIMD fns. NEON uses vld3q_f32 / vld4q_f32 (single-instruction
hardware structure-loads); WASM128 uses the i32x4_shuffle!-based
5-shuffle recipe. Both inline through #[arcane] safe wrappers — no
unsafe in this crate.

Validates with x86_64 native (633 tests pass), aarch64-unknown-linux-gnu
cross-build (clean), wasm32-unknown-unknown cross-build (clean).
…es inline

Verified post-integration via cargo asm:

x86_64 V3 (zenpixels_convert::fast_gamut_v2::__arcane_convert_native_v3
monomorph 0):
  - 745 lines, 29 SIMD shuffle ops
    (vshufps/vunpck/vinsertps/vmovlhps/vmovhlps/vperm)
  - 0 calls to any garb::* symbol -- proves the #[arcane] trait wrapper
    + #[rite] garb body fully inlined into the magetypes V3 region
  - 0 panic_bounds_check
  - 108 vbroadcastss (matrix splats hoisted to loop preheader as expected)
  - Indirect calls (call r12) only present in scalar tail-pixel path
    (transcendental TRC fallback for PQ/HLG); no garb-related calls.

Other V3 monomorphs (1, 5, 10) checked: same patterns -- 0 garb calls,
0 bounds checks, 25-44 shuffles each.

aarch64 NEON (convert_f32_v2_inner monomorph 0, with __arcane_convert_narrow_neon
fully inlined):
  - 9766 lines
  - 24 ld3/ld4/st3/st4 NEON structure-load/store instructions
    (vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32)
  - 0 branch-link to any garb::* symbol
  - 0 panic_bounds_check
  - Only named call: bl powf (transcendental TRC fallback)

Cross-target compile clean: x86_64-unknown-linux-gnu (native, 633 tests pass),
aarch64-unknown-linux-gnu, wasm32-unknown-unknown.

cargo semver-checks check-release: no semver update required
(196 checks pass, 56 skip).
…difier

The plain mode 'v3' / 'neon, wasm128' tier lists triggered the macro's
auto-append of a scalar fallback variant — emitting convert_native_scalar
and convert_narrow_scalar that were never callable (dispatcher always
falls through to convert_wide_scalar on hosts without v3/neon/wasm128).

Switch to additive-mode tier lists that explicitly subtract every default
tier we don't want, including scalar:
  v3 only:        [-v4, -neon, -wasm128, -scalar]
  neon + wasm128: [-v4, -v3, -scalar]

The macro currently rejects mixing plain tiers with -modifier (lines
287-292 of archmage-macros/src/tiers.rs); pure-additive form is the
working path. convert_wide stays at plain (v4x, v4, scalar) — its scalar
variant is the load-bearing fallback for non-SIMD hosts.

Trait/wrapper machinery (ChunkXform8/4 + scalar impls + scalar helpers)
becomes dead with this change but is left in place; deletion follows in
the next change. This commit is functional-only — the magetypes bodies
still call the trait methods.

Tests: 633 / 0
The ChunkXform8/4 trait was a per-token dispatch indirection over garb's
chunk-N SIMD fns: V3 / NEON / WASM128 trait impls each forwarded to
garb's _v3 / _neon / _wasm128 function via an #[archmage::arcane]
wrapper. With the dead scalar variants now removed, the trait collapses
to a single concrete arch per body — V3-only for convert_native,
arch-split (#[cfg]) for convert_narrow.

Replace the trait method calls with direct garb calls. NEON and WASM128
monomorphizations of convert_narrow are arch-mutually-exclusive, so
#[cfg] arms inside the body select the correct garb fn per arch. The
trait + impls + wrapper modules become unreachable and are deleted in
the next change.

No unsafe blocks needed — garb's #[rite] fns inline into the magetypes
body's #[target_feature] region exactly the same way the wrapper module
inlined them, just one Rust call frame fewer.

Tests: 633 / 0
Removes ~419 lines of bridge code rendered unreachable by the previous
two changes:

  - trait ChunkXform8 (4 methods) + 2 impls (ScalarToken, X64V3Token)
  - trait ChunkXform4 (4 methods) + 3 impls (ScalarToken, NeonToken,
    Wasm128Token)
  - mod x64v3_chunk_calls (4 #[arcane] wrapper fns)
  - mod neon_chunk_calls   (4 #[arcane] wrapper fns)
  - mod wasm128_chunk_calls (4 #[arcane] wrapper fns)
  - 8 scalar_*_to_planes / scalar_planes_to_*_{4,8} helpers (only
    referenced from the deleted ScalarToken trait impls)

  Total deleted: 2 traits, 5 impls, 3 wrapper modules (12 fns), 8 scalar
  helpers — 419 lines.

File size: 1969 -> 1589 lines (-380, accounting for one collapsed blank
line from the boundary cleanup).

The chunk loops in convert_native and convert_narrow now call garb's
#[rite] fns directly. The asm shape is identical to before — there was
never a real call frame between the magetypes #[arcane] body and the
garb #[rite] body, just a Rust-source layer that LLVM elided. Removing
it makes that obvious from the source.

Verification (host: x86_64-unknown-linux-gnu, V3 monomorph 0):
  cargo asm convert_native_v3 [0]:
    AVX2 shuffle ops (vshufps/vunpck.ps/vinsertf128/...): 22
    call.*garb:                                            0  (fully inlined)
    panic_bounds_check:                                    0
    vbroadcastss:                                        108  (TRC poly
      coefficients + matrix splats; SRGB->SRGB has the heaviest poly fit)
  Same monomorph 6: 21 / 0 / 0 / 86 — same shape.

Cross-target compile:
  cargo build --target aarch64-unknown-linux-gnu  -p zenpixels-convert -> clean
  cargo build --target wasm32-unknown-unknown     -p zenpixels-convert -> clean
  (Pre-existing _x4 / _x8 / f32x4-camelcase warnings only; symmetric to
  baseline.)

cargo semver-checks check-release -p zenpixels-convert
  --baseline-version 0.2.11 -> 196 pass, 56 skip, no semver update required.

Tests: 633 / 0
@lilith lilith self-assigned this May 10, 2026
…y pin

PR #33's `fast_gamut_v2.rs` calls 12 chunk SIMD primitives that never
shipped on crates.io — they were the pre-tokenless versions of garb's
hand-written 128-bit-XMM f32 chunk SIMD (`*_chunk{4,8}_to_planes_v3`,
`_neon`, `_wasm128`). Those got dropped from garb 0.2.8 entirely after
benches showed they lose to LLVM autovec by 26-37% at 1024px (the
`_mm_*` 128-bit intrinsics couldn't reach 256-bit YMM the way autovec
under target_feature avx2,fma can).

The replacement pattern is the public `*_scalar` chunk fns: pure
fixed-array indexing that LLVM autovec lifts to YMM inside the caller's
`#[arcane(<tier>)]` region. Same wall-clock as the deleted hand-written
chunks would have given, just without the v0.2.7-yanked-because-of-archmage-coupling
machinery.

Migration pattern (applied at 12 call sites):

```diff
-let (r, g, b) = garb::deinterleave::rgb_f32_chunk8_to_planes_v3(token, c);
+let (r, g, b) = garb::deinterleave::rgb_f32_chunk8_to_planes_scalar(c);
```

Plus same shape for `_neon` / `_wasm128` (both also rename to `_scalar`
since the `#[arcane(<tier>)]` caller establishes target_feature for
each arch).

Also: `Cargo.toml` switched the garb dep from `path = "../../../garb"`
(local-only) to `version = "0.2.8"` (registry). The path form was
preventing CI from compiling at all.

Caveats:
  - Bench logs in benchmarks/fast_gamut_after_2026-05-02/ were
    measured against the OLD hand-written 128-bit chunk SIMD path,
    not the autovec'd _scalar path. They reflect a baseline that no
    longer exists in published garb. Numbers should be re-measured if
    they're load-bearing for any decision; they may shift modestly
    (autovec is +26-37% on f32 at 1024px per garb's own bench, so the
    new path should be at least as fast or faster than what was logged).

Test plan:
  - cargo build --release -p zenpixels-convert: clean
  - cargo test --release -p zenpixels-convert --lib: 259 pass

Tracking: imazen/garb#7
@lilith
Copy link
Copy Markdown
Member Author

lilith commented May 10, 2026

Pushed 6bd659a to fix CI:

  1. The 12 chunk-SIMD call sites in fast_gamut_v2.rs were referencing pre-tokenless names (*_chunk8_to_planes_v3, _neon, _wasm128) that never shipped on crates.io — they were the working-name forms of garb's hand-written 128-bit chunk SIMD that got dropped from garb 0.2.8 entirely (bench showed they lost to LLVM autovec by 26-37% at 1024px because _mm_* 128-bit intrinsics couldn't reach 256-bit YMM the way autovec under target_feature avx2,fma does).

    Migrated all 12 to the public *_scalar chunk fns. Inside the surrounding #[arcane(<tier>)] region, those autovec to YMM — same or better wall-clock as the deleted hand-written chunks.

  2. Cargo.toml: garb = { path = "../../../garb", ... }garb = { version = "0.2.8", ... }. The path dep wasn't valid in CI.

  3. cargo fmt cleaned up some pre-existing formatting drift in the file (alphabetical import reorder, const { ... } block form). That's the source of the large diff line count — only ~12 lines of logical change.

Caveat

The bench logs in benchmarks/fast_gamut_after_2026-05-02/ were measured against the OLD hand-written 128-bit chunk SIMD baseline, which no longer exists in published garb. They should be re-measured if the numbers are load-bearing for any review decision. Per garb's own bench data, autovec is +26-37% on f32 deinterleave at 1024px vs the deleted chunks, so the new path should be at least as fast or faster than what was logged.

CI should now go green. Tests: cargo test --release -p zenpixels-convert --lib → 259 pass locally with the migrated code.

Three review items from #33:

1. **Delete `pub(crate) const TRC_LINEAR: u8 = 5`.** The const was
   declared `#[allow(dead_code)]`, gated out of every const-generic
   path by `assert!(SRC_TRC < 5, ...)` / `DST_TRC < 5` const-asserts,
   and never referenced after its definition. The `(Linear, Linear)`
   case is handled at the enum-pair level in `convert_f32_v2_inner`
   via a dedicated short-circuit to `convert_*_linear_v2` (the
   matrix-only scalar loop, no TRC step). Mixed Linear↔tagged pairs
   are unsupported (return `false` from the public entry); v1 didn't
   support them either.

   Replaced the misleading "included for completeness" comment with
   a clear note explaining the design choice: when a Linear↔tagged
   caller appears, add a dedicated enum-pair branch alongside the
   existing `(Linear, Linear)` short-circuit so the linearize/encode
   stages can be elided rather than monomorphized as identity arms.

2. **Annotate historical bench logs as SUPERSEDED.** The
   `benchmarks/fast_gamut_after_2026-05-02/` directory contains two
   v1↔v2 comparisons:

   - `COMPARISON.md` was the criterion-style A-then-B (run v1, run
     v2 on a different commit, hand-diff). Headline "median 0.0% Δ"
     was a thermal-drift artifact.
   - `PAIRED_COMPARISON.md` did the same comparison properly with
     zenbench paired interleaving — but was measured against
     (a) v1, deleted by this PR, and (b) garb 0.2.7's hand-written
     128-bit chunk SIMD, dropped from garb 0.2.8 entirely after
     bench showed LLVM autovec under target_feature avx2,fma was
     +26-37% faster at 1024px.

   Both baselines are deleted on HEAD. Banners on both files now
   point readers at the fresh HEAD throughput snapshot.

3. **Fresh HEAD throughput snapshot.** New
   `benchmarks/fast_gamut_head_2026-05-13/{bench_t7_gamut_HEAD.log,META.md}`
   captures `bench_t7_gamut` on the migrated code (v2 calling
   garb 0.2.8 `_scalar` chunk fns, autovec'd to 256-bit YMM under
   `#[arcane(v3)]`). Not a v1↔v2 paired comparison (impossible —
   v1 is gone), but it sanity-checks that the migrated path performs
   in expected throughput buckets:

       sRGB U8 fused gamut (P3→BT.709)    256px → 227ns
                                         4096px → 3.36µs
                                         1080p  → 1.79ms

   META.md explicitly documents what this run does and does not
   measure.

Test plan:
  - `cargo test --release -p zenpixels-convert --lib` — 259 pass
  - `cargo bench --bench bench_t7_gamut -p zenpixels-convert` — clean run

Tracking: imazen/garb#7 + this PR.
@lilith
Copy link
Copy Markdown
Member Author

lilith commented May 13, 2026

Pushed two more commits addressing the review:

What changed

  1. Deleted pub(crate) const TRC_LINEAR: u8 = 5 plus its misleading "included for completeness" comment. The const was unreachable: gated out of every const-generic path by assert!(SRC_TRC < 5), never referenced after its declaration. The (Linear, Linear) short-circuit at the enum-pair level in convert_f32_v2_inner already handles linear→linear without needing a TRC tag. New comment block documents the design call: if Linear↔tagged callers appear, add a dedicated enum-pair branch rather than wiring a TRC tag with identity arms in every linearize/encode stage.

  2. SUPERSEDED banners on COMPARISON.md (criterion thermal-drift artifact) and PAIRED_COMPARISON.md (paired but measured against now-deleted baselines). Both point at the fresh HEAD snapshot below.

  3. benchmarks/fast_gamut_head_2026-05-13/ — fresh bench_t7_gamut run on the migrated code (v2 + garb 0.2.8 _scalar). META.md documents what it does measure (sanity-check throughput on HEAD) and does not (it's NOT a v1↔v2 paired comparison — v1 is gone).

What this means for the perf claim

The PR description's "v2 is 3-9% faster than v1" was true when measured against v1 + garb 0.2.7's hand-written 128-bit chunk SIMD. Both are gone on HEAD:

  • v1: deleted in this PR.
  • garb 0.2.7's chunk SIMD: dropped in garb 0.2.8 because LLVM autovec under target_feature avx2,fma was +26-37% faster at 1024px on f32 deinterleave (see garb's bench).

So v2 on HEAD is directionally still the right move (the design wins — broader platform coverage, simpler maintenance, const-generic over macro stamp, correctness validated by brute-force parity test) but the exact ±% perf delta is not citeable — the baseline it was measured against doesn't exist.

If a hard "v2 is faster than v1 on HEAD" claim is needed, the work is:

  • revive v1 as a #[cfg(feature = "fast_gamut_v1_compat")] shim,
  • restore the bench_v1_vs_v2_paired harness,
  • re-run paired against HEAD's garb 0.2.8 _scalar path,
  • compare.

I didn't do that here because (a) it's substantial scaffold work, (b) the brute-force parity test gives the correctness guarantee that matters most, and (c) the design-level wins of v2 are independent of the exact perf delta.

Test plan

  • cargo test --release -p zenpixels-convert --lib — 259 pass
  • cargo bench --bench bench_t7_gamut -p zenpixels-convert — clean run, results saved
  • Re-evaluate perf claim with paired v1↔v2 harness (deferred — see above)

Refs: 6bd659a (garb migration), this commit.

@lilith
Copy link
Copy Markdown
Member Author

lilith commented Jun 4, 2026

Rebase + evaluation (is v2 really better?)

Rebased this onto current main (a128c294) and pushed the result as eval-fast-gamut-v2 (a4f39045) — main untouched. Then measured correctness + perf on real hardware (AMD 7950X, default runtime dispatch, no target-cpu=native).

Rebase: clean. 3 conflicts resolved — kept the v1 stamp_trc_kernels deletion + v2 delegation, but preserved main's [-v4] annotations on the retained u8/u16 kernels (this branch predated that f16 work), and kept main's archmage/magetypes 0.9.26 + avx512 feature + versioned linear-srgb 0.6.12 (dropped the branch's path-dep).

Correctness: pixel-faithful ✅. 633 in-tree tests pass. A standalone v1-vs-v2 harness (7,488 cases: P3→sRGB & BT.2020→sRGB cross-gamut, edge/HDR/negative inputs) is bit-identical (max dev 0e0) on V4x/V4/V3 — every tier real x86 uses. Alpha bit-exact. The only scalar-tier divergence is on out-of-domain HLG/PQ signals (>1), where v2 is actually more correct (v1's scalar SIMD lane leaked garbage into valid neighbor pixels; v2 doesn't). No real-pixel regression.

Perf: parity on the common paths, one real win. Paired zenbench v1-vs-v2:

  • sRGB f32 RGB & RGBA (64px–1MP): statistically indistinguishable (~4.7–5.7 GiB/s both, CIs cross zero).
  • BT.709 f32: parity.
  • PQ f32 RGB: v2 genuinely faster, −4.8%/−5.4% at 64/1024px (significant) — the heavy-transcendental path is where the native f32x8 V3 body pays off.

This does not reproduce the "3–9% across the board" claim — that was vs garb 0.2.7's hand-written 128-bit chunks, since dropped (your final commit already flagged those benches superseded).

Blocker: fast_gamut_v2.rs calls linear_srgb::tf::{srgb,bt709,pq,hlg,gamma}::*_x{4,8,16}, which are pub(crate) (not pub) in released linear-srgb 0.6.12 (and the _x16 are named _extended_x16). They're public only on linear-srgb's unmerged feat/pub-tf-x16 branch — so this won't compile against released deps until that ships. The committed eval tree fails to build against versioned linear-srgb (module srgb is private); I evaluated via a throwaway local override. (Eval CI on this branch will go red for exactly this reason — that's the correct signal.)

Minor: adds #[doc(hidden)] pub mod fast_gamut_v2 with 4 pub fns and no external consumer — should be pub(crate) per this repo's strict-YAGNI rule.

Verdict: NEEDS-WORK. Correct + clean rebase, but at parity on the dominant sRGB/BT.709 paths (only ~5% on PQ) and blocked on the linear-srgb pub API. Recommend holding until linear-srgb feat/pub-tf-x16 lands, then deciding whether the PQ-only win + scalar-fallback fix justify the +11.6k churn. Rebased branch + artifacts preserved.

@lilith
Copy link
Copy Markdown
Member Author

lilith commented Jun 4, 2026

Closing this — with prejudice, not parking it.

Per the rebased evaluation above: v2 is correct and rebases cleanly, but it's at parity on the dominant sRGB/BT.709 paths and only ~5% faster on PQ — it does not deliver the across-the-board speedup that motivated it. Landing it would mean +11.6k lines of churn plus a new hard coupling to an unreleased linear-srgb pub API (feat/pub-tf-x16), in exchange for a single-path gain. That trade isn't worth it, and the v1 path on main is fine.

If the gamut/TRC conversion is worth optimizing again later, it should be a fresh, smaller effort justified by a measured broad win on the paths that actually dominate — not this branch. The full A/B (parity harness, paired zenbench, per-tier bit-exactness) is preserved at /mnt/v/output/zenpixels-pr33-eval/ and summarized in the comment above, so the measurements don't need redoing.

@lilith lilith closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant