feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait) by lilith · Pull Request #30 · imazen/zenpixels

lilith · 2026-05-01T16:26:39Z

Summary

Adds a descriptor-level load-bearing analysis layer to zenpixels-convert that codec encoders use to decide whether a buffer fits a narrower descriptor (channel layout / channel type / primaries) without loss. Three layers, bottom-up:

scan module — byte-level SIMD predicates over packed buffers (one per (layout, channel_type) combo).
gamut module — per-pixel bounds-check helpers + in-place transform for primary-conversion lossless detection.
load_bearing module — descriptor-aware orchestration: extension trait on PixelSlice, LoadBearingReport struct, and a combiner that produces a narrower target PixelDescriptor.

scan — SIMD predicates

Layout	`is_opaque`	`is_grayscale`	`alpha_is_binary`	bit-replication
Rgb8	—	yes	—	—
Rgba8 / Bgra8	yes	yes	yes	—
GrayA8	yes	—	yes	—
Rgb16	—	yes	—	yes
Rgba16	yes	yes	yes	yes
GrayA16	yes	—	yes	yes
Gray16	—	—	—	yes
any U16	—	—	—	`bit_replication_lossless_u16(&[u16])`

Plus fused_predicates_rgba8_cg — runs all three RGBA8 checks in one bandwidth-fused pass, 2.2× vs three separate passes at 16 MP, 15.3× vs scalar at 1 MP, on Ryzen 9 7950X. The trait dispatch uses this for RGBA8/Bgra8.

bit_replication_lossless_u16(&[u16]) is the typed primary; _be16(&[u8]) is a thin wrapper for raw IDAT bytes. Endian-symmetric.

Generic over magetypes 5-tier dispatch (v4x AVX-512, v4 AVX2, v3 SSE4.2, neon, wasm128) with #[magetypes(define(u8x64), …)] + incant!. Inner loops use partition_slice + chunks_exact where shifted reads aren't needed; manual stride for the shifted-load patterns (with chunks_exact tails).

gamut — bounds-check + in-place transform

pub enum GamutFit { AllInside, OutOfGamut }
pub const DEFAULT_GAMUT_EPSILON: f32 = 5e-4;
pub fn check_fits_in_gamut_linear_f32_rgb (data: &[f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn check_fits_in_gamut_linear_f32_rgba(data: &[f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn fit_and_transform_linear_f32_rgb   (data: &mut [f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn fit_and_transform_linear_f32_rgba  (data: &mut [f32], m: &GamutMatrix, eps: f32) -> GamutFit;

On OutOfGamut, the buffer is left unmodified so callers can keep the wider representation.

load_bearing — descriptor-aware orchestration

#[non_exhaustive]
pub struct LoadBearingReport {
    pub analyzed: bool,                                 // false → predicates couldn't run
    pub uses_alpha: bool,                               // false → drop alpha
    pub uses_chroma: bool,                              // false → narrow to grayscale
    pub uses_low_bits: bool,                            // false → narrow U16 → U8
    pub alpha_is_binary: Option<bool>,                  // Some(true) → tRNS/binary alpha
    pub uses_gray_bit_depth: Option<GrayBitDepth>,      // Some(One/Two/Four) → sub-byte pack
    pub uses_gamut: Option<ColorPrimaries>,             // Some(narrower) → re-tag + transform
}

impl LoadBearingReport {
    pub const fn fully_load_bearing() -> Self;
    pub const fn unanalyzed() -> Self;
    pub fn apply_to(&self, src: &PixelDescriptor) -> PixelDescriptor;
}

pub trait PixelSliceLoadBearingExt {
    fn determine_load_bearing(&self) -> LoadBearingReport;
    fn determine_load_bearing_reduced_descriptor(&self) -> PixelDescriptor;
    fn try_reduce_to_load_bearing_format(&self) -> Option<(PixelDescriptor, Vec<u8>)>;
}

Each uses_* field answers "is this part of the descriptor load-bearing?" — false/narrower variants mean the descriptor over-promises and can be narrowed losslessly.

analyzed: bool distinguishes "fully load-bearing because we measured it" from "fully load-bearing because we couldn't measure" — codecs that see analyzed: false should skip the optimization path.

alpha_is_binary: Option<bool> is a free byproduct of the fused predicate — directly usable for binary-mask alpha encodings (PNG tRNS, GIF transparency).

uses_gray_bit_depth: Option<GrayBitDepth> — Some(One/Two/Four) if sub-byte packing applies, None otherwise. GrayBitDepth only enumerates the narrow options (no Eight sentinel).

Gamut narrowing is wired end-to-end: determine_load_bearing runs check_fits_in_gamut_* against Bt709 when source primaries are wider AND source transfer is sRGB AND layout is RGB/RGBA at U8. try_reduce_to_load_bearing_format performs the actual transformation — decode to linear f32, apply the matrix, re-encode with the target transfer — and returns the rewritten buffer at the narrower primaries. Other transfers (PQ, HLG, Linear) and channel-types (U16, F32) bail conservatively for v0; followup will dispatch their EOTFs.

apply_to reduction order

Channel-type narrowing (U16 → U8 when uses_low_bits == false)
Alpha drop (when uses_alpha == false and the layout has alpha)
Chroma collapse (when uses_chroma == false and the layout has chroma)
Primaries narrowing (when uses_gamut == Some)

Skips transitions that yield an unrepresentable PixelFormat (e.g. dropping alpha from Bgra8 when there's no Bgr8 enum variant — the transform_to path handles Bgra8 → Rgb8 via channel reorder when alpha drop is requested in try_reduce).

Sub-byte gray packing and tRNS encoding are codec concerns and stay in the codec layer — codec encoders read the report fields directly and apply their format-specific encoding.

Where codecs hook this in

Codec	Hook	Codec-side decisions consuming the report
zenpng `encode_raw`	After `PixelSlice` accepted, before `analyze_rgba8`	`uses_alpha=false` → drop alpha; `alpha_is_binary=Some(true)` → tRNS lookup; `uses_gray_bit_depth=Some(…)` → sub-byte pack; ≤256 colors → indexed (separate scan, palette ordering by luminance)
zenwebp	Top of encode pipeline	`uses_alpha=false` → drop alpha; `uses_gamut=Some(Bt709)` → drop ICC profile, declare sRGB. WebP has no monochrome/sub-byte modes
zenavif	Before nclx box + YUV subsampling pick	`uses_alpha=false` → drop alpha plane; `uses_chroma=false` → yuv400; `uses_gamut` → nclx narrowing; `uses_low_bits=false` → 8-bit container
zenjxl	Before frame type pick	`uses_alpha=false` → drop extra channel; `uses_chroma=false` → grayscale frame; `uses_low_bits=false` → 8-bit; `uses_gamut` → ICC narrowing
zentiff	Before SubFileType/PhotometricInterpretation pick	`uses_chroma=false` → grayscale; `uses_alpha=false` → 1-channel SamplesPerPixel; `uses_low_bits=false` → BitsPerSample=8

The trait method runs in <1 ms at 1 MP (per benches) — free at every encoder entry.

Dep bumps

archmage 0.9.15 → 0.9.23 (avx512 feature)
magetypes 0.9.18 → 0.9.23 (w512 + avx512)
New zenpixels-convert avx512 feature, no-op for users (acknowledges incant!'s cfg expansion)

Tests

95 new tests, 339 total passing in zenpixels-convert --lib, 0 failures.

9 gamut-fit tests
65 scan tests (5 standalone + 7 layout extensions + 11 fused; both true and false fixtures with explicit inline byte arrays at multiple sizes; runs at every dispatch tier via for_each_token_permutation; includes explicit bit_repl_false_fefe_then_fe00 and bit_repl_true_fefe_then_fcfc proving adjacent pairs don't poison each other)
23 load_bearing tests (Rgba8 / Rgba16 / GrayA8 / Gray8 dispatch; sub-byte 1/2/4-bit detection; alpha_is_binary Some/None semantics; analyzed: false for unsupported types; gamut detection for P3-with-neutral-gray vs P3-with-saturated-red; gamut transformation end-to-end producing a sRGB-tagged buffer; PQ transfer skip; SRGB no-op; fully_load_bearing default)

cargo test -p zenpixels-convert --lib
test result: ok. 339 passed; 0 failed; 0 ignored

Test plan

No new top-level dependencies (archmage / magetypes / linear-srgb already in tree)
Public API additions only — no breaking changes
Tests cover every predicate's true/false paths at every dispatch tier
Bench numbers measured on real hardware, not estimated
Gamut detection runs end-to-end (linearize → bounds-check → transform → re-encode → re-tag)
All non-gamut load-bearing reductions also exercised end-to-end via try_reduce_to_load_bearing_format

…parison bench Three pieces in one commit since they all flow from the same investigation into where SIMD pulls its weight: 1) analyze_rgba8 fast path When the flag set requests only the cheap-bool predicates (no palette, no sub-byte gray, no transparent-color tracking), call the fused SIMD predicate scanner and skip the scalar single-pass entirely. Saves the HashMap allocation and per-pixel scalar branches. 2) Gamut downcast v0 (src/gamut.rs) Detects Display P3 + sRGB transfer cICP (CP=12, TC=13, MC=0), scans every pixel for sRGB-gamut fit (early-exit on first overflow), and on success re-encodes the buffer in sRGB primaries. Wide-gamut metadata (cICP, cHRM, source_gamma) dropped; sRGB chunk emitted with perceptual intent if not set. Gated on: - flags.gamut_downcast (off by default) - compression.effort() >= 7 (this pass is more expensive than the byte-level predicates per CLAUDE.md design) - bit depth 8 + RGB or RGBA channel layout - cICP recognized as DisplayP3 Other primaries (BT.2020 / AdobeRGB) and HDR transfers (PQ/HLG) are rejected by SourceGamut::from_cicp. Bounds-check helper proposed upstream as imazen/zenpixels#30. Once that ships in zenpixels-convert 0.2.12 this module shrinks to source detection + EOTF/OETF coordination. 3) Scalar-vs-SIMD comparison bench + decision benches/scalar_vs_simd.rs runs each predicate as scalar, magetypes- SIMD, and (for fused) all three (scalar, runtime-branch SIMD, const- generic SIMD) against the same workload at four sizes. Decision: keep magetypes generic, NO hand-tuned intrinsics. Numbers (Ryzen 9 7950X): 1 MP success path SIMD vs scalar speedup is_grayscale_rgba8 11.7x alpha_is_binary_rgba8 4.3x is_grayscale_rgb8 6.8x bit_replication_be16 12.8x fused (3-in-1, CG) 15.3x 16 MP DRAM-bound SIMD vs scalar speedup is_grayscale_rgba8 2.05x alpha_is_binary_rgba8 1.95x is_grayscale_rgb8 1.81x bit_replication_be16 2.30x fused (3-in-1, CG) 4.34x Per CLAUDE.md (manual intrinsics only when 10%+ over magetypes): in-cache magetypes is 4-15x scalar — already excellent, no obvious 10% gap. DRAM-bound there's no headroom (memory bandwidth is the ceiling, hand intrinsics can't beat it). Closing the manual-intrinsics investigation. Tests (24 new, all passing): * 11 src/gamut.rs unit tests * 2 end-to-end encode tests with real cICP metadata * 11 fast-path / DowncastFlags wiring tests Bench logs saved at: benchmarks/scan_predicates_2026-05-01.{log,meta} benchmarks/fused_predicates_2026-05-01.{log,meta} benchmarks/scalar_vs_simd_2026-05-01.{log,meta}

…predicates Builds on the prior commit: extends scan.rs with U16-typed and GrayAlpha predicates so every common pixel layout has a SIMD path, and adds the descriptor-aware load-bearing module that codecs use as their one-call entry point. == Predicate coverage == Layout is_opaque is_grayscale alpha_is_binary bit_repl ----------- --------- ------------ --------------- -------- Rgb8 N/A yes N/A N/A Rgba8 yes yes yes N/A Bgra8 (Rgba8 fn) (Rgba8 fn) (Rgba8 fn) N/A [shared dispatch] GrayA8 yes N/A yes N/A Rgb16 N/A yes N/A yes Rgba16 yes yes yes yes GrayA16 yes N/A yes yes Gray16 N/A N/A N/A yes any U16 -- -- -- yes [bit_replication_lossless_u16(&[u16])] Bgra8 dispatch shares the Rgba8 implementation because the byte positions are equivalent for these tests (alpha at offset 3, chroma at offsets 0/1/2 — equality is symmetric in channel order). == bit_replication_lossless_u16 == Endian-agnostic predicate on — every sample must satisfy (s >> 8) == (s & 0xFF). Replaces the byte-level _be16 form as the primary API; _be16 retained as a thin wrapper for callers working on raw PNG IDAT bytes. == Refactor: partition_slice + chunks_exact == Where the SIMD pattern doesn't need shifted loads (is_opaque, alpha_is_binary, the new opaque/binary predicates) the inner loops now use u8x64::partition_slice / u16x32::partition_slice for non- overlapping chunks. Scalar tails universally use chunks_exact. Shifted-load patterns (is_grayscale_*, bit_replication_*) keep manual stride for the SIMD outer (overlapping reads don't fit partition_slice) but use chunks_exact in the tail. Comment in each flagged module explains the choice. == load_bearing module == pub struct LoadBearingReport { uses_alpha: bool, uses_chroma: bool, uses_low_bits: bool, uses_gray_bit_depth: GrayBitDepth, uses_gamut: Option<ColorPrimaries>, } #[non_exhaustive] impl LoadBearingReport { const fn fully_load_bearing() -> Self; fn apply_to(&self, src: &PixelDescriptor) -> PixelDescriptor; } pub trait PixelSliceLoadBearingExt { fn determine_load_bearing(&self) -> LoadBearingReport; fn determine_load_bearing_reduced_descriptor(&self) -> PixelDescriptor; fn try_reduce_to_load_bearing_format(&self) -> Option<(PixelDescriptor, Vec<u8>)>; } determine_load_bearing dispatches based on (channel_layout, channel_type) — picks the right SIMD predicate per layout, returns the assembled report. Sub-byte gray detection (1/2/4/8) runs as a scalar pass when the buffer is grayscale (or post-chroma-collapse). apply_to walks the report in dependency order: U16→U8 first (since it reduces the data the next steps see), then alpha drop, chroma collapse, and primaries narrowing. Skips transitions that would yield an unrepresentable PixelFormat (e.g. dropping alpha from Bgra8 — no Bgr8 in this enum). try_reduce_to_load_bearing_format returns None when the buffer is already at its load-bearing minimum, otherwise Some((target, bytes)) with the rewritten contiguous buffer at the narrower format. Sub-byte gray packing and tRNS encoding are codec concerns and stay out of this module — codec encoders read the report fields directly. == Tests == 22 new scan tests covering each new predicate at both true and false fixtures, including the explicit bit_repl_false_fefe_then_fe00 case that proves adjacent pairs don't poison each other. 15 new load_bearing tests covering the trait dispatch for Rgba8, Rgb8, GrayA8, Rgba16, Gray8 — opaque-gray reduces to Gray8, RGBA-with-real-color reduces to Rgb8, RGBA-with-alpha-variation reduces to GrayA8, RGBA16 bit-replicated reduces all the way to Rgb8 / Gray8, sub-byte gray detection emits 1/2/4/8 correctly, try_reduce returns None when buffer is minimal. cargo test -p zenpixels-convert --lib — 331 passed, 0 failed.

…tore avx512 feature

lilith self-assigned this May 1, 2026

lilith force-pushed the push-qpwqxvsrmqqn branch from 8ae6050 to 48ddcf2 Compare May 1, 2026 17:06

lilith changed the title ~~feat(convert): gamut bounds-check helpers for lossless primary downcast~~ feat(convert): gamut bounds-check + SIMD descriptor-level predicates May 1, 2026

lilith force-pushed the push-qpwqxvsrmqqn branch from 48ddcf2 to 1bc0e97 Compare May 1, 2026 22:40

lilith changed the title ~~feat(convert): gamut bounds-check + SIMD descriptor-level predicates~~ feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait) May 1, 2026

lilith force-pushed the push-qpwqxvsrmqqn branch 3 times, most recently from cb97ee5 to 1fe8580 Compare May 2, 2026 00:17

refactor(convert): per-field Option semantics on LoadBearingReport

362f088

lilith force-pushed the push-qpwqxvsrmqqn branch from 1fe8580 to 362f088 Compare May 2, 2026 00:42

fix(convert): explicit tier lists in fast_gamut.rs incant calls + res…

7fe3386

…tore avx512 feature

lilith force-pushed the push-qpwqxvsrmqqn branch from 1dd027f to 7fe3386 Compare May 2, 2026 04:21

lilith mentioned this pull request May 10, 2026

perf(fast_gamut_v2): SIMD chunk path via garb + native V3 body, replaces v1 stamp_trc_kernels #33

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30

feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30
lilith wants to merge 3 commits into
mainfrom
push-qpwqxvsrmqqn

lilith commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilith commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

scan — SIMD predicates

gamut — bounds-check + in-place transform

load_bearing — descriptor-aware orchestration

apply_to reduction order

Where codecs hook this in

Dep bumps

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lilith commented May 1, 2026 •

edited

Loading