Skip to content

feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30

Open
lilith wants to merge 3 commits into
mainfrom
push-qpwqxvsrmqqn
Open

feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30
lilith wants to merge 3 commits into
mainfrom
push-qpwqxvsrmqqn

Conversation

@lilith
Copy link
Copy Markdown
Member

@lilith lilith commented May 1, 2026

Summary

Adds a descriptor-level load-bearing analysis layer to zenpixels-convert that codec encoders use to decide whether a buffer fits a narrower descriptor (channel layout / channel type / primaries) without loss. Three layers, bottom-up:

  1. scan module — byte-level SIMD predicates over packed buffers (one per (layout, channel_type) combo).
  2. gamut module — per-pixel bounds-check helpers + in-place transform for primary-conversion lossless detection.
  3. load_bearing module — descriptor-aware orchestration: extension trait on PixelSlice, LoadBearingReport struct, and a combiner that produces a narrower target PixelDescriptor.

scan — SIMD predicates

Layout is_opaque is_grayscale alpha_is_binary bit-replication
Rgb8 yes
Rgba8 / Bgra8 yes yes yes
GrayA8 yes yes
Rgb16 yes yes
Rgba16 yes yes yes yes
GrayA16 yes yes yes
Gray16 yes
any U16 bit_replication_lossless_u16(&[u16])

Plus fused_predicates_rgba8_cg — runs all three RGBA8 checks in one bandwidth-fused pass, 2.2× vs three separate passes at 16 MP, 15.3× vs scalar at 1 MP, on Ryzen 9 7950X. The trait dispatch uses this for RGBA8/Bgra8.

bit_replication_lossless_u16(&[u16]) is the typed primary; _be16(&[u8]) is a thin wrapper for raw IDAT bytes. Endian-symmetric.

Generic over magetypes 5-tier dispatch (v4x AVX-512, v4 AVX2, v3 SSE4.2, neon, wasm128) with #[magetypes(define(u8x64), …)] + incant!. Inner loops use partition_slice + chunks_exact where shifted reads aren't needed; manual stride for the shifted-load patterns (with chunks_exact tails).

gamut — bounds-check + in-place transform

pub enum GamutFit { AllInside, OutOfGamut }
pub const DEFAULT_GAMUT_EPSILON: f32 = 5e-4;
pub fn check_fits_in_gamut_linear_f32_rgb (data: &[f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn check_fits_in_gamut_linear_f32_rgba(data: &[f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn fit_and_transform_linear_f32_rgb   (data: &mut [f32], m: &GamutMatrix, eps: f32) -> GamutFit;
pub fn fit_and_transform_linear_f32_rgba  (data: &mut [f32], m: &GamutMatrix, eps: f32) -> GamutFit;

On OutOfGamut, the buffer is left unmodified so callers can keep the wider representation.

load_bearing — descriptor-aware orchestration

#[non_exhaustive]
pub struct LoadBearingReport {
    pub analyzed: bool,                                 // false → predicates couldn't run
    pub uses_alpha: bool,                               // false → drop alpha
    pub uses_chroma: bool,                              // false → narrow to grayscale
    pub uses_low_bits: bool,                            // false → narrow U16 → U8
    pub alpha_is_binary: Option<bool>,                  // Some(true) → tRNS/binary alpha
    pub uses_gray_bit_depth: Option<GrayBitDepth>,      // Some(One/Two/Four) → sub-byte pack
    pub uses_gamut: Option<ColorPrimaries>,             // Some(narrower) → re-tag + transform
}

impl LoadBearingReport {
    pub const fn fully_load_bearing() -> Self;
    pub const fn unanalyzed() -> Self;
    pub fn apply_to(&self, src: &PixelDescriptor) -> PixelDescriptor;
}

pub trait PixelSliceLoadBearingExt {
    fn determine_load_bearing(&self) -> LoadBearingReport;
    fn determine_load_bearing_reduced_descriptor(&self) -> PixelDescriptor;
    fn try_reduce_to_load_bearing_format(&self) -> Option<(PixelDescriptor, Vec<u8>)>;
}

Each uses_* field answers "is this part of the descriptor load-bearing?" — false/narrower variants mean the descriptor over-promises and can be narrowed losslessly.

analyzed: bool distinguishes "fully load-bearing because we measured it" from "fully load-bearing because we couldn't measure" — codecs that see analyzed: false should skip the optimization path.

alpha_is_binary: Option<bool> is a free byproduct of the fused predicate — directly usable for binary-mask alpha encodings (PNG tRNS, GIF transparency).

uses_gray_bit_depth: Option<GrayBitDepth>Some(One/Two/Four) if sub-byte packing applies, None otherwise. GrayBitDepth only enumerates the narrow options (no Eight sentinel).

Gamut narrowing is wired end-to-end: determine_load_bearing runs check_fits_in_gamut_* against Bt709 when source primaries are wider AND source transfer is sRGB AND layout is RGB/RGBA at U8. try_reduce_to_load_bearing_format performs the actual transformation — decode to linear f32, apply the matrix, re-encode with the target transfer — and returns the rewritten buffer at the narrower primaries. Other transfers (PQ, HLG, Linear) and channel-types (U16, F32) bail conservatively for v0; followup will dispatch their EOTFs.

apply_to reduction order

  1. Channel-type narrowing (U16 → U8 when uses_low_bits == false)
  2. Alpha drop (when uses_alpha == false and the layout has alpha)
  3. Chroma collapse (when uses_chroma == false and the layout has chroma)
  4. Primaries narrowing (when uses_gamut == Some)

Skips transitions that yield an unrepresentable PixelFormat (e.g. dropping alpha from Bgra8 when there's no Bgr8 enum variant — the transform_to path handles Bgra8 → Rgb8 via channel reorder when alpha drop is requested in try_reduce).

Sub-byte gray packing and tRNS encoding are codec concerns and stay in the codec layer — codec encoders read the report fields directly and apply their format-specific encoding.

Where codecs hook this in

Codec Hook Codec-side decisions consuming the report
zenpng encode_raw After PixelSlice accepted, before analyze_rgba8 uses_alpha=false → drop alpha; alpha_is_binary=Some(true) → tRNS lookup; uses_gray_bit_depth=Some(…) → sub-byte pack; ≤256 colors → indexed (separate scan, palette ordering by luminance)
zenwebp Top of encode pipeline uses_alpha=false → drop alpha; uses_gamut=Some(Bt709) → drop ICC profile, declare sRGB. WebP has no monochrome/sub-byte modes
zenavif Before nclx box + YUV subsampling pick uses_alpha=false → drop alpha plane; uses_chroma=false → yuv400; uses_gamut → nclx narrowing; uses_low_bits=false → 8-bit container
zenjxl Before frame type pick uses_alpha=false → drop extra channel; uses_chroma=false → grayscale frame; uses_low_bits=false → 8-bit; uses_gamut → ICC narrowing
zentiff Before SubFileType/PhotometricInterpretation pick uses_chroma=false → grayscale; uses_alpha=false → 1-channel SamplesPerPixel; uses_low_bits=false → BitsPerSample=8

The trait method runs in <1 ms at 1 MP (per benches) — free at every encoder entry.

Dep bumps

  • archmage 0.9.15 → 0.9.23 (avx512 feature)
  • magetypes 0.9.18 → 0.9.23 (w512 + avx512)
  • New zenpixels-convert avx512 feature, no-op for users (acknowledges incant!'s cfg expansion)

Tests

95 new tests, 339 total passing in zenpixels-convert --lib, 0 failures.

  • 9 gamut-fit tests
  • 65 scan tests (5 standalone + 7 layout extensions + 11 fused; both true and false fixtures with explicit inline byte arrays at multiple sizes; runs at every dispatch tier via for_each_token_permutation; includes explicit bit_repl_false_fefe_then_fe00 and bit_repl_true_fefe_then_fcfc proving adjacent pairs don't poison each other)
  • 23 load_bearing tests (Rgba8 / Rgba16 / GrayA8 / Gray8 dispatch; sub-byte 1/2/4-bit detection; alpha_is_binary Some/None semantics; analyzed: false for unsupported types; gamut detection for P3-with-neutral-gray vs P3-with-saturated-red; gamut transformation end-to-end producing a sRGB-tagged buffer; PQ transfer skip; SRGB no-op; fully_load_bearing default)
cargo test -p zenpixels-convert --lib
test result: ok. 339 passed; 0 failed; 0 ignored

Test plan

  • No new top-level dependencies (archmage / magetypes / linear-srgb already in tree)
  • Public API additions only — no breaking changes
  • Tests cover every predicate's true/false paths at every dispatch tier
  • Bench numbers measured on real hardware, not estimated
  • Gamut detection runs end-to-end (linearize → bounds-check → transform → re-encode → re-tag)
  • All non-gamut load-bearing reductions also exercised end-to-end via try_reduce_to_load_bearing_format

@lilith lilith self-assigned this May 1, 2026
lilith added a commit to imazen/zenpng that referenced this pull request May 1, 2026
…parison bench

Three pieces in one commit since they all flow from the same investigation
into where SIMD pulls its weight:

1) analyze_rgba8 fast path
   When the flag set requests only the cheap-bool predicates (no
   palette, no sub-byte gray, no transparent-color tracking), call
   the fused SIMD predicate scanner and skip the scalar single-pass
   entirely. Saves the HashMap allocation and per-pixel scalar branches.

2) Gamut downcast v0 (src/gamut.rs)
   Detects Display P3 + sRGB transfer cICP (CP=12, TC=13, MC=0),
   scans every pixel for sRGB-gamut fit (early-exit on first overflow),
   and on success re-encodes the buffer in sRGB primaries. Wide-gamut
   metadata (cICP, cHRM, source_gamma) dropped; sRGB chunk emitted
   with perceptual intent if not set.

   Gated on:
     - flags.gamut_downcast (off by default)
     - compression.effort() >= 7 (this pass is more expensive than the
       byte-level predicates per CLAUDE.md design)
     - bit depth 8 + RGB or RGBA channel layout
     - cICP recognized as DisplayP3

   Other primaries (BT.2020 / AdobeRGB) and HDR transfers (PQ/HLG) are
   rejected by SourceGamut::from_cicp.

   Bounds-check helper proposed upstream as imazen/zenpixels#30. Once
   that ships in zenpixels-convert 0.2.12 this module shrinks to
   source detection + EOTF/OETF coordination.

3) Scalar-vs-SIMD comparison bench + decision
   benches/scalar_vs_simd.rs runs each predicate as scalar, magetypes-
   SIMD, and (for fused) all three (scalar, runtime-branch SIMD, const-
   generic SIMD) against the same workload at four sizes.

   Decision: keep magetypes generic, NO hand-tuned intrinsics.

   Numbers (Ryzen 9 7950X):
     1 MP success path  SIMD vs scalar speedup
       is_grayscale_rgba8           11.7x
       alpha_is_binary_rgba8         4.3x
       is_grayscale_rgb8             6.8x
       bit_replication_be16         12.8x
       fused (3-in-1, CG)           15.3x

     16 MP DRAM-bound  SIMD vs scalar speedup
       is_grayscale_rgba8            2.05x
       alpha_is_binary_rgba8         1.95x
       is_grayscale_rgb8             1.81x
       bit_replication_be16          2.30x
       fused (3-in-1, CG)            4.34x

   Per CLAUDE.md (manual intrinsics only when 10%+ over magetypes):
   in-cache magetypes is 4-15x scalar — already excellent, no obvious
   10% gap. DRAM-bound there's no headroom (memory bandwidth is the
   ceiling, hand intrinsics can't beat it). Closing the manual-intrinsics
   investigation.

Tests (24 new, all passing):
  * 11 src/gamut.rs unit tests
  * 2 end-to-end encode tests with real cICP metadata
  * 11 fast-path / DowncastFlags wiring tests

Bench logs saved at:
  benchmarks/scan_predicates_2026-05-01.{log,meta}
  benchmarks/fused_predicates_2026-05-01.{log,meta}
  benchmarks/scalar_vs_simd_2026-05-01.{log,meta}
@lilith lilith force-pushed the push-qpwqxvsrmqqn branch from 8ae6050 to 48ddcf2 Compare May 1, 2026 17:06
@lilith lilith changed the title feat(convert): gamut bounds-check helpers for lossless primary downcast feat(convert): gamut bounds-check + SIMD descriptor-level predicates May 1, 2026
…predicates

Builds on the prior commit: extends scan.rs with U16-typed and
GrayAlpha predicates so every common pixel layout has a SIMD path,
and adds the descriptor-aware load-bearing module that codecs use as
their one-call entry point.

== Predicate coverage ==

  Layout       is_opaque   is_grayscale   alpha_is_binary   bit_repl
  -----------  ---------   ------------   ---------------   --------
  Rgb8           N/A         yes            N/A               N/A
  Rgba8          yes         yes            yes               N/A
  Bgra8         (Rgba8 fn)  (Rgba8 fn)     (Rgba8 fn)         N/A    [shared dispatch]
  GrayA8         yes         N/A            yes               N/A
  Rgb16          N/A         yes            N/A               yes
  Rgba16         yes         yes            yes               yes
  GrayA16        yes         N/A            yes               yes
  Gray16         N/A         N/A            N/A               yes
  any U16        --          --             --                yes    [bit_replication_lossless_u16(&[u16])]

  Bgra8 dispatch shares the Rgba8 implementation because the byte
  positions are equivalent for these tests (alpha at offset 3, chroma
  at offsets 0/1/2 — equality is symmetric in channel order).

== bit_replication_lossless_u16 ==

  Endian-agnostic predicate on  — every sample must satisfy
  (s >> 8) == (s & 0xFF). Replaces the byte-level _be16 form as the
  primary API; _be16 retained as a thin wrapper for callers working
  on raw PNG IDAT bytes.

== Refactor: partition_slice + chunks_exact ==

  Where the SIMD pattern doesn't need shifted loads (is_opaque,
  alpha_is_binary, the new opaque/binary predicates) the inner loops
  now use u8x64::partition_slice / u16x32::partition_slice for non-
  overlapping chunks. Scalar tails universally use chunks_exact.

  Shifted-load patterns (is_grayscale_*, bit_replication_*) keep
  manual stride for the SIMD outer (overlapping reads don't fit
  partition_slice) but use chunks_exact in the tail. Comment in each
  flagged module explains the choice.

== load_bearing module ==

  pub struct LoadBearingReport {
      uses_alpha: bool,
      uses_chroma: bool,
      uses_low_bits: bool,
      uses_gray_bit_depth: GrayBitDepth,
      uses_gamut: Option<ColorPrimaries>,
  }
  #[non_exhaustive]

  impl LoadBearingReport {
      const fn fully_load_bearing() -> Self;
      fn apply_to(&self, src: &PixelDescriptor) -> PixelDescriptor;
  }

  pub trait PixelSliceLoadBearingExt {
      fn determine_load_bearing(&self) -> LoadBearingReport;
      fn determine_load_bearing_reduced_descriptor(&self) -> PixelDescriptor;
      fn try_reduce_to_load_bearing_format(&self) -> Option<(PixelDescriptor, Vec<u8>)>;
  }

  determine_load_bearing dispatches based on (channel_layout,
  channel_type) — picks the right SIMD predicate per layout, returns
  the assembled report. Sub-byte gray detection (1/2/4/8) runs as a
  scalar pass when the buffer is grayscale (or post-chroma-collapse).

  apply_to walks the report in dependency order: U16→U8 first (since
  it reduces the data the next steps see), then alpha drop, chroma
  collapse, and primaries narrowing. Skips transitions that would
  yield an unrepresentable PixelFormat (e.g. dropping alpha from
  Bgra8 — no Bgr8 in this enum).

  try_reduce_to_load_bearing_format returns None when the buffer is
  already at its load-bearing minimum, otherwise Some((target, bytes))
  with the rewritten contiguous buffer at the narrower format.

  Sub-byte gray packing and tRNS encoding are codec concerns and
  stay out of this module — codec encoders read the report fields
  directly.

== Tests ==

  22 new scan tests covering each new predicate at both true and
  false fixtures, including the explicit bit_repl_false_fefe_then_fe00
  case that proves adjacent pairs don't poison each other.

  15 new load_bearing tests covering the trait dispatch for Rgba8,
  Rgb8, GrayA8, Rgba16, Gray8 — opaque-gray reduces to Gray8,
  RGBA-with-real-color reduces to Rgb8, RGBA-with-alpha-variation
  reduces to GrayA8, RGBA16 bit-replicated reduces all the way to
  Rgb8 / Gray8, sub-byte gray detection emits 1/2/4/8 correctly,
  try_reduce returns None when buffer is minimal.

cargo test -p zenpixels-convert --lib — 331 passed, 0 failed.
@lilith lilith force-pushed the push-qpwqxvsrmqqn branch from 48ddcf2 to 1bc0e97 Compare May 1, 2026 22:40
@lilith lilith changed the title feat(convert): gamut bounds-check + SIMD descriptor-level predicates feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait) May 1, 2026
@lilith lilith force-pushed the push-qpwqxvsrmqqn branch 3 times, most recently from cb97ee5 to 1fe8580 Compare May 2, 2026 00:17
@lilith lilith force-pushed the push-qpwqxvsrmqqn branch from 1fe8580 to 362f088 Compare May 2, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant