feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30
Open
lilith wants to merge 3 commits into
Open
feat(convert): descriptor-level load-bearing analysis (gamut + SIMD predicates + extension trait)#30lilith wants to merge 3 commits into
lilith wants to merge 3 commits into
Conversation
lilith
added a commit
to imazen/zenpng
that referenced
this pull request
May 1, 2026
…parison bench
Three pieces in one commit since they all flow from the same investigation
into where SIMD pulls its weight:
1) analyze_rgba8 fast path
When the flag set requests only the cheap-bool predicates (no
palette, no sub-byte gray, no transparent-color tracking), call
the fused SIMD predicate scanner and skip the scalar single-pass
entirely. Saves the HashMap allocation and per-pixel scalar branches.
2) Gamut downcast v0 (src/gamut.rs)
Detects Display P3 + sRGB transfer cICP (CP=12, TC=13, MC=0),
scans every pixel for sRGB-gamut fit (early-exit on first overflow),
and on success re-encodes the buffer in sRGB primaries. Wide-gamut
metadata (cICP, cHRM, source_gamma) dropped; sRGB chunk emitted
with perceptual intent if not set.
Gated on:
- flags.gamut_downcast (off by default)
- compression.effort() >= 7 (this pass is more expensive than the
byte-level predicates per CLAUDE.md design)
- bit depth 8 + RGB or RGBA channel layout
- cICP recognized as DisplayP3
Other primaries (BT.2020 / AdobeRGB) and HDR transfers (PQ/HLG) are
rejected by SourceGamut::from_cicp.
Bounds-check helper proposed upstream as imazen/zenpixels#30. Once
that ships in zenpixels-convert 0.2.12 this module shrinks to
source detection + EOTF/OETF coordination.
3) Scalar-vs-SIMD comparison bench + decision
benches/scalar_vs_simd.rs runs each predicate as scalar, magetypes-
SIMD, and (for fused) all three (scalar, runtime-branch SIMD, const-
generic SIMD) against the same workload at four sizes.
Decision: keep magetypes generic, NO hand-tuned intrinsics.
Numbers (Ryzen 9 7950X):
1 MP success path SIMD vs scalar speedup
is_grayscale_rgba8 11.7x
alpha_is_binary_rgba8 4.3x
is_grayscale_rgb8 6.8x
bit_replication_be16 12.8x
fused (3-in-1, CG) 15.3x
16 MP DRAM-bound SIMD vs scalar speedup
is_grayscale_rgba8 2.05x
alpha_is_binary_rgba8 1.95x
is_grayscale_rgb8 1.81x
bit_replication_be16 2.30x
fused (3-in-1, CG) 4.34x
Per CLAUDE.md (manual intrinsics only when 10%+ over magetypes):
in-cache magetypes is 4-15x scalar — already excellent, no obvious
10% gap. DRAM-bound there's no headroom (memory bandwidth is the
ceiling, hand intrinsics can't beat it). Closing the manual-intrinsics
investigation.
Tests (24 new, all passing):
* 11 src/gamut.rs unit tests
* 2 end-to-end encode tests with real cICP metadata
* 11 fast-path / DowncastFlags wiring tests
Bench logs saved at:
benchmarks/scan_predicates_2026-05-01.{log,meta}
benchmarks/fused_predicates_2026-05-01.{log,meta}
benchmarks/scalar_vs_simd_2026-05-01.{log,meta}
…predicates
Builds on the prior commit: extends scan.rs with U16-typed and
GrayAlpha predicates so every common pixel layout has a SIMD path,
and adds the descriptor-aware load-bearing module that codecs use as
their one-call entry point.
== Predicate coverage ==
Layout is_opaque is_grayscale alpha_is_binary bit_repl
----------- --------- ------------ --------------- --------
Rgb8 N/A yes N/A N/A
Rgba8 yes yes yes N/A
Bgra8 (Rgba8 fn) (Rgba8 fn) (Rgba8 fn) N/A [shared dispatch]
GrayA8 yes N/A yes N/A
Rgb16 N/A yes N/A yes
Rgba16 yes yes yes yes
GrayA16 yes N/A yes yes
Gray16 N/A N/A N/A yes
any U16 -- -- -- yes [bit_replication_lossless_u16(&[u16])]
Bgra8 dispatch shares the Rgba8 implementation because the byte
positions are equivalent for these tests (alpha at offset 3, chroma
at offsets 0/1/2 — equality is symmetric in channel order).
== bit_replication_lossless_u16 ==
Endian-agnostic predicate on — every sample must satisfy
(s >> 8) == (s & 0xFF). Replaces the byte-level _be16 form as the
primary API; _be16 retained as a thin wrapper for callers working
on raw PNG IDAT bytes.
== Refactor: partition_slice + chunks_exact ==
Where the SIMD pattern doesn't need shifted loads (is_opaque,
alpha_is_binary, the new opaque/binary predicates) the inner loops
now use u8x64::partition_slice / u16x32::partition_slice for non-
overlapping chunks. Scalar tails universally use chunks_exact.
Shifted-load patterns (is_grayscale_*, bit_replication_*) keep
manual stride for the SIMD outer (overlapping reads don't fit
partition_slice) but use chunks_exact in the tail. Comment in each
flagged module explains the choice.
== load_bearing module ==
pub struct LoadBearingReport {
uses_alpha: bool,
uses_chroma: bool,
uses_low_bits: bool,
uses_gray_bit_depth: GrayBitDepth,
uses_gamut: Option<ColorPrimaries>,
}
#[non_exhaustive]
impl LoadBearingReport {
const fn fully_load_bearing() -> Self;
fn apply_to(&self, src: &PixelDescriptor) -> PixelDescriptor;
}
pub trait PixelSliceLoadBearingExt {
fn determine_load_bearing(&self) -> LoadBearingReport;
fn determine_load_bearing_reduced_descriptor(&self) -> PixelDescriptor;
fn try_reduce_to_load_bearing_format(&self) -> Option<(PixelDescriptor, Vec<u8>)>;
}
determine_load_bearing dispatches based on (channel_layout,
channel_type) — picks the right SIMD predicate per layout, returns
the assembled report. Sub-byte gray detection (1/2/4/8) runs as a
scalar pass when the buffer is grayscale (or post-chroma-collapse).
apply_to walks the report in dependency order: U16→U8 first (since
it reduces the data the next steps see), then alpha drop, chroma
collapse, and primaries narrowing. Skips transitions that would
yield an unrepresentable PixelFormat (e.g. dropping alpha from
Bgra8 — no Bgr8 in this enum).
try_reduce_to_load_bearing_format returns None when the buffer is
already at its load-bearing minimum, otherwise Some((target, bytes))
with the rewritten contiguous buffer at the narrower format.
Sub-byte gray packing and tRNS encoding are codec concerns and
stay out of this module — codec encoders read the report fields
directly.
== Tests ==
22 new scan tests covering each new predicate at both true and
false fixtures, including the explicit bit_repl_false_fefe_then_fe00
case that proves adjacent pairs don't poison each other.
15 new load_bearing tests covering the trait dispatch for Rgba8,
Rgb8, GrayA8, Rgba16, Gray8 — opaque-gray reduces to Gray8,
RGBA-with-real-color reduces to Rgb8, RGBA-with-alpha-variation
reduces to GrayA8, RGBA16 bit-replicated reduces all the way to
Rgb8 / Gray8, sub-byte gray detection emits 1/2/4/8 correctly,
try_reduce returns None when buffer is minimal.
cargo test -p zenpixels-convert --lib — 331 passed, 0 failed.
cb97ee5 to
1fe8580
Compare
…tore avx512 feature
Closed
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a descriptor-level load-bearing analysis layer to
zenpixels-convertthat codec encoders use to decide whether a buffer fits a narrower descriptor (channel layout / channel type / primaries) without loss. Three layers, bottom-up:scanmodule — byte-level SIMD predicates over packed buffers (one per(layout, channel_type)combo).gamutmodule — per-pixel bounds-check helpers + in-place transform for primary-conversion lossless detection.load_bearingmodule — descriptor-aware orchestration: extension trait onPixelSlice,LoadBearingReportstruct, and a combiner that produces a narrower targetPixelDescriptor.scan — SIMD predicates
is_opaqueis_grayscalealpha_is_binarybit_replication_lossless_u16(&[u16])Plus
fused_predicates_rgba8_cg— runs all three RGBA8 checks in one bandwidth-fused pass, 2.2× vs three separate passes at 16 MP, 15.3× vs scalar at 1 MP, on Ryzen 9 7950X. The trait dispatch uses this for RGBA8/Bgra8.bit_replication_lossless_u16(&[u16])is the typed primary;_be16(&[u8])is a thin wrapper for raw IDAT bytes. Endian-symmetric.Generic over magetypes 5-tier dispatch (
v4xAVX-512,v4AVX2,v3SSE4.2,neon,wasm128) with#[magetypes(define(u8x64), …)]+incant!. Inner loops usepartition_slice+chunks_exactwhere shifted reads aren't needed; manual stride for the shifted-load patterns (withchunks_exacttails).gamut — bounds-check + in-place transform
On
OutOfGamut, the buffer is left unmodified so callers can keep the wider representation.load_bearing — descriptor-aware orchestration
Each
uses_*field answers "is this part of the descriptor load-bearing?" —false/narrower variants mean the descriptor over-promises and can be narrowed losslessly.analyzed: booldistinguishes "fully load-bearing because we measured it" from "fully load-bearing because we couldn't measure" — codecs that seeanalyzed: falseshould skip the optimization path.alpha_is_binary: Option<bool>is a free byproduct of the fused predicate — directly usable for binary-mask alpha encodings (PNGtRNS, GIF transparency).uses_gray_bit_depth: Option<GrayBitDepth>—Some(One/Two/Four)if sub-byte packing applies,Noneotherwise.GrayBitDepthonly enumerates the narrow options (noEightsentinel).Gamut narrowing is wired end-to-end:
determine_load_bearingrunscheck_fits_in_gamut_*againstBt709when source primaries are wider AND source transfer is sRGB AND layout is RGB/RGBA at U8.try_reduce_to_load_bearing_formatperforms the actual transformation — decode to linear f32, apply the matrix, re-encode with the target transfer — and returns the rewritten buffer at the narrower primaries. Other transfers (PQ, HLG, Linear) and channel-types (U16, F32) bail conservatively for v0; followup will dispatch their EOTFs.apply_to reduction order
uses_low_bits == false)uses_alpha == falseand the layout has alpha)uses_chroma == falseand the layout has chroma)uses_gamut == Some)Skips transitions that yield an unrepresentable
PixelFormat(e.g. dropping alpha fromBgra8when there's noBgr8enum variant — thetransform_topath handlesBgra8 → Rgb8via channel reorder when alpha drop is requested intry_reduce).Sub-byte gray packing and tRNS encoding are codec concerns and stay in the codec layer — codec encoders read the report fields directly and apply their format-specific encoding.
Where codecs hook this in
encode_rawPixelSliceaccepted, beforeanalyze_rgba8uses_alpha=false→ drop alpha;alpha_is_binary=Some(true)→ tRNS lookup;uses_gray_bit_depth=Some(…)→ sub-byte pack; ≤256 colors → indexed (separate scan, palette ordering by luminance)uses_alpha=false→ drop alpha;uses_gamut=Some(Bt709)→ drop ICC profile, declare sRGB. WebP has no monochrome/sub-byte modesuses_alpha=false→ drop alpha plane;uses_chroma=false→ yuv400;uses_gamut→ nclx narrowing;uses_low_bits=false→ 8-bit containeruses_alpha=false→ drop extra channel;uses_chroma=false→ grayscale frame;uses_low_bits=false→ 8-bit;uses_gamut→ ICC narrowinguses_chroma=false→ grayscale;uses_alpha=false→ 1-channel SamplesPerPixel;uses_low_bits=false→ BitsPerSample=8The trait method runs in <1 ms at 1 MP (per benches) — free at every encoder entry.
Dep bumps
archmage0.9.15 → 0.9.23 (avx512 feature)magetypes0.9.18 → 0.9.23 (w512 + avx512)zenpixels-convertavx512feature, no-op for users (acknowledgesincant!'s cfg expansion)Tests
95 new tests, 339 total passing in
zenpixels-convert --lib, 0 failures.for_each_token_permutation; includes explicitbit_repl_false_fefe_then_fe00andbit_repl_true_fefe_then_fcfcproving adjacent pairs don't poison each other)alpha_is_binarySome/None semantics;analyzed: falsefor unsupported types; gamut detection for P3-with-neutral-gray vs P3-with-saturated-red; gamut transformation end-to-end producing a sRGB-tagged buffer; PQ transfer skip; SRGB no-op; fully_load_bearing default)Test plan
try_reduce_to_load_bearing_format