Support Lloyd-Max quantization by CC-Yeh · Pull Request #448 · trymirai/uzu

CC-Yeh · 2026-05-26T17:15:17Z

M	K	N	LUT NF4	ZP	ScaleBias	LUT vs ZP	LUT vs ScaleBias
1	2048	2048	13.21 +/- 0.17	13.21 +/- 0.21	16.25 +/- 0.30	+0.0%	-18.7%
2	2048	2048	20.66 +/- 0.18	21.25 +/- 0.44	19.77 +/- 0.42	-2.8%	+4.5%
4	2048	2048	39.95 +/- 0.66	40.00 +/- 0.57	36.38 +/- 0.42	-0.1%	+9.8%
1	4096	4096	82.27 +/- 2.41	81.00 +/- 1.48	88.96 +/- 3.30	+1.6%	-7.5%
2	4096	4096	83.30 +/- 0.69	86.27 +/- 2.69	87.93 +/- 2.92	-3.4%	-5.3%
4	4096	4096	159.31 +/- 0.48	159.28 +/- 0.95	147.03 +/- 0.29	+0.0%	+8.3%
1	4096	14336	311.23 +/- 3.57	317.59 +/- 7.47	333.77 +/- 10.88	-2.0%	-6.8%
2	4096	14336	313.65 +/- 10.87	310.81 +/- 2.51	334.17 +/- 15.47	+0.9%	-6.1%
4	4096	14336	546.43 +/- 6.16	543.17 +/- 1.03	502.50 +/- 1.13	+0.6%	+8.7%
1	14336	4096	307.90 +/- 3.87	315.62 +/- 5.92	328.01 +/- 2.63	-2.4%	-6.1%
2	14336	4096	309.49 +/- 3.95	314.63 +/- 3.91	327.04 +/- 4.39	-1.6%	-5.4%
4	14336	4096	584.91 +/- 36.89	545.00 +/- 1.63	503.72 +/- 1.03	+7.3%	+16.1%
1	14336	14336	1102.04 +/- 25.39	1089.21 +/- 15.51	1143.24 +/- 12.77	+1.2%	-3.6%
2	14336	14336	1106.88 +/- 58.83	1068.91 +/- 14.79	1154.25 +/- 56.39	+3.6%	-4.1%
4	14336	14336	2043.22 +/- 33.13	1872.84 +/- 7.52	1742.68 +/- 20.34	+9.1%	+17.2%

Summary

LUT is roughly tied with ZP overall: geomean +0.14% slower, wins 8/15.
LUT beats ZP at M=1, is roughly neutral at M=2, and loses at M=4.
LUT beats ScaleBias at M=1/M=2, but loses badly at M=4; overall geomean is +0.71% slower.

CC-Yeh · 2026-05-26T17:15:31Z

What improved

Normal tg16 used one shared 16-entry threadgroup LUT across all simdgroups.
That was bad for batched rows: about +33% to +48% slower at M=4.
tg16-duplicate did not fix it: still about +38% to +53% slower at M=4.
tg16-vec4 was worse: about +43% to +57% slower at M=4.
tg16-ilp and no-barrier variants did not explain the win.
The winning change was simdgroup separation: each simdgroup gets its own 16-entry LUT slice and
synchronizes with simdgroup_barrier.
That brought the old bad tg16 path down from roughly +33%..+48% slower to roughly +5%..+15%
slower at M=4 in the old standalone LUT experiments, and near parity with ZP in the current
apples-to-apples benchmark.

Bad variants

FLUTE-ish duplicate/select/shuffle variants were catastrophic.
Duplicate variants reached hundreds of percent slower.
select was often over +1100% slower in M=2/M=4 cells.

CC-Yeh · 2026-05-26T17:16:31Z

Threadgroup codebook in the arg could hurt performance on non-codebook quants, need to benchmarks on different machine for it, if it is true, need to separate to two kernels maybe

CC-Yeh · 2026-05-26T17:35:29Z

previous attempts in #394

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69f4ef5c8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T18:23:55Z

+                        QuantizationMethod::Codebook => {
+                            let codebook_value =
+                                (*codebook.expect("Codebook quantized QMV requires a codebook").add(val_q)).to_f32();
+                            scale * codebook_value


Reject non-4-bit codebook QMV combinations

QuantizedMatmulQmvFast now accepts quant_method == Codebook for both BITS=4 and BITS=8, but the CPU path indexes the codebook with the unpacked quantized value (val_q) without any bounds check. With BITS=8, val_q can be 0..255, so callers that provide the expected 16-entry NF4 codebook will trigger out-of-bounds reads in qmv and produce undefined behavior. Please fail fast for Codebook unless BITS == 4 (or validate/require a 256-entry codebook).

Useful? React with 👍 / 👎.

support qmv LUT

25ed9f7

CC-Yeh mentioned this pull request May 26, 2026

[Experiment] QMV LUT Dequant #394

Closed

CC-Yeh added 4 commits May 27, 2026 01:38

improve names

3f7b147

Merge branch 'main' into lut

cfefa22

improve

ef88ddb

improve

69f4ef5

CC-Yeh marked this pull request as ready for review May 26, 2026 18:16

CC-Yeh requested review from LuckyIYI, eugenebokhan and uuuvn as code owners May 26, 2026 18:16

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

CC-Yeh marked this pull request as draft May 26, 2026 18:46

lloyd-max draft

6e50362

CC-Yeh changed the title ~~support qmv LUT~~ Support Lloyd-Max quantization May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Lloyd-Max quantization#448

Support Lloyd-Max quantization#448
CC-Yeh wants to merge 6 commits into
mainfrom
lut

CC-Yeh commented May 26, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CC-Yeh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

CC-Yeh commented May 26, 2026

What improved

Bad variants

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CC-Yeh commented May 26, 2026 •

edited

Loading

CC-Yeh commented May 26, 2026 •

edited

Loading