Skip to content

Support Lloyd-Max quantization#448

Draft
CC-Yeh wants to merge 6 commits into
mainfrom
lut
Draft

Support Lloyd-Max quantization#448
CC-Yeh wants to merge 6 commits into
mainfrom
lut

Conversation

@CC-Yeh
Copy link
Copy Markdown
Contributor

@CC-Yeh CC-Yeh commented May 26, 2026

M K N LUT NF4 ZP ScaleBias LUT vs ZP LUT vs ScaleBias
1 2048 2048 13.21 +/- 0.17 13.21 +/- 0.21 16.25 +/- 0.30 +0.0% -18.7%
2 2048 2048 20.66 +/- 0.18 21.25 +/- 0.44 19.77 +/- 0.42 -2.8% +4.5%
4 2048 2048 39.95 +/- 0.66 40.00 +/- 0.57 36.38 +/- 0.42 -0.1% +9.8%
1 4096 4096 82.27 +/- 2.41 81.00 +/- 1.48 88.96 +/- 3.30 +1.6% -7.5%
2 4096 4096 83.30 +/- 0.69 86.27 +/- 2.69 87.93 +/- 2.92 -3.4% -5.3%
4 4096 4096 159.31 +/- 0.48 159.28 +/- 0.95 147.03 +/- 0.29 +0.0% +8.3%
1 4096 14336 311.23 +/- 3.57 317.59 +/- 7.47 333.77 +/- 10.88 -2.0% -6.8%
2 4096 14336 313.65 +/- 10.87 310.81 +/- 2.51 334.17 +/- 15.47 +0.9% -6.1%
4 4096 14336 546.43 +/- 6.16 543.17 +/- 1.03 502.50 +/- 1.13 +0.6% +8.7%
1 14336 4096 307.90 +/- 3.87 315.62 +/- 5.92 328.01 +/- 2.63 -2.4% -6.1%
2 14336 4096 309.49 +/- 3.95 314.63 +/- 3.91 327.04 +/- 4.39 -1.6% -5.4%
4 14336 4096 584.91 +/- 36.89 545.00 +/- 1.63 503.72 +/- 1.03 +7.3% +16.1%
1 14336 14336 1102.04 +/- 25.39 1089.21 +/- 15.51 1143.24 +/- 12.77 +1.2% -3.6%
2 14336 14336 1106.88 +/- 58.83 1068.91 +/- 14.79 1154.25 +/- 56.39 +3.6% -4.1%
4 14336 14336 2043.22 +/- 33.13 1872.84 +/- 7.52 1742.68 +/- 20.34 +9.1% +17.2%

Summary

  • LUT is roughly tied with ZP overall: geomean +0.14% slower, wins 8/15.
  • LUT beats ZP at M=1, is roughly neutral at M=2, and loses at M=4.
  • LUT beats ScaleBias at M=1/M=2, but loses badly at M=4; overall geomean is +0.71% slower.

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 26, 2026

What improved

  • Normal tg16 used one shared 16-entry threadgroup LUT across all simdgroups.
  • That was bad for batched rows: about +33% to +48% slower at M=4.
  • tg16-duplicate did not fix it: still about +38% to +53% slower at M=4.
  • tg16-vec4 was worse: about +43% to +57% slower at M=4.
  • tg16-ilp and no-barrier variants did not explain the win.
  • The winning change was simdgroup separation: each simdgroup gets its own 16-entry LUT slice and
    synchronizes with simdgroup_barrier.
  • That brought the old bad tg16 path down from roughly +33%..+48% slower to roughly +5%..+15%
    slower at M=4 in the old standalone LUT experiments, and near parity with ZP in the current
    apples-to-apples benchmark.

Bad variants

  • FLUTE-ish duplicate/select/shuffle variants were catastrophic.
  • Duplicate variants reached hundreds of percent slower.
  • select was often over +1100% slower in M=2/M=4 cells.

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 26, 2026

Threadgroup codebook in the arg could hurt performance on non-codebook quants, need to benchmarks on different machine for it, if it is true, need to separate to two kernels maybe

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 26, 2026

previous attempts in #394

@CC-Yeh CC-Yeh marked this pull request as ready for review May 26, 2026 18:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69f4ef5c8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +89 to +92
QuantizationMethod::Codebook => {
let codebook_value =
(*codebook.expect("Codebook quantized QMV requires a codebook").add(val_q)).to_f32();
scale * codebook_value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject non-4-bit codebook QMV combinations

QuantizedMatmulQmvFast now accepts quant_method == Codebook for both BITS=4 and BITS=8, but the CPU path indexes the codebook with the unpacked quantized value (val_q) without any bounds check. With BITS=8, val_q can be 0..255, so callers that provide the expected 16-entry NF4 codebook will trigger out-of-bounds reads in qmv and produce undefined behavior. Please fail fast for Codebook unless BITS == 4 (or validate/require a 256-entry codebook).

Useful? React with 👍 / 👎.

@CC-Yeh CC-Yeh marked this pull request as draft May 26, 2026 18:46
@CC-Yeh CC-Yeh changed the title support qmv LUT Support Lloyd-Max quantization May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant