Skip to content

Retune MLX transposed quant matmul on Metal#323

Open
ry2009 wants to merge 1 commit intomainfrom
ryan/turboquant-main
Open

Retune MLX transposed quant matmul on Metal#323
ry2009 wants to merge 1 commit intomainfrom
ryan/turboquant-main

Conversation

@ry2009
Copy link
Copy Markdown

@ry2009 ry2009 commented Apr 7, 2026

Supersedes #321 on a fresh main-based branch.

What changed

  • keep the 64x64 and wide transposed families on the zero-point path only
  • use an MLX-only WM=4, WN=1 split in qmm_transposed_impl
  • retile QmmTransposed from 32x32 to 160x32

Why

Llama-3.2-3B-Instruct-4bit prefill on Apple was still spending most of its time in transposed quant matmul. The MLX 4-bit path was taking an unfavorable transposed kernel family and tile shape for that workload.

This keeps the change narrow and preserves the current main refactor: zero-point keeps the specialized transposed kernels, while MLX goes through the retuned generic transposed path.

Impact

Matched Apple blocker benchmark: Llama-3.2-3B-Instruct-4bit, 10049 prompt tokens, 32 generated tokens, tiered_q4:256.

Variant TTFT (s) Prompt tok/s Gen tok/s Memory (GB)
baseline clean tiered_q4 47.33 212.34 23.33 2.68
disable bad transposed family 47.24 212.73 23.60 2.62
add MLX-only WM=4, WN=1 47.16 213.11 24.28 2.58
final 160x32 transposed QMM 38.99 257.73 23.95 2.48

Negative ablations on the same Apple box:

Variant TTFT (s) Prompt tok/s Gen tok/s Memory (GB)
160x32 with old 2x2 split 40.53 247.96 23.36 2.58
192x32 43.34 231.91 23.64 2.50
256x32 43.71 229.94 24.01 2.68

Validation

  • PATH="$HOME/.rustup/toolchains/1.94.0-x86_64-apple-darwin/bin:$PATH" cargo check -p uzu --no-default-features --lib
  • cross-chip benchmark reruns are being run from this fresh main-based branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30e648db73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (use_mlx_quant) {
if (aligned_n) {
qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 32, 32, 32, true>(
qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 160, 32, 32, true>(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep BM at 32 until BlockLoader handles wide row spans

Setting qmm_transposed_impl to BM=160 makes loader_x_t use BlockLoader with n_reads=(BM*BK)/(WM*WN*32)=40 (for BK=32), which is larger than BK. In BlockLoader::load_unsafe/load_safe (quant_matmul.h via mma.h), writes are linearized from (bi,bj) using BK while the destination uses BK_padded stride, so each thread spills into padding and skips real matrix elements. For matrix-matrix calls that hit QuantizedMatmulQmmTransposed (e.g., MLX path and any non-specialized zero-point path), this feeds incorrect Xs tiles to MMA and produces wrong outputs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to either just keep 32 here or fix BlockLoader here as the perf gains without it break correctness tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant