Retune MLX transposed quant matmul on Metal by ry2009 · Pull Request #323 · trymirai/uzu

ry2009 · 2026-04-07T07:54:02Z

Supersedes #321 on a fresh main-based branch.

What changed

keep the 64x64 and wide transposed families on the zero-point path only
use an MLX-only WM=4, WN=1 split in qmm_transposed_impl
retile QmmTransposed from 32x32 to 160x32

Why

Llama-3.2-3B-Instruct-4bit prefill on Apple was still spending most of its time in transposed quant matmul. The MLX 4-bit path was taking an unfavorable transposed kernel family and tile shape for that workload.

This keeps the change narrow and preserves the current main refactor: zero-point keeps the specialized transposed kernels, while MLX goes through the retuned generic transposed path.

Impact

Matched Apple blocker benchmark: Llama-3.2-3B-Instruct-4bit, 10049 prompt tokens, 32 generated tokens, tiered_q4:256.

Variant	TTFT (s)	Prompt tok/s	Gen tok/s	Memory (GB)
baseline clean `tiered_q4`	47.33	212.34	23.33	2.68
disable bad transposed family	47.24	212.73	23.60	2.62
add MLX-only `WM=4`, `WN=1`	47.16	213.11	24.28	2.58
final `160x32` transposed QMM	38.99	257.73	23.95	2.48

Negative ablations on the same Apple box:

Variant	TTFT (s)	Prompt tok/s	Gen tok/s	Memory (GB)
`160x32` with old `2x2` split	40.53	247.96	23.36	2.58
`192x32`	43.34	231.91	23.64	2.50
`256x32`	43.71	229.94	24.01	2.68

Validation

PATH="$HOME/.rustup/toolchains/1.94.0-x86_64-apple-darwin/bin:$PATH" cargo check -p uzu --no-default-features --lib
cross-chip benchmark reruns are being run from this fresh main-based branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30e648db73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T07:59:02Z

crates/uzu/src/backends/metal/kernel/quant_matmul/qmm_transposed.metal

  if (use_mlx_quant) {
    if (aligned_n) {
-      qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 32, 32, 32, true>(
+      qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 160, 32, 32, true>(


Keep BM at 32 until BlockLoader handles wide row spans

Setting qmm_transposed_impl to BM=160 makes loader_x_t use BlockLoader with n_reads=(BM*BK)/(WM*WN*32)=40 (for BK=32), which is larger than BK. In BlockLoader::load_unsafe/load_safe (quant_matmul.h via mma.h), writes are linearized from (bi,bj) using BK while the destination uses BK_padded stride, so each thread spills into padding and skips real matrix elements. For matrix-matrix calls that hit QuantizedMatmulQmmTransposed (e.g., MLX path and any non-specialized zero-point path), this feeds incorrect Xs tiles to MMA and produces wrong outputs.

Useful? React with 👍 / 👎.

need to either just keep 32 here or fix BlockLoader here as the perf gains without it break correctness tests

Retune MLX transposed quant matmul on Metal

30e648d

ry2009 requested review from CC-Yeh, LuckyIYI, eugenebokhan and uuuvn as code owners April 7, 2026 07:54

ry2009 mentioned this pull request Apr 7, 2026

Retune MLX transposed quant matmul on Metal #321

Closed

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retune MLX transposed quant matmul on Metal#323

Retune MLX transposed quant matmul on Metal#323
ry2009 wants to merge 1 commit intomainfrom
ryan/turboquant-main

ry2009 commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Uh oh!

ry2009 Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ry2009 commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ry2009 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant