Retune MLX transposed quant matmul on Metal by ry2009 · Pull Request #321 · trymirai/uzu

ry2009 · 2026-04-06T19:35:41Z

What changed

keep QmmTransposed64x64 on the zero-point path only
use an MLX-only WM=4, WN=1 split in qmm_impl
retile QmmTransposed from 32x32 to 160x32

Why

Llama-3.2-3B-Instruct-4bit prefill on Apple was still spending most of its time in transposed quant matmul. The MLX 4-bit path was taking a bad kernel family and a weak transposed tile shape for this workload.

This patch keeps the change set narrow and only moves the MLX path onto the kernel shapes that benchmarked well on Apple.

Impact

Matched Apple blocker benchmark: Llama-3.2-3B-Instruct-4bit, 10049 prompt tokens, 32 generated tokens, tiered_q4:256.

Variant	TTFT (s)	Prompt tok/s	Gen tok/s	Memory (GB)
baseline clean `tiered_q4`	47.33	212.34	23.33	2.68
disable `64x64` family	47.24	212.73	23.60	2.62
add MLX-only `WM=4`, `WN=1`	47.16	213.11	24.28	2.58
final `160x32` transposed QMM	38.99	257.73	23.95	2.48

Negative ablations on the same Apple box:

Variant	TTFT (s)	Prompt tok/s	Gen tok/s	Memory (GB)
`160x32` with old `2x2` split	40.53	247.96	23.36	2.58
`192x32`	43.34	231.91	23.64	2.50
`256x32`	43.71	229.94	24.01	2.68

So the main win comes from the transposed 160x32 tile, with the MLX-only 4x1 split adding a smaller but real gain.

Validation

PATH="$HOME/.rustup/toolchains/1.93.0-x86_64-apple-darwin/bin:$PATH" cargo check -p uzu --no-default-features --lib
Apple: cargo build -p cli --release
Apple benchmark result files already present on the remote box:
- llama3b_4bit_tiered256_x12_long32_clean.results.json
- llama3b_4bit_wm4_r2.results.json
- llama3b_mlx_qmm_transposed_160x32_r2.json
- llama3b_mlx_qmm_transposed_160x32_2x2.json
- llama3b_mlx_qmm_transposed_192x32.json
- llama3b_mlx_qmm_transposed_256x32.json

ry2009 · 2026-04-07T07:54:13Z

Superseded by #323, which is the same matmul change rebased onto current main for clean benchmarking.

ry2009 · 2026-04-07T07:54:14Z

Closing in favor of #323.

Retune MLX transposed quant matmul on Metal

4f1f629

ry2009 changed the title ~~[codex] Retune MLX transposed quant matmul on Metal~~ Retune MLX transposed quant matmul on Metal Apr 6, 2026

eugenebokhan approved these changes Apr 6, 2026

View reviewed changes

ry2009 mentioned this pull request Apr 7, 2026

Retune MLX transposed quant matmul on Metal #323

Open

ry2009 closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retune MLX transposed quant matmul on Metal#321

Retune MLX transposed quant matmul on Metal#321
ry2009 wants to merge 1 commit intotrymirai:mainfrom
ry2009:ryan/turboquant-prototype

ry2009 commented Apr 6, 2026 •

edited

Loading

Uh oh!

ry2009 commented Apr 7, 2026

Uh oh!

ry2009 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ry2009 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ry2009 commented Apr 7, 2026

Uh oh!

ry2009 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ry2009 commented Apr 6, 2026 •

edited

Loading