Skip to content

Retune MLX transposed quant matmul on Metal#321

Closed
ry2009 wants to merge 1 commit intotrymirai:mainfrom
ry2009:ryan/turboquant-prototype
Closed

Retune MLX transposed quant matmul on Metal#321
ry2009 wants to merge 1 commit intotrymirai:mainfrom
ry2009:ryan/turboquant-prototype

Conversation

@ry2009
Copy link
Copy Markdown

@ry2009 ry2009 commented Apr 6, 2026

What changed

  • keep QmmTransposed64x64 on the zero-point path only
  • use an MLX-only WM=4, WN=1 split in qmm_impl
  • retile QmmTransposed from 32x32 to 160x32

Why

Llama-3.2-3B-Instruct-4bit prefill on Apple was still spending most of its time in transposed quant matmul. The MLX 4-bit path was taking a bad kernel family and a weak transposed tile shape for this workload.

This patch keeps the change set narrow and only moves the MLX path onto the kernel shapes that benchmarked well on Apple.

Impact

Matched Apple blocker benchmark: Llama-3.2-3B-Instruct-4bit, 10049 prompt tokens, 32 generated tokens, tiered_q4:256.

Variant TTFT (s) Prompt tok/s Gen tok/s Memory (GB)
baseline clean tiered_q4 47.33 212.34 23.33 2.68
disable 64x64 family 47.24 212.73 23.60 2.62
add MLX-only WM=4, WN=1 47.16 213.11 24.28 2.58
final 160x32 transposed QMM 38.99 257.73 23.95 2.48

Negative ablations on the same Apple box:

Variant TTFT (s) Prompt tok/s Gen tok/s Memory (GB)
160x32 with old 2x2 split 40.53 247.96 23.36 2.58
192x32 43.34 231.91 23.64 2.50
256x32 43.71 229.94 24.01 2.68

So the main win comes from the transposed 160x32 tile, with the MLX-only 4x1 split adding a smaller but real gain.

Validation

  • PATH="$HOME/.rustup/toolchains/1.93.0-x86_64-apple-darwin/bin:$PATH" cargo check -p uzu --no-default-features --lib
  • Apple: cargo build -p cli --release
  • Apple benchmark result files already present on the remote box:
    • llama3b_4bit_tiered256_x12_long32_clean.results.json
    • llama3b_4bit_wm4_r2.results.json
    • llama3b_mlx_qmm_transposed_160x32_r2.json
    • llama3b_mlx_qmm_transposed_160x32_2x2.json
    • llama3b_mlx_qmm_transposed_192x32.json
    • llama3b_mlx_qmm_transposed_256x32.json

@ry2009 ry2009 changed the title [codex] Retune MLX transposed quant matmul on Metal Retune MLX transposed quant matmul on Metal Apr 6, 2026
@ry2009
Copy link
Copy Markdown
Author

ry2009 commented Apr 7, 2026

Superseded by #323, which is the same matmul change rebased onto current main for clean benchmarking.

@ry2009
Copy link
Copy Markdown
Author

ry2009 commented Apr 7, 2026

Closing in favor of #323.

@ry2009 ry2009 closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants