Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30e648db73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (use_mlx_quant) { | ||
| if (aligned_n) { | ||
| qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 32, 32, 32, true>( | ||
| qmm_transposed_impl<T, GROUP_SIZE, BITS, true, 160, 32, 32, true>( |
There was a problem hiding this comment.
Keep BM at 32 until BlockLoader handles wide row spans
Setting qmm_transposed_impl to BM=160 makes loader_x_t use BlockLoader with n_reads=(BM*BK)/(WM*WN*32)=40 (for BK=32), which is larger than BK. In BlockLoader::load_unsafe/load_safe (quant_matmul.h via mma.h), writes are linearized from (bi,bj) using BK while the destination uses BK_padded stride, so each thread spills into padding and skips real matrix elements. For matrix-matrix calls that hit QuantizedMatmulQmmTransposed (e.g., MLX path and any non-specialized zero-point path), this feeds incorrect Xs tiles to MMA and produces wrong outputs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
need to either just keep 32 here or fix BlockLoader here as the perf gains without it break correctness tests
Supersedes #321 on a fresh
main-based branch.What changed
64x64and wide transposed families on the zero-point path onlyWM=4,WN=1split inqmm_transposed_implQmmTransposedfrom32x32to160x32Why
Llama-3.2-3B-Instruct-4bitprefill on Apple was still spending most of its time in transposed quant matmul. The MLX 4-bit path was taking an unfavorable transposed kernel family and tile shape for that workload.This keeps the change narrow and preserves the current
mainrefactor: zero-point keeps the specialized transposed kernels, while MLX goes through the retuned generic transposed path.Impact
Matched Apple blocker benchmark:
Llama-3.2-3B-Instruct-4bit,10049prompt tokens,32generated tokens,tiered_q4:256.tiered_q4WM=4,WN=1160x32transposed QMMNegative ablations on the same Apple box:
160x32with old2x2split192x32256x32Validation
PATH="$HOME/.rustup/toolchains/1.94.0-x86_64-apple-darwin/bin:$PATH" cargo check -p uzu --no-default-features --libmain-based branch