Skip to content

UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices#1327

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20831-fix-mmvq-dynamic-nwarps-moe
Open

UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices#1327
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20831-fix-mmvq-dynamic-nwarps-moe

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 2, 2026

Note

Source pull request: ggml-org/llama.cpp#20831

Fix MMVQ TG regression on MoE models from #19478.

#19478 increased nwarps to 8 on RDNA3/RDNA4 to better utilize memory bandwidth for bs=1 decode. However, nwarps=8 assumes wide weight matrices. MoE expert FFN layers are narrow (512–2048 cols), so most warps have no work but still pay __syncthreads() and shared-memory reduction overhead, causing a net TG regression. This patch dynamically clamps nwarps based on the actual matrix width to avoid this.

R9700 (gfx1201, RDNA4), ROCm 7.2

MoE (regression fixed):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen3.5-122B-A10B Q4_K_M 4 38.72 34.74 (-10.3%) 39.31
Qwen3.5-35B-A3B Q4_K_M 1 76.92 74.14 (-3.6%) 84.62 (+10.0%)
Qwen3.5-35B-A3B Q5_K_M 1 74.97 75.32 (+0.5%) 81.83 (+9.1%)
Qwen3.5-35B-A3B Q6_K 1 75.60 77.86 (+3.0%) 82.67 (+9.4%)
Qwen3.5-35B-A3B Q8_0 2 65.39 67.37 (+3.0%) 71.45 (+9.3%)

Dense (no regression):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen2.5-72B Q4_K_M 4 10.42 10.63 10.54

Full whitelist sweep — llama-2-7b, 1x R9700, tg512, r=5:

Quant base (t/s) this PR (t/s) Change
Q4_0 92.67 101.09 +9.1%
Q4_1 88.59 95.45 +7.7%
Q5_0 81.52 87.77 +7.7%
Q5_1 78.08 83.99 +7.6%
Q8_0 59.32 63.07 +6.3%
Q2_K 93.39 94.43 +1.1%
Q3_K 91.63 91.67 +0.0%
Q4_K 91.26 95.93 +5.1%
Q5_K 81.65 86.79 +6.3%
Q6_K 72.37 75.25 +4.0%
IQ2_XXS 89.06 89.61 +0.6%
IQ2_XS 86.99 87.34 +0.4%
IQ2_S 84.29 84.14 -0.2%
IQ3_XXS 82.35 82.11 -0.3%
IQ3_S 80.89 81.02 +0.2%
IQ4_NL 93.72 99.02 +5.7%
IQ4_XS 94.49 104.83 +10.9%

W7900 (gfx1100, RDNA3), ROCm 7.1

MoE (regression fixed):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen3.5-35B-A3B Q4_K_M 1 76.33 69.68 (-8.7%) 77.10 (+1.0%)
Qwen3.5-35B-A3B Q5_K_M 1 70.53 69.30 (-1.7%) 73.05 (+3.6%)
Qwen3.5-35B-A3B Q6_K 1 72.31 70.12 (-3.0%) 73.28 (+1.3%)
Qwen3.5-35B-A3B Q8_0 1 69.98 66.89 (-4.4%) 72.67 (+3.8%)

Full whitelist sweep — llama-2-7b, 1x W7900, tg512, r=5:

Quant base (t/s) this PR (t/s) Change
Q4_0 98.49 98.72 +0.2%
Q5_0 85.76 86.47 +0.8%
Q8_0 66.79 68.02 +1.8%
Q2_K 92.12 92.44 +0.3%
Q3_K_S 88.57 88.86 +0.3%
Q4_K_S 82.09 84.79 +3.3%
Q5_K_S 77.39 77.11 -0.4%
Q6_K 73.71 76.33 +3.6%

Note: PR description translated with AI assistance.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Apr 2, 2026

No meaningful performance changes were detected across 124163 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from a8215be to 34734bc Compare April 9, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants