UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices by loci-dev · Pull Request #1327 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-02T03:10:01Z

Note

Source pull request: ggml-org/llama.cpp#20831

Fix MMVQ TG regression on MoE models from #19478.

#19478 increased nwarps to 8 on RDNA3/RDNA4 to better utilize memory bandwidth for bs=1 decode. However, nwarps=8 assumes wide weight matrices. MoE expert FFN layers are narrow (512–2048 cols), so most warps have no work but still pay __syncthreads() and shared-memory reduction overhead, causing a net TG regression. This patch dynamically clamps nwarps based on the actual matrix width to avoid this.

R9700 (gfx1201, RDNA4), ROCm 7.2

MoE (regression fixed):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen3.5-122B-A10B Q4_K_M	4	38.72	34.74 (-10.3%)	39.31
Qwen3.5-35B-A3B Q4_K_M	1	76.92	74.14 (-3.6%)	84.62 (+10.0%)
Qwen3.5-35B-A3B Q5_K_M	1	74.97	75.32 (+0.5%)	81.83 (+9.1%)
Qwen3.5-35B-A3B Q6_K	1	75.60	77.86 (+3.0%)	82.67 (+9.4%)
Qwen3.5-35B-A3B Q8_0	2	65.39	67.37 (+3.0%)	71.45 (+9.3%)

Dense (no regression):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen2.5-72B Q4_K_M	4	10.42	10.63	10.54

Full whitelist sweep — llama-2-7b, 1x R9700, tg512, r=5:

Quant	base (t/s)	this PR (t/s)	Change
Q4_0	92.67	101.09	+9.1%
Q4_1	88.59	95.45	+7.7%
Q5_0	81.52	87.77	+7.7%
Q5_1	78.08	83.99	+7.6%
Q8_0	59.32	63.07	+6.3%
Q2_K	93.39	94.43	+1.1%
Q3_K	91.63	91.67	+0.0%
Q4_K	91.26	95.93	+5.1%
Q5_K	81.65	86.79	+6.3%
Q6_K	72.37	75.25	+4.0%
IQ2_XXS	89.06	89.61	+0.6%
IQ2_XS	86.99	87.34	+0.4%
IQ2_S	84.29	84.14	-0.2%
IQ3_XXS	82.35	82.11	-0.3%
IQ3_S	80.89	81.02	+0.2%
IQ4_NL	93.72	99.02	+5.7%
IQ4_XS	94.49	104.83	+10.9%

W7900 (gfx1100, RDNA3), ROCm 7.1

MoE (regression fixed):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen3.5-35B-A3B Q4_K_M	1	76.33	69.68 (-8.7%)	77.10 (+1.0%)
Qwen3.5-35B-A3B Q5_K_M	1	70.53	69.30 (-1.7%)	73.05 (+3.6%)
Qwen3.5-35B-A3B Q6_K	1	72.31	70.12 (-3.0%)	73.28 (+1.3%)
Qwen3.5-35B-A3B Q8_0	1	69.98	66.89 (-4.4%)	72.67 (+3.8%)

Full whitelist sweep — llama-2-7b, 1x W7900, tg512, r=5:

Quant	base (t/s)	this PR (t/s)	Change
Q4_0	98.49	98.72	+0.2%
Q5_0	85.76	86.47	+0.8%
Q8_0	66.79	68.02	+1.8%
Q2_K	92.12	92.44	+0.3%
Q3_K_S	88.57	88.86	+0.3%
Q4_K_S	82.09	84.79	+3.3%
Q5_K_S	77.39	77.11	-0.4%
Q6_K	73.71	76.33	+3.6%

Note: PR description translated with AI assistance.

loci-review · 2026-04-02T04:04:56Z

No meaningful performance changes were detected across 124163 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

mmvq: dynamic nwarps based on matrix width for MoE models

8cd2a84

loci-dev temporarily deployed to PROD__AL_DEMO April 2, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from a8215be to 34734bc Compare April 9, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18

loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices#1327

UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices#1327
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20831-fix-mmvq-dynamic-nwarps-moe

loci-dev commented Apr 2, 2026

Uh oh!

loci-review bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 2, 2026

R9700 (gfx1201, RDNA4), ROCm 7.2

W7900 (gfx1100, RDNA3), ROCm 7.1

Uh oh!

loci-review bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants