UPSTREAM PR #21168: ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels by loci-dev · Pull Request #1319 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-30T03:11:22Z

Note

Source pull request: ggml-org/llama.cpp#21168

Overview

This pr is a LDS load optimization in mmq kernels for q4_0 and q4_1.
The activations loading loop has been restructured so that 8 * ds_read_b32 scalar operations are replaced by 2*vectorized ds_read_b128 by the HIP compiler. It ends up being about +10% in pp with the vega gpu, and a small speedup on the 6800xt.
This modification is guarded by GGML_USE_HIP flag. Since the code is duplicated in vec_dot_q4_0_q8_1_dp4a and vec_dot_q4_1_q8_1_dp4a kernels, it could be refactored in a single function that select the loading method.

Additional information

GPU	Model	Test	Before (t/s)	After (t/s)	Δ (t/s)	% Δ
MI50	qwen3 4B Q4_0	pp512	1963.38	2210.11	+246.73	+12.57%
		pp2048	2057.84	2297.41	+239.57	+11.64%
		pp8192	1543.88	1675.24	+131.36	+8.51%
		tg128	125.08	124.95	−0.13	−0.10%
		pp512 @ d8192	1058.48	1126.23	+67.75	+6.40%
		pp2048 @ d8192	1074.21	1136.18	+61.97	+5.77%
		pp8192 @ d8192	901.66	952.81	+51.15	+5.67%
		tg128 @ d8192	100.29	101.24	+0.95	+0.95%
	qwen35 27B Q4_1	pp512	318.62	337.35	+18.73	+5.88%
		pp2048	333.22	353.92	+20.70	+6.21%
		pp8192	296.95	324.21	+27.26	+9.18%
		tg128	25.96	26.26	+0.30	+1.16%
		pp512 @ d8192	273.28	289.73	+16.45	+6.02%
		pp2048 @ d8192	282.95	298.37	+15.42	+5.45%
		pp8192 @ d8192	271.39	284.89	+13.50	+4.98%
		tg128 @ d8192	25.04	25.05	+0.01	+0.04%
RX 6800 XT	qwen3 4B Q4_0	pp512	3279.40	3350.35	+70.95	+2.17%
		pp2048	3212.86	3290.28	+77.42	+2.41%
		pp8192	2352.93	2396.06	+43.13	+1.83%
		tg128	114.15	113.86	−0.29	−0.25%
		pp512 @ d8192	1670.92	1689.45	+18.53	+1.11%
		pp2048 @ d8192	1600.54	1621.65	+21.11	+1.32%
		pp8192 @ d8192	1351.55	1365.91	+14.36	+1.06%
		tg128 @ d8192	89.22	89.22	0.00	0.00%

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. I did use qwen/claude/kimi in the past to make the llama.cpp-gfx906 fork. Since people are using it i want to bring the best optimizations to upstream. This code has been rewrote by me starting from fork ideas.

Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

loci-review · 2026-03-30T04:07:24Z

No meaningful performance changes were detected across 123908 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

ds_read_b128 for q4_0 and q4_1 mmq kernels

495c363

Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

loci-dev temporarily deployed to PROD__AL_DEMO March 30, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from fd3ce9d to 1770118 Compare April 6, 2026 02:18

loci-dev force-pushed the main branch 8 times, most recently from 385b1fc to 06d9e10 Compare April 13, 2026 02:18

loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21168: ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels#1319

UPSTREAM PR #21168: ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels#1319
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21168-master

loci-dev commented Mar 30, 2026

Uh oh!

loci-review bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 30, 2026

Overview

Additional information

Requirements

Uh oh!

loci-review bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants