Skip to content

UPSTREAM PR #21168: ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels#1319

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21168-master
Open

UPSTREAM PR #21168: ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels#1319
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21168-master

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21168

Overview

This pr is a LDS load optimization in mmq kernels for q4_0 and q4_1.
The activations loading loop has been restructured so that 8 * ds_read_b32 scalar operations are replaced by 2*vectorized ds_read_b128 by the HIP compiler. It ends up being about +10% in pp with the vega gpu, and a small speedup on the 6800xt.
This modification is guarded by GGML_USE_HIP flag. Since the code is duplicated in vec_dot_q4_0_q8_1_dp4a and vec_dot_q4_1_q8_1_dp4a kernels, it could be refactored in a single function that select the loading method.

Additional information

GPU Model Test Before (t/s) After (t/s) Δ (t/s) % Δ
MI50 qwen3 4B Q4_0 pp512 1963.38 2210.11 +246.73 +12.57%
pp2048 2057.84 2297.41 +239.57 +11.64%
pp8192 1543.88 1675.24 +131.36 +8.51%
tg128 125.08 124.95 −0.13 −0.10%
pp512 @ d8192 1058.48 1126.23 +67.75 +6.40%
pp2048 @ d8192 1074.21 1136.18 +61.97 +5.77%
pp8192 @ d8192 901.66 952.81 +51.15 +5.67%
tg128 @ d8192 100.29 101.24 +0.95 +0.95%
qwen35 27B Q4_1 pp512 318.62 337.35 +18.73 +5.88%
pp2048 333.22 353.92 +20.70 +6.21%
pp8192 296.95 324.21 +27.26 +9.18%
tg128 25.96 26.26 +0.30 +1.16%
pp512 @ d8192 273.28 289.73 +16.45 +6.02%
pp2048 @ d8192 282.95 298.37 +15.42 +5.45%
pp8192 @ d8192 271.39 284.89 +13.50 +4.98%
tg128 @ d8192 25.04 25.05 +0.01 +0.04%
RX 6800 XT qwen3 4B Q4_0 pp512 3279.40 3350.35 +70.95 +2.17%
pp2048 3212.86 3290.28 +77.42 +2.41%
pp8192 2352.93 2396.06 +43.13 +1.83%
tg128 114.15 113.86 −0.29 −0.25%
pp512 @ d8192 1670.92 1689.45 +18.53 +1.11%
pp2048 @ d8192 1600.54 1621.65 +21.11 +1.32%
pp8192 @ d8192 1351.55 1365.91 +14.36 +1.06%
tg128 @ d8192 89.22 89.22 0.00 0.00%

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. I did use qwen/claude/kimi in the past to make the llama.cpp-gfx906 fork. Since people are using it i want to bring the best optimizations to upstream. This code has been rewrote by me starting from fork ideas.

     Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 30, 2026

No meaningful performance changes were detected across 123908 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from fd3ce9d to 1770118 Compare April 6, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 385b1fc to 06d9e10 Compare April 13, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants