vulkan: faster q6_k matmul #17813

netrunnereve · 2025-12-06T03:40:05Z

This basically makes the q6_k mul mat shader dequantize 8 values at a time using the larger 16 bit loads, along with other things. I'll do the other quants later in another PR when I have time.

Calculating the indexes at the start actually takes up a lot of instructions hence why doing 8 at a time is faster than doing 4. In any case dequanting part of a k quant superblock is still really kludgy and it probably needs some repacking to run fast here.

On my RX 470:

PR

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 40 runs - 25383.38 us/run -  60.13 GFLOP/run -   2.37 TFLOPS

model	size	params	backend	ngl	test	t/s
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	100	pp512	1143.10 ± 4.76

Master

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 38 runs - 27576.82 us/run -  60.13 GFLOP/run -   2.18 TFLOPS

model	size	params	backend	ngl	test	t/s
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	100	pp512	1052.66 ± 5.31

0cc4m · 2025-12-06T09:12:20Z

For most cases I tested this change is positive, but there's a problem with coopmat1 on Nvidia:

ggml_vulkan: Found 1 Vulkan devices:                                                                                                                          
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

Master:
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                650 runs -  1542.20 us/run -  60.13 GFLOP/run -  38.99 TFLOPS
PR:
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                556 runs -  1799.21 us/run -  60.13 GFLOP/run -  33.42 TFLOPS

Master:

model	size	params	backend	ngl	fa	test	t/s
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	99	0	pp512	2427.65 ± 23.55
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	99	1	pp512	2420.86 ± 11.39

PR:

model	size	params	backend	ngl	fa	test	t/s
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	99	0	pp512	2061.99 ± 2.74
llama 8B Q6_K	6.14 GiB	8.03 B	Vulkan	99	1	pp512	2061.28 ± 4.78

Maybe @jeffbolznv knows more about the cause of this drop. I don't see it when running without coopmat and integer dot.

netrunnereve · 2025-12-06T18:21:07Z

For most cases I tested this change is positive, but there's a problem with coopmat1 on Nvidia:

Does this also happen with the four at a time version b4bae3f?

jeffbolznv · 2025-12-06T19:39:03Z

I'm seeing a failure in MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) with both of the commits.

netrunnereve · 2025-12-06T21:37:36Z

I'm seeing a failure in MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) with both of the commits.

Is this only happening with coopmat? This test is passing fine on Intel, AMD, and llvmpipe with no coopmat.

jeffbolznv · 2025-12-06T22:05:35Z

Yes, only seeing it with the coopmat path. Maybe something to do with the tile sizes it uses?

netrunnereve · 2025-12-07T02:53:27Z

I don't think this is affected by tile sizes unless it's not divisible by 8 but I'll take a look again next week. IIRC our smallest tile size is 32x32. Strangely enough the CI coopmat run on the T4 is fully passing, and that includes the perplexity check whose number of 9.4792 +/- 0.81443 matches the one on master.

jeffbolznv · 2025-12-07T03:54:14Z

I did a full rebuild and it's passing now. There's an issue with the build system on windows where sometimes it uses a stale version of vulkan-shaders-gen (it mixes up debug vs release), I think that's what bit me. Sorry for the noise, let me see if I can repro the perf difference now.

jeffbolznv · 2025-12-07T04:27:14Z

I'm able to reproduce about a 10% slowdown for DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf with this change on RTX 4070. I only had time for a quick look but it seems like maybe the loads are more spread out and each load (across a warp) is touching more cachelines. But I'm not totally sure of that. Maybe you can rearrange which threads load which elements?

netrunnereve added 2 commits December 5, 2025 21:04

q6_k faster mul mat

b4bae3f

8 values

d8c0d03

netrunnereve requested a review from 0cc4m as a code owner December 6, 2025 03:40

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17813: vulkan: faster q6_k matmul auroralabs-loci/llama.cpp#465

Open

fix comment

def0e1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: faster q6_k matmul #17813

vulkan: faster q6_k matmul #17813

netrunnereve commented Dec 6, 2025

Uh oh!

0cc4m commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 6, 2025

Uh oh!

jeffbolznv commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 6, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 7, 2025

Uh oh!

jeffbolznv commented Dec 7, 2025

Uh oh!

jeffbolznv commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan: faster q6_k matmul #17813

Are you sure you want to change the base?

vulkan: faster q6_k matmul #17813

Conversation

netrunnereve commented Dec 6, 2025

Uh oh!

0cc4m commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 6, 2025

Uh oh!

jeffbolznv commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Dec 6, 2025

Uh oh!

netrunnereve commented Dec 7, 2025

Uh oh!

jeffbolznv commented Dec 7, 2025

Uh oh!

jeffbolznv commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netrunnereve commented Dec 6, 2025 •

edited

Loading