Skip to content

Conversation

@netrunnereve
Copy link
Collaborator

This basically makes the q6_k mul mat shader dequantize 8 values at a time using the larger 16 bit loads, along with other things. I'll do the other quants later in another PR when I have time.

Calculating the indexes at the start actually takes up a lot of instructions hence why doing 8 at a time is faster than doing 4. In any case dequanting part of a k quant superblock is still really kludgy and it probably needs some repacking to run fast here.

On my RX 470:

PR

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 40 runs - 25383.38 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
model size params backend ngl test t/s
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 100 pp512 1143.10 ± 4.76

Master

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 38 runs - 27576.82 us/run -  60.13 GFLOP/run -   2.18 TFLOPS
model size params backend ngl test t/s
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 100 pp512 1052.66 ± 5.31

@netrunnereve netrunnereve requested a review from 0cc4m as a code owner December 6, 2025 03:40
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Dec 6, 2025

For most cases I tested this change is positive, but there's a problem with coopmat1 on Nvidia:

ggml_vulkan: Found 1 Vulkan devices:                                                                                                                          
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

Master:
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                650 runs -  1542.20 us/run -  60.13 GFLOP/run -  38.99 TFLOPS
PR:
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                556 runs -  1799.21 us/run -  60.13 GFLOP/run -  33.42 TFLOPS

Master:

model size params backend ngl fa test t/s
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 99 0 pp512 2427.65 ± 23.55
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 99 1 pp512 2420.86 ± 11.39

PR:

model size params backend ngl fa test t/s
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 99 0 pp512 2061.99 ± 2.74
llama 8B Q6_K 6.14 GiB 8.03 B Vulkan 99 1 pp512 2061.28 ± 4.78

Maybe @jeffbolznv knows more about the cause of this drop. I don't see it when running without coopmat and integer dot.

@netrunnereve
Copy link
Collaborator Author

For most cases I tested this change is positive, but there's a problem with coopmat1 on Nvidia:

Does this also happen with the four at a time version b4bae3f?

@jeffbolznv
Copy link
Collaborator

I'm seeing a failure in MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) with both of the commits.

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Dec 6, 2025

I'm seeing a failure in MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) with both of the commits.

Is this only happening with coopmat? This test is passing fine on Intel, AMD, and llvmpipe with no coopmat.

@jeffbolznv
Copy link
Collaborator

Yes, only seeing it with the coopmat path. Maybe something to do with the tile sizes it uses?

@netrunnereve
Copy link
Collaborator Author

I don't think this is affected by tile sizes unless it's not divisible by 8 but I'll take a look again next week. IIRC our smallest tile size is 32x32. Strangely enough the CI coopmat run on the T4 is fully passing, and that includes the perplexity check whose number of 9.4792 +/- 0.81443 matches the one on master.

@jeffbolznv
Copy link
Collaborator

I did a full rebuild and it's passing now. There's an issue with the build system on windows where sometimes it uses a stale version of vulkan-shaders-gen (it mixes up debug vs release), I think that's what bit me. Sorry for the noise, let me see if I can repro the perf difference now.

@jeffbolznv
Copy link
Collaborator

I'm able to reproduce about a 10% slowdown for DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf with this change on RTX 4070. I only had time for a quick look but it seems like maybe the loads are more spread out and each load (across a warp) is touching more cachelines. But I'm not totally sure of that. Maybe you can rearrange which threads load which elements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants