-
Notifications
You must be signed in to change notification settings - Fork 14k
vulkan: faster q6_k matmul #17813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
vulkan: faster q6_k matmul #17813
Conversation
|
For most cases I tested this change is positive, but there's a problem with coopmat1 on Nvidia: Master:
PR:
Maybe @jeffbolznv knows more about the cause of this drop. I don't see it when running without coopmat and integer dot. |
Does this also happen with the four at a time version b4bae3f? |
|
I'm seeing a failure in |
Is this only happening with coopmat? This test is passing fine on Intel, AMD, and llvmpipe with no coopmat. |
|
Yes, only seeing it with the coopmat path. Maybe something to do with the tile sizes it uses? |
|
I don't think this is affected by tile sizes unless it's not divisible by 8 but I'll take a look again next week. IIRC our smallest tile size is 32x32. Strangely enough the CI coopmat run on the T4 is fully passing, and that includes the perplexity check whose number of |
|
I did a full rebuild and it's passing now. There's an issue with the build system on windows where sometimes it uses a stale version of vulkan-shaders-gen (it mixes up debug vs release), I think that's what bit me. Sorry for the noise, let me see if I can repro the perf difference now. |
|
I'm able to reproduce about a 10% slowdown for DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf with this change on RTX 4070. I only had time for a quick look but it seems like maybe the loads are more spread out and each load (across a warp) is touching more cachelines. But I'm not totally sure of that. Maybe you can rearrange which threads load which elements? |
This basically makes the q6_k mul mat shader dequantize 8 values at a time using the larger 16 bit loads, along with other things. I'll do the other quants later in another PR when I have time.
Calculating the indexes at the start actually takes up a lot of instructions hence why doing 8 at a time is faster than doing 4. In any case dequanting part of a k quant superblock is still really kludgy and it probably needs some repacking to run fast here.
On my RX 470:
PR
Master