forked from LostRuins/koboldcpp
-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Description
This is more of a request for info than an actual issue. Running the Vulkan driver on default koboldCPP with Flash Attention enabled completely tanks processing speed. I noticed that the same thing happens with the hipBLAS version which is a bit surprising. Is there any work around for this? I would like to try messing with KV Cache quantization but the V Cache part requires Flash Attention. I haven't tested K Cache quantization on hipBLAS yet but on Vulkan at least, K Cache quantization doesn't even save any memory. (GPU is an RX 6650XT and I'm running 64-bit Windows 10, since from my understanding, some things work differently between Windows and Linux in this regard.)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels