Skip to content

Flash Attention on hipBLAS #143

@Alice-Cheshire

Description

@Alice-Cheshire

This is more of a request for info than an actual issue. Running the Vulkan driver on default koboldCPP with Flash Attention enabled completely tanks processing speed. I noticed that the same thing happens with the hipBLAS version which is a bit surprising. Is there any work around for this? I would like to try messing with KV Cache quantization but the V Cache part requires Flash Attention. I haven't tested K Cache quantization on hipBLAS yet but on Vulkan at least, K Cache quantization doesn't even save any memory. (GPU is an RX 6650XT and I'm running 64-bit Windows 10, since from my understanding, some things work differently between Windows and Linux in this regard.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions