You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is something llama.cpp actually supports. i set the following environment variable: GGML_VK_VISIBLE_DEVICES like this in a linux command line:
GGML_VK_VISIBLE_DEVICES=0./llama-server -hf ggml-org/Qwen3-1.7B-GGUF -c 9000 --port 8081 -cram 0
you can replace the 0 in the environment variable with the device you want to use. This will force llama.cpp to use that device, because if you also have an nvidia gpu it will default to it.
I compiled llama cpp using vulkan so i am running it on a GCN5 amd igpu. An rx vega 7 on a 5600h cpu. The igpu may be slower than the cpu if you prompt the llm with questions, but it can free up the cpu, and on prompt sizes of over 1000 tokens, i find the speed of the igpu comparable to that of the cpu in the 5 to 11 tok/s range due to limited memory bandwidth using ddr4.
I use vulkan because it seems to be impossible to use opencl or rocm on integrated gpus because they don't support any llvm targets for rocm compilation.
I am also running arch Linux. Following the instructions on the build page is not necessary. I installed vulkan-headers, vulkan-devel and as long as vulkaninfo works, you do not need to source any shell scripts
On another related note, I was able to extract more performance out of the dedicated GPU by running several slots or users on llama server at the same time. As long as all their prompts fit within the context size set for llama server, it's effectively free extra performance. I disabled the kV cache since every prompt is self contained and does not rely on previous chats and that eliminated cache hits and improved performance. Context size is limited by ram amount
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is something llama.cpp actually supports. i set the following environment variable: GGML_VK_VISIBLE_DEVICES like this in a linux command line:
GGML_VK_VISIBLE_DEVICES=0./llama-server -hf ggml-org/Qwen3-1.7B-GGUF -c 9000 --port 8081 -cram 0
you can replace the 0 in the environment variable with the device you want to use. This will force llama.cpp to use that device, because if you also have an nvidia gpu it will default to it.
I compiled llama cpp using vulkan so i am running it on a GCN5 amd igpu. An rx vega 7 on a 5600h cpu. The igpu may be slower than the cpu if you prompt the llm with questions, but it can free up the cpu, and on prompt sizes of over 1000 tokens, i find the speed of the igpu comparable to that of the cpu in the 5 to 11 tok/s range due to limited memory bandwidth using ddr4.
I use vulkan because it seems to be impossible to use opencl or rocm on integrated gpus because they don't support any llvm targets for rocm compilation.
I am also running arch Linux. Following the instructions on the build page is not necessary. I installed vulkan-headers, vulkan-devel and as long as vulkaninfo works, you do not need to source any shell scripts
On another related note, I was able to extract more performance out of the dedicated GPU by running several slots or users on llama server at the same time. As long as all their prompts fit within the context size set for llama server, it's effectively free extra performance. I disabled the kV cache since every prompt is self contained and does not rely on previous chats and that eliminated cache hits and improved performance. Context size is limited by ram amount
Beta Was this translation helpful? Give feedback.
All reactions