Replies: 1 comment
-
|
For 5090 you can try #17906 (should be merged soon), it should give a nice PP boost (~20-25%) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
First off, hats off to all llama.cpp contributors, it's truly an amazing open-source project in terms of making local AI accessible.
Now, I'm getting my feet wet with SFT, and would like to generate synthetic training data. As such I have a bunch of prompts that I feed
llama-server. However I'm not sure I'm extracting the best performance possible from my hardware. I've mostly just done single-issue inference such as chatting or agentic coding, and since this project moves so fast it's hard to know if advice given a year ago is valid today.I got an AMD 5900X CPU, 96GB RAM, and a NVIDIA 5090 GPU, and I'm running an up-to-date Windows 11 with hardware-accelerated GPU scheduling enabled and CUDA toolkit 12.8. I'm currently using
b7488. My current command-line isllama-server -m %models%\gpt-oss-20b-MXFP4.gguf --no-mmap -c 65536 -np 16 -ngl 99 --reasoning-format none --jinja -fa on --top-k 0 --top-p 0 --min-p 0.05 -b 8192 -ub 1024 --cont-batchingWhat I'm observing is that I'm not really getting a significant speedup of
-npbeyond 4 or so, with GPU utilization at around 30% as reported by Windows and GPU-Z. If I set-np 1then GPU utilization goes up to 60% but average inference speed is roughly the same or slightly lower.Is this to be expected for this model, or are there some knobs I can turn to make it much better? I've seen
--swa-fulland--kv-unified, but was not sure if they were relevant nor how to use them correctly if so.For reference here's some current benchmark results (which has about 80-90% GPU utilization):
build: 408616a (7488)
llama-bench -m %models%\gpt-oss-20b-MXFP4.gguf --mmap 0 -ngl 99 -fa on -b 2048 -ub 512,1024,2048 -n 1024Beta Was this translation helpful? Give feedback.
All reactions