Optimal parameters for parallel inference using llama-server #18308

aheid · 2025-12-23T00:38:23Z

aheid
Dec 23, 2025

First off, hats off to all llama.cpp contributors, it's truly an amazing open-source project in terms of making local AI accessible.

Now, I'm getting my feet wet with SFT, and would like to generate synthetic training data. As such I have a bunch of prompts that I feed llama-server. However I'm not sure I'm extracting the best performance possible from my hardware. I've mostly just done single-issue inference such as chatting or agentic coding, and since this project moves so fast it's hard to know if advice given a year ago is valid today.

I got an AMD 5900X CPU, 96GB RAM, and a NVIDIA 5090 GPU, and I'm running an up-to-date Windows 11 with hardware-accelerated GPU scheduling enabled and CUDA toolkit 12.8. I'm currently using b7488. My current command-line is

llama-server -m %models%\gpt-oss-20b-MXFP4.gguf --no-mmap -c 65536 -np 16 -ngl 99 --reasoning-format none --jinja -fa on --top-k 0 --top-p 0 --min-p 0.05 -b 8192 -ub 1024 --cont-batching

What I'm observing is that I'm not really getting a significant speedup of -np beyond 4 or so, with GPU utilization at around 30% as reported by Windows and GPU-Z. If I set -np 1 then GPU utilization goes up to 60% but average inference speed is roughly the same or slightly lower.

Is this to be expected for this model, or are there some knobs I can turn to make it much better? I've seen --swa-full and --kv-unified, but was not sure if they were relevant nor how to use them correctly if so.

For reference here's some current benchmark results (which has about 80-90% GPU utilization):

model	size	params	backend	ngl	n_ubatch	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	512	pp512	8216.90 ± 76.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	512	tg1024	297.62 ± 5.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1024	pp512	8067.09 ± 89.42
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1024	tg1024	298.63 ± 1.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	pp512	7935.34 ± 221.56
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	tg1024	295.00 ± 3.15

build: 408616a (7488)

llama-bench -m %models%\gpt-oss-20b-MXFP4.gguf --mmap 0 -ngl 99 -fa on -b 2048 -ub 512,1024,2048 -n 1024

am17an · 2025-12-23T03:00:36Z

am17an
Dec 23, 2025
Collaborator

For 5090 you can try #17906 (should be merged soon), it should give a nice PP boost (~20-25%)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimal parameters for parallel inference using llama-server #18308

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimal parameters for parallel inference using llama-server #18308

Uh oh!

aheid Dec 23, 2025

Replies: 1 comment

Uh oh!

am17an Dec 23, 2025 Collaborator

aheid
Dec 23, 2025

am17an
Dec 23, 2025
Collaborator