This guide explains the 6 critical vLLM optimizations that can dramatically improve your inference performance and cost efficiency. These optimizations are automatically applied in Terradev's vLLM integration.
Default: 2048 | Optimized: 16384 (throughput) / 4096 (latency)
The single biggest throughput lever most teams never touch. The default is optimized for ITL (inter-token latency) not throughput.
# Throughput-optimized
--max-num-batched-tokens 16384
# Latency-optimized
--max-num-batched-tokens 4096Impact: 8x throughput improvement for batch-heavy workloads
Default: 0.90 | Optimized: 0.95
This leaves 10% VRAM idle on single-instance prod for no reason.
--gpu-memory-utilization 0.95Impact: 5% more VRAM available for larger models/batches
Default: 256 (V0) / 1024 (V1) | Optimized: 1024 (throughput) / 512 (latency)
This hard caps your concurrency. Bursty traffic hits this ceiling and queues silently.
# Throughput-optimized
--max-num-seqs 1024
# Latency-optimized
--max-num-seqs 512Impact: Prevents silent queuing under bursty traffic
Default: OFF | Optimized: ON
This gives you free throughput win if any requests share long system prompts or RAG chunks. No downside.
--enable-prefix-cachingImpact: Free throughput improvement for shared prompts
Default: OFF in V0, ON in V1 | Optimized: ON
If you're on V0, turn it on. If you're on V1, verify it's actually on.
--enable-chunked-prefillImpact: Better prefill performance, especially for long prompts
Default: Usually underprovisioned | Optimized: 2 + #GPUs physical cores
V1 runs a busy loop on the engine core. If you starve it of CPU then your GPU sits idle. You'll see 40% GPU utilisation and spend 3 days blaming the model.
# For 8 GPUs: allocate 10+ CPU cores
resources:
requests:
cpu: "10" # 2 + 8 GPUs
limits:
cpu: "16" # Extra headroomImpact: Prevents CPU starvation of GPU workloads
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.95
--enable-prefix-caching
--enable-chunked-prefill
--max-num-seqs 1024--max-num-batched-tokens 4096
--max-num-seqs 512
--enable-chunked-prefill
--gpu-memory-utilization 0.95
--enable-prefix-caching# Throughput optimization
terradev vllm optimize -m meta-llama/Llama-2-7b-hf -t throughput
# Latency optimization with 4 GPUs
terradev vllm optimize -m mistralai/Mistral-7B-v0.1 -t latency -g 4
# Output Helm values
terradev vllm optimize -m meta-llama/Llama-2-70b-hf -t throughput -o helm
# Output JSON config
terradev vllm optimize -m codellama/CodeLlama-34b-hf -t latency -o config# Basic benchmark
terradev vllm benchmark -e http://localhost:8000
# Concurrent load test
terradev vllm benchmark -e http://localhost:8000 -c 10
# With custom prompt
terradev vllm benchmark -e http://localhost:8000 -p "Write a Python function for fibonacci"FROM vllm/vllm-openai:nightly
# Throughput-optimized vLLM
CMD ["vllm", "serve", "meta-llama/Llama-2-7b-hf", \
"--max-num-batched-tokens", "16384", \
"--gpu-memory-utilization", "0.95", \
"--max-num-seqs", "1024", \
"--enable-prefix-caching", \
"--enable-chunked-prefill"]apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-optimized
spec:
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:nightly
command: ["vllm", "serve", "/models/weights"]
args:
- --max-num-batched-tokens=16384
- --gpu-memory-utilization=0.95
- --max-num-seqs=1024
- --enable-prefix-caching
- --enable-chunked-prefill
resources:
requests:
nvidia.com/gpu: "8"
cpu: "10" # 2 + #GPUs
limits:
nvidia.com/gpu: "8"
cpu: "16"| Optimization | Throughput Gain | Latency Impact | Notes |
|---|---|---|---|
| max-num-batched-tokens (16384) | 8x | +10-20% | Biggest throughput lever |
| gpu-memory-utilization (0.95) | 5% | 0% | Free VRAM |
| max-num-seqs (1024) | 2-3x | -5% | Prevents queuing |
| prefix-caching | 1.5-3x | -10% | Shared prompts |
| chunked-prefill | 1.2-2x | -5% | Long prompts |
| CPU cores (2+#GPUs) | 1.5-2x | -15% | Prevents starvation |
- Version Compatibility: These optimizations work best with vLLM ≥0.15.0
- Memory Usage: Higher
max-num-batched-tokensuses more RAM - CPU Allocation: Always allocate 2+ physical cores per GPU
- Monitoring: Watch GPU utilization - if <70%, increase CPU allocation
- Testing: Always benchmark with your specific workload
Check that optimizations are applied:
# Check vLLM server logs for the flags
docker logs vllm-container | grep -E "(max-num-batched|gpu-memory|prefix-caching)"
# Monitor GPU utilization
nvidia-smi dmon -s u -c 10
# Check for queuing
curl http://localhost:8000/metrics | grep vllmRemember: These optimizations are automatically applied when using Terradev's vLLM integration. Use the CLI commands to generate custom configurations for your specific needs.