vLLM Optimization Guide - 6 Critical Knobs Most Teams Never Touch

This guide explains the 6 critical vLLM optimizations that can dramatically improve your inference performance and cost efficiency. These optimizations are automatically applied in Terradev's vLLM integration.

🔧 The 6 Critical Knobs

1. `--max-num-batched-tokens`

Default: 2048 | Optimized: 16384 (throughput) / 4096 (latency)

The single biggest throughput lever most teams never touch. The default is optimized for ITL (inter-token latency) not throughput.

# Throughput-optimized
--max-num-batched-tokens 16384

# Latency-optimized  
--max-num-batched-tokens 4096

Impact: 8x throughput improvement for batch-heavy workloads

2. `--gpu-memory-utilization`

Default: 0.90 | Optimized: 0.95

This leaves 10% VRAM idle on single-instance prod for no reason.

--gpu-memory-utilization 0.95

Impact: 5% more VRAM available for larger models/batches

3. `--max-num-seqs`

Default: 256 (V0) / 1024 (V1) | Optimized: 1024 (throughput) / 512 (latency)

This hard caps your concurrency. Bursty traffic hits this ceiling and queues silently.

# Throughput-optimized
--max-num-seqs 1024

# Latency-optimized
--max-num-seqs 512

Impact: Prevents silent queuing under bursty traffic

4. `--enable-prefix-caching`

Default: OFF | Optimized: ON

This gives you free throughput win if any requests share long system prompts or RAG chunks. No downside.

--enable-prefix-caching

Impact: Free throughput improvement for shared prompts

5. `--enable-chunked-prefill`

Default: OFF in V0, ON in V1 | Optimized: ON

If you're on V0, turn it on. If you're on V1, verify it's actually on.

--enable-chunked-prefill

Impact: Better prefill performance, especially for long prompts

6. CPU Core Allocation

Default: Usually underprovisioned | Optimized: 2 + #GPUs physical cores

V1 runs a busy loop on the engine core. If you starve it of CPU then your GPU sits idle. You'll see 40% GPU utilisation and spend 3 days blaming the model.

# For 8 GPUs: allocate 10+ CPU cores
resources:
  requests:
    cpu: "10"  # 2 + 8 GPUs
  limits:
    cpu: "16"  # Extra headroom

Impact: Prevents CPU starvation of GPU workloads

📊 Optimization Profiles

Throughput-Heavy Production

--max-num-batched-tokens 16384
--gpu-memory-utilization 0.95
--enable-prefix-caching
--enable-chunked-prefill
--max-num-seqs 1024

Latency-Sensitive Production

--max-num-batched-tokens 4096
--max-num-seqs 512
--enable-chunked-prefill
--gpu-memory-utilization 0.95
--enable-prefix-caching

🚀 Terradev CLI Usage

Generate Optimized Configurations

# Throughput optimization
terradev vllm optimize -m meta-llama/Llama-2-7b-hf -t throughput

# Latency optimization with 4 GPUs
terradev vllm optimize -m mistralai/Mistral-7B-v0.1 -t latency -g 4

# Output Helm values
terradev vllm optimize -m meta-llama/Llama-2-70b-hf -t throughput -o helm

# Output JSON config
terradev vllm optimize -m codellama/CodeLlama-34b-hf -t latency -o config

Benchmark Your Endpoint

# Basic benchmark
terradev vllm benchmark -e http://localhost:8000

# Concurrent load test
terradev vllm benchmark -e http://localhost:8000 -c 10

# With custom prompt
terradev vllm benchmark -e http://localhost:8000 -p "Write a Python function for fibonacci"

🐳 Docker Example

FROM vllm/vllm-openai:nightly

# Throughput-optimized vLLM
CMD ["vllm", "serve", "meta-llama/Llama-2-7b-hf", \
     "--max-num-batched-tokens", "16384", \
     "--gpu-memory-utilization", "0.95", \
     "--max-num-seqs", "1024", \
     "--enable-prefix-caching", \
     "--enable-chunked-prefill"]

☸️ Kubernetes Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-optimized
spec:
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:nightly
        command: ["vllm", "serve", "/models/weights"]
        args:
        - --max-num-batched-tokens=16384
        - --gpu-memory-utilization=0.95
        - --max-num-seqs=1024
        - --enable-prefix-caching
        - --enable-chunked-prefill
        resources:
          requests:
            nvidia.com/gpu: "8"
            cpu: "10"  # 2 + #GPUs
          limits:
            nvidia.com/gpu: "8"
            cpu: "16"

📈 Performance Impact

Optimization	Throughput Gain	Latency Impact	Notes
max-num-batched-tokens (16384)	8x	+10-20%	Biggest throughput lever
gpu-memory-utilization (0.95)	5%	0%	Free VRAM
max-num-seqs (1024)	2-3x	-5%	Prevents queuing
prefix-caching	1.5-3x	-10%	Shared prompts
chunked-prefill	1.2-2x	-5%	Long prompts
CPU cores (2+#GPUs)	1.5-2x	-15%	Prevents starvation

⚠️ Important Notes

Version Compatibility: These optimizations work best with vLLM ≥0.15.0
Memory Usage: Higher max-num-batched-tokens uses more RAM
CPU Allocation: Always allocate 2+ physical cores per GPU
Monitoring: Watch GPU utilization - if <70%, increase CPU allocation
Testing: Always benchmark with your specific workload

🔍 Verification

Check that optimizations are applied:

# Check vLLM server logs for the flags
docker logs vllm-container | grep -E "(max-num-batched|gpu-memory|prefix-caching)"

# Monitor GPU utilization
nvidia-smi dmon -s u -c 10

# Check for queuing
curl http://localhost:8000/metrics | grep vllm

📚 Additional Resources

Remember: These optimizations are automatically applied when using Terradev's vLLM integration. Use the CLI commands to generate custom configurations for your specific needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Optimization Guide - 6 Critical Knobs Most Teams Never Touch

🔧 The 6 Critical Knobs

1. `--max-num-batched-tokens`

2. `--gpu-memory-utilization`

3. `--max-num-seqs`

4. `--enable-prefix-caching`

5. `--enable-chunked-prefill`

6. CPU Core Allocation

📊 Optimization Profiles

Throughput-Heavy Production

Latency-Sensitive Production

🚀 Terradev CLI Usage

Generate Optimized Configurations

Benchmark Your Endpoint

🐳 Docker Example

☸️ Kubernetes Example

📈 Performance Impact

⚠️ Important Notes

🔍 Verification

📚 Additional Resources

FilesExpand file tree

VLLM_OPTIMIZATION_GUIDE.md

Latest commit

History

VLLM_OPTIMIZATION_GUIDE.md

File metadata and controls

vLLM Optimization Guide - 6 Critical Knobs Most Teams Never Touch

🔧 The 6 Critical Knobs

1. --max-num-batched-tokens

2. --gpu-memory-utilization

3. --max-num-seqs

4. --enable-prefix-caching

5. --enable-chunked-prefill

6. CPU Core Allocation

📊 Optimization Profiles

Throughput-Heavy Production

Latency-Sensitive Production

🚀 Terradev CLI Usage

Generate Optimized Configurations

Benchmark Your Endpoint

🐳 Docker Example

☸️ Kubernetes Example

📈 Performance Impact

⚠️ Important Notes

🔍 Verification

📚 Additional Resources

1. `--max-num-batched-tokens`

2. `--gpu-memory-utilization`

3. `--max-num-seqs`

4. `--enable-prefix-caching`

5. `--enable-chunked-prefill`