This document explains how to contribute to the vLLM Server project as an AI assistant (agent). Following these guidelines ensures you make effective, safe contributions without breaking the deployment system.
🛑 NEVER run runMe.sh, run_vllm_agent.sh, or launch scripts from the host machine.
These scripts are designed to run INSIDE the Docker container where:
- The correct Python/VirtualEnv environment is activated
- Model files are mounted at
/app/models - Configuration is generated dynamically
- Environment variables are properly loaded
| Issue | From Host | From Docker |
|---|---|---|
| Python packages | May not be installed | Already in container |
| Model files | Not accessible | Mounted at /app/models |
| Environment | Missing vars.env | Loaded automatically |
| Permissions | Host user | Container root/app user |
| State | Confusing mixed | Clean container state |
To test changes, launch the model through Docker:
# Option 1: Use runMe.sh (recommended)
./runMe.sh step-3.5-flash
# Option 2: Direct docker compose
MODEL=step-3.5-flash docker compose up
# Then exec into container to run scripts manually if needed
docker exec -it vllm-container bash
# Now you're inside the container, scripts work correctly┌─────────────────────────────────────────────────────────────┐
│ Your Client │
└───────────────────────┬─────────────────────────────────────┘
│ REST/OpenAI API
▼
┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Service │
│ Port: 4000 (container) │
│ - Rate limiting │
│ - Request routing │
│ - Usage tracking │
└──────────┬──────────────────────────────────────────────────┘
│ (network)
▼
┌─────────────────────────────────────────────────────────────┐
│ vLLM Service │
│ Port: 8000 (container) │
│ - Model inference │
│ - Tokenization │
│ - GPU management │
└─────────────────────────────────────────────────────────────┘
| Service | Container Name | Port | Purpose |
|---|---|---|---|
| vLLM | vllm-container | 8000 | Actual model inference |
| LiteLLM | litellm | 4000 | API proxy + rate limiting |
| DB | litellm_db | 5432 | PostgreSQL for LiteLLM |
postgres_data: Persistent database storagelitellm_config: tmpfs - config is regenerated each run- Model configs mounted from
./models/host directory - Secrets mounted from
./secrets/host directory
Each model has a YAML file in models/:
models/
├── glm47-flash.yml
├── step-3.5-flash.yml
├── qwen3-next-coder.yml
└── ...
# Comment: Description of what this model is
description: Code generation and analysis
# The vLLM command to run this model
command: |
vllm serve organization/model-name \
--flag1 value1 \
--flag2 value2
# Environment variables for optimization
env:
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_ATTENTION_BACKEND: "FLASH_ATTN"
# LiteLLM API parameters
litellm:
temperature: 0.7
top_p: 0.9
timeout: 45- Launch: User runs
./runMe.sh model-nameorMODEL=model-name docker compose up - Environment:
MODEL=model-nameis set in container - Generation:
run_vllm_agent.shcallsgen_models_yml.sh - Collection:
gen_models_yml.shfindsmodels/model-name.yml - Config Generation:
generate_litellm_config.pymerges:- Global defaults from
vars.env - Model overrides from
models/model-name.yml - Template from
litellm_config.template.yaml
- Global defaults from
- Result:
config.yamlwritten to shared volume
Copy an existing model file as template:
cp models/qwen3-next-coder.yml models/my-new-model.ymlEdit the new file:
description: [Describe what this model is/use case]
command: |
vllm serve <HUGGINGFACE_REPO_ID> \
--tokenizer-mode auto \
--enable-auto-tool-choice \
--tool-call-parser <parser-name> \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching \
--kv-cache-dtype fp8
env:
SAFETENSORS_FAST_GPU: "1"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
# Add any model-specific optimizations
litellm:
temperature: 0.95
top_p: 0.95
top_k: 40
repetition_penalty: 1.1
max_tokens: 128K
timeout: 45 # Adjust based on model speedDO NOT try to run scripts from host. Instead:
# Build and start the model in Docker
./runMe.sh my-new-model --build
# Check logs
sudo docker compose logs -f vllm-node
# Test the API once healthy
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-FAKE"Update README.md:
- Add model to the table
- Add model card in
models/README.md
When you have multiple similar models (variants), use YAML anchors:
# models/qwen3-next-coder-base.yml
# This file defines common settings for all Qwen3-Coder-Next variants
# It's excluded from model selection (named *-base.yml)
&qwen3_coder_variant_anchor
description: Code generation and analysis (Qwen3-Coder-Next family)
command: |
vllm serve <TO_BE_OVERRIDDEN> \
--tokenizer-mode auto \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching \
--kv-cache-dtype fp8
env:
SAFETENSORS_FAST_GPU: "1"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MOE_BACKEND: "latency"
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_TRTLLM_ATTENTION: "0"
litellm:
temperature: 0.95
top_p: 0.95
top_k: 40
repetition_penalty: 1.1
max_tokens: 128K
timeout: 45# models/qwen3-next-coder.yml (Official FP8)
&qwen3_coder_variant_anchor
command: |
vllm serve unsloth/Qwen3-Coder-Next-FP8-Dynamic \
--tokenizer-mode auto \
--enable-auto-tool-choice \
...
# All other settings inherited from base
---
# models/qwen3-next-coder-nvfp4.yml (NVFP4 variant)
&qwen3_coder_variant_anchor
command: |
vllm serve txn545/Qwen3-Coder-Next-NVFP4 \
--tokenizer-mode auto \
--enable-auto-tool-choice \
...
# All other settings inherited from basegen_models_yml.shcollects all*.ymlfiles EXCEPT*-base.yml- All files are merged into a single
models.ymldocument - YAML anchors defined in one file can be referenced by others
- The
<<: *anchormerge key copies all settings from the base
Edit the command: section in the model's .yml file:
command: |
vllm serve model/repo \
--max-num-seqs 16 # Change from 24 to 16
--max-num-batched-tokens 8K # Add this lineEdit the litellm: section:
litellm:
temperature: 0.8 # Change default temperature
timeout: 60 # Increase timeout for slower models
max_tokens: 64K # Restrict max outputEdit vars.env or the model's .yml:
# In vars.env
LITELLM_MAX_PARALLEL_REQUESTS=5
LITELLM_TIMEOUT=30Or per-model:
env:
LITELLM_TIMEOUT: 90
LITELLM_MAX_PARALLEL_REQUESTS: 3# See what's happening in the container
sudo docker compose logs -f vllm-node
# Follow logs from container start
sudo docker compose logs --tail=200 vllm-node
# Check LiteLLM health
curl http://localhost:4000/health/liveliness# Get container ID
sudo docker compose ps
# Enter container
sudo docker exec -it vllm-container bash
# Now you're inside, can run scripts and check files
ls /app/models/
cat /app/generated_configs/config.yamlInside the container:
# Check what model is selected
echo $MODEL
# View generated config
cat /app/generated_configs/config.yaml
# Regenerate config (for testing changes)
python3 /app/generate_litellm_config.pyWhen using sudo, environment variables are NOT preserved by default:
# ❌ WRONG - sudo drops MODEL variable
sudo MODEL=step-3.5-flash docker compose up
# ✅ CORRECT - put MODEL before sudo
MODEL=step-3.5-flash sudo docker compose up
# ✅ ALSO CORRECT - use runMe.sh (handles this automatically)
./runMe.sh step-3.5-flashModel configs are mounted as read-only:
volumes:
- ./models:/app/models:ro # Note the :ro (read-only)You cannot modify model files from inside the container. Always edit on the host and restart.
The generated_configs volume is recreated on every startup:
litellm_config:
driver: tmpfs # Clear on every restartAny manual edits to config.yaml are lost on restart.
First test with a minimal model to verify infrastructure works:
./runMe.sh glm47-flash # Smallest model, fastest to start# 1. Check container started
sudo docker compose ps
# 2. Check vLLM endpoint
curl http://localhost:8000/v1/models
# 3. Check LiteLLM endpoint
curl http://localhost:4000/health/liveliness
# 4. Test actual generation
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-FAKE" \
-d '{"model":"vllm_agent","messages":[{"role":"user","content":"hi"}]}'# Watch GPU usage
watch -n 1 nvidia-smi
# Check container resource usage
sudo docker stats| Symptom | Cause | Solution |
|---|---|---|
| Model defaults to glm47-flash | sudo dropping env var |
Use MODEL=x sudo docker compose up or runMe.sh |
| Container keeps restarting | Wrong command syntax | Check YAML command: indentation |
| Out of memory crash | Model too big for GPU | Reduce --max-num-seqs or switch to smaller model |
| Port already in use | Previous instance still running | sudo docker compose down first |
| Config changes don't apply | Forgot to rebuild | Use --build flag |
| Cross-file anchor fails | Anchor in different file | Put anchor in main model file, not base |
env:
VLLM_USE_FLASHINFER_MOE_FP8: "1" # Faster MoE
VLLM_ATTENTION_BACKEND: "FLASH_ATTN" # Optimized attention
VLLM_USE_DEEP_GEMM: "0" # Use FlashInfer insteadcommand: |
vllm serve ... \
--max-num-seqs 64 \ # More concurrent sequences
--max-num-batched-tokens 16K \ # Larger batches
--chunked-prefill \ # Chunk large prompts
--enable-prefix-caching # Cache repeated prefixescommand: |
vllm serve ... \
--gpu-memory-utilization 0.7 \ # Less VRAM
--max-num-seqs 8 # Fewer concurrent requestsBefore declaring a model ready:
- Model file created in
models/ -
command:uses correct HuggingFace repo ID -
env:has optimization flags -
litellm:has appropriate parameters - Tested with
./runMe.sh model-name --build - Logs show successful startup
- API responds with completions
- GPU memory stable under load
- Documentation updated in
README.md - Model card added to
models/README.md
- Main README: README.md - User-facing documentation
- Model Specs: models/README.md - Model-specific details
- vLLM Docs: https://docs.vllm.ai/
- LiteLLM Docs: https://docs.litellm.ai/
Remember: When in doubt, start from a known-working model and make incremental changes. Test each change by rebuilding and restarting the container.