A production-ready Docker-based deployment system for running multiple vLLM models with LiteLLM proxy integration. This setup provides OpenAI-compatible API endpoints for various large language models optimized for different use cases.
The easiest way to launch a model is using the runMe.sh script:
# List available models
./runMe.sh
# Launch a specific model
./runMe.sh step-3.5-flash
# Launch with rebuild
./runMe.sh glm47-flash --build
# Launch in detached mode (background)
./runMe.sh qwen3-next-coder -dIf you prefer using docker compose directly:
# Set the model and launch
MODEL=step-3.5-flash sudo docker compose up
# With rebuild
MODEL=step-3.5-flash sudo docker compose up --build
# In detached mode
MODEL=glm47-flash sudo docker compose up -dsudo docker compose, you must use MODEL=name sudo docker compose up (not sudo MODEL=name docker compose up) to ensure the environment variable is passed correctly.
This deployment supports multiple pre-configured models, each optimized for specific use cases:
| Model | Use Case | Context | Concurrency |
|---|---|---|---|
glm47-flash |
General-purpose reasoning & tool calling | 128K | Low (16) |
step-3.5-flash |
Long-context reasoning & problem-solving | Auto | Medium (24) |
step-3.5-flash-hcsw |
High-throughput inference | 8K | High (64) |
qwen3-next-coder |
Code generation and analysis | Variable | Medium |
For detailed model specifications and configuration, see models/README.md.
The system consists of three main services:
- vllm-node: Runs the vLLM inference engine with the selected model
- litellm: Provides a unified OpenAI-compatible API proxy with rate limiting and monitoring
- db: PostgreSQL database for LiteLLM's internal state and usage tracking
βββββββββββββββββββ
β Your Client β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββ
β LiteLLM ββββββΆβ PostgreSQL β
β Port: 4000 β β Database β
ββββββββββ¬βββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β vLLM Engine β
β Port: 8000 β
βββββββββββββββββββ
Global configuration is stored in vars.env. Common variables:
# NCCL Configuration
NCCL_P2P_DISABLE=1
# LiteLLM Defaults (optional, can override per-model)
#LITELLM_TEMPERATURE=0.7
#LITELLM_TOP_P=0.8
#LITELLM_MAX_TOKENS=65536Each model has its own configuration file in the models/ directory:
models/glm47-flash.ymlmodels/step-3.5-flash.ymlmodels/step-3.5-flash-hcsw.ymlmodels/qwen3-next-coder.yml
These files define:
- The vLLM command with model-specific flags
- Environment variables for optimization
- LiteLLM API parameters
Sensitive credentials (API keys, tokens) should be stored in the secrets/ directory:
# Create secrets directory
mkdir -p secrets
# Add secrets as individual files (filename = env var name)
echo "your-hf-token" > secrets/HF_TOKEN
echo "your-api-key" > secrets/ANTHROPIC_API_KEY
# These will be automatically loaded as environment variablesSee secrets/README.md for more details.
Once launched, the services are available at:
- LiteLLM Proxy: http://localhost:4000 (Recommended - with rate limiting & monitoring)
- Direct vLLM: http://localhost:8000 (Direct access, no proxy features)
from openai import OpenAI
# Use LiteLLM proxy (recommended)
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-FAKE" # Default API key
)
response = client.chat.completions.create(
model="vllm_agent",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
]
)
print(response.choices[0].message.content)curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-FAKE" \
-d '{
"model": "vllm_agent",
"messages": [{"role": "user", "content": "Hello!"}]
}'Use the launcher scripts in launchers/ to route Claude Code through your local model:
# Launch Claude Code using your local vLLM instance
./launchers/local_claude.sh
# Or add to your PATH for easier access
export PATH="$PWD/launchers:$PATH"
local_claude.shSee launchers/local_claude.sh for more options.
# Foreground (see logs in terminal)
./runMe.sh step-3.5-flash
# Background (detached mode)
./runMe.sh step-3.5-flash -d# Stop all services
sudo docker compose down
# Stop and remove volumes (clean slate)
sudo docker compose down -v# All services
sudo docker compose logs -f
# Specific service
sudo docker compose logs -f vllm-node
sudo docker compose logs -f litellm
# Last 100 lines
sudo docker compose logs --tail=100 vllm-node# Stop current model
sudo docker compose down
# Start with different model
./runMe.sh qwen3-next-coder# Rebuild container image
./runMe.sh step-3.5-flash --build
# Or with docker compose directly
MODEL=step-3.5-flash sudo docker compose up --build# vLLM health
curl http://localhost:8000/health
# LiteLLM health
curl http://localhost:4000/health/liveliness
# List loaded models
curl -H "Authorization: Bearer sk-FAKE" http://localhost:8000/v1/models# Real-time GPU monitoring
watch -n 1 nvidia-smi
# Docker container stats
sudo docker stats# List running containers
sudo docker compose ps
# Check specific container health
sudo docker compose ps vllm-nodeProblem: Running MODEL="step-3.5-flash" sudo docker compose up still launches glm47-flash.
Cause: The sudo command doesn't preserve environment variables by default.
Solutions:
-
β Use the runMe.sh script (Recommended):
./runMe.sh step-3.5-flash
-
β Put MODEL before sudo:
MODEL=step-3.5-flash sudo docker compose up
-
β οΈ Use sudo -E (preserves all env vars):sudo -E docker compose up # Note: This exposes ALL environment variables to sudo, which may be a security concern
Symptoms: Container crashes with CUDA out of memory errors.
Solutions:
-
Switch to a smaller model or high-concurrency variant:
./runMe.sh step-3.5-flash-hcsw # Smaller context window -
Reduce GPU memory utilization (edit model's
.ymlfile):command: | vllm serve ... --gpu-memory-utilization 0.85 # Was 0.96
-
Close other GPU-using processes:
nvidia-smi # Check what's using GPU
Symptoms: Container marked as unhealthy, LiteLLM can't connect.
Diagnosis:
# Check container logs
sudo docker compose logs vllm-node
# Check if vLLM port is accessible
curl http://localhost:8000/healthCommon causes:
- Model download in progress (wait longer, check logs)
- Insufficient GPU memory (see OOM solutions above)
- Model configuration error (check model .yml file syntax)
Symptoms: Docker commands fail with permission errors.
Solutions:
-
Add your user to docker group:
sudo usermod -aG docker $USER newgrp docker # Activate immediately
-
Or continue using sudo:
./runMe.sh step-3.5-flash # Script handles sudo automatically (interactive terminal required if password prompt is needed)
Cause: Models are downloaded from HuggingFace on first run.
Solution:
- Be patient! Large models can be 10-50GB
- Monitor progress in logs:
sudo docker compose logs -f vllm-node
- Pre-download models:
huggingface-cli download stepfun-ai/Step-3.5-Flash-FP8
Symptoms: Error: "port is already allocated"
Solution:
# Check what's using the port
sudo lsof -i :8000
sudo lsof -i :4000
# Stop conflicting service or change ports in docker-compose.ymlvllm-server/
βββ README.md # This file
βββ docker-compose.yml # Docker services definition
βββ Dockerfile # Container image definition
βββ runMe.sh # Simple model launcher script
βββ run_vllm_agent.sh # Container entrypoint script
βββ vars.env # Global environment variables
βββ generate_litellm_config.py # LiteLLM config generator
βββ litellm_config.template.yaml # LiteLLM template
β
βββ models/ # Model configurations
β βββ README.md # Model documentation
β βββ glm47-flash.yml
β βββ step-3.5-flash.yml
β βββ step-3.5-flash-hcsw.yml
β βββ qwen3-next-coder.yml
β
βββ scripts/ # Helper scripts
β βββ gen_models_yml.sh # Model config builder
β βββ load_secrets_env.sh # Secrets loader
β
βββ secrets/ # Sensitive credentials (gitignored)
β βββ README.md
β βββ HF_TOKEN # HuggingFace token
β βββ ANTHROPIC_API_KEY # Anthropic API key
β
βββ launchers/ # Client launcher scripts
βββ local_claude.sh # Claude Code launcher
βββ local_codex.sh # Codex launcher
βββ open_code.sh # VS Code launcher
- API Keys: The default API key is
sk-FAKE- change this for production use - Secrets: Never commit secrets to git. Use the
secrets/directory (gitignored) - Network: Services are exposed on localhost by default. Configure firewall rules for external access
- Sudo: The runMe.sh script uses sudo when needed. Review Docker group membership for sudo-less operation
When adding a new model:
- Create a new
.ymlfile inmodels/directory - Follow the existing format (see models/README.md)
- Test with
./runMe.sh your-new-model - Update the models table in this README
- Add model card details to models/README.md
- Model Configurations: models/README.md - Detailed model specs and tuning guide
- vLLM Documentation: https://docs.vllm.ai/
- LiteLLM Documentation: https://docs.litellm.ai/
- Docker Compose Documentation: https://docs.docker.com/compose/
See LICENSE file for details.
- Issues: Check logs first (
sudo docker compose logs -f) - Documentation: See
models/README.mdfor model-specific details - vLLM Docs: https://docs.vllm.ai/en/latest/
- GPU Issues: Run
nvidia-smito check GPU status
Last Updated: February 2026 Compatible With: vLLM v1+, Docker Compose v2+