Run multiple LLM models concurrently on a single GPU using Ray Serve and KubeRay operator.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0kubectl apply -f ray-serve.yaml# Chat completion
curl -X POST "http://<service-ip>:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List available models
curl "http://<service-ip>:8000/v1/models"- Multiple Models: Run several LLMs on one GPU simultaneously
- GPU Efficiency: Uses vLLM sleep mode to share GPU memory
- OpenAI Compatible: Works with any OpenAI-compatible client
- Auto-scaling: Automatically scales based on load
Important: Ensure you have enough free RAM to offload all LLMs when using sleep mode.
Edit ray-serve.yaml to change models:
env_vars:
MODELS: "model1,model2,model3" # Your models herefrom openai import OpenAI
client = OpenAI(
base_url="http://<service-ip>:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)# See if it's running
kubectl get rayservice llm-engine
# View logs
kubectl logs <pod-name>
# Access dashboard
kubectl port-forward service/llm-engine-head-svc 8265:8265- Kubernetes cluster with GPU
- KubeRay operator
- NVIDIA drivers
- Hugging Face model access
- Out of memory: Reduce number of models or use smaller models
- Model not loading: Check model names and HF_TOKEN
- Connection issues: Verify service IP and ports
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtTo run the LLM Engine on a remote Ray cluster for development:
-
Create virtualenv
-
Install requirements:
pip install -r requirements.txt- Run serve
serve run --address ray://127.0.0.1:10001 --runtime-env-json='{"env_vars": {"VLLM_USE_V1": "1"}, "pip":["runai-model-streamer"], "working_dir": "./"}' engine:app