A good workflow would be:
- Provision a GPU instance on Verda cloud
- SSH into the instance
- Running the container with flags generated from the template inside the instance
Important
If the template has tensor parallel size set to 2 ensure the provisioned instance has the required number of GPUs
Example:
For the llama-vllm.json template first get the correct flags on your local machine by running:
go run examples/mapping_usage.go templates/llama-vllm.jsonwhich should output
Loading mappings for engine: vllm
Generated Command:
--model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.95 --max-model-len 8192 --tensor-parallel-size 2 --dtype bfloat16 --max-num-seqs 256 --enable-prefix-caching --enable-chunked-prefill
Formatted Command:
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--max-num-seqs 256 \
--enable-prefix-caching \
--enable-chunked-prefillTip
For vllm we use "vllm/vllm-openai:v0.13.0" docker image (see deploy.go),
Then, in the SSH instance run:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:v0.13.0 --model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.95 --max-model-len 8192 --tensor-parallel-size 2 --dtype bfloat16 --max-num-seqs 256 --enable-prefix-caching --enable-chunked-prefillTip
To use HuggingFace gates models provide a $HF_TOKEN via --env "HF_TOKEN=$HF_TOKEN"
If all goes well this should launch a vLLM server.
Once the template works correctly you can deploy to the Serverless Containers service.