model-templates/DEV_WORKFLOW.md at main · verda-cloud/model-templates

A good workflow would be:

Provision a GPU instance on Verda cloud
SSH into the instance
Running the container with flags generated from the template inside the instance

Important

If the template has tensor parallel size set to 2 ensure the provisioned instance has the required number of GPUs

Example: For the llama-vllm.json template first get the correct flags on your local machine by running:

go run examples/mapping_usage.go templates/llama-vllm.json

which should output

Loading mappings for engine: vllm

Generated Command:
--model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.95 --max-model-len 8192 --tensor-parallel-size 2 --dtype bfloat16 --max-num-seqs 256 --enable-prefix-caching --enable-chunked-prefill

Formatted Command:
--model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192 \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Tip

For vllm we use "vllm/vllm-openai:v0.13.0" docker image (see deploy.go),

Then, in the SSH instance run:

docker run --runtime nvidia --gpus all -p 8000:8000     --ipc=host     vllm/vllm-openai:v0.13.0 --model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.95 --max-model-len 8192 --tensor-parallel-size 2 --dtype bfloat16 --max-num-seqs 256 --enable-prefix-caching --enable-chunked-prefill

Tip

To use HuggingFace gates models provide a $HF_TOKEN via --env "HF_TOKEN=$HF_TOKEN"

If all goes well this should launch a vLLM server.

Once the template works correctly you can deploy to the Serverless Containers service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

DEV_WORKFLOW.md

Latest commit

History

DEV_WORKFLOW.md

File metadata and controls