This Compose stack runs from the github repo here and executes the below services in Docker or Singularity modes:
- vLLM model server (OpenAI-compatible)
- RAG retrieval API (Chroma)
- Indexer (filesystem → Chroma, auto-updates)
- Enhanced Proxy exposing /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models
- Open WebUI (optional) pointing to the Proxy
See a turnkey demonstration of the workflow running on ACTIVATE at the link below:
Pull down the weights of your choice into a known directory. For example we recommend using git lfs to pull down weights as this is more widely open to firewalls and is relatively fast at pulls:
cd /mymodeldir/
git lfs install
git clone https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
The workflow will provide a field to also pull down a prebuilt vllm singularity container if running in this mode, but you can also pull this down manually for example using the authenticated pw cli:
cd ~/pw/activate-rag-vllm
pw buckets cp pw://mshaxted/codeassist/vllm.sif ./
export HF_TOKEN=hf_xyz
export RUNMODE=docker # or singularity
export BUILD=true
export RUNTYPE=all # or vllm only
# run the service
./run.shdocker-compose.yml— stack definitionDockerfile.rag— builds the RAG + Indexer + Proxy imagerag_proxy.py— enhanced OpenAI-compatible proxy with streaming + extra endpointsrag_server.py— RAG search APIindexer.py,indexer_config.yaml— auto indexer for filesystem changesdocs/— mount point for your documentscache/— workload specific data storage
# Health
curl http://localhost:${PROXY_PORT}/health | jq
# Chat (non-stream)
curl -sS http://localhost:${PROXY_PORT}/v1/chat/completions -H 'content-type: application/json' -d '{"model":"'"${MODEL_NAME}"'","messages":[{"role":"user","content":"Summarize the docs."}], "max_tokens":200}' | jq
# Chat (stream)
curl -N http://localhost:${PROXY_PORT}/v1/chat/completions -H 'content-type: application/json' -d '{"model":"'"${MODEL_NAME}"'","messages":[{"role":"user","content":"Hello"}], "stream": true}'