The vLLM backend is a pure Python backend — it contains no C/C++ or HIP/CUDA code. It acts as a thin wrapper that bridges Triton's Python backend interface with the vLLM inference engine. All heavy lifting (inference, paged attention, continuous batching) is performed by the vLLM engine.
| File | Purpose |
|---|---|
src/model.py |
TritonPythonModel class — the entry point that Triton loads. Receives requests, forwards them to the vLLM AsyncEngine, and streams responses back. |
src/utils/metrics.py |
vLLM statistics and metrics integration with Triton. |
src/utils/request.py |
Request handling utilities for generate and embed operations. |
There is no CMake build or compilation step. The build process (driven by the
Triton server's build.py) is:
- Git clone this repository.
- Copy
src/model.pyandsrc/utils/into/opt/tritonserver/backends/vllm/. - Install the vLLM engine separately:
The Python model.py itself is hardware-agnostic — it calls vLLM's Python API
(AsyncEngineArgs, build_async_engine_client_from_engine_args), and vLLM
internally handles whether it is running on CUDA or ROCm.
Since this backend is pure Python, ROCm support does not require hipification or any C/C++ changes in this repository. The ROCm enablement happens in two places outside this repo:
- vLLM engine — vLLM has its own ROCm support.
- Triton server — the server's own C++ code (shared memory manager, gRPC/HTTP
endpoints, etc.) has
#ifdef TRITON_ENABLE_ROCMguards that swap CUDA API calls for HIP equivalents. Those changes live in the server repository, not here.