Standalone CLI tool for benchmarking VLLM offline deployments. Validates function calling (tool use), embedding, and vision/OCR capabilities.
pip install -r requirements.txtpython -m vllm_benchmark --base-url <URL> --model <MODEL> [OPTIONS]| Argument | Required | Description |
|---|---|---|
--base-url |
Yes | VLLM server base URL (e.g. http://localhost:8000/v1) |
--model |
Yes | Model name / identifier (e.g. Qwen/Qwen3-VL-8B-Thinking) |
--api-key |
No | API key (see API Key Resolution for fallback order) |
--chat |
No | Run function-calling (chat) benchmark |
--embedding |
No | Run embedding benchmark |
--vision |
No | Run vision/OCR benchmark |
--all |
No | Run all benchmarks |
--model-args |
No | JSON string of extra model kwargs (e.g. '{"temperature": 0}') |
--json |
No | Output results as JSON (for scripting) |
--verbose |
No | Enable debug logging |
At least one benchmark flag (--chat, --embedding, --vision, or --all) is required.
Note: A single vLLM instance serves one model type. Use
--chatand--visionagainst chat/VL models, and--embeddingagainst embedding models. The--allflag is only useful if your model supports all three capabilities (rare). In practice, run benchmarks separately against each deployment.
# Run all benchmarks
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model Qwen/Qwen3-VL-8B-Thinking --all
# Run specific benchmarks
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model my-model --chat --vision
# With API key
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model my-model --all --api-key sk-xxx
# With extra model args
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model my-model --chat --model-args '{"temperature": 0}'
# JSON output (for scripting)
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model my-model --all --json
# Verbose logging
python -m vllm_benchmark --base-url http://localhost:8000/v1 --model my-model --all --verbosevllm serve Qwen/Qwen3-VL-8B-Thinking \
--max-model-len 131072 \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--reasoning-parser deepseek_r1 \
--max-num-seqs 10 \
--limit-mm-per-prompt.video 0 \
--async-scheduling \
--tensor-parallel-size 4All Qwen3.5 models are natively multimodal (vision + text) — there is no separate -VL variant.
vllm serve Qwen/Qwen3.5-4B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--enable-prefix-caching \
--max-num-seqs 10vllm serve Qwen/Qwen3-Embedding-8B \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code| Flag | Description |
|---|---|
--tool-call-parser |
Parser for structured tool calls. Use hermes for Qwen3, qwen3_coder for Qwen3.5 |
--reasoning-parser |
Parser for thinking/reasoning content. Use deepseek_r1 for Qwen3, qwen3 for Qwen3.5 |
--enable-auto-tool-choice |
Let the model decide when to call tools |
--tensor-parallel-size N |
Distribute the model across N GPUs |
--mm-encoder-tp-mode data |
Data-parallel vision encoder for better throughput (Qwen3.5 VL) |
--mm-processor-cache-type shm |
Shared memory cache for preprocessed multimodal inputs (Qwen3.5 VL) |
--enable-prefix-caching |
Cache common prompt prefixes for faster inference |
--trust-remote-code |
Required for some models (e.g. embedding models) |
--limit-mm-per-prompt.video 0 |
Disable video input processing |
--max-num-seqs N |
Max concurrent sequences |
--max-model-len N |
Max context length |
Tests whether the model can correctly use tools via a LangChain ReAct agent. Sends a timezone query and verifies the model calls the ISO-Datetime-Getter tool.
Tests the embedding model by embedding a sample string and returning the vector dimensions.
Tests vision capabilities by sending a base64-encoded image containing "HELLO WORLD" and verifying the model can extract the text.
The CLI resolves API keys in this order:
--api-keyargumentVLLM_API_KEYenvironment variableOPENAI_API_KEYenvironment variable"EMPTY"(default for VLLM servers without auth)
0— All selected benchmarks passed1— At least one benchmark failed2— CLI argument error