-
Notifications
You must be signed in to change notification settings - Fork 108
Closed
Description
Describe the bug
I ran sweep test to understand the maximum throughput of the service. guidellm calculated 0.5 RPS (though vllm/benchmarks/benchmark_serving.py showed 0.7 RPS when iterating through concurrency).
I logged into the vllm graphana to observe the load on the service. guidellm says 0.5-0.55 RPS. But in fact, I see 13-15 requests.
Expected behavior
0.5 RPS = 30 RPM. Shouldn't 30 requests be processed if the service holds 0.5 RPS?
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 20.04]: -
- Python version [e.g. 3.12.2]: 3.10
- Docker-image: python:3.10-slim
To Reproduce
Exact steps to reproduce the behavior:
> pip install -U git+https://github.com/vllm-project/guidellm.git@1261fe81c57b07ed64333b5d50846699aa5307d4
> export GUIDELLM__PREFERRED_ROUTE="chat_completions" && export GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=512 && export GUIDELLM_MAX_REQUESTS=1000 && export GUIDELLM__REQUEST_TIMEOUT=600 && guidellm benchmark --target http://localhost:8000 --rate-type sweep --model Qwen/Qwen3-30B-A3B --processor Qwen/Qwen3-30B-A3B --random-seed 2025 --max-seconds 300 --data "prompt_tokens=4096,output_tokens=512,samples=1000" --backend-args '{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}' --output-path "data/benchmarks.json"
Additional context
Add any other context about the problem here. Also include any relevant files.
Metadata
Metadata
Assignees
Labels
No labels

