Skip to content

--rate-type=sweep does not load vllm with expected count of requests #272

@psydok

Description

@psydok

Describe the bug
I ran sweep test to understand the maximum throughput of the service. guidellm calculated 0.5 RPS (though vllm/benchmarks/benchmark_serving.py showed 0.7 RPS when iterating through concurrency).
I logged into the vllm graphana to observe the load on the service. guidellm says 0.5-0.55 RPS. But in fact, I see 13-15 requests.

Expected behavior
0.5 RPS = 30 RPM. Shouldn't 30 requests be processed if the service holds 0.5 RPS?

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: -
  2. Python version [e.g. 3.12.2]: 3.10
  3. Docker-image: python:3.10-slim

To Reproduce
Exact steps to reproduce the behavior:

> pip install -U git+https://github.com/vllm-project/guidellm.git@1261fe81c57b07ed64333b5d50846699aa5307d4
> export GUIDELLM__PREFERRED_ROUTE="chat_completions" && export GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=512 && export GUIDELLM_MAX_REQUESTS=1000 && export GUIDELLM__REQUEST_TIMEOUT=600 && guidellm benchmark --target http://localhost:8000 --rate-type sweep --model Qwen/Qwen3-30B-A3B --processor Qwen/Qwen3-30B-A3B --random-seed 2025 --max-seconds 300 --data "prompt_tokens=4096,output_tokens=512,samples=1000" --backend-args '{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}' --output-path "data/benchmarks.json"

Errors
guidellm:
Image

vllm grafana:
Image

Additional context
Add any other context about the problem here. Also include any relevant files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions