| title | VLLM Llm-d InferenceControlTower |
|---|---|
| colorFrom | pink |
| colorTo | purple |
| sdk | docker |
| pinned | false |
A lightweight control tower UI + gateway that routes LLM requests across workers with cache-aware and queue-aware logic. It demonstrates routing, retries, rate limits, chaos injection, and observability patterns without requiring a full GPU cluster.
Live demo (HF Spaces): https://huggingface.co/spaces/Subramanyam6/vLLM_llm-d_InferenceControlTower
git clone https://huggingface.co/spaces/Subramanyam6/vLLM_llm-d_InferenceControlTower
cd vLLM_llm-d_InferenceControlTower
./run_all.shThis starts API + UI + local gRPC simulator + local llm-d path.
Open http://127.0.0.1:5173.
- No local services required.
- Everything runs inside the control tower process.
MODE=GRPC ./run_all.sh- Uses local gRPC backend workers.
- UI is locked to gRPC in this mode to avoid misrouted traffic.
MODE=LLMD ./run_all.sh- Uses local llm-d gateway workflow.
- Requests pass through SGLang first, then continue to llm-d (
ENABLE_SGLANG_FRONT=1by default). - UI is locked to llm-d in this mode.
Hosted deployments stay light-weight and cost-safe:
SIMonlygRPCandllm-dare disabled
Backend flags:
DISABLE_GRPC=1to force SIM fallbackENABLE_LLMD_LOCAL=1to allow llm-d modeENABLE_SGLANG_FRONT=1to route llm-d requests through SGLang firstSGLANG_HTTP_URL=http://127.0.0.1:30000VLLM_HTTP_URL=http://127.0.0.1:8001(optional OpenAI HTTP target for gRPC track)PROMETHEUS_METRICS_ENABLED=1to expose/metricsfrom APIOTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318/v1/tracesto send spans to OTel Collector
Frontend flags:
VITE_ENABLE_LLMD=1VITE_DISABLE_GRPC=1VITE_LOCK_MODE=1
Start local observability services:
scripts/observability_stack.sh upEndpoints:
- Prometheus:
http://127.0.0.1:9090 - Grafana:
http://127.0.0.1:3001(admin/admin) - Jaeger:
http://127.0.0.1:16686
Run API with tracing + metrics:
OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318/v1/traces \
PROMETHEUS_METRICS_ENABLED=1 \
MODE=LLMD ./run_all.shThe API now exposes Prometheus metrics at:
http://127.0.0.1:8000/metrics
smoke: 100,000 requestsstandard: 1,000,000 requestsendurance: 5,000,000 requests
Profile definitions live in load/profiles.json.
scripts/run_stress.sh --profile smokeDefault stress payload scale is 5 workers (A-E).
Optional overrides:
scripts/run_stress.sh --profile standard \
--backend LLMD \
--api-url http://127.0.0.1:8000/api/submit \
--rate 1200 --vus 300 --duration 900s --scale 5The stress workflow is llm-d focused; non-LLMD backends are rejected by run_stress.sh.
Dry-run config validation:
scripts/run_stress.sh --profile endurance --dry-runEach run writes:
reports/load/<timestamp>/results.jsonreports/load/<timestamp>/summary.mdfrontend/public/stress/latest.json(UI latest report)frontend/public/stress/dated/<YYYY-MM-DD>.json(UI dated report)
If Prometheus is running, run_stress.sh also attaches a real observability snapshot to results.json
under the observability field (request rate, latency histogram, and worker distribution seen by metrics).
results.json includes:
- profile
- total attempted/succeeded/failed
- error rate
- p50/p95/p99 latency
- throughput (req/s)
- backend mode
- target URL
- start/end timestamps
- worker distribution raw counts for A-E (
worker_distribution) - worker distribution percentages for A-E (
worker_distribution_pct) - raw extras (
other,missing,counted) inworker_distribution_meta
For non-SIM backends, the API now emits worker_identity when upstream identity is available
(response body or headers). If identity is unavailable, selected_worker may be null and will
be counted as missing in load reports.
summary.md includes:
- run metadata
- key metrics table
- worker distribution table for A-E
- SLO pass/fail section
- top failure reasons
OTEL=1 python app.pypython bench.pyk8s_mesh/k8s_llmd/
app.py: API server + UI static hostcore.py: gateway, routing, workers, retries, OpenAI HTTP bridgesfrontend/: React UI (Vite)grpc_backend/server.py: local gRPC inference serverload/: k6 script + load profilesscripts/: stress runner + report renderertests/: unit/integration tests