A reference implementation of token-aware rate limiting for multi-tenant LLM APIs.
Built to explore how fairness, burst control, and real-time observability work in practice.
Problem · How It Works · Live Demo · Architecture · Test Harness · When To Use This
A single 128k-token LLM request consumes 1000x more GPU compute than a 128-token request — but naive RPM (requests-per-minute) rate limiters treat them identically.
This creates two failure modes in multi-tenant systems:
- Starvation — One tenant sends fewer, larger requests and consumes disproportionate capacity
- Invisible unfairness — Your metrics show "all tenants under the RPM limit" while one tenant is using 80% of actual compute
Token-aware rate limiting (TPM — tokens per minute) solves this. Every major LLM API provider now enforces TPM limits server-side. This project implements the full system from scratch — token bucket algorithm, multi-tenant isolation, fairness measurement, and real-time visualization — as a learning reference and design artifact.
Three things working together:
Each tenant gets a token bucket. The entire check-and-decrement runs as a single Redis Lua script — atomic, no distributed locks, no race conditions.
Tenant requests 5,000 tokens
→ Refill bucket based on elapsed time (continuous, not interval-based)
→ If tokens ≥ 5,000 → decrement, return ALLOWED
→ Else → return DENIED + retry_after_ms (time until enough tokens refill)
Burst capacity lets idle tenants accumulate tokens above their per-minute baseline (configurable multiplier, e.g. 1.5×).
Most rate limiters answer "is this tenant over the limit?" — TRL also answers "is the system fair across tenants?"
Jain's fairness index: J = (Σxᵢ)² / (n · Σxᵢ²) where xᵢ = actual_throughput / allocated_throughput.
- J = 1.0 → perfect fairness (all tenants get their fair share)
- J = 0.25 → one of four tenants is consuming everything
- J > 0.90 → system is healthy
This is computed in real-time and displayed on the dashboard.
WebSocket-powered, updates every second. No build step — vanilla JS + Chart.js.
- Bucket gauges — radial charts showing each tenant's remaining capacity
- Token flow timeline — 5-minute rolling consumption per tenant
- Rejection heatmap — who's getting throttled, and when
- Fairness index — single number that tells you if the system is working
Try it now — open the live demo to see simulated multi-tenant traffic.
No server needed. The dashboard has a built-in demo mode with simulated traffic:
# Option 1: Open directly
open dashboard/index.html?demo
# Option 2: Serve locally
python3 -m http.server 3000 -d dashboard
# Then visit http://localhost:3000?demoThe demo simulates 4 tenants with different tiers — one deliberately "hogging" to show how the rate limiter throttles it while protecting others.
What you'll see:
| Metric | Meaning |
|---|---|
| Fairness ~0.88 | One tenant is pushing limits — system is compensating |
| High rejections for one tenant | The hog is being throttled (working as intended) |
| Near-zero rejections for others | Other tenants are protected from the hog |
| TPM fluctuation | Realistic bursty traffic patterns |
Client Request
│
▼
┌─────────────────┐ ┌──────────────────────┐
│ FastAPI │────▶│ Redis │
│ /consume │ │ Lua Script (atomic) │
│ /tenants │◀────│ Token Bucket State │
│ /metrics │ └──────────────────────┘
└────────┬────────┘
│
WebSocket /ws
│
▼
┌─────────────────┐
│ Dashboard │
│ Chart.js │
│ Real-time │
└─────────────────┘
| Key Pattern | Type | Purpose |
|---|---|---|
bucket:{tenant_id} |
HASH | Current tokens, last refill timestamp, request count |
metrics:{tenant_id}:consumed |
SORTED SET | Tokens consumed (scored by timestamp for windowed queries) |
metrics:{tenant_id}:rejected |
SORTED SET | Tokens rejected (scored by timestamp) |
| Method | Path | Description |
|---|---|---|
POST |
/consume |
{tenant_id, tokens} → {allowed, tokens_remaining, retry_after_ms} |
GET |
/tenants |
All tenants with current stats |
GET |
/tenants/{id}/stats |
Per-tenant stats (configurable time window) |
PUT |
/tenants/{id}/config |
Update tenant config at runtime |
GET |
/metrics |
Prometheus exposition format |
WS |
/ws/dashboard |
Real-time stats feed (1s interval) |
from trl import RateLimiter, BucketConfig
from trl.redis_backend import RedisBackend
backend = RedisBackend("redis://localhost:6379/0")
limiter = RateLimiter(backend)
await limiter.initialize()
limiter.register_tenant("my-app", BucketConfig(max_tokens=100_000, burst_multiplier=1.5))
result = await limiter.consume("my-app", tokens=5000)
if result.allowed:
... # proceed with LLM call
else:
... # back off for result.retry_after_msfrom trl.middleware import TokenRateLimitMiddleware
app.add_middleware(TokenRateLimitMiddleware, limiter=limiter)
# Reads X-Tenant-ID header, estimates tokens from request bodyBuilt-in harness simulates multi-tenant traffic and measures system behavior against defined thresholds.
| Scenario | What It Tests | Pass Criteria |
|---|---|---|
| Normal Load | Each tenant at 60% TPM | Rejection < 1%, Fairness > 0.95 |
| Burst | 30s idle → all burst at 200% | Burst served > 130% of baseline, Recovery < 15s |
| Hogging | One tenant at 300%, others at 50% | Hog rejected > 60%, Others rejected < 2% |
| At Capacity | All tenants at 110% | Each serves 90-100% of allocation, Fairness > 0.90 |
| With vs Without | Same traffic, limiter on vs off | Quantified fairness improvement |
# Run with Docker
docker compose up
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario all
# Run locally (requires Redis)
pip install -e ".[dev,harness]"
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario hogging| Tier | TPM | Burst | Use Case |
|---|---|---|---|
| free | 10K | 1.2× | Trial |
| basic | 50K | 1.2× | Small apps |
| standard | 100K | 1.5× | Production |
| premium | 500K | 2.0× | High-volume |
| enterprise | 2M | 2.0× | Dedicated |
Use this if you're:
- Learning how token-aware rate limiting works under the hood
- Building a multi-tenant AI product and need a reference for your own implementation
- Preparing for system design interviews (rate limiting is a top-5 question)
- Evaluating fairness properties of different rate limiting strategies
Don't use this if you're:
- Looking for a production API gateway — use LiteLLM (39k stars, 100+ provider support)
- A single developer hitting an LLM API — the provider already rate-limits you server-side
- Running at scale with enterprise SLAs — use a managed solution (Portkey, AWS API Gateway)
| TRL | LiteLLM | Provider API | |
|---|---|---|---|
| Purpose | Reference implementation + learning | Production proxy | Server-side enforcement |
| Token-aware | Yes | Yes | Yes |
| Multi-tenant | Yes | Yes | Per-org |
| Dashboard | Yes (built-in) | Separate UI | Provider console |
| Fairness index | Yes (Jain's) | No | No |
| Deployment | Standalone | Full proxy | N/A |
| Complexity | ~2K lines Python | 800+ contributors | Managed |
| Decision | Choice | Why |
|---|---|---|
| Token bucket state | Redis Lua script | Atomic check-and-decrement without distributed locks |
| Metrics storage | Redis sorted sets | O(log N) time-range queries via ZRANGEBYSCORE |
| Dashboard | Vanilla JS + Chart.js | No build step, ships as static files |
| Fairness metric | Jain's index | Well-studied, single number, range [0,1] |
| Tests | fakeredis with Lua support | Fast unit tests without Docker dependency |
pip install -e ".[dev]"
python3 -m pytest tests/ -v # 25 tests
python3 -m ruff check src/ server/ # lintMIT
Built by Joaquin