Token Rate Limiter

A reference implementation of token-aware rate limiting for multi-tenant LLM APIs.
Built to explore how fairness, burst control, and real-time observability work in practice.

Problem · How It Works · Live Demo · Architecture · Test Harness · When To Use This

The Problem

A single 128k-token LLM request consumes 1000x more GPU compute than a 128-token request — but naive RPM (requests-per-minute) rate limiters treat them identically.

This creates two failure modes in multi-tenant systems:

Starvation — One tenant sends fewer, larger requests and consumes disproportionate capacity
Invisible unfairness — Your metrics show "all tenants under the RPM limit" while one tenant is using 80% of actual compute

Token-aware rate limiting (TPM — tokens per minute) solves this. Every major LLM API provider now enforces TPM limits server-side. This project implements the full system from scratch — token bucket algorithm, multi-tenant isolation, fairness measurement, and real-time visualization — as a learning reference and design artifact.

How It Works

Three things working together:

1. Atomic Token Bucket (Redis + Lua)

Each tenant gets a token bucket. The entire check-and-decrement runs as a single Redis Lua script — atomic, no distributed locks, no race conditions.

Tenant requests 5,000 tokens
  → Refill bucket based on elapsed time (continuous, not interval-based)
  → If tokens ≥ 5,000 → decrement, return ALLOWED
  → Else → return DENIED + retry_after_ms (time until enough tokens refill)

Burst capacity lets idle tenants accumulate tokens above their per-minute baseline (configurable multiplier, e.g. 1.5×).

2. Fairness Measurement (Jain's Index)

Most rate limiters answer "is this tenant over the limit?" — TRL also answers "is the system fair across tenants?"

Jain's fairness index: J = (Σxᵢ)² / (n · Σxᵢ²) where xᵢ = actual_throughput / allocated_throughput.

J = 1.0 → perfect fairness (all tenants get their fair share)
J = 0.25 → one of four tenants is consuming everything
J > 0.90 → system is healthy

This is computed in real-time and displayed on the dashboard.

3. Real-Time Dashboard

WebSocket-powered, updates every second. No build step — vanilla JS + Chart.js.

Bucket gauges — radial charts showing each tenant's remaining capacity
Token flow timeline — 5-minute rolling consumption per tenant
Rejection heatmap — who's getting throttled, and when
Fairness index — single number that tells you if the system is working

Try it now — open the live demo to see simulated multi-tenant traffic.

Live Demo

No server needed. The dashboard has a built-in demo mode with simulated traffic:

# Option 1: Open directly
open dashboard/index.html?demo

# Option 2: Serve locally
python3 -m http.server 3000 -d dashboard
# Then visit http://localhost:3000?demo

The demo simulates 4 tenants with different tiers — one deliberately "hogging" to show how the rate limiter throttles it while protecting others.

What you'll see:

Metric	Meaning
Fairness ~0.88	One tenant is pushing limits — system is compensating
High rejections for one tenant	The hog is being throttled (working as intended)
Near-zero rejections for others	Other tenants are protected from the hog
TPM fluctuation	Realistic bursty traffic patterns

Architecture

Client Request
      │
      ▼
┌─────────────────┐     ┌──────────────────────┐
│   FastAPI        │────▶│  Redis               │
│   /consume       │     │  Lua Script (atomic)  │
│   /tenants       │◀────│  Token Bucket State   │
│   /metrics       │     └──────────────────────┘
└────────┬────────┘
         │
    WebSocket /ws
         │
         ▼
┌─────────────────┐
│   Dashboard      │
│   Chart.js       │
│   Real-time      │
└─────────────────┘

Redis Key Schema

Key Pattern	Type	Purpose
`bucket:{tenant_id}`	HASH	Current tokens, last refill timestamp, request count
`metrics:{tenant_id}:consumed`	SORTED SET	Tokens consumed (scored by timestamp for windowed queries)
`metrics:{tenant_id}:rejected`	SORTED SET	Tokens rejected (scored by timestamp)

API Endpoints

Method	Path	Description
`POST`	`/consume`	`{tenant_id, tokens}` → `{allowed, tokens_remaining, retry_after_ms}`
`GET`	`/tenants`	All tenants with current stats
`GET`	`/tenants/{id}/stats`	Per-tenant stats (configurable time window)
`PUT`	`/tenants/{id}/config`	Update tenant config at runtime
`GET`	`/metrics`	Prometheus exposition format
`WS`	`/ws/dashboard`	Real-time stats feed (1s interval)

Usage as a Library

from trl import RateLimiter, BucketConfig
from trl.redis_backend import RedisBackend

backend = RedisBackend("redis://localhost:6379/0")
limiter = RateLimiter(backend)
await limiter.initialize()

limiter.register_tenant("my-app", BucketConfig(max_tokens=100_000, burst_multiplier=1.5))

result = await limiter.consume("my-app", tokens=5000)
if result.allowed:
    ...  # proceed with LLM call
else:
    ...  # back off for result.retry_after_ms

FastAPI Middleware (Drop-in)

from trl.middleware import TokenRateLimitMiddleware

app.add_middleware(TokenRateLimitMiddleware, limiter=limiter)
# Reads X-Tenant-ID header, estimates tokens from request body

Test Harness

Built-in harness simulates multi-tenant traffic and measures system behavior against defined thresholds.

Scenarios

Scenario	What It Tests	Pass Criteria
Normal Load	Each tenant at 60% TPM	Rejection < 1%, Fairness > 0.95
Burst	30s idle → all burst at 200%	Burst served > 130% of baseline, Recovery < 15s
Hogging	One tenant at 300%, others at 50%	Hog rejected > 60%, Others rejected < 2%
At Capacity	All tenants at 110%	Each serves 90-100% of allocation, Fairness > 0.90
With vs Without	Same traffic, limiter on vs off	Quantified fairness improvement

# Run with Docker
docker compose up
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario all

# Run locally (requires Redis)
pip install -e ".[dev,harness]"
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario hogging

Tenant Tiers

Tier	TPM	Burst	Use Case
free	10K	1.2×	Trial
basic	50K	1.2×	Small apps
standard	100K	1.5×	Production
premium	500K	2.0×	High-volume
enterprise	2M	2.0×	Dedicated

When To Use This

Use this if you're:

Learning how token-aware rate limiting works under the hood
Building a multi-tenant AI product and need a reference for your own implementation
Preparing for system design interviews (rate limiting is a top-5 question)
Evaluating fairness properties of different rate limiting strategies

Don't use this if you're:

Looking for a production API gateway — use LiteLLM (39k stars, 100+ provider support)
A single developer hitting an LLM API — the provider already rate-limits you server-side
Running at scale with enterprise SLAs — use a managed solution (Portkey, AWS API Gateway)

How This Compares

	TRL	LiteLLM	Provider API
Purpose	Reference implementation + learning	Production proxy	Server-side enforcement
Token-aware	Yes	Yes	Yes
Multi-tenant	Yes	Yes	Per-org
Dashboard	Yes (built-in)	Separate UI	Provider console
Fairness index	Yes (Jain's)	No	No
Deployment	Standalone	Full proxy	N/A
Complexity	~2K lines Python	800+ contributors	Managed

Design Decisions

Decision	Choice	Why
Token bucket state	Redis Lua script	Atomic check-and-decrement without distributed locks
Metrics storage	Redis sorted sets	O(log N) time-range queries via ZRANGEBYSCORE
Dashboard	Vanilla JS + Chart.js	No build step, ships as static files
Fairness metric	Jain's index	Well-studied, single number, range [0,1]
Tests	fakeredis with Lua support	Fast unit tests without Docker dependency

Development

pip install -e ".[dev]"
python3 -m pytest tests/ -v        # 25 tests
python3 -m ruff check src/ server/  # lint

License

MIT

Built by Joaquin

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
configs		configs
dashboard		dashboard
harness		harness
scripts		scripts
server		server
src/trl		src/trl
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTIONS.md		CONTRIBUTIONS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token Rate Limiter

The Problem

How It Works

1. Atomic Token Bucket (Redis + Lua)

2. Fairness Measurement (Jain's Index)

3. Real-Time Dashboard

Live Demo

Architecture

Redis Key Schema

API Endpoints

Usage as a Library

FastAPI Middleware (Drop-in)

Test Harness

Scenarios

Tenant Tiers

When To Use This

How This Compares

Design Decisions

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Token Rate Limiter

The Problem

How It Works

1. Atomic Token Bucket (Redis + Lua)

2. Fairness Measurement (Jain's Index)

3. Real-Time Dashboard

Live Demo

Architecture

Redis Key Schema

API Endpoints

Usage as a Library

FastAPI Middleware (Drop-in)

Test Harness

Scenarios

Tenant Tiers

When To Use This

How This Compares

Design Decisions

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages