Skip to content

joaquinhuigomez/token-aware-rate-limiter

Repository files navigation

TRL Gatekeeper

Token Rate Limiter

A reference implementation of token-aware rate limiting for multi-tenant LLM APIs.
Built to explore how fairness, burst control, and real-time observability work in practice.

Problem · How It Works · Live Demo · Architecture · Test Harness · When To Use This


The Problem

A single 128k-token LLM request consumes 1000x more GPU compute than a 128-token request — but naive RPM (requests-per-minute) rate limiters treat them identically.

This creates two failure modes in multi-tenant systems:

  1. Starvation — One tenant sends fewer, larger requests and consumes disproportionate capacity
  2. Invisible unfairness — Your metrics show "all tenants under the RPM limit" while one tenant is using 80% of actual compute

Token-aware rate limiting (TPM — tokens per minute) solves this. Every major LLM API provider now enforces TPM limits server-side. This project implements the full system from scratch — token bucket algorithm, multi-tenant isolation, fairness measurement, and real-time visualization — as a learning reference and design artifact.

How It Works

Three things working together:

1. Atomic Token Bucket (Redis + Lua)

Each tenant gets a token bucket. The entire check-and-decrement runs as a single Redis Lua script — atomic, no distributed locks, no race conditions.

Tenant requests 5,000 tokens
  → Refill bucket based on elapsed time (continuous, not interval-based)
  → If tokens ≥ 5,000 → decrement, return ALLOWED
  → Else → return DENIED + retry_after_ms (time until enough tokens refill)

Burst capacity lets idle tenants accumulate tokens above their per-minute baseline (configurable multiplier, e.g. 1.5×).

2. Fairness Measurement (Jain's Index)

Most rate limiters answer "is this tenant over the limit?" — TRL also answers "is the system fair across tenants?"

Jain's fairness index: J = (Σxᵢ)² / (n · Σxᵢ²) where xᵢ = actual_throughput / allocated_throughput.

  • J = 1.0 → perfect fairness (all tenants get their fair share)
  • J = 0.25 → one of four tenants is consuming everything
  • J > 0.90 → system is healthy

This is computed in real-time and displayed on the dashboard.

3. Real-Time Dashboard

WebSocket-powered, updates every second. No build step — vanilla JS + Chart.js.

  • Bucket gauges — radial charts showing each tenant's remaining capacity
  • Token flow timeline — 5-minute rolling consumption per tenant
  • Rejection heatmap — who's getting throttled, and when
  • Fairness index — single number that tells you if the system is working

Try it now — open the live demo to see simulated multi-tenant traffic.

Live Demo

No server needed. The dashboard has a built-in demo mode with simulated traffic:

# Option 1: Open directly
open dashboard/index.html?demo

# Option 2: Serve locally
python3 -m http.server 3000 -d dashboard
# Then visit http://localhost:3000?demo

The demo simulates 4 tenants with different tiers — one deliberately "hogging" to show how the rate limiter throttles it while protecting others.

What you'll see:

Metric Meaning
Fairness ~0.88 One tenant is pushing limits — system is compensating
High rejections for one tenant The hog is being throttled (working as intended)
Near-zero rejections for others Other tenants are protected from the hog
TPM fluctuation Realistic bursty traffic patterns

Architecture

Client Request
      │
      ▼
┌─────────────────┐     ┌──────────────────────┐
│   FastAPI        │────▶│  Redis               │
│   /consume       │     │  Lua Script (atomic)  │
│   /tenants       │◀────│  Token Bucket State   │
│   /metrics       │     └──────────────────────┘
└────────┬────────┘
         │
    WebSocket /ws
         │
         ▼
┌─────────────────┐
│   Dashboard      │
│   Chart.js       │
│   Real-time      │
└─────────────────┘

Redis Key Schema

Key Pattern Type Purpose
bucket:{tenant_id} HASH Current tokens, last refill timestamp, request count
metrics:{tenant_id}:consumed SORTED SET Tokens consumed (scored by timestamp for windowed queries)
metrics:{tenant_id}:rejected SORTED SET Tokens rejected (scored by timestamp)

API Endpoints

Method Path Description
POST /consume {tenant_id, tokens}{allowed, tokens_remaining, retry_after_ms}
GET /tenants All tenants with current stats
GET /tenants/{id}/stats Per-tenant stats (configurable time window)
PUT /tenants/{id}/config Update tenant config at runtime
GET /metrics Prometheus exposition format
WS /ws/dashboard Real-time stats feed (1s interval)

Usage as a Library

from trl import RateLimiter, BucketConfig
from trl.redis_backend import RedisBackend

backend = RedisBackend("redis://localhost:6379/0")
limiter = RateLimiter(backend)
await limiter.initialize()

limiter.register_tenant("my-app", BucketConfig(max_tokens=100_000, burst_multiplier=1.5))

result = await limiter.consume("my-app", tokens=5000)
if result.allowed:
    ...  # proceed with LLM call
else:
    ...  # back off for result.retry_after_ms

FastAPI Middleware (Drop-in)

from trl.middleware import TokenRateLimitMiddleware

app.add_middleware(TokenRateLimitMiddleware, limiter=limiter)
# Reads X-Tenant-ID header, estimates tokens from request body

Test Harness

Built-in harness simulates multi-tenant traffic and measures system behavior against defined thresholds.

Scenarios

Scenario What It Tests Pass Criteria
Normal Load Each tenant at 60% TPM Rejection < 1%, Fairness > 0.95
Burst 30s idle → all burst at 200% Burst served > 130% of baseline, Recovery < 15s
Hogging One tenant at 300%, others at 50% Hog rejected > 60%, Others rejected < 2%
At Capacity All tenants at 110% Each serves 90-100% of allocation, Fairness > 0.90
With vs Without Same traffic, limiter on vs off Quantified fairness improvement
# Run with Docker
docker compose up
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario all

# Run locally (requires Redis)
pip install -e ".[dev,harness]"
python3 -m harness.runner --config configs/openclaw_4agents.yaml --scenario hogging

Tenant Tiers

Tier TPM Burst Use Case
free 10K 1.2× Trial
basic 50K 1.2× Small apps
standard 100K 1.5× Production
premium 500K 2.0× High-volume
enterprise 2M 2.0× Dedicated

When To Use This

Use this if you're:

  • Learning how token-aware rate limiting works under the hood
  • Building a multi-tenant AI product and need a reference for your own implementation
  • Preparing for system design interviews (rate limiting is a top-5 question)
  • Evaluating fairness properties of different rate limiting strategies

Don't use this if you're:

  • Looking for a production API gateway — use LiteLLM (39k stars, 100+ provider support)
  • A single developer hitting an LLM API — the provider already rate-limits you server-side
  • Running at scale with enterprise SLAs — use a managed solution (Portkey, AWS API Gateway)

How This Compares

TRL LiteLLM Provider API
Purpose Reference implementation + learning Production proxy Server-side enforcement
Token-aware Yes Yes Yes
Multi-tenant Yes Yes Per-org
Dashboard Yes (built-in) Separate UI Provider console
Fairness index Yes (Jain's) No No
Deployment Standalone Full proxy N/A
Complexity ~2K lines Python 800+ contributors Managed

Design Decisions

Decision Choice Why
Token bucket state Redis Lua script Atomic check-and-decrement without distributed locks
Metrics storage Redis sorted sets O(log N) time-range queries via ZRANGEBYSCORE
Dashboard Vanilla JS + Chart.js No build step, ships as static files
Fairness metric Jain's index Well-studied, single number, range [0,1]
Tests fakeredis with Lua support Fast unit tests without Docker dependency

Development

pip install -e ".[dev]"
python3 -m pytest tests/ -v        # 25 tests
python3 -m ruff check src/ server/  # lint

License

MIT


Built by Joaquin

About

Reference implementation of token-aware rate limiting for multi-tenant LLM APIs. Atomic Redis Lua token buckets, Jain's fairness index, real-time dashboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors