Distributed adaptive inference infrastructure for reinforcement-learning-driven LLM routing, online optimization, and production-scale serving experimentation.
EvalRouteOps is a production-oriented AI systems platform for studying how adaptive routing policies optimize tradeoffs between:
- response quality
- latency
- infrastructure cost
- throughput
- reliability
under realistic distributed inference constraints.
The platform combines:
- contextual bandits
- Thompson Sampling
- policy-gradient routing
- adaptive traffic shaping
- Redis-backed distributed workers
- Kubernetes deployment infrastructure
- GPU-aware scheduling manifests
- Prometheus/OpenTelemetry observability
- streaming inference APIs
- large-scale deterministic benchmarking
Full systems and infrastructure report:
- Markdown version:
docs/technical_report.md - PDF version:
docs/EvalRouteOps_Technical_Report.pdf
The report covers:
- adaptive routing optimization
- reinforcement-learning-driven inference allocation
- distributed serving infrastructure
- Kubernetes orchestration
- observability systems
- benchmark methodology
- experimental analysis
- infrastructure tradeoff evaluation
EvalRouteOps is designed as intelligent serving infrastructure — not a chatbot wrapper or prompt-engineering project.
Implemented routing strategies include:
- latency-aware routing
- cost-aware routing
- quality-first routing
- epsilon-greedy contextual bandits
- Thompson Sampling
- policy-gradient routing
- adaptive traffic shaping
- Pareto tradeoff analysis
- oracle benchmarking
EvalRouteOps supports distributed inference execution through:
- Redis-backed inference queues
- asynchronous inference workers
- provider retry wrappers
- timeout wrappers
- fallback provider chains
- concurrency-limited providers
- streaming provider interfaces
Infrastructure components include:
- Docker Compose
- Kubernetes deployment manifests
- horizontal autoscaling (HPA)
- GPU-aware scheduling manifests
The serving stack includes:
- FastAPI routing APIs
- streaming inference APIs
- structured request logging
- Prometheus metrics
- OpenTelemetry tracing
- request timing instrumentation
Current benchmarked infrastructure includes:
| Metric | Result |
|---|---|
| Routing simulation scale | 100,000 requests |
| Adaptive routing experiments | 20,000 requests |
| Replay throughput | 7,600+ requests/sec |
| Live API throughput | 58 requests/sec |
| API P95 latency | ~28 ms |
| Automated tests | 53 passing |
| Failure rate (fallback-enabled routing) | 0% |
Client
|
v
FastAPI Serving Layer
|
v
Routing Policies
|
v
Redis Queue Backend
|
+----------------------+
| |
v v
CPU Workers GPU Workers
| |
+----------+-----------+
|
v
Provider Execution Layer
See:
docs/images/architecture_diagram.md
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
# macOS/Linux
# source .venv/bin/activate
pip install -e ".[dev]"pytest
ruff check .uvicorn evalrouteops.serving.main:app --reloadInteractive API docs:
http://127.0.0.1:8000/docs
python scripts/run_api_loadtest.pypython scripts/run_adaptive_traffic_experiment.py
python scripts/run_policy_gradient_experiment.pyBuild the image:
docker build -t evalrouteops:latest .Run the distributed stack:
docker compose upCompose services include:
- EvalRouteOps API
- Redis backend
- distributed inference worker
Kubernetes manifests are provided in:
k8s/
Supported deployment infrastructure includes:
- API deployment
- distributed workers
- Redis deployment
- horizontal autoscaling
- GPU-aware worker scheduling
Deploy core stack:
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/worker-deployment.yaml
kubectl apply -f k8s/services.yamlOptional GPU workers:
kubectl apply -f k8s/gpu-worker-deployment.yamlOptional autoscaling:
kubectl apply -f k8s/hpa.yamlAdditional documentation is available in:
docs/
Included documentation:
- architecture overview
- benchmark methodology
- deployment documentation
- API reference
- reproducibility guarantees
- benchmark summaries
EvalRouteOps is built around:
- deterministic experimentation
- reproducible benchmarks
- adaptive online optimization
- production-oriented infrastructure
- typed interfaces
- observability-first design
- scalable distributed execution
EvalRouteOps investigates whether reinforcement-learning-inspired routing strategies can dynamically optimize distributed inference systems under competing objectives:
- latency
- quality
- infrastructure cost
- throughput
- reliability
using adaptive online optimization and scalable serving infrastructure.



