Skip to content

vicobarafor/EvalRouteOps

Repository files navigation

EvalRouteOps

CI Python FastAPI Kubernetes Redis Status

Distributed adaptive inference infrastructure for reinforcement-learning-driven LLM routing, online optimization, and production-scale serving experimentation.

EvalRouteOps is a production-oriented AI systems platform for studying how adaptive routing policies optimize tradeoffs between:

  • response quality
  • latency
  • infrastructure cost
  • throughput
  • reliability

under realistic distributed inference constraints.

The platform combines:

  • contextual bandits
  • Thompson Sampling
  • policy-gradient routing
  • adaptive traffic shaping
  • Redis-backed distributed workers
  • Kubernetes deployment infrastructure
  • GPU-aware scheduling manifests
  • Prometheus/OpenTelemetry observability
  • streaming inference APIs
  • large-scale deterministic benchmarking

Technical Report

Full systems and infrastructure report:

  • Markdown version: docs/technical_report.md
  • PDF version: docs/EvalRouteOps_Technical_Report.pdf

The report covers:

  • adaptive routing optimization
  • reinforcement-learning-driven inference allocation
  • distributed serving infrastructure
  • Kubernetes orchestration
  • observability systems
  • benchmark methodology
  • experimental analysis
  • infrastructure tradeoff evaluation

EvalRouteOps is designed as intelligent serving infrastructure — not a chatbot wrapper or prompt-engineering project.


Core Capabilities

Adaptive Routing

Implemented routing strategies include:

  • latency-aware routing
  • cost-aware routing
  • quality-first routing
  • epsilon-greedy contextual bandits
  • Thompson Sampling
  • policy-gradient routing
  • adaptive traffic shaping
  • Pareto tradeoff analysis
  • oracle benchmarking

Distributed Inference Infrastructure

EvalRouteOps supports distributed inference execution through:

  • Redis-backed inference queues
  • asynchronous inference workers
  • provider retry wrappers
  • timeout wrappers
  • fallback provider chains
  • concurrency-limited providers
  • streaming provider interfaces

Infrastructure components include:

  • Docker Compose
  • Kubernetes deployment manifests
  • horizontal autoscaling (HPA)
  • GPU-aware scheduling manifests

Serving Layer

The serving stack includes:

  • FastAPI routing APIs
  • streaming inference APIs
  • structured request logging
  • Prometheus metrics
  • OpenTelemetry tracing
  • request timing instrumentation

Experimental Scale

Current benchmarked infrastructure includes:

Metric Result
Routing simulation scale 100,000 requests
Adaptive routing experiments 20,000 requests
Replay throughput 7,600+ requests/sec
Live API throughput 58 requests/sec
API P95 latency ~28 ms
Automated tests 53 passing
Failure rate (fallback-enabled routing) 0%

Benchmark Visualizations

Bandit Cumulative Regret

Bandit cumulative regret

Bandit Backend Allocation

Bandit backend allocation

Bandit Rolling Quality

Bandit rolling quality

Load Sweep Throughput

Load sweep throughput


Architecture

Client
  |
  v
FastAPI Serving Layer
  |
  v
Routing Policies
  |
  v
Redis Queue Backend
  |
  +----------------------+
  |                      |
  v                      v
CPU Workers         GPU Workers
  |                      |
  +----------+-----------+
             |
             v
     Provider Execution Layer

System Architecture

See:

docs/images/architecture_diagram.md

Quickstart

Environment Setup

python -m venv .venv

# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
# source .venv/bin/activate

pip install -e ".[dev]"

Run Tests

pytest
ruff check .

Start API Server

uvicorn evalrouteops.serving.main:app --reload

Interactive API docs:

http://127.0.0.1:8000/docs

Run API Load Test

python scripts/run_api_loadtest.py

Run Adaptive Routing Experiments

python scripts/run_adaptive_traffic_experiment.py
python scripts/run_policy_gradient_experiment.py

Docker Deployment

Build the image:

docker build -t evalrouteops:latest .

Run the distributed stack:

docker compose up

Compose services include:

  • EvalRouteOps API
  • Redis backend
  • distributed inference worker

Kubernetes Deployment

Kubernetes manifests are provided in:

k8s/

Supported deployment infrastructure includes:

  • API deployment
  • distributed workers
  • Redis deployment
  • horizontal autoscaling
  • GPU-aware worker scheduling

Deploy core stack:

kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/worker-deployment.yaml
kubectl apply -f k8s/services.yaml

Optional GPU workers:

kubectl apply -f k8s/gpu-worker-deployment.yaml

Optional autoscaling:

kubectl apply -f k8s/hpa.yaml

Documentation

Additional documentation is available in:

docs/

Included documentation:

  • architecture overview
  • benchmark methodology
  • deployment documentation
  • API reference
  • reproducibility guarantees
  • benchmark summaries

Design Principles

EvalRouteOps is built around:

  • deterministic experimentation
  • reproducible benchmarks
  • adaptive online optimization
  • production-oriented infrastructure
  • typed interfaces
  • observability-first design
  • scalable distributed execution

Research Objective

EvalRouteOps investigates whether reinforcement-learning-inspired routing strategies can dynamically optimize distributed inference systems under competing objectives:

  • latency
  • quality
  • infrastructure cost
  • throughput
  • reliability

using adaptive online optimization and scalable serving infrastructure.

About

Distributed adaptive inference infrastructure for reinforcement-learning-driven LLM routing, online optimization, and production-scale serving experimentation.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages