DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment.
- Distributed Caching: Redis-based caching for embeddings and query results
- Rate Limiting: Token bucket and sliding window rate limiters
- Circuit Breakers: Fault tolerance for external service calls
- Prometheus Metrics: Comprehensive observability
- Health Checks: Kubernetes-ready liveness and readiness probes
- Horizontal Scaling: Auto-scaling based on load
- Async Support: High-concurrency request handling
┌─────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ RAG API Pods (HPA) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ └───────┼────────────┼────────────┼────────────┼──────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Service Mesh │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Rate Limiter │ │ Circuit │ │ Cache │ │ │
│ │ │ │ │ Breaker │ │ (Redis) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌───────────┐
│ OpenAI │ │ Pinecone │ │ Prometheus│
│ API │ │ Vector │ │ Grafana │
└──────────┘ └──────────┘ └───────────┘
scalable-rag/
├── config.py # Configuration management
├── cache.py # Redis caching layer
├── rate_limiter.py # Rate limiting & circuit breakers
├── monitoring.py # Prometheus metrics & health checks
├── rag_engine.py # Core RAG implementation
├── main.py # FastAPI application
├── Dockerfile # Container image
├── kubernetes/
│ └── deployment.yaml # K8s manifests
├── requirements.txt # Dependencies
└── README.md # Documentation
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OPENAI_API_KEY=sk-...
export PINECONE_API_KEY=...
export REDIS_URL=redis://localhost:6379
# Start Redis (Docker)
docker run -d -p 6379:6379 redis:alpine
# Run the server
python main.py --port 8000 --reload# Build image
docker build -t scalable-rag .
# Run container
docker run -p 8000:8000 \
-e OPENAI_API_KEY=sk-... \
-e PINECONE_API_KEY=... \
-e REDIS_URL=redis://redis:6379 \
scalable-rag# Create secrets
kubectl create secret generic rag-secrets \
--from-literal=openai-api-key=sk-... \
--from-literal=pinecone-api-key=...
# Deploy
kubectl apply -f kubernetes/deployment.yaml
# Check status
kubectl get pods -l app=scalable-ragPOST /query
{
"question": "What is machine learning?",
"k": 5,
"use_cache": true
}GET /health # Full health status
GET /health/live # Liveness probe
GET /health/ready # Readiness probeGET /metrics # Prometheus metrics
GET /status # System status + circuit breakersPOST /documents
{
"texts": ["Document 1 content", "Document 2 content"],
"metadatas": [{"source": "doc1"}, {"source": "doc2"}]
}| Metric | Type | Description |
|---|---|---|
rag_requests_total |
Counter | Total requests by endpoint |
rag_request_latency_seconds |
Histogram | Request latency |
rag_llm_calls_total |
Counter | LLM API calls |
rag_llm_tokens_total |
Counter | Token usage |
rag_cache_hits_total |
Counter | Cache hits |
rag_errors_total |
Counter | Errors by type |
Import the provided dashboard for visualizations:
- Request rate and latency
- Cache hit ratio
- LLM token usage
- Error rates
- Circuit breaker states
# Token bucket for smooth limiting
rate_limiter = TokenBucketRateLimiter(
rate=60, # tokens per minute
capacity=100 # burst capacity
)
# Sliding window for strict limits
rate_limiter = SlidingWindowRateLimiter(
limit=60,
window_seconds=60
)# Protects against cascading failures
circuit = CircuitBreaker(
name="llm",
config=CircuitBreakerConfig(
failure_threshold=5, # Open after 5 failures
timeout_seconds=30, # Try again after 30s
success_threshold=2 # Close after 2 successes
)
)Query → Check Response Cache → Hit? Return
→ Miss? Continue
↓
Check Embedding Cache → Hit? Use cached embedding
→ Miss? Generate & cache
↓
Check Search Cache → Hit? Use cached results
→ Miss? Search & cache
↓
Generate Response → Cache & Return
All settings via environment variables:
# LLM
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000
# Vector Store
PINECONE_INDEX=production-rag
PINECONE_ENV=us-east-1
# Cache
CACHE_ENABLED=true
CACHE_TTL=3600
REDIS_URL=redis://localhost:6379
# Rate Limiting
RATE_LIMIT_RPM=60
RATE_LIMIT_TPM=100000
MAX_CONCURRENT=10
# Monitoring
PROMETHEUS_ENABLED=true
PROMETHEUS_PORT=9090
LOG_LEVEL=INFO# Run tests
pytest tests/ -v
# Load testing
locust -f tests/locustfile.py --host=http://localhost:8000| Users | Pods | Redis | Notes |
|---|---|---|---|
| < 100 | 2 | 1 | Development |
| 100-1K | 3-5 | 1 | Small production |
| 1K-10K | 5-10 | 3 (cluster) | Medium production |
| 10K+ | 10+ | Redis Cluster | Large production |
- API key management via Kubernetes secrets
- Rate limiting prevents abuse
- Circuit breakers prevent cascade failures
- Health checks enable zero-downtime deploys
MIT License