Skip to content

Shameendra/Scalable_RAG

Repository files navigation

🚀 Production-Grade Scalable RAG System

DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment.

🎯 Key Features

  • Distributed Caching: Redis-based caching for embeddings and query results
  • Rate Limiting: Token bucket and sliding window rate limiters
  • Circuit Breakers: Fault tolerance for external service calls
  • Prometheus Metrics: Comprehensive observability
  • Health Checks: Kubernetes-ready liveness and readiness probes
  • Horizontal Scaling: Auto-scaling based on load
  • Async Support: High-concurrency request handling

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Load Balancer                                │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                               │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    RAG API Pods (HPA)                       │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │    │
│  │  │  Pod 1  │  │  Pod 2  │  │  Pod 3  │  │  Pod N  │         │    │
│  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘         │    │
│  └───────┼────────────┼────────────┼────────────┼──────────────┘    │
│          │            │            │            │                   │
│          ▼            ▼            ▼            ▼                   │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Service Mesh                             │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │    │
│  │  │ Rate Limiter │  │   Circuit    │  │    Cache     │       │    │
│  │  │              │  │   Breaker    │  │   (Redis)    │       │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐   ┌──────────┐   ┌───────────┐
    │  OpenAI  │   │ Pinecone │   │ Prometheus│
    │   API    │   │  Vector  │   │  Grafana  │
    └──────────┘   └──────────┘   └───────────┘

📁 Project Structure

scalable-rag/
├── config.py              # Configuration management
├── cache.py               # Redis caching layer
├── rate_limiter.py        # Rate limiting & circuit breakers
├── monitoring.py          # Prometheus metrics & health checks
├── rag_engine.py          # Core RAG implementation
├── main.py                # FastAPI application
├── Dockerfile             # Container image
├── kubernetes/
│   └── deployment.yaml    # K8s manifests
├── requirements.txt       # Dependencies
└── README.md              # Documentation

🚀 Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY=sk-...
export PINECONE_API_KEY=...
export REDIS_URL=redis://localhost:6379

# Start Redis (Docker)
docker run -d -p 6379:6379 redis:alpine

# Run the server
python main.py --port 8000 --reload

Docker

# Build image
docker build -t scalable-rag .

# Run container
docker run -p 8000:8000 \
  -e OPENAI_API_KEY=sk-... \
  -e PINECONE_API_KEY=... \
  -e REDIS_URL=redis://redis:6379 \
  scalable-rag

Kubernetes

# Create secrets
kubectl create secret generic rag-secrets \
  --from-literal=openai-api-key=sk-... \
  --from-literal=pinecone-api-key=...

# Deploy
kubectl apply -f kubernetes/deployment.yaml

# Check status
kubectl get pods -l app=scalable-rag

📡 API Endpoints

Query

POST /query
{
  "question": "What is machine learning?",
  "k": 5,
  "use_cache": true
}

Health Checks

GET /health        # Full health status
GET /health/live   # Liveness probe
GET /health/ready  # Readiness probe

Metrics

GET /metrics       # Prometheus metrics
GET /status        # System status + circuit breakers

Documents

POST /documents
{
  "texts": ["Document 1 content", "Document 2 content"],
  "metadatas": [{"source": "doc1"}, {"source": "doc2"}]
}

📊 Monitoring

Prometheus Metrics

Metric Type Description
rag_requests_total Counter Total requests by endpoint
rag_request_latency_seconds Histogram Request latency
rag_llm_calls_total Counter LLM API calls
rag_llm_tokens_total Counter Token usage
rag_cache_hits_total Counter Cache hits
rag_errors_total Counter Errors by type

Grafana Dashboard

Import the provided dashboard for visualizations:

  • Request rate and latency
  • Cache hit ratio
  • LLM token usage
  • Error rates
  • Circuit breaker states

🛡️ Fault Tolerance

Rate Limiting

# Token bucket for smooth limiting
rate_limiter = TokenBucketRateLimiter(
    rate=60,        # tokens per minute
    capacity=100    # burst capacity
)

# Sliding window for strict limits
rate_limiter = SlidingWindowRateLimiter(
    limit=60,
    window_seconds=60
)

Circuit Breaker

# Protects against cascading failures
circuit = CircuitBreaker(
    name="llm",
    config=CircuitBreakerConfig(
        failure_threshold=5,    # Open after 5 failures
        timeout_seconds=30,     # Try again after 30s
        success_threshold=2     # Close after 2 successes
    )
)

Caching Strategy

Query → Check Response Cache → Hit? Return
                            → Miss? Continue
        ↓
        Check Embedding Cache → Hit? Use cached embedding
                             → Miss? Generate & cache
        ↓
        Check Search Cache → Hit? Use cached results
                          → Miss? Search & cache
        ↓
        Generate Response → Cache & Return

🔧 Configuration

All settings via environment variables:

# LLM
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000

# Vector Store
PINECONE_INDEX=production-rag
PINECONE_ENV=us-east-1

# Cache
CACHE_ENABLED=true
CACHE_TTL=3600
REDIS_URL=redis://localhost:6379

# Rate Limiting
RATE_LIMIT_RPM=60
RATE_LIMIT_TPM=100000
MAX_CONCURRENT=10

# Monitoring
PROMETHEUS_ENABLED=true
PROMETHEUS_PORT=9090
LOG_LEVEL=INFO

🧪 Testing

# Run tests
pytest tests/ -v

# Load testing
locust -f tests/locustfile.py --host=http://localhost:8000

📈 Scaling Guidelines

Users Pods Redis Notes
< 100 2 1 Development
100-1K 3-5 1 Small production
1K-10K 5-10 3 (cluster) Medium production
10K+ 10+ Redis Cluster Large production

🔐 Security

  • API key management via Kubernetes secrets
  • Rate limiting prevents abuse
  • Circuit breakers prevent cascade failures
  • Health checks enable zero-downtime deploys

📝 License

MIT License


About

DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors