The Retrieval-Augmented Generation (RAG) System is an intelligent document querying platform that combines large language models with vector-based information retrieval. The system enables users to upload documents, process them into embeddings, and ask natural language questions that are answered based on the document content.
To provide a scalable, efficient, and user-friendly interface for document-based question answering that leverages local LLM models for privacy and cost-effectiveness.
- Document ingestion and processing (PDF, TXT, MD, DOCX)
- Automated chunking and embedding generation
- Vector similarity search using Qdrant
- Context-aware response generation using local LLMs
- RESTful API for integration
- Support for multiple document collections
Responsibilities:
- Expose REST endpoints for client applications
- Handle authentication and authorization
- Validate incoming requests
- Return responses in standardized format
- Rate limiting and request throttling
Key Endpoints:
Document Management:
POST /api/v1/documents/upload- Upload document (async, returns task ID)GET /api/v1/documents/list/{collection}- List documents in collectionGET /api/v1/documents/download/{collection}/{filename}- Download documentDELETE /api/v1/documents/{collection}/{filename}- Delete document
Task Management:
GET /api/v1/tasks/{task_id}- Get task status and resultPOST /api/v1/tasks/{task_id}/revoke- Cancel/revoke taskGET /api/v1/tasks/- List active tasks
Query Operations:
POST /api/v1/query- Query the RAG systemGET /api/v1/query/history- Get query history
Collection Management:
GET /api/v1/collections- List collectionsPOST /api/v1/collections- Create collectionGET /api/v1/collections/{name}- Get collection detailsDELETE /api/v1/collections/{name}- Delete collection
Model Management:
POST /api/v1/models/load- Load a modelPUT /api/v1/models/switch- Switch active modelGET /api/v1/models- List available models
Batch Operations:
GET /api/v1/batches/{id}- Get batch processing status
Responsibilities:
- Parse various document formats
- Extract text content
- Clean and normalize text
- Split documents into chunks
- Manage chunk metadata
Supported Formats:
- PDF files
- Plain text (.txt)
- Markdown (.md)
- Word documents (.docx)
- HTML content
Responsibilities:
- Generate vector embeddings for text chunks
- Manage embedding model selection
- Handle batch processing for efficiency
- Cache embeddings to avoid recomputation
- Implement cache locking to prevent race conditions
- Support multiple embedding models
Cache Concurrency Control:
- Uses atomic operations for cache check and update
- Implements distributed locks for multi-instance deployments
- Cache-aside pattern with lock to prevent duplicate embedding generation
- When multiple requests process identical text, only one generates embeddings
Default Model: all-MiniLM-L6-v2 or similar lightweight local model
Responsibilities:
- Store and index document embeddings
- Perform similarity search
- Manage collections and indexes
- Handle CRUD operations for vectors
- Optimize query performance
Configuration:
- In-memory or persistent storage
- Configurable distance metrics (Cosine, Euclidean, Dot)
- Sharding for scalability
- Replication for high availability
Responsibilities:
- Process document uploads asynchronously
- Generate embeddings in background
- Manage task queues (documents, embeddings)
- Track task status and results
- Retry failed tasks automatically
- Scale workers independently
Technology:
- Celery 5.3.4: Distributed task queue
- Redis 7: Message broker and result backend
- Queues: Separate queues for documents (I/O) and embeddings (CPU/GPU)
Task Types:
process_document_task: Parse and chunk documentsgenerate_embeddings_task: Generate and store embeddingsdelete_document_task: Delete documents and vectors
Benefits:
- Non-blocking API responses
- Horizontal scalability
- Automatic retry on failures
- Task status tracking
Implementation:
- Location:
src/tasks/ - Configuration:
src/tasks/celery_app.py - Tasks:
src/tasks/document_tasks.py,src/tasks/embedding_tasks.py - API:
src/api/routes/tasks.pyfor status tracking
Responsibilities:
- Manage local LLM models via Ollama
- Generate context-aware responses
- Handle prompt engineering and templates
- Manage model loading and unloading
- Support streaming responses
Default Model: llama2 or mistral via Ollama
Responsibilities:
- Parse user queries
- Generate query embeddings
- Retrieve relevant documents from vector store
- Construct context for LLM
- Format and return final response
- Handle edge cases and errors
RAG Pipeline:
- Receive user query
- Generate query embedding
- Perform similarity search in Qdrant
- Retrieve top-k relevant chunks
- Construct prompt with context
- Generate response using LLM
- Return formatted answer
Responsibilities:
- Store original documents
- Manage document metadata
- Track processing status
- Handle file lifecycle
- Implement cleanup policies
- Language: Python 3.11+
- Framework: FastAPI or Flask
- Async Support: asyncio
- Type Checking: mypy
- Vector Store: Qdrant
- Deployment: Docker or local installation
- Client: qdrant-client Python SDK
- LLM Runtime: Ollama
- Models: Llama 2, Mistral, or other open-source models
- API: Ollama HTTP API
- Model: sentence-transformers
- Backend: PyTorch
- Inference: CPU/GPU accelerated
- PDF: PyPDF2 or pdfplumber
- Word: python-docx
- Text: Built-in Python libraries
- Chunking: LangChain or custom implementation
- Task Queue: Celery 5.3.4
- Message Broker: Redis 7
- Result Backend: Redis 7
- Worker Management: Docker Compose or manual scripts
- Environment: python-dotenv
- Logging: structlog
- Validation: pydantic
- Testing: pytest
- HTTP Client: httpx
Each component has a single, well-defined responsibility. The system is divided into distinct layers (API, Service, Data) with clear boundaries.
- Horizontal scaling via containerization
- Efficient vector storage and retrieval
- Asynchronous processing for I/O operations
- Connection pooling for database operations
- All processing runs locally (no external API calls for LLM)
- Optional encryption for stored documents
- Secure authentication and authorization
- Audit logging for sensitive operations
- Embedding caching to avoid recomputation
- Batch processing for efficiency
- Vector similarity search with sub-second latency
- Optimized chunk sizes for balance of context and performance
- Plugin architecture for embedding models
- Support for multiple LLM providers (if needed)
- Configurable chunking strategies
- Modular component design
- Graceful degradation on failures
- Retry mechanisms for transient errors
- Comprehensive error logging
- User-friendly error messages
- Python 3.11 or higher installed
- Sufficient RAM for local LLM inference (minimum 8GB, recommended 16GB+)
- Disk space for document storage and vector database
- Network access for initial model downloads (Ollama)
- Document ingestion: ~1-5 MB/second depending on model and hardware
- Query response: < 3 seconds for typical queries
- Embedding generation: ~1000 chunks/minute on CPU
- Typical document size: 1-50 MB per file
- Query frequency: 1-100 queries per minute
- Concurrent users: 1-50 (depends on hardware)
- Total documents: 1-10,000 documents (scales with Qdrant)
- Single-tenant deployment (designed for self-hosting)
- No built-in multi-user access control (can be added)
- Limited to text-based content (images not processed)
- Processing speed depends on local hardware
- Document upload and processing
- Vector-based similarity search
- RAG-based question answering
- REST API for integration
- Support for common document formats
- Collection management
- Query history (optional)
- Real-time document updates (documents are re-indexed)
- Multi-user access control (can be added later)
- Distributed deployment (single-node design)
- Web UI (API-only implementation)
- Image and video processing
- Multi-modal retrieval
- Advanced NLP features (NER, entity extraction, etc.)
- Chat history and conversation memory (can be added)
simple-rag-system/
├── src/
│ ├── api/ # Presentation Layer
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI application
│ │ ├── routes/ # API route definitions
│ │ │ ├── documents.py
│ │ │ ├── query.py
│ │ │ ├── collections.py
│ │ │ ├── tasks.py # Task status tracking endpoints
│ │ │ └── health.py
│ │ └── models/ # Pydantic models
│ │ ├── document.py
│ │ ├── query.py
│ │ ├── collection.py
│ │ ├── task.py # Task status models
│ │ └── common.py
│ ├── services/ # Service Layer (Business Logic)
│ │ ├── __init__.py
│ │ ├── document_processor.py
│ │ ├── embedding_service.py
│ │ ├── vector_store.py
│ │ ├── llm_service.py
│ │ ├── query_processor.py
│ │ └── storage_manager.py
│ ├── tasks/ # Background Task Processing (IMPLEMENTED)
│ │ ├── __init__.py
│ │ ├── celery_app.py # Celery application configuration
│ │ ├── document_tasks.py # Document processing Celery tasks
│ │ └── embedding_tasks.py # Embedding generation Celery tasks
│ ├── core/ # Core utilities (Infrastructure Layer)
│ │ ├── __init__.py
│ │ ├── config.py # Configuration management
│ │ ├── exceptions.py # Custom exceptions
│ │ ├── logging.py # Logging setup
│ │ └── monitoring.py # Metrics and monitoring
│ └── utils/ # Utility functions (Infrastructure Layer)
│ ├── __init__.py
│ ├── text_chunker.py
│ ├── validators.py
│ └── cache.py # Cache utilities
├── tests/
│ ├── unit/
│ ├── integration/
│ └── e2e/
├── docs/
│ ├── 01-basic-design.md # This document
│ ├── 02-c4-model.md
│ ├── 03-high-level-design.md
│ ├── 04-data-flow.md
│ └── 05-sequence-diagrams.md
├── requirements.txt
├── .env.example
└── README.md
For detailed architectural diagrams and design specifications, refer to:
- C4 Model - System architecture with Context, Container, and Component diagrams
- High-Level Design - Architectural patterns and deployment strategy
- Data Flow - Detailed data flow diagrams
- Sequence Diagrams - Interaction sequences between components