DocBot is a Retrieval-Augmented Generation (RAG) system for intelligent PDF document processing and question answering. Built with FastAPI and Mistral AI, it combines semantic search, keyword matching, and advanced LLM capabilities to provide accurate, cited answers from your documents.
Built by: Darren Jian || Contact: darrenjian28@gmail.com
RAG-System/
├── app/
│ ├── main.py # FastAPI endpoints and server configuration
│ ├── rag_pipeline.py # Orchestrates ingestion and query workflows
│ ├── vector_db.py # Custom vector database with cosine similarity
│ ├── mistral_client.py # Mistral API client (OCR, embeddings, LLM)
│ ├── search.py # Hybrid search (semantic + BM25 keyword)
│ ├── chunking.py # Sentence-based text chunking with NLTK
│ ├── models.py # Pydantic request/response models
│ └── config.py # Configuration management
├── ui/
│ └── index.html # Chat interface for document Q&A
├── vectordb/ # Persisted vector database (created at runtime)
├── uploads/ # Uploaded PDFs (created at runtime)
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
- PDF Ingestion: Upload PDFs with text extraction via PyPDF2
- Hybrid Search: Combines semantic search (embeddings) + keyword search (BM25)
- LLM Generation: Context-aware answer generation with citations
- Intent Detection: Classifies queries to optimize API usage
- Query Transformation: Enhances queries for better retrieval
- Hallucination Detection: Verifies answers against source documents
- Confidence Filtering: Only answers when evidence threshold is met
- Rich Formatting: Markdown-rendered answers with bold, lists, tables
| Component | Technology | Purpose |
|---|---|---|
| Backend | FastAPI | Async API server with auto-docs |
| Server | Uvicorn | ASGI server |
| LLM/Embeddings | Mistral AI | Text extraction, embeddings, generation |
| Vector Database | Custom (NumPy) | Cosine similarity search |
| Keyword Search | BM25 (custom) | Exact term matching |
| Text Processing | NLTK | Sentence tokenization |
| PDF Extraction | PyPDF2 | Digital PDF text extraction |
| Validation | Pydantic | Type-safe models |
| Frontend | Vanilla HTML/CSS/JS | Simple chat interface |
This system was built with specific constraints and use cases in mind:
Dataset Assumptions:
- Digital PDFs: Documents with extractable text layers (not scanned/image-based)
- English Language: NLTK sentence tokenization optimized for English
- Document Size: Typical range 10-500 pages (~50-25,000 chunks)
- Content Type: Text-heavy documents (reports, papers, articles) rather than form-heavy or image-heavy PDFs
Usage Assumptions:
- Scale: Small team or personal use (<10 concurrent users)
- Query Pattern: Asynchronous, research-oriented queries (not real-time chat)
- Workload: Read-heavy (many queries per ingested document)
- Deployment: Single-machine deployment (no distributed system needed)
Technical Bounds:
- Vector Count: <100k vectors total (~500 documents of 200 chunks each)
- Memory: Entire vector database fits in RAM (typically <2GB)
- API Limits: Respects Mistral API rate limits (batching in groups of 100)
- Concurrent Users: Designed for <100 simultaneous queries
Known Constraints:
- No GPU: Runs on CPU-only machines
- Stateless: No conversation history across sessions
- Single Language: No multilingual support
- Storage: Local filesystem only (no cloud storage integration)
These assumptions enable simplicity and educational value. For production systems with different requirements, see Future Improvements.
┌──────────────────────────────────────────────────────────────────────┐
│ RAG SYSTEM PIPELINE │
└──────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ INGESTION PHASE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User uploads PDF │
│ ↓ │
│ [1] PDF Text Extraction (PyPDF2) │
│ → Extract text page-by-page from digital PDFs │
│ ↓ │
│ [2] Text Chunking (Sentence-based) │
│ → Split into sentences using NLTK │
│ → Preserves semantic boundaries │
│ ↓ │
│ [3] Embedding Generation (Mistral API) │
│ → Convert chunks to 1024-dimensional vectors │
│ → Batched in groups of 100 for large documents │
│ ↓ │
│ [4] Vector Database Storage │
│ → Store vectors with metadata │
│ → Index for BM25 keyword search │
│ ↓ │
│ [5] Persist to Disk │
│ → Save to vectordb/ for future queries │
│ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY PHASE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User asks question │
│ ↓ │
│ [1] Intent Detection (Mistral LLM) │
│ → Classify: greeting, chitchat, factual, retrieval_needed │
│ → Skip retrieval for greetings (faster, cheaper) │
│ ↓ │
│ [2] Query Transformation (Mistral LLM) │
│ → Expand abbreviations, add synonyms │
│ → Example: "ML" → "machine learning" │
│ ↓ │
│ [3] Query Embedding (Mistral API) │
│ → Convert enhanced query to vector │
│ ↓ │
│ [4] Hybrid Search │
│ ├─ Semantic Search (Cosine Similarity) │
│ │ → Find top-10 semantically similar chunks │
│ │ → Score: 0.0 to 1.0 │
│ │ │
│ ├─ Keyword Search (BM25) │
│ │ → Find top-10 by keyword matching │
│ │ → TF-IDF based scoring │
│ │ │
│ └─ Combine & Re-rank │
│ → Hybrid score = 0.7×semantic + 0.3×keyword │
│ → Return top-5 results │
│ ↓ │
│ [5] Confidence Check │
│ → If top score < 0.50: "I don't have enough evidence" │
│ ↓ │
│ [6] Answer Generation (Mistral LLM) │
│ → Provide retrieved context + strict constraints │
│ → Generate answer with source citations [1][2] │
│ ↓ │
│ [7] Hallucination Detection (Optional) │
│ → Verify answer is supported by context │
│ → Override if unsupported claims detected │
│ ↓ │
│ [8] Return Response │
│ → Answer + Citations + Confidence + Processing Time │
│ │
└─────────────────────────────────────────────────────────────────────┘
- PDF Text Extraction: PyPDF2 extracts text from digital PDFs page-by-page
- Sentence-Based Chunking: NLTK splits text into sentences, preserving semantic boundaries for precise retrieval
- Embedding Generation: Mistral API converts each sentence to a 1024-dimensional vector
- Dual Indexing: Stores vectors for semantic search + builds BM25 index for keyword search
- Persistence: Saves to disk for fast querying
- Intent Detection: LLM classifies query type to skip retrieval for greetings (saves API calls)
- Query Enhancement: LLM expands abbreviations and adds synonyms for better matching
- Hybrid Search: Combines semantic similarity (finds conceptually related content) with BM25 keyword matching (finds exact terms)
- Re-ranking: Weighted combination (70% semantic, 30% keyword) produces final top-5 results
- Confidence Filtering: Only generates answers when similarity threshold (0.5) is met
- Answer Generation: LLM generates grounded response with citations using retrieved context
- Verification: Optional hallucination check ensures answer is supported by source documents
- Sentence-based (not fixed-size) to preserve semantic coherence
- Each sentence becomes a searchable unit with exact citations
- Filters out very short sentences (< 10 chars) to reduce noise
- Semantic search captures conceptual similarity (good for paraphrases)
- Keyword search finds exact term matches (good for specific codes/names)
- Combination leverages strengths of both approaches
- Built with NumPy for educational purposes
- No external dependencies (Pinecone, Weaviate, etc.)
- Sufficient for small-to-medium scale (<100k vectors)
- Batch size of 100 chunks to prevent API timeouts and rate limits
- Enables reliable processing of large documents (1000+ chunks)
- Provides progress feedback during ingestion
- Prevents memory issues and network failures
| File | Purpose |
|---|---|
main.py |
FastAPI app with endpoints: /ingest, /query, /health, /stats, /reset |
rag_pipeline.py |
Orchestrates ingestion workflow and query processing pipeline |
vector_db.py |
In-memory vector database with cosine similarity search |
mistral_client.py |
Wraps Mistral API for OCR, embeddings, LLM generation, intent classification |
search.py |
Implements BM25 keyword search and hybrid search re-ranking |
chunking.py |
Sentence-based text chunking using NLTK tokenizer |
models.py |
Pydantic models for request/response validation |
config.py |
Settings loaded from .env (API keys) and hardcoded configs |
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Serve chat UI |
/health |
GET | System health and statistics |
/ingest |
POST | Upload and process PDF |
/query |
POST | Query knowledge base |
/stats |
GET | Detailed system statistics |
/reset |
POST | Clear all data (destructive) |
/documents |
GET | List uploaded PDFs |
/documents/{filename} |
DELETE | Delete specific PDF |
Interactive API Documentation: http://localhost:8000/docs
- Python 3.9+
- Mistral API key (get one here)
-
Clone the repository
git clone https://github.com/darrenjian/RAG-System.git cd RAG-System -
Create virtual environment
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Create environment file
cp .env.example .env
-
Add your Mistral API key
Edit
.env:MISTRAL_API_KEY=your_api_key_here -
Customize model configurations (optional)
Edit
app/config.pyto adjust:- Model selection (
embedding_model,llm_model,ocr_model) - RAG parameters (
chunk_size,top_k_results,similarity_threshold) - Hybrid search weights (
semantic_weight,keyword_weight)
- Model selection (
Development mode (auto-reload on file changes):
python run.pyProduction mode:
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000Server starts at: http://localhost:8000
- Open
http://localhost:8000in your browser - Click "Upload PDF" in the sidebar
- Select a PDF file (digital PDFs work best)
- Wait for processing confirmation
- Ask questions about the document!
Example queries:
- "What is this document about?"
- "Summarize the key findings"
- "What were the Q4 revenue numbers?"
Upload a PDF:
curl -X POST "http://localhost:8000/ingest" \
-F "file=@document.pdf"Query the knowledge base:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is the main topic?",
"top_k": 5,
"use_hybrid_search": true
}'Check system health:
curl http://localhost:8000/health-
Adaptive Similarity Threshold: Automatically calibrate threshold based on document characteristics and query patterns (e.g., 0.6 for technical docs, 0.4 for general content)
-
Production-Grade Vector DB: Integrate FAISS or Pinecone/Weaviate for scalability to millions of vectors with sub-millisecond search times
-
Multi-Modal PDF Understanding: Extract and query tables, charts, and images using vision models for comprehensive document analysis
-
Conversation Memory: Track context across queries within a session to enable multi-turn dialogue and follow-up questions
-
Testing & Production Readiness: Comprehensive test suite, Docker containerization, CI/CD pipeline, and monitoring infrastructure
- Vector DB Scale: Custom NumPy implementation is O(n) linear search; becomes slow beyond 100k vectors
- BM25 Index Scale: Inverted index grows with vocabulary size; for millions of documents, dedicated search engine (Elasticsearch) would be needed
- No conversation state: Each query is independent; no multi-turn dialogue tracking
Built by Darren Jian