This repository contains a fully local Retrieval-Augmented Generation (RAG) system designed to perform grounded Question Answering (QA) on private documents. It solves the enterprise problem of LLM hallucinations by aggressively retrieving relevant factual context before generation.
The application provides a production-style semantic search architecture, featuring an automated evaluation pipeline, runtime performance telemetry, and a Streamlit UI for interaction—all without relying on external API calls or cloud dependencies.
Standard RAG implementations using only Naive Vector Search (e.g., Cosine Similarity) often fail in production because they miss exact keyword overlap (like model numbers or acronyms). Conversely, pure keyword search fails to understand semantic meaning or synonyms.
This system was built to demonstrate a robust Retrieve-and-Rerank architecture:
- Hybrid Retrieval bridges the gap by combining dense vector embeddings with sparse lexical matches.
- Cross-Encoder Reranking is applied as a second-stage filter to drastically improve the precision of the context injected into the LLM.
- Local Inference via Ollama ensures complete data privacy and zero API costs.
- Document Ingestion: Recursively loads
.pdfand.txtfiles from the localdata/directory. - Chunking: Uses LangChain's
RecursiveCharacterTextSplitterto create overlapping chunks, preserving context boundaries. - Embedding/Indexing:
- Dense: Embedded using
all-MiniLM-L6-v2and stored in a persistent ChromaDB vector store. - Sparse: Indexed in-memory using the BM25 algorithm.
- Dense: Embedded using
- Retrieval (Hybrid Merge): Queries both indices concurrently and uses Reciprocal Rank Fusion (or deduplication) to combine candidates.
- Reranking: Passes the retrieved subset through an SBERT Cross-Encoder (
ms-marco-MiniLM-L-6-v2) which scores the query directly against each chunk. - Answer Synthesis: Validated top-K chunks are injected into a strict generative prompt utilizing a local
llama3.1model. - Citations: The UI extracts and displays precise references to source documents and chunk IDs for transparency.
- Why BM25? Dense embeddings struggle with exact lexical matches (e.g., "Error Code 404"). BM25 guarantees that exact keywords are retrieved.
- Why Vector Search? BM25 struggles with synonyms (e.g., "vehicle" vs "car"). Vector search captures semantic intent.
- Why Combine Both? Combining them in a Hybrid Retrieve step maximizes initial Recall.
- Why a CrossEncoder? High recall brings noisy results. A Cross-Encoder acts as an extremely accurate relevance judge to maximize Precision before hitting the LLM context window, lowering token costs and reducing hallucination risk.
To validate the architecture, the system ships with automated evaluation scripts against a curated dataset.
compare.py: Benchmarks four retrieval modes (BM25 Only,Vector Only,Hybrid, andHybrid + Rerank). It uses a fast heuristic relevance metric ensuring expected target keywords appear in the chunks.eval.py: A robust LLM-as-a-judge (or local heuristic) evaluation that computes holistic fidelity metrics (like average Relevance Hit Rate).
This system implements realistic software engineering constraints required for production RAG systems:
- Prompt Versioning: Hardcoded string literals have been removed. The generative instructions are externalized in
config/prompts.yaml, supporting clean iteration and A/B testing of prompt engineering without altering application logic. - Structured Golden Dataset: Evaluation relies on a dynamic, scalable
data/golden_eval.jsonschema rather than hardcoded lists. This allows the benchmarking suite to grow seamlessly. - CI-Based Quality Gating: The repository includes a GitHub Actions workflow (
.github/workflows/eval.yml). On every push, it builds the environment, runs the evaluation heuristic against the golden dataset, and fails the build if the retrieval hit rate or generation faithfulness drops below 50%.
Latency measurements are instrumented natively in the Streamlit UI's telemetry panel. Typical local benchmarks (on standard consumer hardware):
- Indexing: Fast, < 5 seconds for smaller documents.
- Retrieval: ~0.1 - 0.3 seconds.
- Reranking: ~0.5 - 1.5 seconds (depends heavily on K).
- LLM Generation: Varies by Ollama model size and context length, usually between 4 - 10 seconds.
- Dual-index ingestion (BM25 + ChromaDB Vector)
- Hybrid retrieval with deduplication
- SBERT CrossEncoder precision reranking
- Explicit generation citations
- Local Ollama-based inference
- Objective evaluation scripting
- UI Runtime Telemetry
- External prompt configuration
- Golden evaluation dataset structure
- CI-based quality gating
- Evaluation Scale: The golden evaluation dataset is currently a curated slice. Expanding this to 50-100 questions is required for comprehensive benchmark tracking.
- Thresholds: The CI evaluation thresholds are currently heuristic-driven and relatively lenient to support the rapid local testing cycle.
- Dependent on Chunking: The system uses standard recursive chunking. Highly structured data (like complex PDF tables) may lose formatting.
- Hardware Bounds: Quality and speed of generation depend heavily on local RAM/GPU when running models like
llama3.1. The system is currently single-user focused and lacks an asynchronous queue for high-concurrency requests.
- Python 3.10+
- Ollama installed and running locally.
git clone <your-repo>
cd production-rag-app
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtEnsure Ollama is running, then pull the required model:
ollama serve
ollama pull llama3.1streamlit run app.pyUpload documents via the sidebar or place them in data/, then click "Build Index".
To run the automated objective benchmarks:
- Baseline Retrieval Tradeoffs:
python compare.py(Outputs to UI andoutputs/comparison_results.csv) - Evaluate Generation:
python eval.py(Outputs to UI andoutputs/eval_results.json)
app.py: The main Streamlit user interface and application entry point.ingest.py: Handles document parsing, chunking, and ChromaDB ingestion.retrieve.py: Implements BM25, Vector Search, and Hybrid deduplication logic.rerank.py: Wraps the SBERT Cross-Encoder for precision relevance scoring.qa.py: Formats the prompt and streams the context into LangChain's local Ollama wrapper.eval.py: E2E generation evaluation (supports Ragas or local fallbacks).compare.py: Automated strategy testing framework.
- Implement sophisticated PDF parsing (e.g., LlamaParse) for table and image extraction.
- Expand the evaluation dataset to support 100+ domain-specific question pairs.
- Extract document metadata natively for explicit sub-filtering during ChromaDB retrieval.
- Optimize reranking latency by offloading the CrossEncoder to a dedicated inference server.