Skip to content

Nanvithaa/Production-RAG-Application

Repository files navigation

Local RAG System with Hybrid Retrieval and Reranking

A. Project Overview

This repository contains a fully local Retrieval-Augmented Generation (RAG) system designed to perform grounded Question Answering (QA) on private documents. It solves the enterprise problem of LLM hallucinations by aggressively retrieving relevant factual context before generation.

The application provides a production-style semantic search architecture, featuring an automated evaluation pipeline, runtime performance telemetry, and a Streamlit UI for interaction—all without relying on external API calls or cloud dependencies.

B. Motivation

Standard RAG implementations using only Naive Vector Search (e.g., Cosine Similarity) often fail in production because they miss exact keyword overlap (like model numbers or acronyms). Conversely, pure keyword search fails to understand semantic meaning or synonyms.

This system was built to demonstrate a robust Retrieve-and-Rerank architecture:

  1. Hybrid Retrieval bridges the gap by combining dense vector embeddings with sparse lexical matches.
  2. Cross-Encoder Reranking is applied as a second-stage filter to drastically improve the precision of the context injected into the LLM.
  3. Local Inference via Ollama ensures complete data privacy and zero API costs.

C. System Architecture

  1. Document Ingestion: Recursively loads .pdf and .txt files from the local data/ directory.
  2. Chunking: Uses LangChain's RecursiveCharacterTextSplitter to create overlapping chunks, preserving context boundaries.
  3. Embedding/Indexing:
    • Dense: Embedded using all-MiniLM-L6-v2 and stored in a persistent ChromaDB vector store.
    • Sparse: Indexed in-memory using the BM25 algorithm.
  4. Retrieval (Hybrid Merge): Queries both indices concurrently and uses Reciprocal Rank Fusion (or deduplication) to combine candidates.
  5. Reranking: Passes the retrieved subset through an SBERT Cross-Encoder (ms-marco-MiniLM-L-6-v2) which scores the query directly against each chunk.
  6. Answer Synthesis: Validated top-K chunks are injected into a strict generative prompt utilizing a local llama3.1 model.
  7. Citations: The UI extracts and displays precise references to source documents and chunk IDs for transparency.

D. Retrieval Design Rationale

  • Why BM25? Dense embeddings struggle with exact lexical matches (e.g., "Error Code 404"). BM25 guarantees that exact keywords are retrieved.
  • Why Vector Search? BM25 struggles with synonyms (e.g., "vehicle" vs "car"). Vector search captures semantic intent.
  • Why Combine Both? Combining them in a Hybrid Retrieve step maximizes initial Recall.
  • Why a CrossEncoder? High recall brings noisy results. A Cross-Encoder acts as an extremely accurate relevance judge to maximize Precision before hitting the LLM context window, lowering token costs and reducing hallucination risk.

E. Evaluation Strategy

To validate the architecture, the system ships with automated evaluation scripts against a curated dataset.

  • compare.py: Benchmarks four retrieval modes (BM25 Only, Vector Only, Hybrid, and Hybrid + Rerank). It uses a fast heuristic relevance metric ensuring expected target keywords appear in the chunks.
  • eval.py: A robust LLM-as-a-judge (or local heuristic) evaluation that computes holistic fidelity metrics (like average Relevance Hit Rate).

F. Production-Ready Engineering

This system implements realistic software engineering constraints required for production RAG systems:

  • Prompt Versioning: Hardcoded string literals have been removed. The generative instructions are externalized in config/prompts.yaml, supporting clean iteration and A/B testing of prompt engineering without altering application logic.
  • Structured Golden Dataset: Evaluation relies on a dynamic, scalable data/golden_eval.json schema rather than hardcoded lists. This allows the benchmarking suite to grow seamlessly.
  • CI-Based Quality Gating: The repository includes a GitHub Actions workflow (.github/workflows/eval.yml). On every push, it builds the environment, runs the evaluation heuristic against the golden dataset, and fails the build if the retrieval hit rate or generation faithfulness drops below 50%.

G. Performance & Runtime Notes

Latency measurements are instrumented natively in the Streamlit UI's telemetry panel. Typical local benchmarks (on standard consumer hardware):

  • Indexing: Fast, < 5 seconds for smaller documents.
  • Retrieval: ~0.1 - 0.3 seconds.
  • Reranking: ~0.5 - 1.5 seconds (depends heavily on K).
  • LLM Generation: Varies by Ollama model size and context length, usually between 4 - 10 seconds.

H. Failure Handling & Limitations

What's Implemented Implemented Now:

  • Dual-index ingestion (BM25 + ChromaDB Vector)
  • Hybrid retrieval with deduplication
  • SBERT CrossEncoder precision reranking
  • Explicit generation citations
  • Local Ollama-based inference
  • Objective evaluation scripting
  • UI Runtime Telemetry
  • External prompt configuration
  • Golden evaluation dataset structure
  • CI-based quality gating

Current Limitations / Future Scaling:

  • Evaluation Scale: The golden evaluation dataset is currently a curated slice. Expanding this to 50-100 questions is required for comprehensive benchmark tracking.
  • Thresholds: The CI evaluation thresholds are currently heuristic-driven and relatively lenient to support the rapid local testing cycle.
  • Dependent on Chunking: The system uses standard recursive chunking. Highly structured data (like complex PDF tables) may lose formatting.
  • Hardware Bounds: Quality and speed of generation depend heavily on local RAM/GPU when running models like llama3.1. The system is currently single-user focused and lacks an asynchronous queue for high-concurrency requests.

I. How to Run

1. Prerequisites

  • Python 3.10+
  • Ollama installed and running locally.

2. Setup

git clone <your-repo>
cd production-rag-app
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Start Local LLM

Ensure Ollama is running, then pull the required model:

ollama serve
ollama pull llama3.1

4. Run the Application

streamlit run app.py

Upload documents via the sidebar or place them in data/, then click "Build Index".

5. Run Analytics & Evaluation

To run the automated objective benchmarks:

  • Baseline Retrieval Tradeoffs: python compare.py (Outputs to UI and outputs/comparison_results.csv)
  • Evaluate Generation: python eval.py (Outputs to UI and outputs/eval_results.json)

I. Project Structure

  • app.py: The main Streamlit user interface and application entry point.
  • ingest.py: Handles document parsing, chunking, and ChromaDB ingestion.
  • retrieve.py: Implements BM25, Vector Search, and Hybrid deduplication logic.
  • rerank.py: Wraps the SBERT Cross-Encoder for precision relevance scoring.
  • qa.py: Formats the prompt and streams the context into LangChain's local Ollama wrapper.
  • eval.py: E2E generation evaluation (supports Ragas or local fallbacks).
  • compare.py: Automated strategy testing framework.

J. Future Improvements

  • Implement sophisticated PDF parsing (e.g., LlamaParse) for table and image extraction.
  • Expand the evaluation dataset to support 100+ domain-specific question pairs.
  • Extract document metadata natively for explicit sub-filtering during ChromaDB retrieval.
  • Optimize reranking latency by offloading the CrossEncoder to a dedicated inference server.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages