Large Language Models (LLMs) are powerful, but when used in isolation they suffer from hallucinations, stale knowledge, and lack of grounding in private or domain-specific data.
At the same time, traditional keyword-based search systems fail to capture semantic intent and often return irrelevant results for complex queries.
Retrieval-Augmented Generation (RAG) systems that combine search techniques with LLMs to produce accurate, explainable, and grounded answers.
This repository is a hands-on RAG pipeline and end-to-end guide to building Retrieval-Augmented Generation (RAG) systems.
It is designed to help anyone from beginners to advanced practitioners understand how semantic search and LLMs can be combined to produce grounded, high-quality answers.
- How to parse and represent user queries
- Different retrieval techniques: keyword, dense, and hybrid
- How to merge and rerank retrieved results
- Strategies for constructing context for LLMs
- Various RAG patterns: naive, local, conversational, agentic
- Evaluation metrics and production-ready considerations
- ML/NLP engineers building LLM-powered systems
- Backend engineers integrating retrieval pipelines
- Students and researchers learning RAG architectures
- Anyone aiming to go from naive demos to production-ready systems
🏳️ level : beginners mid
Retrieval Augmented Generation (RAG) is architecture for optimizing the performance of AI model by connecting it with external knowladge bases. RAG help LLMs deliver more relevant responses at higher quality.
- Advantage of RAG :
- Injects missing knowledge. "add info not in training data"
- Reduce hallucinations.
- Keep models update & fouces on generation.
- Enables source citation.
The repository is structured to reflect this pipeline, with each folder dedicated to a single stage, containing explanations, practical examples, and minimal code demos.
The core pipeline consists of the following stages:
Prepares raw documents for efficient retrieval by transforming them into searchable representations.
Techniques
- Cleaning & normalization
- Metadata extraction
- Chunking (fixed, recursive, semantic, sliding window)
- Sparse indexing (BM25 / TF-IDF)
- Dense embeddings (bi-encoders)
- ANN indexing (FAISS / HNSW)
- Hybrid indexing
Input
- Raw documents (PDF, HTML, TXT, Markdown, DB)
- Optional metadata
Output
- Chunked documents
- Sparse vectors
- Dense embeddings
- Searchable index
Transforms raw user input into structured and enriched queries.
Techniques
- Query rewriting (LLM-based)
- Named Entity Recognition (NER)
- Intent classification
- Query expansion
- Keyword extraction
- Spelling normalization
- HYDE (Hypothetical Document Embeddings)
Input
- Raw user query
Output
- Rewritten query
- Entities
- Intent
- Keywords
- Optional synthetic document
Converts parsed queries into representations aligned with the knowledge base.
Techniques
- Sparse representation (BM25 vector)
- Dense embeddings
- Token-level embeddings (ColBERT-style)
- Hybrid representation
Input
- Parsed query
- KB type (sparse / dense / hybrid)
Output
- Sparse vector
- Dense embedding
- Dual representation
📚 04 – Retrieval
Fetches relevant documents or passages from the knowledge base.
Techniques
- Lexical retrieval (BM25, Boolean)
- Dense retrieval (bi-encoder + ANN)
- Late interaction (ColBERT)
- Hybrid retrieval
- Metadata filtering
Input
- Query representation
- Indexed knowledge base
Output
- Top-K candidate documents
- Retrieval scores
Combines outputs from multiple retrievers for improved robustness.
Techniques
- Reciprocal Rank Fusion (RRF)
- Weighted score aggregation
- Rank merging
- Deduplication
Input
- Multiple ranked lists
Output
- Single merged ranked list
Applies precise relevance models to refine ranking.
Techniques
- Cross-encoder models
- MonoT5
- BERT-based ranking
- LLM scoring
Input
- Query
- Top-K retrieved documents
Output
- Relevance scores
- Reordered candidate list
Builds structured context within token constraints for the LLM.
Techniques
- Smart chunk selection
- Sliding window
- Deduplication
- Context compression
- Token budgeting
- Ordering strategies
Input
- Reranked documents
- Token limit
Output
- Clean structured context
Formats context and instructions into a structured prompt for reliable generation.
Techniques
- System prompts
- Instruction templates
- Citation enforcement
- Guardrails
- Structured output prompts
- Role-based prompting
Input
- Structured context
- User query
- System instructions
Output
- Final LLM-ready prompt
The LLM generates the final grounded response.
Input
- Final prompt
Output
- Generated answer
- 🚨 Learn how LLMs work and build one from scratch with this repo : LLMs from Scratch
Measures system quality and prevents hallucinations.
Techniques
- Recall@K
- MRR / nDCG
- Faithfulness scoring
- Answer relevance
- LLM-as-a-judge
- RAGAS
- A/B testing
Input
- Queries
- Retrieved docs
- Generated answers
- Ground truth (optional)
Output
- Retrieval metrics
- Generation metrics
- Performance reports
Explains the major Retrieval-Augmented Generation (RAG) architectures used in research and production systems.
📌 Overview
The standard and most basic RAG architecture.
Single-pass retrieval followed by generation.
⚙️ Core Idea Retrieve relevant documents → Inject into prompt → Generate answer.
🏗 Architecture Flow
Query
↓
Retriever (Top-K)
↓
Context Construction
↓
LLM Generation
✅ Advantages
- Simple
- Fast
- Easy to implement
- Good baseline
- Sensitive to retrieval quality
- No feedback loop
- Can hallucinate if retrieval fails
🎯 When to Use
- Prototyping
- Small-scale systems
- Internal knowledge assistants
📌 Overview Extends vanilla RAG by separating stages and adding reranking and fusion.
⚙️ Core Idea Improve retrieval precision before generation.
🏗 Architecture Flow
Query
↓
Query Parsing
↓
Multiple Retrievers
↓
Fusion (RRF)
↓
Reranker (Cross-Encoder)
↓
Context Builder
↓
LLM
✅ Advantages
- Higher retrieval precision
- Better grounding
- Modular and scalable
- Higher latency
- More components to maintain
🎯 When to Use
- Production systems
- Enterprise knowledge bases
- High-accuracy requirements
📌 Overview Combines lexical (BM25) and semantic (embeddings) retrieval.
⚙️ Core Idea Exact match + semantic similarity = better recall and precision.
🏗 Architecture Flow
Query
↓
Sparse Retriever (BM25)
Dense Retriever (Embeddings)
↓
Fusion
↓
LLM
✅ Advantages
- Strong recall
- Handles keyword + semantic queries
- Robust across domains
- Requires maintaining two indices
🎯 When to Use
- Large document collections
- Mixed query types
- Real-world production systems
📌 Overview Performs multiple retrieval steps for complex reasoning queries.
⚙️ Core Idea Retrieve → Generate intermediate reasoning → Retrieve again → Final answer.
🏗 Architecture Flow
Query
↓
Retrieve A
↓
Generate intermediate reasoning
↓
Retrieve B
↓
Final Answer
✅ Advantages
- Handles complex reasoning
- Better for analytical questions
- Higher latency
- More compute cost
🎯 When to Use
- Research assistants
- Analytical QA systems
- Financial/legal reasoning
📌 Overview Uses feedback loops to refine retrieval and generation.
⚙️ Core Idea Generate → Evaluate confidence → Re-retrieve if needed.
🏗 Architecture Flow
Query
↓
Retrieve
↓
Generate
↓
Self-Evaluation
↓
If low confidence → Refine query → Retrieve again
✅ Advantages
- Reduces hallucination
- More reliable answers
- Slower
- Complex orchestration
🎯 When to Use
- High-stakes systems
- Medical / Legal domains
📌 Overview Uses a knowledge graph instead of flat document retrieval.
⚙️ Core Idea Retrieve structured relationships between entities.
🏗 Architecture Flow
Query
↓
Entity Extraction
↓
Graph Traversal
↓
Subgraph Extraction
↓
LLM Generation
✅ Advantages
- Structured reasoning
- Strong factual consistency
- Multi-hop naturally supported
- Requires graph construction
- Higher indexing complexity
🎯 When to Use
- Enterprise structured data
- Research knowledge systems
- Relationship-heavy domains
📌 Overview Uses an autonomous agent that decides when to retrieve, search, or call tools.
⚙️ Core Idea LLM acts as a planner controlling retrieval actions.
🏗 Architecture Flow
User Query
↓
Planner Agent
↓
Tool Selection:
Retriever
Web Search
Calculator
DB Query
↓
Memory Update
↓
Final Answer
✅ Advantages
- Flexible
- Dynamic tool usage
- Handles complex workflows
- Hard to control
- Can be unstable without guardrails
🎯 When to Use
- AI assistants
- Autonomous research agents
- Complex task execution systems
📌 Overview Dynamically adjusts retrieval strategy based on query type.
⚙️ Core Idea
Simple queries → small retrieval
Complex queries → multi-hop retrieval
🏗 Architecture Flow
Query
↓
Complexity Classifier
↓
Simple → Vanilla RAG
Complex → Multi-Hop / Agentic RAG
✅ Advantages
- Efficient
- Optimizes cost vs performance
- Requires accurate query classification
🎯 When to Use
- Cost-sensitive production systems
- Large-scale SaaS assistants
📌 Overview Extends RAG with external tools beyond document retrieval.
⚙️ Core Idea Retrieval + APIs + structured tools.
🏗 Architecture Flow
Query ↓ Retriever ↓ Tool Calls (APIs / DB / Search) ↓ LLM Synthesis
✅ Advantages
- Real-time data access
- Accurate computations
- Broader capability
- Tool orchestration complexity
🎯 When to Use
- Financial dashboards
- Real-time analytics
- Enterprise assistants
| Architecture | Complexity | Accuracy | Latency | Production Ready |
|---|---|---|---|---|
| Vanilla RAG | Low | Medium | Low | Yes |
| Modular RAG | Medium | High | Medium | Yes |
| Hybrid RAG | Medium | High | Medium | Yes |
| Multi-Hop RAG | High | Very High | High | Advanced |
| Self-Reflective | High | Very High | High | Advanced |
| Graph RAG | High | Very High | Medium | Enterprise |
| Agentic RAG | Very High | Dynamic | Variable | Complex Systems |
| Adaptive RAG | High | Optimized | Optimized | Large Scale |
| Tool-Augmented RAG | High | Very High | Variable | Enterprise |
For most production systems:
- Start with Hybrid + Reranking
- Add Evaluation Layer
- Introduce Adaptive or Agentic layer only if complexity requires it
Although RAG seems to be a very straightforward way to integrate LLMs with knowledge, there are still the below mentioned open research and application challenges with RAG.
- Data Ingestion Complexity: Dealing with the complexity of ingesting extensive knowledge bases involves overcoming engineering challenges. For instance, parallelizing requests effectively, managing retry mechanisms, and scaling infrastructure are critical considerations. Imagine ingesting large volumes of diverse data sources, such as scientific articles, and ensuring efficient processing for subsequent retrieval and generation tasks.
- Efficient Embedding: Ensuring the efficient embedding of large datasets poses challenges like addressing rate limits, implementing robust retry logic, and managing self-hosted models. Consider the scenario where an AI system needs to embed a vast collection of news articles, requiring strategies to handle changing data, syncing mechanisms, and optimizing embedding costs.
- Vector Database Considerations: Storing data in a vector database introduces considerations such as understanding compute resources, monitoring, sharding, and addressing potential bottlenecks. Think about the challenges involved in maintaining a vector database for a diverse range of documents, each with varying levels of complexity and importance.
- Fine-Tuning and Generalization: Fine-tuning RAG models for specific tasks while ensuring generalization across diverse knowledge-intensive NLP tasks is challenging. For instance, achieving optimal performance in question-answering tasks might require different fine-tuning approaches compared to tasks involving creative language generation, requiring careful balance.
- Hybrid Parametric and Non-Parametric Memory: Integrating parametric and non-parametric memory components in models like RAG presents challenges related to knowledge revision, interpretability, and avoiding hallucinations. Consider the difficulty in ensuring that a language model combines its pre-trained knowledge with dynamically retrieved information, avoiding inaccuracies and maintaining coherence.
- Knowledge Update Mechanisms: Developing mechanisms to update non-parametric memory as real-world knowledge evolves is crucial. Imagine a scenario where RAG models need to adapt to changing information in domains like medicine, where new research findings and treatments continually emerge, requiring timely updates for accurate responses.
- Retrieval-Augmented Generation for LLMs: A Survey
- Seven Failure Points in RAG Systems
- HyDE: Hypothetical Document Embeddings
- Rewrite-Retrieve-Read Approach
-
Deeplearning.ai Retrieval-Augmented Generation
-
HuggingFace RAG Docs: https://huggingface.co/docs/transformers/model_doc/rag
-
Github awesome-generative-ai-guide


