Skip to content

zeyadusf/Semantic_Search_and_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Search & Retrieval-Augmented Generation
End-to-End Guide

alt text

Introduction

Large Language Models (LLMs) are powerful, but when used in isolation they suffer from hallucinations, stale knowledge, and lack of grounding in private or domain-specific data.

At the same time, traditional keyword-based search systems fail to capture semantic intent and often return irrelevant results for complex queries.

Retrieval-Augmented Generation (RAG) systems that combine search techniques with LLMs to produce accurate, explainable, and grounded answers.

Repository Overview

This repository is a hands-on RAG pipeline and end-to-end guide to building Retrieval-Augmented Generation (RAG) systems.
It is designed to help anyone from beginners to advanced practitioners understand how semantic search and LLMs can be combined to produce grounded, high-quality answers.

What You Will Learn

  • How to parse and represent user queries
  • Different retrieval techniques: keyword, dense, and hybrid
  • How to merge and rerank retrieved results
  • Strategies for constructing context for LLMs
  • Various RAG patterns: naive, local, conversational, agentic
  • Evaluation metrics and production-ready considerations

Who Is This Repository For?

  • ML/NLP engineers building LLM-powered systems
  • Backend engineers integrating retrieval pipelines
  • Students and researchers learning RAG architectures
  • Anyone aiming to go from naive demos to production-ready systems

🏳️ level : beginners mid


📑 Table of Contents

  1. Introduction

  2. Repository Overview

  3. What is RAG ?

  4. RAG Pipeline Overview

  5. RAG Architectures – Complete Guide

  6. 🏁 Summary Comparison

  7. 🎯 Design Recommendation

  8. RAG Challenges

  9. Key References

  10. Optional Learning Resources

  11. 📞 Contact


alt text

What is RAG ?

Retrieval Augmented Generation (RAG) is architecture for optimizing the performance of AI model by connecting it with external knowladge bases. RAG help LLMs deliver more relevant responses at higher quality.

  • Advantage of RAG :
    • Injects missing knowledge. "add info not in training data"
    • Reduce hallucinations.
    • Keep models update & fouces on generation.
    • Enables source citation.

RAG Pipeline Overview

alt text

The repository is structured to reflect this pipeline, with each folder dedicated to a single stage, containing explanations, practical examples, and minimal code demos.

The core pipeline consists of the following stages:

📦 01 – Document Ingestion & Indexing

Prepares raw documents for efficient retrieval by transforming them into searchable representations.

Techniques

  • Cleaning & normalization
  • Metadata extraction
  • Chunking (fixed, recursive, semantic, sliding window)
  • Sparse indexing (BM25 / TF-IDF)
  • Dense embeddings (bi-encoders)
  • ANN indexing (FAISS / HNSW)
  • Hybrid indexing

Input

  • Raw documents (PDF, HTML, TXT, Markdown, DB)
  • Optional metadata

Output

  • Chunked documents
  • Sparse vectors
  • Dense embeddings
  • Searchable index

🔍 02 – Query Parsing & Understanding

Transforms raw user input into structured and enriched queries.

Techniques

  • Query rewriting (LLM-based)
  • Named Entity Recognition (NER)
  • Intent classification
  • Query expansion
  • Keyword extraction
  • Spelling normalization
  • HYDE (Hypothetical Document Embeddings)

Input

  • Raw user query

Output

  • Rewritten query
  • Entities
  • Intent
  • Keywords
  • Optional synthetic document

🧮 03 – Query Representation

Converts parsed queries into representations aligned with the knowledge base.

Techniques

  • Sparse representation (BM25 vector)
  • Dense embeddings
  • Token-level embeddings (ColBERT-style)
  • Hybrid representation

Input

  • Parsed query
  • KB type (sparse / dense / hybrid)

Output

  • Sparse vector
  • Dense embedding
  • Dual representation

📚 04 – Retrieval

Fetches relevant documents or passages from the knowledge base.

Techniques

  • Lexical retrieval (BM25, Boolean)
  • Dense retrieval (bi-encoder + ANN)
  • Late interaction (ColBERT)
  • Hybrid retrieval
  • Metadata filtering

Input

  • Query representation
  • Indexed knowledge base

Output

  • Top-K candidate documents
  • Retrieval scores

🔀 05 – Result Fusion / Merging

Combines outputs from multiple retrievers for improved robustness.

Techniques

  • Reciprocal Rank Fusion (RRF)
  • Weighted score aggregation
  • Rank merging
  • Deduplication

Input

  • Multiple ranked lists

Output

  • Single merged ranked list

🎯 06 – Reranking

Applies precise relevance models to refine ranking.

Techniques

  • Cross-encoder models
  • MonoT5
  • BERT-based ranking
  • LLM scoring

Input

  • Query
  • Top-K retrieved documents

Output

  • Relevance scores
  • Reordered candidate list

🧱 07 – Context Construction

Builds structured context within token constraints for the LLM.

Techniques

  • Smart chunk selection
  • Sliding window
  • Deduplication
  • Context compression
  • Token budgeting
  • Ordering strategies

Input

  • Reranked documents
  • Token limit

Output

  • Clean structured context

✍️ 08 – Prompt Engineering

Formats context and instructions into a structured prompt for reliable generation.

Techniques

  • System prompts
  • Instruction templates
  • Citation enforcement
  • Guardrails
  • Structured output prompts
  • Role-based prompting

Input

  • Structured context
  • User query
  • System instructions

Output

  • Final LLM-ready prompt

🤖 09 – Generation

The LLM generates the final grounded response.

Input

  • Final prompt

Output

  • Generated answer
  • 🚨 Learn how LLMs work and build one from scratch with this repo : LLMs from Scratch

📊 10 – Evaluation & Monitoring

Measures system quality and prevents hallucinations.

Techniques

  • Recall@K
  • MRR / nDCG
  • Faithfulness scoring
  • Answer relevance
  • LLM-as-a-judge
  • RAGAS
  • A/B testing

Input

  • Queries
  • Retrieved docs
  • Generated answers
  • Ground truth (optional)

Output

  • Retrieval metrics
  • Generation metrics
  • Performance reports

RAG Architectures – Complete Guide

Explains the major Retrieval-Augmented Generation (RAG) architectures used in research and production systems.

📌 Overview The standard and most basic RAG architecture.
Single-pass retrieval followed by generation.

⚙️ Core Idea Retrieve relevant documents → Inject into prompt → Generate answer.

🏗 Architecture Flow

Query
↓
Retriever (Top-K)
↓
Context Construction
↓
LLM Generation

✅ Advantages

  • Simple
  • Fast
  • Easy to implement
  • Good baseline

⚠️ Limitations

  • Sensitive to retrieval quality
  • No feedback loop
  • Can hallucinate if retrieval fails

🎯 When to Use

  • Prototyping
  • Small-scale systems
  • Internal knowledge assistants

2️⃣ Advanced (Modular) RAG

📌 Overview Extends vanilla RAG by separating stages and adding reranking and fusion.

⚙️ Core Idea Improve retrieval precision before generation.

🏗 Architecture Flow

Query
↓
Query Parsing
↓
Multiple Retrievers
↓
Fusion (RRF)
↓
Reranker (Cross-Encoder)
↓
Context Builder
↓
LLM

✅ Advantages

  • Higher retrieval precision
  • Better grounding
  • Modular and scalable

⚠️ Limitations

  • Higher latency
  • More components to maintain

🎯 When to Use

  • Production systems
  • Enterprise knowledge bases
  • High-accuracy requirements

3️⃣ Hybrid RAG

📌 Overview Combines lexical (BM25) and semantic (embeddings) retrieval.

⚙️ Core Idea Exact match + semantic similarity = better recall and precision.

🏗 Architecture Flow

Query
↓
Sparse Retriever (BM25)
Dense Retriever (Embeddings)
↓
Fusion
↓
LLM

✅ Advantages

  • Strong recall
  • Handles keyword + semantic queries
  • Robust across domains

⚠️ Limitations

  • Requires maintaining two indices

🎯 When to Use

  • Large document collections
  • Mixed query types
  • Real-world production systems

4️⃣ Multi-Hop RAG

📌 Overview Performs multiple retrieval steps for complex reasoning queries.

⚙️ Core Idea Retrieve → Generate intermediate reasoning → Retrieve again → Final answer.

🏗 Architecture Flow

Query
↓
Retrieve A
↓
Generate intermediate reasoning
↓
Retrieve B
↓
Final Answer

✅ Advantages

  • Handles complex reasoning
  • Better for analytical questions

⚠️ Limitations

  • Higher latency
  • More compute cost

🎯 When to Use

  • Research assistants
  • Analytical QA systems
  • Financial/legal reasoning

5️⃣ Iterative / Self-Reflective RAG

📌 Overview Uses feedback loops to refine retrieval and generation.

⚙️ Core Idea Generate → Evaluate confidence → Re-retrieve if needed.

🏗 Architecture Flow

Query
↓
Retrieve
↓
Generate
↓
Self-Evaluation
↓
If low confidence → Refine query → Retrieve again

✅ Advantages

  • Reduces hallucination
  • More reliable answers

⚠️ Limitations

  • Slower
  • Complex orchestration

🎯 When to Use

  • High-stakes systems
  • Medical / Legal domains

6️⃣ Graph RAG

📌 Overview Uses a knowledge graph instead of flat document retrieval.

⚙️ Core Idea Retrieve structured relationships between entities.

🏗 Architecture Flow

Query
↓
Entity Extraction
↓
Graph Traversal
↓
Subgraph Extraction
↓
LLM Generation

✅ Advantages

  • Structured reasoning
  • Strong factual consistency
  • Multi-hop naturally supported

⚠️ Limitations

  • Requires graph construction
  • Higher indexing complexity

🎯 When to Use

  • Enterprise structured data
  • Research knowledge systems
  • Relationship-heavy domains

7️⃣ Agentic RAG

📌 Overview Uses an autonomous agent that decides when to retrieve, search, or call tools.

⚙️ Core Idea LLM acts as a planner controlling retrieval actions.

🏗 Architecture Flow

User Query
↓
Planner Agent
↓
Tool Selection:

Retriever

Web Search

Calculator

DB Query
↓
Memory Update
↓
Final Answer

✅ Advantages

  • Flexible
  • Dynamic tool usage
  • Handles complex workflows

⚠️ Limitations

  • Hard to control
  • Can be unstable without guardrails

🎯 When to Use

  • AI assistants
  • Autonomous research agents
  • Complex task execution systems

8️⃣ Adaptive RAG

📌 Overview Dynamically adjusts retrieval strategy based on query type.

⚙️ Core Idea Simple queries → small retrieval
Complex queries → multi-hop retrieval

🏗 Architecture Flow

Query
↓
Complexity Classifier
↓
Simple → Vanilla RAG
Complex → Multi-Hop / Agentic RAG

✅ Advantages

  • Efficient
  • Optimizes cost vs performance

⚠️ Limitations

  • Requires accurate query classification

🎯 When to Use

  • Cost-sensitive production systems
  • Large-scale SaaS assistants

9️⃣ Tool-Augmented RAG

📌 Overview Extends RAG with external tools beyond document retrieval.

⚙️ Core Idea Retrieval + APIs + structured tools.

🏗 Architecture Flow

Query ↓ Retriever ↓ Tool Calls (APIs / DB / Search) ↓ LLM Synthesis

✅ Advantages

  • Real-time data access
  • Accurate computations
  • Broader capability

⚠️ Limitations

  • Tool orchestration complexity

🎯 When to Use

  • Financial dashboards
  • Real-time analytics
  • Enterprise assistants

🏁 Summary Comparison

Architecture Complexity Accuracy Latency Production Ready
Vanilla RAG Low Medium Low Yes
Modular RAG Medium High Medium Yes
Hybrid RAG Medium High Medium Yes
Multi-Hop RAG High Very High High Advanced
Self-Reflective High Very High High Advanced
Graph RAG High Very High Medium Enterprise
Agentic RAG Very High Dynamic Variable Complex Systems
Adaptive RAG High Optimized Optimized Large Scale
Tool-Augmented RAG High Very High Variable Enterprise

🎯 Design Recommendation

For most production systems:

  • Start with Hybrid + Reranking
  • Add Evaluation Layer
  • Introduce Adaptive or Agentic layer only if complexity requires it

RAG Challenges

Although RAG seems to be a very straightforward way to integrate LLMs with knowledge, there are still the below mentioned open research and application challenges with RAG.

  1. Data Ingestion Complexity: Dealing with the complexity of ingesting extensive knowledge bases involves overcoming engineering challenges. For instance, parallelizing requests effectively, managing retry mechanisms, and scaling infrastructure are critical considerations. Imagine ingesting large volumes of diverse data sources, such as scientific articles, and ensuring efficient processing for subsequent retrieval and generation tasks.
  2. Efficient Embedding: Ensuring the efficient embedding of large datasets poses challenges like addressing rate limits, implementing robust retry logic, and managing self-hosted models. Consider the scenario where an AI system needs to embed a vast collection of news articles, requiring strategies to handle changing data, syncing mechanisms, and optimizing embedding costs.
  3. Vector Database Considerations: Storing data in a vector database introduces considerations such as understanding compute resources, monitoring, sharding, and addressing potential bottlenecks. Think about the challenges involved in maintaining a vector database for a diverse range of documents, each with varying levels of complexity and importance.
  4. Fine-Tuning and Generalization: Fine-tuning RAG models for specific tasks while ensuring generalization across diverse knowledge-intensive NLP tasks is challenging. For instance, achieving optimal performance in question-answering tasks might require different fine-tuning approaches compared to tasks involving creative language generation, requiring careful balance.
  5. Hybrid Parametric and Non-Parametric Memory: Integrating parametric and non-parametric memory components in models like RAG presents challenges related to knowledge revision, interpretability, and avoiding hallucinations. Consider the difficulty in ensuring that a language model combines its pre-trained knowledge with dynamically retrieved information, avoiding inaccuracies and maintaining coherence.
  6. Knowledge Update Mechanisms: Developing mechanisms to update non-parametric memory as real-world knowledge evolves is crucial. Imagine a scenario where RAG models need to adapt to changing information in domains like medicine, where new research findings and treatments continually emerge, requiring timely updates for accurate responses.

Key References

  1. Retrieval-Augmented Generation for LLMs: A Survey
  2. Seven Failure Points in RAG Systems
  3. HyDE: Hypothetical Document Embeddings
  4. Rewrite-Retrieve-Read Approach

Optional Learning Resources


📞 Contact :

About

Semantic Search and Retrieval Augmented Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors