A production-ready, modular RAG pipeline for answering questions from PDF documents using semantic search and Google Gemini AI.
RAG/
βββ src/ # Modular source code
β βββ __init__.py # Package initialization
β βββ data_loader.py # PDF loading and chunking
β βββ embedding.py # Embedding generation
β βββ vectorstore.py # ChromaDB vector storage
β βββ search.py # Document retrieval
β βββ llm.py # Gemini LLM wrapper
β βββ rag_pipeline.py # Complete RAG orchestration
βββ data/
β βββ pdf/ # PDF documents for processing
β βββ text_files/ # Text files containing ML/DL content
β βββ vector_store/ # Persistent vector database
βββ notebook/
β βββ document.ipynb # Document processing notebook
β βββ pdf_loader.ipynb # PDF processing and embedding pipeline
βββ example.py # Usage examples
βββ .env # Environment variables (API keys)
βββ .env.example # Template for environment variables
βββ requirements.txt # Python dependencies
βββ README.md
This project implements a complete RAG (Retrieval Augmented Generation) pipeline for processing PDF documents and answering questions using AI. The system combines semantic search with Google Gemini for accurate, source-grounded answers.
- π PDF Processing: Automatic loading and chunking of PDF documents
- π Semantic Search: Vector-based similarity search using sentence transformers
- πΎ Persistent Storage: ChromaDB for efficient vector storage and retrieval
- π€ AI-Powered Answers: Google Gemini 2.5 Flash for generating accurate, grounded responses
- ποΈ Modular Design: Clean, maintainable code split into focused modules
- π Source Citation: Automatic tracking and citation of source documents
- βοΈ Configurable: Easy to customize embedding models, chunk sizes, and generation parameters
π PDF Documents
β
πͺ Text Chunking (RecursiveCharacterTextSplitter)
β
π§ Embeddings (all-MiniLM-L6-v2, 384 dimensions)
β
πΎ Vector Store (ChromaDB - Persistent)
β
π Semantic Search (Cosine Similarity)
β
π€ Answer Generation (Gemini 2.5 Flash)
β
β¨ Final Answer with Citations
langchain-community
langchain-text-splitters
sentence-transformers
chromadb
numpy
scikit-learn
pypdf
google-generativeai
python-dotenv# Clone the repository
git clone https://github.com/Anonymus-Coder2403/RAG_Pipeline.git
cd RAG_Pipeline
# Create and activate virtual environment
python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Copy environment template
cp .env.example .env
# Edit .env and add your Gemini API key:
# GEMINI_API_KEY=your_api_key_hereGet your Gemini API key from: https://aistudio.google.com/app/apikey
from src import (
PDFDocumentLoader,
EmbeddingManager,
VectorStore,
RAGRetriever,
GeminiLLM,
RAGPipeline
)
# 1. Load and process documents
loader = PDFDocumentLoader(chunk_size=1000, chunk_overlap=200)
chunks = loader.load_and_split("data/pdf")
# 2. Generate embeddings
embedding_manager = EmbeddingManager()
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)
# 3. Store in vector database
vector_store = VectorStore()
vector_store.add_documents(chunks, embeddings)
# 4. Set up RAG pipeline
retriever = RAGRetriever(vector_store, embedding_manager)
llm = GeminiLLM()
rag = RAGPipeline(retriever, llm)
# 5. Ask questions!
result = rag.answer("What are news values?", top_k=3)
rag.display_result(result)If you've already processed documents:
from src import EmbeddingManager, VectorStore, RAGRetriever, GeminiLLM, RAGPipeline
# Connect to existing vector store
embedding_manager = EmbeddingManager()
vector_store = VectorStore()
retriever = RAGRetriever(vector_store, embedding_manager)
llm = GeminiLLM()
rag = RAGPipeline(retriever, llm)
# Ask questions
result = rag.answer("Your question here", top_k=3)
print(result['answer'])python example.pyHandles loading and chunking PDF documents.
Parameters:
chunk_size(int): Maximum size of text chunks (default: 1000)chunk_overlap(int): Overlap between chunks (default: 200)separators(list): Custom separators for splitting (default: ["\n\n", "\n", " ", ""])
Methods:
load_pdfs(pdf_directory): Load all PDFs from a directorysplit_documents(documents): Split documents into chunksload_and_split(pdf_directory): Combined load and split operation
Generates embeddings using SentenceTransformers.
Parameters:
model_name(str): HuggingFace model name (default: "all-MiniLM-L6-v2")
Methods:
generate_embeddings(texts, show_progress): Generate embeddings for text listget_embedding_dimension(): Get embedding dimension
Manages ChromaDB vector storage.
Parameters:
collection_name(str): Name of collection (default: "pdf_documents")persist_directory(str): Storage directory (default: "../data/vector_store")
Methods:
add_documents(documents, embeddings): Add documents to storeget_collection_stats(): Get collection statisticsclear_collection(): Delete all documents
Handles semantic search and document retrieval.
Parameters:
vector_store: VectorStore instanceembedding_manager: EmbeddingManager instance
Methods:
retrieve(query, top_k, score_threshold): Retrieve relevant documents
Wrapper for Google Gemini API.
Parameters:
model_name(str): Gemini model (default: "gemini-2.5-flash")temperature(float): Sampling temperature (default: 0.1)max_output_tokens(int): Max response length (default: 500)top_p(float): Nucleus sampling (default: 0.95)top_k(int): Top-k sampling (default: 40)api_key(str): Google AI API key (optional)
Methods:
generate(prompt, max_retries): Generate response with retry logiclist_available_models(): List available Gemini models
Complete RAG pipeline orchestrator.
Parameters:
retriever: RAGRetriever instancellm: GeminiLLM instance
Methods:
answer(query, top_k): Complete RAG pipeline (retrieve + generate)display_result(result): Format and display result
Choose different embedding models from HuggingFace:
all-MiniLM-L6-v2(default): Fast, general-purpose (384 dim)BAAI/bge-large-en-v1.5: High quality for technical content (1024 dim)sentence-transformers/all-mpnet-base-v2: Balanced quality/speed (768 dim)
Available models (check with GeminiLLM.list_available_models()):
gemini-2.5-flash(default): Fast, cost-effectivegemini-2.5-pro: Higher quality, slowergemini-2.0-flash: Alternative fast model
Chunking:
chunk_size: 1000 characters (balance between context and specificity)chunk_overlap: 200 characters (maintains context across chunks)
Retrieval:
top_k: 3-5 chunks (more = more context but higher cost)score_threshold: 0.0-1.0 (filter low-similarity results)
Generation:
temperature: 0.1 (low for factual responses)max_output_tokens: 500 (adjust based on needs)
Error: GEMINI_API_KEY not found
Solution: Create .env file with GEMINI_API_KEY=your_key_here
Error: 404 models/gemini-xxx not found
Solution: Run GeminiLLM.list_available_models() to see available models
Issue: Retrieval returns low similarity scores
Solutions:
- Check if embedding model matches document type
- Try different embedding model (e.g., BAAI/bge-large-en-v1.5)
- Adjust chunk_size and chunk_overlap
- Add more relevant documents
- Vector Store Persistence: ChromaDB automatically persists data - no need to re-process documents each run
- Batch Processing: Process multiple PDFs at once for efficiency
- Embedding Model Selection:
- Use
all-MiniLM-L6-v2for general documents - Use
BAAI/bge-large-en-v1.5for technical/academic content
- Use
- Chunk Size Optimization:
- Smaller chunks (500-800): More precise retrieval
- Larger chunks (1000-1500): More context per chunk
The notebook/pdf_loader.ipynb contains the development process showing:
- PDF loading and text extraction
- Text chunking with RecursiveCharacterTextSplitter
- Embedding generation using sentence-transformers
- ChromaDB vector storage setup
- Gemini LLM integration
- Complete RAG pipeline implementation
This forms the foundation of the entire RAG system, demonstrating how document parsing, vector storage, and LLM context retrieval work together.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Yash Kumar (@Anonymus-Coder2403)
This project uses the following amazing technologies:
- LangChain: Document loading and processing
- Sentence Transformers: State-of-the-art embedding generation
- ChromaDB: Efficient vector database
- Google Gemini: AI-powered answer generation
- Original Notebook: Development notebook with step-by-step process
- Example Script: Ready-to-run usage examples
- API Documentation: Google Gemini API docs
β Star this repo if you found it helpful!
Built with β€οΈ for the open-source community