A simple Retrieval-Augmented Generation (RAG) pipeline that lets you ask questions about PDF documents and get AI-generated answers.
Built as a learning project to understand how RAG works — from PDF extraction to vector search to LLM-powered answers.
- Extract text from PDF documents
- Clean and chunk text using a sliding window approach
- Generate embeddings using SentenceTransformers
- Build a FAISS vector index for fast similarity search
- Query your documents and get answers using Ollama (Mistral model)
- Bonus: FastAPI experiments for REST API learning
| Technology | Purpose |
|---|---|
| Python | Core language |
| pypdf | PDF text extraction |
| SentenceTransformers | Text embeddings (all-MiniLM-L6-v2) |
| FAISS | Vector similarity search |
| NumPy | Array operations |
| Ollama (Mistral) | Local LLM for generating answers |
| FastAPI | REST API experiments |
my1strag/
??? data/ # Source PDF documents
? ??? HR_Policy_Document.pdf
? ??? IT_Security_Policy.pdf
? ??? Project_Guidelines.pdf
??? fastapi_experiments/ # FastAPI learning experiments
? ??? order_api.py # Simple order management API
? ??? crud_api.py # SQL Server CRUD API
??? chunk_pdf.py # Step 1: Ingest PDFs ? chunks ? FAISS index
??? query_rag.py # Step 2: Ask questions ? get AI answers
??? pdf_reader_demo.py # Standalone PDF reading experiment
??? requirements.txt
??? .gitignore
??? README.md
- Python 3.9+
- Ollama installed and running
- Mistral model pulled:
ollama pull mistral
# Clone the repository
git clone https://github.com/YOUR_USERNAME/my1strag.git
cd my1strag
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtPlace your PDF file in the project folder (or update the path in chunk_pdf.py), then run:
python chunk_pdf.pyThis will:
- Extract text from the PDF
- Clean and chunk the text
- Generate embeddings
- Save the FAISS index (
vector_indices.faiss) and chunks (chunks.pkl)
Make sure Ollama is running, then:
python query_rag.pyType your question when prompted — the system will find the most relevant chunks and generate an answer using Mistral.
- Process multiple PDFs from the
data/folder automatically - Store vectors in a proper vector database (Pinecone, ChromaDB, etc.)
- Build a FastAPI endpoint to expose the RAG pipeline as an API
- Add a simple web UI for asking questions
- Support more file formats (Word, TXT, etc.)
This project is for learning purposes. Feel free to use and modify it.