My First RAG - PDF Question Answering System

A simple Retrieval-Augmented Generation (RAG) pipeline that lets you ask questions about PDF documents and get AI-generated answers.

Built as a learning project to understand how RAG works — from PDF extraction to vector search to LLM-powered answers.

Features

Extract text from PDF documents
Clean and chunk text using a sliding window approach
Generate embeddings using SentenceTransformers
Build a FAISS vector index for fast similarity search
Query your documents and get answers using Ollama (Mistral model)
Bonus: FastAPI experiments for REST API learning

Tech Stack

Technology	Purpose
Python	Core language
pypdf	PDF text extraction
SentenceTransformers	Text embeddings (`all-MiniLM-L6-v2`)
FAISS	Vector similarity search
NumPy	Array operations
Ollama (Mistral)	Local LLM for generating answers
FastAPI	REST API experiments

Folder Structure

my1strag/
??? data/                       # Source PDF documents
?   ??? HR_Policy_Document.pdf
?   ??? IT_Security_Policy.pdf
?   ??? Project_Guidelines.pdf
??? fastapi_experiments/        # FastAPI learning experiments
?   ??? order_api.py            # Simple order management API
?   ??? crud_api.py             # SQL Server CRUD API
??? chunk_pdf.py                # Step 1: Ingest PDFs ? chunks ? FAISS index
??? query_rag.py                # Step 2: Ask questions ? get AI answers
??? pdf_reader_demo.py          # Standalone PDF reading experiment
??? requirements.txt
??? .gitignore
??? README.md

Installation

Prerequisites

Python 3.9+
Ollama installed and running
Mistral model pulled: ollama pull mistral

Setup

# Clone the repository
git clone https://github.com/YOUR_USERNAME/my1strag.git
cd my1strag

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows

# Install dependencies
pip install -r requirements.txt

How to Run

Step 1: Ingest a PDF (build the vector index)

Place your PDF file in the project folder (or update the path in chunk_pdf.py), then run:

python chunk_pdf.py

This will:

Extract text from the PDF
Clean and chunk the text
Generate embeddings
Save the FAISS index (vector_indices.faiss) and chunks (chunks.pkl)

Step 2: Ask questions

Make sure Ollama is running, then:

python query_rag.py

Type your question when prompted — the system will find the most relevant chunks and generate an answer using Mistral.

Future Improvements

Process multiple PDFs from the data/ folder automatically
Store vectors in a proper vector database (Pinecone, ChromaDB, etc.)
Build a FastAPI endpoint to expose the RAG pipeline as an API
Add a simple web UI for asking questions
Support more file formats (Word, TXT, etc.)

License

This project is for learning purposes. Feel free to use and modify it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

My First RAG - PDF Question Answering System

Features

Tech Stack

Folder Structure

Installation

Prerequisites

Setup

How to Run

Step 1: Ingest a PDF (build the vector index)

Step 2: Ask questions

Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
fastapi_experiments		fastapi_experiments
.gitignore		.gitignore
PROJECT_GUIDE.md		PROJECT_GUIDE.md
README.md		README.md
chunk_pdf.py		chunk_pdf.py
pdf_reader_demo.py		pdf_reader_demo.py
query_rag.py		query_rag.py
requirements.txt		requirements.txt
test.pdf		test.pdf

Folders and files

Latest commit

History

Repository files navigation

My First RAG - PDF Question Answering System

Features

Tech Stack

Folder Structure

Installation

Prerequisites

Setup

How to Run

Step 1: Ingest a PDF (build the vector index)

Step 2: Ask questions

Future Improvements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages