Skip to content

YogadharshiniC/AskMyPDF-AI

Repository files navigation

My First RAG - PDF Question Answering System

A simple Retrieval-Augmented Generation (RAG) pipeline that lets you ask questions about PDF documents and get AI-generated answers.

Built as a learning project to understand how RAG works — from PDF extraction to vector search to LLM-powered answers.

Features

  • Extract text from PDF documents
  • Clean and chunk text using a sliding window approach
  • Generate embeddings using SentenceTransformers
  • Build a FAISS vector index for fast similarity search
  • Query your documents and get answers using Ollama (Mistral model)
  • Bonus: FastAPI experiments for REST API learning

Tech Stack

Technology Purpose
Python Core language
pypdf PDF text extraction
SentenceTransformers Text embeddings (all-MiniLM-L6-v2)
FAISS Vector similarity search
NumPy Array operations
Ollama (Mistral) Local LLM for generating answers
FastAPI REST API experiments

Folder Structure

my1strag/
??? data/                       # Source PDF documents
?   ??? HR_Policy_Document.pdf
?   ??? IT_Security_Policy.pdf
?   ??? Project_Guidelines.pdf
??? fastapi_experiments/        # FastAPI learning experiments
?   ??? order_api.py            # Simple order management API
?   ??? crud_api.py             # SQL Server CRUD API
??? chunk_pdf.py                # Step 1: Ingest PDFs ? chunks ? FAISS index
??? query_rag.py                # Step 2: Ask questions ? get AI answers
??? pdf_reader_demo.py          # Standalone PDF reading experiment
??? requirements.txt
??? .gitignore
??? README.md

Installation

Prerequisites

  • Python 3.9+
  • Ollama installed and running
  • Mistral model pulled: ollama pull mistral

Setup

# Clone the repository
git clone https://github.com/YOUR_USERNAME/my1strag.git
cd my1strag

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows

# Install dependencies
pip install -r requirements.txt

How to Run

Step 1: Ingest a PDF (build the vector index)

Place your PDF file in the project folder (or update the path in chunk_pdf.py), then run:

python chunk_pdf.py

This will:

  • Extract text from the PDF
  • Clean and chunk the text
  • Generate embeddings
  • Save the FAISS index (vector_indices.faiss) and chunks (chunks.pkl)

Step 2: Ask questions

Make sure Ollama is running, then:

python query_rag.py

Type your question when prompted — the system will find the most relevant chunks and generate an answer using Mistral.

Future Improvements

  • Process multiple PDFs from the data/ folder automatically
  • Store vectors in a proper vector database (Pinecone, ChromaDB, etc.)
  • Build a FastAPI endpoint to expose the RAG pipeline as an API
  • Add a simple web UI for asking questions
  • Support more file formats (Word, TXT, etc.)

License

This project is for learning purposes. Feel free to use and modify it.

About

AI-powered PDF question-answering system using Experimental Retrieval-Augmented Generation project using local embeddings, FAISS indexing, and Mistral via Ollama.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages