This project is a complete, end-to-end Conversational Retrieval-Augmented Generation (RAG) system. It allows you to build a sophisticated chatbot that can answer questions based on a private collection of PDF documents. The entire pipeline, from data ingestion to a user-friendly web interface, is included.
- End-to-End Data Pipeline: A series of scripts to automatically process raw documents and prepare them for the RAG system.
- Advanced Document Ingestion: Extracts text and tables from PDFs, with an automatic OCR fallback for scanned documents.
- Intelligent Text Cleaning: A multi-step cleaning process to normalize text, remove noise, and filter out low-quality content.
- Content-Aware Chunking: Splits documents into meaningful chunks, keeping tables intact and filtering out "stub" chunks.
- Incremental Processing: All pipeline steps are incremental, only processing new or updated files to save time and resources.
- Conversational AI Core:
- State-of-the-Art RAG Chain: Uses a modern, conversational RAG chain that remembers chat history to answer follow-up questions.
- Powered by Gemini: Leverages Google's Gemini Pro for high-quality, context-aware answer generation.
- Source-Cited Answers: The chatbot returns the source documents it used to generate an answer, providing transparency and trust.
- Web Interface:
- FastAPI Backend: A robust and efficient API to serve the RAG chain.
- Streamlit Frontend: A user-friendly, interactive chat interface for easy interaction with the chatbot.
- Backend: Python, FastAPI
- Frontend: Streamlit
- LLM: Google Gemini 1.5 Flash via
langchain-google-genai - Embeddings:
intfloat/multilingual-e5-large - Vector Store: FAISS
- Orchestration: LangChain
- Data Processing: PyPDF2, pdfplumber, pytesseract, python-docx
├── api/
│ └── app.py # FastAPI application
├── chunks/ # Stores the chunked documents
├── data/ # Raw PDF documents go here
├── embeddings/
│ ├── create_embeddings.py # Script to generate embeddings
│ └── load_to_faiss.py # Loads the Embeddings into FAISS
├── frontend/
│ └── app.py # Streamlit frontend application
├── ingestion/
│ └── load_documents.py # Script to ingest and clean documents
├── processing/
│ └── chunks_documents.py # Script to chunk the cleaned documents
├── query/
│ └── query_faiss.py # CLI tool to test FAISS index
├── rag/
│ ├── config.py # Configuration for the RAG chain
│ ├── memory_buffer.py # Manages conversational memory
│ └── rag_chain.py # Main RAG chain
├── .env # Store API keys (need to be created)
└── .gitignore # Git ignore rules-
Clone the repository:
git clone <https://github.com/Sanjjjayyy/Language_Agnostic_Chatbot> cd <Language Agnostic Chatbot>
-
Set up a Conda environment:
conda create --name rag-chatbot python=3.10 conda activate rag-chatbot
-
Install dependencies:
pip install -r requirements.txt
-
Create the
.envfile: In the root directory of the project, create a file named.envand add your API key:GOOGLE_API_KEY="YOUR_API_KEY_HERE"
The project is divided into two main parts: the data pipeline and the live application.
You must run these scripts in order from the root directory.
-
Add Documents: Place all your PDF and DOCX files into the
data/folder. -
Ingest and Clean Documents:
python -m ingestion.load_documents
-
Chunk the Documents:
python -m processing.chunks_documents
-
Create Embeddings:
python -m embeddings.create_embeddings
-
Build the FAISS Index:
python -m embeddings.load_to_faiss
After this step, your knowledge base is ready.
This requires two separate terminals, both running from the project's root directory.
Terminal 1: Start the Backend API
conda activate rag-chatbot
uvicorn api.app:app --reload --port 8000Wait for the message indicating the application startup is complete.
Terminal 2: Start the Frontend UI
conda activate rag-chatbot
streamlit run frontend/app.pyThis will open a new tab in your web browser with the chatbot interface. You can now start asking questions.