This repository contains a machine learning project designed to process PDF documents, extract text, split it into smaller chunks, generate embeddings using Google’s Generative AI, and store them in a FAISS vector store for fast retrieval. The system enables question answering based on the contents of the document.
- PDF Parsing: Extract text from PDF files using PyMuPDF (fitz).
- Text Splitting: Split large text documents into smaller, manageable chunks using
RecursiveCharacterTextSplitterfrom Langchain. - Embeddings: Generate text embeddings using Google’s Generative AI, enabling semantic search.
- Vector Store: Store embeddings in a FAISS vector store for efficient similarity-based retrieval.
- Question Answering: Answer user queries based on the document's content using Langchain’s question-answering chain.
-
Core Functionality Implemented:
- PDF text extraction is fully functional using PyMuPDF.
- Text splitting using Langchain's
RecursiveCharacterTextSplitterworks seamlessly. - Embedding generation is integrated with Google’s Generative AI API and is fully operational.
- Vector storage and retrieval using FAISS are implemented and tested.
- Basic question-answering functionality using Langchain’s chain is working.
-
Optimizations:
- Intel’s Scikit-learn extension is integrated for enhanced performance.
- Environment setup instructions are provided for easy replication.
The project’s main functionalities include:
-
Extracting Text from PDFs:
- Load a PDF file and extract text content using PyMuPDF.
- Example:
import fitz pdf = fitz.open("document.pdf") text = "\n".join(page.get_text() for page in pdf)
-
Splitting Text:
- Split extracted text into smaller chunks for better embedding generation.
- Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) chunks = text_splitter.split_text(text)
-
Generating Embeddings:
- Convert text chunks into embeddings using Google’s Generative AI.
- Example:
from langchain_google_genai import GoogleGenerativeAIEmbeddings embeddings = GoogleGenerativeAIEmbeddings(api_key="YOUR_API_KEY") vector_store = FAISS.from_texts(chunks, embeddings)
-
Storing Embeddings:
- Store the generated embeddings in a FAISS vector database for efficient retrieval.
-
Question Answering:
- Use the stored embeddings to answer user queries based on the document content.
- Example:
from langchain.chains import load_qa_chain qa_chain = load_qa_chain(ChatGoogleGenerativeAI(), chain_type="map_reduce") result = qa_chain.run(input_document=document, question="Your question here")
To set up the project, follow the steps below:
-
Clone the repository:
git clone https://github.com/Tech-Society-SEC/Chatbot_ML.git
-
Navigate to the project directory:
cd Chatbot_ML -
Install the necessary libraries:
pip install scikit-learn-intelex pymupdf langchain-google-genai langchain-community python-dotenv faiss-cpu
-
Mount Google Drive to access your files:
from google.colab import drive drive.mount('/content/drive')
-
Set up the Google API key by creating a
.envfile and storing your API key:from dotenv import load_dotenv load_dotenv() api_key = os.getenv('GOOGLE_API_KEY')
-
Optimize scikit-learn for better performance:
from sklearnex import patch_sklearn patch_sklearn()
We welcome contributions! Below are some beginner-friendly issues to help you get started:
-
Improve Documentation:
- Add more detailed comments in the code to explain the purpose of each section.
- Expand the README with examples of common errors and troubleshooting tips.
-
Add Unit Tests:
- Write unit tests for each functionality (e.g., text splitting, embedding generation).
-
Enhance Error Handling:
- Identify potential points of failure (e.g., invalid PDF, missing API key).
- Add meaningful error messages and fallback mechanisms.
-
Create a Simple CLI:
- Develop a command-line interface for running the system end-to-end.
- Include options for loading a PDF, asking a question, and displaying results.
-
Optimize Chunk Size:
- Experiment with different chunk sizes for text splitting.
- Evaluate how it impacts embedding quality and performance.
For more details and to access the code, visit the GitHub Repository: https://github.com/Tech-Society-SEC/Chatbot_ML