Chatbot - ML

Welcome to the Chatbot - ML Repository!

This repository contains a machine learning project designed to process PDF documents, extract text, split it into smaller chunks, generate embeddings using Google’s Generative AI, and store them in a FAISS vector store for fast retrieval. The system enables question answering based on the contents of the document.

Features

PDF Parsing: Extract text from PDF files using PyMuPDF (fitz).
Text Splitting: Split large text documents into smaller, manageable chunks using RecursiveCharacterTextSplitter from Langchain.
Embeddings: Generate text embeddings using Google’s Generative AI, enabling semantic search.
Vector Store: Store embeddings in a FAISS vector store for efficient similarity-based retrieval.
Question Answering: Answer user queries based on the document's content using Langchain’s question-answering chain.

Progress So Far

Core Functionality Implemented:
- PDF text extraction is fully functional using PyMuPDF.
- Text splitting using Langchain's RecursiveCharacterTextSplitter works seamlessly.
- Embedding generation is integrated with Google’s Generative AI API and is fully operational.
- Vector storage and retrieval using FAISS are implemented and tested.
- Basic question-answering functionality using Langchain’s chain is working.
Optimizations:
- Intel’s Scikit-learn extension is integrated for enhanced performance.
- Environment setup instructions are provided for easy replication.

Code Functionality

The project’s main functionalities include:

Extracting Text from PDFs:

Load a PDF file and extract text content using PyMuPDF.

Example:

import fitz
pdf = fitz.open("document.pdf")
text = "\n".join(page.get_text() for page in pdf)

Splitting Text:

Split extracted text into smaller chunks for better embedding generation.

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_text(text)

Generating Embeddings:

Convert text chunks into embeddings using Google’s Generative AI.

Example:

from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(api_key="YOUR_API_KEY")
vector_store = FAISS.from_texts(chunks, embeddings)

Storing Embeddings:
- Store the generated embeddings in a FAISS vector database for efficient retrieval.

Question Answering:

Use the stored embeddings to answer user queries based on the document content.

Example:

from langchain.chains import load_qa_chain
qa_chain = load_qa_chain(ChatGoogleGenerativeAI(), chain_type="map_reduce")
result = qa_chain.run(input_document=document, question="Your question here")

Installation

To set up the project, follow the steps below:

Clone the repository:

git clone https://github.com/Tech-Society-SEC/Chatbot_ML.git

Navigate to the project directory:
```
cd Chatbot_ML
```

Install the necessary libraries:

pip install scikit-learn-intelex pymupdf langchain-google-genai langchain-community python-dotenv faiss-cpu

Mount Google Drive to access your files:

from google.colab import drive
drive.mount('/content/drive')

Set up the Google API key by creating a .env file and storing your API key:

from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('GOOGLE_API_KEY')

Optimize scikit-learn for better performance:

from sklearnex import patch_sklearn
patch_sklearn()

Beginner-Friendly Issues

We welcome contributions! Below are some beginner-friendly issues to help you get started:

Improve Documentation:
- Add more detailed comments in the code to explain the purpose of each section.
- Expand the README with examples of common errors and troubleshooting tips.
Add Unit Tests:
- Write unit tests for each functionality (e.g., text splitting, embedding generation).
Enhance Error Handling:
- Identify potential points of failure (e.g., invalid PDF, missing API key).
- Add meaningful error messages and fallback mechanisms.
Create a Simple CLI:
- Develop a command-line interface for running the system end-to-end.
- Include options for loading a PDF, asking a question, and displaying results.
Optimize Chunk Size:
- Experiment with different chunk sizes for text splitting.
- Evaluate how it impacts embedding quality and performance.

Repository URL

For more details and to access the code, visit the GitHub Repository: https://github.com/Tech-Society-SEC/Chatbot_ML

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
chat.ipynb		chat.ipynb
rag_chat.ipynb		rag_chat.ipynb
rag_gradio.ipynb		rag_gradio.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chatbot - ML

Welcome to the Chatbot - ML Repository!

Features

Progress So Far

Code Functionality

Installation

Beginner-Friendly Issues

Repository URL

About

Uh oh!

Releases

Packages

Languages

HoussemLahmar/Chatbot_ML

Folders and files

Latest commit

History

Repository files navigation

Chatbot - ML

Welcome to the Chatbot - ML Repository!

Features

Progress So Far

Code Functionality

Installation

Beginner-Friendly Issues

Repository URL

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages