pdfBot is a privacy-first, fully local Retrieval-Augmented Generation (RAG) chatbot that allows you to chat interactively with your PDF documents. Built entirely with open-source tools, it runs completely offline, ensuring your sensitive data never leaves your machine.
- 100% Local & Private: No API keys required. Powered by a local Llama 2 model using
CTransformers. - Conversational UI: Features a sleek, interactive web interface built with
Chainlit. - Document Processing: Automatically loads, splits, and processes PDF documents placed in a local directory.
- Efficient Retrieval: Uses
HuggingFaceEmbeddingsandFAISSfor fast, accurate local vector search.
- Orchestration: LangChain
- UI Framework: Chainlit
- Local LLM Inference: CTransformers
- Vector Store: FAISS
- Embeddings: HuggingFace (
sentence-transformers/all-mpnet-base-v2)
- Python 3.8+
- Download the LLM: You will need to download the
llama-2-7b-chat.Q5_K_M.ggufmodel (or updateapp.pyto point to your preferred.ggufmodel) and place it in the root directory. You can find this model on HuggingFace (e.g., from TheBloke's repositories).
- Clone this repository:
git clone [https://github.com/tilakraj0308/pdfbot.git](https://github.com/tilakraj0308/pdfbot.git) cd pdfbot - Create and activate a virtual environment (optional but recommended):
python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\Scripts\activate`- Install the required dependencies:
pip install langchain chainlit ctransformers faiss-cpu huggingface-hub sentence-transformers pypdf
(Note: Use faiss-gpu if you have a compatible NVIDIA GPU).-
Add your Documents Place all the .pdf files you want to chat with inside the data/ directory.
-
Create the Vector Database Run the data ingestion script. This will process your PDFs, generate embeddings, and save the FAISS database locally in the db/ folder.
python data_store.py- Start the Chatbot Launch the Chainlit interface to start interacting with your documents.
chainlit run app.py -wThe UI will open in your default web browser (usually at http://localhost:8000).
- app.py: The main Chainlit application and LangChain RetrievalQA setup.
- data_store.py: Script to load PDFs, chunk text, create embeddings, and build the FAISS index.
- data/: Directory where you drop your input PDF files.
- db/: Directory where the local FAISS vector database is saved.