A Retrieval-Augmented Generation (RAG) chatbot that processes PDF documents and provides intelligent answers based on document content using state-of-the-art language models and vector databases.
- PDF Document Processing: Extract and process text from PDF files
- Vector Database Storage: Store document embeddings in Pinecone for efficient retrieval
- Conversational Memory: Maintain chat history for contextual conversations
- Advanced Retrieval: Uses Cohere's reranking for improved document relevance
- Multiple LLM Support: Built with Falcon-7B-Instruct via Hugging Face Hub
The system consists of two main components:
- Document Indexing (
indexing_script.py): Processes PDFs and stores embeddings - Chat Interface (
chat_script.py): Handles user queries and generates responses
- Python 3.7+
- Valid API keys for:
- Pinecone
- Cohere
- Hugging Face Hub
- Clone the repository:
git clone <repository-url>
cd rag-chatbot- Install required packages:
pip install -r requirements.txtlangchain
pinecone-client
cohere
PyPDF2
transformers
torch
tqdm
Replace the placeholder API keys in both scripts:
# In both files, update these variables:
pine_cone_key = "your_pinecone_api_key"
cohere_key = "your_cohere_api_key"
huggin_face_api = "your_hugging_face_api_key"You can also set environment variables:
export PINECONE_API_KEY="your_pinecone_api_key"
export COHERE_API_KEY="your_cohere_api_key"
export HUGGINGFACE_API_TOKEN="your_hugging_face_api_key"First, process your PDF document and create the vector database:
- Place your PDF file in the project directory (default: "Attention is all you need.pdf")
- Run the indexing script:
python indexing_script.pyThis will:
- Extract text from the PDF
- Split text into chunks (500 tokens each)
- Generate embeddings using Cohere
- Store embeddings in Pinecone index named "chatdatabase"
Once indexing is complete, run the chat interface:
python chat_script.pyThe system will initialize and you can start asking questions about your document.
- Chunk Size: 500 tokens (adjustable in
TokenTextSplitter) - Chunk Overlap: 25 tokens
- Top-K Results: 4 documents retrieved per query
- Reranking: Cohere rerank for improved relevance
- Token Limit: 1000 tokens for conversation history
- Memory Type: ConversationTokenBufferMemory
- Model: tiiuae/falcon-7b-instruct
- Temperature: 0.6
- Max New Tokens: 2000
Update the file path in the indexing script:
Your_text_data = get_pdf_text("path/to/your/document.pdf")Customize the system prompt in chat_script.py:
template = """
Your custom prompt here...
{question}
"""Replace the Falcon model with another Hugging Face model:
repo_id = "your_preferred_model"- Create a Pinecone account at pinecone.io
- Create a new index named "chatdatabase"
- Use the starter environment: 'gcp-starter'
- Get API key from Cohere
- The system uses:
embed-english-light-v2.0for embeddings- Default reranker for document reranking
- API Key Errors: Ensure all API keys are valid and have sufficient quota
- Index Not Found: Run the indexing script before the chat script
- Memory Issues: Reduce chunk size or max_token_limit for large documents
- Slow Responses: Consider using a different embedding model or reducing retrieval count
The code includes basic error handling, but you may want to add:
- API rate limit handling
- Network timeout management
- Input validation
- Batch Processing: Process multiple PDFs by modifying the indexing loop
- Caching: Implement caching for frequently asked questions
- Model Optimization: Use quantized models for faster inference
- Support for multiple document formats (Word, TXT, etc.)
- Web interface using Streamlit or Chainlit
- Multi-language support
- Advanced conversation features
- Document source citations
- Conversation export functionality
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
[Add your license here]
For questions or issues, please create an issue or contact [your-email].
- Built with LangChain
- Vector storage by Pinecone
- Embeddings by Cohere
- Language model from Hugging Face
Note: This is a development version. For production use, implement proper error handling, logging, and security measures.