Skip to content

Dhananjay-prod/Chatpdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RAG Chatbot with LangChain, Pinecone, and Hugging Face

A Retrieval-Augmented Generation (RAG) chatbot that processes PDF documents and provides intelligent answers based on document content using state-of-the-art language models and vector databases.

Features

  • PDF Document Processing: Extract and process text from PDF files
  • Vector Database Storage: Store document embeddings in Pinecone for efficient retrieval
  • Conversational Memory: Maintain chat history for contextual conversations
  • Advanced Retrieval: Uses Cohere's reranking for improved document relevance
  • Multiple LLM Support: Built with Falcon-7B-Instruct via Hugging Face Hub

Architecture

The system consists of two main components:

  1. Document Indexing (indexing_script.py): Processes PDFs and stores embeddings
  2. Chat Interface (chat_script.py): Handles user queries and generates responses

Prerequisites

  • Python 3.7+
  • Valid API keys for:
    • Pinecone
    • Cohere
    • Hugging Face Hub

Installation

  1. Clone the repository:
git clone <repository-url>
cd rag-chatbot
  1. Install required packages:
pip install -r requirements.txt

Required Dependencies

langchain
pinecone-client
cohere
PyPDF2
transformers
torch
tqdm

Configuration

API Keys Setup

Replace the placeholder API keys in both scripts:

# In both files, update these variables:
pine_cone_key = "your_pinecone_api_key"
cohere_key = "your_cohere_api_key" 
huggin_face_api = "your_hugging_face_api_key"

Environment Variables (Alternative)

You can also set environment variables:

export PINECONE_API_KEY="your_pinecone_api_key"
export COHERE_API_KEY="your_cohere_api_key"
export HUGGINGFACE_API_TOKEN="your_hugging_face_api_key"

Usage

Step 1: Document Indexing

First, process your PDF document and create the vector database:

  1. Place your PDF file in the project directory (default: "Attention is all you need.pdf")
  2. Run the indexing script:
python indexing_script.py

This will:

  • Extract text from the PDF
  • Split text into chunks (500 tokens each)
  • Generate embeddings using Cohere
  • Store embeddings in Pinecone index named "chatdatabase"

Step 2: Start Chatting

Once indexing is complete, run the chat interface:

python chat_script.py

The system will initialize and you can start asking questions about your document.

Configuration Options

Text Splitting

  • Chunk Size: 500 tokens (adjustable in TokenTextSplitter)
  • Chunk Overlap: 25 tokens

Retrieval Settings

  • Top-K Results: 4 documents retrieved per query
  • Reranking: Cohere rerank for improved relevance

Memory Settings

  • Token Limit: 1000 tokens for conversation history
  • Memory Type: ConversationTokenBufferMemory

LLM Configuration

  • Model: tiiuae/falcon-7b-instruct
  • Temperature: 0.6
  • Max New Tokens: 2000

Customization

Change PDF Document

Update the file path in the indexing script:

Your_text_data = get_pdf_text("path/to/your/document.pdf")

Modify Prompts

Customize the system prompt in chat_script.py:

template = """
Your custom prompt here...
{question}
"""

Switch LLM Models

Replace the Falcon model with another Hugging Face model:

repo_id = "your_preferred_model"

Pinecone Setup

  1. Create a Pinecone account at pinecone.io
  2. Create a new index named "chatdatabase"
  3. Use the starter environment: 'gcp-starter'

Cohere Setup

  1. Get API key from Cohere
  2. The system uses:
    • embed-english-light-v2.0 for embeddings
    • Default reranker for document reranking

Troubleshooting

Common Issues

  1. API Key Errors: Ensure all API keys are valid and have sufficient quota
  2. Index Not Found: Run the indexing script before the chat script
  3. Memory Issues: Reduce chunk size or max_token_limit for large documents
  4. Slow Responses: Consider using a different embedding model or reducing retrieval count

Error Handling

The code includes basic error handling, but you may want to add:

  • API rate limit handling
  • Network timeout management
  • Input validation

Performance Optimization

  • Batch Processing: Process multiple PDFs by modifying the indexing loop
  • Caching: Implement caching for frequently asked questions
  • Model Optimization: Use quantized models for faster inference

Future Enhancements

  • Support for multiple document formats (Word, TXT, etc.)
  • Web interface using Streamlit or Chainlit
  • Multi-language support
  • Advanced conversation features
  • Document source citations
  • Conversation export functionality

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

[Add your license here]

Support

For questions or issues, please create an issue or contact [your-email].

Acknowledgments


Note: This is a development version. For production use, implement proper error handling, logging, and security measures.

About

An AI-powered chatbot that extracts knowledge from PDFs using LangChain, stores embeddings in Pinecone, and answers questions with Falcon-7B LLM and Cohere reranking for accurate, context-aware responses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages