This project is a Python-based tool that processes PDF documents, extracts relevant information, and posts responses to a specified Slack channel using OpenAI's language models. It employs natural language processing techniques to answer user queries based on the contents of the PDF.
- PDF Processing: Extracts text from PDF documents and splits it into manageable chunks.
- Natural Language Queries: Users can ask questions related to the content of the PDF.
- OpenAI Integration: Utilizes OpenAI's models for generating responses and embeddings.
- Confidence Handling: Implements logic to handle low-confidence responses.
- Exact Match Response: Returns exact matches from the PDF when queries match exactly, using greedy strategy of token generation.
- Slack Notifications: Posts responses directly to a specified Slack channel.
- Error Handling and Logging: Includes robust error handling, retry logic, and detailed logging.
- Python 3.x
- Libraries:
openaislack_sdksklearnPyPDF2
You can install the required libraries using:
pip install -r requirements.txtBefore running the application, make sure to configure the following parameters in your configuration file or command line arguments:
pdf_path: Path to the PDF document to process.questions: Comma-separated list of questions to ask.api_key: Your OpenAI API key.slack_token: Slack API token for sending messages.slack_channel: Slack channel ID to post the messages.model:optional: Model to use for generating responses (default=gpt-4o-mini).embed:optional: Whether to use embeddings for pdf chunks for faster and cost-efficient retrieval using cosine-similarity (default=true).embed_model:optional: Embedding model to use (default=text-embedding-3-small).chunk_size:optional: Size of each chunk when splitting the PDF (default=500).chunk_overlap:optional: Number of overlapping characters between chunks (default=100).confidence_threshold:optional: Confidence threshold for openapi responses (default=-1.5, can be fine-tuned).
-
Clone the repository:
git clone https://github.com/yourusername/PDF-QA.git cd PDF-QA -
Run the script with the desired parameters:
python main.py --questions "Comma-separated list of questions here" --pdf_path "path/to/pdf"
Logs are recorded both in the console and in a log file. Ensure that the logging level is set according to your needs for debugging or monitoring in main.py.