LLM Document Analyzer Pipeline

A modular Python data pipeline that ingests unstructured text from URLs and files (PDF, TXT), processes it, and extracts structured insights using Google's Gemini LLM.

This project was built without relying on heavy orchestration frameworks like LangChain, opting instead for direct API integration with robust exponential backoff error handling.

Features

Multi-source Ingestion: Parses local .txt and .pdf files (using PyMuPDF) as well as web URLs (using BeautifulSoup).
Automated Chunking: Slices long documents into manageable chunks to respect LLM context limits.
Resilient API Calls: Uses the tenacity library to automatically retry failed LLM API requests (e.g., rate limits) with exponential backoff.
Structured JSON Output: Enforces strict JSON schema generation from the LLM.
Data Export: Saves extracted data (summaries, entities, sentiment, questions) to JSON, CSV, and generates a plain-text summary report.
Graceful Error Handling: Skips broken URLs and missing files without crashing the entire pipeline.

Setup & Installation

Clone this repository.

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Copy .env.example to .env and add your Gemini API Key:
```
GEMINI_API_KEY="your_api_key_here"
```

Usage

Simply run the main orchestrator script:

python main.py

The script will iterate over the predefined inputs and generate:

output.json: The raw structured JSON from the LLM.
output.csv: The same data formatted in CSV for easy viewing.
summary_report.txt: A clean, human-readable summary of all processed documents.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
llm.py		llm.py
main.py		main.py
report.py		report.py
requirements.txt		requirements.txt
sample.txt		sample.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Document Analyzer Pipeline

Features

Setup & Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Document Analyzer Pipeline

Features

Setup & Installation

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages