🧾 Invoice Data Extractor (v2.0.0)

A desktop-based AI invoice processing engine that extracts structured, item-level invoice data from PDF and JPG invoices using a hybrid OCR + LLM pipeline, and exports it directly to Excel.

✔ Tested on 1000+ real-world invoices ✔ Achieved ~91% field-level accuracy ✔ Packaged as a single plug-and-play .exe

🎯 Problem This Solves

Invoice data extraction is messy because:

PDFs vary wildly in layout
Many PDFs are scanned images
Tables break traditional parsers
OCR alone is not enough
Rule-based systems don’t scale

This tool solves that by combining:

Deterministic OCR & preprocessing
LLM-based semantic understanding
Structured output enforcement
Excel-ready data flattening

✨ What’s New in v2.0.0

This release is a major backend upgrade.

🔥 Major Improvements

Groq API integration
- Replaced Gemini with Groq (LLaMA-3.1-8B-Instant)
- Faster inference, lower latency, better consistency
Advanced OCR Pipeline
- OpenCV preprocessing (denoise, CLAHE, thresholding)
- High-DPI PDF to image conversion
- Custom Tesseract OCR configuration
Robust PDF Handling
- pdfplumber text extraction first
- Automatic OCR fallback if text is missing
Flattened Excel Output
- Each invoice item becomes one row
- Invoice-level fields repeated per item
- Suitable for accounting & analytics
Append to Existing Excel
- New invoices can be appended to existing files
- No overwriting, no manual merges

🧠 System Architecture

PDF / JPG
   │
   ├─► pdfplumber (text PDFs)
   │
   └─► OpenCV + Tesseract (scanned PDFs / images)
            │
            ▼
     Cleaned invoice text
            │
            ▼
     Groq LLM (LLaMA-3.1-8B)
            │
            ▼
     Structured JSON (validated)
            │
            ▼
     Flattened rows
            │
            ▼
     Excel (.xlsx)

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

PDFs are first parsed using pdfplumber
If no usable text is found → OCR pipeline is triggered
JPG files always go through OCR

2️⃣ OCR Enhancement (Key to Accuracy)

Each page/image undergoes:

Grayscale conversion
Noise removal
CLAHE contrast enhancement
Adaptive thresholding
Custom Tesseract config for invoice symbols & numbers

This significantly improves OCR quality on:

Scanned invoices
Low-contrast documents
Poorly printed bills

3️⃣ LLM-Based Data Extraction (Groq)

Extracted text is sent to Groq with a strict, structured prompt that enforces:

Exact text extraction (no hallucination)
Fixed fields:
- Company name
- Invoice number
- Invoice date
- FSSAI number
- Item-level details
JSON-only response
Clean numeric values
Standardized date format

Model used: llama-3.1-8b-instant

4️⃣ Response Cleaning & Validation

Markdown fences removed
JSON boundaries detected manually
Invalid responses discarded
Safe fallback handling for malformed outputs

5️⃣ Data Flattening Logic

Each invoice item becomes one Excel row
Invoice-level fields are repeated per item
If no items are found → invoice-level row is still created

This makes the output:

Pivot-friendly
BI-ready
Accounting-friendly

6️⃣ Excel Export (Append-Safe)

If Excel file exists → data is appended
Column mismatches are auto-handled
Weights are converted to kilograms
File is saved using openpyxl

📊 Accuracy & Testing

✅ Tested on 1000+ real invoices
📈 Achieved ~91% accuracy
Includes:
- Printed PDFs
- Scanned PDFs
- Multi-item invoices
- Inconsistent layouts

Accuracy measured on:

Invoice number
Date
Item description
Quantity
Amount
HSN codes (when present)

🧪 Supported Input Formats

Format	Method
PDF (text)	pdfplumber
PDF (scanned)	OCR fallback
JPG	OCR

📤 Output

Excel File (`.xlsx`)

Each row represents one invoice item.

Columns include:

company_name
invoice_number
invoice_date
fssai_number
description
hsn_code
quantity
weight (kg)
rate
amount

🚀 Getting Started

🔽 Download Executable

👉 Download v2.0.0

No Python. No setup. Just run.

🔑 Groq API Key (Required)

Create a Groq account
Generate an API key
Paste it when prompted by the app

🛠 Usage Flow

Launch the .exe
Enter Groq API key
Select one or multiple PDF/JPG invoices
Choose output Excel file (new or existing)
Extraction starts automatically
Excel file is generated/appended

🧪 Build From Source

git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor
pip install -r requirements.txt
cd backend
pyinstaller invoice-backend.spec
cd ..
npm start 
npm run dist

🌱 Future Roadmap

Vendor auto-classification
GST / tax breakup extraction
CSV & ERP exports
Multi-language invoices
Cloud + desktop hybrid mode

🤝 Contributions

This project is actively evolving. Contributions, optimizations, and feedback are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧾 Invoice Data Extractor (v2.0.0)

🎯 Problem This Solves

✨ What’s New in v2.0.0

🔥 Major Improvements

🧠 System Architecture

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

2️⃣ OCR Enhancement (Key to Accuracy)

3️⃣ LLM-Based Data Extraction (Groq)

4️⃣ Response Cleaning & Validation

5️⃣ Data Flattening Logic

6️⃣ Excel Export (Append-Safe)

📊 Accuracy & Testing

🧪 Supported Input Formats

📤 Output

Excel File (`.xlsx`)

🚀 Getting Started

🔽 Download Executable

🔑 Groq API Key (Required)

🛠 Usage Flow

🧪 Build From Source

🌱 Future Roadmap

🤝 Contributions

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🧾 Invoice Data Extractor (v2.0.0)

🎯 Problem This Solves

✨ What’s New in v2.0.0

🔥 Major Improvements

🧠 System Architecture

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

2️⃣ OCR Enhancement (Key to Accuracy)

3️⃣ LLM-Based Data Extraction (Groq)

4️⃣ Response Cleaning & Validation

5️⃣ Data Flattening Logic

6️⃣ Excel Export (Append-Safe)

📊 Accuracy & Testing

🧪 Supported Input Formats

📤 Output

Excel File (.xlsx)

🚀 Getting Started

🔽 Download Executable

🔑 Groq API Key (Required)

🛠 Usage Flow

🧪 Build From Source

🌱 Future Roadmap

🤝 Contributions

Excel File (`.xlsx`)