🧾 Invoice Data Extractor (v2.0.0)

A desktop-based AI invoice processing engine that extracts structured, item-level invoice data from PDF and JPG invoices using a hybrid OCR + LLM pipeline, and exports it directly to Excel.

✔ Tested on 1000+ real-world invoices ✔ Achieved ~91% field-level accuracy ✔ Packaged as a single plug-and-play .exe

🎯 Problem This Solves

Invoice data extraction is messy because:

PDFs vary wildly in layout
Many PDFs are scanned images
Tables break traditional parsers
OCR alone is not enough
Rule-based systems don’t scale

This tool solves that by combining:

Deterministic OCR & preprocessing
LLM-based semantic understanding
Structured output enforcement
Excel-ready data flattening

✨ What’s New in v2.0.0

This release is a major backend upgrade.

🔥 Major Improvements

Groq API integration
- Replaced Gemini with Groq (LLaMA-3.1-8B-Instant)
- Faster inference, lower latency, better consistency
Advanced OCR Pipeline
- OpenCV preprocessing (denoise, CLAHE, thresholding)
- High-DPI PDF to image conversion
- Custom Tesseract OCR configuration
Robust PDF Handling
- pdfplumber text extraction first
- Automatic OCR fallback if text is missing
Flattened Excel Output
- Each invoice item becomes one row
- Invoice-level fields repeated per item
- Suitable for accounting & analytics
Append to Existing Excel
- New invoices can be appended to existing files
- No overwriting, no manual merges

🧠 System Architecture

PDF / JPG
   │
   ├─► pdfplumber (text PDFs)
   │
   └─► OpenCV + Tesseract (scanned PDFs / images)
            │
            ▼
     Cleaned invoice text
            │
            ▼
     Groq LLM (LLaMA-3.1-8B)
            │
            ▼
     Structured JSON (validated)
            │
            ▼
     Flattened rows
            │
            ▼
     Excel (.xlsx)

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

PDFs are first parsed using pdfplumber
If no usable text is found → OCR pipeline is triggered
JPG files always go through OCR

2️⃣ OCR Enhancement (Key to Accuracy)

Each page/image undergoes:

Grayscale conversion
Noise removal
CLAHE contrast enhancement
Adaptive thresholding
Custom Tesseract config for invoice symbols & numbers

This significantly improves OCR quality on:

Scanned invoices
Low-contrast documents
Poorly printed bills

3️⃣ LLM-Based Data Extraction (Groq)

Extracted text is sent to Groq with a strict, structured prompt that enforces:

Exact text extraction (no hallucination)
Fixed fields:
- Company name
- Invoice number
- Invoice date
- FSSAI number
- Item-level details
JSON-only response
Clean numeric values
Standardized date format

Model used: llama-3.1-8b-instant

4️⃣ Response Cleaning & Validation

Markdown fences removed
JSON boundaries detected manually
Invalid responses discarded
Safe fallback handling for malformed outputs

5️⃣ Data Flattening Logic

Each invoice item becomes one Excel row
Invoice-level fields are repeated per item
If no items are found → invoice-level row is still created

This makes the output:

Pivot-friendly
BI-ready
Accounting-friendly

6️⃣ Excel Export (Append-Safe)

If Excel file exists → data is appended
Column mismatches are auto-handled
Weights are converted to kilograms
File is saved using openpyxl

📊 Accuracy & Testing

✅ Tested on 1000+ real invoices
📈 Achieved ~91% accuracy
Includes:
- Printed PDFs
- Scanned PDFs
- Multi-item invoices
- Inconsistent layouts

Accuracy measured on:

Invoice number
Date
Item description
Quantity
Amount
HSN codes (when present)

🧪 Supported Input Formats

Format	Method
PDF (text)	pdfplumber
PDF (scanned)	OCR fallback
JPG	OCR

📤 Output

Excel File (`.xlsx`)

Each row represents one invoice item.

Columns include:

company_name
invoice_number
invoice_date
fssai_number
description
hsn_code
quantity
weight (kg)
rate
amount

🚀 Getting Started

🔽 Download Executable

👉 Download v2.0.0

No Python. No setup. Just run.

🔑 Groq API Key (Required)

Create a Groq account
Generate an API key
Paste it when prompted by the app

🛠 Usage Flow

Launch the .exe
Enter Groq API key
Select one or multiple PDF/JPG invoices
Choose output Excel file (new or existing)
Extraction starts automatically
Excel file is generated/appended

🧪 Build From Source

git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor
pip install -r requirements.txt
cd backend
pyinstaller invoice-backend.spec
cd ..
npm start 
npm run dist

🌱 Future Roadmap

Vendor auto-classification
GST / tax breakup extraction
CSV & ERP exports
Multi-language invoices
Cloud + desktop hybrid mode

🤝 Contributions

This project is actively evolving. Contributions, optimizations, and feedback are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
API Documentation.docx		API Documentation.docx
LICENSE		LICENSE
README.md		README.md
invoice-backend.spec		invoice-backend.spec
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧾 Invoice Data Extractor (v2.0.0)

🎯 Problem This Solves

✨ What’s New in v2.0.0

🔥 Major Improvements

🧠 System Architecture

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

2️⃣ OCR Enhancement (Key to Accuracy)

3️⃣ LLM-Based Data Extraction (Groq)

4️⃣ Response Cleaning & Validation

5️⃣ Data Flattening Logic

6️⃣ Excel Export (Append-Safe)

📊 Accuracy & Testing

🧪 Supported Input Formats

📤 Output

Excel File (`.xlsx`)

🚀 Getting Started

🔽 Download Executable

🔑 Groq API Key (Required)

🛠 Usage Flow

🧪 Build From Source

🌱 Future Roadmap

🤝 Contributions

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧾 Invoice Data Extractor (v2.0.0)

🎯 Problem This Solves

✨ What’s New in v2.0.0

🔥 Major Improvements

🧠 System Architecture

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

2️⃣ OCR Enhancement (Key to Accuracy)

3️⃣ LLM-Based Data Extraction (Groq)

4️⃣ Response Cleaning & Validation

5️⃣ Data Flattening Logic

6️⃣ Excel Export (Append-Safe)

📊 Accuracy & Testing

🧪 Supported Input Formats

📤 Output

Excel File (.xlsx)

🚀 Getting Started

🔽 Download Executable

🔑 Groq API Key (Required)

🛠 Usage Flow

🧪 Build From Source

🌱 Future Roadmap

🤝 Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Excel File (`.xlsx`)

Packages