A desktop-based AI invoice processing engine that extracts structured, item-level invoice data from PDF and JPG invoices using a hybrid OCR + LLM pipeline, and exports it directly to Excel.
β Tested on 1000+ real-world invoices
β Achieved ~91% field-level accuracy
β Packaged as a single plug-and-play .exe
Invoice data extraction is messy because:
- PDFs vary wildly in layout
- Many PDFs are scanned images
- Tables break traditional parsers
- OCR alone is not enough
- Rule-based systems donβt scale
This tool solves that by combining:
- Deterministic OCR & preprocessing
- LLM-based semantic understanding
- Structured output enforcement
- Excel-ready data flattening
This release is a major backend upgrade.
-
Groq API integration
- Replaced Gemini with Groq (LLaMA-3.1-8B-Instant)
- Faster inference, lower latency, better consistency
-
Advanced OCR Pipeline
- OpenCV preprocessing (denoise, CLAHE, thresholding)
- High-DPI PDF to image conversion
- Custom Tesseract OCR configuration
-
Robust PDF Handling
pdfplumbertext extraction first- Automatic OCR fallback if text is missing
-
Flattened Excel Output
- Each invoice item becomes one row
- Invoice-level fields repeated per item
- Suitable for accounting & analytics
-
Append to Existing Excel
- New invoices can be appended to existing files
- No overwriting, no manual merges
PDF / JPG
β
βββΊ pdfplumber (text PDFs)
β
βββΊ OpenCV + Tesseract (scanned PDFs / images)
β
βΌ
Cleaned invoice text
β
βΌ
Groq LLM (LLaMA-3.1-8B)
β
βΌ
Structured JSON (validated)
β
βΌ
Flattened rows
β
βΌ
Excel (.xlsx)
- PDFs are first parsed using
pdfplumber - If no usable text is found β OCR pipeline is triggered
- JPG files always go through OCR
Each page/image undergoes:
- Grayscale conversion
- Noise removal
- CLAHE contrast enhancement
- Adaptive thresholding
- Custom Tesseract config for invoice symbols & numbers
This significantly improves OCR quality on:
- Scanned invoices
- Low-contrast documents
- Poorly printed bills
Extracted text is sent to Groq with a strict, structured prompt that enforces:
-
Exact text extraction (no hallucination)
-
Fixed fields:
- Company name
- Invoice number
- Invoice date
- FSSAI number
- Item-level details
-
JSON-only response
-
Clean numeric values
-
Standardized date format
Model used:
llama-3.1-8b-instant
- Markdown fences removed
- JSON boundaries detected manually
- Invalid responses discarded
- Safe fallback handling for malformed outputs
- Each invoice item becomes one Excel row
- Invoice-level fields are repeated per item
- If no items are found β invoice-level row is still created
This makes the output:
- Pivot-friendly
- BI-ready
- Accounting-friendly
- If Excel file exists β data is appended
- Column mismatches are auto-handled
- Weights are converted to kilograms
- File is saved using
openpyxl
-
β Tested on 1000+ real invoices
-
π Achieved ~91% accuracy
-
Includes:
- Printed PDFs
- Scanned PDFs
- Multi-item invoices
- Inconsistent layouts
Accuracy measured on:
- Invoice number
- Date
- Item description
- Quantity
- Amount
- HSN codes (when present)
| Format | Method |
|---|---|
| PDF (text) | pdfplumber |
| PDF (scanned) | OCR fallback |
| JPG | OCR |
Each row represents one invoice item.
Columns include:
- company_name
- invoice_number
- invoice_date
- fssai_number
- description
- hsn_code
- quantity
- weight (kg)
- rate
- amount
π Download v2.0.0
No Python. No setup. Just run.
- Create a Groq account
- Generate an API key
- Paste it when prompted by the app
- Launch the
.exe - Enter Groq API key
- Select one or multiple PDF/JPG invoices
- Choose output Excel file (new or existing)
- Extraction starts automatically
- Excel file is generated/appended
git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor
pip install -r requirements.txt
cd backend
pyinstaller invoice-backend.spec
cd ..
npm start
npm run dist- Vendor auto-classification
- GST / tax breakup extraction
- CSV & ERP exports
- Multi-language invoices
- Cloud + desktop hybrid mode
This project is actively evolving. Contributions, optimizations, and feedback are welcome.