Skip to content

Cherry28831/Invoice-Data-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧾 Invoice Data Extractor (v2.0.0)

GitHub release (latest by date) MIT License Platform AI Engine

A desktop-based AI invoice processing engine that extracts structured, item-level invoice data from PDF and JPG invoices using a hybrid OCR + LLM pipeline, and exports it directly to Excel.

βœ” Tested on 1000+ real-world invoices βœ” Achieved ~91% field-level accuracy βœ” Packaged as a single plug-and-play .exe


🎯 Problem This Solves

Invoice data extraction is messy because:

  • PDFs vary wildly in layout
  • Many PDFs are scanned images
  • Tables break traditional parsers
  • OCR alone is not enough
  • Rule-based systems don’t scale

This tool solves that by combining:

  • Deterministic OCR & preprocessing
  • LLM-based semantic understanding
  • Structured output enforcement
  • Excel-ready data flattening

✨ What’s New in v2.0.0

This release is a major backend upgrade.

πŸ”₯ Major Improvements

  • Groq API integration

    • Replaced Gemini with Groq (LLaMA-3.1-8B-Instant)
    • Faster inference, lower latency, better consistency
  • Advanced OCR Pipeline

    • OpenCV preprocessing (denoise, CLAHE, thresholding)
    • High-DPI PDF to image conversion
    • Custom Tesseract OCR configuration
  • Robust PDF Handling

    • pdfplumber text extraction first
    • Automatic OCR fallback if text is missing
  • Flattened Excel Output

    • Each invoice item becomes one row
    • Invoice-level fields repeated per item
    • Suitable for accounting & analytics
  • Append to Existing Excel

    • New invoices can be appended to existing files
    • No overwriting, no manual merges

🧠 System Architecture

PDF / JPG
   β”‚
   β”œβ”€β–Ί pdfplumber (text PDFs)
   β”‚
   └─► OpenCV + Tesseract (scanned PDFs / images)
            β”‚
            β–Ό
     Cleaned invoice text
            β”‚
            β–Ό
     Groq LLM (LLaMA-3.1-8B)
            β”‚
            β–Ό
     Structured JSON (validated)
            β”‚
            β–Ό
     Flattened rows
            β”‚
            β–Ό
     Excel (.xlsx)

🧠 How the Extraction Works (Step-by-Step)

1️⃣ PDF / Image Ingestion

  • PDFs are first parsed using pdfplumber
  • If no usable text is found β†’ OCR pipeline is triggered
  • JPG files always go through OCR

2️⃣ OCR Enhancement (Key to Accuracy)

Each page/image undergoes:

  • Grayscale conversion
  • Noise removal
  • CLAHE contrast enhancement
  • Adaptive thresholding
  • Custom Tesseract config for invoice symbols & numbers

This significantly improves OCR quality on:

  • Scanned invoices
  • Low-contrast documents
  • Poorly printed bills

3️⃣ LLM-Based Data Extraction (Groq)

Extracted text is sent to Groq with a strict, structured prompt that enforces:

  • Exact text extraction (no hallucination)

  • Fixed fields:

    • Company name
    • Invoice number
    • Invoice date
    • FSSAI number
    • Item-level details
  • JSON-only response

  • Clean numeric values

  • Standardized date format

Model used: llama-3.1-8b-instant


4️⃣ Response Cleaning & Validation

  • Markdown fences removed
  • JSON boundaries detected manually
  • Invalid responses discarded
  • Safe fallback handling for malformed outputs

5️⃣ Data Flattening Logic

  • Each invoice item becomes one Excel row
  • Invoice-level fields are repeated per item
  • If no items are found β†’ invoice-level row is still created

This makes the output:

  • Pivot-friendly
  • BI-ready
  • Accounting-friendly

6️⃣ Excel Export (Append-Safe)

  • If Excel file exists β†’ data is appended
  • Column mismatches are auto-handled
  • Weights are converted to kilograms
  • File is saved using openpyxl

πŸ“Š Accuracy & Testing

  • βœ… Tested on 1000+ real invoices

  • πŸ“ˆ Achieved ~91% accuracy

  • Includes:

    • Printed PDFs
    • Scanned PDFs
    • Multi-item invoices
    • Inconsistent layouts

Accuracy measured on:

  • Invoice number
  • Date
  • Item description
  • Quantity
  • Amount
  • HSN codes (when present)

πŸ§ͺ Supported Input Formats

Format Method
PDF (text) pdfplumber
PDF (scanned) OCR fallback
JPG OCR

πŸ“€ Output

Excel File (.xlsx)

Each row represents one invoice item.

Columns include:

  • company_name
  • invoice_number
  • invoice_date
  • fssai_number
  • description
  • hsn_code
  • quantity
  • weight (kg)
  • rate
  • amount

πŸš€ Getting Started

πŸ”½ Download Executable

πŸ‘‰ Download v2.0.0

No Python. No setup. Just run.


πŸ”‘ Groq API Key (Required)

  1. Create a Groq account
  2. Generate an API key
  3. Paste it when prompted by the app

πŸ›  Usage Flow

  1. Launch the .exe
  2. Enter Groq API key
  3. Select one or multiple PDF/JPG invoices
  4. Choose output Excel file (new or existing)
  5. Extraction starts automatically
  6. Excel file is generated/appended

πŸ§ͺ Build From Source

git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor
pip install -r requirements.txt
cd backend
pyinstaller invoice-backend.spec
cd ..
npm start 
npm run dist

🌱 Future Roadmap

  • Vendor auto-classification
  • GST / tax breakup extraction
  • CSV & ERP exports
  • Multi-language invoices
  • Cloud + desktop hybrid mode

🀝 Contributions

This project is actively evolving. Contributions, optimizations, and feedback are welcome.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors