🚀 Advanced Invoice Extraction System

A production-ready, multi-modal invoice extraction system with state-of-the-art document understanding capabilities. Built with advanced OCR, machine learning, and LLM integration for accurate data extraction from invoices.

📋 Table of Contents

Features
Architecture
Quick Start
Installation
Usage
Configuration
Project Structure
Extracted Fields
Advanced Features
Performance
Troubleshooting
Building from Source
Contributing
License
Acknowledgments

✨ Features

Core Capabilities

✅ Multi-format Support: PDF, JPEG, PNG, TIFF
✅ Quality Assessment: Automatic quality detection and adaptive preprocessing
✅ Multi-Engine OCR: Tesseract, DocTR, TrOCR with intelligent routing
✅ Layout Analysis: Zone segmentation, table detection, reading order
✅ Document Graph: Graph Neural Networks for structural reasoning
✅ Multimodal Fusion: Visual + Text + Layout + Graph features
✅ Hybrid Extraction: LLM (Gemini) + Rule-based for best results
✅ Multi-Layer Validation: Arithmetic, format, consistency, plausibility
✅ Multiple Export Formats: Excel, CSV, JSON, PDF reports

Advanced Features

🔥 Adaptive preprocessing based on image quality
🔥 Ensemble OCR with confidence scoring
🔥 Attention-based multimodal fusion
🔥 Graph Neural Network reasoning
🔥 Automatic field detection and entity classification
🔥 Cross-validation between extraction methods
🔥 Comprehensive confidence scoring

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Document Input                           │
│              (PDF, JPEG, PNG, TIFF)                         │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  INGESTION: Format handling, Quality assessment             │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  PREPROCESSING: Adaptive enhancement (denoise, skew, etc)   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  OCR: Multi-engine routing (Tesseract/DocTR/TrOCR)          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYOUT ANALYSIS: Zones, Tables, Reading order              │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  GRAPH: Document graph + GNN reasoning                      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  MULTIMODAL: Feature fusion (Visual+Text+Layout+Graph)      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  EXTRACTION: Hybrid LLM + Rule-based                        │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  VALIDATION: Arithmetic, Format, Consistency, Plausibility  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  EXPORT: Excel, CSV, JSON, PDF                              │
└─────────────────────────────────────────────────────────────┘

🚀 Quick Start

Download Executable

⬇️ Download Invoice Extractor v1.0.1

Get Your API Key

This application requires a Google Gemini API key (free tier available):

📄 Read API Key Setup Guide

Run the Application

Download and extract the executable
Run invoice-extractor.exe
Enter your Google Gemini API key
Select invoice files (PDF, JPG, PNG, TIFF)
Choose output folder
Click "Process PDFs"
Get structured data in Excel format!

📥 Installation

Prerequisites

System Requirements:

Windows 10/11 (64-bit)
4GB RAM minimum (8GB recommended)
2GB free disk space
Internet connection (for LLM API)

Required Software:

Python 3.8+ (for development)
Tesseract OCR
Poppler (for PDF processing)

Install Tesseract OCR

Windows:

# Download installer from:
https://github.com/UB-Mannheim/tesseract/wiki

# Add to PATH:
C:\Program Files\Tesseract-OCR

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

MacOS:

brew install tesseract

Install Poppler

Windows:

# Download from:
https://github.com/oschwartz10612/poppler-windows

# Add bin folder to PATH

Ubuntu/Debian:

sudo apt-get install poppler-utils

MacOS:

brew install poppler

Install Python Dependencies

# Clone repository
git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor

# Install basic dependencies
pip install -r requirements.txt

Optional Dependencies

# For DocTR (Deep Learning OCR)
pip install python-doctr[torch]

# For TrOCR (Transformer OCR)
pip install transformers torch torchvision

# For Advanced Features (GNN, Multimodal)
pip install torch-geometric sentence-transformers

# For Development
pip install matplotlib seaborn jupyter

💻 Usage

Desktop Application (GUI)

Launch the application
Enter your Gemini API key
Click "Upload PDF Files" and select invoices
Click "Browse" to choose output folder
Enter output filename (default: invoice_data.xlsx)
Click "Process PDFs"
Wait for processing to complete
Find extracted data in the specified output folder

Python API (Basic)

from backend.backend import InvoiceExtractionPipeline

# Initialize pipeline
pipeline = InvoiceExtractionPipeline(
    api_key="your-gemini-api-key",
    enable_advanced_features=False,
    use_gpu=False
)

# Process single document
result = pipeline.process_document(
    document_path="invoice.pdf",
    output_folder="./output",
    filename="invoice_data.xlsx"
)

# Check results
if result['success']:
    print(f"Extracted {len(result['extracted_data'])} items")
    print(f"Confidence: {result['confidence']['overall_confidence']:.1%}")
    print(f"Output: {result['output_path']}")

Python API (Batch Processing)

# Process multiple documents
documents = [
    "invoices/invoice1.pdf",
    "invoices/invoice2.pdf",
    "invoices/invoice3.jpg"
]

result = pipeline.process_multiple_documents(
    document_paths=documents,
    output_folder="./output",
    filename="combined_invoices.xlsx"
)

print(f"Processed: {result['successful']}/{result['total_documents']}")
print(f"Total items: {result['total_items']}")
print(f"Failed: {len(result['failed_documents'])}")

Python API (Advanced Features)

# Enable all advanced features
pipeline = InvoiceExtractionPipeline(
    api_key="your-api-key",
    enable_advanced_features=True,
    use_gpu=True,
    config={
        'ocr': {
            'ensemble_enabled': True,
            'engines': ['tesseract', 'doctr']
        },
        'validation': {
            'arithmetic': True,
            'plausibility': True
        }
    }
)

result = pipeline.process_document("invoice.pdf", "./output")

⚙️ Configuration

Configuration Files

The system uses YAML configuration files located in the config/ directory:

default_config.yaml: Main configuration
ocr_engines.yaml: OCR engine settings
extraction_patterns.yaml: Regex patterns for extraction
validation_rules.yaml: Validation rules and thresholds

Key Configuration Options

OCR Settings:

ocr:
  default_engine: "tesseract"
  ensemble_enabled: true
  confidence_threshold: 0.7

Preprocessing:

preprocessing:
  enabled: true
  operations:
    denoise: true
    enhance_contrast: true
    deskew: true

Validation:

validation:
  enabled: true
  checks:
    arithmetic: true
    format: true
    consistency: true
  arithmetic:
    tolerance: 0.01

Export:

export:
  formats:
    - "excel"
    - "json"
  excel:
    include_confidence: true
    include_metadata: true

Custom Configuration

from backend.backend import InvoiceExtractionPipeline

custom_config = {
    'preprocessing': {
        'target_dpi': 400,
        'denoise': True
    },
    'validation': {
        'arithmetic': {
            'tolerance': 0.05  # 5% tolerance
        }
    }
}

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config=custom_config
)

📁 Project Structure

invoice-extractor/
│
├── backend/                      # Backend processing
│   ├── backend.py               # Main pipeline orchestrator
│   ├── ingestion/               # Document loading & quality assessment
│   ├── preprocessing/           # Image enhancement
│   ├── ocr/                     # OCR engines & routing
│   ├── layout_analysis/         # Document structure analysis
│   ├── graph/                   # Document graph & GNN
│   ├── multimodal/              # Feature fusion
│   ├── extraction/              # Data extraction
│   ├── validation/              # Data validation
│   ├── export/                  # Output generation
│   ├── utils/                   # Utilities
│   └── models/                  # Model weights & configs
│
├── frontend/                    # Electron UI
│   ├── index.html              # Main UI
│   ├── renderer.js             # UI logic
│   └── styles.css              # Styling
│
├── config/                      # Configuration files
│   ├── default_config.yaml
│   ├── ocr_engines.yaml
│   ├── extraction_patterns.yaml
│   └── validation_rules.yaml
│
├── main.js                      # Electron main process
├── requirements.txt             # Python dependencies
├── README.md                    # This file
└── LICENSE                      # MIT License

🎯 Extracted Fields

The system extracts the following fields from invoices:

Invoice Details

Invoice Number
Invoice Date (DD/MM/YYYY format)
Due Date (if available)

Line Items

Goods Description
HSN/SAC Code (4-8 digits)
Quantity
Weight (with automatic unit conversion to kg)
Rate (per unit)
Amount
Tax Rate
Tax Amount

Financial Summary

Subtotal
Total Tax (CGST, SGST, IGST)
Discount (if applicable)
Grand Total

🚀 Advanced Features

1. Adaptive Preprocessing

Automatically adjusts preprocessing based on document quality:

# Quality assessment triggers different preprocessing
# High quality (>0.8): Minimal preprocessing
# Medium quality (0.5-0.8): Standard enhancement
# Low quality (<0.5): Aggressive enhancement + TrOCR

2. Ensemble OCR

Combines multiple OCR engines for improved accuracy:

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config={
        'ocr': {
            'ensemble_enabled': True,
            'engines': ['tesseract', 'doctr'],
            'voting_strategy': 'weighted'
        }
    }
)

3. Document Graph Reasoning

Uses Graph Neural Networks to understand document structure:

# Enable graph-based reasoning
config = {
    'graph': {
        'enabled': True,
        'gnn_reasoning': True
    }
}

4. Multimodal Fusion

Combines visual, textual, and layout features:

# Enable multimodal processing
config = {
    'multimodal': {
        'enabled': True,
        'fusion': {
            'method': 'attention'
        }
    }
}

5. Hybrid Extraction

Combines LLM with rule-based extraction:

# Use hybrid extraction
config = {
    'extraction': {
        'method': 'hybrid',
        'fallback_to_rules': True,
        'confidence_threshold': 0.8
    }
}

📊 Output Formats

Excel Output

Multi-sheet workbook containing:

Invoice Data: Extracted line items
Validation Issues: Detected problems
Confidence Scores: Per-field confidence
Summary: Aggregate statistics

JSON Output

{
  "metadata": {
    "document_path": "invoice.pdf",
    "processed_at": "2025-10-16T12:00:00",
    "processing_time": 12.5
  },
  "extracted_data": [
    {
      "company_name": "ABC Company Ltd",
      "invoice_number": "INV-001",
      "invoice_date": "15/10/2025",
      "items": [...]
    }
  ],
  "validation": {
    "issues": [],
    "passed": true
  },
  "confidence": {
    "overall": 0.95,
    "fields": {...}
  }
}

CSV Output

Simple tabular format for line items.

📈 Performance

Processing Speed

Basic Mode: 5-10 seconds per page
Advanced Mode: 15-30 seconds per page
Batch Processing: Parallel processing supported

Accuracy

High-quality scans: 95%+ accuracy
Medium-quality images: 85-95% accuracy
Low-quality/handwritten: 70-85% accuracy

Resource Usage

CPU: 2-4 cores recommended
RAM: 4GB minimum, 8GB recommended
GPU: Optional (improves speed for advanced features)
Disk: 2GB for models and cache

🐛 Troubleshooting

Common Issues

1. "Tesseract not found"

# Windows: Add to PATH
setx PATH "%PATH%;C:\Program Files\Tesseract-OCR"

# Verify installation
tesseract --version

2. "Module not found" errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

3. "API key invalid"

Verify your Gemini API key at https://makersuite.google.com/app/apikey
Check for extra spaces or characters
Ensure key has proper permissions

4. Low extraction accuracy

Enable preprocessing in configuration
Try ensemble OCR mode
Increase image DPI (300+ recommended)
Enable advanced features

5. "Memory error" during processing

Reduce batch size
Process fewer files at once
Close other applications
Increase system RAM

6. Slow processing

Enable GPU acceleration (if available)
Disable advanced features
Use faster OCR engine (Tesseract)
Process files in smaller batches

Debug Mode

Enable debug logging:

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config={
        'app': {
            'debug': True,
            'log_level': 'DEBUG'
        },
        'debug': {
            'save_intermediate': True
        }
    }
)

Log Files

Check logs in logs/ directory:

app.log: General application logs
error.log: Error traces
processing.log: Processing details

🔨 Building from Source

Prerequisites

Node.js 16+
Python 3.8+
npm or yarn

Build Steps

# 1. Clone repository
git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install Node dependencies
npm install

# 4. Build backend executable
cd backend
pyinstaller invoice-backend.spec
cd ..

# 5. Build Electron app
npm run build

# 6. Create installer (Windows)
npm run dist

# Output: dist/invoice-extractor-setup-1.0.1.exe

Build Configuration

Edit package.json for build settings:

{
  "build": {
    "appId": "com.invoice.extractor",
    "productName": "Invoice Extractor",
    "win": {
      "target": "nsis",
      "icon": "assets/icon.ico"
    }
  }
}

🤝 Contributing

We welcome contributions! Areas for improvement:

High Priority

Additional OCR engines
Support for more languages
Improved table extraction
Cloud deployment options

Medium Priority

UI/UX improvements
Additional export formats
Performance optimizations
Better error handling

Low Priority

Documentation improvements
Test coverage
Code refactoring

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Cherry28831

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

🙏 Acknowledgments

This project builds upon excellent open-source tools:

Google Gemini: LLM for intelligent extraction
Tesseract OCR: Open-source OCR engine
PyTorch: Deep learning framework
Electron: Desktop application framework
OpenCV: Computer vision library
ReportLab: PDF generation

Special thanks to all contributors and the open-source community!

📞 Support

Get Help

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: Support available through GitHub

Resources

Documentation: See docs/ folder for detailed guides
API Reference: See source code docstrings
Examples: See examples/ folder
Video Tutorial: Coming soon

🗺️ Roadmap

Version 2.0 (Planned)

Cloud deployment support
Real-time processing API
Mobile app (Android/iOS)
Multi-language support (Hindi, Spanish, French)
Custom model training interface

Version 1.5 (In Progress)

Improved table extraction
Better handwriting recognition
Batch processing optimization
Enhanced validation rules

Version 1.0 (Current)

📊 Statistics

Stars: ⭐ Star this repo if you find it useful!
Downloads: 1000+ (and growing)
Contributors: Open for contributions
Issues: Check our issue tracker

Built with ❤️ for accurate invoice extraction

Made with ☕ and 🧠 by Akshita Shetty!

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
backend		backend
config		config
frontend		frontend
.gitignore		.gitignore
API Documentation.docx		API Documentation.docx
LICENSE		LICENSE
README.md		README.md
invoice-backend.spec		invoice-backend.spec
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚀 Advanced Invoice Extraction System

📋 Table of Contents

✨ Features

Core Capabilities

Advanced Features

🏗️ Architecture

🚀 Quick Start

Download Executable

Get Your API Key

Run the Application

📥 Installation

Prerequisites

Install Tesseract OCR

Install Poppler

Install Python Dependencies

Optional Dependencies

💻 Usage

Desktop Application (GUI)

Python API (Basic)

Python API (Batch Processing)

Python API (Advanced Features)

⚙️ Configuration

Configuration Files

Key Configuration Options

Custom Configuration

📁 Project Structure

🎯 Extracted Fields

Company Information

Invoice Details

Line Items

Financial Summary

🚀 Advanced Features

1. Adaptive Preprocessing

2. Ensemble OCR

3. Document Graph Reasoning

4. Multimodal Fusion

5. Hybrid Extraction

📊 Output Formats

Excel Output

JSON Output

CSV Output

📈 Performance

Processing Speed

Accuracy

Resource Usage

🐛 Troubleshooting

Common Issues

Debug Mode

Log Files

🔨 Building from Source

Prerequisites

Build Steps

Build Configuration

🤝 Contributing

High Priority

Medium Priority

Low Priority

How to Contribute

📄 License

🙏 Acknowledgments

📞 Support

Get Help

Resources

🗺️ Roadmap

Version 2.0 (Planned)

Version 1.5 (In Progress)

Version 1.0 (Current)

📊 Statistics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages