Skip to content

Gaur-Ayush-AI-Engineer/intelligent-invoice-extractor

Repository files navigation

🎯 Objective

Extract structured data from invoice images including:

  • Invoice number
  • Invoice date
  • Line items (product name, quantity, price)

The system is designed to be scalable and configurable - users can easily add or remove fields by modifying the configuration file.

📥 Input Requirements

Supported Input:

  • File Type: Images only (.png, .jpg, .jpeg)
  • Processing: One image at a time
  • Format: Invoice images with clear text and structure

🏗️ Approach

Hybrid OCR + Vision Model

Vison Model used:- Qwen 2.5vl 7B

Why this approach?

  • Image-only approach had issues: misread characters, missing text, incorrect prices
  • OCR + Vision combination provides better accuracy:
    • OCR extracts exact text with confidence scores
    • Vision model understands layout and spatial relationships
    • Cross-validation between both sources reduces errors

Why Not Fine-Tuning?

-Considered LoRA fine-tuning for domain-specific optimization -Dataset characteristics: Full dataset contained 500 images, but mostly of similar invoice formats/layouts -Lack of diversity: Insufficient variety in invoice structures, fonts, and layouts for robust fine-tuning -Overfitting risk: Fine-tuning on homogeneous dataset would likely cause overfitting to specific invoice types -Testing approach: Used validation subset (48 images) to evaluate generalization capability -Better approach: Prompt engineering with OCR context provides robust generalization across diverse invoice formats -Future work: Fine-tuning viable with larger, more diverse datasets (1000+ images with varied layouts)

Process:

  1. OCR Preprocessing:

    • EasyOCR extracts all text from the invoice image with bounding boxes and confidence scores
    • Text is organized spatially (rows/columns) to understand table structure
    • Creates structured OCR context showing where each piece of text is located
  2. Vision Analysis:

    • Qwen2.5VL 7B receives both the original image AND the structured OCR text
    • Model can "see" the visual layout while having exact text transcription as reference
    • Cross-references between image and OCR: if OCR says "8" but image shows "B", model uses visual context to decide
    • Uses OCR for precise text extraction and image for layout understanding
  3. Structured Output:

    • Model combines insights from both sources to extract accurate data
    • Generates clean JSON with all requested fields
    • OCR handles exact character recognition, vision handles spatial relationships and context

⚙️ Adding/Removing Fields

To extract additional fields or modify existing ones:

Edit config.json:

{
  "output_format": {
    "invoice_number": "string",
    "invoice_date": "YYYY/MM/DD",
    "vendor_name": "string",           // ← Add new field
    "total_amount": "decimal",         // ← Add new field
    "line_items": [
      {
        "name": "product description",
        "qty": "quantity",
        "price": "unit price",
        "total": "line total"          // ← Add new field
      }
    ]
  }
}

🔄 System Flow

Complete Workflow:

📁 test_images/          📁 test_json/
├── invoice1.png    →    ├── invoice1.json (ground truth)
├── invoice2.png    →    ├── invoice2.json (ground truth)
└── invoice3.png    →    └── invoice3.json (ground truth)
                              ↓
                    🚀 python batch_test.py
                              ↓
                    📊 Processing Flow:
                    
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Load Image    │ → │  Extract JSON    │ → │  Save Output    │
│   (main.py)     │    │  (OCR + Vision)  │    │  to output/     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              ↓
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Load Ground    │ →  │   Compare &      │ →  │  Generate       │
│  Truth JSON     │    │  Test Accuracy   │    │  Summary Report │
│(test_accuracy.py)│   |(test_accuracy.py)│    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

File Processing:

  • Input: One image processed at a time
  • Output: Individual JSON file for each image
  • Comparison: Each output compared with corresponding ground truth
  • Result: Comprehensive accuracy report for all files

🚀 Quick Start

Easy Setup - Drop Files and Run:

  1. Prepare Your Data:

    test_images/
    ├── invoice1.png
    ├── invoice2.png
    └── invoice3.png
    
    test_json/
    ├── invoice1.json  (ground truth)
    ├── invoice2.json  (ground truth)
    └── invoice3.json  (ground truth)
    
  2. Run Complete Pipeline:

    python batch_test.py

That's it! The system will:

  • ✅ Extract JSON from each image this is bbeing done in main.py
  • ✅ Save outputs to output/ folder
  • ✅ Compare with ground truth this is done in test_accuracy.py
  • ✅ Generate accuracy report this is done via batch_test.py which creates or automates the testing via sending image to main.py and in main.py json is created for it and then we use have imported the functionality from test_accuracy.py which compares the ground truth with created json from main.py and we have written some preprocessing function is test_accuracy.py for cleaning the output after that it is being compared with the ground truth.

📁 Project Structure

├── main_script.py              # Core extraction functionality
├── test_accuracy.py            # Accuracy evaluation
├── config.json                 # Field configuration
├── requirements.txt            # Dependencies
├── test_images/               # Sample invoice images
├── test_json/                 # Ground truth data
└── output/                    # Generated results

📊 Validation Results

The system was tested on 48 validation invoice images with the following results:

Metric Performance
Average Accuracy 99.0%
Perfect Extractions (100%) 37/48 invoices (77%)
Near-Perfect (95-99%) 11/48 invoices (23%)
Failed Extractions (<90%) 0/48 invoices (0%)

All invoices achieved >90% accuracy, demonstrating robust performance across various invoice formats.

🔧 Core Functions

main.py:

  • extract_invoice_data(): Main extraction function
  • clean_json_string(): Parse model responses
  • OCR preprocessing and text organization

test_accuracy.py:

  • test_single_invoice_direct(): Accuracy testing
  • Smart comparison for numbers and text
  • Handles different formats and minor variations

🚀 Improving Accuracy

For Better Field Extraction:

1. Image Quality:

  • Use high-resolution images (minimum 300 DPI)
  • Ensure good contrast and lighting
  • Avoid blurry or skewed images

2. OCR Optimization:

# Adjust OCR confidence threshold in main_script.py
if confidence > 0.7:  # Increase from 0.5 to 0.7 for higher precision

3. Model Configuration:

# Fine-tune model parameters for specific fields
"options": {
    "temperature": 0.01,    # Lower for more consistent extraction
    "num_predict": 6000,    # Higher for complex invoices
}

4. Custom Field Prompts:

// In config.json, add specific instructions for problematic fields
{
  "special_instructions": {
    "total_amount": "Extract the final total amount, usually at bottom right",
    "line_items": "Focus on table structure, extract each row separately"
  }
}

5. Post-Processing Rules:

  • Add validation rules for specific field formats
  • Implement business logic checks (e.g., quantities × prices = totals)
  • Use regex patterns for structured fields like invoice numbers

Releases

No releases published

Packages

 
 
 

Contributors

Languages