GitHub - Gaur-Ayush-AI-Engineer/intelligent-invoice-extractor: Hybrid OCR + Vision model system for automated invoice data extraction.

🎯 Objective

Extract structured data from invoice images including:

Invoice number
Invoice date
Line items (product name, quantity, price)

The system is designed to be scalable and configurable - users can easily add or remove fields by modifying the configuration file.

📥 Input Requirements

Supported Input:

File Type: Images only (.png, .jpg, .jpeg)
Processing: One image at a time
Format: Invoice images with clear text and structure

🏗️ Approach

Hybrid OCR + Vision Model

Vison Model used:- Qwen 2.5vl 7B

Why this approach?

Image-only approach had issues: misread characters, missing text, incorrect prices
OCR + Vision combination provides better accuracy:
- OCR extracts exact text with confidence scores
- Vision model understands layout and spatial relationships
- Cross-validation between both sources reduces errors

Why Not Fine-Tuning?

-Considered LoRA fine-tuning for domain-specific optimization -Dataset characteristics: Full dataset contained 500 images, but mostly of similar invoice formats/layouts -Lack of diversity: Insufficient variety in invoice structures, fonts, and layouts for robust fine-tuning -Overfitting risk: Fine-tuning on homogeneous dataset would likely cause overfitting to specific invoice types -Testing approach: Used validation subset (48 images) to evaluate generalization capability -Better approach: Prompt engineering with OCR context provides robust generalization across diverse invoice formats -Future work: Fine-tuning viable with larger, more diverse datasets (1000+ images with varied layouts)

Process:

OCR Preprocessing:
- EasyOCR extracts all text from the invoice image with bounding boxes and confidence scores
- Text is organized spatially (rows/columns) to understand table structure
- Creates structured OCR context showing where each piece of text is located
Vision Analysis:
- Qwen2.5VL 7B receives both the original image AND the structured OCR text
- Model can "see" the visual layout while having exact text transcription as reference
- Cross-references between image and OCR: if OCR says "8" but image shows "B", model uses visual context to decide
- Uses OCR for precise text extraction and image for layout understanding
Structured Output:
- Model combines insights from both sources to extract accurate data
- Generates clean JSON with all requested fields
- OCR handles exact character recognition, vision handles spatial relationships and context

⚙️ Adding/Removing Fields

To extract additional fields or modify existing ones:

Edit config.json:

{
  "output_format": {
    "invoice_number": "string",
    "invoice_date": "YYYY/MM/DD",
    "vendor_name": "string",           // ← Add new field
    "total_amount": "decimal",         // ← Add new field
    "line_items": [
      {
        "name": "product description",
        "qty": "quantity",
        "price": "unit price",
        "total": "line total"          // ← Add new field
      }
    ]
  }
}

🔄 System Flow

Complete Workflow:

📁 test_images/          📁 test_json/
├── invoice1.png    →    ├── invoice1.json (ground truth)
├── invoice2.png    →    ├── invoice2.json (ground truth)
└── invoice3.png    →    └── invoice3.json (ground truth)
                              ↓
                    🚀 python batch_test.py
                              ↓
                    📊 Processing Flow:
                    
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Load Image    │ → │  Extract JSON    │ → │  Save Output    │
│   (main.py)     │    │  (OCR + Vision)  │    │  to output/     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              ↓
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Load Ground    │ →  │   Compare &      │ →  │  Generate       │
│  Truth JSON     │    │  Test Accuracy   │    │  Summary Report │
│(test_accuracy.py)│   |(test_accuracy.py)│    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

File Processing:

Input: One image processed at a time
Output: Individual JSON file for each image
Comparison: Each output compared with corresponding ground truth
Result: Comprehensive accuracy report for all files

🚀 Quick Start

Easy Setup - Drop Files and Run:

Prepare Your Data:

test_images/
├── invoice1.png
├── invoice2.png
└── invoice3.png

test_json/
├── invoice1.json  (ground truth)
├── invoice2.json  (ground truth)
└── invoice3.json  (ground truth)

Run Complete Pipeline:
```
python batch_test.py
```

That's it! The system will:

✅ Extract JSON from each image this is bbeing done in main.py
✅ Save outputs to output/ folder
✅ Compare with ground truth this is done in test_accuracy.py
✅ Generate accuracy report this is done via batch_test.py which creates or automates the testing via sending image to main.py and in main.py json is created for it and then we use have imported the functionality from test_accuracy.py which compares the ground truth with created json from main.py and we have written some preprocessing function is test_accuracy.py for cleaning the output after that it is being compared with the ground truth.

📁 Project Structure

├── main_script.py              # Core extraction functionality
├── test_accuracy.py            # Accuracy evaluation
├── config.json                 # Field configuration
├── requirements.txt            # Dependencies
├── test_images/               # Sample invoice images
├── test_json/                 # Ground truth data
└── output/                    # Generated results

📊 Validation Results

The system was tested on 48 validation invoice images with the following results:

Metric	Performance
Average Accuracy	99.0%
Perfect Extractions (100%)	37/48 invoices (77%)
Near-Perfect (95-99%)	11/48 invoices (23%)
Failed Extractions (<90%)	0/48 invoices (0%)

All invoices achieved >90% accuracy, demonstrating robust performance across various invoice formats.

🔧 Core Functions

main.py:

extract_invoice_data(): Main extraction function
clean_json_string(): Parse model responses
OCR preprocessing and text organization

test_accuracy.py:

test_single_invoice_direct(): Accuracy testing
Smart comparison for numbers and text
Handles different formats and minor variations

🚀 Improving Accuracy

For Better Field Extraction:

1. Image Quality:

Use high-resolution images (minimum 300 DPI)
Ensure good contrast and lighting
Avoid blurry or skewed images

2. OCR Optimization:

# Adjust OCR confidence threshold in main_script.py
if confidence > 0.7:  # Increase from 0.5 to 0.7 for higher precision

3. Model Configuration:

# Fine-tune model parameters for specific fields
"options": {
    "temperature": 0.01,    # Lower for more consistent extraction
    "num_predict": 6000,    # Higher for complex invoices
}

4. Custom Field Prompts:

// In config.json, add specific instructions for problematic fields
{
  "special_instructions": {
    "total_amount": "Extract the final total amount, usually at bottom right",
    "line_items": "Focus on table structure, extract each row separately"
  }
}

5. Post-Processing Rules:

Add validation rules for specific field formats
Implement business logic checks (e.g., quantities × prices = totals)
Use regex patterns for structured fields like invoice numbers

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
katanaml_extracted		katanaml_extracted
output		output
test_images		test_images
test_jsons		test_jsons
.DS_Store		.DS_Store
Readme.md		Readme.md
batch_test.py		batch_test.py
config.json		config.json
direct_test.py		direct_test.py
main.py		main.py
test.py		test.py
test_accuracy.py		test_accuracy.py
test_results_summary.json		test_results_summary.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎯 Objective

📥 Input Requirements

🏗️ Approach

Hybrid OCR + Vision Model

Vison Model used:- Qwen 2.5vl 7B

Why Not Fine-Tuning?

⚙️ Adding/Removing Fields

🔄 System Flow

Complete Workflow:

File Processing:

🚀 Quick Start

Easy Setup - Drop Files and Run:

📁 Project Structure

📊 Validation Results

🔧 Core Functions

🚀 Improving Accuracy

For Better Field Extraction:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎯 Objective

📥 Input Requirements

🏗️ Approach

Hybrid OCR + Vision Model

Vison Model used:- Qwen 2.5vl 7B

Why Not Fine-Tuning?

⚙️ Adding/Removing Fields

🔄 System Flow

Complete Workflow:

File Processing:

🚀 Quick Start

Easy Setup - Drop Files and Run:

📁 Project Structure

📊 Validation Results

🔧 Core Functions

🚀 Improving Accuracy

For Better Field Extraction:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages