Extract structured data from invoice images including:
- Invoice number
- Invoice date
- Line items (product name, quantity, price)
The system is designed to be scalable and configurable - users can easily add or remove fields by modifying the configuration file.
Supported Input:
- File Type: Images only (.png, .jpg, .jpeg)
- Processing: One image at a time
- Format: Invoice images with clear text and structure
Why this approach?
- Image-only approach had issues: misread characters, missing text, incorrect prices
- OCR + Vision combination provides better accuracy:
- OCR extracts exact text with confidence scores
- Vision model understands layout and spatial relationships
- Cross-validation between both sources reduces errors
-Considered LoRA fine-tuning for domain-specific optimization -Dataset characteristics: Full dataset contained 500 images, but mostly of similar invoice formats/layouts -Lack of diversity: Insufficient variety in invoice structures, fonts, and layouts for robust fine-tuning -Overfitting risk: Fine-tuning on homogeneous dataset would likely cause overfitting to specific invoice types -Testing approach: Used validation subset (48 images) to evaluate generalization capability -Better approach: Prompt engineering with OCR context provides robust generalization across diverse invoice formats -Future work: Fine-tuning viable with larger, more diverse datasets (1000+ images with varied layouts)
Process:
-
OCR Preprocessing:
- EasyOCR extracts all text from the invoice image with bounding boxes and confidence scores
- Text is organized spatially (rows/columns) to understand table structure
- Creates structured OCR context showing where each piece of text is located
-
Vision Analysis:
- Qwen2.5VL 7B receives both the original image AND the structured OCR text
- Model can "see" the visual layout while having exact text transcription as reference
- Cross-references between image and OCR: if OCR says "8" but image shows "B", model uses visual context to decide
- Uses OCR for precise text extraction and image for layout understanding
-
Structured Output:
- Model combines insights from both sources to extract accurate data
- Generates clean JSON with all requested fields
- OCR handles exact character recognition, vision handles spatial relationships and context
To extract additional fields or modify existing ones:
Edit config.json:
{
"output_format": {
"invoice_number": "string",
"invoice_date": "YYYY/MM/DD",
"vendor_name": "string", // ← Add new field
"total_amount": "decimal", // ← Add new field
"line_items": [
{
"name": "product description",
"qty": "quantity",
"price": "unit price",
"total": "line total" // ← Add new field
}
]
}
}📁 test_images/ 📁 test_json/
├── invoice1.png → ├── invoice1.json (ground truth)
├── invoice2.png → ├── invoice2.json (ground truth)
└── invoice3.png → └── invoice3.json (ground truth)
↓
🚀 python batch_test.py
↓
📊 Processing Flow:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Load Image │ → │ Extract JSON │ → │ Save Output │
│ (main.py) │ │ (OCR + Vision) │ │ to output/ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
↓
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Load Ground │ → │ Compare & │ → │ Generate │
│ Truth JSON │ │ Test Accuracy │ │ Summary Report │
│(test_accuracy.py)│ |(test_accuracy.py)│ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Input: One image processed at a time
- Output: Individual JSON file for each image
- Comparison: Each output compared with corresponding ground truth
- Result: Comprehensive accuracy report for all files
-
Prepare Your Data:
test_images/ ├── invoice1.png ├── invoice2.png └── invoice3.png test_json/ ├── invoice1.json (ground truth) ├── invoice2.json (ground truth) └── invoice3.json (ground truth) -
Run Complete Pipeline:
python batch_test.py
That's it! The system will:
- ✅ Extract JSON from each image this is bbeing done in main.py
- ✅ Save outputs to
output/folder - ✅ Compare with ground truth this is done in test_accuracy.py
- ✅ Generate accuracy report this is done via batch_test.py which creates or automates the testing via sending image to main.py and in main.py json is created for it and then we use have imported the functionality from test_accuracy.py which compares the ground truth with created json from main.py and we have written some preprocessing function is test_accuracy.py for cleaning the output after that it is being compared with the ground truth.
├── main_script.py # Core extraction functionality
├── test_accuracy.py # Accuracy evaluation
├── config.json # Field configuration
├── requirements.txt # Dependencies
├── test_images/ # Sample invoice images
├── test_json/ # Ground truth data
└── output/ # Generated results
The system was tested on 48 validation invoice images with the following results:
| Metric | Performance |
|---|---|
| Average Accuracy | 99.0% |
| Perfect Extractions (100%) | 37/48 invoices (77%) |
| Near-Perfect (95-99%) | 11/48 invoices (23%) |
| Failed Extractions (<90%) | 0/48 invoices (0%) |
All invoices achieved >90% accuracy, demonstrating robust performance across various invoice formats.
main.py:
extract_invoice_data(): Main extraction functionclean_json_string(): Parse model responses- OCR preprocessing and text organization
test_accuracy.py:
test_single_invoice_direct(): Accuracy testing- Smart comparison for numbers and text
- Handles different formats and minor variations
1. Image Quality:
- Use high-resolution images (minimum 300 DPI)
- Ensure good contrast and lighting
- Avoid blurry or skewed images
2. OCR Optimization:
# Adjust OCR confidence threshold in main_script.py
if confidence > 0.7: # Increase from 0.5 to 0.7 for higher precision3. Model Configuration:
# Fine-tune model parameters for specific fields
"options": {
"temperature": 0.01, # Lower for more consistent extraction
"num_predict": 6000, # Higher for complex invoices
}4. Custom Field Prompts:
// In config.json, add specific instructions for problematic fields
{
"special_instructions": {
"total_amount": "Extract the final total amount, usually at bottom right",
"line_items": "Focus on table structure, extract each row separately"
}
}5. Post-Processing Rules:
- Add validation rules for specific field formats
- Implement business logic checks (e.g., quantities × prices = totals)
- Use regex patterns for structured fields like invoice numbers