Skip to content

Akshita3104/Invoice-Data-Extractor

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Advanced Invoice Extraction System

GitHub release (latest by date) MIT License Platform

A production-ready, multi-modal invoice extraction system with state-of-the-art document understanding capabilities. Built with advanced OCR, machine learning, and LLM integration for accurate data extraction from invoices.


πŸ“‹ Table of Contents


✨ Features

Core Capabilities

  • βœ… Multi-format Support: PDF, JPEG, PNG, TIFF
  • βœ… Quality Assessment: Automatic quality detection and adaptive preprocessing
  • βœ… Multi-Engine OCR: Tesseract, DocTR, TrOCR with intelligent routing
  • βœ… Layout Analysis: Zone segmentation, table detection, reading order
  • βœ… Document Graph: Graph Neural Networks for structural reasoning
  • βœ… Multimodal Fusion: Visual + Text + Layout + Graph features
  • βœ… Hybrid Extraction: LLM (Gemini) + Rule-based for best results
  • βœ… Multi-Layer Validation: Arithmetic, format, consistency, plausibility
  • βœ… Multiple Export Formats: Excel, CSV, JSON, PDF reports

Advanced Features

  • πŸ”₯ Adaptive preprocessing based on image quality
  • πŸ”₯ Ensemble OCR with confidence scoring
  • πŸ”₯ Attention-based multimodal fusion
  • πŸ”₯ Graph Neural Network reasoning
  • πŸ”₯ Automatic field detection and entity classification
  • πŸ”₯ Cross-validation between extraction methods
  • πŸ”₯ Comprehensive confidence scoring

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Document Input                           β”‚
β”‚              (PDF, JPEG, PNG, TIFF)                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INGESTION: Format handling, Quality assessment             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PREPROCESSING: Adaptive enhancement (denoise, skew, etc)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OCR: Multi-engine routing (Tesseract/DocTR/TrOCR)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYOUT ANALYSIS: Zones, Tables, Reading order              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GRAPH: Document graph + GNN reasoning                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MULTIMODAL: Feature fusion (Visual+Text+Layout+Graph)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EXTRACTION: Hybrid LLM + Rule-based                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  VALIDATION: Arithmetic, Format, Consistency, Plausibility  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EXPORT: Excel, CSV, JSON, PDF                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Download Executable

⬇️ Download Invoice Extractor v1.0.1

Get Your API Key

This application requires a Google Gemini API key (free tier available):

πŸ“„ Read API Key Setup Guide

Run the Application

  1. Download and extract the executable
  2. Run invoice-extractor.exe
  3. Enter your Google Gemini API key
  4. Select invoice files (PDF, JPG, PNG, TIFF)
  5. Choose output folder
  6. Click "Process PDFs"
  7. Get structured data in Excel format!

πŸ“₯ Installation

Prerequisites

System Requirements:

  • Windows 10/11 (64-bit)
  • 4GB RAM minimum (8GB recommended)
  • 2GB free disk space
  • Internet connection (for LLM API)

Required Software:

  • Python 3.8+ (for development)
  • Tesseract OCR
  • Poppler (for PDF processing)

Install Tesseract OCR

Windows:

# Download installer from:
https://github.com/UB-Mannheim/tesseract/wiki

# Add to PATH:
C:\Program Files\Tesseract-OCR

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

MacOS:

brew install tesseract

Install Poppler

Windows:

# Download from:
https://github.com/oschwartz10612/poppler-windows

# Add bin folder to PATH

Ubuntu/Debian:

sudo apt-get install poppler-utils

MacOS:

brew install poppler

Install Python Dependencies

# Clone repository
git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor

# Install basic dependencies
pip install -r requirements.txt

Optional Dependencies

# For DocTR (Deep Learning OCR)
pip install python-doctr[torch]

# For TrOCR (Transformer OCR)
pip install transformers torch torchvision

# For Advanced Features (GNN, Multimodal)
pip install torch-geometric sentence-transformers

# For Development
pip install matplotlib seaborn jupyter

πŸ’» Usage

Desktop Application (GUI)

  1. Launch the application
  2. Enter your Gemini API key
  3. Click "Upload PDF Files" and select invoices
  4. Click "Browse" to choose output folder
  5. Enter output filename (default: invoice_data.xlsx)
  6. Click "Process PDFs"
  7. Wait for processing to complete
  8. Find extracted data in the specified output folder

Python API (Basic)

from backend.backend import InvoiceExtractionPipeline

# Initialize pipeline
pipeline = InvoiceExtractionPipeline(
    api_key="your-gemini-api-key",
    enable_advanced_features=False,
    use_gpu=False
)

# Process single document
result = pipeline.process_document(
    document_path="invoice.pdf",
    output_folder="./output",
    filename="invoice_data.xlsx"
)

# Check results
if result['success']:
    print(f"Extracted {len(result['extracted_data'])} items")
    print(f"Confidence: {result['confidence']['overall_confidence']:.1%}")
    print(f"Output: {result['output_path']}")

Python API (Batch Processing)

# Process multiple documents
documents = [
    "invoices/invoice1.pdf",
    "invoices/invoice2.pdf",
    "invoices/invoice3.jpg"
]

result = pipeline.process_multiple_documents(
    document_paths=documents,
    output_folder="./output",
    filename="combined_invoices.xlsx"
)

print(f"Processed: {result['successful']}/{result['total_documents']}")
print(f"Total items: {result['total_items']}")
print(f"Failed: {len(result['failed_documents'])}")

Python API (Advanced Features)

# Enable all advanced features
pipeline = InvoiceExtractionPipeline(
    api_key="your-api-key",
    enable_advanced_features=True,
    use_gpu=True,
    config={
        'ocr': {
            'ensemble_enabled': True,
            'engines': ['tesseract', 'doctr']
        },
        'validation': {
            'arithmetic': True,
            'plausibility': True
        }
    }
)

result = pipeline.process_document("invoice.pdf", "./output")

βš™οΈ Configuration

Configuration Files

The system uses YAML configuration files located in the config/ directory:

  • default_config.yaml: Main configuration
  • ocr_engines.yaml: OCR engine settings
  • extraction_patterns.yaml: Regex patterns for extraction
  • validation_rules.yaml: Validation rules and thresholds

Key Configuration Options

OCR Settings:

ocr:
  default_engine: "tesseract"
  ensemble_enabled: true
  confidence_threshold: 0.7

Preprocessing:

preprocessing:
  enabled: true
  operations:
    denoise: true
    enhance_contrast: true
    deskew: true

Validation:

validation:
  enabled: true
  checks:
    arithmetic: true
    format: true
    consistency: true
  arithmetic:
    tolerance: 0.01

Export:

export:
  formats:
    - "excel"
    - "json"
  excel:
    include_confidence: true
    include_metadata: true

Custom Configuration

from backend.backend import InvoiceExtractionPipeline

custom_config = {
    'preprocessing': {
        'target_dpi': 400,
        'denoise': True
    },
    'validation': {
        'arithmetic': {
            'tolerance': 0.05  # 5% tolerance
        }
    }
}

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config=custom_config
)

πŸ“ Project Structure

invoice-extractor/
β”‚
β”œβ”€β”€ backend/                      # Backend processing
β”‚   β”œβ”€β”€ backend.py               # Main pipeline orchestrator
β”‚   β”œβ”€β”€ ingestion/               # Document loading & quality assessment
β”‚   β”œβ”€β”€ preprocessing/           # Image enhancement
β”‚   β”œβ”€β”€ ocr/                     # OCR engines & routing
β”‚   β”œβ”€β”€ layout_analysis/         # Document structure analysis
β”‚   β”œβ”€β”€ graph/                   # Document graph & GNN
β”‚   β”œβ”€β”€ multimodal/              # Feature fusion
β”‚   β”œβ”€β”€ extraction/              # Data extraction
β”‚   β”œβ”€β”€ validation/              # Data validation
β”‚   β”œβ”€β”€ export/                  # Output generation
β”‚   β”œβ”€β”€ utils/                   # Utilities
β”‚   └── models/                  # Model weights & configs
β”‚
β”œβ”€β”€ frontend/                    # Electron UI
β”‚   β”œβ”€β”€ index.html              # Main UI
β”‚   β”œβ”€β”€ renderer.js             # UI logic
β”‚   └── styles.css              # Styling
β”‚
β”œβ”€β”€ config/                      # Configuration files
β”‚   β”œβ”€β”€ default_config.yaml
β”‚   β”œβ”€β”€ ocr_engines.yaml
β”‚   β”œβ”€β”€ extraction_patterns.yaml
β”‚   └── validation_rules.yaml
β”‚
β”œβ”€β”€ main.js                      # Electron main process
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # This file
└── LICENSE                      # MIT License

🎯 Extracted Fields

The system extracts the following fields from invoices:

Company Information

  • Company Name
  • GST Number (15 digits)
  • PAN Number (10 characters)
  • FSSAI Number (14 digits)
  • Address
  • Phone Number
  • Email

Invoice Details

  • Invoice Number
  • Invoice Date (DD/MM/YYYY format)
  • Due Date (if available)

Line Items

  • Goods Description
  • HSN/SAC Code (4-8 digits)
  • Quantity
  • Weight (with automatic unit conversion to kg)
  • Rate (per unit)
  • Amount
  • Tax Rate
  • Tax Amount

Financial Summary

  • Subtotal
  • Total Tax (CGST, SGST, IGST)
  • Discount (if applicable)
  • Grand Total

πŸš€ Advanced Features

1. Adaptive Preprocessing

Automatically adjusts preprocessing based on document quality:

# Quality assessment triggers different preprocessing
# High quality (>0.8): Minimal preprocessing
# Medium quality (0.5-0.8): Standard enhancement
# Low quality (<0.5): Aggressive enhancement + TrOCR

2. Ensemble OCR

Combines multiple OCR engines for improved accuracy:

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config={
        'ocr': {
            'ensemble_enabled': True,
            'engines': ['tesseract', 'doctr'],
            'voting_strategy': 'weighted'
        }
    }
)

3. Document Graph Reasoning

Uses Graph Neural Networks to understand document structure:

# Enable graph-based reasoning
config = {
    'graph': {
        'enabled': True,
        'gnn_reasoning': True
    }
}

4. Multimodal Fusion

Combines visual, textual, and layout features:

# Enable multimodal processing
config = {
    'multimodal': {
        'enabled': True,
        'fusion': {
            'method': 'attention'
        }
    }
}

5. Hybrid Extraction

Combines LLM with rule-based extraction:

# Use hybrid extraction
config = {
    'extraction': {
        'method': 'hybrid',
        'fallback_to_rules': True,
        'confidence_threshold': 0.8
    }
}

πŸ“Š Output Formats

Excel Output

Multi-sheet workbook containing:

  1. Invoice Data: Extracted line items
  2. Validation Issues: Detected problems
  3. Confidence Scores: Per-field confidence
  4. Summary: Aggregate statistics

JSON Output

{
  "metadata": {
    "document_path": "invoice.pdf",
    "processed_at": "2025-10-16T12:00:00",
    "processing_time": 12.5
  },
  "extracted_data": [
    {
      "company_name": "ABC Company Ltd",
      "invoice_number": "INV-001",
      "invoice_date": "15/10/2025",
      "items": [...]
    }
  ],
  "validation": {
    "issues": [],
    "passed": true
  },
  "confidence": {
    "overall": 0.95,
    "fields": {...}
  }
}

CSV Output

Simple tabular format for line items.


πŸ“ˆ Performance

Processing Speed

  • Basic Mode: 5-10 seconds per page
  • Advanced Mode: 15-30 seconds per page
  • Batch Processing: Parallel processing supported

Accuracy

  • High-quality scans: 95%+ accuracy
  • Medium-quality images: 85-95% accuracy
  • Low-quality/handwritten: 70-85% accuracy

Resource Usage

  • CPU: 2-4 cores recommended
  • RAM: 4GB minimum, 8GB recommended
  • GPU: Optional (improves speed for advanced features)
  • Disk: 2GB for models and cache

πŸ› Troubleshooting

Common Issues

1. "Tesseract not found"

# Windows: Add to PATH
setx PATH "%PATH%;C:\Program Files\Tesseract-OCR"

# Verify installation
tesseract --version

2. "Module not found" errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

3. "API key invalid"

4. Low extraction accuracy

  • Enable preprocessing in configuration
  • Try ensemble OCR mode
  • Increase image DPI (300+ recommended)
  • Enable advanced features

5. "Memory error" during processing

  • Reduce batch size
  • Process fewer files at once
  • Close other applications
  • Increase system RAM

6. Slow processing

  • Enable GPU acceleration (if available)
  • Disable advanced features
  • Use faster OCR engine (Tesseract)
  • Process files in smaller batches

Debug Mode

Enable debug logging:

pipeline = InvoiceExtractionPipeline(
    api_key="your-key",
    config={
        'app': {
            'debug': True,
            'log_level': 'DEBUG'
        },
        'debug': {
            'save_intermediate': True
        }
    }
)

Log Files

Check logs in logs/ directory:

  • app.log: General application logs
  • error.log: Error traces
  • processing.log: Processing details

πŸ”¨ Building from Source

Prerequisites

  • Node.js 16+
  • Python 3.8+
  • npm or yarn

Build Steps

# 1. Clone repository
git clone https://github.com/Cherry28831/Invoice-Data-Extractor.git
cd Invoice-Data-Extractor

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install Node dependencies
npm install

# 4. Build backend executable
cd backend
pyinstaller invoice-backend.spec
cd ..

# 5. Build Electron app
npm run build

# 6. Create installer (Windows)
npm run dist

# Output: dist/invoice-extractor-setup-1.0.1.exe

Build Configuration

Edit package.json for build settings:

{
  "build": {
    "appId": "com.invoice.extractor",
    "productName": "Invoice Extractor",
    "win": {
      "target": "nsis",
      "icon": "assets/icon.ico"
    }
  }
}

🀝 Contributing

We welcome contributions! Areas for improvement:

High Priority

  • Additional OCR engines
  • Support for more languages
  • Improved table extraction
  • Cloud deployment options

Medium Priority

  • UI/UX improvements
  • Additional export formats
  • Performance optimizations
  • Better error handling

Low Priority

  • Documentation improvements
  • Test coverage
  • Code refactoring

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Cherry28831

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

πŸ™ Acknowledgments

This project builds upon excellent open-source tools:

Special thanks to all contributors and the open-source community!


πŸ“ž Support

Get Help

Resources

  • Documentation: See docs/ folder for detailed guides
  • API Reference: See source code docstrings
  • Examples: See examples/ folder
  • Video Tutorial: Coming soon

πŸ—ΊοΈ Roadmap

Version 2.0 (Planned)

  • Cloud deployment support
  • Real-time processing API
  • Mobile app (Android/iOS)
  • Multi-language support (Hindi, Spanish, French)
  • Custom model training interface

Version 1.5 (In Progress)

  • Improved table extraction
  • Better handwriting recognition
  • Batch processing optimization
  • Enhanced validation rules

Version 1.0 (Current)

  • Core extraction pipeline
  • Desktop application
  • Multi-format support
  • Validation system
  • Export to Excel/JSON/CSV

πŸ“Š Statistics

  • Stars: ⭐ Star this repo if you find it useful!
  • Downloads: 1000+ (and growing)
  • Contributors: Open for contributions
  • Issues: Check our issue tracker

Built with ❀️ for accurate invoice extraction

Made with β˜• and 🧠 by Akshita Shetty!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 96.7%
  • JavaScript 2.1%
  • Other 1.2%