Skip to content

ohmybugs/invoice-data-extractor

Repository files navigation

Invoice Extractor

A FastAPI service that extracts structured data from PDF invoices using OCR and a local LLM. Returns Odoo-compatible JSON.

Features

  • Accepts PDF uploads (text-based and scanned)
  • Extracts text using pdfplumber with OCR fallback (pytesseract)
  • Processes text with a local Ollama LLM to extract structured invoice data
  • Returns JSON compatible with Odoo invoice import
  • Runs entirely locally (no cloud API calls)

Quick Start with Docker

Run everything with Docker Compose (no local installation required):

# Start all services (app + Ollama)
docker compose up -d

# Wait for model download (first run only, check logs)
docker compose logs -f ollama-pull

# Test the API
curl http://localhost:8000/health
curl -X POST http://localhost:8000/extract -F "file=@invoice.pdf"

# Stop all services
docker compose down

The first run will download the Ollama model (~2GB), which may take a few minutes.

Manual Installation

Requirements

  • Python 3.11+
  • uv (recommended) or pip
  • Ollama with a compatible model
  • Poppler (for PDF processing)
  • Tesseract (for OCR)

macOS

brew install poppler tesseract

Ubuntu/Debian

sudo apt-get install poppler-utils tesseract-ocr

Installation

# Clone the repository
git clone <repository-url>
cd invoice-extractor

# Install dependencies
uv sync

# Or with pip
pip install -e .

Ollama Setup

Install Ollama and pull a compatible model:

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve

# Pull the default model (in another terminal)
ollama pull llama3.2:3b

Usage

Start the Server

uv run uvicorn invoice_extractor.main:app --reload

The server runs at http://localhost:8000.

API Endpoints

Health Check

curl http://localhost:8000/health

Response:

{"status": "healthy"}

Extract Invoice Data

curl -X POST http://localhost:8000/extract \
  -F "file=@invoice.pdf"

Response:

{
  "vendor_name": "Acme Corp",
  "vendor_vat": "FR12345678901",
  "invoice_number": "INV-2024-001",
  "invoice_reference": null,
  "invoice_date": "2024-01-15",
  "due_date": "2024-02-15",
  "currency": "EUR",
  "subtotal": 100.00,
  "tax_amount": 20.00,
  "total": 120.00,
  "line_items": [
    {
      "description": "Consulting services",
      "quantity": 1.0,
      "unit_price": 100.00,
      "amount": 100.00
    }
  ]
}

API Documentation

Interactive API documentation is available at:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Configuration

Environment variables (prefix: INVOICE_EXTRACTOR_):

Variable Default Description
INVOICE_EXTRACTOR_OLLAMA_BASE_URL http://localhost:11434 Ollama API endpoint
INVOICE_EXTRACTOR_OLLAMA_MODEL llama3.2:3b Model to use for extraction

I found that meissosisai/arc1-mini:latest provides acceptable extraction quality with lower memory and CPU requirements on systems with limited hardware.

Development

Run Tests

uv run pytest tests/ -v

Project Structure

src/invoice_extractor/
├── main.py              # FastAPI application and endpoints
├── models.py            # Pydantic models (InvoiceData, ErrorResponse)
└── services/
    ├── pdf_service.py       # PDF text extraction and OCR
    ├── ollama_service.py    # Ollama LLM client
    └── extraction_service.py # Invoice data extraction logic
tests/
└── test_main.py         # API endpoint tests

Error Handling

Status Code Description
400 Bad request (invalid file type, empty file, no text extracted)
422 Extraction failed (could not parse invoice data)
503 Service unavailable (Ollama not reachable)

License

MIT License. See LICENSE for details.

About

Extract suppliers invoice data from PDF files in Odoo compatible format using python and local LLM via ollama

Resources

License

Contributing

Stars

Watchers

Forks

Contributors