Invoice Extractor

A FastAPI service that extracts structured data from PDF invoices using OCR and a local LLM. Returns Odoo-compatible JSON.

Features

Accepts PDF uploads (text-based and scanned)
Extracts text using pdfplumber with OCR fallback (pytesseract)
Processes text with a local Ollama LLM to extract structured invoice data
Returns JSON compatible with Odoo invoice import
Runs entirely locally (no cloud API calls)

Quick Start with Docker

Run everything with Docker Compose (no local installation required):

# Start all services (app + Ollama)
docker compose up -d

# Wait for model download (first run only, check logs)
docker compose logs -f ollama-pull

# Test the API
curl http://localhost:8000/health
curl -X POST http://localhost:8000/extract -F "file=@invoice.pdf"

# Stop all services
docker compose down

The first run will download the Ollama model (~2GB), which may take a few minutes.

Manual Installation

Requirements

Python 3.11+
uv (recommended) or pip
Ollama with a compatible model
Poppler (for PDF processing)
Tesseract (for OCR)

macOS

brew install poppler tesseract

Ubuntu/Debian

sudo apt-get install poppler-utils tesseract-ocr

Installation

# Clone the repository
git clone <repository-url>
cd invoice-extractor

# Install dependencies
uv sync

# Or with pip
pip install -e .

Ollama Setup

Install Ollama and pull a compatible model:

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve

# Pull the default model (in another terminal)
ollama pull llama3.2:3b

Usage

Start the Server

uv run uvicorn invoice_extractor.main:app --reload

The server runs at http://localhost:8000.

API Endpoints

Health Check

curl http://localhost:8000/health

Response:

{"status": "healthy"}

Extract Invoice Data

curl -X POST http://localhost:8000/extract \
  -F "file=@invoice.pdf"

Response:

{
  "vendor_name": "Acme Corp",
  "vendor_vat": "FR12345678901",
  "invoice_number": "INV-2024-001",
  "invoice_reference": null,
  "invoice_date": "2024-01-15",
  "due_date": "2024-02-15",
  "currency": "EUR",
  "subtotal": 100.00,
  "tax_amount": 20.00,
  "total": 120.00,
  "line_items": [
    {
      "description": "Consulting services",
      "quantity": 1.0,
      "unit_price": 100.00,
      "amount": 100.00
    }
  ]
}

API Documentation

Interactive API documentation is available at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Configuration

Environment variables (prefix: INVOICE_EXTRACTOR_):

Variable	Default	Description
`INVOICE_EXTRACTOR_OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API endpoint
`INVOICE_EXTRACTOR_OLLAMA_MODEL`	`llama3.2:3b`	Model to use for extraction

I found that meissosisai/arc1-mini:latest provides acceptable extraction quality with lower memory and CPU requirements on systems with limited hardware.

Development

Run Tests

uv run pytest tests/ -v

Project Structure

src/invoice_extractor/
├── main.py              # FastAPI application and endpoints
├── models.py            # Pydantic models (InvoiceData, ErrorResponse)
└── services/
    ├── pdf_service.py       # PDF text extraction and OCR
    ├── ollama_service.py    # Ollama LLM client
    └── extraction_service.py # Invoice data extraction logic
tests/
└── test_main.py         # API endpoint tests

Error Handling

Status Code	Description
400	Bad request (invalid file type, empty file, no text extracted)
422	Extraction failed (could not parse invoice data)
503	Service unavailable (Ollama not reachable)

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
src/invoice_extractor		src/invoice_extractor
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invoice Extractor

Features

Quick Start with Docker

Manual Installation

Requirements

macOS

Ubuntu/Debian

Installation

Ollama Setup

Usage

Start the Server

API Endpoints

Health Check

Extract Invoice Data

API Documentation

Configuration

Development

Run Tests

Project Structure

Error Handling

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Invoice Extractor

Features

Quick Start with Docker

Manual Installation

Requirements

macOS

Ubuntu/Debian

Installation

Ollama Setup

Usage

Start the Server

API Endpoints

Health Check

Extract Invoice Data

API Documentation

Configuration

Development

Run Tests

Project Structure

Error Handling

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages