A FastAPI service that extracts structured data from PDF invoices using OCR and a local LLM. Returns Odoo-compatible JSON.
- Accepts PDF uploads (text-based and scanned)
- Extracts text using pdfplumber with OCR fallback (pytesseract)
- Processes text with a local Ollama LLM to extract structured invoice data
- Returns JSON compatible with Odoo invoice import
- Runs entirely locally (no cloud API calls)
Run everything with Docker Compose (no local installation required):
# Start all services (app + Ollama)
docker compose up -d
# Wait for model download (first run only, check logs)
docker compose logs -f ollama-pull
# Test the API
curl http://localhost:8000/health
curl -X POST http://localhost:8000/extract -F "file=@invoice.pdf"
# Stop all services
docker compose downThe first run will download the Ollama model (~2GB), which may take a few minutes.
- Python 3.11+
- uv (recommended) or pip
- Ollama with a compatible model
- Poppler (for PDF processing)
- Tesseract (for OCR)
brew install poppler tesseractsudo apt-get install poppler-utils tesseract-ocr# Clone the repository
git clone <repository-url>
cd invoice-extractor
# Install dependencies
uv sync
# Or with pip
pip install -e .Install Ollama and pull a compatible model:
# Install Ollama (macOS)
brew install ollama
# Start Ollama service
ollama serve
# Pull the default model (in another terminal)
ollama pull llama3.2:3buv run uvicorn invoice_extractor.main:app --reloadThe server runs at http://localhost:8000.
curl http://localhost:8000/healthResponse:
{"status": "healthy"}curl -X POST http://localhost:8000/extract \
-F "file=@invoice.pdf"Response:
{
"vendor_name": "Acme Corp",
"vendor_vat": "FR12345678901",
"invoice_number": "INV-2024-001",
"invoice_reference": null,
"invoice_date": "2024-01-15",
"due_date": "2024-02-15",
"currency": "EUR",
"subtotal": 100.00,
"tax_amount": 20.00,
"total": 120.00,
"line_items": [
{
"description": "Consulting services",
"quantity": 1.0,
"unit_price": 100.00,
"amount": 100.00
}
]
}Interactive API documentation is available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Environment variables (prefix: INVOICE_EXTRACTOR_):
| Variable | Default | Description |
|---|---|---|
INVOICE_EXTRACTOR_OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API endpoint |
INVOICE_EXTRACTOR_OLLAMA_MODEL |
llama3.2:3b |
Model to use for extraction |
I found that meissosisai/arc1-mini:latest provides acceptable extraction quality with lower memory and CPU requirements on systems with limited hardware.
uv run pytest tests/ -vsrc/invoice_extractor/
├── main.py # FastAPI application and endpoints
├── models.py # Pydantic models (InvoiceData, ErrorResponse)
└── services/
├── pdf_service.py # PDF text extraction and OCR
├── ollama_service.py # Ollama LLM client
└── extraction_service.py # Invoice data extraction logic
tests/
└── test_main.py # API endpoint tests
| Status Code | Description |
|---|---|
| 400 | Bad request (invalid file type, empty file, no text extracted) |
| 422 | Extraction failed (could not parse invoice data) |
| 503 | Service unavailable (Ollama not reachable) |
MIT License. See LICENSE for details.