Skip to content

vltech55/ledger-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ledger — Schema-Enforced LLM Invoice Extraction

PDF → OCR → Claude function-calling → 9-field JSON-Schema → per-field confidence → human-in-the-loop review → immutable audit log.

Ledger feature poster

Python 3.11 FastAPI Celery Postgres Claude License: MIT

What it does

Ledger ingests invoices (PDF, JPEG, TIFF) via drag-drop, email-in, or S3 cron, runs OCR via pdf2image + pytesseract for scans (or pypdf for digital PDFs), then a Claude function-call extracts a 9-field invoice schema (vendor, invoice number, dates, line items, totals).

Each field carries an LLM confidence × 4 heuristic signals (totals match, ISO dates, ISO currency, positive amounts). High-confidence rows auto-approve into the ledger; the rest queue for human review in a side-by-side OCR + editable-fields UI. Every correction lands in an immutable audit log — replayable to restore prior extraction state.

Features

  • Hybrid OCRpdf2image rasterises scanned pages for Tesseract; pypdf parses digital PDFs natively; text positions preserved so the reviewer sees what the LLM saw.
  • Schema-enforced extraction — Claude function-calling with a frozen prompt-version pin (v1.4.2) and strict 9-field JSON-Schema; malformed outputs cannot leave the worker.
  • Per-field confidence + heuristic gate — auto-approve threshold ≥ 0.85; below that rows queue for human review.
  • HITL review queue — sorted by min_confidence ascending so the most-uncertain extractions surface first; one-click corrections.
  • Immutable audit log — every correction stored with reviewer id, timestamp, before/after JSON, and prompt SHA. Reproduce or roll back any prior extraction.

Screenshots

Documents list — 15 invoices, status pills, min-conf bars Review queue — lowest-confidence first
Review detail — per-field bars + heuristic checks + OCR text Upload — drag-drop, auto-import sources, recent batches
Vendors — by MTD spend, avg confidence, anomaly flags Period analytics — daily intake, auto-approve trend, top vendors

Stack

Layer Tech
Backend Python 3.11, FastAPI, Pydantic 2, SQLAlchemy 2 + asyncpg, Alembic
Queue Celery + Redis broker; idempotent on content-hash
OCR pdf2image + pytesseract (scans), pypdf (digital), Pillow
Extraction Anthropic Claude function-calling, 9-field strict JSON-Schema, frozen prompt v1.4.2
Storage Postgres 16; tables documents, extractions, reviews, audit_log
Frontend Next.js 14, TypeScript, Tailwind, Recharts
Ops Docker Compose, structlog, Tenacity retries

Run locally

git clone https://github.com/phantomdev0826/ledger-extract
cd ledger-extract
cp .env.example .env       # add ANTHROPIC_API_KEY
docker compose up -d --build
docker compose exec backend alembic upgrade head
docker compose exec backend python -m scripts.seed_samples   # 5 sample invoices

Open http://localhost:3000 for the dashboard. Upload more invoices via drag-drop or curl -F file=@invoice.pdf http://localhost:8000/documents.

Architecture

       ┌──────────┐
upload │ drag-drop │
       │ email-in │
       │ S3 cron  │──────┐
       └──────────┘      │
                         ▼
                  ┌──────────────┐         ┌──────────────┐
                  │ Celery task  │────────▶│ pdf2image +  │
                  │ on Redis     │         │ pytesseract  │  (or pypdf for digital)
                  └──────┬───────┘         └──────┬───────┘
                         │                         │
                         │                  ┌──────▼──────┐
                         │                  │  OCR text   │
                         │                  └──────┬──────┘
                         │                         │
                         │                ┌────────▼───────────┐
                         │                │ Claude function-   │
                         │                │ call · v1.4.2 pin  │
                         │                │ 9-field JSON-Schema│
                         │                └────────┬───────────┘
                         │                         │
                         │                  ┌──────▼─────────┐
                         │                  │ confidence × 4 │
                         │                  │ heuristic gate │
                         │                  └──────┬─────────┘
                         │                         │
                ┌────────┴──────┐        ┌────────▼─────────┐
                │ auto-approve? │ ──no──▶│  HITL review     │
                └────────┬──────┘        │  queue           │
                         │ yes           └────────┬─────────┘
                         │                        │
                         ▼                        ▼
                  ┌────────────┐           ┌─────────────┐
                  │ ledger row │           │  audit_log  │
                  └────────────┘           │ (immutable) │
                                            └─────────────┘

Tests

docker compose exec backend pytest

Includes tests for the prompt pinning, JSON-Schema strict-mode rejection, heuristic confidence calculation, and audit-log immutability.

License

MIT

About

Schema-enforced invoice extraction: PDF → OCR → Claude function-call → 9-field JSON-Schema → HITL review → immutable audit log.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors