PDF → OCR → Claude function-calling → 9-field JSON-Schema → per-field confidence → human-in-the-loop review → immutable audit log.
Ledger ingests invoices (PDF, JPEG, TIFF) via drag-drop, email-in, or S3 cron, runs OCR via pdf2image + pytesseract for scans (or pypdf for digital PDFs), then a Claude function-call extracts a 9-field invoice schema (vendor, invoice number, dates, line items, totals).
Each field carries an LLM confidence × 4 heuristic signals (totals match, ISO dates, ISO currency, positive amounts). High-confidence rows auto-approve into the ledger; the rest queue for human review in a side-by-side OCR + editable-fields UI. Every correction lands in an immutable audit log — replayable to restore prior extraction state.
- Hybrid OCR —
pdf2imagerasterises scanned pages for Tesseract;pypdfparses digital PDFs natively; text positions preserved so the reviewer sees what the LLM saw. - Schema-enforced extraction — Claude function-calling with a frozen prompt-version pin (v1.4.2) and strict 9-field JSON-Schema; malformed outputs cannot leave the worker.
- Per-field confidence + heuristic gate — auto-approve threshold ≥ 0.85; below that rows queue for human review.
- HITL review queue — sorted by
min_confidenceascending so the most-uncertain extractions surface first; one-click corrections. - Immutable audit log — every correction stored with reviewer id, timestamp, before/after JSON, and prompt SHA. Reproduce or roll back any prior extraction.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Layer | Tech |
|---|---|
| Backend | Python 3.11, FastAPI, Pydantic 2, SQLAlchemy 2 + asyncpg, Alembic |
| Queue | Celery + Redis broker; idempotent on content-hash |
| OCR | pdf2image + pytesseract (scans), pypdf (digital), Pillow |
| Extraction | Anthropic Claude function-calling, 9-field strict JSON-Schema, frozen prompt v1.4.2 |
| Storage | Postgres 16; tables documents, extractions, reviews, audit_log |
| Frontend | Next.js 14, TypeScript, Tailwind, Recharts |
| Ops | Docker Compose, structlog, Tenacity retries |
git clone https://github.com/phantomdev0826/ledger-extract
cd ledger-extract
cp .env.example .env # add ANTHROPIC_API_KEY
docker compose up -d --build
docker compose exec backend alembic upgrade head
docker compose exec backend python -m scripts.seed_samples # 5 sample invoicesOpen http://localhost:3000 for the dashboard. Upload more invoices via drag-drop or curl -F file=@invoice.pdf http://localhost:8000/documents.
┌──────────┐
upload │ drag-drop │
│ email-in │
│ S3 cron │──────┐
└──────────┘ │
▼
┌──────────────┐ ┌──────────────┐
│ Celery task │────────▶│ pdf2image + │
│ on Redis │ │ pytesseract │ (or pypdf for digital)
└──────┬───────┘ └──────┬───────┘
│ │
│ ┌──────▼──────┐
│ │ OCR text │
│ └──────┬──────┘
│ │
│ ┌────────▼───────────┐
│ │ Claude function- │
│ │ call · v1.4.2 pin │
│ │ 9-field JSON-Schema│
│ └────────┬───────────┘
│ │
│ ┌──────▼─────────┐
│ │ confidence × 4 │
│ │ heuristic gate │
│ └──────┬─────────┘
│ │
┌────────┴──────┐ ┌────────▼─────────┐
│ auto-approve? │ ──no──▶│ HITL review │
└────────┬──────┘ │ queue │
│ yes └────────┬─────────┘
│ │
▼ ▼
┌────────────┐ ┌─────────────┐
│ ledger row │ │ audit_log │
└────────────┘ │ (immutable) │
└─────────────┘
docker compose exec backend pytestIncludes tests for the prompt pinning, JSON-Schema strict-mode rejection, heuristic confidence calculation, and audit-log immutability.
MIT






