Skip to content

ASandHamwich/med-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

122 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

med-ocr

med-ocr extracts structured fields from medical and administrative documents. It supports two parallel processing paths in the same API: a traditional OCR + rules pipeline (ocr) and an LLM-first pipeline (llm).

The service currently targets these document classes:

  • receipt
  • medical_certificate
  • referral_letter
  • unsupported

API Implementation Summary

The API is implemented with FastAPI in src/api/main.py, with startup-time dependency wiring via app lifespan state. Requests hit POST /ocr in src/api/routes.py, then route through PipelineDispatcher to either OCR or LLM processing.

What This Repository Does

  • Accepts uploaded PDFs/images over multipart/form-data.
  • Loads documents into page images (PIL.Image) as a shared ingestion boundary.
  • Runs either:
    • OCR pipeline: preprocessing -> OCR engine -> rule-based classification -> rule-based extraction -> normalization.
    • LLM pipeline: LLM classification -> LLM extraction -> validation -> normalization.
  • Returns a stable JSON output envelope with per-stage timings.
  • Optionally logs every request to SQLite for evaluation and metric tracking.

Key Components (Actual Modules)

Repository Layout

med-ocr/
├── config/                     # Runtime YAML configs (OCR, prompts, logging, thresholds)
├── data/                       # Sample docs, ground-truth JSONs, evaluation DB
├── logs/                       # Runtime logs and batch output text files
├── notebooks/                  # EDA / experiment notebooks
├── scripts/                    # Server runner + batch request scripts
├── src/
│   ├── api/                    # FastAPI app, routes, dependencies, pipeline dispatch
│   ├── classification/         # Rule-based document classifier
│   ├── evaluation/             # SQLite evaluation storage + metrics
│   ├── extraction/             # Document-type extractors + helper logic
│   ├── llm/                    # LLM client abstraction, prompts, schemas, validation
│   ├── normalisation/          # Final response normalization
│   ├── ocr/                    # OCR results, line reconstruction, engines
│   ├── preprocessing/          # Shared preprocessing stages + engine profiles
│   └── utils/                  # Shared utilities (logging/config/document loading)
├── tests/                      # Pytest suite
└── README.md

Setup and Run

1) Prerequisites

Python:

Package manager:

  • uv is the recommended tool because the repo already includes uv.lock.

System dependencies:

  • Poppler (pdf2image backend for PDF rasterization).
  • Tesseract binary (only required when OCR engine is set to tesseract).

macOS (Homebrew):

brew install poppler tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr

2) Install Python dependencies

From repo root:

uv sync
source .venv/bin/activate

Quick sanity check:

uv run pytest -q

3) Configure environment variables

Create a local .env:

cp .env.example .env

Set at least:

OPENAI_API_KEY=your_key_here

Notes:

  • OPENAI_API_KEY is required for ?pipeline=llm.
  • You can still run ?pipeline=ocr without a valid LLM key.

4) Confirm key runtime config before first run

Open config/ocr.yaml and verify:

  • api.default_pipeline (currently llm in this repo).
  • ocr.engine (paddle or tesseract).
  • ocr.document_loading.pdf_dpi / pdf_format.
  • evaluation.enabled and evaluation.db_path.

5) Start the server

Use:

scripts/run.sh

This launches Uvicorn with reload and watches both:

  • src/**/*.py
  • config/**/*.yaml

So code and YAML config updates trigger restart automatically in local dev.

6) Send a smoke-test request

OCR path:

curl -X POST "http://127.0.0.1:8000/ocr?pipeline=ocr" \
  -H "accept: application/json" \
  -F "file=@/absolute/path/to/document.pdf;type=application/pdf"

LLM path:

curl -X POST "http://127.0.0.1:8000/ocr?pipeline=llm" \
  -H "accept: application/json" \
  -F "file=@/absolute/path/to/document.pdf;type=application/pdf"

Important:

  • This endpoint is multipart/form-data because it expects UploadFile.
  • curl -F ... automatically sets multipart encoding.

7) Validate outputs and logs

Inspect:

  • API response result.document_type
  • API response result.finalJson
  • API response result.stage_timings
  • logs/app.log for pipeline stage logs and errors

If evaluation is enabled, also inspect:

  • data/evaluation.sqlite3

API Contract

Endpoint

  • POST /ocr

Request format

  • Content type: multipart/form-data
  • Required form field:
    • file (UploadFile)

Supported upload types (as currently implemented):

  • application/pdf
  • image/png
  • image/jpeg
  • image/jpg

Optional query parameter:

  • pipeline=ocr|llm

If pipeline is omitted, server uses config/ocr.yaml -> api.default_pipeline. If pipeline is present but invalid, the route currently falls back to default and logs a warning.

Optional form fields for evaluation:

  • ground_truth_document_type (string)
  • ground_truth_json (JSON object string)
  • evaluation_notes (string)

Ground truth JSON shapes

ground_truth_json accepts either:

  1. Flat fields object:
{
  "claimant_name": "JOHN DOE",
  "total_amount": 49.25
}
  1. Wrapped object:
{
  "document_type": "receipt",
  "fields": {
    "claimant_name": "JOHN DOE",
    "total_amount": 49.25
  }
}

Success response shape

{
  "message": "Processing completed.",
  "result": {
    "document_type": "receipt",
    "total_time": 2.34,
    "evaluation_run_id": 12,
    "stage_timings": {
      "document_loading": 0.2031,
      "classification": 0.0123
    },
    "finalJson": {
      "claimant_name": "JOHN DOE"
    }
  }
}

Error response contract (strict)

Error body format is intentionally fixed:

{"error":"<error_code>"}

Mappings:

  • 400 -> {"error":"file_missing"}
    • missing file upload
    • unsupported upload MIME/type
  • 422 -> {"error":"unsupported_document_type"}
    • classifier result unsupported
  • 500 -> {"error":"internal_server_error"}
    • all internal runtime failures (OCR, LLM, parsing, DB, etc.)

Pipeline Behavior

OCR pipeline (src/api/pipeline/ocr_pipeline.py)

The OCR pipeline runs these stages:

  1. document_loading
  2. quality_check
  3. preprocessor_build
  4. preprocessing
  5. ocr
  6. classification
  7. extraction
  8. normalization

Preprocessing is engine-specific and configured in config/preprocessing.yaml with true, false, and check modes.

LLM pipeline (src/api/pipeline/llm_pipeline.py)

The LLM pipeline runs these stages:

  1. document_loading
  2. classification
  3. classification_validation
  4. extraction
  5. extraction_validation
  6. normalization

Prompt templates and rules are centralized in config/llm_prompts.yaml.

Configuration Guide (Common Changes You Will Actually Make)

config/ocr.yaml

This is the main runtime switchboard:

  • choose default API pipeline (api.default_pipeline)
  • choose OCR engine (ocr.engine)
  • set PDF rasterization (ocr.document_loading.pdf_dpi, pdf_format)
  • configure LLM model/timeout (llm.*)
  • enable/disable evaluation (evaluation.enabled)

Common updates:

  • Switch default pipeline:
    • set api.default_pipeline: ocr or llm
  • Switch OCR engine:
    • set ocr.engine: paddle or tesseract
  • Improve OCR rendering quality:
    • increase ocr.document_loading.pdf_dpi (slower but usually clearer)
  • Disable eval writes temporarily:
    • set evaluation.enabled: false

config/preprocessing.yaml

Controls stage selection per engine:

  • preprocessing.tesseract.*
  • preprocessing.paddle.*
  • thresholds.* for conditional check logic

Use this when OCR quality drifts by document style and you want per-engine tuning without changing code.

config/document_classifier.yaml

Contains:

  • class-specific keyword rules
  • regex patterns with weights
  • minimum score threshold (min_score_threshold)

Use this when classification is close but brittle on noisy OCR text.

config/llm_prompts.yaml

Contains:

  • system prompt
  • classification prompt
  • extraction template
  • referral signature policy text

Use this when LLM output quality issues are instruction-driven.

config/logging.yaml

Controls:

  • console/file handlers
  • log levels
  • formatter
  • rotation policy for logs/app.log

Batch Inference Workflow

Script:

Basic run:

scripts/batch_ocr_curl.sh \
  --list scripts/paths.txt \
  --pipeline ocr \
  --output logs/ocr_batch_results.txt

Input list formats (--list)

Each non-empty line can be:

  1. Path only:
data/pdf/receipt.pdf
  1. Path + inline ground truth JSON:
data/pdf/receipt.pdf|{"document_type":"receipt","fields":{"total_amount":49.25}}
  1. Path + ground truth JSON file:
data/pdf/receipt.pdf|@data/ground_truth/receipt.json

Both | and first , are accepted delimiters, but | is recommended for clarity.

Typical batch options

  • --pipeline ocr|llm
  • --gt-doc-type ...
  • --gt-json ...
  • --gt-json-file ...
  • --eval-notes ...
  • --append

Output is a human-readable text report at the --output path.

Evaluation (SQLite)

When evaluation.enabled: true, each API request is persisted to SQLite (default data/evaluation.sqlite3).

Tables:

  • documents
  • runs
  • ocr_artifacts
  • field_evaluations
  • run_metrics

Quick inspection

List tables:

sqlite3 data/evaluation.sqlite3 ".tables"

Recent runs:

sqlite3 data/evaluation.sqlite3 \
  "SELECT id, pipeline_name, predicted_document_type, success, latency_ms, created_at FROM runs ORDER BY id DESC LIMIT 20;"

Recent metrics:

sqlite3 data/evaluation.sqlite3 \
  "SELECT run_id, classification_correct, exact_match_score, normalized_match_score, missing_rate, hallucination_rate FROM run_metrics ORDER BY run_id DESC LIMIT 20;"

Field-level diagnostics for a run:

sqlite3 data/evaluation.sqlite3 \
  "SELECT field_name, expected_value, predicted_value, exact_match, normalized_match, is_missing, is_hallucinated FROM field_evaluations WHERE run_id = 42 ORDER BY field_name;"

Python helper summary:

uv run python -c "from src.evaluation import print_run_summary; print(print_run_summary(42))"

Troubleshooting

curl: (7) Failed to connect ...

Server is not running or wrong host/port.

  • Start server with scripts/run.sh.
  • Verify URL is http://127.0.0.1:8000/ocr.

{"error":"file_missing"}

Usually one of:

  • missing file form field
  • unsupported MIME/file type
  • request not sent as multipart form-data

Use:

-F "file=@/absolute/path/to/file.pdf;type=application/pdf"

{"error":"unsupported_document_type"}

The request ran successfully, but classifier could not map to a supported class with confidence/threshold.

  • inspect OCR/LLM logs in logs/app.log
  • inspect config/document_classifier.yaml thresholds/rules

{"error":"internal_server_error"}

Check logs/app.log for traceback. Typical causes:

  • Poppler missing (PDF render fails)
  • Tesseract binary missing (when using tesseract engine)
  • LLM config/key issues
  • malformed ground-truth JSON
  • SQLite write failure

LLM runs fail immediately

Check:

  • .env has OPENAI_API_KEY
  • config/ocr.yaml -> llm.enabled: true
  • config/ocr.yaml -> llm.model is valid for your provider key

OCR is too noisy for extraction

First-line adjustments:

  • increase ocr.document_loading.pdf_dpi
  • tune config/preprocessing.yaml per engine
  • switch engine (paddle vs tesseract) and compare

Extending to New Document Types (Safe Playbook)

To add a new type cleanly, update classification, extraction, normalization, and tests together. Do this in small steps and verify after each step.

Step 1: Define normalized output schema first

Update:

Goal: make final output contract explicit before changing upstream logic.

Step 2: Add OCR-path classification

Update:

Goal: classifier can predict the new label reliably.

Step 3: Add OCR-path extractor

Add/update:

Goal: OCR pipeline can classify and extract raw fields for new type.

Step 4: Add LLM-path schema and prompts

Update:

Goal: LLM pipeline can classify/extract the same type with the same final schema.

Step 5: Add evaluation and ground-truth support

Add/update:

  • ground-truth JSON files under data/ground_truth/
  • test list file for batch runs
  • optional analytics queries for the new fields

Step 6: Add tests and run both pipelines

Run:

uv run pytest -q

Then verify:

  • POST /ocr?pipeline=ocr
  • POST /ocr?pipeline=llm

for the same document samples and confirm output key consistency.

Development Workflow

Typical loop:

  1. Edit code/config.
  2. Keep server running via scripts/run.sh (auto-reload).
  3. Test single docs via curl.
  4. Test batches via scripts/batch_ocr_curl.sh.
  5. Inspect logs/app.log.
  6. Inspect evaluation DB for regression tracking.
  7. Run uv run pytest -q before finalizing changes.

Current Limitations (Explicit)

  • Rule-based extractors are still layout-sensitive for heavily noisy documents.
  • Handwritten signature detection for referral letters is conservative by design (prefers false negatives over false positives).
  • OCR and LLM paths share normalization, but classification/extraction internals are independent and tuned separately.

About

Exploratory OCR project on medical documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors