med-ocr

med-ocr extracts structured fields from medical and administrative documents. It supports two parallel processing paths in the same API: a traditional OCR + rules pipeline (ocr) and an LLM-first pipeline (llm).

The service currently targets these document classes:

receipt
medical_certificate
referral_letter
unsupported

API Implementation Summary

The API is implemented with FastAPI in src/api/main.py, with startup-time dependency wiring via app lifespan state. Requests hit POST /ocr in src/api/routes.py, then route through PipelineDispatcher to either OCR or LLM processing.

What This Repository Does

Accepts uploaded PDFs/images over multipart/form-data.
Loads documents into page images (PIL.Image) as a shared ingestion boundary.
Runs either:
- OCR pipeline: preprocessing -> OCR engine -> rule-based classification -> rule-based extraction -> normalization.
- LLM pipeline: LLM classification -> LLM extraction -> validation -> normalization.
Returns a stable JSON output envelope with per-stage timings.
Optionally logs every request to SQLite for evaluation and metric tracking.

Key Components (Actual Modules)

Document ingestion layer: src/utils/document_loading.py (load_document) converts uploads to page images and is used by both pipelines.
Pipeline selector (OCR vs LLM): src/api/pipeline/dispatcher.py (PipelineDispatcher) is called by src/api/routes.py and initialized in src/api/main.py.
OCR engines (PaddleOCR, Tesseract): src/ocr/engine/ contains paddleocr.py, tesseract.py, shared interface base.py, and factory build.py.
LLM processing layer: orchestration is in src/api/pipeline/llm_pipeline.py, with client/prompt/schema/validation modules in src/llm/.
Rule-based extractors: registry is in src/extraction/registry.py, implementations are in src/extraction/extractors/, and shared helpers are in src/extraction/helpers/.
Evaluation module (SQLite-backed): src/evaluation/ with schema in src/evaluation/schema.sql.

Repository Layout

med-ocr/
├── config/                     # Runtime YAML configs (OCR, prompts, logging, thresholds)
├── data/                       # Sample docs, ground-truth JSONs, evaluation DB
├── logs/                       # Runtime logs and batch output text files
├── notebooks/                  # EDA / experiment notebooks
├── scripts/                    # Server runner + batch request scripts
├── src/
│   ├── api/                    # FastAPI app, routes, dependencies, pipeline dispatch
│   ├── classification/         # Rule-based document classifier
│   ├── evaluation/             # SQLite evaluation storage + metrics
│   ├── extraction/             # Document-type extractors + helper logic
│   ├── llm/                    # LLM client abstraction, prompts, schemas, validation
│   ├── normalisation/          # Final response normalization
│   ├── ocr/                    # OCR results, line reconstruction, engines
│   ├── preprocessing/          # Shared preprocessing stages + engine profiles
│   └── utils/                  # Shared utilities (logging/config/document loading)
├── tests/                      # Pytest suite
└── README.md

Setup and Run

1) Prerequisites

Python:

Python >= 3.13 (see pyproject.toml).

Package manager:

uv is the recommended tool because the repo already includes uv.lock.

System dependencies:

Poppler (pdf2image backend for PDF rasterization).
Tesseract binary (only required when OCR engine is set to tesseract).

macOS (Homebrew):

brew install poppler tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr

2) Install Python dependencies

From repo root:

uv sync
source .venv/bin/activate

Quick sanity check:

uv run pytest -q

3) Configure environment variables

Create a local .env:

cp .env.example .env

Set at least:

OPENAI_API_KEY=your_key_here

Notes:

OPENAI_API_KEY is required for ?pipeline=llm.
You can still run ?pipeline=ocr without a valid LLM key.

4) Confirm key runtime config before first run

Open config/ocr.yaml and verify:

api.default_pipeline (currently llm in this repo).
ocr.engine (paddle or tesseract).
ocr.document_loading.pdf_dpi / pdf_format.
evaluation.enabled and evaluation.db_path.

5) Start the server

Use:

scripts/run.sh

This launches Uvicorn with reload and watches both:

src/**/*.py
config/**/*.yaml

So code and YAML config updates trigger restart automatically in local dev.

6) Send a smoke-test request

OCR path:

curl -X POST "http://127.0.0.1:8000/ocr?pipeline=ocr" \
  -H "accept: application/json" \
  -F "file=@/absolute/path/to/document.pdf;type=application/pdf"

LLM path:

curl -X POST "http://127.0.0.1:8000/ocr?pipeline=llm" \
  -H "accept: application/json" \
  -F "file=@/absolute/path/to/document.pdf;type=application/pdf"

Important:

This endpoint is multipart/form-data because it expects UploadFile.
curl -F ... automatically sets multipart encoding.

7) Validate outputs and logs

Inspect:

API response result.document_type
API response result.finalJson
API response result.stage_timings
logs/app.log for pipeline stage logs and errors

If evaluation is enabled, also inspect:

data/evaluation.sqlite3

API Contract

Endpoint

POST /ocr

Request format

Content type: multipart/form-data
Required form field:
- file (UploadFile)

Supported upload types (as currently implemented):

application/pdf
image/png
image/jpeg
image/jpg

Optional query parameter:

pipeline=ocr|llm

If pipeline is omitted, server uses config/ocr.yaml -> api.default_pipeline. If pipeline is present but invalid, the route currently falls back to default and logs a warning.

Optional form fields for evaluation:

ground_truth_document_type (string)
ground_truth_json (JSON object string)
evaluation_notes (string)

Ground truth JSON shapes

ground_truth_json accepts either:

Flat fields object:

{
  "claimant_name": "JOHN DOE",
  "total_amount": 49.25
}

Wrapped object:

{
  "document_type": "receipt",
  "fields": {
    "claimant_name": "JOHN DOE",
    "total_amount": 49.25
  }
}

Success response shape

{
  "message": "Processing completed.",
  "result": {
    "document_type": "receipt",
    "total_time": 2.34,
    "evaluation_run_id": 12,
    "stage_timings": {
      "document_loading": 0.2031,
      "classification": 0.0123
    },
    "finalJson": {
      "claimant_name": "JOHN DOE"
    }
  }
}

Error response contract (strict)

Error body format is intentionally fixed:

{"error":"<error_code>"}

Mappings:

400 -> {"error":"file_missing"}
- missing file upload
- unsupported upload MIME/type
422 -> {"error":"unsupported_document_type"}
- classifier result unsupported
500 -> {"error":"internal_server_error"}
- all internal runtime failures (OCR, LLM, parsing, DB, etc.)

Pipeline Behavior

OCR pipeline (`src/api/pipeline/ocr_pipeline.py`)

The OCR pipeline runs these stages:

document_loading
quality_check
preprocessor_build
preprocessing
ocr
classification
extraction
normalization

Preprocessing is engine-specific and configured in config/preprocessing.yaml with true, false, and check modes.

LLM pipeline (`src/api/pipeline/llm_pipeline.py`)

The LLM pipeline runs these stages:

document_loading
classification
classification_validation
extraction
extraction_validation
normalization

Prompt templates and rules are centralized in config/llm_prompts.yaml.

Configuration Guide (Common Changes You Will Actually Make)

`config/ocr.yaml`

This is the main runtime switchboard:

choose default API pipeline (api.default_pipeline)
choose OCR engine (ocr.engine)
set PDF rasterization (ocr.document_loading.pdf_dpi, pdf_format)
configure LLM model/timeout (llm.*)
enable/disable evaluation (evaluation.enabled)

Common updates:

Switch default pipeline:
- set api.default_pipeline: ocr or llm
Switch OCR engine:
- set ocr.engine: paddle or tesseract
Improve OCR rendering quality:
- increase ocr.document_loading.pdf_dpi (slower but usually clearer)
Disable eval writes temporarily:
- set evaluation.enabled: false

`config/preprocessing.yaml`

Controls stage selection per engine:

preprocessing.tesseract.*
preprocessing.paddle.*
thresholds.* for conditional check logic

Use this when OCR quality drifts by document style and you want per-engine tuning without changing code.

`config/document_classifier.yaml`

Contains:

class-specific keyword rules
regex patterns with weights
minimum score threshold (min_score_threshold)

Use this when classification is close but brittle on noisy OCR text.

`config/llm_prompts.yaml`

Contains:

system prompt
classification prompt
extraction template
referral signature policy text

Use this when LLM output quality issues are instruction-driven.

`config/logging.yaml`

Controls:

console/file handlers
log levels
formatter
rotation policy for logs/app.log

Batch Inference Workflow

Script:

scripts/batch_ocr_curl.sh

Basic run:

scripts/batch_ocr_curl.sh \
  --list scripts/paths.txt \
  --pipeline ocr \
  --output logs/ocr_batch_results.txt

Input list formats (`--list`)

Each non-empty line can be:

Path only:

data/pdf/receipt.pdf

Path + inline ground truth JSON:

data/pdf/receipt.pdf|{"document_type":"receipt","fields":{"total_amount":49.25}}

Path + ground truth JSON file:

data/pdf/receipt.pdf|@data/ground_truth/receipt.json

Both | and first , are accepted delimiters, but | is recommended for clarity.

Typical batch options

--pipeline ocr|llm
--gt-doc-type ...
--gt-json ...
--gt-json-file ...
--eval-notes ...
--append

Output is a human-readable text report at the --output path.

Evaluation (SQLite)

When evaluation.enabled: true, each API request is persisted to SQLite (default data/evaluation.sqlite3).

Tables:

documents
runs
ocr_artifacts
field_evaluations
run_metrics

Quick inspection

List tables:

sqlite3 data/evaluation.sqlite3 ".tables"

Recent runs:

sqlite3 data/evaluation.sqlite3 \
  "SELECT id, pipeline_name, predicted_document_type, success, latency_ms, created_at FROM runs ORDER BY id DESC LIMIT 20;"

Recent metrics:

sqlite3 data/evaluation.sqlite3 \
  "SELECT run_id, classification_correct, exact_match_score, normalized_match_score, missing_rate, hallucination_rate FROM run_metrics ORDER BY run_id DESC LIMIT 20;"

Field-level diagnostics for a run:

sqlite3 data/evaluation.sqlite3 \
  "SELECT field_name, expected_value, predicted_value, exact_match, normalized_match, is_missing, is_hallucinated FROM field_evaluations WHERE run_id = 42 ORDER BY field_name;"

Python helper summary:

uv run python -c "from src.evaluation import print_run_summary; print(print_run_summary(42))"

Troubleshooting

`curl: (7) Failed to connect ...`

Server is not running or wrong host/port.

Start server with scripts/run.sh.
Verify URL is http://127.0.0.1:8000/ocr.

`{"error":"file_missing"}`

Usually one of:

missing file form field
unsupported MIME/file type
request not sent as multipart form-data

Use:

-F "file=@/absolute/path/to/file.pdf;type=application/pdf"

`{"error":"unsupported_document_type"}`

The request ran successfully, but classifier could not map to a supported class with confidence/threshold.

inspect OCR/LLM logs in logs/app.log
inspect config/document_classifier.yaml thresholds/rules

`{"error":"internal_server_error"}`

Check logs/app.log for traceback. Typical causes:

Poppler missing (PDF render fails)
Tesseract binary missing (when using tesseract engine)
LLM config/key issues
malformed ground-truth JSON
SQLite write failure

LLM runs fail immediately

Check:

.env has OPENAI_API_KEY
config/ocr.yaml -> llm.enabled: true
config/ocr.yaml -> llm.model is valid for your provider key

OCR is too noisy for extraction

First-line adjustments:

increase ocr.document_loading.pdf_dpi
tune config/preprocessing.yaml per engine
switch engine (paddle vs tesseract) and compare

Extending to New Document Types (Safe Playbook)

To add a new type cleanly, update classification, extraction, normalization, and tests together. Do this in small steps and verify after each step.

Step 1: Define normalized output schema first

Update:

src/normalisation/builder.py
src/normalisation/helper.py (if new field normalizers are required)

Goal: make final output contract explicit before changing upstream logic.

Step 2: Add OCR-path classification

Update:

config/document_classifier.yaml
src/classification/document_classifier.py only if needed

Goal: classifier can predict the new label reliably.

Step 3: Add OCR-path extractor

Add/update:

new extractor in src/extraction/extractors/
register in src/extraction/registry.py
helper logic in src/extraction/helpers/ if reusable

Goal: OCR pipeline can classify and extract raw fields for new type.

Step 4: Add LLM-path schema and prompts

Update:

src/llm/schemas.py
config/llm_prompts.yaml
src/llm/prompts.py only if template wiring must change

Goal: LLM pipeline can classify/extract the same type with the same final schema.

Step 5: Add evaluation and ground-truth support

Add/update:

ground-truth JSON files under data/ground_truth/
test list file for batch runs
optional analytics queries for the new fields

Step 6: Add tests and run both pipelines

Run:

uv run pytest -q

Then verify:

POST /ocr?pipeline=ocr
POST /ocr?pipeline=llm

for the same document samples and confirm output key consistency.

Development Workflow

Typical loop:

Edit code/config.
Keep server running via scripts/run.sh (auto-reload).
Test single docs via curl.
Test batches via scripts/batch_ocr_curl.sh.
Inspect logs/app.log.
Inspect evaluation DB for regression tracking.
Run uv run pytest -q before finalizing changes.

Current Limitations (Explicit)

Rule-based extractors are still layout-sensitive for heavily noisy documents.
Handwritten signature detection for referral letters is conservative by design (prefers false negatives over false positives).
OCR and LLM paths share normalization, but classification/extraction internals are independent and tuned separately.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
config		config
data		data
logs		logs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

med-ocr

API Implementation Summary

What This Repository Does

Key Components (Actual Modules)

Repository Layout

Setup and Run

1) Prerequisites

2) Install Python dependencies

3) Configure environment variables

4) Confirm key runtime config before first run

5) Start the server

6) Send a smoke-test request

7) Validate outputs and logs

API Contract

Endpoint

Request format

Ground truth JSON shapes

Success response shape

Error response contract (strict)

Pipeline Behavior

OCR pipeline (src/api/pipeline/ocr_pipeline.py)

LLM pipeline (src/api/pipeline/llm_pipeline.py)

Configuration Guide (Common Changes You Will Actually Make)

config/ocr.yaml

config/preprocessing.yaml

config/document_classifier.yaml

config/llm_prompts.yaml

config/logging.yaml

Batch Inference Workflow

Input list formats (--list)

Typical batch options

Evaluation (SQLite)

Quick inspection

Troubleshooting

curl: (7) Failed to connect ...

{"error":"file_missing"}

{"error":"unsupported_document_type"}

{"error":"internal_server_error"}

LLM runs fail immediately

OCR is too noisy for extraction

Extending to New Document Types (Safe Playbook)

Step 1: Define normalized output schema first

Step 2: Add OCR-path classification

Step 3: Add OCR-path extractor

Step 4: Add LLM-path schema and prompts

Step 5: Add evaluation and ground-truth support

Step 6: Add tests and run both pipelines

Development Workflow

Current Limitations (Explicit)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OCR pipeline (`src/api/pipeline/ocr_pipeline.py`)

LLM pipeline (`src/api/pipeline/llm_pipeline.py`)

`config/ocr.yaml`

`config/preprocessing.yaml`

`config/document_classifier.yaml`

`config/llm_prompts.yaml`

`config/logging.yaml`

Input list formats (`--list`)

`curl: (7) Failed to connect ...`

`{"error":"file_missing"}`

`{"error":"unsupported_document_type"}`

`{"error":"internal_server_error"}`

Packages