med-ocr extracts structured fields from medical and administrative documents. It supports two parallel processing paths in the same API: a traditional OCR + rules pipeline (ocr) and an LLM-first pipeline (llm).
The service currently targets these document classes:
receiptmedical_certificatereferral_letterunsupported
The API is implemented with FastAPI in src/api/main.py, with startup-time dependency wiring via app lifespan state. Requests hit POST /ocr in src/api/routes.py, then route through PipelineDispatcher to either OCR or LLM processing.
- Accepts uploaded PDFs/images over
multipart/form-data. - Loads documents into page images (
PIL.Image) as a shared ingestion boundary. - Runs either:
- OCR pipeline: preprocessing -> OCR engine -> rule-based classification -> rule-based extraction -> normalization.
- LLM pipeline: LLM classification -> LLM extraction -> validation -> normalization.
- Returns a stable JSON output envelope with per-stage timings.
- Optionally logs every request to SQLite for evaluation and metric tracking.
- Document ingestion layer:
src/utils/document_loading.py(load_document) converts uploads to page images and is used by both pipelines. - Pipeline selector (OCR vs LLM):
src/api/pipeline/dispatcher.py(PipelineDispatcher) is called bysrc/api/routes.pyand initialized insrc/api/main.py. - OCR engines (PaddleOCR, Tesseract):
src/ocr/engine/containspaddleocr.py,tesseract.py, shared interfacebase.py, and factorybuild.py. - LLM processing layer: orchestration is in
src/api/pipeline/llm_pipeline.py, with client/prompt/schema/validation modules insrc/llm/. - Rule-based extractors: registry is in
src/extraction/registry.py, implementations are insrc/extraction/extractors/, and shared helpers are insrc/extraction/helpers/. - Evaluation module (SQLite-backed):
src/evaluation/with schema insrc/evaluation/schema.sql.
med-ocr/
├── config/ # Runtime YAML configs (OCR, prompts, logging, thresholds)
├── data/ # Sample docs, ground-truth JSONs, evaluation DB
├── logs/ # Runtime logs and batch output text files
├── notebooks/ # EDA / experiment notebooks
├── scripts/ # Server runner + batch request scripts
├── src/
│ ├── api/ # FastAPI app, routes, dependencies, pipeline dispatch
│ ├── classification/ # Rule-based document classifier
│ ├── evaluation/ # SQLite evaluation storage + metrics
│ ├── extraction/ # Document-type extractors + helper logic
│ ├── llm/ # LLM client abstraction, prompts, schemas, validation
│ ├── normalisation/ # Final response normalization
│ ├── ocr/ # OCR results, line reconstruction, engines
│ ├── preprocessing/ # Shared preprocessing stages + engine profiles
│ └── utils/ # Shared utilities (logging/config/document loading)
├── tests/ # Pytest suite
└── README.md
Python:
Python >= 3.13(seepyproject.toml).
Package manager:
uvis the recommended tool because the repo already includesuv.lock.
System dependencies:
- Poppler (
pdf2imagebackend for PDF rasterization). - Tesseract binary (only required when OCR engine is set to
tesseract).
macOS (Homebrew):
brew install poppler tesseractUbuntu/Debian:
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocrFrom repo root:
uv sync
source .venv/bin/activateQuick sanity check:
uv run pytest -qCreate a local .env:
cp .env.example .envSet at least:
OPENAI_API_KEY=your_key_hereNotes:
OPENAI_API_KEYis required for?pipeline=llm.- You can still run
?pipeline=ocrwithout a valid LLM key.
Open config/ocr.yaml and verify:
api.default_pipeline(currentlyllmin this repo).ocr.engine(paddleortesseract).ocr.document_loading.pdf_dpi/pdf_format.evaluation.enabledandevaluation.db_path.
Use:
scripts/run.shThis launches Uvicorn with reload and watches both:
src/**/*.pyconfig/**/*.yaml
So code and YAML config updates trigger restart automatically in local dev.
OCR path:
curl -X POST "http://127.0.0.1:8000/ocr?pipeline=ocr" \
-H "accept: application/json" \
-F "file=@/absolute/path/to/document.pdf;type=application/pdf"LLM path:
curl -X POST "http://127.0.0.1:8000/ocr?pipeline=llm" \
-H "accept: application/json" \
-F "file=@/absolute/path/to/document.pdf;type=application/pdf"Important:
- This endpoint is
multipart/form-databecause it expectsUploadFile. curl -F ...automatically sets multipart encoding.
Inspect:
- API response
result.document_type - API response
result.finalJson - API response
result.stage_timings logs/app.logfor pipeline stage logs and errors
If evaluation is enabled, also inspect:
data/evaluation.sqlite3
POST /ocr
- Content type:
multipart/form-data - Required form field:
file(UploadFile)
Supported upload types (as currently implemented):
application/pdfimage/pngimage/jpegimage/jpg
Optional query parameter:
pipeline=ocr|llm
If pipeline is omitted, server uses config/ocr.yaml -> api.default_pipeline.
If pipeline is present but invalid, the route currently falls back to default and logs a warning.
Optional form fields for evaluation:
ground_truth_document_type(string)ground_truth_json(JSON object string)evaluation_notes(string)
ground_truth_json accepts either:
- Flat fields object:
{
"claimant_name": "JOHN DOE",
"total_amount": 49.25
}- Wrapped object:
{
"document_type": "receipt",
"fields": {
"claimant_name": "JOHN DOE",
"total_amount": 49.25
}
}{
"message": "Processing completed.",
"result": {
"document_type": "receipt",
"total_time": 2.34,
"evaluation_run_id": 12,
"stage_timings": {
"document_loading": 0.2031,
"classification": 0.0123
},
"finalJson": {
"claimant_name": "JOHN DOE"
}
}
}Error body format is intentionally fixed:
{"error":"<error_code>"}Mappings:
400->{"error":"file_missing"}- missing file upload
- unsupported upload MIME/type
422->{"error":"unsupported_document_type"}- classifier result unsupported
500->{"error":"internal_server_error"}- all internal runtime failures (OCR, LLM, parsing, DB, etc.)
The OCR pipeline runs these stages:
document_loadingquality_checkpreprocessor_buildpreprocessingocrclassificationextractionnormalization
Preprocessing is engine-specific and configured in config/preprocessing.yaml with true, false, and check modes.
The LLM pipeline runs these stages:
document_loadingclassificationclassification_validationextractionextraction_validationnormalization
Prompt templates and rules are centralized in config/llm_prompts.yaml.
This is the main runtime switchboard:
- choose default API pipeline (
api.default_pipeline) - choose OCR engine (
ocr.engine) - set PDF rasterization (
ocr.document_loading.pdf_dpi,pdf_format) - configure LLM model/timeout (
llm.*) - enable/disable evaluation (
evaluation.enabled)
Common updates:
- Switch default pipeline:
- set
api.default_pipeline: ocrorllm
- set
- Switch OCR engine:
- set
ocr.engine: paddleortesseract
- set
- Improve OCR rendering quality:
- increase
ocr.document_loading.pdf_dpi(slower but usually clearer)
- increase
- Disable eval writes temporarily:
- set
evaluation.enabled: false
- set
Controls stage selection per engine:
preprocessing.tesseract.*preprocessing.paddle.*thresholds.*for conditionalchecklogic
Use this when OCR quality drifts by document style and you want per-engine tuning without changing code.
Contains:
- class-specific keyword rules
- regex patterns with weights
- minimum score threshold (
min_score_threshold)
Use this when classification is close but brittle on noisy OCR text.
Contains:
- system prompt
- classification prompt
- extraction template
- referral signature policy text
Use this when LLM output quality issues are instruction-driven.
Controls:
- console/file handlers
- log levels
- formatter
- rotation policy for
logs/app.log
Script:
Basic run:
scripts/batch_ocr_curl.sh \
--list scripts/paths.txt \
--pipeline ocr \
--output logs/ocr_batch_results.txtEach non-empty line can be:
- Path only:
data/pdf/receipt.pdf
- Path + inline ground truth JSON:
data/pdf/receipt.pdf|{"document_type":"receipt","fields":{"total_amount":49.25}}
- Path + ground truth JSON file:
data/pdf/receipt.pdf|@data/ground_truth/receipt.json
Both | and first , are accepted delimiters, but | is recommended for clarity.
--pipeline ocr|llm--gt-doc-type ...--gt-json ...--gt-json-file ...--eval-notes ...--append
Output is a human-readable text report at the --output path.
When evaluation.enabled: true, each API request is persisted to SQLite (default data/evaluation.sqlite3).
Tables:
documentsrunsocr_artifactsfield_evaluationsrun_metrics
List tables:
sqlite3 data/evaluation.sqlite3 ".tables"Recent runs:
sqlite3 data/evaluation.sqlite3 \
"SELECT id, pipeline_name, predicted_document_type, success, latency_ms, created_at FROM runs ORDER BY id DESC LIMIT 20;"Recent metrics:
sqlite3 data/evaluation.sqlite3 \
"SELECT run_id, classification_correct, exact_match_score, normalized_match_score, missing_rate, hallucination_rate FROM run_metrics ORDER BY run_id DESC LIMIT 20;"Field-level diagnostics for a run:
sqlite3 data/evaluation.sqlite3 \
"SELECT field_name, expected_value, predicted_value, exact_match, normalized_match, is_missing, is_hallucinated FROM field_evaluations WHERE run_id = 42 ORDER BY field_name;"Python helper summary:
uv run python -c "from src.evaluation import print_run_summary; print(print_run_summary(42))"Server is not running or wrong host/port.
- Start server with
scripts/run.sh. - Verify URL is
http://127.0.0.1:8000/ocr.
Usually one of:
- missing
fileform field - unsupported MIME/file type
- request not sent as multipart form-data
Use:
-F "file=@/absolute/path/to/file.pdf;type=application/pdf"The request ran successfully, but classifier could not map to a supported class with confidence/threshold.
- inspect OCR/LLM logs in
logs/app.log - inspect
config/document_classifier.yamlthresholds/rules
Check logs/app.log for traceback. Typical causes:
- Poppler missing (PDF render fails)
- Tesseract binary missing (when using tesseract engine)
- LLM config/key issues
- malformed ground-truth JSON
- SQLite write failure
Check:
.envhasOPENAI_API_KEYconfig/ocr.yaml -> llm.enabled: trueconfig/ocr.yaml -> llm.modelis valid for your provider key
First-line adjustments:
- increase
ocr.document_loading.pdf_dpi - tune
config/preprocessing.yamlper engine - switch engine (
paddlevstesseract) and compare
To add a new type cleanly, update classification, extraction, normalization, and tests together. Do this in small steps and verify after each step.
Update:
src/normalisation/builder.pysrc/normalisation/helper.py(if new field normalizers are required)
Goal: make final output contract explicit before changing upstream logic.
Update:
Goal: classifier can predict the new label reliably.
Add/update:
- new extractor in
src/extraction/extractors/ - register in
src/extraction/registry.py - helper logic in
src/extraction/helpers/if reusable
Goal: OCR pipeline can classify and extract raw fields for new type.
Update:
src/llm/schemas.pyconfig/llm_prompts.yamlsrc/llm/prompts.pyonly if template wiring must change
Goal: LLM pipeline can classify/extract the same type with the same final schema.
Add/update:
- ground-truth JSON files under
data/ground_truth/ - test list file for batch runs
- optional analytics queries for the new fields
Run:
uv run pytest -qThen verify:
POST /ocr?pipeline=ocrPOST /ocr?pipeline=llm
for the same document samples and confirm output key consistency.
Typical loop:
- Edit code/config.
- Keep server running via
scripts/run.sh(auto-reload). - Test single docs via
curl. - Test batches via
scripts/batch_ocr_curl.sh. - Inspect
logs/app.log. - Inspect evaluation DB for regression tracking.
- Run
uv run pytest -qbefore finalizing changes.
- Rule-based extractors are still layout-sensitive for heavily noisy documents.
- Handwritten signature detection for referral letters is conservative by design (prefers false negatives over false positives).
- OCR and LLM paths share normalization, but classification/extraction internals are independent and tuned separately.