Skip to content

nancytanaka1/mimic-icd-coder

Repository files navigation

mimic-icd-coder

Multi-label ICD-10 auto-coder for hospital discharge summaries — a reproducible clinical NLP + MLOps reference build.

End-to-end clinical NLP pipeline on Azure Databricks (Delta Lake + Unity Catalog + MLflow + Model Serving), built on MIMIC-IV-Note v2.2 + MIMIC-IV v3.1 Hosp. Reproducible on a single workstation or in the cloud without code branches; every methodological choice is pre-registered in DECISIONS.md and defended in reports/.

Methodological note. This project is inspired by Mullenbach et al. 2018 (CAML), which established the multi-label ICD coding benchmark on MIMIC-III/ICD-9. This work targets MIMIC-IV/ICD-10 — a different dataset and coding system. Numerical results are reported on their own terms, not as an apples-to-apples replication. See §6 Evaluation for the full caveat.

Purpose and compliance posture

Scientific-research use only under the PhysioNet Credentialed Health Data License v1.5.0. This repository builds a reproducible MIMIC-IV/ICD-10 multi-label coding pipeline and demonstrates production-grade MLOps methodology for credentialed clinical NLP. Not a clinical product. Not a commercial service. Not for clinical decision support. No MIMIC data or trained weights are redistributed through this repository — only aggregate research results (metrics, methodology, code, synthetic examples). Reproduction requires independent PhysioNet credentialing and CITI training.

Headline results

Test-split results on MIMIC-IV-Note v2.2 + MIMIC-IV v3.1 Hosp, top-50 ICD-10 codes. Patient-level held-out test split, n=12,091 admissions, seed=42.

Metric Baseline (TF-IDF + LR) Chunked Bio_ClinicalBERT
Micro F1 0.6174 TBD
Macro F1 0.5843 TBD
P@5 0.5259 TBD
P@8 0.4326 TBD
P@15 0.2935 TBD
Micro AUC 0.9284 TBD
Macro AUC 0.9097 TBD
Micro AUPRC 0.6263 TBD
Macro AUPRC 0.5739 TBD

Backing MLflow run: 4e577699a67a4027bc27628e9b237ac5 (local file store, data/mlruns/).

Baseline uses OneVsRestClassifier(LogisticRegression(class_weight="balanced")) over TF-IDF (1–2 grams, min_df=5, 200K vocab cap) with per-label decision thresholds tuned on validation by F1 maximization. The class_weight="balanced" + per-label F1 threshold combination deliberately trades ranked-precision (P@k) for per-label F1 — a documented baseline choice (DECISIONS.md 2026-04-23). Recovering P@k is an explicit objective of the transformer branch.

Validation→test drift ≤ 0.005 on every metric, confirming val-tuned thresholds generalize. Train/val/test are disjoint by subject_id, verified architecturally in tests/test_smoke.py::test_patient_split_disjoint — patient-level splits prevent the writing-style leakage that admission-level splits allow.

Reproducibility

mic run-all --config configs/dev.yml on fresh MIMIC-IV v3.1 Hosp + MIMIC-IV-Note v2.2 raw CSVs reproduces every headline metric to 15+ decimal places (verified 2026-04-24, MLflow run 6e809d5dfd3b46dbafae84ddba710bd7). Reproducibility is architectural: fixed patient-split seed (42), deterministic liblinear solver, file-on-disk stage boundaries that make each step independently re-runnable. A reviewer with PhysioNet credentials cloning this repository will get identical numbers end-to-end.

For the full data card, model card, EDA paper, and evaluation methodology, see reports/. AI-assistance disclosure: ACKNOWLEDGMENTS.md.


1. Study summary

Attribute Value
Domain Clinical NLP — automated medical coding
Input Free-text discharge summaries (MIMIC-IV-Note v2.2)
Output Per-code probability + thresholded binary labels over top-50 ICD-10 codes
Training cohort 122,288 admissions across 65,665 patients (MIMIC-IV v3.1 ICD-10 cohort ∩ v2.2 notes)
Top-50 coverage 94.12% of cohort admissions
License — code Apache-2.0
License — data PhysioNet Credentialed Health Data License v1.5.0 (not redistributed)

Performance targets

Targets for the chunked Bio_ClinicalBERT branch on the patient-level test split. The TF-IDF+LR baseline floor is Micro F1 ≥ 0.55 — below that, something upstream is broken (cohort filter, split leakage, or label misalignment).

Metric Target Floor
Micro F1 ≥ 0.70 0.55
Macro F1 ≥ 0.55 0.40
P@5 ≥ 0.70
P@8 ≥ 0.65

Targets are absolute, not benchmark-relative. They reflect the operational threshold for "the transformer branch is delivering value over the baseline" rather than a literature comparison.


2. System architecture

2.1 Logical topology

                         Raw MIMIC-IV (PhysioNet)
                         discharge.csv.gz
                         diagnoses_icd.csv.gz
                         admissions.csv.gz
                         patients.csv.gz
                         d_icd_diagnoses.csv.gz
                                 │
                                 ▼
     ┌────────────── Bronze ─────────────┐     Raw mirror (Parquet / Delta)
     │  gz CSV → columnar; no transforms │
     └────────────────┬──────────────────┘
                      ▼
     ┌────────────── Silver ─────────────┐     Cleaned notes, one per hadm_id
     │  de-id collapse, dedup, min 100tk │
     └────────────────┬──────────────────┘
                      ▼
     ┌────────────── Gold ───────────────┐     Model-ready artifacts
     │  top-50 ICD-10 multi-hot matrix   │     labels.npz, label_names.json,
     │  patient-level split manifest     │     splits.parquet
     └────────────────┬──────────────────┘
                      ▼
     ┌────────── Training / Eval ────────┐     MLflow tracking + Model Registry
     │  • TF-IDF + LogReg (baseline)     │     per-label threshold tuning
     │  • Chunked Bio_ClinicalBERT       │
     │  • Clinical-Longformer (fallback) │
     └────────────────┬──────────────────┘
                      ▼
     ┌────────── Serving + Monitoring ───┐     Databricks Model Serving (GPU)
     │  FastAPI-compatible scoring API   │     Evidently drift checks
     └───────────────────────────────────┘

2.2 Deployment surfaces

The same pipeline runs in two environments with no code branching. Only config paths change.

Surface Storage Compute Orchestration Tracking Use
Local workstation Parquet on local disk CPU (16 threads) mic CLI File-backed MLflow Cohort construction, EDA, baseline iteration, tests
Azure Databricks ADLS Gen2 + Delta CPU + GPU job clusters (NC6s_v3 for transformer) Databricks Asset Bundles Managed MLflow + Unity Catalog Model Registry Transformer fine-tune, Model Serving, drift monitoring

3. Data contracts

Full cohort composition and preprocessing logic live in reports/data_card.md. Quick reference:

3.1 Inputs

Source Version Key fields Notes
mimic-iv-note/note/discharge.csv.gz v2.2 (Jan 2023) note_id, subject_id, hadm_id, note_type, note_seq, charttime, text 331,793 rows; note_type = 'DS' is the only value
mimic-iv/hosp/diagnoses_icd.csv.gz v3.1 (Oct 2024) subject_id, hadm_id, seq_num, icd_code, icd_version 6,364,488 rows; icd_version ∈ {9, 10}
mimic-iv/hosp/admissions.csv.gz v3.1 subject_id, hadm_id, admittime, dischtime, ... 546,028 rows
mimic-iv/hosp/patients.csv.gz v3.1 subject_id, gender, anchor_age, ... 364,627 rows
mimic-iv/hosp/d_icd_diagnoses.csv.gz v3.1 icd_code, icd_version, long_title ICD dictionary for human-readable descriptions

The v2.2/v3.1 mismatch is deliberate. hadm_id is stable across versions; only 61 of 331,793 notes (0.018%) are orphaned. Full rationale in DECISIONS.md (2026-04-20).

3.2 Stage outputs

Stage Artifact Shape Contract
Bronze bronze/{discharge_notes,diagnoses_icd,admissions,patients,d_icd_diagnoses}.parquet source schema Lossless columnar mirror
Silver silver/notes.parquet hadm_id, subject_id, text, n_tokens One row per admission; n_tokens ≥ 100
Gold gold/labels.npz CSR (n_admissions, 50) Rows aligned 1:1 to silver/notes.parquet
Gold gold/label_names.json list[str] length 50 ICD-10 codes in column order
Gold gold/hadm_ids.parquet hadm_id Row-to-hadm_id lookup
Gold gold/splits.parquet row_idx, split Patient-level 80/10/10; no subject_id spans splits
Gold gold/baseline_model.joblib vectorizer + 50 LR heads Output of fit_baseline
Gold gold/baseline_thresholds.npy float64[50] Per-label thresholds tuned on val

Alignment invariant (pipeline.py): len(silver) == labels.shape[0]. Violation means Gold must be rebuilt.

3.3 Cohort rules

Defined in configs/*.yml under cohort:.

Rule Default Rationale
icd_version 10 ICD-10 is operationally current; mixing fragments the label space
note_types ['DS'] Discharge summaries only; v2.2 contains only DS
min_note_tokens 100 Drops near-empty notes that hurt baseline precision
top_k_labels 50 Standard label-set size in the multi-label coding literature

4. Pipeline stages

Each stage is an idempotent function in src/mimic_icd_coder/pipeline.py with a Parquet or npz checkpoint. Downstream stages read from disk, so any step can be re-run without recomputing upstream.

Stage Entry point Reads Writes Runtime (laptop)
Bronze mic ingest 5 gz CSVs 5 Parquet mirrors 5–10 min
Silver mic silver bronze/discharge_notes.parquet silver/notes.parquet 2–3 min
Gold mic gold Silver + bronze/diagnoses_icd.parquet labels.npz, label_names.json, hadm_ids.parquet ~30 s
Splits mic splits Silver splits.parquet < 10 s
Baseline train mic train-baseline Silver + Gold + Splits baseline_model.joblib, baseline_thresholds.npy, MLflow run 15–25 min (CPU, 16 threads)
Test eval mic evaluate-test Silver + Gold + Splits + saved model Test metrics ~1 min
Run-all mic run-all raw gz everything 25–40 min end-to-end

Checkpoint layout, rooted at Paths.root (default ./data):

data/
  bronze/   discharge_notes.parquet  diagnoses_icd.parquet  admissions.parquet
            patients.parquet         d_icd_diagnoses.parquet
  silver/   notes.parquet
  gold/     labels.npz  label_names.json  hadm_ids.parquet  splits.parquet
            baseline_model.joblib  baseline_thresholds.npy
  mlruns/   <MLflow experiment tree>

5. Models

Full details, architecture rationale, and ethics in reports/model_card.md.

5.1 Baseline — TF-IDF + one-vs-rest LogisticRegression

src/mimic_icd_coder/models/baseline.py

Parameter Default Config key
n-gram range (1, 2) baseline.tfidf_ngram_range
min doc freq 5 baseline.tfidf_min_df
max features 200,000 baseline.tfidf_max_features
LR C 1.0 baseline.logreg_c
class_weight balanced baseline.logreg_class_weight

Per-label decision thresholds are tuned on the validation split by maximizing per-label F1 (src/mimic_icd_coder/thresholds.py).

5.2 Transformer — Chunked Bio_ClinicalBERT (primary)

src/mimic_icd_coder/models/transformer.py, jobs/train_transformer.py

Each note is split into contiguous 512-token chunks. Each chunk runs through the BERT encoder. Per-label logits are max-pooled across chunks. This recovers signal a single 512-token window would lose — 98.74% of notes exceed 512 whitespace tokens (per reports/eda_report.md §3 token-length analysis).

Parameter Default Config key
model emilyalsentzer/Bio_ClinicalBERT (Hugging Face, public weights) transformer.model_name
max sequence length per chunk 512 transformer.max_length
batch size 16 transformer.batch_size
learning rate 2e-5 transformer.learning_rate
epochs 3 transformer.epochs
warmup ratio 0.1 transformer.warmup_ratio
weight decay 0.01 transformer.weight_decay
fp16 true transformer.fp16

Early stop on validation Macro F1.

5.3 Fallback — Clinical-Longformer

Triggered only if chunked Bio_ClinicalBERT misses the Micro F1 target by more than 3 points. 4K-token context; ~3–5× slower training. Rationale in DECISIONS.md (2026-04-20).


6. Evaluation

Full methodology in reports/eval_report.qmd.

Test-split results

Held-out patient-level test split, n=12,091 admissions, 6,567 patients. Seed 42. MLflow run 4e577699a67a4027bc27628e9b237ac5.

Metric Baseline (TF-IDF + LR)
Micro F1 0.6174
Macro F1 0.5843
P@5 0.5259
P@8 0.4326
P@15 0.2935
Micro AUC 0.9284
Macro AUC 0.9097
Micro AUPRC 0.6263
Macro AUPRC 0.5739

Metrics used

Metric Use
Micro F1 Primary operational metric — stable under class imbalance
Macro F1 Rare-label performance across all 50 codes, equally weighted
P@5 / P@8 / P@15 Ranked-prediction precision for coder-assist workflow
Per-label F1 Error analysis on worst-performing labels

On comparisons to prior work

This work does not replicate Mullenbach et al. 2018 in a methodologically valid sense, and does not claim to. The differences are:

  • Different dataset. Mullenbach used MIMIC-III v1.4. This work uses MIMIC-IV v3.1 + MIMIC-IV-Note v2.2.
  • Different coding system. Mullenbach used ICD-9-CM. This work uses ICD-10-CM. The label spaces are non-overlapping; the top-50 codes in each are different sets covering different clinical concepts.
  • Different cohort. Different inclusion criteria, different size, different distributional properties.
  • Different difficulty. ICD-10 is more granular than ICD-9 (~70K codes vs ~14K). Top-50 ICD-10 prediction is a different problem from top-50 ICD-9 prediction.

Numerical differences between this work's results and any number reported in Mullenbach 2018 (or downstream work on MIMIC-III) are confounded by all four factors. Such comparisons would be non-equivalent and methodologically invalid, and are not reported.

What this work does take from Mullenbach 2018: the multi-label classification framing, the patient-level evaluation discipline, the use of P@k as a coder-assist-relevant metric, and the top-50 label cardinality as a tractable problem size. These are methodological inheritances, not benchmark equivalences. Future work to produce an apples-to-apples MIMIC-III/ICD-9 reproduction is tracked in DECISIONS.md.


7. Interfaces

7.1 Local CLI

Entry points registered in pyproject.toml, implemented in src/mimic_icd_coder/cli.py.

mic ingest          --config configs/dev.yml
mic silver          --config configs/dev.yml
mic gold            --config configs/dev.yml
mic splits          --config configs/dev.yml
mic train-baseline  --config configs/dev.yml
mic evaluate-test   --config configs/dev.yml
mic run-all         --config configs/dev.yml

configs/dev.yml is gitignored; copy configs/dev.example.yml and fill in your MIMIC paths. --artifacts <dir> overrides the default ./data checkpoint root.

7.2 Databricks Asset Bundle

databricks.yml. Two targets:

Target Catalog Run-as Compute
dev mimic_icd_dev workspace user Standard_DS4_v2 × 2 (Bronze), Standard_DS5_v2 × 2 (baseline)
prod mimic_icd service principal mimic-icd-sp same + Standard_NC6s_v3 single-node (1× V100) for transformer
databricks bundle validate --target dev
databricks bundle deploy   --target dev
databricks bundle run ingest_bronze     --target dev
databricks bundle run train_baseline    --target dev
databricks bundle run train_transformer --target prod

7.3 Model Serving API

Databricks Model Serving endpoint, GPU-backed.

POST /serving-endpoints/mimic-icd-discharge/invocations
{
  "dataframe_records": [
    {"text": "<discharge summary text>"}
  ]
}

Response:

{
  "predictions": [
    {
      "codes":        ["I10", "I50.9", "N18.6", "E11.9"],
      "scores":       [0.94, 0.87, 0.72, 0.68],
      "thresholded":  ["I10", "I50.9"]
    }
  ]
}

8. Configuration

Template: configs/dev.example.yml. User overrides go in configs/dev.yml or configs/dev.<username>.yml, both gitignored. Schema validated by Pydantic AppConfig in src/mimic_icd_coder/config.py.

Section Purpose
unity_catalog Catalog + schema names for Bronze / Silver / Gold / Models
data Input paths (local gz or ADLS abfss://), including d_icd_path
cohort Cohort filters (see §3.3)
split Train/val/test fractions, seed, strategy
baseline TF-IDF + LR hyperparameters
transformer Bio_ClinicalBERT hyperparameters
evaluation Threshold strategy, top-k metric list
mlflow Experiment name, registry model name
logging Level + format (console or JSON)

9. Observability

Channel Backing Captured
Structured logs structlog — console locally, JSON on Databricks Stage start/end, row counts, label density, metric values
MLflow runs Local file store (data/mlruns) or Databricks-managed Params, metrics, model artifact, signature, thresholds, label list
Model Registry Unity Catalog (mimic_icd.models.discharge_top50) Staging / Production aliases; train-data fingerprint and git SHA tags
Drift monitoring Evidently scheduled job (prod only) Input distribution, prediction, and label drift

10. Security & compliance

Full details in reports/data_card.md. Headlines:

  • Scientific-research use only under the PhysioNet Credentialed Health Data License v1.5.0. Not a clinical product, commercial service, decision-support tool, or clinical-care application.
  • Re-identification. No attempt to identify patients or institutions is made. Only aggregate cohort statistics, label-level metrics, and synthetic examples are published; no note text, admission IDs, or subject IDs leave local disk.
  • Credentialing. Reproducing results from raw data requires the reviewer to independently complete PhysioNet credentialing (CITI training + DUA agreement) before accessing MIMIC-IV. This repository does not grant any access to the underlying data.
  • .gitignore blocks CSV, Parquet, gz, npz, joblib, and user-specific configs. No raw data enters this repository.
  • Training runs in the user's own Azure tenant (single-tenant Databricks workspace, private ADLS Gen2).
  • Clinical text is never sent to third-party LLM APIs. Only open-weights models hosted inside the workspace are used.
  • CI runs only on synthetic fixtures in tests/fixtures/synthetic_notes.py.
  • Notebook outputs are PHI-scanned by scripts/check_notebook_phi.py in CI and pre-commit.
  • Service-principal credentials are stored in Databricks secret scopes.

11. Quality gates

Gate Tool Enforced in
Lint ruff check Pre-commit + CI
Format black --check (line length 100) Pre-commit + CI
Types mypy src (strict) CI
Unit + integration tests pytest (56 tests on synthetic fixtures, ~25 s) CI + local
Notebook output hygiene nbstripout + PHI scanner Pre-commit + CI
Data-safety guards Large-file check, private-key detection Pre-commit
Bundle validity databricks bundle validate --target dev Pre-deploy
Metric floor Baseline Micro F1 ≥ 0.55 on dev split Manual review gate after mic train-baseline

Pre-commit config: .pre-commit-config.yaml. CI workflow: .github/workflows/ci.yml.


12. Quick start

12.1 Local (no credentialed data required)

git clone git@github.com:nancytanaka1/mimic-icd-coder.git
cd mimic-icd-coder

python -m venv .venv
# Windows: .\.venv\Scripts\activate    POSIX: source .venv/bin/activate
pip install -e ".[dev]"
pre-commit install

pytest -q                     # synthetic fixtures only

12.2 Local end-to-end on real MIMIC (requires PhysioNet credentials)

See LOCAL_SETUP.md for the workstation walkthrough (memory profile, expected row counts, runtime envelopes, GPU prerequisites).

cp configs/dev.example.yml configs/dev.yml        # then edit data paths
mic run-all --config configs/dev.yml
mlflow ui --backend-store-uri file:./data/mlruns --port 5000

12.3 Databricks

pip install databricks-cli
databricks configure --token
databricks bundle validate --target dev
databricks bundle deploy   --target dev
databricks bundle run ingest_bronze --target dev

13. Repository layout

mimic-icd-coder/
├── src/mimic_icd_coder/
│   ├── cli.py                CLI entry points (mic ...)
│   ├── config.py             Pydantic AppConfig
│   ├── pipeline.py           Stage orchestration + Paths
│   ├── logging_utils.py      structlog configuration
│   ├── eda.py                EDA analysis helpers
│   ├── evaluate.py           Metrics
│   ├── thresholds.py         Per-label threshold tuner
│   ├── data/
│   │   ├── ingest.py         gz CSV → DataFrame readers (pyarrow CSV engine)
│   │   ├── clean.py          Silver transforms
│   │   ├── labels.py         Top-K multi-hot label builder
│   │   └── splits.py         Patient-level splitter
│   └── models/
│       ├── baseline.py       TF-IDF + LogReg + MLflow logger
│       └── transformer.py    Chunked Bio_ClinicalBERT fine-tune wrapper
├── jobs/                     Databricks-entry-point scripts
├── notebooks/01_eda.ipynb    Cohort + label distribution EDA
├── scripts/check_notebook_phi.py
├── configs/                  dev.example.yml + gitignored user configs
├── tests/                    Unit + integration tests on synthetic fixtures
├── reports/
│   ├── data_card.md
│   ├── model_card.md
│   ├── eval_report.qmd
│   ├── eda_report.md
│   └── EDA_Report.docx
├── .github/workflows/ci.yml
├── .pre-commit-config.yaml
├── databricks.yml
├── pyproject.toml
├── DECISIONS.md
├── LOCAL_SETUP.md
├── ACKNOWLEDGMENTS.md
├── LICENSE
└── README.md

14. Implementation status

Component Status
Scaffold, CI, pre-commit, Asset Bundle Ready
EDA notebook + paper + data card + model card + eval report Complete
Bronze ingestion (5 tables including ICD dictionary) Implemented and run on real data
Silver (clean + min-token filter) Shipped
Gold (top-50 label matrix + patient splits) Shipped — 50-label matrix on 122,288 admissions
TF-IDF + LR baseline Shipped — test Micro F1 0.6174, Macro F1 0.5843
Per-label threshold tuning Implemented
Evaluation (Micro/Macro F1, P@k, AUC, AUPRC) Implemented
Per-label error analysis + calibration + confusion patterns Shipped — see reports/baseline_error_analysis.md
MLflow tracking Local file store wired; Unity Catalog Registry write on Databricks only
Chunked Bio_ClinicalBERT fine-tune Scaffolded — jobs/train_transformer.py; pre-registered predictions in error analysis doc
Clinical-Longformer fallback Not started — trigger-driven
Azure Databricks workspace + Unity Catalog bootstrap Not started — branched from external bootstrap project
Model Serving endpoint Not started
Evidently drift monitoring Not started

15. References

  • Mullenbach, Wiegreffe, Duke, Sun, Eisenstein (2018). Explainable Prediction of Medical Codes from Clinical Text. NAACL. https://arxiv.org/abs/1802.05695
  • Alsentzer, Murphy, Boag, Weng, Jin, Naumann, McDermott (2019). Publicly Available Clinical BERT Embeddings. ClinicalNLP Workshop. https://arxiv.org/abs/1904.03323
  • Beltagy, Peters, Cohan (2020). Longformer: The Long-Document Transformer. https://arxiv.org/abs/2004.05150
  • Devlin, Chang, Lee, Toutanova (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
  • Johnson et al. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet.
  • Mitchell et al. (2019). Model Cards for Model Reporting. FAccT.
  • Pushkarna, Zaldivar, Kjartansson (2022). Data Cards: Purposeful and Transparent Dataset Documentation. FAccT.

16. License

Code licensed under Apache-2.0 (LICENSE). MIMIC data is licensed separately under the PhysioNet Credentialed Health Data License v1.5.0 and is not redistributed via this repository.

About

Reproducible end-to-end clinical NLP + MLOps pipeline for multi-label ICD-10 auto-coding on MIMIC-IV. Bronze→Silver→Gold→splits→train→evaluate, fully tested, CI-gated, DUA-compliant. TF-IDF + LR baseline shipped (test Micro F1 0.617, Macro F1 0.584); Bio_ClinicalBERT fine-tune loop validated locally; Databricks GPU run pending.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors