A research-grade re-examination of how skin-lesion classifiers are evaluated, built on HAM10000 (dermoscopy) and DDI (diverse-skin-tone clinical photos). The headline 90%+ accuracies that circulate for HAM10000 are largely an artifact of how the data is split and scored. This repo measures those artifacts directly, under a single fixed protocol, across four axes:
- RQ1 — Leakage. How much does lesion-level train/test leakage inflate reported AUC?
- RQ2 — Ensembling & calibration. Does a diverse-architecture ensemble actually beat a single model on calibration and diversity, not just accuracy?
- RQ4 — Skin-tone fairness. Is the well-known dark-skin performance gap a clean Fitzpatrick effect, or partly a domain-shift confound?
- RQ5 — Shortcut sensitivity. How much do the models lean on non-lesion context and synthetic artifacts?
These are not four separate projects — they are axes of one benchmark artifact:
leakage-controlled × {backbones, ensembles} × {accuracy, calibration, fairness, shortcuts}. Every
metric is computed from a frozen prediction table by a shared evaluation spine (bench/metrics.py),
so results are reproducible and identical across axes. Each axis was designed, then independently
reviewed by a second model (Codex) before building and again after the result, to catch confounds
before they reached a conclusion.
| Axis | Result | Honest bound |
|---|---|---|
| RQ1 Leakage | Lesion-level leakage inflates test AUC by ~1.3–1.7 pts population-wide and ~6–7 pts on the cases where leakage is possible (~5× amplification), via a paired leakage-injection design on an identical test set; every 95% CI excludes zero across 3 backbones. | The naive "grouped vs ungrouped" comparison shows nothing — it's confounded and diluted; the effect only appears with the paired design. |
| RQ2 Ensembling | A 3-architecture heterogeneous ensemble has ~3× the member diversity of a 3-fold same-architecture average and is best-calibrated after temperature scaling (best Brier/NLL), with the highest AUC. The homogeneous fold-average barely diversifies. | The AUC edge is inconclusive at 3 seeds. Methodological point: raw ensemble calibration is misleading — you must temperature-scale before comparing. |
| RQ4 Fairness | Cross-modality (HAM-dermoscopy model → DDI clinical photos) shows a large dark-skin collapse (MobileNetV2 V–VI AUC below chance, 0.47). But fine-tuning within DDI (modality held constant) flattens per-FST AUC (V–VI 0.72, on par with I–II) and the light-vs-dark sensitivity gap's bootstrap CIs cross zero. So the apparent gap is substantially a domain-shift confound. | Underpowered non-equivalence, not proof of parity (V–VI n=48; CIs allow gaps up to ~0.30); image-disjoint (DDI ships no patient IDs); one dataset/backbone. A landscape scan confirmed no diverse-skin-tone dermoscopy dataset with biopsy-confirmed cancer exists. |
| RQ5 Shortcuts | Backbones differ markedly in non-lesion-context reliance (MobileNetV2 most lesion-centric, EfficientNetB0 most context-sensitive). | Synthetic ink/ruler artifacts did not reproduce the Winkler melanoma-inflation effect (honest negative). Methodological point: the occlusion fill is itself a confound — the mean-fill control changed a conclusion. |
Full write-ups with methods, tables, and caveats are in docs/: one RQx_DESIGN.md and
RQx_RESULT.md per axis, plus the RESEARCH_PROPOSAL.md (literature
grounding + novelty assessment). RQ3 (calibration) is folded into RQ2 as an axis, not a standalone RQ.
bench/ the benchmark — frozen manifests, training, the shared eval spine
metrics.py eval spine: threshold-on-val, sens/spec, calibration (ECE/Brier/NLL), bootstrap CIs
make_manifests.py lesion-grouped + naive splits (RQ1)
make_injection_manifests.py paired leakage-injection manifests (RQ1)
make_cv_folds.py lesion-grouped CV folds (RQ2)
train_eval.py train one backbone on a frozen manifest -> validated prediction table
ensemble_eval.py single vs k-fold vs heterogeneous ensembles (RQ2)
fairness_eval.py RQ4 fork A: source-locked cross-modality stress test on DDI
ddi_manifest.py / ddi_finetune.py / ddi_fairness.py RQ4 fork B: within-clinical fine-tune
shortcut_eval.py RQ5: occlusion + synthetic-artifact sensitivity
docs/ design + result write-ups per RQ, plus the research proposal
runs/ small JSON result summaries (rq*.json) only — per-run outputs are gitignored
src/ the original training/eval/export pipeline (HAM10000, TF-Lite export)
manifests/ frozen HAM10000 split manifests (DDI manifests are gitignored, see below)
- Leakage control by design. Splits are lesion-grouped; effects are measured with paired comparisons on an identical frozen test set, not by comparing different test populations.
- No test tuning. Operating thresholds and temperature scaling are fit on a held-out validation set and frozen before touching test.
- The prediction table is the contract. Every model writes one row per image
(
image_id, lesion_id, dx, label, prob, split, ...); all metrics derive from that table. - Uncertainty is reported. Bootstrap 95% CIs on every headline number; effects are called "inconclusive" or "underpowered" when the CIs say so.
- Adversarial review. Each axis passed a second-model methodology review before and after build.
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtGPU training was run under a WSL2 TensorFlow-GPU environment (RTX 4090). Typical flow per axis:
build a frozen manifest → train_eval.py to emit prediction tables → the axis's eval script →
a small runs/rq*.json summary. Example (RQ1, one backbone):
python bench/make_injection_manifests.py --seed 42
python bench/train_eval.py --manifest manifests/inject_clean_seed42.csv \
--images <ham10000_dir> --model efficientnetb0 --out runs/clean_effb0 --seed 0
python bench/metrics.py --pred runs/clean_effb0/predictions.csv- HAM10000 (Tschandl et al. 2018) — open; the primary dermoscopy set.
lesion_idenables grouping (note:lesion_id≠ patient ID). Frozen split manifests are included. - DDI (Daneshjou et al. 2022) — gated via the Stanford AIMI portal under a Research Use
Agreement that forbids redistributing any portion of the dataset. DDI images, derived fold
manifests, and per-image outputs are therefore gitignored — only aggregate metric summaries
are tracked. Regenerate locally with
bench/ddi_manifest.pyafter accepting the agreement.
This is a research benchmark for studying evaluation methodology. It is not a medical device and must not be used for diagnosis or patient care.