A Leakage-Controlled Benchmark for Skin-Lesion Classification

A research-grade re-examination of how skin-lesion classifiers are evaluated, built on HAM10000 (dermoscopy) and DDI (diverse-skin-tone clinical photos). The headline 90%+ accuracies that circulate for HAM10000 are largely an artifact of how the data is split and scored. This repo measures those artifacts directly, under a single fixed protocol, across four axes:

RQ1 — Leakage. How much does lesion-level train/test leakage inflate reported AUC?
RQ2 — Ensembling & calibration. Does a diverse-architecture ensemble actually beat a single model on calibration and diversity, not just accuracy?
RQ4 — Skin-tone fairness. Is the well-known dark-skin performance gap a clean Fitzpatrick effect, or partly a domain-shift confound?
RQ5 — Shortcut sensitivity. How much do the models lean on non-lesion context and synthetic artifacts?

These are not four separate projects — they are axes of one benchmark artifact: leakage-controlled × {backbones, ensembles} × {accuracy, calibration, fairness, shortcuts}. Every metric is computed from a frozen prediction table by a shared evaluation spine (bench/metrics.py), so results are reproducible and identical across axes. Each axis was designed, then independently reviewed by a second model (Codex) before building and again after the result, to catch confounds before they reached a conclusion.

Findings

Axis	Result	Honest bound
RQ1 Leakage	Lesion-level leakage inflates test AUC by ~1.3–1.7 pts population-wide and ~6–7 pts on the cases where leakage is possible (~5× amplification), via a paired leakage-injection design on an identical test set; every 95% CI excludes zero across 3 backbones.	The naive "grouped vs ungrouped" comparison shows nothing — it's confounded and diluted; the effect only appears with the paired design.
RQ2 Ensembling	A 3-architecture heterogeneous ensemble has ~3× the member diversity of a 3-fold same-architecture average and is best-calibrated after temperature scaling (best Brier/NLL), with the highest AUC. The homogeneous fold-average barely diversifies.	The AUC edge is inconclusive at 3 seeds. Methodological point: raw ensemble calibration is misleading — you must temperature-scale before comparing.
RQ4 Fairness	Cross-modality (HAM-dermoscopy model → DDI clinical photos) shows a large dark-skin collapse (MobileNetV2 V–VI AUC below chance, 0.47). But fine-tuning within DDI (modality held constant) flattens per-FST AUC (V–VI 0.72, on par with I–II) and the light-vs-dark sensitivity gap's bootstrap CIs cross zero. So the apparent gap is substantially a domain-shift confound.	Underpowered non-equivalence, not proof of parity (V–VI n=48; CIs allow gaps up to ~0.30); image-disjoint (DDI ships no patient IDs); one dataset/backbone. A landscape scan confirmed *no diverse-skin-tone dermoscopy* dataset with biopsy-confirmed cancer exists**.
RQ5 Shortcuts	Backbones differ markedly in non-lesion-context reliance (MobileNetV2 most lesion-centric, EfficientNetB0 most context-sensitive).	Synthetic ink/ruler artifacts did not reproduce the Winkler melanoma-inflation effect (honest negative). Methodological point: the occlusion fill is itself a confound — the mean-fill control changed a conclusion.

Full write-ups with methods, tables, and caveats are in docs/: one RQx_DESIGN.md and RQx_RESULT.md per axis, plus the RESEARCH_PROPOSAL.md (literature grounding + novelty assessment). RQ3 (calibration) is folded into RQ2 as an axis, not a standalone RQ.

Repository layout

bench/          the benchmark — frozen manifests, training, the shared eval spine
  metrics.py            eval spine: threshold-on-val, sens/spec, calibration (ECE/Brier/NLL), bootstrap CIs
  make_manifests.py     lesion-grouped + naive splits (RQ1)
  make_injection_manifests.py  paired leakage-injection manifests (RQ1)
  make_cv_folds.py      lesion-grouped CV folds (RQ2)
  train_eval.py         train one backbone on a frozen manifest -> validated prediction table
  ensemble_eval.py      single vs k-fold vs heterogeneous ensembles (RQ2)
  fairness_eval.py      RQ4 fork A: source-locked cross-modality stress test on DDI
  ddi_manifest.py / ddi_finetune.py / ddi_fairness.py   RQ4 fork B: within-clinical fine-tune
  shortcut_eval.py      RQ5: occlusion + synthetic-artifact sensitivity
docs/           design + result write-ups per RQ, plus the research proposal
runs/           small JSON result summaries (rq*.json) only — per-run outputs are gitignored
src/            the original training/eval/export pipeline (HAM10000, TF-Lite export)
manifests/      frozen HAM10000 split manifests (DDI manifests are gitignored, see below)

Methodology principles

Leakage control by design. Splits are lesion-grouped; effects are measured with paired comparisons on an identical frozen test set, not by comparing different test populations.
No test tuning. Operating thresholds and temperature scaling are fit on a held-out validation set and frozen before touching test.
The prediction table is the contract. Every model writes one row per image (image_id, lesion_id, dx, label, prob, split, ...); all metrics derive from that table.
Uncertainty is reported. Bootstrap 95% CIs on every headline number; effects are called "inconclusive" or "underpowered" when the CIs say so.
Adversarial review. Each axis passed a second-model methodology review before and after build.

Environment & reproduction

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

GPU training was run under a WSL2 TensorFlow-GPU environment (RTX 4090). Typical flow per axis: build a frozen manifest → train_eval.py to emit prediction tables → the axis's eval script → a small runs/rq*.json summary. Example (RQ1, one backbone):

python bench/make_injection_manifests.py --seed 42
python bench/train_eval.py --manifest manifests/inject_clean_seed42.csv \
    --images <ham10000_dir> --model efficientnetb0 --out runs/clean_effb0 --seed 0
python bench/metrics.py --pred runs/clean_effb0/predictions.csv

Data & licensing

HAM10000 (Tschandl et al. 2018) — open; the primary dermoscopy set. lesion_id enables grouping (note: lesion_id ≠ patient ID). Frozen split manifests are included.
DDI (Daneshjou et al. 2022) — gated via the Stanford AIMI portal under a Research Use Agreement that forbids redistributing any portion of the dataset. DDI images, derived fold manifests, and per-image outputs are therefore gitignored — only aggregate metric summaries are tracked. Regenerate locally with bench/ddi_manifest.py after accepting the agreement.

Medical disclaimer

This is a research benchmark for studying evaluation methodology. It is not a medical device and must not be used for diagnosis or patient care.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
bench		bench
docs		docs
manifests		manifests
notebooks		notebooks
runs		runs
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Leakage-Controlled Benchmark for Skin-Lesion Classification

Findings

Repository layout

Methodology principles

Environment & reproduction

Data & licensing

Medical disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Leakage-Controlled Benchmark for Skin-Lesion Classification

Findings

Repository layout

Methodology principles

Environment & reproduction

Data & licensing

Medical disclaimer

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages