Skip to content

Lawson-Darrow/Skin-Cancer-Model

A Leakage-Controlled Benchmark for Skin-Lesion Classification

A research-grade re-examination of how skin-lesion classifiers are evaluated, built on HAM10000 (dermoscopy) and DDI (diverse-skin-tone clinical photos). The headline 90%+ accuracies that circulate for HAM10000 are largely an artifact of how the data is split and scored. This repo measures those artifacts directly, under a single fixed protocol, across four axes:

  1. RQ1 — Leakage. How much does lesion-level train/test leakage inflate reported AUC?
  2. RQ2 — Ensembling & calibration. Does a diverse-architecture ensemble actually beat a single model on calibration and diversity, not just accuracy?
  3. RQ4 — Skin-tone fairness. Is the well-known dark-skin performance gap a clean Fitzpatrick effect, or partly a domain-shift confound?
  4. RQ5 — Shortcut sensitivity. How much do the models lean on non-lesion context and synthetic artifacts?

These are not four separate projects — they are axes of one benchmark artifact: leakage-controlled × {backbones, ensembles} × {accuracy, calibration, fairness, shortcuts}. Every metric is computed from a frozen prediction table by a shared evaluation spine (bench/metrics.py), so results are reproducible and identical across axes. Each axis was designed, then independently reviewed by a second model (Codex) before building and again after the result, to catch confounds before they reached a conclusion.

Findings

Axis Result Honest bound
RQ1 Leakage Lesion-level leakage inflates test AUC by ~1.3–1.7 pts population-wide and ~6–7 pts on the cases where leakage is possible (~5× amplification), via a paired leakage-injection design on an identical test set; every 95% CI excludes zero across 3 backbones. The naive "grouped vs ungrouped" comparison shows nothing — it's confounded and diluted; the effect only appears with the paired design.
RQ2 Ensembling A 3-architecture heterogeneous ensemble has ~3× the member diversity of a 3-fold same-architecture average and is best-calibrated after temperature scaling (best Brier/NLL), with the highest AUC. The homogeneous fold-average barely diversifies. The AUC edge is inconclusive at 3 seeds. Methodological point: raw ensemble calibration is misleading — you must temperature-scale before comparing.
RQ4 Fairness Cross-modality (HAM-dermoscopy model → DDI clinical photos) shows a large dark-skin collapse (MobileNetV2 V–VI AUC below chance, 0.47). But fine-tuning within DDI (modality held constant) flattens per-FST AUC (V–VI 0.72, on par with I–II) and the light-vs-dark sensitivity gap's bootstrap CIs cross zero. So the apparent gap is substantially a domain-shift confound. Underpowered non-equivalence, not proof of parity (V–VI n=48; CIs allow gaps up to ~0.30); image-disjoint (DDI ships no patient IDs); one dataset/backbone. A landscape scan confirmed no diverse-skin-tone dermoscopy dataset with biopsy-confirmed cancer exists.
RQ5 Shortcuts Backbones differ markedly in non-lesion-context reliance (MobileNetV2 most lesion-centric, EfficientNetB0 most context-sensitive). Synthetic ink/ruler artifacts did not reproduce the Winkler melanoma-inflation effect (honest negative). Methodological point: the occlusion fill is itself a confound — the mean-fill control changed a conclusion.

Full write-ups with methods, tables, and caveats are in docs/: one RQx_DESIGN.md and RQx_RESULT.md per axis, plus the RESEARCH_PROPOSAL.md (literature grounding + novelty assessment). RQ3 (calibration) is folded into RQ2 as an axis, not a standalone RQ.

Repository layout

bench/          the benchmark — frozen manifests, training, the shared eval spine
  metrics.py            eval spine: threshold-on-val, sens/spec, calibration (ECE/Brier/NLL), bootstrap CIs
  make_manifests.py     lesion-grouped + naive splits (RQ1)
  make_injection_manifests.py  paired leakage-injection manifests (RQ1)
  make_cv_folds.py      lesion-grouped CV folds (RQ2)
  train_eval.py         train one backbone on a frozen manifest -> validated prediction table
  ensemble_eval.py      single vs k-fold vs heterogeneous ensembles (RQ2)
  fairness_eval.py      RQ4 fork A: source-locked cross-modality stress test on DDI
  ddi_manifest.py / ddi_finetune.py / ddi_fairness.py   RQ4 fork B: within-clinical fine-tune
  shortcut_eval.py      RQ5: occlusion + synthetic-artifact sensitivity
docs/           design + result write-ups per RQ, plus the research proposal
runs/           small JSON result summaries (rq*.json) only — per-run outputs are gitignored
src/            the original training/eval/export pipeline (HAM10000, TF-Lite export)
manifests/      frozen HAM10000 split manifests (DDI manifests are gitignored, see below)

Methodology principles

  • Leakage control by design. Splits are lesion-grouped; effects are measured with paired comparisons on an identical frozen test set, not by comparing different test populations.
  • No test tuning. Operating thresholds and temperature scaling are fit on a held-out validation set and frozen before touching test.
  • The prediction table is the contract. Every model writes one row per image (image_id, lesion_id, dx, label, prob, split, ...); all metrics derive from that table.
  • Uncertainty is reported. Bootstrap 95% CIs on every headline number; effects are called "inconclusive" or "underpowered" when the CIs say so.
  • Adversarial review. Each axis passed a second-model methodology review before and after build.

Environment & reproduction

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

GPU training was run under a WSL2 TensorFlow-GPU environment (RTX 4090). Typical flow per axis: build a frozen manifest → train_eval.py to emit prediction tables → the axis's eval script → a small runs/rq*.json summary. Example (RQ1, one backbone):

python bench/make_injection_manifests.py --seed 42
python bench/train_eval.py --manifest manifests/inject_clean_seed42.csv \
    --images <ham10000_dir> --model efficientnetb0 --out runs/clean_effb0 --seed 0
python bench/metrics.py --pred runs/clean_effb0/predictions.csv

Data & licensing

  • HAM10000 (Tschandl et al. 2018) — open; the primary dermoscopy set. lesion_id enables grouping (note: lesion_id ≠ patient ID). Frozen split manifests are included.
  • DDI (Daneshjou et al. 2022) — gated via the Stanford AIMI portal under a Research Use Agreement that forbids redistributing any portion of the dataset. DDI images, derived fold manifests, and per-image outputs are therefore gitignored — only aggregate metric summaries are tracked. Regenerate locally with bench/ddi_manifest.py after accepting the agreement.

Medical disclaimer

This is a research benchmark for studying evaluation methodology. It is not a medical device and must not be used for diagnosis or patient care.

About

A Leakage-Controlled Benchmark for Skin-Lesion Classification

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors