Interpretable pathway-based machine learning for breast cancer prognosis: a four-cohort benchmark against PAM50-ROR.
Authors: Suhaan Thayyil, Eshaan Nidee
This repository contains the code, analysis pipelines, and results for a four-cohort study comparing a pathway-based survival machine learning model (Gradient Boosted Survival Analysis) against official genefu PAM50-ROR for breast cancer prognosis.
The pre-registered primary endpoint (TNBC meta-analyzed delta C-index >= +0.03, p < 0.05) was not met. The observed result was delta = +0.0144, 95% CI [-0.0472, +0.0760], p = 0.6466. All secondary analyses are exploratory.
- Harmonized database: 4,532 samples across four cohorts (TCGA-BRCA, GSE96058/SCAN-B, METABRIC, GSE20685).
- Analysis sets: 4,003 patients in patient-characteristics summaries; 4,002 metric-evaluable patients in the official PAM50-ROR discrimination analysis.
- Headline model: Gradient Boosted Survival (internal TCGA CV C-index = 0.642, delta vs Cox = +0.042).
- TNBC primary endpoint: NOT MET (meta delta = +0.0144, p = 0.6466).
- Overall discrimination: comparable to PAM50-ROR across external cohorts.
The submitted BMC Artificial Intelligence manuscript is mapped to repository files in docs/SUBMITTED_MANUSCRIPT_REPOSITORY_CROSSWALK.md.
Reviewer-facing traceability helpers:
scripts/check_submission_consistency.pyverifies the central submitted values against repository CSV/JSON outputs.results/Table_S2_internal_cv_manuscript_crosswalk.csvmaps internal cross-validation values between the manuscript summary and exact reproducible output.results/Table_1_denominator_crosswalk.csvdocuments patient-characteristics versus metric-evaluable denominators.
src/ Python modules (models, survival analysis, data loading, features)
src/ml/ DeepSurv implementation
R/ Official genefu PAM50-ROR scoring (pam50.R)
scripts/ Analysis pipeline scripts
notebooks/ Jupyter notebooks (01-14: original analyses)
results/ All results tables and QC reports
figures/ Generated figures
paper/ Reviewer-facing LaTeX manuscript source and DOI-backed bibliography
data/ Raw, processed, and clinical data
models/ Trained model artifacts (gitignored)
tests/ Pytest test suite
docs/ Pre-registration endpoint, story lock, cohort notes
The repository includes a reviewer-facing LaTeX manuscript source at paper/main.tex, a DOI-backed BibTeX bibliography at paper/main.bib, a compiled PDF at paper/main.pdf, and a Word export at paper/main.docx. The LaTeX source mirrors the submitted competitive-not-superior framing and is intended for traceability, not as a claim that the manuscript source is the journal's copyedited version.
Build commands:
cd paper && tectonic --keep-logs main.tex
cd .. && python3 scripts/build_manuscript_docx.py- TCGA-BRCA: GDC Data Portal (https://portal.gdc.cancer.gov)
- GSE96058/SCAN-B: NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96058)
- METABRIC: cBioPortal (https://www.cbioportal.org/study/summary?id=brca_metabric)
- GSE20685: NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20685)
Raw data are not redistributed in this repository.
The primary endpoint was locked before any external validation was run. See docs/PRIMARY_ENDPOINT.md (committed at 7783f4f0bfaaa6bdc611c78d33ccda621c6b243d, before any results commits).
- Install Python dependencies:
pip install -r requirements.txt - Install R with the
genefupackage for official PAM50-ROR scoring. - Pipeline scripts are in
scripts/. Key steps:scripts/harmonize_cohorts.py-- build unified cohort databasescripts/score_pathways.py-- compute pathway featuresscripts/train_ml_zoo.py-- train survival modelsscripts/external_validation.py-- run external validationscripts/compute_clinical_baselines.py-- compute clinical genomic baselines- Phase 3 exploratory scripts:
within_subtype_analysis.py,calibration_analysis.py,dca_analysis.py,stability_analysis.py - Sensitivity analyses:
external_only_tnbc_sensitivity.py,feature_ablation.py,pathway_scoring_sensitivity.py
MIT License. See LICENSE.