Computational Semi-supervised MOdular Selection: a framework for the data-driven discovery of temporal transitions in cancer recurrence biology.
This repository contains the reproducibility pipeline for the manuscript COSMOS reveals a convergent proliferation-driven biological transition in pan-cancer recurrence (Zafar et al., under review). The pipeline identifies a 21 to 25 month convergence window separating early proliferation-driven recurrence from late recurrence biology across breast cancer, lung adenocarcinoma, glioblastoma, and renal clear cell carcinoma, and validates the resulting phase definition through Cox regression and landmark analysis across nine TCGA cancer types and through whole-slide histopathology in 9,816 diagnostic images.
COSMOS combines unsupervised modular feature reduction with a Maximally Selected Rank Statistics (MSRS) scan to discover temporal transition cut-points in recurrence biology, then validates the resulting phase definition against independent overall-survival endpoints.
Key findings reproduced by this pipeline:
- Temporal transition at 25 months in the breast discovery cohort GSE2034 (permutation P < 0.001), 21 months in LUAD GSE31210 (P = 0.228), and 23 months in TCGA-BRCA molecular validation (P = 0.128), defining a 21 to 25 month convergence window
- Proliferation module Cohen's d = 0.76 (P < 0.001) between early and late phase
- Pan-cancer Cox HR = 4.25 (95% CI: 3.61 to 5.01) across eight TCGA cancer types, n = 2,871, 516 OS events
- 24-month landmark HR = 2.33 (95% CI: 1.82 to 2.98), n = 2,440
- Whole-slide image validation HR = 1.80 (95% CI: 1.55 to 2.08), n = 4,881 patients, 9,816 slides
- 36 of 37 pre-specified pathways significant at FDR < 0.05 after Benjamini-Hochberg correction
- Spearman correlation rho = +0.60 between PAM50 subtype Ki-67 rank and median recurrence time
.
├── 01_preprocessing.ipynb GEO + TCGA loading, ComBat, ssGSEA
├── 02_cosmos_feature_selection.ipynb Hopkins, K-means, IQR + IsoForest, MSLR
├── 03_msrs_phase_analysis.ipynb MSRS scan, phase comparison, JSD, BH, bootstrap
├── 04_pancancer_cox.ipynb Per-cancer + pan-cancer Cox, landmark
├── 05_wsi_validation.ipynb ResNet50 features, patient-level Cox
├── 06_subtype_analysis.ipynb PAM50 subtype-stratified Spearman
├── README.md
└── requirements.txt
Each notebook is self-contained and writes intermediate pickled outputs that downstream notebooks consume.
Python 3.10 or higher is required. Create a fresh environment:
python -m venv cosmos-env
source cosmos-env/bin/activate
pip install -r requirements.txtCore dependencies:
| Package | Tested version |
|---|---|
| numpy | 1.26 |
| pandas | 2.1 |
| scipy | 1.11 |
| scikit-learn | 1.3 |
| statsmodels | 0.14 |
| lifelines | 0.27 |
| gseapy | 1.1 |
| inmoose (ComBat) | 0.7 |
| torch | 2.1 |
| torchvision | 0.16 |
| openslide-python | 1.3 |
| tqdm, joblib | latest |
A CUDA-enabled GPU is recommended for notebook 05 (whole-slide feature extraction). Notebooks 01 through 04 and 06 run on CPU.
Set the COSMOS_BASE_DIR environment variable to the root of your data directory:
export COSMOS_BASE_DIR=/path/to/cosmos_dataExpected directory structure:
$COSMOS_BASE_DIR/
├── Raw Data/
│ ├── geo/
│ │ ├── GSE2034/
│ │ │ ├── GSE2034_series_matrix.txt
│ │ │ ├── GSE2034_family.soft.gz
│ │ │ └── GSE2034_clinical.csv # manual curation of DMFS
│ │ ├── GSE2990/
│ │ │ ├── GSE2990_series_matrix.txt
│ │ │ ├── GSE2990_family.soft.gz
│ │ │ └── GSE2990_suppl_info.txt
│ │ ├── GSE103746/
│ │ └── GSE31210/
│ ├── tcga_rnaseq/
│ │ ├── TCGA-BRCA/
│ │ │ ├── counts/ # STAR-aligned gene counts
│ │ │ └── patient_file_map.csv
│ │ ├── TCGA-LUAD/
│ │ ├── TCGA-KIRC/
│ │ ├── TCGA-GBM/
│ │ └── gencode.v36.annotation.gtf.gz
│ ├── tcga_clinical/
│ │ ├── TCGA-CDR.csv # https://gdc.cancer.gov
│ │ └── brca_tcga_pan_can_atlas_2018_clinical_data.tsv # cBioPortal
│ └── pathway_databases/
│ ├── h.all.gmt # MSigDB v2023.2 Hallmark
│ ├── c2.cp.kegg_legacy.gmt
│ ├── c2.cp.reactome.gmt
│ └── c5.go.bp.gmt
├── intermediates/ # created by notebooks
└── wsi_results/ # created by notebook 05
Whole-slide images for notebook 05 are located under a separate directory specified by COSMOS_WSI_DIR. Files should be named with TCGA patient barcodes (e.g., TCGA-XX-YYYY-...svs). Both flat and per-cancer-type subdirectory layouts are supported.
Notebooks are designed to run sequentially. Each notebook writes its outputs to $COSMOS_BASE_DIR/intermediates/ for downstream notebooks to consume.
jupyter nbconvert --to notebook --execute 01_preprocessing.ipynb
jupyter nbconvert --to notebook --execute 02_cosmos_feature_selection.ipynb
jupyter nbconvert --to notebook --execute 03_msrs_phase_analysis.ipynb
jupyter nbconvert --to notebook --execute 04_pancancer_cox.ipynb
jupyter nbconvert --to notebook --execute 05_wsi_validation.ipynb
jupyter nbconvert --to notebook --execute 06_subtype_analysis.ipynbApproximate runtimes on a workstation with 384 GB RAM and an RTX 4090:
| Notebook | CPU only | With GPU |
|---|---|---|
| 01 preprocessing | 35 min | n/a |
| 02 COSMOS | 12 min | n/a |
| 03 MSRS + BH + bootstrap | 18 min | n/a |
| 04 pan-cancer Cox | 2 min | n/a |
| 05 WSI validation | impractical | 9 to 14 hr |
| 06 subtype analysis | < 1 min | n/a |
All randomized components use a fixed seed of 42:
numpy,random, andPYTHONHASHSEEDare seeded at the top of each notebooksklearnestimators receiverandom_state=42- ComBat and ssGSEA permutation tests use
np.random.RandomState(seed)for reproducible streams - PyTorch (notebook 05) seeds both CPU and CUDA streams
The pipeline produces deterministic outputs given identical input data and software versions. Small numerical differences (within reported confidence intervals) may occur across scikit-learn or gseapy versions.
- Endpoint priority: DFI as primary recurrence endpoint, PFI as fallback. GBM uses PFI as primary because DFI is not meaningful for an infiltrative primary brain tumor.
- Phase definition: Binary indicator of recurrence event observed within 24 months. Censored observations and late events both belong to the "late" group.
- Cox covariates: All pan-cancer models are stratified by cancer type. The whole-slide validation model includes
phase_early + PC1with cancer-type stratification. - Pathway panel: 37 pathways across four biological modules (Proliferation, Immune, Metabolic, Microenvironment), pre-specified before any survival analysis to prevent post-hoc selection.
- WSI feature extractor: ResNet50 pretrained on ImageNet. Pathology foundation models (UNI, CONCH, Virchow) are not used here because they were pretrained on TCGA, which would create circular dependence with the validation cohort.
Each notebook writes one or more pickle files to $COSMOS_BASE_DIR/intermediates/:
| Notebook | Outputs |
|---|---|
| 01 | geo_clinical.pkl, tcga_clinical.pkl, geo_expr_corrected.pkl, tcga_expr_corrected.pkl, ssgsea_corrected.pkl, ssgsea_raw.pkl, combined_genesets.pkl, tcga_endpoint_used.pkl |
| 02 | stage1_results.pkl, stage2_results.pkl, stage3_results.pkl, core_pathways_37.pkl, module_scores_37.pkl, brca_specific_modules.pkl, luad_specific_modules.pkl |
| 03 | transition_results.pkl (cut-points, phase comparisons, JSD, BH-corrected pathway table, bootstrap CIs) |
| 04 | cox_summary.pkl (per-cancer, pan-cancer, leave-one-out, landmark) |
| 05 | patient_level_data.csv, cancer_specific_results.csv, summary.json, wsi_pca_model.pkl, wsi_scaler.pkl |
| 06 | subtype_results.pkl (per-subtype metrics, Spearman rho and p-value) |
If you use this code, please cite:
Zafar SA, Zeng S, Khan AAK, Faisal MS, Qin W. COSMOS reveals a convergent proliferation-driven biological transition in pan-cancer recurrence. Manuscript under review.
Corresponding author: Prof. Wei Qin (wqin@sjtu.edu.cn), Department of Industrial Engineering and Management, Shanghai Jiao Tong University.
This work was supported by the Shanghai Jiao Tong University Yang Generation Award (SJTU YG2025QNA31).
MIT License. See LICENSE for details. Use of TCGA, GEO, and cBioPortal data is subject to their respective data use agreements.
For questions about the code, open an issue on GitHub. For research collaboration, contact the corresponding author through the affiliated institution.