Polygenic Risk and Imaging Multimodal Assessment
The end-to-end workflow for training/inference is:
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA ACQUISITION │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. export_chimec.py --status-only --refresh-metadata → refreshes metadata │
│ 2. export.py → requests exports from iBroker based on metadata │
│ 3. sync_local.py → syncs exported DICOMs to GPFS (from HIRO share) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES (after acquisition) │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. Disk DICOMs /gpfs/.../ChiMEC/MG/ → actual imaging data │
│ 2. Phenotype CSV Phenotype_ChiMEC_*.csv → labels (case/control, datedx) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 1: PREPROCESSING (preprocess.py) │
│ Scans disk DICOMs → extracts metadata from DICOM headers │
│ Outputs: sot/views.parquet, sot/exams.parquet, out/manifest.parquet │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 2: EMIT CSV (preprocess.py emit-csv or automatic) │
│ Joins exams with phenotype CSV → creates Mirai-compatible manifest │
│ Output: out/mirai_manifest.csv (only exams with labels) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 3: INFERENCE (run_mirai_sharded.py) │
│ Runs Mirai model on manifest → predictions │
│ Output: out/validation_output.csv │
└─────────────────────────────────────────────────────────────────────────────┘
Important: The preprocessing pipeline works directly on disk DICOMs. Historical data that exists on disk but is missing from iBroker (e.g., from earlier studies) will still be processed and can be used for training, as long as patients have phenotype labels.
- Top-level
*.py: primary CLI entrypoints and analysis scripts prima/: lightweight Python package namespacescripts/: operational shell helpersvendor/mirai/: Mirai submoduledata/: local generated metadata/caches (ignored in git)
# install micromamba if needed
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
./bin/micromamba shell init -s bash -r ~/micromamba
source ~/.bashrc
micromamba config append channels conda-forge
# create + activate
micromamba create -y -f env.yaml
eval "$(micromamba shell hook -s bash)"
micromamba activate prima
# install torch/vision matching cu111 wheels (from the official index)
python -m pip install \
torch==1.9.0+cu111 torchvision==0.10.0+cu111 \
-f https://download.pytorch.org/whl/torch_stable.html # from "previous versions"
# your repo + mirai fork (zarr patch)
git clone --recursive git@github.com:uchicago-dsi/prima.git
cd prima
git submodule add git@github.com:annawoodard/Mirai vendor/mirai
git submodule update --init --recursive
# install your code + mirai
pip install -e .
pip install -r requirements.txt
pip install -r requirements-dev.txt # optional: linting + notebook extras
pip install -e vendor/mirai
# fix GLIBCXX/libstdc++ compatibility issue for lifelines/pandas
# add conda/micromamba lib to LD_LIBRARY_PATH (required for survival analysis)
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# add to ~/.bashrc or ~/.bash_profile to make permanent:
# echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH' >> ~/.bashrcThe main prima environment has dependency conflicts with selenium/geckodriver. Use a separate environment:
micromamba create -y -f environment-selenium.yml
micromamba activate selenium-ff
pip install pandas requests lxml tqdmRuntime dependencies live in requirements.txt; install requirements-dev.txt for tooling (Ruff, notebook helpers).
Before sending changes:
ruff format .
ruff check --fix .Data acquisition involves scraping metadata from iBroker, requesting exports, and syncing files to GPFS. These steps use the selenium-ff environment.
Refreshes the iBroker metadata snapshot for all ChiMEC study IDs and overwrites data/imaging_metadata.csv.
micromamba activate selenium-ff
export IBROKER_USERNAME=your_username
export IBROKER_PASSWORD=your_password
python export_chimec.py --status-only --refresh-metadata
# start from scratch (ignore old checkpoints)
python export_chimec.py --status-only --refresh-metadata --refresh-mode fresh
# continue an interrupted refresh
python export_chimec.py --status-only --refresh-metadata --refresh-mode resumeOutput: data/imaging_metadata.csv — the complete refreshed catalog of studies in iBroker.
Light fingerprints (study_id, study_date, study_uid from DICOM) enable robust disk-vs-iBroker cross-check. Dates are always from DICOM metadata, never from filename.
python export_chimec.py --build-fingerprintsOutputs:
fingerprints/chimec/disk_fingerprints.json— compatible with disk status checksfingerprints/chimec/disk_fingerprints.csv— flat table for iteration/cross-check
When fingerprints exist, reconciliation uses them for disk dates (see --reconcile-disk-ibroker).
The command:
- uses Selenium for login, then HTTP requests for study queries
- performs a full refresh of all configured study IDs
- records current iBroker status fields for downstream export decisions
Uses refreshed metadata to request exports from iBroker. It identifies studies that need to be downloaded (matching modality, not already exported/requested, not already on disk) and submits export requests via the web interface.
micromamba activate selenium-ff
export IBROKER_USERNAME=your_username
export IBROKER_PASSWORD=your_password
python export.py --auto-confirmCommon options:
# check status without exporting anything
python export.py --status-only
# run continuously with 1-hour waits between cycles
python export.py --auto-confirm --loop-wait 1h
# export a different modality
python export.py --auto-confirm --modality CT
# include patients without genotyping data
python export.py --auto-confirm --no-genotyping-filter
# limit batch size per cycle
python export.py --auto-confirm --batch-size 50The script:
- Merges patient phenotype data, study keys, and iBroker metadata
- Filters to target modality (default: MG) and excludes already-exported or on-disk studies
- Submits export requests in batches, saving progress to
data/export_state.csv - Optionally audits pending exports to update status (
--refresh-export-status)
After exports are requested, files appear on a HIRO CIFS share. Use sync_local.py to sync them to GPFS.
sudo mount -t cifs //cifs01uchadccd.uchad.uchospitals.edu/radiology/HIRO /mnt/uchad_samba \
-o credentials=/home/annawoodard/creds,vers=3.0,noperm,uid=$(id -u),gid=$(id -g),file_mode=0660,dir_mode=0770Credentials file format:
username=your_username
password=your_password
domain=UCHAD
Configure source/destination paths at the top of sync_local.py:
SRC_ROOT = Path("/mnt/uchad_samba/16352A/")
DST_ROOT = Path("/gpfs/data/huo-lab/Image/ChiMEC/MG")# dry run first (no files moved or transferred)
python sync_local.py --dry-run --no-auto-restart
# real transfer with immediate source deletion (default behavior)
python sync_local.py
# queue for deletion instead of immediate delete
python sync_local.py --no-immediate-delete
# single sync pass (default auto-restarts every 2 minutes)
python sync_local.py --no-auto-restartThe script:
- Waits for exam stability (no file changes within 10 minutes) before transferring
- Uses rsync internally for efficient copying with progress tracking
- Deletes source files after successful transfer (or queues them to
_synced_and_queued_for_deletion/) - Cleans up empty patient directories older than 1 hour
- Auto-restarts every 2 minutes by default to pick up new exports
- Handles SIGINT/SIGTERM gracefully for clean shutdown
Logs are written to sync.log and stdout.
Scan disk DICOMs, extract metadata, select full-quad exams (L-CC, L-MLO, R-CC, R-MLO), and write Zarr cache:
# full preprocessing (discovery + zarr cache)
python preprocess.py preprocess \
--raw /gpfs/data/huo-lab/Image/ChiMEC/MG \
--workers 32
# summary only (skip zarr cache, useful for quick analysis)
python preprocess.py preprocess \
--raw /gpfs/data/huo-lab/Image/ChiMEC/MG \
--workers 32 \
--summaryOutputs are written to {raw}/sot/ and {raw}/out/ by default:
sot/views.parquet— individual DICOM views with metadatasot/exams.parquet— exam-level aggregated metadatasot/dicom_tags.parquet— all DICOM tags (wide format)out/manifest.parquet— Zarr URIs for each viewout/mirai_manifest.csv— Mirai-compatible CSV with labels (auto-generated if --labels provided)
python preprocess.py emit-csv \
--raw /gpfs/data/huo-lab/Image/ChiMEC/MG \
--labels /gpfs/data/phs/groups/Projects/Huo_projects/SPORE/annawoodard/Phenotype_ChiMEC_2025Oct4.csvTo process only new exams (append to existing SoT tables):
python preprocess.py preprocess \
--raw /gpfs/data/huo-lab/Image/ChiMEC/MG \
--workers 32 \
--incremental# monitor progress in a separate terminal
python preprocess.py monitor --interval 30
# list checkpoints
python preprocess.py checkpoint list
# show detailed status
python preprocess.py checkpoint status
# clean old checkpoints (>7 days)
python preprocess.py checkpoint clean --max-age-days 7Important: Before running mirai, ensure LD_LIBRARY_PATH is set:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH# check your GPU compute capability
nvidia-smi --query-gpu=name,compute_cap --format=csvCommon GPUs and required PyTorch:
- A100 (compute 8.0): needs CUDA 11+ with sm_80 → use
cu118 - V100 (compute 7.0): needs CUDA 10+ with sm_70 → use
cu102or higher - P100 (compute 6.0): needs CUDA 9+ with sm_60 → use
cu102or higher
# install/reinstall PyTorch with correct CUDA version (A100 example)
pip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
# verify GPU is detected
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
# check compiled architectures if you get "no kernel image" errors
python -c "import torch; print(f'Compiled for: {torch.cuda.get_arch_list()}')"python scripts/main.py \
--model_name mirai_full \
--img_encoder_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_MIRAI_Base_May20_2019.p" \
--transformer_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_cancer_MIRAI_Transformer_Jan13_2020.p" \
--calibrator_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/calibrators/MIRAI_FULL_PRED_RF.callibrator.p" \
--batch_size 8 \
--num_workers 4 \
--dataset csv_mammo_risk_all_full_future \
--img_mean 7047.99 \
--img_size 1664 2048 \
--img_std 12005.5 \
--metadata_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/mirai_manifest.csv \
--test \
--prediction_save_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/validation_output.csv \
--cudapython scripts/main.py \
--model_name mirai_full \
--img_encoder_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_MIRAI_Base_May20_2019.p" \
--transformer_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_cancer_MIRAI_Transformer_Jan13_2020.p" \
--calibrator_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/calibrators/MIRAI_FULL_PRED_RF.callibrator.p" \
--batch_size 128 \
--num_workers 16 \
--dataset csv_mammo_risk_all_full_future \
--img_mean 7047.99 \
--img_size 1664 2048 \
--img_std 12005.5 \
--metadata_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/mirai_manifest.csv \
--test \
--prediction_save_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/validation_output.csv \
--cuda \
--num_gpus 4 \
--data_parallelImportant partition behavior:
run_mirai_sharded.pyis a launcher that submits one SLURM job per shard- the launcher can run on a CPU partition (
tier1q) - shard inference jobs run on the partition passed via
--partition(usegpuq) - so GPU placement is controlled by
run_mirai_sharded.py --partition gpuq, not by where the launcher itself runs
python run_mirai_sharded.py \
--max_samples_per_shard 500 \
--metadata_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/mirai_manifest.csv \
--prediction_save_path /gpfs/data/huo-lab/Image/ChiMEC/MG/out/validation_output.csv \
--model_name mirai_full \
--img_encoder_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_MIRAI_Base_May20_2019.p" \
--transformer_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/mgh_mammo_cancer_MIRAI_Transformer_Jan13_2020.p" \
--calibrator_snapshot "/gpfs/data/huo-lab/Image/annawoodard/.mirai/snapshots/calibrators/Mirai_calibrator_mar12_2022.p" \
--batch_size 8 \
--num_workers 4 \
--dataset csv_mammo_risk_all_full_future \
--img_mean 7047.99 \
--img_size 1664 2048 \
--img_std 12005.5 \
--test \
--cudaThe sharded script will:
- Split the CSV into chunks (use
--num_shards Nor--max_samples_per_shard N) - Submit one GPU job per chunk via SLURM
- Automatically collate results when jobs complete
- Save recovery metadata for resuming if the main process dies
Recovery and debugging:
# recover from a crashed run
python run_mirai_sharded.py --recover /path/to/mirai_shards/job_metadata.json
# debug with a small sample
python run_mirai_sharded.py --debug_max_samples 100 --num_shards 2 ...Monitor GPU utilization:
watch -n 1 nvidia-smiIf process seems hung or slow:
# LOW GPU MEMORY (<1GB) = stuck in data loading, not GPU compute
# Try with NO multiprocessing first:
--num_workers 0 --batch_size 16
# If that works, gradually increase:
--num_workers 2 --batch_size 32
--num_workers 4 --batch_size 64
# add CUDA_LAUNCH_BLOCKING=1 to see exactly where it hangs
CUDA_LAUNCH_BLOCKING=1 python scripts/main.py ... --cudaCommon bottlenecks:
- GPU memory <1GB, stuck at 0%: data loading hang — use
--num_workers 0 - First batch very slow: normal with large images (1664x2048) — can take 2-5 minutes initially
- Low GPU utilization (<50%): increase
--batch_size - High GPU memory: decrease
--batch_size - Network filesystem: multiprocessing can cause deadlocks — start with
--num_workers 0
Notes:
- The LD_LIBRARY_PATH setting is required because pandas (a lifelines dependency) needs GLIBCXX_3.4.29
- Ignore "Restarting run from scratch" message during inference — it's always printed even when only running
--test
Compute per-horizon AUC and survival metrics (Uno's C-index, time-dependent AUC, integrated Brier score).
# copy and edit the example config
cp configs/analysis.yaml /tmp/analyze_mirai.yaml
# run with a single config input
python analyze_mirai.py /tmp/analyze_mirai.yamlanalyze_mirai.py is now config-only (OmegaConf YAML/JSON). The config supports:
- Paths and core options (
out_dir,out_json,split,colmap,kfold) - QC status filtering (
include_statuses,exclude_statuses) - Auto-filtering aligned with QC server filters (GEMS, implant, scanned film, etc.)
- Annotation tag filtering (
annotation_include_*,annotation_exclude_*)
Each run saves config snapshots alongside outputs:
analyze_mirai_config.source.yamlanalyze_mirai_config.input.yamlanalyze_mirai_config.resolved.yaml
Analyze imaging metadata and generate summary plots.
# basic usage
python analyze_metadata.py --modality MG
# dump screening patients (scans ≥3 months before diagnosis)
python analyze_metadata.py --modality MG --dump-screening-patientsAvailable modalities: CR, DX, MG, US, CT, MR, NM, PT, XA, RF, ES, XC, PX, RG
Creates plots in plots/ including download status, scans per patient distributions, modality distributions, and time-to-diagnosis analysis.
Interactively review and mark mammogram exams for quality control. Generates combined 4-view figures (L CC, L MLO, R CC, R MLO) with an HTML gallery.
# prioritize worst-performing exams for efficient QC
python qc_gallery.py --serve --max-exams 100 --prioritize-errors \
--pred-csv /gpfs/data/huo-lab/Image/ChiMEC/MG/out/validation_output.csv \
--meta-csv /gpfs/data/huo-lab/Image/ChiMEC/MG/out/mirai_manifest.csvThis will:
- Compute prediction error scores for each exam
- Show you the worst-performing exams first (missed cancers, false alarms)
- Weight misclassified cases 2x so you see them immediately
For remote access, set up SSH port forwarding:
ssh -L 5000:localhost:5000 user@remote-hostThen open http://localhost:5000/ in your browser.
For large datasets (e.g., 18k exams), use an iterative approach:
Batch 1: Initial QC (100 worst-performing exams)
# QC first batch prioritized by prediction error
python qc_gallery.py --serve --max-exams 100 --prioritize-errors \
--pred-csv /gpfs/data/huo-lab/Image/ChiMEC/MG/out/validation_output.csv \
--meta-csv /gpfs/data/huo-lab/Image/ChiMEC/MG/out/mirai_manifest.csvAnalyze patterns:
# After QC'ing 20-50 exams, analyze what makes bad exams bad
python analyze_qc_patterns.py --qc-file data/qc_status.json
# This will print:
# - DICOM tags that differ between bad and good exams
# - Suggested filters (e.g., "Exclude if PositionerPrimaryAngle < 0")
# - Known problematic patterns (implants, scanned film, positioning issues)
# - Device models with high failure ratesTest filter hypotheses (streamlined workflow):
# Step 1: Test filters and generate exam lists
python test_positioning_filters.py --max-samples 20
# Output shows:
# - How many exams each filter catches
# - Saves exam IDs to data/filter_tests/*.txt
# Step 2: QC specific filter (interactive menu - easiest!)
./qc_filter.sh
# Select filter from menu → loads ONLY those exams → fast!
# Or QC specific filter directly:
python qc_gallery.py --serve --exam-list data/filter_tests/scanned_film_exams.txt
python qc_gallery.py --serve --exam-list data/filter_tests/gems_ffdm_tc1_exams.txt
# Step 3: After QC, analyze which filters worked
python analyze_qc_patterns.pyBatch 2+: Apply validated filters and repeat
After validating a filter catches only bad exams:
- Add to auto-exclude config (one line edit):
# Edit data/qc_auto_exclude.json
{
"filters": ["gems_ffdm_tc1"] # add validated filter names here
}- Restart server - those exams automatically excluded:
python qc_gallery.py --serve --max-exams 100 --prioritize-errors \
--pred-csv ... --meta-csv ...
# GEMS exams now auto-skipped, marked as "auto_excluded"- View cutflow (click "📊 Cutflow" button in browser):
- See how many exams each filter excludes
- See filter effectiveness (precision/recall)
- Validate filters before adding to config
Warning: Only add filters with >95% precision! Check cutflow first.
With server running, you can also dynamically test filters:
- Use dropdown to preview a filter
- Click "Apply Filter" to see those exams
- Manually QC the batch
- If validated → add to
qc_auto_exclude.json
Standard QC controls:
- Mark exams using keyboard shortcuts:
G= Good,R= Needs review,B= Bad, Arrow keys = Navigate - QC status auto-saves to
data/qc_status.json - Re-run the script to continue — by default, exams marked "good" or "bad" are automatically skipped
- Apply QC filters in downstream analysis via
analyze_mirai.py <config.yaml>(see below)
Default behavior: Future QC runs skip exams marked as "good" (already approved) or "bad" (already rejected), showing only "review" and unmarked exams.
# server mode (recommended for remote work)
python qc_gallery.py --serve --max-exams 100 --random# custom port
python qc_gallery.py --serve --port 8080
# filter to specific patient
python qc_gallery.py --serve --patient 12345
# re-visit exams previously marked as "bad"
python qc_gallery.py --serve --max-exams 100 --qc-skip-status good
# only show completely unmarked exams (skip all QC'd exams)
python qc_gallery.py --serve --max-exams 100 --qc-skip-status good bad review
# use different prediction horizon for error scoring (default: 5 years)
python qc_gallery.py --serve --prioritize-errors --horizon 3 \
--pred-csv ... --meta-csv ...
# custom paths
python qc_gallery.py \
--serve \
--views /path/to/views.parquet \
--raw /gpfs/data/huo-lab/Image/ChiMEC/MG \
--output qc_output \
--qc-file /path/to/my_qc_status.jsonpython qc_gallery.py --max-exams 10
# then open qc_output/gallery.htmlAfter QC, configure filtering in the analysis config:
# edit config QC block (status + auto filters + annotations), then run:
python analyze_mirai.py /tmp/analyze_mirai.yamlQC filter behavior:
In future QC runs (default: --qc-skip-status good bad):
- Exams marked
"good"or"bad"are skipped (already decided) - Exams marked
"review"or unmarked are shown (need attention) - To re-visit "bad" exams:
--qc-skip-status good
In downstream analysis (analyze_mirai.py <config.yaml>):
- Status filtering is controlled by
qc_filters.include_statuses/qc_filters.exclude_statuses - Auto filters are controlled by
qc_filters.enable_auto_filtersandqc_filters.auto_filters - Annotation filters are controlled by
qc_filters.annotation_include_*andqc_filters.annotation_exclude_* - Original
validation_output.csvremains unchanged - QC decisions remain in
qc_status.json/annotations.jsonand are applied at analysis time
This design allows you to:
- Efficiently QC by not showing already-decided exams
- Re-run QC with different criteria
- Audit what was excluded
- Easily revert filtering decisions