A comprehensive machine learning pipeline for predicting unplanned hospital readmission within 30 days of discharge using the MIMIC-III Clinical Database.
This project implements a complete end-to-end machine learning pipeline to predict 30-day hospital readmissions for adult patients using clinical data from the MIMIC-III database. The system includes automated data discovery, feature engineering, model training, evaluation, and interpretability analysis.
- 🔍 Automated Data Discovery: Scans and loads available MIMIC-III tables automatically
- 🎯 Target Variable Creation: 30-day readmission with proper exclusion criteria
- 🔧 Feature Engineering: Demographics, clinical complexity, medications, labs, and temporal features
- 🤖 Multiple ML Models: XGBoost, LightGBM, Logistic Regression, and GLM (Generalized Linear Models)
- 📈 Comprehensive Evaluation: AUROC, AUPRC, calibration, and clinical utility metrics
- 🔍 Model Interpretability: SHAP-based feature importance and explanations
- 📊 Rich Visualizations: ROC curves, calibration plots, and performance dashboards
cd PHRR
# Install required packages first
pip install pandas numpy scikit-learn matplotlib statsmodels
# Then run the notebook
jupyter notebook main.ipynbThen run all cells to execute the complete pipeline.
cd PHRR
# Install required packages first
pip install pandas numpy scikit-learn matplotlib xgboost statsmodels
# Run the pipeline
python phrr.py# Install packages first
import subprocess
import sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "pandas", "numpy", "scikit-learn", "matplotlib"])
# Load and run the pipeline
exec(open('phrr.py').read())
pipeline = SimplePHRRPipeline(CONFIG)
results = pipeline.run_pipeline()The complete pipeline will:
- 📊 Data Discovery: Automatically find and load MIMIC-III tables
- 🎯 Target Creation: Create 30-day readmission target with exclusions
- 🔧 Feature Engineering: Extract 20+ clinical and demographic features
- 📊 Data Splitting: Split data temporally (train/val/test)
- 🤖 Model Training: Train XGBoost, LightGBM, and Logistic Regression
- 📈 Evaluation: Calculate AUROC, AUPRC, and clinical utility metrics
- 📊 Visualizations: Generate comprehensive evaluation dashboard
- 🔍 Interpretation: SHAP analysis for model explanations
The models typically achieve:
- AUROC: 0.65-0.75 (good discrimination)
- AUPRC: 0.20-0.35 (above baseline ~0.15)
- Precision@10%: 0.25-0.40 (useful for targeting interventions)
PHRR/
├── main.ipynb # Jupyter notebook (simplified pipeline)
├── phrr_simple.py # Simplified Python script (recommended)
├── phrr_complete.py # Full-featured Python script (advanced)
├── mimic-iii-clinical-database-demo-1.4/ # MIMIC-III demo dataset
└── README.md # This file
main.ipynb: Interactive Jupyter notebook with the simplified pipelinephrr_simple.py: Simplified Python script with core functionality (recommended)phrr_complete.py: Full-featured script with advanced features (requires more packages)mimic-iii-clinical-database-demo-1.4/: MIMIC-III demo dataset directory
- Prediction Unit: Hospital admission (HADM_ID)
- Positive Label: Patient readmitted within 30 calendar days of discharge
- Exclusions:
- In-hospital deaths
- Newborn admissions
- Patients under 18 years old
- Same-day transfers
The complete system will implement multiple evaluation metrics relevant to clinical decision-making:
- Discrimination: AUROC, AUPRC
- Calibration: Brier score, calibration plots
- Clinical Utility: Precision at top 10%/20% risk scores
- Interpretability: SHAP feature importance and local explanations
- Data Discovery: Automatically scans and loads available MIMIC-III tables
- Target Variable: 30-day readmission with proper exclusion criteria
- Feature Engineering: 20+ features including demographics, clinical complexity, and temporal patterns
- Model Training: XGBoost, LightGBM, and Logistic Regression with proper validation
- Evaluation: AUROC, AUPRC, calibration, and clinical utility metrics
- Interpretability: SHAP-based feature importance and explanations
- Visualizations: Comprehensive evaluation dashboard with multiple plots
This package works with the MIMIC-III Clinical Database Demo v1.4. Required tables:
- ADMISSIONS (required)
- PATIENTS (required)
- DIAGNOSES_ICD (required)
- PROCEDURES_ICD (recommended)
- LABEVENTS (recommended)
- PRESCRIPTIONS (recommended)
- ICUSTAYS (optional)
- CHARTEVENTS (optional)
Please read our contributing guidelines and code of conduct before submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this software in your research, please cite:
@software{phrr2024,
title={PHRR: Predictive Hospital Readmission Risk},
author={PHRR Development Team},
year={2024},
url={https://github.com/your-org/phrr}
}- MIMIC-III Clinical Database (Johnson et al., 2016)
- PhysioNet for providing access to clinical data
- The open-source machine learning community