Advanced Machine Learning Project: Systematic exploration of probabilistic methods for medical data analysis - from mixture models to reinforcement learning
This project systematically explores probabilistic modeling techniques applied to cardiovascular disease data, progressing from basic clustering to advanced sequential decision making. Each milestone builds upon previous work, culminating in a reinforcement learning system for treatment optimization.
- ๐ M0: Exploratory Data Analysis & Statistical Foundations
- ๐ฏ M1: Gaussian Mixture Models for Patient Clustering
- ๐ธ๏ธ M2-M4: Bayesian Networks (Design โ Inference โ Learning)
- โฑ๏ธ M5: Hidden Markov Models for Disease Progression
- ๐ค M6: Reinforcement Learning for Treatment Policies
- ๐ Systematic Methodology: Six milestones covering core probabilistic ML techniques
- ๐ฌ Hands-on Implementation: Build models from scratch using pgmpy, pygraphiz, hmmlearn, custom RL
- ๏ฟฝ Real-world Complexity: Handle missing data, discretization, temporal dependencies
- ๐ Educational Focus: Deep understanding over black-box solutions
โโโ data/
โ โโโ raw/ # ๐ Original datasets (INCLUDED in repo)
โ โ โโโ patients.csv # Static patient demographics & risk factors (3,000 patients)
โ โ โโโ encounters.csv # Longitudinal clinical encounters (24,000 visits)
โ โ โโโ README.md # Comprehensive data documentation
โ โโโ interim/ # Processed data (imputed, discretized)
โ โโโ processed/ # Final model-ready datasets
โโโ milestones/ # ๐ Six milestone deliverables
โ โโโ M0/ # Exploratory Data Analysis
โ โโโ M1/ # Gaussian Mixture Models
โ โโโ M2/ # Bayesian Network Design
โ โโโ M3/ # BN Inference (Exact & Approximate)
โ โโโ M4/ # BN Learning from Data
โ โโโ M5/ # Hidden Markov Models
โ โโโ M6/ # Reinforcement Learning
โโโ docs/ # ๐ Documentation & project guidelines
# Clone repository
git clone <repository-url>
cd Project
# Create virtual environment
python -m venv adv_ml_venv
source adv_ml_venv/bin/activate # Linux/Mac
# adv_ml_venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtimport pandas as pd
# Load datasets (included in repo!)
patients = pd.read_csv('data/raw/patients.csv', sep=';')
encounters = pd.read_csv('data/raw/encounters.csv', sep=';')
print(f"๐ {len(patients):,} patients, {len(encounters):,} encounters")
print(f"โฑ๏ธ {encounters.groupby('patient_id').size().iloc[0]} time points per patient")# Run milestone notebooks
jupyter lab milestones/M0/M0.ipynb # Data exploration
jupyter lab milestones/M1/M1_G14.ipynb # Feature engineering & modelingThis project systematically builds expertise in probabilistic machine learning:
| Milestone | Core Technique | Learning Focus | Deliverable |
|---|---|---|---|
| M0 | Exploratory Data Analysis | Data understanding, visualization | Statistical summaries & insights |
| M1 | Gaussian Mixture Models | Unsupervised clustering, EM algorithm | Patient phenotype discovery |
| M2 | Bayesian Network Design | Graphical models, conditional independence | Hand-designed probabilistic model |
| M3 | BN Inference | Variable elimination, belief propagation | Clinical query answering |
| M4 | BN Parameter/Structure Learning | Maximum likelihood, score-based search | Data-driven model comparison |
| M5 | Hidden Markov Models | Temporal modeling, Viterbi decoding | Disease progression analysis |
| M6 | Reinforcement Learning | Q-learning, policy optimization | Treatment recommendation system |
- ๐งฎ Probabilistic Reasoning: Understanding uncertainty and conditional dependencies
- ๐ Graphical Models: Designing and interpreting Bayesian networks
- โฑ๏ธ Temporal Modeling: Capturing disease progression with HMMs
- ๐ฏ Decision Making: Optimizing treatment policies with reinforcement learning
- ๐ฌ Model Evaluation: Comparing hand-designed vs. learned models
- Dataset overview and baseline characteristics
- Disease state distributions and patient trajectories
- Treatment patterns and outcome analysis
- Missing data assessment and handling strategies
- Feature selection and preprocessing for clustering
- GMM fitting with optimal cluster selection (AIC/BIC)
- Cluster characterization and clinical interpretation
- Comparison with true disease states
- Structure design with clinical justification
- Variable discretization for categorical BNs
- CPT estimation from encounter data
- Conditional independence analysis
- Exact inference (Variable Elimination, Belief Propagation)
- Approximate inference with sampling methods
- Clinical query design and interpretation
- Parameter learning with train/test patient splits
- Structure learning with score-based search
- Model comparison: hand-designed vs. learned
- Expert knowledge vs. data-driven trade-offs
โฑ๏ธ M5: Hidden Markov Models
- Temporal modeling with troponin + symptom features
- Baum-Welch parameter learning
- Viterbi decoding and state sequence analysis
- Comparison with true disease progression
- MDP environment setup with state/action/reward
- Tabular Q-learning for treatment policies
- Policy evaluation vs. random/heuristic baselines
- Clinical interpretation of learned strategies
- Multi-task Learning: Simultaneous disease prediction + treatment optimization
- Temporal Modeling: Account for disease state transitions over time
- Missing Data Realism: Handle realistic clinical data gaps
- Utility-based Evaluation: Beyond accuracy - optimize patient outcomes
- Causal Treatment Effects: Estimate counterfactual treatment scenarios
- Domain: Synthetic cardiovascular clinical study
- Design: Longitudinal (8 time points over ~2 years)
- Patients: 3,000 diverse synthetic patients
- Variables: Demographics, risk factors, symptoms, labs, treatments, outcomes
- Missing Data: Realistic patterns (10-20% symptoms, 5-15% labs)
- Ground Truth: Disease states provided for educational purposes
๐ Full data documentation:
data/raw/README.md
See CONTRIBUTING.md for development workflow, branching strategy, and code standards.
- ๐ Educational Data: Synthetic dataset for learning purposes only
- โ Not for Clinical Use: Do not apply insights to real patients
- ๐ Research Only: Suitable for methodology development and education
- โ Reproducible: All code, data, and results version controlled