Skip to content

PFans-201/ML_cardiovasc_disease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฅ Cardiovascular Disease Analysis: Probabilistic Modeling & Decision Making

Advanced Machine Learning Project: Systematic exploration of probabilistic methods for medical data analysis - from mixture models to reinforcement learning

Python License Data

๐ŸŽฏ Project Overview

This project systematically explores probabilistic modeling techniques applied to cardiovascular disease data, progressing from basic clustering to advanced sequential decision making. Each milestone builds upon previous work, culminating in a reinforcement learning system for treatment optimization.

Core Learning Path

  • ๐Ÿ“Š M0: Exploratory Data Analysis & Statistical Foundations
  • ๐ŸŽฏ M1: Gaussian Mixture Models for Patient Clustering
  • ๐Ÿ•ธ๏ธ M2-M4: Bayesian Networks (Design โ†’ Inference โ†’ Learning)
  • โฑ๏ธ M5: Hidden Markov Models for Disease Progression
  • ๐Ÿค– M6: Reinforcement Learning for Treatment Policies

Key Features

  • ๐Ÿ“š Systematic Methodology: Six milestones covering core probabilistic ML techniques
  • ๐Ÿ”ฌ Hands-on Implementation: Build models from scratch using pgmpy, pygraphiz, hmmlearn, custom RL
  • ๏ฟฝ Real-world Complexity: Handle missing data, discretization, temporal dependencies
  • ๐ŸŽ“ Educational Focus: Deep understanding over black-box solutions

๐Ÿ—๏ธ Project Structure

โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                    # ๐Ÿ“ Original datasets (INCLUDED in repo)
โ”‚   โ”‚   โ”œโ”€โ”€ patients.csv        # Static patient demographics & risk factors (3,000 patients)
โ”‚   โ”‚   โ”œโ”€โ”€ encounters.csv      # Longitudinal clinical encounters (24,000 visits)
โ”‚   โ”‚   โ””โ”€โ”€ README.md           # Comprehensive data documentation
โ”‚   โ”œโ”€โ”€ interim/                # Processed data (imputed, discretized)
โ”‚   โ””โ”€โ”€ processed/              # Final model-ready datasets
โ”œโ”€โ”€ milestones/                 # ๐Ÿ“‹ Six milestone deliverables
โ”‚   โ”œโ”€โ”€ M0/                     # Exploratory Data Analysis
โ”‚   โ”œโ”€โ”€ M1/                     # Gaussian Mixture Models
โ”‚   โ”œโ”€โ”€ M2/                     # Bayesian Network Design
โ”‚   โ”œโ”€โ”€ M3/                     # BN Inference (Exact & Approximate)
โ”‚   โ”œโ”€โ”€ M4/                     # BN Learning from Data
โ”‚   โ”œโ”€โ”€ M5/                     # Hidden Markov Models
โ”‚   โ””โ”€โ”€ M6/                     # Reinforcement Learning
โ””โ”€โ”€ docs/                       # ๐Ÿ“š Documentation & project guidelines

๐Ÿš€ Quick Start

1. Environment Setup

# Clone repository
git clone <repository-url>
cd Project

# Create virtual environment
python -m venv adv_ml_venv
source adv_ml_venv/bin/activate  # Linux/Mac
# adv_ml_venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

2. Data Overview

import pandas as pd

# Load datasets (included in repo!)
patients = pd.read_csv('data/raw/patients.csv', sep=';')
encounters = pd.read_csv('data/raw/encounters.csv', sep=';')

print(f"๐Ÿ“Š {len(patients):,} patients, {len(encounters):,} encounters")
print(f"โฑ๏ธ  {encounters.groupby('patient_id').size().iloc[0]} time points per patient")

3. Start Exploring

# Run milestone notebooks
jupyter lab milestones/M0/M0.ipynb     # Data exploration
jupyter lab milestones/M1/M1_G14.ipynb # Feature engineering & modeling

๐ŸŽ“ Learning Objectives

This project systematically builds expertise in probabilistic machine learning:

Milestone Core Technique Learning Focus Deliverable
M0 Exploratory Data Analysis Data understanding, visualization Statistical summaries & insights
M1 Gaussian Mixture Models Unsupervised clustering, EM algorithm Patient phenotype discovery
M2 Bayesian Network Design Graphical models, conditional independence Hand-designed probabilistic model
M3 BN Inference Variable elimination, belief propagation Clinical query answering
M4 BN Parameter/Structure Learning Maximum likelihood, score-based search Data-driven model comparison
M5 Hidden Markov Models Temporal modeling, Viterbi decoding Disease progression analysis
M6 Reinforcement Learning Q-learning, policy optimization Treatment recommendation system

Key Skills Developed

  • ๐Ÿงฎ Probabilistic Reasoning: Understanding uncertainty and conditional dependencies
  • ๐Ÿ“Š Graphical Models: Designing and interpreting Bayesian networks
  • โฑ๏ธ Temporal Modeling: Capturing disease progression with HMMs
  • ๐ŸŽฏ Decision Making: Optimizing treatment policies with reinforcement learning
  • ๐Ÿ”ฌ Model Evaluation: Comparing hand-designed vs. learned models

๐Ÿ“ˆ Milestones & Progress

๐Ÿ“Š M0: Exploratory Data Analysis

  • Dataset overview and baseline characteristics
  • Disease state distributions and patient trajectories
  • Treatment patterns and outcome analysis
  • Missing data assessment and handling strategies

๐ŸŽฏ M1: Gaussian Mixture Models

  • Feature selection and preprocessing for clustering
  • GMM fitting with optimal cluster selection (AIC/BIC)
  • Cluster characterization and clinical interpretation
  • Comparison with true disease states

๏ฟฝ๏ธ M2: Bayesian Network Design

  • Structure design with clinical justification
  • Variable discretization for categorical BNs
  • CPT estimation from encounter data
  • Conditional independence analysis

๐Ÿง  M3: Bayesian Network Inference

  • Exact inference (Variable Elimination, Belief Propagation)
  • Approximate inference with sampling methods
  • Clinical query design and interpretation

๐Ÿ“š M4: Learning Bayesian Networks

  • Parameter learning with train/test patient splits
  • Structure learning with score-based search
  • Model comparison: hand-designed vs. learned
  • Expert knowledge vs. data-driven trade-offs

โฑ๏ธ M5: Hidden Markov Models

  • Temporal modeling with troponin + symptom features
  • Baum-Welch parameter learning
  • Viterbi decoding and state sequence analysis
  • Comparison with true disease progression

๐Ÿค– M6: Reinforcement Learning

  • MDP environment setup with state/action/reward
  • Tabular Q-learning for treatment policies
  • Policy evaluation vs. random/heuristic baselines
  • Clinical interpretation of learned strategies

๐Ÿ’ก Key Innovations

  1. Multi-task Learning: Simultaneous disease prediction + treatment optimization
  2. Temporal Modeling: Account for disease state transitions over time
  3. Missing Data Realism: Handle realistic clinical data gaps
  4. Utility-based Evaluation: Beyond accuracy - optimize patient outcomes
  5. Causal Treatment Effects: Estimate counterfactual treatment scenarios

๐Ÿ”ฌ Data Highlights

  • Domain: Synthetic cardiovascular clinical study
  • Design: Longitudinal (8 time points over ~2 years)
  • Patients: 3,000 diverse synthetic patients
  • Variables: Demographics, risk factors, symptoms, labs, treatments, outcomes
  • Missing Data: Realistic patterns (10-20% symptoms, 5-15% labs)
  • Ground Truth: Disease states provided for educational purposes

๐Ÿ“– Full data documentation: data/raw/README.md

๐Ÿค Contributing

See CONTRIBUTING.md for development workflow, branching strategy, and code standards.

โš ๏ธ Important Notes

  • ๐ŸŽ“ Educational Data: Synthetic dataset for learning purposes only
  • โŒ Not for Clinical Use: Do not apply insights to real patients
  • ๐Ÿ“š Research Only: Suitable for methodology development and education
  • โœ… Reproducible: All code, data, and results version controlled

About

Advanced Machine Learning for Cardiovascular Disease Prediction & Treatment Optimization

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors