Skip to content

LeonardGeorgescuGL/diabetes-data-multivariate-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 Multidimensional Analysis of Diabetes Data: A Comprehensive Machine Learning Approach

Python NumPy Pandas SciPy License

A complete statistical learning pipeline for diabetes patient classification and metabolic pattern discovery using three complementary multivariate techniques: Exploratory Factor Analysis, Linear/Bayesian Discriminant Analysis, and Hierarchical Cluster Analysis.


📋 Table of Contents


🎯 Overview

This project implements a complete multidimensional data analysis workflow for diabetes patient classification and metabolic pattern discovery. Unlike typical machine learning approaches that rely on black-box models, this project emphasizes interpretability, statistical rigor, and medical domain knowledge.

🔬 Research Questions Addressed

  1. What are the latent metabolic factors underlying diabetes? (Factor Analysis)
  2. Can we accurately classify diabetes types using discriminant analysis? (LDA + Bayes)
  3. Are there hidden patient subgroups with distinct metabolic profiles? (Hierarchical Clustering)

🏆 What Makes This Project Unique

  • Custom Implementation: All core algorithms (LDA, Factor Analysis, Hierarchical Clustering utilities) implemented from scratch using NumPy/SciPy
  • Medical Domain Focus: Statistical tests (Bartlett, KMO, F-test) ensure clinical validity
  • Comprehensive Pipeline: End-to-end workflow from raw data to actionable insights
  • Interpretable Models: Emphasis on explainability over predictive accuracy
  • Production-Ready Code: Modular OOP design with error handling and extensive documentation

🌟 Key Features

1. Exploratory Factor Analysis (EFA)

  • 🔍 Discovers latent metabolic factors from 11 clinical variables
  • 📊 Bartlett's Test of Sphericity & KMO sampling adequacy validation
  • 🔄 Promax rotation for correlated factors (oblique rotation)
  • 📈 Scree plot, communalities, and factor loadings visualization

2. Linear & Bayesian Discriminant Analysis (LDA/BDA)

  • 🎯 Binary and multiclass classification (Non-diabetic, Type 1, Type 2)
  • 📐 Fisher's linear discriminants with geometric interpretation
  • 🧮 Bayesian classification with prior probabilities
  • ✅ F-test for predictor significance
  • 📊 Confusion matrices and Cohen's Kappa coefficient

3. Hierarchical Cluster Analysis (HCA)

  • 🌳 Ward linkage for medical data (minimizes within-cluster variance)
  • 🔬 Dendrogram visualization with automatic threshold detection
  • 🎨 PCA-based 2D projection for cluster visualization
  • 🔥 Heatmap with cluster separation lines
  • 📋 Prediabetic patient identification using HbA1c + BMI criteria

📊 Methodologies

Exploratory Factor Analysis (EFA)

Input: 11 clinical variables × 1000 patients
↓
1. Data standardization (Z-scores)
2. Bartlett's Test (H₀: correlation matrix = identity)
3. KMO index (sampling adequacy per variable)
4. Determine k factors (Bartlett model test + Kaiser criterion)
5. Promax rotation (oblique, allows factor correlation)
↓
Output: Factor loadings, communalities, factor scores

Statistical Tests:

  • Bartlett's χ²: Tests if correlations exist (p < 0.05 required)
  • KMO Index: Measures sampling adequacy (>0.5 acceptable, >0.7 good)
  • F-test: Validates each predictor's discriminant power

Linear Discriminant Analysis (LDA)

Input: X (features) + y (class labels)
↓
1. Compute within-class scatter matrix (W)
2. Compute between-class scatter matrix (B)
3. Solve generalized eigenvalue problem: B·u = λ·T·u
4. Project data onto Fisher's discriminants: Z = X·U
5. Build classification functions (Fisher/Bayes)
↓
Output: Class predictions, confusion matrix, accuracy metrics

Key Formulas:

  • Fisher's Discriminant: f_c(x) = F_c · x + F0_c (linear decision boundary)
  • Bayes' Rule: f_c(x) += log(P(c)) (incorporates class priors probabilities)

Hierarchical Cluster Analysis (HCA)

Input: Standardized patient data
↓
1. Compute distance matrix (Euclidean for patients, correlation for variables)
2. Apply Ward linkage (minimize within-cluster variance)
3. Cut dendrogram at optimal threshold (maximal stability)
4. Validate clusters against original classes (χ² test)
5. Identify prediabetic patients (6.0 ≤ HbA1c ≤ 8.0, BMI ≥ 23)
↓
Output: Cluster assignments, metabolic profiles, risk stratification

📁 Project Structure

diabetes-data-multivariate-analysis/
│
├── AD&ACH/                          # Discriminant + Cluster Analysis results
│   ├── Rezultate_Imagini/           # Discriminant plots, dendrograms, heatmaps
│   └── Rezultate_Text/              # Confusion matrices, F-tests, cluster stats
│   
├── AEF/                             # Factor Analysis results
│   ├── rezultateImagini/            # Factor analysis plots
│   └── rezultateText/               # Numeric results (loadings, communalities, etc.)
├── dataIN/                          # Raw datasets
│   ├── diabetesDataSetAD.csv       # DataSet used for split into train and applied data sets
│   ├── diabetesDataSetAF.csv       # Full dataset for Factor Analysis
│   ├── setAntrenarePacienti.csv    # Training set (LDA + HCA)
│   └── setTest.csv                 # Test set (LDA evaluation)
│
├── Models/                          # Core algorithm implementations
│   ├── AEF.py                       # Exploratory Factor Analysis class
│   ├── ADC.py                       # Discriminant Analysis class (LDA + Bayes)
│   ├── utilsACH.py                  # Hierarchical clustering utilities
│   └── grafice.py                   # Unified visualization module
│  
├── mainAD.py                    # Discriminant Analysis pipeline
├── mainACH.py                   # Hierarchical Clustering pipeline
├── mainAEF.py                   # Factor Analysis pipeline
├── requirements.txt                 # Python dependencies
├── .gitignore                       # Files to exclude from Git
├── README.md                        # This file
└── LICENSE                          # MIT License


🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Clone the Repository

git clone https://github.com/LeonardGeorgescuGL/diabetes-data-multivariate-analysis.git
cd diabetes-data-multivariate-analysis

Install Dependencies

pip install -r requirements.txt

Required packages:

numpy>=1.21.0
pandas>=1.3.0
scipy>=1.7.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0
factor-analyzer>=0.4.0

🚀 Usage

1. Exploratory Factor Analysis

python mainAEF.py

Output:

  • KMO indices (sampling adequacy per variable)
  • Scree plot (eigenvalues)
  • Factor loadings heatmap
  • Communalities (variance explained per variable)
  • Factor scores for each patient

2. Discriminant Analysis

python mainAD.py

Output:

  • F-test results (predictor significance)
  • Fisher's discriminant axes visualization
  • Confusion matrices (Fisher & Bayes methods)
  • Classification accuracy metrics (Cohen's Kappa)

3. Hierarchical Cluster Analysis

python mainACH.py

Output:

  • Dendrograms (patients & variables)
  • PCA 2D cluster visualization
  • Heatmap with cluster boundaries
  • Prediabetic patient identification
  • Crosstab (original classes vs. discovered clusters)

📈 Results & Insights

Factor Analysis Findings

  • Factor 1 (Metabolic Syndrome): High loadings on Triglycerides, VLDL, BMI
  • Factor 2 (Glycemic Control): HbA1c dominates (primary diabetes marker)
  • Factor 3 (Cholesterol Profile): Total Cholesterol, LDL, HDL

Clinical Interpretation: The three latent factors align with established diabetes pathophysiology: metabolic syndrome, glycemic dysregulation, and lipid abnormalities.

Discriminant Analysis Performance

Metric Fisher LDA Bayes LDA
Overall Accuracy 92.3% 94.7%
Cohen's Kappa 0.88 (Good) 0.92 (Excellent)
Type 1 Precision 89% 95%
Type 2 Recall 96% 97%

Key Insight: Bayesian LDA outperforms Fisher's method by incorporating class priors (non-diabetic patients are 28% of population).

Cluster Analysis Discovery

  • Cluster 1 (Moderate Risk): HbA1c=7.66%, BMI=25.96 (n=39) ⚠️ Prediabetic candidates
  • Cluster 2 (Severe Diabetes): HbA1c=9.05%, BMI=32.65 (n=16)

Medical Impact: Clusters 2 & 4 represent prediabetic/early diabetic patients who could benefit from lifestyle interventions before progressing to severe diabetes (Cluster 3).


💡 Technical Highlights

Custom Implementations

1. Linear Discriminant Analysis (ADC.py)

class ADC:
    def fit(self, X, y):
        self.centre_clase(X, y)          # Compute class means
        self.dispersie_intraclasa(X, y)  # Within-class scatter (W)
        self.dispersie_interclase(X, y)  # Between-class scatter (B)
        self.imprastiere_totala()         # Total scatter (T = W + B)
        self.axe_Fisher(X)                # Solve B·u = λ·T·u
        self.functii_Fisher(X)            # Build decision functions
        self.probabilitati_Bayes(y)       # Compute class priors

Why from scratch?

  • Sklearn's LDA is a black box; this implementation exposes:
    • Scatter matrices (interpretable geometric decomposition)
    • Fisher's axes (directions of maximal class separation)
    • Decision boundaries (linear hyperplanes)

2. Exploratory Factor Analysis (AEF.py)

class AEF:
    def calculTestBartlett(self, loadings, epsilon):
        # Computes chi-squared statistic for model adequacy
        Vestim = loadings @ loadings.T + np.diag(epsilon)
        Iestim = np.linalg.inv(Vestim) @ self.Corr
        chi2Calc = (n - 1 - correction_factor) * (trace - log_det - m)
        p_value = 1 - chi2.cdf(chi2Calc, df)
        return chi2Calc, p_value

Why custom?

  • factor_analyzer library doesn't provide Bartlett test on fitted models
  • This implementation tests each k-factor model to find optimal number of factors

3. Hierarchical Clustering Utilities (utilsACH.py)

def threshold(h):
    """Finds optimal dendrogram cut by maximizing gap between successive merges"""
    dif = h[1:, 2] - h[:-1, 2]  # Distance differences
    j = np.argmax(dif)           # Largest gap = most stable partition
    threshold = (h[j, 2] + h[j+1, 2]) / 2
    return threshold, j, m

Why needed?

  • Scipy's fcluster requires manual threshold input
  • This automates optimal cut-point selection (maximal stability criterion)

📊 Dataset

Source

  • Origin: Synthetic dataset based on real clinical parameters
  • Size: 58 patients × 13 variables
  • Classes:
    • 0 (Non-diabetic): 16 patients (28%)
    • 1 (Type 1 Diabetes): 8 patients (14%)
    • 2 (Type 2 Diabetes): 34 patients (58%)

Variables

Variable Description Unit Range
Sex Binary gender 0=M, 1=F -
Varsta Age years 26-73
Uree Blood urea nitrogen mmol/L 2.0-8.7
Creatina Serum creatinine μmol/L 23-97
Hemoglobina_Glicolizata HbA1c (glycemic marker) % 4.0-13.7
Colesterol_Total Total cholesterol mmol/L 2.9-7.2
Colesterol_Bun_HDL HDL (good) cholesterol mmol/L 0.7-2.4
Colesterol_Rau_LDL LDL (bad) cholesterol mmol/L 0.8-3.9
Colesterol_VLDL VLDL cholesterol mmol/L 0.3-15.4
Trigliceride Triglycerides mmol/L 0.6-4.2
Indice_Masa_Corporala Body Mass Index kg/m² 19-37.2

Clinical Note: HbA1c is the gold standard for diabetes diagnosis:

  • < 5.7%: Normal
  • 5.7-6.4%: Prediabetes
  • ≥ 6.5%: Diabetes

🎨 Visualizations

Factor Analysis

![KMO Indices](AEF/rezultateImagini/Indici Kaiser-Meyer-Olkin.png) KMO sampling adequacy per variable (>0.7 = good)

Factor Loadings Heatmap of variable-factor correlations cumulative coefficents

Discriminant Analysis

Fisher Axes Projection onto Fisher's linear discriminants

![Confusion Matrix](AD&ACH/Rezultate_Imagini/MatriceConfuzie fisher.png) Fisher LDA classification results using confusion matrix

![Confusion Matrix](AD&ACH/Rezultate_Imagini/MatriceConfuzie bayes.png) Bayesian stochastic classification results using confusion matrix

Hierarchical Clustering

Dendrogram Ward linkage dendrogram with optimal threshold

Heatmap Standardized metabolic profiles sorted by cluster

PCA Clusters 2D PCA projection of discovered patient clusters


🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Areas for Improvement

  • Add cross-validation for LDA
  • Implement MANOVA for multivariate group comparison
  • Add PLS-DA (Partial Least Squares Discriminant Analysis)
  • Extend to time-series data (longitudinal patient monitoring)
  • Web dashboard with interactive visualizations

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


📧 Contact

My email - georgesculeonard95@gmail.com

LinkedIn: linkedin.com/in/leonard-dimitrie-georgescu

GitHub: @LeonardGeorgescuGL

Project Link: https://github.com/LeonardGeorgescuGL/diabetes-data-multivariate-analysis


🙏 Acknowledgments

  • Clinical Data: Based on standard diabetes diagnostic criteria (ADA 2023 guidelines)
  • Statistical Methods: References from Johnson & Wichern's Applied Multivariate Statistical Analysis
  • Python Libraries: NumPy, SciPy, Pandas, Matplotlib, Seaborn

📚 References

  1. American Diabetes Association (2023). Standards of Medical Care in Diabetes
  2. Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis (6th ed.)
  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning
  4. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. JASA, 58(301), 236-244

⭐ If you found this project helpful, please give it a star! ⭐

Made with ❤️ and Python

Releases

No releases published

Packages

 
 
 

Contributors

Languages