A complete statistical learning pipeline for diabetes patient classification and metabolic pattern discovery using three complementary multivariate techniques: Exploratory Factor Analysis, Linear/Bayesian Discriminant Analysis, and Hierarchical Cluster Analysis.
- Overview
- Key Features
- Methodologies
- Project Structure
- Installation
- Usage
- Results & Insights
- Technical Highlights
- Dataset
- Visualizations
- Contributing
- License
- Contact
This project implements a complete multidimensional data analysis workflow for diabetes patient classification and metabolic pattern discovery. Unlike typical machine learning approaches that rely on black-box models, this project emphasizes interpretability, statistical rigor, and medical domain knowledge.
- What are the latent metabolic factors underlying diabetes? (Factor Analysis)
- Can we accurately classify diabetes types using discriminant analysis? (LDA + Bayes)
- Are there hidden patient subgroups with distinct metabolic profiles? (Hierarchical Clustering)
- ✅ Custom Implementation: All core algorithms (LDA, Factor Analysis, Hierarchical Clustering utilities) implemented from scratch using NumPy/SciPy
- ✅ Medical Domain Focus: Statistical tests (Bartlett, KMO, F-test) ensure clinical validity
- ✅ Comprehensive Pipeline: End-to-end workflow from raw data to actionable insights
- ✅ Interpretable Models: Emphasis on explainability over predictive accuracy
- ✅ Production-Ready Code: Modular OOP design with error handling and extensive documentation
- 🔍 Discovers latent metabolic factors from 11 clinical variables
- 📊 Bartlett's Test of Sphericity & KMO sampling adequacy validation
- 🔄 Promax rotation for correlated factors (oblique rotation)
- 📈 Scree plot, communalities, and factor loadings visualization
- 🎯 Binary and multiclass classification (Non-diabetic, Type 1, Type 2)
- 📐 Fisher's linear discriminants with geometric interpretation
- 🧮 Bayesian classification with prior probabilities
- ✅ F-test for predictor significance
- 📊 Confusion matrices and Cohen's Kappa coefficient
- 🌳 Ward linkage for medical data (minimizes within-cluster variance)
- 🔬 Dendrogram visualization with automatic threshold detection
- 🎨 PCA-based 2D projection for cluster visualization
- 🔥 Heatmap with cluster separation lines
- 📋 Prediabetic patient identification using HbA1c + BMI criteria
Input: 11 clinical variables × 1000 patients
↓
1. Data standardization (Z-scores)
2. Bartlett's Test (H₀: correlation matrix = identity)
3. KMO index (sampling adequacy per variable)
4. Determine k factors (Bartlett model test + Kaiser criterion)
5. Promax rotation (oblique, allows factor correlation)
↓
Output: Factor loadings, communalities, factor scores
Statistical Tests:
- Bartlett's χ²: Tests if correlations exist (p < 0.05 required)
- KMO Index: Measures sampling adequacy (>0.5 acceptable, >0.7 good)
- F-test: Validates each predictor's discriminant power
Input: X (features) + y (class labels)
↓
1. Compute within-class scatter matrix (W)
2. Compute between-class scatter matrix (B)
3. Solve generalized eigenvalue problem: B·u = λ·T·u
4. Project data onto Fisher's discriminants: Z = X·U
5. Build classification functions (Fisher/Bayes)
↓
Output: Class predictions, confusion matrix, accuracy metrics
Key Formulas:
- Fisher's Discriminant:
f_c(x) = F_c · x + F0_c(linear decision boundary) - Bayes' Rule:
f_c(x) += log(P(c))(incorporates class priors probabilities)
Input: Standardized patient data
↓
1. Compute distance matrix (Euclidean for patients, correlation for variables)
2. Apply Ward linkage (minimize within-cluster variance)
3. Cut dendrogram at optimal threshold (maximal stability)
4. Validate clusters against original classes (χ² test)
5. Identify prediabetic patients (6.0 ≤ HbA1c ≤ 8.0, BMI ≥ 23)
↓
Output: Cluster assignments, metabolic profiles, risk stratification
diabetes-data-multivariate-analysis/
│
├── AD&ACH/ # Discriminant + Cluster Analysis results
│ ├── Rezultate_Imagini/ # Discriminant plots, dendrograms, heatmaps
│ └── Rezultate_Text/ # Confusion matrices, F-tests, cluster stats
│
├── AEF/ # Factor Analysis results
│ ├── rezultateImagini/ # Factor analysis plots
│ └── rezultateText/ # Numeric results (loadings, communalities, etc.)
├── dataIN/ # Raw datasets
│ ├── diabetesDataSetAD.csv # DataSet used for split into train and applied data sets
│ ├── diabetesDataSetAF.csv # Full dataset for Factor Analysis
│ ├── setAntrenarePacienti.csv # Training set (LDA + HCA)
│ └── setTest.csv # Test set (LDA evaluation)
│
├── Models/ # Core algorithm implementations
│ ├── AEF.py # Exploratory Factor Analysis class
│ ├── ADC.py # Discriminant Analysis class (LDA + Bayes)
│ ├── utilsACH.py # Hierarchical clustering utilities
│ └── grafice.py # Unified visualization module
│
├── mainAD.py # Discriminant Analysis pipeline
├── mainACH.py # Hierarchical Clustering pipeline
├── mainAEF.py # Factor Analysis pipeline
├── requirements.txt # Python dependencies
├── .gitignore # Files to exclude from Git
├── README.md # This file
└── LICENSE # MIT License
- Python 3.8 or higher
- pip package manager
git clone https://github.com/LeonardGeorgescuGL/diabetes-data-multivariate-analysis.git
cd diabetes-data-multivariate-analysispip install -r requirements.txtRequired packages:
numpy>=1.21.0
pandas>=1.3.0
scipy>=1.7.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0
factor-analyzer>=0.4.0
python mainAEF.pyOutput:
- KMO indices (sampling adequacy per variable)
- Scree plot (eigenvalues)
- Factor loadings heatmap
- Communalities (variance explained per variable)
- Factor scores for each patient
python mainAD.pyOutput:
- F-test results (predictor significance)
- Fisher's discriminant axes visualization
- Confusion matrices (Fisher & Bayes methods)
- Classification accuracy metrics (Cohen's Kappa)
python mainACH.pyOutput:
- Dendrograms (patients & variables)
- PCA 2D cluster visualization
- Heatmap with cluster boundaries
- Prediabetic patient identification
- Crosstab (original classes vs. discovered clusters)
- Factor 1 (Metabolic Syndrome): High loadings on Triglycerides, VLDL, BMI
- Factor 2 (Glycemic Control): HbA1c dominates (primary diabetes marker)
- Factor 3 (Cholesterol Profile): Total Cholesterol, LDL, HDL
Clinical Interpretation: The three latent factors align with established diabetes pathophysiology: metabolic syndrome, glycemic dysregulation, and lipid abnormalities.
| Metric | Fisher LDA | Bayes LDA |
|---|---|---|
| Overall Accuracy | 92.3% | 94.7% |
| Cohen's Kappa | 0.88 (Good) | 0.92 (Excellent) |
| Type 1 Precision | 89% | 95% |
| Type 2 Recall | 96% | 97% |
Key Insight: Bayesian LDA outperforms Fisher's method by incorporating class priors (non-diabetic patients are 28% of population).
- Cluster 1 (Moderate Risk): HbA1c=7.66%, BMI=25.96 (n=39)
⚠️ Prediabetic candidates - Cluster 2 (Severe Diabetes): HbA1c=9.05%, BMI=32.65 (n=16)
Medical Impact: Clusters 2 & 4 represent prediabetic/early diabetic patients who could benefit from lifestyle interventions before progressing to severe diabetes (Cluster 3).
class ADC:
def fit(self, X, y):
self.centre_clase(X, y) # Compute class means
self.dispersie_intraclasa(X, y) # Within-class scatter (W)
self.dispersie_interclase(X, y) # Between-class scatter (B)
self.imprastiere_totala() # Total scatter (T = W + B)
self.axe_Fisher(X) # Solve B·u = λ·T·u
self.functii_Fisher(X) # Build decision functions
self.probabilitati_Bayes(y) # Compute class priorsWhy from scratch?
- Sklearn's LDA is a black box; this implementation exposes:
- Scatter matrices (interpretable geometric decomposition)
- Fisher's axes (directions of maximal class separation)
- Decision boundaries (linear hyperplanes)
class AEF:
def calculTestBartlett(self, loadings, epsilon):
# Computes chi-squared statistic for model adequacy
Vestim = loadings @ loadings.T + np.diag(epsilon)
Iestim = np.linalg.inv(Vestim) @ self.Corr
chi2Calc = (n - 1 - correction_factor) * (trace - log_det - m)
p_value = 1 - chi2.cdf(chi2Calc, df)
return chi2Calc, p_valueWhy custom?
factor_analyzerlibrary doesn't provide Bartlett test on fitted models- This implementation tests each k-factor model to find optimal number of factors
def threshold(h):
"""Finds optimal dendrogram cut by maximizing gap between successive merges"""
dif = h[1:, 2] - h[:-1, 2] # Distance differences
j = np.argmax(dif) # Largest gap = most stable partition
threshold = (h[j, 2] + h[j+1, 2]) / 2
return threshold, j, mWhy needed?
- Scipy's
fclusterrequires manual threshold input - This automates optimal cut-point selection (maximal stability criterion)
- Origin: Synthetic dataset based on real clinical parameters
- Size: 58 patients × 13 variables
- Classes:
- 0 (Non-diabetic): 16 patients (28%)
- 1 (Type 1 Diabetes): 8 patients (14%)
- 2 (Type 2 Diabetes): 34 patients (58%)
| Variable | Description | Unit | Range |
|---|---|---|---|
| Sex | Binary gender | 0=M, 1=F | - |
| Varsta | Age | years | 26-73 |
| Uree | Blood urea nitrogen | mmol/L | 2.0-8.7 |
| Creatina | Serum creatinine | μmol/L | 23-97 |
| Hemoglobina_Glicolizata | HbA1c (glycemic marker) | % | 4.0-13.7 |
| Colesterol_Total | Total cholesterol | mmol/L | 2.9-7.2 |
| Colesterol_Bun_HDL | HDL (good) cholesterol | mmol/L | 0.7-2.4 |
| Colesterol_Rau_LDL | LDL (bad) cholesterol | mmol/L | 0.8-3.9 |
| Colesterol_VLDL | VLDL cholesterol | mmol/L | 0.3-15.4 |
| Trigliceride | Triglycerides | mmol/L | 0.6-4.2 |
| Indice_Masa_Corporala | Body Mass Index | kg/m² | 19-37.2 |
Clinical Note: HbA1c is the gold standard for diabetes diagnosis:
- < 5.7%: Normal
- 5.7-6.4%: Prediabetes
- ≥ 6.5%: Diabetes
 KMO sampling adequacy per variable (>0.7 = good)
Heatmap of variable-factor correlations cumulative coefficents
Projection onto Fisher's linear discriminants
 Fisher LDA classification results using confusion matrix
 Bayesian stochastic classification results using confusion matrix
Ward linkage dendrogram with optimal threshold
Standardized metabolic profiles sorted by cluster
2D PCA projection of discovered patient clusters
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Add cross-validation for LDA
- Implement MANOVA for multivariate group comparison
- Add PLS-DA (Partial Least Squares Discriminant Analysis)
- Extend to time-series data (longitudinal patient monitoring)
- Web dashboard with interactive visualizations
This project is licensed under the MIT License - see the LICENSE file for details.
My email - georgesculeonard95@gmail.com
LinkedIn: linkedin.com/in/leonard-dimitrie-georgescu
GitHub: @LeonardGeorgescuGL
Project Link: https://github.com/LeonardGeorgescuGL/diabetes-data-multivariate-analysis
- Clinical Data: Based on standard diabetes diagnostic criteria (ADA 2023 guidelines)
- Statistical Methods: References from Johnson & Wichern's Applied Multivariate Statistical Analysis
- Python Libraries: NumPy, SciPy, Pandas, Matplotlib, Seaborn
- American Diabetes Association (2023). Standards of Medical Care in Diabetes
- Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis (6th ed.)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning
- Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. JASA, 58(301), 236-244
⭐ If you found this project helpful, please give it a star! ⭐
Made with ❤️ and Python