This project focuses on building machine learning models to classify individuals as diabetic or non-diabetic using clinical data. The goal is to develop accurate predictive models that can support early diabetes detection and improve healthcare decision-making.
Source: Diabetes Clinical Dataset from Kaggle
Size: 100,000 records with 17 features
year: Year of record entry (2015-2022)gender: Patient's gender (Male, Female, Other)age: Age in yearslocation: Geographical locationrace: One-hot encoded race categories (AfricanAmerican, Asian, Caucasian, Hispanic, Other)
hypertension: Presence of high blood pressure (1 = Yes, 0 = No)heart_disease: Presence of heart disease (1 = Yes, 0 = No)smoking_history: Smoking status (never, former, current, etc.)bmi: Body Mass IndexhbA1c_level: Hemoglobin A1C level (average blood sugar over 2-3 months)blood_glucose_level: Blood glucose concentrationclinical_notes: Additional clinical observations
diabetes: Diabetes status (1 = Diabetic, 0 = Non-diabetic)
diabetes-classification/
βββ data/
β βββ diabetes_dataset_with_notes.csv # Original dataset
β βββ feature_engineered_diabetes.csv # After feature engineering
β βββ preprocessed_diabetes.csv # After preprocessing
β βββ train_data.csv # Training set
β βββ test_data.csv # Test set
βββ notebooks/
β βββ EDA.ipynb # Exploratory Data Analysis
β βββ feature_engineering.ipynb # Feature engineering process
β βββ preprocess_data.ipynb # Data preprocessing
βββ models/
β βββ logistic_regression.ipynb # Logistic Regression model
β βββ svm.ipynb # Support Vector Machine
β βββ decision_tree.ipynb # Decision Tree
β βββ random_forest.ipynb # Random Forest
β βββ naive_bayes.ipynb # Naive Bayes
β βββ xg_boost.ipynb # XGBoost
β βββ light_gbm.ipynb # LightGBM
β βββ cat_boost.ipynb # CatBoost
βββ utils/
β βββ constants.py # Project constants
β βββ function.py # Utility functions
β βββ split_dataset.py # Data splitting utilities
βββ report.ipynb # Comprehensive project report
βββ hba1c.jpg # HbA1c reference chart
βββ README.md # This file
- Dataset Balance: The dataset is imbalanced with more non-diabetic cases
- Age Distribution: Well-balanced across age groups, concentrated in 40-60 years
- Gender Distribution: Female samples slightly higher than Male; "Other" category removed due to small representation
- BMI Categories:
- Overweight: 43,519 individuals
- Obese: 30,074 individuals
- Normal: 24,869 individuals
- Underweight: 1,531 individuals
- Clinical Indicators:
- HbA1c levels: 37,857 normal, 41,346 caution, 20,797 danger
- Blood glucose: 51,372 normal, 37,751 prediabetic, 10,877 diabetic
-
Data Cleaning:
- Removed 14 duplicate records
- No missing values found
-
Feature Engineering:
- Removed features:
year,smoking_history,clinical_notes - Removed "Other" gender category
- Combined low-frequency locations (Virgin Islands, Wisconsin, Wyoming) into "Others"
- Added BMI classification (Underweight, Normal, Overweight, Obese)
- Removed features:
-
Data Transformation:
- Applied StandardScaler to numerical features:
age,bmi,hbA1c_level,blood_glucose_level - Applied OneHotEncoder to categorical features:
gender,location,bmi_class - Used stratified splitting to maintain class balance in train/test sets
- Applied StandardScaler to numerical features:
This project implements 8 different machine learning algorithms to predict diabetes classification. Each model is carefully tuned and evaluated to find the optimal performance.
- Algorithm: Linear classification with regularization
- Configuration: L1 regularization for feature selection
- Strengths: Interpretable coefficients, fast training
- Use Case: Baseline model for comparison and feature importance analysis
- Algorithm: SVM with RBF (Radial Basis Function) kernel
- Configuration: Optimized C and gamma parameters
- Strengths: Effective in high-dimensional spaces, memory efficient
- Use Case: Complex decision boundaries with margin maximization
- Algorithm: CART (Classification and Regression Trees)
- Configuration: Optimized max_depth, min_samples_split, min_samples_leaf
- Strengths: Highly interpretable, handles non-linear relationships
- Use Case: Feature importance analysis and rule extraction
- Algorithm: Ensemble of decision trees with bagging
- Configuration: 100 estimators with optimized parameters
- Strengths: Reduces overfitting, handles missing values, robust
- Use Case: General-purpose ensemble with good performance
- Algorithm: Gaussian Naive Bayes classifier
- Configuration: Assumes feature independence
- Strengths: Fast training/prediction, works with small datasets
- Use Case: Probabilistic classification with assumption of feature independence
- Algorithm: Extreme Gradient Boosting
- Configuration: Optimized learning rate, max_depth, n_estimators
- Strengths: State-of-the-art performance, handles missing values
- Use Case: High-performance competitive machine learning
- Algorithm: Light Gradient Boosting Machine
- Configuration: Optimized num_leaves, learning_rate, feature_fraction
- Strengths: Fast training speed, low memory usage, high accuracy
- Use Case: Large datasets requiring efficient gradient boosting
- Algorithm: Categorical Boosting
- Configuration: Optimized iterations, depth, learning_rate
- Strengths: Handles categorical features automatically, robust to overfitting
- Use Case: Tabular data with mixed feature types
Each model undergoes extensive hyperparameter optimization:
- Method: GridSearchCV with 5-fold cross-validation
- Advanced Tuning: Optuna framework for complex models (CatBoost, XGBoost, LightGBM)
- Metrics: Optimized for accuracy while balancing precision and recall
- Validation: Stratified sampling to maintain class distribution
All trained models are saved in the checkpoints/ directory:
checkpoints/
βββ catboost.pkl # CatBoost model
βββ decision_tree.pkl # Decision Tree model
βββ lgbm_model.pkl # LightGBM model
βββ logistic_regression.pkl # Logistic Regression model
βββ random_forest.pkl # Random Forest model
βββ svm.pkl # Support Vector Machine model
βββ xgb_model.pkl # XGBoost model
| Model | Accuracy | Recall | Precision | F1 Score |
|---|---|---|---|---|
| Logistic Regression | 0.9609 | 0.6188 | 0.8870 | 0.7290 |
| SVM | 0.9690 | 0.6547 | 0.9712 | 0.7822 |
| Decision Tree | 0.9722 | 0.6735 | 1.0000 | 0.8049 |
| Random Forest | 0.9725 | 0.6829 | 0.9906 | 0.8085 |
| Naive Bayes | 0.9563 | 0.5794 | 0.8618 | 0.6929 |
| XGBoost | 0.9729 | 0.6888 | 0.9899 | 0.8124 |
| LightGBM | 0.9726 | 0.6877 | 0.9865 | 0.8104 |
| CatBoost | 0.9729 | 0.6947 | 0.9809 | 0.8134 |
-
Top Performing Models: CatBoost and XGBoost tie for highest accuracy (97.29%), with CatBoost achieving the best F1-score (0.8134)
-
Performance Tiers:
- Tier 1 (Elite): CatBoost (F1: 0.8134), XGBoost (F1: 0.8124) - State-of-the-art gradient boosting
- Tier 2 (Strong): LightGBM (F1: 0.8104), Random Forest (F1: 0.8085) - Excellent ensemble methods
- Tier 3 (Solid): Decision Tree (F1: 0.8049), SVM (F1: 0.7822) - Good specialized performance
- Tier 4 (Baseline): Logistic Regression (F1: 0.7290), Naive Bayes (F1: 0.6929) - Traditional methods
-
Precision vs Recall Trade-offs:
- Highest Precision: Decision Tree (100%) - Zero false positives but misses 32.65% of diabetic cases
- Best Recall: CatBoost (69.47%) - Catches most diabetic cases while maintaining high precision (98.09%)
- Balanced Performance: XGBoost and LightGBM show excellent precision-recall balance
-
Feature Importance: Top contributing features across models are:
- HbA1c level (primary diabetes indicator)
- Blood glucose level (critical biomarker)
- BMI (obesity correlation)
- Age (demographic risk factor)
- Heart disease (comorbidity)
- Hypertension (related condition)
-
Clinical Implications:
- For Screening: Decision Tree's perfect precision minimizes unnecessary worry from false positives
- For Diagnosis: CatBoost's balanced metrics make it ideal for comprehensive diabetes prediction
- For Research: Gradient boosting models (CatBoost, XGBoost, LightGBM) consistently outperform traditional ML methods
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
lightgbm
catboost
optuna
jupyter
joblib-
Clone the repository:
git clone https://github.com/fisherman611/diabetes-classification.git cd diabetes-classification -
Set up environment:
python -m venv env source env/bin/activate # On Windows: env\Scripts\activate pip install -r requirements.txt
-
Run the analysis:
- Start with
notebooks/EDA.ipynbfor data exploration - Follow with preprocessing and feature engineering notebooks
- Explore individual model notebooks in the
models/directory - Review the comprehensive analysis in
report.ipynb
- Start with
- Low Recall: Models miss approximately 30-40% of diabetic cases
- Class Imbalance: Dataset skewed toward non-diabetic cases
- Data Balancing: Implement SMOTE, oversampling, or class weighting techniques
- Advanced Models: β Implemented: XGBoost, LightGBM, CatBoost | Future: Neural networks, deep learning approaches
- Feature Selection: Remove low-importance features identified during analysis
- Ensemble Methods: Combine multiple models for improved performance (model stacking/voting)
- Cost-sensitive Learning: Adjust for the different costs of false positives vs false negatives
- Model Optimization: Further hyperparameter tuning with advanced techniques like Bayesian optimization
This project demonstrates how machine learning can support healthcare professionals in:
- Early Detection: Identifying at-risk individuals before symptoms appear
- Resource Allocation: Prioritizing patients for further testing
- Preventive Care: Enabling timely interventions to prevent complications
The high precision achieved by these models makes them suitable for screening applications where minimizing false positives is crucial.
Contributions are welcome! Please feel free to submit pull requests or open issues for suggestions and improvements.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset provided by Kaggle
- Healthcare domain expertise and clinical guidelines for diabetes diagnosis