Skip to content

AryannAgrawall/Diabetes-Risk-Prediction-Platform

Repository files navigation

Diabetes Risk Prediction Platform

This project started as a college assignment and I further developed it into a small, end-to-end ML product that trains multiple classifiers, serves them behind a Flask API, and exposes a lightweight browser UI. The goal is to predict a patient's diabetes risk/type from routinely collected clinical signals.

Problem Statement

Given features such as age, glucose, insulin, and body mass index, predict whether an individual is non-diabetic, type-1-like, or type-2-like. The system must support experimentation with different model families, provide reproducible training, and expose a low-latency inference API that can be consumed by a web client.

Dataset

  • Source: Pima Indians Diabetes dataset (Kaggle/UCI) or a similarly structured CSV placed at data/diabetes.csv. When absent, the pipeline synthesizes a dataset to keep the system runnable.
  • Size: ~768 rows in the canonical dataset; synthetic generator defaults to 800 rows for parity.
  • Features: Age (years), Glucose (mg/dL), Insulin (µU/mL), BMI (kg/m²); target column type encoded as {0: non-diabetic, 1: type1-like, 2: type2-like}.

Models

Model Rationale
Gaussian Naive Bayes Fast baseline, probabilistic outputs, good for imbalanced small datasets.
MLPClassifier (sklearn) Modern non-linear baseline with automatic differentiation and regularization.
Custom two-layer MLP Educational implementation that exposes weight serialization and manual training loops.

Training & Evaluation

  1. training/train.py is the CLI entry point that orchestrates data loading, preprocessing (standardization), model training, hold-out validation, and artifact persistence (models/*.pkl).
  2. training/evaluate_and_report.py now loads trained artifacts, runs a hold-out evaluation, saves confusion matrices + ROC curves under reports/, and prints summary metrics. (Legacy REPORT.md generation can be re-enabled if needed.)
  3. Metrics tracked: accuracy, precision, recall, F1 (micro/macro). Confusion matrices and SHAP-style interpretability are planned (see limitations).
  4. Every CLI training run logs hyperparameters, metrics, and serialized artifacts to a local MLflow store (mlruns/) using the experiment name diabetes-risk-local. Launch the dashboard with mlflow ui --backend-store-uri mlruns (tracking URI file:///.../mlruns) to inspect experiments without reading the code.

Current Results (hold-out, CLI + evaluate_and_report.py)

Model Accuracy Precision Recall F1
GaussianNB 0.7013 0.6928 0.7013 0.6951
MLP (sklearn) 0.7208 0.7140 0.7208 0.7158
MLP (custom) 0.7403 0.7349 0.7403 0.7364

Metrics synced with REPORT.md generated on 2025-12-23 via python -m training.evaluate_and_report.

Each run also drops confusion matrices and ROC curves to reports/<model>_{confusion_matrix,roc_curve}.png, so reviewers can inspect the visuals without re-training (regenerate anytime with python -m training.evaluate_and_report).

System Architecture

  1. Training pipeline: Generates artifacts (*.pkl, scaler) inside models/.
  2. Model registry: Flask loads the serialized estimators and scaler on startup.
  3. REST API: /predict accepts JSON payloads, validates inputs, applies the scaler, and returns the predicted risk/type.
  4. Frontend UI: Static HTML/JS client (frontend/) hits the Flask API to let users compare model outputs interactively.
data → training scripts → model artifacts → Flask API → browser UI

Quickstart

pip install -r requirements.txt

# Train and persist models (also logs to mlruns/)
python -m training.train --config configs/base.yaml

# Generate evaluation plots (writes PNGs to reports/)
python -m training.evaluate_and_report

# Optional: inspect MLflow dashboard locally
mlflow ui --backend-store-uri mlruns

# Run the Flask API (serves frontend as static files)
python backend/app.py

# Visit the UI
open frontend/index.html  # or navigate to http://127.0.0.1:5000

The frontend JavaScript calls the API via same-origin relative paths (/predict), so it works unchanged whether served locally, inside Docker, or behind a reverse proxy.

Containerized Run

The curated .dockerignore keeps bytecode, MLflow runs, pickled artifacts, and virtualenvs out of the build context, so Docker layers stay lean.

# Build the production image locally
docker build -t diabetes-risk-prod .

# Run it with Gunicorn listening on $PORT (defaults to 5000)
docker run --rm -p 5000:5000 --env PORT=5000 diabetes-risk-prod

# Or rely on the provided compose file for repeatable dev/prod parity
docker compose up --build

The compose stack exposes the API on http://127.0.0.1:5000 and serves the static frontend via the same container. Override PORT or FLASK_DEBUG in docker-compose.yml or with --env flags if needed.

Testing

Pytest covers the input validation helpers (tests/test_input_validation.py) and the Flask /predict endpoint (tests/test_api.py).

python -m pytest

Developer Tooling

  • Install formatter/linter/test extras with pip install -r requirements-dev.txt (includes pytest, black, isort, flake8, and pre-commit).
  • Enable Git hooks by running pre-commit install once; enforce them manually anytime with pre-commit run --all-files.
  • Black + isort keep the code style consistent, while flake8 prevents lint regressions before CI even runs.

Limitations & Next Steps

  • Needs formal data validation (Great Expectations) and stronger provenance tracking when switching between real vs synthetic data.
  • No centralized experiment tracking backend beyond the local MLflow file store; promotion-ready tracking (e.g., hosted MLflow/W&B) is still future work.
  • Interpretability (SHAP, feature importances) and monitoring hooks are not yet implemented.

Planned improvements include structured logging, automated tests for the Flask app, and a richer React/Vite frontend with model comparison charts.

About

End-to-end machine learning system for diabetes risk classification with multiple models, REST API, frontend UI, testing, and Dockerized deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors