Diabetes Risk Prediction Platform

This project started as a college assignment and I further developed it into a small, end-to-end ML product that trains multiple classifiers, serves them behind a Flask API, and exposes a lightweight browser UI. The goal is to predict a patient's diabetes risk/type from routinely collected clinical signals.

Problem Statement

Given features such as age, glucose, insulin, and body mass index, predict whether an individual is non-diabetic, type-1-like, or type-2-like. The system must support experimentation with different model families, provide reproducible training, and expose a low-latency inference API that can be consumed by a web client.

Dataset

Source: Pima Indians Diabetes dataset (Kaggle/UCI) or a similarly structured CSV placed at data/diabetes.csv. When absent, the pipeline synthesizes a dataset to keep the system runnable.
Size: ~768 rows in the canonical dataset; synthetic generator defaults to 800 rows for parity.
Features: Age (years), Glucose (mg/dL), Insulin (µU/mL), BMI (kg/m²); target column type encoded as {0: non-diabetic, 1: type1-like, 2: type2-like}.

Models

Model	Rationale
Gaussian Naive Bayes	Fast baseline, probabilistic outputs, good for imbalanced small datasets.
MLPClassifier (sklearn)	Modern non-linear baseline with automatic differentiation and regularization.
Custom two-layer MLP	Educational implementation that exposes weight serialization and manual training loops.

Training & Evaluation

training/train.py is the CLI entry point that orchestrates data loading, preprocessing (standardization), model training, hold-out validation, and artifact persistence (models/*.pkl).
training/evaluate_and_report.py now loads trained artifacts, runs a hold-out evaluation, saves confusion matrices + ROC curves under reports/, and prints summary metrics. (Legacy REPORT.md generation can be re-enabled if needed.)
Metrics tracked: accuracy, precision, recall, F1 (micro/macro). Confusion matrices and SHAP-style interpretability are planned (see limitations).
Every CLI training run logs hyperparameters, metrics, and serialized artifacts to a local MLflow store (mlruns/) using the experiment name diabetes-risk-local. Launch the dashboard with mlflow ui --backend-store-uri mlruns (tracking URI file:///.../mlruns) to inspect experiments without reading the code.

Current Results (hold-out, CLI + evaluate_and_report.py)

Model	Accuracy	Precision	Recall	F1
GaussianNB	0.7013	0.6928	0.7013	0.6951
MLP (sklearn)	0.7208	0.7140	0.7208	0.7158
MLP (custom)	0.7403	0.7349	0.7403	0.7364

Metrics synced with REPORT.md generated on 2025-12-23 via python -m training.evaluate_and_report.

Each run also drops confusion matrices and ROC curves to reports/<model>_{confusion_matrix,roc_curve}.png, so reviewers can inspect the visuals without re-training (regenerate anytime with python -m training.evaluate_and_report).

System Architecture

Training pipeline: Generates artifacts (*.pkl, scaler) inside models/.
Model registry: Flask loads the serialized estimators and scaler on startup.
REST API: /predict accepts JSON payloads, validates inputs, applies the scaler, and returns the predicted risk/type.
Frontend UI: Static HTML/JS client (frontend/) hits the Flask API to let users compare model outputs interactively.

data → training scripts → model artifacts → Flask API → browser UI

Quickstart

pip install -r requirements.txt

# Train and persist models (also logs to mlruns/)
python -m training.train --config configs/base.yaml

# Generate evaluation plots (writes PNGs to reports/)
python -m training.evaluate_and_report

# Optional: inspect MLflow dashboard locally
mlflow ui --backend-store-uri mlruns

# Run the Flask API (serves frontend as static files)
python backend/app.py

# Visit the UI
open frontend/index.html  # or navigate to http://127.0.0.1:5000

The frontend JavaScript calls the API via same-origin relative paths (/predict), so it works unchanged whether served locally, inside Docker, or behind a reverse proxy.

Containerized Run

The curated .dockerignore keeps bytecode, MLflow runs, pickled artifacts, and virtualenvs out of the build context, so Docker layers stay lean.

# Build the production image locally
docker build -t diabetes-risk-prod .

# Run it with Gunicorn listening on $PORT (defaults to 5000)
docker run --rm -p 5000:5000 --env PORT=5000 diabetes-risk-prod

# Or rely on the provided compose file for repeatable dev/prod parity
docker compose up --build

The compose stack exposes the API on http://127.0.0.1:5000 and serves the static frontend via the same container. Override PORT or FLASK_DEBUG in docker-compose.yml or with --env flags if needed.

Testing

Pytest covers the input validation helpers (tests/test_input_validation.py) and the Flask /predict endpoint (tests/test_api.py).

python -m pytest

Developer Tooling

Install formatter/linter/test extras with pip install -r requirements-dev.txt (includes pytest, black, isort, flake8, and pre-commit).
Enable Git hooks by running pre-commit install once; enforce them manually anytime with pre-commit run --all-files.
Black + isort keep the code style consistent, while flake8 prevents lint regressions before CI even runs.

Limitations & Next Steps

Needs formal data validation (Great Expectations) and stronger provenance tracking when switching between real vs synthetic data.
No centralized experiment tracking backend beyond the local MLflow file store; promotion-ready tracking (e.g., hosted MLflow/W&B) is still future work.
Interpretability (SHAP, feature importances) and monitoring hooks are not yet implemented.

Planned improvements include structured logging, automated tests for the Flask app, and a richer React/Vite frontend with model comparison charts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes Risk Prediction Platform

Problem Statement

Dataset

Models

Training & Evaluation

Current Results (hold-out, CLI + evaluate_and_report.py)

System Architecture

Quickstart

Containerized Run

Testing

Developer Tooling

Limitations & Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
configs		configs
data		data
frontend		frontend
mlruns/226014882843742477		mlruns/226014882843742477
models		models
reports		reports
tests		tests
training		training
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
ML_Assign_4.pdf		ML_Assign_4.pdf
README.md		README.md
REPORT.md		REPORT.md
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Diabetes Risk Prediction Platform

Problem Statement

Dataset

Models

Training & Evaluation

Current Results (hold-out, CLI + evaluate_and_report.py)

System Architecture

Quickstart

Containerized Run

Testing

Developer Tooling

Limitations & Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages