This repository contains a Jupyter Notebook implementation of a supervised machine learning pipeline that predicts whether a woman has diabetes. The project includes data loading, cleaning, exploratory data analysis (EDA), model training, evaluation, and model export steps.
- notebook.ipynb - Main Jupyter Notebook with the full experiment (data processing, modeling, evaluation).
- data/ (optional) - Place dataset files here if you keep them in the repo.
The goal of this project is to build a classification model to predict diabetes (positive/negative) for female patients using clinical features. Typical datasets used for this task include the Pima Indians Diabetes dataset (if not included, you can download it from public sources).
Key steps implemented in the notebook:
- Data loading and basic validation
- Exploratory data analysis (visualizations & summary statistics)
- Data preprocessing (imputation, scaling, encoding if needed)
- Feature selection / engineering
- Model training (e.g., Logistic Regression, Random Forest, XGBoost)
- Evaluation using metrics such as accuracy, precision, recall, F1-score, and ROC AUC
- Model serialization/export (joblib/pickle)
Prerequisites: Python 3.8+ and the packages listed in the notebook (commonly: pandas, numpy, scikit-learn, matplotlib, seaborn, xgboost, joblib).
-
Clone the repository:
git clone https://github.com/joaogcfa/Machine-Learnig-Project.git cd Machine-Learnig-Project
-
(Optional) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Linux / macOS venv\Scripts\activate # Windows
-
Install dependencies (example):
pip install -r requirements.txt
If a requirements.txt is not present, install commonly used packages:
pip install pandas numpy scikit-learn matplotlib seaborn xgboost joblib notebook
-
Open the notebook:
jupyter notebook notebook.ipynb
If the dataset is not included in this repository, you can download the Pima Indians Diabetes dataset from UCI Machine Learning Repository or Kaggle. Ensure the dataset is placed in a data/ folder or update the notebook paths accordingly.
Typical feature columns include:
- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome (target: 0 = no diabetes, 1 = diabetes)
See the notebook for model comparisons and evaluation metrics. Commonly reported metrics:
- Accuracy
- Precision / Recall / F1-score
- ROC AUC
- Confusion matrix visualization
- Set a random seed in the notebook to make experiments reproducible.
- Save trained model artifacts (e.g.,
model.joblib) and preprocessing pipelines.
Repository owner: joaogcfa (https://github.com/joaogcfa)