🧠 ML Model Deployment - HR Analytics Job Change Predictor

A production-ready Flask web application that predicts whether a data scientist will stay with a company or leave. Features machine learning model training, evaluation, and interactive predictions with data balancing techniques.

📑 Table of Contents

Overview
Features
Tech Stack
Project Structure
Installation
Usage
Machine Learning Models
Dataset
Results & Performance
API Endpoints
Deployment
License

📊 Overview

This project builds a predictive model to determine whether data scientists will remain with their current employer or leave for better opportunities. The application provides:

Multiple ML algorithms comparison
Data balancing techniques (oversampling, undersampling)
Interactive training interface for experimentation
Real-time predictions on new employee data
Detailed evaluation metrics and classification reports

Business Value: HR departments can identify at-risk employees and implement retention strategies.

✨ Features

🤖 Machine Learning Capabilities

Feature	Description
Multiple Algorithms	Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM)
Data Balancing	Handle imbalanced classes with oversampling, undersampling, SMOTE
Cross-Validation	K-Fold validation for robust model evaluation
Hyperparameter Tuning	GridSearchCV for optimal parameters
Model Persistence	Save/load trained models with joblib
Feature Scaling	StandardScaler for optimal algorithm performance

📈 Analysis & Reporting

Feature	Description
Classification Metrics	Precision, Recall, F1-Score, Accuracy, AUC-ROC
Confusion Matrix	Visual confusion matrix visualization
Train/Test Scores	Detailed performance on training and test sets
Classification Report	Per-class precision, recall, F1-score
Feature Importance	Identify most influential features
ROC Curves	Receiver Operating Characteristic analysis

🎯 Prediction Features

Feature	Description
Batch Predictions	Predict on multiple employees at once
Confidence Scores	Probability of staying vs leaving
Feature-wise Explanation	Understand prediction reasoning
Historical Comparisons	Track prediction accuracy over time

🖥️ User Interface

Feature	Description
Interactive Dashboard	Real-time model performance visualization
Model Comparison	Compare different algorithms side-by-side
Training History	Track all trained models and their metrics
Download Reports	Export predictions and analysis as CSV/PDF

🛠️ Tech Stack

Component	Technology
Backend	Python 3.8+, Flask 2.0
ML Libraries	scikit-learn, XGBoost, LightGBM
Data Processing	Pandas, NumPy
Visualization	Matplotlib, Seaborn, Plotly
Model Storage	joblib
Frontend	HTML5, CSS3, JavaScript, Bootstrap
Deployment	Gunicorn, Docker

📂 Project Structure

ml-model-deployment/
├── main.py                      # Flask application entry point
├── train.py                     # Model training logic
├── predict.py                   # Prediction logic
├── evaluate.py                  # Model evaluation
├── data_processor.py            # Data loading & preprocessing
├── config.py                    # Configuration settings
│
├── requirements.txt             # Python dependencies
├── Dockerfile                   # Container configuration
├── docker-compose.yml           # Multi-container setup
│
├── data/                        # Training datasets
│   ├── normal_data.csv         # Original balanced data
│   ├── oversample.csv          # Oversampled data
│   └── undersample_data.csv    # Undersampled data
│
├── models/                      # Saved trained models
│   ├── lr_model.pkl            # Logistic Regression
│   ├── knn_model.pkl           # KNN model
│   ├── svm_model.pkl           # SVM model
│   └── scalers/                # Feature scalers
│
├── templates/                   # HTML templates
│   ├── base.html               # Base template
│   ├── index.html              # Home page
│   ├── train.html              # Training interface
│   ├── predict.html            # Prediction interface
│   ├── results.html            # Results display
│   └── dashboard.html          # Analytics dashboard
│
├── static/                      # Static files
│   ├── css/
│   │   ├── style.css           # Custom styling
│   │   └── bootstrap.min.css
│   ├── js/
│   │   ├── script.js           # Client-side logic
│   │   └── charts.js           # Chart generation
│   └── images/                 # UI images
│
└── README.md                    # This file

🚀 Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)
Virtual environment (recommended)
2GB RAM minimum

Step-by-Step Setup

Clone repository:

git clone https://github.com/Ahmed122000/ML_model_deployment.git
cd ML_model_deployment

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Prepare datasets:

# Ensure these files exist in data/ directory:
# - normal_data.csv
# - oversample.csv
# - undersample_data.csv

Run application:
```
python main.py
```
Access application:
```
http://localhost:5000
```

💡 Usage

Web Interface Navigation

1️⃣ Home Page

Project overview
Quick links to train/predict
Model statistics

2️⃣ Training Models

Steps:

Navigate to "Train Models" tab
Select Dataset:
- Normal (original data)
- Oversampled (more minority class samples)
- Undersampled (fewer majority class samples)
Choose Algorithm:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
Optional: Adjust hyperparameters
Click "Train Model"
View results:
- Train/Test scores
- Classification report
- Confusion matrix
- Feature importance

Training Output:

Model: Logistic Regression
Dataset: Oversampled
Train Score: 0.8245
Test Score: 0.7893
Precision: 0.8102
Recall: 0.7654
F1-Score: 0.7873

3️⃣ Making Predictions

Steps:

Navigate to "Predict" tab
Fill employee information:
- City Development Index (0.0 - 1.0)
- Gender (M/F)
- Relevant Experience (Yes/No)
- Enrolled in University (Yes/No)
- Education Level (High School/Bachelor/Master/PhD)
- Major Discipline
- Experience (years)
- Company Size
- Company Type
- Last New Job (years)
- Training Hours
Click "Predict"
View prediction result:
- Will Stay or Will Leave
- Confidence percentage
- Feature contributions

4️⃣ Dashboard

Compare all trained models
View training history
Analyze feature importance across models
Export reports

🤖 Machine Learning Models

1. Logistic Regression

When to use: Baseline model, interpretable results

Parameters:

LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

Pros:

Fast training
Highly interpretable
Good for linearly separable data

Cons:

Assumes linear relationship
Less effective with complex patterns

2. K-Nearest Neighbors (KNN)

When to use: Non-linear patterns, small-medium datasets

Parameters:

KNeighborsClassifier(
    n_neighbors=5,
    weights='distance',
    metric='euclidean'
)

Pros:

Captures non-linear patterns
No training phase
Effective for local patterns

Cons:

Slow prediction time
Sensitive to feature scaling
Memory intensive

3. Support Vector Machine (SVM)

When to use: High-dimensional data, maximum margin classification

Parameters:

SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    probability=True,
    random_state=42
)

Pros:

Effective in high dimensions
Robust to outliers
Strong theoretical foundation

Cons:

Slower training
Requires feature scaling
Hard to interpret

Data Balancing Techniques

Original Distribution

Staying: 75% (majority)
Leaving: 25% (minority)

Oversampling

Randomly duplicate minority class samples
Result: 75% vs 75% balanced distribution

Undersampling

Randomly remove majority class samples
Result: 25% vs 25% balanced distribution

📊 Dataset

Features (12 input features)

Feature	Type	Range/Values	Description
city_development_index	float	0.0 - 1.0	City development level
gender	categorical	M/F	Employee gender
relevant_experience	binary	Yes/No	Has relevant experience
enrolled_university	categorical	Full-time/Part-time/No	University enrollment
education_level	categorical	HS/Bachelor/Master/PhD	Highest education
major_discipline	categorical	STEM/Business/Humanities	Field of study
experience	integer	0-50	Years of experience
company_size	categorical	Startup/MNC/Unicorn	Company size
company_type	categorical	IT/Service/Healthcare	Industry type
last_new_job	integer	0-5	Years at current job
training_hours	integer	0-500	Professional training hours
target	binary	0/1	0=Stays, 1=Leaves

Dataset Size

Total Records: 19,158 employees
Training Set: 70% (13,410 records)
Test Set: 30% (5,748 records)
Missing Values: < 2% (handled)
Class Imbalance: 75% vs 25%

Data Preprocessing

# Steps applied:
1. Load CSV data
2. Handle missing values (mean/mode imputation)
3. Encode categorical variables (LabelEncoder)
4. Scale numerical features (StandardScaler)
5. Split train/test (80/20)
6. Handle class imbalance (oversample/undersample)

📈 Results & Performance

Model Comparison (on test set)

Metric	Logistic Regression	KNN (k=5)	SVM (RBF)
Accuracy	78.23%	76.45%	79.12%
Precision	0.7891	0.7654	0.8023
Recall	0.7456	0.7234	0.7789
F1-Score	0.7667	0.7440	0.7904
AUC-ROC	0.8234	0.8012	0.8456
Training Time	2.3s	0.5s	45.2s

Best Performing Model: SVM

Highest accuracy and F1-score
Good balance between precision and recall
Acceptable training time

🔌 API Endpoints

Flask Routes

Endpoint	Method	Purpose
`/`	GET	Home page
`/train`	GET, POST	Training interface
`/predict`	GET, POST	Prediction interface
`/results`	GET	View training results
`/dashboard`	GET	Analytics dashboard
`/api/train-model`	POST	Train model (JSON API)
`/api/predict`	POST	Make prediction (JSON API)
`/api/models`	GET	List trained models
`/api/export`	GET	Export results as CSV

API Examples

Train Model:

curl -X POST http://localhost:5000/api/train-model \
  -H "Content-Type: application/json" \
  -d '{
    "algorithm": "svm",
    "dataset": "oversample"
  }'

Make Prediction:

curl -X POST http://localhost:5000/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "city_development_index": 0.92,
    "gender": "M",
    "relevant_experience": "Yes",
    "experience": 3,
    "training_hours": 40
  }'

🐳 Deployment

Docker Setup

Build image:
```
docker build -t ml-predictor:latest .
```

Run container:

docker run -p 5000:5000 ml-predictor:latest

Using Docker Compose:
```
docker-compose up
```

Production Deployment

Using Gunicorn:

gunicorn --workers 4 --bind 0.0.0.0:5000 main:app

On Heroku:

heroku login
heroku create ml-predictor
git push heroku main

🧪 Testing

Run Tests

python -m pytest tests/

Test Coverage

Unit tests for model training
Integration tests for API endpoints
Data preprocessing tests
Prediction accuracy tests

🐛 Troubleshooting

Issue: "ModuleNotFoundError"

Solution: Install requirements

pip install -r requirements.txt

Issue: "FileNotFoundError: data files"

Solution: Ensure CSV files exist in data/ directory

Issue: "Port 5000 already in use"

Solution: Use different port

python main.py --port 5001

📈 Future Enhancements

📝 Contributing

Fork repository
Create feature branch (git checkout -b feature/improvement)
Commit changes (git commit -m 'Add improvement')
Push to branch (git push origin feature/improvement)
Open Pull Request

📄 License

This project is licensed under the MIT License - see LICENSE for details.

🙏 Acknowledgments

Kaggle - HR Analytics dataset
scikit-learn - ML algorithms
Flask - Web framework
Pandas - Data processing

👨‍💻 Author

Ahmed Hesham - @Ahmed122000

Built with ❤️ for HR Analytics & ML Deployment

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
data		data
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model.pkl		model.pkl
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

🧠 ML Model Deployment - HR Analytics Job Change Predictor

📑 Table of Contents

📊 Overview

✨ Features

🤖 Machine Learning Capabilities

📈 Analysis & Reporting

🎯 Prediction Features

🖥️ User Interface

🛠️ Tech Stack

📂 Project Structure

🚀 Installation

Prerequisites

Step-by-Step Setup

💡 Usage

Web Interface Navigation

1️⃣ Home Page

2️⃣ Training Models

3️⃣ Making Predictions

4️⃣ Dashboard

🤖 Machine Learning Models

1. Logistic Regression

2. K-Nearest Neighbors (KNN)

3. Support Vector Machine (SVM)

Data Balancing Techniques

Original Distribution

Oversampling

Undersampling

📊 Dataset

Features (12 input features)

Dataset Size

Data Preprocessing

📈 Results & Performance

Model Comparison (on test set)

Best Performing Model: SVM

🔌 API Endpoints

Flask Routes

API Examples

🐳 Deployment

Docker Setup

Production Deployment

🧪 Testing

Run Tests

Test Coverage

🐛 Troubleshooting

Issue: "ModuleNotFoundError"

Issue: "FileNotFoundError: data files"

Issue: "Port 5000 already in use"

📈 Future Enhancements

📝 Contributing

📄 License

🙏 Acknowledgments

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages