Skip to content

Ahmed122000/ML_model_deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 ML Model Deployment - HR Analytics Job Change Predictor

Python Flask scikit-learn Pandas License

A production-ready Flask web application that predicts whether a data scientist will stay with a company or leave. Features machine learning model training, evaluation, and interactive predictions with data balancing techniques.


πŸ“‘ Table of Contents


πŸ“Š Overview

This project builds a predictive model to determine whether data scientists will remain with their current employer or leave for better opportunities. The application provides:

  • Multiple ML algorithms comparison
  • Data balancing techniques (oversampling, undersampling)
  • Interactive training interface for experimentation
  • Real-time predictions on new employee data
  • Detailed evaluation metrics and classification reports

Business Value: HR departments can identify at-risk employees and implement retention strategies.


✨ Features

πŸ€– Machine Learning Capabilities

Feature Description
Multiple Algorithms Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM)
Data Balancing Handle imbalanced classes with oversampling, undersampling, SMOTE
Cross-Validation K-Fold validation for robust model evaluation
Hyperparameter Tuning GridSearchCV for optimal parameters
Model Persistence Save/load trained models with joblib
Feature Scaling StandardScaler for optimal algorithm performance

πŸ“ˆ Analysis & Reporting

Feature Description
Classification Metrics Precision, Recall, F1-Score, Accuracy, AUC-ROC
Confusion Matrix Visual confusion matrix visualization
Train/Test Scores Detailed performance on training and test sets
Classification Report Per-class precision, recall, F1-score
Feature Importance Identify most influential features
ROC Curves Receiver Operating Characteristic analysis

🎯 Prediction Features

Feature Description
Batch Predictions Predict on multiple employees at once
Confidence Scores Probability of staying vs leaving
Feature-wise Explanation Understand prediction reasoning
Historical Comparisons Track prediction accuracy over time

πŸ–₯️ User Interface

Feature Description
Interactive Dashboard Real-time model performance visualization
Model Comparison Compare different algorithms side-by-side
Training History Track all trained models and their metrics
Download Reports Export predictions and analysis as CSV/PDF

πŸ› οΈ Tech Stack

Component Technology
Backend Python 3.8+, Flask 2.0
ML Libraries scikit-learn, XGBoost, LightGBM
Data Processing Pandas, NumPy
Visualization Matplotlib, Seaborn, Plotly
Model Storage joblib
Frontend HTML5, CSS3, JavaScript, Bootstrap
Deployment Gunicorn, Docker

πŸ“‚ Project Structure

ml-model-deployment/
β”œβ”€β”€ main.py                      # Flask application entry point
β”œβ”€β”€ train.py                     # Model training logic
β”œβ”€β”€ predict.py                   # Prediction logic
β”œβ”€β”€ evaluate.py                  # Model evaluation
β”œβ”€β”€ data_processor.py            # Data loading & preprocessing
β”œβ”€β”€ config.py                    # Configuration settings
β”‚
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ Dockerfile                   # Container configuration
β”œβ”€β”€ docker-compose.yml           # Multi-container setup
β”‚
β”œβ”€β”€ data/                        # Training datasets
β”‚   β”œβ”€β”€ normal_data.csv         # Original balanced data
β”‚   β”œβ”€β”€ oversample.csv          # Oversampled data
β”‚   └── undersample_data.csv    # Undersampled data
β”‚
β”œβ”€β”€ models/                      # Saved trained models
β”‚   β”œβ”€β”€ lr_model.pkl            # Logistic Regression
β”‚   β”œβ”€β”€ knn_model.pkl           # KNN model
β”‚   β”œβ”€β”€ svm_model.pkl           # SVM model
β”‚   └── scalers/                # Feature scalers
β”‚
β”œβ”€β”€ templates/                   # HTML templates
β”‚   β”œβ”€β”€ base.html               # Base template
β”‚   β”œβ”€β”€ index.html              # Home page
β”‚   β”œβ”€β”€ train.html              # Training interface
β”‚   β”œβ”€β”€ predict.html            # Prediction interface
β”‚   β”œβ”€β”€ results.html            # Results display
β”‚   └── dashboard.html          # Analytics dashboard
β”‚
β”œβ”€β”€ static/                      # Static files
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   β”œβ”€β”€ style.css           # Custom styling
β”‚   β”‚   └── bootstrap.min.css
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   β”œβ”€β”€ script.js           # Client-side logic
β”‚   β”‚   └── charts.js           # Chart generation
β”‚   └── images/                 # UI images
β”‚
└── README.md                    # This file

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Virtual environment (recommended)
  • 2GB RAM minimum

Step-by-Step Setup

  1. Clone repository:

    git clone https://github.com/Ahmed122000/ML_model_deployment.git
    cd ML_model_deployment
  2. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Prepare datasets:

    # Ensure these files exist in data/ directory:
    # - normal_data.csv
    # - oversample.csv
    # - undersample_data.csv
  5. Run application:

    python main.py
  6. Access application:

    http://localhost:5000
    

πŸ’‘ Usage

Web Interface Navigation

1️⃣ Home Page

  • Project overview
  • Quick links to train/predict
  • Model statistics

2️⃣ Training Models

Steps:

  1. Navigate to "Train Models" tab
  2. Select Dataset:
    • Normal (original data)
    • Oversampled (more minority class samples)
    • Undersampled (fewer majority class samples)
  3. Choose Algorithm:
    • Logistic Regression
    • K-Nearest Neighbors (KNN)
    • Support Vector Machine (SVM)
  4. Optional: Adjust hyperparameters
  5. Click "Train Model"
  6. View results:
    • Train/Test scores
    • Classification report
    • Confusion matrix
    • Feature importance

Training Output:

Model: Logistic Regression
Dataset: Oversampled
Train Score: 0.8245
Test Score: 0.7893
Precision: 0.8102
Recall: 0.7654
F1-Score: 0.7873

3️⃣ Making Predictions

Steps:

  1. Navigate to "Predict" tab

  2. Fill employee information:

    • City Development Index (0.0 - 1.0)
    • Gender (M/F)
    • Relevant Experience (Yes/No)
    • Enrolled in University (Yes/No)
    • Education Level (High School/Bachelor/Master/PhD)
    • Major Discipline
    • Experience (years)
    • Company Size
    • Company Type
    • Last New Job (years)
    • Training Hours
  3. Click "Predict"

  4. View prediction result:

    • Will Stay or Will Leave
    • Confidence percentage
    • Feature contributions

4️⃣ Dashboard

  • Compare all trained models
  • View training history
  • Analyze feature importance across models
  • Export reports

πŸ€– Machine Learning Models

1. Logistic Regression

When to use: Baseline model, interpretable results

Parameters:

LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

Pros:

  • Fast training
  • Highly interpretable
  • Good for linearly separable data

Cons:

  • Assumes linear relationship
  • Less effective with complex patterns

2. K-Nearest Neighbors (KNN)

When to use: Non-linear patterns, small-medium datasets

Parameters:

KNeighborsClassifier(
    n_neighbors=5,
    weights='distance',
    metric='euclidean'
)

Pros:

  • Captures non-linear patterns
  • No training phase
  • Effective for local patterns

Cons:

  • Slow prediction time
  • Sensitive to feature scaling
  • Memory intensive

3. Support Vector Machine (SVM)

When to use: High-dimensional data, maximum margin classification

Parameters:

SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    probability=True,
    random_state=42
)

Pros:

  • Effective in high dimensions
  • Robust to outliers
  • Strong theoretical foundation

Cons:

  • Slower training
  • Requires feature scaling
  • Hard to interpret

Data Balancing Techniques

Original Distribution

Staying: 75% (majority)
Leaving: 25% (minority)

Oversampling

Randomly duplicate minority class samples
Result: 75% vs 75% balanced distribution

Undersampling

Randomly remove majority class samples
Result: 25% vs 25% balanced distribution

πŸ“Š Dataset

Features (12 input features)

Feature Type Range/Values Description
city_development_index float 0.0 - 1.0 City development level
gender categorical M/F Employee gender
relevant_experience binary Yes/No Has relevant experience
enrolled_university categorical Full-time/Part-time/No University enrollment
education_level categorical HS/Bachelor/Master/PhD Highest education
major_discipline categorical STEM/Business/Humanities Field of study
experience integer 0-50 Years of experience
company_size categorical Startup/MNC/Unicorn Company size
company_type categorical IT/Service/Healthcare Industry type
last_new_job integer 0-5 Years at current job
training_hours integer 0-500 Professional training hours
target binary 0/1 0=Stays, 1=Leaves

Dataset Size

  • Total Records: 19,158 employees
  • Training Set: 70% (13,410 records)
  • Test Set: 30% (5,748 records)
  • Missing Values: < 2% (handled)
  • Class Imbalance: 75% vs 25%

Data Preprocessing

# Steps applied:
1. Load CSV data
2. Handle missing values (mean/mode imputation)
3. Encode categorical variables (LabelEncoder)
4. Scale numerical features (StandardScaler)
5. Split train/test (80/20)
6. Handle class imbalance (oversample/undersample)

πŸ“ˆ Results & Performance

Model Comparison (on test set)

Metric Logistic Regression KNN (k=5) SVM (RBF)
Accuracy 78.23% 76.45% 79.12%
Precision 0.7891 0.7654 0.8023
Recall 0.7456 0.7234 0.7789
F1-Score 0.7667 0.7440 0.7904
AUC-ROC 0.8234 0.8012 0.8456
Training Time 2.3s 0.5s 45.2s

Best Performing Model: SVM

  • Highest accuracy and F1-score
  • Good balance between precision and recall
  • Acceptable training time

πŸ”Œ API Endpoints

Flask Routes

Endpoint Method Purpose
/ GET Home page
/train GET, POST Training interface
/predict GET, POST Prediction interface
/results GET View training results
/dashboard GET Analytics dashboard
/api/train-model POST Train model (JSON API)
/api/predict POST Make prediction (JSON API)
/api/models GET List trained models
/api/export GET Export results as CSV

API Examples

Train Model:

curl -X POST http://localhost:5000/api/train-model \
  -H "Content-Type: application/json" \
  -d '{
    "algorithm": "svm",
    "dataset": "oversample"
  }'

Make Prediction:

curl -X POST http://localhost:5000/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "city_development_index": 0.92,
    "gender": "M",
    "relevant_experience": "Yes",
    "experience": 3,
    "training_hours": 40
  }'

🐳 Deployment

Docker Setup

  1. Build image:

    docker build -t ml-predictor:latest .
  2. Run container:

    docker run -p 5000:5000 ml-predictor:latest
  3. Using Docker Compose:

    docker-compose up

Production Deployment

Using Gunicorn:

gunicorn --workers 4 --bind 0.0.0.0:5000 main:app

On Heroku:

heroku login
heroku create ml-predictor
git push heroku main

πŸ§ͺ Testing

Run Tests

python -m pytest tests/

Test Coverage

  • Unit tests for model training
  • Integration tests for API endpoints
  • Data preprocessing tests
  • Prediction accuracy tests

πŸ› Troubleshooting

Issue: "ModuleNotFoundError"

Solution: Install requirements

pip install -r requirements.txt

Issue: "FileNotFoundError: data files"

Solution: Ensure CSV files exist in data/ directory

Issue: "Port 5000 already in use"

Solution: Use different port

python main.py --port 5001

πŸ“ˆ Future Enhancements

  • Deep learning models (Neural Networks)
  • Real-time data streaming
  • Advanced feature engineering
  • Model explainability (SHAP, LIME)
  • A/B testing framework
  • Automated retraining pipeline
  • Mobile app integration
  • Multi-language support
  • Advanced visualization dashboards
  • REST API v2

πŸ“ Contributing

  1. Fork repository
  2. Create feature branch (git checkout -b feature/improvement)
  3. Commit changes (git commit -m 'Add improvement')
  4. Push to branch (git push origin feature/improvement)
  5. Open Pull Request

πŸ“„ License

This project is licensed under the MIT License - see LICENSE for details.


πŸ™ Acknowledgments


πŸ‘¨β€πŸ’» Author

Ahmed Hesham - @Ahmed122000

Built with ❀️ for HR Analytics & ML Deployment

About

The HR Analytics: Job Change Predictor is a Flask-based web application that uses machine learning to predict whether an employee will stay with a company or leave. It allows users to train models, evaluate their performance, and make predictions based on employee data, providing valuable insights for HR decision-making.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors