A production-ready Flask web application that predicts whether a data scientist will stay with a company or leave. Features machine learning model training, evaluation, and interactive predictions with data balancing techniques.
- Overview
- Features
- Tech Stack
- Project Structure
- Installation
- Usage
- Machine Learning Models
- Dataset
- Results & Performance
- API Endpoints
- Deployment
- License
This project builds a predictive model to determine whether data scientists will remain with their current employer or leave for better opportunities. The application provides:
- Multiple ML algorithms comparison
- Data balancing techniques (oversampling, undersampling)
- Interactive training interface for experimentation
- Real-time predictions on new employee data
- Detailed evaluation metrics and classification reports
Business Value: HR departments can identify at-risk employees and implement retention strategies.
| Feature | Description |
|---|---|
| Multiple Algorithms | Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) |
| Data Balancing | Handle imbalanced classes with oversampling, undersampling, SMOTE |
| Cross-Validation | K-Fold validation for robust model evaluation |
| Hyperparameter Tuning | GridSearchCV for optimal parameters |
| Model Persistence | Save/load trained models with joblib |
| Feature Scaling | StandardScaler for optimal algorithm performance |
| Feature | Description |
|---|---|
| Classification Metrics | Precision, Recall, F1-Score, Accuracy, AUC-ROC |
| Confusion Matrix | Visual confusion matrix visualization |
| Train/Test Scores | Detailed performance on training and test sets |
| Classification Report | Per-class precision, recall, F1-score |
| Feature Importance | Identify most influential features |
| ROC Curves | Receiver Operating Characteristic analysis |
| Feature | Description |
|---|---|
| Batch Predictions | Predict on multiple employees at once |
| Confidence Scores | Probability of staying vs leaving |
| Feature-wise Explanation | Understand prediction reasoning |
| Historical Comparisons | Track prediction accuracy over time |
| Feature | Description |
|---|---|
| Interactive Dashboard | Real-time model performance visualization |
| Model Comparison | Compare different algorithms side-by-side |
| Training History | Track all trained models and their metrics |
| Download Reports | Export predictions and analysis as CSV/PDF |
| Component | Technology |
|---|---|
| Backend | Python 3.8+, Flask 2.0 |
| ML Libraries | scikit-learn, XGBoost, LightGBM |
| Data Processing | Pandas, NumPy |
| Visualization | Matplotlib, Seaborn, Plotly |
| Model Storage | joblib |
| Frontend | HTML5, CSS3, JavaScript, Bootstrap |
| Deployment | Gunicorn, Docker |
ml-model-deployment/
βββ main.py # Flask application entry point
βββ train.py # Model training logic
βββ predict.py # Prediction logic
βββ evaluate.py # Model evaluation
βββ data_processor.py # Data loading & preprocessing
βββ config.py # Configuration settings
β
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container configuration
βββ docker-compose.yml # Multi-container setup
β
βββ data/ # Training datasets
β βββ normal_data.csv # Original balanced data
β βββ oversample.csv # Oversampled data
β βββ undersample_data.csv # Undersampled data
β
βββ models/ # Saved trained models
β βββ lr_model.pkl # Logistic Regression
β βββ knn_model.pkl # KNN model
β βββ svm_model.pkl # SVM model
β βββ scalers/ # Feature scalers
β
βββ templates/ # HTML templates
β βββ base.html # Base template
β βββ index.html # Home page
β βββ train.html # Training interface
β βββ predict.html # Prediction interface
β βββ results.html # Results display
β βββ dashboard.html # Analytics dashboard
β
βββ static/ # Static files
β βββ css/
β β βββ style.css # Custom styling
β β βββ bootstrap.min.css
β βββ js/
β β βββ script.js # Client-side logic
β β βββ charts.js # Chart generation
β βββ images/ # UI images
β
βββ README.md # This file
- Python 3.8 or higher
- pip (Python package manager)
- Virtual environment (recommended)
- 2GB RAM minimum
-
Clone repository:
git clone https://github.com/Ahmed122000/ML_model_deployment.git cd ML_model_deployment -
Create virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Prepare datasets:
# Ensure these files exist in data/ directory: # - normal_data.csv # - oversample.csv # - undersample_data.csv
-
Run application:
python main.py
-
Access application:
http://localhost:5000
- Project overview
- Quick links to train/predict
- Model statistics
Steps:
- Navigate to "Train Models" tab
- Select Dataset:
- Normal (original data)
- Oversampled (more minority class samples)
- Undersampled (fewer majority class samples)
- Choose Algorithm:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Optional: Adjust hyperparameters
- Click "Train Model"
- View results:
- Train/Test scores
- Classification report
- Confusion matrix
- Feature importance
Training Output:
Model: Logistic Regression
Dataset: Oversampled
Train Score: 0.8245
Test Score: 0.7893
Precision: 0.8102
Recall: 0.7654
F1-Score: 0.7873
Steps:
-
Navigate to "Predict" tab
-
Fill employee information:
- City Development Index (0.0 - 1.0)
- Gender (M/F)
- Relevant Experience (Yes/No)
- Enrolled in University (Yes/No)
- Education Level (High School/Bachelor/Master/PhD)
- Major Discipline
- Experience (years)
- Company Size
- Company Type
- Last New Job (years)
- Training Hours
-
Click "Predict"
-
View prediction result:
- Will Stay or Will Leave
- Confidence percentage
- Feature contributions
- Compare all trained models
- View training history
- Analyze feature importance across models
- Export reports
When to use: Baseline model, interpretable results
Parameters:
LogisticRegression(
max_iter=1000,
random_state=42,
class_weight='balanced'
)Pros:
- Fast training
- Highly interpretable
- Good for linearly separable data
Cons:
- Assumes linear relationship
- Less effective with complex patterns
When to use: Non-linear patterns, small-medium datasets
Parameters:
KNeighborsClassifier(
n_neighbors=5,
weights='distance',
metric='euclidean'
)Pros:
- Captures non-linear patterns
- No training phase
- Effective for local patterns
Cons:
- Slow prediction time
- Sensitive to feature scaling
- Memory intensive
When to use: High-dimensional data, maximum margin classification
Parameters:
SVC(
kernel='rbf',
C=1.0,
gamma='scale',
probability=True,
random_state=42
)Pros:
- Effective in high dimensions
- Robust to outliers
- Strong theoretical foundation
Cons:
- Slower training
- Requires feature scaling
- Hard to interpret
Staying: 75% (majority)
Leaving: 25% (minority)
Randomly duplicate minority class samples
Result: 75% vs 75% balanced distribution
Randomly remove majority class samples
Result: 25% vs 25% balanced distribution
| Feature | Type | Range/Values | Description |
|---|---|---|---|
| city_development_index | float | 0.0 - 1.0 | City development level |
| gender | categorical | M/F | Employee gender |
| relevant_experience | binary | Yes/No | Has relevant experience |
| enrolled_university | categorical | Full-time/Part-time/No | University enrollment |
| education_level | categorical | HS/Bachelor/Master/PhD | Highest education |
| major_discipline | categorical | STEM/Business/Humanities | Field of study |
| experience | integer | 0-50 | Years of experience |
| company_size | categorical | Startup/MNC/Unicorn | Company size |
| company_type | categorical | IT/Service/Healthcare | Industry type |
| last_new_job | integer | 0-5 | Years at current job |
| training_hours | integer | 0-500 | Professional training hours |
| target | binary | 0/1 | 0=Stays, 1=Leaves |
- Total Records: 19,158 employees
- Training Set: 70% (13,410 records)
- Test Set: 30% (5,748 records)
- Missing Values: < 2% (handled)
- Class Imbalance: 75% vs 25%
# Steps applied:
1. Load CSV data
2. Handle missing values (mean/mode imputation)
3. Encode categorical variables (LabelEncoder)
4. Scale numerical features (StandardScaler)
5. Split train/test (80/20)
6. Handle class imbalance (oversample/undersample)| Metric | Logistic Regression | KNN (k=5) | SVM (RBF) |
|---|---|---|---|
| Accuracy | 78.23% | 76.45% | 79.12% |
| Precision | 0.7891 | 0.7654 | 0.8023 |
| Recall | 0.7456 | 0.7234 | 0.7789 |
| F1-Score | 0.7667 | 0.7440 | 0.7904 |
| AUC-ROC | 0.8234 | 0.8012 | 0.8456 |
| Training Time | 2.3s | 0.5s | 45.2s |
- Highest accuracy and F1-score
- Good balance between precision and recall
- Acceptable training time
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Home page |
/train |
GET, POST | Training interface |
/predict |
GET, POST | Prediction interface |
/results |
GET | View training results |
/dashboard |
GET | Analytics dashboard |
/api/train-model |
POST | Train model (JSON API) |
/api/predict |
POST | Make prediction (JSON API) |
/api/models |
GET | List trained models |
/api/export |
GET | Export results as CSV |
Train Model:
curl -X POST http://localhost:5000/api/train-model \
-H "Content-Type: application/json" \
-d '{
"algorithm": "svm",
"dataset": "oversample"
}'Make Prediction:
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{
"city_development_index": 0.92,
"gender": "M",
"relevant_experience": "Yes",
"experience": 3,
"training_hours": 40
}'-
Build image:
docker build -t ml-predictor:latest . -
Run container:
docker run -p 5000:5000 ml-predictor:latest
-
Using Docker Compose:
docker-compose up
Using Gunicorn:
gunicorn --workers 4 --bind 0.0.0.0:5000 main:appOn Heroku:
heroku login
heroku create ml-predictor
git push heroku mainpython -m pytest tests/- Unit tests for model training
- Integration tests for API endpoints
- Data preprocessing tests
- Prediction accuracy tests
Solution: Install requirements
pip install -r requirements.txtSolution: Ensure CSV files exist in data/ directory
Solution: Use different port
python main.py --port 5001- Deep learning models (Neural Networks)
- Real-time data streaming
- Advanced feature engineering
- Model explainability (SHAP, LIME)
- A/B testing framework
- Automated retraining pipeline
- Mobile app integration
- Multi-language support
- Advanced visualization dashboards
- REST API v2
- Fork repository
- Create feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -m 'Add improvement') - Push to branch (
git push origin feature/improvement) - Open Pull Request
This project is licensed under the MIT License - see LICENSE for details.
- Kaggle - HR Analytics dataset
- scikit-learn - ML algorithms
- Flask - Web framework
- Pandas - Data processing
Ahmed Hesham - @Ahmed122000
Built with β€οΈ for HR Analytics & ML Deployment