You are working as a data scientist at a food delivery company. The company wants to imporve its system that calculates ETA for delivery persons. Rather than relying on some fixed method/formula. The management has decided to develop intelligent software that can predict the time of arrival for the delivery persons.
Develop a machine learning model that can calculate the time taken by delivery person to deliver the order, given relevant information.
E.D.A
Time taken to deliver the order is highest in Semi-Urban cities and least in Urban cities¶
To deliver the order on festival is more time consuming. There could be two reasons:
1. Large no. of orders.
2. More Traffic on roads.
Time taken to deliver the order in Fog and Cloudy Weather is greater then in other conditions
It can be observed there is a sudden drop in Time taken from Ratings 4.5 and greater.
Most of the Orders are recieved and delivered during evening(6 PM to 10PM)
Time taken to deliver the order is least in Sunny Weather with Medium Traffic and highest in Foggy Weather in Jam
Time taken to deliver the order is least in Mornings of Cloudy, Fog or Sunny Weather and maximum in evenings.
Refer to the notebook for the complete analysis.
Feature Engineering
First, Created Stratified 5 folds of the data. create_folds.py
- Filled Null values in Categorical variables by "NULL"
- Fixed ratings and time.
Refer to the fixing data notebook
- Extracted granular features from Date and Time columns
- Created bins for Order time.
- Calculated Distance metrics for location data.
- Computed GeoHash of the Locations.
- Greedily combined pairs of categorical columns.
Refer to feature_eng.py
Feature Encoding
- Applied Label Encoding on Road_traffic_density, Festival and City columns.
- Applied Target Mean encoding with cross validation on the remaining categorical columns
Refer to feature_encode.py
Feature Selection
- Dropped features with variance less than or equal to 0.1.
- Kept the features selected by CatBoost, XGBoost and LightGBM.
Refer to feature_selection.py
Imputation
- Iteratively imputed the data using LightGBM and Catboost.
- The Imputed data is only used for the models, which cannot handle null values.
Refer to impute.py
Model Selection
Performed Stratified K-fold cross validation on Regression models.
| Model | R2 Score | RMSE |
|---|---|---|
| LightGBM | 0.8274 | 3.8982 |
| CatBoost without categorical encoding | 0.8266 | 3.9077 |
| Random Forest | 0.8232 | 3.9461 |
| CatBoost with categorical encoding. | 0.8188 | 3.9722 |
| XGBoost | 0.8167 | 4.0172 |
| Gradient Boosting | 0.7836 | 4.3654 |
| AdaBoost | 0.6095 | 5.8631 |
| Linear Regression | 0.5619 | 6.2112 |
- Based on above results, Selected Model: LightGBM, CatBoost, Random Forest and XGBoost.
- Also, Catboost Performs better with the data in which categorical columns are not encoded.
Hypertuning
- Hypertuned the selected models using Optuna.
- Results After Hypertuning:
| Model | R2 Score | RMSE |
|---|---|---|
| Catboost | 0.8319 | 3.8476 |
| XGBoost | 0.8318 | 3.8484 |
| LightGBM | 0.8284 | 3.8876 |
| Random Forest | 0.8274 | 3.8990 |
Refer to hypertuning.py
Best Model and Result
- After Hypertuning, the best models are Catboost and XGBoost.
- Combining(Avg.) their predictions gives slightly better results.
| Model | R2 Score | RMSE |
|---|---|---|
| Catboost + XGBoost | 0.8351 | 3.811 |
| Catboost | 0.8319 | 3.8476 |
| XGBoost | 0.8318 | 3.8484 |
git clone https://github.com/mohan-gupta/estimating-delivery-time.git # clone
cd estimating-delivery-time
pip install -r requirements.txt # install
cd app
streamlit run streamlit_app.py #runFirst follow the approach mentioned above after E.D.A to prepare the data, then run the following command.
cd src
sh run.sh #run









