Logistics Delay Prediction: EDA, Modeling, and Evaluation

Overview

The goal of this project is to predict whether an e-commerce shipment will be delayed or delivered on-time. Shipment delays critically affect customer satisfaction and operational costs. By reframing this challenge as a machine learning task, we can build proactive systems to identify high-risk shipments early and intervene.

Dataset

This project uses the Olist Brazilian E-Commerce Dataset, provided via Kaggle. The dataset contains information on 100k+ orders made at multiple marketplaces in Brazil. Its extensive schema spans features such as order statuses, price, payment metrics, product attributes, customer locations, and detailed delivery timestamps.

Problem Statement

We frame delay prediction as a binary classification problem:

Class 0: On-time
Class 1: Delayed

Challenge: Class Imbalance The dataset exhibits extreme class imbalance, heavily skewed towards the positive on-time scenario. Approximately 92.1% of the shipments were on time, with only 7.9% being delayed.

Caption: The vast majority of orders arrive on time, highlighting the difficulty of modeling the minority (delayed) class without proactive handling.

Methodology

Data Processing

Cleaning: Merged multiple dataset tables (orders, order_items, customers, sellers) to build a unified view. Dropped anomalous records where delivery dates were completely missing.
Handling Missing Values: Applied a predefined ColumnTransformer. Used SimpleImputer(strategy='median') for numerical columns and SimpleImputer(strategy='constant', fill_value='unknown') for categoricals.

Feature Engineering

Key features engineered to capture logistical realities:

Distance Proxies: customer_zip_code_prefix and seller_zip_code_prefix were utilized to estimate geo-logistical distance.
Time/Date Features: Extracted purchase_month, purchase_day, order_purchase_hour, and introduced an is_weekend flag indicating when an order was originally placed.
Logistics Dynamics: Engineered a time_to_ship numerical feature representing the duration between order approval and handoff to the carrier.
Delivery Estimates: Formulated the primary target is_delayed by calculating if the actual delivery timestamp exceeded the estimated delivery date.

Exploratory Data Analysis (EDA)

During EDA, we explored macro trends within the shipping data:

Correlated delayed shipments with temporal features (e.g. delays by day-of-week).
Validated the extreme class imbalance requiring algorithmic intervention (e.g., class weighting and balancing).

Modeling

We trained and evaluated three primary algorithms for this classification task:

Logistic Regression (Baseline): A fast linear model fitted using class_weight='balanced' to establish our baseline evaluation metrics.
Neural Network: A deep sequential network utilizing fully-connected Dense layers with Dropout logic carefully weighted to catch complex, non-linear logistical patterns.
Random Forest: A robust tree-based ensemble trained using RandomizedSearchCV for hyperparameter tuning. It is particularly effective at capturing variable interactions without excessive scaling requirements.

Results

Below are the summarized findings across all three models:

Logistic Regression: Served as a solid benchmark. It yielded an ROC-AUC of 0.716 and accurately recalled 61% of all delays, however, it struggled with precision predicting the delayed class (Precision: 0.15).
Neural Network: Moderately improved pattern discovery. It achieved stable accuracy mappings and a solid ROC-AUC of 0.74 through continuous epochs.
Random Forest (Tuned): Acted as our best-performing generalized model. Achieved the highest global Accuracy (0.912) and ROC-AUC (0.763) metrics.

Caption: A comparative glance over the final Evaluation Metrics demonstrating trade-offs between F1-Score and ROC-AUC for class imbalance.

Model Comparison

Best Performer: Random Forest exhibited the highest top-line metrics (ROC-AUC: 0.763, Accuracy: 0.912). However, balancing out minority classification generated an inherent tradeoff where its recall (0.298) flagged behind the linear baselines.
Trade-offs (Interpretability vs. Performance): Logistic Regression provides completely transparent coefficients indicating why a shipment will be delayed, while the Random Forest gives incredible predictive accuracy at the slight expense of direct explainability, though giving excellent Feature Importance metrics.

Key Insights

Freight and Value Dynamics: Features such as freight_value and price proved to be massive statistical signals as demonstrated by the Random Forest's feature importance mapping.
Process Bottlenecks: time_to_ship (time spent before carrier handoff) remains one of the largest root causes of eventual downstream delays.
Location Impacts: Regional and geographic distributions represented via customer_zip_code_prefix heavily dictate the vulnerability of a shipment.

Caption: Tree-based mappings highlight Freight Value and Price as the most dominant features in predicting transit delay.

Business Impact

Applying these machine-learning insights paves the way for substantial operational and customer success milestones:

Early Delay Prediction: Flag high-risk shipments prior to them ever leaving the warehouse.
Proactive Logistics Planning: Pre-emptively allocate alternative carriers for risk-sensitive hubs.
Customer Notification: Automate communication alerts setting practical expectations for upcoming delays.
Route Optimization: Diagnose chronically underperforming zip-code sectors.

How to Run

Clone this repository [Placeholder Link]
Open the primary notebook Shipment_Delay.ipynb
Install required dependencies listed at the top of the notebook (pip install pandas numpy scikit-learn tensorflow matplotlib seaborn)
Ensure the Kaggle API is authenticated to download the Olist dataset
Run all cells sequentially to regenerate graphs and models

Notebook Repository Link

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
README.md		README.md
Shipment_Delay.ipynb		Shipment_Delay.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logistics Delay Prediction: EDA, Modeling, and Evaluation

Overview

Dataset

Problem Statement

Methodology

Data Processing

Feature Engineering

Exploratory Data Analysis (EDA)

Modeling

Results

Model Comparison

Key Insights

Business Impact

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Logistics Delay Prediction: EDA, Modeling, and Evaluation

Overview

Dataset

Problem Statement

Methodology

Data Processing

Feature Engineering

Exploratory Data Analysis (EDA)

Modeling

Results

Model Comparison

Key Insights

Business Impact

How to Run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages