The goal of this project is to predict whether an e-commerce shipment will be delayed or delivered on-time. Shipment delays critically affect customer satisfaction and operational costs. By reframing this challenge as a machine learning task, we can build proactive systems to identify high-risk shipments early and intervene.
This project uses the Olist Brazilian E-Commerce Dataset, provided via Kaggle. The dataset contains information on 100k+ orders made at multiple marketplaces in Brazil. Its extensive schema spans features such as order statuses, price, payment metrics, product attributes, customer locations, and detailed delivery timestamps.
We frame delay prediction as a binary classification problem:
- Class 0: On-time
- Class 1: Delayed
Challenge: Class Imbalance The dataset exhibits extreme class imbalance, heavily skewed towards the positive on-time scenario. Approximately 92.1% of the shipments were on time, with only 7.9% being delayed.
Caption: The vast majority of orders arrive on time, highlighting the difficulty of modeling the minority (delayed) class without proactive handling.
- Cleaning: Merged multiple dataset tables (
orders,order_items,customers,sellers) to build a unified view. Dropped anomalous records where delivery dates were completely missing. - Handling Missing Values: Applied a predefined
ColumnTransformer. UsedSimpleImputer(strategy='median')for numerical columns andSimpleImputer(strategy='constant', fill_value='unknown')for categoricals.
Key features engineered to capture logistical realities:
- Distance Proxies:
customer_zip_code_prefixandseller_zip_code_prefixwere utilized to estimate geo-logistical distance. - Time/Date Features: Extracted
purchase_month,purchase_day,order_purchase_hour, and introduced anis_weekendflag indicating when an order was originally placed. - Logistics Dynamics: Engineered a
time_to_shipnumerical feature representing the duration between order approval and handoff to the carrier. - Delivery Estimates: Formulated the primary target
is_delayedby calculating if the actual delivery timestamp exceeded the estimated delivery date.
During EDA, we explored macro trends within the shipping data:
- Correlated delayed shipments with temporal features (e.g. delays by day-of-week).
- Validated the extreme class imbalance requiring algorithmic intervention (e.g., class weighting and balancing).
We trained and evaluated three primary algorithms for this classification task:
- Logistic Regression (Baseline): A fast linear model fitted using
class_weight='balanced'to establish our baseline evaluation metrics. - Neural Network: A deep sequential network utilizing fully-connected Dense layers with Dropout logic carefully weighted to catch complex, non-linear logistical patterns.
- Random Forest: A robust tree-based ensemble trained using
RandomizedSearchCVfor hyperparameter tuning. It is particularly effective at capturing variable interactions without excessive scaling requirements.
Below are the summarized findings across all three models:
- Logistic Regression: Served as a solid benchmark. It yielded an ROC-AUC of 0.716 and accurately recalled 61% of all delays, however, it struggled with precision predicting the delayed class (Precision: 0.15).
- Neural Network: Moderately improved pattern discovery. It achieved stable accuracy mappings and a solid ROC-AUC of 0.74 through continuous epochs.
- Random Forest (Tuned): Acted as our best-performing generalized model. Achieved the highest global Accuracy (0.912) and ROC-AUC (0.763) metrics.
Caption: A comparative glance over the final Evaluation Metrics demonstrating trade-offs between F1-Score and ROC-AUC for class imbalance.
- Best Performer: Random Forest exhibited the highest top-line metrics (ROC-AUC: 0.763, Accuracy: 0.912). However, balancing out minority classification generated an inherent tradeoff where its recall (0.298) flagged behind the linear baselines.
- Trade-offs (Interpretability vs. Performance): Logistic Regression provides completely transparent coefficients indicating why a shipment will be delayed, while the Random Forest gives incredible predictive accuracy at the slight expense of direct explainability, though giving excellent
Feature Importancemetrics.
- Freight and Value Dynamics: Features such as
freight_valueandpriceproved to be massive statistical signals as demonstrated by the Random Forest's feature importance mapping. - Process Bottlenecks:
time_to_ship(time spent before carrier handoff) remains one of the largest root causes of eventual downstream delays. - Location Impacts: Regional and geographic distributions represented via
customer_zip_code_prefixheavily dictate the vulnerability of a shipment.
Caption: Tree-based mappings highlight Freight Value and Price as the most dominant features in predicting transit delay.
Applying these machine-learning insights paves the way for substantial operational and customer success milestones:
- Early Delay Prediction: Flag high-risk shipments prior to them ever leaving the warehouse.
- Proactive Logistics Planning: Pre-emptively allocate alternative carriers for risk-sensitive hubs.
- Customer Notification: Automate communication alerts setting practical expectations for upcoming delays.
- Route Optimization: Diagnose chronically underperforming zip-code sectors.
- Clone this repository [Placeholder Link]
- Open the primary notebook
Shipment_Delay.ipynb - Install required dependencies listed at the top of the notebook (
pip install pandas numpy scikit-learn tensorflow matplotlib seaborn) - Ensure the Kaggle API is authenticated to download the Olist dataset
- Run all cells sequentially to regenerate graphs and models