Delays Are Not Random — Logistic Regression

"A delay is not just a late shipment — it's a break in the operational flow. And breaks in operational flow always leave traces in the data."

🎯 Business Problem

Every logistics operation deals with delays. The typical response is reactive: track the KPI, escalate when it's red, firefight until it's green again. This project asks a different question — what if you could score each shipment's probability of being late before it leaves the dock?

With 67.8% of shipments arriving delayed in this dataset, the problem isn't rare. It's structural. And structural problems have structural signals — which is exactly what Logistic Regression is built to find.

📊 Dataset

1,500 shipment records from logistics operation
Target: Reached.on.Time (binary) — 0 = delayed, 1 = on time
Class balance: 67.8% delayed (imbalanced toward the problem class)
Source: Simulated ERP/TMS data reflecting real logistics feature distributions

Feature	Type	Description
`Warehouse_block`	Categorical	Storage zone (A–F)
`Mode_of_Shipment`	Categorical	Ship / Flight / Road
`Customer_care_calls`	Numerical	Support interactions before delivery
`Customer_rating`	Ordinal	1–5 satisfaction score
`Cost_of_the_Product`	Numerical	USD product value
`Prior_purchases`	Numerical	Customer purchase history
`Product_importance`	Categorical	Low / Medium / High
`Gender`	Categorical	Customer demographic
`Discount_offered`	Numerical	% discount applied
`Weight_in_gms`	Numerical	Shipment weight

Surprising EDA finding: Ship mode has the highest delay rate (69.2%) — counterintuitive until you account for volume and route complexity.

🤖 Model

Algorithm: Logistic Regression — sklearn.linear_model.LogisticRegression

The choice is deliberate. Before reaching for a complex ensemble, the right question is: can a simple, interpretable model explain this problem well enough to act on? In logistics, a model that operations teams can read and trust beats a black-box with 2% higher accuracy every time.

Logistic Regression outputs delay probability (0–1), not just binary pass/fail — enabling risk-based prioritization across the shipment queue.

Preprocessing: StandardScaler on numerics, Label Encoding for categoricals, all inside a sklearn Pipeline.

📈 Key Results

Metric	Value
Accuracy	67.3%
Precision (Delay)	75.8%
Recall (Delay)	81.4%
F1 (Delay)	0.785

Why Recall matters here: Missing a delay costs more than a false alarm. 81.4% recall means the model catches 4 out of 5 actual delays before they happen.

🔍 Top Delay Drivers (Log-Odds Coefficients)

Feature	Coefficient	Direction
`Customer_care_calls`	+0.82	🔴 Risk factor
`Discount_offered`	+0.67	🔴 Risk factor
`Prior_purchases`	−0.54	🔵 Protective
`Cost_of_the_Product`	−0.41	🔵 Protective

More customer care calls = more likely delayed. Loyal customers (high prior purchases) see fewer delays. High-value products get handled more carefully. These aren't surprises — they're quantified.

🗂️ Repository Structure

Delays_Are_Not_Random/
├── 01_Logistic_Regression_Logistics.ipynb  # Notebook (no outputs)
├── logistics_shipments_data.csv            # Sample dataset (250 rows)
├── README.md
└── requirements.txt

📦 Full Project Pack — complete dataset (1,500 rows), notebook with full outputs, presentation deck (PPTX + PDF), and app.py simulator available on Gumroad.

🚀 How to Run

Option 1 — Google Colab: Click the badge above.

Option 2 — Local:

pip install -r requirements.txt
jupyter notebook 01_Logistic_Regression_Logistics.ipynb

💡 Key Learnings

Coefficients are storytelling tools — they translate model output into language operations managers actually understand.
Recall > Accuracy in logistics — with 67.8% base delay rate, a naive classifier scores 67% accuracy doing nothing. Recall on the delay class is the real signal.
Customer care calls are a proxy for frustration — high call volume before delivery is the model's strongest feature, and it makes operational sense.
Simple models deployed beat complex models in notebooks — Logistic Regression is often good enough, and always explainable.
EDA validates model logic — if the most important features don't make physical sense, the model is learning noise. Here they do.

👤 Author

Luis Lozano | Operational Excellence Manager · Master Black Belt · Machine Learning
GitHub: LozanoLsa · Gumroad: lozanolsa.gumroad.com

Turning Operations into Predictive Systems — Clone it. Fork it. Improve it.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
01_Logistic_Regression_Logistics.ipynb		01_Logistic_Regression_Logistics.ipynb
Delays_Are_Not_Random.pdf		Delays_Are_Not_Random.pdf
LICENSE		LICENSE
README.md		README.md
cover.png		cover.png
data_sources_and_features.txt		data_sources_and_features.txt
logistics_shipments_data.csv		logistics_shipments_data.csv
requirements.txt		requirements.txt
thum.png		thum.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delays Are Not Random — Logistic Regression

🎯 Business Problem

📊 Dataset

🤖 Model

📈 Key Results

🔍 Top Delay Drivers (Log-Odds Coefficients)

🗂️ Repository Structure

🚀 How to Run

💡 Key Learnings

👤 Author

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Delays Are Not Random — Logistic Regression

🎯 Business Problem

📊 Dataset

🤖 Model

📈 Key Results

🔍 Top Delay Drivers (Log-Odds Coefficients)

🗂️ Repository Structure

🚀 How to Run

💡 Key Learnings

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages