This project performs a Cohort Analysis using the Online Retail II dataset from Kaggle. It analyzes customer purchase behavior over time to identify retention patterns, revenue trends, and average revenue per user (ARPU) by cohort.
📈 Objective
The goal of this analysis is to understand:
How customer activity evolves over their lifetime.
Whether newer cohorts perform better or worse in terms of revenue.
How average revenue per user changes month by month.
Data Cleaning
Removed rows with missing CustomerID.
Calculated Revenue as Quantity × Price.
Cohort Creation
Defined the first purchase month (FirstOrderMonth) for each customer.
Calculated the purchase month (InvoiceDateMonth) for each transaction.
Grouped by (FirstOrderMonth, InvoiceDateMonth) to compute:
Unique number of customers.
Total revenue per cohort and month.
Cohort Lifetime
Computed CohortLifetime = difference in months between the order month and the cohort’s first order month.
ARPU (Average Revenue per User): Revenue / Customers
Retention: Number of active customers per cohort over time.
Total Revenue: Sum of revenues per cohort and lifetime month.
Created heatmaps using Seaborn for:
ARPU over time
Active customers over time
Total revenue per cohort
Metric Description Color Palette ARPU Over Time Shows average revenue per user per cohort. YlGnBu Number of Active Customers Retention behavior of each cohort. Purples Total Revenue by Cohort Total monthly revenue per cohort. OrRd
Python
Pandas
NumPy
Seaborn / Matplotlib
KaggleHub (to download the dataset directly)
Clone this repository:
git clone https://github.com//cohort-analysis-online-retail.git cd cohort-analysis-online-retail
Install dependencies:
pip install pandas numpy seaborn matplotlib kagglehub
Run the analysis:
python cohort_analysis_with_the_online_retail_ii_dataset.py