diff --git a/_config.yml b/_config.yml index ea7d07e6d..026bd5afd 100644 --- a/_config.yml +++ b/_config.yml @@ -8,22 +8,22 @@ plugins: #------------------------------- # General Site Settings -title: Johnny Hopkins -description: "Hi I'm Johnny, and I'm a Data Scientist. My portfolio focuses on interesting projects I've recently undertaken, with a strong emphasis on business impact. Please visit my Github & LinkedIn pages (or download my Resume) by using the links below!" +title: Charles Goodaker +description: "Hi I'm Charles, and I'm a Data Scientist. My portfolio focuses on projects I've recently undertaken, with a strong emphasis on business impact. Please visit my Github & LinkedIn pages (or download my Resume) by using the links below!" baseurl: "" # the subpath of your site, e.g. /blog url: "" # the base hostname & protocol for your site, e.g. http://example.com #------------------------------- # About Section subtitle: Data Science Portfolio -location: "London, UK" +location: "Canton, GA" resume_url: /docs/resume.docx -avatar_image: /img/profile_picture.JPG +avatar_image: /img/CAF9BD2A-23C8-4235-BE2F-89315A9439E7_1_105_c.jpeg #------------------------------- # Contact links -linkedln: "https://linkedln.com/#" # Add your linkedln handle -github: "https://github.com/#" # Add your github handle +linkedln: "https://linkedin.com/in/charles-goodaker-1b9a86b" # LinkedIn handle +github: "https://github.com/cgoodaker" # Add your github handle paginate: 6 paginate_path: "/page/:num" @@ -52,3 +52,4 @@ defaults: # - vendor/gems/ # - vendor/ruby/ + diff --git a/_posts/2026-03-10-coffee-and-python.md b/_posts/2026-03-10-coffee-and-python.md new file mode 100644 index 000000000..786e67a24 --- /dev/null +++ b/_posts/2026-03-10-coffee-and-python.md @@ -0,0 +1,66 @@ +--- +layout: post +title: Coffee & Python +image: "/posts/coffee_python.jpg" +tags: [Python, Coffee] +--- + +# My first project +## is all about +### how much +#### I LOVE +##### Python & Coffee! + +--- + +Firstly, I love Python so much, here is some code! + +``` +my_love_for_python = 0 +my_python_knowledge = 0 + +for day in lifetime: + my_love_for_python += 1 + my_python_knowledge += 1 +``` + +Just so you really see how much I love Python, here is some code BUT with some colours for keywords & functionality! + +```python +my_love_for_python = 0 +my_python_knowledge = 0 + +for day in lifetime: + my_love_for_python += 1 + my_python_knowledge += 1 +``` + +Here is an **unordered list** showing some things I love about Python + +* For my work + * Data Analysis + * Data Visualisation + * Machine Learning +* For fun + * Deep Learning + * Computer Vision + * Projects about coffee + +Here is an _ordered list_ showing some things I love about coffee + +1. The smell + 1. Especially in the morning, but also at all times of the day! +2. The taste +3. The fact I can run the 100m in approx. 9 seconds after having 4 cups in quick succession + +I love Python & Coffee so much, here is that picture from the top of my project AGAIN, but this time, in the BODY of my project! + +![alt text](/img/posts/coffee_python.jpg "Coffee & Python - I love them!") + +The above image is just linked to the actual file in my Github, but I could also link to images online, using the URL! + +A line break, like this one below - helps me make sense of what I'm reading, especially when I've had so much coffee that my vision goes a little blurry + +--- + +I could also add things to my project like links, tables, quotes, and HTML blocks - but I'm starting to get a cracking headache. Must be coffee time. diff --git a/_posts/2021-06-09-Finding-Prime-Numbers-With-Python.md b/_posts/2026-04-30-Finding-Prime-Numbers-With-Python.md similarity index 100% rename from _posts/2021-06-09-Finding-Prime-Numbers-With-Python.md rename to _posts/2026-04-30-Finding-Prime-Numbers-With-Python.md diff --git a/_posts/2026-05-19-chi-square-test.md b/_posts/2026-05-19-chi-square-test.md new file mode 100644 index 000000000..11f2902fa --- /dev/null +++ b/_posts/2026-05-19-chi-square-test.md @@ -0,0 +1,322 @@ +--- +layout: post +title: Assessing Campaign Performance Using Chi-Square Test For Independence +image: "/posts/ab-testing-title-img.png" +tags: [AB Testing, Hypothesis Testing, Chi-Square, Python] +--- + +In this project we apply Chi-Square Test For Independence (a Hypothesis Test) to assess the performance of two types of mailers that were sent out to promote a new service! + +# Table of contents + +- [00. Project Overview](#overview-main) + - [Context](#overview-context) + - [Actions](#overview-actions) + - [Results & Discussion](#overview-results) +- [01. Concept Overview](#concept-overview) +- [02. Data Overview & Preparation](#data-overview) +- [03. Applying Chi-Square Test For Independence](#chi-square-application) +- [04. Analysing The Results](#chi-square-results) +- [05. Discussion](#discussion) + +___ + +# Project Overview + +### Context + +Earlier in the year, our client, a grocery retailer, ran a campaign to promote their new "Delivery Club" - an initiative that costs a customer $100 per year for membership, but offers free grocery deliveries rather than the normal cost of $10 per delivery. + +For the campaign promoting the club, customers were put randomly into three groups - the first group received a low quality, low cost mailer, the second group received a high quality, high cost mailer, and the third group were a control group, receiving no mailer at all. + +The client knows that customers who were contacted, signed up for the Delivery Club at a far higher rate than the control group, but now want to understand if there is a significant difference in signup rate between the cheap mailer and the expensive mailer. This will allow them to make more informed decisions in the future, with the overall aim of optimising campaign ROI! + +
+
+### Actions + +For this test, as it is focused on comparing the *rates* of two groups - we applied the Chi-Square Test For Independence. Full details of this test can be found in the dedicated section below. + +**Note:** Another option when comparing "rates" is a test known as the *Z-Test For Proportions*. While, we could absolutely use this test here, we have chosen the Chi-Square Test For Independence because: + +* The resulting test statistic for both tests will be the same +* The Chi-Square Test can be represented using 2x2 tables of data - meaning it can be easier to explain to stakeholders +* The Chi-Square Test can extend out to more than 2 groups - meaning the client can have one consistent approach to measuring significance + +From the *campaign_data* table in the client database, we isolated customers that received "Mailer 1" (low cost) and "Mailer 2" (high cost) for this campaign, and excluded customers who were in the control group. + +We set out our hypotheses and Acceptance Criteria for the test, as follows: + +**Null Hypothesis:** There is no relationship between mailer type and signup rate. They are independent. +**Alternate Hypothesis:** There is a relationship between mailer type and signup rate. They are not independent. +**Acceptance Criteria:** 0.05 + +As a requirement of the Chi-Square Test For Independence, we aggregated this data down to a 2x2 matrix for *signup_flag* by *mailer_type* and fed this into the algorithm (using the *scipy* library) to calculate the Chi-Square Statistic, p-value, Degrees of Freedom, and expected values + +
+
+ +### Results & Discussion + +Based upon our observed values, we can give this all some context with the sign-up rate of each group. We get: + +* Mailer 1 (Low Cost): **32.8%** signup rate +* Mailer 2 (High Cost): **37.8%** signup rate + +However, the Chi-Square Test gives us the following statistics: + +* Chi-Square Statistic: **1.94** +* p-value: **0.16** + +The Critical Value for our specified Acceptance Criteria of 0.05 is **3.84** + +Based upon these statistics, we retain the null hypothesis, and conclude that there is no relationship between mailer type and signup rate. + +In other words - while we saw that the higher cost Mailer 2 had a higher signup rate (37.8%) than the lower cost Mailer 1 (32.8%) it appears that this difference is not significant, at least at our Acceptance Criteria of 0.05. + +Without running this Hypothesis Test, the client may have concluded that they should always look to go with higher cost mailers - and from what we've seen in this test, that may not be a great decision. It would result in them spending more, but not *necessarily* gaining any extra revenue as a result + +Our results here also do not say that there *definitely isn't a difference between the two mailers* - we are only advising that we should not make any rigid conclusions *at this point*. + +Running more A/B Tests like this, gathering more data, and then re-running this test may provide us, and the client more insight! + +
+
+ +___ + +# Concept Overview + +
+#### A/B Testing + +An A/B Test can be described as a randomised experiment containing two groups, A & B, that receive different experiences. Within an A/B Test, we look to understand and measure the response of each group - and the information from this helps drive future business decisions. + +Application of A/B testing can range from testing different online ad strategies, different email subject lines when contacting customers, or testing the effect of mailing customers a coupon, vs a control group. Companies like Amazon are running these tests in an almost never-ending cycle, testing new website features on randomised groups of customers...all with the aim of finding what works best so they can stay ahead of their competition. Reportedly, Netflix will even test different images for the same movie or show, to different segments of their customer base to see if certain images pull more viewers in. + +
+#### Hypothesis Testing + +A Hypothesis Test is used to assess the plausibility, or likelihood of an assumed viewpoint based on sample data - in other words, it helps us assess whether a certain view we have about some data is likely to be true or not. + +There are many different scenarios we can run Hypothesis Tests on, and they all have slightly different techniques and formulas - however they all have some shared, fundamental steps & logic that underpin how they work. + +
+**The Null Hypothesis** + +In any Hypothesis Test, we start with the Null Hypothesis. The Null Hypothesis is where we state our initial viewpoint, and in statistics, and specifically Hypothesis Testing, our initial viewpoint is always that the result is purely by chance or that there is no relationship or association between two outcomes or groups + +
+**The Alternate Hypothesis** + +The aim of the Hypothesis Test is to look for evidence to support or reject the Null Hypothesis. If we reject the Null Hypothesis, that would mean we’d be supporting the Alternate Hypothesis. The Alternate Hypothesis is essentially the opposite viewpoint to the Null Hypothesis - that the result is *not* by chance, or that there *is* a relationship between two outcomes or groups + +
+**The Acceptance Criteria** + +In a Hypothesis Test, before we collect any data or run any numbers - we specify an Acceptance Criteria. This is a p-value threshold at which we’ll decide to reject or support the null hypothesis. It is essentially a line we draw in the sand saying "if I was to run this test many many times, what proportion of those times would I want to see different results come out, in order to feel comfortable, or confident that my results are not just some unusual occurrence" + +Conventionally, we set our Acceptance Criteria to 0.05 - but this does not have to be the case. If we need to be more confident that something did not occur through chance alone, we could lower this value down to something much smaller, meaning that we only come to the conclusion that the outcome was special or rare if it’s extremely rare. + +So to summarise, in a Hypothesis Test, we test the Null Hypothesis using a p-value and then decide its fate based on the Acceptance Criteria. + +
+**Types Of Hypothesis Test** + +There are many different types of Hypothesis Tests, each of which is appropriate for use in differing scenarios - depending on a) the type of data that you’re looking to test and b) the question that you’re asking of that data. + +In the case of our task here, where we are looking to understand the difference in sign-up *rate* between two groups - we will utilise the Chi-Square Test For Independence. + +
+#### Chi-Square Test For Independence + +The Chi-Square Test For Independence is a type of Hypothesis Test that assumes observed frequencies for categorical variables will match the expected frequencies. + +The *assumption* is the Null Hypothesis, which as discussed above is always the viewpoint that the two groups will be equal. With the Chi-Square Test For Independence we look to calculate a statistic which, based on the specified Acceptance Criteria will mean we either reject or support this initial assumption. + +The *observed frequencies* are the true values that we’ve seen. + +The *expected frequencies* are essentially what we would *expect* to see based on all of the data. + +**Note:** Another option when comparing "rates" is a test known as the *Z-Test For Proportions*. While, we could absolutely use this test here, we have chosen the Chi-Square Test For Independence because: + +* The resulting test statistic for both tests will be the same +* The Chi-Square Test can be represented using 2x2 tables of data - meaning it can be easier to explain to stakeholders +* The Chi-Square Test can extend out to more than 2 groups - meaning the business can have one consistent approach to measuring significance + +___ + +
+# Data Overview & Preparation + +In the client database, we have a *campaign_data* table which shows us which customers received each type of "Delivery Club" mailer, which customers were in the control group, and which customers joined the club as a result. + +For this task, we are looking to find evidence that the Delivery Club signup rate for customers that received "Mailer 1" (low cost) was different to those who received "Mailer 2" (high cost) and thus from the *campaign_data* table we will just extract customers in those two groups, and exclude customers who were in the control group. + +In the code below, we: + +* Load in the Python libraries we require for importing the data and performing the chi-square test (using scipy) +* Import the required data from the *campaign_data* table +* Exclude customers in the control group, giving us a dataset with Mailer 1 & Mailer 2 customers only + +
+```python + +# install the required python libraries +import pandas as pd +from scipy.stats import chi2_contingency, chi2 + +# import campaign data +campaign_data = ... + +# remove customers who were in the control group +campaign_data = campaign_data.loc[campaign_data["mailer_type"] != "Control"] + +``` +
+A sample of this data (the first 10 rows) can be seen below: +
+
+ +| **customer_id** | **campaign_name** | **mailer_type** | **signup_flag** | +|---|---|---|---| +| 74 | delivery_club | Mailer1 | 1 | +| 524 | delivery_club | Mailer1 | 1 | +| 607 | delivery_club | Mailer2 | 1 | +| 343 | delivery_club | Mailer1 | 0 | +| 322 | delivery_club | Mailer2 | 1 | +| 115 | delivery_club | Mailer2 | 0 | +| 1 | delivery_club | Mailer2 | 1 | +| 120 | delivery_club | Mailer1 | 1 | +| 52 | delivery_club | Mailer1 | 1 | +| 405 | delivery_club | Mailer1 | 0 | +| 435 | delivery_club | Mailer2 | 0 | + +
+In the DataFrame we have: + +* customer_id +* campaign name +* mailer_type (either Mailer1 or Mailer2) +* signup_flag (either 1 or 0) + +___ + +
+# Applying Chi-Square Test For Independence + +
+#### State Hypotheses & Acceptance Criteria For Test + +The very first thing we need to do in any form of Hypothesis Test is state our Null Hypothesis, our Alternate Hypothesis, and the Acceptance Criteria (more details on these in the section above) + +In the code below we code these in explicitly & clearly so we can utilise them later to explain the results. We specify the common Acceptance Criteria value of 0.05. + +```python + +# specify hypotheses & acceptance criteria for test +null_hypothesis = "There is no relationship between mailer type and signup rate. They are independent" +alternate_hypothesis = "There is a relationship between mailer type and signup rate. They are not independent" +acceptance_criteria = 0.05 + +``` + +
+#### Calculate Observed Frequencies & Expected Frequencies + +As mentioned in the section above, in a Chi-Square Test For Independence, the *observed frequencies* are the true values that we’ve seen, in other words the actual rates per group in the data itself. The *expected frequencies* are what we would *expect* to see based on *all* of the data combined. + +The below code: + +* Summarises our dataset to a 2x2 matrix for *signup_flag* by *mailer_type* +* Based on this, calculates the: + * Chi-Square Statistic + * p-value + * Degrees of Freedom + * Expected Values +* Prints out the Chi-Square Statistic & p-value from the test +* Calculates the Critical Value based upon our Acceptance Criteria & the Degrees Of Freedom +* Prints out the Critical Value + +```python + +# aggregate our data to get observed values +observed_values = pd.crosstab(campaign_data["mailer_type"], campaign_data["signup_flag"]).values + +# run the chi-square test +chi2_statistic, p_value, dof, expected_values = chi2_contingency(observed_values, correction = False) + +# print chi-square statistic +print(chi2_statistic) +>> 1.94 + +# print p-value +print(p_value) +>> 0.16 + +# find the critical value for our test +critical_value = chi2.ppf(1 - acceptance_criteria, dof) + +# print critical value +print(critical_value) +>> 3.84 + +``` +
+Based upon our observed values, we can give this all some context with the sign-up rate of each group. We get: + +* Mailer 1 (Low Cost): **32.8%** signup rate +* Mailer 2 (High Cost): **37.8%** signup rate + +From this, we can see that the higher cost mailer does lead to a higher signup rate. The results from our Chi-Square Test will provide us more information about how confident we can be that this difference is robust, or if it might have occurred by chance. + +We have a Chi-Square Statistic of **1.94** and a p-value of **0.16**. The critical value for our specified Acceptance Criteria of 0.05 is **3.84** + +**Note** When applying the Chi-Square Test above, we use the parameter *correction = False* which means we are not applying what is known as the *Yates' Correction* which is applied when your Degrees of Freedom is equal to one. This correction helps to prevent overestimation of statistical signficance in this case. + +___ + +
+# Analysing The Results + +At this point we have everything we need to understand the results of our Chi-Square test - and just from the results above we can see that, since our resulting p-value of **0.16** is *greater* than our Acceptance Criteria of 0.05 then we will _retain_ the Null Hypothesis and conclude that there is no significant difference between the signup rates of Mailer 1 and Mailer 2. + +We can make the same conclusion based upon our resulting Chi-Square statistic of **1.94** being _lower_ than our Critical Value of **3.84** + +To make this script more dynamic, we can create code to automatically interpret the results and explain the outcome to us... + +```python + +# print the results (based upon p-value) +if p_value <= acceptance_criteria: + print(f"As our p-value of {p_value} is lower than our acceptance_criteria of {acceptance_criteria} - we reject the null hypothesis, and conclude that: {alternate_hypothesis}") +else: + print(f"As our p-value of {p_value} is higher than our acceptance_criteria of {acceptance_criteria} - we retain the null hypothesis, and conclude that: {null_hypothesis}") + +>> As our p-value of 0.16351 is higher than our acceptance_criteria of 0.05 - we retain the null hypothesis, and conclude that: There is no relationship between mailer type and signup rate. They are independent + + +# print the results (based upon p-value) +if chi2_statistic >= critical_value: + print(f"As our chi-square statistic of {chi2_statistic} is higher than our critical value of {critical_value} - we reject the null hypothesis, and conclude that: {alternate_hypothesis}") +else: + print(f"As our chi-square statistic of {chi2_statistic} is lower than our critical value of {critical_value} - we retain the null hypothesis, and conclude that: {null_hypothesis}") + +>> As our chi-square statistic of 1.9414 is lower than our critical value of 3.841458820694124 - we retain the null hypothesis, and conclude that: There is no relationship between mailer type and signup rate. They are independent + +``` +
+As we can see from the outputs of these print statements, we do indeed retain the null hypothesis. We could not find enough evidence that the signup rates for Mailer 1 and Mailer 2 were different - and thus conclude that there was no significant difference. + +___ + +
+# Discussion + +While we saw that the higher cost Mailer 2 had a higher signup rate (37.8%) than the lower cost Mailer 1 (32.8%) it appears that this difference is not significant, at least at our Acceptance Criteria of 0.05. + +Without running this Hypothesis Test, the client may have concluded that they should always look to go with higher cost mailers - and from what we've seen in this test, that may not be a great decision. It would result in them spending more, but not *necessarily* gaining any extra revenue as a result + +Our results here also do not say that there *definitely isn't a difference between the two mailers* - we are only advising that we should not make any rigid conclusions *at this point*. + +Running more A/B Tests like this, gathering more data, and then re-running this test may provide us, and the client more insight! diff --git a/_posts/2026-05-25-predicting-customer-loyalty.md b/_posts/2026-05-25-predicting-customer-loyalty.md new file mode 100644 index 000000000..bbb3203f0 --- /dev/null +++ b/_posts/2026-05-25-predicting-customer-loyalty.md @@ -0,0 +1,1180 @@ +--- +layout: post +title: Predicting Customer Loyalty Using ML +image: "/posts/regression-title-img.png" +tags: [Customer Loyalty, Machine Learning, Regression, Python] +--- + +My client, a grocery retailer, hired a market research consultancy to append market level customer loyalty information to the database. However, only around 50% of the client's customer base could be tagged, thus the other half did not have this information present. I'll use ML to solve this! + +# Table of contents + +- [00. Project Overview](#overview-main) + - [Context](#overview-context) + - [Actions](#overview-actions) + - [Results](#overview-results) + - [Growth/Next Steps](#overview-growth) + - [Key Definition](#overview-definition) +- [01. Data Overview](#data-overview) +- [02. Modeling Overview](#modelling-overview) +- [03. Linear Regression](#linreg-title) +- [04. Decision Tree](#regtree-title) +- [05. Random Forest](#rf-title) +- [06. Modeling Summary](#modelling-summary) +- [07. Predicting Missing Loyalty Scores](#modelling-predictions) +- [08. Growth & Next Steps](#growth-next-steps) + +___ + +# Project Overview + +### Context + +My client, a grocery retailer, hired a market research consultancy to append market level customer loyalty information to the database. However, only around 50% of the client's customer base could be tagged, thus the other half did not have this information present. + +The overall aim of this work is to accurately predict the *loyalty score* for those customers who could not be tagged, enabling my client a clear understanding of true customer loyalty, regardless of total spend volume - and allowing for more accurate and relevant customer tracking, targeting, and comms. + +To achieve this, I looked to build out a predictive model that will find relationships between customer metrics and *loyalty score* for those customers who were tagged, and use this to predict the loyalty score metric for those who were not. +
+
+### Actions + +I firstly needed to compile the necessary data from tables in the database, gathering key customer metrics that may help predict *loyalty score*, appending on the dependent variable, and separating out those who did and did not have this dependent variable present. + +As I am predicting a numeric output, I tested three regression modeling approaches, namely: + +* Linear Regression +* Decision Tree +* Random Forest +
+
+ +### Results + +My testing found that the Random Forest had the highest predictive accuracy. + +
+**Metric 1: Adjusted R-Squared (Test Set)** + +* Random Forest = 0.955 +* Decision Tree = 0.886 +* Linear Regression = 0.754 + +
+**Metric 2: R-Squared (K-Fold Cross Validation, k = 4)** + +* Random Forest = 0.925 +* Decision Tree = 0.871 +* Linear Regression = 0.853 + +As the most important outcome for this project was predictive accuracy, rather than explicitly understanding weighted drivers of prediction, I chose the Random Forest as the model to use for making predictions on the customers who were missing the *loyalty score* metric. +
+
+### Growth/Next Steps + +While predictive accuracy was relatively high - other modelling approaches could be tested, especially those somewhat similar to Random Forest, for example XGBoost, LightGBM to see if even more accuracy could be gained. + +From a data point of view, further variables could be collected, and further feature engineering could be undertaken to ensure that we have as much useful information available for predicting customer loyalty +
+
+### Key Definition + +The *loyalty score* metric measures the % of grocery spend (market level) that each customer allocates to the client vs. all of the competitors. + +Example 1: Customer X has a total grocery spend of $100 and all of this is spent with our client. Customer X has a *loyalty score* of 1.0 + +Example 2: Customer Y has a total grocery spend of $200 but only 20% is spent with our client. The remaining 80% is spend with competitors. Customer Y has a *customer loyalty score* of 0.2 +
+
+___ + +# Data Overview + +We will be predicting the *loyalty_score* metric. This metric exists (for approximately half of the customer base) in the *loyalty_scores* table of the client database. + +The key variables hypothesized to predict the missing loyalty scores will come from the client database, namely the *transactions* table, the *customer_details* table, and the *product_areas* table. + +Using pandas in Python, I merged these tables together for all customers, creating a single dataset that we can use for modeling. + +```python + +# import required packages +import pandas as pd +import pickle + +# import required data tables +loyalty_scores = pd.read_excel("data/grocery_data.xlsx", sheet_name = "loyalty_scores") +customer_details = pd.read_excel("data/grocery_data.xlsx", sheet_name = "customer_details") +transactions = pd.read_excel("data/grocery_data.xlsx", sheet_name = "transactions") +# loyalty_scores returns 400 rows, customer_details returns 870 rows + +# merge loyalty score data and customer details data, at customer level. customer_details is the base. +# Later we'll split this into 2 datasets for training and testing +data_for_regression = pd.merge(customer_details, loyalty_scores, how = "left", on = "customer_id") + +# aggregate sales data from transactions table. Create 5 sales metrics. +sales_summary = transactions.groupby("customer_id").agg({"sales_cost" : "sum", + "num_items" : "sum", + "transaction_id" : "count", + "product_area_id" : "nunique"}).reset_index() + +# rename columns for clarity +sales_summary.columns = ["customer_id", "total_sales", "total_items", "transaction_count", "product_area_count"] + +# engineer an average basket value column for each customer +sales_summary["average_basket_value"] = sales_summary["total_sales"] / sales_summary["transaction_count"] + +# Overwrite data_for_regression and merge the sales summary with the overall customer data +data_for_regression = pd.merge(data_for_regression, sales_summary, how = "inner", on = "customer_id") +# Our new "data_for_regression" has 10 columns including customer_id + +# split out data for modeling (loyalty score is present or not) +regression_modeling = data_for_regression.loc[data_for_regression["customer_loyalty_score"].notna()] + +# split out data for scoring post-modeling (loyalty score is missing) +regression_scoring = data_for_regression.loc[data_for_regression["customer_loyalty_score"].isna()] + +# for scoring set, drop the loyalty score column (as it is blank/redundant) +regression_scoring.drop(["customer_loyalty_score"], axis = 1, inplace = True) + +# save our datasets for future use +pickle.dump(regression_modeling, open("data/customer_loyalty_modeling.p", "wb")) +pickle.dump(regression_scoring, open("data/customer_loyalty_scoring.p", "wb")) + +``` +
+After this data pre-processing in Python, we have a dataset for modeling that contains the following fields... +
+
+ +| **Variable Name** | **Variable Type** | **Description** | +|---|---|---| +| loyalty_score | Dependent | The % of total grocery spend that each customer allocates to ABC Grocery vs. competitors | +| distance_from_store | Independent | "The distance in miles from the customers home address, and the store" | +| gender | Independent | The gender provided by the customer | +| credit_score | Independent | The customers most recent credit score | +| total_sales | Independent | Total spend by the customer in ABC Grocery within the latest 6 months | +| total_items | Independent | Total products purchased by the customer in ABC Grocery within the latest 6 months | +| transaction_count | Independent | Total unique transactions made by the customer in ABC Grocery within the latest 6 months | +| product_area_count | Independent | The number of product areas within ABC Grocery the customers has shopped into within the latest 6 months | +| average_basket_value | Independent | The average spend per transaction for the customer in ABC Grocery within the latest 6 months | + +___ +
+# Modeling Overview + +I will build a model that looks to accurately predict the “loyalty_score” metric for those customers that were able to be tagged, based upon the customer metrics listed above. + +If that can be achieved, I can use this model to predict the customer loyalty score for the customers that were unable to be tagged by the agency. + +As I am predicting a numeric output, we tested three regression modeling approaches, namely: + +* Linear Regression +* Decision Tree +* Random Forest + +___ +
+# Linear Regression + +I utilized the scikit-learn library within Python to model our data using Linear Regression. The code sections below are broken up into 4 key sections: + +* Data Import +* Data Preprocessing +* Model Training +* Performance Assessment + +
+### Data Import + +Since I saved our modeling data as a pickle file, I'll import it. I will ensure the id column is removed, and I'll also ensure the data is shuffled. + +```python + +# import required packages +import pandas as pd +import pickle +import matplotlib.pyplot as plt +from sklearn.linear_model import LinearRegression +from sklearn.utils import shuffle +from sklearn.model_selection import train_test_split, cross_val_score, KFold +from sklearn.metrics import r2_score +from sklearn.preprocessing import OneHotEncoder +from sklearn.feature_selection import RFECV + +# import modelling data +data_for_model = pickle.load(open("data/customer_loyalty_modelling.p", "rb")) + +# drop uneccessary columns +data_for_model.drop("customer_id", axis = 1, inplace = True) + +# shuffle data +data_for_model = shuffle(data_for_model, random_state = 42) + +``` +
+### Data Preprocessing + +For Linear Regression there are certain data preprocessing steps that need to be addressed, including: + +* Missing values in the data +* The effect of outliers (for Linear) +* Encoding categorical variables to numeric form +* Multicollinearity & Feature Selection + +
+##### Missing Values + +The number of missing values in the data was extremely low, so instead of applying any imputation (i.e. mean, most common value) I'll just drop those rows + +```python + +# remove rows where values are missing +data_for_model.isna().sum() +data_for_model.dropna(how = "any", inplace = True) + +``` + +
+##### Outliers + +The ability for a Linear Regression model to generalize well across *all* data can be hampered if there are outliers present. There is no right or wrong way to deal with outliers, but it is always something worth very careful consideration - just because a value is high or low, does not necessarily mean it should not be there! + +In this code section, I'll use **.describe()** from Pandas to investigate the spread of values for each of our predictors. The results of this can be seen in the table below. + +
+ +| **metric** | **distance_from_store** | **credit_score** | **total_sales** | **total_items** | **transaction_count** | **product_area_count** | **average_basket_value** | +|---|---|---|---|---|---|---|---| +| mean | 2.02 | 0.60 | 1846.50 | 278.30 | 44.93 | 4.31 | 36.78 | +| std | 2.57 | 0.10 | 1767.83 | 214.24 | 21.25 | 0.73 | 19.34 | +| min | 0.00 | 0.26 | 45.95 | 10.00 | 4.00 | 2.00 | 9.34 | +| 25% | 0.71 | 0.53 | 942.07 | 201.00 | 41.00 | 4.00 | 22.41 | +| 50% | 1.65 | 0.59 | 1471.49 | 258.50 | 50.00 | 4.00 | 30.37 | +| 75% | 2.91 | 0.66 | 2104.73 | 318.50 | 53.00 | 5.00 | 47.21 | +| max | 44.37 | 0.88 | 9878.76 | 1187.00 | 109.00 | 5.00 | 102.34 | + +
+Based on this investigation, some *max* column values for several variables are much higher than the *median* value. + +This is for columns *distance_from_store*, *total_sales*, and *total_items* + +For example, the median *distance_to_store* is 1.645 miles, but the maximum is over 44 miles! + +Because of this, I'll apply some outlier removal in order to facilitate generalization across the full dataset. + +I'll do this using the "boxplot approach" where I'll remove any rows where the values within those columns are outside of the interquartile range multiplied by 2. + +
+```python + +outlier_investigation = data_for_model.describe() +outlier_columns = ["distance_from_store", "total_sales", "total_items"] + +# boxplot approach +for column in outlier_columns: + + lower_quartile = data_for_model[column].quantile(0.25) + upper_quartile = data_for_model[column].quantile(0.75) + iqr = upper_quartile - lower_quartile + iqr_extended = iqr * 2 + min_border = lower_quartile - iqr_extended + max_border = upper_quartile + iqr_extended + + outliers = data_for_model[(data_for_model[column] < min_border) | (data_for_model[column] > max_border)].index + print(f"{len(outliers)} outliers detected in column {column}") + + data_for_model.drop(outliers, inplace = True) + +``` + +
+##### Split Out Data For Modeling + +In the next code block I'll do two things, firstly split the data into an **X** object which contains only the predictor variables, and a **y** object that contains only the dependent variable. + +Once I have done this, I split the data into "training" and "test" sets to ensure I can fairly validate the accuracy of the predictions on data that was not used in training. In this case, we have allocated 80% of the data for training, and the remaining 20% for validation. + +
+```python + +# split data into X and y objects for modeling +X = data_for_model.drop(["customer_loyalty_score"], axis = 1) +y = data_for_model["customer_loyalty_score"] + +# split out training & test sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) + +``` + +
+##### Categorical Predictor Variables + +In the dataset, there is one categorical variable *gender* which has values of "M" for Male, "F" for Female, and "U" for Unknown. + +The Linear Regression algorithm can't deal with data in this format as it can't assign any numerical meaning to it when looking to assess the relationship between the variable and the dependent variable. + +As *gender* doesn't have any explicit *order* to it, in other words, Male isn't higher or lower than Female and vice versa - one appropriate approach is to apply "One Hot Encoding" to the categorical column. + +One Hot Encoding can be thought of as a way to represent categorical variables as binary vectors, in other words, a set of *new* columns for each categorical value with either a 1 or a 0 saying whether that value is true or not for that observation. These new columns would go into our model as input variables, and the original column is discarded. + +I also drop one of the new columns using the parameter *drop = "first"*. This is done to avoid the *dummy variable trap* where the newly created encoded columns perfectly predict each other, running the risk of breaking the assumption that there is no multicollinearity - a requirement or at least an important consideration for some models, Linear Regression being one of them! Multicollinearity occurs when two or more input variables are *highly* correlated with each other. It is a scenario to attempt to avoid. While it won't necessarily affect the predictive accuracy of our model, it can make it difficult to trust the statistics around how well the model is performing, and how much each input variable is truly having. + +In the code, I also make sure to apply *fit_transform* to the training set, but only *transform* to the test set. This means the One Hot Encoding logic will *learn and apply* the "rules" from the training data, but only *apply* them to the test data. This is important in order to avoid *data leakage* where the test set *learns* information about the training data, and means we can't fully trust model performance metrics! + +For ease, after applying One Hot Encoding, training and test objects can be turned back into Pandas Dataframes, with the column names applied. + +
+```python + +# list of categorical variables that need encoding +categorical_vars = ["gender"] + +# instantiate OHE class +one_hot_encoder = OneHotEncoder(sparse_output=False, drop = "first") + +# apply OHE +X_train_encoded = one_hot_encoder.fit_transform(X_train[categorical_vars]) +X_test_encoded = one_hot_encoder.transform(X_test[categorical_vars]) + +# extract feature names for encoded columns (array) +encoder_feature_names = one_hot_encoder.get_feature_names_out(categorical_vars) + +# turn objects back to Pandas DataFrame +X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder_feature_names) + +# Create a new DataFrame by concatonating the original to the encoded DataFrame +X_train = pd.concat([X_train.reset_index(drop=True), X_train_encoded.reset_index(drop=True)], axis = 1) +# Drop original input 2 and input 3 variables +X_train.drop(categorical_vars, axis = 1, inplace = True) + +# Test data +X_test_encoded = pd.DataFrame(X_test_encoded, columns = encoder_feature_names) +X_test = pd.concat([X_test.reset_index(drop=True), X_test_encoded.reset_index(drop=True)], axis = 1) +# Drop old "gender" +X_test.drop(categorical_vars, axis = 1, inplace = True) + +``` + +
+##### Feature Selection using RFE (Recursive Feature Elimination) + +Feature Selection is the process used to select the input variables that are most important to your Machine Learning task. It can be a very important addition or at least, consideration, in certain scenarios. The potential benefits of Feature Selection are: + +* **Improved Model Accuracy** - eliminating noise can help true relationships stand out +* **Lower Computational Cost** - our model becomes faster to train, and faster to make predictions +* **Explainability** - understanding & explaining outputs for stakeholders & customers becomes much easier + +There are multiple ways to apply Feature Selection. These range from simple methods such as a *Correlation Matrix* showing variable relationships, to *Univariate Testing* which helps us understand statistical relationships between variables, and then to even more powerful approaches like *Recursive Feature Elimination (RFE)* which is an approach that starts with all input variables and then iteratively removes those with the weakest relationships to the output variable. + +For this task, I applied a variation of Recursive Feature Elimination called *Recursive Feature Elimination With Cross Validation (RFECV)* where the data is split into many "chunks" and iteratively trains & validates models on each "chunk" separately. This means that each time we assess different models with different variables included, or eliminated, the algorithm also knows how accurate each of those models was. From the suite of model scenarios that are created, the algorithm can determine which provided the best accuracy, and thus can infer the best set of input variables to use! + +
+```python + +# instantiate RFECV & the model type to be utilized +regressor = LinearRegression() +feature_selector = RFECV(regressor) + +# fit RFECV onto our training & test data +fit = feature_selector.fit(X_train,y_train) + +# extract & print the optimal number of features +optimal_feature_count = feature_selector.n_features_ +print(f"Optimal number of features: {optimal_feature_count}") + +# limit our training & test sets to only include the selected variables +X_train = X_train.loc[:, feature_selector.get_support()] +X_test = X_test.loc[:, feature_selector.get_support()] + +``` + +
+The below code then produces a plot that visualizes the cross-validated accuracy with each potential number of features + +```python + +plt.style.use('seaborn-poster') +plt.plot(range(1, len(fit.cv_results_['mean_test_score']) + 1), fit.cv_results_['mean_test_score'], marker = "o") +plt.ylabel("Model Score") +plt.xlabel("Number of Features") +plt.title(f"Feature Selection using RFE \n Optimal number of features is {optimal_feature_count} (at score of {round(max(fit.cv_results_['mean_test_score']),4)})") +plt.tight_layout() +plt.show() + +``` + +
+This creates the below plot, which shows that the highest cross-validated accuracy (0.8635) is actually when all eight of the original input variables are included. This is marginally higher than 6 included variables, and 7 included variables. We will continue on with all 8! + +
+![alt text](/img/posts/lin-reg-feature-selection-plot.png "Linear Regression Feature Selection Plot") + +
+### Model Training + +Instantiating and training the Linear Regression model is done using the below code + +```python + +# instantiate our model object +regressor = LinearRegression() + +# fit our model using our training & test sets +regressor.fit(X_train, y_train) + +``` + +
+### Model Performance Assessment + +##### Predict On The Test Set + +To assess how well the model is predicting on new data - the trained model object (here called *regressor*) is used and asked to predict the *loyalty_score* variable for the test set + +```python + +# predict on the test set +y_pred = regressor.predict(X_test) + +``` + +
+##### Calculate R-Squared + +R-Squared is a metric that shows the percentage of variance in our output variable *y* that is being explained by our input variable(s) *x*. It is a value that ranges between 0 and 1, with a higher value showing a higher level of explained variance. Another way of explaining this would be to say that, if we had an r-squared score of 0.8 it would suggest that 80% of the variation of the output variable is being explained by the input variables - and something else, or some other variables must account for the other 20% + +To calculate r-squared, I use the following code where I pass in our *predicted* outputs for the test set (y_pred) as well as the *actual* outputs for the test set (y_test) + +```python + +# calculate r-squared for our test set predictions +r_squared = r2_score(y_test, y_pred) +print(r_squared) + +``` + +The resulting r-squared score from this is **0.78** + +
+##### Calculate Cross Validated R-Squared + +An even more powerful and reliable way to assess model performance is to utilize Cross Validation. + +Instead of simply dividing the data into a single training set, and a single test set, with Cross Validation, the data can be broken into a number of "chunks" and then iteratively train the model on all but one of the "chunks", test the model on the remaining "chunk" until each has had a chance to be the test set. + +The result of this is that a number of test set validation results is provided - and the average of these is calculated to give a much more robust & reliable view of how the model will perform on new, un-seen data! + +In the code below, this is put into place. First, 4 "chunks" is specified, and then we pass in the regressor object, the training set, and the test set. Also specified is the metric with which to assess, in this case, r-squared. + +Finally, a mean of all four test set results is calculated. + +```python + +# calculate the mean cross validated r-squared for our test set predictions +cv = KFold(n_splits = 4, shuffle = True, random_state = 42) +cv_scores = cross_val_score(regressor, X_train, y_train, cv = cv, scoring = "r2") +cv_scores.mean() + +``` + +The mean cross-validated r-squared score from this is **0.853** + +
+##### Calculate Adjusted R-Squared + +When applying Linear Regression with *multiple* input variables, the r-squared metric on it's own *can* end up being an overinflated view of goodness of fit. This is because each input variable will have an *additive* effect on the overall r-squared score. In other words, every input variable added to the model *increases* the r-squared value, and *never decreases* it, even if the relationship is by chance. + +**Adjusted R-Squared** is a metric that compensates for the addition of input variables, and only increases if the variable improves the model above what would be obtained by probability. It is best practice to use Adjusted R-Squared when assessing the results of a Linear Regression with multiple input variables, as it gives a more fair perception the fit of the data. + +```python + +# calculate adjusted r-squared for our test set predictions +num_data_points, num_input_vars = X_test.shape +adjusted_r_squared = 1 - (1 - r_squared) * (num_data_points - 1) / (num_data_points - num_input_vars - 1) +print(adjusted_r_squared) + +``` + +The resulting *adjusted* r-squared score from this is **0.754** which as expected, is slightly lower than the score we got for r-squared on it's own. + +
+### Model Summary Statistics + +Although the overall goal for this project is predictive accuracy, rather than an explicit understanding of the relationships of each of the input variables and the output variable, it is always interesting to look at the summary statistics for these. +
+```python + +# extract model coefficients and create a DataFrame +coefficients = pd.DataFrame(regressor.coef_) +# Make DataFrame more useful by adding names +input_variable_names = pd.DataFrame(X_train.columns) +summary_stats = pd.concat([input_variable_names,coefficients], axis = 1) +summary_stats.columns = ["input_variable", "coefficient"] + +# Values in the DataFrame will make up the values going into the equation for the "line of best fit" or +# technically the "plane of best fit" + +# extract model intercept +regressor.intercept_ + +``` +
+The information from that code block can be found in the table below: +
+ +| **input_variable** | **coefficient** | +|---|---| +| intercept | 0.516 | +| distance_from_store | -0.201 | +| credit_score | -0.028 | +| total_sales | 0.000 | +| total_items | 0.001 | +| transaction_count | -0.005 | +| product_area_count | 0.062 | +| average_basket_value | -0.004 | +| gender_M | -0.013 | + +
+The coefficient value for each of the input variables, along with that of the intercept would make up the equation for the line of best fit for this particular model (or more accurately, in this case it would be the plane of best fit, as we have multiple input variables). + +For each input variable, the coefficient value above tells us, with *everything else staying constant* , how many units the output variable (loyalty score) would change with a *one unit change* in this particular input variable. + +To provide an example of this using the table above, we can see that the *distance_from_store* input variable has a coefficient value of -0.201. This is saying that *loyalty_score* decreases by 0.201 (or 20% as loyalty score is a percentage, or at least a decimal value between 0 and 1) for *every additional mile* that a customer lives from the store. This makes intuitive sense, as customers who live a long way from this store most likely live near *another* store where they might do some of their shopping as well. Whereas, customers who live near this store, probably do a greater proportion of their shopping at "this" store and hence have a higher loyalty score! + +___ +
+# Decision Tree + +We will again utilize the scikit-learn library within Python to model our data using a Decision Tree. The code sections below are broken up into 4 key sections: + +* Data Import +* Data Preprocessing +* Model Training +* Performance Assessment + +
+### Data Import + +Since the modeling data was saved as a pickle file, it is imported. Next, the id column is removed and our data is shuffled. + +```python + +# import required packages +import pandas as pd +import pickle +import matplotlib.pyplot as plt +from sklearn.tree import DecisionTreeRegressor, plot_tree +from sklearn.utils import shuffle +from sklearn.model_selection import train_test_split, cross_val_score, KFold +from sklearn.metrics import r2_score +from sklearn.preprocessing import OneHotEncoder + +# import modelling data +data_for_model = pickle.load(open("data/customer_loyalty_modelling.p", "rb")) + +# drop uneccessary columns +data_for_model.drop("customer_id", axis = 1, inplace = True) + +# shuffle data +data_for_model = shuffle(data_for_model, random_state = 42) + +``` +
+### Data Preprocessing + +While Linear Regression is susceptible to the effects of outliers, and highly correlated input variables - Decision Trees are not, so the required preprocessing here is lighter. We still however will put in place logic for: + +* Missing values in the data +* Encoding categorical variables to numeric form + +
+##### Missing Values + +The number of missing values in the data was extremely low, so instead of applying any imputation (i.e. mean, most common value) we will just remove those rows + +```python + +# remove rows where values are missing +data_for_model.isna().sum() +data_for_model.dropna(how = "any", inplace = True) + +``` + +
+##### Split Out Data For Modelling + +In exactly the same way we did for Linear Regression, in the next code block we do two things, we firstly split our data into an **X** object which contains only the predictor variables, and a **y** object that contains only our dependent variable. + +Once we have done this, we split our data into training and test sets to ensure we can fairly validate the accuracy of the predictions on data that was not used in training. In this case, we have allocated 80% of the data for training, and the remaining 20% for validation. + +
+```python + +# split data into X and y objects for modelling +X = data_for_model.drop(["customer_loyalty_score"], axis = 1) +y = data_for_model["customer_loyalty_score"] + +# split out training & test sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) + +``` + +
+##### Categorical Predictor Variables + +In our dataset, we have one categorical variable *gender* which has values of "M" for Male, "F" for Female, and "U" for Unknown. + +Just like the Linear Regression algorithm, the Decision Tree cannot deal with data in this format as it can't assign any numerical meaning to it when looking to assess the relationship between the variable and the dependent variable. + +As *gender* doesn't have any explicit *order* to it, in other words, Male isn't higher or lower than Female and vice versa - we would again apply One Hot Encoding to the categorical column. + +
+```python + +# list of categorical variables that need encoding +categorical_vars = ["gender"] + +# instantiate OHE class +one_hot_encoder = OneHotEncoder(sparse=False, drop = "first") + +# apply OHE +X_train_encoded = one_hot_encoder.fit_transform(X_train[categorical_vars]) +X_test_encoded = one_hot_encoder.transform(X_test[categorical_vars]) + +# extract feature names for encoded columns +encoder_feature_names = one_hot_encoder.get_feature_names_out(categorical_vars) + +# turn objects back to pandas dataframe +X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder_feature_names) +X_train = pd.concat([X_train.reset_index(drop=True), X_train_encoded.reset_index(drop=True)], axis = 1) +X_train.drop(categorical_vars, axis = 1, inplace = True) + +X_test_encoded = pd.DataFrame(X_test_encoded, columns = encoder_feature_names) +X_test = pd.concat([X_test.reset_index(drop=True), X_test_encoded.reset_index(drop=True)], axis = 1) +X_test.drop(categorical_vars, axis = 1, inplace = True) + +``` + +
+### Model Training + +Instantiating and training our Decision Tree model is done using the below code. We use the *random_state* parameter to ensure we get reproducible results, and this helps us understand any improvements in performance with changes to model hyperparameters. + +```python + +# instantiate our model object +regressor = DecisionTreeRegressor(random_state = 42) + +# fit our model using our training & test sets +regressor.fit(X_train, y_train) + +``` + +
+### Model Performance Assessment + +##### Predict On The Test Set + +To assess how well our model is predicting on new data - we use the trained model object (here called *regressor*) and ask it to predict the *loyalty_score* variable for the test set + +```python + +# predict on the test set +y_pred = regressor.predict(X_test) + +``` + +
+##### Calculate R-Squared + +To calculate r-squared, we use the following code where we pass in our *predicted* outputs for the test set (y_pred), as well as the *actual* outputs for the test set (y_test) + +```python + +# calculate r-squared for our test set predictions +r_squared = r2_score(y_test, y_pred) +print(r_squared) + +``` + +The resulting r-squared score from this is **0.898** + +
+##### Calculate Cross Validated R-Squared + +As we did when testing Linear Regression, we will again utilise Cross Validation. + +Instead of simply dividing our data into a single training set, and a single test set, with Cross Validation we break our data into a number of "chunks" and then iteratively train the model on all but one of the "chunks", test the model on the remaining "chunk" until each has had a chance to be the test set. + +The result of this is that we are provided a number of test set validation results - and we can take the average of these to give a much more robust & reliable view of how our model will perform on new, un-seen data! + +In the code below, we put this into place. We again specify that we want 4 "chunks" and then we pass in our regressor object, training set, and test set. We also specify the metric we want to assess with, in this case, we stick with r-squared. + +Finally, we take a mean of all four test set results. + +```python + +# calculate the mean cross validated r-squared for our test set predictions +cv = KFold(n_splits = 4, shuffle = True, random_state = 42) +cv_scores = cross_val_score(regressor, X_train, y_train, cv = cv, scoring = "r2") +cv_scores.mean() + +``` + +The mean cross-validated r-squared score from this is **0.871** which is slighter higher than we saw for Linear Regression. + +
+##### Calculate Adjusted R-Squared + +Just like we did with Linear Regression, we will also calculate the *Adjusted R-Squared* which compensates for the addition of input variables, and only increases if the variable improves the model above what would be obtained by probability. + +```python + +# calculate adjusted r-squared for our test set predictions +num_data_points, num_input_vars = X_test.shape +adjusted_r_squared = 1 - (1 - r_squared) * (num_data_points - 1) / (num_data_points - num_input_vars - 1) +print(adjusted_r_squared) + +``` + +The resulting *adjusted* r-squared score from this is **0.887** which as expected, is slightly lower than the score we got for r-squared on it's own. + +
+### Decision Tree Regularisation + +Decision Tree's can be prone to over-fitting, in other words, without any limits on their splitting, they will end up learning the training data perfectly. We would much prefer our model to have a more *generalised* set of rules, as this will be more robust & reliable when making predictions on *new* data. + +One effective method of avoiding this over-fitting, is to apply a *max depth* to the Decision Tree, meaning we only allow it to split the data a certain number of times before it is required to stop. + +Unfortunately, we don't necessarily know the *best* number of splits to use for this - so below we will loop over a variety of values and assess which gives us the best predictive performance! + +
+```python + +# finding the best max_depth + +# set up range for search, and empty list to append accuracy scores to +max_depth_list = list(range(1,9)) +accuracy_scores = [] + +# loop through each possible depth, train and validate model, append test set accuracy +for depth in max_depth_list: + + regressor = DecisionTreeRegressor(max_depth = depth, random_state = 42) + regressor.fit(X_train,y_train) + y_pred = regressor.predict(X_test) + accuracy = r2_score(y_test,y_pred) + accuracy_scores.append(accuracy) + +# store max accuracy, and optimal depth +max_accuracy = max(accuracy_scores) +max_accuracy_idx = accuracy_scores.index(max_accuracy) +optimal_depth = max_depth_list[max_accuracy_idx] + +# plot accuracy by max depth +plt.plot(max_depth_list,accuracy_scores) +plt.scatter(optimal_depth, max_accuracy, marker = "x", color = "red") +plt.title(f"Accuracy by Max Depth \n Optimal Tree Depth: {optimal_depth} (Accuracy: {round(max_accuracy,4)})") +plt.xlabel("Max Depth of Decision Tree") +plt.ylabel("Accuracy") +plt.tight_layout() +plt.show() + +``` +
+That code gives us the below plot - which visualises the results! + +
+![alt text](/img/posts/regression-tree-max-depth-plot.png "Decision Tree Max Depth Plot") + +
+In the plot we can see that the *maximum* classification accuracy on the test set is found when applying a *max_depth* value of 7. However, we lose very little accuracy back to a value of 4, but this would result in a simpler model, that generalised even better on new data. We make the executive decision to re-train our Decision Tree with a maximum depth of 4! + +
+### Visualise Our Decision Tree + +To see the decisions that have been made in the (re-fitted) tree, we can use the plot_tree functionality that we imported from scikit-learn. To do this, we use the below code: + +
+```python + +# re-fit our model using max depth of 4 +regressor = DecisionTreeRegressor(random_state = 42, max_depth = 4) +regressor.fit(X_train, y_train) + +# plot the nodes of the decision tree +plt.figure(figsize=(25,15)) +tree = plot_tree(regressor, + feature_names = X.columns, + filled = True, + rounded = True, + fontsize = 16) + +``` +
+That code gives us the below plot: + +
+![alt text](/img/posts/regression-tree-nodes-plot.png "Decision Tree Max Depth Plot") + +
+This is a very powerful visual, and one that can be shown to stakeholders in the business to ensure they understand exactly what is driving the predictions. + +One interesting thing to note is that the *very first split* appears to be using the variable *distance from store* so it would seem that this is a very important variable when it comes to predicting loyalty! + +___ +
+# Random Forest + +We will again utilise the scikit-learn library within Python to model our data using a Random Forest. The code sections below are broken up into 4 key sections: + +* Data Import +* Data Preprocessing +* Model Training +* Performance Assessment + +
+### Data Import + +Again, since we saved our modelling data as a pickle file, we import it. We ensure we remove the id column, and we also ensure our data is shuffled. + +```python + +# import required packages +import pandas as pd +import pickle +import matplotlib.pyplot as plt +from sklearn.ensemble import RandomForestRegressor +from sklearn.utils import shuffle +from sklearn.model_selection import train_test_split, cross_val_score, KFold +from sklearn.metrics import r2_score +from sklearn.preprocessing import OneHotEncoder +from sklearn.inspection import permutation_importance + +# import modelling data +data_for_model = pickle.load(open("data/customer_loyalty_modelling.p", "rb")) + +# drop unnecessary columns +data_for_model.drop("customer_id", axis = 1, inplace = True) + +# shuffle data +data_for_model = shuffle(data_for_model, random_state = 42) + +``` +
+### Data Preprocessing + +While Linear Regression is susceptible to the effects of outliers, and highly correlated input variables - Random Forests, just like Decision Trees, are not, so the required preprocessing here is lighter. We still however will put in place logic for: + +* Missing values in the data +* Encoding categorical variables to numeric form + +
+##### Missing Values + +The number of missing values in the data was extremely low, so instead of applying any imputation (i.e. mean, most common value) we will just remove those rows + +```python + +# remove rows where values are missing +data_for_model.isna().sum() +data_for_model.dropna(how = "any", inplace = True) + +``` + +
+##### Split Out Data For Modelling + +In exactly the same way we did for Linear Regression, in the next code block we do two things, we firstly split our data into an **X** object which contains only the predictor variables, and a **y** object that contains only our dependent variable. + +Once we have done this, we split our data into training and test sets to ensure we can fairly validate the accuracy of the predictions on data that was not used in training. In this case, we have allocated 80% of the data for training, and the remaining 20% for validation. + +
+```python + +# split data into X and y objects for modelling +X = data_for_model.drop(["customer_loyalty_score"], axis = 1) +y = data_for_model["customer_loyalty_score"] + +# split out training & test sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) + +``` + +
+##### Categorical Predictor Variables + +In our dataset, we have one categorical variable *gender* which has values of "M" for Male, "F" for Female, and "U" for Unknown. + +Just like the Linear Regression algorithm, Random Forests cannot deal with data in this format as it can't assign any numerical meaning to it when looking to assess the relationship between the variable and the dependent variable. + +As *gender* doesn't have any explicit *order* to it, in other words, Male isn't higher or lower than Female and vice versa - we would again apply One Hot Encoding to the categorical column. + +
+```python + +# list of categorical variables that need encoding +categorical_vars = ["gender"] + +# instantiate OHE class +one_hot_encoder = OneHotEncoder(sparse=False, drop = "first") + +# apply OHE +X_train_encoded = one_hot_encoder.fit_transform(X_train[categorical_vars]) +X_test_encoded = one_hot_encoder.transform(X_test[categorical_vars]) + +# extract feature names for encoded columns +encoder_feature_names = one_hot_encoder.get_feature_names_out(categorical_vars) + +# turn objects back to pandas dataframe +X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder_feature_names) +X_train = pd.concat([X_train.reset_index(drop=True), X_train_encoded.reset_index(drop=True)], axis = 1) +X_train.drop(categorical_vars, axis = 1, inplace = True) + +X_test_encoded = pd.DataFrame(X_test_encoded, columns = encoder_feature_names) +X_test = pd.concat([X_test.reset_index(drop=True), X_test_encoded.reset_index(drop=True)], axis = 1) +X_test.drop(categorical_vars, axis = 1, inplace = True) + +``` + +
+### Model Training + +Instantiating and training our Random Forest model is done using the below code. We use the *random_state* parameter to ensure we get reproducible results, and this helps us understand any improvements in performance with changes to model hyperparameters. + +We leave the other parameters at their default values, meaning that we will just be building 100 Decision Trees in this Random Forest. + +```python + +# instantiate our model object +regressor = RandomForestRegressor(random_state = 42) + +# fit our model using our training & test sets +regressor.fit(X_train, y_train) + +``` + +
+### Model Performance Assessment + +##### Predict On The Test Set + +To assess how well our model is predicting on new data - we use the trained model object (here called *regressor*) and ask it to predict the *loyalty_score* variable for the test set + +```python + +# predict on the test set +y_pred = regressor.predict(X_test) + +``` + +
+##### Calculate R-Squared + +To calculate r-squared, we use the following code where we pass in our *predicted* outputs for the test set (y_pred), as well as the *actual* outputs for the test set (y_test) + +```python + +# calculate r-squared for our test set predictions +r_squared = r2_score(y_test, y_pred) +print(r_squared) + +``` + +The resulting r-squared score from this is **0.957** - higher than both Linear Regression & the Decision Tree. + +
+##### Calculate Cross Validated R-Squared + +As we did when testing Linear Regression & our Decision Tree, we will again utilise Cross Validation (for more info on how this works, please refer to the Linear Regression section above) + +```python + +# calculate the mean cross validated r-squared for our test set predictions +cv = KFold(n_splits = 4, shuffle = True, random_state = 42) +cv_scores = cross_val_score(regressor, X_train, y_train, cv = cv, scoring = "r2") +cv_scores.mean() + +``` + +The mean cross-validated r-squared score from this is **0.923** which agian is higher than we saw for both Linear Regression & our Decision Tree. + +
+##### Calculate Adjusted R-Squared + +Just like we did with Linear Regression & our Decision Tree, we will also calculate the *Adjusted R-Squared* which compensates for the addition of input variables, and only increases if the variable improves the model above what would be obtained by probability. + +```python + +# calculate adjusted r-squared for our test set predictions +num_data_points, num_input_vars = X_test.shape +adjusted_r_squared = 1 - (1 - r_squared) * (num_data_points - 1) / (num_data_points - num_input_vars - 1) +print(adjusted_r_squared) + +``` + +The resulting *adjusted* r-squared score from this is **0.955** which as expected, is slightly lower than the score we got for r-squared on it's own - but again higher than for our other models. + +
+### Feature Importance + +In our Linear Regression model, to understand the relationships between input variables and our output variable, loyalty score, we examined the coefficients. With our Decision Tree we looked at what the earlier splits were. These allowed us some insight into which input variables were having the most impact. + +Random Forests are an ensemble model, made up of many, many Decision Trees, each of which is different due to the randomness of the data being provided, and the random selection of input variables available at each potential split point. + +Because of this, we end up with a powerful and robust model, but because of the random or different nature of all these Decision trees - the model gives us a unique insight into how important each of our input variables are to the overall model. + +As we’re using random samples of data, and input variables for each Decision Tree - there are many scenarios where certain input variables are being held back and this enables us a way to compare how accurate the models predictions are if that variable is or isn’t present. + +So, at a high level, in a Random Forest we can measure *importance* by asking *How much would accuracy decrease if a specific input variable was removed or randomised?* + +If this decrease in performance, or accuracy, is large, then we’d deem that input variable to be quite important, and if we see only a small decrease in accuracy, then we’d conclude that the variable is of less importance. + +At a high level, there are two common ways to tackle this. The first, often just called **Feature Importance** is where we find all nodes in the Decision Trees of the forest where a particular input variable is used to split the data and assess what the Mean Squared Error (for a Regression problem) was before the split was made, and compare this to the Mean Squared Error after the split was made. We can take the *average* of these improvements across all Decision Trees in the Random Forest to get a score that tells us *how much better* we’re making the model by using that input variable. + +If we do this for *each* of our input variables, we can compare these scores and understand which is adding the most value to the predictive power of the model! + +The other approach, often called **Permutation Importance** cleverly uses some data that has gone *unused* at when random samples are selected for each Decision Tree (this stage is called "bootstrap sampling" or "bootstrapping") + +These observations that were not randomly selected for each Decision Tree are known as *Out of Bag* observations and these can be used for testing the accuracy of each particular Decision Tree. + +For each Decision Tree, all of the *Out of Bag* observations are gathered and then passed through. Once all of these observations have been run through the Decision Tree, we obtain an accuracy score for these predictions, which in the case of a regression problem could be Mean Squared Error or r-squared. + +In order to understand the *importance*, we *randomise* the values within one of the input variables - a process that essentially destroys any relationship that might exist between that input variable and the output variable - and run that updated data through the Decision Tree again, obtaining a second accuracy score. The difference between the original accuracy and the new accuracy gives us a view on how important that particular variable is for predicting the output. + +*Permutation Importance* is often preferred over *Feature Importance* which can at times inflate the importance of numerical features. Both are useful, and in most cases will give fairly similar results. + +Let's put them both in place, and plot the results... + +
+```python + +# calculate feature importance +feature_importance = pd.DataFrame(regressor.feature_importances_) +feature_names = pd.DataFrame(X.columns) +feature_importance_summary = pd.concat([feature_names,feature_importance], axis = 1) +feature_importance_summary.columns = ["input_variable","feature_importance"] +feature_importance_summary.sort_values(by = "feature_importance", inplace = True) + +# plot feature importance +plt.barh(feature_importance_summary["input_variable"],feature_importance_summary["feature_importance"]) +plt.title("Feature Importance of Random Forest") +plt.xlabel("Feature Importance") +plt.tight_layout() +plt.show() + +# calculate permutation importance +result = permutation_importance(regressor, X_test, y_test, n_repeats = 10, random_state = 42) +permutation_importance = pd.DataFrame(result["importances_mean"]) +feature_names = pd.DataFrame(X.columns) +permutation_importance_summary = pd.concat([feature_names,permutation_importance], axis = 1) +permutation_importance_summary.columns = ["input_variable","permutation_importance"] +permutation_importance_summary.sort_values(by = "permutation_importance", inplace = True) + +# plot permutation importance +plt.barh(permutation_importance_summary["input_variable"],permutation_importance_summary["permutation_importance"]) +plt.title("Permutation Importance of Random Forest") +plt.xlabel("Permutation Importance") +plt.tight_layout() +plt.show() + +``` +
+That code gives us the below plots - the first being for *Feature Importance* and the second for *Permutation Importance*! + +
+![alt text](/img/posts/rf-regression-feature-importance.png "Random Forest Feature Importance Plot") +
+
+![alt text](/img/posts/rf-regression-permutation-importance.png "Random Forest Permutation Importance Plot") + +
+The overall story from both approaches is very similar, in that by far, the most important or impactful input variable is *distance_from_store* which is the same insights we derived when assessing our Linear Regression & Decision Tree models. + +There are slight differences in the order or "importance" for the remaining variables but overall they have provided similar findings. + +___ +
+# Modelling Summary + +The most important outcome for this project was predictive accuracy, rather than explicitly understanding the drivers of prediction. Based upon this, we chose the model that performed the best when predicted on the test set - the Random Forest. + +
+**Metric 1: Adjusted R-Squared (Test Set)** + +* Random Forest = 0.955 +* Decision Tree = 0.886 +* Linear Regression = 0.754 + +
+**Metric 2: R-Squared (K-Fold Cross Validation, k = 4)** + +* Random Forest = 0.925 +* Decision Tree = 0.871 +* Linear Regression = 0.853 + +
+Even though we were not specifically interested in the drivers of prediction, it was interesting to see across all three modelling approaches, that the input variable with the biggest impact on the prediction was *distance_from_store* rather than variables such as *total sales*. This is interesting information for the business, so discovering this as we went was worthwhile. + +
+# Predicting Missing Loyalty Scores + +We have selected the model to use (Random Forest) and now we need to make the *loyalty_score* predictions for those customers that the market research consultancy were unable to tag. + +We cannot just pass the data for these customers into the model, as is - we need to ensure the data is in exactly the same format as what was used when training the model. + +In the following code, we will + +* Import the required packages for preprocessing +* Import the data for those customers who are missing a *loyalty_score* value +* Import our model object & any preprocessing artifacts +* Drop columns that were not used when training the model (customer_id) +* Drop rows with missing values +* Apply One Hot Encoding to the gender column (using transform) +* Make the predictions using .predict() + +
+```python + +# import required packages +import pandas as pd +import pickle + +# import customers for scoring +to_be_scored = ... + +# import model and model objects +regressor = ... +one_hot_encoder = ... + +# drop unused columns +to_be_scored.drop(["customer_id"], axis = 1, inplace = True) + +# drop missing values +to_be_scored.dropna(how = "any", inplace = True) + +# apply one hot encoding (transform only) +categorical_vars = ["gender"] +encoder_vars_array = one_hot_encoder.transform(to_be_scored[categorical_vars]) +encoder_feature_names = one_hot_encoder.get_feature_names(categorical_vars) +encoder_vars_df = pd.DataFrame(encoder_vars_array, columns = encoder_feature_names) +to_be_scored = pd.concat([to_be_scored.reset_index(drop=True), encoder_vars_df.reset_index(drop=True)], axis = 1) +to_be_scored.drop(categorical_vars, axis = 1, inplace = True) + +# make our predictions! +loyalty_predictions = regressor.predict(to_be_scored) + +``` +
+Just like that, we have made our *loyalty_score* predictions for these missing customers. Due to the impressive metrics on the test set, we can be reasonably confident with these scores. This extra customer information will ensure our client can undertake more accurate and relevant customer tracking, targeting, and comms. + +___ +
+# Growth & Next Steps + +While predictive accuracy was relatively high - other modelling approaches could be tested, especially those somewhat similar to Random Forest, for example XGBoost, LightGBM to see if even more accuracy could be gained. + +We could even look to tune the hyperparameters of the Random Forest, notably regularisation parameters such as tree depth, as well as potentially training on a higher number of Decision Trees in the Random Forest. + +From a data point of view, further variables could be collected, and further feature engineering could be undertaken to ensure that we have as much useful information available for predicting customer loyalty diff --git a/img/CAF9BD2A-23C8-4235-BE2F-89315A9439E7_1_105_c.jpeg b/img/CAF9BD2A-23C8-4235-BE2F-89315A9439E7_1_105_c.jpeg new file mode 100644 index 000000000..2f8ae9c07 Binary files /dev/null and b/img/CAF9BD2A-23C8-4235-BE2F-89315A9439E7_1_105_c.jpeg differ diff --git a/img/ab-testing-title-img.png b/img/ab-testing-title-img.png new file mode 100644 index 000000000..840c09964 Binary files /dev/null and b/img/ab-testing-title-img.png differ diff --git a/img/posts/ab-testing-title-img.png b/img/posts/ab-testing-title-img.png new file mode 100644 index 000000000..840c09964 Binary files /dev/null and b/img/posts/ab-testing-title-img.png differ diff --git a/img/posts/lin-reg-feature-selection-plot.png b/img/posts/lin-reg-feature-selection-plot.png new file mode 100644 index 000000000..3bced3729 Binary files /dev/null and b/img/posts/lin-reg-feature-selection-plot.png differ diff --git a/img/posts/regression-title-img.png b/img/posts/regression-title-img.png new file mode 100644 index 000000000..ddecd7ddc Binary files /dev/null and b/img/posts/regression-title-img.png differ diff --git a/img/posts/regression-tree-max-depth-plot.png b/img/posts/regression-tree-max-depth-plot.png new file mode 100644 index 000000000..5cbcc6ac5 Binary files /dev/null and b/img/posts/regression-tree-max-depth-plot.png differ diff --git a/img/posts/regression-tree-nodes-plot.png b/img/posts/regression-tree-nodes-plot.png new file mode 100644 index 000000000..6b90d580e Binary files /dev/null and b/img/posts/regression-tree-nodes-plot.png differ diff --git a/img/posts/rf-regression-feature-importance.png b/img/posts/rf-regression-feature-importance.png new file mode 100644 index 000000000..83b6a940f Binary files /dev/null and b/img/posts/rf-regression-feature-importance.png differ diff --git a/img/posts/rf-regression-permutation-importance.png b/img/posts/rf-regression-permutation-importance.png new file mode 100644 index 000000000..bdfd2fca2 Binary files /dev/null and b/img/posts/rf-regression-permutation-importance.png differ