GroupProject/MobileTelcoSegmentation_Copy.Rmd at master · adwoabanful/GroupProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "Mobile Telecom Segmentation"
author: "Group Great"
output:
  html_document:
    css: AnalyticsStyles/default.css
    theme: paper
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: yes
  pdf_document:
    includes:
      in_header: AnalyticsStyles/default.sty
always_allow_html: yes
---

```{r setuplibraries, echo=FALSE, message=FALSE}
suppressWarnings(source("Library/library.R"))
#source("Library/heatmapOutput.R")
#source("Library/MartrixOperations.R")
#source("Library/allPublicLibraries.R")
# Package options
suppressWarnings(ggthemr('fresh'))  # ggplot theme
opts_knit$set(progress=FALSE, verbose=FALSE)
opts_chunk$set(echo=FALSE, fig.align="center", fig.width=10, fig.height=6.35, results="asis")
options(knitr.kable.NA = '')
```

#1) Context of the project
We are using a database with mobile telecommunication consumption data.

The data belongs to a mobile telecom company that operates in Ghana, and our focus will be exclusively on customers who buy prepaid packages (similar to Starhub prepaid packages)

The company uses an old customer segmentation to design service packages (voice, roaming, data, etc) and to target sales initiatives. However, there is the perception that the segmentation is outdated and not working anymore as it should.
<br>

#2)         The business questions:
The objective is to create a new segmentation based on data of customer usage of the different services, so the company can better offer and design the right packages to the right customer segments.

1) What are the main customer segments based on historical usage of the different services (voice, data, etc)?


#3)     The data

##3.1)   Treat and load the data

```{r setupdata1E, echo=TRUE, tidy=TRUE}
# Please ENTER the name of the file with the data used. The file should be a .csv with one row per observation (e.g. person) and one column per attribute. Do not add .csv at the end, make sure the data are numeric.
BDASmallTableClean = "Library/BDASmallTableNew.csv"

#>>>>>>> bfff9c1f85926b4164661820aa734f6312e97bf8 (WHAT IS THIS??)

# Please enter the minimum number below which you would like not to print - this makes the readability of the tables easier. Default values are either 10e6 (to print everything) or 0.5. Try both to see the difference.
MIN_VALUE = 0.5

# Please enter the maximum number of observations to show in the report and slides.
# DEFAULT is 10. If the number is large the report may be slow.
max_data_report = 10
```

```{r}
ProjectData <- read.csv(BDASmallTableClean)
ProjectData <- data.matrix(ProjectData)
ProjectData_INITIAL <- ProjectData

```

##3.2)   Describe the data
(highlevel description of contents, # of rows, # of columns, list the columns and its meaning, show a sample of the data)


#4)         PART 1: Factor/Component selection


In this part we identify the underlying factors that best capture the behaviours on the various variables.

```{r setupfactor, echo=TRUE, tidy=TRUE}
# Please ENTER the original raw attributes to use.
# Please use numbers, not column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
factor_attributes_used = c(3:ncol(ProjectData))

# Please ENTER the selection criterions for the factors to use.
# Choices: "eigenvalue", "variance", "manual"
factor_selectionciterion = "eigenvalue"

# Please ENTER the desired minumum variance explained
# (Only used in case "variance" is the factor selection criterion used).
minimum_variance_explained = 80  # between 1 and 100

# Please ENTER the number of factors to use
# (Only used in case "manual" is the factor selection criterion used).
manual_numb_factors_used = 10

# Please ENTER the rotation eventually used (e.g. "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", and "cluster" - see help(principal)). Default is "varimax"
rotation_used = "varimax"

```

```{r}
factor_attributes_used <- intersect(factor_attributes_used, 1:ncol(ProjectData))
ProjectDataFactor <- ProjectData[,factor_attributes_used]
ProjectDataFactor <- ProjectData <- data.matrix(ProjectDataFactor)
```

## Steps 1-2: Check the Data

Start by some basic visual exploration of, say, a few data:

```{r}
rownames(ProjectDataFactor) <- paste0("Obs.", sprintf("%02i", 1:nrow(ProjectDataFactor)))
iprint.df(t(head(round(ProjectDataFactor, 2), max_data_report)), scale = TRUE)
```

The data we use here have the following descriptive statistics:

```{r}
iprint.df(round(my_summary(ProjectDataFactor), 2), scale = FALSE)
```

## Step 3: Check Correlations

This is the correlation matrix of the customer responses to the `r ncol(ProjectDataFactor)` attitude questions - which are the only questions that we will use for the segmentation (see the case):

```{r}
thecor = round(cor(ProjectDataFactor),2)
iprint.df(round(thecor,2), scale=TRUE)
```


## Step 4: Choose number of factors

Clearly the survey asked many redundant questions (can you think some reasons why?), so we may be able to actually "group" these 29 attitude questions into only a few "key factors". This not only will simplify the data, but will also greatly facilitate our understanding of the customers.

To do so, we use methods called [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) and [factor analysis](https://en.wikipedia.org/wiki/Factor_analysis) as also discussed in the [Dimensionality Reduction readings](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions23/FactorAnalysisReading.html). We can use two different R commands for this (they make slightly different information easily available as output): the command `principal` (check `help(principal)` from R package [psych](http://personality-project.org/r/psych/)), and the command `PCA` from R package [FactoMineR](http://factominer.free.fr) - there are more packages and commands for this, as these methods are very widely used.

```{r}
# Here is how the `principal` function is used
UnRotated_Results<-principal(ProjectDataFactor, nfactors=ncol(ProjectDataFactor), rotate="none",score=TRUE)
UnRotated_Factors<-round(UnRotated_Results$loadings,2)
UnRotated_Factors<-as.data.frame(unclass(UnRotated_Factors))
colnames(UnRotated_Factors)<-paste("Comp",1:ncol(UnRotated_Factors),sep="")
```

```{r}
# Here is how we use the `PCA` function
Variance_Explained_Table_results<-PCA(ProjectDataFactor, graph=FALSE)
Variance_Explained_Table<-Variance_Explained_Table_results$eig
Variance_Explained_Table_copy<-Variance_Explained_Table

rownames(Variance_Explained_Table) <- paste("Component", 1:nrow(Variance_Explained_Table), sep=" ")
colnames(Variance_Explained_Table) <- c("Eigenvalue", "Pct of explained variance", "Cumulative pct of explained variance")
```

Let's look at the **variance explained** as well as the **eigenvalues** (see session readings):

```{r}
iprint.df(round(Variance_Explained_Table, 2), scale = FALSE)
```

```{r}
eigenvalues  <- Variance_Explained_Table[, "Eigenvalue"]
df           <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues)))
colnames(df) <- c("eigenvalues", "components", "abline")
iplot.df(melt(df, id="components"))
```

## Step 5: Interpret the factors

Let's now see how the "top factors" look like.

```{r}
if (factor_selectionciterion == "eigenvalue")
  factors_selected = sum(Variance_Explained_Table_copy[,1] >= 1)
if (factor_selectionciterion == "variance")
  factors_selected = 1:head(which(Variance_Explained_Table_copy[,"cumulative percentage of variance"]>= minimum_variance_explained),1)
if (factor_selectionciterion == "manual")
  factors_selected = manual_numb_factors_used
```

To better visualize them, we will use what is called a "rotation". There are many rotations methods. In this case we selected the `r rotation_used` rotation. For our data, the `r factors_selected` selected factors look as follows after this rotation:

```{r}
Rotated_Results<-principal(ProjectDataFactor, nfactors=max(factors_selected), rotate=rotation_used,score=TRUE)
Rotated_Factors<-round(Rotated_Results$loadings,2)
Rotated_Factors<-as.data.frame(unclass(Rotated_Factors))
colnames(Rotated_Factors)<-paste("Comp.",1:ncol(Rotated_Factors),sep="")

sorted_rows <- sort(Rotated_Factors[,1], decreasing = TRUE, index.return = TRUE)$ix
Rotated_Factors <- Rotated_Factors[sorted_rows,]

iprint.df(Rotated_Factors, scale=TRUE)
```

To better visualize and interpret the factors we often "suppress" loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

```{r}
Rotated_Factors_thres <- Rotated_Factors
Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE]<-NA
colnames(Rotated_Factors_thres)<- colnames(Rotated_Factors)
rownames(Rotated_Factors_thres)<- rownames(Rotated_Factors)

iprint.df(Rotated_Factors_thres, scale=TRUE)
```

To understand if there was any redundancy between factors (i.e. high correlation) we took a look at correlation between the rotated factors
```{r}
factcor = round(cor(Rotated_Factors),2)
iprint.df(round(factcor,2), scale=TRUE)
```

## Step 6:  Save factor scores

We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores  are for the first few respondents:

```{r}
NEW_ProjectData <- round(Rotated_Results$scores[1:nrow(Rotated_Results$scores)-1,1:factors_selected,drop=F],2)
colnames(NEW_ProjectData)<-paste(c("Offnet Voice + SMS ( < 55%)", "Recharge", "Data Packages", "Out Roaming", "OG Intl Voice", "OG Offnet SMS + Intl IC SMS ( < 55%)", "Voice Package", "VAS Package","OG Intl SMS, SMS Package", "Pay-as-you-go Data", "OG SMS Onnet", "Content Provider VAS Package", "Blackberry Package", "Games", "IVR Shortcodes", "Onnet Voice OG", "IC Intl Voice", "OG Airtime Gifts", "Active Days", "IC Airtime Gifts"),1:ncol(NEW_ProjectData),sep=" ")

iprint.df(t(head(NEW_ProjectData, 10)), scale=TRUE)
```


<hr>\clearpage


#5)         PART 2: Segmentation


```{r setupcluster, echo=TRUE, tidy=TRUE}

# Please ENTER the number of clusters to eventually use for this report
numb_clusters_used = 7 #Use this to "play" with the data until finding a suitable segmentation

# Please enter the method to use for the segmentation:
profile_with = "hclust" #"hclust" or "kmeans"

# Please ENTER the distance metric eventually used for the clustering in case of hierarchical clustering
# (e.g. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" - see help(dist)).
# DEFAULT is "euclidean"
distance_used = "euclidean"

# Please ENTER the hierarchical clustering method to use (options are:
# "ward", "single", "complete", "average", "mcquitty", "median" or "centroid").
# DEFAULT is "ward"
hclust_method = "ward.D"

# Please ENTER the kmeans clustering method to use (options are:
# "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen").
# DEFAULT is "Lloyd"
kmeans_method = "Lloyd"

```


##Step 1: Decide on pair-wise distance metric

We need to define a distance metric that measures how different people (observations in general) are from each other. This can be an important choice. Here are the differences between the observations using the distance metric we selected:

```{r}
euclidean_pairwise <- as.matrix(dist(head(NEW_ProjectData, max_data_report), method="euclidean"))
euclidean_pairwise <- euclidean_pairwise*lower.tri(euclidean_pairwise) + euclidean_pairwise*diag(euclidean_pairwise) + 10e10*upper.tri(euclidean_pairwise)
euclidean_pairwise[euclidean_pairwise==10e10] <- NA
rownames(euclidean_pairwise) <- colnames(euclidean_pairwise) <- sprintf("Obs.%02d", 1:max_data_report)

iprint.df(round(euclidean_pairwise))
```

We can see the histogram of, say, the first 2 variables

```{r}
variables_to_plot = 1:2
do.call(iplot.grid, lapply(variables_to_plot, function(n){
  iplot.hist(NEW_ProjectData[, n], breaks=5, xlab = paste("Variable", n))
}))
```


##Step 2: Segmentation and Robustness Analysis

###i.     Segmentation/Robustness Analysis


We need to select the clustering method to use, as well as the number of clusters. It may be useful to see the dendrogram from Hierarchical Clustering of a sample of the factored data, to have a quick idea of how the data may be segmented and how many segments there may be. **Not that there is vector size limit to hclust method, thus we use a sample of, say, a 1000 observations**.
<br>
Here is the dendrogram for our data:

```{r}
Hierarchical_Cluster_distances <- dist(NEW_ProjectData[1:1000,], method=distance_used)
Hierarchical_Cluster <- hclust(Hierarchical_Cluster_distances, method=hclust_method)
# Display dendogram
iplot.dendrogram(Hierarchical_Cluster)
# TODO: Draw dendogram with red borders around the 3 clusters
 #rect.hclust(Hierarchical_Cluster, k=numb_clusters_used, border="red")
```

We can also plot the "distances" traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
```{r}
num <- nrow(NEW_ProjectData[1:1000,]) - 1
df1 <- cbind(as.data.frame(Hierarchical_Cluster$height[length(Hierarchical_Cluster$height):1]), c(1:num))
colnames(df1) <- c("distances","index")
iplot.df(melt(head(df1, 20), id="index"), xlab="Number of Components")
```

Here is the segment membership of the first `r max_data_report` respondents if we use hierarchical clustering:

```{r}
cluster_memberships_hclust <- as.vector(cutree(Hierarchical_Cluster, k=numb_clusters_used))
cluster_ids_hclust=unique(cluster_memberships_hclust)

ProjectData_with_hclust_membership <- cbind(1:length(cluster_memberships_hclust),cluster_memberships_hclust)
colnames(ProjectData_with_hclust_membership)<-c("Observation Number","Cluster_Membership")

iprint.df(round(head(ProjectData_with_hclust_membership, max_data_report), 2))
```

while this is the segment membership if we use k-means:

```{r}
kmeans_clusters <- kmeans(NEW_ProjectData,centers= numb_clusters_used, iter.max=2000, algorithm=kmeans_method)

ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster),kmeans_clusters$cluster)
colnames(ProjectData_with_kmeans_membership)<-c("Observation Number","Cluster_Membership")

iprint.df(round(head(ProjectData_with_kmeans_membership, max_data_report), 2))
```

#c.          Step 3: Profile and interpret the above defined segments
<br>
This is the final output. The list of segments and their definitions