Data-Science-with-R/Project Proposal.Rmd at master · Rspawar/Data-Science-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
Project title: "Data Science with R Project Proposal"
output:
  rmarkdown::html_document:
    theme: lumen
bibliography: bibliography.bib
---

<style>
body {
text-align: justify}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Project Proposal on Customer Behavioural Analytics in the Retail Sector
<br>
__Project title:__ <font color ="black"> "Customer Behavioural Analytics in the Retail Sector" <br /> </font> </font>

__Names of Team Members:__<br />
<font color ="black">
1. Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de) <br />
2. Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de) <br />
3. Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de) <br />
4. Sumit Kundu (217453, sumit.kundu@st.ovgu.de)
</font>

__Under the Guidance of:__ <font color ="black"> M.Sc. Uli Niemann </font>

__Date:__ <font color ="black"> ```r format(Sys.Date(), "%B %e, %Y")``` </font>

__Background and Motivation:__ <font color ="black"> A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360&deg; view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data. <br/>
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement stratergies leading to an increased revenue through customer satisfaction.

</font>

__Project objectives:__ <font color ="black"> This project aims to address the problem of understanding behaviour of customers of an Italian retail distribution company _Coop_ in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering the below formulated Research Questions (RQ), <br />

__1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?__ <br />
_Relevance:_ This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions. <br />

__2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?__ <br />
_Relevance:_ This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.<br />

__3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?__ <br />
_Relevance:_ This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.<br />

__4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?__<br />
_Relevance:_ This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation. <br />

__5. Which are the customers that are most profitable in terms of revenue generation?__<br />
_Relevance:_ This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.

</font>

__Name of the dataset to be used:__ <font color ="black"> Supermarket aggr.Customer^[https://bigml.com/user/czuriaga/gallery/dataset/5559c2c6200d5a6570000084] <br />
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called _Coop_ for a single Italian city [@pennacchioli2013explaining].<br />
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from the original datasets^[http://www.michelecoscia.com/?page_id=379] [@pennacchioli2013explaining] and mapped to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size. </font>


__Design overview:__ <font color ="black"> Below are some of the algorithms and methods that we plan to utilize in our project,<br />

__1. Support Vector Machine (SVM)__ <br />
We will approach RQ1 as a classification task and hence utilize SVM to classify whether a customer is willing to travel long distances or not to purchase products given the high average product price in a shop. <br />

__2. k-means Clustering__ <br />
RQ2, RQ4 and RQ5 require to segment products, customers and shops into multiple clusters. We plan to utilize k-means clustering to partition the data into clusters and draw analysis from it.<br />

__3. Naive Bayes__ <br />
RQ3 also requires the calculation of maximum likelihood estimation involving customer, products and shop. Hence we plan to create a model based on Naive Bayes for the estimation.

__4. Apriori algorithm__ <br />
RQ3 also requires an association to be drawn between customers and shops for different products. Hence we plan to use Apriori algorithm to determine the different association patterns in the data.
</font>

__Time plan:__ <font color ="black">
<table>
<colgroup>
<col width="22%" />
<col width="77%" />
</colgroup>
<thead>
<tr class="header">
<th align="left">Week</th>
<th align="left">Responsibilites and Workload Distribution</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">19.11.</td>
<td align="left">
_Nadiia & Rutuja:_ Data Cleaning <br />
_Shivani:_ Data Transformation <br />
_Sumit:_ Data Reduction<br />
</td>

<tr class="even">
<td align="left">26.11.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ EDA Data Modeling with each team member working on different modeling techniques <br />
</td>
</tr>

<tr class="odd">
<td align="left">03.12.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ EDA Data Visualization with each team member working on different visualization techniques<br />
</td>
</tr>

<tr class="even">
<td align="left">10.12.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ Website development with each team member working on different webpage creation
</td>
</tr>

<tr class="odd">
<td align="left">17.12.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ Answering different Research Question per member through different machine learning models
</td>
</tr>

<tr class="even">
<td align="left">24.12.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ Answering different Research Question per member through different machine learning models
</td>
</tr>

<tr class="odd">
<td align="left">31.12</td>
<td align="left">
_Nadiia and Shivani:_ Website Integration with the Project data <br />
_Rutuja and Sumit:_ Project Screencast creation<br />
</td>
</tr>

<tr class="even">
<td align="left">07.01.</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ Final Project Detailing and Polishing
</td>
</tr>

<tr class="odd">
<td align="left">14.01</td>
<td align="left">
_Nadiia, Rutuja, Shivani, Sumit:_ Project Wrap-up<br />
</td>
</tr>
</tbody>
</table>
</font>


```{r echo=FALSE, message=FALSE}
library(vistime)
data <- read.csv(text="event,group,start,end,color
                       Data Wrangling,Data Wrangling,2018-11-19,2018-11-26,#90caf9
                 EDA,EDA,2018-11-26,2018-12-10,#90caf9
                 Website Development,Website Development,2018-12-10,2018-12-17,#90caf9
                 Answering RQs,Answering RQs,2018-12-17,2018-12-31,#90caf9
                 Website & Screencast,Website & Screencast,2018-12-31,2019-01-07,#90caf9
                 Project Polishing,Project Polishing,2019-01-07,2019-01-14,#90caf9
                 Project Wrap-up,Project Wrap-up,2019-01-14,2019-01-16,#90caf9"

                 )

vistime(data, title="Project Timeline", showLabels = FALSE)
```

__GitHub Repository:__ https://github.com/Rspawar/Data-Science-with-R.git

__References:__


---
#Comments

# RQ3
# This requires association algorithm (maximum likehood analysis) to predict that a customer will select this particular shop always to purchase a particular product with a given ID (Calculated based on the maximum number of customers visiting a particular shop to buy a particular product)

# Dont display warnings
# {r, warning = FALSE}

# Dont display code
# {r, echo=FALSE}

# To store long computation results in a local cache
# {r, cache=TRUE}

# Exmaple usage of labelling and reusing code chunks
# ```{r iris_plot, echo = FALSE, eval = FALSE}
# library(ggplot2)
# ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
#```

# ```{r , ref.label='iris_plot', fig.width = 7, fig.height = 2.5}
# ```

# Inline R code usng r
# A random sample of 5 numbers from the set of numbers between
# 1 and 10 is `r sample(1:10, 5)`
---