AirBnbSeniorProject/termpaper.Rmd at master · dev-aniketsingh/AirBnbSeniorProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "AirBnB Price Prediction "
date: "`r format(Sys.time(), '%B %e, %Y')`"
fontsize: 12pt
bibliography: ["one.bib"]
csl: apa.csl
link-citations: true
output:
  pdf_document: default
  html_document:
    df_print: paged
header-includes: \usepackage{amsmath} \usepackage{color} \usepackage[english]{babel} \usepackage{graphicx} \usepackage{filecontents} \usepackage{natbib} \usepackage{url} \usepackage[nottoc]{tocbibind}
---


\tableofcontents


\newpage
\maketitle

\section{Introduction}

  Airbnb has become one of the essential elements of trips and vacation plans for over 150 million people. Since 2008, guests and hosts have used Airbnb for a unique and personalized experience of traveling with a wide range of travel possibilities.
Before Airbnb, most consumers had to rely on hotels. As hotels aren't as widely available as Airbnb with their exceptional business model, they soon became the best vacation rental marketplace. \
Because of the dramatic growth of Airbnb, price prediction becomes one of the essential elements for their platform. As hosts typically determine the price of the Airbnb; both the host and Airbnb need to provide a fair price to the consumer, as it is an essential element of their model. Determining the price is crucial for the new and existing host/Airbnb because the price cannot be too high that they lose popularity or not get any guest. As for customers, they have options to check and compare prices depending on their needs.
\
We plan to build a machine learning model using some Machine Learning algorithms like Linear Regression, Ridge Regression, Decision Tree Regression, and Random Forest Regression. Also, we would be using Machine learning techniques to tune our model and try to acheive a good predictive accuracy.
\section{Data Preparation}
  The dataset used in the project has been accessed from Kaggle's database\cite{Kaggle}. It is a public dataset of Airbnb that is accessible publicly on their original dataset website. This dataset describes the listing activity and metrics in NYC, NY for the year 2019. This data file includes all the needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. The dataset is in Comma-Separated Values(.csv) file format. After importing the datset, we performed some analysis on the dataset. The dataset contained 48894 rows and 16 columns.
\subsection{Handling Missing Values}

  After importing the dataset. we noticed that column "last_review", "reviews_per_month" had 10052 missing values. Also, few columns had some missing values that were either imputed or dropped. This dataset did not contain many missing values or the rows that contained majority of missing values were not used to train or build the model.
\newpage
\subsection{Dealing with categorical inputs}
  Too many levels of a categorical variable are one of the most frequently occurring problems in predictive modeling. As for our dataset, we have three columns categories of the categorical variables they are: neighborhood_group, neighborhood and room_type are. In case of neighborhood_group and room_type, they do not have multiple levels so it is possible to use dummy coding. But in case of neighborhood, we have 219 levels which makes dummy coding overly complex. The number of dummy variables for the neighborhood variable would be one less than the number of levels in neighborhood. Since we have 219 levels in neighborhood variable it would lead to the problem of high dimensionality which will ultimately may cause over-fitting of the data. Thus, the model will not be able to generalize properly to a new dataset. Moreover, inclusion of categorical variables with too many inputs can lead to the problem of quasi-complete separation. The problem of quasi-complete separation occurs when a level of the categorical variable has a target response of either 0% or 100% and can cause the interpretation of the regression model. However, we did not include neighborhood in our final model.
\

\subsection{Variable screening}

  In predictive modeling, carefully selected features can improve the accuary of the model while adding too many or too few variables can lead to overfitting or underfitting. The model can't generalize well and work on unseen dataset if the model is not "just right" or close to "just right". There are multiple methods for variable screening. We can perform many statistical tests such as AIC, BIC, F-test, Cross Validation to check what set of variables perform well on the model. There are many methods available such as Stepwise selection which uses either Forward or Backward selection method. But, we have a very small number of features, so we performed Best Subset selection method. Best subset selection method can be computationally expensive but in our case, it is possible to perform best subset.
\begin{figure}
  \centering
    \includegraphics[width=0.8\textwidth]{~/AirBnbSeniorProject/images/variableBest.png}
    \caption{Best Subset Regression Result}
    \includegraphics[width=0.8\textwidth]{~/AirBnbSeniorProject/images/plotAIC.png}
    \caption{Best Subset Regression: Choosing number of predictors}
\end{figure}

\newpage
\section{Models Used}


\subsection{Linear Regression }

Linear Regression is a technique that is widely used in Statistics, Statistical Learning, and also in Machine Learning. Linear Regression Model focused on modeling the relationship between observed variables. A simple linear regression model fit a line between the observation of two variables, that line is also called "line of best fit". The task is to draw "line of best fit" or "closest" to the points ($x_i$,$y_i$), where $x_i$ and $y_i$ are observations of the two variables which are expected to depend linearly on each other. \cite{LinearRegression}
\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/linear.png}
    \caption{Simple Linear Regression from \cite{WikipediaEN:LinearRegression.svg}}
\end{figure}

\subsection{Ridge Regression}

  Ridge Regression is also a technique that is widely used in Statistics, Statistical Learning, and in Machine Learning. Ridge Regression is used when regression data suffer from multicollinearity. Ridge Regression introduces some degree of bias to the regression estimates, which may prevent overfitting of data. Introducing some bias may provide a better long term prediction and reduce the standard errors.
It is hoped that results are better as the model doesn't overfit the training set.  \cite{RidgeRregression}
\newpage
\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/ridge.jpeg}
    \caption{Ridge Regression figure from \cite{RIDGEFIGURE}}
\end{figure}

\subsection{Decision Tree }

  Decision tree is also a model widely used in Machine Learning algorithms. Decision tree is used for both classification and Regression problems. Decision tree breaks down the dataset intro smaller subsets and simultaneously a decision tree is developed. Decision tree can be more like a flowchart, the final result is a tree with nodes and leafs.  \cite{DecisionTreee}
\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/tree.png}
    \caption{Decision Tree Regression from \cite{DecisionTreee}}
\end{figure}
\newpage

\subsection{Random Forest Regression }

A random forest is a model that uses multiple decision trees on various sub-samples of the dataset and uses the average of the all the decision tree results to improve the predictive accuracy while also controlling overfitiing. As the figure below shows, there are multiple deicison trees used and then the average is used as result of random forest. \cite{RandomForestRegresso}

\begin{figure}
  \centering
    \includegraphics[width=0.9\textwidth]{~/AirBnbSeniorProject/images/random.png}
    \caption{Random Forest Regression figure from \cite{randomForestFigure}}
\end{figure}

\section{Analysis}

\subsection{Exploratory Data Analysis }

\subsubsection{Correlation Analysis}

A \textbf{correlation coefficient}is a value that tells if there's a linear association between two variables. The values can be anywhere from "-1" to "1". A value of "0" indicates no linear association between observed variables. A negative value indicates a negative linear association between observed variables and positive indicates a positive linear association. \cite{Correlation}
\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/corr.png}
    \caption{Results of Pearson Correlation Test on Dataset}
\end{figure}

As, we can see from the Correlation Plot above, our features are correlated to each other. We can observe strong correlation between features. This can sometimes be an indicator for multicollinearity.

\subsubsection{Price Analysis}

Machine Learning algorithms are typically sensitive to distribution of attribute values and range. Data outliers can mislead the training proccess and that would result us with a less accurate model and ultimately poor predictive accuracy. \cite{OutlierDetection}
\
In order to build a good model, it is important to have a dataset that doesn't have too many outliers. So we try to detect outliers in our prices.
\begin{figure}
  \centering
    \includegraphics[width=0.4\textwidth]{~/AirBnbSeniorProject/images/outPrice.png}
    \caption{Density plot of price with outliers}
\end{figure}
As we can notice from the plot above, we have many outliers in our price distribution. So, we drop all the outliers from our dataset.
\begin{figure}
  \centering
    \includegraphics[width=0.5\textwidth]{~/AirBnbSeniorProject/images/Price.png}
    \caption{Density plot of price after removing outliers}
\end{figure}
\newpage
\textbf{Neighbourhood Group vs Price Analysis}
\
Let's now look into the relationship between neighbourhood group and price. This would help us understand if the price variable is impacted by neighbourhood groups.
\begin{figure}
  \centering
    \includegraphics[width=0.6\textwidth]{~/AirBnbSeniorProject/images/nhgPrice.png}
    \caption{Violin Plot for Neighbourhood Distribution vs Price}
\end{figure}

We can notice from the violin plot above, Manhattan and Brooklyn are more expensive and distributed around the upper price ranges. Whereas Queens, Staten Island and Bronx is more distributed in the bottom price ranges. This indicates that the prices are being impacted by the loction of Airbnb.  \

\newpage
\textbf{Room Type vs Price Analysis}
\
Now we look into the relationship between room type and price. This would help us understand if the price variable is impacted by room type.
\begin{figure}
  \centering
    \includegraphics[width=0.6\textwidth]{~/AirBnbSeniorProject/images/rmtype.png}
    \caption{Violin Plot for room type vs price}
\end{figure}
As we can notice from our violin plot above, that getting an entire apartment is significantly more expensive whereas shared room is much cheaper. While Private rooms are bit more expensive than shared rooms but the major price ditribution is between shared rooms and entire apartment.

\subsection{Model Optimization and Tuning Techniques}
In our models, we have used multiple optimization and tuning techniques for Random Forest Regression. Before we analyze our models and results, it is important to understand these techniques. Here are the techniques that we have used to tune models using Python.
\subsubsection{Cross Validation}
"Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on "new" data. This is the basic idea for a whole class of model evaluation methods called cross validation." \cite{CrossValidation}
\
There are many Cross Validation techniques, but we have used \textbf{K-fold Cross Validation}.
\
\
"\textbf{K-fold Cross Validation} is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed." \cite{CrossValidation}
\
\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/cross.png}
    \caption{K-Fold Cross Validation Visualized from \cite{kfold}}
\end{figure}
\newpage
\subsubsection{Bootstrapping}
"Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach." \cite{Bootstrap}

\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/bootstrap.png}
    \caption{Bootstrap Visualized from \cite{bootstrapFigure}}
\end{figure}

\subsection{Model Comparison and Results}

I used four different Machine Learning models: Linear Regression, Ridge Regression, Decision Tree, and Random Forest Regression. And I also tuned the Random Forest Regressor model using cross-validation and bootstrap. For this model, the significant predictors were categorical variables such as neighborhood group and room type, and numeric variables like minimum nights, and host information. The results are not particularly good for predictive accuracy. \
\
I trained my model with 70% of the dataset, and then I tested the model with the remaining 30%. I was able to achieve 55% model accuracy with a tuned random forest and the RMSE 36.40. \
\
This dataset did not include features that are important for an Airbnb price prediction. In machine learning, the variables in the dataset must be significant predictors for the target variable. When we book an Airbnb, it is always a case that we look at the number of rooms, bathrooms, and other services included. Also, people generally case about service fees, cancellation policies. But since these features were not available in our dataset, our model didn't perform well. In the figure below, we can see RMSE understand the performance of the model. \

\begin{figure}
  \centering
    \includegraphics[width=0.7\textwidth]{~/AirBnbSeniorProject/images/rmse.png}
    \caption{RMSE of all the models}
\end{figure}

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. \cite{rootmean}
\
Here is another plot that visualizes some datapoints in form of histogram to see how the Tuned-Random Forest Model performed.
\begin{figure}
  \centering
    \includegraphics[width=0.6\textwidth]{~/AirBnbSeniorProject/images/tuned.png}
    \caption{Tuned Random Forest Prediction Visualized}
\end{figure}

\newpage
\section{Conclusion}

We worked on the Price Prediction problem and compared various Machine Learning models in this project. We did not achieve a significant predicting accuracy because of the lack of features in the dataset. While the model did not perform well, we were able to still compare multiple models and achieve a decent predictive accuracy. We also used frameworks provided by python: scikit-learn-RandomizedSearchCV to perform model tuning.
\
As we now know more about Airbnb, we can choose to do future work with a dataset with better features and build a model that would have better predictive accuracy. A better model can help hosts estimate their Airbnb prices. As we mentioned before, an expensive price can hurt the business of the host because people might not choose to stay if it is too expensive for what they offer, and a cheaper price might make the host lose money overall.
\


\clearpage

\bibliographystyle{unsrt}
\bibliography{one}