eth-datamining/report.tex at master · thecopy/eth-datamining · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
\documentclass[a4paper, 11pt]{article}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage[pdftex]{hyperref}

\setlength{\textwidth}{16.5cm}
\setlength{\marginparwidth}{1.5cm}
\setlength{\parindent}{0cm}
\setlength{\parskip}{0.15cm}
\setlength{\textheight}{22cm}
\setlength{\oddsidemargin}{0cm}
\setlength{\evensidemargin}{\oddsidemargin}
\setlength{\topmargin}{0cm}
\setlength{\headheight}{0cm}
\setlength{\headsep}{0cm}

\renewcommand{\familydefault}{\sfdefault}

\title{Data Mining: Learning from Large Data Sets - Spring Semester 2014}
\author{jerik@student.ethz.ch\\ sethv@student.ethz.ch\\ ruxu@student.ethz.ch\\}
\date{\today}

\begin{document}
\maketitle

\section{Approximate Near-duplicate Search using Locality Sensitive Hashing}
To detect duplicate videos, min-hashing is used on videos in the mapper so that similar ones are grouped together. In order not to miss any potential duplicate videos, we set a relatively lower threshold and then manually filter out false positives using Jaccard similarity.

        \subsection{Min-Hashing on the Mapper}
                We start by generating the permutations which will be used in the algorithm first. Then a hash function of the form
                        \[h(x) = a_0x + a_1 \mod N\]
                is chosen, where N is the number of features (i.e. $10000$) and $a_0$ and $a_1$ is uniformly randomly generated. In the implementation, 256 permutations are used. Incidentally, for the signature matrix, 16 rows and 16 bands are used.

                Afterwards, we generate the signature matrix by reading the next video descriptor, i.e. all the shingles, and then performing the following algorithm:
                \begin{verbatim}
                          set M to be a row vector of length 256
                          set each element of M to infinity

                          for i=1:10000 do
                              if current video has shingles, do
                                  for each hash function h_i, do
                                      if h_i(s) < M(i), do
                                      set M(i) = h_i(s)
                          return M
                \end{verbatim}
                Therefore, $M$ represents the column in the signature matrix which stands for the current video.
                \pagebreak

                Using the following algorithm, we iterate for each band $B$:
                \begin{verbatim}
                          set M according to the algorithm above
                          set key to ''
                          for each row R in band B, do
                              append M(R) to key
                              emit(key, value)
                \end{verbatim}

                where $value$ is the video ID.\\
                Read the next video and repeat the steps above.

        \subsection{Detecting Duplicates on the Reducer}
                Since in the mapper we emit the signature as the key and MapReduce environment groups data automatically based on the keys, we can get all videos with the same key sequentially. Because we used 16 bands with 16 rows, we have a threshold of 0.8 = 80\% (target is 85\%). Hence we will get some false positives, but very few false negatives.\\

                After collecting all videos with the same key (signature), we inspect all videos with this key by calculating their Jaccard similarity. If the similarity is at least 85\%, the video is considered as a duplicate.

        \subsection{Result}
                With the above algorithm, we achieved a result of 0.99. We think that by further tweaking the number of band and rows, we could get a lower threshold before manually filtering similar object and potentially reach a higher score.

\newpage

\section{Large-Scale Image Classification}

In this part of the project, we trained our SVM classifier using parallel Stochastic Gradient Descent on MapReduce. The task was to classify pictures, represented as 400-coefficient feature vectors, into two categories: Nature and People.

\subsection{Mapper}
In every mapper, we trained a linear SVM classifier online by reading the input data line by line. We used the function SGDClassifier provided by scikit for learning.

\subsection{Reducer}
The implementation in the reducer is straightforward. We just took the mean of all the values passed by every mapper to obtain the final coefficients.

\subsection{Results}
In order to achieve better result, we tried different tranformations on the input data. Example transformations we used include power functions, logarithm functions, etc. The submission result did not show very much improvement. We also alternated between different loss funcitons and regularization parameters. In the final result, we chose to use hinge loss and L2 norm and obtained around 77 percent of the final score.

\newpage

\section{Extracting Representative Elements From Large Datasets}

The goal of this project was to extract 200 representative centers of a dataset of 2,000,000 images. The images were represented as 750-dimensional feature vectors.

\subsection{Mapper}
Our mapper function uses the MiniBatchKMeans algorithm provided by scikit to generate 200 clusters. We used k-means++ with 5 random initializations.

\subsection{Reducer}
The reducer simply averages the clusterings produced by each mapper.

\subsection{Results}

The "representativeness" of our set C of 200 centers for the provided dataset P was defined as
\[cost(P,C) = \frac{1}{\mathopen|P\mathclose|} \sum\limits_{p \in P} \min\limits_{c \in C} d(p,c)^2\]
where $d(p,c)$ is the Euclidean distance from a datapoint p to a center c.

Our result was 743.42174089, a lower cost than the easy baseline but higher than the hard baseline. We suspect that importance sampling would have given a superior result, but uniform sampling and the scikit-learn k-means implementation was sufficient to surpass the easy baseline.

\newpage

\section{Explore-Exploit Tradeoffs in Recommender Systems}
For this problem, we used the LinUCB algorithm given in lecture to learn the recommendation policy over the set of articles and logs. More precisely, the disjont LinUCB model was selected.

\subsection{Recommendation}

The algorithm for recommendation is presented below for reference.
\newline

In each round:
\begin{itemize}

\item Receive article set $A_t$ and user features $z_t$
\item For each article $x \in A_t$:

\begin{itemize}

    \item If $x$ is new, then set $M_x = I$ and $b_x = 0$
    \item Set $\hat{w}_x = M_x^{-1} \cdot b_x$
    \item Set $UCB_x = \hat{w}_x^T \cdot z_t + \alpha \sqrt{z_t^T \cdot M_x^{-1} \cdot z_t)}$

\end{itemize}

\item Recommend article $x_t =
    \operatorname{argmax}_{x \in A_t} UCB_x$

\end {itemize}


\subsection{Update}

The algorithm for update is presented below for reference:
\newline

\begin{itemize}
    \item If the random policy does not recommend the same article, no update occurs.

    \item Otherwise, the reward is either 1 (user clicked on the recommended article) or 0 (user did not click). We use $y_t$ below to denote the reward.


    Set $M_x \leftarrow M_x + z_t \cdot z_t^T$

    Set $b_x \leftarrow b_x + y_t z_t$

    This update is completed easily in our implementation, as we simply add the outer product of the chosen feature vector $z_t$ to the chosen article's matrix $M_x$ (stored in the hash table $A$) and add reward feedback to the chosen article's $b_x$.
    \begin{verbatim}
        A[chosen_article] += np.outer(chosen_feature, chosen_feature)
        b[chosen_article] += reward * chosen_feature
    \end{verbatim}
\end{itemize}

\subsection{Implementation details}

We use Python dictionaries to keep the matrix $M_x$ and a vector $b_x$ for each arm (article). Since each round requires multiple vector operations (and possibly a matrix outer product) we use Numpy arrays rather than Python lists. Numpy provides optimized linear algebra operations that speed up our code by orders of magnitude compared to the naive implementation. Also, we cached the inverse of matrices so we do not need to repeat such expensive operations unnecessarily.

\subsection{Results}

We experimented with various values for the learning rate $\alpha$ between 0.1 and 2 with incremental step of 0.1. After some test submissions we found that $\alpha = 0.2$ produced a score of 0.059632, just above the hard baseline.


\end{document}