2023/30-matrix-factorization.qmd at main · datasciencelabs/2023 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# Matrix factorization

```{r, echo=FALSE, message=FALSE}
library(tidyverse)
library(janitor)
library(dslabs)

rmse <- function(r) sqrt(mean(r^2, na.rm = TRUE))
```

Matrix factorization is a widely used concept in machine learning. It is very much related to factor analysis, singular value decomposition (SVD), and principal component analysis (PCA). Here we describe the concept in the context of movie recommendation systems.

We have described how the model:

$$
Y_{u,i} = \mu + b_i + b_u + \varepsilon_{u,i}
$$

accounts for movie to movie differences through the $b_i$ and user to user differences through the $b_u$. But this model leaves out an important source of variation related to the fact that groups of movies have similar rating patterns and groups of users have similar rating patterns as well. We will discover these patterns by studying the residuals:

$$
r_{u,i} = y_{u,i} - \hat{b}_i - \hat{b}_u
$$


We can compute these residuals for the model we fit in the previous chapter:

```{r}
y <- select(movielens, movieId, userId, rating) |>
  pivot_wider(names_from = movieId, values_from = rating) |>
  column_to_rownames("userId") |>
  as.matrix()

movie_map <- movielens |> select(movieId, title) |> distinct(movieId, .keep_all = TRUE)

lambda <- 3.1
mu <- mean(y, na.rm = TRUE)
n <- colSums(!is.na(y))
b_i_reg <- colSums(y - mu, na.rm = TRUE) / (n + lambda)
b_u <- rowMeans(sweep(y - mu, 2, b_i_reg), na.rm = TRUE)
r <- sweep(y - mu, 2, b_i_reg) - b_u
colnames(r) <- with(movie_map, title[match(colnames(r), movieId)])
```

If the movie and user effect model explains all the signal, and the $\varepsilon$ are just noise, then the residuals for different movies should be independent from each other. But they are not. Here are some examples:

```{r movie-cor, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%", fig.width=9, fig.height=3}
m_1 <- "Godfather, The"
m_2 <- "Godfather: Part II, The"
p1 <- data.frame(x = r[, m_1], y = r[, m_2]) |>
  ggplot(aes(x, y)) + geom_point() + xlab(m_1) + ylab(m_2)

m_3 <- "Goodfellas"
p2 <- data.frame(x = r[, m_1], y = r[, m_3]) |>
  ggplot(aes(x, y)) + geom_point() + xlab(m_1) + ylab(m_3)

m_4 <- "You've Got Mail"
m_5 <- "Sleepless in Seattle"
p3 <- data.frame(x = r[, m_4], y = r[, m_5]) |>
  ggplot(aes(x, y)) + geom_point() + xlab(m_4) + ylab(m_5)

gridExtra::grid.arrange(p1, p2 ,p3, ncol = 3)
```

This plot says that users that liked The Godfather more than what the model expects them to, based on the movie and user effects, also liked The Godfather II more than expected. A similar relationship is seen when comparing The Godfather and Goodfellas. Although not as strong, there is still correlation. We see correlations between You've Got Mail and Sleepless in Seattle as well

By looking at the correlation between movies, we can see a pattern (we rename the columns to save print space):

```{r, echo=FALSE}
x <- r[, c(m_1, m_2, m_3, m_4, m_5)]
short_names <- c("Godfather", "Godfather2", "Goodfellas",
                 "You've Got", "Sleepless")
colnames(x) <- short_names
cor(x, use = "pairwise.complete")
```

There seems to be people that like romantic comedies more than expected, while others that like gangster movies more than expected.

These results tell us that there is structure in the data. But how can we model this?

## Factor analysis {#sec-factor-analysis}

Here is an illustration, using a simulation, of how we can use some structure to predict the $r_{u,i}$. Suppose our residuals `r` look like this:

```{r, echo=FALSE}
q <- matrix(c(1 , 1, 1, -1, -1), ncol = 1)
rownames(q) <- short_names
p <- matrix(rep(c(2, 0, -2), c(3, 5, 4)), ncol = 1)
rownames(p) <- 1:nrow(p)

set.seed(1988)
r <- jitter(p %*% t(q), factor = 2)
```

```{r}
round(r, 1)
```

There seems to be a pattern here. In fact, we can see very strong correlation patterns:

```{r}
cor(r)
```

We can create vectors `q` and `p`, that can explain much of the structure we see. The `q` would look like this:

```{r}
t(q)
```

and it narrows down movies to two groups: gangster (coded with 1) and romance (coded with -1). We can also reduce the users to three groups:

```{r}
t(p)
```

those that like gangster movies and dislike romance movies (coded as 2), those that like romance movies and dislike gangster movies (coded as -2), and those that don't care (coded as 0). The main point here is that we can almost reconstruct $r$, which has 60 values, with a couple of vectors totaling 17 values. Note that `p` and `q` are equivalent to the patterns and weights we described in Section @sec-pca.

If $r$ contains the residuals for users $u=1,\dots,12$ for movies $i=1,\dots,5$ we can write the following mathematical formula for our residuals $r_{u,i}$.

$$
r_{u,i} \approx p_u q_i
$$

This implies that we can explain more variability by modifying our previous model for movie recommendations to:

$$
Y_{u,i} = \mu + b_i + b_u + p_u q_i + \varepsilon_{u,i}
$$


However, we motivated the need for the $p_u q_i$ term with a simple simulation. The structure found in data is usually more complex. For example, in this first simulation we assumed there were was just one factor $p_u$ that determined which of the two genres movie $u$ belongs to. But the structure in our movie data seems to be much more complicated than gangster movie versus romance. We may have many other factors. Here we present a slightly more complex simulation. We now add a sixth movie, Scent of Woman.


```{r, echo=FALSE}
set.seed(1988)
m_6 <- "Scent of a Woman"
q <- cbind(c(1 , 1, 1, -1, -1, -1),
           c(1 , 1, -1, -1, -1, 1))
rownames(q) <- c(short_names, "Scent")
p <- cbind(rep(c(1,0,-1), c(3,5,4)),
          c(-1,1,1,0,0,1,1,1,0,-1,-1,-1))
rownames(p) <- 1:nrow(p)

r <- jitter(p %*% t(q), factor = 2)
```

```{r}
round(r, 1)
```

By exploring the correlation structure of this new dataset
```{r, echo=FALSE}
colnames(r)[4:6] <- c("YGM", "SS", "SW")
cor(r)
```
we note that perhaps we need a second factor to account for the fact that some users like Al Pacino, while others dislike him or don't care. Notice that the overall structure of the correlation obtained from the simulated data is not that far off the real correlation:

```{r, echo=FALSE}
cnames <- colnames(r)
r <- sweep(y - mu, 2, b_i_reg) - b_u
colnames(r) <- with(movie_map, title[match(colnames(r), movieId)])

six_movies <- c(m_1, m_2, m_3, m_4, m_5, m_6)
x <- r[, six_movies]
colnames(x) <- cnames
cor(x, use = "pairwise.complete")
```

To explain this more complicated structure, we need two factors. For example something like this:

```{r}
t(q)
```

With the first factor (the first column of `q`) used to code the gangster versus romance groups and a second factor (the second column of `q`) to explain the Al Pacino versus no Al Pacino groups. We will also need two sets of coefficients to explain the variability introduced by the $3\times 3$ types of groups:

```{r}
t(p)
```

The model with two factors has 36 parameters that can be used to explain much of the variability in the 72 ratings:

$$
Y_{u,i} = \mu + b_i + b_u + p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \varepsilon_{u,i}
$$

Note that in an actual data application, we need to fit this model to data. To explain the complex correlation we observe in real data, we usually permit the entries of $p$ and $q$ to be continuous values, rather than discrete ones as we used in the simulation. For example, rather than dividing movies into gangster or romance, we define a continuum. Also note that this is not a linear model and to fit it we need to use an algorithm other than the one used by `lm` to find the parameters that minimize the least squares. The winning algorithms for the Netflix challenge fit a model similar to the above and used regularization to penalize for large values of $p$ and $q$, rather than using least squares. Implementing this approach is beyond the scope of this book.

## Exercises

In this exercise set, we will be covering a topic useful for understanding matrix factorization: the singular value decomposition (SVD). SVD is a mathematical result that is widely used in machine learning, both in practice and to understand the mathematical properties of some algorithms. This is a rather advanced topic and to complete this exercise set you will have to be familiar with linear algebra concepts such as matrix multiplication, orthogonal matrices, and diagonal matrices.

The SVD tells us that we can _decompose_ an $N\times p$ matrix $Y$ with $p < N$ as

$$ Y = U D V^{\top} $$

With $U$ and $V$ _orthogonal_ of dimensions $N\times p$ and $p\times p$, respectively, and $D$ a $p \times p$ _diagonal_ matrix with the values of the diagonal decreasing:

$$d_{1,1} \geq d_{2,2} \geq \dots d_{p,p}.$$

In this exercise, we will see one of the ways that this decomposition can be useful. To do this, we will construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average has been removed so this data represents the percentage point each student received above or below the average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents a low grade (F). You can simulate the data like this:


```{r, eval=FALSE}
set.seed(1987)
n <- 100
k <- 8
Sigma <- 64  * matrix(c(1, .75, .5, .75, 1, .5, .5, .5, 1), 3, 3)
m <- MASS::mvrnorm(n, rep(0, 3), Sigma)
m <- m[order(rowMeans(m), decreasing = TRUE),]
y <- m %x% matrix(rep(1, k), nrow = 1) +
  matrix(rnorm(matrix(n * k * 3)), n, k * 3)
colnames(y) <- c(paste(rep("Math",k), 1:k, sep="_"),
                 paste(rep("Science",k), 1:k, sep="_"),
                 paste(rep("Arts",k), 1:k, sep="_"))
```

Our goal is to describe the student performances as succinctly as possible. For example, we want to know if these test results are all just  random independent numbers. Are all students just about as good? Does being good in one subject  imply you will be good in another? How does the SVD help with all this? We will go step by step to show that with just three relatively small pairs of vectors we can explain much of the variability in this $100 \times 24$ dataset.

You can visualize the 24 test scores for the 100 students by plotting an image:

```{r, eval=FALSE}
my_image <- function(x, zlim = range(x), ...){
  colors = rev(RColorBrewer::brewer.pal(9, "RdBu"))
  cols <- 1:ncol(x)
  rows <- 1:nrow(x)
  image(cols, rows, t(x[rev(rows),,drop=FALSE]), xaxt = "n", yaxt = "n",
        xlab="", ylab="",  col = colors, zlim = zlim, ...)
  abline(h=rows + 0.5, v = cols + 0.5)
  axis(side = 1, cols, colnames(x), las = 2)
}

my_image(y)
```

(@) How would you describe the data based on this figure?

a. The test scores are all independent of each other.
b. The students that test well are at the top of the image and there seem to be three groupings by subject.
c. The students that are good at math are not good at science.
d. The students that are good at math are not good at humanities.

(@) You can examine the correlation between the test scores directly like this:

```{r, eval=FALSE}
my_image(cor(y), zlim = c(-1,1))
range(cor(y))
axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
```

Which of the following best describes what you see?

a. The test scores are independent.
b. Math and science are highly correlated but the humanities are not.
c. There is high correlation between tests in the same subject but no correlation across subjects.
d. There is a correlation among all tests, but higher if the tests are in science and math and even higher within each subject.