diff --git a/blog/_posts/2016-02-21-cfn.md b/blog/_posts/2016-02-21-cfn.md new file mode 100644 index 0000000..b165eea --- /dev/null +++ b/blog/_posts/2016-02-21-cfn.md @@ -0,0 +1,339 @@ +--- +layout: post +title: Collaborative Filtering with Neural Networks +comments: True +author: florian-strub +excerpt: Collaborative Filtering methods are key algorithms for Recommender Systems. They are used by major companies such as Netflix, Amazon or Spotify to suggest items to users. While Neural Networks have tremendous success in image and speech recognition, Collaborative Filtering received less attention as it deals with sparse inputs/targets. After a quick introduction to Recommendation Systems, we show how to implement a well-defined Autoencoders to perform Collaborative Filtering. +picture: http://insidebigdata.com/wp-content/uploads/2014/06/Humor_recommender.jpg +--- + + + +Deep Learning has well-known state-of-the-art results in Computer Vision, Speech Recognition or Natural Language Processing. +This post intends at using Neural Networks on a different topic : Recommender systems and Collaborative Filtering. +It turns out that a well-defined Autoencoder can compete with the best Collaborative Fitering algorithms. +This post focus on understating and implementing such Autoencoders. + +#Collaborative Filtering# + +Recommendation systems are a subclass of algorithms that seek to predict the rating that a user would give to an item. +A good recommendation system may dramatically increase the amount of sales of a firm or retain customers. For instance, 80% of movies watched on Netflix come from the recommender system of the company. + +One of the most successful approach of Recommendation systems is to use the explicit feedback of users to build a Matrix of ratings. This technique is known as Collaborative Filtering. + +For instance, if we gather the feedback of movies of all users into a matrix, we obtain a sparse matrix of ratings. Note that the sparsity is induced by unknown values. + +![initialMatrix](http://florian-strub.com/tmp/initialMatrix.png) + +In this case, should we recommend Bob to watch _The Exorcist_ or _Titanic_? + +Collaborative Filtering aims at estimating the ratings a user would have given to other movies by using the ratings of all the other users. +Inherent features of users or items such as the genre, age, actors or some descriptions are _not_ taken into account. +The goal is then to turn the previous sparse matrix of ratings into a dense one: + +![finalMatrix](http://florian-strub.com/tmp/finalMatrix.png) + +Therefore, the Recommender System will advise Bob to watch _The Exorcist_. + +If Bob eventually rates _The Exorcist_, the matrix is re-estimated and so on... +In practice, the matrix of ratings is very sparse. IMDB currently indexes around 1,7 millions movies while users rarely rate more than a few hundreds movies. + +I would recommend the [following paper](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf) [1] to curious people which provide a great insight regarding the challenges of the field (Latent factors, Cold-Sart Problem, Hybrid Systems etc.) + +In this post, we will introduce a Collaborative Filtering approach based on Stacked Denoising Autoencoders. +A preliminary (and excellent!) work was first pulished in 2007 by Ruslan Salakhutdinov [2]. + +# Understanding the model # + +As a short reminder, Autoencoders are unsupervised Networks where the output of the Network plan to reconstruct the initial input. +The Network is constrained to use narrow hidden layers, forcing a dimensionality reduction on the data. + +In a Collaborative Filtering task, the goal is to turn the sparse matrix of ratings into a dense one. +To do so, we are going to decompose the matrix by either its lines or columns and feed them to the Autoencoder. +The output layer will reconstruct the lines/columns and complete them. + + ![UV](http://florian-strub.com/tmp/UV.png) + +Note that the input/output size will depend on the matrix dimension. + +Before implemeting the Autoencoder, we need to define how to handle unknwon values. + +In the forward pass, a naive but efficient approach is to turn unkwown values into zero values. +This is strictly equivalent to removing the neurons from the network. In other words, unknown values are ignored. + +In the backward pass, the same idea can be applied. When no rating exist, the error is set to zero as it cannot be computed. + + ![training1](http://florian-strub.com/tmp/training3-small.png) + +At first sight, there is very few constraint in such Autoencoders. Actually, dropout already removed 50% of the weights. +Beware that more than 95% of the ratings are missing in real datasets. Thus, the bottleneck of the autoencoder is often greater than the number of ratings in the input vector! +If we change the toys example by a bit more realistic autoencoder. It is not straightfroward that the network will be able to learn a useful representation from a very sparse representation. + + ![error](http://florian-strub.com/tmp/realNetworks.png) + +The autoencoder is acctually diverted from its original purpose. It does not aim to reconstruct the initial input anymore, it aims at predicting the missing data. + +By keeping this idea in mind, we applied the Denoising loss [2] to emphasize the prediction aspect during the training. +First, a few input ratings are masked on every input vector to simulate missing ratings. +And then, a higher error is returned when the output rating is erroneously predicted. + +The training loss becomes + ![DAELoss](http://florian-strub.com/tmp/DAELoss-small.png) + +where \tilde{x} is the corrupted input x, J are the indices of the corrupted elements of x, nn(x)_j is the jth output of the Network, K(x) are the indices of known values of x. +The loss is based on two main hyperparameters Alpha, Beta. They balance the evolution of the Network between denoising the input (alpha) and reconstructing the input (beta). + +Finally, the full training process is described below + + ![training1](http://florian-strub.com/tmp/training2-small.png) + +A more advanced description is described in our [paper](https://hal.archives-ouvertes.fr/hal-01281794) [4]). +Importantly, we explain in this paper the underlying idea that links classic Collaborative Filtering techniques and Autoencoders and propose ideas that __may explain__ why neural Networks performs well in Collaborative Filtering tasks. + +Now, let's start the interesting part! + +# Implementation # + + + +First, we need to install the nnsparse package. It will provide us with the basic tools to handle sparse data. + +``` +luarocks install nnsparse +``` + + +## The data ## +We are going to work on a small but classic dataset called [movieLens-1M](http://grouplens.org/datasets/movielens/). +This dataset contains 1 million ratings from 1 to 5 given by 6040 users on 3700 movies. + +To avoid memory overflow, data must be stored in sparse tensors. Given the toy matrix in the first section, the sparse matrix of users/items would be: + + ![training1](http://florian-strub.com/tmp/dense2sparse.png) + + +The following code parses the movieLens-1M dataset and turns it into a sparse reprensation of items + + +```lua + + local train, test = {}, {} + + -- Step 1 : Load file + local ratesfile = io.open("ml-1m/ratings.dat", "r") + + -- Step 2 : Retrieve ratings + for line in ratesfile:lines() do + + -- parse the file by using regex + local userIdStr, movieIdStr, ratingStr, _ = line:match('(%d+)::(%d+)::(%d%.?%d?)::(%d+)') + + local userId = tonumber(userIdStr) + local itemId = tonumber(movieIdStr) + local rating = tonumber(ratingStr) + + -- normalize the rating between [-1, 1] + rating = (rating-3)/2 + + -- we are going to autoencode the item with a training ratio of 0.9 + if torch.uniform() < 0.9 then + train[itemId] = train[itemId] or nnsparse.DynamicSparseTensor() + train[itemId]:append(torch.Tensor{userId, rating}) + else + test[itemId] = test[itemId] or nnsparse.DynamicSparseTensor() + test[itemId]:append(torch.Tensor{userId, rating}) + end + + end + + -- Step 3 : Build the final sparse matrices + for k, oneTrain in pairs(train) do train[k] = oneTrain:build():ssortByIndex() end + for k, oneTest in pairs(test) do test[k] = oneTest:build():ssortByIndex() end + +``` + +It is important to unbias the lines/columns by the training mean to increase the final results. + +```lua +-- Step 4 : remove mean +for k, oneTrain in pairs(train) do + local mean = oneTrain[{ {},2 }]:mean() + train[k][{ {},2 }]:add(-mean) + test[k] [{ {},2 }]:add(-mean) +end +``` + +## The Network ## +We will first implement a shallow Autoencoder with sparse inputs for items. Deeper Autoencoder can be used, but the final results will only be increased by a small margin. +So let's keep things simple! + +- On CPU: + +```lua +local network = nn.Sequential() +network:add(nnsparse.SparseLinearBatch(6040, 500)) --There are 6040 users in movieLens-1M +network:add(nn.Tanh()) +network:add(nn.Linear(500, 6040)) +network:add(nn.Tanh()) +``` + +- On GPU: + +```lua +local network = nn.Sequential() +network:add(nnsparse.Densify(6040)) --GPU does not handle sparse data, so we densify the input on the fly +network:add(nn.Linear(6040, 750)) +network:add(nn.Tanh()) +network:add(nn.Linear(750, 6040)) +network:add(nn.Tanh()) +network:cuda() +``` + +## Training Loss ## + +As described above, training the autoencoder requires to use a Denoising Loss. +The ```nnsparse``` package offers a full Denoising Loss (Gaussian Noise/Masking Noise/ Salt&Paper Noise) for CPU and a Masking Loss that both work on CPU/GPU. +We will use the latter since we want to focus on the prediction aspect. + +```lua + + -- define the Loss function + lossFct = nnsparse.MaskCriterion(nn.MSECriterion(), { + alpha = 1, + beta = 0.5, + hideRatio = 0.25, + }) + lossFct.inputDim = 6040 --require for sparse inputs + lossFct.sizeAverage = false + + -- Create minibatch + local input, minibatch = {}, {} + + --shuffle the indices of the inputs to create the minibatch + local shuffle = torch.randperm(inputSize) + shuffle:apply(function(k) + if train[k] then + input[#input+1] = train[k] + if #input == batchSize then + minibatch[#minibatch+1] = input + input = {} + end + end + end) + if #input > 0 then + minibatch[#minibatch+1] = input + end + + -- Classic training + local w, dw = network:getParameters() + + for _, input in pairs(minibatch) do + local function feval(x) + + -- Reset gradients and losses + network:zeroGradParameters() + + -- AutoEncoder targets + local target = input + + -- Compute noisy input for Denoising autoencoders + local noisyInput = lossFct:prepareInput(input) + + -- FORWARD + local output = network:forward(noisyInput) + local loss = lossFct:forward(output, target) + + -- BACKWARD + local dloss = lossFct:backward(output, target) + local _ = network:backward(noisyInput, dloss) + + -- Return loss and gradients + return loss/batchSize, dw:div(batchSize) + end + + sgdConfiguration.evalCounter = t + optim.sgd (feval, w, sgdConfiguration ) + + end +``` + +## Testing Loss ## + +Torch criterions do not handle sparse targets by default. Thus, it is necessary to wrap them into another criterion. +The ```nnsparse.SparseCriterion``` takes a dense output and a sparse target as parameters. +It then computes the loss/dloss regarding the indices of the target. + +```lua + -- Create minibatch + local noRatings = 0 + local input, target, minibatch = {}, {}, {} + + for k, _ in pairs(train) do + + if test[k] ~= nil then --ignore when there is no target + input [#input +1] = train[k] + target[#target +1] = test[k] + + noRatings = noRatings + test[k]:size(1) + + if #input == batchSize then + minibatch[#minibatch+1] = {input = input, target = target} + input, target = {}, {} + end + end + end + if #input > 0 then + minibatch[#minibatch+1] = {input = input, target = target} + end + + -- define the testing criterion + local criterion = nnsparse.SparseCriterion(nn.MSECriterion()) + criterion.sizeAverage = false + + -- Compute the RMSE by predicting the testing dataset thanks to the training dataset + local err = 0 + for _, oneBatch in pairs(minibatch) do + local output = network:forward(oneBatch.input) + err = err + criterion:forward(output, oneBatch.target) + end + + err = err/noRatings + + print("Current RMSE : " .. math.sqrt(err) * 2) -- The RMSE is rescaled +``` + +Here we are! The shallow autoencoder can be trained. + +The full code can be found [here](https://github.com/fstrub95/Autoencoders_cf/blob/master/src/tutorial.lua). + +A more advanced implementation can be found [here](https://github.com/fstrub95/Autoencoders_cf/). + + +# Results # + +For movieLens-1M, the network should be trained in a few minutes (~20 epoches). +The final RMSE should be around 0.839 with the following hyperparameters. Deeper Networks can go up to a RMSE of 0.831. + +```lua +local sgdConfiguration = { + learningRate = 0.07, + learningRateDecay = 0.9, + weightDecay = 0.05, +} +``` + +As a benchmark, Alternative Least Squares [3] is a widely used Collaborative algorithm. It has a final RMSE around 0.850 on movieLens-1M - cf [Mahout](http://mahout.apache.org/) + [Script](https://github.com/apache/mahout/blob/master/examples/bin/factorize-movielens-1M.sh) + +Furthermore, the bigger is the dataset, the better will be the Autoencoder. The current state of the art results can be obtained with a shallow autoencoder for movieLens-10/20M. + +Once you have your network, you are free to implement your recommendation system! + +More advanced Network can be built to integrate side information such as movie tags, age/sex/job etc. [4] +Boosting can also be used to mix both User/Item-autoencoders to compute a single matrice. + + +## References +1. *Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8), 30-37.,* +2. *Salakhutdinov, R., Mnih, A., & Hinton, G. (2007, June). Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning (pp. 791-798). ACM.* +3. *Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management (pp. 337-348). Springer Berlin Heidelberg.* +4. *Strub, F., Mary, J., Gaudel R., (2016). Hybrid Collaborative Filterings and Neural Networks. Under review* diff --git a/blog/_posts/images/cfn/DAELoss-small.png b/blog/_posts/images/cfn/DAELoss-small.png new file mode 100644 index 0000000..5e34475 Binary files /dev/null and b/blog/_posts/images/cfn/DAELoss-small.png differ diff --git a/blog/_posts/images/cfn/Humor_recommender.jpg b/blog/_posts/images/cfn/Humor_recommender.jpg new file mode 100644 index 0000000..289428f Binary files /dev/null and b/blog/_posts/images/cfn/Humor_recommender.jpg differ diff --git a/blog/_posts/images/cfn/UV.png b/blog/_posts/images/cfn/UV.png new file mode 100644 index 0000000..c944a86 Binary files /dev/null and b/blog/_posts/images/cfn/UV.png differ diff --git a/blog/_posts/images/cfn/dense2sparse.png b/blog/_posts/images/cfn/dense2sparse.png new file mode 100644 index 0000000..a8a1b82 Binary files /dev/null and b/blog/_posts/images/cfn/dense2sparse.png differ diff --git a/blog/_posts/images/cfn/finalMatrix.png b/blog/_posts/images/cfn/finalMatrix.png new file mode 100644 index 0000000..eba67c42 Binary files /dev/null and b/blog/_posts/images/cfn/finalMatrix.png differ diff --git a/blog/_posts/images/cfn/initialMatrix.png b/blog/_posts/images/cfn/initialMatrix.png new file mode 100644 index 0000000..f548256 Binary files /dev/null and b/blog/_posts/images/cfn/initialMatrix.png differ diff --git a/blog/_posts/images/cfn/realNetworks.png b/blog/_posts/images/cfn/realNetworks.png new file mode 100644 index 0000000..354d636 Binary files /dev/null and b/blog/_posts/images/cfn/realNetworks.png differ diff --git a/blog/_posts/images/cfn/training1-small.png b/blog/_posts/images/cfn/training1-small.png new file mode 100644 index 0000000..e392bf7 Binary files /dev/null and b/blog/_posts/images/cfn/training1-small.png differ diff --git a/blog/_posts/images/cfn/training2-small.png b/blog/_posts/images/cfn/training2-small.png new file mode 100644 index 0000000..ef418b7 Binary files /dev/null and b/blog/_posts/images/cfn/training2-small.png differ diff --git a/blog/_posts/images/cfn/training3-small.png b/blog/_posts/images/cfn/training3-small.png new file mode 100644 index 0000000..a8f9e68 Binary files /dev/null and b/blog/_posts/images/cfn/training3-small.png differ