Large-scale Parallel Collaborative Filtering for the โฆ...2019/01/17 ย ยท Alternating Least Squares...
Transcript of Large-scale Parallel Collaborative Filtering for the โฆ...2019/01/17 ย ยท Alternating Least Squares...
Large-scale Parallel Collaborative Filtering for the Netflix Prize Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan
TALK LEAD: PRESTON ENGSTROM
FACILITATORS: SERENA MCDONNELL, OMAR NADA
AgendaHistorical Background
Problem Formulation
Alternating Least Squares For Collaborative Filtering
Results
Discussion
The Netflix PrizeIn 2006, Netflix offered 1 million dollars for an algorithm that could beat their algorithm, โCinematchโ, in terms of RMSE by 10%
100MM ratings
(User, Item, Date, Rating)
Training data and holdout was split by dateโฆ Training data: 1996 โ 2005
โฆ Test data: 2006
The Netflix Prize: Solutions
Recommendation Systems:Collaborative FilteringGenerate recommendations of relevant items from aggregate behavior/tastes over a large number of users
Contrasted against Content Filtering, which uses attributes (text, tags, user age, demographics) to find related items and users
Notation๐๐ข : Number of users
๐๐ : Number of items
R : ๐๐ข ร ๐๐ interaction matrix
๐๐๐ : Rating provided by user ๐ for item ๐
๐ผ๐ : Set of movies user ๐ has rated
๐๐ข๐ : Cardinality of ๐ผ๐
๐ผ๐ : Set of users who rated movie ๐
๐๐๐: Cardinality of ๐ผ๐
Problem FormulationGiven a set of users and items, estimate how a user would rate an item they have had no interaction with
Available information:โฆ List of (User ID, Item ID, Rating,โฆ..)
Low Rank Approximationโฆ Populate a matrix with the known interactions, factor to two smaller
matrices, and use them to generate unknown interactions
Problem Formulation
๐ ๐ R
Let ๐ = ๐๐ be the user feature matrix, where ๐๐ โ โ๐๐ for all ๐ = 1โฆ๐๐ข
Let ๐ = ๐๐ be the user feature matrix, where ๐๐ โ โ๐๐ for all ๐ = 1โฆ๐๐
๐๐ is the dimension of the feature space. This gives ๐๐ข + ๐๐ ร ๐๐ parameters to be learned
If the full interaction matrix R is known, and ๐๐ is sufficiently large, we could expect
that ๐๐๐ =< ๐๐,๐๐ >,โ๐, ๐
๐๐ = 3
Problem Formulation
Mean square loss function for a single rating:
โ2 ๐, ๐,๐ = (๐ โ< ๐,๐ >)๐
Total loss for a given ๐ and ๐, and the set of known ratings ๐ผ:
โ๐๐๐ ๐ , ๐,๐ =1
๐ฯ ๐,๐ โ๐ผ โ
2(๐๐๐๐๐,๐๐)
Finding a good low rank approximation:๐,๐ = argmin
๐,๐โ๐๐๐ (๐ , ๐,๐)
Many parameters and few ratings make overfitting very likely, so a regularization term should be introduced:
โ๐๐๐๐
๐ , ๐,๐ = โ๐๐๐ ๐ , ๐,๐ + ๐( ๐ฮ๐2 + ๐ฮ๐
2)
Singular Value Decomposition
Singular Value Decomposition (SVD) is a common method for approximating a matrix ๐ by the product of two rank ๐ matrices เทจ๐ = ๐๐ ร๐
Because the algorithm operates over the entire matrix, minimizing the Frobenious norm ๐ โ เทจ๐ , standard SVD can fail to find ๐ and ๐
Alternating Least Squares
Alternating Least Squares is one method used to find ๐ and ๐, only considering known ratings
By treating each update of ๐ and ๐ as a least squares problem, we are able to update one matrix by fixing the other, and using the known entries of ๐
ALS with Weighted-๐-RegularizationStep 1: Set ๐ด to be a ๐๐ ร ๐๐, assign the average rating for each movie to the first row
Step 2: Fix ๐ด solve ๐ผ by minimizing the objective function
Step 3: Fix ๐ผ solve ๐ by minimizing the objective function
Step 4: Repeat steps 2 and 3 until a set number of iterations, or an improvement in error of less than 0.0001 on the probe dataset occurs
ALS with Weighted-๐-Regularization
The selected lost function uses Tikhonov regularization to penalize large parameters. This scheme was found to never overfit on test data when the number of features ๐๐ or number of iterations was increased.
๐ ๐,๐ =
(๐,๐)๐๐ผ
๐๐๐ โ ๐๐๐ป๐๐
2+ ๐
๐
๐๐ข๐ ๐๐2 +
๐
๐๐๐๐๐
2
๐ : Regularization weight
๐๐ข๐ : Cardinality of ๐ผ๐
๐๐๐: Cardinality of ๐ผ๐
Solving for ๐ given ๐1
2
๐๐
๐๐ข๐๐= 0, โ๐, ๐
โ
๐โ๐ผ๐
(๐๐๐ป๐๐ โ ๐๐๐)๐๐๐ + ๐๐๐ข๐๐ข๐๐ = 0, โ๐, ๐
โ
๐โ๐ผ๐
๐๐๐๐๐๐๐๐ + ๐๐๐ข๐๐ข๐๐ =
๐โ๐ผ๐
๐๐๐ ๐๐๐ , โ๐, ๐
โ ๐๐ผ๐๐๐ผ๐๐ + ๐๐๐ข๐๐ธ ๐๐ = ๐๐ผ๐๐
๐ ๐, ๐ผ๐ , โ๐
โ ๐๐ = ๐ด๐โ1๐๐ , โ๐
Update Procedure (Single Machine)
Parallelization
ALS-WR can be parallelized by distributing the updates of ๐ and ๐ over multiple computers
When updating ๐, each computer need only hold a portion of ๐, and the corresponding portion of ๐ . However, a complete copy of ๐ is needed locally on each computer
Once all computers have updated their local partition of ๐, the results can be gathered, and distributed out to allow the following update of ๐
Post ProcessingGlobal mean shift:
โฆ Given a prediction ๐, if the mean of ๐ is not equal to the mean of the test dataset, all predictions are shifted by a constant ๐ = ๐๐๐๐ ๐ก๐๐ ๐ก โ๐๐๐๐ ๐
โฆ This is shown to strictly reduce RMSE
Linear combinations of predictors:โฆ Given two predictors ๐0 and ๐1, a new family of predictors ๐๐ฅ = 1 โ ๐ฅ ๐0 +๐ฅ๐1 can be found, and ๐ฅโ determined by minimizing ๐ ๐๐๐ธ ๐๐ฅ
โฆ This yields a predictor ๐๐ฅโ , which is at least as good as ๐0 or ๐1
2 Minute Break
Hyperparameter Selection : ฮป
Hyperparameter Selection : ๐๐
Experimental ResultsWhere does ALS-WR stack up against the top Netflix prize winners?
The best RMSE score is an improvement of 5.56% over CineMatch
๐๐ ๐โ ๐น๐ด๐บ๐ฌ ๐ฉ๐๐๐ ๐ช๐๐๐๐๐๐๐๐๐
50 0.065 0.9114 No
150 0.065 0.9066 No
300 0.065 0.9017 Yes
400 0.065 0.9006 Yes
500 0.065 0.9000 Yes
1000 0.065 0.8985 No
Combining ALS-WR With Other MethodsThe top performing Netflix prize contributions were all composed of ensembles of models. Knowing this, combined with apparent diminishing returns of increasing ๐๐ and ๐, ALS-WR was combined with two other common methods:
โฆ Restricted Boltzmann Machines (RBM)
โฆ K-nearest Neighbours (kNN)
Both methods can be parallelized in a similar method to ALS-WR, and provide a modest improvement in performance.
Using a linear blend of ALS, RBM and kNN, an RMSE of 0.8952 (a 5.91% over CineMatch) was obtained
DiscussionWhile RMSE is the error metric both Netflix and the researchers choose, itโs not necessarily the most appropriate. What other metrics could have been useful?
It is stated that the method of regularization will stop overfitting if either the number of features or number of iterations is increased- does this mean that if both are increased, it will overfit? Is their claim that the model never overfits valid?
S