Matrix Factorization
description
Transcript of Matrix Factorization
Matrix Factorization
Bamshad MobasherDePaul University
The $1 Million Question
2
Ratings Data
1 3 43 5 5
4 5 53 3
2 2 25
2 1 13 3
1
17,700 movies
480,000users
4
Training Datai 100 million ratings (matrix is 99% sparse)
i Rating = [user, movie-id, time-stamp, rating value]
i Generated by users between Oct 1998 and Dec 2005
i Users randomly chosen among set with at least 20 ratings4 Small perturbations to help with anonymity
Ratings Data
1 3 43 5 5
4 5 53 3
2 ? ??
2 1 ?3 ?
1
Test Data Set(most recent ratings)
480,000users
17,700 movies
6
Scoringi Minimize root mean square error
4 Does not necessarily correlate well with user satisfaction4 But is a widely-used well-understood quantitative measure
i RMSE Baseline Scores on Test Data4 1.054 - just predict the mean user rating for each movie4 0.953 - Netflix’s own system (Cinematch) as of 20064 0.941 - nearest-neighbor method using correlation4 0.857 - required 10% reduction to win $1 million
Mean square error = 1/|R| S (u,i) e R ( rui - rui )2
^
7
Matrix Factorization of Ratings Data
i Based on the idea of Latent Factor Analysis4 Identify latent (unobserved) factors that “explain” observations in the data4 In this case, observations are user ratings of movies4 The factors may represent combinations of features or characteristics of movies
and users that result in the ratings
R Q Pm
use
rs
n movies
m u
sers
n moviesf
f
~~x
rui qTi pu ~~
8
Matrix Factorization of Ratings Data
Figure from Koren, Bell, Volinksy, IEEE Computer, 2009
9
Matrix Factorization of Ratings Data
Credit: Alex Lin, Intelligent Mining
Predictions as Filling Missing Data
Credit: Alex Lin, Intelligent Mining
Learning Factor Matricesi Need to learn the feature vectors from training data
4 User feature vector: (a, b, c)4 Item feature vector (x, y, z)
i Approach: Minimize the errors on known ratings
Credit: Alex Lin, Intelligent Mining
Learning Factor Matrices
minq,p S (u,i) e R ( rui - qti pu )2
rui qti pu
~~
minq,p S (u,i) e R ( rui - qti pu )2 + l (|qi|2 + |pu|2
)
Add regularization
12
Stochastic Gradient Descent (SGD)
eui = rui - qti pu
qi qi + g ( eui pu - l qi )
pu pu + g ( eui qi - l pu )
minq,p S (u,i) e R ( rui - qti pu )2 + l (|qi|2 + |pu|2 )
regularizationgoodness of fit
Online (“stochastic”) gradient update equations:
13
Components of a Rating Predictoruser-movie interactionmovie biasuser bias
User-movie interactioni Characterizes the matching between
users and moviesi Attracts most research in the fieldi Benefits from algorithmic and
mathematical innovations
Baseline predictor• Separates users and movies• Often overlooked • Benefits from insights into users’
behavior• Among the main practical
contributions of the competition
14Credit: Yehuda Koren, Google, Inc.
Modeling Systematic Biases
rui m + bu + bi + user-movie interactions
~~overall meanrating
mean rating
for user u
mean rating
for movie i
Example:Mean rating m = 3.7
You are a critical reviewer: your ratings are 1 lower than the mean -> bu = -1
Star Wars gets a mean rating of 0.5 higher than average movie: bi = + 0.5
Predicted rating for you on Star Wars = 3.7 - 1 + 0.5 = 3.2
qti pu
15Credit: Padhraic Smyth, University of California, Irvine
Objective Function
minq,p { S (u,i) e R ( rui - (m + bu + bi + qti
pu ) )2
+ l (|qi|2 + |pu|2 + |bu|2 + |bi|2 ) }
regularization
goodness of fit
Typically selected via grid-search on a validation set
16Credit: Padhraic Smyth, University of California, Irvine
5%
8%
17Figure from Koren, Bell, Volinksy, IEEE Computer, 2009
18
Explanation for increase?
19
Adding Time Effectsrui m + bu + bi + user-movie interactions
~~
~~rui m + bu(t) + bi(t) + user-movie interactions
Add time dependence to biases
Time-dependence parametrized by linear trends, binning, and other methods
For details see Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009
20Credit: Padhraic Smyth, University of California, Irvine
Adding Time Effectsrui m + bu(t) + bi(t) + qt
i pu(t) ~~
Add time dependence to user “factor weights”
Models the fact that user’s interests over “genres” (the q’s)
may change over time
21
Figure from Koren, Bell, Volinksy, IEEE Computer, 2009
5%
8%
22
The Kitchen Sink Approach….i Many options for modeling
4 Variants of the ideas we have seen so farh Different numbers of factorsh Different ways to model timeh Different ways to handle implicit informationh ….
4 Other models (not described here)h Nearest-neighbor modelsh Restricted Boltzmann machines
i Model averaging was useful….4 Linear model combining4 Neural network combining4 Gradient boosted decision tree combining4 Note: combining weights learned on validation set (“stacking”)
23Credit: Padhraic Smyth, University of California, Irvine
24
Other Aspects of Model Buildingi Automated parameter tuning
4 Using a validation set, and grid search, various parameters such as learning rates, regularization parameters, etc., can be optimized
i Memory requirements4 Memory: can fit within roughly 1 Gbyte of RAM
i Training time 4 Order of days: but achievable on commodity hardware rather than a
supercomputer4 Some parallelization used
25Credit: Padhraic Smyth, University of California, Irvine
Progress Prize 2008
Sept 2nd Only 3 teams qualify for 1% improvement over previous year
Oct 2nd Leading team has 9.4% overall improvement
Progress prize ($50,000) awarded to BellKor team of 3 AT&T researchers (same as before) plus 2 Austrian graduate students, Andreas Toscher and Martin Jahrer
Key winning strategy: clever “blending” of predictions
from models used by both teams
Speculation that 10% would be attained by mid-2009
26
The Leading Team for the Final Prize
i BellKorPragmaticChaos4 BellKor:
h Yehuda Koren (now Yahoo!), Bob Bell, Chris Volinsky, AT&T4 BigChaos:
h Michael Jahrer, Andreas Toscher, 2 grad students from Austria4 Pragmatic Theory
h Martin Chabert, Martin Piotte, 2 engineers from Montreal (Quebec)
27
28
June 26th 2009: after 1000 days & nights…
29
Million Dollars Awarded Sept 21st 2009
30