Post on 21-Jan-2018
Matrix Factorizations for Recommender Systems
Dmitriy Selivanovselivanov.dmitriy@gmail.com
2017-11-16
Recommender systems are everywhere
Figure 1:
Recommender systems are everywhere
Figure 2:
Recommender systems are everywhere
Figure 3:
Recommender systems are everywhere
Figure 4:
Goals
Propose “relevant” items to customers
I RetentionI Exploration
I Up-saleI Personalized offers
I recommended items for a customer given history of activities (transactions, browsinghistory, favourites)
I Similar itemsI substitutionsI bundles - frequently bought togetherI . . .
Live demoDataset - LastFM-360K:
I 360k usersI 160k artistsI 17M observationsI sparsity - 0.9999999
Figure 5:
Explicit feedbackRatings, likes/dislikes, purchases:
I cleaner dataI smallerI hard to collect
RMSE 2 = 1D
∑u,i∈D
(rui − r̂ui )2
Netflix prize
I ~ 480k users, 18k movies, 100m ratingsI sparsity ~ 90%I goal is to reduce RMSE by 10% - from 0.9514 to 0.8563
Implicit feedback
I noisy feedback (click, likes, purchases, search, . . . )I much easier to collectI wider user/item coverage
I usually sparsity > 99.9%
One-Class Collaborative Filtering
I observed entries are positive preferencesI should have high confidence
I missed entries in matrix are mix of negative preferences and positive preferencesI consider them as negative with low confidenceI we cannot really distinguish that user did not click a banner because of a lack of
interest or lack of awareness
Evaluation
Recap: we only care about how to produce small set of highly relevant items.RMSE is bad metrics - very weak connection to business goals.
Only interested about relevance precision of retreived items:
I space on the screen is limitedI only order matters - most relevant items should be in top
Ranking - Mean average precision
AveragePrecision =∑n
k=1(P(k)×rel(k))number of relevant documents
## index relevant precision_at_k## 1: 1 0 0.0000000## 2: 2 0 0.0000000## 3: 3 1 0.3333333## 4: 4 0 0.2500000## 5: 5 0 0.2000000
map@5 = 0.1566667
Ranking - Normalized Discounted Cumulative Gain
Intuition is the same as for MAP@K, but also takes into account value of relevance:
DCGp =p∑
i=1
2reli − 1log2(i + 1)
nDCGp = DCGpIDCGp
IDCGp =|REL|∑i=1
2reli − 1log2(i + 1)
Approaches
I Content basedI good for cold startI not personalized
I Collaborative filteringI vanilla collaborative fitleringI matrix factorizationsI . . .
I Hybrid and context aware recommender systemsI best of two worlds
Focus today
I WRMF (Weighted Regularized Matrix Factorization) - Collaborative Filtering forImplicit Feedback Datasets (2008)
I efficient learning with accelerated approximate Alternating Least SquaresI inference time
I Linear-FLow - Practical Linear Models for Large-Scale One-Class CollaborativeFiltering (2016)
I efficient truncated SVDI cheap cross-validation with full path regularization
Matrix FactorizationsI Users can be described by small number of latent factors pukI Items can be described by small number of latent factors qki
Figure 6:
Sparse data
items
user
s
Low rank matrix factorization
R = P × Q
factors
user
s
itemsfact
ors
Reconstruction
items
user
s
items
user
s
Truncated SVD
Take k largest singular values:X ≈ UkDkV T
k
- Xk ∈ Rm∗n - Uk ,V - columns are orthonormal bases (dot product of any 2 columns iszero, unit norm) - Dk - matrix with singular values on diagonal
Truncated SVD is the best rank k approximation of the matrix X in terms ofFrobenius norm:
||X − UkDkV Tk ||F
P = Uk√
Dk
Q =√
DkV Tk
Issue with truncated SVD for “explicit” feedback
I Optimal in terms of Frobenius norm - takes into account zeros in ratings -
RMSE =√√√√ 1
users × items∑
u∈users,i∈items(rui − r̂ui )2
I Overfits data
Objective = error only in “observed” ratings:
RMSE =√√√√ 1
Observed∑
u,i∈Observed(rui − r̂ui )2
SVD-like matrix factorization with ALS
J =∑
u,i∈Observed(rui − pu × qi )2 + λ(||Q2||+ ||P2||)
Given Q fixed solve for p:
min∑
i∈Observed(ri − qi × P)2 + λ
u∑j=1
p2j
Given P fixed solve for q:
min∑
u∈Observed(ru − pu × Q)2 + λ
i∑j=1
q2j
Ridge regression: P = (QT Q + λI)−1QT r , Q = (PT P + λI)−1PT r
“Collaborative Filtering for Implicit Feedback Datasets”WRMF - Weighted Regularized Matrix Factorization
I “Default” approachI Proposed in 2008, but still widely used in industry (even at youtube)I several high-quality open-source implementations
J =∑u,i
Cui (Pui − XuYi )2 + λ(||X ||F + ||Y ||F )
I Preferences - binary
Pij ={1 if Rij > 00 otherwise
I Confidence - Cui = 1 + f (Rui )
Alternating Least Squares for implicit feedback
For fixed Y :dL/dxu = −2
∑i=item
cui (pui − xTu yi )yi + 2λxu =
−2∑
i=itemcui (pui − yT
i xu)yi + 2λxu =
−2Y T Cup(u) + 2Y T CuYxu + 2λxu
I Setting dL/dxu = 0 for optimal solution gives us (Y T CuY + λI)xu = Y T Cup(u)I xu can be obtained by solving system of linear equations:
xu = solve(Y T CuY + λI,Y T Cup(u))
Alternating Least Squares for implicit feedback
Similarly for fixed X :
I dL/dyi = −2XT C ip(i) + 2XT C iYyi + 2λyiI yi = solve(XT C iX + λI,XT C ip(i))
Another optimization:
I XT C iX = XT X + XT (C i − I)XI Y T CuY = Y T Y + Y T (Cu − I)Y
XT X and Y T Y can be precomputed
Accelerated Approximate Alternating Least Squaresyi = solve(XT C iX + λI,XT C ip(i))
Iterative methods
I Conjugate GradientI Coordinate Descend
Fixed number of steps of (usually 3-4 is enough):
Inference time
How to make recommendations for new users?There are no user embeddings since users are not in original matrix!
Inference time
Make one step on ALS with fixed item embeddings matrix => get new user embeddings:
I given Y fixed, Cnew - new user-item interactions confidenceI xunew = solve(Y T Cunew Y + λI,Y T Cunew p(unew ))I scores = Xnew Y T
WRMF Implementations
I python implicit - implemets Conjugate Gradient. With GPU support recently!I R reco - implemets Conjugate GradientI Spark ALSI Quora qmfI Google tensorflow
*titles are clickable
Linear-Flow
Idea is to learn item-item similarity matrix W from the data.
First
min J = ||X − XWk ||F + λ||Wk ||FWith constraint:
rank(W ) ≤ k
Linear-Flow observations
1. Whithout L2 regularization optimal solution is Wk = QkQTk where
SVDk(X ) = PkΣkQTk
2. Whithout rank(W ) ≤ k optimal solution is just solution for ridge regression:W = (XT X + λI)−1XT X - infeasible.
Linear-Flow reparametrization
SVDk(X ) = PkΣkQTk
Let W = QkY :
argmin(Y ) : ||X − XQkY ||F + λ||QkY ||F
Motivation
λ = 0 => W = QkQTk and also soliton for current problem Y = QT
k
Linear-Flow closed-form solution
I Notice that if Qk orthogogal then ||QkY ||F = ||Y ||FI Solve ||X − XQkY ||F + λ||Y ||FI Simple ridge regression with close form solution
Y = (QTk XT XQk + λI)−1QT
k XT X
Very cheap inversion of the matrix of rank k!
Linear-Flow hassle-free cross-validation
Y = (QTk XT XQk + λI)−1QT
k XT X
How to find lamda with cross-validation?
I pre-compute Z = QTk XT X so Y = (ZQk + λI)−1Z -
I pre-compute ZQkI notice that value of lambda affects only diagonal of ZQkI generate sequence of lambda (say of length 50) based on min/max diagonal valuesI solving 50 rigde regression of a small rank is super-fast
Linear-Flow hassle-free cross-validation
Figure 7:
Suggestions
I start simple - SVD, WRMFI design proper cross-validation - both objective and data splitI think about how to incorporate business logic (for example how to exclude
something)I use single machine implementationsI think about inference timeI don’t waste time with libraries/articles/blogposts wich demonstrate MF with dense
matrices
Questions?
I http://dsnotes.com/tags/recommender-systems/I https://github.com/dselivanov/reco
Contacts:
I selivanov.dmitriy@gmail.comI https://github.com/dselivanovI https://www.linkedin.com/in/dselivanov1