Matrix Factorizations for Recommender Systems

Dmitriy Selivanovselivanov.dmitriy@gmail.com

2017-11-16

Recommender systems are everywhere

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Propose “relevant” items to customers

I RetentionI Exploration

I Up-saleI Personalized offers

I recommended items for a customer given history of activities (transactions, browsinghistory, favourites)

I Similar itemsI substitutionsI bundles - frequently bought togetherI . . .

Live demoDataset - LastFM-360K:

I 360k usersI 160k artistsI 17M observationsI sparsity - 0.9999999

Figure 5:

Explicit feedbackRatings, likes/dislikes, purchases:

I cleaner dataI smallerI hard to collect

RMSE 2 = 1D

∑u,i∈D

(rui − r̂ui )2

Netflix prize

I ~ 480k users, 18k movies, 100m ratingsI sparsity ~ 90%I goal is to reduce RMSE by 10% - from 0.9514 to 0.8563

Implicit feedback

I noisy feedback (click, likes, purchases, search, . . . )I much easier to collectI wider user/item coverage

I usually sparsity > 99.9%

One-Class Collaborative Filtering

I observed entries are positive preferencesI should have high confidence

I missed entries in matrix are mix of negative preferences and positive preferencesI consider them as negative with low confidenceI we cannot really distinguish that user did not click a banner because of a lack of

interest or lack of awareness

Evaluation

Recap: we only care about how to produce small set of highly relevant items.RMSE is bad metrics - very weak connection to business goals.

Only interested about relevance precision of retreived items:

I space on the screen is limitedI only order matters - most relevant items should be in top

Ranking - Mean average precision

AveragePrecision =∑n

k=1(P(k)×rel(k))number of relevant documents

## index relevant precision_at_k## 1: 1 0 0.0000000## 2: 2 0 0.0000000## 3: 3 1 0.3333333## 4: 4 0 0.2500000## 5: 5 0 0.2000000

map@5 = 0.1566667

Ranking - Normalized Discounted Cumulative Gain

Intuition is the same as for MAP@K, but also takes into account value of relevance:

DCGp =p∑

2reli − 1log2(i + 1)

nDCGp = DCGpIDCGp

IDCGp =|REL|∑i=1

2reli − 1log2(i + 1)

Approaches

I Content basedI good for cold startI not personalized

I Collaborative filteringI vanilla collaborative fitleringI matrix factorizationsI . . .

I Hybrid and context aware recommender systemsI best of two worlds

Focus today

I WRMF (Weighted Regularized Matrix Factorization) - Collaborative Filtering forImplicit Feedback Datasets (2008)

I efficient learning with accelerated approximate Alternating Least SquaresI inference time

I Linear-FLow - Practical Linear Models for Large-Scale One-Class CollaborativeFiltering (2016)

I efficient truncated SVDI cheap cross-validation with full path regularization

Matrix FactorizationsI Users can be described by small number of latent factors pukI Items can be described by small number of latent factors qki

Figure 6:

Sparse data

Low rank matrix factorization

R = P × Q

factors

itemsfact

Reconstruction

Truncated SVD

Take k largest singular values:X ≈ UkDkV T

- Xk ∈ Rm∗n - Uk ,V - columns are orthonormal bases (dot product of any 2 columns iszero, unit norm) - Dk - matrix with singular values on diagonal

Truncated SVD is the best rank k approximation of the matrix X in terms ofFrobenius norm:

||X − UkDkV Tk ||F

P = Uk√

Q =√

DkV Tk

Issue with truncated SVD for “explicit” feedback

I Optimal in terms of Frobenius norm - takes into account zeros in ratings -

RMSE =√√√√ 1

users × items∑

u∈users,i∈items(rui − r̂ui )2

I Overfits data

Objective = error only in “observed” ratings:

RMSE =√√√√ 1

Observed∑

u,i∈Observed(rui − r̂ui )2

SVD-like matrix factorization with ALS

J =∑

u,i∈Observed(rui − pu × qi )2 + λ(||Q2||+ ||P2||)

Given Q fixed solve for p:

min∑

i∈Observed(ri − qi × P)2 + λ

u∑j=1

Given P fixed solve for q:

min∑

u∈Observed(ru − pu × Q)2 + λ

i∑j=1

Ridge regression: P = (QT Q + λI)−1QT r , Q = (PT P + λI)−1PT r

“Collaborative Filtering for Implicit Feedback Datasets”WRMF - Weighted Regularized Matrix Factorization

I “Default” approachI Proposed in 2008, but still widely used in industry (even at youtube)I several high-quality open-source implementations

J =∑u,i

Cui (Pui − XuYi )2 + λ(||X ||F + ||Y ||F )

I Preferences - binary

Pij ={1 if Rij > 00 otherwise

I Confidence - Cui = 1 + f (Rui )

Alternating Least Squares for implicit feedback

For fixed Y :dL/dxu = −2

∑i=item

cui (pui − xTu yi )yi + 2λxu =

−2∑

i=itemcui (pui − yT

i xu)yi + 2λxu =

−2Y T Cup(u) + 2Y T CuYxu + 2λxu

I Setting dL/dxu = 0 for optimal solution gives us (Y T CuY + λI)xu = Y T Cup(u)I xu can be obtained by solving system of linear equations:

xu = solve(Y T CuY + λI,Y T Cup(u))

Alternating Least Squares for implicit feedback

Similarly for fixed X :

I dL/dyi = −2XT C ip(i) + 2XT C iYyi + 2λyiI yi = solve(XT C iX + λI,XT C ip(i))

Another optimization:

I XT C iX = XT X + XT (C i − I)XI Y T CuY = Y T Y + Y T (Cu − I)Y

XT X and Y T Y can be precomputed

Accelerated Approximate Alternating Least Squaresyi = solve(XT C iX + λI,XT C ip(i))

Iterative methods

I Conjugate GradientI Coordinate Descend

Fixed number of steps of (usually 3-4 is enough):

Inference time

How to make recommendations for new users?There are no user embeddings since users are not in original matrix!

Inference time

Make one step on ALS with fixed item embeddings matrix => get new user embeddings:

I given Y fixed, Cnew - new user-item interactions confidenceI xunew = solve(Y T Cunew Y + λI,Y T Cunew p(unew ))I scores = Xnew Y T

WRMF Implementations

I python implicit - implemets Conjugate Gradient. With GPU support recently!I R reco - implemets Conjugate GradientI Spark ALSI Quora qmfI Google tensorflow

*titles are clickable

Linear-Flow

Idea is to learn item-item similarity matrix W from the data.

min J = ||X − XWk ||F + λ||Wk ||FWith constraint:

rank(W ) ≤ k

Linear-Flow observations

1. Whithout L2 regularization optimal solution is Wk = QkQTk where

SVDk(X ) = PkΣkQTk

2. Whithout rank(W ) ≤ k optimal solution is just solution for ridge regression:W = (XT X + λI)−1XT X - infeasible.

Linear-Flow reparametrization

SVDk(X ) = PkΣkQTk

Let W = QkY :

argmin(Y ) : ||X − XQkY ||F + λ||QkY ||F

Motivation

λ = 0 => W = QkQTk and also soliton for current problem Y = QT

Linear-Flow closed-form solution

I Notice that if Qk orthogogal then ||QkY ||F = ||Y ||FI Solve ||X − XQkY ||F + λ||Y ||FI Simple ridge regression with close form solution

Y = (QTk XT XQk + λI)−1QT

k XT X

Very cheap inversion of the matrix of rank k!

Linear-Flow hassle-free cross-validation

Y = (QTk XT XQk + λI)−1QT

k XT X

How to find lamda with cross-validation?

I pre-compute Z = QTk XT X so Y = (ZQk + λI)−1Z -

I pre-compute ZQkI notice that value of lambda affects only diagonal of ZQkI generate sequence of lambda (say of length 50) based on min/max diagonal valuesI solving 50 rigde regression of a small rank is super-fast

Linear-Flow hassle-free cross-validation

Figure 7:

Suggestions

I start simple - SVD, WRMFI design proper cross-validation - both objective and data splitI think about how to incorporate business logic (for example how to exclude

something)I use single machine implementationsI think about inference timeI don’t waste time with libraries/articles/blogposts wich demonstrate MF with dense

matrices

Questions?

I http://dsnotes.com/tags/recommender-systems/I https://github.com/dselivanov/reco

Contacts:

I selivanov.dmitriy@gmail.comI https://github.com/dselivanovI https://www.linkedin.com/in/dselivanov1

Matrix Factorizations for Recommender Systems

Data & Analytics

Transcript of Matrix Factorizations for Recommender Systems

Matrix Factorization Methods for Recommender Systems

Matrix Factorizations, Algorithms, Wavelets

PCA & Matrix Factorizations

Matrix Factorization Techniques For Recommender Systems

Big data matrix factorizations and Overlapping community detection in graphs

CHERN CHARACTERS FOR TWISTED MATRIX FACTORIZATIONS …

Collaborators Nonnegative Matrix and Tensor …helper.ipam.ucla.edu/publications/sews2/sews2_7123.pdfNonnegative Matrix and Tensor Factorizations for Text Mining Applications IPAM

Convex sets, matrix factorizations and positive ...parrilo/pubs/talkfiles/ILAS2016-Parrilo.pdf · Convex sets, matrix factorizations and positive semide nite rank ... (MIT) Convexity,

COVER FEATURE MATRIX FACTORIZATION … · Netflix have made recommender systems a salient part ... MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER ... data aspects and other application

Matrix Factorization In Recommender Systems

Parallel Matrix Factorization for Recommender Systemsinderjit/public_papers/kais-pmf.pdf · Parallel Matrix Factorization for Recommender Systems 3 Table 1. Comparison between CCD++

Learning with Matrix Factorizations Nathan Srebronati/Publications/thesis.pdfLearning with Matrix Factorizations by Nathan Srebro Submitted to the Department of Electrical Engineering

Nonnegative Matrix and Tensor Factorizations

HHMF: hidden hierarchical matrix factorization for ... · Keywords Hierarchical matrix factorization · Collaborative ﬁltering ·Recommender systems 1Introduction Recommender systems

Nonnegative Matrix Factorizations for Intelligent Data Analysis

Recsys matrix-factorizations

Matrix Factorization Methods for Recommender Systems633561/FULLTEXT01.pdf · Matrix Factorization Methods for Recommender Systems ... with matrix factorization methods, for recommender

Matrix factorizations via group actions on categories, etc. · 2011. 6. 9. · June 8th, 2011 Anatoly Preygel (MIT) Matrix factorizations via group actions on categories, etc. June

Methods for Modifying Matrix Factorizations...METHODS FOR MODIFYING MATRIX FACTORIZATIONS 507 Note that 0 ^ c ^ 1. In order to perform the reduction (1) or (2), we must embed the matrix

Matrix factorizations over non-conventional …pmiettin/slides/Matrix...Matrix Factorizations over Non-Conventional Algebras for Data Mining Pauli Miettinen 28 April 2015 Chapter 1.