Dictionary Learning for Massive Matrix Factorization

Dictionary Learning forMassive Matrix Factorization

Arthur Mensch, Julien MairalGael Varoquaux, Bertrand Thirion

Inria Parietal, Inria Thoth

October 6, 2016

Introduction

Why am I here ?

Inria Parietal: machine learning for neuro-imaging(fMRI data)

Matrix factorization: major ingredient in fMRI analysis

Very large datasets (2 TB): we designed faster algorithms

These algorithms can be used in collaborative filtering

D AX

Voxe

ls

Time

=

k spatial maps Time

x

1Work presented at ICML 2016

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28

Deroule

1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods

2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm

3 ResultsSettingBenchmarksParameter setting


Deroule





Collaborative filtering

Collaborative platform

n users rate a fraction ofp items

e.g movies, restaurants

Estimate ratings forrecommendation

Use the ratings of other users for recommendation


How to predict ratings ?

Credit: [Bell and Koren, 2007]

Joe like We weresoldiers, Black Hawkdown.

Bob and Alice like thesame films, and alsolike Saving privateRyan.

Joe should watchSaving private Ryan,because all of themindeed likes war films.

Need to uncover topics in items


Predicting rate with scalar products

Embeddings to model the existence of genre/category/topics

Representative vectors forusers and items:

(αj)1≤j≤n, (di )1≤i≤p ∈ Rk

q-th coefficient of di , αj

= affinity with the “topic” q

Ratings xij (item i , user j):

xij = d>i αj ( + biases)

= Common affinity for topics

xij αjdi

k topics

di

αj

1

Learning problem: estimate D and A with known ratings


Matrix factorization

1Ω ? X AD

p

n k

n

=

1X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n

Constraints / penalty on factors D and A

We only observe 1Ω ? X — Ω set of ratings provided by users

Recommender systems : millions of users, millions of items

How to scale matrix factorization to very large datasets ?


Formalism

Finding representation in Rk for items and users:

minD∈Rp×k

A∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ(‖D‖2

F + ‖A‖2F )

= ‖1Ω ? (X−DA)‖22 + λ(‖D‖2

F + ‖A‖2F ) 1Ω set of knownratings

`2 reconstruction loss — `2 penalty for generalization

Existing methods

Alternated minimization

Stochastic gradient descent


Existing methods

Alternated minimization

Minimize over A, D: alternate between

D = minD∈Rp×k

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ‖D‖2

F

A = minA∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ‖A‖2

F

No hyperparameters

Slow and memory expensive: use all ratings at each iteration

a.k.a. coordinate descent (variation in parameter update order)


Existing methods

Stochastic gradient descent

minA,D

∑(i ,j)∈Ω

fij(A,B)def= (xij − d>i α

j)2 +1

cjλ‖αj‖2

2 +1

ciλ‖di‖2

2

Gradient step for each rating:

(At ,Dt)← (At−1,Dt−1)− 1

ct∇(A,D)fij(At−1,Dt−1)

Fast and memory efficient – won the Netflix prize

Very sensitive to step sizes (ct) – need to cross-validate


Towards a new algorithm

Best of both worlds ?

Fast and memory efficient algorithm

Little sensitive to hyperparameter setting

Subsampled online dictionary learning

Builds upon the online dictionary learning algorithm

popular in computer vision and interpretable learning (fMRI)

Adapt it to handle missing values efficiently


Deroule





Dictionary learning

Recall: recommender system formalism

Non-masked matrix factorization with `2 penalty:

minD∈Rp×k

A∈Rk×n

n∑j=1

(xj −D>αj)2 + λ(‖D‖2F + ‖A‖2

F )

Penalties can be richer, and made into constraints

Dictionary learning

Learn the left side factor [Olshausen and Field, 1997]

minD∈C

n∑j=1

‖xj −Dαj‖22 +λΩ(αj) αj = argmin

α∈Rk

‖xi −Dα‖22 +λΩ(α)

Naive approach: alternated minimization


Online dictionary learning [Mairal et al., 2010]

At iteration t, select xt in xjj (user ratings), improve D

Single iteration complexity ∝ sample dimension O(p)

(Dt)t converges in a few epochs (one for large n)

xt αtD

p

n k n

=Stream

1Very efficient in computer vision / networks / fMRI /hyperspectral images

Can we use it efficiently for recommender systems ?


In short: Handling missing values

X

p

n

xt

Steam

Handle large n

n

Handle missing valuesOnline → online + partial

Batch →online

Mtxt

StreamIgno

re

UnknownUnaccessed

1Leverage streaming + partial access to samples


In detail: online dictionary learning

Objective function involves latent codes (right side factor)

minD∈C

1

t

t∑i=1

‖xi −Dα∗i (D)‖22, α∗i (D) = argmin

α

1

2‖xi −Dα‖2

2 + λΩ(α)

Replace latent codes by codes computed with old dictionaries

Build an upper-bounding surrogate function

min1

t

t∑i=1

‖xi−Dαi‖22 αi = argmin

α

1

2‖xi−Di−1α‖2

2+λΩ(α)

Minimize surrogate — updateable online at low cost



Algorithm outline

1 Compute code

– xt → complexity depends on p

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update the surrogate function

– Complexity in O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22 = Tr (

1

2D>DAt −D>Bt)

At = (1− 1

t)At−1 +

1

tαtα

>t Bt = (1− 1

t)Bt−1 +

1

txtα

>t

3 Minimize surrogate

– Complexity in O(p)

Dt = argminD∈C

gt(D) ∇gt = DAt − Bt



Algorithm outline

1 Compute code – xt → complexity depends on p

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update the surrogate function – Complexity in O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22 = Tr (

1

2D>DAt −D>Bt)

At = (1− 1

t)At−1 +

1

tαtα

>t Bt = (1− 1

t)Bt−1 +

1

txtα

>t

3 Minimize surrogate – Complexity in O(p)

Dt = argminD∈C

gt(D) ∇gt = DAt − Bt


Specification for a new algorithm

Mtxt

StreamIgno

rep

n

1

Constrained : use only knownratings from Ω

Efficient: single iteration in O(s),# of ratings provided by user t

Principled: follows the onlinematrix factorization algorithm asmuch as possible


Missing values in practice

Data stream: (xt)t → masked (Mtxt)t

= ratings from user t

Dimension: p (all items) → s (rated items)

Use only Mtxt in algorithm computation

→ complexity in O(s)Mtxt

StreamIgno

rep

n

1

Adaptation to make

Modify all parts of the algorithm to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization


Missing values in practice

Data stream: (xt)t → masked (Mtxt)t

= ratings from user t

Dimension: p (all items) → s (rated items)

Use only Mtxt in algorithm computation

→ complexity in O(s)Mtxt

StreamIgno

rep

n

1Adaptation to make

Modify all parts of the algorithm to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization


Subsampled online dictionary learningCheck out paper !

Original online MF

1 Code computation

αt = argminα∈Rk

‖xt −Dt−1α‖22

+ λΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1

t(xtα

>t − Bt−1)

3 Surrogate minimization

Dj ← p⊥Crj(Dj−

1

(At)j,j(DAj

t−Bjt))

Our algorithm

1 Code computation: masked loss

αt = argminα∈Rk

‖Mt(xt −Dt−1α)‖22

+ λrkMt

pΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1∑t

i=1 Mi(Mtxtα

>t −MtBt−1)

3 Surrogate minimization

MtDj ← p⊥Cj (MtD

j −1

(At)j,jMt(D(Aj

t − (Bjt))


Deroule





Experiments

Validation : Test RMSE (rating prediction) vs CPU time

Baseline : Coordinate descent solver [Yu et al., 2012] for

minD∈Rp×k

A∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ(‖D‖2

F + ‖A‖2F )

Fastest solver available apart from SGD — hyperparameters

↑ Our method has a learning rate with little influence

Datasets : Movielens, Netflix

Publicly available

Larger one in the industry...


Results

Scalable algorithm: speed-up improves with size


Performance

Dataset Test RMSE Convergence time Speed

CD SODL CD SODL -up

ML 1M 0.872 0.866 6 s 8 s ×0.75ML 10M 0.802 0.799 223 s 60 s ×3.7NF (140M) 0.938 0.934 1714 s 256 s ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Simple model: RMSE is not state-of-the-art


Robustness to learning rate

Learning rate in algorithm to be set in [0.75, 1] (← theory)

In practice: Just set it in [0.8, 1]

1 10 40Epoch

0.800.810.820.830.840.850.860.87

RM

SE

on

test

set

Learning rate β0.75

0.78

0.81

0.83

0.86

0.89

0.92

0.94

0.97

1.00

MovieLens 10M

.1 1 10 200.930.940.950.960.970.980.99 Netflix


Conclusion

Take-home message

Online matrix factorization can be adaptedto handle missing value efficiently, with verygood performance in reccommender system

Mtxt

StreamIgno

rep

n

1Algorithm usable in any rich model involving matrix factorization

Python package http://github.com/arthurmensch/modl

Article/slides at http://amensch.fr/publications

Questions ?


http://github.com/arthurmensch/modl

http://amensch.fr/publications

Appendix: Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster


Bibliography I

[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).

Lessons from the Netflix prize challenge.

ACM SIGKDD Explorations Newsletter, 9(2):75–79.

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).

Sparse coding with an overcomplete basis set: A strategy employed by V1?

Vision Research, 37(23):3311–3325.

[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).

Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.

In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.


Dictionary Learning for Massive Matrix Factorization

Internet

Transcript of Dictionary Learning for Massive Matrix Factorization