Dictionary Learning for Massive Matrix Factorization
Transcript of Dictionary Learning for Massive Matrix Factorization
Dictionary Learning forMassive Matrix Factorization
Arthur Mensch, Julien MairalGael Varoquaux, Bertrand Thirion
Inria Parietal, Inria Thoth
October 6, 2016
Introduction
Why am I here ?
Inria Parietal: machine learning for neuro-imaging(fMRI data)
Matrix factorization: major ingredient in fMRI analysis
Very large datasets (2 TB): we designed faster algorithms
These algorithms can be used in collaborative filtering
D AX
Voxe
ls
Time
=
k spatial maps Time
x
1Work presented at ICML 2016
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
Deroule
1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods
2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm
3 ResultsSettingBenchmarksParameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
Deroule
1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods
2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm
3 ResultsSettingBenchmarksParameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
Collaborative filtering
Collaborative platform
n users rate a fraction ofp items
e.g movies, restaurants
Estimate ratings forrecommendation
Use the ratings of other users for recommendation
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
How to predict ratings ?
Credit: [Bell and Koren, 2007]
Joe like We weresoldiers, Black Hawkdown.
Bob and Alice like thesame films, and alsolike Saving privateRyan.
Joe should watchSaving private Ryan,because all of themindeed likes war films.
Need to uncover topics in items
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors forusers and items:
(αj)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i , user j):
xij = d>i αj ( + biases)
= Common affinity for topics
xij αjdi
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors forusers and items:
(αj)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i , user j):
xij = d>i αj ( + biases)
= Common affinity for topics
xij αjdi
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors forusers and items:
(αj)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i , user j):
xij = d>i αj ( + biases)
= Common affinity for topics
xij αjdi
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Matrix factorization
1Ω ? X AD
p
n k
n
=
1X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n
Constraints / penalty on factors D and A
We only observe 1Ω ? X — Ω set of ratings provided by users
Recommender systems : millions of users, millions of items
How to scale matrix factorization to very large datasets ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
Formalism
Finding representation in Rk for items and users:
minD∈Rp×k
A∈Rk×n
∑(i ,j)∈Ω
(xij − d>i αj)2 + λ(‖D‖2
F + ‖A‖2F )
= ‖1Ω ? (X−DA)‖22 + λ(‖D‖2
F + ‖A‖2F ) 1Ω set of knownratings
`2 reconstruction loss — `2 penalty for generalization
Existing methods
Alternated minimization
Stochastic gradient descent
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
Existing methods
Alternated minimization
Minimize over A, D: alternate between
D = minD∈Rp×k
∑(i ,j)∈Ω
(xij − d>i αj)2 + λ‖D‖2
F
A = minA∈Rk×n
∑(i ,j)∈Ω
(xij − d>i αj)2 + λ‖A‖2
F
No hyperparameters
Slow and memory expensive: use all ratings at each iteration
a.k.a. coordinate descent (variation in parameter update order)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
Existing methods
Stochastic gradient descent
minA,D
∑(i ,j)∈Ω
fij(A,B)def= (xij − d>i α
j)2 +1
cjλ‖αj‖2
2 +1
ciλ‖di‖2
2
Gradient step for each rating:
(At ,Dt)← (At−1,Dt−1)− 1
ct∇(A,D)fij(At−1,Dt−1)
Fast and memory efficient – won the Netflix prize
Very sensitive to step sizes (ct) – need to cross-validate
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
Towards a new algorithm
Best of both worlds ?
Fast and memory efficient algorithm
Little sensitive to hyperparameter setting
Subsampled online dictionary learning
Builds upon the online dictionary learning algorithm
popular in computer vision and interpretable learning (fMRI)
Adapt it to handle missing values efficiently
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
Deroule
1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods
2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm
3 ResultsSettingBenchmarksParameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
Dictionary learning
Recall: recommender system formalism
Non-masked matrix factorization with `2 penalty:
minD∈Rp×k
A∈Rk×n
n∑j=1
(xj −D>αj)2 + λ(‖D‖2F + ‖A‖2
F )
Penalties can be richer, and made into constraints
Dictionary learning
Learn the left side factor [Olshausen and Field, 1997]
minD∈C
n∑j=1
‖xj −Dαj‖22 +λΩ(αj) αj = argmin
α∈Rk
‖xi −Dα‖22 +λΩ(α)
Naive approach: alternated minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
Online dictionary learning [Mairal et al., 2010]
At iteration t, select xt in xjj (user ratings), improve D
Single iteration complexity ∝ sample dimension O(p)
(Dt)t converges in a few epochs (one for large n)
xt αtD
p
n k n
=Stream
1Very efficient in computer vision / networks / fMRI /hyperspectral images
Can we use it efficiently for recommender systems ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
In short: Handling missing values
X
p
n
xt
Steam
Handle large n
n
Handle missing valuesOnline → online + partial
Batch →online
Mtxt
StreamIgno
re
UnknownUnaccessed
1Leverage streaming + partial access to samples
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
In detail: online dictionary learning
Objective function involves latent codes (right side factor)
minD∈C
1
t
t∑i=1
‖xi −Dα∗i (D)‖22, α∗i (D) = argmin
α
1
2‖xi −Dα‖2
2 + λΩ(α)
Replace latent codes by codes computed with old dictionaries
Build an upper-bounding surrogate function
min1
t
t∑i=1
‖xi−Dαi‖22 αi = argmin
α
1
2‖xi−Di−1α‖2
2+λΩ(α)
Minimize surrogate — updateable online at low cost
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code
– xt → complexity depends on p
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update the surrogate function
– Complexity in O(p)
gt =1
t
t∑i=1
‖xi −Dαi‖22 = Tr (
1
2D>DAt −D>Bt)
At = (1− 1
t)At−1 +
1
tαtα
>t Bt = (1− 1
t)Bt−1 +
1
txtα
>t
3 Minimize surrogate
– Complexity in O(p)
Dt = argminD∈C
gt(D) ∇gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code – xt → complexity depends on p
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update the surrogate function – Complexity in O(p)
gt =1
t
t∑i=1
‖xi −Dαi‖22 = Tr (
1
2D>DAt −D>Bt)
At = (1− 1
t)At−1 +
1
tαtα
>t Bt = (1− 1
t)Bt−1 +
1
txtα
>t
3 Minimize surrogate – Complexity in O(p)
Dt = argminD∈C
gt(D) ∇gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
Specification for a new algorithm
Mtxt
StreamIgno
rep
n
1
Constrained : use only knownratings from Ω
Efficient: single iteration in O(s),# of ratings provided by user t
Principled: follows the onlinematrix factorization algorithm asmuch as possible
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)Mtxt
StreamIgno
rep
n
1
Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)Mtxt
StreamIgno
rep
n
1Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Subsampled online dictionary learningCheck out paper !
Original online MF
1 Code computation
αt = argminα∈Rk
‖xt −Dt−1α‖22
+ λΩ(αt)
2 Surrogate aggregation
At =1
t
t∑i=1
αiα>i
Bt = Bt−1 +1
t(xtα
>t − Bt−1)
3 Surrogate minimization
Dj ← p⊥Crj(Dj−
1
(At)j,j(DAj
t−Bjt))
Our algorithm
1 Code computation: masked loss
αt = argminα∈Rk
‖Mt(xt −Dt−1α)‖22
+ λrkMt
pΩ(αt)
2 Surrogate aggregation
At =1
t
t∑i=1
αiα>i
Bt = Bt−1 +1∑t
i=1 Mi(Mtxtα
>t −MtBt−1)
3 Surrogate minimization
MtDj ← p⊥Cj (MtD
j −1
(At)j,jMt(D(Aj
t − (Bjt))
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
Deroule
1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods
2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm
3 ResultsSettingBenchmarksParameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
Experiments
Validation : Test RMSE (rating prediction) vs CPU time
Baseline : Coordinate descent solver [Yu et al., 2012] for
minD∈Rp×k
A∈Rk×n
∑(i ,j)∈Ω
(xij − d>i αj)2 + λ(‖D‖2
F + ‖A‖2F )
Fastest solver available apart from SGD — hyperparameters
↑ Our method has a learning rate with little influence
Datasets : Movielens, Netflix
Publicly available
Larger one in the industry...
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
Results
Scalable algorithm: speed-up improves with size
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
Performance
Dataset Test RMSE Convergence time Speed
CD SODL CD SODL -up
ML 1M 0.872 0.866 6 s 8 s ×0.75ML 10M 0.802 0.799 223 s 60 s ×3.7NF (140M) 0.938 0.934 1714 s 256 s ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Simple model: RMSE is not state-of-the-art
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
Robustness to learning rate
Learning rate in algorithm to be set in [0.75, 1] (← theory)
In practice: Just set it in [0.8, 1]
1 10 40Epoch
0.800.810.820.830.840.850.860.87
RM
SE
on
test
set
Learning rate β0.75
0.78
0.81
0.83
0.86
0.89
0.92
0.94
0.97
1.00
MovieLens 10M
.1 1 10 200.930.940.950.960.970.980.99 Netflix
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
Conclusion
Take-home message
Online matrix factorization can be adaptedto handle missing value efficiently, with verygood performance in reccommender system
Mtxt
StreamIgno
rep
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Conclusion
Take-home message
Online matrix factorization can be adaptedto handle missing value efficiently, with verygood performance in reccommender system
Mtxt
StreamIgno
rep
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Appendix: Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
124 epoch
Proposed method
10 h run time
12 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
Bibliography I
[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).
Lessons from the Netflix prize challenge.
ACM SIGKDD Explorations Newsletter, 9(2):75–79.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.
In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28