Large-scale Matrix Factorizationkijungs/etc/10-405.pdf · Large-scale Matrix Factorization ......

Large-scale Matrix Factorization

Kijung Shin

Ph.D. Student, CSD

Roadmap•Matrix Factorization (review)

•Algorithms◦ Distributed SGD: DSGD

◦ Alternating Least Square: ALS

◦ Cyclic Coordinate Descent: CCD++

• Experiments

• Extension to Tensor Factorization

•Conclusions

Large-scale Matrix Factorization (by Kijung Shin) 2/99

Roadmap•Matrix Factorization (review) <<




• Experiments


•Conclusions


Matrix Factorization: Problem•Given:

◦ 𝑽: 𝑛 by 𝑚 matrix

▪possibly with missing values

◦ 𝑟: target rank (a scalar)

▪usually 𝑟 << 𝑛 and 𝑟 << 𝑚


𝑚

𝑛

?

?

?

?

𝑽

Matrix Factorization: Problem•Given:

◦ 𝑽: 𝑛 by 𝑚 matrix


◦ 𝑟: target rank (a scalar)

▪usually 𝑟 << 𝑛 and 𝑟 << 𝑚

• Find:◦ 𝑾: 𝑛 by 𝑟 matrix

◦ 𝑯: 𝑟 by 𝑚 matrix

▪without missing values


𝑛

𝑟

𝑚

𝑾

𝑯

𝑟

Matrix Factorization: Problem•Goal: 𝑾𝑯 ≈ 𝑽


𝑾

𝑯

𝑽

≈×?

?

?

?

Loss Function𝐿 𝑉,𝑊,𝐻

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗)2 + …

Goal: to make 𝑾𝑯 similar to 𝑽


Indices ofnon-missing entries

i.e., 𝑖, 𝑗 ∈ 𝑍↔ 𝑉𝑖𝑗 is not missing

(𝑖, 𝑗)-th entry of 𝑊𝐻

(𝑖, 𝑗)-th entry of 𝑉

Loss Function (cont.)𝐿 𝑉,𝑊,𝐻

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗)2

+ 𝜆( 𝑊 𝐹2 + 𝐻 𝐹

2)

Goal: to prevent overfitting

(by making the entries of 𝑊 and 𝐻 close to zero)


Regularization parameter

Frobenius Norm:

𝑊 𝐹2 =

𝑖=1

𝑛

𝑘=1

𝑟

𝑊𝑖𝑘2

Algorithms•How can we minimize this loss function 𝐿?

◦ Stochastic Gradient Descent: SGD (covered in the last lecture)

◦ Alternating Least Square: ALS (covered today)

◦ Cyclic Coordinate Descent: CCD++ (covered today)

•Are these algorithms parallelizable?

• Yes, all of them!



•Algorithms◦ Distributed SGD: DSGD <<



• Experiments


•Conclusions


Distributed SGD (DSGD)

Large-scale matrix factorizationwith distributed

stochastic gradient descent(KDD 2011)


Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis

Stochastic GD for MF (review)

• Let 𝑊 =

− 𝑊1: −

:− 𝑊𝑛: −

, 𝐻 =

| |𝐻:1 … 𝐻:𝑚| |

•𝐿 𝑉,𝑊,𝐻 : sum of loss for each non-missing entry

𝐿 𝑉,𝑊,𝐻 =

𝑖,𝑗 ∈𝑍

𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗

where loss at each non-missing entry 𝑉𝑖𝑗 is

𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ≔ 𝑉𝑖𝑗 −𝑊𝑖:𝐻:𝑗2+ 𝜆

𝑊𝑖:2

𝑁𝑖:+

𝐻:𝑗2

𝑁:𝑗


= 𝑊𝐻 𝑖𝑗

Number of non-missing entries in the 𝑖-th row of 𝑉

Stochastic GD for MF (cont.)• Stochastic Gradient Descent (SGD) for MF

• repeat until convergence◦ randomly shuffle the non-missing entries of 𝑉

◦ for each non-missing entry:

▪perform an SGD step on it


Stochastic GD for MF (cont.)•An SGD step on each non-missing entry 𝑉𝑖𝑗:

◦ Read 𝑊𝑖: and 𝐻:𝑗

◦ Compute gradient of 𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗◦ Update 𝑊𝑖: and 𝐻:𝑗▪Detailed update rules were

covered in the last lecture


𝑽𝑾

𝑯

𝑉𝑖𝑗𝑊𝑖:

𝐻:𝑗

Simple Parallel SGD for MF• Parameter Mixing: MSGD

◦ entries of 𝑉 are distributed across multiple machines


Machine 1 Machine 2 Machine 3

𝑽 𝑽 𝑽

Simple Parallel SGD for MF (cont.)• Parameter Mixing: MSGD


◦ Map step: each machine runs SGD independently on the assigned entries until convergence



𝑽

𝑾

𝑯

𝑽

𝑾

𝑯

𝑽

𝑾

𝑯

• Parameter Mixing: MSGD◦ Map step: each machine runs

▪an independent instance of SGD on subsets of the data

▪until convergence

◦ Reduce step: average results (i.e., 𝑾 and 𝑯)

• Problem: does not converge to correct solutions◦ no guarantee that “the average” reduces the loss function 𝐿


Simple Parallel SGD for MF (cont.)

𝑯𝑯 𝑯 𝑯+ +

←


3

Simple Parallel SGD for MF (cont.)• Iterative Parameter Mixing: ISGD


◦ Repeat until convergence

▪Map step: each machine runs SGD independently on the assigned entries for some time

▪Reduce step:

◦ average results (i.e., 𝑾 and 𝑯)

◦ broadcast averaged results

• Problem: slow convergence◦ still has the averaging step


Interchangeability•How can we avoid the averaging step?

◦ let different machines update different entries of W and 𝐻

• Two entries 𝑉𝑖𝑗 and 𝑉𝑘𝑙 of 𝑽 are interchangeable if they share neither row nor column


𝑽𝑾

𝑯


𝐻:𝑗

𝑉𝑖𝑙

𝐻:𝑙Not interchangeable!

𝑽𝑾

𝑯


𝐻:𝑗

𝑉𝑘𝑙

𝐻:𝑙

𝑊𝑘:

Interchangeable!

Interchangeability (cont.)• If 𝑉𝑖𝑗 and 𝑉𝑘𝑙 are interchangeable,

• SGD steps on 𝑉𝑖𝑗 and 𝑉𝑘𝑙 can be parallelized safely


Read 𝑊𝑖: and 𝐻:𝑗Compute gradient of 𝐿′ 𝑊𝑖:, 𝐻:𝑗Update 𝑊𝑖: and 𝐻:𝑗

Read 𝑊𝑘: and 𝐻:𝑙Compute gradient of 𝐿′ 𝑊𝑙:, 𝐻:𝑙Update 𝑊𝑘: and 𝐻:𝑙

no conflicts!

𝑽𝑾

𝑯


𝐻:𝑗

𝑉𝑘𝑙

𝐻:𝑙

𝑊𝑘:

Distributed SGD •Block 𝑾, 𝑯, and 𝑽


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1) 𝑉(1,2) 𝑉(1,3)

𝑉(2,1) 𝑉(2,2) 𝑉(2,3)

𝑉(3,1) 𝑉(3,2) 𝑉(3,3)

𝑊

𝐻

𝑉

Distributed SGD (cont.) •Block 𝑾, 𝑯, and 𝑽

• repeat until convergence◦ for a set of interchangeable blocks of 𝑉

▪1. run SGD on the blocks in parallel

◦ no conflict between machines

▪2. merge results (i.e., 𝑾 and 𝑯)

◦ averaging is not needed


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1)

𝑉(2,2)

𝑉(3,3)Blocks on a diagonal are interchangeable!

Interchangeable Blocks• Example of interchangeable blocks


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1)

𝑉(2,2)

𝑉(3,3)

Machine 1 Machine 2

Machine 3Blocks are interchangeable!

𝑊(1)

𝐻(1)

𝑉(1,1) 𝑊(2)

𝐻(2)

𝑉(2,2)

𝑊(3)

𝐻(3)

𝑉(3,3)

Interchangeable Blocks (cont.)• Example of interchangeable blocks


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,2)

𝑉(2,3)

𝑉(3,1)

𝑊(1)

𝐻(2)

𝑉(1,2) 𝑊(2)

𝐻(3)

𝑉(2,3)

𝑊(3)

𝐻(1)

𝑉(3,1)

Machine 1 Machine 2


Interchangeable Blocks (cont.)• Example of interchangeable blocks


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,3)

𝑉(2,1)

𝑉(3,2)

𝑊(1)

𝐻(3)

𝑉(1,3) 𝑊(2)

𝐻(1)

𝑉(2,1)

𝑊(3)

𝐻(2)

𝑉(3,2)

Machine 1 Machine 2


Interchangeable Blocks (cont.)•Not interchangeable blocks

◦ Not used in DSGD


𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,3)

𝑉(2,3)

𝑉(3,2)

Blocks are not interchangeable!

More details of DSGD•How should we choose sets of interchangeable blocks?

◦ In each iteration, we choose sets of interchangeable blocks so that every block of 𝑉 is used exactly once

◦ E.g.,

◦ But the order and grouping are random

• SGD within each block◦ every non-missing entry in the block is used exactly once ◦ but the order is random


More details of DSGD (cont.)•Use “bold driver” to set step size (or learning rate)

◦ After each iteration,

▪ increase the step size if the loss decreases

▪decrease the step size if the loss increases

• Implemented Hadoop and R/Snowfall◦ Snowfall: package for parallel R programs

◦ https://cran.r-project.org/web/packages/snowfall/index.html


https://cran.r-project.org/web/packages/snowfall/index.html

Pros & Cons of DSGD• Pros:

◦ Faster convergence than MSGD and ISGD▪no averaging step

◦ Memory efficiency: each machine needs to maintain a single block of W and a single block of H in memory

•Cons:◦ Many hyperparameters: step size (𝜖) in addition to

regularization parameter (𝜆) and rank (𝑟)


𝑊(1)

𝐻(1)

𝑉(1,1) 𝑊(2)

𝐻(2)

𝑉(2,2) 𝑊(3)

𝐻(3)

𝑉(3,3)




◦ Alternating Least Square: ALS <<


• Experiments


•Conclusions


Alternating Least Square

Large-scale Parallel Collaborative Filtering for the Netflix Prize

(AAIM 2008)


Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan

Main Idea behind ALS•How hard is matrix factorization?

◦ i.e., finding argmin𝑾,𝑯

𝐿(𝑉,𝑊,𝐻)

• Solving the entire problem is difficult◦ 𝐿 𝑉,𝑊,𝐻 is a non-convex function of 𝑾 and 𝑯

▪many local optima

◦ finding a global optimum is NP-hard


Main Idea behind ALS (cont.)•How hard is solving a smaller problem?

◦ Specifically, findingargmin

𝑯𝐿(𝑉,𝑊,𝐻)

while fixing 𝑾 to its current value

•Much easier!◦ 𝐿 𝑉,𝑊,𝐻 is a convex function of 𝑯 (once 𝑾 is fixed)

▪one local optimum, which is also globally optimal

◦ Moreover, there exists the closed-form solution

▪we can direct compute the global optimum


Updating H• Finding

argmin𝑯


while fixing 𝑾 to its current value

• This problem is solved by the following update rule

◦ derived from 𝜕𝐿

𝜕𝐻𝑘𝑗= 0, ∀𝑘, ∀𝑗 (first-order condition)


For j = 1…𝑚

𝐻:𝑗 ←

𝑖∈𝑍:𝑗

𝑊𝑖: 𝑊𝑖:𝑇 + 𝜆𝐼𝑘

−1

𝑖∈𝑍:𝑗


row indices of non-missing entries in the 𝑗-th column of 𝑽

𝑘 by 𝑘 identity matrix

Updating W• Likewise, finding

argmin𝑾


while fixing 𝑯 to its current value

• This problem is solved by the following update rule

◦ derived from 𝜕𝐿

𝜕𝑊𝑖𝑘= 0, ∀𝑖, ∀𝑘 (first-order condition)


For i = 1…𝑛

𝑊𝑖: ←

𝑖∈𝑍𝑖:

𝐻:𝑗𝑇𝐻:𝑗 + 𝜆𝐼𝑘

−1

𝑗∈𝑍𝑖:

𝑉𝑖𝑗𝐻:𝑗

column indices of non-missing entries in the 𝑖-th row of 𝑽

Alternating Least Square (ALS)randomly Initialize 𝑾 and 𝑯

repeat until convergenceupdate 𝑾 while fixing 𝑯 to its current value

update 𝑯 while fixing 𝑾 to its current value


• Each step never increases 𝐿 𝑉,𝑊,𝐻◦ 𝑾 is updated to the “best” 𝑾 that minimizes 𝐿 𝑉,𝑊,𝐻

for current 𝑯◦ 𝑯 is updated to the “best” 𝑯 that minimizes 𝐿 𝑉,𝑊,𝐻

for current 𝑾

•𝐿 𝑉,𝑊,𝐻 monotonically decreases until convergence

Parallel ALS: Idea•Recall the update rule for each 𝑗-th column of 𝑯


𝐻:𝑗 ←

𝑖∈𝑍:𝑗


−1

𝑖∈𝑍:𝑗


𝑽𝑾

𝑯

𝑉:𝑗

𝐻:𝑗Read Written

𝑽𝑾

𝑯

Parallel ALS: Idea (cont.)• Potentially every entry of 𝑾 needs to be read

• Only the 𝑗-th column of 𝑽 needs to be read

• 𝑯 does not need to be read⇒ can update the columns of 𝑯 independently in parallel


𝑽𝑾

𝑯

𝑉:𝑗

𝐻:𝑗Read Written

𝑽𝑾

𝑯

Updating H in Parallel•A toy example with 3 machines

• 1. Divide 𝑽 column-wise into 3 pieces


𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)𝑾

Updating H in Parallel (cont.)


• 2. Distribute the pieces across the machines

• 3. 𝑾 is broadcast to every machine

𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)𝑾 𝑾 𝑽(𝟏) 𝑾 𝑽(𝟐) 𝑾 𝑽(𝟑)




• 4. Compute the corresponding columns of 𝑯 in parallel

𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)𝑾 𝑾 𝑽(𝟏) 𝑾 𝑽(𝟐) 𝑾 𝑽(𝟑)


𝑯(𝟏) 𝑯(𝟐) 𝑯(𝟑)New! New! New!




• 5. Broadcast the part of 𝑯 that each machine computes

𝑾 𝑽(𝟏) 𝑾 𝑽(𝟐) 𝑾 𝑽(𝟑)

𝑯(𝟏) 𝑯(𝟐) 𝑯(𝟑) 𝑯(𝟏) 𝑯(𝟐) 𝑯(𝟑) 𝑯(𝟏) 𝑯(𝟐) 𝑯(𝟑)

Parallel ALS: Idea•Recall the update rule for each row of 𝑾


𝑊𝑖: ←

𝑖∈𝑍𝑖:

𝐻:𝑗𝑇𝐻:𝑗 + 𝜆𝐼𝑘

−1

𝑗∈𝑍𝑖:

𝑉𝑖𝑗𝐻:𝑗

𝑽𝑾

𝑯

𝑉𝑖:

Read Written

𝑽𝑾

𝑯

𝑊𝑖:

Parallel ALS: Idea (cont.)• Potentially every entry of 𝑯 needs to be read

• Only the 𝑖-th row of 𝑽 needs to be read

•𝑾 does not need to be read⇒ can update the rows of 𝑾 independently in parallel


𝑽𝑾

𝑯

𝑉𝑖:

Read Written

𝑽𝑾

𝑯

𝑊𝑖:

Updating W in Parallel•A toy example with 3 machines

• 1. Divide 𝑽 row-wise into 3 pieces


𝑽(𝟏)

𝑯

𝑽(𝟐)

𝑽(𝟑)

Updating W in Parallel (cont.)



• 2. Distribute the pieces across the machines

• 3. 𝑯 is broadcast to every machine

𝑯 𝑯 𝑯

𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)

𝑽(𝟏)

𝑯

𝑽(𝟐)

𝑽(𝟑)



• 4. Compute the corresponding rows of 𝑾 in parallel


𝑯 𝑯 𝑯

𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)𝑾(𝟐)𝑾(𝟏) 𝑽(𝟑)

New! New! New!




• 5. Broadcast the part of 𝑾 that each machine computes

𝑯 𝑯 𝑯

𝑽(𝟏) 𝑽(𝟐) 𝑽(𝟑)

𝑾(𝟏)

𝑾(𝟐)

𝑾(𝟑)

𝑾(𝟏)

𝑾(𝟐)

𝑾(𝟑)

𝑾(𝟏)

𝑾(𝟐)

𝑾(𝟑)

Pros & Cons of ALS• Pros:

◦ Less hyper-parameters: not requiring step size

•Cons:◦ High computational cost: e.g., matrix inversion takes O(𝑟3)


𝐻:𝑗 ←

𝑖∈𝑍:𝑗


−1

𝑖∈𝑍:𝑗


Pros & Cons of ALS (cont.)


𝑾 𝑽(𝟏) 𝑾 𝑽(𝟐) 𝑾 𝑽(𝟑)


•Cons (cont.)◦ High memory requirement:

▪while updating 𝑾(or 𝑯),

▪each machine maintains the entire 𝑯(or 𝑾) in memory




◦ Cyclic Coordinate Descent: CCD++ <<

• Experiments


•Conclusions


Cyclic Coordinate Descent

Scalable Coordinate Descent Approaches

to Parallel Matrix Factorization

for Recommender Systems

(ICDM 2012)


Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon

Best Paper Award

Matrix Factorization: Revisit


•Goal: 𝑽 is approximated by the product of 𝑾 and 𝑯

𝑾

𝑯

𝑽

≈ ×?

?

?

?

𝑛

𝑟 𝑚

𝑟

𝑛

𝑚



•Goal: 𝑽 is approximated by the product of 𝑾 and 𝑯

𝑾

𝑯

𝑽

≈ ×?

?

?

?

𝑛

𝑟 𝑚

𝑟𝑊𝑖:

𝐻:𝑗

𝑉𝑖𝑗 ≈ 𝑊𝐻 𝑖𝑗

= σ𝑘=1𝑟 𝑊𝑖𝑘𝐻𝑘𝑗 = 𝑊𝑖1𝐻1𝑗 +⋯+𝑊𝑖𝑟𝐻𝑟𝑗

𝑛

𝑚

𝑉𝑖𝑗



• Equivalently, 𝑽 is approximated by the sum of 𝑟 matrices

𝑽

≈ ×?

?

?

?

𝑊:1

𝐻1:

×

𝑊:𝑟

𝐻𝑟:

+⋯+

𝑉𝑖𝑗 ≈ 𝑊:1𝐻1: 𝑖𝑗 +⋯+ 𝑊:𝑟𝐻𝑟: 𝑖𝑗 = 𝑊𝑖1𝐻1𝑗 +⋯+𝑊𝑖𝑟𝐻𝑟𝑗

𝑊:1𝐻1: 𝑖𝑗 = 𝑊𝑖1𝐻1𝑗

𝑊:𝑟𝐻𝑟:

𝑊:𝑟𝐻𝑟: 𝑖𝑗 = 𝑊𝑖𝑟𝐻𝑟𝑗

1st column of 𝑊

1st row of 𝐻

rth column of 𝑊

rth row of 𝐻

Cyclic Coordinate Descent: CCD++• repeat until convergence

◦ for 𝑘 = 1…𝑟

update the 𝑘-th column of 𝑾 and the 𝑘-th row of 𝑯

while fixing the other entries of 𝑾 and 𝑯


•Consider a toy example with rank 3

𝑽

≈ ×

𝑊:1

+ × + ×

𝐻1:

𝑊:2

𝐻2:

𝑊:3

𝐻3:

Step 1:


𝑽

≈ ×

𝑊:1

+ × + ×

𝐻1:

𝑊:2

𝐻2:

𝑊:3

𝐻3:

𝑽

≈ ×

𝑊:1

+ × + ×

𝐻1:

𝑊:2

𝐻2:

𝑊:3

𝐻3:

Step 2:

Step 3:

Order of Updates in CCD++

Performing one update• Let 𝑊:1 and 𝐻1: be the updated row and column


𝑽

≈ ×

𝑊:1

𝐻1:

− × − ×

𝑊:2

𝐻2:

𝑊:3

𝐻3:

residual matrix 𝑹

𝑽

≈ ×

𝑊:1

+ × + ×

𝐻1:

𝑊:2

𝐻2:

𝑊:3

𝐻3:

Performing one update (cont.)

• Updating 𝑊:1 and 𝐻1: equals to factorizing 𝑹 with rank 1

• 𝑹 = 𝑽 −𝑊:2𝐻2: −𝑊:3𝐻:3

= 𝑽 −𝑊:1𝐻1: −𝑊:2𝐻2: −𝑊:3𝐻:3 +𝑊:1𝐻1:

= 𝑽 −𝑾𝑯+𝑊:1𝐻1:


𝑽

≈ ×

𝑊:1

𝐻1:

− × − ×

𝑊:2

𝐻2:

𝑊:3

𝐻3:

residual matrix 𝑹

Performing one update (cont.)• Factorizing 𝑹 with rank 1 can be performed using ALS

•with the following rules:◦ The updates rules can be derived from those in ALS


For i = 1…𝑛

𝑊𝑖𝑘 ← ൙σ𝑗∈𝑍𝑖:

𝑅𝑖𝑗𝐻𝑘𝑗

σ𝑗∈𝑍𝑖:𝐻𝑘𝑗

2+ 𝜆

For j = 1…𝑚

𝐻𝑘𝑗 ← ൙σ𝑖∈𝑍:𝑗

𝑅𝑖𝑗𝑊𝑖𝑘

σ𝑖∈𝑍:𝑗𝑊𝑖𝑘

2 + 𝜆

Pseudocode of CCD++


randomly initialize 𝑾 and 𝑯

repeat until convergencefor 𝑘 = 1…𝑟:𝑹 ← 𝑽 −𝑾𝑯+𝑊:𝑘𝐻𝑘:update 𝑊:𝑘 and 𝐻𝑘:

(rank-1 factorization of 𝑹)

randomly initialize 𝑾 and 𝑯

𝑹 ← 𝑽 −𝑾𝑯

repeat until convergencefor 𝑘 = 1…𝑟:𝑹 ← 𝑹 +𝑊:𝑘𝐻𝑘:update 𝑊:𝑘 and 𝐻𝑘:(rank-1 factorization of 𝑹)

𝑹 ← 𝑹 +𝑊:𝑘𝐻𝑘:Repeated!!

Maintained!!

Pseudocode (Simple) Pseudocode (Detailed)

Parallel CCD++ •Updating 𝐻𝑘: in parallel

◦ 1. divide 𝑹 column-wise

◦ 2. distribute the pieces across the machines

◦ 3. 𝑊:𝑘 is broadcast to every machine


𝑊:𝑘

𝑹(𝟏) 𝑹(𝟐) 𝑹(𝟑)


𝑊:𝑘 𝑊:𝑘

𝑹(𝟏) 𝑹(𝟐) 𝑹(𝟑)

𝑊𝐾:

𝐻𝑘:

Parallel CCD++ (cont.) •Updating 𝐻𝑘: in parallel (cont.)

◦ 4. compute the corresponding entries of 𝐻𝑘: in parallel


𝑊:𝑘

𝑹(𝟏) 𝑹(𝟐) 𝑹(𝟑)


𝐻 1𝑘:

New!

𝑊:𝑘 𝑊:𝑘

𝐻 2𝑘:

New!

𝐻 3𝑘:

New!

Parallel CCD++ (cont.) •Updating 𝐻𝑘: in parallel (cont.)

◦ 5. send the computed entries to the other machines.


𝑊:𝑘

𝑹(𝟏) 𝑹(𝟐) 𝑹(𝟑)


𝑊:𝑘 𝑊:𝑘

𝐻𝑘: 𝐻𝑘: 𝐻𝑘:

Pros & Cons of CCD++• Pros:

◦ Low computational cost

▪update rules do not need matrix inversion


𝐻𝑘𝑗 ← ൙σ𝑖∈𝑍:𝑗

𝑅𝑖𝑗𝑊𝑖𝑘

σ𝑖∈𝑍:𝑗𝑊𝑖𝑘

2 + 𝜆𝑊𝑖𝑘 ← ൙

σ𝑗∈𝑍𝑖:𝑅𝑖𝑗𝐻𝑘𝑗

σ𝑗∈𝑍𝑖:𝐻𝑘𝑗

2+ 𝜆

Pros & Cons of CCD++• Pros (cont.):

◦ Low memory requirements

▪each machine needs to maintain

▪one row of 𝑯 (or one column of 𝑾) in memory at a time

▪ instead of entire 𝑯 (or𝑾)


Machine 1 in ALS

𝑾 𝑽(𝟏)

Machine 1 in CCD++

𝑊:𝑘

𝑹(𝟏)

Pros & Cons of CCD++•Cons:

◦ Slow for dense matrices with many non-missing entries

▪To update 𝑹 , every non-missing entry

▪needs to be read and rewritten.

▪This is especially slow if 𝑹 is stored on disk.


Roadmap• Problem (review)




• Experiments <<


•Conclusions


Experimental Settings•Datasets:


5 ? 4 3

3 4 1 ?

2 ? 1 ?

5 3 ? 2

Use

rs

Movies (or songs)

Ratings

Experimental Settings (cont.)•Datasets:


(2.7M users, 18K movies, 100M ratings)

(1M users, 625K songs, 256M ratings)

(71K users, 65K movies, 10M ratings)

Experimental Settings (cont.)• Split of Datasets:

◦ Training set: about 80% of non-missing entries

◦ Test set: about 20% of non-missing entries

• Evaluation Metric:◦ Test RMSE (Root-mean square error)


𝑅𝑀𝑆𝐸 𝑉,𝑊,𝐻 =1

|𝑍𝑡𝑒𝑠𝑡|

(𝑖,𝑗)∈𝑍𝑡𝑒𝑠𝑡

𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗2

index of test entriesEstimated

ratingTrue

rating

Experimental Settings (cont.)•Machines and implementations

◦ Shared-memory setting:

▪8 cores

▪OpenMP in C++

◦ Distributed setting:

▪up to 20 machines

▪MPI in C++


Convergence Speed


• Shared-memory setting with 8 cores

•CCD++ decreases Test RMSE fastest

Figure 3(b) of [YHS+12]

Convergence Speed (cont.)



•CCD++ converges Test RMSE fastest

Figure 3(c) of [YHS+12]



•CCD++ converges fastest

Figure 3(a) of [YHS+12]

Convergence Speed (cont.)

Speedup in Shared Memory•CCD++ and ALS show near-linear machine scalability

•DSGD suffers form high cache-miss rates due to its randomness◦ Cache-miss rate increases as more cores are used


Figure 4 of [YHS+12]

•All algorithms show similar speedups


Speedup in Distributed Settings

Figure 6(b) of [YHS+12]





• Experiments

• Extension to Tensor Factorization <<

•Conclusions


Tensors


•An 𝑁-way tensor is an 𝑁-dimensional array

𝑡11 ⋯ 𝑡1𝐼2⋮ ⋱ ⋮𝑡𝐼11 ⋯ 𝑡𝐼1𝐼2

𝑖1=1,…

,𝐼1

𝑖2 = 1,… , 𝐼2

𝑡1⋮𝑡𝐼1

1-way tensor(= vector)

2-way tensor(= matrix)

3-way tensor

Tensor Factorization: Problem


•Given: ◦ 𝑿: 𝐼 by 𝐽 by 𝐾 tensor


◦ 𝑅: target rank (a scalar)

▪usually 𝑅 ≪ 𝐼, 𝐽, 𝐾

𝐼

𝐽

𝐾

?

?

?

?

𝑿

Tensor Factorization: Problem


•Given: ◦ 𝑿: 𝐼 by 𝐽 by 𝐾 tensor (i.e., 3-dim array)


◦ 𝑅: target rank (a scalar)

▪usually 𝑅 ≪ 𝐼, 𝐽, 𝐾

• Find:◦ 𝑼: 𝐼 by 𝑅 matrix

◦ 𝑯: 𝐽 by 𝑅 matrix

◦ 𝑾: 𝐾by 𝑅 matrix

▪without missing values

𝐼

𝑅

𝑼

𝐽

𝑯

𝐾

𝑾

𝑅

𝑅

Tensor Factorization: Problem•Goal: 𝑿 ≈ [𝑼𝑯𝑾]

◦ where 𝑼𝑯𝑾 𝑖𝑗𝑘 = σ𝑟=1𝑅 𝑈𝑖𝑟𝐻𝑗𝑟𝑊𝑘𝑟


?

?

?

?

≈

𝑼

𝑯𝑻

𝑿

Loss Function𝐿 𝑋, 𝑈, 𝐻,𝑊

= σ 𝑖,𝑗,𝑘 ∈𝑍 (𝑋𝑖𝑗𝑘 − 𝑈𝐻𝑊 𝑖𝑗𝑘)2 + …

Goal: to make 𝑿 and 𝑼𝑯𝑾 similar


Indices ofnon-missing entries

i.e., 𝑖, 𝑗, 𝑘 ∈ 𝑍↔ 𝑋𝑖𝑗𝑘 is not missing

(𝑖, 𝑗, 𝑘)-th entry of [UHW]

(𝑖, 𝑗, 𝑘)-thentry of 𝑋

Loss Function (cont.)𝐿 𝑋, 𝑈, 𝐻,𝑊

= σ 𝑖,𝑗,𝑘 ∈𝑍 (𝑋𝑖𝑗𝑘 − 𝑈𝐻𝑊 𝑖𝑗𝑘)2

+ 𝜆( 𝑈 𝐹2 + 𝐻 𝐹

2 + 𝑊 𝐹2)

Goal: to prevent overfitting

(by making the entries of 𝑈, 𝐻 and 𝑉 close to zero)


Regularization parameter

Frobenius Norm:

𝑈 𝐹2 =

𝑖=1

𝐼

𝑟=1

𝑅

𝑈𝑖𝑟2

SGD for TF•𝐿 𝑋, 𝑈, 𝐻,𝑊 : sum of loss for each non-missing entry

𝐿 𝑋, 𝑈, 𝐻,𝑊 = σ 𝑖,𝑗,𝑘 ∈𝑍 𝐿′(𝑋𝑖𝑗𝑘 , 𝑈𝑖:𝐻𝑗:𝑊𝑘:)

•An SGD step on each non-missing entry 𝑋𝑖𝑗𝑘:◦ Read 𝑈𝑖: , 𝐻𝑗:, and 𝑊𝑘:

◦ Compute gradient of 𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ,𝑊𝑘:

◦ Update 𝑈𝑖: , 𝐻𝑗:, 𝑊𝑘:


𝑼

𝑯𝑻

𝑾

𝑋𝑖𝑗𝑘

𝑿

Parallel Algorithms for TF• Parallel algorithms for MF are extended to TF

◦ DSGD → FlexiFaCT [BKP+14]

◦ ALS [SPK16]

◦ CCD++ → CDTF [SK17]


FlexiFaCT [VKP+14]

Applications•Context-aware recommendation [KABO10]


Use

rs

Ratings

?

5

3

4

2

Movies

𝑋𝑖𝑗𝑘= 𝑈𝐻𝑊 𝑖𝑗𝑘

Missing rating

Applications (cont.)•Context-aware recommendation [KABO10]


Use

rs

Ratings

?

5

3

4

2

Restaurants

Missing rating𝑋𝑖𝑗𝑘= 𝑈𝐻𝑊 𝑖𝑗𝑘

Applications (cont.)•Video Restoration [LMWY13]


X-a

xis

Pixels

?

5

3

4

2

Y-Axis

Missing or corrupted pixel𝑋𝑖𝑗𝑘= 𝑈𝐻𝑊 𝑖𝑗𝑘

Applications (cont.)•Video Restoration [LMWY13]


Corrupted frame Restored frame

Applications (cont.)• Personalized Web Search [SZL+05]


Use

rs

Preference

?

5

3

4

2

Query keywords

Missing preference

𝑋𝑖𝑗𝑘= 𝑈𝐻𝑊 𝑖𝑗𝑘





• Experiments


•Conclusions <<


Conclusions•Matrix Factorization (MF)

• Parallel algorithms for MF◦ Distributed SGD: DSGD



• Tensor Factorization (TF)◦ Extension of MF to tensors

◦ Applications: context-aware recommendation, video restoration, personalized web search, etc.


Questions?• These slides are available at:

www.cs.cmu.edu/~kijungs/ETC/10-405.pdf

• Email: [email protected]


http://www.cs.cmu.edu/~kijungs/ETC/10-405.pdf

References• [SZL+05] Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. "Cubesvd: a novel approach to personalized web search." WWW 2005

• [ZWSP08] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and RongPan. "Large-scale parallel collaborative filtering for the netflix prize." AAIM 2008

• [KABO10] Karatzoglou, Alexandros, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. "Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering." RecSys 2010.

• [GNHS11] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and YannisSismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." KDD 2011

• [DKA11] Daniel M. Dunlavy, Tamara G. Kolda, and Evrim Acar. "Temporal link prediction using matrix and tensor factorizations." TKDD 2011


References (cont.)• [YHSD12] Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. "Scalable coordinate descent approaches to parallel matrix factorization for recommender systems." ICDM 2012

• [LMWY13] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. "Tensor completion for estimating missing values in visual data." TPAMI 2013.

• [BKP+14] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha Pratim Talukdar, Christos Faloutsos, and Eric P. Xing. "Flexifact: Scalable flexible factorization of coupled tensors on hadoop." SDM 2014

• [SPK16] Shaden Smith, Jongsoo Park, and George Karypis. "An exploration of optimization algorithms for high performance tensor completion." In High Performance Computing, Networking, Storage and Analysis, SC 2016

• [SK17] Kijung Shin and U Kang. “Distributed Methods for High-dimensional and Large-scale Tensor Factorization.” ICDM 2014


Closed-Form Solution of ALS

• Let 𝑊 =

− 𝑊1: −

:− 𝑊𝑛: −

, 𝐻 =| |𝐻:1 … 𝐻:𝑚| |

Then,

𝐿 𝑉,𝑊,𝐻

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 −𝑊𝑖:𝐻:𝑗)2 + 𝜆 σ𝑖=1

𝑛 𝑊𝑖: 2 + σ𝑗=1𝑚 𝐻:𝑗 2

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 −σ𝑘=1𝑟 𝑊𝑖𝑘𝐻𝑘𝑗)

2

+ 𝜆 σ𝑖=1𝑛 σ𝑘=1

𝑟 𝑊𝑖𝑘2 +σ𝑗=1

𝑚 σ𝑘=1𝑟 𝐻𝑘𝑗

2


Closed-Form Solution of ALS (cont.)1

2

𝜕𝐿

𝜕𝐻𝑘𝑗= 0, ∀𝑘, ∀𝑗 (first-order conditions)

⇒ σ𝑖∈𝑍:𝑗𝑉𝑖𝑗 −𝑊𝑖:𝐻:𝑗 𝑊𝑖𝑘 + 𝜆𝐻𝑘𝑗 = 0, ∀𝑘, ∀𝑗

⇒ σ𝑖∈𝑍:𝑗𝑊𝑖:𝐻:𝑗𝑊𝑖𝑘 + 𝜆𝐻𝑘𝑗 = σ𝑖∈𝑍:𝑗

𝑉𝑖𝑗𝑊𝑖𝑘 , ∀𝑘, ∀𝑗

⇒ σ𝑖∈𝑍:𝑗𝑊𝑖:𝐻:𝑗𝑊𝑖𝑘 + 𝜆𝐻𝑘𝑗 = σ𝑖∈𝑍:𝑗

𝑉𝑖𝑗𝑊𝑖𝑘 , ∀𝑘, ∀𝑗


rows of non-missing entrys in the 𝑗-th column of 𝑽

(If we stack 𝑟 conditions on 𝐻1𝑗, …., 𝐻𝑟𝑗)

⇒ (σ𝑖∈𝑍:𝑗𝑊𝑖: 𝑊𝑖:

𝑇 + 𝜆𝐼𝑘)𝐻:𝑗 = σ𝑖∈𝑍:𝑗𝑉𝑖𝑗𝑊𝑖: , ∀𝑗

⇒ 𝐻:𝑗 = σ𝑖∈𝑍:𝑗𝑊𝑖: 𝑊𝑖:

𝑇 + 𝜆𝐼𝑘−1

σ𝑖∈𝑍:𝑗𝑉𝑖𝑗𝑊𝑖: , ∀𝑗


𝑘 by 𝑘 identity matrix

Closed-Form Solution of ALS (cont.)

Large-scale Matrix Factorizationkijungs/etc/10-405.pdf · Large-scale Matrix Factorization ......

Documents

Transcript of Large-scale Matrix Factorizationkijungs/etc/10-405.pdf · Large-scale Matrix Factorization ......