Fast Coordinate Descent Methods with Variable Selection...

58
Fast Coordinate Descent Methods with Variable Selection for NMF ChoJui Hsieh and Inderjit S. Dhillo Published on KDD 2011 Hongchang Gao

Transcript of Fast Coordinate Descent Methods with Variable Selection...

Page 1: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Methods with Variable Selection for NMF

ChoJui Hsieh and Inderjit S. Dhillo Published on KDD 2011

Hongchang Gao

Page 2: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Alternating Non-negative Least Squares • Gradient Descent Method • Fast Coordinate Descent Methods

Page 3: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Definition

• Given a nonnegative matrix , find nonnegative matrices and to – The partial derivative r.w.t W and H

m nV R ×∈m kW R ×∈ k nH R ×∈

T Tf WHH VHW∂

= −∂

T Tf W WH W VH∂

= −∂

2

, 0

1min ( , ) || ||2 FW H

f W H V WH≥

= −

Page 4: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Alternating Non-negative Least Squares • Gradient Descent Method • Fast Coordinate Descent Methods

Page 5: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• The most common used method • Proposed by Lee and Seung (2001) • The update rule:

( )( )

Tia

ia ia Tia

VHW WWHH

=

( )( )

Ta

a a Ta

W VH H

W WHµ

µ µµ

=

Page 6: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Arise from gradient descent method – Where is a small positive number.

• Set it as

• Then

[( ) ( ) ]T Tia ia ia ia iaW W VH WHHε← + −

iaε

( )ia

ia Tia

WWHH

ε =

( )( )

Tia

ia ia Tia

VHW WWHH

=

Presenter
Presentation Notes
This method takes a step in the direction of the negative gradient.
Page 7: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Algorithm – The 10−9 in each update rule is added to avoid

division by zero

Page 8: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Property 1 – If and are strictly positive, these matrices

remain positive throughout the iterations.

• Property 2 – If and , then

initW initH

* *{ , } { , }k kW H W H→ * *0, 0W H> >

* *

* *

( , ) 0

( , ) 0

f W HWf W HH

∂=

∂∂

=∂

Page 9: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Proof of Property 2 – The update rule

– For the limit point

[ . / ( )].*[ ( )]T TH H H W WH W V WH= + −

([ ] [ ] ) 0[ ]

[ ] [ ] 0

[ ] 0

ij T Tij ijT

ij

T Tij ij

ij

HW V W WH

W WH

W V W WHfH

− =

⇒ − =

∂⇒ =

Page 10: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• From the two properties, KKT conditions satisfied, which means the limit point is a stationary point.

• Otherwise, can not determine whether it is a stationary point

Page 11: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Conclusion: – The sequence can not guarantee to converge to a

stationary point – When converge, are slow to converge notoriously – The computational cost for each iteration – Once an element in W or H becomes 0, it must

remain 0

( )O mnk

Page 12: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 13: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• The update rule: – The multiplicative update method can be

considered as a gradient method

Page 14: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• How to choose the step ? – Initialize as 1, then multiply them by ½ at each

iteration. Can not guarantee the non-negativity. – Project gradient method.

,H Wε ε

Page 15: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• Main idea of Projected Gradient Method – Given such a problem

– Update rule:

Page 16: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• Conclusion: – Without a careful choice for step, it is difficult to

guarantee non-negativity. – The projection makes it difficult to analysis the

convergence. – Sensitive to the initialization

Page 17: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 18: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• The objective is not convex in both W and H, but it is convex in either W or H.

• Alternatively fixes one matrix and improves the other, called Block Coordinate Descent

Page 19: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• Theorem – Any limit point of the sequence generated by

Algorithm 2 is a stationary point. { , }k kW H

Page 20: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• Conclusion: – Has nice optimization properties. – It can be very fast if well implemented.

Page 21: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 22: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Contribution – Propose a variable selection scheme – Guarantee the convergence – Propose a cyclic coordinate method solve

Page 23: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 24: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Coordinate Descent Method – updates one variable at a time until convergence. – More efficient than ANLS

• ANLS need find an exact solution for each sub-problem to guarantee a stationary point

Page 25: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• The update rule for W – Where is a matrix with all elements zero

except the (i, r) elements equals one.

• It equals to solve a one-variable subproblem:

irE m k×

Page 26: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Rewrite it as – It is a one-variable quadratic function with

constraint – Has closed form solution:

– where

Page 27: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Existing Method – FastHals is a coordinate descent method.

• Use a cyclic coordinate descent method – It first updates all variables in W in cyclic order, and then

updates variables in H.

• May perform unneeded descent steps on unimportant variables.

Page 28: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 29: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Greedy Coordinate Descent (GCD) – select variables according to their importance

• Behavior Of FastHals and GCD – Apparently, GCD focuses on nonzero variables – GCD reduces the objective value more efficiently

Page 30: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Update rules: – In the outer updates:

– In the inner updates:

Page 31: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• If is selected to update – The optimal update is

– The objective will be decreased by

• Where measures how much the objective can be

reduced by choosing • Thus, according to choose the which reduce the

objective value mostly

irW

WirD

irWirWWD

Page 32: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Idea – Maintain GW , DW to determine which to update – update them after updating each element

• Strategy – 1. Precompute at the beginning of updates – 2. Update – 3. Update the i-th row of GW and DW in O(k) time

*ir irW W s← +

WG

irW

Page 33: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Strategy – 4. Select the next variable-to-update to satisfy

• A brute force search will cost O(mk) • Proposed method:

– store the largest value and index for each row

– Only one element of q will be changed after updating – Takes O(k) time to recalculate qi – Takes O(logm) time to recalculate the largest value of q – The total cost for one update is O(k+logm)

Page 34: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

( log )O k m+

( )O k

( )O k

( )O k

(log )O m

Page 35: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Note that – Maintain GW in O(k) time – Maintain GH in O(kn)

• Because each element of W is changed, the whole matrix GH will be changed

– Restrict to either W or H for a sequence

Page 36: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Stop condition – At the beginning of updates to W, store

– Iteratively choose variables to update to meet

• Note that it can be achieved in a finite number of

iterations because f(W, H) is lower bounded, the minimum for f(W,H) with fixed H is achievable.

Page 37: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• A more efficient row-based variable selection – When k<<m, the term will cost dominately – Row-based selection

• Changes in the i-th row of DW will not affect the other rows

• Iteratively update variables in the i-th row until meeting

– Note that choose the largest value in one row costs O(k), cheaper than O(logm)

• Then update the other rows. • Taking O(k) time totally for each variable update.

log m

Page 38: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the
Page 39: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• To get the amortized cost per coordinate update, divide the numbers by t

Page 40: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 41: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Apply coordinate descent for solving NMF with KL-divergence – Consider one-variable sub-problem

– Unlike least squares NMF, it has no closed form solution

Page 42: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• The method in FastHals – Solve a different problem to approximate it – Have close form solution – May converge to a different final solution.

Page 43: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Propose to solve it with Newton’s method – Where

– Takes O(n) time for summation

Page 44: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Note that the case of – For , then for all positive values

ignore those entries. – For , the Newton direction will be

infinity, thus, reset s so that is a small positive value and restart the Newton method

0, ( ) 0ij ijV WH= =

0ijV = log(( ) ) 0ij ijV WH = ( )ijWH

( ) 0ij ijWH sH+ =

irW s+

Page 45: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

– Theorem 1 shows that Newton method for the

special objective function converges without line search

Page 46: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Computational Complexity – To maintain the gradient similar to least squares

– The complexity is

– It is expensive compared to the time cost for updating one variable. DO NOT maintain gradient!

– Adopt Cyclic Coordinate Descent, taking for each coordinate update

( )O n

( )O k

( )O nk

( )O n

( )O nd

Page 47: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

Page 48: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 49: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Convergence Property

• For least squares

Method Convergence

Multiplicative Not guarantee converge to a stationary point

Gradient Descent Method Lack convergent theory to support this method

ANLS (with exact solution) Any limit point is a stationary point

GCD Any limit point is a stationary point

Page 50: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Convergence Property

• For KL-Divergence

Page 51: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 52: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Stopping condition – Adopt projected gradient as stopping condition

– According to KKT, is a stationary point if and only if . Use it to measure how close to stationary point

Page 53: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Least square NMF on dense data – FLOP: num of floating point operations

Page 54: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• KL NMF on dense data

Page 55: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Objective value reduced on sparse data

Page 56: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Projected gradient on sparse data

Page 57: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Reference • Hsieh, Cho-Jui, and Inderjit S. Dhillon. "Fast coordinate

descent methods with variable selection for non-negative matrix factorization." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

• Lin, Chih-Jen. "Projected gradient methods for nonnegative matrix factorization." Neural computation 19.10 (2007): 2756-2779.

• Berry, Michael W., et al. "Algorithms and applications for approximate nonnegative matrix factorization." Computational statistics & data analysis 52.1 (2007): 155-173.

Page 58: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Thank you