fast coordinate descent methods with variable selection...

58
Fast Coordinate Descent Methods with Variable Selection for NMF ChoJui Hsieh and Inderjit S. Dhillo Published on KDD 2011 Hongchang Gao

Upload: others

Post on 08-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Methods with Variable Selection for NMF

ChoJui Hsieh and Inderjit S. Dhillo Published on KDD 2011

Hongchang Gao

Page 2: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Alternating Non-negative Least Squares • Gradient Descent Method • Fast Coordinate Descent Methods

Page 3: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Definition

• Given a nonnegative matrix , find nonnegative matrices and to – The partial derivative r.w.t W and H

m nV R ×∈m kW R ×∈ k nH R ×∈

T Tf WHH VHW∂

= −∂

T Tf W WH W VH∂

= −∂

2

, 0

1min ( , ) || ||2 FW H

f W H V WH≥

= −

Page 4: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Alternating Non-negative Least Squares • Gradient Descent Method • Fast Coordinate Descent Methods

Page 5: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• The most common used method • Proposed by Lee and Seung (2001) • The update rule:

( )( )

Tia

ia ia Tia

VHW WWHH

=

( )( )

Ta

a a Ta

W VH H

W WHµ

µ µµ

=

Page 6: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Arise from gradient descent method – Where is a small positive number.

• Set it as

• Then

[( ) ( ) ]T Tia ia ia ia iaW W VH WHHε← + −

iaε

( )ia

ia Tia

WWHH

ε =

( )( )

Tia

ia ia Tia

VHW WWHH

=

Presenter
Presentation Notes
This method takes a step in the direction of the negative gradient.
Page 7: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Algorithm – The 10−9 in each update rule is added to avoid

division by zero

Page 8: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Property 1 – If and are strictly positive, these matrices

remain positive throughout the iterations.

• Property 2 – If and , then

initW initH

* *{ , } { , }k kW H W H→ * *0, 0W H> >

* *

* *

( , ) 0

( , ) 0

f W HWf W HH

∂=

∂∂

=∂

Page 9: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Proof of Property 2 – The update rule

– For the limit point

[ . / ( )].*[ ( )]T TH H H W WH W V WH= + −

([ ] [ ] ) 0[ ]

[ ] [ ] 0

[ ] 0

ij T Tij ijT

ij

T Tij ij

ij

HW V W WH

W WH

W V W WHfH

− =

⇒ − =

∂⇒ =

Page 10: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• From the two properties, KKT conditions satisfied, which means the limit point is a stationary point.

• Otherwise, can not determine whether it is a stationary point

Page 11: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Multiplicative Update Method

• Conclusion: – The sequence can not guarantee to converge to a

stationary point – When converge, are slow to converge notoriously – The computational cost for each iteration – Once an element in W or H becomes 0, it must

remain 0

( )O mnk

Page 12: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 13: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• The update rule: – The multiplicative update method can be

considered as a gradient method

Page 14: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• How to choose the step ? – Initialize as 1, then multiply them by ½ at each

iteration. Can not guarantee the non-negativity. – Project gradient method.

,H Wε ε

Page 15: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• Main idea of Projected Gradient Method – Given such a problem

– Update rule:

Page 16: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Gradient Descent Method

• Conclusion: – Without a careful choice for step, it is difficult to

guarantee non-negativity. – The projection makes it difficult to analysis the

convergence. – Sensitive to the initialization

Page 17: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 18: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• The objective is not convex in both W and H, but it is convex in either W or H.

• Alternatively fixes one matrix and improves the other, called Block Coordinate Descent

Page 19: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• Theorem – Any limit point of the sequence generated by

Algorithm 2 is a stationary point. { , }k kW H

Page 20: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Alternating Non-negative Least Squares

• Conclusion: – Has nice optimization properties. – It can be very fast if well implemented.

Page 21: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Outline

• Definition • Multiplicative Update Method • Gradient Descent Method • Alternating Non-negative Least Squares • Fast Coordinate Descent Methods

Page 22: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Contribution – Propose a variable selection scheme – Guarantee the convergence – Propose a cyclic coordinate method solve

Page 23: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 24: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Coordinate Descent Method – updates one variable at a time until convergence. – More efficient than ANLS

• ANLS need find an exact solution for each sub-problem to guarantee a stationary point

Page 25: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• The update rule for W – Where is a matrix with all elements zero

except the (i, r) elements equals one.

• It equals to solve a one-variable subproblem:

irE m k×

Page 26: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Rewrite it as – It is a one-variable quadratic function with

constraint – Has closed form solution:

– where

Page 27: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method

• Existing Method – FastHals is a coordinate descent method.

• Use a cyclic coordinate descent method – It first updates all variables in W in cyclic order, and then

updates variables in H.

• May perform unneeded descent steps on unimportant variables.

Page 28: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 29: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Greedy Coordinate Descent (GCD) – select variables according to their importance

• Behavior Of FastHals and GCD – Apparently, GCD focuses on nonzero variables – GCD reduces the objective value more efficiently

Page 30: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Update rules: – In the outer updates:

– In the inner updates:

Page 31: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• If is selected to update – The optimal update is

– The objective will be decreased by

• Where measures how much the objective can be

reduced by choosing • Thus, according to choose the which reduce the

objective value mostly

irW

WirD

irWirWWD

Page 32: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Idea – Maintain GW , DW to determine which to update – update them after updating each element

• Strategy – 1. Precompute at the beginning of updates – 2. Update – 3. Update the i-th row of GW and DW in O(k) time

*ir irW W s← +

WG

irW

Page 33: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Strategy – 4. Select the next variable-to-update to satisfy

• A brute force search will cost O(mk) • Proposed method:

– store the largest value and index for each row

– Only one element of q will be changed after updating – Takes O(k) time to recalculate qi – Takes O(logm) time to recalculate the largest value of q – The total cost for one update is O(k+logm)

Page 34: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

( log )O k m+

( )O k

( )O k

( )O k

(log )O m

Page 35: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Note that – Maintain GW in O(k) time – Maintain GH in O(kn)

• Because each element of W is changed, the whole matrix GH will be changed

– Restrict to either W or H for a sequence

Page 36: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• Stop condition – At the beginning of updates to W, store

– Iteratively choose variables to update to meet

• Note that it can be achieved in a finite number of

iterations because f(W, H) is lower bounded, the minimum for f(W,H) with fixed H is achievable.

Page 37: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• A more efficient row-based variable selection – When k<<m, the term will cost dominately – Row-based selection

• Changes in the i-th row of DW will not affect the other rows

• Iteratively update variables in the i-th row until meeting

– Note that choose the largest value in one row costs O(k), cheaper than O(logm)

• Then update the other rows. • Taking O(k) time totally for each variable update.

log m

Page 38: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the
Page 39: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Variable Selection Strategy

• To get the amortized cost per coordinate update, divide the numbers by t

Page 40: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 41: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Apply coordinate descent for solving NMF with KL-divergence – Consider one-variable sub-problem

– Unlike least squares NMF, it has no closed form solution

Page 42: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• The method in FastHals – Solve a different problem to approximate it – Have close form solution – May converge to a different final solution.

Page 43: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Propose to solve it with Newton’s method – Where

– Takes O(n) time for summation

Page 44: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Note that the case of – For , then for all positive values

ignore those entries. – For , the Newton direction will be

infinity, thus, reset s so that is a small positive value and restart the Newton method

0, ( ) 0ij ijV WH= =

0ijV = log(( ) ) 0ij ijV WH = ( )ijWH

( ) 0ij ijWH sH+ =

irW s+

Page 45: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

– Theorem 1 shows that Newton method for the

special objective function converges without line search

Page 46: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

• Computational Complexity – To maintain the gradient similar to least squares

– The complexity is

– It is expensive compared to the time cost for updating one variable. DO NOT maintain gradient!

– Adopt Cyclic Coordinate Descent, taking for each coordinate update

( )O n

( )O k

( )O nk

( )O n

( )O nd

Page 47: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Coordinate Descent Method for NMF with KL-Divergence

Page 48: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 49: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Convergence Property

• For least squares

Method Convergence

Multiplicative Not guarantee converge to a stationary point

Gradient Descent Method Lack convergent theory to support this method

ANLS (with exact solution) Any limit point is a stationary point

GCD Any limit point is a stationary point

Page 50: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Convergence Property

• For KL-Divergence

Page 51: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Fast Coordinate Descent Method with Variable Selection

• Coordinate Gradient Method • Variable Selection Strategy • CGD for KL-Divergence • Convergence Analysis • Experiment Result

Page 52: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Stopping condition – Adopt projected gradient as stopping condition

– According to KKT, is a stationary point if and only if . Use it to measure how close to stationary point

Page 53: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Least square NMF on dense data – FLOP: num of floating point operations

Page 54: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• KL NMF on dense data

Page 55: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Objective value reduced on sparse data

Page 56: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Experiment Result

• Projected gradient on sparse data

Page 57: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Reference • Hsieh, Cho-Jui, and Inderjit S. Dhillon. "Fast coordinate

descent methods with variable selection for non-negative matrix factorization." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

• Lin, Chih-Jen. "Projected gradient methods for nonnegative matrix factorization." Neural computation 19.10 (2007): 2756-2779.

• Berry, Michael W., et al. "Algorithms and applications for approximate nonnegative matrix factorization." Computational statistics & data analysis 52.1 (2007): 155-173.

Page 58: Fast Coordinate Descent Methods with Variable Selection ...ranger.uta.edu/~heng/CSE6389_15_slides/Fast... · Coordinate Descent Method for NMF with KL-Divergence • Note that the

Thank you