parallel numerical algorithms -...

21
Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Parallel Numerical Algorithms Chapter 6 – Structured and Low Rank Matrices Section 6.3 – Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 21

Upload: duongduong

Post on 14-Aug-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Parallel Numerical AlgorithmsChapter 6 – Structured and Low Rank Matrices

Section 6.3 – Numerical Optimization

Michael T. Heath and Edgar Solomonik

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 21

Page 2: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Outline

1 Alternating Least SquaresQuadratic OptimizationParallel ALS

2 Coordinate DescentCoordinate DescentCyclic Coordinate Descent

3 Gradient DescentGradient DescentStochastic Gradient DescentParallel SGD

4 Nonlinear OptimizationNonlinear EquationsOptimization

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 21

Page 3: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic Optimization: Matrix Completion

Given a subset of entries

Ω ⊆ 1, . . . ,m × 1, . . . , n

of the entries of matrix A ∈ Rm×n, seek rank-k approximation

argminW∈Rm×k,H∈Rn×k

∑(i,j)∈Ω

(aij −

∑l

wilhjl︸ ︷︷ ︸(A−WHT )ij

)2+ λ(||W ||F + ||H||F )

Problems of these type studied in sparse approximation

Ω may be randomly selected sample subset

Methods for this problem are typical of numericaloptimization and machine learning

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 21

Page 4: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic OptimizationParallel ALS

Alternating Least Squares

Alternating least squares (ALS) fixes W and solves for H thenvice versa until convergence

Each step improves approximation, convergence to aminimum expected given satisfactory starting guess

We have a quadratic optimization problem

argminW∈Rm×k

∑(i,j)∈Ω

(aij −

∑l

wilhjl

)2+ λ||W ||F

The optimization problem is independent for rows of W

Letting wi = wi?, hi = hi?, Ωi = j : (i, j) ∈ Ω, seek

argminwi∈Rk

∑j∈Ωi

(aij −wih

Tj

)2+ λ||wi||2

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 21

Page 5: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic OptimizationParallel ALS

ALS: Quadratic Optimization

Seek minimizer wi for quadratic vector equation

f(wi) =∑j∈Ωi

(aij −wih

Tj

)2+ λ||wi||

Differentiating with respect to wi gives

∂f(wi)

∂wi= 2

∑j∈Ωi

hTj

(wih

Tj − aij

)+ 2λwi = 0

Rotating wihTj = hjw

Ti and defining G(i) =

∑j∈Ωi

hTj hj ,

(G(i) + λI)wTi =

∑j∈Ωi

hTj aij

which is a k × k symmetric linear system of equationsMichael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 21

Page 6: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic OptimizationParallel ALS

ALS: Iteration Cost

For updating each wi, ALS is dominated in cost by two steps

1 G(i) =∑

j∈ΩihTj hj

dense matrix-matrix product

O(|Ωi|k2) work

logarithmic depth

2 Solve linear system with G(i) + λI

dense symmetric k × k linear solve

O(k3) work

typically O(k) depth

Can do these for all m rows of W independently

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 21

Page 7: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic OptimizationParallel ALS

Parallel ALS

Let each task optimize a row wi of W

Need to compute G(i) for each task

Specific subset of rows of H needed for each G(i)

Task execution is embarassingly parallel if all of H storedon each processor

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 21

Page 8: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Quadratic OptimizationParallel ALS

Memory-Constrained Parallel ALS

May not have enough memory to replicate H on all processors

Communication required and pattern is data-dependent

Could rotate rows of H along a ring of processors

Each processor computes contributions to the G(i) it owns

Requires Θ(p) latency cost for each iteration of ALS

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 21

Page 9: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Coordinate DescentCyclic Coordinate Descent

Updating a Single Variable

Rather than whole rows wi solve for elements of W , recall

argminW∈Rm×k

∑(i,j)∈Ω

(aij −

∑l

wilhjl

)2+ λ||W ||F

Coordinate descent finds the best replacement µ for wit

µ = argminµ

∑j∈Ωi

(aij − µhjt −

∑l 6=t

wilhjl

)2+ λµ2

The solution is given by

µ =

∑j∈Ωi

hjt

(aij −

∑l 6=twilhjl

)λ+

∑j∈Ωi

h2jt

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 21

Page 10: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Coordinate DescentCyclic Coordinate Descent

Coordinate Descent

For ∀(i, j) ∈ Ω compute elements rij of

R = A−WHT

so that we can optimize via

µ =

∑j∈Ωi

hjt

(aij −

∑l 6=twilhjl

)λ+

∑j∈Ωi

h2jt

=

∑j∈Ωi

hjt

(rij + withjt

)λ+

∑j∈Ωi

h2jt

after which we can update R via

rij ← rij − (µ− wit)hjt ∀j ∈ Ωi

both using O(|Ωi|) operations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 21

Page 11: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Coordinate DescentCyclic Coordinate Descent

Cyclic Coordinate Descent (CCD)

Updating wi costs O(|Ωi|k) operations with coordinatedescent rather than O(|Ωi|k2 + k3) operations with ALS

By solving for all of wi at once, ALS obtains a moreaccurate solution than coordinate descent

Coordinate descent with different update orderings:

Cyclic coordinate descent (CCD) updates all columns of Wthen all columns of H (ALS-like ordering)

CCD++ alternates between columns of W and H

All entries within a column can be updated concurrently

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 21

Page 12: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Coordinate DescentCyclic Coordinate Descent

Parallel CCD++

Yu, Hsieh, Si, and Dhillon 2013 propose using a row-blockedlayout of H and W

They keep track of a corresponding m/p and n/p rows andcolumns of A and R on each processor (using twice theminimal amount of memory)

Every column update in CCD++ is then fully parallelized,but an allgather of each column is required to update R

The complexity of updating all of W and all of H is then

Tp(m,n, k) = Θ(kT allgatherp (m+ n) + γQ1(m,n, k)/p)

= O(αk log p+ β(m+ n)k + γ|Ω|k/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 21

Page 13: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Gradient DescentStochastic Gradient DescentParallel SGD

Gradient-Based Update

ALS minimizes wi, gradient descent methods only improve it

Recall that we seek to minimize

f(wi) =∑j∈Ωi

(aij −wih

Tj

)2+ λ||wi||

and use the partial derivative

∂f(wi)

∂wi= 2

∑j∈Ωi

hTj

(wih

Tj −aij

)+2λwi = 2

(λwi−

∑j∈Ωi

rijhj

)Gradient descent method updates

wi = wi − η∂f(wi)

∂wi

where parameter η is our step-sizeMichael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 21

Page 14: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Gradient DescentStochastic Gradient DescentParallel SGD

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) performs fine-grainedupdates based on a component of the gradient

Again the full gradient is

∂f(wi)

∂wi= 2

(λwi −

∑j∈Ωi

rijhj

)= 2

∑j∈Ωi

λwi/|Ωi| − rijhj

SGD selects random (i, j) ∈ Ω and updates wi using hj

wi ← wi − η(λwi/|Ωi| − rijhj)

SGD then updates rij = aij −wTi hj

Each update costs O(k) operations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 21

Page 15: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Gradient DescentStochastic Gradient DescentParallel SGD

Asynchronous SGD

Parallelizing SGD is easy aside from ensuring concurrentupdates do not conflict

Asynchronous shared-memory implementations of SGDare popular and achieve high performance

For sufficiently small step-size, inconsistencies amongupdates (e.g. duplication) are not problematic statistically

Asynchronicity can slow down convergence

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 21

Page 16: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Gradient DescentStochastic Gradient DescentParallel SGD

Blocked SGD

Distributed blocking SGD introduces further considerations

Associate a task with updates on a block

Can define p× p grid of blocks of dimension m/p× n/p

Diagonal/superdiagonals/subdiagonals of blocks updatedindependently, so p tasks can execute concurrently

Assuming Θ(|Ω|/p2) updates are performed on each block,the execution time for |Ω| updates is

Tp(m,n, k) = Θ(αp log p+ βmin(m,n)k + γ|Ω|k/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 21

Page 17: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Nonlinear EquationsOptimization

Nonlinear Equations

Potential sources of parallelism in solving nonlinear equationf(x) = 0 include

Evaluation of function f and its derivatives in parallel

Parallel implementation of linear algebra computations(e.g., solving linear system in Newton-like methods)

Simultaneous exploration of different regions via multiplestarting points (e.g., if many solutions are sought orconvergence is difficult to achieve)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 21

Page 18: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

Nonlinear EquationsOptimization

Optimization

Sources of parallelism in optimization problems include

Evaluation of objective and constraint functions and theirderivatives in parallel

Parallel implementation of linear algebra computations(e.g., solving linear system in Newton-like methods)

Simultaneous exploration of different regions via multiplestarting points (e.g., if global optimum is sought orconvergence is difficult to achieve)

Multi-directional searches in direct search methods

Decomposition methods for structured problems, such aslinear, quadratic, or separable programming

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 21

Page 19: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

References

Candés, Emmanuel J., and Benjamin Recht. "Exact matrixcompletion via convex optimization." Foundations ofcomputational mathematics 9.6 (2009): 717.

Jain, Prateek, Praneeth Netrapalli, and Sujay Sanghavi."Low-rank matrix completion using alternatingminimization." Proceedings of the forty-fifth annual ACMSymposium on Theory of Computing. ACM, 2013.

Yu, H.F., Hsieh, C.J., Si, S. and Dhillon, I., 2012,December. Scalable coordinate descent approaches toparallel matrix factorization for recommender systems. In2012 IEEE 12th International Conference on Data Mining(pp. 765-774).

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 21

Page 20: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

References

Recht, Benjamin, Christopher Re, Stephen Wright, andFeng Niu. "Hogwild: A lock-free approach to parallelizingstochastic gradient descent." In Advances in neuralinformation processing systems, pp. 693-701. 2011.

Gemulla, Rainer, Erik Nijkamp, Peter J. Haas, and YannisSismanis. "Large-scale matrix factorization with distributedstochastic gradient descent." In Proceedings of the 17thACM SIGKDD international conference on knowledgediscovery and data mining, pp. 69-77. ACM, 2011.

Karlsson, Lars, Daniel Kressner, and André Uschmajew."Parallel algorithms for tensor completion in the CPformat." Parallel computing 57 (2016): 222-234.

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 21

Page 21: Parallel Numerical Algorithms - solomonik.cs.illinois.edusolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_15.pdf · Alternating Least Squares Coordinate Descent Gradient

Alternating Least SquaresCoordinate Descent

Gradient DescentNonlinear Optimization

References – Parallel OptimizationJ. E. Dennis and V. Torczon, Direct search methods onparallel machines, SIAM J. Optimization 1:448-474, 1991J. E. Dennis and Z. Wu, Parallel continuous optimization,J. Dongarra et al., eds., Sourcebook of Parallel Computing,pp. 649-670, Morgan Kauffman, 2003F. A. Lootsma and K. M. Ragsdell, State-of-the-art inparallel nonlinear optimization, Parallel Computing6:133-155, 1988R. Schnabel, Sequential and parallel methods forunconstrained optimization, M. Iri and K. Tanabe, eds.,Mathematical Programming: Recent Developments andApplications, pp. 227-261, Kluwer, 1989S. A. Zenios, Parallel numerical optimization: currenttrends and an annotated bibliography, ORSA J. Comput.1:20-43, 1989

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 21