peter richtárik parallel coordinate simons institute for the theory of computing, berkeley parallel...

59
Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013 descent methods

Upload: jarred-pomeroy

Post on 30-Mar-2015

224 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Peter Richtárik

Parallel coordinateSimons Institute for the Theory of Computing, BerkeleyParallel and Distributed Algorithms for Inference and Optimization, October 23, 2013

descent methods

Page 2: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,
Page 3: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent

in 2D

Page 4: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Find the minimizer of

2D OptimizationContours of a function

Goal:

a2 =b2

Page 5: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Page 6: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Page 7: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Page 8: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Page 9: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Page 10: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Page 11: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Page 12: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Page 13: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Convergence of Randomized Coordinate Descent

Strongly convex F

Smooth or ‘simple’ nonsmooth F‘difficult’ nonsmooth F

Focus on n

(big data = big n)

Page 14: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Parallelization Dream

Depends on to what extent we can add up individual updates, which depends on the properties of F and the

way coordinates are chosen at each iteration

Serial Parallel

What do we actually get?WANT

Page 15: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

How (not) to ParallelizeCoordinate Descent

Page 16: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

“Naive” parallelization

Do the same thing as before, but

for MORE or ALL coordinates &

ADD UP the updates

Page 17: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Failure of naive parallelization

1a

1b

0

Page 18: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Failure of naive parallelization

1

1a

1b

0

Page 19: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Failure of naive parallelization

1

2a

2b

Page 20: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Failure of naive parallelization

1

2a

2b

2

Page 21: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Failure of naive parallelization

2

OOPS!

Page 22: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Page 23: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Page 24: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Averaging may be too conservative 2

WANT

BAD!!!But we wanted:

Page 25: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

What to do?

Averaging:

Summation:

Update to coordinate i

i-th unit coordinate vector

Figure out when one can safely use:

Page 26: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Optimization Problems

Page 27: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Problem

Convex (smooth or nonsmooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Page 28: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 29: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 30: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

3 models for f with small

1

2

3

Smooth partially separable f [RT’11b ]

Nonsmooth max-type f [FR’13]

f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

Page 31: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

General Theory

Page 32: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Randomized Parallel Coordinate Descent Method

Random set of coordinates (sampling)

Current iterate New iterate i-th unit coordinate vector

Update to i-th coordinate

Page 33: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

ESO: Expected Separable Overapproximation

Definition [RT’11b]

1. Separable in h2. Can minimize in parallel3. Can compute updates for only

Shorthand:

Minimize in h

Page 34: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Convergence rate: convex f

average # updated coordinates per iteration

# coordinatesstepsize parameter

error tolerance# iterations

implies

Theorem [RT’11b]

Page 35: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Convergence rate: strongly convex f

implies

Strong convexity constant of the regularizer

Strong convexity constant of the loss f

Theorem [RT’11b]

Page 36: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Partial Separability and

Doubly Uniform Samplings

Page 37: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Serial uniform samplingProbability law:

Page 38: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

-nice samplingProbability law:

Good for shared memory systems

Page 39: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Doubly uniform sampling

Probability law:

Can model unreliable processors / machines

Page 40: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

ESO for partially separable functionsand doubly uniform samplings

Theorem [RT’11b]

1 Smooth partially separable f [RT’11b ]

Page 41: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Theoretical speedup

Much of Big Data is here!

degree of partial separability

# coordinates

# coordinate updates / iter

WEAK OR NO SPEEDUP: Non-separable (dense) problems

LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

Page 42: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,
Page 43: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

n = 1000(# coordinates)

Theory

Page 44: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Practice

n = 1000(# coordinates)

Page 45: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Experiment with a 1 billion-by-2 billion

LASSO problem

Page 46: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Page 47: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Coordinate Updates

Page 48: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Iterations

Page 49: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Wall Time

Page 50: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Distributed-Memory Coordinate Descent

Page 51: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Distributed -nice sampling

Probability law:

Machine 2Machine 1 Machine 3

Good for a distributed version of coordinate descent

Page 52: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

ESO: Distributed setting

Theorem [RT’13b]

3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

spectral norm of the data

Page 53: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Bad partitioning at most doubles # of iterations

spectral norm of the partitioning

Theorem [RT’13b]

# nodes

# iterations = implies

# updates/node

Page 54: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,
Page 55: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

LASSO with a 3TB data matrix

128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM

= # coordinates

Page 56: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Conclusions• Coordinate descent methods scale very well

to big data problems of special structure– Partial separability (sparsity)– Small spectral norm of the data– Nesterov separability, ...

• Care is needed when combining updates (add them up? average?)

Page 57: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.

• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.

• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.

• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.

• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.

References: serial coordinate descent

Page 58: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011

• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012

• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013

• [FR’12] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013

• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013

• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013

References: parallel coordinate descent

Good entry point to the topic (4p paper)

Page 59: Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.

• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.

• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.

• Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS 2013.

References: parallel coordinate descent

TALK TOMORROW