cocoa: communication-efficient coordinate ascent

88
COCOA Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

Upload: jeykottalam

Post on 02-Jul-2015

713 views

Category:

Software


1 download

DESCRIPTION

"COCOA: Communication-Efficient Coordinate Ascent" presentation from AMPCamp 5 by Virginia Smith

TRANSCRIPT

Page 1: COCOA: Communication-Efficient Coordinate Ascent

COCOA Communication-Efficient

Coordinate Ascent

Virginia Smith ! Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

Page 2: COCOA: Communication-Efficient Coordinate Ascent
Page 3: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

Page 4: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA

Page 5: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS

Page 6: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION!

COCOA

RESULTS

Page 7: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning with Large Datasets

Page 8: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning with Large Datasets

Page 9: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Page 10: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning Workflow

Page 11: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

Page 12: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

Page 13: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Page 14: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 15: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 16: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 17: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 18: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 19: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

Page 20: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||

Page 21: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)

Page 22: COCOA: Communication-Efficient Coordinate Ascent

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)

Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients

Newton and Quasi-Newton methods Coordinate descent

Stochastic and incremental gradient methods SMO

SVMlight LIBLINEAR

Page 23: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Page 24: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Page 25: COCOA: Communication-Efficient Coordinate Ascent

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem:efficiently solving objective

when data is distributed

Page 26: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization

Page 27: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization

Page 28: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization

reduce: w = w � ↵P

k �w

Page 29: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization

reduce: w = w � ↵P

k �w

Page 30: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization

“always communicate”reduce: w = w � ↵

Pk �w

Page 31: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees

“always communicate”reduce: w = w � ↵

Pk �w

Page 32: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Page 33: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Page 34: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”reduce: w = w � ↵

Pk �w

Page 35: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Page 36: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Page 37: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

ZDWJ, 2012

reduce: w = w � ↵P

k �w

Page 38: COCOA: Communication-Efficient Coordinate Ascent

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk

“always communicate”

“never communicate”

ZDWJ, 2012

reduce: w = w � ↵P

k �w

Page 39: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch

Page 40: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch

Page 41: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Page 42: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Page 43: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch✔ convergence guarantees

reduce: w = w � ↵|b|

Pi2b �w

Page 44: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch✔ convergence guarantees✔ tunable communication

reduce: w = w � ↵|b|

Pi2b �w

Page 45: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Page 46: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Page 47: COCOA: Communication-Efficient Coordinate Ascent

Are we done?

Page 48: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

Page 49: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

1. METHODS BEYOND SGD

Page 50: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

Page 51: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Page 52: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS

Page 53: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA!

RESULTS

Page 54: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Page 55: COCOA: Communication-Efficient Coordinate Ascent

Mini-batch Limitations

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Page 56: COCOA: Communication-Efficient Coordinate Ascent

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Communication-Efficient Distributed Dual

Coordinate Ascent (CoCoA)

Page 57: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL≥

Page 58: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Page 59: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

Page 60: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

Page 61: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

Page 62: COCOA: Communication-Efficient Coordinate Ascent

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

Page 63: COCOA: Communication-Efficient Coordinate Ascent

2. Immediately Apply Updates

Page 64: COCOA: Communication-Efficient Coordinate Ascent

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

Page 65: COCOA: Communication-Efficient Coordinate Ascent

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

for i 2 b�w �w � ↵riP (w)w w +�w

endOutput: �↵[k] and �w := A[k]�↵[k]

FRESH

Page 66: COCOA: Communication-Efficient Coordinate Ascent

3. Average over K

Page 67: COCOA: Communication-Efficient Coordinate Ascent

3. Average over K

reduce: w = w + 1K

Pk �wk

Page 68: COCOA: Communication-Efficient Coordinate Ascent

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

Page 69: COCOA: Communication-Efficient Coordinate Ascent

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

Page 70: COCOA: Communication-Efficient Coordinate Ascent

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

Page 71: COCOA: Communication-Efficient Coordinate Ascent

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

✔ local updates applied immediately

Page 72: COCOA: Communication-Efficient Coordinate Ascent

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

✔ local updates applied immediately

✔ average over K

Page 73: COCOA: Communication-Efficient Coordinate Ascent

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

Page 74: COCOA: Communication-Efficient Coordinate Ascent

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

Page 75: COCOA: Communication-Efficient Coordinate Ascent

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

Page 76: COCOA: Communication-Efficient Coordinate Ascent

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

Page 77: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS!

Page 78: COCOA: Communication-Efficient Coordinate Ascent

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS!

Page 79: COCOA: Communication-Efficient Coordinate Ascent

Empirical Results in

Dataset Training (n) Features (d) Sparsity Workers (K)

Cov 522,911 54 22.22% 4

Rcv1 677,399 47,236 0.16% 8

Imagenet 32,751 160,000 100% 32

Page 80: COCOA: Communication-Efficient Coordinate Ascent

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Log P

rim

al S

uboptim

alit

y

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

Page 81: COCOA: Communication-Efficient Coordinate Ascent

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e5)batch−SGD (H=1)

0 100 200 300 40010

−6

10−4

10−2

100

102

RCV1

Time (s)

Log P

rim

al S

uboptim

alit

y

0 100 200 300 40010

−6

10−4

10−2

100

102

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e4)batch−SGD (H=100)

Page 82: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-Aways

Page 83: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimization

Page 84: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methods

Page 85: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Page 86: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

Page 87: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

To appear inNIPS ‘14

Page 88: COCOA: Communication-Efficient Coordinate Ascent

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

To appear inNIPS ‘14www.berkeley.edu/~vsmith