recent progresses in stochastic algorithms for big data...

67
Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 / 36

Upload: others

Post on 10-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Recent Progresses in Stochastic Algorithms for BigData Optimization

Tong Zhang

Rutgers University & Baidu Inc.

collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamirand Nathan Srebro

T. Zhang Big Data Optimization 1 / 36

Page 2: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 2 / 36

Page 3: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 2 / 36

Page 4: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 2 / 36

Page 5: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Mathematical Problem

Big Data Optimization Problem in machine learning:

minw

f (w) f (w) =1n

n∑i=1

fi(w)

Special structure: sum over data.

Big data (n large) requires distrubuted training.

T. Zhang Big Data Optimization 3 / 36

Page 6: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Assumptions on loss function

λ-strong convexity:

f (w ′) ≥ f (w) +∇f (w)>(w ′ − w) +λ

2‖w ′ − w‖22

L-smoothness:

fi(w ′) ≤ fi(w) +∇fi(w)>(w ′ − w) +L2‖w ′ − w‖22

T. Zhang Big Data Optimization 4 / 36

Page 7: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Example: Computational Advertizing

Large scale regularized logistic regression

minw

1n

n∑i=1

ln(1 + e−w>xi yi ) +λ

2‖w‖22︸ ︷︷ ︸

fi (w)

data (xi , yi ) with yi ∈ ±1parameter vector w .

λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.

big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion

How to solve big optimization problems efficiently?

T. Zhang Big Data Optimization 5 / 36

Page 8: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Example: Computational Advertizing

Large scale regularized logistic regression

minw

1n

n∑i=1

ln(1 + e−w>xi yi ) +λ

2‖w‖22︸ ︷︷ ︸

fi (w)

data (xi , yi ) with yi ∈ ±1parameter vector w .

λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.

big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion

How to solve big optimization problems efficiently?

T. Zhang Big Data Optimization 5 / 36

Page 9: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Example: Computational Advertizing

Large scale regularized logistic regression

minw

1n

n∑i=1

ln(1 + e−w>xi yi ) +λ

2‖w‖22︸ ︷︷ ︸

fi (w)

data (xi , yi ) with yi ∈ ±1parameter vector w .

λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.

big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion

How to solve big optimization problems efficiently?

T. Zhang Big Data Optimization 5 / 36

Page 10: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Statistical Thinking: sampling

Objective function:

f (w) =1n

n∑i=1

fi(w)

sample objective function: only optimize approximate objective

1st order gradient

∇f (w) =1n

n∑i=1

∇fi(w)

sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate2nd order gradient

∇2f (w) =1n

n∑i=1

∇2fi(w)

sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing

T. Zhang Big Data Optimization 6 / 36

Page 11: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Statistical Thinking: sampling

Objective function:

f (w) =1n

n∑i=1

fi(w)

sample objective function: only optimize approximate objective1st order gradient

∇f (w) =1n

n∑i=1

∇fi(w)

sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate

2nd order gradient

∇2f (w) =1n

n∑i=1

∇2fi(w)

sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing

T. Zhang Big Data Optimization 6 / 36

Page 12: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Statistical Thinking: sampling

Objective function:

f (w) =1n

n∑i=1

fi(w)

sample objective function: only optimize approximate objective1st order gradient

∇f (w) =1n

n∑i=1

∇fi(w)

sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate2nd order gradient

∇2f (w) =1n

n∑i=1

∇2fi(w)

sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing

T. Zhang Big Data Optimization 6 / 36

Page 13: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Batch Optimization Method: Gradient Descent

Solve

w∗ = arg minw

f (w) f (w) =1n

n∑i=1

fi(w).

Gradient Descent (GD):

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitable conditions.Convergence rate depends on conditions of f (·). For λ-stronglyconvex and L-smooth problems, it is linear rate:

f (wk )− f (w∗) = O((1− ρ)k ),

where ρ = O(λ/L) is the inverse condition number

T. Zhang Big Data Optimization 7 / 36

Page 14: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Batch Optimization Method: Gradient Descent

Solve

w∗ = arg minw

f (w) f (w) =1n

n∑i=1

fi(w).

Gradient Descent (GD):

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitable conditions.Convergence rate depends on conditions of f (·). For λ-stronglyconvex and L-smooth problems, it is linear rate:

f (wk )− f (w∗) = O((1− ρ)k ),

where ρ = O(λ/L) is the inverse condition number

T. Zhang Big Data Optimization 7 / 36

Page 15: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Stochastic Approximate Gradient Computation

If

f (w) =1n

n∑i=1

fi(w),

GD requires the computation of full gradient, which is extremely costly

∇f (w) =1n

n∑i=1

∇fi(w)

Idea: stochastic optimization employs random sample (mini-batch) B

to approximate

∇f (w) ≈ 1|B|∑i∈B

∇fi(w)

It is an unbiased estimatormore efficient computation but introduces variance

T. Zhang Big Data Optimization 8 / 36

Page 16: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Stochastic Approximate Gradient Computation

If

f (w) =1n

n∑i=1

fi(w),

GD requires the computation of full gradient, which is extremely costly

∇f (w) =1n

n∑i=1

∇fi(w)

Idea: stochastic optimization employs random sample (mini-batch) B

to approximate

∇f (w) ≈ 1|B|∑i∈B

∇fi(w)

It is an unbiased estimatormore efficient computation but introduces variance

T. Zhang Big Data Optimization 8 / 36

Page 17: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

SGD versus GD

SGD:faster computation per stepSublinear convergence: due to the variance of gradientapproximation.

f (wt )− f (w∗) = O(1/t).

GD:slower computation per stepLinear convergence:

f (wt )− f (w∗) = O((1− ρ)t ).

T. Zhang Big Data Optimization 9 / 36

Page 18: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Improving SGD via Variance Reduction

GD converges fast but computation is slowSGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:let gi = ∇fi .unbaisedness: E gi = 1

n

∑ni=1 gi = ∇f .

error of using gi to approx ∇f : variance E‖gi − Egi‖22.

Statistical thinking:relating variance to optimizationdesign other unbiased gradient estimators with smaller variance

T. Zhang Big Data Optimization 10 / 36

Page 19: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Improving SGD via Variance Reduction

GD converges fast but computation is slowSGD computation is fast but converges slowly

slow convergence due to inherent variance

SGD as a statistical estimator of gradient:let gi = ∇fi .unbaisedness: E gi = 1

n

∑ni=1 gi = ∇f .

error of using gi to approx ∇f : variance E‖gi − Egi‖22.

Statistical thinking:relating variance to optimizationdesign other unbiased gradient estimators with smaller variance

T. Zhang Big Data Optimization 10 / 36

Page 20: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relating Statistical Variance to Optimization

Want to optimizemin

wf (w)

Full gradient ∇f (w).

Given unbiased random estimator gi of ∇f (w), and SGD rule

w → w − ηgi ,

reduction of objective is

Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random

+η2L2

E‖g− Eg‖22︸ ︷︷ ︸variance

.

Smaller variance implies bigger reduction

T. Zhang Big Data Optimization 11 / 36

Page 21: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relating Statistical Variance to Optimization

Want to optimizemin

wf (w)

Full gradient ∇f (w).

Given unbiased random estimator gi of ∇f (w), and SGD rule

w → w − ηgi ,

reduction of objective is

Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random

+η2L2

E‖g− Eg‖22︸ ︷︷ ︸variance

.

Smaller variance implies bigger reduction

T. Zhang Big Data Optimization 11 / 36

Page 22: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relating Statistical Variance to Optimization

Want to optimizemin

wf (w)

Full gradient ∇f (w).

Given unbiased random estimator gi of ∇f (w), and SGD rule

w → w − ηgi ,

reduction of objective is

Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random

+η2L2

E‖g− Eg‖22︸ ︷︷ ︸variance

.

Smaller variance implies bigger reduction

T. Zhang Big Data Optimization 11 / 36

Page 23: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 12 / 36

Page 24: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 12 / 36

Page 25: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 12 / 36

Page 26: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Stochastic Variance Reduced Gradient (SVRG) I

Objective function

f (w) =1n

n∑i=1

fi(w) =1n

n∑i=1

fi(w),

wherefi(w) = fi(w)− (∇fi(w)−∇f (w))>w︸ ︷︷ ︸

sum to zero

.

Pick w to be an approximate solution (close to w∗).The SVRG rule (control variates) is

wt = wt−1 − ηt∇fi(wt−1) = wt−1 − ηt [∇fi(wt−1)−∇fi(w) +∇f (w)]︸ ︷︷ ︸small variance

.

T. Zhang Big Data Optimization 13 / 36

Page 27: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Stochastic Variance Reduced Gradient (SVRG) II

Assume that w ≈ w∗ and wt−1 ≈ w∗. Then

∇f (w) ≈ ∇f (w∗) = 0 ∇fi(wt−1) ≈ ∇fi(w).

This means∇fi(wt−1)−∇fi(w) +∇f (w)→ 0.

It is possible to choose a constant step size ηt = η instead of requiringηt → 0.One can achieve comparable linear convergence with SVRG:

Ef (wt )− f (w∗) = O((1− ρ)t ),

where ρ = O(λn/(L + λn); convergence is faster than GD.

T. Zhang Big Data Optimization 14 / 36

Page 28: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Compare SVRG to Batch Gradient Descent Algorithm

Number of examples needed to achieve ε accuracy:

Batch GD: O(n · L/λ log(1/ε))

SVRG: O((n + L/λ) log(1/ε))

Assume L-smooth loss:

‖∇fi(w)−∇fi(w ′)‖ ≤ L‖w − w ′‖

and λ strong convex objective:

‖∇f (w)−∇f (w ′)‖ ≥ λ‖w − w ′‖

The gain of SVRG over batch algorithm is significant when n is large.

T. Zhang Big Data Optimization 15 / 36

Page 29: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 16 / 36

Page 30: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 16 / 36

Page 31: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 16 / 36

Page 32: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]

or the ridge regression problem:

minw

1n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as wecan.

A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

T. Zhang Big Data Optimization 17 / 36

Page 33: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]or the ridge regression problem:

minw

1n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as wecan.

A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

T. Zhang Big Data Optimization 17 / 36

Page 34: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Motivation of SDCA: regularized loss minimization

Assume we want to solve the Lasso problem:

minw

[1n

n∑i=1

(w>xi − yi)2 + λ‖w‖1

]or the ridge regression problem:

minw

1n

n∑i=1

(w>xi − yi)2

︸ ︷︷ ︸loss

2‖w‖22︸ ︷︷ ︸

regularization

Our goal: solve regularized loss minimization problems as fast as wecan.

A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.

T. Zhang Big Data Optimization 17 / 36

Page 35: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

General Problem

Want to solve:

minw

P(w) :=

[1n

n∑i=1

φi(X>i w) + λg(w)

],

where Xi are matrices; g(·) is strongly convex.Examples:

Multi-class logistic loss

φi(X>i w) = lnK∑`=1

exp(w>Xi,`)− w>Xi,yi .

L1 − L2 regularization

g(w) =12‖w‖22 +

σ

λ‖w‖1

T. Zhang Big Data Optimization 18 / 36

Page 36: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Dual Formulation

Primal:

minw

P(w) :=

[1n

n∑i=1

φi (X>i w) + λg(w)

],

Dual:

maxα

D(α) :=

[1n

n∑i=1

−φ∗i (−αi )− λg∗(

1λn

n∑i=1

Xiαi

)]

with the relationship

w = ∇g∗(

1λn

n∑i=1

Xiαi

).

The convex conjugate (dual) is defined as φ∗i (a) = supz(az − φi(z)).

SDCA: randomly pick i and optimize D(α) by varying αi while keepingother dual variables fixed.

T. Zhang Big Data Optimization 19 / 36

Page 37: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Example: L1 − L2 Regularized Logistic Regression

Primal:

P(w) =1n

n∑i=1

ln(1 + e−w>Xi Yi )︸ ︷︷ ︸φi (w)

2w>w + σ‖w‖1︸ ︷︷ ︸

λg(w)

.

Dual: with αiYi ∈ [0,1]

D(α) =1n

n∑i=1

−αiYi ln(αiYi)− (1− αiYi) ln(1− αiYi)︸ ︷︷ ︸φ∗i (−αi )

−λ2‖trunc(v , σ/λ)‖22

s.t. v =1λn

n∑i=1

αiXi ; w = trunc(v , σ/λ)

where

trunc(u, δ)j =

uj − δ if uj > δ

0 if |uj | ≤ δuj + δ if uj < −δ

T. Zhang Big Data Optimization 20 / 36

Page 38: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Proximal-SDCA for L1-L2 Regularization

Algorithm:Keep dual α and v = (λn)−1∑

i αiXi

Randomly pick iFind ∆i by approximately maximizing:

−φ∗i (αi + ∆i)− trunc(v , σ/λ)>Xi ∆i −1

2λn‖Xi‖22∆2

i ,

where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi ) + (1− (αi + ∆)Yi ) ln(1− (αi + ∆)Yi )

α = α + ∆i · ei

v = v + (λn)−1∆i · Xi .Let w = trunc(v , σ/λ).

T. Zhang Big Data Optimization 21 / 36

Page 39: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Convergence rate

The number of iterations needed to achieve ε accuracyFor (1/γ)-smooth loss:

O((

n +1γλ

)log

)For L-Lipschitz loss:

O(

n +L2

λ ε

)

T. Zhang Big Data Optimization 22 / 36

Page 40: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smoothφi :

1n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n + 1/ε).

Compare to Dual Averaging SGD (Xiao):number of iterations needed is O(1/ε2).

Compare to batch FISTA (Nesterov’s accelerated proximal gradient):number of iterations needed is O(n/

√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Can design an accelerated prox-SDCA procedure always superior toFISTA

T. Zhang Big Data Optimization 23 / 36

Page 41: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smoothφi :

1n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).

Compare to batch FISTA (Nesterov’s accelerated proximal gradient):number of iterations needed is O(n/

√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Can design an accelerated prox-SDCA procedure always superior toFISTA

T. Zhang Big Data Optimization 23 / 36

Page 42: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smoothφi :

1n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).Compare to batch FISTA (Nesterov’s accelerated proximal gradient):

number of iterations needed is O(n/√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Can design an accelerated prox-SDCA procedure always superior toFISTA

T. Zhang Big Data Optimization 23 / 36

Page 43: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Solving L1 with Smooth Loss

Assume we want to solve L1 regularization to accuracy ε with smoothφi :

1n

n∑i=1

φi(w) + σ‖w‖1.

Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):

number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):

number of iterations needed is O(1/ε2).Compare to batch FISTA (Nesterov’s accelerated proximal gradient):

number of iterations needed is O(n/√ε).

Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)

Can design an accelerated prox-SDCA procedure always superior toFISTA

T. Zhang Big Data Optimization 23 / 36

Page 44: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 24 / 36

Page 45: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 24 / 36

Page 46: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 24 / 36

Page 47: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Accelerated Prox-SDCA

Solving:

P(w) :=1n

n∑i=1

φi(X>i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ)

Inferior to acceleration when λ is very small O(1/n), which hasO(1/

√λ) dependency

Inner-outer Iteration Accelerated Prox-SDCAPick a suitable κ = Θ(1/n) and βFor t = 2,3 . . . (outer iter)

Let gt (w) = λg(w) + 0.5κ‖w‖22 − κw>y (t−1) (κ-strongly convex)

Let Pt (w) = P(w)− λg(w) + gt (w) (redefine P(·) – κ strongly convex)Approximately solve Pt (w) for (w (t), α(t)) with prox-SDCA (inner iter)Let y (t) = w (t) + β(w (t) − w (t−1)) (acceleration)

T. Zhang Big Data Optimization 25 / 36

Page 48: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Accelerated Prox-SDCA

Solving:

P(w) :=1n

n∑i=1

φi(X>i w) + λg(w)

Convergence rate of Prox-SDCA depends on O(1/λ)

Inferior to acceleration when λ is very small O(1/n), which hasO(1/

√λ) dependency

Inner-outer Iteration Accelerated Prox-SDCAPick a suitable κ = Θ(1/n) and βFor t = 2,3 . . . (outer iter)

Let gt (w) = λg(w) + 0.5κ‖w‖22 − κw>y (t−1) (κ-strongly convex)

Let Pt (w) = P(w)− λg(w) + gt (w) (redefine P(·) – κ strongly convex)Approximately solve Pt (w) for (w (t), α(t)) with prox-SDCA (inner iter)Let y (t) = w (t) + β(w (t) − w (t−1)) (acceleration)

T. Zhang Big Data Optimization 25 / 36

Page 49: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Performance Comparisons

Problem Algorithm Runtime

SVMSGD d

λε

AGD (Nesterov) dn√

1λ ε

Acc-Prox-SDCA d(

n + min 1λ ε ,√

nλε)

Lasso

SGD and variants dε2

Stochastic Coordinate Descent dnε

FISTA dn√

Acc-Prox-SDCA d(

n + min 1ε ,√

nε )

Ridge Regression

Exact d2n + d3

SGD, SDCA d(n + 1

λ

)AGD dn

√1λ

Acc-Prox-SDCA d(

n + min 1λ ,√

nλ)

T. Zhang Big Data Optimization 26 / 36

Page 50: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Summary of Sampling

1st order gradient

∇f (w) =1n

n∑i=1

∇fi(w)

sample 1st order gradient (stochastic gradient): variance reductionleads to fast linear convergence2nd order gradient

∇2f (w) =1n

n∑i=1

∇2fi(w)

sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate in distributed computing setting

T. Zhang Big Data Optimization 27 / 36

Page 51: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 28 / 36

Page 52: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 28 / 36

Page 53: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Outline

Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons

Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)

Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling

T. Zhang Big Data Optimization 28 / 36

Page 54: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Distributed Computing

Assume: data distributed over machinesm processorseach has n/m examples

Simple Computational Strategy — One Shot Averaging (OSA)run optimization on m machines separately

obtaining parameters w (1), . . . ,w (m)

average the parameters: lw = m−1∑m

i=1 w (i)

T. Zhang Big Data Optimization 29 / 36

Page 55: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Improvement

OSA Strategy’s advantage:machines run independentlysimple and computationally efficient; asymptotically good in theory

Disadvantage:practically inferior to training all examples on a single machine

Traditional solution in optimization: ADMM

New Idea: via 2nd order gradient samplingDistributed Approximate NEwton (DANE)

T. Zhang Big Data Optimization 30 / 36

Page 56: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Improvement

OSA Strategy’s advantage:machines run independentlysimple and computationally efficient; asymptotically good in theory

Disadvantage:practically inferior to training all examples on a single machine

Traditional solution in optimization: ADMM

New Idea: via 2nd order gradient samplingDistributed Approximate NEwton (DANE)

T. Zhang Big Data Optimization 30 / 36

Page 57: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Distribution Scheme

Assume: data distributed over machines with decomposed problem

f (w) =m∑`=1

f (`)(w).

m processorseach f (`)(w) has n/m randomly partitioned exampleseach machine holds a complete set of parameters

T. Zhang Big Data Optimization 31 / 36

Page 58: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

DANE

Starting with w using OSAIterate

Take w and define

f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w

on each machine solves

w (`) = arg minw

f (`)(w)

independently.Take partial average as the next w

The gradient correction is similar to SVRG

T. Zhang Big Data Optimization 32 / 36

Page 59: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

DANE

Starting with w using OSAIterate

Take w and define

f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w

on each machine solves

w (`) = arg minw

f (`)(w)

independently.Take partial average as the next w

The gradient correction is similar to SVRG

T. Zhang Big Data Optimization 32 / 36

Page 60: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relationship to 2nd Order Gradient Sampling

Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).

f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w

Expand f (`)(w) around w , we have

f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)

− (∇f (`)(w)−∇f (w))>w .

minw f (`)(w) is equivalent to approximate minimization of

minw

f (`)(w) +∇f (w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)

Lead to fast convergence

T. Zhang Big Data Optimization 33 / 36

Page 61: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relationship to 2nd Order Gradient Sampling

Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).

f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w

Expand f (`)(w) around w , we have

f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)

− (∇f (`)(w)−∇f (w))>w .

minw f (`)(w) is equivalent to approximate minimization of

minw

f (`)(w) +∇f (w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)

Lead to fast convergence

T. Zhang Big Data Optimization 33 / 36

Page 62: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Relationship to 2nd Order Gradient Sampling

Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).

f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w

Expand f (`)(w) around w , we have

f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)

− (∇f (`)(w)−∇f (w))>w .

minw f (`)(w) is equivalent to approximate minimization of

minw

f (`)(w) +∇f (w)>(w − w) +12

(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)

Lead to fast convergence

T. Zhang Big Data Optimization 33 / 36

Page 63: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Comparisons

0 5 10

0.229

0.23

0.231

t

COV1

0 5 100.04

0.05

0.06

0.07

t

ASTRO

0 5 10

0.03

0.04

0.05

0.06

t

MNIST−47

DANEADMMOSAOpt

T. Zhang Big Data Optimization 34 / 36

Page 64: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Summary

Optimization in machine learning: sum over data structuresuitable for statistical sampling

Traditional methods: gradient based batch algorithmsdo not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work)

Sampling of 1st order gradientvariance reduction leads to fast linear convergence rate

Sampling of 2nd order gradientleads to fast distributed algorithm

T. Zhang Big Data Optimization 35 / 36

Page 65: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

Summary

Optimization in machine learning: sum over data structuresuitable for statistical sampling

Traditional methods: gradient based batch algorithmsdo not take advantage of special structure

Recent progress: stochastic optimization (focusing on my own work)

Sampling of 1st order gradientvariance reduction leads to fast linear convergence rate

Sampling of 2nd order gradientleads to fast distributed algorithm

T. Zhang Big Data Optimization 35 / 36

Page 66: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descentusing Predictive Variance Reduction, NIPS 2013.Lin Xiao and TZ. A Proximal Stochastic Gradient Method withProgressive Variance Reduction, Tech Report arXiv:1403.4699,March 2014.Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, JMLR 14:567-599,2013.Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic DualCoordinate Ascent for Regularized Loss Minimization, Tech ReportarXiv:1309.2375, Sep 2013.Ohad Shamir and Nathan Srebro and Tong Zhang.Communication-Efficient Distributed Optimization using anApproximate Newton-type Method, ICML 2014.

My friend Lin Xiao has made further improves with collaborators inmultiple fronts on this thread of big optimization research ...

T. Zhang Big Data Optimization 36 / 36

Page 67: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization

References

Rie Johnson and TZ. Accelerating Stochastic Gradient Descentusing Predictive Variance Reduction, NIPS 2013.Lin Xiao and TZ. A Proximal Stochastic Gradient Method withProgressive Variance Reduction, Tech Report arXiv:1403.4699,March 2014.Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, JMLR 14:567-599,2013.Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic DualCoordinate Ascent for Regularized Loss Minimization, Tech ReportarXiv:1309.2375, Sep 2013.Ohad Shamir and Nathan Srebro and Tong Zhang.Communication-Efficient Distributed Optimization using anApproximate Newton-type Method, ICML 2014.My friend Lin Xiao has made further improves with collaborators inmultiple fronts on this thread of big optimization research ...

T. Zhang Big Data Optimization 36 / 36