stochastic subgradient method - uci

25
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine

Upload: others

Post on 03-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Subgradient Method - UCI

Stochastic Subgradient Method

Lingjie Weng, Yutian Chen Bren School of Information and Computer Science

UC Irvine

Page 2: Stochastic Subgradient Method - UCI

Subgradient

β€’ Recall basic inequality for convex differentiable 𝑓 : 𝑓 𝑦 β‰₯ 𝑓 π‘₯ + 𝛻𝑓 π‘₯ 𝑇(𝑦 βˆ’ π‘₯)

β€’ The gradient 𝛻𝑓 π‘₯ at π‘₯ determines a global under-estimator of 𝑓.

What if 𝒇 is not differentiable?

β€’ We can still construct a global under-estimator : the vector 𝑔 is a subgradient of 𝑓 at π‘₯ if

𝑓 𝑦 β‰₯ 𝑓 π‘₯ + 𝑔𝑇 𝑦 βˆ’ π‘₯ for all 𝑦

β€’ There can be more than one subgradient at given point: the collection of subgradients of 𝑓 at π‘₯ is the subdifferential πœ•π‘“(π‘₯).

Page 3: Stochastic Subgradient Method - UCI

An example: non-differentiable function and its subgradients

β€’ 𝑓1 π‘₯β€² > 𝑓2 π‘₯β€² : unique subgradient

𝑔 = 𝛻𝑓1 π‘₯β€² β€’ 𝑓1 π‘₯β€² < 𝑓2 π‘₯β€² : unique subgradient

𝑔 = 𝛻𝑓2 π‘₯β€² β€’ 𝑓1 π‘₯β€² > 𝑓2 π‘₯β€² : subgradient form a

line segment 𝛻𝑓1 π‘₯β€² , 𝛻𝑓2 π‘₯β€²

𝑓 = max 𝑓1,𝑓2 , with 𝑓1, 𝑓2 convex and differentiable

Page 4: Stochastic Subgradient Method - UCI

Subgradient methods β€’ Subgradient method is simple algorithm to minimize nondifferentiable

convex function 𝑓

π‘₯(π‘˜+1) = π‘₯(π‘˜) βˆ’ π›Όπ‘˜π‘”π‘˜

π‘₯(π‘˜) is the π‘˜th iterate

𝑔 π‘˜ is any subgradient of 𝑓 at π‘₯ π‘˜

π›Όπ‘˜ > 0 is the π‘˜th step size

β€’ Not a descent method, so we keep track of best point so far

π‘“π‘π‘’π‘ π‘‘π‘˜

= min *𝑓 π‘₯ 1 , β‹― , 𝑓 π‘₯ π‘˜ +

Page 5: Stochastic Subgradient Method - UCI

Why does it work ?--convergence

Convergence results for different step size β€’ Constant step size: π›Όπ‘˜ = 𝛼(positive constant).

– Converges to a suboptimal solution: π‘“π‘π‘’π‘ π‘‘π‘˜ βˆ’ π‘“βˆ— ≀

𝑅2+𝐺2π‘˜π›Ό2

2π‘˜π›Ό kβ†’βˆž

𝐺2𝛼

2

β€’ Constant step length: π›Όπ‘˜ =𝛾

𝑔 π‘˜2

(i.e. π‘₯ π‘˜+1 βˆ’ π‘₯ π‘˜ = 𝛾).

– Converges to a suboptimal solution: π‘“π‘π‘’π‘ π‘‘π‘˜ βˆ’ π‘“βˆ— ≀

𝑅2+𝛾2π‘˜

2π›Ύπ‘˜/𝐺 kβ†’βˆž

𝐺𝛾

2

β€’ Square summable but not summable π›Όπ‘˜2∞

π‘˜=1 < ∞, π›Όπ‘˜βˆžπ‘˜=1 = ∞ (i.e. π›Όπ‘˜=

π‘Ž

𝑏+π‘˜ )

β€’ Nonsummable diminishing step size limπ‘˜β†’βˆž

π›Όπ‘˜ = 0, and π›Όπ‘˜βˆžπ‘˜=1 = ∞ (i.e. π›Όπ‘˜=

π‘Ž

π‘˜ )

β€’ Nonsummable diminishing step length π›Όπ‘˜=π›Ύπ‘˜

𝑔 π‘˜2

– Converges to the optimal solution: limπ‘˜β†’βˆž

π‘“π‘π‘’π‘ π‘‘π‘˜ βˆ’ π‘“βˆ— = 0

Assumptions: β€’ π‘“βˆ— = inf

π‘₯𝑓(π‘₯) > βˆ’βˆž, with 𝑓 π‘₯βˆ— = π‘“βˆ—

β€’ βˆ₯ 𝑔 π‘˜ βˆ₯2≀ 𝐺 for all π‘˜ (if 𝑓 satisfies Lipschitz condition 𝑓 𝑒 βˆ’ 𝑓(𝑣) ≀ 𝐺 βˆ₯ 𝑒 βˆ’ 𝑣 βˆ₯2)

β€’ βˆ₯ π‘₯ 1 βˆ’ π‘₯βˆ— βˆ₯2≀ 𝑅

Page 6: Stochastic Subgradient Method - UCI

Stopping criterion

β€’ Terminating when 𝑅2+𝐺2 𝛼𝑖

2π‘˜π‘–=1

2 π›Όπ‘–π‘˜π‘–=1

≀ πœ€ is really, really slow.

β€’ 𝛼𝑖 = (𝑅/𝐺)/ π‘˜, 𝑖 = 1, … , π‘˜ minimizes the suboptimality bound.

β€’ The number of steps to achieve a guaranteed accuracy of πœ€ will be at least (𝑅𝐺/πœ€)2, no matter what step sizes we use.

β€’ There really isn’t a good stopping criterion for the subgradient method – are used only for problems in which very

high accuracy is not required, typically around 10%

Page 7: Stochastic Subgradient Method - UCI

Subgradient method for unconstrained problem

1. Set π‘˜ ≔ 1 and. Choose an infinite sequence of positive step size value

*𝛼(π‘˜)+π‘˜=1∞ .

2. Compute a subgradient 𝑔 π‘˜ ∈ πœ•π‘“(π‘₯ π‘˜ ).

3. Update π‘₯(π‘˜+1) = π‘₯(π‘˜) βˆ’ π›Όπ‘˜π‘”π‘˜

4. If algorithm has not converged, then set π‘˜ ≔ π‘˜ + 1 and go to step 2.

Properties: Simple and general. Any subgradient from πœ•π‘“(π‘₯ π‘˜ ) is sufficient. Convergence theorems exist for properly selected step size Slow convergence. It usually requires a lot of iterations

No monotonic convergence. βˆ’π‘” π‘˜ need not be decreasing direction β†’ line search cannot be used to find optimal π›Όπ‘˜.

No good stopping condition.

Page 8: Stochastic Subgradient Method - UCI

Example: 𝐿2-norm Support Vector Machines

β€’ π’˜βˆ—, π‘βˆ—, πœ‰βˆ— = argminπ’˜,𝑏,πœ‰

1

2βˆ₯ π’˜ βˆ₯2 +𝐢 πœ‰π‘–

π‘šπ‘–=1 ,

s.t. 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 β‰₯ 1 βˆ’ πœ‰π‘–, 𝑖 = 1,… ,π‘š

πœ‰π‘– β‰₯ 0, 𝑖 = 1,… ,π‘š

β€’ Note that πœ‰π‘– β‰₯ max *0,1 βˆ’ 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 + and the equality is attained in the optimum, i.e., we can optimize

s.t. πœ‰π‘– = max 0,1 βˆ’ 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 = β„“ π’˜, 𝒙𝑖 𝑦𝑖 (hinge loss)

β€’ We can reformulate the problem as an unconstrained convex problem

π’˜βˆ—, π‘βˆ— =argminπ’˜,𝑏

1

2βˆ₯ π’˜ βˆ₯2 +𝐢ℓ( π’˜, 𝒙𝑖 𝑦𝑖)

Page 9: Stochastic Subgradient Method - UCI

Example 𝐿2-norm Support Vector Machines(cont’)

β€’ Function 𝑓0 π’˜, 𝑏 =1

2βˆ₯ π’˜ βˆ₯2 is differentiable and thus πœ•π‘€π‘“0 π’˜, 𝑏 = w,

πœ•π‘π‘“0 π’˜, 𝑏 = 0.

β€’ Function 𝑓𝑖 π’˜, 𝑏 = max *0,1 βˆ’ 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 + = β„“(𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 )is a point-wise maximum thus subdifferential is convex combination of gradients of active functions

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

0/1 loss

hinge loss

β„“(π‘₯)

Page 10: Stochastic Subgradient Method - UCI

Constant step length: 𝛼𝑑 =𝛾

𝑔𝑑

Converges to a suboptimal solution:

limπ‘‘β†’βˆž

𝑓𝑏𝑒𝑠𝑑𝑑

βˆ’π‘“βˆ— ≀ 𝐺2𝛼

2

The value of : π‘“π‘π‘’π‘ π‘‘π‘˜

βˆ’ π‘“βˆ— versus iteration number 𝑑 The value of : π‘“π‘π‘’π‘ π‘‘π‘˜

βˆ’ π‘“βˆ— versus iteration number 𝑑

Diminishing step size: 𝛼𝑑 =𝛾

𝑑

Converges to a optimal solution:

limπ‘‘β†’βˆž

𝑓𝑏𝑒𝑠𝑑𝑑

βˆ’π‘“βˆ— = 0

Trade-off: larger 𝛾 gives faster convergence, but larger final suboptimality

Page 11: Stochastic Subgradient Method - UCI

Subgradient method for constrained problem

β€’ Let us consider the constraint optimization problem

minimize 𝑓(π‘₯)

s.t. π‘₯ ∈ 𝑆 (S is a convex set)

β€’ Projected subgradient method is given by π‘₯(π‘˜+1) = 𝑃𝑆(π‘₯π‘˜ βˆ’ π›Όπ‘˜π‘”

π‘˜ )

𝑃𝑆 is (Euclidean) projection on 𝑆,

1. Set π‘˜ ≔ 1 and. Choose an infinite sequence of positive step size value

*𝛼(π‘˜)+π‘˜=1∞ .

2. Compute a subgradient 𝑔 π‘˜ ∈ πœ•π‘“(π‘₯ π‘˜ ).

3. Updateπ‘₯(π‘˜+1) = 𝑃𝑆(π‘₯π‘˜ βˆ’ π›Όπ‘˜π‘”

π‘˜ )

4. If algorithm has not converged, then set π‘˜ ≔ π‘˜ + 1 and go to step 2.

Page 12: Stochastic Subgradient Method - UCI

Example 𝐿2-norm Support Vector Machines

β€’ π’˜βˆ—, π‘βˆ— = argminπ’˜,𝑏

1

2βˆ₯ π’˜ βˆ₯2 +𝐢 max *0,1 βˆ’ 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 +π‘š

𝑖=1

β€’ An alternative formulation of the same problem is π’˜βˆ—, π‘βˆ— = argmin

π’˜,𝑏max *0,1 βˆ’ 𝑦𝑖 π’˜π‘‡π‘₯𝑖 + 𝑏 +

S.t. π’˜ ≀ π‘Š

β€’ Note that the regularization constant C > 0 is replaced here by a less abstract constant π‘Š > 0.

β€’ To implement subgradient method we need only a projection operator

𝑃𝑆 π’˜, 𝑏 = *w β†’π’˜

π’˜π‘Š, b β†’ b+

β€’ So after update π’˜ ≔ π’˜ βˆ’ π›Όπ‘˜π’˜β€² and 𝑏 ≔ 𝑏 βˆ’ π›Όπ‘˜π‘

β€² ,

we project w:= π’˜

π’˜π‘Š

Page 13: Stochastic Subgradient Method - UCI

Constant step length: 𝛼𝑑 =𝛾

𝑔𝑑 Diminishing step size: 𝛼𝑑 =𝛾

𝑑

Page 14: Stochastic Subgradient Method - UCI

Noisy unbiased subgradient

β€’ Random vector 𝑔 ∈ 𝑹𝑛 is a noisy unbiased subgradient for 𝑓: 𝑹𝑛 β†’ 𝑹 at π‘₯ if for all 𝑧

𝑓 𝑧 β‰₯ 𝑓 π‘₯ + 𝑬𝑔 𝑇(𝑧 βˆ’ π‘₯)

𝑖. 𝑒. , 𝑔 = 𝑬𝑔 ∈ πœ•π‘“(π‘₯)

β€’ The noise can represent error in computing 𝑔, measurement noise, Monte Carlo sampling error, etc

β€’ If π‘₯ is also random, 𝑔 is a noisy unbiased subgradient of 𝑓 at π‘₯ if βˆ€π‘§ 𝑓 𝑧 β‰₯ 𝑓 π‘₯ + 𝑬 𝑔 π‘₯ 𝑇 𝑧 βˆ’ π‘₯

holds almost surely

β€’ Same as 𝑬 𝑔 π‘₯ ∈ πœ•π‘“(π‘₯) (a.s.)

Page 15: Stochastic Subgradient Method - UCI

Stochastic subgradient method

β€’ Stochastic subgradient method is the subgradient method, using noisy unbiased subgradients

π‘₯(π‘˜+1) = π‘₯ π‘˜ βˆ’ π›Όπ‘˜π‘”

π‘˜

β€’ π‘₯ π‘˜ is π‘˜th iterate

β€’ 𝑔 π‘˜ is any noisy unbiased subgradient of (convex) 𝑓 at π‘₯ π‘˜

β€’ π›Όπ‘˜ > 0 is the π‘˜th step size

β€’ Define π‘“π‘π‘’π‘ π‘‘π‘˜

= min *𝑓 π‘₯ 1 , β‹― , 𝑓 π‘₯ π‘˜ +

Page 16: Stochastic Subgradient Method - UCI

Assumptions

β€’ π‘“βˆ— = infπ‘₯

𝑓(π‘₯) > βˆ’βˆž, with 𝑓 π‘₯βˆ— = π‘“βˆ—

β€’ 𝑬 βˆ₯ 𝑔 π‘˜ βˆ₯22≀ 𝐺2for all π‘˜

β€’ 𝑬 βˆ₯ π‘₯ 1 βˆ’ π‘₯βˆ— βˆ₯22≀ 𝑅2

β€’ Step sizes are square-summable but not summable

π›Όπ‘˜ β‰₯ 0, π›Όπ‘˜2 =βˆ₯ 𝛼 βˆ₯2

2< ∞, π›Όπ‘˜ = ∞

∞

π‘˜=1

∞

π‘˜=1

These assumptions are stronger than needed, just to simplify proofs

Page 17: Stochastic Subgradient Method - UCI

Convergence results

β€’ Convergence in expectation:

π‘¬π‘“π‘π‘’π‘ π‘‘π‘˜

βˆ’ π‘“βˆ— ≀𝑅2+𝐺2βˆ₯𝛼βˆ₯2

2

2 π›Όπ‘–π‘˜π‘–=1

limπ‘˜β†’βˆž

π‘¬π‘“π‘π‘’π‘ π‘‘π‘˜

= π‘“βˆ—

β€’ Convergence in probability (Markov’s inequality): for any πœ– > 0,

limπ‘˜β†’βˆž

𝐏𝐫𝐨𝐛 π‘“π‘π‘’π‘ π‘‘π‘˜

β‰₯ π‘“βˆ— + πœ– = 0

β€’ Almost sure convergence:

limπ‘˜β†’βˆž

π‘“π‘π‘’π‘ π‘‘π‘˜

= π‘“βˆ— (a.s.)

Page 18: Stochastic Subgradient Method - UCI

Example: Machine learning problem

β€’ Minimizing loss function

minπ’˜

𝐿 (π’˜) =1

π‘š π‘™π‘œπ‘ π‘ (𝑀, π‘₯𝑖 , 𝑦𝑖 )

π‘š

𝑖=1

Subgradient estimate: 𝑔 π‘˜ = π›»π’˜π‘™π‘œπ‘ π‘ (π’˜ π‘˜ , π‘₯𝑖 , 𝑦𝑖 )

β€’ Examples: 𝐿2 regularized linear classification with hinge loss, aka. SVM

minβˆ₯π’˜βˆ₯2≀𝐡

1

π‘š β„“( π’˜, 𝒙𝑖 𝑦𝑖)

π‘š

𝑖=1

≑ min1

π‘š β„“( π’˜, 𝒙𝑖 𝑦𝑖)

π‘š

𝑖=1

+πœ†

2βˆ₯ π’˜ βˆ₯2

Initialize at π’˜ 𝟎 Iteration π‘˜:

π’˜ π‘˜+1 ← π’˜ π‘˜ βˆ’ 𝛼 π‘˜ 𝑔 π‘˜

If βˆ₯ π’˜ π‘˜+1 βˆ₯> 𝐡

π’˜ π‘˜+1 ← π΅π’˜ π‘˜+1

βˆ₯π’˜ π‘˜+1 βˆ₯

Page 19: Stochastic Subgradient Method - UCI

Stochastic vs Batch Gradient Descent

π’˜ =1

π‘š π’˜π‘–

π‘š

𝑖=1

Page 20: Stochastic Subgradient Method - UCI

Stochastic vs Batch Gradient Descent

β€’ Runtime to guarantee an error πœ– (𝛼 π‘˜ =𝐡/𝐺

π‘˜)

– Stochastic Gradient Descent: π’ͺ(𝐺2𝐡2

πœ–2 𝑑)

– Batch Gradient Descent: π’ͺ𝐺2𝐡2

πœ–2 π‘šπ‘‘

β€’ Optimal runtime if only using gradients and only assuming Lipchitz condition.

β€’ Slower than second order method

Page 21: Stochastic Subgradient Method - UCI

Stochastic programming

minimize 𝑬𝑓0(π‘₯, πœ”)

subject to 𝑬𝑓𝑖 π‘₯, πœ” ≀ 0, 𝑖 = 1,β‹― ,π‘š

If 𝑓𝑖 π‘₯, πœ” is convex in π‘₯ for each πœ”, problem is convex

β€’ β€œcerntainty-equivalent” problem

minimize 𝑓0(π‘₯, π‘¬πœ”)

subject to 𝑓𝑖 π‘₯, π‘¬πœ” ≀ 0, 𝑖 = 1,β‹― ,π‘š

(if 𝑓𝑖 π‘₯, πœ” is convex in πœ”, gives a lower bound on optimal value of stochastic problem)

Page 22: Stochastic Subgradient Method - UCI

Machine learning as stochastic programming

Minimizing generalization error: minπ’˜βˆˆπ’²

𝑬,𝑓 π’˜, 𝑧 -

β€’ Two approaches: – Sample Average Approximation (SAA):

β€’ Collect sample z1, β‹― , π‘§π‘š

β€’ Minimize 𝐹 π’˜ =1

π‘š 𝑓(π’˜, 𝑧𝑖)

π‘šπ‘–=1 , π’˜ = min

π’˜πΉ (π’˜)

β€’ Minimizing empirical risk

β€’ Stochastic/batch gradient descent, Newton method, …

– Sample Approximation (SA):

β€’ Update π’˜ π’Œ based on weak estimator to 𝛻𝐹(π’˜ π‘˜ ) 𝑔 π‘˜ = 𝛻𝑓(π’˜ π‘˜ , π‘§π‘˜)

β€’ Stochastic gradient descent , π’˜ π‘š =1

π‘š π’˜π‘–

π‘šπ‘–=1

Page 23: Stochastic Subgradient Method - UCI

Machine learning as stochastic programming

Minimizing generalization error: minπ’˜βˆˆπ’²

𝑬,𝑓 π’˜, 𝑧 -

β€’ Two approaches: – Sample Average Approximation (SAA):

β€’ Collect sample z1, β‹― , π‘§π‘š

β€’ Minimize 𝐹 π’˜ =1

π‘š 𝑓(π’˜, 𝑧𝑖)

π‘šπ‘–=1 , π’˜ = min

π’˜πΉ (π’˜)

β€’ Minimizing empirical risk

β€’ Stochastic/batch gradient descent, Newton method, …

– Sample Approximation (SA):

β€’ Update π’˜ π’Œ based on weak estimator to 𝛻𝐹(π’˜ π‘˜ ) 𝑔 π‘˜ = 𝛻𝑓(π’˜ π‘˜ , π‘§π‘˜)

β€’ Stochastic gradient descent , π’˜ π‘š =1

π‘š π’˜π‘–

π‘šπ‘–=1

minπ’˜

𝐿 (π’˜) =1

π‘š π‘™π‘œπ‘ π‘ (𝑀, π‘₯𝑖 , 𝑦𝑖 )

π‘š

𝑖=1

Page 24: Stochastic Subgradient Method - UCI

SAA vs SA

β€’ Runtime to guarantee an error πœ– (𝐿 π’˜ ≀ 𝐿 π’˜βˆ— + πœ–) – SAA (Empirical Risk Minimization)

β€’ Sample size to guarantee 𝐿 π’˜ ≀ 𝐿 π’˜βˆ— + πœ–: π‘š = Ξ©(𝐺2𝐡2

πœ–2 )

β€’ Using any method (stochastic/batch subgradient descent, Newton method, …), runtime:

Ω(𝐺2𝐡2

πœ–2𝑑)

– SA (Single-Pass Stochastic Gradient Descent)

β€’ After π‘˜ iteration, 𝐿 π’˜ π‘˜ ≀ 𝐿 π’˜βˆ— + π’ͺ(𝐺2𝐡2

π‘˜)

β€’ Runtime:

π’ͺ(𝐺2𝐡2

πœ–2𝑑)

Page 25: Stochastic Subgradient Method - UCI

Reference

β€’ S. Boyd, et. al. Subgradient Methods, Stochastic Subgradient Methods. Lecture slides and notes for EE364b, Stanford University, Spring Quarter 2007–08

β€’ Professor Vojtech Franc's talks. A Short Tutorial on Subgradient Methods. IDA retreat, May 16, 2007

β€’ Nathan Srebro and Ambuj Tewari, Stochastic Optimization: ICML 2010 Tutorial, July, 2010