stochastic subgradient method - uci
TRANSCRIPT
Stochastic Subgradient Method
Lingjie Weng, Yutian Chen Bren School of Information and Computer Science
UC Irvine
Subgradient
β’ Recall basic inequality for convex differentiable π : π π¦ β₯ π π₯ + π»π π₯ π(π¦ β π₯)
β’ The gradient π»π π₯ at π₯ determines a global under-estimator of π.
What if π is not differentiable?
β’ We can still construct a global under-estimator : the vector π is a subgradient of π at π₯ if
π π¦ β₯ π π₯ + ππ π¦ β π₯ for all π¦
β’ There can be more than one subgradient at given point: the collection of subgradients of π at π₯ is the subdifferential ππ(π₯).
An example: non-differentiable function and its subgradients
β’ π1 π₯β² > π2 π₯β² : unique subgradient
π = π»π1 π₯β² β’ π1 π₯β² < π2 π₯β² : unique subgradient
π = π»π2 π₯β² β’ π1 π₯β² > π2 π₯β² : subgradient form a
line segment π»π1 π₯β² , π»π2 π₯β²
π = max π1,π2 , with π1, π2 convex and differentiable
Subgradient methods β’ Subgradient method is simple algorithm to minimize nondifferentiable
convex function π
π₯(π+1) = π₯(π) β πΌπππ
π₯(π) is the πth iterate
π π is any subgradient of π at π₯ π
πΌπ > 0 is the πth step size
β’ Not a descent method, so we keep track of best point so far
ππππ π‘π
= min *π π₯ 1 , β― , π π₯ π +
Why does it work ?--convergence
Convergence results for different step size β’ Constant step size: πΌπ = πΌ(positive constant).
β Converges to a suboptimal solution: ππππ π‘π β πβ β€
π 2+πΊ2ππΌ2
2ππΌ kββ
πΊ2πΌ
2
β’ Constant step length: πΌπ =πΎ
π π2
(i.e. π₯ π+1 β π₯ π = πΎ).
β Converges to a suboptimal solution: ππππ π‘π β πβ β€
π 2+πΎ2π
2πΎπ/πΊ kββ
πΊπΎ
2
β’ Square summable but not summable πΌπ2β
π=1 < β, πΌπβπ=1 = β (i.e. πΌπ=
π
π+π )
β’ Nonsummable diminishing step size limπββ
πΌπ = 0, and πΌπβπ=1 = β (i.e. πΌπ=
π
π )
β’ Nonsummable diminishing step length πΌπ=πΎπ
π π2
β Converges to the optimal solution: limπββ
ππππ π‘π β πβ = 0
Assumptions: β’ πβ = inf
π₯π(π₯) > ββ, with π π₯β = πβ
β’ β₯ π π β₯2β€ πΊ for all π (if π satisfies Lipschitz condition π π’ β π(π£) β€ πΊ β₯ π’ β π£ β₯2)
β’ β₯ π₯ 1 β π₯β β₯2β€ π
Stopping criterion
β’ Terminating when π 2+πΊ2 πΌπ
2ππ=1
2 πΌπππ=1
β€ π is really, really slow.
β’ πΌπ = (π /πΊ)/ π, π = 1, β¦ , π minimizes the suboptimality bound.
β’ The number of steps to achieve a guaranteed accuracy of π will be at least (π πΊ/π)2, no matter what step sizes we use.
β’ There really isnβt a good stopping criterion for the subgradient method β are used only for problems in which very
high accuracy is not required, typically around 10%
Subgradient method for unconstrained problem
1. Set π β 1 and. Choose an infinite sequence of positive step size value
*πΌ(π)+π=1β .
2. Compute a subgradient π π β ππ(π₯ π ).
3. Update π₯(π+1) = π₯(π) β πΌπππ
4. If algorithm has not converged, then set π β π + 1 and go to step 2.
Properties: Simple and general. Any subgradient from ππ(π₯ π ) is sufficient. Convergence theorems exist for properly selected step size Slow convergence. It usually requires a lot of iterations
No monotonic convergence. βπ π need not be decreasing direction β line search cannot be used to find optimal πΌπ.
No good stopping condition.
Example: πΏ2-norm Support Vector Machines
β’ πβ, πβ, πβ = argminπ,π,π
1
2β₯ π β₯2 +πΆ ππ
ππ=1 ,
s.t. π¦π πππ₯π + π β₯ 1 β ππ, π = 1,β¦ ,π
ππ β₯ 0, π = 1,β¦ ,π
β’ Note that ππ β₯ max *0,1 β π¦π πππ₯π + π + and the equality is attained in the optimum, i.e., we can optimize
s.t. ππ = max 0,1 β π¦π πππ₯π + π = β π, ππ π¦π (hinge loss)
β’ We can reformulate the problem as an unconstrained convex problem
πβ, πβ =argminπ,π
1
2β₯ π β₯2 +πΆβ( π, ππ π¦π)
Example πΏ2-norm Support Vector Machines(contβ)
β’ Function π0 π, π =1
2β₯ π β₯2 is differentiable and thus ππ€π0 π, π = w,
πππ0 π, π = 0.
β’ Function ππ π, π = max *0,1 β π¦π πππ₯π + π + = β(π¦π πππ₯π + π )is a point-wise maximum thus subdifferential is convex combination of gradients of active functions
-2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
0/1 loss
hinge loss
β(π₯)
Constant step length: πΌπ‘ =πΎ
ππ‘
Converges to a suboptimal solution:
limπ‘ββ
ππππ π‘π‘
βπβ β€ πΊ2πΌ
2
The value of : ππππ π‘π
β πβ versus iteration number π‘ The value of : ππππ π‘π
β πβ versus iteration number π‘
Diminishing step size: πΌπ‘ =πΎ
π‘
Converges to a optimal solution:
limπ‘ββ
ππππ π‘π‘
βπβ = 0
Trade-off: larger πΎ gives faster convergence, but larger final suboptimality
Subgradient method for constrained problem
β’ Let us consider the constraint optimization problem
minimize π(π₯)
s.t. π₯ β π (S is a convex set)
β’ Projected subgradient method is given by π₯(π+1) = ππ(π₯π β πΌππ
π )
ππ is (Euclidean) projection on π,
1. Set π β 1 and. Choose an infinite sequence of positive step size value
*πΌ(π)+π=1β .
2. Compute a subgradient π π β ππ(π₯ π ).
3. Updateπ₯(π+1) = ππ(π₯π β πΌππ
π )
4. If algorithm has not converged, then set π β π + 1 and go to step 2.
Example πΏ2-norm Support Vector Machines
β’ πβ, πβ = argminπ,π
1
2β₯ π β₯2 +πΆ max *0,1 β π¦π πππ₯π + π +π
π=1
β’ An alternative formulation of the same problem is πβ, πβ = argmin
π,πmax *0,1 β π¦π πππ₯π + π +
S.t. π β€ π
β’ Note that the regularization constant C > 0 is replaced here by a less abstract constant π > 0.
β’ To implement subgradient method we need only a projection operator
ππ π, π = *w βπ
ππ, b β b+
β’ So after update π β π β πΌππβ² and π β π β πΌππ
β² ,
we project w:= π
ππ
Constant step length: πΌπ‘ =πΎ
ππ‘ Diminishing step size: πΌπ‘ =πΎ
π‘
Noisy unbiased subgradient
β’ Random vector π β πΉπ is a noisy unbiased subgradient for π: πΉπ β πΉ at π₯ if for all π§
π π§ β₯ π π₯ + π¬π π(π§ β π₯)
π. π. , π = π¬π β ππ(π₯)
β’ The noise can represent error in computing π, measurement noise, Monte Carlo sampling error, etc
β’ If π₯ is also random, π is a noisy unbiased subgradient of π at π₯ if βπ§ π π§ β₯ π π₯ + π¬ π π₯ π π§ β π₯
holds almost surely
β’ Same as π¬ π π₯ β ππ(π₯) (a.s.)
Stochastic subgradient method
β’ Stochastic subgradient method is the subgradient method, using noisy unbiased subgradients
π₯(π+1) = π₯ π β πΌππ
π
β’ π₯ π is πth iterate
β’ π π is any noisy unbiased subgradient of (convex) π at π₯ π
β’ πΌπ > 0 is the πth step size
β’ Define ππππ π‘π
= min *π π₯ 1 , β― , π π₯ π +
Assumptions
β’ πβ = infπ₯
π(π₯) > ββ, with π π₯β = πβ
β’ π¬ β₯ π π β₯22β€ πΊ2for all π
β’ π¬ β₯ π₯ 1 β π₯β β₯22β€ π 2
β’ Step sizes are square-summable but not summable
πΌπ β₯ 0, πΌπ2 =β₯ πΌ β₯2
2< β, πΌπ = β
β
π=1
β
π=1
These assumptions are stronger than needed, just to simplify proofs
Convergence results
β’ Convergence in expectation:
π¬ππππ π‘π
β πβ β€π 2+πΊ2β₯πΌβ₯2
2
2 πΌπππ=1
limπββ
π¬ππππ π‘π
= πβ
β’ Convergence in probability (Markovβs inequality): for any π > 0,
limπββ
ππ«π¨π ππππ π‘π
β₯ πβ + π = 0
β’ Almost sure convergence:
limπββ
ππππ π‘π
= πβ (a.s.)
Example: Machine learning problem
β’ Minimizing loss function
minπ
πΏ (π) =1
π πππ π (π€, π₯π , π¦π )
π
π=1
Subgradient estimate: π π = π»ππππ π (π π , π₯π , π¦π )
β’ Examples: πΏ2 regularized linear classification with hinge loss, aka. SVM
minβ₯πβ₯2β€π΅
1
π β( π, ππ π¦π)
π
π=1
β‘ min1
π β( π, ππ π¦π)
π
π=1
+π
2β₯ π β₯2
Initialize at π π Iteration π:
π π+1 β π π β πΌ π π π
If β₯ π π+1 β₯> π΅
π π+1 β π΅π π+1
β₯π π+1 β₯
Stochastic vs Batch Gradient Descent
π =1
π ππ
π
π=1
Stochastic vs Batch Gradient Descent
β’ Runtime to guarantee an error π (πΌ π =π΅/πΊ
π)
β Stochastic Gradient Descent: πͺ(πΊ2π΅2
π2 π)
β Batch Gradient Descent: πͺπΊ2π΅2
π2 ππ
β’ Optimal runtime if only using gradients and only assuming Lipchitz condition.
β’ Slower than second order method
Stochastic programming
minimize π¬π0(π₯, π)
subject to π¬ππ π₯, π β€ 0, π = 1,β― ,π
If ππ π₯, π is convex in π₯ for each π, problem is convex
β’ βcerntainty-equivalentβ problem
minimize π0(π₯, π¬π)
subject to ππ π₯, π¬π β€ 0, π = 1,β― ,π
(if ππ π₯, π is convex in π, gives a lower bound on optimal value of stochastic problem)
Machine learning as stochastic programming
Minimizing generalization error: minπβπ²
π¬,π π, π§ -
β’ Two approaches: β Sample Average Approximation (SAA):
β’ Collect sample z1, β― , π§π
β’ Minimize πΉ π =1
π π(π, π§π)
ππ=1 , π = min
ππΉ (π)
β’ Minimizing empirical risk
β’ Stochastic/batch gradient descent, Newton method, β¦
β Sample Approximation (SA):
β’ Update π π based on weak estimator to π»πΉ(π π ) π π = π»π(π π , π§π)
β’ Stochastic gradient descent , π π =1
π ππ
ππ=1
Machine learning as stochastic programming
Minimizing generalization error: minπβπ²
π¬,π π, π§ -
β’ Two approaches: β Sample Average Approximation (SAA):
β’ Collect sample z1, β― , π§π
β’ Minimize πΉ π =1
π π(π, π§π)
ππ=1 , π = min
ππΉ (π)
β’ Minimizing empirical risk
β’ Stochastic/batch gradient descent, Newton method, β¦
β Sample Approximation (SA):
β’ Update π π based on weak estimator to π»πΉ(π π ) π π = π»π(π π , π§π)
β’ Stochastic gradient descent , π π =1
π ππ
ππ=1
minπ
πΏ (π) =1
π πππ π (π€, π₯π , π¦π )
π
π=1
SAA vs SA
β’ Runtime to guarantee an error π (πΏ π β€ πΏ πβ + π) β SAA (Empirical Risk Minimization)
β’ Sample size to guarantee πΏ π β€ πΏ πβ + π: π = Ξ©(πΊ2π΅2
π2 )
β’ Using any method (stochastic/batch subgradient descent, Newton method, β¦), runtime:
Ξ©(πΊ2π΅2
π2π)
β SA (Single-Pass Stochastic Gradient Descent)
β’ After π iteration, πΏ π π β€ πΏ πβ + πͺ(πΊ2π΅2
π)
β’ Runtime:
πͺ(πΊ2π΅2
π2π)
Reference
β’ S. Boyd, et. al. Subgradient Methods, Stochastic Subgradient Methods. Lecture slides and notes for EE364b, Stanford University, Spring Quarter 2007β08
β’ Professor Vojtech Franc's talks. A Short Tutorial on Subgradient Methods. IDA retreat, May 16, 2007
β’ Nathan Srebro and Ambuj Tewari, Stochastic Optimization: ICML 2010 Tutorial, July, 2010