tutorial : part 2 · tutorial: part 2 optimization for machine learning eladhazan princeton...

Tutorial:PART2

OptimizationforMachineLearning

Elad HazanPrincetonUniversity

+helpfromSanjeevArora&Yoram Singer

Agenda

1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Onlinegradientdescent

2. Regularization• AdaGrad andoptimalregularization

3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization

Acceleratinggradientdescent?

1. Adaptiveregularization(AdaGrad)worksfornon-smooth&non-convex

2. VariancereductionusesspecialERMstructureveryeffectiveforsmooth&convex

3. Acceleration/momentumsmoothconvexonly,generalpurposeoptimizationsince80’s

definedas𝛾 = #$,where(simplified)

0 ≺ 𝛼𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼

𝛼 =strongconvexity,𝛽 =smoothnessNon-convexsmoothfunctions:(simplified)

−𝛽𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼

Whydowecare?

well-conditionedfunctionsexhibitmuchfasteroptimization!(butequivalentviareductions)

Conditionnumberofconvexfunctions

Examples

Thedescentlemma,𝛽-smoothfunctions:(algorithm:𝑥012 = 𝑥0 − 𝜂𝛻4 )

𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

= − 𝜂 + 𝛽𝜂+ 𝛻0 + = −14𝛽 𝛻0 +

Thus,forM-boundedfunctions:( 𝑓 𝑥0 ≤ 𝑀)

−2𝑀 ≤ 𝑓 𝑥; − 𝑓(𝑥2) = > 𝑓(𝑥012)− 𝑓 𝑥00

≤ −14𝛽> 𝛻0 +

0Thus,existsat forwhich,

𝛻0 + ≤8𝑀𝛽𝑇

Smoothgradientdescent

Conclusions:for𝑥012 = 𝑥0 − 𝜂𝛻4 andT = Ω 2C ,finds

𝛻0 + ≤ 𝜖

1. Holdsevenfornon-convexfunctions2. Forconvexfunctionsimplies𝑓 𝑥0 − 𝑓 𝑥∗ ≤ 𝑂(𝜖) (fasterfor

smooth!)

Smoothgradientdescent

Thedescentlemma,𝛽-smoothfunctions:(algorithm:𝑥012 = 𝑥0 − 𝜂𝛻0N )

𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻4N+= −𝜂𝛻4+ + 𝜂+𝛽𝐸 𝛻4N

+

= −𝜂𝛻4+ + 𝜂+𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )

Thus,forM-boundedfunctions:( 𝑓 𝑥0 ≤ 𝑀)

T = 𝑂M𝛽𝜀+ ⇒ ∃0Y;. 𝛻0 + ≤ 𝜀

Non-convexstochastic gradientdescent

Model:bothfullandstochasticgradients.EstimatorcombinesbothintolowervarianceRV:

𝑥012 = 𝑥0 − 𝜂 𝛻P𝑓 𝑥0 − 𝛻P𝑓 𝑥[ + 𝛻𝑓(𝑥[)

Everysooften,computefullgradientandrestartatnew𝑥[.

Theorem:[Schmidt,LeRoux,Bach‘12;JohnsonandZhang‘13;Mahdavi,Zhang,Jin ’13]Variancereductionforwell-conditionedfunctions

0 ≺ 𝛼𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼, 𝛾 =𝛽𝛼

Producesan𝜖 approximatesolutionintime

𝑂 𝑚 + 𝛾 𝑑log1𝜖

Controllingthevariance:InterpolatingGDandSGD

𝛾 shouldbeinterpretedas

1𝜖

• Optimalgradientcomplexity(smooth,convex)

• modestpracticalimprovements,non-convex“momentum”methods.

• Withvariancereduction,fastestpossiblerunningtimeoffirst-ordermethods:

𝑂 (𝑚 + 𝛾𝑚)𝑑log1𝜖

[Woodworth,Srebro ’15]– tightlowerboundw.gradientoracle

Acceleration/momentum[Nesterov ‘83]

Experimentsw.convexlosses

ImproveuponGradientDescent++?

Nextfewslides:

Movefromfirstorder(GD++)tosecondorder

HigherOrderOptimization

• GradientDescent– DirectionofSteepestDescent• SecondOrderMethods– UseLocalCurvature

Newton’smethod(+Trustregion)

𝑥2

𝑥+

𝑥m

𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)

Fornon-convexfunction:canmoveto∞Solution: solveaquadraticapproximation inalocalarea(trust region)

Newton’smethod(+Trustregion)

𝑥2

𝑥+

𝑥m

𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)

d3 timeperiteration!InfeasibleforML!!

Tillrecently…

SpeeduptheNewtondirectioncomputation??

• Spielman-Teng ‘04:diagonallydominantsystemsofequationsinlineartime!

• 2015Godel prize• UsedbyDaitch-Speilman forfasterflowalgorithms

• Erdogu-Montanari ‘15:lowrankapproximation&inversionbySherman-Morisson

• Allowstochasticinformation• Stillprohibitive: rank*d2

StochasticNewton?

• ERM,rank-1loss:argminr𝐸 s∼u [ℓ 𝑥;𝑎s, 𝑏s + 2

+|𝑥|+]

• unbiasedestimatoroftheHessian:

𝛻+y = azaz{ ⋅ ℓ′ 𝑥;𝑎s, bz + 𝐼𝑖~𝑈[1,… ,𝑚]

• clearly𝐸 𝛻+y = 𝛻+𝑓 ,but 𝐸 𝛻+yp2

≠ 𝛻+𝑓p2

𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)

𝑛�

SingleexampleVector-vectorproductsonly

Foranydistribution onnaturalsi ∼ 𝑁

CircumventHessiancreationandinversion!

• 3steps:• (1)representHessianinverseasinfinite series

𝛻p+ = > 𝐼 − 𝛻+ s

s�[0��

• (2)samplefromtheinfinite series(Hessian-gradientproduct) ,ONCE

𝛻+𝑓p2𝛻𝑓 = > 𝐼 − 𝛻+𝑓 s𝛻f = 𝐸s∼� 𝐼 − 𝛻+𝑓 s𝛻f ⋅1

Pr[𝑖]s

• (3)estimateHessian-powerbysampling i.i.d.dataexamples

= Es∼�,�∼[s] � 𝐼 − 𝛻+𝑓� 𝛻f ⋅1

Pr[𝑖]��20�s

Linear-timeSecond-orderStochasticAlgorithm(LiSSA)

• Usetheestimator𝛻p+𝑓� definedpreviously• Computeafull(largebatch)gradient𝛻f• Moveinthedirection𝛻p+𝑓� 𝛻𝑓• Theoreticalrunningtimetoproducean𝜖 approximatesolutionfor𝛾well-conditionedfunctions(convex):[Agarwal,Bullins,Hazan‘15]

𝑂 𝑑𝑚log1𝜖 + 𝛾𝑑𝑑log

1𝜖

1. Fasterthanfirst-ordermethods!2. Indicationsthisistight[Arjevani,Shamir ‘16]

Whataboutconstraints??

Nextfewslides– projectionfree(Frank-Wolfe)methods

Recommendationsystems

1 52 3

2 4 1

54 2 3

1 1 15 5

movies

users

rating of user ifor movie j

1 4 5 2 1 53 4 2 2 3 55 2 4 4 1 13 3 3 3 3 32 3 1 5 4 43 4 2 1 2 31 2 4 1 5 14 5 3 3 5 2

movies

users

complete missing entries



1 4 5 2 1 53 4 2 2 3 55 2 4 4 1 13 3 3 3 3 52 3 1 5 4 43 4 2 1 2 31 2 4 1 5 14 5 3 3 5 2

movies

usersget new data


movies

users

1 2 5 5 4 42 4 3 2 3 15 2 4 2 3 13 3 2 4 5 53 3 2 5 1 52 4 3 4 2 31 1 1 1 1 14 5 3 3 5 2

re-complete missing entries

Assumelowrankof“truematrix”,convexrelaxation:boundedtracenorm

Boundedtracenormmatrices

• Tracenormofamatrix=sumofsingularvalues• K ={X |X isamatrixwithtracenormatmostD }

• Computationalbottleneck:projectionsonKrequireeigendecomposition:O(n3) operations

• But:linearoptimizationoverK iseasiercomputingtopeigenvector;O(sparsity) time

1. Matrixcompletion (K =bounded tracenormmatrices)eigen decomposition largestsingularvectorcomputation

2. Onlinerouting (K =flowpolytope)conicoptimizationoverflowpolytopeshortestpathcomputation

3. Rotations(K =rotationmatrices)conicoptimizationoverrotationssetWahba’s algorithm– lineartime

4. Matroids (K =matroid polytope)convexopt.viaellipsoidmethodgreedyalgorithm– nearlylineartime!

Projectionsà linearoptimization

ConditionalGradientalgorithm[Frank,Wolfe’56]

1. Atiterationt: convexcomb.ofatmostt vertices(sparsity)2. Nolearningrate.𝜂0 ≈

20 (independentofdiameter,gradientsetc.)

Convexoptproblem:min�∈�

𝑓(𝑥)

• f issmooth,convex• linearoptoverK iseasy

v

t

= argminx2K

rf(xt

)>x

x

t+1 = x

t

+ ⌘

t

(vt

� x

t

)

FWtheorem

Theorem:

Proof,mainobservation:

Thus:ht =f(xt)– f(x*)

xt

vt+1 xt+1

rf(xt)

f(xt+1)� f(x

⇤) = f(xt + ⌘t(vt � xt))� f(x

⇤)

f(xt)� f(x

⇤) + ⌘t(vt � xt)

>rt + ⌘

2t�

2

kvt � xtk2 �-smoothness of f

f(xt)� f(x

⇤) + ⌘t(x

⇤ � xt)>rt + ⌘

2t�

2

kvt � xtk2 optimality of vt

f(xt)� f(x

⇤) + ⌘t(f(x

⇤)� f(xt)) + ⌘

2t�

2

kvt � xtk2 convexity of f

(1� ⌘t)(f(xt)� f(x

⇤)) +

⌘

2t �

2

D

2.

f(xt)� f(x⇤) = O(1

t)

ht+1 (1� ⌘t)ht +O(⌘2t )

x

t+1 = x

t

+ ⌘

t

(vt

� x

t

) , v

t

= arg minx2K

r>t

x

⌘t, ht = O(1

t)

OnlineConditionalGradient

vt

xt+1xtxt+1 (1� t

�↵)xt + t

�↵vt

Theorem:[Hazan,Kale’12]𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇m/�)Theorem:[Garber,Hazan‘13] Forpolytopes,strongly-convexandsmoothlosses,1. Offline:convergenceaftertsteps:𝑒p�(0)

2. Online:𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)

• Set𝑥2 ∈ 𝐾arbitrarily• Fort =1,2,…,

1. Use𝑥0,obtainft2. Compute𝑥012 asfollows

v

t

= argminx2K

⇣Pt

i=1 rf

i

(xi

) + �

t

x

t

⌘>x

Pti=1 rfi(xi) + �txt

Nextfewslides:surveystate-of-the-art

Distributionover{a} ∈ 𝑅�

label

Non-convexoptimizationinML

Machine

arg min�∈��

1𝑚 > ℓs 𝑥, 𝑎s, 𝑏s

s�20�u

+ 𝑅 𝑥

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

Solutionconceptsfornon-convexoptimization?

• Global minimization is NP hard, even for degree 4 polynomials. Even local minimization up to 4th order optimality conditions is NP hard.

• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]

• Optimization approaches:• Finding vanishing gradients / local minima efficiently• Graduated optimization / homotopy method• Quasi-convexity• Structure of local optima (probabilistic assumptions that

allow alternating minimization,…)

Goal:findpointx suchthat1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximatefirstorderoptimality)2.𝛻+𝑓 𝑥 ≽ −𝜀𝐼 (approximatesecondorderoptimality)

1. (we’veproved)GDalgorithm:𝑥012 = 𝑥0 − 𝜂𝛻4 findsin𝑂2C

(expensive)iterationspoint(1)

2. (we’veproved)SGDalgorithm:𝑥012 = 𝑥0 − 𝜂𝛻P4 findsin𝑂2C� (cheap)

iterationspoint(1)

3. SGDalgorithmwithnoisefindsin𝑂 2C (cheap)iterations(1&2)

[Ge, Huang,Jin,Yuan ‘15]

4. Recentsecondordermethods:findin𝑂 2C¡/¢ (expensive)iterations

point(1&2)[Carmon,Duchi,Hinder,Sidford ‘16][Agarwal,Allen-Zuo,Bullins,Hazan,Ma‘16]

Gradient/Hessianbasedmethods

Chair/car

Recap

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

1. Onlinelearningandstochasticoptimization

• Regretminimization• Onlinegradientdescent

2. Regularization• AdaGradandoptimalregularization

3. Advancedoptimization• Frank-Wolfe, acceleration,variancereduction,secondordermethods,non-convexoptimization

Bibliography&moreinformation,see:

http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm

Thankyou!

tutorial : part 2 · tutorial: part 2 optimization for machine learning eladhazan princeton...

Documents