tutorial : part 2 · tutorial: part 2 optimization for machine learning eladhazan princeton...
TRANSCRIPT
Tutorial:PART2
OptimizationforMachineLearning
Elad HazanPrincetonUniversity
+helpfromSanjeevArora&Yoram Singer
Agenda
1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Onlinegradientdescent
2. Regularization• AdaGrad andoptimalregularization
3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization
Acceleratinggradientdescent?
1. Adaptiveregularization(AdaGrad)worksfornon-smooth&non-convex
2. VariancereductionusesspecialERMstructureveryeffectiveforsmooth&convex
3. Acceleration/momentumsmoothconvexonly,generalpurposeoptimizationsince80’s
definedas𝛾 = #$,where(simplified)
0 ≺ 𝛼𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼
𝛼 =strongconvexity,𝛽 =smoothnessNon-convexsmoothfunctions:(simplified)
−𝛽𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼
Whydowecare?
well-conditionedfunctionsexhibitmuchfasteroptimization!(butequivalentviareductions)
Conditionnumberofconvexfunctions
Examples
Thedescentlemma,𝛽-smoothfunctions:(algorithm:𝑥012 = 𝑥0 − 𝜂𝛻4 )
𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +
= − 𝜂 + 𝛽𝜂+ 𝛻0 + = −14𝛽 𝛻0 +
Thus,forM-boundedfunctions:( 𝑓 𝑥0 ≤ 𝑀)
−2𝑀 ≤ 𝑓 𝑥; − 𝑓(𝑥2) = > 𝑓(𝑥012)− 𝑓 𝑥00
≤ −14𝛽> 𝛻0 +
0Thus,existsat forwhich,
𝛻0 + ≤8𝑀𝛽𝑇
Smoothgradientdescent
Conclusions:for𝑥012 = 𝑥0 − 𝜂𝛻4 andT = Ω 2C ,finds
𝛻0 + ≤ 𝜖
1. Holdsevenfornon-convexfunctions2. Forconvexfunctionsimplies𝑓 𝑥0 − 𝑓 𝑥∗ ≤ 𝑂(𝜖) (fasterfor
smooth!)
Smoothgradientdescent
Thedescentlemma,𝛽-smoothfunctions:(algorithm:𝑥012 = 𝑥0 − 𝜂𝛻0N )
𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +
= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻4N+= −𝜂𝛻4+ + 𝜂+𝛽𝐸 𝛻4N
+
= −𝜂𝛻4+ + 𝜂+𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )
Thus,forM-boundedfunctions:( 𝑓 𝑥0 ≤ 𝑀)
T = 𝑂M𝛽𝜀+ ⇒ ∃0Y;. 𝛻0 + ≤ 𝜀
Non-convexstochastic gradientdescent
Model:bothfullandstochasticgradients.EstimatorcombinesbothintolowervarianceRV:
𝑥012 = 𝑥0 − 𝜂 𝛻P𝑓 𝑥0 − 𝛻P𝑓 𝑥[ + 𝛻𝑓(𝑥[)
Everysooften,computefullgradientandrestartatnew𝑥[.
Theorem:[Schmidt,LeRoux,Bach‘12;JohnsonandZhang‘13;Mahdavi,Zhang,Jin ’13]Variancereductionforwell-conditionedfunctions
0 ≺ 𝛼𝐼 ≼ 𝛻+𝑓 𝑥 ≼ 𝛽𝐼, 𝛾 =𝛽𝛼
Producesan𝜖 approximatesolutionintime
𝑂 𝑚 + 𝛾 𝑑log1𝜖
Controllingthevariance:InterpolatingGDandSGD
𝛾 shouldbeinterpretedas
1𝜖
• Optimalgradientcomplexity(smooth,convex)
• modestpracticalimprovements,non-convex“momentum”methods.
• Withvariancereduction,fastestpossiblerunningtimeoffirst-ordermethods:
𝑂 (𝑚 + 𝛾𝑚)𝑑log1𝜖
[Woodworth,Srebro ’15]– tightlowerboundw.gradientoracle
Acceleration/momentum[Nesterov ‘83]
Experimentsw.convexlosses
ImproveuponGradientDescent++?
Nextfewslides:
Movefromfirstorder(GD++)tosecondorder
HigherOrderOptimization
• GradientDescent– DirectionofSteepestDescent• SecondOrderMethods– UseLocalCurvature
Newton’smethod(+Trustregion)
𝑥2
𝑥+
𝑥m
𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)
Fornon-convexfunction:canmoveto∞Solution: solveaquadraticapproximation inalocalarea(trust region)
Newton’smethod(+Trustregion)
𝑥2
𝑥+
𝑥m
𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)
d3 timeperiteration!InfeasibleforML!!
Tillrecently…
SpeeduptheNewtondirectioncomputation??
• Spielman-Teng ‘04:diagonallydominantsystemsofequationsinlineartime!
• 2015Godel prize• UsedbyDaitch-Speilman forfasterflowalgorithms
• Erdogu-Montanari ‘15:lowrankapproximation&inversionbySherman-Morisson
• Allowstochasticinformation• Stillprohibitive: rank*d2
StochasticNewton?
• ERM,rank-1loss:argminr𝐸 s∼u [ℓ 𝑥;𝑎s, 𝑏s + 2
+|𝑥|+]
• unbiasedestimatoroftheHessian:
𝛻+y = azaz{ ⋅ ℓ′ 𝑥;𝑎s, bz + 𝐼𝑖~𝑈[1,… ,𝑚]
• clearly𝐸 𝛻+y = 𝛻+𝑓 ,but 𝐸 𝛻+yp2
≠ 𝛻+𝑓p2
𝑥012 = 𝑥0 − 𝜂[𝛻+𝑓(𝑥)]p2𝛻𝑓(𝑥)
𝑛�
SingleexampleVector-vectorproductsonly
Foranydistribution onnaturalsi ∼ 𝑁
CircumventHessiancreationandinversion!
• 3steps:• (1)representHessianinverseasinfinite series
𝛻p+ = > 𝐼 − 𝛻+ s
s�[0��
• (2)samplefromtheinfinite series(Hessian-gradientproduct) ,ONCE
𝛻+𝑓p2𝛻𝑓 = > 𝐼 − 𝛻+𝑓 s𝛻f = 𝐸s∼� 𝐼 − 𝛻+𝑓 s𝛻f ⋅1
Pr[𝑖]s
• (3)estimateHessian-powerbysampling i.i.d.dataexamples
= Es∼�,�∼[s] � 𝐼 − 𝛻+𝑓� 𝛻f ⋅1
Pr[𝑖]��20�s
Linear-timeSecond-orderStochasticAlgorithm(LiSSA)
• Usetheestimator𝛻p+𝑓� definedpreviously• Computeafull(largebatch)gradient𝛻f• Moveinthedirection𝛻p+𝑓� 𝛻𝑓• Theoreticalrunningtimetoproducean𝜖 approximatesolutionfor𝛾well-conditionedfunctions(convex):[Agarwal,Bullins,Hazan‘15]
𝑂 𝑑𝑚log1𝜖 + 𝛾𝑑𝑑log
1𝜖
1. Fasterthanfirst-ordermethods!2. Indicationsthisistight[Arjevani,Shamir ‘16]
Whataboutconstraints??
Nextfewslides– projectionfree(Frank-Wolfe)methods
Recommendationsystems
1 52 3
2 4 1
54 2 3
1 1 15 5
movies
users
rating of user ifor movie j
1 4 5 2 1 53 4 2 2 3 55 2 4 4 1 13 3 3 3 3 32 3 1 5 4 43 4 2 1 2 31 2 4 1 5 14 5 3 3 5 2
movies
users
complete missing entries
Recommendationsystems
Recommendationsystems
1 4 5 2 1 53 4 2 2 3 55 2 4 4 1 13 3 3 3 3 52 3 1 5 4 43 4 2 1 2 31 2 4 1 5 14 5 3 3 5 2
movies
usersget new data
Recommendationsystems
movies
users
1 2 5 5 4 42 4 3 2 3 15 2 4 2 3 13 3 2 4 5 53 3 2 5 1 52 4 3 4 2 31 1 1 1 1 14 5 3 3 5 2
re-complete missing entries
Assumelowrankof“truematrix”,convexrelaxation:boundedtracenorm
Boundedtracenormmatrices
• Tracenormofamatrix=sumofsingularvalues• K ={X |X isamatrixwithtracenormatmostD }
• Computationalbottleneck:projectionsonKrequireeigendecomposition:O(n3) operations
• But:linearoptimizationoverK iseasiercomputingtopeigenvector;O(sparsity) time
1. Matrixcompletion (K =bounded tracenormmatrices)eigen decomposition largestsingularvectorcomputation
2. Onlinerouting (K =flowpolytope)conicoptimizationoverflowpolytopeshortestpathcomputation
3. Rotations(K =rotationmatrices)conicoptimizationoverrotationssetWahba’s algorithm– lineartime
4. Matroids (K =matroid polytope)convexopt.viaellipsoidmethodgreedyalgorithm– nearlylineartime!
Projectionsà linearoptimization
ConditionalGradientalgorithm[Frank,Wolfe’56]
1. Atiterationt: convexcomb.ofatmostt vertices(sparsity)2. Nolearningrate.𝜂0 ≈
20 (independentofdiameter,gradientsetc.)
Convexoptproblem:min�∈�
𝑓(𝑥)
• f issmooth,convex• linearoptoverK iseasy
v
t
= argminx2K
rf(xt
)>x
x
t+1 = x
t
+ ⌘
t
(vt
� x
t
)
FWtheorem
Theorem:
Proof,mainobservation:
Thus:ht =f(xt)– f(x*)
xt
vt+1 xt+1
rf(xt)
f(xt+1)� f(x
⇤) = f(xt + ⌘t(vt � xt))� f(x
⇤)
f(xt)� f(x
⇤) + ⌘t(vt � xt)
>rt + ⌘
2t�
2
kvt � xtk2 �-smoothness of f
f(xt)� f(x
⇤) + ⌘t(x
⇤ � xt)>rt + ⌘
2t�
2
kvt � xtk2 optimality of vt
f(xt)� f(x
⇤) + ⌘t(f(x
⇤)� f(xt)) + ⌘
2t�
2
kvt � xtk2 convexity of f
(1� ⌘t)(f(xt)� f(x
⇤)) +
⌘
2t �
2
D
2.
f(xt)� f(x⇤) = O(1
t)
ht+1 (1� ⌘t)ht +O(⌘2t )
x
t+1 = x
t
+ ⌘
t
(vt
� x
t
) , v
t
= arg minx2K
r>t
x
⌘t, ht = O(1
t)
OnlineConditionalGradient
vt
xt+1xtxt+1 (1� t
�↵)xt + t
�↵vt
Theorem:[Hazan,Kale’12]𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇m/�)Theorem:[Garber,Hazan‘13] Forpolytopes,strongly-convexandsmoothlosses,1. Offline:convergenceaftertsteps:𝑒p�(0)
2. Online:𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
• Set𝑥2 ∈ 𝐾arbitrarily• Fort =1,2,…,
1. Use𝑥0,obtainft2. Compute𝑥012 asfollows
v
t
= argminx2K
⇣Pt
i=1 rf
i
(xi
) + �
t
x
t
⌘>x
Pti=1 rfi(xi) + �txt
Nextfewslides:surveystate-of-the-art
Distributionover{a} ∈ 𝑅�
label
Non-convexoptimizationinML
Machine
arg min�∈��
1𝑚 > ℓs 𝑥, 𝑎s, 𝑏s
s�20�u
+ 𝑅 𝑥
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Solutionconceptsfornon-convexoptimization?
• Global minimization is NP hard, even for degree 4 polynomials. Even local minimization up to 4th order optimality conditions is NP hard.
• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]
• Optimization approaches:• Finding vanishing gradients / local minima efficiently• Graduated optimization / homotopy method• Quasi-convexity• Structure of local optima (probabilistic assumptions that
allow alternating minimization,…)
Goal:findpointx suchthat1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximatefirstorderoptimality)2.𝛻+𝑓 𝑥 ≽ −𝜀𝐼 (approximatesecondorderoptimality)
1. (we’veproved)GDalgorithm:𝑥012 = 𝑥0 − 𝜂𝛻4 findsin𝑂2C
(expensive)iterationspoint(1)
2. (we’veproved)SGDalgorithm:𝑥012 = 𝑥0 − 𝜂𝛻P4 findsin𝑂2C� (cheap)
iterationspoint(1)
3. SGDalgorithmwithnoisefindsin𝑂 2C (cheap)iterations(1&2)
[Ge, Huang,Jin,Yuan ‘15]
4. Recentsecondordermethods:findin𝑂 2C¡/¢ (expensive)iterations
point(1&2)[Carmon,Duchi,Hinder,Sidford ‘16][Agarwal,Allen-Zuo,Bullins,Hazan,Ma‘16]
Gradient/Hessianbasedmethods
Chair/car
Recap
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
1. Onlinelearningandstochasticoptimization
• Regretminimization• Onlinegradientdescent
2. Regularization• AdaGradandoptimalregularization
3. Advancedoptimization• Frank-Wolfe, acceleration,variancereduction,secondordermethods,non-convexoptimization
Bibliography&moreinformation,see:
http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm
Thankyou!