tutorial : part 1 · tutorial: part 1 optimization for machine learning eladhazan princeton...
TRANSCRIPT
Tutorial:PART1
Optimizationformachinelearning
Elad HazanPrincetonUniversity
+helpfromSanjeevArora,Yoram Singer
Distributionover{a} ∈ 𝑅%
label𝑏 = 𝑓)*+*,-.-+/(𝑎)
Chair/car
MLparadigm
Thistutorial- trainingthemachine• Efficiency• generalization
Machine
Agenda
1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/online/stochasticgradientdescent
2. Regularization• AdaGrad andoptimalregularization
3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization
NOTtouchupon:• Parallelism/distributedcomputation(asynchronousoptimization,HOGWILDetc.),Bayesianinferenceingraphicalmodels,Markov-chain-monte-carlo,Partialinformationandbanditalgorithms
Mathematicaloptimization
Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅8
Output:minimizer𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾
Accessingf?(values,differentials,…)
GenerallyNP-hard,givenfullaccesstofunction.
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Learning=optimizationoverdata(a.k.a.EmpiricalRiskMinimization)
Fittingtheparametersofthemodel(“training”)=optimizationproblem:
arg minB∈CD
1𝑚 G ℓI 𝑥, 𝑎I , 𝑏I
IKL.M,+ 𝑅 𝑥
m=#ofexamples(a,b)=(features,labels)d=dimension
Example:linearclassification
Givenasample𝑆 = 𝑎L,𝑏L , … , 𝑎,,𝑏, ,findhyperplane(throughtheoriginw.l.o.g)suchthat:
𝑥 = arg minB QL
# ofmistakes=
arg minB QL
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |
arg min] QL
L,∑ ℓ(𝑥,𝑎I, 𝑏I)I forℓ 𝑥, 𝑎I, 𝑏I = _1𝑥
`𝑎 ≠ 𝑏0𝑥`𝑎 = 𝑏
NPhard!
Sumofsignsà globaloptimizationNP-hard!butlocallyverifiable…
Localpropertythatensuresglobaloptimality?
Convexity
Afunction𝑓: 𝑅8 ↦ 𝑅 isconvexifandonlyif:
𝑓12 𝑥 +
12𝑦 ≤
12𝑓 𝑥 +
12𝑓 𝑦
• Informally:smileyJ
• Alternativedefinition:
f y ≥ f x + 𝛻𝑓(𝑥)`(𝑦 − 𝑥)
𝑥 𝑦
Convexsets
SetKisconvexifandonly if:
𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
Lossfunctionsℓ 𝑥, 𝑎I, 𝑏I = ℓ(𝑥Z𝑎I ⋅ 𝑏I)
Convexrelaxationsforlinear(&kernel)classification
1. Ridge/linearregressionℓ 𝑥`𝑎I,𝑦I = 𝑥`𝑎I − 𝑏I l
2. SVM ℓ 𝑥`𝑎I,𝑦I = max{0,1 − 𝑏I 𝑥`𝑎I}3. Logisticregression ℓ 𝑥`𝑎I,𝑦I = log(1 + 𝑒qrs⋅Bt*s)
𝑥 = arg minB QL
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |
Wehave:castlearningasmathematicaloptimization,arguedconvexityisalgorithmicallyimportant
Nextè algorithms!
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Gradientdescent,constrainedset
p1p* p2p3
�[rf(x)]i = � @
@xif(x)
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Theorem:forstepsize𝜂 = yz Z
𝑓1𝑇G𝑥..
≤ minB∗∈x
𝑓 𝑥∗ +𝐷𝐺𝑇
Where:• G=upperboundonnormofgradients
|𝛻𝑓 𝑥. | ≤ 𝐺
• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷
Convergenceofgradientdescent
Proof:1. Observation 1:
x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l
2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l
This is the Pythagorean theorem:
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Proof:1. Observation 1:
x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l
2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l
Thus: x∗ − x�uL l ≤ x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l𝐺l
And hence:
𝑓(1𝑇G𝑥.)− 𝑓 𝑥∗ ≤
.
1𝑇G 𝑓 𝑥. − 𝑓 𝑥∗
.
≤1𝑇G𝛻𝑓 𝑥. 𝑥. − 𝑥∗
.
≤1𝑇G
12𝜂 x∗ − x�uL l − x∗ − x� l
.
+𝜂2 𝐺
l
≤1
𝑇 ⋅ 2𝜂 𝐷l+
𝜂2𝐺
l ≤𝐷𝐺𝑇
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Recap
Theorem:forstepsize𝜂 = yz Z
𝑓1𝑇G𝑥.
.
≤ minB∗∈x
𝑓 𝑥∗ +𝐷𝐺𝑇
Thus,toget𝜖-approximatesolution,applyO L��
gradientiterations.
GradientDescent- caveat
p1p* p2p3
ForERMproblems
arg minB∈CD
1𝑚 G ℓI 𝑥, 𝑎I, 𝑏I
IKL.M,
+ 𝑅 𝑥
1. Gradientdependsonalldata2. Whataboutgeneralization?
Nextfewslides:
Simultaneousoptimizationandgeneralizationè Fasteroptimization!(singleexampleperiteration)
Nature:i.i.d fromdistribution D overA×𝐵 = {(𝑎, 𝑏)}
Statistical(PAC)learning
learner:
Hypothesish
Loss,e.g.ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 l
h1
hN
(a1,b1) (aM,bM)
h2
Hypothesis classH:X->Yislearnableif∀𝜖, 𝛿 > 0 existsalgorithms.t. afterseeingmexamples,for𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))findsh s.t. w.p.1- δ:
err(h) minh⇤2H
err(h⇤) + ✏
𝑒𝑟𝑟 ℎ = 𝔼*,r∼y[ℓ(ℎ, 𝑎, 𝑏 ]
Iteratively,fort = 1,2, … , 𝑇
Player:ℎ. ∈ 𝐻Adversary:(𝑎.,𝑏.) ∈ 𝐴
Lossℓ(ℎ.,(𝑎., 𝑏.))
Goal:minimize(average,expected)regret:
Vanishing regretà generalizationinPACsetting!(online2batch)
Fromthispointonwards:𝑓. 𝑥 = ℓ(𝑥, 𝑎., 𝑏.) =lossforoneexample
Canweminimizeregretefficiently?
Morepowerfulsetting:OnlineLearninginGames
AxB
H
1
T
"X
t
`(ht, (at, bt)� minh⇤2H
X
t
`(h⇤, (at, bt))
#�!T!1
0
Onlinegradientdescent[Zinkevich ‘05]
x
t+1 = argminx2K
kyt+1 � x
t
k
Theorem:Regret=∑ 𝑓. 𝑥. − ∑ 𝑓. 𝑥∗ =.. 𝑂 𝑇
yt+1 = xt � ⌘rft(xt)
Observation1:
Observation2:(Pythagoras)
Thus:
Convexity:
Analysis
kyt+1 � x
⇤k2 = kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k2 kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k kyt+1 � x
⇤k
X
t
[ft(xt)� ft(x⇤)]
X
t
rt(xt � x⇤)
1
⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘
X
t
krtk2
1
⌘kx1 � x⇤k2 + ⌘TG = O(
pT )
𝛻. ≔ 𝛻𝑓.(𝑥.)
• 2lossfunctions,T iterations:• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥, 𝑓l 𝑥 = −𝑥• Secondexpertloss=first*-1
• Expectedloss=0(anyalgorithm)• Regret=(comparedtoeither-1or1)
Lowerbound
Regret = ⌦(pT )
E[|#10s�#(�1)0s|] = ⌦(pT )
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Stochasticgradientdescent
Learningproblemarg minB∈CD
𝐹 𝑥 =𝐸(*s ,rs) ℓI 𝑥,𝑎I, 𝑏Irandomexample:𝑓. 𝑥 = ℓI 𝑥,𝑎I, 𝑏I
1. Wehaveproved:(foranysequenceof𝛻. )
1𝑇G𝛻.`𝑥..
≤ minB∗∈x
1𝑇G𝛻.`𝑥∗
.
+𝐷𝐺𝑇
2. Taking(conditional)expectation:
𝐸 𝐹1𝑇G𝑥..
− minB∗∈x
𝐹 𝑥∗ ≤ 𝐸1𝑇G𝛻.`(𝑥. − 𝑥∗).
] ≤𝐷𝐺𝑇
Oneexampleperstep,sameconvergenceasGD,&givesdirectgeneralization!(formallyneedsmartingales)O 8
�� vs.O ,8�� totalrunningtimefor𝜖 generalizationerror.
Stochasticvs.fullgradientdescent
Regularization&GradientDescent++
• Statisticallearningtheory/Occam’srazor:#ofexamplesneededtolearnhypothesisclass~it’s“dimension”• VCdimension• Fat-shatteringdimension• Rademacher width• Margin/normoflinear/kernelclassifier
• PACtheory:Regularization <->reducecomplexity• Regretminimization:Regularization<->stability
Why“regularize”?
• Mostnatural:
𝑥. = argminB∈x
G𝑓I 𝑥.qL
IKL
• Provablyworks[Kalai-Vempala’05]:
𝑥.� = arg minB∈x
G𝑓I 𝑥.
IKL= 𝑥.uL
• Soif𝑥. ≈𝑥.uL,wegetaregretbound• Butinstability 𝑥. − 𝑥.uL canbelarge!
Minimizeregret:best-in-hindsight
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
FixingFTL:Follow-The-Regularized-Leader(FTRL)
• Linearize:replaceft byalinearfunction,𝛻𝑓. 𝑥. Z𝑥• Addregularization:
𝑥. = argminB∈x
G 𝛻.`𝑥 +1𝜂 𝑅 𝑥
IKL….qL
• R(x)isastronglyconvexfunction,ensuresstability:
𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)
FTRLvs.gradientdescent
• 𝑅 𝑥 = Ll∥ 𝑥 ∥l
• EssentiallyOGD:startingwithy1 =0,for t=1,2,…
xt =Q
K(yt)
yt+1 = yt � ⌘rft(xt)
x
t
= argminx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
=Q
K
⇣�⌘
Pt�1i=1 rf
i
(xi
)⌘
FTRLvs.MultiplicativeWeights• Expertssetting:𝐾 = Δ% distributionsoverexperts• 𝑓. 𝑥 = 𝑐.Z𝑥,wherect isthevectoroflosses• 𝑅 𝑥 =∑ 𝑥I log 𝑥II :negativeentropy
• GivestheMultiplicativeWeightsmethod!
x
t
= arg min
x2K
Pt�1i=1 rf
i
(x
i
)
>x+
1⌘
R(x)
= exp
⇣�⌘
Pt�1i=1 ci
⌘/Z
tEntrywiseexponential
Normalizationconstant
FTRL⇔ OnlineMirrorDescent
x
t
= arg minx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
xt =QR
K(yt)
yt+1 = (rR)�1(rR(yt)� ⌘rft(xt))
Bregman Projection:Q
R
K
(y) = argminx2K
B
R
(xky)
B
R
(xky) := R(x)�R(y)�rR(y)>(x� y)
AdaptiveRegularization:AdaGrad
• Considergeneralizedlinearmodel,predictionisfunctionof𝑎Z𝑥𝛻𝑓. 𝑥 = ℓ 𝑎.,𝑏., 𝑥 𝑎.
• OGDupdate:𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎., 𝑏. , 𝑥 𝑎.• features treatedequallyinupdatingparametervector
• Intypicaltextclassificationtasks,featurevectorsat areverysparse,Slowlearning!
• Adaptiveregularization:per-feature learningrates
Optimalregularization
• ThegeneralRFTLform
𝑥. = argminB∈x
G 𝑓I 𝑥 +1𝜂 𝑅 𝑥
IKL….qL
• Whichregularizer topick?• AdaGrad:treatthisasalearningproblem!Familyofregularizations:
𝑅 𝑥 = 𝑥 ¤l 𝑠. 𝑡. 𝐴 ≽ 0, 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
• Objectiveinmatrixworld:bestregretinhindsight!
AdaGrad (diagonalform)
• Set𝑥L ∈ 𝐾arbitrarily• Fort =1,2,…,
1. use𝑥. obtainft2. compute𝑥.uL asfollows:
• Regretbound: [Duchi,Hazan,Singer ‘10]
𝑂 ∑ ∑ 𝛻. ,Il
.I ,canbe 𝑑 betterthanSGD
• Infrequentlyoccurring,orsmall-scale,featureshavesmallinfluenceonregret(andtherefore,convergencetooptimalparameter)
G
t
= diag(P
t
i=1 rf
i
(xi
)rf
i
(xi
)>)
y
t+1 = x
t
� ⌘G
�1/2t
rf
t
(xt
)
x
t+1 = argminx2K
(yt+1 � x)>G
t
(yt+1 � x)
Agenda
1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/stochastic/onlinegradientdescent
2. Regularization• AdaGrad andoptimalregularization
3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization