tutorial : part 1 · tutorial: part 1 optimization for machine learning eladhazan princeton...

Tutorial:PART1

Optimizationformachinelearning

Elad HazanPrincetonUniversity

+helpfromSanjeevArora,Yoram Singer

Distributionover{a} ∈ 𝑅%

label𝑏 = 𝑓)*+*,-.-+/(𝑎)

Chair/car

MLparadigm

Thistutorial- trainingthemachine• Efficiency• generalization

Machine

Agenda

1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/online/stochasticgradientdescent

2. Regularization• AdaGrad andoptimalregularization

3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization

NOTtouchupon:• Parallelism/distributedcomputation(asynchronousoptimization,HOGWILDetc.),Bayesianinferenceingraphicalmodels,Markov-chain-monte-carlo,Partialinformationandbanditalgorithms

Mathematicaloptimization

Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅8

Output:minimizer𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾

Accessingf?(values,differentials,…)

GenerallyNP-hard,givenfullaccesstofunction.

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

Learning=optimizationoverdata(a.k.a.EmpiricalRiskMinimization)

Fittingtheparametersofthemodel(“training”)=optimizationproblem:

arg minB∈CD

1𝑚 G ℓI 𝑥, 𝑎I , 𝑏I

IKL.M,+ 𝑅 𝑥

m=#ofexamples(a,b)=(features,labels)d=dimension

Example:linearclassification

Givenasample𝑆 = 𝑎L,𝑏L , … , 𝑎,,𝑏, ,findhyperplane(throughtheoriginw.l.o.g)suchthat:

𝑥 = arg minB QL

# ofmistakes=

arg minB QL

𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |

arg min] QL

L,∑ ℓ(𝑥,𝑎I, 𝑏I)I forℓ 𝑥, 𝑎I, 𝑏I = _1𝑥

`𝑎 ≠ 𝑏0𝑥`𝑎 = 𝑏

NPhard!

Sumofsignsà globaloptimizationNP-hard!butlocallyverifiable…

Localpropertythatensuresglobaloptimality?

Convexity

Afunction𝑓: 𝑅8 ↦ 𝑅 isconvexifandonlyif:

𝑓12 𝑥 +

12𝑦 ≤

12𝑓 𝑥 +

12𝑓 𝑦

• Informally:smileyJ

• Alternativedefinition:

f y ≥ f x + 𝛻𝑓(𝑥)`(𝑦 − 𝑥)

𝑥 𝑦

Convexsets

SetKisconvexifandonly if:

𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾

Lossfunctionsℓ 𝑥, 𝑎I, 𝑏I = ℓ(𝑥Z𝑎I ⋅ 𝑏I)

Convexrelaxationsforlinear(&kernel)classification

1. Ridge/linearregressionℓ 𝑥`𝑎I,𝑦I = 𝑥`𝑎I − 𝑏I l

2. SVM ℓ 𝑥`𝑎I,𝑦I = max{0,1 − 𝑏I 𝑥`𝑎I}3. Logisticregression ℓ 𝑥`𝑎I,𝑦I = log(1 + 𝑒qrs⋅Bt*s)

𝑥 = arg minB QL

𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |

Wehave:castlearningasmathematicaloptimization,arguedconvexityisalgorithmicallyimportant

Nextè algorithms!

𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin

]∈x|𝑦.uL − 𝑥|

Gradientdescent,constrainedset

p1p* p2p3

�[rf(x)]i = � @

@xif(x)


]∈x|𝑦.uL − 𝑥|

Theorem:forstepsize𝜂 = yz Z

𝑓1𝑇G𝑥..

≤ minB∗∈x

𝑓 𝑥∗ +𝐷𝐺𝑇

Where:• G=upperboundonnormofgradients

|𝛻𝑓 𝑥. | ≤ 𝐺

• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷

Convergenceofgradientdescent

Proof:1. Observation 1:

x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l

2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l

This is the Pythagorean theorem:


]∈x|𝑦.uL − 𝑥|

Proof:1. Observation 1:

x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l

2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l

Thus: x∗ − x�uL l ≤ x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l𝐺l

And hence:

𝑓(1𝑇G𝑥.)− 𝑓 𝑥∗ ≤

.

1𝑇G 𝑓 𝑥. − 𝑓 𝑥∗

.

≤1𝑇G𝛻𝑓 𝑥. 𝑥. − 𝑥∗

.

≤1𝑇G

12𝜂 x∗ − x�uL l − x∗ − x� l

.

+𝜂2 𝐺

l

≤1

𝑇 ⋅ 2𝜂 𝐷l+

𝜂2𝐺

l ≤𝐷𝐺𝑇


]∈x|𝑦.uL − 𝑥|

Recap

Theorem:forstepsize𝜂 = yz Z

𝑓1𝑇G𝑥.

.

≤ minB∗∈x

𝑓 𝑥∗ +𝐷𝐺𝑇

Thus,toget𝜖-approximatesolution,applyO L��

gradientiterations.

GradientDescent- caveat

p1p* p2p3

ForERMproblems

arg minB∈CD

1𝑚 G ℓI 𝑥, 𝑎I, 𝑏I

IKL.M,

+ 𝑅 𝑥

1. Gradientdependsonalldata2. Whataboutgeneralization?

Nextfewslides:

Simultaneousoptimizationandgeneralizationè Fasteroptimization!(singleexampleperiteration)

Nature:i.i.d fromdistribution D overA×𝐵 = {(𝑎, 𝑏)}

Statistical(PAC)learning

learner:

Hypothesish

Loss,e.g.ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 l

h1

hN

(a1,b1) (aM,bM)

h2

Hypothesis classH:X->Yislearnableif∀𝜖, 𝛿 > 0 existsalgorithms.t. afterseeingmexamples,for𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))findsh s.t. w.p.1- δ:

err(h) minh⇤2H

err(h⇤) + ✏

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y[ℓ(ℎ, 𝑎, 𝑏 ]

Iteratively,fort = 1,2, … , 𝑇

Player:ℎ. ∈ 𝐻Adversary:(𝑎.,𝑏.) ∈ 𝐴

Lossℓ(ℎ.,(𝑎., 𝑏.))

Goal:minimize(average,expected)regret:

Vanishing regretà generalizationinPACsetting!(online2batch)

Fromthispointonwards:𝑓. 𝑥 = ℓ(𝑥, 𝑎., 𝑏.) =lossforoneexample

Canweminimizeregretefficiently?

Morepowerfulsetting:OnlineLearninginGames

AxB

H

1

T

"X

t

`(ht, (at, bt)� minh⇤2H

X

t

`(h⇤, (at, bt))

#�!T!1

0

Onlinegradientdescent[Zinkevich ‘05]

x

t+1 = argminx2K

kyt+1 � x

t

k

Theorem:Regret=∑ 𝑓. 𝑥. − ∑ 𝑓. 𝑥∗ =.. 𝑂 𝑇

yt+1 = xt � ⌘rft(xt)

Observation1:

Observation2:(Pythagoras)

Thus:

Convexity:

Analysis

kyt+1 � x

⇤k2 = kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k2 kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k kyt+1 � x

⇤k

X

t

[ft(xt)� ft(x⇤)]

X

t

rt(xt � x⇤)

1

⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘

X

t

krtk2

1

⌘kx1 � x⇤k2 + ⌘TG = O(

pT )

𝛻. ≔ 𝛻𝑓.(𝑥.)

• 2lossfunctions,T iterations:• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥, 𝑓l 𝑥 = −𝑥• Secondexpertloss=first*-1

• Expectedloss=0(anyalgorithm)• Regret=(comparedtoeither-1or1)

Lowerbound

Regret = ⌦(pT )

E[|#10s�#(�1)0s|] = ⌦(pT )

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

Stochasticgradientdescent

Learningproblemarg minB∈CD

𝐹 𝑥 =𝐸(*s ,rs) ℓI 𝑥,𝑎I, 𝑏Irandomexample:𝑓. 𝑥 = ℓI 𝑥,𝑎I, 𝑏I

1. Wehaveproved:(foranysequenceof𝛻. )

1𝑇G𝛻.`𝑥..

≤ minB∗∈x

1𝑇G𝛻.`𝑥∗

.

+𝐷𝐺𝑇

2. Taking(conditional)expectation:

𝐸 𝐹1𝑇G𝑥..

− minB∗∈x

𝐹 𝑥∗ ≤ 𝐸1𝑇G𝛻.`(𝑥. − 𝑥∗).

] ≤𝐷𝐺𝑇

Oneexampleperstep,sameconvergenceasGD,&givesdirectgeneralization!(formallyneedsmartingales)O 8

�� vs.O ,8�� totalrunningtimefor𝜖 generalizationerror.

Stochasticvs.fullgradientdescent

Regularization&GradientDescent++

• Statisticallearningtheory/Occam’srazor:#ofexamplesneededtolearnhypothesisclass~it’s“dimension”• VCdimension• Fat-shatteringdimension• Rademacher width• Margin/normoflinear/kernelclassifier

• PACtheory:Regularization <->reducecomplexity• Regretminimization:Regularization<->stability

Why“regularize”?

• Mostnatural:

𝑥. = argminB∈x

G𝑓I 𝑥.qL

IKL

• Provablyworks[Kalai-Vempala’05]:

𝑥.� = arg minB∈x

G𝑓I 𝑥.

IKL= 𝑥.uL

• Soif𝑥. ≈𝑥.uL,wegetaregretbound• Butinstability 𝑥. − 𝑥.uL canbelarge!

Minimizeregret:best-in-hindsight

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

FixingFTL:Follow-The-Regularized-Leader(FTRL)

• Linearize:replaceft byalinearfunction,𝛻𝑓. 𝑥. Z𝑥• Addregularization:

𝑥. = argminB∈x

G 𝛻.`𝑥 +1𝜂 𝑅 𝑥

IKL….qL

• R(x)isastronglyconvexfunction,ensuresstability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)

FTRLvs.gradientdescent

• 𝑅 𝑥 = Ll∥ 𝑥 ∥l

• EssentiallyOGD:startingwithy1 =0,for t=1,2,…

xt =Q

K(yt)

yt+1 = yt � ⌘rft(xt)

x

t

= argminx2K

Pt�1i=1 rf

i

(xi

)>x+ 1⌘

R(x)

=Q

K

⇣�⌘

Pt�1i=1 rf

i

(xi

)⌘

FTRLvs.MultiplicativeWeights• Expertssetting:𝐾 = Δ% distributionsoverexperts• 𝑓. 𝑥 = 𝑐.Z𝑥,wherect isthevectoroflosses• 𝑅 𝑥 =∑ 𝑥I log 𝑥II :negativeentropy

• GivestheMultiplicativeWeightsmethod!

x

t

= arg min

x2K

Pt�1i=1 rf

i

(x

i

)

>x+

1⌘

R(x)

= exp

⇣�⌘

Pt�1i=1 ci

⌘/Z

tEntrywiseexponential

Normalizationconstant

FTRL⇔ OnlineMirrorDescent

x

t

= arg minx2K

Pt�1i=1 rf

i

(xi

)>x+ 1⌘

R(x)

xt =QR

K(yt)

yt+1 = (rR)�1(rR(yt)� ⌘rft(xt))

Bregman Projection:Q

R

K

(y) = argminx2K

B

R

(xky)

B

R

(xky) := R(x)�R(y)�rR(y)>(x� y)

AdaptiveRegularization:AdaGrad

• Considergeneralizedlinearmodel,predictionisfunctionof𝑎Z𝑥𝛻𝑓. 𝑥 = ℓ 𝑎.,𝑏., 𝑥 𝑎.

• OGDupdate:𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎., 𝑏. , 𝑥 𝑎.• features treatedequallyinupdatingparametervector

• Intypicaltextclassificationtasks,featurevectorsat areverysparse,Slowlearning!

• Adaptiveregularization:per-feature learningrates

Optimalregularization

• ThegeneralRFTLform

𝑥. = argminB∈x

G 𝑓I 𝑥 +1𝜂 𝑅 𝑥

IKL….qL

• Whichregularizer topick?• AdaGrad:treatthisasalearningproblem!Familyofregularizations:

𝑅 𝑥 = 𝑥 ¤l 𝑠. 𝑡. 𝐴 ≽ 0, 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑

• Objectiveinmatrixworld:bestregretinhindsight!

AdaGrad (diagonalform)

• Set𝑥L ∈ 𝐾arbitrarily• Fort =1,2,…,

1. use𝑥. obtainft2. compute𝑥.uL asfollows:

• Regretbound: [Duchi,Hazan,Singer ‘10]

𝑂 ∑ ∑ 𝛻. ,Il

.I ,canbe 𝑑 betterthanSGD

• Infrequentlyoccurring,orsmall-scale,featureshavesmallinfluenceonregret(andtherefore,convergencetooptimalparameter)

G

t

= diag(P

t

i=1 rf

i

(xi

)rf

i

(xi

)>)

y

t+1 = x

t

� ⌘G

�1/2t

rf

t

(xt

)

x

t+1 = argminx2K

(yt+1 � x)>G

t

(yt+1 � x)

Agenda

1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/stochastic/onlinegradientdescent

2. Regularization• AdaGrad andoptimalregularization

3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization

tutorial : part 1 · tutorial: part 1 optimization for machine learning eladhazan princeton...

Documents