lecture 5: optimization and convexity sanjeev arora elad hazan · sanjeev arora elad hazan cos 402...

Lecture5:optimizationandconvexity

SanjeevArora EladHazan

COS402– MachineLearningand

ArtificialIntelligenceFall2016

Admin

• Exercise2(implementation)nextThu,inclass• Exercise3(written),duenextThu• Movie– “ExMachina”+discussionpanelw.Prof.Hasson(PNI)WedOct.4th 19:30ticketsstillavailable@Bellaroom204COS

• NextTue:specialguest- Dr.Yoram Singer@Google

Recap

• Definition+fundamentaltheoremofstatisticallearning• Powerfulclassesw.lowsamplecomplexityerrorexist(i.e.pythonprograms),butcomputationallyhard• Perceptron• SVM

Agenda

• convexrelaxations• convexoptimization• Gradientdescent

Mathematicaloptimization

Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅'

Output:point𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾

Mathematicaloptimization

• Continuousfunctions(backtocalculus,derivatives,differentiability,…)• Vs.combinatorialoptimizationasingraphalgorithms(strongconnection)• Studiedsinceearly1900’s,lotsofworkinsovietunion(centraloptimization,resourceallocation,militaryapplications,etc.)• Specialcases:linearprogramming,convexoptimization,maxflowingraphs

Efficient(poly-time)algorithms

Optimizationforlinearclassification

Givenasample𝑆 = 𝑥0, 𝑦0 ,… , 𝑥3, 𝑦3 ,findhyperplane(throughtheoriginw.l.o.g)suchthat:

𝑤 = arg min; <0

# ofmistakes

Optimizationforlinearclassification

𝑤 = arg min; <0

𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑤F𝑥G ≠ 𝑦G |

Minimizationcanbehard

Sumofsignsà hard

Convexfunctions:localà global

Sumofconvexfunctionsà alsoconvex

Convexrelaxationfor0-1loss

Convexrelaxationforlinearclassification

𝑤 = arg min; <0

ℓ(𝑤K𝑥G,𝑦G) suchas:

1. Ridge/linearregressionℓ 𝑤K𝑥G, 𝑦G = 𝑤K𝑥G − 𝑦G N

2. SVM ℓ 𝑤K𝑥G,𝑦G = max{0,1 − 𝑦G 𝑤K𝑥G}3. Logisticregression ℓ 𝑤K𝑥G,𝑦G = log(1 + 𝑒;XYZ)

𝑤 = arg min; <0

𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑤F𝑥G ≠ 𝑦G |

Smallrecap

• Findinglinearclassifiers:formulatedasmathematicaloptimization• Convexity:propertythatallowslocal greedyalgorithms• Formulateconvexrelaxationstolinearclassification

Next:• Algorithmsforconvexoptimization

Convexity

Afunction𝑓: 𝑅' ↦ 𝑅 isconvexifandonlyif:

𝑓12𝑥 +

12𝑦 ≤

12𝑓 𝑥 +

12𝑓 𝑦

• Informally:smileyJ

• Gradient=thedirectionofsteepestdescent,whichisthederivativeineachcoordinate:

Calculusreminder:gradient

�[rf(x)]i = � @

@xif(x)

Convexity

(assumesdifferentiability,o/wsubgradient)(anotheralternative:secondderivativeisnon-negativein1D)

• Alternativedefinition:

f y ≥ f x + 𝛻𝑓(𝑥)K(𝑦 − 𝑥)

𝑥 𝑦

• Moveinthedirectionofsteepestdescent,whichis:

Greedyoptimization:gradientdescent

p1p* p2p3

�[rf(x)]i = � @

@xif(x)

𝑥`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`

“stepsize”or“Learningrate”

gradientdescent– constrainedset

𝑦à0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`𝑥à0 = argmin

d∈e|𝑦à0 − 𝑥|

convexconstraints

SetKisconvexifandonlyif:

𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 +½𝑦) ∈ 𝐾



d∈e|𝑦`a0 − 𝑥|

Let:• G=upperboundonnormofgradients

|𝛻𝑓 𝑥` | ≤ 𝐺

• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷

Theorem:forstepsize𝜂 = jk F

𝑓1𝑇m𝑥``

≤ minY∗∈e

𝑓 𝑥∗ +𝐷𝐺𝑇

Proof:1. Observation 1:

x∗ − yoa0 N = x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗)+ 𝛻𝑓(𝑥`) N

2. Observation 2: x∗ − 𝑥à0 N ≤ x∗ − yà0 N

This is the Pythagorean theorem:


d∈e|𝑦à0 − 𝑥|

Proof:1. Observation 1:

x∗ − yoa0 N = x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗) + 𝛻𝑓(𝑥`) N

2. Observation 2: x∗ − 𝑥à0 N ≤ x∗ − yà0 N

Thus: x∗ − xoa0 N ≤ x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗) + 𝐺N

And hence:

𝑓(1𝑇m𝑥`)− 𝑓 𝑥∗ ≤

`

1𝑇m 𝑓 𝑥` − 𝑓 𝑥∗

`≤1𝑇m𝛻𝑓 𝑥` 𝑥` − 𝑥∗

`

≤1𝑇m

12𝜂 x∗ − xoa0 N − x∗ − xo N

`+𝜂2 𝐺

N

≤1

𝑇 ⋅ 2𝜂 𝐷N +

𝜂2𝐺

N ≤𝐷𝐺𝑇


d∈e|𝑦à0 − 𝑥|


Theorem:forstepsize𝜂 = jk F

𝑓1𝑇m

𝑥``

≤ minY∗∈e

𝑓 𝑥∗ +𝐷𝐺𝑇

Thus,toget𝜖-approximatesolution,applyrsks

tsgradient

iterations.

GDforlinearclassification

1. Ridge/linearregressionℓ 𝑤K𝑥G, 𝑦G = 𝑤K𝑥G − 𝑦G N

2. SVM ℓ 𝑤K𝑥G,𝑦G = max{0,1 − 𝑦G 𝑤K𝑥G}3. Logisticregression ℓ 𝑤K𝑥G,𝑦G = log(1 + 𝑒;XYZ)

𝑤 = arg min; <0

1𝑚mℓ 𝑤K𝑥G,𝑦G

G

GDforlinearclassification

𝑤 = arg min; <0

1𝑚mℓ 𝑤K𝑥G,𝑦G

G

𝑤`a0 = wo − 𝜂1𝑚mℓw 𝑤`K𝑥G,𝑦G 𝑥G

G

• Complexity? 0ts

iterations,eachtaking~lineartimeindataset

• Overall𝑂 3'ts

runningtime,m=#ofexamplesinRd

• Canwespeeditup??

Summary

• Mathematicaloptimizationforlinearclassification• Convexrelaxations• Gradientdescentalgorithm• GDappliedtolinearclassification

lecture 5: optimization and convexity sanjeev arora elad hazan · sanjeev arora elad hazan cos 402...

Documents