smo research

8/13/2019 SMO research

1/19

SVM by Sequential Minimal

Optimization (SMO)

Algorithm by John Platt

Lecture by David PageCS 760: Machine Learning


2/19

Quick Review

Inner product (specifically here the dot product)is defined as

Changingxto w and ztox, we may also write:

or or wTx or w x In our usage, xis feature vector and wis weight

(or coefficient) vector

A line (or more generally a hyperplane) is writtenas wTx= b, or wTx- b = 0

A linear separator is written as sign(wTx- b)


3/19

Quick Review (Continued)

Recall in SVMs we want to maximize margin

subject to correctly separating the data that

is, we want to:

Here the ys are the class values (+1 for

positive and -1 for negative), so

says xicorrectly labeled with room to spare

Recall is just wTw


4/19

Recall Full Formulation

As last lecture showed us, we can Solve the dual more efficiently (fewer unknowns)

Add parameter C to allow some misclassifications

Replace xi

T

xjby more more general kernel term


5/19

Intuitive Introduction to SMO

Perceptron learning algorithm is essentially

doing same thingfind a linear separator by

adjusting weights on misclassified examples

Unlike perceptron, SMO has to maintain sum

over examples of example weight times

example label

Therefore, when SMO adjusts weight of one

example, it must also adjust weight of another


6/19

An Old Slide:

Perceptron as Classifier

Output for example xis sign(wTx)

Candidate Hypotheses: real-valued weight vectors w

Training: Update wfor each misclassified example x(targetclass t, predicted o) by:

wi wi+ h(t-o)xi Here h is learning rate parameter

Let E be the error o-t. The above can be rewritten as

w whEx

So final weight vector is a sum of weighted contributions fromeach example. Predictive model can be rewritten as weightedcombination of examples rather than features, where theweight on an example is sum (over iterations) of terms hE.


7/19

Corresponding Use in Prediction

To use perceptron, prediction is wTx- b. May

treat -b as one more coefficient in w (and omit

it here), may take sign of this value

In alternative view of last slide, prediction can

be written as or, revising the

weight appropriately,

Prediction in SVM is:

b-x Tj'i

ja

b-x Tii

jjay


8/19

From Perceptron Rule to SMO Rule

Recall that SVM optimization problem has the addedrequirement that:

Therefore if we increase one by an amount h, in

either direction, then we have to change another

by an equal amount in the opposite direction

(relative to class value). We can accomplish this by

changing: w whEx also viewed as:

yhE to:1 1y1h(E1-E2)

or equivalently: 1 1+ y1h(E2-E1)

We then also adjust 2so that y11+ y22stays same


9/19

First of Two Other Changes

Instead of h = or some similar constant, weset h = or more generally h =

For a given difference in error between two

examples, the step size is larger when the two

examples are more similar. When x1and x2are similar, K(x1,x2) is larger (for typical

kernels, including dot product). Hence the

denominator is smaller and step size is larger.

20

1

212211 2

1

xxxxxx -

),(2),(),(

1

212211 xxKxxKxxK -


10/19

Second of Two Other Changes

Recall our formulation had an additionalconstraint:

Therefore, we clip any change we make toany to respect this constraint

This yields the following SMO algorithm. (Sofar we have motivated it intuitively. Afterwardwe will show it really minimizes our objective.)


11/19

Updating Two s: One SMO Step

Given examples e1 and e2, set

where:

Clip this value in the natural way: if y1 = y2 then:

otherwise:

Set where s = y1y2


12/19

Karush-Kuhn-Tucker (KKT) Conditions

It is necessary and sufficient for a solution to our objective that all ssatisfy the following:

An is 0 iff that example is correctly labeled with room to spare

An is C iff that example is incorrectly labeled or in the margin

An is properly between 0 and C (is unbound) iff that example isbarely correctly labeled (is a support vector)

We just check KKT to within some small epsilon, typically 10-3


13/19

SMO Algorithm

Input: C, kernel, kernel parameters, epsilon

Initialize b and all s to 0

Repeat until KKT satisfied (to within epsilon): Find an example e1that violates KKT (prefer unbound

examples here, choose randomly among those)

Choose a second example e2. Prefer one to maximize stepsize (in practice, faster to just maximize |E1E2|). If thatfails to result in change, randomly choose unboundexample. If that fails, randomly choose example. If that

fails, re-choose e1. Update 1and 2in one step

Compute new threshold b


14/19

At End of Each Step, Need to Update b

Compute b1and b2as follows

Choose b = (b1+ b2)/2

When 1and 2are not at bound, it is in factguaranteed that b = b1= b2


15/19

Derivation of SMO Update Rule

Recall objective function we want to optimize:

Can write objective as:

where:

Starred values are from end of previous round

1with

everybody else

2with

everybody else

all the

other s


16/19

Expressing Objective in Terms of One

Lagrange Multiplier Only

Because of the constraint , each

update has to obey:

where wis a constant and sis y1y2

This lets us express the objective function in

terms of 2alone:


17/19

Find Minimum of this Objective

Find extremum by first derivative wrt 2:

If second derivative positive (usual case), then

the minimum is where (by simple Algebra,

remembering ss= 1):


18/19

Finishing Up

Recall that w = and

So we can rewrite

as: = s(K11-K12)(s2*+1

*) +

y2(u1-u2+y11*(K12-K11) + y22

*(K22-K21)) +y2y2y1y2

We can rewrite this as:


19/19

Finishing Up

Based on values of wand v, we rewrote

as:

From this, we get our update rule:

where

u1y1

u2y2

K11+ K222K12

smo research

Documents