smo research
TRANSCRIPT
-
8/13/2019 SMO research
1/19
SVM by Sequential Minimal
Optimization (SMO)
Algorithm by John Platt
Lecture by David PageCS 760: Machine Learning
-
8/13/2019 SMO research
2/19
Quick Review
Inner product (specifically here the dot product)is defined as
Changingxto w and ztox, we may also write:
or or wTx or w x In our usage, xis feature vector and wis weight
(or coefficient) vector
A line (or more generally a hyperplane) is writtenas wTx= b, or wTx- b = 0
A linear separator is written as sign(wTx- b)
-
8/13/2019 SMO research
3/19
Quick Review (Continued)
Recall in SVMs we want to maximize margin
subject to correctly separating the data that
is, we want to:
Here the ys are the class values (+1 for
positive and -1 for negative), so
says xicorrectly labeled with room to spare
Recall is just wTw
-
8/13/2019 SMO research
4/19
Recall Full Formulation
As last lecture showed us, we can Solve the dual more efficiently (fewer unknowns)
Add parameter C to allow some misclassifications
Replace xi
T
xjby more more general kernel term
-
8/13/2019 SMO research
5/19
Intuitive Introduction to SMO
Perceptron learning algorithm is essentially
doing same thingfind a linear separator by
adjusting weights on misclassified examples
Unlike perceptron, SMO has to maintain sum
over examples of example weight times
example label
Therefore, when SMO adjusts weight of one
example, it must also adjust weight of another
-
8/13/2019 SMO research
6/19
An Old Slide:
Perceptron as Classifier
Output for example xis sign(wTx)
Candidate Hypotheses: real-valued weight vectors w
Training: Update wfor each misclassified example x(targetclass t, predicted o) by:
wi wi+ h(t-o)xi Here h is learning rate parameter
Let E be the error o-t. The above can be rewritten as
w whEx
So final weight vector is a sum of weighted contributions fromeach example. Predictive model can be rewritten as weightedcombination of examples rather than features, where theweight on an example is sum (over iterations) of terms hE.
-
8/13/2019 SMO research
7/19
Corresponding Use in Prediction
To use perceptron, prediction is wTx- b. May
treat -b as one more coefficient in w (and omit
it here), may take sign of this value
In alternative view of last slide, prediction can
be written as or, revising the
weight appropriately,
Prediction in SVM is:
b-x Tj'i
ja
b-x Tii
jjay
-
8/13/2019 SMO research
8/19
From Perceptron Rule to SMO Rule
Recall that SVM optimization problem has the addedrequirement that:
Therefore if we increase one by an amount h, in
either direction, then we have to change another
by an equal amount in the opposite direction
(relative to class value). We can accomplish this by
changing: w whEx also viewed as:
yhE to:1 1y1h(E1-E2)
or equivalently: 1 1+ y1h(E2-E1)
We then also adjust 2so that y11+ y22stays same
-
8/13/2019 SMO research
9/19
First of Two Other Changes
Instead of h = or some similar constant, weset h = or more generally h =
For a given difference in error between two
examples, the step size is larger when the two
examples are more similar. When x1and x2are similar, K(x1,x2) is larger (for typical
kernels, including dot product). Hence the
denominator is smaller and step size is larger.
20
1
212211 2
1
xxxxxx -
),(2),(),(
1
212211 xxKxxKxxK -
-
8/13/2019 SMO research
10/19
Second of Two Other Changes
Recall our formulation had an additionalconstraint:
Therefore, we clip any change we make toany to respect this constraint
This yields the following SMO algorithm. (Sofar we have motivated it intuitively. Afterwardwe will show it really minimizes our objective.)
-
8/13/2019 SMO research
11/19
Updating Two s: One SMO Step
Given examples e1 and e2, set
where:
Clip this value in the natural way: if y1 = y2 then:
otherwise:
Set where s = y1y2
-
8/13/2019 SMO research
12/19
Karush-Kuhn-Tucker (KKT) Conditions
It is necessary and sufficient for a solution to our objective that all ssatisfy the following:
An is 0 iff that example is correctly labeled with room to spare
An is C iff that example is incorrectly labeled or in the margin
An is properly between 0 and C (is unbound) iff that example isbarely correctly labeled (is a support vector)
We just check KKT to within some small epsilon, typically 10-3
-
8/13/2019 SMO research
13/19
SMO Algorithm
Input: C, kernel, kernel parameters, epsilon
Initialize b and all s to 0
Repeat until KKT satisfied (to within epsilon): Find an example e1that violates KKT (prefer unbound
examples here, choose randomly among those)
Choose a second example e2. Prefer one to maximize stepsize (in practice, faster to just maximize |E1E2|). If thatfails to result in change, randomly choose unboundexample. If that fails, randomly choose example. If that
fails, re-choose e1. Update 1and 2in one step
Compute new threshold b
-
8/13/2019 SMO research
14/19
At End of Each Step, Need to Update b
Compute b1and b2as follows
Choose b = (b1+ b2)/2
When 1and 2are not at bound, it is in factguaranteed that b = b1= b2
-
8/13/2019 SMO research
15/19
Derivation of SMO Update Rule
Recall objective function we want to optimize:
Can write objective as:
where:
Starred values are from end of previous round
1with
everybody else
2with
everybody else
all the
other s
-
8/13/2019 SMO research
16/19
Expressing Objective in Terms of One
Lagrange Multiplier Only
Because of the constraint , each
update has to obey:
where wis a constant and sis y1y2
This lets us express the objective function in
terms of 2alone:
-
8/13/2019 SMO research
17/19
Find Minimum of this Objective
Find extremum by first derivative wrt 2:
If second derivative positive (usual case), then
the minimum is where (by simple Algebra,
remembering ss= 1):
-
8/13/2019 SMO research
18/19
Finishing Up
Recall that w = and
So we can rewrite
as: = s(K11-K12)(s2*+1
*) +
y2(u1-u2+y11*(K12-K11) + y22
*(K22-K21)) +y2y2y1y2
We can rewrite this as:
-
8/13/2019 SMO research
19/19
Finishing Up
Based on values of wand v, we rewrote
as:
From this, we get our update rule:
where
u1y1
u2y2
K11+ K222K12