support vector machines (svms) lecture 2...support vector machines (svms) lecture 2 david sontag new...
TRANSCRIPT
![Page 1: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/1.jpg)
Support vector machines (SVMs) Lecture 2
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin
![Page 2: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/2.jpg)
Geometry of linear separators (see blackboard)
A plane can be specified as the set of all points given by:
Barber, Section 29.1.1-4
Vector from origin to a point in the plane Two non-parallel directions in the plane
Alternatively, it can be specified as:
Normal vector (we will call this w)
Only need to specify this dot product, a scalar (we will call this the offset, b)
![Page 3: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/3.jpg)
Linear Separators
If training data is linearly separable, perceptron is guaranteed to find some linear separator
Which of these is optimal?
![Page 4: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/4.jpg)
SVMs (Vapnik, 1990’s) choose the linear separator with the largest margin
• Good according to intuition, theory, practice
• SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task
Support Vector Machine (SVM)
V. Vapnik
Robust to outliers!
![Page 5: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/5.jpg)
1. Use optimization to find solution (i.e. a hyperplane) with few errors
2. Seek large margin separator to improve generalization
3. Use kernel trick to make large feature spaces computationally efficient
Support vector machines: 3 key ideas
![Page 6: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/6.jpg)
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
Finding a perfect classifier (when one exists) using linear programming
for yt = +1,
and for yt = -1,
For every data point (xt, yt), enforce the constraint
Equivalently, we want to satisfy all of the linear constraints
This linear program can be efficiently solved using algorithms such as simplex, interior point, or ellipsoid
![Page 7: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/7.jpg)
Finding a perfect classifier (when one exists) using linear programming
Example of 2-dimensional linear programming (feasibility) problem:
For SVMs, each data point gives one inequality:
What happens if the data set is not linearly separable?
Weight space
![Page 8: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/8.jpg)
• Try to find weights that violate as few constraints as possible?
• Formalize this using the 0-1 loss:
• Unfortunately, minimizing 0-1 loss is NP-hard in the worst-case – Non-starter. We need another
approach.
#(mistakes)
Minimizing number of errors (0-1 loss)
minw,b
X
j
`0,1(yj , w · xj + b)
where `0,1(y, y) = 1[y 6= sign(y)]
![Page 9: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/9.jpg)
Key idea #1: Allow for slack
For each data point:
• If functional margin ≥ 1, don’t care
• If functional margin < 1, pay linear penalty
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
ξ2
ξ1
ξ3
ξ4
Σj ξj
- ξj ξj≥0
“slack variables”
We now have a linear program again, and can efficiently find its optimum
, ξ
![Page 10: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/10.jpg)
Key idea #1: Allow for slack
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
Σj ξj
- ξj ξj≥0
“slack variables”
, ξ
What is the optimal value ξj* as a function
of w* and b*?
If then ξj = 0
If then ξj =
Sometimes written as
ξ2
ξ1
ξ3
ξ4
![Page 11: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/11.jpg)
Equivalent hinge loss formulation
Σj ξj - ξj ξj≥0
Substituting into the objective, we get:
, ξ
min
w,b
X
j
max
⇣0, 1� (w · xj + b) yj
⌘
This is empirical risk minimization, using the hinge loss
minw,b
X
j
`hinge(yj , w · xj + b)
The hinge loss is defined as `hinge(y, y) = max
⇣0, 1� yy
⌘
![Page 12: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/12.jpg)
Hinge loss vs. 0/1 loss
1 0
1
Hinge loss upper bounds 0/1 loss!
It is the tightest convex upper bound on the 0/1 loss
Hinge loss: `hinge(y, y) = max
⇣0, 1� yy
⌘
0-1 Loss: `0,1(y, y) = 1[y 6= sign(y)]
![Page 13: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/13.jpg)
Key idea #2: seek large margin
![Page 14: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/14.jpg)
• Suppose again that the data is linearly separable and we are solving a feasibility problem, with constraints
• If the length of the weight vector ||w|| is too small, the optimization problem is infeasible! Why?
Key idea #2: seek large margin
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
As ||w|| (and |b|) get smaller
![Page 15: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/15.jpg)
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
x1 x2
γ
What is (geometric margin) as a function of w?
-
We also know that:
So,
�i = Distance to i’th data point
� = mini
�i
(assuming there is a data point on the w.x + b = +1 or -1 line)
Final result: can maximize by minimizing ||w||2!!!
![Page 16: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/16.jpg)
(Hard margin) support vector machines
• Example of a convex optimization problem
– A quadratic program
– Polynomial-time algorithms to solve!
• Hyperplane defined by support vectors
– Could use them as a lower-dimension basis to write down line, although we haven’t seen how yet
• More on these later
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
margin 2γ
Support Vectors: • data points on the
canonical lines
Non-support Vectors: • everything else • moving them will
not change w
![Page 17: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/17.jpg)
Allowing for slack: “Soft margin SVM”
For each data point:
• If margin ≥ 1, don’t care
• If margin < 1, pay linear penalty
w.x
+ b
= +
1
w.x
+ b
= -
1
w.x
+ b
= 0
+ C Σj ξj
- ξj ξj≥0
Slack penalty C > 0: • C=∞ have to separate the data! • C=0 ignores the data entirely!
• Select using cross-validation
“slack variables”
ξ2
ξ1
ξ3
ξ4
![Page 18: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/18.jpg)
Equivalent formulation using hinge loss
+ C Σj ξj - ξj ξj≥0
Substituting into the objective, we get:
This part is empirical risk minimization, using the hinge loss
This is called regularization; used to prevent overfitting!
The hinge loss is defined as `hinge(y, y) = max
⇣0, 1� yy
⌘
minw,b
||w||22 + C
X
j
`hinge(yj , w · xj + b)
![Page 19: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/19.jpg)
What if the data is not linearly separable?
Use features of features of features of features….
Feature space can get really large really quickly!
�(x) =
�
⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤
x(1)
. . .x(n)
x(1)x(2)
x(1)x(3)
. . .
ex(1)
. . .
⇥
⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅
7
![Page 20: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/20.jpg)
[Tommi Jaakkola]
Example
![Page 21: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/21.jpg)
What’s Next!
• Learn one of the most interesting and exciting recent advancements in machine learning – Key idea #3: the “kernel trick”
– High dimensional feature spaces at no extra cost
• But first, a detour – Constrained optimization!
![Page 22: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/22.jpg)
Constrained optimization
x*=0
No Constraint x ≥ -1
x*=0 x*=1
x ≥ 1
How do we solve with constraints? Lagrange Multipliers!!!
![Page 23: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/23.jpg)
Lagrange multipliers – Dual variables
Introduce Lagrangian (objective):
We will solve:
Add Lagrange multiplier
Add new constraint
Why is this equivalent? • min is fighting max! x<b (x-b)<0 maxα-α(x-b) = ∞
• min won’t let this happen!
x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 • min is cool with 0, and L(x, α)=x2 (original objective)
x=b α can be anything, and L(x, α)=x2 (original objective)
Rewrite Constraint
The min on the outside forces max to behave, so constraints will be satisfied.
![Page 24: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/24.jpg)
Dual SVM derivation (1) – the linearly separable case (hard margin SVM)
Original optimization problem:
Lagrangian:
Rewrite constraints
One Lagrange multiplier per example
Our goal now is to solve:
![Page 25: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/25.jpg)
Dual SVM derivation (2) – the linearly separable case (hard margin SVM)
Swap min and max
Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!
(Primal)
(Dual)
![Page 26: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/26.jpg)
Dual SVM derivation (3) – the linearly separable case (hard margin SVM)
Can solve for optimal w, b as function of α:
⇥(x) =
�
⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤
x(1)
. . .x(n)
x(1)x(2)
x(1)x(3)
. . .
ex(1)
. . .
⇥
⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅
⇤L
⇤w= w �
⌥
j
�jyjxj
7
(Dual)
Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples dot product scalars
![Page 27: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/27.jpg)
Dual formulation only depends on dot-products of the features!
First, we introduce a feature mapping:
Next, replace the dot product with an equivalent kernel function:
![Page 28: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/28.jpg)
SVM with kernels
• Never compute features explicitly!!! – Compute dot products in closed form
• O(n2) time in size of dataset to compute objective – much work on speeding up
Predict with:
![Page 29: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/29.jpg)
Common kernels • Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
• Sigmoid
• And many others: very active area of research!
![Page 30: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/30.jpg)
[Tommi Jaakkola]
Quadratic kernel
![Page 31: Support vector machines (SVMs) Lecture 2...Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos](https://reader036.vdocuments.net/reader036/viewer/2022081411/60af2650b11fd0633c4a32e9/html5/thumbnails/31.jpg)
Quadratic kernel
[Cynthia Rudin]
Feature mapping given by: