an introduction to support vector machines
TRANSCRIPT
Machine Learning for Data MiningIntroduction to Support Vector Machines
Andres Mendez-Vazquez
June 21, 2015
1 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
2 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
3 / 85
In addition
Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.
He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.
4 / 85
In addition
Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.
He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.
4 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
5 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
6 / 85
Separable Classes
Given
xi , i = 1, · · · ,N
A set of samples belonging to two classes ω1, ω2.
ObjectiveWe want to obtain decision functions
g(x) = wtx + w0
7 / 85
Separable Classes
Given
xi , i = 1, · · · ,N
A set of samples belonging to two classes ω1, ω2.
ObjectiveWe want to obtain decision functions
g(x) = wtx + w0
7 / 85
Such that we can do the following
A linear separation function g (x) = wtx + w0
8 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
9 / 85
In other words ...
We have the following samplesFor x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samplesFor x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samplesFor x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samplesFor x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
What do we want?Our goal is to search for a direction w that gives the maximum possible margin
direction 2
MARGINSdirection 1
11 / 85
Remember
We have the following
d
Projection r distance
0
12 / 85
A Little of GeometryThus
r
d
A
B
C
Then
d = |w0|√w2
1 + w22
, r = |g (x)|√w2
1 + w22
(1)
13 / 85
A Little of GeometryThus
r
d
A
B
C
Then
d = |w0|√w2
1 + w22
, r = |g (x)|√w2
1 + w22
(1)
13 / 85
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = ABC
Remark: Can you get the rest of values?
14 / 85
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = ABC
Remark: Can you get the rest of values?
14 / 85
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = ABC
Remark: Can you get the rest of values?
14 / 85
What about r = |g(x)|√w2
1+w22?
First, remember
g (xp) = 0 and x = xp + r w‖w‖ (4)
Thus, we have
g(x) =wT[xp + r w
‖w‖
]+ w0
=wT xp + w0 + r wT wwT
=g (xp) + r ‖w‖
Thenr = g(x)
||w||
15 / 85
What about r = |g(x)|√w2
1+w22?
First, remember
g (xp) = 0 and x = xp + r w‖w‖ (4)
Thus, we have
g(x) =wT[xp + r w
‖w‖
]+ w0
=wT xp + w0 + r wT wwT
=g (xp) + r ‖w‖
Thenr = g(x)
||w||
15 / 85
What about r = |g(x)|√w2
1+w22?
First, remember
g (xp) = 0 and x = xp + r w‖w‖ (4)
Thus, we have
g(x) =wT[xp + r w
‖w‖
]+ w0
=wT xp + w0 + r wT wwT
=g (xp) + r ‖w‖
Thenr = g(x)
||w||
15 / 85
What about r = |g(x)|√w2
1+w22?
First, remember
g (xp) = 0 and x = xp + r w‖w‖ (4)
Thus, we have
g(x) =wT[xp + r w
‖w‖
]+ w0
=wT xp + w0 + r wT wwT
=g (xp) + r ‖w‖
Thenr = g(x)
||w||
15 / 85
What about r = |g(x)|√w2
1+w22?
First, remember
g (xp) = 0 and x = xp + r w‖w‖ (4)
Thus, we have
g(x) =wT[xp + r w
‖w‖
]+ w0
=wT xp + w0 + r wT wwT
=g (xp) + r ‖w‖
Thenr = g(x)
||w||
15 / 85
This has the following interpretation
The distance from the projection
0
16 / 85
Now
We know that the straight line that we are looking for looks like
wT x + w0 = 0 (5)
What about something like this
wT x + w0 = δ (6)
ClearlyThis will be above or below the initial line wT x + w0 = 0.
17 / 85
Now
We know that the straight line that we are looking for looks like
wT x + w0 = 0 (5)
What about something like this
wT x + w0 = δ (6)
ClearlyThis will be above or below the initial line wT x + w0 = 0.
17 / 85
Now
We know that the straight line that we are looking for looks like
wT x + w0 = 0 (5)
What about something like this
wT x + w0 = δ (6)
ClearlyThis will be above or below the initial line wT x + w0 = 0.
17 / 85
Come back to the hyperplanesWe have then for each border support line an specific bias!!!
Support Vectors
18 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Then, normalize by δ
The new margin functionsw′T x + w10 = 1w′T x + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
19 / 85
Come back to the hyperplanesThe meaning of what I am saying!!!
20 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
21 / 85
A little about Support Vectors
They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
22 / 85
A little about Support Vectors
They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
22 / 85
A little about Support Vectors
They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
22 / 85
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)
ImpliesThe distance to the support vectors is:
r = g(xi)||w|| =
1||w|| if di = +1− 1||w|| if di = −1
23 / 85
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)
ImpliesThe distance to the support vectors is:
r = g(xi)||w|| =
1||w|| if di = +1− 1||w|| if di = −1
23 / 85
Therefore ...
We want the optimum value of the margin of separation as
ρ = 1||w|| + 1
||w|| = 2||w|| (8)
And the support vectors define the value of ρ
24 / 85
Therefore ...We want the optimum value of the margin of separation as
ρ = 1||w|| + 1
||w|| = 2||w|| (8)
And the support vectors define the value of ρ
Support V
ectors
24 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
25 / 85
Quadratic Optimization
Then, we have the samples with labelsT = {(xi , di)}Ni=1
Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N
26 / 85
Quadratic Optimization
Then, we have the samples with labelsT = {(xi , di)}Ni=1
Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N
26 / 85
Then, we have the optimization problem
The optimization problemminwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
27 / 85
Then, we have the optimization problem
The optimization problemminwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
27 / 85
Then, we have the optimization problem
The optimization problemminwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
27 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
28 / 85
Lagrange Multipliers
The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.
This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.
29 / 85
Lagrange Multipliers
The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.
This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.
29 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
30 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
30 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
30 / 85
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Remember
Think about thisRemember First Law of Newton!!!
Yes!!!
32 / 85
Remember
Think about thisRemember First Law of Newton!!!
Yes!!!A system in equilibrium does not move
Static Body
32 / 85
Lagrange Multipliers
DefinitionGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problem
33 / 85
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
34 / 85
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
34 / 85
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
34 / 85
Gradient to a Surface
After all a gradient is a measure of the maximal changeFor example the gradient of a function of three variables:
∇f (x) = i ∂f (x)∂x + j ∂f (x)
∂y + k ∂f (x)∂z (11)
where i, j and k are unitary vectors in the directions x, y and z.
35 / 85
ExampleWe have f (x , y) = x exp {−x2 − y2}
36 / 85
Example
With Gradient at the the contours when projecting in the 2D plane
37 / 85
Now, Think about this
Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ...+ λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .
38 / 85
Now, Think about this
Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ...+ λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .
38 / 85
ThusIf we have the following optimization:
min f (x)s.tg1 (x) = 0
g2 (x) = 0
39 / 85
Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp
{−x2 − y2} we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0
Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0
39 / 85
Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp
{−x2 − y2} we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0
Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0
39 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
40 / 85
Method
Steps1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.
From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
Method
Steps1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.
From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
Example
We can apply that to the following problem
min f (x, y) = x2 − 8x + y2 − 12y + 48s.t x + y = 8
42 / 85
Then, Rewriting The Optimization Problem
The optimization with equality constraintsminwΦ(w) = 1
2wT w
s.t
di(wT xi + w0)− 1 = 0 i = 1, · · · ,N
43 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function
J (w,w0, α) = 12wT w−
N∑i=1
αi [di(wT xi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑
i=1αi [di(wT xi + w0)− 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function
J (w,w0, α) = 12wT w−
N∑i=1
αi [di(wT xi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑
i=1αi [di(wT xi + w0)− 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function
J (w,w0, α) = 12wT w−
N∑i=1
αi [di(wT xi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑
i=1αi [di(wT xi + w0)− 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function
J (w,w0, α) = 12wT w−
N∑i=1
αi [di(wT xi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑
i=1αi [di(wT xi + w0)− 1]. (13)
44 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
45 / 85
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)s.t g1 (x) = 0
...gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:
∇L(x,α) = ∇f (x)−N∑
i=1αi∇gi (x) = 0
46 / 85
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)s.t g1 (x) = 0
...gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:
∇L(x,α) = ∇f (x)−N∑
i=1αi∇gi (x) = 0
46 / 85
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wT xi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di
(wT xi + w0
)− 1 = 0.
47 / 85
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wT xi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di
(wT xi + w0
)− 1 = 0.
47 / 85
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wT xi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di
(wT xi + w0
)− 1 = 0.
47 / 85
A small deviation from the SVM’s for the sake of VoxPopuli
Theorem (Karush-Kuhn-Tucker Necessary Conditions)Let X be a non-empty open set Rn , and let f : Rn → R and gi : Rn → Rfor i = 1, ...,m. Consider the problem P to minimize f (x) subject tox ∈ X and gi (x) ≤ 0 i = 1, ...,m. Let x be a feasible solution, anddenote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I aredifferentiable at x and that gi i /∈ I are continuous at x. Furthermore,suppose that ∇gi (x) for i ∈ I are linearly independent. If x solvesproblem P locally, there exist scalars ui for i ∈ I such that
∇f (x) +∑i∈I
ui∇gi (x) = 0
ui ≥ 0 for i ∈ I
48 / 85
It is more...
In addition to the above assumptionsIf gi for each i /∈ I is also differentiable at x, the previous conditions canbe written in the following equivalent form:
∇f (x) +m∑
i=1ui∇gi (x) = 0
ugi (x) = 0 for i = 1, ...,mui ≥ 0 for i = 1, ...,m
49 / 85
The necessary conditions for optimality
We use the previous theorem
∇(12wT w−
N∑i=1
αi [di(wT xi + w0)− 1])
(14)
Condition 1∂J (w,w0, α)
∂w= 0
Condition 2∂J (w,w0, α)
∂w0= 0
50 / 85
The necessary conditions for optimality
We use the previous theorem
∇(12wT w−
N∑i=1
αi [di(wT xi + w0)− 1])
(14)
Condition 1∂J (w,w0, α)
∂w= 0
Condition 2∂J (w,w0, α)
∂w0= 0
50 / 85
The necessary conditions for optimality
We use the previous theorem
∇(12wT w−
N∑i=1
αi [di(wT xi + w0)− 1])
(14)
Condition 1∂J (w,w0, α)
∂w= 0
Condition 2∂J (w,w0, α)
∂w0= 0
50 / 85
Using the conditions
We have the first condition
∂J (w,w0, α)∂w =
∂ 12wT w∂w −
∂N∑
i=1αi [di(wT xi + w0)− 1]
∂w = 0
∂J (w,w0, α)∂w = 1
2(w + w)−N∑
i=1αidixi
Thus
w =N∑
i=1αidixi (15)
51 / 85
Using the conditions
We have the first condition
∂J (w,w0, α)∂w =
∂ 12wT w∂w −
∂N∑
i=1αi [di(wT xi + w0)− 1]
∂w = 0
∂J (w,w0, α)∂w = 1
2(w + w)−N∑
i=1αidixi
Thus
w =N∑
i=1αidixi (15)
51 / 85
Using the conditions
We have the first condition
∂J (w,w0, α)∂w =
∂ 12wT w∂w −
∂N∑
i=1αi [di(wT xi + w0)− 1]
∂w = 0
∂J (w,w0, α)∂w = 1
2(w + w)−N∑
i=1αidixi
Thus
w =N∑
i=1αidixi (15)
51 / 85
In a similar way ...
We have by the second optimality conditionN∑
i=1αidi = 0
Note
αi [di(wT xi + w0)− 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.
52 / 85
In a similar way ...
We have by the second optimality conditionN∑
i=1αidi = 0
Note
αi [di(wT xi + w0)− 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.
52 / 85
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
53 / 85
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
53 / 85
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
53 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
54 / 85
Lagrangian Dual Problem
Consider the following nonlinear programming problemPrimal Problem P
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
hi (x) = 0 for i = 1, ..., lx ∈ X
Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0
where Θ (u, v) = infx{
f (x) +∑m
i=1 uigi (x) +∑l
i=1 vihi (x) |x ∈ X}
55 / 85
Lagrangian Dual Problem
Consider the following nonlinear programming problemPrimal Problem P
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
hi (x) = 0 for i = 1, ..., lx ∈ X
Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0
where Θ (u, v) = infx{
f (x) +∑m
i=1 uigi (x) +∑l
i=1 vihi (x) |x ∈ X}
55 / 85
What does this mean?
Assume that the equality constraint does not existWe have then
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
x ∈ X
Now assume that we finish with only one constraintWe have then
min f (x)s.t g (x) ≤ 0
x ∈ X
56 / 85
What does this mean?
Assume that the equality constraint does not existWe have then
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
x ∈ X
Now assume that we finish with only one constraintWe have then
min f (x)s.t g (x) ≤ 0
x ∈ X
56 / 85
What does this mean?
First, we have the following figure
A
B
X G
Slope:
Slope:
57 / 85
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0
58 / 85
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0
58 / 85
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.
59 / 85
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.
59 / 85
In other words
Move the line parallel to itself until it supports G
A
B
X G
Slope:
Slope:
Note The Set G lies above the line and touches it.
60 / 85
Thus
ThusThen, the problem is to find the slope of the supporting hyperplane forG.
Then intersection with the z-axisGives θ(u)
61 / 85
Thus
ThusThen, the problem is to find the slope of the supporting hyperplane forG.
Then intersection with the z-axisGives θ(u)
61 / 85
Again
We can see the θ
A
B
X G
Slope:
Slope:
62 / 85
Thus
The dual problem is equivalentFinding the slope of the supporting hyperplane such that its intercept onthe z-axis is maximal
63 / 85
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
For more on this Please!!!
Look at this bookFrom “Nonlinear Programming: Theory and Algorithms” by MokhtarS. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)
I At Page 260.
65 / 85
Example (Lagrange Dual)
Primalmin x2
1 + x22
s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0
Lagrange DualΘ(u) = inf {x2
1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
66 / 85
Example (Lagrange Dual)
Primalmin x2
1 + x22
s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0
Lagrange DualΘ(u) = inf {x2
1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
66 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 04u if u < 0
(18)
67 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 04u if u < 0
(18)
67 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 04u if u < 0
(18)
67 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
68 / 85
Duality Theorem
First PropertyIf the Primal has an optimal solution, the dual too.
ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗,w0∗, α∗)= min
wJ (w,w0∗, α∗)
69 / 85
Duality Theorem
First PropertyIf the Primal has an optimal solution, the dual too.
ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗,w0∗, α∗)= min
wJ (w,w0∗, α∗)
69 / 85
Reformulate our Equations
We have then
J (w,w0, α) = 12wT w−
N∑i=1
αidiwT xi − ω0
N∑i=1
αidi +N∑
i=1αi
Now for our 2nd optimality condition
J (w,w0, α) = 12wT w−
N∑i=1
αidiwT xi +N∑
i=1αi
70 / 85
Reformulate our Equations
We have then
J (w,w0, α) = 12wT w−
N∑i=1
αidiwT xi − ω0
N∑i=1
αidi +N∑
i=1αi
Now for our 2nd optimality condition
J (w,w0, α) = 12wT w−
N∑i=1
αidiwT xi +N∑
i=1αi
70 / 85
We have finally for the 1st Optimality Condition:
First
wT w =N∑
i=1αidiwT xi =
N∑i=1
N∑j=1
αiαjdidjxTj xi
Second, setting J (w, ω0, α) = Q (α)
Q(α) =N∑
i=1αi −
12
N∑i=1
N∑j=1
αiαjdidjxTj xi
71 / 85
We have finally for the 1st Optimality Condition:
First
wT w =N∑
i=1αidiwT xi =
N∑i=1
N∑j=1
αiαjdidjxTj xi
Second, setting J (w, ω0, α) = Q (α)
Q(α) =N∑
i=1αi −
12
N∑i=1
N∑j=1
αiαjdidjxTj xi
71 / 85
From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function
Q(α) =N∑
i=1αi −
12
N∑i=1
N∑j=1
αiαjdidjxTj xi
subject to the constraints
N∑i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · ,N (20)
NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function
Q(α) =N∑
i=1αi −
12
N∑i=1
N∑j=1
αiαjdidjxTj xi
subject to the constraints
N∑i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · ,N (20)
NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑
i=1α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗
For this, we use the positive margin equation:
g(xi) = wT x(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
73 / 85
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑
i=1α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗
For this, we use the positive margin equation:
g(xi) = wT x(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
73 / 85
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑
i=1α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗
For this, we use the positive margin equation:
g(xi) = wT x(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
73 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
74 / 85
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
75 / 85
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
75 / 85
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
75 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
76 / 85
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl → y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl → y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
Define a map to a higher Dimension
Nonlinear transformationsGiven a series of nonlinear transformations
{φi(x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑
i=1wiφi(x) + w0 = 0
.
78 / 85
Define a map to a higher Dimension
Nonlinear transformationsGiven a series of nonlinear transformations
{φi(x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑
i=1wiφi(x) + w0 = 0
.
78 / 85
This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.That represents the mapping.
From this mappingwe can define the following kernel function
K : X×X→ R
K (xi , xj) = φ (xi)T φ (xj)
79 / 85
This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.That represents the mapping.
From this mappingwe can define the following kernel function
K : X×X→ R
K (xi , xj) = φ (xi)T φ (xj)
79 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
80 / 85
Example
Assume
x ∈ R→ y =
x21√
2x1x2x2
2
We can show that
yTi yj =
(xT
i xj)
2
81 / 85
Example
Assume
x ∈ R→ y =
x21√
2x1x2x2
2
We can show that
yTi yj =
(xT
i xj)
2
81 / 85
Example of Kernels
Polynomials
k(x, z) = (xT z + 1)q q > 0
Radial Basis Functions
k(x, z) = exp(−||x− z||2
σ2
)
Hyperbolic Tangents
k(x, z) = tanh(βxT z + γ)
82 / 85
Example of Kernels
Polynomials
k(x, z) = (xT z + 1)q q > 0
Radial Basis Functions
k(x, z) = exp(−||x− z||2
σ2
)
Hyperbolic Tangents
k(x, z) = tanh(βxT z + γ)
82 / 85
Example of Kernels
Polynomials
k(x, z) = (xT z + 1)q q > 0
Radial Basis Functions
k(x, z) = exp(−||x− z||2
σ2
)
Hyperbolic Tangents
k(x, z) = tanh(βxT z + γ)
82 / 85
Outline1 History2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
83 / 85
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
84 / 85
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
84 / 85
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
84 / 85
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
85 / 85
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
85 / 85
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
85 / 85