an introduction to support vector machines

Machine Learning for Data MiningIntroduction to Support Vector Machines

Andres Mendez-Vazquez

June 21, 2015

1 / 85

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

2 / 85

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

4 / 85

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

4 / 85

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

6 / 85

Separable Classes

xi , i = 1, · · · ,N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain decision functions

g(x) = wtx + w0

7 / 85

Separable Classes

xi , i = 1, · · · ,N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain decision functions

g(x) = wtx + w0

7 / 85

Such that we can do the following

A linear separation function g (x) = wtx + w0

8 / 85

9 / 85

In other words ...

We have the following samplesFor x1, · · · , xm ∈ C1

For x1, · · · , xn ∈ C2

We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1

wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2

10 / 85

In other words ...

For x1, · · · , xn ∈ C2

10 / 85

In other words ...

For x1, · · · , xn ∈ C2

10 / 85

In other words ...

For x1, · · · , xn ∈ C2

10 / 85

What do we want?Our goal is to search for a direction w that gives the maximum possible margin

direction 2

MARGINSdirection 1

11 / 85

Remember

We have the following

Projection r distance

12 / 85

A Little of GeometryThus

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

13 / 85

A Little of GeometryThus

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

13 / 85

First d = |w0|√w2

We can use the following rule in a triangle with a 90o angle

Area = 12Cd (2)

In addition, the area can be calculated also as

Area = 12AB (3)

d = ABC

Remark: Can you get the rest of values?

14 / 85

First d = |w0|√w2

Area = 12Cd (2)

Area = 12AB (3)

d = ABC

14 / 85

First d = |w0|√w2

Area = 12Cd (2)

Area = 12AB (3)

d = ABC

14 / 85

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

15 / 85

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

=g (xp) + r ‖w‖

Thenr = g(x)

15 / 85

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

=g (xp) + r ‖w‖

Thenr = g(x)

15 / 85

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

=g (xp) + r ‖w‖

Thenr = g(x)

15 / 85

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

=g (xp) + r ‖w‖

Thenr = g(x)

15 / 85

This has the following interpretation

The distance from the projection

16 / 85

We know that the straight line that we are looking for looks like

wT x + w0 = 0 (5)

What about something like this

wT x + w0 = δ (6)

ClearlyThis will be above or below the initial line wT x + w0 = 0.

17 / 85

wT x + w0 = 0 (5)

wT x + w0 = δ (6)

17 / 85

wT x + w0 = 0 (5)

wT x + w0 = δ (6)

17 / 85

Come back to the hyperplanesWe have then for each border support line an specific bias!!!

Support Vectors

18 / 85

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

0δ ,and w01 = w′′

19 / 85

0δ ,and w01 = w′′

19 / 85

0δ ,and w01 = w′′

19 / 85

0δ ,and w01 = w′′

19 / 85

0δ ,and w01 = w′′

19 / 85

Come back to the hyperplanesThe meaning of what I am saying!!!

20 / 85

21 / 85

A little about Support Vectors

They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1

PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.

22 / 85

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g(xi)||w|| =

1||w|| if di = +1− 1||w|| if di = −1

23 / 85

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g(xi)||w|| =

1||w|| if di = +1− 1||w|| if di = −1

23 / 85

Therefore ...

We want the optimum value of the margin of separation as

ρ = 1||w|| + 1

||w|| = 2||w|| (8)

And the support vectors define the value of ρ

24 / 85

Therefore ...We want the optimum value of the margin of separation as

ρ = 1||w|| + 1

||w|| = 2||w|| (8)

And the support vectors define the value of ρ

Support V

ectors

24 / 85

25 / 85

Quadratic Optimization

Then, we have the samples with labelsT = {(xi , di)}Ni=1

Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N

26 / 85

Quadratic Optimization

Then, we have the samples with labelsT = {(xi , di)}Ni=1

Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N

26 / 85

Then, we have the optimization problem

The optimization problemminwΦ(w) = 1

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.

27 / 85

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

27 / 85

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

27 / 85

28 / 85

Lagrange Multipliers

The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

29 / 85

The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

29 / 85

The classical problem formulation

min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

It can be converted into

min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)

whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.

30 / 85

Finding an Optimum using Lagrange Multipliers

New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}

We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗

minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

TrickIt is to find appropriate value for Lagrangian multiplier λ.

31 / 85

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

31 / 85

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

31 / 85

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!

32 / 85

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!A system in equilibrium does not move

Static Body

32 / 85

DefinitionGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problem

33 / 85

Lagrange was a Physicists

He was thinking in the following formulaA system in equilibrium has the following equation:

F1 + F2 + ...+ FK = 0 (10)

But functions do not have forces?Are you sure?

Think about the followingThe Gradient of a surface.

34 / 85

F1 + F2 + ...+ FK = 0 (10)

34 / 85

F1 + F2 + ...+ FK = 0 (10)

34 / 85

Gradient to a Surface

After all a gradient is a measure of the maximal changeFor example the gradient of a function of three variables:

∇f (x) = i ∂f (x)∂x + j ∂f (x)

∂y + k ∂f (x)∂z (11)

where i, j and k are unitary vectors in the directions x, y and z.

35 / 85

ExampleWe have f (x , y) = x exp {−x2 − y2}

36 / 85

Example

With Gradient at the the contours when projecting in the 2D plane

37 / 85

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λK FK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .

38 / 85

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λK FK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .

38 / 85

ThusIf we have the following optimization:

min f (x)s.tg1 (x) = 0

g2 (x) = 0

39 / 85

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0

39 / 85

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0

39 / 85

40 / 85

Method

Steps1 Original problem is rewritten as:

1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).

41 / 85

Method

Steps1 Original problem is rewritten as:

1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).

41 / 85

Example

We can apply that to the following problem

min f (x, y) = x2 − 8x + y2 − 12y + 48s.t x + y = 8

42 / 85

Then, Rewriting The Optimization Problem

The optimization with equality constraintsminwΦ(w) = 1

di(wT xi + w0)− 1 = 0 i = 1, · · · ,N

43 / 85

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function

J (w,w0, α) = 12wT w−

N∑i=1

αi [di(wT xi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

J (w,w0, α) = 12wT w−

N∑i=1

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

J (w,w0, α) = 12wT w−

N∑i=1

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

J (w,w0, α) = 12wT w−

N∑i=1

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

45 / 85

Karush-Kuhn-Tucker Conditions

First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L(x,α) = ∇f (x)−N∑

i=1αi∇gi (x) = 0

46 / 85

First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L(x,α) = ∇f (x)−N∑

i=1αi∇gi (x) = 0

46 / 85

ImportantThink about this each constraint correspond to a sample in both classes,thus

The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di

(wT xi + w0

)− 1 6= 0 (Remember

Maximization).

Again the Support VectorsThis actually defines the idea of support vectors!!!

ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di

(wT xi + w0

)− 1 = 0.

47 / 85

(wT xi + w0

Maximization).

(wT xi + w0

)− 1 = 0.

47 / 85

(wT xi + w0

Maximization).

(wT xi + w0

)− 1 = 0.

47 / 85

A small deviation from the SVM’s for the sake of VoxPopuli

Theorem (Karush-Kuhn-Tucker Necessary Conditions)Let X be a non-empty open set Rn , and let f : Rn → R and gi : Rn → Rfor i = 1, ...,m. Consider the problem P to minimize f (x) subject tox ∈ X and gi (x) ≤ 0 i = 1, ...,m. Let x be a feasible solution, anddenote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I aredifferentiable at x and that gi i /∈ I are continuous at x. Furthermore,suppose that ∇gi (x) for i ∈ I are linearly independent. If x solvesproblem P locally, there exist scalars ui for i ∈ I such that

∇f (x) +∑i∈I

ui∇gi (x) = 0

ui ≥ 0 for i ∈ I

48 / 85

It is more...

In addition to the above assumptionsIf gi for each i /∈ I is also differentiable at x, the previous conditions canbe written in the following equivalent form:

∇f (x) +m∑

i=1ui∇gi (x) = 0

ugi (x) = 0 for i = 1, ...,mui ≥ 0 for i = 1, ...,m

49 / 85

The necessary conditions for optimality

We use the previous theorem

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

Condition 1∂J (w,w0, α)

∂w= 0

∂w0= 0

50 / 85

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

∂w= 0

∂w0= 0

50 / 85

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

∂w= 0

∂w0= 0

50 / 85

Using the conditions

We have the first condition

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

i=1αi [di(wT xi + w0)− 1]

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

w =N∑

i=1αidixi (15)

51 / 85

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

w =N∑

i=1αidixi (15)

51 / 85

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

w =N∑

i=1αidixi (15)

51 / 85

In a similar way ...

We have by the second optimality conditionN∑

i=1αidi = 0

αi [di(wT xi + w0)− 1] = 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.

52 / 85

In a similar way ...

We have by the second optimality conditionN∑

i=1αidi = 0

αi [di(wT xi + w0)− 1] = 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.

52 / 85

We need something extraOur classic trick of transforming a problem into another problem

In this caseWe use the Primal-Dual Problem for Lagrangian

WhereWe move from a minimization to a maximization!!!

53 / 85

54 / 85

Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0

where Θ (u, v) = infx{

f (x) +∑m

i=1 uigi (x) +∑l

i=1 vihi (x) |x ∈ X}

55 / 85

Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P

hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0

where Θ (u, v) = infx{

f (x) +∑m

i=1 uigi (x) +∑l

i=1 vihi (x) |x ∈ X}

55 / 85

What does this mean?

Assume that the equality constraint does not existWe have then

x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

56 / 85

Assume that the equality constraint does not existWe have then

x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

56 / 85

First, we have the following figure

Slope:

57 / 85

What does this means?

Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

58 / 85

Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

58 / 85

Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

59 / 85

Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

59 / 85

In other words

Move the line parallel to itself until it supports G

Slope:

Note The Set G lies above the line and touches it.

60 / 85

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

61 / 85

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

61 / 85

We can see the θ

Slope:

62 / 85

The dual problem is equivalentFinding the slope of the supporting hyperplane such that its intercept onthe z-axis is maximal

63 / 85

Such an hyperplane has slope −u and support G at (y, z)

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

64 / 85

Such an hyperplane has slope −u and support G at (y, z)

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

64 / 85

For more on this Please!!!

Look at this bookFrom “Nonlinear Programming: Theory and Algorithms” by MokhtarS. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)

I At Page 260.

65 / 85

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

66 / 85

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

66 / 85

Solution

Derive with respect to x1 and x2

We have two case to take in account: u ≥ 0 and u < 0

The first case is clearWhat about when u < 0

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

67 / 85

Solution

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

67 / 85

Solution

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

67 / 85

68 / 85

Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗,w0∗, α∗)= min

wJ (w,w0∗, α∗)

69 / 85

Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗,w0∗, α∗)= min

wJ (w,w0∗, α∗)

69 / 85

Reformulate our Equations

We have then

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi − ω0

N∑i=1

αidi +N∑

i=1αi

Now for our 2nd optimality condition

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi +N∑

i=1αi

70 / 85

Reformulate our Equations

We have then

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi − ω0

N∑i=1

αidi +N∑

i=1αi

Now for our 2nd optimality condition

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi +N∑

i=1αi

70 / 85

We have finally for the 1st Optimality Condition:

wT w =N∑

i=1αidiwT xi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, ω0, α) = Q (α)

Q(α) =N∑

i=1αi −

N∑i=1

N∑j=1

αiαjdidjxTj xi

71 / 85

We have finally for the 1st Optimality Condition:

wT w =N∑

i=1αidiwT xi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, ω0, α) = Q (α)

Q(α) =N∑

i=1αi −

N∑i=1

N∑j=1

αiαjdidjxTj xi

71 / 85

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑

i=1αi −

N∑i=1

N∑j=1

αiαjdidjxTj xi

subject to the constraints

N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · ,N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).

72 / 85

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑

i=1αi −

N∑i=1

N∑j=1

αiαjdidjxTj xi

subject to the constraints

N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · ,N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).

72 / 85

Solving for α

We can compute w∗ once we get the optimal α∗i by using (Eq. 15)

w∗ =N∑

i=1α∗i dixi

In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗

For this, we use the positive margin equation:

g(xi) = wT x(s) + w0 = 1

corresponding to a positive support vector.

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

Solving for α

w∗ =N∑

i=1α∗i dixi

g(xi) = wT x(s) + w0 = 1

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

Solving for α

w∗ =N∑

i=1α∗i dixi

g(xi) = wT x(s) + w0 = 1

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

74 / 85

What do we need?

Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?

75 / 85

What do we need?

75 / 85

What do we need?

75 / 85

76 / 85

Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

77 / 85

Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

77 / 85

Define a map to a higher Dimension

Nonlinear transformationsGiven a series of nonlinear transformations

{φi(x)}mi=1

from input space to the feature space.

We can define the decision surface asm∑

i=1wiφi(x) + w0 = 0

78 / 85

Define a map to a higher Dimension

Nonlinear transformationsGiven a series of nonlinear transformations

{φi(x)}mi=1

from input space to the feature space.

We can define the decision surface asm∑

i=1wiφi(x) + w0 = 0

78 / 85

This allows us to define

The following vector

φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T

.That represents the mapping.

From this mappingwe can define the following kernel function

K : X×X→ R

K (xi , xj) = φ (xi)T φ (xj)

79 / 85

This allows us to define

The following vector

φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T

.That represents the mapping.

From this mappingwe can define the following kernel function

K : X×X→ R

K (xi , xj) = φ (xi)T φ (xj)

79 / 85

80 / 85

Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

We can show that

yTi yj =

81 / 85

Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

We can show that

yTi yj =

81 / 85

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

Radial Basis Functions

k(x, z) = exp(−||x− z||2

Hyperbolic Tangents

k(x, z) = tanh(βxT z + γ)

82 / 85

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

k(x, z) = exp(−||x− z||2

Hyperbolic Tangents

82 / 85

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

k(x, z) = exp(−||x− z||2

Hyperbolic Tangents

82 / 85

83 / 85

Now, How to select a Kernel?

We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.

ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.

Thenif this fails, we can try the other possible kernels.

84 / 85

Thus, we have something like this

Step 1Normalize the data.

Step 2Use cross-validation to adjust the parameters of the selected kernel.

Step 3Train against the entire dataset.

85 / 85

an introduction to support vector machines

Engineering

vladimir vapnik

empirical datacorinna

thefield of machine

institute of control

alexey ya

practiceaward acm

new york

paris kanellakis theory