1 cse 881: data mining lecture 10: support vector machines

Post on 15-Jan-2016

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CSE 881: Data Mining

Lecture 10: Support Vector Machines

2

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data

3

Support Vector Machines

One Possible Solution

B1

4

Support Vector Machines

Another possible solution

B2

5

Support Vector Machines

Other possible solutions

B2

6

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

B1

B2

7

Support Vector Machines

Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

8

SVM vs Perceptron

Perceptron: SVM:

Minimizes least square error (gradient descent)

Maximizes margin

9

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf

||||

2 Margin

w

10

Learning Linear SVM

We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

or

This is a constrained optimization problem– Solve it using Lagrange multiplier method

||||

2 Margin

w

1bxw if 1

1bxw if1

i

i

iy

2

||||)(

2wwL

Niby ii ,...,2,1 ,1)( xw

11

Unconstrained Optimization

min E(w1,w2) = w12 + w2

2

002

002

222

111

wwdw

dE

wwdw

dE

w2

w1

E(w1,w2)

w1=0, w2=0 minimizes E(w1,w2)

Minimum value is E(0,0)=0

12

Constrained Optimization

min E(w1,w2) = w12 + w2

2

subject to 2w1 + w2 = 4 (equality constraint)

516),(,5

4,58

42042

202

022

)42(

2121

121121

222

111

2122

21

wwEww

wwwwd

dL

wwdw

dL

wwdw

dL

wwwwL

Dual problem:

445)422(4max

222 L

w2

w1

E(w1,w2)

13

Learning Linear SVM

Lagrange multiplier:

– Take derivative w.r.t and b:

– Additional constraints:

– Dual problem:

n

iii by

wwL

1

2

12

||||)( xw

iii

n

iiii

yb

L

yL

00

01

xww

01)(

0

by iii

i

xw

n

ijiji

n

ii yywL

11

)( ji xx

14

Learning Linear SVM

Bigger picture:– Learning algorithm needs to find w and b– To solve for w:

– But:

is zero for points that do not reside on

Data points where s are not zero are called support vectors

n

iiii y

1

xw

01)( by iii xw

1)( by ii xw

15

Example of Linear SVM

x1 x2 y 0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0

Support vectors

16

Learning Linear SVM

Bigger picture…

– Decision boundary depends only on support vectors If you have data set with same support vectors, decision boundary will not change

– How to classify using SVM once w and b are found? Given a test record, xi

1bxw if1

1bxw if1)(

i

i

ixf

17

Support Vector Machines

What if the problem is not linearly separable?

18

Support Vector Machines

What if the problem is not linearly separable?– Introduce slack variables

Need to minimize:

Subject to:

If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook)

ii

ii

1bxw if1

-1bxw if1)(

ixf

N

i

kiC

wwL

1

2

2

||||)(

19

Nonlinear Support Vector Machines

What if decision boundary is not linear?

20

Nonlinear Support Vector Machines

Trick: Transform data into higher dimensional space

0)( bxw

Decision boundary:

21

Learning Nonlinear SVM

Optimization problem:

Which leads to the same set of equations (but involve (x) instead of x)

22

Learning NonLinear SVM

Issues:

– What type of mapping function should be used?

– How to do the computation in high dimensional space? Most computations involve dot product (xi) (xj)

Curse of dimensionality?

23

Learning Nonlinear SVM

Kernel Trick:

– (xi) (xj) = K(xi, xj)

– K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space) Examples:

24

Example of Nonlinear SVM

SVM with polynomial degree 2 kernel

25

Learning Nonlinear SVM

Advantages of using kernel:

– Don’t have to know the mapping function – Computing dot product (xi) (xj) in the

original space avoids curse of dimensionality

Not all functions can be kernels

– Must make sure there is a corresponding in some high-dimensional space

– Mercer’s theorem (see textbook)

top related