support vector machines

41
Support Vector Machines (C) CDAC Mumbai Workshop on Machine Learning Support Vector Machines Prakash B. Pimpale CDAC Mumbai

Upload: -

Post on 19-Jun-2015

96 views

Category:

Data & Analytics


0 download

DESCRIPTION

Introduction to SVM.

TRANSCRIPT

Page 1: Support Vector Machines

Support Vector Machines

(C) CDAC Mumbai Workshop on Machine Learning

Support Vector MachinesPrakash B. Pimpale

CDAC Mumbai

Page 2: Support Vector Machines

Outlineo Introductiono Towards SVMo Basic Concept

(C) CDAC Mumbai Workshop on Machine Learning

o Basic Concepto Implementationso Issueso Conclusion & References

Page 3: Support Vector Machines

Introduction:o SVMs – a supervised learning methods for classification and Regression

o Base: Vapnik-Chervonenkis theoryo First practical implementation: Early nineties

(C) CDAC Mumbai Workshop on Machine Learning

o First practical implementation: Early ninetieso Satisfying from theoretical point of viewo Can lead to high performance in practical applications

o Currently considered one of the most efficient family of algorithms in Machine Learning

Page 4: Support Vector Machines

Towards SVMA:I found really good function describing the training examples using ANN but couldn’t classify test example that efficiently, what could be the problem?

B: It didn’t generalize well! A: What should I do now?

(C) CDAC Mumbai Workshop on Machine Learning

A: What should I do now? B: Try SVM!A: why?B: SVM 1)Generalises well

And what's more….2)Computationally efficient (just a convex optimization problem)3)Robust in high dimensions also (no overfitting)

Page 5: Support Vector Machines

A: Why is it so?B: So many questions…?? L

o Vapnik & Chervonenkis Statistical Learning Theory Result: Relates ability to learn a rule for classifying training data to ability of resulting rule to classify unseen examples (Generalization)

(C) CDAC Mumbai Workshop on Machine Learning

o Let a rule ,o Empirical Risk of : Measure of quality of classification on

training dataBest performance

Worst performance

f Ff ∈f

0)( =fR emp

1)( =fR emp

Page 6: Support Vector Machines

What about the Generalization?

o Risk of classifier: Probability that rule ƒ makes a mistake on a new sample randomly generated by random machine

))(()( yxfPfR ≠=

(C) CDAC Mumbai Workshop on Machine Learning

Best Generalization

Worst Generalization

o Many times small Empirical Risk implies Small Risk

0)( =fR

1)( =fR

Page 7: Support Vector Machines

Is the problem solved? …….. NO!

o Is Risk of selected by Empirical Risk Minimization (ERM) near to that of ideal ?

o No, not in case of overfittingo Important Result of Statistical Learning Theory

tfif

(C) CDAC Mumbai Workshop on Machine Learning

o Important Result of Statistical Learning Theory

Where, V(F)- VC dimension of class FN- number of observations for trainingC- Universal Constant

NFV

CfRfER it)(

)()( +≤

Page 8: Support Vector Machines

What it says:

o Risk of rule selected by ERM is not far from Risk of the ideal rule if-

1) N is large enough2)VC dimension of F should be small enough

(C) CDAC Mumbai Workshop on Machine Learning

2)VC dimension of F should be small enough

[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]

Page 9: Support Vector Machines

Structural Risk Minimization (SRM)o Consider family of F =>

)(......)(..........)()(

..

................

10

10

FVFVFVFV

ts

FFFF

n

n

≤≤≤≤≤

⊂⊂⊂⊂

(C) CDAC Mumbai Workshop on Machine Learning

o Find the minimum Empirical Risk for each subclass and its VC dimension

o Select a subclass with minimum bound on the Risk (i.e. sum of the VC dimension and empirical risk)

)(......)(..........)()( 10 FVFVFVFV n ≤≤≤≤≤

Page 10: Support Vector Machines

SRM Graphically:NFV

CfRfER it)(

)()( +≤

(C) CDAC Mumbai Workshop on Machine Learning

Page 11: Support Vector Machines

A: What it has to do with SVM….?

B:SVM is an approximate implementation of SRM!A: How?B: Just in simple way for now:

(C) CDAC Mumbai Workshop on Machine Learning

Just import a result:Maximizing distance of the decision boundary from training points minimizes the VC dimension resulting into the Good generalization!

Page 12: Support Vector Machines

A: Means Now onwards our target is Maximizing Distance

between decision boundary and the Training points!

B: Yeah, Right!A: Ok, I am convinced that SVM will generalize well, but can you please explain what is the concept of SVM and how to implement it, are there any

(C) CDAC Mumbai Workshop on Machine Learning

SVM and how to implement it, are there any packages available?

B: Yeah, don’t worry, there are many implementations available, just use them for your application, now the next part of the presentation will give a basic idea about the SVM, so be with me!

Page 13: Support Vector Machines

Basic Concept of SVM:o Which line will classify the unseen data well?

(C) CDAC Mumbai Workshop on Machine Learning

data well?o The dotted line! Its line with Maximum Margin!

Page 14: Support Vector Machines

Cont…

Support Vectors Support Vectors

(C) CDAC Mumbai Workshop on Machine Learning

+

=+

1

0

1

bXW T

Page 15: Support Vector Machines

Some definitions:o Functional Margin:w.r.t.

1) individual examples : 2)example set },.....,1);,{( )()( miyxS ii ==

)(ˆ )()()( bxWy iTii +=γ

)(ˆminˆ iγγ =

(C) CDAC Mumbai Workshop on Machine Learning

o Geometric Margin:w.r.t1)Individual examples: 2) example set S,

)(

,...,1ˆminˆ i

miγγ

==

+

=

||||||||)()()(

Wb

xWW

y i

T

iiγ

)(

,...,1min i

miγγ

==

Page 16: Support Vector Machines

Problem Formulation:

(C) CDAC Mumbai Workshop on Machine Learning

+

=+

1

0

1

bXW T

Page 17: Support Vector Machines

Cont..o Distance of a point (u, v) from Ax+By+C=0, is given by

|Ax+By+C|/||n||Where ||n|| is norm of vector n(A,B)o Distance of hyperpalne from origin =

|||| Wb

(C) CDAC Mumbai Workshop on Machine Learning

o Distance of point A from origin =

o Distance of point B from Origin =

o Distance between points A and B (Margin) =

|||| W

||||1

Wb +

||||1

Wb −

||||2W

Page 18: Support Vector Machines

Cont…We have data set

1

)()( ,....,1},,{

RYandRX

miYXd

ii

∈∈

=

(C) CDAC Mumbai Workshop on Machine Learning

separating hyperplane

10

10

..

0

)()(

)()(

−=<+

+=>+

=+

iiT

iiT

T

YifbXW

YifbXW

ts

bXW

Page 19: Support Vector Machines

Cont…o Suppose training data satisfy following constrains also,

Combining these to the one,11

11)()(

)()(

−=−≤+

+=+≥+iiT

iiT

YforbXW

YforbXW

(C) CDAC Mumbai Workshop on Machine Learning

Combining these to the one,

o Our objective is to find Hyperplane(W,b) with maximal separation between it and closest data points while satisfying the above constrains

iforbXWY iTi ∀≥+ 1)( )()(

Page 20: Support Vector Machines

THE PROBLEM:

||||2

max, WbW

(C) CDAC Mumbai Workshop on Machine Learning

such that

Also we know

iforbXWY iTi ∀≥+ 1)( )()(

WWW T=||||

Page 21: Support Vector Machines

Cont..

WW T

bW 21

min,

So the Problem can be written as:

(C) CDAC Mumbai Workshop on Machine Learning

bW 2,

iforbXWY iTi ∀≥+ 1)( )()(

Such that

It is just a convex quadratic optimization problem !

2||||WWW T =Notice:

Page 22: Support Vector Machines

DUALo Solving dual for our problem will lead us to apply SVM for

nonlinearly separable data, efficientlyo It can be shown that

)),,(min(maxmin αα

bWLprimal≥

=

(C) CDAC Mumbai Workshop on Machine Learning

o Primal problem:

Such that

)),,(min(maxmin,0

αα

bWLprimalbW≥

=

WW T

bW 21

min,

iforbXWY iTi ∀≥+ 1)( )()(

Page 23: Support Vector Machines

Constructing Lagrangiano Lagrangian for our problem:

[ ]∑ −+−=m

iTii bXWYWbWL )()(2 1)(||||

21

),,( αα

(C) CDAC Mumbai Workshop on Machine Learning

Where a Lagrange multiplier and

o Now minimizing it w.r.t. W and b:We set derivatives of Lagrangian w.r.t. W and b to zero

[ ]∑=

−+−=i

i bXWYWbWL1

1)(||||2

),,( αα

α 0≥iα

Page 24: Support Vector Machines

Cont…o Setting derivative w.r.t. W to zero, it gives:

)(

1

)(

..

0im

i

ii

ei

XYW ∑=

=− α

(C) CDAC Mumbai Workshop on Machine Learning

o Setting derivative w.r.t. b to zero, it gives:

)(

1

)(

..

im

i

ii XYW

ei

∑=

= α

∑=

=m

i

iiY

1

)( 0α

Page 25: Support Vector Machines

Cont…o Plugging these results into Lagrangian gives

∑∑==

−=m

ji

jTiji

jim

ii XXYYbWL

1,

)()()()(

1

)()(21

),,( αααα

(C) CDAC Mumbai Workshop on Machine Learning

o Say it

o This is result of our minimization w.r.t W and b,

== jii 1,1

∑∑==

−=m

ji

jTiji

jim

ii XXYYD

1,

)()()()(

1

)()(21

)( αααα

Page 26: Support Vector Machines

So The DUAL:o Now Dual becomes::

∑∑==

=≥

−=

i

m

ji

jiji

jim

ii

mi

ts

XXYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα

(C) CDAC Mumbai Workshop on Machine Learning

o Solving this optimization problem gives us o Also Karush-Kuhn-Tucker (KKT) condition is satisfied at this solution i.e.

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

[ ] miforbXWY iTii ,...,1,01)( )()( ==−+α

Page 27: Support Vector Machines

Values of W and b:o W can be found using

)(

1

)( im

i

ii XYW ∑

=

= α

(C) CDAC Mumbai Workshop on Machine Learning

o b can be found using:

1i =

2

*min*max*

)(1:

)(1: )()(

iTYi

iTYi

XWXWb

ii =−=+

−=

Page 28: Support Vector Machines

What if data is nonlinearly separable?o The maximal margin

hyperplane can classify only linearly separable data

o What if the data is linearly

(C) CDAC Mumbai Workshop on Machine Learning

o What if the data is linearly non-separable?

o Take your data to linearly separable ( higher dimensional space) and use maximal margin hyperplane there!

Page 29: Support Vector Machines

Taking it to higher dimension works! Ex. XOR

(C) CDAC Mumbai Workshop on Machine Learning

Page 30: Support Vector Machines

Doing it in higher dimensional space o Let be non linear mapping from input space X (original space) to feature space (higher dimensional) F

o Then our inner (dot) product in higher

FX→Φ:

)()( , ji XX

(C) CDAC Mumbai Workshop on Machine Learning

o Then our inner (dot) product in higher dimensional space is

o Now, the problem becomes:

)()( , ji XX

)(),( )()( ji XX φφ

∑∑

=

==

=

=≥

−=

m

i

ii

i

m

ji

jiji

jim

ii

Y

mi

ts

XXYYD

1

)(

1,

)()()()(

1

0

,...,1,0

..

)(,)(21

)(max

α

α

φφααααα

Page 31: Support Vector Machines

Kernel function:o There exist a way to compute inner product in feature space as function of original input points – Its kernel function!

o Kernel function:

(C) CDAC Mumbai Workshop on Machine Learning

o Kernel function:

o We need not know to compute

)(),(),( zxzxK φφ=

),( zxKφ

Page 32: Support Vector Machines

An example:For n=3, feature mapping is given as :

∑∑=

=

n

jj

n

ii

T

n

zxzxzxKei

zxzxK

Rzxlet2

)()(),(..

)(),(

, φ

31

21

11

xx

xx

xx

xx

(C) CDAC Mumbai Workshop on Machine Learning

∑∑

∑∑

=

= =

==

=

=

=

n

jijiji

n

i

n

jjiji

jjj

iii

zzxx

zzxx

zxzxzxKei

1,

1 1

11

))((

)()(),(..

=

33

23

13

32

22

12

)(

xx

xx

xx

xx

xx

xx

)(),(),( zxzxK φφ=

Page 33: Support Vector Machines

example cont…o Here,

31

)(),( 2

=

=

=

zx

zxzxK

forT

4

2

2

1

)(

22

12

21

11

=

=

xx

xx

xx

xx

(C) CDAC Mumbai Workshop on Machine Learning

[ ]

121)(),(

11

4

321

4

3

2

1

2 ==

=

=

=

=

zxzxK

zx

zx

T

T

[ ]

121

16

12

12

9

4221)()(

16

12

12

9

)(

=

=

=

zx

z

T φφ

φ

Page 34: Support Vector Machines

So our SVM for the non-linearly separable data:o Optimization problem:

∑∑==

=≥

−=m

ji

jiji

jim

ii

mi

ts

XXKYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα

(C) CDAC Mumbai Workshop on Machine Learning

o Decision function

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

)),(()(1

)()(∑=

+=m

i

iii bXXKYSignXF α

Page 35: Support Vector Machines

Some commonly used Kernel functions:

o Linear:

o Polynomial of degree d: dTYXYXK )1(),( +=

YXYXK T=),(

(C) CDAC Mumbai Workshop on Machine Learning

o Polynomial of degree d:

o Gaussian Radial Basis Function (RBF):

o Tanh kernel:

YXYXK )1(),( +=

2

2||||

2),( σ

YX

eYXK−

−=

))(tanh(),( δρ −= YXYXK T

Page 36: Support Vector Machines

Implementations:Some Ready to use available SVM implementations:1)LIBSVM:A library for SVM by Chih-Chung Chang and

chih-Jen Lin

(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

(C) CDAC Mumbai Workshop on Machine Learning

(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)2)SVM light : An implementation in C by Thorsten

Joachims(at: http://svmlight.joachims.org/ )3)Weka: A Data Mining Software in Java by University

of Waikato (at: http://www.cs.waikato.ac.nz/ml/weka/ )

Page 37: Support Vector Machines

Issues:o Selecting suitable kernel: Its most of the time trial and error

o Multiclass classification: One decision function for each class( l1 vs l-1 ) and then finding one with max

(C) CDAC Mumbai Workshop on Machine Learning

each class( l1 vs l-1 ) and then finding one with max value i.e. if X belongs to class 1, then for this and other (l-1) classes vales of decision functions:

1)(

.

.

1)(

1)(

2

1

−≤

−≤

+≥

XF

XF

XF

l

Page 38: Support Vector Machines

Cont….o Sensitive to noise: Mislabeled data can badly affect the performance

o Good performance for the applications like-1)computational biology and medical applications (protein, cancer classification problems)

(C) CDAC Mumbai Workshop on Machine Learning

(protein, cancer classification problems)2)Image classification3)hand-written character recognitionAnd many others…..

o Use SVM :High dimensional, linearly separable data (strength), for nonlinearly depends on choice of kernel

Page 39: Support Vector Machines

Conclusion:Support Vector Machines provides very simple method for linear classification. But performance, in case of nonlinearly separable data, largely depends on the choice of kernel!

(C) CDAC Mumbai Workshop on Machine Learning

data, largely depends on the choice of kernel!

Page 40: Support Vector Machines

References:o Nello Cristianini and John Shawe-Taylor (2000)??

An Introduction to Support Vector Machines and Other Kernel-based Learning MethodsCambridge University Press

o Christopher J.C. Burges (1998)??A tutorial on Support Vector Machines for pattern recognitionUsama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167. Kluwer Academic Publishers, Boston.

(C) CDAC Mumbai Workshop on Machine Learning

Kluwer Academic Publishers, Boston.o Andrew Ng (2007)

CSS229 Lecture NotesStanford Engineering Everywhere, Stanford University .

o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)o Wikipediao Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

Page 41: Support Vector Machines

Thank You!

(C) CDAC Mumbai Workshop on Machine Learning

Thank You!

[email protected] ; [email protected]