"svm - the original papers" presentation @ papers we love bucharest

SVM – the original papersAdrian FloreaPapers We Love - Bucharest ChapterNovember 3rd, 2016

SVM-related papers trend on arXivdata from Nov 1997 to Nov 2013http://bookworm.culturomics.org/arxiv/

http://bookworm.culturomics.org/arxiv/

http://bookworm.culturomics.org/arxiv/

Introduction

0x 0x

w

0xx

00)( 00 bxwxxwxxw

1x

2x w

wt

),( 1bw

),( 2bw ||||||||||)),(),,(( 21 wtwtbwbwd

0||||0)(0 22

12122 bwtxwbwtxwbxw

221

22

111 ||||0||||)(

wbbtbwtbbxw

||||||||||||)),(),,(( 21

21 wbbwtbwbwd

||||2w

1 bxwT

0 bxwT1 bxwT

1x

2x

++

+ +

+ +

+

+

+

- --

- -||||

2||||

|11|))1,(),1,((ww

bbbwbwd

BxAx

yRxpiyxi

ii

Niii ,1

,1,,,1|),(training

set:

1)(1,11,1

bxwyybxwybxw

iiii

ii

1)( s.t. ,||||

1max,

bxwyw iibw

||||min||||

1max ww

1)( bxwy ii

2||||min

2w

)sign( bxwy ii

Perceptron

Frank Rosenblatt

1x

2x

Nx

1w2w

Nw

y

1b

)sign()(ˆ bxwxf decision function

),,,( 21 Nxxxx features of the customerx

threshold ifcredit deny

threshold ifcredit approveN

1

N

1

i ii

i ii

xw

xw

input: }1,1{},1|),{(T Nii piyx R

0 ,0 bwrepeat

until

for i=1 to pif thenii ybxw )sign(

i

ii

ybbxyww

end ifend for

return bw,pjybxw jj ,1,)sign(

w ix

ii xyw

1iy

w

ixii xyw 1iy

Novikoff’s Theorem of Convergence

errors. most at makes algorithm perceptron theThen

||||max

,1, s.t. 0,1||||, Assume

2

2

,1

def

***

R

xR

pixwyww

ipi

ii

A.B.J. Novikoff, "On convergence proofs on perceptrons", Proc. of the Symposium on the Mathematical Theory of Automata, 12, pp 615–622 (1962)

2

222)1(2222)1(

error) ( 0

)(2

1

22)(2)(2)1(

)1(*

*)1(*)1(

*)1(

*)(**)(*)(*)1(

)1()(th

|||||||| : on induction By

2||||||||||||||||

||||1||||

||||||:||SchwartzCauchy

: on inductionBy

)(

.0 and on made is error Assume

th2

RkkRwkkRwk

xwyxywxyww

kww

wwww

kwwk

wwxwywwwxywww

wxwk

kk

k

tk

t

R

ttk

ttkk

k

kk

k

ktt

ktt

kk

tk

Demo

Mark I Perceptron hardware

input: }1,1{},1|),{(T Nii piyx R

0 ,0 bwrepeat

until

for i=1 to pif thenii ybxw )sign(

i

ii

ybbxyww

end ifend for

return bw,pjybxw jj ,1,)sign(

N = 400p = 512

• inputs captured by a 20 × 20 array of cadmium sulphide photocells

• a patch board allowed different configurations of input features to be tried• each adaptive weight was implemented using a rotary variable resistor/potentiometer, driven by an electric motorC. Bishop, "Pattern Recognition and Machine Learning", Springer (2006), p. 196

Linear separability as an LP problemA necessary and sufficient condition for the linear separability of the pattern sets A and B is that:0),( BAwhere (A, B) is the solution of the linear programming problem:

}0,0,0,0,11,11|1{min),(,,

vupvBuApvBuAvupBA TTTTTk

Tm

Tnpvu

where u, v, and p are m-, k-, and n-dimensional column vectors.

A necessary and sufficient condition that the sets of patterns A and B be linearly inseparable is that the system has a solution:

00

11

11

0

vu

v

u

vBuA

T

T

TT

card(A)=m, card(B)=k,

nR

(m + k + n + 2 + n) x (m + k + n)

Linear separability as an LP problem

0

0

0

0

0

0

0

0

1

1

0...01...00...0

.........

0...00...10...0

0...00...01...0

.........

0...00...00...1

1...0......

.........

0...1......

1...0......

.........

0...1......

0...01...10...0

0...00...01...1

1

1

1

11

111111

11

111111

n

k

m

knnmnn

km

knnmnn

km

p

p

v

v

u

u

bbaa

bbaa

bbaa

bbaa

0min1,,

n

i ipvup

A and B are linear separable iff:

0

...

0

0

...

0

1

1

0

...

0

...

...

...

...

...

1...00...0

..................

0...10...0

0...01...0

..................

0...00...1

1...10...0

0...01...1

......

..................

......

1

111

111111

k

m

knnmnn

km

v

v

u

ubbaa

bbaa

A and B are linear inseparable iffthis system has a solution

: , . . , ,B b A a t s R h Rn

11

)1(

bhah

bhah

bhah

T

T

T

T

T

T

1

1

...

1...

1...

............

1...

1...

............

1...

1

1

111

1

111

n

knk

n

mnm

n

h

h

bb

bb

aa

aa

Linear separability as an LP problem

Demo

Cover’s Theorem

d

iiNN d,NC

d, NF(N,d)

d

0 11

dii

12

111

:separablelinearly sdichotomie ofNumber

dependent.linearly pointsfewer or ofsubset no is thereemAssu}1,1{R N1,i|)y,x( :dichotomy

1

11

2),()1,(

N

dNCdNFdNF

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space

Thomas Cover

Bernhard Boser Vladimir VapnikIsabelle Guyon

Corinna Cortes Vladimir Vapnik

Vladimir VapnikAlexey Chervonenkis

V. Vapnik, A. Chervonenkis, "A note on one class of perceptrons", Automation and remote control, 25, 1 (1964)

Lagrangian optimization problem0)( s.t. )min xgxΦ( ix

Primal formulation:

Dual formulation:

linear convex, ,R1 NigΦx, ,pi

0 s.t. Φminmax ),minmax i1

))x(gα)x((xL( p

i iixx

NR1 x, ,pi

p

i ii xgxxL1

*** ))()((max),(max

p

i iixxxgxxL

1

** ))()((min),(min

p

i iixxgxxLxL

1***** )()(),(),(minmax

0),( ** x

xL

,p, ixg iii 10)( **

,p, i

,p, ixg

i

ii

10

10)(*

*

Karush-Kuhn-TuckerKKT conditions

KKT complementarity condition

Np RRx ),( ! ** saddle point

Generalized Portrait Method

2||||min

2w pibxwy ii ,1 ,1)(

BxAx

y

Rpiyx

i

ii

Nii

,1,1

}1,1{},1|),{(training set:

)sign()( bxwxf Classifier for new instances:

Primal formulation for linear SVM

Dual formulation for linear SVM

convex QP with N variables ,Niwi 1 ,

Apply the method of Lagrange multipliers:

]1)([21),,(

11

2 bxwywbwL i

p

i iN

i i

00),,(

0),,(

1

1

ip

i i

iip

i i

ybbwL

xywwbwL

p

i ip

i

p

i iijjp

j jiiip

i iiip

i iii byxyxyxyxyL11

0

1111)()()(

21)(

p

i jijip

j jip

i i xxyyL1 11 2

1)( )(max

L

p

i iii y1

0 ,p1,i ,0

0 dual), solves ( max and

primal) solves ( min ispoint saddle The

αL

wLw

convex QP with p variables ,pii 1 ,

Support VectorsKKT complementarity condition: ,p, ibxwy iii 10]1)([ ***

pibxwy ii ,1 ,1)( Primal formulation conditions for linear SVM:,pii 1 ,0

training set linear separabilitySs, ,p,,S s 0,}21{ *

Ssbxwy ss

1)( **

||||2w

1* bxwT

0 bxwT1* bxwT

1x

2x

++

+ +

+ +

+

+

+

- --

- -

Support vectors = points with non-zero Lagrangian multipliers

pS )card(

Primal Supporting hyperplanes

Dual Support vectors

Kernel functions)2,,(),()(,: 21

22

2121

32 xxxxxxxRR

2

22112211

21212

22

22

12

1

212

22

1212

22

1

)())((

))((2

)2,,()2,,()()(

yxyxyx

yxyxyxyxyyxxyxyx

yyyyxxxxyx

)()(),( yxyxK kernel function+ +

-

-1v

2v

3v4v

1x

2x

21x

22x

212 xx

+31,vv

42 ,vv-

linear inseparable linear separable

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space

From Cover’s theorem:

)tanh(),()||||exp()(),(

)||||exp(),(

||)||exp(),()(),(

),(

2

2

yxkyxKyxyxpyxK

yxyxK

yxyxKyxpyxK

yxyxK

q

q

a kernel expresses a measure of “similarity” between vectors

Examples

Kernel trick

p

i iii xyw

bxwxf

1

**

** )sign()(ˆ

input space

p

i iii xyw

bxwxf

1

**

**

)(

))(sign()(ˆ

image space

)),(sign())()(sign()(ˆ1

*

1

* bxxKybxxyxf p

i iiip

i iii No need to know explicitly!

Sii ,0*instead of obtaining a function with complexity proportional to the image space dimension, we obtained an expression with complexity proportional to the number of support vectors

),(21(max

1,1 jip

ji jijip

i i xxKyy

piy iii ,1,0,0 s.t. p

1i

Long history of development "You look at stuff like this in a book and you think, well, Vladimir Vapnik

just figured this out one Saturday afternoon when the weather was too bad to go outside. That's not how it happened.“ - Patrick Winston

"The invention of SVMs happened when Bernhard decided to implement Vladimir's algorithm in the three months we had left before we moved to Berkeley. After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the 'potential function' algorithm. Vladimir initially resisted the idea because the inventors of the 'potential functions' algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!“ – Isabel Guyon

Bibliography C. Bishop, "Pattern Recognition and Machine Learning", Springer (2006) B. Boser, I. Guyon, V. Vapnik, "A Training Algorithm for Optimal Margin Classifiers", Proc. of the 5th annual workshop on

Computational Learning Theory (COLT '92), pp. 144-152 (1992) T. Cover, "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition",

IEEE Transactions on Electronic Computers, EC-14, No. 3, pp. 326-334 (1965) I. Guyon - https://www.facebook.com/WomenInMachineLearning/posts/1314864275195877 L. Hamel, "Knowledge Discovery with Support Vector Machines", Wiley (2009) E. Kim, "Everything You Wanted to Know about the Kernel Trick (But Were Too Afraid to Ask)" -

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html O. Mangasarian, "Linear and Nonlinear Separation of Patterns by Linear Programming", Operations Research, 13, pp. 444-

452 (1965) H. Murrell, "Data Mining with R" - http://www.cs.ukzn.ac.za/~hughm/dm/ F. Rosenblatt, "The Perceptron. A Perceiving and Recognizing Automaton (Project PARA)", Cornell Aeronautical Laboratory,

Report No. 85-460-1 (1957) A. Smola, "Introduction to Machine Learning, Class 10-701/15-781", Carnegie Mellon University -

http://alex.smola.org/teaching/10-701-15/ A. Statnikov, D. Hardin, I. Guyon, C. Aliferis, "A Gentle Introduction to Support Vector Machines in Biomedicine", The

American Medical Informatics Association (AMIA) 2009 Annual Symposium, San Francisco (2009) V. Vapnik, "The Nature of Statistical Learning Theory", 2nd Ed., Springer (2000) V. Vapnik, "Estimation of Dependences Based on Empirical Data", 2nd ed., Springer (2006) R. Vogler, "Testing for Linear Separability with Linear Programming in R" -

http://www.joyofdata.de/blog/testing-linear-separability-linear-programming-r-glpk/ P. Winston, "6.034 Artificial Intelligence. Lecture 16: Learning: Support Vector Machines" -

https://www.youtube.com/watch?v=_PwhiWxHK8o

https://www.facebook.com/WomenInMachineLearning/posts/1314864275195877

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

http://www.cs.ukzn.ac.za/~hughm/dm/

http://alex.smola.org/teaching/10-701-15/

http://www.joyofdata.de/blog/testing-linear-separability-linear-programming-r-glpk/

https://www.youtube.com/watch?v=_PwhiWxHK8o

"svm - the original papers" presentation @ papers we love bucharest

Software