"svm - the original papers" presentation @ papers we love bucharest
TRANSCRIPT
SVM-related papers trend on arXivdata from Nov 1997 to Nov 2013http://bookworm.culturomics.org/arxiv/
Introduction
0x 0x
w
0xx
00)( 00 bxwxxwxxw
1x
2x w
wt
),( 1bw
),( 2bw ||||||||||)),(),,(( 21 wtwtbwbwd
0||||0)(0 22
12122 bwtxwbwtxwbxw
221
22
111 ||||0||||)(
wbbtbwtbbxw
||||||||||||)),(),,(( 21
21 wbbwtbwbwd
||||2w
1 bxwT
0 bxwT1 bxwT
1x
2x
++
+ +
+ +
+
+
+
- --
- -||||
2||||
|11|))1,(),1,((ww
bbbwbwd
BxAx
yRxpiyxi
ii
Niii ,1
,1,,,1|),(training
set:
1)(1,11,1
bxwyybxwybxw
iiii
ii
1)( s.t. ,||||
1max,
bxwyw iibw
||||min||||
1max ww
1)( bxwy ii
2||||min
2w
)sign( bxwy ii
Perceptron
Frank Rosenblatt
1x
2x
Nx
1w2w
Nw
y
1b
)sign()(ˆ bxwxf decision function
),,,( 21 Nxxxx features of the customerx
threshold ifcredit deny
threshold ifcredit approveN
1
N
1
i ii
i ii
xw
xw
input: }1,1{},1|),{(T Nii piyx R
0 ,0 bwrepeat
until
for i=1 to pif thenii ybxw )sign(
i
ii
ybbxyww
end ifend for
return bw,pjybxw jj ,1,)sign(
w ix
ii xyw
1iy
w
ixii xyw 1iy
Novikoff’s Theorem of Convergence
errors. most at makes algorithm perceptron theThen
||||max
,1, s.t. 0,1||||, Assume
2
2
,1
def
***
R
xR
pixwyww
ipi
ii
A.B.J. Novikoff, "On convergence proofs on perceptrons", Proc. of the Symposium on the Mathematical Theory of Automata, 12, pp 615–622 (1962)
2
222)1(2222)1(
error) ( 0
)(2
1
22)(2)(2)1(
)1(*
*)1(*)1(
*)1(
*)(**)(*)(*)1(
)1()(th
|||||||| : on induction By
2||||||||||||||||
||||1||||
||||||:||SchwartzCauchy
: on inductionBy
)(
.0 and on made is error Assume
th2
RkkRwkkRwk
xwyxywxyww
kww
wwww
kwwk
wwxwywwwxywww
wxwk
kk
k
tk
t
R
ttk
ttkk
k
kk
k
ktt
ktt
kk
tk
Demo
Mark I Perceptron hardware
input: }1,1{},1|),{(T Nii piyx R
0 ,0 bwrepeat
until
for i=1 to pif thenii ybxw )sign(
i
ii
ybbxyww
end ifend for
return bw,pjybxw jj ,1,)sign(
N = 400p = 512
• inputs captured by a 20 × 20 array of cadmium sulphide photocells
• a patch board allowed different configurations of input features to be tried• each adaptive weight was implemented using a rotary variable resistor/potentiometer, driven by an electric motorC. Bishop, "Pattern Recognition and Machine Learning", Springer (2006), p. 196
Linear separability as an LP problemA necessary and sufficient condition for the linear separability of the pattern sets A and B is that:0),( BAwhere (A, B) is the solution of the linear programming problem:
}0,0,0,0,11,11|1{min),(,,
vupvBuApvBuAvupBA TTTTTk
Tm
Tnpvu
where u, v, and p are m-, k-, and n-dimensional column vectors.
A necessary and sufficient condition that the sets of patterns A and B be linearly inseparable is that the system has a solution:
00
11
11
0
vu
v
u
vBuA
T
T
TT
card(A)=m, card(B)=k,
nR
(m + k + n + 2 + n) x (m + k + n)
Linear separability as an LP problem
0
0
0
0
0
0
0
0
1
1
0...01...00...0
.........
0...00...10...0
0...00...01...0
.........
0...00...00...1
1...0......
.........
0...1......
1...0......
.........
0...1......
0...01...10...0
0...00...01...1
1
1
1
11
111111
11
111111
n
k
m
knnmnn
km
knnmnn
km
p
p
v
v
u
u
bbaa
bbaa
bbaa
bbaa
0min1,,
n
i ipvup
A and B are linear separable iff:
0
...
0
0
...
0
1
1
0
...
0
...
...
...
...
...
1...00...0
..................
0...10...0
0...01...0
..................
0...00...1
1...10...0
0...01...1
......
..................
......
1
111
111111
k
m
knnmnn
km
v
v
u
ubbaa
bbaa
A and B are linear inseparable iffthis system has a solution
: , . . , ,B b A a t s R h Rn
11
)1(
bhah
bhah
bhah
T
T
T
T
T
T
1
1
...
1...
1...
............
1...
1...
............
1...
1
1
111
1
111
n
knk
n
mnm
n
h
h
bb
bb
aa
aa
Linear separability as an LP problem
Demo
Cover’s Theorem
d
iiNN d,NC
d, NF(N,d)
d
0 11
dii
12
111
:separablelinearly sdichotomie ofNumber
dependent.linearly pointsfewer or ofsubset no is thereemAssu}1,1{R N1,i|)y,x( :dichotomy
1
11
2),()1,(
N
dNCdNFdNF
A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space
Thomas Cover
Bernhard Boser Vladimir VapnikIsabelle Guyon
Corinna Cortes Vladimir Vapnik
Vladimir VapnikAlexey Chervonenkis
V. Vapnik, A. Chervonenkis, "A note on one class of perceptrons", Automation and remote control, 25, 1 (1964)
Lagrangian optimization problem0)( s.t. )min xgxΦ( ix
Primal formulation:
Dual formulation:
linear convex, ,R1 NigΦx, ,pi
0 s.t. Φminmax ),minmax i1
))x(gα)x((xL( p
i iixx
NR1 x, ,pi
p
i ii xgxxL1
*** ))()((max),(max
p
i iixxxgxxL
1
** ))()((min),(min
p
i iixxgxxLxL
1***** )()(),(),(minmax
0),( ** x
xL
,p, ixg iii 10)( **
,p, i
,p, ixg
i
ii
10
10)(*
*
Karush-Kuhn-TuckerKKT conditions
KKT complementarity condition
Np RRx ),( ! ** saddle point
Generalized Portrait Method
2||||min
2w pibxwy ii ,1 ,1)(
BxAx
y
Rpiyx
i
ii
Nii
,1,1
}1,1{},1|),{(training set:
)sign()( bxwxf Classifier for new instances:
Primal formulation for linear SVM
Dual formulation for linear SVM
convex QP with N variables ,Niwi 1 ,
Apply the method of Lagrange multipliers:
]1)([21),,(
11
2 bxwywbwL i
p
i iN
i i
00),,(
0),,(
1
1
ip
i i
iip
i i
ybbwL
xywwbwL
p
i ip
i
p
i iijjp
j jiiip
i iiip
i iii byxyxyxyxyL11
0
1111)()()(
21)(
p
i jijip
j jip
i i xxyyL1 11 2
1)( )(max
L
p
i iii y1
0 ,p1,i ,0
0 dual), solves ( max and
primal) solves ( min ispoint saddle The
αL
wLw
convex QP with p variables ,pii 1 ,
Support VectorsKKT complementarity condition: ,p, ibxwy iii 10]1)([ ***
pibxwy ii ,1 ,1)( Primal formulation conditions for linear SVM:,pii 1 ,0
training set linear separabilitySs, ,p,,S s 0,}21{ *
Ssbxwy ss
1)( **
||||2w
1* bxwT
0 bxwT1* bxwT
1x
2x
++
+ +
+ +
+
+
+
- --
- -
Support vectors = points with non-zero Lagrangian multipliers
pS )card(
Primal Supporting hyperplanes
Dual Support vectors
Kernel functions)2,,(),()(,: 21
22
2121
32 xxxxxxxRR
2
22112211
21212
22
22
12
1
212
22
1212
22
1
)())((
))((2
)2,,()2,,()()(
yxyxyx
yxyxyxyxyyxxyxyx
yyyyxxxxyx
)()(),( yxyxK kernel function+ +
-
-1v
2v
3v4v
1x
2x
21x
22x
212 xx
+31,vv
42 ,vv-
linear inseparable linear separable
A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space
From Cover’s theorem:
)tanh(),()||||exp()(),(
)||||exp(),(
||)||exp(),()(),(
),(
2
2
yxkyxKyxyxpyxK
yxyxK
yxyxKyxpyxK
yxyxK
q
q
a kernel expresses a measure of “similarity” between vectors
Examples
Kernel trick
p
i iii xyw
bxwxf
1
**
** )sign()(ˆ
input space
p
i iii xyw
bxwxf
1
**
**
)(
))(sign()(ˆ
image space
)),(sign())()(sign()(ˆ1
*
1
* bxxKybxxyxf p
i iiip
i iii No need to know explicitly!
Sii ,0*instead of obtaining a function with complexity proportional to the image space dimension, we obtained an expression with complexity proportional to the number of support vectors
),(21(max
1,1 jip
ji jijip
i i xxKyy
piy iii ,1,0,0 s.t. p
1i
Long history of development "You look at stuff like this in a book and you think, well, Vladimir Vapnik
just figured this out one Saturday afternoon when the weather was too bad to go outside. That's not how it happened.“ - Patrick Winston
"The invention of SVMs happened when Bernhard decided to implement Vladimir's algorithm in the three months we had left before we moved to Berkeley. After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the 'potential function' algorithm. Vladimir initially resisted the idea because the inventors of the 'potential functions' algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!“ – Isabel Guyon
Bibliography C. Bishop, "Pattern Recognition and Machine Learning", Springer (2006) B. Boser, I. Guyon, V. Vapnik, "A Training Algorithm for Optimal Margin Classifiers", Proc. of the 5th annual workshop on
Computational Learning Theory (COLT '92), pp. 144-152 (1992) T. Cover, "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition",
IEEE Transactions on Electronic Computers, EC-14, No. 3, pp. 326-334 (1965) I. Guyon - https://www.facebook.com/WomenInMachineLearning/posts/1314864275195877 L. Hamel, "Knowledge Discovery with Support Vector Machines", Wiley (2009) E. Kim, "Everything You Wanted to Know about the Kernel Trick (But Were Too Afraid to Ask)" -
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html O. Mangasarian, "Linear and Nonlinear Separation of Patterns by Linear Programming", Operations Research, 13, pp. 444-
452 (1965) H. Murrell, "Data Mining with R" - http://www.cs.ukzn.ac.za/~hughm/dm/ F. Rosenblatt, "The Perceptron. A Perceiving and Recognizing Automaton (Project PARA)", Cornell Aeronautical Laboratory,
Report No. 85-460-1 (1957) A. Smola, "Introduction to Machine Learning, Class 10-701/15-781", Carnegie Mellon University -
http://alex.smola.org/teaching/10-701-15/ A. Statnikov, D. Hardin, I. Guyon, C. Aliferis, "A Gentle Introduction to Support Vector Machines in Biomedicine", The
American Medical Informatics Association (AMIA) 2009 Annual Symposium, San Francisco (2009) V. Vapnik, "The Nature of Statistical Learning Theory", 2nd Ed., Springer (2000) V. Vapnik, "Estimation of Dependences Based on Empirical Data", 2nd ed., Springer (2006) R. Vogler, "Testing for Linear Separability with Linear Programming in R" -
http://www.joyofdata.de/blog/testing-linear-separability-linear-programming-r-glpk/ P. Winston, "6.034 Artificial Intelligence. Lecture 16: Learning: Support Vector Machines" -
https://www.youtube.com/watch?v=_PwhiWxHK8o