an introduction to support vector machine
TRANSCRIPT
![Page 1: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/1.jpg)
An Introduction to Support Vector Machine
![Page 2: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/2.jpg)
Support Vector Machine (SVM)
n A classifier derived from statistical learning theory by Vapnik, et al. in 1992
n SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task
n Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.
n Also used for regression
![Page 3: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/3.jpg)
Outline
n Linear Discriminant Function n Large Margin Linear Classifier n Nonlinear SVM: The Kernel Trick
![Page 4: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/4.jpg)
Linear Discriminant Function
n g(x) is a linear function:
( ) Tg b= +x w x
x1
x2
wT x + b < 0
wT x + b > 0
n A hyper-plane in the feature space
n (Unit-length) normal vector of the hyper-plane:
=wnw
n wT x + b = 0
![Page 5: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/5.jpg)
n How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function denotes +1
denotes -1
x1
x2
n Infinite number of answers!
![Page 6: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/6.jpg)
n How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function denotes +1
denotes -1
x1
x2
n Infinite number of answers!
![Page 7: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/7.jpg)
n How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function denotes +1
denotes -1
x1
x2
n Infinite number of answers!
![Page 8: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/8.jpg)
x1
x2 n How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function denotes +1
denotes -1
n Infinite number of answers!
n Which one is the best?
![Page 9: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/9.jpg)
Large Margin Linear Classifier
“safe zone” n The linear discriminant
function (classifier) with the maximum margin is the best
n Margin is defined as the width that the boundary could be increased by before hitting a data point
n Why it is the best? q Robust to outliners and thus
strong generalization ability q Good according to PAC
(Probably Approximately Correct) theory.
Margin
x1
x2
denotes +1
denotes -1
![Page 10: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/10.jpg)
Maximum Margin Classification n Distance from point xi to the
hyperplane is: n Examples closest to the
hyperplane are support vectors.
n Margin M of the classifier is the distance between support vectors on both sides.
n
wxw br i
T +=
x1
x2 Margin
x+
x+
x- n
Support Vectors
Only support vectors matter; other training points are ignorable.
M
![Page 11: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/11.jpg)
Large Margin Linear Classifier
n Given a set of data points:
n With a scale transformation on both w and b, the above is equivalent to
x1
x2
denotes +1
denotes -1
{( , )}, 1,2, ,i iy i n=x L , where
For 1, 1
For 1, 1
Ti i
Ti i
y by b= + + ≥
= − + ≤ −
w xw x
M
wTxi + b ≥ M/2 if yi = +1
wTxi + b ≤ - M/2 if yi = -1
![Page 12: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/12.jpg)
Large Margin Linear Classifier
n We know that
n The margin width is:
x1
x2
denotes +1
denotes -1
1 1
T
T
bb
+
−
+ =
+ = −
w xw x
Margin
x+
x+
x-
( )2 ( )
M + −
+ −
= − ⋅
= − ⋅ =
x x nwx xw w
n
Support Vectors
M
Thus wT (x+-x-) = 2
![Page 13: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/13.jpg)
Large Margin Linear Classifier
n Formulation:
x1
x2
denotes +1
denotes -1
Margin
x+
x+
x- n
such that
2maximize w
For 1, 1
For 1, 1
Ti i
Ti i
y by b= + + ≥
= − + ≤ −
w xw x
![Page 14: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/14.jpg)
Large Margin Linear Classifier
n Formulation:
x1
x2
denotes +1
denotes -1
Margin
x+
x+
x- n
21minimize 2w
such that
For 1, 1
For 1, 1
Ti i
Ti i
y by b= + + ≥
= − + ≤ −
w xw x
![Page 15: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/15.jpg)
Large Margin Linear Classifier
n Formulation:
x1
x2
denotes +1
denotes -1
Margin
x+
x+
x- n ( ) 1T
i iy b+ ≥w x
21minimize 2w
such that
![Page 16: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/16.jpg)
Solving the Optimization Problem
( ) 1Ti iy b+ ≥w x
21minimize 2w
s.t.
Quadratic programming
with linear constraints
( )2
1
1minimize ( , , ) ( ) 12
nT
p i i i ii
L b y bα α=
= − + −∑w w w x
s.t.
Lagrangian Function
0iα ≥
![Page 17: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/17.jpg)
Solving the Optimization Problem
( )2
1
1minimize ( , , ) ( ) 12
nT
p i i i ii
L b y bα α=
= − + −∑w w w x
s.t. 0iα ≥
0pLb
∂=
∂
0pL∂ =∂w 1
n
i i ii
yα=
=∑w x
10
n
i ii
yα=
=∑
![Page 18: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/18.jpg)
Solving the Optimization Problem
( )2
1
1minimize ( , , ) ( ) 12
nT
p i i i ii
L b y bα α=
= − + −∑w w w x
s.t. 0iα ≥
1 1 1
1maximize 2
n n nT
i i j i j i ji i j
y yα αα= = =
−∑ ∑∑ x x
s.t. 0iα ≥1
0n
i ii
yα=
=∑, and
Lagrangian Dual Problem
![Page 19: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/19.jpg)
Solving the Optimization Problem
n The solution has the form:
( )( ) 1 0Ti i iy bα + − =w x
n From KKT (Karush–Kuhn–Tucker) condition, we know:
n Thus, only support vectors have 0iα ≠
1 SV
n
i i i i i ii i
y yα α= ∈
= =∑ ∑w x x
get b from yk (wT xk +b)−1= 0,
where xk is any support vector
x1
x2
x+
x+
x-
Support Vectors
Thus, b = yk - Σαiyixi Txk for any αk > 0
![Page 20: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/20.jpg)
Solving the Optimization Problem
g (x) = wT x +b = αi yi xiT x
i∈SV∑ +b
n The linear discriminant function is:
n Notice it relies on a dot product between the test point x and the support vectors xi
n Also keep in mind that solving the optimization problem involved computing the dot products xi
Txj between all pairs of training points
That is, no need to compute w explicitly for classification.
![Page 21: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/21.jpg)
Large Margin Linear Classifier
n What if data is not linear separable? (noisy data, outliers, etc.)
n Slack variables ξi can be added to allow mis-classification of difficult or noisy data points
x1
x2
denotes +1
denotes -1
1ξ2ξ
![Page 22: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/22.jpg)
Large Margin Linear Classifier
n Formulation:
( ) 1Ti i iy b ξ+ ≥ −w x
2
1
1minimize 2
n
ii
C ξ=
+ ∑w
such that 0iξ ≥
n Parameter C can be viewed as a way to control over-fitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. n For large values of C, the optimization will choose a smaller-margin
hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.
![Page 23: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/23.jpg)
Solving the Optimization Problem
n Formulation: (Lagrangian Dual Problem)
1 1 1
1maximize 2
n n nT
i i j i j i ji i j
y yα αα= = =
−∑ ∑∑ x x
such that
0 i Cα≤ ≤
10
n
i ii
yα=
=∑
![Page 24: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/24.jpg)
Solving the Optimization Problem
n Again, xi with non-zero αi will be support vectors.
n Solution to the dual problem is:
b= yk(1- ξk) - Σαiyixi
Txk for any k s.t. αk>0
Again, we don’t need to compute w explicitly for classification:
1 SV
n
i i i i i ii i
y yα α= ∈
= =∑ ∑w x x
g (x) = wT x +b = αi yi xiT x
i∈SV∑ +b
![Page 25: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/25.jpg)
Non-linear SVMs n Datasets that are linearly separable with noise work out great:
0 x
0 x
x2
0 x
n But what are we going to do if the dataset is just too hard?
n How about… mapping data to a higher-dimensional space:
![Page 26: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/26.jpg)
Non-linear SVMs: Feature Space n General idea: the original input space can be mapped to
some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
![Page 27: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/27.jpg)
Nonlinear SVMs: The Kernel Trick n With this mapping, our discriminant function is now:
SV( ) ( ) ( ) ( )T T
i ii
g b bφ αφ φ∈
= + = +∑x w x x x
n No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.
n A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
( , ) ( ) ( )Ti j i jK φ φ≡x x x x
![Page 28: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/28.jpg)
Nonlinear SVMs: The Kernel Trick
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2,
Need to show that K(xi,xj) = φ(xi)
Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,
= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2] = φ(xi)
Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x2
2 √2x1 √2x2]
n An example:
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
![Page 29: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/29.jpg)
Nonlinear SVMs: The Kernel Trick
q Linear kernel:
2
2( , ) exp( )2i j
i jKσ
−= −
x xx x
( , ) Ti j i jK =x x x x
( , ) (1 )T pi j i jK = +x x x x
0 1( , ) tanh( )Ti j i jK β β= +x x x x
n Examples of commonly-used kernel functions:
q Polynomial kernel:
q Gaussian (Radial-Basis Function (RBF) ) kernel:
q Sigmoid:
Mercer’s theorem: Every semi-positive definite symmetric function is a kernel.
![Page 30: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/30.jpg)
Nonlinear SVM: Optimization n Formulation: (Lagrangian Dual Problem)
1 1 1
1maximize ( , )2
n n n
i i j i j i ji i j
y y Kα αα= = =
−∑ ∑∑ x x
such that 0 i Cα≤ ≤
10
n
i ii
yα=
=∑
n The solution of the discriminant function is
g (x) = αi yi K (xi ,x)i∈SV∑ +b
n The optimization technique is the same.
![Page 31: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/31.jpg)
Support Vector Machine: Algorithm
n 1. Choose a kernel function
n 2. Choose a value for C
n 3. Solve the quadratic programming problem (many software packages available)
n 4. Construct the discriminant function from the support vectors
![Page 32: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/32.jpg)
Some Issues n Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate
similarity measures n Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different classifications - In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters. n Optimization criterion – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters are
tested
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
![Page 33: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/33.jpg)
Summary: Support Vector Machine
n 1. Large Margin Classifier q Better generalization ability & less over-fitting
n 2. The Kernel Trick q Map data points to higher dimensional space in
order to make them linearly separable. q Since only dot product is used, we do not need to
represent the mapping explicitly.
![Page 34: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/34.jpg)
Demo of LibSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
![Page 35: An Introduction to Support Vector Machine](https://reader034.vdocuments.net/reader034/viewer/2022051315/627a317ed29e105ea8207a03/html5/thumbnails/35.jpg)
References on SVM and Stock Prediction
http://www.svms.org/finance/HuangNakamoriWang2005.pdf http://cs229.stanford.edu/proj2012/ShenJiangZhang-StockMarketForecastingusingMachineLearningAlgorithms.pdf http://research.ijcaonline.org/volume41/number3/pxc3877555.pdf and other references online …