an introduction to support vector...
TRANSCRIPT
An Introduction to Support Vector Machines
Seong-Bae Park
Kyungpook National Universityhttp://sejong.knu.ac.kr/~sbpark
2
Supervised Learning
Environment Solution d
Problem xTeacher
Learner f (Student) -
yx
feedback
3
Quality of Learning MachineLoss L(y, f(x, w)) ≥ 0
Discrepancy between true output and output by the learning machine
Risk functionalExpected value of the loss
LearningThe process of estimating the function f(x, w) which minimizes the risk functional using only the training data
∫= dydypwfyLwR xxx ),()),(,()(
4
Common Learning Tasks (1)Classification
∫=⎩⎨⎧
≠=
=
dydypwfyLwR
wfywfy
wfyL
xxx
xx
x
),()),(,()(
),( if1),( if0
)),(,(
5
Common Learning Tasks (2)Regression
Common Loss FunctionSquared Error (L2)
Risk
2)),(()),(,( wfywfyL xx −=
∫ −= dydypwfywR xxx ),()),(()( 2
6
ML HypothesisMaximum Likelihood hypothesis
)|(maxarg
)()|(maxarg)(
)()|(maxarg
)|(maxarg
hDP
hPhDPDP
hPhDP
DhPh
Hh
Hh
Hh
HhML
∈
∈
∈
∈
=
=
=
=
7
Maximum Likelihood RevisitedIf x1, …, xn are iid samples from a pdf , the likelihood is defined by
Maximum Likelihood EstimatorChoose w* that maximizes the likelihood
Relation to LossL(w) = - P(w|x) Take a log-likelihood
)|( wf x
.)|()|(1∏=
=n
ii wfwP xx
n
∑=
−=i
iML wfwR1
)|(ln)( x
8
Empirical Risk MinimizationDo we know p(x, y)?
Generally NO!!!What we have is only training data!
Empirical Risk
ERM is more general than ML.In density estimation, ERM is equivalent to ML.
L(f(x,w))= - ln f(x|w)
∑
∫
=
=
=n
i
wfyLn
wR
dydypwfyLwR
1emp )),(,(1)(
),()),(,()(
x
xxx
9
Risk and Empirical RiskWhen loss is
Relation between them (Vapnik, 1995)With probability 1-η,
h: VC dimension ( ≥ 0)Regardless of P(x, y)
∑
∫
=
−=
−=
l
iii wfy
lwR
ydPwfyR
1emp ),(
21)(
),(),(21)(
x
xxω
),(21 wfy x−
⎟⎠⎞
⎜⎝⎛ −+
+≤l
hlhwRR )4/log()1)/2(log()()( empηω
10
Risk and Empirical RiskWhen loss is
Relation between them (Vapnik, 1995)With probability 1-η,
h: VC dimension ( ≥ 0)Regardless of P(x, y)
∑
∫
=
−=
−=
l
iii wfy
lwR
ydPwfyR
1emp ),(
21)(
),(),(21)(
x
xxω
),(21 wfy x−
⎟⎠⎞
⎜⎝⎛ −+
+≤l
hlhwRR )4/log()1)/2(log()()( empηω
VC confidence
11
VC dimension
A set of instances S is shattered by {f(w)} iff for every dichotomy of S there exists some f(w) consistent with this dichotomy.
In case of l points, there are 2l dichotomies.The Vapnik-Chervonenkis dimension, VC, is the maximum number of training points that can be shattered by {f(w)}.
12
Minimizing R(w) by minimizing h
13
Perceptron Revisited: Linear Separators
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
14
Learning Perceptron (1)
Perceptron Learning AlgorithmGiven a training set S = {(x1, y1), …., (xl, yl) }and learning rate η ∈ R+
w0 ← 0; b0 ← 0; k ← 0R ← max1≤i≤l ||xi||while (there is some errors)
for i = 1 to lif yi( < wk ⋅ xi > + bk ) ≤ 0 then
wk+1 ← wk + ηyixibk+1 ← bk + ηyiR2
end ifend for
end whilereturn ( wk, bk )
15
Learning Perceptron (2)
16
Linear Separators
Which of the linear separators is optimal?
17
MarginDistance from example x to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between classes.
wxw br
T +=
r
ρ
18
Maximum Margin ClassificationMaximizing the margin is good according to intuition and PAC theory.Implies that only support vectors are important; other training examples are ignorable.
19
Linear SVM Mathematically (1)Assuming all data is at least distance 1 from the hyperplane, the following two constraints follow for a training set {(xi ,yi)}
For support vectors, the inequality becomes an equality. Since each example’s distance from the hyperplane is , the margin is:
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1
w2
=ρwxw br
T +=
20
Linear SVMs Mathematically (2)
Quadratic optimization problem:
A better formulation:
Find w and b such that
is maximized and for all {(xi ,yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
w2
=ρ
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
21
Solving the Optimization Problem
Need to optimize a quadratic function subject to linear constraints.Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange multiplierαi is associated with every constraint in the primary problem:
Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
22
The Optimization Problem Solution
The solution has the form:
Each non-zero αi indicates that corresponding xi is a support vector.Then the classifying function will have the form:
Notice that it relies on an inner product between the test point x and the support vectors xi! Also keep in mind that solving the optimization problem involvedcomputing the inner products xi
Txj between all training points!
w =Σαiyixi b= yk- wTxk for any xk such that αk≠ 0
f(x) = ΣαiyixiTx + b
23
Support Vectors in Dual Form
r
ρ
α1 = 0
α2 = 0
α3 = 0.8
α4 = 0
α5 = 0α6 = 0.4
α7 = 0
α8 = 0
α9 = 0
α10 = 0
α11 = 0
α12 = 0
α13 = 1.4α15 = 0
α14 = 0 α17 = 0
α19 = 0
α20 = 0α21 = 0
24
Soft Margin Classification (1)What if the training set is not linearly separable?Slack variables ξi can be added to allow misclassification of difficult or noisy examples.
ξiξi
25
Optimization Situation
Minimize
Dual problemMaximize
subject to
1111−=+−≤+⋅
Soft Margin Classification (2)
+=−+≥+⋅
iii
iii
yforbyforb
ξξ
wxwx
∑=
+l
iiC
1
2||||21 ξw
∑ =
≤≤ i
yCC
0parameter) defined-useran is (0
α
α
iii
jiji
jijii
i yy xx ⋅− ∑∑,2
1 ααα
26
Non-linear SVMsDatasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:x2
0 x
0 x
0 x
27
Non-linear SVMs: Feature spaces
General ideaThe original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
28
The “Kernel Trick”The linear classifier relies on inner product between vectors K(xi,xj)=xi
Txj
If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)A kernel function is some function that corresponds to an inner product into some feature space.Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):K(xi,xj)=(1 + xi
Txj)2,= 1+ xi1
2xj12 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2] == φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
29
What Functions are Kernels?For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome. Mercer’s theorem:
Every semi-positive definite symmetric function is a kernelSemi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
K=
30
Examples of Kernel FunctionsLinear: K(xi,xj)= xi
Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network): K(xi,xj)=
Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)
2
2
2σji xx −
−e
31
Non-linear SVMs MathematicallyDual problem formulation:
The solution is:
Optimization techniques for finding αi’sremain the same!
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi, xj)+ b
32
SVM Structure
33
VC dimension of SVMMinimal embedding space
Any embedding space with minimal dimension for a given kernel
Let K be a kernel which corresponds to a minimal embedding space H. Then the VC dimension of the corresponding SVM is dim(H) + 1.
VC dimension of SVM can be ∞.Striking conundrum
High VC dimension, but good performance!
34
Generalization Error by Margin
Risk bound by margin ρWith a probability 1-η,
Large margin makes SVM stronger!
⎟⎟⎠
⎞⎜⎜⎝
⎛++≤ )/1log(log)( 2
2
2
ηρ
lRlc
lbwR
35
SVMlight (1)Author: T. JoachimsDownload: http://svmlight.joachims.orgTwo executable files
svm_learnsvm_learn training_data model_file
svm_classifysvm_classify test_data model_file
svm_learn svm_classify
Training data Test data
Generated Model Classified Result
36
SVMlight (2)
Written in CApplicable to Classification, Regression, and Ranking TasksCan handle thousands of support vectorsCan handle hundred-thousands of training examplesSupport standard kernel functions and user-defined kernelsUse sparse vector representation
37
Why handling many SVs is important?
Learning SVM
Q: n x n matrix (Qij = yiyjK(xi, xj) )For many real world applications, Q is too large for standard computers.
SMO decompositionOverall QP QP subproblems
Joachims presented a optimization method for SMO decomposition.
Find α1…αN such thatQ(α) =Σαi - ½Σαiαj Qij is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
38
svm_learn options-z {c, r, p} Selection of task: classification(c), regression(r), preference
ranking(p) (default is c)-c float C parameter for soft-margin SVM (default: E[xTx]-1)
-t int Type of kernel functions0: linear1: polynomial (sx⋅y + c)d
2: RBF3: sigmoid than(sx⋅y + c)4: user defined kernel
-d int Parameter d in polynomial kernel
-g float Parameter gamma in RBF kernel
-s float Parameter s in sigmoid/polynomial kernel
-r float Parameter c in sigmoid/polynomial kernel-u string Parameter of user-defined kernel
39
Format of data
Each data is represented as a line
Feature/value pairs must be ordered by increasing feature number.Features with value zero can be skipped.Example
-1 1:0.43 3:0.12 9284:0.2 # comment
<line> .=. <target> <feature>:<value>...<feature>:<value> # <info><target> .=. +1 | -1 | 0 | <float><feature> .=. <integer><value> .=. <float><info> .=. <string>
40
Text Chunking
Corpushttp://sejong.knu.ac.kr/~sbpark/Chunk
� maj B-ADVP mmd B-NP ncn I-NP jxt I-NP ncn B-NP jcm I-NP� ncps I-NP jca I-NP mag B-ADVP paa B-VP ef I-VP� nbn I-VP paa I-VP ef I-VP. sf O
Information Value
VocabularyTotal WordsChunk TypesPOS TagsSentencesPhrases
16,838321,328
952
12,092112,658
41
Context
42
Data Format
1 1:1 16315:1 32630:1 50221:1 66411:1 82496:1 97890:1 114205:1 114258:1 114311:1 114401:1 114447:1 114492:1 114553:1 114576:1 114586:1 114596:1-1 1:1 16315:1 33906:1 50096:1 66181:1 81575:1 98759:1 114205:1 114258:1 114348:1 114394:1 114439:1 114500:1 114535:1 114576:1 114586:1 114599:1-1 1:1 17591:1 33781:1 49866:1 65260:1 82444:1 97890:1 114205:1 114295:1 114341:1 114386:1 114447:1 114482:1 114553:1 114576:1 114589:1 114603:11 1276:1 17466:1 33551:1 48945:1 66129:1 81575:1 97894:1 114242:1 114288:1 114333:1 114394:1 114429:1 114500:1 114556:1 114579:1 114593:1 114603:1-1 1276:1 17466:1 33551:1 49814:1 65260:1 81579:1 97890:1 114242:1 114288:1 114333:1 114376:1 114447:1 114503:1 114552:1 114583:1 114593:1 114599:1-1 1151:1 17236:1 33499:1 48945:1 65264:1 81575:1 98803:1 114235:1 114280:1 114323:1 114394:1 114450:1 114499:1 114533:1 114583:1 114589:1 114603:1
…
BNP.data
43
Running SVMlight
svm_learn BNP.data BNP.modelSVM-light Version V3.500 # kernel type3 # kernel parameter -d 1 # kernel parameter -g 1 # kernel parameter -s 1 # kernel parameter -r empty # kernel parameter -u 114605 # highest feature index 290465 # number of training documents 13947 # number of support vectors plus 1 0.94731663 # threshold b -0.05882352941165028270553705169732 456:1 16683:1 33555:1 48945:1 65260:1 81981:1 98703:1 114229:1 114309:1 114324:1 114394:1 114447:1 114480:1 114564:1 114579:1 114593:1 114603:1 -0.05882352941165028270553705169732 1:1 17591:1 33555:1 49634:1 65472:1 82444:1 98054:1 114205:1 114295:1 114324:1 114401:1 114447:1 114482:1 114550:1 114576:1 114589:1 114603:1 …
44
Performance
Decision Tree SVM MBL
AccuracyF-score
97.95±0.24%91.36±0.85
98.15±0.20%92.54±0.72
97.79±0.29%91.38±1.01
45
Another Example Task
Korean Clause Boundary Detection
Word POS Chunk Output
기지에서보이는
위버반도에서가장높ㄴ
봉우리를
서울봉이
라부르ㄴ다
.
ncnjca
pvgetmnqjca
magpaaetmncnjconqjp
ecspvgefsf
B-NPI-NPB-VPI-VPB-NPI-NP
B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP
O
SSSXXESXXXEXXXXEXXE
46
Clause Boundary DetectionTwo Binary Classification Tasks
Finding Ending Point (S, X)Finding Starting Point (E, X)
Feature set
Feature set
Feature Selection
Feature Selection
Learning
Classification
Learning
Classification
Ending Point
Starting Point
S X
S: w1, w2, …, wi,…..wn
S: w1, w2, …, wi,…..wn
E X
47
FeaturesDimension of a vector (= 4,232)
# of words: 4,171# of POSs: 52# of chunks: 9
Trigram Modelwi-1: 1 ~ 4,232wi: 4,233 ~ 8,464wi+1: 8,465 ~ 12,696
48
Vector RepresentationWord POS Chunk Output
기지에서보이는
위버반도에서가장높ㄴ
봉우리를
서울봉이
라부르ㄴ다
.
ncnjca
pvgetmnqjca
magpaaetmncnjconqjp
ecspvgefsf
B-NPI-NPB-VPI-VPB-NPI-NP
B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP
O
SSSXXESXXXEXXXXEXXE
wi
wi-1 wi-1 wi+1
30:1 6302:1 9921:14215:1 8423:1 12664:14229:1 8462:1 12692:14232:1
wi-1 wi wi+1
Word 는 위버반도 에서
POS etm nq jcaChungk I-VP B-NP I-NP
Ending Point E
-1 30:1 4215:1 4229:1 4232:1 6302:1 8423:1 8462:1 9921:1 12664:1 12692:1
49
Execution of SVMlight (1)
50
Execution of SVMlight (2)
51
Third Example: Text Classification
Document into a vectorBinary Vector
x = <w1, w2, …, w|v|}Commonly-used Corpus
Reuters-2157812,902 Reuters stories118 categoriesModApte split
75% for training (9,603 stories)25% for test (3,299 stroies)
Feature Selection300 words with the highest mutual information with each category|v| = 300
52
Text Classification Results
53
Interpreting Weight VectorCategory “interest”
Terms with Highest WeightPrime: 0.70Rate: 0.67Interest: 0.63Rates: 0.60Discount: 0.46
Terms with Lowest WeightGroup: -0.24Year: -0.25Sees: -0.33World: -0.35Dlrs: -0.71