an introduction to support vector...

An Introduction to Support Vector Machines

Seong-Bae Park

Kyungpook National Universityhttp://sejong.knu.ac.kr/~sbpark

http://sejong.knu.ac.kr/~sbpark

2

Supervised Learning

Environment Solution d

Problem xTeacher

Learner f (Student) -

yx

feedback

3

Quality of Learning MachineLoss L(y, f(x, w)) ≥ 0

Discrepancy between true output and output by the learning machine

Risk functionalExpected value of the loss

LearningThe process of estimating the function f(x, w) which minimizes the risk functional using only the training data

∫= dydypwfyLwR xxx ),()),(,()(

4

Common Learning Tasks (1)Classification

∫=⎩⎨⎧

≠=

=

dydypwfyLwR

wfywfy

wfyL

xxx

xx

x

),()),(,()(

),( if1),( if0

)),(,(

5

Common Learning Tasks (2)Regression

Common Loss FunctionSquared Error (L2)

Risk

2)),(()),(,( wfywfyL xx −=

∫ −= dydypwfywR xxx ),()),(()( 2

6

ML HypothesisMaximum Likelihood hypothesis

)|(maxarg

)()|(maxarg)(

)()|(maxarg

)|(maxarg

hDP

hPhDPDP

hPhDP

DhPh

Hh

Hh

Hh

HhML

∈

∈

∈

∈

=

=

=

=

7

Maximum Likelihood RevisitedIf x1, …, xn are iid samples from a pdf , the likelihood is defined by

Maximum Likelihood EstimatorChoose w* that maximizes the likelihood

Relation to LossL(w) = - P(w|x) Take a log-likelihood

)|( wf x

.)|()|(1∏=

=n

ii wfwP xx

n

∑=

−=i

iML wfwR1

)|(ln)( x

8

Empirical Risk MinimizationDo we know p(x, y)?

Generally NO!!!What we have is only training data!

Empirical Risk

ERM is more general than ML.In density estimation, ERM is equivalent to ML.

L(f(x,w))= - ln f(x|w)

∑

∫

=

=

=n

i

wfyLn

wR

dydypwfyLwR

1emp )),(,(1)(

),()),(,()(

x

xxx

9

Risk and Empirical RiskWhen loss is

Relation between them (Vapnik, 1995)With probability 1-η,

h: VC dimension ( ≥ 0)Regardless of P(x, y)

∑

∫

=

−=

−=

l

iii wfy

lwR

ydPwfyR

1emp ),(

21)(

),(),(21)(

x

xxω

),(21 wfy x−

⎟⎠⎞

⎜⎝⎛ −+

+≤l

hlhwRR )4/log()1)/2(log()()( empηω

10

Risk and Empirical RiskWhen loss is

Relation between them (Vapnik, 1995)With probability 1-η,

h: VC dimension ( ≥ 0)Regardless of P(x, y)

∑

∫

=

−=

−=

l

iii wfy

lwR

ydPwfyR

1emp ),(

21)(

),(),(21)(

x

xxω

),(21 wfy x−

⎟⎠⎞

⎜⎝⎛ −+

+≤l

hlhwRR )4/log()1)/2(log()()( empηω

VC confidence

11

VC dimension

A set of instances S is shattered by {f(w)} iff for every dichotomy of S there exists some f(w) consistent with this dichotomy.

In case of l points, there are 2l dichotomies.The Vapnik-Chervonenkis dimension, VC, is the maximum number of training points that can be shattered by {f(w)}.

12

Minimizing R(w) by minimizing h

13

Perceptron Revisited: Linear Separators

wTx + b = 0

wTx + b < 0wTx + b > 0

f(x) = sign(wTx + b)

14

Learning Perceptron (1)

Perceptron Learning AlgorithmGiven a training set S = {(x1, y1), …., (xl, yl) }and learning rate η ∈ R+

w0 ← 0; b0 ← 0; k ← 0R ← max1≤i≤l ||xi||while (there is some errors)

for i = 1 to lif yi( < wk ⋅ xi > + bk ) ≤ 0 then

wk+1 ← wk + ηyixibk+1 ← bk + ηyiR2

end ifend for

end whilereturn ( wk, bk )

15

Learning Perceptron (2)

16

Linear Separators

Which of the linear separators is optimal?

17

MarginDistance from example x to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between classes.

wxw br

T +=

r

ρ

18

Maximum Margin ClassificationMaximizing the margin is good according to intuition and PAC theory.Implies that only support vectors are important; other training examples are ignorable.

19

Linear SVM Mathematically (1)Assuming all data is at least distance 1 from the hyperplane, the following two constraints follow for a training set {(xi ,yi)}

For support vectors, the inequality becomes an equality. Since each example’s distance from the hyperplane is , the margin is:

wTxi + b ≥ 1 if yi = 1

wTxi + b ≤ -1 if yi = -1

w2

=ρwxw br

T +=

20

Linear SVMs Mathematically (2)

Quadratic optimization problem:

A better formulation:

Find w and b such that

is maximized and for all {(xi ,yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

w2

=ρ

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

21

Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints.Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange multiplierαi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

22

The Optimization Problem Solution

The solution has the form:

Each non-zero αi indicates that corresponding xi is a support vector.Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi! Also keep in mind that solving the optimization problem involvedcomputing the inner products xi

Txj between all training points!

w =Σαiyixi b= yk- wTxk for any xk such that αk≠ 0

f(x) = ΣαiyixiTx + b

23

Support Vectors in Dual Form

r

ρ

α1 = 0

α2 = 0

α3 = 0.8

α4 = 0

α5 = 0α6 = 0.4

α7 = 0

α8 = 0

α9 = 0

α10 = 0

α11 = 0

α12 = 0

α13 = 1.4α15 = 0

α14 = 0 α17 = 0

α19 = 0

α20 = 0α21 = 0

24

Soft Margin Classification (1)What if the training set is not linearly separable?Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

ξiξi

25

Optimization Situation

Minimize

Dual problemMaximize

subject to

1111−=+−≤+⋅

Soft Margin Classification (2)

+=−+≥+⋅

iii

iii

yforbyforb

ξξ

wxwx

∑=

+l

iiC

1

2||||21 ξw

∑ =

≤≤ i

yCC

0parameter) defined-useran is (0

α

α

iii

jiji

jijii

i yy xx ⋅− ∑∑,2

1 ααα

26

Non-linear SVMsDatasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:x2

0 x

0 x

0 x

27

Non-linear SVMs: Feature spaces

General ideaThe original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

28

The “Kernel Trick”The linear classifier relies on inner product between vectors K(xi,xj)=xi

Txj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)A kernel function is some function that corresponds to an inner product into some feature space.Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):K(xi,xj)=(1 + xi

Txj)2,= 1+ xi1

2xj12 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] == φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

29

What Functions are Kernels?For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome. Mercer’s theorem:

Every semi-positive definite symmetric function is a kernelSemi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …

K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)

K=

30

Examples of Kernel FunctionsLinear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network): K(xi,xj)=

Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)

2

2

2σji xx −

−e

31

Non-linear SVMs MathematicallyDual problem formulation:

The solution is:

Optimization techniques for finding αi’sremain the same!

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

f(x) = ΣαiyiK(xi, xj)+ b

32

SVM Structure

33

VC dimension of SVMMinimal embedding space

Any embedding space with minimal dimension for a given kernel

Let K be a kernel which corresponds to a minimal embedding space H. Then the VC dimension of the corresponding SVM is dim(H) + 1.

VC dimension of SVM can be ∞.Striking conundrum

High VC dimension, but good performance!

34

Generalization Error by Margin

Risk bound by margin ρWith a probability 1-η,

Large margin makes SVM stronger!

⎟⎟⎠

⎞⎜⎜⎝

⎛++≤ )/1log(log)( 2

2

2

ηρ

lRlc

lbwR

35

SVMlight (1)Author: T. JoachimsDownload: http://svmlight.joachims.orgTwo executable files

svm_learnsvm_learn training_data model_file

svm_classifysvm_classify test_data model_file

svm_learn svm_classify

Training data Test data

Generated Model Classified Result

36

SVMlight (2)

Written in CApplicable to Classification, Regression, and Ranking TasksCan handle thousands of support vectorsCan handle hundred-thousands of training examplesSupport standard kernel functions and user-defined kernelsUse sparse vector representation

37

Why handling many SVs is important?

Learning SVM

Q: n x n matrix (Qij = yiyjK(xi, xj) )For many real world applications, Q is too large for standard computers.

SMO decompositionOverall QP QP subproblems

Joachims presented a optimization method for SMO decomposition.

Find α1…αN such thatQ(α) =Σαi - ½Σαiαj Qij is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

38

svm_learn options-z {c, r, p} Selection of task: classification(c), regression(r), preference

ranking(p) (default is c)-c float C parameter for soft-margin SVM (default: E[xTx]-1)

-t int Type of kernel functions0: linear1: polynomial (sx⋅y + c)d

2: RBF3: sigmoid than(sx⋅y + c)4: user defined kernel

-d int Parameter d in polynomial kernel

-g float Parameter gamma in RBF kernel

-s float Parameter s in sigmoid/polynomial kernel

-r float Parameter c in sigmoid/polynomial kernel-u string Parameter of user-defined kernel

39

Format of data

Each data is represented as a line

Feature/value pairs must be ordered by increasing feature number.Features with value zero can be skipped.Example

-1 1:0.43 3:0.12 9284:0.2 # comment

<line> .=. <target> <feature>:<value>...<feature>:<value> # <info><target> .=. +1 | -1 | 0 | <float><feature> .=. <integer><value> .=. <float><info> .=. <string>

40

Text Chunking

Corpushttp://sejong.knu.ac.kr/~sbpark/Chunk

�　 maj B-ADVP　 mmd B-NP　　 ncn I-NP　 jxt I-NP　　 ncn B-NP　 jcm I-NP�　 ncps I-NP　 jca I-NP　　 mag B-ADVP　 paa B-VP　 ef I-VP� nbn I-VP　 paa I-VP　 ef I-VP. sf O

Information Value

VocabularyTotal WordsChunk TypesPOS TagsSentencesPhrases

16,838321,328

952

12,092112,658

41

Context

42

Data Format

1 1:1 16315:1 32630:1 50221:1 66411:1 82496:1 97890:1 114205:1 114258:1 114311:1 114401:1 114447:1 114492:1 114553:1 114576:1 114586:1 114596:1-1 1:1 16315:1 33906:1 50096:1 66181:1 81575:1 98759:1 114205:1 114258:1 114348:1 114394:1 114439:1 114500:1 114535:1 114576:1 114586:1 114599:1-1 1:1 17591:1 33781:1 49866:1 65260:1 82444:1 97890:1 114205:1 114295:1 114341:1 114386:1 114447:1 114482:1 114553:1 114576:1 114589:1 114603:11 1276:1 17466:1 33551:1 48945:1 66129:1 81575:1 97894:1 114242:1 114288:1 114333:1 114394:1 114429:1 114500:1 114556:1 114579:1 114593:1 114603:1-1 1276:1 17466:1 33551:1 49814:1 65260:1 81579:1 97890:1 114242:1 114288:1 114333:1 114376:1 114447:1 114503:1 114552:1 114583:1 114593:1 114599:1-1 1151:1 17236:1 33499:1 48945:1 65264:1 81575:1 98803:1 114235:1 114280:1 114323:1 114394:1 114450:1 114499:1 114533:1 114583:1 114589:1 114603:1

…

BNP.data

43

Running SVMlight

svm_learn BNP.data BNP.modelSVM-light Version V3.500 # kernel type3 # kernel parameter -d 1 # kernel parameter -g 1 # kernel parameter -s 1 # kernel parameter -r empty # kernel parameter -u 114605 # highest feature index 290465 # number of training documents 13947 # number of support vectors plus 1 0.94731663 # threshold b -0.05882352941165028270553705169732 456:1 16683:1 33555:1 48945:1 65260:1 81981:1 98703:1 114229:1 114309:1 114324:1 114394:1 114447:1 114480:1 114564:1 114579:1 114593:1 114603:1 -0.05882352941165028270553705169732 1:1 17591:1 33555:1 49634:1 65472:1 82444:1 98054:1 114205:1 114295:1 114324:1 114401:1 114447:1 114482:1 114550:1 114576:1 114589:1 114603:1 …

44

Performance

Decision Tree SVM MBL

AccuracyF-score

97.95±0.24%91.36±0.85

98.15±0.20%92.54±0.72

97.79±0.29%91.38±1.01

45

Another Example Task

Korean Clause Boundary Detection

Word POS Chunk Output

기지에서보이는

위버반도에서가장높ㄴ

봉우리를

서울봉이

라부르ㄴ다

.

ncnjca

pvgetmnqjca

magpaaetmncnjconqjp

ecspvgefsf

B-NPI-NPB-VPI-VPB-NPI-NP

B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP

O

SSSXXESXXXEXXXXEXXE

46

Clause Boundary DetectionTwo Binary Classification Tasks

Finding Ending Point (S, X)Finding Starting Point (E, X)

Feature set

Feature set

Feature Selection

Feature Selection

Learning

Classification

Learning

Classification

Ending Point

Starting Point

S X

S: w1, w2, …, wi,…..wn

S: w1, w2, …, wi,…..wn

E X

47

FeaturesDimension of a vector (= 4,232)

# of words: 4,171# of POSs: 52# of chunks: 9

Trigram Modelwi-1: 1 ~ 4,232wi: 4,233 ~ 8,464wi+1: 8,465 ~ 12,696

48

Vector RepresentationWord POS Chunk Output

기지에서보이는

위버반도에서가장높ㄴ

봉우리를

서울봉이

라부르ㄴ다

.

ncnjca

pvgetmnqjca

magpaaetmncnjconqjp

ecspvgefsf

B-NPI-NPB-VPI-VPB-NPI-NP

B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP

O

SSSXXESXXXEXXXXEXXE

wi

wi-1 wi-1 wi+1

30:1 6302:1 9921:14215:1 8423:1 12664:14229:1 8462:1 12692:14232:1

wi-1 wi wi+1

Word 는 위버반도 에서

POS etm nq jcaChungk I-VP B-NP I-NP

Ending Point E

-1 30:1 4215:1 4229:1 4232:1 6302:1 8423:1 8462:1 9921:1 12664:1 12692:1

49

Execution of SVMlight (1)

50

Execution of SVMlight (2)

51

Third Example: Text Classification

Document into a vectorBinary Vector

x = <w1, w2, …, w|v|}Commonly-used Corpus

Reuters-2157812,902 Reuters stories118 categoriesModApte split

75% for training (9,603 stories)25% for test (3,299 stroies)

Feature Selection300 words with the highest mutual information with each category|v| = 300

52

Text Classification Results

53

Interpreting Weight VectorCategory “interest”

Terms with Highest WeightPrime: 0.70Rate: 0.67Interest: 0.63Rates: 0.60Discount: 0.46

Terms with Lowest WeightGroup: -0.24Year: -0.25Sees: -0.33World: -0.35Dlrs: -0.71

an introduction to support vector...

Documents