christopher m. bishop, pattern recognition and machine learning

Christopher M. Bishop,Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

2

Supervised Learning

In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning

y(x)

(x,t)(1,60,pass)(2,53,fail)(3,77,pass)(4,34,fail)﹕

output

3

Classificationx2

x1

y=0 y>0y<0

t=-1 t=14

Regression

0 1 x

t

0

1

-1

new x 5

Linear Models Linear models for regression and

classification:

if we apply feature extraction,

0 1 1 1 D( ) where x = (x ,...,x ) D Dy x x x

1

0 01

( ) ( ) ( )M

Tj j

j

y x x w x

inputmodel

parameter

6

Problems with Feature Space Why feature extraction? Working in

high dimensional feature spaces solves the problem of expressing complex functions

Problems: - there is a computational problem

(working with very large vectors) - curse of dimensionality

7

Kernel Methods (1)

Kernel function: inner products in some feature space nonlinear similarity measure

Examples - polynomial: - Gaussian:

( , ') ( ) ( ')Tk x x x x

( , ') ( ' )T dk x x x x c 2 2( , ') exp( ' / 2 )k x x x x

8

Kernel Methods (2)

Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally only require inner products between data (input)

2 21 1 2 2

2 2 2 21 1 1 1 2 2 2 2

2 2 2 21 1 2 2 1 1 2 2

( , ) ( ) ( )

2

( , 2 , )( , 2 , )

( ) ( )

T

T

T

k x z x z x z x z

x z x z x z x z

x x x x z z z z

x z

9

Kernel Methods (3)

We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but

just working out the inner product in the data space

10

Kernel Methods (4)

Kernel methods exploit information about the inner products between data items

We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function

If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target

11

Kernel Methods (5)

Two basic modules for kernel methods

General purpose learning model

Problem specific kernel function

12

Kernel Methods (6)

Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points

Sparse kernel machine makes prediction only by a subset of the training data points

13


14

Support Vector Machines (1) Support Vector Machines are a

system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory

Generalization theory describes how to control the learning machines to prevent them from overfitting

15

Support Vector Machines (2) To avoid overfitting, SVM modify the error

function to a “regularized form” where hyperparameter λ balances the

trade-off The aim of EW is to limit the estimated

functions to smooth functions As a side effect, SVM obtain a sparse

model

( ) ( ) ( ) D WE w E w E w

16

Support Vector Machines (3)

17

Fig. 1 Architecture of SVM

SVM for Classification (1) The mechanism to prevent

overfitting in classification is “maximum margin classifiers”

SVM is fundamentally a two-class classifier

18

Maximum Margin Classifiers (1) The aim of classification is to find a

D-1 dimension hyperplane to classify data in a D dimension space

2D example:

19

Maximum Margin Classifiers (2)

margin

support vectors

support vectors

20

Maximum Margin Classifiers (3)

small margin large margin

21

Maximum Margin Classifiers (4) Intuitively it is a “robust” solution - If we’ve made a small error in

the location of the boundary, this gives us least chance of causing a misclassification

The concept of max margin is usually justified using Vapnik’s Statistical learning theory

Empirically it works well22

SVM for Classification (2) After the optimization process, we

obtain the prediction model:

where (xn,tn) are N training data we can find that an will be zero

except for that of the support vectors sparse

23

1

( ) ( , ) N

n n nn

y x a t k x x b

SVM for Classification (3)

24

Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function

SVM for Classification (4) For overlapping class

distributions, SVM allow some of the training points to be misclassified soft margin

25

penalty

SVM for Classification (5) For multiclass problems, there are

some methods to combine multiple two-class SVMs

- one versus the rest - one versus one more

training time

26

Fig. 3 Problems in multiclass classification using multiple SVMs

SVM for Regression (1)

For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function”

27

quadratic error

functionε-insensitive

error funciton


28Fig . 4 ε-tube

No error

×

Error = |y(x)-t|- ε


After the optimization process, we obtain the prediction model:

we can find that an will be zero except for that of the support vectors sparse

29

1

ˆ( ) ( ) ( , ) N

n n nn

y x a a k x x b


30Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube

Disadvantages

It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set

Predictions are not probabilistic The estimation of error/margin trade-off

parameters must utilize cross-validation which is a waste of computation

Kernel functions are limited Multiclass classification problems

31


32

Relevance Vector Machines (1) The relevance vector machine (RVM)

is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations

RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM

33

Relevance Vector Machines (2) RVM intend to mirror the structure of the

SVM and use a Bayesian treatment to remove the limitations of SVM

the kernel functions are simply treated as basis functions, rather than dot-product in some space

34

1

( ) ( , ) N

n nn

y x w k x x b

Bayesian Inference Bayesian inference allows one to model

uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence.

35

Relevance Vector Machines (3) In the Bayesian framework, we use a

prior distribution over w to avoid overfitting

where α is a hyperparameter which control the model parameter w

36

1/ 2 2

1

( | ) ( ) exp( )2 2

N

mm

p w w

Relevance Vector Machines (4) Goal: find most probable α* and β* to

compute the predictive distribution over tnew for a new input xnew, i.e.

p(tnew | xnew, X, t, α*, β*)

Maximize the likelihood function to obtain α* and β* :

p(t|X, α, β)

37

Training data and their target values

Relevance Vector Machines (5) RVM utilize the “automatic relevance

determination” to achieve sparsity

where αm represents the precision of wm

In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero remain relevance vectors !

38

1/ 2 2

1

( | ) ( ) exp( )2 2

Nm m

mm

p w w

Comparisons - Regression

39

RVM (on standard deviation predictive distribution)

SVM

Comparisons - Regression

40

Comparison - Classification

41

RVM SVM

Comparison - Classification

42

Comparisons

RVM are much sparser and make probabilistic prediction

RVM gives better generalization in regression

SVM gives better generalization in classification

RVM is computationally demanding while learning

43


44

Applications (1)

SVM for face detection

45

Applications (2)

46Marti Hearst, “ Support Vector Machines” ,1998

Applications (3)

In the feature-matching based object tracking, SVM are used to detect false feature matches

47Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001

Applications (4)

Recovering 3D human poses by RVM

48A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression” 2004


49

Conclusions

The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks

The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions

50

References

www.support-vector.net N. Cristianini and J. Shawe-Taylor,

“An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000

M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001

51

Underfitting and Overfitting

52

underfitting-too simple overfitting-too complex

Adapted from http://www.dtreg.com/svm.htm

new data

christopher m. bishop, pattern recognition and machine learning

Documents