christopher m. bishop, pattern recognition and machine learning

52
Christopher M. Bishop, Pattern Recognition and Machine Learning

Upload: alan-morrison

Post on 05-Jan-2016

289 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Christopher M. Bishop, Pattern Recognition and Machine Learning

Christopher M. Bishop,Pattern Recognition and Machine Learning

Page 2: Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

2

Page 3: Christopher M. Bishop, Pattern Recognition and Machine Learning

Supervised Learning

In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning

y(x)

(x,t)(1,60,pass)(2,53,fail)(3,77,pass)(4,34,fail)﹕

output

3

Page 4: Christopher M. Bishop, Pattern Recognition and Machine Learning

Classificationx2

x1

y=0 y>0y<0

t=-1 t=14

Page 5: Christopher M. Bishop, Pattern Recognition and Machine Learning

Regression

0 1 x

t

0

1

-1

new x 5

Page 6: Christopher M. Bishop, Pattern Recognition and Machine Learning

Linear Models Linear models for regression and

classification:

if we apply feature extraction,

0 1 1 1 D( ) where x = (x ,...,x ) D Dy x x x

1

0 01

( ) ( ) ( )M

Tj j

j

y x x w x

inputmodel

parameter

6

Page 7: Christopher M. Bishop, Pattern Recognition and Machine Learning

Problems with Feature Space Why feature extraction? Working in

high dimensional feature spaces solves the problem of expressing complex functions

Problems: - there is a computational problem

(working with very large vectors) - curse of dimensionality

7

Page 8: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (1)

Kernel function: inner products in some feature space nonlinear similarity measure

Examples - polynomial: - Gaussian:

( , ') ( ) ( ')Tk x x x x

( , ') ( ' )T dk x x x x c 2 2( , ') exp( ' / 2 )k x x x x

8

Page 9: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (2)

Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally only require inner products between data (input)

2 21 1 2 2

2 2 2 21 1 1 1 2 2 2 2

2 2 2 21 1 2 2 1 1 2 2

( , ) ( ) ( )

2

( , 2 , )( , 2 , )

( ) ( )

T

T

T

k x z x z x z x z

x z x z x z x z

x x x x z z z z

x z

9

Page 10: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (3)

We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but

just working out the inner product in the data space

10

Page 11: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (4)

Kernel methods exploit information about the inner products between data items

We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function

If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target

11

Page 12: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (5)

Two basic modules for kernel methods

General purpose learning model

Problem specific kernel function

12

Page 13: Christopher M. Bishop, Pattern Recognition and Machine Learning

Kernel Methods (6)

Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points

Sparse kernel machine makes prediction only by a subset of the training data points

13

Page 14: Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

14

Page 15: Christopher M. Bishop, Pattern Recognition and Machine Learning

Support Vector Machines (1) Support Vector Machines are a

system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory

Generalization theory describes how to control the learning machines to prevent them from overfitting

15

Page 16: Christopher M. Bishop, Pattern Recognition and Machine Learning

Support Vector Machines (2) To avoid overfitting, SVM modify the error

function to a “regularized form” where hyperparameter λ balances the

trade-off The aim of EW is to limit the estimated

functions to smooth functions As a side effect, SVM obtain a sparse

model

( ) ( ) ( ) D WE w E w E w

16

Page 17: Christopher M. Bishop, Pattern Recognition and Machine Learning

Support Vector Machines (3)

17

Fig. 1 Architecture of SVM

Page 18: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Classification (1) The mechanism to prevent

overfitting in classification is “maximum margin classifiers”

SVM is fundamentally a two-class classifier

18

Page 19: Christopher M. Bishop, Pattern Recognition and Machine Learning

Maximum Margin Classifiers (1) The aim of classification is to find a

D-1 dimension hyperplane to classify data in a D dimension space

2D example:

19

Page 20: Christopher M. Bishop, Pattern Recognition and Machine Learning

Maximum Margin Classifiers (2)

margin

support vectors

support vectors

20

Page 21: Christopher M. Bishop, Pattern Recognition and Machine Learning

Maximum Margin Classifiers (3)

small margin large margin

21

Page 22: Christopher M. Bishop, Pattern Recognition and Machine Learning

Maximum Margin Classifiers (4) Intuitively it is a “robust” solution - If we’ve made a small error in

the location of the boundary, this gives us least chance of causing a misclassification

The concept of max margin is usually justified using Vapnik’s Statistical learning theory

Empirically it works well22

Page 23: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Classification (2) After the optimization process, we

obtain the prediction model:

where (xn,tn) are N training data we can find that an will be zero

except for that of the support vectors sparse

23

1

( ) ( , ) N

n n nn

y x a t k x x b

Page 24: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Classification (3)

24

Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function

Page 25: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Classification (4) For overlapping class

distributions, SVM allow some of the training points to be misclassified soft margin

25

penalty

Page 26: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Classification (5) For multiclass problems, there are

some methods to combine multiple two-class SVMs

- one versus the rest - one versus one more

training time

26

Fig. 3 Problems in multiclass classification using multiple SVMs

Page 27: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Regression (1)

For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function”

27

quadratic error

functionε-insensitive

error funciton

Page 28: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Regression (2)

28Fig . 4 ε-tube

No error

×

Error = |y(x)-t|- ε

Page 29: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Regression (3)

After the optimization process, we obtain the prediction model:

we can find that an will be zero except for that of the support vectors sparse

29

1

ˆ( ) ( ) ( , ) N

n n nn

y x a a k x x b

Page 30: Christopher M. Bishop, Pattern Recognition and Machine Learning

SVM for Regression (4)

30Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube

Page 31: Christopher M. Bishop, Pattern Recognition and Machine Learning

Disadvantages

It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set

Predictions are not probabilistic The estimation of error/margin trade-off

parameters must utilize cross-validation which is a waste of computation

Kernel functions are limited Multiclass classification problems

31

Page 32: Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

32

Page 33: Christopher M. Bishop, Pattern Recognition and Machine Learning

Relevance Vector Machines (1) The relevance vector machine (RVM)

is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations

RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM

33

Page 34: Christopher M. Bishop, Pattern Recognition and Machine Learning

Relevance Vector Machines (2) RVM intend to mirror the structure of the

SVM and use a Bayesian treatment to remove the limitations of SVM

the kernel functions are simply treated as basis functions, rather than dot-product in some space

34

1

( ) ( , ) N

n nn

y x w k x x b

Page 35: Christopher M. Bishop, Pattern Recognition and Machine Learning

Bayesian Inference Bayesian inference allows one to model

uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence.

35

Page 36: Christopher M. Bishop, Pattern Recognition and Machine Learning

Relevance Vector Machines (3) In the Bayesian framework, we use a

prior distribution over w to avoid overfitting

where α is a hyperparameter which control the model parameter w

36

1/ 2 2

1

( | ) ( ) exp( )2 2

N

mm

p w w

Page 37: Christopher M. Bishop, Pattern Recognition and Machine Learning

Relevance Vector Machines (4) Goal: find most probable α* and β* to

compute the predictive distribution over tnew for a new input xnew, i.e.

p(tnew | xnew, X, t, α*, β*)

Maximize the likelihood function to obtain α* and β* :

p(t|X, α, β)

37

Training data and their target values

Page 38: Christopher M. Bishop, Pattern Recognition and Machine Learning

Relevance Vector Machines (5) RVM utilize the “automatic relevance

determination” to achieve sparsity

where αm represents the precision of wm

In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero remain relevance vectors !

38

1/ 2 2

1

( | ) ( ) exp( )2 2

Nm m

mm

p w w

Page 39: Christopher M. Bishop, Pattern Recognition and Machine Learning

Comparisons - Regression

39

RVM (on standard deviation predictive distribution)

SVM

Page 40: Christopher M. Bishop, Pattern Recognition and Machine Learning

Comparisons - Regression

40

Page 41: Christopher M. Bishop, Pattern Recognition and Machine Learning

Comparison - Classification

41

RVM SVM

Page 42: Christopher M. Bishop, Pattern Recognition and Machine Learning

Comparison - Classification

42

Page 43: Christopher M. Bishop, Pattern Recognition and Machine Learning

Comparisons

RVM are much sparser and make probabilistic prediction

RVM gives better generalization in regression

SVM gives better generalization in classification

RVM is computationally demanding while learning

43

Page 44: Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

44

Page 45: Christopher M. Bishop, Pattern Recognition and Machine Learning

Applications (1)

SVM for face detection

45

Page 46: Christopher M. Bishop, Pattern Recognition and Machine Learning

Applications (2)

46Marti Hearst, “ Support Vector Machines” ,1998

Page 47: Christopher M. Bishop, Pattern Recognition and Machine Learning

Applications (3)

In the feature-matching based object tracking, SVM are used to detect false feature matches

47Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001

Page 48: Christopher M. Bishop, Pattern Recognition and Machine Learning

Applications (4)

Recovering 3D human poses by RVM

48A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression” 2004

Page 49: Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions

49

Page 50: Christopher M. Bishop, Pattern Recognition and Machine Learning

Conclusions

The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks

The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions

50

Page 51: Christopher M. Bishop, Pattern Recognition and Machine Learning

References

www.support-vector.net N. Cristianini and J. Shawe-Taylor,

“An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000

M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001

51

Page 52: Christopher M. Bishop, Pattern Recognition and Machine Learning

Underfitting and Overfitting

52

underfitting-too simple overfitting-too complex

Adapted from http://www.dtreg.com/svm.htm

new data