pattern recognition and machine learning

26
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 7: SPARSE KERNEL MACHINES

Upload: frisco

Post on 24-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Pattern Recognition and Machine Learning. Chapter 7: sparse kernel Machines. Outline. The problem: finding a sparse decision (and regression) machine that uses kernels The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs) The core ideas behind the solutions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Recognition  and  Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 7: SPARSE KERNEL MACHINES

Page 2: Pattern Recognition  and  Machine Learning

Outline

• The problem: finding a sparse decision (and regression) machine that uses kernels

• The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs)

• The core ideas behind the solutions

• The mathematical details

Page 3: Pattern Recognition  and  Machine Learning

The problem (1)

Methods introduced in chapters 3 and 4• Take into account all data points in the training

set -> cumbersome• Do not take advantage of kernel methods

-> basis functions have to be explicit

Example: Least squares and logistic regression

Page 4: Pattern Recognition  and  Machine Learning

The problem (2)

Kernel methods require evaluation of the kernel function for all pairs of • -> cumbersome

Page 5: Pattern Recognition  and  Machine Learning

The solution (1)

Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points

Page 6: Pattern Recognition  and  Machine Learning

The solution (2)

Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points

Page 7: Pattern Recognition  and  Machine Learning

The solution (3)

SVMs as well as RVMs can also be used for regression

SVM RVMeven sparser!

Page 8: Pattern Recognition  and  Machine Learning

SVM: The core idea (1)

That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:

Page 9: Pattern Recognition  and  Machine Learning

SVM: The core idea (2)

In input space:

Page 10: Pattern Recognition  and  Machine Learning

SVM: The core idea (3)

For regression:

Page 11: Pattern Recognition  and  Machine Learning

RVM: The core idea (1)

Exclude basis vectors whose presence reduces the probability of the observed data

Page 12: Pattern Recognition  and  Machine Learning

RVM: The core idea (2)

For classification and regression:

Classification Regression

Page 13: Pattern Recognition  and  Machine Learning

SVM: The details (1)

Equation of the decision surface:

Distance of a point from the decision surface:

Page 14: Pattern Recognition  and  Machine Learning

SVM: The details (2)

Distance of a point from the decision surface:

Maximum margin solution:

Page 15: Pattern Recognition  and  Machine Learning

SVM: The details (3)

Distance of a point from the decision surface:

We therefore may rescale , such that

for the point closest to the surface.

Page 16: Pattern Recognition  and  Machine Learning

SVM: The details (4)

Therefore, we can reduce

to

under the constraint

Page 17: Pattern Recognition  and  Machine Learning

SVM: The details (5)

To solve this, we introduce Lagrange multipliers and minimize

Equivalently, we can maximize the dual representation

where the kernel function can be chosen without specifying explicitly.

Page 18: Pattern Recognition  and  Machine Learning

SVM: The details (6)

Because of the constraint

only those survive for which is on the margin,

i.e.

This leads to sparsity.

Page 19: Pattern Recognition  and  Machine Learning

SVM: The details (7)

Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of

Page 20: Pattern Recognition  and  Machine Learning

SVM: The details (8)

In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points.

To achieve this, we introduce slack variableswith

Page 21: Pattern Recognition  and  Machine Learning

SVM: The details (9)

Graphically:

Page 22: Pattern Recognition  and  Machine Learning

SVM: The details (10)

The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:

Page 23: Pattern Recognition  and  Machine Learning

SVM: The details (11)

The soft-margin approach can be formulated as minimizing the regularized error function

This formulation can be extended to use SVMs for regression:

where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.

Page 24: Pattern Recognition  and  Machine Learning

SVM: The details (12)

Graphically:

Page 25: Pattern Recognition  and  Machine Learning

SVM: The details (13)

Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:

Page 26: Pattern Recognition  and  Machine Learning

SVM: Limitations

• Output is a decision, not a posterior probability

• Extension of classification to more than two classes is problematic

• The parameters C and ϵ have to be found by methods such as cross validation

• Kernel functions are required to be positive definite