pattern recognition and machine learning

Post on 24-Feb-2016

80 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Pattern Recognition and Machine Learning. Chapter 7: sparse kernel Machines. Outline. The problem: finding a sparse decision (and regression) machine that uses kernels The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs) The core ideas behind the solutions - PowerPoint PPT Presentation

TRANSCRIPT

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 7: SPARSE KERNEL MACHINES

Outline

• The problem: finding a sparse decision (and regression) machine that uses kernels

• The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs)

• The core ideas behind the solutions

• The mathematical details

The problem (1)

Methods introduced in chapters 3 and 4• Take into account all data points in the training

set -> cumbersome• Do not take advantage of kernel methods

-> basis functions have to be explicit

Example: Least squares and logistic regression

The problem (2)

Kernel methods require evaluation of the kernel function for all pairs of • -> cumbersome

The solution (1)

Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points

The solution (2)

Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points

The solution (3)

SVMs as well as RVMs can also be used for regression

SVM RVMeven sparser!

SVM: The core idea (1)

That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:

SVM: The core idea (2)

In input space:

SVM: The core idea (3)

For regression:

RVM: The core idea (1)

Exclude basis vectors whose presence reduces the probability of the observed data

RVM: The core idea (2)

For classification and regression:

Classification Regression

SVM: The details (1)

Equation of the decision surface:

Distance of a point from the decision surface:

SVM: The details (2)

Distance of a point from the decision surface:

Maximum margin solution:

SVM: The details (3)

Distance of a point from the decision surface:

We therefore may rescale , such that

for the point closest to the surface.

SVM: The details (4)

Therefore, we can reduce

to

under the constraint

SVM: The details (5)

To solve this, we introduce Lagrange multipliers and minimize

Equivalently, we can maximize the dual representation

where the kernel function can be chosen without specifying explicitly.

SVM: The details (6)

Because of the constraint

only those survive for which is on the margin,

i.e.

This leads to sparsity.

SVM: The details (7)

Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of

SVM: The details (8)

In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points.

To achieve this, we introduce slack variableswith

SVM: The details (9)

Graphically:

SVM: The details (10)

The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:

SVM: The details (11)

The soft-margin approach can be formulated as minimizing the regularized error function

This formulation can be extended to use SVMs for regression:

where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.

SVM: The details (12)

Graphically:

SVM: The details (13)

Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:

SVM: Limitations

• Output is a decision, not a posterior probability

• Extension of classification to more than two classes is problematic

• The parameters C and ϵ have to be found by methods such as cross validation

• Kernel functions are required to be positive definite

top related