presented by: joanna giforos cs8980: topics in machine learning 9 march, 2006

28
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Upload: kineks

Post on 03-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf. Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006. Outline. Problem Description Nonlinear Algorithms in Kernel Feature Space Supervised Learning: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

An Introduction to Kernel-Based Learning Algorithms

K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos

CS8980: Topics in Machine Learning

9 March, 2006

Page 2: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Outline

Problem Description Nonlinear Algorithms in Kernel Feature Space Supervised Learning:

Nonlinear SVM Kernel Fisher Discriminant Analysis

Supervised Learning: Kernel Principle Component Analysis

Applications Model specific kernels

Page 3: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Problem Description

2-class Classification: estimate a function, using input-output training data

such that will correctly classify unseen examples.

i.e., Find a mapping:

Assume: Training and test data are drawn from the same probability distribution

Page 4: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Problem Description A learning machine is a family of functions For a task of learning two classes

f(x,) 2 {-1,1}, 8 x, Too complex ) Overfitting Not complex enough ) Underfitting Want to find the right balance between accuracy and

complexity

Page 5: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Problem Description

Best is the one that minimizes the expected error:

Empirical Risk (training error):

Remp()! R() as n!1

Page 6: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Structural Risk Minimization

Construct a nested family of function classes , with non-decreasing VC dimension.

Let be the solutions of the empirical risk minimization.

SRM chooses the function class and the function such that an upper bound on the generalization error is minimized.

Page 7: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Nonlinear Algorithms in Kernel Feature Space Via a non-linear mapping

the data is mapped into a potentially much higher dimensional feature space.

Given this mapping, we can compute scalar products in feature spaces using kernel functions.

does not need to be known explicitly ) every linear algorithm that only uses scalar products can implicitly be executed in by using kernels.

Page 8: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Nonlinear Algorithms in Kernel Feature Space:Example

Page 9: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: Nonlinear SVM Consider linear classifiers in feature space using dot products.

Conditions for classification without training error:

GOAL: Find and b such that the empirical risk and regularization term are minimized.

But we cannot explicitly access w in the feature space, so we introduce Lagrange multipliers, i, one for each of the above constraints.

Page 10: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: Nonlinear SVM Last class we saw that the nonlinear SVM primal problem is:

Which leads to the dual:

Page 11: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: Nonlinear SVM Using KKT second order optimality conditions on the dual SVM

problem, we obtain:

The solution is sparse in ) many patterns are outside the margin area and the optimal i’s are zero.

Without sparsity, SVM would be impractical for large data sets.

Page 12: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: Nonlinear SVM The dual problem can be rewritten as:

Where

Since objective function is convex, every local max is a global max, but there can be several optimal solutions (in terms of i)

Once i’s are found using QP solvers, simply plug into prediction rule:

Page 13: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: KFD

Discriminant analysis seeks to find a projection of the data in a direction that is efficient for discrimination.

Image from: R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, John Wiley & Sons, INC., 2001.

Page 14: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: KFD

Solve Fisher’s linear discriminant in kernel feature space. Aims at finding linear projections such that the classes are well

separated. How far are the projected means apart? (should be large) How big is the variance of the data in this direction? (should be small)

Recall, that this can be achieved by maximizing the Rayleigh quotient:

where

Page 15: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: KFD

In kernel feature space , express w in terms of mapped training patterns:

To get:

Where,

Page 16: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: KFD

Projection of a test point onto the discriminant is computed by:

Can solve the generalized eigenvalue problem:

But N and M may be large and non-sparse, can transform KFD into a convex QP problem.

Question – can we use numerical approximations to the eigenvalue problem?

Page 17: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Supervised Learning: KFD Can reformulate as constrained optimization problem. FD tries to minimize the distance between the variance of the

data along the projection whilst maximizing the distance between the means:

This QP is equivalent to J() since M is a matrix of rank 1 (columns are linearly dependent) Solutions w in J() are invariant under scaling.

) Can fix the distance of the means to some arbitrary, positive value and just minimize the variance.

Page 18: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Connection Between Boosting and Kernel Methods Can show that Boosting maximizes the smallest margin .

Recall, SVM attempts to maximize w

In general, using an arbitrary lp norm constraint on the weight vector leads to maximizing the lq distance between the hyperplane and the training points.

Boosting uses l1 norm

SVM uses l2 norm

Page 19: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Linear PCA Principal Components Analysis (PCA) attempts to efficiently

represent the data by finding orthonormal axes which maximally decorrelate the data

Given centered observations:

PCA finds the principal axes by diagonalizing the covariance matrix

Note that C is positive definite,and thus can be diagonalized with nonnegative eigenvalues.

Page 20: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Linear PCA Eigenvectors lie in the span of x1, …, xn:

Thus it can be shown that,

But is just a scalar, so all solutions v with 0 lie in the span of x1, …, xn, i.e.

Page 21: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Kernel PCA If we first map the data into another space,

Then assuming we can center the data, we can write the covariance matrix as:

Which can be diagonalized with nonnegative eigenvalues satisfying:

Page 22: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Kernel PCA As in linear PCA, all solutions v with 0 lie in the span of

(xi), …, (xm) i.e.

Substituting, we get:

Where K is the inner product kernel:

Premultiplying both sides by (xk)T, we finally get:

Page 23: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Kernel PCA The resulting set of eigenvectors are then used to extract the

Principle Components of a test point by:

Page 24: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Kernel PCA Nonlinearities only enter the computation at two points:

In the calculation of the matrix K In the evaluation of new points

Drawback of PCA: For large data sets, storage and computational complexity

issues. Can use sparse approximations of K.

Question: Can we think of other unsupervised methods which can make use of kernels? Kernel k-means, Kernal ICA, Spectral Clustering

Page 25: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Linear PCA

Page 26: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Unsupervised Methods: Kernel PCA

Page 27: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Applications

Support Vector Machines and Kernel Fisher Discriminant: Bioinformatics: protein classification OCR Face Recognition Content based image retrieval Decision Tree Predictive Modeling …

Kernel PCA Denoising Compression Visualization Feature extraction for classification

Page 28: Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Kernels for Specific Applications Image Segmentation: Gaussian weighted 2-distance between

local color histograms. Can be shown to be robust for color and texture discrimination

Text classification: Vector Space kernels

Structured Data (strings, trees, etc.): Spectrum kernels

Generative models: P-kernels, Fisher kernels