machine learning: basic principles - · pdf filemachine learning: basic principles sparse...

21
Machine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne Argillander, [email protected]

Upload: tranxuyen

Post on 09-Feb-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

Machine Learning: Basic Principles

Sparse Kernel Machines

T-61.6020 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Page 2: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

2

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 3: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

3

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 4: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

4

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

� Argorithms based on the non-linear kernels are nice, but because the kernel function k(xn,xm) has to be evaluated for possible pairs of the training points the computational complexity is high.

� Some kind of ”reduction” has to be done

� We say solution to be sparse when only the subset of the training data points are used for evaluation.

Page 5: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

5

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Which boundary should we pick?

Page 6: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

6

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 7: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

7

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

� Support Vector Machine (SVM) approaches the problem stated in the previous slide through the concept of the margin

� The margin is the smallest distance between the decision boundary and any of the training samples

� The points with the smallest margin define the decision boundary and are called as support vectors

Page 8: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

8

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Page 9: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

9

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Properties of SVM

� SVM does not provide posterior probabilities. The output value can be seen as a distance from the decision boundary to the test point (measured in the feature space)

– Considering the retrieval task (most probable first), how do we sort the results?

– Can we combine two different methods or models?

� Can be used for classifying, regression and to data reduction– In the classifying task the samples are usually annotated as a positive or negative

examples. SVM accepts also examples without annotations.

� SVM is fundamentally a two-class classifier– What if we have more classes?

Page 10: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

10

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Properties of SVM

� The naive solution is slow O(N3), but with the help of lagrange multipliers task is only from O(N) to O(N2)

� Handles non-línearly separable data set in the linearly separable feature space.

– This space is defined by the nonlinear kernel function K.

– Dimensionality of the feature space can be high

� Traditional SVM does not like outliers

Page 11: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

11

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 12: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

12

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Overlapping class distributions

� Penalty that increases with the distance (slack variables)

� �-SVM� Method that can handle data sets with few outliers

� Parameters have to be estimated

- Cross-validation over the hold-out set

Page 13: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

13

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 14: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

14

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

SVM is fundamendally a two-class classifier

Let us consider problem with K classes

� One-versus-rest approach (1-BGM)– Training: Ck marked as positive, rest K-1 classes as negative

� Training sets will be inbalanced (if we had symmetry now it is lost)

(10 ’faces’, 100000 non faces)

� What if we had similar classes?

� Can lead to very very slow models, O(KN2) – O(K2N2)

� One-versus-one approach– Training: 2-Class SVMs on all possible pair of classes

Page 15: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

15

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Multiclass = problems?

One-versus-rest One-versus-one

HINT: Generally multiclass classification is an open issue. � Solve it and become famous!

Page 16: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

16

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 17: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

17

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

On the way to even more sparse solutions

� Some limitations of the SVM– Outputs of an SVM represent decisions

– Slack variables are found by using a hold-out method

– Positive definite kernels

– SVM is just not cool anymore?

� Relevance Vector Machine (RVM) has derived from the SVM– Shares many (good) characteristics of the SVM

– No restriction to positive definite kernels

– No cross-validation needed

– Number of relevance vectors << Number of support vectors � faster, sparser

– When training the RVM model we have to calculate inverse of M x M matrix (where M is amount of basis functions) � O(M3) � RVM training takes more time

Page 18: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

18

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

v-SVM RVM

Page 19: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

19

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Function approximation (RVM)

sinc(x) sinc(x) + gaussian noise

Tipping, M. E. (2000). The Relevance Vector Machine

Page 20: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

20

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Contents

� Introduction

� Maximum margin classifier

� Overlapping class distributions

� Multiclass classifier

� Relevance Vector Machines

� Outroduction

Page 21: Machine Learning: Basic Principles - · PDF fileMachine Learning: Basic Principles Sparse Kernel Machines T-61.6020 Special Course in Computer and Information Science 20070226 | Janne

21

T-61.6050 Special Course in Computer and Information Science

20070226 | Janne Argillander, [email protected]

Software packages and good to check

� Command line tool

SVMLIGHT http://svmlight.joachims.org/SVMTorch Multi III http://www.torch.ch/

– one-versus-rest approach

� Do it yourself

LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/

� Interfaces to LIBSVM– Matlab, R, Weka. Labview,...

� http://www.kernel-machines.org/

� Google