machine learning using matlab - uni konstanz · machine learning using matlab lecture 7 support...

Machine Learning using Matlab

Lecture 7 Support Vector Machine (SVM)

Note● Deadline for presentation application is 11.06.2017. If you still didn’t send

your application, please send it asap.● The presentation date schedule will be released in our course website next

week.● On Thursday lab session I will give a quiz, if you can finish in time, you will be

given bonus in your final score.

Outline● Primal and dual forms● Feature map● Kernel trick● Regression● SVM toolbox

Intuition

Support vector

Support vector

SVM is also called “maximum margin classifier”

“Hard” margin● Given training examples , SVM aims to find an optimal hyperplane

so that:

● It is equivalent to minimizing the following function:

Which classifier is better?

Tradeoff between the margin and the number of mistakes in the training data

Introduce “slack” variables

Support vector

Support vector

● For point is between margin and correct side of hyperplane. This is margin violation

● For point is misclassified

“Soft” margin solutionThe optimization problems becomes:

● Every constraint can be satisfied if is sufficiently large.● C is a regularization parameter:

○ Small C ⇒ large margin○ large C ⇒ narrow margin○ C = ∞ ⇒ hard margin

● It is called primal form of SVM.

Different regularization

Dual formWith Lagrange multipliers, we have the dual form of SVM:

The decision function can be rewritten:

Prediction is very fast as most ⍺ are zeros.

What if the data is not linearly separable?● In logistic regression, we add more

parameters to make the decision boundary nonlinearly.

● However, we can’t do the same way in SVM because I still want to learn to a linear classifier.

Map data into higher dimension

Data is linearly separable in 3D space

w

Feature map● By mapping data from d-dimensional to D-dimensional space (d<D), we can

still learn a linear classifier.● , where is called feature map.●

What change in classifier learning after mapping features?

Kernel trick - demonstration

http://www.youtube.com/watch?v=3liCbRZPrZA

Transformed feature in primal formClassifier:

Optimization:

● Simply map x to phix where data is separable● Solve for w in high dimensional space● There are many more parameters to learn for w if D>>d, can we avoid this?

Transformed feature in dual formClassifier:

Optimization:

● In dual form, phix only occurs in pairs● Only the m dimensional vector alp needs to be learnt● Kernel:

Kernel in dual formClassifier:

Optimization:

Kernel trick

1. Classifier can be learnt and applied without explicitly computing2. All that is required to compute the kernel function3. Complexity of learning depends on number of training examples m rather than the

dimensions of feature space D.

Common kernel functions● Linear kernel:● Polynomial kernel:

○ Contains all polynomials terms up to degree d

● Radial Basis Function (Gaussian kernel):○ Infinite dimensional feature space

How many parameters do you need to tune?

Kernel trick - summary● Classifier can be learnt in high dimensional feature space, without explicitly

knowing the feature map● Kernels can also be used elsewhere, for example, kernel PCA, kernel

k-means● Different kernel functions may be applied to different scenarios● However, the optimal parameters have to be chosen empirically

Support Vector Regression (SVR)

-insensitive loss

SVR primal form

SVR dual form

Introduce Lagrange multipliers , we have:

SVR - summary● SVR is the extension of SVM, thus the optimization algorithm for SVM can be

applied to SVR directly [Smola ’04].● Likewise, “kernel trick” can also be applied to SVR.● Q: how many parameters should I tune if I use gaussian kernel?

Three parameters, namely, C, σ, and

SVM toolbox1. Libsvm: https://www.csie.ntu.edu.tw/~cjlin/libsvm/2. SVMlight: http://svmlight.joachims.org/

SVM - summary● SVM was originally proposed by Boser, Guyon and Vapnik in 1992 and gained

increasing popularity in late 1990s.● SVM can be applied to complex data types beyond feature vectors (e.g. graphs,

sequences, relational data) by designing kernel functions for such data.● For multiclass SVM, you can use either one-vs-rest scheme or multi-class SVM, e.g.,

[Weston ’99] and [Crammer ’01].● SVM is a convex problem, thus we have global optimal solution. However, the

computational cost increases along with the number of training examples. Therefore, more efficient optimization algorithms are proposed, e.g. SMO [Platt ’99] and [Joachims ’99].

● Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.

machine learning using matlab - uni konstanz · machine learning using matlab lecture 7 support...

Documents