lecture12 - svm

23
Introduction to Machine Introduction to Machine Learning Learning Lecture 12 Lecture 12 Support Vector Machines Albert Orriols i Puig il@ ll l d aorriols@salle.url.edu Artificial Intelligence – Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

Upload: albert-orriols-puig

Post on 18-Nov-2014

6.462 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Lecture12 - SVM

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 12Lecture 12Support Vector Machines

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Page 2: Lecture12 - SVM

Recap of Lecture 111st generation NN: Perceptrons and othersg p

Also multi-layer percetronsAlso multi-layer percetrons

Slide 2Artificial Intelligence Machine Learning

Page 3: Lecture12 - SVM

Recap of Lecture 112nd generation NNg

Some people figure it out how to adapt the weights of internal layers aye s

Seemed to be very powerful and able to solve almost anything

Slide 3

The reality showed that this was not exactly trueArtificial Intelligence Machine Learning

Page 4: Lecture12 - SVM

Today’s Agenda

Moving to SVMgLinear SVM

The separable caseThe non-separable case

Non Linear SVMNon-Linear SVM

Slide 4Artificial Intelligence Machine Learning

Page 5: Lecture12 - SVM

IntroductionSVM (Vapnik, 1995)( p , )

Clever type of perceptron

I t d f h d di th l f d ti f t hInstead of hand-coding the layer of non-adaptive features, each training example is used to create a new feature using a fixed recipeec pe

A clever optimization technique is used to select the best subset of featuressubset o eatu es

Many NNs researchers switched to SVM in the 1990s because they work betterbecause they work better

Here, we’ll take a slow path into SVM concepts

Slide 5Artificial Intelligence Machine Learning

Page 6: Lecture12 - SVM

Shattering Points with Oriented HyperplanesRemember the idea

I want to build hyperplanes that separate points of two classes

In a two-dimensional space lines

E.g.: Linear Classifier

Which is the best separating line?Which is the best separating line?

Remember, a hyperplane is t d b th tirepresented by the equation

0bWX 0=+ bWX

Slide 6Artificial Intelligence Machine Learning

Page 7: Lecture12 - SVM

Linear SVMI want the line that maximizes the margin between gexamples of both classes!

Support Vectors

Slide 7Artificial Intelligence Machine Learning

Page 8: Lecture12 - SVM

Linear SVMIn more detail

Let’s assume two classesy = {-1 1}yi = {-1, 1}

Each example described by a set of features x (x is aa set of features x (x is a vector; for clarity, we will mark vectors in bold in the remainder of the slides)

The problem can be formulated as followsAll training must satisfy(in the separable case)( )

This can be combined

Slide 8

This can be combined

Artificial Intelligence Machine Learning

Page 9: Lecture12 - SVM

Linear SVMWhat are the support vectors?pp

Let’s find the points that lay on the hyper plane H1

Their perpendicular distance to the origin isTheir perpendicular distance to the origin is

Let’s find the points that lay on the hyper plane H2

Their perpendicular distance to the origin is

The margin is:

Slide 9Artificial Intelligence Machine Learning

Page 10: Lecture12 - SVM

Linear SVMTherefore, the problem is, p

Find the hyper plane that minimizes

Subject to

But let us change to the Lagrange formulation becauseBut let us change to the Lagrange formulation becauseThe constraints will be placed on the Lagrange multipliers themselves (easier to handle)themselves (easier to handle)

Training data will appear only in form of dot products between vectorsvectors

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture12 - SVM

Linear SVMThe Lagrangian formulation comes to beg g

Where αi are the Lagrange multipliers

So now we need toSo, now we need toMinimize Lp w.r.t w, b

Simultaneously require that the derivatives of Lp w.r.t to αvanish

All subject to the constraints αi ≥ 0

Slide 11Artificial Intelligence Machine Learning

Page 12: Lecture12 - SVM

Linear SVMTransformation to the dual problemp

This is a convex problem

W i l tl l th d l blWe can equivalently solve the dual problem

That is, maximize LD

W.r.t αi

Subject to constraintsSubject to constraints

And with αi ≥ 0

Slide 12Artificial Intelligence Machine Learning

Page 13: Lecture12 - SVM

Linear SVM

This is a quadratic programming problem. You can solve it with many methods such as gradient descent

We’ll not see these methods in class

Slide 13Artificial Intelligence Machine Learning

Page 14: Lecture12 - SVM

The Non-Separable caseWhat if I can not separate the two classesp

We will not be able to solve the Lagrangian formulationWe will not be able to solve the Lagrangian formulation proposed

Any idea?

Slide 14

Any idea?

Artificial Intelligence Machine Learning

Page 15: Lecture12 - SVM

The Non-Separable CaseJust relax the constraints by permitting some errorsy p g

Slide 15Artificial Intelligence Machine Learning

Page 16: Lecture12 - SVM

The Non-Separable CaseThat means that the Lagrangian is rewritteng g

We change the objective function to be minimized tou c o o be ed o

Therefore, we are maximizing the margin and minimizing the error

C i t t t b h b thC is a constant to be chosen by the user

The dual problem becomes

Subject to and

Slide 16Artificial Intelligence Machine Learning

Page 17: Lecture12 - SVM

Non-Linear SVMWhat happens if the decision function is a linear function of ppthe data?

In our equations data appears in form of dot products x xIn our equations, data appears in form of dot products xi · xj

Wouldn’t you like to have polynomials, logarithmics, … functions to fit the data?functions to fit the data?

Slide 17Artificial Intelligence Machine Learning

Page 18: Lecture12 - SVM

Non-Linear SVM

The kernel trickThe kernel trickMap the data into a higher-dimensional space

Mercer theorem: any continuous, symmetric, positive semi-definite kernel function K(x, y) can be expressed as a dot product in a high dimensional spaceproduct in a high-dimensional space

Now, we have a kernel function

An example

All we have talked about still holds when using theAll we have talked about still holds when using the kernel function

The only difference is that now my function will beThe only difference is that now my function will be

Slide 18Artificial Intelligence Machine Learning

Page 19: Lecture12 - SVM

Non-Linear SVMSome typical kernelsSome typical kernels

A i l l f l i l k l ith 3A visual example of a polynomial kernel with p=3

Slide 19Artificial Intelligence Machine Learning

Page 20: Lecture12 - SVM

Some Further IssuesWe have to classify datay

Described by nominal attributes and continuous attributes

P b bl ith i i lProbably with missing values

That may have more than two classes

How SVM deal with them?SVM defined over continuous attributes No problem!SVM defined over continuous attributes. No problem!

Nominal attributes Map into continuous space

S fMultiple classes Build SVM that discriminate each pair of classes

Slide 20Artificial Intelligence Machine Learning

Page 21: Lecture12 - SVM

Some Further IssuesI’ve seen lots of formulas… But I want to program a SVM p gbuilder. How I get my SVM?

We have already mentioned that there are many methods toWe have already mentioned that there are many methods to solve the quadratic programming problem

Many algorithms designed for SVMMany algorithms designed for SVM

One of the most significant: Sequential Minimal Optimization

C l h l i hCurrently, there are many new algorithms

Slide 21Artificial Intelligence Machine Learning

Page 22: Lecture12 - SVM

Next Class

Association Rules

Slide 22Artificial Intelligence Machine Learning

Page 23: Lecture12 - SVM

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 12Lecture 12Support Vector Machines

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull