introduction to object recognition & svm · •linear svm decision function: the kernel trick...

Prof. Didier Stricker

Kaiserlautern University

http://ags.cs.uni-kl.de/

DFKI – Deutsches Forschungszentrum für Künstliche Intelligenz

http://av.dfki.de

2D Image Processing

Introduction to object recognition & SVM

1

http://ags.cs.uni-kl.de/

http://av.dfki.de/

• Basic recognition tasks

• A statistical learning approach

• Traditional recognition pipeline• Bags of features

• Classifiers

• Support Vector Machine

Overview

2

Image classification

• outdoor/indoor

• city/forest/factory/etc.

3

Image tagging

• street

• people

• building

• mountain

• …

4

Object detection

• find pedestrians

5

• walking

• shopping

• rolling a cart

• sitting

• talking

• …

Activity recognition

6

mountain

building

tree

banne

r

marketpeople

street

lamp

sky

building

Image parsing

7

This is a busy street in an Asian city.

Mountains and a large palace or

fortress loom in the background. In the

foreground, we see colorful souvenir

stalls and people walking around and

shopping. One person in the lower left

is pushing an empty cart, and a couple

of people in the middle are sitting,

possibly posing for a photograph.

Image description

8

Image classification

9

Apply a prediction function to a feature

representation of the image to get the

desired output:

f( ) = “apple”

f( ) = “tomato”

f( ) = “cow”

The statistical learning framework

10

y = f(x)

Training: given a training set of labeled examples

{(x1,y1), …, (xN,yN)}, estimate the prediction function f by

minimizing the prediction error on the training set

Testing: apply f to a never before seen test example x

and output the predicted value y = f(x)

The statistical learning framework

output prediction

function

Image

feature

11

Prediction

Steps

Training Labels

Training

Images

Training

Training

Image Features

Image Features

Testing

Test Image

Learned model

Learned model

Slide credit: D. Hoiem

12

Traditional recognition pipeline

Hand-designed

feature

extraction

Trainable

classifier

Image

Pixels

• Features are not learned

• Trainable classifier is often generic (e.g. SVM)

Object

Class

13

Bags of features

14

1. Extract local features

2. Learn “visual vocabulary”

3. Quantize local features using visual vocabulary

4. Represent images by frequencies of “visual words”

Traditional features: Bags-of-features

15

• Sample patches and extract descriptors

1. Local feature extraction

16

• Orderless document representation:

frequencies of words from a dictionary Salton & McGill (1983)

• Classification to determine document categories

Bags of features: Motivation

17

Traditional recognition pipeline

Hand-designed

feature

extraction

Trainable

classifier

Image

PixelsObject

Class

18

f(x) = label of the training example nearest to x

All we need is a distance function for our inputs

No training required!

Classifiers: Nearest neighbor

Test

exampleTraining

examples

from class 1

Training

examples

from class 2

19

• For a new point, find the k closest points from training data

• Vote for class label with labels of the k points

K-nearest neighbor classifier

k = 5

20

Which classifier is more robust to outliers?


Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 21

http://cs231n.github.io/classification/


Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 22

http://cs231n.github.io/classification/

Linear classifiers

Find a linear function to separate the classes:

f(x) = sgn(w x + b)

23

Visualizing linear classifiers

Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/ 24

http://cs231n.github.io/linear-classify/

• NN pros: Simple to implement

Decision boundaries not necessarily linear

Works for any number of classes

Nonparametric method

• NN cons: Need good distance function

Slow at test time

• Linear pros: Low-dimensional parametric representation

Very fast at test time

• Linear cons: Works for two classes

How to train the linear function?

What if data is not linearly separable?

Nearest neighbor vs. linear classifiers

25

26

Linear Classifiers can lead to many equally

valid choices for the decision boundary

Maximum Margin

Are these really

“equally valid”?

27

How can we pick which is best?

Maximize the size of the margin.

Max Margin

Are these really

“equally valid”?

Large Margin

Small Margin

28

Support Vectors are those

input points (vectors) on

the decision boundary

1. They are vectors

2. They “support” the

decision hyperplane

Support Vectors

29

Define this as a decision

problem

The decision hyperplane:

No fancy math, just the

equation of a hyperplane.

Support Vectors

30

Any line in the plane is perpendicular to

the normal of the plane

Which can be written in the form of

Where the W is the normal, and b is the –ve distance

of the plane to the origin (up to the scale of w).

Recall: Equation of a plane

31

Define this as a decision problem


Decision Function:

Support Vectors

32

Define this as a decision problem


Margin hyperplanes:

Support Vectors

33


Scale invariance

Support Vectors

Support Vectors


Scale invariance

34

Support Vectors


Scale invariance

35

This scaling does not change the

decision hyperplane, or the support

vector hyperplanes. But we will

eliminate a variable from the

optimization

36

We will represent the size of

the margin in terms of w.

This will allow us to

simultaneously Identify a decision boundary

Maximize the margin

What are we optimizing?

37

1. There must at least

one point that lies on

each support

hyperplanes

How do we represent the size of

the margin in terms of w?

Proof outline: If not, we

could define a larger

margin support

hyperplane that does

touch the nearest

point(s).

38

1. There must at least one point

that lies on each support

hyperplanes

2. Thus:

How do we represent the size of the

margin in terms of w?

3. And:



1. There must at least one

point that lies on each

support hyperplanes

2. Thus:

39

3. And:

The vector w is perpendicular to the decision hyperplane

If the dot product of two vectors equals zero, the two vectors are perpendicular.



40

41

The margin is the projection

of x1 – x2 onto w, the normal

of the hyperplane.



42

Recall: Vector Projection

43

The margin is the projection

of x1 – x2 onto w, the normal of

the hyperplane.



Size of the Margin:

Projection:

44

Goal: maximize the margin

Maximizing the margin

Linear separability of the data

by the decision boundary

45

If constraint optimization then Lagrange Multipliers

Optimize the “Primal”

Max Margin Loss Function

Lagrange Multiplier + Karush-Kuhn-Tucker (KKT) condition

46



Partial wrt b

47



Partial wrt w

Now have to find αi.

Substitute back to the Loss function

48

Construct the “dual”


With:

Then:

49

Solve this quadratic program to identify the

lagrange multipliers and thus the weights

Dual formulation of the error

There exist (rather) fast approaches to quadratic

program solving in both C, C++, Python, Java and R

50

Quadratic Programming

•If Q is positive semi definite, then f(x) is convex.

•If f(x) is convex, then there is a single maximum.

51

In constraint optimization: At the optimal solution

Constraint * Lagrange Multiplier = 0

Karush-Kuhn-Tucker Conditions

Only points on the decision boundary contribute to the solution!

Support Vector Expansion

When αi is non-zero then xi is a support vector

When αi is zero xi is not a support vector52

New decision Function

• General idea: the original input space can always

be mapped to some higher-dimensional feature

space where the training set is separable

Nonlinear SVMs

Φ: x→ φ(x)

Image source 53

http://stackoverflow.com/questions/9480605/what-is-the-relation-between-the-number-of-support-vectors-and-training-data-and

• Linearly separable dataset in 1D:

• Non-separable dataset in 1D:

• We can map the data to a higher-dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

Slide credit: Andrew Moore54

• General idea: the original input space can always be

mapped to some higher-dimensional feature space

where the training set is separable.

• The kernel trick: instead of explicitly computing the

lifting transformation φ(x), define a kernel function K

such that

K(x,y) = φ(x) · φ(y)

(to be valid, the kernel function must satisfy

Mercer’s condition)

The kernel trick

55

• Linear SVM decision function:

The kernel trick

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining

and Knowledge Discovery, 1998

bybi iii xxxw

Support

vector

learned

weight

56

http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf

bKybyi

iii

i

iii ),()()( xxxx

The kernel trick

• Linear SVM decision function:

• Kernel SVM decision function:

• This gives a nonlinear decision boundary in the original feature space

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining

and Knowledge Discovery, 1998

bybi iii xxxw

57

http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf

Polynomial kernel: dcK )(),( yxyx

58

• Also known as the radial basis function

(RBF) kernel:

Gaussian kernel

||x – y||

K(x, y)

2

2

1exp),( yxyx

K

59

Gaussian kernel

SV’s

60

• Pros Kernel-based framework is very powerful, flexible

Training is convex optimization, globally optimal solution

can be found

Amenable to theoretical analysis

SVMs work very well in practice, even with very small

training sample sizes

• Cons No “direct” multi-class SVM, must combine two-class

SVMs (e.g., with one-vs-others)

Computation, memory (esp. for nonlinear SVMs)

SVMs: Pros and cons

61

• Generalization refers to the ability to correctly

classify never before seen examples

• Can be controlled by turning “knobs” that affect

the complexity of the model

Generalization

Training set (labels known) Test set (labels

unknown)62

• Training error: how does the model perform on the

data on which it was trained?

• Test error: how does it perform on never before seen

data?

Diagnosing generalization ability

Source: D. Hoiem

Training error

Test error

Underfitting Overfitting

Model complexity HighLow

Err

or

63

• Underfitting: training and test error are both high Model does an equally poor job on the training and the test set

Either the training procedure is ineffective or the model is too

“simple” to represent the data

• Overfitting: Training error is low but test error is high• Model fits irrelevant characteristics (noise) in the training data

Model is too complex or amount of training data is insufficient

Underfitting and overfitting

Underfitting

Overfitting

Good generalization

Figure source64

http://www.holehouse.org/mlclass/07_Regularization.html

Effect of training set size

Many training examples

Few training examples


Te

st E

rro

r

Source: D. Hoiem65

Effect of training set size

Many training examples

Few training examples


Te

st E

rro

r

Source: D. Hoiem66

• Split the data into training, validation, and test subsets

• Use training set to optimize model parameters

• Use validation test to choose the best model

• Use test set only to evaluate performance

Validation

Model complexity

Training set loss

Test set loss

Err

or

Validation

set loss

Stopping point

67

Histogram of Oriented Gradients (HoG)

[Dalal and Triggs, CVPR 2005]

10x10 cells

20x20 cells

HoGify

68

Histogram of Oriented Gradients

(HoG)

69

Many examples on Internet:

https://www.youtube.com/watch?v=b_DSTuxdGDw

Histogram of Oriented Gradients

(HoG)

[Dalal and Triggs, CVPR 2005] 70

Demo software:

http://cs.stanford.edu/people/karpathy/svmjs/demo

A useful website: www.kernel-machines.org

Software: LIBSVM: www.csie.ntu.edu.tw/~cjlin/libsvm/

SVMLight: svmlight.joachims.org/

Resources

72

http://cs.stanford.edu/people/karpathy/svmjs/demo

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

73

Thank you!

introduction to object recognition & svm · •linear svm decision function: the kernel trick...

Documents