introduction to object recognition & svm · •linear svm decision function: the kernel trick...
TRANSCRIPT
Prof. Didier Stricker
Kaiserlautern University
http://ags.cs.uni-kl.de/
DFKI – Deutsches Forschungszentrum für Künstliche Intelligenz
http://av.dfki.de
2D Image Processing
Introduction to object recognition & SVM
1
• Basic recognition tasks
• A statistical learning approach
• Traditional recognition pipeline• Bags of features
• Classifiers
• Support Vector Machine
Overview
2
Image classification
• outdoor/indoor
• city/forest/factory/etc.
3
Image tagging
• street
• people
• building
• mountain
• …
4
Object detection
• find pedestrians
5
• walking
• shopping
• rolling a cart
• sitting
• talking
• …
Activity recognition
6
mountain
building
tree
banne
r
marketpeople
street
lamp
sky
building
Image parsing
7
This is a busy street in an Asian city.
Mountains and a large palace or
fortress loom in the background. In the
foreground, we see colorful souvenir
stalls and people walking around and
shopping. One person in the lower left
is pushing an empty cart, and a couple
of people in the middle are sitting,
possibly posing for a photograph.
Image description
8
Image classification
9
Apply a prediction function to a feature
representation of the image to get the
desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
The statistical learning framework
10
y = f(x)
Training: given a training set of labeled examples
{(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x
and output the predicted value y = f(x)
The statistical learning framework
output prediction
function
Image
feature
11
Prediction
Steps
Training Labels
Training
Images
Training
Training
Image Features
Image Features
Testing
Test Image
Learned model
Learned model
Slide credit: D. Hoiem
12
Traditional recognition pipeline
Hand-designed
feature
extraction
Trainable
classifier
Image
Pixels
• Features are not learned
• Trainable classifier is often generic (e.g. SVM)
Object
Class
13
Bags of features
14
1. Extract local features
2. Learn “visual vocabulary”
3. Quantize local features using visual vocabulary
4. Represent images by frequencies of “visual words”
Traditional features: Bags-of-features
15
• Sample patches and extract descriptors
1. Local feature extraction
16
• Orderless document representation:
frequencies of words from a dictionary Salton & McGill (1983)
• Classification to determine document categories
Bags of features: Motivation
17
Traditional recognition pipeline
Hand-designed
feature
extraction
Trainable
classifier
Image
PixelsObject
Class
18
f(x) = label of the training example nearest to x
All we need is a distance function for our inputs
No training required!
Classifiers: Nearest neighbor
Test
exampleTraining
examples
from class 1
Training
examples
from class 2
19
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points
K-nearest neighbor classifier
k = 5
20
Which classifier is more robust to outliers?
K-nearest neighbor classifier
Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 21
K-nearest neighbor classifier
Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 22
Linear classifiers
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
23
Visualizing linear classifiers
Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/ 24
• NN pros: Simple to implement
Decision boundaries not necessarily linear
Works for any number of classes
Nonparametric method
• NN cons: Need good distance function
Slow at test time
• Linear pros: Low-dimensional parametric representation
Very fast at test time
• Linear cons: Works for two classes
How to train the linear function?
What if data is not linearly separable?
Nearest neighbor vs. linear classifiers
25
26
Linear Classifiers can lead to many equally
valid choices for the decision boundary
Maximum Margin
Are these really
“equally valid”?
27
How can we pick which is best?
Maximize the size of the margin.
Max Margin
Are these really
“equally valid”?
Large Margin
Small Margin
28
Support Vectors are those
input points (vectors) on
the decision boundary
1. They are vectors
2. They “support” the
decision hyperplane
Support Vectors
29
Define this as a decision
problem
The decision hyperplane:
No fancy math, just the
equation of a hyperplane.
Support Vectors
30
Any line in the plane is perpendicular to
the normal of the plane
Which can be written in the form of
Where the W is the normal, and b is the –ve distance
of the plane to the origin (up to the scale of w).
Recall: Equation of a plane
31
Define this as a decision problem
The decision hyperplane:
Decision Function:
Support Vectors
32
Define this as a decision problem
The decision hyperplane:
Margin hyperplanes:
Support Vectors
33
The decision hyperplane:
Scale invariance
Support Vectors
Support Vectors
The decision hyperplane:
Scale invariance
34
Support Vectors
The decision hyperplane:
Scale invariance
35
This scaling does not change the
decision hyperplane, or the support
vector hyperplanes. But we will
eliminate a variable from the
optimization
36
We will represent the size of
the margin in terms of w.
This will allow us to
simultaneously Identify a decision boundary
Maximize the margin
What are we optimizing?
37
1. There must at least
one point that lies on
each support
hyperplanes
How do we represent the size of
the margin in terms of w?
Proof outline: If not, we
could define a larger
margin support
hyperplane that does
touch the nearest
point(s).
38
1. There must at least one point
that lies on each support
hyperplanes
2. Thus:
How do we represent the size of the
margin in terms of w?
3. And:
How do we represent the size of the
margin in terms of w?
1. There must at least one
point that lies on each
support hyperplanes
2. Thus:
39
3. And:
The vector w is perpendicular to the decision hyperplane
If the dot product of two vectors equals zero, the two vectors are perpendicular.
How do we represent the size of the
margin in terms of w?
40
41
The margin is the projection
of x1 – x2 onto w, the normal
of the hyperplane.
How do we represent the size of the
margin in terms of w?
42
Recall: Vector Projection
43
The margin is the projection
of x1 – x2 onto w, the normal of
the hyperplane.
How do we represent the size of the
margin in terms of w?
Size of the Margin:
Projection:
44
Goal: maximize the margin
Maximizing the margin
Linear separability of the data
by the decision boundary
45
If constraint optimization then Lagrange Multipliers
Optimize the “Primal”
Max Margin Loss Function
Lagrange Multiplier + Karush-Kuhn-Tucker (KKT) condition
46
Optimize the “Primal”
Max Margin Loss Function
Partial wrt b
47
Optimize the “Primal”
Max Margin Loss Function
Partial wrt w
Now have to find αi.
Substitute back to the Loss function
48
Construct the “dual”
Max Margin Loss Function
With:
Then:
49
Solve this quadratic program to identify the
lagrange multipliers and thus the weights
Dual formulation of the error
There exist (rather) fast approaches to quadratic
program solving in both C, C++, Python, Java and R
50
Quadratic Programming
•If Q is positive semi definite, then f(x) is convex.
•If f(x) is convex, then there is a single maximum.
51
In constraint optimization: At the optimal solution
Constraint * Lagrange Multiplier = 0
Karush-Kuhn-Tucker Conditions
Only points on the decision boundary contribute to the solution!
Support Vector Expansion
When αi is non-zero then xi is a support vector
When αi is zero xi is not a support vector52
New decision Function
• General idea: the original input space can always
be mapped to some higher-dimensional feature
space where the training set is separable
Nonlinear SVMs
Φ: x→ φ(x)
Image source 53
• Linearly separable dataset in 1D:
• Non-separable dataset in 1D:
• We can map the data to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
Slide credit: Andrew Moore54
• General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable.
• The kernel trick: instead of explicitly computing the
lifting transformation φ(x), define a kernel function K
such that
K(x,y) = φ(x) · φ(y)
(to be valid, the kernel function must satisfy
Mercer’s condition)
The kernel trick
55
• Linear SVM decision function:
The kernel trick
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
bybi iii xxxw
Support
vector
learned
weight
56
bKybyi
iii
i
iii ),()()( xxxx
The kernel trick
• Linear SVM decision function:
• Kernel SVM decision function:
• This gives a nonlinear decision boundary in the original feature space
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
bybi iii xxxw
57
Polynomial kernel: dcK )(),( yxyx
58
• Also known as the radial basis function
(RBF) kernel:
Gaussian kernel
||x – y||
K(x, y)
2
2
1exp),( yxyx
K
59
Gaussian kernel
SV’s
60
• Pros Kernel-based framework is very powerful, flexible
Training is convex optimization, globally optimal solution
can be found
Amenable to theoretical analysis
SVMs work very well in practice, even with very small
training sample sizes
• Cons No “direct” multi-class SVM, must combine two-class
SVMs (e.g., with one-vs-others)
Computation, memory (esp. for nonlinear SVMs)
SVMs: Pros and cons
61
• Generalization refers to the ability to correctly
classify never before seen examples
• Can be controlled by turning “knobs” that affect
the complexity of the model
Generalization
Training set (labels known) Test set (labels
unknown)62
• Training error: how does the model perform on the
data on which it was trained?
• Test error: how does it perform on never before seen
data?
Diagnosing generalization ability
Source: D. Hoiem
Training error
Test error
Underfitting Overfitting
Model complexity HighLow
Err
or
63
• Underfitting: training and test error are both high Model does an equally poor job on the training and the test set
Either the training procedure is ineffective or the model is too
“simple” to represent the data
• Overfitting: Training error is low but test error is high• Model fits irrelevant characteristics (noise) in the training data
Model is too complex or amount of training data is insufficient
Underfitting and overfitting
Underfitting
Overfitting
Good generalization
Figure source64
Effect of training set size
Many training examples
Few training examples
Model complexity HighLow
Te
st E
rro
r
Source: D. Hoiem65
Effect of training set size
Many training examples
Few training examples
Model complexity HighLow
Te
st E
rro
r
Source: D. Hoiem66
• Split the data into training, validation, and test subsets
• Use training set to optimize model parameters
• Use validation test to choose the best model
• Use test set only to evaluate performance
Validation
Model complexity
Training set loss
Test set loss
Err
or
Validation
set loss
Stopping point
67
Histogram of Oriented Gradients (HoG)
[Dalal and Triggs, CVPR 2005]
10x10 cells
20x20 cells
HoGify
68
Histogram of Oriented Gradients
(HoG)
69
Many examples on Internet:
https://www.youtube.com/watch?v=b_DSTuxdGDw
Histogram of Oriented Gradients
(HoG)
[Dalal and Triggs, CVPR 2005] 70
71
Demo software:
http://cs.stanford.edu/people/karpathy/svmjs/demo
A useful website: www.kernel-machines.org
Software: LIBSVM: www.csie.ntu.edu.tw/~cjlin/libsvm/
SVMLight: svmlight.joachims.org/
Resources
72
73
Thank you!