viola 2003 learning and vision: discriminative models paul viola microsoft research...
Post on 21-Dec-2015
250 views
TRANSCRIPT
Viola 2003
Overview
• Perceptrons• Support Vector Machines
– Face and pedestrian detection
• AdaBoost– Faces
• Building Fast Classifiers– Trading off speed for accuracy…– Face and object detection
• Memory Based Learning– Simard– Moghaddam
Viola 2003
History Lesson• 1950’s Perceptrons are cool
– Very simple learning rule, can learn “complex” concepts
– Generalized perceptrons are better -- too many weights
• 1960’s Perceptron’s stink (M+P)– Some simple concepts require exponential # of features
• Can’t possibly learn that, right?
• 1980’s MLP’s are cool (R+M / PDP)– Sort of simple learning rule, can learn anything (?)
– Create just the features you need
• 1990 MLP’s stink– Hard to train : Slow / Local Minima
• 1996 Perceptron’s are cool
Viola 2003
Why did we need multi-layer perceptrons?
• Problems like this seem to require very complex non-linearities.
• Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.
Viola 2003
Why an exponential number of features?
...,,,,,
,,,,,
,,,
,,
,
1
)(
52
421
32
21
22
312
41
51
42
321
22
21
22
212
31
41
32
22
11
12
21
31
22
12
11
21
12
11
xxxxxxxxxx
xxxxxxxxxx
xxxxxx
xxxx
xx
x
14th Order???120 Features
),min(!!
)!( nk knOnk
kn
k
kn
),min(!!
)!( nk knOnk
kn
k
kn
polyorder :
variables:
k
npolyorder :
variables:
k
n
N=21, k=5 --> 65,000 featuresN=21, k=5 --> 65,000 features
Viola 2003
MLP’s vs. Perceptron
• MLP’s are hard to train… – Takes a long time (unpredictably long)
– Can converge to poor minima
• MLP are hard to understand– What are they really doing?
• Perceptrons are easy to train… – Type of linear programming. Polynomial time.
– One minimum which is global.
• Generalized perceptrons are easier to understand.– Polynomial functions.
Viola 2003
Perceptron Training is Linear Programming
0: iT
i wyi x
Polynomial time in the number of variablesand in the number of constraints.
isi 0
i
ismin
What about linearly inseparable?
0: iiT
i swyi x
),(
:
}1,1{:
ii
N
y
R
y
x
x
),(
:
}1,1{:
ii
N
y
R
y
x
x
Viola 2003
Rebirth of Perceptrons
• How to train effectively– Linear Programming (… later quadratic programming)
– Though on-line works great too.
• How to get so many features inexpensively?!?– Kernel Trick
• How to generalize with so many features?– VC dimension. (Or is it regularization?)
Support Vector Machines
Viola 2003
Lemma 1: Weight vectors are simple
• The weight vector lives in a sub-space spanned by the examples… – Dimensionality is determined by the number of
examples not the complexity of the space.
xw00 w
l
llt
tt bw xx
l
llbw )(x
Viola 2003
Lemma 2: Only need to compare examples
lll
l
Tll
T
lll
T
Kb
b
b
wy
),(
)()(
)()(
)()(
xx
xx
xx
xx l
llbw )(x
Viola 2003
Simple Kernels yield Complex Features
22112
222
21
21
22211
2
21
)1(
)1(),(
xxxxxxxx
xxxx
K T
xxxx
21
22
21
1
)(
xx
x
xx
Viola 2003
But Kernel Perceptrons CanGeneralize Poorly
14)1(),( xxxx TK
Viola 2003
Perceptron Rebirth: Generalization• Too many features … Occam is unhappy
– Perhaps we should encourage smoothness?
0),(: iij
jji sKbyi xx
0: isi
i
ismin
j
jb2min
Smoother
Viola 2003
The linear program can return any multiple of the correct weight vector...
Linear Program is not unique
Slack variables & Weight prior - Force the solution toward zero
0: iT
i wyi x 0: iT
i wyi x
iswy iiT
i 0x
isi 0
i
ismin
wwTmin
Viola 2003
Definition of the Margin
• Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane
• Classifier Margin iT
NEGii
T
POSiww xx
maxmin
Viola 2003
Require non-zero margin
Allows solutionswith zero margin
lsw llT 1x
Enforces a non-zeromargin between examplesand the decision boundary.
lsw llT 0x
Viola 2003
Constrained Optimization
• Find the smoothest function that separates data– Quadratic Programming (similar to Linear
Programming)• Single Minima
• Polynomial Time algorithm
1),( llj
jjl sKby xx
0ls
l
lsmin
j
jb2min
Viola 2003
Constrained Optimization 2
inactiveisx3
Viola 2003
SVM: examples
Viola 2003
SVM: Key Ideas
• Augment inputs with a very large feature set– Polynomials, etc.
• Use Kernel Trick(TM) to do this efficiently• Enforce/Encourage Smoothness with weight penalty• Introduce Margin• Find best solution using Quadratic Programming
Viola 2003
SVM: Zip Code recognition
• Data dimension: 256• Feature Space: 4 th order
– roughly 100,000,000 dims
Viola 2003
The Classical Face Detection Process
SmallestScale
LargerScale
50,000 Locations/Scales
Viola 2003
Classifier is Learned from Labeled Data
• Training Data– 5000 faces
• All frontal
– 108 non faces
– Faces are normalized• Scale, translation
• Many variations– Across individuals
– Illumination
– Pose (rotation both in plane and out)
Viola 2003
Key Properties of Face Detection
• Each image contains 10 - 50 thousand locs/scales• Faces are rare 0 - 50 per image
– 1000 times as many non-faces as faces
• Extremely small # of false positives: 10-6
Viola 2003
Sung and Poggio
Viola 2003
Rowley, Baluja & Kanade
First Fast System - Low Res to Hi
Viola 2003
Osuna, Freund, and Girosi
Viola 2003
Support Vectors
Viola 2003
P, O, & G: First Pedestrian Work
Viola 2003
On to AdaBoost
• Given a set of weak classifiers
– None much better than random
• Iteratively combine classifiers– Form a linear combination
– Training error converges to 0 quickly
– Test error is related to training margin
}1,1{)( :originally xjh
also ( ) { , } "confidence rated"jh x
t
t bxhxC )()(
Viola 2003
AdaBoostWeak
Classifier 1
WeightsIncreased
Weak classifier 3
Final classifier is linear combination of weak classifiers
W
Wtt log
2
1,
t
xhyt
t Z
eiDiD
iti )(
1
)()(
t
i
xhyt
ht
Z
eiDh ii )()(min
Weak Classifier 2
Freund & Shapire
Viola 2003
AdaBoost Properties
i
xhy
tt
titi
eZ)(
i
ii xCyLoss )(,
t
xhyt
t Z
eiDiD
iti )(
1
)()(
tt
t
xhy
Z
e iti )(
tt
xhy
Z
e titi )(
i
xhy
tt
titi
eZ)(
)(,)(
ii
xhy
xCyLosse titi
Viola 2003
AdaBoost: Super Efficient Feature Selector
• Features = Weak Classifiers
• Each round selects the optimal feature given:– Previous selected features– Exponential Loss
Viola 2003
Boosted Face Detection: Image Features
“Rectangle filters”
Similar to Haar wavelets Papageorgiou, et al.
000,000,6100000,60 Unique Binary Features
otherwise
)( if )(
t
tittit
xfxh
t
t bxhxC )()(
Viola 2003
Viola 2003
W
Wtt log
2
1,
t
xhyt
t Z
eiDiD
iti )(
1
)()(
t
i
xhyt
ht
Z
eiDh ii )()(min
Viola 2003
Feature Selection
• For each round of boosting:– Evaluate each rectangle filter on each example
– Sort examples by filter values
– Select best threshold for each filter (min Z)
– Select best filter/threshold (= Feature)
– Reweight examples
• M filters, T thresholds, N examples, L learning time
– O( MT L(MTN) ) Naïve Wrapper Method
– O( MN ) Adaboost feature selector
Viola 2003
Example Classifier for Face Detection
ROC curve for 200 feature classifier
A classifier with 200 rectangle features was learned using AdaBoost
95% correct detection on test set with 1 in 14084false positives.
Not quite competitive...
Viola 2003
Building Fast Classifiers
• Given a nested set of classifier hypothesis classes
• Computational Risk Minimization
vs false neg determined by
% False Pos
% D
etec
tion
0 50
50
100
FACEIMAGESUB-WINDOW
Classifier 1
F
T
NON-FACE
Classifier 3T
F
NON-FACE
F
T
NON-FACE
Classifier 2T
F
NON-FACE
Viola 2003
Other Fast Classification Work
• Simard
• Rowley (Faces)• Fleuret & Geman (Faces)
Viola 2003
Cascaded Classifier
1 Feature 5 Features
F
50%20 Features
20% 2%
FACE
NON-FACE
F
NON-FACE
F
NON-FACE
IMAGESUB-WINDOW
• A 1 feature classifier achieves 100% detection rate and about 50% false positive rate.
• A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative)– using data from previous stage.
• A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)
Viola 2003
Comparison to Other Systems
(94.8)Roth-Yang-Ahuja
94.4Schneiderman-Kanade
89.990.189.286.083.2Rowley-Baluja-Kanade
93.791.891.190.890.190.088.885.278.3Viola-Jones
422167110957865503110Detector
False Detections
Viola 2003
Output of Face Detector on Test Images
Viola 2003
Solving other “Face” Tasks
Facial Feature Localization
DemographicAnalysis
Profile Detection
Viola 2003
Feature Localization
• Surprising properties of our framework– The cost of detection is not a function of image size
• Just the number of features
– Learning automatically focuses attention on key regions
• Conclusion: the “feature” detector can include a large contextual region around the feature
Viola 2003
Feature Localization Features
• Learned features reflect the task
Viola 2003
Profile Detection
Viola 2003
More Results
Viola 2003
Profile Features
Viola 2003
Features, Features, Features
• In almost every case:
Good Features beat Good Learning
Learning beats No Learning
• Critical classifier ratio:
• AdaBoost >> SVM
complexity
quality