viola 2003 learning and vision: discriminative models paul viola microsoft research...

50
Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research [email protected]

Post on 21-Dec-2015

250 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Learning and Vision:Discriminative Models

Paul ViolaMicrosoft Research

[email protected]

Page 2: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Overview

• Perceptrons• Support Vector Machines

– Face and pedestrian detection

• AdaBoost– Faces

• Building Fast Classifiers– Trading off speed for accuracy…– Face and object detection

• Memory Based Learning– Simard– Moghaddam

Page 3: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

History Lesson• 1950’s Perceptrons are cool

– Very simple learning rule, can learn “complex” concepts

– Generalized perceptrons are better -- too many weights

• 1960’s Perceptron’s stink (M+P)– Some simple concepts require exponential # of features

• Can’t possibly learn that, right?

• 1980’s MLP’s are cool (R+M / PDP)– Sort of simple learning rule, can learn anything (?)

– Create just the features you need

• 1990 MLP’s stink– Hard to train : Slow / Local Minima

• 1996 Perceptron’s are cool

Page 4: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Why did we need multi-layer perceptrons?

• Problems like this seem to require very complex non-linearities.

• Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.

Page 5: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Why an exponential number of features?

...,,,,,

,,,,,

,,,

,,

,

1

)(

52

421

32

21

22

312

41

51

42

321

22

21

22

212

31

41

32

22

11

12

21

31

22

12

11

21

12

11

xxxxxxxxxx

xxxxxxxxxx

xxxxxx

xxxx

xx

x

14th Order???120 Features

),min(!!

)!( nk knOnk

kn

k

kn

),min(!!

)!( nk knOnk

kn

k

kn

polyorder :

variables:

k

npolyorder :

variables:

k

n

N=21, k=5 --> 65,000 featuresN=21, k=5 --> 65,000 features

Page 6: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

MLP’s vs. Perceptron

• MLP’s are hard to train… – Takes a long time (unpredictably long)

– Can converge to poor minima

• MLP are hard to understand– What are they really doing?

• Perceptrons are easy to train… – Type of linear programming. Polynomial time.

– One minimum which is global.

• Generalized perceptrons are easier to understand.– Polynomial functions.

Page 7: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Perceptron Training is Linear Programming

0: iT

i wyi x

Polynomial time in the number of variablesand in the number of constraints.

isi 0

i

ismin

What about linearly inseparable?

0: iiT

i swyi x

),(

:

}1,1{:

ii

N

y

R

y

x

x

),(

:

}1,1{:

ii

N

y

R

y

x

x

Page 8: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Rebirth of Perceptrons

• How to train effectively– Linear Programming (… later quadratic programming)

– Though on-line works great too.

• How to get so many features inexpensively?!?– Kernel Trick

• How to generalize with so many features?– VC dimension. (Or is it regularization?)

Support Vector Machines

Page 9: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Lemma 1: Weight vectors are simple

• The weight vector lives in a sub-space spanned by the examples… – Dimensionality is determined by the number of

examples not the complexity of the space.

xw00 w

l

llt

tt bw xx

l

llbw )(x

Page 10: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Lemma 2: Only need to compare examples

lll

l

Tll

T

lll

T

Kb

b

b

wy

),(

)()(

)()(

)()(

xx

xx

xx

xx l

llbw )(x

Page 11: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Simple Kernels yield Complex Features

22112

222

21

21

22211

2

21

)1(

)1(),(

xxxxxxxx

xxxx

K T

xxxx

21

22

21

1

)(

xx

x

xx

Page 12: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

But Kernel Perceptrons CanGeneralize Poorly

14)1(),( xxxx TK

Page 13: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Perceptron Rebirth: Generalization• Too many features … Occam is unhappy

– Perhaps we should encourage smoothness?

0),(: iij

jji sKbyi xx

0: isi

i

ismin

j

jb2min

Smoother

Page 14: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

The linear program can return any multiple of the correct weight vector...

Linear Program is not unique

Slack variables & Weight prior - Force the solution toward zero

0: iT

i wyi x 0: iT

i wyi x

iswy iiT

i 0x

isi 0

i

ismin

wwTmin

Page 15: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Definition of the Margin

• Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane

• Classifier Margin iT

NEGii

T

POSiww xx

maxmin

Page 16: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Require non-zero margin

Allows solutionswith zero margin

lsw llT 1x

Enforces a non-zeromargin between examplesand the decision boundary.

lsw llT 0x

Page 17: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Constrained Optimization

• Find the smoothest function that separates data– Quadratic Programming (similar to Linear

Programming)• Single Minima

• Polynomial Time algorithm

1),( llj

jjl sKby xx

0ls

l

lsmin

j

jb2min

Page 18: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Constrained Optimization 2

inactiveisx3

Page 19: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

SVM: examples

Page 20: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

SVM: Key Ideas

• Augment inputs with a very large feature set– Polynomials, etc.

• Use Kernel Trick(TM) to do this efficiently• Enforce/Encourage Smoothness with weight penalty• Introduce Margin• Find best solution using Quadratic Programming

Page 21: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

SVM: Zip Code recognition

• Data dimension: 256• Feature Space: 4 th order

– roughly 100,000,000 dims

Page 22: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

The Classical Face Detection Process

SmallestScale

LargerScale

50,000 Locations/Scales

Page 23: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Classifier is Learned from Labeled Data

• Training Data– 5000 faces

• All frontal

– 108 non faces

– Faces are normalized• Scale, translation

• Many variations– Across individuals

– Illumination

– Pose (rotation both in plane and out)

Page 24: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Key Properties of Face Detection

• Each image contains 10 - 50 thousand locs/scales• Faces are rare 0 - 50 per image

– 1000 times as many non-faces as faces

• Extremely small # of false positives: 10-6

Page 25: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Sung and Poggio

Page 26: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Rowley, Baluja & Kanade

First Fast System - Low Res to Hi

Page 27: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Osuna, Freund, and Girosi

Page 28: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Support Vectors

Page 29: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

P, O, & G: First Pedestrian Work

Page 30: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

On to AdaBoost

• Given a set of weak classifiers

– None much better than random

• Iteratively combine classifiers– Form a linear combination

– Training error converges to 0 quickly

– Test error is related to training margin

}1,1{)( :originally xjh

also ( ) { , } "confidence rated"jh x

t

t bxhxC )()(

Page 31: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

AdaBoostWeak

Classifier 1

WeightsIncreased

Weak classifier 3

Final classifier is linear combination of weak classifiers

W

Wtt log

2

1,

t

xhyt

t Z

eiDiD

iti )(

1

)()(

t

i

xhyt

ht

Z

eiDh ii )()(min

Weak Classifier 2

Freund & Shapire

Page 32: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

AdaBoost Properties

i

xhy

tt

titi

eZ)(

i

ii xCyLoss )(,

t

xhyt

t Z

eiDiD

iti )(

1

)()(

tt

t

xhy

Z

e iti )(

tt

xhy

Z

e titi )(

i

xhy

tt

titi

eZ)(

)(,)(

ii

xhy

xCyLosse titi

Page 33: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

AdaBoost: Super Efficient Feature Selector

• Features = Weak Classifiers

• Each round selects the optimal feature given:– Previous selected features– Exponential Loss

Page 34: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Boosted Face Detection: Image Features

“Rectangle filters”

Similar to Haar wavelets Papageorgiou, et al.

000,000,6100000,60 Unique Binary Features

otherwise

)( if )(

t

tittit

xfxh

t

t bxhxC )()(

Page 35: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Page 36: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

W

Wtt log

2

1,

t

xhyt

t Z

eiDiD

iti )(

1

)()(

t

i

xhyt

ht

Z

eiDh ii )()(min

Page 37: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Feature Selection

• For each round of boosting:– Evaluate each rectangle filter on each example

– Sort examples by filter values

– Select best threshold for each filter (min Z)

– Select best filter/threshold (= Feature)

– Reweight examples

• M filters, T thresholds, N examples, L learning time

– O( MT L(MTN) ) Naïve Wrapper Method

– O( MN ) Adaboost feature selector

Page 38: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Example Classifier for Face Detection

ROC curve for 200 feature classifier

A classifier with 200 rectangle features was learned using AdaBoost

95% correct detection on test set with 1 in 14084false positives.

Not quite competitive...

Page 39: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Building Fast Classifiers

• Given a nested set of classifier hypothesis classes

• Computational Risk Minimization

vs false neg determined by

% False Pos

% D

etec

tion

0 50

50

100

FACEIMAGESUB-WINDOW

Classifier 1

F

T

NON-FACE

Classifier 3T

F

NON-FACE

F

T

NON-FACE

Classifier 2T

F

NON-FACE

Page 40: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Other Fast Classification Work

• Simard

• Rowley (Faces)• Fleuret & Geman (Faces)

Page 41: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Cascaded Classifier

1 Feature 5 Features

F

50%20 Features

20% 2%

FACE

NON-FACE

F

NON-FACE

F

NON-FACE

IMAGESUB-WINDOW

• A 1 feature classifier achieves 100% detection rate and about 50% false positive rate.

• A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative)– using data from previous stage.

• A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

Page 42: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Comparison to Other Systems

(94.8)Roth-Yang-Ahuja

94.4Schneiderman-Kanade

89.990.189.286.083.2Rowley-Baluja-Kanade

93.791.891.190.890.190.088.885.278.3Viola-Jones

422167110957865503110Detector

False Detections

Page 43: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Output of Face Detector on Test Images

Page 44: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Solving other “Face” Tasks

Facial Feature Localization

DemographicAnalysis

Profile Detection

Page 45: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Feature Localization

• Surprising properties of our framework– The cost of detection is not a function of image size

• Just the number of features

– Learning automatically focuses attention on key regions

• Conclusion: the “feature” detector can include a large contextual region around the feature

Page 46: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Feature Localization Features

• Learned features reflect the task

Page 47: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Profile Detection

Page 48: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

More Results

Page 49: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Profile Features

Page 50: Viola 2003 Learning and Vision: Discriminative Models Paul Viola Microsoft Research viola@microsoft.com

Viola 2003

Features, Features, Features

• In almost every case:

Good Features beat Good Learning

Learning beats No Learning

• Critical classifier ratio:

• AdaBoost >> SVM

complexity

quality