face detection with boosted gaussian features

Face detection with boosted Gaussian features

Pattern Recognition, Feb, 2007井民全報告

Outline

• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features

– Anisotropic Gaussian filters– Gaussian vs. Haar-like

• Experiments and results

Introduction

• Automatic face detection is a key step in any face processing system

• It is far from a trivial task– faces are highly deformable objects– lighting conditions, poses

• holistic methods– consider the face as a global object

• feature-based methods– recognize parts of the face and assemble them to

take the final decision

Introduction

• The classical approach for face detectionStep 1: scan the input image with a sliding window,

and for each positionStep 2: the window is classified as either face or

non-face• The efficient exploration of the search space is

a key ingredient for obtaining a fast face detector– Skin color, a coarse-to-fine approach, etc…

Introduction

• A fast algorithm is proposed by Viola and Jones– Three main ideas

• first train a strong classifier by Haar-like features-based classifiers

• use the so-called integral image as image representation very efficiently

• a classification structure in cascade speed

A brief overview of AdaBoost

h1 h2 h3 h4 hT-1 hT

Weaker classifier

a strong classifiera strong classifier

24

24

Cascaded Classifiers Structure

Ada Boosting Learner

1h1h

Feature setFeature set

Feature Select & Classifier

Stage 1Stage 1

False (Reject)

Ada Boosting LearnerAda Boosting Learner

Stage 2Stage 2

1h1h

2h2h

10h10h

Pass

False (Reject)

Ada Boosting LearnerAda Boosting Learner

Stage 3Stage 3

1h1h

2h2h moremore

Pass

False (Reject)

Reject as many negatives as possible (minimize the false negative)Reject as many negatives as possible (minimize the false negative)

100% Detection Rate50% False Positive

Haar-like featuresThe difference between the sum of pixels

within two rectangular regionsThe difference between the sum of pixels

within two rectangular regionsTwo-Rectangle FeatureTwo-Rectangle Feature

The region have the same size and shape

And are horizontally or vertically adjacent

The base resolution is 24x24

The exhaustive set of rectangle is large, over 180,000.

Three-Rectangle Feature the sum within two

outside rectangle subtracted from the sum in a center rectangle

The difference between the diagonal pairs of rectangles

Four-Rectangle Feature

Four-Rectangle Feature

1

2

3

24

24

Over 180,000 rectangle features associate with each sub-image

Over 180,000 rectangle features associate with each sub-image

The feature values An example

The training process for a weaker learner

• Let’s see an example

The training process for a weaker classifier (an example)

1

1

0

{10,23,…,5, …}

{7,20,…, 25, …}

{15,21,…,100,…}

h1

h1(xi)= 1 , if fj(xi) < 30 0, otherwise

Example x y

0 {15,21,…,20,…}

{f1(x),f2(x),…,fj(x), …, f180,000(x)}

fj(x)

Example xi yi

Searching for a feature that the training error is minimal !

The 1st iteration

1

1

0

h1

{10,23,…,5, …}

{7,20,…, 25, …}

{15,21,…,100,…}


Example x h1(x)

0 {7,23,…,20,…} (False positive)

fj(x)y

1

1

0

1

Example xi yi

h1


False positive

Non-face

The training error for h1

1

1

0

{10,23,…,5, …}

{7,20,…, 25, …}

{15,21,…,100,…}

Example xi h1(x)

0 {7,23,…, 20,…}(False positive)

fj(x)yi

1

1

0

1

weight error

1/4

1/4

1/4

1/4

i

iiij yxhw |)(| 1

j

0

0

0

1/4

+

+

+

= 1/4

for h1

iw ,1

Update the weight (1/2)

otherwise

correctly classified is example if1

,

,,1

it

it

tit

it

w

xww

Distribute the contribution!

Update the weight (2/2)

1

1

0

{10,23,…,5, …}

{7,20,…, 25, …}

{15,21,…,100,…}

Example xi h1(x)


fj(x)yi

1

1

0

1

Weight error

1/4*

1/4

j

0

0

0

1/4

+

+

+

= 1/4

for h1

75.0

25.0

1/4* 75.0

25.0

1/4* 75.0

25.0

( 變小 )

( 變小 )

( 變小 )

( 不變 )

iw ,2

t

titit ww

1,,1

Normalization the weight

n

k kt

itit

w

ww

1 ,

,,1

N= # of the example

1

1

0

{10,23,…,5, …}

{7,20,…, 25, …}

{15,21,…,100,…}

Example xi h1(x)


fj(x)yi

1

1

0

1

NormalizeWeight

0.5

( 剛剛分錯的 , weight 變大由 1/4 0.5)

iw ,2

0.166

0.166

0.166

n

k kt

itit

w

ww

1 ,

,,1

分析

• 因為我們選 classifier 是選擇產生總體分類錯誤最小的 feature , 進行分類 .

• 而上次的分類 , 分錯的 example 錯誤成本增加了 . 故整個 training process 會趨向不讓上次分錯的 example, 在這次分錯

i

iijij yxhw |)(|每一個 example 的 weight ( 錯誤成本 )

目前使用 feature j 進行分類 , 整體的 training error

h1

False positive of h1

Non-face

h2

Cascaded Classifiers Structure

The Boost algorithm for

classifier learning

),(, ... ),,(),,( 2211 nn yxyxyx ),(, ... ),,(),,( 2211 nn yxyxyx

Image

Positive =1 Negative=0

Step 1: Giving example images

Step 2: Initialize the weights

positives. and negatives of # theare and

,1,0for 2

1,

2

1,1

lm

ylm

w ii

positives. and negatives of # theare and

,1,0for 2

1,

2

1,1

lm

ylm

w ii

For t = 1, … , T 1. Normalize the weights,

2. For each feature j, train a classifier hj which is restricted to using a single feature

3. Update the weights:

For t = 1, … , T 1. Normalize the weights,

2. For each feature j, train a classifier hj which is restricted to using a single feature

3. Update the weights:

ondistributiprobabity a is that so ,1 ,

,, tn

k kt

itit w

w

ww

ondistributiprobabity a is that so ,1 ,

,, tn

k kt

itit w

w

ww

.error lowest with the, ,classifier theChoose

|)(|

, respect to with evaluated iserror The

tt

iiijij

t

h

yxhw

w

.error lowest with the, ,classifier theChoose

|)(|

, respect to with evaluated iserror The

tt

iiijij

t

h

yxhw

w

Weak learner constructor Weak learner constructor

otherwise

correctly classified is example if1

,

,,1

it

it

tit

it

w

xww

The final strong classifier

t

tt

1 t

tt

1

若超過一半的人 , 贊成就通

過

每個人的投票的份量 , 由正確率決定

1

1esty

The VC-Dimension concept

• A learning machine f takes an input x and transforms it, somehow using weights a, into a predicted output

in some pagers, the definition is

0

1esty

ffx esty

(Some vector of adjustable parameters)

Examples

ffx esty

Examplesffx esty

How do we characterize “power”?

• Different machines have different amounts of “power”

• Tradeoff between:– More power: Can model more complex classifiers

but might overfit– Less power: Not going to overfit, but restricted in

what it can model• How do we characterize the amount of

power?

Some definitions

• Given some machine f• And under the assumption that all training

points (xk,yk) were drawn i.i.d from some distribution.

• And under the assumption that future test points will be drawn from the same distribution

i.i.d independent and identically distributed

Probability of misclassification

),(2

1)()( xfyETESTERRR

R

k

emp xfyR

TRAINERRR1

),(2

11)()(

R = # of training R = # of training

Fraction training set of misclassification片段

Vapnik-Chervonenkis dimension

• Given some machine f, let h be its VC dimension

• Vapnik showed that

with probability 1-

known

Known(# of training example)

known

known

This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f

This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f

• But given machine f,

how do we define and compute h?the VC-dimension of f

Shattering

• Machine f can shatter a set of points x1, x2 .. xr if and only if…– For every possible training set of the form

– There exists some value of that gets zero training error.

),(),...,,(),,( 2211 rr yxyxyx

Question

• Can the following f shatter the following points?

Answer: No problem

• There are four training sets to consider

水平線 (ok) 對角線 (ok) 對角線換正負號 (ok) 水平線換正負號 (ok)

Question

• Can the following f shatter the following points?

Answer: No way my friend

圓外一類圓內一類

(ok)

圓外一類圓內一類

(ok)

衝突無法變換參數( 因為 f(x,b) 中無法控制 x

• What ‘s VC dimension of

Definition of VC dimension

• Given machine f, the VC-dimension h is

The maximum number of points that can be arranged so that f shatter them

Ans: 1

這個機器 , 在所有 example 組合下 , 最多不會分錯的 example 個數

VC dim of line machine

• For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?

• Well, can we find four points that f can shatter?

…

• VC-dimension 越大的機器 , Power 越大 .

Structural Risk Minimization

• considers a sequence of hypothesis spaces of increasing complexity– For example, polynomials of increasing degree.

if

Structural Risk Minimization• We’re trying to decide which machine to use• We train each machine and make a table…

i TrainErr VC Conf Prob. Upper bound on

TestErr

Choice

1

2

3

4

5

if

越簡單

越複雜

分析• Vapnic-Chervonenkis 告訴我們任一台機器

的 TestError 與 VC-Dimension ( 機器複雜程度 ) 有關 .

• 對相同 data set 而言 , 複雜度越高的機器 , 對 training 資料 overfit 的程度也越高 .

Training example

Generalization error for the AdaBoost proposed by Freund

feature a is

sign, inequality theofdirection theindicating

, thresholda is

,1

)( if,1)(

j

j

j

jjjjj

f

P

where

otherwise

PxfPxh

feature a is

sign, inequality theofdirection theindicating

, thresholda is

,1

)( if,1)(

j

j

j

jjjjj

f

P

where

otherwise

PxfPxh

TRAINERR

examples ofnumber theis N

AdaBoostby output function decision theis Tf

d= VC-dimension

• For example– An AdaBoost algorithm proposed by [1]– Total # of features in all layer 6061

• AdaBoost has an important drawback– It tends to overfit training examples

The proposed new features – Anisotropic Gaussianfilters

• The generating function

• It efficiently capture contour singularities with a smooth low resolution function

2:),( yx

)exp(),( 2yxxyx

The transformations

• Translation by

• Rotation by

• Bending by r

),( 00 yx

),(),( 00, 00yyxxyxyx

)sinsin,sincos(),( yxyxyxR

. if2

,

, if,tan,)(),(

122

rxrrxyr

rxxr

yryyxr

yxBr

• Anisotropic scaling by

• By combining these four basic transforations,

),( yx ss

),(),(,yx

ss s

y

s

xyxS

yx

),(

),(),(),(

,,

,,,,

00

00

yxSBRT

yxyxyx

yx

yx

ssryx

yxssi

Anisotropic Gaussian filters with different rotating and bending parameters

Some of the features selected by the proposed method

YX iji dxdyyxIyxxfD ),(),()(

feature a is sign, inequality theofdirection theindicating

, thresholda is

,1

)( if,1)(

jj

j

jjjjj

fP

where

otherwise

PxfPxh

feature a is sign, inequality theofdirection theindicating

, thresholda is

,1

)( if,1)(

jj

j

jjjjj

fP

where

otherwise

PxfPxh

Input imageInput imageParticular filterParticular filterfeaturefeature

• The features are particularly well adapted to capture local contours – that are insensitive to changes of the lighting

conditions.

Gaussian vs. Haar-like

The Haar-like templates

Haar filters model global contrasts that are • more sensitive the direction of the light source• well capture the contrast between image regions• limited for modeling smooth transitions present in facial images

ComparisonComparison

(Good)

(Bad)

(Bad)


The first and second features selected by AdaBoost proposed by P. Viola


Haar: Test error stop decreasing

Gaussian: Keep decreasing

Stage 越多

• AdaBoost focuses on the hard to classify examples – the simplistic HF are not discriminant enough to

separate the two classes

The receiver operating characteristic analysis

Experimental Result

• 20x15 pixels window is used to scan the image• For different scale

– The image is then dilated by power of 1.2• For training models

– Face images: XM2VTS, BioID, FERET (scale, in-plane rotation, shift) 9500 face images

– Non-face dataset: randomly selected images without human faces 500,000 non-face images

• Total set of GF used for training the classifier– 202,200 features

The evaluation protocol

• The Popovici score:

between-eyes distancesbetween-eyes distances

angle between the eyes axisangle between the eyes axis

annotated positionannotated position

detected positiondetected position

因為所有的輸出結果都會被納入分數計算高 False Positive 結果也會拉低分數因為所有的輸出結果都會被納入分數計算高 False Positive 結果也會拉低分數

12,480 images in French and English datasets12,480 images in French and English datasets

23 images with 155 low-resolution faces23 images with 155 low-resolution faces

123 images with 483 faces123 images with 483 facesmanually selected

130 images with 507 faces130 images with 507 faces

complete set

因為大部分的 false positive 被快速 5 stage HF 濾掉 , 所以速度會比 12 stage 單純 GF 較快因為大部分的 false positive 被快速 5 stage HF 濾掉 , 所以速度會比 12 stage 單純 GF 較快

• Thank you

face detection with boosted gaussian features

Documents