face detection with boosted gaussian features
DESCRIPTION
Face detection with boosted Gaussian features. Pattern Recognition, Feb, 2007 井民全 報告. Outline. Introduction A brief overview of AdaBoost The VC-Dimension concept The features Anisotropic Gaussian filters Gaussian vs. Haar-like Experiments and results. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Face detection with boosted Gaussian features
Pattern Recognition, Feb, 2007井民全 報告
Outline
• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features
– Anisotropic Gaussian filters– Gaussian vs. Haar-like
• Experiments and results
Introduction
• Automatic face detection is a key step in any face processing system
• It is far from a trivial task– faces are highly deformable objects– lighting conditions, poses
• holistic methods– consider the face as a global object
• feature-based methods– recognize parts of the face and assemble them to
take the final decision
Introduction
• The classical approach for face detectionStep 1: scan the input image with a sliding window,
and for each positionStep 2: the window is classified as either face or
non-face• The efficient exploration of the search space is
a key ingredient for obtaining a fast face detector– Skin color, a coarse-to-fine approach, etc…
Introduction
• A fast algorithm is proposed by Viola and Jones– Three main ideas
• first train a strong classifier by Haar-like features-based classifiers
• use the so-called integral image as image representation very efficiently
• a classification structure in cascade speed
A brief overview of AdaBoost
h1 h2 h3 h4 hT-1 hT
Weaker classifier
a strong classifiera strong classifier
24
24
Cascaded Classifiers Structure
Ada Boosting Learner
1h1h
Feature setFeature set
Feature Select & Classifier
Stage 1Stage 1
False (Reject)
Ada Boosting LearnerAda Boosting Learner
Stage 2Stage 2
1h1h
2h2h
10h10h
Pass
False (Reject)
Ada Boosting LearnerAda Boosting Learner
Stage 3Stage 3
1h1h
2h2h moremore
Pass
False (Reject)
Reject as many negatives as possible (minimize the false negative)Reject as many negatives as possible (minimize the false negative)
100% Detection Rate50% False Positive
Haar-like featuresThe difference between the sum of pixels
within two rectangular regionsThe difference between the sum of pixels
within two rectangular regionsTwo-Rectangle FeatureTwo-Rectangle Feature
The region have the same size and shape
And are horizontally or vertically adjacent
The base resolution is 24x24
The exhaustive set of rectangle is large, over 180,000.
Three-Rectangle Feature the sum within two
outside rectangle subtracted from the sum in a center rectangle
The difference between the diagonal pairs of rectangles
Four-Rectangle Feature
Four-Rectangle Feature
1
2
3
24
24
Over 180,000 rectangle features associate with each sub-image
Over 180,000 rectangle features associate with each sub-image
The feature values An example
The training process for a weaker learner
• Let’s see an example
The training process for a weaker classifier (an example)
1
1
0
{10,23,…,5, …}
{7,20,…, 25, …}
{15,21,…,100,…}
h1
h1(xi)= 1 , if fj(xi) < 30 0, otherwise
Example x y
0 {15,21,…,20,…}
{f1(x),f2(x),…,fj(x), …, f180,000(x)}
fj(x)
Example xi yi
Searching for a feature that the training error is minimal !
The 1st iteration
1
1
0
h1
{10,23,…,5, …}
{7,20,…, 25, …}
{15,21,…,100,…}
h1(xi)= 1 , if fj(xi) < 30 0, otherwise
Example x h1(x)
0 {7,23,…,20,…} (False positive)
fj(x)y
1
1
0
1
Example xi yi
h1
h1(xi)= 1 , if fj(xi) < 30 0, otherwise
False positive
Non-face
The training error for h1
1
1
0
{10,23,…,5, …}
{7,20,…, 25, …}
{15,21,…,100,…}
Example xi h1(x)
0 {7,23,…, 20,…}(False positive)
fj(x)yi
1
1
0
1
weight error
1/4
1/4
1/4
1/4
i
iiij yxhw |)(| 1
j
0
0
0
1/4
+
+
+
= 1/4
for h1
iw ,1
Update the weight (1/2)
otherwise
correctly classified is example if1
,
,,1
it
it
tit
it
w
xww
Distribute the contribution!
Update the weight (2/2)
1
1
0
{10,23,…,5, …}
{7,20,…, 25, …}
{15,21,…,100,…}
Example xi h1(x)
0 {7,23,…, 20,…}(False positive)
fj(x)yi
1
1
0
1
Weight error
1/4*
1/4
j
0
0
0
1/4
+
+
+
= 1/4
for h1
75.0
25.0
1/4* 75.0
25.0
1/4* 75.0
25.0
( 變小 )
( 變小 )
( 變小 )
( 不變 )
iw ,2
t
titit ww
1,,1
Normalization the weight
n
k kt
itit
w
ww
1 ,
,,1
N= # of the example
1
1
0
{10,23,…,5, …}
{7,20,…, 25, …}
{15,21,…,100,…}
Example xi h1(x)
0 {7,23,…, 20,…}(False positive)
fj(x)yi
1
1
0
1
NormalizeWeight
0.5
( 剛剛分錯的 , weight 變大 由 1/4 0.5)
iw ,2
0.166
0.166
0.166
n
k kt
itit
w
ww
1 ,
,,1
分析
• 因為我們選 classifier 是選擇產生總體分類錯誤最小的 feature , 進行分類 .
• 而上次的分類 , 分錯的 example 錯誤成本增加了 . 故整個 training process 會趨向不讓上次分錯的 example, 在這次分錯
i
iijij yxhw |)(|每一個 example 的 weight ( 錯誤成本 )
目前 使用 feature j 進行分類 , 整體的 training error
h1
False positive of h1
Non-face
h2
Cascaded Classifiers Structure
The Boost algorithm for
classifier learning
),(, ... ),,(),,( 2211 nn yxyxyx ),(, ... ),,(),,( 2211 nn yxyxyx
Image
Positive =1 Negative=0
Step 1: Giving example images
Step 2: Initialize the weights
positives. and negatives of # theare and
,1,0for 2
1,
2
1,1
lm
ylm
w ii
positives. and negatives of # theare and
,1,0for 2
1,
2
1,1
lm
ylm
w ii
For t = 1, … , T 1. Normalize the weights,
2. For each feature j, train a classifier hj which is restricted to using a single feature
3. Update the weights:
For t = 1, … , T 1. Normalize the weights,
2. For each feature j, train a classifier hj which is restricted to using a single feature
3. Update the weights:
ondistributiprobabity a is that so ,1 ,
,, tn
k kt
itit w
w
ww
ondistributiprobabity a is that so ,1 ,
,, tn
k kt
itit w
w
ww
.error lowest with the, ,classifier theChoose
|)(|
, respect to with evaluated iserror The
tt
iiijij
t
h
yxhw
w
.error lowest with the, ,classifier theChoose
|)(|
, respect to with evaluated iserror The
tt
iiijij
t
h
yxhw
w
Weak learner constructor Weak learner constructor
otherwise
correctly classified is example if1
,
,,1
it
it
tit
it
w
xww
The final strong classifier
t
tt
1 t
tt
1
若超過一半的人 , 贊成就通
過
每個人的投票的份量 , 由正確率決定
• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features
– Anisotropic Gaussian filters– Gaussian vs. Haar-like
• Experiments and results
• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features
– Anisotropic Gaussian filters– Gaussian vs. Haar-like
• Experiments and results
1
1esty
The VC-Dimension concept
• A learning machine f takes an input x and transforms it, somehow using weights a, into a predicted output
in some pagers, the definition is
0
1esty
ffx esty
(Some vector of adjustable parameters)
Examples
ffx esty
Examples
ffx esty
Examplesffx esty
How do we characterize “power”?
• Different machines have different amounts of “power”
• Tradeoff between:– More power: Can model more complex classifiers
but might overfit– Less power: Not going to overfit, but restricted in
what it can model• How do we characterize the amount of
power?
Some definitions
• Given some machine f• And under the assumption that all training
points (xk,yk) were drawn i.i.d from some distribution.
• And under the assumption that future test points will be drawn from the same distribution
i.i.d independent and identically distributed
Probability of misclassification
),(2
1)()( xfyETESTERRR
R
k
emp xfyR
TRAINERRR1
),(2
11)()(
R = # of training R = # of training
Fraction training set of misclassification片段
Vapnik-Chervonenkis dimension
• Given some machine f, let h be its VC dimension
• Vapnik showed that
with probability 1-
known
Known(# of training example)
known
known
This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
• But given machine f,
how do we define and compute h?the VC-dimension of f
Shattering
• Machine f can shatter a set of points x1, x2 .. xr if and only if…– For every possible training set of the form
– There exists some value of that gets zero training error.
),(),...,,(),,( 2211 rr yxyxyx
Question
• Can the following f shatter the following points?
Answer: No problem
• There are four training sets to consider
水平線 (ok) 對角線 (ok) 對角線換正負號 (ok) 水平線換正負號 (ok)
Question
• Can the following f shatter the following points?
Answer: No way my friend
圓外一類圓內一類
(ok)
圓外一類圓內一類
(ok)
衝突無法變換參數( 因為 f(x,b) 中無法控制 x
• What ‘s VC dimension of
Definition of VC dimension
• Given machine f, the VC-dimension h is
The maximum number of points that can be arranged so that f shatter them
Ans: 1
這個機器 , 在所有 example 組合下 , 最多不會分錯的 example 個數
VC dim of line machine
• For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
• Well, can we find four points that f can shatter?
…
• VC-dimension 越大的機器 , Power 越大 .
Structural Risk Minimization
• considers a sequence of hypothesis spaces of increasing complexity– For example, polynomials of increasing degree.
if
Structural Risk Minimization• We’re trying to decide which machine to use• We train each machine and make a table…
i TrainErr VC Conf Prob. Upper bound on
TestErr
Choice
1
2
3
4
5
if
越簡單
越複雜
分析• Vapnic-Chervonenkis 告訴我們任一台機器
的 TestError 與 VC-Dimension ( 機器複雜程度 ) 有關 .
• 對相同 data set 而言 , 複雜度越高的機器 , 對 training 資料 overfit 的程度也越高 .
Training example
Generalization error for the AdaBoost proposed by Freund
feature a is
sign, inequality theofdirection theindicating
, thresholda is
,1
)( if,1)(
j
j
j
jjjjj
f
P
where
otherwise
PxfPxh
feature a is
sign, inequality theofdirection theindicating
, thresholda is
,1
)( if,1)(
j
j
j
jjjjj
f
P
where
otherwise
PxfPxh
TRAINERR
examples ofnumber theis N
AdaBoostby output function decision theis Tf
d= VC-dimension
• For example– An AdaBoost algorithm proposed by [1]– Total # of features in all layer 6061
• AdaBoost has an important drawback– It tends to overfit training examples
• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features
– Anisotropic Gaussian filters– Gaussian vs. Haar-like
• Experiments and results
• Introduction• A brief overview of AdaBoost• The VC-Dimension concept• The features
– Anisotropic Gaussian filters– Gaussian vs. Haar-like
• Experiments and results
The proposed new features – Anisotropic Gaussianfilters
• The generating function
• It efficiently capture contour singularities with a smooth low resolution function
2:),( yx
)exp(),( 2yxxyx
The transformations
• Translation by
• Rotation by
• Bending by r
),( 00 yx
),(),( 00, 00yyxxyxyx
)sinsin,sincos(),( yxyxyxR
. if2
,
, if,tan,)(),(
122
rxrrxyr
rxxr
yryyxr
yxBr
• Anisotropic scaling by
• By combining these four basic transforations,
),( yx ss
),(),(,yx
ss s
y
s
xyxS
yx
),(
),(),(),(
,,
,,,,
00
00
yxSBRT
yxyxyx
yx
yx
ssryx
yxssi
Anisotropic Gaussian filters with different rotating and bending parameters
Some of the features selected by the proposed method
YX iji dxdyyxIyxxfD ),(),()(
feature a is sign, inequality theofdirection theindicating
, thresholda is
,1
)( if,1)(
jj
j
jjjjj
fP
where
otherwise
PxfPxh
feature a is sign, inequality theofdirection theindicating
, thresholda is
,1
)( if,1)(
jj
j
jjjjj
fP
where
otherwise
PxfPxh
Input imageInput imageParticular filterParticular filterfeaturefeature
• The features are particularly well adapted to capture local contours – that are insensitive to changes of the lighting
conditions.
Gaussian vs. Haar-like
The Haar-like templates
Haar filters model global contrasts that are • more sensitive the direction of the light source• well capture the contrast between image regions• limited for modeling smooth transitions present in facial images
ComparisonComparison
(Good)
(Bad)
(Bad)
Gaussian vs. Haar-like
The first and second features selected by AdaBoost proposed by P. Viola
Gaussian vs. Haar-like
Haar: Test error stop decreasing
Gaussian: Keep decreasing
Stage 越多
• AdaBoost focuses on the hard to classify examples – the simplistic HF are not discriminant enough to
separate the two classes
The receiver operating characteristic analysis
Experimental Result
• 20x15 pixels window is used to scan the image• For different scale
– The image is then dilated by power of 1.2• For training models
– Face images: XM2VTS, BioID, FERET (scale, in-plane rotation, shift) 9500 face images
– Non-face dataset: randomly selected images without human faces 500,000 non-face images
• Total set of GF used for training the classifier– 202,200 features
The evaluation protocol
• The Popovici score:
between-eyes distancesbetween-eyes distances
angle between the eyes axisangle between the eyes axis
annotated positionannotated position
detected positiondetected position
因為所有的輸出結果都會被納入分數計算高 False Positive 結果也會拉低分數因為所有的輸出結果都會被納入分數計算高 False Positive 結果也會拉低分數
12,480 images in French and English datasets12,480 images in French and English datasets
23 images with 155 low-resolution faces23 images with 155 low-resolution faces
123 images with 483 faces123 images with 483 facesmanually selected
130 images with 507 faces130 images with 507 faces
complete set
因為大部分的 false positive 被快速 5 stage HF 濾掉 , 所以速度會比 12 stage 單純 GF 較快因為大部分的 false positive 被快速 5 stage HF 濾掉 , 所以速度會比 12 stage 單純 GF 較快
• Thank you