e nsemble l earning : a da b oost jianping fan dept of computer science unc-charlotte
TRANSCRIPT
ENSEMBLE LEARNING: ADABOOST
Jianping Fan Dept of Computer Science UNC-Charlotte
ENSEMBLE LEARNINGA machine learning paradigm where multiple learners are used to solve the problem
Problem
… ...… ...
Problem
Learner Learner Learner Learner
Previously:
Ensemble:
The generalization ability of the ensemble is usually significantly better than that of an individual learner
Boosting is one of the most important families of ensemble methods
3
Bootstrapping
Bagging
Boosting (Schapire 1989)
Adaboost (Schapire 1995)
A BRIEF HISTORY Resampling for estimating statistic
Resampling for classifier design
BOOTSTRAP ESTIMATION
Repeatedly draw n samples from D
For each set of samples, estimate a statistic
The bootstrap estimate is the mean of the individual estimates
Used to estimate a statistic (parameter) and its variance
BAGGING - AGGREGATE BOOTSTRAPPING
For i = 1 .. MDraw n*<n samples from D with
replacementLearn classifier Ci
Final classifier is a vote of C1 .. CM
Increases classifier stability/reduces variance
BAGGING
f1
f2
fT
ML
ML
ML
f
Random sa
mple
with re
placem
ent
Random sample
with replacement
BOOSTING
Training Sample
Weighted Sample
Weighted Sample
fT
f1
…
f2
f
ML
ML
ML
REVISIT BAGGING
BOOSTING CLASSIFIER
BAGGING VS BOOSTING
Bagging: the construction of complementary base-learners is left to chance and to the unstability of the learning methods.
Boosting: actively seek to generate complementary base-learner--- training the next base-learner based on the mistakes of the previous learners.
BOOSTING (SCHAPIRE 1989)BOOSTING (SCHAPIRE 1989) Randomly select n1 < n samples from D without
replacement to obtain D1Train weak learner C1
Select n2 < n samples from D with half of the samples misclassified by C1 to obtain D2Train weak learner C2
Select all samples from D that C1 and C2 disagree onTrain weak learner C3
Final classifier is vote of weak learners
ADABOOST (SCHAPIRE ADABOOST (SCHAPIRE 1995)1995)Instead of sampling, re-weight
Previous weak learner has only 50% accuracy over new distribution
Can be used to learn weak classifiers
Final classification based on weighted vote of weak classifiers
ADABOOST TERMSLearner = Hypothesis = Classifier
Weak Learner: < 50% error over any distribution
Strong Classifier: thresholded linear combination of weak learner outputs
AdaBoostAdaptive
A learning algorithm
Building a strong classifier a lot of weaker ones
Boosting
ADABOOST CONCEPT
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
ttT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
WEAKER CLASSIFIERS
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
ttT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
Each weak classifier learns by considering one simple feature
T most beneficial features for classification should be selected
How to– define features?– select beneficial features?– train weak classifiers?– manage (weight) training
samples?– associate weight to each weak
classifier?
THE STRONG CLASSIFIERS
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
ttT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
How good the strong one will be?How good the strong one will be?
THE ADABOOST ALGORITHM
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
:probability distribution of 's at time ( )t iD i x t:probability distribution of 's at time ( )t iD i x t
• Weight classifier: 11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
minimize weighted errorminimize weighted error
for minimize exponential lossfor minimize exponential loss
Give error classified patterns more chance for learning.Give error classified patterns more chance for learning.
THE ADABOOST ALGORITHM
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
• Weight classifier: 11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
BOOSTING ILLUSTRATION
Weak
Classifier 1
BOOSTING ILLUSTRATION
Weights
Increased
THE ADABOOST ALGORITHM
typicallywhere
the weights of incorrectly classified examples are increased so that the base learner is forced to focus on the hard examples in the training set
where
BOOSTING ILLUSTRATION
Weak
Classifier 2
BOOSTING ILLUSTRATION
Weights
Increased
BOOSTING ILLUSTRATION
Weak
Classifier 3
BOOSTING ILLUSTRATION
Final classifier is
a combination of weak classifiers
THE ADABOOST ALGORITHM
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
• Weight classifier: 11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
What goal the AdaBoost wants to reach?
THE ADABOOST ALGORITHM
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
• Weight classifier:
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
What goal the AdaBoost wants to reach?
11ln
2t
tt
They are goal dependent.They are goal dependent.
GOAL
Minimize exponential loss
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ,( ) yH x
x yloss H x E e
GOAL
Minimize exponential loss
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ,( ) yH x
x yloss H x E e ( )yH x
Maximize the margin yH(x)
GOALFinal classifier:
1
( ) ( )T
t tt
sign H x h x
( )exp ,( ) yH x
x yloss H x E e Minimize
( ) ( ), |t tyH x yH x
x y x yE e E E e x
Define 1( ) ( ) ( )t t t tH x H x h x with 0 ( ) 0H x
Then, ( ) ( )TH x H x
1[ ( ) ( )] |t t ty H x h xx yE E e x
1 ( ) ( ) |t t tyH x y h xx yE E e e x
1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x
( ) ( ), |t tyH x yH x
x y x yE e E E e x
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ,( ) yH x
x yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x with 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x
( ), 0tyH x
x yt
E e
Set
1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x
0
?t
Final classifier:1
( ) ( )T
t tt
sign H x h x
Minimize
Define 1( ) ( ) ( )t t t tH x H x h x with 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))t
tt
P y h x
P y h x
11
ln2
tt
t
(error)t P
?t ( )
exp ,( ) yH xx yloss H x E e
( , ) ( )i i tP x y D i( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j ii
D i y h x
( )exp ,( ) yH x
x yloss H x E e
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))t
tt
P y h x
P y h x
11
ln2
tt
t
?t Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
(error)t P
( , ) ( )i i tP x y D i( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j ii
D i y h x
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))t
tt
P y h x
P y h x
11
ln2
tt
t
1 ?tD
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
(error)t P
( , ) ( )i i tP x y D i( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j ii
D i y h x
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1, ,
t t t tyH yH y hx y x yE e E e e
1 2 2 2,
11
2tyH
x y t t t tE e y h y h
1 2 2 2,
1arg min 1
2tyH
t x y t th
h E e y h y h
2 2 1y h
1 2,
1arg min 1
2tyH
t x y t th
h E e y h
1 21arg min 1 |
2tyH
t x y t th
h E E e y h x
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1 21arg min 1 |
2tyH
t x y t th
h E E e y h x
1arg min |tyHt x y t
hh E E e y h x
1arg max |tyHt x y
hh E E e yh x
1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x
hh E h x e P y x h x e P y x
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
( )1, ~ ( | )arg max ( )yH xtt x y e P y xh
h E yh x maximized when ( ) y h x x
( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x
h x sign P y x P y x
( )1, ~ ( | )( ) |yH xtt x y e P y x
h x sign E y x
1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x
hh E h x e P y x h x e P y x
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x
h x sign P y x P y x
1 ( ), ~ ( | )tyH xx y e P y xAt time t
with
Final classifier:1
( ) ( )T
t tt
sign H x h x
( )exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1 ( ), ~ ( | )tyH xx y e P y xAt time t
At time 1
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y
I nitialization: 11( ) , 1, ,mD i i m
For :1, ,t T
•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j iih
h D i y h x
•Weight classifier:11
ln2
tt
t
•Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t tt
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t tt
sign H x h x
, ~ ( | )x y P y x ( | ) 1i iP y x 11
1 1(1)D
Z m
At time t+1( ), ~ ( | )tyH xx y e P y x ( )t tyh x
tD e
1
( ) exp[ ( ), is for normaliza i
]( ) t ont t i t i
t tt
D i y h xD i Z
Z
41
PROS AND CONS OF ADABOOST
AdvantagesVery simple to implementDoes feature selection resulting
in relatively simple classifierFairly good generalization
DisadvantagesSuboptimal solutionSensitive to noisy data and
outliers
INTUITION
Train a set of weak hypotheses: h1, …., hT.
The combined hypothesis H is a weighted majority vote of the T weak hypotheses. Each hypothesis ht has a weight αt.
During the training, focus on the examples that are misclassified. At round t, example xi has the weight Dt(i).
BASIC SETTING Binary classification problem Training data:
Dt(i): the weight of xi at round t. D1(i)=1/m.
A learner L that finds a weak hypothesis ht: X Y given the training set and Dt
The error of a weak hypothesis ht:
}1,1{,),,(),....,,( 11 YyXxwhereyxyx iimm
THE BASIC ADABOOST ALGORITHM
For t=1, …, T
• Train weak learner using training data and Dt
• Get ht: X {-1,1} with error
• Choose
• Update
iit yxhitt iD
)(:
)(
t
tt
1ln
2
1
t
xhyt
iit
iit
t
tt
Z
eiD
yxhife
yxhife
Z
iDiD
itit
t
t
)(
1
)(
)(
)(*
)()(
THE GENERAL ADABOOST ALGORITHM
46
PROS AND CONS OF ADABOOST
AdvantagesVery simple to implementDoes feature selection resulting
in relatively simple classifierFairly good generalization
DisadvantagesSuboptimal solutionSensitive to noisy data and
outliers