the boosting approach to machine learning
TRANSCRIPT
![Page 1: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/1.jpg)
The Boosting Approach to
Machine Learning
Maria-Florina Balcan
04/05/2019
![Page 2: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/2.jpg)
Boosting
β’ Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.
β’ Backed up by solid foundations.
β’ Works well in practice --- Adaboost and its variations one of the top ML algorithms.
β’ General method for improving the accuracy of any given learning algorithm.
![Page 3: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/3.jpg)
Readings:
β’ The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001
β’ Theory and Applications of Boosting. NIPS tutorial.
http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf
Plan for today:
β’ Motivation.
β’ A bit of history.
β’ Adaboost: algo, guarantees, discussion.
β’ Focus on supervised classification.
![Page 4: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/4.jpg)
An Example: Spam Detection
Key observation/motivation:
β’ Easy to find rules of thumb that are often correct.
β’ Harder to find single rule that is very highly accurate.
β’ E.g., βIf buy now in the message, then predict spam.β
Not spam spam
β’ E.g., βIf say good-bye to debt in the message, then predict spam.β
β’ E.g., classify which emails are spam and which are important.
![Page 5: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/5.jpg)
β’ Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.
An Example: Spam Detection
β’ apply weak learner to a subset of emails, obtain rule of thumb
β’ apply to 2nd subset of emails, obtain 2nd rule of thumb
β’ apply to 3rd subset of emails, obtain 3rd rule of thumb
β’ repeat T times; combine weak rules into a single highly accurate rule.
ππ
ππ
ππ
ππ»
β¦E
mai
ls
![Page 6: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/6.jpg)
Boosting: Important Aspects
How to choose examples on each round?
How to combine rules of thumb into single prediction rule?
β’ take (weighted) majority vote of rules of thumb
β’ Typically, concentrate on βhardestβ examples (those most often misclassified by previous rules of thumb)
![Page 7: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/7.jpg)
Historicallyβ¦.
![Page 8: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/8.jpg)
Weak Learning vs Strong Learning
β’ [Kearns & Valiant β88]: defined weak learning:being able to predict better than random guessing
(error β€1
2β πΎ) , consistently.
β’ Posed an open pb: βDoes there exist a boosting algo that turns a weak learner into a strong learner (that can produce
arbitrarily accurate hypotheses)?β
β’ Informally, given βweakβ learning algo that can consistently
find classifiers of error β€1
2β πΎ, a boosting algo would
provably construct a single classifier with error β€ π.
![Page 9: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/9.jpg)
Surprisinglyβ¦.
Weak Learning =Strong Learning
Original Construction [Schapire β89]:
β’ poly-time boosting algo, exploits that we can learn a little on every distribution.
β’ A modest booster obtained via calling the weak learning algorithm on 3 distributions.
β’ Cool conceptually and technically, not very practical.
β’ Then amplifies the modest boost of accuracy by running this somehow recursively.
Error = π½ <1
2β πΎ β error 3π½2 β 2π½3
![Page 10: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/10.jpg)
An explosion of subsequent work
![Page 11: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/11.jpg)
Adaboost (Adaptive Boosting)
[Freund-Schapire, JCSSβ97]
Godel Prize winner 2003
βA Decision-Theoretic Generalization of On-Line Learning and an Application to Boostingβ
![Page 12: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/12.jpg)
Informal Description Adaboost
β’ For t=1,2, β¦ ,T
β’ Construct Dt on {x1, β¦, xm}
β’ Run A on Dt producing ht: π β {β1,1} (weak classifier)
xi β π, π¦π β π = {β1,1}
+++
+
+++
+
- -
--
- -
--
ht
β’ Boosting: turns a weak algo into a strong learner.
β’ Output Hfinal π₯ = sign Οπ‘=1πΌπ‘βπ‘ π₯
Input: S={(x1, π¦1), β¦,(xm, π¦m)};
Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.
weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps)
Ο΅t = Pxi ~Dt(ht xi β yi) error of ht over Dt
![Page 13: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/13.jpg)
Adaboost (Adaptive Boosting)
β’ For t=1,2, β¦ ,T
β’ Construct ππ on {π±π, β¦, ππ¦}
β’ Run A on Dt producing ht
Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct
β’ Weak learning algorithm A.
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘ if π¦π = βπ‘ π₯π
π·π‘+1 π =π·π‘ π
ππ‘e πΌπ‘ if π¦π β βπ‘ π₯π
Constructing π·π‘
[i.e., D1 π =1
π]
β’ Given Dt and ht set
πΌπ‘ =1
2ln
1 β ππ‘ππ‘
> 0
Final hyp: Hfinal π₯ = sign Οπ‘ πΌπ‘βπ‘ π₯
β’ D1 uniform on {x1, β¦, xm}
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
![Page 14: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/14.jpg)
Adaboost: A toy example
Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)
![Page 15: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/15.jpg)
Adaboost: A toy example
![Page 16: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/16.jpg)
Adaboost: A toy example
![Page 17: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/17.jpg)
Adaboost (Adaptive Boosting)
β’ For t=1,2, β¦ ,T
β’ Construct ππ on {π±π, β¦, ππ¦}
β’ Run A on Dt producing ht
Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct
β’ Weak learning algorithm A.
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘ if π¦π = βπ‘ π₯π
π·π‘+1 π =π·π‘ π
ππ‘e πΌπ‘ if π¦π β βπ‘ π₯π
Constructing π·π‘
[i.e., D1 π =1
π]
β’ Given Dt and ht set
πΌπ‘ =1
2ln
1 β ππ‘ππ‘
> 0
Final hyp: Hfinal π₯ = sign Οπ‘ πΌπ‘βπ‘ π₯
β’ D1 uniform on {x1, β¦, xm}
π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
![Page 18: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/18.jpg)
Nice Features of Adaboost
β’ Very general: a meta-procedure, it can use any weak learning algorithm!!!
β’ Very fast (single pass through data each round) & simple to code, no parameters to tune.
β’ Grounded in rich theory.
β’ Shift in mindset: goal is now just to find classifiers a bit better than random guessing.
β’ Relevant for big data age: quickly focuses on βcore difficultiesβ, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLTβ12].
(e.g., NaΓ―ve Bayes, decision stumps)
![Page 19: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/19.jpg)
Analyzing Training Error
Theorem ππ‘ = 1/2 β πΎπ‘ (error of βπ‘ over π·π‘)
ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
So, if βπ‘, πΎπ‘ β₯ πΎ > 0, then ππππ π»πππππ β€ exp β2 πΎ2π
Adaboost is adaptive
β’ Does not need to know πΎ or T a priori
β’ Can exploit πΎπ‘ β« πΎ
The training error drops exponentially in T!!!
To get ππππ π»πππππ β€ π, need only π = π1
πΎ2log
1
πrounds
![Page 20: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/20.jpg)
Understanding the Updates & Normalization
Prπ·π‘+1
π¦π = βπ‘ π₯π =
π:π¦π=βπ‘ π₯π
π·π‘ π
ππ‘πβπΌπ‘ =
1 β ππ‘ππ‘
πβπΌπ‘ =1 β ππ‘ππ‘
ππ‘1 β ππ‘
=1 β ππ‘ ππ‘ππ‘
Prπ·π‘+1
π¦π β βπ‘ π₯π =
π:π¦πβ βπ‘ π₯π
π·π‘ π
ππ‘ππΌπ‘ =
Probabilities are equal!
ππ‘
Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.
Recall π·π‘+1 π =π·π‘ π
ππ‘e βπΌπ‘π¦π βπ‘ π₯π
= 1 β ππ‘ πβπΌπ‘ + ππ‘π
πΌπ‘ = 2 ππ‘ 1 β ππ‘
ππ‘1
ππ‘ππΌπ‘ = ππ‘
ππ‘
1 β ππ‘ππ‘
=ππ‘ 1 β ππ‘
ππ‘
![Page 21: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/21.jpg)
Analyzing Training Error
Theorem ππ‘ = 1/2 β πΎπ‘ (error of βπ‘ over π·π‘)
ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
So, if βπ‘, πΎπ‘ β₯ πΎ > 0, then ππππ π»πππππ β€ exp β2 πΎ2π
The training error drops exponentially in T!!!
To get ππππ π»πππππ β€ π, need only π = π1
πΎ2log
1
πrounds
Think about generalization aspects!!!!
![Page 22: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/22.jpg)
What you should know
β’ The difference between weak and strong learners.
β’ AdaBoost algorithm and intuition for the distribution update step.
β’ The training error bound for Adaboost
![Page 23: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/23.jpg)
Advanced (Not required) Material
![Page 24: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/24.jpg)
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
Step 3: Οπ‘ ππ‘ = Οπ‘ 2 ππ‘ 1 β ππ‘ = Οπ‘ 1 β 4πΎπ‘2 β€ πβ2 Οπ‘ πΎπ‘
2
[Unthresholded weighted vote of βπ on π₯π ]
![Page 25: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/25.jpg)
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Recall π·1 π =1
πand π·π‘+1 π = π·π‘ π
exp βπ¦ππΌπ‘βπ‘ π₯π
ππ‘
π·π+1 π =exp βπ¦ππΌπβπ π₯π
ππΓ π·π π
=exp βπ¦ππΌπβπ π₯π
ππΓ
exp βπ¦ππΌπβ1βπβ1 π₯π
ππβ1Γ π·πβ1 π
β¦β¦ .
=exp βπ¦ππΌπβπ π₯π
ππΓβ―Γ
exp βπ¦ππΌ1β1 π₯π
π1
1
π
=1
π
exp βπ¦π(πΌ1β1 π₯π +β―+πΌπβπ π₯π )
π1β―ππ=
1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
![Page 26: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/26.jpg)
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
errS π»πππππ =1
π
π
1π¦πβ π»πππππ π₯π
1
0
0/1 loss
exp loss
=1
π
π
1π¦ππ π₯π β€0
β€1
π
π
exp βπ¦ππ π₯π
= Οπ‘ ππ‘ .=
π
π·π+1 π ΰ·
π‘
ππ‘
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
![Page 27: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/27.jpg)
Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: π·π+1 π =1
π
exp βπ¦ππ π₯π
Οπ‘ ππ‘
where π π₯π = Οπ‘ πΌπ‘βπ‘ π₯π .
Step 2: errS π»πππππ β€ Οπ‘ ππ‘ .
Step 3: Οπ‘ ππ‘ = Οπ‘ 2 ππ‘ 1 β ππ‘ = Οπ‘ 1 β 4πΎπ‘2 β€ πβ2 Οπ‘ πΎπ‘
2
Note: recall ππ‘ = 1 β ππ‘ πβπΌπ‘ + ππ‘π
πΌπ‘ = 2 ππ‘ 1 β ππ‘
πΌπ‘ minimizer of πΌ β 1 β ππ‘ πβπΌ + ππ‘π
πΌ
![Page 28: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/28.jpg)
Generalization Guarantees (informal)
β’ Let H be the set of rules that the weak learner can useβ’ Let G be the set of weighted majority rules over T elements
of H (i.e., the things that AdaBoost might output)
Theorem [Freund&Schapireβ97]
β π β πΊ, πππ π β€ ππππ π + ΰ·¨πππ
π
T= # of rounds
Theorem where ππ‘ = 1/2 β πΎπ‘ππππ π»πππππ β€ exp β2
π‘
πΎπ‘2
How about generalization guarantees?
Original analysis [Freund&Schapireβ97]
d= VC dimension of H
![Page 29: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/29.jpg)
Generalization Guarantees
error
complexity
train error
generalizationerror
T= # of rounds
Theorem [Freund&Schapireβ97]
β π β πΊ, πππ π β€ ππππ π + ΰ·¨πππ
π
T= # of rounds
d= VC dimension of H
![Page 30: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/30.jpg)
Generalization Guarantees
β’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.
β’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
![Page 31: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/31.jpg)
Generalization Guarantees
β’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.
β’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
β’ These results seem to contradict FSβ87 bound and Occamβs razor (in order achieve good test error the classifier should be as
simple as possible)!
![Page 32: The Boosting Approach to Machine Learning](https://reader034.vdocuments.net/reader034/viewer/2022052601/628cd65b741e9e344058ad21/html5/thumbnails/32.jpg)
How can we explain the experiments?
Key Idea:
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in βBoosting the margin: A new explanation for the effectiveness of voting methodsβ a nice theoretical explanation.
Training error does not tell the whole story.
We need also to consider the classification confidence!!