the boosting approach to machine learning

32
The Boosting Approach to Machine Learning Maria-Florina Balcan 04/05/2019

Upload: others

Post on 24-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Boosting Approach to Machine Learning

The Boosting Approach to

Machine Learning

Maria-Florina Balcan

04/05/2019

Page 2: The Boosting Approach to Machine Learning

Boosting

β€’ Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.

β€’ Backed up by solid foundations.

β€’ Works well in practice --- Adaboost and its variations one of the top ML algorithms.

β€’ General method for improving the accuracy of any given learning algorithm.

Page 3: The Boosting Approach to Machine Learning

Readings:

β€’ The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001

β€’ Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

β€’ Motivation.

β€’ A bit of history.

β€’ Adaboost: algo, guarantees, discussion.

β€’ Focus on supervised classification.

Page 4: The Boosting Approach to Machine Learning

An Example: Spam Detection

Key observation/motivation:

β€’ Easy to find rules of thumb that are often correct.

β€’ Harder to find single rule that is very highly accurate.

β€’ E.g., β€œIf buy now in the message, then predict spam.”

Not spam spam

β€’ E.g., β€œIf say good-bye to debt in the message, then predict spam.”

β€’ E.g., classify which emails are spam and which are important.

Page 5: The Boosting Approach to Machine Learning

β€’ Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

β€’ apply weak learner to a subset of emails, obtain rule of thumb

β€’ apply to 2nd subset of emails, obtain 2nd rule of thumb

β€’ apply to 3rd subset of emails, obtain 3rd rule of thumb

β€’ repeat T times; combine weak rules into a single highly accurate rule.

π’‰πŸ

π’‰πŸ

π’‰πŸ‘

𝒉𝑻

…E

mai

ls

Page 6: The Boosting Approach to Machine Learning

Boosting: Important Aspects

How to choose examples on each round?

How to combine rules of thumb into single prediction rule?

β€’ take (weighted) majority vote of rules of thumb

β€’ Typically, concentrate on β€œhardest” examples (those most often misclassified by previous rules of thumb)

Page 7: The Boosting Approach to Machine Learning

Historically….

Page 8: The Boosting Approach to Machine Learning

Weak Learning vs Strong Learning

β€’ [Kearns & Valiant ’88]: defined weak learning:being able to predict better than random guessing

(error ≀1

2βˆ’ 𝛾) , consistently.

β€’ Posed an open pb: β€œDoes there exist a boosting algo that turns a weak learner into a strong learner (that can produce

arbitrarily accurate hypotheses)?”

β€’ Informally, given β€œweak” learning algo that can consistently

find classifiers of error ≀1

2βˆ’ 𝛾, a boosting algo would

provably construct a single classifier with error ≀ πœ–.

Page 9: The Boosting Approach to Machine Learning

Surprisingly….

Weak Learning =Strong Learning

Original Construction [Schapire ’89]:

β€’ poly-time boosting algo, exploits that we can learn a little on every distribution.

β€’ A modest booster obtained via calling the weak learning algorithm on 3 distributions.

β€’ Cool conceptually and technically, not very practical.

β€’ Then amplifies the modest boost of accuracy by running this somehow recursively.

Error = 𝛽 <1

2βˆ’ 𝛾 β†’ error 3𝛽2 βˆ’ 2𝛽3

Page 10: The Boosting Approach to Machine Learning

An explosion of subsequent work

Page 11: The Boosting Approach to Machine Learning

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003

β€œA Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Page 12: The Boosting Approach to Machine Learning

Informal Description Adaboost

β€’ For t=1,2, … ,T

β€’ Construct Dt on {x1, …, xm}

β€’ Run A on Dt producing ht: 𝑋 β†’ {βˆ’1,1} (weak classifier)

xi ∈ 𝑋, 𝑦𝑖 ∈ π‘Œ = {βˆ’1,1}

+++

+

+++

+

- -

--

- -

--

ht

β€’ Boosting: turns a weak algo into a strong learner.

β€’ Output Hfinal π‘₯ = sign σ𝑑=1π›Όπ‘‘β„Žπ‘‘ π‘₯

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps)

Ο΅t = Pxi ~Dt(ht xi β‰  yi) error of ht over Dt

Page 13: The Boosting Approach to Machine Learning

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Page 14: The Boosting Approach to Machine Learning

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Page 15: The Boosting Approach to Machine Learning

Adaboost: A toy example

Page 16: The Boosting Approach to Machine Learning

Adaboost: A toy example

Page 17: The Boosting Approach to Machine Learning

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Page 18: The Boosting Approach to Machine Learning

Nice Features of Adaboost

β€’ Very general: a meta-procedure, it can use any weak learning algorithm!!!

β€’ Very fast (single pass through data each round) & simple to code, no parameters to tune.

β€’ Grounded in rich theory.

β€’ Shift in mindset: goal is now just to find classifiers a bit better than random guessing.

β€’ Relevant for big data age: quickly focuses on β€œcore difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

(e.g., NaΓ―ve Bayes, decision stumps)

Page 19: The Boosting Approach to Machine Learning

Analyzing Training Error

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

So, if βˆ€π‘‘, 𝛾𝑑 β‰₯ 𝛾 > 0, then π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2 𝛾2𝑇

Adaboost is adaptive

β€’ Does not need to know 𝛾 or T a priori

β€’ Can exploit 𝛾𝑑 ≫ 𝛾

The training error drops exponentially in T!!!

To get π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ πœ–, need only 𝑇 = 𝑂1

𝛾2log

1

πœ–rounds

Page 20: The Boosting Approach to Machine Learning

Understanding the Updates & Normalization

Pr𝐷𝑑+1

𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖 =

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

π‘π‘‘π‘’βˆ’π›Όπ‘‘ =

1 βˆ’ πœ–π‘‘π‘π‘‘

π‘’βˆ’π›Όπ‘‘ =1 βˆ’ πœ–π‘‘π‘π‘‘

πœ–π‘‘1 βˆ’ πœ–π‘‘

=1 βˆ’ πœ–π‘‘ πœ–π‘‘π‘π‘‘

Pr𝐷𝑑+1

𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖 =

𝑖:π‘¦π‘–β‰ β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

𝑍𝑑𝑒𝛼𝑑 =

Probabilities are equal!

𝑍𝑑

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

= 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

πœ–π‘‘1

𝑍𝑑𝑒𝛼𝑑 = πœ–π‘‘

𝑍𝑑

1 βˆ’ πœ–π‘‘πœ–π‘‘

=πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝑍𝑑

Page 21: The Boosting Approach to Machine Learning

Analyzing Training Error

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

So, if βˆ€π‘‘, 𝛾𝑑 β‰₯ 𝛾 > 0, then π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2 𝛾2𝑇

The training error drops exponentially in T!!!

To get π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ πœ–, need only 𝑇 = 𝑂1

𝛾2log

1

πœ–rounds

Think about generalization aspects!!!!

Page 22: The Boosting Approach to Machine Learning

What you should know

β€’ The difference between weak and strong learners.

β€’ AdaBoost algorithm and intuition for the distribution update step.

β€’ The training error bound for Adaboost

Page 23: The Boosting Approach to Machine Learning

Advanced (Not required) Material

Page 24: The Boosting Approach to Machine Learning

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

[Unthresholded weighted vote of β„Žπ‘– on π‘₯𝑖 ]

Page 25: The Boosting Approach to Machine Learning

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Recall 𝐷1 𝑖 =1

π‘šand 𝐷𝑑+1 𝑖 = 𝐷𝑑 𝑖

exp βˆ’π‘¦π‘–π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖

𝑍𝑑

𝐷𝑇+1 𝑖 =exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇× 𝐷𝑇 𝑖

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×

exp βˆ’π‘¦π‘–π›Όπ‘‡βˆ’1β„Žπ‘‡βˆ’1 π‘₯𝑖

π‘π‘‡βˆ’1Γ— π·π‘‡βˆ’1 𝑖

…… .

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×⋯×

exp βˆ’π‘¦π‘–π›Ό1β„Ž1 π‘₯𝑖

𝑍1

1

π‘š

=1

π‘š

exp βˆ’π‘¦π‘–(𝛼1β„Ž1 π‘₯𝑖 +β‹―+π›Όπ‘‡β„Žπ‘‡ π‘₯𝑇 )

𝑍1⋯𝑍𝑇=

1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

Page 26: The Boosting Approach to Machine Learning

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

errS π»π‘“π‘–π‘›π‘Žπ‘™ =1

π‘š

𝑖

1π‘¦π‘–β‰ π»π‘“π‘–π‘›π‘Žπ‘™ π‘₯𝑖

1

0

0/1 loss

exp loss

=1

π‘š

𝑖

1𝑦𝑖𝑓 π‘₯𝑖 ≀0

≀1

π‘š

𝑖

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

= ς𝑑 𝑍𝑑 .=

𝑖

𝐷𝑇+1 𝑖 ΰ·‘

𝑑

𝑍𝑑

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Page 27: The Boosting Approach to Machine Learning

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

Note: recall 𝑍𝑑 = 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝛼𝑑 minimizer of 𝛼 β†’ 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Ό + πœ–π‘‘π‘’

𝛼

Page 28: The Boosting Approach to Machine Learning

Generalization Guarantees (informal)

β€’ Let H be the set of rules that the weak learner can useβ€’ Let G be the set of weighted majority rules over T elements

of H (i.e., the things that AdaBoost might output)

Theorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ 𝐺, π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘š

T= # of rounds

Theorem where πœ–π‘‘ = 1/2 βˆ’ π›Ύπ‘‘π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

d= VC dimension of H

Page 29: The Boosting Approach to Machine Learning

Generalization Guarantees

error

complexity

train error

generalizationerror

T= # of rounds

Theorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ 𝐺, π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘š

T= # of rounds

d= VC dimension of H

Page 30: The Boosting Approach to Machine Learning

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Page 31: The Boosting Approach to Machine Learning

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

β€’ These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as

simple as possible)!

Page 32: The Boosting Approach to Machine Learning

How can we explain the experiments?

Key Idea:

R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in β€œBoosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification confidence!!