the boosting approach to machine learning an overview

32
The Boosting Approach to Machine Learning An Overview (Robert E. Schapire) The Boosting Approach to Machine Learning An Overview (Robert E. Schapire) Giulio Meneghin, Javier Kreiner March 4, 2009 Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Ro

Upload: vania-pukalova

Post on 16-Apr-2015

19 views

Category:

Documents


1 download

DESCRIPTION

Lecture

TRANSCRIPT

Page 1: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

The Boosting Approach to Machine LearningAn Overview

(Robert E. Schapire)

Giulio Meneghin, Javier Kreiner

March 4, 2009

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 2: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Outline

1 Introduction

2 AdaBoost

3 Why does it work?

4 Extensions of AdaBoost

5 Applications

6 Conclusions

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 3: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Introduction

Boosting Approach: Combining simple rules

Suppose we have a classification problem.

Sometimes finding a lot of simple but not so accurateclassification rules is much easier than finding a single highlyaccurate classification rule. (an algorithm that provides uswith simple rules is called a weak learner)

Many weak classifiers(simple rules) → One strongclassifier(accurate rule).

Key idea: Give importance to misclassified data. How? Havea distribution over the training data.

Finally: Find a good way to combine weak classifiers intogeneral rule.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 4: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Introduction

Example

Suppose we have points belonging to two different distributions.

Blue ∼ N(0, 1), red ∼ 1r√

8π3e−

12

(r−4)2

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 5: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Introduction

Example

It’s very straightforward to come up with linear classifiers:

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 6: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Introduction

Example

If we combined many of this classifiers we could obtain a veryaccurate rule:

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 7: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

What is a weak learner?

Given a set of examples X = {(x1, y1), ..., (xm, ym)} withyi ∈ Y = {−1, 1}, and a distribution over the examples Dt(i),the weak learner gives us a hypothesis (or classifier)

ht : X → Y

The goodness of that classifier is measured by the error:

εt = Pri∼Dt [ht(xi ) 6= yi ] =∑

i :ht(xi )i

Dt(i)

We ask of a weak learner to give consistently an errorεt <

12 − γ (slightly less than chance) for any distribution over

the examples.

NOTE: If the learning algorithm doesn’t accept a distribution we just sample the

training set according to Dt and provide the new sampled training set to the weak

learner. Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 8: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Input:

set of examples X = {(x1, y1), ..., (xm, ym)} withyi ∈ Y = {−1, 1}a weak learning algorithm WeakLearn

number of iterations T

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 9: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Initialize distribution Dt(i) = 1m for all i (same weight for every

example) For t = 1...T

1 Call WeakLearn using distribution Dt .

2 Get back a classifier ht : X → Y

3 Calculate error of ht , εt =∑

i :ht(xi )6=yiDt(i). If εt >

12 , then

set T=t-1 and recommence loop

4 Set αt . (For example αt = 12 ln( 1−εt

εt)).

5 Update Dt :

Dt+1(i) =Dt(i)

Zt×

e−αt if ht(xi ) = yi

eαt if ht(xi ) 6= yi

e−αtyiht(xi ) , general formula

where Zt is a normalization constant (so as to generate a valid

distribution), this procedure emphasizes difficult examples (αt > 0).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 10: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Initialize distribution Dt(i) = 1m for all i (same weight for every

example) For t = 1...T

1 Call WeakLearn using distribution Dt .

2 Get back a classifier ht : X → Y

3 Calculate error of ht , εt =∑

i :ht(xi )6=yiDt(i). If εt >

12 , then

set T=t-1 and recommence loop

4 Set αt . (For example αt = 12 ln( 1−εt

εt)).

5 Update Dt :

Dt+1(i) =Dt(i)

Zt×

e−αt if ht(xi ) = yi

eαt if ht(xi ) 6= yi

e−αtyiht(xi ) , general formula

where Zt is a normalization constant (so as to generate a valid

distribution), this procedure emphasizes difficult examples (αt > 0).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 11: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Initialize distribution Dt(i) = 1m for all i (same weight for every

example) For t = 1...T

1 Call WeakLearn using distribution Dt .

2 Get back a classifier ht : X → Y

3 Calculate error of ht , εt =∑

i :ht(xi )6=yiDt(i). If εt >

12 , then

set T=t-1 and recommence loop

4 Set αt . (For example αt = 12 ln( 1−εt

εt)).

5 Update Dt :

Dt+1(i) =Dt(i)

Zt×

e−αt if ht(xi ) = yi

eαt if ht(xi ) 6= yi

e−αtyiht(xi ) , general formula

where Zt is a normalization constant (so as to generate a valid

distribution), this procedure emphasizes difficult examples (αt > 0).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 12: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Initialize distribution Dt(i) = 1m for all i (same weight for every

example) For t = 1...T

1 Call WeakLearn using distribution Dt .

2 Get back a classifier ht : X → Y

3 Calculate error of ht , εt =∑

i :ht(xi )6=yiDt(i). If εt >

12 , then

set T=t-1 and recommence loop

4 Set αt . (For example αt = 12 ln( 1−εt

εt)).

5 Update Dt :

Dt+1(i) =Dt(i)

Zt×

e−αt if ht(xi ) = yi

eαt if ht(xi ) 6= yi

e−αtyiht(xi ) , general formula

where Zt is a normalization constant (so as to generate a valid

distribution), this procedure emphasizes difficult examples (αt > 0).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 13: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Initialize distribution Dt(i) = 1m for all i (same weight for every

example) For t = 1...T

1 Call WeakLearn using distribution Dt .

2 Get back a classifier ht : X → Y

3 Calculate error of ht , εt =∑

i :ht(xi )6=yiDt(i). If εt >

12 , then

set T=t-1 and recommence loop

4 Set αt . (For example αt = 12 ln( 1−εt

εt)).

5 Update Dt :

Dt+1(i) =Dt(i)

Zt×

e−αt if ht(xi ) = yi

eαt if ht(xi ) 6= yi

e−αtyiht(xi ) , general formula

where Zt is a normalization constant (so as to generate a valid

distribution), this procedure emphasizes difficult examples (αt > 0).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 14: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Adaboost pseudocode

Output: The final classifier:

h(x) = sign(T∑

t=1

αtht(x))

Note that this is a weighted voting of the classifiers of eachiteration.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 15: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

AdaBoost

Diagram

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 16: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Why does it work?

Bound on the training error

Fortunately we have some strong guarantees for the training error(the proportion of misclassified samples):

1

m|{i : H(xi ) 6= yi}| ≤

∏t

Zt ≤ e−2∑T

t=1 γ2t ≤ e−2Tγ2

Here γt = 12 − εt . Provided WeakLearn does always better than

chance so that γt ≥ γ > 0:

The training error drops exponentially fast with T .

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 17: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Why does it work?

Generalization error

The generalization error is bounded by:

Pr [H(x) 6= y ] + O(

√Td

m)

Where Pr [.] is the empirical probability, d is theVapnik-Chervonenkis dimension, O contains all the log andconstant factors.

This suggests that there is overfitting?! (grows with T) .

It doesn’t often happen empirically, the bound is nottight enough.

There other bounds for the generalization error, but ingeneral give a qualitative measure. They are to weak tobe quantitatively useful.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 18: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Why does it work?

Demos

http://vision.ucla.edu/~vedaldi/code/snippets/snippets.html

http://www.cse.ucsd.edu/~yfreund/adaboost/index.html

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 19: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Extensions of AdaBoost

There are a lot of variants of AdaBoost, depending on thespecific problem to be solved.

One of them is AdaBoost.M1.

Used for k-class classification.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 20: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 21: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Multiclass classification: AdaBoost.M1

If k = 2, then chance performance = 12 .

If k = n, then chance performance = 1n .

AdaBoost.M1 still requires a WeakLearner performance of 12 .

WeakLearner must be a strong learner if we want to useAdaBoost.M1.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 22: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Multiclass classification: AdaBoost.M2

Instead of a class, WeakLearner can specify a set of plausiblelabels.

The weak hypothesis is a k-element vector with elements in[0, 1].

0 = not plausible, 1 = plausible (6= probable).

Requires modification of WeakLearner.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 23: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

OCR example

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 24: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Incorporating Human Knowledge

AdaBoost performance depends on the amount of data(among other things).

A solution: exploit available human knowledge while boostingthe WeakClassifier.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 25: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Incorporating Human Knowledge: BoosTexter

Binary classification problem, classes in {−1,+1}Human expert constructs a probability function p : D → Cthat estimates the probability that an instance belongs toclass +1.

p needs not be highly accurate, another parameter µ is tocontrol the confidence in the human expert knowledge.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 26: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Text classification performance

Comparison of Adaboost with four other text categorization methods in two datasets, Reuters newswire

articles(left) and AP newswire headlines(right). x-axis: number of class labels.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 27: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Extensions of AdaBoost

Finding outliers

AdaBoost gives higher weights to harder examples.

Examples with highest weights are often outliers.

If too many outliers, or too noisy data, performance decreases(although solutions proposed).

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 28: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Applications

Applications

Text filtering Schapire, Singer, Singhal. Boosting andRocchio applied to text filtering.1998

Routing Iyer, Lewis, Schapire, Singer, Singhal. Boosting fordocument routing.2000

Ranking problems Freund, Iyer, Schapire, Singer. Anefficient boostingalgorithm for combining preferences.1998

Image retrieval Tieu, Viola. Boosting image retrieval.2000

Medical diagnosis Merler, Furlanello, Larcher, Sboner.Tuning cost sensitive boosting and its application tomelanoma diagnosis.2001

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 29: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Conclusions

Conclusions

Perspective shift: you may not need to find a perfect classifier,just combine a good enough classifier.

An already seen example: face detection.

Thoughts:

Fast, simple, easy to program.Tuned by only one parameter (T, number of iterations).Theoretical assurances, given:(1) WeakLearner performance,(2) Training set size.Variants exist to address specific problems.

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 30: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Conclusions

References

Freund, Y.; Schapire, R.E., Experiments with a New Boosting Algorithm, MachineLearning: Proceedings of the Thirteenth International Conference, pp. 148-156, 1996

Rochery, M.; Schapire, R.E.; Rahim, M.; Gupta, N., BoosTexter for textcategorization in spoken language dialogue, Unpublished manuscript, 2001

Schapire, R.E.; Singer, Y., Improved boosting algorithm using confidence-ratedpredictions , Machine Learning, 37(3), pp.297-336, 1999

Schapire, R.E., The Boosting Approach to Machine Learning: An Overview, MSRIWorkshop on Nonlinear Estimation and Classification, 2002

Matas, J.; Sochman, J. , AdaBoost, http://cmp.felk.cvut.cz

Hoiem, D., AdaBoost ,www.cs.uiuc.edu/homes/dhoiem/presentations/AdaboostT utorial .ppt

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 31: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Conclusions

Useful links

- http://www.boosting.org- http://www.cs.princeton.edu/~schapire/boost.html

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Page 32: The Boosting Approach to Machine Learning  An Overview

The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)

Conclusions

Thank you!

Giulio Meneghin, Javier Kreiner The Boosting Approach to Machine Learning An Overview (Robert E. Schapire)