boosting & adaboost - aalto · conclusion adaboost does not require prior knowledge of the weak...

28
Boosting & AdaBoost Luis De Alba [email protected] 5.3.2008

Upload: lambao

Post on 30-Apr-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Boosting & AdaBoost

Luis De [email protected]

5.3.2008

Page 2: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Boosting

Definition: Method of producing a very accurate

prediction rule by combining inaccurate rules of thumb.

Objective: Improve the accuracy of any given

learning algorithm (LMS, SVM, Perceptron, etc.)

Page 3: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

How to boost in 2 steps: Select a Classifier with an accuracy

greater than chance (random) Form an ensemble of the selected

Classifiers whose joint decision rule has high accuracy.

Page 4: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Example 1: Golf Tournament Tiger Woods VS 18 average+ players

Page 5: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Example 2: Classification Problem 2 dimensions, 2 classes, 3 Classifiers D training sets {x

1,y

1 ... x

m,y

m}

a) Train C1 with D

1 where D

1 is

randomly selected. b) Train C

2 with D

2 where only 50%

of the selected set is correctly classified by C

1

Page 6: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

c) Train C3 with D

3 where the

classification, of selected D3,

disagrees between C1 and C

2

Classification task:

X y

C1

C2

C3

C1=C

2

Y, C1

N, C3

Page 7: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Duda, et al.

Page 8: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

AdaBoost

Adaptive Boosting aka“AdaBoost” Introduced by Yoav Freud and

Robert Schapire in 1995. Both from AT&T Labs.

Advantages: As many weak learners as needed

(wished) “Focuses in” on the informative or

“difficult” patterns

Page 9: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Algorithm Inputs: Training set:

Weak or Base algorithm C LMS, Perceptron, etc.

Number or rounds (calls) to the weak algorithm k=1,... ,K

x 1 , y 1 ,... ,x m , ym

Y={−1,1} two classes

Page 10: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Each training tuple receives a weight that determines its probability of being selected.

At initialization all weights across training set are uniform:

W1 i=1/mw1

x1 y

1

w2

x2 y

2

w3

x3 y

3

wm

xm y

m

Page 11: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

For each iteration a random set is selected from the training set according to weights.

Classifier is trained with the selection

k

Ck

w1

x1 y

1

w2

x2 y

2

w3

x3 y

3

w4

x4 y

4

w5

x5 y

5

w6

x6 y

6

Ck

Page 12: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Increase weights of training patterns misclassified by

Decrease weights of training patterns correctly classified by

Ck

Ck

w1

x1 y

1

w2

x2 y

2

w3

x3 y

3

w4

x4 y

4

w5

x5 y

5

w6

x6 y

6

✔ ✗ ✔ ✔✗ ✗

Repeat for all K

Page 13: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Algorithm:initializeD={x 1 , y1 ,... x m , ym}, K ,W 1 i=1 /m, i=1,... ,m

loopFor k=1toK

Train weak learner Ck using Dk sampled according to W k iE k training error ofCk measured using Dk

k12

ln [1−Ek/Ek ]

W k1 iW k ie

− k y iCk x i

Zk

returnCkandk for all K

Page 14: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Output of the final classifier for any given input.

H x =sign∑k=1

K

kC k x

Page 15: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Error calculation:

Error is the summation and normalization of all wrongly classified sets by the weak learner.

Weak learner shall do better than if only random guesses i.e.,

Ek=∑i=1

m

if Ckx i≠y i:1 :0

m

Analysis

Ek0.5

Page 16: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

measurement:

It measures the importance assigned to

There are two things to note: >=0 if E <= ½ gets larger as E gets smaller

k12

ln [ 1−E k

E k]

Ck

Page 17: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Normalization factor: is a normalization factor so that

will be a distribution. Possible calculation:

Zk

Zk=∑i=1

m

W k1 i

W k1

Page 18: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Hypothesis:

H is a weighted majority vote of the K weak hypothesis where is the weight associated to

Each instance of x outputs a prediction whose sign represents the class (-1,+1) and the magnitude is the confidence.

H x =sign∑k=1

K

kCk x t

Ck

∣Ck x ∣

Page 19: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Why better than random?

Suppose three classifiers: Worse than random (E=0.75)

Alpha=-0.549 CC=1.731 IC=0.577

Random (E=0.50) Alpha=0.0 CC=1.0 IC=1.0

Better than random (E=0.25) Alpha=0.549 CC=0.577 IC=1.731

CC=Correctly Classified Weight FactorIC=Incorrectly Classified Weight Factor

Page 20: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Generalization Error

Based on sample size m, the VC-dimension d and the boosting rounds K.

Overfit when K to large. Experiments showed that AdaBoost do

not overfit. It continues improving after training

error is zero.

O Kdm O dm

2

Page 21: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Duda, et al.

Page 22: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Multi-class, Multi-label

In the multi-label case, each instance may belong to multiple labels in .

x XY

x ,Y where Y⊆

x1 x

2

y1

y2

y3

y4

y5

Page 23: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Decompose the problem into n orthogonal binary classification problems.

For Y⊆ , define Y [] for ∈ to be

Y []={1 if ∈Y−1 if ∉Y

H: X∗Y{−1,1} defined by H x ,=H x []

Page 24: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Then, replace each training example by k examples.

The algorithm is called AdaBoost.MH [3]

x i ,Y i x i , ,Y i [] for ∈

Page 25: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Homework hints

Weak learning function is a simple Perceptron

Do not forget how Alpha is computed.

Do not forget what is the Weight's vector and how it is computed.

Do not be lazy !

Page 26: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data
Page 27: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Conclusion

AdaBoost does not require prior knowledge of the weak learner.

The performance is completely dependent on the learner and the training data.

AdaBoost can identify outliers based on the weight's vector.

It is susceptible to noise or to examples with very large number of outliers.

Page 28: Boosting & AdaBoost - Aalto · Conclusion AdaBoost does not require prior knowledge of the weak learner. The performance is completely dependent on the learner and the training data

Bibliography

[1] A Short Introduction to Boosting. Yoav Freud, Robert Schapire. 1999

[2] Patter Classification. Richard Duda, et al. Wiley Interscience. 2nd Edition. 2000. [s. 9.5.2]

[3] Improved boosting algorithms using confidence-rated predictions. Robert Schapire, Yoram Singer. 1998 [s. 6,7]