lecture 7 boosting - uppsala university · summary of lecture 6 (ii/iii) deep tree models tend to...

24
Lecture 7 – Boosting Thomas Sch¨ on Division of Systems and Control Department of Information Technology Uppsala University. Email: [email protected] [email protected]

Upload: others

Post on 26-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Lecture 7 – Boosting

Thomas SchonDivision of Systems and ControlDepartment of Information TechnologyUppsala University.

Email: [email protected]

[email protected]

Page 2: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Summary of Lecture 6 (I/III)

CART: Classification And Regression Trees

Partition the input space using recursive binary splitting

x1

x2

Partitioning of input space

s1

s2

s3

s4R1

R2

R3

R4

R5

Tree representationx1≤s1

x2≤s2 x1≤s3

x2≤s4

R1 R2 R3

R4 R5

• Classification: Majority vote within the region.• Regression: Mean of training data within the region.

1 / 23 [email protected]

Page 3: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Summary of Lecture 6 (II/III)

Deep tree models tend to have low bias but high variance.

Bagging (=bootstrap aggregating) is a general ensemble method that can be usedto reduce model variance (well suited for trees):

1. For b = 1 to B (can run in parallel)(a) Draw a bootstrap data set T b from T .(b) Train a model yb(x) using the bootstrapped data T b.

2. Aggregate the B models by outputting,- Regression: ybag(x) = 1

B

∑Bb=1 y

b(x).- Classification: ybag(x) = MajorityVote{yb(x)}B

b=1.

2 / 23 [email protected]

Page 4: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Summary of Lecture 6 (III/III)

Random forests = bagged trees, but with further variance reduction achieved bydecorrelating the B ensemble members.

Specifically:

For each split in each tree only a random subset of q ≤ p inputs are considered assplitting variables.

3 / 23 [email protected]

Page 5: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Contents – Lecture 7

1. Boosting – the general idea2. AdaBoost – the first practical boosting method3. Robust loss functions4. Gradient boosting

4 / 23 [email protected]

Page 6: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Boosting

Even a simple (classification or regression) model can typically capture some aspects ofthe input-output relationship.

Can we then learn an ensemble of “weak models”, each describing some part of thisrelationship, and combine these into one “strong model”?

Boosting:• Sequentially learns an ensemble of weak models.• Combine these into one strong model .• General strategy – can in principle be used to improve any supervised learning

algorithm.• One of the most successful machine learning ideas!

5 / 23 [email protected]

Page 7: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Boosting

Trainingdata

Model 1

Modifieddata

Model 2

Modifieddata

Model 3

Modifieddata

Model B

Boosted model

The models are built sequentially such that each models tries to correct the mistakesmade by the previous one!

6 / 23 [email protected]

Page 8: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Binary classification

We will restrict our attention to binary classification.• Class labels are −1 and 1, i.e. y ∈ {−1, 1}.• We have access to some (weak) base classifier, e.g. a classification tree.

Note. Using labels −1 and 1 is mathematically convenient as it allows us to express a majorityvote between B classifiers y1(x), . . . , yB(x) as

sign(

B∑b=1

yb(x))

={

+1 if more plus-votes than minus-votes,−1 if more minus-votes than plus-votes.

7 / 23 [email protected]

Page 9: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Boosting procedure (for classification)

Boosting procedure:1. Assign weights w1

i = 1/n to all data points.2. For b = 1 to B

(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, wbi )}n

i=1.(b) Update the weights {wb+1

i }ni=1 from {wb

i}ni=1:

i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).

The predictions of the B classifiers, y(1)(x), . . . , y(B)(x), are combined using aweighted majority vote:

yBboost(x) = sign(

B∑b=1

α(b)y(b)(x)).

8 / 23 [email protected]

Page 10: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

ex) Boosting illustration

X1

X2

Iteration b = 1

X1

X2

Iteration b = 1

X1

X2

Iteration b = 1

X1

X2

Iteration b = 2

X1

X2

Iteration b = 2

X1

X2

Iteration b = 2

X1

X2

Iteration b = 3

X1

X2

Iteration b = 3

X1

X2

Iteration b = 3

Boosted model

X1

X2

α(1)y(1)(x) α(2)y(2)(x) α(3)y(3)(x)

yboost(x)=sign(∑3

b=1 α(b)y(b)(x)

)

9 / 23 [email protected]

Page 11: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

The technical details. . .

Q1: How do we reweight the data?

Q2: How are the coefficients α(1), . . . , α(B) computed?

10 / 23 [email protected]

Page 12: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Exponential loss

Loss functions for binary classifier y(x) = sign(c(x)).

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

1

2

3

Margin, y · c(x)

Loss

Exponential lossMisclassification loss

Exponential loss function L(y, c(x)) = exp(−y · c(x)) plotted vs. margin y · c(x). Themisclassification loss I{y 6= y(x)} = I{y · c(x) < 0} is plotted as comparison.

11 / 23 [email protected]

Page 13: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

AdaBoost pseudo-code

1. Assign weights w1i = 1/n to all data points.

2. For b = 1 to B(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, w

bi )}n

i=1.(b) Update the weights {wb+1

i }ni=1 from {wb

i}ni=1:

i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).

i. Compute weighted classification errorCompute Ebtrain =

∑n

i=1 wbi I{yi 6= y(b)(xi)}

ii. Compute classifier “confidence”Compute αb = 0.5 log((1− Ebtrain)/Eb

train).iii. Compute new weightsCompute wb+1

i = wbi exp(−α(b)yiy

(b)(xi)), i = 1, . . . , niv. Normalize. Set wb+1

i ← wb+1i /

∑n

j=1 wb+1j , for i = 1, . . . , n.

3. Output y(B)boost(x) = sign

(∑Bb=1 α

(b)y(b)(x))

.

Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. Proceedings of the 13thInternational Conference on Machine Learning (ICML). Bari, Italy, 1996.

12 / 23 [email protected] Godel Prize

Page 14: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Boosting vs. bagging

Bagging BoostingLearns base models in parallel Learns base models sequentiallyUses bootstrapped datasets Uses reweighted datasetsDoes not overfit as B becomes large Can overfit as B becomes largeReduces variance but not bias(requires deep trees as base models)

Also reduces bias!(works well with shallow trees)

N.B. Boosting does not require each base model to have low bias. Thus, a shallowclassification tree (say, 4-8 terminal nodes) or even a tree with a single split (2 terminalnodes, a “stump”) is often sufficient.

13 / 23 [email protected]

Page 15: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

ex) The Viola-Jones face detector

14 / 23 [email protected]

Page 16: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

ex) The Viola-Jones face detector

The Viola-Jones face detector:• Revolutionized computer face dectection in 2001.• First real-time system — soon built into standard point-and-shoot cameras.• Based on AdaBoost!

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR). Kauai, Hawaii, USA, 2001.

15 / 23 [email protected]

Page 17: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

ex) The Viola-Jones face detector

Uses extremely simple (and fast!) base models:1. Place a “mask” of type A, B, C or D on top of the image2. Value =

∑(pixels in black area)−

∑(pixels in white area)

3. Classify as face or noface.Base models combined using AdaBoost, resulting in a fast and accurate face detector.

16 / 23 [email protected]

Page 18: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Robust loss functions

Why exponential loss?• Main reason – computational simplicity (cf. squared loss in linear regression)

However, models using exponential loss is sensitive to “outliers” , e.g. mislabeled ornoisy data.

17 / 23 [email protected]

Page 19: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Robust loss functions

Instead of exponential loss we can use a robust loss function with a more gentle slopefor negative margins.

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

Margin, y · C(x)

Loss

Exponential lossBinomial deviance

Huber-like lossMisclassification loss

18 / 23 [email protected]

Page 20: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Gradient boosting

No free lunch! The optimization problem

(α(b), y(b)) = arg min(α,y)

n∑i=1

L(yi, c

(b−1)(xi) + αy(xi))

lacks a closed form solution for a general loss function L(y, c(x)).

Gradient boosting methods address this by using techniques from numerical opti-mization.

19 / 23 [email protected]

Page 21: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Gradient descent

Consider the generic optimization problem minθ J(θ). A numerical method for solvingthis problem is gradient descent.

Initialize θ(0) and iterate:

θ(b) = θ(b−1) − γ(b)∇θJ(θ(b−1)).

where αb is the step size at the bth iteration.

Can be compared with boosting – sequentially construct an ensemble of models as:

c(b)(x) = c(b−1)(x) + α(b)y(b)(x)

for some base model y(b)(x) and “step size” α(b).

20 / 23 [email protected]

Page 22: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Gradient boosting

Gradient boosting mimics the gradient descent algorithm by fitting the bth basemodel to the negative gradient of the loss function,

g(b)i = −

[∂L(yi, c)

∂c

]c=c(b−1)(xi)

, i = 1, . . . , n.

That is, y(b)(x) is learned using {(xi, g(b)i )}ni=1 as training data.

Gradient boosting (with some additional bells and whistles) is used by the very popularlibraries XGBoost and LightGBM,

xgboost.readthedocs.io/lightgbm.readthedocs.io/

21 / 23 [email protected]

Page 23: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

Classification methods (covered so far)

So far in the course we have discussed the following methods for classification. . .

Parametric classifiers1. Logistic regression – linear2. Linear discriminant analysis — linear3. Quadratic discriminant analysis — nonlinear

Nonparametric classifiers4. k-nearest neighbors5. Classification trees

Ensemble-based methods6. Bagging (bagged trees / random forests)7. Boosting (AdaBoost)

22 / 23 [email protected]

Page 24: Lecture 7 Boosting - Uppsala University · Summary of Lecture 6 (II/III) Deep tree models tend to have low bias but high variance. Bagging (=bootstrap aggregating) is a general ensemble

A few concepts to summarize lecture 7

Boosting: Sequential ensemble method, where each consecutive model tries to correct the mistakes of theprevious one.

AdaBoost: The first successful boosting algorithm. Designed for binary classification.

Exponential loss: The classification loss function used by AdaBoost.

Gradient boosting: A boosting procedure that can be used with any differentiable loss function.

23 / 23 [email protected]