lecture 7 boosting - uppsala university · summary of lecture 6 (ii/iii) deep tree models tend to...

Lecture 7 – Boosting

Thomas SchonDivision of Systems and ControlDepartment of Information TechnologyUppsala University.

Email: [email protected]

[email protected]

mailto:[email protected]

Summary of Lecture 6 (I/III)

CART: Classification And Regression Trees

Partition the input space using recursive binary splitting

x1

x2

Partitioning of input space

s1

s2

s3

s4R1

R2

R3

R4

R5

Tree representationx1≤s1

x2≤s2 x1≤s3

x2≤s4

R1 R2 R3

R4 R5

• Classification: Majority vote within the region.• Regression: Mean of training data within the region.

1 / 23 [email protected]


Summary of Lecture 6 (II/III)

Deep tree models tend to have low bias but high variance.

Bagging (=bootstrap aggregating) is a general ensemble method that can be usedto reduce model variance (well suited for trees):

1. For b = 1 to B (can run in parallel)(a) Draw a bootstrap data set T b from T .(b) Train a model yb(x) using the bootstrapped data T b.

2. Aggregate the B models by outputting,- Regression: ybag(x) = 1

B

∑Bb=1 y

b(x).- Classification: ybag(x) = MajorityVote{yb(x)}B

b=1.



Summary of Lecture 6 (III/III)

Random forests = bagged trees, but with further variance reduction achieved bydecorrelating the B ensemble members.

Specifically:

For each split in each tree only a random subset of q ≤ p inputs are considered assplitting variables.



Contents – Lecture 7

1. Boosting – the general idea2. AdaBoost – the first practical boosting method3. Robust loss functions4. Gradient boosting



Boosting

Even a simple (classification or regression) model can typically capture some aspects ofthe input-output relationship.

Can we then learn an ensemble of “weak models”, each describing some part of thisrelationship, and combine these into one “strong model”?

Boosting:• Sequentially learns an ensemble of weak models.• Combine these into one strong model .• General strategy – can in principle be used to improve any supervised learning

algorithm.• One of the most successful machine learning ideas!



Boosting

Trainingdata

Model 1

Modifieddata

Model 2

Modifieddata

Model 3

Modifieddata

Model B

Boosted model

The models are built sequentially such that each models tries to correct the mistakesmade by the previous one!



Binary classification

We will restrict our attention to binary classification.• Class labels are −1 and 1, i.e. y ∈ {−1, 1}.• We have access to some (weak) base classifier, e.g. a classification tree.

Note. Using labels −1 and 1 is mathematically convenient as it allows us to express a majorityvote between B classifiers y1(x), . . . , yB(x) as

sign(

B∑b=1

yb(x))

={

+1 if more plus-votes than minus-votes,−1 if more minus-votes than plus-votes.



Boosting procedure (for classification)

Boosting procedure:1. Assign weights w1

i = 1/n to all data points.2. For b = 1 to B

(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, wbi )}n

i=1.(b) Update the weights {wb+1

i }ni=1 from {wb

i}ni=1:

i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).

The predictions of the B classifiers, y(1)(x), . . . , y(B)(x), are combined using aweighted majority vote:

yBboost(x) = sign(

B∑b=1

α(b)y(b)(x)).



ex) Boosting illustration

X1

X2

Iteration b = 1

X1

X2

Iteration b = 1

X1

X2

Iteration b = 1

X1

X2

Iteration b = 2

X1

X2

Iteration b = 2

X1

X2

Iteration b = 2

X1

X2

Iteration b = 3

X1

X2

Iteration b = 3

X1

X2

Iteration b = 3

Boosted model

X1

X2

α(1)y(1)(x) α(2)y(2)(x) α(3)y(3)(x)

yboost(x)=sign(∑3

b=1 α(b)y(b)(x)

)



The technical details. . .

Q1: How do we reweight the data?

Q2: How are the coefficients α(1), . . . , α(B) computed?



Exponential loss

Loss functions for binary classifier y(x) = sign(c(x)).

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

1

2

3

Margin, y · c(x)

Loss

Exponential lossMisclassification loss

Exponential loss function L(y, c(x)) = exp(−y · c(x)) plotted vs. margin y · c(x). Themisclassification loss I{y 6= y(x)} = I{y · c(x) < 0} is plotted as comparison.



AdaBoost pseudo-code

1. Assign weights w1i = 1/n to all data points.

2. For b = 1 to B(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, w

bi )}n

i=1.(b) Update the weights {wb+1

i }ni=1 from {wb

i}ni=1:

i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).

i. Compute weighted classification errorCompute Ebtrain =

∑n

i=1 wbi I{yi 6= y(b)(xi)}

ii. Compute classifier “confidence”Compute αb = 0.5 log((1− Ebtrain)/Eb

train).iii. Compute new weightsCompute wb+1

i = wbi exp(−α(b)yiy

(b)(xi)), i = 1, . . . , niv. Normalize. Set wb+1

i ← wb+1i /

∑n

j=1 wb+1j , for i = 1, . . . , n.

3. Output y(B)boost(x) = sign

(∑Bb=1 α

(b)y(b)(x))

.

Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. Proceedings of the 13thInternational Conference on Machine Learning (ICML). Bari, Italy, 1996.

12 / 23 [email protected] Godel Prize


Boosting vs. bagging

Bagging BoostingLearns base models in parallel Learns base models sequentiallyUses bootstrapped datasets Uses reweighted datasetsDoes not overfit as B becomes large Can overfit as B becomes largeReduces variance but not bias(requires deep trees as base models)

Also reduces bias!(works well with shallow trees)

N.B. Boosting does not require each base model to have low bias. Thus, a shallowclassification tree (say, 4-8 terminal nodes) or even a tree with a single split (2 terminalnodes, a “stump”) is often sufficient.



ex) The Viola-Jones face detector




The Viola-Jones face detector:• Revolutionized computer face dectection in 2001.• First real-time system — soon built into standard point-and-shoot cameras.• Based on AdaBoost!

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR). Kauai, Hawaii, USA, 2001.




Uses extremely simple (and fast!) base models:1. Place a “mask” of type A, B, C or D on top of the image2. Value =

∑(pixels in black area)−

∑(pixels in white area)

3. Classify as face or noface.Base models combined using AdaBoost, resulting in a fast and accurate face detector.



Robust loss functions

Why exponential loss?• Main reason – computational simplicity (cf. squared loss in linear regression)

However, models using exponential loss is sensitive to “outliers” , e.g. mislabeled ornoisy data.



Robust loss functions

Instead of exponential loss we can use a robust loss function with a more gentle slopefor negative margins.

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

Margin, y · C(x)

Loss

Exponential lossBinomial deviance

Huber-like lossMisclassification loss



Gradient boosting

No free lunch! The optimization problem

(α(b), y(b)) = arg min(α,y)

n∑i=1

L(yi, c

(b−1)(xi) + αy(xi))

lacks a closed form solution for a general loss function L(y, c(x)).

Gradient boosting methods address this by using techniques from numerical opti-mization.



Gradient descent

Consider the generic optimization problem minθ J(θ). A numerical method for solvingthis problem is gradient descent.

Initialize θ(0) and iterate:

θ(b) = θ(b−1) − γ(b)∇θJ(θ(b−1)).

where αb is the step size at the bth iteration.

Can be compared with boosting – sequentially construct an ensemble of models as:

c(b)(x) = c(b−1)(x) + α(b)y(b)(x)

for some base model y(b)(x) and “step size” α(b).



Gradient boosting

Gradient boosting mimics the gradient descent algorithm by fitting the bth basemodel to the negative gradient of the loss function,

g(b)i = −

[∂L(yi, c)

∂c

]c=c(b−1)(xi)

, i = 1, . . . , n.

That is, y(b)(x) is learned using {(xi, g(b)i )}ni=1 as training data.

Gradient boosting (with some additional bells and whistles) is used by the very popularlibraries XGBoost and LightGBM,

xgboost.readthedocs.io/lightgbm.readthedocs.io/


xgboost.readthedocs.io/

lightgbm.readthedocs.io/


Classification methods (covered so far)

So far in the course we have discussed the following methods for classification. . .

Parametric classifiers1. Logistic regression – linear2. Linear discriminant analysis — linear3. Quadratic discriminant analysis — nonlinear

Nonparametric classifiers4. k-nearest neighbors5. Classification trees

Ensemble-based methods6. Bagging (bagged trees / random forests)7. Boosting (AdaBoost)



A few concepts to summarize lecture 7

Boosting: Sequential ensemble method, where each consecutive model tries to correct the mistakes of theprevious one.

AdaBoost: The first successful boosting algorithm. Designed for binary classification.

Exponential loss: The classification loss function used by AdaBoost.

Gradient boosting: A boosting procedure that can be used with any differentiable loss function.



lecture 7 boosting - uppsala university · summary of lecture 6 (ii/iii) deep tree models tend to...

Documents