lecture 7 boosting - uppsala university · summary of lecture 6 (ii/iii) deep tree models tend to...
TRANSCRIPT
Lecture 7 – Boosting
Thomas SchonDivision of Systems and ControlDepartment of Information TechnologyUppsala University.
Email: [email protected]
Summary of Lecture 6 (I/III)
CART: Classification And Regression Trees
Partition the input space using recursive binary splitting
x1
x2
Partitioning of input space
s1
s2
s3
s4R1
R2
R3
R4
R5
Tree representationx1≤s1
x2≤s2 x1≤s3
x2≤s4
R1 R2 R3
R4 R5
• Classification: Majority vote within the region.• Regression: Mean of training data within the region.
1 / 23 [email protected]
Summary of Lecture 6 (II/III)
Deep tree models tend to have low bias but high variance.
Bagging (=bootstrap aggregating) is a general ensemble method that can be usedto reduce model variance (well suited for trees):
1. For b = 1 to B (can run in parallel)(a) Draw a bootstrap data set T b from T .(b) Train a model yb(x) using the bootstrapped data T b.
2. Aggregate the B models by outputting,- Regression: ybag(x) = 1
B
∑Bb=1 y
b(x).- Classification: ybag(x) = MajorityVote{yb(x)}B
b=1.
2 / 23 [email protected]
Summary of Lecture 6 (III/III)
Random forests = bagged trees, but with further variance reduction achieved bydecorrelating the B ensemble members.
Specifically:
For each split in each tree only a random subset of q ≤ p inputs are considered assplitting variables.
3 / 23 [email protected]
Contents – Lecture 7
1. Boosting – the general idea2. AdaBoost – the first practical boosting method3. Robust loss functions4. Gradient boosting
4 / 23 [email protected]
Boosting
Even a simple (classification or regression) model can typically capture some aspects ofthe input-output relationship.
Can we then learn an ensemble of “weak models”, each describing some part of thisrelationship, and combine these into one “strong model”?
Boosting:• Sequentially learns an ensemble of weak models.• Combine these into one strong model .• General strategy – can in principle be used to improve any supervised learning
algorithm.• One of the most successful machine learning ideas!
5 / 23 [email protected]
Boosting
Trainingdata
Model 1
Modifieddata
Model 2
Modifieddata
Model 3
Modifieddata
Model B
Boosted model
The models are built sequentially such that each models tries to correct the mistakesmade by the previous one!
6 / 23 [email protected]
Binary classification
We will restrict our attention to binary classification.• Class labels are −1 and 1, i.e. y ∈ {−1, 1}.• We have access to some (weak) base classifier, e.g. a classification tree.
Note. Using labels −1 and 1 is mathematically convenient as it allows us to express a majorityvote between B classifiers y1(x), . . . , yB(x) as
sign(
B∑b=1
yb(x))
={
+1 if more plus-votes than minus-votes,−1 if more minus-votes than plus-votes.
7 / 23 [email protected]
Boosting procedure (for classification)
Boosting procedure:1. Assign weights w1
i = 1/n to all data points.2. For b = 1 to B
(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, wbi )}n
i=1.(b) Update the weights {wb+1
i }ni=1 from {wb
i}ni=1:
i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).
The predictions of the B classifiers, y(1)(x), . . . , y(B)(x), are combined using aweighted majority vote:
yBboost(x) = sign(
B∑b=1
α(b)y(b)(x)).
8 / 23 [email protected]
ex) Boosting illustration
X1
X2
Iteration b = 1
X1
X2
Iteration b = 1
X1
X2
Iteration b = 1
X1
X2
Iteration b = 2
X1
X2
Iteration b = 2
X1
X2
Iteration b = 2
X1
X2
Iteration b = 3
X1
X2
Iteration b = 3
X1
X2
Iteration b = 3
Boosted model
X1
X2
α(1)y(1)(x) α(2)y(2)(x) α(3)y(3)(x)
yboost(x)=sign(∑3
b=1 α(b)y(b)(x)
)
9 / 23 [email protected]
The technical details. . .
Q1: How do we reweight the data?
Q2: How are the coefficients α(1), . . . , α(B) computed?
10 / 23 [email protected]
Exponential loss
Loss functions for binary classifier y(x) = sign(c(x)).
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
1
2
3
Margin, y · c(x)
Loss
Exponential lossMisclassification loss
Exponential loss function L(y, c(x)) = exp(−y · c(x)) plotted vs. margin y · c(x). Themisclassification loss I{y 6= y(x)} = I{y · c(x) < 0} is plotted as comparison.
11 / 23 [email protected]
AdaBoost pseudo-code
1. Assign weights w1i = 1/n to all data points.
2. For b = 1 to B(a) Train a weak classifier y(b)(x) on the weighted training data {(xi, yi, w
bi )}n
i=1.(b) Update the weights {wb+1
i }ni=1 from {wb
i}ni=1:
i. Increase weights for all points misclassified by y(b)(x).ii. Decrease weights for all points correctly classified by y(b)(x).
i. Compute weighted classification errorCompute Ebtrain =
∑n
i=1 wbi I{yi 6= y(b)(xi)}
ii. Compute classifier “confidence”Compute αb = 0.5 log((1− Ebtrain)/Eb
train).iii. Compute new weightsCompute wb+1
i = wbi exp(−α(b)yiy
(b)(xi)), i = 1, . . . , niv. Normalize. Set wb+1
i ← wb+1i /
∑n
j=1 wb+1j , for i = 1, . . . , n.
3. Output y(B)boost(x) = sign
(∑Bb=1 α
(b)y(b)(x))
.
Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. Proceedings of the 13thInternational Conference on Machine Learning (ICML). Bari, Italy, 1996.
12 / 23 [email protected] Godel Prize
Boosting vs. bagging
Bagging BoostingLearns base models in parallel Learns base models sequentiallyUses bootstrapped datasets Uses reweighted datasetsDoes not overfit as B becomes large Can overfit as B becomes largeReduces variance but not bias(requires deep trees as base models)
Also reduces bias!(works well with shallow trees)
N.B. Boosting does not require each base model to have low bias. Thus, a shallowclassification tree (say, 4-8 terminal nodes) or even a tree with a single split (2 terminalnodes, a “stump”) is often sufficient.
13 / 23 [email protected]
ex) The Viola-Jones face detector
The Viola-Jones face detector:• Revolutionized computer face dectection in 2001.• First real-time system — soon built into standard point-and-shoot cameras.• Based on AdaBoost!
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR). Kauai, Hawaii, USA, 2001.
15 / 23 [email protected]
ex) The Viola-Jones face detector
Uses extremely simple (and fast!) base models:1. Place a “mask” of type A, B, C or D on top of the image2. Value =
∑(pixels in black area)−
∑(pixels in white area)
3. Classify as face or noface.Base models combined using AdaBoost, resulting in a fast and accurate face detector.
16 / 23 [email protected]
Robust loss functions
Why exponential loss?• Main reason – computational simplicity (cf. squared loss in linear regression)
However, models using exponential loss is sensitive to “outliers” , e.g. mislabeled ornoisy data.
17 / 23 [email protected]
Robust loss functions
Instead of exponential loss we can use a robust loss function with a more gentle slopefor negative margins.
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30
1
2
3
Margin, y · C(x)
Loss
Exponential lossBinomial deviance
Huber-like lossMisclassification loss
18 / 23 [email protected]
Gradient boosting
No free lunch! The optimization problem
(α(b), y(b)) = arg min(α,y)
n∑i=1
L(yi, c
(b−1)(xi) + αy(xi))
lacks a closed form solution for a general loss function L(y, c(x)).
Gradient boosting methods address this by using techniques from numerical opti-mization.
19 / 23 [email protected]
Gradient descent
Consider the generic optimization problem minθ J(θ). A numerical method for solvingthis problem is gradient descent.
Initialize θ(0) and iterate:
θ(b) = θ(b−1) − γ(b)∇θJ(θ(b−1)).
where αb is the step size at the bth iteration.
Can be compared with boosting – sequentially construct an ensemble of models as:
c(b)(x) = c(b−1)(x) + α(b)y(b)(x)
for some base model y(b)(x) and “step size” α(b).
20 / 23 [email protected]
Gradient boosting
Gradient boosting mimics the gradient descent algorithm by fitting the bth basemodel to the negative gradient of the loss function,
g(b)i = −
[∂L(yi, c)
∂c
]c=c(b−1)(xi)
, i = 1, . . . , n.
That is, y(b)(x) is learned using {(xi, g(b)i )}ni=1 as training data.
Gradient boosting (with some additional bells and whistles) is used by the very popularlibraries XGBoost and LightGBM,
xgboost.readthedocs.io/lightgbm.readthedocs.io/
21 / 23 [email protected]
Classification methods (covered so far)
So far in the course we have discussed the following methods for classification. . .
Parametric classifiers1. Logistic regression – linear2. Linear discriminant analysis — linear3. Quadratic discriminant analysis — nonlinear
Nonparametric classifiers4. k-nearest neighbors5. Classification trees
Ensemble-based methods6. Bagging (bagged trees / random forests)7. Boosting (AdaBoost)
22 / 23 [email protected]
A few concepts to summarize lecture 7
Boosting: Sequential ensemble method, where each consecutive model tries to correct the mistakes of theprevious one.
AdaBoost: The first successful boosting algorithm. Designed for binary classification.
Exponential loss: The classification loss function used by AdaBoost.
Gradient boosting: A boosting procedure that can be used with any differentiable loss function.
23 / 23 [email protected]