winter 2021 lecture 13: ensemble methods · 2021. 2. 22. · roy fox | cs 273a | winter 2021 |...
TRANSCRIPT
![Page 1: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/1.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
CS 273A: Machine Learning Winter 2021
Lecture 13: Ensemble Methods
Roy FoxDepartment of Computer ScienceBren School of Information and Computer SciencesUniversity of California, Irvine
All slides in this course adapted from Alex Ihler & Sameer Singh
![Page 2: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/2.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Today's lecture
Bagging
Gradient boosting
AdaBoost
Kernel Machines
![Page 3: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/3.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Adding features• So far: linear SVMs, not very expressive
‣ add features
• Linearly non-separable:
• Linearly separable in quadratic features:
⟹ x ↦ Φ(x)
x1
x1
x2 = x21
![Page 4: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/4.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Adding features
• Prediction:
• Dual problem:
• Example: quadratic features
‣ features features
‣ Why ? Next slide... But just scale corresponding weights
y(x) = sign(w ⋅ Φ(x) + b)
max0≤λ≤R ∑
j (λj−12 ∑
k
λjλky( j)y(k)Φ(x( j)) ⋅ Φ(x(k))) s.t. ∑j
λjy( j) = 0
Φ(x) = [1 2xi x2i 2xixi′ ]
n ↦ O(n2)
2
![Page 5: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/5.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Implicit features
• For dual problem, we need
• Kernel trick: with :
‣ Each of elements computed in time (instead of )
Kjk = Φ(x( j)) ⋅ Φ(x(k))
Φ(x) = [1 2xi x2i 2xixi′ ]
Kjk = 1 + ∑i
2x( j)i x(k)
i + ∑i
(x( j)i x(k)
i )2 + ∑i<i′
2(x( j)i x(k)
i )(x( j)i′
x(k)i′
)
= (1 + ∑i
x( j)i x(k)
i )2
m2 O(n) O(n2)
![Page 6: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/6.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Mercer's Theorem
• Reminder: positive semidefinite matrix : for all vectors
• Positive semidefinite kernel : matrix for all datasets
• Mercer's Theorem: if for some
• may be hard to calculate
‣ May even be infinite dimensional (Hilbert space)
‣ Not an issue, only the kernel should be easy to compute ( times)
A ⪰ 0 v⊺Av ≥ 0 v
K ⪰ 0 K(x( j), x(k)) ⪰ 0
K ⪰ 0 ⟹ K(x, x′ ) = Φ(x) ⋅ Φ(x′ ) Φ(x)
Φ
K(x, x′ ) O(m2)
![Page 7: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/7.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Common kernel functions• Polynomial:
• Radial Basis Functions (RBF):
• Saturating:
• Domain-specific: textual similarity, genetic code similarity, ...
‣ May not be positive semidefinite, and still work well in practice
K(x, x′ ) = (1 + x ⋅ x′ )d
K(x, x′ ) = exp (− ∥x − x′ ∥2
2σ2 )
K(x, x′ ) = tanh(ax ⋅ x′ + c)
![Page 8: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/8.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Kernel SVMs• Define kernel
• Solve dual QP:
• Learned parameters = ( parameters)
‣ But also need to store all support vectors (having )
• Prediction:
K : (x, x′ ) ↦ ℝ
max0≤λ≤R ∑
j (λj−12 ∑
k
λjλky( j)y(k)K(x( j), x(k))) s.t. ∑j
λjy( j) = 0
λ m
λj > 0
y(x) = sign(w ⋅ Φ(x))
= sign ∑j
λjy( j)Φ(x( j)) ⋅ Φ(x) = sign ∑j
λjy( j)K(x( j), x)
![Page 9: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/9.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Demo
• https://cs.stanford.edu/people/karpathy/svmjs/demo/
![Page 10: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/10.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Linear vs. kernel SVMs
• Linear SVMs
‣ parameters
‣ Alternatively: represent by indexes of SVs; usually, #SVs = #parameters
• Kernel SVMs
‣ may correspond to high- (possibly infinite-) dimensional
‣ Typically more efficient to store the SVs (not )
- And their corresponding
y = sign(w ⋅ x + b) ⟹ n + 1
K(x, x′ ) Φ(x)
x( j) Φ(x( j))
λj
![Page 11: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/11.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Recap• Maximize margin for separable data
‣ Primal QP: minimize subject to linear constraints
‣ Dual QP: variables, dot products
• Soft margin for non-separable data
‣ Primal problem: regularized hinge loss
‣ Dual problem: -dimensional QP
• Kernel Machines
‣ Dual form involves only pairwise similarity
‣ Mercer kernels: equivalent to dot products in implicit high-dimensional space
∥w∥2
m m2
m
![Page 12: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/12.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Today's lecture
Bagging
Gradient boosting
AdaBoost
Kernel Machines
![Page 13: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/13.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bias vs. variance• Imagine 3 universes → 3 datasets
• A simple model:
‣ Poor prediction (on average across universes)
- High bias
‣ Doesn't vary much between universes
- Low variance
• A complex model:
- Low bias
- High variance
all data observed data
![Page 14: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/14.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Averaging across datasets• What if we could reach out across universes
‣ Average models for different datasets
‣ For classification: majority vote of different models
• Same bias
• Lower variance
• But we only have our training set
‣ Idea: resample from
- Average models trained for each
𝒟
𝒟1, …, 𝒟K 𝒟
𝒟k
![Page 15: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/15.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bootstrap
• Resampling = any method that samples a new dataset from the training set
‣ Subsampling = resampling without replacement (choose a subset)
‣ Bootstrap = resampling with replacement (may repeat same datapoint)
- Preferred for theory that is less sensitive to good choice of
- But has higher variance
�� = {(x( j1), y( j1)), …, (x( jb), y( jb))} j1, …, jb ∼ U(1,…, m)
b
![Page 16: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/16.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bagging• Bagging = bootstrap aggregating:
‣ Resample datasets of size
‣ Train models on each dataset
‣ Regression: output
‣ Classification: output
• Similar to cross-validation (for different purpose), but outputs average model
‣ Also, datasets are resampled (with replacement), not a partition
K 𝒟1, …, 𝒟K b
K θ1, …, θK
fθ : x ↦ 1K ∑
k
fθk(x)
fθ : x ↦ majority{fθk(x)}
![Page 17: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/17.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bagging: properties• Each model is trained from less data
‣ More bias
‣ More variance
‣ Replacement also adds variance (repetitions throw off the data distribution)
• Models are averaged
‣ Doesn't affect bias (defined as average over models)
‣ Variance reduced a lot (roughly as , under some conditions)
• More bias, less variance less overfitting = simpler model, in a sense
1K
⟹
![Page 18: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/18.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bagged decision trees• A model badly in need for complexity reduction: decision trees
‣ Very low bias, very high variance
• Randomly resample data
• Train decision tree for each sample; no max depth
‣ Still low bias, high variance
• Average / majority decision over models
![Page 19: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/19.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Bagged decision trees• Average model can't just “memorize” training data
‣ Each data point only seen by few models
‣ Hopefully still predicted well by majority of other models
full training dataset
majority of 5 trees majority of 25 trees majority of 100 trees
![Page 20: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/20.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Ensemble methods
• Ensemble = “committee” of models:
‣ Decisions made by average / majority vote:
‣ May be weighted: better model = higher weight:
• Stacking = use ensemble as inputs (as in MLP):
‣ trained on held out data = validation of which model should be trusted
‣ linear weighted committee, with learned weights
yk(x) = fθk(x)
y(x) = 1K ∑
k
yk(x)
y(x) = ∑k
αk yk(x)
y(x) = fθ( y1(x), …, yK(x))
fθ
fθ ⟹
![Page 21: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/21.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Mixture of Experts (MoE)
• Experts = models can “specialize”, good only for some instances
‣ Let weights depend on :
• Can we predict which model will perform well?
‣ Learn a predictor
- E.g., multilogistic regression (softmax)
• Loss, experts, weights differentiable end-to-end gradient-based learning
x y(x) = ∑k
αk(x) yk(x)
αϕ(k |x)
αϕ(k |x) =exp(ϕk ⋅ x)
∑k′ exp(ϕk′
⋅ x)
⟹
0 0.5 1 1.5 2 2.5 3-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
mixture of 3 linear predictors
![Page 22: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/22.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Random Forests• Bagging over decision trees: which feature at root?
‣ Much data max info gain stable across data samples
‣ Little diversity among models little gained from ensemble
• Random Forests = subsample features
‣ Each tree only allowed to use a subset of features
‣ Still low, but higher bias
‣ Average over trees for lower variance
• Works very well in practice go-to algorithm for small ML tasks
⟹
⟹
⟹
![Page 23: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/23.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Recap• Ensembles = collections of predictors
‣ Combine predictions to improve performance
• Bagging = bootstrap aggregation
‣ Reduces model class complexity to mitigate overfitting
‣ Resample the data many times (with replacement)
- For each, train model
‣ More bias but less variance
‣ Also more compute — both at training time and at test time
![Page 24: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/24.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Today's lecture
Bagging
Gradient boosting
AdaBoost
Kernel Machines
![Page 25: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/25.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Growing ensembles
• Ensemble = collection of models:
‣ Models should “cover” for each other
• If we could add a model to a given ensemble, what would we add?
• Let's find that minimizes this loss
‣ If we could do this well — done in one step
‣ Instead, let's do it badly but many times gradually improve
y(x) = ∑k
fk(x)
ℒ(y, y′ ) = ℒ(y, y + fK+1(x))
fK+1(x)
→
![Page 26: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/26.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Boosting• Question: can we create a strong learner from many weak learners?
‣ Weak learner = underfits, but fast and simple (e.g., decision stump, perceptron)
‣ Strong learner = performs well but increasingly complex
• Boosting: focus new learners on instances that current ensemble gets wrong
‣ Train new learner
‣ Measure errors
‣ Re-weight data points to emphasize large residuals
‣ Repeat
![Page 27: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/27.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
• Ensemble: ; MSE loss:
‣ To minimize: have try to predict
‣ Then update
yK = ∑k
fk(x) ℒ(y, yk) = 12 (y − yk−1 − fk(x))2
fk(x) y − yk−1
yk = yk−1 + fk(x)
Example: MSE loss
data prediction
residual weak model
increasingly accurate increasingly complex
![Page 28: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/28.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Gradient Boosting
• More generally: pseudo-residuals
‣ = steepest descent of loss in “prediction space” (how should change)
‣ For MSE loss: as before
• Gradient Boosting:
‣ Learn weak model to predict
‣ Find best step size (line search)
r( j)k = − ∂ yℒ(y( j), y)
y= y( j)k−1
r( j)k y( j)
k−1
r( j)k = y( j) − y( j)
k−1
fk : x( j) ↦ r( j)k
αk = arg minα
1m ∑
j
ℒ (y( j), y( j)k−1 + αfk(x( j)))
![Page 29: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/29.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Demo
• http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
![Page 30: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/30.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Today's lecture
Bagging
Gradient boosting
AdaBoost
Kernel Machines
![Page 31: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/31.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Growing ensembles
• Ensemble = collection of models:
‣ Models should “cover” for each other
• If we could add a model to a given ensemble, what would we add?
• Let's find , that minimize this loss
‣ If we could do this well — done in one step
‣ Instead, let's do it badly but many times gradually improve
y(x) = ∑k
αk fk(x)
ℒ(y, yk) = ℒ(y, yk−1 + αk fk(x))
αk fk(x)
→
![Page 32: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/32.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Example: exponential loss• Exponential loss:
‣ Optimal : (proof by derivative)
‣ If we can minimize the loss is the more likely label
• Let's find model that minimizes
ℒ(y, y) = e−y y
y(x) arg miny
𝔼y|x[ℒ(y, y)] = 12 ln p(y = + 1 |x)
p(y = − 1 |x)
⟹ sign( y)
fk : x ↦ {+1, − 1}
∑j
ℒ(y( j), y( j)k ) = ∑
j
ℒ(y( j), y( j)k−1 + αk fk(x( j))) = ∑
j
e−y( j) y( j)k−1 e−y( j)αk fk(x( j))
= (eαk − e−αk)∑j
w( j)k−1δ[y( j) ≠ fk(x( j))] + e−αk ∑
j
w( j)k−1
independent of fk
w( j)k−1
independent of fk
![Page 33: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/33.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Minimizing weighted loss
• So far, we minimized average loss:
• We can also minimize weighted loss:
‣ Every data point “counts” as
‣ E.g., in decision trees, weighted info gain obtained by
• In our current case, weighted 0–1 loss:
1m ∑
j
ℒ(y( j), y( j))
∑j
w( j)ℒ(y( j), y( j))
w( j)
p(y = c) ∝ ∑j:y( j)=c
w( j)
∑j
w( j)k−1δ[y( j) ≠ fk(x( j))]
![Page 34: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/34.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Boosting with exponential loss (cont.)
• The best classifier to add to the ensemble minimizes weighted 0–1 loss:
with
• It gives weighted error rate
• Plugging into the loss and solving:
• Now add the model and update the ensemble
∑j
w( j)k−1δ[y( j) ≠ fk(x( j))] w( j)
k−1 = e−y( j) y( j)k−1
ϵk =∑j w( j)
k−1δ[y( j) ≠ fk(x( j))]
∑j w( j)k−1
αk = 12 ln
1 − ϵk
ϵk
yk(x) = yk−1(x) + αk fk(x)
![Page 35: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/35.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
AdaBoost• AdaBoost = adaptive boosting:
‣ Initialize
‣ Train classifier on training data with weights
‣ Compute weighted error rate
‣ Compute
‣ Update weights (increase weight for misclassified points)
• Predict
w( j)0 = 1
m
fk wk−1
ϵk =∑j w( j)
k−1δ[y( j) ≠ fk(x( j))]
∑j w( j)k−1
αk = 12 ln
1 − ϵk
ϵk
w( j)k = w( j)
k−1e−y( j)αk fk(x( j))
y(x) = sign∑k
αk fk(x)
![Page 36: Winter 2021 Lecture 13: Ensemble Methods · 2021. 2. 22. · Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods Mercer's Theorem • Reminder: positive semidefinite matrix](https://reader035.vdocuments.net/reader035/viewer/2022071408/610110d0a94925402633e3a6/html5/thumbnails/36.jpg)
Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods
Recap
• Ensembles = collections of predictors
‣ Combine predictions to improve performance
• Boosting: Gradient Boost, AdaBoost, ...
‣ Build strong predictor from many weak ones
‣ Train sequentially; later predictors focus on mistakes by earlier
- Weight “hard” examples more