winter 2021 lecture 13: ensemble methods · 2021. 2. 22. · roy fox | cs 273a | winter 2021 |...

Roy Fox | CS 273A | Winter 2021 | Lecture 13: Ensemble Methods

CS 273A: Machine Learning Winter 2021

Lecture 13: Ensemble Methods

Roy FoxDepartment of Computer ScienceBren School of Information and Computer SciencesUniversity of California, Irvine

All slides in this course adapted from Alex Ihler & Sameer Singh


Today's lecture

Bagging

Gradient boosting

AdaBoost

Kernel Machines


Adding features• So far: linear SVMs, not very expressive

‣ add features

• Linearly non-separable:

• Linearly separable in quadratic features:

⟹ x ↦ Φ(x)

x1

x1

x2 = x21


Adding features

• Prediction:

• Dual problem:

• Example: quadratic features

‣ features features

‣ Why ? Next slide... But just scale corresponding weights

y(x) = sign(w ⋅ Φ(x) + b)

max0≤λ≤R ∑

j (λj−12 ∑

k

λjλky( j)y(k)Φ(x( j)) ⋅ Φ(x(k))) s.t. ∑j

λjy( j) = 0

Φ(x) = [1 2xi x2i 2xixi′ ]

n ↦ O(n2)

2


Implicit features

• For dual problem, we need

• Kernel trick: with :

‣ Each of elements computed in time (instead of )

Kjk = Φ(x( j)) ⋅ Φ(x(k))

Φ(x) = [1 2xi x2i 2xixi′ ]

Kjk = 1 + ∑i

2x( j)i x(k)

i + ∑i

(x( j)i x(k)

i )2 + ∑i<i′

2(x( j)i x(k)

i )(x( j)i′

x(k)i′

)

= (1 + ∑i

x( j)i x(k)

i )2

m2 O(n) O(n2)


Mercer's Theorem

• Reminder: positive semidefinite matrix : for all vectors

• Positive semidefinite kernel : matrix for all datasets

• Mercer's Theorem: if for some

• may be hard to calculate

‣ May even be infinite dimensional (Hilbert space)

‣ Not an issue, only the kernel should be easy to compute ( times)

A ⪰ 0 v⊺Av ≥ 0 v

K ⪰ 0 K(x( j), x(k)) ⪰ 0

K ⪰ 0 ⟹ K(x, x′ ) = Φ(x) ⋅ Φ(x′ ) Φ(x)

Φ

K(x, x′ ) O(m2)


Common kernel functions• Polynomial:

• Radial Basis Functions (RBF):

• Saturating:

• Domain-specific: textual similarity, genetic code similarity, ...

‣ May not be positive semidefinite, and still work well in practice

K(x, x′ ) = (1 + x ⋅ x′ )d

K(x, x′ ) = exp (− ∥x − x′ ∥2

2σ2 )

K(x, x′ ) = tanh(ax ⋅ x′ + c)


Kernel SVMs• Define kernel

• Solve dual QP:

• Learned parameters = ( parameters)

‣ But also need to store all support vectors (having )

• Prediction:

K : (x, x′ ) ↦ ℝ

max0≤λ≤R ∑

j (λj−12 ∑

k

λjλky( j)y(k)K(x( j), x(k))) s.t. ∑j

λjy( j) = 0

λ m

λj > 0

y(x) = sign(w ⋅ Φ(x))

= sign ∑j

λjy( j)Φ(x( j)) ⋅ Φ(x) = sign ∑j

λjy( j)K(x( j), x)


Demo

• https://cs.stanford.edu/people/karpathy/svmjs/demo/

https://cs.stanford.edu/people/karpathy/svmjs/demo/


Linear vs. kernel SVMs

• Linear SVMs

‣ parameters

‣ Alternatively: represent by indexes of SVs; usually, #SVs = #parameters

• Kernel SVMs

‣ may correspond to high- (possibly infinite-) dimensional

‣ Typically more efficient to store the SVs (not )

- And their corresponding

y = sign(w ⋅ x + b) ⟹ n + 1

K(x, x′ ) Φ(x)

x( j) Φ(x( j))

λj


Recap• Maximize margin for separable data

‣ Primal QP: minimize subject to linear constraints

‣ Dual QP: variables, dot products

• Soft margin for non-separable data

‣ Primal problem: regularized hinge loss

‣ Dual problem: -dimensional QP

• Kernel Machines

‣ Dual form involves only pairwise similarity

‣ Mercer kernels: equivalent to dot products in implicit high-dimensional space

∥w∥2

m m2

m


Today's lecture

Bagging

Gradient boosting

AdaBoost

Kernel Machines


Bias vs. variance• Imagine 3 universes → 3 datasets

• A simple model:

‣ Poor prediction (on average across universes)

- High bias

‣ Doesn't vary much between universes

- Low variance

• A complex model:

- Low bias

- High variance

all data observed data


Averaging across datasets• What if we could reach out across universes

‣ Average models for different datasets

‣ For classification: majority vote of different models

• Same bias

• Lower variance

• But we only have our training set

‣ Idea: resample from

- Average models trained for each

𝒟

𝒟1, …, 𝒟K 𝒟

𝒟k


Bootstrap

• Resampling = any method that samples a new dataset from the training set

‣ Subsampling = resampling without replacement (choose a subset)

‣ Bootstrap = resampling with replacement (may repeat same datapoint)

- Preferred for theory that is less sensitive to good choice of

- But has higher variance

�� = {(x( j1), y( j1)), …, (x( jb), y( jb))} j1, …, jb ∼ U(1,…, m)

b


Bagging• Bagging = bootstrap aggregating:

‣ Resample datasets of size

‣ Train models on each dataset

‣ Regression: output

‣ Classification: output

• Similar to cross-validation (for different purpose), but outputs average model

‣ Also, datasets are resampled (with replacement), not a partition

K 𝒟1, …, 𝒟K b

K θ1, …, θK

fθ : x ↦ 1K ∑

k

fθk(x)

fθ : x ↦ majority{fθk(x)}


Bagging: properties• Each model is trained from less data

‣ More bias

‣ More variance

‣ Replacement also adds variance (repetitions throw off the data distribution)

• Models are averaged

‣ Doesn't affect bias (defined as average over models)

‣ Variance reduced a lot (roughly as , under some conditions)

• More bias, less variance less overfitting = simpler model, in a sense

1K

⟹


Bagged decision trees• A model badly in need for complexity reduction: decision trees

‣ Very low bias, very high variance

• Randomly resample data

• Train decision tree for each sample; no max depth

‣ Still low bias, high variance

• Average / majority decision over models


Bagged decision trees• Average model can't just “memorize” training data

‣ Each data point only seen by few models

‣ Hopefully still predicted well by majority of other models

full training dataset

majority of 5 trees majority of 25 trees majority of 100 trees


Ensemble methods

• Ensemble = “committee” of models:

‣ Decisions made by average / majority vote:

‣ May be weighted: better model = higher weight:

• Stacking = use ensemble as inputs (as in MLP):

‣ trained on held out data = validation of which model should be trusted

‣ linear weighted committee, with learned weights

yk(x) = fθk(x)

y(x) = 1K ∑

k

yk(x)

y(x) = ∑k

αk yk(x)

y(x) = fθ( y1(x), …, yK(x))

fθ

fθ ⟹


Mixture of Experts (MoE)

• Experts = models can “specialize”, good only for some instances

‣ Let weights depend on :

• Can we predict which model will perform well?

‣ Learn a predictor

- E.g., multilogistic regression (softmax)

• Loss, experts, weights differentiable end-to-end gradient-based learning

x y(x) = ∑k

αk(x) yk(x)

αϕ(k |x)

αϕ(k |x) =exp(ϕk ⋅ x)

∑k′ exp(ϕk′

⋅ x)

⟹

0 0.5 1 1.5 2 2.5 3-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

mixture of 3 linear predictors


Random Forests• Bagging over decision trees: which feature at root?

‣ Much data max info gain stable across data samples

‣ Little diversity among models little gained from ensemble

• Random Forests = subsample features

‣ Each tree only allowed to use a subset of features

‣ Still low, but higher bias

‣ Average over trees for lower variance

• Works very well in practice go-to algorithm for small ML tasks

⟹

⟹

⟹


Recap• Ensembles = collections of predictors

‣ Combine predictions to improve performance

• Bagging = bootstrap aggregation

‣ Reduces model class complexity to mitigate overfitting

‣ Resample the data many times (with replacement)

- For each, train model

‣ More bias but less variance

‣ Also more compute — both at training time and at test time


Today's lecture

Bagging

Gradient boosting

AdaBoost

Kernel Machines


Growing ensembles

• Ensemble = collection of models:

‣ Models should “cover” for each other

• If we could add a model to a given ensemble, what would we add?

• Let's find that minimizes this loss

‣ If we could do this well — done in one step

‣ Instead, let's do it badly but many times gradually improve

y(x) = ∑k

fk(x)

ℒ(y, y′ ) = ℒ(y, y + fK+1(x))

fK+1(x)

→


Boosting• Question: can we create a strong learner from many weak learners?

‣ Weak learner = underfits, but fast and simple (e.g., decision stump, perceptron)

‣ Strong learner = performs well but increasingly complex

• Boosting: focus new learners on instances that current ensemble gets wrong

‣ Train new learner

‣ Measure errors

‣ Re-weight data points to emphasize large residuals

‣ Repeat


• Ensemble: ; MSE loss:

‣ To minimize: have try to predict

‣ Then update

yK = ∑k

fk(x) ℒ(y, yk) = 12 (y − yk−1 − fk(x))2

fk(x) y − yk−1

yk = yk−1 + fk(x)

Example: MSE loss

data prediction

residual weak model

increasingly accurate increasingly complex


Gradient Boosting

• More generally: pseudo-residuals

‣ = steepest descent of loss in “prediction space” (how should change)

‣ For MSE loss: as before

• Gradient Boosting:

‣ Learn weak model to predict

‣ Find best step size (line search)

r( j)k = − ∂ yℒ(y( j), y)

y= y( j)k−1

r( j)k y( j)

k−1

r( j)k = y( j) − y( j)

k−1

fk : x( j) ↦ r( j)k

αk = arg minα

1m ∑

j

ℒ (y( j), y( j)k−1 + αfk(x( j)))


Demo

• http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html


Today's lecture

Bagging

Gradient boosting

AdaBoost

Kernel Machines


Growing ensembles

• Ensemble = collection of models:

‣ Models should “cover” for each other

• If we could add a model to a given ensemble, what would we add?

• Let's find , that minimize this loss

‣ If we could do this well — done in one step

‣ Instead, let's do it badly but many times gradually improve

y(x) = ∑k

αk fk(x)

ℒ(y, yk) = ℒ(y, yk−1 + αk fk(x))

αk fk(x)

→


Example: exponential loss• Exponential loss:

‣ Optimal : (proof by derivative)

‣ If we can minimize the loss is the more likely label

• Let's find model that minimizes

ℒ(y, y) = e−y y

y(x) arg miny

𝔼y|x[ℒ(y, y)] = 12 ln p(y = + 1 |x)

p(y = − 1 |x)

⟹ sign( y)

fk : x ↦ {+1, − 1}

∑j

ℒ(y( j), y( j)k ) = ∑

j

ℒ(y( j), y( j)k−1 + αk fk(x( j))) = ∑

j

e−y( j) y( j)k−1 e−y( j)αk fk(x( j))

= (eαk − e−αk)∑j

w( j)k−1δ[y( j) ≠ fk(x( j))] + e−αk ∑

j

w( j)k−1

independent of fk

w( j)k−1

independent of fk


Minimizing weighted loss

• So far, we minimized average loss:

• We can also minimize weighted loss:

‣ Every data point “counts” as

‣ E.g., in decision trees, weighted info gain obtained by

• In our current case, weighted 0–1 loss:

1m ∑

j

ℒ(y( j), y( j))

∑j

w( j)ℒ(y( j), y( j))

w( j)

p(y = c) ∝ ∑j:y( j)=c

w( j)

∑j

w( j)k−1δ[y( j) ≠ fk(x( j))]


Boosting with exponential loss (cont.)

• The best classifier to add to the ensemble minimizes weighted 0–1 loss:

with

• It gives weighted error rate

• Plugging into the loss and solving:

• Now add the model and update the ensemble

∑j

w( j)k−1δ[y( j) ≠ fk(x( j))] w( j)

k−1 = e−y( j) y( j)k−1

ϵk =∑j w( j)

k−1δ[y( j) ≠ fk(x( j))]

∑j w( j)k−1

αk = 12 ln

1 − ϵk

ϵk

yk(x) = yk−1(x) + αk fk(x)


AdaBoost• AdaBoost = adaptive boosting:

‣ Initialize

‣ Train classifier on training data with weights

‣ Compute weighted error rate

‣ Compute

‣ Update weights (increase weight for misclassified points)

• Predict

w( j)0 = 1

m

fk wk−1

ϵk =∑j w( j)

k−1δ[y( j) ≠ fk(x( j))]

∑j w( j)k−1

αk = 12 ln

1 − ϵk

ϵk

w( j)k = w( j)

k−1e−y( j)αk fk(x( j))

y(x) = sign∑k

αk fk(x)


Recap

• Ensembles = collections of predictors

‣ Combine predictions to improve performance

• Boosting: Gradient Boost, AdaBoost, ...

‣ Build strong predictor from many weak ones

‣ Train sequentially; later predictors focus on mistakes by earlier

- Weight “hard” examples more

winter 2021 lecture 13: ensemble methods · 2021. 2. 22. · roy fox | cs 273a | winter 2021 |...

Documents