machine learning 1 · machine learning 1 ‣ combining models: (bishop 4.1-4.4) ‣ bayesian model...

10
Machine Learning 1 Lecture 13.2 - Combining Models Bootstrapping and Feature Bagging - Recap Bias-Variance Decomposition Erik Bekkers (Bishop 14.2) Image credit: Kirillm | Getty Images Slide credits: Patrick Forré and Rianne van den Berg

Upload: others

Post on 31-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine Learning 1 Lecture 13.2 - Combining Models

    Bootstrapping and Feature Bagging - Recap Bias-Variance Decomposition

    Erik Bekkers

    (Bishop 14.2)

    Image credit: Kirillm | Getty Images

    Slide credits: Patrick Forré and Rianne van den Berg

  • Machine Learning 1

    ‣ Combining models: (Bishop 4.1-4.4)

    ‣ Bayesian model averaging vs. model combination methods

    ‣ Committees:

    ‣ Bootstrap aggregation

    ‣ Random subspace methods

    ‣ Boosting

    ‣ Decision trees

    ‣ Random forests

    2

    Regression with GP’s

  • Machine Learning 1

    ‣ Simplest way to construct a committee is by averaging predictions of a set of individual models

    ‣ Remember the bias variance trade-off: model error decomposes into two components

    ‣ Bias: arises from the difference between model and the ground truth function that needs to predicted

    ‣ Variance: represents the sensitivity of a model to the individual datapoints that it was trained on

    3

    Constructing committees

  • Machine Learning 1

    ‣ Generate datasets of points:

    ‣ predictions with 24 Gaussian basis functions:

    L Nx ∼ U(0,1)

    t = sin(2πx) + ϵ ϵ ∈ N(0, α−1)

    $[t |x] = sin(2πx)

    L

    ED =12

    N

    ∑i=1

    {tn − wTϕ(x)}2 +λ2 w

    Tw

    y(l)(x) = (w(l))Tϕ(x)

    $D[yD(x)] =1L

    L

    ∑l=1

    y(l)(x)

    4

    Bias-Variance Decomposition: Example150 3. LINEAR MODELS FOR REGRESSION

    x

    tln λ = 2.6

    0 1

    −1

    0

    1

    x

    t

    0 1

    −1

    0

    1

    x

    tln λ = −0.31

    0 1

    −1

    0

    1

    x

    t

    0 1

    −1

    0

    1

    x

    tln λ = −2.4

    0 1

    −1

    0

    1

    x

    t

    0 1

    −1

    0

    1

    Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters isM = 25 including the bias parameter. The left column shows the result of fitting the model to the data sets forvarious values of ln λ (for clarity, only 20 of the 100 fits are shown). The right column shows the correspondingaverage of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green).

    Figure: bias-variance decomposition (Bishop 3.5)

    q160 225

    pgWgnbits

    low variance

    *right"

    t

    tf Nishariana low bias

  • Machine Learning 1

    ‣ When we average models trained on different datasets, the contribution of the variance reduces

    ‣ When we average a set of low-bias models (corresponding to complex models such as high-order polynomials), we obtained accurate predictions!

    ‣ However, in practice we only have one single dataset!

    ‣ One way to introduce variability between different models within the committee: bootstrap datasets.

    5

    Averaging predictions from different models

  • Machine Learning 1

    ‣ Suppose your original dataset consists of data points:

    ‣ Create new datasets by drawing points at random from , with replacement.

    ‣ Some data points will occur multiple times in

    ‣ Some data points will be absent from

    N

    X = [x1, . . . , xN]T

    B {X1, . . . , XB} NX

    Xb

    Xb

    6

    Committees: bootstrapping datasets

    ✓each

    Xb is anew

    dataset

    of sineN

  • Machine Learning 1

    ‣ We have generated bootstrap datasets

    ‣ Use each to train a separate model

    ‣ The committees prediction

    ‣ This is called bootstrap aggregation/bagging!

    ‣ Suppose the ground truth function that we need to predict is

    ‣ The prediction of each individual model:

    ‣ Error of model :

    B {X1, . . . , XB}

    Xb yb(x)

    yCOM =1B

    B

    ∑b=1

    yb(x)

    h(x)

    yb(x) = h(x) + ϵb(x)

    b ϵb(x)

    7

    Regression with bootstrap datasetsB

  • Machine Learning 1

    ‣ The average sum-of-squares error for model b:

    ‣ The average error made by the models individually:

    ‣ The expected error of the committee

    ‣ If we assume and for then

    $x[{yb(x) − h(x)}2] = $x[ϵb(x)2]

    b

    EAV =1B

    B

    ∑b=1

    $x[ϵb(x)2]

    yCOM =1B

    B

    ∑b=1

    yb(x)

    ECOM = $x [{ 1BB

    ∑b=1

    yb(x) − h(x)}2

    ] = $x [{ 1BB

    ∑b=1

    ϵb(x)}2

    ]$x[ϵb(x)] = 0 cov[ϵb(x), ϵb′ (x)] = 0 b′ ≠ b

    $x[ϵb(x)ϵb′ (x)] = 0 ECOM =1B

    EAV

    8

    Bootstrap aggregation Ects'CE. .Ed HIKE. .""D

    = Tsa &.EE/..lECSz..efHGiH-D--tpjrBEb...tECEbC15]=tbE€(thx)= I § hexD

    ÷

    2a

  • Machine Learning 1

    ‣ If we assume and for

    ‣ Seems like the average error of a model due to the variance can be reduced by a factor if we average versions of the model…

    ‣ However, we assumed that error due to individual models are uncorrelated!

    ‣ In practice, errors are highly correlated (the bootstrap datasets are not independent)

    ‣ But even for correlated errors!

    ‣ Strategy: choose models with low-bias (complex models that can overfit), bootstrap aggregated model will have lower error, than the average error of the individual models.

    $x[ϵb(x)] = 0 $x[ϵb(x)ϵb′ (x)] = 0 b′ ≠ b

    ECOM =1B

    EAV

    B B

    ECOM ≤ EAVB

    9

    Bootstrap aggregation

  • Machine Learning 1

    ‣ Feature bagging: sample a subset of features of length for each learner

    ‣ Also called ‘random subspace method’

    ‣ Works especially well if features are uncorrelated

    ‣ Causes learners to not over-focus on features that are overly predictive for training set but do not generalize to new data

    ‣ So feature bagging works well if the number of features is much larger than the number of training points

    ‣ Decisions trees with bootstrapping and random subspaces -> random forests

    r < D

    10

    Committees: feature bagging

    I = ( K , , 29 , 43 , Xu , Xs , Xs DT C- IRD

    Yi

    T243