lecture 12: the bootstrap · lecture 12: the bootstrap reading: chapter 5 stats 202: data mining...

35
Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1 / 14

Upload: others

Post on 29-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Lecture 12: The Bootstrap

Reading: Chapter 5

STATS 202: Data mining and analysis

Jonathan Taylor, 10/19Slide credits: Sergio Bacallado

1 / 14

Page 2: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 3: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 4: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 5: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 6: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 7: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Page 8: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Page 9: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Page 10: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Page 11: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Standard errors in linear regression

Standard error: SD of an estimate from a sample of size n.

4 / 14

Page 12: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Page 13: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Page 14: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Page 15: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Page 16: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Page 17: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Page 18: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Page 19: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Page 20: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

Suppose that X and Y are the returns of two assets.

These returns are observed every day: (x1, y1), . . . , (xn, yn).

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y7 / 14

Page 21: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y .

Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Page 22: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Page 23: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α.

One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Page 24: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Page 25: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Page 26: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Example. Investing in two assets

Suppose we compute the estimate α̂ = 0.6 using the samples(x1, y1), . . . , (xn, yn).

I How sure can we be of this value?

I If we resampled the observations, would we get a wildlydifferent α̂?

In this thought experiment, we know the actual joint distributionP (X,Y ), so we can resample the n observations to our hearts’content.

9 / 14

Page 27: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Resampling the data from the true distribution

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

10 / 14

Page 28: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Computing the standard error of α̂

For each resampling of the data,

(x(1)1 , . . . , x(1)n )

(x(2)1 , . . . , x(2)n )

. . .

we can compute a value of the estimate α̂(1), α̂(2), . . . .

The Standard Error of α̂ is approximated by the standard deviationof these values.

11 / 14

Page 29: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Computing the standard error of α̂

For each resampling of the data,

(x(1)1 , . . . , x(1)n )

(x(2)1 , . . . , x(2)n )

. . .

we can compute a value of the estimate α̂(1), α̂(2), . . . .

The Standard Error of α̂ is approximated by the standard deviationof these values.

11 / 14

Page 30: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

Page 31: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

Page 32: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

Page 33: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

Page 34: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

A schematic of the Bootstrap

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

1.1 2.1 2

Y X Obs

Original Data (Z)

1*Z

2*Z

Z*B

1*α̂

2*α̂

α̂*B

!!

!!

!!

!!

!

!!

!!

!!

!!

!!

!!

!!

!!

13 / 14

Page 35: Lecture 12: The Bootstrap · Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis Jonathan Taylor, 10/19 Slide credits: Sergio Bacallado 1/14

Comparing Bootstrap resamplingsto resamplings from the true distribution

0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

0.3 0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

True Bootstrap

0.3

0.4

0.5

0.6

0.7

0.8

0.9

αα

α

14 / 14