lecture 12: the bootstrap · lecture 12: the bootstrap reading: chapter 5 stats 202: data mining...

Post on 29-May-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 12: The Bootstrap

Reading: Chapter 5

STATS 202: Data mining and analysis

Jonathan Taylor, 10/19Slide credits: Sergio Bacallado

1 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Announcements

I Midterm is a week from today

I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.

I We will post a practice exam.

I Notes: 1 page double sided or 2 pages single sided. Closedbook.

I No calculators necessary.

I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:

http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information

2 / 14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Cross-validation vs. the Bootstrap

Cross-validation: provides estimates of the (test) error.

The Bootstrap: provides the (standard) error of estimates.

I One of the most important techniques in allof Statistics.

I Computer intensive method.

I Popularized by Brad Efron, from Stanford.

3 / 14

Standard errors in linear regression

Standard error: SD of an estimate from a sample of size n.

4 / 14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Classical way to compute Standard Errors

Example: Estimate the variance of a sample x1, x2, . . . , xn:

σ̂2 =1

n− 1

n∑i=1

(xi − x)2.

What is the Standard Error of σ̂2?

1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.

2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.

3. For large n, σ̂2 is normally distributed around σ2.

4. The SD of this sampling distribution is the Standard Error.

5 / 14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Limitations of the classical approach

This approach has served statisticians well for many years; however,what happens if:

I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?

I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?

6 / 14

Example. Investing in two assets

Suppose that X and Y are the returns of two assets.

These returns are observed every day: (x1, y1), . . . , (xn, yn).

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y7 / 14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y .

Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α.

One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Example. Investing in two assets

We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be

αX + (1− α)Y.

Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:

α =σ2Y − Cov(X,Y )

σ2X + σ2Y − 2Cov(X,Y ).

Proposal: Use an estimate:

α̂ =σ̂2Y − ˆCov(X,Y )

σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).

8 / 14

Example. Investing in two assets

Suppose we compute the estimate α̂ = 0.6 using the samples(x1, y1), . . . , (xn, yn).

I How sure can we be of this value?

I If we resampled the observations, would we get a wildlydifferent α̂?

In this thought experiment, we know the actual joint distributionP (X,Y ), so we can resample the n observations to our hearts’content.

9 / 14

Resampling the data from the true distribution

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

10 / 14

Computing the standard error of α̂

For each resampling of the data,

(x(1)1 , . . . , x(1)n )

(x(2)1 , . . . , x(2)n )

. . .

we can compute a value of the estimate α̂(1), α̂(2), . . . .

The Standard Error of α̂ is approximated by the standard deviationof these values.

11 / 14

Computing the standard error of α̂

For each resampling of the data,

(x(1)1 , . . . , x(1)n )

(x(2)1 , . . . , x(2)n )

. . .

we can compute a value of the estimate α̂(1), α̂(2), . . . .

The Standard Error of α̂ is approximated by the standard deviationof these values.

11 / 14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

In reality, we only have n samples

−2 −1 0 1 2

−2

−1

01

2

X

Y

−2 −1 0 1 2

−2

−1

01

2

X

Y

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X

Y

−2 −1 0 1 2 3

−3

−2

−1

01

2

X

Y

I However, these samples can beused to approximate the jointdistribution of X and Y .

I The Bootstrap: Resample fromthe empirical distribution:

P̂ (X,Y ) =1

n

n∑i=1

δ(xi, yi).

I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.

12 / 14

A schematic of the Bootstrap

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

1.1 2.1 2

Y X Obs

Original Data (Z)

1*Z

2*Z

Z*B

1*α̂

2*α̂

α̂*B

!!

!!

!!

!!

!

!!

!!

!!

!!

!!

!!

!!

!!

13 / 14

Comparing Bootstrap resamplingsto resamplings from the true distribution

0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

0.3 0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

True Bootstrap

0.3

0.4

0.5

0.6

0.7

0.8

0.9

αα

α

14 / 14

top related