lecture 12: the bootstrap · lecture 12: the bootstrap reading: chapter 5 stats 202: data mining...
TRANSCRIPT
Lecture 12: The Bootstrap
Reading: Chapter 5
STATS 202: Data mining and analysis
Jonathan Taylor, 10/19Slide credits: Sergio Bacallado
1 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Announcements
I Midterm is a week from today
I Topics: chapters 1-5 and 10 of the book — everything untiland including today’s lecture.
I We will post a practice exam.
I Notes: 1 page double sided or 2 pages single sided. Closedbook.
I No calculators necessary.
I SCPD students: if you haven’t chosen your proctor already,you must do it ASAP. For guidelines see:
http://scpd.stanford.edu/programs/courses/graduate-courses/exam-monitor-information
2 / 14
Cross-validation vs. the Bootstrap
Cross-validation: provides estimates of the (test) error.
The Bootstrap: provides the (standard) error of estimates.
I One of the most important techniques in allof Statistics.
I Computer intensive method.
I Popularized by Brad Efron, from Stanford.
3 / 14
Cross-validation vs. the Bootstrap
Cross-validation: provides estimates of the (test) error.
The Bootstrap: provides the (standard) error of estimates.
I One of the most important techniques in allof Statistics.
I Computer intensive method.
I Popularized by Brad Efron, from Stanford.
3 / 14
Cross-validation vs. the Bootstrap
Cross-validation: provides estimates of the (test) error.
The Bootstrap: provides the (standard) error of estimates.
I One of the most important techniques in allof Statistics.
I Computer intensive method.
I Popularized by Brad Efron, from Stanford.
3 / 14
Standard errors in linear regression
Standard error: SD of an estimate from a sample of size n.
4 / 14
Classical way to compute Standard Errors
Example: Estimate the variance of a sample x1, x2, . . . , xn:
σ̂2 =1
n− 1
n∑i=1
(xi − x)2.
What is the Standard Error of σ̂2?
1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.
2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.
3. For large n, σ̂2 is normally distributed around σ2.
4. The SD of this sampling distribution is the Standard Error.
5 / 14
Classical way to compute Standard Errors
Example: Estimate the variance of a sample x1, x2, . . . , xn:
σ̂2 =1
n− 1
n∑i=1
(xi − x)2.
What is the Standard Error of σ̂2?
1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.
2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.
3. For large n, σ̂2 is normally distributed around σ2.
4. The SD of this sampling distribution is the Standard Error.
5 / 14
Classical way to compute Standard Errors
Example: Estimate the variance of a sample x1, x2, . . . , xn:
σ̂2 =1
n− 1
n∑i=1
(xi − x)2.
What is the Standard Error of σ̂2?
1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.
2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.
3. For large n, σ̂2 is normally distributed around σ2.
4. The SD of this sampling distribution is the Standard Error.
5 / 14
Classical way to compute Standard Errors
Example: Estimate the variance of a sample x1, x2, . . . , xn:
σ̂2 =1
n− 1
n∑i=1
(xi − x)2.
What is the Standard Error of σ̂2?
1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.
2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.
3. For large n, σ̂2 is normally distributed around σ2.
4. The SD of this sampling distribution is the Standard Error.
5 / 14
Classical way to compute Standard Errors
Example: Estimate the variance of a sample x1, x2, . . . , xn:
σ̂2 =1
n− 1
n∑i=1
(xi − x)2.
What is the Standard Error of σ̂2?
1. Assume that x1, . . . , xn are normally distributed with commonmean µ and variance σ2.
2. Then σ̂2(n− 1) has a χ-squared distribution with n− 1degrees of freedom.
3. For large n, σ̂2 is normally distributed around σ2.
4. The SD of this sampling distribution is the Standard Error.
5 / 14
Limitations of the classical approach
This approach has served statisticians well for many years; however,what happens if:
I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?
I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?
6 / 14
Limitations of the classical approach
This approach has served statisticians well for many years; however,what happens if:
I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?
I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?
6 / 14
Limitations of the classical approach
This approach has served statisticians well for many years; however,what happens if:
I The distributional assumption — for example, x1, . . . , xnbeing normal — breaks down?
I The estimator does not have a simple form and its samplingdistribution cannot be derived analytically?
6 / 14
Example. Investing in two assets
Suppose that X and Y are the returns of two assets.
These returns are observed every day: (x1, y1), . . . , (xn, yn).
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y7 / 14
Example. Investing in two assets
We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y .
Therefore, our returnwill be
αX + (1− α)Y.
Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:
α =σ2Y − Cov(X,Y )
σ2X + σ2Y − 2Cov(X,Y ).
Proposal: Use an estimate:
α̂ =σ̂2Y − ˆCov(X,Y )
σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).
8 / 14
Example. Investing in two assets
We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be
αX + (1− α)Y.
Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:
α =σ2Y − Cov(X,Y )
σ2X + σ2Y − 2Cov(X,Y ).
Proposal: Use an estimate:
α̂ =σ̂2Y − ˆCov(X,Y )
σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).
8 / 14
Example. Investing in two assets
We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be
αX + (1− α)Y.
Our goal will be to minimize the variance of our return as afunction of α.
One can show that the optimal α is:
α =σ2Y − Cov(X,Y )
σ2X + σ2Y − 2Cov(X,Y ).
Proposal: Use an estimate:
α̂ =σ̂2Y − ˆCov(X,Y )
σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).
8 / 14
Example. Investing in two assets
We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be
αX + (1− α)Y.
Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:
α =σ2Y − Cov(X,Y )
σ2X + σ2Y − 2Cov(X,Y ).
Proposal: Use an estimate:
α̂ =σ̂2Y − ˆCov(X,Y )
σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).
8 / 14
Example. Investing in two assets
We have a fixed amount of money to invest and we will invest afraction α on X and a fraction (1− α) on Y . Therefore, our returnwill be
αX + (1− α)Y.
Our goal will be to minimize the variance of our return as afunction of α. One can show that the optimal α is:
α =σ2Y − Cov(X,Y )
σ2X + σ2Y − 2Cov(X,Y ).
Proposal: Use an estimate:
α̂ =σ̂2Y − ˆCov(X,Y )
σ̂2X + σ̂2Y − 2 ˆCov(X,Y ).
8 / 14
Example. Investing in two assets
Suppose we compute the estimate α̂ = 0.6 using the samples(x1, y1), . . . , (xn, yn).
I How sure can we be of this value?
I If we resampled the observations, would we get a wildlydifferent α̂?
In this thought experiment, we know the actual joint distributionP (X,Y ), so we can resample the n observations to our hearts’content.
9 / 14
Resampling the data from the true distribution
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y
10 / 14
Computing the standard error of α̂
For each resampling of the data,
(x(1)1 , . . . , x(1)n )
(x(2)1 , . . . , x(2)n )
. . .
we can compute a value of the estimate α̂(1), α̂(2), . . . .
The Standard Error of α̂ is approximated by the standard deviationof these values.
11 / 14
Computing the standard error of α̂
For each resampling of the data,
(x(1)1 , . . . , x(1)n )
(x(2)1 , . . . , x(2)n )
. . .
we can compute a value of the estimate α̂(1), α̂(2), . . . .
The Standard Error of α̂ is approximated by the standard deviationof these values.
11 / 14
In reality, we only have n samples
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y
I However, these samples can beused to approximate the jointdistribution of X and Y .
I The Bootstrap: Resample fromthe empirical distribution:
P̂ (X,Y ) =1
n
n∑i=1
δ(xi, yi).
I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.
12 / 14
In reality, we only have n samples
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y
I However, these samples can beused to approximate the jointdistribution of X and Y .
I The Bootstrap: Resample fromthe empirical distribution:
P̂ (X,Y ) =1
n
n∑i=1
δ(xi, yi).
I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.
12 / 14
In reality, we only have n samples
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y
I However, these samples can beused to approximate the jointdistribution of X and Y .
I The Bootstrap: Resample fromthe empirical distribution:
P̂ (X,Y ) =1
n
n∑i=1
δ(xi, yi).
I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.
12 / 14
In reality, we only have n samples
−2 −1 0 1 2
−2
−1
01
2
X
Y
−2 −1 0 1 2
−2
−1
01
2
X
Y
−3 −2 −1 0 1 2
−3
−2
−1
01
2
X
Y
−2 −1 0 1 2 3
−3
−2
−1
01
2
X
Y
I However, these samples can beused to approximate the jointdistribution of X and Y .
I The Bootstrap: Resample fromthe empirical distribution:
P̂ (X,Y ) =1
n
n∑i=1
δ(xi, yi).
I Equivalently, resample the data bydrawing n samples withreplacement from the actualobservations.
12 / 14
A schematic of the Bootstrap
2.8 5.3 3
1.1 2.1 2
2.4 4.3 1
Y X Obs
2.8 5.3 3
2.4 4.3 1
2.8 5.3 3
Y X Obs
2.4 4.3 1
2.8 5.3 3
1.1 2.1 2
Y X Obs
2.4 4.3 1
1.1 2.1 2
1.1 2.1 2
Y X Obs
Original Data (Z)
1*Z
2*Z
Z*B
1*α̂
2*α̂
α̂*B
!!
!!
!!
!!
!
!!
!!
!!
!!
!!
!!
!!
!!
13 / 14
Comparing Bootstrap resamplingsto resamplings from the true distribution
0.4 0.5 0.6 0.7 0.8 0.9
050
100
150
200
0.3 0.4 0.5 0.6 0.7 0.8 0.9
050
100
150
200
True Bootstrap
0.3
0.4
0.5
0.6
0.7
0.8
0.9
αα
α
14 / 14