Transcript
Page 1: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Bootstrapping

PUBH 7401: Fundamentals of Biostatistical Inference

Eric F. LockUMN Division of Biostatistics, SPH

[email protected]

10/18/2018

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 2: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Recall: Sampling distributions

Question: how do I find/approximate the sampling distribution of astatistic?

1 Derivation from probability distribution of Xi ’s2 Simulation from probability distribution of Xi ’s3 Approximation using asymptotic theory4 Bootstrapping

Option 1.) is often not possibleOption 3.) is nice, but...

The Central Limit Theorem is only for a sample meanMay not be a good approximation (especially for small samplesize n

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 3: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Sampling Distribution Via Simulation

In an ideal world, to learn about the sampling distribution we would1 Take lots of different samples from the population2 Calculate the statistic in each of those samples3 Plot the sampling distribution

This is totally unreasonable as we would never take multiplesamples from the population

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 4: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Sampling Distribution of Statistics Using Simulation

In a slightly more realistic world, to learn about the samplingdistribution we could

1 Simulate lots of different samples (of the same size) from thepopulation

2 Calculate the statistic in each of those samples3 Plot the sampling distribution

(But this is challenging if we do not know the distribution of thepopulation)

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 5: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Idea of the (nonparametric) bootstrap

I In practice, we only collect a single sample from thepopulation

I A single sample statistic

I Want to know how statistics could vary with different samples

I Bootstrapping: Using sample data, create an artificial“population” to sample from

I Many copies of the original sample

I Reese’s Pieces:http://ericfrazerlock.com/Reeses_bootstrap.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 6: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Idea of the (nonparametric) bootstrap

I A bootstrap “population” from a sample:

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 7: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Sampling with replacement

I To draw a sample from bootstrap population, sample withreplacement from the sample we have.

I Each unit can be selected more than once.

I The bootstrap sample is of the original sample size n.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 8: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Bootstrap terminology

I A bootstrap sample is a random sample taken withreplacement from the original sample, of the same size as theoriginal sample.

I A bootstrap statistic is the statistic computed from abootstrap sample.

I E.g., sample mean, sample proportion, correlation, etc.

I A bootstrap distribution is the distribution of many bootstrapstatistics.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 9: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Finding a bootstrap distribution

http://ericfrazerlock.com/bootstrap_analogy.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 10: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Golden rule of bootstrapping

I Bootstrap statistics are to the original sample statisticasthe original sample statistic is to the population parameter

I The bootstrap distribution approximates the shape and spread(variance) of the unknown population distribution.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 11: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Example: Body temperatures

I Consider body temperatures for random sample 50 individuals1

I The mean body temperature in the sample is x̄ = 92.26◦

I Generate a bootstrap distribution for the sample mean

I http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html

1https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910512PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 12: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Probability interpretation of bootstrap

I Recall: For a simple random sample X1, . . . , Xn areindependent and from the same probability distribution

I In practice, we may not know the probability distribution

I Our “best guess” of the population pmf is often the empiricalpmf. That is, we assume that

P̂(X = x) = # of times in sample observed xn

I This can be a good approximation even if the distribution iscontinuous

I When we bootstrap, we simulate from this distribution

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 13: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Empirical Distribution

Consider previous health cost data for El GoogAssume that we took a sample of size 5 from the cost data:

## [1] 479 489 725 1955 2809

The empirical distribution p̂(x) is

x 479 489 725 1955 2809p̂(x) 0.2 0.2 0.2 0.2 0.2

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Page 14: Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean, sample proportion, correlation,

Bootstrap Sampling Distribution

1,000 bootstrap simulations with the original n = 50 dataset

Mean Cost in Bootstrap Samples

Den

sity

1500 2000 2500

0.00

000.

0005

0.00

100.

0015

0.00

20

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping


Top Related