# bootstr i a bootstrap statistic is the statistic computed from a bootstrap sample. i e.g., sample...     Post on 26-Jun-2020

0 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Bootstrapping

PUBH 7401: Fundamentals of Biostatistical Inference

Eric F. Lock UMN Division of Biostatistics, SPH

elock@umn.edu

10/18/2018

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Recall: Sampling distributions

Question: how do I find/approximate the sampling distribution of a statistic?

1 Derivation from probability distribution of Xi ’s 2 Simulation from probability distribution of Xi ’s 3 Approximation using asymptotic theory 4 Bootstrapping

Option 1.) is often not possible Option 3.) is nice, but...

The Central Limit Theorem is only for a sample mean May not be a good approximation (especially for small sample size n

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Sampling Distribution Via Simulation

In an ideal world, to learn about the sampling distribution we would 1 Take lots of different samples from the population 2 Calculate the statistic in each of those samples 3 Plot the sampling distribution

This is totally unreasonable as we would never take multiple samples from the population

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Sampling Distribution of Statistics Using Simulation

In a slightly more realistic world, to learn about the sampling distribution we could

1 Simulate lots of different samples (of the same size) from the population

2 Calculate the statistic in each of those samples 3 Plot the sampling distribution

(But this is challenging if we do not know the distribution of the population)

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Idea of the (nonparametric) bootstrap

I In practice, we only collect a single sample from the population

I A single sample statistic

I Want to know how statistics could vary with different samples

I Bootstrapping: Using sample data, create an artificial “population” to sample from

I Many copies of the original sample

I Reese’s Pieces: http://ericfrazerlock.com/Reeses_bootstrap.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

http://ericfrazerlock.com/Reeses_bootstrap.pptx

• Idea of the (nonparametric) bootstrap

I A bootstrap “population” from a sample:

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Sampling with replacement

I To draw a sample from bootstrap population, sample with replacement from the sample we have.

I Each unit can be selected more than once.

I The bootstrap sample is of the original sample size n.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Bootstrap terminology

I A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.

I A bootstrap statistic is the statistic computed from a bootstrap sample.

I E.g., sample mean, sample proportion, correlation, etc.

I A bootstrap distribution is the distribution of many bootstrap statistics.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Finding a bootstrap distribution

http://ericfrazerlock.com/bootstrap_analogy.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

http://ericfrazerlock.com/bootstrap_analogy.pptx

• Golden rule of bootstrapping

I Bootstrap statistics are to the original sample statistic as the original sample statistic is to the population parameter

I The bootstrap distribution approximates the shape and spread (variance) of the unknown population distribution.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Example: Body temperatures

I Consider body temperatures for random sample 50 individuals1

I The mean body temperature in the sample is x̄ = 92.26◦

I Generate a bootstrap distribution for the sample mean

I http://www.lock5stat.com/StatKey/bootstrap_1_ quant/bootstrap_1_quant.html

1https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910512 PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html

• Probability interpretation of bootstrap

I Recall: For a simple random sample X1, . . . , Xn are independent and from the same probability distribution

I In practice, we may not know the probability distribution

I Our “best guess” of the population pmf is often the empirical pmf. That is, we assume that

P̂(X = x) = # of times in sample observed xn

I This can be a good approximation even if the distribution is continuous

I When we bootstrap, we simulate from this distribution

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Empirical Distribution

Consider previous health cost data for El Goog Assume that we took a sample of size 5 from the cost data:

##  479 489 725 1955 2809

The empirical distribution p̂(x) is

x 479 489 725 1955 2809 p̂(x) 0.2 0.2 0.2 0.2 0.2

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

• Bootstrap Sampling Distribution

1,000 bootstrap simulations with the original n = 50 dataset

Mean Cost in Bootstrap Samples

D en

si ty

1500 2000 2500

0. 00

00 0.

00 05

0. 00

10 0.

00 15

0. 00

20

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Recommended