bootstr i a bootstrap statistic is the statistic computed from a bootstrap sample. i e.g., sample...

Download Bootstr I A bootstrap statistic is the statistic computed from a bootstrap sample. I E.g., sample mean,

Post on 26-Jun-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Bootstrapping

    PUBH 7401: Fundamentals of Biostatistical Inference

    Eric F. Lock UMN Division of Biostatistics, SPH

    elock@umn.edu

    10/18/2018

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Recall: Sampling distributions

    Question: how do I find/approximate the sampling distribution of a statistic?

    1 Derivation from probability distribution of Xi ’s 2 Simulation from probability distribution of Xi ’s 3 Approximation using asymptotic theory 4 Bootstrapping

    Option 1.) is often not possible Option 3.) is nice, but...

    The Central Limit Theorem is only for a sample mean May not be a good approximation (especially for small sample size n

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Sampling Distribution Via Simulation

    In an ideal world, to learn about the sampling distribution we would 1 Take lots of different samples from the population 2 Calculate the statistic in each of those samples 3 Plot the sampling distribution

    This is totally unreasonable as we would never take multiple samples from the population

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Sampling Distribution of Statistics Using Simulation

    In a slightly more realistic world, to learn about the sampling distribution we could

    1 Simulate lots of different samples (of the same size) from the population

    2 Calculate the statistic in each of those samples 3 Plot the sampling distribution

    (But this is challenging if we do not know the distribution of the population)

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Idea of the (nonparametric) bootstrap

    I In practice, we only collect a single sample from the population

    I A single sample statistic

    I Want to know how statistics could vary with different samples

    I Bootstrapping: Using sample data, create an artificial “population” to sample from

    I Many copies of the original sample

    I Reese’s Pieces: http://ericfrazerlock.com/Reeses_bootstrap.pptx

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

    http://ericfrazerlock.com/Reeses_bootstrap.pptx

  • Idea of the (nonparametric) bootstrap

    I A bootstrap “population” from a sample:

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Sampling with replacement

    I To draw a sample from bootstrap population, sample with replacement from the sample we have.

    I Each unit can be selected more than once.

    I The bootstrap sample is of the original sample size n.

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Bootstrap terminology

    I A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.

    I A bootstrap statistic is the statistic computed from a bootstrap sample.

    I E.g., sample mean, sample proportion, correlation, etc.

    I A bootstrap distribution is the distribution of many bootstrap statistics.

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Finding a bootstrap distribution

    http://ericfrazerlock.com/bootstrap_analogy.pptx

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

    http://ericfrazerlock.com/bootstrap_analogy.pptx

  • Golden rule of bootstrapping

    I Bootstrap statistics are to the original sample statistic as the original sample statistic is to the population parameter

    I The bootstrap distribution approximates the shape and spread (variance) of the unknown population distribution.

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Example: Body temperatures

    I Consider body temperatures for random sample 50 individuals1

    I The mean body temperature in the sample is x̄ = 92.26◦

    I Generate a bootstrap distribution for the sample mean

    I http://www.lock5stat.com/StatKey/bootstrap_1_ quant/bootstrap_1_quant.html

    1https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910512 PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

    http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html

  • Probability interpretation of bootstrap

    I Recall: For a simple random sample X1, . . . , Xn are independent and from the same probability distribution

    I In practice, we may not know the probability distribution

    I Our “best guess” of the population pmf is often the empirical pmf. That is, we assume that

    P̂(X = x) = # of times in sample observed xn

    I This can be a good approximation even if the distribution is continuous

    I When we bootstrap, we simulate from this distribution

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Empirical Distribution

    Consider previous health cost data for El Goog Assume that we took a sample of size 5 from the cost data:

    ## [1] 479 489 725 1955 2809

    The empirical distribution p̂(x) is

    x 479 489 725 1955 2809 p̂(x) 0.2 0.2 0.2 0.2 0.2

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

  • Bootstrap Sampling Distribution

    1,000 bootstrap simulations with the original n = 50 dataset

    Mean Cost in Bootstrap Samples

    D en

    si ty

    1500 2000 2500

    0. 00

    00 0.

    00 05

    0. 00

    10 0.

    00 15

    0. 00

    20

    PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Recommended

View more >