august 2004copyright tim hesterberg1 introduction to the bootstrap (and permutation tests) tim...

August 2004 Copyright Tim Hesterberg 1

Introduction to the Bootstrap (and

Permutation Tests)Tim Hesterberg, Ph.D.

Association of General Clinical Research Center Statisticians

August 2004, Toronto


Outline of Talk

• Why Resample?

• Introduction to Bootstrapping

• More examples, sampling methods

• Two-sample Bootstrap

• Two-sample Permutation Test

• Other statistics

• Other permutation tests


Why Resample?

• Fewer assumptions: normality, equal variances

• Greater accuracy (in practice)

• Generality: Same basic procedure for wide range of statistics, sampling methods

• Promote understanding: Concrete analogies to theoretical concepts


Good Books

• Hesterberg et al. Bootstrap Methods and Permutation Tests (2003, W. H. Freeman)

• B. Efron and R. Tibshirani An Introduction to the Bootstrap (1993, Chapman & Hall).

• A.C. Davison and D.V. Hinkley, Bootstrap Methods and Their Application (Cambridge University Press, 1997).


Example - Verizon

Number of

Observations

Average Repair

Time

ILEC (Verizon) 1664 8.4

CLEC (other carrier)

23 16.5

Is the difference statistically significant?

Example Data

Repair Time

0 50 100 150 200

0.0

0.0

10

.02

0.0

30

.04

Repair Time

0 50 100 150 200

0.0

0.0

10

.02

0.0

3

Quantiles of Standard Normal

Re

pa

ir T

ime

-2 0 2

05

01

00

15

0

ILECCLEC


Start Simple

• We’ll start simple – single sample mean

• Later – other statistics– two samples – permutation tests


Bootstrap Procedure

• Repeat 1000 times– Draw a sample of size n with replacement from

the original data (“bootstrap sample”, or “resample”)

– Calculate the sample mean for the resample

• The 1000 bootstrap sample means comprise the bootstrap distribution.


Bootstrap Distn for ILEC mean

mean

De

nsi

ty

7.5 8.0 8.5 9.0 9.5

0.0

0.2

0.4

0.6

0.8

1.0

ObservedMean


me

an

-2 0 2

7.5

8.0

8.5

9.0

9.5

bootstrap : ILEC$Time : mean


Bootstrap Standard Error

• Bootstrap standard error (SE) = standard deviation of bootstrap distribution

> ILEC.boot.meanCall:bootstrap(data = ILEC, statistic = mean, seed = 36)

Number of Replications: 1000

Summary Statistics: Observed Mean Bias SE mean 8.412 8.395 -0.01698 0.3672


Bootstrap Distn for CLEC mean

mean

De

nsi

ty

10 15 20 25 30

0.0

0.0

20

.04

0.0

60

.08

0.1

0

ObservedMean


me

an

-2 0 2

10

15

20

25

30

bootstrap : CLEC$Time : mean


Take another look

• Take another look at the previous two figures.

• Is the amount of non-normality/asymmetry there a cause for concern?

• Note – we’re looking at a sampling distribution, not the underlying distribution. This is after the CLT effect!


Idea behind bootstrapping

• Plug-in principle– Underlying distribution is unknown– Substitute your best guess


Ideal world


Bootstrap world


Fundamental Bootstrap Principle


• Fundamental Bootstrap Principle– This substitution works– Not always– Bootstrap distribution centered at statistic, not

parameter


Secondary Principle

• Implement the Fundamental Principle by Monte Carlo sampling

• This is just an implementation detail!– Exact: nn samples– Monte Carlo, typically 1000 samples

• 1000 realizations from theoretical bootstrap dist

• More for higher accuracy (e.g. 500,000)


Not Creating Data from Nothing

• Some are uncomfortable with the bootstrap, because they think it is creating data out of nothing. (The name doesn’t help!)

• Not creating data. No better parameter estimates. (Exception – bagging, boosting.)

• Use the original data to estimate SE or other aspects of the sampling distribution.– Using sampling, rather than a formula


Formulaic and Bootstrap SE


What to substitute?


• What to substitute?– Empirical distribution – ordinary bootstrap– Smoothed distribution – smoothed bootstrap– Parametric distribution – parametric bootstrap– Satisfy assumptions, e.g. null hypothesis


Another example: Kyphosis

• Variables Kyphosis (present or absent), Age of child, Number of vertebrae in operation, Start of range of vertebrae

• Logistic regression


Kyphosis - Logistic Regression

Value Std. Error t value (Intercept) -2.03693225 1.44918287 -1.405573 Age 0.01093048 0.00644419 1.696175 Start -0.20651000 0.06768504 -3.051043 Number 0.41060098 0.22478659 1.826626

Null Deviance: 83.23447 on 80 dfResidual Deviance: 61.37993 on 77 df


Kyphosis vs. Start

Start

Kyp

ho

sis

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0


Kyphosis Example

• Pseudo-code:Repeat 1000 times {

Draw sample with replacement from original rows

Fit logistic regression

Save coefficients

}

Use the bootstrap distribution

• Live demo (kyphosis.ssc)


Bootstrap SE and bias

• Bootstrap SE (standard error) = standard deviation of bootstrap distribution

• Bootstrap bias = mean of bootstrap distribution – original statistic


t confidence interval

• Statistic +- t* SE(bootstrap)

• Reasonable interval if bootstrap distribution is approximately normal, little bias. Compare to bootstrap percentiles. Return to Kyphosis example

• In the literature, “bootstrap t” means something else.


Percentiles to check Bootstrap t

• If bootstrap distribution is approximately normal and unbiased, then bootstrap t intervals and corresponding percentiles should be similar.

• Compare these

• If similar use either; else use a more accurate interval


More Accurate Intervals

• BCa, Tilting, others (real bootstrap-t)

• Percentile and “bootstrap-t”: – first-order correct– Consistent, coverage error O(1/sqrt(n))

• BCa and Tilting: – second-order correct– coverage error O(1/n)


Different Sampling Procedures

• Two-sample applications

• Other sampling situations


Two-sample Bootstrap Procedure

Given independent SRSs from two populations:• Repeat 1000 times

– Draw sample size m from sample 1

– Draw sample size n from sample 2, independently

– Compute statistic that compares two groups, e.g. difference in means

• The 1000 bootstrap statistics comprise the bootstrap distribution.


Example – Relative Risk

Blood Pressure Cardiovascular Disease

High 55/3338 = 0.0165

Low 21/2676 = 0.0078

Estimated Relative Risk = 2.12


…bootstrap Relative Riskbootstrap Relative Risk

mean

De

nsi

ty

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

ObservedMean

t

percentile

BCa

tilt


Example: Verizon

Repair Time

0 50 100 150 200

0.0

0.0

10

.02

0.0

30

.04

Repair Time

0 50 100 150 200

0.0

0.0

10

.02

0.0

3


Re

pa

ir T

ime

-2 0 2

05

01

00

15

0

ILECCLEC


…difference in means

mean

De

nsi

ty

-25 -20 -15 -10 -5 0

0.0

0.0

20

.04

0.0

60

.08

0.1

0

ObservedMean


me

an

-2 0 2

-20

-15

-10

-50

bootstrap : Verizon$Time : mean : ILEC - CLEC


…difference in trimmed means

Param

De

nsi

ty

-15 -10 -5 0

0.0

0.0

50

.10

0.1

5

ObservedMean


Pa

ram

-2 0 2

-15

-10

-50

bootstrap : Verizon : mean(Time, trim =... : ILEC - CLEC


…comparison

• Diff means

Observed Mean Bias SE mean -8.098 -7.931 0.1663 3.893

• Diff 25% trimmed means

Observed Mean Bias SE Param -10.34 -10.19 0.1452 2.737


Other Sampling Situations

• Stratified Sampling– Resample within strata

• Small samples or strata– Correct for narrowness bias

• Finite Population– Create finite population, resample without

replacement

• Regression


Bootstrap SE too small

• Usual SE for mean is where

• Bootstrap corresponds to using divisor of n instead of n-1.

• Bias factor for each sample, each stratum

/s n

21( )

1 is x xn


Remedies for small SE

• Multiply SE by sqrt(n/(n-1)– Equal strata sizes only. No effect on CIs.

• Sample with reduced size, (n-1)• Bootknife sampling

– Omit random observation– Sample size n from remaining n-1

• Smoothed bootstrap– Choose smoothing parameter to match variance– Continuous data only


Smoothed bootstrap

• Kernel Density Estimate = Nonparametric bootstrap + random noise

minutes/half-hourTV Advertising, Basic Cable

De

nsi

ty

6 8 10 12

0.0

0.1

0.2

0.3


Finite Population

• Sample size n from population size N

• If N is multiple of n, – repeat each observation (N/n) times, – bootstrap sample without replacement

• If N is not a multiple of n, – Repeat each observation same # of times

• round N/n up, down


Resampling for Regression

• Resample observations (random effects)– Problem with factors, random amount of info

• Resample residuals (fixed effects)– Fit model– Resample residuals, with replacement– Add to fitted values– Problems with heteroskedasticity, lack of fit


Basic Rule for Sampling

• Sample in a way consistent with how the data were produced

• Including any additional information– Continuous distribution (if it matters, e.g. for

medians)– Null hypothesis


Resampling for Hypothesis Tests

• Sample in a manner consistent with H0• P-value = P0(random value exceeds observed

value)

observed statistic

P-value

SamplingDistribution

when H0 is true


Permutation Test for 2-samples

• H0: no real difference between groups; observations could come from one group as well as the other

• Resample: randomly choose n1 observations for group 1, rest for group 2.

• Equivalent to permuting all n, first n1 into group 1.


Verizon permutation testpermutation : Verizon$Time : mean : ILEC - CLEC

ObservedMean


Test resultsPooled-variance t-test t = -2.6125, df = 1685, p-value = 0.0045Non-pooled-variance t-test t = -1.9834, df = 22.3463548265907, p-value = 0.0299 > permVerizon3Call:permutationTestMeans(data = Verizon$Time, treatment = Verizon$Group, B = 499999, alternative = "less", seed = 99)

Number of Replications: 499999

Summary Statistics: Observed Mean SE alternative p.value Var -8.098 -0.001288 3.105 less 0.01825


Permutation vs Pooled Bootstrap

• Pooled bootstrap test– Pool all n observations

– Choose n1 with replacement for group 1

– Choose n2 with replacement for group 2

• Permutation test is preferred– Condition on the observed data– Same number of outliers as the observed data


Assumptions

• Permutation Test:– Same distribution for two populations

• When H0 is true

• Population variances must be the same; sample variances may differ

– Does not require normality– Does not require that data be a random sample

from a larger population


Other Statistics

• Procedure works for variety of statistics– Difference in means– t-statistic– difference in trimmed means

• Work directly with statistic of interest– Same p-value for and pooled-variance t-

statistic1 2x x


Difference in Trimmed Meanspermutation 25% trimmed mean: Verizon ILEC-CLEC

mean

De

nsi

ty

-10 -5 0

0.0

0.0

50

.10

0.1

50

.20

0.2

50

.30

ObservedMean

P-value = 0.0002


General Permutation Tests

• Compute Statistic for data

• Resample in a way consistent with H0 and study design

• Construct permutation distribution

• P-value = percentage of resampled statistics that exceed original statistic


Perm Test for Matched Pairsor Stratified Sampling

• Permute within each pair

• Permute within each stratum


Example: Puromycin

• The data are from a biochemical experiment where the initial velocity of a reaction was measured for different concentrations of the substrate. Data are from two runs, one on cells treated with the drug Puromycin, the other on cells without

• Variables concentration, velocity, treatment


Puromycin dataPuromycin

Concentration

Ve

loci

ty

0.0 0.2 0.4 0.6 0.8 1.0

50

10

01

50

20

0

untreatedtreated


Permutation Test for Puromycin

• Statistic: ratio of smooths, at each original concentration

• Stratify by original concentration

• Permute only the treatment variablepermutationTest(data = Puromycin, statistic = f,

alternative = "less", combine = T, seed = 42,

group = Puromycin$conc,

resampleColumns = "state")


Puromycin – Permutation Graphspermutation : Puromycin : f

0.02

De

nsi

ty

0.8 1.0 1.2 1.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ObservedMean

permutation : Puromycin : f

0.06

De

nsi

ty0.8 0.9 1.0 1.1 1.2

01

23

45 Observed

Mean


0.11

De

nsi

ty

0.8 0.9 1.0 1.1 1.2

01

23

45

ObservedMean


0.22

De

nsi

ty

0.8 0.9 1.0 1.1 1.2

01

23

45

6 ObservedMean


0.56

De

nsi

ty

0.8 0.9 1.0 1.1 1.2

01

23

45 Observed

Mean


1.1

De

nsi

ty

0.8 0.9 1.0 1.1 1.20

12

34

5

ObservedMean


Puromycin – P-values

Summary Statistics: Observed Mean SE alternative p-value 0.02 0.9085 1.016 0.14932 less 0.2590.06 0.8509 1.005 0.08191 less 0.0240.11 0.8254 1.002 0.07011 less 0.0030.22 0.8034 1.001 0.07657 less 0.0020.56 0.7850 1.007 0.09675 less 0.002 1.1 0.7937 1.025 0.13384 less 0.053

Combined p-value: 0.02, 0.06, 0.11, 0.22, 0.56, 1.1 0.002


Permutation test curves

Concentration

Ve

loci

ty

0.0 0.2 0.4 0.6 0.8 1.0

50

10

01

50

20

0

untreatedtreatedperm/untreated


Permutation Test of Relationship

• To test H0: X and Y are independent

• Permute either X or Y (both is just extra work)

• Test statistic may be correlation, regression slope, chi-square statistic (Fisher’s exact test), …


Perm Test in Regression

• Simple regression: permute X or Y

• Multiple regression:– Permute Y to test H0: no X contributes

– To test incremental contribution of X1

• Cannot permute X1

• That loses joint relationship of Xs


Example: Kyphosis




Kyphosis vs. Start

Start

Kyp

ho

sis

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0


Kyphosis Permutation Test

• Permute Kyphosis (the response variable), leaving other variables fixed.

• Test statistic is residual deviance.

Summary Statistics: Observed Mean SE alternative p-value Param 61.38 79.95 2.828 less 0.001


Kyphosis Permutation Distribution

Permutation Distribution for Kyphosis

Residual Deviance

De

nsi

ty

65 70 75 80

0.0

0.0

50

.10

0.1

50

.20

ObservedMean


When Perm Testing Fails

• Permutation Testing is not Universal– Cannot test H0: = 0 – Cannot test H0: = 1

• Use Confidence Intervals• Bootstrap tilting

– Find maximum-likelihood weighted distribution that satisfies H0, use weighted bootstrap


If time permits

• Bias – Portfolio optimization example, in section3.ppt

• More about confidence intervals, from section5.ppt


Summary

• Basic bootstrap idea – – Substitute best estimate for population(s)

• For testing, match null hypothesis

– Sample consistently with how data produced– Inspect bootstrap distribution – Normal?– Compare t and percentile intervals, BCa &

tilting


Summary

• Testing– Sample consistent with H0– Permutation test to compare groups, test

relationships– No permutation tests in some situations; use

bootstrap confidence interval or test


Resources

• www.insightful.com/Hesterberg/bootstrap

• S+Resamplewww.insightful.com/downloads/libraries

• [email protected]


Supplement for pages 24-27

• This document is a supplement to the presentation at the AGS. This includes some material that was shown in a live demo using S-PLUS, corresponding to pages 24-27 of the original presentation.


Another example: Kyphosis




Kyphosis Example

• Pseudo-code:Repeat 1000 times {

Draw sample with replacement from original rows

Fit logistic regression

Save coefficients

}

Use the bootstrap distribution

• Live demo (kyphosis.ssc)


Kyphosis vs. Start

Start

Kyp

ho

sis

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0


Graphical bootstrap of predictions

Start

Kyp

ho

sis

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0


Bootstrap Coefficients

bootstrap : glm(formula = Kyp... : coef(glm(data))

(Intercept)

De

nsi

ty

-15 -10 -5 0

0.0

0.0

50

.15

0.2

5

ObservedMean


Age

De

nsi

ty

0.0 0.02 0.04 0.06

01

02

03

04

05

06

0 ObservedMean


Start

De

nsi

ty

-0.8 -0.6 -0.4 -0.2 0.0

01

23

45

6

ObservedMean


Number

De

nsi

ty

0 1 2 3

0.0

0.4

0.8

1.2

ObservedMean


Bootstrap Scatterplots

(Intercept)

0.0 0.02 0.04 0.06 0 1 2 3

-15

-10

-50

0.0

0.02

0.06

Age

Start

-0.8

-0.4

-15 -10 -5 0

01

23

-0.8 -0.6 -0.4 -0.2

Number


t confidence interval

• Statistic +- t* SE(bootstrap)

• Reasonable interval if bootstrap distribution is approximately normal, little bias. Compare to bootstrap percentiles. Return to Kyphosis example

• In the literature, “bootstrap t” means something else.


Are t-limits reasonable here?


Start

De

nsi

ty

-0.8 -0.6 -0.4 -0.2 0.0

01

23

45

6 ObservedMean





Sta

rt

-2 0 2

-0.8

-0.6

-0.4

-0.2



• Remember, the previous two plots show the bootstrap distribution, an estimate of the sampling distribution, after the Central Limit Theorem has had its chance to work.


Percentiles to check Bootstrap t

• If bootstrap distribution is approximately normal and unbiased, then bootstrap t intervals and corresponding percentiles should be similar.

• Compare these

• If similar use either; else use a more accurate interval


Compare t and percentile CIs• > signif(limits.t(boot.kyphosis), 2)• 2.5% 5% 95% 97.5% • (Intercept) -6.1000 -5.4000 1.400 2.000• Age -0.0054 -0.0027 0.025 0.027• Start -0.3800 -0.3500 -0.063 -0.034• Number -0.2900 -0.1800 1.000 1.100• > signif(limits.percentile(boot.kyphosis), 2)

• 2.5% 5% 95% 97.5% • (Intercept) -6.80000 -5.8000 0.560 1.400• Age 0.00077 0.0021 0.028 0.033• Start -0.44000 -0.3900 -0.120 -0.095• Number -0.09400 0.0078 1.100 1.300


Compare asymmetry of CIs• > signif(limits.t(boot.kyphosis) - boot.kyphosis$observed, 2)

• 2.5% 5% 95% 97.5% • (Intercept) -4.100 -3.400 3.400 4.100• Age -0.016 -0.014 0.014 0.016• Start -0.170 -0.140 0.140 0.170• Number -0.710 -0.590 0.590 0.710• > signif(limits.percentile(boot.kyphosis) - boot.kyphosis$observed, 2)

• 2.5% 5% 95% 97.5% • (Intercept) -4.80 -3.8000 2.600 3.500• Age -0.01 -0.0088 0.018 0.022• Start -0.23 -0.1800 0.088 0.110• Number -0.51 -0.4000 0.710 0.850

august 2004copyright tim hesterberg1 introduction to the bootstrap (and permutation tests) tim...

Documents

copyright tim hesterbergnot

copyright tim hesterbergtake

bootstrap methods

original data bootstrap

sampling distribution

theoretical bootstrap

mean bias se

sample of size n