symbolic bootstrap estimators and their properties · ideal bootstrap mean (1.6), for exampie. may...

SYMBOLIC COMPUTATION OF NONPARAMETRIC

BOOTSTRAP ESTIMATORS AND THEIR PROPERTIES

George Theodore Sampson

A t hesis submitted in conformity wit h the requirements for t h e

Degree of Doctor of Philosophy

Graduate Department of Statistics

University of Toronto

@ Copyright by George Theodore Sampson 1997

National tibrary 1*m of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographic Senrices services bibliographiques

395 Wellington Street 395. nie Wellington OttawaON KlAON4 OtiawaON K1A ON4 Canada Canada

Your ink votre fehïRms

Our nle Noire fefdrence

The author has granted a non- L'auteur a accordé une licence non exclusive licence dowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othenvise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

Symbolic Computation of Nonparametric Bootstrap Estimators and their

Properties

George Theodore Sampson

Ph.D. 1997

Department of Statistics

University of Toronto

Abstract

The bootstrap was introduced in 1979 as a cornputer-intensive method for esti-

mating the standard error and other properties of an estimator. This thesis focuses on

how the same objective can be achieved. in many cases. by the development and im-

plementation of computer algorit hms for the symbolic cornputation and evaluat ion of

series expansions. The analytic bootstrap permits the calculation of standard errors

for a wide variety of estimators in a fraction of the time and wit h comparable or bet ter

accuracy when compared with conventional bootstrap Monte Car10 resampling. In ad-

dition. properties such as bias and variance of bootstrap estimators may be assessed.

The methodology is applied to examples from Efron and Tibshirani (1993) involv-

ing the correlat ion coefficient. the Behrens-Fisher statistic. double-bootst rapping and

BC, bootst rap confidence intervals. An alternative met hodology based on the ex-

haustive enurneration of bootstrap resamples is also presented and illustrated for the

case of the median. Implications for future work are discussed.

Acknowledgement s

1 would like to express my gratitude to my supervisor David Andrews for his

invaluable guidance, patience, encouragement, enthusiasrn and humour during the

undertaking of this thesis. It has been a privilege and a pleasure working with him.

My thanks also go to Andrey Feuerverger and Rob Tibshirani for making very

useful comments and suggestions on the manuscript. I wish to thank Yodit Seifu for

her encouragement and assistance. Ruth Croxford, Jenny Tarn. Laura Kerr and Tom

Glinos have been helpful with computational and administrative details. Gus Vukov

has provided me with many valuable teaching experiences.

I would like to thank al1 the graduate students. staff and faculty in the Department

of Statistics for a rneaningful and memorable acadeniic experience.

I am indebted to the Department of Statistics and the University of Toronto for

providing me with financial support during the first four years of my graduate studies.

This thesis is dedicated to my parents rvhose initial encouragement instilled in me

a resolve to undertake the present studies.

Contents

1 Introduction 1

1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Literature review I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis outline 9

2 Symbolic Computation of Bootstrapped Order Statistics

and their Moments : The Case of

the Median 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 11

. . . . . . . . . . 2.2 Review of Bootstrap Estimation of Standard Error 12

. . . . . . . . . . . . . . . . . . . 2.3 Properties of Bootstrap Estimators 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 ..3 .1 Introduction 14

. . . . . . . . . . 2.3.2 Exact Cornputation of Bootstrap Estimators 16

2.3.3 MomentsofBootstrapEstimators . . . . . . . . . . . . . . . . 10

.? 3 2.4 Application to Order Statistics : The Case of t h e Median . . . . . . . -- 2.4. L Some Properties of Order Statistics . . . . . . . . . . . . . . . 23

. . . . . . . . . . . . . . . . . . . . . 2.4.2 The Case of the Median 2s

. . . . . . . . . . . . . . . . . . . . . 2 - 4 3 Some Numerical Results 38

2.5 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . 4'3

3 Bootstrapped M-estimators and their Properties 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1 Bootstrapping M-estimators : An -4nalytical Approach . 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Introduction 46

3.2.2 An Approximation for Functions of M-estimators . . . . . . . 46

3.2.3 Weighting and Decomposing Residual Sums . . . . . . . . . . 51

3.2.4 -4nalytical Bootstrap Moments of M- Estimators and - -

t heir Properties . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Computational Tools 57

. . . . . . . . . . . . . . . . . 3.4 The Case of the Correlation Coefficient 60

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . -4.1 introduction 60

3.4.2 Computational Tools . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.3 NumericalResults . . . . . . . . . . . . . . . . . . . . . . . . 69

. . . . . . . . . . . . . . . . . . . . . . . 3 . 5 Discussion and Conclusions i l

4 Double-Bootstrapping and the Behrens-Fisher statistic 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction 79

4.2 Symbolic Computation of the Bootstrap-t

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence Interval 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 Background 79

4.2.2 S-mbolic Computation of the Bootstrap-t Interval for the Cor-

relation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . S?

4.2..3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S Ï

4.3 Symbolic Computation of BC . Confidence

Intervais . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S i

-1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . SG

4.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4. 3 . 3 Symbolic Computation of the Xcceleration Constant 8: . . . . 91

4.34 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Analytic Bootstrap Computation for the

. . . . . . . . . . . . . . . . . . . . . . . . . . Behrens-Fisher statistic 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Introduction 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Background 97

. . . . . . . . . . . . . . . . 4.4.3 Analytic Bootstrap Compiitation 98

. . . . . . . . . . . . . . . . . . . . . . 4.4.4 Numerical Application 103

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Discussion 104

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions 106

5 Concluding remarks 108

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Remarks 108

A Source Code used for this Thesis 111

. . . . . . . . . . . . . . . . . . . . . A.1 Source Code used for Chapter 2 111

. . . . . . . . . . . . . . . . . . . . . A.' Source Code used for Chapter 3 122

. . . . . . . . . . . . . . . . . . . . . A.3 Source Code used for Chapter 4 143

List of Tables

1.1 Bootstrap estimators and associated properties for a statistic 6 for

which analytic expressions are derived. . . . . . . . . . . . . . . . . .

Total number of distinct resamples and number of partitions for various

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sample sizes

The 5 partitions associated with a sample size of 4 with corresponding

. . . . . . . . . . . . . . . . . . . . . . . . . mult inornial probabilities

. . . . . . . . The 1% distinct permutations of the partition (2.1.1.0)

. . . . . . . . . . . . Approximating distributions for various X values

. . . . Partitions and rnultinomial probabilities for a sample size of 3

Analytical and simulated results for various properties and bootstrap

estimates related to the 2nd moment of the median using several sample

sizes for t h e case of the normal distribution . . . . . . . . . . . . . . .

Analytical and simulated results for various properties and boot st rap


sizes for the case of the uniform distribution . . . . . . . . . . . . . .

Analytical and simulated results for various properties and bootstrap


sizes for t h e case of the logistic distribution . . . . . . . . . . . . . . .

3.1 Bootstrap estimators and related properties for the correlation coeffi-

cient for which analytic expansions are derived . . . . . . . . . . . . .

3.2 n = 1.5 Law School Data : original and standardized dat,a . . . . . . .

3.3 n = 100 simulated bivariate observations from N(0. I ) : variable I . .

3.4 n = 100 simulated bivariate observations from N(0? 1) : variable y . . 71

3.5 Symbolic bootstrap and classical results for the correlation coefficient

using two data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . I I

3.6 Monte Car10 Bootstrap Estimates of t h e Mean of the correlation coef-

ficient for various B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7 Monte Carlo Bootstrap Estirnates of Standard Error of the correlatiori

coefficient for various B . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.1 Cornparison of 90% confidence intervals for p using analytic bootstrap-

t. conventional bootstrap-t, and standard methods for the law school

data ( n = 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3 Bootstrap estimators and reiated properties for the Behrens-Fisher

statistic for which analytic expansions are derived . . . . . . . . . . . 99

. . . . . . 4.3 Survival times (days) of gastric cancer patients (n = 44) 103

4.4 Symbolic bootstrap and classical results for the Behrens-Fisher statistic

using the survival data. n = 44 . . . . . . . . . . . . . . . . . . . . . 104

4 . Monte Carlo Bootstrap Estimates of Mean and Standard Error of the

Behrens-Fisher statistic for various B for t he survival data. n = 44

(s-e. in brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5

Chapter 1

Introduction

1.1 Overview

In 1979 Efron proposed the idea of using cornputer-based simulations instead of math-

ematical calculations to obtain the sampling properties of random variables. This

thesis focuses on how the same objective can be achieved, in many cases. by the de-

velopment and implementat ion of cornputer algori t hms for the sy mbolic compu t at ion

and evaluation of asyrnptotic expansions. Our study is largel:; confined to a collec-

tion of problems that may be treated using classical theory for sums of independent

and identically distributed random variables - the so-called *-smooth function model'.

(Hall 1992). The corresponding class of statistics is wide and includes. for example.

sample means. variances. ratios and differences of means and variances. correlat ions.

maximum likelihood estimators and M-estimators. But it is by no means comprehen-

sive. It does not include. for example. the sample median or other order statistics:

we treat this case in a separate chapter.

The bootstrap was originally introduced as a general method for assessing the

statistical accuracy of an estimator. The procedure applies a given estimator 8 to each

of B bootstrap samples, and then estirnates the standard error of 6 by the empirical

standard deviat ion of the B replicat ions. More formally, suppose t hat x = (xi . . . . . x,)

is an observed random sample of size n from a population with distribution function

F . We estimate a parameter of interest 0 by e to which we wish to assign a standard

CKAPTER 1. INTRODUCTION

error denoted by

Now (1.1) may in turn be estimated by

that is, by replacing F by the empirical distribution function F defitied to be the

discrete distribution that assigns probability l / n to each value xi for i = { 1. . . . . n } .

When F is unknown or 6 is complicated. bootstrap methodology estirnates ( 1.1 ) by

applying the same principle. To this end the bootstrap algorithm for estimating

standard errors selects B independent boot s t rap samples x;, . . . . x& . . . . x i each con-

sisting of n observations drawn with replacement from x. The bootstrap replication

d a ( b ) corresponding to each bootstrap sarnple is evaluated and the standard error

s e F ( 6 ) is estirnated by

w here

PVe note that ( 1.3) is an approximation of the ideal (i.e. "infinite" ) bootstrap estimate

of s e F @ ) given by

.

lim i ' e B ( Ô - ) = J E ~ ( B . ~ ) - [ E ~ ( B * ) ] ~ B-Cu

Similarly. the bootstrap algorithm uses (1.4) to approximate the ideal bootstrap es-

CHAPTER 1. INTRODUCTION

pect at ion

Observe that a bootstrap sample x- = (xi,. . . . xi) consists of members of the

original data set x, some appearing zero times: some appearing once, some appear-

ing twice. etc. . Hence, the bootstrap sample x' generates a corresponding set of

frequencies ( t ~ ~ . . . . , w,) condi t ional on the original sample x' that is.

where w = ( c u l , . . . , w,) is a multinomial vector of integer weights. In this sense. the

ideal bootstrap mean (1.6), for exampie. may be alternatiuelg expressed notationally

as

PVe shall use this notation throughout this thesis in order to convey (1.7). that is.

that bootstrap sampling can be thought of as weighted resampling . Moreover. this

notation shall represent ideal bootstrap estimates in the sense of (1.5).

More generally, we rnay represent a generic bootstrap estimator of ~ ~ [ ~ ( 0 ) ] by

for some function g. Indeed, when

we get (1.8) which rnay be a~proxirnated by (1.4). Further. when

CHAPTER 1. INTRODUCTION

then (1.9) gives the ideal bootstrap variance estimator

which may be approximated by the square of (1.3). In addition. we may evaluate

properties of ( 1.8) and (1.10) such as the mean E x ( . ) and variance Var,(-) . Table 1.1

below summarizes the six bootstrap expressions for which representations in terms of

analyt ical expressions comprise the focus of t his t hesis. These six quanti t ies include

the bootstrap mean and the bootstrap variance of an estimator O as well as two

properties for each of these.

Table 1.1: Bootstrap estimators and associated properties for a. statistic 6 for which analytic expressions are derived.

The intent of conventional bootst rap met hodology is to est imate. for example.

using the ideal bootstrap variance est imator

One motivation for this thesis is to investigate how well the bootstrap procedure

(1.12) estimates (1.1 1). To do this we will evaluate properties 01 (1.12) such as

CHAPTER 1. INTRODUCTIOAi

and

These properties could be used diagnostically to uncover, for example. inherent bias

in the bootstrap procedure (1.12) as represented by

The objective of this thesis is to develop symbolic computational tools t hat can be

used in practice to derive Monte Carlo-free bootstrap estimates and to help study

properties of bootstrap estimators.

We explore Iwo methods for the symbolic evaluation of the quantities in Table 1.1.

The first approach is based on the exact computation of the ideal bootst rap expecta-

tioo (1.8) as studied by Fisher and Hall (1991). Their algorithm requires enurneration

of al1 distinct resamples which may be drawn wi t h replacement from the original sam-

pie. We develop our method with respect to the class of L-estirnators. that is. linear

combinat ions of order stat ist ics. However. the technique may be used for more general

estimators. We will locus attention on the sample median. a statistic whose distri-

bution is known exactly: this will facilitate illustration of the appropriate symbolic

tools. The second approach involves the symbolic derivation of a bootstrap moment

and its properties for statistics belonging to the class of 41-estimators, that is. for any

statistic assumed to be the solution of the estimating equation

where the xi's arise from the density f (x. 8). The method depends primarily on two

key symbolic routines - one for computing asymptotic expansions of roots of estimat-

ing equations, and one for computing bootstrap versions of these expansions. The

algorithms have been implemented using a computer-algebraic manipulation package

called Mathematica (Wolfram. 1988). The methodology is applied to exarnples

CHAPTER 1. INTRODUCTION 6

from Efron and Tibshirani (1993) involving the correlation coefficient, the Behrens-

Fisher s tatistic? dou ble-bootstrapping and BC, bootst rap confidence intervals.

Finally, we surnmarize the featuresladvantages of the anaiytic b o o t s t r a ~ developed

in this thesis as follows:

Analytic derivat ion and numerical evaluation of bootstrap estimates and their

properties entirely avoids Monte Car10 resampling resulting in faster and more

efficient calculat ion.

It can be used diagnostically to assess properties such as bias. variance or other

moments of a bootstrap procedure.

For the class of L-estimators ( linear combinat ions of order stat istics j: the ana-

lytic bootstrap calculates ideal bootstrap estimators exactZy for sample sizes up

to n = 20 and calculates properties of these estimators exactly for much larger

sample sizes.

For the class of il[-estimators (maximum likelihood-type estimators). the an-

alytic bootstrap calculates ideal bootstrap estimators and properties of these

exactly for linear statistics and to within a specified order of accuracy for large

sample sizes in the case of non-iinear stat,ist ics.

The analytic bootstrap may be applied to a broad class of statistics. namely.

to any estimator which may be expressed as a non-linear smooth function of

averages.

The analytical approach For the class of M-estimators is less cornputer intensive

for larger data sets. The reverse is true for conventional bootstrap calculations.

Analytic bootstrap expressions are potentially insightful because component

statistical quantities such as curnulants and sums are identified.

Each analytic bootstrap expression (representing a bootstrap moment or a prop-

erty of this) is a mathematical bootstrap formula which may be further nu-

merically evaluated efficiently for n given data setl or repeatedly evaluated for

diflerent data sets arising within a given application. The computational effort

required to derive each formula need only be undertaken once. It is subsequently

useable in perpetuity by ot her practioners for the purpose of efficient numerical

evaluation.

This thesis does not discuss the analytic bootstrap in the context of paramet-

ric bootstrap estimation, kernel estimation or data which are not independent and

ident ically distri buted. These could be topics for future research.

1.2 Literature review

This thesis represents the marriage between two technologies. symbolic computation

and bootstrap estimation, and a broad class of statistics known as maximum likeli-

hood type estimators, or M-estimators. In the literature these t hree areas have largely

been developed independently of each ot her.

Efron's ( 1979) article has spawned an entire literature on the topic of the boot-

strap. The monograph by Efron and Tibshirani (1993) presents an intuitive account

of the bootstrap a n d its diverse applications in statistical inference. It contains a com-

prehensive list of references on the subject. The lecture notes of Beran and Ducharme

(1991) and Hall's (1992) monograph give a mathematical treatrnent of the bootstrap.

In particular. Fisher and Hall ( 1991 ) discuss the exact computation of nonparanietric

bootstrap estimators using exhaustive enumeration of distinct resamples which niay

be drawn with replacement from the original sample. CVe use a related approach in

Chapter 2 to develop the symbolic computation of bootstrap estimators and corre-

sponding properties connected with the class of L-estimators. Oldlord ( 1985) con-

siders analytic evaluation of bootstrap estimates of bias and variance for linear and

quadratic approximations of an estimator. This forms a basis for Chapter 3 which is

developed using a generalized methodology found in Andrews and Feuerverger (1993).

Huber (1964. 1967) considered the class of maximum likelihood type estimators

ê that solve estirnating equations such as (1.13). A detailed account of the theory of'

M-estimators can be found in Huber (1981). Andrews and Feuerverger (1993) show

CHAPTER 1. INTRODUCTION S

how a single arbitrary M-estimator. or a smooth function of such estimators. may be

expressed as a particular sequence of increasingly accurate Taylor series approxima-

tions. .ilternative presentations rnay aiso be found in Ferguson ( 1989) and WcCullagh

(1987). This approach is used in Chapter 3 as the basis for the development of algo-

rit hms permit t ing the symbolic representation of nonparametric bootst rap moments

(and respective propert ies) associated wit h M-estimators.

In this thesis a set of general procedures for the symbolic derivation of boot-

strap quantities is presented. One of the two contexts considered involves the use of

asymptotic expansions. The derivation of asymptotic expansions by hand is typicaliy

a laborioils task. it can be extremely time consuming and can involve exceedingly

complicated algebra. The potential for clerical error is higb. Symbolic computation

provides a practical alternative to doing these expansions by hand - the development

of appropriate cornputer-algebraic tools permits the statistician to derive the expan-

sions correctly and reliably. in a fraction of the time taken by hand, and to efficiently

evaluate the derived formulae in specific cases. Expansions are obtained quickly with-

out the chance for hurnan clerical error. This approach li berates statisticians from

t,he tedious details of asymptotic calculation and extends their potential capabilities.

Expansions derived by cornputer rnay moreover be saved in a form suitable for pub-

lication and then inserted directly into text. further reducinp the chance for errors in

publications. Programming errors can be reduced by testing software against kiiown

statistical results; the software can then be used to derive new results. The procedures

developed iii t his t hesis have been trsted against simulated results documented in.

for example. Efron and Tibshirani (1993). and against independent results obtained

from the use of code written in Sp lus .

Symbolic computation is a porverful facility available to research statisticians. The

use of this technology in statistics is relatively new, its history extending over little

more than a decade. Packages like Mathematica, Maple. Macsyma and R e d u c e

are being used increasingly in areas of statistical theory where the algebra may be

ot herwise prohibitively cornplex. There are nurnerous books t hat descri be a vari-

ety of useful applications of symbolic computation in statistics - see. for example.

CHAPTER 1. lNTRODCrCTION 9

Heller (1991), Maeder (1991), Wagan (1991), Davenport, e t al. (1988). Silverman

and Young ( 1987) use cornputer algebra to evaluate criteria on which the decision to

smooth a bootstrap distribution is based. Young and Daniels (1990) apply symbolic

programming to study bias in bootstrap estimation of tail probabilities for two simple

situations. Venables ( 1985) utilizes symbolic computation heavily to obtain expan-

sions of maximum marginal likelihood estimates. O ther contributions include Kendall

(1988. 1990), Young (1988) a n d Barndorff-Nielsen and Blaesild (1986). For a complete

review of the use of symbolic computation in probability and statistics prior to 199 1.

see Kendall (1993). More recently? Stafford (1992), Stafford and Andrews ( 1993).

Andrews and Stafford (1993): Andrews e t al. (1992) and Wang (1992) use methods

of symbolic computation to derive new statisticai results as well as to confirm old re-

sults. Kendall (1994) develops routines in Reduce that aid in determining whether

a particular expression is tensorially invariant. Stafford and Bellhouse (1995) develop

procedures that are used to automate complicated algebraic calculations frequently

encountered in sample survey t heory. Addi t ional contributions include Stafford € 1

al. (1994). Stafford (1994). Andrews e t al. (1994). Andrews and Stafford (1995).

Bellhouse e t al. ( 1995), Stafford (1995) and Stafford (1996).

1.3 Thesis outline

The thesis is organized as follows:

Chapter 2 presents algorithms for the symbolic computation of properties of non-

parametric bootstrap estimators involving order statistics. The case of the median is

considered in detail. Comparisons are made wi t h Monte Car10 simulations.

In Chapter 3 symbolic procedures are developed and implemented for the ana-

lytical representation of bootstrap estimators and their properties with respect to

statistics defined by estimating equations and functions of these. T h e technique is

illustrated for the case of the correlation coefficient. Comparisons wit h conventionally

computed bootstrap estimates are made.

The methodology is further illustrated in Chapter 4 with respect to the Behrens-

CHA PTER 1. IiVTRODUCTION 10

Fisher statistic, double-bootstrapping and BC. bootstrap confidence intervais.

Finally. Chap ter 5 provides concluding remarks and suggestions for future re-

search.

Chapter 2

Symbolic Computation of

Bootstrapped Order Statistics

and their Moments : The Case of

the Median

Introduction

In this chapter we seek to compute properties of nonparametric bootstrap estima-

tors involving order statistics exactly using symbolic cornputation. Iri particuiar. t h e

method permits evaluation of the bias and the mean square error of such estima-

tors analytically. The resulting expressions may be efficiently evaluated numericall-.

Such calculations require enumeration of al1 possible resamples which may be drawn

with replacement from the original sample and this will be shown to be feasible for

moderate sample sizes.

Section 2 reviews standard bootstrap estimation. In Section 3 we first review an

algorithm given in Fisher and Hall (1991) which describes the exact computation of

nonparametric bootstrap estimators in small samples. We propose an extension to

t his procedure for the exact computation of moments of nonparametric bootst rap

CHA PTER 2. BOOTSTRAPPED ORDER STATiSTICS 12

estirnators in much larger samples. In Section 4 this methodology is illustrated by

evaluating moments associated with bootstrapping order statistics, in particular. the

median. This is achieved through the use of the A-lamily of distributions t hat permits

expression of moments of order statistics in closed form for a variety of distributional

s hapes. These results are compared wit h Monte Car10 simulations. Various algo-

rithms developed in the chapter are also described. Conclusions are discussed in

Section 5.

2.2 Review of Bootstrap Estimation of Standard

Error

Suppose that x = ( x l , . . . x,) is an observed random sample of size n from a popula-

tion with distribution function F and we wish to estimate the parameter of interest

expressed as some function t of F. on the basis of x. For this purpose we calculate

an estimate of interest

è = s(x).

to which we wish to assign a standard error denoted by

(The hat symbol *A'' always indicates quantities calculated from the observed data.)

Yow ( 2 . 3 ) may in turn be estimated by

that is, by replacing F with the ernpirical distribution function F defined to be the

discrete distribution that assigns probability l / n to each value ri for i = (1.. . . . n } .

CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS 13

When F is unknown. or 8 is complicated, bootstrap rnethodology atternpts to esti-

mate (2.3) by applying this same principle. The bootstrap algorithm for estimating

standard errors consists of several steps which are given as follows. We use notation

similar to that used by Efron and Tibshirani (1993? Chapter 6). First, B independent

bootstrap sarnples x;, . . . 'xh are selected, each consisting of n observations drawn

with replacement from x. Each bootstrap sample, generically represented as

is a randomized, or resampled, version of the original data set x. and hence consists of

rnembers of x. some appearing zero times, some appearing once. some appearing tivice.

etc. . Hence. the bootstrap sample x' generates a corresponding set of frequencies

(wl, . . . , w,) conditional on the original sample x. that is.

where w = ( w 1 7 .... w,) is a multinomial vector of integer weights. The bootstrap

replication corresponding to each bootstrap sample is then ewluated as

Finally. the bootstrap estirnate of the standard error of û is calculated as the sarnple

standard deviation of the B replications

where & ( - ) = xfI1 0 * ( b ) / ~ . The lirnit of ~ é ~ ( & ) = See as B -. CG is called t h e ideal

bootstrap estimate of sec(8) , that is,

CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 14

The bootstrap aigorithm described above is a computational met hod for obtaining

an approximation to the numerical value of seF(B*) . This ideal bootstrap estima-

tor and its empirical approximation .+* are sornetimes referred to as nonparametric

bootstrap estirnates because they are based on F o the nonpararnetric estirnate of the

population F. Parametric bootstrap estimation uses a different estimate of F but

this will not be discussed in this thesis.

Expressing (2.3). (2.4). (2.9) and (2.8) in terms of variances we have. respectively.

varc(8*) = [seF(êU)l2 (2.1'2)

V Z ~ ( B ~ ) = [ ~ é ~ ( 9 - ) ] ~ . (2.13)

The quantities V a r F ( & ) and l i n r f ( & ) rnay be made arbitrarily close for large enough

B and hence will be assumed to be effectively equal in this discussion.

In the next section we will develop a diagnostic tool that will enable us to calculate

bias and other properties of the ideal bootstrap variance estirnator b-arp(B') and the

method will be illustrated in the case of order statistics. This methodology is capable

of giving analytical or numerical results and is free of Monte Carlo resampling.

2.3 Properties of Bootstrap Estimators

2.3.1 Introduction

In this section we propose a procedure for the exact computation of moments of

nonparamet ric bootstrap estimators. We first review an aigori t hm given in Fisher

and Hall ( 1991) which describes the exact computation of nonparametric boo ts t ra~

estimators in small samples.

CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS

A s was discussed in Chapter 1 we can express the bootstrap sample as

and the bootstrap statistic 0' as

where w is a multinomial vector of integer weights. Since bootstrap sampling is

identical with weighted resampling we may represent a generic bootstrap estimator

as

T = ~ w , x [ g ( ~ ' ) l

for some function g. Indeed. when

then (2.16) gives the ideal bootstrap variance estimator

The quantity E,~,(B=) in (2.18) may be used to assess the ideal bootstrap estirnate

of bias of Ô by

biasC = ~ ~ ~ ~ ( 9 ' ) - è. (2.19)

For most statistics that arise in practice, (2.19) must be approximated by Monte Carlo

simulation. In particular, the bootstrap estimate of bias based on B replications is

(2.19) with da(.) = CE, ê ' ( b ) / ~ substituted for E , ~ , ( B * ) , so t h a t

Fisher and Hall (1991) show how Monte Carlo resampling rnay be circumvented

by describing an algorit hm for the exact computation of nonparametric bootstrap

CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS 16

estimators in small samples. This technique requires enumeration of al1 distinct re-

samples which may be drawn with replacement from the original sample. and also

calculation of the probability and value of the statistic associated with each resam-

ple. We shall review this procedure for the exact calculation of E,~,(B-) in the next

su b-sec t ion.

2.3.2 Exact Computation of Bootstrap Estimators

Suppose that we wish to calcidate the exact value of the ideal bootstrap estimator of

the mean of a statistic 9 given by

The discussion below follows Fisher and Hall ( 1991).

If the sample x is of size n. and if al1 elements of x are distinct. then the number

of different possible resamples x' of size n equals the number. X ( n ) . of distinct ways

of placing n indistinguishable objects into n numbered boxes' the boxes being allowed

to contain any nurnber of objects. Indeed. letting Li denote the numher of times r , is

repeated in x' then the number of different possible resarnples equals the nuniber of

ways of choosing nonnegative integers k = ( / c l ' . . . . Ln) satisfj-ing kl + . . . + kn = n.

This is given by

See Hall (1992- Appendix 1) for a proof of (2.21) .

Let W ( n ) denote the set of al1 distinct integer n-tuples w = ( w l . . . . . un). hence-

forth called partitions. such that wl 2 wz 1 . . . 2 w, 2 O and wi = n. Given

( w l . .. ..ut,) E Cbr(n), let M ( w l .... , w,) be the set of al1 ordered n-tuples k =

( / c l o . . . . kn) which are simply permutations of ( w 1 7 . . . , w,). Defining

CHAPTER 2, BOOTSTRAPPED ORDER SX4TISTICS 17

then I<(n) contains precisely iV(n) elements given by (3.21) . The probability of ob-

taining a specific resampie x* based on the n-tuple k = ( k l , . . . , k,) is the multinornial

probability p(k) given by

with the /, 2 O as above. Furthermore. let &(k) denote the calculated bootstrap A

replication of B corresponding to the multinomia! n-tuple k. For example. if k =

(1.2, O. 1) in the case n = 4 then t he bootstrap replication 8% is correspondingly

calculated from the resarnple x- = (xi 'sz , xz, z4). Then the ideal bootstrap estimator

of the rnean of ê rnay be cornptited exactly and directly as the weighted average of

the N ( n ) bootstrap values B*(k):

or alternat ively.

( n - L)! I &vIx(èœ) = nn- l 6=(k).

k f M ( w )

Equation (2.25) is computationally more efficient than (2.24). For example. for n = 4

(2.25) is T times more computationally efficient than (2.24).

If the sequence w = (wi . . . . . w,) consists of precisely ri ji 's for ( 1 5 i 5 q) . where

the ji's are distinct and the ri's sum to n. then :\.l(w), that is. t he second summation

in (2.25). contains exactly

elements. Table 2.1 Iists respective values of N ( n ) and the number of partitions w

contained in W ( n ) for various sample sizes. We consider the example below.

CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS

Table 2.1: Total number of distinct resamples and number of partitions for various sample sizes

Example 1

N(n) Nurnber of Partitions 9 3

W e consider in detail t he case for n = 4. Table 2.2 Lists the 5 partitions comprising

Ct'(4) with the respective number of distinct permutations L(w) . the simple multino-

mial probabilities p(k). and the probabilities p(w) of the partitions. The probabilities

p(w) will be discussed more fully in the next section.

The iV(2 , l . 1.0) = 12 distinct permutations k contained in M(2.1.1.0). for es-

ample, are listed in Table 2.3 below:

The ideal bootstrap estimator (2.25) may now be expressed as

CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS 19

Table 3.2: T h e 5 partitions associated with a sample size of 4 with corresponding multinomial probabilities

Table 2.3: The 12 distinct permutations of the partition (2.1. 1.0)

(2?OJ,L) (2.1:0,1) ( 1 ?O7 1 ?2) ( 1.0.2,l) ( 1.1 .O??) (1,1,2?0) ( 1.2,o. 1 ) (1.2.1 .O) (0.1.12) (O. 1.2.1 )

For example. t he 4th summation in (2.27) represents the sum of 12 bootstrap repli-

cates calculated from distinct resamples associated with the permutations listed in

Table 2.3. X sirnilar expression for exactly evaluating the ideal bootstrap estimator

of variance of 4 could also be obtained.


2.3.3 Moments of Bootstrap Estimators

We seek to calculate the mean and variance of the ideal bootstrap quantity

for some function g. with respect to the original distribution F . Indeed. t h e first

moment of (3.28) rnay be evaluated exactly, using (2.25). as

(n - l)! 1 n ~ - ' C w * ! ... lun!

w € W ( n ) k€M(w)

where

for an- k E M(w) and t h e quantities p(k) and L(w) are given respectively by (2.23)

and (2.26) . Clearl- xwcw(n) L(w) = Y ( n ) as given by (2.21).

Note that the final expression in (2.29) may be alternatively obtained b - consid-

ering that

Applying this in the context of Example 2 where

CNiI PTER 2. 1300TSTRA PPED ORDER STATISTICS'

permits the first moment of (2.27) to be ex~ressed exactly. t hat is.

where ,,,,,,,, (8.) = Ex [ P ( w l , lu,; w3, w,)] represents the first moment of the

bootstrap replication 9' with respect to the specific partition w = ( w l : w2. m ~ . ~ 4 ) .

The algorithm associated with (2.29) could be espressed as:

Algorithm 2.1 (Property of a bootstrap moment):

Construct the list 1 of al1 partitions w E W(n) for a given positive integer n

0 Over al1 the partitions w in f sum

CHII PTER 2. BOQTSTRAPPED ORDER STATISTICS

where p(w) = L(w) p(k).

The list 1 of al1 partitions w E W(n) for a given positive integer n can be created

from the algorithm given in Wang ( 1992. Chapter 2). The tool for calculating such

partitions is called FPL[n] and the code is available in Appendix A . l in a file called

partiti0n.m.

We shall now apply this rnethodology to the case of assessing the mean and vari-

ance of bootstrap quantities associated with order statistics, in particular. the sample

median.

2.4 Application to Order Statistics : The Case of

the Median

In this section we apply the methodology developed in the preceding discussion to

the case where

8 = sarnple median

denotes a replication of the bootstrapped sample median.

The median is a statistic whose distribution is known exactly - this will facilitate

illustration of the met hodology descri bed above.

Before proceeding we review some general properties of order statistics.

CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS

2.4.1 Some Properties of Order Statistics

Let XI,. . . . x, be a random sample of n distinct observations from an absolutely

continuous population with pdf f ( x ) and cdf F(x). Let

be the order statistics obtained by arranging the above sample in increasing order of

magnitude. The density functioo of x( i ) (1 5 i 5 n ) is well-known to be

n! f(i)(4 = [~( t ) ] ' - ' [ l - F ( x ) ] * - ; f (x). -ca < x < x; (2.34)

( i - l ) ! ( n - i)!

from which it follows that the k - th moment of x ( i ) for k = 1,2, . . . may be presented

as

The variance of r(i) may then be expressed in terms of (2.135) as

In many situations (2.35) does not exist explicitly in closed form. An example of t his

is the case where F ( x ) represents the cumulative normal distribution. In order to

circurnvent this difficulty we shall express (2.35) aIternatively in ternis of the inverse

probability integral transformation. .As will be seen shortly. this provides a mearis

of carrying an initial variable with a uniform distribution into a new variable with a

special distribution of interest. For a description of this transformation see. for es-

ample: Fraser (1976, page 86-87). This representation of (2.35) will then be amenable

to closed form expression t hrough the use of a special family of approximat ing dist ri-

but ions to be described below.


Let ui, Q,. . . , un be a random sample from the U[O, 11 distribution and let

U(1) 5 U(2) 5 . - - 5 y * )

be the order statistics obtained from this sample. Define the inverse cumulative

distri but ion funct ion of the population as

Now it is well-knowu that if u is a [{[O, 11 random variable. then F - ' ( u ) has distribu-

tion function F. We have

since F is continuous. Further. it can easily be verified by using (2.37) that

and

d where = is to be read as "has the same distribution as". T h e distributional relations

in (2.38) and (2.39) were observed by Scheffé and Tukey(194.5). Hence. using (2.37)

- (2.39) permits (2.35) to be rvritten alternatively and more compactiy as


In the particular case when

Xi U [ Q , I ] ? i = l , 2 , . . . ? n

then (2.40) can be expressed explicitly in closed form. Indeed, (2.39) is modified to

F - ' ( u [ ; ) ) A u(,l

from whence it follows that

E(x?~ , ) = ~ ( u h ) )

where

is the density function of a two-parameter beta distribution. and B( .. .) is t h e complete

beta function defined by

Upon simplification. (2.4 1) yields

For example, the mean and variance of the sample median P for odd sample size may

CHA4 PTER 2. BOOTSTRAPPED ORDER STATISTICS

Table 2.4: Approximating dist nbutions for various X values

1 X .4pproximating Distribution ] 1 0.135 Normal 1

Logist ic Uniform

be shown to be

1 1 E(i) = - and Var@) =

3 - 4(n + 2) .

The above example illustrates the use of the particular inverse function given by

which permits (2.40) to be expressed in closed form in this particular case. .A special

form of F - [ ( u ) permitting closed form expression of moments of order statistics in a

variety of cases was provided by Hastings e t al (1947). They suggested the class of

approximating distri butions implicit in the transformation

x = F - ' ( u )

- - u - (1 - u ) . for X > O.

where u has a l r [ O , 11 distribution, z is the random variable wliose order statistics

interest us and X is a distributional shape parameter. Al1 the distributions represented

by (2.43) for various values of X are symmetrical. Table 2.4 presents 3 examples of

approximating distributions generated by (2.45) for the indicated values of A.

Simulated data sets generated according to (2.45) for the 3 values of X shown 111

Table 2.4 may be seen to exhibit properties closely mimicking those of the normal.

logist ic and uniform distributions, respectively.

By using the representation (2.45) we are now able to express moments of or-

der statistics in closed form. Indeed. stibstitution of (2.45) into (2.40) yields linear

C'HA PTER 1. BOOTSTRAPPED ORDER STATISTICS 27

combinations of beta functions. For general sample size n and distributional shape

parameter A, the second moment of the i-th order statistic can be expressed explicitly

and exactly in terlm of beta functions as

- T2 ! - [ L I Yi-L(l - .U)2.\tn-idU

( i - l ) ! (n - i ) !

+ B ( 2 X + i . n - i f 1 ) ] .

A simple algorithm was irnplemented in Mathematica that cornputes (2.46). The

funct ion is called SecondMornOrder [i. n. A].

For example. when X = 0.1% the distribution of the random variable x generated

by

approximates a normal distribution. When n = 3 the second moment of t h e median

E ( 1 2 ) is

SecondMomOrder (2,3,0.133] = 0.0174397

CKA PTER 2. BOOTSTRAPPED ORDER STA TISTICS '2s

which is precisely (2.46) for i = 2, n = 3, X = 0.135 . When h = 1 then (2.45)

becomes

which is distributed as CT[- 1, l]. In this case. E ( i 2 ) is

SecondMomOrder [-, 3,1] = 0.3 .

2.4.2 The Case of the Median

We now use the tools discussed in the previous sections to evaluatc properties of

bootstrap estimators associated with the median. We commence by considering in

detail the first moment of the bootstrap estimator of the second moment of the

median in the case when the random sample is of size n = 3. This result will

suggest that properties (such as moments or cumulants) of bootstrapped moments of

order statistics are Iinear combinations of bootstrapped moments of order statistics

of different sample sizes. Later we shall consider computations of such moments

for various sample sizes and distributions. Vie use the following Table 2.5 which is

analagous to Table 2.2 in the present context:

Table 2 5 : Partitions and multinomial probabilities for a sample size of : j

CHA PTER 2. BOOTSTRAPPED ORBER STATISTICS

Noting tha t there are three partitions for n = 3 and substituting

into (2.31) gives

where .ir denotes t h e sample median associated with t h e i-th partition wi.

Consider first 3; 1 (3.010) in (2.50). This denotes tha t the sample median is

associated with t h e parti t ion (3,0,0). The partition (3,O.O) places al1 t he weight on

the first observation: it may be interpreted as representing three identical realizations

of a single random variable x in which case n = 1. Hence.

CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 3 0

Hence, the median of three identical realizations reduces to a median of a sampie of

size n = 1, i.e. a single random variable. Therefore,

Next consider 5; ( (2.1.0). The partition (2.1,O) may be interpreted as representing

the realizations of two distinct random 'ariables so t hat n = 2- One of the vari-

ables provides 2 identical observations while the ot her has generated a single different

observation. There are thus two order statistics each of which has probability $ of

represent ing the median. Hence.

Applying the expect ation operator gives

by using ( 2 . 3 5 ) .

Finally. consider 2; ( (1 ,1 , l ) . The partition (1.1.1 ) may be interpreted as rep-

resenting respective realizations of 3 distinct random variables. In this case n = 3 .

There are t hus three distinct order statistics and it fol~ows t hat

CHAPTER 2. BOOTSTRAPPED ORDER STATISTlCS

which provides

Equation (2.50) now becomes

where E(i.') = E ( z f 2 ) ) . Equation (2.52) represents a linear combinat ion of a second

moment of sample size 1 and a second moment of the median for a sarnple size 3.

It is interesting to note that this expression is weighted in favour of the sample of

size 1. It is an analytical expression! a formula, which is true for any distribution. It

can be efficiently evaluated numerically and exactly. for example. for an? distribution

of interest. belonging to the A-family of distributions generated by (2.45). Similar

analytical formulas may be generated for any sample size although we consider only

odd sample sizes in this thesis. The method allows us to study. for example in this

context. the properties of the bootst rap procedure which est imates the second moment

of the sample median.

The algorithm describing the calculation of the conditional expectation of the

squared sample median for a given partition for odd values of n and arbitrary parent

distribution function F(x) is given by:

Algorithm 2.2 (Calculation of Ewl,[(5')'] for F ( x ) and odd n) :

a Compute the number of distinct order statistics for the given partition. This

equals the number of non-zero elernents in the partition. In the case of repeated

observations the corresponding order stat istic is the common value.

CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS :32

Compute al1 permutations of the given part itioo. Each permutation assumes

observations arranged in descending order.

For each permutation record the order statistic which corresponds to the me-

dian.

Count the number of permutations in which the first order statistic is the me-

dian. Do the same for al1 order statistics.

Divide each count by the number of elements in the partition. Each order

statistic is hence associated with a numerical probability that it represents the

median.

The condi tional expectation of the squared sample median for a given partit ion

is the sum of products of the expectation (over an arbitrary parent F ) of each

squared order statist ic and its associated probabili ty of represent ing the me-

dian for that partition. The expectation is evaluated, in general. by numerical

integration.

If F is a mernber of the A-family then closed forms of moments of order stat ist ics

are evaluated using the symbolic function SecondMomOrder[i. n. A].

The tool which gives the order statistics of a partition w with the respective probahil-

it ies t hey have of representing the median is called Med[w] and the code is available

in Appendix A. l in a file called 0s.m.

Simiiarly. the first moment of the bootstrap estimate of variance of the median

may be generally expres:

E x [varwl :

which for the case n = 3 becomes

Ex [ ~ a r , ~ ~ [ Z ] ] = (719) x Var@) + (219) x Var (3 ) .

CHA PTE8 2. BOOTSTRAPPED ORDER STATISTICS 33

The exact bias of the bootstrap estimator of variance of the rnedian for n = 3 follows

immediately from (2.54) as

BI AS [ V a r W l x ( ~ * ) ] = Ex [ ~ a r , ~ ~ [ ~ * ] ] - V a r ( 5 )

= (719) x V a r ( z ) + (219) x V a r ( i )

= ('719) x [ V a r ( z ) - V a r (i)]

A similar evaluation with respect to (2.5'2) yields

Note tha t the discussion in this chapter centers on the evaluation of

Ex [ ~ w , x ( * ' ) ~ ]

which is identical to

- V a r (i j

(2.35)

assuming t hat

that is, t hat the population median is zero. More generally. for an estimator û rvhere

we denote ~ , ( é ) = O , t he inner expectation in (2.58) may bbe ivritten as

assuming that E,,,(,(B' - 4) is approximately uncorrelated with e - O . Considering

CHAPTER 2. BOOTSTRAPPED ORDER SïXTISTICS

the expectation of (2.60) over x we then observe t hat

assurning ~,~,(0' - 9)2 is close to (8 - It follows that

if 8 = O is assumed. In the case of the median we have assumed (2.59) in this chapter

and thus we will observe

The relationship given by (2.63) is corroborated by numerical results discussed later.

This does not signify that the bootstrap second moment of the median. EwIx(~')2. is

positively biased but only tha t (2.63) is due to the assumption (2.59) that has been

made.

The Mathematica functions ExpBoot[n. A] and VarBoot[n. A ] respectivelu corn-

pute

and

by using the function SecondMomOrder [i. n, A] seen above and so work only for

the A-family of approximating distributions and odd sample size n. The bias of the

CHAPTER 2. BOOTSTRAPPED ORDER STATISTlCS

bootstrap estimator of the second moment of the median may be expressed as

where E,(i2) represents the true value of the second moment of the median. The

latter can be computed by using SecondMomOrder [i? n. A] whenever r(i) = S.

The algorithms describing the evaluation of each of (2.64) and (2.65) is given next:

Algorithm '2.3 (Calculation of ExpBoot[n. A]):

0 Derive al1 partitions of the positive integer n using FPL[n].

0 Compute the multinomial-type probability of occurrence of each partition w

using Prob[w] from Appendix A. l .

a Use Med[w] to find the order statistics for each partition w having non-zero

probabilit ies of represent ing the median. Compute the probabili t ies.

Compute the conditional evpectation of the squared sample median for each

partition by summing the products of SecondMomOrder [i. n. A] and the

respective probabiii ties the order statist ics have of represent ing the median.

0 Sum the conditional expectations found in the previous step weighted by the

respective multinomial-type probabilities of occurrence of the partitions.

The code corresponding to the tool ExpBoot[n,X]) is available in Appendix A.1

in a file called Expb0ot.m.

Algorit hm 2.4 (Calculation of VarBoot [n, A]):

0 Derive a11 partitions of the positive integer n using FPL[n].

a Compute the square of the multinornial-type probability of occurrence of each

partition w using Prob[w] [rom Appendix A.1 .

CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS :3 6

0 Use Med[w] to find the order statistics for each partition w having non-zero

probabilities of representing the median. Compute the probabilit ies.

Compute the conditional variance of the squared sample median for each par-

tition by summing the products of VarSqOrder [i. n? A] wit h the respective

probabilities the order statistics have of representing the rnedian. This tool is

a funct ion of SecondMomOrder [i. n. A].

Sum the conditional variances found in the previous step weighted by the re-

spective squared multinomial-type probabilities of occurrence of the partitions.

The code corresponding to the tool VarBoot[n? A]) is available in Appendix A . l in

a file called Varboot-m.

The expressions (3.64). (2.65) and (2.66) represent various propert ies of the boot-

strap estimator of the second moment of the median which can be explicitly evaluated

for given n and A. For example, when n = 9 and X =0.139 then (2.64) is

However. setting n = 3 only in ExpBoot[n. A] yields the formula

CHAPTER 2. BOOTSTRA PPED ORDER STATISTICS

ivhere i represents A. This is precisely the A-family equivalent of the first expression

in (2.52). Similar analytical representations may be given for any n. M;é qualify this.

however. as follows. Note that (2.67) contains 3 major terms. Table 2.1 indicates that

for n = 3. ten terms would be involved if we ivere evaluating the bootstrap estima-

tor directly. The exact calculation of the bootstrap est imator (based on exhaustive

enumeration of resamples) for a sample size of n = 16 would involve the enumeration

of almost SO. 000.000 resamples whereas a mere 116 partit ions would be required for

the exact computation of a property of the same bootstrap estimator! This indicates

that t he calculations of moments of bootstrap estimators using the methodology de-

veloped in this chapter may be carried out for much larger sample sizes than ivould

be feasible if bootstrap estimators were to be evaluated directly.

Recall that we have shown by first principles that for n = 3 (2.64) reduces to

(9 .52) . We can hence compute the right hand side of (2.52) directly as a check on the

Mathematica code underlying ExpBoot [n. A]. Indeed. E(x2) and E(i2) in ( 2 . 5 2 )

are computed as

and E ( 5 2 ) was given in (2.48) as

SecondMomOrder [2,3' 0.1351 = 0.01 74397 .

CHA PTER 2. BOOTSTRA PPED ORDER ST4TISTlC.î

Thus the right hand side of (2.52) is

(719) x E ( x 2 ) + (219) x E ( j 2 ) = 0.0342303

which substantiates the left hand side of (2.50) as computed by ExpBoot[n. A]. Sirn-

ilarly. when n = 3. X = 1 we have x = 2u - 1 - Ci[-l, l] and then

for which

and (2.49) permit

as desired.

The next section presents tabulated results of t h e Mathematica functions de-

scribed here for various n and A. Inciuded in the Tables are the results of independent

simulations which provide checks of the algori thms underlying the functions.

2.4.3 Some Numerical Results

Tables 2.6, 2.7 and 2.8 below preseot numerical results for various small sarnple sizes

and three values of X associated wi th 10 quantities. The quantities have been multi-

plied by n so as to remove the decreasing efFect due to increasing sample size.

The ten quantities presented in the above tables are described as follows:

1. ExpBoot represents the Mathematica function ExpBoot[n. A] defined in the

preceding section.

CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS f39

Table 2.6: Analytical and simulated results for various propert ies and bootst rap estimates related to the 2nd moment of the median using several sample sizes for the case of the normal distribution

2. ExpBootSim was calculated as a check of the correctness of the Mathematica

code underlying ExpBoot [n, A]. ExpBootSim is an estimate of ExpBoot [n. A]

for a particular n and X and was computed according to the following algori t hm.

This algorithm was coded using SPlus.

a For each value of ,\ and n car- out the following :

0 Randomly select 1000 samples of size n (odd) from the C[O. 11 distribution

a For each sample:

- select B = 1 resamples of size n

- calculate rnedian 5 of the resample

- transform to A-family variate using x = (5)" - (1 - .i)-'

2 O Calculate ExpBootSim = mean (x,. . . . , x:,,)

Calculate E x ~ B o o ~ ~ ~ ~ - ~ ~ ~ ~ =

3. E x ~ B o o ~ ~ ~ ~ - ~ ~ ~ ~ ~ is defined in 2. above.

4. Et,e(32) is computed by the function SecondMomOrder[i. n. A ] wheriever t h e

i-th order statistic is the rnedian in a sample of odd sample size n. This function


Table 2.7: Analyt ical and simulated results for various propert ies and bootst rap estirnates related to t h e 2nd moment of the median using several sample sizes for the case of the uniforrn distribution

1 n ~ t m e ( . 2 )

n E s i m ( f 2 ) ' Esim(52),t&m n Bias n VarBoot

was defined in the preceding section.

5 . ES i , ( i 2 ) was calculated as a check of the correctness of the Mathematica code

underlying SecondMomOrder[i, n. A] which computes Et, , ( i2) whenever the

i-th order statistic is the median in a sample of odd sample size n. Esim(f2) is

an estimate of E , , ( I ~ ) for a particular n. A: and i and was computed according

to the algorithm described below. This algorithm was coded using SPlus.

0 For each value of ,\ and n carry out the following :

Randornly select 1000 samples of size n (odd) from the G[O, 11 distribution

a For each sample:

- calculate median i

- transform to A-family variate using r = (i),' - ( 1 - i)"

Calculate Esim(Z2) = mean (x:. . . . . x & ~ , )

~ariance(xf , . - . , x : ~ ~ ~ Calculate Esim(i2), ld, , = ( 1000

6 . E ~ ~ ~ ( E ~ ) ~ ~ ~ ~ ~ is defined in 5 . above.

- r . Bias = ExpBoot[n, A] - Et,,(?*)

CHAPTER 2. BOOTSTRAPPED ORDER STATISTiCS 41

Table 2.8: Analyt ical and simulated results for various propert ies and bootst rap estimates related to the 2nd moment of the median using several sample sizes for the case of the logistic distribution

A = O [Logistic] 1 3 5 ï 9

S. VarBoot represents the Mathematica function VarBoot[n. A] defined in the

previous section.

9. Boot,, represents a conventional bootstrap calcuiation of the second moment

of the median and was cornputed according to the algorithm described belom.

This aigorithm n7as coded using SPlus.

0 For each value of X and n carry out the following :

Randomly select one sample of s i x n (odd) from the C[O. 11 distribution

0 From this single sample select B = LOO0 bootstrap resamples

For each of the bootstrap resamples :

- calculate median i

- transform to A-family variate using s = (2 )" - (1 - I).'

2 Calculate boot, = mean(x,, . . . . x : ~ , )

vwiance(z: .....z~&) i Calculate b~ot,,-~,~., = ( ,000

10. B o o ~ ~ ~ ~ - ~ ~ ~ ~ ~ is defined in 9. above.


2.5 Conclusions and Discussion

In this chapter we have detailed a diagnostic methodology that permits the study

of properties of bootst rap estimators associated wit h order stat istics. This approach

was illustrated for the case where the estimator is the median of a sample of odd

sample size. The ideal bootstrap estimator of the second moment of the median can

be expressed as a weight ed mult inomial average of order stat ist ics. condi t ional on the

underlying sample? and t his can be computed exactly through exhaustive enumeration

of distinct resamples.

Algorit hms were developed that permit moments of bootst rap estimators in t his

contest to be expressed symbolically. and if desired. numerically. for any distribu-

tion and for any odd sample size. The methodology provides exact results. The

correctness of the algorithms was checked by comparing numerical results generated

by the algorithms with independent simulations based on conventional Monte Carlo

resampling. This has been confirrned by the results given in Tables 2.6. 2.7 and 2.S .

The numerical results corroborate the relat ionship given by ( 2 . 6 3 ) especially for a

Gaussian shaped distribution. The apparent 2 1 bias suggested in Tables 2.6. 2.7 and

2.8 is largely due to the definitional assumption ('2.39) that has been made rather than

to any long-run discrepancy between the bootstrap second moment of the media11 and

what it is attempting to estimate.

The methodology developed in this chapter for studying properties of bootstrap

estimators involving exhaustive enumeration is deemed useful for sample sizes of

moderate size. Indeed. characteristics of bootstrap procedures are more iinpredictable

for smaller sample sizes rather than for larger sample sizes and the syrnbolic tools

presented here provide a technique for assessing t hese properties. The concensus

in the literature is that bootstrap estimates calculated on the basis of eshaostive

enumeration may be cornpetitive with Monte Carlo resampling up to about n =

10. The reason that properties of bootstrap estimates may be assessed for much

larger sample sizes is that the calculations involved here use partitions which are

substantially fewer t han the respective number of al1 possible distinct resamples.


Table 2.1 illustrates this. For example, the exact calculation of a bootstrap estimator

(based on exhaustive enumeration of resamples) for a sample size of n = 15 would

involve the enumeration of almost 80,000,000 resarnples whereas a mere 176 partitions

would be required for the exact computation of a property (Le. a moment) of the

same bootstrap estimator. It appears feasible that the algorithms discussed in this

chapter could be further developed to permit symbolic expression of higher moments

of bootstrap estirnators associated not only with more general L-estimators but with

M-estimators as well. In addition, greater computational efficiency could be achieved

by elirninating rnultinomial weights possessing negligible probabilities.

In the next chapter, we will discuss an alternative methodology that permits

analyt ical expression of boot st rap est imators and corresponding propert ies associated

wit h the class of hl-estimators.

Chapter 3

Bootstrapped M-estimators and

t heir Properties

3.1 Introduction

In this chapter. procedures are developed and implemented for analytically computing

bootstrap est imators and their properties for statistics defined by est imat ing equa-

tions and funct ions of t hese. The resulting sym bolic quanti ties are expressed in ternis

of series expansions to any desired order. Moreover. these formulas may t hen h e eval-

uated numerically for any particular set of n iid observations. This method entirely

avoids the use of conventional Monte Carlo resarnpling.

The basis for t he proposed methodology comprises the six following steps:

1. .A statistic defined by estimating equations is approximated to any order by a fi-

nite Iinear combination of products of normalized, centered sums of independent

and identically distributed random variables.

2. .A bootstrap replicate of such a statistic may be similarly (analytically) ex-

pressed as a finite linear combination of products of weighted sums. The weights

arise from the multinomial distribution. The weighted surns are decomposed

into simple sums and products of sums involving distinct observations (hence-

forth referred to as disjoint sums).

CHAPTER 3. BOOTSTRAPPED hl-ESTIIGIA TORS

The bootst rap expectation of the statistic is obt ained by considering the expec-

tation of the series expansion with respect to the multinomial weights arising

in (2). The moments of the weights are found from the moment generating

function of the multinomial distribution.

Products of sums involving distinct observations are re-expressed as linear com-

binations of simple sums and ordinary products of sums.

Properties of the preceding expressions rnay be evaluated. This corresponds to

calculating various moments of the analytical bootstrap quantities wit h respect

to the original random variable x. for example. in the one-sample case.

Symbolic expressions represent ing bootstrap moments and t heir propert ies ma?

be further assessed nurnerically if desired by evaluating these for a particular

set of data.

In Sections 3.2.2-3.2.4 the main met hodological details are presented. Specifically.

we establish (1) in Section 3.2.2 by considering the inversion of the Taylor series

representation of an estimating equation. Sections :3.2.3 - 3.2.4 discuss (2) - ( 3 )

while (4) - (6) are discussed in Section 3.2.4. Section 3.3 focuses on computational

issues. In Section 3.4 we apply the algorithms to the case of the correlation coefficient.

Specifically, ~ v e compare our results to those obtained by Efron and Tibshirani ( 1993)

using their Law School data. LVe also consider an artificially generated sample of

n = 100 bivariate Gaussian observations.

The methods described in the present chapter have been implemented for ma-

chine computation. and in fact depend primarily on two key symbolic routines - one

for computing asymptotic expansions of roots of estimating equations, and one for

computing bootstrap analogues to these expansions. The algorithms have been imple-

rnented using Mathematica (Wolfram. 1988). The routines are very simple to apply

to a broad range of statistics. The Mathematica code is presented in Appendix

A.2 of this thesis.

CHA PTER 3. BOOTSTRAPPED hl-ESTIMA TORS 46

3.2 Bootstrapping M-estimators : An Analytical

Approach

3.2.1 Introduction

The following sub-sections present the methodological details relevant to steps ( 1) - (6)

given above.

3.2.2 An Approximation for Functions of M-estimat ors

In this section we show how a single arbitrary M-estimator 8. or a smooth function of

such estimators. may be expressed as a particuiar sequence of increasingly accurate

series approsimat ions. The discussion below follows Andrews and Feuerverger( 1993).

Alternative presentations rnay also be lound in Ferguson(1989) and McCullagh(l9Sli).

Let XI.. . . , x, represent independently and identically distributed observations

from a distribution indexed by a parameter 6- with Bo being the assumed. triie value.

As developed by Huber(196.1, 1967) we consider first a one-dimensional M-estimator

assumed to be the solution of the equation

If the riss arise Frorn the density f (r. O ) . and if +(z. O ) = %log /(r. 0). it follows that

O is t h e maximum likelihood estimate of B.

Denoting expectation with respect to Je, by Eo, we shall require that

and

EodJ'(x.00) # 0

so that é is +-consistent for Bo (see, for example, Huber(1977) ). The following

CHA PTER 3. BOOTSTRA PPED LM-ESTIMATORS 47

construction enables us to write 8 in the form of a series approximation. Consider

the k-th order Taylor expansion of (3.1 ) about 80 :

w here

The superscripts ( j ) on zb represent differentiation with respect to 0. The remainder

term appearing in (3.3) rnay be expressed as

for some 0% between 0 and &, and is of the order indicated whenever 8" = Bo + 0p(n-1/2) provided only that d b 1 ) ( x . O ) is uniformly integrable in a neighborhood

of OO.

Now observe that (3.3) may alternatively be written as

w here

and so

C H A PTER 3. BOOTSTRAPPED M-ESTIM.4 TORS -4s

Note that the j = O term in the first sum of (3.4) has becn omitted in view of (3.2).

Andrews and Feuerverger(l993) present an iterative inversion of the power series

representation (3.4) so as to provide increasingly accurate approximations to the root

6 of the equation (3.1). It is important to note that the terms henceforth

referred to as residual sums? appearing in (3.4) are normalized, centered averages and

hence are of precise order Op( l ) provided ~ l i l > ( j ) J~ < rn : Le. are of order Op(1) and

not of any lower order. (See Ha11(1992), pp xii-xiii. for a fuller discussion). Indeed.

the use of O p ( l ) random variables of the form (3.5) and (3.6) in (3.4) will conveniently

allow the expression of successive iterative approximations to 0 in terrns of stochastic

asymptotic expansions in powers of the sample size n . The specific sequence of approxirnate roots il to (3.1) is constructed inductively

to have the form

for 1 = 1.2.. . . where the hi's are Op( 1). depend upon n. and further.

z 81 + 61 = 6 + O p ( n - ( 1 + 2 ) / 2 n ( r + w 2 > .

Each 6; has the special form

t hat is. a linear combinat ion of products of exact ly i + 1 normalized. centered averages

Z, of the type ( 3 . 5 ) . (3.6). The inductive construction (3.7) is rendered cornpletc by

finally defining

for 1 = l o2 . . . . . It is observed that the representation (3.10) is of order Op(1).

satisfies (3.d), and has the form given by (3.9) . This procedure is a symbolic analogue

CHAPTER 3. FOOTSTRAPPED hi-ESTIMATORS

of a quasi Newton-Raphson iteration; see Andrews and Stafford(l993).

For example, setting 1 = 2. the third order expansion of the root 8 in iterat.ive

form may be expressed as

We now consider the analagous multiparameter problem. The rnultivariate .CI-

estimator

for the component parameters

C H A PTER 3. BOOTSTRAPPED M-ESTIMA TORS

having the t rue values

is t he solution of the system of equations:

for j = 1. . . . , p . The met hods of the previous paragraphs extend directly (care

should be used in the matrix-algebra since. for example, EoG'(Xo B a ) is now a matris)

to permit us to write each cornponent êj of 6 in the lorm (3.7). (3.9) rvhere any of the

terms 2,'s appearing in (3.9) can now be any of the terms (3.5) or (3.6) corresponding

to an- of the +-functions . , $,. Now suppose that we are interested in some

srnooth function. say g. of the cornponent M-estimators so that 4 y(dl.. . . . B p ) is viewed as being the statistic of interest, and g g(8h: . . . . B i ) as the quantity

of inferent ial interest. Then by su bstit uting approximations which are the vector

analogues of (3.7) (for each component of the M-estimator) into a straightforward

Taylor expansion for g ive are lead to t h e fact that

where

Y1 7 2 Ur = 70 + - Î ' r + -+ . . .+ -

n1/2 TZ n1/2

and the 7,'s are each O p ( l ) linear combinations of products of i + 1 normalized.

centered averages as in (3.9), that is,

As before. this procedure enables us to consider the members of the sequence of

CHAPTER S. BOOTSTRAPPED M-ESTlMATORS

truncated expansions as approximants to the quantity fi (9 - g ) .

3.2.3 Weighting and Decomposing Residual Sums

Conventional bootstrapping is based on the simulated generation of a finite number

of resamples drawn wit h replacement from the original saniple. Bootstrap replicates

of a statistic O of interest are cornputed and the desired moment associated with the

statistic can then be estimated. We observed in the previous section that e rnqv

be approxirnated by using the expansion (3.7) together with (3.10). In this section

we consider converting (3.7) and (3.10) into bootstrap replicates analytically. that

is. without resorting to Monte Carlo simulation. We shall do so by algebraicaliy

introducing those weights which have been implicitiy generated by repeated sampling

wit h replacement inherent in bootstrapping. We consider the single parameter case

below.

Rom (2.5) and (2.6) it follows that

Consequently. the introduction of multinornial weights wi into (3.6) yieids the rveighted

residual sum

Weighted (i.e. bootstrap) versions of (3 .7) and (3.10) may hence be similarly obtained

by observing t hat

and

CHL4 PTER 3. BOOTSTRA PPED I~~-ESTMATURS

with (3.6) replaced by (3.15).

Consider the example in (3.11) in this context. The third order expansion of the

bootstrap replicate O' is observed to represent a function of moments and weighted

residuai sums as well as products of these. That is, the third order expansion of 8- can

be expressed as a linear combination of products of weighted residual sums. Hence.

We now observe that products of residual sums algebraically decompose into linear

combinations of simple sums! disjoint sums. and products of these. This fact will be

important for subsequent development of the present discussion as i t will facilitate

computation of moments corresponding to the multinomial weights. A computational

procedure for providing such a decomposition for any product of residual sums will be

given in Section 3.3. LVe note that in (3.16) the second order term contains products of

two weighted residual sums, the third order term contains products of three weighted

residual sums, and so on. Consider Zi(do) Z,!,')'(O0) . Disregarding centering and

CHAPTER 3. BOOTSTRA PPED M-ESTfM.4TORS

normalizing operations and letting $(xi, O ) = $i for sake of clarity, we have

Sirnilarly. a product of three weighted residual sums produces. for example.

The combinatorial array of terms in (3.18) suggests a well-known pattern that

corresponds to a partitioning of a list of t hree items. The computational tool FP[list]

provides a full partition of a List of any number of items. For example.

FP[{a.b.c}] = ab.^}}. { { b } . {CU}}. { { a } . { h . ~ } } .

contains the five partitions from which the respective terms in (9.18) have been con-

structed. Similady:

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 0 5 4

is used as the basis for decomposing a product of four weighted residual sums into a

sum of fifteen component terms. and so on. The above expressions may be readily

evaluated using tools for symbolic cornputation. The procedures described below

were based on those developed by Andrews et al. (1993) and subsequent unpublished

work. These basic procedures were adapted for the present study. The tool FP[list]

is described by the following algori t hm:

Algorithm 3.1 :

Let FP[{a)] = { {a} }

FP[{a,b)l = { {a* 6 ) - { { a l ? V I 1 1

To find the partitions in FP[{a,b,c)] do the following:

To each efement of each partition in the preceding list: join c.

To each partition in the preceding list: append c.

Thus. FP[{a,b,c)] is given by (3.19) above.

Apply this rule to find the partitions of any list {a. b. c. . . } .

In this section we have indicated that the bootstrap replicate È- maj- be ap-

proximated arbitrarily well by weighted analogues of (3.7) and (:3.10). that is. by a

truncated series in products of weighted residual sums. Moreover. these products may

further be decornposed into linear combinations of simple and disjoint sums. it follows

that any smooth function g of 0'' for exomple g(ê*) = ( ê ' ) 2 , may also be expressed as

a polynomial series in decomposed products of weighted residual sums. Finally. the

bootstrap expectation of g ( & ) is an expectation over a multinomial vector of weights

w conditional on a sample x as in (2.16), that is.

which can be expressed in a sirnilar form. We consider this in the next section.

CHAPTER S. BOOTSTRAPPED LM-ESTIMATORS 5 5

3.2.4 Analytical Bootstrap Moments of M-Estimators and

t heir Properties

In t his section we consider analyt ical representat ions of the bootst rap est imator

and its properties. For example. when g(ê') = Ô' or g(ê ' ) = [ê* - EwIx (êœ)12 then

(3.22) provides the bootstrap estimates of mean or variance given. respectively. by

and

.As before. the discussion here shall consider only ideal bootst rap est imates. in the

sense that B -+ s (see Section 2.2) .

Since expectat ion is a linear operator. Ewlx [g (OR)] may be expressed as a truncated

polynomial series in decoinpos~d residual sums tvith some coefficients arising froni

wrious moments of the weights W. For example. taking the hootstrap espectat ion

Note that the second step in (3.25) is possible because the multinomial weights

( w l , . . . . w,) are identically dist ributed, and t hough not independent. are exchange-

able. That is. E[wi . w,] = E[wl - wt] for i # j but E[wl . w2] # E [ w I ] . E[w.L]. The

CHA PTER S. BOOTSTRAPPED M-ESTIMATORS

coefficients Ewlx[w:] and Ewlx [ w l w 2 ] are known and may be simply computed from

the moment generat ing function of the multinornial distribution. Moreover. the dis-

joint sum in the second term on the right side of (3.25) rnay be expressed as a linear

combination of products of simple sums. Similar evaluations would take place when

evaluating the bootstrap expectation of any product of weighted residual sums.

The series evaluation of the ideal bootstrap est irnator E~~~ [g(&)] to any level

of precision t hus depends solely on the specificat ion of the appropriat e UT-funct ions

defining the !VI-estimator ë while numerical evaluation would require observation of

the sample (x,. . . . . x,) . The specified U-functions must be differentiable u p to the

desired order of the expansion of B. For example. the bootstrap estimates of mean and

variance. given respect ively by (3.23) and (3.24) can eacli be expressed anal - t ically as

a linear combination of residual sums - symbolic evaluation in each case would depend

only on the specification of the appropriate th-functions. and subsequent numerical

evaluation. if desired. woiild require the sample x = ( r l . . . . . z,) .

Finally. one rnay consider properties of the series representation of E , , , ~ , [ ~ ( B ~ ) ] . Indeed. any expectation - over x - of such a series may be calculated directly. For

exarnple. the rnean and variance of the bootstrap estirnator E,~ , [~(&)] are givrn by

and

and such expressions can be evaluated in a straightforward manner.

The salient feature of the representations described in this section is that they can

be expressed in symbolic form. This precludes necessity for the use of Monte Carlo

simulation. In the next section we shall describe and illustrate several computational

procedures developed using Mathematica that permit analytic expression of the

bootstrap quant ities considered above. This met hodology will t hen be applied to the

case of the correlat ion coefficient.

CHA PTER 3. BOOTSTRAPPED M-ESTIMATORS

3.3 Computat ional Tools

In this section we will describe and illustrate a number of computa t io~a l procedures

developed using Mat hernatica t hat permit analytic expression of various quanti t ies

considered in the previous t wo su b-sect ions.

In Section 3.2.2 it was observed that (3.7) together with (3.10) provide increas-

ingly accurate iterat ive approximations to a single-dimensional M-estimator B. An

algorithm has been implemented in Mathematica tha t cornputes ( 3 . T ) ? (3.10). This

function is called ThetaHat[i] and it returns the expansion for 0. correct to order

n-'12. For example. the second order approximation to the root 6i is computed as

This corresponds to the sum of the first four terms in (3.1 1 ) where 2, ( v(X. 8 ) ) =

Zn(&) and expectations are represented as cumulants. When i = 3 a n expression

with nine terms is returned. When i = 4 the function returns a n expression with 19

terms. Similarly. t he 6-th order expansion coritains 73 terms. The expansion for the

?VI-estirnator rnay thus be derived for any order. The code for the tool ThetaHat[i]

is available in Appendix A.2 in a file called Bootcum-m .

8 o w the bootstrap quantity 8'. or any power of this. can always be represented

symbolically as a linear combinat ion of products of weighted residual sums (factors

in 2:); these are now functions in the bootstrap resample X'. Thus, t he bootstrap

version of (3.28) would look identical with (3.25) where Zn and X are replaced by

2: and X'. kloreover. this analytic bootstrap quantity represents (3.16) truncated

to second order.

The function ~ o o t ~ u m [ 8 ] [ i , j] returns the i-th bootstrap cumulant of some esti-

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 5 8

mator è expressed as an expansion correct to order n-~ /* . This calculation represents

the kernel upon which the methodology described in the present chapter is based.

In particular, ~ o o t ~ u m [ 8 ] [i? 21 represents the analytical expansion of the bootstrap 4

expectation EwIx(Bœ) correct to second order, that is,

The code ~ o o t ~ u r n [ 8 ] [2: 21 represents the analytical expansion of the bootst rap

variance ~ a r , ~ , ( B ' ) correct to second order. that is.

For example. the second order expansions of the bootstrap estimates of mean and

variance of 6J are displayed symbolically. respectively as

C u r n ( @ ( X * , 0)' ~'(,Y', O ) ) ~oot~urn[8][1 ,2] = 0 + [

Curn[@'(.Y. 0)12

and

Cum($(X=. O ) , c7(.Y=. O ) j B o o t ~ u m [ e ] [ ~ . 21 =

n Curn[thi(X. 0 ) l 2

The code for the tool ~ o o t ~ u m [ d ] [ i . j] is available in Appendix A.2 in a file called

l3ootcum.m . The main steps of the algorithrns associated with ThetaHat[i] and ~ o o t ~ u m [ d ] [ i . j]

are given below.

We suppose t hat 6 is the root of the equation f(ê) = O. The solution is expressed

as a sequence of increasingly accurate series approximations. A starting value Bo must

be provided. The algorithm is a Newton iteration as iollows:

CHAPTER 3. BOOTSTR-4 PPED M-ESTIhIATORS

Algorithm 3.2 (ThetaHat [il ) :

0 Provide a starting value Bo.

- -f.((,-i ) a In general. 8; = f; (00 1

where fi(êj) is a series expansion of f(6) correct to order d2.

Algorithm 3.3 ( ~ o o t ~ u m [ 6 ] [ i . j]):

0 A statistic defined by estimating equations can be approxirnated to any or-

der by a finite linear combination of products of normalized. centered sums of

independent and identically distributed random variables.

A bootstrap replicate of such a statistic, or any smooth function of this. may

be similarly (analytically) expressed as a finite linear combination of products

of weighted sums. The weights arise from the multinomial distribution. The

meighted sums are decomposed into simple sums and products of sums involving

distinct observations (referred to as disjoint sums).

The bootstrap expectation of the statistic is obtained by considering the expec-

tation of the series expansion wi th respect to the multinomial weights arising

in the preceding step. The moments of the weights are found from the moment

generating function of the multinomial distribution.

a Products of sums involving distinct observations are re-expressed as linear corn-

binations of simple sums and ordinary products of sums.

Finally, we consider the procedure that evaluates properties such as (3.26) and

( 3 The function ~ u r n [ ~ o o t ~ u m [ 8 ] [ i ] ] ~ . k] returns a series expansion correct

to order n-'12 of the j-th cumulant - over x - of the i-th bootstrap cumulant of the

CHAPTER 3. BOOTSTRA PPED .M-ESTIMATORS

statistic 8 . In particular,

Ex [E,~,(&)] = ~urn[~oot~urn[d] [ l ] ] [l, 21 + OP(n-l)

V a r , [~ar ,~ , (8 ' ) ] = ~ u r n [ ~ o o t ~ u m [ 8 ] [2]] [2,2] + O&-') . (:3.36)

The functional representations given in this section are generic and apply to any

bootstrapped M-estirnator 8' eocept those which cannot be expressed in terrns of

smooth functions of averages. An example of this is the median: this case was treated

alternatively in Chapter 2. In the following section we shall illustrate the functions

developed here by applying them to the case of the correlation coefficient.

3.4 The Case of the Correlation Coefficient

3.4.1 Introduction

In this section we illustrate the methodology developed in the preceding sections by

considering the case of a bivariate correlation coefficient using bot h real and simulated

data: Efron and Tibshirani's (1993) Law School data relating LSAT and GPA scores

of students entering 1.5 American law schools and a data set simulated from a bivariate

standardized Gaussian distribution.

CHAPTER 3. BOOTSTRAPPED LW-ESTIMATORS

3.4.2 Computational Tooh

The symbolic function Cor[i] expands the bivariate sample correlat ion coefficient

given by

using a Taylor series in residual sums correct to i-th order. Since the correlation

coefficient is invariant with respect to linear transformations we may simplify the

expansion by assuming, without loss of generality, that E ( r ) = E ( y ) = O and E ( t 2 ) =

E ( ~ J ~ ) = 1. For example. the 2nd order expansion of is given analyticaily as

where. for example. the residual sum Z(n)(.Y) according to ( 3 . 5 ) is given by

The main steps of the algorithm associated with Cor[i] are as lollows:

Algorit hm 3.4:

a Assume w.1.o.g. that E ( x ) = E(y) = O and E ( x Z ) = E ( ~ J ~ ) = 1 .

Express Cov(x. y ) as a series expansion ir, terms of residual sums of the form

(3.39).

CHAPTER 9. BOOTSTRAPPED 111-ESTIMATORS 62

Express COV(X. x) and Cov(y, y) as series expansions in terrns of residud sums.

The product of the series expansions of Cov(x , x)-'/* and Cov(y, is then

multiplied by the expansion for Cov(x, y ) to produce terms in the series expan-

sion of ,!i = C m ( x , y ) - Cou(x, x ) - ' / ~ . Cm(y,

The code for the tool Cor[i] may be found in Appendix A.2 in a file called Cor1.m.

Subsequeat code used for the asymptotic bootstrap expansions associated wit h the

correlation as derived below can be found in the file Cor2.m.

The tool Cum[Cor][i, j] returns the i-th cumulant - over the (x. y ) distribution -

of expressed as an expansion correct to order (In the notation. r. y and x. y

are used interchangeably.) For example. the mean EZ,,(/3) and the variance Var,,,(&)

expressed as expansions to order n-' are given. respectively. by

and

where. for example, C u m ( X , .Y' .Y) represents the third cumulant of .Y. that is.

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 63

if p = E(.Y) . See. for exarnple, McCullagh (1987) for a discussion of multivariate

cumulants.

The function BootCum[Cor][i, j] returns the i-th bootstrap cumulant of the

correlation coefficient expressed as an expansion correct to order d l 2 . For example.

the 2nd order anaiytic expressions for the bootstrap rnean

and the bootstrap variance

of the correlat ion coefficient are given, respectively. as

CHAPTER 3. BOOTSTRA PPED M-ESTIMA TORS

and

The 4th order expansion of the bootstrap variance of the correlation (3.44) is

much lengthier than (3.46) and possesses 226 terms. We provide the first and final

few terms of this expansion as follows:

C H A PTER 3. BOOTSTRA PPED hl-ESTIMATORS

Yote that the 4th order expansion (3.47) contains residual sums whereas the 2nd

order expansion (3.46) does not. Moreover. 2nd and 3rd order expansions of (3.41)

are identical. 4th and 5th order expansions are identical. etc. .

Our methodology now permits us to consider properties. indeed. cumulants of the

bootstrap quantities (3.43) and (3.41). For illustration we consider the mean and

variance - over 2 and y - of (3.43) and (3.44). In practice. an- cumulant expanded to

an arbitrary order of accuracy may be considered. The function

Cum[BootCum[Cor] [i]]b,k] computes a series expansion correct to order n-"l' of

t h e j-th cumulant - over x and y - of the i-th bootstrap cumulant of the correlat ion

coefficient. In part icular.

Var,,, [ E , ~ ~ , ~ (es)] = Curn[BootCurn[Cor][l]][2.2] + O,(n-' )

CHA PTER 3. BOOTSTRA PPED M-ESTIiVATORS

Var,,, [ ~ a r , ~ & ~ ) ] = Curn[BootCurn[Cor][2]][2,2] + O&-') . (3.51)

For example, the second order expansions considered in (3.48) - (3.50) are given,

respectively, as :

p2 Cum (-Y. .Y1 Y. Y ) p Cum (.Y. 1'; l'-. Y - ) +

and

CHAPTER 3. BOOTSTRAPPED 41-ESTIMATORS

Cum(Y, Y, Y. Y ) 412

We observe t hat

The corresponding 4th order and higher order expansions are not identical. in general.

due to the appearance of residual sums.

Table 3.1 below summarizes the six quantities for which symbolic expressions

associated wi t h the correlation coefficient have been considered in t his chapter. These

represent the bootstrap mean and bootstrap variance and the mean and variance -

over x and y - for each of these. However. the computational tools developed here can

provide symbolic expansions for a n y bootstrap cumulant of the sample correlation to

a n y desired order as well as properties of these.

Table 3.1: Bootstrap estimators and related properties for the correlation coefficient for which analy t ic expansions are derived

II Bootstrap Mean ( b ) 1 Bootstrap Variance ( 6 ) 1

The series expansions associated with Table 3.1 for the correlation coefficient

constitute symbolic representations, in fact, /onnulas, which rnay be evaluated for

any bi-variate set of the data for which calculations of the correlation. its bootstrap

moments, as well as properties of these are desired. These expansions. such as the ones

in (3.45) - (3.50), represent analytic bootstrap expressions which may be evaluated

numerically for a n y pair of distributions F and G associated with the dependent

CHA PTER 3. BOOTSTRAPPED M-ESTMATORS 68

variables x and y for which cumulants are known. In particular, we will use the

empirical (nonparamet ric) distribution functions F, and G, as defined in Chapter

2, Section 2.2. With reference to the law school data, for example. we make the

lollowing empirical substitutions in expressions (3.45)-(3.50) :

3. Al1 residual sums vanish when moments are replaced by empirical moments.

For example, E ( X ) + .i yields Z(n)( .Y) = O in (3.39). For distributions other

than the empirical distribution this is not always the case. If? for example. a

standard normal parent is assumed then not al1 residuai sums vanish.

4. Cumulants -, empirical cumulants. For example, the third cumulant

is replaced by the third empirical cumulant

where. for example, S L ~ ~ I [ X * ] = C:=, X: .

An advantage of the analytical approach given above is that once a bootstrap

expansion to a specified order has been calculated, the derivat ional work for the par-

ticular problem at hand has been done for al1 tirne - it remains mereiy to numerically

evaluate the formula for any bivariate data set for which estimation of the correlation

coefficient is desired. This contrasts witli the conventional approach which requires

that a bootstrap variance estimate for the correlation coefficient based on Monte

Car10 simulation be determined each time that a different data set is encountered.

CHAPTER 3. BOOTSTRA PPED RI-ESTIMATORS

3.4.3 Numerical Results

CVe will numerically evaluate the second and fourth order expan sions for the six

bootstrap expressions given in Table 3.1 for two bi-variate data sets described below.

The methodology is illustrated for real and simulated data sets of different &es.

Table 3.2 shows a random sample of size n = 15 drawn from a population of = SZ

Arnerican law schools. What is shown on the left side are two measurements made

on t h e entering classes of 1973 for each school in the sample : LSAT. the average

score of the class on a national law test, and GPA. the average undergraduate grade

point average achieved by the members of the class. The data were obtained from

Efron and Tibshirani (1993. p. 19). The symbolic expansions given previously assume

standardized variables, and hence the same data shown on the right side of Table 3.2

have also been standardized.

Table 3.2: n = 1.5 Law School Data : original and standardized data

Tables 3.3 and 3.4 display n = 100 simulated bivariate observations generated

from a standard Gaussian distribution. The data was also standardized to produce

exact zero sample means and unit sample variances for each of the variables .

Table 3.5 gives the numerical results obtained by calculating the 2nd and 4th

CHAPTER 3. BOOTSTRAPPED M-ESTIkïATORS 70

Table 3.3: n = 100 simulzited bivariate observations from N(0 , l ) : variable x

order expansions associated with the sis moments identified in Table 3 . 1 for each of

the data sets. The classical resuits are also presented where P has been computed

using (3.37) and EZ,y(P). b7ar,,,(;) have been computed to order n-' . for esample.

by (3.40) and (3.41). The asymptotic variance of the correlation is given by

Tables 3.6 and 3.7 document the conventional Monte Carlo bootstrap estimates

of the mean and standard error of the sample correlation coefficient ( the standard

error of each estimate is shown in brackets alongside). These are then compared wit h

the corresponding anaiytically computed bootstrap rnean and standard error of the

correlation given in lines 1 and 4 of Table 3.5. The Mathematica and Splus code

underlying the computation of the results within these tables is given in Appendix

A.2.

Note tliat calculations exhibited in Table 3.5 were carried out using Mathematica

CHA PTER 3. BOOTSTRA PPEB M-ESTIMATORS I l

Table 3.4: n = 100 simulated bivariate observations from N ( 0 , l ) : variable y

and those in Tables 3.6 and 3.7 were performed using SpIus. Theretore. variances

from Table 3.5 should be multiplieci by a factor

when being cornpared with analagous variances associated a i th Table 3.7.

LVe discuss the results and conclusions based on Tables 3.5. 3.6 and :3.7 in the next

section.

3.5 Discussion and Conclusions

A number of points arise from consideration of the results d

section. amongst which we note the following:

iescribed in the previous

1. The symbolic bootstrap methodology developed in this chapter rnay be applied

to a broad class of statistics, namely, to any estimator which may be expressed

CHAPTER 2. BOOTSTRA PPED M-ESTiMATORS

Table 3.5: Symbolic bootstrap and classical results for the correlat ion coefficient using two data sets

as a smooth (possibly non-linear) function of averages. The sample correlation

coefficient. is an exampie of such an estimator.

hnalytical Bootstrap

EwIz ,y ( P m ) E [ E l y ( 1

[r/ar, ,[~,~, , , ( j~)]]f

[ ~ a r , ~ , , , ( b ~ ) ] ) V a r ( )

E,,y [Varwlz , , ( F m ) ] [Var,,, [Varwl,,g(ji')]]

2. The coniputational tools presented in t his chapter \vit h respect to the correlation

coefficient could provide symbolic expansions for a n y bootstrap cumulant to ony

desired order as well as properties of these.

3 . The numerical evaluation of the 2nd and 4th order analytical expansions for

Ewlr,y(@m) and [ V a r W l , , , ( ~ ' ) ] ~ given in Table 3.5 for the sample correlation are

confirmed by the analagous Monte Carlo calculations in Tables 3.6 and 3.7 for

both data sets.

Classical results

Empirical Distribution F, and G,

Moreover. t he numerical evaluation of the 4t h order symbolic expansion for the

bootst rap standard error of the correlat ion [Varwi,,, (ka)] for the law school

data as given in Table 3.5 matches the conventional bootstrap calculation given

in Efron and Tibshirani (1993, p. ZO), i-e. S E ( j m ) = 0.132 for B = 3-00.

This demonstrates that the symbolic tools developed here for calculating boot-

strap quantit ies comprise a viable alternative met hodology to conventional re-

Law School Data (n = 15) 2nd order 0.771713 0.767052 0.124276

0.1242'76 0.0 15445 0.015445

Simulated Data ( n = 100) '

4th order 0.770879 0.765075 0.134268

0.131504 0.01 73'72 0.016806

0.158942

2nd order 0.693382 0.69206 0.051 133

0.051 133 0.001615 0.002615

4th order 0.692396 0.692158 0.051514

0.051680 0.002670 0.001623 0.0144849 0.162598 0.0731031

Table 3.6: Monte Carlo Bootstrap Estimates of the Mean of the correlation coefficient for various B

sarnpling. Any bootstrap moment of a sufficiently smooth estimator. and prop-

erties of this. may be derived symbolically and subsequently evaluated numer-

ically to within a specified order of accuracy without recourse to conventional

Monte Carlo resampling. This is not true with conventional bootstrap estimates

which are unavoidably subject to random variability (note the SE'S enclosed in

brackets in Tables 3.6 and 3.7).

r

B

1. For n = 100 the 2nd and 4th order analytical calculations are identical u p to

a t least tliree decirnal places for the six quantities in Table 3.5. For n = 1.5 t his

is true only u p to the first decimal place. We conclude that feuyer terms are

required in the analyt ical bootst rap expansions as the size of a data set increases.

Indeed. 4th order accuracy was required to achieve identical results to a t least

two significant figures for bootstrap standard errors of the correlation between

the anaiytical and conventional Monte Carlo methods for n = 1.5. However. only

2nd order accuracy was sufficient to achieve this for the larger data set n = 100.

Comparatively then. the analytical approach is less cornputer intensive for larger

data sets. The reverse is true for conventional bootstrap calculat ions.

Monte Carlo Bootstrap Estimates of the Mean EB (jR) of jj

Law School Data (n = 15) 1 Simulated Data (n = 100)

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS

Table 3.7: Monte Carlo Bootstrap Estimates of Standard Error of the correlation coefficient for various B

5 . The closeness between analytical results for Ewl, , , (ex) and [ ~ a r , ~ , , , ( ~ ~ ) ] f and

the corresponding Monte Carlo evaluations increases as B increases for both

data sets.

B 50

100

6. The series expansions associated with Table 3.1 for the correlation coefficient

constitute symbolic representations. in fact. formulas, which may b e applied to

any bivariate set of data for which calculations of the correlation. its bootstrap

moments. as well as properties of these are desired. .An advantage to the an-

alytical approach is that once a particular expansion to a specified order has

been calculated the derivational work for the problem at hand has been done

and need not be duplicated by other practitioners. The formula can then be

numerically evaluated efficiently for any bivariate d a t a set for which estimation

of the correlation coefficient is required. This contrasts with the conventional

approach which requires that a particular bootstrap moment for the correla-

tion coefficient based on Monte Car10 simulation be generated each time that a

different data set is encountered.

7. The symbolic bootstrap method can represent bootstrap moments analytically

which is not possible in the context of conventional bootstrap cornputation.

Monte Carlo Bootstrap Estimates of Standard Error (CeB) for Law School Data (n = 15)

0.1241 (0.0084) 0.1194 (0-0054)

Simulated Data (n = 100) 0.0605 (0.0059) 0.0463 (0.0028)

CHA PTER 3. BOOTSTRAPPED M-ESTIMATORS l F

1.3

Moreover, diagnostic properties such as bias and variance of bootst rap moments

may be represented analytically. The reason for this is tha t any bootstrap

moment when expressed syrnbolically represents a random variable for which

various moments (i.e. properties) rnay in turn be considered. This permits the

user to study the analytical strvcture of a bootstrap quantity of interest and

identify cornponent statistical quantities such as moments and surns. This can

be potentially insightful.

For example. the bias of the bootstrap variance of the correlation is represented

b~

This expression could be evaluated for any distribution. for example. the Gaus-

sian distribution. Using the empirical distribution functions F,, and G, for the

law school data we calculate the 4th order bootstrap estimate of bias of the

bootst rap variance of the correlat ion as

The corresponding calculation for t.he simulated data ( n = 100) yields BZS =

-0.000047 . Similarly. the 4th order bootstrap estimateof bias of the bootstrap

expectation of the correlation for the law school data is

The corresponding calculation for n = 100 is B%S = -0.000238 .

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 76

These results suggest that the bootstrap variance of the correlation and the

bootstrap procedure for estimating the mean of the correlation are largely free

of bias. In both cases, the negligible bias that exists decreases for increasing

sample size.

8. LVe observe that

and

for both the 2nd and 4th order expansions.

9. The methodology discussed in Chapter 9 in which exhaustive enumeration of

bootstrap resamples was used as a basis for the ezact computation of bootstrap

estimates and associated properties could a l t e m a t i v e l y be used for the case

of the correlat ion coefficient. For exarnple. exact evaluat ion of the bootst rap

estimate of standard error of the correlation for the law school data ( n = 15)

mould involve the enumeration according to (2.21) of

or almost SO.000,000 distinct bootstrap resamples. This very large computation .

was avoided and conventional bootstrap sampling with B = 195,000 was used

as a proxy. The results shown at the bottom of Tables 3.6 and 3.7 for the law

school data support corresponding analyt ically- based calculat ions in Table 3.5.

10. The issue of error analysis with respect to asymptotic expansions of bootstrap

moments and their properties is not investigated quantitatively in this thesis.

Considerations such as deciding how many terms to include in a particular

expansion and how to tell how much error has been induced by truncating the

series would play a role in such an analysis.

The answer to the first question depends on the sample size and the smoot hness

of the estimator in question. Higher order terms involve higher order derivatives

which can escalate in magnitude. In the case of smooth estimators the mag-

nitude of higher order derivatives is dampened. Although no useful bound is

known for these they rnay be evaluated in any application. Hence. series erpan-

sions involving very smooth estirnators and large sample sizes would converge

more rapidly t han t hose involving less smoot h estimators and srnaller sample

sizes. Consequently. in the former case. the error induced by truncating a series

at a given point would be less than the error induced by truncating a series at

the same place in the latter case. For example. ive observe in Table 3.5 that for

n = 13 the error between the 2nd and 4th order expansions of t h e bootstrap

standard error of the correlation coefficient is 0.13 1804 - O.l%276 = 0.0075%

and this decreases to 0.0.516SO - 0.051 133 = 0.000547 for n = 100. The results

in Tables 3.5-3.7 suggest that a 4th order expansion of the bootstrap standard

error of the correlation evaluated for n = 15 data points is necessary to achieve

the numerical stability that a corresponding 2nd order expansion similarly en -

joys when evaluated for t h e larger sample size of n = 100. C k N-ould make

sirnilar conclusions for estimators differing in smoothness characteristics.

The analytic bootstrap developed in this chapter calculates ideal bootstrap es-

t imates and corresponding propert ies ezactly for t h e case of linear stat ist ics.

Such computations are also exact when performed on truncated expansions of

non-linear smooth statistics. For example. consider the truncated series espan-

sion of the correlation coefficient given by Cor[i] for some finite i . Then the

bootstrap variance of this. given by BootCum[Cor[i]][Z. j]. will be exact for

large enough j. More generally. we may conjecture t hat if the evaluation of the

truncated expansion ThetaHat[i] for a given data set and finite i is very close

to d then the evaluation of the exact expression BootCum[ThetaHat [i]][2. j]

CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 7s

(for large enough j ) would similarly be very close to the evaluation of the series

BootCum[ThetaHat] [Z, jj.

Finally, analytical error analysis involving truncated series expansions of boot-

strap expressions as referred to above may be contrasted with sampling error

induced by the Monte Carlo bootstrap. For example, t his could entai1 taking

blocks of 100 bootstrap replicates of the estimator and st udying the empirical

variation among the block estimates. However. such an analysis is limited be-

cause it considers only the central region of the empirical distribution of the

bootstrap estimate. In contrast. truncated analytical bootstrap expansions can

permit expression of subt Ier higher moments of the distribution. particularly

those providing information about the tails. Moreover. such expansions are

true for any distribution and not just the empirical distribution.

\Ve wiil further illustrate the methodology developed in the present chapter bu

nest considering analytic bootstrap expansions associated wit h the bootstrap-t inter-

val, BC, confidence intervals, and the Behrens- Fisher statistic.

Chapter 4

Double-Bootstrapping and the

Behrens-Fisher statistic

4.1 Introduction

The methodology developed in the preceding chapter will presently be further illus-

t rated by considering analytic expansions associated with the bootst rap- t interval.

BC, confidence intervals. and the Behrens-Fisher statistic.

Bootstrap-t intervals are described in Efron ( 198 1) and the bias-corrected. accel-

erated interval ( BC, ) is suggested in Efron (1987). The two-sample problem with

unequal variance has a long history : see. for erarnple. Cox and Hinkley ( 1974. page

14.3) and Robinson (1982).

4.2 Symbolic Computation of the Bootstrap-t

Confidence Interval

4.2.1 Background

We present the conventional bootstrap-t procedure in this section. Our description

follows Efron and Tibshirani (1993). The notation follows that of Section 2.2 .

CHAPTER 4. DOUBLE-BOOTSTRAPPING

The standard large-sample central 1 - 2a confidence interval

is derived from the assumption t hat

where du) relen to the a t h percentile of the standard normal distribution and sé is

an estirnate of the standard deviation of ê . This is valid as n + m. but is only

an approximation for finite samples. In 1908. for the case = 6. Gosset derived the

better approximation

where represents the Student's t distribution on n - 1 degrees of freedom. This

approximation yields the interval

with to, denoting t h e a t h percentile of the t distribution on n - 1 degrees of Ireedorn.

The use of the t distribution doesn't adjust the coafidence interval to accouiit for

skewness in the underlying population or other errors ttiat can result when ê is not

the sample mean. The bootstrap-t interval is a procedure which does adjust for t hese

errors. We describe this next.

From an observed random sample x = (ri,. . . . x,) we generate B bootstrap sam-

p les

CHAPTER 4. DOUBLE-BQOTSTRAPPING

and for each we cornpute the quantity

where û'(6) is the value of 0 for the bootstrap sarnple x,' and sAe'(b) is the estimated

standard error of 8. for the same bootstrap sample x;. The a t h percentile of the t-( b)

is estimated by the value 8") such that

The bootstrap-t confidence interval is thus given by

The -bootstrap-t ' approach thus estimates the distribution of (4.3) directly from the

data. A bootstrap table is built by generating B bootstrap sarnples. and then com-

puting the bootstrap version of (4.3) for each: the bootstrap table consists of the

percentiles of these B values. Note that B and sê represent estirnates using the origi-

nal sample: the- are not bootstrap estimates. However. sê could be replaced by the

bootstrap estimate of standard error of 8. Le. s4eB . We evill use t his option when ive

consider the analytical bootstrap-t approach in the next section.

There are a n~imber of advantages that the bootstrap-t procedure enjoys over t he

standard methods. First, accurate intervals can be obtained without having to make

normal t heory assumptions. Second. bootst rap-t intervals often enjoy bet ter cover-

âge. Third. these intervals offer second order acciiracy versus first order accuracy

associated wi t h standard met hods. The bootstrap-t approach is particularly appli-

cable to location statistics like the sample mean or a sample percentile. However. at

least in its simple form. it cannot be trusted for more general problems. like setting

a confidence interval for a correlation coefficient. Moreover. it may perform errati-

cally in small-sample. nonparametric settings. The latter drawback can however be

alleviated by the use of appropriate transformations; see Tibshirani ( 19%) for the

CHA PTER 4. DOUBLE-BOOTSTRAPPING

implementation of the variance-stabilized bootstrap-t interval . Finally, t here is a computational problem associated wi t h the bootst rap- t ap-

proach which can be circumvented by employing the met hodology developed in this

thesis. In the denominator of the statistic (4.5) we require s ê m ( b ) . the estimated

standard error of Ê* for the bootstrap sarnple xi. T h e difficulty arises when 6 is

a complicated statistic, for which there is no simple standard error formula. Thus

we would need to compute a bootstrap estimate of standard error for each bootstrap

sarnple. Thus, two (nested) levels of bootstrap sampling are required. The quantity

(4.5) now becomes

where the double star notation in the denominator is used to signify that two levels of

bootstrapping are required. It is considered sufficient to use B = 25 for the estimation

of each Se"(b). while B = 1000 is needed for the computation of percentiles. Hence.

the overall number of bootstrap samples needed is perhaps 25 . 1000 = 25.000 . a

formidable nurnber if 8 is costly to cornpute.

4.2.2 Symbolic Computation of the Bootstrap-t Interval for

the Correlation Coefficient

The analyt ic bootst rap-t approach constructs confidence intervals in a similar wax

that standard normal and t tables are used to establish (4.4) from (4.3) . There are

two key differences. First. the conventional estimate of standard error si used in the

denominator of (4.3) is repiaced by its bootstrap analogue

Second, both 6 and (4.9) may each be expressed in series format ( to some order) as

was illustrated in the previous chapter at which point the user may choose to compute

these numerically for given data. This gives rise t o two advantages not enjoyed by

C H A PTER 4. DOUBLE-BOOTSTRAPPïlVG 83

the conventionai bootstrap-t procedure. Firstly, as will be s hown beloy maint ai n-

ing the quantities d and (4.9) in analytical form allows the subsequent cornputation

of certain bootstrap moments which rnay be used to further standardize the ana-

lytical t-statistic described here. Moreover. properties of these bootstrap moments

with respect to the original underlying distribution may also be assessed although

we shall not consider these . Secondly, calculation of an analytical bootstrap-t in-

terval precludes the need for Monte Carlo resampling at an9 leuel. For example. by

circumventing the requirement for building a distribution of B bootstrap percent iles

of (4.3) the analytic bootstrap-t method thus avoids the necessity for undertaking

double-bootstrapping in order to calculate quantities such as ;e*=(b). CVe illustrate

the approach for the sample correlat ion coefficient.

Replacing 8 = p in (4.3) we modify

by considering the t-like statistic

which can be re-written as the difference

Observe that (1.13) rnay be expressed analytically as a series expansion to any order

desired because the component factors & and VarWi , (p=) may each be similarly ex-

pressed (i.e. algebraically) according to Section 3.4. In the second term of (4.12) p is

kept constant; when bootstrap moments of t are computed p is then replaced by P a t

which point i t is expressed analytically to some order.

The bootstrap expectation

CHAPTER 4. DOUBLE-BOOTSTRAPPiNG


may each be represented analytically to any desired order. Moreover. properties of

(4.13) and (4.14) such as Er[-] and Var,[.] may also be assessed although for brevity's

salie we shall not consider these here. For example, the first order expansion of the

bootstrap rnean (4.13) may be derived immediately using the tool BootCum[t][i, j]

as

which reduces to

For the law school data we observe that BootCurn[t][l,l] = 0.29595. The Mathe-

matica code underlying the tool BootCum[t][i, j] ma,- be found in Appendix A.3

in a file called t .m.

It is well-known that if we construct a confidence interval for Fisher's variancc-

stabilizing transformation of the correlation coefficient o = tan h-' p given b~

and then t ransform the endpoints back tvit h the inverse transformation

then a more accurate interval is obtained. We will use O in the context of cornparing

t hree pairs of confidence intervals beiow.

First. the standard 1 - 2a t-interval for the correlation coefficient foollows (-1.4)

CHil PTER 4. DOUBLE-BOOTSTRA PPING

and is given by

This can be improved by using 4 as in

and transforming back to an interval for p iising (4.16). We assume that the corre-

sponding percentiles t n ~ : ) are different in (4.17) and (4.18). Second. the bootstrap-t

interval for p given by (4.7) as

may also be improved by considering

Again. we assume that the corresponding values of +") are different in each case.

Third. the analytic bootstrap-t interval is based on (4.11) and is given by

Finally? a standurdized analytical bootstrap- t interval may be formulated by stan-

dardizing (4.1 1) using (4.13) and (4.14) . Thus upon rearranging

where t is given by (4.11) we have

CHAPTER 4. DOUBLE-BOOTSTRAPPING 86

For exampie. to calculate a 90% confidence interval for p for the iaw school data using

(4.22) we set

2. J~ar,~,(i') = 0.132 frorn Chapter 3

1. E+(t) = 0.29595 (first order result seen above).

Table 4.1 compares the six different 90% confidence intervals (4.17) to (4.22) for

the correlation coefficient for the n = 1.5 law school data.

Table 4.1: Coniparison of 90% confidence intervals for p using analytic bootstrap-t. conventional bootstrap-t. and standard methods for the laiv school data ( n = 15)

1 4. boot-t, 1 ( 4 , 9 1

1. standardt 2. standard t, 3. boat-t

(.5L -98) (51: .91) (--03. -90)

Intervals #3 and #-I were obtained by Efron and Tibshirani (1993. p. 162-3)

using convent ional double- bootst rapping. The- used B = 25 bootst rap sarnples to

calculate the hootstrap estirnate of standard error Se"(6) for each of 1000 values of

b' that were generated so that a total of 25.000 bootstrap samples were used. The

superscript "II in intervals #5 - #6 signifies that al1 analytic component terms were

computed to first order except for the bootstrap variance of the correlation which was

calcuiated to fourth order as in Chapter 3 .

5 . analytical' boot-t 6. analyti~al~,,.,~. boot-t

(..?4. 1-01) (.58. .9S)

4.2.3 Discussion

We have shown that the two nested Levels of bootstrap sampling associated with

the conventional bootstrap-t interval can be avoided by representing the bootstrap-

t statistic in terms of a symbolic expansion. We compared confidence intervals for

the correlation using standard and conventional bootstrap-t met hods wit h the same

in tervals evaluated analyt ically.

It is observed that the conventional bootstrap-t interval #4 for the correlation

based on Fisher's z transformation has the same length as the analytical bootstrap-t

interval #5 computed without use of Fisher's transformation. However the latter in-

terval has partially st rayed out side the allowable range for the correlat ion. Moreover.

the analytical standardized bootstrap-t interval #6 is shorter t han t h e coriventional

intervai #4.

This suggests that the analytical bootstrap-t approach provides intervals which

not only circumvent computer-intensive double-bootstrapping calculations but may

also be more accurate.

4.3 Symbolic Computation of BC, Confidence

Int ervals

4.3.1 Introduction

In this section we explore the development of cornputer algebraic tools which permit

the analytic expression of an integral'component associated with BC, confidence

intervals. We first provide a general overview of the BC, confidence intervai and its

precursor. the percentile method. We follow Efron and Tibshirani ( 1993) and Efron

(2981).

CHA PTER 4. DOUBLE-BOOTSTRAPPIArG

4.3.2 Background

-4 number of problems connected with conventional bootstrap-t confidence intervals

were outlined in Section 4.2.1 . Instead of focusing on a statistic of the form t =

( B - B)/.<e a diflerent approach for remedying such problems works direct., with the

bootstrap distribution of 8. We describe first the percentile method followed by its

improved version, the bias-corrected and accelerated (BC.) interval.

Suppose the ernpirical distribution F generates the bootstrap data set x- by ran-

dom sampling, giving the bootstrap replication 8' = s(x'). This is repeated an

infinite nurnber of tirnes. Let 6 be the cumulative bootstrap distribution function of

8%. t hat is.

The 1 - ?a percentile intemal is defined by the a and 1 - o percentiles of G as

since by definition 6-1 ( a ) = . the 100 - û t h percentile of the bootstrap distribu-

tion. Expression (4.23) refers to the ideal bootstrap situation in which the number

of bootstrap replications is infinite. In practice we use some finite number B of

replications in which case the approximate 1 - 2 0 percentile interval is

The lower confidence lirnit d z a ) is the 100-0th ernpirical percentile of the k ( b ) values.

that is. the B . a t h value in the ordered list of the B replications of dm. The percentile interval for a parameter 8 is more accurate than a standard normal

interval when the bootstrap distribution of 8% is non-nomai. Both rnethods agree

well when the standard interval is constructed on an appropriate transformation d

CHA PTER 4. DOUBLE-BOOTSTRAPPING 89

of 6 and then mapped back to the 0 scale. The advantage of the percentile method

is t hat it automatically makes this transformation; unlike the standard interval. we

don't need to know the correct transformation. Al1 we need assume is that such a

transformation exists.

The percentile method is not the last word in bootstrap confidence intervals. There

are other ways the standard intervals can fail. besides non-norrnality. For example. 6

might be a biased normal estimate.

in which case no transformation can fix things up. The BCa confidence interval

represents an extension of the percentile met hod t hat automatically handles bot h

bias and transformations. Moreover. it allows the standard error in (4.25) to var-

with O. rather than being forced to stay constant. We discuss the BC, confidence

procedure next.

The BCa interval endpoints are given by percentiles of the bootstrap distribution.

but not necessarily the same ones w in (4.23). The percentiles used depend on trvo

numbers â and to. called the acceleration and bias-correction. respect ively. The I - 2 0

BC, interval is given by

w here

and

Here a(.) is the standard normal cumulative distribution function and is the


100 - rrt h percentile point of a standard normal distribution. Notice that if â and

equal zero, then a1 = O and a;! = 1 - cr so that the BC, and percentile intervals

coincide.

The BC, interval given above is based on the following model. Recall that stan-

dard confidence intervals are based on the assumpt ion

e - e - - !V(O, 1) . s-e

The BC, interval assumes. more generally, that there exists a monotone transforrna-

tion g and constants zo and a such that

sat isfy

1 se, = se,,[l + a ( 6 - ( - ! . :3 1 )

for any fixed value of do. The BC, interval is attractive because one does not need

to know the transformation g(-); the only three quantities needed to construct t h e

interval are the bootstrap distribution G and the constants q-, and a .

The estimated value of the bias-correction zo is obtained directly from the pro-

portion of bootstrap replications less than the original estirnate 6,

Roughly speaking, the quantit.y rneasures the rnedian bias of ê*. that is. the dis-

crepancy between the median of 6' and &, in normal units.

C'HAPTER 4. DOUBLE-BOOTSTRAPPiArG

The quantity a is called the acceleration because it measures the rate of change

of the standard error of 4 with respect to the true parameter value 8, measured

on a normalized scale. T h e normal approximation 4 - N ( 0 , s e 2 ) assumes that the

standard error of è is the sârne for al1 8. This is often unrealistic and the acceleration

adjustment a corrects for this. For the multioomial distribution. which corresponds

to the nonparametric bootstrap. â rnay be computed in terms of the jackknife values

of a statistic 0 = s(x). Let x(i) represent the original sample with the ith point ri

deleted, let O(;) = s(x(~)) , and define

A simple expression for the acceleration is given by

In summary. the percentile method generalizes the standard normal approximation

(4.28) by allowing a transformation g(.) of 0 and 8. The BC, met hod adds the furt her

adjustments a and a. The bootstrap-t approacli is second-order accurate. but not

t ransformation-respect h g . The percentile met hod is t ransformat ion-respect ing but

not second-order accurate. The standard method is neither. while the BC, method

is both.

In the next section we show how recourse to resarnpling methods such as the

jackknife for estimating the acceleration constant a can be avoided by using t h e

symbolic bootst rap met hodology developed in the previous chap ter.

4.3.3 Symbolic Computation of the Acceleration Constant â

We show how the computer algebraic tools developed in Chapter 3 may be used for

the symbolic representat ion of the acceleration adj ustment used in BC. confidence

intervals. As will be shown belorv, the key reason that the acceleration constant is

conducive to symbolic representation (whereas the bias-correction is not) is that

CK.1 PTER 4. DOUBLE-B00TSTR.IPPING 92

it can be represented as a smooth function of averages. We provide a setting for this

representation by justifying the expression for se,, as given by (4.31), as follows.

Indeed. the first -order Taylor series expansion of

about 4 = & is given by

which may be re-expressed as

Taking the square root of each side of (4.37) and using the fact tha t

ive have

Re-introducing the notation se, = o,/\/n, we observe that (4.38) and (4.31) are t h e

same. as desired. with

By interpreting (4.36) as a simple linear regression mode1


with z = ( b - as the independent variable and the d o p e represented by

then (4.39) becomes

Disregarding the divisor n in (4.41) for the moment it is well-known that the d o p e

81 of a generic regression mode1 (4.40) can always be expressed a s

which we re-formutate as

where the expectation E and the variance -uar'' are taken with respect t o any dis-

tribution F .

The expression (4.41) now becomes the basis for the symbolic computation of

(4.42). We show this for the case of the correlation coefficient. and t hen using t h e law

school da t a ive cornpare our numericai evaluation of (4.42) with tha t of Efron (19s;).

Indeed. set t ing

we observe that


becomes

and (4.44) becomes

Note that t h e subscript x in (4.45) refers to a bivariate sample of size n whereas t he

notation z as in (4.40) denotes a single independent randorn variable.

We now use the same tools developed in Chapter 3 to formulate a new tool called

Boot[Acc] [il for expressing (4.45) to any desired order. The Mathematica code

underlying the tool Boot[Acc][i] may be found in Appendix A.3 in a file called

bca.m. For example. t he first-order symbolic bootstrap expansion of (4.45) is given

by


The analytic boo t s t r a~ expression (4.46) may be evaluated numerically for any pair of

distributions F and C; associated with the dependent variables x and y for which t h e

cumulants are known. In particular. we will use the empirical distribution functions

F, and G, as follows.

Using the law school da ta with ,i = 0.776 then (4.46) becomes

Boot [Acc][l] = -0.2'772

and upon factoring in the divisor n = 15 we find

dl = -0.01545 .

CHA PTER 4. DOUBLE-BOOTSTRAPPlNG

Using the fourth-order result from Chapter 3

substitut ion in (4.42) yields

i l a = - . - = -0.0700 . Ise,,

Using the jackknife approximation (4.34) Efron (1987. p. 178) found

Finally. we can perform an independent check on the results obtained above by

expressing V a r ( j ) as a first-order Taylor expansion about p = po. Proceeding w e get

with

where = po = 0.776 . Dividing by n = 15 we get -0.082323 which is close to Our

value of â = -0.0700 .

4.3.4 Discussion

LVe have s h o w t hat the acceleration adjustment associated wit h BC, confidence in-

tervals may be represented exactly as a symbolic bootstrap expansion to an- order

desired. For illustration this was applied to the case of the correlation coefficient.

The resulting analytical expression is a formula which may repeatedly be used in any


context for which a BCa confidence interval for the correlation is desired. Yumerical

evaluation of a first-order analytical bootstrap expansion of the acceleratioo yielded a

result consistent wi t h that produced by jackknife resampling. Higher-order evaluat ion

would render more accurate results. Properties of the symbolic bootstrap represen-

tation of the acceleration could be further assessed although for brevity's sake t hese

are not considered here.

4.4 Analytic Bootstrap Computation for the


4.4.1 Introduction

We ddeeelop here the analytical representation of bootstrap moments related to the

Behrens-Fisher statistic and their properties. Numerical evaluation of the symbolic

expressions is illustrated using da ta from a case s tudy in medicine. We compare some

of these results with bootstrap estimates computed using Monte Carlo resampling.

4.4.2 Background

Suppose that (xl. . . . . x,, ) and (yl. . . . . y,,) comprise independent samples frorn Gaiis-

sian distributions N(pi, a:) and N(p2' 0;). respectively. Since n: and ai are typicallj*

unknown ive would consider estimating them by s: = E:Al(ri - i ) * / ( n l - 1) and

s$ = C:21 (~ i - ~ ) ~ / ' ( n ~ - 1). The hypothesis

may be assessed using the statistic


where x = x:., x i / n l and y = C:,, yi/nz. ( T h e test statistic (4.47) would have a

Student's t distribution under the nul1 hypothesis if the two Gaussian populations

had the same variance a: = O: = aZ and a pooled estimate of a2 was used.) .A

number of approximate solutions have been proposed in the literature. This is known

as the Behrens-Fisher problem. For illustrative purposes it suffices to

balanced case onlyo namely. wheo nl = nz = n. and so the test statistic

Say.

In the next section we shall calculate bootstrap est imates of the mean

consider the

becornes

and variance

of the Behrens-Fisher statistic (4.48) and their properties by using the analytical

bootst rap . The symbolic calculations are facilitated by considering the modified


4.4.3 Analytic Bootstrap Computation

As was done for the case of the correlation coefficient Table 4.2 beIow lists the six

bootst rap quanti t ies for which syrnbolic expressions associated wit h t h e modified

Behrens-Fisher statistic (4.49) are presented in this section. These represent the

bootstrap mean and bootstrap variance and the mean and variance - over x and y

- for each of these. Moreover, if required. t h e computational tools developed here

can provide symbolic expansions for any bootstrap cumulant of the Behrens-Fisher

statistic to any desired order as well as properties of these. The tools presented below

were coded using Mathematica and are based on the methodology developed in

Chapter 3.

The symbolic function BF[i] expands the modified Behrens-Fisher statistic (4.49)

CHA PTER 4. DOUBLE-BOOTSTRAPPlNG

Table 4.2: Bootstrap estimators and related properties for the Behrens-Fisher statistic for which anaiytic expansions are derived

1 II Bootstrap Mean ( 6 f ) 1 Bootstrap Variance ( 6 f ) 1

using a Taylor series in residual sums correct to i-th order. W e assume Ho : pl =

p2 = O and that t h e variables x and y are independent. For example, the 2nd order

expansion of (4.49) is given by

The tool Cum[BF][i, j] returns the i-th cumulant - over x and y - of the mod-

ified Behrens-Fisher statistic expressed as an expansion correct to order n-'I2. For

example, the meari E,,,(b f ) and the variance Var,,,(b f ) expressed as expansions to

order n-' are given. respectively, by

and


The function BootCum[BF][i, j] returns the i-th bootstrap cumulant of the

modified Behrens-Fisher statistic expressed as an expansion correct to order n-]I2.

The Mathematica code underlying this tool and BF[i] may be found in Appendix

A.3. For exampie. the 4th order analytic expressions for the bootstrap expectation


of the modified Behrens-Fisher statistic are given. respect ively. by

and

CHAPTER 4. DOUBLE-BOOTSTRAPPIlVG 101

CVe now consider the procedure t hat evaluates properties, indeed, various cum u-

lants - over x and y - of the bootstrap mean (4.53) and the bootstrap variance (4.54).

The function Curn[Boot Cum[BF][i]] Ci, k] returns a series expansion correct to

order n-k /2 of the j - t h cumulant - over x and y - of the i-th bootstrap cumulant of

the modified Behrens-Fisher stat ist ic. In part icular?

varX.Y [&~x.y ( 6 f')] = Curn[BootCum[BF][1]][2.4! + O,(n-') (4.5s)

Er,, [ ~ a r , ~ ~ , ~ ( b f')] = Curn[BootCurn[BF][2]][1.4] + O,(n-') (4.59)

Var,,, [~ /ar ,~ , ,~ jb f')] = Curn[BootCum[BF][2]][2.4] + O,(n-') . (4.60)

In practice. higher moments of other bootstrap cumulants may be expanded to an?

order of accuracy with these tools.

For example. the fourth order expansions considered in (4.57) - (4.59) are given

by the follorving:

5 (Curn(X , X. X ) - Cum(X Y. Y ) ) Cum[BootCum[BF][1]][1.4] = 3 (4-61)

2 n2 (Cum(X. X) + Cum(Y. Y))'


Since the fourth order expansion considered in (1.60) is lengthy we provide helow

the corresponding second order expression :

Curn[BootCum[BF][2]][2,2] = [ 2 Curn(S' + 2 Cum(Y. Y)' + C u m ( X . .Y. .Y. .Y) + Cum(I- Y: Y, Y ) ]

/ [n (Curn(.Yl -Y) + Cum(Y. (-1.64)

The six expressions ( 4 . 3 , (4.56): and (4.61) - (4.64) represent esamples of series

expansions of the six bootstrap quantities associated with Table 4.2 . They constitute

symbolic representations. in fact. formulas, which may subsequently be applied to any

pair of independent data sets for which numerical computations of the Behrens-Fisher

statistic, i ts bootstrap moments: or properties of these are desired. Moreover. (4.55).

(4.56). and (4.61) - (4.64) represent analytic bootstrap expressions ah ich may be


evaluated numerically for any pair of distributions F and G associated with the

independent variables x and y for which cumulants are known. In particular. we will

use the empirical (nonparametric) distribut ion funct ions F, and G,.

We neext illustrate the numerical evaluation of these expansions within the context

of a case study from medicine.

4.4.4 Numerical Application

Table 4.3 shows the original survival times of 58 patients with gastric cancer equally

divided into two indepeodent treatment groups. The data corne from Gamerman

(1991) . Group 1 received chemotherapy and radiation while Group 2 received only

chemotherapy. Interest Lies in comparing t h e average survival times of the two groups.

Without loss of generality, prior to evaluation each of the two data sets was standard-

ized to have zero means. It is assumed that both samples corne €rom independent

populations possessing unequal variances.

Table 4.3: Survival times (days) of gastric cancer patients ( n = 44)

1 Therapy 1

The 2nd and 4t h order expansions of t he six bootstrap quantities associated wit h

Table 4.2 could be evaluated numerically for any underlying distributions for which

CHAPTER 4. DOUBLE-BOOTSTRAPPINC 1 04

Table 4.4: Symbolic bootstrap and classical results for the Behrens-Fisher stat istic using the survival data, n = 44

Analytical Bootstrap

cumulants exist. Here we evaluate the analytic expansions for the empirical distri bu-

tion functions F, and G, because these are considered close to the true underlying

F and G . Hence. the data in Table 4.3 are used to numerically evaluate the syrn-

bolic bootstrap expressions . The results were then scaled by appropriate powers of

fi so that they be associated with (4.48) instead of (4.49). The final results are

shown in Table 4.4 . For cornparison purposes the bootstrap rnean and bootstrâp

standard error of the Behrens-Fisher statistic have also been estimated u s k g conven-

tional Monte Carlo resampling for various values of B. The algorithm upon which

these calculations are based is given in Efron and Tibshirani (1993. p. 4 ) . The

simulation results are presented in Table 4.5.

Empirical Distribution F, and G', 2nd order 1 4th order

[Va~,,,[Var,~,,~ ( b f')]] t Classicai

4.4.5 Discussion

0.992728 1 1.44237

CVe observe that the numerical evaluations of the 2nd and 4th order symbolic expres-

sions for the bootstrap mean and the bootstrap standard error of the Behrens-Fisher

statistic given by Ewl,,,(b f ') and [ V ~ T , ~ , , ~ ( b f ')] f in Table 4.4 are alrnost identical to

the corresponding Monte Carlo calculations seen in Table 4.5. (The latter computa-

CHAPTER 4. DOUBLE-BOOTSTRA PPING 1 05

Table 4.5: Monte Carlo Bootstrap Estimates of Mean and Standard Error of the Behrens-Fisher statistic for various B for the survival data, n = 44 ( s e . in brackets)

I - --

1 Survival times of gastric cancer ients!

1 B 1 Bootstrap mean 1 Bootstrap se. 1

tions representing estimates of bootstrap standard error are multiplied by \/(n - 1 )/n

ahen compared to the symbolic- based computations.) This demonst rates t hat the

analytic bootstrap is capable of providing bootstrap estimates without resorting to

convent ional bootst rap resampling.

Diagnostic properties such as bias and variance of bootstrap moments may be

assessed analytically and then evaluated numerically For example. the bias of t h e

bootstrap variance of the Behrens-Fisher statistic is represented by

50 100 200 400 SOO

1600

This expression could bc evaluated for any distribution. for example, t h e Gaussian

distribution. Using the empirical distribution functions F, and G, ive may calculate

the 4th order bootstrap estimate of bias of the bootstrap variance given by

0.028 (0.155) -0.08s (0.108) -0.060 (0.072) -0.015 (0.049) -0.000 (0.035) 0.018 (0.025)

1 .O93 (O. 125) 1.057 (0.077) 1 .O22 (0.058) 0.980 (0.036) 0.98s (0.024) 1.008 (0.019)


Similarly. the 4th order bootstrap estimate of bias of the bootstrap espectation is

These results suggest that the bootstrap procedure for estimating the mean and

variance of the Behrens-Fisher statistic is largely free of bias.

4.5 Conclusions

The symbolic bootstrap rnethodology developed in Chapter 3 may be applied to a

broad class of statistics. namely those which may be expressed as smooth non-linear

funct ions of averages. T h e Behrens-Fisher stat istic. the acceieration adjustrnent in

BCa confidence intervals. and various quanti ties associated wit h bootstrap-t confi-

dence intervals a11 satisfy this criterion. The present chapter has shown in detail hom

the analytic bootstrap may be applied to these examples. In al1 cases numeric evalua-

tion of the analytic bootstrap estimates matched cornputations based on conventional

resampling techniques.

If required? the computational tools presented in this chapter with respect to the

t hree examples studied could provide symbolic expansions for an y bootstrap cumulant

to any desired order as well as properties of these.

Analytic bootstrap expansions const itute symbolic representations. in fact. boot-

s trap jormvlas which are functions of various cumulants and residual sums. They are

true for any distribution(s) for which the cumulants are known. In particular. the

empirical distribution functions were used for numerical evaluation. The symbolic

expressions offer two advantages over conventional Monte Carlo resampling. First,.

the existence of the particular 2nd and 4th order analytic expansions seen in this

chapter means that the derivational work they represent does not need to be dupli-

cated by different pract it ioners: the expressions are available for efficient and repeated


numerical evaluation for different data sets. Second, analytic bootstrap expansions

enable the user to identify component statistical quantities such as particular cumu-

lants or residual sums which may influence the magnitude of properties such as t h e

bias or variance of the bootstrap procedure in question. Hence, analytic bootstrap

methodology enables the user to better understand the bootstrap procedure being

used.

Chapter 5

Concluding remarks

5.1 Remarks

In this thesis we established the representation of bootstrap moments and their prop-

ert ies as analy t ical expansions involving cumulants and standardized sums. This

rnethodology was developed for a broad class of statistics. namely. the kI-estimators.

The approach was illust rated by providing examples involving the correlat ion coef-

ficient, the Behrens-Fisher statistic. double-bootstrapping. and BC, bootstrap con-

fidence intervals. Numerical evaluations of the symbolic bootst rap expressions were

shown to be consistent wit h analagous Monte Carlo approxirnat ions. An alternative

methodology based on the exhaustive enurneration of bootstrap resamples was also

presented and illustrated for the class of L-estimators, in particular. the median. The

exhaustive enurneration approach used for order statistics provides exact analyt ical

bootstrap expressions and altiiough the approach may also be applied for more gen-

eral estimators it has sample size restrictions. The use of asymptotic expansions in

the case of M-estimators has no such limitation and the symbolic bootstrap quant ities

may be obtained to an arbitrary precision.

The ieatures/advantages of the analytic bootstrap developed in t his t hesis may

be summarized as they were in Chapter 1 as follows:

1. Analytic derivation and numerical evaluation of bootstrap estimates and t heir

CHA PTER 5. CONCL UDING REMA R H 109

properties entirely avoids Monte Carlo resampling resulting in &ter and more

efficient calculat ion.

2. It can be used diagnostically t o assess properties such as bias, variance or other

moments of a bootstrap procedure.

3. For the class of L-estimators (linear combinations of order statistics). the ana-

ly tic bootst rap calculates ideal bootstrap est imators exactly for sample sizes up

to n = 20 and calculates properties of these estimators ezactly for much larger

sample sizes.

For the class of hl-estimators (maximum likelihood-type estimators). the an-

alyt ic bootst rap calculates ideal bootst rap est imators and propert ies of t hese

exactly for linear statistics and to within a specified order of accuracy for large

sample sizes in the case of non-linear statistics.

4. The analytic bootstrap may be applied to a broad class of statistics. naniely.

to any estimator which may be expressed as a non-Iinear smooth function of

averages.

5 . The analytical approach for the class of M-estimators is less cornputer intensive

for larger data sets. The reverse is true for conventional bootstrap calculat ions.

6. Analyt ic bootst rap expressions are potent ially insightful because component

statistical quantities such as curnulants and sums are identified.

7. Each analytic bootstrap expression (representing a bootst rap moment or a prop-

erty of this) is a mathematical bootstrap /onnula which may be further nu-

rnerically evaluated efficiently for a given data set, or repeatedly evaluated for

different da ta sets arising within a given application. The computational effort

required to derive each formula need only be undertaken once. It is subsequently

useable in perpetuity by other practioners for the purpose of efficient numerical

evaiuation.

CHA PTER 5. CONCL W I N C REMARh3 i l 0

This thesis has established that a large class of problems may be addressed by the

analyt ic bootst rap. Furt her work would be required to investigate the application of

the met hodology for estimators involving non-iid random variables. The met hodology

could also be extended to address multi-dimensional parameters. O t her work niay

involve use of the analytic bootstrap on the basis of bootstrap resamples of smaller

size a n ins tead of n. Symbolic Cornish- Fisher expansions of bootst rapped quant iles

of a statistic could be obtained by computing and plugging in various bootstrap

curnulants into an Edgeworth expansion of the distribution function of 8 and inverting.

Appendix A

Source Code used for this Thesis

In this Appendix we provide the source code that was used for this thesis. This

includes the Mat hematica code underlying the computational tools developed in

Chapters 2 and 3 together with code used for the examples in Chapter 1. SPius code

that was used to carry out comparative Monte Car10 simulations related to these

chapters is also given below.

A.1 Source Code used for Chapter 2

This section contains the source code for 3 Mathematica programs associated mith

the functions developed in Chapter 2. and the source code for examples of 3 prograrns

written in SPlus that were used to conduct simulation checks of the functions.

(* This program computes the multinomial-type probability of a given *)

(* set of weights (i.e. a list) t = {nl,n2,n3, ...) *)

A PPENDfX: Source Code

P r i n t Clst21;

mult = MultinomialC f [il , f [ 2 ] , f [3] , f 141 , f [5] ,

f 161, f C71, f B I , f Cg1 , f Cl01 ,

f Ci11 , f Ci21, f Cl31 , f LI41 , f C151,

fC161, fC171, fCl81, fCi91, fC201 1 ;

(* P r i n t [mult] ; *)

den = ~ p p l y [Times ,Table [y [il , {i , ls)] ] ;

(JL P r i n t [den] ; *)

mult * 1 * m/den

(* This program cornputes :

Med [ C list t of weights 1 1 = { { r ( i ) , n , p ( r ( i ) ) ) 1

f o r any list t where

r ( i ) = t h e i t h o rder s t a t i s t i c having non-zero p r o b a b i l i t y of

occurrence,

n = number of d i s t i n c t o rde r s t a t i s t i c s a v a i l a b l e , and

p ( r ( i ) ) = p robab i l i t y of occurrence o f t h e ith o rde r s t a t i s t i c .

Th is i s done f o r ODD sample s i z e on ly . Note t h a t sum of weights

equa l s t h e sample s i z e .

A PPENDiX: Source Code

OsPerm [t-] : = Map CF, Permutations [t]]

F2 [t-1 : = Union [ Partition [ Map [F ,Permutations [t]] , 1 1 1

~ed[t-1 := Block [ < 1,

Do C

If [ i <= Length [t] , g [il = Count [OsPerm [t] ,il ;

p[i] = gci] / ~ength[~emutationsCtll //N 1,

€i,LengthCtl> 1 ;

.4 PPENDIX: Source Code

tab2 = Deletecases [tab ,O] ;

(* This code f inds a l1 p a r t i t i o n s o f any p o s i t i v e i n t ege r *)

ClearA11 [JI

J CI [ni,] ,O] : = 1 [nl]

JCI[nl,] ,ICn2-11 := 1[Join[nl,n2]]

J CI [ni-] , 1 Cn2-] +b-] : =J [I [nl] , 1 [n2] 1 + J CI [ni] , b]

A PPENDIX: Source Code

rernCICa,]] := a

FPL [nn-] : = Map [ rem, Apply [ L i s t , FP [nn] ] 1

(* This cornputes t h e e x p e c t a t i o n of t h e b o o t s t r a p e s t i m a t e of t h e second

moment of a sample median for given odd sample s i z e and lambda.

It u s e s t h e p reced ing t h r e e programs.

The f unct i o n SecondMomOrder [r, ,n, , 1-1 i s def ined here .

CCprob . m

<<os.m

< < p a r t i t i o n . m

mbica- + b,] := mbi[a] + mbi[b];

mbi [a, b,] : = a mbi [b] / ; FreeQ [a, u] ;

rnbi [u'a-. (1-u)^b- .] := Beta[a+l ,b+l] ;

mbi [u] = Beta[2,1] ;

mbi[uaa-] := Beta [a+ l , l ] ;

mbi [ ( l -u) ^b-1 := B e t a r l ,b+l] ;


SecondMomOrder [r- ,n- ,l-] : = mbi [Expand [ (x [l] ) -2 f [r ,n] 11

ExpBoot [n-, 1-1 : = Block [ { ) ,

a = Map [Prob ,FPL Cn] 1 ;

b = Map [Med , FPL [n] ] ;

Do C

If [ Length[b[[i]]] == 1, c[i] = Flatten[b[[i]]];

d [il = a[ [il 1 SecondMomOrder [c [il [Cl] 1 , c [il [ [2] hl] ,

Do [ d j ] = bCbll CEjll CC311 ~econdMomOrderCbC[iIl CCjll [ClIl ,

bCCi11 [CjII CD11 JI,

(j ,LengthCbCCilIl) 1 ;

d[il = App~y[P~us,Table~e[jl,~j,Length~b~~il~~~ll d[ill 3

, Ci,Length[bl) 1

Apply [ p l u s ,Table ( d [il , Ci, Length [a]

1

(* This computes the variance of the bootstrap estimate of

the second moment of a sample median for given odd sample size

and lambda. The f unct ion VarSqOrder Cr-, n- , 1-1 is def ined here


and this depends on ~econdMomOrderCr-,n-,l-1.

<<prob .m

<<os .m

<<partit ion. m

mbi [a- + b,] : = mbi [a] + mbi [b] ;

mbi [a, b-] : = a mbi [b] / ; FreeQ [a,u] ;

mbi h n a - . (1-u) "b-. ] : = ~eta[a+l, b+l] ;

mbi[u] = Beta[2,1] ;

mbi [uAa-1 := Beta[a+l , lf ;

mbi f (1-u) ̂b-1 : = BetaCl ,b+1] ;

x[lJ := u-1 - (1-u) -1 f[r,,n,] := ~eta[r,n-r+i]*(-1) un(r-1) (1-u)-(n-r)

~econdMornOrder Cr- ,n-, 1-1 : = mbi [~xpand [(x [l] ) - 2 f Cr, n] ]]

m v 2 r , n l := mbi[Expand[(x[1])̂ 4 f[r,n]]]

VarSqOrder Cr-, n- , 1-1 : = mv2 Cr ,n , l] - (~econdMomOrder Cr, n, 11 ) -2

a = Map[Prob,FPL[n]la2;

b = Map [Med , FPL [n] ] ;


~ p p l y [Plus, Table C d Cil , Ci , ~ e n g t h fa] )] 1

(* This vas used to simulate the ExpBoot func t ion *)

(* The particular example here assumes a Gaussian shape for

a sample size of 3. There were 12 combinations in al1 *)

ret urn (med)


med_apply(mat,2,boot) # med contains 1000 transformed median values

rneanl-mean(meda2)

stderri-sqrt (var(rnedn2) /1000)

# Examples

# meanl : 0.03669427 stderrl : 0.001580984

# This is a simulation CHECK of the Mathematica function

# SecondMomOrder [i ,n , 11 (= mv Cr ,n , l] ) = true value of E (mediana2) ,

# that is, for the case where the i-th order statistic happens to be the

# median in a sample of size n .

# Below 1000 samples of size 3 have been simulated from a U[O,1] distribution.

# The particular example here assumes a Gaussian shape for

# a sample size of 3. (12 combinations)

mu-matrix (runif (3000) ,3,1000)

u.med-apply(mu,Z,median)

x-(u.meda0.135) - (i-~.med)~0.135 # x resembles a Normal since lamda = 0.135

meani-rnean(x-2)

st . errl-sqrt ( var (xa2) /IO00 )

# --- This is the conventional bootstrap, selecting 1000 resamples


# --- from the i n i t i a l sample of s i ze 3, c a l c u l a t i n g the transformed median,

# --- then the average of the i r squares.

# --- This is bootstrap-reg.

# The p a r t i c u l a r example hera assumes a Gaussian shape f o r

# a sample size of 3. (12 combinations)

mat -matrix (runif (3) ,3,1)

med- apply (mat, 2 , boot) # med contains 1000 transf ormed median values

meanl,mean(med^2)

stderrl-sqrt(var(meda2)/1000)

APPEh;DIX: Source Code


This section contains the Mathematica code that was used in Chapter 3 for the sym-

bolic derivation and numerical evaluation of bootstrap estimators and corresponding

properties associated with M-estimators. Code connected with t he case of the cor-

relation coefficient is also included. Splus code for producing comparative Monte

Carlo simulations for the correlation coefficient is given at the end of this section.

(* This contains the code for : ThetaHat Ci] and BootCum[thetahat] Ci, j] *)

(* Supporting code called astat.3.0.new.m is associated with

Andrews et al. (1993) . *)

(* Change Cums to Gmoms if rv is xl *)

Cum [(a,)] : = Cum [a]

Mom [(a,)] : = Mom [a]

Cum [a,] : = Mom [a] / ; ! FreeQ [a, Xi [w] ]

CumCa,,b,l := CfromM[a,b]/;!FreeQ[(a,b),~l[w]]

Mom [a,] : = Gmom [a] / ; ! FreeCj [a, XI [w] ]

Mom Ca-, b-1 : = Gmom [Apply [Times, (a, b)] 1 / ; ! FreeQ [a, XI [w] 3

(* Read in definitions *)

$Echo = "stdout "

(* Define psi function, root etc. *)


Psi=psifn[X[w] , a l ] &

NrootO [ C m [Psi] = t h e t a

t h e t aha t = Nroot CAF [n] [Psi] ] ;

(* empir r e p l a c e s F wi th Fn i n s t a t *)

(* eCum s t a n d s f o r empi r ica l Cum *)

rsn[a-1 := rsm[aJ/.X[w]->Xstar[w]

NT [empir [a-] , i-1 : = Nget [Expand [Nsum [a, i l 1 , il

/ . RS [n] - > r d . rsm->RS [n] / . Cm->tCum/ . tCum-Xhm

tCum [a-] : = s [Length [{a)] 1 eCum [a] / ; ! FreeQ [{a), Xstar]

(* eCum[a-1 : = na (-1) S Cnl [a] / . X [w] ->Xstar [w] *)

(* X* = Xstar , i . e . t h e weighted o r boo ts t rapped X *)

Thet &at C i - ] : = Nsum [ the taha t , i]

BootCum[thetahat] Ci-, j -1 :=

~sum [empir [Ncum [empir [~power [ t he t a h a t , 1 ] , i] ] , j]

(* I n p r a c t i c e , f o r a given s t a t i s t i c , t h i s BootCum code must u se

t h e t o o l Nboot . See f o r example t h e code used t o d e r i v e

b o o t s t r a p moments of t h e c o r r e l a t i o n c o e f f i c i e n t below.

(*

ExpBootExp C i - , j - , k-] : =

Nsum [ N C U ~ [empir CNCU [empir [Npower f t h e t a h a t ,1] 1 , k] 1 , j] , i]


*)

(* This defines t he boo t s t r ap expectation of a s t a t i s t i c

(* Supporting code c a l l e d astat.3.0.new.m is a s soc i a t ed

wi th Andrews et a l . (1993) .

RVQ: :usage =

"RVQ [a] r e t u r n s True i f f a is a random v a r i a b l e . "

2:0,14,10,Courier,1,12,0,0,0;1,14,10,Geneva,1,10,0,0,0;

: [ font = output ; output ; i n ac t i ve ]

" R V ~ f a l r e t u r n s True i f f a is a random v a r i a b l e . "

; Col

RVQCa] r e t u r n s True iff a is a random v a r i a b l e .

: [font = inpu t ; i n i t i a l i z a t ion ; endGroup]

*)

ClearAll [RVQ]

RVQ Ca,, b,,] : = If CRVQ [a] , True , RVQ [b]]

RVQ [g- [a--] 1 : =If [MemberQ [ConsExpectFns ,g] ,False,

If [MemberQ [RandomFns , g] , True , RVQ [a] 1 3

RVQ[z-1 := Fa l s e

ConsExpectFns = { ~ x p e c t , Moment, CentralMoment, Theta,Cum,ED,InvED,ICum);

RandomFns = {RD, RS ,DF);

A11RVQ Ca,, b,,] : = If CRVQ [a] , A l I R V Q [b] , False]

APPEiVDIX: Source Code

AllRVQ [a,] : =RVQ [a] ;

VQ : :usage =

"VQ [a] r e t u r n s True iff a i s a v a r i a b l e . "

(*

: [ fon t = outpu t ; o u t p u t ; inac t ive ]

"VQ [a] r e t u r n s True if f a is a v a r i a b l e . "

; Col

VQ [a] r e t u r n s True i f f a is a v a r i a b l e .

: Cf on t = input ; i n i t ializat ion; endCroupl

*)

ClearAll [VQ]

VQh-,b,,l:= IfCvQCal ,Truc, VQ[b]l

VQ Cg- [a--] 1 : =If [MemberQ [ConsFns , g] , F a l s e ,

If [MemberQ [VarFns ,g] , True , VQ [a] 1 1

VQ [z,] : = False

ConsFns = (SUM);

VarFns = ();

A 1 1 V Q [a-, b--1 : = If [VQ [a] , A 1 1 V Q [b] ,Fa l se ]

A l l V Q [a-] : =VQ [a] ;

"FPQ [a, b, . . ] r e t u r n s t h e full p a r t i t i o n of a , b , . . . "

: [ font = output ; outpu t ; inact ive]

r e t u r n s t h e f u l l p a r t i t i o n of a , b , . . . "

FPQ [ a , b , . .] r e t u r n s t h e f u l l p a r t i t i o n of a , b , . . . : Cf ont = input ; i n i t i a l i z a t i o n ]

APPENDIX: Source Code

SetAttributes CPE, Orderless]

ClearAll [addnl]

addnl[n-,a- + b-1 := addnl[n,a]+addnl[n,b]

addnl[n-, (a, + b,)c,] := addnl[n,a c]+addnl[n,b c]

addnl Ln-, a- PE [b,,] ] : = a addnl [n , PE [b] ]

addnl En-, PE [a-,] 1 : =

Apply [Plus, Table [Apply [PE , addn2 [{a), n, iadd] 1 , {iadd, 1, Length [{a)] )] 3 +

Append CPE Cd , b)l

ClearAll [addn2]

addn2 [a-, n- , i,] : =

TableCIf [jaddnl == i,AppendCa[[i]l ,n] ,a[[jaddn2]11.

{ j addn2,1, Length [a] )] ;

ClearA11 [FPl]

FPl[inlist-1 :=Expand[FPl[PE[inlist] ,Length[inlist]]]

FP1 [pe- , l] : = pe

FPlCa- + b-] := FPlCa] + FPl[b]

FPlCc, PECa-,] ,n,] := c FPi[PE[a],n]

FP1 [PE [a--] , n,] : =

addnl [Last [al , FP1 [PE [Drop [a, -11 ] , n-11 ]

EV::usage =

"EV[a] returns the expected value of a"

(*

: Cf ont = output ; output ; inactive]

"EV[a] returns the expected value of a"

; Co]


EV[a] r e tu rns the expected va lue of a

: [font = input ; init i a l i z a t ion]

*)

ClearA11 [EV]

EVCa, + b-] := EV[a] + EV[b]

EV [(2- + b-) ̂ c- . d- . ] : = EV [Expand [(a+b) ̂ c d] ] / ;

( ( c > 0) && ( In tegerQ[c] ) )

EV[a, b-] : = a EV [b] / ; ! RVQ [a]

EV[a-] := a /; !RVQ[a]

EV [SUM [a-] ] : = SUM [EV [a] 1

EV CSUM [a-] c-] : = EV cc, a]

EV[SUM[a-1 c-. ,b,,] := EV[c,a,b]

EV[SUM[a-] ^b- c, .] : = EV[SUM[a] ^ (b- l )c , a]

EV[SUM[a-]*b- c,.,d--1 := EV[SUM[aIA(b-1)c,a,d1

EV Cl, a--] : = Col l ec t [Expand [Map C E V P E , F P ~ [{a)] ] 1 , SUM]

ClearAll [EVPE]

E V P E [ ( ~ - + b-)] := EVPE[~ ] + EVPE[b ]

EVPE [(a- + b,) c,] : = EVPECa cl + EVPECb cl

EVPE [a- PE [b--11 : = a EVPE [PE [b]]

EVPE [PECa-11 : = S L ~ M CEV C ~ p p l y [Times, a]] ]

EVPE [PE [a--] 1 : = Apply CSUM, Map [EV [Apply [Times, # l ] ] à, PE [a] ] ]

BEV: :usage =

ltBEV[a] r e t u r n s the b o o t s t r a p expected value of a"

: [font = output ; outpu t ; i nac t i ve ]

"BEV [a] r e t u r n s the b o o t s t r a p expected value of a"

B E V [ ~ ] r e t u r n s t h e b o o t s t r a p expected value of a

A PPENDIX: Sozme Code

: Cf ont = input ; init ializat ion]

*)

ClearAll [BEV]

BEVCa, + b,] := BEVCa] + BEV[b]

BEV[(a- + b-)Y-. d-.] := BEV[Expand[(a+b)*c d l ] / ;

( ( c > 0) && (IntegerQ[c]))

BEV [a, b-] : = a BEV [b] / ; ! RVQ [a]

BEV [a-] : = a / ; ! RVQ [a]

BEV CSUM fa-] 1 : = SUM [a]

BEV [SUM [a-] c-] : = BEV Cc, a]

BEV [SUM [a-j c, . , b,,] : = BEV [c , a, b]

BEV [SUM [a-] ̂ b_ c, . j : = BEV CSUM [a] (b-1) c, a]

BEV [SUM [a-] -b- c, . , d--1 : = BEV CSUM [a] " (b- 1) c , a , d]

BEV C l , a--] : = Collect [Expand [Map CBEVPE, FPI [{a)] 11 , SUM]

ClearAll [BEVPE]

BEVPE[(a- + b,)] := BEVPECa 1 + BEVPECb 1

BEVPE[(a- + b-)c,] := BEVPECa cl + B E V P E C ~ cl

BEVPE [a- PE [b,,] ] : = a BEVPE [PE [b]

BEVPE [PE [a-] ] : = u t [Length [a] SUM [ Apply [Times, a] ]

BEVPE CPE [a--]] : =

wtpe CPE [al 1 Apply CSUM ,Map C~pply [Times $11 & , Ca)] l

A PPENDiX: Source Code

ClearAl l [ut]

S e t A t t r i b u t e s [wt , Orderless]

u t [a--] := w t [a] = m g f [ a ] / . tv ru les

BVAR: :usage =

"BVARCa] r e t u r n s the boo t s t r ap var iance of a"

(*

: [ f on t = output ; outpu t ; inac t ive ]

"BVAR [a] r e t u r n s t h e boot s t r a p var iance of a"

; Col

BVARCa] r e t u r n s t h e boo t s t r ap var iance of a

: [ fon t = input ; i n i t i a l i z a t ion; endcroup]

* i o

BVAR [a,] : = Collect CBEV [a-21 - BEV [a] -2, SUM]

SüM: :usage =

" s u M C ~ ] r ep r e sen t s t h e sum of a-i

~üM[a, b , . . ] r e p r e s e n t s t h e sum of a-i , b- j , . . .

over i != j != . . ."

(*

: [ fon t = output ; outpu t ; inac t ive ]

"stJM[a] r ep r e sen t s t h e s u of a-i SUMCa, b , . .] r e p r e s e n t s \

t h e sum of a,i,b,j , . . . over i ! = j != . . . " ; Col

SUMCa] r ep re sen t s t h e sum of a-i SUM[a,b, . .] r e p r e s e n t s \

t h e sumof a - i , b - j ,... over i != j != . . . : [ fon t = inpu t ; i n i t ializat ion]


*)

C l e a r A l l [SUMI

S e t A t t r i b u t e s [SU~,Order le s s ]

SüM[a,] : = n a/; !VQ [a]

SUMCa- + b-1 := SUM [al + SUMC~I

SUM[a- +b-,c--1 := SUM[a,c] + SUM[b,c]

SüM [a, b,] : = a SUM [b] / ; ! VQ [a]

SüM [a,,] : =

C o l l e c t [Expand [Map [CSUM [Length [Ca)] , # i l & , FP1 [Ca)] ] 1 , SüM] / ;

Length[{a)] > 1

C l e a r A l l [CSUM]

CSUM[l, ,a- + b,] := C S W M [ ~ ,a] + C S U M [ ~ ,b]

CSüM Cl,, a, b,] : = a CSUM [l , b] / ; ! Vq [a]

cSUM Cl,, (b-+c-) ^d-. : =

CSUM C l , Expand [ (b+c) -dl 1 / ; I n t egerQ Cd] && (d > 0)

CSUM Cl,, PE [a--] ] : = Block [(la),

la=Length [(a>] ;

(-1) * (1 - l a ) * Apply [Times ,Map [(Length[#l] -1) ! SüM2 [#II &, {a)]] 1

SüM2 [{a--)] : = SUM [Times Ca] 1

Nboot : :usage =

"NT [Nboot [a-] , il r e p r e s e n t s t h e i t h term of the boo t s t r ap

expect at ion of a" ;

NT [Nboot [a,] , i-1 : = NT C N ~ O O ~ [a] , i l =

C o l l e c t [Boot SumCa, i , O] ,RS]

Boot~um[a-, i-, j-] := Nget bootTerm[a, j ] , i-j] +

IfCj < i ,BootSum[a, i , j+l] ,O]


BootTerm [a-, i-] : = BootTermfa, il =

Collect [Expand[(BEV [NT[a, il / .RS [n] ->RtoS] / . SUM->StoR)] ,n] StoR[ZM[a,]] := nA(l/2)RS[n] [a]

RtoS [a,] : = nA (-1/2) SUM[ZM[a] ]

St OR [a,] : = S [n] [a/. Dl->ZtoX]

ZtoX[a,] := a - CumCal

EmpCum [a-] : = n- (- 1) SUM [a]

EmpCum [a--] : = ~ a p CS ignedEV , FPi [{a)] ]

SignedEV [b- PE [a--] 1 : = b S ignedEV [PE [a] ]

SignedEV [PE [a--] ] : =

(-1) (Length[{a)] -1) (Length [Ca)] -1) ! * Apply [Times, Map [BAV , Map [Apply [Times, XI]&, {a)] ] 1

BAV [a-] : = nn (- 1) SUM [a]

(* Example *)

RVQ [ X I =True

VQ [XI = True

EmpcumCx,x,X,xl

<<BSTAT.l.O.new.m (* bootstrap code *)

<<astat .3.0.new.m

(* Some Examples *) -


RVQ [Xi] =True

RVQ LX21 = True

RVQ [X3] =Trie

VQ [Xi] =True

VQCXS] = True

Vg [X3] =True ;

SUM [xi]

suMcxi ,X21

SUM [Xi, X2, X3l

suMcxi x2 X31 + suM[x1,x2 X31 + sUM[xi X2, X31 +

SUM [Xi X 3 , X21 +Sm [Xi, X2, X31 EV [SUM [Xi] SUM LX21 1

BEV [SUM [XI]]

BEV [SUM [XI] SUM [X2] 1

BEV [sUM [Xl] SvM [X2] SmX3]]

SVAR [SVM [Xi] /n]

BVAR CBVAR [ S ~ M [xi] /n] 3

LX/ : VQ [LX [w] 1 =True

LY/ : VQ [LY [w] ] =True ;

cm [LX [wJ 1 =O

cm CLY [w] 1 =O Lx/:cum[~X[w] ,LX[wll=l

LY/:CumCLY[w] ,LY[wlI=i

LX/ : Cum [LX [w] , LY [w] ] =rho

cov/ : NT [cov, i-] : = NT [cov, il =

If [i==O, Cum [LX [al , LY [wl 1 ,

If Ci== 1, RS [nl [LX [wl LY [vl 1 -

Cum [LX hl 1 RS Cnl [LY Lw1 1 - Cum [LY Cwl 1 RS [nl [LX [VI 1 ,


If [i==2, -RS [n] [LX [w] ] RS [n] [LY [w] 1 , O] 11

vax/ :NT [vax, i-] : =

If [i==O , Cum [LX [w] ,LX [w] 1 , If [i==i , RS En] [LX Cw] LX [w] 1 -

Cum [LX Cwl 1 RS Cnl [LX [wl 1 - Cum [LX Cul 1 RS Cnl [LX Cul 1 ,

If [i==2, -RS [n] [LX [w] 1 RS [nl [LX [w] 1 , O] 1 1

vay/:NT[vay,i-] :=

If [i==o , CumCLY Lw] ,LY [w]] ,

If [i==i , RS [n] [LY Cw] LY [w] 1 -

Cum CLY Cul 1 RS [ni [LY [wl 1 - Cum [LY [wl l RS [nl CLY Cul 1 , If Ci==2, -RS [nl [LY [w] 1 Rs [nl CLY Cwf 1 ,O] 11

cor/:NT[cor,i-] := ~T[cor,i] =

Collect [ExpandCNT [Ntimes [Nf [#le (-1/2)& ,Ntimes [vax, vay] 1 , cov] , il ] ,RS]

Cor Ci-] : = Nsum [cor, il

NT [cor, 21

Nsum [cor, 21

NT [Nboot [cor] , O]

NT [Nboot [cor] , 11

NT [~boot [cor] ,2]

v2=Expand [Nsum [Nboot [Npower [cor, 23 ] - Npower [~boot [cor] ,2] ,2] ] Collect [ (v2/. RS [n] ->RtoS/ . Cm->ErnpCum) , SUM] ;

(* This cornputes 2nd order symbolic bootstrap expansions


for the correlation coefficient and numerically

evaluates these for two data sets.

4th order code is similar.

* This is for n = 15 AND n = 100

(* Uses NEW versions of both BSTAT and ASTAT ! *> (* Calculates cor2, a2 = BootExp [Cor] , v2 = BootVar [cor]

(* Plus TeXForm and numerical versions for e 2 , v2 *)

(* Assuming empirical distribution Fn *)

(* i.e. Set R S [ ~ ] -> O below, and use empirical cums

(* The values below are correctly standardized ! *)

texr = ( RS -> 2, LX[W] -> X , LY[w] -> Y )

(* Expansion of correlation *)

cor Ci,] := N s u m [cor, il

cor2 = NsumCcor,21 . t e x r

incor2 = TeXForm [cor21

incor22 = InputForm [cor21

here *)

$1

(* Expansion of bootstrap mean of the c o r r e l a t i o n *)


e 2 = Expand [Nsum [Nboot [cor] ,2] 1

i n e 2 = TeXForm[e2]

ine22 = InputForm Ce21

(* Expansion of E - over x - of b o o t s t r a p mean of t h e c o r r e l a t i o n *)

ee2 = Expand [Nsum CNCU [ ~ b o o t [cor] ,11 ,2] ]

i n e e 2 = TeXFormCee21

inee22 = InputForm Lee21

(* Expansion of V a r - o v e r x - of b o o t s t r a p mean of the c o r r e l a t i o n *)

ve2 = ~ x p a n d [Nsum [Ncum [ ~ b o o t [cor] ,Z] ,2] 1

inve2 = TeXFormCve21

inve22 = InputForm [vel]

(* Expansion of b o o t s t r a p v a r i a n c e of t h e c o r r e l a t i o n *)

v2=Expand [Nsum [Nboot [Npower [cor, 21 1 - Npower [Nboot [cor] ,21,2] 1

i nv2 = TeXForm[v2]

inv22 = Input Form [v2]

(* Expansion o f E - over x - of b o o t s t r a p v a r i a n c e o f t h e c o r r e l a t i o n *)

ev2 = Expand [NSU [Ncum [ ~ b o o t [Npower [ co r , 21 1 - Npower [ N ~ O O ~ [cor] ,2] ,1] ,2] 1

inev2 = TeXForm[ev2]

inev22 = InputForm [ev2]

(* Expansion of Var - over x - of b o o t s t r a p v a r i a n c e of t h e c o r r e l a t i o n *)


vv2 = Ekpand [~sum [Ncum [Nboot [ ~ p o w e r [ co r , 23 3 - Npower [Nboot [cor] ,2] ,2] ,2] 1

invv2 = TeXForm[vv2]

invv22 = InputForm [vv2]

PLUS [(a--)] : = PLUS [a]

(* The following is for n = 15 and n = 100 *)

e2numl = e2SW / . (SUM->PLUS, LX[W] -> (-0.600997, 0 -860219, -1.04679,

-0.551465, 1.62798,

-0,501932,-1.12109, 1.50414, 1 .25648 , 0.117227,

1.30601,-0.625764, -1.36875,-0.700063, -0.155203 ) ,

LYCw] -> (1.25538, 0.872812, -1.21003, -0.274879, 1.46791,

-0.104851, -0.4024,1.4254, 1.12785, 0.150192,

0.107685, -1.50758,

-1.42257, -0.912485, -0.5724291,

n->15 ,

rho -> 0.776374 3 . PLUS -> Plus

il PPE.!VDIX: Source Code 137

A P PELVDLY: Source Code

rho -> 0,692704 3 . PLUS -> Plus

... ETC . . .

(******************** r.s2 ******************+*************) (* SPLUS code for producing comparative Monte Carlo

simulations for the correlat ion c o e f f i c i e n t f o r

n = 15 law school d a t a .

Can hence compare bootstrap mean and standard error for t h e

correlation coefficient obtained using the ana ly t i ca l

bootstrap with t he same using conventional resampling

as done below. *)

il PPENDX: Source Code

o p t i o n s (obj e c t . size=4Oe6)

da ta -c ( lawi , law2)

xdat a-matrix (dat a,nco1=2)

x-1:n

nboot ,25600

t h e t a - f unct ion(x, xdata) { c o r (xdatacx, 11 , x d a t a Lx, 23 )

r e s u l t s - b o o t s t r a p ( x ,nboot , t h e t a , x d a t a )

bs . s e - s q r t (va r ( resu1 t s ) )

A PPENDLY: Source Code

varvar- (nboot/ ((nboot-1) -2) ) *var( ( r e s u l t s - mean(resu1ts) ) -2)

v a r . bsse, (1/ (4*var ( resu l t s ) ) ) *varvar

bs . se

se. bsse

xbar

se

(************************ r 2 . s *************************************) (* SPLUS code f o r producing comparative Monte Carlo

s imulat ions f o r t h e c o r r e l a t i o n coe f f i c i en t f o r

n = 100 s imula ted s tandard normal da ta .

Can hence compare boo t s t r ap mean and s tandard error f o r t h e

c o r r e l a t i o n c o e f f i c i e n t obta ined using t h e a n a l y t i c a l

boo ts t rap with t h e same us ing conventional resampling

as done below.

APPEIVDIX: Source Code 131

options (ob j ect . size=50e6)


data-c (nom1 ,norm2)

xdat a-rnat rix (dat a, nc01=2)

bootstrap-function(x ,nboot , theta,xdata) {

data~matrix(sample(x,size=length(x)*nboot, replace=T), nrow=nboot)

return(app1y (data, 1, theta,xdata) )

bs. se-sqrt (var(resu1t.s) )

varvar~(nboot/((nboot-l)n2))*var((results - mean(results))'2) var. bsse, (1/ (4*var (result s) ) ) *varvar

se. bsse-var . bsseAO. 5

xbar,rnean(result s)

se-sqrt (var(resu1ts) /nboot)



This section contains the Mathematica programs that were used for the examples

in Chapter 4. Splus code for producing comparative Monte Car10 simulations for

the Behrens-Fisher statistic is given at the end of this section.

(****************** t . m ************************************) (* T h i s def i n e s t h e a n a l y t i c a l boo t s t r ap - t s t a t i s t i c

f o r t h e c o r r e l a t i o n c o e f f i c i e n t

and de r i ve s the symbolic boo t s t r ap expec t a t i on of

it a s well as eva lua t i ng it f o r t h e law school data *)

$Echo = "s tdout"

LX/ : VQ [LX [w] ] =True

LY/ : VQ ELY [wf 1 =True ;

cm [LX tu] 1 =O

Cllm [LY [w] ] =O

LX/ : Cum [LX Cul , LX [w] 3=1

LY/ : Cum [LY Lw] , LY [w] 3=1

LX/ : Cum [LX [w] , LY [w] 1 =rho

cov/ : NT [cov , i-1 : = NT [cov , i l =

If Ci==O, Cum [LX Cwl ,LY [w] ] ,


c o r / : NT [ c o r , i,] : = NT [cor , il =

C o l l e c t [Expand [NT C ~ t i m e s CNÎ [# le (-1/21 k , N t i m e s [vax, vay] 3 , cov] , il ] ,RS]

co r2 / :NT Ccor2,i-1 : = NT Lcor2, i l = NT [ c o r , i + i ]

b o o t v a r / :NT [bootvar ,i_] : = NT [boo tva r , il =

NT CNboot CNpower [ co r , 21 1 , i l - NT [Npower [Nboot [cor] ,2] , il

boo tva r2 / :NT Cbootvar2. i-] : = NT[bootvar2, i l =

NT [Nboot CNpower Ccor2,2] 1 , il - NT [Npower [Nboot [cor21 ,2] , il

t / : ~ T [ t , i , ] := NT[t , i ] =

Expand [

NT [ Ntimes [ c o r 2 , N f [ # l a ( - l / 2 ) & , b o o t v a r 2 1 1 , i ]

- NT [ N t i m e s [ rl, NF [# la( -1 /2)&, b o o t v a r l 1 1 , i l ]

texr = ( RS -> Z , LX[w] -> X , LYCu] -> Y 1


bootexp-t 1 = Expand[Nsum[Nboot [Npower[t, 11 1 ,1] 1 / . rl -> cor2

% . t e x r

TeXForm [%]

% . t e x r ;

Simplify [ X I

TeXForm [%1

boot exp = ~xpand ENSU [Nboot Et] ,1] 1

% / . t e x r ;

Simplify [ X I

TeXForm [%]

bootexp-t lSUM = Col l ec t [(bootexp-tl/ . RS [n] ->(O&) / . ~ u m - > ~ m ~ ~ u m ) , SUM] ;

bootexp-t lSUM / . {SUM->PLUS, LX Lw] -> {-O. 600997, 0.860219, -1.04679, -0.551465

1.62798, -0.501932,-1.12109, 1.50414, 1.25648, 0.117227,

1.30601,-0.625764, -1.36875,-0.700063, -0.155203 ) ,

LY [w] -> €1.25538, O. 872812, -1.2IOO3, -0.274879, 1.4679

-0.104851, -0.4024,1+4254, 1.12785, 0.150192,

0.107685, -1,50758,

-1.42257, -0.912485, -0.5724291,

n+15 ,

rho -> 0.776374 1 . PLUS -> Plus


............................ bca1.m ........................ (* T h i s c a l c u l a t e s t h e symbolic b o o t s t r a p expansion of t h e

a cce l e r a t i on adjustment a s soc i a t ed wi th BCa C . 1 . ' S.

T h i s is done f o r t h e c o r r e l a t i o n c o e f f i c i e n t w , r . t

t h e law school data.

RVQ [XI =l'rue

VQ [ X I = True

RVQ [Y] =True

VQ [Y] =True

LX/ : VQ [LX [w] ] =True

LY/ : VQ [LY [w] ] =True ;

CumfLx [w]]=û

Cum [LY [wl 1 =O

LX/ : C m [LX [w] , LX Lw] 1 =i

LY/:Cum[LY[w] ,LY[w]]=i

LX/ : Cum [LX [w] , LY [w] ] =rho

cov/ :NT [cov , i-1 : = NT [cov, i l =

If [i==O , C m [LX Cu] ,LY Lw] f ,

If [i==i, RS [n] [LX Cul LY [w] 3 -

A PPENDIX: Source Code 1.17

Cum [LX [w] 1 Rs [n] [LY [wl 1 - Cum [LY [wl 1 RS [n] [LX [w 1 1 ,

If [i==2, -RS [n] [LX [w] 1 RS [n] [LY [w] ,O] 11

vax/ : NT [ v a , i,] : =

If Ci==O , Cum [LX Lw] ,LX [w] I , If [i==l ,RS [nl CLX[w]LX[wll -

Cum [LX [w] ] RS Ln] [LX [w] 1 - Cum [LX [VI 1 RS [n] [LX [wJ J ,

If Ci==:!, -RS [n] [LX Lw] 1 RS [nl [LX [w] 1 , O] 11

vay/ : NT [vay , i-] : =

If Ci==o , cum CLY Lw] ,LY [wf 1 , If C i = = i ,RS [n] [LY [wl LY Lw3 1 -

cum [LY [w] ] RS [n 1 [LY [w] 1 - Cum CLY [w] 1 RS [n] [LY [w] 1 ,

If [ i==2, -RS [n] [LY [w] 1 RS Ln] [LY Cu] 1 , O] I I

c o r / : NT [cor, i-] : = NT [cor , i l =

C o l l e c t [Expand [NT [ N t imes C N ~ [#la (- 1/2)&, Nt imes [vax , vay] , cov] , il ] ,RS]

co r2 / :NT [cor2, i-] : = NT [cor2 , i l = NT [cor , i+1]

(* Bootvar g i v e s t h e i t h term, as opposed t o t h e sum t h a t v2 g i v e s .

We need t h i s t h e ith term as i n p u t t o get t h e i t h te rm o f t h e t s t a t i s t i c *)

b o o t v a d :NT [bootvar , i-] : = NT [boo tva r , il =

NT [Nboot [Npower [cor , 21 1 , i l - NT [Npower C N ~ O O ~ [cor] ,2] , il

bootvar2/ :NT CbootvarS, i-] : = NT Cbootvar2, il =

NT [Nboot [Npower [cor:!, 21 1 , i l - NT [Npower [ N ~ O O ~ (cor21 ,2] , i]

(* The next s e c t i o n c a l c u l a t e s t h a a c c e l e r a t i o n c o n s t a n t 'a ' which

a l s o r e p r e s e n t s a s l o p e ! *)


Expand [NT [ ~ t imes [ co r , bootvar] , i] 1

bprod/ :NT [bprod , i-] : = NT [bprod , il =

Expand [ NT [ Nboot [prod], i 1 1

bprod2/ : NT Lbprod2, i-] : = NT Cbprod2, il = NT [bprod , i+1]

b c o d :NT [bcor , i,] : = NT [bcor , i l =

Expand [ NT C Nboot [cor ] , i 1 1

boo tva r2 / :NT[bootvarL, i-] := NT[bootvar2, il =

NT [Nboot (Npower [cor2,2] ] , il - NT [Npower [Nboot [cor21 ,2] , i]

bbvar2/ : NT Cbbvar2, i-f : = NT cbbvar2, il =

Expand [ NT [ Nboot [bootvar2] , i 1 1

(* Made up of t h r e e p i e c e s : bprod2 , b o o t v a r 2 , b c o r . *)

acc3 / :NT Cacc3, i-1 : = NT Cacc3, i l =

Expand [ NT [ N t i m e s [ bprod2, N f [ # l e ( - 1 / 2 ) & , boo tva r2 1 1 , i 1 1 -

Expand [ NT [ N t i m e s [ Ntimes [ b c o r , boo tva r2 1 , N f [ # l e ( - 1 / 2 ) & ,

boo tva r2 ] 1, i 1 1

Boot [Acc] C i - ] : = S i m p l i f y [Nsu in Cacc3, i] ]

(************* NUMERICAL EVALUATION FOR n = 15 *****************)


t e x r = C RS -> 2 , LXCw] -> X , LY[w] -> Y )

s lope = Simplify[Nsum[acc3,1]~ (* f i r s t order s lope *)

% / . t e x r

TeXForm [%]

slopeSUM = Collect [(slope/ .RS fn] -> (O&) / .Cm->EmpCum) , SUMI ;

PLUS [(a--)] : = PLUS [al

(* The following is for the law school data, n = 15 *)

slopenum = slopeSUM / . {SUM->PLUS, L X C W J -> {-0.600997, 0.860219, -1.04679,

-0.551465, 1.62798,

-0.501932,-1.12109, 1.50414, 1.25648, 0.117227

1.30601,-0.625'764, -1.36875,-0.700063, -0.155203 )

LY[w] -> c1.25538, 0.872812, -1.21003, -0.274879, 1.4679

-0.104851, -0.4024,1.4254, 1.12785, 0.150192,

O. 107685, -1.50758,

-1.42257, -0.912485, -0.572429),

n->15 ,

rho -> 0,776374 . PLUS -> P l u s

APPENDIX: Source Code 150

<<BSTAT.indep:!.LXonly.m (* Indep. version of BSTAT.1.O.new.m *)

(* Some Examples *)

RVQ [x] =Tm@

VQ [XI = True

RVQ [Y] = True

VQ [Y] = True

LX/ : VQ [LX Lw] = R u e

LY/ : VQ [LY [w] =True ;

BEV[ SUM[X] -2

BEV[ SUM[Yn2] 1

BEV C SUM [XI -2 SUM [YI - 2 1

BEV[ SUM[X] SUM [Y] 1

BEV[ SUM[X] -2 SUM[Y] ]

BEV[ SUM[X] ^2 SUMCY-21 SUMCYI 1

BEV[ SUMCX] 1

BEV[ SUMCYI 1

BEV[ 1 ]

(* end of examples *)

(* BERHENS-FISHER problem : corrected

Assume Ex = O, Ey = O with unequal variances

which now really are j u s t second moments !


Equal zero means as opposed to equal non-zero means

will speed up the 4th order symbolic calcs !

*)

cov/ :NT [cov, i,] : = NT [cov, il =

If [i==0, Cum [LX Lw] , LY [w] 1 ,

If Ci== 1 , RS Ln] [LX Lw] LY [w] 1 - Cum [LX [w] 1 RS Ln] [LY [wl 1 - Cum ELY [wl 1 RS [n] [LX [w] 1 ,

If ti==2, -RS Cnl [LX Cwl IRS [n] [LY [w] 1 ,O] 1) v u / : NT [vax, i-] : =

If b = O , Cum [LX [w] , LX Cul 1 - C u [LX [w] 1 Cum [LX [w] ] ,

If Ci== 1, RS Cn] [LX CwJ LX Cw] 1 -

Cum [LX Cwl 1 RS Cnl [LX [w] 1 - Cum [LX [w] 1 RS [n 1 [LX [w] 1 ,

If [i==2, -RS [n] [LX [w] 1 RS Ln] [LX [w] 1 ,O] 11

vay/ : NT [vay , i-] : =

If fi==0, Cum ELY [VI , LY Lw] 1 - Cum [LY Lw] 1 Cum [LY [w] ] ,

If [i==l,RS[n] [LY[w]LY[w]] - Cum CLY Cul 1 RS Cnl CLY Cul 1 - Cum [LY Lw1 1 RS Cnl ELY Lw] 1 ,

If [i==2, -RS [nl [LY Cwl 3 RS Cn] [LY Lw] 1 , O] 11

cor/ : NT [cor, i-] : = NT [cor, i] =

Col lec t [Expand [NT [ N t i m e s [Nf [#le (-1/2)& ,Nt imes Cvax, vay] 1 , covl , il 1 ,RS]

* BERHENS-FISHER CODE : *)


diff/:NT[diff,i-7 := NT[diff,i] =

If [i==O, C m [LX Lw] 1 -Cum [LY [wu ,

If Ci==l, RS h l [LX Cwl 1 - RS Cnl CLY [VI 1 , O 11

bf/:NT[bf ,i-] := NTCbf ,il =

Collect [Expand [NT C N t i m e s C N ~ C#l-(-1/2)&, Apply [Plus, Cvax, vay) 13 , diff 1 ,i]],RS]

BF Ci-] : = Nsum [bf , il

NT [bf ,2]

Nsum [bf ,2]

NT [Nboot [bf 1 , O]

NTCNboot [bf] , i l

NT [Nboot [bf 1 ,2]

bootexp = Expand [Nsum [Nboot [bf 1 ,411

bootvar = Expand [Nsum [Nboot CNpower [bf ,2] 1 - Npower [Nboot [bf $21 ,4] ]


survival data. 4th order uses similar code. *)

(* Uses SURVIVAL data ( n = 4 4 ) *)

(********** Assming O , O , unequal variances ! *****ik*t)

(* and assuming INDEPENDENCE code in bf3 .m *)

* i. e. BEV t o o l was MODIFIED *>

texr = { RS -> Z , LXCwl -> X , LYCw] -> Y 1

(* Expansion of Behrens-Fisher statistic : *)


bf 2 = Nsum [bf ,2]

% / . t e x r

(* Expansion o f bootstrap mean of t h e Behrens-Fisher s t a t i s t i c *)

e2 = Expand[Nsum[Nboot b f l , 2 ] 1

% / . t e x r

Simpli f y

TeXForm [%]

(* Expansion o f E - over x - of bootstrap mean of t h e 3-F s t a t . *)

ee2 = Expand[Nsum[Ncum[Nboot[bf], 11,2]]

ee2 /. texr

Simpl i f y 1x1

TeXForm [%]

ee4 = Expand C N S ~ [Ncu [Nboot [bf 3 , I I , 41 1

ee4 / . t e x r

Simplif y [%]

TeXForm [%]

(* Expansion o f V a r - over x - of bootstrap me= of t h e B-F stat . *)

ve2 = Expand [ N S U [ N C U U [ N ~ O O ~ (bf] ,2] ,211

ve2 / . t e x r

S impl if y [XI

TeXForm [%]

A PP ENDIX: Source Code

(* Expansion of b o o t s t r a p va r i ance of t h e B-F stat . *)

v2=Expand [Nsum [Nboot [Npower [bf ,211 - Npover [Nboot [bf] ,2] ,2] 1

v2 / . t e x r

Simpli fy [%]

TeXForm [%]

(* Expansion of E - o v e r x - of b o o t s t r a p variance of t h e B-F s t a t . *)

ev2 = Expand [Ncm [Nboot [Npower bf ,231 - Npower C N ~ O O ~ [bf 1 ,2] ,1] ,211

ev2 / . t e x r

S impl i f y [XI

TeXForm [XI

(* Expansion of Var - o v e r x - of b o o t s t r a p variance of t he B-F stat. *)

vv2 = Expand [Nsum [NcumCNboot [Npower bf ,211 - Npower [ ~ b o o t [bf 1 ,2] ,2] ,2] 1 vv2 / . t e x r

S impl i f y [XI

TeXForm 1x1

vv2one = ~xpand [~sum [Ncum [Nboot [Npower [bf 1,2] 1 - ~ p o w e r C N ~ O O ~ [bf 11 ,2] ,21,2] ]

vv2one / . t e x r

Simpl i fy C%1

TeXForm C%1

exp2 = Expand C ~ s d ~ c u m [bf ,11 ,211

exp2 /. t e x r

APPENDIX Source Code

S impl if y [ X I

TeXForm [%]

var2 = Expand [Nsum [Ncum [bf ,2] ,2] 1

var2 / - t e x r

Simplif y [XI

TeXForm [%]

(* exp4 = Erpand[Nsum[Ncum[bf,l] ,411 . texr *)

(* var4 = Expand [Nsum [Ncum [bf ,2] ,4] ] / . t exr *)

PLUS [(a,,)] : = PLUS [al

(* The following uses the SURVIVAL data *)

(**************** assuming O means and unequal variances *t*****t******)


LY[wl ->

.c -644.9318182 , -582.9318182 , -540.9318182 , -520.4318182 , -463.9318182 ,

-429.9318182 , -395.9318182 , -383,9318182 , -344.9318182 , -344.9318182 ,

-303.9318182 , -291.9318182 , -289.9318182 , -287.9318182 ,

-265.9318182 , -262.9318182 , -262.9318182 , -257.9318182 , -251.9318182 ,

-237.9318182 , -185.9318182 , -156.9318182 , -122.9318182 , -121.9318182 ,

-110.9318182 , -83.93181818 , -76.93181818 , 29.06818182 , 30.06818182 ,

102.0681818 , 132.0681818 , 140.0681818 , 151.0681818 , 309.0681818 ,

322.0681818 , 331.0681818 , 599.0681818 , 625.0681818 , 774.0681818 ,

814.0681818 , 870.0681818 , 905.0681818 , 1044.068182 ,

1048.068182

> 9

n -> 44 / PLUS -> Plus

A PPENDIX: Source Code 158

I ,

n -> 44 / PLUS -> Plus

. . . ETC. . . .

A PPENDIX: So,urce Code

ee2numl / / N

'/. SqrtC441 //N

ee4numl //N

% Sqrt C441 //N

ve2numi //N

% 44 //N

SqrtC%] //N

v2numl //N

var = '/. 44 //N

se = Sqr t [% 1 //N

% ~ q r t [44/43] //N (* t o compare t o S p l u s *)

ev2numl / / N

% 44 //N

vv2numl / / N

vv2num2 //N

% 44-2 //N

exp2num //N

% Sqrt C441 //N

var2num //N

% 44 //N

(* Same as BSTAT.1.O.nev.m except for extra code below included

t o c r ea t e t h e INDEPENDENCE c r i t e r i o n necessary for

t h e Behrens-Fisher s t a t i s t i c *)

BEV: :usage =

APPENDiX: Source Code

"BEVCa] r e t u r n s t h e b o o t s t r a p e x p e c t a t i o n value of a"

(*

: [ fon t = ou tpu t ; output ; i n a c t i v e ]

"BEV [a] r e t u r n s the b o o t s t r a p e x p e c t a t i o n v a l u e of ai '

; Col

BEVCal r e t u r n s t h e b o o t s t r a p e x p e c t a t i o n v a l u e of a

: [ fon t = input ; i n i t i a l i z a t i o n ]

*)

ClearAll [BEV]

BEV[a- + b-1 := BEV[a] + SEVCb]

BEV [(a- + b-) Y-. d- .] : = BEV[Expand[(a+b) ^c dl] / ;

((c > O ) && ( In tege rQEc] ) )

BEV [a- b-1 : = a BEV [b] / ; ! RVQ [a]

BEV [a-] : = a / ; ! RVQ [a]

BEV [SUM [a-] ] : = SUM [a]

BEV [SUM [a-] c-] : = BEV [c , a]

BEVCSUMCa-1 c,. ,b,,] := BEV[c,a,b]

BEV [SUM Ca-] -b- c- . 1 : = BEV [SUM [a] ̂ (b-1) c, a]

B E V [ S U M [ ~ - J ^ ~ - c-. ,d--1 := BEV[SUM[a]^(b-l)c,a,d]

(* S e t t i n g VARIABLES ! +>

RVQ [XI =Truc

VQ f X l = True

RVQ [Y] =Truc

VQ [Y] =True

LX/ : VQ [LX Cu] 1 =True

LY/ : V q [LY Lw] ] =Truc

LX/ : RVQ [LX [w] ] =True


LY/ : RVQ [LY Cu] 1 =Truc

VQ [ZM [LX Cu] ] ] = T r u e

RVQ CZM [LX Lw] 11 = T r u e

VQ [ N L Y CU] 1 1 = T r u e

RVQ [ZM [LY Cu] ] 1 = T r u e

BEV [ SUM [LX Lw] j 1

BEV [ SUM [LX [w] 1 -2 1

BEV [ SUM [LX Lw] 1 * 2 SUM [LY [w] 1 -2 ]

BEV [ SUM [LX [w] 1 SUM [LY [w] 3 1


(* THIS PART CREATES INDEPENDENCE *>

h [a-] : = P r e p e n d [a, 11

g[a-1 := { h Cases [ a , ZMILX[w]lnc-. 11 , h [ C a s e s C a, ZMCLY[WII-~-

m [a-] : = Apply [Times, Apply [BEVl , a , { 1)] ]

BEVl [a,] : = a / ; ! RVQ [a]

BEV [l ,a--] : = C o l l e c t [Expand [ m [ g [ {a) 1 1 1 , SUM]

BEVl Cl, a--] : = C o l l e c t [Expand [Map [BEVPE, FPI [{a) + PE [O] 1 ] , SUM]

(* Examples *)

BEV[ SUM [LX Cu] 1

BEV[ SUM[LXCW]I-~ 1

BEV [ SUM [LX Lw] 1 -2 SUM [LY [w] ] -2 ]


BEV [ SUM [LX Lw] 1 SUM [LY Lw] ] 1


# Behrens-Fisher stat 1st ic : Monte Car lo calcs .

I f INDEP. SAMPLES APPROACH ( u n l i k e the c o r r . c o e f f . wi th pa i r ed samples!)

# Survival t imes (days) of gastric cance r p a t i e n t s (n = 44)

# Data have been s tandard ized t o have ZERO MEANS only

# i. e . assume Ho: equal rneans = O

opt ions (object . size=10e6)

A PPENDIX: Source Code 163

n-44

x-1:n

nboot,l2800 # Repeat f o r B = : 50,100,200,400,800,1600,3200,6400,

# 12,800 , 25600

Bmeansi,apply(datai,l,mean)

Bvars 1-apply (dat al, 1, var) /n

Bmeans2-apply(data2,1,mean)

Bvars2_apply(data2,1 ,var) /n

bf-( Bmeansl - Bmeans2 ) / sqrt ( Bvars l + Bvars2)

# B b o o t s t r a p reps of BF s t a t i s t i c

# as on p . 224 EfronâTibs., step (16.6)

bf . mean-mean (bf) # b o o t s t r a p mean (exp. ) of bf

se .mean-sqrt ( v a r ( b f ) /nboot)


bf . s e - sqr t (var (bf ) ) # boots trap SE o f bf

varvar, (nboot/ ((nboot-11-2) ) *var((bf - mean(bf ) ) -2)

var .bsse,(1/(4*var(bf)))*varvar

se. se-var. bssenO. 5

# 1. See if matches with symbolic r e s u l t s , using modif i e d

# BSTAT code assuming INDEPJCE and ZERO means

# 2 . Calc. SE'S in brackets l a te r using code below

# 3 . Run wi th B as now preferred ( even spacings and up t o 25,600 )

bf . m e a n

se. m e a n

bf . se

s e . se

Bibliography

[ l ] Andrews. D. F. and Feuerverger. .A. ( 1993). General saddlepoint approximation

methods for bootstrap configurations. Uniu. o j Toronto Tech. Rep .

[2] Andrews, D.F. and Stafford. J.E. (1993). Tools for the Symbolic Computation of

Asymptotic Expansions. J. Roy. Statist. Soc. B 55. 613-627'.

[3] 'Andrews. D.F. and Stafford, J.E. (199.5). Iterated Full Partitions. Uniz*. of Toronto

Tech. Rep.

[L I ] Andrews, D. F.. Stafford, J. E. and Wang Y. (1992). On reading Lawley (1956):

an application of symbolic computation. Liniv. of Toronto Tech. R e p .

[.j] Andrews. D. F.. Stafford. J. E. and Wang Y. (1993). ASTAT: a systern for calcu-

lating asyniptot ic expansions. CTniv. of Toronto Tech. R e p .

[6] Andrews. D. F.. Stafford, J. E. and Wang Y. ( 1994). The calculation of asymptotic

expansions based on estimat ing equations. /hi. of Toronto Tech. Rep .

[7'] Barndorff-Nielsen. O.E. and Blaesild: P. (1986). A note on the calculation of

Bartlett adjustments. J. Roy. Statist. Soc. B 48. 351-358.

[8] Bellhouse, D.R., Philips, R. and Stafford, J. E. (1995). Symbolic operators for

multiple sums. Univ. of Western Ontario Tech. Rep.

[9] Beran, R. and Ducharme, G. (1991). Asymptotic theory for bootstrap methods in

statistics. Centre de Reserches Mat hematiques, Univ. of Montreal.

BIBLIOGRAPHY 166

[IO] Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall.

London

[ I l ] Davenport, J.H., Siret, Y. and Tournier, G. (1988). Cornputer Algebra: Systerns

and Afgorithms for Doing Algebraic Computation. Academic Press. New York.

[12] DiCiccio. T. and Ti bshirani, R. ( 1987). Bootst rap confidence iiitervals and boot-

s t rap approximations. J. Amer. Statist. Assoc. 82. 163-170.

[13] Efron. B. (1979). Bootstrap methods: another look a t the jackknife. Ann. Statist.

7, 1-26.

[11] Efron, B. (1951). Nonparametric standard errors and confidence intervals. (witli

discussion.) Canad. J. Statist. 9 , t 39-172.

[l 51 Efron. B. ( 1987). Better bootstrap confidence intervals. (wit h discussion.) J .

Amer. Statist. Assoc. 82, l'il-'200.

[16] Efron, B. and Tibshirani. R. ( 1993). ..ln Introduction to the Booislrap. Chapman

and Hall, Xew York.

[17] Ferguson. H. ( 1989). .Asyrnptot.ic Properties of a Condit ional Maximum Likeli-

hood Estimator. Ph.D Thesis. La7niu. of Toronto.

[18] Fisher. X.I. and Hali. P. (1991). Bootstrap algorithms for srnall saniples. J .

Statist. Pfann. Inf. 27 , 157-169.

[lg] Fraser. D.A.S. (1976). Probability and Statistics: Theory and Applications. DAI

Press. Toronto

[%O] Gamerman, D. ( 199 1). Dynamic Bayesian Models for Survival Data. ,4ppf.

Statist., 40, 63-79.

[21] Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer. Kew York.

[22] Hastings, C.., Mosteller, F., Tukey. J.W. and Winsor, C.P. (1947). Low moments

for small samples: a comparative study of order statistics. Ann. Math. Statist.. 18.

4 1 :3-426.

[%3] Heller. B. (1991). Macsyma For Statisticians. Wiley, New York.

[24] Huber. P. J. (1964). Robust estimation of a location parameter. -4nn. Math.

Statist., 35, 43-101.

[%] Huber. P. J. (1967). The Behaviour of Maximum Likelihood Estimates Cnder

Xonstandard Conditions. Proc. O/ the 5th Berkeley Symposium in .\lathemat ical

Sfatistics and Probability, 1. 22 1-233.

[26] Huber. P. J. (1981). Robust Statistics. Wiley. New York.

[27] Johnson, X.L. and Kotz. S. (1969). Distributions in Statistics - Discrrte Distri-

butions. Houghton Mifflin. Boston, 41.4.

[?SI Kendall. W. S. (1988). Symbolic Computation and the diffusion of shapes of

triads. Adaances in Applied Probability. 20. 775-797.

[29] Kendall. W. S. ( 1990). The diffusion of Euclidean Shape. In Disorder in Physical

Systems. Ed. G. Grimmett and D. Welsh. Oxford University Press. pp. 203-217.

[30] Kendall. W. S. (1993). Computer Algebra in Probability and Statistics. Stntistica

Neerlandica. 47, 9-25.

[31] Kendall, W. S. (1994). Computer Algebra and Yoke Geometry 1: when is an

expression a tensor? Statistics and Computing, 4.

[3%] Lawley, D.N. (1956). A general method for approximating the distribution of

likeli hood ratio criteria. Biometrika, 43, 195-303.

[33] Maeder, R. (1991). Programming in iVlathematica. Addison-Weslq. Reading.

M A .

BIBLLOGRAPHY 168

[34] McCullagh, P. (1987). Tensor Methods in Statkiics. Chapman and Hall, 'leiv

York.

[33] Oldford. R. W. (1985). Bootstrapping by Monte Carlo versus approximating the

estirnator and bootstrapping exactly : Cost and performance. Commun. Statist.

Ser. B. 14. :395-424.

[36] Robinson. G. K. ( 1982). Behrens- Fisher Problem. In Encyclopedia of Statistical

Science. Vol. 1. Ed. S. Kotz and N.L. Johnson. Wiley, Xew York. pp. 205-209.

[3ï] Scheffe. H. and Tukey. J. W. (1945). Non-parametric estimation - 1. Lalidation of

order statistics. rtnn. Math. Statist.. 16. 187-192.

[3S] Silverman. B. W. and Young, A. E. (1987). The bootstrap: to smooth or not to

smoot h? Biometrika. 74, 469479.

[39] Stafford. J. E. (1992). Symbolic Computation and the Comparison of Tradit ional

and Robust Test Statistics. Ph.D. Thesis. L'nic. of Toronto.

[NI] Stafford. J . E. (1994). Automating t h e partition of indexes. J. Comp. u n d Gruph.

Statist.. 3. 249-260.

[-LI] Stafford. J. E. (1995). Partitions and the symbolic rnethod. L-nic. of CkEsteric

Ontario Tech. Rep.

[42] Stafford. J. E. (1996). A note on symbolic Newton-Raphson. (submitted for

publication)

[G] Stafford. .J. E. and Andrews. D.F. (1993). A syrnbolic algorithm for studying

adjustments to the profile likelihood. Bio-met d a . 80, 71.5-T30.

[14] Stafford, J . E. and Bellhouse. D.R. (1995). A computer algebra for sample survey

theory. Cniv. of Iiéstern Ontario Tech. Rep.

[45] Stafford. J. E., Andrewso D.F. and Wang, Y. (1994). Symbolic computation: .A

unified approach to studying likelihood. Statistics and Computing? 4. 235-245.

BIBLIOGRAPHY 169

[46] Thisted, R. A. (1988). Elements of Statistical Computing. Chapman and Hall,

New York.

[47] Tibshirani. R. (1988). Variance stabilization and t h e bootstrap. Biometrika. 75.

433-444.

[48] Venables, W. Y. (1985). The multiple correlation coefficient and Fisher's -4 statis-

tic. Aust. J. Stntist.. 27. 17'3-182.

[49] Wagan, S. (1991). Mathematica in Action. W . H. Freeman. San Francisco.

[50] Wang, Y. ( 1992). Symbolic Computation for S tatistics Using Tensors. Ph.D.

Thesis, Univ. of Toronto.

[Sl] Wolfram. S. (1988). Mathematica: il System /or Doing :Vathematics by Corn-

pclter, 2nd Ed. Addison- Wesley. New York.

[52] Young, G.X. (1988). A note on bootstrapping the correlation coefficient.

Biometrika, 7 5 . 370-37.3.

[53] Young, A.E. and Daniels. H.E. ( 1990). Bootstrap Bias. Biometrika. 77. 179- 185.

TEST TARGET (QA-3)

APPLIED I W G E . lnc - = 1653 East Main Street - -. - - Rochester. NY 14609 USA -- -- - - Phone: 716/482-0300 -- -- - - Fm: 7161288-5989

symbolic bootstrap estimators and their properties · ideal bootstrap mean (1.6), for exampie. may...

Documents