symbolic bootstrap estimators and their properties · ideal bootstrap mean (1.6), for exampie. may...
TRANSCRIPT
SYMBOLIC COMPUTATION OF NONPARAMETRIC
BOOTSTRAP ESTIMATORS AND THEIR PROPERTIES
George Theodore Sampson
A t hesis submitted in conformity wit h the requirements for t h e
Degree of Doctor of Philosophy
Graduate Department of Statistics
University of Toronto
@ Copyright by George Theodore Sampson 1997
National tibrary 1*m of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographic Senrices services bibliographiques
395 Wellington Street 395. nie Wellington OttawaON KlAON4 OtiawaON K1A ON4 Canada Canada
Your ink votre fehïRms
Our nle Noire fefdrence
The author has granted a non- L'auteur a accordé une licence non exclusive licence dowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othenvise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
Symbolic Computation of Nonparametric Bootstrap Estimators and their
Properties
George Theodore Sampson
Ph.D. 1997
Department of Statistics
University of Toronto
Abstract
The bootstrap was introduced in 1979 as a cornputer-intensive method for esti-
mating the standard error and other properties of an estimator. This thesis focuses on
how the same objective can be achieved. in many cases. by the development and im-
plementation of computer algorit hms for the symbolic cornputation and evaluat ion of
series expansions. The analytic bootstrap permits the calculation of standard errors
for a wide variety of estimators in a fraction of the time and wit h comparable or bet ter
accuracy when compared with conventional bootstrap Monte Car10 resampling. In ad-
dition. properties such as bias and variance of bootstrap estimators may be assessed.
The methodology is applied to examples from Efron and Tibshirani (1993) involv-
ing the correlat ion coefficient. the Behrens-Fisher statistic. double-bootst rapping and
BC, bootst rap confidence intervals. An alternative met hodology based on the ex-
haustive enurneration of bootstrap resamples is also presented and illustrated for the
case of the median. Implications for future work are discussed.
Acknowledgement s
1 would like to express my gratitude to my supervisor David Andrews for his
invaluable guidance, patience, encouragement, enthusiasrn and humour during the
undertaking of this thesis. It has been a privilege and a pleasure working with him.
My thanks also go to Andrey Feuerverger and Rob Tibshirani for making very
useful comments and suggestions on the manuscript. I wish to thank Yodit Seifu for
her encouragement and assistance. Ruth Croxford, Jenny Tarn. Laura Kerr and Tom
Glinos have been helpful with computational and administrative details. Gus Vukov
has provided me with many valuable teaching experiences.
I would like to thank al1 the graduate students. staff and faculty in the Department
of Statistics for a rneaningful and memorable acadeniic experience.
I am indebted to the Department of Statistics and the University of Toronto for
providing me with financial support during the first four years of my graduate studies.
This thesis is dedicated to my parents rvhose initial encouragement instilled in me
a resolve to undertake the present studies.
Contents
1 Introduction 1
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Literature review I
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis outline 9
2 Symbolic Computation of Bootstrapped Order Statistics
and their Moments : The Case of
the Median 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 11
. . . . . . . . . . 2.2 Review of Bootstrap Estimation of Standard Error 12
. . . . . . . . . . . . . . . . . . . 2.3 Properties of Bootstrap Estimators 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 ..3 .1 Introduction 14
. . . . . . . . . . 2.3.2 Exact Cornputation of Bootstrap Estimators 16
2.3.3 MomentsofBootstrapEstimators . . . . . . . . . . . . . . . . 10
.? 3 2.4 Application to Order Statistics : The Case of t h e Median . . . . . . . -- 2.4. L Some Properties of Order Statistics . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . 2.4.2 The Case of the Median 2s
. . . . . . . . . . . . . . . . . . . . . 2 - 4 3 Some Numerical Results 38
2.5 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . 4'3
3 Bootstrapped M-estimators and their Properties 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Bootstrapping M-estimators : An -4nalytical Approach . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Introduction 46
3.2.2 An Approximation for Functions of M-estimators . . . . . . . 46
3.2.3 Weighting and Decomposing Residual Sums . . . . . . . . . . 51
3.2.4 -4nalytical Bootstrap Moments of M- Estimators and - -
t heir Properties . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Computational Tools 57
. . . . . . . . . . . . . . . . . 3.4 The Case of the Correlation Coefficient 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . -4.1 introduction 60
3.4.2 Computational Tools . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.3 NumericalResults . . . . . . . . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . 3 . 5 Discussion and Conclusions i l
4 Double-Bootstrapping and the Behrens-Fisher statistic 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction 79
4.2 Symbolic Computation of the Bootstrap-t
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence Interval 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 Background 79
4.2.2 S-mbolic Computation of the Bootstrap-t Interval for the Cor-
relation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . S?
4.2..3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S Ï
4.3 Symbolic Computation of BC . Confidence
Intervais . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S i
-1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . SG
4.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4. 3 . 3 Symbolic Computation of the Xcceleration Constant 8: . . . . 91
4.34 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Analytic Bootstrap Computation for the
. . . . . . . . . . . . . . . . . . . . . . . . . . Behrens-Fisher statistic 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Introduction 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Background 97
. . . . . . . . . . . . . . . . 4.4.3 Analytic Bootstrap Compiitation 98
. . . . . . . . . . . . . . . . . . . . . . 4.4.4 Numerical Application 103
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Discussion 104
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions 106
5 Concluding remarks 108
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Remarks 108
A Source Code used for this Thesis 111
. . . . . . . . . . . . . . . . . . . . . A.1 Source Code used for Chapter 2 111
. . . . . . . . . . . . . . . . . . . . . A.' Source Code used for Chapter 3 122
. . . . . . . . . . . . . . . . . . . . . A.3 Source Code used for Chapter 4 143
List of Tables
1.1 Bootstrap estimators and associated properties for a statistic 6 for
which analytic expressions are derived. . . . . . . . . . . . . . . . . .
Total number of distinct resamples and number of partitions for various
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sample sizes
The 5 partitions associated with a sample size of 4 with corresponding
. . . . . . . . . . . . . . . . . . . . . . . . . mult inornial probabilities
. . . . . . . . The 1% distinct permutations of the partition (2.1.1.0)
. . . . . . . . . . . . Approximating distributions for various X values
. . . . Partitions and rnultinomial probabilities for a sample size of 3
Analytical and simulated results for various properties and bootstrap
estimates related to the 2nd moment of the median using several sample
sizes for t h e case of the normal distribution . . . . . . . . . . . . . . .
Analytical and simulated results for various properties and boot st rap
estimates related to the 2nd moment of the median using several sample
sizes for the case of the uniform distribution . . . . . . . . . . . . . .
Analytical and simulated results for various properties and bootstrap
estimates related to the 2nd moment of the median using several sample
sizes for t h e case of the logistic distribution . . . . . . . . . . . . . . .
3.1 Bootstrap estimators and related properties for the correlation coeffi-
cient for which analytic expansions are derived . . . . . . . . . . . . .
3.2 n = 1.5 Law School Data : original and standardized dat,a . . . . . . .
3.3 n = 100 simulated bivariate observations from N(0. I ) : variable I . .
3.4 n = 100 simulated bivariate observations from N(0? 1) : variable y . . 71
3.5 Symbolic bootstrap and classical results for the correlation coefficient
using two data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . I I
3.6 Monte Car10 Bootstrap Estimates of t h e Mean of the correlation coef-
ficient for various B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Monte Carlo Bootstrap Estirnates of Standard Error of the correlatiori
coefficient for various B . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 Cornparison of 90% confidence intervals for p using analytic bootstrap-
t. conventional bootstrap-t, and standard methods for the law school
data ( n = 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Bootstrap estimators and reiated properties for the Behrens-Fisher
statistic for which analytic expansions are derived . . . . . . . . . . . 99
. . . . . . 4.3 Survival times (days) of gastric cancer patients (n = 44) 103
4.4 Symbolic bootstrap and classical results for the Behrens-Fisher statistic
using the survival data. n = 44 . . . . . . . . . . . . . . . . . . . . . 104
4 . Monte Carlo Bootstrap Estimates of Mean and Standard Error of the
Behrens-Fisher statistic for various B for t he survival data. n = 44
(s-e. in brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5
Chapter 1
Introduction
1.1 Overview
In 1979 Efron proposed the idea of using cornputer-based simulations instead of math-
ematical calculations to obtain the sampling properties of random variables. This
thesis focuses on how the same objective can be achieved, in many cases. by the de-
velopment and implementat ion of cornputer algori t hms for the sy mbolic compu t at ion
and evaluation of asyrnptotic expansions. Our study is largel:; confined to a collec-
tion of problems that may be treated using classical theory for sums of independent
and identically distributed random variables - the so-called *-smooth function model'.
(Hall 1992). The corresponding class of statistics is wide and includes. for example.
sample means. variances. ratios and differences of means and variances. correlat ions.
maximum likelihood estimators and M-estimators. But it is by no means comprehen-
sive. It does not include. for example. the sample median or other order statistics:
we treat this case in a separate chapter.
The bootstrap was originally introduced as a general method for assessing the
statistical accuracy of an estimator. The procedure applies a given estimator 8 to each
of B bootstrap samples, and then estirnates the standard error of 6 by the empirical
standard deviat ion of the B replicat ions. More formally, suppose t hat x = (xi . . . . . x,)
is an observed random sample of size n from a population with distribution function
F . We estimate a parameter of interest 0 by e to which we wish to assign a standard
CKAPTER 1. INTRODUCTION
error denoted by
Now (1.1) may in turn be estimated by
that is, by replacing F by the empirical distribution function F defitied to be the
discrete distribution that assigns probability l / n to each value xi for i = { 1. . . . . n } .
When F is unknown or 6 is complicated. bootstrap methodology estirnates ( 1.1 ) by
applying the same principle. To this end the bootstrap algorithm for estimating
standard errors selects B independent boot s t rap samples x;, . . . . x& . . . . x i each con-
sisting of n observations drawn with replacement from x. The bootstrap replication
d a ( b ) corresponding to each bootstrap sarnple is evaluated and the standard error
s e F ( 6 ) is estirnated by
w here
PVe note that ( 1.3) is an approximation of the ideal (i.e. "infinite" ) bootstrap estimate
of s e F @ ) given by
.
lim i ' e B ( Ô - ) = J E ~ ( B . ~ ) - [ E ~ ( B * ) ] ~ B-Cu
Similarly. the bootstrap algorithm uses (1.4) to approximate the ideal bootstrap es-
CHAPTER 1. INTRODUCTION
pect at ion
Observe that a bootstrap sample x- = (xi,. . . . xi) consists of members of the
original data set x, some appearing zero times: some appearing once, some appear-
ing twice. etc. . Hence, the bootstrap sample x' generates a corresponding set of
frequencies ( t ~ ~ . . . . , w,) condi t ional on the original sample x' that is.
where w = ( c u l , . . . , w,) is a multinomial vector of integer weights. In this sense. the
ideal bootstrap mean (1.6), for exampie. may be alternatiuelg expressed notationally
as
PVe shall use this notation throughout this thesis in order to convey (1.7). that is.
that bootstrap sampling can be thought of as weighted resampling . Moreover. this
notation shall represent ideal bootstrap estimates in the sense of (1.5).
More generally, we rnay represent a generic bootstrap estimator of ~ ~ [ ~ ( 0 ) ] by
for some function g. Indeed, when
we get (1.8) which rnay be a~proxirnated by (1.4). Further. when
CHAPTER 1. INTRODUCTION
then (1.9) gives the ideal bootstrap variance estimator
which may be approximated by the square of (1.3). In addition. we may evaluate
properties of ( 1.8) and (1.10) such as the mean E x ( . ) and variance Var,(-) . Table 1.1
below summarizes the six bootstrap expressions for which representations in terms of
analyt ical expressions comprise the focus of t his t hesis. These six quanti t ies include
the bootstrap mean and the bootstrap variance of an estimator O as well as two
properties for each of these.
Table 1.1: Bootstrap estimators and associated properties for a. statistic 6 for which analytic expressions are derived.
The intent of conventional bootst rap met hodology is to est imate. for example.
using the ideal bootstrap variance est imator
One motivation for this thesis is to investigate how well the bootstrap procedure
(1.12) estimates (1.1 1). To do this we will evaluate properties 01 (1.12) such as
CHAPTER 1. INTRODUCTIOAi
and
These properties could be used diagnostically to uncover, for example. inherent bias
in the bootstrap procedure (1.12) as represented by
The objective of this thesis is to develop symbolic computational tools t hat can be
used in practice to derive Monte Carlo-free bootstrap estimates and to help study
properties of bootstrap estimators.
We explore Iwo methods for the symbolic evaluation of the quantities in Table 1.1.
The first approach is based on the exact computation of the ideal bootst rap expecta-
tioo (1.8) as studied by Fisher and Hall (1991). Their algorithm requires enurneration
of al1 distinct resamples which may be drawn wi t h replacement from the original sam-
pie. We develop our method with respect to the class of L-estirnators. that is. linear
combinat ions of order stat ist ics. However. the technique may be used for more general
estimators. We will locus attention on the sample median. a statistic whose distri-
bution is known exactly: this will facilitate illustration of the appropriate symbolic
tools. The second approach involves the symbolic derivation of a bootstrap moment
and its properties for statistics belonging to the class of 41-estimators, that is. for any
statistic assumed to be the solution of the estimating equation
where the xi's arise from the density f (x. 8). The method depends primarily on two
key symbolic routines - one for computing asymptotic expansions of roots of estimat-
ing equations, and one for computing bootstrap versions of these expansions. The
algorithms have been implemented using a computer-algebraic manipulation package
called Mathematica (Wolfram. 1988). The methodology is applied to exarnples
CHAPTER 1. INTRODUCTION 6
from Efron and Tibshirani (1993) involving the correlation coefficient, the Behrens-
Fisher s tatistic? dou ble-bootstrapping and BC, bootst rap confidence intervals.
Finally, we surnmarize the featuresladvantages of the anaiytic b o o t s t r a ~ developed
in this thesis as follows:
Analytic derivat ion and numerical evaluation of bootstrap estimates and their
properties entirely avoids Monte Car10 resampling resulting in faster and more
efficient calculat ion.
It can be used diagnostically to assess properties such as bias. variance or other
moments of a bootstrap procedure.
For the class of L-estimators ( linear combinat ions of order stat istics j: the ana-
lytic bootstrap calculates ideal bootstrap estimators exactZy for sample sizes up
to n = 20 and calculates properties of these estimators exactly for much larger
sample sizes.
For the class of il[-estimators (maximum likelihood-type estimators). the an-
alytic bootstrap calculates ideal bootstrap estimators and properties of these
exactly for linear statistics and to within a specified order of accuracy for large
sample sizes in the case of non-iinear stat,ist ics.
The analytic bootstrap may be applied to a broad class of statistics. namely.
to any estimator which may be expressed as a non-linear smooth function of
averages.
The analytical approach For the class of M-estimators is less cornputer intensive
for larger data sets. The reverse is true for conventional bootstrap calculations.
Analytic bootstrap expressions are potentially insightful because component
statistical quantities such as curnulants and sums are identified.
Each analytic bootstrap expression (representing a bootstrap moment or a prop-
erty of this) is a mathematical bootstrap formula which may be further nu-
merically evaluated efficiently for n given data setl or repeatedly evaluated for
diflerent data sets arising within a given application. The computational effort
required to derive each formula need only be undertaken once. It is subsequently
useable in perpetuity by ot her practioners for the purpose of efficient numerical
evaluation.
This thesis does not discuss the analytic bootstrap in the context of paramet-
ric bootstrap estimation, kernel estimation or data which are not independent and
ident ically distri buted. These could be topics for future research.
1.2 Literature review
This thesis represents the marriage between two technologies. symbolic computation
and bootstrap estimation, and a broad class of statistics known as maximum likeli-
hood type estimators, or M-estimators. In the literature these t hree areas have largely
been developed independently of each ot her.
Efron's ( 1979) article has spawned an entire literature on the topic of the boot-
strap. The monograph by Efron and Tibshirani (1993) presents an intuitive account
of the bootstrap a n d its diverse applications in statistical inference. It contains a com-
prehensive list of references on the subject. The lecture notes of Beran and Ducharme
(1991) and Hall's (1992) monograph give a mathematical treatrnent of the bootstrap.
In particular. Fisher and Hall ( 1991 ) discuss the exact computation of nonparanietric
bootstrap estimators using exhaustive enumeration of distinct resamples which niay
be drawn with replacement from the original sample. CVe use a related approach in
Chapter 2 to develop the symbolic computation of bootstrap estimators and corre-
sponding properties connected with the class of L-estimators. Oldlord ( 1985) con-
siders analytic evaluation of bootstrap estimates of bias and variance for linear and
quadratic approximations of an estimator. This forms a basis for Chapter 3 which is
developed using a generalized methodology found in Andrews and Feuerverger (1993).
Huber (1964. 1967) considered the class of maximum likelihood type estimators
ê that solve estirnating equations such as (1.13). A detailed account of the theory of'
M-estimators can be found in Huber (1981). Andrews and Feuerverger (1993) show
CHAPTER 1. INTRODUCTION S
how a single arbitrary M-estimator. or a smooth function of such estimators. may be
expressed as a particular sequence of increasingly accurate Taylor series approxima-
tions. .ilternative presentations rnay aiso be found in Ferguson ( 1989) and WcCullagh
(1987). This approach is used in Chapter 3 as the basis for the development of algo-
rit hms permit t ing the symbolic representation of nonparametric bootst rap moments
(and respective propert ies) associated wit h M-estimators.
In this thesis a set of general procedures for the symbolic derivation of boot-
strap quantities is presented. One of the two contexts considered involves the use of
asymptotic expansions. The derivation of asymptotic expansions by hand is typicaliy
a laborioils task. it can be extremely time consuming and can involve exceedingly
complicated algebra. The potential for clerical error is higb. Symbolic computation
provides a practical alternative to doing these expansions by hand - the development
of appropriate cornputer-algebraic tools permits the statistician to derive the expan-
sions correctly and reliably. in a fraction of the time taken by hand, and to efficiently
evaluate the derived formulae in specific cases. Expansions are obtained quickly with-
out the chance for hurnan clerical error. This approach li berates statisticians from
t,he tedious details of asymptotic calculation and extends their potential capabilities.
Expansions derived by cornputer rnay moreover be saved in a form suitable for pub-
lication and then inserted directly into text. further reducinp the chance for errors in
publications. Programming errors can be reduced by testing software against kiiown
statistical results; the software can then be used to derive new results. The procedures
developed iii t his t hesis have been trsted against simulated results documented in.
for example. Efron and Tibshirani (1993). and against independent results obtained
from the use of code written in Sp lus .
Symbolic computation is a porverful facility available to research statisticians. The
use of this technology in statistics is relatively new, its history extending over little
more than a decade. Packages like Mathematica, Maple. Macsyma and R e d u c e
are being used increasingly in areas of statistical theory where the algebra may be
ot herwise prohibitively cornplex. There are nurnerous books t hat descri be a vari-
ety of useful applications of symbolic computation in statistics - see. for example.
CHAPTER 1. lNTRODCrCTION 9
Heller (1991), Maeder (1991), Wagan (1991), Davenport, e t al. (1988). Silverman
and Young ( 1987) use cornputer algebra to evaluate criteria on which the decision to
smooth a bootstrap distribution is based. Young and Daniels (1990) apply symbolic
programming to study bias in bootstrap estimation of tail probabilities for two simple
situations. Venables ( 1985) utilizes symbolic computation heavily to obtain expan-
sions of maximum marginal likelihood estimates. O ther contributions include Kendall
(1988. 1990), Young (1988) a n d Barndorff-Nielsen and Blaesild (1986). For a complete
review of the use of symbolic computation in probability and statistics prior to 199 1.
see Kendall (1993). More recently? Stafford (1992), Stafford and Andrews ( 1993).
Andrews and Stafford (1993): Andrews e t al. (1992) and Wang (1992) use methods
of symbolic computation to derive new statisticai results as well as to confirm old re-
sults. Kendall (1994) develops routines in Reduce that aid in determining whether
a particular expression is tensorially invariant. Stafford and Bellhouse (1995) develop
procedures that are used to automate complicated algebraic calculations frequently
encountered in sample survey t heory. Addi t ional contributions include Stafford € 1
al. (1994). Stafford (1994). Andrews e t al. (1994). Andrews and Stafford (1995).
Bellhouse e t al. ( 1995), Stafford (1995) and Stafford (1996).
1.3 Thesis outline
The thesis is organized as follows:
Chapter 2 presents algorithms for the symbolic computation of properties of non-
parametric bootstrap estimators involving order statistics. The case of the median is
considered in detail. Comparisons are made wi t h Monte Car10 simulations.
In Chapter 3 symbolic procedures are developed and implemented for the ana-
lytical representation of bootstrap estimators and their properties with respect to
statistics defined by estimating equations and functions of these. T h e technique is
illustrated for the case of the correlation coefficient. Comparisons wit h conventionally
computed bootstrap estimates are made.
The methodology is further illustrated in Chapter 4 with respect to the Behrens-
CHA PTER 1. IiVTRODUCTION 10
Fisher statistic, double-bootstrapping and BC. bootstrap confidence intervais.
Finally. Chap ter 5 provides concluding remarks and suggestions for future re-
search.
Chapter 2
Symbolic Computation of
Bootstrapped Order Statistics
and their Moments : The Case of
the Median
Introduction
In this chapter we seek to compute properties of nonparametric bootstrap estima-
tors involving order statistics exactly using symbolic cornputation. Iri particuiar. t h e
method permits evaluation of the bias and the mean square error of such estima-
tors analytically. The resulting expressions may be efficiently evaluated numericall-.
Such calculations require enumeration of al1 possible resamples which may be drawn
with replacement from the original sample and this will be shown to be feasible for
moderate sample sizes.
Section 2 reviews standard bootstrap estimation. In Section 3 we first review an
algorithm given in Fisher and Hall (1991) which describes the exact computation of
nonparametric bootstrap estimators in small samples. We propose an extension to
t his procedure for the exact computation of moments of nonparametric bootst rap
CHA PTER 2. BOOTSTRAPPED ORDER STATiSTICS 12
estirnators in much larger samples. In Section 4 this methodology is illustrated by
evaluating moments associated with bootstrapping order statistics, in particular. the
median. This is achieved through the use of the A-lamily of distributions t hat permits
expression of moments of order statistics in closed form for a variety of distributional
s hapes. These results are compared wit h Monte Car10 simulations. Various algo-
rithms developed in the chapter are also described. Conclusions are discussed in
Section 5.
2.2 Review of Bootstrap Estimation of Standard
Error
Suppose that x = ( x l , . . . x,) is an observed random sample of size n from a popula-
tion with distribution function F and we wish to estimate the parameter of interest
expressed as some function t of F. on the basis of x. For this purpose we calculate
an estimate of interest
è = s(x).
to which we wish to assign a standard error denoted by
(The hat symbol *A'' always indicates quantities calculated from the observed data.)
Yow ( 2 . 3 ) may in turn be estimated by
that is, by replacing F with the ernpirical distribution function F defined to be the
discrete distribution that assigns probability l / n to each value ri for i = (1.. . . . n } .
CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS 13
When F is unknown. or 8 is complicated, bootstrap rnethodology atternpts to esti-
mate (2.3) by applying this same principle. The bootstrap algorithm for estimating
standard errors consists of several steps which are given as follows. We use notation
similar to that used by Efron and Tibshirani (1993? Chapter 6). First, B independent
bootstrap sarnples x;, . . . 'xh are selected, each consisting of n observations drawn
with replacement from x. Each bootstrap sample, generically represented as
is a randomized, or resampled, version of the original data set x. and hence consists of
rnembers of x. some appearing zero times, some appearing once. some appearing tivice.
etc. . Hence. the bootstrap sample x' generates a corresponding set of frequencies
(wl, . . . , w,) conditional on the original sample x. that is.
where w = ( w 1 7 .... w,) is a multinomial vector of integer weights. The bootstrap
replication corresponding to each bootstrap sample is then ewluated as
Finally. the bootstrap estirnate of the standard error of û is calculated as the sarnple
standard deviation of the B replications
where & ( - ) = xfI1 0 * ( b ) / ~ . The lirnit of ~ é ~ ( & ) = See as B -. CG is called t h e ideal
bootstrap estimate of sec(8) , that is,
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 14
The bootstrap aigorithm described above is a computational met hod for obtaining
an approximation to the numerical value of seF(B*) . This ideal bootstrap estima-
tor and its empirical approximation .+* are sornetimes referred to as nonparametric
bootstrap estirnates because they are based on F o the nonpararnetric estirnate of the
population F. Parametric bootstrap estimation uses a different estimate of F but
this will not be discussed in this thesis.
Expressing (2.3). (2.4). (2.9) and (2.8) in terms of variances we have. respectively.
varc(8*) = [seF(êU)l2 (2.1'2)
V Z ~ ( B ~ ) = [ ~ é ~ ( 9 - ) ] ~ . (2.13)
The quantities V a r F ( & ) and l i n r f ( & ) rnay be made arbitrarily close for large enough
B and hence will be assumed to be effectively equal in this discussion.
In the next section we will develop a diagnostic tool that will enable us to calculate
bias and other properties of the ideal bootstrap variance estirnator b-arp(B') and the
method will be illustrated in the case of order statistics. This methodology is capable
of giving analytical or numerical results and is free of Monte Carlo resampling.
2.3 Properties of Bootstrap Estimators
2.3.1 Introduction
In this section we propose a procedure for the exact computation of moments of
nonparamet ric bootstrap estimators. We first review an aigori t hm given in Fisher
and Hall ( 1991) which describes the exact computation of nonparametric boo ts t ra~
estimators in small samples.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS
A s was discussed in Chapter 1 we can express the bootstrap sample as
and the bootstrap statistic 0' as
where w is a multinomial vector of integer weights. Since bootstrap sampling is
identical with weighted resampling we may represent a generic bootstrap estimator
as
T = ~ w , x [ g ( ~ ' ) l
for some function g. Indeed. when
then (2.16) gives the ideal bootstrap variance estimator
The quantity E,~,(B=) in (2.18) may be used to assess the ideal bootstrap estirnate
of bias of Ô by
biasC = ~ ~ ~ ~ ( 9 ' ) - è. (2.19)
For most statistics that arise in practice, (2.19) must be approximated by Monte Carlo
simulation. In particular, the bootstrap estimate of bias based on B replications is
(2.19) with da(.) = CE, ê ' ( b ) / ~ substituted for E , ~ , ( B * ) , so t h a t
Fisher and Hall (1991) show how Monte Carlo resampling rnay be circumvented
by describing an algorit hm for the exact computation of nonparametric bootstrap
CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS 16
estimators in small samples. This technique requires enumeration of al1 distinct re-
samples which may be drawn with replacement from the original sample. and also
calculation of the probability and value of the statistic associated with each resam-
ple. We shall review this procedure for the exact calculation of E,~,(B-) in the next
su b-sec t ion.
2.3.2 Exact Computation of Bootstrap Estimators
Suppose that we wish to calcidate the exact value of the ideal bootstrap estimator of
the mean of a statistic 9 given by
The discussion below follows Fisher and Hall ( 1991).
If the sample x is of size n. and if al1 elements of x are distinct. then the number
of different possible resamples x' of size n equals the number. X ( n ) . of distinct ways
of placing n indistinguishable objects into n numbered boxes' the boxes being allowed
to contain any nurnber of objects. Indeed. letting Li denote the numher of times r , is
repeated in x' then the number of different possible resarnples equals the nuniber of
ways of choosing nonnegative integers k = ( / c l ' . . . . Ln) satisfj-ing kl + . . . + kn = n.
This is given by
See Hall (1992- Appendix 1) for a proof of (2.21) .
Let W ( n ) denote the set of al1 distinct integer n-tuples w = ( w l . . . . . un). hence-
forth called partitions. such that wl 2 wz 1 . . . 2 w, 2 O and wi = n. Given
( w l . .. ..ut,) E Cbr(n), let M ( w l .... , w,) be the set of al1 ordered n-tuples k =
( / c l o . . . . kn) which are simply permutations of ( w 1 7 . . . , w,). Defining
CHAPTER 2, BOOTSTRAPPED ORDER SX4TISTICS 17
then I<(n) contains precisely iV(n) elements given by (3.21) . The probability of ob-
taining a specific resampie x* based on the n-tuple k = ( k l , . . . , k,) is the multinornial
probability p(k) given by
with the /, 2 O as above. Furthermore. let &(k) denote the calculated bootstrap A
replication of B corresponding to the multinomia! n-tuple k. For example. if k =
(1.2, O. 1) in the case n = 4 then t he bootstrap replication 8% is correspondingly
calculated from the resarnple x- = (xi 'sz , xz, z4). Then the ideal bootstrap estimator
of the rnean of ê rnay be cornptited exactly and directly as the weighted average of
the N ( n ) bootstrap values B*(k):
or alternat ively.
( n - L)! I &vIx(èœ) = nn- l 6=(k).
k f M ( w )
Equation (2.25) is computationally more efficient than (2.24). For example. for n = 4
(2.25) is T times more computationally efficient than (2.24).
If the sequence w = (wi . . . . . w,) consists of precisely ri ji 's for ( 1 5 i 5 q) . where
the ji's are distinct and the ri's sum to n. then :\.l(w), that is. t he second summation
in (2.25). contains exactly
elements. Table 2.1 Iists respective values of N ( n ) and the number of partitions w
contained in W ( n ) for various sample sizes. We consider the example below.
CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS
Table 2.1: Total number of distinct resamples and number of partitions for various sample sizes
Example 1
N(n) Nurnber of Partitions 9 3
W e consider in detail t he case for n = 4. Table 2.2 Lists the 5 partitions comprising
Ct'(4) with the respective number of distinct permutations L(w) . the simple multino-
mial probabilities p(k). and the probabilities p(w) of the partitions. The probabilities
p(w) will be discussed more fully in the next section.
The iV(2 , l . 1.0) = 12 distinct permutations k contained in M(2.1.1.0). for es-
ample, are listed in Table 2.3 below:
The ideal bootstrap estimator (2.25) may now be expressed as
CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS 19
Table 3.2: T h e 5 partitions associated with a sample size of 4 with corresponding multinomial probabilities
Table 2.3: The 12 distinct permutations of the partition (2.1. 1.0)
(2?OJ,L) (2.1:0,1) ( 1 ?O7 1 ?2) ( 1.0.2,l) ( 1.1 .O??) (1,1,2?0) ( 1.2,o. 1 ) (1.2.1 .O) (0.1.12) (O. 1.2.1 )
For example. t he 4th summation in (2.27) represents the sum of 12 bootstrap repli-
cates calculated from distinct resamples associated with the permutations listed in
Table 2.3. X sirnilar expression for exactly evaluating the ideal bootstrap estimator
of variance of 4 could also be obtained.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS
2.3.3 Moments of Bootstrap Estimators
We seek to calculate the mean and variance of the ideal bootstrap quantity
for some function g. with respect to the original distribution F . Indeed. t h e first
moment of (3.28) rnay be evaluated exactly, using (2.25). as
(n - l)! 1 n ~ - ' C w * ! ... lun!
w € W ( n ) k€M(w)
where
for an- k E M(w) and t h e quantities p(k) and L(w) are given respectively by (2.23)
and (2.26) . Clearl- xwcw(n) L(w) = Y ( n ) as given by (2.21).
Note that the final expression in (2.29) may be alternatively obtained b - consid-
ering that
Applying this in the context of Example 2 where
CNiI PTER 2. 1300TSTRA PPED ORDER STATISTICS'
permits the first moment of (2.27) to be ex~ressed exactly. t hat is.
where ,,,,,,,, (8.) = Ex [ P ( w l , lu,; w3, w,)] represents the first moment of the
bootstrap replication 9' with respect to the specific partition w = ( w l : w2. m ~ . ~ 4 ) .
The algorithm associated with (2.29) could be espressed as:
Algorithm 2.1 (Property of a bootstrap moment):
Construct the list 1 of al1 partitions w E W(n) for a given positive integer n
0 Over al1 the partitions w in f sum
CHII PTER 2. BOQTSTRAPPED ORDER STATISTICS
where p(w) = L(w) p(k).
The list 1 of al1 partitions w E W(n) for a given positive integer n can be created
from the algorithm given in Wang ( 1992. Chapter 2). The tool for calculating such
partitions is called FPL[n] and the code is available in Appendix A . l in a file called
partiti0n.m.
We shall now apply this rnethodology to the case of assessing the mean and vari-
ance of bootstrap quantities associated with order statistics, in particular. the sample
median.
2.4 Application to Order Statistics : The Case of
the Median
In this section we apply the methodology developed in the preceding discussion to
the case where
8 = sarnple median
denotes a replication of the bootstrapped sample median.
The median is a statistic whose distribution is known exactly - this will facilitate
illustration of the met hodology descri bed above.
Before proceeding we review some general properties of order statistics.
CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS
2.4.1 Some Properties of Order Statistics
Let XI,. . . . x, be a random sample of n distinct observations from an absolutely
continuous population with pdf f ( x ) and cdf F(x). Let
be the order statistics obtained by arranging the above sample in increasing order of
magnitude. The density functioo of x( i ) (1 5 i 5 n ) is well-known to be
n! f(i)(4 = [~( t ) ] ' - ' [ l - F ( x ) ] * - ; f (x). -ca < x < x; (2.34)
( i - l ) ! ( n - i)!
from which it follows that the k - th moment of x ( i ) for k = 1,2, . . . may be presented
as
The variance of r(i) may then be expressed in terms of (2.135) as
In many situations (2.35) does not exist explicitly in closed form. An example of t his
is the case where F ( x ) represents the cumulative normal distribution. In order to
circurnvent this difficulty we shall express (2.35) aIternatively in ternis of the inverse
probability integral transformation. .As will be seen shortly. this provides a mearis
of carrying an initial variable with a uniform distribution into a new variable with a
special distribution of interest. For a description of this transformation see. for es-
ample: Fraser (1976, page 86-87). This representation of (2.35) will then be amenable
to closed form expression t hrough the use of a special family of approximat ing dist ri-
but ions to be described below.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS
Let ui, Q,. . . , un be a random sample from the U[O, 11 distribution and let
U(1) 5 U(2) 5 . - - 5 y * )
be the order statistics obtained from this sample. Define the inverse cumulative
distri but ion funct ion of the population as
Now it is well-knowu that if u is a [{[O, 11 random variable. then F - ' ( u ) has distribu-
tion function F. We have
since F is continuous. Further. it can easily be verified by using (2.37) that
and
d where = is to be read as "has the same distribution as". T h e distributional relations
in (2.38) and (2.39) were observed by Scheffé and Tukey(194.5). Hence. using (2.37)
- (2.39) permits (2.35) to be rvritten alternatively and more compactiy as
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS
In the particular case when
Xi U [ Q , I ] ? i = l , 2 , . . . ? n
then (2.40) can be expressed explicitly in closed form. Indeed, (2.39) is modified to
F - ' ( u [ ; ) ) A u(,l
from whence it follows that
E(x?~ , ) = ~ ( u h ) )
where
is the density function of a two-parameter beta distribution. and B( .. .) is t h e complete
beta function defined by
Upon simplification. (2.4 1) yields
For example, the mean and variance of the sample median P for odd sample size may
CHA4 PTER 2. BOOTSTRAPPED ORDER STATISTICS
Table 2.4: Approximating dist nbutions for various X values
1 X .4pproximating Distribution ] 1 0.135 Normal 1
Logist ic Uniform
be shown to be
1 1 E(i) = - and Var@) =
3 - 4(n + 2) .
The above example illustrates the use of the particular inverse function given by
which permits (2.40) to be expressed in closed form in this particular case. .A special
form of F - [ ( u ) permitting closed form expression of moments of order statistics in a
variety of cases was provided by Hastings e t al (1947). They suggested the class of
approximating distri butions implicit in the transformation
x = F - ' ( u )
- - u - (1 - u ) . for X > O.
where u has a l r [ O , 11 distribution, z is the random variable wliose order statistics
interest us and X is a distributional shape parameter. Al1 the distributions represented
by (2.43) for various values of X are symmetrical. Table 2.4 presents 3 examples of
approximating distributions generated by (2.45) for the indicated values of A.
Simulated data sets generated according to (2.45) for the 3 values of X shown 111
Table 2.4 may be seen to exhibit properties closely mimicking those of the normal.
logist ic and uniform distributions, respectively.
By using the representation (2.45) we are now able to express moments of or-
der statistics in closed form. Indeed. stibstitution of (2.45) into (2.40) yields linear
C'HA PTER 1. BOOTSTRAPPED ORDER STATISTICS 27
combinations of beta functions. For general sample size n and distributional shape
parameter A, the second moment of the i-th order statistic can be expressed explicitly
and exactly in terlm of beta functions as
- T2 ! - [ L I Yi-L(l - .U)2.\tn-idU
( i - l ) ! (n - i ) !
+ B ( 2 X + i . n - i f 1 ) ] .
A simple algorithm was irnplemented in Mathematica that cornputes (2.46). The
funct ion is called SecondMornOrder [i. n. A].
For example. when X = 0.1% the distribution of the random variable x generated
by
approximates a normal distribution. When n = 3 the second moment of t h e median
E ( 1 2 ) is
SecondMomOrder (2,3,0.133] = 0.0174397
CKA PTER 2. BOOTSTRAPPED ORDER STA TISTICS '2s
which is precisely (2.46) for i = 2, n = 3, X = 0.135 . When h = 1 then (2.45)
becomes
which is distributed as CT[- 1, l]. In this case. E ( i 2 ) is
SecondMomOrder [-, 3,1] = 0.3 .
2.4.2 The Case of the Median
We now use the tools discussed in the previous sections to evaluatc properties of
bootstrap estimators associated with the median. We commence by considering in
detail the first moment of the bootstrap estimator of the second moment of the
median in the case when the random sample is of size n = 3. This result will
suggest that properties (such as moments or cumulants) of bootstrapped moments of
order statistics are Iinear combinations of bootstrapped moments of order statistics
of different sample sizes. Later we shall consider computations of such moments
for various sample sizes and distributions. Vie use the following Table 2.5 which is
analagous to Table 2.2 in the present context:
Table 2 5 : Partitions and multinomial probabilities for a sample size of : j
CHA PTER 2. BOOTSTRAPPED ORBER STATISTICS
Noting tha t there are three partitions for n = 3 and substituting
into (2.31) gives
where .ir denotes t h e sample median associated with t h e i-th partition wi.
Consider first 3; 1 (3.010) in (2.50). This denotes tha t the sample median is
associated with t h e parti t ion (3,0,0). The partition (3,O.O) places al1 t he weight on
the first observation: it may be interpreted as representing three identical realizations
of a single random variable x in which case n = 1. Hence.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 3 0
Hence, the median of three identical realizations reduces to a median of a sampie of
size n = 1, i.e. a single random variable. Therefore,
Next consider 5; ( (2.1.0). The partition (2.1,O) may be interpreted as representing
the realizations of two distinct random 'ariables so t hat n = 2- One of the vari-
ables provides 2 identical observations while the ot her has generated a single different
observation. There are thus two order statistics each of which has probability $ of
represent ing the median. Hence.
Applying the expect ation operator gives
by using ( 2 . 3 5 ) .
Finally. consider 2; ( (1 ,1 , l ) . The partition (1.1.1 ) may be interpreted as rep-
resenting respective realizations of 3 distinct random variables. In this case n = 3 .
There are t hus three distinct order statistics and it fol~ows t hat
CHAPTER 2. BOOTSTRAPPED ORDER STATISTlCS
which provides
Equation (2.50) now becomes
where E(i.') = E ( z f 2 ) ) . Equation (2.52) represents a linear combinat ion of a second
moment of sample size 1 and a second moment of the median for a sarnple size 3.
It is interesting to note that this expression is weighted in favour of the sample of
size 1. It is an analytical expression! a formula, which is true for any distribution. It
can be efficiently evaluated numerically and exactly. for example. for an? distribution
of interest. belonging to the A-family of distributions generated by (2.45). Similar
analytical formulas may be generated for any sample size although we consider only
odd sample sizes in this thesis. The method allows us to study. for example in this
context. the properties of the bootst rap procedure which est imates the second moment
of the sample median.
The algorithm describing the calculation of the conditional expectation of the
squared sample median for a given partition for odd values of n and arbitrary parent
distribution function F(x) is given by:
Algorithm 2.2 (Calculation of Ewl,[(5')'] for F ( x ) and odd n) :
a Compute the number of distinct order statistics for the given partition. This
equals the number of non-zero elernents in the partition. In the case of repeated
observations the corresponding order stat istic is the common value.
CHA PTER 2. BOOTSTRAPPED ORDER STATISTICS :32
Compute al1 permutations of the given part itioo. Each permutation assumes
observations arranged in descending order.
For each permutation record the order statistic which corresponds to the me-
dian.
Count the number of permutations in which the first order statistic is the me-
dian. Do the same for al1 order statistics.
Divide each count by the number of elements in the partition. Each order
statistic is hence associated with a numerical probability that it represents the
median.
The condi tional expectation of the squared sample median for a given partit ion
is the sum of products of the expectation (over an arbitrary parent F ) of each
squared order statist ic and its associated probabili ty of represent ing the me-
dian for that partition. The expectation is evaluated, in general. by numerical
integration.
If F is a mernber of the A-family then closed forms of moments of order stat ist ics
are evaluated using the symbolic function SecondMomOrder[i. n. A].
The tool which gives the order statistics of a partition w with the respective probahil-
it ies t hey have of representing the median is called Med[w] and the code is available
in Appendix A. l in a file called 0s.m.
Simiiarly. the first moment of the bootstrap estimate of variance of the median
may be generally expres:
E x [varwl :
which for the case n = 3 becomes
Ex [ ~ a r , ~ ~ [ Z ] ] = (719) x Var@) + (219) x Var (3 ) .
CHA PTE8 2. BOOTSTRAPPED ORDER STATISTICS 33
The exact bias of the bootstrap estimator of variance of the rnedian for n = 3 follows
immediately from (2.54) as
BI AS [ V a r W l x ( ~ * ) ] = Ex [ ~ a r , ~ ~ [ ~ * ] ] - V a r ( 5 )
= (719) x V a r ( z ) + (219) x V a r ( i )
= ('719) x [ V a r ( z ) - V a r (i)]
A similar evaluation with respect to (2.5'2) yields
Note tha t the discussion in this chapter centers on the evaluation of
Ex [ ~ w , x ( * ' ) ~ ]
which is identical to
- V a r (i j
(2.35)
assuming t hat
that is, t hat the population median is zero. More generally. for an estimator û rvhere
we denote ~ , ( é ) = O , t he inner expectation in (2.58) may bbe ivritten as
assuming that E,,,(,(B' - 4) is approximately uncorrelated with e - O . Considering
CHAPTER 2. BOOTSTRAPPED ORDER SïXTISTICS
the expectation of (2.60) over x we then observe t hat
assurning ~,~,(0' - 9)2 is close to (8 - It follows that
if 8 = O is assumed. In the case of the median we have assumed (2.59) in this chapter
and thus we will observe
The relationship given by (2.63) is corroborated by numerical results discussed later.
This does not signify that the bootstrap second moment of the median. EwIx(~')2. is
positively biased but only tha t (2.63) is due to the assumption (2.59) that has been
made.
The Mathematica functions ExpBoot[n. A] and VarBoot[n. A ] respectivelu corn-
pute
and
by using the function SecondMomOrder [i. n, A] seen above and so work only for
the A-family of approximating distributions and odd sample size n. The bias of the
CHAPTER 2. BOOTSTRAPPED ORDER STATISTlCS
bootstrap estimator of the second moment of the median may be expressed as
where E,(i2) represents the true value of the second moment of the median. The
latter can be computed by using SecondMomOrder [i? n. A] whenever r(i) = S.
The algorithms describing the evaluation of each of (2.64) and (2.65) is given next:
Algorithm '2.3 (Calculation of ExpBoot[n. A]):
0 Derive al1 partitions of the positive integer n using FPL[n].
0 Compute the multinomial-type probability of occurrence of each partition w
using Prob[w] from Appendix A. l .
a Use Med[w] to find the order statistics for each partition w having non-zero
probabilit ies of represent ing the median. Compute the probabili t ies.
Compute the conditional evpectation of the squared sample median for each
partition by summing the products of SecondMomOrder [i. n. A] and the
respective probabiii ties the order statist ics have of represent ing the median.
0 Sum the conditional expectations found in the previous step weighted by the
respective multinomial-type probabilities of occurrence of the partitions.
The code corresponding to the tool ExpBoot[n,X]) is available in Appendix A.1
in a file called Expb0ot.m.
Algorit hm 2.4 (Calculation of VarBoot [n, A]):
0 Derive a11 partitions of the positive integer n using FPL[n].
a Compute the square of the multinornial-type probability of occurrence of each
partition w using Prob[w] [rom Appendix A.1 .
CHA PTER 2. BOOTSTRA PPED ORDER STATISTICS :3 6
0 Use Med[w] to find the order statistics for each partition w having non-zero
probabilities of representing the median. Compute the probabilit ies.
Compute the conditional variance of the squared sample median for each par-
tition by summing the products of VarSqOrder [i. n? A] wit h the respective
probabilities the order statistics have of representing the rnedian. This tool is
a funct ion of SecondMomOrder [i. n. A].
Sum the conditional variances found in the previous step weighted by the re-
spective squared multinomial-type probabilities of occurrence of the partitions.
The code corresponding to the tool VarBoot[n? A]) is available in Appendix A . l in
a file called Varboot-m.
The expressions (3.64). (2.65) and (2.66) represent various propert ies of the boot-
strap estimator of the second moment of the median which can be explicitly evaluated
for given n and A. For example, when n = 9 and X =0.139 then (2.64) is
However. setting n = 3 only in ExpBoot[n. A] yields the formula
CHAPTER 2. BOOTSTRA PPED ORDER STATISTICS
ivhere i represents A. This is precisely the A-family equivalent of the first expression
in (2.52). Similar analytical representations may be given for any n. M;é qualify this.
however. as follows. Note that (2.67) contains 3 major terms. Table 2.1 indicates that
for n = 3. ten terms would be involved if we ivere evaluating the bootstrap estima-
tor directly. The exact calculation of the bootstrap est imator (based on exhaustive
enumeration of resamples) for a sample size of n = 16 would involve the enumeration
of almost SO. 000.000 resamples whereas a mere 116 partit ions would be required for
the exact computation of a property of the same bootstrap estimator! This indicates
that t he calculations of moments of bootstrap estimators using the methodology de-
veloped in this chapter may be carried out for much larger sample sizes than ivould
be feasible if bootstrap estimators were to be evaluated directly.
Recall that we have shown by first principles that for n = 3 (2.64) reduces to
(9 .52) . We can hence compute the right hand side of (2.52) directly as a check on the
Mathematica code underlying ExpBoot [n. A]. Indeed. E(x2) and E(i2) in ( 2 . 5 2 )
are computed as
and E ( 5 2 ) was given in (2.48) as
SecondMomOrder [2,3' 0.1351 = 0.01 74397 .
CHA PTER 2. BOOTSTRA PPED ORDER ST4TISTlC.î
Thus the right hand side of (2.52) is
(719) x E ( x 2 ) + (219) x E ( j 2 ) = 0.0342303
which substantiates the left hand side of (2.50) as computed by ExpBoot[n. A]. Sirn-
ilarly. when n = 3. X = 1 we have x = 2u - 1 - Ci[-l, l] and then
for which
and (2.49) permit
as desired.
The next section presents tabulated results of t h e Mathematica functions de-
scribed here for various n and A. Inciuded in the Tables are the results of independent
simulations which provide checks of the algori thms underlying the functions.
2.4.3 Some Numerical Results
Tables 2.6, 2.7 and 2.8 below preseot numerical results for various small sarnple sizes
and three values of X associated wi th 10 quantities. The quantities have been multi-
plied by n so as to remove the decreasing efFect due to increasing sample size.
The ten quantities presented in the above tables are described as follows:
1. ExpBoot represents the Mathematica function ExpBoot[n. A] defined in the
preceding section.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS f39
Table 2.6: Analytical and simulated results for various propert ies and bootst rap es- timates related to the 2nd moment of the median using several sample sizes for the case of the normal distribution
2. ExpBootSim was calculated as a check of the correctness of the Mathematica
code underlying ExpBoot [n, A]. ExpBootSim is an estimate of ExpBoot [n. A]
for a particular n and X and was computed according to the following algori t hm.
This algorithm was coded using SPlus.
a For each value of ,\ and n car- out the following :
0 Randomly select 1000 samples of size n (odd) from the C[O. 11 distribution
a For each sample:
- select B = 1 resamples of size n
- calculate rnedian 5 of the resample
- transform to A-family variate using x = (5)" - (1 - .i)-'
2 O Calculate ExpBootSim = mean (x,. . . . , x:,,)
Calculate E x ~ B o o ~ ~ ~ ~ - ~ ~ ~ ~ =
3. E x ~ B o o ~ ~ ~ ~ - ~ ~ ~ ~ ~ is defined in 2. above.
4. Et,e(32) is computed by the function SecondMomOrder[i. n. A ] wheriever t h e
i-th order statistic is the rnedian in a sample of odd sample size n. This function
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 40
Table 2.7: Analyt ical and simulated results for various propert ies and bootst rap es- tirnates related to t h e 2nd moment of the median using several sample sizes for the case of the uniforrn distribution
1 n ~ t m e ( . 2 )
n E s i m ( f 2 ) ' Esim(52),t&m n Bias n VarBoot
was defined in the preceding section.
5 . ES i , ( i 2 ) was calculated as a check of the correctness of the Mathematica code
underlying SecondMomOrder[i, n. A] which computes Et, , ( i2) whenever the
i-th order statistic is the median in a sample of odd sample size n. Esim(f2) is
an estimate of E , , ( I ~ ) for a particular n. A: and i and was computed according
to the algorithm described below. This algorithm was coded using SPlus.
0 For each value of ,\ and n carry out the following :
Randornly select 1000 samples of size n (odd) from the G[O, 11 distribution
a For each sample:
- calculate median i
- transform to A-family variate using r = (i),' - ( 1 - i)"
Calculate Esim(Z2) = mean (x:. . . . . x & ~ , )
~ariance(xf , . - . , x : ~ ~ ~ Calculate Esim(i2), ld, , = ( 1000
6 . E ~ ~ ~ ( E ~ ) ~ ~ ~ ~ ~ is defined in 5 . above.
- r . Bias = ExpBoot[n, A] - Et,,(?*)
CHAPTER 2. BOOTSTRAPPED ORDER STATISTiCS 41
Table 2.8: Analyt ical and simulated results for various propert ies and bootst rap es- timates related to the 2nd moment of the median using several sample sizes for the case of the logistic distribution
A = O [Logistic] 1 3 5 ï 9
S. VarBoot represents the Mathematica function VarBoot[n. A] defined in the
previous section.
9. Boot,, represents a conventional bootstrap calcuiation of the second moment
of the median and was cornputed according to the algorithm described belom.
This aigorithm n7as coded using SPlus.
0 For each value of X and n carry out the following :
Randomly select one sample of s i x n (odd) from the C[O. 11 distribution
0 From this single sample select B = LOO0 bootstrap resamples
For each of the bootstrap resamples :
- calculate median i
- transform to A-family variate using s = (2 )" - (1 - I).'
2 Calculate boot, = mean(x,, . . . . x : ~ , )
vwiance(z: .....z~&) i Calculate b~ot,,-~,~., = ( ,000
10. B o o ~ ~ ~ ~ - ~ ~ ~ ~ ~ is defined in 9. above.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS
2.5 Conclusions and Discussion
In this chapter we have detailed a diagnostic methodology that permits the study
of properties of bootst rap estimators associated wit h order stat istics. This approach
was illustrated for the case where the estimator is the median of a sample of odd
sample size. The ideal bootstrap estimator of the second moment of the median can
be expressed as a weight ed mult inomial average of order stat ist ics. condi t ional on the
underlying sample? and t his can be computed exactly through exhaustive enumeration
of distinct resamples.
Algorit hms were developed that permit moments of bootst rap estimators in t his
contest to be expressed symbolically. and if desired. numerically. for any distribu-
tion and for any odd sample size. The methodology provides exact results. The
correctness of the algorithms was checked by comparing numerical results generated
by the algorithms with independent simulations based on conventional Monte Carlo
resampling. This has been confirrned by the results given in Tables 2.6. 2.7 and 2.S .
The numerical results corroborate the relat ionship given by ( 2 . 6 3 ) especially for a
Gaussian shaped distribution. The apparent 2 1 bias suggested in Tables 2.6. 2.7 and
2.8 is largely due to the definitional assumption ('2.39) that has been made rather than
to any long-run discrepancy between the bootstrap second moment of the media11 and
what it is attempting to estimate.
The methodology developed in this chapter for studying properties of bootstrap
estimators involving exhaustive enumeration is deemed useful for sample sizes of
moderate size. Indeed. characteristics of bootstrap procedures are more iinpredictable
for smaller sample sizes rather than for larger sample sizes and the syrnbolic tools
presented here provide a technique for assessing t hese properties. The concensus
in the literature is that bootstrap estimates calculated on the basis of eshaostive
enumeration may be cornpetitive with Monte Carlo resampling up to about n =
10. The reason that properties of bootstrap estimates may be assessed for much
larger sample sizes is that the calculations involved here use partitions which are
substantially fewer t han the respective number of al1 possible distinct resamples.
CHAPTER 2. BOOTSTRAPPED ORDER STATISTICS 43
Table 2.1 illustrates this. For example, the exact calculation of a bootstrap estimator
(based on exhaustive enumeration of resamples) for a sample size of n = 15 would
involve the enumeration of almost 80,000,000 resarnples whereas a mere 176 partitions
would be required for the exact computation of a property (Le. a moment) of the
same bootstrap estimator. It appears feasible that the algorithms discussed in this
chapter could be further developed to permit symbolic expression of higher moments
of bootstrap estirnators associated not only with more general L-estimators but with
M-estimators as well. In addition, greater computational efficiency could be achieved
by elirninating rnultinomial weights possessing negligible probabilities.
In the next chapter, we will discuss an alternative methodology that permits
analyt ical expression of boot st rap est imators and corresponding propert ies associated
wit h the class of hl-estimators.
Chapter 3
Bootstrapped M-estimators and
t heir Properties
3.1 Introduction
In this chapter. procedures are developed and implemented for analytically computing
bootstrap est imators and their properties for statistics defined by est imat ing equa-
tions and funct ions of t hese. The resulting sym bolic quanti ties are expressed in ternis
of series expansions to any desired order. Moreover. these formulas may t hen h e eval-
uated numerically for any particular set of n iid observations. This method entirely
avoids the use of conventional Monte Carlo resarnpling.
The basis for t he proposed methodology comprises the six following steps:
1. .A statistic defined by estimating equations is approximated to any order by a fi-
nite Iinear combination of products of normalized, centered sums of independent
and identically distributed random variables.
2. .A bootstrap replicate of such a statistic may be similarly (analytically) ex-
pressed as a finite linear combination of products of weighted sums. The weights
arise from the multinomial distribution. The weighted surns are decomposed
into simple sums and products of sums involving distinct observations (hence-
forth referred to as disjoint sums).
CHAPTER 3. BOOTSTRAPPED hl-ESTIIGIA TORS
The bootst rap expectation of the statistic is obt ained by considering the expec-
tation of the series expansion with respect to the multinomial weights arising
in (2). The moments of the weights are found from the moment generating
function of the multinomial distribution.
Products of sums involving distinct observations are re-expressed as linear com-
binations of simple sums and ordinary products of sums.
Properties of the preceding expressions rnay be evaluated. This corresponds to
calculating various moments of the analytical bootstrap quantities wit h respect
to the original random variable x. for example. in the one-sample case.
Symbolic expressions represent ing bootstrap moments and t heir propert ies ma?
be further assessed nurnerically if desired by evaluating these for a particular
set of data.
In Sections 3.2.2-3.2.4 the main met hodological details are presented. Specifically.
we establish (1) in Section 3.2.2 by considering the inversion of the Taylor series
representation of an estimating equation. Sections :3.2.3 - 3.2.4 discuss (2) - ( 3 )
while (4) - (6) are discussed in Section 3.2.4. Section 3.3 focuses on computational
issues. In Section 3.4 we apply the algorithms to the case of the correlation coefficient.
Specifically, ~ v e compare our results to those obtained by Efron and Tibshirani ( 1993)
using their Law School data. LVe also consider an artificially generated sample of
n = 100 bivariate Gaussian observations.
The methods described in the present chapter have been implemented for ma-
chine computation. and in fact depend primarily on two key symbolic routines - one
for computing asymptotic expansions of roots of estimating equations, and one for
computing bootstrap analogues to these expansions. The algorithms have been imple-
rnented using Mathematica (Wolfram. 1988). The routines are very simple to apply
to a broad range of statistics. The Mathematica code is presented in Appendix
A.2 of this thesis.
CHA PTER 3. BOOTSTRAPPED hl-ESTIMA TORS 46
3.2 Bootstrapping M-estimators : An Analytical
Approach
3.2.1 Introduction
The following sub-sections present the methodological details relevant to steps ( 1) - (6)
given above.
3.2.2 An Approximation for Functions of M-estimat ors
In this section we show how a single arbitrary M-estimator 8. or a smooth function of
such estimators. may be expressed as a particuiar sequence of increasingly accurate
series approsimat ions. The discussion below follows Andrews and Feuerverger( 1993).
Alternative presentations rnay also be lound in Ferguson(1989) and McCullagh(l9Sli).
Let XI.. . . , x, represent independently and identically distributed observations
from a distribution indexed by a parameter 6- with Bo being the assumed. triie value.
As developed by Huber(196.1, 1967) we consider first a one-dimensional M-estimator
assumed to be the solution of the equation
If the riss arise Frorn the density f (r. O ) . and if +(z. O ) = %log /(r. 0). it follows that
O is t h e maximum likelihood estimate of B.
Denoting expectation with respect to Je, by Eo, we shall require that
and
EodJ'(x.00) # 0
so that é is +-consistent for Bo (see, for example, Huber(1977) ). The following
CHA PTER 3. BOOTSTRA PPED LM-ESTIMATORS 47
construction enables us to write 8 in the form of a series approximation. Consider
the k-th order Taylor expansion of (3.1 ) about 80 :
w here
The superscripts ( j ) on zb represent differentiation with respect to 0. The remainder
term appearing in (3.3) rnay be expressed as
for some 0% between 0 and &, and is of the order indicated whenever 8" = Bo + 0p(n-1/2) provided only that d b 1 ) ( x . O ) is uniformly integrable in a neighborhood
of OO.
Now observe that (3.3) may alternatively be written as
w here
and so
C H A PTER 3. BOOTSTRAPPED M-ESTIM.4 TORS -4s
Note that the j = O term in the first sum of (3.4) has becn omitted in view of (3.2).
Andrews and Feuerverger(l993) present an iterative inversion of the power series
representation (3.4) so as to provide increasingly accurate approximations to the root
6 of the equation (3.1). It is important to note that the terms henceforth
referred to as residual sums? appearing in (3.4) are normalized, centered averages and
hence are of precise order Op( l ) provided ~ l i l > ( j ) J~ < rn : Le. are of order Op(1) and
not of any lower order. (See Ha11(1992), pp xii-xiii. for a fuller discussion). Indeed.
the use of O p ( l ) random variables of the form (3.5) and (3.6) in (3.4) will conveniently
allow the expression of successive iterative approximations to 0 in terrns of stochastic
asymptotic expansions in powers of the sample size n . The specific sequence of approxirnate roots il to (3.1) is constructed inductively
to have the form
for 1 = 1.2.. . . where the hi's are Op( 1). depend upon n. and further.
z 81 + 61 = 6 + O p ( n - ( 1 + 2 ) / 2 n ( r + w 2 > .
Each 6; has the special form
t hat is. a linear combinat ion of products of exact ly i + 1 normalized. centered averages
Z, of the type ( 3 . 5 ) . (3.6). The inductive construction (3.7) is rendered cornpletc by
finally defining
for 1 = l o2 . . . . . It is observed that the representation (3.10) is of order Op(1).
satisfies (3.d), and has the form given by (3.9) . This procedure is a symbolic analogue
CHAPTER 3. FOOTSTRAPPED hi-ESTIMATORS
of a quasi Newton-Raphson iteration; see Andrews and Stafford(l993).
For example, setting 1 = 2. the third order expansion of the root 8 in iterat.ive
form may be expressed as
We now consider the analagous multiparameter problem. The rnultivariate .CI-
estimator
for the component parameters
C H A PTER 3. BOOTSTRAPPED M-ESTIMA TORS
having the t rue values
is t he solution of the system of equations:
for j = 1. . . . , p . The met hods of the previous paragraphs extend directly (care
should be used in the matrix-algebra since. for example, EoG'(Xo B a ) is now a matris)
to permit us to write each cornponent êj of 6 in the lorm (3.7). (3.9) rvhere any of the
terms 2,'s appearing in (3.9) can now be any of the terms (3.5) or (3.6) corresponding
to an- of the +-functions . , $,. Now suppose that we are interested in some
srnooth function. say g. of the cornponent M-estimators so that 4 y(dl.. . . . B p ) is viewed as being the statistic of interest, and g g(8h: . . . . B i ) as the quantity
of inferent ial interest. Then by su bstit uting approximations which are the vector
analogues of (3.7) (for each component of the M-estimator) into a straightforward
Taylor expansion for g ive are lead to t h e fact that
where
Y1 7 2 Ur = 70 + - Î ' r + -+ . . .+ -
n1/2 TZ n1/2
and the 7,'s are each O p ( l ) linear combinations of products of i + 1 normalized.
centered averages as in (3.9), that is,
As before. this procedure enables us to consider the members of the sequence of
CHAPTER S. BOOTSTRAPPED M-ESTlMATORS
truncated expansions as approximants to the quantity fi (9 - g ) .
3.2.3 Weighting and Decomposing Residual Sums
Conventional bootstrapping is based on the simulated generation of a finite number
of resamples drawn wit h replacement from the original saniple. Bootstrap replicates
of a statistic O of interest are cornputed and the desired moment associated with the
statistic can then be estimated. We observed in the previous section that e rnqv
be approxirnated by using the expansion (3.7) together with (3.10). In this section
we consider converting (3.7) and (3.10) into bootstrap replicates analytically. that
is. without resorting to Monte Carlo simulation. We shall do so by algebraicaliy
introducing those weights which have been implicitiy generated by repeated sampling
wit h replacement inherent in bootstrapping. We consider the single parameter case
below.
Rom (2.5) and (2.6) it follows that
Consequently. the introduction of multinornial weights wi into (3.6) yieids the rveighted
residual sum
Weighted (i.e. bootstrap) versions of (3 .7) and (3.10) may hence be similarly obtained
by observing t hat
and
CHL4 PTER 3. BOOTSTRA PPED I~~-ESTMATURS
with (3.6) replaced by (3.15).
Consider the example in (3.11) in this context. The third order expansion of the
bootstrap replicate O' is observed to represent a function of moments and weighted
residuai sums as well as products of these. That is, the third order expansion of 8- can
be expressed as a linear combination of products of weighted residual sums. Hence.
We now observe that products of residual sums algebraically decompose into linear
combinations of simple sums! disjoint sums. and products of these. This fact will be
important for subsequent development of the present discussion as i t will facilitate
computation of moments corresponding to the multinomial weights. A computational
procedure for providing such a decomposition for any product of residual sums will be
given in Section 3.3. LVe note that in (3.16) the second order term contains products of
two weighted residual sums, the third order term contains products of three weighted
residual sums, and so on. Consider Zi(do) Z,!,')'(O0) . Disregarding centering and
CHAPTER 3. BOOTSTRA PPED M-ESTfM.4TORS
normalizing operations and letting $(xi, O ) = $i for sake of clarity, we have
Sirnilarly. a product of three weighted residual sums produces. for example.
The combinatorial array of terms in (3.18) suggests a well-known pattern that
corresponds to a partitioning of a list of t hree items. The computational tool FP[list]
provides a full partition of a List of any number of items. For example.
FP[{a.b.c}] = ab.^}}. { { b } . {CU}}. { { a } . { h . ~ } } .
contains the five partitions from which the respective terms in (9.18) have been con-
structed. Similady:
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 0 5 4
is used as the basis for decomposing a product of four weighted residual sums into a
sum of fifteen component terms. and so on. The above expressions may be readily
evaluated using tools for symbolic cornputation. The procedures described below
were based on those developed by Andrews et al. (1993) and subsequent unpublished
work. These basic procedures were adapted for the present study. The tool FP[list]
is described by the following algori t hm:
Algorithm 3.1 :
Let FP[{a)] = { {a} }
FP[{a,b)l = { {a* 6 ) - { { a l ? V I 1 1
To find the partitions in FP[{a,b,c)] do the following:
To each efement of each partition in the preceding list: join c.
To each partition in the preceding list: append c.
Thus. FP[{a,b,c)] is given by (3.19) above.
Apply this rule to find the partitions of any list {a. b. c. . . } .
In this section we have indicated that the bootstrap replicate È- maj- be ap-
proximated arbitrarily well by weighted analogues of (3.7) and (:3.10). that is. by a
truncated series in products of weighted residual sums. Moreover. these products may
further be decornposed into linear combinations of simple and disjoint sums. it follows
that any smooth function g of 0'' for exomple g(ê*) = ( ê ' ) 2 , may also be expressed as
a polynomial series in decomposed products of weighted residual sums. Finally. the
bootstrap expectation of g ( & ) is an expectation over a multinomial vector of weights
w conditional on a sample x as in (2.16), that is.
which can be expressed in a sirnilar form. We consider this in the next section.
CHAPTER S. BOOTSTRAPPED LM-ESTIMATORS 5 5
3.2.4 Analytical Bootstrap Moments of M-Estimators and
t heir Properties
In t his section we consider analyt ical representat ions of the bootst rap est imator
and its properties. For example. when g(ê') = Ô' or g(ê ' ) = [ê* - EwIx (êœ)12 then
(3.22) provides the bootstrap estimates of mean or variance given. respectively. by
and
.As before. the discussion here shall consider only ideal bootst rap est imates. in the
sense that B -+ s (see Section 2.2) .
Since expectat ion is a linear operator. Ewlx [g (OR)] may be expressed as a truncated
polynomial series in decoinpos~d residual sums tvith some coefficients arising froni
wrious moments of the weights W. For example. taking the hootstrap espectat ion
Note that the second step in (3.25) is possible because the multinomial weights
( w l , . . . . w,) are identically dist ributed, and t hough not independent. are exchange-
able. That is. E[wi . w,] = E[wl - wt] for i # j but E[wl . w2] # E [ w I ] . E[w.L]. The
CHA PTER S. BOOTSTRAPPED M-ESTIMATORS
coefficients Ewlx[w:] and Ewlx [ w l w 2 ] are known and may be simply computed from
the moment generat ing function of the multinornial distribution. Moreover. the dis-
joint sum in the second term on the right side of (3.25) rnay be expressed as a linear
combination of products of simple sums. Similar evaluations would take place when
evaluating the bootstrap expectation of any product of weighted residual sums.
The series evaluation of the ideal bootstrap est irnator E~~~ [g(&)] to any level
of precision t hus depends solely on the specificat ion of the appropriat e UT-funct ions
defining the !VI-estimator ë while numerical evaluation would require observation of
the sample (x,. . . . . x,) . The specified U-functions must be differentiable u p to the
desired order of the expansion of B. For example. the bootstrap estimates of mean and
variance. given respect ively by (3.23) and (3.24) can eacli be expressed anal - t ically as
a linear combination of residual sums - symbolic evaluation in each case would depend
only on the specification of the appropriate th-functions. and subsequent numerical
evaluation. if desired. woiild require the sample x = ( r l . . . . . z,) .
Finally. one rnay consider properties of the series representation of E , , , ~ , [ ~ ( B ~ ) ] . Indeed. any expectation - over x - of such a series may be calculated directly. For
exarnple. the rnean and variance of the bootstrap estirnator E,~ , [~(&)] are givrn by
and
and such expressions can be evaluated in a straightforward manner.
The salient feature of the representations described in this section is that they can
be expressed in symbolic form. This precludes necessity for the use of Monte Carlo
simulation. In the next section we shall describe and illustrate several computational
procedures developed using Mathematica that permit analytic expression of the
bootstrap quant ities considered above. This met hodology will t hen be applied to the
case of the correlat ion coefficient.
CHA PTER 3. BOOTSTRAPPED M-ESTIMATORS
3.3 Computat ional Tools
In this section we will describe and illustrate a number of computa t io~a l procedures
developed using Mat hernatica t hat permit analytic expression of various quanti t ies
considered in the previous t wo su b-sect ions.
In Section 3.2.2 it was observed that (3.7) together with (3.10) provide increas-
ingly accurate iterat ive approximations to a single-dimensional M-estimator B. An
algorithm has been implemented in Mathematica tha t cornputes ( 3 . T ) ? (3.10). This
function is called ThetaHat[i] and it returns the expansion for 0. correct to order
n-'12. For example. the second order approximation to the root 6i is computed as
This corresponds to the sum of the first four terms in (3.1 1 ) where 2, ( v(X. 8 ) ) =
Zn(&) and expectations are represented as cumulants. When i = 3 a n expression
with nine terms is returned. When i = 4 the function returns a n expression with 19
terms. Similarly. t he 6-th order expansion coritains 73 terms. The expansion for the
?VI-estirnator rnay thus be derived for any order. The code for the tool ThetaHat[i]
is available in Appendix A.2 in a file called Bootcum-m .
8 o w the bootstrap quantity 8'. or any power of this. can always be represented
symbolically as a linear combinat ion of products of weighted residual sums (factors
in 2:); these are now functions in the bootstrap resample X'. Thus, t he bootstrap
version of (3.28) would look identical with (3.25) where Zn and X are replaced by
2: and X'. kloreover. this analytic bootstrap quantity represents (3.16) truncated
to second order.
The function ~ o o t ~ u m [ 8 ] [ i , j] returns the i-th bootstrap cumulant of some esti-
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 5 8
mator è expressed as an expansion correct to order n-~ /* . This calculation represents
the kernel upon which the methodology described in the present chapter is based.
In particular, ~ o o t ~ u m [ 8 ] [i? 21 represents the analytical expansion of the bootstrap 4
expectation EwIx(Bœ) correct to second order, that is,
The code ~ o o t ~ u r n [ 8 ] [2: 21 represents the analytical expansion of the bootst rap
variance ~ a r , ~ , ( B ' ) correct to second order. that is.
For example. the second order expansions of the bootstrap estimates of mean and
variance of 6J are displayed symbolically. respectively as
C u r n ( @ ( X * , 0)' ~'(,Y', O ) ) ~oot~urn[8][1 ,2] = 0 + [
Curn[@'(.Y. 0)12
and
Cum($(X=. O ) , c7(.Y=. O ) j B o o t ~ u m [ e ] [ ~ . 21 =
n Curn[thi(X. 0 ) l 2
The code for the tool ~ o o t ~ u m [ d ] [ i . j] is available in Appendix A.2 in a file called
l3ootcum.m . The main steps of the algorithrns associated with ThetaHat[i] and ~ o o t ~ u m [ d ] [ i . j]
are given below.
We suppose t hat 6 is the root of the equation f(ê) = O. The solution is expressed
as a sequence of increasingly accurate series approximations. A starting value Bo must
be provided. The algorithm is a Newton iteration as iollows:
CHAPTER 3. BOOTSTR-4 PPED M-ESTIhIATORS
Algorithm 3.2 (ThetaHat [il ) :
0 Provide a starting value Bo.
- -f.((,-i ) a In general. 8; = f; (00 1
where fi(êj) is a series expansion of f(6) correct to order d2.
Algorithm 3.3 ( ~ o o t ~ u m [ 6 ] [ i . j]):
0 A statistic defined by estimating equations can be approxirnated to any or-
der by a finite linear combination of products of normalized. centered sums of
independent and identically distributed random variables.
A bootstrap replicate of such a statistic, or any smooth function of this. may
be similarly (analytically) expressed as a finite linear combination of products
of weighted sums. The weights arise from the multinomial distribution. The
meighted sums are decomposed into simple sums and products of sums involving
distinct observations (referred to as disjoint sums).
The bootstrap expectation of the statistic is obtained by considering the expec-
tation of the series expansion wi th respect to the multinomial weights arising
in the preceding step. The moments of the weights are found from the moment
generating function of the multinomial distribution.
a Products of sums involving distinct observations are re-expressed as linear corn-
binations of simple sums and ordinary products of sums.
Finally, we consider the procedure that evaluates properties such as (3.26) and
( 3 The function ~ u r n [ ~ o o t ~ u m [ 8 ] [ i ] ] ~ . k] returns a series expansion correct
to order n-'12 of the j-th cumulant - over x - of the i-th bootstrap cumulant of the
CHAPTER 3. BOOTSTRA PPED .M-ESTIMATORS
statistic 8 . In particular,
Ex [E,~,(&)] = ~urn[~oot~urn[d] [ l ] ] [l, 21 + OP(n-l)
V a r , [~ar ,~ , (8 ' ) ] = ~ u r n [ ~ o o t ~ u m [ 8 ] [2]] [2,2] + O&-') . (:3.36)
The functional representations given in this section are generic and apply to any
bootstrapped M-estirnator 8' eocept those which cannot be expressed in terrns of
smooth functions of averages. An example of this is the median: this case was treated
alternatively in Chapter 2. In the following section we shall illustrate the functions
developed here by applying them to the case of the correlation coefficient.
3.4 The Case of the Correlation Coefficient
3.4.1 Introduction
In this section we illustrate the methodology developed in the preceding sections by
considering the case of a bivariate correlation coefficient using bot h real and simulated
data: Efron and Tibshirani's (1993) Law School data relating LSAT and GPA scores
of students entering 1.5 American law schools and a data set simulated from a bivariate
standardized Gaussian distribution.
CHAPTER 3. BOOTSTRAPPED LW-ESTIMATORS
3.4.2 Computational Tooh
The symbolic function Cor[i] expands the bivariate sample correlat ion coefficient
given by
using a Taylor series in residual sums correct to i-th order. Since the correlation
coefficient is invariant with respect to linear transformations we may simplify the
expansion by assuming, without loss of generality, that E ( r ) = E ( y ) = O and E ( t 2 ) =
E ( ~ J ~ ) = 1. For example. the 2nd order expansion of is given analyticaily as
where. for example. the residual sum Z(n)(.Y) according to ( 3 . 5 ) is given by
The main steps of the algorithm associated with Cor[i] are as lollows:
Algorit hm 3.4:
a Assume w.1.o.g. that E ( x ) = E(y) = O and E ( x Z ) = E ( ~ J ~ ) = 1 .
Express Cov(x. y ) as a series expansion ir, terms of residual sums of the form
(3.39).
CHAPTER 9. BOOTSTRAPPED 111-ESTIMATORS 62
Express COV(X. x) and Cov(y, y) as series expansions in terrns of residud sums.
The product of the series expansions of Cov(x , x)-'/* and Cov(y, is then
multiplied by the expansion for Cov(x, y ) to produce terms in the series expan-
sion of ,!i = C m ( x , y ) - Cou(x, x ) - ' / ~ . Cm(y,
The code for the tool Cor[i] may be found in Appendix A.2 in a file called Cor1.m.
Subsequeat code used for the asymptotic bootstrap expansions associated wit h the
correlation as derived below can be found in the file Cor2.m.
The tool Cum[Cor][i, j] returns the i-th cumulant - over the (x. y ) distribution -
of expressed as an expansion correct to order (In the notation. r. y and x. y
are used interchangeably.) For example. the mean EZ,,(/3) and the variance Var,,,(&)
expressed as expansions to order n-' are given. respectively. by
and
where. for example, C u m ( X , .Y' .Y) represents the third cumulant of .Y. that is.
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 63
if p = E(.Y) . See. for exarnple, McCullagh (1987) for a discussion of multivariate
cumulants.
The function BootCum[Cor][i, j] returns the i-th bootstrap cumulant of the
correlation coefficient expressed as an expansion correct to order d l 2 . For example.
the 2nd order anaiytic expressions for the bootstrap rnean
and the bootstrap variance
of the correlat ion coefficient are given, respectively. as
CHAPTER 3. BOOTSTRA PPED M-ESTIMA TORS
and
The 4th order expansion of the bootstrap variance of the correlation (3.44) is
much lengthier than (3.46) and possesses 226 terms. We provide the first and final
few terms of this expansion as follows:
C H A PTER 3. BOOTSTRA PPED hl-ESTIMATORS
Yote that the 4th order expansion (3.47) contains residual sums whereas the 2nd
order expansion (3.46) does not. Moreover. 2nd and 3rd order expansions of (3.41)
are identical. 4th and 5th order expansions are identical. etc. .
Our methodology now permits us to consider properties. indeed. cumulants of the
bootstrap quantities (3.43) and (3.41). For illustration we consider the mean and
variance - over 2 and y - of (3.43) and (3.44). In practice. an- cumulant expanded to
an arbitrary order of accuracy may be considered. The function
Cum[BootCum[Cor] [i]]b,k] computes a series expansion correct to order n-"l' of
t h e j-th cumulant - over x and y - of the i-th bootstrap cumulant of the correlat ion
coefficient. In part icular.
Var,,, [ E , ~ ~ , ~ (es)] = Curn[BootCurn[Cor][l]][2.2] + O,(n-' )
CHA PTER 3. BOOTSTRA PPED M-ESTIiVATORS
Var,,, [ ~ a r , ~ & ~ ) ] = Curn[BootCurn[Cor][2]][2,2] + O&-') . (3.51)
For example, the second order expansions considered in (3.48) - (3.50) are given,
respectively, as :
p2 Cum (-Y. .Y1 Y. Y ) p Cum (.Y. 1'; l'-. Y - ) +
and
CHAPTER 3. BOOTSTRAPPED 41-ESTIMATORS
Cum(Y, Y, Y. Y ) 412
We observe t hat
The corresponding 4th order and higher order expansions are not identical. in general.
due to the appearance of residual sums.
Table 3.1 below summarizes the six quantities for which symbolic expressions
associated wi t h the correlation coefficient have been considered in t his chapter. These
represent the bootstrap mean and bootstrap variance and the mean and variance -
over x and y - for each of these. However. the computational tools developed here can
provide symbolic expansions for a n y bootstrap cumulant of the sample correlation to
a n y desired order as well as properties of these.
Table 3.1: Bootstrap estimators and related properties for the correlation coefficient for which analy t ic expansions are derived
II Bootstrap Mean ( b ) 1 Bootstrap Variance ( 6 ) 1
The series expansions associated with Table 3.1 for the correlation coefficient
constitute symbolic representations, in fact, /onnulas, which rnay be evaluated for
any bi-variate set of the data for which calculations of the correlation. its bootstrap
moments, as well as properties of these are desired. These expansions. such as the ones
in (3.45) - (3.50), represent analytic bootstrap expressions which may be evaluated
numerically for a n y pair of distributions F and G associated with the dependent
CHA PTER 3. BOOTSTRAPPED M-ESTMATORS 68
variables x and y for which cumulants are known. In particular, we will use the
empirical (nonparamet ric) distribution functions F, and G, as defined in Chapter
2, Section 2.2. With reference to the law school data, for example. we make the
lollowing empirical substitutions in expressions (3.45)-(3.50) :
3. Al1 residual sums vanish when moments are replaced by empirical moments.
For example, E ( X ) + .i yields Z(n)( .Y) = O in (3.39). For distributions other
than the empirical distribution this is not always the case. If? for example. a
standard normal parent is assumed then not al1 residuai sums vanish.
4. Cumulants -, empirical cumulants. For example, the third cumulant
is replaced by the third empirical cumulant
where. for example, S L ~ ~ I [ X * ] = C:=, X: .
An advantage of the analytical approach given above is that once a bootstrap
expansion to a specified order has been calculated, the derivat ional work for the par-
ticular problem at hand has been done for al1 tirne - it remains mereiy to numerically
evaluate the formula for any bivariate data set for which estimation of the correlation
coefficient is desired. This contrasts witli the conventional approach which requires
that a bootstrap variance estimate for the correlation coefficient based on Monte
Car10 simulation be determined each time that a different data set is encountered.
CHAPTER 3. BOOTSTRA PPED RI-ESTIMATORS
3.4.3 Numerical Results
CVe will numerically evaluate the second and fourth order expan sions for the six
bootstrap expressions given in Table 3.1 for two bi-variate data sets described below.
The methodology is illustrated for real and simulated data sets of different &es.
Table 3.2 shows a random sample of size n = 15 drawn from a population of = SZ
Arnerican law schools. What is shown on the left side are two measurements made
on t h e entering classes of 1973 for each school in the sample : LSAT. the average
score of the class on a national law test, and GPA. the average undergraduate grade
point average achieved by the members of the class. The data were obtained from
Efron and Tibshirani (1993. p. 19). The symbolic expansions given previously assume
standardized variables, and hence the same data shown on the right side of Table 3.2
have also been standardized.
Table 3.2: n = 1.5 Law School Data : original and standardized data
Tables 3.3 and 3.4 display n = 100 simulated bivariate observations generated
from a standard Gaussian distribution. The data was also standardized to produce
exact zero sample means and unit sample variances for each of the variables .
Table 3.5 gives the numerical results obtained by calculating the 2nd and 4th
CHAPTER 3. BOOTSTRAPPED M-ESTIkïATORS 70
Table 3.3: n = 100 simulzited bivariate observations from N(0 , l ) : variable x
order expansions associated with the sis moments identified in Table 3 . 1 for each of
the data sets. The classical resuits are also presented where P has been computed
using (3.37) and EZ,y(P). b7ar,,,(;) have been computed to order n-' . for esample.
by (3.40) and (3.41). The asymptotic variance of the correlation is given by
Tables 3.6 and 3.7 document the conventional Monte Carlo bootstrap estimates
of the mean and standard error of the sample correlation coefficient ( the standard
error of each estimate is shown in brackets alongside). These are then compared wit h
the corresponding anaiytically computed bootstrap rnean and standard error of the
correlation given in lines 1 and 4 of Table 3.5. The Mathematica and Splus code
underlying the computation of the results within these tables is given in Appendix
A.2.
Note tliat calculations exhibited in Table 3.5 were carried out using Mathematica
CHA PTER 3. BOOTSTRA PPEB M-ESTIMATORS I l
Table 3.4: n = 100 simulated bivariate observations from N ( 0 , l ) : variable y
and those in Tables 3.6 and 3.7 were performed using SpIus. Theretore. variances
from Table 3.5 should be multiplieci by a factor
when being cornpared with analagous variances associated a i th Table 3.7.
LVe discuss the results and conclusions based on Tables 3.5. 3.6 and :3.7 in the next
section.
3.5 Discussion and Conclusions
A number of points arise from consideration of the results d
section. amongst which we note the following:
iescribed in the previous
1. The symbolic bootstrap methodology developed in this chapter rnay be applied
to a broad class of statistics, namely, to any estimator which may be expressed
CHAPTER 2. BOOTSTRA PPED M-ESTiMATORS
Table 3.5: Symbolic bootstrap and classical results for the correlat ion coefficient using two data sets
as a smooth (possibly non-linear) function of averages. The sample correlation
coefficient. is an exampie of such an estimator.
hnalytical Bootstrap
EwIz ,y ( P m ) E [ E l y ( 1
[r/ar, ,[~,~, , , ( j~)]]f
[ ~ a r , ~ , , , ( b ~ ) ] ) V a r ( )
E,,y [Varwlz , , ( F m ) ] [Var,,, [Varwl,,g(ji')]]
2. The coniputational tools presented in t his chapter \vit h respect to the correlation
coefficient could provide symbolic expansions for a n y bootstrap cumulant to ony
desired order as well as properties of these.
3 . The numerical evaluation of the 2nd and 4th order analytical expansions for
Ewlr,y(@m) and [ V a r W l , , , ( ~ ' ) ] ~ given in Table 3.5 for the sample correlation are
confirmed by the analagous Monte Carlo calculations in Tables 3.6 and 3.7 for
both data sets.
Classical results
Empirical Distribution F, and G,
Moreover. t he numerical evaluation of the 4t h order symbolic expansion for the
bootst rap standard error of the correlat ion [Varwi,,, (ka)] for the law school
data as given in Table 3.5 matches the conventional bootstrap calculation given
in Efron and Tibshirani (1993, p. ZO), i-e. S E ( j m ) = 0.132 for B = 3-00.
This demonstrates that the symbolic tools developed here for calculating boot-
strap quantit ies comprise a viable alternative met hodology to conventional re-
Law School Data (n = 15) 2nd order 0.771713 0.767052 0.124276
0.1242'76 0.0 15445 0.015445
Simulated Data ( n = 100) '
4th order 0.770879 0.765075 0.134268
0.131504 0.01 73'72 0.016806
0.158942
2nd order 0.693382 0.69206 0.051 133
0.051 133 0.001615 0.002615
4th order 0.692396 0.692158 0.051514
0.051680 0.002670 0.001623 0.0144849 0.162598 0.0731031
Table 3.6: Monte Carlo Bootstrap Estimates of the Mean of the correlation coefficient for various B
sarnpling. Any bootstrap moment of a sufficiently smooth estimator. and prop-
erties of this. may be derived symbolically and subsequently evaluated numer-
ically to within a specified order of accuracy without recourse to conventional
Monte Carlo resampling. This is not true with conventional bootstrap estimates
which are unavoidably subject to random variability (note the SE'S enclosed in
brackets in Tables 3.6 and 3.7).
r
B
1. For n = 100 the 2nd and 4th order analytical calculations are identical u p to
a t least tliree decirnal places for the six quantities in Table 3.5. For n = 1.5 t his
is true only u p to the first decimal place. We conclude that feuyer terms are
required in the analyt ical bootst rap expansions as the size of a data set increases.
Indeed. 4th order accuracy was required to achieve identical results to a t least
two significant figures for bootstrap standard errors of the correlation between
the anaiytical and conventional Monte Carlo methods for n = 1.5. However. only
2nd order accuracy was sufficient to achieve this for the larger data set n = 100.
Comparatively then. the analytical approach is less cornputer intensive for larger
data sets. The reverse is true for conventional bootstrap calculat ions.
Monte Carlo Bootstrap Estimates of the Mean EB (jR) of jj
Law School Data (n = 15) 1 Simulated Data (n = 100)
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS
Table 3.7: Monte Carlo Bootstrap Estimates of Standard Error of the correlation coefficient for various B
5 . The closeness between analytical results for Ewl, , , (ex) and [ ~ a r , ~ , , , ( ~ ~ ) ] f and
the corresponding Monte Carlo evaluations increases as B increases for both
data sets.
B 50
100
6. The series expansions associated with Table 3.1 for the correlation coefficient
constitute symbolic representations. in fact. formulas, which may b e applied to
any bivariate set of data for which calculations of the correlation. its bootstrap
moments. as well as properties of these are desired. .An advantage to the an-
alytical approach is that once a particular expansion to a specified order has
been calculated the derivational work for the problem at hand has been done
and need not be duplicated by other practitioners. The formula can then be
numerically evaluated efficiently for any bivariate d a t a set for which estimation
of the correlation coefficient is required. This contrasts with the conventional
approach which requires that a particular bootstrap moment for the correla-
tion coefficient based on Monte Car10 simulation be generated each time that a
different data set is encountered.
7. The symbolic bootstrap method can represent bootstrap moments analytically
which is not possible in the context of conventional bootstrap cornputation.
Monte Carlo Bootstrap Estimates of Standard Error (CeB) for Law School Data (n = 15)
0.1241 (0.0084) 0.1194 (0-0054)
Simulated Data (n = 100) 0.0605 (0.0059) 0.0463 (0.0028)
CHA PTER 3. BOOTSTRAPPED M-ESTIMATORS l F
1.3
Moreover, diagnostic properties such as bias and variance of bootst rap moments
may be represented analytically. The reason for this is tha t any bootstrap
moment when expressed syrnbolically represents a random variable for which
various moments (i.e. properties) rnay in turn be considered. This permits the
user to study the analytical strvcture of a bootstrap quantity of interest and
identify cornponent statistical quantities such as moments and surns. This can
be potentially insightful.
For example. the bias of the bootstrap variance of the correlation is represented
b~
This expression could be evaluated for any distribution. for example. the Gaus-
sian distribution. Using the empirical distribution functions F,, and G, for the
law school data we calculate the 4th order bootstrap estimate of bias of the
bootst rap variance of the correlat ion as
The corresponding calculation for t.he simulated data ( n = 100) yields BZS =
-0.000047 . Similarly. the 4th order bootstrap estimateof bias of the bootstrap
expectation of the correlation for the law school data is
The corresponding calculation for n = 100 is B%S = -0.000238 .
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 76
These results suggest that the bootstrap variance of the correlation and the
bootstrap procedure for estimating the mean of the correlation are largely free
of bias. In both cases, the negligible bias that exists decreases for increasing
sample size.
8. LVe observe that
and
for both the 2nd and 4th order expansions.
9. The methodology discussed in Chapter 9 in which exhaustive enumeration of
bootstrap resamples was used as a basis for the ezact computation of bootstrap
estimates and associated properties could a l t e m a t i v e l y be used for the case
of the correlat ion coefficient. For exarnple. exact evaluat ion of the bootst rap
estimate of standard error of the correlation for the law school data ( n = 15)
mould involve the enumeration according to (2.21) of
or almost SO.000,000 distinct bootstrap resamples. This very large computation .
was avoided and conventional bootstrap sampling with B = 195,000 was used
as a proxy. The results shown at the bottom of Tables 3.6 and 3.7 for the law
school data support corresponding analyt ically- based calculat ions in Table 3.5.
10. The issue of error analysis with respect to asymptotic expansions of bootstrap
moments and their properties is not investigated quantitatively in this thesis.
Considerations such as deciding how many terms to include in a particular
expansion and how to tell how much error has been induced by truncating the
series would play a role in such an analysis.
The answer to the first question depends on the sample size and the smoot hness
of the estimator in question. Higher order terms involve higher order derivatives
which can escalate in magnitude. In the case of smooth estimators the mag-
nitude of higher order derivatives is dampened. Although no useful bound is
known for these they rnay be evaluated in any application. Hence. series erpan-
sions involving very smooth estirnators and large sample sizes would converge
more rapidly t han t hose involving less smoot h estimators and srnaller sample
sizes. Consequently. in the former case. the error induced by truncating a series
at a given point would be less than the error induced by truncating a series at
the same place in the latter case. For example. ive observe in Table 3.5 that for
n = 13 the error between the 2nd and 4th order expansions of t h e bootstrap
standard error of the correlation coefficient is 0.13 1804 - O.l%276 = 0.0075%
and this decreases to 0.0.516SO - 0.051 133 = 0.000547 for n = 100. The results
in Tables 3.5-3.7 suggest that a 4th order expansion of the bootstrap standard
error of the correlation evaluated for n = 15 data points is necessary to achieve
the numerical stability that a corresponding 2nd order expansion similarly en -
joys when evaluated for t h e larger sample size of n = 100. C k N-ould make
sirnilar conclusions for estimators differing in smoothness characteristics.
The analytic bootstrap developed in this chapter calculates ideal bootstrap es-
t imates and corresponding propert ies ezactly for t h e case of linear stat ist ics.
Such computations are also exact when performed on truncated expansions of
non-linear smooth statistics. For example. consider the truncated series espan-
sion of the correlation coefficient given by Cor[i] for some finite i . Then the
bootstrap variance of this. given by BootCum[Cor[i]][Z. j]. will be exact for
large enough j. More generally. we may conjecture t hat if the evaluation of the
truncated expansion ThetaHat[i] for a given data set and finite i is very close
to d then the evaluation of the exact expression BootCum[ThetaHat [i]][2. j]
CHAPTER 3. BOOTSTRAPPED M-ESTIMATORS 7s
(for large enough j ) would similarly be very close to the evaluation of the series
BootCum[ThetaHat] [Z, jj.
Finally, analytical error analysis involving truncated series expansions of boot-
strap expressions as referred to above may be contrasted with sampling error
induced by the Monte Carlo bootstrap. For example, t his could entai1 taking
blocks of 100 bootstrap replicates of the estimator and st udying the empirical
variation among the block estimates. However. such an analysis is limited be-
cause it considers only the central region of the empirical distribution of the
bootstrap estimate. In contrast. truncated analytical bootstrap expansions can
permit expression of subt Ier higher moments of the distribution. particularly
those providing information about the tails. Moreover. such expansions are
true for any distribution and not just the empirical distribution.
\Ve wiil further illustrate the methodology developed in the present chapter bu
nest considering analytic bootstrap expansions associated wit h the bootstrap-t inter-
val, BC, confidence intervals, and the Behrens- Fisher statistic.
Chapter 4
Double-Bootstrapping and the
Behrens-Fisher statistic
4.1 Introduction
The methodology developed in the preceding chapter will presently be further illus-
t rated by considering analytic expansions associated with the bootst rap- t interval.
BC, confidence intervals. and the Behrens-Fisher statistic.
Bootstrap-t intervals are described in Efron ( 198 1) and the bias-corrected. accel-
erated interval ( BC, ) is suggested in Efron (1987). The two-sample problem with
unequal variance has a long history : see. for erarnple. Cox and Hinkley ( 1974. page
14.3) and Robinson (1982).
4.2 Symbolic Computation of the Bootstrap-t
Confidence Interval
4.2.1 Background
We present the conventional bootstrap-t procedure in this section. Our description
follows Efron and Tibshirani (1993). The notation follows that of Section 2.2 .
CHAPTER 4. DOUBLE-BOOTSTRAPPING
The standard large-sample central 1 - 2a confidence interval
is derived from the assumption t hat
where du) relen to the a t h percentile of the standard normal distribution and sé is
an estirnate of the standard deviation of ê . This is valid as n + m. but is only
an approximation for finite samples. In 1908. for the case = 6. Gosset derived the
better approximation
where represents the Student's t distribution on n - 1 degrees of freedom. This
approximation yields the interval
with to, denoting t h e a t h percentile of the t distribution on n - 1 degrees of Ireedorn.
The use of the t distribution doesn't adjust the coafidence interval to accouiit for
skewness in the underlying population or other errors ttiat can result when ê is not
the sample mean. The bootstrap-t interval is a procedure which does adjust for t hese
errors. We describe this next.
From an observed random sample x = (ri,. . . . x,) we generate B bootstrap sam-
p les
CHAPTER 4. DOUBLE-BQOTSTRAPPING
and for each we cornpute the quantity
where û'(6) is the value of 0 for the bootstrap sarnple x,' and sAe'(b) is the estimated
standard error of 8. for the same bootstrap sample x;. The a t h percentile of the t-( b)
is estimated by the value 8") such that
The bootstrap-t confidence interval is thus given by
The -bootstrap-t ' approach thus estimates the distribution of (4.3) directly from the
data. A bootstrap table is built by generating B bootstrap sarnples. and then com-
puting the bootstrap version of (4.3) for each: the bootstrap table consists of the
percentiles of these B values. Note that B and sê represent estirnates using the origi-
nal sample: the- are not bootstrap estimates. However. sê could be replaced by the
bootstrap estimate of standard error of 8. Le. s4eB . We evill use t his option when ive
consider the analytical bootstrap-t approach in the next section.
There are a n~imber of advantages that the bootstrap-t procedure enjoys over t he
standard methods. First, accurate intervals can be obtained without having to make
normal t heory assumptions. Second. bootst rap-t intervals often enjoy bet ter cover-
âge. Third. these intervals offer second order acciiracy versus first order accuracy
associated wi t h standard met hods. The bootstrap-t approach is particularly appli-
cable to location statistics like the sample mean or a sample percentile. However. at
least in its simple form. it cannot be trusted for more general problems. like setting
a confidence interval for a correlation coefficient. Moreover. it may perform errati-
cally in small-sample. nonparametric settings. The latter drawback can however be
alleviated by the use of appropriate transformations; see Tibshirani ( 19%) for the
CHA PTER 4. DOUBLE-BOOTSTRAPPING
implementation of the variance-stabilized bootstrap-t interval . Finally, t here is a computational problem associated wi t h the bootst rap- t ap-
proach which can be circumvented by employing the met hodology developed in this
thesis. In the denominator of the statistic (4.5) we require s ê m ( b ) . the estimated
standard error of Ê* for the bootstrap sarnple xi. T h e difficulty arises when 6 is
a complicated statistic, for which there is no simple standard error formula. Thus
we would need to compute a bootstrap estimate of standard error for each bootstrap
sarnple. Thus, two (nested) levels of bootstrap sampling are required. The quantity
(4.5) now becomes
where the double star notation in the denominator is used to signify that two levels of
bootstrapping are required. It is considered sufficient to use B = 25 for the estimation
of each Se"(b). while B = 1000 is needed for the computation of percentiles. Hence.
the overall number of bootstrap samples needed is perhaps 25 . 1000 = 25.000 . a
formidable nurnber if 8 is costly to cornpute.
4.2.2 Symbolic Computation of the Bootstrap-t Interval for
the Correlation Coefficient
The analyt ic bootst rap-t approach constructs confidence intervals in a similar wax
that standard normal and t tables are used to establish (4.4) from (4.3) . There are
two key differences. First. the conventional estimate of standard error si used in the
denominator of (4.3) is repiaced by its bootstrap analogue
Second, both 6 and (4.9) may each be expressed in series format ( to some order) as
was illustrated in the previous chapter at which point the user may choose to compute
these numerically for given data. This gives rise t o two advantages not enjoyed by
C H A PTER 4. DOUBLE-BOOTSTRAPPïlVG 83
the conventionai bootstrap-t procedure. Firstly, as will be s hown beloy maint ai n-
ing the quantities d and (4.9) in analytical form allows the subsequent cornputation
of certain bootstrap moments which rnay be used to further standardize the ana-
lytical t-statistic described here. Moreover. properties of these bootstrap moments
with respect to the original underlying distribution may also be assessed although
we shall not consider these . Secondly, calculation of an analytical bootstrap-t in-
terval precludes the need for Monte Carlo resampling at an9 leuel. For example. by
circumventing the requirement for building a distribution of B bootstrap percent iles
of (4.3) the analytic bootstrap-t method thus avoids the necessity for undertaking
double-bootstrapping in order to calculate quantities such as ;e*=(b). CVe illustrate
the approach for the sample correlat ion coefficient.
Replacing 8 = p in (4.3) we modify
by considering the t-like statistic
which can be re-written as the difference
Observe that (1.13) rnay be expressed analytically as a series expansion to any order
desired because the component factors & and VarWi , (p=) may each be similarly ex-
pressed (i.e. algebraically) according to Section 3.4. In the second term of (4.12) p is
kept constant; when bootstrap moments of t are computed p is then replaced by P a t
which point i t is expressed analytically to some order.
The bootstrap expectation
CHAPTER 4. DOUBLE-BOOTSTRAPPiNG
and the bootstrap variance
may each be represented analytically to any desired order. Moreover. properties of
(4.13) and (4.14) such as Er[-] and Var,[.] may also be assessed although for brevity's
salie we shall not consider these here. For example, the first order expansion of the
bootstrap rnean (4.13) may be derived immediately using the tool BootCum[t][i, j]
as
which reduces to
For the law school data we observe that BootCurn[t][l,l] = 0.29595. The Mathe-
matica code underlying the tool BootCum[t][i, j] ma,- be found in Appendix A.3
in a file called t .m.
It is well-known that if we construct a confidence interval for Fisher's variancc-
stabilizing transformation of the correlation coefficient o = tan h-' p given b~
and then t ransform the endpoints back tvit h the inverse transformation
then a more accurate interval is obtained. We will use O in the context of cornparing
t hree pairs of confidence intervals beiow.
First. the standard 1 - 2a t-interval for the correlation coefficient foollows (-1.4)
CHil PTER 4. DOUBLE-BOOTSTRA PPING
and is given by
This can be improved by using 4 as in
and transforming back to an interval for p iising (4.16). We assume that the corre-
sponding percentiles t n ~ : ) are different in (4.17) and (4.18). Second. the bootstrap-t
interval for p given by (4.7) as
may also be improved by considering
Again. we assume that the corresponding values of +") are different in each case.
Third. the analytic bootstrap-t interval is based on (4.11) and is given by
Finally? a standurdized analytical bootstrap- t interval may be formulated by stan-
dardizing (4.1 1) using (4.13) and (4.14) . Thus upon rearranging
where t is given by (4.11) we have
CHAPTER 4. DOUBLE-BOOTSTRAPPING 86
For exampie. to calculate a 90% confidence interval for p for the iaw school data using
(4.22) we set
2. J~ar,~,(i') = 0.132 frorn Chapter 3
1. E+(t) = 0.29595 (first order result seen above).
Table 4.1 compares the six different 90% confidence intervals (4.17) to (4.22) for
the correlation coefficient for the n = 1.5 law school data.
Table 4.1: Coniparison of 90% confidence intervals for p using analytic bootstrap-t. conventional bootstrap-t. and standard methods for the laiv school data ( n = 15)
1 4. boot-t, 1 ( 4 , 9 1
1. standardt 2. standard t, 3. boat-t
(.5L -98) (51: .91) (--03. -90)
Intervals #3 and #-I were obtained by Efron and Tibshirani (1993. p. 162-3)
using convent ional double- bootst rapping. The- used B = 25 bootst rap sarnples to
calculate the hootstrap estirnate of standard error Se"(6) for each of 1000 values of
b' that were generated so that a total of 25.000 bootstrap samples were used. The
superscript "II in intervals #5 - #6 signifies that al1 analytic component terms were
computed to first order except for the bootstrap variance of the correlation which was
calcuiated to fourth order as in Chapter 3 .
5 . analytical' boot-t 6. analyti~al~,,.,~. boot-t
(..?4. 1-01) (.58. .9S)
4.2.3 Discussion
We have shown that the two nested Levels of bootstrap sampling associated with
the conventional bootstrap-t interval can be avoided by representing the bootstrap-
t statistic in terms of a symbolic expansion. We compared confidence intervals for
the correlation using standard and conventional bootstrap-t met hods wit h the same
in tervals evaluated analyt ically.
It is observed that the conventional bootstrap-t interval #4 for the correlation
based on Fisher's z transformation has the same length as the analytical bootstrap-t
interval #5 computed without use of Fisher's transformation. However the latter in-
terval has partially st rayed out side the allowable range for the correlat ion. Moreover.
the analytical standardized bootstrap-t interval #6 is shorter t han t h e coriventional
intervai #4.
This suggests that the analytical bootstrap-t approach provides intervals which
not only circumvent computer-intensive double-bootstrapping calculations but may
also be more accurate.
4.3 Symbolic Computation of BC, Confidence
Int ervals
4.3.1 Introduction
In this section we explore the development of cornputer algebraic tools which permit
the analytic expression of an integral'component associated with BC, confidence
intervals. We first provide a general overview of the BC, confidence intervai and its
precursor. the percentile method. We follow Efron and Tibshirani ( 1993) and Efron
(2981).
CHA PTER 4. DOUBLE-BOOTSTRAPPIArG
4.3.2 Background
-4 number of problems connected with conventional bootstrap-t confidence intervals
were outlined in Section 4.2.1 . Instead of focusing on a statistic of the form t =
( B - B)/.<e a diflerent approach for remedying such problems works direct., with the
bootstrap distribution of 8. We describe first the percentile method followed by its
improved version, the bias-corrected and accelerated (BC.) interval.
Suppose the ernpirical distribution F generates the bootstrap data set x- by ran-
dom sampling, giving the bootstrap replication 8' = s(x'). This is repeated an
infinite nurnber of tirnes. Let 6 be the cumulative bootstrap distribution function of
8%. t hat is.
The 1 - ?a percentile intemal is defined by the a and 1 - o percentiles of G as
since by definition 6-1 ( a ) = . the 100 - û t h percentile of the bootstrap distribu-
tion. Expression (4.23) refers to the ideal bootstrap situation in which the number
of bootstrap replications is infinite. In practice we use some finite number B of
replications in which case the approximate 1 - 2 0 percentile interval is
The lower confidence lirnit d z a ) is the 100-0th ernpirical percentile of the k ( b ) values.
that is. the B . a t h value in the ordered list of the B replications of dm. The percentile interval for a parameter 8 is more accurate than a standard normal
interval when the bootstrap distribution of 8% is non-nomai. Both rnethods agree
well when the standard interval is constructed on an appropriate transformation d
CHA PTER 4. DOUBLE-BOOTSTRAPPING 89
of 6 and then mapped back to the 0 scale. The advantage of the percentile method
is t hat it automatically makes this transformation; unlike the standard interval. we
don't need to know the correct transformation. Al1 we need assume is that such a
transformation exists.
The percentile method is not the last word in bootstrap confidence intervals. There
are other ways the standard intervals can fail. besides non-norrnality. For example. 6
might be a biased normal estimate.
in which case no transformation can fix things up. The BCa confidence interval
represents an extension of the percentile met hod t hat automatically handles bot h
bias and transformations. Moreover. it allows the standard error in (4.25) to var-
with O. rather than being forced to stay constant. We discuss the BC, confidence
procedure next.
The BCa interval endpoints are given by percentiles of the bootstrap distribution.
but not necessarily the same ones w in (4.23). The percentiles used depend on trvo
numbers â and to. called the acceleration and bias-correction. respect ively. The I - 2 0
BC, interval is given by
w here
and
Here a(.) is the standard normal cumulative distribution function and is the
CHAPTER 4. DOUBLE-BOOTSTRAPPING 90
100 - rrt h percentile point of a standard normal distribution. Notice that if â and
equal zero, then a1 = O and a;! = 1 - cr so that the BC, and percentile intervals
coincide.
The BC, interval given above is based on the following model. Recall that stan-
dard confidence intervals are based on the assumpt ion
e - e - - !V(O, 1) . s-e
The BC, interval assumes. more generally, that there exists a monotone transforrna-
tion g and constants zo and a such that
sat isfy
1 se, = se,,[l + a ( 6 - ( - ! . :3 1 )
for any fixed value of do. The BC, interval is attractive because one does not need
to know the transformation g(-); the only three quantities needed to construct t h e
interval are the bootstrap distribution G and the constants q-, and a .
The estimated value of the bias-correction zo is obtained directly from the pro-
portion of bootstrap replications less than the original estirnate 6,
Roughly speaking, the quantit.y rneasures the rnedian bias of ê*. that is. the dis-
crepancy between the median of 6' and &, in normal units.
C'HAPTER 4. DOUBLE-BOOTSTRAPPiArG
The quantity a is called the acceleration because it measures the rate of change
of the standard error of 4 with respect to the true parameter value 8, measured
on a normalized scale. T h e normal approximation 4 - N ( 0 , s e 2 ) assumes that the
standard error of è is the sârne for al1 8. This is often unrealistic and the acceleration
adjustment a corrects for this. For the multioomial distribution. which corresponds
to the nonparametric bootstrap. â rnay be computed in terms of the jackknife values
of a statistic 0 = s(x). Let x(i) represent the original sample with the ith point ri
deleted, let O(;) = s(x(~)) , and define
A simple expression for the acceleration is given by
In summary. the percentile method generalizes the standard normal approximation
(4.28) by allowing a transformation g(.) of 0 and 8. The BC, met hod adds the furt her
adjustments a and a. The bootstrap-t approacli is second-order accurate. but not
t ransformation-respect h g . The percentile met hod is t ransformat ion-respect ing but
not second-order accurate. The standard method is neither. while the BC, method
is both.
In the next section we show how recourse to resarnpling methods such as the
jackknife for estimating the acceleration constant a can be avoided by using t h e
symbolic bootst rap met hodology developed in the previous chap ter.
4.3.3 Symbolic Computation of the Acceleration Constant â
We show how the computer algebraic tools developed in Chapter 3 may be used for
the symbolic representat ion of the acceleration adj ustment used in BC. confidence
intervals. As will be shown belorv, the key reason that the acceleration constant is
conducive to symbolic representation (whereas the bias-correction is not) is that
CK.1 PTER 4. DOUBLE-B00TSTR.IPPING 92
it can be represented as a smooth function of averages. We provide a setting for this
representation by justifying the expression for se,, as given by (4.31), as follows.
Indeed. the first -order Taylor series expansion of
about 4 = & is given by
which may be re-expressed as
Taking the square root of each side of (4.37) and using the fact tha t
ive have
Re-introducing the notation se, = o,/\/n, we observe that (4.38) and (4.31) are t h e
same. as desired. with
By interpreting (4.36) as a simple linear regression mode1
CHAPTER 4. DOUBLE-BOOTSTRAPPING
with z = ( b - as the independent variable and the d o p e represented by
then (4.39) becomes
Disregarding the divisor n in (4.41) for the moment it is well-known that the d o p e
81 of a generic regression mode1 (4.40) can always be expressed a s
which we re-formutate as
where the expectation E and the variance -uar'' are taken with respect t o any dis-
tribution F .
The expression (4.41) now becomes the basis for the symbolic computation of
(4.42). We show this for the case of the correlation coefficient. and t hen using t h e law
school da t a ive cornpare our numericai evaluation of (4.42) with tha t of Efron (19s;).
Indeed. set t ing
we observe that
CHAPTER 4. DOUBLE-BOOTSTRAPPING
becomes
and (4.44) becomes
Note that t h e subscript x in (4.45) refers to a bivariate sample of size n whereas t he
notation z as in (4.40) denotes a single independent randorn variable.
We now use the same tools developed in Chapter 3 to formulate a new tool called
Boot[Acc] [il for expressing (4.45) to any desired order. The Mathematica code
underlying the tool Boot[Acc][i] may be found in Appendix A.3 in a file called
bca.m. For example. t he first-order symbolic bootstrap expansion of (4.45) is given
by
CHAPTER 4. DOUBLE-BOOTSTRAPPING
The analytic boo t s t r a~ expression (4.46) may be evaluated numerically for any pair of
distributions F and C; associated with the dependent variables x and y for which t h e
cumulants are known. In particular. we will use the empirical distribution functions
F, and G, as follows.
Using the law school da ta with ,i = 0.776 then (4.46) becomes
Boot [Acc][l] = -0.2'772
and upon factoring in the divisor n = 15 we find
dl = -0.01545 .
CHA PTER 4. DOUBLE-BOOTSTRAPPlNG
Using the fourth-order result from Chapter 3
substitut ion in (4.42) yields
i l a = - . - = -0.0700 . Ise,,
Using the jackknife approximation (4.34) Efron (1987. p. 178) found
Finally. we can perform an independent check on the results obtained above by
expressing V a r ( j ) as a first-order Taylor expansion about p = po. Proceeding w e get
with
where = po = 0.776 . Dividing by n = 15 we get -0.082323 which is close to Our
value of â = -0.0700 .
4.3.4 Discussion
LVe have s h o w t hat the acceleration adjustment associated wit h BC, confidence in-
tervals may be represented exactly as a symbolic bootstrap expansion to an- order
desired. For illustration this was applied to the case of the correlation coefficient.
The resulting analytical expression is a formula which may repeatedly be used in any
CHAPTER 4. DOUBLE-BOOTSTRAPPING 97
context for which a BCa confidence interval for the correlation is desired. Yumerical
evaluation of a first-order analytical bootstrap expansion of the acceleratioo yielded a
result consistent wi t h that produced by jackknife resampling. Higher-order evaluat ion
would render more accurate results. Properties of the symbolic bootstrap represen-
tation of the acceleration could be further assessed although for brevity's sake t hese
are not considered here.
4.4 Analytic Bootstrap Computation for the
Behrens-Fisher statistic
4.4.1 Introduction
We ddeeelop here the analytical representation of bootstrap moments related to the
Behrens-Fisher statistic and their properties. Numerical evaluation of the symbolic
expressions is illustrated using da ta from a case s tudy in medicine. We compare some
of these results with bootstrap estimates computed using Monte Carlo resampling.
4.4.2 Background
Suppose that (xl. . . . . x,, ) and (yl. . . . . y,,) comprise independent samples frorn Gaiis-
sian distributions N(pi, a:) and N(p2' 0;). respectively. Since n: and ai are typicallj*
unknown ive would consider estimating them by s: = E:Al(ri - i ) * / ( n l - 1) and
s$ = C:21 (~ i - ~ ) ~ / ' ( n ~ - 1). The hypothesis
may be assessed using the statistic
CHAPTER 4. DOUBLE-BOOTSTRAPPING
where x = x:., x i / n l and y = C:,, yi/nz. ( T h e test statistic (4.47) would have a
Student's t distribution under the nul1 hypothesis if the two Gaussian populations
had the same variance a: = O: = aZ and a pooled estimate of a2 was used.) .A
number of approximate solutions have been proposed in the literature. This is known
as the Behrens-Fisher problem. For illustrative purposes it suffices to
balanced case onlyo namely. wheo nl = nz = n. and so the test statistic
Say.
In the next section we shall calculate bootstrap est imates of the mean
consider the
becornes
and variance
of the Behrens-Fisher statistic (4.48) and their properties by using the analytical
bootst rap . The symbolic calculations are facilitated by considering the modified
Behrens-Fisher statistic
4.4.3 Analytic Bootstrap Computation
As was done for the case of the correlation coefficient Table 4.2 beIow lists the six
bootst rap quanti t ies for which syrnbolic expressions associated wit h t h e modified
Behrens-Fisher statistic (4.49) are presented in this section. These represent the
bootstrap mean and bootstrap variance and the mean and variance - over x and y
- for each of these. Moreover, if required. t h e computational tools developed here
can provide symbolic expansions for any bootstrap cumulant of the Behrens-Fisher
statistic to any desired order as well as properties of these. The tools presented below
were coded using Mathematica and are based on the methodology developed in
Chapter 3.
The symbolic function BF[i] expands the modified Behrens-Fisher statistic (4.49)
CHA PTER 4. DOUBLE-BOOTSTRAPPlNG
Table 4.2: Bootstrap estimators and related properties for the Behrens-Fisher statistic for which anaiytic expansions are derived
1 II Bootstrap Mean ( 6 f ) 1 Bootstrap Variance ( 6 f ) 1
using a Taylor series in residual sums correct to i-th order. W e assume Ho : pl =
p2 = O and that t h e variables x and y are independent. For example, the 2nd order
expansion of (4.49) is given by
The tool Cum[BF][i, j] returns the i-th cumulant - over x and y - of the mod-
ified Behrens-Fisher statistic expressed as an expansion correct to order n-'I2. For
example, the meari E,,,(b f ) and the variance Var,,,(b f ) expressed as expansions to
order n-' are given. respectively, by
and
CHAPTER 4. DOUBLE-BOOTSTRAPPING 100
The function BootCum[BF][i, j] returns the i-th bootstrap cumulant of the
modified Behrens-Fisher statistic expressed as an expansion correct to order n-]I2.
The Mathematica code underlying this tool and BF[i] may be found in Appendix
A.3. For exampie. the 4th order analytic expressions for the bootstrap expectation
and the bootstrap variance
of the modified Behrens-Fisher statistic are given. respect ively. by
and
CHAPTER 4. DOUBLE-BOOTSTRAPPIlVG 101
CVe now consider the procedure t hat evaluates properties, indeed, various cum u-
lants - over x and y - of the bootstrap mean (4.53) and the bootstrap variance (4.54).
The function Curn[Boot Cum[BF][i]] Ci, k] returns a series expansion correct to
order n-k /2 of the j - t h cumulant - over x and y - of the i-th bootstrap cumulant of
the modified Behrens-Fisher stat ist ic. In part icular?
varX.Y [&~x.y ( 6 f')] = Curn[BootCum[BF][1]][2.4! + O,(n-') (4.5s)
Er,, [ ~ a r , ~ ~ , ~ ( b f')] = Curn[BootCurn[BF][2]][1.4] + O,(n-') (4.59)
Var,,, [~ /ar ,~ , ,~ jb f')] = Curn[BootCum[BF][2]][2.4] + O,(n-') . (4.60)
In practice. higher moments of other bootstrap cumulants may be expanded to an?
order of accuracy with these tools.
For example. the fourth order expansions considered in (4.57) - (4.59) are given
by the follorving:
5 (Curn(X , X. X ) - Cum(X Y. Y ) ) Cum[BootCum[BF][1]][1.4] = 3 (4-61)
2 n2 (Cum(X. X) + Cum(Y. Y))'
CHAPTER 4. DOUBLE-BOOTSTRAPPING
Since the fourth order expansion considered in (1.60) is lengthy we provide helow
the corresponding second order expression :
Curn[BootCum[BF][2]][2,2] = [ 2 Curn(S' + 2 Cum(Y. Y)' + C u m ( X . .Y. .Y. .Y) + Cum(I- Y: Y, Y ) ]
/ [n (Curn(.Yl -Y) + Cum(Y. (-1.64)
The six expressions ( 4 . 3 , (4.56): and (4.61) - (4.64) represent esamples of series
expansions of the six bootstrap quantities associated with Table 4.2 . They constitute
symbolic representations. in fact. formulas, which may subsequently be applied to any
pair of independent data sets for which numerical computations of the Behrens-Fisher
statistic, i ts bootstrap moments: or properties of these are desired. Moreover. (4.55).
(4.56). and (4.61) - (4.64) represent analytic bootstrap expressions ah ich may be
CHAPTER 4. DOUBLE-BOOTSTRAPPING 103
evaluated numerically for any pair of distributions F and G associated with the
independent variables x and y for which cumulants are known. In particular. we will
use the empirical (nonparametric) distribut ion funct ions F, and G,.
We neext illustrate the numerical evaluation of these expansions within the context
of a case study from medicine.
4.4.4 Numerical Application
Table 4.3 shows the original survival times of 58 patients with gastric cancer equally
divided into two indepeodent treatment groups. The data corne from Gamerman
(1991) . Group 1 received chemotherapy and radiation while Group 2 received only
chemotherapy. Interest Lies in comparing t h e average survival times of the two groups.
Without loss of generality, prior to evaluation each of the two data sets was standard-
ized to have zero means. It is assumed that both samples corne €rom independent
populations possessing unequal variances.
Table 4.3: Survival times (days) of gastric cancer patients ( n = 44)
1 Therapy 1
The 2nd and 4t h order expansions of t he six bootstrap quantities associated wit h
Table 4.2 could be evaluated numerically for any underlying distributions for which
CHAPTER 4. DOUBLE-BOOTSTRAPPINC 1 04
Table 4.4: Symbolic bootstrap and classical results for the Behrens-Fisher stat istic using the survival data, n = 44
Analytical Bootstrap
cumulants exist. Here we evaluate the analytic expansions for the empirical distri bu-
tion functions F, and G, because these are considered close to the true underlying
F and G . Hence. the data in Table 4.3 are used to numerically evaluate the syrn-
bolic bootstrap expressions . The results were then scaled by appropriate powers of
fi so that they be associated with (4.48) instead of (4.49). The final results are
shown in Table 4.4 . For cornparison purposes the bootstrap rnean and bootstrâp
standard error of the Behrens-Fisher statistic have also been estimated u s k g conven-
tional Monte Carlo resampling for various values of B. The algorithm upon which
these calculations are based is given in Efron and Tibshirani (1993. p. 4 ) . The
simulation results are presented in Table 4.5.
Empirical Distribution F, and G', 2nd order 1 4th order
[Va~,,,[Var,~,,~ ( b f')]] t Classicai
4.4.5 Discussion
0.992728 1 1.44237
CVe observe that the numerical evaluations of the 2nd and 4th order symbolic expres-
sions for the bootstrap mean and the bootstrap standard error of the Behrens-Fisher
statistic given by Ewl,,,(b f ') and [ V ~ T , ~ , , ~ ( b f ')] f in Table 4.4 are alrnost identical to
the corresponding Monte Carlo calculations seen in Table 4.5. (The latter computa-
CHAPTER 4. DOUBLE-BOOTSTRA PPING 1 05
Table 4.5: Monte Carlo Bootstrap Estimates of Mean and Standard Error of the Behrens-Fisher statistic for various B for the survival data, n = 44 ( s e . in brackets)
I - --
1 Survival times of gastric cancer ients!
1 B 1 Bootstrap mean 1 Bootstrap se. 1
tions representing estimates of bootstrap standard error are multiplied by \/(n - 1 )/n
ahen compared to the symbolic- based computations.) This demonst rates t hat the
analytic bootstrap is capable of providing bootstrap estimates without resorting to
convent ional bootst rap resampling.
Diagnostic properties such as bias and variance of bootstrap moments may be
assessed analytically and then evaluated numerically For example. the bias of t h e
bootstrap variance of the Behrens-Fisher statistic is represented by
50 100 200 400 SOO
1600
This expression could bc evaluated for any distribution. for example, t h e Gaussian
distribution. Using the empirical distribution functions F, and G, ive may calculate
the 4th order bootstrap estimate of bias of the bootstrap variance given by
0.028 (0.155) -0.08s (0.108) -0.060 (0.072) -0.015 (0.049) -0.000 (0.035) 0.018 (0.025)
1 .O93 (O. 125) 1.057 (0.077) 1 .O22 (0.058) 0.980 (0.036) 0.98s (0.024) 1.008 (0.019)
CHAPTER 4. DOUBLE-BOOTSTRAPPING
Similarly. the 4th order bootstrap estimate of bias of the bootstrap espectation is
These results suggest that the bootstrap procedure for estimating the mean and
variance of the Behrens-Fisher statistic is largely free of bias.
4.5 Conclusions
The symbolic bootstrap rnethodology developed in Chapter 3 may be applied to a
broad class of statistics. namely those which may be expressed as smooth non-linear
funct ions of averages. T h e Behrens-Fisher stat istic. the acceieration adjustrnent in
BCa confidence intervals. and various quanti ties associated wit h bootstrap-t confi-
dence intervals a11 satisfy this criterion. The present chapter has shown in detail hom
the analytic bootstrap may be applied to these examples. In al1 cases numeric evalua-
tion of the analytic bootstrap estimates matched cornputations based on conventional
resampling techniques.
If required? the computational tools presented in this chapter with respect to the
t hree examples studied could provide symbolic expansions for an y bootstrap cumulant
to any desired order as well as properties of these.
Analytic bootstrap expansions const itute symbolic representations. in fact. boot-
s trap jormvlas which are functions of various cumulants and residual sums. They are
true for any distribution(s) for which the cumulants are known. In particular. the
empirical distribution functions were used for numerical evaluation. The symbolic
expressions offer two advantages over conventional Monte Carlo resampling. First,.
the existence of the particular 2nd and 4th order analytic expansions seen in this
chapter means that the derivational work they represent does not need to be dupli-
cated by different pract it ioners: the expressions are available for efficient and repeated
CHAPTER 4. DOUBLE-BOOTSTRAPPING
numerical evaluation for different data sets. Second, analytic bootstrap expansions
enable the user to identify component statistical quantities such as particular cumu-
lants or residual sums which may influence the magnitude of properties such as t h e
bias or variance of the bootstrap procedure in question. Hence, analytic bootstrap
methodology enables the user to better understand the bootstrap procedure being
used.
Chapter 5
Concluding remarks
5.1 Remarks
In this thesis we established the representation of bootstrap moments and their prop-
ert ies as analy t ical expansions involving cumulants and standardized sums. This
rnethodology was developed for a broad class of statistics. namely. the kI-estimators.
The approach was illust rated by providing examples involving the correlat ion coef-
ficient, the Behrens-Fisher statistic. double-bootstrapping. and BC, bootstrap con-
fidence intervals. Numerical evaluations of the symbolic bootst rap expressions were
shown to be consistent wit h analagous Monte Carlo approxirnat ions. An alternative
methodology based on the exhaustive enurneration of bootstrap resamples was also
presented and illustrated for the class of L-estimators, in particular. the median. The
exhaustive enurneration approach used for order statistics provides exact analyt ical
bootstrap expressions and altiiough the approach may also be applied for more gen-
eral estimators it has sample size restrictions. The use of asymptotic expansions in
the case of M-estimators has no such limitation and the symbolic bootstrap quant ities
may be obtained to an arbitrary precision.
The ieatures/advantages of the analytic bootstrap developed in t his t hesis may
be summarized as they were in Chapter 1 as follows:
1. Analytic derivation and numerical evaluation of bootstrap estimates and t heir
CHA PTER 5. CONCL UDING REMA R H 109
properties entirely avoids Monte Carlo resampling resulting in &ter and more
efficient calculat ion.
2. It can be used diagnostically t o assess properties such as bias, variance or other
moments of a bootstrap procedure.
3. For the class of L-estimators (linear combinations of order statistics). the ana-
ly tic bootst rap calculates ideal bootstrap est imators exactly for sample sizes up
to n = 20 and calculates properties of these estimators ezactly for much larger
sample sizes.
For the class of hl-estimators (maximum likelihood-type estimators). the an-
alyt ic bootst rap calculates ideal bootst rap est imators and propert ies of t hese
exactly for linear statistics and to within a specified order of accuracy for large
sample sizes in the case of non-linear statistics.
4. The analytic bootstrap may be applied to a broad class of statistics. naniely.
to any estimator which may be expressed as a non-Iinear smooth function of
averages.
5 . The analytical approach for the class of M-estimators is less cornputer intensive
for larger data sets. The reverse is true for conventional bootstrap calculat ions.
6. Analyt ic bootst rap expressions are potent ially insightful because component
statistical quantities such as curnulants and sums are identified.
7. Each analytic bootstrap expression (representing a bootst rap moment or a prop-
erty of this) is a mathematical bootstrap /onnula which may be further nu-
rnerically evaluated efficiently for a given data set, or repeatedly evaluated for
different da ta sets arising within a given application. The computational effort
required to derive each formula need only be undertaken once. It is subsequently
useable in perpetuity by other practioners for the purpose of efficient numerical
evaiuation.
CHA PTER 5. CONCL W I N C REMARh3 i l 0
This thesis has established that a large class of problems may be addressed by the
analyt ic bootst rap. Furt her work would be required to investigate the application of
the met hodology for estimators involving non-iid random variables. The met hodology
could also be extended to address multi-dimensional parameters. O t her work niay
involve use of the analytic bootstrap on the basis of bootstrap resamples of smaller
size a n ins tead of n. Symbolic Cornish- Fisher expansions of bootst rapped quant iles
of a statistic could be obtained by computing and plugging in various bootstrap
curnulants into an Edgeworth expansion of the distribution function of 8 and inverting.
Appendix A
Source Code used for this Thesis
In this Appendix we provide the source code that was used for this thesis. This
includes the Mat hematica code underlying the computational tools developed in
Chapters 2 and 3 together with code used for the examples in Chapter 1. SPius code
that was used to carry out comparative Monte Car10 simulations related to these
chapters is also given below.
A.1 Source Code used for Chapter 2
This section contains the source code for 3 Mathematica programs associated mith
the functions developed in Chapter 2. and the source code for examples of 3 prograrns
written in SPlus that were used to conduct simulation checks of the functions.
(* This program computes the multinomial-type probability of a given *)
(* set of weights (i.e. a list) t = {nl,n2,n3, ...) *)
A PPENDfX: Source Code
P r i n t Clst21;
mult = MultinomialC f [il , f [ 2 ] , f [3] , f 141 , f [5] ,
f 161, f C71, f B I , f Cg1 , f Cl01 ,
f Ci11 , f Ci21, f Cl31 , f LI41 , f C151,
fC161, fC171, fCl81, fCi91, fC201 1 ;
(* P r i n t [mult] ; *)
den = ~ p p l y [Times ,Table [y [il , {i , ls)] ] ;
(JL P r i n t [den] ; *)
mult * 1 * m/den
(* This program cornputes :
Med [ C list t of weights 1 1 = { { r ( i ) , n , p ( r ( i ) ) ) 1
f o r any list t where
r ( i ) = t h e i t h o rder s t a t i s t i c having non-zero p r o b a b i l i t y of
occurrence,
n = number of d i s t i n c t o rde r s t a t i s t i c s a v a i l a b l e , and
p ( r ( i ) ) = p robab i l i t y of occurrence o f t h e ith o rde r s t a t i s t i c .
Th is i s done f o r ODD sample s i z e on ly . Note t h a t sum of weights
equa l s t h e sample s i z e .
A PPENDiX: Source Code
OsPerm [t-] : = Map CF, Permutations [t]]
F2 [t-1 : = Union [ Partition [ Map [F ,Permutations [t]] , 1 1 1
~ed[t-1 := Block [ < 1,
Do C
If [ i <= Length [t] , g [il = Count [OsPerm [t] ,il ;
p[i] = gci] / ~ength[~emutationsCtll //N 1,
€i,LengthCtl> 1 ;
.4 PPENDIX: Source Code
tab2 = Deletecases [tab ,O] ;
(* This code f inds a l1 p a r t i t i o n s o f any p o s i t i v e i n t ege r *)
ClearA11 [JI
J CI [ni,] ,O] : = 1 [nl]
JCI[nl,] ,ICn2-11 := 1[Join[nl,n2]]
J CI [ni-] , 1 Cn2-] +b-] : =J [I [nl] , 1 [n2] 1 + J CI [ni] , b]
A PPENDIX: Source Code
rernCICa,]] := a
FPL [nn-] : = Map [ rem, Apply [ L i s t , FP [nn] ] 1
(* This cornputes t h e e x p e c t a t i o n of t h e b o o t s t r a p e s t i m a t e of t h e second
moment of a sample median for given odd sample s i z e and lambda.
It u s e s t h e p reced ing t h r e e programs.
The f unct i o n SecondMomOrder [r, ,n, , 1-1 i s def ined here .
CCprob . m
<<os.m
< < p a r t i t i o n . m
mbica- + b,] := mbi[a] + mbi[b];
mbi [a, b,] : = a mbi [b] / ; FreeQ [a, u] ;
rnbi [u'a-. (1-u)^b- .] := Beta[a+l ,b+l] ;
mbi [u] = Beta[2,1] ;
mbi[uaa-] := Beta [a+ l , l ] ;
mbi [ ( l -u) ^b-1 := B e t a r l ,b+l] ;
A PPENDIX: Source Code
SecondMomOrder [r- ,n- ,l-] : = mbi [Expand [ (x [l] ) -2 f [r ,n] 11
ExpBoot [n-, 1-1 : = Block [ { ) ,
a = Map [Prob ,FPL Cn] 1 ;
b = Map [Med , FPL [n] ] ;
Do C
If [ Length[b[[i]]] == 1, c[i] = Flatten[b[[i]]];
d [il = a[ [il 1 SecondMomOrder [c [il [Cl] 1 , c [il [ [2] hl] ,
Do [ d j ] = bCbll CEjll CC311 ~econdMomOrderCbC[iIl CCjll [ClIl ,
bCCi11 [CjII CD11 JI,
(j ,LengthCbCCilIl) 1 ;
d[il = App~y[P~us,Table~e[jl,~j,Length~b~~il~~~ll d[ill 3
, Ci,Length[bl) 1
Apply [ p l u s ,Table ( d [il , Ci, Length [a]
1
(* This computes the variance of the bootstrap estimate of
the second moment of a sample median for given odd sample size
and lambda. The f unct ion VarSqOrder Cr-, n- , 1-1 is def ined here
A PPENDIX: Source Code
and this depends on ~econdMomOrderCr-,n-,l-1.
<<prob .m
<<os .m
<<partit ion. m
mbi [a- + b,] : = mbi [a] + mbi [b] ;
mbi [a, b-] : = a mbi [b] / ; FreeQ [a,u] ;
mbi h n a - . (1-u) "b-. ] : = ~eta[a+l, b+l] ;
mbi[u] = Beta[2,1] ;
mbi [uAa-1 := Beta[a+l , lf ;
mbi f (1-u) ̂b-1 : = BetaCl ,b+1] ;
x[lJ := u-1 - (1-u) -1 f[r,,n,] := ~eta[r,n-r+i]*(-1) un(r-1) (1-u)-(n-r)
~econdMornOrder Cr- ,n-, 1-1 : = mbi [~xpand [(x [l] ) - 2 f Cr, n] ]]
m v 2 r , n l := mbi[Expand[(x[1])̂ 4 f[r,n]]]
VarSqOrder Cr-, n- , 1-1 : = mv2 Cr ,n , l] - (~econdMomOrder Cr, n, 11 ) -2
a = Map[Prob,FPL[n]la2;
b = Map [Med , FPL [n] ] ;
A PPENDIX: Source Code
~ p p l y [Plus, Table C d Cil , Ci , ~ e n g t h fa] )] 1
(* This vas used to simulate the ExpBoot func t ion *)
(* The particular example here assumes a Gaussian shape for
a sample size of 3. There were 12 combinations in al1 *)
ret urn (med)
A PPENDIX: Source Code
med_apply(mat,2,boot) # med contains 1000 transformed median values
rneanl-mean(meda2)
stderri-sqrt (var(rnedn2) /1000)
# Examples
# meanl : 0.03669427 stderrl : 0.001580984
# This is a simulation CHECK of the Mathematica function
# SecondMomOrder [i ,n , 11 (= mv Cr ,n , l] ) = true value of E (mediana2) ,
# that is, for the case where the i-th order statistic happens to be the
# median in a sample of size n .
# Below 1000 samples of size 3 have been simulated from a U[O,1] distribution.
# The particular example here assumes a Gaussian shape for
# a sample size of 3. (12 combinations)
mu-matrix (runif (3000) ,3,1000)
u.med-apply(mu,Z,median)
x-(u.meda0.135) - (i-~.med)~0.135 # x resembles a Normal since lamda = 0.135
meani-rnean(x-2)
st . errl-sqrt ( var (xa2) /IO00 )
# --- This is the conventional bootstrap, selecting 1000 resamples
A PPENDIX: Source Code
# --- from the i n i t i a l sample of s i ze 3, c a l c u l a t i n g the transformed median,
# --- then the average of the i r squares.
# --- This is bootstrap-reg.
# The p a r t i c u l a r example hera assumes a Gaussian shape f o r
# a sample size of 3. (12 combinations)
mat -matrix (runif (3) ,3,1)
med- apply (mat, 2 , boot) # med contains 1000 transf ormed median values
meanl,mean(med^2)
stderrl-sqrt(var(meda2)/1000)
APPEh;DIX: Source Code
A.2 Source Code used for Chapter 3
This section contains the Mathematica code that was used in Chapter 3 for the sym-
bolic derivation and numerical evaluation of bootstrap estimators and corresponding
properties associated with M-estimators. Code connected with t he case of the cor-
relation coefficient is also included. Splus code for producing comparative Monte
Carlo simulations for the correlation coefficient is given at the end of this section.
(* This contains the code for : ThetaHat Ci] and BootCum[thetahat] Ci, j] *)
(* Supporting code called astat.3.0.new.m is associated with
Andrews et al. (1993) . *)
(* Change Cums to Gmoms if rv is xl *)
Cum [(a,)] : = Cum [a]
Mom [(a,)] : = Mom [a]
Cum [a,] : = Mom [a] / ; ! FreeQ [a, Xi [w] ]
CumCa,,b,l := CfromM[a,b]/;!FreeQ[(a,b),~l[w]]
Mom [a,] : = Gmom [a] / ; ! FreeCj [a, XI [w] ]
Mom Ca-, b-1 : = Gmom [Apply [Times, (a, b)] 1 / ; ! FreeQ [a, XI [w] 3
(* Read in definitions *)
$Echo = "stdout "
(* Define psi function, root etc. *)
A PPENDIX: Source Code
Psi=psifn[X[w] , a l ] &
NrootO [ C m [Psi] = t h e t a
t h e t aha t = Nroot CAF [n] [Psi] ] ;
(* empir r e p l a c e s F wi th Fn i n s t a t *)
(* eCum s t a n d s f o r empi r ica l Cum *)
rsn[a-1 := rsm[aJ/.X[w]->Xstar[w]
NT [empir [a-] , i-1 : = Nget [Expand [Nsum [a, i l 1 , il
/ . RS [n] - > r d . rsm->RS [n] / . Cm->tCum/ . tCum-Xhm
tCum [a-] : = s [Length [{a)] 1 eCum [a] / ; ! FreeQ [{a), Xstar]
(* eCum[a-1 : = na (-1) S Cnl [a] / . X [w] ->Xstar [w] *)
(* X* = Xstar , i . e . t h e weighted o r boo ts t rapped X *)
Thet &at C i - ] : = Nsum [ the taha t , i]
BootCum[thetahat] Ci-, j -1 :=
~sum [empir [Ncum [empir [~power [ t he t a h a t , 1 ] , i] ] , j]
(* I n p r a c t i c e , f o r a given s t a t i s t i c , t h i s BootCum code must u se
t h e t o o l Nboot . See f o r example t h e code used t o d e r i v e
b o o t s t r a p moments of t h e c o r r e l a t i o n c o e f f i c i e n t below.
(*
ExpBootExp C i - , j - , k-] : =
Nsum [ N C U ~ [empir CNCU [empir [Npower f t h e t a h a t ,1] 1 , k] 1 , j] , i]
A PPENDIX: Source Code
*)
(* This defines t he boo t s t r ap expectation of a s t a t i s t i c
(* Supporting code c a l l e d astat.3.0.new.m is a s soc i a t ed
wi th Andrews et a l . (1993) .
RVQ: :usage =
"RVQ [a] r e t u r n s True i f f a is a random v a r i a b l e . "
2:0,14,10,Courier,1,12,0,0,0;1,14,10,Geneva,1,10,0,0,0;
: [ font = output ; output ; i n ac t i ve ]
" R V ~ f a l r e t u r n s True i f f a is a random v a r i a b l e . "
; Col
RVQCa] r e t u r n s True iff a is a random v a r i a b l e .
: [font = inpu t ; i n i t i a l i z a t ion ; endGroup]
*)
ClearAll [RVQ]
RVQ Ca,, b,,] : = If CRVQ [a] , True , RVQ [b]]
RVQ [g- [a--] 1 : =If [MemberQ [ConsExpectFns ,g] ,False,
If [MemberQ [RandomFns , g] , True , RVQ [a] 1 3
RVQ[z-1 := Fa l s e
ConsExpectFns = { ~ x p e c t , Moment, CentralMoment, Theta,Cum,ED,InvED,ICum);
RandomFns = {RD, RS ,DF);
A11RVQ Ca,, b,,] : = If CRVQ [a] , A l I R V Q [b] , False]
APPEiVDIX: Source Code
AllRVQ [a,] : =RVQ [a] ;
VQ : :usage =
"VQ [a] r e t u r n s True iff a i s a v a r i a b l e . "
(*
: [ fon t = outpu t ; o u t p u t ; inac t ive ]
"VQ [a] r e t u r n s True if f a is a v a r i a b l e . "
; Col
VQ [a] r e t u r n s True i f f a is a v a r i a b l e .
: Cf on t = input ; i n i t ializat ion; endCroupl
*)
ClearAll [VQ]
VQh-,b,,l:= IfCvQCal ,Truc, VQ[b]l
VQ Cg- [a--] 1 : =If [MemberQ [ConsFns , g] , F a l s e ,
If [MemberQ [VarFns ,g] , True , VQ [a] 1 1
VQ [z,] : = False
ConsFns = (SUM);
VarFns = ();
A 1 1 V Q [a-, b--1 : = If [VQ [a] , A 1 1 V Q [b] ,Fa l se ]
A l l V Q [a-] : =VQ [a] ;
"FPQ [a, b, . . ] r e t u r n s t h e full p a r t i t i o n of a , b , . . . "
: [ font = output ; outpu t ; inact ive]
r e t u r n s t h e f u l l p a r t i t i o n of a , b , . . . "
FPQ [ a , b , . .] r e t u r n s t h e f u l l p a r t i t i o n of a , b , . . . : Cf ont = input ; i n i t i a l i z a t i o n ]
APPENDIX: Source Code
SetAttributes CPE, Orderless]
ClearAll [addnl]
addnl[n-,a- + b-1 := addnl[n,a]+addnl[n,b]
addnl[n-, (a, + b,)c,] := addnl[n,a c]+addnl[n,b c]
addnl Ln-, a- PE [b,,] ] : = a addnl [n , PE [b] ]
addnl En-, PE [a-,] 1 : =
Apply [Plus, Table [Apply [PE , addn2 [{a), n, iadd] 1 , {iadd, 1, Length [{a)] )] 3 +
Append CPE Cd , b)l
ClearAll [addn2]
addn2 [a-, n- , i,] : =
TableCIf [jaddnl == i,AppendCa[[i]l ,n] ,a[[jaddn2]11.
{ j addn2,1, Length [a] )] ;
ClearA11 [FPl]
FPl[inlist-1 :=Expand[FPl[PE[inlist] ,Length[inlist]]]
FP1 [pe- , l] : = pe
FPlCa- + b-] := FPlCa] + FPl[b]
FPlCc, PECa-,] ,n,] := c FPi[PE[a],n]
FP1 [PE [a--] , n,] : =
addnl [Last [al , FP1 [PE [Drop [a, -11 ] , n-11 ]
EV::usage =
"EV[a] returns the expected value of a"
(*
: Cf ont = output ; output ; inactive]
"EV[a] returns the expected value of a"
; Co]
APPENDIX: Source Code
EV[a] r e tu rns the expected va lue of a
: [font = input ; init i a l i z a t ion]
*)
ClearA11 [EV]
EVCa, + b-] := EV[a] + EV[b]
EV [(2- + b-) ̂ c- . d- . ] : = EV [Expand [(a+b) ̂ c d] ] / ;
( ( c > 0) && ( In tegerQ[c] ) )
EV[a, b-] : = a EV [b] / ; ! RVQ [a]
EV[a-] := a /; !RVQ[a]
EV [SUM [a-] ] : = SUM [EV [a] 1
EV CSUM [a-] c-] : = EV cc, a]
EV[SUM[a-1 c-. ,b,,] := EV[c,a,b]
EV[SUM[a-] ^b- c, .] : = EV[SUM[a] ^ (b- l )c , a]
EV[SUM[a-]*b- c,.,d--1 := EV[SUM[aIA(b-1)c,a,d1
EV Cl, a--] : = Col l ec t [Expand [Map C E V P E , F P ~ [{a)] ] 1 , SUM]
ClearAll [EVPE]
E V P E [ ( ~ - + b-)] := EVPE[~ ] + EVPE[b ]
EVPE [(a- + b,) c,] : = EVPECa cl + EVPECb cl
EVPE [a- PE [b--11 : = a EVPE [PE [b]]
EVPE [PECa-11 : = S L ~ M CEV C ~ p p l y [Times, a]] ]
EVPE [PE [a--] 1 : = Apply CSUM, Map [EV [Apply [Times, # l ] ] à, PE [a] ] ]
BEV: :usage =
ltBEV[a] r e t u r n s the b o o t s t r a p expected value of a"
: [font = output ; outpu t ; i nac t i ve ]
"BEV [a] r e t u r n s the b o o t s t r a p expected value of a"
B E V [ ~ ] r e t u r n s t h e b o o t s t r a p expected value of a
A PPENDIX: Sozme Code
: Cf ont = input ; init ializat ion]
*)
ClearAll [BEV]
BEVCa, + b,] := BEVCa] + BEV[b]
BEV[(a- + b-)Y-. d-.] := BEV[Expand[(a+b)*c d l ] / ;
( ( c > 0) && (IntegerQ[c]))
BEV [a, b-] : = a BEV [b] / ; ! RVQ [a]
BEV [a-] : = a / ; ! RVQ [a]
BEV CSUM fa-] 1 : = SUM [a]
BEV [SUM [a-] c-] : = BEV Cc, a]
BEV [SUM [a-j c, . , b,,] : = BEV [c , a, b]
BEV [SUM [a-] ̂ b_ c, . j : = BEV CSUM [a] (b-1) c, a]
BEV [SUM [a-] -b- c, . , d--1 : = BEV CSUM [a] " (b- 1) c , a , d]
BEV C l , a--] : = Collect [Expand [Map CBEVPE, FPI [{a)] 11 , SUM]
ClearAll [BEVPE]
BEVPE[(a- + b,)] := BEVPECa 1 + BEVPECb 1
BEVPE[(a- + b-)c,] := BEVPECa cl + B E V P E C ~ cl
BEVPE [a- PE [b,,] ] : = a BEVPE [PE [b]
BEVPE [PE [a-] ] : = u t [Length [a] SUM [ Apply [Times, a] ]
BEVPE CPE [a--]] : =
wtpe CPE [al 1 Apply CSUM ,Map C~pply [Times $11 & , Ca)] l
A PPENDiX: Source Code
ClearAl l [ut]
S e t A t t r i b u t e s [wt , Orderless]
u t [a--] := w t [a] = m g f [ a ] / . tv ru les
BVAR: :usage =
"BVARCa] r e t u r n s the boo t s t r ap var iance of a"
(*
: [ f on t = output ; outpu t ; inac t ive ]
"BVAR [a] r e t u r n s t h e boot s t r a p var iance of a"
; Col
BVARCa] r e t u r n s t h e boo t s t r ap var iance of a
: [ fon t = input ; i n i t i a l i z a t ion; endcroup]
* i o
BVAR [a,] : = Collect CBEV [a-21 - BEV [a] -2, SUM]
SüM: :usage =
" s u M C ~ ] r ep r e sen t s t h e sum of a-i
~üM[a, b , . . ] r e p r e s e n t s t h e sum of a-i , b- j , . . .
over i != j != . . ."
(*
: [ fon t = output ; outpu t ; inac t ive ]
"stJM[a] r ep r e sen t s t h e s u of a-i SUMCa, b , . .] r e p r e s e n t s \
t h e sum of a,i,b,j , . . . over i ! = j != . . . " ; Col
SUMCa] r ep re sen t s t h e sum of a-i SUM[a,b, . .] r e p r e s e n t s \
t h e sumof a - i , b - j ,... over i != j != . . . : [ fon t = inpu t ; i n i t ializat ion]
APPENDIX: Source Code
*)
C l e a r A l l [SUMI
S e t A t t r i b u t e s [SU~,Order le s s ]
SüM[a,] : = n a/; !VQ [a]
SUMCa- + b-1 := SUM [al + SUMC~I
SUM[a- +b-,c--1 := SUM[a,c] + SUM[b,c]
SüM [a, b,] : = a SUM [b] / ; ! VQ [a]
SüM [a,,] : =
C o l l e c t [Expand [Map [CSUM [Length [Ca)] , # i l & , FP1 [Ca)] ] 1 , SüM] / ;
Length[{a)] > 1
C l e a r A l l [CSUM]
CSUM[l, ,a- + b,] := C S W M [ ~ ,a] + C S U M [ ~ ,b]
CSüM Cl,, a, b,] : = a CSUM [l , b] / ; ! Vq [a]
cSUM Cl,, (b-+c-) ^d-. : =
CSUM C l , Expand [ (b+c) -dl 1 / ; I n t egerQ Cd] && (d > 0)
CSUM Cl,, PE [a--] ] : = Block [(la),
la=Length [(a>] ;
(-1) * (1 - l a ) * Apply [Times ,Map [(Length[#l] -1) ! SüM2 [#II &, {a)]] 1
SüM2 [{a--)] : = SUM [Times Ca] 1
Nboot : :usage =
"NT [Nboot [a-] , il r e p r e s e n t s t h e i t h term of the boo t s t r ap
expect at ion of a" ;
NT [Nboot [a,] , i-1 : = NT C N ~ O O ~ [a] , i l =
C o l l e c t [Boot SumCa, i , O] ,RS]
Boot~um[a-, i-, j-] := Nget bootTerm[a, j ] , i-j] +
IfCj < i ,BootSum[a, i , j+l] ,O]
APPENDIX: Source Code
BootTerm [a-, i-] : = BootTermfa, il =
Collect [Expand[(BEV [NT[a, il / .RS [n] ->RtoS] / . SUM->StoR)] ,n] StoR[ZM[a,]] := nA(l/2)RS[n] [a]
RtoS [a,] : = nA (-1/2) SUM[ZM[a] ]
St OR [a,] : = S [n] [a/. Dl->ZtoX]
ZtoX[a,] := a - CumCal
EmpCum [a-] : = n- (- 1) SUM [a]
EmpCum [a--] : = ~ a p CS ignedEV , FPi [{a)] ]
SignedEV [b- PE [a--] 1 : = b S ignedEV [PE [a] ]
SignedEV [PE [a--] ] : =
(-1) (Length[{a)] -1) (Length [Ca)] -1) ! * Apply [Times, Map [BAV , Map [Apply [Times, XI]&, {a)] ] 1
BAV [a-] : = nn (- 1) SUM [a]
(* Example *)
RVQ [ X I =True
VQ [XI = True
EmpcumCx,x,X,xl
<<BSTAT.l.O.new.m (* bootstrap code *)
<<astat .3.0.new.m
(* Some Examples *) -
APPENDIX: Source Code
RVQ [Xi] =True
RVQ LX21 = True
RVQ [X3] =Trie
VQ [Xi] =True
VQCXS] = True
Vg [X3] =True ;
SUM [xi]
suMcxi ,X21
SUM [Xi, X2, X3l
suMcxi x2 X31 + suM[x1,x2 X31 + sUM[xi X2, X31 +
SUM [Xi X 3 , X21 +Sm [Xi, X2, X31 EV [SUM [Xi] SUM LX21 1
BEV [SUM [XI]]
BEV [SUM [XI] SUM [X2] 1
BEV [sUM [Xl] SvM [X2] SmX3]]
SVAR [SVM [Xi] /n]
BVAR CBVAR [ S ~ M [xi] /n] 3
LX/ : VQ [LX [w] 1 =True
LY/ : VQ [LY [w] ] =True ;
cm [LX [wJ 1 =O
cm CLY [w] 1 =O Lx/:cum[~X[w] ,LX[wll=l
LY/:CumCLY[w] ,LY[wlI=i
LX/ : Cum [LX [w] , LY [w] ] =rho
cov/ : NT [cov, i-] : = NT [cov, il =
If [i==O, Cum [LX [al , LY [wl 1 ,
If Ci== 1, RS [nl [LX [wl LY [vl 1 -
Cum [LX hl 1 RS Cnl [LY Lw1 1 - Cum [LY Cwl 1 RS [nl [LX [VI 1 ,
APPENDIX: Source Code
If [i==2, -RS [n] [LX [w] ] RS [n] [LY [w] 1 , O] 11
vax/ :NT [vax, i-] : =
If [i==O , Cum [LX [w] ,LX [w] 1 , If [i==i , RS En] [LX Cw] LX [w] 1 -
Cum [LX Cwl 1 RS Cnl [LX [wl 1 - Cum [LX Cul 1 RS Cnl [LX Cul 1 ,
If [i==2, -RS [n] [LX [w] 1 RS [nl [LX [w] 1 , O] 1 1
vay/:NT[vay,i-] :=
If [i==o , CumCLY Lw] ,LY [w]] ,
If [i==i , RS [n] [LY Cw] LY [w] 1 -
Cum CLY Cul 1 RS [ni [LY [wl 1 - Cum [LY [wl l RS [nl CLY Cul 1 , If Ci==2, -RS [nl [LY [w] 1 Rs [nl CLY Cwf 1 ,O] 11
cor/:NT[cor,i-] := ~T[cor,i] =
Collect [ExpandCNT [Ntimes [Nf [#le (-1/2)& ,Ntimes [vax, vay] 1 , cov] , il ] ,RS]
Cor Ci-] : = Nsum [cor, il
NT [cor, 21
Nsum [cor, 21
NT [Nboot [cor] , O]
NT [Nboot [cor] , 11
NT [~boot [cor] ,2]
v2=Expand [Nsum [Nboot [Npower [cor, 23 ] - Npower [~boot [cor] ,2] ,2] ] Collect [ (v2/. RS [n] ->RtoS/ . Cm->ErnpCum) , SUM] ;
(* This cornputes 2nd order symbolic bootstrap expansions
APPENDIX: Source Code
for the correlation coefficient and numerically
evaluates these for two data sets.
4th order code is similar.
* This is for n = 15 AND n = 100
(* Uses NEW versions of both BSTAT and ASTAT ! *> (* Calculates cor2, a2 = BootExp [Cor] , v2 = BootVar [cor]
(* Plus TeXForm and numerical versions for e 2 , v2 *)
(* Assuming empirical distribution Fn *)
(* i.e. Set R S [ ~ ] -> O below, and use empirical cums
(* The values below are correctly standardized ! *)
texr = ( RS -> 2, LX[W] -> X , LY[w] -> Y )
(* Expansion of correlation *)
cor Ci,] := N s u m [cor, il
cor2 = NsumCcor,21 . t e x r
incor2 = TeXForm [cor21
incor22 = InputForm [cor21
here *)
$1
(* Expansion of bootstrap mean of the c o r r e l a t i o n *)
A PPENDIX: Source Code
e 2 = Expand [Nsum [Nboot [cor] ,2] 1
i n e 2 = TeXForm[e2]
ine22 = InputForm Ce21
(* Expansion of E - over x - of b o o t s t r a p mean of t h e c o r r e l a t i o n *)
ee2 = Expand [Nsum CNCU [ ~ b o o t [cor] ,11 ,2] ]
i n e e 2 = TeXFormCee21
inee22 = InputForm Lee21
(* Expansion of V a r - o v e r x - of b o o t s t r a p mean of the c o r r e l a t i o n *)
ve2 = ~ x p a n d [Nsum [Ncum [ ~ b o o t [cor] ,Z] ,2] 1
inve2 = TeXFormCve21
inve22 = InputForm [vel]
(* Expansion of b o o t s t r a p v a r i a n c e of t h e c o r r e l a t i o n *)
v2=Expand [Nsum [Nboot [Npower [cor, 21 1 - Npower [Nboot [cor] ,21,2] 1
i nv2 = TeXForm[v2]
inv22 = Input Form [v2]
(* Expansion o f E - over x - of b o o t s t r a p v a r i a n c e o f t h e c o r r e l a t i o n *)
ev2 = Expand [NSU [Ncum [ ~ b o o t [Npower [ co r , 21 1 - Npower [ N ~ O O ~ [cor] ,2] ,1] ,2] 1
inev2 = TeXForm[ev2]
inev22 = InputForm [ev2]
(* Expansion of Var - over x - of b o o t s t r a p v a r i a n c e of t h e c o r r e l a t i o n *)
APPENDIX: Source Code
vv2 = Ekpand [~sum [Ncum [Nboot [ ~ p o w e r [ co r , 23 3 - Npower [Nboot [cor] ,2] ,2] ,2] 1
invv2 = TeXForm[vv2]
invv22 = InputForm [vv2]
PLUS [(a--)] : = PLUS [a]
(* The following is for n = 15 and n = 100 *)
e2numl = e2SW / . (SUM->PLUS, LX[W] -> (-0.600997, 0 -860219, -1.04679,
-0.551465, 1.62798,
-0,501932,-1.12109, 1.50414, 1 .25648 , 0.117227,
1.30601,-0.625764, -1.36875,-0.700063, -0.155203 ) ,
LYCw] -> (1.25538, 0.872812, -1.21003, -0.274879, 1.46791,
-0.104851, -0.4024,1.4254, 1.12785, 0.150192,
0.107685, -1.50758,
-1.42257, -0.912485, -0.5724291,
n->15 ,
rho -> 0.776374 3 . PLUS -> Plus
il PPE.!VDIX: Source Code 137
A P PELVDLY: Source Code
rho -> 0,692704 3 . PLUS -> Plus
... ETC . . .
(******************** r.s2 ******************+*************) (* SPLUS code for producing comparative Monte Carlo
simulations for the correlat ion c o e f f i c i e n t f o r
n = 15 law school d a t a .
Can hence compare bootstrap mean and standard error for t h e
correlation coefficient obtained using the ana ly t i ca l
bootstrap with t he same using conventional resampling
as done below. *)
il PPENDX: Source Code
o p t i o n s (obj e c t . size=4Oe6)
da ta -c ( lawi , law2)
xdat a-matrix (dat a,nco1=2)
x-1:n
nboot ,25600
t h e t a - f unct ion(x, xdata) { c o r (xdatacx, 11 , x d a t a Lx, 23 )
r e s u l t s - b o o t s t r a p ( x ,nboot , t h e t a , x d a t a )
bs . s e - s q r t (va r ( resu1 t s ) )
A PPENDLY: Source Code
varvar- (nboot/ ((nboot-1) -2) ) *var( ( r e s u l t s - mean(resu1ts) ) -2)
v a r . bsse, (1/ (4*var ( resu l t s ) ) ) *varvar
bs . se
se. bsse
xbar
se
(************************ r 2 . s *************************************) (* SPLUS code f o r producing comparative Monte Carlo
s imulat ions f o r t h e c o r r e l a t i o n coe f f i c i en t f o r
n = 100 s imula ted s tandard normal da ta .
Can hence compare boo t s t r ap mean and s tandard error f o r t h e
c o r r e l a t i o n c o e f f i c i e n t obta ined using t h e a n a l y t i c a l
boo ts t rap with t h e same us ing conventional resampling
as done below.
APPEIVDIX: Source Code 131
options (ob j ect . size=50e6)
A PPENDIX: Source Code
data-c (nom1 ,norm2)
xdat a-rnat rix (dat a, nc01=2)
bootstrap-function(x ,nboot , theta,xdata) {
data~matrix(sample(x,size=length(x)*nboot, replace=T), nrow=nboot)
return(app1y (data, 1, theta,xdata) )
bs. se-sqrt (var(resu1t.s) )
varvar~(nboot/((nboot-l)n2))*var((results - mean(results))'2) var. bsse, (1/ (4*var (result s) ) ) *varvar
se. bsse-var . bsseAO. 5
xbar,rnean(result s)
se-sqrt (var(resu1ts) /nboot)
APPENDIX: Source Code
A.3 Source Code used for Chapter 4
This section contains the Mathematica programs that were used for the examples
in Chapter 4. Splus code for producing comparative Monte Car10 simulations for
the Behrens-Fisher statistic is given at the end of this section.
(****************** t . m ************************************) (* T h i s def i n e s t h e a n a l y t i c a l boo t s t r ap - t s t a t i s t i c
f o r t h e c o r r e l a t i o n c o e f f i c i e n t
and de r i ve s the symbolic boo t s t r ap expec t a t i on of
it a s well as eva lua t i ng it f o r t h e law school data *)
$Echo = "s tdout"
LX/ : VQ [LX [w] ] =True
LY/ : VQ ELY [wf 1 =True ;
cm [LX tu] 1 =O
Cllm [LY [w] ] =O
LX/ : Cum [LX Cul , LX [w] 3=1
LY/ : Cum [LY Lw] , LY [w] 3=1
LX/ : Cum [LX [w] , LY [w] 1 =rho
cov/ : NT [cov , i-1 : = NT [cov , i l =
If Ci==O, Cum [LX Cwl ,LY [w] ] ,
A PPENDIX: Source Code
c o r / : NT [ c o r , i,] : = NT [cor , il =
C o l l e c t [Expand [NT C ~ t i m e s CNÎ [# le (-1/21 k , N t i m e s [vax, vay] 3 , cov] , il ] ,RS]
co r2 / :NT Ccor2,i-1 : = NT Lcor2, i l = NT [ c o r , i + i ]
b o o t v a r / :NT [bootvar ,i_] : = NT [boo tva r , il =
NT CNboot CNpower [ co r , 21 1 , i l - NT [Npower [Nboot [cor] ,2] , il
boo tva r2 / :NT Cbootvar2. i-] : = NT[bootvar2, i l =
NT [Nboot CNpower Ccor2,2] 1 , il - NT [Npower [Nboot [cor21 ,2] , il
t / : ~ T [ t , i , ] := NT[t , i ] =
Expand [
NT [ Ntimes [ c o r 2 , N f [ # l a ( - l / 2 ) & , b o o t v a r 2 1 1 , i ]
- NT [ N t i m e s [ rl, NF [# la( -1 /2)&, b o o t v a r l 1 1 , i l ]
texr = ( RS -> Z , LX[w] -> X , LYCu] -> Y 1
APPENDIX: Source Code
bootexp-t 1 = Expand[Nsum[Nboot [Npower[t, 11 1 ,1] 1 / . rl -> cor2
% . t e x r
TeXForm [%]
% . t e x r ;
Simplify [ X I
TeXForm [%1
boot exp = ~xpand ENSU [Nboot Et] ,1] 1
% / . t e x r ;
Simplify [ X I
TeXForm [%]
bootexp-t lSUM = Col l ec t [(bootexp-tl/ . RS [n] ->(O&) / . ~ u m - > ~ m ~ ~ u m ) , SUM] ;
bootexp-t lSUM / . {SUM->PLUS, LX Lw] -> {-O. 600997, 0.860219, -1.04679, -0.551465
1.62798, -0.501932,-1.12109, 1.50414, 1.25648, 0.117227,
1.30601,-0.625764, -1.36875,-0.700063, -0.155203 ) ,
LY [w] -> €1.25538, O. 872812, -1.2IOO3, -0.274879, 1.4679
-0.104851, -0.4024,1+4254, 1.12785, 0.150192,
0.107685, -1,50758,
-1.42257, -0.912485, -0.5724291,
n+15 ,
rho -> 0.776374 1 . PLUS -> Plus
A PPENDIX: Source Code
............................ bca1.m ........................ (* T h i s c a l c u l a t e s t h e symbolic b o o t s t r a p expansion of t h e
a cce l e r a t i on adjustment a s soc i a t ed wi th BCa C . 1 . ' S.
T h i s is done f o r t h e c o r r e l a t i o n c o e f f i c i e n t w , r . t
t h e law school data.
RVQ [XI =l'rue
VQ [ X I = True
RVQ [Y] =True
VQ [Y] =True
LX/ : VQ [LX [w] ] =True
LY/ : VQ [LY [w] ] =True ;
CumfLx [w]]=û
Cum [LY [wl 1 =O
LX/ : C m [LX [w] , LX Lw] 1 =i
LY/:Cum[LY[w] ,LY[w]]=i
LX/ : Cum [LX [w] , LY [w] ] =rho
cov/ :NT [cov , i-1 : = NT [cov, i l =
If [i==O , C m [LX Cu] ,LY Lw] f ,
If [i==i, RS [n] [LX Cul LY [w] 3 -
A PPENDIX: Source Code 1.17
Cum [LX [w] 1 Rs [n] [LY [wl 1 - Cum [LY [wl 1 RS [n] [LX [w 1 1 ,
If [i==2, -RS [n] [LX [w] 1 RS [n] [LY [w] ,O] 11
vax/ : NT [ v a , i,] : =
If Ci==O , Cum [LX Lw] ,LX [w] I , If [i==l ,RS [nl CLX[w]LX[wll -
Cum [LX [w] ] RS Ln] [LX [w] 1 - Cum [LX [VI 1 RS [n] [LX [wJ J ,
If Ci==:!, -RS [n] [LX Lw] 1 RS [nl [LX [w] 1 , O] 11
vay/ : NT [vay , i-] : =
If Ci==o , cum CLY Lw] ,LY [wf 1 , If C i = = i ,RS [n] [LY [wl LY Lw3 1 -
cum [LY [w] ] RS [n 1 [LY [w] 1 - Cum CLY [w] 1 RS [n] [LY [w] 1 ,
If [ i==2, -RS [n] [LY [w] 1 RS Ln] [LY Cu] 1 , O] I I
c o r / : NT [cor, i-] : = NT [cor , i l =
C o l l e c t [Expand [NT [ N t imes C N ~ [#la (- 1/2)&, Nt imes [vax , vay] , cov] , il ] ,RS]
co r2 / :NT [cor2, i-] : = NT [cor2 , i l = NT [cor , i+1]
(* Bootvar g i v e s t h e i t h term, as opposed t o t h e sum t h a t v2 g i v e s .
We need t h i s t h e ith term as i n p u t t o get t h e i t h te rm o f t h e t s t a t i s t i c *)
b o o t v a d :NT [bootvar , i-] : = NT [boo tva r , il =
NT [Nboot [Npower [cor , 21 1 , i l - NT [Npower C N ~ O O ~ [cor] ,2] , il
bootvar2/ :NT CbootvarS, i-] : = NT Cbootvar2, il =
NT [Nboot [Npower [cor:!, 21 1 , i l - NT [Npower [ N ~ O O ~ (cor21 ,2] , i]
(* The next s e c t i o n c a l c u l a t e s t h a a c c e l e r a t i o n c o n s t a n t 'a ' which
a l s o r e p r e s e n t s a s l o p e ! *)
A PPENDIX: Source Code
Expand [NT [ ~ t imes [ co r , bootvar] , i] 1
bprod/ :NT [bprod , i-] : = NT [bprod , il =
Expand [ NT [ Nboot [prod], i 1 1
bprod2/ : NT Lbprod2, i-] : = NT Cbprod2, il = NT [bprod , i+1]
b c o d :NT [bcor , i,] : = NT [bcor , i l =
Expand [ NT C Nboot [cor ] , i 1 1
boo tva r2 / :NT[bootvarL, i-] := NT[bootvar2, il =
NT [Nboot (Npower [cor2,2] ] , il - NT [Npower [Nboot [cor21 ,2] , i]
bbvar2/ : NT Cbbvar2, i-f : = NT cbbvar2, il =
Expand [ NT [ Nboot [bootvar2] , i 1 1
(* Made up of t h r e e p i e c e s : bprod2 , b o o t v a r 2 , b c o r . *)
acc3 / :NT Cacc3, i-1 : = NT Cacc3, i l =
Expand [ NT [ N t i m e s [ bprod2, N f [ # l e ( - 1 / 2 ) & , boo tva r2 1 1 , i 1 1 -
Expand [ NT [ N t i m e s [ Ntimes [ b c o r , boo tva r2 1 , N f [ # l e ( - 1 / 2 ) & ,
boo tva r2 ] 1, i 1 1
Boot [Acc] C i - ] : = S i m p l i f y [Nsu in Cacc3, i] ]
(************* NUMERICAL EVALUATION FOR n = 15 *****************)
APPENDIX: Source Code
t e x r = C RS -> 2 , LXCw] -> X , LY[w] -> Y )
s lope = Simplify[Nsum[acc3,1]~ (* f i r s t order s lope *)
% / . t e x r
TeXForm [%]
slopeSUM = Collect [(slope/ .RS fn] -> (O&) / .Cm->EmpCum) , SUMI ;
PLUS [(a--)] : = PLUS [al
(* The following is for the law school data, n = 15 *)
slopenum = slopeSUM / . {SUM->PLUS, L X C W J -> {-0.600997, 0.860219, -1.04679,
-0.551465, 1.62798,
-0.501932,-1.12109, 1.50414, 1.25648, 0.117227
1.30601,-0.625'764, -1.36875,-0.700063, -0.155203 )
LY[w] -> c1.25538, 0.872812, -1.21003, -0.274879, 1.4679
-0.104851, -0.4024,1.4254, 1.12785, 0.150192,
O. 107685, -1.50758,
-1.42257, -0.912485, -0.572429),
n->15 ,
rho -> 0,776374 . PLUS -> P l u s
APPENDIX: Source Code 150
<<BSTAT.indep:!.LXonly.m (* Indep. version of BSTAT.1.O.new.m *)
(* Some Examples *)
RVQ [x] =Tm@
VQ [XI = True
RVQ [Y] = True
VQ [Y] = True
LX/ : VQ [LX Lw] = R u e
LY/ : VQ [LY [w] =True ;
BEV[ SUM[X] -2
BEV[ SUM[Yn2] 1
BEV C SUM [XI -2 SUM [YI - 2 1
BEV[ SUM[X] SUM [Y] 1
BEV[ SUM[X] -2 SUM[Y] ]
BEV[ SUM[X] ^2 SUMCY-21 SUMCYI 1
BEV[ SUMCX] 1
BEV[ SUMCYI 1
BEV[ 1 ]
(* end of examples *)
(* BERHENS-FISHER problem : corrected
Assume Ex = O, Ey = O with unequal variances
which now really are j u s t second moments !
A PPENDIX: Source Code
Equal zero means as opposed to equal non-zero means
will speed up the 4th order symbolic calcs !
*)
cov/ :NT [cov, i,] : = NT [cov, il =
If [i==0, Cum [LX Lw] , LY [w] 1 ,
If Ci== 1 , RS Ln] [LX Lw] LY [w] 1 - Cum [LX [w] 1 RS Ln] [LY [wl 1 - Cum ELY [wl 1 RS [n] [LX [w] 1 ,
If ti==2, -RS Cnl [LX Cwl IRS [n] [LY [w] 1 ,O] 1) v u / : NT [vax, i-] : =
If b = O , Cum [LX [w] , LX Cul 1 - C u [LX [w] 1 Cum [LX [w] ] ,
If Ci== 1, RS Cn] [LX CwJ LX Cw] 1 -
Cum [LX Cwl 1 RS Cnl [LX [w] 1 - Cum [LX [w] 1 RS [n 1 [LX [w] 1 ,
If [i==2, -RS [n] [LX [w] 1 RS Ln] [LX [w] 1 ,O] 11
vay/ : NT [vay , i-] : =
If fi==0, Cum ELY [VI , LY Lw] 1 - Cum [LY Lw] 1 Cum [LY [w] ] ,
If [i==l,RS[n] [LY[w]LY[w]] - Cum CLY Cul 1 RS Cnl CLY Cul 1 - Cum [LY Lw1 1 RS Cnl ELY Lw] 1 ,
If [i==2, -RS [nl [LY Cwl 3 RS Cn] [LY Lw] 1 , O] 11
cor/ : NT [cor, i-] : = NT [cor, i] =
Col lec t [Expand [NT [ N t i m e s [Nf [#le (-1/2)& ,Nt imes Cvax, vay] 1 , covl , il 1 ,RS]
* BERHENS-FISHER CODE : *)
A PPENDIX: Source Code
diff/:NT[diff,i-7 := NT[diff,i] =
If [i==O, C m [LX Lw] 1 -Cum [LY [wu ,
If Ci==l, RS h l [LX Cwl 1 - RS Cnl CLY [VI 1 , O 11
bf/:NT[bf ,i-] := NTCbf ,il =
Collect [Expand [NT C N t i m e s C N ~ C#l-(-1/2)&, Apply [Plus, Cvax, vay) 13 , diff 1 ,i]],RS]
BF Ci-] : = Nsum [bf , il
NT [bf ,2]
Nsum [bf ,2]
NT [Nboot [bf 1 , O]
NTCNboot [bf] , i l
NT [Nboot [bf 1 ,2]
bootexp = Expand [Nsum [Nboot [bf 1 ,411
bootvar = Expand [Nsum [Nboot CNpower [bf ,2] 1 - Npower [Nboot [bf $21 ,4] ]
A PPENDIX: Source Code
survival data. 4th order uses similar code. *)
(* Uses SURVIVAL data ( n = 4 4 ) *)
(********** Assming O , O , unequal variances ! *****ik*t)
(* and assuming INDEPENDENCE code in bf3 .m *)
* i. e. BEV t o o l was MODIFIED *>
texr = { RS -> Z , LXCwl -> X , LYCw] -> Y 1
(* Expansion of Behrens-Fisher statistic : *)
A PPENDIX: Source Code
bf 2 = Nsum [bf ,2]
% / . t e x r
(* Expansion o f bootstrap mean of t h e Behrens-Fisher s t a t i s t i c *)
e2 = Expand[Nsum[Nboot b f l , 2 ] 1
% / . t e x r
Simpli f y
TeXForm [%]
(* Expansion o f E - over x - of bootstrap mean of t h e 3-F s t a t . *)
ee2 = Expand[Nsum[Ncum[Nboot[bf], 11,2]]
ee2 /. texr
Simpl i f y 1x1
TeXForm [%]
ee4 = Expand C N S ~ [Ncu [Nboot [bf 3 , I I , 41 1
ee4 / . t e x r
Simplif y [%]
TeXForm [%]
(* Expansion o f V a r - over x - of bootstrap me= of t h e B-F stat . *)
ve2 = Expand [ N S U [ N C U U [ N ~ O O ~ (bf] ,2] ,211
ve2 / . t e x r
S impl if y [XI
TeXForm [%]
A PP ENDIX: Source Code
(* Expansion of b o o t s t r a p va r i ance of t h e B-F stat . *)
v2=Expand [Nsum [Nboot [Npower [bf ,211 - Npover [Nboot [bf] ,2] ,2] 1
v2 / . t e x r
Simpli fy [%]
TeXForm [%]
(* Expansion of E - o v e r x - of b o o t s t r a p variance of t h e B-F s t a t . *)
ev2 = Expand [Ncm [Nboot [Npower bf ,231 - Npower C N ~ O O ~ [bf 1 ,2] ,1] ,211
ev2 / . t e x r
S impl i f y [XI
TeXForm [XI
(* Expansion of Var - o v e r x - of b o o t s t r a p variance of t he B-F stat. *)
vv2 = Expand [Nsum [NcumCNboot [Npower bf ,211 - Npower [ ~ b o o t [bf 1 ,2] ,2] ,2] 1 vv2 / . t e x r
S impl i f y [XI
TeXForm 1x1
vv2one = ~xpand [~sum [Ncum [Nboot [Npower [bf 1,2] 1 - ~ p o w e r C N ~ O O ~ [bf 11 ,2] ,21,2] ]
vv2one / . t e x r
Simpl i fy C%1
TeXForm C%1
exp2 = Expand C ~ s d ~ c u m [bf ,11 ,211
exp2 /. t e x r
APPENDIX Source Code
S impl if y [ X I
TeXForm [%]
var2 = Expand [Nsum [Ncum [bf ,2] ,2] 1
var2 / - t e x r
Simplif y [XI
TeXForm [%]
(* exp4 = Erpand[Nsum[Ncum[bf,l] ,411 . texr *)
(* var4 = Expand [Nsum [Ncum [bf ,2] ,4] ] / . t exr *)
PLUS [(a,,)] : = PLUS [al
(* The following uses the SURVIVAL data *)
(**************** assuming O means and unequal variances *t*****t******)
A PPENDIX: Source Code
LY[wl ->
.c -644.9318182 , -582.9318182 , -540.9318182 , -520.4318182 , -463.9318182 ,
-429.9318182 , -395.9318182 , -383,9318182 , -344.9318182 , -344.9318182 ,
-303.9318182 , -291.9318182 , -289.9318182 , -287.9318182 ,
-265.9318182 , -262.9318182 , -262.9318182 , -257.9318182 , -251.9318182 ,
-237.9318182 , -185.9318182 , -156.9318182 , -122.9318182 , -121.9318182 ,
-110.9318182 , -83.93181818 , -76.93181818 , 29.06818182 , 30.06818182 ,
102.0681818 , 132.0681818 , 140.0681818 , 151.0681818 , 309.0681818 ,
322.0681818 , 331.0681818 , 599.0681818 , 625.0681818 , 774.0681818 ,
814.0681818 , 870.0681818 , 905.0681818 , 1044.068182 ,
1048.068182
> 9
n -> 44 / PLUS -> Plus
A PPENDIX: Source Code 158
I ,
n -> 44 / PLUS -> Plus
. . . ETC. . . .
A PPENDIX: So,urce Code
ee2numl / / N
'/. SqrtC441 //N
ee4numl //N
% Sqrt C441 //N
ve2numi //N
% 44 //N
SqrtC%] //N
v2numl //N
var = '/. 44 //N
se = Sqr t [% 1 //N
% ~ q r t [44/43] //N (* t o compare t o S p l u s *)
ev2numl / / N
% 44 //N
vv2numl / / N
vv2num2 //N
% 44-2 //N
exp2num //N
% Sqrt C441 //N
var2num //N
% 44 //N
(* Same as BSTAT.1.O.nev.m except for extra code below included
t o c r ea t e t h e INDEPENDENCE c r i t e r i o n necessary for
t h e Behrens-Fisher s t a t i s t i c *)
BEV: :usage =
APPENDiX: Source Code
"BEVCa] r e t u r n s t h e b o o t s t r a p e x p e c t a t i o n value of a"
(*
: [ fon t = ou tpu t ; output ; i n a c t i v e ]
"BEV [a] r e t u r n s the b o o t s t r a p e x p e c t a t i o n v a l u e of ai '
; Col
BEVCal r e t u r n s t h e b o o t s t r a p e x p e c t a t i o n v a l u e of a
: [ fon t = input ; i n i t i a l i z a t i o n ]
*)
ClearAll [BEV]
BEV[a- + b-1 := BEV[a] + SEVCb]
BEV [(a- + b-) Y-. d- .] : = BEV[Expand[(a+b) ^c dl] / ;
((c > O ) && ( In tege rQEc] ) )
BEV [a- b-1 : = a BEV [b] / ; ! RVQ [a]
BEV [a-] : = a / ; ! RVQ [a]
BEV [SUM [a-] ] : = SUM [a]
BEV [SUM [a-] c-] : = BEV [c , a]
BEVCSUMCa-1 c,. ,b,,] := BEV[c,a,b]
BEV [SUM Ca-] -b- c- . 1 : = BEV [SUM [a] ̂ (b-1) c, a]
B E V [ S U M [ ~ - J ^ ~ - c-. ,d--1 := BEV[SUM[a]^(b-l)c,a,d]
(* S e t t i n g VARIABLES ! +>
RVQ [XI =Truc
VQ f X l = True
RVQ [Y] =Truc
VQ [Y] =True
LX/ : VQ [LX Cu] 1 =True
LY/ : V q [LY Lw] ] =Truc
LX/ : RVQ [LX [w] ] =True
A PPENDIX: Source Code
LY/ : RVQ [LY Cu] 1 =Truc
VQ [ZM [LX Cu] ] ] = T r u e
RVQ CZM [LX Lw] 11 = T r u e
VQ [ N L Y CU] 1 1 = T r u e
RVQ [ZM [LY Cu] ] 1 = T r u e
BEV [ SUM [LX Lw] j 1
BEV [ SUM [LX [w] 1 -2 1
BEV [ SUM [LX Lw] 1 * 2 SUM [LY [w] 1 -2 ]
BEV [ SUM [LX [w] 1 SUM [LY [w] 3 1
(* end of examples *)
(* THIS PART CREATES INDEPENDENCE *>
h [a-] : = P r e p e n d [a, 11
g[a-1 := { h Cases [ a , ZMILX[w]lnc-. 11 , h [ C a s e s C a, ZMCLY[WII-~-
m [a-] : = Apply [Times, Apply [BEVl , a , { 1)] ]
BEVl [a,] : = a / ; ! RVQ [a]
BEV [l ,a--] : = C o l l e c t [Expand [ m [ g [ {a) 1 1 1 , SUM]
BEVl Cl, a--] : = C o l l e c t [Expand [Map [BEVPE, FPI [{a) + PE [O] 1 ] , SUM]
(* Examples *)
BEV[ SUM [LX Cu] 1
BEV[ SUM[LXCW]I-~ 1
BEV [ SUM [LX Lw] 1 -2 SUM [LY [w] ] -2 ]
A PPENDIX: Source Code
BEV [ SUM [LX Lw] 1 SUM [LY Lw] ] 1
(* end of examples *)
# Behrens-Fisher stat 1st ic : Monte Car lo calcs .
I f INDEP. SAMPLES APPROACH ( u n l i k e the c o r r . c o e f f . wi th pa i r ed samples!)
# Survival t imes (days) of gastric cance r p a t i e n t s (n = 44)
# Data have been s tandard ized t o have ZERO MEANS only
# i. e . assume Ho: equal rneans = O
opt ions (object . size=10e6)
A PPENDIX: Source Code 163
n-44
x-1:n
nboot,l2800 # Repeat f o r B = : 50,100,200,400,800,1600,3200,6400,
# 12,800 , 25600
Bmeansi,apply(datai,l,mean)
Bvars 1-apply (dat al, 1, var) /n
Bmeans2-apply(data2,1,mean)
Bvars2_apply(data2,1 ,var) /n
bf-( Bmeansl - Bmeans2 ) / sqrt ( Bvars l + Bvars2)
# B b o o t s t r a p reps of BF s t a t i s t i c
# as on p . 224 EfronâTibs., step (16.6)
bf . mean-mean (bf) # b o o t s t r a p mean (exp. ) of bf
se .mean-sqrt ( v a r ( b f ) /nboot)
APPENDIX: Source Code
bf . s e - sqr t (var (bf ) ) # boots trap SE o f bf
varvar, (nboot/ ((nboot-11-2) ) *var((bf - mean(bf ) ) -2)
var .bsse,(1/(4*var(bf)))*varvar
se. se-var. bssenO. 5
# 1. See if matches with symbolic r e s u l t s , using modif i e d
# BSTAT code assuming INDEPJCE and ZERO means
# 2 . Calc. SE'S in brackets l a te r using code below
# 3 . Run wi th B as now preferred ( even spacings and up t o 25,600 )
bf . m e a n
se. m e a n
bf . se
s e . se
Bibliography
[ l ] Andrews. D. F. and Feuerverger. .A. ( 1993). General saddlepoint approximation
methods for bootstrap configurations. Uniu. o j Toronto Tech. Rep .
[2] Andrews, D.F. and Stafford. J.E. (1993). Tools for the Symbolic Computation of
Asymptotic Expansions. J. Roy. Statist. Soc. B 55. 613-627'.
[3] 'Andrews. D.F. and Stafford, J.E. (199.5). Iterated Full Partitions. Uniz*. of Toronto
Tech. Rep.
[L I ] Andrews, D. F.. Stafford, J. E. and Wang Y. (1992). On reading Lawley (1956):
an application of symbolic computation. Liniv. of Toronto Tech. R e p .
[.j] Andrews. D. F.. Stafford. J. E. and Wang Y. (1993). ASTAT: a systern for calcu-
lating asyniptot ic expansions. CTniv. of Toronto Tech. R e p .
[6] Andrews. D. F.. Stafford, J. E. and Wang Y. ( 1994). The calculation of asymptotic
expansions based on estimat ing equations. /hi. of Toronto Tech. Rep .
[7'] Barndorff-Nielsen. O.E. and Blaesild: P. (1986). A note on the calculation of
Bartlett adjustments. J. Roy. Statist. Soc. B 48. 351-358.
[8] Bellhouse, D.R., Philips, R. and Stafford, J. E. (1995). Symbolic operators for
multiple sums. Univ. of Western Ontario Tech. Rep.
[9] Beran, R. and Ducharme, G. (1991). Asymptotic theory for bootstrap methods in
statistics. Centre de Reserches Mat hematiques, Univ. of Montreal.
BIBLIOGRAPHY 166
[IO] Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall.
London
[ I l ] Davenport, J.H., Siret, Y. and Tournier, G. (1988). Cornputer Algebra: Systerns
and Afgorithms for Doing Algebraic Computation. Academic Press. New York.
[12] DiCiccio. T. and Ti bshirani, R. ( 1987). Bootst rap confidence iiitervals and boot-
s t rap approximations. J. Amer. Statist. Assoc. 82. 163-170.
[13] Efron. B. (1979). Bootstrap methods: another look a t the jackknife. Ann. Statist.
7, 1-26.
[11] Efron, B. (1951). Nonparametric standard errors and confidence intervals. (witli
discussion.) Canad. J. Statist. 9 , t 39-172.
[l 51 Efron. B. ( 1987). Better bootstrap confidence intervals. (wit h discussion.) J .
Amer. Statist. Assoc. 82, l'il-'200.
[16] Efron, B. and Tibshirani. R. ( 1993). ..ln Introduction to the Booislrap. Chapman
and Hall, Xew York.
[17] Ferguson. H. ( 1989). .Asyrnptot.ic Properties of a Condit ional Maximum Likeli-
hood Estimator. Ph.D Thesis. La7niu. of Toronto.
[18] Fisher. X.I. and Hali. P. (1991). Bootstrap algorithms for srnall saniples. J .
Statist. Pfann. Inf. 27 , 157-169.
[lg] Fraser. D.A.S. (1976). Probability and Statistics: Theory and Applications. DAI
Press. Toronto
[%O] Gamerman, D. ( 199 1). Dynamic Bayesian Models for Survival Data. ,4ppf.
Statist., 40, 63-79.
[21] Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer. Kew York.
[22] Hastings, C.., Mosteller, F., Tukey. J.W. and Winsor, C.P. (1947). Low moments
for small samples: a comparative study of order statistics. Ann. Math. Statist.. 18.
4 1 :3-426.
[%3] Heller. B. (1991). Macsyma For Statisticians. Wiley, New York.
[24] Huber. P. J. (1964). Robust estimation of a location parameter. -4nn. Math.
Statist., 35, 43-101.
[%] Huber. P. J. (1967). The Behaviour of Maximum Likelihood Estimates Cnder
Xonstandard Conditions. Proc. O/ the 5th Berkeley Symposium in .\lathemat ical
Sfatistics and Probability, 1. 22 1-233.
[26] Huber. P. J. (1981). Robust Statistics. Wiley. New York.
[27] Johnson, X.L. and Kotz. S. (1969). Distributions in Statistics - Discrrte Distri-
butions. Houghton Mifflin. Boston, 41.4.
[?SI Kendall. W. S. (1988). Symbolic Computation and the diffusion of shapes of
triads. Adaances in Applied Probability. 20. 775-797.
[29] Kendall. W. S. ( 1990). The diffusion of Euclidean Shape. In Disorder in Physical
Systems. Ed. G. Grimmett and D. Welsh. Oxford University Press. pp. 203-217.
[30] Kendall. W. S. (1993). Computer Algebra in Probability and Statistics. Stntistica
Neerlandica. 47, 9-25.
[31] Kendall, W. S. (1994). Computer Algebra and Yoke Geometry 1: when is an
expression a tensor? Statistics and Computing, 4.
[3%] Lawley, D.N. (1956). A general method for approximating the distribution of
likeli hood ratio criteria. Biometrika, 43, 195-303.
[33] Maeder, R. (1991). Programming in iVlathematica. Addison-Weslq. Reading.
M A .
BIBLLOGRAPHY 168
[34] McCullagh, P. (1987). Tensor Methods in Statkiics. Chapman and Hall, 'leiv
York.
[33] Oldford. R. W. (1985). Bootstrapping by Monte Carlo versus approximating the
estirnator and bootstrapping exactly : Cost and performance. Commun. Statist.
Ser. B. 14. :395-424.
[36] Robinson. G. K. ( 1982). Behrens- Fisher Problem. In Encyclopedia of Statistical
Science. Vol. 1. Ed. S. Kotz and N.L. Johnson. Wiley, Xew York. pp. 205-209.
[3ï] Scheffe. H. and Tukey. J. W. (1945). Non-parametric estimation - 1. Lalidation of
order statistics. rtnn. Math. Statist.. 16. 187-192.
[3S] Silverman. B. W. and Young, A. E. (1987). The bootstrap: to smooth or not to
smoot h? Biometrika. 74, 469479.
[39] Stafford. J. E. (1992). Symbolic Computation and the Comparison of Tradit ional
and Robust Test Statistics. Ph.D. Thesis. L'nic. of Toronto.
[NI] Stafford. J . E. (1994). Automating t h e partition of indexes. J. Comp. u n d Gruph.
Statist.. 3. 249-260.
[-LI] Stafford. J. E. (1995). Partitions and the symbolic rnethod. L-nic. of CkEsteric
Ontario Tech. Rep.
[42] Stafford. J. E. (1996). A note on symbolic Newton-Raphson. (submitted for
publication)
[G] Stafford. .J. E. and Andrews. D.F. (1993). A syrnbolic algorithm for studying
adjustments to the profile likelihood. Bio-met d a . 80, 71.5-T30.
[14] Stafford, J . E. and Bellhouse. D.R. (1995). A computer algebra for sample survey
theory. Cniv. of Iiéstern Ontario Tech. Rep.
[45] Stafford. J. E., Andrewso D.F. and Wang, Y. (1994). Symbolic computation: .A
unified approach to studying likelihood. Statistics and Computing? 4. 235-245.
BIBLIOGRAPHY 169
[46] Thisted, R. A. (1988). Elements of Statistical Computing. Chapman and Hall,
New York.
[47] Tibshirani. R. (1988). Variance stabilization and t h e bootstrap. Biometrika. 75.
433-444.
[48] Venables, W. Y. (1985). The multiple correlation coefficient and Fisher's -4 statis-
tic. Aust. J. Stntist.. 27. 17'3-182.
[49] Wagan, S. (1991). Mathematica in Action. W . H. Freeman. San Francisco.
[50] Wang, Y. ( 1992). Symbolic Computation for S tatistics Using Tensors. Ph.D.
Thesis, Univ. of Toronto.
[Sl] Wolfram. S. (1988). Mathematica: il System /or Doing :Vathematics by Corn-
pclter, 2nd Ed. Addison- Wesley. New York.
[52] Young, G.X. (1988). A note on bootstrapping the correlation coefficient.
Biometrika, 7 5 . 370-37.3.
[53] Young, A.E. and Daniels. H.E. ( 1990). Bootstrap Bias. Biometrika. 77. 179- 185.
TEST TARGET (QA-3)
APPLIED I W G E . lnc - = 1653 East Main Street - -. - - Rochester. NY 14609 USA -- -- - - Phone: 716/482-0300 -- -- - - Fm: 7161288-5989