2013 w stute empirical distributions

315
EMPIRICAL DISTRIBUTIONS Winfried Stute University of Giessen

Upload: amber-miller

Post on 18-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Course on empirical and non-parametric statistics

TRANSCRIPT

Page 1: 2013 W Stute Empirical Distributions

EMPIRICAL DISTRIBUTIONS

Winfried Stute

University of Giessen

Page 2: 2013 W Stute Empirical Distributions

Contents

1 Introduction 1

1.1 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Diracs and Indicators . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Empirical Distributions for Real-Valued Data . . . . . . . . . 13

1.4 Some Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5 The Lorenz Curve . . . . . . . . . . . . . . . . . . . . . . . . 31

1.6 Gini’s Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.7 The ROC-Curve . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.8 The Mean Residual Lifetime Function . . . . . . . . . . . . . 42

1.9 The Total Time on Test Transform . . . . . . . . . . . . . . . 44

1.10 The Product Integration Formula . . . . . . . . . . . . . . . . 45

1.11 Sampling Designs and Weighting . . . . . . . . . . . . . . . . 51

2 The Single Event Process 59

2.1 The Basic Process . . . . . . . . . . . . . . . . . . . . . . . . 59

2.2 Distribution-Free Transformations . . . . . . . . . . . . . . . 65

2.3 The Uniform Case . . . . . . . . . . . . . . . . . . . . . . . . 69

3 Univariate Empiricals:The IID Case 77

3.1 Basic Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.2 Finite-Dimensional Distributions . . . . . . . . . . . . . . . . 92

3.3 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.4 Some Selected Boundary Crossing Probabilities . . . . . . . . 106

4 U-Statistics 113

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.2 The Hajek Projection Lemma . . . . . . . . . . . . . . . . . . 115

4.3 Projection of U -Statistics . . . . . . . . . . . . . . . . . . . . 117

i

Page 3: 2013 W Stute Empirical Distributions

ii CONTENTS

4.4 The Variance of a U -Statistic . . . . . . . . . . . . . . . . . . 120

4.5 U -Processes: A Martingale Approach . . . . . . . . . . . . . . 122

5 Statistical Functionals 133

5.1 Empirical Equations . . . . . . . . . . . . . . . . . . . . . . . 133

5.2 Anova Decomposition of Statistical Functionals . . . . . . . . 136

5.3 The Jackknife Estimate of Variance . . . . . . . . . . . . . . . 138

5.4 The Jackknife Estimate of Bias . . . . . . . . . . . . . . . . . 140

6 Stochastic Inequalities 145

6.1 The D-K-W Bound . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Binomial Tail Bounds . . . . . . . . . . . . . . . . . . . . . . 148

6.3 Oscillations of Empirical Processes . . . . . . . . . . . . . . . 152

6.4 Exponential Bounds for Sums of Independent Random Vari-ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7 Invariance Principles 161

7.1 Continuity of Stochastic Processes . . . . . . . . . . . . . . . 161

7.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . 163

7.3 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.4 Donsker’s Invariance Principles . . . . . . . . . . . . . . . . . 167

7.5 More Invariance Principles . . . . . . . . . . . . . . . . . . . . 172

7.6 Parameter Empirical Processes (Regression) . . . . . . . . . . 176

8 Empirical Measures: A Dynamic Approach 183

8.1 The Single-Event Process . . . . . . . . . . . . . . . . . . . . 183

8.2 Martingales and the Doob-Meyer Decomposition . . . . . . . 185

8.3 The Doob-Meyer Decomposition of the Single-Event Process . 187

8.4 The Empirical Distribution Function . . . . . . . . . . . . . . 190

8.5 The Predictable Quadratic Variation . . . . . . . . . . . . . . 193

8.6 Some Stochastic Differential Equations . . . . . . . . . . . . . 195

8.7 Stochastic Exponentials . . . . . . . . . . . . . . . . . . . . . 197

9 Introduction To Survival Analysis 201

9.1 Right Censorship: The Kaplan Meier Estimator . . . . . . . . 201

9.2 Martingale Structures under Censorship . . . . . . . . . . . . 207

9.3 Confidence Bands for F under Right Censorship . . . . . . . 214

9.4 Rank Tests for Censored Data . . . . . . . . . . . . . . . . . . 216

9.5 Parametric Modelling in Survival Analysis . . . . . . . . . . . 220

Page 4: 2013 W Stute Empirical Distributions

CONTENTS iii

10 “Time To Event” Data 22510.1 Sampling Designs in Survival Analysis . . . . . . . . . . . . . 22510.2 Nonparametric Estimation in Counting Processes . . . . . . . 22810.3 Nonparametric Testing in Counting Processes . . . . . . . . . 23110.4 Maximum Likelihood Procedures . . . . . . . . . . . . . . . . 23510.5 Right-Truncation . . . . . . . . . . . . . . . . . . . . . . . . . 242

11 The Multivariate Case 25111.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25111.2 Identification of Defective Survival Functions . . . . . . . . . 25711.3 The Multivariate Kaplan-Meier Estimator . . . . . . . . . . . 26011.4 Efficiency of the Estimator . . . . . . . . . . . . . . . . . . . 26411.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 269

12 Nonparametric Curve Estimation 27312.1 Nonparametric Density Estimation . . . . . . . . . . . . . . . 27312.2 Nonparametric Regression: Stochastic Design . . . . . . . . . 27612.3 Consistent Nonparametric Regression . . . . . . . . . . . . . 28012.4 Nearest-Neighbor Regression Estimators . . . . . . . . . . . . 28712.5 Nonparametric Classification . . . . . . . . . . . . . . . . . . 29112.6 Smoothed Empirical Integrals . . . . . . . . . . . . . . . . . . 296

13 Conditional U-Statistics 30113.1 Introduction and Main Result . . . . . . . . . . . . . . . . . . 30113.2 Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . 30313.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 307

Page 5: 2013 W Stute Empirical Distributions

iv CONTENTS

Page 6: 2013 W Stute Empirical Distributions

Chapter 1

Introduction

1.1 Counting

Empirical distributions have been designed to provide fundamental mathe-matical and graphical tools in the context of data analysis.

To begin with, suppose that n customers are asked to name their favorite carfrom a list a, b, c, d, e. Denote with Xj the response of the j-th customer.A quantity of general interest is the number of customers in favor of ’a’,say. This number is the outcome of a procedure which consists of n stepsand which adds a ’one’ to the previous value whenever one obtains anotherpositive vote for ’a’: counting.

To formalize things a little bit, let S = a, b, c, d, e be the sample space ofall possible outcomes. Set A = a, a set consisting of only one element. Let1A be the indicator function associated with A, i.e., the function definedon S attaining the value 1 on A and 0 otherwise:

1A(x) =

1 if x ∈ A0 if x /∈ A

. (1.1.1)

In our example, 1A(Xj) equals 1 if the vote of the j-th customer is in favorof ’a’ and 0 otherwise. The absolute number of customers voting for ’a’ ofcourse equals

n∑j=1

1A(Xj). (1.1.2)

Needless to say, (1.1.2) also applies to other sets B,C, . . . and other X’s. Asanother example, consider the number of days per year at a specific location,

1

Page 7: 2013 W Stute Empirical Distributions

2 CHAPTER 1. INTRODUCTION

when the maximum temperature exceeds 30C. The set of interest then isthe interval A = [30,∞), together with its complement A = [−273, 30).Quantity (1.1.2) may then be compared with similar results from previousyears and is helpful in analyzing possible changes in the local climate. Inother situations, the X’s may take on multivariate values as well. We maybe interested in the income of a person (per year) together with the amountof money spent on travelling. When we split the ’income scale’ into finitelymany groups A1, . . . , Ak representing classes from ’very poor’ to ’very rich’,and likewise partition the scale for ’travel expenses’ into groups B1, . . . , Bm,we may form new groups Ar ×Bs, 1 ≤ r ≤ k, 1 ≤ s ≤ m, with

1Ar×Bs(Xj) = 1

just indicating that the j-th person belongs to the r-th group as far asincome and to the s-th group as far as travel expenses are concerned.

More generally, we may split a sample space S into finitely many mutuallydisjoint sets (groups, cells) C1, . . . , Cm, i.e.,

Ci ∩ Cj = ∅ for i = j.

We call π = C1, . . . , Cm a partition of S if∪m

i=1Ci = S.

E.g., if one is interested in the age distribution of a population, we may put

C1 = [0, 9], C2 = [10, 19], . . . , C10 = [90, 99], C11 = [100,∞).

A person aged 25 thus is assigned to C3. In this example, if one has theimpression, that the grouping is “too rough” so that relevant information islost one may feel free to choose a “finer” partition. In other situations, therelevant sets, say B1, . . . , Bm, may be increasing: B1 ⊂ B2 ⊂ . . . ⊂ Bm.When S is the real line and each Bi = (−∞, ti] is an extended interval thenthe Bi’s are increasing if and only if t1 ≤ t2 ≤ . . . ≤ tm. A (finite) sequenceof B’s is called decreasing iff B1 ⊃ . . . ⊃ Bm. If a sequence of B’s isincreasing or decreasing we call the B’s linearly ordered. Note that inthe case of extended intervals, we may always arrange the B’s in such a waythat they are increasing and hence linearly ordered. Finally, to each sequenceB1 ⊂ . . . ⊂ Bm we may attach the natural partition π = C1, . . . , Cm+1defined by

C1 = B1, C2 = B2 \B1, . . . , Cm = Bm \Bm−1, Cm+1 = Bm.

Page 8: 2013 W Stute Empirical Distributions

1.1. COUNTING 3

We thus see that absolute frequencies (1.1.2) may be computed in generalsituations, whether the data are on a nominal, ordinal or metric scale, orwhether they are univariate or multivariate.

Obviously, (1.1.2) may become very large with n. A quantity which isbetter suited to discriminate between the fractions of A,B, . . . in the wholepopulation is the so-called relative frequency. For a set (or group) A thisbecomes

µn(A) =1

n

n∑j=1

1A(Xj). (1.1.3)

This quantity is called the empirical distribution (measure) of A:

u

u u u uuA

u u

uµn(A) =

2

8

Figure 1.1.1: Empirical measure of a set

If π = C1, . . . , Cm is a partition of the sample space S, then

m∑i=1

µn(Ci) = µn

(m∪i=1

Ci

)= µn(S) = 1.

Hence the empirical frequencies (µn(C1), . . . , µn(Cm)) attain their values inthe set of m-tuples (p1, . . . , pm) with

0 ≤ pi ≤ 1 and

m∑i=1

pi = 1.

If the sets B1, . . . , Bm are increasing, then each Xj contained in Bi is alsocontained in Bi+1. Hence

0 ≤ µn(B1) ≤ µn(B2) ≤ . . . ≤ µn(Bm) ≤ 1.

Page 9: 2013 W Stute Empirical Distributions

4 CHAPTER 1. INTRODUCTION

For data on a nominal scale we may represent, e.g., group-frequencies throughthe following diagram:

Figure 1.1.2: Empirical frequencies for data on a nominal scale

When the data are on an ordinal scale, a graphical representation shouldtake care of this fact in that the “best” group, e.g., could be placed at the leftend. Below we present the empirical measures (or fractions) of final gradesof the master students at the Mathematical Institute of the University ofGiessen in one of the previous years.

A B C D

Figure 1.1.3: Empirical frequencies for data on an ordinal scale

When the data are on a metric scale, grouping of the data into finitely manysets may lead to a considerable loss of information. In such a situation wemay consider a much richer class of sets. For example, let A be the classof all extended intervals (−∞, t] with t ∈ R. We then come up with

Fn(t) ≡ µn(−∞, t] =1

n

n∑j=1

1Xj≤t, (1.1.4)

the empirical distribution function of X1, . . . , Xn. Thus Fn(t) denotes

Page 10: 2013 W Stute Empirical Distributions

1.1. COUNTING 5

the relative number of data being less than or equal to t. For two s and t weeither have s ≤ t or t < s. In the first case (−∞, s] is contained in (−∞, t]so that by monotonicity of µn

Fn(s) ≤ Fn(t) for s ≤ t,

i.e., Fn is non-decreasing. If we let s ↓ −∞ and t ↑ ∞, we obtain in thelimit

Fn(−∞) = 0 and Fn(∞) = 1. (1.1.5)

The first statement just says that among real data, there are no Xj ’s whichare less than or equal to −∞, while, for the second, all are less than or equalto +∞. Actually, we already have Fn(x) = 0 for all x < smallest datumand Fn(x) = 1 for all x ≥ largest datum.

Sometimes applicants of statistical methodology prefer to look at

1− Fn(t) =1

n

n∑j=1

1Xj>t, t ∈ R,

which is non-increasing in t. For example, in a medical context, the Xj ’smight represent the times elapsed from surgery until death. Then 1−Fn(t)equals the fraction of patients who survived at least t time units. Therefore,in this context, 1− Fn is called the empirical survival function.

For multivariate data X1, . . . ,Xn in the Euclidean space Rk we have manypossibilities to choose among sets A. For example, we could take for Aso-called quadrants

At =

k∏l=1

(−∞, tl] = x = (x1, . . . , xk)T : xl ≤ tl for all 1 ≤ l ≤ k,

where t = (t1, . . . , tk)T is a given vector determining the quadrant. Here

and in the following the superscript “T” denotes transposition, i.e., if notstated otherwise, t etc. denote column vectors. The multivariate empiricaldistribution function. is then defined as

Fn(t) = µn(At) =

1

n

n∑j=1

1Xjl≤tl for all 1≤l≤k.

Page 11: 2013 W Stute Empirical Distributions

6 CHAPTER 1. INTRODUCTION

Here Xjl denotes the l-th component of Xj . For ease of notation we writeXj ≤ t iff ≤ holds coordinatewise. Then Fn becomes

Fn(t) =1

n

n∑j=1

1Xj≤t.

Other possible choices for A are halfspaces of the form

A = x :< x,β >≤ c,

where β ∈ Rk and c ∈ R are fixed or may vary as t before, and

< x,β >=

k∑i=1

xiβi

denotes the scalar product in Rk.

x2

1.5

62421

>+ xx

*

x1

3

62421

£+ xx

*

*

*

*

*

*

*

1

2

2

Figure 1.1.4: A data set split into two pieces by a line

In general, the choice of A will depend on the kind of application one hasin mind. Therefore, in the beginning, we leave it open what the A’s are.

Page 12: 2013 W Stute Empirical Distributions

1.2. DIRACS AND INDICATORS 7

Also the sample space S may be quite general. Later in this monograph,most of our analysis will first focus on S = R, i.e., on real-valued data, withthe class of A’s being the extended intervals (−∞, t]. After that we shallstudy multivariate problems. Our final extension will be concerned with theimportant case when the available information comes through functions overtime or space. This will lead us to the study of so-called functional data.

1.2 Diracs and Indicators

The indicator function is intimately related with a so-called point mass (orDirac measure).

For each x ∈ S, define

δx(A) =

1 if x ∈ A0 if x /∈ A

(1.2.1)

δx(A1) = 0

δx(A2) = 1

Figure 1.2.1: Evaluation of a Dirac measure

If we compare (1.2.1) with (1.1.1), we immediately obtain

δx(A) = 1A(x). (1.2.2)

Note that for δx the point x is kept fixed with A varying among a class ofsets, while for the indicator function, the set A is fixed with x varying inthe sample space. It is easy to see that each δx is a (probability) measure.In fact,

0 ≤ δx(A) ≤ 1 with δx(S) = 1

and

δx

( ∞∪i=1

Ai

)=

∞∑i=1

δx(Ai)

Page 13: 2013 W Stute Empirical Distributions

8 CHAPTER 1. INTRODUCTION

for any mutually disjoint subsets Ai of S. One may check this equalityseparately for the two cases that x is contained in none of the Ai’s and,alternatively, that x is contained in precisely one of them. In the first caseboth sides equal zero, while in the latter we have one on both sides.

Within this framework, (1.1.3) becomes

µn(A) =1

n

n∑j=1

δXj (A) (1.2.3)

or, omitting sets,

µn =1

n

n∑j=1

δXj . (1.2.4)

In other words, the empirical measure is a sum of equally weighted Diracs.

For many statistical procedures, counting data is not sufficient. Rather wewould like to extend the scope of possible applications of empirical distri-butions by introducing empirical integrals. Starting again with a pointmass δx, we have

δx(A) =

∫1Adδx = 1A(x). (1.2.5)

The first equation just expresses the fact that the integral of an indicatorfunction always equals the measure of the underlying set. If we put φ = 1A,the second equation may be rewritten to become∫

φdδx = φ(x). (1.2.6)

The last equality is simple but fundamental and may be easily extended toother φ’s, not necessarily of indicator type. Since µn is a linear combinationof the point masses δXj , 1 ≤ j ≤ n, we thus obtain∫

φdµn =1

n

n∑j=1

φ(Xj), (1.2.7)

i.e., an empirical integral may be most easily evaluated by just averagingthe φ-values at the data.

When the data are real-valued, the choice of φ(x) = x yields a representationof the sample mean as an empirical integral:∫

xdµn =1

n

n∑j=1

Xj ≡ Xn.

Page 14: 2013 W Stute Empirical Distributions

1.2. DIRACS AND INDICATORS 9

Similarly, if we set φ(x) = xp, we obtain∫xpdµn =

1

n

n∑j=1

Xpj ,

the empirical p-th moment.

The above identities easily extend to complex-valued functions. E.g., if weset

φλ(x) = exp[iλx],

we obtain the so-called empirical characteristic function (Fourier-transform):

ψn(λ) ≡∫φλdµn =

1

n

n∑j=1

exp[iλXj ].

Here i denotes the complex unity satisfying i2 = −1. When we take thereal-valued version

φ0λ(x) = exp[λx],

we come up with the empirical Laplace-transform:

ψ0n(λ) =

∫φ0λdµn =

1

n

n∑j=1

exp[λXj ].

Needless to say, if we set φ = 1(−∞,t], then∫φdµn = Fn(t).

To present an example for bivariate data Xj = (Xj1, Xj2)T , 1 ≤ j ≤ n, the

choice ofφ(x1, x2) = x1x2

leads to ∫φdµn =

1

n

n∑j=1

Xj1Xj2,

an empirical mixed moment.

This small list of examples already indicates that many traditional quantitiesknown from elementary statistical analysis may be expressed in terms of µn.

Page 15: 2013 W Stute Empirical Distributions

10 CHAPTER 1. INTRODUCTION

When the data are real-valued or multivariate, µn is uniquely determinedthrough Fn. We therefore prefer to replace µn with Fn and write:∫

φdµn ≡∫φdFn.

For a deeper analysis the class of quantities which somehow can be computedfrom the available data through µn or Fn, respectively, need to be enlarged.

Going back to (1.2.3), it is tempting to look also at products of µn. Forexample, the product measure µn ⊗ µn is uniquely determined through itsvalues on rectangles A1 ×A2 and is given by

µn ⊗ µn(A1 ×A2) =1

n2

n∑i=1

n∑j=1

δXi(A1)δXj (A2)

=1

n2

n∑i=1

n∑j=1

1Xi∈A1,Xj∈A2.

This expression shows that µn⊗µn is a new (artificial) measure on the samplespace S × S giving masses n−2 to each of the points (Xi, Xj), 1 ≤ i, j ≤ n.The representation corresponding to (1.2.4) becomes

µn ⊗ µn = n−2n∑

i=1

n∑j=1

δ(Xi,Xj).

Since (1.2.6) holds true for a general Dirac, we get, for an arbitrary functionφ = φ(x, y) of two variables:∫

φd(µn ⊗ µn) =1

n2

n∑i=1

n∑j=1

φ(Xi, Xj),

a so-called V-statistic. A closely related version is

Un =1

n(n− 1)

n∑i=1

n∑j=1

i=j

φ(Xi, Xj).

These statistics are called U-statistics (of degree two).

For example, if the Xj ’s are real-valued and if we set φ(x, y) = 12(x − y)2,

we obtain the sample variance

σ2n =1

n− 1

n∑j=1

(Xj − Xn)2.

Page 16: 2013 W Stute Empirical Distributions

1.2. DIRACS AND INDICATORS 11

Actually,

1

2n(n− 1)

n∑i=1

n∑j=1

i=j

(Xi −Xj)2 =

1

2n(n− 1)

n∑i=1

n∑j=1

(Xi −Xj)2

=1

n− 1

n∑i=1

X2i − 1

n(n− 1)

n∑i=1

n∑j=1

XiXj

=1

n− 1

n∑i=1

X2i − n

n− 1X2

n =1

n− 1

n∑i=1

(Xi − Xn)2.

Our discussion may have already indicated that point measures or Diracs arejust the cornerstones of a hierarchy of more or less sophisticated statisticaltools. After a moment of thought this is not surprising, because in a sta-tistical framework the relevant information is often obtained only througha set of finitely many (discrete) data so that it is only natural to put theweights there.

So far we have restricted ourselves to equal weighting. As it will turn outthere will be many data situations when we are led to masses which are notequal. For further discussion, we rewrite (1.2.4) as

µn =

n∑j=1

1

nδXj . (1.2.8)

From a purely mathematical point of view this is trivial if not ridiculous.We prefer (1.2.8) over (1.2.4) whenever we want to point out that 1

n is aweight attached to Xj . When the weight Wnj given to Xj depends also onj, the resulting (generalized) empirical distribution becomes

µn =

n∑j=1

WjnδXj . (1.2.9)

In such a situation it will not be possible to keep the weights separatedfrom the Xj ’s so that (1.2.8) is the right way to look at µn. The empiricalintegrals w.r.t. µn in (1.2.9) equal∫

φdµn =n∑

j=1

Wjnφ(Xj). (1.2.10)

Page 17: 2013 W Stute Empirical Distributions

12 CHAPTER 1. INTRODUCTION

An introductory discussion of examples where it may be necessary to replacethe standard weights n−1 by more sophisticated Wjn’s will appear inSection 1.5. Generally speaking, a proper choice of Wjn will depend onthe problem one has in mind as well as on the structure of the data. Inmany situations the Wjn will depend on the observations and are thereforerandom. Though theWjn’s may be very complicated, there will be principleswhich may guide us how to find and analyze the Wjn’s. It is the aim of thismonograph to develop such principles and to study the resulting empiricalquantities.

Our final extension of an empirical measure will incorporate weights depend-ing also on a variable x, say. In other words, the weights Wjn are functionsrather than constants and the associated empirical distribution becomes

µn =

n∑j=1

Wjn(x)δXj , (1.2.11)

The x’s need not necessarily be taken from the same S but may vary inanother set. More clearly, we have

µn(x,A) =n∑

j=1

Wjn(x)1Xj∈A (1.2.12)

and ∫φ(y)µn(x, dy) =

n∑j=1

Wjn(x)φ(Xj). (1.2.13)

Adopting terminology from probability theory, we call µn(x,A) an empir-ical kernel. The integrals (1.2.13) computed w.r.t. these µn are functionsin x. As one may expect these functions are candidates for estimators offunctions rather than parameters like the mean or variance of a population.

The kernel µn(x,A) is called an empirical probability kernel iff for all xWjn(x) ≥ 0 and

n∑j=1

Wjn(x) = 1.

In some situations the sum of weights may be less than or equal to one:

0 ≤n∑

j=1

Wjn(x) ≤ 1,

Page 18: 2013 W Stute Empirical Distributions

1.3. EMPIRICAL DISTRIBUTIONS FOR REAL-VALUED DATA 13

in which case we call µn an empirical sub-probability kernel or just anempirical sub-distribution. When some of the weights are negative, µn iscalled a signed empirical distribution (resp. kernel).

1.3 Empirical Distributions for Real-Valued Data

In this section we present a preliminary discussion of Fn with standardweights n−1, when the observations are real-valued. In this case we mayarrange the data X1, . . . , Xn in increasing order. We then come up with theso-called order statistics

X1:n ≤ . . . ≤ Xn:n.

Sometimes the smallest and the largest datum, X1:n and Xn:n, are calledextreme order statistics. When the data are pairwise distinct, i.e., if

X1:n < . . . < Xn:n,

there are no ties.

From the definition of Fn it follows that

• Fn is nondecreasing

• Fn(x) = 0 for all x < X1:n and Fn(x) = 1 for all x ≥ Xn:n

• Fn is constant between successive (distinct) order statistics

• Fn is continuous from the right, i.e., for each x ∈ R:

Fn(x) = limy↓x

Fn(y)

• Fn has left-hand limits:

Fn(x−) = limy↑x

Fn(y) exists

• Fn has discontinuities only at the order statistics Xj:n.

The jump size

Fnx = Fn(x)− Fn(x−)

Page 19: 2013 W Stute Empirical Distributions

14 CHAPTER 1. INTRODUCTION

vanishes for all x outside the data set. When x = Xj for some j, then

FnXj =djn,

where dj is the number of data with the same value as Xj . When no tiesare present, then dj = 1. In this case, Fn has exactly n jumps with jumpsize n−1. Hence for large n, Fn is a function with many discontinuities butsmall jump heights.

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1.3.1: Example of an empirical distribution function for n = 100observations

Note also that for the computation of Fn it is enough to know the set ofordered data. To recover the value of Xj from the order statistics we alsoneed the position of Xj within the sample. This leads us to the rank of Xj .

Definition 1.3.1. Assume that there are no ties. If Xj = Xi:n, then theindex i is called the Rank of Xj .

Page 20: 2013 W Stute Empirical Distributions

1.3. EMPIRICAL DISTRIBUTIONS FOR REAL-VALUED DATA 15

In a particular data situation Xj may denote the result of the j-th experi-ment or the response of the j-th person in a medical or socio-demographicsurvey. As another possibility the index j may refer to time. For example,in a financial context, Xj may denote the price of a stock on day j at theclosing of the session. In each case, and contrary to Xi:n, the index j needsto be chosen before the experiment is run.

Example 1.3.2. In this example we record the times (in seconds) in the100m final (men) at the Olympic Games 2004 in Athens. Xj corresponds tothe time of athlete number j in the alphabetic order as outlined below:

Collins X1 = 10.00Crawford X2 = 9.89Gatlin X3 = 9.85Greene X4 = 9.87Obikwelu X5 = 9.86Powell X6 = 9.94Thompson X7 = 10.10Zakari X8 = DNF

From this we see that X1:n = X3 = 9.85 with R3 = 1, i.e., the winner of thegold medal was Gatlin.

By the definition of ranks, we have

Xj = XRj :n. (1.3.1)

Hence the original data X1, . . . , Xn can be reconstructed from the set oforder statistics and the ranks. In other words, ranks and order statisticscontain the same information as the original (unordered) sample.

The empirical distribution function provides a convenient tool to connectand express these quantities. First, we have

Rj = nFn(Xj). (1.3.2)

This means, to compute the ranks we need to evaluate Fn at very special t’s,namely the (original) data. Secondly, when we compute Fn at the ordereddata, we obtain a fixed value:

i

n= Fn(Xi:n). (1.3.3)

Page 21: 2013 W Stute Empirical Distributions

16 CHAPTER 1. INTRODUCTION

Conversely, if we fix an arbitrary 0 < u ≤ 1, then we find an x ∈ R suchthat

Fn(x) = u

only when u equals i/n for some 1 ≤ i ≤ n. Therefore, to get some kind ofinverse function for Fn, we have to adopt a definition, which at first sightlooks a little strange.

Definition 1.3.3. Let Fn be the standard empirical distribution function.Then the associated empirical quantile function F−1

n is defined on theinterval 0 < u ≤ 1 and is given by

F−1n (u) = inft : Fn(t) ≥ u.

This definition also applies to Fn’s with other weights Win(x). For standardweights, we may represent the order statistics in terms of F−1

n , namely

Xi:n = F−1n

(i

n

). (1.3.4)

Also the following properties of Fn and F−1n are readily checked:

F−1n (u) = Xi:n for

i− 1

n< u ≤ i

nand 1 ≤ i ≤ n (1.3.5)

and

Fn

(F−1n (u)

)≥ u for 0 < u ≤ 1 (1.3.6)

with equality in (1.3.6) only when u = in , 1 ≤ i ≤ n. Note that, by (1.3.5),

F−1n is left-hand continuous.

The u-quantile F−1n (u) divides the sample into the lower 100 × u percent

and the upper 100 × (1 − u) percent of the data. F−1n (1/2) is the sample

(or empirical) median, a popular parameter for the central location of adata set. Other quantiles which have found some interest are the lower andupper quartiles, F−1

n (1/4) and F−1n (3/4). The interquartile range

F−1n (3/4)− F−1

n (1/4) (1.3.7)

is a convenient means to measure the spread of the central part of the data.

Page 22: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 17

1.4 Some Targets

So far we have introduced and discussed several elementary quantities re-lated to Fn. In this section we formalize their mathematical background.This enables us to express and view our statistics as estimators of varioustargets. In particular, we provide the mathematical framework for the Xj ’s.Again we first restrict ourselves to real-valued X’s.

In mathematical terms, the Xj are random variables defined on a probabilityspace (Ω,A,P). For the time being we assume that all X’s have the samedistribution function (d.f.)

P(ω ∈ Ω : Xj(ω) ≤ t) ≡ F (t) (Xj ∼ F ).

The random variables Xj are assumed to be measurable w.r.t. the σ-algebraA guaranteeing that the events Xj ≤ t belong to A. The quantity F (t)equals the probability that in a future draw from the same population, theattained value does not exceed t. Typically, F is unknown and needs tobe estimated from a set of data, often being the only source of availableinformation.

The function F has the following properties shared by each d.f. G:

(i) G is non-decreasing

(ii) G is right-hand continuous with left-hand limits

(iii) limt↓−∞G(t) = 0 and limt↑∞G(t) = 1

For G = Fn, conditions (i) – (iii) have been discussed in Section 1.3. Forother G’s like G = F , (i) – (iii) follow from elementary properties of theunderlying probability measure P.

The notion of a quantile function also extends from Fn to arbitrary d.f.’sG:

G−1(u) = inft ∈ R : G(t) ≥ u, 0 < u ≤ 1. (1.4.1)

F−1 constitutes the ’true’ quantile function. If G(t) < 1 for all t ∈ R,G−1(1) = ∞. Generally, G−1(1) equals the smallest upper bound for thesupport of G. Inequality (1.3.6) holds for a general G. Note that G G−1(u) = u if G is continuous.

At this point of the discussion it is already useful to look at F and Fn astwo ’points’ on the string of all distributions, one (G = F ) representing the

Page 23: 2013 W Stute Empirical Distributions

18 CHAPTER 1. INTRODUCTION

whole but unknown population and the other one (G = Fn), revealing somethough limited information about Ω through X. With this in mind we mayconsider all quantities discussed in the previous sections also for G = F .

A functional T , which maps G into a real number or a vector T (G), iscalled a statistical functional. A simple example of a linear functional isdefined, for a given φ, as

T (G) =

∫φ(x)G(dx),

provided the integral exists. The functional T may be also evaluated atmeasures which are not necessarily probability measures. T is called linearbecause for any two G1, G2 for which T is well-defined and any real numbersa1 and a2 we have

T (a1G1 + a2G2) = a1T (G1) + a2T (G2).

While

T (F ) =

∫φ(x)F (dx)

denotes the theoretical quantity of interest,

T (Fn) =

∫φ(x)Fn(dx)

equals the estimator obtained by replacing F with Fn. Furthermore, bytransformation of integrals,

T (F ) = E[φ(X)], X ∼ F,

while

T (Fn) = E[φ(X∗)], X∗ ∼ Fn.

This (very) simple example already reveals the possibility that we have moreor less two options to represent quantities of interest:

(i) a representation in terms of random variables

(ii) a representation in terms of distributions

Page 24: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 19

No representation is superior to the other one but each has its own value. Arepresentation of expectation and variance, e.g., through variables still ex-hibits that originally these quantities were attached to variables describingsome particular random phenomena. On the other hand, the representa-tion through F gives us a clue how to estimate the target, namely just byreplacing F with Fn (plug-in method).

The V -statistic

Vn =1

n2

n∑i=1

n∑j=1

φ(Xi, Xj) =

∫∫φ(x, y)Fn(dx)Fn(dy)

equals T (Fn), where now

T (G) =

∫∫φ(x, y)G(dx)G(dy)

is a so-called V -functional of degree two. The function φ is called thekernel associated with T .

We now list some targets which may be of some general interest. This listof course is not complete and is intended only to give a rough idea whatcould be the subject of research. In each case, we assume without furthermentioning that all integrals exist at G = F .

In the following we assume that X ∼ F .

Example 1.4.1. The choice of φ(x) = xk leads to the k-th moment of X,

EXk =

∫xkF (dx).

The empirical estimator becomes∫xkFn(dx) =

1

n

n∑j=1

Xkj .

Example 1.4.2. If we set φt(x) = 1x≤t it leads to∫φt(x)F (dx) = F (t),

respectively ∫φt(x)Fn(dx) = Fn(t).

Note that if we consider the collection φt of all φt’s, we obtain a stochas-tic process or random function indexed by t ∈ R.

Page 25: 2013 W Stute Empirical Distributions

20 CHAPTER 1. INTRODUCTION

Example 1.4.3. The family of functions φλ(x) = exp[λx] leads to theLaplace-transform of X (resp. F ):

λ→∫

exp[λx]F (dx) = EeλX

along with its estimator

λ→∫

exp[λx]Fn(dx) =1

n

n∑j=1

exp[λXj ],

the empirical Laplace-transform.

Example 1.4.4. The complex-valued version φλ(x) = exp[iλx] leads to theFourier-transform of X (resp. F ):

λ→∫

exp[iλx]F (dx) = EeiλX

along with its estimator

λ→∫

exp[iλx]Fn(dx) =1

n

n∑j=1

exp[iλXj ].

Here again, i is the complex unit.

The Fourier-transform uniquely determines the distribution of a randomvariable X. Hence the empirical version is a candidate to detect distribu-tional features of X.

Example 1.4.5. Another quantity which uniquely determines the distribu-tion of a random variable X is its cumulative hazard function

ΛF (t) =

∫(−∞,t]

F (dx)

1− F (x−)=

∫1x≤t

1− F (x−)F (dx).

In this case, the function φ consists of two components, the numerator beinga function only of x, while the denominator depends on x through F . Thecumulative hazard function plays an outstanding role in survival analysisand reliability, when the data are lifetimes (failure times) of individualsor technical units and are therefore nonnegative. Hence in this case it issufficient to extend the integral over the positive real line. For a continuousF , the left-hand limit F (x−) coincides with F (x). At this point it is not

Page 26: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 21

clear why one should not use F (x) instead of F (x−). For a preliminaryanswer we compute the empirical cumulative hazard function estimator, theso-called Nelson-Aalen estimator:

Λn(t) =

t∫0

Fn(dx)

1− Fn(x−)=

n∑i=1

1Xi≤t∑nj=1 1Xj≥Xi

.

Note that the choice of Fn(x−) in the denominator is responsible forXj ≥ Xi

(and not Xj > Xi). Because of Xi ≥ Xi, the sum is at least 1 so that theratio is well-defined. If, as in Example 1.4.2, we let t vary in some interval,say, we obtain a stochastic process in t.

When there are no ties we may compute Λn also by summation along theorder statistics. This yields

Λn(t) =n∑

i=1

1Xi:n≤t

n− i+ 1.

We thus see that like Fn also Λn jumps at the data, the weights 1/(n− i+1)assigned to the i-th order statistic, however, being strictly increasing as wemove to the right.

While the total mass under (the standard) Fn equals one, the Nelson-Aalenestimator has total mass

n∑i=1

1

n− i+ 1=

n∑i=1

1

i∼ lnn→ ∞ as n→ ∞. (1.4.2)

This total cumulative mass is attained for all t ≥ Xn:n so that Λn is aconstant function there.

Page 27: 2013 W Stute Empirical Distributions

22 CHAPTER 1. INTRODUCTION

0 1 2 3 4 5 6

01

23

45

Figure 1.4.1: Example of a Nelson-Aalen estimator for n = 100 observations

In the example underlying Figure 1.4.1, we feature some effects which aretypical for the Nelson-Aalen estimator. For 0 ≤ t ≤ 3 the function is closeto a smooth function. Here it is the straight line y = t. In the extreme righttails, there are fewer observations with weights becoming larger and larger.Consequently there is no averaging effect and Λn is not a reliable estimateof ΛF .

Example 1.4.6. Though the empirical distribution function is a nonpara-metric estimator in the sense that for its computation as well as its prop-erties no parametric model assumption is required, our view turns out to beuseful also in parametric statistics. For this, let M = f(·, θ) : θ ∈ Θ be afamily of densities parametrized by a set Θ of finite-dimensional vectors. Inthe case of independent observations from M, i.e., the true density equals

Page 28: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 23

f(·, θ0) for some unknown θ0, the log-likelihood function equals

θ →n∑

i=1

ln f(Xi, θ)

which after normalization becomes

θ →∫

ln f(x, θ)Fn(dx). (1.4.3)

We thus see that log-likelihood functions are empirical integrals parame-trized by θ. Under some smoothness conditions the maximizer of (1.4.3),i.e., the Maximum Likelihood Estimator (MLE) satisfies∫

ψ(x, θ)Fn(dx) = 0,

where

ψ(x, θ) =∂

∂θln f(x, θ).

Example 1.4.7. Sometimes the target is known to take on a specific value,given some proper assumptions hold. For example, assume F is symmetricat zero, i.e., X = −X in distribution when X ∼ F , and φ is odd:

φ(−x) = −φ(x).

Then ∫φ(x)F (dx) = 0.

Important examples for φ are

φ(x) = sign(x) =

1 if x > 00 if x = 0

−1 if x < 0,

the sign function, or

φt(x) = sin(tx), t ∈ R.

Example 1.4.8. As another example, consider a two-sample situation, inwhich the first sample is from F and the second is from G. For a score-function J set ∫

J(F (x))G(dx).

If F = G and F is continuous, the last integral becomes1∫0

J(u)du, a known

value.

Page 29: 2013 W Stute Empirical Distributions

24 CHAPTER 1. INTRODUCTION

The empirical version becomes∫J(Fn(x))Gm(dx), (1.4.4)

an empirical integral with φ(x) = J(Fn(x)). Here, Fn and Gm are theempirical d.f.’s of the two samples X1, . . . , Xn and Y1, . . . , Ym from (theunknown) F and G, respectively. Applying our general formula (1.2.7), weget ∫

J(Fn(x))Gm(dx) =1

m

m∑i=1

J(Fn(Yi)). (1.4.5)

Note that since Yi is from the second sample and Fn is from the first nFn(Yi)is not the rank of Yi. Rather it provides us with the location (or position) ofYi w.r.t. the first sample. E.g., when we put J(u) = u, then (1.4.4) equals

1

nm

m∑i=1

n∑j=1

1Xj≤Yi.

This integral is related to the Wilcoxon-two-sample rank statistic. At thesame time it also constitutes an example of a two-sample U -statistic.

Since under F = G (continuous)∫F (x)G(dx) =

∫F (x)F (dx) =

1∫0

udu =1

2,

a known value, we may expect that (1.4.5) for J(u) = u is at least close to1/2 when F = G. At the same time this integral, namely (1.4.5), shouldhopefully be far away from 1/2 if F = G. In general, the interesting ques-tion will then be how to choose J in order to get large power on certainalternatives to F = G.

We see from Examples 1.4.7 and 1.4.8 that empirical integrals may also beutilized for tests in that a null hypothesis such as symmetry or equalityin distribution is rejected if the empirical term deviates too much from apre-specified value.

Many of the concepts discussed so far have obvious extensions to the k-variate case. As we know, a convenient class of sets which determines adistribution is the family of quadrants

Q = (−∞, t],

Page 30: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 25

where −∞ = (−∞, . . . ,−∞)T and t = (t1, . . . , tk)T . The pertaining empir-

ical d.f. was defined as

Fn(t) =1

n

n∑j=1

1Xj≤t, t ∈ Rk.

New questions may come up now which have no counterpart in the univariatecase.

Example 1.4.9. We may be interested in the dependence structure of theX-coordinates. A simple measure of association between the coordinates ofa bivariate vector X = (X1, X2)

T , say, is the correlation

CorrX =Cov(X1, X2)√VarX1VarX2

.

We have already seen in Section 1.2 that variances are simple (quadratic)functionals of a distribution function. Denoting with F the joint distributionof X1 and X2 also the covariance allows for a simple representation as a(statistical) functional of F :

Cov(X1, X2) =1

2

∫ ∫(x1 − y1)(x2 − y2)F (dx1, dx2)F (dy1, dy2).

Example 1.4.10. Much work, e.g., has been devoted to testing for inde-pendence of X1 and X2. Write F1 and F2 for the marginal distributions ofX1 and X2, respectively. Independence is tantamount to

F (t1, t2)− F1(t1)F2(t2) = 0 (1.4.6)

for all t1, t2 ∈ R. A test of (1.4.6) may then be based on its empiricalanalogue

Fn(t1, t2)− F1n(t1)F2n(t2).

Summarizing, the purpose of this section so far was to demonstrate, thatquantities or structures which may be expressed through the true distribu-tion function have natural sample analogues in terms of empirical d.f.’s. Theempirical versions then serve as estimators of the theoretical quantities ormay lead to statistics for testing hypotheses about unknown parameters orstructures.

In most of our examples the entities of interest as well as their estimatorscould be written in an explicit form. The log-likelihood function was an

Page 31: 2013 W Stute Empirical Distributions

26 CHAPTER 1. INTRODUCTION

exception in that the estimator was implicitly defined as a solution of anempirical equation.

So far we focused on empirical masses attached to fixed sets A,B, . . . For ex-tended intervals (−∞, t], Fn(t) represents the cumulative masses attachedto allXj ’s less than or equal to t. Therefore Fn is also called the cumulativeempirical d.f. Similarly, F is the theoretical cumulative d.f. representingthe whole population. As such, both Fn and F have a global feature. Atthe same time there are targets which characterize the distribution of apopulation in a local way.

For this, assume again that the Xj ’s are real-valued from a distributionfunction F . If F is differentiable with derivative

f(x) = limh↓0

F (x+ h)− F (x)

h, (1.4.7)

the (Lebesgue) density of F , then for small h > 0 we have

F (x+ h)− F (x) ∼ hf(x). (1.4.8)

These terms exhibit the local structure of F in a neighborhood of x ∈ R.Applicants of statistical methodology often prefer a density as a representa-tive of a distribution and to recover masses as areas under the density. Thereare other important quantities which are of local type and which correspondto cumulative functions.

Example 1.4.11. If F has a density f , then the cumulative hazard functionΛF equals

ΛF (t) =

t∫0

f(x)dx

1− F (x)= − ln(1− F (t)).

Hence

1− F (t) = exp[−Λ(t)].

The function

λ(x) =f(x)

1− F (x)

is the hazard function of F . We may use (1.4.7) to represent λ(x) as

λ(x) = limh↓0

F (x+ h)− F (x)

h(1− F (x))= lim

h↓0

P(x < X ≤ x+ h|x < X)

h,

Page 32: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 27

where

P(A|B) =P(A ∩B)

P(B)

is the conditional probability of an event A given an event B (with positiveprobability). Hence similarly to (1.4.8) we get

P(x < X ≤ x+ h|x < X) ∼ hλ(x).

In other words, the probability that, e.g., a technical unit may fail between xand x+h given that it worked until x is proportional to λ(x). The quantityx may be interpreted as the current age of the system.

If X denotes a discrete lifetime attaining the values i = 0, 1, 2, . . . the hazardfunction is defined as

λ(i) = P(i < X ≤ i+ 1|i < X) =P(X = i+ 1)

P(i < X).

Hence λ(i) equals the probability that given a person has survived i timeunits, death will occur in the next period.

Table 1.4.1 presents the hazard functions of the German male and femalepopulation computed for some selected decades in the last century.

The pertaining plot 1.4.2 reveals that the hazard function of a human popu-lation is likely to have the shape of a bathtub. If we compare the local risksfor German women in the 1910’s and 1980’s it becomes apparent that signif-icant improvements as to the rate of mortality have been obtained for youngchildren (in particular babies) and “elderly” women exceeding 60 years ofage.

Page 33: 2013 W Stute Empirical Distributions

28 CHAPTER 1. INTRODUCTION

0 20 40 60 80

050

100

150

200

250

300

Age in years

1000

×λ

1901/101980/82

Figure 1.4.2: Plot of two selected hazard functions for the female populationin Germany

Example 1.4.12. Another important local quantity is the so-called regres-sion function. For this, recall that we already discussed some measuresof association for the two components of an observational vector (X1, X2)

T .The purpose then was to quantify the degree of dependence between X1

and X2. In regression analysis one is interested in the kind of dependencebetween X1 and X2. This is of great practical importance when X2, e.g.,is not observable and needs to be predicted on the basis of X1. Hence X1

is sometimes called the independent variable (dose, input) while X2 is thedependent variable (response, output). To distinguish between the differ-ent roles played by X1 and X2, we prefer the notation X and Y for X1 andX2, respectively.

If Y has a finite expectation, a result in probability theory guarantees the

Page 34: 2013 W Stute Empirical Distributions

1.4. SOME TARGETS 29

Table 1.4.1: 1000× λ(i)

Males1901/10 1924/26 1932/34 1949/51 1960/62 1970/72 1980/82

i = 0 202,34 115,38 85,35 61,77 35,33 26,00 13,07i = 1 39,88 16,19 9,26 4,16 2,31 1,55 0,92i = 2 14,92 6,36 4,50 2,46 1,40 1,00 0,63i = 5 5,28 2,42 2,32 1,21 0,80 0,73 0,44i = 10 2,44 1,42 1,33 0,70 0,45 0,47 0,29i = 15 2,77 1,94 1,57 1,04 0,75 0,79 0,52i = 20 5,04 4,27 2,83 1,88 1,85 2,00 1,54i = 25 5,13 4,39 2,97 2,23 1,69 1,61 1,28i = 30 5,56 4,05 3,24 2,28 1,70 1,70 1,32i = 35 6,97 4,25 3,94 2,76 2,09 2,10 1,69i = 40 9,22 5,35 4,82 3,52 2,95 3,20 2,78i = 45 12,44 7,23 6,58 5,16 4,43 4,75 4,44i = 50 16,93 10,30 9,39 8,50 7,39 7,71 7,33i = 55 23,57 15,48 14,18 12,75 12,97 12,06 10,97i = 60 32,60 23,62 21,72 18,91 22,04 20,44 18,25i = 65 47,06 36,92 34,04 29,06 34,33 34,59 28,34i = 70 69,36 58,08 54,01 45,79 50,87 55,92 45,76i = 75 106,40 93,91 87,40 75,08 78,85 84,15 75,30i = 80 157,87 141,96 136,68 121,37 122,97 122,86 115,52i = 85 231,60 212,85 207,69 190,15 188,02 180,95 169,05i = 90 320,02 284,69 287,73 282,56 279,21 259,70 227,77

Females1901/10 1924/26 1932/34 1949/51 1960/62 1970/72 1980/82

i = 0 170,48 93,92 68,39 49,09 27,78 19,84 10,37i = 1 38,47 14,93 8,23 3,60 2,01 1,31 0,84i = 2 14,63 5,74 3,98 2,15 1,08 0,80 0,50i = 5 5,31 2,19 2,15 0,99 0,56 0,50 0,29i = 10 2,56 1,20 1,14 0,47 0,28 0,28 0,20i = 15 3,02 1,81 1,30 0,68 0,40 0,45 0,33i = 20 4,22 3,32 2,27 1,15 0,62 0,65 0,44i = 25 5,37 3,94 2,70 1,35 0,73 0,63 0,51i = 30 5,97 4,14 3,01 1,65 0,99 0,77 0,66i = 35 6,86 4,52 3,48 1,99 1,38 1,16 0,96i = 40 7,71 5,31 4,22 2,55 2,01 1,78 1,42i = 45 8,54 6,44 5,46 3,68 2,99 2,82 2,23i = 50 11,26 8,86 7,91 5,46 4,45 4,56 3,50i = 55 16,19 12,73 11,53 8,13 6,72 5,38 5,30i = 60 24,73 19,47 17,46 12,91 10,85 9,88 8,69i = 65 39,60 31,55 28,53 22,24 18,62 17,11 13,39i = 70 62,06 51,98 47,61 39,11 32,85 30,19 22,77i = 75 98,31 85,29 80,33 68,11 59,61 54,29 43,11i = 80 146,50 133,71 126,51 114,02 103,31 94,43 76,44i = 85 217,39 198,37 193,66 173,62 166,26 155,88 132,62i = 90 295,66 263,08 273,64 259,16 248,21 234,20 206,54

Source: Statistisches Bundesamt.

Page 35: 2013 W Stute Empirical Distributions

30 CHAPTER 1. INTRODUCTION

existence of a function m = m(x) such that

Y = m(X) + ε, (1.4.9)

where ε is orthogonal to X, i.e., the conditional expectation of ε, the noisevariable, given X equals zero:

E(ε|X) = 0. (1.4.10)

Given X = x (and not Y ), the optimal predictor of Y then equals m(x). Inprobability theory the notation

m(x) = E[Y |X = x]

is very common and intuitive. It points out that m(x) equals the mean out-put when the input equals x. Unless m is a constant function, the expectedresponse therefore depends on the value attained by X. Unfortunately, the-ory only provides us with the existence of m. If (X,Y ) admits a bivariateLebesgue density f = f(x, y), we have

m(x) =

∫yf(x, y)dy

f1(x)(1.4.11)

where f1 is the (marginal) density of X. Equation (1.4.11) exhibits the localflavor of m but also points out that in a real world situation, m is unknownand depends on unknown quantities like f and f1.

We now introduce the cumulative process pertaining tom. Again we restrictourselves to the bivariate case. Let F1 be the d.f. of X and set

I(t) = E[Y 1X≤t].

When Y ≡ 1, we obtain I(t) = F1(t). In the general case, we get, usingwell-known properties of conditional expectations:

I(t) = E[E(Y 1X≤t|X)]

= E[1X≤tE(Y |X)]

= E[1X≤tm(X)] =

∫(−∞,t]

m(x)F1(dx).

The function I is the integrated (or cumulative) regression function,where integration takes place w.r.t. the (marginal) distribution of X. Need-less to say, this concept has an obvious extension to multivariate X’s. If a

Page 36: 2013 W Stute Empirical Distributions

1.5. THE LORENZ CURVE 31

sample (Xj , Yj), 1 ≤ j ≤ n, from the same distribution as (X,Y ) is available,the empirical estimator of I equals

In(t) =1

n

n∑j=1

Yj1Xj≤t.

Again, when all Yj = 1, we obtain the empirical d.f. of the Xj ’s. In thegeneral case, In is an example of a so-called marked point process withjumps at Xj , 1 ≤ j ≤ n, and marks (or jump heights) Yj/n. These marksmay be random and take on positive as well as negative values. The choiceof the indicator 1X≤t has been made only to make I comparable with F .Applicants may feel free to replace this indicator, e.g., by 1X>t, which ismore popular for lifetime data.

A preliminary consequence of our discussion is that it seems possible to pro-pose natural estimators if the target has a global respectively cumulativecharacter. For local quantities like densities, hazard or regression functionsthings seem to be different. Consider the problem of estimating a density f .The empirical d.f. Fn is constant between two successive data and thereforehas derivative zero there. On the other hand, Fn is not even continuous atthe data. Any proposal as to smooth Fn and then to compute the resultingderivative brings us exactly to the heart of the problem:

How to smooth?

From a mathematical point of view, our goal is to invert the ’integrationoperator’ through differentiating. As we have seen, however, the inverseoperator is not applicable to the estimator Fn, a feature which is quitecommon in so-called ill-posed problems.

1.5 The Lorenz Curve

In the following sections we discuss several important quantities, which maybe expressed through F and F−1. First we investigate the so-called Lorenzcurve.

Definition 1.5.1. Let F be any distribution function on the real line withquantile function F−1. Assume that µ =

∫xF (dx) exists and does not

Page 37: 2013 W Stute Empirical Distributions

32 CHAPTER 1. INTRODUCTION

vanish. Then the Lorenz function L is defined through

LF (p) = L(p) = µ−1

p∫0

F−1(u)du, 0 ≤ p ≤ 1.

Since by transformation of integrals

µ =

1∫0

F−1(u)du,

we have L(1) = 1. Clearly, L(0) = 0. Hence the associated Lorenz curve(p, L(p)), 0 ≤ p ≤ 1, connects the points (0, 0) and (1, 1). It is scale-free.

Remark 1.5.2. If the random variable X with d.f. F is nonnegative, so isF−1(u). Consequently, in this situation, L is a non-decreasing function.

Lemma 1.5.3. Assume that µ > 0. Then the Lorenz function L is convex.

Proof. We have to show that for p1, p2 from the unit interval and any0 ≤ c ≤ 1

cL(p1) + (1− c)L(p2) ≥ L(cp1 + (1− c)p2). (1.5.1)

The left-hand side equals, for 0 ≤ p1 ≤ p2 ≤ 1:

µ−1

p1∫0

F−1(u)du+ (1− c)µ−1

p2∫p1

F−1(u)du.

For the right-hand side of (1.5.1) we obtain

L(cp1 + (1− c)p2) = µ−1

p1∫0

F−1(u)du+ µ−1

cp1+(1−c)p2∫p1

F−1(u)du.

Hence it remains to show that

(1− c)

p2∫p1

F−1(u)du ≥cp1+(1−c)p2∫

p1

F−1(u)du.

Page 38: 2013 W Stute Empirical Distributions

1.5. THE LORENZ CURVE 33

The integral on the right-hand side equals, however,

(1− c)

p2∫p1

F−1(cp1 + (1− c)u)du.

The conclusion now follows from the monotonicity of F−1 and

u ≥ cp1 + (1− c)u for p1 ≤ u ≤ p2.

Example 1.5.4. For the exponential distribution with parameter λ > 0 wehave

L(p) = p+ (1− p) ln(1− p).

Example 1.5.5. For the uniform distribution on the interval [a, b] we have

L(p) =2ap+ (b− a)p2

a+ b.

In particular, for a = 0 and b = 1, we obtain L(p) = p2.

Corollary 1.5.6. The Lorenz curve is always below the straight line con-necting (0, 0) and (1, 1).

We may view L as a functional T evaluated at p and F

LF (p) = T (p, F ).

To compute the empirical counterpart, we have to replace F by Fn, i.e., theempirical Lorenz function equals

Ln(p) = T (p, Fn) =1

µn

p∫0

F−1n (u)du,

where µn = n−1∑n

i=1Xi is the sample mean.

Since F−1n is constant between two successive i−1

n and in we have for k−1

n <

p ≤ kn :

Ln(p) =1

µn

[k−1∑i=1

1

nXi:n +

(p− k − 1

n

)Xk:n

].

We see that Ln(p) may also be expressed through sums of order statistics.

Page 39: 2013 W Stute Empirical Distributions

34 CHAPTER 1. INTRODUCTION

Example 1.5.7. If X1:n = . . . = Xn−1:n = 0 and Xn:n = K > 0, then

Ln(p) =

0 for 0 ≤ p ≤ n−1

nn(p− 1 + 1

n) for n−1n < p ≤ 1

If we interpret the X’s as the wealth owned by n members of a population,this describes a situation where all is owned by a single member. In thisextreme situation, L is flat up to n−1

n . Hence the difference between L andthe 45 line is very large. This observation leads us to an index, whichmeasures the distribution of wealth in an economy.

1.6 Gini’s Index

In this section, F denotes a continuous d.f. supported by the positive realline. Let A denote the area between the 45-line and the Lorenz curve, andlet B be the area below the Lorenz curve. Gini’s index, see Gini (1936), isthen defined as

G :=A

A+B

Since A+B = 12 , we obtain

G = 2A = 2

12−

1∫0

L(p)dp

= 1− 2

1∫0

L(p)dp.

We always have 0 ≤ G ≤ 1. G is close to one if L(p) is close to zero. Inview of example 1.5.7, a large G indicates that the wealth of a populationis concentrated in a few hands.

Lemma 1.6.1. Assume that F is a continuous d.f. supported by the positivereal line. Then we have

G =∆

2µ,

where

∆ =

∞∫0

∞∫0

|x− y|F (dx)F (dy).

In terms of random variables, we have

∆ = E|X − Y |, where X,Y ∼ F are independent.

Page 40: 2013 W Stute Empirical Distributions

1.6. GINI’S INDEX 35

Land

Rang

Gini-

Index

Namibia 1 70,70

Südafrika 2 65,00

Lesotho 3 63,20

Botsuana 4 63,00

Sierra Leone 5 62,90

Zentralafrikan. Republik 6 61,30

Haiti 7 59,20

Bolivien 8 58,20

Honduras 9 57,70

Kolumbien 10 56,00

Guatemala 11 55,10

Thailand 12 53,60

Hongkong 13 53,30

Paraguay 14 53,20

Chile 15 52,10

Brasilien 16 51,90

Panama 17 51,90

Mexiko 18 51,70

Papua-Neuguinea 19 50,90

Sambia 20 50,80

Swasiland 21 50,40

Costa Rica 22 50,30

Gambia 23 50,20

Simbabwe 24 50,10

Sri Lanka 25 49,00

Dominikanische Republik 26 48,40

China 27 48,00

Madagaskar 28 47,50

Singapur 29 47,30

Ecuador 30 47,30

Nepal 31 47,20

El Salvador 32 46,90

Ruanda 33 46,80

Malaysia 34 46,20

Peru 35 46,00

Argentinien 36 45,80

Philippinen 37 45,80

Mosambik 38 45,60

Jamaika 39 45,50

Uruguay 40 45,30

Bulgarien 41 45,30

Vereinigte Staaten 42 45,00

Guyana 43 44,60

Kamerun 44 44,60

Iran 45 44,50

Kambodscha 46 44,40

Uganda 47 44,30

Mazedonien 48 44,20

Land

Rang

Gini-

Index

Nigeria 49 43,70

Kenia 50 42,50

Burundi 51 42,40

Russische Föderation 52 42,00

Côte d´Ivoire 53 41,50

Senegal 54 41,30

Marokko 55 40,90

Georgien 56 40,80

Turkmenistan 57 40,80

Nicaragua 58 40,50

Türkei 59 40,20

Mali 60 40,10

Tunesien 61 40,00

Jordanien 62 39,70

Burkina Faso 63 39,50

Guinea 64 39,40

Ghana 65 39,40

Israel 66 39,20

Mauretanien 67 39,00

Mauritius 68 39,00

Venezuela 69 39,00

Malawi 70 39,00

Portugal 71 38,50

Moldau 72 38,00

Jemen 73 37,70

Vietnam 74 37,60

Japan 75 37,60

Tansania 76 37,60

Indien 77 36,80

Usbekistan 78 36,80

Indonesien 79 36,80

Laos 80 36,70

Mongolei 81 36,50

Benin 82 36,50

Neuseeland 83 36,20

Bosnien und Herzegowina 84 36,20

Litauen 85 35,50

Algerien 86 35,30

Lettland 87 35,20

Albanien 88 34,50

Ägypten 89 34,40

Polen 90 34,20

Großbritannien 91 34,00

Niger 92 34,00

Irland 93 33,90

Schweiz 94 33,70

Aserbaidschan 95 33,70

Kirgisistan 96 33,40

Rumänien 97 33,30

Land

Rang

Gini-

Index

Bangladesch 98 33,20

Griechenland 99 33,00

Frankreich 100 32,70

Taiwan 101 32,60

Tadschikistan 102 32,60

Kanada 103 32,10

Italien 104 32,00

Spanien 105 32,00

Timor-Leste 106 31,90

Estland 107 31,30

Korea, Republik 108 31,00

Tschechische Republik 109 31,00

Armenien 110 30,90

Niederlande 111 30,90

Pakistan 112 30,60

Australien 113 30,50

Europäische Union 114 30,40

Äthiopien 115 30,00

Kosovo 116 30,00

Zypern 117 29,00

Zypern 117 29,00

Slowenien 118 28,40

Serbien 119 28,20

Island 120 28,00

Belgien 121 28,00

Ukraine 122 27,50

Belarus 123 27,20

Deutschland 124 27,00

Kroatien 125 27,00

Finnland 126 26,80

Kasachstan 127 26,70

Slowakei 128 26,00

Luxemburg 129 26,00

Malta 130 26,00

Malta 130 26,00

Österreich 131 26,00

Norwegen 132 25,00

Dänemark 133 24,80

Ungarn 134 24,70

Montenegro 135 24,30

Schweden 136 23,00

Figure 1.6.1: Gini Index

Page 41: 2013 W Stute Empirical Distributions

36 CHAPTER 1. INTRODUCTION

Proof. We have

∆ = 2

∞∫0

y∫0

(y − x)F (dx)F (dy)

= 2

∞∫0

yF (y)F (dy)−∞∫0

y∫0

xF (dx)F (dy)

.Applying Fubini’s Theorem the last expression becomes

2

∞∫0

yF (y)F (dy)−∞∫0

x(1− F (x))F (dx)

= 2

∞∫0

(2xF (x)− x)F (dx) = 2

[2

∫xF (x)F (dx)− µ

].

Since, by transformation of integrals and under continuity of F ,∫xF (x)F (dx) =

1∫0

F−1(u)udu =

1∫0

u∫0

1dvF−1(u)du

=

1∫0

1∫v

F−1(u)dudv = µ−1∫

0

v∫0

F−1(u)dudv,

we come up with

∆ = 2

µ− 2

1∫0

v∫0

F−1(u)dudv

whence

2µ= 1− 2

1∫0

L(v)dv = G.

The empirical Gini Index of course equals

Gn = 1− 2

1∫0

Ln(p)dp.

Page 42: 2013 W Stute Empirical Distributions

1.6. GINI’S INDEX 37

Since Ln is a piecewise linear function interpolating the values Sk and Sk+1,where Sk =

∑ki=1Xi:n, we obtain

Gn = 1− 2n∑

i=1

1

2

(k

n− k − 1

n

)(Sk−1

Sn+SkSn

)

= 1− 1

n

n∑k=1

Sk + Sk−1

Sn.

Example 1.6.2. Assume that X1 = 16, X2 = 1, X3 = 4, X4 = 25 and X5 =9. Ordering gives X1:5 = 1, X2:5 = 4, X3:5 = 9, X4:5 = 16, X5:5 = 25. HenceLn interpolates the values 0, 1

55 ,555 ,

1455 ,

3055 and 1 yielding G5 =

2455 = 43, 64%.

The statistical analysis of Gn requires some knowledge of the distributionaltheory of sums of order statistics. Alternatively we could use the equivalentrepresentation in terms of U -statistics.

Rather than looking at the area between the 45 line and the Lorenz curve,we could take into account the discrepancy

Z(p) = p− L(p).

Obviously, Z(p) ≥ 0 and

1∫0

Z(p)dp =1

2−

1∫0

L(p)dp =1

2G.

Since L is convex,the function Z is concave. Furthermore,

Z(0) = 0 and Z(1) = 1− L(1) = 0.

Since L and hence Z is continuous, Z admits a maximizer, say p0:

Z(p0) = max0≤p≤1

Z(p).

It turns out that p0 admits an interesting interpretation.

Lemma 1.6.3. We havep0 = F (µ).

In other words, p0 equals the proportion of the population obtaining lessthan or equal to the mean.

Page 43: 2013 W Stute Empirical Distributions

38 CHAPTER 1. INTRODUCTION

Proof. We have to show that Z(F (µ)) ≥ Z(p) for all 0 ≤ p ≤ 1. We onlydiscuss the case F (µ) ≥ p, the other being dealt with in a similar way. Now,

Z(F (µ))− Z(p) = F (µ)− p− [L(F (µ))− L(p)]

= F (µ)− p− 1

µ

F (µ)∫p

F−1(u)du. (1.6.1)

The integral on the right-hand side equals, by transformation of integrals,

∫1F−1(p)≤x≤µxF (dx)

so that (??) becomes

1

µ

∫(µ− x)1F−1(p)≤x≤µF (dx).

Obviously the integral is nonnegative.

It is interesting to investigate the maximal deviation between L and the45-line, i.e., of

Z(p0) = p0 − L(p0).

Lemma 1.6.4. We have

Z(p0) =δ

with

δ =

∫|x− µ|F (dx),

the mean absolute deviation.

Page 44: 2013 W Stute Empirical Distributions

1.7. THE ROC-CURVE 39

Proof. We have

δ

2µ=

1

∫|x− µ|F (dx) = 1

∞∫µ

(x− µ)F (dx) +1

µ∫0

(µ− x)F (dx)

=1

∞∫0

(µ− x)F (dx) +1

µ

∞∫µ

(x− µ)F (dx) =1

µ

∞∫µ

(x− µ)F (dx)

=1

µ

∞∫µ

xF (dx)− 1 + F (µ) = F (µ)− 1

µ

µ∫0

xF (dx)

= F (µ)− 1

µ

F (µ)∫0

F−1(u)du. (1.6.2)

Recalling that p0 = F (µ), we obtain that (1.6.2) equals Z(p0).

1.7 The ROC-Curve

The ROC-Curve (Receiver Operating Characteristic Curve) wasoriginally proposed to analyze the accuracy of a medical diagnostic test. Tobegin with, assume that a variable D may take on the two values zero andone indicating the true but unknown status of a patient:

D =

1 for disease0 for non-disease

Let Y be the result of a diagnostic test. Suppose that

Y =

1 positive for disease0 negative for disease

The result of the test may be classified as follows:

D = 0 D = 1

Y = 0 True negative False negativeY = 1 False positive True positive

We denote with

FPF = P(Y = 1|D = 0)

Page 45: 2013 W Stute Empirical Distributions

40 CHAPTER 1. INTRODUCTION

and

TPF = P(Y = 1|D = 1)

the false positive and true positive fraction, respectively. The ideal testyields

FPF = 0 and TPF = 1.

For a useless test which is unable to discriminate between D = 0 and D = 1we have FPF = TPF. In a biomedical context FPF and 1−TPF are calledsensitivity and specificity, i.e.,

sensitivity = P(Y = 1|D = 0)

specificity = P(Y = 0|D = 1)

In engineering the terms ”hit rate” (TPF) and ”false alarm rate” (FPF)are quite common. In statistics when the goal is to test the null hypothesisD = 0, FPF is called the significance level and TPF power. The quantityρ = P(D = 1) is called the population prevalence of disease. Finally, themisclassification probability equals

P(Y = D) = P(D = 1)P(Y = 0|D = 1) + P(D = 0)P(Y = 1|D = 0)

= ρ(1− TPF) + (1− ρ)FPF.

In applications to real data

1− TPF = P(Y = 0|D = 1)

should be small since usually a negative diagnosis of a patient who has fallensick may result in a fatal incidence. On the other hand, a positive diagnosisof a negative status may only result in some less serious inconveniences. Wefinally introduce the predictive values

Positive predictive value = PPV = P(D = 1|Y = 1)

Negative predictive value = NPV = P(D = 0|Y = 0)

A perfect test has

PPV = 1 = NPV

while a useless test yields

P(D = 1|Y = 1) = P(D = 1) and P(D = 0|Y = 0) = P(D = 0).

Page 46: 2013 W Stute Empirical Distributions

1.7. THE ROC-CURVE 41

A straightforward application of Bayes’ rule gives us PPV and NPV in ageneral situation:

PPV =ρTPF

ρTPF + (1− ρ)FPF

NPV =(1− ρ)(1− FPF)

(1− ρ)(1− FPF) + ρ(1− TPF)

In terms of TPF and FPF, the test is perfect if

TPF = 1 and FPF = 0. (1.7.1)

In applications, the variable Y is not dichotomous (Y = 0 or 1) but a scorestatistic Y = S(X) of some available data vector X. We can dichotomise bycomparing Y with a threshold c. In the context of these notes we diagnosethe case to be positive if Y > c and negative if Y ≤ c. It follows that

FPF(c) = P(Y > c|D = 0)

TPF(c) = P(Y > c|D = 1) (1.7.2)

The problem now is one of choosing the threshold c yielding a good if notoptimal test. To start with, we consider the curve consisting of all valuesfrom (1.7.2) with −∞ < c <∞.

Definition 1.7.1. The ROC curve consists of all tuples (FPF(c),TPF(c))with −∞ < c <∞.

Since FPF and TPF are both survival functions, we have

limc→∞

TPF(c) = 0 limc→∞

FPF(c) = 0

and

limc→−∞

TPF(c) = 1 limc→−∞

FPF(c) = 1.

If we plot TPF(c) against FPF(c), we obtain the ROC-function. It is anincreasing function on the unit interval with ROC(0) = 0 and ROC(1) = 1.If we denote with

F (y) = P(Y ≤ y|D = 1) and G(y) = P(Y ≤ y|D = 0)

the two relevant d.f.’s and with F and G their survival functions, then byconstruction

ROC(t) = F G−1(t), 0 < t < 1.

Page 47: 2013 W Stute Empirical Distributions

42 CHAPTER 1. INTRODUCTION

In applications people became interested in, as for the Lorenz-curve, thearea under ROC, i.e.,

AUC =

1∫0

ROC(t)dt.

Transformation of integrals leads to

AUC =

∫F (y)G(dy) = P(Y1 > Y0),

where Y1 and Y0 are independent with Y1 ∼ F and Y0 ∼ G. Given twosamples Y11, . . . , Y1n and Y01, . . . , Y0m from F resp. G, the estimator ofAUC becomes

AUC =1

nm

n∑i=1

m∑j=1

1Y1i>Y0j,

a two-sample U -statistic.

As for the Lorenz curve we could consider the partial AUC, namely

pAUC(s) =

s∫0

ROC(t)dt, 0 ≤ s ≤ 1.

This is a functional in s, F and G. If we replace F and G by their empiricalversions, we obtain a stochastic process.

Remark 1.7.2. We mention that some authors base the ROC analysis onF and G and not on F and G. Also a lot of ROC analysis has been donein a parametric framework, see Hsieh and Turnbull (1996), Ann. Statistics24, 25-40.

1.8 The Mean Residual Lifetime Function

Let X be a positive random variable with d.f. F and finite mean

µ = EX =

∞∫0

(1− F (x))dx.

The mean residual lifetime function is defined for all x with F (x) < 1:

e(x) = E(X − x|X > x) =

∞∫xF (y)dy

F (x).

Page 48: 2013 W Stute Empirical Distributions

1.8. THE MEAN RESIDUAL LIFETIME FUNCTION 43

Note that e(0) = µ. The function e has been primarily used in survivalanalysis and reliability theory. It is the mean remaining lifelength of thoseindividuals who survive beyond x.

Example 1.8.1. For the exponential d.f. F (x) = exp(−λx), x ≥ 0, we have

e(x) = λ−1.

There is an intimate relation between the Lorenz function and the meanresidual lifetime function.

Lemma 1.8.2. We have

L(F (x)) = 1− 1

µF (x)[e(x) + x].

Proof. Note that

F (x)[e(x) + x] = xF (x) +

∞∫x

F (y)dy

= EX1x<X = µ− EX1X≤x = µ−F (x)∫0

F−1(t)dt

= µ(1− L(F (x))).

The next lemma presents a formula connecting F and e.

Lemma 1.8.3. We have

F (x) =e(0)

e(x)exp

− x∫0

dt

e(t)

.Proof. Since

− 1

e(t)=

−F (t)∞∫t

F (z)dz

≡ h′(t)

h(t)

we obtain

−x∫

0

dt

e(t)= lnh(t)

∣∣∣∣x0

Page 49: 2013 W Stute Empirical Distributions

44 CHAPTER 1. INTRODUCTION

and therefore

exp

− x∫0

dt

e(t)

=h(x)

h(0)=h(x)

µ.

We conclude

e(0) exp[−∫. . .] = h(x) =

∞∫x

F (z)dz = F (x)e(x).

1.9 The Total Time on Test Transform

Again, let X be a nonnegative random variable with d.f. F . The totaltime on test transform (TTT-transform) is defined as

H−1(t) =

F−1(t)∫0

F (x)dx, 0 < t < 1.

In survival analysis, the TTT-transform and its sample version are examplesof processes (plots) which aim at analyzing the distributional properties ofa lifetime variable subject to falling below (or above) a given threshold.

By now, there is a huge literature on the empirical TTT-transform. InBarlow et al. (1972), it appeared in the estimation process of a failure rateunder order restrictions. Barlow and Proschan (1969) studied tests basedon TTT when the data are incomplete. Langberg et al. (1980) showed howthe TTT-transform can be used to characterize lifetime distributions; seealso Klefsjo (1983a, b). More recently, Haupt and Schabe (1997) appliedthe TTT concept to construct bathtub-shaped hazard functions.

Example 1.9.1. For the exponential d.f. with parameter λ we have

H−1(t) =t

λ.

Again, there are some interesting connections with other functions. One ofthese examples is the following

Page 50: 2013 W Stute Empirical Distributions

1.10. THE PRODUCT INTEGRATION FORMULA 45

Lemma 1.9.2. We have, for 0 < t < 1,

H−1(t) = µL(t) + F−1(t)(1− t).

Proof. By definition,

H−1(t) =

F−1(t)∫0

F (x)dx =

F−1(t)∫0

∫1x<yF (dy)dx

=

∞∫0

min(F−1(t), y)F (dy) =

F−1(t)∫0

yF (dy) + F−1(t)(1− t).

The empirical version of H−1 is obtained, if one replaces F and F−1 by theirempirical counterparts:

H−1n (t) =

F−1n (t)∫0

Fn(x)dx.

At t = k/n we get

H−1n

(k

n

)=

Xk:n∫0

Fn(x)dx =1

n

k∑

i=1

Xi:n + (n− k)Xk:n

.

This also explains the name for H−1.

1.10 The Product Integration Formula

To summarize the results of the last 5 sections, we have introduced fourimportant functionals of F , namely the

• Lorenz curve (including Gini’s Index)

• ROC-curve

• Mean residual lifetime function

• TTT transform

Page 51: 2013 W Stute Empirical Distributions

46 CHAPTER 1. INTRODUCTION

It became clear that there is an intimate relationship between these func-tions and each may serve as a tool to model various aspects of a parentpopulation. In this section we study the connection between the d.f. Fand the cumulative hazard function Λ. As our final result we shall comeup with the famous product integration formula. It will offer a possibilityto represent empirical d.f.’s through products rather than sums. Importantapplications of this approach are in survival analysis.

Let us start with an arbitrary not necessarily random sequence of real num-bers, say S0, S1, S2, . . . In most cases (Si)i is adapted to a filtration (Fi)i.For a given x define the sequence

Tn = xn∏

i=1

(1 + ∆Si), n = 0, 1, . . .

The stochastic integral of T− w.r.t. S is defined as∫[0,n]

T−dS =

n∑i=1

Ti−1∆Si =

n∑i=1

Ti−1(Si − Si−1).

Lemma 1.10.1. We have, for n = 0, 1, . . .

Tn = x+

∫[0,n]

T−dS. (1.10.1)

Proof. Obviously, the assertion is true for n = 0 since both sides equal x.The general case is proved by induction on n. Given that the assertion istrue for n, we have

Tn+1 = Tn(1 + ∆Sn+1) = Tn + Tn∆Sn+1

= x+

∫[0,n]

T−dS + Tn∆Sn+1 = x+

∫[0,n+1]

T−dS.

Equation (1.10.1) is an example of an Integral Equation. For x = 1, wecall the sequence

Tn ≡ Expn(S) =

n∏i=1

(1 + ∆Si) = 1 +

∫[0,n]

T−dS (1.10.2)

Page 52: 2013 W Stute Empirical Distributions

1.10. THE PRODUCT INTEGRATION FORMULA 47

the Stochastic Exponential of S.

The above setting allows for a simple but important extension. For this, let(hn)n be a predictable sequence. From (Sn)n and (hn)n we may constructa new sequence, namely∫

[0,n]

hdS =

n∑i=1

hi∆Si =

n∑i=1

hi(Si − Si−1).

The associated stochastic exponential equals

Tn = Expn

∫[0,n]

hdS

=

n∏i=1

(1 + hi∆Si).

Lemma 1.10.2. The sequence (Tn)n satisfies

Tn = 1 +

∫[0,n]

T−hdS.

Proof. The proof is similar to that of Lemma 1.10.1

Note that Lemma 1.10.1 is a special case of 1.10.2. Just put hn ≡ 1. Now, ifS is a function of bounded variation, we may let the number of grid pointstend to infinity. Again, put t0 = −∞ and keep tn = t fixed. Then the limitof (1.10.2) exists and we obtain the analogue of (1.10.2) in continuous time:

Tt = 1 +

∫(−∞,t]

T (x−)S(dx).

In some situations S ≡ 0 on the negative real line. In this case we have

Tt = 1 +

∫[0,t]

T (x−)S(dx) (1.10.3)

Example 1.10.3. Recall the cumulative hazard function Λ satisfying

dΛ =dF

1− F−.

Page 53: 2013 W Stute Empirical Distributions

48 CHAPTER 1. INTRODUCTION

Conclude

F (t) =

∫(−∞,t]

(1− F−)dΛ

and

F (t) ≡ 1− F (t) = 1−∫

(−∞,t]

(1− F−)dΛ. (1.10.4)

We thus see that the survival function F satisfies (1.10.3) with S = −Λ.Solving (1.10.4) means that we aim at representing F in terms of Λ. For acontinuous F , Example 1.4.11 presents the solution, namely

F (t) = exp(−Λ(t)).

For applications in statistics we need the solution of (1.10.4), however, inthe general case. This will be necessary mainly for two reasons: In manysituations it will be simpler to estimate Λ, say by Λn, than F . This Λn willbe discrete so that a general representation of the corresponding Fn wouldbe helpful to construct the associated survival function and hence estimatorof F .

To solve (1.10.4) in the general case, we need some notation.

Definition 1.10.4. Let S be a nondecreasing function. Put

Sc(t) = S(t)−∑x≤t

∆S(x)

≡ S(t)−∑x≤t

Sx

The function Sc is called the continuous part of S. Note that S admits onlyat most countably many discontinuities.

Theorem 1.10.5. Let F be any d.f. with associated cumulative hazardfunction Λ. Then we have

1− F (t) = exp[−Λc(t)]∏x≤t

[1− Λx]. (1.10.5)

If F is continuous, so is Λ. Therefore the product is empty and equals 1,while Λc = Λ. For a purely discrete Λ, (1.10.5) becomes

1− F (t) =∏x≤t

[1− Λx]. (1.10.6)

Page 54: 2013 W Stute Empirical Distributions

1.10. THE PRODUCT INTEGRATION FORMULA 49

For the proof of Theorem 1.10.5 we need the following result for exponentials.

Lemma 1.10.6. Let (S1i )i and (S2

i )i be two sequences. Then we have

Expn(S1)Expn(S

2) = Expn(S1 + S2 + [S1, S2]). (1.10.7)

Here

[S1, S2]n = S10S

20 +

n∑k=1

∆S1k∆S

2k

denotes the Quadratic Covariation of S1 und S2.

Proof of Theorem 1.10.5. By definition,

Λ(t) = Λc(t) +∑x≤t

Λx

≡ Λc(t) + Λd(t),

where “d” stands for discrete part. As before, consider a grid t0 = −∞ <t1 < . . . < tn = t. If we let n tend to infinity such that the grid gets finerand finer, we get

Expn(Λc) =

n∏i=1

[1 + ∆Λc(ti)] (1.10.8)

= exp

n∑

i=1

ln[1 + ∆Λc(ti)]

= exp

n∑

i=1

∆Λc(ti) + o(1)

→ exp[Λc(t)].

This argument shows that the exponential of a continuous monoton boundedfunction f equals t→ exp[f(t)]. For the discrete part, however, we get

Expn(Λd) →

∏x≤t

[1 + Λx]. (1.10.9)

Now replace Λ with −Λ. Then the product of (1.10.8) and (1.10.9) convergesto the right-hand side of (1.10.5). To prove the Theorem, we apply Lemma

Page 55: 2013 W Stute Empirical Distributions

50 CHAPTER 1. INTRODUCTION

1.10.6. By (1.10.7) the product of (1.10.8) and (1.10.9) equals

n∏i=1

[1 + ∆S1

i +∆S2i +∆S1

i ∆S2i

]= exp

n∑

i=1

ln(1 + ∆S1i +∆S2

i +∆S1i ∆S

2i )

= exp

n∑

i=1

ln(1 + ∆S1i +∆S2

i ) + o(1)

.

The last equation is obtained from a Taylor-expansion and the fact that

n∑i=1

∆S1i ∆S

2i →

∫(−∞,t]

ΛxΛc(dx) = 0.

Finally, S = −Λ = −Λc − Λd so that by (1.10.4)

n∏i=1

(1 + ∆S1i +∆S2

i ) → Expt[−Λ] = 1− F (t).

In the following we apply the last theorem to the empirical d.f. Fn. For thesake of simplicity we assume that the order statistics X1:n < . . . < Xn:n

are pairwise distinct. This is always the case (with probability one) if theunderlying d.f. F is continuous. Now, since Fn only jumps at the data, wehave

Λnx =Fnx

1− Fn(x−)=

1

n−i+1 if x = Xi:n

0 elsewhere.

Hence Theorem 1.10.5 yields

1− Fn(t) =∏

Xi:n≤t

[1− 1

n− i+ 1

]=

∏Xi:n≤t

n− i

n− i+ 1. (1.10.10)

Equation (1.10.10) is called the Product-Limit representation of Fn. Forthe mass at t, we get

Fnt = (1− Fn(t−))− (1− Fn(t))

=∏

Xi:n<t

n− i

n− i+ 1−∏

Xi:n≤t

n− i

n− i+ 1.

Page 56: 2013 W Stute Empirical Distributions

1.11. SAMPLING DESIGNS AND WEIGHTING 51

This difference equals zero if t = Xj:n for all 1 ≤ j ≤ n. For t = Xj:n,however, we get

Fnt =

j−1∏i=1

n− i

n− i+ 1−

j∏i=1

n− i

n− i+ 1=

j−1∏i=1

n− i

n− i+ 1

[1− n− j

n− j + 1

]=

n− 1

n

n− 2

n− 1. . .

n− j + 1

n− j + 2

1

n− j + 1=

1

n,

as expected.

If we check the proof of Theorem 1.10.5 carefully, we see that the followingarguments were essential

• 1− F is the exponential of −Λ.

• For taking logarithms we needed 1 + ∆S1i + ∆S2

i + ∆S1i ∆S

2i > 0.

This was true, however, since ∆S1i and ∆S1

i ∆S2i became small while

∆S2i → −Λx and 1− Λx > 0.

• For taking limits it was important that Λ is finite over (−∞, t] for eachfinite t.

We already knew before that 1 − F is the exponential of −Λ. Equation(1.10.5) is important since it reveals a possibility to explicitly represent1− F in terms of the continuous and discrete part of Λ.

1.11 Sampling Designs and Weighting

In the previous sections we have seen that empirical d.f.’s may serve asa universal tool to express (and analyze) various statistics which at firstsight do not seem to have much in common. For this kind of discussion noparticular assumptions like independence of the data were necessary.

In this section we outline several possible data situations which we will studyin future chapters. The classical field of empirical distributions (with stan-dard weights) deals with independent identically distributed (i.i.d.)random observations. Therefore, in forthcoming chapters, we shall firststudy this case in greater detail. The theory and their applications are par-ticularly rich when the data are real-valued. For the multivariate case thefact that the data cannot be ordered causes new challenging questions. Inthe context of time series analysis the data feature some dependence.

Page 57: 2013 W Stute Empirical Distributions

52 CHAPTER 1. INTRODUCTION

Empirical distributions are helpful to analyze this dependence structure inboth the frequency and time domain. Quite often one has a particular modelfor the autoregressive function in mind, i.e., one assumes that

Xi = m(Xi−1, . . . , Xi−p, θ) + εi

satisfies some (parametric) dynamic model, in which the new variable is anunknown function of the previous ones, subject to some noise εi. In such asituation empirical d.f.’s are also useful to analyze the noise variables εiand the goodness-of-fit of the model in the context of a residual anal-ysis.

In the following we discuss some data examples which show that we need tobe open to replace the standard weights n−1 by other ones.

Example 1.11.1. This is a situation frequently appearing inRobust Statis-tics. Recalling that the sample mean 1

n

∑nj=1Xj is an estimator for the

expectation of a parent distribution F , in order to protect against out-liers, the sample mean may be replaced by a weighted average of data inwhich the influence of extreme order statistics is trimmed. More precisely,let X1:n ≤ . . . ≤ Xn:n again denote the set of data X1, . . . , Xn arranged inincreasing order. Let Win be generated by a function g defined on the unitinterval (0,1):

Win =1

ng

(i

n+ 1

), 1 ≤ i ≤ n.

Then the statistic

Tn :=n∑

i=1

WinXi:n =1

n

n∑i=1

g

(i

n+ 1

)Xi:n

is called an L (= linear)-statistic. Tn is the empirical version of

T (F ) =

1∫0

g(u)F−1(u)du.

In the above version i/n was replaced by i/(n+ 1) since from time to timeone considers functions g only defined on the open unit interval (0,1) suchthat g(u) tends to infinity when u→ 0 or u→ 1.

The choice of g ≡ 1 leads to the sample mean. If, for 0 < a < 12 , we set

g(x) =1

1− 2a1[a,1−a](x),

Page 58: 2013 W Stute Empirical Distributions

1.11. SAMPLING DESIGNS AND WEIGHTING 53

we obtain

Tn =1

n(1− 2a)

(n+1)(1−a)∑i=(n+1)a

Xi:n,

which is closely related to the a-trimmed mean 1(n+1)(1−2a)+1

∑(n+1)(1−a)i=(n+1)a Xi:n.

Example 1.11.2. Let X1, X2, . . . , Xn be a sample of real-valued data. E.g.,Xi may be the disease-free survival time of patient i after surgery. One maythen split a time interval into disjoint sub-intervals I1, . . . , Ik with lengthh1, . . . , hk, respectively. Put, for x ∈ Ij , 1 ≤ j ≤ k,

fn(x) =1

nhj

n∑i=1

1Ij (Xi).

This function fn is called a histogram. Note that fn(x) ≥ 0 and∫fn(x)dx =

1. Hence fn is an (empirical) density. The weight of Xi depends on the lo-cation of x, i.e., on the cell Ij(x) = Ij which contains x:

Win(x) =1

nhj1Xi∈Ij.

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Figure 1.11.1: Example of a histogram based on n = 100 observations, withhj = 1/2

Page 59: 2013 W Stute Empirical Distributions

54 CHAPTER 1. INTRODUCTION

Note that fn very much depends on the choice of the cells Ij . A slight changeof the I’s may result in a dramatic change of fn, if many data are locatedclose to the boundary of a cell.

Example 1.11.3. This example is similar to the previous one but frommultivariate statistics. Assume we observe bivariate data (Xi, Yi), 1 ≤ i ≤ n.One may be interested in the mean m(x) = E[Y |X = x] of Y given that Xattains a specified value x, i.e., in the regression of Y at x. Since there maybe no Xi with Xi = x, we may open a window with center x and length2h and average the Yi’s for which the Xi’s fall into [x−h, x+h]. This leadsto

mn(x) =

∑ni=1 Yi1[x−h,x+h](Xi)∑ni=1 1[x−h,x+h](Xi)

=

n∑i=1

Win(x)Yi,

where

Win =Win(x) =1[x−h,x+h](Xi)∑nj=1 1[x−h,x+h](Xj)

is the weight of Yi depending on x, h and the locations of all Xj . For smallh > 0 it may happen that the denominator of Win equals zero so that Win

is not well-defined. For large h’s we face the risk that we loose the localflavor of the data. So the question is how to choose a good h. Also notethat each window is located around x and no pre-specified choice of cells asfor histograms is required.

Example 1.11.4. This example describes an important situation wherethe available data are incomplete. It’s taken from survival analysis. Thefollowing figure describes a situation where patients enter a study at time ofsurgery, say. Then one may be interested in the disease-free survival time.Due to unexpected losses or since the study has to be terminated at time T ,not all survival times Yi are available. Rather, some only exist in censored

Page 60: 2013 W Stute Empirical Distributions

1.11. SAMPLING DESIGNS AND WEIGHTING 55

form, i.e., rather than Yi the time Ci spent in the study is observed.

3C

1Y

4C

5C

2Y

1

2

3

4

5

1. 2. 3. 4. 5. time ofsurgery

T = end ofstudy

Figure 1.11.2: Data censored from the right

In our example Y1 and Y2 are observable. For cases 3 and 5 we observe thetimes spent until the end of the study, while patient 4 dropped out of thestudy before the end.

Though some data are censored, they provide important information in thatit is known that a patient survived at least for some time. The question thenbecomes one of how to properly weight each datum.

In the following example the situation is even worse since some of the datamay be completely lost.

Example 1.11.5. In retrospective AIDS data analysis (e.g., transfusionassociated AIDS) one may be interested in the incubation period Xi =Wi − Si, where Si is the (calendar) time of infection and Wi is the time ofdiagnosis. Typically, there is a time T when the study has to be terminated.A case is only included if Wi ≤ T or Xi ≤ T −Si = Yi. In this case both Xi

and Yi are observed. If Xi > Yi, the data are unknown and the case is notincluded. One says thatXi is truncated by Yi from the right. Consequentlysample size n is unknown. Again the question is how to weight the availabledata.

Example 1.11.6. Sometimes it may be known that the observationsX1, . . . , Xn come from a distribution function F satisfying some constraint,e.g.,

EX =

∫xF (dx) = 0.

Page 61: 2013 W Stute Empirical Distributions

56 CHAPTER 1. INTRODUCTION

In regression or time series analysis, the dependent variables are decomposedinto the input (resp. prognostic) part and noise ε, where by construction εhas expectation zero. In a residual analysis, when Xi is the i-th residual,it is usually not guaranteed that

∫xFn(dx) = 0 so that one may wish to

replace Fn with some Fn satisfying∫xFn(dx) = 0. In most cases Fn will

also have masses at X1, . . . , Xn but different from 1n . So-called Empirical

Likelihood-Methods aim at finding and analyzing (modified) empiricaldistribution functions satisfying such constraints.

Example 1.11.7. In the situation to discuss now, the unknown parameterof interest is not F or any of its functionals but population size. Assumeone is interested in the number n of unknown species to be investigated.To get some initial information, one has to observe (and hopefully detect)a (random) number of species within a pre-specified period of time. We seethat, as in Example 1.5.5, the statistical inference on the total number n ofunknown species may present a challenging problem.

Example 1.11.8. Our final example is concerned with possible structuralchanges. Though many times it is assumed that a sample has been drawnfrom the same population under identical conditions, in a real world thereis no guarantee that this is the case. E.g., given a set of observationsX1, . . . , Xn, there may be an unknown index k with 1 < k ≤ n such that for1 ≤ j ≤ k the distribution of Xj is different from that of Xj for k < j ≤ n.The unknown k is then called a changepoint.

Conclusions. The main goal of this introductory chapter was first to intro-duce the important concept of a Dirac. Appropriately weighted this led usto general empirical distributions (kernels). Associated empirical integralsthen form a flexible and rich class of elementary statistics. For univariatedata order statistics are obtained after ordering the data. Ranks describethe position of an observation in the full sample or the relative position ofone sample w.r.t. another. At the end it has become clear, that quantitiescomputable from empirical d.f.’s are just statistical functionals T = T (G)evaluated at G = Fn, while the target is the term evaluated at the unknownparent d.f. G = F .

In such a situation T (F ) is the parameter of interest describing some char-acteristics of the parent population. The examples discussed in section 1.4 –1.10 then showed that very often we are not only interested in a parameter.Rather terms like T (x,G), T (p,G) or T (λ,G) are functions and not param-eters. The discussion in Section 1.10 revealed that the function of interest

Page 62: 2013 W Stute Empirical Distributions

1.11. SAMPLING DESIGNS AND WEIGHTING 57

may be characterized as being the (unique) solution of an integral equation.Adopting the notation from calculus, the associated T may henceforth becalled a statistical operator.

Things are changing if we are interested in local quantities rather than func-tionals of the cumulative d.f. F . Statistical methodology then faces so-calledill-posed problems requiring some smoothing (or regularization) to make in-version of operators feasible.

Sometimes data may be incomplete or missing, which requires a new weight-ing. In other cases, reweighting will be necessary in order to fulfill some con-straints. As our example on unknown species or the problem from change-point analysis have revealed, there are situations when the target is not afunction of F but some other parameter of interest, like the size of a popu-lation.

So far, no distributional assumptions on the observations X1, . . . , Xn likeindependence were made. Before we do that, it is necessary to have a closerlook at the situation when sample size equals n = 1.

Page 63: 2013 W Stute Empirical Distributions

58 CHAPTER 1. INTRODUCTION

Page 64: 2013 W Stute Empirical Distributions

Chapter 2

The Single Event Process

2.1 The Basic Process

The whole chapter is devoted to a detailed analysis of the empirical d.f. onthe real line when the sample size equals n = 1.

Definition 2.1.1. Let X be a real-valued random variable defined on aprobability space (Ω,A,P). Then we call the stochastic process

St = 1X≤t, t ∈ R,

the Single Event Process.

This process equals F1. Hence, all the properties established for a generalFn in Section 1.3 apply also here. In particular, St equals zero for t < Xand one for t ≥ X. The jump size at X equals one. Let F denote the d.f. ofX. For each fixed t, the variable St is a 0− 1 or Bernoulli-variable withexpectation

E(St) = P(X ≤ t) = F (t) (2.1.1)

and variance

Var(St) = F (t)(1− F (t)). (2.1.2)

The variance attains its maximum when F (t) = 12 , i.e., at the median. As we

move to the left or to the right, the variance declines. It vanishes wheneverF (t) = 0 or F (t) = 1, i.e., outside the support of F .

In the next lemma we present the covariance structure of S· ≡ St : t ∈ R.

59

Page 65: 2013 W Stute Empirical Distributions

60 CHAPTER 2. THE SINGLE EVENT PROCESS

Lemma 2.1.2. For s ≤ t we have

Cov(Ss, St) = F (s)(1− F (t)). (2.1.3)

Proof. We have

Cov(Ss, St) = E(SsSt)− E(Ss)E(St) = F (s)− F (s)F (t),

whence the result. For the last equality note that

SsSt = 1X≤s1X≤t = 1X≤s,X≤t = 1X≤s.

For a general pair s, t of real numbers the covariance equals

Cov(Ss, St) = F (s ∧ t)− F (s)F (t), (2.1.4)

where

s ∧ t = min(s, t).

For arbitrary measurable functions φ, φ1 and φ2 we similarly get the fol-lowing elementary but basic equations.

Lemma 2.1.3. Provided all integrals exist, we have

• E(φ(X)) =∫φdF

• Cov(φ1(X), φ2(X)) =∫φ1φ2dF −

∫φ1dF

∫φ2dF .

By the Cauchy-Schwarz inequality∣∣∣∣∫ φ1φ2dF

∣∣∣∣ ≤√∫

φ21dF

∫φ22dF .

Hence, the integral∫φ1φ2dF exists whenever φ1, φ2 ∈ L2(F ), the space of

F -square-integrable functions on the real line.

If we set

< φ1, φ2 >≡∫φ1φ2dF,

Page 66: 2013 W Stute Empirical Distributions

2.1. THE BASIC PROCESS 61

we obtain a scalar-product on the space L2(F ). The functions φ1, φ2 arecalled orthogonal if and only if∫

φ1φ2dF = 0.

If, furthermore, they are also centered, i.e.,∫φ1dF = 0 =

∫φ2dF,

thenCov(φ1(X), φ2(X)) = 0. (2.1.5)

(2.1.5) implies, that also the correlation between φ1(X) and φ2(X) equalszero.

Orthogonality is obtained for all F if φ1φ2 ≡ 0. E.g., if φi = 1Bi for i = 1, 2and B1 ∩B2 = ∅, then

1B11B2 = 1B1∩B2 = 1∅ = 0.

Other φ’s may be orthogonal only w.r.t a special F but not w.r.t others.

Having studied some elementary properties of the Single Event Process,we now begin to analyze the probabilistic structure of the whole process.The distribution of S· is uniquely determined by its finite-dimensionaldistributions (fidis). For this we fix any finite collection of t’s, say

t1 < t2 < . . . < tk < tk+1.

To study the distribution of (St1 , St2 , . . . , Stk+1) we note that, e.g.,

P(Stk+1= 1|St1 = 0, . . . , Stk = 0) =

P(X ≤ tk+1, X > tk, . . . , X > t1)

P(X > tk, . . . , X > t1)

=P(X ≤ tk+1, X > tk)

P(X > tk)= P(Stk+1

= 1|Stk = 0).

Similar results hold for other possible past values of S. So we may concludethe following lemma.

Lemma 2.1.4. The Single Event Process is a Markov process with transi-tion probabilities, for s < t,

P(St = 1|Ss = 1) = 1, P(St = 1|Ss = 0) =F (t)− F (s)

1− F (s),

P(St = 0|Ss = 1) = 0, P(St = 0|Ss = 0) =1− F (t)

1− F (s).

Page 67: 2013 W Stute Empirical Distributions

62 CHAPTER 2. THE SINGLE EVENT PROCESS

We take this opportunity to also introduce the pertaining increments.Again, consider finitely many t0 < t1 < t2 < . . . < tk+1 in increasingorder. Then the associated increments (of the S· process) are given by

St0 , St1 − St0 , . . . , Stk+1− Stk .

Note that∆Sti ≡ Sti − Sti−1 = 1ti−1<X≤ti

and the sets Bi = (ti−1, ti] are mutually disjoint. Hence, the increments ofthe Single Event Process are orthogonal with expectations

F (ti)− F (ti−1) ≡ ∆Fti .

Conclude that

Cov(∆Sti ,∆Stj ) = −∆Fti∆Ftj for i = j. (2.1.6)

Empirical d.f.’s are traditionally encountered as estimators of unknown d.f.’sF . The value Fn(t) may then be taken as a substitute for the probabilityF (t) that in a future draw from the same population the value of X does notexceed t. When one applies empirical d.f.’s in this context one assumes thatapart from the previously obtained data X1, . . . , Xn no further informationon X is available.

Very often, however, the situation is different. To give a simple example,assume that X represents the disease free survival time of a patient aftersurgery. Denote with 0 the time of surgery. Then at time 0 < s < t someinformation on X may be already available. For example, if X ≤ s, thennecessarily X ≤ t, while if X > s the patient is still at risk. Given thisinformation the quantity of interest is no longer F (t) but

P(X ≤ t|X > s) =F (t)− F (s)

1− F (s).

See Lemma 2.1.4. If rather than in the event X ≤ t we are interested inany function φ(X) of X and want to encounter the information up to s, wehave to introduce the natural filtration

Fs = σ(X ≤ u : u ≤ s).

Then (Fs)s is increasing in s with S· being adapted to F·, i.e., for each s thevariable Ss is measurable w.r.t. Fs. Actually, this is the smallest filtration

Page 68: 2013 W Stute Empirical Distributions

2.1. THE BASIC PROCESS 63

making S· an adapted process. In a real world situation, at time s, there mayalso be covariables available so that in this case the available information issummarized in a filtration Gs with Gs ⊃ Fs. In this chapter, however, westick to Fs.

In the following, we fix an integrable function φ(X) of X. Set

Xt = E[φ(X)|Ft] – a martingale.

If we set F−∞ = ∅,Ω, then

X−∞ = E[φ(X)] =

∫φdF,

the target we discussed before. For other t’s, Xt represents a predictor ofφ(X) given the information up to time t. This comment aims at changingour view of things a bit in that the former target

∫φdF now becomes only

the starting point of a whole family Xt of quantities which focus on the truebut possibly unobserved value of X rather than a population parameter. Ast increases, Xt incorporates the available information so that one hopefullycomes up with a better updated predictor of φ(X). The last remark isintended to point out the dynamic character of Xt.

In the next lemma we present an explicit expression for Xt.

Lemma 2.1.5. We have, for all t with F (t) < 1,

Xt = φ(X)1X≤t + 1X>t

∫(t,∞)

φ(u)F (du)

1− F (t).

Proof. Note that both summands are measurable w.r.t. Ft. Actually, thesecond term is just 1− St multiplied with a deterministic factor. As to thefirst, if X ≤ t, we know the precise value of X since Ft contains all X ≤ uwith u ≤ t. Knowing, for each u ≤ t, whether X ≤ u or X > uoccurred, is tantamount to knowing the value of X. Finally, using theMarkov-property of S·, it remains to show that Xt has the same expectationas φ(X) on both X ≤ t and X > t. But this is obvious.

Page 69: 2013 W Stute Empirical Distributions

64 CHAPTER 2. THE SINGLE EVENT PROCESS

The prediction error φ(X)−Xt equals

φ(X)−Xt = φ(X)1X>t − 1X>t

∫(t,∞)

φ(u)F (du)

1− F (t)

= 1X>t

φ(X)−

∫(t,∞)

φ(u)F (du)

1− F (t)

whence

E[φ(X)−Xt]2 =

∫X>t

φ(X)−

∫(t,∞)

φ(u)F (du)

1− F (t)

2

dP

=

∫X>t

φ2(X)dP−[∫

(t,∞)

φdF ]2

1− F (t)≤

∫X>t

φ2(X)dP.

As t → ∞, this upper bound tends to zero. Also rates of convergence areeasily available. E.g., if

∫|X|φ2(X)dP <∞, then (for t > 0)∫

X>t

φ2(X)dP ≤ 1

t

∫|X|φ2(X)dP = O(t−1).

If, in Lemma 2.1.5, we set φ(X) = 1X>s for t < s, we obtain

E[1X>s|Ft] = 1X>s1X≤t + 1X>t1− F (s)

1− F (t)

= 1X>t1− F (s)

1− F (t).

From this we immediately obtain

Lemma 2.1.6. The process

t→1X>t

1− F (t)

is a (forward) martingale w.r.t. the natural filtration. Each variable hasexpectation one.

Page 70: 2013 W Stute Empirical Distributions

2.2. DISTRIBUTION-FREE TRANSFORMATIONS 65

With similar arguments one can show that the process

t→1X≤t

F (t),

is a martingale in reverse time. It is interesting to look atXt if the variable tobe integrated is not a deterministic function φ(X) of X, but any (integrable)random variable Y . If m denotes the regression function of Y w.r.t. X, weobtain

Xt ≡ E [Y |Ft] = E E[Y |X]|Ft= E[m(X)|Ft],

i.e., Lemma 2.1.5 applies with φ = m. C.f. the function I in Section 1.4.

2.2 Distribution-Free Transformations

In nonparametric statistics no particular assumption about the underly-ing distribution function F will be imposed. As a computational drawbackcritical regions of test statistics or distributions of estimators may depend onF and could therefore be difficult to obtain. While with high-speed comput-ers at hand Monte Carlo techniques may nowadays be employed to overcomethese problems, beginning in the 1940’s, an increasing scepticism about theappropriateness of small parametric models led to an enormous interest instatistical methodology which was distribution-free under broad modelassumptions. As we shall see later the ideas elaborated then are useful inour context and will help to motivate also more advanced new statisticalapproaches.

Generally, a distribution-free transformation aims at transforming a variable(an observation) X to another one which has a known distribution. In thissection only real-valued X will be considered. The most important trans-formation is the one which comes up with a uniformly distributed variable.

Definition 2.2.1. The uniform distribution (on the unit interval [0, 1])is given through its d.f.

FU (t) =

0 for t < 0t for 0 ≤ t ≤ 11 for 1 < t

Page 71: 2013 W Stute Empirical Distributions

66 CHAPTER 2. THE SINGLE EVENT PROCESS

Hence, a random variable from this distribution can only take on values in(0, 1) (with probability one). Write U ∼ U [0, 1].

The other important reference distribution is the exponential distribu-tion.

Definition 2.2.2. The exponential distribution with parameter λ > 0 isgiven through

1− F (t) =

1 for t < 0exp[−λt] for t ≥ 0

(2.2.1)

We shall write X ∼ Exp(λ) whenever X has d.f. (2.2.1). Note that such anX is nonnegative with probability one. The hazard function pertaining tothis F equals the constant function λ(x) ≡ λ. The following lemma revealsa way how to generate a variable X with d.f. F .

Lemma 2.2.3. Let U be uniform on [0, 1]. Then

X := F−1(U) ∼ F. (2.2.2)

Proof. For each real t we have from the definition of F−1 that

F−1(U) ≤ t = U ≤ F (t)

whenceP(X ≤ t) = P(U ≤ F (t)) = F (t).

While Lemma 2.2.3 states that a random variable X ∼ F equals F−1(U) indistribution, the following lemma shows that any random variable X ∼ Fmay be written as

X = F−1(U) with probability one

and not only in distribution, provided the underlying probability space(Ω,A,P) is rich enough to carry appropriate random variables independentof X. This will be assumed throughout without further mentioning.

Lemma 2.2.4. Assume that X ∼ F . Then, for an appropriate uniform U ,we have

X = F−1(U) with probability one.

Page 72: 2013 W Stute Empirical Distributions

2.2. DISTRIBUTION-FREE TRANSFORMATIONS 67

Proof. Denote with A = a the set of atoms of F , i.e., the set of pointssuch that Fa > 0. There are at most countably many atoms, if there areany. Let V be independent of X and uniformly distributed on [0, 1]. Set

U =

F (X) if X /∈ AF (a−) + FaV if X = a ∈ A

.

Check that U ∼ U [0, 1] and X = F−1(U).

−3 −2 −1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

U3

U4

U1

U5

U2

X3 X4 X1 X5 X2

Figure 2.2.1: X-raindrops as the U -clouds hit the F -mountain

What Lemma 2.2.4 points out and the corresponding Figure 2.2.1 depictsis the following: we may interpret the observed X-data as raindrops fallingto the earth. The location of the drops depends on the height of the U -clouds. Once these are blown to the right they hit the F -mountain (graph)in different locations. These are in charge of the actual values of the X’s.The experimenter or observer only has access to the X-world but not tothe U -sky. Also the shape of the mountain (shape of F ) remains unknown.More or less such a picture appropriately modified applies to other situationsin statistics, i.e., the goal is to analyze what is available to get some ideasof what is behind the stage. We also see why we need the infimum in thedefinition (1.4.1) of F−1. It guarantees that once a U -cloud (like U4, U5)has started on a level which does not yield a point to drop down, the gate(hole) between F (x−) and F (x) in the F -graph needs to be closed. This isexactly what the infimum does.

Page 73: 2013 W Stute Empirical Distributions

68 CHAPTER 2. THE SINGLE EVENT PROCESS

For most distributions, the quantile function does not allow for a simpleanalytic expression so that Lemma 2.2.3 is of limited value if one reallywants to generate X through U and F−1. For the exponential distribution,however,

u = F (t) = 1− exp(−λt)

immediately yieldsF−1(u) = −λ−1 ln(1− u)

so thatX := −λ−1 ln(1− U) ∼ Exp(λ). (2.2.3)

Lemma 2.2.4 and (2.2.3) now lead to the desired distribution-free transfor-mations. Recall

ΛF (t) =

∫(−∞,t]

F (dx)

1− F (x−),

the cumulative hazard function of F .

Lemma 2.2.5. Assume that X ∼ F and F is continuous. Then

(i) F (X) ∼ FU

and

(ii) ΛF (X) ∼ Exp(1).

Proof. By Lemma 2.2.4 we have X = F−1(U). By continuity of F ,

F F−1(u) = u for all 0 < u < 1.

HenceF (X) = U,

which proves (i). For the second statement, we obtain under continuity

ΛF (t) =

∫(−∞,t]

F (dx)

1− F (x)= − ln[1− F (t)]. (2.2.4)

Since X = F−1(U), it follows that

ΛF (X) = − ln[1− F (X)]

= − ln[1− U ] ∼ Exp(1),

Page 74: 2013 W Stute Empirical Distributions

2.3. THE UNIFORM CASE 69

by (2.2.3).

Lemma 2.2.4 has some interesting consequences for the representation of theSingle Event Process. Actually, since X = F−1(U), we have

S(t) ≡ St = 1X≤t = 1U≤F (t). (2.2.5)

If we denote with SX and SU the S-process for X and U , then (2.2.5) statesthat

SX(t) = SU (F (t)). (2.2.6)

The process SU will be restricted to [0, 1] since for t outside the unit intervalSU is constant (= 0 resp. 1) and therefore uninteresting.

Equation (2.2.6) shows that the study of SX could be traced back to that ofSU . Actually, SX equals SU after a proper time transformation throughF . When we observe X, only the left-hand side of (2.2.6) is known. NeitherU nor F are available. Though, (2.2.6) turns out to be a basic equalityto understand and study many distribution-free procedures in statistics.It is responsible for the special role played by the uniform distribution innonparametric statistics.

Therefore our next section will be devoted to the study of U and the asso-ciated SU .

2.3 The Uniform Case

In the previous section we have studied the connection between X and U ,a uniform variable.

The density of U equals

fU (t) =

1 for 0 ≤ t ≤ 10 elsewhere

.

fU is symmetric at 1/2. The associated cumulative hazard function equals

ΛU (t) =

t∫0

1

1− xdx

= − ln(1− t), 0 ≤ t < 1.

Page 75: 2013 W Stute Empirical Distributions

70 CHAPTER 2. THE SINGLE EVENT PROCESS

It tends to +∞ as t ↑ 1. The k-th moment equals

1∫0

tkdt =1

k + 1.

In particular, the mean equals 1/2, and for the variance we obtain

VarU =1

3− 1

4=

1

12.

When we study SU and – later – the uniform empirical d.f. FUn , we also

have to study the space L2(FU ). In particular, we are looking for systemsof orthonormal functions spanning relevant subspaces of L2(FU ).

Lemma 2.3.1. Consider the functions

φj(t) =√2 sin(jπt), 0 ≤ t ≤ 1, j ∈ N.

Then φj is an orthonormal system of functions in L2(FU ).

Proof. For each j ∈ N we have

1∫0

φ2j (t)dt = 2

[1

2t− cos jπt · sin jπt

2jπ

] ∣∣∣∣10

= 1

while

1∫0

sin(jπt) sin(kπt)dt =sin[(j − k)πt]

2π(j − k)

∣∣∣∣10

− sin[(j + k)πt]

2π(j + k)

∣∣∣∣10

= 0 for j = k.

Since the functions φj vanish at zero and one they are not candidates fora basis in L2(FU ). If, however, we restrict ourselves to the subspace of allfunctions in L2(FU ) vanishing at the boundary of [0,1], these functions areindeed candidates. Coming back to SU , we have SU (0) = 0 but SU (1) = 1.To obtain a process which also vanishes at t = 1, we set

α1(t) = SUt − E(SU

t ) = 1U≤t − t, 0 ≤ t ≤ 1.

Page 76: 2013 W Stute Empirical Distributions

2.3. THE UNIFORM CASE 71

The upper bar stands for uniform. This process is centered and vanishes att = 0 and t = 1. It has the same covariance structure as SU , see (2.1.4),namely

Cov(α1(s), α1(t)) = s ∧ t− st, 0 ≤ s, t ≤ 1.

As a function of t, it equals

α1(t) =

−t for t < U1− t for U ≤ t ≤ 1

.

The sample path thus has the shape of a sawtooth.

1

0U 1

Figure 2.3.1: A sawtooth-type path of α1

As a centered version of the Single Event Process the process α1 will play adominant role in our analysis. It is called the (uniform) Empirical Process(for sample size one). The extension to larger n will be introduced anddiscussed in Chapter 3.

We will show later that the system φj is a complete system of orthonormalfunctions in the subspace of all functions in L2(FU ) vanishing at zero andone. From this we get

α1 =∞∑j=1

< α1, φj > φj , (2.3.1)

Page 77: 2013 W Stute Empirical Distributions

72 CHAPTER 2. THE SINGLE EVENT PROCESS

the Fourier representation of α1. The Fourier coefficients < α1, φj > equal

< α1, φj > =

1∫0

α1(t)φj(t)dt = −U∫0

tφj(t)dt+

1∫U

(1− t)φj(t)dt

= −1∫

0

tφj(t)dt+

1∫U

φj(t)dt =√2cos(jπU)

jπ=

√2

jπψj(U),

with

ψj(x) = cos(jπx), 0 ≤ x ≤ 1.

The functions ψ1, ψ2, . . . satisfy certain properties which make the orthonor-mal representation (2.3.1) interesting for statistical applications.

Lemma 2.3.2. We have, for each j ≥ 1,

Eψj(U) =

1∫0

ψj(x)dx = 0

and, for i = j,

E[ψi(U)ψj(U)] =

1∫0

ψi(x)ψj(x)dx = 0.

Proof. The first statement follows from integration, while the second is adirect consequence of

1∫0

cos(iπx) cos(jπx)dx =sin[(i− j)πx]

2π(i− j)

∣∣∣∣10

+sin[(i+ j)πx]

2π(i+ j)

∣∣∣∣10

= 0.

We see from Lemma 2.3.2 that the Fourier coefficients

< α1, φj >=

√2

jπψj(U) (2.3.2)

Page 78: 2013 W Stute Empirical Distributions

2.3. THE UNIFORM CASE 73

are uncorrelated and centered. For the variance we get

Var(< α1, φj >) =2

j2π2

1∫0

ψ2j (x)dx =

2

j2π2

[cos jπt · sin jπt

2jπ+

1

2t

] ∣∣∣∣10

=1

j2π2.

Hence, the variances decrease at the rate j−2. We may approximate α1 bythe series in (2.3.1) truncated at some integer k:

αk1 =

k∑j=1

< α1, φj > φj .

For the difference α1 − αk1 we get by the orthonormality of the φj ’s and

Parseval’s identity

1∫0

[α1(t)− αk1(t)]

2dt =

∞∑j=k+1

< α1, φj >2 .

Its expectation equals∞∑

j=k+1

1

j2π2= O(

1

k).

For k = 0 we get1∫

0

α21(t)dt =

∞∑j=1

< α1, φj >2 .

The decomposition (2.3.1) of α1 into orthonormal functions and uncorrelatedcoefficients has its counterpart in Multivariate Statistics, where we deal withfinite-dimensional problems. There the components are called principal com-ponents. The vectors forming the orthonormal basis are eigenfunctions andthe variances of the coefficients are the eigenvalues of a certain matrix. Hereit is similar. Without much further work we shall see later that the functionsφj form an eigenbasis for our subspace of L2(FU ), and the eigenvalues

λj =1

j2π2, j = 1, 2, . . .

are associated to the covariance kernel

K(s, t) = s ∧ t− st, 0 ≤ s, t ≤ 1.

Page 79: 2013 W Stute Empirical Distributions

74 CHAPTER 2. THE SINGLE EVENT PROCESS

Therefore representation (2.3.1) constitutes the principal component de-composition of the Single Event Process.

It is also interesting to determine the distribution of the (uncorrelated)ψj(U).

Lemma 2.3.3. ψ1(U), ψ2(U), . . . all have the same distribution. TheirFourier-transform equals the Bessel-function of order zero, i.e.,

E[eit cos(jπU)

]= J0(t) :=

∞∑k=0

(−1)kt2k

(k!)24k.

Proof. Set

I1 :=

1∫0

cos[t cos(jπx)]dx

and

I2 :=

1∫0

sin[t cos(jπx)]dx.

We first show

I1 =

1∫0

cos[t cos(πx)]dx, (2.3.3)

I2 =1

j

j∑k=1

(−1)k−1

1∫0

sin[t cos(πx)]dx, (2.3.4)

1∫0

sin[t cos(πx)]dx = 0, (2.3.5)

and1∫

0

cos[t cos(πx)]dx = J0(t). (2.3.6)

As to I1, set u = jx and obtain

I1 =1

j

j∫0

cos[t cos(πu)]du =1

j

j∑k=1

k∫k−1

cos[t cos(πu)]du.

Page 80: 2013 W Stute Empirical Distributions

2.3. THE UNIFORM CASE 75

Setting y = u− (k − 1), the k-th summand becomes

k∫k−1

cos[t cos(πu)]du =

1∫0

cos[t cos(πy + π(k − 1))]dy

=

1∫0

cos[(−1)k−1t cos(πy)

]dy =

1∫0

cos[t cos(πx)]dx.

This proves (2.3.3). Similarly, we obtain for I2:

I2 =1

j

j∑k=1

1∫0

sin[(−1)k−1t cos(πy)

]dy =

1

j

j∑k=1

(−1)k−1

1∫0

sin[t cos(πx)]dx.

For (2.3.5), set u = cos(πx) whence

dx = − 1

π[sin(πx)]−1du = − 1

π[1− u2]−1/2du. (2.3.7)

Conclude that

1∫0

sin[t cos(πx)]dx =1

π

1∫−1

sin(tu)(1− u2)−1/2du = 0,

where the last equation follows from the fact that the integrand is odd.Finally,

1∫0

cos[t cos(πx)]dx =1

π

1∫−1

cos(tu)(1− u2)−1/2du

=2

π

1∫0

cos(tu)(1− u2)−1/2du =2

π

∞∑k=0

(−1)kt2k

(2k)!

1∫0

u2kdu

(1− u2)1/2.

The first equation uses (2.3.7) again, while the second uses the fact that theintegrand is even. For the last equation, just apply the series expansion ofthe cosine function. To compute the last integral, apply (2.3.7) again to get

1∫0

u2kdu

(1− u2)1/2= π

1/2∫0

cos2k(πx)dx =2k − 1

2kπ

1/2∫0

cos2(k−1)(πx)dx.

Page 81: 2013 W Stute Empirical Distributions

76 CHAPTER 2. THE SINGLE EVENT PROCESS

By recursion, we therefore obtain

1/2∫0

cos2k(πx)dx =1

2

1 · 3 · 5 . . . (2k − 1)

2 · 4 · 6 . . . 2k=

1

2

(2k)!2kk!

2kk!=

(2k)!

2 · 4k(k!)2.

Summarizing,

E[eit cos(jπU)

]=

1∫0

cos[t cos(jπx)]dx+ i

1∫0

sin[t cos(jπx)]dx

=

1∫0

cos[t cos(πx)]dx.

This proves the lemma.

Page 82: 2013 W Stute Empirical Distributions

Chapter 3

Univariate Empiricals:The IID Case

3.1 Basic Facts

So far we have discussed only those properties of empirical d.f.’s which donot require any distributional assumptions on the underlying observations.Clearly, to obtain, e.g., the distribution of Fn(t) or any other empiricalintegral

∫φdFn, the distributional structure of X1, . . . , Xn turns out to be

crucial. The situation which is best known is the case of independentidentically distributed (i.i.d.) random variables X1, . . . , Xn. As always,let F denote the unknown d.f. of each Xi.

This chapter is devoted to various problems and questions which may arise inconnection with empiricals of i.i.d. real-valued data. In the first section wecollect some basic properties of Fn which are straightforward consequencesof classical results in probability theory.

Recall the empirical distribution

µn(A) =1

n

n∑i=1

1Xi∈A.

In the i.i.d. case nµn(A) is a sum of n i.i.d. Bernoulli-random variables withsuccess parameter p = µ(A). Hence the following lemma is straightforward.

Lemma 3.1.1. Assume that X1, . . . , Xn are i.i.d. from some distribution

77

Page 83: 2013 W Stute Empirical Distributions

78 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

µ. For any (measurable) set A we have for k = 0, 1, . . . , n

P(nµn(A) = k) =

(n

k

)µk(A)(1− µ(A))n−k,

i.e., nµn(A) ∼ Bin(n, µ(A)), the binomial distribution with parametersn and p = µ(A).

Note that this result holds true for i.i.d. Xi’s which take their values in anysample space, not necessarily the real line, and for any (measurable) setA. For extended intervals A = (−∞, t], this lemma takes on a special formwhich, for the sake of reference, is listed as part of the following lemma.

Lemma 3.1.2. Assume X1, . . . , Xn are i.i.d. from F . We then have

P(nFn(t) = k) =

(n

k

)F k(t)(1− F (t))n−k, k = 0, 1, . . . , n (3.1.1)

E[Fn(t)] = F (t) (3.1.2)

Var[Fn(t)] =1

nF (t)(1− F (t)) (3.1.3)

Cov(Fn(s), Fn(t)) =1

nF (s)(1− F (t)). (3.1.4)

Since Fn is a normalized sum of independent Single Event Processes, (3.1.2)– (3.1.4) easily follow from (2.1.2), Lemma 2.1.2 and the fact that the vari-ance of a sum of independent random variables equals the sum of theirvariances. Assertion (3.1.2) states that Fn(t) is an unbiased estimator ofF (t).

From (3.1.2) and (3.1.3) we obtain that, as n→ ∞,

Fn(t) → F (t) in L2(Ω,A,P),

the space of square-integrable functions on (Ω,A,P) endowed with theL2-norm

∥ ξ ∥2 =

[∫ξ2dP

]1/2, ξ ∈ L2(Ω,A,P).

Recalling that

Fn(t) =1

n

n∑i=1

1Xi≤t

Page 84: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 79

is a sample mean of bounded (and hence P-integrable) i.i.d. random vari-ables, the Strong Law of Large Numbers (SLLN) may be applied to getfor each fixed t ∈ R:

limn→∞

Fn(t) = F (t) with probability one. (3.1.5)

In other words, Fn(t) is a strongly consistent estimator of F (t).

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3.1.1: Example of F100 and its target F

In Figure 3.1.1 we have added the true d.f. F to the empirical d.f. F100

already depicted in Figure 1.3.1. These data were independently generatedfrom an Exp(1)-distribution. It becomes clear that apart from small devia-tions F100 follows the graph of F very closely.

Finally, the de Moivre-Laplace version of the Central Limit Theorem(CLT) yields for each t ∈ R, as n→ ∞, an approximation for the distributionof a properly standardized Fn(t):

n1/2[Fn(t)− F (t)]L−→ N (0, σ2t ), (3.1.6)

where L denotes convergence in law (or distribution), and N (0, σ2) isthe normal distribution with expectation zero and variance σ2. According

Page 85: 2013 W Stute Empirical Distributions

80 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

to (3.1.3),σ2 = σ2t = F (t)(1− F (t)).

The assertions (3.1.2) – (3.1.3) in Lemma 3.1.2 as well as (3.1.5) and (3.1.6)allow for straightforward extensions to general empirical integrals. To applythe SLLN and CLT appropriate moment conditions are required which areautomatically fulfilled for indicator functions. See also Lemma 2.1.3 forsample size n = 1.

Lemma 3.1.3. Assume that X1, . . . , Xn are i.i.d. from a d.f. F , and letφ,φ1, φ2 : R → R be arbitrary (Borel measurable) functions. Provided thatall integrals on the right-hand side exist, we have:

E[∫φdFn

]=∫φdF (3.1.7)

Var[∫φdFn

]= 1

n

[∫φ2dF −

(∫φdF

)2](3.1.8)

Cov[∫φ1dFn,

∫φ2dFn

]= 1

n

[∫φ1φ2dF −

∫φ1dF

∫φ2dF

](3.1.9)

limn→∞∫φdFn =

∫φdF with probability one (3.1.10)

n1/2[∫φdFn −

∫φdF

] L−→ N (0, σ2) (3.1.11)

with

σ2 =

∫φ2dF −

(∫φdF

)2

.

All statements are classical results from probability theory properly ’trans-lated’ into the ’language of empiricals’.

Equation (3.1.4) reveals the covariance structure of Fn when consideredas a stochastic process. To get rid of the factor 1

n on the right-hand side weneed to introduce the standardized process

αn(t) := n1/2[Fn(t)− F (t)], t ∈ R.

This process is the so-called empirical process. For sample size n = 1 andfor the uniform case it was introduced and studied in Section 2.3. In termsof αn we immediately get from the preceding results

Lemma 3.1.4. Assume that X1, . . . , Xn are i.i.d. from F . Then we have:

E[αn(t)] = 0 for each t ∈ R (3.1.12)

Cov[αn(s), αn(t)] = F (s)(1− F (t)) for s ≤ t (3.1.13)

αn(t)L−→ N (0, F (t)(1− F (t))) as n→ ∞. (3.1.14)

Page 86: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 81

Since both Fn and F are distribution functions, we have

limt↓−∞

Fn(t) = 0 = limt↓−∞

F (t)

and

limt↑∞

Fn(t) = 1 = limt↑∞

F (t).

Hence

limt↓−∞

αn(t) = 0 = limt↑∞

αn(t). (3.1.15)

By (3.1.15), we may continuously extend αn to ±∞ through

αn(−∞) = 0 = αn(∞). (3.1.16)

Since, by (3.1.16), αn vanishes at±∞, i.e., on both sides of a river called ’realline’, the empirical process is sometimes characterized as being of bridgetype.

We now come back to the distribution-free transformations as introducedin Section 2.2. Under independence of X1, . . . , Xn, we may apply (2.2.2) toeach Xi to obtain

Xi = F−1(Ui) for i = 1, . . . , n,

for an independent sample U1, . . . , Un from FU . As before, we have

Xi ≤ t if and only if Ui ≤ F (t).

Conclude that

Fn(t) = Fn(F (t)), (3.1.17)

where

Fn(u) =1

n

n∑i=1

1Ui≤u, 0 ≤ u ≤ 1,

is the empirical d.f. of the uniform sample U1, . . . , Un. The representation(3.1.17) constitutes the extension of (2.2.6) to sample size n. It allows for areduction to the uniform case. In some instances this turns out to be quiteuseful. The main advantages here are:

(i) FU is continuous

(ii) FU has compact support [0, 1].

Page 87: 2013 W Stute Empirical Distributions

82 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Discontinuities or atoms of F may, from time to time, create some problems.With (3.1.17), these effects are absorbed by the time transformation F .The original empirical d.f. Fn and its theoretical counterpart F are definedon the real line, which is not compact. Sometimes, however, it is useful tohave a smallest and a largest t. This would require compactification of thereal line and a continuous extension of Fn and F . See (3.1.15) and (3.1.16).To some readers not familiar with these topological aspects, a transformationof the time scale is a welcome way out of the dilemma. Moreover, we shall seesoon that such transformations are extremely important to also understandthe distributional properties of several other statistics based on Fn and αn.

As to the empirical process, (3.1.17) yields

αn(t) = αn(F (t)), t ∈ R,

a composition of the deterministic function F and the random process αn

defined on the (compact) unit interval.

In our next lemma we derive a representation of the original X-order statis-tics in terms of uniform order statistics.

Lemma 3.1.5. For 0 < u < 1 we have

F−1n (u) = F−1(F−1

n (u)). (3.1.18)

In particular, for u = i/n, we get

Xi:n = F−1(Ui:n), 1 ≤ i ≤ n.

Proof. The proof follows from (3.1.17) and the definition of quantile func-tions. See also Figure 2.2.1.

In Chapter 1 we introduced the rank of an observation as a means to describeits position within the sample. Let

R = (R1, . . . , Rn)

denote the vector of ranks. The following result presents a well knownfact, namely that under some weak assumptions R is distribution-free.

Page 88: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 83

Lemma 3.1.6. Let R be the vector of ranks for an i.i.d. sequence of obser-vations from a continuous d.f. F . Then the distribution of R is distribution-free, i.e., does not depend on F .

Proof. Under continuity,

F F−1(u) = u for all 0 < u < 1.

Moreover, Xi = F−1(Ui) for all 1 ≤ i ≤ n. Altogether, this gives us for each1 ≤ i ≤ n:

Ri = nFn(Xi) = nFn(F (Xi)) = nFn(F F−1(Ui)) = nFn(Ui),

the second equality following from (3.1.17). We see that F has ’droppedout’. The proof is complete.

Lemma 3.1.6 does not specify the distribution of the vector of ranks. It onlyguarantees that it suffices to consider uniformly distributed observations.Due to continuity, there will be no ties among the data and each rank iswell-defined. The vector R attains its values in the set of all permutationsof the integers 1, . . . , n. The event R = r corresponds to a unique orderingof the U ’s. For independent U ’s each of the n! possible orderings has equalprobability, namely ∫

. . .

∫0<u1<...<un<1

1du1 . . . dun =1

n!.

Hence, we obtain the following lemma.

Lemma 3.1.7. For i.i.d. observations from a continuous d.f. F , the vectorof ranks is uniformly distributed on the set of all permutations of 1, . . . , n(Laplace model):

P(R = r) =1

n!.

Figure 3.1.1 suggests that the (pointwise) consistency of Fn, see (3.1.5), mayalso hold uniformly in t. This is ascertained in the following famous result.

Theorem 3.1.8 (Glivenko-Cantelli). Assume that X1, . . . , Xn are i.i.d.from an arbitrary (!) d.f. F . Then, as n→ ∞,

Dn := supt∈R

|Fn(t)− F (t)| → 0 with probability one.

Page 89: 2013 W Stute Empirical Distributions

84 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Proof. The proof of this result presents another nice application of (3.1.17).First we have

Dn = supt∈R

|Fn(F (t))− F (t)|. (3.1.19)

Setting u = F (t), we obtain

Dn ≤ sup0≤u≤1

|Fn(u)− u| ≡ Dn. (3.1.20)

Hence, it suffices to bound Dn. As to this, fix some ε > 0 and choose a finitegrid 0 = u0 < u1 < u2 < . . . < uk = 1 such that

max0≤i<k

(ui+1 − ui) ≤ ε.

Then use the monotonicity of Fn and the identity function to get for any ubetween ui and ui+1:

Fn(u)− u ≤ Fn(ui+1)− ui ≤ |Fn(ui+1)− ui+1|+ ε

and similarly−ε− |Fn(ui)− ui| ≤ Fn(u)− u.

In conclusion,

sup0≤u≤1

|Fn(u)− u| ≤ max0≤i≤k

|Fn(ui)− ui|+ ε.

The maximum is taken over finitely many points. Hence, by (3.1.5), themax tends to zero with probability one. We obtain

lim supn→∞

Dn ≤ ε P-almost surely.

Now choose ε = 1/m with m ∈ N and let m→ ∞ to complete the proof.

Though the last proof is simple, it is instructive to discuss the various is-sues related to Dn. First, the symbol Dn stands for distance or discrep-ancy between Fn and F . In a hypothesis testing framework, when F isreplaced with a hypothetical distribution, the resulting test is called theKolmogorov-Smirnov test. Thus in our context the quantity Dn will behenceforth called the K-S distance. Other distances between Fn and Fwill be studied in detail later. Equation (3.1.19) is interesting in itself. Fora continuous F , the time transformation u = F (t) is surjective so that in(3.1.20) we indeed have equality:

Dn = Dn.

Page 90: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 85

Again, as with rank statistics, F has dropped out so that

Dn is distribution-free when F is continuous. (3.1.21)

Next observe that, because of Fn and F are right-hand continuous and haveleft-hand limits,

Dn = supt∈Q

|Fn(t)− F (t)|,

with Q denoting the set of all rational numbers. Hence Dn, being the supre-mum of countably many random variables |Fn(t)−F (t)|, is again measurable.In particular, the event

Ω0 = limn→∞

Dn = 0

is measurable and P(Ω0) is well-defined.

The next comment is concerned with the technical part. As remarked ear-lier, (3.1.17) was the key step for the transformation to the unit interval.The compactness of this interval was essential to guarantee the existence ofa ’finite ε-grid’. The monotonicity of Fn together with the monotonicityand uniform continuity of the identity function u → u were then used toobtain uniform convergence from pointwise convergence. Note that point-wise convergence at the boundary points u = 0 and u = 1 are given for free,since there both functions equal zero or one, respectively. We mention thatin detail since much later, when we deal with general (weighted) empiricals,convergence at boundary points may and will cause special difficulties andnothing will be given for granted anymore.

The interested reader is asked to prove Glivenko-Cantelli Theorem withoutusing (3.1.17). Find a proper grid of the whole real line and allow forinfinitely many discontinuities of F .

In Figures 3.1.2 and 3.1.3 we consider plots of the function Fn − F underdifferent simulated scenarios. For the true distribution we took the uniformand the Exp(1)-distribution. Sample sizes were n = 50 and n = 100. Forn = 100, Fn − F is close to zero indicating that the approximation of Fthrough Fn is good already for moderate sample size. On the other hand,since the graph of Fn − F looks like a (mathematical) worm the plots areunable to reveal any characteristic features of the deviation between Fn andF . In such a situation it is helpful to put Fn − F under the microscope.In mathematical terms, this means that we have to multiply Fn − F with afactor which gets large as n increases. For the empirical process this factoris n1/2.

Page 91: 2013 W Stute Empirical Distributions

86 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 1 with n = 50

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 2 with n = 50

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 3 with n = 100

0.0 0.2 0.4 0.6 0.8 1.0−

1.0

0.0

0.5

Sample 4 with n = 100

Figure 3.1.2: Glivenko-Cantelli: Uniform scenario

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 1 with n = 50

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 2 with n = 50

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 3 with n = 100

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 4 with n = 100

Figure 3.1.3: Glivenko-Cantelli: Exponential scenario

Page 92: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 87

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 1 with n = 50

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 2 with n = 50

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 3 with n = 100

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

00.

5

Sample 4 with n = 100

Figure 3.1.4: Empirical process: Uniform scenario

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 1 with n = 50

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 2 with n = 50

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 3 with n = 100

0 1 2 3 4 5 6

−0.

50.

00.

5

Sample 4 with n = 100

Figure 3.1.5: Empirical process: Exponential scenario

Page 93: 2013 W Stute Empirical Distributions

88 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Figures 3.1.4 and 3.1.5 show the plots of αn = n1/2(Fn−F ) for the same setof data. Under the microscope, the roughness of the sample paths becomesapparent. Also the paths do not explode so that multiplication with thestandardizing factor n1/2 seems to keep everything in balance. Mathemat-ically, at least for a fixed t, this is justified by the CLT guaranteeing thatthe distribution of αn(t) has a nondegenerate limit.

The next lemma presents an algorithm how to compute Dn in finitely manysteps.

Lemma 3.1.9. For a continuous F , we have

Dn = D∗n := max

1≤i≤n

max[F (Xi:n)−

i− 1

n,i

n− F (Xi:n)]

.

Proof. For Xi:n ≤ t < Xi+1:n and 1 ≤ i < n we have by monotonicity

Fn(t)− F (t) ≤ Fn(Xi:n)− F (Xi:n) =i

n− F (Xi:n) ≤ D∗

n

as well as

Fn(t)− F (t) ≥ Fn(Xi:n)− F (Xi+1:n) ≥ −D∗n.

For t < X1:n,

−D∗n ≤ −F (X1:n) ≤ Fn(t)− F (t) ≤ 0 ≤ D∗

n,

while for t ≥ Xn:n,

−D∗n ≤ 0 ≤ Fn(t)− F (t) ≤ 1− F (Xn:n) ≤ D∗

n.

Summarizing, we obtain Dn ≤ D∗n. Conversely,

i

n− F (Xi:n) = Fn(Xi:n)− F (Xi:n) ≤ Dn

and, by continuity of F ,

F (Xi:n)−i− 1

n= lim

t↑Xi:n

[F (t)− Fn(t)] ≤ Dn.

Hence D∗n ≤ Dn, and the proof is complete.

Page 94: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 89

Lemma 3.1.9 implies that

Dn = max(D+n , D

−n ),

where

D+n = sup

t∈R[Fn(t)− F (t)] = max

1≤i≤n

[i

n− F (Xi:n)

]and

D−n = sup

t∈R[F (t)− Fn(t)] = max

1≤i≤n

[F (Xi:n)−

i− 1

n

]denote the two one-sided deviations between Fn and F . Also D+

n and D−n

are distribution-free as long as F is continuous. Moreover, both have thesame distribution. This is most easily seen by introducing the variablesUi = F (Xi) and U

∗i = 1−Ui, 1 ≤ i ≤ n. Under continuity of F , the U∗

i ’s arealso i.i.d. from FU . Moreover, in an obvious notation, D+∗

n = D−n whence

D+n = D−

n in distribution.

Our final results are concerned with the consistency of empirical quantiles.So far we have only studied the consistency of empirical integrals, withspecial emphasis on Fn(t). Under appropriate integrability conditions on φ,consistency was a straightforward consequence of the SLLN. For quantilesthe situation is different. They are defined through F−1

n and F−1 to whicha priori no SLLN applies.

We first study uniform quantiles F−1n (u). Lemma 3.1.10 shows that the K-

S distance between F−1n and the identity function equals the K-S distance

between Fn and the identity function.

Lemma 3.1.10. We have

Dn := sup0≤u≤1

|Fn(u)− u| = sup0<u<1

|F−1n (u)− u| =: D∗∗

n .

Proof. Because of Lemma 3.1.9, we have

Dn = max1≤i≤n

max[Ui:n − i− 1

n,i

n− Ui:n]

.

Since

Ui:n − i− 1

n= lim

u↓ i−1n

[F−1n (u)− u]

Page 95: 2013 W Stute Empirical Distributions

90 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

andi

n− Ui:n =

i

n− F−1

n

(i

n

),

we obtainDn ≤ D∗∗

n .

Conversely,i− 1

n< u ≤ i

nimplies F−1

n (u) = Ui:n

and therefore

−Dn ≤ Ui:n − i

n≤ F−1

n (u)− u ≤ Ui:n − i− 1

n≤ Dn,

from which D∗∗n ≤ Dn.

An application of the Glivenko-Cantelli Theorem to a uniform sample to-gether with Lemma 3.1.10 immediately yield with probability one

sup0<u<1

|F−1n (u)− u| = D∗∗

n → 0. (3.1.22)

For a general F , things are less obvious. The next lemma gives a positiveanswer as to pointwise consistency.

Lemma 3.1.11. Assume that F−1 is continuous at 0 < u < 1. Then

limn→∞

F−1n (u) = F−1(u) with probability one.

Proof. The proof is an immediate consequence of (3.1.18) and (3.1.22).

It is easy to see that F−1 is continuous at u iff

F (t) > u for each t > F−1(u),

i.e., there is no non-degenerate interval [F−1(u), a] such that F is constantthere. Put in another way, F−1 is discontinuous at u iff F attains the valueu on a non-degenerate interval. Conclude that F−1 has at most countablymany discontinuities so that in particular Lemma 3.1.11 holds for Lebesguealmost all 0 < u < 1. Hence the Ui’s appearing in the representationXi = F−1(Ui) are with probability one continuity points of F−1.

Page 96: 2013 W Stute Empirical Distributions

3.1. BASIC FACTS 91

If F−1 is continuous on some compact subinterval [u0, u1] of (0, 1), then amodification of the proof of Theorem 3.1.8 gives uniform convergence on[u0, u1]:

supu0≤u≤u1

|F−1n (u)− F−1(u)| → 0 with probability one.

An extension to the whole (open!) interval (0, 1) creates some problems,since compactness of the parameter set is lost and no simple ’ε-grid’ approachis available. Actually, suppose that F has unbounded support, like a normaldistribution. Then

sup0<u<1

|F−1n (u)− F−1(u)| = ∞,

since X1:n ≤ F−1n (u) ≤ Xn:n and F−1 is unbounded. This simple counter-

example shows that compactness arguments are indeed crucial to obtainuniform convergence, and that the uniform convergence valid for Fn maynot hold for other estimators. Another example is the cumulative hazardfunction. We have noted in (2.2.4) that for a continuous F

ΛF (t) = − ln[1− F (t)]

and thereforeΛF (t) ↑ ∞ as t ↑ bF ,

where bF is the smallest upper bound for the support of F , possibly infinite.

The Nelson-Aalen estimator

Λn(t) =

∫(−∞,t]

Fn(dx)

1− Fn(x−)

is a bounded function so that necessarily

supt

|Λn(t)− ΛF (t)| = ∞.

One may argue that the bad fit of Λn is caused by t’s in the far right tails,namely t≫ Xn:n. This is true only to a certain extent, since typically a badfit of an estimating function is not abrupt but takes place gradually. SeeFigure 1.4.1. In other words, Λn(t) is not a reliable estimator of Λ(t) alreadyfor those t’s ranging between extreme order statistics Xn:n > Xn−1:n >Xn−2:n > . . ..

Page 97: 2013 W Stute Empirical Distributions

92 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

3.2 Finite-Dimensional Distributions

In the last section we have seen that for i.i.d. random variables with distri-bution µ, we have

P(nµn(A) = k) =

(n

k

)µk(A)(1− µ(A))n−k, k = 0, 1, . . . , n.

Since the number of observations n is fixed the event nµn(A) = k automat-ically implies nµn(A) = n − k. In other words, the event on the left-handside equals

nµn(A) = k, nµn(A) = n− k.The collection A, A forms a simple partition of the sample space S. Wenow discuss the extension to more than two sets. Let A1, A2, . . . , Am bea (measurable) partition of the sample space, i.e.,

Ai ∩Aj = ∅ for i = j, i, j = 1, . . . ,m and

m∪i=1

Ai = S.

Such a situation occurs in the χ2-test where one compares the frequenciesµn(Ai) with hypothetical cell probabilities, µ0(Ai), 1 ≤ i ≤ m, through thepertaining weighted Euclidean distance:

χ2 = n

m∑i=1

[µn(Ai)− µ0(Ai)]2

µ0(Ai). (3.2.1)

The joint distribution of these frequencies equals a multinomial distribu-tion.

Lemma 3.2.1. For i.i.d. observations from a distribution µ we have

P(nµn(Ai) = ki for 1 ≤ i ≤ m) =

(n

k1, . . . , km

) m∏i=1

[µ(Ai)]ki ,

for ki = 0, 1, . . . , n such that∑m

i=1 ki = n.

This lemma may be applied to also obtain the joint distribution of empiricalmeasures of sets which are not disjoint but increasing.

Lemma 3.2.2. For sets C1 ⊂ C2 ⊂ . . . ⊂ Cm and integers 0 ≤ k1 ≤ k2 ≤. . . ≤ km ≤ n we have

P(nµn(Ci) = ki for 1 ≤ i ≤ m) =

(n

n1, n2, . . . , nm+1

)m+1∏i=1

[µ(Ai)]ni ,

Page 98: 2013 W Stute Empirical Distributions

3.2. FINITE-DIMENSIONAL DISTRIBUTIONS 93

where n1 = k1, ni = ki − ki−1 for 2 ≤ i ≤ m and nm+1 = n − km andA1 = C1, Ai = Ci \ Ci−1 for 2 ≤ i ≤ m and Am+1 = S \ Cm.

Proof. The proof is an immediate consequence of the previous lemma.

Though Lemmas 3.2.1 and 3.2.2 provide us with exact distributions theformulas are not tractable for moderate and large sample sizes. Moreover,for a deeper forthcoming analysis they are also not very insightful to detecthidden structures. In particular, if we want to study the dynamic behaviorof empirical and related processes it is useful to restate the last lemma interms of conditional probabilities.

Lemma 3.2.3. In the situation of Lemma 3.2.2 we have

P (nµn(Cm) = km|nµn(Ci) = ki for 1 ≤ i ≤ m− 1)

= P(nµn(Cm) = km|nµn(Cm−1) = km−1)

=

(n− km−1

km − km−1

)[µ(Cm \ Cm−1)

1− µ(Cm−1)

]km−km−1[1− µ(Cm \ Cm−1)

1− µ(Cm−1)

]n−km

.

Proof. Write the conditional probability as a ratio of two probabilities andapply the last lemma.

Lemma 3.2.3 states that nµn has a Markov-property when it is evalu-ated on increasing sets. Furthermore the ’transition probability’ equals a(shifted) binomial distribution with new parameters n−km−1 and µ(·|Cm−1).More precisely, given nµn(Cm−1) = km−1, nµn(Cm) has the same distribu-tion as km−1+(n−km−1)µn−km−1(Cm \Cm−1), and where the data definingthe last empirical distribution are i.i.d. from µ restricted (and re-normalized)to Cm−1.

For real-valued data, the most important Ci’s are again the extended inter-vals (−∞, ti]. Because of their importance, the preceding results are sum-marized for these C’s in the following Theorem. It constitutes the extensionof Lemma 2.1.4 to sample size n > 1.

Theorem 3.2.4. Assume that X1, . . . , Xn are i.i.d. from F . Then Fn is aMarkov-process such that for t1 < t2 < . . . < tm and 0 ≤ k1 ≤ k2 ≤ . . . ≤km ≤ n:

P(nFn(tm) = km|nFn(tm−1) = km−1)

=

(n− km−1

km − km−1

)[F (tm)− F (tm−1)

1− F (tm−1)

]km−km−1[1− F (tm)− F (tm−1)

1− F (tm−1)

]n−km

.

Page 99: 2013 W Stute Empirical Distributions

94 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

More or less, considering an increasing sequence of ti’s often is only a matterof convenience or taste. From a mathematical point of view we could alsoconsider a decreasing sequence of ti’s. In such a situation we have

Theorem 3.2.5. Assume that X1, . . . , Xn are i.i.d. from F . Then Fn isa Markov-process in reverse time such that for t1 < t2 < . . . < tm and0 ≤ k1 ≤ k2 ≤ . . . ≤ km ≤ n:

P(nFn(t1) = k1|nFn(t2) = k2, . . . , nFn(tm) = km)

= P(nFn(t1) = k1|nFn(t2) = k2) =

(k2k1

)(F (t1)

F (t2)

)k1 (1− F (t1)

F (t2)

)k2−k1

.

Later on we shall view the processes Fn and αn as random elements in thespace D of all discontinuous functions which are right-hand continuous andhave limits from the left, equipped with the σ-field generated by all Fn(t), t ∈R. Its distribution is uniquely determined through its finite dimensionaldistributions. We may then combine Theorems 3.2.4 and 3.2.5 with theuniform representation (3.1.17) to come up with the following result.

Theorem 3.2.6. For i.i.d. data, nFn is a Markov process such that condi-tionally on nFn(t0) = k, nFn(t) on t ≥ t0 has the same distribution as theprocess

t→ k + (n− k)Fn−k

[F (t)− F (t0)

1− F (t0)

].

Similarly, for the process in reverse time, we have that conditionally onnFn(t0) = k, nFn(t) on t ≤ t0 has the same distribution as the process

t→ kFk

(F (t)

F (t0)

).

Informally speaking, we see that conditionally on Fn(t0) = k, the processnFn on t ≥ t0 resp. t ≤ t0 equals in distribution a uniform empirical d.f. withsample size and time transformation properly adjusted. These statementsare (hopefully) easier to remember than those in Theorems 3.2.4 and 3.2.5.They readily enable us to compute conditional expectations. For example,in the forward case, let Ft = σ(Fn(s) : s ≤ t). Then the first statement inTheorem 3.2.6 implies for t0 ≤ t:

E[nFn(t)|Ft0 ] = nFn(t0) + (n− nFn(t0))F (t)− F (t0)

1− F (t0), (3.2.2)

Page 100: 2013 W Stute Empirical Distributions

3.2. FINITE-DIMENSIONAL DISTRIBUTIONS 95

while in the backward case with Gt = σ(Fn(s) : t ≤ s):

E[nFn(t)|Gt0 ] = nFn(t0)F (t)

F (t0). (3.2.3)

From (3.2.2) and (3.2.3) we immediately get the following result.

Corollary 3.2.7. Let X1, . . . , Xn be i.i.d. from F . Then

• the process t→ 1−Fn(t)1−F (t) is a martingale w.r.t. (Ft)t (on F (t) < 1)

• the process t→ Fn(t)F (t) is a reverse martingale w.r.t. (Gt)t (on F (t) > 0).

The forward case also readily follows from Lemma 2.1.6.

Theorem 3.2.6 also yields explicit formulas for conditional variances:

VarnFn(t)|Ft0 = (n− nFn(t0))F (t)− F (t0)

1− F (t0)

[1− F (t)− F (t0)

1− F (t0)

](3.2.4)

VarnFn(t)|Gt0 = nFn(t0)F (t)

F (t0)

[1− F (t)

F (t0)

]. (3.2.5)

The last statements in this section deal with the limit distribution of αn

at finitely many points. With Nm(0,Σ) we denote the m-variate normaldistribution with expectation 0 ∈ Rm and m×m covariance matrix Σ.

Lemma 3.2.8. Recall

αn(t) = n1/2[Fn(t)− F (t)], t ∈ R,

the empirical process. Then, for each t1 ≤ t2 ≤ . . . ≤ tm, we get

[αn(t1), . . . , αn(tm)]TL−→ Nm(0,Σ) as n→ ∞

withΣ = (σij)1≤i,j≤m

andσij = F (ti ∧ tj)− F (ti)F (tj). (3.2.6)

This lemma is an immediate consequence of the multivariate CLT. For(3.2.6), see also (2.1.4).

Next we formulate the asymptotic result corresponding to disjoint setsA1, . . . , Am as considered in Lemma 3.2.1. Since µn(Am) = 1 − µn(A1) −

Page 101: 2013 W Stute Empirical Distributions

96 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

. . . − µn(Am−1) the limit distribution of (µn(A1), . . . , µn(Am)) when prop-erly standardized will be degenerate. Therefore we restrict ourselves to thefirst m− 1 coordinates.

Lemma 3.2.9. Let A1, . . . , Am be a partition of the sample space. Then,as n→ ∞,

n1/2[µn(A1)−µ(A1), . . . , µn(Am−1)− µ(Am−1)]T

L−→ Nm−1(0,Σ),

where Σ is the (m− 1)× (m− 1) matrix

Σ =

p1(1− p1) −p1p2 −p1pm−1

−p2p1...

. . ....

−pm−1p1 . . . pm−1(1− pm−1)

and pi = µ(Ai).

Proof. The assertion follows from the multivariate CLT. For the covariancematrix, recall (2.1.6).

We only mention the well-known fact that Lemma 3.2.9 is in charge of thelimit distribution of χ2 from (3.2.1). Actually, setting pi = µn(Ai) we get

χ2 = n(p1 − p1, . . . , pm−1 − pm−1)I(p1 − p1, . . . , pm−1 − pm−1)T

with

I =

p−11 + p−1

m p−1m . . . p−1

m

p−1m

......

...

p−1m . . . . . . p−1

m−1 + p−1m

denoting the pertaining Fisher Information Matrix. From Lemma 3.2.9and the Continuous Mapping Theorem, we obtain

χ2 L−→ Y T IY with Y ∼ Nm−1(0,Σ).

But I = Σ−1. We therefore have the following result.

Theorem 3.2.10. As n→ ∞,

χ2 L−→ Z,

where Z has a χ2-distribution with m− 1 degrees of freedom.

Page 102: 2013 W Stute Empirical Distributions

3.3. ORDER STATISTICS 97

3.3 Order Statistics

In this section we collect some basic distributional properties of order statis-tics. Special emphasis is given on order statistics from the uniform and theexponential distribution. Order statistics are important for at least threereasons:

• Special quantiles like the median, upper and lower quartiles or, moregenerally, linear combinations of order statistics are important (robust)competitors to estimators based on empirical moments (like samplemeans or sample variances).

• As Lemma 3.1.9 has shown, properly transformed order statistics mayappear in distribution-free statistics. To determine the distribution ofthese statistics, a detailed study of order statistics will be indispens-able.

• In some data situations, e.g., in life testing, n items are put on test andmonitored until the r-th failure. In such a situation onlyX1:n, . . . , Xr:n

are observable and any statistical inference therefore needs to be basedon them.

Our first result deals with the distribution of a single Xi:n. For example,when i = n, we get

P(Xn:n ≤ x) = P(Xi ≤ x for all i = 1, . . . , n)

= Fn(x), by independence.

For i = 1, we obtain

P(X1:n ≤ x) = 1− P(X1:n > x) = 1− P(Xi > x for all i = 1, . . . , n)

= 1− [1− F (x)]n.

These examples are two special cases of the following more general result.

Lemma 3.3.1. Let X1, . . . , Xn be a sample of independent observationsfrom a d.f. F . Then

Gi(x) = P(Xi:n ≤ x) =n∑

k=i

(n

k

)F k(x)(1− F (x))n−k, 1 ≤ i ≤ n.

Proof. The assertion immediately follows from, see (1.3.4),

P(Xi:n ≤ x) = P(i/n ≤ Fn(x))

Page 103: 2013 W Stute Empirical Distributions

98 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

and (3.1.1).

The functionGi only gives limited information about the distributional char-acter of order statistics. Therefore we next determine the joint distributionof all Xi:n’s. Since by construction X1:n ≤ X2:n ≤ . . . ≤ Xn:n this dis-tribution is supported by the set K of all x = (x1, . . . , xn) of n-vectorssatisfying x1 ≤ x2 ≤ . . . ≤ xn. The set of rectangles

∏ni=1(ai, bi] with

a1 ≤ b1 ≤ . . . ≤ an ≤ bn uniquely determines a distribution on K. From thei.i.d. property of the Xi’s, we obtain

P(ai < Xi:n ≤ bi for 1 ≤ i ≤ n) = n!

n∏i=1

[F (bi)− F (ai)].

When F admits a Lebesgue density f , theXi:n’s also have a Lebesgue densitysupported by K.

Lemma 3.3.2. Assume that X1, . . . , Xn are independent from a density f .Then (X1:n, . . . , Xn:n) has the Lebesgue density

gn(x1, . . . , xn) =

n!∏n

i=1 f(xi) for x1 ≤ x2 ≤ . . . ≤ xn0 elsewhere

.

For the sake of reference we display two important special cases:F = FU :

gn(x1, . . . , xn) =

n! for 0 ≤ x1 ≤ . . . ≤ xn ≤ 10 elsewhere

F = Exp(1):

gn(x1, . . . , xn) =

n! exp[−x1 − . . .− xn] for 0 ≤ x1 ≤ . . . ≤ xn <∞

0 elsewhere.

The distributional behavior of the order statistics 0 ≤ E1:n ≤ . . . ≤ En:n

pertaining to a sample E1, . . . , En of independent random variables from anExp(1) distribution is particularly nice.

As seen before, (E1:n, . . . , En:n) has density

gn(x1, . . . , xn) =

n! exp[−

∑ni=1 xi] for 0 ≤ x1 < . . . < xn

0 elsewhere.

Page 104: 2013 W Stute Empirical Distributions

3.3. ORDER STATISTICS 99

Define E0:n = 0 and set

Yin ≡ Yi = (n− i+ 1)(Ei:n − Ei−1:n)

whence

Er:n =r∑

i=1

(Ei:n − Ei−1:n) =r∑

i=1

Yin− i+ 1

.

The interesting fact about this representation is that the Yi’s have a simpledistributional structure. See Renyi (1953).

Lemma 3.3.3. The random variables Y1, . . . , Yn are i.i.d. from Exp(1).

Proof. We have Y1...Yn

= A

E1:n...

En:n

with the n× n matrix A defined as

A =

n 0 . . . . . . 0−(n− 1) n− 1 0 . . . 0

0 −(n− 2) n− 2...

. . .

−2 2 00 . . . . . . . . . −1 1

.

From calculus, the Y -vector has density

(x1, . . . , xn) → |detA−1|gn A−1

x1...xn

.

Since detA−1 = 1/detA = 1/n! and

gn A−1

x1...xn

=

n! exp[−

∑ni=1 xi] for x1, . . . , xn ≥ 0

0 elsewhere,

the proof is complete.

Page 105: 2013 W Stute Empirical Distributions

100 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

If we represent the Ei’s in terms of uniform random variables U1, . . . , Un:

Ei = − ln(1− Ui),

then

− ln(1− Ur:n) =

r∑i=1

Yin− i+ 1

. (3.3.1)

Solving for Yr yields

Yrn− r + 1

= ln1− Ur−1:n

1− Ur:n

and therefore

Qr :=

[1− Ur:n

1− Ur−1:n

]n−r+1

= exp(−Yr). (3.3.2)

We may therefore conclude from Lemma 3.3.3 the following Corollary. SeeMalmquist (1950).

Corollary 3.3.4. The random variables Q1, . . . , Qn are i.i.d. from FU .

There is another interesting variant of Lemma 3.3.3 which brings the cumu-lative hazard function ΛF back into play.

Corollary 3.3.5. Assume that X1, . . . , Xn are i.i.d. from a continuous F .Then the variables

Zi = (n− i+ 1) [ΛF (Xi:n)− ΛF (Xi−1:n)] , 1 ≤ i ≤ n

are i.i.d. from Exp(1). For completeness, set ΛF (X0:n) = 0.

This Corollary is indeed an extension of Lemma 3.3.3 to the case of a generalcontinuous F . Actually, when the Xi’s are Exp(1), then ΛF (t) = t so thatthe Zi’s equal the Yi’s from before. Corollary 3.3.5 reveals that ΛF mayserve as a tool to transform the dependent order statistics to a sample ofindependent random variables with a specified distribution.

Proof of Corollary 3.3.5. The assertion is a straightforward consequence ofLemma 3.3.3, the monotonicity of ΛF and Lemma 2.2.5.

Relation (3.3.1) is the basic tool to analyze the probabilistic properties ofuniform order statistics. First,

1− Ur:n = exp

[−

r∑i=1

Yin− i+ 1

]. (3.3.3)

Page 106: 2013 W Stute Empirical Distributions

3.3. ORDER STATISTICS 101

Since the Y ’s are independent, the partial sums are Markovian. So areU1:n, . . . , Un:n. Moreover, for k + 1 ≤ j ≤ n, we have

Uj:n = 1− exp

[−

j∑i=1

Yin− i+ 1

]

= 1− exp

[−

k∑i=1

Yin− i+ 1

]exp

[−

j∑i=k+1

Yin− i+ 1

],

whence

Uj:n − Uk:n

1− Uk:n= 1− exp

[−

j∑i=k+1

Yin− i+ 1

](3.3.4)

= 1− exp

[−

j∑i=k+1

Yi(n− k)− (i− k) + 1

]

= 1− exp

[−

j−k∑i=1

Yi+k

n− k − i+ 1

]L= Uj−k:n−k.

From this we immediately get the following lemma.

Lemma 3.3.6. For each n ≥ 1, U1:n, . . . , Un:n is a Markov sequence suchthat for 0 ≤ y1 ≤ . . . ≤ yk ≤ 1:

L(Ui:n − yk1− yk

, k + 1 ≤ i ≤ n|U1:n = y1, . . . , Uk:n = yk

)= L(U1:n−k, . . . , Un−k:n−k).

Here again ’L’ stands for distribution.

The following lemma presents a reverse Markov property for the Ui:n’s.

Lemma 3.3.7. For each n ≥ 1 and all 0 < yk+1 ≤ . . . < yn < 1 we have

L (Ui:n, 1 ≤ i ≤ k|Uk+1:n = yk+1, . . . , Un:n = yn) = L(yk+1Ui:k, 1 ≤ i ≤ k).

Proof. Put Ui = 1 − Ui. The Ui’s are again i.i.d. from FU . Since Ui:n =1 − Un−i+1:n, we get from the preceding lemma, for any a1 ≤ b1 ≤ . . . ≤

Page 107: 2013 W Stute Empirical Distributions

102 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

ak ≤ bk ≤ yk+1:

P (ai < Ui:n ≤ bi for 1 ≤ i ≤ k|Uk+1:n = yk+1, . . . , Un:n = yn)

= P(1− bi ≤ Un−i+1:n < 1− ai for 1 ≤ i ≤ k|Un−k:n = 1− yk+1, . . . , U1:n = 1− yn

)= P

(−bi + yk+1 ≤ yk+1Uk−i+1:k < −ai + yk+1 for 1 ≤ i ≤ k

)= P (ai < yk+1Ui:k ≤ bi for 1 ≤ i ≤ k) .

Our final result in this direction is the following

Lemma 3.3.8. For 1 < r < n and 0 < x < 1, given Ur:n = x, therandom vectors (U1:n, . . . , Ur−1:n) and (Ur+1:n, . . . , Un:n) are independentand distributed as (xU1:r−1, . . . , xUr−1:r−1) and (x+ (1− x)U1:n−r, . . . , x+(1− x)Un−r:n−r).

When F has a derivative f = F ′, then also Gi has a Lebesgue densitywhich may be immediately derived from the formula in Lemma 3.3.1. In thefollowing we shall rather apply the Markov structure of the Ui:n’s to get someexperience with a recursion technique, which also applies in a conditionalsetting.

First

Gn(x) = P(Un:n ≤ x) = xn for 0 ≤ x ≤ 1.

Furthermore, for 1 ≤ i < n and 0 ≤ x ≤ 1, we have

Gi(x) = P(Ui:n ≤ x) =

1∫0

P(Ui:n ≤ x|Ui+1:n = y)Gi+1(dy)

= P(Ui+1:n ≤ x) +

1∫x

P(yUi:i ≤ x)Gi+1(dy)

= Gi+1(x) + xi1∫

x

y−iGi+1(dy). (3.3.5)

This recursion will be the key tool to compute the density gi of Ui:n.

Page 108: 2013 W Stute Empirical Distributions

3.3. ORDER STATISTICS 103

Lemma 3.3.9. For 1 ≤ i ≤ n,

gi(x) =

xi−1(1− x)n−i/B(i, n− i+ 1) for 0 ≤ x ≤ 1

0 elsewhere.

Here

B(a, b) =

1∫0

ta−1(1− t)b−1dt, a, b > 0

denotes the Beta-function.

Remark 3.3.10. Recall that

B(a, b) =Γ(a)Γ(b)

Γ(a+ b)

and Γ is the Gamma function satisfying

Γ(a) = (a− 1)! for a ∈ N.

Proof of Lemma 3.3.9. For i = n,

gn(x) = nxn−1 on 0 ≤ x ≤ 1

which is the same as the expression in the lemma. For 1 ≤ i < n and0 ≤ x ≤ 1, we apply (3.3.5) and use induction to get

gi(x) = gi+1(x) + ixi−1

1∫x

y−igi+1(y)dy − xix−igi+1(x)

= ixi−1

1∫x

y−iyi(1− y)n−i−1dy/B(i+ 1, n− i)

= ixi−1

1∫x

(1− y)n−i−1dy/B(i+ 1, n− i)

= xi−1(1− x)n−i/B(i, n− i+ 1),

where the last equality follows from

(n− i)B(i+ 1, n− i)

i= B(i, n− i+ 1).

Page 109: 2013 W Stute Empirical Distributions

104 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

For a general d.f. F with density f , since

P(Xi:n ≤ x) = P(Ui:n ≤ F (x)),

Xi:n has density

fi(x) = f(x)F i−1(x)(1− F (x))n−i/B(i, n− i+ 1). (3.3.6)

Lemma 3.3.9 immediately yields the moments of Ui:n.

Corollary 3.3.11. For j ≥ 1 and 1 ≤ i ≤ n, we have

EU ji:n =

B(j + i, n− i+ 1)

B(i, n− i+ 1).

As special cases we obtain

EUi:n =i

n+ 1≡ πi,n

and

EU2i:n =

i(i+ 1)

(n+ 1)(n+ 2).

From this

Var Ui:n =1

n+ 2πi,n(1− πi,n).

These formulae may also be easily obtained by using (3.3.3). For example,since

E [exp(−λE)] =1

1 + λ

for λ > 0,

E[1− Ur:n] =r∏

i=1

1

1 + 1n−i+1

=r∏

i=1

n− i+ 1

n− i+ 2=n− r + 1

n+ 1= 1− r

n+ 1

whenceEUr:n =

r

n+ 1.

Furthermore, putting j = k + 1 in (3.3.4), we get

Uk+1:n = e−Yk+1n−k Uk:n +

(1− e−

Yk+1n−k

)

Page 110: 2013 W Stute Empirical Distributions

3.3. ORDER STATISTICS 105

and, by independence of Uk:n and Yk+1:

E[Uk+1:n|U1:n, . . . , Uk:n] =n− k

n− k + 1Uk:n +

1

n− k + 1.

Lemma 3.3.12. The sequence

Ui:n − in+1

n− i+ 1, 1 ≤ i ≤ n

is a mean zero martingale w.r.t. the natural filtration

Fi = σ(Uj:n : 1 ≤ j ≤ i), 1 ≤ i ≤ n.

The reverse Markov property of the Ui:n’s gives rise to a reverse martingale.The proof is similar to that of Lemma 3.3.12 and therefore omitted.

Lemma 3.3.13. The sequence

(n+ 1)Ui:n/i, 1 ≤ i ≤ n

is a mean-one reverse martingale w.r.t. the natural filtration Fi = σ(Uj:n :i ≤ j ≤ n).

In the last result of this section, we relate uniform order statistics to sumsand not order statistics of independent exponential random variables.

Lemma 3.3.14. The vector (U1:n, . . . , Un:n) of uniform order statistics hasthe same distribution as (

Z1

Zn+1, . . . ,

Zn

Zn+1

),

where Zi = E1+. . .+Ei is the i-th partial sum from a sequence E1, . . . , En+1

of independent Exp(1) random variables.

Proof. For 0 ≤ a1 ≤ b1 ≤ . . . ≤ an+1 ≤ bn+1 a Fubini-type argument yields

P(ai < Zi ≤ bi for 1 ≤ i ≤ n+ 1)

=

b1∫a1

b2−x1∫a2−x1

. . .

bn+1−x1−...−xn∫an+1−x1−...−xn

exp[−x1 − . . .− xn+1]dxn+1dxn . . . dx1.

Page 111: 2013 W Stute Empirical Distributions

106 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Substituting yn+1 = xn+1 + (x1 + . . .+ xn) one obtains for the last integral

b1∫a1

b2−x1∫a2−x1

. . .

bn+1∫an+1

exp(−yn+1)dyn+1dxn . . . dx1=

b1∫a1

b2∫a2

. . .

bn+1∫an+1

exp(−xn+1)dxn+1 . . . dx1.

This shows that (Z1, . . . , Zn+1) has the Lebesgue density

(x1, . . . , xn+1) →exp(−xn+1) for 0 ≤ x1 < . . . < xn+1

0 elsewhere.

Conclude that

P(ai <Zi

Zn+1≤ bi for 1 ≤ i ≤ n) =

b1xn+1∫a1xn+1

. . .

bnxn+1∫anxn+1

∞∫0

exp(−xn+1)dxn+1 . . . dx1

=n∏

i=1

(bi − ai)

∞∫0

xn exp(−x)dx = n!n∏

i=1

(bi − ai),

as desired.

The results presented in this section are fundamental tools for proving var-ious properties of empirical d.f.’s. In the following section they will be usedto get some exact boundary crossing probabilities for Fn.

3.4 Some Selected Boundary Crossing Probabili-ties

The Glivenko-Cantelli Theorem asserts that the empirical d.f.’s approach, assample size increases to infinity, the true d.f. in the uniform metric. In thissection we derive several results which characterize the deviation betweenFn and F for fixed finite n.

Our first Theorem, which is originally due to Daniels (1945), yields the exactdistribution of the maximal relative deviation between Fn and F :

Rn = supt:0<F (t)

Fn(t)

F (t).

By the SLLN,Fn(t)

F (t)→ 1 with probability one. (3.4.1)

Page 112: 2013 W Stute Empirical Distributions

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 107

Note, however, that compared with Fn−F , the process of interest, Fn/F , isonly defined on the left-open interval of those t’s for which F (t) is positive sothat compactness arguments in connection with a uniform version of (3.4.1)do not apply here. Indeed,

Var

[Fn(t)

F (t)

]=

1− F (t)

nF (t)

tends to infinity for fixed n, as F (t) → 0, so that Rn is likely to be heavilycontrolled by Fn(t) when t is small.

Theorem 3.4.1. For a continuous F , we have, for 0 < ε ≤ 1,

Gn(ε) ≡ P(Rn <1

ε) = 1− ε. (3.4.2)

Proof. That Gn is distribution-free, follows as for the K-S statistic, sinceunder continuity

Rn = sup0<u≤1

Fn(u)/u ≡ Rn.

It is noteworthy that also Gn is the same for each n ≥ 1. For n = 1,R1 = 1/U1 so that (3.4.2) is immediate.

For a general n > 1, we follow Renyi (1973) who suggested to prove Theorem3.4.1 by induction on n. Now,

Rn = max1≤k≤n

k/nUk:n

so thatGn(ε) = P( min

1≤k≤nnUk:n/k > ε).

Recall that Un:n has the Lebesgue density nyn−1 on 0 < y < 1. Thus, forn > 1, we may apply Lemma 3.3.7 and use the validity of (3.4.2) for n − 1to get

Gn(ε) =

1∫ε

Gn−1

((n− 1)ε

ny

)nyn−1dy

=

1∫ε

[1− (n− 1)ε

ny

]nyn−1dy = 1− ε.

Page 113: 2013 W Stute Empirical Distributions

108 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

If we set1

ε= 1 + δ with δ ≥ 0

we may restate Theorem 3.4.1 to get

P

(sup

0<F (t)

[Fn(t)

F (t)− 1

]> δ

)=

1

1 + δ.

It follows that, as expected, Fn/F does not converge to 1 uniformly in t. Onthe other hand, to formulate a positive result, Theorem 3.4.1 asserts thatup to an event of probability ε, Fn is bounded from above by F/ε:

Fn(t) ≤ F (t)/ε. (3.4.3)

This inequality is also true when F (t) = 0, since then with probability oneall data exceed t so that Fn(t) = 0. Since for uniformly distributed data,F (t) = t is a linear function, the bound (3.4.3) is sometimes called a linearbound (on the F -scale) for Fn.

Next we investigate lower bounds corresponding to (3.4.3). Again it sufficesto consider the case of uniformly distributed U ’s.

For 0 ≤ s ≤ n and a, b > 0 such that a+ sb < 1, put

Pns(a, b) = P(Uk:n ≤ a+ (k − 1)b for 1 ≤ k ≤ s, Us+1:n > a+ sb),

with Un+1:n ≡ 1.

Lemma 3.4.2. (Dempster). We have

Pns(a, b) =

(n

s

)a(a+ sb)s−1(1− a− sb)n−s.

Proof. The assertion is true for n = 1. Also, for a general n, it holds fors = 0. For n ≥ 2 and s ≥ 1 we obtain upon conditioning on U1:n:

Pns(a, b) =

a∫0

n(1− y)n−1Pn−1,s−1

[a+ b− y

1− y,

b

1− y

]dy.

Page 114: 2013 W Stute Empirical Distributions

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 109

Now, use induction on n to get the result.

From Dempster’s formula we obtain further results on the relative deviationbetween Fn and F .

Corollary 3.4.3 (Renyi). For 0 < ε < 1 we have

P(

max1≤k≤n

nUk:n/k > ε

)=

n−1∑s=0

(n

s

)( εn

)s(1 + s)s−1

[1− (1 + s)ε

n

]n−s

→∞∑s=0

εse−(s+1)ε(s+ 1)s−1/s! as n→ ∞.

Proof. From Dempster’s formula, we have

P(

max1≤k≤n

nUk:n/k > ε

)=

n−1∑s=0

Pns

( εn,ε

n

)=

n−1∑s=0

(n

s

)( εn

)s(1 + s)s−1

[1− 1 + s)ε

n

]n−s

.

The limit result follows from an application of the dominated convergencetheorem.

We now derive a formula for

P(

max1≤k≤n−1

nUk+1:n/k > ε

),

which differs slightly from the probability appearing in Corollary 3.4.3. Itis of interest because

inft>U1:n

Fn(t)/t =

[max

1≤k≤n−1nUk+1:n/k

]−1

so that

P(

inft>U1:n

Fn(t)/t <1

ε

)= P

(max

1≤k≤n−1nUk+1:n/k > ε

).

Page 115: 2013 W Stute Empirical Distributions

110 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Now, conditioning on U1:n, we obtain for 0 < ε < n

P(

max1≤k≤n−1

nUk+1:n/k > ε

)= P

(U1:n >

ε

n

)

+

ε/n∫0

n(1− y)n−1r∑

s=0

Pn−1,s

[ εn − y

1− y,

ε

n(1− y)

]dy,

where r is the largest integer < n/ε. Applying Dempster’s formula we getthe following Corollary.

Corollary 3.4.4 (Chang). For 0 < ε < n,

P(

max1≤k≤n−1

nUk+1:n/k > ε

)=(1− ε

n

)n+

r+1∑i=1

(n

i

)( εn

)i(1− iε

n

)n−i

(i−1)i−1.

Letting n→ ∞, we get for all ε > 0

P(

max1≤k≤n−1

nUk+1:n/k > ε

)→ e−ε +

∞∑k=1

(εe−ε)k(k − 1)k−1

k!.

Our last result in this section presents the exact distribution of

D+n = sup

t∈R[Fn(t)− F (t)] = sup

0≤u≤1[Fn(u)− u],

when F is continuous.

Theorem 3.4.5 (Birnbaum-Tingey). For all 0 < x ≤ 1 and n ≥ 1 wehave

P(D+n ≤ x) = 1− x

<n(1−x)>∑j=0

(n

j

)(1− x− j

n)n−j(x+

j

n)j−1.

Proof. Obviously, D+n ≤ x if and only if

i

n≤ Ui:n + x for all i = n, n− 1, . . . ,K, (3.4.4)

where the integer K is defined through

K

k− x ≥ 0 >

K − 1

n− x.

Page 116: 2013 W Stute Empirical Distributions

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 111

In view of Lemma 3.3.2, P(D+n ≤ x) equals the integral

J(x, n,K) = n!

1∫1−x

yn∫1−x− 1

n

. . .

yK+1∫1−x−n−K

n

yK∫0

. . .

y2∫0

dy1 . . . dyn.

By induction we obtain

yK∫0

. . .

y2∫0

dy1 . . . dyK−1 =yK−1K

(K − 1)!.

Inserting this expression into J(x, n,K) we obtain the recursion formula

J(x, n,K) = J(x, n,K + 1)−(1− x− n−K

n

)KK!

× n!

1∫1−x

. . .

yK+2∫1−x−n−(K+1)

n

dyK+1 . . . dyn

≡ J(x, n,K + 1)−(1− x− n−K

n

)KK!

I(x, n,K + 1)

whence

J(x, n,K) = J(x, n, n)−n−1∑i=K

(1− x− n−i

n

)ii!

I(x, n, i+ 1).

Note thatJ(x, n, n) = 1− (1− x)n

while for the I’s we obtain through induction on i:

I(x, n, i+ 1) = n!x(x+ n−i

n

)n−i−1

(n− i)!, i = K, . . . , n− 1.

Putting j = n− i, we obtain the assertion of the Theorem.

The distribution of Dn = max(D+n , D

−n ) is much more complicated. Because

D+n = D−

n in distribution (when F is continuous), we have at least aninequality for the upper tail probabilities:

P(Dn > x) ≤ 2P(D+n > x).

Page 117: 2013 W Stute Empirical Distributions

112 CHAPTER 3. UNIVARIATE EMPIRICALS: THE IID CASE

Page 118: 2013 W Stute Empirical Distributions

Chapter 4

U-Statistics

4.1 Introduction

Let X1, . . . , Xn be a sample of i.i.d. random variables with d.f. F . For agiven function φ, the empirical integral

∫φdFn then is a linear statistic to

which standard results from probability such as the SLLN, CLT or the law ofthe iterated logarithm may be applied. As we have seen in previous sectionsthere are many examples in which the statistic of interest is a multiple sum.The associated statistical functional then is of the form

T (G) =

∫. . .

∫h(x1, . . . , xk)G(dx1) . . . G(dxk).

Here and in the following it will be assumed that the integral exists forG = F . For G = Fn, it is always finite.

The function h is called the kernel of T , while k is its degree. Whenk = 1, we are back at linear functionals, so here we shall focus on k > 1.For G = Fn, we obtain the V -statistic

T (Fn) = n−kn∑

i1=1

. . .n∑

ik=1

h(Xi1 , . . . , Xik).

When θ = T (F ) is the parameter of interest, it is more common to look at

Un =(n− k)!

n!

∑π

h(Xi1 , . . . , Xik),

where summation takes place over all (incomplete) permutations π =(i1, . . . , ik) of integers 1 ≤ ij ≤ n with length k ≤ n. The variable Un is

113

Page 119: 2013 W Stute Empirical Distributions

114 CHAPTER 4. U -STATISTICS

called a U-statistic of degree k, since Un is an unbiased estimator of θ:

EUn = θ = T (F ).

The goal of this chapter is to discuss some fundamental finite and largesample properties of Un. We already remark now that in some applicationsthe kernel h may vary with n, in which case special care is required whenwe let n tend to infinity.

Now, the main problem with Un is its nonlinearity when k > 1. This limitsthe applicability of standard results from probability theory. Therefore, inSection 4.2, we shall look for the linear statistic Un which is closest to Un,and determine the associated φ. Then, by expanding Un into

Un = Un + (Un − Un) ≡ Un +Rn,

we shall discuss when the remainder Rn may be neglected, if at all.

It is again useful to look at this problem from a geometric point of view.For this, let φ1 and φ2 be two F -integrable functions and let a1, a2 be tworeal numbers. If we combine the two linear statistics

Ljn =

n∑i=1

φj(Xi), j = 1, 2

we obtain

a1L1n + a2L

2n =

n∑i=1

(a1φ1 + a2φ2)(Xi),

which is a linear statistic with score function a1φ1+a2φ2. In other words, thelinear statistics form a subspace of all statistics. We enlarge this subspace tothe subspace of all sums, where the transformations of the data may differfrom i to i:

L =

n∑i=1

φi(Xi). (4.1.1)

In the following section we present a famous result due to Hajek, who wasable to compute the projection of a general square integrable statistic. Thenwe shall apply Hajek’s projection lemma to U -statistics.

Page 120: 2013 W Stute Empirical Distributions

4.2. THE HAJEK PROJECTION LEMMA 115

4.2 The Hajek Projection Lemma

The Projection Lemma which we will discuss now was established by Hajek(1968) and then applied to achieve asymptotic normality of linear rankstatistics. Much earlier, Hoeffding (1948) has already used projection tech-niques to linearize U -statistics.

The original Hajek projection lemma only assumes independence of the un-derlying variables. Equality of the distributions is not required.

Lemma 4.2.1. (Hajek). Let X1, . . . , Xn be independent random variablesand let S = S(X1, . . . , Xn) be a square-integrable function (statistic) ofX1, . . . , Xn. Set

S =

n∑i=1

E(S|Xi)− (n− 1)ES.

Then S is a member of the previously mentioned subspace. Actually, Sequals the projection of S and has the following properties:

(i) E(S) = E(S)

(ii) E(S − S)2 = Var(S)−Var(S)

(iii) For any L of the form (4.1.1), one has

E(S − L)2 = E(S − S)2 + E(S − L)2,

i.e., the left-hand side is minimized for L = S.

Proof. (i) is trivial, and (ii) follows from (iii), if we take for L the constantE(S) = E(S). To show (iii), assume E(S) = E(S) = 0 w.l.o.g. We have

E[(S − S)(S − L)

]=

n∑i=1

E[(S − S)(E(S|Xi)− li(Xi))

]=

n∑i=1

E[(E(S(Xi)− φi(Xi))E(S − S|Xi)

].

Because of independence,

E(E(S|Xj)|Xi) =

E(S) for i = jE(S|Xi) for i = j

.

Page 121: 2013 W Stute Empirical Distributions

116 CHAPTER 4. U -STATISTICS

Hence, since E(S) = 0,

E(S|Xi) = E(S|Xi),

i.e.

E(S − S|Xi) = 0.

It follows that

E[(S − S)(S − L)

]= 0

and therefore (iii).

From time to time the statistic S also contains other variables which aremeasurable with respect to a σ-field F . In such a situation we are lookingfor an approximation of S through a statistic

L =

n∑i=1

Zi, (4.2.1)

where Zi is an F-measurable function which may depend on Xi but not onthe others. The following lemma is a conditional version of Hajek’s lemma.

Assumption: E[E(S|Xi,F)|Xj ,F ] = E(S|F) for i = j.

Lemma 4.2.2. Under ES2 <∞, let

S =

n∑i=1

E(S|Xi,F)− (n− 1)E(S|F).

Then

(i) E(S|F) = E(S|F)

(ii) E[(S − S)2|F ] = Var(S|F)−Var(S|F)

(iii) For any admissable L, one has

E[(S − L)2|F ] = E[(S − S)2|F ] + E[(S − L)2|F ],

i.e., S minimizes the left-hand side.

Page 122: 2013 W Stute Empirical Distributions

4.3. PROJECTION OF U -STATISTICS 117

Proof. Needless to say, S is admissable. (i) is trivial, since F ⊂ σ(Xi,F).(ii) follows from (iii) upon setting

L = E(S|F) = E(S|F).

To show (iii), assume E(S|F) = 0 = E(S|F) w.l.o.g. We then have

E[(S − S)(S − L)|F ] =

n∑i=1

E[(S − S)(E(S|Xi,F)− Zi)|F ]

=

n∑i=1

E(E(S|Xi,F)− Zi)E[S − S|Xi,F ]|F

.

From the assumption,

E[E(S|Xi,F)|Xj ,F ] =

E(S|F) for i = jE(S|Xi,F) for i = j

.

It follows

E(S|Xi,F) = E(S|Xi,F)

whence

E[(S − S)(S − L)|F ] = 0.

Conclude (iii).

Equality (ii) is likely to be applied in the following way. Assume that asn → ∞ the right-hand side converges to zero in probability or P-a.s. Thenso does the left-hand side. In case we may apply an integral convergencetheorem, we infer S − S → 0 in L2. Coming back to the right-hand side,the σ-fields may depend on n. In case they are increasing or decreasing,martingale arguments might be useful. Conditioning on F is always usefulwhen S contains awkward F-measurable components.

4.3 Projection of U-Statistics

As before, consider

Un =(n− k)!

n!

∑π

h(Xi1 , . . . , Xik),

Page 123: 2013 W Stute Empirical Distributions

118 CHAPTER 4. U -STATISTICS

and denote with Un its Hajek projection. Define

hj(x) =

∫Rk−1

h(x1, . . . , xj−1, x, xj+1, . . . , xk)F (dx1) . . . F (dxk).

If h is symmetric, i.e., invariant w.r.t. permutations of the xi, the hj ’s areall equal.

Lemma 4.3.1. We have

Un − θ = n−1n∑

i=1

k∑j=1

[hj(Xi)− θ],

where θ = E[h(X1, . . . , Xk)], the target.

Proof. Obviously, for 1 ≤ i ≤ n,

(n− k)!

n!

∑π

E [h(Xi1 , . . . , Xik)|Xi] =(n− k)!

n!

[∑0πθ +

∑kj=1

∑jπhj(Xi)

]=n− k

nθ + n−1

k∑j=1

hj(Xi).

Here∑0 resp.

∑j denote summation over all π not containing i resp.containing i at position j. It follows that

Un =n∑

i=1

E(Un|Xi)− (n− 1)θ

= (n− k)θ + n−1n∑

i=1

k∑j=1

hj(Xi)− (n− 1)θ

= θ + n−1n∑

i=1

k∑j=1

[hj(Xi)− θ].

The proof is complete.

Now, putting

h(x) =k∑

j=1

[hj(x)− θ],

Page 124: 2013 W Stute Empirical Distributions

4.3. PROJECTION OF U -STATISTICS 119

we get

Un − θ = n−1n∑

i=1

h(Xi).

The random variables h(X1), . . . , h(Xn) are i.i.d. and centered. So

n1/2(Un − θ) → N (0, σ2) in distribution (4.3.1)

with

σ2 =

∫h2(x)F (dx).

Also note that

Vn = Un − Un =(n− k)!

n!

∑π

H(Xi1 , . . . , Xik)

is again a U -statistic with kernel

H(x1, . . . , xk) = h(x1, . . . , xk)− θ − 1

k

k∑j=1

h(xj).

Lemma 4.3.2. The H function associated with H vanishes: H ≡ 0.

Proof. The statement follows from straightforward computations or the factthat

Vn = Un − ˆUn = Un − Un = 0.

U -statistics with kernel H such that H = 0 are henceforth called degener-ate.

To properly understand the (limit) behavior of Un, we may decompose Un−θinto

Un − θ = (Un − θ) + (Un − Un) = (Un − θ) + Vn.

Two different scenarios are possible.

Scenario 1:

• n1/2Vn → 0 in probability and Un is nondegenerate.

Page 125: 2013 W Stute Empirical Distributions

120 CHAPTER 4. U -STATISTICS

In this case, (4.3.1) applies and we obtain

n1/2(Un − θ) → N (0, σ2) in distribution.

Scenario 2:

• Un is degenerate, i.e., h vanishes and so does σ2 in (4.3.1).

Now n1/2 is unlikely to be the correct standardizing factor, and it is notclear how the correct approximation through a nondegenerate distributionmay look like.

A partial answer to these questions will be given in the following section,when we discuss the variance of a U -statistic.

4.4 The Variance of a U-Statistic

In this section we compute the variance of a U -statistic and provide someuseful upper bound. Again, let

Un =(n− k)!

n!

∑π

h(Xi1 , . . . , Xik),

where h is square-integrable and∑

π denotes summation over the n!/(n−k)!permutations π of k distinct elements from 1, . . . , n. Set θ = EUn. Wemay assume w.l.o.g. θ = 0. Clearly,

U2n =

[(n− k)!

n!

]2 ∑π1,π2

h(Xi1 , . . . , Xik)h(Xj1 , . . . , Xjk).

The sum over π1 and π2 may be written as

∑π1,π2

=k∑

r=0

∑π1,π2

r.

Here∑r denotes summation over all permutations π1 and π2 (of length k)

with exactly r indices in common. If these indices are in position ∆1 and∆2, then by the i.i.d. property

E [h(Xi1 , . . . , Xik)h(Xj1 , . . . , Xjk)]

=

∫. . .

∫R2k−r

h(x1, . . . , xk)h(y1, . . . , yk)F (dx1) . . . F (dx2k−r),

Page 126: 2013 W Stute Empirical Distributions

4.4. THE VARIANCE OF A U -STATISTIC 121

in which the y’s in position ∆2 agree with the x’s in position ∆1 and aretaken from xk+1, . . . , x2k−r otherwise. In particular, the expectation onlydepends on ∆1 and ∆2 and may thus be written as I(∆1,∆2). There are(n − r)!/(n − 2k + r)! possibilities which lead to the same I(∆1,∆2). Forr = 0, each I vanishes. Hence we come up with the following

Lemma 4.4.1. Let Un be a (centered) square-integrable U -statistic. Thenwe have

Var(Un) =

[(n− k)!

n!

]2 k∑r=1

(n− r)!

(n− 2k + r)!

∑|∆1|=r=|∆2|

I(∆1,∆2). (4.4.1)

For a symmetric h, I(∆1,∆2) ≡ Ir only depends on r. There are(k

r

)(k

r

)(n

r

)r!r!

possibilities for choosing ∆1 and ∆2. It follows that

Var(Un) =

k∑r=1

(n

k

)−1(kr

)(n− k

k − r

)Ir. (4.4.2)

Equality (4.4.2) is due to Hoeffding (1948).

Now, for a general kernel, by the Cauchy-Schwarz inequality,

|I(∆1,∆2)| ≤ Eh2(X1, . . . , Xk).

The r-th term in (4.4.1) is thus bounded from above in absolute value by(n

k

)−1(kr

)(n− k

k − r

)Eh2 = O(n−r).

In particular, if the sum over the I(∆1,∆2) vanishes for r = 1, . . . , s − 1,then

Var(Un) = O(n−s). (4.4.3)

In some applications we need the last bound for an Un in which the kernelh may depend on n. Under the assumptions leading to (4.4.3), we then get

Var(Un) = O(n−scn),

when cn is an upper bound for Eh2n(X1, . . . , Xk).

Page 127: 2013 W Stute Empirical Distributions

122 CHAPTER 4. U -STATISTICS

We are now in a position to discuss Scenario 1 from the previous sectionin full length. So suppose that h does not vanish. We already noticed thatVn is a U -statistic with kernel H. The random variable H(Xi1 , . . . , Xik)has expectation zero, and for two permutations (i1, . . . , ik) and (j1, . . . , jk)having only one index in common (say il = js),

E [H(Xi1 , . . . , Xik)|Xj1 , . . . , Xjk ]

= hl(Xjs)− θ − 1

k

k∑j=1

E[h(Xij )|Xj1 , . . . , Xjk

]= hl(Xjs)− θ − 1

kh(Xjs). (4.4.4)

For a symmetric kernel the term in (4.4.4) vanishes so that

E [H(Xi1 , . . . , Xik)H(Xj1 , . . . , Xjk)] = 0.

For a general kernel h and for a given π = (j1, . . . , jk), the k being equal tojs may sit in any position l so that the sum over the terms in (4.4.4) alsovanish. We may conclude that for r = 1 the sum over the ∆’s in (4.4.1)equals zero. From (4.4.3) we obtain

Var(Vn) = EV 2n = O(n−2).

Altogether this yields

Theorem 4.4.2. Assume that Un is a nondegenerate square integrable U -statistic. Then

n1/2(Un − θ) → N (0, σ2) in distribution,

with

σ2 =

∫h2(x)F (dx).

4.5 U-Processes: A Martingale Approach

Assume that X1, . . . , Xn is a (finite) sequence of independent identicallydistributed (i.i.d.) random variables with distribution function (d.f.) F ,defined on a probability space (Ω,A,P). Let h be a function (the kernel) onm-dimensional Euclidean space and set (for n ≥ m)

Un =(n−m)!

n!

∑π

h(Xi1 , . . . , Xim),

Page 128: 2013 W Stute Empirical Distributions

4.5. U -PROCESSES: A MARTINGALE APPROACH 123

where π extends over all multiindices π = (i1, . . . , im) of pairwise distinct 1 ≤ij ≤ n, 1 ≤ j ≤ m. Commonly Un is called a U -statistic. U -statistics wereintroduced by Hoeffding (1948). They have been extensively investigatedover the last 40 years. Most of the basic theory is contained in Serfling(1980), Denker (1985) and Lee (1990). See also Randles and Wolfe (1979).More recently much attention has been given to what has been called a U(-statistic) process. For ease of representation we shall restrict ourselves todegree m = 2. The U -statistic process is then defined as follows: for real uand v set

Un(u, v) =1

n(n− 1)

∑1≤i =j≤n

h(Xi, Xj)1Xi≤u,Xj≤v.

Write

Fn(x) =1

n

n∑i=1

1Xi≤x, x ∈ R,

the empirical d.f. of the sample. Then Un(u, v) becomes (assuming no ties)

Un(u, v) =n

n− 1

u∫−∞

v∫−∞

h(x, y)1x =yFn(dx)Fn(dy).

Now, a standard method to analyze the (large sample) distributional behav-ior of Un is to write Un as

Un = Un +Rn,

in which Unis the Hajek projection of Un and the remainder Rn = Un − Un

is a degenerate U -statistic that is asymptotically negligible when comparedwith Un. In fact, provided that h has a finite pth moment and zero mean,

n1/2Un → N (0, σ2) in distribution

andE|Rn|p ≤ Cn−p.

See Serfling [(1980), page 188]. It follows that also

n1/2Un → N (0, σ2) in distribution.

Of course, this approach immediately applies to Un(u, v) for each (u, v)fixed. Just replace h(x, y) by h(x, y) by 1x≤u,y≤v. Unfortunately, this

Page 129: 2013 W Stute Empirical Distributions

124 CHAPTER 4. U -STATISTICS

is insufficient for handling Un(u, v) [resp. Rn(u, v)] as a process in (u, v).Particularly, the (pointwise) Hajek approach does not yield bounds forsupu,v |Rn(u, v)|. Such bounds are, however, extremely useful in applica-tions. For example, in survival analysis, U -statistic processes or variants ofthem appear in the context of estimating the lifetime distribution F andthe cumulative hazard function Λ when the data are censored or truncated[cf. Lo and Singh (1986), Lo, Mack and Wang (1989), Major and Rejto(1988) and Chao and Lo (1988)]. In Lo and Singh (1986) the analysis ofthe remainder term incorporated known global and local properties of em-pirical processes. In Lo, Mack and Wang (1989) the error bounds wereimproved by applying a sharp moment bound for degenerate U -statisticsdue to Dehling, Denker and Philipp (1987). In Major and Rejto (1988) abound for supu |Rn(u, 1)| of large deviation type due to Major (1988) wasapplied, which required h to be bounded. In all these papers estimation ofF (resp. Λ) could only be carried through on intervals strictly containedin the support of the distribution of the observed data; similarly in Chaoand Lo (1988) for truncated data situations. This general drawback mainlyarose because of lack of a sharp bound for supu,v |Rn(u, v)| when the kernelh is not necessarily bounded.

Classes of degenerate U-statistics also have been studied, from a differentpoint of view, by Nolan and Pollard (1987). In their Theorem 6 they derivean upper bound for the mean of the supremum by first decoupling the U -process of interest and then using a chaining argument conditionally on theobservations. Now, by Holder, a more efficient inequality would be onerelating the pth order mean of the supremum to the pth order mean of theenvelope function, p ≥ 2.

At least this is a typical feature of many other maximal inequalities. We alsorefer to de la Pena (1992) and the literature cited there. In these papers themain emphasis is on relating the maximum of interest to the maximum of adecoupled process. No explicit bounds for a degenerate U -statistic processare derived that are comparable to ours. Note, however, that in applicationsthe leading (Hajek) part is well understood and it is the degenerate part thatcreates the more serious problems.

In this section we shall employ martingale methods to provide a maximalbound satisfying the above requirements. As a consequence we would be ableto improve the a.s. representations of the product-limit estimators of F forcensored and truncated data as discussed above; see Stute (1993, 1994).

Page 130: 2013 W Stute Empirical Distributions

4.5. U -PROCESSES: A MARTINGALE APPROACH 125

Denote by Un(u, v) the Hajek projection of Un(u, v). As for proofs, unfor-tunately, as a process in (u, v),

Un(u, v)− Un(u, v)

does not enjoy any particular properties, so that standard maximal inequal-ities could be applied. As another possibility, assume for a moment that his nonnegative. Then Un(u, v) is nondecreasing in (u, v) and adapted to thefiltration

Fu,v = σ(1Xi≤x, x ≤ max(u, v)

).

Let Cn(u, v) denote the compensator in the Doob-Meyer decomposition ofUn(u, v); see, for example, Dozzi (1981). At first sight one might expect

Un(u, v)− Cn(u, v)

to be a two-parameter martingale to which standard maximal bounds couldbe applied; see Cairoli and Walsh (1975). A serious drawback of this ap-proach is that with this choice of F :

1. The process (Un(·) − Cn(·),F) does not satisfy the fundamental con-ditional independence property (F4) in Cairoli and Walsh (1975).

2. The compensator Cn(·) is still a U -statistic process rather than a sumof i.i.d. processes.

3. Un(·)− Cn(·) turns out not to be a degenerate U -statistic.

The last comments were meant only to express the author’s difficulties whenwriting the paper, in finding a proper decomposition of Un(u, v), in whichthe remainder term (at least the most interesting part of it) is both a two-parameter (strong) martingale in (u, v) and a degenerate U -statistic for (u, v)fixed. Given such a decomposition we could then apply standard maximalinequalities for (strong) two-parameter martingales. Having thus replacedsupu,v |Rn(u, v)| by a single Rn(u, v),E|Rn(u, v)|p could be further dealt withby applying Burkholder’s inequality.

Furthermore, in our analysis, the Doob-Meyer decomposition of the process∑j

h(s,Xj)1Xj≤v s fixed

will be employed. Finally, some global bounds for empirical d.f.’s a laDvoretzky-Kiefer- Wolfowitz (1956) will be required.

Page 131: 2013 W Stute Empirical Distributions

126 CHAPTER 4. U -STATISTICS

Now, the process Un(u, v) may be written as

n(n− 1)Un(u, v) =∑

1≤i<j≤n

h(Xi, Xj)1Xi≤u,Xj≤v

+∑

1≤j<i≤n

h(xi, Xj)1Xi≤u,Xu≤v

≡ In(u; v) + IIn(u, v).

The following theorem contains the key representation of In(u, v) in termsof a sum of independent random processes.

Theorem 4.5.1. Assume h ∈ Lp(F ⊗ F ), with p ≥ 2. Then we have

In(u, v) =∑

1≤i<j≤n

[ u∫−∞

h(x,Xj)1Xj≤vF (dx) +

v∫−∞

h(Xi, y)1Xi≤uF (dy)

−u∫

−∞

v∫−∞

h(x, y)F (dx)F (dy)

]+Rn(u, v),

where for each u0, v0,

E

[sup

u≤u0,v≤v0

|Rn(u, v)|p]≤ Cpnp. (4.5.1)

The constant C satisfies

C ≤ C

u0∫−∞

v0∫−∞

|h(x, y)|pF (dx)F (dy)

1/p

with C depending only on p.

A similar representation also holds for IIn(u, v). Putting these together weget the following corollary.

Page 132: 2013 W Stute Empirical Distributions

4.5. U -PROCESSES: A MARTINGALE APPROACH 127

Corollary 4.5.2. Under the assumptions of Theorem 4.5.1,

n(n− 1)Un(u, v) = n(n− 1)

[ u∫−∞

v∫−∞

h(x, y)Fn(dx)F (dy)

+

u∫−∞

v∫−∞

h(x, y)F (dx)Fn(dy)

−u∫

−∞

v∫−∞

h(x, y)F (dx)F (dy)

]+Rn(u, v).

where the remainder satisfies (4.5.1), with C replaced by 2C.

Since (assuming no ties)

n(n− 1)Un(u, v) = n2u∫

−∞

v∫−∞

h(x, y)1x=yFn(dx)Fn(dy),

we may write the equation in Corollary 4.5.2 as

n

n− 1

u∫−∞

v∫−∞

h(x, y)1x =yFn(dx)Fn(dy)

=

u∫−∞

v∫−∞

h(x, y)Fn(dx)Fn(dy)

+

u∫−∞

v∫−∞

h(x, y)Fn(dx)Fn(dy)

−u∫

−∞

v∫−∞

h(x, y)Fn(dx)Fn(dy) +Rn(u, v)/n(n− 1).

Inequality (4.5.1) together with the Markov inequality yield, with probabil-ity 1,

supu≤u0v≤v0

|Rn(u, v)| = o(n1+1/p(lnn)δ), (4.5.2)

whenever δ satisfies pδ > 1. Furthermore, if h is bounded, (4.5.1) may beapplied for each p ≥ 2.

Page 133: 2013 W Stute Empirical Distributions

128 CHAPTER 4. U -STATISTICS

So far we kept u0 and v0 fixed. In such a situation integrability of hp onlyup to u0, v0) is sufficient. Actually, it may happen that (u0, v0) = (un, vn)varies with n in such a way that hp is not integrable over the whole plane, but∫ un

−∞∫ vn−∞ |h|pdF dF → ∞ at a prescribed rate. Theorem 4.5.1 is particularly

useful also in this case. On the other hand, if either un or vn becomessmall as n → ∞ (such situations occur quite often in nonparametric curveestimation), then the integral

un∫−∞

vn∫−∞

|h(x, y)|pFn(dx)Fn(dy)

also becomes small, to the effect that the bound in (4.5.2) may be replacedby smaller ones. The last remarks also apply to the results that follow.Interestingly enough (4.5.2) may be improved a lot. This is due to the factthat according to Berk (1966) a sequence of normalized U -statistics is areverse-time martingale. Utilizing this, we get the following result.

Theorem 4.5.3. Under the assumptions of Theorem 4.5.1, with probability1,

supu≤u0v≤v0

|Rn(u, v)| = o(n(lnn)δ)

whenever pδ > 1. For bounded h’s, we may therefore take any δ > 0.

With some extra work the logarithmic factor may be pushed down so as toget a bounded LIL. The necessary methodology may be found, for a fixedU -statistic rather than a process, in a notable paper by Dehling, Denkerand Philipp (1986). After truncation, they applied their moment inequality,at stage n, with a p = pn depending on n such that pn → ∞ slowly, tothe effect that for a bounded LIL the moment inequality ”serves the samepurpose as an exponential bound” (personal communication by M. Denker).Since this method is well established now, we need not dwell on this hereagain.

In the next theorem we are concerned with a two-sample situation. LetX1, . . . , Xn be i.i.d. with common d.f. F and let, independently of the X’s,Y1, . . . , Ym be another i.i.d. sequence with common d.f. G. We shall derivea representation of the process

nmUnm(u, v) =n∑

i=1

m∑j=1

h(Xi, Xj)1Xi≤u,Yj≤v.

Page 134: 2013 W Stute Empirical Distributions

4.5. U -PROCESSES: A MARTINGALE APPROACH 129

Theorem 4.5.4. Assume h ∈ Lp(F ⊗G), with p ≥ 2. Then we have

nmUnm(u, v) =

n∑i=1

m∑j=1

[ u∫−∞

h(x, Yj)1Yj≤vF (dx)

+

u∫−∞

h(Xi, y)1Xi≤uG(dx)

−u∫

−∞

v∫−∞

h(x, y)F (dx)G(dy)

]+Rnm(u, v).

where for each u0, v0,

E

supu≤u0v≤v0

|Rnm(u, v)|p ≤ [C2nm]p/2.

The constant C satisfies

C ≤ C

u0∫−∞

v0∫−∞

|h(x, y)|pF (dx)G(dy)

1/p

.

The analogue of Theorem 4.5.3 is only formulated for m = n.

Theorem 4.5.5. Under the assumptions of Theorem 4.5.4, with probability1 as n→ ∞,

supu≤u0v≤v0

|Rnm(u, v)| = o(n(lnn)δ)

whenever pδ > 1.

Another variant (resp. extension) of Theorem 4.5.1, which is extremelyuseful in applications, comes up when, in addition to Xi, there is a Yi pairedwith Xi. Typically Xi is correlated with Yi. We may then form

In(u, v) =∑

1≤i<j≤n

h(Xi, Yj)1Xi≤u,Yj≤v.

Clearly, this In equals the In from Theorem 4.5.1 if Xi = Yi; similarly,for IIn(u, v). Theorem 4.5.5 is an extension of Theorem 4.5.1 to pairedobservations.

Page 135: 2013 W Stute Empirical Distributions

130 CHAPTER 4. U -STATISTICS

Theorem 4.5.6. Assume that (Xi, Yi), 1 ≤ i ≤ n, is an i.i.d. sample fromsome bivariate d.f. H with marginals F and G. Assume h ∈ Lp(F ⊗ G)with p ≥ 2. Then we have

In(u, v) =∑

1≤i<j≤n

[ u∫−∞

h(x, Yj)1Yj≤vF (dx) +

v∫−∞

h(Xi, y)1Xi≤uG(dy)

−u∫

−∞

v∫−∞

h(x, y)F (dx)G(dy)

]+Rn(u, v),

where Rn satisfies (4.5.1) and the h-integral in the bound is taken w.r.tF ⊗G. The assertion of Theorem 4.5.3 also extends to the present case.

Remark 4.5.7. The results of this section may be extended to U -statisticprocesses of degree m > 2, but proofs become more complicated and thenotation even more cumbersome. As far as applications are concerned, how-ever, the case m = 2 is by far the most important one.

We end this section by presenting five examples to which the theorems maybe applied. For these we remark that in the formulation of the previousresults, the point infinity could also be included in the parameter set. Whatmatters is that the parameter sets of the coordinate spaces need to be lin-early ordered.

Example 4.5.8. (Censored data). In the random censorship model theactually observed data are Zi = min(Xi, Ci) and δi = 1Xi≤Ci, where Xi isthe variable of interest (the lifetime), which is at risk of being censored byCi, the censoring variable. For estimation of the cumulative hazard functionof X, a crucial role is played by the (one-parameter) process

∑1≤i<j≤n

δi1Zj>Zi

(1−H)2(Zi)1Zi≤u = In(u), u ∈ R.

Here H is the d.f. of Zi. If we introduce

Yi =

Zi, if δi = 1,∞, if δi = 0,

then

δi1Zi≤u = 1Yi≤u,

Page 136: 2013 W Stute Empirical Distributions

4.5. U -PROCESSES: A MARTINGALE APPROACH 131

and, therefore,

In(u) =∑

1≤i<j≤n

1Zj>Yi

(1−H)2(Yi)1Yi≤u = In(u,∞)

for an appropriate kernel h. The fact that Yi is an extended random variableis of no importance to us. The theorems have been formulated for realvariables just for the sake of convenience, but may be generalized easily tothe foregoing setup. This example is discussed in greater detail in Stute(1994).

Example 4.5.9. (Truncated data). Here one observes (Xi, Yi) only if Yi ≤Xi. Though originallyXi is assumed independent of Yi the actually observedpair has dependent components. For estimation of the cumulative hazardfunction the following process constitutes a crucial part in the analysis:

In(u) =∑

1≤i<j≤n

1Yj≤Xi≤Xj

C2(Xi)1Xi≤u

for some particular function C. Obviously In may be decomposed into twoparts, each of which is of the type as discussed in Theorem 4.5.6, with v = ∞.See Stute (1993) for a thorough discussion of this example.

Example 4.5.10. (Two samples). In the situation of Theorem 4.5.4, theWilcoxon two-sample rank test for H0 : F = G versus H1 : F = G(· −∆) isbased on the U-statistic

Tnm =

n∑i=1

m∑j=1

h(Xi, Yj),

with h(x, y) = 1x≤y. In our previous notation,

Tnm = nmUnm(∞,∞).

We may now consider the associated process Unm in order to construct testsfor H0 versus H1, which are based on the whole of Unm rather than onlyTnm. It would be interesting to compare the power of these tests with thatof the standard Wilcoxon test.

Example 4.5.11. (Trimmed U -statistics). Gijbels, Janssen and Veraver-beke (1988) investigated so-called trimmed U -statistics

n(n− 1)U0n(α, β) =

[nα]∑i=1

[nβ]∑j=1

h(Xi:n, Xj:n),

where X1:n ≤ . . . ≤ Xn:n denotes the ordered sample.

Page 137: 2013 W Stute Empirical Distributions

132 CHAPTER 4. U -STATISTICS

Since

n(n− 1)U0n(α, β) =

n∑i=1

n∑j=1

h(Xi, Xj)1Xi≤F−1n (α),Xj≤F−1

n (β),

we see that U0n(α, β) is related to Un(F

−1n (α), F−1

n (β)), neglecting the sumover i = j for a moment. Observe that u = F−1

n (α) and v = F−1n (β) are ran-

dom in this case. The (uniform) representation of Un in terms of a (simple)sum of independent random processes together with their tightness in thetwo-dimensional Skorokhod space allows for a simple analysis of U0

n(α, β),not just for a fixed (α, β) [as done in Gijbels, Janssen and Veraverbeke(1988)], but as a process in (α, β). Details are omitted.

U -statistic processes also occur in the analysis of linear rank statistics. Weonly mention the possibility of representing a linear signed rank statistic(up to an error term) as a sum of i.i.d. random processes [cf. Sen (1981),Theorem 5.4.2].

Example 4.5.12. (Linear signed rank statistics). For a sample X1, . . . , Xn

and a proper score function φ, it is required to represent the double sum∑1≤i=j≤n

φ(Xi)1|Xj |≤Xi10≤Xi≤x.

We see that Theorem 4.5.6 applies with Yj = |Xj |.

Example 4.5.13. The Lorenz functional

L(p) =

p∫0

F−1(u)du

µ

serves as a tool to measure the economic imbalances in a population. It isalso of interest to determine its variance. This leads to the function

L1(p) =

p∫0

p∫0

[F−1(u)− F−1(v)

]2dudv

σ2.

Page 138: 2013 W Stute Empirical Distributions

Chapter 5

Statistical Functionals

5.1 Empirical Equations

Empirical equations have been addressed briefly in Section 1.4. For a moredetailed discussion, consider a parametric family M = fθ : θ ∈ Θ ofdensities w.r.t. a measure µ. Assume that the true distribution function Fbelongs to M, i.e., F = Fθ0 = fθ0µ for some (unknown) θ0 ∈ Θ. Set

K(θ, θ0) =

∫lnfθ0(x)

fθ(x)F (dx) =

∫ [lnfθ0(x)

fθ(x)

]fθ0(x)µ(dx).

This quantity is called the Kullback-Leibler information. In the contextof this section the following property of K is important.

Lemma 5.1.1. We have K(θ, θ0) ≥ 0 with K(θ, θ0) = 0 if and only ifθ = θ0.

The assertion of this lemma may be reformulated as follows: consider themapping

TF : θ →∫

ln fθ(x)F (dx)

which of course depends on F . Denote with T (F ) the maximizer of TF , i.e.,

T (F ) = argmaxθTF (θ).

When F = Fθ0 , Lemma 5.1.1 yields T (Fθ0) = θ0. This property is frequentlycalled Fisher consistency. Since F is unknown, so is TF . Given an i.i.d.

133

Page 139: 2013 W Stute Empirical Distributions

134 CHAPTER 5. STATISTICAL FUNCTIONALS

sample from F , we may compute Fn and consider the mapping TFn insteadof TF :

TFn(θ) =

∫ln fθ(x)Fn(dx) =

1

n

n∑i=1

ln fθ(Xi).

The parameter T (Fn) is the maximum likelihood estimator (MLE) of θ0.For our discussion of T (Fn) = argmaxθ TFn(θ) it is sufficient to assume thatT (Fn) is well defined. We see again that the target may be obtained as T (F ),while the estimator equals T (Fn). Under some smoothness assumptions, theMLE solves the (vector) equation

1

n

n∑i=1

∂ ln fθ(Xi)

∂θ=

∫ψ(x, θ)Fn(dx) = 0, (5.1.1)

with

ψ(x, θ) =∂ ln fθ(x)

∂θ.

Equation (5.1.1) is an example of an empirical equation. The same prin-ciple also applies to other ψ’s. See below. In such a situation the resultingestimator is called an M- (or maximum likelihood type) estimator and theassociated functional attaching to a d.f. G the solution of∫

ψ(x, θ)G(dx) = 0

is called an M-Functional. As it will turn out, the class of M-functionalsmay be substantially enlarged, if we let ψ also depend on G:

L(θ,G) ≡∫ψ(x, θ,G)G(dx).

The parameter of interest is the solution of L(θ, F ) = 0, and the estimatoris obtained as the solution of the empirical equation

L(θ, Fn) = 0.

We shall discuss several important estimators which all fit into this scheme:

R-estimators: Let J be a nondecreasing function on [0,1] such that J(1−t) = −J(t) for 0 ≤ t ≤ 1. Put

L(θ,G) =

∫J

(G(x) + 1−G(2θ − x)

2

)G(dx).

Page 140: 2013 W Stute Empirical Distributions

5.1. EMPIRICAL EQUATIONS 135

If F is symmetric at θ0, then L(θ0, F ) = 0, i.e., the center of symmetry θ0may be characterized as a solution of this equation. The estimator satisfiesL(θ, Fn) = 0.

For J(t) = t− 12 , we obtain

θn = med

Xi +Xj

2: 1 ≤ i, j ≤ n

,

the Hodges-Lehmann estimator.

L-M-estimators: We put

ψ(x, θ,G) = J(G(x))ψ0(x− θ).

For J ≡ 1, we are back at the M-estimator. For ψ0(y) = y, we arrive at so-called L-estimators.

Minimum-Distance estimators: Consider

d2(G,Fθ) =

∫[G(x)− Fθ(x)]

2G(dx).

The parameter of interest is the θ minimizing d2(F, Fθ). Under smoothnessthis minimizer satisfies ∫

ψ(x, θ, F )F (dx) = 0

with

ψ(x, θ,G) = [G(x)− Fθ(x)]∂

∂θFθ(x).

More examples are discussed in Stute (1986), Stoch. Proc. Appl. 22, 223-244.

This paper also provides conditions under which

n1/2[T (Fn)− T (F )]

has a normal limit. Generally speaking, differentiability of ψ in θ and G isrequired guaranteeing that a Taylor expansion applies. For other estima-tors, like the Least Absolute Deviation estimator differentiability is notavailable, so that other arguments will be needed.

Page 141: 2013 W Stute Empirical Distributions

136 CHAPTER 5. STATISTICAL FUNCTIONALS

5.2 Anova Decomposition of Statistical Function-als

Let S = S(X1, . . . , Xn) be a square integrable functional of independentvariables X1, . . . , Xn. Set, for T ⊂ 1, . . . , n,

ST = E(S|Xi, i ∈ T ).

In particular,

S∅ = E(S)ST = S for T = 1, . . . , n.

PutYT =

∑U⊂T

(−1)|T\U |SU .

Lemma 5.2.1. (Efron-Stein) We have

S =∑

T⊂1,...,n

YT ,

where the summands satisfy

(i) E(YT ) = 0 for each T = ∅

(ii) Cov(YT1 , YT2) = 0 for T1 = T2.

Proof. The representation of S follows from∑T⊂1,...,n

YT =∑

T⊂1,...,n

∑U⊂T

(−1)|T\U |SU

=∑

U⊂1,...,n

SU

n−|U |∑k=0

(−1)k(n− |U |

k

).

The sum over k equals zero unless |U | = n, in which case it is one. For (i),

E(YT ) =∑U⊂T

(−1)|T\U |E(S) = E(S)r∑

k=0

(−1)k(r

k

)= 0,

where r = |T | ≥ 1.

Page 142: 2013 W Stute Empirical Distributions

5.2. ANOVA DECOMPOSITION OF STATISTICAL FUNCTIONALS137

As to (ii), since Y∅ is a constant, we may assume T1, T2 = ∅ w.l.o.g. Supposealso that T2 \ T1 is nonempty (otherwise, consider T1 \ T2). Then

Cov(YT1 , YT2) = E(YT1YT2)

= E(YT1E(YT2 |Xi, i ∈ T1)).

We show that the conditional expectation is zero. By independence we get

E(YT2 |Xi, i ∈ T1) =∑U⊂T2

(−1)|T2\U |E(S|Xi, i ∈ U ∩ T1)

=∑

W⊂T1∩T2

(−1)|T2|−|W |−|T2\T1|E(S|Xi, i ∈W )∑

V⊂T2\T1

(−1)|T2\T1|−|V |.

Because of |T2 \ T1| > 0, the sum over the V ’s vanishes.

Remark 5.2.2. Set

Wk =∑

T⊂1,...,k

YT , 1 ≤ k ≤ n.

Then

Wk =∑

T⊂1,...,k

∑U⊂T

(−1)|T\U |SU

=∑

U⊂1,...,k

SU

k−|U |∑r=0

(−1)r(k − |U |

r

)= E(S|Xi, i = 1, . . . , k),

i.e. (Wk)k is a martingale w.r.t. Fk = σ(Xi, 1 ≤ i ≤ k), k = 1, . . . , n.

If, in the Efron-Stein representation of S, we only consider the T ’s withcardinality not exceeding one, we obtain the Hajek projection S:

S :=∑|T |≤1

YT =n∑

i=1

E(S|Xi)− (n− 1)ES.

Page 143: 2013 W Stute Empirical Distributions

138 CHAPTER 5. STATISTICAL FUNCTIONALS

5.3 The Jackknife Estimate of Variance

Recall the Efron-Stein decomposition of a square integrable statistic S =S(X1, . . . , Xn):

ST := E(S|Xi, i ∈ T )

and

YT :=∑U⊂T

(−1)|T\U |SU .

Then

S =∑

T⊂1,...,n

YT .

Moreover, the YT ’s are uncorrelated. In particular,

Var(S) =∑

∅=T⊂1,...,n

Var(YT ).

In this section we study estimation of Var(S) in the case when the X’s arei.i.d. and S is symmetric. Then Var(YT ) only depends on the cardinality ofT . Write σ2k = Var(YT ) when |T | = k. Hence

s2n ≡ Var(S) =n∑

k=1

(n

k

)σ2k.

Note that σ2k depends on n. We now introduce the jackknife estimate ofs2n. Denote with S(i) = S(X1, . . . , Xi−1, Xi+1, . . . , Xn) the value of S afterdeleting Xi from the sample. Observe that S(i) is computed for sample sizen− 1. Let S(·) be the sample mean of S(1), . . . , S(n). The jackknife estimateof s2n−1 is then given by

s2n−1 :=

n∑i=1

[S(i) − S(·)]2.

It is the purpose of this section to compare Es2n−1 with s2n−1. For this,compare S(i) with S(j), i = j. Check that

S(i) − S(j) = Σ′YT − Σ′′YT ,

where∑′ denotes summation over all T ⊂ 1, . . . , i−1, i+1, . . . , n contain-

ing j, and∑′′ denotes summation over all T ⊂ 1, . . . , j − 1, j + 1, . . . , n

Page 144: 2013 W Stute Empirical Distributions

5.3. THE JACKKNIFE ESTIMATE OF VARIANCE 139

containing i. Since thus all T appearing on the right-hand side are different,the corresponding Y ’s are uncorrelated. By symmetry, we thus get

E(S(i) − S(j))2 = 2

n−1∑k=1

(n− 2

k − 1

)σ2k.

Now,

2s2n−1 = n−1∑i =j

(S(i) − S(j))2,

whence

Es2n−1 = (n− 1)n−1∑k=1

(n− 2

k − 1

)σ2k.

Recall

s2n−1 =

n−1∑k=1

(n− 1

k

)σ2k.

We find that

Es2n−1 − s2n−1 =

n−1∑k=2

(k − 1)

(n− 1

k

)σ2k ≥ 0.

It follows that the jackknife estimate of variance is always biased upwards.It is unbiased only when σ2k = 0 for k ≥ 2. In this case

S = E(S) +n∑

i=1

Yi =n∑

i=1

E(S|Xi)− (n− 1)E(S),

i.e. S coincides with its Hajek projection. Such S’s are called linear. Theyare called quadratic iff σ2k = 0 for k ≥ 3. Such an S admits a representation

S = E(S) +n∑

i=1

Yi +∑i=j

Yi,j,

a U -statistic of degree two. More generally, every S for which σ2k = 0 whenk > m is a U-statistic of degree m (provided σ2m = 0).

Conversely, if S only depends on X1, . . . , Xm, then YT = 0 for each T ⊂1, . . . , n not wholly contained in 1, . . . ,m. Thus, if S =

∑h(Xi1 , . . . , Xim)

is a (square-integrable) U-statistic of degree m, we find that S =∑′ YT ,

where∑′ extends over all T ⊂ 1, . . . , n with |T | ≤ m. Hence σ2i = 0

Page 145: 2013 W Stute Empirical Distributions

140 CHAPTER 5. STATISTICAL FUNCTIONALS

whenever k > m. In summary, σ2i = 0 for k > m if and only if S is aU-statistic of degree m.

The i-th pseudo-value is defined as

S∗i = nS(X1, . . . , Xn)− (n− 1)S(X1, . . . , Xi−1, Xi+1, . . . , Xn)

= nS(X1, . . . , Xn)− (n− 1)S(i).

Denote with S∗ their arithmetic mean. Then

n∑i=1

(S∗i − S∗)2 = (n− 1)2

n∑i=1

(S(i) − S(·))2,

i.e., in terms of the pseudo-values the jackknife estimate of variance equals

(n− 1)−2n∑

i=1

(S∗i − S∗)2.

Example 5.3.1. For the sample mean, S∗i = Xi and S

∗ = Xn. Since Xn

is linear, the jackknife estimate of variance (n − 1)−2∑n

i=1(Xi − Xn)2 is

unbiased (for sample size n− 1).

Example 5.3.2. If S is the median and n = 2m, then

s2n−1 =n

4[Xm+1:n −Xm:n]

2.

5.4 The Jackknife Estimate of Bias

Let Sn ≡ S = S(X1, . . . , Xn) be an integrable function of i.i.d. random vari-ables X1, . . . , Xn ∼ F . Assume that S is designed to estimate a populationparameter θ(F ). Consider the bias

bias := EF (S)− θ(F ).

Quenouille’s estimate of bias is

BIAS := (n− 1)[S(·) − S]

=n− 1

n

n∑i=1

S(i) − (n− 1)S.

This leads to the bias-corrected jackknife estimate of θ(F ):

S∗ = S − BIAS = nS − (n− 1)S(·).

Page 146: 2013 W Stute Empirical Distributions

5.4. THE JACKKNIFE ESTIMATE OF BIAS 141

Lemma 5.4.1. Assume that bias has an expansion

bias =a1n

+a2n2

+a3n3

+O(n−4),

the parameters a1, a2 and a3 usually depending on F . Then S∗ has the biasexpansion

E(S∗)− θ(F ) =−a2

n(n− 1)+O(n−3).

Proof. We have

E(S∗)− θ(F ) = nE(Sn)− (n− 1)E(Sn−1)− θ(F )

= n[E(Sn)− θ(F )]− (n− 1)[E(Sn−1)− θ(F )]

= a1 +a2n

+a3n2

−(a1 +

a2n− 1

+a3

(n− 1)2

)+O(n−3)

= a2

(1

n− 1

n− 1

)+ a3

(1

n2− 1

(n− 1)2

)+O(n−3),

whence the assertion.

Example 5.4.2. Take θ(F ) = σ2(F ), and let S = n−1∑n

i=1(Xi − Xn)2. A

simple calculation shows

BIAS =−1

n(n− 1)

n∑i=1

(Xi − Xn)2.

yielding

S∗ =1

n− 1

n∑i=1

(Xi − Xn)2,

the usual unbiased estimate of θ(F ).

We now compute the bias corrected value of the jackknife estimate of vari-ance

s2n−1 =n∑

i=1

[S(i) − S(·)]2 ≡W.

Clearly,

W =n∑

i=1

S2(i) − n−1

n∑i,j=1

S(i)S(j).

Page 147: 2013 W Stute Empirical Distributions

142 CHAPTER 5. STATISTICAL FUNCTIONALS

Denote with Sir the value of S taken at the sample deleted by Xi and Xr.Then

W() =1

n

n∑r=1

∑i =r

S2ir −

1

n(n− 1)

n∑r=1

∑i,j =r

SirSjr

so that by symmetry

EW() = (n− 2)ES212 − (n− 2)ES13S23.

On the other hand,

EW = (n− 1)ES21 − (n− 1)ES1S2.

Since

W ∗ =W − BIAS = nW − (n− 1)W(),

we find that

EW ∗ = n(n− 1)[ES21 − ES1S2]

− (n− 1)(n− 2)[ES213 − ES13S23].

But

ES1S2 = ES21 −

1

2E(S1 − S2)

2

and

ES13S23 = ES213 −

1

2E(S13 − S23)

2.

Conclude that

EW ∗ =1

2n(n− 1)E(S1 − S2)

2 − 1

2(n− 1)(n− 2)E(S13 − S23)

2

= n(n− 1)n−1∑i=1

(n− 2

k − 1

)σ2k,n−1 − (n− 1)(n− 2)

n−2∑k=1

(n− 3

k − 1

)σ2k,n−2.

The index i in σ2ki indicates the underlying sample size. Note that the firstsummand equals nEs2n−1, while the second is (n − 1)Es2n−2. Both expecta-tions exceed the corresponding true variances. By subtracting them (whenappropriately normalized) it might well be that W ∗ leads to a “truly” biascorrected value of the jackknife estimate of variance.

To make this argument work one also needs some smoothness of the varianceas a function of n.

Page 148: 2013 W Stute Empirical Distributions

5.4. THE JACKKNIFE ESTIMATE OF BIAS 143

Remark 5.4.3. In terms of the pseudo-values S∗i = nS − (n − 1)S(i), we

have

BIAS = S − n−1n∑

i=1

S∗i .

To motivate the choice of BIAS, assume that Sn = S(Fn) is a smoothstatistical functional of the empirical d.f. Fn, i.e. θ(F ) = limn→∞ S(Fn) inL1. For sample size n− 1,

bias = ES(Fn−1)− θ(F ) = −∞∑k=0

[ES(Fn+k)− ES(Fn+k−1)].

Replace the series by the finite sum from k = 0 to k = m. In S(Fn+k) −S(Fn+k−1) we compare values of S for two consecutive sample sizes. Fork = 0 we may generate such a pair by deleting a point mass from Fn. Indoing this for each 1 ≤ i ≤ n, we arrive at

ES(Fn)− ES(Fn−1) ∼1

n

n∑i=1

[Sn − S(i)] = Sn − S(·).

Taking the last term also for estimating ES(Fn+k)−ES(Fn+k−1) for k > 0,and letting m = n− 2, we finally arrive at BIAS.

Page 149: 2013 W Stute Empirical Distributions

144 CHAPTER 5. STATISTICAL FUNCTIONALS

Page 150: 2013 W Stute Empirical Distributions

Chapter 6

Stochastic Inequalities

6.1 The D-K-W Bound

This section deals with the most famous bound on the deviation betweenFn and F , the Dvoretzky-Kiefer-Wolfowitz (1956) exponential bound for theupper tails of D+

n .

Theorem 6.1.1. There exists a universal constant c > 0 so that for allx ≥ 0 and n ≥ 1

P(n1/2D+n > x) ≤ c exp(−2x2). (6.1.1)

Proof. In view of (3.1.17), it suffices to study the case F = FU . Theoriginal proof of Dvoretzky-Kiefer-Wolfowitz, which is presented here, restson Theorem 3.4.5. Since the left-hand side of (??) equals zero for x ≥

√n,

we only need to consider 0 ≤ x ≤√n. Apply Theorem 3.4.5 to get

P(n1/2D+n > x) = (1− x/

√n)n + x

√n

n−1∑j=<x

√n>+1

Qn(j, x), (6.1.2)

with

Qn(j, x) =

(n

j

)(j − x

√n)j(n− j + x

√n)n−j−1n−n.

Taking logarithms and differentiating we see that on 0 ≤ x ≤√n the func-

tionx→ (1− x/

√n)n exp(2x2)

attains its maximum at zero. Conclude that

(1− x/√n)n ≤ exp(−2x2).

145

Page 151: 2013 W Stute Empirical Distributions

146 CHAPTER 6. STOCHASTIC INEQUALITIES

Moreover, for x√n < j < n, we get

d

dxlnQn(j, x) =

−xn2

(j − x√n)(n− j + x

√n)

−√n

n− j + x√n

<−4x

1− 4n2

(n2 − j + x

√n)2 < −4x− 16x

n2

(n2− j + x

√n)2,

The last term, however, has the primitive

x→ −2x2 − 8x2

n2

(n

2− j +

2x√n

3

)2

− 4x4

9n.

Integration therefore leads to

Qn(j, x) ≤ Qn(j, 0) exp

[−2x2 − 8x2

n2

(n

2− j +

2x√n

3

)2

− 4x4

9n

](6.1.3)

as well as, for x ≥ 1,

Qn(j, x) ≤ c1Qn(j, 1) exp

[−2x2 − 8x2

n2

(n

2− j +

2x√n

3

)2

− 4x4

9n

](6.1.4)

with a universal constant c1 > 0. To bound Qn(j, 0) and Qn(j, 1), we firstconsider the case |j − n

2 | ≤n4 . From Stirling’s formula

k! ∼(k

e

)k √2πk

we may infer thatQn(j, 0) ≤ c2n

−3/2

for some generic constant c2. Denote with Σ′ the sum over those j in (6.1.2)satisfying |j− n

2 | ≤n4 , and let Σ′′ be the sum over the remaining ones. From

(6.1.3) we obtain

Σ′Qn(j, x) ≤ c2n−3/2 exp(−2x2)Σ′ exp

[−8x2

(1

2− j

n+

2x

3√n

)2]

≤ 2c2n−3/2 exp(−2x2)

∞∑j=0

exp[−8x2j2/n2]

≤ 2c2n−1/2 exp(−2x2)

1

n+

∞∫0

exp(−8x2t2)dt

≤ c3x

−1n−1/2 exp(−2x2)

Page 152: 2013 W Stute Empirical Distributions

6.1. THE D-K-W BOUND 147

and thereforeΣ′xn1/2Qn(j, x) ≤ c3 exp(−2x2).

We now turn to the second sum. We assume w.l.o.g. that x > 1. If2x

√n/3 ≤ n/8, the second term in the exponent of (6.1.4) does not ex-

ceed −x2/8. For x > 3√n/16, the last term is less than or equal to

−(4/9)(3/16)2x2 so that

Qn(j, x) ≤ c1Qn(j, 1) exp[−2x2 − c4x4]

≤ c5x−1Qn(j, 1) exp(−2x2).

It follows, since√nΣjQn(j, 1) ≤ 1, that

x√nΣ′′Qn(j, x) ≤ c5 exp(−2x2)

√nΣ′′Qn(j, 1) ≤ c5 exp(−2x2).

The proof of the Theorem is complete.

Massart (1990) was able to refine the original arguments of D-K-W andproved that the optimal c equals 1.

Theorem 6.1.1 immediately yields some rough rates for the convergence ofFn to F . First, for a given positive ε > 0, the D-K-W bound yields

P(Dn > ε) ≤ 2c exp(−2nε2).

Since the series of the terms on the right-hand side converges, the Borel-Cantelli Lemma yields the Glivenko-Cantelli Theorem, namely that Dn → 0with probability one. A slight modification of this argument yields thefollowing Corollary.

Corollary 6.1.2. With probability one,

lim supn→∞

[ n

lnn

]1/2Dn <∞.

Proof. Put x = (c lnn)1/2 with c > 1/2 and argue as before.

The assertion of Corollary 6.1.2 is often stated in the form

Dn = O

((lnn

n

)1/2)

with probability one.

For many applications this bound is sufficient. The precise rate of con-

vergence is of the order(ln lnn

n

)1/2, which indicates that the almost sure

behavior of Fn, as n → ∞, is governed by a Law of the Iterated Loga-rithm.

Page 153: 2013 W Stute Empirical Distributions

148 CHAPTER 6. STOCHASTIC INEQUALITIES

6.2 Binomial Tail Bounds

As before, let X1, . . . , Xn be i.i.d. from F so that η := nFn(t) ∼ Bin(n, p)with p = F (t). If 0 < p < 1, the CLT yields

P

(n1/2

Fn(t)− F (t)√F (t)(1− F (t))

≥ x

)→ 1√

∞∫x

e−u2/2du = P(ξ ≥ x), (6.2.1)

where ξ ∼ N (0, 1). Mill’s ratio states that for all x > 0(1

x− 1

x3

)1√2π

exp(−x2/2) ≤ P(ξ ≥ x) ≤ 1

x

1√2π

exp(−x2/2), (6.2.2)

i.e., neglecting the factors 1/x and 1/x3, the upper tails of a standard normaldistribution decrease exponentially fast. In this section we show that asimilar bound also holds, for finite n, for a standardized binomial variable.

First, the Chebychev inequality leads to

P

(n1/2

Fn(t)− F (t)√F (t)(1− F (t))

≥ x

)≤ x−2

which is far worse than what we may expect from (6.2.1) and (6.2.2). Amuch sharper bound is obtained if we incorporate the moment generatingfunction of a binomial random variable.

Lemma 6.2.1. Let η ∼ Bin(n, p), 0 < p < 1. Then the moment generatingfunction of η equals

M(z) = E[exp(zη)] = [(1− p) + p exp z]n.

Proof. Use the fact that η equals, in distribution, the sum of n independentBernoulli random variables with parameter p, whose moment generatingfunction equals (1− p) + p exp z.

To bound P(η − np ≥ ε) for ε > 0, note that the Markov inequality yieldsfor each z ≥ 0:

P(η − np ≥ ε) ≤ E [exp(z(η − np− ε))]

= M(z) exp[−z(np+ ε)].

We thus get the following bound.

Page 154: 2013 W Stute Empirical Distributions

6.2. BINOMIAL TAIL BOUNDS 149

Lemma 6.2.2. For each ε > 0 we have

P(η − np ≥ ε) ≤ infz≥0

M(z) exp[−z(np+ ε)] ≡ ρ.

To determine ρ let f be defined by

exp[−f(u)] = infz≥0

M(z) exp(−zu)

so thatρ = exp[−f(np+ ε)].

The function f is called the Chernoff-function of M .

Clearly, f is nonnegative and nondecreasing.

Lemma 6.2.3. We have

f(u) ≡ f(u, n, p) =

0 for u ≤ 0u ln u

np + (n− u) ln n−un(1−p) for 0 < u < n

−n ln p for u = n+∞ for n < u

.

Proof. The assertion follows by formal differentiation of M(z) exp(−zu) in(0, n) and a separate check outside of this interval.

Before we further analyze the Chernoff-function we note that the lower tails

P(η − np ≤ −ε)

equalP(η ≥ ε+ n(1− p)), where η ∼ Bin(n, 1− p)

whenceP(η − np ≤ −ε) ≤ exp [−f(n(1− p) + ε, n, 1− p)] .

Coming back to the upper tails with u = np+ε, we shall focus on 0 < u < n,i.e., ε < n(1− p). We write

f(u, n, p) = np

[1 +

ε

np

]ln

[1 +

ε

np

]+ n(1− p)

[1− ε

n(1− p)

]ln

[1− ε

n(1− p)

].

Page 155: 2013 W Stute Empirical Distributions

150 CHAPTER 6. STOCHASTIC INEQUALITIES

For |x| < 1 we may expand

(1 + x) ln(1 + x) = x+x2

1 · 2− x3

2 · 3+

x4

3 · 4− . . .

and

(1− x) ln(1− x) = −x+x2

1 · 2+

x3

2 · 3+

x4

3 · 4+ . . .

With x1 = ε/np and x2 = ε/n(1− p) we thus get

f(u, n, p) =ε2

2np(1− p)+np

[− x312 · 3

+x413 · 4

− . . .

]+n(1−p)

[x322 · 3

+x423 · 4

+ . . .

].

When 0 < x1, x2 < 1 it therefore follows that

f(u, n, p) ≥ ε2

2np(1− p)+ np

[− x312 · 3

+x413 · 4

− . . .

]≥ ε2

2np(1− p)+

np

1− p[. . .]

=ε2

2np(1− p)ψ

np

).

The function ψ is defined, for |x| < 1, as

ψ(x) = 1− x

3+x2

6. . .+

(−1)k2xk

(k + 2)(k + 1)+ . . .

Check thatψ(x) = 2h(1 + x)/x2 (6.2.3)

withh(x) = x(lnx− 1) + 1. (6.2.4)

In terms of (6.2.3) and (6.2.4), ψ is defined for all positive x. Putting

g1(ε) = f(np+ ε, n, p) and g2(ε) = ε2ψ

np

)/2np(1− p)

we easily find that

g1(0) ≤ g2(0) and ∂g1/∂ε ≥ ∂g2/∂ε.

The second inequality is equivalent to

(1− p) ln

[1− ε

n(1− p)

]+ p ln

[1 +

ε

np

]≤ 0

which easily follows from the concavity of x → ln(1 + x). We thus obtaing1(ε) ≥ g2(ε).

Page 156: 2013 W Stute Empirical Distributions

6.2. BINOMIAL TAIL BOUNDS 151

Lemma 6.2.4. Let η ∼ Bin(n, p). Then we have for all ε > 0

(i)

P(η − np ≥ ε) ≤ exp

[− ε2

2np(1− p)ψ

np

)]= exp

[− np

1− ph

(1 +

ε

n(1− p)

)].

(ii)

P(η − np ≤ −ε) ≤ exp

[− ε2

2np(1− p)ψ

n(1− p)

)].

Improved bounds are obtained if we use sharper lower bounds for f(u, n, p).For example, when 1− p ≤ p and therefore p ≥ 1/2, we get for u = np+ ε

f(u, n, p) ≥ ε2/2np(1− p). (6.2.5)

For an arbitrary 0 < p < 1, the somewhat weaker bound

f(u, n, p) ≥ ε2

2np(1− p)− np

6

np

)3

=ε2

2np(1− p)

[1− ε(1− p)

3np

](6.2.6)

will suffice for most applications.

In many situations

ε(1− p)

3np≤ δ, a given threshold.

In this case

f(u, n, p) ≥ ε2(1− δ)

2np(1− p). (6.2.7)

For the standardized Fn(t), we obtain for all x ≥ 0 and a given threshold0 < δ < 1

P

(n1/2[Fn(t)− F (t)]√F (t)(1− F (t))

≥ x

)≤exp[−x2/2] if p = F (t) ≥ 1/2exp[−x2(1− δ)/2] for a general p = F (t),

provided that x ≤ 3δ√np

(1−p)3/2.

Page 157: 2013 W Stute Empirical Distributions

152 CHAPTER 6. STOCHASTIC INEQUALITIES

6.3 Oscillations of Empirical Processes

Consider the uniform empirical process

αn(t) = n1/2[Fn(t)− t] on 0 ≤ t ≤ 1,

based on a sample U1, . . . , Un. Since each Ui is a discontinuity of αn, thenumber of jumps increases with n. On the other hand the jumpsize n−1/2

tends to zero so that a priori it is not clear, how the paths evolve as n getslarge. A measure for the roughness of a sample path is given through theoscillation modulus

ωn(a) = sup|t−s|≤a

|αn(t)− αn(s)|.

Typically, 0 < a < 1 is a small number so that ωn measures local oscilla-tions of αn. To study ωn in greater detail we first keep s fixed, say s = 0 andconsider the one-sided (local) deviation sup0≤t≤a αn(t). Since the supremumis taken over an uncountable set, it is wise to first bound αn on a finite grid0 = t0 < t1 < . . . < tm = a and then let the mesh of the partition tend tozero. For level x > 0, put

Ai = αn(ti) > x.

We then have

P(

sup0≤i≤m

αn(ti) > x

)= P

(m∪i=0

Ai

).

Next we introduce an event

A∗m = αn(tm) > x∗.

The threshold x∗ will be a little smaller than x so that Am ⊂ A∗m. We have

P

(m∪i=0

Ai

)= P

(m∪i=0

Ai ∩A∗m

)+ P

(m∪i=0

Ai ∩ A∗m

)

≤ P(A∗m) +

m−1∑i=0

P(A∗

m ∩Ai ∩ A0 ∩ . . . ∩ Ai−1

).

We may further decompose each set Ai ∩ A0 ∩ . . . ∩ Ai−1 into finitely manysets on which the discrete variables αn(tj), j ≤ i, attain specified values

Page 158: 2013 W Stute Empirical Distributions

6.3. OSCILLATIONS OF EMPIRICAL PROCESSES 153

αn(tj) = xj . Write

P(A∗m, αn(tj) = xj for 0 ≤ j ≤ i)

= P(A∗m|αn(tj) = xj for 0 ≤ j ≤ i)P(αn(tj) = xj for 0 ≤ j ≤ i)

= P(A∗m|αn(ti) = xi)P(αn(tj) = xj for 0 ≤ j ≤ i),

where the last equality uses the Markov-property of αn. Suppose x∗ has been

chosen so that each of the conditional probabilities is less than or equal toa constant c, c < 1. This would yield

P

(m∪i=0

Ai

)≤ P(A∗

m) + c

m−1∑i=0

P(Ai ∩ A0 ∩ . . . ∩ Ai−1)

≤ P(A∗m) + cP

(m∪i=0

Ai

).

From this we immediately obtain

P(

sup0≤i≤m

αn(ti) > x

)≤ 1

1− cP(αn(a) > x∗).

Since the right-hand side does not depend on the chosen grid, the boundalso holds when we let the mesh tend to zero. Since αn is continuous fromthe right and has left-hand limits, we finally arrive at

P(

sup0≤t≤a

αn(t) > x

)≤ 1

1− cP(αn(a) > x∗). (6.3.1)

Arguments similar to those which led to the maximal inequality (6.3.1)are well-known and have been successfully applied also outside empiricalprocess theory.

We now investigate conditions on x∗ ≤ x guaranteeing

P(A∗m|αn(ti) = xi) ≤ c

for each xi > x, Now,

P(αn(tm) ≤ x∗|αn(ti) = xi)

= P((n−N)Fn−N

(tm − ti1− ti

)≤

√nx∗ + tmn−N

),

Page 159: 2013 W Stute Empirical Distributions

154 CHAPTER 6. STOCHASTIC INEQUALITIES

with N =√nxi + nti. The last probability equals

P(n−N)Fn−N

(tm − ti1− ti

)− (n−N)

tm − ti1− ti

≤√n

[x∗ − xi

1− tm1− ti

].

Since xi > x,

√n

[x∗ − xi

1− tm1− ti

]≤

√n

[x∗ − x

1− tm1− ti

]≤

√n [x∗ − x(1− a)] .

Write x∗ = x(1− δ) with a < δ/2. Then

√n[x∗ − x(1− a)] ≤ −

√n× δ/2.

Under these conditions the Chebychev inequality yields

P (αn(tm) ≤ x∗|αn(ti) = xi)

≤(n−N) tm−ti

1−ti

nx2δ2/4≤ 4a

x2δ2≤ 1/2,

where the last inequality holds provided that 8a ≤ x2δ2.

Lemma 6.3.1. Assume that

(i) a < δ/2

(ii) 8a ≤ x2δ2

Then

P(

sup0≤t≤a

αn(t) > x

)≤ 2P (αn(a) > x(1− δ)) .

A similar bound holds for the lower tails so that summarizing we get under(i) - (ii)

P(

sup0≤t≤a

|αn(t)| > x

)≤ 2P (|αn(a)| > x(1− δ)) . (6.3.2)

Next we derive an upper bound for the oscillation modulus ωn.

Lemma 6.3.2. For 0 < a, δ < 1 and s > 0 such that

(i) a < δ/2

(ii) 8 ≤ [sδ/(1 + δ)]2

Page 160: 2013 W Stute Empirical Distributions

6.3. OSCILLATIONS OF EMPIRICAL PROCESSES 155

(iii) s ≤ δxδ√na/4 for some positive xδ depending only on δ.

ThenP(ωn(a) > s

√a) ≤ Cδa

−1 exp[−s2(1− δ)5/2],

with Cδ = 64δ−2.

Proof. Put x = s√a and let R be the smallest integer satisfying 1/

√R ≤

δ√a/2. We obviously have

ωn(a) ≤ max0≤i≤R−1

sup0≤t≤a

|αn

(i

R+ t

)− αn(

i

R)|

+ 2 max0≤i≤R−1

sup0≤τ≤1/R

|αn

(i

R+ τ

)− αn(

i

R)|.

Since αn has stationary increments, we obtain

P(ωn(a) > x) ≤ P(

max0≤i≤R−1

sup0≤t≤a

|αn

(i

R+ t

)− αn(

i

R)| > x

1 + δ

)+ P

(2 max0≤i≤R−1

sup0≤τ≤1/R

|αn

(i

R+ τ

)− αn(

i

R)| > δx

1 + δ

)

≤ RP(

sup0≤t≤a

|αn(t)| >x

1 + δ

)+RP

(sup

0≤t≤1/R

|αn(t)| >δx

2(1 + δ)

).

It follows from (i) - (ii) and the choice of R, that (6.3.2) is applicable toeach of the above probabilities. Hence

P(ωn(a) > x) ≤ 2R

[P(|αn(a)| >

x(1− δ)

1 + δ

)+ P

(|αn(

1

R)| > xδ(1− δ)

2(1 + δ)

)].

Under (iii), we may apply (6.2.7) and thus get the final result.

Theorem 6.3.3. For each ε > 0 and every η > 0 there exists a small a > 0such that for all n ≥ n0(ε, η)

P(ωn(a) ≥ ε) ≤ η. (6.3.3)

Proof. Put δ = 1/2. We shall apply Lemma 6.3.2 with s = a−1/4. Condi-tions (i) - (iii) from Lemma 6.3.2 are then all satisfied for a > 0 sufficientlysmall and n sufficiently large. Moreover, we may choose a > 0 so small that

Page 161: 2013 W Stute Empirical Distributions

156 CHAPTER 6. STOCHASTIC INEQUALITIES

also ε ≥ a1/4 and C1/2a−1 exp[−a−1/2/64] ≤ η hold. This completes the

proof.

We also mention that Theorem 6.3.3 may be readily applied to bound theoscillation modulus of empirical processes from a general F satisfying someweak smoothness assumptions.

If, for example, F is Lipschitz:

|F (x)− F (y)| ≤M |x− y| (6.3.4)

then, in view of (6.3.4),

ωn(a) ≤ ωn(Ma), (6.3.5)

where

ωn(a) = sup|t−s|≤a

|αn(t)− αn(s)|. (6.3.6)

is the oscillation modulus pertaining to a general empirical process. Condi-tion (6.3.4) is satisfied if F has a bounded Lebesgue density.

We may also let a = an tend to zero as n → ∞. A rough upper bound for

ωn(an) may be obtained if we set s = sn =√K ln a−1

n , for a large K > 0.

Theorem 6.3.4. Assume

(i) ln a−1n = o(nan)

and

(ii)∑

n≥1 arn <∞ for some r > 0.

Then, with probability one,

ωn(an) = O(

√an ln a

−1n ).

Improvements of these results may be found in Stute (1982). Extensions toa general F satisfying (6.3.4) utilize (6.3.5).

Page 162: 2013 W Stute Empirical Distributions

6.4. EXPONENTIAL BOUNDS FOR SUMS OF INDEPENDENT RANDOMVARIABLES157

6.4 Exponential Bounds for Sums of IndependentRandom Variables

Let X1, . . . , Xn be independent random variables, not necessarily from thesame distribution. Set

S :=n∑

i=1

Xi, S := S/n

and

µ := E(S) = n−1n∑

i=1

E(Xi).

Clearly, for each ε > 0 and any h > 0,

P(S − µ ≥ ε) = P(S − E(S) ≥ nε)

= P(h(S − E(S)) ≥ hnε)

≤ e−hnε−hE(S)n∏

i=1

EehXi . (6.4.1)

Lemma 6.4.1. Suppose a ≤ X ≤ b, and let h ∈ R. Then

EehX ≤ b− E(X)

b− aeha +

E(X)− a

b− aehb.

Proof. We have

ehX = ehb−Xb−a

a+hX−ab−a

b

≤ b−X

b− aeha +

X − a

b− aehb,

by convexity. Integrate out.

Lemma 6.4.2. (Hoeffding). Assume that X1, . . . , Xn are independent with0 ≤ Xi ≤ 1. We then have for 0 < ε < 1− µ:

P(S − µ ≥ ε) ≤

[(µ

µ+ ε

)µ+ε( 1− µ

1− µ− ε

)1−µ−ε]n

≤ e−nε2g(µ) ≤ e−2nε2 ,

Page 163: 2013 W Stute Empirical Distributions

158 CHAPTER 6. STOCHASTIC INEQUALITIES

where

g(µ) =

1

1−2µ ln 1−µµ for 0 ≤ µ < 1/2

12µ(1−µ) for 1/2 ≤ µ ≤ 1

.

Proof. Apply Lemma 6.4.1 to get (with µi ≡ EXi)

EehXi ≤ 1− µi + µieh

and thereforen∏

i=1

EehXi ≤n∏

i=1

(1− µi + µieh).

Now use the fact that the geometric mean is always less than or equal tothe arithmetic mean. It follows that

n∏i=1

(1− µi + µieh) ≤

[1

n

n∑i=1

(1− µi + µieh)

]n=[1− µ+ µeh

]n.

Plugging this into (6.4.1) we get

P(S − µ ≥ ε) ≤ [e−hε−hµ(1− µ+ µεh)]n.

Formal differentiation shows that the right-hand side is minimized for

h = ln(1− µ)(µ+ ε)

(1− µ− ε)µ.

Inserting this h we obtain the first inequality. For the second we note thatthe first inequality may be written as

P(S − µ ≥ ε) ≤ e−nε2G(ε,µ),

with

G(ε, µ) =µ+ ε

ε2ln

(µ+ ε

µ

)+

1− µ− ε

ε2ln

(1− µ− ε

1− µ

).

Now checkg(µ) = inf

0≤ε≤1−µG(ε, µ).

Finally, g(µ) ≥ 2.

Hoeffdings’s inequality may be extended to the case, when the bounds varywith i.

Page 164: 2013 W Stute Empirical Distributions

6.4. EXPONENTIAL BOUNDS FOR SUMS OF INDEPENDENT RANDOMVARIABLES159

Lemma 6.4.3. (Hoeffding). Assume that X1, . . . , Xn are independent withai ≤ Xi ≤ bi. Then we have for each ε > 0:

P(S − µ ≥ ε) ≤ exp

[− 2n2ε2∑n

i=1(bi − ai)2

].

Proof. Omitted.

Page 165: 2013 W Stute Empirical Distributions

160 CHAPTER 6. STOCHASTIC INEQUALITIES

Page 166: 2013 W Stute Empirical Distributions

Chapter 7

Invariance Principles

7.1 Continuity of Stochastic Processes

The functions Fn, F−1n and αn depend on the data X1, . . . , Xn and are

thus random. Random functions are realizations of so-called Stochas-tic Processes. If, e.g., one studies αn at finitely many points t1, . . . , tm,then the multivariate CLT may be used to approximate the distribution of(αn(t1), . . . , αn(tm)). See Lemma 3.2.8. The Kolomogorov-Smirnov (K-S)statistic

Dn = sup0≤t≤1

|Fn(t)− t|,

is an example of a quantity depending on all Fn(t), 0 ≤ t ≤ 1, so that areduction to a specific finite-dimensional subvector is not possible. In thischapter we shall discuss the problem of how to determine or approximatethe distributions of a large class of functions of αn. First, we fix a generalstochastic process S = (St)0≤t≤1. Later on we shall specify S and discussimportant examples appearing in statistics. That the parameter set equalsthe unit interval [0, 1] is only for mathematical convenience. What is im-portant is compactness. Along with S we consider the associated familyof finite-dimensional distributions (fidi’s), i.e., the distributions of sub-vectors

(St1 , . . . , Stm), where 0 ≤ t1 < t2 < . . . < tm ≤ 1

are arbitrary. From a measure theoretic argument one can see that thedistribution of S, which is defined on the space of all functions, is uniquelydetermined through its fidi’s. This means, that if two processes have thesame fidi’s then they give equal probabilities to events which are measurable

161

Page 167: 2013 W Stute Empirical Distributions

162 CHAPTER 7. INVARIANCE PRINCIPLES

w.r.t. the σ-field generated by all finite-dimensional projections

f → (f(t1), . . . f(tm)), 0 < t1 < t2 < . . . ≤ tm ≤ 1.

For any function f defined on I = [0, 1], the degree of smoothness may bemeasured through the oscillation modulus

ω(f, δ) = sup|t1−t2|≤δ

|f(t1)− f(t2)|.

Uniform continuity of f is equivalent to

limδ↓0

ω(f, δ) = 0, (7.1.1)

while Lipschitz-continuity

|f(t1)− f(t2)| ≤ C|t1 − t2|

yieldsω(f, δ) ≤ Cδ.

Sinceω(f, δ1) ≤ ω(f, δ2) if δ1 ≤ δ2,

(7.1.1) is equivalent to

limn→∞

ω(f, δn) = 0 for any sequence δn ↓ 0.

Sample continuity of S(ω) for almost all ω ∈ Ω is equivalent to

limn→∞

P(ω(S, δn) > ε) = 0 for all ε > 0. (7.1.2)

Verification of (7.1.2) is not easy since the oscillation modulus incorporatesall pairs |t1 − t2| ≤ δ.

A well-known result due to A. N. Kolmogorov gives sufficient conditionsthrough bivariate fidi’s which ensure uniform continuity of S(ω) with prob-ability one, at least if one restricts S to the dyadic numbers T = k/2n :0 ≤ k ≤ 2n, n ≥ 1.

Theorem 7.1.1. (A. N. Kolmogorov). Let S = (St)0≤t≤1 be any stochasticprocess such that there exist K <∞, a ≥ 0 and b > 1 so that for any ε > 0

P(|St − St′ | > ε) ≤ Kε−a|t− t′|b. (7.1.3)

Then with probability one S(ω) is uniformly continuous on T .

Page 168: 2013 W Stute Empirical Distributions

7.2. GAUSSIAN PROCESSES 163

A sufficient condition for (7.1.3) is

E|St − St′ |a ≤ K|t− t′|b. (7.1.4)

In fact, by the Chebychev inequality, (7.1.4) implies

P(|St − St′ | > ε) ≤ ε−aE|St − St′ |a

≤ Kε−a|t− t′|b,

i.e., (7.1.3) holds,

The restriction to the dyadic numbers is not serious, since by putting

St := limt′→t

St′ ,

where t′ ∈ T , we obtain a continuous version S which has the same fidi’s asS.

7.2 Gaussian Processes

Lemma 3.2.8 asserts that the fidi’s of αn have normal limits. Processeswhich have normal fidi’s – not only in the limit – are of special importancesince they are candidates for limit processes.

Definition 7.2.1. A stochastic process (St)0≤t≤1 is called aGaussian Pro-cess if all fidi’s are normal distributions.

Normal distributions are uniquely determined through

m(t) = ESt – the mean function

andK(s, t) = Cov(Ss, St) – the covariance function.

If m(t) = 0 for all t, (St)t is called a centered Gaussian Process. Recallthat K is symmetric:

K(s, t) = K(t, s)

and nonnegative definite: for any t1, . . . , tk and λ1, . . . , λk ∈ R:

k∑i=1

k∑j=1

λiK(ti, tj)λj ≥ 0.

Page 169: 2013 W Stute Empirical Distributions

164 CHAPTER 7. INVARIANCE PRINCIPLES

It can be shown that for any K satisfying these two properties there existsa centered Gaussian process (St)t such that

Cov(Ss, St) = K(s, t).

We now introduce the most famous Gaussian Process – the Brownian Mo-tion. This belong to

K(s, t) = min(s, t).

Clearly, K is symmetric. For 0 ≤ t1 < t2 < . . . < tk let X1, . . . , Xk beindependent normal random variables such that

X1 ∼ N (0, t1), X2 ∼ N (0, t2 − t1), . . . , Xk ∼ N (0, tk − tk−1).

Put

S1 = X1 S2 = X1 +X2, . . . , Sk = X1 + . . .+Xk.

Then, for i < j, we have

σij = E[SiSj ] = E

[Si

(Si +

j∑k=i+1

Xk

)]= ES2

i = ti = min(ti, tj).

Hence K(ti, tj) is a covariance function and therefore nonnegative definite.

Theorem 7.2.2. Let B = (Bt)0≤t≤1 be a Gaussian Process with

EBt = 0 and Cov(Bs, Bt) = min(s, t).

Then we have:

(i) The process B exists

(ii) B has continuous sample paths

(iii) B0 = 0 w.p. 1

(iv) B has independent increments with

Bt −Bs ∼ N (0, t− s) for 0 ≤ s ≤ t ≤ 1

(v) E[(Bt −Bs)

2]= t− s for 0 ≤ s ≤ t ≤ 1

Page 170: 2013 W Stute Empirical Distributions

7.3. BROWNIAN MOTION 165

Proof. We have already seen that min(s, t) is an admissable covariancefunction. Furthermore, m(t) ≡ 0. Hence B exists. Putting s = t, weget VarBt = t. For t = 0, we have EB0 = 0 = VarB0 and thereforeB0 = 0 with probability one. The variable Bt − Bs is a linear function ofthe Gaussian random vector (Bs, Bt) and therefore Gaussian. Check thatfor 0 ≤ s ≤ t ≤ 1

E(Bt −Bs)2 = EB2

t + EB2s − 2EBtBs = t+ s− 2s = t− s.

For increments Bt1 , Bt2 −Bt1 , . . . , Btk −Btk−1we get for i < j:

E[(Btj −Btj−1)(Bti −Bti−1)

]= ti − ti−1 − ti + ti−1 = 0.

Hence the increments are uncorrelated. Since they are jointly Gaussian,they are also independent. It remains to prove continuity. For this we verifyKolmogorov’s criterion: For s < t, Bt −Bs ∼ N (0, t− s). Therefore,

E[Bt −Bs]4 = (t− s)2Eξ4, ξ ∼ N (0, 1).

This completes the proof.

We already mentioned that we could define a Brownian Motion on anyinterval [0, T ], and finally on [0,∞).

7.3 Brownian Motion

The Brownian Motion is the most important Gaussian Process. Since B0 =0, it starts in zero. The process

Bt = x+Bt

is called a Brownian Motion starting in x. The process

Bt = σBt

is called a Brownian Motion with scale parameter σ > 0. For σ = 1 andx = 0, we call B a standard Brownian Motion. In many applications,we come up with Brownian Motions which involve a transformation Λ oftime:

Bt = BΛ(t).

Page 171: 2013 W Stute Empirical Distributions

166 CHAPTER 7. INVARIANCE PRINCIPLES

This process is also a centered Gaussian process with covariance

cov(Bs, Bt) = min(Λ(s),Λ(t)).

If Λ is nondecreasing

min(Λ(s),Λ(t)) = Λ(s) for s ≤ t.

From B we may obtain other Gaussian processes.

Lemma 7.3.1. Let σ > 0. Then

Bt = σ−1/2Btσ, t ≥ 0

is a Brownian Motion.

Proof. (Bt)t is a centered Gaussian process with covariance

Cov(Bs, Bt) = σ−1cov(Bsσ, Btσ)

= σ−1σs = s for 0 ≤ s ≤ t.

Lemma 7.3.2. Fix 0 < t0. Then

Bt = Bt0+t −Bt0 , t ≥ 0

is a Brownian Motion.

Proof. B is a centered Gaussian process with covariance

Cov(Bs, Bt) = E [(Bt0+s −Bt0)(Bt0+t −Bt0)]

= t0 + s− t0 − t0 + t0 = s for 0 ≤ s ≤ t.

The following process is of central importance in the theory of EmpiricalProcesses.

Lemma 7.3.3. Let B = (Bt)0≤t≤1 be a Brownian Motion on the unitinterval. Put

B0t = Bt − tB1, 0 ≤ t ≤ 1.

Then (B0t )0≤t≤1 is a centered Gaussian process with covariance

Cov(B0s , B

0t ) = s− st for all 0 ≤ s ≤ t ≤ 1.

Page 172: 2013 W Stute Empirical Distributions

7.4. DONSKER’S INVARIANCE PRINCIPLES 167

Proof.

Cov(B0s , B

0t ) = E [(Bs − sB1)(Bt − tB1)]

= s− st− st+ st = s− st.

Remark 7.3.4. Note that αn and B0 are both centered and have the samecovariance structure. Also

αn(0) = B00 = 0 = αn(1) = B0

1 .

Therefore B0 is called a Brownian Bridge.

7.4 Donsker’s Invariance Principles

In Section 3.4 we have derived several probabilities for events depending onthe sample path of Fn. Most of these formulae are complicated and may beimplemented only for a few small n. For other n, one may wonder if there isa limit as n→ ∞ which may then serve as an approximation when no exactprobabilities are available.

The most important example of an approximation through a limit distribu-tion is the CLT: Assume that X1, . . . , Xn are i.i.d. ∼ F with EX2

1 < ∞,then

limn→∞

P

(n−1/2

n∑i=1

(Xi − EX1) ≤ xσ

)= Φ(x) =

1√2π

x∫−∞

e−y2/2dy,

where σ2 = VarX1.

In other words, if Gn denotes the d.f. of (nσ2)−1/2∑n

i=1(Xi − EX1), then

limn→∞

Gn(x) = Φ(x) for all x ∈ R.

This convergence is called convergence in distribution or convergencein law. In other situations, the limit distribution need not be continuous.Then convergence is only required at those points at which the limit functionis continuous:

limn→∞

Gn(x) = G(x) for all x at which G is continuous.

Page 173: 2013 W Stute Empirical Distributions

168 CHAPTER 7. INVARIANCE PRINCIPLES

The same concept applies to random vectors.

To discuss the distributional convergence of random processes like αn, asn → ∞, one has to replace intervals or quadrants by sets A in the space offunctions, which carry no mass on their boundary.

Definition 7.4.1. Assume that (Sn(t))0≤t≤1 is a sequence of stochastic pro-cesses whose sample paths are in some metric space (X , d). Let (S(t))0≤t≤1

be another process. Then Sn converges in distribution to S:

SnL−→ S,

if and only if

limn→∞

P(Sn ∈ A) = P(S ∈ A) for all A ⊂ X such that P(S ∈ δA) = 0.

Here δA is the boundary of A.

A classical book on convergence of stochastic processes is Billingsley (1968).

In applications the most important examples for the space X are X = C(Θ),the space of continuous functions over some (σ−) compact space Θ.

So far, we only studied Θ = [0, 1], but we may also consider higher-dimensionalcubes or other subsets of Rd as well. The other important space of samplefunctions is the so-called Skorokhod space D[0, 1], consisting of all rightcontinuous functions with left-hand limits. The empirical d.f. and the em-pirical process have sample paths in this space. Billingsley (1968) providesa thorough discussion of how D[0, 1] may be metrized. The following tworesults give sufficient conditions which guarantee convergence in distributionin C and D.

Theorem 7.4.2. Let (Sn(t))0≤t≤1 and (S(t))0≤t≤1 be stochastic processesin C[0, 1] such that

(Sn(t1), . . . , Sn(tk))L−→ (S(t1), . . . , S(tk))

for all 0 < t1 < . . . < tk < 1 (convergence of fidi’s) (7.4.1)

andlimδ↓0

lim supn→∞

P (ω(Sn, δ) > ε) = 0 for all ε > 0. (7.4.2)

Page 174: 2013 W Stute Empirical Distributions

7.4. DONSKER’S INVARIANCE PRINCIPLES 169

Then we have:Sn → S in distribution.

In many applications, the limit process will be Gaussian, and the conver-gence of the fidi’s follows from an application of the multivariate CLT. Onepossibility to check (7.4.2) is to show that (7.1.3) holds uniformly in n, withthe same K, a and b.

The following Theorem constitutes the counterpart of Theorem 7.4.2 for thecase D[0, 1].

Theorem 7.4.3. Let (Sn(t))0≤t≤1 and (S(t))0≤t≤1 be stochastic processesin D[0, 1] such that

S has continuous sample paths (7.4.3)

(Sn(t1), . . . , Sn(tk))L−→ (S(t1), . . . , S(tk))

for all 0 ≤ t1 < t2 < . . . < tk ≤ 1 (convergence of fidi’s) (7.4.4)

limδ↓0

lim supn→∞

P (ω(Sn, δ) > ε) = 0 for all ε > 0. (7.4.5)

Then:Sn → S in distribution.

The last Theorem covers an important special case, namely that the limit Sis continuous though for each n ≥ 1 the process Sn may have jumps. Thisis exactly the case for the uniform empirical process αn. In view of Lemma3.2.8 the candidate for the limit is the Brownian Bridge S = B0. Whilethis process has continuous sample paths, αn has n discontinuities. Note,however, that the jump size n−1/2 tends to zero so that, as n→ ∞,

• the number of jumps tends to infinity

• the jump sizes tend to zero

• the limit process is continuous.

In many cases the continuity of S is obtained through an application ofTheorem 7.1.1, while the convergence of the fidi’s will again follow from anapplication of the multivariate CLT. Verification of (7.4.5) typically requiresmore work.

In the literature the convergence of Sn to some limit S is often called anInvariance Principle. The reason for this becomes clear if one recalls the

Page 175: 2013 W Stute Empirical Distributions

170 CHAPTER 7. INVARIANCE PRINCIPLES

classical CLT. Typically, the distribution F of the summands Xi is arbitraryand the distribution of the sums is complicated. The CLT then implies thatin the limit this distribution is, up to the scale parameter σ, invariant w.r.t.F , i.e., does not depend on F . Similar things occur with the so-calledfunctional limit theorems covered by Theorems 7.4.2 and 7.4.3.

The first invariance principles were due to Donsker (1952/53).

Theorem 7.4.4. (Donsker). Let X1, . . . , Xn be i.i.d. with EXi = 0 andVarXi = 1. Put

Sn(t) =1√n

⟨nt⟩∑i=1

Xi, 0 ≤ t ≤ 1,

the so-called partial sum process. Then we have

SnL−→ B,

where B is the Brownian Motion.

Theorem 7.4.5. (Donsker). Let U1, . . . , Un be i.i.d. from U [0, 1]. Denotewith αn the uniform empirical process. Then

αnL−→ B0,

where B0 is the Brownian Bridge.

Proof. The assertion follows from Theorem 7.4.3 upon applying Lemma3.2.8 and Theorem 6.3.3.

There is an important technique which shows how useful invariance princi-ples may be. This is based on the so-called Continuous Mapping Theo-rem (CMT).

Theorem 7.4.6. (CMT). Assume SnL−→ S. Let T be any continuous

functional. Then

T (Sn)L−→ T (S).

We shall now demonstrate how Donsker’s invariance principle for empiricalprocesses in connection with the CMT may be applied to easily get the limitdistribution of various statistics of interest.

Page 176: 2013 W Stute Empirical Distributions

7.4. DONSKER’S INVARIANCE PRINCIPLES 171

The one- and two-sided Smirnov and Kolmogorov-Smirnov statistics havebeen defined as

D+n = sup

t∈R[Fn(t)− F (t)] D−

n = supt∈R

[F (t)− Fn(t)]

andDn = sup

t∈R|Fn(t)− F (t)|.

By the Glivenko-Cantelli Theorem, all three quantities converge to zero andare therefore degenerate in the limit. To obtain non-degenerate limits oneneeds to redefine them accordingly:

D+n = n1/2 sup

t∈R[Fn(t)− F (t)] D−

n = n1/2 supt∈R

[F (t)− Fn(t)]

andDn = n1/2 sup

t∈R|Fn(t)− F (t)|.

Under continuity of F we have, e.g.,

Dn = n1/2 sup0≤t≤1

|Fn(t)− t| = sup0≤t≤1

|αn(t)|.

This is a continuous functional of αn. Therefore, Donsker and the CMTimply

DnL−→ sup

0≤t≤1|B0(t)|. (7.4.6)

Similarly,

D+n

L−→ sup0≤t≤1

B0(t) D−n

L−→ sup0≤t≤1

B0(t). (7.4.7)

We see that three convergence results easily follow from one basic result:the invariance principle for the underlying process αn.

Another example is the Cramer-von Mises statistic

CvM = n

∫[Fn(t)− F (t)]2 F (dt),

which under continuity of F becomes

CvM = n

1∫0

[Fn(t)− t

]2dt =

1∫0

α2n(t)dt.

Page 177: 2013 W Stute Empirical Distributions

172 CHAPTER 7. INVARIANCE PRINCIPLES

By Donsker and CMT we get

CvML−→

1∫0

[B0(t)]2dt.

The distributions of the limits in (7.4.6) and (7.4.7) are known and tabu-lated.

Lemma 7.4.7. For x ≥ 0 we have

P(

sup0≤t≤1

B0(t) ≥ x

)= exp[−2x2]

P(

sup0≤t≤1

|B0(t)| ≥ x

)= 2

∞∑k=1

(−1)k+1 exp[−2k2x2] ≡ K(x).

Hence, e.g., from (7.4.6) we get

limn→∞

P(Dn ≥ x) = K(x). (7.4.8)

Equation (7.4.8) is the basis for the classical Kolmogorov-SmirnovGoodness-of-Fit Test. The null hypothesis to be tested is

H0 : F = F0,

where F0 is a fully specified distribution (simple hypothesis). Consider

Dn := n1/2 supt∈R

|Fn(t)− F0(t)|.

For a given significance level 0 < α < 1, choose c > 0 so that K(c) = α.Reject H0 iff Dn ≥ c. From (7.4.8) we get that asymptotically the error ofthe first kind equals α.

The same procedure works for CvM. In this case the critical value c needs

to be taken from the distribution of1∫0

[B0(t)]2dt, which is tabulated.

7.5 More Invariance Principles

The next invariance principle covers a two-sample situation. Let X1, . . . , Xn

and Y1, . . . , Ym be two independent samples with

X1 ∼ F and Y1 ∼ G (continuous),

Page 178: 2013 W Stute Empirical Distributions

7.5. MORE INVARIANCE PRINCIPLES 173

both unknown. The null hypothesis to be tested is

H0 : F = G.

Denote with Fn and Gm the associated empirical d.f.’s. The test statistic ofK-S type equals

Dnm =√· supt∈R

|Fn(t)−Gm(t)|,

where at this moment we leave it open how the standardizing factor lookslike. To determine critical values one needs to know or at least approximatethe distribution of Dnm under H0. Now, under H0,

Dnm =√· sup0≤t≤1

|Fn(t)− Gm(t)|.

Putαn(t) = n1/2[Fn(t)− t] and βm(t) = m1/2[Gm(t)− t],

and set N = n+m, the sample size of the pooled sample.

With λN = nN , we get

√N [Fn(t)− Gm(t)] =

√N[n−1/2αn(t)−m−1/2βm(t)

]= λ

−1/2N αn(t)− (1− λN )−1/2βm(t).

Assuming thatλN → λ for some 0 < λ < 1,

we get from Donsker that

√N [Fn − Gm] → λ−1/2B0 − (1− λ)−1/2B1 ≡ B,

where B0 and B1 are two independent Brownian Bridges. The process B,as a linear combination of two independent Gaussian processes, is also aGaussian process with expectation zero and covariance function

K(s, t) = E[B(s)B(t)] = λ−1s(1− t) + (1− λ)−1s(1− t)

=

[1

λ+

1

1− λ

]s(1− t) =

1

λ(1− λ)s(1− t).

This gives us a clue how to choose the standardizing factor in the definitionof Dnm. Put

γN =√λN (1− λN )N [Fn − Gm].

Page 179: 2013 W Stute Empirical Distributions

174 CHAPTER 7. INVARIANCE PRINCIPLES

ThenγN → B in distribution,

where B is a Brownian Bridge. Hence H0 is rejected at level α iff

Dnm =

√nm

Nsupt∈R

|Fn(t)−Gm(t)| ≥ c,

where c is the (1− α)-quantile of K in Lemma 7.4.7.

In our next example we use empirical process theory to design tests forsymmetry of F at zero:

H0 : F (x) + F (−x) = 1 for all x ∈ R. (7.5.1)

We may expect that under H0

Fn(x) + Fn(−x) ∼ 1 for all x ∈ R.

Therefore we study the following variant of an empirical process:

γn(x) = n1/2[Fn(x)− 1 + Fn(−x)], x ∈ R.

Under (7.5.1) we obtain

γn(x) = n1/2[Fn(F (x))− 1 + Fn(F (−x))]= n1/2[Fn(F (x))− 1 + Fn(1− F (x))].

Putting t = F (x) it remains to study

γn(t) = n1/2[Fn(t)− 1 + Fn(1− t)]

= αn(t) + αn(1− t), 0 ≤ t ≤ 1.

From Donsker we get

γnL−→ B,

withB(t) = B0(t) +B0(1− t).

Our next aim will be to determine the distribution of sup0≤t≤1 |B(t)|. Forthis, put

R(t) = B0

(1 + t

2

)+B0

(1− t

2

), −1 ≤ t ≤ 1.

Page 180: 2013 W Stute Empirical Distributions

7.5. MORE INVARIANCE PRINCIPLES 175

Note that

B(t) = R(2t− 1), 0 ≤ t ≤ 1.

The process

Z(t) = R(1− t), 0 ≤ t ≤ 1,

is a Brownian Motion on [0, 1]. Actually Z is a zero-means Gaussian processwith covariance, if 0 ≤ s ≤ t ≤ 1,

E[Z(s)Z(t)] = E[R(1− s)R(1− t)]

= E[B0(1− s

2

)B0

(1− t

2

)]+ E

[B0(1− s

2

)B0

(t

2

)]+E

[B0(s2

)B0

(1− t

2

)]+ E

[B0(s2

)B0

(t

2

)]=

(1− t

2

)−(1− s

2

)(1− t

2

)+t

2−(1− s

2

) t2

+s

2− s

2

(1− t

2

)+s

2− s

2

t

2= s.

Since

γn(t) : 0 ≤ t ≤ 1 L−→ R(2t− 1) : 0 ≤ t ≤ 1,

the CMT implies

sup0≤t≤1

|γn(t)|L−→ sup

0≤t≤1|R(2t− 1)|.

Now, R(t) = R(−t), whence

sup0≤t≤1

|R(2t− 1)| = sup0≤t≤1

|R(t)| = sup0≤t≤1

|Z(t)|.

Thus we have obtained

Theorem 7.5.1. Under H0 (provided F is continuous):

supx∈R

|γn(x)|L−→ sup

0≤t≤1|Z(t)|,

where Z is a Brownian Motion.

Page 181: 2013 W Stute Empirical Distributions

176 CHAPTER 7. INVARIANCE PRINCIPLES

7.6 Parameter Empirical Processes (Regression)

So far invariance principles were discussed for processes where the parameterset was identical with the scale of the observations X1, . . . , Xn. In thissection we study a typical situation in parametric statistics. The example istaken from regression. The observations (Xi, Yi), 1 ≤ i ≤ n, are independentfrom the same unknown d.f. F . The X’s take their values in Rd, while theY ’s are real-valued. If E|Yi| <∞, each Yi may be decomposed into

Yi = m(Xi) + εi,

where E[εi|Xi] = 0. The function m is the regression function of Y w.r.t. Xand is also unknown. Very often one assumes thatm belongs to a parametricfamily M = m(·, β) : β ∈ Θ ⊂ Rp of functions, i.e., one assumes that thetrue m satisfies

m(x) = m(x, β0) for some β0 ∈ Θ and all x ∈ Rd. (7.6.1)

The best studied case is the Linear Model in which case

m(x, β) =< x, β >=

d∑i=1

xiβi.

If necessary, one may enlarge x by a component 1 leading to an additionalintercept parameter β0:

m(x, β) = β0 +d∑

i=1

xiβi.

Another class of popular models are the so-called Generalized LinearModels in which the dependence on β is nonlinear. See the book ofFahrmeir and Tutz.

The classical estimator of β0 is the Least Squares Estimator (LSE)

βn = argminβn−1

n∑i=1

[Yi −m(Xi, β)]2 .

In this section we show how so-called Parameter Empirical Processesmay be used to analyze βn. Of course, the model assumption (7.6.1) neednot be true at all so that we have to study the behavior of βn whetherm ∈ M is true or not.

Page 182: 2013 W Stute Empirical Distributions

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION) 177

In the first step we want to find out whether βn converges to some β0. Thislimit should be “the true parameter” β0 under H0 : m ∈ M. If the modelis not correctly specified, βn also makes sense but there exists no such β0.Summarizing, we may expect that βn approaches some β0, as n→ ∞, whichcoincides with β0 when the model is true.

In the second step we need to analyze the distributional convergence ofn1/2(βn − β0) a little closer. Among other things, we will come up with aso-called i.i.d. representation.

Before we analyze βn, we fix β. Then, by the SLLN,

n−1n∑

i=1

[Yi −m(Xi, β)]2 → E[Y −m(X,β)]2 =

= EY 2 − 2

∫m(x)m(x, β)F0(dx) +

∫m2(x, β)F0(dx), (7.6.2)

where F0 is the (marginal) d.f. of X. Expression (7.6.2) equals

EY 2 −∫m2(x)F0(dx) +

∫[m(x)−m(x, β)]2F0(dx). (7.6.3)

Since EY 2 −∫m2(x)F0(dx) does not depend on β, (7.6.2) and (7.6.3) only

differ by a constant. Hence they have the same minimizer, say β0. In otherwords,

β0 = argminβ

∫[m(x)−m(x, β)]2F0(dx).

If m(x) = m(x, β0), then clearly β0 = β0.

As regularity assumptions we require that as a function of β the functionm(x, ·) is twice continuously differentiable. Put

m′(x, β) =∂m(x, β)

∂βand

∂2m(x, β)

∂βi∂βj≡ mij(x, β) ≡ (M(x, β))ij , 1 ≤ i, j ≤ p. (7.6.4)

As we shall see second order differentiability is indispensable for the under-standing of βn if H0 is not satisfied. All vectors β will be column vectors,

Page 183: 2013 W Stute Empirical Distributions

178 CHAPTER 7. INVARIANCE PRINCIPLES

and βT will be their transpose. Recall β0. A crucial role in our analysis willbe played by the p× p matrix

A =

∫m′(x, β0)[m

′(x, β0)]TF0(dx)

+

∫[m(x, β0)−m(x)]M(x, β0)F0(dx).

The second integral vanishes in two cases:

• when H0 is satisfied so that β0 = β0 and m(x, β0) = m(x).

• when M is a linear model so that M = 0 is the null matrix.

The first matrix is well known and corresponds to the matrix “XXT ” inthe linear model and for finite samples. Throughout this section we assumethat

A is nonsingular (7.6.5)

Our result provides a representation of βn which is valid under both H0 :m ∈ M and H1 : m /∈ M.

Theorem 7.6.1. Under (7.6.4) and (7.6.5), assume that β0 is unique andlies in the interior set of Θ. Then

limn→∞

βn = β0 with probability one

and

n1/2(βn − β0) = n−1/2A−1n∑

i=1

(Yi −m(Xi, β0))m′(Xi, β0) + oP(1). (7.6.6)

The summands on the right-hand side are centered with, under independenceof Xi and εi, covariance matrix

Σ =VarεE[m′(X, β0)[m

′(X, β0)]T]

(7.6.7)

+ E[(m(X)−m(X, β0))

2m′(X, β0)[m′(X, β0)]

T]. (7.6.8)

The second term vanishes under H0.

Equation (7.6.6) immediately yields the asymptotic (multivariate) normalityof n1/2(βn − β0):

n1/2(βn − β0) → N (0, A−1ΣA−1) in distribution.

Page 184: 2013 W Stute Empirical Distributions

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION) 179

Under H0,A−1ΣA−1 = VarεA−1,

a well known fact.

Proof of Theorem 7.6.1. Since β0 is in the interior set of Θ, we may tacitlyassume that βn takes its value there as well. Then we have

0 =

n∑i=1

(Yi −m(Xi, βn))m′(Xi, βn)

from which we get

n−1/2n∑

i=1

(Yi −m(Xi))m′(Xi, βn)

= n−1/2n∑

i=1

(m(Xi, βn)−m(Xi))m′(Xi, βn). (7.6.9)

Define the process

αn(β) = n−1/2n∑

i=1

(Yi −m(Xi))m′(Xi, β).

Standard arguments show that the αn are uniformly continuous in C(Θ),the space of (vector-valued) continuous functions on Θ, endowed with thetopology of uniform convergence on compacta. Since βn → β0, we get

n−1/2n∑

i=1

(Yi −m(Xi))m′(Xi, βn) = αn(βn) = αn(β0) + oP(1).

The right-hand side of (7.6.9) may be expanded into

n−1/2n∑

i=1

(m(Xi, βn)−m(Xi, β0))m′(Xi, βn)

+ n−1/2n∑

i=1

(m(Xi, β0)−m(Xi))m′(Xi, βn)

= n−1/2(βn − β0)T

n∑i=1

m′(Xi, βn)m′(Xi, βn)

+ n−1/2n∑

i=1

(m(Xi, β0)−m(Xi))m′(Xi, β0)

+ n−1/2(βn − β0)T

n∑i=1

(m(Xi, β0)−m(Xi))M(Xi,˜βn),

Page 185: 2013 W Stute Empirical Distributions

180 CHAPTER 7. INVARIANCE PRINCIPLES

for appropriate βn,˜βn tending to β0 along with βn. By the assumed conti-

nuity of M as a function of β and a uniform version of the SLLN, equation(7.6.9) may be rewritten as

αn(β0) + oP(1) = n−1/2n∑

i=1

(m(Xi, β0)−m(Xi))m′(Xi, β0)

+ [A+ oP(1)]n1/2(βn − β0).

Note that because β0 is the minimizer of∫[m(x)−m(x, β)]2F0(dx) lying in

the interior set of Θ, the variables

[m(Xi, β0)−m(Xi)]m′(Xi, β0), 1 ≤ i ≤ n,

have expectation zero. Conclude that, up to an oP(1) term,

n1/2(βn − β0)

= n−1/2A−1n∑

i=1

[Yi −m(Xi)−m(Xi, β0) +m(Xi)

]m′(Xi, β0)

= n−1/2A−1n∑

i=1

[Yi −m(Xi, β0)

]m′(Xi, β0),

a sum of standardized centered independent random variables. For thecomputation of Σ, note that εi = Yi − m(Xi) is orthogonal to the σ-fieldgenerated by Xi. Hence the covariance is the sum of the two covariancesCov[(Y −m(X))m′(X, β0)] and Cov[(m(X, β0)−m(X))m′(X, β0)]. Finally,use the independence of ε and X to come up with (7.6.7).

The result and its proof require some comments:

• The LSE is an example of an estimator which is defined as the mini-mizer of the data depending function

β → n−1n∑

i=1

[Yi −m(Xi, β)]2 . (7.6.10)

• Only in very special cases (like the Linear Model) it is possible toget explicit representations of βn. Generally this is not possible. Toanalyze its distributional behavior, the i.i.d. representation providesa powerful tool to determine the limit distribution of n1/2(βn − β0).

Page 186: 2013 W Stute Empirical Distributions

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION) 181

• Our discussion exhibits, for the first time, that an assumed model maybe incorrectly specified. Therefore βn needs to be studied under bothm ∈ M and m /∈ M.

• To derive the i.i.d. representation we use arguments which will appear,in modified form, also in other situations. First regularity (or smooth-ness) of the model is important. Taylor’s expansion is important for alocal linearization at β0.

• This β0 is the minimizer of the distance (squared)

D2(β) =

∫[m(x)−m(x, β)]2F0(dx)

between m and M. In this way m(·, β0) is the projection of m ontoM. The function (7.6.10) is, up to the constant EY 2−

∫m2(x)F0(dx),

the empirical analog of D2(β).

• A crucial role in our arguments is played by the process αn(β). Acontinuity argument is used to get

αn(βn) = αn(β0) + oP(1).

The process αn is an example of a Parameter Empirical Process.The summands (Yi−m(Xi))m

′(Xi, β) of this process are independentand centered (which is always important!):

E[(Yi −m(Xi))m

′(Xi, β)]= E[εim′(Xi, β)]

= E[m′(Xi, β)E[εi|Xi]

]= 0.

• The terms Yi −m(Xi, βn), 1 ≤ i ≤ n, are the fitted errors. They arecalled residuals.

Page 187: 2013 W Stute Empirical Distributions

182 CHAPTER 7. INVARIANCE PRINCIPLES

Page 188: 2013 W Stute Empirical Distributions

Chapter 8

Empirical Measures: ADynamic Approach

8.1 The Single-Event Process

In many applied fields researchers are interested in the time X elapsed untila certain event occurs. To give only a few examples, X could be

• the survival time of a patient, i.e., the time from surgery to death

• the free disease survival time, e.g., the time elapsed from a first chemother-apy until reappearance of metastases

• the appearance of the first customer in a service station

• the time a bond defaults

• the breakdown of a technical unit

Since X very often is a random variable, the associated process

St = 1X≤t, t ∈ R,

is a random process jumping from zero to one at t = X. As in Chapter 2we call (St)t the Single-Event Process. This process may be extendedin many ways. Nevertheless it constitutes the cornerstone of many otherprocesses appearing in applied fields and statistics. Each St is a Bernoulli0-1 random variable with expectation

E(St) = P(X ≤ t) ≡ F (t),

183

Page 189: 2013 W Stute Empirical Distributions

184CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

the (unknown) distribution function (d.f.) of X. To better understandthe following assume that X is nonnegative, like any lifetime variable, andimagine that we start at t = 0. Since we are no prophets we are not able toprecisely forecast the value of X. It may, however, be the case that at timet we have some information helpful to predict X. In the case of a patient,this could be some covariate-information about his/her health-status. Inthe simplest case, when there are no covariates, at least the fact that X > sfor all s ≤ t means that “the event” has not occured before t. The eventhistory of the process is reflected through a σ-algebra of events which areobservable up to t, say Ft, and which may be helpful to predict the futuredevelopment of the process. In the simplest case

Ft = σ(1X≤s : s ≤ t).

Note that Ft ↑ as t increases. The value of St is known only at time t, butnot necessarily before. In technical terminology, this means that (St)t isadapted to the filtration (Ft)t.

To discuss the following we first restrict ourselves to finitely many t’s, say0 = t0 < t1 < t2 < . . . < tk. Put Si = Sti for short. At time ti, onlyS0, S1, . . . , Si are known, but not necessarily Si+1, Si+2, . . . , Sk. The mainquestion we are concerned with now is how to predict future values of Sjgiven the information at t = ti. Clearly, for a variable Ai+1 to be a predictorfor Si+1 at ti, e.g., means that this Ai+1 needs to be known and computableat ti. Summarizing we have come up with two important concepts, whichcan and will be discussed also in a context more general than the Single-Event Process and therefore are formulated in full generality.

Definition 8.1.1. Let (Fi)0≤i≤k be some increasing filtration, and let (Si)0≤i≤k

and (Ai)0≤i≤k be two sequences of random variables. Then we call

• (Si)i adapted to (Fi)i iff Si is Fi-measurable

• (Ai)i predictable w.r.t. (Fi)i iff Ai+1 is Fi-measurable

As we know from our experience in everyday life many quantities are notpredictable and therefore can be predicted only with error. Relevant issuesof this scenario will be discussed in the next section.

Page 190: 2013 W Stute Empirical Distributions

8.2. MARTINGALES AND THE DOOB-MEYER DECOMPOSITION185

8.2 Martingales and the Doob-Meyer Decomposi-tion

We know from probability theory that if S is an unknown (square-integrable)variable and the information is given through events forming a σ-algebra F ,then the optimal predictor A for S minimizing the (integrated) squared pre-diction error E

[(S −A)2|F

]equals the conditional expectation A = E(S|F).

Coming back to a sequence of observations S0, S1, S2, . . . , Sk adapted toF0 ⊂ F1 ⊂ . . . ⊂ Fk, one may wish to predict, at each ti, the value of Si+1

on the basis of Fi. A possible candidate for Ai+1 is Si itself. This situationdescribes one of the best studied cases and deserves a special name.

Definition 8.2.1. Let (Si)0≤i≤k be a sequence of (integrable) random vari-ables adapted to an increasing filtration (Fi)0≤i≤k. Then (Si)i is called aMartingale iff

E(Si+1|Fi) = Si P-almost surely

for all i = 0, . . . , k − 1.

Being a martingale thus means that the best predictor of Si+1 is just theprevious value Si.

Of course it would be very naive to believe that martingales are all aroundand the last observation would always be an appropriate (and simple) choicefor a predictor. Though, as the next important result will show us, we areat least very close to such a situation.

Lemma 8.2.2. (Doob-Meyer). Let (Si)0≤i≤k be adapted to (Fi)0≤i≤k.Then, for a given initial value a, there is a unique decomposition (almostsurely) of Si into

Si = Ai +Mi, i = 0, 1, . . . , k

such that

• (Mi)i is a martingale

• (Ai)i is predictable

• A0 = a

Page 191: 2013 W Stute Empirical Distributions

186CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

Proof. The proof is interesting because it explicitly shows how to computeMi and Ai. Set, by recursion,

Ai =

a for i = 0

Ai−1 − Si−1 + E(Si|Fi−1) for i ≥ 1

Mi =

S0 − a for i = 0

Mi−1 + Si − E(Si|Fi−1) for i ≥ 1

Clearly, by induction,

Si = Ai +Mi, i = 0, 1, . . .

Also, (Ai)i is predictable with A0 = a. That (Mi)i is a martingale, isreadily checked. To show uniqueness, let Si = A′

i + M ′i be another such

decomposition. Conclude that

Ai −A′i =M ′

i −Mi

whenceE[Ai −A′

i|Fi−1

]= E

[M ′

i −Mi|Fi−1

].

Since Ai and A′i are measurable w.r.t. Fi−1 andMi andM

′i are martingales,

the last equation yields

Ai −A′i =M ′

i−1 −Mi−1 = Ai−1 −A′i−1.

By induction it follows that

Ai −A′i = A0 −A′

0 = a− a = 0

and, finally, Mi =M ′i . This concludes the proof of the lemma.

In the literature, the process (Ai)i is called the Compensator, while, fromtime to time, (Mi)i is called the Innovation Martingale.

To interpret the Doob-Meyer Decomposition, note that a martingale istrend-free. The compensator Ai is already known at time ti−1. If the originalsequence (Si)i includes a trend, this is compensated by (Ai)i.

We would also like to mention a well-known decomposition of an outputrandom variable Y in terms of an input X and noise ε:

Y = m(X) + ε. (8.2.1)

Page 192: 2013 W Stute Empirical Distributions

8.3. THE DOOB-MEYER DECOMPOSITION OF THE SINGLE-EVENT PROCESS187

Here, m(X) = E(Y |X) and m is the regression function of Y w.r.t. X.The variable ε is the so-called noise variable and satisfies E(ε|X) = 0 P-almost surely. Also (8.2.1) may be interpreted as a decomposition into apredictable part, namely m(X), and some trend-free component ε. Thedifference between (8.2.1) and the Doob-Meyer decomposition is, that in(8.2.1) only one pair (X,Y ) of random variables is considered, while inLemma 8.2.2 we consider a sequence of random variables over time. At thesame time the σ-algebras Fi containing the information at time ti may bequite arbitrary. Of course, the decomposition heavily depends on the choiceof Fi and will change when we replace the filtration by another one.

8.3 The Doob-Meyer Decomposition of the Single-Event Process

For a better understanding, we first consider St = 1X≤t at finitely manyt0 ≤ t1 ≤ . . . ≤ tk, with

Fi = σ(1X≤tj : 0 ≤ j ≤ i

).

Put Si ≡ Sti , and recall F , the d.f. of X.

Lemma 8.3.1. For i = 1, . . . , k, we have

E[Si|Fi−1] = 1X≤ti−1 + 1X>ti−1F (ti)− F (ti−1)

1− F (ti−1).

Proof. Write

Si = 1X≤ti = 1X≤ti1X≤ti−1 + 1X≤ti1X>ti−1

= 1X≤ti−1 + 1ti−1<X≤ti.

The first summand is measurable w.r.t. Fi−1. For the second summand weget

E[1ti−1<X≤ti|Fi−1

]=

1ti−1<X∫

ti−1<X1ti−1<X≤tidP

P(ti−1 < X)

= 1X>ti−1F (ti)− F (ti−1)

1− F (ti−1).

Page 193: 2013 W Stute Empirical Distributions

188CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

The conclusion follows.

Now recall the proof of the Doob-Meyer Decomposition. Put t0 = −∞ sothat S0 ≡ 0. Let a = 0. The martingale part of the Single-Event Processthen satisfies the recursion

Mi = Mi−1 + Si − E(Si|Fi−1)

= Mi−1 + 1ti−1<X≤ti − 1X>ti−1F (ti)− F (ti−1)

1− F (ti−1).

Summation from i = 1 to i = k leads to

Mk = 1X≤tk −k∑

i=1

1X>ti−1F (ti)− F (ti−1)

1− F (ti−1).

In other words, the compensator equals

Ak =

k∑i=1

1X>ti−1F (ti)− F (ti−1)

1− F (ti−1).

From this we may get an idea how M and A look like when all processes areconsidered in continuous time. Just let the grid ti of points get finer andfiner. If we keep tk = t fixed, then Ak converges to

At =

∫(−∞,t]

1x≤X

1− F (x−)F (dx).

Hence we obtain the following result.

Theorem 8.3.2. Let X be a real random variable with d.f. F . Put St =1X≤t and set Ft = σ(1X≤s : s ≤ t). Then the innovation martingaleof (St)t equals

Mt = 1X≤t −∫

(−∞,t]

1x≤X

1− F (x−)F (dx).

Remark 8.3.3. Note that (St)t and (At)t are nondecreasing. The martin-gale (Mt)t is trend-free and satisfies

E(Mt) = 0 for all t ∈ R.

Page 194: 2013 W Stute Empirical Distributions

8.3. THE DOOB-MEYER DECOMPOSITION OF THE SINGLE-EVENT PROCESS189

Next, it will be convenient to introduce a new measure:

dΛ =dF

1− F−,

i.e.,

Λ(t) =

∫(−∞,t]

F (dx)

1− F (x−).

As in Chapter 1 the function Λ is called the Cumulative Hazardfunctionof F . Also recall that, if F is continuous, then

Λ(t) =

∫(−∞,t]

F (dx)

1− F (x)= − ln(1− F (t))

whenceF (t) ≡ 1− F (t) = exp(−Λ(t)). (8.3.1)

The function F is called the Survival Function.

For a continuous F , we have Λ(t) ↑ ∞ as t ↑ ∞. Hence, in this case, Λ isnot finite. Note that when F is discrete with a finite number of atoms, theassociated Λ is finite. If F has density f , then

dΛ =f

1− Fdx ≡ λdx.

The function

λ(x) =f(x)

1− F (x), x ∈ R

is called the Hazard-Function. The compensator becomes∫(−∞,t]

1X≥xλ(x)dx.

Hence A has a Lebesgue-density, which equals λ(x) on x ≤ X and vanishesafterwards. In most textbooks on Survival Analysis, the function λ(x) isintroduced as

λ(x) = lim∆x↓0

P(x < X ≤ x+∆x|x < X)

∆x.

Hence, for small ∆x,

P(x < X ≤ x+∆x|x < X) ∼ λ(x)∆x.

Page 195: 2013 W Stute Empirical Distributions

190CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

The quantity λ(x) therefore describes the current speed or intensity, atwhich, e.g., a patient approaches death, given that he/she has survived x.Therefore λ(x) is an important parameter to describe the current risk statusof the patient. Since after the occurrence of X, the Single Event Processdoes not allow for more events, it is only natural that in the compensatorthe density jumps down to zero immediately after X.

Example 8.3.4. Assume that X has a uniform distribution on [0, 1] so that

λ(x) =1

1− xon 0 < x < 1.

Given that X has exceeded x, the intensity for a default in the near futureincreases to infinity at a hyperbolic rate.

8.4 The Empirical Distribution Function

In this section we extend the previous results to the empirical distributionfunction

Fn(t) =1

n

n∑i=1

1Xi≤t, t ∈ R.

Here, X1, . . . , Xn, . . . are independent identically distributed (i.i.d.) randomvariables from the same unknown d.f. F . The filtration now becomes, forsample size n,

Ft = σ(1Xi≤s : s ≤ t, 1 ≤ i ≤ n

).

By independence of the Xi’s, the compensator and martingale part of Fn isjust the sum of the compensators and innovation martingales of the singleindicators. Hence we get the following result.

Theorem 8.4.1. The Doob-Meyer decomposition of Fn equals

Fn(t) =Mn(t) +An(t),

where

Mn(t) = Fn(t)−∫

(−∞,t]

1− Fn(x−)

1− F (x−)F (dx). (8.4.1)

Since Mn is a centered process, the covariance of Mn becomes, for s ≤ t:

Cov(Mn(s),Mn(t)) = E[Mn(s)Mn(t)]

= EE[Mn(s)Mn(t)|Fs]

= E Mn(s)E[Mn(t)|Fs] = EM2n(s) ≡ 1

nγ(s).

Page 196: 2013 W Stute Empirical Distributions

8.4. THE EMPIRICAL DISTRIBUTION FUNCTION 191

The function γ is computed in the next lemma.

Lemma 8.4.2. For s ≤ t, we have

Cov(Mn(s),Mn(t)) =1

nγ(s)

with

γ(s) = F (s)−∫

(−∞,s]

Fx1− F (x−)

F (dx).

Here Fx = F (x)−F (x−) is the point mass at x. If F is continuous, thenFx = 0 and

γ(s) = F (s) for all s.

Proof. Since γ(s) = EM21 (s), we obtain

γ(s) = EM21 (s) = E

1X≤s −∫

(−∞,s]

1x≤X

1− F (x−)F (dx)

2

= E12X≤s − 2E

1X≤s

∫(−∞,s]

1x≤X

1− F (x−)F (dx)

+ E

∫(−∞,s]

1x≤X

1− F (x−)F (dx)

2

.

The second expectation equals

E

1X≤s

∫(−∞,s]

1x≤X

1− F (x−)F (dx)

=

∫(−∞,s]

E1x≤X≤s

1− F (x−)F (dx)

=

∫(−∞,s]

F (s)− F (x−)

1− F (x−)F (dx).

Finally, ∫(−∞,s]

1x≤X

1− F (x−)F (dx)

2

=

∫(−∞,s]

∫(−∞,s]

1x≤X1y≤X

(1− F (x−))(1− F (y−))F (dx)F (dy).

Page 197: 2013 W Stute Empirical Distributions

192CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

By Fubini, we obtain

E[. . .]2 =∫

(−∞,s]

∫(−∞,s]

1− F (max(x, y)−)

(1− F (x−))(1− F (y−))F (dx)F (dy).

The double integral is split into two pieces: y ≤ x ≤ s and x < y ≤ s. Thisleads to∫(−∞,s]

∫(−∞,s]

1y≤x(1− F (x−))

(1− F (x−))(1− F (y−))F (dx)F (dy)

=

∫(−∞,s]

∫(−∞,s]

1y≤x1

1− F (y−)F (dx)F (dy) =

∫(−∞,s]

F (s)− F (y−)

1− F (y−)F (dy),

and ∫(−∞,s]

∫(−∞,s]

1x<y≤s1− F (y−)

(1− F (x−))(1− F (y−))F (dx)F (dy)

=

∫(−∞,s]

∫(−∞,s]

1x<y≤s1

1− F (x−)F (dy)F (dx)

=

∫(−∞,s]

F (s)− F (x)

1− F (x−)F (dx).

Summation of all terms yields the conclusion.

Remark 8.4.3. If F has discontinuities, γ = F is violated, but γ ≤ F isalways valid. In particular γ ≤ 1 is bounded. We could rewrite γ to obtain

γ(s) =

∫(−∞,s]

[1− Fx

1− F (x−)

]F (dx) =

∫(−∞,s]

[1− Λx]F (dx).

Since

Fx = P(X = x) ≤ P(X ≥ x) = 1− F (x−),

we get Λx ≤ 1. Conclude that γ is nondecreasing.

As a final comment, if F is continuous, then

Mn(x) = Mn(F (x))

Page 198: 2013 W Stute Empirical Distributions

8.5. THE PREDICTABLE QUADRATIC VARIATION 193

with

Mn(t) = Fn(t)−∫

[0,t]

1− Fn(u)

1− udu, 0 ≤ t ≤ 1,

where Fn is the empirical d.f. of a uniform sample U1, . . . , Un.

It is useful to write (8.4.1) in terms of differentials:

dMn = dFn − 1− Fn−1− F−

dF. (8.4.2)

If we replace the indicator 1x≤t by an arbitrary function φ, then (8.4.2)implies ∫

φdMn =

∫φdFn −

∫φ1− Fn−1− F−

dF

=

∫φd(Fn − F ) +

∫φFn− − F−1− F−

dF.

8.5 The Predictable Quadratic Variation

Mean and variance are two basic concepts in elementary probability theoryand statistics. In a dynamic framework we have seen so far that meansshould be updated at each time t so as to incorporate the current avail-able information. This led to the Doob-Meyer Decomposition, in which themartingale part could be viewed as a noise process and the mean, i.e., pre-dictable part, is captured by the compensator. Extending these ideas to thevariance, we need to also include expected variations over time.

Definition 8.5.1. Let S0, S1, . . . , Sk be a finite sequence of square-integrablerandom variables adapted to the filtration F0 = ∅,Ω ⊂ F1 ⊂ . . . ⊂ Fk.Then we call

< S >k≡< S, S >k= S20 +

k∑i=1

E∆2Si|Fi−1

the Predictable Quadratic Variation (at time k). Here

∆Si = Si − Si−1

is the i-th increment.

The process < S >k is important when one studies squared martingales.

Page 199: 2013 W Stute Empirical Distributions

194CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

Lemma 8.5.2. LetM0,M1, . . . ,Mk be a square-integrable martingale. Then

M2i − < M >i, i = 0, 1, . . . , k

is also a martingale. In other words, < M >i is the compensator of M2i , i =

0, 1, . . . , k.

Proof. We have

E[M2i − < M >i |Fi−1] = E[M2

i |Fi−1]− < M >i

= E(∆2Mi|Fi−1) +M2i−1− < M >i

= M2i−1− < M >i−1 .

Next we compute the predictable quadratic variation of the martingale

Mk = 1X≤tk −k∑

i=1

1X>ti−1F (ti)− F (ti−1)

1− F (ti−1)

First,

∆Mk =Mk −Mk−1 = 1tk−1<X≤tk − 1X>tk−1F (tk)− F (tk−1)

1− F (tk−1).

and therefore

∆2Mk = 1tk−1<X≤tk + 1X>tk−1[F (tk)− F (tk−1)]

2

[1− F (tk−1)]2

−2 1tk−1<X≤tkF (tk)− F (tk−1)

1− F (tk−1)

Conclude that

E[∆2Mk|Fk−1] = 1tk−1<XF (tk)− F (tk−1)

1− F (tk−1)

+ 1tk−1<X[F (tk)− F (tk−1)]

2

[1− F (tk−1)]

− 2 1tk−1<X[F (tk)− F (tk−1)]

2

[1− F (tk−1)]2

whence, with t0 = −∞,

< M >n=n∑

k=1

1tk−1<X[F (tk)− F (tk−1)](1− F (tk))

[1− F (tk−1)]2.

Going to the limit, we obtain < M >t in continuous time.

Page 200: 2013 W Stute Empirical Distributions

8.6. SOME STOCHASTIC DIFFERENTIAL EQUATIONS 195

Lemma 8.5.3.

< M >t=

∫(−∞,t]

1x≤X(1− F (x))

[1− F (x−)]2F (dx).

For sample size n ≥ 1, the martingale Mn from (8.4.1) has the predictablequadratic variation

< Mn >t=1

n

∫(−∞,t]

(1− Fn−)(1− F )

(1− F−)2dF.

Since M2n− < Mn > is a centered martingale, it follows that

EM2n(t) = E < Mn >t=

1

nγ(t) =

1

n

∫(−∞,t]

1− F

1− F−dF,

8.6 Some Stochastic Differential Equations

From (8.4.2) we obtain

dMn = d(Fn − F ) +Fn− − F−1− F−

dF. (8.6.1)

This is a differential equation in Fn−F . If we multiply both sides with n1/2

and introduce the standardized processes

Mn = n1/2Mn αn = n1/2(Fn − F ),

then, if F is the uniform distribution on [0, 1],

n1/2Mn → B and αn → B0.

Here, B and B0 denote a Brownian Motion and a Brownian Bridge, respec-tively. This follows from Donsker’s invariance principle for the empiricalprocess αn. In the limit we obtain the differential equation

dB = dB0 +B0

t

1− tdt.

Rewriting the last equation we get

dB0 = − B0t

1− tdt+ dB.

Page 201: 2013 W Stute Empirical Distributions

196CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

This equation admits the solution

B0(t) = (1− t)

∫[0,t]

1

1− sB(ds), 0 ≤ t ≤ 1.

Going back to finite sample size, the following result therefore is not sur-prising.

Theorem 8.6.1. Under F = U [0, 1],

Fn(t)− t = (1− t)

∫[0,t]

Mn(ds)

1− s.

Proof. From (8.4.2), the right-hand side equals

(1− t)

∫[0,t]

Fn(ds)

1− s−∫

[0,t]

1− Fn(s−)

(1− s)2ds

. (8.6.2)

It suffices to deal with n = 1. The first integral then becomes1X1≤t1−X1

, whilethe second equals∫

[0,t]

1X1≥s

(1− s)2ds =

1

1− s

∣∣∣∣∣t∧X1

0

=1

1− t ∧X1− 1.

We have to consider two cases. When X1 ≤ t, (8.6.2) equals

(1− t)

[1

1−X1− 1

1−X1+ 1

]= 1− t = 1X1≤t − t.

If X1 > t, we obtain

(1− t)

[0− 1

1− t+ 1

]= −1 + 1− t = 1X1≤t − t.

This concludes the proof of the Theorem.

Since Mn is a martingale and the integrand is a deterministic function, theintegral is also a martingale.

Corollary 8.6.2. The process

t→ Fn(t)− t

1− t=

∫[0,t]

Mn(ds)

1− s, 0 ≤ t < 1

is a martingale under F = U [0, 1].

Page 202: 2013 W Stute Empirical Distributions

8.7. STOCHASTIC EXPONENTIALS 197

8.7 Stochastic Exponentials

Stochastic Exponentials were briefly discussed in section 1.10. In this sectionwe study some extensions and apply them to the Empirical Process. Thefollowing result is an extension of Theorem 1.10.5

Lemma 8.7.1. (Gill). Let A and B be two distribution functions (notnecessarily finite or with total mass 1), such that

Ax ≤ 1 and Bx < 1 for all x ∈ R.

Then the equation

Zt =

∫(−∞,t]

1− Z(s−)

1−Bs[A(ds)−B(ds)]

admits a unique solution, namely

Z(t) = 1−∏

s≤t(1−As) exp(−Ac(t))∏s≤t(1−Bs) exp(−Bc(t))

.

Proof. The function 1− Zt is the exponential of the process (Ct)t given by

dCt =dBt − dAt

1−Bt.

As usually, we decompose A and B into their continuous and discrete part:

A = Ac +Ad B = Bc +Bd.

If we consider the exponential of C in discrete time, i.e., over a finite gridof points, we have with (1.10.2)

Cn =

n∏i=1

[1 +

∆Bci +∆Bd

i −∆Aci −∆Ad

i

1−Bti

]

= exp

n∑

i=1

ln

[1 +

∆Bci +∆Bd

i −∆Aci −∆Ad

i

1−Bti

]

= exp

n∑

i=1

ln [1 + Bti −Ati1−Bti

]+

∆Bci −∆Ac

i

1−Bti· 1

1 + Bti−Ati1−Bti

+ o(1)

∏s≤t

[1 +

Bs −As1−Bs

]exp[Bc

t −Act ].

Page 203: 2013 W Stute Empirical Distributions

198CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

From this the conclusion follows.

In Corollary 8.6.2 we have seen that the process

t→ Fn(t)− t

1− t=

∫[0,t]

Mn(ds)

1− s

is a martingale if F = U [0, 1]. Now we study the non-centered process

βn(t) =1− Fn(t)

1− t, 0 ≤ t < 1.

Note that the expectation of βn(t) equals one. Now,

βn(t) =1− Fn(t)

1− t= 1− Fn(t)− t

1− t

= 1−∫

[0,t]

Mn(ds)

1− s= 1−

∫[0,t]

1− Fn(s−)

1− s

Mn(ds)

1− Fn(s−)

= 1 +

∫[0,t]

βn(s−)M0n(ds), (8.7.1)

where

M0n(ds) =

−Mn(ds)

1− Fn(s−).

Note that this is legitimate only for s ≤ Xn:n. To be correct we need toredefine M0

n so as to become

M0n(ds) =

−1s≤Xn:n

1− Fn(s−)Mn(ds). (8.7.2)

For t ≤ Xn:n, (8.7.1) remains the same. For t > Xn:n we have βn(t) = 0.Hence the right hand side of (8.7.1) becomes

1−Xn:n∫0

Mn(ds)

1− s= 1− Fn(Xn:n)−Xn:n

1−Xn:n= 0.

Corollary 8.7.2. The process

βn(t) =1− Fn(t)

1− t, 0 ≤ t < 1

is the Stochastic Exponential of the martingale M0n defined in (8.7.2).

Page 204: 2013 W Stute Empirical Distributions

8.7. STOCHASTIC EXPONENTIALS 199

Remark 8.7.3. Because of (8.4.2) we have on s ≤ Xn:n

dM0n =

−dMn

1− Fn−= − dFn

1− Fn−+

dF

1− F−.

Hence (8.7.1) may be also written, for t ≤ Xn:n, in the form

βn(t) = 1−∫

[0,t]

βn(s−)(Λn(ds)− Λ(ds)).

Since βn is an integral of a deterministic function w.r.t. the martingale Mn,it is also a martingale. Since the index set [0, 1) is open on the right andβn(1−) = 0, βn cannot be extended to become a martingale on the closed set[0, 1]. Therefore, to obtain maximal bounds for βn, we first have to restrictto compact subintervals of [0, 1). For example, Doob’s maximal inequalityyields, for any c > 0:

P(

sup0≤s≤t

1− Fn(s)

1− s> c

)≤ Eβn(t)

c=

1

c.

If we let t tend to one, then by σ-continuity

P(

sup0≤s<1

1− Fn(s)

1− s> c

)≤ 1

c.

If, for a given ε > 0, we choose c > 0 so large that 1/c ≤ ε, then on an eventwith probability less than ε, we have

1− Fn(s) ≤ c(1− s). (8.7.3)

for all 0 ≤ s < 1. Obviously, this inequality also holds true for s = 1.Inequality (8.7.3) provides us with a so-called linear upper bound for 1−Fn.Such bounds are very useful in proofs if one needs to bound 1 − Fn fromabove.

Page 205: 2013 W Stute Empirical Distributions

200CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

Page 206: 2013 W Stute Empirical Distributions

Chapter 9

Introduction To SurvivalAnalysis

9.1 Right Censorship: The Kaplan Meier Estima-tor

In Survival Analysis one is interested in phenomena evolving over time.E.g., one may focus on the vital health status of a patient since surgery.Information about the status may be given through covariates known at thebeginning of the observational period and which may be updated from timeto time. To give another example, in a financial setting, investors holdingcorporate bonds may wish to know more about the company. An updatedrating may then be helpful to assess the actual risk status of the companyas to possible defaults before or at maturity of the bonds.

In the following we let Y denote the random time elapsed from the beginningof the study (surgery, investment) until default. We are interested in thesurvival function

S(y) = F (y) = 1− F (y) = P(Y > y).

For example, in a medical setting, one may wish to know the probability thatthe free-disease survival time exceeds y = 5 years. A typical feature of suchdata is that due to follow-up losses or early end of the study the informationis incomplete. This means, that for a sample of size n not all relevantY1, . . . , Yn are available. Rather, in some cases, some surrogates are knownwhich contribute less information than the full sample. Consequently, theempirical d.f. Fn of Y1, . . . , Yn cannot be computed and therefore cannot

201

Page 207: 2013 W Stute Empirical Distributions

202 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

serve as an estimator of F . In the following sections we discuss severalof such situations and show how statistical analysis needs to be properlyadapted to the present situation.

In this section we briefly discuss the best studied case of incomplete data,namely Right Censorship. A typical example is shown in the followingfigure.

3C

1Y

4C

5C

2Y

1

2

3

4

5

1. Entries 2. 3. 4. 5. TimeEnd of Study

We see that entries into the study are staggered. Patients 1 and 2 died afterstaying Y1 resp. Y2 time units in the group, while patients 3 and 5 werestill alive at the end. Finally, patient 4 dropped out due to follow-up losses.In the last three cases, rather than Yi, we observe Ci, the time spent underobservation until follow-up losses or due to some other reason. Summarizing,rather than Yi, we observe a variable Zi which constitutes the minimum ofYi and a so-called censoring variable Ci, together with the informationindicating which of Yi or Ci was actually observed:

Zi = min(Yi, Ci) δi = 1Yi≤Ci.

Hence δi = 1 indicates that Yi has not been censored while δi = 0 meansthat instead of Yi the variable Ci was observed.

The problem now becomes one of estimating F , the unknown d.f. of the“lifetime” Y , from the available (Zi, δi), 1 ≤ i ≤ n. Throughout it will beassumed that, for each 1 ≤ i ≤ n, Yi is independent of Ci. Denote with Gthe (unknown) d.f. of each C. For the moment we restrict ourselves to thecase when no covariables are present.

Page 208: 2013 W Stute Empirical Distributions

9.1. RIGHT CENSORSHIP: THE KAPLAN MEIER ESTIMATOR 203

To derive such an estimator we shall make heavy use of the approach outlinedin the first chapter. For this, recall the cumulative hazard function of F ,given through

dΛ =dF

1− F−.

Then clearly

dΛ =(1−G−)dF

(1−G−)(1− F−)≡ (1−G−)dF

(1−H−), (9.1.1)

where

1−H−(z) = (1− F (z−))(1−G(z−))

= P(Y ≥ z)P(C ≥ z) = P(Y ≥ z, C ≥ z) = P(Z ≥ z).

Here the last equality follows from the assumed independence of Y and C.Recall that Z1, . . . , Zn are observable. Hence H can be nonparametricallyestimated by the empirical d.f. of the Zi’s.:

Hn(z) =1

n

n∑i=1

1Zi≤z, z ∈ R.

Next we consider the numerator in (9.1.1). Define H1 by

dH1 = (1−G−)dF.

Then we have the following

Lemma 9.1.1. For each z ∈ R,

P(Z ≤ z, δ = 1) = H1(z).

Proof. We have

P(Z ≤ z, δ = 1) = P(Z ≤ z, Y ≤ C) = P(Y ≤ z, Y ≤ C)

=

∫(−∞,z]

(1−G(y−))F (dy).

Also H1 can be nonparametrically estimated through an Empirical Sub-Distribution:

H1n(z) =

1

n

n∑i=1

1Zi≤z,δi=1.

Page 209: 2013 W Stute Empirical Distributions

204 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Together Hn and H1n define an estimator for Λ, namely

dΛn ≡ dH1n

(1−Hn−)

In an integrated form, we have

Λn(z) =

∫(−∞,z]

H1n(dy)

1−Hn(y−)

=n∑

i=1

1Zi≤z,δi=11

n−Ri + 1,

where Ri is the rank of Zi among Z1, . . . , Zn and we assumed for the momentthat all Zi are distinct.

The corresponding estimator of F is obtained from

1− Fn(t) =∏z≤t

[1− Λnz]. (9.1.2)

This estimator was introduced in a landmark paper by Kaplan and Meier(1958). For further analysis, note that Λn only jumps at the Zi for whichδi = 1. The jump size equals

ΛnZi =1

n−Ri + 1.

Plugging this into (9.1.2), we get for the Kaplan-Meier estimator

1− Fn(t) =∏

Zi≤t,δi=1

n−Ri

n−Ri + 1=∏Zi≤t

[1− δi

n−Ri + 1

].

After ordering the Zi’s, we get

1− Fn(t) =∏

Zi:n≤t

[1−

δ[i:n]

n− i+ 1

]. (9.1.3)

Here, δ[i:n] is the δ-concomitant of Zi:n, i.e., δ[i:n] = δj , if Zi:n = Zj . From(9.1.3) we finally get the mass given to Zi:n:

Win =δ[i:n]

n− i+ 1

i−1∏k=1

[1−

δ[k:n]

n− k + 1

].

When all data are uncensored, all δ equal one and Win collapses to 1/n.Hence, when no censorship is present, the Kaplan-Meier estimator becomesthe ordinary empirical d.f. In the general case, the weights Win depend onthe location of the δ-labels and are therefore random.

Page 210: 2013 W Stute Empirical Distributions

9.1. RIGHT CENSORSHIP: THE KAPLAN MEIER ESTIMATOR 205

When F has discontinuities, the data may have ties. In this general casethe Kaplan-Meier estimator is computed as follows:Since 1− Fn is the exponential of −Λn, we have

1− Fn(t) = 1−∫

(−∞,t]

(1− Fn−)dΛn.

From this we get the point mass at t:

Fnt =1− Fn(t−)

1−Hn(t−)H1

nt. (9.1.4)

Since H1n attributes positive mass only to the uncensored data, the same is

true for Fn. Now, let t1 < . . . < tk be the pairwise distinct values amongZ1, . . . , Zn. From (9.1.4) we obtain a recursive formula for the masses at ti.For t1 we obtain

Fnt1 = H1nt1 ≡ k1

n,

where k1 is the number of Z-data with Zi = t1 and δi = 1. Conclude that

1− Fn(t1) = 1− Fnt1 = 1− k1n

=n− k1n

=n− 1

n

n− 2

n− 1. . .

n− k1n− k1 + 1

=

k1∏i=1

[1−

δ[i:n]

n− i+ 1

].

Practically, we have done the same for the k1 smallest uncensored Zi as forthe empirical d.f., when no censorship was present. It could be that alsosome censored variables take on the value t1. They do not contribute to thelast product. If we continue in this way we find out that also in the case ofties the formula

1− Fn(t) =∏

Zi:n≤t

[1−

δ[i:n]

n− i+ 1

](9.1.5)

holds true. If there is a tj which is attained by both censored and uncensoredvariables, the uncensored need to be ordered before the censored, while theordering among the censored and uncensored may be arbitrary. For theweight Wjn attributed to tj we get

Wjn =kj

n− nHn(tj−)

∏Zi:n<tj

[1−

δ[i:n]

n− i+ 1

].

Page 211: 2013 W Stute Empirical Distributions

206 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

where again kj is the number of uncensored Z’s attaining the value tj .

Next we consider an arbitrary function φ. For the associated Kaplan-Meier Integral

∫φdFn, we get

∫φdFn =

k∑j=1

Wjnφ(tj).

In the analysis of classical empirical integrals,∫φdFn =

1

n

n∑i=1

φ(Xi),

we can make use of the fact that

X = F−1(Ui) in distribution,

so that ∫φdFn =

∫φ F−1dFn.

Here U1, . . . , Un is a uniform sample. Note that with probability one the U ’shave no ties. The existence of ties in the available X-data may be attributedto the quantile transformation F−1. Summarizing, replacing φ by φ F−1

we may assume w.l.o.g. that the data come from a (continuous) uniformdistribution with not ties present. To apply the same idea in the case ofright censored data we have to take into account censorship effects.

Lemma 9.1.2. Let Y ∼ F and C ∼ G be independent, and let D =a1, a2, . . . denote possible discontinuities of F . Let V ∼ U [0, 1] be inde-pendent of Y and C. Put

U =

F (Y ) if Y /∈ DF (Y−) + [F (a)− F (a−)]V if Y = a and a ∈ D

U∗ = F (C).

Then we have:

(i) U and U∗ are independent

(ii) U ∼ U [0, 1]

(iii) Y = F−1(U)

Page 212: 2013 W Stute Empirical Distributions

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP 207

(iv) δ = 1Y≤C = 1U≤U∗ = δ∗

Proof. Elementary.

When we apply the previous lemma to the pairs (Yi, Ci), we obtain a sample(Ui, U

∗i ) with the same censorship structure as the original sample. In the

U -sample no ties are present with probability one. For a general integrandφ, we have

φ(Yi) = φ(F−1(Ui)), if δi = δ∗i = 1.

The Kaplan-Meier weights remain unchanged. As a consequence∫φdFn =

n∑i=1

Winφ(F−1(Z∗

i:n))

with Z∗i = min(Ui, U

∗i ). Hence, in proofs, we may assume w.l.o.g. that no

ties are present.

9.2 Martingale Structures under Censorship

As in the previous section, let

Y ∼ F und C ∼ G

be two independent random variables such that

Z = min(Y,C) und δ = 1Y≤C

are observed. Again, denote

1−H(t) = P(Z > t) = (1− F (t))(1−G(t))

H1(t) = P(Z ≤ t, δ = 1) =

∫(−∞,t]

(1−G−)dF.

Finally, set

H0(t) = P(Z ≤ t, δ = 0) =

∫(−∞,t]

(1− F )dG.

Page 213: 2013 W Stute Empirical Distributions

208 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

The corresponding estimators are given by

Hn(t) =1

n

n∑i=1

1Zi≤t

H1n(t) =

1

n

n∑i=1

1Zi≤t,δi=1

H0n(t) =

1

n

n∑i=1

1Zi≤t,δi=0,

respectively. We have also seen that

dΛF ≡ dΛ =dH1

1−H−,

while the empirical variant became

dΛn =dH1

n

1−Hn−.

In the following we derive the Doob-Meyer decompositions of Hn,H1n and

H0n. As before, it is enough to study sample size n = 1. The filtration has

to be chosen such that all three processes are adapted. Since

1Z≤t,δ=0 = 1Z≤t − 1Z≤t,δ=1,

it suffices to consider

Ft = σ(1Z≤s, 1Z≤s,δ=1 : s ≤ t

).

As before we again consider a finite grid −∞ = t1 < . . . < tn < t = tn+1.Then

1Z≤tk,δ=1 = 1Z≤tk−1,δ=1 + 1tk−1<Z≤tk,δ=1.

The first indicator is again measurable w.r.t. Ftk−1. For the second indicator

we obtain, by the Markov property,

E(1tk−1<Z≤tk,δ=1|Ftk−1

)= E

(1...|1Z≤tk−1, 1Z≤tk−1,δ=1

).

The σ-algebra generated by the two events consists of ∅,Ω, Z ≤ tk−1, δ =1, Z ≤ tk−1, δ = 0, Z > tk−1 and unions thereof. Hence

E(1...| . . .) = 1Z>tk−1

∫Z>tk−1

1tk−1<Z≤tk,δ=1dP

P(Z > tk−1)

= 1Z>tk−1H1(tk)−H1(tk−1)

1−H(tk−1).

Page 214: 2013 W Stute Empirical Distributions

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP 209

From this we get the Doob-Meyer decomposition in discrete time and, finally,by going to the limit, in continuous time.

Theorem 9.2.1. Consider the process St = 1Z≤t,δ=1 with the adapted fil-tration Ft = σ(1Z≤s, 1Z≤s,δ=1, s ≤ t). Then (St)t admits the innovationmartingale

M1t = 1Z≤t,δ=1 −

∫(−∞,t]

1x≤Z

1−H(x−)H1(dx)

= 1Z≤t,δ=1 −∫

(−∞,t]

1x≤Z

1− F (x−)F (dx).

Corollary 9.2.2. The processH1n is adapted to Ft = σ(1Zi≤s, 1Zi≤s,δi=1 :

i = 1, . . . , n, s ≤ t) and has the innovation martingale

M1n(t) = H1

n(t)−∫

(−∞,t]

1−Hn(x−)

1− F (x−)F (dx). (9.2.1)

Remark 9.2.3. For H0n the innovation martingale becomes

M0n(t) = H0

n(t)−∫

(−∞,t]

1−Hn(x−)

1−H(x−)H0(dx)

= H0n(t)−

∫(−∞,t]

1−Hn(x−)

1−H(x−)(1− F (x))G(dx).

If F is continuous, the compensator becomes∫(−∞,t]

1−Hn(x−)

1−G(x−)G(dx).

Coming back to (9.2.1), in terms of differentials, we obtain

dM1n = dH1

n − 1−Hn−1− F−

dF.

On t ≤ Zn:n we divide both sides by 1−Hn− to get

dM1n

1−Hn−=

dH1n

1−Hn−− dF

1− F−= dΛn − dΛ. (9.2.2)

Page 215: 2013 W Stute Empirical Distributions

210 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Integration leads to

(Λn − Λ)(t) =

∫(−∞,t]

dMn

1−Hn−on t ≤ Zn:n. (9.2.3)

Equation (9.2.3) is the analog of Remark 8.7.3, when there is no censorship.

So far we have discussed martingale structures for Hn, H1n and H0

n. Now westudy the Kaplan-Meier estimator Fn itself. For this, recall that

• 1− Fn is the exponential of −Λn

• 1− F is the exponential −Λ.

Lemma 8.7.1 enables us to find the solution of the equation

Zt =

∫(−∞,t]

1− Z(s−)

1− Λs[Λn(ds)− Λ(ds)].

Since Λn is purely discrete, Λcn ≡ 0. It follows that

Z(t) = 1−∏

s≤t(1− Λns)∏s≤t(1− Λs) exp(−Λc(t))

= 1− 1− Fn(t)

1− F (t).

Hence, for F (t) < 1,

1− Fn(t)

1− F (t)= 1−

∫(−∞,t]

1− Z(s−)

1− Λs[Λn(ds)− Λ(ds)]

= 1−∫

(−∞,t]

1− Fn(s−)

(1− F (s−))(1− Λs)[Λn(ds)− Λ(ds)].

Since

(1−F (s−))(1−Λs) = (1−F (s−))(1− Fs1− F (s−)

) = 1−F (s−)−Fs = 1−F (s)

we obtain

1− Fn(t)

1− F (t)= 1−

∫(−∞,t]

1− Fn(s−)

1− F (s)[Λn(ds)− Λ(ds)]. (9.2.4)

Page 216: 2013 W Stute Empirical Distributions

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP 211

Equations (9.2.2) and (9.2.4) will enable us to further study the Kaplan-Meier estimator. First, from (9.2.4)

Fn(t)− F (t)

1− F (t)=

∫(−∞,t]

1− Fn(s−)

1− F (s)[Λn(ds)− Λ(ds)].

Using (9.2.2) we get the following

Theorem 9.2.4. On t ≤ Zn:n and for F (t) < 1 we have

Fn(t)− F (t)

1− F (t)=

∫(−∞,t]

1− Fn(s−)

(1− F (s))(1−Hn(s−))M1

n(ds). (9.2.5)

Theorem 9.2.4 constitutes an extension of Theorem 8.6.1 to the Kaplan-Meier estimator. Note that in the classical case 1 − Fn = 1 − Hn, so thatthe (predictable) random ratio (1 − Fn)/(1 − Hn) cancels out. Therefore,with no censorship present, the restriction t ≤ Un:n was not necessary. Forthe ratio of 1− Fn and 1− F a similar argument yields, for t ≤ Zn:n,

1− Fn(t)

1− F (t)= 1−

∫(−∞,t]

1− Fn(s−)

(1− F (s−)) (1−Hn(s−))(1− Λs)M1

n(ds).

(9.2.6)

Theorem 9.2.5. The process

βn(t) =1− Fn(t)

1− F (t)

equals, on t ≤ Zn:n, the exponential of the process Mn defined through

dMn =−dM1

n

(1−Hn−)(1− Λs).

The last result extends Corollary 8.7.2 in two ways:

• Censorship is admitted

• Λ may have discontinuities

Page 217: 2013 W Stute Empirical Distributions

212 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Next we determine the covariance structure of M1n. The proof is similar to

that of Lemma 8.4.2. For γ(s) ≡ EM21 (s) we now obtain

γ(s) = E

1Z≤s,δ=1 −∫

(−∞,s]

1x≤Z

1− F (x−)F (dx)

2

= H1(s)− 2E

∫(−∞,s]

1x≤Z≤s,δ=1

1− F (x−)F (dx)

+ E

∫(−∞,s]

1x≤Z

1− F (x−)F (dx)

2

,

Proceeding as for Lemma 8.4.2, we obtain

γ(s) = H1(s)−∫

(−∞,s]

H1x1− F (x−)

F (dx). (9.2.7)

Theorem 9.2.6. The innovation martingaleM1n has the covariance function

Cov(Mn(s),Mn(t)) =1

nγ(s) for s ≤ t

with γ as in (9.2.7).

For a continuous H1, we have γ = H1. When there is no censorship (9.2.7)and Lemma 1.4.2 coincide.

To determine the limit distribution of Fn, since the weightsWin are random,the CLT for sums of independent random variables is not directly applicable.In such a situation, a representation in terms of stochastic integrals, see(9.2.5), may be helpful.

In the following we sketch the arguments. Starting with (9.2.5), the so-calledKaplan-Meier process

αn(t) = n1/2[Fn(t)− F (t)]

may be written as

αn(t) = (1− F (t))

∫(−∞,t]

1− Fn(s−)

(1− F (s))(1−Hn(s−))Mn(ds),

Page 218: 2013 W Stute Empirical Distributions

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP 213

with Mn = n1/2M1n. The process Mn is a martingale with covariance γ.

From the Glivenko-Cantelli Theorem we have, with probability one,

1−Hn → 1−H uniformly.

The extension of the Glivenko-Cantelli Theorem to the Kaplan-Meier esti-mator is due to Stute and Wang, Ann. Statistics 1993:

1− Fn → 1− F uniformly.

Hence, on an interval (−∞, t0] with F (t0) < 1, αn(t) is asymptoticallyequivalent to

α0n(t) = (1− F (t))

∫(−∞,t]

1− F (s−)

(1− F (s))(1−H(s−))Mn(ds)

= (1− F (t))

∫(−∞,t]

Mn(ds)

(1− F (s))(1−G(s−)).

Mn has the same covariance structure as B γ, where B is a standardBrownian Motion. Therefore, we may expect that

αn(·) → (1− F (·))∫

(−∞,·]

dB γ(1− F )(1−G−)

= B(·).

Since the integrand is deterministic, B is a centered Gaussian process. Thecovariance equals

E[B(s)B(t)] = (1− F (s))(1− F (t))

∫(−∞,s∧t]

(1− F )2(1−G−)2

= (1− F (s))(1− F (t))

∫(−∞,s∧t]

1

(1− F )2(1−G−)2[dH1 − H1·

1− F−dF ].

If F and hence H1 are continuous, the second integrating measure vanishes,and we obtain

E[B(s)B(t)] = (1− F (s))(1− F (t))

∫(−∞,s∧t]

dF

(1− F )2(1−G−)

= (1− F (s))(1− F (t))

∫(−∞,s∧t]

(1−H−). (9.2.8)

The following result summarizes our findings.

Page 219: 2013 W Stute Empirical Distributions

214 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Theorem 9.2.7. (Breslow-Crowley). The process αn converges in distri-bution on each interval (−∞, t0] with H(t0) < 1 with limit B. Here B is acentered Gaussian process with covariance (if F is continuous) (9.2.8).

Corollary 9.2.8. Assume F is continuous. Then for all x ∈ R

P(supt≤t0

|αn(t)| ≥ x) → P(supt≤t0

|B(t)| ≥ x).

Proof. This assertion follows from the Breslow-Crowley result upon applyingthe continuous mapping theorem.

9.3 Confidence Bands for F under Right Censor-ship

In this section we demonstrate how the results of the previous section may beapplied in a statistical context. Again, assume that we observe (Zi, δi), 1 ≤i ≤ n. We want to construct a confidence band for the unknown d.f. F , i.e.,for each t we aim at constructing an interval

In(t) =

(Fn(t)−

x√n, Fn(t) +

x√n

)such that with a predetermined probability 1− c we have

F (t) ∈ In(t) for all t ≤ t0.

The quantity x may and will depend on t. Hence the width of the confidenceband will change from t to t. To find such x’s, introduce the function

C(t) =

∫(−∞,t]

1−H−=

∫(−∞,t]

dH1

(1−H−)2.

The function C is nondecreasing, and the limit of αn in the Breslow-Crowleyresult may be written as

B = (1− F )B C,

where again B is a Brownian Motion. Put

K(t) =C(t)

1 + C(t)

Page 220: 2013 W Stute Empirical Distributions

9.3. CONFIDENCE BANDS FOR F UNDER RIGHT CENSORSHIP215

so that

C(t) =K(t)

1−K(t).

Then

B = (1− F )B K

1−K=

1− F

1−K(1−K)B K

1−K

=1− F

1−KB0 K,

where B0 is a Brownian Bridge. Thus, putting

βn =(1−K)

1− Fαn,

we getβn → B0 K in distribution.

From the Continuous Mapping Theorem,

P(supt≤t0

∣∣∣∣(1−K(t))αn(t)

1− F (t)

∣∣∣∣ ≥ x

)→ P

(supt≤t0

|B0 K(t)| ≥ x

).

Since along with C also K is monotone increasing in t,

P(supt≤t0

|B0 K(t)| ≥ x

)= P

(sup

0≤u≤K(t0)|B0(u)| ≥ x

).

For selected values of K(t0) and x, the last probabilities are tabulated.Choosing x such that

P

(sup

0≤u≤K(t0)|B0(u)| ≥ x

)= c, (9.3.1)

a pre-specified value, then

P(supt≤t0

∣∣∣∣(1−K(t))αn(t)

1− F (t)

∣∣∣∣ ≥ x

)→ c.

The same holds when we replace F with Fn and K by

Kn =Cn

1 + Cn

with

Cn(t) =

∫(−∞,t]

dH1n

(1−Hn−)2

Summarizing we obtain

Page 221: 2013 W Stute Empirical Distributions

216 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Corollary 9.3.1. (Hall-Wellner Bands). For a given 0 < c < 1, let x bechosen as in (9.3.1). Put

In(t) =

(Fn(t)−

n−1/2x(1− Fn(t))

1−Kn(t), Fn(t) +

n−1/2x(1− Fn(t))

1−Kn(t)

).

Then we have

P (F (t) ∈ In(t) for all t ≤ t0) → 1− c.

Remark 9.3.2. The approach of Hall and Wellner incorporates a timetransformation K which leads us to a time-transformed Brownian Bridge.If one is not willing to follow this approach one needs to base the analysison the process αn/(1− Fn). It follows from the Breslow-Crowley result andthe consistency of Fn, that

P

(supt≤t0

∣∣∣∣∣ αn(t)

1− Fn(t)

∣∣∣∣∣ ≥ x

)→ P

(supt≤t0

|B(C(t))| ≥ x

)

= P

(sup

0≤u≤C(t0)|B(u)| ≥ x

)= P

(sup

0≤u≤1|B(uC(t0))| ≥ x

)= P

(√C(t0) sup

0≤u≤1|B(u)| ≥ x

).

Replace x with x√C(t0) resp. x

√Cn(t0). Then

P

(supt≤t0

∣∣∣∣∣ αn(t)

1− Fn(t)

∣∣∣∣∣ ≥ x√Cn(t0)

)→ P

(sup

0≤u≤1|B(u)| ≥ x

).

The resulting confidence band is of the form

In(t) =(Fn(t)− xn−1/2(1− Fn(t))

√Cn(t0), Fn(t) + xn−1/2(1 + Fn(t))

√Cn(t0)

).

9.4 Rank Tests for Censored Data

In Case-Control studies one is interested in a comparison of a new therapy,say, with an existing one. For this one needs to compare two samples drawnfrom each of the two populations. Under right-censorship this amounts tocomparing samples of the form

(Z1i , δ

1i ), 1 ≤ i ≤ n1 and (Z2

i , δ2i ), 1 ≤ i ≤ n2.

Page 222: 2013 W Stute Empirical Distributions

9.4. RANK TESTS FOR CENSORED DATA 217

Each Z is of the form

Z1i = Y 1

i ∧ C1i resp. Z2

i = Y 2i ∧ C2

i .

The lifetime-distributions F1 and F2 satisfying

Y 1i ∼ F1 und Y 2

i ∼ F2

are the quantities of interest, while the censoring distributions G1 and G2

satisfying

C1i ∼ G1 and C2

i ∼ G2

are nonparametric nuisance parameters. Denote with H1 and H2 the d.f.’sof Z1

i and Z2i , respectively. As before, we obtain

1−H1 = (1− F1)(1−G1)

1−H2 = (1− F2)(1−G2),

while the corresponding sub-distribution H1 now becomes

H11 (t) = P(Z1 ≤ t, δ1 = 1) =

∫(−∞,t]

(1−G1−)dF1

H21 (t) = P(Z1 ≤ t, δ2 = 1) =

∫(−∞,t]

(1−G2−)dF2,

respectively. A possible hypothesis to be tested might be

H0 : F1 = F2 versus H1 : F1 = F2

or

H0 : F1 = F2 versus H1 : F1 ≥ F2 and F1 = F2.

The distributions G1 and G2 may differ under H0 so that the observed Z1

and Z2 may have different distributions H1 and H2 even under the nullhypothesis. Hence a statistical test for H0 has to take care of significantdifferences in the Z’s only caused by the fact that G1 = G2.

A class of tests which have found a lot of interest in practical applicationsare so-called Linear Rank Tests. The word “linear” stands for sums or,in our notation, for empirical integrals. A crucial part is played by a weightfunctionW which needs to be chosen in such a way that under the alternative

Page 223: 2013 W Stute Empirical Distributions

218 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

relevant departures in the samples are detected and upweighted resulting ina test with hopefully large power.

To start with the theoretical quantities we aim at comparing two termswhich are identical under H0, irrespective of whether G1 = G2 or not. Sincewe shall restrict ourselves to integrals the problem becomes one of choosing

• a weight function W

• two measures µ and ν such that∫Wdµ =

∫Wdν under H0.

Focus on µ and ν first. Putting

dµ =(1−G1−)(1−G2−)(1− F2−)

1−H−dF1

and

dν =(1−G1−)(1−G2−)(1− F1−)

1−H−dF2,

whereH =

n1n1 + n2

H1 +n2

n1 + n2H2,

we get that

I ≡∫W (H(x))[µ(dx)− ν(dx)]

vanishes under the null hypothesis. Note that in I the weight functionbecomes W H rather than W , with W being defined on the unit in-terval. The function H may be viewed as the d.f. of the pooled sampleZ11 , . . . , Z

1n1, Z2

1 , . . . , Z2n2. The measures µ and ν are closely connected with

the cumulative hazard functions. Actually, we have

dµ =1−H2−1−H−

dH11

dν =1−H1−1−H−

dH12 .

The final test statistic is just the empirical analog of I. This leads to

I ≡∫W (H(x))

[1− H2(x−)

1− H(x−)H1

1 (dx)−1− H1(x−)

1− H(x−)H1

2 (dx)

].

Page 224: 2013 W Stute Empirical Distributions

9.4. RANK TESTS FOR CENSORED DATA 219

The function H is the empirical d.f. of the pooled Z’s:

H(x) =1

n1 + n2

[n1∑i=1

1Z1i ≤x +

n2∑i=1

1Z2i ≤x

].

The functions H1 and H2 are the empirical d.f.’s of the Z-subsamples:

H1(x) =1

n1

n1∑i=1

1Z1i ≤x und H2(x) =

1

n2

n2∑i=1

1Z2i ≤x.

Conclude thatH =

n1n1 + n2

H1 +n2

n1 + n2H2,

and henceforth

1− H =n1

n1 + n2(1− H1) +

n2n1 + n2

(1− H2).

Setλ1 =

n1n1 + n2

λ2 =n2

n1 + n2.

Then we obtain

I =

∫W H

[1

λ2dH1

1 − n1n2

1− H1−

1− H−dH1

1 − 1− H1−

1− H−dH1

2

].

Denote the data in the pooled Z-sample with T1 < T2 < . . . < TN , whereN = n1 + n2. Let δ(k) be the label of Tk. If we introduce

ρk =

1 if Tk comes from the first sample0 if Tk comes from the second sample

,

then we easily express I as a sum:

I =N

n1n2

N∑k=1

W (k

N)δ(k)

[ρk −

nk1N − k + 1

],

with

nk1 =N∑j=1

1Tj≥Tkρj

being the number of data in the pooled sample, which exceed the k-th orderstatistic and are from the first sample.

Two famous tests:

W = 1 : Mantel-Haenszel Test

W = Id : Gehan’s Wilcoxon Test

Page 225: 2013 W Stute Empirical Distributions

220 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

9.5 Parametric Modelling in Survival Analysis

As before we consider i.i.d. data Y1, . . . , Yn from a d.f. F , which are at riskof being censored from the right. In this section we shall discuss severalissues when F comes from a parametric model. Usually this model is giventhrough densities M = fθ : θ ∈ Θ, i.e., F admits a density f such that

f = fθ0 for some θ0 ∈ Θ.

The “true” parameter θ0 is unknown and needs to be estimated from thedata. If all Y ’s were observable, then the Likelihoodfunction

Ln(θ) =n∏

i=1

fθ(Yi)

is available. It is known since Wald (1949) that its maximizer is stronglyconsistent for θ0 under weak regularity assumptions on M.

In Survival Analysis it is more common to model F through its hazardfunction. This is mainly because now the main interest is to model the riskin the near future, which is better captured by λ than by f . Thus for agiven model M = λθ : θ ∈ Θ of hazard functions, we have, by continuity,

λθ(x) =fθ(x)

1− Fθ(x)

Conclude that

fθ(Y ) = λθ(Y )(1− Fθ(Y ))

= λθ(Y ) exp

− Y∫−∞

λθ(t)dt

= λθ(Y ) exp

− ∞∫−∞

1t≤Y

1− Fθ(t−)Fθ(dt)

.The integral in the exponent equals the compensator of the Single-EventProcess at t = ∞, when F = Fθ. If a fully observable sample Y1, . . . , Yn isgiven, then the likelihood-function becomes

Ln(θ) =n∏

i=1

λθ(Yi) exp

− ∞∫−∞

∑ni=1 1t≤Yi

1− Fθ(t−)Fθ(dt)

.

Page 226: 2013 W Stute Empirical Distributions

9.5. PARAMETRIC MODELLING IN SURVIVAL ANALYSIS 221

So far we assume in this section that the Y ’s are i.i.d. and observable. As weshall see, in the presence of covariates, that very often the assumption thatthe Y ’s are identically distributed and come from a homogeneous populationcannot be justified. In other words, the hazard function may vary with ifor 1 ≤ i ≤ n. With the same arguments as before, we obtain for thelikelihood-function

Ln(θ) =

n∏i=1

λiθ(Yi) exp

− Yi∫−∞

λiθ(t)dt

.Example 9.5.1. A popular choice for modelling λiθ is of the form

λiθ(x) = λθ(x)µi(x).

Here, the functions λθ are a model for the baseline risk of a population,while the µi describe an individual risk component. If, e.g., µ1(x) > µ2(x)for all x, this implies that, whatever the true parameter θ would be, person1 faces a larger risk than person 2.

In medical applications individual information often comes in through acovariate Xi measured at the beginning of the study. In such a case thefunction µi does not depend on t.

Example 9.5.2. Maybe the best studied example is Cox-Regression.Here the individual hazard rate is modelled as

λiθ(x) = λγ(x) exp(βtXi). (9.5.1)

Hence θ = (γ, β). The function λγ is again called the Baseline-HazardFunction.

There is one important comment to be made. Since in our setting the vari-able Xi is a random prognostic-factor, the function λiθ in (9.5.1) equals thehazard function of Yi conditional on the covariate Xi. In other words, λiθ0(x)equals the hazard function associated with the conditional d.f. F (y|Xi) ofY given Xi. Therefore the function

Ln(θ) =n∏

i=1

λγ(Yi) exp(βtXi) exp

− exp(βtXi)

Yi∫−∞

λγ(s)ds

.

Page 227: 2013 W Stute Empirical Distributions

222 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

is called the Conditional Likelihoodfunction. The full likelihood equals

Ln(θ) =n∏

i=1

fθ(Xi, Yi),

where now fθ denotes the joint density of (Xi, Yi) under θ. We have

fθ(Xi, Yi) = f1θ (Xi)f2,1θ (Yi|Xi),

where in obvious notation

f1θ = Density of Xi

f2,1θ = Conditional density of Yi given Xi.

For f2,1θ we have by model assumption

f2,1θ (Yi|Xi) = λiθ(Yi) exp

− Yi∫−∞

λiθ(t)dt

with λiθ given as in (9.5.1). Hence

Ln(θ) = Ln(θ)

n∏i=1

f1θ (Xi).

If the density f1θ does not depend on θ, then Ln and Ln coincide up to afactor not depending on θ. In this case their maximizers coincide. In thegeneral case, f1θ will depend on θ and the maximizers of Ln and Ln aredistinct. The common name of Ln is also Partial Likelihood.

As another popular model we mention the following example.

Example 9.5.3. Now the individual risk acts as an additive component:

λiθ(x) = λγ(x) + µi(x).

A possible choice for µi(x) is, if x = Xi, βtXi.

So far our discussion focused on the likelihood approach. Interestinglyenough the compensator appeared in the exponent of Ln. A direct anal-ysis which also found a lot of interest is the martingale itself. Our approach

Page 228: 2013 W Stute Empirical Distributions

9.5. PARAMETRIC MODELLING IN SURVIVAL ANALYSIS 223

can be easily extended to non-identically distributed random variables. Insuch a situation the compensator equals, for θ = θ0,

n−1n∑

i=1

∫(−∞,t]

1x≤Yiλiθ(x)dx.

Since θ0 is unknown, we consider

∆θn(t) := Fn(t)− n−1

n∑i=1

∫(−∞,t]

1x≤Yiλiθ(x)dx (9.5.2)

also for other θ’s. The true parameter therefore may be characterized asbeing the parameter which makes ∆θ

n(·) a martingale. The process (9.5.2)is called the Martingale Residual Process.

So far we assumed that the Y ’s are observable. When censorship is presentwe consider the process 1Zi≤t,δi=1. Its compensator is given by∫

(−∞,t]

1x≤Zi

1− Fi(x−)Fi(dx) =

∫(−∞,t]

1x≤Ziλiθ(x)dx.

The residual martingale becomes

∆θn(t) = H1

n(t)− n−1n∑

i=1

∫(−∞,t]

1x≤Ziλiθ(x)dx. (9.5.3)

Page 229: 2013 W Stute Empirical Distributions

224 CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Page 230: 2013 W Stute Empirical Distributions

Chapter 10

“Time To Event” Data

10.1 Sampling Designs in Survival Analysis

It is worthwhile recalling the scenario which gave rise to data censored fromthe right. After entry into a study an individual is observed over a periodof time until a certain event can be observed. Due to an early end of thestudy or due to follow-up losses censorship may then occur. To determine theexact value of Zi, however, it is necessary to monitor the history of a patient,e.g., without gaps. In a real life situation things may differ, however, in aspatients appear for a check only at times T0 < T1 < T2 < . . . and monitoringis only possible then. Such a situation gives rise to the following samplingdesign.

Example 10.1.1. Let T0 < T1 < . . . be an increasing sequence of monitor-ing times. Suppose that at time Tj it became known that a default tookplace at time R between Tj−1 and Tj . The associated indicators

δi =

1 if Ti−1 < R ≤ Ti0 else

together with the monitoring times Ti constitute the available informationin the form (Tk, δk), k = 1, 2, . . . In the above example, δj = 1 while all otherlabels vanish. At time t, only those (Tk, δk) with Tk ≤ t are available.

1T

2T

3T

4T

5T

R0T

Figure 10.1.1: Lifetime Data

225

Page 231: 2013 W Stute Empirical Distributions

226 CHAPTER 10. “TIME TO EVENT” DATA

Example 10.1.2. Imagine one is interested in the age Y a young childdevelops a certain ability. Since a correct measurement of Y may requiresome experience, a group of children is observed over time in a kindergarten.Let U denote the age of a child at which it entered the group, and let V bethe time of its exit. In such a situation we are at risk of not observing Y ,the quantity of interest, for two possible reasons. If Y < U , the ability wasalready developed before entering the study, so that the available quantityis U . Similarly, if V < Y . Consequently, the observable quantity becomes

Z =

U , if Y < UY , if U ≤ Y ≤ VV , if V < Y

together with its label

δ =

1 , if Y < U2 , if U ≤ Y ≤ V3 , if V < Y

.

This sampling design is called Double-Censorship.

Example 10.1.3. In epidemiological studies one often faces so-calledTruncation-Effects. For example, in an AIDS study one may be inter-ested in the time elapsed from infection until diagnosis. Suppose that thestudy ends at time c and available information is to be analyzed.

2w

3w

1w c

Incubation Period1

Y

2Y

Incubation Period

3YIncubation Period

0

Figure 10.1.2: Truncated Data

If, e.g., infection is known to be caused by blood transfusion, then τ is knownfor case one, as is Y1. For obvious reasons, case three is not included, while

Page 232: 2013 W Stute Empirical Distributions

10.1. SAMPLING DESIGNS IN SURVIVAL ANALYSIS 227

case two has not been diagnosed at time c so it also cannot be included.More formally, put

τ = Time of Infectionw = Time of Diagnosis

,

and let Y = w − τ for the Incubation Period. A case is included if andonly if w ≤ c or, equivalently, iff

Y ≤ U = c− τ. (10.1.1)

The variable U is called a Truncation Variable. If (10.1.1) is satisfied thenboth Y and U are observed. Compare this situation with right censorshipwhen either Y or C are known. Under right truncation cases with a smallerY have a better chance to be included in the study. This is a kind of length-biasedness which should be respected in the statistical analysis.

Example 10.1.4. In the previous example a datum was available and in-cluded in the analysis whenever Y ≤ U . This required monitoring of allcases diagnosed before c. In the present example imagine that the studyonly begins at time c and that only those cases are of interest, where infec-tion took place before c while not being diagnosed at c. In Figure 10.1.2there is only one case, namely case two. Since Y > U it cannot be observedat c. Since compared with the previous example c is now the beginning andnot the end of the study, we need to define the sampling plan after c. If theobservational period is [c,∞), then each case will be registered with prob-ability one, subject to τ + Y = w ≥ c. For such a (Y, τ) we write (Y 0, τ0).We then have, for t ≤ c,

P(τ ≤ t, Y ≤ y) = P(τ ≤ t, Y ≤ y|τ + Y ≥ c).

Example 10.1.5. (Sampling in a Parallelogram). The situation nowis close to the previous one, with one important difference. The observationperiod starting at c is not [c,∞) but terminates at c + a. Set T1 = c andT2 = c+ a. We thus observe τ and Y under the condition τ + Y ≥ T1 onlyif τ + Y ≤ T2. Graphically, this means that we observe (Y, τ) only if it fallsinto a parallelogram as depicted below.

Page 233: 2013 W Stute Empirical Distributions

228 CHAPTER 10. “TIME TO EVENT” DATA

Y(0,0)

),0( 1T

ô

)0,( 1T )0,( 2T

Figure 10.1.3: Sampling in a Parallelogram

Again the distribution of the observed (Y 0, τ0) is an appropriate conditionaldistribution of (Y, τ):

P(τ0 ≤ t, Y ≤ y) = P(τ ≤ t, Y ≤ y|c+ a ≥ τ + Y ≥ c).

Example 10.1.6. In the situation described in Example 10.1.4 it may wellbe that though τ + Y > c the variable Y is not observed due to rightcensorship. Hence given τ + Y ≥ c only (τ,min(Y,C), δ) with δ = 1Y≤Care observable. This situation is called left-truncation and right-censorship.

Example 10.1.7. It may be that data are available only at time c butongoing observation is not possible. In this case Y is never observable. Onepossibility is to study the “age” at time c. In other words, of interest is thedistribution

P(c− τ ≤ t|τ + Y ≥ c).

Summary. In this section we reviewed several sampling situations in sur-vival analysis. So far only right censorship was discussed in detail. Whentruncation comes into play, empirical estimators are substitutes for condi-tional probabilities. The main question will be one of reconstructing uncon-ditional probabilities from these estimators.

10.2 Nonparametric Estimation in Counting Pro-cesses

The Single-Event-ProcessSt = 1Y≤t

Page 234: 2013 W Stute Empirical Distributions

10.2. NONPARAMETRIC ESTIMATION IN COUNTING PROCESSES229

and its modification

St = 1Z≤t,δ=1

are two simple examples of a type of stochastic processes, which in its gen-eral form is called a Counting Process. Another famous example is thePoisson Process. Not to forget the extensions of St to larger sample size.

The general concept of a counting process constitutes a perfect frameworkfor modeling “Time to Event” data.

In its most general form a counting process is based on a sequence T1 < T2 <T3 < . . . of increasing random variables such that Ti ↑ ∞. This assumptionguarantees that in any compact interval we only have finitely many jumps.If we only have finitely many points to distribute, the sequence is assumedto be finite. The associated counting process becomes

Nt =

∞∑i=1

1Ti≤t.

It can be shown that every (Nt)t admits a compensator and thus an innova-tion martingale. In the examples discussed so far the compensator admittedan intensity of the form

x→ 1x≤Y λ(x) resp. x→ 1x≤Zλ(x),

wheredF

1− F−= λdx.

For a time-homogeneous Poisson-Process the intensity is a constant λ, whilea time-heterogeneous Poisson-Process has a deterministic intensity λ = λ(x).These examples motivated researchers to study point-processes with com-pensators which are multiplicative in the following sense.

Definition 10.2.1. A counting process is said to admit a multiplicativecompensator, if there is a deterministic function λ = λ(x) and a predictableprocess Y = Y (x) such that

Mt := Nt −∫

(−∞,t]

λ(x)Y (x)dx (10.2.1)

is a martingale.

Page 235: 2013 W Stute Empirical Distributions

230 CHAPTER 10. “TIME TO EVENT” DATA

In the case of the Single-Event process, Y (x) = 1x≤Y and λ(x) is thehazard function of F .

In this section we study the problem of estimating the function

Λ(t) =

∫(−∞,t]

λ(x)dx

First work on this problem was due to Nelson (1969) and Aalen (1975).Therefore the resulting estimator is called the Nelson-Aalen Estimator.

To start with, rewrite (10.2.1) in terms of differentials:

dMt = dNt − λY dt

orλY dt = dNt − dMt. (10.2.2)

If Y (t) > 0, then

λ(t)dt = Y −1(t)dNt − Y −1(t)dMt (10.2.3)

and therefore

Λ(t) =

∫(−∞,t]

Y −1(x)N(dx)−∫

(−∞,t]

Y −1(x)M(dx).

Now, (Mt)t is martingale and Y is predictable. It follows that the secondprocess is a centered martingale. Neglecting this noisy part, the estimatorof Λ becomes

Λ(t) ≡∫

(−∞,t]

Y −1(x)N(dx) =∑Ti≤t

Y −1(Ti).

If Y (x) may attain the value zero we need to introduce the function J(t) =1Y (t)>0 and consider the equation

J(t)λ(t)dt =J(t)

Y (t)dNt −

J(t)

Y (t)dMt.

The estimator

Λ(t) =

∫(−∞,t]

J(x)

Y (x)dNx

Page 236: 2013 W Stute Empirical Distributions

10.3. NONPARAMETRIC TESTING IN COUNTING PROCESSES 231

therefore is an estimator of

Λ∗(t) =

∫(−∞,t]

J(x)λ(x)dx.

We have

Λ∗(t) ∼ EΛ(t) =∫

(−∞,t]

λ(x)P(Y (x) > 0)dx.

Therefore, in general, Λ∗(t) is biased. The bias becomes small when P(Y (x) =0) is small for x ≤ t.

When we have independent replications of (Nt)t the procedure needs to beslightly modified. In this case the arguments for (10.2.2) and (10.2.3) nowyield

λ(t)

(n∑

i=1

Yi(t)

)dt = d

(n∑

i=1

Nit

)− d

(n∑

i=1

Mit

)and finally

Λ(t) =

∫(−∞,t]

J(x)∑ni=1 Yi(x)

d

(n∑

i=1

Nix

).

For right-censored data,

Yi(x) = 1Zi≥x and Ni(x) = 1Zi≤x,δi=1.

Conclude that

Λ(t) =

∫(−∞,t]

J(x)

1−Hn(x−)H1

n(dx).

The correction through J(x) is only necessary for x > max(Zi : 1 ≤ i ≤ n).But H1

n has no mass there so that after all

Λ(t) =

∫(−∞,t]

1

1−Hn(x−)H1

n(dx).

10.3 Nonparametric Testing in Counting Processes

In this section we apply the concept developed so far for testing hypotheses.For this, imagine we observe counting processes N1, N2, . . . , Nk with k fixed.

Page 237: 2013 W Stute Empirical Distributions

232 CHAPTER 10. “TIME TO EVENT” DATA

Each Nj may itself be a sum of individual counting processes, as, e.g.,

N(t) =n∑

i=1

1Zi≤t,δi=1.

The best is to view each Nj as the result of several measurements undertherapy j, 1 ≤ j ≤ k. We assume that each Nj admits a multiplicativeintensity λjYj , i.e., the innovation martingale of Nj equals

Mj(t) = Nj(t)−∫

(−∞,t]

λj(x)Yj(x)dx.

Denote with

Λj(t) =

∫(−∞,t]

λj(x)dx

the cumulative hazard function associated with λj . One question which hasbeen studied in detail in the literature was how to test for the hypothesis

H0 : λ1 = . . . = λk = λ0

with λ0 unspecified.

In the following we assume that N1, . . . , Nk are independent. The first stepin our analysis is to form the “Grand Sum Process”

N• =

k∑j=1

Nj .

Similarly

Y• =k∑

j=1

Yj

Denote with Ft = σ(F1t, . . . ,Fkt), where (Fjt)t is the filtration pertaining tothe j-th process. By the assumed independence, the innovation martingaleof N• becomes M• defined through

dM• = dN• −

k∑j=1

λjYj

dt≡ dN• − (λY )•dt.

Page 238: 2013 W Stute Empirical Distributions

10.3. NONPARAMETRIC TESTING IN COUNTING PROCESSES 233

This leads todN• = (λY )•dt+ dM•.

For each single j we similarly obtain

dNi = λiYidt+ dMi, 1 ≤ i ≤ k

respectivelydNj

Yj= λjdt+

dMj

Yj(if Yj(t) > 0) (10.3.1)

The crucial argument comes now: Under H0,

(λY )• = λ0Y•

whencedN•Y•

= λ0dt+dM•Y•

. (10.3.2)

Under the null hypothesis, the drift parts in (10.3.1) and (10.3.2) coincide sothat dNj/Yj and dN•/Y• only differ in their martingale parts. The differencebetween these two processes we can weight appropriately and then integrate.Then we come up with a linear statistic

Zj(t) =

∫(−∞,t]

Kj(x)Jj(x)

[Nj(dx)

Yj(x)− N•(dx)

Y•(x)

].

Under the null hypothesis, each Zj is “almost” centered. We may nextconsider the vector-valued process

Zt = (Z1(t), . . . , Zk(t))

and define a χ2-type statistic

χ2 ≡ ZtΣ−1t Zt

t ,

where Σt is an estimator of the covariance of Zt. In the case that each Nj

is a sum of nj independent processes we can do some asymptotic analysisand show that in the limit χ2 has a χ2-distribution with k − 1 degrees offreedom (without proof).

Most popular weight functions are of the type

Kj(x) = K(x)Yj(x).

Page 239: 2013 W Stute Empirical Distributions

234 CHAPTER 10. “TIME TO EVENT” DATA

Here K is a predictable process only depending on Nj and Y•. For such aKj we get

Zj(t) =

∫(−∞,t]

K(x)Nj(dx)−∫

(−∞,t]

K(x)Yj(x)

Y•(x)N•(dx)

=

∫(−∞,t]

K(x)Mj(dx)−∫

(−∞,t]

K(x)Yj(x)

Y•(x)M•(dx)

=

k∑l=1

∫(−∞,t]

K(x)

[δlj −

Yj(x)

Y•(x)

]Ml(dx),

Here δlj is the Kronecker-symbol. In the literature two of several choices forK are

• K(x) = 1Y•(x)>0

• K(x) = Y•(x)

We study Zj a little bit in greater detail for right-censored data. In thiscase

Zj(t) = nj

∫(−∞,t]

K(x)H1nj(dx)−

k∑l=1

njnlN

∫(−∞,t]

K(x)1−Hnj (x−)

1− H•(x−)H1

nl(dx).

with N = n1+n2+ . . .+nk. For K = 1−H•− the term simplifies to become

Zj(t) = nj

∫(−∞,t]

(1−H•(x−))H1nj(dx)−

k∑l=1

njnlN

∫(−∞,t]

(1−Hnj (x−))H1nl(dx).

All integrals are empiricals which can be readily computed.

Now, if nj → ∞ such that

njN

→ µj for 1 ≤ j ≤ k,

then

limnj→∞

Zj(t)

N= µj

∫(−∞,t]

(1−H•(x−))H1j (dx)−

k∑l=1

µjµl

∫(−∞,t]

(1−Hj(x−))H1l (dx).

Page 240: 2013 W Stute Empirical Distributions

10.4. MAXIMUM LIKELIHOOD PROCEDURES 235

Under the null hypothesis the limit should be zero. Actually,

dH1l = (1−Gl−)dFl = (1−Hl−)dΛl = (1−Hl−)dΛ0,

whencek∑

l=1

µldH1l = (1−H•−)dΛ0.

Conclude

k∑l=1

µl

∫(−∞,t]

(1−Hj−)dH1l =

∫(−∞,t]

(1−Hj−)(1−H•−)dΛ0

=

∫(−∞,t]

(1−H•−)dH1j .

10.4 Maximum Likelihood Procedures

Recall that when we observe independent identically distributed randomvariables Y1, . . . , Yn with hazard function λθ, θ = θ0, then the likelihoodfunction becomes

Ln(θ) =n∏

i=1

λθ(Yi) exp

− ∫(−∞,Yi]

λθ(t)dt

.Taking logarithms leads to

lnLn(θ) =

n∑i=1

lnλθ(Yi)− ∞∫−∞

1t≤Yiλθ(t)dt

. (10.4.1)

For a general counting process N with multiplicative intensity we get

lnLn(θ) =

∫lnλθ(t)N(dt)−

∫Y (t)λθ(t)dt, (10.4.2)

The analysis of the MLE is more or less parallel to the classical theory. Forright-censored data (10.4.2) takes on the form

lnLn(θ) =

n∑i=1

δi lnλθ(Zi)−∫

(−∞,Zi]

λθ(t)dt

.

Page 241: 2013 W Stute Empirical Distributions

236 CHAPTER 10. “TIME TO EVENT” DATA

Of more interest is the case when λθ depends on covariates. The best studiedcase is

λθ(x) ≡ λiθ(x) ≡ λγ(x) exp[βtXi]. (10.4.3)

Sinceλiθ(x)

λjθ(x)= exp

[βt(Xi −Xj)

]does not depend on x, this model is called the Cox Proportional HazardsModel. The partial Log-Likelihood Function becomes

lnLn(θ) =n∑

i=1

δi lnλγ(Zi) + δiβtXi − exp

(βtXi

) Zi∫−∞

λγ(t)dt

. (10.4.4)

If the baseline-function λγ does not depend on a finite-dimensional parame-ter γ but may be any function, we arrive at a semiparametric model. In thiscase λj needs to be replaced with λ = λ(t), and maximization takes placew.r.t. β and λ.

The following presents a detailed analysis of the Cox model. For a reference,see Tsiatis (1981). Assume that

λiθ(x) ≡ λθ(x) = λ(x) exp[βtXi]

and that

Yi and Ci are independent conditionally on Xi (10.4.5)

This implies that the conditional d.f.’s of Y and C:

F (y|x) ≡ P(Y ≤ y|X = x) and G(y|x) ≡ P(C ≤ y|X = x)

satisfy1−H(y|x) = (1− F (y|x))(1−G(y|x)),

whereH(y|x) = P(Z ≤ y|X = x).

The hazard function of F (y|x) is by assumption equal to (10.4.3), namelyλ(y) exp(βtx). There will be no conditions on G(·|x). Similar to the uncon-ditional case, we have to introduce the (conditional) sub-distribution

H1(y|X) = P(Z ≤ y, δ = 1|X = x).

Page 242: 2013 W Stute Empirical Distributions

10.4. MAXIMUM LIKELIHOOD PROCEDURES 237

Because of independence

H1(y|x) = E[1Z≤y,Y≤C|X = x

]=

∫(−∞,y]

[1−G(z − |x)]F (dz|x)

=

∫(−∞,y]

[1−H(z − |x)]λ(z) exp(βtx)dz

= exp(βtx)

∫(−∞,y]

[1−H(z − |x)]λ(z)dz. (10.4.6)

As before, denote with

H1(y) = P(Z ≤ y, δ = 1)

the unconditional sub-distribution and let c = c(x), x ∈ Rp, be the densityof the p-dimensional covariate X. Then

H1(y) =

∫H1(y|x)c(x)dx

=

∫ ∫(−∞,y]

exp(βtx)[1−H(z − |x)]λ(z)c(x)dzdx.(10.4.7)

Lemma 10.4.1. The function H1 is differentiable in y, and its derivativeequals

[H1(y)]′ = λ(y)

∫[1−H(y − |x)] exp(βtx)c(x)dx.

Proof. This follows from (10.4.7).

In the following, let φ be any function defined on Rp. Set

Eφ(y) ≡ E[φ(X)1Z≥y] = E[φ(X)P(Z ≥ y|X)]

=

∫φ(x)[1−H(y − |x)]c(x)dx.

Finally, put

E1φ(y) = E[φ(X)1Z≤y,δ=1] = E[φ(X)H1(y|X)]

=

∫φ(x)H1(y|x)c(x)dx.

Page 243: 2013 W Stute Empirical Distributions

238 CHAPTER 10. “TIME TO EVENT” DATA

By (10.4.6)

E1φ(y) =

∫ ∫(−∞,y]

φ(x)[1−H(z − |x)] exp(βtx)λ(z)dzc(x)dx.

Lemma 10.4.2. The function E1φ is differentiable with derivative

[E1φ(y)]

′ =

∫φ(x)[1−H(y − |x)]λ(y) exp(βtx)c(x)dx.

Proof. As before.

If we take for φ the function

φ0(x) = exp(βtx)

we get two important relations.

Corollary 10.4.3.

λ(y) =[H1(y)]′

Eφ0(y). (10.4.8)

Equation (10.4.8) provides us with a representation of the unknown baselinefunction in terms of two quantities which will be estimable. The proof of(10.4.8) is a direct consequence of Lemma 10.4.1 and the definition of Eφ0 .

Furthermore, we get for any φ

[E1φ(y)]

′ =[H1(y)]′Eφφ0(y)

Eφ0(y)(10.4.9)

which is an easy consequence of Lemma 10.4.1, Lemma 10.4.2 and the defi-nitions of the involved functions.

It’s time to go back to the partial log-likelihood function (10.4.4). Write

Xi = (X1i , . . . , X

pi )

t.

If we take the partial derivative of (10.4.4) w.r.t. βj , 1 ≤ j ≤ p, we obtain

∂ lnLn

∂βj=

n∑i=1

δiXji −Xj

i exp(βtXi)

∫(−∞,Zi]

λ(t)dt

. (10.4.10)

Page 244: 2013 W Stute Empirical Distributions

10.4. MAXIMUM LIKELIHOOD PROCEDURES 239

Denote with πj the j-th projection of Rp onto R. With φ = πj , (10.4.10)takes on the form

∂ lnLn

∂βj=

n∑i=1

δiφ(Xi)− φ(Xi)(expβtXi)

∫(−∞,Zi]

λ(t)dt

. (10.4.11)

The right-hand side also makes sense for β’s other than the true parame-ter. Therefore, from now on, we denote the true parameter with β0. From(10.4.8) and (10.4.9) we get, by Fubini,

E

φ(Xi)φ0(Xi)

∫(−∞,Zi]

λ(t)dt

=

∞∫−∞

λ(t)Eφ(Xi)φ0(Xi)1t≤Zi

dt

=

∞∫−∞

λ(t)Eφφ0(t)dt =

∞∫−∞

[E1φ(t)]

′dt

= E1φ(∞)− E1

φ(−∞)

= E[φ(X)1δ=1

].

Conclude

Lemma 10.4.4. At β = β0 we have

E[∂ lnLn

∂βj

]= 0 for 1 ≤ j ≤ p.

The lemma asserts that β = β0 is a solution of a certain equation. It suggeststhat β0 should be estimated by solving

∂ lnLn

∂βj= 0 for 1 ≤ j ≤ p.

Unfortunately (10.4.11) contains the unknown function λ. A possible wayout of this dilemma is to use (10.4.8). From this one gets∫

(−∞,Zi]

λ(t)dt =

∫(−∞,Zi]

1

Eφ0(t)H1(dt).

H1 and Eφ0 may be replaced with

H1n(y) =

1

n

n∑i=1

1Zi≤y,δi=1

Page 245: 2013 W Stute Empirical Distributions

240 CHAPTER 10. “TIME TO EVENT” DATA

and

E(y) =1

n

n∑i=1

exp(βtXi)1Zi≥y

with still containing the parameter β. This leads to a term lj(β) replacing∂ lnLn∂βj

in the following way:

lj(β) ≡n∑

i=1

[δiX

ji −Xj

i exp(βtXi)

1

n

n∑k=1

δk1Zk≤ZiE−1(Zk)

]

=n∑

i=1

δiXji −

1

n

n∑k=1

n∑i=1

δkXji exp(β

tXi)1Zk≤ZiE−1(Zk)

=

n∑i=1

δiXji −

n∑k=1

δk

∑ni=1X

ji exp(β

tXi)1Zi≥Zk∑ni=1 exp(β

tXi)1Zi≥Zk.

We need to solvelj(β) = 0 for 1 ≤ j ≤ p

Denote with β = (β1, . . . , βp)t its solution.

The rest of this section is a little technical. The lemmas will be needed toprove consistency of β.

Lemma 10.4.5. With probability one, we have

supy∈R

|H1n(y)−H1(y)| → 0 (10.4.12)

supy∈R

|Eφ(y)−Eφ(y)| → 0 (10.4.13)

andsupy∈R

|E1φ(y)−E1

φ(y)| → 0 (10.4.14)

Here Eφ und E1φ are the estimators of Eφ and E1

φ, respectively.

Proof. Follows from the Strong Law of Large Numbers and its extension touniform convergence.

Lemma 10.4.6. For each β, with probability one

limn→∞

1

nlj(β) = E1

πj(∞)−

∫Eπjφ(y)

Eφ(y)H1(dy),

where φ(x) = exp(βtx).

Page 246: 2013 W Stute Empirical Distributions

10.4. MAXIMUM LIKELIHOOD PROCEDURES 241

Recall that the right-hand side in Lemma 10.4.6 equals zero for β = β0

Theorem 10.4.7. (Tsiatis). With probability one,

limn→∞

β = β0.

The last point to discuss is estimation of λ. We already showed that

λ(t)dt = E−1φ0dH1.

From this the empirical estimate of Λ becomes

Λ(t) ≡∫

(−∞,t]

E−1φ0dH1

n

=1

n

n∑i=1

δiE−1φ0

(Zi)1Zi≤t

=n∑

i=1

δi1Zi≤t∑nk=1 exp(βXk)1Zk≥Zi

.

From the previous results we obtain

Corollary 10.4.8. For each t ∈ R we have with probability one,

limn→∞

Λ(t) = Λ(t)

We finally discuss the problem of estimating the case-specific survival prob-abilities

P(Y > t|X = x) = 1− F (t|x).

Under the Cox-Model

1− F (t|x) = exp

− ∫(−∞,t]

λ(s) exp(βtx)ds

= exp

− exp(βtx)

∫(−∞,t]

λ(s)ds

.Hence the estimator becomes

1− F (t|x) = exp[− exp(βtx)Λ(t)].

Page 247: 2013 W Stute Empirical Distributions

242 CHAPTER 10. “TIME TO EVENT” DATA

10.5 Right-Truncation

Let Y1, . . . , YN be a sequence of i.i.d. random variables from some unknownd.f. F . Under right-truncation one observes Yi only if it falls below athreshold Zi, 1 ≤ i ≤ N . Here, also Z1, . . . , ZN are i.i.d. from some unknownd.f. G such that for each 1 ≤ i ≤ N

Yi and Zi independent (10.5.1)

If Yi ≤ Zi then both Yi and Zi are observed. Denote with

n :=

N∑i=1

1Yi≤Zi

the (random) number of actually observed pairs (Yi, Zi) with Yi ≤ Zi. Thenumber N typically is unknown so that the truncation probability 1 − αwith

α = P(Y1 ≤ Zi),

cannot be easily estimated by

1− α =1

N

N∑i=1

1Yi>Zi.

Denote the actually observed pair with (Y i , Z

i ),i.e.,

(Y i , Z

i ) = (Yi, Zi) if Yi ≤ Zi. (10.5.2)

By the constraint in (10.5.2) we only observe a pair (Yi, Zi) if it falls abovey = z. We thus see that again we have only information in partial form. Toperform a proper statistical analysis of (Y

i , Zi , 1 ≤ i ≤ n, say, we need to

study the relation between the observed and the original pairs. First, foreach real t,

Y ≤ t, Y ≤ Z = Y ≤ t, Y ≤ Z.i.e., taking intersections with Y ≤ Z, we know that Y ≤ t iff Y ≤ t.Furthermore, the events we can observe are at time t of the type

Gn(t) = σ (Yi < s < Zi, s ≤ Yi ≤ Zi : t ≤ s, 1 ≤ i ≤ n) .

Note that the filtration is nonincreasing as t ↑. Hence, rather with mar-tingale, we have to deal with reverse-time martingales. We first derive theDoob-Meyer decomposition of the process

t→ 1t≤Y≤Z.

Page 248: 2013 W Stute Empirical Distributions

10.5. RIGHT-TRUNCATION 243

As usual we first consider the discrete case. So, let

t = tn+1 < tn < tn−1 < . . . < t1 <∞ = t0

be a finite grid. Then we have

1tk≤Y≤Z = 1tk−1≤Y≤Z + 1tk≤Y <tk−1<Z

+ 1tk≤Y≤Z≤tk−1,Y <tk−1.

The first term is predictable. Because of the reverse Markov-property theconditional expectation of the second equals

E[1tk≤Y <tk−1<Z|G1(tk−1)

]= E

[1tk≤Y <tk−1<Z|1tk−1≤Y≤Z, 1Y <tk−1<Z1Y≤Z≤tk−1,Y <tk−1

].

Similarly, we obtain

E[1tk≤Y≤Z≤tk−1,Y <tk−1G1(tk−1)

]= 1Y≤Z≤tk−1,Y <tk−1

P(tk ≤ Y ≤ Z ≤ tk−1, Y < tk−1)

P(Y ≤ Z ≤ tk−1), Y < tk−1.

Summing up, we get

E[1tk≤Y≤Z|G1(tk−1)

]= 1tk−1≤Y≤Z

+ 1Y <tk−1<ZP(tk ≤ Y < tk−1 < Z)

P(Y < tk−1 < Z)

+ 1Y≤Z≤tk−1,Y <tk−1P(tk ≤ Y ≤ Z ≤ tk−1, Y < tk−1)

P(Y ≤ Z ≤ tk−1, Y < tk−1).

In the Doob-Meyer decomposition we get for the martingale part

Mk = Mk−1 + 1tk≤Y <tk−1,Y≤Z

− 1Y <tk−1<ZP(tk ≤ Y < tk−1 < Z)

P(Y < tk−1 < Z)

− 1Y≤Z≤tk−1,Y <tk−1P(tk ≤ Y ≤ Z ≤ tk−1, Y < tk−1)

P(Y ≤ Z ≤ tk−1, Y < tk−1)

By induction,

Mk=1tk≤Y≤Z−k−1∑j=0

1Y <tj<ZP(tj+1 ≤ Y < tj < Z)

P(Y < tj < Z)(10.5.3)

−k−1∑j=0

1Y≤Z≤tj ,Y<tjP(tj+1≤Y ≤Z ≤ tj , Y < tj)

P(Y ≤ Z ≤ tj , Y< tj).(10.5.4)

Page 249: 2013 W Stute Empirical Distributions

244 CHAPTER 10. “TIME TO EVENT” DATA

Now, since Y and Z are independent we have

P(tj+1 ≤ Y < tj < Z)

P(Y < tj < Z)=

[F (tj−)− F (tj+1−)](1−G(tj))

F (tj−)(1−G(tj))

=F (tj−)− F (tj+1−)

F (tj−).

Hence the sum in (10.5.4) is the F -integral of a certain step function which,as the grid gets finer and finer, tends to the limit∫

[t,∞)1Y≤u<Z

F (du)

F (u)

When F and G have no atoms in common, then (10.5.4) tends to zero.Without further mentioning we assume that this holds.

Lemma 10.5.1. The process

t→ 1t≤Y≤Z

has the innovation martingale (in reverse time)

Mt = 1t≤Y≤Z −∫

[t,∞)

1Y≤u<ZF (du)

F (u).

Corollary 10.5.2. The process

H1n(t) =

1

n

n∑i=1

1t≤Yi≤Zi

has the innovation martingale

Mn(t) = H1n(t)−

∫[t,∞)

Cn(u+)

F (u)F (du).

where

Cn(x) =1

n

n∑i=1

1Yi≤u≤Zi.

This Cn is an estimator of the function

C(x) = P(Y ≤ u ≤ Z|Y ≤ Z) = α−1F (u)(1−G(u−))

Page 250: 2013 W Stute Empirical Distributions

10.5. RIGHT-TRUNCATION 245

which plays an important role in the analysis of truncated data. Actually,

Cn(x) =1

n

n∑i=1

1Yi≤u≤Zi.

is a ratio of a sample mean estimating F (1−G) at x and n/N which is an(unknown) estimator of

α ≡ P(Y ≤ Z).

We now discuss he problem of estimating the unknown d.f. of Y . First weare looking for a relation which may be helpful to identify F . Denote withF ∗ the d.f. of the actually observed Y 0’s. Then

F ∗(y) = P(Y ≤ y) = P(Y ≤ y|Y ≤ Z).

This function can be estimated through

F ∗n(y) =

1

n

n∑i=1

1Y 0i ≤y.

Since we observe both Y 0i and Z0

i , also the function

H∗(y, z) = P(Y ≤ y, Z ≤ z|Y ≤ Z).

may be estimated, namely by

Hn(y, z) =1

n

n∑i=1

1Y 0i ≤y,Z0

i ≤z.

Now, because Y and Z are independent,

H∗(y, z) = α−1

y∫−∞

z∫−∞

1u≤vF (du)G(dv)

= α−1

y∧z∫−∞

[G(z)−G(u−)]F (du).

Putting z = ∞, we get

F ∗(y) = α−1

y∫−∞

[1−G(u−)]F (du). (10.5.5)

Page 251: 2013 W Stute Empirical Distributions

246 CHAPTER 10. “TIME TO EVENT” DATA

thus the unknown α is contained in both C and F ∗0 . There is some hope that

taking appropriate ratios we may arrive at estimable quantities in which αhas cancelled out. Now, rewrite (10.5.5) to get

αdF ∗ = (1−G−)dF

and therefore

dF =αdF ∗

(1−G−)=

dF ∗

α−1(1−G−).

Introduce

C(x) = P(Y ≤ x ≤ Z|Y ≤ Z) = α−1F (x)(1−G(x−)).

thendF

F=dF ∗

C. (10.5.6)

Equation (10.5.6) is the required identifiability equation. Both terms on theright-hand side are estimable. For example, if F has a density f , then

∞∫t

dF

F=

∞∫t

f(x)

F (x)dx = lnF (x)

∣∣∣∣∣∞

t

= − lnF (t)

and therefore

F (t) = exp

− ∞∫t

dF

F

. (10.5.7)

We need (10.5.7) fo rthe general case, i.e., also when F has atoms. Theproduct integration formula for survival functions is obtained by multiplyingthe variable of interest with minus one. So, write

F (t) = P(Y ≤ t) = P(−Y ≥ −t) = P(X ≥ s)

where X = −Y and s = −t. Furthermore

P(X ≥ s) = limu↑s

P(X > u).

Let, for a moment, G denote the d.f. of X, let Λ denote the pertainingcumulative hazard function, i.e.

Λ(x) =

∫(−∞,x]

dG

1−G−.

Page 252: 2013 W Stute Empirical Distributions

10.5. RIGHT-TRUNCATION 247

Then1−G(u−) = P(X ≥ u) = P(Y ≤ −u) = F (−u)

and therefore

Λ(x) =

x∫−∞

G(du)

F (−u)= E

[1X≤x

1

F (−X)

]= E

[1Y≥−x

1

F (Y )

]

=

∫[−x,∞)

F (dy)

F (y). (10.5.8)

Hence x is a discontinuity point of Λ if and only if −x is a discontinuitypoint of F . For such an x

Λx =F−xF (−x)

.

Conclude that

F (t) = limu↑−t

[1−G(u)] = limu↑−t

exp[−Λc(u)]∏x≤u

[1− Λx]

= limu↑−t

exp[−Λc(u)]∏x≤u

F−(−x)F (−x)

= exp[−Λc(−t)]∏y>t

F (y−)

F (y).

Since from (10.5.8)Λc(−t) = Λc(t)

with

Λ(t) =

∫[t,∞)

F (dy)

F (y)(10.5.9)

upon noticing that Λ is left-hand continuous and nonincreasing. Further-more, Λ(∞) = 0. Summarizing, we get

Lemma 10.5.3. With Λ from (10.5.9) we have

F (t) = exp[−Λc(t)]∏y>t

F (y−)

F (y)= exp[−Λc(t)]

∏y>t

[1− Λy].

We are now in the position to derive the estimator of F .

Page 253: 2013 W Stute Empirical Distributions

248 CHAPTER 10. “TIME TO EVENT” DATA

1. Step: Since, by (10.5.6)

Λ(t) =

∫[t,∞)

F (dy)

F (y)=

∫[t,∞)

F ∗(dy)

C(y).

and F ∗ and C are estimated through

F ∗n(y) =

1

n

n∑i=1

1Y i ≤y

and

Cn(y) =1

n

n∑i=1

1Y i ≤y≤Z

i .

For Λn we set

Λn(t) ≡∫

[t,∞)

F ∗n(dy)

Cn(y)

=1

n

n∑i=1

1t≤Y i C

−1n (Y

i )

=

n∑i=1

1t≤Y i ∑n

j=1 1Y j ≤Y

i ≤Zj .

Note that the denominator is always ≥ 1.

2. Step: In the next step we estimate F . Since Λn is purely discrete,Lemma 3.5.3 yields

Fn(t) =∏y>t

[1− Λny].

If the Y i are pairwise distinct

Fn(t) =∏Y i >t

[1− 1

nCn(Y i )

]. (10.5.10)

This estimator is called the Lynden-Bell estimator. Recall (10.5.9) whichin terms of differentials yields

−FdΛ = dF.

Page 254: 2013 W Stute Empirical Distributions

10.5. RIGHT-TRUNCATION 249

Integration leads to

−∫

(t,∞)

F (y)Λn(dy) = 1− F (t). (10.5.11)

(10.5.11) is true for any Λ and the associated F . For Λn and Fn we thereforeget

−∫

(t,∞)

Fn(y)Λn(dy) = 1− Fn(t). (10.5.12)

We conclude our discussion with the estimation of α. We have seen that

αdF ∗ = (1−G−)dF.

Integration gives

α =

∫(1−G−)dF.

Since Z is truncated through Y from the right, we can use a Lynden-Bellestimator for G, say Gn. The estimator of α will be

α =

∫(1− G−)dF.

Page 255: 2013 W Stute Empirical Distributions

250 CHAPTER 10. “TIME TO EVENT” DATA

Page 256: 2013 W Stute Empirical Distributions

Chapter 11

The Multivariate Case

11.1 Introduction

Let X = (X1, . . . , Xm), m ≥ 1, be a random vector with distribution func-tion F , i.e.,

F (x1, . . . , xm) = P(X1 ≤ x1, . . . , Xm ≤ xm)

denotes the probability that the Xi jointly fall below given thresholds xi, 1 ≤i ≤ m. In many real life situations the vector X comprises the individual”lifetimes” of the m components of a complex system. For example, in afinancial portfolio of bonds, Xi usually denotes the time until default of thei-th asset. In a medical setup imagine that a patient is constantly monitoredafter surgery and further breakdowns may be caused by so-called competingrisks. Similarly in reliability theory, i.e., in a technical environment. Inall these examples the Xi may be viewed as individual times-to-event andare therefore nonnegative. Consequently F is concentrated on the m-foldproduct of the positive real line so that

P(X1 ≥ 0, . . . , Xm ≥ 0) = 1.

In the analysis of survival data it may often happen that, due to time limi-tations, the Xi are not always observable. Rather, there will be a censoringvariable Yi, the time spent under investigation, so that only

Zi = min(Xi, Yi), 1 ≤ i ≤ m,

and

δi = 1Xi≤Yi,

251

Page 257: 2013 W Stute Empirical Distributions

252 CHAPTER 11. THE MULTIVARIATE CASE

the censoring status or label of the datum, are available. Given a numberof independent replications of

Z = (Z1, . . . , Zm) and δ = (δ1, . . . , δm)

it is then the goal of statistical inference to estimate the unknown distribu-tion function (d.f.) F .

While for m = 1, i.e., for real-valued survival data, the analysis alreadystarted, in a nonparametric framework, with the landmark paper of Ka-plan and Meier (1958), the multivariate case was only studied since the mid1980’s. An early reference is Dabrowska (1988) whose estimator, however,was not a bona fide estimator but could give negative (empirical) probabilitymasses to certain events. Van der Laan (1995) derived implicit estimatorswhich were asymptotically efficient under strong assumptions like completeavailability of the censoring variables. Gill et al. (1995) showed that theestimators obtained until then were efficient under complete independenceand continuity but inefficient otherwise. As a conclusion, the problem tofind efficient bona fide estimators for F under censorship in a general sce-nario is still open when m ≥ 2. It is the goal of this chapter to providesuch an estimator, i.e., to extend the Kaplan-Meier (KM)-estimator to themultivariate case. The outline of this chapter is as follows. In the rest ofthis section we discuss some identifiability issues. In Section 11.2 we fur-ther motivate our approach while in Section 11.3 we derive the multivariateKM-estimator. Section 11.4 provides its efficiency while in Section 11.5 wereport on some simulation results.

In the rest of this section we first show why the multivariate case is so muchdifferent from the univariate situation. For this, we remind the reader howthe univariate KM-estimator is usually derived. When m = 1, the survivalfunction

F (x) = P(X ≥ x)

satisfies the trivial equation

F (x) = 1− F (x−),

whereF (x−) = lim

y↑xF (y)

is the left-hand limit of F at x. Moreover, introduce H, the d.f. of Z =min(X,Y ), and the joint d.f. of Z and δ:

H(x) = P(Z ≤ x) and H1(x) = P(Z ≤ x, δ = 1).

Page 258: 2013 W Stute Empirical Distributions

11.1. INTRODUCTION 253

Under independence of X and Y ,

dH1 = GdF and 1−H = (1− F )(1−G), (11.1.1)

with G(x) = P(Y ≤ x) denoting the (unknown) d.f. of Y . In conclusion weobtain

F (x) = 1− F (x−) = 1−∫

[0,x)

F (y)

F (y)F (dy)

= 1−∫

[0,x)

F (y)Λ(dy), (11.1.2)

where dΛ = dFF

is the hazard measure associated with F . In view of (11.1.1),(11.1.2) becomes

F (x) = 1−∫

[0,x)

F (y)H1(dy)

H(y). (11.1.3)

There is one word of caution necessary. Namely, (11.1.3) is true only ifG(y) > 0 for all y < x. Actually, suppose that G(y) = 0 for all y in aleft neighborhood (x − δ, x) of x but dF > 0 there. This implies that withprobability one all Y ’s are less than or equal to x − δ. Hence possible X’swhich exceed x − δ are necessarily censored. As a conclusion F can notbe fully reconstructed from a set of censored data. Stute and Wang (1993)contains a detailed discussion of this issue. To avoid this here we assumethroughout without further mentioning that the support of F is included inthat of G.

There are more important facts about (11.1.2) and (11.1.3). First, equation(11.1.2) is an inhomogeneous Volterra integral equation such that, given

dΛ =dH1

H,

its solution q satisfying q(0) = 1 is unique and equals q = F . In other words,the hazard measure dΛ uniquely determines F resp. F through (11.1.2). Anexplicit representation is obtained through

F (x) = exp[−Λc(x)]∏

0≤t<x

[1− Λdt], (11.1.4)

Page 259: 2013 W Stute Empirical Distributions

254 CHAPTER 11. THE MULTIVARIATE CASE

the famous product-limit formula. Here Λc and Λd denote the continuousand the discrete parts of Λ, respectively. The KM-estimator is then obtainedby first replacing H1 and H by their empirical analogs leading to

dΛn =dH1

n

Hn

and then applying (11.1.4) to Λn.

We now come to the multivariate case. For ease of discussion, we onlyconsider the case m = 2. The following example shows that things havechanged.

Example 11.1.1. Let F1 and F2 be two absolutely continuous survivalfunctions on the positive real line with densities f1 and f2. Set F (x1, x2) =F1(x1)F2(x2). The hazard function of F is obviously

λ(x1, x2) =f1(x1)f2(x2)

F (x1)F (x2)≡ λ1(x1)λ2(x2),

say. Since, for any a > 0,

λ(x1, x2) = (aλ1(x1))(a−1λ2(x2))

the family of bivariate survival functions

Fa(x1, x2) = exp

− x1∫0

aλ1(t1)dt1

exp

− x2∫0

a−1λ2(t2)dt2

have the same hazard function as F .

The message of Example 11.1.1 is simple but striking, namely that in themultivariate case the hazard measure does not uniquely specify the survivalfunction and hence F . There are more important differences. For example,if one tries to mimic the trivial derivation of (11.1.2) in the multivariatecase, one finds out that this is impossible simply because typically F and Fdon’t sum up to one. Actually, for m = 2, F (x1, x2) + F (x1, x2) equals theprobability that the vector X falls into the area southwest or northeast to(x1, x2), the other two areas being totally neglected.

Summarizing our findings obtained so far, a simple extension of (11.1.2)-(11.1.4) to the multivariate case will be hopeless. Since it makes no sense

Page 260: 2013 W Stute Empirical Distributions

11.1. INTRODUCTION 255

to start the estimation procedure before solving the identifiability question,we have to discuss the first issue in greater detail.

Again, for ease of representation and readability, we concentrate on bivariatedata. Recall X = (X1, X2) and, independently, Y = (Y1, Y2) from G. Also,as before,

Zi = min(Xi, Yi) and δi = 1Xi≤Yi, i = 1, 2.

SettingH11(t1, t2) = P(Z1 ≤ t1, δ1 = 1, Z2 ≤ t2, δ2 = 1)

andH(t1, t2) = P(Z1 ≤ t1, Z2 ≤ t2)

we obtain for the survival functions

F (x1, x2) = P(X1 ≥ x1, X2 ≥ x2)

andG(x1, x2) = P(Y1 ≥ x1, Y2 ≥ x2),

as in the univariate case,

dH11 = GdF and GF = H.

Conclude that

F (x1, x2) =

∫[x1,∞)×[x1,∞)

F (t1, t2)H11(dt1, dt2)

H(t1, t2). (11.1.5)

As in the univariate case we have to assume that

supp F ⊂ supp G (11.1.6)

Given H11 and H, (11.1.5) exhibits that F is a solution of a homogeneousVolterra equation rather than a heterogeneous one as in (11.1.3). If onelooks at the right-hand side of (11.1.5) and interprets the integral as anoperator acting on F , (11.1.5) states that F is an eigenfunction with eigen-value one. There is little hope, though, in view of Example 11.1.1, thatthe eigenspace is one-dimensional, so that F is the only solution satisfyingF (0, 0) = 1. Another complication comes in since compared with (11.1.3)we now integrate over the north-eastern area which very often (e.g., in theLebesgue-dominated case) has infinite mass under the hazard measure dΛ.The following construction serves to replace dΛ by a finite measure.

Page 261: 2013 W Stute Empirical Distributions

256 CHAPTER 11. THE MULTIVARIATE CASE

Definition 11.1.2. For any bivariate vector X and 0 < ε < 1 let

X∞ =

X with probability 1− εx∞ with probability ε

,

where x∞ is a vector, possibly (∞,∞), which exceeds all (x1, x2) (compo-nentwise) in the support of F .

Note that in the discussion below as well as in the estimation process thechoice of x∞ is not important. It only serves as an intellectual tool forintroducing a defective distribution on R2 with survival function

Fε(x1, x2) =

(1− ε)F (x1, x2) + ε if (x1, x2) in R2

ε if x = (x1, x2) = x∞.

Of course

F (x) = limε↓0

Fε(x) uniformly in x = (x1, x2) ∈ R2. (11.1.7)

When it comes to estimation we may let ε = εn depend on sample size n insuch a way that εn ↓ 0 as n ↑ ∞. It is also interesting to note that in theunivariate case the Kaplan-Meier estimate is also defective whenever the δfor the largest Z equals zero. Then the point x∞ carrying the remainingmass is from time to time called the ”cemetry”.

Now, denote with dΛε the hazard measure associated with Fε. We have

dΛε =

1 on x∞(1−ε)dF(1−ε)F+ε

on R+ × R+ , (11.1.8)

a finite measure. Finally, observe that the homogeneous equation

qε(x) =

∫1t≥xqε(t)Λε(dt), (11.1.9)

where t = (t1, t2) ≥ x = (x1, x2) means componentwise, admits the solution

qε(x) = Fε(x). (11.1.10)

In the following section we shall show that for a given ε > 0, Fε is the onlysolution of (11.1.9) satisfying qε(0, 0) = 1 and qε(x∞) = ε. In other words,for defective distributions, the survival function is reconstructable from itshazard function Λε. Also we need to find a one-to-one relation between Fε

and Λε serving the same purpose as (11.1.4) in the univaritate case. Finally,recall (11.1.7) to get the original target F back from the Fε’s.

Page 262: 2013 W Stute Empirical Distributions

11.2. IDENTIFICATION OF DEFECTIVE SURVIVAL FUNCTIONS257

11.2 Identification of Defective Survival Functions

Consider the eigenproblem (11.1.9) with Λε defined as in (11.1.8). Since dΛε

is unknown in practice, we need to study (11.1.9) also for other distributions,e.g., for estimators of dΛε. Therefore, in the following, we let P be any finitedistribution giving mass one to x∞, and consider the integral equation

q(x) =

∫1t≥xq(t)P (dt). (11.2.1)

We then have the following result.

Theorem 11.2.1. Let P be any given finite measure with mass one atx∞. Furthermore, let 0 < ε < 1 be a given positive number. Then thereexists exactly one nonnegative qε = qε(P ) with qε(0, 0) = 1 and qε(x∞) = εsatisfying (11.2.1). Moreover this qε admits the Neumann representation

qε(x) = ε

[1 +

∞∑i=1

∫. . .

∫1ti≥ti−1≥...≥t1≥x,ti =x∞P (dti) . . . P (dt1)

]

=1 +

∑∞i=1

∫. . .∫1ti≥ti−1≥...≥t1≥x,ti =x∞P (dti) . . . P (dt1)

1 +∑∞

i=1

∫. . .∫1ti≥ti−1≥...≥t1≥0,ti =x∞P (dti) . . . P (dt1)

(11.2.2)

Proof. It is easy to check that the qε from (11.2.2) is nonnegative and satisfiesqε(0, 0) = 1. Also the series in the numerator vanishes at x = x∞ so thatqε(x∞) = ε. We shall see from the proof below that both series converge.Actually, the positive ε has been introduced only for one reason, namelyto enforce convergence in the Neumann representation. Finally, check that(11.2.2) is indeed a solution of (11.2.1). Hence it remains to prove that qεfrom (11.2.2) is the only solution of (??). So, let q be any other nonnegativesolution of (11.2.1) satisfying the two constraints q(0, 0) = 1 and q(x∞) = ε.By (11.2.1), it is necessarily non-increasing in x and bounded from aboveby 1. If we iterate (11.2.1), we obtain

q(x) = q(x∞) +

∫1t≥x,t =x∞q(t)P (dt) (11.2.3)

= q(x∞) +

∫∫1t2≥t1≥x,t1 =x∞q(t2)P (dt2)P (dt1)

= q(x∞) + q(x∞)

∫1t1≥x,t1 =x∞P (dt1)

+

∫∫1t2≥t1≥x,t2 =x∞q(t2)P (dt2)P (dt1).

Page 263: 2013 W Stute Empirical Distributions

258 CHAPTER 11. THE MULTIVARIATE CASE

If we repeat this several times we get for any fixed k ∈ N:

q(x) = q(x∞)

1 +

k−1∑i=1

∫. . .

∫1ti≥ti−1≥...≥x,ti =x∞P (dti) . . . P (dt1)

+

∫. . .

∫1tk≥tk−1≥...≥x,tk−1 =x∞q(tk)P (dtk) . . . P (dt1)

≡ q(x∞)

1 +

k−1∑i=1

bi

+ ck = ε

1 +

k−1∑i=1

bi

+ ck.

Clearly, all bk’s and ck’s are nonnegative. In particular, the sum of the b’sis non-decreasing in k. Conclude that

ε

1 +

∞∑i=1

bi

≤ q(x).

Since ε is positive, the series is finite. Especially, we get bk → 0 as k → ∞.But

0 ≤ ck ≤ q(0)

∫. . .

∫1tk≥tk−1≥...≥x,tk−1 =x∞P (dtk) . . . P (dt1) ≤ q(0)bk−1

P (R+)2 + 1

.

Therefore, ck → 0 as k → ∞ and thus

q(x) = ε

1 +

∞∑i=1

bi

= ε

1 +

∞∑i=1

∫. . .

∫1ti≥ti−1≥...≥t1≥x,ti =x∞P (dti) . . . P (dt1)

.

Since q(0, 0) = 1, we get in particular

1 = ε

1 +

∞∑i=1

∫. . .

∫1ti≥...≥t1≥0,ti =x∞P (dti) . . . P (dt1)

,

so that in summary we have obtained (11.2.2).

In view of Theorem 11.2.1, we know that for dP = dΛε the only solution of(11.1.9) equals qε = qε(Λε) = Fε. Moreover, (11.2.2) yields the Neumannrepresentation of Fε. Next, we study (11.2.1) for

dP ≡ dP 0ε =

(1− ε)dH11

(1− ε)H + εon R+ × R+

Page 264: 2013 W Stute Empirical Distributions

11.2. IDENTIFICATION OF DEFECTIVE SURVIVAL FUNCTIONS259

(and one on x∞) in greater detail. Let q0ε = qε(P0ε ) be the corresponding q.

Since, by

dH11 = GdF and GF = H,

we have

dP 0ε =

(1− ε)GdF

(1− ε)GF + ε≤ (1− ε)dF

(1− ε)F + ε= dΛε,

(11.2.2) yields

q0ε(x) ≤ qε(x) for all x.

To obtain a lower bound we need to strengthen the identifiability condition(11.1.7) a little bit, namely that

supp F ⊂ supp G strictly.

In technical terms this means that for each x in the support of F we have

G(x) ≥ δ > 0 (11.2.4)

for an appropriate (small) 0 < δ < 1. This means for a vector X = (X1, X2)taking its value in the extreme north-eastern corner, that there is a littlethough positive chance of not being censored. Under (11.2.4), we have, for0 < ε < δ < 1,

dP 0ε =

(1− ε)dF

(1− ε)F + ε/G≥ (1− ε)dF

(1− ε)F + ε/δ

≥ (1− ε/δ)dF

(1− ε/δ)F + ε/δ.

Hence by the Neumann representation (11.2.2) we get

q0ε(x) ≥ qε/δ(x) for all x. (11.2.5)

Taking into account (11.1.7), we thus obtain the following

Theorem 11.2.2. Under (11.2.4), we have, uniformly in x,

F = limε↓0

qε(Λε) = limε↓0

qε(P0ε ).

Page 265: 2013 W Stute Empirical Distributions

260 CHAPTER 11. THE MULTIVARIATE CASE

11.3 The Multivariate Kaplan-Meier Estimator

In this section we derive and study the multivariate extension of the Kaplan-Meier estimator. Again, we restrict ourselves to dimension two. We startwith equation (11.1.5) and replace the unknown H11 and H by their empir-ical counterparts

H11n (t1, t2) =

1

n

n∑j=1

δ1jδ2j1Z1j≤t1,Z2j≤t2

Hn(t1, t2) =1

n

n∑j=1

1Z1j≥t1,Z2j≥t2,

where from now on we denote with (Z1j , Z2j , δ1j , δ2j), 1 ≤ j ≤ n, a sampleof independent replicates of Z = (Z1, Z2) and δ = (δ1, δ2) available to theanalyst. See Section 11.1 for the notion of Z and δ. To obtain a defectiveestimator, we choose ε = 1/(n+1) so that the empirical equivalent of (11.2.1)becomes

qε(x1, x2) =1

n+ 1+

∫∫[x1,∞)×[x2,∞)

1t1≥x1,t2≥x2qε(t1, t2)H11

n (dt1, dt2)

(Hn + 1n)(t1, t2)

.

(11.3.1)Denote with

Fn = qε(P0n)

the unique nonnegative solution of (11.3.1) satisfying qε(0, 0) = 1, wheredP 0

n = dH11n /(Hn + 1/n). Fn will be the bivariate analog of the univari-

ate Kaplan-Meier estimator. Extensions to higher dimension are straight-forward at the expense of some more notation. As to computational as-pects, one possibility would be the Neumann representation (11.2.2) withdP = dP 0

n . We prefer to solve (11.3.1) directly. Forgetting the leadingsummand 1/(n + 1) for a moment, equation (11.3.1) says that Fn is theeigenfunction (with eigenvalue one) of an appropriate (empirical) operatorequation.

To begin with the computation, since H11n is a discrete measure supported

by certain (Z1i, Z2i), 1 ≤ i ≤ n, so is Fn. Let pi denote the mass given to(Z1i, Z2i) under Fn. These are the quantities we are interested in becausethey uniquely determine Fn. Conclude that

Fn(Z1i, Z2i) =n∑

j=1

aijpj +1

n+ 1,

Page 266: 2013 W Stute Empirical Distributions

11.3. THE MULTIVARIATE KAPLAN-MEIER ESTIMATOR 261

where aij is the indicator

aij =

1 if Z1j ≥ Z1i and Z2j ≥ Z2i

0 otherwise.

Since in this section n is fixed, we may extend this definition to

ai,n+1 = 1 for 1 ≤ i ≤ n or i = n+ 1

to describe the position of x∞ in the sample. Also let pn+1 = 1/(n + 1)denote the mass of x∞. Finally, let

bi =∆H11

n (Z1i, Z2i)

Hn(Z1i, Z2i) +1n

=δ1iδ2i

nHn(Z1i, Z2i) + 1

be the mass given to (Z1i, Z2i) under dH11n /(Hn + 1

n). Let (Z1,n+1, Z2,n+1)correspond to x∞, and put bn+1 = 1. Then equation (??) with x1 = Z1i

and x2 = Z2i may be rewritten as

n+1∑j=1

aijpj =

n+1∑k=1

Fn(Z1k, Z2k)aikbk

=n+1∑k=1

aikbk

n+1∑l=1

aklpl =n+1∑l=1

[n+1∑k=1

aikbkakl

]pl.

For i = n + 1, i.e., at x = x∞, the equation is also true, since both sidesequal 1/(n + 1). Hence, if we introduce the vector p = (p1, . . . , pn, pn+1)

t

and the matrices

A = (aij)1≤i,j≤n+1 and B = diag(b1, . . . , bn+1),

then the equation (11.3.1) becomes

Ap = ABAp. (11.3.2)

Hence the vector Ap is an eigenvector of the matrix AB, with eigenvalueone. To simplify things, we may order the first components Zij in increasingorder to come up with the associated order statistics

Z11:n ≤ . . . ≤ Z1n:n.

After ordering the aij become

aij =

0 if j < i1 if j = i1 or 0 if j > i

, (11.3.3)

Page 267: 2013 W Stute Empirical Distributions

262 CHAPTER 11. THE MULTIVARIATE CASE

assuming for the time being that no ties are present. An ordering of thesecond components would lead to the same numerical results. Now, under(11.3.3), the matrix A becomes an invertible upper triangular matrix, andequation (11.3.2) boils down to

p = BAp. (11.3.4)

Of course, now pj is the mass given to (Z1j:n, Z[2j:n]), the second Z beingthe concomitant of the first. Furthermore, upon setting δj = δ1jδ2j , we have

BA = ((biaij))1≤i,j≤n+1

with

biaij =δiaij∑n+1k=i aik

.

In particular, BA too is upper-triangular with diagonal elements

bi =δi∑n+1

k=i aik, 1 ≤ i ≤ n+ 1.

Note that bn+1 = 1 but 0 ≤ bi < 1 for 1 ≤ i ≤ n, simply because aii =1 = ai,n+1. We are now in the position to formulate the main result of thissection.

Theorem 11.3.1. The equation (11.3.4) has a unique admissible solutionp = (p1, . . . , pn+1)

t, i.e., all pi satisfy pi ≥ 0 and

n+1∑i=1

pi = 1.

Proof. From (11.3.4) we obtain for each 1 ≤ i ≤ n+ 1:

pi =n+1∑j=1

biaijpj .

Since aij = 0 for i > j, it follows that

pi = bipi +n+1∑

j=i+1

biaijpj .

Recall pn+1 = 1/(n+ 1). For 1 ≤ i ≤ n we get

pi = ci

n+1∑j=i+1

aijpj , (11.3.5)

Page 268: 2013 W Stute Empirical Distributions

11.3. THE MULTIVARIATE KAPLAN-MEIER ESTIMATOR 263

with

ci =bi

1− bi, 1 ≤ i ≤ n,

being well-defined because of bi < 1. Since all bi and aij are nonnegative,so are the pi’s. Furthermore, by backward recursion based on (11.3.5), allpi’s are unique up to a common factor. Hence by dividing each pi through∑n+1

j=1 pj , we obtain the admissible solution.

Note that the backward recursion (11.3.5) leads to an explicit representationof pi through pn+1 (before normalizing): For 1 ≤ i ≤ n we have

pi = cipn+1

[1 +

n∑j=i+1

aijcj +

n∑j=i+1

n∑k=j+1

aijajkcjck

+ . . .+ ai,i+1ai+1,i+2 . . . an−1,nci+1ci+2 . . . cn

]≡ cidipn+1

and thereforen+1∑i=1

pi = pn+1

[1 +

n∑i=1

cidi

].

To get a proper distribution pn+1 therefore needs to be updated and becomes

pn+1 =

[1 +

n∑i=1

cidi

]−1

.

(11.3.5) then applies to compute the other pi’s from the set of c’s and a’s,i.e., from the data. As mentioned earlier, in the univariate case, the Kaplan-Meier estimator is always defective and thus unprotected to loss of masseswhen the δ of the largest Z equals zero. Hence it is interesting to look atour approach also in dimension one. In this case aij = 1 for j ≥ i, afterordering. Conclude that

bi =δi

n+ 2− i, 1 ≤ i ≤ n+ 1.

The backward recursion (11.3.5) simplifies dramatically and becomes

pi = ci

n+1∑j=i+1

pj , 1 ≤ i ≤ n.

From this we obtainpi =

cibi+1

pi+1

Page 269: 2013 W Stute Empirical Distributions

264 CHAPTER 11. THE MULTIVARIATE CASE

and thus, because of bn+1 = 1,

pi =cibi+1

ci+1

bi+2

ci+2

bi+3. . . pn+1

= cipn+1

n∏j=i+1

(1 + cj)

for 1 ≤ i ≤ n. The updated pn+1 equals

pn+1 =1∏n

j=1(1 + cj)

so that at the end of the day, for 1 ≤ i ≤ n,

pi =ci∏i

j=1(1 + cj)= bi

i−1∏j=1

[1− bj ].

These are exactly the Kaplan-Meier weights given to the sample (Zi, δi), 1 ≤i ≤ n, enriched by x∞. The choice of x∞ is not important when we restrictour estimator to the data. Furthermore, as was shown by Stute and Wang(1993) and Stute (1995), the fundamental large sample results, the StrongLaw of Large Numbers and the Central Limit Theorem, continue to hold forKM-integrals and are not affected by the possible (negligible) defectivenessof the estimator. Similarly for the multivariate case, where (before nor-malization) the mass 1/(n + 1) given to x∞ is positive but small as n getslarger.

11.4 Efficiency of the Estimator

While in the previous section the focus was on the construction of a non-parametric estimator of a multivariate survival function and some of itsnumerical properties, this section is devoted to its statistical behavior. Forthis, let P be again any finite measure on the Euclidean space enriched byx∞ and giving mass one to x∞. Recall that

q(x) =

∫1t≥xq(t)P (dt) (11.4.1)

admits exactly one nonnegative solution satisfying q(0, 0) = 1 and q(x∞) =ε. Moreover, q equals

q(x) =1 +

∑∞i=1

∫. . .∫1ti≥ti−1≥...≥t1≥x,ti =x∞P (dti) . . . P (dt1)

1 +∑∞

i=1

∫. . .∫1ti≥ti−1≥...≥t1≥0,ti =x∞P (dti) . . . P (dt1)

(11.4.2)

Page 270: 2013 W Stute Empirical Distributions

11.4. EFFICIENCY OF THE ESTIMATOR 265

Since q(x) is uniquely determined through P , we may view q(x) as the valueof a statistical functional q(x) = Tx(P ) evaluated at P . Equation (??)exhibits that Tx is a ratio of two infinite order U -functionals. Similar toSerfling (1980) one can show that Tx is Gateaux-differentiable. The left-hand side of (11.4.1) yields not only a function in x but also a distributionwhich is again denoted with q. By taking increments in (11.4.1) we obtainfor any rectangle I

q(I) =

∫1I(t)q(t)P (dt). (11.4.3)

From this we get, by summing up both sides of (11.4.3), that∫φdq =

∫φ(t)q(t)P (dt) (11.4.4)

for any elementary function φ. Finally, by going to the limit, (11.4.4) holdsalso for any bounded continuous function φ, and finally for any boundedmeasurable φ. The boundedness is needed to guarantee, for any finite P ,the existence of the integrals. Relevant q obtained and discussed so far arethe ones associated with

dP = dΛε, dP = dP 0ε =

(1− ε)dH11

(1− ε)H + ε

and

dP = dP 0n =

(1− 1n+1)dH

11n

(1− 1n+1)Hn + 1

n+1

=dH11

n

Hn + 1n

.

The true but unknown hazard measure dP = dF/F does not belong to thiscollection since in many cases it puts infinite mass to the Euclidean space.

To also include integrals w.r.t. the last P , we assume that the function φis such that F is bounded away against zero on the support of φ. Thisguarantees that all integrals and moments to appear below exist. Importantexamples of such φ’s are the indicators of the rectangles [0, x1]× [0, x2] withx1 ≤ T1 and x2 ≤ T2 such that F (T1, T2) > 0. Of course, for such a φ,∫

φdF = F (x1, x2).

Recall that for sample size n the mass ε sitting at x∞ equals 1/(n+1). Withthis ε = εn, we have

Fn(x) = Tx(P0n).

Page 271: 2013 W Stute Empirical Distributions

266 CHAPTER 11. THE MULTIVARIATE CASE

Furthermore, setF 0n(x) = Tx(P

0ε ),

a non-random survival function. To study the efficiency of our estimatorFn, we consider a linear functional

∫φdFn of Fn with φ satisfying the above

support condition. Then, by (11.4.4),∫φdFn =

∫φ(t)Fn(t)P

0n(dt)

and ∫φdF 0

n =

∫φ(t)F 0

n(t)P0ε (dt).

Conclude that∫φdFn −

∫φdF 0

n =

∫φ(t)[Fn(t)− F 0

n(t)]P0n(dt)

+

∫φ(t)F 0

n(t)[P0n(dt)− P 0

ε (dt)] = I + II.

The second integral equals

II =

∫φF 0

n

[dH11

n

Hn + 1n

− dH11

H + 1n

]

=

∫φF 0

n

(Hn + 1n)(H + 1

n)

[HdH11

n − HndH11 +

1

n(dH11

n − dH11)

].

Standard empirical process theory thus yields, in view of Theorem 11.2.2,

n1/2II = n1/2∫φF

H2[HdH11

n − HndH11] + oP(1)

= n1/2∫

φF

H(dH11

n − dH11) +

∫φF

H2[H − Hn]dH

11

+ oP(1)

= n1/2∫

φ

G(dH11

n − dH11)−∫

φ

H[Hn − H]dF

+ oP(1).

The term in brackets is a sum of centered i.i.d. random variables which aftermultiplication with n1/2 is asymptotically normal. Because of dH11 = GdF ,it simplifies to ∫

φ

[dH11

n

G− HndF

H

]. (11.4.5)

Page 272: 2013 W Stute Empirical Distributions

11.4. EFFICIENCY OF THE ESTIMATOR 267

Conclude that

n1/2II =

∫φdαn + oP(1),

where the associated centered empirical process αn defined through

dαn = n1/2[dH11

n

G− HndF

H

].

It will also be relevant for I. Actually, since both Fn and F 0n pertain to the

same ε, namely ε = 1/(n+ 1), we obtain from (11.2.2)

Fn(t)−F 0n(t)

=1

n+ 1

∞∑i=1

∫. . .

∫1ti≥...≥t1≥t,ti =x∞

P 0n(dti) . . . P 0

n(dt1)−P 0

ε (dti) . . . P 0ε (dt1)

The linear expansion of the right hand side is obtained through the Hajekprojection; see Serfling (1980). More precisely, define αn through dαn = dαn

F.

Then we have, up to an oP(1) term,

n1/2[Fn(t)− F 0

n(t)]

=1

n+ 1

∞∑i=1

i∑k=1

∫. . .

∫1ti≥...≥t1≥t,ti =x∞P

0ε (dti) . . . αn(dtk) . . . P

0ε (dt1)

=1

n+ 1

∞∑k=1

∫. . .

∫1tk≥...≥t1≥t,tk =x∞

1 +∞∑

i=k+1

∫. . .

∫1ti≥...≥tk,ti =x∞P

0ε (dti) . . . P

0ε (dtk+1)

αn(dtk) . . . P

0ε (dt1)

=

∞∑k=1

∫. . .

∫Ttk(P

0ε )1tk≥...≥t1≥t,tk =x∞αn(dtk) . . . P

0ε (dt1).

From Theorem 11.2.2 and the tightness of αn the last series becomes, up toan oP(1) term,

∞∑k=1

∫. . .

∫F (tk)1tk≥...≥t1≥t,tk =x∞αn(dtk)

F (dtk−1)

F (tk−1). . .

F (dt1)

F (t1).

On the support of φ, by the Glivenko-Cantelli Theorem, the distributionfunction of P 0

n converges uniformly to that of dF/F , with probability one.

Page 273: 2013 W Stute Empirical Distributions

268 CHAPTER 11. THE MULTIVARIATE CASE

Summarizing, we have up to an oP(1) term,

n1/2I =∞∑k=1

∫. . .

∫φ(t)

F (t)1tk≥...≥t1≥t,tk =x∞αn(dtk) . . .

F (dtk−1)

F (tk−1). . .

F (dt1)

F (t1)F (dt).

Taking (11.4.5) into account, we have thus obtained

n1/2[∫

φdFn −∫φdF 0

n

]=

∫ψdαn + oP(1) (11.4.6)

with

ψ(s) = φ(s) +∞∑k=1

∫. . .

∫φ(t1)1s≥tk≥...≥t1

F (dtk)

F (tk). . .

F (dt1)

F (t1). (11.4.7)

Hence the Gateaux differential of a linear functional of Fn belongs to theclosed linear span of the empirical process αn, which obviously belongs tothe closed linear span of the model at F . For efficiency, since

∫ψdαn is a

linear functional, it suffices to consider sample size n = 1. Now∫ψdα1 =

∫ψ

[dH11

1

G− H1dF

H

]=δ1δ2ψ(Z1, Z2)

G(Z1, Z2)−∫

1Z1≥z1,Z2≥z2ψ(z1, z2)

H(z1, z2)F (dz1, dz2).

In follows that

E[∫

ψdα1|X1, X2

]= ψ(X1, X2)

−∫

1X1≥z1,X2≥z2ψ(z1, z2)

F (z1, z2)F (dz1, dz2).

If we plug (11.4.7) into the last expression, the terms in the series cancelout and the conditional expectation drops down to

E[∫

ψdα1|X1, X2

]= φ(X1, X2) (11.4.8)

Page 274: 2013 W Stute Empirical Distributions

11.5. SIMULATION RESULTS 269

11.5 Simulation Results

x1seq

0

1

2

3

4

5

6

x2se

q

0

1

2

3

4

5

6

Fehat

0.0

0.2

0.4

0.6

0.8

Fehat

n=100, theta=1.125, tao=0.125,epsilon=(n+1)^(−1)

x1seq

0

1

2

3

4

5

6

x2se

q

0

1

2

3

4

5

6

Fbar

0.2

0.4

0.6

0.8

1.0

Fbar

n=100, theta=1.125, tao=0.125,epsilon=(n+1)^(−1)

Page 275: 2013 W Stute Empirical Distributions

270 CHAPTER 11. THE MULTIVARIATE CASE

Page 276: 2013 W Stute Empirical Distributions

Bibliography

[1] Dabrowska, D. M., Kaplan-Meier estimate on the plane, Ann. Statist.,16, 1475-1489 (1988).

[2] Gill, R. D., van der Laan, M. J. and Wellner, J. A., Inefficient estimatorsof the bivariate survival function for three models, Ann. Inst. H. PoincareProbab. Statist., 31, 545-597 (1995).

[3] Prentice, R. L., Moodie, F. and Wu, J., Hazard-based nonparametricsurvivor function estimation, J. R. Stat. Soc. Ser. B. Stat. Methodol., 66,305-319 (2004).

[4] Serfling, R. J., Approximation theorems of mathematical statistics, Wi-ley, New York, (1980).

[5] Stute, W., The central limit theorem under random censorship, Ann.Statist., 23, 422-439 (1995).

[6] Stute, W. and Wang, J. L., The strong law under random censorship .Ann. Statist., 21, 1591-1607 (1993).

[7] van der Laan, M. J., Efficient estimation in the bivariate censoring modeland repairing NPMLE, Ann. Statist., 24, 596-627 (1996).

[8] van der Vaart, A., On differentiable functions, Ann. Statist., 19, 178-204(1991).

271

Page 277: 2013 W Stute Empirical Distributions

272 BIBLIOGRAPHY

Page 278: 2013 W Stute Empirical Distributions

Chapter 12

Nonparametric CurveEstimation

12.1 Nonparametric Density Estimation

Let X1, . . . , Xn be a sample of independent random variables from a d.f. Fwith density f = F ′. In this section we discuss a nonparametric estimatorof f which has been first proposed by Rosenblatt (1956) and Parzen (1962)and which in the sequel had been extensively studied in the literature.

We already mentioned that Fn does not allow for a straightforward estima-tion of f . Noting, however, that

f(x) = F ′(x) = lima↓0

F (x+ a2 )− F (x− a

2 )

a

we could consider the right-hand side, for a fixed bandwidth a > 0, andestimate the ratio through

fn(x) =Fn(x+ a

2 )− Fn(x− a2 )

a.

Since Fn is not differentiable at the data, it is not permitted to take thelimit a→ 0.

For any such bandwidth, we have

Efn(x) =F (x+ a

2 )− F (x− a2 )

a

273

Page 279: 2013 W Stute Empirical Distributions

274 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

and

Varfn(x) =1

na2

(F (x+

a

2)− F (x− a

2))(

1− F (x+a

2) + F (x− a

2))

∼ f(x)

na.

The formula for Efn(x) reveals that in general fn(x) is a biased estimatorof f(x). Put

Biasfn(x) := Efn(x)− f(x).

If f is twice continuously differentiable, then Taylor’s formula yields

Biasfn(x) =a2

24f ′′(x) +O(a3).

An overall measure for the deviation between fn and f at x is the MeanSquared Error

MSEfn(x) := Bias2fn(x) + Varfn(x).

Neglecting error terms we get

MSEfn(x) =a4

242[f ′′(x)]2 +

f(x)

na= c1a

4 +c2na.

This expression is minimized for

a =

[c2

4c1n

]1/5= cn−1/5.

The unpleasant feature of this result is that the constant c depends on fand is therefore unknown.

A way out of this dilemma is to parametrize the bandwidth as tn−1/5 andlook at fn as a stochastic process in t. This will be studied in a forthcomingsection.

Our results on the oscillation modulus immediately apply to get an almostsure bound for the stochastic error.

Actually, for any bandwidth a = an tending to zero,

√nan [fn(x)− Efn(x)] = a−1/2

n

[αn

(x+

an2

)− αn

(x− an

2

)]

Page 280: 2013 W Stute Empirical Distributions

12.1. NONPARAMETRIC DENSITY ESTIMATION 275

which in absolute values is uniformly in x bounded from above by

a−1/2n ωn(an) = O

(√ln a−1

n

).

Hence for the optimal rate an = tn−1/5,

supx

|fn(x)− Efn(x)| = O

(√lnn

n2/5

).

The analytic error (bias) equals as we have seen before, O(a2n) = O(n−2/5).

In this section we extend fn to a more general class of estimators. Put

K(y) =1

21− 1

2≤y< 1

2, y ∈ R. (12.1.1)

Then we have

fn(x) =1

a

∫K

(x− y

a

)Fn(dy). (12.1.2)

The function K is nonnegative and has Lebesgue integral∫K(y)dy = 1,

i.e., K is a Lebesgue density. Equation (12.1.2) also applies to other densitiesK as well. If we replace the indicator by a differentiableK, also the resultingfn will be differentiable allowing for estimation of higher order derivativesof F . The function K is called the kernel and the resulting fn is a kernelestimator.

As for the naive kernel (12.1.1), the optimal choice of an depends onunknown quantities. Introducing the function

Ka(y) =1

aK(ya

), y ∈ R,

the kernel estimator equals Ka ∗ Fn, where ∗ denotes convolution. Mostof the results obtained so far extend to a general K, under some mild tailconditions on K(x), as |x| → ∞.

Page 281: 2013 W Stute Empirical Distributions

276 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

12.2 Nonparametric Regression: Stochastic De-sign

An important issue in Multivariate Statistics is that of investigating thedependence structure of the components of multivariate data. This questionwas initiated by F. Galton who coined the term regression. Further, let(X,Y ) be a random vector in Rd+1. A classical model assumption is that,up to a stochastic error, Y depends linearly on X:

Y = XTβ + ε,E(ε|X) = 0,Varε = σ2 <∞. (12.2.1)

Here, β is the unknown parameter of interest. Apart from (12.2.1) it is oftenassumed that ε has a normal distribution. In this case, many distributionsof quantities of interest (e.g., of the LSE of β) may be explicitly computed.Otherwise, asymptotic distributional results may be obtained under appro-priate assumptions. Estimation and testing procedures are based on an i.i.d.sample (Xi, Yi), 1 ≤ i ≤ n, from the same distribution as (X,Y ). Much lesshas been known for a long time when the model assumptions (12.2.1) cannot be justified. For a general approach to this question, observe that

E[Y |X] = E[XTβ|X] + E[ε|X]

= XTβ

i.e.E[Y |X = x] = xTβ.

In other words, knowing β is tantamount to knowing the regression func-tion

m(x) = E[Y |X = x]. (12.2.2)

Nonparametric regression is concerned with estimation ofm without makingany model assumptions like (12.2.1). As becomes clear from (12.2.2), theregression function m is the factorization of E[Y |X] w.r.t. X. It is knownfrom measure theory that existence of m follows from the Radon-NikodymTheorem but no explicit representation of m is available in general.

Estimation of m from an i.i.d. sample was independently initiated byNadaraya (1964) and Watson (1964). To point out their idea, suppose thatx is an X-atom, i.e., P(X = x) > 0. We then know that (a version of) m(x)is given by

m(x) =1

P(X = x)

∫X=x

Y dP. (12.2.3)

Page 282: 2013 W Stute Empirical Distributions

12.2. NONPARAMETRIC REGRESSION: STOCHASTIC DESIGN 277

The SLLN implies that with probability one

n−1n∑

i=1

Yi1Xi=x →∫

X=x

Y dP

and

n−1n∑

i=1

1Xi=x → P(X = x).

Thus, defining

mn(x) =

∑ni=1 Yi1Xi=x∑ni=1 1Xi=x

=

∑ni=1 Yi1x(Xi)∑ni=1 1x(Xi)

,

we getmn(x) → m(x) with probability one.

If x is not an X-atom (12.2.3) is no longer true. Nadaraya and Watsonproposed (when d = 1) to replace the one point set x with an interval(x− an, x+ an]:

mn(x) =

∑ni=1 Yi1x−an<Xi≤an∑ni=1 1x−an<Xi≤x+an

(12.2.4)

(= 0 if the denominator is zero)

Here an > 0 is again a bandwidth (or window) tending to zero as n getslarge. an will play the same role as the bandwidth in the kernel densityestimator. Similarly, also here, we are likely to lose the local informationcontained in the data and increase the bias if we choose an too large (over-smooth), and have too much variance if an is too small (undersmooth).Choice of a proper bandwidth will therefore be an important issue also here.

Notice that mn of (12.2.4) may be written as

mn(x) =

∑ni=1 YiK

(x−Xian

)∑n

i=1K(x−Xian

) (12.2.5)

withK(y) = 1[−1,1)(y).

Needless to say, that (12.2.5) may be extended to more general kernels K.Note also that unlike the density case, we need not require

∫K(u)du = 1

Page 283: 2013 W Stute Empirical Distributions

278 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

(the case∫K = 0 is always ruled out). Also, because m may also take

on negative values, there are no psychological (or emotional) objections tokernels attaining negative values.

If x is not an atom of X, more sophisticated arguments will be needed forshowing, e.g., consistency of mn(x). Here is a heuristic argument whichuses the fact that m(X) = E[Y |X]. Only the means of the numerator andthe denominator are considered.

We have

E

[1

nan

n∑i=1

YiK

(x−Xi

an

)]= E

[1

anY K

(x−X

an

)]= E E[. . . |X] = E

1

anK

(x−X

an

)E[Y |X]

= E

1

anm(X)K

(x−X

an

).

Assuming that X has a density f , the last integral becomes

a−1n

∫m(y)f(y)K

(x− y

an

)dy → m(x)f(x)

∫K(u)du

as an → 0, under appropriate regularity conditions on f and m. The de-nominator of (12.2.5) formally equals the nominator if we set Yi ≡ 1. Thusthe expectation of the denominator is likely to converge to 1 ·f(x)

∫K(u)du.

Provided that f(x) > 0, the ratio therefore in the limit becomes m(x), asdesired.

Originally, mn was motivated upon assuming that (X,Y ) has a (bivariate)density g. In this case

m(x) =

∫yg(x, y)dy

f(x), (12.2.6)

where

f(x) =

∫g(x, y)dy

is the marginal density of X. Denote with

Hn(u, v) = n−1n∑

i=1

1Xi≤u,Yi≤v, u, v ∈ R

Page 284: 2013 W Stute Empirical Distributions

12.2. NONPARAMETRIC REGRESSION: STOCHASTIC DESIGN 279

the bivariate empirical d.f. of (Xi, Yi), 1 ≤ i ≤ n. Set

gn(x, y) = a−2n

∫ ∫K

(x− u

an

)K

(y − v

an

)Hn(du, dv)

= a−2n n−1

n∑i=1

K

(x−Xi

an

)K

(y − Yian

).

Then gn constitutes the extension of the Rosenblatt-Parzen kernel densityestimator to the bivariate case, with kernel

K0(u, v) = K(u)K(v).

Check that, by Fubini,∫ygn(x, y)dy = a−2

n

∫ ∫ ∫yK

(x− u

an

)K

(y − v

an

)Hn(du, dv)dy

= a−2n

∫ ∫ ∫yK

(x− u

an

)K

(y − v

an

)dyHn(du, dv)

= a−1n

∫ ∫K

(x− u

an

)(anw + v)K(w)dwHn(du, dv)

= a−1n

∫vK

(x− u

an

)Hn(du, dv)

= a−1n n−1

n∑i=1

YiK

(x−Xi

an

),

provided that∫K(w)dw = 1 and

∫wK(w)dw = 0. If in (??) g and f are

replaced with gn and

fn(x) = a−1n n−1

n∑i=1

K

(x−Xi

an

),

respectively, then

mn(x) ≡∫ygn(x, y)dy

fn(x)

becomes (12.2.5), as expected.

Remark 12.2.1. Setting

Win(x) =K(x−Xian

)∑n

j=1K(x−Xj

an

)

Page 285: 2013 W Stute Empirical Distributions

280 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

we have

mn(x) =n∑

i=1

Win(x)Yi. (12.2.7)

12.3 Consistent Nonparametric Regression

Stone (1977) considered estimation of m through estimators of the form(12.2.7) for general weights Wn, whereWn =Wn(x) = (W1n(x), . . . ,Wnn(x))and Win(x) =Win(x,X1, . . . , Xn):

mn(x) =

n∑i=1

Win(x)Yi.

Let (X,Y ) ∈ Rd+1, where d ≥ 1. ∥ · ∥ will denote some norm on Rd.

Definition 12.3.1. A sequence mn is said to be consistent in Lr, r ≥ 1,if

mn(X) → m(X) in Lr, (12.3.1)

i.e.

E∫

|mn(x)−m(x)|rµ(dx) → 0.

Here µ is the distribution ofX andm(x) = E[Y |X = x] is the true regressionfunction. Without further mentioning it is assumed that E|Y |r <∞.

Consider the following set of assumptions

(i)

E

[n∑

i=1

|Win(X)|f(Xi)

]≤ CEf(X)

for every f ≥ 0 and n ≥ 1, some C ≥ 1.

(ii) For some D ≥ 1

P

(n∑

i=1

|Win(X)| ≤ D

)= 1

(iii)n∑

i=1

|Win(X)|1∥Xi−X∥>ε → 0 in probability, each ε > 0

Page 286: 2013 W Stute Empirical Distributions

12.3. CONSISTENT NONPARAMETRIC REGRESSION 281

(iv)n∑

i=1

Win(X) → 1 in probability

(v)max1≤i≤n

|Win(X)| → 0 in probability

Theorem 12.3.2. (Stone). Under (i) - (v),

mn(X) → m(X) in Lr.

To prove the theorem, we need a series of lemmas.

Lemma 12.3.3. Under (i) – (iii), assume that E|f(X)|r <∞. Then

E

[n∑

i=1

|Win(X)||f(Xi)− f(X)|r]→ 0.

Proof. For ε > 0 given, let h be a continuous function on Rd with compactsupport such that

E|f(X)− h(X)|r ≤ ε.

By (i),

E

[n∑

i=1

|Win(X)||f(Xi)− h(Xi)|r]≤ CE|f(X)− h(X)|r ≤ Cε.

From (ii),

E

[n∑

i=1

|Win(X)||f(X)− h(X)|r]≤ DE|f(X)− h(X)|r ≤ Dε.

Altogether this shows that we need to prove the lemma only for continuousf with compact support. Set M =∥ f ∥∞. Since f is uniformly continuous,given ε > 0, we man find some δ > 0 such that

∥ x− x1 ∥≤ δ implies |f(x)− f(x1)|r ≤ ε.

By (ii),

E

[n∑

i=1

|Win(X)||f(Xi)− f(X)|r]

≤ (2M)rE

n∑

i=1

|Win(X)|1∥Xi−X∥>δ

+ εD.

Page 287: 2013 W Stute Empirical Distributions

282 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Use (ii) and (iii) to conclude that the last expectation converges to zero.This completes the proof of the lemma.

Lemma 12.3.4. Under (i) – (iii), let Wn be a sequence of nonnegativeweights such that for nonnegative constants Mn and Nn

P

(Mn ≤

n∑i=1

Win(X) ≤ Nn

)→ 1.

Let f be a nonnegative Borel-function on Rd such that Ef(X) <∞. Then

lim infn→∞

En∑

i=1

Win(X)f(Xi) ≥ (lim infn→∞

Mn)Ef(X)

and

lim supn→∞

En∑

i=1

Win(X)f(Xi) ≤ (lim supn→∞

Nn)Ef(X)

Proof. Set

An =

Mn ≤

n∑i=1

Win(X) ≤ Nn

.

Assume w.l.o.g. Mn ≤ D. Then

Mn −D1An≤

n∑i=1

Win(X) ≤ Nn +D1An

and therefore

MnEf(X)−DE1Anf(X) ≤ E

[n∑

i=1

Win(X)f(X)

]≤ NnEf(X) +DE1An

f(X).

Now, P(An) → 1 implies E1Anf(X) → 0. Consequently

lim infn→∞

E

[n∑

i=1

Win(X)f(X)

]≥ (lim inf

n→∞Mn)Ef(X).

Apply Lemma 12.3.3 to prove the lim inf part of the lemma. A similarargument yields the lim sup part.

Page 288: 2013 W Stute Empirical Distributions

12.3. CONSISTENT NONPARAMETRIC REGRESSION 283

Lemma 12.3.5. Under (i) – (iii), assume that for some constants Mn andNn

P

(Mn ≤

n∑i=1

W 2in(X) ≤ Nn

)→ 1

Then for every nonnegative µ-integrable function f

lim infn→∞

E

[n∑

i=1

W 2in(X)f(Xi)

]≥(lim infn→∞

Mn

)Ef(X)

and

lim supn→∞

E

[n∑

i=1

W 2in(X)f(Xi)

]≤ (lim sup

n→∞Nn)Ef(X).

Proof. Apply the last lemma to W 2n, with C and D replaced by CD and

D2, respectively.

Lemma 12.3.6. Under (i) - (iii), for every Borel-function f ,

n∑i=1

|Win(X)|1|f(Xi)−f(X)|>ε → 0 in probability

for each ε > 0.

Proof. Let ε > 0 be given. Assume first that f is bounded. By Lemma12.3.3,

E

[n∑

i=1

|Win(X)||f(Xi)− f(X)|

]→ 0

and therefore

n∑i=1

|Win(X)|1|f(Xi)−f(X)|>ε → 0 in probability.

For an arbitrary f , we may choose an M > 0 such that P(|f(X)| > M) ≤ δ,a given small positive constant. Set f = (f ∧M) ∨ (−M). Since

f(Xi) = f(Xi) ⊂ |f(Xi)| > M,

Page 289: 2013 W Stute Empirical Distributions

284 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

we obtain

n∑i=1

|Win(X)|1|f(Xi)−f(X)|>ε ≤n∑

i=1

|Win(X)|1|f(Xi)−f(X)|>ε

+

n∑i=1

|Win(X)|1|f(Xi)|>M or |f(X)|>M

The first sum converges to zero in probability, by boundedness of f . Theexpectation of the second sum is less than or equal to (use (i) and (ii))

(C +D)P(|f(X)| > M) ≤ (C +D)δ.

Lemma 12.3.7. Under (i) - (iv), assume that f(X) ∈ Lr, some r ≥ 1.Then

n∑i=1

Win(X)f(Xi) → f(X) in Lr.

Proof. By (ii),∣∣∣∣∣n∑

i=1

Win(X)− 1

∣∣∣∣∣r

≤ (1 +D)r with probability one.

Thus, by (iv),

E

∣∣∣∣∣[(

n∑i=1

Win(X)

)− 1

]f(X)

∣∣∣∣∣r

→ 0. (12.3.2)

Apply (ii), Lemma 12.3.3 and Holder’s inequality to obtain

E

∣∣∣∣∣n∑

i=1

Win(X)[f(Xi)− f(X)]

∣∣∣∣∣r

→ 0. (12.3.3)

Clearly, (12.3.2) and (12.3.3) imply the assertion of the lemma.

We are now in the position to give the

Proof of Theorem 12.3.2. Assume (i) – (v), and let r ≥ 1. Since E|Y |r <∞ by assumption,

E|m(X)|r = E|E(Y |X)|r ≤ E|Y |r <∞

Page 290: 2013 W Stute Empirical Distributions

12.3. CONSISTENT NONPARAMETRIC REGRESSION 285

by Jensen’s inequality. Now,

n∑i=1

Win(X)Yi −m(X) =

[n∑

i=1

Win(X)m(Xi)−m(X)

]

+

n∑i=1

Win(X)Zi,

where

Zi = Yi − E[Yi|Xi] = Yi −m(Xi).

The term in brackets converges to zero in Lr by Lemma 12.3.7. For thesecond sum consider r = 2 first. It follows from the very definition of m andthe independence of the data, that

E

[n∑

i=1

Win(x)Zi

]2=

n∑i=1

n∑j=1

E[Win(x)Wjn(x)ZiZj ]

=n∑

i=1

n∑j=1

E [E(Win(x)Wjn(x)ZiZj |X1, . . . , Xn)]

=n∑

i=1

n∑j=1

E [Win(x)Wjn(x)E(ZiZj |X1, . . . , Xn)]

=

n∑i=1

E[W 2in(x)E(Z2

i |Xi)] =

n∑i=1

E(W 2in(x)h(Xi)],

say. It follows that

E

[n∑

i=1

Win(X)Zi

]2= E

[n∑

i=1

W 2in(X)h(Xi)

].

By Lemma 12.3.5, the limsup of the right-hand side is less than or equal to

(lim supn→∞

Nn)Eh(X), (12.3.4)

where Nn is such that

P

(n∑

i=1

W 2in(X) ≤ Nn

)→ 1.

Page 291: 2013 W Stute Empirical Distributions

286 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

By (ii) and (v)n∑

i=1

W 2in(X) → 0 in probability,

i.e., for Nn we may take Nn ≡ δ, where δ > 0 is any small number. Ex-pression (12.3.4) thus becomes δEh(X). Since δ is arbitrary, this proves thetheorem. Remark 12.3.8. Observing that

Win(X) =Win(X;X1, . . . , Xn)

is a measurable function of the i.i.d. random vectors X,X1, . . . , Xn, we get

E[|Win(X)|f(Xi)] = E[|Win(Xi;X1, . . . , X, . . .Xn)|f(X)]

so that (i) amounts to

E

[n∑

i=1

|Win(Xi;X1, . . . , X, . . . ,Xn)|

f(X)

]≤ CEf(X),

any f ≥ 0. In particular, (i) is satisfied whenever

(i)*

n∑i=1

|Win(Xi;X1, . . . , x, . . . ,Xn)| ≤ C, each x ∈ Rd, (12.3.5)

holds.

Remark 12.3.9. Let g be a Borel-measurable function such that g(Y ) ∈ Lr.Replacing Yi by g(Yi), we obtain

mn(x) =n∑

i=1

Win(x)g(Yi),

which, under (i) - (v), is consistent for

m(x) = E[g(Y )|X = x].

As an example, take g = 1(−∞,t], then

mn(x) = mn(t;x) =

n∑i=1

Win(x)1Yi≤t

andm(x) = m(t;x) = P(Y ≤ t|X = x),

the empirical and the true conditional d.f. at t given X = x.

Page 292: 2013 W Stute Empirical Distributions

12.4. NEAREST-NEIGHBOR REGRESSION ESTIMATORS 287

Similarly, combining mn for g(y) = y2 and g(y) = y, we obtain

σ2n(x) =n∑

i=1

Win(x)Y2i −

[n∑

i=1

Win(x)Yi

]2as an estimator for the conditional variance

E[Y 2|X = x]− E[Y |X = x]2 ≡ Var(Y |X = x).

12.4 Nearest-Neighbor Regression Estimators

In this section we shall introduce and discuss a family of weights satisfying(i)* from the last section. Also, all of the other conditions in Theorem 12.3.2will be seen to hold. To be specific, let X1, . . . , Xn be a random sample in Rd

with distribution µ. Fix x ∈ Rd. Set di :=∥ x−Xi ∥, a nonnegative randomvariable. Clearly, d1, . . . , dn are i.i.d. Denote with d1:n ≤ d2:n ≤ . . . ≤ dn:nthe ordered di’s.

Definition 12.4.1. For 1 ≤ k ≤ n, Xi is called the k-nearest neighbor(k-NN) of x iff di = dk:n. If the d’s have ties, we may break them byattaching independently a uniform Zi to Xi and then replace Xi by (Xi, Zi),thus embedding Xi into Rd+1. Xi is called to be among the k-NN if di =dr:n, r ≤ k.

Definition 12.4.2. For any µ, supp(µ) is the smallest closed set S for whichµ(S) = 1.

Remark 12.4.3. In “Topological Measure Theory” it is shown that supp(µ)exists.

In the following k = kn may and will depend on n as n→ ∞.

Lemma 12.4.4. Assume kn/n → 0 as n→ ∞. Then for each x ∈ supp(µ)we have

dkn:n = dkn:n(x) → 0 with probability one,

i.e. the kn-NN converges to x P-a.s.

Proof. For each ε > 0, denote with

Sε(x) = y :∥ x− y ∥< ε

Page 293: 2013 W Stute Empirical Distributions

288 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

the open ε-ball with center x. For x ∈ supp(µ), we have µ(Sε(x)) > 0. Infact, from µ(Sε(x)) = 0 we obtain µ(A) = 1, where A = supp(µ) \ Sε(x) isa closed set strictly contained in supp(µ), a contradiction.

Now, from the SLLN,

µn(Sε(x)) → µ(Sε(x)) ≡ a > 0 P-a.s.

Hence,n∑

i=1

1Xi∈Sε(x) ≥ na/2 ≥ kn

eventually with probability one. Conclude dkn:n < ε. This proves the lemma.

In the following, replace x by X, so that dkn:n becomes dkn:n(X).

Lemma 12.4.5. Under kn/n→ 0 as n→ ∞,

dkn:n(X) → 0 with probability one

Proof. SetA = (ω, x) ∈ Ω× Rd : dωkn:n(x) → 0.

Then A ∈ A⊗ Bd. From Lemma 12.4.4

P(Ax) = 1 for each x ∈ supp(µ).

By independence of X and the training sample,

P (dkn:n(X) → 0) =

∫supp(µ)

P(Ax)µ(dx) = 1.

We now introduce the kn-NN weights. For x ∈ Rd, set

Win(x) =

1/kn if Xi is among the kn NN of x0 otherwise

.

Lemma 12.4.6. Under kn/n→ 0, we have for each ε > 0

n∑i=1

|Win(X)|1∥Xi−X∥>ε → 0 in probability.

This is condition (iii) in Theorem 12.3.2.

Page 294: 2013 W Stute Empirical Distributions

12.4. NEAREST-NEIGHBOR REGRESSION ESTIMATORS 289

Proof. SetAn := dkn:n(X) > ε.

Lemma 12.4.5 yields P(An) → 0 as n→ ∞. Furthermore,

n∑i=1

|Win(X)|1∥Xi−X∥>ε ≤ 1An ,

whence the result.

Since Win(x) ≥ 0,n∑

i=1

Win(x) = 1

andmax1≤i≤n

Win(x) = k−1n → 0 (provided that kn → ∞)

also conditions (ii), (iv) and (v) are satisfied. It remains to verify condition(i) resp. (i)*:

(i)*

n∑i=1

Win(Xi;X1, . . . , x, . . . ,Xn) ≤ C, each x ∈ Rd.

Notice that the left-hand side of (i)* equals k−1n times the number of i’s for

which x is among the kn-NN of Xi.

We first treat the case kn = 1.

Lemma 12.4.7. (Bickel and Breiman). For each d ≥ 1 and any norm ∥ · ∥,there exists α(d) < ∞ such that it is possible to write Rd as the union ofα(d) disjoint cones C1, . . . , Cα with 0 as their common peak such that if

x, y ∈ Cj(x, y = 0), then ∥ x− y ∥< max(∥ x ∥, ∥ y ∥), j = 1, . . . , α(d).

Proof. SetS(0, 1) = x ∈ Rd :∥ x ∥≤ 1,

the unit sphere in Rd. Since ∂S(0, 1) is compact we can find disjoint setsC1, . . . , Cα(d) such that

α(d)∪j=1

Cj = ∂S(0, 1) and ∥ x− y ∥< 1 for x, y ∈ Cj .

Page 295: 2013 W Stute Empirical Distributions

290 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Let

Cj = λx : x ∈ Cj , λ ≥ 0, 1 ≤ j ≤ α(d).

Suppose x = λx, y = ηy with x, y ∈ Cj . W.l.o.g., assume λ ≤ η. Then,

∥ x− y ∥= η ∥ ληx− y ∥≤ η

(1− λ

η

)∥ y ∥ +

λ

η∥ x− y ∥

<∥ y ∥ .

Corollary 12.4.8. For any set of n distinct points in Rd, say x1, . . . , xn, x1can be the NN (k = 1) only of at most α(d) points. Hence, for kn = 1, (i)*holds with C = α(d).

Proof. Assume x1 = 0 w.l.o.g. Lemma 12.4.7 yields that in each Cj thereexists at most one point for which x1 is a NN. In fact, for two points x, yfrom the same Cj such that ∥ x ∥≤∥ y ∥, we have ∥ x − y ∥<∥ y ∥, i.e. 0cannot be a NN of y.

Corollary 12.4.9. For a general kn, x1 can be among the kn NN only ofat most knα(d) points. Hence (i)* again holds true with C = α(d).

Proof. Similar to the last Corollary.

Altogether, we have shown that the kn-NN weights satisfy the assumptionsof Theorem 12.3.2. Hence we may conclude

Theorem 12.4.10. Under kn/n → 0 and kn → ∞, the kn-NN regressionestimate

mn(X) =

n∑i=1

Win(X)Yi

fulfills

mn(X) → m(X) in Lr.

Notice that Theorem 12.4.10 holds under no assumption on the underlyingdistribution of (X,Y ) (up to E|Y |r <∞). Therefore we say, that mn(X) isuniversally consistent.

Page 296: 2013 W Stute Empirical Distributions

12.5. NONPARAMETRIC CLASSIFICATION 291

12.5 Nonparametric Classification

Suppose that Y is some variable attaining only finitely many values, say1, . . . ,m. Rather than Y , we observe some random vector X, which typi-cally is correlated with Y . The problem of classification is one of specifyingthe value of Y on the basis of X.

In other words, we are looking for a function (or rule) g of X such thatg(X) may serve as a substitute for Y . Denote with

P(g(X) = Y ) the probability of misclassification.

g0 is optimal if the probability of misclassification is minimal:

infgP(g(X) = Y ) ≡ L∗ = P(g0(X) = Y ).

In the following we shall derive the optimal rule. Now,

L∗ = 1− supg

P(g(X) = Y ).

But

P(g(X) = Y ) =m∑i=1

P(g(X) = i, Y = i)

=

m∑i=1

∫g(X)=i

P(Y = i|X)dP =

m∑i=1

∫x:g(x)=i

pi(x)µ(dx)

≤m∑i=1

∫x:g(x)=i

maxjpj(x)µ(dx) =

∫max

jpj(x)µ(dx), (12.5.1)

where as in the previous section µ denotes the distribution of X and

pi(x) = P(Y = i|X = x)

denotes the so-called “a posteriori probability” of Y = i. Notice that weobtain equality in (??) if g is such that

g(x) = i iff pi(x) = maxjpj(x). (12.5.2)

Page 297: 2013 W Stute Empirical Distributions

292 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

The rule g0 defined by (12.5.2) is called the Bayes-Rule. For this g0, wehave

P(g0(X) = Y ) =

∫max

jpj(x)µ(dx)

and therefore

L∗ = 1− E( max1≤j≤m

pj(X))− the Bayes Risk.

This is the minimal probability of misclassification attainable.

The point now is that for (12.5.2) we need to know pj , 1 ≤ j ≤ m. Sincein practice this is not possible, they need to be estimated from a learningsample (Xi, Yi), 1 ≤ i ≤ n. Set

pjn(x) =

n∑i=1

Win(x)1Yi=j, 1 ≤ j ≤ m.

Under the conditions of Theorem 12.3.2, for each 1 ≤ j ≤ m,

pjn(X) → pj(X) in Lr as n→ ∞. (12.5.3)

According to (12.5.2) it is tempting to consider the empirical Bayes rule

gn(x) = i iff pin(x) = maxjpjn(x),

as a substitute of g0. From Theorem 12.4.10 we already know that (12.5.3)holds for the kn-NN rule. In this case pjn(x) is the relative number of Y ’sbeing equal to j subject to the condition that the pertaining X is amongthe kn-NN of x. The resulting gn is also called the majority vote.

Definition 12.5.1. A sequence gnn of rules is said to be consistent inBayes risk if

Ln ≡ P(gn(X) = Y ) → L∗ as n→ ∞.

Theorem 12.5.2. Suppose that gn is an empirical Bayes rule such that(12.5.3) holds. Then gn is consistent in Bayes risk.

Proof. Seten(X) = max

1≤j≤m|pjn(X)− pj(X)|

From (12.5.3) we obtain

Een(X) → 0 as n→ ∞. (12.5.4)

Page 298: 2013 W Stute Empirical Distributions

12.5. NONPARAMETRIC CLASSIFICATION 293

Now, by definition of L∗,

L∗ ≤ P(gn(X) = Y |X1, Y1, . . . , Xn, Yn) P-a.s.

Integrate out to getL∗ ≤ P(gn(X) = Y ).

Furthermore,

P(gn(X) = Y |X,X1, Y1, . . . , Xn, Yn)

=m∑i=1

1gn(X)=ipi(X)

=m∑i=1

1gn(X)=ipin(X) +m∑i=1

1gn(X)=i[pi(X)− pin(X)]

= maxjpjn(X) +

m∑i=1

1gn(X)=i[pi(X)− pin(X)]

≥ maxjpjn(X)− en(X) ≥ max

jpj(X)− 2en(X)

whence

P(gn(X) = Y ) ≥∫

maxjpj(x)µ(dx)− 2Een(X)

= 1− L∗ − 2Een(X).

From (12.5.4) we may infer

lim supn→∞

P(gn(X) = Y ) ≤ L∗,

as desired. Remark 12.5.3. The consistency in Bayes risk of the kn-NN rule was firstproved (under further regularity conditions) by Fix and Hodges (1951).

Remark 12.5.4. Much attention has been also given to the case k ≡ 1,i.e. Y is classified as to be the value of that Yi for which Xi is the nearestneighbor of X. It may be shown that

L = limn→∞

P(gn(X) = Y )

exists, with

L∗ ≤ L ≤ L∗(2− m

m− 1L∗).

Classical paper: Cover and Hart (1967), IEEE Transactions on InformationTheory 13, 21-27.

Page 299: 2013 W Stute Empirical Distributions

294 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

An important issue in pattern classification is that of estimating the un-known probability of misclassification. Ln is the overall probability of mis-classification, i.e., it constitutes the mean risk averaged over all outcomes ofX and Y and the training sample.

A more informative quantity which represents the probability of misclassi-fication given the particular sample X, (Xi, Yi), 1 ≤ i ≤ n, on hand, is theconditional probability

P(gn(X) = Y |X,X1, Y1, . . . , Xn, Yn) = Ln1.

Another quantity which is of interest when only the training sample is knownbut X has not been sampled so far, is

P(gn(X) = Y |X1, Y1, . . . , Xn, Yn) = Ln2.

Note that both Ln1 and Ln2 are random. Recall that in the proof of Theorem12.5.2 we have seen that

1− Ln1 =m∑i=1

1gn(X)=ipi(X) (12.5.5)

≥ maxjpj(X)− 2en(X).

Integrating with respect to X leads to

1− Ln2 = P(gn(X) = Y |X1, Y1, . . . , Xn, Yn) =m∑i=1

∫1gn(x)=ipi(x)µ(dx)

≥∫

maxjpj(x)µ(dx)− 2

∫en(x)µ(dx)

= 1− L∗ − 2

∫en(x)µ(dx). (12.5.6)

Theorem 12.5.5. Under the assumptions of Theorem 12.5.2,

Ln2 → L∗ in probability.

Proof. Similar to the proof of Theorem 12.5.2.

Theorem 12.5.5 asserts that Ln2 is also a consistent estimate of L∗. ThoughLn2 is measurable w.r.t. the training sample, it cannot be computed due tothe fact that the pi, 1 ≤ i ≤ m, in

1− Ln2 =

m∑i=1

∫gn=i

pi(x)µ(dx)

Page 300: 2013 W Stute Empirical Distributions

12.5. NONPARAMETRIC CLASSIFICATION 295

are not known. It is tempting, however, to consider

1− Ln2 =m∑i=1

∫gn=i

pin(x)µ(dx)

instead. Ln2 is called the apparent error rate.

Corollary 12.5.6. Under the assumptions of Theorem 12.5.2,

Ln2 → L∗ in probability.

Proof. In view of Theorem 12.5.5 it remains to prove that

Ln2 − Ln2 → 0 in probability.

But

|Ln2 − Ln2| ≤m∑i=1

∫gn=i

|pi(x)− pin(x)|µ(dx) ≤m∑i=1

∫|pi(x)− pin(x)|µ(dx).

The last term converges to zero in the mean and hence in probability.

Notice that Ln2 is computable only when µ is known. If µ is unknown, Ln2

is redefined so as to become

1− Ln2 = n−1n∑

j=1

max1≤i≤m

pin(Xj).

In the literature it is often reported that Ln2 underestimates Ln2.

We now introduce two variants of Ln2:

LCVn2 – the cross-validated estimate of Ln2: for this, compute pi,n−1 as be-

fore, but this time for (X1, Y1), . . . , (Xj−1, Yj−1), (Xj+1, Yj+1), . . . , (Xn, Yn)

i.e., for the whole sample with (Xj , Yj) deleted. Write p(j)i,n−1 for this p and

put

1− LCVn2 = n−1

n∑j=1

max1≤i≤m

p(j)i,n−1(Xj).

Page 301: 2013 W Stute Empirical Distributions

296 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

LHn2 – the half-sample estimate of Ln2: choose 1 ≤ k < n, and compute

pi,n−k from the sample (X1, Y1), . . . , (Xn−k, Yn−k). Use Xn−k+1, . . . , Xn forestimation of µ. This results in

1− LHn2 = k−1

n∑j=n−k+1

max1≤i≤m

pi,n−k(Xj).

Note that for LHn2, we do not require Yn−k+1, . . . , Yn.

Note 12.5.7. Versions of LCVn2 and LH

n2 which are more widely used in theliterature are (in obvious notation)

LCVn2 = n−1

n∑i=1

1gin(Xi )=Yi

and

LHn2 = k−1

n∑j=n−k+1

1gn,n−k(Xj )=Yj.

12.6 Smoothed Empirical Integrals

Consider an i.i.d. sequence X1, X2, . . . from a density f . For a given band-width a > 0, set

fn(x) =1

na

n∑i=1

K

(x−Xi

a

)=

1

a

∫K

(x− y

a

)Fn(dy).

Set

Fn(t) =

t∫−∞

fn(x)dx, t ∈ R,

the smoothed empirical d.f. For a given score-function φ, the smoothedempirical φ-integral equals

I =

∫φ(x)Fn(dx) =

∫φ(x)fn(x)dx

=1

a

∫ ∫φ(x)K

(x− y

a

)dxFn(dy) ≡

∫φa(y)Fn(dy).

We have

E(I) =∫φa(y)F (dy) = φa

Page 302: 2013 W Stute Empirical Distributions

12.6. SMOOTHED EMPIRICAL INTEGRALS 297

and

nVar(I) =

∫[φa − φa]

2F (dy) = σ2(a)

provided φa is square-integrable.

In the following Lemma we formulate some conditions on f and φ underwhich the above integrals converge to those with φ in place of φa. For this,note that

φa =

∫φ(x)

[∫1

af(y)K

(x− y

a

)dy

]dx.

Under some mild conditions on K the inner integrals converge to f(x) forall x up to a Lebesgue null set. To show

φa →∫φdF ≡ φ as a→ 0 (12.6.1)

we need to justify the assumptions in the Lebesgue dominated convergencetheorem. Whenever K(x) = T (∥ x ∥), T nonincreasing, by Wheeden andZygmund, p. 156,

supa>0

|f ∗Ka(x)| ≤ cf∗(x) (12.6.2)

for some generic constant c. Here, for any integrable function g,

g∗(x) = supQ

1

λ(Q)

∫Q|g(y)|dy

denotes the Hardy-Littlewood maximal function of g, Q is an intervalcentered at x and λ(Q) is the Lebesgue measure ofQ. This definition extendsto the multivariate case, in which the supremum is taken over all cubes withcenter x and edges parallel to the coordinate axes.

As a conclusion (12.6.2) implies (12.6.1) provided that φf∗ is Lebesgue-integrable.

As to ∫[φa − φa]

2F (dy) →∫[φ− φ]2F (dy), (12.6.3)

when (12.6.1) is satisfied, we need to show∫φ2a(y)F (dy) →

∫φ2(y)F (dy).

Page 303: 2013 W Stute Empirical Distributions

298 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

We have ∣∣∣∣∫ [φ2a − φ2]dF

∣∣∣∣ ≤∫

|φa − φ||φa + φ|dF

≤ ∥ φa − φ ∥2 (∥ φa ∥2 + ∥ φ ∥2)

the ∥ · ∥2 being taken in L2(F ). First, by Cauchy-Schwarz,∫φ2a(y)F (dy) ≤ 1

a

∫ ∫φ2(x)K

(x− y

a

)dxF (dy)

=

∫φ2(x)f ∗Ka(x)dx ≤ c

∫φ2(x)f∗(x)dx <∞,

provided φ2f∗ ∈ L1(R). Finally, using∫K(x)dx = 1 and Cauchy-Schwarz,∫

(φa − φ)2dF ≤∫ ∫

1

a[φ(x)− φ(y)]2K

(x− y

a

)dxF (dy)

=

∫ ∫1

a[φ(x)− φ(y)]2f(y)K

(x− y

a

)dydx.

One can show that

1

a

∫[φ(x)− φ(y)]2f(y)K

(x− y

a

)dy → 0

for all x up to a Lebesgue-null set, since φf ∈ L1(R) and φ2f ∈ L1(R).Application of the dominated convergence theorem follows as before. Insummary, we get

Lemma 12.6.1. Provided φ2f∗ ∈ L1(R) and φf∗ ∈ L1(R), under standardregularity assumptions on K,

E(I) →∫φdF

and

nVar(I) →∫

[φ− φ]2dF

as a→ 0.

Now, consider the stochastic process

Zn(a) ≡ Zn = n1/2∫φa(y)[Fn(dy)− F (dy)].

Page 304: 2013 W Stute Empirical Distributions

12.6. SMOOTHED EMPIRICAL INTEGRALS 299

For a > 0 fixed, the CLT easily yields

Zn(a) → N (0, σ2(a)), (12.6.4)

in distribution.

Setting φ0(y) = φ(y), (12.6.4) also holds for a = 0 when φ ∈ L2(F ). If welet a depend on n in such a way that a = an → 0 as n→ ∞, then

Var(Zn(an)− Zn(0)) ≤∫

[φa − φ]2dF → 0,

as was shown before.

As a conclusion we get the following

Theorem 12.6.1. Under the assumptions of Lemma 6.6.1, with an → 0 asn→ ∞,

n1/2∫φan(y)[Fn(dy)− F (dy)] → N (0, σ2(0))

in distribution.

Under an appropriate smoothness assumption on f , for K even,∫φa(y)F (dy) =

∫φ(x)F (dx) +

a2

2

∫ ∫z2K(z)f ′′(x)dzdx+O(a3).

We see that

n1/2∫φ(y)[Fn(dy)− F (dy)] → N (0, σ2(0))

provided that n1/2a2n → 0. Under different conditions on an it may happenthat the normal limit has a non-zero expectation.

Page 305: 2013 W Stute Empirical Distributions

300 CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Page 306: 2013 W Stute Empirical Distributions

Chapter 13

Conditional U-Statistics

13.1 Introduction and Main Result

Let (X,Y ) be a random vector in Rd+1. The regression function

m(x) = E[Y |X = x]

then is a convenient tool to describe the dependence between X and Y .Since in practice m will be unknown, it needs to be estimated from a sample(Xi, Yi), 1 ≤ i ≤ n. Starting with Nadaraya (1964) and Watson (1964), bynow there is a huge literature on nonparametric estimation of m. Stone(1977) investigated estimates of the form

mn(x) =

n∑i=1

Win(x)Yi.

He could show that under very broad assumptions on the weights, for p ≥ 1,

E [|mn(X)−m(X)|p] → 0. (13.1.1)

In particular, he was able to verify these assumptions for NN -type estima-tors. Window estimators were dealt with independently by Devroye andWagner (1980) and Spiegelman and Sacks (1980). Devroye (1981) consid-ered Lp-convergence at a point:

E [|mn(x0)−m(x0)|p] → 0. (13.1.2)

Both (13.1.1) and (13.1.2) have important implications on the Bayes-riskconsistency of empirical Bayes rules in discrimination.

301

Page 307: 2013 W Stute Empirical Distributions

302 CHAPTER 13. CONDITIONAL U -STATISTICS

There are several situations in which one is not only interested in the rela-tion between a single X and the corresponding Y , but in the dependencestructure of few X’s and a function of the associated Y ’s. To be precise, leth = h(Y1, . . . , Yk) be a p-times integrable function of Y1, . . . , Yk, p ≥ 1. h iscalled a kernel, and k is its degree. The Y ’s need not be real but may berandom vectors in any Euclidean space. Put, for x = (x1, . . . , xk),

m(x) = E [h(Y1, . . . , Yk)|X1 = x1, . . . , Xk = xk] .

The problem of estimating m was studied in Stute (1991). As estimators weconsidered generalizations of the Nadaraya-Watson estimator, namely

mn(x) =

∑π h(Yπ1 , . . . , Yπk

)∏k

j=1K[(xj −Xπj )/bn]∑π

∏kj=1K[(xj −Xπj )/bn]

(13.1.3)

HereK is a smoothing kernel and bn > 0 is a bandwidth tending to zero at anappropriate rate. Summation extends over all permutations π = (π1, . . . , πk)of length k, i.e., over all pairwise distinct π1, . . . , πk taken from 1, . . . , n.Clearly, setting

Wπn(x) ≡Wπ(x) =

∏kj=1K[(xj −Xπj )/bn]∑

σ

∏kj=1K[(xj −Xσj )/bn]

and Yπ = (Yπ1 , . . . , Yπk), we have

mn(x) =∑π

Wπ(x)h(Yπ). (13.1.4)

Statistics of the type (13.1.4) may be called conditional (or local) U -statistics,because in spirit they are similar to Hoeffding’s (1948) U -statistic, extendingthe sample mean.

The analog of (13.1.1) for conditional U -statistics has been obtained in Stute(1994). In the present paper we consider Lp-convergence at a point x0, say, ofa window estimator, i.e., of the form (13.1.3) with K = 1[−1,1]d . Extensionsto more general kernels will be dealt with in Remark 13.1.2 below. Theexamples in Stute (1991) could also be mentioned here, but we prefer todiscuss another example, which may be of independent interest. See Sections13.1 and 13.2 for details.

In what follows ∥ · ∥ denotes the sup-norm on an Euclidean space. For thewindow estimator,

Wπ(x) =1∥Xπ−x∥≤bn∑σ 1∥xσ−x∥≤bn

,

Page 308: 2013 W Stute Empirical Distributions

13.2. DISCRIMINATION 303

where Xπ = (Xπ1 , . . . , Xπk). Let µ denote the distribution of X.

Theorem 13.1.1. Assume that bn → 0 and nbdn → ∞. Then for µ⊗ . . .⊗µ-almost all x0

E[|mn(x0)−m(x0)|p] → 0 as n→ ∞.

Remark 13.1.2. The assertion of Theorem 13.1.1 may be easily extendedto kernels K on Rd satisfying

c11∥x∥≤r1 ≤ K(x) ≤ c21∥x∥≤r2,

for some 0 < c1, c2 <∞, 0 < r1 < r2 <∞, provided that

lim supb→0

µ(x :∥ x− xi ∥≤ br2)

µ(x :∥ x− xi ∥≤ br1)<∞ (13.1.5)

for 1 ≤ i ≤ k. In particular, (13.1.5) is satisfied if xi is a µ-atom or the re-striction of µ to some neighborhood of xi is dominated by Lebesgue-measure.

13.2 Discrimination

In discrimination an unobservable random variable Y taking values froma finite set 1, 2, . . . ,M, say, has to be estimated from some (hopefully)correlated X. The optimal predictor g0(X) minimizing the probability oferror P(g(X) = Y ) is given by

g0(x) = arg max1≤j≤M

pj(x),

wherepj(x) = P(Y = j|X = x), 1 ≤ j ≤M.

g0 is called the Bayes-rule and

L∗ = P(g0(X) = Y ) = 1− E[max

1≤j≤Mpj(X)

]is the associated Bayes-risk. Since in practice the a posteriori probabilitiespj are seldom known, they need to be estimated from a training sample(Xi, Yi), 1 ≤ i ≤ n. Replacing Yi by 1Yi=j, the local average estimatorsbecome

mjn(x) =

n∑i=1

Win(x)1Yi=j.

Page 309: 2013 W Stute Empirical Distributions

304 CHAPTER 13. CONDITIONAL U -STATISTICS

Putgn(x) = arg max

1≤j≤Mmj

n(x).

(13.1.1) then is useful to show Bayes-risk consistency of gn:

P(gn(X) = Y ) → L∗.

See Stone (1977) or Chapter 12. On the other hand, (13.1.2) turns out tobe useful to handle the probability of error given the training sample:

P(gn(X) = Y |X1, Y1, . . . , Xn, Yn).

See Devroye (1981).

Now, in applications, it often happens that more than one Y need to beclassified. Suppose that X0

1 , . . . , X0k are independent copies of X. The corre-

sponding Y 01 , . . . , Y

0k are unobservable. They may be discrete or continuous

random vectors in any Euclidean space. Rather than predicting their valuesitem by item, one may, e.g., be interested to compare their (expected) valueson the basis of the input vectors X0

1 , . . . X0k .

To be precise, let h be a function of degree k taking values from 1, 2, . . . ,M .The sets

Aj = (y1, . . . , yk) : h(y1, . . . , yk) = j , 1 ≤ j ≤M,

then form a partition of the feature space. h(Y 01 , . . . , Y

0k ) = j then is tan-

tamount to (Y 01 , . . . , Y

0k ) ∈ Aj which is to be considered as one of M con-

current situations of interest. Take, for example, k = 2 and let the Y ’s bereal-valued. We may then wonder whether Y 0

1 ≤ Y 02 or not. Putting

h(y1, y2) =

1 if y1 ≤ y20 if y1 > y2

we arrive at a discrimination problem for h(Y 01 , Y

02 ). Given (X0

1 , X02 ) and a

training sample (Xi, Yi), 1 ≤ i ≤ n, a decision has to be made as to whetherh = 1 or 0. This example will be considered in the simulation study ofSection 13.3. Others are mentioned in Stute (1994).

It is easily seen that now the Bayes-rule equals

g0(x) = arg max1≤j≤M

mj(x),

Page 310: 2013 W Stute Empirical Distributions

13.2. DISCRIMINATION 305

where

mj(x) = P(h(Y 0

1 , . . . , Y0k ) = j|X0

1 = x1, . . . , X0k = xk

).

The Bayes-risk becomes

L∗ = 1− E[max

1≤j≤Mmj(X)

],

with X = (X01 , . . . , X

0k). The empirical Bayes-rule is given by

gn0(x) = arg max1≤j≤M

mjn(x),

withmj

n(x) =∑π

Wπ(x)1h(Yπ)=j.

One can show that under mild assumptions on the weights one obtainsBayes-risk consistency:

P(gn0(X) = h(Y )) → L∗.

In this section we discuss the limit behavior of the probability of error givena particular training sample at hand:

Ln := P(gn0(X) = h(Y )|X1, Y1, . . . , Xn, Yn).

Theorem 13.2.1. Assume bn → 0 and nbdn → ∞. Then, as n→ ∞,

Ln → L∗ in the mean.

Similar to classical discrimination,

1− Ln =

M∑j=1

∫gn0=j

mj(x)µ(dx1) . . . µ(dxk).

Since the mj are unknown, we are tempted to estimate Ln by the apparenterror rate Ln given as

1− Ln =

M∑j=1

∫gn0=j

mjn(x)µ(dx1) . . . µ(dxi)

=

∫max

1≤j≤Mmj

n(x)µ(dx1) . . . µ(dxk).

Page 311: 2013 W Stute Empirical Distributions

306 CHAPTER 13. CONDITIONAL U -STATISTICS

Corollary 13.2.2. Under the assumptions of Theorem 13.2.1,

Ln → L∗ in the mean.

In the most general situation, also µ will be unknown. We may then replaceµ in the definition of Ln by the empirical measure µn of X1, . . . , Xn. So wecome up with the following estimate of 1− Ln:

1− Ln = n−k∑

1≤i1,...,ik≤n

max1≤j≤M

mjn(Xi1 , . . . , Xik).

Two modifications are suggested in order to reduce the bias of Ln:

1. Summation should only be extended over all π = (i1, . . . , ik) withpairwise distinct ir’s.

2. Given such a ”multi-index” π,mjn should be replaced bymjπ

n computedfrom the sample (X1, Y1), . . . , (Xn, Yn) with (Xir , Yir), 1 ≤ r ≤ k,deleted.

Hence we obtain

1− Ln :=1

n(n− 1) . . . (n− k + 1)

∑π

max1≤j≤M

mjπn (Xπ),

the Jackknife-corrected estimate of 1− Ln.

Example 13.2.3. Suppose that the Y ’s are real-valued and have a continu-ous distribution. This assumption is not very crucial. It is only to guaranteethat Y 0

1 , . . . , Y0k have no ties. Any randomization will do the job. Put

h(y1, . . . , yk) = j iff yj is the maximum among y1, . . . , yk.

Then h is well defined on the range of (Y 01 , . . . , Y

0k ). Of course, M = k

and the problem becomes one of determining the index of that Y amongY 01 , . . . , Y

0k which is the largest. Assume for simplicity that X0

1 , . . . , X0k are

pairwise distinct, and let ε := mini =r ∥ X0k − X0

r ∥> 0. For bn < ε/2 and1 ≤ j ≤ k, compute the number of π’s such that Yπj is the maximum among(Yπ1 , . . . , Yπk

) subject to ∥ Xπi −X0i ∥≤ bn for 1 ≤ i ≤ k. Then gn0(X) is

just a majority vote among these numbers.

Page 312: 2013 W Stute Empirical Distributions

13.3. SIMULATION STUDY 307

13.3 Simulation Study

In our simulation study we consider data generated from the linear regressionmodel Y = βX+ ε, in which X is uniformly distributed on the unit intervaland ε is standard normal, both independent of each other. For variousβ’s,training samples (Xi, Yi), 1 ≤ i ≤ n, were drawn. Given two input variablesX0

1 and X02 it is required to make a decision as to whether Y 0

1 exceedsY 02 or not. See Example 13.2.3 with k = 2. The rule gn0 was applied

for various bandwidths. Finally, the jackknife-corrected Ln was computedfor each training sample. Table 13.3.2 below presents the estimated meanand variance based on 1000 replicates. For the sake of comparison thetrue values of L∗ are listed in Table 13.3.1. It becomes apparent that L∗

attains its maximum for β = 0 and decreases as |β| increases. This shouldalso be intuitively clear, since for small |β| the input variables contain lessinformation on the Y ’s than for large |β|’s. In the extreme, when β = 0, Ydoes not depend on X at all so that X0

1 and X02 become uninformative as

to our multi-input classification problem. By symmetry, L∗ = 0.5.

Next we present the summary statistics for sample size n = 10(20) for someselected values of bn and β.

To measure the performance of the rule gn0, it is also instructive to compareL∗ and Ln with the true error rate Ln. This is done for n = 10 and β = 1.(See Table 13.3.3.)

The true error rate is always bigger than L∗ (by definition of L∗). It is prettywell approximated by Ln for 0.6 ≤ bn ≤ 0.8. Also Ln seems less vulnerableto an improper choice of bn.

In Table 13.3.4 we list, for n = 10 and various β’s, the bandwidths leadingto an Ln which in the mean are closest to L∗.

We see that bn decreases as β increases. Hence for large β’s efficient dis-crimination between the Y ’s is based on smaller neighborhoods of the inputvariables. After a moment of thought this observation becomes quite natu-ral since for these values of β, Y is strongly correlated with X so that smallneighborhoods contain enough information on the Y ’s. On the other hand,small neighborhoods have the advantage to properly divide the training sam-ple into disjoint parts such that each part contains information which is onlyrelevant for the corresponding pair (X0, Y 0).

Page 313: 2013 W Stute Empirical Distributions

308 CHAPTER 13. CONDITIONAL U -STATISTICS

Needless to mention that the optimal bandwidth decreases as sample sizeincreases.

β L∗

0.0 0.5000.1 0.4910.2 0.4810.3 0.4720.4 0.4630.5 0.4530.6 0.4440.7 0.4350.8 0.4260.9 0.4171.0 0.4081.3 0.3831.5 0.3661.8 0.3432.0 0.328

Table 13.3.1:

Page 314: 2013 W Stute Empirical Distributions

13.3. SIMULATION STUDY 309

β bn Estimated Mean of Ln Estimated Variance of Ln

0.1 0.8 0.48(0.48) 0.0004(0.00009)0.7 0.46(0.47) 0.0006(0.0002)0.6 0.44(0.46) 0.0010(0.0003)0.5 0.42(0.45) 0.0020(0.0005)

0.5 0.8 0.47(0.48) 0.0005(0.0001)0.7 0.45(0.47) 0.0010(0.0003)0.6 0.43(0.46) 0.0020(0.0006)0.5 0.41(0.44) 0.0020(0.0009)

1.0 0.8 0.46(0.47) 0.0006(0.0003)0.7 0.44(0.46) 0.0010(0.0006)0.6 0.42(0.44) 0.0020(0.001)0.5 0.40(0.43) 0.0030(0.001)

1.5 0.8 0.46 0.00080.7 0.44 0.00170.6 0.42 0.00250.5 0.40 0.00300.4 0.37 0.00400.3 0.34 0.0040

2.0 0.8 0.46 0.0090.7 0.43 0.00180.6 0.40 0.00260.5 0.37 0.00360.4 0.34 0.00400.3 0.31 0.0052

Table 13.3.2:

β L∗ Estimated Mean of Ln Estimated Mean of Ln

0.8 0.408 0.46 0.460.7 0.408 0.44 0.450.6 0.408 0.42 0.450.5 0.408 0.40 0.45

Table 13.3.3:

Page 315: 2013 W Stute Empirical Distributions

310 CHAPTER 13. CONDITIONAL U -STATISTICS

β bn

0.1 0.80.5 0.71.0 0.51.5 0.42.0 0.3

Table 13.3.4: