modern survival analysis notes... · o. o. aalen, o. borgan, h. k. gjessing, survival and event...

275
Modern Survival Analysis David Steinsaltz 1 University of Oxford 1 University lecturer at the Department of Statistics, University of Oxford

Upload: others

Post on 28-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Modern Survival Analysis

David Steinsaltz1

University of Oxford

1University lecturer at the Department of Statistics, University of Oxford

Page 2: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents

1 Time processes and counting processes 11.1 Definition of counting process . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Examples of counting processes . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Single survival time . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Sum of counting processes . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Deterministic intensity . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2 Random intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5.1 Independent right censoring . . . . . . . . . . . . . . . . . . . . . 91.5.2 Independent left truncation . . . . . . . . . . . . . . . . . . . . . 10

2 σ-algebras and conditioning 122.1 σ algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 σ-algebras and information . . . . . . . . . . . . . . . . . . . . . 132.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.4 Intersections of σ-algebras . . . . . . . . . . . . . . . . . . . . . . . 142.1.5 σ algebra generated by a family of sets . . . . . . . . . . . . . . . . 142.1.6 Joins of σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.7 σ-algebra generated by a random variable . . . . . . . . . . . . . 152.1.8 Filtrations of σ-algebras . . . . . . . . . . . . . . . . . . . . . . . 152.1.9 Adapted processes . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.10 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.11 Predictable processes . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Definition of conditioning . . . . . . . . . . . . . . . . . . . . . . 192.2.2 Intuitive definition for discrete random variables . . . . . . . . . 19

ii

Page 3: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents iii

2.2.3 Properties of conditional expectations . . . . . . . . . . . . . . . 20

3 Martingales 223.1 More about conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 In search of past time . . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Compensators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 Inhomogeneous Poisson counting process . . . . . . . . . . . . . . 29

4 Stochastic integrals and nonparametric estimation 324.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Introduction to non-parametric estimation . . . . . . . . . . . . . 324.1.2 The multiplicative intensity model . . . . . . . . . . . . . . . . . 33

4.2 Cheater’s guide to stochastic integrals . . . . . . . . . . . . . . . . . . . . 344.3 The Nelson–Aalen estimator . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Distinct event times: Informal derivation . . . . . . . . . . . . . 364.3.2 Distinct event times: Formal derivation of the Nelson–Aalen estimator 364.3.3 Simulated data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.4 Breaking ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.5 Simulated example with ties . . . . . . . . . . . . . . . . . . . . . 38

5 Variation and confidence intervals for non-parametric estimators 405.1 Variation processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.1 Intuitive definitions . . . . . . . . . . . . . . . . . . . . . . . . . 405.1.2 Formal definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.3 Useful facts about variation processes . . . . . . . . . . . . . . . . 415.1.4 Caveats (not examinable) . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.1 Independent sums . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.2 Weighted independent sums . . . . . . . . . . . . . . . . . . . . . 435.2.3 Compensated homogeneous Poisson process . . . . . . . . . . . . 455.2.4 Compensated inhomogeneous Poisson process . . . . . . . . . . . 45

5.3 Normal approximation for martingales . . . . . . . . . . . . . . . . . . . 455.4 Pointwise confidence intervals for the Nelson–Aalen estimator . . . . . . . 47

5.4.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4.2 More simulated data . . . . . . . . . . . . . . . . . . . . . . . . . 49

Page 4: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents iv

6 Nonparametric estimation, continued 556.1 The nobody-left problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 The Kaplan–Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2.1 Deriving the Kaplan–Meier estimator . . . . . . . . . . . . . . . 566.2.2 The relation between Nelson–Aalen and Kaplan–Meier . . . . . . . 576.2.3 Duhamel’s equation . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.4 Confidence intervals for the Kaplan–Meier estimator . . . . . . . 59

6.3 Computing survival estimators in R . . . . . . . . . . . . . . . . . . . . . . 616.3.1 Survival objects with only right-censoring . . . . . . . . . . . . . . 616.3.2 Other survival objects . . . . . . . . . . . . . . . . . . . . . . . . 63

6.4 Survival to ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7 Comparing distributions: Excess mortality 697.1 Estimating excess mortality: One-sample setting . . . . . . . . . . . . . 697.2 Excess mortality: Two-sample case . . . . . . . . . . . . . . . . . . . . . 737.3 Nonparametric tests for equality: One-sample setting . . . . . . . . . . . 75

7.3.1 No ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3.2 Weight functions and particular tests . . . . . . . . . . . . . . . . 767.3.3 With ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.3.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8 Excess mortality II: Two-sample setting 798.1 Non-parametric tests for equality: Two-sample setting . . . . . . . . . . 79

8.1.1 No ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.1.2 Weight functions and particular tests . . . . . . . . . . . . . . . . . 818.1.3 With ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.1.4 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . 838.1.5 Kidney dialysis example . . . . . . . . . . . . . . . . . . . . . . . 888.1.6 Nonparametric tests in R . . . . . . . . . . . . . . . . . . . . . . 89

9 Additive hazards regression 939.1 Describing the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.2 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.3 Variance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.3.1 Martingale representation of additive hazards model . . . . . . . 969.3.2 Estimating the covariance matrix . . . . . . . . . . . . . . . . . . 96

9.4 Testing for a single effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

9.5.1 Single covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.5.2 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Page 5: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents v

10 Relative-risk models 10310.1 The relative-risk regression model . . . . . . . . . . . . . . . . . . . . . . 10310.2 Partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10410.3 Significance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.4 Estimating baseline hazard . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.4.1 Breslow’s estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 10710.4.2 Individual risk ratios . . . . . . . . . . . . . . . . . . . . . . . . . . 107

11 Relative risk regression, continued 10811.1 Dealing with ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10811.2 Asymptotic properties of partial likelihood . . . . . . . . . . . . . . . . . 10911.3 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11111.4 The Cox model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11711.5 Graphical tests of the proportional hazards assumption . . . . . . . . . 120

11.5.1 Log cumulative hazard plot . . . . . . . . . . . . . . . . . . . . . 12011.5.2 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12011.5.3 Arjas plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12011.5.4 Leukaemia example . . . . . . . . . . . . . . . . . . . . . . . . . 123

12 Model diagnostics 12512.1 General principles of model selection . . . . . . . . . . . . . . . . . . . . 125

12.1.1 The idea of model diagnostics . . . . . . . . . . . . . . . . . . . . 12512.1.2 A simulated example . . . . . . . . . . . . . . . . . . . . . . . . . 126

12.2 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12712.3 Bone marrow transplantation example . . . . . . . . . . . . . . . . . . . 129

13 Model diagnostics, continued 13213.1 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

13.1.1 Definition of martingale residuals . . . . . . . . . . . . . . . . . . 13213.1.2 Application of martingale residuals for estimating covariate trans-

forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13313.2 Outliers and leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

13.2.1 Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 13413.2.2 Schoenfeld residuals . . . . . . . . . . . . . . . . . . . . . . . . . 13613.2.3 Delta–beta residuals . . . . . . . . . . . . . . . . . . . . . . . . . 136

13.3 Residuals in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13713.3.1 Dutch Cancer Institute (NKI) breast cancer data . . . . . . . . . . 13713.3.2 Complementary log-log plot . . . . . . . . . . . . . . . . . . . . . 13913.3.3 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14013.3.4 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Page 6: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents vi

13.3.5 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 14113.4 Schoenfeld residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

14 Censoring and truncation revisited 14614.1 Left censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14614.2 Right truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14714.3 Doubly-censored data: Turnbull’s algorithm . . . . . . . . . . . . . . . . 148

15 Censoring and truncation, continued 15015.1 Interval-censored data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15015.2 Current status data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

15.2.1 Parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . 15215.2.2 Nonparametric approaches . . . . . . . . . . . . . . . . . . . . . 15315.2.3 Example of current status data . . . . . . . . . . . . . . . . . . . . 154

15.3 Dependent censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15715.3.1 Censoring plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15715.3.2 Corrected survival estimators . . . . . . . . . . . . . . . . . . . . . 16115.3.3 Inverse probability of censoring weighting . . . . . . . . . . . . . 163

16 Frailty and recurrent events 16516.1 Proportional frailty model . . . . . . . . . . . . . . . . . . . . . . . . . . 16516.2 Examples of frailty distributions . . . . . . . . . . . . . . . . . . . . . . 165

16.2.1 Gamma frailty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16516.2.2 PVF family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

16.3 Effects on the hazard ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 16616.3.1 Changing relative risk . . . . . . . . . . . . . . . . . . . . . . . . . 16716.3.2 Hazard and frailty of survivors . . . . . . . . . . . . . . . . . . . . 167

16.4 Repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16816.4.1 The Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . 16816.4.2 The Poisson regression model . . . . . . . . . . . . . . . . . . . . 16916.4.3 Negative-binomial model . . . . . . . . . . . . . . . . . . . . . . . 17116.4.4 The Andersen-Gill model . . . . . . . . . . . . . . . . . . . . . . 173

A Notes on the Poisson Process IA.1 Point processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IA.2 The Poisson process on R+ . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2.1 Local definition of the Poisson process . . . . . . . . . . . . . . . IIA.2.2 Global definition of the Poisson process . . . . . . . . . . . . . . IIIA.2.3 Defining the Interarrival process . . . . . . . . . . . . . . . . . . IIIA.2.4 Equivalence of the definitions . . . . . . . . . . . . . . . . . . . . . IV

Page 7: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents vii

A.2.5 The Poisson process as Markov process . . . . . . . . . . . . . . VIA.3 Examples and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . VIA.4 Some basic calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . VIIA.5 Thinning and merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIA.6 Poisson process and the uniform distribution . . . . . . . . . . . . . . . XIII

B Assignments XVIB.1 Modern Survival Problem sheet 1:

Counting processes and martingales . . . . . . . . . . . . . . . . . . . . IB.2 Modern Survival Problem sheet 2:

Nonparametric estimation of survival curves . . . . . . . . . . . . . . . . IIIB.3 Modern Survival Problem sheet 3:

Estimating quantiles and excess mortality . . . . . . . . . . . . . . . . . . VB.4 Modern Survival Problem sheet 4:

Nonparametric testing and semiparametric models . . . . . . . . . . . . VIIB.5 Modern Survival Problem sheet 5:

Relative risks and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . IXB.6 Modern Survival Problem sheet 6:

Censoring and truncation, frailty and repeated events . . . . . . . . . . XI

C Solutions XIIIC.1 Modern Survival Problem sheet 1:

Counting processes and martingales . . . . . . . . . . . . . . . . . . . . . XIVC.2 Modern Survival Problem sheet 2:

Nonparametric estimation of survival curves . . . . . . . . . . . . . . . . XXIIC.3 Modern Survival Problem sheet 3:

Estimating quantiles and excess mortality . . . . . . . . . . . . . . . . . . XXXIVC.4 Modern Survival Problem sheet 4:

Nonparametric testing and semiparametric models . . . . . . . . . . . . XLC.5 Modern Survival Problem sheet 5:

Relative risks and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . LIVC.6 Modern Survival Problem sheet 6:

Censoring and truncation, frailty and repeated events . . . . . . . . . . . LXV

Page 8: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Contents viii

Website: http://www.steinsaltz.me.uk/survival/survival.html

Classes: There will be 6 classes, held Monday mornings 11–12 in the lecture theatre of1 SPR in weeks 3, 4, 5, 7, 8 of Michaelmas Term, and week 1 of Hilary Term. Work foreach class is to be turned in at the statistics department by Friday noon.

Overview: Students will learn how to use the basic mathematical tools that are usedto evaluate survival models. They will learn both mathematical facts about standardsurvival models currently in use, and how to use standard R packages to fit these modelsto data. They will learn how to interpret models critically, and how to choose anappropriate model.

Prerequisites: Required Part A Probability and Statistics. BS3a and BS3b would behelpful. Basic computer skills, including some familiarity with R. This is on a level thatan interested student with no R experience could acquire in a few hours.

Synopsis

• Point processes and compensators. Introduction to martingales.

• Non-parametric estimation. Semi-parametric estimation and the Cox model. Addi-tive hazards regression.

• Model selection: Hypothesis testing and information criteria.

• Model Diagnostics: Graphical methods, Residual methods.

• Advanced topics, such as: Interval censoring. Frailty models. Repeated events.Bayesian approaches to survival.

Reading: The primary source for material in this course will beO. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis:

A Process Point of ViewOther material will come from

• J. P. Klein and M. L. Moeschberger, Survival Analysis: Techniques for Censoredand Truncated Data, (2d edition)

• T. R. Fleming and D. P. Harrington, Counting Processes and Survival Analysis

Klein and Moeschberger is the most applied, least theoretical book. Fleming andHarrington is more rigorous than the level of this course. Aalen et al. is at more or less thelevel we are aiming for. If you’re looking for more detail about the mathematics, Flemingand Harrington is a good place to start. If you’re looking for a more straightforwardpresentation, try Klein and Moeschberger.

Page 9: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 1

Time processes and countingprocesses

1.1 Definition of counting process

Given a collection of random times T1, . . . , Tn, where n in principle could be∞, we have an associated counting process

N(t) := #i : Ti ≤ t

=

n∑i=1

1Ti≤t. (1.1)

Note that this definition makes counting processes continuous from the right.From the left they have limits1, but are discontinuous at each Ti. We writeN(t−) for the left-limit at t, which is

N(t−) =

N(t)−#i : Ti = t if t ∈ T1, . . . , Tn,N(t) otherwise.

In general, a random function that is piecewise constant and right-continuous, with all jumps being positive integers, may be called a countingprocess, and it may be associated to a collection of random variables whereT ∈ R appears exactly N(T )−N(T−) times. (We will only be consideringfinite collections of random variables. More generally, one may assume merelythat the number of T where this difference is nonzero is finite on any boundedinterval.)

1Functions that are continuous from the right, but only have limits from the left aresometimes called cadlag, from the French continue a droite, limite a gauche. We will notbe using this term.

1

Page 10: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 2

We will refer to a counting process whose jumps are all of size 1 (withprobability 1) as a simple counting process. The counting process associatedto any independent sequence of continuous random variables is simple.

1.2 Examples of counting processes

1.2.1 Single survival time

Let T be a nonnegative random variable. The counting process associatedwith T is a function that is 0 for t < T , and then jumps to N(t) = 1 att = T .

1.2.2 Sum of counting processes

If N1, . . . , Nk are counting processes, then N(t) =∑k

i=1Ni(t) is also a count-ing process. In particular, the counting process associated with T1, . . . , Tk isthe sum of the counting processes associated with T1, . . . , Tk. It may also bewritten as

N(t) =

k∑i=1

Ni(t) = #i : Ti ≤ t

.

The sum of independent simple counting processes may not be simple.But it will be if the processes are all generated by times with continuousdistributions.

1.3 Poisson process

The fundamental counting process is the Poisson process. This is a topic inPart A Probability and in Statistics BS3a (Applied Probability), so shouldbe familiar to you, but notes from Part A Probability are included as areminder in the Appendix A.

As noted there, the term “Poisson process” is used sometimes for thepoint process — the random collection of points — and sometimes for thecounting process associated with this point process. When we need to makeclear which we are referring to, we will use the terms Poisson point processand Poisson counting process.

The Poisson counting process is the simplest nontrivial continuous-timediscrete-space homogeneous Markov process on Z+. It starts at 0, and movesup by one step after holding times that are i.i.d. exponential. Thus it has

Page 11: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 3

the Markov property, meaning that for any 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤sn < tn, the random variables

(N(ti)−N(si)

)ni=1

are independent.

1.4 Intensity

1.4.1 Deterministic intensity

Let N(t) be a Poisson counting process with rate 1 on R+, let λ : R+ → R+

be a piecewise continuous, locally bounded (that is, bounded on boundedintervals) function, and let Λ(t) :=

∫ t0 λ(s)ds. Define

N (Λ)(t) := N(Λ(t)). (1.2)

Remember that the Poisson process with rate 1 has the property that

limδ↓0

δ−1PN(t+ δ)−N(t) > 0

= 1;

that is, points accumulate at rate 1. The time-changed process N (Λ) has

limδ↓0

δ−1PN (Λ)(t+ δ)−N (Λ)(t) > 0

= lim

δ↓0δ−1P

N(Λ(t+ δ))−N(Λ(t)) > 0

= lim

δ↓0

Λ(t+ δ)− Λ(t)

δ· δ−1∗ P

N(Λ(t) + δ∗

)−N(Λ(t)) > 0

where δ∗ = (Λ(t+ δ)− Λ(t)),

= λ(t).

We call N (Λ) the (inhomogeneous) Poisson counting process with cumula-tive intensity Λ, and the corresponding point process is the (inhomogeneous)Poisson point process with intensity λ. It has the same general properties asthe Poisson process (no clustering, independent increments), but changes theassumption of constant rate. The expected number of points on an interval(s, t) is not proportional to t− s now, but to

∫ ts λ(x)dx.

As with the homogeneous Poisson process, the sum of independent in-homogeneous Poisson processes with intensities λi(t) is an inhomogeneousPoisson process with intensity

∑λi(t).

We may also think of the inhomogeneous Poisson process with intensityλ, like the Poisson process, as a sequence of interarrival times. The sequenceis no longer independent, but the event times are Markov, in the sensethat if we define 0 = T0 < T1 < T2 < · · · < Tn to be the event times, and

Page 12: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 4

τi := Ti − Ti−1, then conditioned on (T0, . . . , Ti) the next interarrival timeτi+1 has hazard rate λ(t+ Ti), which translates to a density

fτi+1|(T0,...,Ti) = λ(Ti + t) exp−∫ Ti+t

Ti

λ(s)ds. (1.3)

Recall that, for a positive random variable T with density f and cdfF (t) =

∫ t0 f(s)ds, and survival function S(t) = 1− F (t), the hazard rate of

T is defined as

λ(t) :=f(t)

1− F (t)=−S′(t)S(t)

= limδ↓0

δ−1Pt ≤ T < t+ δ

∣∣ t ≤ T. (1.4)

The intensity is the same as the hazard rate of the first waiting time.More generally, we may have a “cumulative intensity” Λ(t) that is not

differentiable, or not even continuous. Any nondecreasing right-continuousfunction Λ may be used in the definition (1.2). Discontinuities of Λ correspondto atoms of the event-time distribution, where the probability of a positivenumber of events is nonzero. Precisely, the distribution of N(t)−N(t−) isPoisson with parameter Λ(t)− Λ(t−).

1.4.2 Random intensity

A counting process with fixed intensity is useful for some purposes — mod-elling customer arrivals at a queue, for instance, where customers are morelikely to come at certain times of day — but generally not for survivalmodelling. The reason is that the most common survival applications involvea fixed population of individuals, which is depleted at each event. Randomcensoring — for example, individuals dropping out of the study — will alsocause the rate of new events to change.

We will allow, then, for the possibility that the intensity λ(t) depends onsome stochastic process (that is, a random function). It will be important thatwe only allow the intensity to depend on the past. We will be developing amathematical definition of “conditioning on the past” in the coming chapters,but eschewing complete mathematical rigour. Those who are interested maywish to look at chapter 5 of [Dur10] (or almost any graduate-level text onstochastic processes).

Survival experiments

A paradigm example is where we have n individuals, each of whom has arandom event time Ti. The population is not replaced, so the number ofindividuals at risk is changing.

Page 13: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 5

Suppose that the Ti are i.i.d., with hazard rate λ(t). We define a downwardcounting process

Y (t) := #i : t < Ti

= n−#

i : t ≥ Ti

. (1.5)

This is called the number at risk; that is, the numbeur of individuals at timet whose event has not yet occurred. Then if N(t) is the number of eventsthat have happened up to time t, the intensity at time t is λ(t)Y (t). SinceY (t) is random, the intensity is also random. On the interval t ∈ [T(i), T(i+1))the intensity is (n− i)λ(t).

In some applications the individuals will have different hazard rates fortheir waiting times. For such applications we need to think of a risk setR(t), the set of individuals at risk at time t (so Y (t) = |R(t)|), and then therandom intensity at time t would be given by

λ(t) =∑i∈R(t)

λi(t).

Right censoring

A slightly more complicated example is when each individual has a pair ofpositive random variables (Ui, Ci). Ui is the event time, Ci is the censoringtime. We observe (Ti, δi) where Ti = Ui ∧ Ci and δi = 1Ui≤Ci.

The idea is that the event time is not observed if it comes after Ci(so-called right-censoring). Otherwise, all we know is that censoring occurredat the time Ti. The number at risk is now given by

Y (t) := #i : t < Ui ∧ Ci

= n−#

i : t ≥ Ui ∧ Ci

. (1.6)

As before, if the Ui all have hazard rate λ(t), the intensity is Y (t)λ(t). Thedifference is that Y (t) is no longer computable from the (Ti).

Notation 1.4.1. We use the notations a ∧ b = mina, b and a ∨ b = maxa, b.If T1, . . . , Tn are real numbers, we write T(1) ≤ · · · ≤ T(n) for the ordered sequence.(Tie-breaking may be done arbitrarily. The mathematical subtleties of tie-breakingwill not concern us.)We will generally use capital letters T1, . . . , Tn to represent independent times (whichmay be event times or censoring times), and t1 < · · · < tk to represent the orderedevent times (which may each occur multiple times).

Page 14: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 6

We will mainly be concerned with the case where Ui and Ci are inde-pendent of each other, called random censoring. When we neglect to stateotherwise, it will be assumed that censoring is random. Right censoringcovers several common practical situations, including

(i). Loss to follow-up: the subject moves away and we can’t contacthim/her.

(ii). Drop out: The subject declines or is unable to continue participation.As this is often due to side-effects in medical trials, or a deteriorationin condition, it is clear that independence of drop out will often beapproximate at best.

(iii). End of trial: This is the simplest case, since a deterministic plannedend to the study is presumptively independent of the event times.

“End of trial” is called type I censoring. If individuals have distinct censoringtimes that are fixed in advance, this is called progressive type I censoring.

Left truncation

An observation is “truncated” if an individual whose event occurs at thewrong time drops out of the observation set altogether. It can be trickier thancensoring, because a truncated observation by definition gives no evidencethat anything has happened. Estimates based on truncated observations areconditional estimates. Left truncation means that individuals are includedonly if their event occurs after the truncation time; right truncation meansthat individuals are included only if their event occurs before the truncationtime. Left truncation is inherently more natural to deal with than righttruncation — just as right censoring is more natural than left censoring —because of the asymmetry of information in time. It is easier to understandconditioning on a past event than on a future event. Some examples:

(i). Delayed entry into study: Often “time” in a survival study is “timeon test”, the time elapsed since the subject entered the study, but notalways. Depending on the scientific question, there may be an earlierkey moment from which it is natural to start the clock running, sothat the individual is at risk of the event occurring before enteringthe study, or otherwise becoming observable. For example, we may beinterested in hazard rate as a function of age based on an observationalstudy, where subjects are recruited at a variety of different ages. Or wemay be interested in survival time of cancer patients after diagnosis,where the study recruits them at a later time.

Page 15: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 7

(ii). Delayed identification: [MSD90] points out that some studies of preg-nancy loss find that the rate falls continuously through the pregnancy,while others find rates that rise to a peak toward the end of the firsttrimester. The discrepancy comes from the distinct methods of analysis,which may or may not take account of left truncation. In this case,the truncation event is the recognition that the woman is pregnant.Pregnancy loss before that time will typically not be recognised assuch.

Missing data

The terminology introduced here will not be used later, and is not examinable.The concepts described here may be useful, though, for understanding latermaterial.Censoring and truncation are both forms of “missing data”. There is awhole area of statistics that studies the problems of missing information; athorough textbook on the subject is [LR02]. The relevant terms here areMissing at Random (MAR) and Missing Completely at Random (MCAR).Data are MCAR if the values of the data do not affect whether they aremissing, once we know the true model. They are MAR if the values of themissing data do not affect whether they are missing, once we know the truemodel and the observable data.

Thus, in the case of independent right censoring, individual i has somevector of observed covariates xi and two independent times (Ui, Ci) — Uiis the event time and Ci the censoring time — and we only get to observeTi = Ui ∧ Ci and δi = 1Ui≤Ci. Both the event time and the censoring timemay depend on xi. This clearly is not MCAR, since large values of Ui aremore likely to be censored than small values. On the other hand, they areMAR, regardless of how Ci may depend on the observed covariates.

Censoring or truncation that produces data missing at random — sothat the censoring (or truncation) time and the unobserved portion ofthe event time are independent, given the observed covariates — is callednon-informative. That means, in the right-censoring setting, we want thedistribution of Ui conditioned on Ui > c ∩ Ci = c to be the same as thedistribution of Ui conditioned on Ui > c (for any fixed c). Non-informativecensoring and truncation allows us to analyse survival times without jointlymodelling the censoring and truncation times, and we will be assuming non-informativeness unless otherwise indicated. Clearly it is impossible to have ageneral method for analysing survival times in the presence of informativecensoring.

Page 16: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 8

Suppose, for example, we studying the time to next seizure in epilepsypatients receiving a new drug or a placebo. Suppose half the subjects in thetreatment group, but one fourth of the placebo subjects, drop out becauseof apparent side effects. If the side effects are, let us say, indigestion causedby effects of swallowing the pills, then this is non-informative, since the onlyfactor linking censoring and event time is which group the subject is in,which is a known covariate. If, on the other hand, some subjects dropped outbecause of headaches or memory lapses — which sometimes precede seizures

— then this would be informative censoring. The subject who dropped out attime c is likely to have unobserved data — the remaining time to the nextseizure Ui − c — smaller than that of a typical subject who was observednot to have had a seizure up to time c.

An example of non-informative censoring that is not independent isso-called type II right censoring. In type II censoring, a population of Nindividuals is observed until exactly r of them have had their event, at whichpoint the study is concluded. Clearly the censoring times are not indepen-dent of the event times, but it also seems intuitively clear that there is noparticular problem with using the observed event times to draw conclusionsabout that portion of the survival curve. The next few lectures will developa mathematical framework that allows us to draw the appropriate distinc-tions for determining, among other things, when censoring is sufficientlyindependent.

A useful formal definition of non-informative censoring will require adifferent mathematical framework, which we will introduce in the cominglectures.

1.5 Parametric models

Survival analysis, more than most areas of statistics, tends to work withnonparametric and semiparametric methods, and these will be our mainconcern in this course. Nonetheless, parametric methods are also important,and require little modification from the methods you will have learned inother courses. In particular, likelihood methods work unchanged, as long aswe modify the likelihood appropriately.

The simplest model would assume that individual event times haveindependent exponential distributions — that is, constant hazard rates. Ifwe observe n independent times U1, . . . , Un, with unknown hazard rate λ,

Page 17: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 9

the log likelihood is

`(λ) = n log λ− λn∑i=1

Ui.

The maximum is attained at

λ =n∑ni=1 Ui

,

and the expected and observed Fisher information are both n/λ2.

1.5.1 Independent right censoring

What happens if we add non-informative right-censoring? We first simplify byassuming independent censoring: there are i.i.d. censoring times C1, . . . , Cnwith density gµ, determined by some parameters µ, which are independentof the failure times. (Any parameters that define this density are nuisanceparameters, which we are not interested in estimating. If we are doingmaximum likelihood estimation, we require only that there be no interferencebetween µ and λ, in the sense that any feasible value of µ and any feasiblevalue of λ may also come together. If we are doing Bayesian estimation, weneed the priors on f and λ to be independent.)

We write fλ(u) and Fλ(u) for the density and cdf respectively of Uigiven the parameter λ; in the exponential case these are λe−λu and e−λu

respectively. The density and cdf for Ci are gµ and Gµ. Suppose now wehave made observations Ti = Si ∧ Ci and δi = 1Si≤Ci. The likelihoodcontribution of an individual with δi = 1 and Ti = t — an uncensoredobservation — is then the density of observing exactly Si = t, multiplied bythe probability of Ci ≥ t conditioned on Si = t. The likelihood contributionof an individual with δi = 0 and Ti = t — a censored observation — is thedensity of observing exactly Ci = t, multiplied by the probability of Ui > tconditioned on Ci = t. This gives us a log likelihood

`(λ, µ |Ti, δi) =n∑i=1

(δi(log fλ(Ti) + logPµ

C ≥ Ti

∣∣T = Ti)

+ (1− δi)(logPλ

U > Ti

∣∣C = Ti

+ log gµ(Ti)))

=

n∑i=1

(δi log fλ(Ti) + (1− δi) log(1− Fλ(Ti))

)+

n∑i=1

(δi log(1−Gµ(Ti)) + (1− δi) log gµ(Ti)

)

Page 18: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 10

by independence of C and U . If µ and λ are entirely distinct parameters —in the sense that (λ0, µ0) is a possible pair of values whenever λ0 is a possiblevalue of λ and µ0 is a possible value of µ — we can drop the second sum fromthe likelihood for purposes of maximum likelihood estimation of λ. (If we aredoing Bayesian analysis with this likelihood, the corresponding assumptionwould be that the prior makes λ and µ independent.) Thus

`(λ |Ti, δi) =

n∑i=1

(δi log fλ(Ti) + (1− δi) log(1− Fλ(Ti))

). (1.7)

(If λ and µ are not independent, we can still use this likelihood for inference,but now it is a partial likelihood. We have thrown away some of the informa-tion about λ, which was included in Ci, but the inferences we make are stillvalid. We will discuss partial likelihoods at length in section 10.2.)

We return now to the case where the distribution is exponential. The loglikelihood is then

`(λ) = k log λ+ λn∑i=1

Ti,

where k =∑δi is the number of uncensored observations. This yields an

MLE of λ = k/∑Ti. The observed Fisher information is k/λ2, and the

expected information is n/λ2 · PC ≥ U

. Unsurprisingly, then, average

efficiency is reduced exactly by a factor of PC ≥ U

. Information is

produced by observed events. But information also accumulates more quicklywhen λ is small, meaning that each individual is observed for a longer time.Asymptotically, the information will be like (

∑Ti)

2/k.More generally, there might be a model where the hazard rate is of the

form λh(t), where h is a known function. Then

fλ(u) = λh(u)e−λH(u),

Fλ(u) = 1− e−λH(u),

where H(u) =∫ u

0 h(t)dt. The log likelihood is then

`(λ) = n log λ+ λ

n∑i=1

H(Ti).

1.5.2 Independent left truncation

Suppose now there each individual has three times: L,U,C, with L ≤ C,and U independent of the pair (L,C). When U < L we observe nothing;

Page 19: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Counting processes 11

when U > C we observe L and T = C, with δ = 0; otherwise, we observeT = U with δ = 1.

Let I be the set of individuals who are observed, and define the conditionaldensity

fλ(u |L = s) =

fλ(u)

1−Fλ(s) if u ≥ s,0 if u < s.

We get

`(λ, µ | I, Ti, δi) =∑i∈I

δi

(log fλ(Ti |L = Li)

+ log marginal density of L = Li+ logPC ≥ Ti

∣∣U = Ti)

+∑i∈I

(1− δi)(

logPλU > Ti

∣∣U ≥ Li+ log joint density of (C,L) = (Ti, Li).

As before, we can drop the terms from the log likelihood that do not dependon λ.

`(λ | I, Ti, δi) =∑i∈I

(δi log fλ(Ti |L = Li) + (1− δi) logPλ

U > Ti

∣∣U ≥ Li) .Again substituting the hazard rate hλ and cumulative hazard Hλ for U , weget

`(λ | I, Ti, δi) =∑i∈I

(δi log hλ(Ti)−Hλ(Ti) +Hλ(Li)) .

In the case of an exponential distribution we get

`(λ | I, Ti, δi) = k log λ− λ∑i∈I

(Ti − Li),

where k is the number of uncensored (and untruncated) observations. Thus,the change from the case without truncation is simply that — in addition toincluding only the untruncated individuals (i ∈ I) in the estimate (which isunavoidable, since the other data don’t exist!) — the time of the event iscounted rather as the “time under observation”; that is, time starting fromLi, since observations before wouldn’t have been possible.

Page 20: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 2

σ-algebras and conditioning

2.1 σ algebras

2.1.1 Background

If you have learned any measure theoretic probability, you will be familiarwith the fundamental “probability triple” (Ω,F,P). Here Ω is the samplespace, P is the probability distribution, and F is the set of “events”: thesubsets of Ω that have probabilities. For discrete probability F can includeall the subsets, so there’s no need to think about it. One of the first theoremsyou prove in measure-theoretic probability is that it is impossible to definea continuous distribution on the real numbers such that all subsets haveprobabilities. This means that a mathematically rigorous requires that wethink carefully about what events are allowed. The collection of events forma σ-algebra, meaning that it satisfies the following conditions:

(i). Ω ∈ F;

(ii). If A ∈ F then A ∈ F;

(iii). If A1, A2, . . . are a countable collection of elements of F, then⋃∞i=1Ai ∈

F.

These clearly parallel the probability axioms, saying that any set whoseprobability could be computed from knowing the probabilities of events inF is also an event in F. The smallest σ-algebra that contains all the openintervals is called the Borel σ-algebra, and that is the σ-algebra that weconventionally use for continuous probability on R.

The Borel σ-algebra is BIG. It doesn’t include all subsets of R, butactually constructing a set that is not Borel is not something you are likely

12

Page 21: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 13

to do by accident. Thus, we can still be pretty rigorous about our probabilitytheory — as rigorous as we intend to be in this course — without ever definingprecisely which sets are “events”. We will be ignoring these “measurability”questions.

2.1.2 σ-algebras and information

So why are we talking about σ-algebras? They turn out to be exactly theright mathematical language for talking about different sets of information.There are good reasons for restricting to a smaller class of events than themaximum possible. The basic idea is that the full σ-algebra F representscomplete information about which outcome occurred, while smaller σ-algebrasrepresent reduced information, such as all the random occurrences up to afixed time t: In σ-algebra language, all you can know is, which events in theσ-algebra occurred.

2.1.3 Examples

Trivial σ-algebra

The minimal σ-algebra — the smallest amount of information you could have— is represented by T = Ω, ∅. This corresponds to having no informationat all. This σ-algebra is contained in any other σ-algebra, by definition.

An example on [0, 1]

Let Ω = [0, 1]. Then F = Ω, [0, 12 ], (1

2 , 1], ∅ is a σ-algebra. It may bethought of as representing the information about which half of the intervalω is in.

Coin tossing

Let Ω = 0, 1n (where n may be ∞). Let A(k1, . . . , kj ;x1, . . . , xj) for1 ≤ ki ≤ n distinct integers, and xi ∈ 0, 1 be the “cylinder set”

A(k1, . . . , kj ;x1, . . . , xj) :=

(ω1, . . . , ωn) : ωki = xi for 1 ≤ i ≤ j.

This is the set of outcomes such that the kj-th coordinate is fixed to be xj .That is, it is the set of sequences of coin-flips where some particular flips(given by kj) are known to have particular values (given by xj). There are

Page 22: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 14

3m different cylinder sets with 1 ≤ k1 < k2 < · · · < kj ≤ m, which we maylist as A1, . . . ,A3m . Then if we define

Fm :=Ai1 ∪ · · · ∪Aij , 0 ≤ j ≤ 3m, 1 ≤ i1 < i2 < · · · < ij ≤ 3m

then Fm may be thought of as representing the information in the first mflips.

2.1.4 Intersections of σ-algebras

If F1, . . . ,Fn are σ-algebras, the intersection represents the information thatis common to all of them, written

F1 ∧ · · · ∧ Fn =n∧i=1

Fi =n⋂i=1

Fi.

As with any other sort of intersection, we may take intersections over arbitrarycollections, so if Fi : i ∈ I, for any index class I, then the intersection∧i∈I Fi =

⋂i∈I Fi is a σ-algebra.

2.1.5 σ algebra generated by a family of sets

If F is any collection of subsets of Ω, the intersection of all σ-algebrascontaining F is a σ-algebra, called the σ-algebra generated by F , writtenσ(F ). As already mentioned, the σ-algebra generated by open intervals iscalled the Borel σ-algebra.

Coin-tossing

The σ-algebra described in section 2.1.3 is generated by the family of sets

Bi =

(ω1, . . . , ωn) : ωi = 0.

2.1.6 Joins of σ-algebras

If F and G are σ-algebras, the join of F and G, written F∨G, is the σ-algebragenerated by sets of the form A ∩ B, where A ∈ F and B ∈ G. It maybe thought of as the σ-algebra containing all the information of F and G

together.We may also define the join of an arbitrary collection of σ-algebras,

as being generated by all finite intersections of elements of the individualσ-algebras.

Page 23: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 15

2.1.7 σ-algebra generated by a random variable

If X is a real-valued random variable, the σ-algebra 〈X〉 is defined to be theσ-algebra generated by the family of events of the form ω : X(ω) ≤ x,for any x ∈ R. In other words, it consists of sets of the form X−1(B),where B ⊂ R is any Borel set. This may be thought of as the σ-algebra ofinformation about the outcome that is contained in the value of X.

If X and Y are random variables, the σ-algebra generated by them jointlyis 〈X,Y 〉 = 〈X〉 ∨ 〈Y 〉. We may likewise speak of the σ-algebra generatedby an arbitrary collection of random variables.

If F is a σ-algebra with ω : X(ω) ≤ a ∈ F for all a ∈ R, we say thatX ∈ F. Equivalently, we say that X is F-measurable. Thus, we can think ofa σ-algebra as a collection of random variables, and this is the perspectivewe will be taking. A σ-algebra is a way of summarising the statement thatwe know the values of certain random variables. The axioms of a σ-algebramean that for any collection of random variables X1, . . . , Xn ∈ F, F alsoincludes any other random variables that could be computed as a functionof X1, . . . , Xn.

Coin-tossing

The σ-algebra Fm described in section 2.1.3 is generated by the randomvariables X1, . . . , Xm, where Xi = outcome of the i-th flip.

2.1.8 Filtrations of σ-algebras

A filtration is a collection of σ-algebras Ft such that Fs ⊂ Ft when s ≤ t.(They are said to be increasing.)

A stochastic process (X(t))t≥0 is said to be adapted to the filtration Ftif X(t) ∈ Ft. The filtration Ft generated by Xs : s ≤ t is called the pastσ-algebra of X. It is the minimal filtration such that X is adapted to F.

We define

Ft+ :=∧s>t

Fs,

Ft− :=∨s<t

Fs.

Intuitively, Ft+ includes all information that is available at all times after t,while Ft− includes all information that is available strictly before time t. Wecall a filtration right-continuous if Ft = Ft+. Unless otherwise indicated, wewill be assuming that filtrations are right-continuous.

Page 24: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 16

The natural filtration of a stochastic process (Xt)t≥0 is the filtrationgenerated by Xt. This is the smallest filtration with respect to which(Xt) is adapted, but we will often need to include additional information— additional random variables — in the filtration. In particular, we willinclude information about censoring and truncation events up to time t inthe filtration.

2.1.9 Adapted processes

A stochastic process is a random function. That is, for every value of thetime coordinate, it determines a random variable. For our purposes, the timecoordinate will usually be R+ (or a bounded subinterval), and the randomvariables will usually be real-valued; in other applications the time coordinatemay be natural numbers. A stochastic process (Xt)t≥0 is adapted to thefiltration (Ft)t≥0 if Xt ∈ Ft for each t ≥ 0. The smallest adapted filtration —that is, the filtration where Ft has just the minimum amount of informationrequired to determine Xs for all s ≤ t — is called the natural filtration of(Xt).

For example, suppose we are considering a simple queueing process, whereindividuals arrive at times of a Poisson process, and are then served at timesthat are i.i.d. exponentially distributed. Let Ft be the σ-algebra generatedby arrivals up to time t. Then Yt =total number of arrivals up to time t isadapted with respect to Ft, but Zt =length of the queue at time t is not,because it depends on information about the random service times, whichare not in Ft.

Usually we will be taking adaptedness for granted, by understanding Ftto include “all the information” generated by time t.

2.1.10 Stopping times

A stopping time (also called a Markov random time) is a random time suchthat we know at time t whether it has happened. As usual, “knowledge”is defined by a filtration. That is, a stopping time is a positive-real-valuedrandom variable T such that ω : T ≤ t ∈ Ft for all t. Some examples:

• Let Nt be a counting process with deterministic intensity. The timeof the first event is a stopping time (with respect to any filtration towhich it is adapted; ).

• Let Nt be a counting process of a survival process with n individuals,each of whom has exactly one event at some random time. The time

Page 25: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 17

T(k) of the k-th death is a stopping time for any k. This remains trueif we start with a random number of individuals, and does not dependon any assumptions about the distribution of the death times. (If k islarger than the number of deaths that ultimately occur, either becausek > n or because some deaths do not occur — so limt→∞Nt < k —the stopping time will be T(k) =∞.)

• The time of the last observed death in a survival experiment startingwith a (possibly random) collection of individuals with random rightcensoring is not a stopping time. This is because any given event maybe the last one, if all survivors are ultimately censored. There is noway to know this at the time of the event (unless, of course, there areno survivors).

Formally, a stopping time is a random time such that all events of theform ω : T ≤ t are in Ft. Note that a stopping time with respect toone filtration is automatically a stopping time with respect to any largerfiltration — that is, a filtration with more information.

2.1.11 Predictable processes

A crucial concept for the models we want to develop is that of a predictableprocess. Intuitively predictable process is an Ft-adapted process (X(t)) suchthat no new information appears suddenly. We can “predict” the valueof the process at any time on the basis of information available at leastinfinitesimally before that time. Clearly a left-continuous process would meetthis definition. A jump process that has a clock that accumulates time —possibly at a random rate, according to an auxiliary stochastic process —and then jumps when the clock hits 1, would also be predictable. On theother hand, the process

X(t) =

0 for all t with probability 1

2 ;

1t≥1 with probability 12 .

is clearly not predictable (when Ft is the natural filtration of X). The valueof X(t) cannot be calculated from any information available in the valuesX(s) for s < 1.

This suggests a definition

(X(t)) is predictable if X(t) ∈ Ft− for each t.

This definition is not restrictive enough, though, because an unpredictableprocess may not be unpredictable at any particular time. A Poisson counting

Page 26: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 18

process is quintessentially unpredictable — independent increments impliesthat there is no way to know that a jump is coming until it actually comes —but for any fixed t the probability of an unpredictable jump actually occurringat time t is 0. Thus

X(t) = X(t−) +[X(t)−X(t−)

]where X(t−) := limδ↓0X(t − δ). Clearly X(t−) is Ft−-measurable, andX(t)−X(t−) is almost surely 0, so it is F0-measurable (by completeness).So this definition would not exclude the Poisson process.

An alternative definition would be

(X(t)) is predictable if it is left-continuous.

This clearly is too restrictive, since “predictability” would not be infringedif there were a right-continuous jump at a deterministic time, for example.The technical definition then extends this to processes whose random bitsare left-continuous. A process is defined to be predictable if for any stoppingtime T the value of X(T ) is determined by information available up to butnot including T .

We will use this definition heuristically. For practical purposes, allpredictable processes we actually use will be left-continuous.

The more general definition is discussed in section 1.4 of [FH91].

Non-examinable: There are two ways of making a technically correctdefinition. One is to say a stochastic process is predictable if it is inthe predictable σ-algebra, defined to be the σ-algebra generated by allleft-continuous processes. This is a bit hard to imagine, but essentially itextends the left-continuous processes to limits of left-continuous processes.The more natural definition is to say the process X is predictable (withrespect to the filtration (Ft)) if for every stopping time T the randomvariable X(T ) ∈ FT−, defined to be the σ-algebra generated by all eventsof the form A ∩ T ≥ t where A ∈ Ft−.

2.2 Conditioning

One of the most important concepts in probability is the conditional expec-tation of one random variable conditioned on another – or on a collection ofother random variables. An excellent elementary introduction to this concept

Page 27: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 19

may be found in chapter 6 of [Pit93]. A more sophisticated treatment is inchapter 5 of [Dur10].

2.2.1 Definition of conditioning

Let F be a σ-algebra, and X a random variable that may or may not be inF. Then

E[X∣∣F] (2.1)

the expectation of X conditioned on F is, intuitively, the best approximation— the best guess — we can make to X, given only the information in F. It isthe projection of X onto the F-measurable random variables. This may betaken as a formal definition:

E[X∣∣F] is the F-measurable r.v. Y that minimises E

[(X − Y )2

]. (2.2)

Alternatively,

E[X∣∣F] is the F-measurable r.v. Y that satisfies

E[XZ

]= E

[Y Z]

for any Z ∈ F.(2.3)

These definitions aren’t very convenient to work with, and they’re notobviously even definitions — Does such a Y exist? Is it unique? — so wedelve into some more intuitive descriptions. (The answer to both of thesequestions is yes, by the way, but we won’t be going into the proof of this result,which is a standard theorem in any introductory text on measure-theoreticprobability. See, for example, [Dur10, section 5.1].)

2.2.2 Intuitive definition for discrete random variables

Suppose F = 〈W 〉 is the σ-algebra generated by the discrete random variableW . Then E[X|F] is some function g(W ) — that is, if we know that W (ω) = w,then we can compute E[X|F](ω) = g(w). What would g(w) be? Accordingto (2.3), if we take Z = 1W=w,

E[X1W=w

]= E

[g(W )1W=w

]= g(w)P

W = w

.

Thus

g(w) =E[X1W=w

]PW = w

,

which is our old definition for E[X|W = w]. That is, E[X|F] = E[X|W ] is arandom variable that may be written as a function of W , by the rule thatassigns to E[X|W ](ω) the value E[X|W = w] whenever W (ω) = w.

Page 28: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 20

It is clear that this may be generalised to σ-algebras generated by multiplediscrete random variables. Generalising to continuous (or more complicated)random variables takes some abstract mathematics, but it still works, andthe results are unique (or, unique enough), and it is still true that whenF = 〈W1, . . . ,Wk〉 then

E[X|F](ω) = E[X∣∣W1 = w1, . . . ,Wk = wk

]whenever Wi(ω) = wi. (2.4)

2.2.3 Properties of conditional expectations

It is a theorem in measure-theoretic probability that conditional expectationsalways exist. But when we want to work with them, we usually don’t goback to the definitions. We use some convenient properties.

• If X ∈ F then E[X|F] = X.

• Suppose X is independent of F, meaning that for any real x, and anyA ∈ F,

P(X ≤ x ∩A

)= P

X ≤ x

P(A).

Then E[X|F] = E[X]. This applies, in particular, to the case whenF = Ω, ∅. Since X is always independent of T, we see that E[X|T] isthe constant E[X].

• If F ⊂ G are two σ-algebras then

E[E[X

∣∣F]∣∣G] = E

[X∣∣F] = E

[E[X

∣∣G]∣∣F]. (2.5)

In other words, the cruder estimate dominates. The first equalityfollows immediately from the fact that E[X|F] is already G-measurable.The second equality follows, intuitively, from the fact that an estimateof an estimate cannot be better than an estimate of the original quantity.Formally (but not examinably!) we go back to the definition (2.3):E[X∣∣F] is an F-measurable random variable. And if Z is any other

F-measurable random variable,

E[E[X

∣∣G]Z]

= E[E[XZ

∣∣G]]

= E[XZ] = E[E[XZ

∣∣F] = E[E[X

∣∣F]Z]

since Z ∈ F ⊂ G.

• If X and Y are any random variables,

E[X + Y

∣∣F] = E[X∣∣F]+ E

[Y∣∣F]. (2.6)

Page 29: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Conditional expectations 21

Since integrals behave like sums, if we have a collection of boundedrandom variables λ(u) (for u ∈ [s, t]),

E[∫ t

sλ(u)du

∣∣∣F] =

∫ t

sE[λ(u)

∣∣F]du. (2.7)

• If Y ∈ F, thenE[XY

∣∣F] = E[X∣∣F] · Y. (2.8)

Proof. Intuitively this makes sense: If we know the value of Y , thebest approximation to XY must be obtained by multiplying Y by thebest approximation to X. Formally, we use the characterisation (2.3).First of all, since Y ∈ F and E[X

∣∣F] ∈ F, clearly Y E[X∣∣F] ∈ F.

If Z is any other F-measurable random variable then also ZY ∈ F.Thus, by (2.3)

E[Z · Y E[X

∣∣F]]

= E[ZY · E[X

∣∣F]]

= E[ZXY ].

Page 30: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 3

Martingales

3.1 More about conditioning

3.1.1 Examples

Constants

The most trivial collection of random variables is the collection F of allconstants (or deterministic random variables).1 Then E[X|F] = E[X].

We could also think of F as being an empty collection of random variables;constants may be thought of as functions with no arguments.

Conditioning on a single random variable

If F is generated by a random variable Y , we write E[X|F] as E[X|Y ], as itis sometimes defined; that is, E[X|Y ] is a random variable that is a functionof Y , calculated according to the rule, when Y = y, E[X|Y ] = E[X|Y = y].

Uniform distribution

This is just an example to see how these tools work in a different kind ofcase. There is no need to learn the details of this particular example, but youshould understand the techniques being applied.Let X be the uniform distribution on [0, 1], and let Y = X(1−X). ThenF := 〈X〉 includes all open intervals, hence is the complete Borel σ-algebra,while G := 〈Y 〉 includes only the Borel sets that are symmetric about12 . Clearly E[X|F] = X and E[X2|F] = X2, since X and X2 are both

1In the usual language of σ-algebras, F is the σ-algebra ∅,Ω.

22

Page 31: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 23

F-measurable. On the other hand,

E[X∣∣G] =

1

2,

E[X2∣∣G] = E

[X − Y

∣∣G] =1

2− Y.

The second calculation follows from the first, using (2.6). The first seemsintuitively clear: given the value of Y , which is the information in G, weknow that X = 1

2(1±√

1− 4y), with both possibilities equally likely (sincethe distribution is uniform). Thus, the conditional expectation is

E[X|Y = y] =1

2

(1

2(1 +

√1− 4y) +

1

2(1−

√1− 4y)

)=

1

2.

More formally, we can write Z = sgn(2X − 1). Then X = 12(1 +Z

√1− 4Y ),

so

E[X∣∣G] =

1

2+

1

2E[Z√

1− 4Y∣∣G] =

1

2+

1

2E[Z∣∣G]√1− 4Y

by (2.8). We know that

E[Z∣∣G] =

f((1 +√

1− 4Y )/2)− f((1−√

1− 4Y )/2)

f((1 +√

1− 4Y )/2) + f((1−√

1− 4Y )/2),

where f is the density of X. In the case where f is uniform, this is 0, butwe can compute this more generally.

3.1.2 In search of past time

We now can describe, informally, the general intensity of a counting processas

λ(t)dt = E[dN(t)

∣∣Ft−]. (3.1)

This is a useful way of thinking about it, even if it’s not quite right becausedtand dN(t) aren’t real mathematical objects (unless we’re doing nonstandardanalysis).

This intuition allows us to think of intensities as being relative to theinformation available at time t, and we may want to be able to change ourminds about what information is available for computing the intensity. Forexample, in many of the models we are concerned with, the hazard ratesof individuals depend on covariates. A given covariate may be observedor unobserved in a given setting, and the filtration is part of the model.It is crucial that Ft (or whatever we call the filtration in a given setting)

Page 32: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 24

contains only random variables whose value is known by time t, which is ingeneral a subset of the random variables whose values have been physicallydetermined by time t.

This leads to an important theorem, the Innovation Theorem, which isan application of (2.5): Suppose we have two different collections of randomvariables Ft and Gt for each t, such that Ft contains only partial informationup to time t — that is, only some of the random variables determined bytime t — and Gt contains all the information. (Or, Gt may also contain onlypartial information, but in any case Ft ⊂ Gt.) Suppose we have computedthe intensity of the process based on the information in Gt, but we actuallywant the intensity based on Ft. Then by (2.5)

λF(t) = E[dN(t)

∣∣Ft−] = E[E[dN(t)

∣∣Gt−] ∣∣Ft−].Theorem 3.1 (Innovation Theorem). Suppose Ft ⊂ Gt for every t. Thenthe associated intensities satisfy

λF(t) = E[λG(t)

∣∣Ft−]. (3.2)

3.1.3 Examples

Coin flipping

Let Ω = 0, 1∞ be the space of infinite sequences of coin flips, Xi = ωithe outcome of the i-th flip. Let P being the probability distribution thatmakes the Xi i.i.d. uniform on 0, 1. Let Ft be the σ-algebra generatedby X1, . . . , Xm where m = btc. Of course, we would ordinarily look atthis process in discrete time, but there is no problem with embedding it incontinuous time. We may also define X(t) := Xbtc, and S(t) =

∑1≤i≤tXi.

Then X(t) and S(t) are right-continuous and in Ft; Ft is a right-continuousσ-algebra.

If n ≥ m then

E[S(n)

∣∣Fm] = E[S(m)

∣∣Fm]+ E[S(n)− S(m)

∣∣Fm]= S(m) + E

[S(n)− S(m)

]= S(m) +

n−m2

since S(n)− S(m) is independent of Fm.

Page 33: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 25

Poisson process

Let N(t) be a Poisson counting process with intensity λ, and Ft the associatedfiltration. For any fixed t and t′ > t,

E[N(t′)

∣∣Ft] = E[N(t′)−N(t)

∣∣Ft]+ E[N(t)

∣∣Ft]= λ(t′ − t) +N(t),

and

E[N(t′)2

∣∣Ft] = E[(N(t′)−N(t))2

∣∣Ft]+ E[N(t)2

∣∣Ft]+ 2E[(N(t′)−N(t))N(t)

∣∣Ft]= λ(t′ − t) + λ2(t′ − t)2 +N(t)2 + 2N(t) · λ(t′ − t),

so that

Var(N(t′)

∣∣Ft) = E[N(t′)2

∣∣Ft]− E[N(t′)

∣∣Ft]2= λ(t′ − t).

Thus, unsurprisingly (since N(t′)−N(t) is independent of Ft), the conditionalvariance of N(t′) conditioned on Ft is precisely the same as Var(N(t′)−N(t)).

Frailty process

Suppose we have a single individual, whose mortality rate as a function ofage is Gompertz — that is, λ(t) = Beθt. Suppose now that B is itself arandom “frailty”, determined already at age 0. For definiteness, let us saythat B has Gamma distribution with parameters (r, µ). (Recall that thisdistribution has density

fr,µ(x) =µr

Γ(r)xr−1e−µx (3.3)

on x ∈ (0,∞). Its expectation is r/µ and variance r/µ2.) Let Gt be thecomplete past up to time t, while Ft is the past excluding the random variableB. (In this case, it will be generated by T1T≤t.)

Since G includes all the information,

λG(t) = Beθt1T≥t.

That is, up to time t we know B, and so we know that the rate of the singleevent occurring at time t — intuitively, the “instantaneous probability perunit of time” — is the hazard rate, unless it has already happened, in whichcase the intensity is 0.

Page 34: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 26

λF is the intensity you would estimate if you could not observe B. TheInnovation Theorem tells us that

λF(t) = E[Beθt1T≥t

∣∣Ft−] = eθtE[B1T≥t

∣∣T1T<t],

because Tt generates Ft−. Now, on the event T1T<t > 0— that is, when

T < t, so it has already happened — λF(t) = λG(t) = 0. On the remainingevent, where T ≥ t, we get

λF(t) = eθtE[B1T≥t

∣∣T ≥ t]= eθtE

[B∣∣T ≥ t]

= eθtE[B · 1T≥tPT ≥ t

= eθt

∫∞0 xfr,µ(x)P

T ≥ t

∣∣B = x∫∞

0 fr,µ(x)PT ≥ t

∣∣B = x

= eθt∫∞

0 xre−µxPT ≥ t

∣∣B = xdx∫∞

0 xr−1e−µxPT ≥ t

∣∣B = xdx

We have

PT ≥ t

∣∣B = x

= exp

−∫ t

0xeθsds

= exp

−xθ

(eθt − 1

).

So

λF(t) = eθt

∫∞0 xr exp

−x(µ+ eθx−1

θ

)dx∫∞

0 xr−1 exp−x(µ+ eθt−1

θ

)dx

= eθtΓ(r + 1)

(µ+ (eθt − 1)/θ

)−(r+1)

Γ(r) (µ+ (eθt − 1)/θ)−r

=reθt

µ+ (eθt − 1)/θ.

3.2 Martingales

3.2.1 Definitions

Intuitively, a martingale is a “fair game”. That is, based on what hashappened up to time t, the expected future change is 0. The average best

Page 35: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 27

guess about the value at any future time is that it will be equal to the currentvalue. Formally, the stochastic process (M(t))t≥0 is a martingale with respectto the filtration Ft (or an “Ft-martingale”) if for any t ≥ s,

E[M(t)

∣∣Fs] = M(s). (3.4)

Fact 3.2. If (M(t)) is a martingale, and Y ∈ Fs, then E[(M(t)−M(s))Y ] =0 for any t ≥ s, and E[M(t)Y ] = E[M(s)Y ].

This is an immediate consequence of (2.8).

Fact 3.3. Optional stopping: If T > s is a bounded stopping time, thenE[XT

∣∣Fs] = M(s).

Example: Sums of i.i.d. random variables

If M(t) =∑btc

i=0 ξi, where ξ0, ξ1, . . . are i.i.d. with E[ξi] = 0, then (M(t)) is amartingale. (We can also think of this as a discrete-time martingale.)

The reason martingales are interesting is that they have many of thesame nice properties as sums of i.i.d. random variables — Laws of LargeNumbers and Central Limit Theorem — but are much more general.

Brownian motion

As the number of jumps in a discrete martingale gets large, but the sizeof the jumps remains small, the process can be rescaled to converge to acontinuous martingale called a Brownian motion or Wiener process. Themathematical analysis of this object is a fascinating subject in its own right(which many of you will have seen in other courses), but we will not havemuch to say about it in this course. Still, it is useful to know that there is thisuniversal limiting object that plays the same role for whole stochastic processpaths that the Gaussian distribution plays for one-dimensional averages. Theessential properties of the Brownian motion are:

• Continuous: Brownian motion is a random continuous function B :R+ → R.

• Independent increments: For any 0 ≤ s1 ≤ t1 ≤ s2 ≤ t2 ≤ · · · ≤sn ≤ tn, the random variables B(t1)−B(s1), B(t2)−B(s2), . . . , B(tn)−B(sn) are independent.

• Normal distribution: For any s, t, B(t)−B(s) is normally distributedwith mean 0 and variance t− s.

Page 36: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 28

3.3 Compensators

If N(t) is a Poisson counting process with intensity λ, it obviously isn’t amartingale, since it only goes up. But if we define M(t) := N(t)− λt, then

E[M(t)

∣∣Fs] = E[M(s) +

(M(t)−M(s)

) ∣∣Fs]= M(s) + E

[N(t)−N(s)− λ(t− s)

) ∣∣Fs]= M(s) + E

[N(t)−N(s)

∣∣Fs]− λ(t− s)= M(s).

The last line uses the fact that N(t) − N(s) is independent of Fs, whichimplies that

E[N(t)−N(s)

∣∣Fs] = E[N(t)−N(s)

]= λ(t− s).

So we subtract the function λt from N and get a martingale. We say λtis the compensator of the Poisson counting process.

A compensator for the random process N(t) is a (possibly random)process A(t) with the following properties:

• A(t) is non-decreasing;

• A(t) is predictable;

• N(t)−A(t) is a martingale.

Of course, if N(t) has a compensator A(t), then taking M(t) := N(t)−A(t), for any t ≥ s

E[N(t)

∣∣Fs] = E[M(t) +A(t)

∣∣Fs]= E

[M(t)

∣∣Fs]+A(s) + E[A(t)−A(s)

∣∣Fs]≥M(s) +A(s)

since A(t)−A(s) is always ≥ 0, so

E[N(t)

∣∣Fs] ≥ N(s). (3.5)

A process N satisfying (3.5) is called a submartingale. There is a result,the Doob–Meyer decomposition, telling us that every submartingale has acompensator — that is, it may be written as a sum of a martingale and anon-decreasing predictable process.

Intuitively, the compensator is the cumulative conditional rate of instan-taneous average increase of the process. For a Poisson-like counting process

Page 37: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 29

this is exactly the same thing that we have vaguely defined as a cumulativeintensity.

We can restate the Innovation Theorem in a more general way:

Theorem 3.4. Suppose Ft ⊂ Gt for every t. Suppose (Nt) is a Gt-submartingale,with uniformly continuous compensator A(t). Then it is also an Ft-submartingale,with compensator A∗(t) := E[A(t)

∣∣Ft].Proof. For 0 ≤ s ≤ t, by the refinement rule

E[N(t)

∣∣Fs] = E[E[N(t) |Gs]

∣∣Fs]≥ E

[N(s)

∣∣Fs] since N is a G-submartingale

= N(s).

So N is an F-submartingale. A∗ is continuous and adapted, hence F-predictable.2

It remains to show that N − A∗ is an F-martingale. For 0 ≤ s ≤ t, wehave

E[A∗(t)

∣∣Fs] = E[E[A(t) |Ft]

∣∣Fs]= E

[A(t)

∣∣Fs],so

E[N(t)−A∗(t)

∣∣Fs] = E[N(t)−A(t)

∣∣Fs]= E

[E[N(t)−A(t) |Gs]

∣∣Fs]= E

[N(s)−A(s)

∣∣Fs]= N(s)−A∗(s).

3.3.1 Inhomogeneous Poisson counting process

Let N(t) be a counting process with predictable intensity λ(t). We make theassumption that λ(t) does not change, on average, very rapidly, in the sensethat there is a constant C such that for all t ≥ s ≥ u ≥ 0,

E[|λ(t)− λ(s)|

∣∣Fu] ≤ C(t− s). (3.6)

2Uniformly continuous A implies that A∗ is continuous. This is not examinable, butyou may want to think about why this is true. In fact, continuity is not required here —predictable would suffice — but it is enough for our purposes.

Page 38: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 30

We have not formally defined what it means for λ(t) to be the intensityof N(t). We define it now (not completely rigorously) by the relations

E[N(t+ δ)−N(t)

∣∣Ft] = λ(t)δ + o(δ), uniformly,

E[(N(t+ δ)−N(t)

)1N(t+δ)−N(t)≥2

∣∣Ft] = o(δ),uniformly.(3.7)

That is, conditioned on all the information up to time t, the estimated rateof new points appearing in the next instant is λ(t); and the rate of multiplepoints in a tiny interval is vanishingly small. The errors are small withrespect to δ in a way that is uniform over all random outcomes and all t.

Let Λ(t) :=∫ t

0 λ(s)ds. If we define M(t) := N(t)− Λ(t), then for t ≥ s

E[M(t+ δ)

∣∣Fs] = E[E[M(t+ δ)

∣∣Ft] ∣∣Fs].Thus,

d

dtE[M(t)

∣∣Fs] = limδ↓0

δ−1E[E[M(t+ δ)−M(t)

∣∣Ft]Fs]= lim

δ↓0E[δ−1E

[N(t+ δ)−N(t)

∣∣Ft]− Eδ−1

∫ t+δ

tλ(u)du

∣∣Ft]Fs]= lim

δ↓0E[δ−1E

[δλ(t) + o(δ)

∣∣Ft]− E[λ(t)

∣∣Ft]Fs]= 0.

That E[M(t)|Fs] is constant for t ≥ s, thus always equal to M(s), confirmingthat M is a martingale.

More rigorously, we can start with a predictable process Λ(t), and aPoisson counting process X with unit intensity, independent of Λ. DefineN(t) = X(Λ(t)). This is what we mean by a Poisson counting processwith (random) cumulative intensity Λ. It is an adapted submartingale withcompensator Λ. To see this, we write M(t) := N(t) − Λ(t), and define Gtto be the σ-algebragenerated by (As)0≤s≤t. The strategy is, we can provethat something has expectation 0 by showing that there is some additionalinformation that we can condition on, such that the conditional expectationis 0 no matter what value that information has. Conditioning on Gt ∧ Fs wecan treat A(t) like a constant, and so

E[M(t)

∣∣Gt ∧ Fs]

= E[X(A(t)

)−A

∣∣Gt]

Page 39: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 31

We compute

E[M(t)

∣∣Fs] = E[E[M(t)

∣∣Gt ∧ Fs] ∣∣∣ Fs]

= E[E[X(A(t)

)−A(t)

∣∣Gt ∧ Fs] ∣∣∣ Fs]

= E[E[X(A(t)

)−X

(A(s)

) ∣∣Gt ∧ Fs] ∣∣∣ Fs]+ E

[X(A(s)

)−A(t)

∣∣Fs]= E

[E[A(t)−A(s)

∣∣Gt ∧ Fs] ∣∣∣ Fs]+X

(A(s)

)−A(s)− E

[A(t)−A(s)

∣∣Fs]= X

(A(s)

)−A(s)

= M(s).

Page 40: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 4

Stochastic integrals andnonparametric estimation

4.1 Background

4.1.1 Introduction to non-parametric estimation

Hazard rates are natural objects for modelling, but not for estimation. Theproblem is the same as for density estimation: An observation occurs at ansingle dimensionless point. The best summary of the actual observationswould be to say that the density is infinite at the points of observation,and zero elsewhere. Various assumptions of smoothness of the densityor hazard rate will lead to optimal smoothing procedures for hazard-rateestimation. (An alternative approach, called isotonic estimation, is to imposeassumptions like increasing hazard rate, which indirectly force a certaindegree of smoothness.) We will not have much to say about hazard-rate (ordensity) estimation in this course.

Instead, we will be concerned with estimating cumulative hazard rates,which are equivalent to estimating survival functions (or cdfs), since S(t) =e−Λ(t). Whereas hazard rates can be any nonnegative function, cumulativehazard rates are by definition increasing functions, which is a much moremanageable class of functions to search.

32

Page 41: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 33

4.1.2 The multiplicative intensity model

Our basic model is a counting process that is a sum of n individual countingprocesses

N(t) =n∑i=1

Ni(t), where Ni(t) has intensity λi(t) = α(t)Yi(t),

and Yi(t) is a random process that is 0 or 1, depending on whether individuali is “at risk” up to time t. The intensity of N(t) is then λ(t) = α(t)Y (t),where

Y (t) =

n∑i=1

Yi(t) = # individuals at risk at time t.

Here α(t) is an arbitrary (deterministic) positive function with∫∞

0 α(t)dt =∞.

We assume that Yi(t) is predictable; that is, Yi(t) is known infinitesimallybefore time t. In particular, there is no hidden frailty or other unknowninformation for individual i, and Yi(t) doesn’t depend on the jump thathappens or doesn’t happen at time t. (We will relax the first condition inLecture 10.) Our goal is to estimate A(t) =

∫ t0 α(s)ds.

In the survival setting — where each individual can have at most oneevent Ti — Yi(t) = 0 for t > Ti, and α(t) represents the instantaneousprobability of the event occurring for an at-risk individual.

This framework allows for both left truncation and right censoring. Lefttruncation means that events that happen before a certain (possibly random)time ti exclude the individual from the study. This is represented in themodel by Yi(t) that starts at 0, and then jumps to 1 at time τi if and onlyif τi < Ti. This means that a truncated individual is represented by anintensity that is always 0, which is equivalent (from the point of view ofestimating the intensity α) to not being included in the study.

Right censoring means that observations after the (possibly random)censoring time Ci are not observed. This is represented by a Yi function thatstarts at 1, and then drops to 0 at Ci ∧ Ti.

Note the asymmetry: In censoring, we always observe either the eventtime or the censoring time, but never both.1 In truncation, we observe either

1In theory. In practice, the censoring time may be difficult to observe. If a survivaltime is censored because a subject has moved away — or died of an unrelated cause —this may not be known immediately, or ever. If there is, say, a five-year follow-up, at whichpoint it becomes known that a certain subject moved away and left no forwarding address,there may be no way to determine at precisely what point he or she became unobservable.

Page 42: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 34

both times — truncation time followed by event time for left truncation— or neither, which is why the latter are equivalent to not being includedin the study at all.2 Of course, both may be active: An individual whoseleft-truncation time precedes the right-censoring time has a Yi that jumpsfrom 0 up to 1 at time τi, and then down to 0 at time Ci ∧ Ti, only ifτi < Ci ∧ Ti. If τi ≥ Ci ∧ Ti then Yi(t) is always 0.

These are not the only possibilities for Yi. It is possible for individuali to be at risk only at certain times (for instance, if the event time is thecompletion of a certain task, and the subject takes breaks, during whichYi = 0). It may be that the maximum number of events by the individual istwo or more (perhaps random), in which case Yi(t) remains at 1 until thatnumber of events has been completed. Yi(t) may depend in a complicatedway on the other Nj(t). For instance, in a matched case-control study thesubjects may be paired up, with observations being made until the first ofthe two has an event. In this case, we would define Yi(t) = 1t≤Tj∧Ti, wherej is the partner of i.

Non-informative censoring is guaranteed in the multiplicative intensitymodel by the assumption that Yi(t) is predictable, and, in particular, that itis adapted.

4.2 Cheater’s guide to stochastic integrals

You may be familiar with the Riemann-Stieltjes integral∫ t

0f(x)dG(x).

Without concerning ourselves too much with the formalities, we may thinkof this as meaning the integral of a bounded function f(x) with respect tochanges in G. If G is a differentiable function whose derivative is g, thendG = (dG/dx)dx, so we may write∫ t

0f(x)dG(x) =

∫ t

0f(x)g(x)dx.

Recording five years as the censoring time will clearly overestimate the time at risk, andso underestimate the hazard rate.

2Again, in theory. In practice we may not know precisely when an individual in ourstudy first was at risk, from the perspective of the study. Fortunately, the standardmethods do not require that we know anything about the individuals who were truncatedout of the study.

Page 43: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 35

What if G has jumps, but is differentiable between the jumps? SupposeG′(x) = g(x) for x /∈ x1, . . . , xn, where x1 < · · · < xn, and G has a jumpof size yi at xi — that is, G(xi)−G(xi−) = yi. Then the “change in G” hasa jump of size yi at xi, so∫ t

0f(x)dG(x) =

∫ t

0f(x)g(x)dx +

n∑i=1

f(xi)yi. (4.1)

The random functions we will be concerned with are piecewise differ-entiable with jumps, so we may apply formula (4.1) to define the integralwith respect to one of them. In particular, if N(t) is a counting processwith distinct events at 0 ≤ T1 < · · · < TK ≤ t (where K may now also berandom), then for any bounded predictable stochastic process X(s),∫ t

0X(s)dN(s) =

K∑i=1

X(Ti). (4.2)

Example 4.1: Compensated counting process

Given the compensator Λ(s) =∫ s

0 λ(u)du, we define the mar-tingale M(s) = N(s)− Λ(s). Then for any predictable process(X(t)),∫ t

0X(t)dM(t) =

∑Ti≤t

X(Ti)−∫ t

0X(s)λ(s)ds. (4.3)

Note that the changes in M all have expectation 0 (because M is amartingale). What we are doing is taking the change ∆M(s) and multiplyingit by X(s), which is already known before time s, so may be thought ofas a constant. The resulting change is larger or smaller than ∆M(s), inproportion to X(s), but it is still zero on average.

Fact 4.1. If X is an (Ft)-predictable random process and M an (Ft)-martingale, then

Y (t) :=

∫ t

0X(s)dM(s) is an (Ft)-martingale. (4.4)

Page 44: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 36

4.3 The Nelson–Aalen estimator

4.3.1 Distinct event times: Informal derivation

Consider the case of right-censored (but not truncated) survival data thatare precisely observed, so that there is no possibility of an exact tie. Lett1 < t2 < · · · < tm be the (ordered) times at which an event is observed. Ifthere are n individuals under observation, then of course m ≤ n.

Split up the time period under observation into equal subintervals ofwidth ε > 0, each small enough that the probability of two events in anysubinterval is negligible. The cumulative hazard may be thought of asapproximately

A(t) ≈ εbt/εc∑k=0

α(kε),

and εα(kε)Y (kε) is (up to errors of order ε2) the probability of an eventoccurring in the time interval [kε, (k+1)ε], conditioned on the past up to timetk. That is, conditioned on the past 1N((k+1)ε)>N(kε) is a Bernoulli randomvariable with probability εα(kε)Y (kε), for which the natural estimator issimply the random variable itself 1N((k+1)ε)>N(kε). Thus, the estimator ofα(kε) is

α(kε) =1N((k+1)ε)>N(kε)

εY (kε),

which is nonzero just for those intervals at which some ti ∈ [kε, (k + 1)ε), atwhich point it has the value 1/εY (ti) (up to an error of order ε). Thus

A(t) = ε

bt/εc∑k=0

α(kε) =∑i:ti≤t

1

Y (ti), (4.5)

which is the Nelson–Aalen estimator.

4.3.2 Distinct event times: Formal derivation of the Nelson–Aalen estimator

Since N(t) is a counting process with intensity α(t)Y (t), we may write(informally)

dN(t) = α(t)Y (t)dt+ dM(t),

Page 45: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 37

where M(t) is a martingale jump process. As long as Y (t) stays nonzero3,we may write ∫ t

0

dN(s)

Y (s)=

∫ t

0α(s)ds+

∫ t

0

dM(s)

Y (s).

Recall that the integral of a counting process is the same as summing theintegrand over the jump points. Thus the left-hand-side is just the Nelson–Aalen estimator A(t). The first term on the right-hand side is A(t). Thus

A(t)−A(t) =

∫ t

0

dM(s)

Y (s).

Thus A(t)−A(t) is a martingale, implying in particular that its expectationis 0 for any t. This means that A(t) is an unbiased estimator for A(t) foreach t.

4.3.3 Simulated data set

Suppose that we have 10 observations in the data set with failure times asfollows:

21, 47, 52, 58+, 71, 72+, 125, 143+, 143+, 143+ (4.6)

Here + indicates a censored observation. Then we can calculate the Nelson–Aalen estimator for S(t) at all time points. It is obviously unsafe to ex-trapolate much beyond the last time point, 143, even with a large dataset.

Table 4.1: Computations of survival estimates for simulated data set (4.6)

ti Y (ti) 1/Y (ti) A(ti) S(ti)

21 10 0.100 0.100 0.9047 9 0.111 0.211 0.8152 8 0.125 0.336 0.7171 6 0.167 0.503 0.60125 4 0.25 0.753 0.47

3We may deal with the problem of Y (t) = 0 by integrating only over s such thatY (s) > 0; this is done formally in [ABG08], section 3.1.5.

Page 46: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 38

4.3.4 Breaking ties

We describe here only the survival setting, where each individual has amaximum of one event, so that Yi(t) = 0 for t ≥ Ti.

We are assuming here that events have continuous distributions, thoughrounding may lead to nominal ties. This has important implications for howwe deal with ties that appear in our data.

By convention we always assume that censoring follows an event whenthey are nominally simultaneous. The argument is that Y (Ti) is the numberof individuals at risk just before time Ti.

When multiple events are simultaneous, we treat them as though theywere in fact distinct, even if the distinction is unknown. Suppose di is thenumber of events reported at time ti. If Y (ti) is the number at risk justbefore ti, then the first death lowers this to Y (ti)− 1, then Y (ti)− 2, and soon, until it reaches Y (ti)−di, which is the same as Y (ti+) = limδ↓0 Y (ti + δ).This changes the estimator for A(t) to

A(t) =∑ti≤t

di−1∑k=0

1

Y (ti)− k. (4.7)

The corresponding variance is then

Var(A(t)) =∑ti≤t

di−1∑k=0

1

(Y (ti)− k)2. (4.8)

Obviously this is somewhat crude. There is no mathematical theorybehind it. In order to make a precise theory out of this we need a model forhow the ties are arising. In the extreme case, we might view the ties as being“real ties”, leading to a slightly different estimator — described in section3.1.3 of [ABG08] — but one that will in practice differ only very slightly.

4.3.5 Simulated example with ties

Suppose that we now have 10 observations in the data set with failure timesas follows:

21, 47, 47, 58+, 71, 71+, 125, 143+, 143+, 143+ (4.9)

The tie between the event at 71 and the censoring at 71 is resolved byassuming the event happens first, leaving the estimators unchanged. Theeffect of the tie at 47 is to reduce the number of jumps, but our resolutionof the tie leaves the estimator unchanged outside of the interval where

Page 47: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Martingales 39

the observations have changed. The resulting Nelson–Aalen estimator iscomputed in Table 4.2, and plotted in Figure 4.1.

Table 4.2: Computations of survival estimates for simulated data set (4.9)

ti Y (ti) hazard increment A(ti) S(ti) = e−A(ti)

21 10 0.100 0.100 0.9047 9 0.111 + 0.125 0.336 0.7171 6 0.167 0.503 0.60125 4 0.25 0.753 0.47

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Age

Surv

ival

++

+++

+

+

+++

Figure 4.1: Plot of Nelson–Aalen survival estimates from Table 4.1 (black)and Table 4.2 (red).

Page 48: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 5

Variation and confidenceintervals for non-parametricestimators

5.1 Variation processes

5.1.1 Intuitive definitions

One of the nice properties of independent sums is that the variance of the sumis the sum of the variances, since the covariances are all 0. The same is truefor martingale sums, as long as we define variance to mean the conditionalvariance based on the past. We define the predictable variation process 〈M〉(t)to be an increasing function that accumulates all of the conditional varianceup to time t:

d〈M〉(t) = Var(dM(t)∣∣Ft−).

It is random because it sums variances conditional on the developments upto time t. The optional variation process is the sum of the actual squaredchanges in the process: If M has jumps of size Yi at points Ti, then

[M ](t) =∑Ti≤t

Y 2i .

(This is in our special setting where the stochastic processes have only jumpsand differentiable pieces.)1

1If M is differentiable over the interval [s, t], and we break it up into K equal pieces

ti = s+ (t− s) ∗ i/K, then∑

(M(ti+1)−M(ti))2K→∞−−−−−→ 0, so the only contribution to the

40

Page 49: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 41

5.1.2 Formal definitions

We may divide up the interval [0, t] into n equal subintervals [ti, ti+1) withti = it/n, and treat M(ti) like a discrete-time process. (We won’t be usingdiscrete-time martingales, but they are defined in the obvious way, anddiscussed in section 2.1 of [ABG08].) Letting ∆Mi = M(ti)−M(ti−1), wesee that E[∆Mi

∣∣Fti−1 ] = 0, and we define

〈M〉(t) = limn→∞

n∑i=1

Var(∆Mi

∣∣Fti−1

),

and

[M ](t) = limn→∞

n∑i=1

(∆Mi)2.

Of course, we need a bit of mathematical work — which we will skip — toshow that these limits always exist for the sorts of processes we are concernedwith here.

5.1.3 Useful facts about variation processes

Fact 5.1. If M is a martingale with M(0) = 0 then

M2 − 〈M〉 is a mean-zero martingale;

M2 − [M ] is a mean-zero martingale.

In particular,

Var(M(t)

)= E

[M(t)2

]= E

[〈M〉(t)

]= E

[[M ](t)

]. (5.1)

Fact 5.2. If X is a predictable stochastic process, and M = N − Λ acounting-process martingale, then

M(t) :=

∫ t

0X(s)dM(s)

is a martingale whose variation processes are

〈M〉(t) =

∫ t

0X(s)2dΛ(s) =

∫ t

0X(s)2λ(s)ds, (5.2)

[M ](t) =

∫ t

0X(s)2dN(s). (5.3)

In other words, predictable variation is driven by the continuous part of M ,while optional variation is driven by the random jumps.

optional variation comes from the jumps. It’s different when M is a continuous Markovprocess such as a Brownian motion, but we won’t consider that here.

Page 50: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 42

5.1.4 Caveats (not examinable)

We are calling these “facts” rather than “theorems” because

(i). We are not proving them.

(ii). They’re not technically true. But they’re true enough for our purposes.

The problem is, M could be a martingale but M(t)2 might not in generaleven have a finite expectation. Similarly, the claim made in (4.4) did notimpose any assumptions on the integrand process X. It would clearly workif X were bounded, but we need to allow for integrals like

∫M(s−)dM(s).

Even if M is derived from a counting process with bounded intensity it won’tbe bounded. It will be locally bounded, though, by which we mean somethinglike M is bounded on finite intervals with high probability. The result ofthe integration is a local square integrable martingale, which means that itessentially satisfies the conditions for a martingale, including having variationprocesses that satisfy the formulas above, except for small probabilities thatit could be much larger than expected on small intervals.

For the sorts of examples we will be considering here, the statements aretrue; and they can be proved in significantly more generality if we expandour definitions by replacing martingales by local martingales. The proofs,together with the definitions of local martingales and related concepts, arenot difficult, but they would take and extra couple of lectures. If you’reinterested, there is a straightforward treatment — in the context of countingprocesses — in chapter 2 of [FH91].

5.2 Examples

5.2.1 Independent sums

Let ξ1, ξ2, . . . be independent normal random variables, with mean 0 andVar(ξi) = σ2

i , and let

M(t) :=∑i≤t

ξi.

Thus M(t) is a right-continuous process that makes jumps precisely atpositive-integer times. Ft includes the values of all jumps that happened upto and including time t, and of course any function of a combination of these.When t is a positive integer, Ft− does not include the value of ξt; otherwise,Ft− is identical with Ft.

Page 51: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 43

For any s < t,

E[M(t)

∣∣∣ Fs

]= E

[M(t)−M(s)

∣∣∣ Fs

]+ E

[M(s)

∣∣∣ Fs

]= E

[ ∑s<i≤t

ξi

∣∣∣ Fs

]+M(s)

=∑s<i≤t

E[ξi

∣∣∣ Fs

]+M(s)

= M(s),

The last equality follows from the fact that ξi is independent of Fs for i > s,so that E[ξi|Fs] = E[ξi] = 0.

The predictable variation is flat except at positive integers t. At thosetimes it increments by

d〈M〉(t) = Var(dM(t)

∣∣Ft−) = Var(ξt∣∣Ft−) = σ2

t ,

since ξt is independent of Ft− with variance 1. Thus

〈M〉(t) =∑i≤t

σ2i .

Note that the predictable variation is deterministic, because the incrementsare independent of the past. The optional variation process is also flat awayfrom positive integers, but the jumps are the squares of the jumps in M , so

[M ](t) =∑i≤t

ξ2i .

5.2.2 Weighted independent sums

Imagine that a gambler is betting on the outcomes of the random variablesξi in section 5.2.1. This means that at time i she gets to choose, basedon everything she has seen so far — formally, this means that the randomvariable Ci+1 ∈ Fi — to bet an amount Ci+1 (bounded by some fixed C),which will return Ci+1ξi+1 at time i+ 1. Her fortune at time t (relative toher initial fortune, which is set to 0), is

M(t) :=∑i≤t

Ciξi.

Thus M(t) is a right-continuous process that makes jumps precisely atpositive-integer times. Ft includes the values of all jumps that happened upto and including time t, as well as Cbtc+1.

Page 52: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 44

For any s < t,

E[M(t)

∣∣∣ Fs

]= E

[M(t)−M(s)

∣∣∣ Fs

]+ E

[M(s)

∣∣∣ Fs

]= E

[ ∑s<i≤t

Ciξi

∣∣∣ Fs

]+M(s)

=∑s<i≤t

E[Ciξi

∣∣∣ Fs

]+M(s)

= M(s),

The last equality follows from the fact that Fs ⊂ Fi−1 (because s < i, andthere is no new information between time i and i+ 1), so that

E[Ciξi

∣∣∣ Fs

]= E

[E[Ciξi

∣∣Fi−1

] ∣∣∣ Fs

]= E

[CiE

[ξi∣∣Fi−1

] ∣∣∣ Fs

]since Ci ∈ Fi−1

= E[Ci · 0

∣∣∣ Fs

]= 0,

since ξi is independent of Fi−1, so that E[ξi|Fi−1] = E[ξi] = 0. (This is just aformal way of saying that at time i the random variable Ci is like a constant,so Ciξi is a constant times ξi, with expectation 0.)

The predictable variation is flat except at positive integers t. At thosetimes it increments by

d〈M〉(t) = Var(dM(t)

∣∣Ft−) = Var(Ctξt

∣∣Ft−).Since E

[Ctξt

∣∣Ft−] = 0,

Var(Ctξt

∣∣Ft−) = E[(Ctξt)

2∣∣Ft−]

= C2t E[(ξt)

2∣∣Ft−] since C2

t ∈ Ft−.

= C2t σ

2t .

So〈M〉(t) =

∑i≤t

C2i σ

2i .

Note that this is random, because Ci is random. But it is “predictable”, inthe sense that 〈M〉(t) ∈ Ft− — that is, it is known before time t.

The optional variation process is also flat away from positive integers,but the jumps are the squares of the jumps in M , so

[M ](t) =∑i≤t

C2i ξ

2i .

Page 53: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 45

5.2.3 Compensated homogeneous Poisson process

Let N(t) be a homogeneous Poisson counting process with intensity λ, and

M(t) := N(t)− λt.

We already know that M is a martingale. The predictable variation is nolonger piecewise flat. Instead, regardless of the past dM(t) is 1− λdt withprobability λdt, and −λdt otherwise, making

d〈M〉(t) = Var(dM(t)

∣∣Ft−) = λdt+O(dt2).

Thus 〈M〉(t) = λt.The optional variation process is just [M ](t) = N(t).

5.2.4 Compensated inhomogeneous Poisson process

Let N(t) be a homogeneous Poisson counting process with intensity λ(t) attime t, and

M(t) := N(t)− Λ(t).

Now, conditioned on the past, dM(t) is 1− λ(t)dt with probability λ(t)dt,and −λ(t)dt otherwise, making

d〈M〉(t) = Var(dM(t)

∣∣Ft−) = λ(t)dt+O(dt2).

Thus 〈M〉(t) = Λ(t).The optional variation process is again just [M ](t) = N(t).

5.3 Normal approximation for martingales

Like sums of i.i.d. random variables, martingales are approximately normal.The simplest Central Limit Theorem, taught in Part A Probability, says thatif ξ1, ξ2, . . . are i.i.d. random variables with E[ξi] = µ and Var(ξi) = σ2, thendefining Mn :=

∑ni=1(ξi − µ), we have Mn/σ

√n →d N(0, 1); that is, Mn

standardised to have variance 1 converges to a standard normal distribution.The same proof extends easily to any sequence of independent random

variables with E[ξi] = µi and Var(ξi) = σ2i . Then, defining Mn :=

∑ni=1(ξi −

µi), we have ( n∑i=1

σ2i

)−1/2Mn

n→∞−−−−−→d N(0, 1).

Page 54: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 46

as long as∑∞

i=1 σ2i =∞. (We must also impose some sort of condition on

the third moments. It would suffice, for instance, to know that there is aconstant C such that E[|ξi|3] ≤ Cσ2

i .) Check that you understand that thei.i.d. Central Limit Theorem is just a special case of this one.

These generalise to martingales, with one important complication: Thevariance may be random. The quantity corresponding to σ2

i is the incrementto the predictable variation process. What we need is for the predictable vari-ation to be approximately a fixed deterministic function, and for individualjumps to be small.

Suppose that we have a mean-zero martingale M (n)(t) with a parameter n— for instance, the number of subjects — such that the predictable variationconverges to the function V (t) as n→∞:

limn→∞

〈M (n)〉(t) = V (t) for each t. (5.4)

Suppose, too, that individual jumps are small — for instance, that lettingT1, T2, . . . be the times of jumps,

limn→∞

E[∑Ti≤t|M (n)(Ti)−M (n)(Ti−)|3

]= 0 for each t. (5.5)

Then M (n)(t)/√V (t)

n→∞−−−−−→d N(0, 1) for each t. In fact, we have a functional

CLT, telling us that the entire random function M (n) converges to the time-changed Brownian motion W (V (·)), where W is Brownian motion. But wewon’t need this formal result, though we will occasionally refer to it forintuition.

A version that will be adequate for our purposes is the following: SupposeM (n)(t) = N (n)(t) − Λ(n)(t) is a counting-process martingale, and H(n)(t)

is a predictable process. Then M (n)(t) :=∫ t

0 H(n)(s)dM (n)(s) is another

martingale, and

Theorem 5.3. Suppose there is a function v (and V (t) =∫ t

0 v(s)ds) suchthat

H(n)(s)2λ(n)(s)n→∞−−−−−→P v(s), and (5.6)

H(n)(s)n→∞−−−−−→P 0. (5.7)

Then M (n) converges in distribution to the stochastic process W (V (t)), where

W is Brownian motion. In particular, M (n)(t) converges to a normal distri-bution with mean 0 and variance V (t).

Page 55: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 47

The specific application of these definitions to point processes, to definesurvival distribution confidence intervals, may be found in sections 5.4 and6.2.4.

More details may be found in section 2.3 of [ABG08].

5.4 Pointwise confidence intervals for the Nelson–Aalen estimator

From (5.3) we see immediately that

[A−A

](t) =

∫ t

0

1

Y (s)2dN(s).

Using formula 5.1,

σ2(t) :=

∫ t

0

1

Y (s)2dN(s) =

∑ti≤t

1

Y (ti)2, (5.8)

is an unbiased estimator for the variance of A(t) for each t. Note that thevariance will be a sum of O(n) random terms, each of which is on the orderof 1/n2, so the variance will be like constant /n for large n. In particular,the variance goes to 0, and the estimator is consistent.

Now we apply Theorem 5.3 to show that A(t) is approximately normal.Define

M (n)(t) :=√n(A(t)−A(t)

)=

∫ t

0H(n)(s)dM (n)(s),

where H(n)(s) =√n/Y (n)(s) and M (n) is a counting-process martingale

with intensity λ(n)(t) = Y (n)(t)α(t) (corresponding to n individuals). Lety(s) := PYi(s) = 1. By the Weak Law of Large Numbers

Y (n)(t)

n

n→∞−−−−−→P y(t) for each t,

which implies that H(n)(t)n→∞−−−−−→P 0 and

H(n)(t)2λ(n)(t) =α(t)

Y (n)(t)/n

n→∞−−−−−→P

α(t)

y(t).

So both conditions are satisfied, and we may conclude that A(t) is approx-imately normal with mean A(t) and variance v(t) =

∫ t0 α(s)/y(s)ds. Of

Page 56: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 48

course, we cannot compute this, because we don’t know α or y, but we havethe estimator (5.8) for the variance.

Thus, we may write an approximate (1− α)100% confidence interval forA(t) as ∑

ti≤t

1

Y (ti)± z1−α/2

(∑ti≤t

1

Y (ti)2

)1/2. (5.9)

When the data include ties, we treat the ties as though they were distinctevents that have been rounded or crudely reported. Thus, we associate tothe Nelson–Aalen estimator

A(t) =∑ti≤t

di−1∑k=0

1

Y (ti)− k.

the variance

Var(A(t)) ≈ σ2(t) =∑ti≤t

di−1∑k=0

1

(Y (ti)− k)2. (5.10)

5.4.1 Simulated data

We refer back to the data in section 4.3.3. Suppose we want a 95% confidenceinterval for A(100). Following (5.9) we have

σ2(60) =∑ti≤100

1

Y (ti)=

1

102+

1

92+

1

82+

1

62= 0.0657.

Thus we have the 95% confidence interval for A(100)

0.503± 1.96√.0657 = (0, 1.01).

The corresponding confidence interval for the survival function S(100) is(e−1.01, e0) = (0.364, 1). When there are ties, as in section 4.3.5, the varianceand the confidence interval for A(100) remain the same; there would only bea change if we sought to estimate A(t) for some t between 47 and 52.

Of course, the lower bound of 0 is completely useless. This reflects thefact that σ2 is only a good estimate for the variance of A(t) when the numberat risk remains reasonably large. Once the number at risk gets below aboutten, we get to a situation where the error in estimating A, the error inestimating σ(t), and A(t) itself, get to be about the same size, at which pointthe estimates are not very useful. (If we went further, the formula would

Page 57: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 49

give a negative lower bound on the confidence interval for A(100) that isnegative.) In reality, it doesn’t make sense to use these techniques (or, really,any others) when the data are so thin. The only purpose to such examplesis to see how these methods work, in a setting where the calculations canstill be done more or less by hand. In the next section we will show howthey look when we have larger sample sizes, that can be processed only bycomputer. In section 6.3 we look at how to use the survival package in R

to do these computations.

5.4.2 More simulated data

Consider a large number n of individuals under observation with hazard rateα(t) = t at time t. Suppose they are randomly right-censored with constantrate 1, starting from time t = 1. The probability of an individual still beingat risk at time t is

y(t) = PT > t

PC > t

=

e−t

2/2 if t ≤ 1,

e−t2/2−t+1 if t > 1.

Thus, by the calculations of section 5.4, the asymptotic variance at time t is

v(t) =

∫ t

0

α(s)

y(s)ds =

∫ t

0ses

2/2+(s−1)+ds.

In Figure 5.1(b) we show a single realisation of this simulation withn = 100. The black curve is the Nelson–Aalen estimate for the cumulativehazard, the red curve is the true cumulative hazard H(t) = t2/2. The dashedcurves show pointwise upper and lower 95% confidence limits, computedfrom (5.8). This particular realisation had 75 observed event times, and25 censored times, shown in Figure 5.1(a). In Figure 5.1(c) we show 20different additional realisations of the Nelson–Aalen estimator, based on 20new simulations of the data, together with the original confidence limitsfor the first simulation. We see that the estimates mostly stay within theconfidence limits, but occasionally go outside.

Page 58: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 50

0.0 0.5 1.0 1.5 2.0

Event times

EventCensored

(a) Event times

0.5 1.0 1.5 2.0 2.5

0.0

1.0

2.0

3.0

TimeH

azar

d

(b) Single NA estimator with 95% CI

0.5 1.0 1.5 2.0 2.5

0.0

1.0

2.0

3.0

Time

Haz

ard

(c) 20 different simulated Nelson–Aalen estimators

Figure 5.1: Nelson–Aalen estimator for 100 simulated individuals with hazardrate t and constant censoring rate 1 after time 1.

In Figure 5.2 we examine our theoretical variances. The red curve is

Page 59: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 51

the known cumulative hazard; the green and blue dashed curves represent±√v(t)/10 and ±

√v(t)/5 — so, ± 1 or 2 SDs. At times t = i/10 for

i = 1, . . . , 20 we plot Nelson–Aalen estimators from 100 independent samples.We observe the number of estimates that fall outside these ranges as ap-proximately what we would have expected. In Figure 5.3 we look at just 20simulated populations, but we connect up the estimators for different timescorresponding to the same simulation, showing how many of the estimatorsleave the confidence intervals at some time.

0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

100 different estimates for cumulative hazard

Time

Cum

ulat

ive

Haz

ard

Observed cum hazardTrue cum hazard68% conf int95% conf int

Observed cum hazardTrue cum hazard68% conf int95% conf int

Figure 5.2: 100 Nelson–Aalen estimators for 100 each individuals at specifictimes, together with theoretical confidence intervals.

In Figures 5.4(a) and 5.4(b) we examine our data-derived estimates forthe variance. We simulate the survival process 100 times, and compute theestimated variance σ2(t) at times t = i/10, i = 1, 2, . . . , 20. These are plotted,together with the known variance v(t)/n. We see that the red curve does lieapproximately in the middle of each column of estimated variances, consistentwith the fact that σ2(t) is an unbiased estimator for the variance. On the

Page 60: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 52

0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

20 different estimates for cumulative hazard

Time

Cum

ulat

ive

Haz

ard

Observed cum hazardTrue cum hazard68% conf int95% conf int

Figure 5.3: 20 Nelson–Aalen estimators for 100 each individuals at specifictimes, joined up by simulation, together with theoretical confidence intervals.

other hand, the errors are not inconsequential. A more careful analysis woulduse a Student-like distribution instead of normal for the confidence intervals,to allow for the uncertainty in the variance. It is not entirely straightforward,though, and we will ignore this complication.

Figures 5.5(a) through 5.5(d) show the distribution of 1000 cumulativehazard estimates at t = 0.5 and t = 1.5 respectively, as histograms and asnormal Q–Q plots. We see that the distributions are approximately normal,but are somewhat right-skewed.

Page 61: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 53

0.5 1.0 1.5 2.0

0.00

0.02

0.04

0.06

0.08

0.10

100 different estimates for cumulative variance

Time

Cumu

lative

Var

iance

.

(a) Full time interval

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.002

0.004

0.006

0.008

0.010

100 different estimates for cumulative variance

Time

Cumu

lative

Var

iance

(b) Blowup of low-variance interval

Figure 5.4: 100 variance estimates for different time-points. Red curve showsthe true variance.

Page 62: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Variation and Confidence Intervals 54

Histogram of cumulative hazard at time 0.5

Cumulative hazard

Fre

quen

cy

0.05 0.10 0.15 0.20 0.25 0.30

050

100

150

200

250

(a) Histogram of time t = 0.5 variance estimates

−3 −2 −1 0 1 2 30.

050.

100.

150.

200.

250.

30

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

(b) Normal Q-Q plot of time t = 0.5 variance estimates

Histogram of cumulative hazard at time 1.5

Cumulative hazard

Freq

uenc

y

0.6 0.8 1.0 1.2 1.4 1.6 1.8

050

100

150

200

250

(c) Histogram of time t = 1.5 variance estimates

−3 −2 −1 0 1 2 3

0.8

1.0

1.2

1.4

1.6

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

(d) Normal Q-Q plot of time t = 1.5 variance estimates

Figure 5.5: Distribution of cumulative hazard estimates at times t = 0.5and t = 1.5, based on 1000 simulated populations. Red lines show the truevariance.

Page 63: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 6

Nonparametric estimation,continued

6.1 The nobody-left problem

In most survival schemes that we consider, there is always at least a smallprobability that Y (t) will be 0. As discussed at various points in chapter 3of [ABG08], the solution is to understand that we are “really” estimatingnot A(t), but rather a quantity

A∗(t) :=

∫ t

01Y (u)>0dA(u). (6.1)

With this correction, the mathematics we have done so far becomes rigorouslycorrect. Of course, this adds some confusion because the thing we areestimating is not actually a deterministic quantity, but is itself a randomvariable. What you should keep in mind is:

(i). E[A∗(t)− A(t)] = 0. In that sense, A is an “unbiased estimator” of A∗.

(ii). A∗(t) ≤ A(t), and it will eventually be strictly less. Thus A is, formally,a biased estimator for A.

(iii). A∗(t) and A(t) are genuinely the same, as long as the population hasn’trun out.

(iv). But we do often run survival experiments until there is no one left.After that point, we see the flat hazard rate of A, corresponding to theflat hazard rate of A∗.

55

Page 64: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 56

The crucial point is the last one: What happens to the Nelson–Aalenestimator at the end is not reflective of any truth about the true hazard rate.

6.2 The Kaplan–Meier estimator

6.2.1 Deriving the Kaplan–Meier estimator

The Nelson–Aalen estimator arises naturally from our mathematical frame-work, but it is not the most commonly used nonparametric survival estimator.We consider first the case when the event times are all distinct and possiblyright-censored, and recapitulate the informal derivation of the Nelson–Aalenestimator.

Let t1 < t2 < · · · < tm be the (ordered) times at which an event isobserved. If there are n individuals under observation, then of course m ≤ n.

Split up the time period under observation into equal subintervals of widthε > 0, each small enough that the probability of two events in any subintervalis negligible. Instead of estimating the increments to the hazard, we think ofestimating the increments to the survival function S(t) = e−A(t) = Pti > t.The probability that an individual who has survived to time (k − 1)ε hasits event before kε is 1 − S(kε)/S((k − 1)ε). If no event occurs in theinterval [(k − 1)ε, kε) (though possibly a censoring time), it is natural toestimate 1 − S(kε)/S((k − 1)ε) ≈ 0; if there is a single event ti — outof Y ((k − 1)ε) ≈ Y (ti) individuals at risk — it is natural to estimate1−S(kε)/S((k−1)ε) ≈ 1/Y (ti). This leads us to the Kaplan–Meier estimator

S(t) =∏ti≤t

(1− 1

Y (ti)

). (6.2)

The fraction 1 − S(t)/S(t−) is sometimes referred to as the discretehazard at time t. It is the probability of an individual alive up to time tsurviving to time t, where t is a time of discontinuity in the survival functionS. Thus, the Kaplan–Meier estimator is the cumulative product up to timet of the empirically estimated discrete hazards up to time t.

If there are ties, then we follow the same strategy as in section 4.3.4 forthe Nelson–Aalen estimator. We would then be multiplying together terms

Page 65: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 57

of the form

di−1∏k=0

(1− 1

Y (ti)− k

)=

(Y (ti)− 1

Y (ti)

)(Y (ti)− 2

Y (ti)− 1

)· · ·(

Y (ti)− diY (ti)− di + 1

)=

(1− di

Y (ti)

).

So we have the Kaplan–Meier estimator with ties as

S(t) =∏ti≤t

(1− di

Y (ti)

). (6.3)

6.2.2 The relation between Nelson–Aalen and Kaplan–Meier

If you have one clock you know what time it is. If you have twoclocks you are never sure.

— Wisdom whose origins are lost to human memory

We now have two estimators for survival, each of which has a cogentargument in its favour. Which one is “right”?

In sections 3.2.4 and 3.2.6 of [ABG08] there is an interesting discussionof the relationship between the Nelson–Aalen and Kaplan–Meier estimators.There heavy use is made of the “product-integral” notation, a sort of multi-plicative version of the integral, represented by a curvy product symbol thatrequires advanced computing skills to even print in LATEX.

It’s interesting, and worth looking at (not examinable!), but it dependson a non-obvious redefinition of the cumulative hazard in order to make theNelson–Aalen estimator precisely the cumulative hazard corresponding to theKaplan–Meier estimator. In fact, it’s not completely clear what we shouldmean by a cumulative hazard with jumps — what exactly is the “hazard”here that is being accumulated? — but the most natural definition is thatthe cumulative hazard is A(t) = − logS(t). [ABG08] defines the cumulativehazard instead as the integral of dS(t)/S(t−), which agrees with the abovedefinition when A is differentiable, but not when it has jumps.

Trying to show that these estimators are, in some sense, the same distractsfrom the main point, which is simply that there is no exclusive criterion fora “best” estimator. Some of the criteria we need to balance against eachother are

• Ease of computing from the data;

Page 66: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 58

• Minimising bias;

• Minimising error — e.g., MSE;

• Consistency — error is asymptotically 0;

• Asymptotic normality, or more generally, ability to estimate the distri-bution of errors.

Minimising bias (or seeking unbiased estimators) sounds good — whowants more bias? — but is simply not a very important criterion. In anycase, there is no way to make a cumulative hazard estimator that is alsounbiased as an estimator of survival. The Kaplan-Meier estimator is typicallyunbiased for the survival function The Nelson–Aalen estimator is unbiasedfor cumulative hazard.

So what is the difference between Nelson–Aalen and Kaplan–Meier? Ifwe accept the definition of cumulative hazard A(t) = − logS(t), the changein cumulative hazard at a jump-point Ti of the Kaplan–Meier estimator S is− log S(Ti)/S(Ti−); the Nelson–Aalen estimator, on the other hand, changesby 1− S(Ti)/S(Ti−) — the estimated discrete hazard at time Ti. Of course,these will be very similar if S(Ti)/S(Ti−) is close to 1. The differences willbe on the order of 1/Y (Ti)

2, and cumulatively on the order of the reciprocalof the smallest number of individuals ever at risk. We note for the futurethe relation

dA(t) = − dS(t)

S(t−)(6.4)

To summarise:

(i). The Nelson–Aalen estimator is always smaller than − log of the Kaplan–Meier estimator.

(ii). They are not very different, as long as the number of individuals atrisk remains large.

(iii). If the number of individuals at risk isn’t large, there’s no good reasonto prefer one over the other, although the Kaplan–Meier estimatoris more natural as an estimator of survival, while the Nelson–Aalenestimator is more natural as an estimator of cumulative hazard.

(iv). The Nelson–Aalen estimator is easier to work with mathematically. . .

(v). . . . but the Kaplan–Meier estimator is substantially better known andmore widely used, particularly in medical contexts. It’s also the defaultin the survival package in R.

Page 67: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 59

We will generally treat the two estimators interchangeably, taking S(t) =

e−A(t) as an estimator for survival, and − log S(t) (where S is the Kaplan–Meier estimator) as an estimator for cumulative hazard, using whicheverseems more convenient.

6.2.3 Duhamel’s equation

A useful relation for comparing different survival curves may be found bydifferentiating the ratio:

d(S1/S2) =S2dS1 − S1dS2

S22

=S1

S2

(dS1

S1− dS2

S2

).

As discussed in section 6.2.2, for continuous survival curves −dS(t)/S(t−)is the simply the hazard; otherwise, it is the discrete hazard, and we couldsimply define this as the increment to the cumulative hazard dA. This is theapproach taken by [ABG08, section A.1].

We integrate both sides (which is the only way that this equation makesformal sense) to obtain Duhamel’s equation:

S1(t)

S2(t)= 1 +

∫ t

0

S1(s−)

S2(s)

(dS1(s)

S1(s−)− dS2(s)

S2(s−)

)= 1 +

∫ t

0

S1(s−)

S2(s)(dA2(s)− dA1(s)) ,

(6.5)

where −dS(s)/S(s−) is the fraction of those alive at time s who die at times (when s is a discontinuity of S).

When S1 and S2 are both continuous (and differentiable) at s this is justcalculus. When S1 or S2 has a jump at s there is a calculation to be done,which is left as an exercise.

6.2.4 Confidence intervals for the Kaplan–Meier estimator

We may apply Duhamel’s equation to S/S∗, where S is the Kaplan–Meierestimator, S is the true survival function, and S∗ is the survival functionthat is decremented only when Y (t) > 0:

S∗(t) :=

∫ t

01Y (u)>0dS(u). (6.6)

Page 68: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 60

As we will see, it is slightly more technically important to distinguish betweenS(t) and S∗(t) when analysing the Kaplan–Meier estimator.

Then S∗(t) > 0 as well, and

S(t)

S∗(t)= 1 +

∫ t

0

S(s−)

S∗(s)

(dS(s)

S(s−)− dS∗(s)

S∗(s−)

).

We are assuming that our underlying survival function S is continuous, sodS∗/S∗ = A∗; and we already pointed out in (6.4) that dS(s)/S(s−) = dA(s)is the Nelson–Aalen estimator. This gives us

S(t)

S∗(t)− 1 =

∫ t

0

S(s−)

S∗(s)d(A∗ − A

)(s). (6.7)

Thus S(t)/S∗(t)− 1 is a mean-zero martingale, and E[S(t)/S∗(t)] = 1. Notethat S∗ is itself random, since it depends on the random variable 1Y (t)>0.

For t small enough that Y (t) > 0 with high probability, we may concludethat

E[S(t)

]≈ S(t),

so that S(t) is almost unbiased. As with the Nelson–Aalen estimator, S(t)obviously becomes biased (when considered as an estimator for S(t)) — itoverestimates survival — when t becomes large enough for Y (t) = 0.

Estimating the variance is not quite as straightforward as for the Nelson–Aalen estimator. If n is moderately large then we can approximate S∗(t) ≈S(t) and S(s−)/S∗(s) ≈ 1, so

S(t)

S(t)− 1 ≈ A(t)− A(t),

orVar(S(t)) ≈ S(t)2 Var(A(t)).

Applying (5.8), the variance of the Kaplan–Meier estimator may be estimatedby using

τ2(t) := S(t)2∑ti≤t

1

Y (ti)2. (6.8)

Similarly, when there are ties, the estimator is

τ2(t) := S(t)2∑ti≤t

di−1∑k=0

1

(Y (ti)− k)2. (6.9)

Page 69: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 61

An alternative commonly used variance estimator is Greenwood’s formula,which takes the form

τ2(t) := S(t)2∑ti≤t

1

Y (ti)(Y (ti − 1))(6.10)

when there are no ties. In the case of no censoring, this reduces to S(t)(1−S(t)). Obviously these two estimators are asymptotically (as n → ∞)identical

When there are ties, the Greenwood formula is

τ2(t) := S(t)2∑ti≤t

diY (ti)(Y (ti)− di)

, (6.11)

where, again, di is the number of events recorded at time ti.

6.3 Computing survival estimators in R

The main package for doing survival analysis in R is survival. Once thepackage is installed on your computer, you include library(survival) atthe start of your Rcode This works with “survival objects”, which are createdby the Surv command with the following syntax:Surv(time, event) or Surv(time, time2, event,type)

6.3.1 Survival objects with only right-censoring

We begin by discussing the first version, which may be applied to right-censored (or uncensored) survival data. The individual times (whethercensoring or events) are entered as vectors time. The vector event (of thesame length) has values 0 or 1, depending on whether the time is a censoringtime or an event time respectively. These alternatives may also be labelled 1and 2, or FALSE and TRUE.

For an example, we turn to our tiny simulated data set

21, 47, 47, 58+, 71, 71+, 125, 143+, 143+, 143+

into a survival object with

sim.surv = Surv(c(21, 47, 47, 58, 71, 71, 125, 143, 143, 143), c(1, 1, 1, 0, 1, 0, 1, 0, 0, 0)).

Page 70: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 62

Fitting models is done with the survfit command. This is designed forcomparing distributions, so we need to put in a some sort of covariate. Thenwe can writesim.fit=survfit(sim.surv∼1,conf.int=.99)and then plot(sim.fit), orplot(sim.fit,main=‘Kaplan-Meier for simulated data set’,

xlab=‘Time’,ylab=‘Survival’)

to plot the Kaplan–Meier estimator of the survival function, as in Figure 6.1.The dashed lines are the Greenwood estimator of a 99% confidence interval.(The default for conf.int is 0.95.)

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Kaplan-Meier for simulated data set

Time

Survival

Figure 6.1: Plot of Kaplan–Meier estimates from data in (4.9). Dashed linesare 95% confidence interval from Greenwood’s estimate.

The Nelson–Aalen estimatorcan also be computed with survfit. The asso-

ciated survival estimator S = e−A is called the Fleming–Harrington estimator,and it may be estimated with fit=survfit(formula, type=’fleming-harrington’).The cumulative hazard — the log of the survival estimator — may be printedwith plot(fit, fun=’cumhaz’).

If you want to compute it more directly, you can extract the informationin the survfit object. If you want to see what’s inside an R object, you can

Page 71: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 63

use the str command. The output is shown in Figure 6.2.

> str(sim.fit)

List of 13

$ n : int 10

$ time : num [1:6] 21 47 58 71 125 143

$ n.risk : num [1:6] 10 9 7 6 4 3

$ n.event : num [1:6] 1 2 0 1 1 0

$ n.censor : num [1:6] 0 0 1 1 0 3

$ surv : num [1:6] 0.9 0.7 0.7 0.583 0.438...

$ type : chr ‘‘right’’

$ std.err : num [1:6] 0.105 0.207 0.207 0.276

0.399 ...

$ upper : num [1:6] 1 1 1 1 0.957 ...

$ lower : num [1:6] 0.732 0.467 0.467 0.34 0.2

...

$ conf.type: chr ‘‘log’’

$ conf.int : num 0.95

$ call : language survfit(formula = sim.surv ∼type, conf.int = 0.95) - attr(*, ‘‘class’’)= chr

‘‘survfit’’

Figure 6.2: Example of structure of a survfit object.

We can then compute the Nelson–Aalen estimator with a function suchas the one in Figure 6.3. This is plotted together with the Kaplan–Meierestimator in Figure 6.4. As you can see, the two estimators are similar, andthe Nelson–Aalen survival is always higher than the KM.

6.3.2 Other survival objects

Left-censored data are represented withSurv(time, event,type=’left’).Here event can be 0/1 or 1/2 or TRUE/FALSE for alive/dead, i.e., censored/notcensored.

Left-truncation is represented withSurv(time,time2, event).

Page 72: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 64

NAest=function(SF)

times=SF$time[SF$n.event>0]

events=SF$n.event[SF$n.event>0]

nrisk=SF$n.risk[SF$n.event>0]

increment=sapply(seq(length(nrisk)),function(i)

sum(1/seq(nrisk[i],nrisk[i]-events[i]+1)))

varianceincrement=sapply(seq(length(nrisk)),function(i)

sum(1/seq(nrisk[i],nrisk[i]-events[i]+1)^2))

hazard=cumsum(increment)

variance=cumsum(varianceincrement)

list(time=times,Hazard=hazard,Var=variance)

Figure 6.3: Function to compute Nelson–Aalen estimator.

event is as before. The type is ’right’ by default.

Interval censoring also takes time and time2, with type=’interval’.In this case, the event can be 0 (right-censored), 1 (event at time), 2(left-censored), or 3 (interval-censored).

Page 73: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Non-parametric estimation II 65

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Estimators for simulated data set

Time

Survival

+

+

Kaplan-MeierNelson-Aalen

Figure 6.4: Plot of Kaplan–Meier (black) and Nelson–Aalen (red) estimatesfrom data in (4.9). Dashed lines are pointwise 95% confidence intervals.

6.4 Survival to ∞Let T be a survival time, and define the conditional survival function

S0(t) := PT > t

∣∣T <∞

;

that is, the probability of surviving to time t given that the event eventuallydoes occur. We have

S0(t) =P∞ > T > tP∞ > T

. (6.12)

How can we estimate S0? Nelson–Aalen estimators will never reach ∞(which would mean 0 survival); Kaplan–Meier estimators will reach 0 if andonly if the last individual at risk actually has an observed event. In eithercase, there is no mathematical principle for distinguishing between the actualsurvival to ∞ — that is, the probability that the event never occurs — andsimply running out of data. Nonetheless, in many cases there can be goodreasons for thinking that there is a time t∂ such that the event will never

Page 74: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 66

happen if it hasn’t happened by that time. In that case we may use the factthat T <∞ = T < t∂ to estimate

S0 =S(t)− S(t∂)

1− S(t∂). (6.13)

Example 6.1: Time to next birth

This is an example discussed repeatedly in [ABG08]. It hasthe advantage of being a large data set, where the asymptoticassumptions may be assumed to hold; it has the correspondingdisadvantage that we cannot write down the data or performcalculations by hand.

The data set at http://folk.uio.no/borgan/abg-2008/data/second_births.txt lists, for 53,558 women listed in Norway’sbirth registry, the time (in days) from first to second birth. (Ob-viously, many women do not have a second birth, and the obser-vations for these women will be treated as censored.)

In Figure 6.5(a) we show the Kaplan–Meier estimator computedand automatically plotted by the survfit command. Figure6.5(b) shows a crude estimate for the distribution of time-to-second-birth for those women who actually had a second birth.We see that the last birth time recorded in the registry was 3677,after which time none of the remaining 131 women had a recordedsecond birth. Thus, the second curve is simply the same as thefirst curve, rescaled to go between 1 and 0, rather than between1 and 0.293 as the original curve does.

The code used to generate the plots is in Figure 6.6.

Page 75: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 67

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Norwegian birth registry time to second birth

Time (days)

(a) Original Kaplan–Meier curve

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Time to second birth conditioned on occurrence

Time (days)

(b) Kaplan–Meier curve conditioned on second birth occurring

Figure 6.5: Time (in days) between first and second birth from Norwegianregistry data.

Page 76: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 68

library(’survival’)

sb=read.table(’second_births.dat’,header=TRUE)

attach(sb)

sb.surv=Surv(time,status)

sb.fit1=survfit(sb.surv∼rep(1,53558))

plot(sb.fit1,mark.time=FALSE,xlab=’Time (days)’,

main=’Norwegian birth registry time to second birth’)

# Condition on last event

cle=function(SF)

minsurv=min(SF$surv)

SF$surv=(SF$surv-minsurv)/(1-minsurv)

SF$upper=(SF$upper-minsurv)/(1-minsurv)

SF$lower=(SF$lower-minsurv)/(1-minsurv)

SF

sb.fit2=cle(sb.fit1)

plot(sb.fit2,mark.time=FALSE,xlab=’Time (days)’,

main=’Time to second birth conditioned on occurrence’)

Figure 6.6: Code to generate Figure 6.5.

Page 77: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 7

Comparing distributions:Excess mortality

7.1 Estimating excess mortality: One-sample set-ting

This section is taken directly from section 3.2.5 of [ABG08].A common class of models splits the hazard rate into two pieces:

αi(t) = γ(t) + µi(t), (7.1)

where µi(t) is a known baseline hazard rate that is associated to individuali, and α(t) is an unknown increment to the hazard that we seek to calculate.For example, we might be measuring mortality, and µi the known populationmortality for individuals with the same age and gender.

We will discuss in Lecture 10 the problem of testing the validity of sucha relative survival model, and compare it to the relative mortality modelthat would have αi(t) = µi(t)α(t). Here, we concern ourselves with how toestimate γ.

We have the intensity for individual i of

λi(t) =(γ(t) + µi(t)

)Yi(t).

If we define the average population mortality

µ(t) =

n∑i=1

µi(t)Yi(t)

Y (t),

69

Page 78: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 70

then the total counting process N has intensity

λ(t) =(γ(t) + µ(t)

)Y (t).

If we consider the relative survival function R(t) = e−Γ(t), where Γ(t) =∫ t0 γ(s)ds. We then have the martingale representation

M(t) :=

∫ t

0

dN(s)

Y (s)−∫ t

0

(γ(s) + µ(s)

)ds.

Thus

Γ(t) :=

∫ t

0

dN(s)

Y (s)−∫ t

0µ(s)ds = Γ(t) +M(t)

is an unbiased estimator for Γ(t). By the same arguments as in section4.3 we see that Γ(t) is approximately normal, with variance estimated byσ2(t) =

∑ti≤t 1/Y (ti)

2.Define the average survival function

Save(t) := exp

−∫ t

0µ(s)ds

. (7.2)

Observe that this is a random function, since it depends on the randomcomposition of the population. Save is the survival probability that would beobserved if the population composition were the one observed, but everyonehad the survival rates given by the baseline. Then we may estimate therelative survival function by

R(t) :=S(t)

Save(t)(7.3)

Example 7.1: Simulated survival data

In Figure 7.1 we plot the mortality rates (from 1990–2) in Englandand Wales, for male and female separately. Imagine that we hada population of elderly individuals, of various ages, who wereobserved over 10 years because of their ongoing exposure toan environmental danger. Suppose, in fact, that this pollutantincreases mortality rates uniformly by 0.01 per year. How wellcan we estimate this effect?

Page 79: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 71

We model individual intensities as(γ(t) +µsi(ai+ t)

)Yi(t), where

ai is the initial age of individual i, and si is the sex. We simulatea population that starts with 1000 males and 1000 females, withrandomly chosen initial ages between 50 and 80 years, and assumecensoring at a rate of 0.02 per year.

The results are plotted in Figure 7.2. The red curve shows theestimate Γ, while the black circles show the changing backgroundmortality µ. Note that µ grows much more slowly than themortality rates for individuals, owing to the shift in the populationtoward younger individuals and women, as the higher-mortalitysubgroups die off. For example, we start with equal numbers ofmen and women, but the remaining population after 10 yearsis close to 60% female. Note, too, that the estimates for Γ arereasonably good.

0 20 40 60 80 100

England and Wales mortality, 1990-2

Age (yrs)

Mor

tality

rate

2e-04

1e-03

1e-02

1e-01

5e-01

malefemale

Figure 7.1: Age-specific mortality rates in England and Wales, 1990–2.

Page 80: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 72

0 2 4 6 8 10

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Years on test

µ

µΓtrue Gamma

Figure 7.2: Estimating excess mortality. This combines simulated survivaldata with the mortality rates presented in Figure 7.1.

Page 81: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 73

7.2 Excess mortality: Two-sample case

Example 7.2: NHANES survival data

Figure 7.3 shows male and female survival curves as a function ofage, calculated from 17,427 subjects in the NHANES third wave(1988–94), after up to 12 years of follow-up. The subjects aredivided into 3 ethnic categories (white, black, Mexican). Supposewe wish to estimate the difference in survival between males andfemales in this study population, correcting for differences inproportion of the different ethnic groups.

Note that because time in this study is measured in calendar age,the data are both left-truncated and right-censored.

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

NHANES survival data

Age (yrs)

MaleFemale

Figure 7.3: Estimated survival function from 17,427 subjects in the NHANES3rd wave

Page 82: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 74

We have survival data on individuals i, categorised in two different ways:There is a nuisance categorisation ci (which might have several classes)and a binary categorisation of interest Gi which is 0 or 1. We assume themultiplicative-intensity model, with hazard rates α(ci; t) + Giγ(t). In theNHANES example, Gi is sex and ci is ethnicity.

We have the intensity for individual i of

λi(t) =(γ(t) + α(ci; t)

)Yi(t) if Gi = 1,

λi(t) = α(ci; t)Yi(t) if Gi = 0.

For a possible category c and G =0 or 1, define

Y (c,G; t) :=∑

i : ci=c,Gi=G

Yi(t), N(c,G; t) :=∑

i : ci=c,Gi=G

Ni(t).

Define Y (c,−; t) := minY (c, 0; t), Y (c, 1; t), and

kc(t) :=Y (c,−; t)∑c′ Y (c′,−; t)

. (7.4)

Then

Γ(t) =∑c

∫ t

0kc(s)

(dN(c, 1; s)

Y (c, 1; s)− dN(c, 0; s)

Y (c, 0; s)

)=∑ti≤t

∑c

kc(ti)

(Gi

Y (ci, 1; ti)− 1−GiY (ci, 0; ti)

) (7.5)

is an unbiased estimator for Γ∗(t). Assuming no ties, we have

Var(Γ(t)

)≤ E

[∑ti≤t

(∑c

Y (c,−; ti))−2]

, (7.6)

which we approximate, as usual, by the realised value∑ti≤t

(∑c

Y (c,−; ti))−2

. (7.7)

The derivation of these facts is left as an exercise.

Page 83: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 75

7.3 Nonparametric tests for equality: One-samplesetting

We have now described methods for estimating the difference between hazardrates in different settings. Once we have an estimator with variance, it isstraightforward to turn it into a significance test. In the remainder of thischapter we describe the standard nonparametric tests for equality of survivaldistributions.

The simplest situation is when we are comparing observations to a given(either theoretically or by a large quantity of prior data) survival distribution.Following the approach in section 7.1 we suppose we have a null hypothesis

H0 : the intensity for individual i is λi(t) = µi(t)Yi(t).

We wish to test this null hypothesis against “non-crossing” alternatives,which may be understood as λi(t) = (µi(t) + γi(t))Yi(t) where γi(t) are allof the same sign for all times t. As before, we define

µ(t) =n∑i=1

µi(t)Yi(t)

Y (t).

7.3.1 No ties

If we choose any predictable weight function W (t) such that W (t) = 0whenever Y (t) = 0 (and adopt the convention 0/0 = 0), then under the nullhypothesis

M(t) :=

∫ t

0W (s)

(dN(s)

Y (s)− µ(s)

)ds =

∑ti≤t

W (ti)

Y (ti)−∫ t

0W (s)µ(s)ds

is a martingale, with mean zero and asymptotically normal.The change from the estimation setting is that we estimate the variance

not from the sample, but from the null hypothesis. Under the null hypothesisthe intensity of the counting process is µ(s)Y (s). By Fact 5.2 we can computethe predictable variation as

〈M〉(t) =

∫ t

0

W (s)2

Y (s)2µ(s)Y (s)ds =

∫ t

0

W (s)2µ(s)

Y (s)ds. (7.8)

The variance is the expectation of this, which will be difficult or impossible tocompute in general, since it depends, among other things, on the probability

Page 84: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 76

of an individual being censored or truncated, something that we generallyavoid modelling. Instead, we may approximate the expected value by theobserved value of this integral.

Then we have the test statistic

Z(t) :=(∫ t

0

W (s)2µ(s)

Y (s)ds)−1/2(∑

ti≤t

W (ti)

Y (ti)−∫ t

0W (s)µ(s)ds

), (7.9)

which under the null hypothesis should have a standard normal distributionfor any fixed t. This may be used for one-sided alternatives (hazard > µi orhazard < µi) or two-sided alternatives (hazard 6= µi).

7.3.2 Weight functions and particular tests

A popular choice of weight is simply W (t) = Y (t). The resulting test iscalled the log-rank test. In this special case,

Z(t) =(∫ t

0Y (s)µ(s)ds

)−1/2(N(t)−

∫ t

0Y (s)µ(s)ds

).

Note that the variance (which is also identical with the expectation term inparentheses) is equal to the sum of cumulative hazards of all the individualnull-hypothesis hazards, over the time that the individual was at risk.

Another popular class of weight functions is the Harrington-Flemingfamily

WHF (t) := Y (t)S(t)p(1− S(t))q, (7.10)

for nonnegative p, q, where S(t) is the survival probability under the nullhypothesis. When p = q = 0 this is just the log-rank test. Larger values of pand q reduce the effect of deviations early and/or late in the process.

The case p = 1, q = 0 is essentially the popular Peto test (also knownas the Peto–Peto test or Petos’ test, as two different Petos were involved),except for slight modifications that are intended to improve the small-samplebehaviour. (They take the version W (t) := S(t)Y (t)/(Y (t) + 1).)

Note that different sources disagree on which part of the test statisticis the “weight”; we are following the convention of [ABG08]. If you look inthe book of [KM03] you will find the weights for the log-rank test given asW (t) = 1, as the factor Y (t) is simply absorbed into weight. We adopt theconvention here in order to emphasise the connection between the “base case”and the nonparametric estimators.

Page 85: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 77

7.3.3 With ties

When tied observations are allowed, if we assume that these are merely“rounding ties” then the exact form of the test statistic will depend on howthe weight function changes as the population is decremented, and it couldactually depend on the order in which the events occur. Partly for thisreason, it is conventional to perform significance tests as though the timeswere genuinely discrete, and the ties are real ties. In the one-sample case,where we have di events occurring at ordered times ti, this is still quitestraightforward. We also need the discrete hazard hi := 1− S(ti)/S(ti−1),which is the probability of an event at time ti, given survival up to time ti.

ThenM(t) :=

∑ti≤t

(di − hiY (ti)

)is a martingale that is constant except at the times ti. As usual, we look atthe weighted version

M∗(t) :=

∫ t

0

W (t)

Y (t)dM(t) =

∑ti≤t

(W (ti)diY (ti)

−W (ti)hi

)The predictable variation is the sum of the conditional variances of the jumps.By assumption, W (ti), Y (ti) ∈ Fti−, so

〈M∗〉(t) =∑ti≤t

(W (ti)

Y (ti)

)2

Var(di∣∣Y (ti)

).

In this case, the model is that conditioned on events up to time ti, thenumber of events di has binomial distribution with parameters (Y (ti), hi),with variance Y (ti)hi(1− hi). The predictable variation is thus

〈M∗〉(t) =∑ti≤t

W (ti)2hi(1− hi)Y (ti)

,

and, as usual, we may take this as an unbiased estimator for the variance ofM∗(t).

7.3.4 An example

Table 7.1 presents imaginary data for men aged 90 to 95. The number atrisk increases and decreases, which may reflect either left truncation or that

Page 86: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 78

ti Y (ti) di µ(ti) hi excess M∗(ti) 〈M∗〉(ti)

90 40 10 0.202 0.183 2.684 0.067 0.00491 35 8 0.215 0.193 1.229 0.102 0.00892 22 4 0.236 0.210 -0.625 0.074 0.01693 14 6 0.261 0.230 2.784 0.273 0.02894 11 4 0.279 0.243 1.322 0.393 0.04595 7 3 0.291 0.252 1.233 0.569 0.072

Table 7.1: Table of mortality rates for an imaginary old-people’s home, withstandard British male mortality given as µ(x).

this is a “period table”, reflecting different groups of individuals at risk atdifferent ages. Here we use the weights W ≡ 1.

Thus, our test statistic is

Z :=M∗(95)√〈M∗〉(95)

= 2.12.

If we are performing a two-tailed hypothesis test at the 0.95 level, we rejectvalues of Z with modulus > 1.96. Thus, we conclude that the mortalityrate in this population is significantly different from the rate in the generalpopulation.

Page 87: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 8

Excess mortality II:Two-sample setting

8.1 Non-parametric tests for equality: Two-samplesetting

8.1.1 No ties

Assume we have two different groups, where the individuals have hazard rateα0(t) and α1(t) when at risk. Let w(t) be any predictable weights, assumedto be such that w(t) = 0 whenever Y0(t)Y1(t) = 0 (that is, when either ofthe groups has no one left at risk); we adopt the convention 0/0 = 0. Wehave for each group g = 0, 1 the Nelson–Aalen estimator

Ag(t) =

∫ t

01Yg>0

dNg(s)

Yg(s)=∑t(g)i ≤t

1Yg>0

Yg(ti),

where t(g)i are the times of events belonging to individuals of group g. As

before, we write

A∗g(t) :=

∫ t

01Yg>0αg(s)ds.

We know that for g = 0, 1,

Mg(t) = Ag(t)−A∗g(t)

is a mean-zero martingale for each g. Thus,

M(t) :=

∫ t

0w(s)

(dM1(s)− dM0(s)

)79

Page 88: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 80

is also a mean-zero martingale. We have

M(t) =

∫ t

0w(s)

(dA1(s)− dA0(s)

)+

∫ t

0w(s)

(1Y1>0α1(s)− 1Y0>0α0(s)

)ds

=

∫ t

0w(s)

(dA1(s)− dA0(s)

)+

∫ t

0w(s)

(α1(s)− α0(s)

)ds.

The last step follows from the assumption that w(s) = 0 whenever either ofthe indicators is 0. Under the null hypothesis, then,

M(t) =

∫ t

0w(s)

(dA1(s)− dA0(s)

)=∑tj≤t

(−1)Gj+1w(tj)

YGj (tj),

where Gi is the index of the group whose event occurs at time ti.The predictable variation of M is the sum of the predictable variations of

M1 and M0 (since they are independent). Since we have the representation

Mg(t) =

∫ t

0

w(s)

Yg(s)

(dNg(s)− Yg(s)α(s)

),

for the counting process Ng with intensity Ygα, we have by Fact 5.2

〈Mg〉(t) =

∫ t

0

(w(s)

Yg(s)

)2

Yg(s)α(s)ds.

Thus, under the null hypothesis

〈M〉(t) =

∫ t

0

(w(s)2

Y0(s)+w(s)2

Y1(s)

)α(s)ds

=

∫ t

0

w(s)2Y·(s)

Y0(s)Y1(s)α(s)ds

(8.1)

Of course, we generally don’t know α; so we replace α(s)ds by the estimatordA(s) = dN·(s)/Y·(s) (where the dot represents summation over the index).Thus we obtain

〈M〉(t) ≈∫ t

0

w(s)2Y·(s)

Y0(s)Y1(s)dA(s)ds

=∑ti≤t

w(ti)2

Y0(ti)Y1(ti).

(8.2)

We take this as an estimate for the variance

Var(M(t)) ≈∑ti≤t

w(ti)2

Y0(ti)Y1(ti). (8.3)

Page 89: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 81

8.1.2 Weight functions and particular tests

The weight functions look slightly different in the 2-sample case. The 2-sample log rank test has weights

wLR(t) =Y0(t)Y1(t)

Y·(t). (8.4)

The log rank test statistic is then

Z(t) =(∑tj≤t

Y0(tj)Y1(tj)

Y·(tj)2

)−1/2(N1(t)−

∑tj≤t

Y1(tj)

Y·(tj)

). (8.5)

Note that the variance (which is also identical with the expectation term inparentheses) is equal to the sum of cumulative hazards of all the individualnull-hypothesis hazards, over the time that the individual was at risk.

Other weight functions are defined similarly. The Harrington-Flemingfamily is

wHF(t) := S(t−)p(1− S(t−))qY0(t)Y1(t)

Y·(t), (8.6)

for nonnegative p, q, where S(t) is an estimator for the survival probabilityunder the null hypothesis. Since this states that the two populations havethe same survival distribution, S is just an estimator — for example, theKaplan–Meier estimator — for survival, where we treat all the individualsas coming from a single population.

When p = q = 0 this is just the log-rank test. Larger values of p and qreduce the effect of deviations early and/or late in the process. The Petotest is essentially the Harrington-Fleming test with parameters (1, 0), exceptfor the small modification to

wPeto(t) := S(t−)Y0(t)Y1(t)

Y·(t) + 1. (8.7)

Since the log-rank weights are already fairly complicated, [ABG08] sim-plifies matters by defining the weight to be

w(t) = K(t)Y0(t)Y1(t)

Y·(t).

This way me may think of just the weights K, which are now constant forthe log-rank test, and much simpler for many other standard tests as well.

The test statistic is

Z(t) :=(∑ti≤t

K(ti)2Y0(ti)Y1(ti)

Y·(ti)2

)−1/2∑ti≤t

(−1)Gi+1K(ti)Y1−Gi(ti)

Y·(ti).

Page 90: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 82

8.1.3 With ties

As in section 7.3.3, we assume that the ties are genuine, so that we musttreat the survival curve as having discrete hazards hj := 1− S(tj)/S(tj−),which is the probability of an event at time tj , given survival up to time tj .

We write d(g)j (g = 0, 1) for the number of events of type g at time tj and

dj = d(1)j + d

(0)j . (Note: We will only be needing to refer to the hazards hj

under the null hypothesis, hence no need to define separate hazards for thetwo groups.)

Assuming the null hypothesis, the dj individuals having events at timetj are uniformly chosen from the Y·(tj) individuals at risk at time tj . Thisis like the variance of the number of red balls obtained when drawing dj atrandom from an urn containing Y0(tj) red and Y1(tj) blue balls. Thus

E[d

(1)j − dj

Y1(tj)

Y·(tj)

∣∣ dj] = 0,

so that

E[d

(1)j − dj

Y1(tj)

Y·(tj)

∣∣∣Ftj−] = E[E[d

(1)j − dj

Y1(tj)

Y·(tj)

∣∣ dj] ∣∣∣Ftj−] = 0.

Thus

M(t) :=∑tj≤t

K(tj)

(d

(1)j − dj

Y1(tj)

Y·(tj)

)

=∑tj≤t

w(tj)Y·(tj)

Y0(tj)Y1(tj)

(d

(1)j − dj

Y1(tj)

Y·(tj)

)

=∑tj≤t

w(tj)

(d

(1)j

Y1(tj)−

d(0)j

Y0(tj)

).

is a martingale that is constant except at the times tj .1

The predictable variation is the sum of the conditional variances of thejumps. By assumption, w(tj), Y (tj) ∈ Ftj−, so

〈M〉(t) =∑tj≤t

K(tj)2 Var

(d

(1)j

∣∣Ftj−).1As discussed in section 7.3.3, the weight functions w and K are completely equivalent,

and the choice of one or the other is purely a matter of convenience. Depending on context,one or the other may appear more natural.

Page 91: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 83

Unlike the corresponding calculation in section 8.1.3, in this one we have nospecific hazard rate given by the null hypothesis.

We note now that this derivation would work to show that M is amartingale if we change the filtration to Gtj = Ftj ∨ 〈dj+1〉, where dj :=

d(1)j + d

(0)j . That is, the decision of how many events is predictable, and the

unpredictable random event at time tj is the allocation to the two groups. Wemay do the conditioning in two steps, first conditioning on Gj — effectively,conditioning on Y0(tj), Y1(tj), and dj , and then on the smaller σ-algebraFtj−.

Var(d(1)j

∣∣Ftj−) = E[Var(d

(1)j

∣∣Gj−1

) ∣∣Ftj−1

]. (8.8)

Because the sampling is without replacement, the distribution is not binomial(though obviously close when Y is large), but hypergeometric, so

Var(d

(1)j

∣∣Gj−1

)=dj(Y·(tj)− dj)Y0(tj)Y1(tj)

Y·(tj)2(Y·(tj)− 1).

This gives us the increment to the variation

K(tj)2E[dj(Y·(tj)− dj)|Ftj−]Y0(tj)Y1(tj)

Y·(tj)2(Y·(tj)− 1)= w(tj)

2 E[dj(Y·(tj)− dj)|Ftj−]

Y0(tj)Y1(tj)(Y·(tj)− 1).

The true variance is given by the expected value of the predictablevariation, but as usual, we take the realised value

σ2 =∑tj≤t

w(tj)2 dj

(Y·(tj)− dj

)Y0(tj)Y1(tj)(Y·(tj)− 1)

(8.9)

as an unbiased estimator for the variance.We have, for sufficiently large sample sizes, that the test statistic∑tj≤t

w(tj)2 dj(Y·(tj)− dj)Y0(tj)Y1(tj)(Y·(tj)− 1)

−1/2∑tj≤t

w(tj)

(d

(1)j − dj

Y1(tj)

Y·(tj)

)(8.10)

has approximately standard normal distribution.

8.1.4 The AML example

In the 1970s it was known that individuals who had gone into remissionafter chemotherapy for acute lymphatic leukemia would benefit — by longer

Page 92: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 84

remission times — from a course of continuing “maintenance” chemotherapy.A study [EEH+77] pointed out that “Despite a lack of conclusive evidence,it has been assumed that maintenance chemotherapy is useful in the manage-ment of acute myelogenous leukemia (AML).” The study set out to test thisassumption, comparing the duration of remission between an experimentalgroup that received the additional chemotherapy, and a control group thatdid not. (This analysis is based on the discussion in [MGM01].) We willanalyse these data in various ways in this lecture.

The data are from a preliminary analysis of the data, before completionof the study. The duration of complete remission in weeks was given for eachpatient (11 maintained, 12 non-maintained controls); those who were still inremission at the time of the analysis are censored observations. The data aregiven in Table 8.1. They are included in the survival package of R, underthe name aml.

Table 8.1: Times of complete remission for preliminary analysis of AMLdata, in weeks. Censored observations denoted by +.

maintained 9 13 13+ 18 23 28+ 31 34 45+ 48 161+

non-maintained 5 5 8 8 12 16+ 23 27 30 33 43 45

The first thing we do is to estimate the survival curves. The summarydata and computations are given in Table 8.2. The Kaplan–Meier survivalcurves are shown in Figure 8.1. In Table 8.3 we show the computationsfor confidence intervals just for the Kaplan–Meier curve of the maintenancegroup. The confidence intervals are based on the logarithm of survival. Thatis, the bounds on the confidence interval are

exp

log S(t)± z√∑ti≤t

dini(ni − di)

,

where z is the appropriate quantile of the normal distribution.We could also use

S(t)± zS(t)

√∑ti≤t

dini(ni − di)

.

Page 93: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 85

Note that the approximation cannot be assumed to be very good in this case,since the number of individuals at risk is too small for the asymptotics to bereliable. We show the confidence intervals in Figure 8.2.

Table 8.2: Computations for the Kaplan–Meier and Nelson–Aalen survivalcurve estimates of the AML data.

Maintenance Non-Maintenance (control)

ti Y (ti) di hi S(ti) A(ti) S(ti) Y (ti) di hi S(ti) A(ti) S(ti)

5 11 0 0.00 1.00 0.00 1.00 12 2 0.17 0.83 0.17 0.858 11 0 0.00 1.00 0.00 1.00 10 2 0.20 0.67 0.37 0.699 11 1 0.09 0.91 0.09 0.91 8 0 0.00 0.67 0.37 0.6912 10 0 0.00 0.91 0.09 0.91 8 1 0.12 0.58 0.49 0.6113 10 1 0.10 0.82 0.19 0.83 7 0 0.00 0.58 0.49 0.6118 8 1 0.12 0.72 0.32 0.73 6 0 0.00 0.58 0.49 0.6123 7 1 0.14 0.61 0.46 0.63 6 1 0.17 0.49 0.66 0.5227 6 0 0.00 0.61 0.46 0.63 5 1 0.20 0.39 0.86 0.4230 5 0 0.00 0.61 0.46 0.63 4 1 0.25 0.29 1.11 0.3331 5 1 0.20 0.49 0.66 0.52 3 0 0.00 0.29 1.11 0.3333 4 0 0.00 0.49 0.66 0.52 3 1 0.33 0.19 1.44 0.2434 4 1 0.25 0.37 0.91 0.40 2 0 0.00 0.19 1.44 0.2443 3 0 0.00 0.37 0.91 0.40 2 1 0.50 0.10 1.94 0.1445 3 0 0.00 0.37 0.91 0.40 1 1 1.00 0.00 2.94 0.0548 2 1 0.50 0.18 1.41 0.24 0 0

Important: The estimate of the variance is more generally reliable thanthe assumption of normality, particularly for small numbers of events. Thus,the first line in Table 8.3 indicates that the estimate of A(9) is associatedwith a variance of 0.008. The error in this estimate is on the order of Y (ti)

−3,so it’s potentially about 10% of the indicated value. On the other hand,the number of events observed has binomial distribution, with parametersaround (11, 0.9), so it’s very far from a normal distribution.

We can use these tests to compare the survival of the two groups. Therelevant quantities are tabulated in Table 8.4. The column σ2

j gives theincrements to the approximate variance

σ2j =

djY0(tj)Y1(tj)(Y·(tj)− dj)Y·(tj)2(Y·(tj)− 1)

Page 94: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 86

0 10 20 30 40 50

0.00.2

0.40.6

0.81.0

Figure 8.1: Kaplan–Meier estimates of survival in maintenance (black) andnon-maintenance groups in the AML study.

Table 8.3: Variance estimates for cumulative hazard of the maintenancepopulation in the AML data. “Lower” and “upper” are bounds for 95%confidence intervals.

ti Y (ti) di1

Y (ti)2σ2(ti) lower upper

9 11 1 0.008 0.008 0.000 0.26913 10 1 0.010 0.018 0.000 0.45618 8 1 0.016 0.034 0.000 0.67723 7 1 0.020 0.054 0.002 0.91531 5 1 0.040 0.094 0.057 1.2634 4 1 0.062 0.157 0.133 1.6848 2 1 0.25 0.407 0.159 2.66

Page 95: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 87

0 10 20 30 40 50

0.00.2

0.40.6

0.81.0

Time (weeks)

Survival

Figure 8.2: Estimated of 95% confidence intervals for survival in maintenancegroup of the AML study.

Page 96: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 88

ti Y0(ti) Y1(ti) d0(ti) d1(ti) σ2i Peto wt. H–F (0, 1) wt.

5 11 12 0 2 0.476 0.958 0.0008 11 10 0 2 0.474 0.875 0.0839 11 8 1 0 0.244 0.792 0.167

12 10 8 0 1 0.247 0.750 0.20813 10 7 1 0 0.242 0.708 0.25018 8 6 1 0 0.245 0.661 0.29223 7 6 1 1 0.456 0.614 0.33927 6 5 0 1 0.248 0.519 0.43330 5 4 0 1 0.247 0.467 0.48131 5 3 1 0 0.234 0.416 0.53333 4 3 0 1 0.245 0.364 0.58434 4 2 1 0 0.222 0.312 0.63643 3 2 0 1 0.240 0.260 0.68845 3 1 0 1 0.188 0.208 0.74048 2 0 1 0 0.000 0.139 0.792

Table 8.4: Data for testing equality of survival in AML experiment.

When the weights are all taken equal, we compute Z = −1.84, whereasthe Peto weights — which reduce the influence of later observations — giveus Z = −1.67. This yields one-sided p-values of 0.033 and 0.048 respectively

— a marginally significant difference — or two-sided p-values of 0.065 and0.096.

8.1.5 Kidney dialysis example

This is example 7.2 from [KM03]. The data are from a clinical trial ofalternative methods of placing catheters in kidney dialysis patients. Theevent time is the first occurrence of an exit-site infection. The data are inthe KMsurv package, in the object kidney. (Note: There is a different dataset with the same name in the survival package. To make sure you getthe right one, enter data(kidney,package=’KMsurv’).) The Kaplan–Meierestimator is shown in Figure 8.3. There are two survival curves, correspondingto the two different methods.

We show the calculations for the nonparametric test of equality of dis-tributions in Table 8.5. The log-rank test — obtained by simply dividingthe sum of all the deviations by the square root of the sum of terms in theσ2i column — is only 1.59, so not significant. With the Peto weights the

Page 97: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 89

statistic is only 1.12. This is not surprising, because the survival curvesare close together (and actually cross) early on. On the other hand, theydiverge later, suggesting that weighting the later times more heavily wouldyield a significant result. It would not be responsible statistical practice tochoose a different test after seeing the data. On the other hand, if we hadstarted with the belief that the benefits of the percutaneous method arecumulative, so that it would make sense to expect the improved survivalto appear later on, we might have planned from the beginning to use theHarrington–Fleming weights with, for example, p = 0, q = 1, tabulated inthe last column. Applying these weights gives us a test statistic ZFH = 3.11,implying a highly significant difference.

ti Y0(ti) Y1(ti) d0(ti) d1(ti) σ2i Peto wt. H–F (0, 1) wt.

0.5 43 76 0 6 1.326 0.992 0.0001.5 43 60 1 0 0.243 0.941 0.0502.5 42 56 0 2 0.485 0.931 0.0593.5 40 49 1 1 0.489 0.912 0.0784.5 36 43 2 0 0.490 0.890 0.0995.5 33 40 1 0 0.248 0.867 0.1216.5 31 35 0 1 0.249 0.854 0.1338.5 25 30 2 0 0.487 0.839 0.1469.5 22 27 1 0 0.247 0.807 0.176

10.5 20 25 1 0 0.247 0.790 0.19311.5 18 22 1 0 0.247 0.770 0.21015.5 11 14 1 1 0.472 0.741 0.23016.5 10 13 1 0 0.246 0.681 0.28918.5 9 11 1 0 0.247 0.649 0.31923.5 4 3 1 0 0.245 0.568 0.35126.5 2 3 1 0 0.240 0.473 0.432

Table 8.5: Data for kidney dialysis study.

8.1.6 Nonparametric tests in R

The survival package includes a command survdiff that carries out thetests described in this lecture. The following code carries out the log-ranktest and the Petos’ test (actually, the Harrington–Fleming (1, 0) test.)

Page 98: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 90

1 > kS=Surv ( kidney $ time , kidney $ d e l t a )2 > s u r v d i f f ( kS˜ kidney $ type ) #log−rank t e s t3 Cal l :4 s u r v d i f f ( formula = kS ˜ kidney $ type )5

6 N Observed Expected (O−E) ˆ2/E (O−E) ˆ2/V7 kidney $ type=1 43 15 11 1 .42 2 .538 kidney $ type=2 76 11 15 1 .05 2 .539

10 Chisq= 2 .5 on 1 degree s o f freedom , p= 0.11211 > s u r v d i f f ( kS˜ kidney $ type , rho=1) #Petos t e s t12 Cal l :13 s u r v d i f f ( formula = kS ˜ kidney $ type , rho = 1)14

15 N Observed Expected (O−E) ˆ2/E (O−E) ˆ2/V16 kidney $ type=1 43 12 .0 9 .48 0 .686 1 .3917 kidney $ type=2 76 10 .4 12 .98 0 .501 1 .3918

19 Chisq= 1 .4 on 1 degree s o f freedom , p= 0.239

Note that the outputWe conclude with the R code used to generate Figure 8.3 and Table 8.5

1 r e q u i r e ( ’KMsurv ’ )2 data ( kidney , package=’KMsurv ’ )3 attach ( kidney )4 kid . surv=Surv ( time , de l t a )5 kid . f i t=s u r v f i t ( k id . surv ˜ type )6 p lo t ( kid . f i t , c o l =1:2 , xlab=’Time to i n f e c t i o n ( months ) ’ ,7 main=’ Kaplan−Meier p l o t f o r kidney d i a l y s i s data ’ )8

9 t=s o r t ( unique ( time ) )10 d=lapp ly ( 1 : 2 , f unc t i on ( i ) sapply ( t , f unc t i on (T) sum ( ( time [ type==i

]==T)&( d e l t a [ type==i ]==1) ) ) )11 n=lapp ly ( 1 : 2 , f unc t i on ( i ) sapply ( t , f unc t i on (T) sum( time [ type==i

]>=T) ) )12

13 keep=(n [ [ 1 ] ] ∗n [ [ 2 ] ] > 0 )14 t=t [ keep ]15 d=lapp ly (d , f unc t i on (D) D[ keep ] )16 n=lapp ly (n , f unc t i on (D) D[ keep ] )17 ddot=d [ [ 1 ] ] + d [ [ 2 ] ]18 ndot=n [ [ 1 ] ] + n [ [ 2 ] ]19

20 s i=ddot∗n [ [ 1 ] ] ∗n [ [ 2 ] ] ∗ ( ndot−ddot ) /ndot/ndot/ ( ndot−1)21 petos=cumprod ( ( ndot−ddot ) / ( ndot ) )22 petow=c (1 , petos [ 1 : ( l ength ( petos )−1) ] ) ∗ndot/ ( ndot+1)

Page 99: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 91

23 fhw=c (0 ,(1− petos [ 1 : ( l ength ( petos )−1) ] ) )24 e i=ddot∗n [ [ 1 ] ] /ndot25 wk=rep (1 , l ength ( e i ) )26

27 zLR=sum(wk∗ (d [ [ 1 ] ] − e i ) ) / s q r t (sum(wkˆ2∗ s i ) )28 zP=sum( petow∗ (d [ [ 1 ] ] − e i ) ) / s q r t (sum( petowˆ2∗ s i ) )29 zFH=sum( fhw∗ (d [ [ 1 ] ] − e i ) ) / s q r t (sum( fhwˆ2∗ s i ) )30

31 xt=xtab l e ( cbind ( t , n [ [ 1 ] ] , n [ [ 2 ] ] , d [ [ 1 ] ] , d [ [ 2 ] ] , s i , petow , fhw ) ,32 d i s p l ay=c ( ’d ’ , ’ f ’ , ’ d ’ , ’ d ’ , ’ d ’ , ’ d ’ , ’ f ’ , ’ f ’ , ’ f ’ ) ,33 d i g i t s=c ( 0 , 1 , 0 , 0 , 0 , 0 , 3 , 3 , 3 ) )34 pr in t ( xt , i n c lude . rownames=FALSE, in c lude . colnames=FALSE)

Page 100: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Comparing distributions 92

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Kaplan−Meier plot for kidney dialysis data

Time to infection (months)

Figure 8.3: Plot of Kaplan–Meier survival curves for time to infection ofdialysis patients, based on data described in section 1.4 of [KM03]. Theblack curve represents 43 patients with surgically-placed catheter; the redcurve 76 patients with percutaneously placed catheter.

Page 101: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 9

Additive hazards regression

In Lecture 7 we considered problems of measuring the effect of a categoricalvariable on the hazard for an event time: Sex, ethnicity, receiving onetreatment rather than another. This can be done in a nonparametric way,depending on very few assumptions. We considered only the analysis oftwo-state categories, but to a certain extent the same methods can be appliedto the analysis of more than two categories (but still a small number).

In many situations we wish to measure the effect of a quantitative, oftencontinuous, variable on hazard rates. This cannot be done in a nonparametricway, since the information provided by any single individual — with a singlevalue of the covariate — is so low. We need a way to combine the effectof multiple individuals into a single estimate, and that means we need tomake fairly strong modelling assumptions about how the covariates affect thehazard. The approach of linking covariates to an outcome measure (in thiscase, hazards) through strong functional assumptions about the influenceare called regression models.

Even when we have a categorical model, it may be useful to apply aregression model that summarises the relationship between covariate andhazard in a more compact way.

In this chapter we will consider the additive-hazards approach, which isstill very close to a non-parametric model. In Lecture 10 we will considerthe more popular relative-risk or proportional-hazards, which are clearlysemi-parametric, in the sense that part of the model — the definition ofthe baseline survival — is non-parametric, while another part — the moreimportant part, defining the influence of the covariates — is parametric,depending on a small number of numerical parameters.

Relative risk regression is mathematically elegant, and traditionally is

93

Page 102: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 94

very popular in medical statistics, but it cannot be said to be a universalmodel. The assumptions are very strong, and the model may give misleadingresults when applied to data that don’t support it. (We describe in section11.5 methods for evaluating the appropriateness of the proportional hazardsassumption.) It is useful to have alternative models available.

In addition to the general issue of scientific appropriateness of a givenmodel, the additive hazards model has numerous advantages, detailed atlength in section 4.2 of [ABG08]. In particular

• the statistical methods for fitting additive hazards regression modelsmake it relatively easy to allow for effects that change with time;

• results of the additive model lend themselves to a natural interpretationas “excess mortality”, forming a natural regression counterpart to thenonparametric excess mortality models we described in Lecture 7.

9.1 Describing the model

This section and the next two are based closely on section 4.2 of [ABG08].The additive-hazards regression model assigns to each individual i a

time-varying covariate vector

xi(t) =(xi1(t) · · · xip(t))

)The model parameters (unknown) are the baseline hazard β0(t), and pfunctions β1(t), . . . , βp(t). The hazard for individual i at time t is then

β0(t) + β1(t)xi1(t) + · · ·+ βp(t)xip(t). (9.1)

Each βj is a trajectory of excess risk attributable to changes in the covariatexij .

9.2 Fitting the model

As with other nonparametric estimation procedures, we naturally estimatethe cumulative excess risk rather than the risk itself. We define

Bk(t) :=

∫ t

0βk(s)ds.

Page 103: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 95

We write Ni(t) for the counting process of individual i. By (9.1) and themultiplicative intensity model,

Mi(t) := Ni(t)−∫ t

0Yi(s)dB0(s)−

p∑j=1

∫ t

0Yi(s)xij(s)dBj(s)

is a martingale. We may rewrite this as

dNi(t) = Yi(t)dB0(t) +

p∑j=1

Yi(t)xij(t)dBj(t) + dMi(t). (9.2)

The trick is to view this as an analogue of the usual regression equation foreach fixed t, where dNi(t) is a vector of observations of dependent variables,Yi(t)xij(t) are the explanatory variables, dMi(t) plays the role of randomerror, and dBj(t) are the parameters to be estimated. As in a linear regressionmodel it is convenient to represent this in matrix form

dN(t) = X(t)dB(t) + dM(t). (9.3)

Here X is the n× (p+ 1) matrix whose (i, j) component is Yi(t)xij(t), withxi0(t) ≡ 1.

We define a random matrix

X−(t) :=

(X(t)TX(t)

)−1X(t)T if X(t) has full rank,

0 otherwise.(9.4)

In other words, it is the generalised inverse of X whenever this exists. Ourusual least-squares solution for this equation is then

dB(t) = X−(t)dN(t),

yielding the estimator

B(t) =

∫ t

0X−(u)dN(u) =

∑tj≤t

X−(tj)dN(tj) =∑tj≤t

X−(tj)·ij , (9.5)

where dN(tj) is a vector of all 0’s, except for a 1 in the ij component, whereij is the individual having an event at time tj . (Here we are assuming noties, as is conventional. If there are ties, we could still obtain an unbiasedestimator by summing over the set Dj of individuals with events at time tj .This would require, as usual, some decision about the slight adjustment tothe variance. The R function aareg breaks ties randomly.)

Page 104: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 96

9.3 Variance estimation

9.3.1 Martingale representation of additive hazards model

As in section 6.1 we need to acknowledge that the estimation is only beingcarried out up to the point where the data become insufficient. We defineJ(s) = 1X(s) has full rank, and

B∗(t) =

∫ t

0J(s)dB(s).

The design matrix will stop being of full rank when the covariates representedin the individuals still at risk are no longer linearly independent. Intuitively,this means that there are not “enough” individuals still at risk. If thecovariates represent categories, this will be equivalent to there no longerbeing individuals at risk in all categories, so the “stopping condition” willbe the same as for the excess mortality calculation.

Now

B(t) =

∫ t

0X−(s)dN(s)

=

∫ t

0X−(s) (X(s)dB(s) + dM(s))

=

∫ t

0

(J(s)dB(s) + X−(s)dM(s)

).

So

B(t)−B∗(t) =

∫ t

0X−(s)dM(s) (9.6)

is a mean-zero martingale. In particular, B(t) is an unbiased estimator forB∗(t).

9.3.2 Estimating the covariance matrix

Equation (9.6) looks a lot like other martingale representations that we haveused for estimating the variance of survival estimators. The difference hereis that the parameter being estimated is multidimensional, so a statisticalanalysis requires variances and covariances. This requires that we considerthe covariation of different martingales. We will not do this formally, thoughthe precise statements may be found in the appendix of [ABG08].

Page 105: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 97

A single component of (9.6) is(B(t)−B∗(t)

)j

=

∫ t

0

n∑i=1

X−ji(s)dMi(s),

where the Mi are the counting martingales of individuals. Since thesecounting processes are assumed independent of each other, the optionalvariation has no interaction between individuals, and the optional variationis just a sum of the optional variations for the separate processes. (The sameis true of the predictable variation, but we do not concern ourselves with ithere.)[(

B−B∗)k

](t) =

∫ t

0

n∑i=1

X−ki(s)2dNi(s)

=∑tj≤t

X−kij (tj)2, where ij is the individual with event at time tj .

The optional covariation between components k and ` will similarly be a sumof variances of n independent terms. This gives us an estimator Σ for the(p+ 1)× (p+ 1) covariance matrix, whose (k, `) component may be writtenas

Σk`(t) =[(

B−B∗)k,(B−B∗

)`

](t) =

∑tj≤t

X−kij (tj)X−`ij

(tj). (9.7)

Thus, for large n the martingale CLT implies that B(t) is approximatelynormal, with mean B(t) and covariance matrix Σ(t).

9.4 Testing for a single effect

We consider here methods for testing the null hypothesis that a singlecovariate effect is 0, against noncrossing alternatives. (Simultaneously testingthe hypothesis that all of the effects are 0 is somewhat more complicated,and we will not address it, though those who are interested may find adiscussion — along with the material included in this section — in section4.2.3 of [ABG08].)

Suppose we want to test H0 : βq(t) = 0 for t ∈ [0, t0] where t0 may ornot be the end of the study, or the last time for which data are available.We have a stream of estimates dBq(t). Some are nonzero, and we wish todetermine whether they are biased in one direction more than would be

Page 106: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 98

expected by chance. As usual, we take a weighted sum of different estimatesto form a test statistic:

Zq(t0) =

∫ t0

0Lq(s)dBq(s) =

∑tj≤t0

Lq(tj)dBq(tj), (9.8)

where Lq(s) is an arbitrary nonnegative predictable weight function. Weobtain the optional variation by integrating the square of the weight functionagainst the increments in optional variation for Bq:

Vq(t0) =

∫ t0

0Lq(s)

2dΣqq(s) =∑tj≤t0

Lq(tj)2dΣqq(tj). (9.9)

ThusZq(t0)√Vq(t0)

has approximately standard normal distribution under the null hypothesis.

9.5 Examples

9.5.1 Single covariate

Suppose each individual has a single covariate xi(t) at time t, and the hazardfor individual i at time t is β0(t) + β1(t)xi(t). Then

M(t) := N(t)−∫ t

0X(s)dB(s) is a martingale,

where N(t) is the binary vector of counting processes (giving a 1 in place i ifindividual i has had an event by time t, and 0 otherwise), and X(t) is then× 2 design matrix, with Yi(t) (the at-risk indicator for individual i) in thefirst column and Yi(t)xi(t) in the second column.

If we let R(t) be the set of individuals at risk at time t, and define

µk(t) =1

#R(t)

∑i∈R(t)

xi(t)k,

we have

X(t)TX(t) = #R(t)

(1 µ1(t)

µ1(t) µ2(t)

),

Page 107: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 99

and the inverse is

1

#R(t)(µ2(t)− µ1(t)2)

(µ2(t) −µ1(t)−µ1(t) 1

)If we assume that t is such that there are still multiple individuals whoseevent is after t, and that the xi(t) are all distinct, then the Cauchy–Schwarzinequality tells us that the denominator is always > 0.

We also observe that X(t)TdN(t) is a 2× 1 vector which is 0 except attimes tj , when it is

X(tj)TdN(tj) =

(dj∑

i∈Dj xi(tj)

),

where Dj is the set of individuals who have events at time Dj and dj = #Dj .The estimator (9.5) then becomes(

B0(t)

B1(t)

)=∑tj≤t

dj#R(tj)(µ2(tj)− µ1(tj)2)

(µ2(tj)− µ1(tj)xj−µ1(tj) + xj

), (9.10)

where xj is the mean value of xi(tj) over i ∈ Dj .Intuitively, this result makes sense: It says that B1(t) increases insofar

as the average covariate value of the individuals having an event at time tjis greater than the average covariate value of all the individuals at risk atthat time. On the other hand, the increments to B0 are like dj/#Rj — theincrement in the Nelson–Aalen estimator — modified in proportion as theestimate of β1 is negative or positive.

We estimate the covariance matrix of B(t) as∑tj≤t

1

#R(tj)2(µ2(tj)− µ1(tj)2)2

×∑i∈Dj

((µ2(tj)− µ1(tj)xi)

2 (µ2(tj)− µ1(tj)xi)(−µ1(tj) + xi)(µ2(tj)− µ1(tj)xi)(−µ1(tj) + xi) (−µ1(tj) + xi)

2

),

9.5.2 Simulated data

We consider the single-covariate additive model from section 9.5.1. Weconsider a population of n individuals, where the hazard rate for individuali is

λi(t) = 1 +xi

1 + t,

Page 108: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 100

with xi being i.i.d. covariates with N(1, 0.25) distribution. So the effect ofthe covariate decreases with time. We assume independent right censoringat constant rate 0.5. We consider two cases: n = 100 and n = 1000. First ofall, we need to simulate the times. We use the result of problem (2) fromProblem Sheet B.1. The cumulative intensity for individual i is

Λi(t) = t+ xi log(1 + t).

We can’t find Λ−1i in closed form, but it is easy to write this as a function

xtime that computes it numerically:

censrate=0.5

covmean=0

covsd=0.5

xtime=function(T,x)

u=uniroot(function(t) t+x*log(1+t)-T,c(0,max(T,2*(T-x))))

u$root

n=1000

# Censoring times

C=rexp(n,censrate)

xi=rnorm(n,covmean,covsd)

T=rep(0,n)

for (i in 1:n)

T[i]=xtime(rexp(1),xi[i])

t=pmin(T,C)

delta=(T<C)

We may compute B by using the function aareg:

afit=aareg(Surv(t,delta)~xi)

plot(afit,xlim=c(0,1),ylim=c(-.5,1.5))

s=(0:120)/100

lines(s,log(1+s),col=2)

The results are in Figure 9.1. Note that the estimates for n = 100 arebarely useful even just for distinguishing the effect of the covariates from 0; on

Page 109: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 101

the other hand, bear in mind that these are pointwise confidence intervals, sointerpretations in terms of the entire time-course of B are more complicated(and beyond the scope of this course). The estimates with n = 1000 aremuch more useful.

Applying print() to an aareg object gives useful summary informationabout the model fit. Applying it to the n = 100 simulation we get

> print(afit,maxtime=1)

Call:

aareg(formula = Surv(T, delta) ~ xi)

n= 100

70 out of 80 unique event times used

slope coef se(coef) z p

Intercept 1.76 0.00733 0.00433 1.69 0.0909

xi 1.24 0.00847 0.00428 1.98 0.0480

Chisq=3.91 on 1 df, p=0.048; test weights=aalen

The slope is a crude estimate of the rate of increase of B·(t) with t (basedon fitting a weighted least-squares line to the estimates). We use the optionmaxtime=1 since about 80% of the events are in [0, 1], so that the estimatesbecome extremely erratic after t = 1. If we leave out this option, the slopewill not make much sense (though we could extend the range significantlyfurther when n = 1000). In this case, we would get a slope estimate of 2.

Note that the p-value for the covariate coefficient (row “xi”) is based onthe SE for the cumulative weighted test statistic for that particular parameter,and has nothing to do with the slope estimate. The chi-squared statistic is ajoint is based on a weighted cumulative test statistic for all effects to be 0,and it has chi-squared distribution with p degrees of freedom. In the casep = 1 it is just the square of the single-variable test statistic.

Page 110: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Additive hazards regression 102

0.0 0.2 0.4 0.6 0.8 1.0

-0.5

0.0

0.5

1.0

1.5

Time

xi

(a) Sample size 100

0.0 0.2 0.4 0.6 0.8 1.0

-0.5

0.0

0.5

1.0

1.5

Time

xi

(b) Sample size 1000

Figure 9.1: Estimated cumulative hazard increment per unit of covariate(B1(t)) for two different sample sizes, together with pointwise 95% confidenceintervals. The true value is B1(t) = log(1 + t), which is plotted in red.

Page 111: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 10

Relative-risk models

10.1 The relative-risk regression model

In the regression setting, the most mathematically tractable models are therelative-risk models. These are based on the multiplicative intensity model,with hazard modelled by

αi(t) = α(t∣∣xi) = α0(t)r

(β,xi(t); t

)(10.1)

Here xi is a vector of (possibly time-varying) covariates belonging to individ-ual i, and β is a vector of parameters. Since this model has a nonparametricpiece α0 and a parametric piece β, it is called semiparametric. In this lecturewe will generally be assuming that r is a function only of β and x (no directdependence on t), and in that case we will drop t from the notation, writingr(β,x) or r(β,x(t)).

This model is called relative-risk or proportional hazards because thereis an unchanging ratio of the hazard rate (or risk of the event) betweenindividuals with parameter values xi and xj . (Of course, if the covariatesthemselves are time-varying, this looks more complicated when we look atindividual hazard rates over time.)

Different choices for the function relative risk function r are possible. Wewill focus mainly on the choice

r(β,x) = eβTx = e

∑βjxj , (10.2)

which assigns a constant proportional change to the hazard rate to each unitchange in the covariate. This regression model with this risk function, whichis by far the most commonly used survival model in medical applications, iscalled the Cox proportional hazards regression model.

103

Page 112: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression 104

One might naively suppose that the choice of model should be guidedby a belief about what is “really true” about the survival process. But, asGeorge Box famously said, “All models are wrong, but some are useful.” Awrong model will still summarise the data in a potentially non-misleadingway. (On the other hand, the model can’t be too wrong. We will discussmodel diagnostics later on.)

For example, in medical statistics the excess relative risk of a category isdefined as

observed rate− expected rate

expected rate.

If we have a single parameter that is supposed to summarise the excessrelative risk created per unit of covariate, we would take the risk function

r(β, x) = 1 + βx.

In the multidimensional-covariate setting we can generalise this to the excessrelative risk model (taking p to be the dimension of the covariate)

r(β,x) =

p∏j=1

(1 + βjxj) . (10.3)

This allows each covariate to contribute its own excess relative risk, inde-pendent of the others. Alternatively, we can define the linear relative riskfunction

r(β,x) = 1 +

p∑j=1

βjxj . (10.4)

10.2 Partial likelihood

This section follows closely section 4.1.1 of [ABG08]. More detail may befound there.

We have a counting process for each individual i whose intensity is

λi(t) = Yi(t)r(β,xi(t); t)α0(t).

The total counting process N(t) has intensity

λ(t) =n∑i=1

Yi(t)α0(t)r(β,xi(t); t) (10.5)

Page 113: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression 105

Conditioned on some event happening at time t, the probability that it isindividual i is

π(i∣∣ t) :=

λi(t)

λ(t)=

Yi(t)r(β,xi(t); t)∑n`=1 Y`(t)r(β,x`(t); t)

.

We represent the data in the following form:

(i). A list of event times t1 < t2 < · · · . (We are assuming no ties, for themoment.)

(ii). The identity ij of the individual whose event is at time j.

(iii). The values of all individuals’ covariates (at times tj , if they are varying).

(iv). The process Yi(t) for all subjects. These may be summarised in risksets Rj = i : Yi(tj) = 1, the set of individuals who are at risk attime tj .

The most common way to fit a relative-risk model is to split the likelihoodinto two pieces: The likelihood of the event times, and the conditionallikelihood of the choice of subjects given the event times. The first piece isassumed to contain relatively little information about the parameters, andits dependence on the parameters is quite complicated.

We use the second piece to estimate β. We have the partial likelihood

LP (β) =∏tj

π(ij∣∣ tj) =

∏tj

r(β,xij (tj); tj)∑l∈Rj r(β,xl(tj); tj)

. (10.6)

The partial likelihood is useful because it involves only the parametersβ, isolating them from the nonparametric (and often less interesting) α0.In section 11.2 we derive a special case of the asymptotic behaviour of themaximum partial likelihood estimator. As we state here, this has the sameessential properties as the MLE.

Theorem 10.1. Let β maximise LP , as given in (10.6). Then β is aconsistent estimator of the true parameter β0, and

√n(β − β0) converges

to a multivariate normal distribution with mean 0 and covariance matrixconsistently approximated by J(β)−1, where J(β) is the observed informationmatrix, with (i, j) component given by

− ∂2

∂βi∂βjlogLP (β).

Page 114: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression 106

10.3 Significance testing

As with the MLE, when β is p-dimensional there are three (asymptoticallyequivalent) conventional test statistics used to test the null hypothesis β = β0:

Wald statistic : ξ2W := (β − β0)T J(β0)(β − β0);

Score statistic : ξ2SC = U(β0)T J(β0)U(β0);

Likelihood ratio statistic : ξ2LR = 2

[`P (β)− `P (β0)

], where `P := logLP .

Under the null hypothesis these are all asymptotically chi-squared distributedwith p degrees of freedom. Here J(β) is the observed Fisher partial infor-mation matrix. There is a computable estimate for this, which is fairlystraightforward, but notationally slightly tricky in the general (multivari-ate) case, so we do not include it here. (See equations (4.46) and (4.48) of[ABG08] if you are interested.) As usual, we can approximate the expectedinformation by the observed information.

Here U(β) is the vector of score functions ∂`P /∂βj .

10.4 Estimating baseline hazard

In relative-risk regression we are usually primarily interested in estimating thecoefficients β — the difference between groups. Thus, if we are performing aclinical trial, we are interested to know whether the treatment group hadbetter survival than the control. How long the groups actually survived inan absolute sense is irrelevant to the experimental outcome.

But that doesn’t mean that the hazard rates are merely a nuisance.And even when survival rates are the primary concern, if hazard is affectedby a known covariate there is no way to estimate the appropriate survivalfunction for individuals without using a regression model to pool the dataover different values of the covariate.1 Even when the covariate has onlya small number of categories, so that pooling within categories is possible,it may still be worth using a regression model — as long as it is plausiblyaccurate — which allows us effectively to pool more individuals together toestimate a single survival curve. Thus variance is reduced, at the cost ofpotentially introducing a bias from modelling error (because the regressionassumptions won’t hold exactly).

1The case of hazards being affected by an unobservable covariate will be discussed inLecture 16.

Page 115: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression 107

10.4.1 Breslow’s estimator

We start as usual with the total counting process N(t), with intensity

λ(t) = α0(t)n∑i=1

Yi(t)r(β,xi(t)).

This is an example of the multiplicative intensity model, except that now thefactors are no longer 0 or 1, but are 0 or a nonzero positive weight r(β,x).We still have the Nelson–Aalen estimator

A0(t;β) =

∫ t

0

dN(u)∑ni=1 Yi(u)r(β,xi(u))

.

Since β is unknown, we use the estimator β, yielding Breslow’s estimator

A(t) =

∫ t

0

dN(u)∑ni=1 Yi(u)r(β,xi(u))

=∑tj≤t

1∑i∈Rj r(β,xi(tj)

. (10.7)

10.4.2 Individual risk ratios

A benefit of relative risk models is that they give a straightforward inter-pretation to the question of individual risk. In the case where covariatesare constant in time, an individual with covariates x has a relative risk ofr(β,x) compared with individuals with baseline covariates (with r(β,x) = 1;typically corresponding to x = 0). Thus, their cumulative hazard to age t isapproximated by

A(t∣∣x) = r(β,x)A0(t). (10.8)

In case of time-varying covariates we have an individual cumulativehazard

A(t∣∣x) =

∫ t

0r(β,x(u))α0(u)du,

which we approximate by

A(t∣∣x) =

∫ t

0r(β,x(u))dA0(u) =

∑tj≤t

r(β,x(tj))∑i∈Rj r(β,xi(tj))

. (10.9)

Page 116: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 11

Relative risk regression,continued

11.1 Dealing with ties

We discuss here the methods of dealing with tied event times for the Coxmodel.

Until now in this section we have been assuming that the times of eventsare all distinct. In situations where event times are equal, we can carry outthe same computations for Cox regression, only using a modified version ofthe partial likelihood. Suppose Rj is the set of individuals at risk at timetj , and Dj the set of individuals who have their event at that time. Weassume that the ties are not real ties, but only the result of discreteness inthe observation. Then the probability of having precisely those individualsat time tj will depend on the order in which they actually occurred. Forexample, suppose there are 5 individuals at risk at the start, and two of themhave their events at time t1. If the relative risks were r1, . . . , r5, then thefirst term in the partial likelihood would be

r1

r1 + r2 + r3 + r4 + r5· r2

r2 + r3 + r4 + r5+

r2

r1 + r2 + r3 + r4 + r5· r1

r1 + r3 + r4 + r5.

The number of terms is dj !, so it is easy to see that this computation quicklybecomes intractable.

A very good alternative — accurate and easy to compute — was proposedby B. Efron. Observe that the terms differ in the denominator merely by asmall change due to the individuals lost from the risk set. If the deaths attime tj are not a large proportion of the risk set, then we can approximate

108

Page 117: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 109

this by deducting the average of the risks that depart. In other words, inthe above example, the first contribution to the partial likelihood becomes

r1r2

(r1 + r2 + r3 + r4 + r5)(12(r1 + r2) + r3 + r4 + r5)

.

More generally, the log partial likelihood becomes

`P (β) =∑tj

∑i∈Dj

log r(β,xi(tj))−dj−1∑k=0

log

∑i∈Rj

r(β,xi(tj)−k

di

∑i∈Dj

r(β,xi)

.

We take the same approach to estimating the baseline cumulative hazard:

A0(t) =∑tj≤t

dj−1∑k=0

∑i∈Rj

r(β,xi(tj)

)− k

di

∑i∈Dj

r(β,xi(tj)

)−1

.

An alternative approach, due to Breslow, makes no correction for theprogressive loss of risk in the denominator:

`BreslowP (β) =∑tj

∑i∈Dj

log r(β,xi(tj))− dj log∑i∈Rj

r(β,xi(tj)

).

This approximation is always too small, and tends to shift the estimates ofβ toward 0. It is widely used as a default in software packages (SAS, not R!)for purely historical reasons.

11.2 Asymptotic properties of partial likelihood

This section closely follows section 4.1.5 of [ABG08], which in turn followsthe treatment of [AG82]. We describe here only the Cox model in the one-dimensional parameter case. Those interested in the more general case cansee the reference.

Suppose we have data sampled for n subjects from the Cox model withtrue parameter β0. Let β be the point at which the partial likelihood ismaximised. We show here that for large n, β is approximately normallydistributed, with mean β0 and variance 1/I(β), where I(β) is the observedFisher information.

The partial log likelihood may be written as

`(β) =n∑i=1

∫ ω

0

[βxi(u)− logS(0)(β, u)

]dNi(u),

Page 118: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 110

where ω is an upper limit for time and we refer back to the definitions (??).The partial score function and the observed information function are

then

U(β) = n−1/2`′(β) = n−1/2n∑i=1

∫ ω

0

[xi(u)− S(1)(β, u)

S(0)(β, u)

]dNi(u),

J(β) = −n−1`′′(β) = n−1

∫ ω

0

[S(2)(β, u)

S(0)(β, u)−

(S(1)(β, u)

S(0)(β, u)

)2]dN(u).

Since β0 is the true parameter, the difference

Mi(t) = n−1/2Ni(t)− n−1/2

∫ t

0Yi(u)α0(u)eβ0xi(u)du

is a martingale. If we evaluate the score at β0 we get

U(β0) =

n∑i=1

∫ ω

0

[xi(u)− S(1)(β0, u)

S(0)(β0, u)

]dMi(u)

+

∫ ω

0

n∑i=1

[eβ0xi(u)xi(u)Yi(u)− S(1)(β0, u)eβ0xi(u)Yi(u)

S(0)(β0, u)

]α0(u)du.

The second term is 0, because summing over i turns the first term in thebrackets into S(1)(β0, u), and the second term into S(1)(β0, u)S(0)(β0, u)/S(0)(β0, u).Thus E[U(β0)] = 0. If we replace ω by a variable t we get a martingale,whose predictable variation is

〈U(β0)〉(t) =

∫ t

0

n∑i=1

[xi(u)− S(1)(β0, u)

S(0)(β0, u)

]2

Yi(u)α0(u)eβ0xi(u)du

=

∫ t

0

[S(1)(β0, u)− S(1)(β0, u)2

S(0)(β0, u)

]α0(u)du.

A similar argument says that the observed information satisfies

J(β0) = n−1

∫ ω

0

[S(2)(β, u)

S(0)(β, u)−

(S(1)(β, u)

S(0)(β, u)

)2]dM(u)

+ n−1

∫ t

0

[S(1)(β0, u)− S(1)(β0, u)2

S(0)(β0, u)

]α0(u)du,

Page 119: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 111

where M =∑Mi is a martingale. Thus the first term on the left is a

martingale, while the second term is 1/n times the predictable variationof the score function. Taking expectations on both sides, we get n timesthe expected information nI(β0) being equal to the expected predictablevariation, which is to say, the variance of the score function.

We now apply the martingale CLT to see that n−1/2U(β0) is approxi-mately normal, with mean 0 and variance I(β0). By definition, the max-imum partial likelihood estimator β is the solution to U(β) = 0. Since−nI(β0) = U ′(β0), we have

0 = U(β) ≈ U(β0)− nI(β0)(β − β0

).

This implies√n(β − β0) ≈ n−1/2U(β0)

I(β0)

is approximately normal for large n, with mean 0 and variance 1/I(β0). Inparticular, this means that β is a consistent estimator, so that J(β) may betaken as an estimator for I(β0).

11.3 The AML example

We continue looking at the leukemia study that we started to considerin section 8.1.4. First, in Figure 11.1 we plot the iterated logarithm ofsurvival against time, to test the proportional hazards assumption. The PHassumption corresponds to the two curves differing by a vertical shift. Theresult makes this assumption at least credible. (This method, and others,for examining the PH assumption are discussed in detail in chapter 12.)

We code the data with covariate x = 0 for the maintained group, and x = 1for the non-maintained group. Thus, the baseline hazard will correspond tothe maintained group, and eβ will be the relative risk of the non-maintained

Page 120: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 112

0 10 20 30 40 50

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

Age

log(-log(Survival))

Figure 11.1: Iterated log plot of survival of two populations in AML study,to test proportional hazards assumption.

Page 121: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 113

group. From Table 8.2 we see that the partial likelihood is given by

LP (β) =

(e2β

(12eβ + 11)(11eβ + 11)

)(e2β

(10eβ + 11)(9eβ + 11)

)×(

1

8eβ + 11

)(eβ

8eβ + 10

)(1

7eβ + 10

)(1

6eβ + 8

)×(

eβ · 1(6eβ + 7)(5.5eβ + 6.5)

)(eβ

5eβ + 6

)(eβ

4eβ + 5

)×(

1

3eβ + 5

)(eβ

3eβ + 4

)(1

2eβ + 4

)(eβ

2eβ + 3

)(eβ

eβ + 3

)(1

2

)(11.1)

A plot of LP (β) is shown in Figure 11.2.

0.6 0.8 1.0 1.2 1.41.6e-18

2.0e-18

2.4e-18

2.8e-18

β

L P

Figure 11.2: A plot of the partial likelihood from (11.1). Dashed line is atβ = 0.9155.

In the one-dimensional setting it is straightforward to estimate β bydirect computation. We see the maximum at β = 0.9155 in the plot of Figure11.2. In more complicated settings, there are good maximisation algorithmsbuilt in to the coxph function in the survival package of R. Applying thisto the current problem, we obtain:

Page 122: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 114

Table 11.1: Output of the coxph function run on the aml data set.

coxph(formula = Surv(time, status) ∼ x, data = aml)coef exp(coef) se(coef) z p

×Nonmaintained 0.916 2.5 0.512 1.79 0.074

Likelihood ratio test=3.38 on 1 df p=0.0658 n= 23

The z is simply the Z-statistic for testing the hypothesis that β = 0, soz = β/SE(β). We see that z = 1.79 corresponds to a p-value of 0.074, so wewould not reject the null hypothesis at level 0.05.

We show the estimated baseline hazard in Figure 11.3; the relevantnumbers are given in Table 11.2. For example, the first hazard, correspondingto t1 = 5, is given by

h0(5) =1

12eβ + 11+

1

11eβ + 11= 0.050,

substituting in β = 0.9155.

Page 123: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 115

Table 11.2: Computations for the baseline hazard LME for the AML data,in the proportional hazards model, with maintained group as baseline, and

relative risk eβ = 2.498.

Maintenance Non-Maintenance Baseline(control)

ti YM (ti) dMi Y N (ti) dNi h0(ti) A0(ti) S0(ti)

5 11 0 12 2 0.050 0.050 0.9518 11 0 10 2 0.058 0.108 0.8989 11 1 8 0 0.032 0.140 0.86912 10 0 8 1 0.033 0.174 0.84113 10 1 7 0 0.036 0.210 0.81118 8 1 6 0 0.043 0.254 0.77623 7 1 6 1 0.095 0.348 0.70627 6 0 5 1 0.054 0.403 0.66930 5 0 4 1 0.067 0.469 0.62531 5 1 3 0 0.080 0.549 0.57733 4 0 3 1 0.087 0.636 0.52934 4 1 2 0 0.111 0.747 0.47443 3 0 2 1 0.125 0.872 0.41845 3 0 1 1 0.182 1.054 0.34848 2 1 0 0 0.500 1.554 0.211

Page 124: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 116

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Figure 11.3: Estimated baseline hazard under the PH assumption. Thepurple circles show the baseline hazard; blue crosses show the baseline hazard

shifted up proportionally by a multiple of eβ = 2.5. The dashed green lineshows the estimated survival rate for the mixed population (mixing the twoestimates by their proportions in the initial population).

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Figure 11.4: Comparing the estimated population survival under the PHassumption (green dashed line) with the estimated survival for the combinedpopulation (blue dashed line), found by applying the Nelson–Aalen estimatorto the population, ignoring the covariate.

Page 125: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 117

11.4 The Cox model in R

The survival package includes a function coxph that computes Cox regres-sion models. To illustrate this, we simulate 100 individuals with hazard ratetexi , where xi are normal with mean 0 and variance 0.25. We also right-censorobservations at constant rate 0.5. The simulations may be carried out withthe following commands:

require(’survival’)

n=100

censrate=0.5

covmean=0

covsd=0.5

beta=1

x=rnorm(n,covmean,covsd)

T=sqrt(rexp(n)*2*exp(-beta*x))

# Censoring times

C=rexp(2*n,censrate)

t=pmin(C,T)

delta=1*(T<C)

Then the cox model may be fit with the command

cfit=coxph(Surv(t,delta)~x)

> summary(cfit)

Call:

coxph(formula = Surv(t, delta) ~ x)

n= 100, number of events= 50

coef exp(coef) se(coef) z Pr(>|z|)

x 0.8998 2.4590 0.3172 2.836 0.00457 **

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

exp(coef) exp(-coef) lower .95 upper .95

x 2.459 0.4067 1.32 4.579

Page 126: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 118

Concordance= 0.601 (se = 0.049 )

Rsquare= 0.079 (max possible= 0.966 )

Likelihood ratio test= 8.19 on 1 df, p=0.004203

Wald test = 8.04 on 1 df, p=0.004565

Score (logrank) test = 8.18 on 1 df, p=0.004225

If we give a coxph object to survfit it will automatically compute thehazard estimate, using the Efron method as default. We need to give it adata frame of “newdata”, with one column for each covariate. It will outputone survival curve for each row of the data frame. In particular, inputtingthe new data x = 0 we get the baseline hazard. If we plot this object itcomes by default with a 95% confidence interval. We show the plot in Figure11.5.

plot(survfit(cfit,data.frame(x=0)),main=’Cox example’)

tt=(0:300)/100

lines(tt,exp(-tt^2/2),col=2)

legend(.1,.2,c(’baseline survival estimate’,’true baseline’),col=1:2,lwd=2)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Cox example

baseline survival estimatetrue baseline

Figure 11.5: Survival estimated from 100 individuals simulated from the Coxproportional hazards model. True baseline survival e−t

2/2 is plotted in red.

Page 127: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 119

Suppose now we have a categorical variable — for example, three differenttreatment groups, labelled 0,1,2 — with relative risk 1,2,3, let us say. If wewere to use the command

cfit=coxph(Surv(t,delta)~x)

we would get the wrong result:

coef exp(coef) se(coef) z p

x 0.335 1.4 0.0865 3.88 0.00011

Likelihood ratio test=15 on 1 df, p=0.000107 n= 300, number of events= 191

The problem is that this assumes the effects are monotonic: Group 1 hasrelative risk eβ and Group 2 has relative risk e2β, which is simply wrong.

The estimates of survival for the three different groups may be estimatedcorrectly by defining new covariates x1=(x==1) and x2=(x==2) — that is,they are indicator variables for x to be 1 or 2 respectively, and we wouldthen use the command

cfit2=coxph(Surv(t,delta)~x1+x2).

Even easier is to use the factor command, which tells Rto treat the vectoras non-numeric. If we give the command

cfit2=coxph(Surv(t,delta)~factor(x))

it produces the output

coef exp(coef) se(coef) z p

factor(x)2 0.801 2.23 0.187 4.29 1.8e-05

factor(x)3 0.707 2.03 0.185 3.82 1.3e-04

Likelihood ratio test=23 on 2 df, p=1.01e-05 n= 300, number of events= 191

This comes close to estimating correctly the relative risks 2 and 2.5, whichin the first version were estimated as 1.4 and 1.42 = 1.96.

If you want to include time-varying covariates, this may be done crudelyby having multiple time intervals for each individual, with all having astart and a stop time, and all but (perhaps) the last being right-censored.(Of course, if individuals have multiple events, then there may be multipleintervals that end with an event.) This allows the covariate to be changedstepwise.

Page 128: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 120

11.5 Graphical tests of the proportional hazardsassumption

We describe here several plots that can be made from survival data todetermine whether the proportional hazards assumption is reasonable.

11.5.1 Log cumulative hazard plot

The simplest graphical test require that the covariate take on a few discretevalues, with a substantial number of subjects observed in each category. Ifthe covariate is continuous we stratify it, defining a new categorical covariateby the original covariate being in some fixed region.

The first approach is to consider, for categories 1, . . . ,m, the Nelson–Aalen estimators Ai(t) of the cumulative hazard for individuals in categoryi. If any relative-risk model holds then Ai(t) = r(β, i)A0(t), so that

log Ai(t)− log Aj(t) ≈ log r(β, i)− log r(β, j)

should be approximately constant.

11.5.2 Andersen plot

In the Andersen plot we plot all the pairs (Ai(t), Aj(t)). If the proportionalhazards assumption holds then each pair (i, j) should produce (approxi-mately) a straight line through the origin. It is known (cf. section 11.4 of[KM03]) that when the ratio of hazard rates αi(t)/αj(t) is increasing, thecorresponding Andersen plot is a convex function; decreasing ratios produceconcave Andersen plots.

11.5.3 Arjas plot

The Arjas plot is a more sophisticated graphical test, that is capable oftesting the proportional hazards assumption for a single categorical covariate,within a model that includes other covariates (that might follow a differentsort of model).

Suppose we have fit the model αi(t) = α(t|xi) for the hazard rate of

individual i — for example, it might be αi(t) = eβTxiα0(t), but it might

be something else — and we are interested to decide whether an additional(categorical) covariate zi (taking on values 0 and 1) ought to be included aswell. For each individual we have an estimated cumulative hazard A(t|xi).

Page 129: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 121

Define the weighted time on test for individual i at event time tj as A(tj ∧Ti|xi), and the total time on test for level g (of the covariate z) as

TOTg(tj) =∑i:zi=g

A(tj ∧ Ti|xi);

the number of events at level g will be

Ng(tj) =∑i:zi=g

δi1Ti≤tj.

The idea is that if the covariate z has no effect, the difference Ng(tj) −TOTg(tj) is a martingale, so a plot of Ng against TOT would lie close to astraight line with 45 slope. If levels of z have proportional hazards effects,we expect to see lines of different slopes. If the effects are not proportional,we expect to see curves that are not lines.

We give an example in Figure 11.6. We have simulated a population of100 males and 100 females, whose hazard rates are αi(t) = exi+βmaleImalet,where xi ∼ N(0, 0.25). In Figure 11.6(a) we show the Arjas plot in the caseβmale = 0; in Figure 11.6(b) we show the plot for the case βmale = 1.

require(’survival’)

n=100

#n=1000

censrate=0.5

covmean=0

covsd=0.5

maleeff=1

beta=1

n=100

xmale=rnorm(n,covmean,covsd)

xfem=rnorm(n,covmean,covsd)

x=c(xmale,xfem)

Tmale=sqrt(rexp(n)*2*exp(-beta*xmale-maleeff))

Tfem=sqrt(rexp(n)*2*exp(-beta*xfem))

Page 130: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 122

0 10 20 30 40 50 60

010

2030

4050

Male effect= 0

Number of events

Cum

ulat

ive

haza

rd malefemale

(a) βmale = 0

0 10 20 30 40 50 60

010

2030

40

Male effect= 1

Number of events

Cum

ulat

ive

haza

rd

malefemale

(b) βmale = 1

Figure 11.6: Arjas plots for simulated data.

T=c(Tmale,Tfem)

# Censoring times

C=rexp(2*n,censrate)

t=pmin(C,T)

delta=1*(T<C)

sex=factor(c(rep(’M’,100),rep(’F’,100)))

cfit=coxph(Surv(t,delta)∼x)

#Make a plot to see how close the estimated baseline survival is to the correct one

plot(survfit(cfit,data.frame(x=0)))

tt=(0:300)/100

lines(tt,exp(-tt^2/2),col=2)

Page 131: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 123

beta=as.numeric(cfit$coef)

relrisk=exp(beta*x)*exp(maleeff*c(rep(1,n),rep(0,n)))

eventord=order(t)

et=t[eventord]

es=sex[eventord]

ec=x[eventord]

cumhaz=-log(survfit(cfit,data.frame(x=ec))$surv)

haztrunc=sapply(1:(2*n),function(i) pmin(cumhaz[i,i],cumhaz[,i]))

haztruncmale=haztrunc[,es==’M’]

haztruncfem=haztrunc[,es==’F’]

# Maximum cumulative hazard comes for individual i when we’ve gotten

# to the row corresponding to eventord[i]

TOTmale=apply(haztruncmale,1,sum)

TOTfem=apply(haztruncfem,1,sum)

Nmale=cumsum(cumhazmale$n.event*(es==’M’))

Nfem=cumsum(cumhazfem$n.event*(es==’F’))

plot(Nmale,TOTmale,xlab=’Number of events’,ylab=’Cumulative hazard’,type=’l’,col=2,lwd=2,main=paste(’Male effect=’,maleeff))

abline(0,1)

lines(Nfem,TOTfem,col=3,lwd=2)

legend(.1*n,.4*n,c(’male’,’female’),col=2:3,lwd=2)

11.5.4 Leukaemia example

Section 1.9 of [KM03] describes data on 101 leukaemia patients, comparingthe disease-free survival time between 50 who received allogenic bone marrowtransplants and 51 who received autogenic transplants. We follow section 11.4of that book in testing the proportional hazards model with two graphicaltests. The data are available in the object alloauto in KMsurv.

Both plots show that the data are badly suited to a proportional hazardsmodel. The Andersen plot looks clearly convex, suggesting that the hazardratio of autogenic to allogenic is increasing. This is also what one would

Page 132: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Relative risk regression II 124

infer from the crossing of the two log cumulative hazard curves.

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

log cumulative hazard plot for alloauto data

time (weeks)

log

cum

ulat

ive

haza

rd

(a) Cumulative log hazard plot

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.0

0.2

0.4

0.6

0.8

1.0

Andersen plot for alloauto data

log cumulative allogenic hazard

log

cum

ulat

ive

auto

logo

us h

azar

d

(b) Andersen plot

Figure 11.7: Graphical tests for proportional hazards assumption in alloauto data.

Page 133: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 12

Model diagnostics

At the end of the previous lecture we looked at graphical plots that can beused to test the appropriateness of the proportional hazards model. We nowlook in greater depth at issues of model diagnostics.

12.1 General principles of model selection

12.1.1 The idea of model diagnostics

A statistical model is a family of probability distributions, whose realisationsare data of the sort that we are trying to analyse. Fitting a model meansthat we find the member of the family that is closest to the data in termsof some criterion. If the model is good, then this best-fit model will be areasonable proxy — for all sorts of inference purposes — for the processthat originally generated the data. That means that other properties of thedata that were not used in choosing the representative member of the familyshould also be close to the corresponding properties of the model fit.

(An alternative approach, that we won’t discuss here, is model averaging,where we accept up front that no model is really correct, and so give up thesearch for the “one best”. Instead, we draw our statistical inferences fromall the models in the family simultaneously, appropriately weighted for howwell they suit the data.)

The idea, then, is to look at some deviations of the data from the best-fitmodel — the residuals — which may be represented in terms of test statisticsor graphical plots whose properties are known under the null hypothesisthat the data came from the family of distributions in question, and thenevaluate the performance. Often, this is done not in the sense of formalhypothesis testing — after all, we don’t expect the data to have come exactly

125

Page 134: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 126

from the model, so the difference between rejecting and not rejecting thenull hypothesis is really just a matter of sample size — but of evaluatingwhether the deviations from the model seem sufficiently crass to invalidatethe analysis. In addition, residuals may show not merely that the model doesnot fit the data adequately, but also show what the systematic differenceis, pointing the way to an improved model. This is the main application ofmartingale residuals, which we will discuss in section 13.1. Alternatively, itmay show that the failure is confined to a few individuals. Together withother information, this may lead us to analyse these outliers as a separategroup, or to discover inconsistencies in the data collection that would makeit appropriate to analyse the remaining data without these few outliers. Themain tool for detecting outliers are deviance residuals, which we discuss insection 13.2.1.

12.1.2 A simulated example

Suppose we have data simulated from a very simple survival model, whereindividual i has constant hazard 1 + xi, where xi is an observed positivecovariate, with independent right censoring at constant rate 0.2. Now supposewe choose to fit the data to a proportional hazards model. What wouldgo wrong? Not surprisingly, for such a simple model, the main conclusion

— that the covariate has a positive effect on the hazard — would still bequalitatively accurate. But what about the estimate of baseline hazard?

We simulated this process with 1000 individuals, where the covariateswere the absolute values of independent normal random variables. We mustfirst recognise that it is not entirely clear what it even means to evaluatethe accuracy of fit of such a misspecified model. If we plug the simulateddata into the Cox model, we necessarily get an exponential parameter out,

telling us that the hazard rate corresponding to covariate x is eβx. Since thehazard is actually 1 + βx, it is not clear what it would mean to say that theparameter was well estimated. Certainly, positive β should be translatedinto positive β. Similarly, the baseline hazard of the Cox model agrees withthe baseline hazard of the additive-hazards model, in that both are supposedto be the hazard rates for an individual with covariates 0, but their roles inthe two models are sufficiently different that any comparison is on uncertainground.

Still, the fitted Cox model makes a prediction about the hazard rate ofan individual whose covariates are all 0, and that prediction is wrong. Whenwe fit the data from 1000 individuals to the Cox model, we get this output:

coef exp(coef) se(coef) z p

Page 135: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 127

x 0.581 1.79 0.0559 10.4 0

Likelihood ratio test=98.6 on 1 df, p=0 n= 1000, number of events= 882

In Figure 12.1(a) we see the baseline survival curve estimated from thesedata by the Cox model. The confidence region is quite narrow, but we seethat the true survival curve — the red curve — is nowhere near it. In Figure12.1(b) we have a smoothed version of the hazard estimate, and we seethat forcing the data into a misspecified Cox model has turned the constantbaseline hazard into an increasing hazard.

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Cox baseline survival estimate

(a) Cumulative hazard

0 1 2 3 4

01

23

45

67

Cox smoothed hazard estimate

Time

Haz

ard

(b) Smoothed hazard

Figure 12.1: Baseline survival and hazard estimated from the Cox propor-tional hazards model for data simulated from the additive hazards model.Red is the true baseline hazard.

12.2 Cox–Snell residuals

Given a linear regression model Yi = βXi + εi, we stress-test the model bylooking at the residuals Yi− βXi. If the model is reasonably appropriate, theresiduals should look something like a sample from the distribution positedfor the errors εi. There are many ways the residuals can fail to have the“right” distribution — wrong tails, dependence between different residuals,

Page 136: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 128

changing distributions over time or depending on Xi — and consequentlymany ways to test them, ranging from formal test statistics to graphicalplots that are evaluated by eye.

In evaluating regression models for survival we are doing somethingsimilar, except that the connection between the individual observations andthe parameters being estimated is much more indirect, thus demanding moreingenuity to even define the residuals, and then to evaluate them. There is alarge family of different residuals that have been defined for survival models,each of which is useful for different parts of the task of model diagnostics,including:

• Generally evaluating the appropriateness of a regression model (suchas Cox proportional hazards or additive hazards);

• Specifically evaluating assumptions of the regression model (such asthe proportional hazards assumption, or the log-linear action of thecovariates;

• Finding specific outlier individuals in an otherwise reasonably wellspecified model.

The most basic version is called the Cox-Snell residual. It is based on theobservation that if T is a sample from a distribution with cumulative hazardfunction H, then H(T ) has an exponential distribution with parameter 1.

Given a parametric model H(T, β) we would then generate and evaluateCox-Snell residuals as follows:

(i). We use the samples (Ti, δi) to estimate a best fit β;

(ii). Compute the residuals ri := H(Ti, β);

(iii). If the model is well specified — a good fit to the data — then (ri, δi)should be like a right-censored sample from a distribution with constanthazard 1.

(iv). A standard way of evaluating the residuals is to compute and plot aNelson–Aalen estimator for the cumulative hazard rate of the residuals.The null hypothesis — that the data came from the parametric modelunder consideration — would predict that this plot should lie close tothe line y = x.

Of course, things aren’t quite so straightforward when evaluating a semi-parametric model (such as the Cox model) or nonparametric model (such as

Page 137: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 129

the Aalen additive hazards model). We describe here the standard procedurefor computing Cox-Snell residuals for the Cox proportional hazards model(the original application):

(i). We use the samples (Ti,xi, δi) to estimate a best fit β;

(ii). We compute the Breslow estimator A0(t) for the baseline hazard;

(iii). Compute the residuals ri := eβTxiA0(Ti).

After this, we proceed as above. Of course, there is nothing special aboutthe Cox log-linear form of the relative risk. Given any relative risk functionr(β,x), we may define residuals ri := r(β,xi)A0(Ti).

12.3 Bone marrow transplantation example

(This is based on Example 11.1 of [KM03].)The data here are in the object bmt of the KMsurv package. They describe

the disease-free survival time of leukaemia patients following a bone marrowtransplant. We consider the model with proportional effects for the followingvariables:

Z1 = patient age at transplant (centred at 28 years);

Z2 = donor age at transplant (centred at 28 years);

Z1 × Z2;

Z3 = patient sex (1=male, 0=female);

Z7 = waiting time to transplant.

We fit the model in R as follows:

Page 138: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 130

> bmcox=coxph(Surv(t2,d3)∼z1+z2+z1*z2+z3+z7,data=bmt)> summary(bmcox)

Call:

coxph(formula = Surv(t2, d3) ∼ z1 + z2 + z1 * z2 + z3 + z7, data = bmt)

n= 137, number of events= 83

coef exp(coef) se(coef) z Pr(>|z|)

z1 -0.1142944 0.8919953 0.0356292 -3.208 0.00134 **

z2 -0.0859570 0.9176336 0.0302892 -2.838 0.00454 **

z3 -0.3062033 0.7362370 0.2298515 -1.332 0.18280

z7 0.0001135 1.0001135 0.0003274 0.347 0.72875

z1:z2 0.0036177 1.0036243 0.0009203 3.931 8.46e-05 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

exp(coef) exp(-coef) lower .95 upper .95

z1 0.8920 1.1211 0.8318 0.9565

z2 0.9176 1.0898 0.8647 0.9738

z3 0.7362 1.3583 0.4692 1.1552

z7 1.0001 0.9999 0.9995 1.0008

z1:z2 1.0036 0.9964 1.0018 1.0054

Concordance= 0.605 (se = 0.033 )

Rsquare= 0.104 (max possible= 0.996 )

Likelihood ratio test= 15.06 on 5 df, p=0.01012

Wald test = 18.02 on 5 df, p=0.00292

Score (logrank) test = 18.9 on 5 df, p=0.002009.

So we see that patient age and donor age both seem to have strong effectson disease-free survival time (with increasing age acting negatively — thatis, increasing the length of disease-free survival — as we might expect ifwe consider that many forms of cancer progress more rapidly in youngerpatients). Somewhat surprisingly, the effect of a year of donor age is almostas strong as the effect of a year of patient age. There is also a strong positiveinteraction term, suggesting that the prognosis for an old patient receiving atransplant from an old donor is not as favourable as we would expect from

Page 139: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics 131

simply adding their effects. Thus, for example, the oldest patient was 80years old, while the oldest donor was 84. The youngest patient was 35, theyoungest donor just 30. The model suggests that the 80-year-old patientshould have a hazard rate for relapse that is a factor of e−0.1143×45 = 0.006that of the youngest. Indeed, the youngest patient relapsed after just 43 days,while the oldest died after 363 days without recurrence of the disease. Thepatient with the oldest donor would be predicted to have his or her hazardrate of recurrence reduced by a factor of e−0.0860×54 = 0.0096. Indeed, theyoungest patient did have the youngest donor, and the oldest patient nearlyhad the oldest. Had that been the case, the ratio of hazard rate for the oldestto that of the youngest would have been 0.0096× 0.006 = 0.00006. Takingaccount of the interaction term we see that the actual log proportionalityterm predicted by the model is

exp−0.1143× 45− 0.0860× 54 + 0.036177(45× 54)

= 0.45.

Page 140: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 13

Model diagnostics, continued

13.1 Martingale residuals

13.1.1 Definition of martingale residuals

A close approximation to the residuals that we use in the linear-regressionsetting are the martingale residuals. (A good source for the practicalities ofmartingale residuals is chapter 11 of [KM03]. The mathematical details arebest presented in section 4.5 of [FH91].)

We present the martingale residuals for the Cox model, and leave theadditive-hazards variant for the exercises. We start with the “individualmartingale”

Mi(t) = Ni(t)−∫ t

0Yi(s)e

βTxi(s)dA0(s).

This represents the difference between the number of events observed forindividual i up to time t and the expected number. We turn this into aresidual by replacing the true distribution by the estimator:

Mi(t) := Ni(t)−∫ t

0Yi(s)e

βTxi(s)dA0(s). (13.1)

Usually we refer to Mi := Mi(∞) as the martingale residual. When thecovariates are constant in time,

Mi = δi − eβTxiA0(Ti),

so it differs from the Cox-Snell residual only by a constant..An individual martingale residual is a very crude measure of deviation,

since for any given t the only observation here that is relevant to the survivalmodel is Ni(t), which is a binary observation.

132

Page 141: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 133

If there are no ties, or if we use the Breslow method for resolving ties,sum of all the martingale residuals is 0. Assuming no ties, we have

n∑i=1

Mi(t) =n∑i=1

Ni(t)−

∫ t

0Yi(s)e

βTxi(s)dA0(s)

=

n∑i=1

Ni(t)−

∫ t

0Yi(s)e

βTxi(s)

[ ∑nj=1 dNj(s)∑n

j=1 Yk(s)eβTxk(s)

]= 0.

13.1.2 Application of martingale residuals for estimating co-variate transforms

Martingale residuals are not very useful in the way that linear-regressionresiduals are, because there is no natural distribution to compare themto. The main application is to estimate appropriate modifications to theproportional hazards model by way of covariate transformation: Instead of arelative risk of eβx, it might be ef(x), where f(x) could be 1x<x0 or

√x, or

something else.We assume that in the population xi and zi are independent. This won’t

be exactly true in reality, but obviously a strong correlation between twovariables complicates efforts to disentangle their effects through a regressionmodel. We derive the formula under the assumption of independence, un-derstanding that the results will be less reliable the more intertwined thevariable zi is with the others.

Suppose the data (Ti,xi(·), zi, δi) are sampled from a relative-risk modelwith two covariates: a vector xi and an additional one-dimensional covariatezi, with

log r(β,x, z) = βTx + f(z);

that is, the Cox regression model holds except with regard to the lastcovariate, which acts as h(z) := ef(z). Let β be the p-dimensional vector

corresponding to the Cox model fit without the covariate z, and let Mi bethe corresponding martingale residuals.

Another complication is that we use β instead β (which we don’t know).We will derive the relationship under the assumption that they are equal;again, errors in estimating β will make the conclusions less correct. For largen we may assume that β and β are close.

Page 142: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 134

Let

h(s,x) :=E[Y (s)ef(z) |x]

E[Y (s) |x]= E

[h(z)

∣∣Y (s) = 1; x];

h(s) := E[h(s,x)

].

We assume that h(s,x) is approximately the constant h(s).

Fact 13.1.

E[M∣∣ z] ≈ ∑ δi

n

(f(z)− log h(∞)

). (13.2)

Thus, we may estimate f(z0) by estimating the local average of M ,averaged over z close to z0. For instance, we can compute a LOESS smoothcurve fit to the scatterplot of points (zi, Mi).

The basic idea is, the martingale residuals measure the excess events,the difference between the expected and observed number of events. If wecompute the expectation without taking account of z, then individuals whosez value has f(z) large positive will seem to have a large number of excessevents; and those whose f(z) is large negative will seem to have fewer eventsthan expected.

A reasonably formal proof may be found in [TGF90], and also reproducedin chapter 4 of [FH91].

13.2 Outliers and leverage

We don’t expect a model to be exactly right, but what does it mean for it tobe “close enough”? An important way a model can go badly wrong is if themodel fit is dominated by a very small number of the observations. This canhappen if either there are extreme values of the covariates, or if an outcome(the time in a survival model) is exceptionally far off of the predicted value.The former are called high-leverage observations, the latter are outliers. Weneed tools for identifying these observations, and determining how muchinfluence they exert over the model fit.

13.2.1 Deviance residuals

The martingale residual for an individual is

observed # events− expected # events

for individual i. In principle, large values indicate outliers — results thatindividually are unexpected if the model is true. The problem is, these are

Page 143: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 135

highly-skewed variables — they take values between 1 and −∞ and it is hardto determine what their distribution should look like.

We define the deviance residual for individual i as

di := sgn(Mi)−2[Mi + δi log(δi − Mi)

]1/2. (13.3)

This choice of scaling is inspired by the intuition that each di should representthe contribution of that individual to the total model deviance. Whereas themartingale residuals are between −∞ and 1, the deviance residuals shouldhave a similar range to a standard normal random variable. Thus, we treatvalues outside a range of about −2.5 to +2.5 as outliers, potentially

Recall that deviance is defined as

D = 2[log likelihood(saturated)− log likelihood(β)

].

Applied to the Cox model, the saturated model is the one where eachindividual has an individual parameter β∗i . Then

D = 2 supβ∗

n∑i=1

∫ ∞0

(log eβ

∗i xi − log eβxi

)dNi(s)

−∫ ∞

0Yi(s)

(eβ∗i xi − eβixi

)dA0(s)

.

Since the terms separate, this is maximised when

δi = Ni(∞) =

∫ ∞0

Yi(s)eβ∗i xidA0(s).

If A0 is replaced by the Breslow estimator A0, we have

β∗i xi − βxi = log

∫∞0 Yi(s)e

β∗i xidA0(s)∫∞0 Yi(s)eβxidA0(s)

= logNi(∞)− Mi

Ni(∞),

so

D = 2

n∑i=1

− log eβxi/eβ

∗i xi

∫ ∞0

dNi(s)− Mi

= 2

n∑i=1

(log

Ni(∞)− Mi

Ni(∞)

)Ni(∞)− Mi

=n∑i=1

d2i .

Page 144: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 136

13.2.2 Schoenfeld residuals

At least as important as checking the log-linearity of a covariate’s influencefor the Cox proportional hazards model is checking that the influence isconstant in time. For this purpose the standard tool is the Schoenfeldresiduals. The formal derivation is left as an exercise, but the definition ofthe j-th Schoenfeld residual for parameter βk is

Skj(tj) :=∑

Xijk − Xk(tj),

where as usual ij is the individual with event at time tj , and

Xk(t) =

∑ni=1 Yi(t)Xik(t)e

βTXi(t)∑ni=1 Yi(t)e

βTXi(t)

is the weighted mean of covariate Xk at time t. Thus the Schoenfeld residualmeasures the difference between the covariate at time t and the averagecovariate at time t. If βk is constant we expect this to be 0. If the effect ofXk is increasing we expect the estimated parameter βk to be an overestimateearly on — so the individuals with events then have lower Xk than we wouldhave expected, producing negative Schoenfeld residuals; at later times theresiduals would tend to be positive. Thus, increasing effect is associatedwith increasing Schoenfeld residuals. Likewise decreasing effect is associatedwith decreasing Schoenfeld residuals. As with the martingale residuals, wetypically make a smoothed plot of the Schoenfeld residuals, to get a generalpicture of the time trend.

We can also make a formal test of the hypothesis βk is constant by fittinga linear regression line to the Schoenfeld residuals as a function of time, andtesting the null hypothesis of zero slope, against the alternative of nonzeroslope. Of course, such a test will have little or no power to detect nonlineardeviations from the hypothesis of constant effect — for instance, thresholdeffects, or changing direction.

13.2.3 Delta–beta residuals

A variant of the Schoenfeld residual is the score residual, discussed in prob-lem 1 of sheet 6. We assume here that we have a process satisfying themultiplicative intensity model, with fitted regression model for the intensityA(s|x). The score residuals for covariate xk are

rSc,ki =

∫ ∞0

Yi(s)[xik(s)− xk(s)

](dNi(s)− dA(s

∣∣xi)). (13.4)

Page 145: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 137

The sum of these (over all individuals i) is the score function — the derivativeof the log likelihood — so the sum will be 0 if A is an MLE. This can bethought of as a martingale residual weighted by the leverage (given by theSchoenfeld residual) of that observation.

The main use typically made of the score residuals is to approximate theso-called “delta–beta” residuals for measuring the influence of an individualobservation. The delta–beta residual for parameter βk and subject i is definedas

∆βki := βk − βk(i),

where βk(i) is the estimate of βk with individual i removed. This is expensiveto compute, since it requires that we recalculate the model n times for nsubjects. We can approximate it by

∆βki ≈p∑`=1

(J−1)k`rSc,`i.

Individuals with high values of ∆β should be looked at more closely.They may reveal evidence of data-entry errors, interactions between differentparameters, or just the influence of extreme values of the covariate. Youshould particularly be worried if there are high-influence individuals pushingthe parameters of interest in one direction.

13.3 Residuals in R

As an object-oriented language, R has functions in place for any well-writtenpackage to produce standard kinds of outputs in an appropriate way. Inparticular, if fit is the output of a model-fitting procedure, resid(fit)should produce sensible residuals. If fit is an object of type coxph, the resid-uals can be ”martingale”, ”deviance”, ”score”, ”schoenfeld”, ”dfbeta”,”dfbetas”, and ”scaledsch” (scaled Schoenfeld).

13.3.1 Dutch Cancer Institute (NKI) breast cancer data

We apply these methods to the data on survival of Dutch breast cancerpatients collected in the nki data set, and discussed in [vHS14]. The dataare available in the dynpred package; this package has been removed fromCRAN, but may still be found in the archive at http://cran.r-project.

org/src/contrib/Archive/dynpred/. The most interesting thing aboutthe study that these data come from is that it was one of the first to relate

Page 146: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 138

survival of breast cancer patients to gene expression data. The data andsome results are described in [VDVHvV+02].

The data frame includes the following covariates:

patnr Patient identification number

d Survival status; 1 = death; 0 = censored

tyears Time in years until death or last follow-up

diameter Diameter of the primary tumor

posnod Number of positive lymph nodes

age Age of the patient

mlratio oestrogen expression level

chemotherapy Chemotherapy used (yes/no)

hormonaltherapy Hormonal therapy used (yes/no)

typesurgery Type of surgery (excision or mastectomy)

histolgrade Histological grade (Intermediate, poorly, or well differentiated)

vasc.invasion Vascular invasion (-, +, or +/-)

We begin by fitting the Cox model, including all the potentially significantcovariates:

nki.surv=with(nki,Surv(tyears,d))

nki.cox=with(nki,coxph(nki.surv∼posnodes+chemotherapy+hormonaltherapy+histolgrade+age+mlratio+diameter+posnodes+vasc.invasion+typesurgery))

> summary(nki.cox)

Call:

coxph(formula = nki.surv ∼ posnodes + chemotherapy + hormonaltherapy

+ histolgrade + age + mlratio + diameter + posnodes + vasc.invasion + typesurgery)

n= 295, number of events= 79

coef exp(coef) se(coef) z Pr(>|z|)

posnodes 0.07443 1.07727 0.05284 1.409 0.158971

chemotherapyYes -0.42295 0.65511 0.29769 -1.421 0.155381

hormonaltherapyYes -0.17160 0.84232 0.44233 -0.388 0.698062

Page 147: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 139

histolgradePoorly diff 0.26550 1.30409 0.28059 0.946 0.344030

histolgradeWell diff -1.30782 0.27041 0.54782 -2.387 0.016972 *

age -0.03937 0.96139 0.01953 -2.016 0.043816 *

mlratio -0.75031 0.47222 0.21138 -3.550 0.000386 ***

diameter 0.01976 1.01996 0.01323 1.493 0.135334

vasc.invasion+ 0.60286 1.82733 0.25315 2.381 0.017245 *

vasc.invasion+/- -0.14580 0.86433 0.49104 -0.297 0.766530

typesurgerymastectomy 0.15043 1.16233 0.24864 0.605 0.545183

We could remove the insignificant covariates stepwise, or use AIC, orsome other model-selection method, but let us suppose we have reduced itto the model including just histological grade, vascular invasion, age, andmlratio (the crucial measure of oestrogen-receptor gene expression. we thenget

summary(nki.cox)

Call:

coxph(formula = nki.surv ∼ histolgrade + age + mlratio + vasc.invasion)

n= 295, number of events= 79

coef exp(coef) se(coef) z Pr(>|z|)

histolgradePoorly diff 0.42559 1.53050 0.26914 1.581 0.113810

histolgradeWell diff -1.31213 0.26925 0.54782 -2.395 0.016611 *

age -0.04302 0.95790 0.01961 -2.194 0.028271 *

mlratio -0.72960 0.48210 0.20754 -3.515 0.000439 ***

vasc.invasion+ 0.64816 1.91203 0.24205 2.678 0.007410 **

vasc.invasion+/- -0.04951 0.95169 0.48593 -0.102 0.918840

13.3.2 Complementary log-log plot

The first model diagnostic is to plot the cumulative hazard on a log scalefor different values of the covariate. We show this in Figure 13.1. Figure13.1(a) shows the cumulative hazard calculated from the Cox model fit forthe 10th, 50th, and 90th percentiles of mlratio. Since this is calculated fromthe model, the curves are exact vertical shifts of each other; this is what theplot would look like in the ideal case. Figure 13.1(b) shows the Nelson–Aalenestimator for the three sub-populations formed from the upper, middle, andlower tertiles of mlratio. It does not fit perfectly, but it is not obviouslywrong to see them as vertical shifts of one another. Note that the second

Page 148: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 140

and third tertiles are very similar; the big difference is between the lowertertile and the rest.

1 2 5 10 20

−6

−5

−4

−3

−2

−1

NKI survival (Cox model fit)

Time (Years)

log(

cum

haz

ard)

mlratio−1.2−0.10.35

(a) Model fit for different mlratio levels

5 10 15−

4−

3−

2−

1

NKI survival (Tertiles)

Time (Years)

log(

cum

haz

ard)

mlratio tertiles123

(b) Nelson–Aalen estimator for tertiles ofmlratio

Figure 13.1: Log cumulative hazards of NKI data.

Figure 13.2 shows a similar plot where the population has been stratifiedby the vascular invasion status. We see here that the effect of the vascularinvasion variable — reflected in the gap between the two log cumulativehazard curves seems to increase over time. The code to generate this plot is

nki.km3=survfit(nki.surv∼factor(nki$vasc.invasion))plot(nki.km3,mark.time=FALSE,xlab=’Time (Years)’,ylab=’log(cum hazard)’,

main=’Vascular invasion’,conf.int=FALSE,col=c(2,1,3),fun=myfun,firstx=1)

legend(10,-3,c(’-’,’+’,’+/-’),col=c(2,1,3),title=’Vascular invasion’,lwd=2)

13.3.3 Andersen plot

We show an Andersen plot for the vascular invasion covariate in Figure 13.3.We have already observed that the effect seems to increase with time, andthis is reflected in the convex shape of the curve.

Page 149: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 141

5 10 15

−5

−4

−3

−2

−1

Vascular invasion

Time (Years)

log(

cum

haz

ard)

Vascular invasion

−++/−

Figure 13.2: Log cumulative hazards of NKI data for subpopulations stratifiedby vascular invasion status.

13.3.4 Cox–Snell residuals

We compute the Cox–Snell residuals by simply extracting the martingaleresiduals automatically computed by the resid function, and subtractingthem from the censoring indicator:

nki.mart=residuals(nki.cox)

nki.CS=nki$d-nki.mart

We then test for the goodness of fit by computing a Nelson–Aalen estimatorfor the residuals, and plotting the line y = x for comparison.

nki.CStest=survfit(Surv(nki.CS,nki$d)∼1)plot(nki.CStest,fun=’cumhaz’,mark.time=FALSE,xmax=.8,

main=’Cox-Snell residual plot for NKI data’)

abline(0,1,col=2)

13.3.5 Martingale residuals

We know that the effect of the gene activity covariate mlratio is highlysignificant. But does it fit the Cox linear model assumption? We test

Page 150: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 142

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.1

0.2

0.3

0.4

0.5

0.6

cumulative hazard -

cum

ulat

ive

haza

rd +

Figure 13.3: Andersen plot of NKI data for subpopulations stratified byvascular invasion status.

this by fitting the model without mlratio, and then plot the martingaleresiduals against mlratio, using a local smoother to get a picture of howthe martingale residuals change, on average, with values of this covariate.The code is:

nki.nogen=with(nki,coxph(nki.surv∼histolgrade+age+vasc.invasion))nki.NGmart=resid(nki.nogen,type=’martingale’)

ord=order(nki$mlrat)

m1=nki.NGmart[ord]

m2=nki$mlrat[ord]

plot(nki$mlratio,nki.NGmart,xlab=’mlratio’,ylab=’Martingale residual’)

lines(lowess(m1∼m2),lwd=2,col=2)

This produces the result in figure 13.5We see that the effect declines up to a level of about −0.5, and is 0 af-

ter that. The original paper [VDVHvV+02] did not introduce mlratio

as a log-linear effect, but created a binary variable, distinguishing be-tween oestrogen-receptor-negative tumours (mlratio< −0.65) and oestrogen-receptor-positive tumours. The receptor-negative subjects are about 23% ofthe total. Our analysis suggests that this is a reasonable approach. Thisfits very well with our observation in section 13.3.2 that there was a clear

Page 151: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 143

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

Cox-Snell residual plot for NKI data

Figure 13.4: Nelson–Aalen estimator for the Cox–Snell residuals for the NKIdata, together with the line y = x.

difference between the subjects with the lowest tertile of mlratio and allthe others, but hardly any difference between the upper two tertiles.

Page 152: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 144

−1.5 −1.0 −0.5 0.0 0.5

−1.

5−

0.5

0.0

0.5

1.0

mlratio

Mar

tinga

le r

esid

ual

Figure 13.5: Martingale residuals for the model without mlratio, plottedagainst mlratio, with a LOWESS smoother in red.

13.4 Schoenfeld residuals

Schoenfeld residuals are produced in the survival package by the cox.zph

command applied to the output of coxph. The command z=cox.zph(nki.cox)

produces the output

rho chisq p

histolgradePoorly diff -0.0661 0.3629 0.547

histolgradeWell diff 0.1686 2.3944 0.122

age 0.0313 0.0919 0.762

mlratio 0.1357 1.5480 0.213

vasc.invasion+ 0.0459 0.1664 0.683

vasc.invasion+/- 0.1312 1.3028 0.254

GLOBAL NA 9.7256 0.137

Page 153: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Model diagnostics II 145

The column rho gives the correlation of the scaled Schoenfeld residual withtime. The chisq and p-value are for a test of the null hypothesis that thecorrelation is zero, meaning that the proportional hazards condition holds(with constant β). GLOBAL gives the result of a chi-squared test for thehypothesis that all of the coefficients are constant. Thus, we may accept thehypothesis that all of the proportionality parameters are constant in time.

Plotting the output of cox.zph gives a smoothed picture of the scaledSchoenfeld residuals as a function of time. For example, plot(z[4]) givesthe output in Figure 13.6, showing an estimate of the parameter for mlratioas a function of time. We see that the plot is not perfectly constant, buttaking into account the uncertainty indicated by the confidence intervals, itis plausible that the parameter is constant.

Time

Bet

a(t)

for m

lratio

1.6 2.6 3.3 4.8 5.8 8.1 11 13

-4-2

02

Figure 13.6: Scaled Schoenfeld residuals for the mlratio parameter as afunction of time.

Page 154: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 14

Censoring and truncationrevisited

This Lecture follows closely the presentation in Chapter 5 of [KM03]. Untilnow, we have only considered right censoring and left truncation. Othercircumstances are common, but are not so natural to address in the frameworkof the multiplicative intensity model. The MIM assumes

(i). that individuals are either under observation at their event time —in which case their event time is observed exactly — or not underobservation, in which case there is either no information about theevent time (in the case of censoring) or the subject is excluded fromconsideration altogether (in the case of truncation); and

(ii). the set of individuals under observation at time t is in Ft−.

The first assumption is still satisfied in the special case of pure leftcensoring, and this allows us to apply essentially the same techniques, by atime-reversal trick that makes the second assumption true. We describe thisin section 14.1. On the other hand, in the case of combined left- and right-censoring, when an event time is unobserved we still get the information ofwhether it occurred before the left-censoring time or after the right-censoringtime. Similarly with interval censoring, where the event time is not observedexactly, but is only known to lie within a given interval.

14.1 Left censoring

In pure left censoring, each individual has a censoring time Ci and an eventtime Ti. We get to observe Ui := Ti∨Ci (that is, the maximum of the times),

146

Page 155: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation I 147

and the censoring indicator δi := 1Ti≥Ci. We assume that all individualshave the same hazard rate α(t) at time t. In this setting it is more naturalto think of estimating the survival function S(t) := e−A(t), which is assumed(as usual) to be continuous.

If we fix τ a time that is greater than any time Ui, then U∗i := τ − Uiis equal to (τ − Ti) ∧ (τ − Ci), and δi = 1τ−Ti≤τ−Ci. Thus (U∗i , δi) is acollection of right-censored observations of a random variable whose survivalfunction is F (t) = 1− S(t).

The procedure is then straightforward:

(i). Choose τ > maxUi, and let U∗i := τ − Ui.

(ii). Let F (t) be the Kaplan–Meier estimatorfor the right-censored observa-tions (U∗i , δi).

(iii). The estimator for the survival function is then S(t) = 1− F (τ − t).

By the same reasoning, we may estimate the variance of S(t) by∑ti≥t

(#j : Uj ≤ ti

)−2.

14.2 Right truncation

Survival times are right truncated when individuals are excluded from thestudy if their time exceeds a certain threshold. This depends on a particularstudy design. We will consider here an example, following Chapter 5.3 of[KM03] (and available in the dataset aids), of AIDS induction time — timeto disease onset — among 258 adults and 37 children who received infectedblood transfusions during an eight-year period from 1979 to 1987. Theexact time of infection is known, because it was the time of transfusion,and the time of disease onset is known, but only for those who developedthe disease during the study period. The small number of patients whodeveloped the disease later are indistinguishable from the very large numberwho never developed the disease. In principle we are looking to analysePTi > t

∣∣Ti < ∞, but what we get to estimate is PTi > t∣∣Ti < Ri,

where Ri is the truncation time.If there is a fixed truncation time R, this requires simply that we rein-

terpret our results: Where our estimate S(t) was formerly an estimate forS(t) = PTi > t, it is now an estimate for PTi > t

∣∣Ti ≤ R. When Ri isdifferent for each individual

Page 156: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation I 148

Not surprisingly, given that censoring and truncation are mirror imagesof one another, right truncation can also be dealt with by time reversal. Thesituation is that there are event times Ti and truncation times Ri (assumedindependent), but we only observe the subset of (Ti, Ri) such that Ri > Ti.(That is, we observe both times or neither. This is in contrast to censoring,where we always observe exactly one of the times.)

Let τ = maxRi. Let T ∗i = τ − Ti and R∗i = τ − Ri. Then (T ∗i , R∗i )

are left-truncated observations of a random variable with survival functionS∗(t) = 1− S(τ − t).

The procedure is then straightforward:

(i). Choose τ = maxRi, and let T ∗i := τ − Ti, R∗i = τ −Ri.

(ii). Let F (t) be the Kaplan–Meier estimatorfor the left-truncated observa-tions (T ∗i , R

∗i ).

(iii). The estimator for the survival function is then S(t) = 1− F (τ − t).

14.3 Doubly-censored data: Turnbull’s algorithm

Pure left censoring is rare. When some event times are left-censored andothers are right-censored, obviously time-reversal is no help. We have theproblem that times that are left-censored affect the interpretation of anyevent times that precede the left-censoring times. Since we don’t knowexactly when the event occurred, we don’t know exactly when the individualstopped being at risk.

On the other hand, if we knew the survival distribution, we would beable to make a good inference about when the left-censored event “really”occurred. This suggests an iterative algorithm, first proposed by [Tur74]:We use our survival estimate to allocate the left-censored observations, andthen use our new beliefs about the left-censored observations to improve oursurvival estimates.

More precisely, the procedure is as follows:

(i). Start with a grid of times 0 = τ0 < τ1 < τ2 < · · · < τm, includingall the event times and censoring times. Let dj be the number ofdeaths, rj the number of right-censoring observations, cj the numberof left-censoring observations at time τj .

(ii). Produce an initial estimate S0(τj) for the survival function at timeτj . This could be the Kaplan–Meier estimator in the absence of theleft-censored observations.

Page 157: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation I 149

(iii). Suppose now we have the estimator Sk. We estimate the probabilitythat a left-censored observation at time τj actually occurred in theinterval (τ`−1, τ`], where ` ≤ j. We compute the estimator

p(k)j` :=

Sk(τ`−1)− Sk(τ`)1− Sk(τj)

≈ Pτ`−1 < T ≤ τ`

∣∣T ≤ τj.Of course, this approximation would be an equality if Sk were the truesurvival function. Note that for each j this is a probability distributionover 0 ≤ ` ≤ j.

(iv). We distribute the left-censored observations at time τj over all the

earlier times τ` in proportion to p(k)j` . That is, a death that is imputed to

(τ`−1, τ`] is taken to have occurred at time τ`. We produce an effectivenumber of deaths

d` := d` +

m∑j=`

cjp(k)j` .

(v). The estimated number of individuals at risk at time τ` is

n` =

m∑j=`

(dj + rj).

(vi). We then have the Kaplan–Meier-like estimator

Sk+1(t) =∏τj≤t

(1− dj

nj

).

Page 158: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 15

Censoring and truncation,continued

15.1 Interval-censored data

In interval censoring there are no exact observations. For each individual weobserve a time (Li, Ri] that is known to contain the true event time Ti. Ifthese intervals are non-overlapping then this is not really different from exactobservation, from the point of view of non-parametric estimation. (Consider,for example, event times that are simply rounded off to the nearest timeunit — as observations inevitably are — reported as intervals (T − 1

2 , T + 12 ].

Right or left censoring are simply interval censoring on intervals that areunbounded on one side.)

Turnbull adapted his iterative algorithm for interval censoring as follows:

(i). Start with a grid of times 0 = τ0 < τ1 < τ2 < · · · < τm, including allthe endpoints of censoring intervals Li and Ri.

(ii). Produce an initial estimate S0(τj) for the survival function at time τj .This could be, for example, S0(τj) = (1 − j

m) dn , where d is the totalnumber of events observed and n the number of individuals.

(iii). Suppose now we have the estimator Sk. We reallocate an event knownto have occurred on the interval [Li, Ri) to the subintervals [τ`−1, τ`)in proportion to the conditional probability of it having occurred inthat subinterval. We define

p(k)i` :=

Sk(τ`−1)− Sk(τ`)Sk(Li)− Sk(Ri)

≈ Pτ`−1 < T ≤ t`

∣∣Li ≤ T < Ri

150

Page 159: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 151

for all ` such that Li ≤ τ`−1 < τ` ≤ Ri. (We take it to be 0 otherwise.)For each i this is a probability distribution over 1 ≤ ` ≤ m.

(iv). We distribute the left-censored observations at time tj over all the

earlier times t` in proportion to p(k)j` . That is, a death that is imputed to

(t`−1, t`] is taken to have occurred at time t`. We produce an effectivenumber of deaths

d(k)` :=

n∑i=1

p(k)i` .

(v). The effective number of individuals at risk at time t` is

n` =

m∑j=`

d(k)j .

(vi). We then have the Kaplan–Meier-like estimator

Sk+1(t) =1

n

m∑j=`+1

d(k)j for t ∈ [τ`, τ`+1).

That is, we estimate the survival past time τ` by the fraction of deathsthat are estimated to have occurred in intervals after time τ`.

Note that we have no evidence for when the events occurred within theinterval. We may choose instead the continuous estimator

Sk+1(t) =1

n

m∑j=`+1

d(k)j − t− τ`

τ`+1 − τ`· d(k)

`+1

for t ∈ [τ`, τ`+1).

This is what is done by the icfit function in the interval package in R.

15.2 Current status data

In [HFB+11] the authors analyse data on survival of avalanche victims. Foreach incident, in addition to several covariates — location (Switzerland orCanada), type of snow, type of activity — there is a single time observedfor each individual, and whether the individual died (and, if so, the cause ofdeath).

Assume that each individual has an event time Ti and a screening time Ci,with event times and screening times independent (the independent screening

Page 160: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 152

assumption). We observe for each individual only Ci and δi := 1Ti≤Ci.Observations for different individuals are assumed independent.

We want to estimate the survival function S(t) for the event times. Theindependent screening assumption means that for estimating S we only needto consider the conditional likelihood

CL =

n∏i=1

S(Ci)1−δi

(1− S(Ci)

)1−δi . (15.1)

15.2.1 Parametric approaches

Suppose now the survival times are exponentially distributed. We then havethe log likelihood and its derivatives being

`(λ) =n∑i=1

δi log(

1− e−λCi)− λ

n∑i=1

(1− δi)Ci,

`′(λ) =n∑i=1

δiCi

eλCi − 1−

n∑i=1

(1− δi)Ci,

`′′(λ) = −n∑i=1

δiC2i e−λCi

(1− e−λCi)2.

We can solve this numerically, though there is no closed-form analytic solution.If the screening times are all the same, Ci = c, then we can solve `′(λ) = 0to obtain the MLE

λ = c−1 log

(n

n− k

).

where k =∑δi is the number of individuals for whom events are observed.

The observed Fisher information will then be

I(λ) =kc2e−λc

(1− e−λc)2.

If these had been ordinary right-censored observations, then the observedFisher information would have been k/λ2. Thus, the relative efficiency ofthe current status data is

λ2c2e−λc

(1− e−λc)2=

(1− p) log2(1− p)p2

where p = 1 − e−λc is the probability of the event occurring before thescreening time. This is a declining function of p, unsurprisingly. What might

Page 161: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 153

seem more surprising is that even when p = 0.5, the relative efficiency is stillover 96%; even when p = .9 — so that 90% of the survival times are beingreplaced by an indicator Ti ≤ c — the relative efficiency is still above65%. This is a consequence of the very strong assumption of the exponentialdistribution.

Of course, the general conditional likelihood (15.1) may be applied tomore complicated distributions. For example, if we wish to fit a Weibulldistribution, so the cumulative hazard rate at time t is Λ(t) = (λt)r, and thehazard rate at time t is rλrtr−1, we have log likelihood

`(λ, r) =

n∑i=1

δi log(

1− e−λrCri)− λr

n∑i=1

(1− δi)Cri .

This may be maximised numerically. We give an example in section 15.2.3.

15.2.2 Nonparametric approaches

In principle, current status data is an extreme form of interval censoring,and so we could compute a nonparametric survival estimator with Turnbull’salgorithm. The convergence is very slow, though. Fortunately, there isanother approach that is much faster.

Let c1 < c2 < · · · < ck be the screening times for the n individuals. Letnj be the number of individuals observed at time cj , and rj the number ofthose found not to have failed. We may rewrite the conditional likelihood as

L(S) =

k∏j=1

S(cj)rj[1− S(cj)

]nj−rj .As pointed out by [Sun06, section 3.2], we are seeking to maximise thisfunction by choosing any possible values of S(cj) ∈ [0, 1] under the constraintthat S(c1) ≥ S(c2) ≥ · · · ≥ S(ck). The solution to this problem is wellknown, called the pool adjacent violators algorithm. The idea is, we firstassign S(cj) the value that will maximise L, so S(0)(cj) = rj/nj . Of course,this will probably lead to some of the estimates going in the wrong order.The next step is then to look for clusters of time points that go in thewrong order, and pool their observations. Suppose we have some j such thatS(0)(cj) < S(0)(cj+1) (but S(0)(cj−1) ≥ S(0)(cj) and S(0)(cj+1) ≥ S(0)(cj+2)).Then at the next stage we would set

S(1)(cj) = S(1)(cj+1) =rj + rj+1

nj + nj+1,

Page 162: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 154

while the other values are unchanged. The estimator is then

S(cj) = minu≤j

maxv≥j

∑vl=u rl∑vl=u nl

. (15.2)

For example, suppose we have n = 20 individuals, and screening times1, 2, 4, 5. 5 individuals are screened at each of the times, with rj = 5, 2, 3, 1still alive at those times. We show in Figure 15.1 the estimate rj/nj , followedby the correction to a decreasing function.

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

t

S

(a) First step

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

t

S

(b) Corrected

Figure 15.1: Illustration of survival curve estimation from current statusdata rj = 5, 2, 3, 1, nj = 5, 5, 5, 5, tj = 1, 2, 4, 5.

15.2.3 Example of current status data

We simulated n = 100 or n = 1000 samples with hazard rate λ(t) = t, withstatus observed at a screening time exponentially distributed with parameter1. The results are shown in Figure 15.2. We used the function gpava fromthe package isotone. The code is given here.

1 n=1002 t=s q r t (2 ∗ rexp (n) )3 C=rexp (n)4 d e l t a =(t>C)

Page 163: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 155

5 ## gpava computes i n c r e a s i n g s o l u t i o n s ; we need to r e v e r s e i t ,then r e v e r s e back

6 gp=gpava (C,− d e l t a )7 gp$x=−gp$x8 gp$y=−gp$y9 p lo t ( gp , main=’ Current s t a t u s s u r v i v a l e s t imator ’ , y lab=’S ’ , x lab=’

time ’ , c o l =1)10 t t =(0:300) /10011 l i n e s ( tt , exp(− t t ˆ2/ 2) , c o l =2)12 l egend ( 2 . 5 , . 8 , c ( ’ e s t imator ’ , ’ t rue s u r v i v a l ’ , ’ observed ’ ) , lwd=c

(1 ,1 ,NA) , pch=c (NA,NA, 1 ) , c o l=c (1 , 2 , 1 ) )

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Current status survival estimator

time

S

estimatortrue survivalobserved

estimatortrue survivalobservedWeibull

(a) n=100

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Current status survival estimator

time

S

estimatortrue survivalobserved

estimatortrue survivalobservedWeibull

(b) n=1000

Figure 15.2: Illustration of survival curve estimation from current statusdata. The data were simulated from a hazard rate λ(t) = t, with screeningtimes independent with constant hazard rate 1. The green dashed lines showparametric estimates from the Weibull distribution.

We see that the nonparametric procedure is working — clearly it isestimating the correct survival function. At the same time, the convergence isextremely slow, with the n = 1000 estimate still very coarse. The parametricestimate from the Weibull family is vastly superior. One may wonder,then, why bother with the nonparametric approach. The answer is thatthe parametric fit will not be consistent if the data were not drawn from

Page 164: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 156

a distribution in the appropriate family. In some cases the distortions canbe quite substantial. As an example, we show in Figure 15.3 the results ofa simulation from a nonmonotone hazard rate. We have taken α(t) = 0.5except for t ∈ [1, 1.5], where α(t) = 2.

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Current status survival estimator

time

S

estimatortrue survivalobservedWeibull

Figure 15.3: Illustration of survival curve estimation from current statusdata. The data were simulated from a nonmonotone hazard rate. Notethat the parametric estimates from the Weibull distribution deviate verysubstantially from the true survival curve, shown in red.

Page 165: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 157

15.3 Dependent censoring

Back in section 1.4.2 we discussed our assumption that censoring and trunca-tion produced data missing at random. The assumption is that a censoringevent at time C is independent of the future event time, given the informationavailable at time C; that is, conditioned on FC . A version of this assumptionwas embedded in our multiplicative-intensity model, and in its generalisationto the relative-risk model. It is also satisfied by the additive hazards model.

What does this mean, in practice, given that the “future event time” maynot even exist. (For example, if we are measuring the time until side-effectsfrom a medication are reported, then once the subject has dropped out orthe study has ended, the medication is discontinued and there is no furtherpossibility of an event time.) Individual event intensities (and, in particular,individual at-risk indicators) are predictable processes. Censoring events arenot affected by an unobserved covariate that also influences event times. Ifthere are time-dependent covariates that influence survival then this alsoexcludes a situation where censoring is bound up with future trajectories ofthe covariate.

15.3.1 Censoring plot

Dependent censoring by a categorical covariate may be detected with acensoring plot — essentially, a Kaplan–Meier curve with the roles of eventsand censoring switched. In Figure 15.4(a) we show the censoring “survivalcurves” for the three different levels of the histological grade variable of theNKI data set. They are very similar, suggesting that censoring may betreated as independent of this covariate. Figure ?? shows (following section4.2.7 of [ABG08]). In Figure 15.4(b) we show the censoring curves for thetwo different groups in the kidney-dialysis study described in section 8.1.5.

Page 166: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 158

0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

Noncensoring for NKI data (histolgrade variable)

IntermediatePoorly diffWell diff

(a) NKI data

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Kidney data censoring plot

surgicalpercutaneous

(b) kidney data set

Figure 15.4: Censoring survival functions for NKI and kidney data sets withrespect to a single categorical covariate.

We can also consider whether censoring depends on a quantitative covari-ate, by fitting a regression model to the censoring times. The additive hazardsregression model is convenient because it is flexible in the way it representseffects changing over time. For example, we fit the NKI censoring times toan additive hazards model that includes age, mlratio, and histological grade:

1 nki . aacens=with ( nki , aareg ( nki . survcens ˜ h i s t o l g r a d e+mlra t i o+age ) )2 summary( nki . aacens )3 $ t ab l e4 s l ope c o e f se ( c o e f ) z p5 I n t e r c e p t 0 .01017 0.00114 0.00467 0.24308 0.807946 h i s t o l g r a d e P o o r l y d i f f 0 .00258 0.00011 0.00126 0.08421 0.932897 h i s t o l g radeWe l l d i f f 0 .04709 0.00261 0.00141 1.84735 0.064708 mlrat i o 0 .00241 −0.00080 0.00107 −0.74792 0.454519 age 0 .00291 0.00013 0.00010 1.24793 0.21206

Figure 15.5 shows the estimated coefficients of all the covariates overtime. (In principle, the plot for histolgradeWell diff is showing the sameinformation as the difference between the green and red curves in Figure15.4(a), though with some adjustment due to correcting for other covariates.)

Page 167: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 159

Intercept histolgradePoorly diff histolgradeWell diff

mlratio age

−2.5

0.0

2.5

5.0

−1.0

−0.5

0.0

0.5

1.0

0

1

2

3

−2.0

−1.5

−1.0

−0.5

0.0

0.5

−0.05

0.00

0.05

0.10

0 5 10 15 0 5 10 15

variable

Intercept

histolgradePoorly diff

histolgradeWell diff

mlratio

age

Figure 15.5: Parameters for covariates in censoring times for NKI data fittedto additive hazards model. Plotted with autoplot command in ggfortify

package.

For another example, we consider the pbc data set, the Mayo ClinicPrimary Biliary Cirrhosis Data, discussed in Appendix D of [FH91]. Thisincludes 10 years of follow-up on 418 PBC patients, of whom 312 were enrolledin a clinical trial, where they randomly received either the drug D-penicillmainor a placebo; the other 106 were followed up with no particular treatment.Some died, and some were censored by the ending of the trial, which of courseis non-informative. But some were censored by having received a transplant,which could be informative. We fit the model described in section 4.2.7 of[ABG08]:

1 Cal l :2 aareg ( formula = pbcsurv ˜ albumin + age + f a c t o r ( edema ) + b i l i )3

4 n= 4185 25 out o f 25 unique event t imes used6

Page 168: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 160

7 s l ope c o e f se ( c o e f ) z p8 I n t e r c e p t 5 .31 e−04 0.032500 0.011500 2 .83 4 .70 e−039 albumin −7.63e−05 −0.003860 0.002650 −1.46 1 .45 e−01

10 age −4.76e−06 −0.000344 0.000088 −3.91 9 .39 e−0511 f a c t o r ( edema ) 0 .5 6 .80 e−05 0.003910 0.003830 1 .02 3 .08 e−0112 f a c t o r ( edema ) 1 −1.47e−04 −0.008240 0.002270 −3.64 2 .76 e−0413 b i l i 9 .16 e−06 0.000568 0.000358 1 .59 1 .13 e−0114

15 Chisq =18.66 on 5 df , p=0.0022; t e s t weights=aalen

The parameter estimates are plotted in Figure 15.6.

Intercept albumin age

factor(edema)0.5 factor(edema)1 bili

0.0

0.5

1.0

−0.2

−0.1

0.0

0.1

0.2

−0.020

−0.015

−0.010

−0.005

0.000

−0.1

0.0

0.1

0.2

−0.4

−0.3

−0.2

−0.1

0.0

−0.025

0.000

0.025

0.050

0.075

0.100

0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000

variable

Intercept

albumin

age

factor(edema)0.5

factor(edema)1

bili

Figure 15.6: Parameters for covariates in censoring times for PBC data fittedto additive hazards model. Plotted with autoplot command in ggfortify

package.

Page 169: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 161

15.3.2 Corrected survival estimators

Consider the pbc data set discussed in the previous section. We look just atthe 106 individuals who were not included in the study. If we fit an additivehazards model to the survival times we get

1 Cal l :2 aareg ( formula = pbcsurv2 ˜ albumin + age + b i l i + protime )3

4 n=104 (2 obs e rva t i on s de l e t ed due to mi s s ingne s s )5 34 out o f 35 unique event t imes used6

7 s l ope c o e f se ( c o e f ) z p8 I n t e r c e p t −1.76e−03 −0.064700 0.039200 −1.650 0.098409 albumin −2.20e−04 −0.004060 0.005540 −0.733 0.46300

10 age 1 .77 e−05 0.000631 0.000219 2 .880 0 .0039611 b i l i 9 .89 e−05 0.002600 0.000925 2 .810 0 .0049712 protime 1 .18 e−04 0.004430 0.002210 2 .000 0 .0452013

14 Chisq =14.54 on 4 df , p=0.0057; t e s t weights=aalen

(We removed the edema variable, since the category edema==1 does not arisein this group, and added the variable protime.) The covariate curves aregiven in Figure 15.7.

Page 170: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 162

Intercept albumin age

bili protime

−5.0

−2.5

0.0

−0.50

−0.25

0.00

0.25

0.00

0.02

0.04

0.00

0.05

0.10

0.15

0.0

0.1

0.2

0.3

0 1000 2000 0 1000 2000

variable

Intercept

albumin

age

bili

protime

Figure 15.7: Parameters for covariates in survival times for PBC data fittedto additive hazards model.

What would it mean to compute a survival curve for this population?Simply computing the Kaplan–Meier estimator would have no sensible inter-pretation: It would combine survival probabilities for different proportions ofthe two groups at different times. For the kidney data set we could computeseparate survival curves for the two different treatment groups, but thatwouldn’t work for censoring influenced by a quantitative covariate.

A reasonable ambition would be to estimate the survival curve that wewould have observed for this population in the absence of censoring. Fora discrete covariate this would mean computing separate survival curvesfor each category, and combining those curves in the same proportions asthe initial population. For continuous covariates — or multiple covariates —fitted by some sort of regression model there are two standard approaches.We follow section 4.7 of [ABG08] in our treatment in the following sections.

Page 171: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 163

Cumulative individual survival

Suppose we have n individuals, with time-constant covariates xi. We fita model that yields an individual survival estimator S(t |x). In the caseof the additive hazards model we have our cumulative hazard estimatorA(t |x) = B0(t) +

∑pk=0 xkBk(t), and we can use either S(t |x) = e−A(t |x)

or the Kaplan–Meier -type estimator

S(t |x) =∏tj≤t

(1−∆A(t |x)

).

For the Cox model we can use S(t |x) = S0(t) exp(−β · x).Then we have the adjusted survival estimate

Sadj(t) := n−1n∑i=1

S(t |xi).

If we are working with the additive-hazards model the individual survivalestimators, hence Sadj , might not be monotone. We can make it monotoneby taking the estimator

Smon(t) := sups≤t

Sadj(s).

15.3.3 Inverse probability of censoring weighting

A proposal by Robins and Finkelstein [RF00], called inverse probability-of-censoring weighting (IPCW) has become the standard approach to correctingfor dependent censoring. The idea is to treat informative censoring as a sortof sampling bias, which may be corrected by reweighting the individuals bythe inverse probability of their having been included in the sample. Thusindividuals whose characteristics make them unlikely to have been caughtin our sample — not censored, in this case — are counted more heavily inour estimates. (In sampling theory this is known as the Horvitz–Thompsonestimator.)

We start by estimating the survival process of the censoring times foreach individual

Ki(t) = estimated PCi > t

.

Then we define adjusted aggregated counting process N∗ and adjustednumber-at-risk process Y ∗ by

N∗(t) =n∑i=1

∫ t

0

dNi(u)

Ki(u−)and Y ∗(t) =

n∑i=1

Yi(t)

Ki(t−).

Page 172: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Censoring and truncation II 164

We then define adjusted Nelson–Aalen estimator and adjusted Kaplan–Meierestimator by using the adjusted counts and number-at-risk.

Page 173: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Lecture 16

Frailty and recurrent events

16.1 Proportional frailty model

This section and the following three are based largely on Chapter 6 of [ABG08].Individual i is supposed to have an unobservable random variable Zi, theindividual frailty, yielding a hazard rate

αi(t) = Zi · α(t).

We assume the Zi are i.i.d. We write A(t) :=∫ t

0 α(s)ds, and

S(t|Z) = e−zA(t)

is the survival function of an individual with frailty z.This conditional survival function is unobservable. The population sur-

vival function is

S(t) = E[S(t|Z)

]= E

[e−ZA(t)

]= L

(A(t)

), (16.1)

where L(c) is the Laplace transform of Zi. The population hazard rate is

µ(t) = α(t) · −L′(A(t))

L(A(t)). (16.2)

16.2 Examples of frailty distributions

16.2.1 Gamma frailty

A popular class of frailty distributions used for modelling is the gammafrailty, with density

fr,λ(z) =λr

Γ(r)zr−1e−λz, (16.3)

165

Page 174: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 166

and Laplace transformL(c) = (1 + c/λ)−r. (16.4)

We commonly wish to scale Z to have mean 1 (since a scaling factormay be absorbed into the baseline hazard α). We then have one freeparameter for the gamma distribution, which we may identify with theprecision θ = λ = r = 1/ variance. The Laplace transform is then

L(c) = (1 + c/θ)−θ. (16.5)

We then have population hazard from (16.2)

µ(t) =α(t)

1 +A(t)/θ. (16.6)

16.2.2 PVF family

A convenient generalisation of the gamma frailty is the power variancefunction distribution. This is a kind of compound Poisson distribution,defined as the sum of a Poisson random number of independent gamma-distributed random variables. The parameters are (ρ, λ, r), where ρ is thePoisson parameter. The Laplace transform is easy to work with, being just

Lρ,λ,r(c) = e−ρ(1−(1+c/λ)−r). (16.7)

One useful feature of the PVF family is that it includes an atom at 0 (of sizee−ρ). This makes it natural for application to populations that are thoughtto include individuals whose hazard is simply 0.

Note that this distribution is neither continuous nor discrete. It doesnot exist in the world of Part A probability, though it is perfectly natural.We can analyse it indirectly with elementary probability methods, though itonly fits comfortably in the context of measure-theoretic probability.

The hazard rate is

µ(t) =ρr

λ

(1 +A(t)/λ

)−r−1α(t). (16.8)

16.3 Effects on the hazard ratio

One of the key goals of survival analysis is the comparison of survivalrates between groups. It is important to recognise that in the presence ofunobserved heterogeneity, the comparison on a group level looks different fromthe comparison on the individual level (which is inherently unobservable).

Page 175: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 167

Or, conversely, if we think there is a simple relationship (such as proportionalhazards) between the hazard rates in two groups, the relationship will behidden by the selection bias induced by frailty.

16.3.1 Changing relative risk

The example we consider here is the effect of heterogeneity on relative risk.Suppose the population is composed of two groups: Individual i has hazardZi. The hazard for individual i is Ziα(t) at time t if in Group 1; and kZiα(t)if in Group 2. So k is the “real” relative risk, which we will assume is > 1.

Suppose Z ∼ Gamma(θ, θ). The hazard rate at time t for groups 1 and2 respectively will be

µ1(t) =α(t)

1 + δA(t), µ2(t) =

kα(t)

1 + δkA(t).

The hazard ratio will be

µ2(t)

µ1(t)=k(1 +A(t)/θ)

1 + δkA(t)/θ,

which converges to 1 as long as A(t)→∞. Thus, the effect of group identityon hazard seems to dissipate, as group 2 becomes more concentrated withlow-frailty individuals.

If Z has compound Poisson distribution with parameters (ρ, ν, r) theeffect is even more extreme. Now there is a zero-frailty class, that becomeever more concentrated in group 2, because of the higher selection pressure.Thus, we get a hazard crossover:

µ2(t)

µ1(t)= k

(ν +A(t)

ν + kA(t)

)r+1 A(t)→∞−−−−−→ k−r < 1.

16.3.2 Hazard and frailty of survivors

The population will be selectively filtered, as time passes, leading to aconcentration of individuals of lower frailty. If the frailty at time 0 hasdensity f(x), at time t the frailty of the survivors will have density

f(x)e−xA(t)∫∞0 f(y)e−yA(t)dy

. (16.9)

Page 176: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 168

16.4 Repeated events

If we have no information about an individual’s frailty other than the observedsurvival, then the frailty model is just a convenient way of motivatingpotentially interesting survival distributions. (Of course, we may have somesubstantive scientific reasons for generating the survival distribution in thisway.) On the other hand, if the individual frailty is a function of knowncovariates, perhaps with unknown parameters, then multiplicative frailty justbecomes a proportional hazards regression model. (Similarly, we could havean additive frailty model. Unobserved additive frailty is essentially what isbeing described in question (3) of problem sheet B.4.)

As we have remarked before, there is nothing in our past assumptionsthat requires the at-risk indicator to take any particular form. In partic-ular, if the at-risk indicator stays at 1 after the individual has an event,this will correspond to repeated events. But the basic assumptions of themultiplicative-intensity model are violated by hidden frailty, so we need toconsider how to adapt our procedures when we wish to allow for individualfrailty.

We assume that for each individual we observe the individual’s countingprocess Ni(t) (which may be thought of as a possibly empty set of eventtimes), the at-risk process Yi(t) (assumed predictable), and the covariateprocess xi(t) (also predictable, possibly constant, possibly vector-valued).

16.4.1 The Poisson model

The simplest model for repeated events is the Poisson model, taking theintensity for each individual to be a fixed constant α when at risk. Assumingchanges in the at-risk process is independent of the counting process, thenumber of events observed for each individual, and in total, will be Poissondistributed, conditioned on the total time at risk. If we observe individual ifor a total time Ti =

∫∞0 Yi(t)dt, and observe N = N(∞) events, then the

log likelihood is

`(α) = N logα− α∑

Ti,

with the MLE

α =N∑Ti.

While the independence assumption is necessary to derive a Poissondistribution, our basic non-informative assumption tells us that α

∫ t0 Yi(s)ds

Page 177: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 169

is the compensator for Ni, so that∫ t

0dN(s)− αY (s)ds

is a martingale. Hence N(∞)− α∑Ti has expectation 0. This shows that

α is asymptotically unbiased.

16.4.2 The Poisson regression model

A simple generalisation would be to say that each individual has Poissonnumber of events, with the intensity being constant during the time whenthat individual is at risk, and a function — to be determined — of somemeasured covariates. If individual i is at risk for total time Ti, the numberof events Ni is Poisson distributed with parameter Tiα(xi). Conditioned onthe time at risk the log likelihood is then

`(α) =n∑i=1

Ni logα(xi)− Tiα(xi).

The most common parametric form is α = expβ · x

, where β =

(β0, . . . , βp), and we take xi0 ≡ 1. The log likelihood then becomes

`(β) =

p∑k=0

βk

n∑i=1

Nixik −n∑i=1

Tieβ·xi . (16.10)

The MLE then satisfies the equations

n∑i=1

Nixik =n∑i=1

xikTieβ·xi .

This fits into the framework of GLM (generalised linear model), andmay be fit in R using any of the standard GLM functions. Note that we aremodelling Ni ∼ Po(µi), where

logµi = log Ti + β · xi. (16.11)

We call log Ti an offset in the model.As an example, we consider data from a trial to determine the effect of

nutritional supplements on prisoners’ rate of disciplinary offenses. Therewere 771 prisoners, observed for anywhere from two weeks to half a year fora baseline period, after which half were randomly given the supplements, the

Page 178: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 170

others were given placebos, and they were observed (and offences recorded)for a variable period, mostly at least 10 weeks. We consider here only thetreatment (second) period, and try to model the effect of the treatment, andof the difference in rates between prisons (which we call here A, B, and C).Thus our model is

logα = β0 + β11treatment+ β21prison B+ β31prison C.

1 #sdb0=s t a r t date ba s e l i n e , sdt0= s t a r t date treatment , edt0=enddate

2 #base . count=# events ba s e l i n e , t r e a t . count=#events treatment3 # pgroup= i n d i c a t o r o f a c t i v e treatment4 r i s k t i m e=edt0−sdt05 p o i s r e g=glm ( t r e a t . count ˜ p r i s+pgroup , fami ly=poisson , o f f s e t=log (

r i s k t i m e ) )6 summary( p o i s r e g )7 Cal l :8 glm ( formula = t r e a t . count ˜ p r i s + pgroup , fami ly = poisson ,9 o f f s e t = log ( r i s k t i m e ) )

10

11 Deviance Res idua l s :12 Min 1Q Median 3Q Max13 −4.5092 −1.5806 −0.7360 0 .4907 8 .362914

15 C o e f f i c i e n t s :16 Estimate Std . Error z va lue Pr(>| z | )17 ( I n t e r c e p t ) −2.39045 0.05971 −40.031 <2e−16 ∗∗∗18 prisB −0.99509 0.06611 −15.051 <2e−16 ∗∗∗19 prisC −1.99417 0.07069 −28.212 <2e−16 ∗∗∗20 pgroup −0.10239 0.05051 −2.027 0 .0426 ∗21 −−−22 S i g n i f . codes : 0 ‘ ∗∗∗ ‘ 0 .001 ‘ ∗∗ ‘ 0 .01 ‘ ∗ ‘ 0 .05 ‘ . ‘ 0 . 1 ‘ ‘ 123

24 ( D i spe r s i on parameter f o r po i s son fami ly taken to be 1)25

26 Null dev iance : 2887 .0 on 770 degree s o f freedom27 Res idua l dev iance : 2133 .2 on 767 degree s o f freedom28 AIC : 3349 .529

30 Number o f F i sher Scor ing i t e r a t i o n s : 6

This fitted model gives us a predicted expected number of events for eachindividual. The difference between the observed number of events and theexpected number predicted by the model is the residual. In Figure 16.1 weplot the residuals against the fitted values. (This is the automatic output of

Page 179: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 171

the command plot(poisreg), where poisreg is the output of the glm fitabove.

This would be obvious from a casual examination of the data. The meannumber of events is about 2, but some individuals have as many as 25, whichis not something you would see in a Poisson distribution. These data areover-dispersed, meaning that their variance is higher than it would be for aPoisson distribution of the same mean.

We also note that the deviance residuals (which should be approximatelystandard normal distributed if the model is correct) range from −4.5 to+8.36. The sum of their squares, called the residual deviance, is 2133.2,which is much to large for a chi-squared variable on 767 degrees of freedom.

−3 −2 −1 0 1 2

−5

05

Predicted values

Res

idua

ls

glm(treat.count ~ pris + pgroup)

Residuals vs Fitted

374

650245

Figure 16.1: Residual plot for the prison-data Poisson regression.

16.4.3 Negative-binomial model

Of course, the Poisson model doesn’t really make much sense in the examplediscussed in the previous section. Individuals may be presumed to have

Page 180: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 172

differing predispositions to offend. Thus, it is not surprising that the numberof offences is more spread out than you would expect under the Poissonmodel, which posits that everyone offended at the same rate.

We may generalise the Poisson regression model to better fit overdisperseddata by adding a frailty term. That is, in place of (16.11) we represent theindividual intensity by

logµi = log λi + log Ti + β · xi. (16.12)

The term λi, called a multiplicative frailty, represents the individual relativerate of producing events. The λi are treated as random effects, meaning thatthey are not to be estimated individually — which would not make sense

— but rather, they are taken to be i.i.d. samples from a simple parametricdistribution. When the frailty λ has a gamma distribution (with parameters(θ, θ), because we conventionally take the frailty distribution to have mean1), and N is a Poisson count conditioned on λ with mean λα, then N hasprobability mass function

PN = n

=

∫ ∞0

(λα)n

n!e−λα · θ

θ−1λθ

Γ(θ)e−θλdλ

=θθ−1αn

n!Γ(θ)

∫ ∞0

λn+θe−λ(θ+α) · dλ

=Γ(n+ θ)

n!Γ(θ)

θ + α

)θ−1( α

θ + α

)n,

which is the negative binomial distribution with parameters θ and α/(θ+α).Therefore this is called the negative binomial regression model. We can fit itwith the glm.nb command in R. If we apply it to the same data as before,we get the following output:

1 > summary( po i s r eg2 )2

3 Cal l :4 glm . nb( formula = t r e a t . count ˜ p r i s + pgroup + o f f s e t ( l og (

r i s k t i m e ) ) ,5 i n i t . theta = 0.8418678047 , l i n k = log )6

7 Deviance Res idua l s :8 Min 1Q Median 3Q Max9 −2.1179 −1.2228 −0.4695 0 .2785 3 .6767

10

11 C o e f f i c i e n t s :12 Estimate Std . Error z va lue Pr(>| z | )

Page 181: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Frailty and recurrent events 173

13 ( I n t e r c e p t ) −2.28974 0.17977 −12.737 < 2e−16 ∗∗∗14 prisB −1.08624 0.19047 −5.703 1 .18 e−08 ∗∗∗15 prisC −2.08056 0.18716 −11.117 < 2e−16 ∗∗∗16 pgroup −0.15331 0.09984 −1.536 0 .12517 −−−18 S i g n i f . codes : 0 ’ ∗∗∗ ’ 0 .001 ’ ∗∗ ’ 0 .01 ’ ∗ ’ 0 .05 ’ . ’ 0 . 1 ’ ’ 119

20 ( D i spe r s i on parameter f o r Negative Binomial ( 0 . 8419 ) fami ly takento be 1)

21

22 Null dev iance : 956 .27 on 770 degree s o f freedom23 Res idua l dev iance : 755 .61 on 767 degree s o f freedom24 AIC : 2645 .425

26 Number o f F i sher Scor ing i t e r a t i o n s : 127

28

29 Theta : 0 .841930 Std . Err . : 0 .078131

32 2 x log−l i k e l i h o o d : −2635.3550

We note that, while the largest deviance residual of 3.68 suggests a possibleoutlier, the total residual deviance is now quite plausible.

16.4.4 The Andersen-Gill model

Another popular generalisation of the Poisson model, introduced in 1982by Andersen and Gill [AG82], is a semi-parametric relative-risk regressionmodel, essentially equivalent to the Cox proportional hazards regressionmodel. The only change is that the at-risk indicator for an individual willnot, in general, become 0 after an event. Partial likelihood is defined exactlyas in (10.6), and Breslow’s formula still defines an estimate of cumulativeintensity (rather than cumulative hazard).

We can fit the model in R by using the coxph command. All we need todo is to represent the data appropriately in a Surv object. To do this, therecord for an individual gets duplicated, with one row for each event time orcensoring time. An event time will be the “stop” time in one row, and willthen be the “start” time in the next row. The covariates will repeat fromrow to row for the same individual.

Page 182: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Bibliography

[Aal93] Odd O. Aalen. Further results on the non-parametric linearregression model in survival analysis. Statistics in Medicine,12:1569–88, 1993.

[ABG08] Odd O. Aalen, Ørnulf Borgan, and Hakon K. Gjessing. Sur-vival and Event History Analysis: A process point of view.Springer Verlag, 2008.

[AG82] Per Kragh Andersen and Richard D Gill. Cox’s regressionmodel for counting processes: a large sample study. AoS,pages 1100–1120, 1982.

[Dur10] Rick Durrett. Probability: theory and examples. CambridgeSeries in Statistical and Probabilistic Mathematics. Cam-bridge University Press, Cambridge, fourth edition, 2010.

[EEH+77] Stephen H. Embury, Laurence Elias, Philip H. Heller,Charles E. Hood, Peter L. Greenberg, and Stanley L.Schrier. Remission maintenance therapy in acute myeloge-nous leukemia. The Western Journal of Medicine, 126:267–72, April 1977.

[FH91] Thomas R. Fleming and David P. Harrington. CountingProcesses and Survival Analysis. Wiley, 1991.

[HFB+11] Pascal Haegeli, Markus Falk, Hermann Brugger, Hans-JurgEtter, and Jeff Boyd. Comparison of avalanche survivalpatterns in Canada and Switzerland. Canadian MedicalAssociation Journal, 183(7):789–795, 2011.

[KM03] John P. Klein and Melvin L. Moeschberger. Survival Analy-sis: Techniques for Censored and Truncated Data. SV, 2ndedition, 2003.

174

Page 183: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Bibliography 175

[LR02] Roderick J. A. Little and Donald B. Rubin. StatisticalAnalysis with Missing Data. Wiley, 2002.

[MGM01] Rupert G. Miller, Gail Gong, and Alvaro Munoz. SurvivalAnalysis. Wiley, 2001.

[MSD90] Jens Modvig, Lone Schmidt, and Mogens T. Damsgaard.Measurement of total risk of spontaneous abortion: Thevirtue of conditional risk estimation. Journal of Epidemiology,132(6):1021–1038, December 1990.

[Pit93] Jim Pitman. Probability. Springer Verlag, 1993.

[RF00] James M Robins and Dianne M Finkelstein. Correcting fornoncompliance and dependent censoring in an aids clinicaltrial with inverse probability of censoring weighted (ipcw)log-rank tests. Biometrics, pages 779–788, 2000.

[Sun06] Jianguo Sun. The Statistical Analysis of Interval-censoredFailure Time Data. SV, 2006.

[TGF90] Terry M. Therneau, Patricia M. Grambsch, and Thomas R.Fleming. Martingale-based residuals for survival models.Biometrika, 77(1):147–60, 1990.

[TK98] Howard M. Taylor and Samuel Karlin. An introduction tostochastic modeling. Academic Press, 3rd edition, 1998.

[Tur74] Bruce W Turnbull. Nonparametric estimation of a survivor-ship function with doubly censored data. Journal of theAmerican Statistical Association, 69(345):169–173, 1974.

[VDVHvV+02] Marc J Van De Vijver, Yudong D He, Laura J van’t Veer,Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil,George J Schreiber, Johannes L Peterse, Chris Roberts,Matthew J Marton, et al. A gene-expression signature as apredictor of survival in breast cancer. New England Journalof Medicine, 347(25):1999–2009, 2002.

[vHS14] Hans C. van Houwelingen and Theo Stijnen. Cox regres-sion model. In John P. Klein, Hans C. van Houwelingen,Joseph G. Ibrahim, and Thomas H. Scheike, editors, Hand-book of Survival Analysis, chapter 1, pages 5–26. CRC Press,2014.

Page 184: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Bibliography 176

[ZKJ07] Achim Zeileis, Christian Kleiber, and Simon Jackman. Re-gression models for count data in R. Journal of StatisticalSoftware, 2007.

Page 185: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Appendix A

Notes on the Poisson Process

(Mainly copied from the BS3b Lecture Notes)

A.1 Point processes

A joint n-dimensional distribution may be thought of as a model for randompoint in Rn. A point process is a model for choosing a set of points. Someexamples of phenomena that may be modelled this way are:

• The pattern of faults on a silicon chip;

• The accumulation of mutations within an evolutionary tree;

• Appearance of a certain pattern of pixels in a photograph;

• The times when a radioactive sample emits a particle;

• The arrival times of customers at a bank;

• Times when customers issue register financial transactions.

Note that a point process — thought of as a random subset of Rn (orsome more general space) — may also be identified with a counting processN mapping regions of Rn to natural numbers. For any region A, N(A) isthe (random) number of points in A.

A Poisson process is the simplest possible point process. It has thefollowing properties (which we will formalise later on):

• Points in disjoint regions are selected independently;

I

Page 186: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process II

• The number of points in a region is proportional purely to the size ofthe region;

• There are finitely many points in any bounded region, and points areall distinct.

We will be focusing in this course on one-dimensional point processes, sorandom collections of points in R. Note that the counting process is uniquelydetermined by its restriction to half-infinite intervals: We write

Nt := N((−∞, t]

)= # points ≤ t.

If we imagine starting from 0 and progressing to the left, the succession ofpoints may be identified with a succession of “interarrival times”.

A.2 The Poisson process on R+

The Poisson process on R+ is the simplest nontrivial point process. We havethree equivalent ways of representing a point process on R+:

Random set of points ←→ Interarrival times ←→ Counting process Nt

A.2.1 Local definition of the Poisson process

A Poisson arrival process with parameter λ is an integer-valued stochasticprocess N(t) that satisfies the following properties:

(PPI.1) N(0) = 0.

(PPI.2) Independent increments: If 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤ sk < tk,

then the random variables(N(ti)−N(si)

)ki=1

are independent.

(PPI.3) Constant rate: PN(t+ h)−N(t) = 1 = λh+ o(h). That is,

limh↓0

h−1PN(t+ h)−N(t) = 1 = λ.

(PPI.4) No clustering: PN(t+ h)−N(t) ≥ 2 = o(h).

The corresponding point process has a discrete set of points at those twhere N(t) jumps. That is, t : N(t) = N(t−) + 1.

Page 187: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process III

A.2.2 Global definition of the Poisson process

A Poisson arrival process with parameter λ is an integer-valued stochasticprocess N(t) that satisfies the following properties:

(PPII.1) N(0) = 0.

(PPII.2) Independent increments: If 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤ sk < tk,

then the random variables(N(ti)−N(si)

)ki=1

are independent.

(PPII.3) Poisson distribution:

PN(t+ s)−N(s) = n = e−λt(λt)n

n!.

The corresponding point process has a discrete set of points at those twhere N(t) jumps. That is, t : N(t) = N(t−) + 1.

A.2.3 Defining the Interarrival process

A Poisson process with parameter λ may be defined by letting τ1, τ2, . . . bei.i.d. random variables with exponential distribution with parameter λ. Thenthe point process is made up of the cumulative sums Tk :=

∑ki=1 τi; and the

counting process isN(t) = #k : 0 ≤ Tk ≤ t.

Page 188: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process IV

1

2

3

4

5

6

7

8

9

!1 !2 !3 !4 !5 !6 !7 !8 !9

T1 T2 T3 T4 T5 T6 T7 T8 T9

N(t)

Figure A.1: Representations of a Poisson process. T1, T2, . . . are the locationsof the points (the arrival times). τ1, τ2, . . . are the i.i.d. interarrival times.The green line represents the counting process, which increases by 1 at eacharrival time.

A.2.4 Equivalence of the definitions

Proposition 1. The local, global, and interarrival definitions define thesame stochastic process.

Proof. We show that the local definition is equivalent to the other definitions.We start with the local definition and show that it implies the global definition.The first two conditions are the same. Consider an interval [s, t], and supposethe process satisfies the local definition. Choose a large integer K, and define

ξi = N

(s+ t

i

K

)−N

(s+ t

i− 1

K

).

Then N(t + s) − N(s) =∑K

i=1 ξi, and ξi is close to being Bernoulli withparameter λt/K, so N(t + s) − N(s) should be close to Binom(K,λt/K)distribution, which we know converges to Poisson(λt). Formally, we can

Page 189: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process V

assume that the ξi are all 0 or 1, since

Pξi ≥ 2 for some 1 ≤ i ≤ K

≤ K · o(t/K)

K→∞−−−−−→ 0.

The new ξi, which are no more than 1, have mgf

Mξ(θ) =

(1 +

λt

K(1 + o(1))(eθ − 1)

).

Thus the mgf of N(t+ s)−N(t) is

Mξ(θ)K =

(1 +

λt

K(1 + o(1))(eθ − 1)

)KK→∞−−−−−→ eλt(e

θ−1),

which is the mgf of a Poi(λt) random variable.Now assume the global definition. Since there are a finite number of

points in any finite interval, we may list them in order, giving us a sequenceof random variables T1, T2, . . . , and the interarrival times τ1, τ2, . . . . We needto show that these are independent with Exp(λ) distribution. It will sufficeto show that for any positive numbers t1, t2, . . . , tk,

Pτk > tk

∣∣ τ1 = t1, . . . , τk−1 = tk−1

= e−λtk .

(This means that, independent of τi for i ≤ k−1, τk has Exp(λ) distribution.)Let t = t1 + · · · + tk−1. By property (PPII.2), the numbers and locationsof the points on [0, t] and on (t, t+ tk] are independent.1 The event

τk >

tk, τ1 = t1, . . . , τk−1 = tk−1

is identical to the event

N(tk + t) −N(t) =

0, τ1 = t1, . . . , τk−1 = tk−1

, so by independence the probability is simply

the probability of a Poi(λtk) random variable being 0, which is

PN(tk + t)−N(t) = 0

= e−λtk .

Finally, suppose the process is defined by the interarrival definition.It’s trivial that N(0) = 0. The independent increment condition is satisfiedbecause of the memoryless property of the exponential distribution. (Formally,we can show that, conditioned on any placement of all the points on theinterval [0, t2], the next point still has distribution t+ τ , where τ is Exp(λ).Then we proceed by induction on the number of intervals.) Finally, PN(t+h) −N(t) ≥ 1 = 1 − e−λh = λh + o(h), while PN(t + h) −N(t) ≥ 2 ≤Pτ < h2, where τ is an Exp(λ) random variable, so this is on the order ofλ2h2.

1If we’re going to be completely rigorous, we would need to take an infinitesimal intervalaround each ti, and show that the events of τi being in all of these intervals are independent.

Page 190: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions VI

A.2.5 The Poisson process as Markov process

We note that the Poisson process in one dimension is also a Markov process.It is the simplest version of a Markov process in continuous time. We do notformally define the Markov property in continuous time, but intuitively itmust be that the past contains no information about the future of the processthat is not in the current position. For a counting process, the “future” ismerely the time when it will make the next jump (and the next one afterthat, and so on), while the past is simply how long since the last jump,and the one before that, and so on. So the time to the next jump must beindependent of the time since the last jump, which can only happen if thedistribution of the time is the “memoryless” exponential distribution.

Thus, if a counting process is to be Markov, it must be something like aPoisson process. The only exception is that it would be possible to changethe arrival rate, depending on the cumulative count of the arrivals.

More generally, a continuous-time Markov process on a discrete statespace is defined by linking a Markov chain of transitions among the stateswith independent exponentially distributed waiting times between transitions,with the rate parameter of the waiting times determined by the state currentlyoccupied. But this goes beyond the scope of this course.

A.3 Examples and extensions

Part of learning about any probability distribution or process is to knowwhat the standard situations are where that distribution is considered to bethe default model. The one-dimensional Poisson process is the default modelfor a process of identical events happening at a constant rate, such that noneof the events influences the timing of the others. Standard examples:

• Arrival times of customers in a shop.

• Calls coming into a telephone exchange (the traditional version) orservice requests coming into an internet server.

• Particle emissions from a radioactive material.

• Times when surgical accidents occur in a hospital.

Some of these examples we would expect to be not exactly like a Poissonprocess, particularly as regards homogeneity: Customers might be morelikely to come in the morning than in the afternoon, or accidents mightbe more likely to occur in the hospital late at night than in mid-afternoon.

Page 191: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions VII

Generalising to Poisson-like processes where the arrival rates may change withtime, for instance, is fairly straightforward. As with many other modellingproblems, we might best think of the Poisson process as a kind of “nullmodel”: Probably too simplistic to be really accurate, but providing abaseline for starting the modelling process, from which we can considerwhether more detailed realism is worth the effort.

A.4 Some basic calculations

Example 1.1: Queueing at the bank

Customers arrive at a bank throughout the day according to aPoisson process, at a rate of 2 per minute. Calculate:

(i) The probability that the first customer to arrive after 12noon arrives after 12:02;

(ii) The probability that exactly 4 customers arrive between12:03 and 12:06;

(iii) The probability that there are at least 3 customers to arrivebetween 12:00 and 12:01.

(iv) The expected time between the 4th customer arrival afternoon and the 7th.

(v) The distribution of the time between the 4th customer arrivalafter noon and the 7th.

Solution:

(i) The number of arrivals in a 2-minute period has Poi(4)distribution. So the probability that this number is 0 ise−4 = 0.018. Alternatively, we can think in terms of waitingtimes. The waiting time has Exp(2) distribution. Regardlessof how long since the last arrival at noon, the remainingwaiting time is still Exp(2). So the probability that this isat least 2 minutes is e−4.

(ii) The number N3 arriving in 3 minutes has Poi(6) distribution.The probability this is exactly 4 is e−664/4! = 0.134.

Page 192: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions VIII

(iii) The number is Poi(2). Then

PN1 ≥ 3 = 1−PN1 ≤ 2 = 1−e−2

(1 +

2

1!+

22

2!

)= 0.323.

(iv) The expected time between arrivals is 1/2 minute. Thus theexpected sum of three interarrival times is 1.5 minutes.

(v) This is the sum of three independent Exp(2) random vari-ables, so has Gamma(2, 3) distribution, with density 4t2e−2t.

Example 1.2: The waiting time paradox

Central stations of the Moscow subway have (or had, when Ivisited a dozen or so years ago) an electronic sign that showsthe time since the last train arrived on the line. That is, it is atimer that resets itself to 0 every time a train arrives, and thencounts the seconds up to the next arrival. Suppose trains arriveon average once every 2 minutes, and arrivals are distributed asa Poisson process. A woman comes every day, boards the nexttrain, and writes down the number on the timer as the trainarrives. What is the long-term average of these numbers?

Solution: You might think that the average should be 2 minutes.After all, she is observing an interarrival time between trains,and the expected interarrival time is 2 minutes. It is true thatif she stood there all day and wrote down the numbers on thetimer when the trains arrive, the average should converge to 2minutes. But she isn’t averaging all the interarrival times, onlythe ones that span the moments when she comes to the station.Isn’t that the same thing?

No! The intervals that span her arrival times are size-biased.Imagine the train arrival times being marked on the time-line.Now the woman picks a time to come into the station at random,independent of the train arrivals. (We are implicitly assumingthat she uses no information about the actual arrival times. Thiswould be true if she comes at the same time every day, or at arandom time independent of the trains.) This is like dropping

Page 193: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions IX

a point at random onto the real line. There will be some wideintervals and some narrow intervals, and the point will naturallybe more likely to end up in one of the wider intervals. In fact, ifthe interarrival times have density f(t), the size-biased interarrivaltimes will have density tf(t)/

∫∞0 sf(s)ds. In the Poisson process,

the interarrival times have exponential distribution, so the densityof the observed interarrival times is

t · λe−λt∫∞0 sλe−λsds

= λ2te−λt.

An easier way to see this is to think of the time that the womanwaits for the train, and the time since the previous train at themoment when she enters the station. By the memoryless propertyof the exponential distribution, the waiting time until the nexttrain, at the moment when she enters, is still precisely exponentialwith parameter 2. By the symmetry of the Poisson process, it’sclear that if we go backwards looking for the previous train, thetime will also be exponential with parameter 2, and the two timeswill be independent. Thus, the interarrival times observed bythe woman will actually be the sum of two independent Exp(2)random variables, so will have gamma distribution, with rateparameter 2 and shape parameter 2.

Example 1.3: Genetic recombination model

A simple model of genetic recombination is illustrated in FigureA.2. Each individual has two copies of each chromosome (ma-ternal and paternal — one from the mother and one from thefather). Genes are lined up on the chromosomes, so for the geneson a given chromosome, your children should inherit either allthe genes you got from your mother, or all the genes you gotfrom your father. Not exactly, though, because of recombination.

During meiosis — the process that creates sperm and ova — thechromosomes are broken at random points where they “crossover”, making new chromosomes out of pieces of the maternaland paternal chromosomes. In early genetic research biologistsworked to situate genes on chromosomes by measuring how likelythe genes were to stay together, generation after generation.

Page 194: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions X

maternal

paternal

x x x

new chromosomes

x x x

x y z

x y z

Figure A.2: Illustration of the genetic recombination model. At top wesee the maternal and paternal chromosomes lined up, with the locationsof crossover events — determined by a Poisson process — marked by x’s.The middle sketch shows them crossing over. The bottom sketch shows thenew chromosomes that result. Genes are inherited together if they are onpositions where the same colour goes on the same new chromosome. Thusthe positions marked by x and y are inherited together, while x and z arenot.

Page 195: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions XI

Genes that were on different chromosomes should be passed onindependently (Mendel’s law). Genes that were close together ona chromosome should almost always be passed on as a unit. Andgenes that were farther apart should be more likely than chanceto be inherited together, but not certain.

In our model, the chromosomes are the unit interval, and thecrossover points are a Poisson process with intensity λ. Considertwo points x < y on the interval, representing the location oftwo genes. We might first ask for the probability that thereis no crossover between x and y. Since Ny − Nx has Poissondistribution with parameter λ(y − x).

Pno crossover = PNy −Nx = 0= e−λ(y−x).

But this isn’t really what we want to compute. Looking at theinheritance of x and y, we can’t tell if there was no recombinationbetween those points or 2 or 4 or any even number. So wecompute

Peven number of crossovers =∞∑k=0

PNy −Nx = 2k

=∞∑k=0

e−λ(y−x) (λ(y − x))2k

(2k)!

= e−λ(y−x) · 1

2

(eλ(y−x) + e−λ(y−x)

)=

1

2

(1 + e−2λ(y−x)

).

Thus, if we observe that genes at x and y are inherited togetherwith probability p > 1

2 , we can estimate the distance betweenthem as

1

2λlog(2p− 1).

A.5 Thinning and merging

Consider the following problems: A hospital intensive care unit admits 4patients per day, according to a Poisson process. One patient in twenty, on

Page 196: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions XII

average, develops a dangerous infection. What is the probability that therewill be more than 2 dangerous infections in the course of a week?

Or: The casualty department takes in victims of accidents at the rate of4 per hour through the night, and heart attack/stroke victims at the rateof 2 per hour, each of them according to a Poisson process. What is thedistribution of the total number of patients that arrive during an 8-hourshift? What can we say about the distribution of patient arrivals, ignoringtheir cause?

Theorem A.1. Suppose we have a Poisson process with parameter λ. Denotethe arrival times T1 < T2 < · · · . We thin the process by the following process:We are given a probability distribution on 1, . . . ,K (so, we have numberspi = P(i) for i = 1, . . . ,K). Each arrival time is assigned to category i with

probability pi. The assignments are independent. Let T(i)1 < T

(i)2 < · · · be the

arrival times in category i. Then these are independent Poisson processes,with rate parameters λpi for process #i.

Conversely, suppose we have independent Poisson processes T(i)1 < T

(i)2 <

· · · with rate parameters λi. We form the merged process T1 < T2 < · · ·by taking the union of all the times, ignoring which process they come from.Then the merged process is also a Poisson process, with parameter

∑λi.

Proof. Thinning: We use the local definition of the Poisson process. Start

from the process T1 < T2 < . . . . Let T(i)1 < T

(i)2 < · · · be the i-th thinned

process. Clearly Ni(0) is still 0. If we look at the number of events occurringon disjoint intervals, they are being thinned from independent randomvariables. Since a function applied to independent random variables producesindependent random variables, we still have independent increments. Wehave

PNi(t+ h)−Ni(t) = 1

= P

N(t+ h)−N(t) = 1

· P

assign category i

+ PN(t+ h)−N(t) ≥ 2

· P

assign category i to exactly one

= (λh+ o(h))pi + o(h)

= piλh+ o(h).

And by the same approach, we see that PNi(t+ h)−Ni(t) ≥ 2

= o(h).

Independence is slightly less obvious. In general, if you take a fixednumber of points and allocate them to categories, the numbers in the differentcategories will not be independent. The key is that there is not a fixed numberof points; moving from left to right, there is always the same chance λdt of

Page 197: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions XIII

getting an event at the next moment, and these may be allocated to anyof the categories, independent of the points already allocated. A rigorousproof is easiest with the global definition. Consider N1(t), N2(t), . . . , NK(t)for fixed t. These may be generated by the following process: Let N(t) bePoi(λt), and let (N1(t), N2(t), . . . , NK(t)) be multinomial with parameters(N(t); (pi)). That is, supposing N(t) = n, allocate points to bins 1, . . . ,Kaccording to the probabilities p1, . . . , pK . Then

PN1(t) = n1, . . . , NK(t) = nK

= P

N(t) = n

PN1(t) = n1, . . . , NK(t) = nK

∣∣N(t) = n

= e−λt(λt)n

n!· n!

n1! · · ·nK !pn1

1 · · · pnKK

=K∏i=1

e−λit(λt)ni

ni!

=K∏i=1

PNi(t) = ni

Since counts involving distinct intervals are clearly independent, this com-pletes the proof.

Merging: This is left as an exercise.

Thus, in the questions originally posed, the arrivals of patients whodevelop dangerous infections (assuming they are independent) is a Poissonprocess with rate 4/20 = 0.2. The number of such patients in the course of aweek is then Poi(1.4), so the probability that this is > 2 is

1− e−1.4

(1 +

1.4

1+

1.42

2

)= 0.167.

The casualty department takes in two independent Poisson streams ofpatients with total rate 6, so it is a Poisson process with parameter 6. Thenumber of patients in 8 hours has Poi(48) distribution.

A.6 Poisson process and the uniform distribution

The presentation here is based on section V.4 of [TK98].

Example 1.4: Discounted income

Page 198: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions XIV

Companies evaluate their future income stream in terms of presentvalue: Earning £1 next year is not worth as much as £1 today;put simply, £1 income t years in the future is worth e−θt£ today,where θ is the interest rate.

A company makes deals at random times, according to a Poissonprocess with parameter λ. For simplicity, let us say that eachdeal is worth £1. What is the expectation and variance of thetotal present value of all its future deals?

The problem here is that, while we know the distribution of the numberof deals over a span [0, t], the quantity of interest depends on the precisetimes.

Theorem A.2. Let T1 < T2 < · · · be the arrival process of a Poisson process.For any s < t, conditioned on the event N(t) −N(s) = n, the subset ofpoints on the interval (s, t] is jointly distributed as n independent pointsuniform on (s, t].

Proof. Intuitively this is clear: The process is uniform, so there can’t be ahigher probability density at one point than at another. And finding a pointin (t, t+ δt] doesn’t affect the locations of any other points.

We can prove this formally by calculating that for any s < u < t,conditioned on N(t) − N(s) = n, the number of points in (s, u] (that is,N(u)−N(s)) has binomial distribution with parameters n and (u−s)/(t−s).That is, the number of points in the subinterval has exactly the samedistribution as you would find by allocating the n points independently anduniformly. Then we need to argue that the distribution of the number ofpoints in every subinterval determines the joint distribution. The details areleft as an exercise.

Solution to the Discounted income problem: Consider the presentvalue of all deals in the interval [0, t]. Conditioned on there being n deals,these are independent and uniformly distributed on [0, t]. The expectedpresent value of a single deal is then∫ t

0t−1e−θsds = (θt)−1

(1− e−θt

),

and the variance of a single deal’s value is

σ2(t) := (2θt)−1(

1− e−2θt)− (θt)−2

(1− e−θt

)2

=1

2θ2t

((θ − 2t−1) + 4e−θt − (θ + 2t−1)e−2θt

).

Page 199: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Poisson process: Examples and extensions XV

Thus, the present value up to time t, call it Vt, has conditional expectationand variance

E[Vt∣∣N(t)

]= N(t)(θt)−1

(1− e−θt

),

Var(Vt∣∣N(t)

)= N(t)σ2(t).

(The formula for the conditional variance depends on independence.) So wehave

E[Vt] = E[E[Vt∣∣N(t)

]]= λθ−1

(1− e−θt

).

Of course, as t→∞ this will converge to λ/θ. For the variance, we use theformula Var(V ) = E[Var(V |X)] + Var(E[V |X]), so that

Var(Vt) = E[N(t)]σ2(t) + (θt)−2(

1− e−θt)2

Var(N(t))

= λ · 1

2θ2

((θ − 2t−1) + 4e−θt − (θ + 2t−1)e−2θt

)+ θ−2t−2

(1− e−θt

)2λt

t→∞−−−−−→ λ

2θ.

Page 200: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Appendix B

Assignments

XVI

Page 201: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.1 Modern Survival Problem sheet 1:Counting processes and martingales

Due at noon, Friday 23 October

(1) We have a point process with intensity

λ(t) =

1 if 0 ≤ t < 5,

2 if 5 ≤ t < 10,

0 otherwise.

Let (N(t))t≥0 be the counting process and let Ti be the time of the i-th event, or ∞ if there isno i-th event.

(a) i. Find the distribution of the total number of events.

ii. Find the distribution of T2 − T1.

iii. Show that Ti →∞ in probability.

(b) In one realisation we observe the points 1.3, 1.7, 3.2, 4.8, 5.5, 6.2, 6.4, 7.0, 7.8, 8.3, 8.4, 8.8, 9.2, 9.9.

i. What is the compensator A(t) for N(t)?

ii. Sketch a typical realisation of N(t).

iii. Sketch the martingale N(t)−A(t).

(2) Let λ be any positive function and Λ(t) =∫ t

0 λ(s)ds. Suppose X is a random variable withexponential distribution with parameter 1. Show that Λ−1(X) is a random variable with hazardrate λ(t).

(3) Suppose X and Y are a pair of random variables with joint density F (x, y). Let g : R→ R bea function such that E[|g(X)|] <∞, and let Y be the sigma algebra generated by Y . Show thatE[g(X) |Y] = h(Y ), where

h(y) :=

∫∞−∞ g(x)f(x, y)dx∫∞−∞ f(x, y)dx

.

Explain the relationship between this formula the conditional expectations you learned inprelims and Part A probability.

(4) Let (N(t)) be the counting process associated with the Poisson process with intensity λ, and let(M(t)) = (N(t)− λt) be the associated martingale. Show that λt is the compensator for M2.(Hint: Expand M(t)2 = (N(t)− λt)2, and use known properties of the Poisson process.)

(5) Later in the course we will discuss current status data, which is a form of extreme censoring.Individuals have an unobserved (assumed i.i.d.) event time Ui. What is observed is a censustime Ci (independent of Ui), and δi = 1Ui≤Ci.

(a) Assuming Ui has density fλ and cdf Fλ, write an expression for the log likelihood of Ui.

Page 202: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(b) Suppose the distribution is exponential, so fλ(u) = λe−λu. What is the relative efficiencyof estimation (that is, ratio of expected Fisher information) based on the current statusdata, compared with complete observation of Ui?

(c) Suppose you can choose the distribution of Ci (but still independent of Ui). How wouldyou maximise the expected information?

(6) Suppose we have an inhomogeneous Poisson process N(t), whose intensity starts out as either 1or 2, each with probability 1/2, and immediately after each event the intensity is determinedagain by an independent coin flip. Let λ(t) be the intensity at time t.

(a) Suppose λ(t) is observed (i.e., λ(t) ∈ Ft). Find the compensator (i.e., the cumulativeintensity process) of N(t). Find the predictable variation process and the optional variationprocess for the martingale M(t) obtained by subtracting the compensator from N(t).Compute Var(M(t)).

(b) Now suppose λ(t) is unobserved (i.e., λ(t) /∈ Ft). Find the compensator. (Rememberthat the compensator must be Ft-adapted. Find the predictable variation process and theoptional variation process of M , where M is obtained by subtracting the compensator fromN(t). Is Var(M(t)) = Var(M(t))? Why or why not?

Page 203: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.2 Modern Survival Problem sheet 2:Nonparametric estimation of survival curves

To be turned in by noon on Friday, 30 October, 2015

(1) Consider a situation where the multiplicative intensity model holds, and there is unobservedright censoring. That is, for some individuals we observe the event time Ti and δi = 1; forothers, we observe δi = 0 and no event time. Suppose the right censoring is independent of eventtimes, all n individuals are independent, and the distribution of censoring times is known tohave cdf G. (So G(c) is the probability of being censored before time c.) Let t1 < t2 < · · · < tkbe the observed event times (assumed distinct).

Show that

A(t) =∑ti≤t

((n− i+ 1)

(1−G(ti)

))−1

is an unbiased estimator for the cumulative hazard, and derive an estimator for the variance.

(2) Nonparametric estimators are inevitably less efficient (that is, have larger errors, on average)than parametric estimators. Consider the case when n individuals are observed up to time t.Their event times are independent and exponentially distributed with unknown parameter λ,and we observe all event times. We wish to compare two different possible estimators for thecumulative hazard up to time t: First, taking advantage of the knowledge that the data comefrom an exponential distribution; and second, using the nonparametric Nelson–Aalen estimator.

(a) i. Show that the MLE for Λ(t) under the exponential model is

nt∑ni=1 Ti

.

ii. Compute the (approximate) variance for this estimator.

(b) Using the inequality

log n+ γ ≤n∑i=1

1

i≤ log n+ γ +

1

2n,

show that the Nelson–Aalen estimator for S(t) is approximately Y (t+)/n (the empiricalfraction surviving to time n), and find a bound for the error — that is, for the maximumdifference between S(s) and Y (s+)/n on 0 ≤ s ≤ t.

(c) Use this to estimate the variance of A(t), the Nelson–Aalen estimator. Show that thisvariance is larger than the variance for the parametric estimator above.

(d) Plot the ratio of the variances for a range of values of t, for the case λ = 1 and n = 1000.

(e) Why might one prefer to use the nonparametric estimator, even when there is no censoring?

Page 204: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(3) You may find R code at http://steinsaltz.me.uk/survival/countingprocess.R that sim-ulates and plots a survival process, that starts with n = 50 individuals, each with constantmortality rate α = 1. What is the intensity λ of the survival process at time t?

(a) R code runs faster if you replace loops with vector operations. Can you get rid of the loopin this code?

(b) Modify the code to plot the martingale N(t)−∫ t

0 λ(s)ds.

(c) Add a routine that computes the optional variation process. Run a simulation and plot it.

(d) Modify the program to apply to α(t) = 2t.

(4) The data set ovarian, included in the survival package, presents data for 26 ovarian cancerpatients, receiving one of two treatments, which we will refer to as the single and doubletreatments. (They appear in the data set as the rx variable, taking on values 1 and 2respectively.)

(a) Create a survival object for the times in this database.

(b) Compute and plot the Kaplan–Meier estimator for the survival curves. (For a small extrachallenge, plot the single-treatment survival curve black, and the double-treatment curvered.) You may use the survfit function.

(c) Compute the Nelson–Aalen survival curve estimate. Make a table of the relevant data(time of events, number of events, number at risk).

(d) Compute the standard error for the probability of survival past 400 days in each group, asestimated by the Nelson–Aalen and Kaplan–Meier estimators.

Page 205: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.3 Modern Survival Problem sheet 3:Estimating quantiles and excess mortality

To be turned in by noon on 6 November, 2015

(1) Show that Duhamel’s equation (6.5) holds at a point s where S1 or S2 is discontinuous.

(2) Below is code that calculates confidence intervals for quantiles of the survival curve. Here SF isthe output of the survfit(S∼1) command for a survival object S.

1 quant i l eCI=func t i on (SF , p , alpha =.05)2 sb . fitNA=NAest (SF)3 z=qnorm(1−alpha / 2)4 a=−l og (p)5 se=s q r t ( sb . fitNA$Var )6 x p l e s s=sb . fitNA$Hazard+z∗ se7 xpmore=sb . fitNA$Hazard−z∗ se8 upper=min ( which ( xpmore>=a ) )9 lower=max( which ( xp le s s<=a ) )

10 c ( sb . fitNA$ time [ lower ] , sb . fitNA$ time [ upper ] )

The function NAest is a homemade function, given in Figure 6.3, to compute Nelson–Aalenestimators.

(a) Explain why this is a reasonable estimator for the quantiles of the survival function. (Formore information about quantile estimation, see Section 3.2.3 of Aalen’s book. Note thatthe book is available electronically through the Bodleian website.)

(b) Use this to compute a 95% confidence interval for the time when 80% of the populationsurvive in the ovarian data set (ignore treatment type);

(c) Use this to compute a 95% confidence interval for median survival in a collection of datasimulated from an exponential distribution with parameter 1, available on the course website. Note that this file has been produced with the save command, and can be loaded intoR with the command load(′filename′). You will then have two vectors of length 1000,called T and ev. The former is the time of event or right-censoring, the latter is TRUE (forevent) or FALSE (for censoring).

(d) Suppose we ignored the censoring, so only included the uncensored times. What would youestimate for the median survival time?

(3) Look back to the derivation in the lecture notes section 7.2 of the estimator Γ(t) for cumulativeexcess mortality in the two-sample setting. Think of kc(t) as arbitrary predictable randomvariables.

(a) Construct a martingale to show that the estimator (7.5) for excess mortality in the two-sample case is unbiased for appropriate choice of kc(t). What conditions must kc(t) satisfy?

Page 206: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(b) Find an expression for estimating the variance of Γ.

(c) Show that for the particular choice (7.4) the bound (7.6)

Var(Γ(t)) ≈∑ti≤t

(∑c

Y (c,−; ti)

)−2

is a conservative estimate for the variance of the estimator. That is, it is a good estimatefor large samples, and tends not to underestimate the variance.

(d) Since any choice of kc yields an estimator, we are free to make a convenient choice. Why isthe choice (7.4) a good one?

(e) Supposing the groups to be of approximately equal size, what will the relation be betweenthe variance of our estimator for the cumulative excess mortality, and the variance wewould estimate for the difference in the cumulative hazards between the groups Gi = 0and Gi = 1, ignoring the classification ci.

Page 207: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.4 Modern Survival Problem sheet 4:Nonparametric testing and semiparametric models

To be turned in by noon on 20 November, 2015

(1) The object tongue in the package KMsurv lists survival or right-censoring times in weeks afterdiagnosis for 80 patients with tongue tumours. The type random variable is 1 or 2, dependingas the tumour was aneuploid or diploid respectively.

(a) Use the log-rank test to test whether the difference in survival distributions is significant atthe 0.05 level.

(b) Repeat the above with a test that emphasises differences shortly after diagnosis.

(c) Calculate and plot the estimated excess mortality for aneuploid compared with diploid.

(2) You may be familiar with the Wilcoxon rank-sum test (also called the Mann–Whitney U test).This is a nonparametric substitute for the T test, for comparing two samples, to test whetherthey came from the same distribution, without assuming that the distributions are normal.Look up the properties of this test. Show that the class of non-parametric test statistics that wehave defined in section 8.1 includes the Wilcoxon rank-sum statistic, in the special case wherethere is no censoring or truncation. What weight function do we need to take to recover therank-sum statistic? Derive the sampling distribution of the rank-sum statistic. Optional: Tryout the two statistics on some simulated data. They may be drawn from any distribution youlike.

(3) (Based on problem 4.6 of [ABG08].) Suppose we have an additive-hazards model where anindividual has covariates (X1, . . . , Xp) and the individual hazards are then

α(t) = β0(t) + β1(t)X1 + · · ·+ βp(t)Xp,

where the Xk are random variables. An observation consists of a single right-censored event.

(a) Suppose the variable Xp is not observed, so is not included in the model. If the randomvariables Xk are all independent, show that the remaining model is still an additive-hazardsmodel with a different baseline hazard β0(t).

(b) Suppose the random variables Xk are multivariate normal (but not independent). Howdoes the model change when Xp is dropped?

(4) In section 9.5.1 we describe fitting the Aalen additive hazards model for the special case ofa single (possibly time-varying) covariate. Suppose we constrain the assumptions further, toassume that xi is constant in time, and takes on only the values 0 and 1. Explain how this isrelated to the excess mortality model. Compare the results we would obtain from the methodsdescribed in this section, to those obtained from the methods of section 7.2.

Page 208: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(5) Refer to the AML study, which is described at length in Example 8.1.4 and analysed with theCox model in section 11.3. Using the data described in those places, estimate the difference incumulative hazard to 20 weeks between the two groups by

(a) The nonparametric method described in section 7.2;

(b) The semiparametric method based on the relative-risk regression.

(c) Using the proportional hazards method, suppose an individual were to switch from mainte-nance to non-maintenance after 10 weeks, and suppose the hazard rates change instanta-neously. Estimate the difference in cumulative hazard to 20 weeks between that individualand one who had always been in the non maintenance group.

Page 209: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.5 Modern Survival Problem sheet 5:Relative risks and diagnostics

To be turned in by 2pm on 27 November, 2015

(1) Let N(t) be a counting process with additive hazards λi(t) = λ0(t) +∑p

k=1 xik(t)βk(t), with

Bk(t) =∫ t

0 βk(s)ds. As in Lecture 13 we define N(t) to be the vector of the individual counting

processes (so it is a binary vector), and similarly X(t) the matrix of covariates, and B(t) thevector of regression coefficient estimators. Define the martingale residual

Mres(t) =

∫ t

0J(s)dN(s)−

∫ t

0J(s)X(s)dB(s),

where J(s) is the indicator of X(s)TX(s) having full rank, hence of X−(s) being nonzero.

(a) Using the fact thatJ(s)

(I−X(s)X−(s)

)X(s) ≡ 0,

show that Mres is a martingale. (That is, every component is a martingale.)

(b) Suppose now that all covariates are fixed and the data are right-censored, and let τ be thefinal time under consideration (such that J(τ) = 1). Show that

X(0)TMres(τ) = 0.

(For time-fixed covariates we define X(t) := Y(t)X, where Y(t) is the matrix with theat-risk indicators Yi(t) on the diagonal.)

(c) How might this fact be used as a model-diagnostic for the additive-hazards assumption?

(2) Let

Xi(t) = vector of observed covariates for individual i at time t;

Ni(t) = counting process for individual i at time t;

β = estimate of Cox regression coefficients;

A0(t) = estimate of baseline hazard in Cox model;

Mi(t) = Ni(t)−∫ t

0Yi(s)e

βTXi(s)dA0(s) the martingale residuals;

Xk(t) =

∑ni=1 Yi(t)Xik(t)e

βTXi(t)∑ni=1 Yi(t)e

βTXi(t);

Uk(t) =n∑i=1

∫ t

0

[Xik(s)− Xk(s)

]dMi(s).

Uk is called the score process.

Page 210: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(a) Show that

Uk(t) =∑tj≤t

(Xijk − Xk(tj)

).

(The summands here are called Schoenfeld residuals.)

(b) Show that the score process is the conditional expectation of the partial derivative of thelog likelihood with respect to the coefficient βk, conditioned on Ft.

(c) Conclude that Uk(0) = Uk(∞) = 0.

(d) Explain why a plot of Uk(t), suitably scaled, would be expected to look like a random walkconditioned to start and end at 0 (a discrete bridge) if the proportional hazards assumptionholds.

(3) (Based on Exercise 11.1 of [KM03].) The dataset larynx in the package KMsurv includes timesof death (or censoring by the end of the study) of 90 males diagnosed with cancer of the larynxbetween 1970 and 1978 at a single hospital. One important covariate is the stage of the cancer,coded as 1,2,3,4.

(a) Why would it probably not be a good idea to fit the Cox model with relative risk eβ·stage?

(b) Use a martingale residual plot to show that stage does not enter as a linear covariate.

(c) An alternative is to define three new binary covariates, coding for the patient being in stage2, 3, or 4 respectively (leaving stage 1, where all three covariates are 0, as the baselinegroup). Fit this model. Are all of these covariates statistically significant?

(d) An equivalent approach is to replace stage in the model definition by factor(stage).Show that this produces the same result.

(e) Try adding year of diagnosis or age at diagnosis as a linear covariate (in the exponent ofthe relative risk). Is either statistically significant?

(f) Use a residual plot to test whether one or the other of these covariates might moreappropriately enter the model in a different functional form — for example, as a stepfunction.

(g) Use a Cox-Snell residual plot to test whether the Cox model is appropriate to these data.

Page 211: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

B.6 Modern Survival Problem sheet 6:Censoring and truncation, frailty and repeated events

To be turned in by noon on 15 January, 2016

(1) A sample of patients taking a new blood pressure medication is asked whether they haveexperienced any vertigo since they started taking it; and if so, when the symptoms were firstnoticed. Some have not experienced symptoms yet, some report an exact time (in weeks afterstarting treatment), and some only say they know it was before a certain time.

Table B.1: Reports of vertigo

# taking it this many weeks # whose symptoms startedweeks who never had symptoms at this many weeks before this many weeks

1 45 6 02 22 11 03 23 10 04 19 22 35 12 37 26 10 33 67 3 16 98 5 13 49 3 8 910 0 9 15

Which observations are left-censored? Right-censored? Estimate the survival function (that is,probability of remaining symptom-free for x weeks)

(a) Ignoring the left-censored observations;

(b) Ignoring the right-censored observations;

(c) Taking all observations into account.

(2) In order to control the spread of a virus in a wild population, researchers spread food itemslaced with a vaccine. Once a week they capture a small number of animals and test whetherthey have developed an immune response

week 1 2 3 4 5 6 7 8 9 10

number sampled 5 4 7 3 4 6 3 8 5 4number immune 0 1 2 0 2 1 2 4 4 3

Page 212: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Estimate the probability of being immune at week t

(a) using an exponential model;

(b) using a Weibull model;

(c) using the nonparametric MLE.

(3) A population has multiplicative frailty, so that the mortality rate for individual i is Biα(x) atage x, where the Bi are i.i.d. positive random variables and limx→∞ α(x) =∞.

(a) Show that the population mortality goes to∞ as t→∞ if the distribution of Bi is boundedaway from 0.

(b) Show that the population mortality converges to a finite constant as t→∞ if the distributionof Bi has nonzero density at 0 and the hazard rate does not grow too quickly as x→∞.Give a formal condition for what “too quickly” would be.

(c) Suppose now that the baseline hazard is Gompertz, i.e., α(x) = eθx.

i. If the Bi have Gamma distribution with parameters (r, λ) — λ is the rate parameter —compute the population mortality rate µ(t) at age t.

ii. What is the hazard ratio between a subpopulation whose frailty has Gamma distributionwith parameters (r, λ) and one with parameters (r′, λ)?

(4) The paper [ZKJ07] includes a dataset, available to download from the Journal of StatisticalScience, on the healthcare demand of 4406 patients in the public old-age health insurance schemeMedicare in the US. When you load this file in, the data will be in a data-frame DebTrivedi.

(a) The number of physician office visits is enumerated in the variable ofp, while numchron

gives the number of chronic conditions, and health gives self-reported health (poor, average,excellent). Do one or more exploratory plots to illustrate the distributions of these variables,and their relationship.

(b) Fit a Poisson regression model to predict the number of office visits as a function ofhealth, numchron, gender, school (number of years of schooling), and privins (indicatorof whether the patient has private insurance). Interpret the result.

(c) Explain why you might want to fit a negative binomial model instead. Do the fit, andinterpret the result.

Page 213: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Appendix C

Solutions

XIII

Page 214: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.1 Modern Survival Problem sheet 1:Counting processes and martingales

(1) (a) We have a point process with intensity

λ(t) =

1 if 0 ≤ t < 5,

2 if 5 ≤ t < 10,

0 otherwise.

Let (N(t))t≥0 be the counting process and let Ti be the time of the i-th event, or ∞ if thereis no i-th event.

i. Find the distribution of the total number of events.The cumulative intensity is

Λ(t) =

t if t ≤ 5,

2t− 5 if 5 < t ≤ 10,

15 if t > 10.

The distribution of the total number of events is Poisson with parameter 15.

ii. Find the distribution of T2 − T1.The inverse cumulative intensity is

Λ−1(s) =

s if s ≤ 5,s2 + 2.5 if 5 < s ≤ 15

∞ if s > 15.

We may think of N(t) as N ′(Λ(t)), where N ′ is a counting process of a Poisson processwith unit intensity. Thus

PT2 − T1 ≥ t

= P

Λ−1(T ′2)− Λ−1(T ′1) ≥ t

=

∫ 15

0PT ′2 ≥ Λ

(t+ Λ−1(s)

) ∣∣T ′1 = se−sds

=

∫ 15

0e−Λ

(t+Λ−1(s)

)+se−sds

=

∫ 5

0e−Λ(t+s)ds+

∫ 15

5e−Λ(t+s/2+2.5)ds.

Page 215: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

For t < 5 this is∫ 5

te−sds+

∫ 5+t

5e−(2s−5)ds+

∫ 15−2t

5e−2t−s)ds+

∫ 15

15−2te−15ds

=(e−t − e−5

)+

1

2

(e−5 − e−5−2t

)+(e−5−2t − e−15

)+ 2te−15

= e−t − 1

2e−5 +

1

2e−5−2t + (2t− 1)e−15.

For 10 ≥ t ≥ 5 it is∫ 10

te−2s+5 + (t+ 5)e−15 =

1

2e−2t+5 + (t+ 4.5)e−15

Finally, for t ≥ 10 we have PT2 − T1 ≥ t

= 15e−15.

This may seem not quite right, since we might be inclined to say

PT2 =∞

= P

N(∞) ≤ 1

= 16e−15.

The difference is the event T1 = ∞, which has probability e−15. The calculationabove integrated over values of T1 on (0,∞), so implicitly excluded the event T1 =∞.In fact, it is not clear that T2 − T1 is defined on this event.

iii. Show that Ti →∞ in probability.For any fixed positive real K,

PTi ≤ K

= P

N(K) ≥ i

≤ P

N(∞) ≥ i

i→∞−−−−−→ 0.

(b) i. NOTE: This question was a bit inconsistent. The numbers relate to a version of thequestion in which you were supposed to sketch this particular realisation. I changedit to sketching a “typical realisation?? but left the numbers in. Apologies if this wasconfusing.

ii. What is the compensator A(t) for N(t)?The compensator for the counting process is the cumulative hazard rate

A(t) = Λ(t) =

t if 0 ≥ t < 5

2t− 5 if 5 ≥ t < 10

15 t ≥ 10.

iii. Sketch N(t).The counting process is N(t) =

∑ni=1 1Ti≤t

Page 216: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 2 4 6 8 10 12

02

46

810

1214

age

leve

l[−(n

+ 1

)]

iv. Sketch the martingale associated with N(t).The martingale associated with N(t) is N(t)−A(t)

0 2 4 6 8 10 12

−5−4

−3−2

−10

1

time

N(t)−

A(t)

Page 217: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(2) Let λ be any positive function and Λ(t) =∫ t

0 λ(s)ds. Suppose X is a random variable withexponential distribution with parameter 1. Show that Λ−1(X) is a random variable with hazardrate λ(t). We have X ∼ exp(1). We want the distribution of Y = Λ−1(X).

If we let FX and FY be the corresponding cdfs, we have FX(x) = PX ≤ x

= 1− e−x, so

FY (y) = PY ≤ y

= P

Λ−1(X) ≤ y

= P

X ≤ Λ(y)

because Λ is strictly increasing

= 1− e−Λ(y),

which is the cdf of a random variable with hazard rate λ.

(3) Suppose X and Y are a pair of random variables with joint density f(x, y). Let g : R→ R be afunction such that E[|g(X)|] <∞, and let Y be the sigma algebra generated by Y . Show thatE[g(X) |Y] = h(Y ), where

h(y) :=

∫∞−∞ g(x)f(x, y)dx∫∞−∞ f(x, y)dx

.

Explain the relationship between this formula the conditional expectations you learned inprelims and Part A probability.Trivially, h(Y ) ∈ Y, so we need to show that for any other random variable Z ∈ Y, E[g(X)Z] =E[h(Y )Z]. Note that

h(y) :=

∫∞−∞ f(x, y)g(x)dx

fY (y),

where fY (y) =∫∞−∞ f(x, y)dx is the marginal density of Y ,

We may write a random variable Z ∈ Y as z(Y ), for some function z : R → R. Thush(Y )Z = h(Y )z(Y ), and

E[h(Y )Z

]=

∫ ∞−∞

fY (y)h(y)z(y)dy

=

∫ ∞−∞

∫ ∞−∞

f(x, y)g(x)z(y)dxdy

= E[g(X)z(Y )

]= E

[g(X)Z

].

(4) Let (N(t)) be the counting process associated with the Poisson process with intensity λ, and let(M(t)) = (N(t)− λt) be the associated martingale. Show that λt is the compensator for M2.Trivially, λt is predictable. (It is both continuous and deterministic.) To prove that λt is acompensator for M(t)2, we need to prove that M(t)2 − λt is a martingale. Let s < t, then

Page 218: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

M(t)2 = (N(t)− λt)2

= N(t)2 − 2λtN(t) + (λt)2

=(N(t)−N(s)

)2+ 2(N(t)−N(s)

)(N(s)− λt

)+N(s)2 + (λt)2

Conditioned on Fs, N(t)−N(s) has Poisson distribution with parameter λ(t− s), so

E

[[N(t)−N(s)

]2|Fs] = V ar

[N(t)−N(s)|Fs

]+ E

[N(t)−N(s)

∣∣Fs]2

= λ(t− s) + (λ(t− s))2

Thus

E[M(t)2 − λt|Fs] = λ(t− s)− λt+ [λ(t− s)]2 + 2(λ(t− s))

(N(s)− λt

)+N(s)2 − 2λtN(s) + (λt)2

= N(s)2 − 2λsN(s) + (λs)2 − λs= M(s)2 − λs.

(5) Later in the course we will discuss current status data, which is a form of extreme censoring.Individuals have an unobserved (assumed i.i.d.) event time Ui. What is observed is a censustime Ci (independent of Ui), and δi = 1Ui≤Ci.

(a) Assuming Ui has density fλ and cdf Fλ, write an expression for the log likelihood of Ui.

(b) Suppose the distribution is exponential, so fλ(u) = λe−λu. What is the relative efficiencyof estimation (that is, ratio of expected Fisher information) based on the current statusdata, compared with complete observation of Ui?

(c) Suppose you can choose the distribution of Ci (but still independent of Ui). How wouldyou maximise the expected information?

(a)

`(λ) =n∑i=1

δi logFλ(Ci) + (1− δi) log(1− Fλ(Ci)).

(b)

`(λ) =n∑i=1

δi log(

1− e−λCi)− λ

n∑i=1

(1− δi)Ci.

`′(λ) =n∑i=1

δiCi

eλCi − 1−

n∑i=1

(1− δi)Ci.

`′′(λ) = −n∑i=1

δiC2i eλCi

(eλCi − 1)2.

Page 219: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

The expected information is thus

I(λ) = nE[

C2eλC

(eλC − 1)21U<C

]= nE

[C2eλC

(eλC − 1)2PU < C

∣∣C]= nE

[C2eλC

(eλC − 1)2

(1− e−λC

)]= nE

[C2

eλC − 1

].

With complete observation the expected information is n/λ2. Thus, the relative efficiency is

λ2C2

eλC − 1.

(c) The function x2/(ex− 1) has a unique maximum value about 0.628 at x0 ≈ 1.544. Thus therelative efficiency is no more than 0.628, and may be made as close as we like to that valueby making C close to being deterministically 1.544/λ. Of course, we can’t do that withoutknowing λ, in which case we wouldn’t need to do the experiment. One could imagine usingan adaptive procedure, where λ(k) is the MLE based on the first k observations, and thenwe choose Ck+1 = 1.544/λ(k).

(6) Suppose we have an inhomogeneous Poisson process N(t), whose intensity starts out as either 1or 2, each with probability 1/2, and immediately after each event the intensity is determinedagain by an independent coin flip. Let λ(t) be the intensity at time t.

(a) Suppose λ(t) is observed (i.e., λ(t) ∈ Ft). Find the compensator (i.e., the cumulativeintensity process) of N(t). Find the predictable variation process and the optional variationprocess for the martingale M(t) obtained by subtracting the compensator from N(t).Compute Var(M(t)).Let (Gt) be the natural filtration of (N(t)), and let Ft = Gt ∨ 〈λ(s) : s ≤ t〉; that is, Ft isobtained from Gt by adding the variables λ(s) for s ≤ t. λF(t) = λ(t) ∈ Ft, so

ΛF(t) =

∫ t

0λ(s)ds = t+

∫ t

01λ(s)=2ds.

Defining the martingale M(t) = N(t)− Λ(t),

d〈M〉(t) = Var(dM(t)|Ft)= λ(t)dt

〈M〉(t) = Λ(t)

Page 220: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

[M ](t) =∑Ti≤t

dM(t)2

= N(t) since the jumps have size 1.

Var([M(t)]

)= E

[〈M〉(t)

]= E

[∫ t

0λ(s)ds

]=

∫ t

0E[λ(s)

]ds.

You might think that E[λ(s)] = 3/2, since each time a new λ is chosen, it has equal chances ofbeing 1 or 2. But in fact, when λ takes on the value 2 it spends only half as long on average

in that state as when it takes on the value 1, so we may expect that E[λ(t)]t→∞−−−−−→ 4/3.

Formally, we observe that λ(t) is a Markov chain on the states 1, 2, with Q-matrix

Q =

(−1 12 −2

).

The transition probabilities at time t are given by

etQ =1

3

(2 + e−3t 1− e−3t

2− 2e−3t 1 + 2e−3t

).

Given that we start in the distribution (1/2, 1/2), the distribution at time t is (2/3−e−3t/6, 1/3+e−3t/6). Thus

E[λ(t)

]=

4

3+e−3t

6,

and finally

Var(M(t)

)=

∫ t

0

(4

3+e−3s

6

)ds =

4

3t+

1

18

(1− e−3t

).

(b) Now suppose λ(t) is unobserved (i.e., λ(t) /∈ Ft). Find the compensator. Find the

predictable variation process and the optional variation process of M , where M is obtainedby subtracting the compensator from N(t). Is Var(M(t)) = Var(M(t))? Why or why not?By the Innovation Theorem,

λG(t) = E[λF(t)

∣∣Gt]= 1 + P

λ(t) = 2

∣∣GtWe know that λ(t) is independent of anything that happened before time T∗(t), so condi-tioning on Gt is equivalent to conditioning on the stopping time T∗(t) := last jump before

Page 221: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

time t, or, equivalently, conditioning on τ(t) := t− T∗(t) ∈ Gt. That is, if we write T ∗(t)for the first event time after t,

Pλ(t) = 2

∣∣Gt = Pλ(t) = 2

∣∣ τ= P

λ(T∗) = 2

∣∣T ∗ − T∗ > τ

=1/2 · e−2τ

1/2 · e−τ + 1/2 · e−2τ

=e−τ

e−τ + 1.

Note that ∫ t

0

e−s

e−s + 1dτ = log

2

1 + e−t,

so if we define the inter-event times τi := Ti − Ti−1, we get

Λ(t) = log2

1 + e−τ(t)+∑i:Ti≤t

log2

1 + e−τi.

This is the compensator.Thus M(t) = N(t)− Λ(t) is a martingale, which has predictable variation Λ(t).

The optional variation does not depend on the choice of σ-algebra, and is still [M ](t) = N(t).

Despite the fact that M is a very different process from M , the variance of M(t) is the

same as that of M(t), since both are equal to E[N(t)].

Page 222: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.2 Modern Survival Problem sheet 2:Nonparametric estimation of survival curves

(1) Consider a situation where the multiplicative intensity model holds, and there is unobservedright censoring. That is, for some individuals we observe the event time Ti and δi = 1; forothers, we observe δi = 0 and no event time. Suppose the right censoring is independent of eventtimes, all n individuals are independent, and the distribution of censoring times is known tohave cdf G. (So G(c) is the probability of being censored before time c.) Let t1 < t2 < · · · < tkbe the observed event times (assumed distinct).

Show that

A(t) =∑ti≤t

((n− i+ 1)

(1−G(ti)

))−1

is an unbiased estimator for the cumulative hazard, and derive an estimator for the variance.

Let Ft be the σ-algebra generated by the events up to time t, and Gt the σ-algebra generatedby events and censoring times up to time t.

We know that the counting process N(t) — the number of events at times ≤ t — has Gt-compensator

Λ(t) =

∫ t

0Y (s)dA(s).

Thus, the Ft-compensator, by the Innovation Theorem, is

Λ(t) =

∫ t

0E[Y (s)

∣∣Fs−]dA(s)

(since A(s) is deterministic).

Conditioned on Ft, which includes only the times of the events, the probability that an individualis still at risk at time s is 0 if they have already had their event by time s, and (1−G(s)) ifthey have not. There are n−N(s−) individuals who have not yet had an event at time s, so

Λ(t) =

∫ t

0

(n−N(s−)

)(1−G(s)

)dA(s).

Thus M(t) := N(t)− Λ(t) is an Ft-martingale. Since(n−N(s−)

)(1−G(s)

)is predictable, it

follows that

M(t) : =

∫ t

0

(n−N(s−)

)−1(1−G(s)

)−1dM(s)

=∑ti≤t

(n−N(ti−)

)(1−G(ti)

)−1 −A(t)

= A(t)−A(t)

Page 223: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

is also a martingale. Thus, its expectation is 0.

The optional variation is ∑ti≤t

(n− i+ 1

)−2(1−G(ti)

)−2,

which may thus serve as an unbiased estimator for the variance of A(t).

Note that this estimator does not use all of the data. We could improve our estimation by usingthe filtration F?t : Ft ∨ 〈δi : i = 1, . . . , n〉. That is, we include at all times the information aboutwho has ultimately been censored. Conditioned on F?t we know that there are C := n−

∑δi

individuals who will ultimately be censored. Thus it makes sense to write

Y (s) =∑i:δi=1

1Ti>s +∑i:δi=0

1Ci>s,

where Ci is the (unobserved) censoring time for individual i. The only variables that are not inF?t are the Ci. Thus, for s ≤ t,

E[Y (s)

∣∣F?t]

=∑i:δi=1

1Ti>s +∑i:δi=0

PCi > s

∣∣ δi = 0

=(n−N(s−)− C

)− CP

Ci > s

∣∣Ti > Ci.

In fact (and contrary to what I somewhat glibly claimed in the lecture), it?s not straightforwardto turn this into an estimator for A!

(2) Nonparametric estimators are inevitably less efficient (that is, have larger errors, on average)than parametric estimators. Consider the case when n individuals are observed up to time t.Their event times are independent and exponentially distributed with unknown parameter λ,and we observe all event times. We wish to compare two different possible estimators for thecumulative hazard up to time t: First, taking advantage of the knowledge that the data comefrom an exponential distribution; and second, using the nonparametric Nelson–Aalen estimator.

(a) i. Show that the MLE for Λ(t) under the exponential model is

nt∑ni=1 Ti

.

The log likelihood is

`(λ) = n log λ− λ∑

Ti.

Setting the derivative to 0 and solving for λ we get λ = n/∑Ti. The result follows,

since Λ(t) = λt.

Page 224: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

ii. Compute the (approximate) variance for this estimator.We know that G :=

∑ni=1 Ti has Gamma distribution with parameters (n, λ), so has

density

g(x) =λn

(n− 1)!xn−1e−λx.

Thus

E[λ] = n

∫ ∞0

λn

(n− 1)!xn−2e−λx =

n− 1,

E[λ2] = n2

∫ ∞0

λn

(n− 1)!xn−2e−λx =

n2λ2

(n− 1)(n− 2),

Var(λ) =n2λ2

(n− 1)2(n− 2)≈ λ2

n.

Thus Var(λ) ≈ λ2t2/n.The mean squared error (which is what we really should be looking at) adds the squaredbias to this, which is λ2/(n− 1)2, but this doesn’t change anything significant for largen.

(b) Using the inequality

log n+ γ ≤n∑i=1

1

i≤ log n+ γ +

1

2n,

show that the Nelson–Aalen estimator for S(t) is approximately Y (t+)/n (the empiricalfraction surviving to time n), and find a bound for the error — that is, for the maximumdifference between S(s) and Y (s+)/n on 0 ≤ s ≤ t.We have

− log S(s) =

N(s)∑j=1

1

n+ 1− j.

Since Y (s+) = n+ 1−N(s),∣∣∣log S(s)− logY (s+)

n

∣∣∣ ≤ 1

2Y (s+).

Using the general inequality |e−x − e−y| ≤ |x− y| for any x, y ≥ 0, we conclude that for all0 ≤ s ≤ t, ∣∣∣S(s)− Y (s+)

n

∣∣∣ ≤ e1/2Y (s+).

We can improve this, if we wish, to e1/2Y (s+) · Y (s+)n−1 .

Page 225: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(c) Use this to estimate the variance of A(t), the Nelson–Aalen estimator. Show that thisvariance is larger than the variance for the parametric estimator above.Writing A(t) = − log S(t) = − log Y (t+)/n, we have(

A(t)− Λ(t))2≈ log2

(S(t)

S(t)

)= log2

(1 +

S(t)− S(t)

S(t)

)

(S(t)− S(t)

S(t)

)2

+O(|S(t)/S(t)− 1|3

),

soVar(A(t)) ≈ S(t)−2 Var(S(t)) = n−2S(t)−2 Var(Y (t+)).

Since Y (t+) has binomial distribution with parameters (n, S(t)), its variance is nS(t)(1−S(t)), so we have

Var(A(t)) ≈ n−1

(1

S(t)− 1

)= n−1

(eλt − 1

).

(d) Show that this variance is larger than the variance for the parametric estimator above.We note that for any u > 0 we have

eu − 1− u2 = u− u2

2+∞∑i=3

ui

i!

>∞∑i=1

(−1)i−1ui

i!

= 1− e−u

> 0.

So eλt − 1 > (λt)2. The parametric estimator has variance approximately n−1(λt)2, whilethe nonparametric estimator has variance approximately n−1(eλt−1), so the nonparametricvariance is larger.

(e) Plot the ratio of the variances for a range of values of t, for the case λ = 1 and n = 1000.

(f) Why might one prefer to use the nonparametric estimator, even when there is no censoring?If the exponential model is not actually a good fit, then the Λ(t) obtained from it will bedistorted.

(3) What is the intensity λ of the survival process at time t?The intensity of the process is equal to α× (number of subjects at risk at time t), or

λ(t) = α(n−N(t−)).

Page 226: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 1 2 3 4

01

23

45

t

Varia

nce

ratio

(a) R code runs faster if you replace loops with vector operations. Can you get rid of the loopin this code?

(b) Modify the code to plot the martingale N(t)−∫ t

0 λ(s)ds.

(c) Add a routine that computes the optional variation process. Run a simulation and plot it.

(d) Modify the program to apply to α(t) = 2t.

################################################

########### Original code ######################

################################################

n=50

alpha=1

taus=NULL #Accumulates the inter-event times

set.seed(0)# this can be used to validate similarity of the output

for (i in 1:n)

taus=c(taus,rexp(1,alpha*(n+1-i))) # The next inter-event time

Page 227: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

T=cumsum(taus) # Turn the inter-event times into event times

# First create an empty set of axes.

# y ranges from n down to 0, x ranges from 0 up to max(T)

# Make an empty set of axes

plot(NULL,NULL,xlim=c(0,max(T)),

ylim=c(0,n),

xlab=’Time’,

ylab=’Number of events’)

# Now plot flat lines for all the segments between arrivals

segments(c(0,T[1:n]),0:n,c(T[1:n],T[n]+1),0:n)

# T[n]+1 is there to extend the final interval one

# time unit from the last time

# In principle it extends forever

################################################

################################################

############### (3 a ) ####################

################################################

n=50

alpha=1

taus=NULL #Accumulates the inter-event times

set.seed(0) #this can be used to validate similarity of the output

# It should produce the same taus as the for loop

##∼∼∼∼∼ replacing the loop∼∼∼∼∼∼taus=rexp(n,alpha*( n+1 - c(1:n) )) # The next inter-event time

Page 228: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

################################################

############### (3 b ) ####################

################################################

# Lambda(t) is equal to t(alpha(n-N(t))

M <- c(1:50) - T*( alpha*(n +1 -c(1:n) ) )

plot(NULL,NULL,xlim=c(0,max(T)),

ylim=c(min(M), max(M) ),

xlab=’Time’,ylab=’M(t)’)

segments(c(0,T[1:n]), M ,c(T[1:n],T[n]+1) , M )

################################################

############### (3 c ) ####################

################################################

# The optional variation process is just sum( Y_i^2 )

# where Y_i is the changes in the process

optvar <- cumsum( ( 1:n - 0:(n-1) )^2 )

optvar <- c(0,optvar)

#Plotting it:

plot(NULL,NULL,xlim=c(0,max(T)),

ylim=c(0,n),xlab=’Time’,ylab=’Optional Variation process’)

segments(c(0,T[1:n]), optvar ,c(T[1:n],T[n]+1), optvar )

################################################

############### (3 d ) ####################

################################################

# Now that we have a more complex hazard rate,

# we will have to use what we learned in question 4

#Treating N(t) as observed:

#lambda(t) = 2t * (n - N(t))

Page 229: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

#Lambda(t) = t^2 * (n-N(t))

#Lambda^-1(t) = sqrt( t/ (n-N(t)) )

n <- 50

Lambda_inv <- function(x , N ) sqrt( x / (n + 1 - N) )

taus= Lambda_inv( rexp(50,1) , 1:50 )

T=cumsum(taus) # Turn the inter-event times into event times

# First create an empty set of axes.

# y ranges from n down to 0, x ranges from 0 up to max(T)

# Make an empty set of axes

plot(NULL,NULL,xlim=c(0,max(T)),

ylim=c(0,n),

xlab=’Time’,

ylab=’Number of events’)

# Now plot flat lines for all the segments between arrivals

segments(c(0,T[1:n]),0:n,c(T[1:n],T[n]+1),0:n)

# T[n]+1 is there to extend the final interval

# one time unit from the last time

# In principle it extends forever

## Plotting the martingale

Lambda <- function( t , N ) (t^2) * (n + 1 - N )

M <- c(1:50) - Lambda(T, 1:50)

plot(NULL,NULL,xlim=c(0,max(T)),

ylim=c(min(M), max(M) ),

xlab=’Time’,ylab=’M(t)’)

segments(c(0,T[1:n]), M ,c(T[1:n],T[n]+1) , M )

Page 230: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

## Plotting the optional variation process

# The optional variation process is just sum( Y_i^2 )

# where Y_i is the changes in the process

optvar <- cumsum( ( 1:n - 0:(n-1) )^2 )

optvar <- c(0,optvar)

#Plotting it:

plot(NULL,NULL,xlim=c(0,max(T)),ylim=c(0,n),xlab=’Time’,

ylab=’Optional Variation process’)

segments(c(0,T[1:n]), optvar ,c(T[1:n],T[n]+1), optvar )

(4) The data set ovarian, included in the survival package, presents data for 26 ovarian cancerpatients, receiving one of two treatments, which we will refer to as the single and doubletreatments.

(a) Create a survival object for the times in this database.

(b) Compute and plot the Kaplan–Meier estimator for the survival curves. (For a small extrachallenge, plot the single-treatment survival curve black, and the double-treatment curvered.) You may use the survfit function.

library(survival)

## a ##

surv_object <- Surv(ovarian$futime, ovarian$fustat)

# To have a look at what has been computed about survival

## b ##

plot(survfit(surv_object~ovarian$rx), main="Kaplan-Meier")

> summary(surv_object)

Call: survfit(formula = Surv(futime, fustat) ~ rx)

# rx=1

# time n.risk n.event survival std.err lower 95\% CI upper 95\% CI

# 59 13 1 0.923 0.0739 0.789 1.000

# 115 12 1 0.846 0.1001 0.671 1.000

# 156 11 1 0.769 0.1169 0.571 1.000

Page 231: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

# 268 10 1 0.692 0.1280 0.482 0.995

# 329 9 1 0.615 0.1349 0.400 0.946

# 431 8 1 0.538 0.1383 0.326 0.891

# 638 5 1 0.431 0.1467 0.221 0.840

#

# rx=2

# time n.risk n.event survival std.err lower 95\% CI upper 95\% CI

# 353 13 1 0.923 0.0739 0.789 1.000

# 365 12 1 0.846 0.1001 0.671 1.000

# 464 9 1 0.752 0.1256 0.542 1.000

# 475 8 1 0.658 0.1407 0.433 1.000

# 563 7 1 0.564 0.1488 0.336 0.946

#for extra challenge:

plot(survfit(surv_object~ovarian$rx) ,

col=c("black","red"),

main="Kaplan-Meier")

legend("bottomright",

c( "single-treatment", "double-treatment"),

col=c("black","red") , lty=1 )

## c ##

plot(survfit(surv_object~ovarian$rx, type=’fleming-harrington’),main="Nelson-Aalen")

### The rest is to do this more ’by hand’, computing the relevant quantities

### and directly computing the Nelson-Aalen estimator.

attach(ovarian)

x=order(futime)

futime=futime[x]

fustat=fustat[x]

rx=rx[x]

ns=rev(cumsum(rev(rx==1)))

nd=rev(cumsum(rev(rx==2)))

hs=round(fustat*(rx==1)/ns,2)

hd=round(fustat*(rx==2)/nd,2)

Page 232: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

NelsonAalenTable =

subset(data.frame(t_i=futime, n_single=ns, n_double=nd,

h_single=hs,h_double=hd,A_single=cumsum(hs),A_double=cumsum(hd)), h_single+h_double>0)

> NelsonAalenTable

t_i n_single n_double h_single h_double A_single A_double vars vard

1 59 13 13 0.08 0.00 0.08 0.00 0.01 0.00

2 115 12 13 0.08 0.00 0.16 0.00 0.01 0.00

3 156 11 13 0.09 0.00 0.25 0.00 0.02 0.00

4 268 10 13 0.10 0.00 0.35 0.00 0.03 0.00

5 329 9 13 0.11 0.00 0.46 0.00 0.04 0.00

6 353 8 13 0.00 0.08 0.46 0.08 0.04 0.01

7 365 8 12 0.00 0.08 0.46 0.16 0.04 0.01

10 431 8 9 0.12 0.00 0.58 0.16 0.06 0.01

12 464 6 9 0.00 0.11 0.58 0.27 0.06 0.03

13 475 6 8 0.00 0.12 0.58 0.39 0.06 0.04

15 563 5 7 0.00 0.14 0.58 0.53 0.06 0.06

16 638 5 6 0.20 0.00 0.78 0.53 0.10 0.06

Page 233: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

Kaplan-Meier

single-treatmentdouble-treatment

(c) Compute the Nelson–Aalen survival curve estimate. Make a table of the relevant data(time of events, number of events, Nelson–Aalen estimates).

(d) Compute the standard error for the probability of survival past 400 days, as estimated bythe Nelson–Aalen and Kaplan–Meier estimators.The standard errors are in the code printout above. For type 1 the variance estimate forthe Nelson–Aalen estimator is 0.04 at t = 400; for type 2 it is 0.01. So the correspondingstandard errors for the cumulative hazard are 0.2 and 0.1. The standard errors for survivalare obtained from multiplying these by S(400)2, obtaining 0.13 and 0.097. The standarderrors computed by the survfit function for the Kaplan–Meier estimatorare in the printoutabove. They are 0.135 and 0.100.

Page 234: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.3 Modern Survival Problem sheet 3:Estimating quantiles and excess mortality

(1) Show that Duhamel’s equation (6.5) holds at a point s where S1 or S2 is discontinuous.At a discontinuity point of S1 or S2

dS1(s)

S2(s)=S1(s)

S2(s)− S1(s−)

S2(s−)

=dS1(s)

S2(s)+ S1(s−)

(1

S2(s)− 1

S2(s−)

)=S1(s−)

S2(s)· dS1(s)

S1(s−)− S1(s−)

S2(s)

(S2(s)

S2(s−)− 1

)=S1(s−)

S2(s)

(dS1(s)

S1(s−)− dS2(s)

S2(s−)

).

(2) (a) Explain why this is a reasonable estimator for the quantiles of the survival function.This code computes two vectors xpless and xpmore, corresponding to the event times in thedata. These are lower and upper confidence limits for the cumulative hazard, according tothe Nelson–Aalen estimator. The p-th quantile of survival is the time when the cumulativehazard crosses − log p. Thus, we define two indices upper and lower, to be the last timewhen the upper confidence bound is below − log p, and the first time when the upperconfidence bound is above − log p respectively.

(b) Use this to compute a 95% confidence interval for median survival in the ovarian data set(ignore treatment type);

1 l i b r a r y ( s u r v i v a l )2 data ( ovar ian ) # load the ovar ian datase t i n to the workspace3

4 ov . surv <− Surv ( ovar ian $ futime , ovar ian $ f u s t a t )5 ov . f i t <− s u r v f i t ( ov . surv ˜1 )6 quant i l eCI ( ov . f i t , 0 . 5 )7 [ 1 ] 431 NA8 >9 > #

10 > # I t a c t u a l l y makes more sense to look at something l i k e 80 th p e r c e n t i l e o f11 > # surv iva l , s i n c e s u r v i v a l bare ly ge t s down below . 5 in the study time12 > # ( Hence the NA f o r the upper end o f the con f idence i n t e r v a l )13 >14 > quant i l eCI ( ov . f i t , 0 . 8 )15 [ 1 ] 115 563

Page 235: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(c) Use this to compute a 95% confidence interval for median survival in a collection of datasimulated from an exponential distribution with parameter 1, available on the course website.

1 > load ( ” expdata . dat ” )2 >3 > exp . surv <− Surv ( t , ev )4 > exp . f i t <− s u r v f i t ( exp . surv ˜ 1)5 >6 > quant i l eCI ( exp . f i t , 0 . 5 )7 [ 1 ] 0 . 5 0 .7

(d) Suppose we ignored the censoring, so only included the uncensored times. What would youestimate for the median survival time?

1 T new <− t [ ev ] #only use the events2

3 exp . surv new <− Surv (T new , rep (TRUE, sum( ev ) ) )4 exp . f i t new <− s u r v f i t ( exp . surv new˜ 1)5

6 quant i l eCI ( exp . f i t new , 0 .5 )7 [ 1 ] 0 . 3 0 .5

Page 236: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

Figure C.1: Plot of Kaplan–Meier estimator with 95% confidence interval for theexponential simulation (black), and the false plot based on ignoring censored data(red).

Page 237: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(3) Look back to the derivation in the lecture notes section 7.2 of the estimator Γ(t) for cumulativeexcess mortality in the two-sample setting. Think of kc(t) as arbitrary predictable randomvariables.

(a) Construct a martingale to show that the estimator (7.5) for excess mortality in the two-sample case is unbiased for appropriate choice of kc(t). What conditions must kc(t) satisfy?For each c,

N(c, 0; t)−∫ t

0α(c; s)Y (c, 0; s)ds and

N(c, 1; t)−∫ t

0(α(c; s) + γ(s))Y (c, 1; s)ds

are both martingales. Dividing the increments by Y , we see that∫ t

0

dN(c, 0; s)

Y (c, 0; s)− α(c, 0; s)ds and∫ t

0

dN(c, 1; s)

Y (c, 1; s)−∫ t

0α(c, 1; s)ds− Γ(t)

are martingales. Thus, the difference∫ t

0

(dN(c, 1; s)

Y (c, 1; s)− dN(c, 0; s)

Y (c, 0; s)

)− Γ(t)

is also a martingale.Thus, if kc is predictable with

∑kc = 1, then

M(t) :=∑c

(∫ t

0kc(t)

(dN(c, 1; s)

Y (c, 1; s)− dN(c, 0; s)

Y (c, 0; s)

)− Γ(t)

)= Γ(t)− Γ(t)

is a martingale, and its expectation is 0 for any t, implying that Γ(t) is an unbiased estimatorfor Γ(t).

(b) Find an expression for estimating the variance of Γ.The optional variation of M is

[M](t) =

∑c

∑t(c,1)i ≤t

kc(ti)2

Y (c, 1; t(c,1)i )2

+∑

t(c,0)i ≤t

kc(ti)2

Y (c, 0; t(c,0)i )2

)

=∑ti≤t

kci(ti)2

Y (ci, Gi; ti)2.

Page 238: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(c) Show that for the particular choice (7.4) the bound (7.6)

Var(Γ(t)) ≈∑ti≤t

(∑c

Y (c,−; ti)

)−2

is a conservative estimate for the variance of the estimator. That is, it is a good estimatefor large samples, and tends not to underestimate the variance.We need to show that the optional variation is bounded by

∑ti≤t

(∑c

Y (c,−; ti)

)−2

when we take

kc(t) =Y (c,−; t)∑c′ Y (c′,−; t)

.

This follows immediately from the above formula, since Y (ci,−; ti) ≤ Y (ci, Gi; ti), so

kci(ti)2

Y (ci, Gi; ti)2≤(∑

c

Y (c,−; ti))2.

(d) Since any choice of kc yields an estimator, we are free to make a convenient choice. Why isthe choice (7.4) a good one?A sensible strategy is to minimise the expected next variance increment. Since the nextindividual to have an event is approximately uniformly chosen from the available individuals,hence proportional to Y (c,G; t), we want to minimise

∑Y (c,G; t)

k2c

Y (c,G; t)2

subject to∑kc = 1, leading us to choose kc proportional to Y (c,G; t). However, kc can

only depend on c, not on G, so we take kc(t) = Y (c,−; t)/∑

c′ Y (c′,−; t). (Otherwise, itwouldn’t be predictable.) In that case, we get the variance bound

∑ti≤t

(∑c

Y (c,−; t)

)−2

(C.1)

If we said instead we wanted to minimise the maximum of the next squared increment, wewould also choose kc(t) proportional to Y (c,−; t).

Page 239: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(e) Supposing the groups to be of approximately equal size, what will the relation be betweenthe variance of our estimator for the cumulative excess mortality, and the variance wewould estimate for the difference in the cumulative hazards between the groups Gi = 0and Gi = 1, ignoring the classification ci.If we ignore the classification, we will estimate the difference in cumulative hazards as

∑ti≤t

(−1)Gi∑c Y (c,Gi; ti)

,

with variance estimated by ∑ti≤t

1∑c Y (c,Gi; ti)2

.

The bound we have just derived replaces∑

c Y (c,Gi; ti) by∑

c Y (c,−; ti). Thus, thevariance estimate when we stratify by category is always larger, but they differ only to theextent that the numbers at risk in the two groups differ.

Page 240: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.4 Modern Survival Problem sheet 4:Nonparametric testing and semiparametric models

(1) The object tongue in the package KMsurv lists survival or right-censoring times in weeks afterdiagnosis for 80 patients with tongue tumours. The type random variable is 1 or 2, dependingas the tumour was aneuploid or diploid respectively.

(a) Use the log-rank test to test whether the difference in survival distributions is significant atthe 0.05 level.We give below R code for computing this in two different ways: Using the function survdiff,which does the computation automatically, and by extracting the relevant quantities fromthe survival object and doing the computation directly.We get Z = −1.67, which corresponds to a p-value of 0.09.Using survdiff we get the same result, but it is reported as a chi-squared statistic of 2.8(which is 1.672) on 1 degree of freedom.

SURVDIFF CODE

> require(’survival’)

> require(’KMsurv’)

> data(tongue)

> attach(tongue)

>

> tongue.surv=Surv(time,delta)

> tongue.fit=survfit(tongue.surv∼type)> tdiff=survdiff(tongue.surv∼type)> tdiff

Call:

survdiff(formula = tongue.surv ∼ type)

N Observed Expected (O-E)^2/E (O-E)^2/V

type=1 52 31 36.6 0.843 2.79

type=2 28 22 16.4 1.873 2.79

Chisq= 2.8 on 1 degrees of freedom, p= 0.0949

DIRECT COMPUTATION

# Problem sheet 4, question 1

require(’survival’)

Page 241: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

require(’KMsurv’)

data(tongue)

attach(tongue)

tongue.surv=Surv(time,delta)

tongue.fit=survfit(tongue.surv∼type)

n1=tongue.fit$strata[1]

n2=tongue.fit$strata[2]

# Input two vectors of times t1,t2, and

# numbers at risk n1,n2 whose length is 1 longer than the t’s

# Output four vectors I1, I2, (of same length as t1,t2) and Y1,Y2

# I1[k] gives an index of I2 corresponding to

# the last time in t2 that precedes t1[k]

# Thus, we have t2[I1[k]]<=t1[k] < t2[I1[k]+1],

# and r2[I1[k]+1] is the number of type 2 individuals at risk

# at the time t1[k] (when there are r1[k] type 1 individuals)

# Y1=r1[I1]

crossrisk=function(t1,t2,r1,r2)

I1=rep(0,length(t1))

I2=rep(0,length(t2))

for(i in seq(length(t1)))

I1[i]=1+sum(t1[i]>t2)

for(i in seq(length(t2)))

I2[i]=1+sum(t2[i]>t1)

list(I1,I2,r1[I2],r2[I1])

r1=tongue.fit$n.risk[seq(n1)]

r2=tongue.fit$n.risk[seq(n1+1,n1+n2)]

r1=c(r1,r1[n1]-tongue.fit$n.event[n1]-tongue.fit$n.censor[n1])

r2=c(r2,r2[n2]-tongue.fit$n.event[n1+n2]-tongue.fit$n.censor[n1+n2])

t1=tongue.fit$time[seq(n1)]

t2=tongue.fit$time[seq(n1+1,n1+n2)]

Page 242: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

cr=crossrisk(t1,t2,r1,r2)

Y1=c(r1[-n1],cr[[3]])

Y2=c(cr[[4]],r2[-n2])

# Note: r1 and r2 had an extra count added on to make crossrisk work

d1=c(tongue.fit$n.event[seq(n1)],rep(0,n2))

d2=c(rep(0,n1),tongue.fit$n.event[seq(n1+1,n1+n2)])

t=c(t1,t2)

# We have to deal with the problem of ties between times for the two groups

dup1=which(duplicated(t,fromLast=TRUE))

dup2=which(duplicated(t))

ndup=length(dup1)

# Type 2 Event counts are removed from the second appearance

# and placed in the first appearance

d2[dup1]=d2[dup2]

d2=d2[-dup2]

d1=d1[-dup2]

# Type 2 at-risk counts are removed from the second appearance

# and placed in the first appearance

Y2[dup1]=Y2[dup2]

Y2=Y2[-dup2]

Y1=Y1[-dup2]

t=t[-dup2]

tord=order(t)

t=t[tord] #put times in order

## Now put everything else in the same order

Y=Y[tord]

Y1=Y1[tord]

Y2=Y2[tord]

d=d[tord]

d1=d1[tord]

d2=d2[tord]

Page 243: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Y=Y1+Y2

d=d1+d2

# Product of number at risk

atriskprod=Y1*Y2

includes=(atriskprod>0)&(d>0)

# We only get contributions if someone’s at risk and events occurred at that time

Y=Y[includes]

Y1=Y1[includes]

Y2=Y2[includes]

d=d[includes]

d2=d2[includes]

d1=d1[includes]

t=t[includes]

wLR=Y1*Y2/Y

p=1

q=0

S=c(1,cumprod((Y-d)/Y))[-length(Y)] #K-M estimator for survival

wFH=(1-S)^q*S^p*wLR

# Now compute the test statistic

w=wLR

M=w*(d1/Y1-d2/Y2)

sigma=w*w*d*(Y-d)/Y2/Y1/(Y-1)

sK=d*Y1*Y2*(Y-d)/Y^2/(Y-1)

Z=sum(M)/sqrt(sum(sigma))

> Z

[1] -1.670246

(b) Repeat the above with a test that emphasises differences shortly after diagnosis.This can also be done with survdiff.

Page 244: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

> tdiff2=survdiff(tongue.surv∼type,rho=1)# rho=1 corresponds to p=1 for Fleming-Harrington weights

> tdiff2

Call:

survdiff(formula = tongue.surv ∼ type, rho = 1)

N Observed Expected (O-E)^2/E (O-E)^2/V

type=1 52 20.2 24.4 0.731 3.3

type=2 28 15.1 10.9 1.643 3.3

Chisq= 3.3 on 1 degrees of freedom, p= 0.0694

> w=wFH

> M=w*(d1/Y1-d2/Y2)

> sigma=w*w*d*(Y-d)/Y2/Y1/(Y-1)

> sK=d*Y1*Y2*(Y-d)/Y^2/(Y-1)

>

> Z=sum(M)/sqrt(sum(sigma))

> Z

[1] -1.805118

> Z^2

[1] 3.25845

(c) Calculate and plot the estimated excess mortality for aneuploid compared with diploid.In this case there are no nuisance covariates, so the excess mortality is simply the differencebetween the Nelson–Aalen estimators, with increments at time t by a predictable functionk(t):

Γ(t) =∑ti≤t

k(ti)

(G(ti

Y (1; ti)− 1−GiY (0; ti)

).

Here Gi = 1 when there is an aneuploid-type event at time ti, and 0 when there is adiploid-type event. Because there are ties in the data, we change this slightly, to

Γ(t) =∑ti≤t

k(ti)

(daneupi

Y aneup(ti)− ddip

Y dip(ti)

).

We could choose anything for k, but we just take k ≡ 1 here. One approach, then, wouldbe to take the differences between the estimators calculated by survfit in the previouspart. Extracting the components from the survfit object would be fairly opaque, though,so we give code below that does the computation directly.

1 et imes=tongue$ time [ tongue$ d e l t a ==1] #Event t imes2 aneup=subset ( tongue , type==1)

Page 245: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

3 dip=subset ( tongue , type==2)4 ## For each event time , count number o f events o f each type at that time5 d . aneup=sapply ( etimes , f unc t i on ( t ) sum( aneup$ d e l t a [ aneup$ time==t ] ) )6 d . dip=sapply ( etimes , f unc t i on ( t ) sum( dip $ d e l t a [ dip $ time==t ] ) )7 ## For each event time , count number at r i s k o f each type at that time8 Y. aneup=sapply ( etimes , f unc t i on ( t ) sum( aneup$ d e l t a [ aneup$ time>t ] ) )9 Y. dip=sapply ( etimes , f unc t i on ( t ) sum( dip $ d e l t a [ dip $ time>t ] ) )

10 # Stop when we run out o f i n d i v i d u a l s11 tmax=max( which (Y. aneup∗Y. dip>0) )12

13 Gammaincrement=(d . aneup/Y. aneup−d . dip /Y. dip ) [ 1 : tmax ]14 Gamma=cumsum( Gammaincrement )15 var inc =(d . aneup/Y. aneupˆ2+d . dip /Y. aneup ˆ2) [ 1 : tmax ]16 vargamma=cumsum( var inc )17 sdgamma=s q r t (vargamma)18

19 c o n f l e v e l =.9520 z=−qnorm((1− c o n f l e v e l ) / 2)21 upper=Gamma+sdgamma∗z22 lower=Gamma−sdgamma∗z

Page 246: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 50 100 150

01

23

4

Excess mortality for aneuploid tumours

Time(weeks)

Gam

ma

Figure C.2: Excess mortality for aneuploid vs diploid tumours from tongue

dataset.

Page 247: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(2) Show that the class of non-parametric test statistics that we have defined in section 8.1 includesthe Wilcoxon rank-sum statistic, in the special case where there is no censoring or truncation.What weight function do we need to take to recover the rank-sum statistic? Derive the samplingdistribution of the rank-sum statistic. Optional: Try out the two statistics on some simulateddata. They may be drawn from any distribution you like.When computing the rank-sum statistic for n subjects, with ni from group i (i = 0, 1) we startby assigning ranks to each of the observations T1, . . . , Tn, and defining Ri to be the sum of theranks of subjects of type i. We then define

Ui := n0n1 +ni(ni + 1)

2−Ri.

Either one of these may be used as the rank-sum statistic, corresponding to the two differenttails of the distribution.

Observe that when there is no censoring or truncation, the rank of an observation is simply thenumber of individuals still at risk; that is, Y·(ti). Thus

R0 =∑i:Gi=0

Y0(ti) +∑i:Gi=0

Y1(ti).

Note that Y0 is decremented by one at each time ti such that Gi = 0, so∑i:Gi=0

Y0(ti) =n0(n0 + 1)

2,

and

U1 − U0 =n∑i=1

(−1)Giw(ti)

YGi(ti)

U1 + U0 = n0n1,

where w(ti) = Y1(ti)Y0(ti). Thus E[Uj ] = n0n1/2 and the variance may be computed as theexpected value of the predictable variation

1

4

n∑i=1

E[Y1−Gi(ti)

2∣∣Fti−] =

1

4

n∑i=1

Y0(ti)2Y1(ti)

Y0(ti) + Y1(ti)+

Y1(ti)2Y0(ti)

Y0(ti) + Y1(ti)

=1

4

n∑i=1

Y0(ti)Y1(ti).

The variance is the expected value of this sum. Since Y1(ti) = n− i+ 1− Y0(ti), this is

Var(U1) =1

4

n∑i=1

(n− i+ 1)E[Y0(ti)]−1

4

n∑i=1

E[Y0(ti)]2 − 1

4

n∑i=1

Var(Y0(ti)).

Page 248: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

We observe now that under the null hypothesis, the n − i + 1 survivors up to time ti are auniform random pick from the n subjects. Thus Y0(ti) has hypergeometric distribution withparameters (n, n0, n− i+ 1), so

E[Y0(ti)

]=

(n− i+ 1)n0

n,

Var(Y0(ti)

)=

(n− i+ 1)n0n1(i− 1)

n2(n− 1).

(Properties of the hypergeometric distribution may be found at http://en.wikipedia.org/

wiki/Hypergeometric_distribution.) Thus

Var(Uj) =n0

4n

n∑i=1

(i2 − i2n0

n− n1i(n− i)

n(n− 1)

)

=n0n1

4n(n− 1)

n∑i=1

(i2 − i)

=n0n1(n+ 1)

12

after some algebra.

Thus, when n0 and n1 are both large, we may use

Uj − n0n1/2√n0n1(n+ 1)/12

as a test statistic, assuming it should have standard normal distribution if the null hypothesisholds.

(3) Suppose we have an additive-hazards model where an individual has covariates (X1, . . . , Xp)and the individual hazards are then

α(t) = β0(t) + β1(t)X1 + · · ·+ βp(t)Xp,

where the Xk are random variables. An observation consists of a single right-censored event.

(a) Suppose the variable Xp is not observed, so is not included in the model. If the randomvariables Xk are all independent, show that the remaining model is still an additive-hazardsmodel with a different baseline hazard β0(t).Consider a single individual under observation, producing a time T that is either an eventtime (δ = 1) or a censoring time (δ = 0).Let Ft be the σ-algebra for the observations up to time t including the fixed covariatesX1, . . . , Xp−1, and Gt be the extension of Ft to include the covariate Xp. We are given thatthe fully observed process has intensity for individual i

α(t) = αG(t) = Yi(t) (β0(t) + β1(t)X1 + · · ·+ βp(t)Xp) .

Page 249: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(We take the regression coefficients βk(t) to be nonrandom.) We need to show that thehazard rate conditioned on reduced information αF also fits into the additive-hazards model.By the Innovation Theorem (Theorem 3.1),

αF(t) = E[αGi (t)

∣∣Ft−]=

p∑k=0

E[βk(t)Xk

∣∣Ft−]=

p−1∑k=0

βk(t)Xk + E[Xp

∣∣Ft−]βp(t).We only need to consider this conditioning on the event T ≥ t, since the intensity is 0 onthe event T < t.Since we have assumed that the covariates Xk are independent, on the event Yi(t) = 1

E[Xp

∣∣Ft−] = E[Xp

∣∣ (T ∧ t, δ1T<t]= E

[Xp

S(t|Xp)

S(t)

]by Bayes’ Law, where S(t|x) =

∫∞t f(s|x)ds is the conditional survival function.

We have

S(t|x) = exp

−∫ t

0α(s)ds

= exp

−B0(t)−

p−1∑k=1

XkBk(t)− xBp(t)

;

thus, by independence of Xk,

S(t) = E[e−XpBp(t)

]exp

−B0(t)−

p−1∑k=1

XkBk(t)

,

andS(t|x)

S(t)=

e−xBp(t)

MXp(−Bp(t)),

where MXp is the moment generating function of Xp.Thus, the hazard rate for the reduced model is

αF(t) =

(β0(t) +

E[Xpe−XpBp(t)]

E[e−XpBp(t)]βp(t)

)+

p−1∑k=1

βk(t)Xk.

The first term in brackets is the new baseline hazard.

Page 250: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(b) Suppose the random variables Xk are multivariate normal (but not independent). Howdoes the model change when Xp is dropped?We may assume without loss of generality that the Xk have mean 0. We can representXp = cpZ +

∑p−1k=1 ckXk, where Z is standard normal independent of X1, . . . , Xp−1, and

c1, . . . , cp. We have then

E[Xp

∣∣Ft−] =

p−1∑k=1

ckXk + cpE[ZS(t|Z)

S(t)

]By the same argument as before, we have

S(t|z)S(t)

=e−zcpBp(t)

MZ(−cpBp(t))= e−zcpBp(t)−c2pBp(t)2/2.

Thus the hazard rate for the reduced model is

αF(t) =(β0(t) + E[Ze−ZcpBp(t)]cpβp(t)e

−c2pBp(t)2/2)

+

p−1∑k=1

(βk(t) + ckβp(t))Xk.

Thus, we still have an additive-hazards model, but now all the coefficients have changed.

(4) In section 9.5.1 we describe fitting the Aalen additive hazards model for the special case ofa single (possibly time-varying) covariate. Suppose we constrain the assumptions further, toassume that xi is constant in time, and takes on only the values 0 and 1. Explain how this isrelated to the excess mortality model. Compare the results we would obtain from the methodsdescribed in this section, to those obtained from the methods of section 7.2.

Assume there are no ties. Since xi(t) = xi is 0 or 1, we may write Y1(t) = #R(t) · µ1(t) thenumber of individuals in group 1, and Y0(t) = #R(t) · (1− µ1(t)). Also µ2(t) = µ1(t). We havethen by (9.10) (

B0(t)

B1(t)

)=∑tj≤t

1

Y0(tj)Y1(tj)

(Y1(tj)xj

−(1− xj)Y1(tj) + xjY0(tj)

).

If we think of this as an excess mortality model, B1 is the same as what was called Γ. We have

B1(t) =∑tj≤t−

1Gj=0

Y0(tj)+

1Gj=1

Y1(tj).

This is the same as the estimator we worked out for the two-sample case for excess mortality,where the weight function is 1.

Page 251: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(5) Refer to the AML study, which is described at length in Example 8.1.4 and analysed with theCox model in section 11.3. Using the data described in those places, estimate the difference incumulative hazard to 20 weeks between the two groups by

(a) The nonparametric method described in section 7.2;In the terminology of section 8.1.4 there is no nuisance categorisation, so by (7.5) thedifference may be estimated by the difference between the Nelson–Aalen estimators:

Γ(t) =

∫ t

0

dN1(s)

Y1(s)−∫ t

0

dN0(s)

Y0(s)

Calling the Maintenance group number 1, and Nonmaintenance number 0, we read off ofTable 8.2 A1(20) = A1(18) = 0.32, and A0(20) = 0.49, yielding

Γ(20) = −0.17.

The variance will be the sum of the variances of the two estimators (since they areindependent). As long as there are no ties between events from different groups, this maybe estimated by ∑

ti≤t

di∑k=0

(Y (Gi; ti)− k

)−2.

From the Table we can see that this is

σ21(20) + σ2

0(20) =1

122+

1

112+

1

102+

1

92+

1

82+

1

102+

1

82= 0.0788.

Thus, an approximate 95% confidence interval for Γ(20) would be

−0.17± 0.28 · 1.96 = −0.17± 0.550.

(b) The semiparametric method based on the relative-risk regression.The Cox model fit by coxph produced the outcome

coxph(formula = Surv(time, status) ∼ x, data = aml)coef exp(coef) se(coef) z p

×Nonmaintained 0.916 2.5 0.512 1.79 0.074

Likelihood ratio test=3.38 on 1 df p=0.0658 n= 23

In Table 11.2 we tabulated the estimators for the baseline hazard, obtaining A0(18) = 0.254.A central estimate for the difference in cumulative hazard between the two groups would be

(1− eβ)A0(18) = −1.5 · 0.254 = −0.38.

Page 252: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

We see that this is a substantially larger estimate than we made in the nonparametric model.This is consistent with the plot in Figure 11.3, where the purple circles and blue crosses(representing the survival estimates from the proportional hazards model for the two groups)are further apart at ti = 18 than the black and red lines (representing the Kaplan–Meierestimators). This reflects that fact that the separate Kaplan–Meier estimators are cruder,making larger jumps at less frequent intervals.To estimate the standard error, we begin by assuming (with little justification) that theestimators β and A0(ti) are approximately independent. Then we can use the delta methodto estimate the variance. Let σ2

β be the variance of β, and σ2A the variance of A(18). So we

can representβ ≈ β0 + σβZ, A0(18) = A0(18) + σAZ

′,

where Z and Z ′ are standard normal (also approximately independent). We already havethe estimate σβ ≈ 0.512. We haven’t given a formula for an estimator of σA(18), but wecan easily compute it with R.

require(survival)

cp=coxph(Surv(time,status)∼x,data=aml)

aml.fit=survfit(cp)

aml.fit$std.err[aml.fit$time==18]

[1] 0.150247

Then our estimator for the difference in cumulative hazard is

(1− eβ)A0(18) ≈(

1− eβ0+σβZ) (A0(18) + σAZ

′)≈(

1− eβ0 (1 + σβZ)) (A0(18) + σAZ

′)≈(

1− eβ0)A0(18)− eβ0σβA0(18)Z +

(1− eβ0

)σAZ

′ − eβ0σβσAZZ ′.

(Note that the approximation in the first line is based on assuming σβ is much smaller thanβ0, which isn’t really very true here.) As long as we are assuming independence of Z andZ ′, the variance will be approximately(

eβ0σβA0(18))2

+((

1− eβ0)σA

)2= 0.3252 + .2252 = 0.156,

so the standard error is about 0.395.A better estimate, also taking into account the dependence between β and A0, could beobtained by not using the delta method, but instead treating the normal distribution of β

Page 253: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

as a Bayesian posterior distribution on β0. For a range of possible β0 we can compute anapproximate mean and variance for A0, and then compute a Monte Carlo estimator of thevariance of Γ.

(c) Using the proportional hazards method, suppose an individual were to switch from mainte-nance to non-maintenance after 10 weeks, and suppose the hazard rates change instanta-neously. Estimate the difference in cumulative hazard to 20 weeks between that individualand one who had always been in the non maintenance group.We let x0(t) be the covariate trajectory for this individual, so recalling that the maintainedgroup is the baseline this means that

x0(t) =

0 if t ≤ 10,

1 if t > 10.

Using the formula (10.9) we estimate for this individual

A(20∣∣x0

)=

∫ t

0eβx0(u)dA0(u)

=

∫ 10

0dA0(u) +

∫ 20

10eβdA0(u)

≈ A0(10) + 2.5(A0(20)− A0(10)

)= 0.14 + 2.5(0.114)

= 0.425.

Page 254: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.5 Modern Survival Problem sheet 5:Relative risks and diagnostics

(1) Let N(t) be a counting process with additive hazards λi(t) = λ0(t) +∑p

k=1 xik(t)βk(t), with

Bk(t) =∫ t

0 βk(s)ds. As in Lecture 12 we define N(t) to be the vector of the individual counting

processes (so it is a binary vector), and similarly X(t) the matrix of covariates, and B(t) thevector of regression coefficient estimators. Define the martingale residual

Mres(t) =

∫ t

0J(s)dN(s)−

∫ t

0J(s)X(s)dB(s),

where J(s) is the indicator of X(s)TX(s) having full rank, hence of X−(s) existing.

(a) Using the fact thatJ(s)

(I−X(s)X−(s)

)X(s) ≡ 0,

show that Mres is a martingale. (That is, every component is a martingale.)By (9.3) we have that

M(t) := N(t)−∫ t

0X(t)dB(t)

is a martingale, and the estimator B is defined in (9.5) as

B(t) =

∫ t

0X−(s)dN(s).

Thus

Mres(t) =

∫ t

0J(s)dN(s)−

∫ t

0J(s)X(s)X−(s)dN(s)

=

∫ t

0J(s)

(I−X(s)X−(s)

)dN(s)

=

∫ t

0J(s)

(I−X(s)X−(s)

)dM(s) +

∫ t

0J(s)

(I−X(s)X−(s)

)X(s)dB(s).

The second term is identically 0, so we have

Mres(t) =

∫ t

0J(s)

(I−X(s)X−(s)

)dM(s),

which is an integral of martingale increments, hence itself a martingale.

Page 255: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(b) Suppose now that all covariates are fixed and the data are right-censored, and let τ be thefinal time under consideration (such that J(τ) = 1). Show that

X(0)TMres(τ) = 0.

(For time-fixed covariates we define X(t) := Y(t)X, where Y(t) is the matrix with theat-risk indicators Yi(t) on the diagonal.)We note that for right-censored data Y(s′)Y(s) = Y(s) for s′ ≤ s, and Y(s)dN(s) = dN(s)because Yi(s) = 0 implies that dN(s) = 0. Thus

Y(s)dM(s) = Y(s)dN(s)−Y(s)X(s)dB(s) = dM(s).

For any s with J(s) = 1,

XTdMres(t) =(XT −XTY(s)XX−(s)

)dM(s)

=(XT −XTY(s)X

(XTY(s)X

)−1XT)Y(s)dM(s)

= XTdM(s)−XTY(s)dM(s)

= 0.

Since this is true for all s, and since Mres(0) = 0, it must be true that X(0)TMres(t) = 0for all t ≤ τ .

(c) How might this fact be used as a model-diagnostic for the additive-hazards assumption?The equation XTMres(τ) = 0 means that the n-dimensional vector Mres(τ) is orthogonalto each of the p+1 distinct n-dimensional vectors of the coefficients. There is no linear trendwith respect to the covariates. (In other words, in the linear regression model predictingMres(τ) as a function of the covariates, the coefficients are all 0.)If the additive hazards model is true, there should be no nonlinear effect of the covariateson the martingale residuals. So one possible model test is to plot the martingale residualsagainst nonlinear functions of the residuals — for instance, the square of a covariate, or aproduct of two covariates — and look for trends. This is described briefly in section 4.2.4of [ABG08], and more extensively in [Aal93].

Page 256: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(2) Let

Xi(t) = vector of observed covariates for individual i at time t;

Ni(t) = counting process for individual i at time t;

β = estimate of Cox regression coefficients;

A0(t) = estimate of baseline hazard in Cox model;

Mi(t) = Ni(t)−∫ t

0Yi(s)e

βTXi(s)dA0(s) the martingale residuals;

Xk(t) =

∑ni=1 Yi(t)Xik(t)e

βTXi(t)∑ni=1 Yi(t)e

βTXi(t);

Uk(t) =

n∑i=1

∫ t

0

[Xik(s)− Xk(s)

]dMi(s).

Uk is called the score process.

(a) Show that

Uk(t) =∑tj≤t

(Xijk(tj)− Xk(tj)

).

(The summands here are called Schoenfeld residuals.)By definition,

Uk(t) =n∑i=1

∫ t

0

[Xik(s)− Xk(s)

] (dNi(s)− Yi(s)eβ·Xi(s)dA0(s)

)

=n∑i=1

∫ t

0

[Xik(s)− Xk(s)

]dNi(s)− Yi(s)eβ·Xi(s)n∑j=1

eβ·Xj(s)dNj(s)∑n`=1 e

β·X`(s)Y`(s)

.

We have

n∑i=1

[Xik(s)− Xk(s)

]Yi(s)e

β·Xi(s)

=n∑i=1

Xik(s)Yi(s)eβ·Xi(s) −

n∑i=1

Yi(s)eβ·Xi(s)

∑n`=1 Y`(s)X`k(s)e

β·X`(s)∑n`=1 Y`(s)e

β·X`(s);

= 0.

Page 257: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Thus

Uk(t) =n∑i=1

∫ t

0

[Xik(s)− Xk(s)

]dNi(s)

=n∑i=1

[Xik(Ti)− Xk(Ti)

]1Ti≤t

=∑tj≤t

(Xijk(tj)− Xk(tj)

).

(b) Show that the score process is the conditional expectation of the partial derivative of thelog partial likelihood with respect to the coefficient βk, conditioned on Ft.The log likelihood is

`P (β) =∑tj

β ·Xij (tj)− log∑i=1

nYi(tj)eβ·Xi(tj).

The derivative with respect to βk is

∂`P∂βk

=∑tj

Xijk(tj)−∑n

i=1 Yi(tj)Xik(tj)eβ·Xi(tj)∑n

i=1 Yi(tj)eβ·Xi(tj)

=∑tj

[Xijk(tj)− Xk(tj)

]=

n∑i=1

∫ ∞0

[Xik(s)− Xk(s)

]dNi(s).

The conditional expectation with respect to Ft is then

E[∂`P∂βk

∣∣∣∣ Ft] =∑tj≤t

[Xijk(tj)− Xk(tj)

]+ E

∑tj>t

[Xijk(tj)− Xk(tj)

] ∣∣∣∣ Ft .

If we condition on an event at time tj > t, since ij is distributed among the elements of1, . . . , n in proportion to Yi(tj)e

β·Xi(tj),

E[Xijk(tj)− Xk(tj)

∣∣∣∣ Ftj−] =n∑i=1

Xik(tj)Yi(tj)e

β·Xi(tj)∑i′ Yi′(tj)e

β·Xi′ (tj)− Xk(tj) = 0

for all j and k. (We are here conditioning on the past up to a random stopping timetj , something that was mentioned in section 2.1.10, but not formally introduced.) Since

Page 258: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

conditioning on the smaller σ-algebra Ft may be achieved by first conditioning on the larger,and then on the smaller, by formula 2.5, we see that all the conditional expectations fortj > t contribute 0 to the sum. Thus, we are left with only the first term

E[∂`P∂βk

∣∣∣∣ Ft] =∑tj≤t

[Xijk(tj)− Xk(tj)

]= Uk(t).

(c) Conclude that Uk(0) = Uk(∞) = 0.Uk(0) is trivially 0, by definition. β is chosen to satisfy the equation ∂`P /∂βk = 0. This isthe same as Uk(∞) = 0.

(d) Explain why a plot of Uk(t), suitably scaled, would be expected to look like a random walkconditioned to start and end at 0 (a discrete bridge) if the proportional hazards assumptionholds.The function starts at 0 and ends at 0. Since it is a martingale, it will behave like atime-changed Brownian motion (by the martingale CLT), except for being conditioned toend at 0.

(3) The dataset larynx in the package KMsurv includes times of death (or censoring by the end ofthe study) of 90 males diagnosed with cancer of the larynx between 1970 and 1978 at a singlehospital. One important covariate is the stage of the cancer, coded as 1,2,3,4.

(a) Why would it probably not be a good idea to fit the Cox model with relative risk eβ·stage?That would treat the categorical variable as though it were quantitative. That would forcethe relative risks into particular proportions that have no empirical basis. There may begood reason to expect the relative risk to increase with stage, but not to expect particularproportions.

(b) Use a martingale residual plot to show that stage does not enter as a linear covariate.We could fit the model without any covariates — so just find the Nelson–Aalen estimator—and use that as a basis for adding in the stage as a covariate and checking the martingaleresiduals. Here we will use age as an additional covariate. So we will fit the modelαi(t) = α0(t)eβ·age, and check for the behaviour of stage as an additional covariate. Weshow a box plot in figure C.3, showing the distributions of martingale residuals for the 4different stages. What we see is that the residuals have essentially the same mean for stages1 and 2, rise substantially for stage 3, and somewhat less for stage 4.

Page 259: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

1 r e q u i r e ( s u r v i v a l )2 r e q u i r e (KMsurv)3

4 data ( larynx )5 l a r . cph=coxph ( Surv ( time , d e l t a ) ˜age , data=larynx )6

7 c o e f exp ( c o e f ) se ( c o e f ) z p8 age 0 .0233 1 .02 0 .0145 1 .61 0 .119

10 Like l i hood r a t i o t e s t =2.63 on 1 df , p=0.105 n= 90 , number o f events= 5011

12 l a r . f i t=s u r v f i t ( l a r . cph )13

14 # The coxph ob j e c t has a l i s t o f t imes15 # We want to f i n d the index o f the time correspond ing to i n d i v i d u a l i .16 whichtime=sapply ( larynx $ time , func t i on ( t ) which ( l a r . f i t $ time==t ) )17

18 cumhaz=−l og ( l a r . f i t $ surv [ whichtime ] )19

20 beta=l a r . cph$ c o e f f i c i e n t s21 r e l r i s k=exp ( beta ∗ ( larynx $age−mean( larynx $age ) ) )22 # Base l in e hazard i s f o r mean value o f c o v a r i a t e23

24 r e s i d s=larynx $ de l ta−cumhaz∗ r e l r i s k25 #Note : We could get the same numbers out as l a r . cph$ r e s i d u a l s26 r e s i d s . bystage=lapp ly ( 1 : 4 , f unc t i on ( i ) r e s i d s [ larynx $ s tage==i ] )27 boxplot ( r e s i d s . bystage , xlab=’ Stage ’ , y lab=’ Mart ingale r e s i d u a l ’ )

Page 260: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

1 2 3 4

-1.5

-1.0

-0.5

0.0

0.5

1.0

Stage

Mar

tinga

le re

sidu

al

Figure C.3: Box plot of martingale residuals for larynx data, stratified bystage.

(c) The alternative is to define three new binary covariates, coding for the patient being instage 2, 3, or 4 respectively (leaving stage 1, where all three covariates are 0, as the baselinegroup). Fit this model. Are all of these covariates statistically significant?

(d) An equivalent approach is to replace stage in the model definition by factor(stage).Show that this produces the same result.The R computation below shows that the coefficient for stage 2 is clearly not statisticallysignificant; the coefficient for stage 3 is borderline (p = 0.083); and the coefficient for stage4 is highly significant (p = 0.000035).

Page 261: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

lar.cph=coxph(Surv(time,delta)∼factor(stage)+age,data=larynx)coef exp(coef) se(coef) z p

factor(stage)2 0.140 1.15 0.4625 0.303 7.6e-01

factor(stage)3 0.642 1.90 0.3561 1.804 7.1e-02

factor(stage)4 1.706 5.51 0.4219 4.043 5.3e-05

age 0.019 1.02 0.0143 1.335 1.8e-01

Likelihood ratio test=18.3 on 4 df, p=0.00107 n= 90, number of events= 50

(e) Try adding year of diagnosis or age at diagnosis as a linear covariate (in the exponent ofthe relative risk). Is either statistically significant?In the below code we fit the model including age and stage. Again, only the coefficient forstage 4 is significantly greater than 0.

(f) Use a residual plot to test whether one or the other of these covariates might moreappropriately enter the model in a different functional form — for example, as a stepfunction.The plot is shown in Figure C.4. We see that there seems to be no effect of the age variableuntil age 70, after which it seems to increase linearly.

Page 262: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

40 50 60 70 80

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Age (Yrs)

mar

tinga

le re

sidu

al

Figure C.4: Plot of martingale residuals against age for larynx data.

########## Residual plot to test age

aord=order(age)

resids=lar.cph2$residuals[aord]

plot(age[aord],resids,xlab=’Age (Yrs)’,ylab=’martingale residual’)

lines(lowess(resids∼age[aord]),col=2)

########## New model with age starting from 70

newage=pmax(age[aord]-70,0)

lar.cph=coxph(Surv(time,delta)∼factor(stage)+newage,data=larynx)

(g) Use a Cox-Snell residual plot to test whether the Cox model is appropriate to these data.There seems to be a marked curvature of the residual plot, suggesting that the model isunderestimating the cumulative hazard later on.

Page 263: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

lar.cph=coxph(Surv(time,delta)∼factor(stage),data=larynx)lar.fit=survfit(lar.cph)

whichtime=sapply(larynx$time,function(t) which(lar.fit$time==t))

cumhaz=-log(lar.fit$surv[whichtime])

beta=lar.cph$coefficients

relrisk=exp(matrix(beta,1,3)%*%rbind(st2-mean(st2),st3-mean(st3),st4-mean(st4)))

coxsnell=c(relrisk*cumhaz)

CS.surv=Surv(coxsnell,delta[aord])

CS.fit=survfit(CS.surv∼1)

plot(CS.fit$time,-log(CS.fit$surv),xlab=’Time’,

ylab=’Fitted cumulative hazard for Cox-Snell residuals’)

abline(0,1,col=2)

Page 264: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Time

Fitte

d cu

mul

ativ

e ha

zard

for C

ox-S

nell

resi

dual

s

Figure C.5: Cox-Snell residual plot for larynx data.

Page 265: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

C.6 Modern Survival Problem sheet 6:Censoring and truncation, frailty and repeated events

(1) A sample of patients taking a new blood pressure medication is asked whether they haveexperienced any vertigo since they started taking it; and if so, when the symptoms were firstnoticed. Some have not experienced symptoms yet, some report an exact time (in weeks afterstarting treatment), and some can only say they know it was before a certain time.

Which observations are left-censored? Right-censored? Estimate the survival function (that is,probability of remaining symptom-free for x weeks)The first column are the right-censored observations. The second column are the left-censoredobservations.

(a) Ignoring the left-censored observations;

weeks # at risk px S(x)

1 307 0.980 0.9802 256 0.957 0.9383 223 0.955 0.8964 190 0.884 0.7925 149 0.752 0.5966 100 0.670 0.3997 57 0.719 0.2878 38 0.658 0.1899 20 0.600 0.11310 9 0.000 0.000

(b) Ignoring the right-censored observations;

We reverse time from 11 weeks. Letting T be the time of first vertigo, and τi = 11− Ti. wecompute a Kaplan-Meier survival estimator for τ .

Page 266: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

weeks # at risk px Sτ (x)

1 213 0.958 0.9582 189 0.958 0.9173 172 0.924 0.8484 155 0.897 0.7605 130 0.746 0.5676 91 0.593 0.3377 52 0.577 0.1948 27 0.630 0.1229 17 0.353 0.04310 6 0.000 0.000

Now, Sτ (x) is an estimator for

P

11− T > x

= PT < 11− x

= 1− P

T ≥ 11− x

= 1− ST (11− x+).

Thus we can estimate

ST (y) ≈ 1− Sτ (11− y−) = 1− Sτ (10− y);

that is, the estimate of survival for τ just before time 11− y, which in this case is the sameas the survival estimate at time 10− y, since there is no change between integer times. Thisyields

weeks ST (x)

1 0.9572 0.8783 0.8064 0.6635 0.4336 0.2407 0.1528 0.0839 0.04210 0.000

Note that there is a certain amount of ambiguity here in the way we have dealt with thediscreteness of the censoring and event times.

(c) Taking all observations into account.

We apply Turnbull’s algorithm for doubly censored data, beginning with the solution frompart (a), calling that S0. Because the data come at discrete times, the grid of times willjust be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

Page 267: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

We compute

p(0)j` (0) =

S0(t`−1)− S0(t`)

1− S0(tj)

=

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.32 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.19 0.41 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.000.09 0.20 0.20 0.50 0.00 0.00 0.00 0.00 0.00 0.000.05 0.10 0.10 0.26 0.49 0.00 0.00 0.00 0.00 0.000.03 0.07 0.07 0.17 0.33 0.33 0.00 0.00 0.00 0.000.03 0.06 0.06 0.15 0.28 0.28 0.16 0.00 0.00 0.000.02 0.05 0.05 0.13 0.24 0.24 0.14 0.12 0.00 0.000.02 0.05 0.05 0.12 0.22 0.22 0.13 0.11 0.09 0.000.02 0.04 0.04 0.10 0.20 0.20 0.11 0.10 0.08 0.11

.

Note that we have treated left-censoring at time t as meaning T ≤ t. This may seeminappropriate: Perhaps someone who at week 3 cannot recall when symptoms began shouldbe assumed to have started them in weeks 1 or 2. On the other hand, it is plausible thatsomeone would have symptoms beginning during week 3, but would report at the end ofweek 3 that she can’t remember which week they started in. This is part of the more generalproblem, that our modelling assumption of non-informative left censoring is probably notvery appropriate to this story.Now, we reassign the left-censored observations according to this distribution, obtainingthe following numbers of estimated events:

weeks number of events

1 7.402 14.003 13.004 29.505 48.306 43.407 20.808 16.009 9.9010 10.70

Computing the Kaplan–Meier estimator again, with these new numbers of events, we get

Page 268: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

weeks # at risk px Sτ (x)

1 355 0.979 0.9792 302 0.954 0.9343 266 0.951 0.8884 230 0.872 0.7745 182 0.734 0.5696 121 0.644 0.3667 68 0.696 0.2558 44 0.642 0.1649 23 0.581 0.09510 10 0.000 0.000

This is our second estimator, S1. It is slightly different from S0. We can iterate theprocedure, carrying out exactly the same calculation with S1 in place of S2. The newestimated numbers of events are

weeks number of events

1 7.412 14.043 13.034 29.485 48.346 43.367 20.788 15.959 9.9010 10.70

The redistribution is minuscule, so it is probably not worth continuing with another iterationof the survival estimation.

(2) In order to control the spread of a virus in a wild population, researchers spread food itemslaced with a vaccine. Once a week they capture a small number of animals and test whetherthey have developed an immune response

week 1 2 3 4 5 6 7 8 9 10

number sampled 5 4 7 3 4 6 3 8 5 4number immune 0 1 2 0 2 1 2 4 4 3

Estimate the probability of being immune at week t

Page 269: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(a) using an exponential model;Using the results in section 15.2.1 we have the log likelihood

`(λ) =10∑c=1

kc log(

1− e−λc)− λ

10∑c=1

(nc − kc)c,

where nc is the number of animals sampled at week c, and kc the number found to beimmune. We can find the maximum numerically:

1 n=c ( 5 , 4 , 7 , 3 , 4 , 6 , 3 , 8 , 5 , 4 )2 k=c ( 0 , 1 , 2 , 0 , 2 , 1 , 2 , 4 , 4 , 3 )3

4

5 l o g l i k=func t i on ( lambda ) 6 c =1:107 −sum( k∗ l og (1−exp(−lambda∗c ) )−lambda∗c∗ (n−k ) )8 9 # Note : This i s negat ive l og l i k e l i h o o d because opt imize f i n d s minima

10

11 opt imize ( l o g l i k , c ( 0 , 2 ) )12 $minimum13 [ 1 ] 0 .09728314

15 $ o b j e c t i v e16 [ 1 ] 27 .88519

So λ = 0.097.

(b) using a Weibull model;The Weibull log likelihood with cumulative hazard parametrised as Λ(t) = (λt)r is

`(λ, r) =

10∑c=1

kc log(

1− e−(λc)r)− λr

10∑c=1

(nc − kc)cr.

We use the nlm function to minimise a function of two variables. (We need to give a startingpoint, for which we take the exponential solution that we found in the previous example).

1 l o g l i k=func t i on ( lambda , r ) 2 c =1:103 −sum( k∗ l og (1−exp(−( lambda∗c ) ˆ r ) )−(lambda∗c ) ˆ r ∗ (n−k ) )4 5 > nlm ( func t i on ( x ) l o g l i k ( x [ 1 ] , x [ 2 ] ) , c ( . 1 , 1 ) )6 $minimum7 [ 1 ] 27 .52188

Page 270: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

8

9 $ es t imate10 [ 1 ] 0 .1135262 1.441697511

12 $ grad i ent13 [ 1 ] 9 .006129 e−09 3.998258 e−0814

15 $ code16 [ 1 ] 117

18 $ i t e r a t i o n s19 [ 1 ] 11

Thus the MLE for the Weibull distribution has parameters (λ, r) = (0.11, 1.44). We notethat the log likelihood has been increased only from −27.8 to −27.5, so by the likelihoodratio test we would not take the Weibull distribution as an improvement.

(c) using the nonparametric MLE.We apply the Pool Adjacent Violators Algorithm. We start by calculating the fraction“surviving” at each census time

week 1 2 3 4 5 6 7 8 9 10

number sampled 5 4 7 3 4 6 3 8 5 4fraction not yet immune 1.0 0.75 0.71 1.0 0.5 0.83 0.33 0.50 0.20 0.25

We see that weeks (3, 4), (5, 6), (7, 8), and (9, 10) are all increasing. So we pool theseobservations:

week 1 2 3 4 5 6 7 8 9 10

number sampled 5 4 10 10 11 9fraction not yet immune 1.0 0.75 0.8 0.7 0.45 0.22

There remains one increasing sequence, so we pool (2, 3, 4):

week 1 2 3 4 5 6 7 8 9 10

number sampled 5 14 10 11 9fraction not yet immune 1.0 0.79 0.7 0.45 0.22

(3) A population has multiplicative frailty, so that the mortality rate for individual i is Biα(x) atage x, where the Bi are i.i.d. positive random variables, where limx→∞ α(x) =∞.

(a) Show that the population mortality goes to∞ as x→∞ if the distribution of Bi is boundedaway from 0.

Page 271: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Suppose Bi ≥ b > 0 with probability 1. Since e−Biθ > 0,

−L′(θ)L(θ)

=E[Bie−Biθ

]E [e−Biθ]

≥E[be−Biθ

]E [e−Biθ]

≥ b.

By equation (16.2), it follows that the population mortality rate is bounded below by bα(x),which goes to ∞.

(b) Show that the population mortality converges to a finite constant as x → ∞ if thedistribution of Bi has nonzero density at 0 and the hazard rate does not grow too quicklyas x→∞. Give a formal condition for what “too quickly” would be.Let f : R+ → R+ be the density of Bi, with f(0) > 0. For large θ, e−θxf(x) is almost thesame as e−θxf(0) (since e−θxf(x) is nearly 0 except for x ≈ 0), so

−L′(θ)L(θ)

=

∫∞0 xe−θxf(x)dx∫∞0 e−θxf(x)dx

≈ f(0)

θ.

Thus, by (16.2) the population mortality for large x is

µ(x) ≈ f(0)α(x)

A(x).

This will be bounded, unless α(x) grows extremely fast. The condition that needs to besatisfied is that

limx→∞

∫ x

0

α(y)

α(x)dy > 0.

This will certainly be true if α is a Gompertz hazard (so grows exponentially with x).

(c) Suppose now that the baseline hazard is Gompertz, i.e., α(x) = eθx.

i. If the Bi have Gamma distribution with parameters (r, λ) — λ is the rate parameter —compute the population mortality rate µ(t) at age t.The Laplace transform is L(c) = (1 + c/λ)−r. Thus the population mortality is

µ(x) =reθx

λ(1 + (eθx − 1)/θλ)−1 =

θr

(θλ− 1)e−θx + 1.

ii. What is the hazard ratio between a subpopulation whose frailty has Gamma distributionwith parameters (r, λ) and one with parameters (r′, λ)?From the above formula it will be r/r′.

(4) The paper [ZKJ07] includes a dataset, available to download from the Journal of StatisticalScience, on the healthcare demand of 4406 patients in the public old-age health insurance schemeMedicare in the US. When you load this file in, the data will be in a data-frame DebTrivedi.

Page 272: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(a) The number of physician office visits is enumerated in the variable ofp, while numchron

gives the number of chronic conditions, and health gives self-reported health (poor, average,excellent). Do one or more exploratory plots to illustrate the distributions of these variables,and their relationship.

1 attach ( DebTrivedi )2 # box p lo t o f v i s i t s by hea l th s t a t u s3 boxplot ( ofp ˜ hea l th )4 # box p lo t o f v i s i t s by number o f c o n d i t i o n s5 boxplot ( ofp ˜ f a c t o r ( numchron ) , xlab=’Number o f chron i c c o n d i t i o n s ’ , y lab=’Number

o f phys i c i an v i s i t s ’ )6 # histogram ( bar p l o t ) o f number o f cond i t i ons , s t r a t i f i e d by hea l th7 p lo t (−1 ,−1 , xlim=c ( − . 5 , 8 .5 ) , yl im=c (0 , 1 ) , x lab=’Number o f chron i c c o n d i t i o n s ’ ,

y lab=’ f r a c t i o n ’ )8 c=−19 f o r (L in l e v e l s ( hea l th ) )

10 h=h i s t ( numchron [ hea l th==L ] , breaks =0:9 , p l o t=FALSE)11 r e c t ( ( 0 : 8 )+c/3−1/ 6 , 0 , ( 0 : 8 )+c/3+1/ 6 ,h$ dens i ty , c o l=c+3)12 c=c+113 14 l egend ( 4 , . 6 , c ( ’ poor hea l th ’ , ’ average hea l th ’ , ’ e x c e l l e n t hea l th ’ ) , c o l =2:4 , lwd

=3)

poor average excellent

020

4060

80

Figure C.6: Box plot of visits by health status.

Page 273: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

0 1 2 3 4 5 6 7 8

020

4060

80

Number of chronic conditions

Num

ber

of p

hysi

cian

vis

its

Figure C.7: Box plot of visits by number of conditions.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Number of chronic conditions

frac

tion

poor healthaverage healthexcellent health

Figure C.8: Histograms of number of chronic conditions, stratified by healthstatus.

Page 274: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

(b) Fit a Poisson regression model to predict the number of office visits as a function ofhealth, numchron, gender, school (number of years of schooling), and privins (indicatorof whether the patient has private insurance). Interpret the result.

1 > preg=glm ( ofp ˜ hea l th+numchron+gender+schoo l+pr i v in s , f ami ly=po i s son )2 > summary( preg )3

4 Cal l :5 glm ( formula = ofp ˜ hea l th + numchron + gender + schoo l + pr iv in s ,6 f ami ly = po i s son )7

8 Deviance Res idua l s :9 Min 1Q Median 3Q Max

10 −6.2816 −2.0370 −0.7143 0 .7301 16.265511

12 C o e f f i c i e n t s :13 Estimate Std . Error z va lue Pr(>| z | )14 ( I n t e r c e p t ) 1 .034542 0.023857 43 .364 <2e−16 ∗∗∗15 hea l thpoor 0 .318205 0.017479 18 .205 <2e−16 ∗∗∗16 h e a l t h e x c e l l e n t −0.379045 0.030291 −12.514 <2e−16 ∗∗∗17 numchron 0.168793 0.004471 37 .755 <2e−16 ∗∗∗18 gendermale −0.108014 0.012943 −8.346 <2e−16 ∗∗∗19 s choo l 0 .025754 0.001843 13 .972 <2e−16 ∗∗∗20 p r i v i n s y e s 0 .216007 0.016872 12 .803 <2e−16 ∗∗∗21 −−−22 S i g n i f . codes : 0 ?∗∗∗? 0 .001 ?∗∗? 0 .01 ?∗? 0 .05 ? . ? 0 . 1 ? ? 123

24 ( D i spe r s i on parameter f o r po i s son fami ly taken to be 1)25

26 Null dev iance : 26943 on 4405 degree s o f freedom27 Res idua l dev iance : 23808 on 4399 degree s o f freedom28 AIC : 3659729

30 Number o f F i sher Scor ing i t e r a t i o n s : 5

All of these effects seem to be significant. Unsurprisingly, excellent health is associatedwith a reduction in the rate of physician visits, and poor health with an increase. Malepatients have slightly fewer (by a factor of e−0.108 = 0.898). More schooling and privateinsurance are both associated with an increase in the number of office visits.

(c) Explain why you might want to fit a negative binomial model instead. Do the fit, andinterpret the result.We would expect individuals to have differing propensities to go to see a physician, separatefrom the factors included in the model. And the tail of ofp seems much too long (ofp isoverdispersed relative to Poisson) to be explained by any Poisson distribution.

Page 275: Modern Survival Analysis notes... · O. O. Aalen, O. Borgan, H. K. Gjessing, Survival and Event History Analysis: A Process Point of View Other material will come from • J. P. Klein

Solutions 6 – Survival Analysis – Oxford MT 2013 LXXV

1 > preg2=glm . nb( ofp ˜ hea l th+numchron+gender+schoo l+pr iv in s , l i n k=log )2 > summary( preg2 )3

4 Cal l :5 glm . nb( formula = ofp ˜ hea l th + numchron + gender + schoo l +6 pr iv in s , l i n k = log , i n i t . theta = 1.164195333)7

8 Deviance Res idua l s :9 Min 1Q Median 3Q Max

10 −2.6730 −1.0062 −0.3002 0 .2859 5 .612411

12 C o e f f i c i e n t s :13 Estimate Std . Error z va lue Pr(>| z | )14 ( I n t e r c e p t ) 0 .940307 0.055296 17 .005 < 2e−16 ∗∗∗15 hea l thpoor 0 .367665 0.048733 7 .544 4 .54 e−14 ∗∗∗16 h e a l t h e x c e l l e n t −0.373647 0.061669 −6.059 1 .37 e−09 ∗∗∗17 numchron 0.195760 0.012067 16 .223 < 2e−16 ∗∗∗18 gendermale −0.115130 0.031609 −3.642 0.00027 ∗∗∗19 s choo l 0 .027179 0.004451 6 .106 1 .02 e−09 ∗∗∗20 p r i v i n s y e s 0 .250154 0.040008 6 .253 4 .04 e−10 ∗∗∗21 −−−22 S i g n i f . codes : 0 ?∗∗∗? 0 .001 ?∗∗? 0 .01 ?∗? 0 .05 ? . ? 0 . 1 ? ? 123

24 ( D i spe r s i on parameter f o r Negative Binomial ( 1 . 1642 ) fami ly taken to be 1)25

26 Null dev iance : 5607 .2 on 4405 degree s o f freedom27 Res idua l dev iance : 5039 .2 on 4399 degree s o f freedom28 AIC : 2447029

30 Number o f F i sher Scor ing i t e r a t i o n s : 131

32

33 Theta : 1 .164234 Std . Err . : 0 .032035

36 2 x log−l i k e l i h o o d : −24453.9070

We see a huge reduction in AIC, indicating a superior fit. The deviance residuals are muchmore controlled. On the other hand, the parameter estimates remain fairly similar.