statistics for biology and healthlinear, logistic, survival, and repeated measures models...

30
Statistics for Biology and Health Series Editors M. Gail, K. Krickeberg, J. Sarmet, A. Tsiatis, W. Wong

Upload: others

Post on 31-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Statistics for Biology and HealthSeries EditorsM. Gail, K. Krickeberg, J. Sarmet, A. Tsiatis, W. Wong

Page 2: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Statistics for Biology and Health

Bacchieri/Cioppa: Fundamentals of Clinical ResearchBorchers/Buckland/Zucchini: Estimating Animal Abundance: Closed PopulationsBurzykowski/Molenberghs/Buyse: The Evaluation of Surrogate Endpoints

Everitt/Rabe-Hesketh: Analyzing Medical Data Using S-PLUSEwens/Grant: Statistical Methods in Bioinformatics: An Introduction, 2nd ed.Gentleman/Carey/Huber/Irizarry/Dudoit: Bioinformatics and Computational Biology

Solutions Using R and BioconductorHougaard: Analysis of Multivariate Survival DataKeyfitz/Caswell: Applied Mathematical Demography, 3rd ed.Klein/Moeschberger: Survival Analysis: Techniques for Censored and Truncated

Data, 2nd ed.Kleinbaum/Klein: Survival Analysis: A Self-Learning Text, 2nd ed.Kleinbaum/Klein: Logistic Regression: A Self-Learning Text, 2nd ed.Lange: Mathematical and Statistical Methods for Genetic Analysis, 2nd ed.Manton/Singer/Suzman: Forecasting the Health of Elderly PopulationsMartinussen/Scheike: Dynamic Regression Models for Survival DataMoyé: Multiple Analyses in Clinical Trials: Fundamentals for InvestigatorsNiels.en: Statistical Methods in Molecular Evolution

Parmigiani/Garrett/Irizarry/Zeger: The Analysis of Gene Expression Data: Methodsand Software

Proschan/LanWittes: Statistical Monitoring of Clinical Trials: A Unified ApproachSiegmund/Yakir: The Statistics of Gene MappingSimon/Korn/McShane/Radmacher/Wright/Zhao: Design and Analysis of DNA

Microarray InvestigationsSorensen/Gianola: Likelihood, Bayesian, and MCMC Methods in Quantitative

GeneticsStallard/Manton/Cohen: Forecasting Product Liability Claims: Epidemiology and

Modeling in the Manville Asbestos CaseSun: The Statistical Analysis of Interval-censored Failure Time DataTherneau/Grambsch: Modeling Survival Data: Extending the Cox ModelTing: Dose Finding in Drug DevelopmentVittinghoff/Glidden/Shiboski/McCulloch: Regression Methods in Biostatistics:

Linear, Logistic, Survival, and Repeated Measures ModelsWu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTLZhang/Singer: Recursive Partitioning in the Health SciencesZuur/Ieno/Smith: Analyzing Ecological Data

Cook/Lawless : The Statistical Analysis of Recurrent Events

'O Quigley: Proportional Hazards Regression

Duchateau/Janssen: The Frailty Model

Page 3: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Luc DuchateauPaul Janssen

The Frailty Model

ABC

Page 4: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Series EditorsM. Gail K. KrickebergNational Cancer Institute Le Chatelet Department of EpidemiologyRockville, MD 20892 F-63270 Manglieu School of Public HealthUSA France Johns Hopkins University

615 Wolfe StreetBaltimore, MD 21205-2103USA

A. Tsiatis W. WongDepartment of Statistics Department of StatisticsNorth Carolina State Stanford University

University Stanford, CA 94305-4065Raleigh, NC 27695 USAUSA

J. Sarmet

c© 2008 Springer Science+Business Media, LLC

[email protected]

Ghent University

Belgium

Center for Statistics

[email protected]

Paul Janssen

Hasselt University

Luc Duchateau

Agoralaan – Building D

Faculty of Veterinary Medicine

B-9820 MerelbekeBelgium

Salisburylaan 133B-3590 Diepenbeek

ISBN 978-0-387-72834-6 e-ISBN 978-0-387-72835-3

Library of Congress Control Number: 2007929935

All rights reserved. This work may not be translated or copied in whole or in part without thethe publisher (Springer Science + Business Media, LLC, 233 Spring

Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews orscholarly analysis. Use in connection with any form of information storage and retrieval,electronic adaptation, computer software, or by similar or dissimilar methodology now knownor hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even ifthey are not identified as such, is not to be taken as an expression of opinion as to whether ornot they are subject to proprietary rights.

written permission of

Printed on acid-free paper.

9 8 7 6 5 4 3 2 1

springer.com

Page 5: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

To Pascale, Veerle, and Katelijne (L.D.)

To Mia, Hans, and Marleen (P.J.)

Page 6: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Preface

Survival analysis techniques are used in a variety of disciplines including hu-man and veterinary medicine, epidemiology, engineering, biology and econ-omy. Proportional hazards models and accelerated failure time models areclassical models that are frequently used for the analysis of univariate (cen-sored) survival data.

In recent years a number of papers appeared, extending these models tomodels that are suitable to handle more complex survival data. In this contexta lot of attention has been paid to frailty models and copula models. Thesemodels provide a powerful tool to analyse clustered survival data.

In this book the focus is on frailty models, but similarities as well asdifferences between frailty models and copula models are highlighted. Frailtymodels provide a nice way to capture and to describe the dependence ofobservations within a cluster and/or the heterogeneity between clusters. Thespecification of a frailty model is rather easy. Frailty models are hazard modelshaving a multiplicative frailty factor. This factor specifies how frail subjects ina specific cluster are. The hazard factor in the model can depend on covariatesand can be modelled in a parametric or a semiparametric way. Frailty modelsare conditional models. The frailty factor is random and therefore a frailtydistribution needs to be specified in the frailty model. Different choices offrailty distributions are studied in detail in this book; attention is paid to therelation between the choice of the distribution and the type of dependence itgenerates between observations within a cluster.

Once the frailty model is specified we need to fit the model. This is oftennot an easy task and in many cases standard software to fit such models ismissing. For particular choices of the frailty distribution, the frailties can beintegrated out with respect to the frailty distribution from the conditionallikelihood resulting in the marginal likelihood. For parametric frailty mod-els, maximisation of the marginal likelihood then leads to estimates of theparameters in the model. However, for semiparametric frailty models, morecomplex estimation techniques are needed. These techniques include the EMalgorithm, penalised likelihood techniques, and Gibbs sampling in a Bayesian

vii

Page 7: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

viii Preface

context. This book does not give an exhaustive overview of all models andtechniques that are available for the analysis of clustered survival data. Werather aim at an in-depth discussion of the most current techniques and wewant to demonstrate, based on real data sets, how results obtained from thestatistical analysis should be interpreted. By doing so, we hope to bring theseimportant tools closer to the applied statisticians.

In the two final chapters of the book we consider more complex frailtymodels including hierarchical models with more than one level of clustering.

The data sets used in the book are taken from the different disciplines inwhich the authors worked during the last ten years. Three data sets are pro-vided by the European Organisation for Research and Treatment of Cancer(EORTC, Brussels, Belgium): the periop data set (Example 1.5), the DCISdata set (Example 1.6), and the bladder cancer data set (Example 1.10). An-other study in human medicine is a large asthma prevention trial that was runby Union Chimique Belge (UCB) to assess the effect of a new drug (Example1.9). An infant mortality data set from Jimma (Ethiopia) resulted from a largelongitudinal study run by the Public Health investigators at Jimma University(Example 1.11). The other data sets are from veterinary sciences. The firststudy (Example 1.1) deals with the epidemiology of East Coast Fever, an im-portant disease of cattle in eastern and southern Africa. The data are providedby the Institute of Tropical Medicine (ITM, Antwerp, Belgium). The Depart-ment of Medical Imaging of Domestic Animals (Ghent University, Belgium)has provided data on time to diagnosis of fracture healing in dogs (Example1.2). We further include two data sets from the Department of Physiologyand Biometrics (Ghent University, Belgium) on mastitis and udder infection(Examples 1.3 and 1.4). Finally the Department of Reproduction, Obstetricsand Herd Health (Ghent University, Belgium) has kindly provided a cullingdata set (Example 1.7) and an insemination data set (Example 1.8).

This book is intended for students and for applied statisticians. It can beused as a graduate course for students who have had a course on survivalanalysis and a course on statistical inference. Applied statisticians will finda lot of examples in the book, which will help to get a better understandingof the theoretical ideas. The code of the programs used in the examples ofthe book is freely available from the Springer Verlag website. This gives thereaders the opportunity to try the examples themselves and to adapt the codeor write new code for their particular problems.

Selected parts of the book have been used as teaching material for a seriesof lectures at the Biometrics and Clinical Informatics Department of Johnson& Johnson Pharmaceutical Research and Development in Beerse, Belgium(October-November 2005) and for workshops in Nairobi (Kenya, 2004) andAddis Ababa (Ethiopia, 2005), sponsored by the Flemish InteruniversitaryCouncil - University Development Cooperation. We also gratefully acknowl-edge the financial support from the IAP research network P6/03 of the BelgianGovernment (Belgian Science Policy).

Page 8: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Preface ix

We thank all those who provided in one way or another support tothis book project: Bart Ampe, Sarne De Vliegher, Els Goetghebeur, KlaraGoethals, Hans Laevens, Geert Opsomer, and Marije Risselada (Ghent Univer-sity, Belgium), Tomasz Burzykowski, Goele Massonnet, and Noel Veraverbeke(Hasselt University, Belgium), Arnost Komarek (Catholic University Leuven,Belgium), Catherine Legrand and Ingrid Van Keilegom (Universite Catholiquede Louvain, Belgium), Dirk Berkvens (ITM, Belgium), Richard Sylvester(EORTC, Belgium), Catherine Fortpied (UCB, Belgium), Vincent Ducrocq(INRA, Jouy-en-Josas, France), Andreas Wienke (Martin-Luther-UniversitatHalle-Wittenberg, Germany), Samuli Ripatti (Karolinska Institute, Sweden),Virginie Rondeau (INSERM, Bordeaux, France), Elisa Molanes (University ofLa Coruna, Spain), Mekonnen Assefa and Fasil Tessema (Jimma University,Ethiopia), and Daan de Waal (University of the Free State, South Africa).

A word of special thanks from Paul Janssen goes to his next-door colleaguefor many years, Noel Veraverbeke, for many discussions on different issues insurvival analysis and in statistics in general.

Luc Duchateau would like to thank Herman Ramon, Paul Darius, BrunoGoddeeris, Jef Vercruysse, and Christian Burvenich for their stimulating sup-port in his different research undertakings.

Luc DuchateauPaul Janssen

September 2007

Page 9: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Glossary of Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4.1 Survival likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.2 Proportional hazards models . . . . . . . . . . . . . . . . . . . . . . . . 201.4.3 Accelerated failure time models . . . . . . . . . . . . . . . . . . . . . 261.4.4 The loglinear model representation . . . . . . . . . . . . . . . . . . 30

1.5 Semantics and history of the term frailty . . . . . . . . . . . . . . . . . . . 32

2 Parametric proportional hazards models with gammafrailty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.1 The parametric proportional hazards model with frailty term . 442.2 Maximising the marginal likelihood: the frequentist approach . 452.3 Extension of the marginal likelihood approach to

interval-censored data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.4 Posterior densities: the Bayesian approach . . . . . . . . . . . . . . . . . . 65

2.4.1 The Metropolis algorithm in practice for theparametric gamma frailty model . . . . . . . . . . . . . . . . . . . . . 65

2.4.2 ∗ Theoretical foundations of the Metropolis algorithm . . . 742.5 Further extensions and references . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3 Alternatives for the frailty model . . . . . . . . . . . . . . . . . . . . . . . . . . 773.1 The fixed effects model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.1.1 The model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xi

Page 10: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

xii Contents

3.1.2 ∗ Asymptotic efficiency of fixed effects model parameterestimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.2 The stratified model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3 The copula model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.3.1 Notation and definitions for the conditional, joint, andpopulation survival functions . . . . . . . . . . . . . . . . . . . . . . . 93

3.3.2 Definition of the copula model . . . . . . . . . . . . . . . . . . . . . . 953.3.3 The Clayton copula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.3.4 The Clayton copula versus the gamma frailty model . . . 99

3.4 The marginal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.4.1 Defining the marginal model . . . . . . . . . . . . . . . . . . . . . . . . 1043.4.2 ∗ Consistency of parameter estimates from marginal

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.4.3 Variance of parameter estimates adjusted for

correlation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.5 Population hazards from conditional models . . . . . . . . . . . . . . . . 111

3.5.1 Population versus conditional hazard from frailtymodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

3.5.2 Population versus conditional hazard ratio from frailtymodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.6 Further extensions and references . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4 Frailty distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.1 General characteristics of frailty distributions . . . . . . . . . . . . . . . 118

4.1.1 Joint survival function and the Laplace transform . . . . . 1194.1.2 Population survival function and the copula . . . . . . . . . . 1204.1.3 Conditional frailty density changes over time . . . . . . . . . . 1224.1.4 Measures of dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.2 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.2.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . 1304.2.2 Joint and population survival function . . . . . . . . . . . . . . . 1314.2.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.2.4 Copula form representation . . . . . . . . . . . . . . . . . . . . . . . . . 1374.2.5 Dependence measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1384.2.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.2.7 ∗ Estimation of the cross ratio function: some theoretical

considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474.3 The inverse Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.3.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . 1504.3.2 Joint and population survival function . . . . . . . . . . . . . . . 1524.3.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1584.3.4 Copula form representation . . . . . . . . . . . . . . . . . . . . . . . . . 1584.3.5 Dependence measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1614.3.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Page 11: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Contents xiii

4.4 The positive stable distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644.4.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . 1644.4.2 Joint and population survival function . . . . . . . . . . . . . . . 1674.4.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1714.4.4 Copula form representation . . . . . . . . . . . . . . . . . . . . . . . . . 1714.4.5 Dependence measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1734.4.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

4.5 The power variance function distribution . . . . . . . . . . . . . . . . . . . 1774.5.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . 1774.5.2 Joint and population survival function . . . . . . . . . . . . . . . 1814.5.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1844.5.4 Copula form representation . . . . . . . . . . . . . . . . . . . . . . . . . 1854.5.5 Dependence measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1864.5.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

4.6 The compound Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . 1904.6.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . 1904.6.2 Joint and population survival functions . . . . . . . . . . . . . . 1924.6.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

4.7 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1954.8 Further extensions and references . . . . . . . . . . . . . . . . . . . . . . . . . . 196

5 The semiparametric frailty model . . . . . . . . . . . . . . . . . . . . . . . . . . 1995.1 The EM algorithm approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

5.1.1 Description of the EM algorithm . . . . . . . . . . . . . . . . . . . . 1995.1.2 Expectation and maximisation for the gamma frailty

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2005.1.3 Why the EM algorithm works for the gamma frailty

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075.2 The penalised partial likelihood approach . . . . . . . . . . . . . . . . . . . 210

5.2.1 The penalised partial likelihood for the normalrandom effects density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

5.2.2 The penalised partial likelihood for the gamma frailtydistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

5.2.3 Performance of the penalised partial likelihoodestimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

5.2.4 Robustness of the frailty distribution assumption . . . . . . 2285.3 Bayesian analysis for the semiparametric gamma frailty

model through Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2335.3.1 The frailty model with a gamma process prior for the

cumulative baseline hazard for grouped data . . . . . . . . . . 2345.3.2 The frailty model with a gamma process prior for the

cumulative baseline hazard for observed event times . . . 2395.3.3 The normal frailty model based on Poisson likelihood . . 2445.3.4 Sampling techniques used for semiparametric frailty

models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

Page 12: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

xiv Contents

5.3.5 Gibbs sampling, a special case of the Metropolis–Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

5.4 Further extensions and references . . . . . . . . . . . . . . . . . . . . . . . . . . 258

6 Multifrailty and multilevel models . . . . . . . . . . . . . . . . . . . . . . . . . 2596.1 Multifrailty models with one clustering level . . . . . . . . . . . . . . . . 260

6.1.1 Bayesian analysis based on Laplacian integration . . . . . . 2606.1.2 Frequentist approach using Laplacian integration . . . . . . 268

6.2 Multilevel frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2776.2.1 Maximising the marginal likelihood with penalised

splines for the baseline hazard . . . . . . . . . . . . . . . . . . . . . . 2776.2.2 The Bayesian approach for multilevel frailty models

using Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2796.3 Further extensions and references . . . . . . . . . . . . . . . . . . . . . . . . . . 286

7 Extensions of the frailty model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2877.1 Censoring and truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2877.2 Correlated frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2887.3 Joint modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2907.4 The accelerated failure time model . . . . . . . . . . . . . . . . . . . . . . . . . 292

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

Applications and Examples Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

Page 13: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Glossary of Definitions and Notation

Definitions are given for the case of the shared frailty model with one cluster-ing level, with the first index i referring to the cluster and the second index jreferring to the jth observation within that cluster. For univariate frailty mod-els or models without clustering, the first index i refers to the ith observation.For hierarchical models with two clustering levels, the first index i refers tothe highest clustering level, the second index j to the second clustering leveland the third index k to the kth observation within cluster j which is nestedwithin cluster i.

The indexing for the conditional, joint and population density and survivalfunctions can be deduced from the corresponding hazard functions.

β the fixed effects vector (of dimension p)Γ (.) the gamma functionδij censoring indicator with δij = 1 if the event time

is observed, otherwise zeroζ (ξ, θ,β)ζ (t1, t2) the cross ratio functionθ the variance of a frailty density with mean one

(except in the case of the positive stable frailty distribution)ξ vector containing the parameters of the baseline hazard h0(t)τ Kendall’s tauΦ the acceleration factor

in the accelerated failure time modelφ (t1, t2) the ratio of the joint density function and the product

of the two population density functionsψ (t1, t2) the ratio of the joint survival function

and the product of the two population survival functions

xv

Page 14: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

xvi Glossary of Definitions and Notation

cij the censoring timeCθ(.) the copula function with parameter vector θd the total number of eventsdi the number of events in cluster ih0(t) the baseline hazard functionh0i(t) the baseline hazard function for cluster ihij(t) the conditional hazard functionhij,c(t) the conditional hazard function for observation j

from cluster i with ui = 1hi(tni

) the joint conditional hazard function for cluster ih�,f (tn) the joint hazard function for clusters of size n

with covariate information � based on the frailty modelh�,p (tn) the joint hazard function for clusters of size n

with covariate information � based on the marginal modelhx,f (t) the population hazard function for subjects

with covariate information x based on the frailty modelhx,p (t) the population hazard function for subjects

with covariate information x based on the marginal modelH0(t) the cumulative baseline hazardH the Hessian matrixI the Fisher information matrixI the observed information matrixI(condition) the indicator function equal to one if the condition is true,

zero otherwiselij the lowerbound of the interval that contains the event timel(.) the loglikelihood functionL(.) the likelihood functionLmarg(.) the marginal likelihood functionLpart(.) the partial likelihood functionL(.) the Laplace transformni the number of observations in cluster ip the dimension of βr the total number of distinct event timesrij the upperbound of the interval that contains the event timeR (yij) the risk set for observation yij

including all subjects at risk at time yij

Ri (yij) the risk set for observation yij

including only subjects from cluster i at risk at time yij

s the number of clustersS the vector of scorestij the event timeui the frailty term for cluster iwi the random effect for cluster iW (λ, ρ) the Weibull density with scale parameter λ

and shape parameter ρ

Page 15: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

Glossary of Definitions and Notation xvii

xij vector containing covariate informationxij (tij1) , . . . ,xij

(tijkij

)time-varying covariate vectors with kij timepoints

� (xt1, . . . ,x

tn)t

�i

(xt

i1, . . . ,xtini

)t

yij the minimum of the event time and censoring timey(1), . . . , y(r) the ordered event timeszij (yij , δij)

Page 16: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1

Introduction

1.1 Goals

Survival analysis is an important statistical field that is required for dataanalysis in different disciplines. There are a large number of textbooks onthe basic concepts of survival analysis, e.g., Collett (2003) and Machin et al.(2006). In recent years a number of papers have been published extending theclassical survival models to models that are suitable for the analysis of clus-tered survival data: frailty models. A wide variety of frailty models and severalnumerical techniques to fit these models have been studied in these papers.Only a limited number of textbooks have chapters devoted to frailty mod-els, all of them containing only some specific aspects on frailty models. Kleinand Moeschberger (1997) have a chapter on the use of the EM algorithm insemiparametric frailty models. Ibrahim et al. (2001) discuss both parametricand semiparametric frailty models but only in a Bayesian context. Hougaard(2000) discusses, among other techniques to model multivariate survival data,the frailty model mainly assuming a parametric model. Finally, Therneau andGrambsch (2000) discuss inference for the semiparametric frailty model usingpenalised partial likelihood.

The main objective of this book is to study frailty models and the nu-merical techniques, needed to fit such models, in a unified and detailed way.Both proportional hazards models and accelerated failure time models arediscussed; we consider parametric as well as semiparametric modelling. We fitmodels using both a frequentist and a Bayesian approach. Furthermore, dif-ferent density functions for the frailty term are studied as each of them puts adifferent dependence structure on the data. We also discuss hierarchical mod-els, i.e., models with more than one level of clustering. We do not attempt togive an exhaustive overview of the literature on frailty models as it is far tooambitious to do so; we rather prefer to explain the basic ideas and statisticaltechniques in detail. Some sections of the book are marked with an asteriskto indicate that they deal with more theoretical aspects. These sections canbe skipped without losing the flow of the book.

1

Page 17: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

2 1 Introduction

A very important further objective is to make the frailty model techniquesavailable to a wider audience. In spite of the amount of published material,the bottleneck is the lack of available public software. We will mainly use R, afreeware package, to fit the different models, and all the programs used are alsofreely available from the Springer Verlag website. Also software available inSTATA and SAS is used to fit frailty models. WinBUGS, a freeware package,is used to perform most of the Bayesian analyses (Spiegelhalter et al., 2003).

1.2 Outline

In this first chapter, we introduce different data sets with time to event out-comes that will be used throughout the book to demonstrate the frailty modelfitting techniques and the practical interpretation of the results. The introduc-tory chapter further describes and explains some basic and general concepts insurvival analysis to lay the foundation for the more advanced frailty models.

In Chapter 2, the simplest frailty model is introduced: the parametricmodel with clusters sharing the same gamma distributed frailty term. Thesimplicity is due to the fact that the frailties appearing in the conditionallikelihood can be integrated out in a simple way to obtain the marginal like-lihood that can then be maximised to obtain parameter estimates. Both thefrequentist and the Bayesian approach are considered.

In Chapter 3, we consider models that, although different from frailtymodels, take the clustering of the data into account: the fixed effects model,the stratified model, the copula model, and the marginal model with robustvariance estimators. We also demonstrate the relation between conditionaland marginal models: given the hazard function and the hazard ratio in theconditional model and given the frailty distribution, the hazard function andthe hazard ratio in the population can easily be obtained. These functions areimportant to understand how the population evolves over time.

In Chapter 4, different distributions for the frailty term are discussed andthe type of dependence that they induce on the event times in the clusteris studied. The gamma distribution is most often used in practice. It be-longs to the power variance function family proposed by Hougaard (1986a);other members of this family are also discussed: the inverse Gaussian and thepositive stable distribution. The compound Poisson distribution is a furtherextension of the power variance function family. Characteristics of these distri-butions follow easily from their corresponding Laplace transforms. In practicealso the lognormal distribution is often used for the frailty term. The use ofthe lognormal distribution mainly originates from work by McGilchrist andAisbett (1991). They look at survival data from a mixed models viewpointand therefore assume a normal distribution for the random effects which ap-pear in the model formulation of the loghazard function. A drawback of thelognormal frailty distribution is that the Laplace transform does not take a

Page 18: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 3

simple form and hence the dependence imposed by the lognormal distributionis difficult to evaluate.

In Chapter 5, the semiparametric model is studied. Different numericaltechniques have been proposed to fit semiparametric models, such as theEM algorithm (Klein, 1992), the penalised partial likelihood maximisation(Therneau et al., 2003), in a Bayesian context Laplacian integration (Ducrocqand Solkner, 1994) or the MCMC algorithm (Clayton, 1991). The differenttechniques are studied in detail and the specific features of the different ap-proaches will be compared.

Chapter 6 deals with frailty models with more than one frailty term. Thechapter has two parts. First we discuss the case of two frailty terms withinthe same cluster, which occurs for instance if the objective in a multicentretrial consists of both looking at the heterogeneity between centres but alsoat the heterogeneity of the treatment effect between centres (Legrand et al.,2005). In the second part, nested or hierarchical models are discussed, withone clustering level nested within another clustering level.

In Chapter 7, further extensions of frailty models are discussed: frailtymodels for data with different censoring and truncation characteristics, cor-related frailty models, joint modelling, and the accelerated failure time frailtymodel.

1.3 Examples

Different techniques to fit frailty models for clustered survival data are dis-cussed in this book. All these techniques are applied to real data sets. We madea selection of examples so that we can demonstrate in later chapters how spe-cific aspects of the data determine the appropriate inferential methodologyused for the statistical analysis. A few important aspects that we will men-tion at this stage are (i) the cluster size, e.g., bivariate clusters versus clusterswith large cluster sizes; (ii) the nesting or the hierarchy present in the designof the experiment; (iii) presence or absence of event ordering within a cluster,e.g., ordering is present in recurrent event data sets. Most of the examplesdo not have such ordering. If there exists an ordering, either in space or intime, we can still use techniques that do not take the ordering into account,although more specific models might be more relevant. We will therefore men-tion, whenever needed, that events in a cluster are ordered.

It is not our objective to give the complete analysis of the data sets; thedata sets are rather used to illustrate how the different inferential techniquescan be applied in practice.

We first consider data sets with one level of clustering. The simplest suchdata set consists of clusters of size one. Such a data set is given in Example1.1. The models fitted to these data are called univariate frailty models. Thesimplest extension of this type of data set is to enlarge the cluster from one unitto two units. This type of data, called bivariate survival data, has received

Page 19: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

4 1 Introduction

considerable attention. Typical examples include twin data (Wienke et al.,2003) and matched measurements on similar organs like eyes and kidneys(Mahe and Chevret, 1999a). We introduce two such data sets in Examples 1.2and 1.3.

A data set with clusters of size four is presented in Example 1.4. In thisexample, we study time to infection in the four quarters of the udder of a dairycow, the cow being the cluster. There is a certain ordering of the quarters, inthe sense that front or rear udder quarters might have different characteristics.

In all the previous examples the cluster sizes are balanced, i.e., each clusterhas the same number of units. However, many other examples exist where thecluster size differs from cluster to cluster and where it is often substantiallylarger than four. In fact this situation occurs most in practice. Four such datasets will be used. In Example 1.5 a breast cancer clinical trial data set isstudied, with the hospital being the cluster and the patient the sampling orobservational unit within the cluster. This data set has a restricted numberof large clusters. In Example 1.6 a different breast cancer clinical trial isdescribed. This particular breast cancer type, ductal carcinoma in situ, is arather rare cancer, so that compared to the previous example, this study ischaracterised by the presence of many more hospitals with each of them havingrather few patients. Examples 1.7 and 1.8 are based on studies in veterinaryepidemiology; typical for these two examples is that we have a large numberof rather small clusters. Example 1.9 is an asthma study: the patient is thecluster for which a varying number of event times are available. The eventtimes clustered within the patient are now ordered in time and models takingthis ordering into account will be the adequate ones.

In all the examples considered so far, one level of clustering is present.The clustering can be modelled by introducing a frailty term for each clusterinducing correlation between the units within the cluster. But for an appropri-ate analysis of the data it might be necessary to include more than one frailtyterm in the model, even for data sets having only one level of clustering. Wewill discuss such situation in Example 1.10. This data set is a collection ofseven bladder cancer clinical trials. The data set is used to evaluate prognos-tic indices for bladder cancer. Apart from introducing a frailty term for thecentre, an additional frailty term for the prognostic index is included for eachcluster to study heterogeneity of the prognostic index effect. Therefore, al-though there is only one level of clustering, there are two frailty terms withineach cluster. Finally, the data in Example 1.11 are hierarchical in the sensethat there is more than one clustering level, in this example a collection ofclusters is nested within a larger cluster.

The data sets could not be made publicly available, but data sets with ex-actly the same structure can be downloaded from the Springer Verlag website.The examples in the book are based on the real data sets.

Page 20: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 5

Example 1.1 East Coast Fever data set: East Coast Fever transmis-sion dynamics

Theileriosis or East Coast Fever (ECF) is a major cattle disease in east-ern and southern Africa. The disease is caused by Theileria parva which istransmitted by the ticks Rhipicephalus appendiculatus and the closely relatedRhipicephalus zambeziensis (Fandamu et al., 2005a). In order to study thetransmission of the disease in southern Zambia, cows are followed up frombirth until the time of first ECF contact in Nteme, a region in southern Zam-bia (Fandamu et al., 2005b). On a weekly basis, blood is collected and testedfor the presence of antibodies to Theileria parva using the Indirect Fluores-cent Antibody (IFA) test. After three consecutive positive IFA test results ananimal is considered seroconverted. The time to first ECF contact is defined asthe timespan from birth to one month before the first of the three consecutivepositive test results (i.e., we set the date of contact with Theileria parva onemonth before the time of the first positive test result). Animals are followedup until one year after birth; if they do not seroconvert by that time, theirtime of seroconversion is right-censored. The data for a few animals are givenin Table 1.1. We also consider the binary covariate breed.

Table 1.1. East Coast Fever data set. The first column contains the cow identifica-tion number, the second column gives the time (in days) to ECF contact, the thirdcolumn gives the censoring status taking value one (status=1) if seroconversion isobserved and zero (status=0) otherwise. The last two columns give the breed andthe month of birth.

Cowid Time to ECF contact Status Breed Month of birth

1 309 1 1 52 240 1 1 63 126 1 2 94 365 0 2 95 62 1 1 4

. . .232 365 0 1 3

Example 1.2 Diagnosis data set: Diagnosis of fracture healing

Medical imaging has become an important tool in the veterinary hospital toassess whether and when a fracture has healed. The standard technique in dogsis based on radiography (RX). Newer techniques based on ultrasound (US) arecheaper and do not require radioprotection. To investigate the performance ofUS for this purpose and to compare it to RX, Risselada et al. (2006) set up a

Page 21: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

6 1 Introduction

trial in which fracture healing is evaluated by both US and RX. In total, 106dogs, treated in the veterinary university hospital of Ghent, are included inthe trial and evaluated for time to fracture healing with the two techniques.Only 7 dogs are censored for time to fracture healing evaluated by RX; nocensoring occurs for time to fracture healing evaluated by US. The censoringis due to the fact that dog owners do not show up anymore. The data for afew dogs are given in Table 1.2.

Table 1.2. Diagnosis data set. The first column contains the dog identificationnumber, the second column gives the time (in days) to diagnosis, the third columngives the censoring status taking value one (status=1) if healing is observed andzero (status=0) otherwise. The last column gives the diagnostic technique.

Dogid Time to diagnosis Status Method

1 63 1 RX1 30 1 US2 83 1 RX2 83 1 US

. . .106 35 0 RX106 35 1 US

Example 1.3 Reconstitution data set: Reconstitution of blood–milkbarrier after mastitis

When an udder quarter of a cow is infected (mastitis), the blood–milk barrieris partially destroyed and particular ions can flow freely from blood to milkand vice versa, leading to higher concentrations of, for instance, the sodiumconcentration Na+.

The objective of this study is to demonstrate that the local applicationof a drug based on corticosteroids decreases the time to reconstitution of theblood–milk barrier in dairy cows. We therefore consider as outcome the timeuntil the Na+ concentration goes below a certain threshold (a concentrationbelow the threshold value is considered to be normal again). Each udder quar-ter is separated from the three other quarters so that a quarter can be used asexperimental unit to which a treatment is assigned. The Na+ concentrationin each of the experimental units is followed up. The rear udder quarters of100 cows are experimentally infected with Escherichia coli (Vangroenweghe etal., 2005). After nine hours, one of the two infected udder quarters is treatedlocally with the active compound whereas the other is treated with placebo.Cows are followed up for 6.5 days, and are censored at that point in time ifthe Na+ concentration is still above the threshold level. We further include

Page 22: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 7

parity in the study as covariate. The parity of a cow is the number of calvings(and therefore the number of lactation periods) that the cow has already ex-perienced. Parity is often converted into a binary covariate, grouping all thecows with more than one calving in the group of multiparous cows (heifer=0)compared to the group of primiparous cows or heifers, cows with only onecalving (heifer=1). The data for a few cows are given in Table 1.3.

Table 1.3. Reconstitution data set. The first column contains the cow identifica-tion number, the second column gives the time (in days) to reconstitution, the thirdcolumn gives the censoring status taking value one (status=1) if reconstitution isobserved and zero (status=0) otherwise. The last two columns give the drug ap-plication (active compound (drug=A) or placebo (drug=P)) and the heifer status(multiparous cow (heifer=0) or primiparous cow (heifer=1)).

Cowid Time to reconstitution Status Drug Heifer

1 6.50 0 A 11 6.50 0 P 12 6.50 0 A 02 0.93 1 P 03 1.90 1 A 13 0.41 1 P 1

. . .99 3.80 1 A 199 6.50 0 P 1100 4.93 1 A 1100 6.50 0 P 1

Example 1.4 Mastitis data set: Correlated infection times in fourcow udder quarters

Mastitis, the infection of the udder, is economically the most important diseasein the dairy sector of the western world. Mastitis can be caused by many or-ganisms, most of them bacteria, such as Escherichia coli, Streptococcus uberis,and Staphylococcus aureus. Since each udder quarter is separated from thethree other quarters, one quarter might be infected with the other quartersfree of infection. In an extensive study, 100 cows are followed up for infections.

The objective of this observational study is to estimate the incidence of thedifferent organisms causing mastitis in the dairy cattle population in Flanders.Also the correlation between the infection times of the four udder quarters of acow is an important parameter to take preventive measures against mastitis.With high correlation, a lot of attention should be given to the uninfectedudder quarters of a cow that has an infected quarter.

Page 23: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

8 1 Introduction

From each quarter, a milk sample is taken monthly and is screened forthe presence of different bacteria. We model the time to infection with anybacteria, with the cow being the cluster and the quarter the experimental unitwithin the cluster.

Observations can be right-censored if no infection occurs before the endof the lactation period, which is roughly 300–350 days but different for everycow, or if the cow is lost to follow-up during the study, for example dueto culling. Due to the periodic follow-up, udder quarters that experience anevent are interval-censored with lowerbound the time of the last milk samplewith a negative result and upperbound the time of the first milk sample witha positive result. In some examples, the midpoint (average of lowerboundand upperbound of the interval) is used for simplicity; in other examples theinterval-censored nature of the data is taken into account.

In the analysis, two types of covariates are considered. Cow level covari-ates take the same value for every udder quarter of the cow (e.g., number ofcalvings or parity). Several studies have shown that prevalence as well as in-cidence of intramammary infections increase with parity (Weller et al., 1992).Several hypotheses have been suggested to explain these findings, e.g., teatend condition deteriorates with increasing parity (Neijenhuis et al., 2001). Be-cause the teat end is a physical barrier that prevents organisms from invadingthe udder, impaired teat ends make the udder more vulnerable for intramam-mary infections. For simplicity, parity is dichotomised into primiparous cows(heifer=1) and multiparous cows (heifer=0).

Udder quarter level covariates change within the cow (e.g., position of theudder quarter, front or rear). The difference in teat end condition betweenfront and rear quarters has also been put forward to explain the difference ininfection status (Adkinson et al., 1993). In total, 317 out of 400 udder quartersare infected during the lactation period. A subset of the data is presented inTable 1.4.

Example 1.5 Periop data set: Perioperative breast cancer clinicaltrial

Cancer clinical trials are often international multicentre trials due to the factthat quite a few patients are needed to have sufficient power to demonstratea treatment effect. In survival analysis the power of a test used to study thetreatment effect depends on the number of events rather than the number ofpatients within the study when the logrank test is used (Freedman, 1982).Therefore, especially trials with a low hazard rate such as early breast cancertrials are often large trials of a few thousand patients. Although the maininterest is in showing a beneficial treatment effect, these data sets contain alot of extra relevant information that is often not used. Indeed, in spite ofthe fact that all the patients in the trial are treated according to the sameprotocol, there still remains quite a lot of variability due to the fact that thetrial is run in different cancer centres over the world. The scientific discipline

Page 24: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 9

that studies this type of heterogeneity is termed treatment outcome research(Legrand et al., 2002).

We investigate heterogeneity over centres for an early breast cancer clini-cal trial from the European Organisation for Research and Treatment of Can-cer (EORTC), a randomised phase III trial comparing perioperative (periop)chemotherapy versus no perioperative chemotherapy for early breast cancer:2795 patients are entered by 14 institutions (Legrand et al., 2005). The end-point used is disease-free survival, which corresponds to the time from ran-domisation to time to death or recurrence, whatever comes first. Patients thatare still at risk at the end of the study are censored at that time. Apart fromthe treatment effect, also the effect of prognostic factors at the level of thepatient is studied. As an example, we consider nodal status: cancer cells arefound already in the lymph nodes (nodal status=1) or not (nodal status=0).Additionally, the effect of factors at the institute level can be considered. Herewe study the effect of country. The data for a few patients from the first centreand the last centre are given in Table 1.5.

Table 1.4. Mastitis data set. The first column contains the cow identification num-ber, the second, third, and fourth columns contain the time (in days) to infection(the lowerbound, upperbound, and midpoint of the interval, resp.), the fifth columngives the censoring status taking value one (status=1) if infection is observed andzero (status=0) otherwise. The last two columns give the parity (multiparous cow(heifer=0) or primiparous cow (heifer=1)) and the udder quarter (LF=Left-Front,LR=Left-Rear, RF=Right-Front, RR=Right-Rear).

Time to infectionCowid Lower Upper Midpoint Status Heifer Quarter

1 261 297 279 0 1 LF1 261 297 279 1 1 LR1 48 78 63 1 1 RF1 141 165 153 1 1 RR2 303 330 316.5 0 0 LF2 303 330 316.5 0 0 LR2 303 330 316.5 0 0 RF2 303 330 316.5 0 0 RR

. . .99 39 72 55.5 1 1 LF99 189 228 208.5 1 1 LR99 129 156 142.5 1 1 RF99 72 129 100.5 1 1 RR100 60 93 76.5 1 1 LF100 60 93 76.5 1 1 LR100 60 93 76.5 1 1 RF100 60 93 76.5 1 1 RR

Page 25: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

10 1 Introduction

Table 1.5. Early breast cancer data set. The first column contains the patient iden-tification number, the second column gives the time (in days) to death or recurrence,the third column gives the censoring status taking value one (status=1) if death orrecurrence is observed and zero (status=0) otherwise. The fourth and fifth columnsgive the institute where the patient was treated and the country in which the insti-tute is located (country=F for France, ..., country=N for the Netherlands. The lasttwo columns give the treatment received (perioperative treatment (periop=1) or not(periop=0)) and the nodal status (cancer cells in the lymph nodes (nodal status=1)or not (nodal status=0)) of the patient.

Patid Time to death/ Status Institute Country Periop Nodalrecurrence status

1 230 1 1 F Y 12 3100 0 1 F Y 03 560 1 1 F N 1...

2793 1800 0 14 N N 12794 130 1 14 N N 02795 1900 0 14 N Y 0

Example 1.6 DCIS data set: Breast conserving therapy with or with-out radiotherapy for ductal carcinoma in situ

Ductal carcinoma in situ (DCIS) is a not frequently occurring type of breastcancer. The data set studied here is based on an EORTC randomised clinicaltrial (Julien et al., 2000) in which all patients underwent breast conservingtherapy but half of them were randomly assigned to radiotherapy and theother half to no further treatment. Due to the fact that this cancer typeis rather rare, a large number of institutes (equal to 46) recruit patients inthis study to obtain a sufficiently large number of patients. In total, 1010patients are included in the study. We consider time to local recurrence asmain endpoint and investigate the effect of radiotherapy. As this cancer is abenign form of breast cancer, only 155 events are observed after a medianfollow-up time of 5.75 years. The data for a few patients are given in Table1.6.

Example 1.7 Culling data set: Culling of dairy heifer cows

The time to culling is studied in heifers as a function of the somatic cell count(SCC) measured between 5 and 15 days (measurement day) after calving (DeVliegher et al., 2005). High somatic cell count (we use the logarithm of so-matic cell count as covariate) might be a surrogate marker for intramammaryinfections. Heifers which have intramammary infections or which are expectedto develop intramammary infections in the future are quite expensive to keep

Page 26: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 11

Table 1.6. Ductal carcinoma in situ (DCIS) data set. The first column contains thepatient identification number, the second column gives the time (in days) to localrecurrence, the third column gives the censoring status taking value one (status=1)if local recurrence is observed and zero (status=0) otherwise. The fourth columngives the institute where the patient is treated. The last column gives the treatmentreceived (radiotherapy (radiotherapy=1) or not (radiotherapy=0)).

Patid Time to local Status Institute Radiotherapyrecurrence

1 1996 1 1 02 1114 0 2 13 2535 0 2 0...

1000 428 1 46 11001 755 0 46 01002 771 0 46 01003 948 1 46 01004 1078 0 46 01005 1343 0 46 11006 1475 0 46 01007 1566 0 46 11008 1621 0 46 11009 3056 0 46 01010 893 0 46 1

due to the high costs for drugs and the loss in milk production. Cows arefollowed up for an entire lactation period (roughly 300–350 days) and if theyare still alive at the end of the lactation period they are censored at thattime. Cows are further clustered within herds (140 herds in total) and thisclustering needs to be taken into account as culling policy and also SCC inearly lactation might differ substantially between the herds. The data for afew heifer cows from the first and the last herd are given in Table 1.7.

Example 1.8 Insemination data set: Time to first insemination indairy heifer cows

In a dairy farm, the calving interval (the time between two calvings) should beoptimally between 12 and 13 months. One of the main factors determining thelength of the calving interval is the time from parturition to the time of firstinsemination. The objective of this study is to look for cow factors that mightpredict the time to first insemination, so that actions can be taken based onthese predictors. As no inseminations take place in the first 29 days after calv-ing, we subtract 29 days (and not 30 days as the first event would then havefirst insemination time zero) from the time to first insemination since at risk

Page 27: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

12 1 Introduction

Table 1.7. Culling data set. The first column contains the cow identification num-ber, the second column gives the time (in days) to culling, the third column gives thecensoring status taking value one (status=1) if the cow is culled and zero (status=0)otherwise. The fourth column gives the herd to which the cow belongs. The last twocolumns give the day on which the somatic cell count is assessed (between 5 and 15days after parturition) and the logarithm of the observed somatic cell count on thatday.

Cowid Time to death Status Herd Measurement day log(SCC)

1 230 1 1 5 6.22 300 0 1 7 5.33 300 0 1 13 4.1...

14244 158 0 140 9 4.414245 65 1 140 6 5.914246 88 1 140 12 8.0

time starts only then. Cows which are culled without being inseminated arecensored at their culling time. Furthermore, cows that are not yet inseminated300 days after calving are censored at that time. The time to first insemina-tion is studied in dairy cows as a function of two types of covariates. The firsttype of covariates is fixed over time. An example is the parity of the cow, cor-responding to the number of calvings the cow has had already. As we observeonly one lactation period for each cow in the study, it is indeed a constantcow characteristic within the time framework of the study. We dichotomiseparity into primiparous cows or heifers (only one calving (heifer=1)) and mul-tiparous cows (more than one calving (heifer=0)). Other covariates that areused in the analysis are the different milk constituents such as protein andureum concentration at parturition (Duchateau and Janssen, 2004; Duchateauet al., 2005). For a few dairy cows from the first and the last herd the datawith fixed covariates are given in Table 1.8.

The second type of covariates that is relevant in this study does changeover time. The protein and ureum concentrations are measured, during theexperiment, at a number of points in time. It might be more adequate to modelthe hazard at a particular time using the concentration at that particular pointin time. In order to accommodate for time-varying covariates, the risk timefor each cow is split into time intervals with a start and an end time and ineach such interval a constant value for the concentration. We therefore havefor each cow as many data lines as there are risk intervals. Using time-varyingcovariates the complete data information available for the first cow (cowid=1)in Table 1.8 is given in Table 1.9.

Page 28: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 13

Table 1.8. Insemination data set with fixed covariates. The first column containsthe cow identification number, the second column gives the time (in days) to first in-semination, the third column gives the censoring status taking value one (status=1)if the cow is inseminated and zero (status=0) otherwise. The fourth column givesthe herd to which the cow belongs. The fifth and sixth columns give the milk ureumand protein concentration (%) at the start of the lactation period. The seventhand eighth columns contain the parity (number of calvings) and heifer information(multiparous cow (heifer=0) or primiparous cow (heifer=1)).

Cowid Time Status Herd Ureum (%) Protein (%) Parity Heifer

1 69 1 1 3.12 3.22 6 02 201 0 1 6.60 3.02 5 03 22 1 1 3.97 2.74 5 04 24 0 1 4.04 3.00 4 0

. . .51 71 1 1 3.61 3.42 1 152 34 1 2 2.33 2.87 6 053 40 1 2 2.43 3.61 6 0. . .

10503 37 0 181 2.43 3.40 1 110504 27 1 181 2.44 3.11 1 110505 39 1 181 3.09 3.01 4 010506 61 1 181 2.64 2.91 1 110507 48 1 181 2.47 2.81 1 110508 45 1 181 2.56 3.02 1 110509 30 1 181 2.85 3.41 1 110510 55 1 181 2.95 3.13 1 110511 48 1 181 2.19 3.55 1 110512 78 1 181 3.53 3.52 1 110513 156 1 181 1.67 3.09 1 1

Example 1.9 Asthma data set: Recurrent asthma attacks in children

Asthma is occurring more and more frequently in very young children (be-tween 6 and 24 months). Therefore, a new application of an existing anti-allergic drug is administered to children who are at higher risk to developasthma in order to prevent it. A prevention trial is set up with such childrenrandomised to placebo or drug, and the asthma events that developed overtime are recorded in a diary. Typically, a patient has more than one asthmaevent. The different events are thus clustered within a patient and are orderedin time. This ordering can be taken into account in the model. Such data canbe presented in different formats, but in Table 1.10, we choose to use thecalendar time representation (see Duchateau et al. (2003) for a discussion onother data formats and a comparison of the different formats). In the calendartime representation, the time at risk for a particular event is the time fromthe end of the previous event (asthma attack) to the start of the next event

Page 29: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

14 1 Introduction

Table 1.9. Insemination data set with time-varying covariates. The first columncontains the cow identification number, the second and third columns give the startand the end (in days) of the interval, the fourth column gives the censoring statustaking value one (status=1) if the cow is inseminated at the end time of the intervaland zero (status=0) otherwise. The fifth column gives the herd to which the cowbelongs. The sixth and seventh columns give the milk ureum and protein concen-tration (%) at the begin time of the interval. The eighth column contains the heiferinformation (multiparous cow (heifer=0) or primiparous cow (heifer=1)).

Cowid Start End Status Herd Ureum (%) Protein (%) Parity

1 0 9 0 1 3.12 3.22 61 10 18 0 1 3.23 3.13 61 19 27 0 1 3.34 3.05 61 28 33 0 1 3.44 2.97 61 34 42 0 1 3.51 2.92 61 43 50 0 1 3.57 3.01 61 51 58 0 1 3.62 3.09 61 59 63 0 1 3.68 3.17 61 64 69 1 1 3.71 3.22 62 0 7 0 1 6.60 3.02 52 8 14 0 1 6.01 3.03 5

. . .

(start of the next asthma attack). In describing recurrent event data, we need asomewhat more complex data structure to keep track of the sequence of eventswithin a patient. A particular patient has different periods at risk during thetotal observation period which are separated either by an asthmatic eventthat lasts one or more days or by a period in which the patient was notunder observation. The start and end of each such risk period is required,together with the status indicator to denote whether the end of the risk periodcorresponds to an asthma attack or not. The data for a few patients are givenin Table 1.10.

Example 1.10 Bladder cancer data set: Prognostic index evaluationfor bladder cancer

Bladder cancer is a common urological malignancy and about 70–80% of allbladder cancers are superficial (stage Ta–T1). We consider a pooled databaseof seven trials conducted in this patient population by the EORTC Genito-Urinary Group. These trials are designed to investigate the use of prophylac-tic treatment following transurethral resection (TUR). A total of 2649 eligi-ble patients are included in these trials. However, our analysis is restrictedto the 2501 patients without missing information for the prognostic indexwe consider. These patients are recruited from 63 centres. Prognostic factorsin superficial bladder cancer have been the subject of numerous publications

Page 30: Statistics for Biology and HealthLinear, Logistic, Survival, and Repeated Measures Models Wu/Ma/Casella: Statistical Genetics of Quantitative Traits: Linkage, Map and QTL Zhang/Singer:

1.3 Examples 15

Table 1.10. Asthma data set. The first column contains the patient identificationnumber, the second and third columns give the start and the end (in days) of therisk period, the fourth column gives the censoring status taking value one (status=1)if the patient experiences an asthma attack at the end time of the risk period andzero (status=0) otherwise. The fifth column contains the drug information (placebo(drug=P) or drug (drug=D)).

Patid Start End Status Drug

1 0 168 1 P1 172 420 0 P2 0 56 1 D2 57 134 1 D2 148 325 0 D...

111 0 465 0 D

over the past years (Sylvester et al., 2006) with the objective of adapting thetreatment to the risk group in which a patient is classified. Allard et al. (1998)develop a prognostic index using the disease-free interval (DFI) as outcomevariable. The disease-free interval corresponds to the time from randomisationuntil the time of disease or death, whatever comes first. They consider thefollowing adverse tumour characteristics (ATC) present at initial resection:tumour multiplicity, tumour diameter > 3 cm, stage T1, and histological grade2 or 3. The prognostic model is developed based on a cohort of 382 patientswith primary Ta and T1 bladder cancer. They propose grouping the patientsinto four risk groups, each category being simply defined by the number ofATC: no ATC, 1 ATC, 2 ATC, 3–4 ATC. Along the same lines, we regrouppatients to define only two risk groups, considering patients without any ATCat initial resection as good prognosis patients and patients with at least oneATC as poor prognosis patients. This leads to about a 15%–85% distributionof patients over prognostic groups and could therefore be used to save one sixthof the patients from more aggressive, more toxic, or more expensive treatment.Our interest is mainly focused on the heterogeneity of this prognostic indexfrom centre to centre. The data for a few patients are given in Table 1.11.

Example 1.11 Infant mortality data set: Infant mortality in Ethiopia

A large birth cohort study is conducted in southwest Ethiopia including 46urban and 64 rural “kebeles” or villages, which further cluster into “woredas”or districts, with an estimated size of 300,200 persons (Asefa et al., 1996).The main focus is on studying possible causes of infant mortality in orderto develop strategies for intervention (Asefa et al., 2000). The cohort studycomprises in total 8162 newborns. From these 8162 newborns 856 infants diedin their first year.