new techniques for the analysis of cohort studiespeople.musc.edu › ~korte › 747_2015 › c17 -...

13
Epidemiologic Reviews Copyright © 1998 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 20, No. 1 Printed in U.S.A. New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION Cohort studies involve the key element of follow-up of individuals over time to study an outcome in rela- tion to some earlier exposure factor or a fixed host characteristic (such as genotype). While the outcome under study could be the change in some continuous variable, we shall restrict this review to studies of disease incidence. Subjects can be randomly assigned to different exposures, as in a clinical or prevention trial, or the exposure histories of free-living individu- als can be passively observed, as in most epidemio- logic cohort studies. Although the issues of confound- ing and comparability are very different in randomized and observational studies, the basic analysis methods are similar (except, perhaps, for a greater emphasis on adjustment for covariates in observational studies) and the distinction will be ignored. Similarly, follow-up can conducted "prospectively" or "retrospectively" in real time, but this too has no significance for methods of analysis. We begin with a brief description of several cohort studies with different types of data structures and different analysis problems that will be used to illus- trate the statistical issues. Following a review of the basic approaches to the analysis of the different types of cohort data, we focus on empirical and mechanistic approaches to model specification. Some special prob- lems, such as measurement error, dependent out- comes, and the unique problems of reproductive data, are addressed. We conclude with a more in-depth treatment of approaches to the analysis based on co- hort sampling methods—the nested case-control and case-cohort designs, and variants thereof. Received for publication July 28, 1997, and accepted for publi- cation May 14, 1998. From the Department of Preventive Medicine, University of Southern California, Los Angeles, CA. Reprint requests to Dr. Duncan Thomas, Department of Preven- tive Medicine, University of Southern California, 1540 Alcazar Street, Suite 220, Los Angeles, CA 90033-9987. EXAMPLES OF COHORT STUDIES Atomic bomb survivors One of the largest cohort studies ever conducted is of the 120,128 survivors of the atomic bombing of Hiroshima and Nagasaki, Japan (1). The cohort com- prises all those residents of the two cities at the time of bombing who survived to 1950. Passive follow-up has been conducted using the resources of the Japanese family registration (koseki) system and is planned to continue to the extinction of the cohort. Exposure was estimated for each survivor based on information ob- tained at entry about location at the time of bombing, combined with elaborate physical models for dose as a function of distance, position, and shielding. This study illustrates an important class of studies in- volving a single instantaneous exposure, as well as the analysis efficiencies that can result from the use of grouped data. Diet cohorts In contrast with the study of a single instantaneous exposure are several cohort studies involving rela- tively short-term follow-up of subjects in relation to their reported dietary habits at entry (2). Although several of these cohorts have now been followed for many years, with updated dietary information and substantial losses along the way, we use them here to illustrate analytical approaches where the exposure variables under study represent "usual" lifetime expo- sure (assumed constant over the period of follow-up) and where the prospective observation means subjects are all at risk for essentially the same period of time. Uranium miners Occupational cohort studies comprise a more com- plicated situation, typically involving extended and time-varying exposures and variable periods of time at risk. A good example is the US Public Health Service study of mortality in 3,347 uranium miners on the Colorado plateau (3-7). This study was initiated in the 1950s because of a concern about the risks of lung cancer from the high levels of radon and its daughter 122 at MUSC Library on February 25, 2013 http://epirev.oxfordjournals.org/ Downloaded from

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Epidemiologic ReviewsCopyright © 1998 by The Johns Hopkins University School of Hygiene and Public HealthAll rights reserved

Vol. 20, No. 1Printed in U.S.A.

New Techniques for the Analysis of Cohort Studies

Duncan Thomas

INTRODUCTION

Cohort studies involve the key element of follow-upof individuals over time to study an outcome in rela-tion to some earlier exposure factor or a fixed hostcharacteristic (such as genotype). While the outcomeunder study could be the change in some continuousvariable, we shall restrict this review to studies ofdisease incidence. Subjects can be randomly assignedto different exposures, as in a clinical or preventiontrial, or the exposure histories of free-living individu-als can be passively observed, as in most epidemio-logic cohort studies. Although the issues of confound-ing and comparability are very different in randomizedand observational studies, the basic analysis methodsare similar (except, perhaps, for a greater emphasis onadjustment for covariates in observational studies) andthe distinction will be ignored. Similarly, follow-upcan conducted "prospectively" or "retrospectively" inreal time, but this too has no significance for methodsof analysis.

We begin with a brief description of several cohortstudies with different types of data structures anddifferent analysis problems that will be used to illus-trate the statistical issues. Following a review of thebasic approaches to the analysis of the different typesof cohort data, we focus on empirical and mechanisticapproaches to model specification. Some special prob-lems, such as measurement error, dependent out-comes, and the unique problems of reproductive data,are addressed. We conclude with a more in-depthtreatment of approaches to the analysis based on co-hort sampling methods—the nested case-control andcase-cohort designs, and variants thereof.

Received for publication July 28, 1997, and accepted for publi-cation May 14, 1998.

From the Department of Preventive Medicine, University ofSouthern California, Los Angeles, CA.

Reprint requests to Dr. Duncan Thomas, Department of Preven-tive Medicine, University of Southern California, 1540 AlcazarStreet, Suite 220, Los Angeles, CA 90033-9987.

EXAMPLES OF COHORT STUDIES

Atomic bomb survivors

One of the largest cohort studies ever conducted isof the 120,128 survivors of the atomic bombing ofHiroshima and Nagasaki, Japan (1). The cohort com-prises all those residents of the two cities at the time ofbombing who survived to 1950. Passive follow-up hasbeen conducted using the resources of the Japanesefamily registration (koseki) system and is planned tocontinue to the extinction of the cohort. Exposure wasestimated for each survivor based on information ob-tained at entry about location at the time of bombing,combined with elaborate physical models for dose as afunction of distance, position, and shielding. Thisstudy illustrates an important class of studies in-volving a single instantaneous exposure, as well asthe analysis efficiencies that can result from the useof grouped data.

Diet cohorts

In contrast with the study of a single instantaneousexposure are several cohort studies involving rela-tively short-term follow-up of subjects in relation totheir reported dietary habits at entry (2). Althoughseveral of these cohorts have now been followed formany years, with updated dietary information andsubstantial losses along the way, we use them here toillustrate analytical approaches where the exposurevariables under study represent "usual" lifetime expo-sure (assumed constant over the period of follow-up)and where the prospective observation means subjectsare all at risk for essentially the same period of time.

Uranium miners

Occupational cohort studies comprise a more com-plicated situation, typically involving extended andtime-varying exposures and variable periods of time atrisk. A good example is the US Public Health Servicestudy of mortality in 3,347 uranium miners on theColorado plateau (3-7). This study was initiated in the1950s because of a concern about the risks of lungcancer from the high levels of radon and its daughter

122

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 2: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 123

products. Exposure information was obtained from themining companies' payroll records, combined withmeasurements, extrapolations, and "guesstimates" ofradon levels in the mines over time.

Smoking information was also obtained at entryand updated several times. The primary endpoint islung cancer mortality, with 329 cases having occurredby 1987.

A reproductive cohort

Reproductive endpoints raise several new issues re-lating to the nature of the endpoints. In addition tocontinuous outcomes, such as birthweight, there aretwo main types of binary endpoints—those manifestonly at birth, such as congenital malformations, andthose that can occur throughout the pregnancy, such asspontaneous abortions. The former is normally treatedas a dichotomous endpoint, with all subjects havingbeen at risk for the same period of time (except forvariation in gestational age, which is usually treated asa confounder), whereas spontaneous abortions need tobe treated as censored event-time data. However, thetwo endpoints are closely interrelated since fetuseswith severe malformations are likely to be aborted.Furthermore, the time of entry into the cohort (recog-nition of a pregnancy, not conception) will generallyvary and may be difficult to pinpoint. These problemsare well illustrated by a cohort of 7,450 pregnancies inSan Francisco Bay Area women enrolled in a healthmaintenance organization who were exposed to vary-ing degrees of aerial spraying of the pesticide mala-thion during the early 1980s (8).

Family cohort

Genetic studies usually involve family data, oftenincluding extended pedigrees. The analysis of geneticsegregation and linkage models is beyond the scope ofthis review (see, for example, Ott (9) and Khoury et al.(10)), but some unique issues arise in analyzing theeffects of family history or measured genes withinfamilies. These issues are well illustrated by analysesof breast cancer in the families of the cases and con-trols from the Cancer and Steroid Hormone study(11-13). In addition to many other "environmental"factors, data on the history of cancer in first-degreerelatives of cases and controls were obtained. Thestandard case-control analysis (11) simply combinesall this information into a classification of familyhistory as positive or negative (and subcategories ofpositive); no special problems of dependency arise inthis analysis, since only the cases and controls them-selves are included in the analysis and they are inde-pendent. Claus et al. (12), however, excluded the orig-

inal cases and controls (since their outcomes weredetermined by design) and treated their family mem-bers as a cohort of subjects exposed since birth to therisk factor of having an affected or unaffected pro-band; although more informative, these analyses mustthen deal with any residual dependency in outcomeswithin families not accounted by this risk factor. In asubsequent paper (13), such dependencies were ad-dressed using segregation analysis to infer whetherthey could be explained in terms of a single majorgene and/or polygenic effects.

BASIC ANALYTICAL APPROACHES

All these designs involve the collection of a set ofdata for each individual i = 1, . . . , / comprising anevent time or censoring time tt, a censoring indicatordi = 1 if the subject is affected, zero otherwise, and avector of covariates Z, (exposures, confounders, andmodifiers), which can be time-dependent.

Risk models are used to describe the incidence rateX.(t, Z) for times-to-event (disease diagnosis or death)as a function of time t and covariates Z = (Zl5 . . . ,Zp), which may themselves be time-dependent. A spe-cial class of risk models that has been widely usedin epidemiology are known as "relative risk mod-els," which are based on the proportional hazardsassumption,

\{t, Z) = AoW r[Z(t); fi] (1)

where /3 represents a vector of parameters to be esti-mated and Ao(0 is an unknown set of age-specific"baseline" rates for subjects with Z = 0. In the stan-dard proportional hazards model, the relative risk termtakes the log-linear form r(Z,/3) = exp(Z'/3). This hasthe convenient property that it is positive for all pos-sible covariate and parameter values, since the hazardrate itself must be non-negative. However, in particu-lar applications, some alternative form of relative riskmodel may be more appropriate. Although time sinceentry to the study is commonly used as the time axis tin the analysis of clinical trial data, age is a moreappropriate axis for most cohort studies (14). Othertemporal factors, such as calendar date, or time sinceexposure began may also be relevant and can generallybe handled either by treating them as covariates or bystratification. One might even consider some biologictime scale related to the underlying disease process;for example, Krailo et al. (15) fitted data on breastcancer using a model for the "breast tissue aging rate"proposed by Pike et al. (16) based on reproductivehistory. Note, however, that if any scale other thantime since entry to the cohort is used, one must thendeal with the problem of staggered entry times (e.g.,

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 3: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

124 Thomas

age at first employment in the occupational example orgestational age at diagnosis of pregnancy in the repro-ductive example).

The process of specifying an analysis entails twodistinct steps. First, one must choose a form of anal-ysis appropriate to the particular data structure avail-able. Second, one must specify a particular model forthe relations amongst the variables. These two stepsoverlap in the sense that most any data structure can befitted to most any model, although conventionally thetwo are often treated as linked in the sense that aparticular data structure is often analyzed with a par-ticular model. To avoid this narrow perspective, wehave organized the following discussion by first dis-cussing forms of analysis appropriate to the mostcommon data structures and then focusing the rest ofthe review on models for disease incidence or mortal-ity data.

Likelihoods and data structures

The appropriate likelihood depends on the samplingdesign and data structure. The key elements in deter-mining the appropriate analysis are:

• whether the subjects are to be treated as individ-uals or grouped on the basis of their exposurehistories in some way; for example, the basicanalyses of the atomic bomb survivor cohort haveall been based on grouped data (by dose, age,gender, city, and time) because the large size ofthe cohort essentially precludes extensive explor-atory analyses on an individual basis;

• if grouped, whether a single exposure variable isof particular interest (the others being treated asconfounders), or the joint effects of multiple vari-ables are to be modeled; for example, malathionwas the primary exposure of interest in the repro-ductive cohort, whereas unscrambling the effectsof multiple dietary components is a key aim of thedietary cohort studies;

• whether subjects are considered to be at risk foran essentially constant period of time (as in ashort-term cohort study or trial) with relativelylittle censoring, or the periods of observation varyconsiderably between individuals; for example,both the reproductive and dietary cohorts werefollowed for a relatively short period with fewcensoring events, whereas the two radiation co-horts entail lifetime follow-up;

• if observation time is extended, whether the co-variates are constant or time-dependent (theatomic bomb survivor and uranium miner co-horts, respectively), and whether assumptions areto be made about the baseline risk as a function oftime or age; and

• whether all subjects are to be included in theanalysis or only a sample of them (we will illus-trate cohort sampling options below using theuranium miner cohort).

These various questions will then influence the ap-proach to the analysis. The most commonly used al-ternatives are briefly reviewed in the rest of this sec-tion. A more detailed treatment of the standardmethods of analysis can be found in the standard textof Breslow and Day (17) and recent epidemiologictextbooks.

Standardized mortality ratio analysis. The simplestanalysis of cohort data is a comparison of the numbersof observed events N with their corresponding ex-pected numbers E, via the standardized mortality ratio,SMR — N/E. Expected numbers are usually estimatedby multiplying a set of "standard" rates A* to theperson-time at risk Ts in strata 5 defined by age,gender, calendar time, and perhaps other factors, andsumming over strata to produce E = ZsTsk*s. Externalrates (e.g., national) are commonly used to determinewhether the cohort rates are different, but if the pri-mary interest concerns internal comparisons betweensubcohorts with different exposures, the rates for theentire cohort can be used as the standard. The methoddescribed above, known as "indirect standardization,"is but one of several ways of standardizing for thestratifying factors, but it is the most commonly usedmethod and the one that is most closely related to themultivariate methods to be discussed below. Breslowand Day (17) provide a comprehensive treatment ofthe alternative methods of standardization, as well asmethods for significance testing, confidence intervalestimation, comparison between subcohorts, and mod-eling.

Poisson regression. If there are several risk factorsunder study, it may be more revealing to model theirjoint effects than simply to describe the effect of each,adjusting for the others. For large datasets, it may bemore convenient to analyze the data in grouped formusing Poisson regression (17). This technique providesthe natural multivariate generalization of the standard-ized mortality rate method. For this purpose, the totalperson-time of follow-up is grouped into k = 1, . . . ,K categories on the basis of time and covariates, andthe number of events Nk and person-time Tk in eachcategory is recorded, together with the correspondingvalues of the person-time-weighted averages of age tk

and covariates zk. The proportional hazards modelnow leads to a Poisson likelihood for the grouped dataof the form

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 4: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 125

= UPr(Nk\Ek) = nENk>exp(-Ek)/Nk\

(2)

where Ek = \kTkr(zk;fi) and \k — ko(tk) denote a setof baseline hazard parameters that must be estimatedtogether with p.

Logistic regression. For a clinical trial or cohortstudy with the same period of observation for allsubjects, but where only the disease status, not theevent-time itself, is observed, a logistic model for theprobability of an event of the form Pr(D = 0|Z) =[1 + ar(Z,P)]~' might be used, where a is the odds ofthe event for a subject with Z = 0. Again, the standardform is obtained using HZ,P) = exp(Z'/3). The like-lihood for this design would then be

= dj\Z = z,;a,/3)

(3)= n1 +

Survival analysis. In a clinical trial or cohort studyin which the event times are observed, the proportionalhazards model (equation 1) leads to a full likelihood ofthe form

X exp - Ao(f) fa{t); (4)

where s: denotes the entry time of subject i. Use of thefull likelihood requires specification of the form of thebaseline hazard, for example, constant (exponentialsurvival), step function, Weibull, or Gompertz (18).Cox (19) proposed instead a "partial likelihood" of theform

(5)

where n — I,. .. ,N indexes the observed event times,in denotes the individual who fails at time /„, and Rn

denotes the set of subjects at risk at time tn. (Whenusing a time scale like age, this may raise the issue ofstaggered entry times, discussed earlier, where Rn in-cludes only subjects who have entered the cohort bytime tn. Although some Cox regression programs donot explicitly allow for staggered entry times, it isoften possible to deal with this by creating a time-dependent indicator for times when the subject is not

in view and fixing its regression coefficient to a largenegative value, thereby reducing their contributions atsuch times to zero.) This likelihood does not requireany specification of the form of the baseline hazard;the estimation of fi is said to be "semi-parametric," asthe relative risk factor is still specified parametrically(e.g., the loglinear model in the standard form).

Nested case-control and case-cohort sampling.This partial likelihood can also be used to fit relativerisk models for nested case-control studies within acohort, where n now indexes the cases and Rn indi-cates the set comprising the nth case and his/hermatched controls. This approach is discussed morefully below.

Models

Why model relative risks?. Before proceeding fur-ther, it is worth pausing to inquire why one might wishto adopt the proportional hazards model at all. Cer-tainly, there are examples where some other form ofmodel provides a better description of the underlyingbiologic process. Although any risk model can bereparameterized in proportional hazards form, it maybe that a more parsimonious model can be found usingsome alternative formulation, such as an excess riskmodel \(t,Z) = ko(t) + Z'a. In this case, whether theproportional hazards or excess risk model provides amore parsimonious description of the data depends onwhich is more nearly constant over time (or requires thefewest time-dependent interaction effects).

The advantages of relative risk models are bothmathematical and empirical. Mathematically, the pro-portional hazards model allows "semi-parametric" es-timation of covariate effects via partial likelihoodwithout requiring parametric assumptions about theform of the baseline hazard. Furthermore, the asymp-totic distribution theory for estimating confidence re-gions and significance testing generally seems to applyat smaller sample sizes than for most alternative mod-els. Empirically, it appears that many survival-timeprocesses do indeed show rough proportionality of thehazard to time and covariate effects, at least withappropriate specification of the covariates. Evidenceof this phenomenon for cancer incidence is reviewedin Breslow and Day (20, chapter 2): age-specific in-cidence rates from a variety of populations have morenearly constant ratios than differences. Such consider-ations have led the view amongst most cancer epide-miologists that the relative risk model is the "right"one biologically for that endpoint, but this is notnecessarily the case for other endpoints (14).

Some alternatives. Two alternative models thathave received some attention are the excess risk model

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 5: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

126 Thomas

and the accelerated failure time model. The excess riskmodel A(f,Z) = Ao(/) + Z'a was often used in earlywork on the radiation field, in part because of thesimplicity of the resulting risk assessment calcula-tions. However, in addition to growing evidence that itdid not fit radiation data as well as the relative riskmodel, it does not allow the types of semiparametricinference on exposure effects that is possible under therelative risk model, where no parametric assumptionsabout baseline risks are needed. Recent work (21, 22),however, provides quite a general framework forsemiparametric inference in additive models that mer-its further consideration. In particular, the model al-lows the magnitude of the regression coefficients a tovary over time in an arbitrary fashion, so that one canget a visual feel for whether a constant excess riskmodel would be appropriate.

The accelerated failure time model is generally writ-ten in the form Int = Z 'y + e (for uncensored obser-vations) where the residuals e are assumed to havesome common, but unspecified, distribution/(e). Thisexpression provides a natural interpretation of the re-gression coefficients y in terms of the effects of co-variates on the mean survival times. The same modelcan also be expressed in terms of the incidence rate asA(r,Z) = K0(te~z'y) e~z'y, where the baseline rateA0(0 is related to the distribution of residuals/(e). Themodel is easily fitted to uncensored event times underparametric assumptions about the distribution of resid-uals and can be extended in a straight-forward mannerto censored event times (18, chapter 3). Its extensionto semiparametric (rank) regression for censored datais more complex, but methods are now available (18,chapter 6; 23; 24). Alternatively, semiparametric re-gression models have been recently developed formedian, rather than mean, survival times, median(f)= Z 'y + e where the error terms again have somecommon but unknown distribution (25).

GENERAL MODELING ISSUES

For any of these likelihoods, it suffices to substitutesome appropriate function for r(Z;/3) and then use thestandard methods of maximum likelihood to estimateits parameters and test hypotheses. In the remainder ofthis section, we discuss various approaches to speci-fying this function. The major distinction we make isbetween empiric and mechanistic approaches. Empiricmodels are not based on any particular biologic theoryfor the underlying disease process, but simply attemptto provide a parsimonious description of it, particu-larly to identify and quantify the effects of covariatesthat affect the relative risk. Perhaps the best-knownempiric model is the loglinear model for relative risks,but other forms may be appropriate for testing partic-

ular hypotheses or for more parsimonious modeling inparticular datasets, as discussed in the following sec-tion. With a small number of covariates, it may also bepossible to model the relative risk nonparametrically.Mechanistic models, on the other hand, aim to de-scribe the observed data in terms of some unobserv-able underlying disease process, such as the multistagetheory of carcinogenesis. A more mathematical treat-ment of risk modeling can be found in Thomas (26).

Empiric models

The log-linear model, r(Z;/3) = exp(Z'/3), is prob-ably the most widely used empiric model and is thestandard form included in all statistical packages forlogistic, Cox, and Poisson regression. As noted earlier,it is nonnegative and it produces a nonzero likelihoodfor all possible parameter values, which doubtlesscontributes to the observation that in most applica-tions, parameter estimates are reasonably normallydistributed, even with relatively sparse data. However,the model involves two key assumptions that merittesting in any particular application:

• For a continuous covariate Z, the relative riskdepends exponentially on the value of Z; and

• For a pair of covariates, Zx and Z2, the relativerisk depends multiplicatively on the marginalrisks from each covariate separately (i.e.,r(Z;/3) = riZtfi

Neither of these assumptions is relevant for a singlecategorical covariate. In other cases, the two assump-tions can be tested by nesting the model in some moregeneral model that includes the fitted model as aspecial case, for example, by adding covariate trans-formations or interaction terms to a model of the sameform.

If these tests reveal significant lack of fit of theoriginal model, one might still be satisfied with theexpanded model as a reasonable description of thedata, but one should then also consider the possibilitythat the data might be more parsimoniously describedby some completely different form of model. Forexample, a negative quadratic term might suggest thata linear model be tried, and a negative interaction termmight suggest an additive model. Thus, one mightbe led to a model of the form r(Z;0) = 1 + Z'/3. Inother circumstances, one might consider a linear-multiplicative or loglinear-additive model.

In a rich dataset, the number of possible alternativemodels can quickly get out of hand, so some structuredapproach to model building is needed. The key is toadopt a general class of models that would include allthe alternatives one might be interested in as specialcases, allowing specific submodels to be tested within

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 6: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 127

nested alternatives. A general model that has achievedsome popularity recently consists of a mixture of lin-ear and loglinear terms of the form

, W;0,y) = exp(Wo'yo)

X Tl + S Z '/3mexp(W'myjl (6)

where /3m and ym denote vectors of regression coeffi-cients corresponding to the subsets of covariates Zm

and Wm included in the mth linear and loglinear terms,respectively. A special case that has been widely usedin radiobiology (7, 27, 28) is of the form

r(Z,W;/3;y)=l + (/3,Z + /32Z2) exp( - /33Z + W'y)

where Z represents radiation dose (believed from mi-crodosimetry considerations to have a linear-quadraticeffect on mutation rates at low doses multiplied by anegative exponential survival term to account for cellkilling at high doses) and W comprises modifiers ofthe slope of the dose-response relation, such as at-tained age, sex, latency, or age at exposure. For ex-ample, including the log of latency and its square in Wallows for a lognormal dependence of excess relativerisk on latency.

Comparisons of alternative models that are nestedwithin such a general class can be accomplished usingstandard likelihood ratio tests. Models that are of afundamentally different form can always be nestedwithin some more general class, such as the expo-nential mixture of linear-additive and loglinear-multiplicative models proposed by Thomas (29)

r(Z;/3;0) = (1 + Z'/3)'~eexp(0Z'/3) (7)

which produces the linear model when 0 = 0 and theloglinear model with 0 = 1 . Several alternative mix-tures have been proposed, of which the Guerrero-Johnson (30) mixture

fexp(zm\W ( 0 * 0 ) (8)

appears to have the most satisfactory statistical prop-erties (26, 31, 32). However, a word of warning isneeded concerning inference on the parameters ofmost nonstandard models. Their likelihood is gener-ally far from normal (32), leading to highly skewedconfidence regions and Wald tests that are seriouslyweakened (33, 34). Thus, inference should be based onthe likelihood ratio test and likelihood-based confi-dence limits. For example, Lubin and Gaffey (35)describe an application of the exponential mixture oflinear-additive and linear-multiplicative models (29)to testing the joint effect of radon and smoking on lung

cancer risk in uranium miners; the point estimate of 0was 0.4, apparently closer to additivity than multipli-cativity, but the likelihood ratio tests rejected the ad-ditive model (x* = 9.8) but not the multiplicativemodel {x\ = 1.1). A linear mixture showed an evenmore skewed likelihood, with 0 = 0.1 (apparentlynearly additive) but with very similar likelihood ratiotests that rejected the additive but not the multiplica-tive model.

Extended exposure histories. Chronic disease epi-demiology often involves measurement of an entirehistory of exposure {X(u), u < t} which we wish toincorporate into a relative risk model through one ormore time-dependent covariates Z(t). How this isdone depends upon one's assumptions about the un-derlying disease mechanism.

Most approaches to exposure-response modeling inepidemiology are based on an implicit assumption ofdose additivity, i.e., that the excess relative risk at timet is a sum of independent contributions from eachincrement of exposure at earlier times u, possiblymodified in some fashion by temporal factors. In arelative risk model, this hypothesis might be expressedgenerally as

r[t, X(.); 0; y] = /? f[X (u);a] g{t,u\y) du;

where /?(Z;/3) is some known relative risk functionsuch as the linear or loglinear models discussed above,/ is a known function describing the modifying effectof dose-rate, and g is a known function describing themodifying effect of temporal factors. For example, thechoice/(X) = X and g{t,u) = 1 leads to the standardrelative risk model based on cumulative exposure,probably the most widely used exposure index inepidemiology. For many diseases with long latency,such as cancer, it is common to use lagged cumulativeexposure, corresponding to a weighting function ofthe form g{t,u\y) = 1 if t — u > y, zero otherwise.Other simple exposure indices might include latency-weighted exposure f'o X(u) (t - u) du or age-weightedexposure f'^yX(u) u du. Similarly, the function/can beused to test dose-rate effects (the phenomenon that along, low-intensity exposure has a different risk froma short, high-intensity exposure for the same cumula-tive dose). For example, letting/[X(w);a] = X(u)a and/?(Z;j3,a) = 1 + /3Z1/a generates a family of exposure-response functions, ranging from a cumulative linearrelation (for a = 1) to those which show conventionaldose-rate (a > 1) and inverse dose-rate (a < 1)effects. Unfortunately, the additivity assumption has

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 7: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

128 Thomas

seldom been tested, although in principle this could bedone by nesting the dose-additive model in some moregeneral alternative. (See Thomas (26) for furtherdetails and a discussion of fitting methods.) For ex-ample, in the uranium miner data, we have tested thishypothesis by adding a covariate of the form/ o /o y]X(u)X(v)j{u) g(v - u) h{t - v) dv du to the equa-tion, but found no significant improvement in the fit forany of several simple choices of the weight functions/, g,or h, suggesting that the dose additivity assumption isreasonable for these data.

Nonparametric models. The appeal of Cox's par-tial likelihood is that no assumptions are needed aboutthe form of the dependence of risk on time, but itremains parametric in modeling covariate effects.Even more appealing would be a nonparametric modelfor both time and covariate effects. For categoricaldata, no parametric assumptions are needed, of course,although the effects of multiple covariates are com-monly estimated using the loglinear (i.e., multiplica-tive) model, with additional interaction terms asneeded. Similarly, continuous covariates are fre-quently categorized to provide a visual impression ofthe exposure-response relation, but the choice of cut-points is arbitrary. However, nonparametric smoothingtechniques are now available to allow covariate effectsto be estimated without such arbitrary grouping.

One approach relies only on an assumption ofmonotonicity. Thomas (36) adapted the technique ofisotonic regression to relative risk modeling, andshowed that the maximum likelihood estimate of theexposure-response relation under this constraint was astep function with jumps at the observed covariatevalues of a subset of the cases. The technique has beenextended to two dimensions (37), but in higher dimen-sions the resulting function is difficult to visualize andcan be quite unstable.

Cubic splines and other means of smoothing provideattractive alternatives which produce smooth, but notnecessarily monotonic relations. The generalized ad-ditive model (38) has been widely used for this pur-pose. For example, Schwartz (39) described the effectof air pollution on daily mortality rates using a gen-eralized additive model, after controlling for weathervariables and other factors using similar models. Acomplex dependence on dew point temperature wasfound, with multiple maxima and minima, whereas thesmoothed plot of the paniculate air pollution was seento be almost perfectly linear over the entire range ofconcentrations.

Mechanistic models

In contrast with the empiric models discussedabove, there are circumstances where the underlying

disease process is well enough understood to allow itto be characterized mathematically. Probably thegreatest activity along these lines has been in the fieldof cancer epidemiology. Two models in particularhave dominated this development, the multistagemodel of Armitage and Doll (40) and the two-eventmodel of Moolgavkar and Knudson (41). For thoroughreviews of this literature, see Whittemore and Keller(42), Moolgavkar (43), and Thomas (44); here, wemerely sketch the basic ideas.

The Armitage-Doll multistage model postulates thatcancer arises from a single cell that undergoes a se-quence of k heritable changes, such as point mutations,chromosomal rearrangements, or deletions, in a par-ticular sequence. The model further postulates that therate of one or more of these changes may depend onexposure to carcinogens. Then the model predicts that thehazard rate for the incidence of cancer (or more precisely,the appearance of the first truly malignant cell) followingcontinuous exposure at rate X is of the form

\{t,Z) = at (1 +

Thus, the hazard has a power-function dependence onage and a polynomial dependence on exposure ratewith order equal to the number of dose-dependentstages. It further implies that two carcinogens wouldproduce an additive effect if they act at the same stageand a multiplicative effect if they act at differentstages. If exposure is instantaneous with intensity X(u)at age u, its effect is modified by the age at and timesince exposure: if it acts at a single stage i, then theexcess relative risk at time t is proportional to Zlk{t)= X(u) u'~l (t - M ) * - ' - 1 / ^ - ' , and for an extended ex-posure at varying dose rates, the excess relative risk isobtained by integrating this expression over u (45,46). Analogous expressions are available for time-dependent exposures to multiple agents acting at mul-tiple stages (47). Note, however, that the expressionsgiven above are only approximations to the far morecomplex exact solution of the stochastic differentialequations (48); the approximate expressions given aboveare valid only when the mutation rates are all small.

The Moolgavkar-Knudson two-stage model postu-lates that cancer results from a clone of cells fromwhich one descendent has undergone two mutationalevents at rates /xi[Z(f)] and fJ^[Z(t)], either or both ofwhich may depend on exposure to carcinogens. Theclone of intermediate cells is subject to a birth-and-death process with net proliferation rate p[Z(t)] thatmay also depend on carcinogenic exposures. The num-ber of normal stem cells at risk N(t) varies with age,depending on the rate of development of the target

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 8: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 129

tissue. Finally, in genetically susceptible individuals(carriers), all cells carry the first mutation at birth. Anapproximate expression for the resulting incidence rateat age t is then

N(u

A(f,Z) =exp p[Z(v)]dv

M2[Z(t)]N(0)

exp p[Z(v)]dv

du noncarriers

earners.

(10)

Again, note that this expression is only an approximatesolution to the stochastic process (49), the validity ofwhich depends upon all the rates being small.

There have been a number of interesting applica-tions of these models to various carcinogenic expo-sures. For example, the multistage model has beenfitted to data on lung cancer in relation to asbestos andsmoking (47), arsenic (50), coke oven emissions (51),and smoking (52, 53), as well as to data on leukemiaand benzene (54) and nonleukemic cancers and radi-ation (55). The two- stage model has been fitted todata on lung cancer in relation to smoking (56), radon(57, 58), and cadmium (59), as well as to data onbreast (60) and colon cancers (61). Few of these re-ports have provided any formal assessment of good-ness of fit, focusing instead on comparisons betweenalternative models. This can be done, however, bygrouping the subjects in various ways and comparingthe numbers of observed and predicted cases; for ex-ample, Moolgavkar et al. (58) grouped uranium minersby the temporal sequence of their radon and smokingexposure histories and reported good agreement withthe predictions of their two-stage model.

As in any other form of statistical modeling, theanalyst should be cautious in interpretation. A good fitto a particular model does not, of course, establish thetruth of the model. Instead the value of models,whether descriptive or mechanistic, lies in their abilityto organize a range of hypotheses into a systematicframework in which simpler models can be testedagainst more complex alternatives. The usefulness ofthe multistage model of carcinogenesis, for example,lies not in our belief that it is an accurate descriptionof the process but, rather, in its ability to distinguishwhether a carcinogen appears to act early or late in theprocess or at more than one stage. Similarly, the im-portance of the Moolgavkar-Knudson model lies in its

ability to test whether a carcinogen acts as an "initia-tor" (i.e., on the mutation rates) or a "promoter" (i.e.,on proliferation rates). Such inferences can be valu-able, even if the model itself is an incomplete descrip-tion of the process, as must always be the case.

Special problems

Measurement error. The above treatment has as-sumed either that the covariates Z are accurately mea-sured or that the exposure-response relation that issought refers to the measured value of the covariates,not to their true values. There is a large and growingliterature on methods of adjustment of relative riskmodels for measurement error, which is beyond thescope of this review (62). However, some generalobservations are worth making:

• It is well known that the usual effect of measure-ment error is to bias the relative risk towards thenull and weaken power. However, there are someimportant exceptions. First, in multivariate mod-els with correlated exposures and possibly corre-lated errors, the bias is not necessarily towards thenull; instead, there is a general tendency for themore precisely measured variables to absorb pro-portionally more of the effect of variables withwhich they are correlated, but the magnitude anddirection of the effects depends upon the correla-tional structure. Second, one must carefully con-sider whether the errors are independent of thetrue values ("classic error"), the measured values("Berkson error"), or neither, as the effect ofmeasurement error will differ; in linear models,for example, Berkson error does not tend to pro-duce any bias in relative risk estimates. Third,measurement error can distort the shape of anexposure-response relation in various ways, par-ticularly for nonlinear models or error variancesthat are proportional to the true values.

• Many methods have been proposed for correctingfor measurement errors; most involve some formof replacement of the measured values by esti-mates of the corresponding true values. For ex-ample, if validation data are available, one mightuse them to build a model for true given measuredvalues, and then use this model to impute "true"values for subjects in the main study; adjustmentsto the standard errors of the relative risks areneeded to allow for the uncertainty in this impu-tation (63, 64). If only a summary estimate of thevariance of the error distribution is available, aBayesian estimate of the expectation of true givenmeasured values can be used instead. (Thismethod has been applied to the analysis of the

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 9: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

130 Thomas

atomic bomb survivor data, to show that the slopeof the dose-response relation may have been un-derestimated by about 15 percent if the dose er-rors were lognormally distributed with a coeffi-cient of variation of 35 percent (65).) Thesemethods are considerably simpler than the fulllikelihood methods, which entail integration ofthe likelihood over the unobserved true exposurevariables, but can be seen as approximations tothese more sophisticated methods which will bevalid if the error variances are not too large.

• Monte Carlo methods can be very useful for morecomplex problems where likelihood methods areintractable and these approximate methods aredubious. Essentially, one would randomly imputevalues for the true exposures of each subject,given their measured values and all other relevantfactors, and then analyze the resulting data toobtain a point estimate of the relative risk; thisprocess is then repeated many times to build up anentire distribution of risk estimates, which incor-porates the uncertainty in the various imputations.This approach is currently being applied to thedata on the Colorado plateau uranium miners.

Modeling baseline risks. If one adopts a parametricassumption for the baseline risk function \0(t) (forexample, a simple step function dependence on age tand perhaps a small number of additional stratificationvariables), then estimation of the parameters of thisfunction together with the regression coefficients in-volves no unusual complexities. In the semiparametricapproach of Cox, however, the estimated baselinehazard rate is discrete, involving infinite "spikes" ateach of the observed event times, zero elsewhere. Thecumulative baseline hazard remains finite, however,and provides a natural extension of the now familiarKaplan-Meier (66) survival curve to models involvingcovariates. (See Langholz and Borgan (67) for a dis-cussion of estimation of baseline hazards in excess riskmodels.)

Reproductive outcomes. As noted earlier, theanalysis of reproductive outcomes entails two types ofendpoints, those manifest only at birth and those thatcan occur throughout the pregnancy. If one ignorestheir interdependency, then the former can be analyzedin a straightforward manner by comparing risks (withfetus, not fetus-time, denominators) between exposuregroups or using unconditional logistic regression formultivariate analysis, assuming all fetuses have beenat risk for essentially the same duration. Gestationalage is a common risk factor for many malformations,but this is more appropriately handled as a covariatethan as a time scale in survival analysis, since the truetime at which the malformation developed is unob-

served. Exposures, however, are likely to be time-dependent, and it is important to examine such expo-sures during the critical periods of organogenesis foreach malformation type. For example, in the malathionstudy, limb and orofacial malformations were found tobe more strongly associated with first trimester expo-sures, whereas gastrointestinal anomalies were moreassociated with second trimester exposures; the latterobservation is plausible for the seven pyloric stenosiscases in that group, but not for the four tracheoesoph-ageal fistulas.

Spontaneous abortion data require survival analysistechniques, since the set of fetuses to be used forcomparison will vary over time because of the variabletimes of recognition of pregnancy and because of theelimination of earlier spontaneous or induced abor-tions. Since the malathion study was conducted withina health maintenance organization, the entry time tothe cohort could be easily defined in terms of the dateof the pregnancy confirmation visit. This was impor-tant, since it is possible that any causal effect ofmalathion exposure might be strongest for very earlyspontaneous abortions, which would never have beenobserved by this study; if gestational age at pregnancydiagnosis was also associated with malathion exposure(e.g., through socioeconomic correlates), then an anal-ysis which included the fetus-time prior to pregnancydiagnosis would have produced biased estimates of therelative risk. Spontaneous abortions also need to betreated as event-time data with time-dependent covari-ates, since it would be inappropriate to compare theexposures of late abortions with those of fetuses whohad aborted earlier and did not have the same oppor-tunity for exposure. Although a crude comparisonsuggested that the spontaneous abortion group tendedto be less exposed than the live births, this differencedisappeared when the data were properly analyzedallowing for the shorter opportunity for exposureamong the spontaneous exposure group.

Conceptually, the close relation between the pro-cesses leading to spontaneous abortions and congenitalmalformations cries out for a joint analysis of the twoendpoints, but statistical methods remain undevelopedin this area. Such an analysis would be considerablystrengthened if data could be obtained on the charac-teristics of aborted fetuses. Such data are not routinelyavailable, but have been obtained in special studies.

Dependent outcomes. Dependent outcomes canarise in various ways. The endpoint may be a recurrentevent (accidents or heart attacks, for example). Depen-dency can arise either because the occurrence of thefirst event alters the risk of subsequent events orbecause individuals differ in the underlying risks (e.g.,"accident proneness"). Next, there may be several

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 10: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 131

correlated endpoints under study: for mortality data,only the first of the possible competing risks is ob-served, so such dependency cannot be studied; forincidence data, however, it may be desirable to con-sider related events (e.g., multiple congenital anoma-lies) jointly, although the usual practice is to restrictthe analysis to the first event. Finally, there may becorrelations between the outcomes of different indi-viduals. This most commonly arises in the context offamily studies, due to shared genetic or unmeasuredenvironmental factors. Several approaches to thisproblem have been considered. Setting aside the morespecialized genetic models, which are beyond thescope of this review, the three most commonly usedalternatives are regressive models, latent variablemodels, and marginal models; all three are generallyapplicable to any form of dependency, but we limit ourdiscussion to the case of family data.

Regressive models are based on an ordering of thesubjects within a family in some natural order, such asparents (f,m) before offspring (1, . . . , s), older sibsbefore younger, and postulates a direct dependence ofthe outcomes of the later members on the outcomes ofthe earlier (68, 69). For example, for binary outcomes,one might add to a logistic regression model describ-ing the dependence of each subject's outcome on hisor her own risk factors additional covariates for theoutcomes of their spouse, parents, and older sibs:

logitPr(4- = l\Zf) = a + Z//3

logitPr(rfm = l\Zm,df) = a + Zm'(i + yspd}

logitPr^, = \\Zudf4m) = a + Z//3

d) + d*m

' ~Yp° o

= l\Z2,df,dmA) = a + Z2'/3

d) + d+ Jpo Z + lsibd\

logit Pr(<4 = \\Zs,df4mA,. • •, ds-t

= a

+

d) + dl

dUs - 1

where d* - dt - Pr(d, = \\Zhdi, • • •, dt-X) orzero if dt is unknown.

The latent variables approach assumes that depen-

dencies arise because members of a family share oneor more unobservable risk factors. In the context ofsurvival data, such a factor has come to be called"frailty." The simplest frailty model assumes that allmembers of the family share a common frailty whichhas a gamma distribution and acts as a constant rela-tive risk. Methods of fitting frailty models have beendescribed by Clayton (70) and others; recent work isaimed at relaxing the assumption of a particular para-metric distribution for the frailties (71) and allowingfor more complex models of sharing individual frail-ties within families (72, 73).

The marginal models approach treats the outcomesof all the members of a family as a vector of obser-vations with some simple covariance structure. Byusing generalized estimating equations methods (74),estimates of the parameters of the relative risk modelfor the measured risk factors can be obtained whichare robust to misspecification of the covariance struc-ture (75, 76). Using higher moments, it is also possibleto obtain robust estimates of the parameters in thecovariance structure as well, which can be of interestfor testing hypotheses about residual familial risks notexplained by the measured factors (77).

COHORT SAMPLING REVISITED

For most chronic diseases, the number of eventsexpected during the period of observation is small inrelation to the size of the cohort. Thus, most of thestudy resources, both in terms of data collection and,perhaps, biologic sample collection as well as dataanalysis, would normally be devoted to subjects whowill have relatively little influence on the final results.For this reason, Liddell et al. (78) introduced thenested case-control design, and, subsequently, Pren-tice (79) introduced the case-cohort design. Both de-signs involve comparison of the cases in the cohortwith controls sampled in different ways from withinthe cohort, thus requiring risk factor information to beavailable only on the cases and the selected controls.Although the entire cohort must still be followed toidentify the cases, the burden of data collection andanalysis is considerably reduced.

The nested case-control design entails matched se-lection of controls from the "risk sets" for each case,comprising those who are at risk and disease free at thetime the case occurred. The analysis of the nestedcase-control design uses standard conditional logisticregression methods that are identical to those used forany matched case-control study (see Borgan et al. (80)for the relevant statistical theory). The case-cohortdesign entails selection of a single unmatched controlsample at random from the entire cohort at entry, anduses a form of Cox regression to compare each case

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 11: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

132 Thomas

with the subset of controls who are still at risk at thetime that case occurred. The analysis of the case-cohort design is more complex, owing to the depen-dency between the contributions from each case-subcohort comparison. There are practical andstatistical issues in choosing between the two designs(81, 82): for example, the case-cohort design is moreconvenient for studying multiple diseases, because thesame control group can be used for each one, but forlong-term cohort studies, the case-cohort design mayleave few controls at risk for the later cases andsubtleties arise in its application to studies with vari-able entry times. Generally speaking, however, thedifferences in statistical efficiency between the twodesigns are modest.

Much greater efficiency gains are possible, how-ever, if one exploits information on exposure (or sur-rogates thereof) that are readily available for the entirecohort. The original two-stage designs were developedfor population-based case-control studies (83, 84), butare equally applicable to case-control studies nestedwithin cohorts, where the cohort essentially plays therole of the first-stage sample. The basic idea of theunmatched two-stage case-control design is to selectdifferent sampling fractions for the two-way classifi-cation of subjects defined by case-control status andthe surrogate exposure variable, and then assess theexposure variable of primary interest only in this sub-group. The known sampling fractions are then used inthe analysis to obtain unbiased estimates of the relativerisk for the primary exposure variable. By appropriateselection of the sampling fractions, considerable effi-ciency gains (per subject included in the second stage)are possible relative to simple random sampling ofcases and controls.

This basic idea would, in principle, be applicable tothe case-cohort design, although the statistical theoryhas not yet been developed. However, a variant of thisdesign, known as "counter matching," has been devel-oped for the nested case-control design (85). The basicidea is to select a matched control for each case drawnfrom the subset of the risk set that is discordant for thesurrogate exposure, and to incorporate the correspond-ing sampling fractions into the usual conditional like-lihood for matched case-control designs. For example,supposing the surrogate exposure variable were di-chotomous, then each exposed case would be matchedwith an unexposed control from the case's risk set, andvice versa. This approach ensures a high degree ofvariability in the primary exposure variable withinmatched sets, thereby producing great efficiency forthe main effect of exposure and its interactions withother variables (but generally at the cost of some lossof efficiency for estimating the effect of confounders)

(86, 87).In the uranium miner cohort, a 1:1 nested case-

control design counter-matched on radon produced anestimate of the standard error of the radon effect thatwas only 27 percent larger than from the analysis ofthe full cohort, compared with 82 percent larger for thestandard 1:1 matched case-control study. A 1:3counter-matched study was nearly fully efficient (18percent larger standard error than from the cohort)compared with 45 percent larger for the standard 1: 3matched study. In contrast, the standard errors for theestimate of the smoking effect were very similar forboth designs (88). In an application of this method toa cohort study of gold miners, Steenland and Deddens(89) found that a 1:3 counter-matched case-controlstudy provided efficiency approximately equivalent toa standard nested case-control study with 10 controlsper case.

These analyses provide an idea of the potentialefficiency gains that are possible by efficient selectionof controls in nested case-control studies, although inthe miner study the basic data was already collected.In other studies, such as the on-going InternationalNuclear Worker Study (90), in which a summary doseestimate is available for all nuclear workers, but ex-tensive efforts are underway to characterize the expo-sure measurement errors over time, the cost savingsfrom having to do this only for a small sample ofhighly informative cohort members could be very sub-stantial indeed.

CONCLUSIONS

Although the basic statistical methods for cohortstudies have been well established for many years,new methods are continuing to be developed. Theproportional hazards model has proven to be a veryuseful framework for unifying approaches to the anal-ysis of time-to-event data for individuals as well as forgrouped count and person-year data. However, partic-ular applications may call for imagination in the de-velopment of relative risk models that are biologicallyappropriate and fit the available data. Although mostroutine analyses proceed with empirical model build-ing techniques using standard relative risk regressionmodels, models more strongly grounded in biologictheory would be appropriate where there are strongeffects, the data are of high quality, and there is a solidbiologic theory. Beyond the proportional hazardsmodel, future work may well be usefully directedtoward such alternatives as the excess risk and accel-erated failure time models. Methods for dealing withexposure measurement error and dependent outcomeshave become a very active area of statistical researchrecently, but applications are in their infancy. But

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 12: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

Analysis of Cohort Studies 133

perhaps the most important methodological develop-ment in recent years has been in the area of cohortsampling methods, where considerable cost savingsare possible by efficient study design. A close inter-action between epidemiologists and statisticians willbe needed to fully realize the potential of these newmethodological developments.

ACKNOWLEDGMENTS

This work was supported by grants CA42949 andCA52862 from the National Institutes of Health.

REFERENCES

1. Shimizu Y, Kato H, Schull WJ. Studies of the mortality ofA-bomb survivors. 9. Mortality, 1950-1985: Part 2. Cancermortality based on the recently revised doses (DS86). RadiatRes 1990:121:120-41.

2. Willett W. Nutritional epidemiology. New York, NY: OxfordUniversity Press, 1990.

3. Lundin FE, Wagoner JK, Archer VE. Radon daughter expo-sure and respiratory cancer, quantitative and temporal aspects.National Institute for Occupational Safety and Health-National Institute of Environmental Health Sciences jointmonograph no. 1. Washington, DC: US Department of Health,Education, and Welfare, Public Health Service, 1971.

4. Whittemore AS, McMillan A. Lung cancer mortality amongUS uranium miners: a reappraisal. J Natl Cancer Inst 1983;71:489-99.

5. Hornung RW, Meinhardt TJ. Quantitative risk assessment oflung cancer in US uranium miners. Health Phys 1987;52:417-30.

6. National Research Council (US) Committee on the BiologicalEffects of Ionizing Radiations. Health risks of radon and otherinternally deposited alpha-emitters: BEIR IV. Washington,DC: National Academy Press, 1988.

7. Lubin JH, Boice JD Jr, Edling C. Radon and lung cancer risk:a joint analysis of 11 underground miners studies. Bethesda,MD: US Department of Health and Human Services, 1994.(NIH publication no. 94-3644).

8. Thomas DC, Petitti DB, Goldhaber M, et al. Reproductiveoutcomes in relation to malathion spraying in the San Fran-cisco Bay Area, 1981-1982. Epidemiology 1992;3:32-9.

9. Ott J. Analysis of human genetic linkage. Baltimore, MD:Johns Hopkins University Press, 1991.

10. Khoury MJ, Beaty TH, Cohen BH. Fundamentals of geneticepidemiology. New York, NY: Oxford University Press, 1993.

11. Sattin RW, Rubin CL, Webster LA, et al. Family history andthe risk of breast cancer. JAMA 1985;253:1908-13.

12. Claus EB, Risen NJ, Thompson WD. Age at onset as anindicator of familiar risk of breast cancer. Am J Epidemiol1990:131:961-72.

13. Claus EB, Risch N, Thompson WD. Genetic analysis of breastcancer in the Cancer and Steroid Hormone Study. Am J HumGenet 1991;48:232-42.

14. Samet JM, Munoz A. Evolution of the cohort study. Epide-miol Rev 1998;20:l-14.

15. Krailo M, Thomas DC, Pike MC. Fitting models of carcino-genesis to a case-control study of breast cancer. J Chronic Dis1987;40(Suppl 2):181S-9S.

16. Pike MC, Krailo MD, Henderson BE, et al. 'Hormonal' riskfactors, 'breast tissue age' and the age-incidence of breastcancer. Nature 1983;303:767-70.

17. Breslow NE, Day NE. Statistical methods in cancer research.

Vol II—The design and analysis of cohort studies. Lyon,France: International Agency for Research on Cancer, 1987.(IARC scientific publications no. 82).

18. Kalbfleisch JD, Prentice RL. The statistical analysis of failuretime data. New York, NY: Wiley, 1980.

19. Cox DR. Regression models and life tables. J R Stat Soc [B]1972:34:187-220.

20. Breslow NE, Day NE. Statistical methods in cancer research.Vol I—The analysis of case-control studies. Lyon, France:International Agency for Research on Cancer, 1980. (IARCscientific publications no. 32).

21. Aalen OO. A linear regression model for the analysis of lifetimes. Stat Med 1989;8:907-25.

22. Borgan 0 , Langholz B. Estimation of excess risk from case-control data using Aalen's linear regression model. Biometrics1997;53:690-7.

23. Buckley JD, James I. Linear regression with censored data.Biometrika 1979:66:429-36.

24. Wei LJ. The accelerated failure time model: a useful alterna-tive to the Cox regression model in survival analysis. Stat Med1992:11:1871-9.

25. Ying Z, Jung SH, Wei LJ. Survival analysis with medianregression models. J Am Stat Assoc 1995;90:178-84.

26. Thomas DC. Relative risk modeling. Encyclopedia of biosta-tistics. Chichester, United Kingdom: Wiley, 1997:3763-71.

27. National Research Council (US) Committee on the BiologicalEffects of Ionizing Radiations. Health effects of exposure tolow levels of ionizing radiation: BEIR V. Washington, DC:National Academy Press, 1990.

28. Boice JD Jr, Blettner M, Kleinerman RA, et al. Radiation doseand leukemia risk in patients treated for cancer of the cervix.J Natl Cancer Inst 1987;79:1295-311.

29. Thomas DC. General relative-risk models for survival timeand matched case-control analysis. Biometrics 1981;37:673-86.

30. Guerrero VM, Johnson RA. Use of the Box-Cox transforma-tion with binary response models. Biometrika 1982;69:309-14.

31. Breslow NE, Storer BE. General relative risk functions forcase-control studies. Am J Epidemiol 1985; 122:149-62.

32. Moolgavkar SH, Venzon DJ. General relative risk regressionmodels for epidemiologic studies. Am J Epidemiol 1987;126:949-61.

33. Hauck WW Jr, Donner A. Wald's test as applied to hypothesesin logit analysis. J Am Stat Assoc 1977;72:851-3.

34. Vaeth M. On the use of Wald's test in exponential families. IntStat Rev 1985;53:199-214.

35. Lubin JH, Gaffey W. Relative risk models for assessing thejoint effects of multiple factors. Am J Ind Med 1988;13:149-67.

36. Thomas DC. Nonparametric estimation and tests of fit fordose-response relations. Biometrics 1983:39:263-8.

37. Ulm K. Nonparametric analysis of dose-response relations inepidemiology. Math Model 1986;7:777-83.

38. Hastie TJ, Tibnshirani RJ. Generalized additive models. NewYork, NY: Chapman and Hall, 1990.

39. Schwartz J. Air pollution and daily mortality in Birmingham,Alabama. Am J Epidemiol 1993;137:1136-47.

40. Armitage P, Doll R. Stochastic models of carcinogenesis. In:Neyman J, ed. Proceedings of the 4th Berkeley symposium onmathematics, statistics, and probability. Berkeley, CA: Uni-versity of California Press, 1961:18-32.

41. Moolgavkar SH, Knudson AG Jr. Mutation and cancer: amodel for human carcinogenesis. J Natl Cancer Inst 1981;66:1037-52.

42. Whittemore A, Keller JB. Quantitative theories of carcinogen-esis. SIAM Rev 1978,20:1-30.

43. Moolgavkar SH. Carcinogenesis modeling: from molecularbiology to epidemiology. Annu Rev Public Health 1986;7:151-69.

44. Thomas DC. Models for exposure-time-response relationshipswith applications to cancer epidemiology. Annu Rev PublicHealth 1988:9:451-82.

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from

Page 13: New Techniques for the Analysis of Cohort Studiespeople.musc.edu › ~korte › 747_2015 › c17 - case-cohort...New Techniques for the Analysis of Cohort Studies Duncan Thomas INTRODUCTION

134 Thomas

45. Whittemore AS. The age distribution of human cancer forcarcinogenic exposures of varying intensity. Am J Epidemioll977;106:418-32.

46. Day NE, Brown CC. Multistage models and primary preven-tion of cancer. J Natl Cancer Inst 1980;64:977-89.

47. Thomas DC. Statistical methods for analyzing effects of tem-poral patterns of exposure on cancer risks. Scand J WorkEnviron Health 1983;9:353-66.

48. Moolgavkar SH. The multistage theory of carcinogenesis andthe age distribution of cancer in man. J Natl Cancer Inst1978;61:49-52.

49. Moolgavkar SH, Dewanji A, Venzon DJ. A stochastic two-stage model for cancer risk assessment. I. The hazard functionand the probability of tumor. Risk Anal 1988;8:383-92.

50. Brown CC, Chu KC. A new method for the analysis of cohortstudies: implications of the multistage theory of carcinogene-sis applied to occupational arsenic exposure. Environ HealthPerspect 1983;50:293-308.

51. Dong MH, Redmond CK, Mazumdar S, et al. A multistageapproach to the cohort analysis of lifetime lung cancer riskamong steelworkers exposed to coke oven emissions. Am JEpidemiol 1988;128:860-73.

52. Brown CC, Chu KC. Use of multistage models to infer stageaffected by carcinogenic exposure: example of lung cancerand cigarette smoking. J Chronic Dis 1987;40(Suppl 2):171S-9S.

53. Freedman DA, Navidi WC. Multistage models for carcino-genesis. Environ Health Perspect 1989;81:169-88.

54. Crump KS, Allen BC, Howe RB, et al. Time-related factors inquantitative risk assessment. J Chronic Dis 1987;40(Suppl2)1O1S11S

55. Thomas DC. A model for dose rate and duration of exposureeffects in radiation carcinogenesis. Environ Health Perspect1990;87:163-71.

56. Moolgavkar SH, Dewanji A, Luebeck G. Cigarette smokingand lung cancer: reanalysis of the British doctors' data. J NatlCancer Inst 1989;81:415-20.

57. Moolgavkar SH, Cross FT, Luebeck G, et al. A two-mutationmodel for radon-induced lung tumors in rats. Radiat Res1990;121:28-37.

58. Moolgavkar SH, Luebeck EG, Krewski D, et al. Radon,cigarette smoke, and lung cancer: a re-analysis of the Colo-rado Plateau uranium miners' data. Epidemiology 1993;4:204-17.

59. Stayner L, Smith R, Bailer AJ, et al. Modeling epidemiologicstudies of occupational cohorts for the quantitative assessmentof carcinogenic hazards. Am J Ind Med 1995;27:155-70.

60. Moolgavkar SH, Day NE, Stevens RG. Two-stage model forcarcinogenesis: epidemiology of breast cancer in females.J Natl Cancer Inst 1980;65:559-69.

61. Moolgavkar SH, Luebeck EG. Multistage carcinogenesis:population-based model for colon cancer. J Natl Cancer Inst1992;84:610-18.

62. Thomas DC, Stram D, Dwyer J. Exposure measurement error:influence on exposure-disease: relationships and methods ofcorrection. Annu Rev Public Health 1993,14:69-93.

63. Rosner B, Willett WC, Spiegelman D. Correction of logisticregression relative risk estimates and confidence intervals forsystematic within-person measurement error. Stat Med 1989;8:1051-69.

64. Rosner B, Spiegelman D, Willett WC. Correction of logisticregression relative risk estimates and confidence intervals formeasurement error: the case of multiple covariates measuredwith error. Am J Epidemiol 1990; 132:734-45.

65. Pierce DA, Stram DO, Vaeth M. Allowing for random errorsin radiation dose estimates for the atomic bomb survivor data.Radiat Res 1990;123:275-84.

66. Kaplan EL, Meier P. Nonparametric estimation from incom-plete observations. J Am Stat Assoc 1958;53:457-81.

67. Langholz B, Borgan 0. Estimation of absolute risk fromnested case-control data. Biometrics 1997;53:767-74.

68. Rosner B. Multivariate methods in ophthalmology with appli-cations to other paired-data situations. Biometrics 1984;40:1025-35.

69. Bonney GE. Regressive logistic models for familial diseaseand other binary traits. Biometrics 1986,42:611—25.

70. Clayton DG. A Monte Carlo method for Bayesian inference infrailty models. Biometrics 1991;47:467-85.

71. Hoogaard P, Harvald B, Holm N. Measuring similarities be-tween the lifetimes of adult Danish twins born between1881-1930. J Am Stat Assoc 1992;87:17-24.

72. Yashin Al, Vaupel JW, Lachine IA. Correlated individualfrailty: an advantageous approach to survival analysis of bi-variate data. Math Popul Stud 1995;5:145-60.

73. Korsgaard IR, Anderson AH. The additive genetic gammafrailty model. Scand J Stat 1998;25:254-69.

74. Liang KY, Zeger SL. Longitudinal data analysis using gener-alized linear models. Biometrika 1986;73:13-22.

75. Prentice RL, Zhao LP. Estimating equations for parameters inmeans and covariances of multivariate discrete and continuousresponses. Biometrics 1991;47:825-39.

76. Lin DY. Cox regression analysis of multivariate failure timedata: the marginal approach. Stat Med 1994; 13:2233-47.

77. Prentice RL, Cai J. Covariance and survivor function estima-tion using censored multivariate failure time data. Biometrika1992;79:495-512.

78. Liddell FDK, McDonald JC, Thomas DC. Methods of cohortanalysis: appraisal by application to asbestos mining (withdiscussion). J R Stat Soc [A] 1977; 140:469-91.

79. Prentice RL. A case-cohort design for epidemiologic cohortstudies and disease prevention trials. Biometrika 1986;73:1-11.

80. Borgan 0 , Goldstein L, Langholz B. Methods for the analysisof sampled cohort data in the Cox proportional hazards model.Ann Stat 1995;23:1749-78.

81. Langholz B, Thomas DC. Nested case-control and case-cohortmethods of sampling from a cohort: a critical comparison.Am J Epidemiol 1990; 131:169-76.

82. Langholz B, Thomas DC. Efficiency of cohort samplingdesigns: some surprising results. Biometrics 1991 ;47:1563-71.

83. White JE. A two stage design for the study of the relationshipbetween a rare exposure and a rare disease. Am J Epidemiol1982;115:119-28.

84. Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika 1988,75:11-20.

85. Langholz B, Borgan 0. Counter-matching: a stratified nestedcase-control sampling method. Biometrika 1995;82:69-79.

86. Langholz B, Clayton D. Sampling strategies in nested case-control studies. Environ Health Perspect 1994; 102 (Suppl8):47-51.

87. Cologne JB. Counterintuitive matching. (Editorial). Epidemi-ology 1997;8:227-9.

88. Borgan 0 , Langholz B. Risk set sampling designs for propor-tional hazards models. In: Everitt B, Dunn G, eds. Recentadvances in the statistical analysis of medical data. London,England: Edward Arnold, 1998:75-100.

89. Steenland K, Deddens JA. Increased precision using counter-matching in nested case-control studies. Epidemiology 1997;8:238-42.

90. Cardis E, Gilbert ES, Carpenter L, et al. Combined analyses ofcancer mortality among nuclear industry workers in Canada,the United Kingdom, and the United States of America. Lyon,France: International Agency for Research on Cancer, 1995.(IARC technical report no. 25).

Epidemiol Rev Vol. 20, No. 1, 1998

at MU

SC L

ibrary on February 25, 2013http://epirev.oxfordjournals.org/

Dow

nloaded from