01/20151 epi 5344: survival analysis in epidemiology age as time scale march 31, 2015 dr. n....

01/2015 1

EPI 5344:Survival Analysis in

EpidemiologyAge as time scale

March 31, 2015

Dr. N. Birkett,School of Epidemiology, Public Health &

Preventive Medicine,University of Ottawa

01/2015 2

Objectives

• Choice of time scale for observational epidemiology

• Risk-set based analysis approaches

01/2015 3

Example Study (1)

• Are Uranium miners at risk for dying from

lung cancer?– Uranium is radioactive and has a complex

decay process

01/2015 4

01/2015 5

Example Study (1)

• Are Uranium miners at risk for dying from

lung cancer?– Uranium is radioactive and has a complex

decay process

– Miners work in enclosed areas with high

levels of radioactive dust

– Is there evidence that their health is affected?

01/2015 6

Example Study (2)

• Colorado Plateau study– Subject eligibility

• Worked underground in uranium mines in the four-state Colorado

Plateau area– at least one month of work

• 2,500 mines in target area

• Examined at least once by Public Health Service MDs between

1950 and 1960

– Followed-up to Dec 31, 1982• Vital Stats records

– Death

– Lung cancer death

01/2015 7

Example Study (3)

• Entry date:– latest of:

• one month of work and exam by MD

• January 1, 1952

• Main outcome– death from lung cancer

01/2015 8

Example Study (4)

• Exposure:– 43,000 direct measurements of radon levels in mines

between 1951 and 1968

– Converted to annual exposure

– Combined with worker’s ‘in mine’ work time

– Generated Working-Level months (WLM)• WL = 20.8 µJ (microjoules) alpha energy per cubic meter (m3) air

• WLM = 1 WL exposure for 170 hours

– Cumulated in five year age intervals• 0-5; 5-10; 10-15; 15-20; 20-25; ….

01/2015 9

agest = age at entry to study

ageexit = age at exit from study

ind = died from lung cancer (=1)

rexp20 = WLMs from age 15-20

Example study (5)

Item Number Percent

Sample size 3,347

Dying (any cause) 1,258 38%

Lung cancer death 258 7.7%

Lung cancer as proportion of all deaths

20.5%

01/2015 10

How to apply survival analysis methods to this data?

Example study (6)

• Based on course to now:– Time is the number of years (month, days,

etc.) from initial entry into the study– Time ‘0’ is the entry date– End of follow-up

• Death (or death from lung cancer)• Censored if

– lost– died from ‘wrong cause’

01/2015 11

Example study (7)

• Based on course to now:– Exposure is time varying

• Cumulative• Mean• Peak

– We will look at exposure to more than 500 WLM

– Use PHREG to generate HR estimates

01/2015 12

Choosing a time scale (1)

• Time scale choices include:– Age– Calendar year– Time since entry into study– Time since initial employment

01/2015 13


• Cox model is:

• Choice of time scale affects the shape of the baseline hazard

• It also affects which people belong together in a risk set

• Betas will have different values01/2015 14


• Time on study– Hazard affected by

• cumulative exposure• Length of time for disease to develop post-exposure

– Usually a ‘gentle’ increase– Risk set groups people with same time post-

entry• Combines people of different ages• Averages age-specific hazards

01/2015 15


• The actual year (calendar time)– Hazard affected by

• Temporal changes in exposure or risk– increased air pollution

– climate change

– legislation

– Changes usually slow

– Hazard is fairly constant, controlling for age, etc.

– Risk set groups people in same years

– Most commonly used for trend analyses with Poisson

regression models01/2015 16


• Age– Hazard affected by

• Cumulative exposure• Aging

– Often shows a very strong effect on hazard• Prostate cancer hazard increases ‘super-

exponentially’

– Risk set groups people of the same age• Ignores how long you have been ‘on study’

01/2015 17


• Choices are not independent– One year of follow-up increases all three time

scale measures by one year• Cox models ‘work’ best if the baseline

hazard captures a lot of hazard variation

01/2015 18


• For an RCT, ‘time on study’ is appropriate– follow-up time is usually short

– Intervention has a strong effect, overwhelms age effect

• For etiological studies– Risk increases with age

– Risk relates to exposure, not to length of time since study

entry

– Length of time is a proxy for cumulative exposure

• For etiological studies, several people have studied the

choice of time scale01/2015 19


• Breslow et al (1983)– Time-on-study as time scale

• fine for RCT’s, etc.

– Not optimal for cohort studies• Most outcome death rates increase rapidly with age

– Want to maximize control of the age effect

• Time-on-study often strongly correlated with

cumulative exposure– Can produce negative bias if used as time scale

01/2015 20


• Breslow et al (1983)– Recommendation

• Use age as time scale• Stratify by calendar time (5 year groups)

– Risk sets consist of people at the same age in each calendar group

– Ignores length of time since entry as factor– Subjects are left truncated (‘late entry’)

• Time ‘0’ is ‘birth date’01/2015 21


• Korn et al (1997)– Cox models don’t specify a form for h(t)

– Best choice of time scale is the one which has the biggest

impact on the hazard function shape• NOT the biggest impact on the HR!

– Which would differ the most:• hazard for people aged 50 vs. aged 60, both with 10 years of

follow-up?

• hazard for two 55 year olds, one with 5 years of follow-up and one

with 15 years?

– Cannot study in the effect of the time scale variable01/2015 22


• Korn et al– Recommendation

• Use age as time scale

• Stratify by year of birth (birth cohort)– 5 year groups are commonly used

– Essentially the same model as proposed by

Breslow et al

01/2015 23


• Korn et al– Considered s second model (commonly used):

• ‘Time-on-study’ as time scale• Adjust for age at entry in model

– Results are the same as having age as time scale if:• h0(t) is exponential in age

– can give strong bias, especially with time-dependent covariates.

01/2015 24


01/2015 25

• Uranium miners study

• 4 different time scales

• Differences are not big

• HR/RR all around 3.5-5.2

• ‘Age’ is used as time scale in rest of session

Time scale RR 95% CI

Time since entry 4.7 3.5 – 6.3

Time since first mining 3.6 2.7 – 4.9

Calendar year 5.2 3.9 – 6.9

Age 4.3 3.2 – 5.7

Implications for Analysis

• Age is the time variable– Left truncated

– Requires ‘late entry’ methods

• Compute exposure as a time varying variable– Cumulative

– Mean

• Analysis option #1:– Use regular Cox model

• Other options– risk set modelling methods

01/2015 26

Regular Cox models (1)

• Uses the Phreg approach• Time varying exposure

– e.g. use ‘500 WLM’ as time varying cut-point• SAS code uses programming statements

within Phreg• Data file uses layout shown earlier

01/2015 27

28

* model has ageexit as failure time, ind as failure indicator and agest as entry time;

proc phreg data=u.uminers; model ageexit*ind(0)=cr500 / entry=agest risklimits;

* Time-dependent programming steps- see PHREG documentation;

array rexp {18} rexp5 rexp10 rexp15 rexp20 rexp25 rexp30 rexp35 rexp40 rexp45 rexp50 rexp55 rexp60 rexp65 rexp70 rexp75 rexp80 rexp85 rexp90;

m = min((ageexit-2)/5,18); i = 0; cradon = 0; do while (i < m); if (m > (i+1)) then do; cradon = cradon + rexp[i+1]; end; else do; cradon = cradon + (m-i)*(rexp[i+1]); end; i = i+1; end;

* Determine whether cumulative radon is >= 500 WLM; cr500 = (cradon >= 500);run;

01/2015

29

proc phreg data=u.uminers; model ageexit*ind(0)=cr500 / entry=agest risklimits;

/***** CODE REMOVED FOR CLARITY *****/

cr500 = (cradon >= 500);run;

01/2015

Regular Cox models (2)

• Could do with counting process style input– Need to create one record for each subject for each year.

– Code gets complex

– I won’t show this

• Either way, Phreg needs to:– create risk set data for each risk set

– compute time varying covariates

– do the MLE algorithm

• Time consuming process– BUT, not a big issue with modern computers.

01/2015 30

Risk Set Methods (1)

A Different Approach• Use the data step to create new data set with

the risk set data• Risk set grouped data

– series of records for each risk set– one line for each subject in the risk set

• Code is complex (not shown)

01/2015 31

01/2015 32

01/2015 33


• How can we use this data?• Consider any risk set (take risk set #1)• Can represent data as 2x2 table

01/2015 34

Lung CA no Lung CA

>=500 WLM 1 4 5

< 500 WLM 0 8 8

1 12 13


• Treat each risk set as a stratum– matched on age (the time scale variable)

• Combine tables into an overall estimate– Mantel-Haenzel methods could be used

• Better approach– Conditional logistic regression.

• Can do this using either:– Proc Logistic

– Proc Phreg

• Likelihood functions are identical

01/2015 35


• Three approaches can be used to do these analyses:– the ‘bit of time’ method (phreg)– the ‘separate strata’ method (phreg)– the ‘binary data’ method (logistic)

01/2015 36


• Approach #1 (‘bit of time’ method)– Use Phreg– Treat the risk set file as a counting process

structure– Need to add an ‘entry’ and ‘exit’ time for each

subject in each risk set

01/2015 37


• Approach #1 (‘bit of time’ method)– Need to add an ‘entry’ and ‘exit’ time for each

subject in each risk set• exit time

– age when the risk set occurred

• entry time– exit time – 0.001– 0.001 is arbitrary but the math works (trust me )

01/2015 38

01/2015 39


• Approach #1 (‘bit of time’ method)– Ignores all of the time between risk sets– Seem weird but the math works (trust me )

01/2015 40

41

proc phreg data=cumexp; model _rstime*_cc(0)=cr500 / entry=_rsentry rl;run;

01/2015


• Approach #2 (separate strata method)– Use Phreg– Number the risk sets from 1 to n– Use the risk set ID number as the time

variable!• Seems weird• Risk set ID is not actually a ‘time’• But the math works (trust me )

– No need for a late entry variable

01/2015 42

01/2015 43

44

proc phreg data=cumexp nosummary; model _setno*_cc(0)=cr500 / rl; strata _setno;run;

01/2015

Identical to Method #1


• Approach #3 (binary data method)– Uses Proc Logistic– Treats each risk set as a stratum

• Remember my 2x2 table from an earlier slide

– Uses conditional logistic regression• Condition on the risk set ID• Not interested in OR or RR for each risk set

– just ‘nuisance’ parameters

• Including strata parameter can lead to strong bias

01/2015 45


• Approach #3 (binary data method)– Stratify by the risk set ID

• similar to STRATA statement in Phreg

– Model yields an OR.• with this sampling approach, OR = RR• the math works (trust me )

01/2015 46

01/2015 47

48

proc logistic data=cumexp descending; model _cc=cr500 / clodds=wald; strata _setno;run;

01/2015

Identical to Method #1


• All three methods gave the same results.– Results are not quite the same as initial Phreg

analysis (with age as the time scale):

01/2015 49

Method HR (RR) 95% CI

Regular Phreg 4.263 3.175 – 5.722

Risk sets 4.267 3.179 – 5.728


• Why bother with risk set method?– Some people claim it is faster

• I didn’t see this effect

• If true, is this an issue with modern computers?

• does 1 sec vs. 2 secs matter?

01/2015 50

Regular RS #1 RS #2 RS #3

0.39 1.65 0.47 1.71


• Why bother with risk set method?– Can handle random effects code better (I am

told)– More easily extends to nested case-control

and case-cohort methods.

01/2015 51

01/2015 52

Full risk data• 1 ‘case’ per risk

set• Multiple non-

cases

Nested case-control (1)

• Most studies will have hundreds or thousands of non-cases in each risk set.

• Suppose we needed to collect new exposure information on all subjects– genotyping

• Gets very expensive to use whole cohort.

01/2015 53


• Do we need all of the non-cases in each risk set?

• NO!!!

01/2015 54


• Select a random sample of non-cases from each risk set– Usually a small number

• 4 is common• up to 20 in pharmaco-epidemiology studies

• A person can be used more than once– Multiple time as control– As control and case

• Collect new exposure information only on selected subjects• Analyze using only these subjects• Use any of the three risk set methods shown here

01/2015 55


• Will give an unbiased estimate of the true HR/RR

• 95% confidence intervals will be larger• Why does it work?• Go back to the Partial Likelihood for Cox

models

01/2015 56

57

• The final likelihood contribution from each risk set is:

• For the nested case-control, the likelihood contribution is given by:

01/2015


• Likelihoods are the same form– denominator sums over the available risk set

• Can vary method of non-case selection– random sample

– matched

– counter-matched

• Easily extends to case-cohort design– Select a random sample from initial cohort

– Entire sample is retained as the risk set members through-out

follow-up• treats case status as a time varying covariate

01/2015 58

Summary

• Observational epidemiology analysis is more

complex than an RCTs

• Survival methods generalize– discrete time methods

– risk set approaches

• Choice of time scale

• More information on Langholz’s web site– Risk set analysis course, Lanhgolz, USC

01/2015 59

http://hydra.usc.edu/pm518b/

01/2015 60

01/20151 epi 5344: survival analysis in epidemiology age as time scale march 31, 2015 dr. n....

Documents