survival analysis - purdue universityovitek/stat526-spring11_files/pdfs/9-surviva… · survival...

Survival Analysis

STAT 526

Professor Olga Vitek

May 4, 2011

Survival Data andSurvival Functions

• Statistical analysis of time-to-event data

– Lifetime of machines and/or parts(called failure time analysis in engineering)

– Time to default on bonds or credit card(called duration analysis in economics)

– Patients survival time under different treatment(called survival analysis in clinical trial)

– Event-history analysis (sociology)

• Why a special topic on survival analysis?

– Non-normal and skewed distribution

– Need to answer questions related to

P (X > t+ t0|X ≥ t0)

– Censored/truncated data

– Here we only focus on right-censored data

Survival Function

• Continuous survival time T

– its probability density function is f(t)

– its cumulative probability function is F (t)

F (t) = P (T ≤ t) =

0f(s)ds

• The survival function of T is

S(t) = P (T > t) = 1− F (t)

– also called survival rate

– steeper curve → shorter survival

– S′(t) = −f(t)

• Mean survival time

µ = E{T} =

∫ ∞0

tf(t) dt =

∫ ∞0

t dF (t) =

∫ ∞0

t d[1− S(t)]

∫ ∞0

[∫ t

]d[1− S(t)] =

∫ ∞0

[∫ ∞x

d[1− S(t)]

∫ ∞0

S(x) dx

Hazard Function

• The hazard function of T is

λ(t) = lim∆t↘0

P (t ≤ T < t+ ∆t |T ≥ t)∆t

– proportion of subjects with the event per unittime, around time t; λ(t) ≥ 0

– measure of ’proneness’ to the even as functionof time

– λ(t) 6= f(t): λ(t) is conditional on survival to t

• Relates to the survival function

λ(t) = lim∆t↘0

F (t+ ∆t)− F (t)

/S(t) =

S(t)= −

S′(t)

= −d

dtlogS(t) (negative slope of log-survival)

• The cumulative hazard function of T is

defined as Λ(t) =∫ t0 λ(s)ds

– Λ(t) = − logS(t)

– S(t) = exp{−Λ(t)}

– λ(t), Λ(t) or S(t) define the distribution

Parametric Survival Models

• Exponential distribution:

– λ(t) = ρ, where ρ > 0 is a constant, and t > 0

– S(t) = e−ρt, ⇒ f(t) = −S′(t) = ρe−ρt

• Weibull distribution:

– λ(t) = λp(λt)p−1; λ, p > 0 are constants, t > 0

– S(t) = e−(λt)p, ⇒ f(t) = −S′(t) = e−(λt)pλp(λt)p−1

– is the Exponential distribution when p = 1

• Gompertz distribution:

– λ(t) = αeβt; α, β > 0 are constants, t > 0

– is the Exponential distribution when β = 0

• Gompertz-Makeham distribution:

– λ(t) = λ+ αeβt; α, β > 0 are constants, t > 0

– adds an initial fixed component to the hazard

Non-Parametric SurvivalModels

• Approximate time by discrete intervals

– f(xi) =

{P{T = xi}0 otherwise

– S(xi) =∑

j: xj≥t f(xj)

– λ(xi) = f(xi) / S(xi)

• After substitution:

– f(xi) = λ(xi)∏i−1j=1 (1− λj)

– S(xi) =∏i−1j=1(1− λj)

– S(t) =∏j:xj<t

(1− λj), t > 0

Right-Censored Data

• Survival time of i-th subject Ti, i = 1, · · · , n.

• Censoring time of i-th subject Ci, i = 1, · · · , n.

• Observed event for ith subject:

Yi = min(Ti, Ci), δi = 1{Ti≤Ci} =

{1, if Ti ≤ Ci0, if Ti > Ci

– Data are reported in pairs (Yi, δi)

= (observedTime, hasEvent)

• Assumption:

– Ci are predetermined and fixed, or

– Ci are random, mutually independent, and inde-pendent of Ti

Non-parametric Estimation

of Survival Function

Comparison Between Groups

With Censoring

Example: Remission Timesof Leukaemia Patients

• 21 leukemia patients treated with drug (6-mercaptopurine)

• 21 matched controls, no covariates.

> library(MASS) #Described in Venable & Ripley, Ch.13> data(gehan)

> gehanpair time cens treat

1 1 1 1 control2 1 10 1 6-MP3 2 22 1 control4 2 7 1 6-MP5 3 3 1 control6 3 32 0 6-MP7 4 12 1 control8 4 23 1 6-MP9 5 8 1 control10 5 22 1 6-MP11 6 17 1 control12 6 6 1 6-MP13 7 2 1 control14 7 16 1 6-MP15 8 11 1 control16 8 34 0 6-MP17 9 8 1 control.........................

Kaplan-Meier Estimator ofSurvival Function

• Also called product limit estimator

• Re-organize the data

Distinct Event Times t1 · · · ti · · · tk# of Events d1 · · · di · · · dk

# Survivors Right Before ti n1 · · · ni · · · nk

• Kaplan-Meier estimator S(t) :

– Estimate P (T > t1) by p1 = (n1 − d1)/n1

– Estimate P (T > ti|T > ti−1) by pi = (ni − di)/ni,i = 2, · · · , k

– If ti−1 < t < ti,

S(t) =

P (T > t) = P (T > t1)P (T > t|T > t1) = · · ·

P (T > t1){i−1∏m=2

P (T > tm|T > tm−1)}P (T > t|T > ti−1)

– Estimate S(t) by S(t) = S(ti−1)ni−dini

=∏i:ti≤t

ni−dini

(Kaplan & Meier, 1958, JASA)

Non-Parametric Fit ofSurvival Curves

• Represent data in the censored survival form– create a data structure ’Surv’

– ’+’ represents censored time

> library(survival)> Surv(gehan$time, gehan$cens)1 10 22 7 3 32+ 12 23 8 22 17 6 2 1611 34+ 8 32+ 12 25+ 2 11+ 5 20+ 4 19+ 15 68 17+ 23 35+ 5 6 11 13 4 9+ 1 6+ 8 10+

• K-M estimation of survival function– subset of results for the treatment group:

> fit <- survfit(Surv(time, cens) ~ treat, data=gehan)> summary(fit)

treat=6-MPtime n.risk n.event survival std.err lower95%CI upper95%CI

6 21 3 0.857 0.0764 0.720 1.0007 17 1 0.807 0.0869 0.653 0.996

10 15 1 0.753 0.0963 0.586 0.96813 12 1 0.690 0.1068 0.510 0.93516 11 1 0.627 0.1141 0.439 0.89622 7 1 0.538 0.1282 0.337 0.85823 6 1 0.448 0.1346 0.249 0.807

Interval K-M Estimatorof Survival Function

• Direct interval estimation of S(t)

– The variance of S(t) can be estimated by

var(S(t)) = [S(t)]2∑i:ti≤t

ni(ni − di)

– This is Greenwood’s formula (Greenwood, 1926)

– CI for S(t) is S(t)± 1.96√var[S(t)],

– S(t) ∈ [0,1], but var(S(t)) may be large enough

that the usual CI will exceed this interval

• Interval estimation using log S(t) (preferred)

– The variance of log S(t) can be estimated as

var[ log S(t) ] =∑i:ti≤t

ni(ni − di)

– Construct the CI for log S(t)

[L,U ] = log S(t)± 1.96√var[log S(t)],

– For CI of S(t), transform the limits with [eL, eU ].

Estimate the Cumulative Hazard

Function Λ(t) = − logS(t)

• Using the Kaplan-Meier estimator of S(t):

Λ(t) = − log S(t)

• Alternatively, the Nelson-Aalen Estimator:

Λ(t) =

{0, if t ≤ t1∑

i:ti≤tdini, if t ≥ ti

V ar(Λ(t) =∑i:ti≤t

– better small-sample-size performance than basedon the K-M procedure

– useful in comparing the fit of a parametric modelto its non-parametric alternative

Non-Parametric Fit ofSurvival Curves

>plot(fit, conf.int=TRUE, lty=3:2, log=TRUE,xlab="Remission (weeks)", ylab="Log-Survival", main="Gehan")

• Plot curves and CI; default CI are on the log scale

• ’log=TRUE’ plots y axis (i.e. S(t)) on the log scale(i.e. y axis shows the hegative cumulative hazard)

0 5 10 15 20 25 30 35

Remission (weeks)

control6−MP

Log-Rank Test for Homogeneity

• Non-parametric test

– Compare two populations with hazard functionsλi(t), i = 1,2.

– Collect two samples from each population.

– Construct a pooled sample with k distinct eventtimes

Distinct Failure t1 · · · ti · · · tkTime

Pool # of Failures d1 · · · di · · · dkSample # survivors n1 · · · ni · · · nk

right before tiSample # of Failures d11 · · · d1i · · · d1k

1 # survivors ti n11 · · · n1i · · · n1k

right before tiSample # of Failures d21 · · · d2i · · · d2k

2 # survivors n21 · · · n2i · · · n2k

right before ti

– Note di = d1i + d2i, ni = n1i + n2i, i = 1, · · · , k

– Test the hypotheses

H0 : λ1(t) = λ2(t), t ≤ τvs. Ha : λ1(t) 6= λ2(t) for some t ≤ τ

Log-Rank Test for

Homogeneity: Procedure

• Consider [ti, ti + ∆) for small ∆Sample Sample Pooled

1 2 Sample# failures d1i d2i di

# survivors at ti n1i − d1i n2i − d2i ni − di# of survivors ti n1i n2i ni

right before ti

• If λ1(t) 6= λ2(t) around ti, there should be associ-ation between sample and event in this 2 × 2 table

• d1i|(ni, di, n1i)H0∼ Hypergeometric(ni, di, n1i):

P (d1i = x|ni, di, n1i) =

)(ni − din1i − x

)=⇒ E[d1i|ni, di, n1i] =

• Define U =∑k

[d1i − din1i

]E[U ] = 0, var(U) =

∑ki=1

din1i(ni−di)(ni−n1i)n2i (ni−1)

U/var(U)1/2 H0, asy∼ N(0,1)

Log-Rank Test in R

(Venable & Ripley Sec.13.2)

> ?survdiff

> survdiff(Surv(time,cens)~treat,data=gehan)

N Observed Expected (O-E)^2/E (O-E)^2/Vtreat=6-MP 21 9 19.3 5.46 16.8treat=control 21 21 10.7 9.77 16.8

Chisq= 16.8 on 1 degrees of freedom, p= 4.17e-05

• Conclusion: Reject H0

• Warning: this test does not adjust for covariates,and may often be inappropriate

– Appropriate for this study, since the individualsare matched pairs

Parametric Estimation of

Survival Function

Comparison Between Groups

The Likelihood Function

• The likelihood of i-th observation (yi, δi, xi):{f(yi|Xi), if δi = 1, (not censored)

S(yi|Xi), if δi = 0, (right− censored)

• The likelihood of the data is

L(β) ∝n∏i=1

{[f(yi|Xi)]δi[S(yi|Xi)]1−δi

}– The inference approaches based on the likeli-

hood function can be applied.

∗ See chapter by Lee on ML parameterestimation

– Use AIC or BIC to compare nested models.

– Graphical check of model adequacy

∗ Plot Λ(t) vs. t for exponential models;

∗ Plot log(Λ) vs. log(t) for Weibull models;

∗ Can also plot deviance residuals.

Accounting for Covariates:Models for Hazard Function

• General form: λ(t) = λ0(t)ex′β

– λ0(t): the baseline hazard

– ex′β: multiplicative effect, independent of time

• Implications for the Survival Function:

– S(t) = e−Λ(t) = e−∫ t

0λ(s) ds = e−Λ0(t)ex

– logΛ(t) = log[−log S(t)] = logΛ0 + x′β

– If the model is appropriate, a plot of log[−log S(t)KM ]for different groups yields roughly parallel lines

Accounting for Covariates:Models for Hazard Function

• Exponential distribution, no predictors:

– λ(t) = ρ - constant hazard function

– S(t) = e−ρ t - survival function

• Exponential distribution, one predictor:

– λ(t) = eβ0+β1X = eβ0 · eβ1X,

– eβ0 plays the role of λ0,i.e. the constant baseline hazard

• Weibull distribution, no predictors:

– λ(t) = λp p tp−1 - hazard function

– S(t) = e−(λ t)p - survival function

• Weibull distribution, one predictor:

– λ(t) = p tp−1 · e(β0+β1X)p = p tp−1eβp0 · epβ1X

– eβ0+β1X plays the role of λ in overall hazard

– p tp−1eβp0 is the baseline hazard

– eβ0 plays the role of λ in the baseline hazard

survival analysis - purdue universityovitek/stat526-spring11_files/pdfs/9-surviva… · survival...

Documents

pocket survival pak instructions - doug … survival pak...

court survival guide -...

a long-term survival guide – military survival kitsa...

the swiss army survival tampon — 10 survival uses

longest survival with renal aa amyloidosis: … · survival...

survival gadget page survival gadgetsurvival gadgetsurvival...

serbian 'survival' phrases serbian 'survival' phrases

homework 5 - solution - purdue...

survival project -...

survival spanish for law enforcement...

into the wild survival course elite survival training

survival analysis chapitre 1 : introduction · chapitre 1 :...

% survival

classification of survival data by comparison of survival

survival skills -...

acr sr203 survival radio -...

survival analysis - university of...

10 survival phrases survival answers 5/6

survival sullivan | advanced survival and prepping from

a long-term survival guide - military survival kits