lecture 5 – categorical data and survival analyses

126
Lecture 5 – Categorical Data and Survival Analyses

Upload: aggie

Post on 06-Jan-2016

61 views

Category:

Documents


1 download

DESCRIPTION

Lecture 5 – Categorical Data and Survival Analyses. OUTLINE. Definition Common CDA Descriptive summaries Tests of Association Modeling Extensions Other examples in CDA. What is Categorical Data Analysis?. Statistical analysis of data that are categorical - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 5 – Categorical Data and Survival Analyses

Lecture 5 – Categorical Data and Survival Analyses

Page 2: Lecture 5 – Categorical Data and Survival Analyses

OUTLINEOUTLINE

• Definition• Common CDA

– Descriptive summaries– Tests of Association– Modeling

• Extensions• Other examples in CDA

Page 3: Lecture 5 – Categorical Data and Survival Analyses
Page 4: Lecture 5 – Categorical Data and Survival Analyses

What is Categorical Data Analysis?What is Categorical Data Analysis?

• Statistical analysis of data that are categorical (cannot be summarized with mean +/- SD)

• Includes dichotomous, ordinal, nominal outcomes

• Examples: Disease prevalence, Discharge location, Treatment adherence (yes/no)

Page 5: Lecture 5 – Categorical Data and Survival Analyses

Examples of Studies with CDAExamples of Studies with CDA

• MI after CABG

• Diagnostic studies looking Sensitivity, Specificity of a new test/procedure

• Discharge location after new surgical intervention.

Page 6: Lecture 5 – Categorical Data and Survival Analyses

How to analyze words?How to analyze words?

• Order vs. no order

• Breakdown mean +/- SD for two groups

• Do the same: Breakdown Outcome %’s for two groups

Page 7: Lecture 5 – Categorical Data and Survival Analyses

How to analyze words?How to analyze words?

• Comparing length of stay after CABG:– New Trt = 19.2 +/- 2.7– SOC = 21.3 +/- 3.3

• Comparing prevalence of MI:– New Trt = 16%– SOC = 24%

• Are these differences statistically significant? clinically significant?

Page 8: Lecture 5 – Categorical Data and Survival Analyses

Choice of End PointChoice of End Point

• Some designs have a binary response variable– MI after 3 years

– Overall Survival – Time to CVD– Time to recurrent MI

• Can Dichotomize as 1 year rate (Yes/No)

Page 10: Lecture 5 – Categorical Data and Survival Analyses

Common CDACommon CDA

• Descriptive summaries

• Tests for association

• Modeling

Page 11: Lecture 5 – Categorical Data and Survival Analyses

Descriptive summariesDescriptive summaries

Page 12: Lecture 5 – Categorical Data and Survival Analyses

Let’s Talk Data…Let’s Talk Data…

Page 13: Lecture 5 – Categorical Data and Survival Analyses

Descriptive Summaries in CDADescriptive Summaries in CDA

Nominal – Categorical Data Measured in unordered categories

Ordinal – Categorical Data Measured in ordered categories

Continuous – Quantitative Data Measured on a continuum

(summarize with %’s)

(summarize with %’s)

summarize with many measures

Page 14: Lecture 5 – Categorical Data and Survival Analyses

Types of DataTypes of DataNominal – Categorical data

measured in unordered categoriesRace Blood Type

Ordinal – Categorical data measured in ordered categoriesCancer StagesSocio-economic Status (low, medium,

high)

Continuous – Quantitative data measured on a continuumSerum CreatinineHeight/Weight/BMI

Gender

Likert (unlikely, neutral, likely)

Diastolic Blood PressureTumor measurements

Page 15: Lecture 5 – Categorical Data and Survival Analyses

What the data might look like…What the data might look like…

Page 16: Lecture 5 – Categorical Data and Survival Analyses

Compare Categorical Outcomes between groupsCompare Categorical Outcomes between groups

• How to assess if a predictor is associated with a categorical outcome?

• Intuitive?: Get the %’s of the outcome prevalence within each predictor group.

• Example: New Trt and MI.– New Trt response rate = 16%– SOC response rate = 24%

Page 17: Lecture 5 – Categorical Data and Survival Analyses

Contingency TablesContingency Tables

MI

Yes No

New a b a+b

Old c d c+d

a+c b+d n=a+b+c+dGro

up

MI

Yes No

New 12 8 20

Old 4 16 20

16 24 N=40Gro

up

Page 18: Lecture 5 – Categorical Data and Survival Analyses

CDA Summary with Contingency TableCDA Summary with Contingency Table

• Research question

Is there a relationship between Group and Attacked Heart?

• Better to convert the table into percentages (easier to see)

Page 19: Lecture 5 – Categorical Data and Survival Analyses

What the data might look like…What the data might look like…

Page 20: Lecture 5 – Categorical Data and Survival Analyses

MI No MI

New TRT 12 8

Old TRT 4 16

Step 1. Breakdown the frequenciesStep 1. Breakdown the frequencies

Page 21: Lecture 5 – Categorical Data and Survival Analyses

(Cell %)Row %Col %

No MI MI Total

New TRT

12(30%)60%75%

8( 20%)40%33%

20

Old TRT

4(10%)20%25%

16(40%)80%67%

20

Total 16 24 40

Step 2. Get the different %’sStep 2. Get the different %’s

Page 22: Lecture 5 – Categorical Data and Survival Analyses

Row vs. Column %’s: It’s your choiceRow vs. Column %’s: It’s your choice

• Row %’s: – 40% of New trt patients had MI vs. 80% of

Old trt patients had MI

• Col %’s:– 75% of No MI were in the New trt group vs. 33% of

MI were in New trt group

• P-value for test of association is the same!

Page 23: Lecture 5 – Categorical Data and Survival Analyses

Tests for AssociationTests for Association

Page 24: Lecture 5 – Categorical Data and Survival Analyses

CDA tests for AssociationCDA tests for Association

• Is there a significant association between Group and MI?

• What is a good way to test for an association between the two?

Page 25: Lecture 5 – Categorical Data and Survival Analyses

Test for significant differencesTest for significant differences

• The most common tests are the Chi-square test and Fisher’s Exact test.

• Research question: Is there an association between treatment group and MI?

• To answer this: Compare what you would expect if there was no

association to what you observed

Page 26: Lecture 5 – Categorical Data and Survival Analyses

No MI MI Total

New TRT20

Old TRT 20

Total 40

Expect if no relationship?Expect if no relationship?

Page 27: Lecture 5 – Categorical Data and Survival Analyses

No MI MI Total

New TRT20

Old TRT 20

Total 16 24 40

Expect if no relationship?Expect if no relationship?

Page 28: Lecture 5 – Categorical Data and Survival Analyses

(Cell %)Row %Col %

No MI MI Total

New TRT

8(20%)40%50%

12( 30%)60%50%

20

Old TRT

8(20%)40%50%

12(30%)60%50%

20

Total 16 24 40

Same % with MI by GroupSame % with MI by Group

Page 29: Lecture 5 – Categorical Data and Survival Analyses

Test for significant differencesTest for significant differences

• Have exact same response % would favor “no association”

• There is another general way to calculate what you “expect”

• Use Row totals, Column totals, Grand total to calculate “Expected” frequencies

Page 30: Lecture 5 – Categorical Data and Survival Analyses

Observed vs. Expected FrequenciesObserved vs. Expected Frequencies

• Observed frequencies = actual counts

• “Expected” frequencies:

= Row total x Column total / Grand total (why?)

Page 31: Lecture 5 – Categorical Data and Survival Analyses

Actual No MI MI Total

New TRT 12 8 20

Old TRT 4 16 20

Total 16 24 40

What you actually observed in StudyWhat you actually observed in Study

Page 32: Lecture 5 – Categorical Data and Survival Analyses

ActualExpected No MI MI Total

New TRT 128

812

20

Old TRT 48

1612

20

Total 16 24 40

““Expected” frequenciesExpected” frequencies

Page 33: Lecture 5 – Categorical Data and Survival Analyses

Chi-square testChi-square test

• Quantify if the actual frequencies are far enough away from the Expected (assuming no association)

• We can quantify using the Chi-square test statistic

• We can get the p-value to determine if there is a significant association.

Page 34: Lecture 5 – Categorical Data and Survival Analyses

Chi-square test for association in RxC tableChi-square test for association in RxC table

• H0: There is no association between row and columns

• The classic Pearson’s chi-squared test of independence

• For a 2x2 table, df = (2-1) x (2-1) = 1• Conservatively, we require expected ≥ 5 for all i, j

OVERALL

COLROWij Total

TotalTotal *expected

21

2

1

2

1

2)(

dist

i j ij

ijij

Expected

ExpectedObserved

Page 35: Lecture 5 – Categorical Data and Survival Analyses

Chi-square TestChi-square Test

67.6

12

1216

8

84

12

128

8

812

)(

2222

2

1

2

1

22

i j ij

ijijTS Expected

ExpectedObserved

•Associated P-value for this Chi-square value is p=0.0098.

Thus, we conclude group and MI are significantly associated (given α = 0.05).

Page 36: Lecture 5 – Categorical Data and Survival Analyses

ActualExpected No MI MI Total

New TRT 128

812

20

Old TRT 48

1612

20

Total 16 24 40

““Expected” frequenciesExpected” frequencies

Page 37: Lecture 5 – Categorical Data and Survival Analyses

Fisher’s Exact TestFisher’s Exact Test

• Fisher’s Exact test will test similar hypotheses as the Chi-square test.

• Use Fisher’s Exact test when assumptions of Chi-square test are not satisfied.

• That is, when you have Expected < 5 (basically implying when cell sample size is small).

Page 38: Lecture 5 – Categorical Data and Survival Analyses

Confidence Intervals for Confidence Intervals for %’s%’s

Page 39: Lecture 5 – Categorical Data and Survival Analyses

Confidence Interval for %’sConfidence Interval for %’s

• You conduct your follow-up after CABG study and accrue 40 patients.

• After 3 years 20 out of all 40 patients have had a MI.

• Q1. What is your best guess at the true (population) MI rate at 3 years? A. Based on your sample, 20/40 = 50%

Page 40: Lecture 5 – Categorical Data and Survival Analyses

Sampling VariabilitySampling Variability

MI at 3 yrs = ?MI = 50%

Inference

Sample

Population

Page 41: Lecture 5 – Categorical Data and Survival Analyses

Sampling VariabilitySampling Variability

MI at 3 yrs = ?MI = 44%

Inference

Sample

Population

Page 42: Lecture 5 – Categorical Data and Survival Analyses

Confidence Interval for %’sConfidence Interval for %’s

• A good way to make inference about what the range of plausible values of the population % is to calculate a Confidence Interval (CI).

• Q2. How much precision do you have in terms of estimating the MI rate at 3 yrs. in the population based on your sample?

Page 43: Lecture 5 – Categorical Data and Survival Analyses

95% Confidence Intervals95% Confidence Intervals

• 95% Confidence Interval for Mean:

• 95% Confidence Interval for Proportion (Standard “Wald” CI):

n

sdX 2

n

ppp

ˆ1ˆ2ˆ

Page 44: Lecture 5 – Categorical Data and Survival Analyses

Confidence Interval for %’sConfidence Interval for %’s

• Q2. How much precision do you have in terms of estimating the MI rate in the population based on your sample? (Remember, 20 of 40 total had MI)

A. A 95% Wilson CI for population MI rate is (35.2%, 64.8%).

Thus, if we have repeated our study over and over again, each time drawing a sample of 40 patients, then the true population MI rate at 3 yrs. would be between 35.2% and 64.8% approximately 95% of the time.

Page 45: Lecture 5 – Categorical Data and Survival Analyses

Confidence Interval for %’sConfidence Interval for %’s

• What’s interesting is that there are “lucky” and “unlucky” combinations of p (response rate) and N (sample size)

• That is, for a given sample size: * for some p you may higher ability to make inference

* for some p you may have less ability!

• Not to scald the Wald, but not all CI’s are created equal

• Paper

Page 46: Lecture 5 – Categorical Data and Survival Analyses

Modeling in CDAModeling in CDA

Page 47: Lecture 5 – Categorical Data and Survival Analyses

Modeling in CDAModeling in CDA

• Modeling is done with variations of Logistic Regression:• Dichotomous• Ordinal (Proportional odds)• Nominal (Generalized logit)• Conditional (Matched-pairs)• Exact (small sample size/rare outcome)• Longitudinal (GEE, GLMM)

• Simple (1 predictor) vs. Multivariable (>1 predictor/adjusted)

Page 48: Lecture 5 – Categorical Data and Survival Analyses

Why use adjusted analysis?Why use adjusted analysis?

• Do you think patient demographics or clinical characteristics at baseline would affect MI?

• What if half of the patients are all <30 yrs. old and half are all >80 yrs. old?

• What are some possible confounders of response? Effect modifiers?

• These are testable in adjusted analyses.

Page 49: Lecture 5 – Categorical Data and Survival Analyses

You may not need adjusted.You may not need adjusted.

• Typically have well-defined specific patient populations of interest.

• Thus, inclusion/exclusion criteria might have removed variability from potential confounders

• A well designed, well executed trial usually does not require intensive and complex analysis.

Page 50: Lecture 5 – Categorical Data and Survival Analyses

What is Logistic Regression?What is Logistic Regression?

• In a nutshell:

A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables.

Used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used).

Page 51: Lecture 5 – Categorical Data and Survival Analyses

What is Logistic Regression?What is Logistic Regression?

• What is the “Logistic” component?

Instead of modeling the outcome, Y, directly, the method models the Pr(Y) using the logistic function.

Page 52: Lecture 5 – Categorical Data and Survival Analyses

What is Logistic Regression?What is Logistic Regression?

• What is the “Regression” component?

Methods used to quantify association between an outcome and predictor variables. Could be used to build predictive models as a function of predictors.

Page 53: Lecture 5 – Categorical Data and Survival Analyses

What can we use Logistic Regression for?What can we use Logistic Regression for?

• To estimate adjusted prevalence rates, adjusted for potential confounders

(sociodemographic or clinical characteristics)

• To estimate the effect of a treatment on a dichotomous outcome, adjusted for other covariates

• Explore how well characteristics predict a categorical outcome

Page 54: Lecture 5 – Categorical Data and Survival Analyses

Fig 1. Logistic regression curves for the three drug combinations. The dashed reference line represents the probability of DLT of .33. The estimated MTD can be obtained as the value on the horizontal axis that coincides with a vertical line drawn through the point where the dashed line intersects the logistic curve. Taken from “Parallel Phase I Studies of Daunorubicin Given With Cytarabine and Etoposide With or Without the Multidrug Resistance Modulator PSC-833 in Previously Untreated Patients 60 Years of Age or Older With Acute Myeloid Leukemia: Results of Cancer and Leukemia Group B Study 9420” Journal of Clinical Oncology, Vol 17, Issue 9 (September), 1999: 283. http://www.jco.org/cgi/content/full/17/9/2831

Page 55: Lecture 5 – Categorical Data and Survival Analyses

Logistic Regression quantifies “effects” Logistic Regression quantifies “effects” using Odds Ratiosusing Odds Ratios

• Does not model the outcome directly, which leads to effect estimates quantified by means (i.e., differences in means)

• Estimates of effect are instead quantified by “Odds Ratios”

Page 56: Lecture 5 – Categorical Data and Survival Analyses

Logistic Regression &Logistic Regression &Odds Ratio (OR)Odds Ratio (OR)

• The odds ratio is equally valid for retrospective, prospective, or cross-sectional sampling designs

• That is, regardless of the design it estimates the same population parameter

(not true for Relative Risk)

Page 57: Lecture 5 – Categorical Data and Survival Analyses

Relationship between Relationship between Odds & ProbabilityOdds & Probability

Probability eventOdds event =

1-Probability event

Odds eventProbability event

1+Odds event

Page 58: Lecture 5 – Categorical Data and Survival Analyses

The Odds RatioThe Odds RatioDefinition of Odds Ratio: Ratio of two odds estimates.

Example:

Suppose 16 out of 40 people in the trt group had a MI and only 5 out of 25 in the placebo group had a MI.

16Pr MI | trt group 0.40

40

5Pr MI | placebo group 0.20

25

Page 59: Lecture 5 – Categorical Data and Survival Analyses

The Odds RatioThe Odds Ratio

Example Cont’d:

So, if Pr(MI | trt) = 0.40 and Pr(MI | placebo) = 0.20

Then:

0.40Odds MI | trt group 0.667

1 0.40

0.20Odds MI | placebo group 0.25

1 0.20

0.667 OR Trt vs. Placebo 2.67

0.25

Page 60: Lecture 5 – Categorical Data and Survival Analyses

Interpretation of the Odds RatioInterpretation of the Odds Ratio

•Example cont’d:

Outcome = MI, 67.2OR Plb Trt vs.

Then, the odds of a MI in the treatment group were estimated to be 2.67 times the odds of having a MI in the placebo group.

Alternatively, the odds of having a MI were 167% higher in the treatment group than in the placebo group.

Page 61: Lecture 5 – Categorical Data and Survival Analyses

Odds Ratio vs. Relative RiskOdds Ratio vs. Relative Risk

• An Odds Ratio of 2.67 for trt. vs. placebo does NOT mean that MI is 2.67 times as LIKELY to occur.

• It DOES mean that the ODDS of MI are 2.67 times as high for trt. vs. placebo.

Page 62: Lecture 5 – Categorical Data and Survival Analyses

Odds Ratio vs. Relative RiskOdds Ratio vs. Relative Risk

• The Odds Ratio is NOT mathematically equivalent to the Relative Risk (Risk Ratio)

• However, for “rare” events, the Odds ratio can approximate the Relative risk (RR)

1-P MI | plbOR=RR

1-P MI | trt

Page 63: Lecture 5 – Categorical Data and Survival Analyses

The Logistic Regression ModelThe Logistic Regression Model

0 1 1 2 2 K K

0 1 1 2 2 K K

Logistic Regression:

P Yln

1-P Y

Linear Regression:

Y

X X X

X X X

Page 64: Lecture 5 – Categorical Data and Survival Analyses

The Logistic Regression ModelThe Logistic Regression Model

0 1 1 2 2 K K

P Yln

1-P YX X X

predictor variables

YP1

YPln is the log(odds) of the outcome.

dichotomous outcome

Page 65: Lecture 5 – Categorical Data and Survival Analyses

The Logistic Regression ModelThe Logistic Regression Model

0 1 1 2 2 K K

P Yln

1-P YX X X

intercept

YP1

YPln is the log(odds) of the outcome.

model coefficients

Page 66: Lecture 5 – Categorical Data and Survival Analyses

The Logistic Regression ModelThe Logistic Regression Model

0 1 1 2 2 K K

0 1 1 2 2 K K

0 1 1 2 2 K K

P Yln

1-P Y

expP Y

1 exp

X X X

X X X

X X X

In this latter form, the logistic regression model directly relates the probability of Y to the predictor variables.

Page 68: Lecture 5 – Categorical Data and Survival Analyses

Extensions of Extensions of Logistic RegressionLogistic Regression

• Outcomes with more than 2 categories (polytomous or polychotomous)

• Cumulative logit model – Proportional odds model for ordinal outcomes (ordered categories)

• Generalized logit model for nominal outcomes or non-proportional odds models (unordered categories)

Page 69: Lecture 5 – Categorical Data and Survival Analyses

Extensions of Extensions of Logistic RegressionLogistic Regression

• Ordinal Logistic Regression model:

– Fits a logistic regression model with g-1 intercepts for a g category outcome and one model coefficient for each predictor

– Models cumulative probability of being in a “higher” category

Page 70: Lecture 5 – Categorical Data and Survival Analyses

Discharge Location as OrdinalDischarge Location as Ordinal(Died, Assisted, Home)(Died, Assisted, Home)

• There is no law that says you can’t model all categories of Discharge Location

• Ordinal logistic regression example:

Predictor OR P-value

Trt vs. Control 1.24 0.036

M vs. F 0.87 0.163

Page 71: Lecture 5 – Categorical Data and Survival Analyses

Ordinal Outcome Ordinal Outcome (Died, Assisted, Home)(Died, Assisted, Home)

OR(Trt vs C) = 1.24 means there was 24% higher odds of being in a higher DL category for Treatment vs. Control (adjusting for gender).

OR(M vs F) = 0.87 means there was 13% lower odds of being in a higher DL category for Males vs. Females (adjusting for Trt group).

Predictor OR P-value

Trt vs. Control 1.24 0.036

M vs. F 0.87 0.163

Page 72: Lecture 5 – Categorical Data and Survival Analyses

Extensions of Extensions of Logistic RegressionLogistic Regression

• Nominal Logistic Regression Model:

– Fits a logistic regression model with g-1 intercepts and g-1 model coefficients for a g category outcome

– Model captures the multinomial probability of being in a particular category using generalized logits

Page 73: Lecture 5 – Categorical Data and Survival Analyses

Nominal Logistic RegressionNominal Logistic Regression• Doesn’t make “Proportional odds” assumption

• Separate OR’s for C-1 categories of C category outcome (get OR for every group except Referent)

• Example:

Predictor OR P-value

Trt vs. Control

Home vs. Died

Assisted vs. Died

1.22

1.56

Overall=0.048

0.236

0.034

Page 74: Lecture 5 – Categorical Data and Survival Analyses

Nominal Logistic RegressionNominal Logistic Regression

• Thus, there was 56% higher odds of being discharged to Assisted Living compared to Dying for Trt. vs. Control.

Predictor OR P-value

Trt vs. Control

Home vs. Died

Assisted vs. Died

1.22

1.56

Overall=0.048

0.236

0.034

Page 75: Lecture 5 – Categorical Data and Survival Analyses

Extensions of Extensions of Logistic RegressionLogistic Regression

• Longitudinal data / repeated measures data / Clustered data with binary outcomes

• Multilevel models (nested data structures)

GEE (Generalized Estimating Equations)GLMM (Generalized Linear Mixed Models)

Page 76: Lecture 5 – Categorical Data and Survival Analyses

Repeated Measures /Repeated Measures /Longitudinal dataLongitudinal data

• Longitudinal data = data on subjects over time

• Repeated measures need to be taken into account when testing for differences

• Need to investigate correlation of repeated measures

Page 77: Lecture 5 – Categorical Data and Survival Analyses

Extensions to Extensions to Logistic RegressionLogistic Regression

• Exact Logistic Regression

• Small Sample Size

• Adequate sample size but rare event (sparse data)

Page 78: Lecture 5 – Categorical Data and Survival Analyses
Page 79: Lecture 5 – Categorical Data and Survival Analyses

Questions?

Page 80: Lecture 5 – Categorical Data and Survival Analyses

Part II. Analysis of Time-to-event Data

(A.K.A., Survival Analysis)

Page 81: Lecture 5 – Categorical Data and Survival Analyses

What do we mean by Time?

• Length of follow-up till the event of interest occurs

• Follow-up can start at (for example)1. Randomization into a clinical trial2. Time of employment3. First contact on record in retrospective cohort

• Age of the individual at the time of the event

Page 82: Lecture 5 – Categorical Data and Survival Analyses

What is Survival Analysis?• Survival analysis is a collection of statistical

analysis techniques where the outcome is time to an event.

• Survival or time-to-event outcomes are defined by the pair of random variables (ti, δi) that give the observation time and an indicator of whether or not the event occurred

Page 83: Lecture 5 – Categorical Data and Survival Analyses

What do we mean by Event?• Usually we mean death – thus the name

“survival” analysis• Other examples:

– Cancer relapse or recurrence– Disease incidence

• Can also be a positive outcome:– Discharge from psychiatric counseling– Normalization of WBC count(in these examples, death would be a censored

outcome)

Page 84: Lecture 5 – Categorical Data and Survival Analyses

Censoring

• In the pair of random variables (ti, δi) that constitute survival outcomes:– ti is an observed variable representing time (e.g.,

actual time until death or time until last follow-up)– δi is a Bernoulli random variable (0,1) or indicator

of whether the observation is censored or not – 1 if we observed a failure, 0 if we have a censored observation

Page 85: Lecture 5 – Categorical Data and Survival Analyses

Censoring Occurs• When we have incomplete information about the

exact survival time due to a random factor– Non-informative censoring – whether an observation is

censored or not is independent of the value of the observation.

– Informative censoring – whether an observation is censored or not is dependent on the value of the observation

– E.g., we use dates seen in clinic provide censoring times (without attempted phone contact to verify vital status)

• We will require non-informative censoring mechanisms. If censoring is informative, then these methods will generate biased results.

Page 86: Lecture 5 – Categorical Data and Survival Analyses

Types of censoring

• Right censoring – true survival time is greater than what we observed

• Left censoring – true survival time is less than what we observed (less common)

• Interval censoring – subjects are not observed continuously and we only know the event happened between time A and time B (e.g., annual testing of partner of an HIV+ individual)

Page 87: Lecture 5 – Categorical Data and Survival Analyses

Three common reasons for right censoring

• Person does not experience the event before the study ends

• Person is lost to follow-up during the study period

• Person withdraws from the study because of death (if death is not the outcome of interest) or some other reason (e.g., adverse drug reaction)

Page 88: Lecture 5 – Categorical Data and Survival Analyses

What do the data look like?

end of study

drop out

5 10 15 20

2

1

3

4

5

0

event occurred

Page 89: Lecture 5 – Categorical Data and Survival Analyses

Example: Survival of Patients With Renal Resistive Index ≥ 0.8

• 86 hypertensive patients with open or percutaneous repair of RVD

• RA resistive index (RI) defined as 1-EDV/PSV in the D-segment

• RI dichotomized: <0.8 or ≥0.8• Outcome of interest is time to death (any

cause)

Page 90: Lecture 5 – Categorical Data and Survival Analyses

Example: Survival of Patients With Renal Resistive Index ≥ 0.8

Hosp UNITNO

Preop RI

<08 or ≥ 0.8Time to Event

Death Indicator

006-52-90 1 41.0 1

013-39-46 1 77.1 0

014-57-80 0 112.3 0

022-39-81 0 90.3 0

023-80-78 0 88.3 0

026-66-88 0 104.3 0

028-75-46 0 57.8 0

030-10-56 0 26.8 0

033-83-68 0 38.1 1

034-30-42 1 90.9 0

δi = 1 if death

δi = 0 if censoredRI=1 if ≥ 0.8

RI=0 if < 0.8

Page 91: Lecture 5 – Categorical Data and Survival Analyses

Survival Distribution

• Distribution of times to event – called “survival times,” even when the “event” is not “death”

• Let T = survival time (T ≥ 0) t = specified value for T

• Survival times follow a continuous distribution with times ranging from zero to infinity

• Ordinary methods for estimating and comparing continuous distributions cannot be used with survival data due to the presence of censoring

Page 92: Lecture 5 – Categorical Data and Survival Analyses

Probability Density Function f (t )

0

1( ) lim [ ]

tf t P t T t t

t

Difficult to estimate density directly because of censoring – histogram not direct estimate of f(t)

Cumulative Distribution Function F(t )

0

( ) [ ] ( )t

F t P T t f s ds Defined in the same way we would any CDF

Page 93: Lecture 5 – Categorical Data and Survival Analyses

Survival Function S(t )

0

( ) Pr[ ] ( ) 1 ( ) 1 ( )t

t

S t T t f u du f u du F t

• Monotone non-increasing function• S(0) = 1• S(+∞) = 0

Hazard Function λ(t)

Instantaneous death rate at time t, given alive at time t

0

0

Prob event in ( ) given survived to ( ) lim

Pr( | )lim

t

t

t, t t tt

tt T t t T t

t

Page 94: Lecture 5 – Categorical Data and Survival Analyses

Hazard Function λ(t )• So, you survived to time t, what is the probability

that you survive another increment of time t?• Standardize this conditional probability to a per unit

of time.• As unit of time gets very small (i.e., goes to 0) this

conditional probability becomes an instantaneous rate.

• Some simple features of λ(t)– λ(t) takes on values in the interval (0, ∞)– λ(t) could be instantaneously increasing, decreasing, or

constant

Page 95: Lecture 5 – Categorical Data and Survival Analyses

Survival Distribution

• Any one of these four functions is enough to specify the survival distribution. There exists an equivalence relationship between the them.

• Survival analysis techniques focus on survival distribution S(t) and hazard rate λ(t)– When λ(t) is high, S(t) decreases faster.– When λ(t) is low, S(t) decreases slower.

Page 96: Lecture 5 – Categorical Data and Survival Analyses

How do censored cases affect survival estimation?

• Censored patients do not make the survival curve drop in steps

• Censored cases do reduce the number of patients left who are contributing to the survival curve

• Thus every event after that censored case will result in a “larger” step down than it would have been without the censored case

Page 97: Lecture 5 – Categorical Data and Survival Analyses

• The reduction in sample size due to amount of censored cases present will result in reduced reliability of the estimates of survival.

• That is, larger amount of censored cases make wider CI’s about survival estimates.

• End of the survival curve is most affected yet is of great interest

How do censored cases affect the survival estimation?

Page 98: Lecture 5 – Categorical Data and Survival Analyses

Survival Estimation: The Kaplan-Meier (K-M) Method

• Also known as “Product-limit” Method• Most popular method of estimating

survival / time-to-event• Good statistical properties: estimates

converge to true survival distribution as sample size grows

• Nonparametric - does not require knowledge of the underlying distribution

Page 99: Lecture 5 – Categorical Data and Survival Analyses

K-M Estimation: How it Works• Order death/censoring times from smallest to

largest• Update survival estimate at each distinct

failure time

( ) ( ) ( )1

( 1) ( ) ( )

ˆ ˆ( ) [ | ]

ˆ ˆ( ) ( | )

j

j i ii

j j j

S t P T t T t

S t P T t T t

Page 100: Lecture 5 – Categorical Data and Survival Analyses

K-M Estimation: RI Example

Product-Limit Survival Estimates

Time (months)

Censored (*) Survival

Survival SE

NumberFailed

NumberLeft

0.000 1.0000 0 0 272.628 * . . 0 267.392 0.9615 0.0377 1 25

13.470 0.9231 0.0523 2 2414.587 * . . 2 2316.164 0.8829 0.0636 3 2216.296 0.8428 0.0722 4 21

ˆ

S(t) = (prop. alive after this death) (surv. estimate at prior death)21

= 0.8829 22

Page 101: Lecture 5 – Categorical Data and Survival Analyses

Pro

po

rtio

n A

live

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Months Post-surgery

0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96

Plot of K-M Survival Estimate Group with RI≥0.8

Survival estimate updated at each death time

Open diamonds mark censoring times

Page 102: Lecture 5 – Categorical Data and Survival Analyses

Assumptions of Kaplan-Meier

• Non-informative Censoring: The probability of being censored does not depend upon a patient’s prognosis for the event.

• Deaths of patients in a sample occur independently of each other

• Does not make assumptions about the distribution of survival times

Page 103: Lecture 5 – Categorical Data and Survival Analyses

Before Kaplan-Meier• Life-table (“actuarial”) method of

estimating time to death

• Break follow-up time into pre-defined intervals

• Number of subjects alive at beginning of interval

• Number of subjects dying during interval

• Estimate survival in similar fashion to Kaplan-Meier

Page 104: Lecture 5 – Categorical Data and Survival Analyses

Testing for difference between two survival curves: Log-rank test

• Are two survivor curves the same?• Use the times of events: t1, t2, ... (do not include censoring

times)• Treat each event and its “set of persons still at risk” (i.e., risk

set) at each time tj as an independent table

• Make a 2×2 table at each tj (i.e., each distinct death time)

Event No Event Total

Group A aj njA- aj njA

Group B cj njB-cj njB

Total dj nj-dj nj

Page 105: Lecture 5 – Categorical Data and Survival Analyses

Log-rank test for comparing survivor curves

• At each event time t j, under assumption of equal survival (i.e., SA(t) = SB(t) ), the expected number of events in Group A out of the total events (dj=aj +cj) is proportional to the numbers at risk in group A to the total at risk at time tj:

E(aj)= dj x njA / nj

• Differences between aj and E(aj) represent evidence against the null hypothesis of equal survival in the two groups

Page 106: Lecture 5 – Categorical Data and Survival Analyses

Log-rank test for comparing survivor curves

• Use the Cochran Mantel-Haenszel idea of pooling over events j to get the log-rank chi-squared statistic with one degree of freedom

21

2

2 ~ˆ

)(

jj

jjj

a

aEa

raV

Page 107: Lecture 5 – Categorical Data and Survival Analyses

Log-rank test for comparing survivor curves

• Idea summary:– Create a 2x2 table at each uncensored failure time– The construct of each 2x2 table is based on the

corresponding risk set– Combine information from all the tables

• The null hypothesis is SA(t) = SB(t) for all times t (i.e., tests for differences across entire distribution)

Page 108: Lecture 5 – Categorical Data and Survival Analyses

(N=84) (N=76) (N=63) (N=48) (N=37) (N=30) (N=15) (N=8) (N=3)

Pro

po

rtio

n A

live

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Months Post-surgery

0 12 24 36 48 60 72 84 96 108 120

RI < 0.8

RI ≥ 0.8

Resistive Index example

Log-rank Χ2= 15.2 p-value<0.0001

Page 109: Lecture 5 – Categorical Data and Survival Analyses

Other Tests to Compare Survival Curves You May Encounter

• Wilcoxon (a.k.a., Peto) Test– Weights analysis by the number of subjects at risk at

each distinct death time– More sensitive than log-rank to early differences in

survival curves; log-rank is more sensitive to late differences in curves

• Likelihood Ratio Test– Assumes exponential distribution– Optimal if survival is, in fact, exponentially distributed

Page 110: Lecture 5 – Categorical Data and Survival Analyses

From Stratification to Modeling

• Goal: extend survival analysis to an approach that allows for multiple covariates of mixed forms (i.e., continuous, ordinal and nominal categorical)

• We have two options for our expansion– Model the survival function or time– Model the hazard function (between 0 to ∞)

We will model the hazard function

Page 111: Lecture 5 – Categorical Data and Survival Analyses

What are Proportional Hazards?

• The constant C does not depend on time• The model is multiplicative

C11 2

2

( | )( | ) ( | )

( | )

tC S t S t

t

x

x xx

Page 112: Lecture 5 – Categorical Data and Survival Analyses

Cox Proportional Hazards Model• D.R. Cox assumed proportionality was constant

across time and proposed the following model:

where λ0(t) is the baseline hazard and involves t but not X

• is the exponential function; involves

X’s but not t (as long as the are time independent)

)exp()();(

p

iiio xtXt

1

)exp(

p

iii x

1

Page 113: Lecture 5 – Categorical Data and Survival Analyses

Cox Proportional Hazards Model

• The regression model for the hazard function as a function of p explanatory (X) variables is specified as follows:log hazard: log λ(t; X) = log h0(t) + 1X1 + 2X2 + … + pXp

hazard: pp2211 XβXβXβ

o e...ee(t)λX)λ(t;

Page 114: Lecture 5 – Categorical Data and Survival Analyses

Cox Proportional Hazards Model

• Interpretation of – The relative hazard (i.e., hazard ratio) associated

with a 1 unit change in X1 (i.e., X1+1 vs. X1), holding other Xs constant, independent of time

– The relative (instantaneous) risk for X1+1 vs. X1, holding other Xs constant, independent of time

• Other ’s have similar interpretations

1βe

Page 115: Lecture 5 – Categorical Data and Survival Analyses

Cox Proportional Hazards Model• “multiplies” the baseline hazard λ0(t) by the

same amount regardless of the time t, thus we have a “proportional hazards” model – the effect of any (fixed) X is the same at any time during follow-up

• is the focus whereas λ 0(t) is a nuisance variable

• Cox (1972) showed how to estimate without having to assume a model for λ 0(t) (e.g., estimating λ 0(t) with a step function)

• Let # steps get large —partial likelihood for depends on , not λ0(t)

1e

Page 116: Lecture 5 – Categorical Data and Survival Analyses

Partial likelihood

• The likelihood function used in Cox PH models is called a partial likelihood

• We use only the part of the likelihood function that contains the ’s

• It depends only on the ranks of the data and not the actual time values.

Page 117: Lecture 5 – Categorical Data and Survival Analyses

Partial likelihood• Let the survival times (times to failure) be:

t1 < t2 < ... < tk

• And let the “risk sets” corresponding to these times be:R1, R2, ..., Rk where Rj = list of persons at risk just before tj

• The “partial likelihood” for is

(Assumes no ties in event times)• To estimate , find the values of s that maximize L() above.

k

i

Rj

XXX

XXX

i

pjpjj

pipii

e

eL

1...

...

2211

2211

)(

Page 118: Lecture 5 – Categorical Data and Survival Analyses

Partial likelihood

• Why does the partial likelihood make sense?

• Choose so that the one who failed at each time was most likely - relative to others who might have failed!

it at failed have could who ones of hazardsperson failed of hazard

i

pjpjj

pipii

i

pjpjj

pipii

Rj

XXXi

XXXi

Rj

XXX

XXX

et

et

e

e

...

...

...

...

)(

)(2211

2211

2211

2211

0

0

Page 119: Lecture 5 – Categorical Data and Survival Analyses

Some General Comments Thoughts

• Similar to logistic regression, a simple function of the has a particularly nice interpretation

• can be interpreted as a relative risk (risk ratio) for a one unit change in the predictor

e

β

0.60

0.60

ˆ 0.60 0.55 (protective effect)

ˆ 0.60 1.82 (increased risk)

e

e

Page 120: Lecture 5 – Categorical Data and Survival Analyses

Some General Comments Thoughts

• Estimates of βs are asymptotically normal (i.e., are normally distributed)

• Two important implications of asymptotic normality– We can use the likelihood ratio, score, and Wald tests to

make inference about our data – Wald test: “thing/SE(thing)”– We can use the usual method to construct a 95%

confidence intervalˆ ˆ1.96 ( )SEe

Page 121: Lecture 5 – Categorical Data and Survival Analyses

Resistive Index: Univariable PH Regression

Summary of the Number of Event and Censored Values

Total Event CensoredPercent

Censored86 22 64 74.42

Testing Global Null Hypothesis: BETA=0

TestChi-

Square DF Pr > ChiSqLikelihood Ratio 12.6328 1 0.0004

Score 15.2341 1 <.0001

Wald 12.6195 1 0.0004

Analysis of Maximum Likelihood Estimates

Parameter DFParameter

EstimateStandard

Error Chi-Square Pr > ChiSqHazard

Ratio

95% Hazard Ratio

Confidence Limits

PRERIGEP8 1 1.56144 0.43954 12.6195 0.0004 4.766 2.014 11.279

β

“Score” test is equivalent to log-rank

ˆ ˆ1.96 ( )SEe ˆ

e

Page 122: Lecture 5 – Categorical Data and Survival Analyses

Resistive Index: Multivariable PH RegressionAnalysis of Maximum Likelihood Estimates

Parameter DFParameter

EstimateStandard

Error Chi-Square Pr > ChiSq

HazardRatio

95% Hazard Ratio Confidence

LimitsRI >=0.8 1 1.89910 0.48164 15.5471 <.0001 6.680 2.60 17.17History of Coronary Disease 1 1.65099 0.75558 4.7745 0.0289 5.212 1.19 22.92History of PVD 1 0.78338 0.44869 3.0483 0.0808 2.189 0.91 5.27Preop - Postop Num Meds 1 -0.46046 0.20628 4.9830 0.0256 0.631 0.42 0.94Failed BP Response 1 1.43046 0.57694 6.1474 0.0132 4.181 1.35 12.95

• Each effect is estimated controlling for other effects in model• Note: increase in hazard ratio for RI in multivariable model• Proportional hazards assumption should be verified

Examine survival plots by covariate values Plot –log(-log(S(t)) vs. time, should be parallel if hazards are proportional

Page 123: Lecture 5 – Categorical Data and Survival Analyses

(N=84) (N=76) (N=63) (N=48) (N=37) (N=30) (N=15) (N=8) (N=3)

Pro

po

rtio

n A

live

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Months Post-surgery

0 12 24 36 48 60 72 84 96 108 120

RI < 0.8

RI ≥ 0.8

Resistive Index example

Log-rank Χ2= 15.2 p-value<0.0001

Page 124: Lecture 5 – Categorical Data and Survival Analyses

Example of proportional hazards violation

Unadjusted all-cause mortality survival curve, by annual hospital volume of Medicare breast cancer cases: United States, 1994–1996

Page 125: Lecture 5 – Categorical Data and Survival Analyses

Remedial Measures for Non-proportionality

• Stratified analysis– Uses stratification to control for the non-

proportional factor– Removes factor as covariate (i.e., get no effect

estimate)

• Add time dependent covariate– Time dependent covariates change value over

time– Can use indicator to fit new effect after change

point (e.g., at 40 months in previous plot)

Page 126: Lecture 5 – Categorical Data and Survival Analyses