model selection strategies - · pdf filedeveloping a multivariable prediction model •...

56
MODEL SELECTION STRATEGIES Tony Panzarella

Upload: ledien

Post on 06-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

MODEL SELECTION STRATEGIES

Tony Panzarella

Page 2: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Preamble• Although focus will be on time-to-event data the same

principles apply to other outcome data

Lab Course March 20, 2014 2

Page 3: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Developing a multivariable prediction model• Select clinically relevant predictors for possible inclusion in

the model• Evaluate the quality of the data and how to handle missing

data• Data handling decisions• Choosing a strategy for selecting the important variables in

the final model• Deciding how to model continuous variables• Selecting measures of model performance or predictive

accuracy

Lab Course March 20, 2014 3

Page 4: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

AUTOMATIC SELECTION ROUTINES

Page 5: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Forward Selection• Variables are added to the model one at a time• At each stage the variable added is the one which gives

the largest decrease in the value of -2LogL on its inclusion• The process ends when each of the remaining variables

fails to reduce -2LogL by a pre-specified amount (typically couched as a significance level)

Lab Course March 20, 2014 5

Page 6: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Backward elimination• Full model is fit first• Variables are excluded one at a time• At each stage the variable omitted is the one that

increases -2LogL by the smallest amount by its exclusion• The process ends when the next candidate for deletion

increases the value of -2LogL by more than a pre-specified amount.

Lab Course March 20, 2014 6

Page 7: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Stepwise• Operates similarly to forward selection• However, a variable that is included can be considered for

exclusion at a later stage• Thus after adding a variable, the procedure then checks

whether any previously included variable can be deleted

Lab Course March 20, 2014 7

Page 8: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 8

proc phreg data=myeloma; model Time*VStatus(0)=LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc / selection=score best=3;run;

Best Subsets

Provides a computational efficient way to screen all possible modelsThe procedure requires a criterion to judge a model

Given the criterion the software screens all models containing q covariates and reports the covariates in the best, say n, models for q=1,2,3,…,p, where p denotes the number of covariates

SAS uses the score test

Page 9: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 9

Regression Models Selected by Score CriterionNumber ofVariables

ScoreChi-Square

Variables Included in Model

1 8.5164 LogBUN1 5.0664 HGB1 3.1816 Platelet2 12.7252 LogBUN HGB2 11.1842 LogBUN Platelet2 9.9962 LogBUN SCalc3 15.3053 LogBUN HGB SCalc3 13.9911 LogBUN HGB Age3 13.5788 LogBUN HGB Frac4 16.9873 LogBUN HGB Age SCalc4 16.0457 LogBUN HGB Frac SCalc

4 15.7619 LogBUN HGB LogPBM SCalc

5 17.6291 LogBUN HGB Age Frac SCalc

5 17.3519 LogBUN HGB Age LogPBM SCalc

5 17.1922 LogBUN HGB Age LogWBC SCalc

6 17.9120 LogBUN HGB Age Frac LogPBM SCalc

6 17.7947 LogBUN HGB Age LogWBC Frac SCalc

6 17.7744 LogBUN HGB Platelet Age Frac SCalc

7 18.1517 LogBUN HGB Platelet Age Frac LogPBM SCalc

7 18.0568 LogBUN HGB Age LogWBC Frac LogPBM SCalc

7 18.0223 LogBUN HGB Platelet Age LogWBC Frac SCalc

8 18.3925 LogBUN HGB Platelet Age LogWBC Frac LogPBM SCalc

8 18.1636 LogBUN HGB Platelet Age Frac LogPBM Protein SCalc

8 18.1309 LogBUN HGB Platelet Age LogWBC Frac Protein SCalc

9 18.4550 LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc

The PHREG Procedure

Page 10: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Disadvantages of automatic routines• They typically lead to one particular subset of variables,

rather than a set of equally good ones• The subsets found might be different for different selection

routines• They generally tend not to account for the hierarchic

principle• Dependent on the stopping rule• It does not foster critical thinking about the problem

Lab Course March 20, 2014 10

Page 11: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• The model selection strategy depends to some extent on

the purpose of the study

Lab Course March 20, 2014 11

Page 12: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• Chow et al. (2002)• Main goal: Investigate what explanatory variables, in a

palliative care setting, are associated with overall survival

Lab Course March 20, 2014 12

Page 13: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• Fosker et al. (2013)• The Importance of Poor Performance Status in

Personalizing Palliative Radiotherapy Towards the End of Life

Lab Course March 20, 2014 13

Page 14: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

CollettStep 0:Identify a set of explanatory variables that have the potential for being included in a model

This approach assumes that all variables are considered to be on an equal footing, and there is no a priori reason to include any specific variables (like treatment).

Steps 1-4:Determine the combination of variables to be included

In practice, there will not be a unique combination of variables; there are likely to be a number of equally good models

Lab Course March 20, 2014 14

Page 15: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• If the number of potential explanatory variables (including

interactions, non-linear terms etc.) is not too large, it might be feasible to consider all combinations of terms

• Pay due regard to the hierarchic principle and use the statistic -2Log(Likelihood)

• Use AIC to compare possible models

Lab Course March 20, 2014 15

Page 16: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• When the number of variables is relatively large, the

number of possible models that need to be fitted can be computationally expensive

• Automatic selection routines might seem to be an attractive option• Forward selection• Backward elimination• Stepwise

Lab Course March 20, 2014 16

Page 17: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett

Step 1: Fit a univariate model for each covariate, and identify the predictors significant at some level p1, say 0.20.

Lab Course March 20, 2014 17

Page 18: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

CollettStep 2:Fit a multivariate model with all significant univariate predictors, and use backward selection to eliminate nonsignificant variables at some level p2, say 0.10.

Lab Course March 20, 2014 18

Page 19: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

CollettStep 3:Starting with final step (2) model, consider each of the non-significant variables from step (1) using forward selection, with significance level p3, say 0.10.

Lab Course March 20, 2014 19

Page 20: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

CollettStep 4:Do final pruning of main-effects model (omit variables that are non-significant, add any that are significant), using stepwise regression with significance level p4.

Lab Course March 20, 2014 20

Page 21: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• At this stage, you may also consider adding interactions

between any of the main effects currently in the model, under the hierarchical principle.

Lab Course March 20, 2014 21

Page 22: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• Collett recommends using a likelihood ratio test for all

variable inclusion/exclusion decisions.

Lab Course March 20, 2014 22

Page 23: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Collett• Statistical criteria alone should not guide the model

selection strategy• It may not be appropriate to include particular combinations of

variables• It might be unwise to omit some non statistically significant

variables

Lab Course March 20, 2014 23

Page 24: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayPurposeful selection

Step 1:Fit a multivariable model containing all variables significant in the univariable analysis at the 0.20 to 0.25 significance level, and any other variables not selected using this criterion but judged to be of clinical importance

Lab Course March 20, 2014 24

Page 25: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and May• Note:• If there are many covariates that show a statistically

significant association with survival you can rank order the covariates based on p-values using only the most highly significant variables. Include one covariate per ten events.

Lab Course March 20, 2014 25

Page 26: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 2:Use Wald test p-values of the individual coefficients to identify covariates that might be deleted

• Cautioned not to delete too many seemingly non-significant variables at one time

• Confirm above by using partial likelihood test

Lab Course March 20, 2014 26

Page 27: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 3:Assess whether removal of the covariate has produced an “important” change in the coefficients of the variables remaining in the model. A value of 20% is used as an indicator of important change.If the variable excluded is an important confounder reintroduce it into the model.This process continues until no variables can be deleted.

Lab Course March 20, 2014 27

Page 28: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 4:Add to the model, one at a time, all variables excluded from the initial multivariable model to confirm that they are neither statistically significant nor a confounder

Result referred to as the preliminary main effects model

Lab Course March 20, 2014 28

Page 29: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 5:Test linearity of the continuous covariates

This is referred to as the main effects model

Lab Course March 20, 2014 29

Page 30: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 6:Are interactions needed? Use 0.05 significance level. Use Wald p-value and partial likelihood ratio test as described earlier

Lab Course March 20, 2014 30

Page 31: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Hosmer, Lemeshow and MayStep 7: Final ModelCheck model assumptions, goodness-of-fit

Lab Course March 20, 2014 31

Page 32: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Machin, Cheung, Parmar• Explanatory variables are categorized

1. Fundamental to research design (D)2. Those that influence outcome or are confounders (K)3. Uncertain influence (Q)

Lab Course March 20, 2014 32

Page 33: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Strategies• Forced-entry • Significance tests• Change in estimates of hazard ratios

Lab Course March 20, 2014 33

Page 34: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Forced-entry• Include variables in the model according to research

design or prior opinion. This could include a non-statistically significant variable.• E.g. treatment variable in a RCT

• Include variables known to be influential in their ability to confound the primary association of interest

• The resulting model (with statistically non-significant effects) could have a reduced efficiency

Lab Course March 20, 2014 34

Page 35: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Significance testing• Step-up or step-down procedures where selection is

‘manual’, not automated

Lab Course March 20, 2014 35

Page 36: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Change in estimates• If our purpose is to obtain a suitable estimate of the HR for

a key variable the significance-testing strategy may not be successful in selecting confounders

• Compare HRCrude with the adjusted estimate HRAdjusted

for a clinically important difference. A 10% change is suggested.

Lab Course March 20, 2014 36

Page 37: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Practical considerations• Due to the effects of bias if more than 20% of the data

points are missing for a variable exclude it from the modeling process. If missing data comprise < 5% then the bias introduced will likely be small.

• Check to see how any automatic selection routines handle missing data

• In practice one can start with missing data excluded at the early stages of the selection process but bring them back into the process as it becomes more clear which variables are likely to be in the final model

Lab Course March 20, 2014 37

Page 38: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Practical considerations• Significance level to use? Err on the side of caution. Use

0.10 generally and 0.2 for the change-in-estimates method

Lab Course March 20, 2014 38

Page 39: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Practical considerations• Univariable analysis per se is not recommended• Rationale for univariable screening

• if an explanatory variable is associated with an outcome variable this association may be the result of confounding

• However, if an explanatory variable is not associated with an outcome variable in a univariable analysis, there is no gain in further examining it in a multivariable analysis• This argument is flawed; it overlooks the possibility of confounding

which may suppress a genuine relation; so-called ‘negative’ confounding

Lab Course March 20, 2014 39

Page 40: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Positive vs. Negative Confounding• Positive confounding – An association is found between an

exposure variable and outcome but in reality there is no association. The spurious association is caused by the confounder OR the association is stronger than it appears because of the confounder

• Negative confounding - An association is not found between an exposure variable and outcome but in reality there is an association. The true association is suppressed by the confounder OR the association is weaker than it appears in reality because of a confounder

Lab Course March 20, 2014 40

Page 41: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 41

Higher education in

women

True Magnitude

Apparent Magnitude

Higher education in women

Nulliparous

Outcome: Higher breast cancer

incidence

Lower breast cancer

incidenceOutcome: Lower

breast cancer incidence

Page 42: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Steyerberg• The problem of overfitting already starts with considering

too many candidate predictors in a data set. • The problem is difficult to solve with standard statistical

techniques which are used by default in medical research.• The uncertainty of model selection is an important source

of overfitting.

Lab Course March 20, 2014 42

Page 43: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Steyerberg• Improvements can be sought by limiting the necessity for

selection by using subject matter knowledge, especially in relatively smaller data sets (also advocated by Harrell)

• Use better algorithms to discover patterns in the data (e.g. LASSO)

• LASSO is a penalized estimation technique where the estimated regression coefficients are constrained such that the sum of their scaled absolute values falls below some constant k chosen by cross-validation

• This type of constraint forces some regression coefficients towards zero (which helps with overfitting problem) and some to exactly zero (helping with variable selection)

Lab Course March 20, 2014 43

Page 44: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Royston et al.• No consensus exits on the best method for selecting

variables• Two main strategies:

• Full model approach – all candidate variables are included. This model is claimed to avoid overfitting and selection bias and provide correct standard errors and P values. However, the full model is not always easy to define

• Backward elimination approach – the choice of significance level has a major effect on the number of variables selected.• Selection of predictors by significance testing is known to produce

selection bias (regression coefficients overestimated) and optimism as a result of overfitting. Overfitting leads to worse prediction in independent data

Lab Course March 20, 2014 44

Page 45: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Example 1 - Chow et al. (Collett approach)

Lab Course March 20, 2014 45

Page 46: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 46

Page 47: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 47

Page 48: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 48

Page 49: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 49

Page 50: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 50

Page 51: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 51

Page 52: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Example 2 – Fosker et al. (Harrell approach)

• The Importance of Poor Performance Status in Personalising Palliative Radiotherapy Towards the End of Life

• The goal of our project is to define a clinically relevant ECOG PS based algorithm that would enable accurate prediction of patients with shorter life expectancies (< 3-4 months).

Lab Course March 20, 2014 52

Page 53: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014 53

Page 54: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Lab Course March 20, 2014

Parameter P-value Hazard 95% CI Ra�o Lower Upper

Age 67-74 0.0009 1.168 1.066 1.28Age 75+ <.0001 1.265 1.15 1.391

Brain mets Yes <.0001 1.354 1.252 1.465ECOG 1 <.0001 1.575 1.317 1.885ECOG 2 <.0001 2.258 1.881 2.712ECOG 3 <.0001 3.59 2.989 4.312ECOG 4 <.0001 5.925 4.743 7.401

Gender Male <.0001 1.337 1.24 1.44Primary Lung <.0001 1.249 1.158 1.348

Multivariate AnalysisCox Proportional Hazards model resultsNOTE: ECOG=0 as reference category for variable ECOG

54

Page 55: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

Conclusions• One size doesn’t fit all – hard to conclude there is a “best”

approach.

• “Cutting to the chase” is not appropriate to describe multivariable modeling building

• “A good model is one chosen by using a careful, well thought out covariate selection process that gives thought consideration to issues of adjustment and interactions and thoroughly evaluates the model for assumptions, influential observations, and tests for goodness-of-fit” (Hosmer and Lemeshow 2008)

Lab Course March 20, 2014 55

Page 56: MODEL SELECTION STRATEGIES - · PDF fileDeveloping a multivariable prediction model • Select clinically relevant predictors for possible inclusion in the model • Evaluate the quality

References• Collett D. Modelling Survival Data in Medical Research. Chapman and

Hall 1991.• Hosmer DW, Lemeshow S, May S. Applied Survival Analysis –

Regression Modeling of Time-to-event Data 2nd edition Wiley• Machin D, Cheung YB, Parmar MKB. Survival Analysis – A Practical

Approach. Wiley 2006.• Steyerberg EW. Clinical Prediction Models. Springer 2009.• Royston P, Moons KGM, Altman DG, Vergouwe Y. Prognosis and

prognostic research: Developing a prognostic model BMJ June 2009 Volume 338 pp1373-1377

• Harrell FE. Regression Modeling Strategies. Springer 2001. New York.

Lab Course March 20, 2014 56