7071

Validation of predictive regression models

Ewout W. Steyerberg, PhD

Clinical epidemiologist

Frank E. Harrell, PhD

Biostatistician

Personal background

Ewout Steyerberg: Erasmus MC, Rotterdam, the Netherlands

Frank Harrell: Health Evaluation Sciences,

Univ of Virginia, Charlottesville, VA, USA

“Validation of predictions from

regression models is of

paramount importance”

Learning objectives: knowledge of common types of regression models

fundamental assumptions of regression

models

performance criteria of predictive

models

principles of different types of validation

Performance objectives

To be able to explain why validation is

necessary for predictive models

To be able to judge the adequacy of a

validation procedure

Predictive models provide quantitative estimates of an outcome, e.g.

Quality of life one year after surgery

Death at 30 days after surgery

Long term survival

Predictive models are often based on regression analysis

y ~ a + sum(bi*xi)

y: outcome variable

a: intercept

bi: regression coefficient i

xi: predictor variable i

i in [1,many], usually 2 to 20

3 examples of regression

Quality of life one year after surgery:

continuous outcome, linear regression

Death at 30 days after surgery:

binary outcome, logistic regression

Long term survival:

time-to-outcome, Cox regression

Predictive models make assumptions

Distribution

Linearity of continuous variables

Additivity of effects

Example: a simple logistic regression model

30day mortality ~ a + b1*sex + b2*age

Assumptions:

Distribution of 30day mortality is binomial

Age has a linear effect

The effects of sex and age can be added

Assessing model assumptions

Examine model residuals

Perform specific tests

add nonlinear terms, e.g. age+age2

add interaction terms, e.g. sex*age

Model assumptions and predictionsBetter predictions if assumptions are met

Some violation inherent in empirical data

Evaluate predictions in new data

Evaluation of predictions

Calibration

average of predictions correct?

low and high predictions correct?

Discrimination

distinguish low risk from high risk

patients?

Example: predicted probabilities

0.0 0.1 0.2 0.3 0.4Predicted probability of 30-day mortality

0.0

0.1

0.2

0.3

0.4

Act

ual 3

0-da

y m

orta

lity

Area under ROC: 0.77Calibration: OK

3 types of validation

Apparent: performance on sample used to

develop model

Internal: performance on population

underlying the sample

External: performance on related but

slightly different population

Apparent validity

Easy to calculate

Results in optimistic performance

estimates

Apparent estimates optimistic since same data used for:

Definition of model structure:

e.g. selection and coding of variables

Estimation of model parameters:

e.g. regression coefficients

Evaluation of model performance:

e.g. calibration and discrimination

Internal validity

More difficult to calculate

Test model in new data, random from

underlying population

Why internal validation?

Honest estimate of performance should

be obtained, at least for a population

similar to the development sample

Internal validated performance sets an

upper limit to what may be expected in

other settings (external validity)

External validity

Moderately easy to calculate when new

data are available

Test model in new data, different from

development population

Why external validation?

Various factors may differ from

development population, including

different selection of patients

different definitions of variables

different diagnostic or therapeutic

procedures

Internal validation techniques

Split-sample:

development / validation

Cross-validation:

alternating development / validation

extreme: n-1 develop / 1 validate

(‘jack-knife’)

Bootstrap

Bootstrap is the preferred internal validation technique

bootstrap sample for model development:

n patients drawn with replacement

original sample for validation: n patients

difference: optimism

efficiency: development and validation on n

patients

Example: bootstrap results for logistic regression model

30-day mortality ~ a + b1*sex + b2*age

Apparent area under the ROC curve: 0.77

Mean area of 200 bootstrap samples:0.772

Mean area of 200 tests in original: 0.762

Optimism in apparent performance: 0.01

Optimism-corrected area: 0.76

External validation techniques

Temporal validation: same

investigators, validate in recent years

Spatial validation (other place): same

investigators, cross-validate in centers

Fully external: other investigators, other

centers

Example: external validity of logistic regression model

30-day mortality ~ a + b1*sex + b2*age

Apparent area in 785 patients: 0.77

Tested in 20,318 other patients: 0.74

Tested by other investigators: ?

Example: external validation

0.0 0.1 0.2 0.3 0.4Predicted probability of 30-day mortality

0.0

0.1

0.2

0.3

0.4

Act

ual 3

0-da

y m

orta

lity

Area under ROC: 0.74Calibration: reasonable

Summary

Apparent validity gives an optimistic

estimate of model performance

Internal validity may be estimated by

bootstrapping

External validity should be determined

in other populations

Key references

tutorial and book on multivariable models(Harrell 1996, Stat Med 15:361-87;

Harrell: regression modeling strategies, Springer 2001)

empirical evaluations of strategies (Steyerberg 2000: Stat Med19: 1059-79)

internal validation (Steyerberg 2001:JCE 54: 774-81)

external validation (Justice 1999: Ann Intern Med 130:515-24;

Altman 2000: Stat Med 19: 453-73)

Links

Interactive text book on predictive

modelinghttp://www.neri.org/symptom/mockup/Chapter_8/

Harrell’s Regression modeling strategieshttp://hesweb1.med.virginia.edu/biostat/rms/

7071

Documents

cox regression

regression analysisy

new data

calculatetest model

model parameters

availabletest model

different population

development population