choosing regression models

40
Regression models 1 Choosing regression models An elementary introduction Stephen Senn

Upload: stephen-senn

Post on 11-Jan-2017

313 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Choosing Regression Models

Regression models 1

Choosing regression modelsAn elementary introduction

Stephen Senn

Page 2: Choosing Regression Models

Regression models 2

Explanation

• I am not presenting these things because I think you don’t know them

• I am presenting them because the people you work with don’t know them

• And you need to explain these things to them

Page 3: Choosing Regression Models

Regression models 3

Outline

• Basic considerations in modelling• Choosing predictors• Transformation of the predictor(s)• Transformation of the outcome• Advice

Page 4: Choosing Regression Models

Regression models 4

Basic considerationsThinking before you model

Page 5: Choosing Regression Models

Regression models 5

Some Modelling Tasks• Choose a generally suitable probability model• Choose a set of suitable predictors• Consider whether these need to be transformed• Consider whether the outcome needs to be transformed• Choose a technique for fitting the model• Fit the model• Assess goodness of fit of model• Make causal inferences• Issue predictions

Page 6: Choosing Regression Models

Regression models 6

Factors Affecting Choice of Model

• Purpose of model– Causal, predictive, classification

• Design of study– Designed experiment, observational study, survey,

• Temporal sequence• Prior knowledge• Type of data

– Continuous measurements, binary, ordinal, counts, censored life-times

• Case ascertainment• Results of model fitting

Page 7: Choosing Regression Models

Regression models 7

Preliminaries

• Choosing good regression models is not a question of throwing some data at a stepwise selection algorithm

• Two things are important– Being clear about the purpose– Insight (which in turn is based on)

• Experience• Understanding• Logic

Page 8: Choosing Regression Models

Regression models 8

Two Extremes

Causal analysis• The putative causal factor(s) must

be in the model• Other factors are in the model

because they help us understand the causal factor(s)

• They are of no interest in themselves

• We pay particular attention to the significance of the putative causal factor(s)

Predictive modelling• We are trying to find predictors of

some outcome• It is their joint value as predictors

that is important• We simply want the most

predictive model• We compare entire models to

judge which is best

Page 9: Choosing Regression Models

Regression models 9

Example

• Modelling the effect of treatment in a clinical trial• Treatment must be in any model whether or not it

is significant• Other factors will be in the model to help me

improve my estimate of the effect of treatment– They are of little interest in themselves– They are nearly always predetermined

Page 10: Choosing Regression Models

Regression models 10

Does Smoking Cause Lung Cancer?A Tale of Two Statisticians

Works in public health• I wish to establish whether

it is causal• If so I can warn smokers to

quit and this will benefit their health

• It is important for me to rule out possible confounding factors

Works in life insurance• I don’t care if it is causal or

not• The data show that smokers

are much more likely to get lung cancer

• That’s enough for me to take account of it in setting the premiums

Page 11: Choosing Regression Models

Regression models 11

Warning

• Regression models are there to help you use your insight, experience and prior knowledge to understand your datasets

• They are not a substitute for scientific understanding

Page 12: Choosing Regression Models

Regression models 12

Choosing predictorsIt’s not just a matter of significance

Page 13: Choosing Regression Models

Regression models 13

An Example• Multicentre trial of asthma comparing formoterol, salbutamol and

placebo for their effects on forced expiratory volume in one second (FEV1).

• Randomisation stratified by steroid use (yes/no) and centre• Sex, age, height of patient and baseline FEV1 also measured• Definitely in the model

– Blocking factors: centre & steroid use– Treatment factor (3 levels: formoterol, salbutamol, placebo)

• Possibly in the model– Covariates: sex, age, height of patient and baseline FEV1

– NB sex, age, height are very predictive of baseline FEV1 also therefore if you put them all in the model none may be significant

– This does not matter

Page 14: Choosing Regression Models

Regression models 14

Temporal Sequence I

• If we are interested in causal inferences it is usually inappropriate to include variables that were measured later in a model than putative causal variables that were measured earlier.

• The later variables cannot have caused the earlier variables and so should not be included.

Page 15: Choosing Regression Models

Regression models 15

Example• It is desired to study whether the type of school attended

(private or state school) affects students’ chances of success in final degree examinations at university

• Data are obtained for a large group of students• In addition to information on degree results and type of

school attended, information is obtained on – sex of student, – high school results – parents’ income

• Which of these factors is it inappropriate to include in the model and why?

Page 16: Choosing Regression Models

Regression models 16

Temporal Sequence II• The same does not apply if the purpose of

the model is simply classification• It may then be helpful to have factors in the

model even if they are measured after the “outcome variable of interest”

• Indeed they can be included even if they have been “caused” by the variable of interest

Page 17: Choosing Regression Models

Regression models 17

Example• We wish to develop a model for classifying

patients who present with abdominal pain as either suffering from appendicitis or non-specific abdominal pain

• We use location of pain, degree of pain, absence/presence of nausea, body temperature as “predictor” variable– Even though these are consequences of rather

than causes of appendicitis

Page 18: Choosing Regression Models

Regression models 18

Prior Knowledge• Frequently when fitting models we already have strong opinions about

the effect of some factors even if we are ignorant about others.– For example we may be examining the effect of a previously

unstudied environmental exposure on health– we know, however, that age is an important determinant of health

• We will tend to put factors we believe are important in the model irrespective of their significance according to the current data set.

• Similarly, implicitly, there will always be a host of factors we believe are irrelevant.

• We will not put these in the model on prior grounds

Page 19: Choosing Regression Models

Regression models 19

Type of Data and Choice of Basic Model

Type of Data• Continuous measurement• Count data• Binary data• Ordered categorical• Censored lifetimes• Multinomial

Possible Basic Model• General linear model (Normal outcomes)• Poisson regression• Logistic regression• Proportional odds • Proportional hazards• Log-linear

Page 20: Choosing Regression Models

Regression models 20

Case Ascertainment• The way in which data are obtained (ascertained) can

affect the way that we build a model• For example in a case-control study we sample by

outcome (cases and controls) and then measure how these two differ by exposure – Example

• Case: lung cancer, Control: other cancer• Exposure: smoker versus non-smoker

• We cannot model relative risk using such data• We can only model (log) odds ratios• For a cohort study where we sample by exposure we could

model either

Page 21: Choosing Regression Models

Regression models 21

Social Status: Longer life expectancy for Oscar winnersA study of actors and actresses found that Oscar winners lived, on average, almost four years longer than nominees who went home empty-handed, reports the March issue of the Harvard Health Letter. Actors aren’t the only people who reap benefits. Dr. Donald Redelmeier of Toronto’s Sunnybrook and Women’s College Health Sciences Centre found that Oscar-winning directors live longer than non-winners, and male directors live 4.5 years longer on average than actors. These findings add to a large body of evidence delineating connections between social status and health and longevity, reports the Harvard Health Letter. Redelmeier theorizes that an Oscar on the mantel moves the winner up the Hollywood pecking order. Winners find it easier to get work, and when they do, they’re better appreciated and better paid.

Page 22: Choosing Regression Models

Regression models 22

Not Harvard Health Publications

A study has shown that getting a telegram from The Queen can add 20 years to your life

An extensive study of individuals who have received telegrams from The Queen has shown that an astonishing proportion of them have lived to be 100. Age at death of a control group of non-recipients was typically 20 years less.

Researchers have postulated that esteem is an important determinant of health

Joked lead researcher, Prof Morton Gullible, ‘our advice to her Majesty is send yourself a telegram, Ma'am’

Page 23: Choosing Regression Models

Regression models 23

Results of Model Fitting• Statisticians have developed a number of

techniques for assessing the adequacy of various models using the data in hand– Standard errors, significance tests on coefficients– Analysis of variance/ deviance on factors– Goodness of fit generally– Residual plots– AIC, BIC

• These are important tools but are by no means the only tools for assessing the adequacy of a model

Page 24: Choosing Regression Models

Regression models 24

Transforming predictorsThe X Files

Page 25: Choosing Regression Models

Regression models 25

Luxembourg Temperature Example

Data on temperatures in Luxembourg

Month Normal temperatures deg CJanuary 0.6February 1.4March 4.7April 7.7May 12.4June 15.1July 17.5August 17.3September 13.5October 8.9November 4.0December 1.8

Page 26: Choosing Regression Models

Regression models 26

Modelling the temperatureNote that in the yearly rhythm, January follows December even though January is point 1 and December point 12.

The data are periodic and we need a model that reflects this.

The simplest periodic pattern is a sine wave.

= level (the average temperature)b = amplitude (the difference max to average) = phase (governs point at which maximum is reached)

Page 27: Choosing Regression Models

Regression models 27

Fitting a sine wave

A sine wave model can be fitted by using the fact that

This is linear in . Hence by regressing Y on two variables we can obtain a periodic fit.

Note that X must be transformed from linear to angular measure. So we can write

if we measure in degrees or

radians

Page 28: Choosing Regression Models

Regression models 28

3 parameters fit 12 points rather well

Page 29: Choosing Regression Models

Regression models 29

Transforming the outcomeBeing wise about Ys

Page 30: Choosing Regression Models

Regression models 30

An Example of a One-way Layout

• Four experimental p38 kinase inhibitors• Vehicle and marketed product as controls• Thrombaxane B2 (TXB2) is used as a

marker of COX-1 activity • Six rats per group were treated for a total of

36 rats• At the end of the study rats are sacrificed

and TXB2 is measured.

Page 31: Choosing Regression Models

Regression models 31

Page 32: Choosing Regression Models

Regression models 32

GenStat® ANOVA(Original data)

Analysis of variance Variate: TXB2 Source of variation d.f. s.s. m.s. v.r. F pr.Treatment 5 184596. 36919. 6.31 <.001Residual 30 175439. 5848. Total 35 360035.

A2WAY [TREATMENTS=Treatment] TXB2

Page 33: Choosing Regression Models

Regression models 33

GenStat plot of residuals

Page 34: Choosing Regression Models

Regression models 34

Page 35: Choosing Regression Models

Regression models 35

GenStat ANOVA(log transformed)

A2WAY [TREATMENTS=Treatment] logTXB2

Analysis of variance Variate: logTXB2 Source of variation d.f. s.s. m.s. v.r.Treatment 5 62.6760 12.5352 40.09Residual 30 9.3800 0.3127 Total 35 72.0559

Signal to noise ratio is now much higher

Page 36: Choosing Regression Models

Regression models 36

GenStat plot of residuals

Page 37: Choosing Regression Models

Regression models 37

Homogeneity of Variances(Bartlett’ Test: GenStat)

Untransformed*** Bartlett's Test for homogeneity of variances *** Chi-square 50.87 on 5 degrees of freedom: probability < 0.001 Log-transformed *** Bartlett's Test for homogeneity of variances *** Chi-square 8.95 on 5 degrees of freedom: probability 0.111

Page 38: Choosing Regression Models

Regression models 38

Data-filtering examplesor find the flaw

• A 20 year follow-up study of women in an English village found higher survival amongst smokers than non-smokers

• Transplant receivers on highest doses of cyclosporine had higher probability of graft rejection than on lower doses

• Left-handers observed to die younger on average than right-handers

• Obese infarct survivors have better prognosis than non-obese

Page 39: Choosing Regression Models

Regression models 39

AdviceStatistics is a way of improving your thinking, not a substitute for it

Page 40: Choosing Regression Models

Regression models 40

Advice• Think before you model• Purpose is key

– Causal– Predictive– Classification

• Think about time• Think about case ascertainment• Testing is a small part of discerning• Don’t use stepwise regression as a substitute for

understanding