unit 1a: fitting sensible taxonomies of multiple regression models © andrew ho, harvard graduate...

28
Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 1 ttp://xkcd.com/943 /

Upload: kelly-lyons

Post on 13-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 1http://xkcd.com/943/

Page 2: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Scared!

This sounds familiar!

Logistic regression isn’t so ba-

Ack, Discrete Time what now?

Clustering… seems intuitive

Principal components?!

Final project

Whoa, fixed and random effects?

Course Principle 1: Layering

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 2

• In my experience learning and teaching statistics, fluency comes only after repeated exposure to concepts in varying contexts. Think of this as “layering.”

So: Be Patient. Get Exposed.

Page 3: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Course Principle 2: Language

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 3

• Learning statistics is like learning a language. Fluency comes from active use of new vocabulary in genuine conversations about meaningful content.

So: Find a partner. Find a study group. Speak, read, write, and

hear statistics. Immerse yourself.

Page 4: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Course Principle 3: Disciplined Perception

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 4

• As you learn, layer, and gain fluency, you will hopefully begin to view graphical representations, text, and real-life situations with new eyes.

So: If at first it seems like gibberish, don’t despair.

This takes practice.

Page 5: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Course Principle 4: Exploratory Data Analysis

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 5

• Attributed to the famous statistician John Tukey, EDA is an analytic approach that emphasizes graphical representations and descriptive statistics to give analysts an intuitive sense of their data before they begin fitting models.

• When faced with new data, you will learn to reflexively:– Get your eyes on the data.– Note summary statistics.– Visualize univariate distributions.– Visualize bivariate and multivariate relationships.– Graphically summarize the fit of the model to data. 0

2040

6080

100

Fre

que

ncy

0 2 4 6 8Index of self-reported sexual activity at baseline (two years prior)

010

020

030

040

0F

req

uenc

y

-2 -1 0 1Reported importance of religion and religious attendance at baseline

02

46

8In

dex

of

self-

repo

rte

d se

xual

act

ivity

.2 .4 .6 .8 1

* p<0.05, ** p<0.01, *** p<0.001p-values in parentheses N 887 (0.000) (0.000) (0.000) (0.000) (0.003) (0.474) (0.000) (0.045) (0.000) (0.000) (0.000) (0.000) fabstain -0.310*** -0.354*** -0.341*** -0.258*** -0.0995** -0.0241 -0.394*** -0.0673* 0.344*** -0.170*** 0.246*** 0.290*** 1

(0.000) (0.003) (0.000) (0.000) (0.000) (0.000) (0.000) (0.056) (0.000) (0.027) (0.000) pdisapp -0.235*** -0.101** -0.274*** -0.171*** -0.137*** -0.286*** -0.173*** -0.0642 0.255*** 0.0741* 0.218*** 1

(0.000) (0.000) (0.000) (0.000) (0.008) (0.775) (0.001) (0.405) (0.000) (0.017) parengag -0.232*** -0.207*** -0.288*** -0.159*** -0.0888** 0.00962 -0.116*** 0.0280 0.307*** 0.0798* 1

(0.540) (0.000) (0.437) (0.881) (0.044) (0.195) (0.000) (0.369) (0.754) religion 0.0206 0.138*** 0.0261 -0.00503 0.0675* -0.0436 0.274*** 0.0302 -0.0105 1

(0.000) (0.000) (0.000) (0.000) (0.106) (0.003) (0.000) (0.240) schengag -0.191*** -0.211*** -0.201*** -0.138*** -0.0543 -0.0994** -0.337*** -0.0395 1

(0.152) (0.554) (0.946) (0.558) (0.918) (0.770) (0.000) sesindex 0.0482 0.0199 0.00226 -0.0197 -0.00344 0.00984 0.205*** 1

(0.001) (0.000) (0.001) (0.117) (0.047) (0.922) black 0.110** 0.379*** 0.113*** 0.0526 0.0667* -0.00328 1

(0.351) (0.000) (0.014) (0.004) (0.709) male 0.0313 -0.361*** 0.0821* 0.0960** 0.0125 1

(0.000) (0.000) (0.000) (0.300) pubearly 0.183*** 0.127*** 0.128*** 0.0349 1

(0.000) (0.000) (0.000) age 0.164*** 0.147*** 0.269*** 1

(0.000) (0.000) baseact 0.537*** 0.276*** 1

(0.000) sexmedia 0.294*** 1

activity 1 activity sexmedia baseact age pubearly male black sesindex schengag religion parengag pdisapp fabstain

fabstain 887 .4092446 .4919719 0 1 pdisapp 887 .7192785 .4496052 0 1 parengag 887 -.0019955 .7886763 -2 1.42 religion 887 .01708 .8769921 -2.21 .8 schengag 887 .0132243 .7676371 -2 1.66 sesindex 887 .0047689 .6077198 -1.44 1.47 black 887 .5039459 .5002665 0 1 male 887 .4937993 .5002436 0 1 sexmedia 887 .5314092 .1451268 .14 .94 pubearly 887 2.011274 .3709047 1 3 age 887 15.71466 .6995205 14.04 18.35 baseact 887 2.818489 1.567498 1 7 activity 887 4.324961 1.801506 1.01 7 id 887 2626.545 1495.928 6 5217 Variable Obs Mean Std. Dev. Min Max

. summarize, separator(14)

Page 6: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Course Roadmap: Unit 1a

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 6

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Today’s Topic Area

You can keep track of your progress through this course by referencing the outline in the course syllabus.You can keep track of your progress through this course by referencing the outline in the course syllabus.

Page 7: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

A Simple Path through S-052

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 7

1. Taxonomies of Regression Models2. Nonlinear Regression3. Nonindependent Residuals

4. Logistic Regression5. Discrete-Time Survival Analysis

6. Forming Composites7. Cluster Analysis8. Factor Analysis

Page 8: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

The ILLCAUSE Dataset and Research Question: Know Your Dataset

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 8

RQ: Do children who suffer from chronic

illness understand the causes of illness better than healthy children

and, if so, by how much?

Dataset ILLCAUSE.txt

Overview Data for investigating differences in children’s understanding of the causes of illness, by their health status.

Source Perrin E.C., Sayer A.G., and Willett J.B. (1991). Sticks And Stones May Break My Bones: Reasoning About Illness Causality And Body Functioning In Children Who Have A Chronic Illness, Pediatrics, 88(3), 608-19.

Sample size 301 children, including a sub-sample of 205 who were described as asthmatic, diabetic, or healthy. After further reductions due to the list-wise deletion of cases with missing data on one or more variables, the analytic sub-sample used in class ends up containing 33 diabetic children, 68 asthmatic children and 93 healthy children.

More info Chronically-ill children were recruited into the study through their pediatricians; healthy children were a matched random sample drawn from the same schools as the ill children.

Updated September 16, 2005

Dataset on website: ILLCAUSE.txt. Codebook on website: ILLCAUSE_info

Page 9: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

The ILLCAUSE Dataset: Know Your Variables

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 9

Col Variable Variable Description Variable Metric/Labels 1 ID Child identification code Integers

2 ILLCAUSE Child’s score on a measure of the understanding of illness causality.

Ordinal score obtained by averaging child responses to 7 interview questions on the causes of illness, with responses rated on a “developmental” scale: 1 = No response. 2 = Phenomenistic or circular response. 3 = External agency cited as sole cause. 4 = Internalization in understanding illness, once agent

internalized, illness is inevitable. 5 = Interaction of host and agent described. 6 = Mechanisms of illness causation described, including notions

of treatment and bodily response.

3 SES

Family socio-economic status, rated using the education and employment levels of the primary bread-winner with Hollingshead Two-Factor Index of Social Position. (Hollingshead & Frederick. Social Class and Mental Illness. NY: Wiley, 1958)

Ordinal rating of social class: 1 = upper 2 = upper middle 3 = middle 4 = lower middle 5 = low (Notice the ordering of the numerical values is counterintuitive).

4 PPVT Child’s normed score on the Peabody Picture Vocabulary Test.

Continuous score, mean of 100 & standard deviation of 15 in population.

5 AGE Child age Continuous variable, months since birth.

6 GENREAS Child’s score on a measure of general reasoning.

Ordinal score, from 1 through 6. Similar to ILLCAUSE, but requires general reasoning, rather than reasoning about illness.

7 HEALTH Child Health Status Indicator

Categorical variable with multiple categories, of which we are interested in:

3 = Diabetic 5 = Asthmatic 6 = Healthy

Page 10: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

The Data Analytic Handout: Boiler Plate

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 10

*--------------------------------------------------------------------------------* S-052: APPLIED DATA ANALYSIS* Data-Analytic Handout Unit 1a** Unit 1 Conducting Sensible Multiple Regression Analyses.* Unit 1a Fitting Taxonomies of Multiple Regression Models.* Introducing the Data.* * RQ: Do Children Who Are Chronically Ill Understand the Causes of Illness* Better Than Do Healthy Children?** Programming:* Stata Version: Stata 11 SE.* Original Authors: Andres Molano, Monica Yudron & John B. Willett.* Modifications by Andrew Ho* Last Modified: Jan 28, 2013.*--------------------------------------------------------------------------------

*--------------------------------------------------------------------------------* Set the critical parameters of the computing environment.*--------------------------------------------------------------------------------* Specify the version of Stata to be used in the analysis: version 11.0 * Clear all computer memory and delete any existing stored graphs and matrices: clear graph drop _all clear matrix * Define the local directory: cd "C:\Users\hoan\Documents\My Dropbox\S-052\Stata\"

*--------------------------------------------------------------------------------* Open a log to contain a permanent record of the syntax and analytic output.*-------------------------------------------------------------------------------- log using "Unit1a.log", replace

You can title STATA programs & code, using comments. Include:• Name of the Handout.• Link to the Syllabus.• Substantive Theme (RQ).• Programming logistics.

Any current version of STATA can recognize code written according

to the rules of any previous version of the software

Clear everything out, before the current program executes

Define the local directory, so STATA knows where to write its

logs, etc.

Begin a log file to record output. Your lab book.

All units will include a “Data Analytic Handout” that records the Stata code used in lectures. As you download, adapt, and annotate these handouts, they will serve as a lasting code library.

Any line that begins with an asterisk is a comment.

It doesn’t matter what it ends with.

Page 11: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

The Data Analytic Handout: Reading In Data

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 11

*--------------------------------------------------------------------------------* Input the raw dataset, name and label the variables and their values.*--------------------------------------------------------------------------------* Input the dataset:infile ID ILLCAUSE SES PPVT AGE GENREAS HEALTH /// using "C:\Users\hoan\Documents\My Dropbox\S-052\Raw Data\ILLCAUSE.txt"

* Label the variables in the dataset: label variable ID "Child Identification Code" label variable ILLCAUSE "Understanding of Illness Causality Score" label variable SES "Hollingshead SES" label variable PPVT "Peabody Picture Vocabulary Test Score" label variable AGE "Chronological Age (Months)" label variable GENREAS "General Reasoning Ability Score" label variable HEALTH "Health Status" * Label the values of categorical question predictor HEALTH: label define HEALTHLBL 3 "Diabetic" 5 "Asthmatic" 6 "Healthy" label values HEALTH HEALTHLBL

*--------------------------------------------------------------------------------------* Subset and save an analytic sample with only healthy, asthmatic, * & diabetic children.*-------------------------------------------------------------------------------------- keep if HEALTH==3 | HEALTH==5 | HEALTH==6 save "ILLCAUSE S052 subset.dta", replace

Try to run full .do files all the way from the raw data, preserving each step in your analysis as if in a lab book. A carefully annotated .do file should be all one needs to replicate your analysis.

In the infile command, specify the names of the

variables (in order of their appearance in the dataset)

and the location of the data-file that contains the raw

data.

The data are read into the Stata active file, and held in

memory.

Note "///" allows a command to continue on the next line.

Variables are easily labeled with informative names.

We capitalize variable names, although this is not common Stata convention.

You can also label the values of categorical variables.

Here, we will subsequently focus on only the Diabetic,

Asthmatic, and Healthy children to simplify the

analysis, so I have named only those values of HEALTH.

I keep only Diabetic, Asthmatic, and Healthy children in the active file and save the data subset, which will reside in the default directory set earlier in this .do file.

Page 12: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

EDA Reflex 1: List Data; Univariate Summary Statistics

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 12

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

10. 310 3.143 1 125 147 4.146 Diabetic 9. 309 3.429 2 109 133 4.396 Diabetic 8. 308 3.429 2 102 90 3.677 Diabetic 7. 307 2.857 4 75 194 3.833 Diabetic 6. 306 3.286 2 106 81 3.833 Diabetic 5. 305 4.286 4 80 113 2.5 Diabetic 4. 304 4.286 3 98 178 5.219 Diabetic 3. 303 3.429 3 84 151 3.302 Diabetic 2. 302 2.857 2 102 79 2.188 Diabetic 1. 301 . 2 138 128 4.802 Diabetic ID ILLCAUSE SES PPVT AGE GENREAS HEALTH

. list in 1/10, clean

HEALTH 205 5.117073 1.078284 3 6 GENREAS 203 4.124384 1.095722 1.75 6 AGE 205 131.7902 40.4458 61 203 PPVT 205 112.7659 15.81678 73 162 SES 205 2.287805 .9852329 1 5 ILLCAUSE 194 4.133284 1.021903 1.571 6 ID 205 558.1073 138.8618 301 748 Variable Obs Mean Std. Dev. Min Max

. summarize, separator(`c(k)')

List the first 10 observations.Tell yourself the story of an observation.e.g., “Observation 304 is a diabetic, middle-class child with an average vocabulary and high general reasoning who has a fairly high understanding of the causes of illness, 4.3 on a 1-6 scale.

Note missing data “.”, get a sense of scale (mean and sd), and note the ranges for any anomalies.

Page 13: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

EDA Reflex 2a: Univariate Visualizations (Start with the Outcome)

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 13

• Ordinal score obtained by averaging child responses to 7 interview questions on the causes of illness, with responses rated on a “developmental” scale:– 1 = No response.– 2 = Phenomenistic or circular response.– 3 = External agency cited as sole cause.– 4 = Internalization in understanding illness, once agent internalized, illness is inevitable.– 5 = Interaction of host and agent described.– 6 = Mechanisms of illness causation described, including notions of treatment and bodily response.

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

05

1015

Fre

quen

cy

1 2 3 4 5 6Understanding of Illness Causality Score

010

2030

Fre

quen

cy

2 3 4 5 6Understanding of Illness Causality Score

hist ILLCAUSE, discrete freq name(Unit1a_g1,replace)hist ILLCAUSE, freq

Page 14: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

EDA Reflex 2b: Predictor Variable Distributions

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 14

010

2030

40P

erce

nt

0 1 2 3 4 5Hollingshead SES

02

46

810

Fre

quen

cy

80 100 120 140 160Peabody Picture Vocabulary Test Score

05

1015

20F

requ

ency

50 100 150 200Chronological Age (Months)

05

1015

2025

Fre

quen

cy

2 3 4 5 6General Reasoning Ability Score

Total 205 100.00 Healthy 96 46.83 100.00 Asthmatic 73 35.61 53.17 Diabetic 36 17.56 17.56 Status Freq. Percent Cum. Health

. tabulate HEALTH

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

The Question Predictor

Remember that non-normal predictor variable distributions are not necessarily a

threat to the regression model. The regression assumption is: independent and identically normally distributed residuals.

Page 15: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

23

45

6U

nder

stan

ding

of I

llnes

s C

ausa

lity

Sco

re

Diabetic Asthmatic Healthy

0.1

.2.3

.4.5

Ke

rne

l Den

sity

2 3 4 5 6Understanding of Illness Causality Score

Diabetic Asthmatic Healthy

EDA Reflex 3a: Bivariate Visualizations (Outcome and Question Predictor)

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 15

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

You are looking for overall mean differences while staying attuned to absolute and relative sample sizes.

Note that the order of box plots for categorical predictors is arbitrary and does not raise concerns about nonlinearity.

Be cautious about overinterpreting Kernel Density plots with low sample sizes.

Total 4.1332835 1.021903 194 Healthy 4.6036559 1.0026495 93 Asthmatic 3.6680588 .86861936 68 Diabetic 3.7663333 .76587771 33 Status Mean Std. Dev. Freq. Health Causality Score Summary of Understanding of Illness

. tabulate HEALTH, summarize(ILLCAUSE)

Page 16: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 16

We’re not that naïve! We haven’t randomly assigned illnesses to kids, so no causal inference is warranted. We might wish to include additional predictors to account for the confounding of illness and other variables. Perhaps add AGE and SES as your control predictors.

And proceed with a multiple regression analysis …

We’re not that naïve! We haven’t randomly assigned illnesses to kids, so no causal inference is warranted. We might wish to include additional predictors to account for the confounding of illness and other variables. Perhaps add AGE and SES as your control predictors.

And proceed with a multiple regression analysis …

… and then, what about non-linear expressions of the continuous predictors, or categorical versions, or what if you add another predictor like gender or race, or … How Many Potential Models Would There Be, Then?

The task seems ok until you begin to enumerate how many possible models you can actually specify using just these few predictors …

The task seems ok until you begin to enumerate how many possible models you can actually specify using just these few predictors …

Three models with a 1 main effect

ii

ii

ii

SESILLCAUSEAGEILLCAUSEHEALTHILLCAUSE

10

10

10 ""

Three models with 2 main effects and 1 two-way interaction

ii

ii

ii

SESAGESESAGEILLCAUSESESHEALTHSESHEALTHILLCAUSEAGEHEALTHAGEHEALTHILLCAUSE

3210

3210

3210

""""""""

One model with 3 main effects

ii SESAGEHEALTHILLCAUSE 3210 ""

Three models with 2 main effects

ii

ii

ii

SESAGEILLCAUSESESHEALTHILLCAUSEAGEHEALTHILLCAUSE

210

210

210

""""

Three models with 3 main effects and 1 two-way interaction

ii

ii

ii

SESAGESESAGEHEALTHILLCAUSESESHEALTHSESAGEHEALTHILLCAUSEAGEHEALTHSESAGEHEALTHILLCAUSE

43210

43210

43210

""""""""""

Three models with 3 main effects and 2 two-way interactions

One model with 3 main effects and 3 two-way interactions

and so on ... …

So chronic illness decreases kids’ understanding of the causes of illness?

Page 17: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

How big is the landscape of possible models?

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 17

Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors:

Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors:

predictors of 5.141 numbere

5 Predictors 450 Models10 Predictors 817,250 Models

How can we navigate the landscape of potential models to find the “best” model, theoretically, practically, and empirically?

One-PredictorModel

Two-PredictorModel

Two-Predictors + an InteractionModel

Three Predictors and an Interaction?

Page 18: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Model selection algorithms (that you should never use!)

© Andrew Ho, Harvard Graduate School of Education

Automated model building strategies (that you may see in journal articles)

1. All possible subsets: all regression models

2. Best subsets: Best m models with 1, 2, …, k predictors, along some criterion: , adj, (Cp, AIC, BIC…)…

3. Forward selection: start with no predictors and sequentially add them so that each maximally increases the statistic at that step

4. Backwards elimination: start with all predictors and sequentially drop them so that each minimally decreases the statistic at that step

5. Stepwise regression: Some combination of 3 and 4, such as forward selection with occasional “glances” backwards to see if anything has changed.

All models are wrong, but some are useful George E.P. Box (1979)

Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962)

The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962)

Occam’s razor: entia non sunt multiplicanda praeter necessitatem: Entities must not be multiplied beyond necessity. If two competing theories lead to the same predictions, the simpler one is better

William of Occam (14th century)

Your instincts should tell you that this is both tempting and loony.Of these approaches, Best Subsets can be the most informative in a many-predictor context, reminding you that many possible models can lead to very similar global fit statistics. In general, however, these data mining approaches should be avoided.

Unit 1– Slide 18

Page 19: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Managing and Categorizing Multiple Predictors: A Framework

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 19

• If you are managing the study from the design stage, you should have included your variables with intent from the beginning.• If in contrast you are coming in cold as a secondary data analyst, prioritizing and classifying variables is part of your job.First, identify your outcome variable -- here, ILLCAUSE.Second, consider classes of predictors, based on substance (i.e., your research questions and theoretical

framework), your research design, … past literature, etc., and establish the priorities among these predictors.

• If you are managing the study from the design stage, you should have included your variables with intent from the beginning.• If in contrast you are coming in cold as a secondary data analyst, prioritizing and classifying variables is part of your job.First, identify your outcome variable -- here, ILLCAUSE.Second, consider classes of predictors, based on substance (i.e., your research questions and theoretical

framework), your research design, … past literature, etc., and establish the priorities among these predictors.

Priority Predictor Comment

High HEALTH is the Key Question Predictor. Without the presence of HEALTH in the final model, we cannot address the research question!

MediumAGE is a key “Design” Control Predictor because it represents the multi-cohort nature of the research design:

Our sample contains multiple sub-samples (“cohorts”) of children at different ages. By controlling for AGE, we can pool all the children into the same analysis, regardless of their age, rather than doing separate “age-by-age slice” analysis (not recommended!)

Low

SES is a subsidiary substantive Control Predictor. It is often worth including because, as John Willett likes to say, “some twit will always ask you if it matters.”

Ill children may have lower SES, on average. If understanding illness also depends on home resources, the effect of SES could masquerade as an effect of HEALTH.

Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions.

Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next.At each step, once the main effects have been exhausted, consider including the interactions, again driven by

theory: does the effect of on depend on levels of or, equivalently, vice versa?

Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions.

Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next.At each step, once the main effects have been exhausted, consider including the interactions, again driven by

theory: does the effect of on depend on levels of or, equivalently, vice versa?

Page 20: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Approaches to variable selection and model building

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 20

Page 21: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

EDA Reflex 3b: (Categorical) Question Predictor and Other Predictors

• We can see that Healthy children have higher (lower-valued) SES, higher PPVT, and higher GENREAS.

• If HEALTH were randomly assigned to children, what would the table look like?

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 21

23

45

6U

nder

stan

ding

of I

llnes

s C

ausa

lity

Sco

re

Diabetic Asthmatic Healthy

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

205 205 205 203 .9852329 15.81678 40.4458 1.095722 Total 2.287805 112.7659 131.7902 4.124384 96 96 96 96 .7567677 15.80673 42.02267 1.064994 Healthy 1.78125 117 131.9896 4.533854 73 73 73 72 .8446717 12.77659 40.72187 .9798192Asthmatic 2.69863 110.3014 129.0822 3.709361 36 36 36 35 1.141914 18.48163 35.9105 1.022861 Diabetic 2.805556 106.4722 136.75 3.855029 HEALTH SES PPVT AGE GENREAS

by categories of: HEALTH (Health Status)Summary statistics: mean, sd, N

. tabstat SES PPVT AGE GENREAS, statistics(mean sd count) by(HEALTH)

Page 22: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

192 203 203 203 203 0.0000 0.0000 0.0000 0.0000 GENREAS 0.8242* -0.2975* 0.3891* 0.7369* 1.0000 194 205 205 205 0.0000 0.3941 0.0869 AGE 0.6711* 0.0598 0.1199 1.0000 194 205 205 0.0000 0.0000 PPVT 0.3140* -0.3785* 1.0000 194 205 0.0005 SES -0.2471* 1.0000 194 ILLCAUSE 1.0000 ILLCAUSE SES PPVT AGE GENREAS

. pwcorr ILLCAUSE SES PPVT AGE GENREAS, sig obs star(.05)

EDA Reflex 3c: Outcome and Other Predictors

• (Low) SES is a slight negative predictor, and age as well as vocabulary and reasoning ability correlate positively as expected.

• Also a strong relationship between general reasoning and age, as expected.

• When interpreting correlation matrices, learn to look past the STARS to the magnitude of the correlation itself.

• Example: 0.3 is significant (we’re confident it’s not 0) but we’re also confident it’s a pretty weak correlation.

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 22

RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?

Page 23: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

23

45

6U

nde

rsta

nd

ing

of

Illn

ess

Cau

salit

y S

core

1 2 3 4 5Hollingshead SES

23

45

6U

nd

erst

and

ing

of

Illne

ss C

aus

alit

y S

core

80 100 120 140 160Peabody Picture Vocabulary Test Score

23

45

6U

nde

rsta

nd

ing

of

Illn

ess

Cau

salit

y S

core

50 100 150 200Chronological Age (Months)

23

45

6U

nd

erst

and

ing

of

Illne

ss C

aus

alit

y S

core

2 3 4 5 6General Reasoning Ability Score

EDA Reflex 3c: Outcome and Other Predictors

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 23

“Jitter” discrete scatterplots to distinguish among overlapping datapoints.

Decrease msize or “sample down” until you can feel the “grain” of the data.

Page 24: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

Preparing Dummy Variables, Transformations, and Interactions

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 24

*--------------------------------------------------------------------------------------* Create Dummy Variables, Transformations, and Interactions*-------------------------------------------------------------------------------------- gen D=(HEALTH==3) gen A=(HEALTH==5) gen H=(HEALTH==6) gen LAGE = log(AGE) *Health status by logAGE interactions gen DxLAGE = D*LAGE gen AxLAGE = A*LAGE gen HxLAGE = H*LAGE *Health status by SES interactions gen DxSES = D*SES gen AxSES = A*SES gen HxSES = H*SES *logAGE by SES interaction gen LAGExSES = LAGE*SES * Health status by SES by logAGE interactions gen DxLAGExSES = D*LAGE*SES gen AxLAGExSES = A*LAGE*SES gen HxLAGExSES = H*LAGE*SES

Creating dummy variables for Diabetes, Asthma, and Healthy. Recall that only of categories are necessary to specify the model, but we overspecify here for illustration.

Creating interaction variables to ask the question: Does the “effect”

of on depend upon ?

Interpretation often requires graphical displays, as we will

demonstrate.

Create new variables on the fly with the “generate” command. As you can see, many commands can be abbreviated. You’ll learn more about these codes in section next week.

Creating a log-AGE variable to account for the slight

curvilinearity and the occasional theoretical support for log-AGE or otherwise nonlinear variables on developmental scales. (Marginal)

Page 25: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

_cons 4.603656 .095443 48.23 0.000 4.415398 4.791914 A -.9355971 .1468597 -6.37 0.000 -1.225272 -.6459219 D -.8373226 .1864973 -4.49 0.000 -1.205182 -.4694638 ILLCAUSE Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 201.54714 193 1.0442857 Root MSE = .92042 Adj R-squared = 0.1888 Residual 161.809826 191 .847171864 R-squared = 0.1972 Model 39.7373141 2 19.868657 Prob > F = 0.0000 F( 2, 191) = 23.45 Source SS df MS Number of obs = 194

. regress ILLCAUSE D A

The Baseline Regression Model

© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 25

We test the effect of a categorical variable by creating dummy variables.

In an unadjusted model, the constant refers to the excluded group’s mean, and the coefficients refer to simple mean difference between the dummy

mean and the excluded mean.When in doubt, write out the prediction equation:

The ratio of these variances gives us our statistic, the

higher the better (over 4ish generally suffices to reject

the omnibus null hypothesis).

“Good” variation and “bad” variation, respectively.

Good/Total = R-squared.

My favorite alternative to the ubiquitous and overused R-

squared statistic, the RMSE is interpretable on the scale of

as the standard deviation after accounting for the predictors.

The number of standard errors away from the null

value, an absolute greater than 2 generally

does the trick.

The probabilities of observing a that big or greater, assuming that the null hypothesis is true, and given all else in the model. To reject the null

hypothesis, our cutoff is .05 by convention.

Page 26: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

© Andrew Ho, Harvard Graduate School of Education

The first fitted regression model from the Data Analytic Handout is:

From it, you can estimate the predicted value of ILLCAUSE in each health status group by substituting numerical values of the health status predictors that represent prototypical individuals in the dataset:

Notice that the predicted outcome values corresponding to one of the groups – the reference, omitted or comparison group (here, healthy children) – are obtained when the two dichotomous predictors that distinguish the chronically-ill children are both set to zero. This means that, if you have an intercept in the model, you need one less dummy predictor in the model than there are groups compared, as the fitted value for the “reference (or omitted) group” is provided by the estimated intercept.

Another way of thinking about this is to understand that, although there are three distinct health status groups present, only two independent pieces of information are needed to indicate the health status of a child because if a child is neither diabetic nor asthmatic then s/he must be healthy, by default.

Of course, you get to choose which of the health status groups serves as the reference, because you are the one who picks which dummy predictor is omitted from the regression model. Typically, you make this choice for substantive, not statistical, reasons.

iii ADUSEAILLC 94.084.060.4ˆ

76.394.060.4194.0084.060.4ˆ:

66.384.060.4094.0184.060.4ˆ:60.4094.0084.060.4ˆ:

i

i

i

USEAILLC1A0;D AsthmaticUSEAILLC0A1;D DiabeticUSEAILLC0A0;DHealthy

Appendix 1: Two Dummy Predictors Distinguish Among Three Groups

Unit 1– Slide 26

Page 27: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

© Andrew Ho, Harvard Graduate School of Education

Inspection of the fitted values computed on the previous slide indicate that the fitted regression parameters that we obtained in the analysis – that is, the estimated intercept parameter and the two estimated slope parameters associated with the dummy predictors representing health status, can be interpreted as follows:

iii ADUSEAILLC 210ˆˆˆˆ

The fitted slope parameter associated with dummy

predictor A represents the difference in the predicted

value of ILLCAUSE between the asthmatic and

“reference” healthy children – it is our best estimate of

the difference between asthmatic and healthy

children, on average, in the population

(-0.94).

The fitted slope parameter associated with dummy

predictor A represents the difference in the predicted

value of ILLCAUSE between the asthmatic and

“reference” healthy children – it is our best estimate of

the difference between asthmatic and healthy

children, on average, in the population

(-0.94).

The fitted slope parameter associated with dummy

predictor D represents the difference in the predicted

value of ILLCAUSE between diabetic and “reference” healthy

children – it is our best estimate of the difference

between diabetic and healthy children, on

average, in the population (-0.84).

The fitted slope parameter associated with dummy

predictor D represents the difference in the predicted

value of ILLCAUSE between diabetic and “reference” healthy

children – it is our best estimate of the difference

between diabetic and healthy children, on

average, in the population (-0.84).

The fitted intercept represents the predicted

value of ILLCAUSE (4.60) for those in the reference

(or omitted) category –it is our best estimate of the

understanding of healthy children, on average, in the

population.

The fitted intercept represents the predicted

value of ILLCAUSE (4.60) for those in the reference

(or omitted) category –it is our best estimate of the

understanding of healthy children, on average, in the

population.

Unit 1– Slide 27

Appendix 1: Two Dummy Predictors Distinguish Among Three Groups

Page 28: Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 1– Slide 1

© Willett, Harvard University Graduate School of Education

“Baseline Control Model” Approach:• Form a baseline control model, by sequentially adding

control predictors, highest priority first, and testing for appropriate interactions as you go along.

• Then, add the main effects of the question predictors to the new baseline control model.

• Then, add interactions between the question predictors and the control predictors in the baseline control model, sequentially.

• Finally, add interactions between the question predictors.

Here, the objective is to obtain a parsimonious model that controls away all extraneous variation first, and then focus attention on the impact of the question predictors. While this approach refines your view of the impact of the question predictors, removing that part of their effect that may depend on the inter-relationships with the controls, it never reveals the “total” impact of the question predictors on the outcome for a person who has been randomly selected from the population without regard to any of their other characteristics.

“Work Back From The End” Approach:• Include all possible predictors in the model, both their

main effects and interactions.• The, remove statistically unimportant predictors

sequentially to achieve a more parsimonious model, starting with those of lowest declared priority that do not appear to have statistically significant effects (i.e., remove question predictors last).

• Make sure that you remove any statistically unimportant ahead of any of the main effects from which they are constituted.

Here, the objective is to obtain a final parsimonious model by sequentially removing predictors that appear unimportant. The idea is that you get to see the impact of “everything” to start with, and then you can “slim down” the fitted model to a final model. However, the impact of main effects is always masked when interactions are present in the model, and you still may remove an important predictor whose correlation with another predictor makes it look unimportant.

Devise Your Own Strategy?• It’s acceptable to devise your own strategy, in fact it’s

probably the best approach as you know the field the best!.

But, remember that your strategy must be systematic, sensible and you must explain it explicitly to your reader, describing the logic that underpins its organization.

Unit 1– Slide 28

Appendix 2: Reasonable Strategies For Selecting Among Regression Models