unit 1a: fitting sensible taxonomies of multiple regression models © andrew ho, harvard graduate...
TRANSCRIPT
Unit 1a: Fitting Sensible Taxonomies of Multiple Regression Models
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 1http://xkcd.com/943/
Scared!
This sounds familiar!
Logistic regression isn’t so ba-
Ack, Discrete Time what now?
Clustering… seems intuitive
Principal components?!
Final project
Whoa, fixed and random effects?
Course Principle 1: Layering
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 2
• In my experience learning and teaching statistics, fluency comes only after repeated exposure to concepts in varying contexts. Think of this as “layering.”
So: Be Patient. Get Exposed.
Course Principle 2: Language
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 3
• Learning statistics is like learning a language. Fluency comes from active use of new vocabulary in genuine conversations about meaningful content.
So: Find a partner. Find a study group. Speak, read, write, and
hear statistics. Immerse yourself.
Course Principle 3: Disciplined Perception
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 4
• As you learn, layer, and gain fluency, you will hopefully begin to view graphical representations, text, and real-life situations with new eyes.
So: If at first it seems like gibberish, don’t despair.
This takes practice.
Course Principle 4: Exploratory Data Analysis
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 5
• Attributed to the famous statistician John Tukey, EDA is an analytic approach that emphasizes graphical representations and descriptive statistics to give analysts an intuitive sense of their data before they begin fitting models.
• When faced with new data, you will learn to reflexively:– Get your eyes on the data.– Note summary statistics.– Visualize univariate distributions.– Visualize bivariate and multivariate relationships.– Graphically summarize the fit of the model to data. 0
2040
6080
100
Fre
que
ncy
0 2 4 6 8Index of self-reported sexual activity at baseline (two years prior)
010
020
030
040
0F
req
uenc
y
-2 -1 0 1Reported importance of religion and religious attendance at baseline
02
46
8In
dex
of
self-
repo
rte
d se
xual
act
ivity
.2 .4 .6 .8 1
* p<0.05, ** p<0.01, *** p<0.001p-values in parentheses N 887 (0.000) (0.000) (0.000) (0.000) (0.003) (0.474) (0.000) (0.045) (0.000) (0.000) (0.000) (0.000) fabstain -0.310*** -0.354*** -0.341*** -0.258*** -0.0995** -0.0241 -0.394*** -0.0673* 0.344*** -0.170*** 0.246*** 0.290*** 1
(0.000) (0.003) (0.000) (0.000) (0.000) (0.000) (0.000) (0.056) (0.000) (0.027) (0.000) pdisapp -0.235*** -0.101** -0.274*** -0.171*** -0.137*** -0.286*** -0.173*** -0.0642 0.255*** 0.0741* 0.218*** 1
(0.000) (0.000) (0.000) (0.000) (0.008) (0.775) (0.001) (0.405) (0.000) (0.017) parengag -0.232*** -0.207*** -0.288*** -0.159*** -0.0888** 0.00962 -0.116*** 0.0280 0.307*** 0.0798* 1
(0.540) (0.000) (0.437) (0.881) (0.044) (0.195) (0.000) (0.369) (0.754) religion 0.0206 0.138*** 0.0261 -0.00503 0.0675* -0.0436 0.274*** 0.0302 -0.0105 1
(0.000) (0.000) (0.000) (0.000) (0.106) (0.003) (0.000) (0.240) schengag -0.191*** -0.211*** -0.201*** -0.138*** -0.0543 -0.0994** -0.337*** -0.0395 1
(0.152) (0.554) (0.946) (0.558) (0.918) (0.770) (0.000) sesindex 0.0482 0.0199 0.00226 -0.0197 -0.00344 0.00984 0.205*** 1
(0.001) (0.000) (0.001) (0.117) (0.047) (0.922) black 0.110** 0.379*** 0.113*** 0.0526 0.0667* -0.00328 1
(0.351) (0.000) (0.014) (0.004) (0.709) male 0.0313 -0.361*** 0.0821* 0.0960** 0.0125 1
(0.000) (0.000) (0.000) (0.300) pubearly 0.183*** 0.127*** 0.128*** 0.0349 1
(0.000) (0.000) (0.000) age 0.164*** 0.147*** 0.269*** 1
(0.000) (0.000) baseact 0.537*** 0.276*** 1
(0.000) sexmedia 0.294*** 1
activity 1 activity sexmedia baseact age pubearly male black sesindex schengag religion parengag pdisapp fabstain
fabstain 887 .4092446 .4919719 0 1 pdisapp 887 .7192785 .4496052 0 1 parengag 887 -.0019955 .7886763 -2 1.42 religion 887 .01708 .8769921 -2.21 .8 schengag 887 .0132243 .7676371 -2 1.66 sesindex 887 .0047689 .6077198 -1.44 1.47 black 887 .5039459 .5002665 0 1 male 887 .4937993 .5002436 0 1 sexmedia 887 .5314092 .1451268 .14 .94 pubearly 887 2.011274 .3709047 1 3 age 887 15.71466 .6995205 14.04 18.35 baseact 887 2.818489 1.567498 1 7 activity 887 4.324961 1.801506 1.01 7 id 887 2626.545 1495.928 6 5217 Variable Obs Mean Std. Dev. Min Max
. summarize, separator(14)
Course Roadmap: Unit 1a
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 6
Multiple RegressionAnalysis (MRA)
Multiple RegressionAnalysis (MRA) iiii XXY 22110
Do your residuals meet the required assumptions?
Test for residual
normality
Use influence statistics to
detect atypical datapoints
If your residuals are not independent,
replace OLS by GLS regression analysis
Use Individual
growth modeling
Specify a Multi-level
Model
If time is a predictor, you need discrete-
time survival analysis…
If your outcome is categorical, you need to
use…
Binomial logistic
regression analysis
(dichotomous outcome)
Multinomial logistic
regression analysis
(polytomous outcome)
If you have more predictors than you
can deal with,
Create taxonomies of fitted models and compare
them.
Form composites of the indicators of any common
construct.
Conduct a Principal Components Analysis
Use Cluster Analysis
Use non-linear regression analysis.
Transform the outcome or predictor
If your outcome vs. predictor relationship
is non-linear,
Use Factor Analysis:EFA or CFA?
Today’s Topic Area
You can keep track of your progress through this course by referencing the outline in the course syllabus.You can keep track of your progress through this course by referencing the outline in the course syllabus.
A Simple Path through S-052
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 7
1. Taxonomies of Regression Models2. Nonlinear Regression3. Nonindependent Residuals
4. Logistic Regression5. Discrete-Time Survival Analysis
6. Forming Composites7. Cluster Analysis8. Factor Analysis
The ILLCAUSE Dataset and Research Question: Know Your Dataset
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 8
RQ: Do children who suffer from chronic
illness understand the causes of illness better than healthy children
and, if so, by how much?
Dataset ILLCAUSE.txt
Overview Data for investigating differences in children’s understanding of the causes of illness, by their health status.
Source Perrin E.C., Sayer A.G., and Willett J.B. (1991). Sticks And Stones May Break My Bones: Reasoning About Illness Causality And Body Functioning In Children Who Have A Chronic Illness, Pediatrics, 88(3), 608-19.
Sample size 301 children, including a sub-sample of 205 who were described as asthmatic, diabetic, or healthy. After further reductions due to the list-wise deletion of cases with missing data on one or more variables, the analytic sub-sample used in class ends up containing 33 diabetic children, 68 asthmatic children and 93 healthy children.
More info Chronically-ill children were recruited into the study through their pediatricians; healthy children were a matched random sample drawn from the same schools as the ill children.
Updated September 16, 2005
Dataset on website: ILLCAUSE.txt. Codebook on website: ILLCAUSE_info
The ILLCAUSE Dataset: Know Your Variables
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 9
Col Variable Variable Description Variable Metric/Labels 1 ID Child identification code Integers
2 ILLCAUSE Child’s score on a measure of the understanding of illness causality.
Ordinal score obtained by averaging child responses to 7 interview questions on the causes of illness, with responses rated on a “developmental” scale: 1 = No response. 2 = Phenomenistic or circular response. 3 = External agency cited as sole cause. 4 = Internalization in understanding illness, once agent
internalized, illness is inevitable. 5 = Interaction of host and agent described. 6 = Mechanisms of illness causation described, including notions
of treatment and bodily response.
3 SES
Family socio-economic status, rated using the education and employment levels of the primary bread-winner with Hollingshead Two-Factor Index of Social Position. (Hollingshead & Frederick. Social Class and Mental Illness. NY: Wiley, 1958)
Ordinal rating of social class: 1 = upper 2 = upper middle 3 = middle 4 = lower middle 5 = low (Notice the ordering of the numerical values is counterintuitive).
4 PPVT Child’s normed score on the Peabody Picture Vocabulary Test.
Continuous score, mean of 100 & standard deviation of 15 in population.
5 AGE Child age Continuous variable, months since birth.
6 GENREAS Child’s score on a measure of general reasoning.
Ordinal score, from 1 through 6. Similar to ILLCAUSE, but requires general reasoning, rather than reasoning about illness.
7 HEALTH Child Health Status Indicator
Categorical variable with multiple categories, of which we are interested in:
3 = Diabetic 5 = Asthmatic 6 = Healthy
The Data Analytic Handout: Boiler Plate
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 10
*--------------------------------------------------------------------------------* S-052: APPLIED DATA ANALYSIS* Data-Analytic Handout Unit 1a** Unit 1 Conducting Sensible Multiple Regression Analyses.* Unit 1a Fitting Taxonomies of Multiple Regression Models.* Introducing the Data.* * RQ: Do Children Who Are Chronically Ill Understand the Causes of Illness* Better Than Do Healthy Children?** Programming:* Stata Version: Stata 11 SE.* Original Authors: Andres Molano, Monica Yudron & John B. Willett.* Modifications by Andrew Ho* Last Modified: Jan 28, 2013.*--------------------------------------------------------------------------------
*--------------------------------------------------------------------------------* Set the critical parameters of the computing environment.*--------------------------------------------------------------------------------* Specify the version of Stata to be used in the analysis: version 11.0 * Clear all computer memory and delete any existing stored graphs and matrices: clear graph drop _all clear matrix * Define the local directory: cd "C:\Users\hoan\Documents\My Dropbox\S-052\Stata\"
*--------------------------------------------------------------------------------* Open a log to contain a permanent record of the syntax and analytic output.*-------------------------------------------------------------------------------- log using "Unit1a.log", replace
You can title STATA programs & code, using comments. Include:• Name of the Handout.• Link to the Syllabus.• Substantive Theme (RQ).• Programming logistics.
Any current version of STATA can recognize code written according
to the rules of any previous version of the software
Clear everything out, before the current program executes
Define the local directory, so STATA knows where to write its
logs, etc.
Begin a log file to record output. Your lab book.
All units will include a “Data Analytic Handout” that records the Stata code used in lectures. As you download, adapt, and annotate these handouts, they will serve as a lasting code library.
Any line that begins with an asterisk is a comment.
It doesn’t matter what it ends with.
The Data Analytic Handout: Reading In Data
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 11
*--------------------------------------------------------------------------------* Input the raw dataset, name and label the variables and their values.*--------------------------------------------------------------------------------* Input the dataset:infile ID ILLCAUSE SES PPVT AGE GENREAS HEALTH /// using "C:\Users\hoan\Documents\My Dropbox\S-052\Raw Data\ILLCAUSE.txt"
* Label the variables in the dataset: label variable ID "Child Identification Code" label variable ILLCAUSE "Understanding of Illness Causality Score" label variable SES "Hollingshead SES" label variable PPVT "Peabody Picture Vocabulary Test Score" label variable AGE "Chronological Age (Months)" label variable GENREAS "General Reasoning Ability Score" label variable HEALTH "Health Status" * Label the values of categorical question predictor HEALTH: label define HEALTHLBL 3 "Diabetic" 5 "Asthmatic" 6 "Healthy" label values HEALTH HEALTHLBL
*--------------------------------------------------------------------------------------* Subset and save an analytic sample with only healthy, asthmatic, * & diabetic children.*-------------------------------------------------------------------------------------- keep if HEALTH==3 | HEALTH==5 | HEALTH==6 save "ILLCAUSE S052 subset.dta", replace
Try to run full .do files all the way from the raw data, preserving each step in your analysis as if in a lab book. A carefully annotated .do file should be all one needs to replicate your analysis.
In the infile command, specify the names of the
variables (in order of their appearance in the dataset)
and the location of the data-file that contains the raw
data.
The data are read into the Stata active file, and held in
memory.
Note "///" allows a command to continue on the next line.
Variables are easily labeled with informative names.
We capitalize variable names, although this is not common Stata convention.
You can also label the values of categorical variables.
Here, we will subsequently focus on only the Diabetic,
Asthmatic, and Healthy children to simplify the
analysis, so I have named only those values of HEALTH.
I keep only Diabetic, Asthmatic, and Healthy children in the active file and save the data subset, which will reside in the default directory set earlier in this .do file.
EDA Reflex 1: List Data; Univariate Summary Statistics
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 12
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
10. 310 3.143 1 125 147 4.146 Diabetic 9. 309 3.429 2 109 133 4.396 Diabetic 8. 308 3.429 2 102 90 3.677 Diabetic 7. 307 2.857 4 75 194 3.833 Diabetic 6. 306 3.286 2 106 81 3.833 Diabetic 5. 305 4.286 4 80 113 2.5 Diabetic 4. 304 4.286 3 98 178 5.219 Diabetic 3. 303 3.429 3 84 151 3.302 Diabetic 2. 302 2.857 2 102 79 2.188 Diabetic 1. 301 . 2 138 128 4.802 Diabetic ID ILLCAUSE SES PPVT AGE GENREAS HEALTH
. list in 1/10, clean
HEALTH 205 5.117073 1.078284 3 6 GENREAS 203 4.124384 1.095722 1.75 6 AGE 205 131.7902 40.4458 61 203 PPVT 205 112.7659 15.81678 73 162 SES 205 2.287805 .9852329 1 5 ILLCAUSE 194 4.133284 1.021903 1.571 6 ID 205 558.1073 138.8618 301 748 Variable Obs Mean Std. Dev. Min Max
. summarize, separator(`c(k)')
List the first 10 observations.Tell yourself the story of an observation.e.g., “Observation 304 is a diabetic, middle-class child with an average vocabulary and high general reasoning who has a fairly high understanding of the causes of illness, 4.3 on a 1-6 scale.
Note missing data “.”, get a sense of scale (mean and sd), and note the ranges for any anomalies.
EDA Reflex 2a: Univariate Visualizations (Start with the Outcome)
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 13
• Ordinal score obtained by averaging child responses to 7 interview questions on the causes of illness, with responses rated on a “developmental” scale:– 1 = No response.– 2 = Phenomenistic or circular response.– 3 = External agency cited as sole cause.– 4 = Internalization in understanding illness, once agent internalized, illness is inevitable.– 5 = Interaction of host and agent described.– 6 = Mechanisms of illness causation described, including notions of treatment and bodily response.
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
05
1015
Fre
quen
cy
1 2 3 4 5 6Understanding of Illness Causality Score
010
2030
Fre
quen
cy
2 3 4 5 6Understanding of Illness Causality Score
hist ILLCAUSE, discrete freq name(Unit1a_g1,replace)hist ILLCAUSE, freq
EDA Reflex 2b: Predictor Variable Distributions
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 14
010
2030
40P
erce
nt
0 1 2 3 4 5Hollingshead SES
02
46
810
Fre
quen
cy
80 100 120 140 160Peabody Picture Vocabulary Test Score
05
1015
20F
requ
ency
50 100 150 200Chronological Age (Months)
05
1015
2025
Fre
quen
cy
2 3 4 5 6General Reasoning Ability Score
Total 205 100.00 Healthy 96 46.83 100.00 Asthmatic 73 35.61 53.17 Diabetic 36 17.56 17.56 Status Freq. Percent Cum. Health
. tabulate HEALTH
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
The Question Predictor
Remember that non-normal predictor variable distributions are not necessarily a
threat to the regression model. The regression assumption is: independent and identically normally distributed residuals.
23
45
6U
nder
stan
ding
of I
llnes
s C
ausa
lity
Sco
re
Diabetic Asthmatic Healthy
0.1
.2.3
.4.5
Ke
rne
l Den
sity
2 3 4 5 6Understanding of Illness Causality Score
Diabetic Asthmatic Healthy
EDA Reflex 3a: Bivariate Visualizations (Outcome and Question Predictor)
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 15
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
You are looking for overall mean differences while staying attuned to absolute and relative sample sizes.
Note that the order of box plots for categorical predictors is arbitrary and does not raise concerns about nonlinearity.
Be cautious about overinterpreting Kernel Density plots with low sample sizes.
Total 4.1332835 1.021903 194 Healthy 4.6036559 1.0026495 93 Asthmatic 3.6680588 .86861936 68 Diabetic 3.7663333 .76587771 33 Status Mean Std. Dev. Freq. Health Causality Score Summary of Understanding of Illness
. tabulate HEALTH, summarize(ILLCAUSE)
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 16
We’re not that naïve! We haven’t randomly assigned illnesses to kids, so no causal inference is warranted. We might wish to include additional predictors to account for the confounding of illness and other variables. Perhaps add AGE and SES as your control predictors.
And proceed with a multiple regression analysis …
We’re not that naïve! We haven’t randomly assigned illnesses to kids, so no causal inference is warranted. We might wish to include additional predictors to account for the confounding of illness and other variables. Perhaps add AGE and SES as your control predictors.
And proceed with a multiple regression analysis …
… and then, what about non-linear expressions of the continuous predictors, or categorical versions, or what if you add another predictor like gender or race, or … How Many Potential Models Would There Be, Then?
The task seems ok until you begin to enumerate how many possible models you can actually specify using just these few predictors …
The task seems ok until you begin to enumerate how many possible models you can actually specify using just these few predictors …
Three models with a 1 main effect
ii
ii
ii
SESILLCAUSEAGEILLCAUSEHEALTHILLCAUSE
10
10
10 ""
Three models with 2 main effects and 1 two-way interaction
ii
ii
ii
SESAGESESAGEILLCAUSESESHEALTHSESHEALTHILLCAUSEAGEHEALTHAGEHEALTHILLCAUSE
3210
3210
3210
""""""""
One model with 3 main effects
ii SESAGEHEALTHILLCAUSE 3210 ""
Three models with 2 main effects
ii
ii
ii
SESAGEILLCAUSESESHEALTHILLCAUSEAGEHEALTHILLCAUSE
210
210
210
""""
Three models with 3 main effects and 1 two-way interaction
ii
ii
ii
SESAGESESAGEHEALTHILLCAUSESESHEALTHSESAGEHEALTHILLCAUSEAGEHEALTHSESAGEHEALTHILLCAUSE
43210
43210
43210
""""""""""
Three models with 3 main effects and 2 two-way interactions
One model with 3 main effects and 3 two-way interactions
and so on ... …
So chronic illness decreases kids’ understanding of the causes of illness?
How big is the landscape of possible models?
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 17
Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors:
Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors:
predictors of 5.141 numbere
5 Predictors 450 Models10 Predictors 817,250 Models
How can we navigate the landscape of potential models to find the “best” model, theoretically, practically, and empirically?
One-PredictorModel
Two-PredictorModel
Two-Predictors + an InteractionModel
Three Predictors and an Interaction?
Model selection algorithms (that you should never use!)
© Andrew Ho, Harvard Graduate School of Education
Automated model building strategies (that you may see in journal articles)
1. All possible subsets: all regression models
2. Best subsets: Best m models with 1, 2, …, k predictors, along some criterion: , adj, (Cp, AIC, BIC…)…
3. Forward selection: start with no predictors and sequentially add them so that each maximally increases the statistic at that step
4. Backwards elimination: start with all predictors and sequentially drop them so that each minimally decreases the statistic at that step
5. Stepwise regression: Some combination of 3 and 4, such as forward selection with occasional “glances” backwards to see if anything has changed.
All models are wrong, but some are useful George E.P. Box (1979)
Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962)
The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962)
Occam’s razor: entia non sunt multiplicanda praeter necessitatem: Entities must not be multiplied beyond necessity. If two competing theories lead to the same predictions, the simpler one is better
William of Occam (14th century)
Your instincts should tell you that this is both tempting and loony.Of these approaches, Best Subsets can be the most informative in a many-predictor context, reminding you that many possible models can lead to very similar global fit statistics. In general, however, these data mining approaches should be avoided.
Unit 1– Slide 18
Managing and Categorizing Multiple Predictors: A Framework
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 19
• If you are managing the study from the design stage, you should have included your variables with intent from the beginning.• If in contrast you are coming in cold as a secondary data analyst, prioritizing and classifying variables is part of your job.First, identify your outcome variable -- here, ILLCAUSE.Second, consider classes of predictors, based on substance (i.e., your research questions and theoretical
framework), your research design, … past literature, etc., and establish the priorities among these predictors.
• If you are managing the study from the design stage, you should have included your variables with intent from the beginning.• If in contrast you are coming in cold as a secondary data analyst, prioritizing and classifying variables is part of your job.First, identify your outcome variable -- here, ILLCAUSE.Second, consider classes of predictors, based on substance (i.e., your research questions and theoretical
framework), your research design, … past literature, etc., and establish the priorities among these predictors.
Priority Predictor Comment
High HEALTH is the Key Question Predictor. Without the presence of HEALTH in the final model, we cannot address the research question!
MediumAGE is a key “Design” Control Predictor because it represents the multi-cohort nature of the research design:
Our sample contains multiple sub-samples (“cohorts”) of children at different ages. By controlling for AGE, we can pool all the children into the same analysis, regardless of their age, rather than doing separate “age-by-age slice” analysis (not recommended!)
Low
SES is a subsidiary substantive Control Predictor. It is often worth including because, as John Willett likes to say, “some twit will always ask you if it matters.”
Ill children may have lower SES, on average. If understanding illness also depends on home resources, the effect of SES could masquerade as an effect of HEALTH.
Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions.
Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next.At each step, once the main effects have been exhausted, consider including the interactions, again driven by
theory: does the effect of on depend on levels of or, equivalently, vice versa?
Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions.
Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next.At each step, once the main effects have been exhausted, consider including the interactions, again driven by
theory: does the effect of on depend on levels of or, equivalently, vice versa?
Approaches to variable selection and model building
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 20
EDA Reflex 3b: (Categorical) Question Predictor and Other Predictors
• We can see that Healthy children have higher (lower-valued) SES, higher PPVT, and higher GENREAS.
• If HEALTH were randomly assigned to children, what would the table look like?
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 21
23
45
6U
nder
stan
ding
of I
llnes
s C
ausa
lity
Sco
re
Diabetic Asthmatic Healthy
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
205 205 205 203 .9852329 15.81678 40.4458 1.095722 Total 2.287805 112.7659 131.7902 4.124384 96 96 96 96 .7567677 15.80673 42.02267 1.064994 Healthy 1.78125 117 131.9896 4.533854 73 73 73 72 .8446717 12.77659 40.72187 .9798192Asthmatic 2.69863 110.3014 129.0822 3.709361 36 36 36 35 1.141914 18.48163 35.9105 1.022861 Diabetic 2.805556 106.4722 136.75 3.855029 HEALTH SES PPVT AGE GENREAS
by categories of: HEALTH (Health Status)Summary statistics: mean, sd, N
. tabstat SES PPVT AGE GENREAS, statistics(mean sd count) by(HEALTH)
192 203 203 203 203 0.0000 0.0000 0.0000 0.0000 GENREAS 0.8242* -0.2975* 0.3891* 0.7369* 1.0000 194 205 205 205 0.0000 0.3941 0.0869 AGE 0.6711* 0.0598 0.1199 1.0000 194 205 205 0.0000 0.0000 PPVT 0.3140* -0.3785* 1.0000 194 205 0.0005 SES -0.2471* 1.0000 194 ILLCAUSE 1.0000 ILLCAUSE SES PPVT AGE GENREAS
. pwcorr ILLCAUSE SES PPVT AGE GENREAS, sig obs star(.05)
EDA Reflex 3c: Outcome and Other Predictors
• (Low) SES is a slight negative predictor, and age as well as vocabulary and reasoning ability correlate positively as expected.
• Also a strong relationship between general reasoning and age, as expected.
• When interpreting correlation matrices, learn to look past the STARS to the magnitude of the correlation itself.
• Example: 0.3 is significant (we’re confident it’s not 0) but we’re also confident it’s a pretty weak correlation.
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 22
RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and, if so, by how much?
23
45
6U
nde
rsta
nd
ing
of
Illn
ess
Cau
salit
y S
core
1 2 3 4 5Hollingshead SES
23
45
6U
nd
erst
and
ing
of
Illne
ss C
aus
alit
y S
core
80 100 120 140 160Peabody Picture Vocabulary Test Score
23
45
6U
nde
rsta
nd
ing
of
Illn
ess
Cau
salit
y S
core
50 100 150 200Chronological Age (Months)
23
45
6U
nd
erst
and
ing
of
Illne
ss C
aus
alit
y S
core
2 3 4 5 6General Reasoning Ability Score
EDA Reflex 3c: Outcome and Other Predictors
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 23
“Jitter” discrete scatterplots to distinguish among overlapping datapoints.
Decrease msize or “sample down” until you can feel the “grain” of the data.
Preparing Dummy Variables, Transformations, and Interactions
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 24
*--------------------------------------------------------------------------------------* Create Dummy Variables, Transformations, and Interactions*-------------------------------------------------------------------------------------- gen D=(HEALTH==3) gen A=(HEALTH==5) gen H=(HEALTH==6) gen LAGE = log(AGE) *Health status by logAGE interactions gen DxLAGE = D*LAGE gen AxLAGE = A*LAGE gen HxLAGE = H*LAGE *Health status by SES interactions gen DxSES = D*SES gen AxSES = A*SES gen HxSES = H*SES *logAGE by SES interaction gen LAGExSES = LAGE*SES * Health status by SES by logAGE interactions gen DxLAGExSES = D*LAGE*SES gen AxLAGExSES = A*LAGE*SES gen HxLAGExSES = H*LAGE*SES
Creating dummy variables for Diabetes, Asthma, and Healthy. Recall that only of categories are necessary to specify the model, but we overspecify here for illustration.
Creating interaction variables to ask the question: Does the “effect”
of on depend upon ?
Interpretation often requires graphical displays, as we will
demonstrate.
Create new variables on the fly with the “generate” command. As you can see, many commands can be abbreviated. You’ll learn more about these codes in section next week.
Creating a log-AGE variable to account for the slight
curvilinearity and the occasional theoretical support for log-AGE or otherwise nonlinear variables on developmental scales. (Marginal)
_cons 4.603656 .095443 48.23 0.000 4.415398 4.791914 A -.9355971 .1468597 -6.37 0.000 -1.225272 -.6459219 D -.8373226 .1864973 -4.49 0.000 -1.205182 -.4694638 ILLCAUSE Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 201.54714 193 1.0442857 Root MSE = .92042 Adj R-squared = 0.1888 Residual 161.809826 191 .847171864 R-squared = 0.1972 Model 39.7373141 2 19.868657 Prob > F = 0.0000 F( 2, 191) = 23.45 Source SS df MS Number of obs = 194
. regress ILLCAUSE D A
The Baseline Regression Model
© Andrew Ho, Harvard Graduate School of Education Unit 1– Slide 25
We test the effect of a categorical variable by creating dummy variables.
In an unadjusted model, the constant refers to the excluded group’s mean, and the coefficients refer to simple mean difference between the dummy
mean and the excluded mean.When in doubt, write out the prediction equation:
The ratio of these variances gives us our statistic, the
higher the better (over 4ish generally suffices to reject
the omnibus null hypothesis).
“Good” variation and “bad” variation, respectively.
Good/Total = R-squared.
My favorite alternative to the ubiquitous and overused R-
squared statistic, the RMSE is interpretable on the scale of
as the standard deviation after accounting for the predictors.
The number of standard errors away from the null
value, an absolute greater than 2 generally
does the trick.
The probabilities of observing a that big or greater, assuming that the null hypothesis is true, and given all else in the model. To reject the null
hypothesis, our cutoff is .05 by convention.
© Andrew Ho, Harvard Graduate School of Education
The first fitted regression model from the Data Analytic Handout is:
From it, you can estimate the predicted value of ILLCAUSE in each health status group by substituting numerical values of the health status predictors that represent prototypical individuals in the dataset:
Notice that the predicted outcome values corresponding to one of the groups – the reference, omitted or comparison group (here, healthy children) – are obtained when the two dichotomous predictors that distinguish the chronically-ill children are both set to zero. This means that, if you have an intercept in the model, you need one less dummy predictor in the model than there are groups compared, as the fitted value for the “reference (or omitted) group” is provided by the estimated intercept.
Another way of thinking about this is to understand that, although there are three distinct health status groups present, only two independent pieces of information are needed to indicate the health status of a child because if a child is neither diabetic nor asthmatic then s/he must be healthy, by default.
Of course, you get to choose which of the health status groups serves as the reference, because you are the one who picks which dummy predictor is omitted from the regression model. Typically, you make this choice for substantive, not statistical, reasons.
iii ADUSEAILLC 94.084.060.4ˆ
76.394.060.4194.0084.060.4ˆ:
66.384.060.4094.0184.060.4ˆ:60.4094.0084.060.4ˆ:
i
i
i
USEAILLC1A0;D AsthmaticUSEAILLC0A1;D DiabeticUSEAILLC0A0;DHealthy
Appendix 1: Two Dummy Predictors Distinguish Among Three Groups
Unit 1– Slide 26
© Andrew Ho, Harvard Graduate School of Education
Inspection of the fitted values computed on the previous slide indicate that the fitted regression parameters that we obtained in the analysis – that is, the estimated intercept parameter and the two estimated slope parameters associated with the dummy predictors representing health status, can be interpreted as follows:
iii ADUSEAILLC 210ˆˆˆˆ
The fitted slope parameter associated with dummy
predictor A represents the difference in the predicted
value of ILLCAUSE between the asthmatic and
“reference” healthy children – it is our best estimate of
the difference between asthmatic and healthy
children, on average, in the population
(-0.94).
The fitted slope parameter associated with dummy
predictor A represents the difference in the predicted
value of ILLCAUSE between the asthmatic and
“reference” healthy children – it is our best estimate of
the difference between asthmatic and healthy
children, on average, in the population
(-0.94).
The fitted slope parameter associated with dummy
predictor D represents the difference in the predicted
value of ILLCAUSE between diabetic and “reference” healthy
children – it is our best estimate of the difference
between diabetic and healthy children, on
average, in the population (-0.84).
The fitted slope parameter associated with dummy
predictor D represents the difference in the predicted
value of ILLCAUSE between diabetic and “reference” healthy
children – it is our best estimate of the difference
between diabetic and healthy children, on
average, in the population (-0.84).
The fitted intercept represents the predicted
value of ILLCAUSE (4.60) for those in the reference
(or omitted) category –it is our best estimate of the
understanding of healthy children, on average, in the
population.
The fitted intercept represents the predicted
value of ILLCAUSE (4.60) for those in the reference
(or omitted) category –it is our best estimate of the
understanding of healthy children, on average, in the
population.
Unit 1– Slide 27
Appendix 1: Two Dummy Predictors Distinguish Among Three Groups
© Willett, Harvard University Graduate School of Education
“Baseline Control Model” Approach:• Form a baseline control model, by sequentially adding
control predictors, highest priority first, and testing for appropriate interactions as you go along.
• Then, add the main effects of the question predictors to the new baseline control model.
• Then, add interactions between the question predictors and the control predictors in the baseline control model, sequentially.
• Finally, add interactions between the question predictors.
Here, the objective is to obtain a parsimonious model that controls away all extraneous variation first, and then focus attention on the impact of the question predictors. While this approach refines your view of the impact of the question predictors, removing that part of their effect that may depend on the inter-relationships with the controls, it never reveals the “total” impact of the question predictors on the outcome for a person who has been randomly selected from the population without regard to any of their other characteristics.
“Work Back From The End” Approach:• Include all possible predictors in the model, both their
main effects and interactions.• The, remove statistically unimportant predictors
sequentially to achieve a more parsimonious model, starting with those of lowest declared priority that do not appear to have statistically significant effects (i.e., remove question predictors last).
• Make sure that you remove any statistically unimportant ahead of any of the main effects from which they are constituted.
Here, the objective is to obtain a final parsimonious model by sequentially removing predictors that appear unimportant. The idea is that you get to see the impact of “everything” to start with, and then you can “slim down” the fitted model to a final model. However, the impact of main effects is always masked when interactions are present in the model, and you still may remove an important predictor whose correlation with another predictor makes it look unimportant.
Devise Your Own Strategy?• It’s acceptable to devise your own strategy, in fact it’s
probably the best approach as you know the field the best!.
But, remember that your strategy must be systematic, sensible and you must explain it explicitly to your reader, describing the logic that underpins its organization.
Unit 1– Slide 28
Appendix 2: Reasonable Strategies For Selecting Among Regression Models