review: correlation vs. regression what are the main differences between correlation &...
TRANSCRIPT
Review: Correlation vs. Regression
What are the main differences between correlation & regression?
What are the data requirements for each?
What are their principal vulnerabilities?
How do we establish causality?
Pearson correlation: a linear association between two quantitative variables.
Data Requirements
A probability sample—if the analysis will be inferential, as opposed to descriptive.
For OLS regression, the outcome variable must be quantitative (interval or ratio); the explanatory variables may be quantitative or categorical (nominal or ordinal).
What are the disadvantages of using correlation to study the relationships between two or more variables?
See Moore/McCabe, chapter 2.
0
0
xy
xy
:Ha
:Ho
Hypothesis tests for correlation
We can use Pearson correlation not only descriptively but also inferentially.
To use it inferentially, use a scatterplot to check the bivariate relationship for linearity.
If the relationship is sufficiently linear:
In Stata, ‘pwcorr’ vs. ‘corr’ ‘correlate’ (corr): ‘listwise,’ or ‘casewise,’ deletion—i.e. any observation (i.e. individual, case) for which any of the correlated variables has missing data is not used (i.e. ‘corr’ only uses observations with complete data for the examined variables).
If, for the relationship between math and reading scores, observation #27 has, say, a missing math score, then ‘corr’ or ‘regress’ will automatically drop observation #27.
This is how regression works, so ‘corr’ corresponds to regression.
Moreover, ‘corr’ does not permit hypothesis tests.
pwcorr: (‘pairwise’) uses all of the non-missing observations for the examined variables (e.g., it would use observation #27’s reading score, even though #27’s math score is missing).
This does not correspond to the way that regression works.
Moreover, pwcorr permits hypothesis tests
Note: There is a way to use ‘pwcorr’ so that, like regression analysis, it is based on casewise (i.e. listwise) deletion of missing observations. We’ll demonstrate this later.
Use a Bonferonni or other multiple-test adjustment when simultaneously testing multiple correlation hypotheses:
. pwcorr read write math science socst, obs sig star(.05) bonf
Why is the multiple-test adjustment important?
If the data have no missing values, then there’s no problem using pwcorr.
Contingency Table vs.
Pearson Correlation What if the premises of parametric statistics don’t hold?
E.g., what if the quantitative variables are based on a small sample (say, <30)?
Or what if the relationship is non-linear, and possible transformations (such as logarithmic) aren’t applicable or don’t work, and/or it doesn’t make sense to eliminate extreme outliers?
What if the quantitative variables are ordinal rather than interval in measurement?
In such cases it may be useful to: (1) use non-parametric procedures (such as Spearman rho [in Stata: ‘spearman x y’]); or (2) categorize the data & assess the bivariate association via a contingency table.
Non-parametric procedures are not premised on the approximate normality of the sample distribution of sample means (see the Moore/McCabe chap. 7 & CD-Rom chapter).
As for contingency tables, here’s an example of how to do a contingency table in response to violated parametric assumptions: What’s the association between science & reading scores?
. xtile xsci=science, nq(4)
. xtile xread=read, nq(4)
. tab1 xsci xread
. bys xsci: su science
. bys xread: su read
. tab xread xsci, col chi2
Always have a good reason for how you categorize a quantitative variable.
Check # observations per cell for the validity of the Chi-square hypothesis test.
Measures of Correlation & Association Involving Categorical Variables
Here are some alternatives to Pearson correlation (including non-parametric [‘rank’] statistics):
There are other correlation coefficients & measures of association (besides ttest & prtest) for categorical variables & for combinations of categorical & continuous variables. E.g.:
Spearman correlation (i.e. ‘rank correlation’; see Moore/McCabe chap. 7 and CD-Rom chapter, Hamilton, chap. 6 & Stata Manual) (It is also an outlier-resistant alternative to ‘corr’ or ‘pwcorr’ for quantitative variables.):
. spearman ordinalscore ses
Kendall’s tau (like Spearman, but can be slower in Stata; see Hamilton, chap. 6 & Stata Manual)
. ktau ordinalscore ses
eta-squared: when one variable is quantitative continuous & the other is multi-level categorical (see Moore/McCabe, chap. 12, ‘ANOVA’; Hamilton, chap. 5, ‘ANOVA’; Stata Manual, ‘oneway’).
. oneway read ses, tabulate bonf [Bartlett test must be insignificant; see also ANOVA and loneway.]
biserial correlation & point biserial correlation: when one variable is quantitative continuous & the other is binary. Just use ‘corr’ or ‘pwcorr’—Stata or any other major software automatically makes the adjustment:
. pwcorr read female, obs sig star(.05) [same result as ttest read,by(female)]
phi coefficient: two categorical binary variables.
. pwcorr female white, obs sig star(.05) [same result as tab female white, col/row/cell chi2]
. tab female ses, all [output includes ‘Cramer’s V,’ which is an adaptation of phi coefficient for tables that are larger than two-by-two, but it likewise works in two-by-two tables]
Caution: Recall the ramifications of restricted-range data and ecological data for correlation results, & recall the need to consider lurking variables.
Non-parametric: rank data.
Parametric: premised on approximately normal sampling distribution of sample means (i.e. Central Limit Theorem).
CI’s for Pearson & Spearman correlations:
. findit ci2 [& download]
. ci2 read write, corrConfidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation. Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)
. ci2 read ses, corr spearmanConfidence interval for Spearman's rank correlation of read and ses, based on Fisher's transformation. Correlation = 0.280 on 200 observations (95% CI: 0.147 to 0.403)
Regression Analysis
What are regression analysis’s major advantages over the alternatives for examining the relationships between two or more variables (see Moore/McCabe, chapter 2)?
Regression: examines how the values of an outcome variable y depend on the values of one or more explanatory variables x (i.e. the slope & direction of the y/x straight line).
On average, how does risk of heart disease (y) change with every unit of increase or decrease in amount of fat consumption (x1) & in amount of exercise (x2)?
On average, how does earnings level (y) change with every unit of increase or decrease in years of education (x2) & in years of person’s age (x1)?
Recall the problems of causality that we discussed in Chapter Two.
Always ask: What is the conceptual basis of the hypothesized or implied causal relationship? What if it were reversed?
See, e.g., King et al., Designing Social Inquiry; McClendon, Multiple Regression and Causal Analysis; Berk, Regression Analysis: A Constructive Critique.
Let’s start by interpreting the following simple regression model (i.e. regression model with one explanatory variable x).
use hsb2, clear
. for varlist math read: kdensity X, norm \ more
0.0
1.0
2.0
3.0
4D
en
sity
30 40 50 60 70 80math score
Kernel density estimateNormal density
0.0
1.0
2.0
3.0
4
Den
sity
20 40 60 80reading score
Kernel density estimateNormal density
. gr box math read3
04
05
06
07
08
0
math score reading score
. su math read, d math score------------------------------------------------------------- Percentiles Smallest 1% 36 33 5% 39 3510% 40 37 Obs 20025% 45 38 Sum of Wgt. 200
50% 52 Mean 52.645 Largest Std. Dev. 9.36844875% 59 7290% 65.5 73 Variance 87.7678195% 70.5 75 Skewness .284411599% 74 75 Kurtosis 2.337319
reading score------------------------------------------------------------- Percentiles Smallest 1% 32.5 28 5% 36 3110% 39 34 Obs 20025% 44 34 Sum of Wgt. 200
50% 50 Mean 52.23 Largest Std. Dev. 10.2529475% 60 7390% 67 73 Variance 105.122795% 68 76 Skewness .194837399% 74.5 76 Kurtosis 2.363052
30
40
50
60
70
80
30 40 50 60 70 80reading score
math score Fitted values
. scatter math read || qfit math read
. corr math read(obs=200)
| math read---------+------------------ math | 1.0000 read | 0.6623 1.0000
. pwcorr math read, obs sig
| math read-------------+------------------ math | 1.0000 | | 200 | read | 0.6623 1.0000 | 0.0000 | 200 200
. reg math read
Source SS df MS Number of obs = 200
F( 1, 198) = 154.70
Model 7660.75905 1 7660.75905 Prob > F = 0.0000
Residual 9805.03595 198 49.5203836 R-squared = 0.4386
Adj R-squared = 0.4358
Total 17465.795 199 87.7678141 Root MSE = 7.0371
math Coef. Std. Err. t P>t [95% Conf. Interval]
read .6051473 .0486538 12.44 0.000 .509201 .7010935
_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446
Interpretation?
The simple linear regression model assumes that the mean of the outcome variable is a linear function of one explanatory variable.
The multiple linear regression model—as we’ll later see—assumes that the mean of the outcome variable is a linear function of multiple explanatory variables: implication?
Regression Analysis
Does Not Assume the Following!
Regression analysis does not assume that the sample values of the outcome & explanatory variables have normal distributions!
It does assume an approximately linear relationship y/x relationship.
And it does assume that the distribution of the residuals is approximately normal and is constant across the values of each explanatory variable, with an expected value of zero.
xy 0
ppx...xy 110
Simple linear regression model:
Multiple linear regression model:
Regression model: a set of variables & their hypothesized relationships.
Basic research strategy in multiple regression: compare models—which model provides the best explanation (or prediction) for a research question & the data?
What’s the advantage of multiple regression over simple regression?
Multiple regression allows us to examine how values of an outcome variable vary in association with changes in the values of more than one explanatory variable.
The effect of each x on the value of y is measured holding the other x’s constant at their means.
Thus a given x may perform differently within differing sets of x’s.
In multiple regression, the value of each explanatory x’s slope (i.e. beta or regression) coefficient is its partial (i.e. net) effect, holding the other explanatory variables constant.
So, in multiple regression, the value of a slope (i.e. regression) coefficient may vary according to which other explanatory variables are included in the model.
What is accomplished by holding the model’s other variables constant?
How does this compare to experimental design?
Thus, when interpreting the effect of any one explanatory variable on y, consider the model’s other explanatory variables as held constant.
On average, how does risk of heart disease (y) change with every unit of increase in amount of fat consumption (x1), holding constant amount of exercise (x2)?
On average, how does earnings level (y) change with every year person’s age (x2), holding constant years of education (x1)?
The characteristics of scatterplots & correlations don’t necessarily predict whether an explanatory variable will test significant in multiple regression.
That’s because multiple regression expresses the joint, linear effect of a set of explanatory variables on an outcome variable y.
That is, the regression model’s whole is more than the sum of its parts.
In fact, significant bivariate relationships may become insignificant in multiple regression.
Or insignificant bivariate relationships may become significant.
Or positive bivariate relationships may become negative, & vice versa (‘Simpson’s Paradox’).
On such complexities within multiple regression models, see McClendon, Multiple Regression and Causal Analysis.
And see Agresti/Finlay, Statistical Methods for the Social Sciences, chapter 10.
Regression model: a set of variables & their hypothesized relationships.
Basic research strategy: compare models—which model provides the best explanation (or prediction) for a research question & the data?
To repeat:
The estimated (i.e. probabilistic) regression line:
The estimated (i.e. probabilistic) regression line contains a component of uncertainty, or error: deviations between the observed values of y & the estimated values of y)
exxxy kk ...22110
The most important statistical assumptions of the linear regression model: the distribution of residuals (i.e. prediction ‘errors’) is (1) approximately normal & (2) is constant for all the values of each explanatory variable x, with an expected value of zero.
These are the principal assumptions that we check in our diagnostic graphs after estimating a regression model.
Why are the assumptions of constant, normal distribution of residuals & zero expected value of residuals so important?
(1) The expected value of the residuals equals zero: guarantees that the estimates of the y-intercept & the slope coefficients are unbiased estimates of the corresponding population values.
(2) Constant spread of residuals: minimizes the standard errors of the estimates of the y-intercept & the slope coefficients, which is necessary for the usefulness of confidence intervals & tests of significance.
What if the assumption of normal, constant distribution of residuals with an expected value of zero does not hold for a given estimated regression model?
Violations of assumptions are a matter of degree. Assessing the degree of violation & taking proper corrective action are advanced topics of regression diagnostics.
How to check if there’s a constant, normal distribution of residuals & zero expected value of residuals?
. reg math read
. predict e, resid [e = ‘errors’]
. hist e, norm
. rvfplot, yline(0)
If the distribution is approximately normal, as it roughly is above (note some negative skewness), then the assumption basically holds. In this specific case, the slight negative skewness does alert us to possible problems with other diagnostics.
0.0
2.0
4.0
6D
en
sity
-20 -10 0 10 20Residuals
-20
-10
01
02
0R
esid
uals
40 50 60 70Fitted values
If the distribution of the residuals is approximately random, which it roughly is above (note the degree of rightward expansion), then the assumption basically holds. In this specific case, we would want to check other diagnostics to confirm that there are no serious violations of the assumptions.
Another potential problem we check in regression diagnostics is that of x-outliers.
x-outliers that fall far from the mean in the y/x scatterplot may be influential: i.e. they may exert an excessive effect on the slope coefficients.
Always ask: Why are there outliers? What is their effect? How should we deal with them?
How to check for regression outliers in Stata? Here’s the preliminary way:
. scatter math read, ml(id)
(In this data set, id identifies each subject or observation.)
Look for x-observations that fall far away from the pack to the left or right.
There are no notable outliers.
1
2
3
45
6
7
8 9
10
11 12
13
14
15 16
1718
19
20
21
22
2324
25
2627
28
29
30
31
32
33
34
35
36
37
38
39
4041
42
4344
45
46
47
48
49
5051
52
53 54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
6970
71
72
73
747576
77
78
79
80
81
82
83
84
85
86
87
88
89
90
9192
9394
95
96
97
98
99
100
101
102
103
104
105
106
107
108109
110
111
112
113
114
115
116
117
118
119
120121
122123
124
125126 127
128
129
130131
132
133134
135
136
137
138
139
140
141
142
143
144
145
146
147148
149
150
151
152
153
154
155
156
157
158159160
161
162
163
164
165166
167
168
169170
171
172
173
174
175176
177
178
179
180
181182
183
184185
186
187188
189
190
191
192
193
194
195
196197
198199
2003
04
05
06
07
08
0m
ath
sco
re
30 40 50 60 70 80reading score
If there were notable outliers, we would estimate the regression model both with & without the outliers, then compare the models.
. reg math read
. reg math read if id~=15 & id~=167
More important, however, are post-estimation diagnostics that assess the influence of outliers within a regression model (e.g., lvr2plot, avplots, dfbeta). E.g.:
. reg math read
. lvr2, ml(id)
Notable outliers would be located in in the right-hand quadrant.
1
2
3
4567
8
910
11
12
1314
15
161718
19
20 2122
23
2425
26
27
28
29
30
31 3233
34
3536
37
38
39
40
41424344
45
464748
4950
51
52
53
5455 56
57
58
59
60
61
62
6364 65
6667
68
69
7071
72
7374
7576
77
78
79
80
81
82
83
84
85
86
87
88
89
90
9192
93
94
95
96
9798 99
100
101
102
103
104 105
106
107
108
109
110
111
112
113
114
115
116
117
118
119120
121
122
123
124
125
126127
128
129130
131
132
133
134
135
136
137138
139
140
141
142
143
144
145
146147
148 149150
151152
153154
155
156
157
158159160 161162163
164
165
166
167
168 169170
171172
173
174175
176177178 179
180
181
182
183
184
185
186187
188
189190191
192
193
194
195
196
197198199
200
0.0
1.0
2.0
3.0
4
Leve
rag
e
0 .02 .04 .06Normalized residual squared
In any case, we use the estimated regression line—which must be based on a random sample of observations measured on the same individuals or subjects—to estimate the population regression line.
Sampling error (as well as non-sampling error) causes uncertainty in the estimated regression line, as in inferential statistics in general.
Regression measures a linear association: non-linearity—which if present emerges in non-normal distributions of residuals—creates misleading results.
Before we proceed, recall that there are two regression lines.
What are the two regression lines? Why are there two lines? How does this distinguish regression from correlation? What are the pitfalls of this?
As with correlation, beware of lurking variables in regression analysis.
As we’ll see, multiple (rather than simple) regression addresses the problem of lurking variables, though not as effectively as experimental design.
How to do simple (i.e. one explanatory variable) linear regression in Stata?
Let’s assume that the preliminary graphical & numerical descriptions have been carried out, & that the scatterplot indicates an approximately linear y/x relationship.
. reg math read
Source SS df MS Number of obs = 200
F( 1, 198) = 154.70
Model 7660.75905 1 7660.75905 Prob > F = 0.0000
Residual 9805.03595 198 49.5203836 R-squared = 0.4386
Adj R-squared = 0.4358
Total 17465.795 199 87.7678141 Root MSE = 7.0371
math Coef. Std. Err. t P>t [95% Conf. Interval]
read .6051473 .0486538 12.44 0.000 .509201 .7010935
_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446
Interpretation?
Sample-estimated variation in the x-based values of response variable y is measured by residuals (i.e. deviations or errors): the deviations between each observed y & each predicted y (‘yhat’).
REGRESSION DATA = FIT + RESIDUALS
Fit: the model’s estimate of y’s average value for each level of an x-variable.
Residuals: the deviations (‘errors’) of the predicted y values (yhat) from the observed y values (e.g., the deviations of the predicted math scores from the observed math scores)
SST = SSM + SSE
REGRESSION DATA = FIT + RESIDUALS
SST = SSM + SSE
This formula underpins various diagnostic measures of a model’s explanatory/ predictive worth.
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.
Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.
Sum of Squares for Errors (SSE) (i.e. Sum of Squares for Residuals): each observed y minus the mean of predicted y ; sum the values; square the values.
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
Linear regression is called ‘ordinary least squares’ (OLS) because the equation chooses the straight line that minimizes the squared deviations between the observed values of y & the model’s predicted values of y (yhat; e.g., predicted math test scores).
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
We use the sample data—i.e. the observations on outcome variable y & explanatory variables x—to estimate the following:
(1) The slope coefficients:
x
yk s
srb
*
r=correlation xy. s=standard deviation
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
Here’s how to graph the confidence interval of a regression coefficient:
. twoway qfitci math read
30
40
50
60
70
30 40 50 60 70 80reading score
95% CI Fitted values
See the course document ‘Graphing confidence intervals in Stata’ for other options.
xbyb 0
(2) Y-intercept (the formula being for simple regression):
The value of predicted y when the explanatory x’s are zero.
It typically has no substantive meaning. Why not?
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
_cons: y-intercept
(3) The residuals:
ikki
iii
xbxbxbby
yye
...
ˆ
22110
The residuals ei correspond to the deviations of each predicted y (i.e. each predicted y) from each observed y.
The residuals ei must have an approximately normal distribution with an expected value of zero (over an infinite number of observations).
How to Obtain the the
Predicted Values of Y & the Residuals in Stata
. reg math read
. predict yhat [yhat=predicted values of y]
. predict e, resid [e=residuals]
. for varlist read yhat e: hist X, norm
222110 )...( kki xbxbxbby
The method of least squares chooses the values of the regression coefficients that make the sum of the squares of the residuals as small as possible:
Software regression output typically refers to the residual-terms—which go into the computation of model fit indicators— more or less as follows:
s2 = Mean Square Error (MSE: variance of predicted y/df)
s = Root Mean Square Error (Root MSE: standard error of predicted y [which equals the square root of MSE])
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000
Residual | 9805.03595 198 49.520 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358
Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
MS Error (Residual) = 9805.04/49.52
Root MSE: sqrt 49.52 = 7.0371
Stata labels the Residuals, Mean Square Error & Root Mean Square Error as follows:
Top-left table
SS for Residual: sum of squared errors (i.e sum of squared residuals)
MS for residual: variance of predicted y/df
Top-right column
Root MSE: standard error of predicted y (i.e. square root of MS for residual)
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.520 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
To repeat:
Top-left table
SS for Residual: sum of squared errors
MS for residual: variance of predicted y/d.f.
Top-right column
Root MSE: standard error of predicted y
r-squared: A Measure of Model Fit
Two basic ways of assessing model fit: (1) the slope of the regression line & (2) the amount of cluster around the regression line.
The regression coefficient=slope of the regression line (higher coefficient=greater slope).
r-squared=the degree of cluster around the regression line (higher r-squared=greater cluster).
r-squared=what percentage of the variance of y is explained by the explanatory variables?
That is, how much of the variance of y is explained by the model versus how much is explained by merely using the mean of y?
r-squared=square of correlation yx in simple regression
Note: r-squared for multiple regression is more complicated to compute for, as later discussed.
Regression output table:
r-squared = SS Model/SS Total
Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.
Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.
Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values.
. reg math read
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70
Model | 7660.759 1 7660.75 Prob > F = 0.0000 Residual | 9805.03595 198 49.520 R-squared=0.4386-------------+------------------------------ Adj R-squared = 0.4358
Total | 17465.79 199 87.7678141 Root MSE = 7.0371
------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------
r-squared=7660.759/17465.79=0.4386
This is the percentage of variance in y that the model explains.
Multiple Regression
What did we say was the advantage of multiple regression over simple regression?
Multiple regression allows us to examine how values of response variable y vary according to changes in the values of more than one explanatory variable x.
The relationship of each x to the value of y is measured holding the other x’s constant. A given x may therefore perform differently within differing sets of x’s.
Multiple regression, then, allows us not only to examine more than one explanatory value but in doing so to control—that is, hold constant—otherwise lurking variables.
This approximates experimental control for lurking variables. Why is it not as effective as experimental design in controlling lurking variables?
And what are the intrinsic problems regarding causality?
. reg math read
_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446 read .6051473 .0486538 12.44 0.000 .509201 .7010935 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 7.0371 Adj R-squared = 0.4358 Residual 9805.03595 198 49.5203836 R-squared = 0.4386 Model 7660.75905 1 7660.75905 Prob > F = 0.0000 F( 1, 198) = 154.70 Source SS df MS Number of obs = 200
. reg math read write science
_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200
What happened to the coefficient for ‘read’? Why?
How do we evaluate how well an estimated regression model fits the data?
(1) F-test: overall significance of the model
(2) t-tests of each slope coefficient
(3) r2: overall explanatory/predictive effectiveness of the model
(4) Post-estimation diagnostics: to assess residuals for non-linearity & to assess the influence of outliers.
What is the problem with conducting several hypothesis tests of slope coefficients in a single equation?
The probability of Type I errors.
Begin, then, with an F-test for overall model significance, before testing the slope coefficients.
Ho:
Ha: at least one
F=Mean Square for Model/Mean Square for Error
The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.
Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.
021 p...0β
. reg math read write science
Interpretation?
_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200
. reg math read write science
F-statistic = 3229.08/39.69 = 81.36
_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200
Like other tests of significance, F is the magnitude of the effect divided by the error term.
To repeat, conduct the F-test for overall model significance, before testing the slope coefficients.
Ho:
Ha: at least one
F=Mean Square for Model/Mean Square for Error
The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.
Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.
021 p...0β
t-tests of Slope Coefficients
Conduct a significance test (one or two-sided) for each slope coefficient.
Ho:
Ha:
(or one-sided test: > or <)
Beware of Type I errors.
01 01
. reg math read write science
t-value for read = .306/.060 = 5.07
_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200
R2
R2: the squared multiple correlation (capital R, vs. previous small-case r) measures the proportion of the outcome variable that is explained by the explanatory variables (i.e. degree of cluster around regression line).
R2 is the square of the correlation of the predicted values of y with the observed values of y.
It tells us what percentage of y’s variance is accounted for by the model (i.e. the explanatory variables): higher R2, greater fit.
. reg math read write science
R2: Model SS/Total SS=9687.25/17465.79 = 0.55
_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200
R-squared, however, is much less preferred than the slope coefficients as an indicator of model fit.
Recall the difference between ‘historicist’ & ‘generalizing’ explanation. Simply adding more explanatory variables—even if they’re meaningless—will increase R-squared.
How to do multiple
regression in Stata?
Example: How much do science achievement test scores depend on reading achievement test scores & math achievement test scores in a random sample of 200 high school students?
If there are any missing observations:
. mark complete
. markout complete science read math
Alternatively: perhaps save a working data set that excludes the missing observations.
Regarding mark & markout:
‘complete’ is an binary, or dummy, variable:
1=complete data
0=incomplete data
. tab complete
. gr matrix science read math if complete==1, half
sciencescore
readingscore
mathscore
20 40 60 80
20
40
60
80
20 40 60 80
40
60
80
. kdensity science if complete==1, norm
. gr box science if complete==1
. kdensity read if complete==1, norm
. gr box read if complete==1
. kdensity math if complete==1, norm
. gr box math if complete==1
‘complete’ is a binary, or dummy variable: 1=complete data, 0=incomplete data
. su science read math if complete==1, detail
. pwcorr science read math if complete==1, obs sig bonf star(.05)
science read math
science 1.0000
200
read 0.6302* 1.0000
0.0000
200 200
math 0.6307* 0.6623* 1.0000
0.0000 0.0000
200 200 200
Note this way of using pwcorr (‘if complete==1’) corresponds to how regression uses the observations.
Correlation ci’s
. ci2 read write if complete==1, corr
. ci2 read science if complete==1, corr
. ci2 write science if complete==1, corr
Confidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation.Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)
Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation. Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)
Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation.Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)
‘Partial coefficients’—correlation coefficients of the outcome variable with the explanatory variables, for each variable holding the others constant—are also helpful:
. pcorr science read math if complete==1
Partial correlation of science with
Variable | Corr. Sig.
-------------+------------------
read | 0.3654 0.000
math | 0.3668 0.000
Compare partial correlations to correlations.
Here are the hypotheses that we’ll test.
. Ho:
Ha: at least one
check for F-test significance
. Ho:
Ha:
check for t-test significance
021 p...01β
01β01β
. reg science read if complete==1
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+--------------------------------------------------------------------- read | .6085207 .0532864 11.42 0.000 .503439 .7136024 _cons | 20.06696 2.836003 7.08 0.000 14.47432 25.65961------------------------------------------------------------------------------------
. reg science math if complete==1
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------------- math | .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons | 16.75789 3.116229 5.38 0.000 10.61264 22.90315
. reg science read math if complete==1
Source SS df MS Number of obs = 200
F( 2, 197) = 90.27
Model 9328.73944 2 4664.36972 Prob > F = 0.0000
Residual 10178.7606 197 51.6688353 R-squared = 0.4782
Adj R-squared = 0.4729
Total 19507.50 199 98.0276382 Root MSE = 7.1881
science Coef. Std. Err. t P>t [95% Conf. Interval]
read .3654205 .0663299 5.51 0.000 .2346128 .4962282
math .4017207 .0725922 5.53 0.000 .2585632 .5448782
_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875
What might happen to the slope coefficients if we add other explanatory variables? Why?
Here’s a quick, simple way to graph regression coefficient ci’s in multiple regression:
. reg science read math if complete==1
. gorciv read math
E
stim
ate
read math
-2
2
See ‘Graphing confidence intervals in Stata’ for the commands ‘parmby,’ ‘sencode,’ & ‘ecplot’ to make more useful graphs.
. Check that N-observations is correct in the model.
. predict yhat if e(sample)
. predict e if e(sample), resid
. hist yhat, norm
. hist e, norm
. rvfplot, yline(0)
. sort science [to order low-high values]
. list science yhat e
. lvr2, ml(id)
. lincom _cons + read*45 + math*45
. lincom _cons + read*65 + math*65
0.0
2.0
4.0
6D
en
sity
-20 -10 0 10 20Residuals
The residuals are approximately normal in distribution.
-20
-10
01
02
0R
esid
uals
40 50 60 70Fitted values
There might be some rightward tilt worth exploring. We can use ‘rvpplot’ to explore individual variables.
-20
-10
01
02
0
Resid
uals
30 40 50 60 70 80reading score
-20
-10
01
02
0
Resid
uals
30 40 50 60 70 80math score
. rvpplot read, yline(0) . rvpplot math, yline(0) There might be some minor problem with read’s relationship with math.
1
23
45
6 7
8
910
11
1213
14
15
161718
19
20
21
2223
24
252627
28
29
30
31
32
3334
3536
37
38
39
40414243
44
45
464748
49
5051
52
53
54 55
56
5758
59
60
61
62
63 64
6566
67
68
69
70
7172
737475
76
77
78
7980
81
8283
84
85
86
87
8889
9091 92
93
94
95
96
979899
100
101
102
103
104105
106
107
108
109
110
111
112113
114
115
116
117
118119
120
121
122
123
124
125126
127
128
129
130131
132
133134
135
136
137138139140
141
142
143
144
145146
147
148
149 150
151152
153154
155156
157
158159160
161162
163
164 165
166
167
168
169
170
171172
173
174
175
176177 178
179
180
181182
183
184
185186
187
188
189
190191
192
193
194
195 196197198199
200
0.0
2.0
4.0
6.0
8L
eve
rag
e
0 .01 .02 .03 .04 .05Normalized residual squared
lvr2plot: let’s estimate the model with & without id=167 to see if there’s a notable difference.
_cons 11.05814 3.02127 3.66 0.000 5.099772 17.01651 math .4475735 .0738742 6.06 0.000 .3018831 .5932638 read .3280934 .0670811 4.89 0.000 .1958001 .4603867 science Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 19306.2714 198 97.506421 Root MSE = 7.0915 Adj R-squared = 0.4842 Residual 9856.80938 196 50.2898438 R-squared = 0.4895 Model 9449.46198 2 4724.73099 Prob > F = 0.0000 F( 2, 196) = 93.95 Source SS df MS Number of obs = 199
. reg science read math if id~=167
There’s some difference, but nothing that will change the interpretation of the results.
_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875 math .4017207 .0725922 5.53 0.000 .2585632 .5448782 read .3654205 .0663299 5.51 0.000 .2346128 .4962283 science Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 19507.5 199 98.0276382 Root MSE = 7.1881 Adj R-squared = 0.4729 Residual 10178.7606 197 51.6688353 R-squared = 0.4782 Model 9328.73944 2 4664.36972 Prob > F = 0.0000 F( 2, 197) = 90.27 Source SS df MS Number of obs = 200
. reg science read math if complete==1
Results: the regression model provides a good fit: there are significant relationships between y & the explanatory variables, with meaningful magnitudes.
The residuals are more or less properly distributed; & the one influential outlier doesn’t make any major difference.
Let’s try to improve the model by adding a new explanatory variable, ‘white,’ which is coded 0=nonwhite 1=white.
A binary categorical variable coded 0/1 is called an indicator, or dummy, variable.
. tab white
. su science, d
. ttest science, by(white) [exploratory hypothesis test]
. ttest science, by(white)Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
nonwhite 55 45.65455 1.255887 9.313905 43.13664 48.17245
white 145 54.2 .7552879 9.09487 52.70712 55.69288
combined 200 51.85 .7000987 9.900891 50.46944 53.23056
diff -8.545455 1.44982 -11.40452 -5.686385
Degrees of freedom: 198
Ho: mean(nonwhite) - mean(white) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = -5.8941 t = -5.8941 t = -5.8941
P < t = 0.0000 P > t = 0.0000 P > t = 1.0000
. reg science read math white if complete==1
Source | SS df MS Number of obs = 200
------------+------------------------------ F( 3, 196) = 70.84
Model | 10148.3965 3 3382.79884 Prob > F = 0.0000
Residual | 9359.10349 196 47.750528 R-squared = 0.5202
------------+------------------------------ Adj R-squared = 0.5129
Total | 19507.50 199 98.0276382 Root MSE = 6.9102
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
read | .3227622 .0645911 5.00 0.000 .1953793 .450145
math | .380627 .0699709 5.44 0.000 .2426345 .5186194
white | 4.719807 1.139193 4.14 0.000 2.473158 6.966456
_cons | 11.53217 2.936238 3.93 0.000 5.741491 17.32284
------------------------------------------------------------------------------
Interpretation?
Interpretation of the dummy variable: the science scores of ‘whites’ are 4.7 points higher than those of ‘nonwhites,’ on average, other explanatory variables held constant.
. drop yhat e
. Check that N-observations is correct in the model.
. predict yhat if e(sample)
. predict e if e(sample), resid
. hist yhat, norm
. hist e, norm
. sort science
. l science yhat e
. rvfplot, yline(0)
. lvr2plot, ml(id)
. lincom _cons + math*45 + white
. lincom _cons + math*45 – white
Go through the model-assessment steps: Is this an improved model or not, & why?
Our discussion of linear regression brings us back to an earlier topic: linear transformations (Moore & McCabe, pages 51-55).
Recall that multiplying each observation by a positive number b multiples both measures of center & spread (i.e. variance & standard deviation), thereby increasing dispersion (i.e. inequality).
Recall also that adding or subtracting the same number a to observations adds a to measures of center & to quartiles & other percentiles but does not change measures of spread.
. gen xsci=5*science
. univar science xsci
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00
xsci 200 259.25 49.50 130.00 220.00 265.00 290.00 370.00
. gen asci=5 + science
. univar science asciVariable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00
asci 200 56.85 9.90 31.00 49.00 58.00 63.00 79.00
-------------------------------------------------------------------------------
Other kinds of transformations, however, are not linear but, rather, nonlinear (see McCabe & Moore, pages 187-203).
In contrast to linear transformations, nonlinear transformations potentially can normalize a variable’s distribution & straighten a curved relationship between two variables.
Why might we need to straighten out a skewed univariate distribution or a curved relationship between two variables?
(1) A skewed distribution may be difficult to examine became many observations may be piled up in one place or some observations may be hidden; & summary measures such as mean & standard deviation are distorted by skewed distributions.
(2) Linear relationships are easier to interpret; statistical theory is better developed for linear relationships; & nonparametric statistics are not as insightful as parametric statistics.
(3) A curvilinear relationship causes invalid results for correlation and regression.
The nonlinear transformations that we’ll briefly consider will consist of logarithms, powers & roots.
Remember: for correlation and regression, what really matters are not univariate distributions but rather bivariate relationships as displayed in scatterplots.
So, for correlation and regression, don’t make decisions about transformations until you’ve inspected the scatterplots.
Keeping the preceding point in mind, Stata makes choosing among potential non-linear transformations relatively easy.
Using data on household consumption per capita in Tegucigalpa data:
. kdensity dhc, norm
. qladder dhc
0
.0001
.0002
.0003
.0004
.0005
De
nsity
0 2000 4000 6000 8000Dollars household consumption per capita
Kernel density estimate
Normal density
. kdensity dhc, norm scheme(economist)
-2.0
0e+
1102.0
0e+
114.0
0e+
116.0
0e+
118.0
0e+
11
-2.00e+11-1.00e+11 0 1.00e+112.00e+11
cubic
-2.0
0e+
07
02.0
0e+
07
4.0
0e+
07
6.0
0e+
07
8.0
0e+
07
-2.00e+07-1.00e+070 1.00e+072.00e+073.00e+07
square
-50000
500010000
-2000 0 2000 4000 6000
identity0
50
100
-20 0 20 40 60 80
sqrt
46
810
4 6 8 10
log
-.1
-.05
0
-.08 -.06 -.04 -.026.94e-18 .02
1/sqrt
-.01-
.005
0.0
05
-.006 -.004 -.002 0 .002
inverse
-.00008
-.00006
-.00004
-.00002
-6.7
8e-2
1.0
0002
-.00002-.00001 0 .00001 .00002
1/square
-8.0
0e-0
7-6
.00e-0
7-4
.00e-0
7-2
.00e-0
702.0
0e-0
7
-2.00e-07 -1.00e-07 0 1.00e-07
1/cubic
new dollar-weighted welfare measureQuantile-Normal plots by transformation
. qladder dhc
qladder dhc suggests that a log transformation of dhc will make its distribution more normal.
gen ldhc = ln(dhc)
su dhc ldhc
for varlist ldhc-dhc: kdensity X, norm \ more
. kdensity dhc, norm
0
.0001
.0002
.0003
.0004
.0005
De
nsity
0 2000 4000 6000 8000Dollars household consumption per capita
Kernel density estimateNormal density
. kdensity ldhc, norm
0
.2
.4
.6
De
nsity
5 6 7 8 9ldhc
Kernel density estimate
Normal density
The log transformation did considerably normalize the variable’s distribution (by dampening the effect of the right-skewed distribution’s high-end values).
The ‘qladder’ command is based on John Tukey’s ‘ladder of power transformations’ (see Moore & McCabe, pages 191-95; & see Stata’s command ‘ladder’).
See the ‘ladder’ command for a hypothesis-test based approach to selection.
The ladder of transformations recommends particular non-linear transformations for particular non-linear relationships.
While the ‘ladder’ does help, normalizing a variable’s distribution & linearizing a bivariate relationship generally involve trial & error.
And not all skewed distributions or non-linear bivariate relationships can be straightened out in a satisfactory way.
There’s always a trade-off, moreover: A nonlinear transformation may indeed normalize a variable or linearize a relationship between variables, but a significant cost may be diminished clarity of interpretation.
Remember: what really matters is the scatterplot, not the univariate frequency distributions.
So don’t make decisions about transformations until you’ve inspected the scatterplot.
There’ll be lots more transforming next semester.
Interaction Effects
One more thing: what if the relationship of y to x varies according to the level of another variable, z?
This is an ‘interaction effect’.
E.g., do not drink alcoholic beverages while taking medication X.
E.g., the effect of an educational intervention varies according to the gender of students.
A regression example: with regard to the log annual household living standard measurement per capita of a stratified random sample of households in several collective agricultural communities (‘ejidos’) in Quintana Roo, Mexico.
The value of the outcome variable increases with higher farm levels of mahogony production & with presence (vs. absence) of a community saw mill.
Is there an interaction effect between mahogony production level & mill presence/absence?
------------------------------------------------------------------------------- lsm 199 3.98 1.97 0.01 2.49 3.67 5.49 8.69-------------------------------------------------------------------------------Variable n Mean S.D. Min .25 Mdn .75 Max -------------- Quantiles --------------. univar lsm
Total 3.98 1.97 3.67 3.01 0.01 8.69 4 6.59 1.26 6.75 1.93 4.10 8.69 3 2.83 0.93 2.64 1.27 1.53 5.47 2 4.09 1.70 3.70 2.14 1.33 7.88 1 3.33 1.77 3.16 2.36 0.01 8.15 0 3.61 1.74 3.50 1.78 1.16 8.27 mvc mean sd p50 iqr min max
by categories of: mvc (Mahogany volume categories)Summary for variables: lsm
. tabstat lsm, s(mean sd med iqr min max) by(mvc) f(%9.2f)
05
10
0 1 2 3 4 0 1 2 3 4
No mill Mill
Living standard measure Fitted values
Mahogany volume categories
Graphs by Mill
. twoway scatter lsm mvc || lfit lsm mvc, by(mill)
_cons 2.369156 .4068352 5.82 0.000 1.566716 3.171596 resworks .6022761 .2480852 2.43 0.016 .1129537 1.091598 nmaya .9817913 .4365363 2.25 0.026 .1207687 1.842814 _Imi1Xmv 1.236629 .2460735 5.03 0.000 .7512742 1.721983 mvc -.019945 .1623813 -0.12 0.902 -.3402253 .3003354 _Imill_1 -2.725266 .6579602 -4.14 0.000 -4.023024 -1.427508 hway .6115082 .3280433 1.86 0.064 -.0355233 1.25854 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 766.361343 198 3.87051183 Root MSE = 1.5543 Adj R-squared = 0.3758 Residual 463.861506 192 2.41594534 R-squared = 0.3947 Model 302.499837 6 50.4166394 Prob > F = 0.0000 F( 6, 192) = 20.87 Source SS df MS Number of obs = 199
(1 missing value generated)i.mill _Imill_0-1 (naturally coded; _Imill_0 omitted). xi3: reg lsm hway i.mill*mvc nmaya resworks
Interaction: coef=1.24, p=.000
postgr3 mv, by(mill) table: graph of saw mill X mahagony volume categories (see next slide)
3.5
44
.55
5.5
6
0 1 2 3 4Mahogany volume categories
yhat_, mill == No mill yhat_, mill == Mill
. postgr3 mvc, by(mill) table [predicted average values of lsm by millXmvc]
Variables left asis: mvc _Imill_1 _Imi1XmvHolding hway constant at .79396985Holding nmaya constant at .46231156Holding resworks constant at .71356784
------------------------------Mahogany |volume |categorie | Mill s | No mill Mill----------+------------------- 0 | 3.738333 1 | 3.718388 2 | 3.446435 3 | 3.678499 4 | 5.879803------------------------------
. findit xi3 [download]
. help xi3
. findit postgr3 [download]
. help postgr3
. xi3: reg y x i.xcat*z
. postgr3 x, by(xcat)
23
45
67
Ave
. Liv
ing S
tand
ard
Me
asu
re
0 1 2 3 4Mahogany volume categories
Fitted values No MillFitted values Mill
. twoway scatter yhat mvc if mill==0 || mspline yhat mvc if mill==0, clpatt(solid) || scatter yhat mvc if mill==1, ms(oh) || mspline yhat mvc if mill==1, clpatt(solid) || , ytitle("Ave. Living Standard Measure")
Another way to graph the interaction
As an alternative to postgr3, table or to complement interaction graphs in general, use lincom to explore predicted outcomes.
See Paul Allison, Multiple Regression: A Primer.
See the next slide…
. reg lsm hway mvc mill mvcXmill nmaya resworks
(1) 2.221249 .793086 2.80 0.006 .656969 3.785529 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]
( 1) mill + 4 mvcXmill = 0
. lincom mill*1 + mvcXmill*4
(1) .9846203 .6311184 1.56 0.120 -.2601954 2.229436 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]
( 1) mill + 3 mvcXmill = 0
. lincom mill*1 + mvcXmill*3
(1) -.2520085 .5373451 -0.47 0.640 -1.311866 .8078492 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]
( 1) mill + 2 mvcXmill = 0
. lincom mill*1 + mvcXmill*2
(1) 1.216684 .185311 6.57 0.000 .851177 1.582191 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]
( 1) mvc + mvcXmill = 0
. lincom mvc*1 + mvcXmill*1
You’ll spend lots of time analyzing interaction effects next semester.
On making correlation & regression tables, see the class document ‘Making working & publication-style tables in Stata’.
. findit esttab [download]
. findit eststo [download]
. reg lsm hway mill mvc nmaya resworks
. eststo
. reg lsm hway mill mvc millXmvc nmaya resworks
. estso
. esttab, se starlevels(+ .10 * .05 ** .01 *** .001) r2 nodepvars no mtitles compress
+ p<.10, * p<.05, ** p<.01, *** p<.001Standard errors in parentheses R-sq 0.315 0.395 N 199 199 (0.388) (0.407) _cons 1.477*** 2.369***
(0.246) millXmvc 1.237***
(0.263) (0.248) resworks 0.636* 0.602*
(0.449) (0.437) nmaya 1.517*** 0.982*
(0.130) (0.162) mvc 0.517*** -0.0199
(0.560) (0.658) mill -0.753 -2.725***
(0.341) (0.328) hway 0.946** 0.612+ (1) (2)