review: correlation vs. regression what are the main differences between correlation &...

Review: Correlation vs. Regression

What are the main differences between correlation & regression?

What are the data requirements for each?

What are their principal vulnerabilities?

How do we establish causality?

Pearson correlation: a linear association between two quantitative variables.

Data Requirements

A probability sample—if the analysis will be inferential, as opposed to descriptive.

For OLS regression, the outcome variable must be quantitative (interval or ratio); the explanatory variables may be quantitative or categorical (nominal or ordinal).

What are the disadvantages of using correlation to study the relationships between two or more variables?

See Moore/McCabe, chapter 2.

0

0

xy

xy

:Ha

:Ho

Hypothesis tests for correlation

We can use Pearson correlation not only descriptively but also inferentially.

To use it inferentially, use a scatterplot to check the bivariate relationship for linearity.

If the relationship is sufficiently linear:

In Stata, ‘pwcorr’ vs. ‘corr’ ‘correlate’ (corr): ‘listwise,’ or ‘casewise,’ deletion—i.e. any observation (i.e. individual, case) for which any of the correlated variables has missing data is not used (i.e. ‘corr’ only uses observations with complete data for the examined variables).

If, for the relationship between math and reading scores, observation #27 has, say, a missing math score, then ‘corr’ or ‘regress’ will automatically drop observation #27.

This is how regression works, so ‘corr’ corresponds to regression.

Moreover, ‘corr’ does not permit hypothesis tests.

pwcorr: (‘pairwise’) uses all of the non-missing observations for the examined variables (e.g., it would use observation #27’s reading score, even though #27’s math score is missing).

This does not correspond to the way that regression works.

Moreover, pwcorr permits hypothesis tests

Note: There is a way to use ‘pwcorr’ so that, like regression analysis, it is based on casewise (i.e. listwise) deletion of missing observations. We’ll demonstrate this later.

Use a Bonferonni or other multiple-test adjustment when simultaneously testing multiple correlation hypotheses:

. pwcorr read write math science socst, obs sig star(.05) bonf

Why is the multiple-test adjustment important?

If the data have no missing values, then there’s no problem using pwcorr.

Contingency Table vs.

Pearson Correlation What if the premises of parametric statistics don’t hold?

E.g., what if the quantitative variables are based on a small sample (say, <30)?

Or what if the relationship is non-linear, and possible transformations (such as logarithmic) aren’t applicable or don’t work, and/or it doesn’t make sense to eliminate extreme outliers?

What if the quantitative variables are ordinal rather than interval in measurement?

In such cases it may be useful to: (1) use non-parametric procedures (such as Spearman rho [in Stata: ‘spearman x y’]); or (2) categorize the data & assess the bivariate association via a contingency table.

Non-parametric procedures are not premised on the approximate normality of the sample distribution of sample means (see the Moore/McCabe chap. 7 & CD-Rom chapter).

As for contingency tables, here’s an example of how to do a contingency table in response to violated parametric assumptions: What’s the association between science & reading scores?

. xtile xsci=science, nq(4)

. xtile xread=read, nq(4)

. tab1 xsci xread

. bys xsci: su science

. bys xread: su read

. tab xread xsci, col chi2

Always have a good reason for how you categorize a quantitative variable.

Check # observations per cell for the validity of the Chi-square hypothesis test.

Measures of Correlation & Association Involving Categorical Variables

Here are some alternatives to Pearson correlation (including non-parametric [‘rank’] statistics):

There are other correlation coefficients & measures of association (besides ttest & prtest) for categorical variables & for combinations of categorical & continuous variables. E.g.:

Spearman correlation (i.e. ‘rank correlation’; see Moore/McCabe chap. 7 and CD-Rom chapter, Hamilton, chap. 6 & Stata Manual) (It is also an outlier-resistant alternative to ‘corr’ or ‘pwcorr’ for quantitative variables.):

. spearman ordinalscore ses

Kendall’s tau (like Spearman, but can be slower in Stata; see Hamilton, chap. 6 & Stata Manual)

. ktau ordinalscore ses

eta-squared: when one variable is quantitative continuous & the other is multi-level categorical (see Moore/McCabe, chap. 12, ‘ANOVA’; Hamilton, chap. 5, ‘ANOVA’; Stata Manual, ‘oneway’).

. oneway read ses, tabulate bonf [Bartlett test must be insignificant; see also ANOVA and loneway.]

biserial correlation & point biserial correlation: when one variable is quantitative continuous & the other is binary. Just use ‘corr’ or ‘pwcorr’—Stata or any other major software automatically makes the adjustment:

. pwcorr read female, obs sig star(.05) [same result as ttest read,by(female)]

phi coefficient: two categorical binary variables.

. pwcorr female white, obs sig star(.05) [same result as tab female white, col/row/cell chi2]

. tab female ses, all [output includes ‘Cramer’s V,’ which is an adaptation of phi coefficient for tables that are larger than two-by-two, but it likewise works in two-by-two tables]

Caution: Recall the ramifications of restricted-range data and ecological data for correlation results, & recall the need to consider lurking variables.

Non-parametric: rank data.

Parametric: premised on approximately normal sampling distribution of sample means (i.e. Central Limit Theorem).

CI’s for Pearson & Spearman correlations:

. findit ci2 [& download]

. ci2 read write, corrConfidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation. Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)

. ci2 read ses, corr spearmanConfidence interval for Spearman's rank correlation of read and ses, based on Fisher's transformation. Correlation = 0.280 on 200 observations (95% CI: 0.147 to 0.403)

Regression Analysis

What are regression analysis’s major advantages over the alternatives for examining the relationships between two or more variables (see Moore/McCabe, chapter 2)?

Regression: examines how the values of an outcome variable y depend on the values of one or more explanatory variables x (i.e. the slope & direction of the y/x straight line).

On average, how does risk of heart disease (y) change with every unit of increase or decrease in amount of fat consumption (x1) & in amount of exercise (x2)?

On average, how does earnings level (y) change with every unit of increase or decrease in years of education (x2) & in years of person’s age (x1)?

Recall the problems of causality that we discussed in Chapter Two.

Always ask: What is the conceptual basis of the hypothesized or implied causal relationship? What if it were reversed?

See, e.g., King et al., Designing Social Inquiry; McClendon, Multiple Regression and Causal Analysis; Berk, Regression Analysis: A Constructive Critique.

Let’s start by interpreting the following simple regression model (i.e. regression model with one explanatory variable x).

use hsb2, clear

. for varlist math read: kdensity X, norm \ more

0.0

1.0

2.0

3.0

4D

en

sity

30 40 50 60 70 80math score

Kernel density estimateNormal density

0.0

1.0

2.0

3.0

4

Den

sity

20 40 60 80reading score


. gr box math read3

04

05

06

07

08

0

math score reading score

. su math read, d math score------------------------------------------------------------- Percentiles Smallest 1% 36 33 5% 39 3510% 40 37 Obs 20025% 45 38 Sum of Wgt. 200

50% 52 Mean 52.645 Largest Std. Dev. 9.36844875% 59 7290% 65.5 73 Variance 87.7678195% 70.5 75 Skewness .284411599% 74 75 Kurtosis 2.337319

reading score------------------------------------------------------------- Percentiles Smallest 1% 32.5 28 5% 36 3110% 39 34 Obs 20025% 44 34 Sum of Wgt. 200

50% 50 Mean 52.23 Largest Std. Dev. 10.2529475% 60 7390% 67 73 Variance 105.122795% 68 76 Skewness .194837399% 74.5 76 Kurtosis 2.363052

30

40

50

60

70

80

30 40 50 60 70 80reading score

math score Fitted values

. scatter math read || qfit math read

. corr math read(obs=200)

| math read---------+------------------ math | 1.0000 read | 0.6623 1.0000

. pwcorr math read, obs sig

| math read-------------+------------------ math | 1.0000 | | 200 | read | 0.6623 1.0000 | 0.0000 | 200 200

. reg math read

Source SS df MS Number of obs = 200

F( 1, 198) = 154.70

Model 7660.75905 1 7660.75905 Prob > F = 0.0000

Residual 9805.03595 198 49.5203836 R-squared = 0.4386

Adj R-squared = 0.4358

Total 17465.795 199 87.7678141 Root MSE = 7.0371

math Coef. Std. Err. t P>t [95% Conf. Interval]

read .6051473 .0486538 12.44 0.000 .509201 .7010935

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446

Interpretation?

The simple linear regression model assumes that the mean of the outcome variable is a linear function of one explanatory variable.

The multiple linear regression model—as we’ll later see—assumes that the mean of the outcome variable is a linear function of multiple explanatory variables: implication?

Regression Analysis

Does Not Assume the Following!

Regression analysis does not assume that the sample values of the outcome & explanatory variables have normal distributions!

It does assume an approximately linear relationship y/x relationship.

And it does assume that the distribution of the residuals is approximately normal and is constant across the values of each explanatory variable, with an expected value of zero.

More on this later…

xy 0

ppx...xy 110

Simple linear regression model:

Multiple linear regression model:

Regression model: a set of variables & their hypothesized relationships.

Basic research strategy in multiple regression: compare models—which model provides the best explanation (or prediction) for a research question & the data?

What’s the advantage of multiple regression over simple regression?

Multiple regression allows us to examine how values of an outcome variable vary in association with changes in the values of more than one explanatory variable.

The effect of each x on the value of y is measured holding the other x’s constant at their means.

Thus a given x may perform differently within differing sets of x’s.

In multiple regression, the value of each explanatory x’s slope (i.e. beta or regression) coefficient is its partial (i.e. net) effect, holding the other explanatory variables constant.

So, in multiple regression, the value of a slope (i.e. regression) coefficient may vary according to which other explanatory variables are included in the model.

What is accomplished by holding the model’s other variables constant?

How does this compare to experimental design?

Thus, when interpreting the effect of any one explanatory variable on y, consider the model’s other explanatory variables as held constant.

On average, how does risk of heart disease (y) change with every unit of increase in amount of fat consumption (x1), holding constant amount of exercise (x2)?

On average, how does earnings level (y) change with every year person’s age (x2), holding constant years of education (x1)?

The characteristics of scatterplots & correlations don’t necessarily predict whether an explanatory variable will test significant in multiple regression.

That’s because multiple regression expresses the joint, linear effect of a set of explanatory variables on an outcome variable y.

That is, the regression model’s whole is more than the sum of its parts.

In fact, significant bivariate relationships may become insignificant in multiple regression.

Or insignificant bivariate relationships may become significant.

Or positive bivariate relationships may become negative, & vice versa (‘Simpson’s Paradox’).

On such complexities within multiple regression models, see McClendon, Multiple Regression and Causal Analysis.

And see Agresti/Finlay, Statistical Methods for the Social Sciences, chapter 10.

Regression model: a set of variables & their hypothesized relationships.

Basic research strategy: compare models—which model provides the best explanation (or prediction) for a research question & the data?

To repeat:

The estimated (i.e. probabilistic) regression line:

The estimated (i.e. probabilistic) regression line contains a component of uncertainty, or error: deviations between the observed values of y & the estimated values of y)

exxxy kk ...22110

The most important statistical assumptions of the linear regression model: the distribution of residuals (i.e. prediction ‘errors’) is (1) approximately normal & (2) is constant for all the values of each explanatory variable x, with an expected value of zero.

These are the principal assumptions that we check in our diagnostic graphs after estimating a regression model.

Why are the assumptions of constant, normal distribution of residuals & zero expected value of residuals so important?

(1) The expected value of the residuals equals zero: guarantees that the estimates of the y-intercept & the slope coefficients are unbiased estimates of the corresponding population values.

(2) Constant spread of residuals: minimizes the standard errors of the estimates of the y-intercept & the slope coefficients, which is necessary for the usefulness of confidence intervals & tests of significance.

What if the assumption of normal, constant distribution of residuals with an expected value of zero does not hold for a given estimated regression model?

Violations of assumptions are a matter of degree. Assessing the degree of violation & taking proper corrective action are advanced topics of regression diagnostics.

How to check if there’s a constant, normal distribution of residuals & zero expected value of residuals?

. reg math read

. predict e, resid [e = ‘errors’]

. hist e, norm

. rvfplot, yline(0)

If the distribution is approximately normal, as it roughly is above (note some negative skewness), then the assumption basically holds. In this specific case, the slight negative skewness does alert us to possible problems with other diagnostics.

0.0

2.0

4.0

6D

en

sity

-20 -10 0 10 20Residuals

-20

-10

01

02

0R

esid

uals

40 50 60 70Fitted values

If the distribution of the residuals is approximately random, which it roughly is above (note the degree of rightward expansion), then the assumption basically holds. In this specific case, we would want to check other diagnostics to confirm that there are no serious violations of the assumptions.

Another potential problem we check in regression diagnostics is that of x-outliers.

x-outliers that fall far from the mean in the y/x scatterplot may be influential: i.e. they may exert an excessive effect on the slope coefficients.

Always ask: Why are there outliers? What is their effect? How should we deal with them?

How to check for regression outliers in Stata? Here’s the preliminary way:

. scatter math read, ml(id)

(In this data set, id identifies each subject or observation.)

Look for x-observations that fall far away from the pack to the left or right.

There are no notable outliers.

1

2

3

45

6

7

8 9

10

11 12

13

14

15 16

1718

19

20

21

22

2324

25

2627

28

29

30

31

32

33

34

35

36

37

38

39

4041

42

4344

45

46

47

48

49

5051

52

53 54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

6970

71

72

73

747576

77

78

79

80

81

82

83

84

85

86

87

88

89

90

9192

9394

95

96

97

98

99

100

101

102

103

104

105

106

107

108109

110

111

112

113

114

115

116

117

118

119

120121

122123

124

125126 127

128

129

130131

132

133134

135

136

137

138

139

140

141

142

143

144

145

146

147148

149

150

151

152

153

154

155

156

157

158159160

161

162

163

164

165166

167

168

169170

171

172

173

174

175176

177

178

179

180

181182

183

184185

186

187188

189

190

191

192

193

194

195

196197

198199

2003

04

05

06

07

08

0m

ath

sco

re


If there were notable outliers, we would estimate the regression model both with & without the outliers, then compare the models.

. reg math read

. reg math read if id~=15 & id~=167

More important, however, are post-estimation diagnostics that assess the influence of outliers within a regression model (e.g., lvr2plot, avplots, dfbeta). E.g.:

. reg math read

. lvr2, ml(id)

Notable outliers would be located in in the right-hand quadrant.

1

2

3

4567

8

910

11

12

1314

15

161718

19

20 2122

23

2425

26

27

28

29

30

31 3233

34

3536

37

38

39

40

41424344

45

464748

4950

51

52

53

5455 56

57

58

59

60

61

62

6364 65

6667

68

69

7071

72

7374

7576

77

78

79

80

81

82

83

84

85

86

87

88

89

90

9192

93

94

95

96

9798 99

100

101

102

103

104 105

106

107

108

109

110

111

112

113

114

115

116

117

118

119120

121

122

123

124

125

126127

128

129130

131

132

133

134

135

136

137138

139

140

141

142

143

144

145

146147

148 149150

151152

153154

155

156

157

158159160 161162163

164

165

166

167

168 169170

171172

173

174175

176177178 179

180

181

182

183

184

185

186187

188

189190191

192

193

194

195

196

197198199

200

0.0

1.0

2.0

3.0

4

Leve

rag

e

0 .02 .04 .06Normalized residual squared

In any case, we use the estimated regression line—which must be based on a random sample of observations measured on the same individuals or subjects—to estimate the population regression line.

Sampling error (as well as non-sampling error) causes uncertainty in the estimated regression line, as in inferential statistics in general.

Regression measures a linear association: non-linearity—which if present emerges in non-normal distributions of residuals—creates misleading results.

Before we proceed, recall that there are two regression lines.

What are the two regression lines? Why are there two lines? How does this distinguish regression from correlation? What are the pitfalls of this?

As with correlation, beware of lurking variables in regression analysis.

As we’ll see, multiple (rather than simple) regression addresses the problem of lurking variables, though not as effectively as experimental design.

How to do simple (i.e. one explanatory variable) linear regression in Stata?

Let’s assume that the preliminary graphical & numerical descriptions have been carried out, & that the scatterplot indicates an approximately linear y/x relationship.

. reg math read


F( 1, 198) = 154.70

Model 7660.75905 1 7660.75905 Prob > F = 0.0000



Total 17465.795 199 87.7678141 Root MSE = 7.0371

math Coef. Std. Err. t P>t [95% Conf. Interval]

read .6051473 .0486538 12.44 0.000 .509201 .7010935

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446

Interpretation?

Sample-estimated variation in the x-based values of response variable y is measured by residuals (i.e. deviations or errors): the deviations between each observed y & each predicted y (‘yhat’).

REGRESSION DATA = FIT + RESIDUALS

Fit: the model’s estimate of y’s average value for each level of an x-variable.

Residuals: the deviations (‘errors’) of the predicted y values (yhat) from the observed y values (e.g., the deviations of the predicted math scores from the observed math scores)

SST = SSM + SSE

REGRESSION DATA = FIT + RESIDUALS

SST = SSM + SSE

This formula underpins various diagnostic measures of a model’s explanatory/ predictive worth.

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Errors (SSE) (i.e. Sum of Squares for Residuals): each observed y minus the mean of predicted y ; sum the values; square the values.

. reg math read



Linear regression is called ‘ordinary least squares’ (OLS) because the equation chooses the straight line that minimizes the squared deviations between the observed values of y & the model’s predicted values of y (yhat; e.g., predicted math test scores).

. reg math read



We use the sample data—i.e. the observations on outcome variable y & explanatory variables x—to estimate the following:

(1) The slope coefficients:

x

yk s

srb

*

r=correlation xy. s=standard deviation

. reg math read



Here’s how to graph the confidence interval of a regression coefficient:

. twoway qfitci math read

30

40

50

60

70


95% CI Fitted values

See the course document ‘Graphing confidence intervals in Stata’ for other options.

xbyb 0

(2) Y-intercept (the formula being for simple regression):

The value of predicted y when the explanatory x’s are zero.

It typically has no substantive meaning. Why not?

. reg math read



_cons: y-intercept

(3) The residuals:

ikki

iii

xbxbxbby

yye

...

ˆ

22110

The residuals ei correspond to the deviations of each predicted y (i.e. each predicted y) from each observed y.

The residuals ei must have an approximately normal distribution with an expected value of zero (over an infinite number of observations).

How to Obtain the the

Predicted Values of Y & the Residuals in Stata

. reg math read

. predict yhat [yhat=predicted values of y]

. predict e, resid [e=residuals]

. for varlist read yhat e: hist X, norm

222110 )...( kki xbxbxbby

The method of least squares chooses the values of the regression coefficients that make the sum of the squares of the residuals as small as possible:

Software regression output typically refers to the residual-terms—which go into the computation of model fit indicators— more or less as follows:

s2 = Mean Square Error (MSE: variance of predicted y/df)

s = Root Mean Square Error (Root MSE: standard error of predicted y [which equals the square root of MSE])

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000

Residual | 9805.03595 198 49.520 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358

Total | 17465.795 199 87.7678141 Root MSE = 7.0371


MS Error (Residual) = 9805.04/49.52

Root MSE: sqrt 49.52 = 7.0371

Stata labels the Residuals, Mean Square Error & Root Mean Square Error as follows:

Top-left table

SS for Residual: sum of squared errors (i.e sum of squared residuals)

MS for residual: variance of predicted y/df

Top-right column

Root MSE: standard error of predicted y (i.e. square root of MS for residual)

. reg math read



To repeat:

Top-left table

SS for Residual: sum of squared errors

MS for residual: variance of predicted y/d.f.

Top-right column

Root MSE: standard error of predicted y

r-squared: A Measure of Model Fit

Two basic ways of assessing model fit: (1) the slope of the regression line & (2) the amount of cluster around the regression line.

The regression coefficient=slope of the regression line (higher coefficient=greater slope).

r-squared=the degree of cluster around the regression line (higher r-squared=greater cluster).

r-squared=what percentage of the variance of y is explained by the explanatory variables?

That is, how much of the variance of y is explained by the model versus how much is explained by merely using the mean of y?

r-squared=square of correlation yx in simple regression

Note: r-squared for multiple regression is more complicated to compute for, as later discussed.

Regression output table:

r-squared = SS Model/SS Total

Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values.

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70

Model | 7660.759 1 7660.75 Prob > F = 0.0000 Residual | 9805.03595 198 49.520 R-squared=0.4386-------------+------------------------------ Adj R-squared = 0.4358

Total | 17465.79 199 87.7678141 Root MSE = 7.0371


r-squared=7660.759/17465.79=0.4386

This is the percentage of variance in y that the model explains.

Multiple Regression

What did we say was the advantage of multiple regression over simple regression?

Multiple regression allows us to examine how values of response variable y vary according to changes in the values of more than one explanatory variable x.

The relationship of each x to the value of y is measured holding the other x’s constant. A given x may therefore perform differently within differing sets of x’s.

Multiple regression, then, allows us not only to examine more than one explanatory value but in doing so to control—that is, hold constant—otherwise lurking variables.

This approximates experimental control for lurking variables. Why is it not as effective as experimental design in controlling lurking variables?

And what are the intrinsic problems regarding causality?

. reg math read

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446 read .6051473 .0486538 12.44 0.000 .509201 .7010935 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 7.0371 Adj R-squared = 0.4358 Residual 9805.03595 198 49.5203836 R-squared = 0.4386 Model 7660.75905 1 7660.75905 Prob > F = 0.0000 F( 1, 198) = 154.70 Source SS df MS Number of obs = 200

. reg math read write science

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]


What happened to the coefficient for ‘read’? Why?

How do we evaluate how well an estimated regression model fits the data?

(1) F-test: overall significance of the model

(2) t-tests of each slope coefficient

(3) r2: overall explanatory/predictive effectiveness of the model

(4) Post-estimation diagnostics: to assess residuals for non-linearity & to assess the influence of outliers.

What is the problem with conducting several hypothesis tests of slope coefficients in a single equation?

The probability of Type I errors.

Begin, then, with an F-test for overall model significance, before testing the slope coefficients.

Ho:

Ha: at least one

F=Mean Square for Model/Mean Square for Error

The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.

Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.

021 p...0β


Interpretation?




F-statistic = 3229.08/39.69 = 81.36



Like other tests of significance, F is the magnitude of the effect divided by the error term.

To repeat, conduct the F-test for overall model significance, before testing the slope coefficients.

Ho:

Ha: at least one

F=Mean Square for Model/Mean Square for Error

The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.

Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.

021 p...0β

t-tests of Slope Coefficients

Conduct a significance test (one or two-sided) for each slope coefficient.

Ho:

Ha:

(or one-sided test: > or <)

Beware of Type I errors.

01 01


t-value for read = .306/.060 = 5.07



R2

R2: the squared multiple correlation (capital R, vs. previous small-case r) measures the proportion of the outcome variable that is explained by the explanatory variables (i.e. degree of cluster around regression line).

R2 is the square of the correlation of the predicted values of y with the observed values of y.

It tells us what percentage of y’s variance is accounted for by the model (i.e. the explanatory variables): higher R2, greater fit.


R2: Model SS/Total SS=9687.25/17465.79 = 0.55



R-squared, however, is much less preferred than the slope coefficients as an indicator of model fit.

Recall the difference between ‘historicist’ & ‘generalizing’ explanation. Simply adding more explanatory variables—even if they’re meaningless—will increase R-squared.

How to do multiple

regression in Stata?

Example: How much do science achievement test scores depend on reading achievement test scores & math achievement test scores in a random sample of 200 high school students?

If there are any missing observations:

. mark complete

. markout complete science read math

Alternatively: perhaps save a working data set that excludes the missing observations.

Regarding mark & markout:

‘complete’ is an binary, or dummy, variable:

1=complete data

0=incomplete data

. tab complete

. gr matrix science read math if complete==1, half

sciencescore

readingscore

mathscore

20 40 60 80

20

40

60

80

20 40 60 80

40

60

80

. kdensity science if complete==1, norm

. gr box science if complete==1

. kdensity read if complete==1, norm

. gr box read if complete==1

. kdensity math if complete==1, norm

. gr box math if complete==1

‘complete’ is a binary, or dummy variable: 1=complete data, 0=incomplete data

. su science read math if complete==1, detail

. pwcorr science read math if complete==1, obs sig bonf star(.05)

science read math

science 1.0000

200

read 0.6302* 1.0000

0.0000

200 200

math 0.6307* 0.6623* 1.0000

0.0000 0.0000

200 200 200

Note this way of using pwcorr (‘if complete==1’) corresponds to how regression uses the observations.

Correlation ci’s

. ci2 read write if complete==1, corr

. ci2 read science if complete==1, corr

. ci2 write science if complete==1, corr

Confidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation.Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)

Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation. Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)

Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation.Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)

‘Partial coefficients’—correlation coefficients of the outcome variable with the explanatory variables, for each variable holding the others constant—are also helpful:

. pcorr science read math if complete==1

Partial correlation of science with

Variable | Corr. Sig.

-------------+------------------

read | 0.3654 0.000

math | 0.3668 0.000

Compare partial correlations to correlations.

Here are the hypotheses that we’ll test.

. Ho:

Ha: at least one

check for F-test significance

. Ho:

Ha:

check for t-test significance

021 p...01β

01β01β

. reg science read if complete==1

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+--------------------------------------------------------------------- read | .6085207 .0532864 11.42 0.000 .503439 .7136024 _cons | 20.06696 2.836003 7.08 0.000 14.47432 25.65961------------------------------------------------------------------------------------

. reg science math if complete==1

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------------- math | .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons | 16.75789 3.116229 5.38 0.000 10.61264 22.90315

. reg science read math if complete==1


F( 2, 197) = 90.27

Model 9328.73944 2 4664.36972 Prob > F = 0.0000



Total 19507.50 199 98.0276382 Root MSE = 7.1881

science Coef. Std. Err. t P>t [95% Conf. Interval]

read .3654205 .0663299 5.51 0.000 .2346128 .4962282

math .4017207 .0725922 5.53 0.000 .2585632 .5448782

_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875

What might happen to the slope coefficients if we add other explanatory variables? Why?

Here’s a quick, simple way to graph regression coefficient ci’s in multiple regression:


. gorciv read math

E

stim

ate

read math

-2

2

See ‘Graphing confidence intervals in Stata’ for the commands ‘parmby,’ ‘sencode,’ & ‘ecplot’ to make more useful graphs.

. Check that N-observations is correct in the model.

. predict yhat if e(sample)

. predict e if e(sample), resid

. hist yhat, norm

. hist e, norm

. rvfplot, yline(0)

. sort science [to order low-high values]

. list science yhat e

. lvr2, ml(id)

. lincom _cons + read*45 + math*45

. lincom _cons + read*65 + math*65

0.0

2.0

4.0

6D

en

sity

-20 -10 0 10 20Residuals

The residuals are approximately normal in distribution.

-20

-10

01

02

0R

esid

uals

40 50 60 70Fitted values

There might be some rightward tilt worth exploring. We can use ‘rvpplot’ to explore individual variables.

-20

-10

01

02

0

Resid

uals


-20

-10

01

02

0

Resid

uals

30 40 50 60 70 80math score

. rvpplot read, yline(0) . rvpplot math, yline(0) There might be some minor problem with read’s relationship with math.

1

23

45

6 7

8

910

11

1213

14

15

161718

19

20

21

2223

24

252627

28

29

30

31

32

3334

3536

37

38

39

40414243

44

45

464748

49

5051

52

53

54 55

56

5758

59

60

61

62

63 64

6566

67

68

69

70

7172

737475

76

77

78

7980

81

8283

84

85

86

87

8889

9091 92

93

94

95

96

979899

100

101

102

103

104105

106

107

108

109

110

111

112113

114

115

116

117

118119

120

121

122

123

124

125126

127

128

129

130131

132

133134

135

136

137138139140

141

142

143

144

145146

147

148

149 150

151152

153154

155156

157

158159160

161162

163

164 165

166

167

168

169

170

171172

173

174

175

176177 178

179

180

181182

183

184

185186

187

188

189

190191

192

193

194

195 196197198199

200

0.0

2.0

4.0

6.0

8L

eve

rag

e

0 .01 .02 .03 .04 .05Normalized residual squared

lvr2plot: let’s estimate the model with & without id=167 to see if there’s a notable difference.

_cons 11.05814 3.02127 3.66 0.000 5.099772 17.01651 math .4475735 .0738742 6.06 0.000 .3018831 .5932638 read .3280934 .0670811 4.89 0.000 .1958001 .4603867 science Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg science read math if id~=167

There’s some difference, but nothing that will change the interpretation of the results.

_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875 math .4017207 .0725922 5.53 0.000 .2585632 .5448782 read .3654205 .0663299 5.51 0.000 .2346128 .4962283 science Coef. Std. Err. t P>|t| [95% Conf. Interval]



Results: the regression model provides a good fit: there are significant relationships between y & the explanatory variables, with meaningful magnitudes.

The residuals are more or less properly distributed; & the one influential outlier doesn’t make any major difference.

Let’s try to improve the model by adding a new explanatory variable, ‘white,’ which is coded 0=nonwhite 1=white.

A binary categorical variable coded 0/1 is called an indicator, or dummy, variable.

. tab white

. su science, d

. ttest science, by(white) [exploratory hypothesis test]

. ttest science, by(white)Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

nonwhite 55 45.65455 1.255887 9.313905 43.13664 48.17245

white 145 54.2 .7552879 9.09487 52.70712 55.69288

combined 200 51.85 .7000987 9.900891 50.46944 53.23056

diff -8.545455 1.44982 -11.40452 -5.686385

Degrees of freedom: 198

Ho: mean(nonwhite) - mean(white) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

t = -5.8941 t = -5.8941 t = -5.8941

P < t = 0.0000 P > t = 0.0000 P > t = 1.0000

. reg science read math white if complete==1

Source | SS df MS Number of obs = 200

------------+------------------------------ F( 3, 196) = 70.84

Model | 10148.3965 3 3382.79884 Prob > F = 0.0000

Residual | 9359.10349 196 47.750528 R-squared = 0.5202

------------+------------------------------ Adj R-squared = 0.5129

Total | 19507.50 199 98.0276382 Root MSE = 6.9102

------------------------------------------------------------------------------

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------------+----------------------------------------------------------------

read | .3227622 .0645911 5.00 0.000 .1953793 .450145

math | .380627 .0699709 5.44 0.000 .2426345 .5186194

white | 4.719807 1.139193 4.14 0.000 2.473158 6.966456

_cons | 11.53217 2.936238 3.93 0.000 5.741491 17.32284

------------------------------------------------------------------------------

Interpretation?

Interpretation of the dummy variable: the science scores of ‘whites’ are 4.7 points higher than those of ‘nonwhites,’ on average, other explanatory variables held constant.

. drop yhat e

. Check that N-observations is correct in the model.

. predict yhat if e(sample)

. predict e if e(sample), resid

. hist yhat, norm

. hist e, norm

. sort science

. l science yhat e

. rvfplot, yline(0)

. lvr2plot, ml(id)

. lincom _cons + math*45 + white

. lincom _cons + math*45 – white

Go through the model-assessment steps: Is this an improved model or not, & why?

Our discussion of linear regression brings us back to an earlier topic: linear transformations (Moore & McCabe, pages 51-55).

Recall that multiplying each observation by a positive number b multiples both measures of center & spread (i.e. variance & standard deviation), thereby increasing dispersion (i.e. inequality).

Recall also that adding or subtracting the same number a to observations adds a to measures of center & to quartiles & other percentiles but does not change measures of spread.

. gen xsci=5*science

. univar science xsci

Variable n Mean S.D. Min .25 Mdn .75 Max

-------------------------------------------------------------------------------

science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00

xsci 200 259.25 49.50 130.00 220.00 265.00 290.00 370.00

. gen asci=5 + science

. univar science asciVariable n Mean S.D. Min .25 Mdn .75 Max

-------------------------------------------------------------------------------

science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00

asci 200 56.85 9.90 31.00 49.00 58.00 63.00 79.00

-------------------------------------------------------------------------------

Other kinds of transformations, however, are not linear but, rather, nonlinear (see McCabe & Moore, pages 187-203).

In contrast to linear transformations, nonlinear transformations potentially can normalize a variable’s distribution & straighten a curved relationship between two variables.

Why might we need to straighten out a skewed univariate distribution or a curved relationship between two variables?

(1) A skewed distribution may be difficult to examine became many observations may be piled up in one place or some observations may be hidden; & summary measures such as mean & standard deviation are distorted by skewed distributions.

(2) Linear relationships are easier to interpret; statistical theory is better developed for linear relationships; & nonparametric statistics are not as insightful as parametric statistics.

(3) A curvilinear relationship causes invalid results for correlation and regression.

The nonlinear transformations that we’ll briefly consider will consist of logarithms, powers & roots.

Remember: for correlation and regression, what really matters are not univariate distributions but rather bivariate relationships as displayed in scatterplots.

So, for correlation and regression, don’t make decisions about transformations until you’ve inspected the scatterplots.

Keeping the preceding point in mind, Stata makes choosing among potential non-linear transformations relatively easy.

Using data on household consumption per capita in Tegucigalpa data:

. kdensity dhc, norm

. qladder dhc

0

.0001

.0002

.0003

.0004

.0005

De

nsity

0 2000 4000 6000 8000Dollars household consumption per capita

Kernel density estimate

Normal density

. kdensity dhc, norm scheme(economist)

-2.0

0e+

1102.0

0e+

114.0

0e+

116.0

0e+

118.0

0e+

11

-2.00e+11-1.00e+11 0 1.00e+112.00e+11

cubic

-2.0

0e+

07

02.0

0e+

07

4.0

0e+

07

6.0

0e+

07

8.0

0e+

07

-2.00e+07-1.00e+070 1.00e+072.00e+073.00e+07

square

-50000

500010000

-2000 0 2000 4000 6000

identity0

50

100

-20 0 20 40 60 80

sqrt

46

810

4 6 8 10

log

-.1

-.05

0

-.08 -.06 -.04 -.026.94e-18 .02

1/sqrt

-.01-

.005

0.0

05

-.006 -.004 -.002 0 .002

inverse

-.00008

-.00006

-.00004

-.00002

-6.7

8e-2

1.0

0002

-.00002-.00001 0 .00001 .00002

1/square

-8.0

0e-0

7-6

.00e-0

7-4

.00e-0

7-2

.00e-0

702.0

0e-0

7

-2.00e-07 -1.00e-07 0 1.00e-07

1/cubic

new dollar-weighted welfare measureQuantile-Normal plots by transformation

. qladder dhc

qladder dhc suggests that a log transformation of dhc will make its distribution more normal.

gen ldhc = ln(dhc)

su dhc ldhc

for varlist ldhc-dhc: kdensity X, norm \ more

. kdensity dhc, norm

0

.0001

.0002

.0003

.0004

.0005

De

nsity

0 2000 4000 6000 8000Dollars household consumption per capita


. kdensity ldhc, norm

0

.2

.4

.6

De

nsity

5 6 7 8 9ldhc

Kernel density estimate

Normal density

The log transformation did considerably normalize the variable’s distribution (by dampening the effect of the right-skewed distribution’s high-end values).

The ‘qladder’ command is based on John Tukey’s ‘ladder of power transformations’ (see Moore & McCabe, pages 191-95; & see Stata’s command ‘ladder’).

See the ‘ladder’ command for a hypothesis-test based approach to selection.

The ladder of transformations recommends particular non-linear transformations for particular non-linear relationships.

While the ‘ladder’ does help, normalizing a variable’s distribution & linearizing a bivariate relationship generally involve trial & error.

And not all skewed distributions or non-linear bivariate relationships can be straightened out in a satisfactory way.

There’s always a trade-off, moreover: A nonlinear transformation may indeed normalize a variable or linearize a relationship between variables, but a significant cost may be diminished clarity of interpretation.

Remember: what really matters is the scatterplot, not the univariate frequency distributions.

So don’t make decisions about transformations until you’ve inspected the scatterplot.

There’ll be lots more transforming next semester.

Interaction Effects

One more thing: what if the relationship of y to x varies according to the level of another variable, z?

This is an ‘interaction effect’.

E.g., do not drink alcoholic beverages while taking medication X.

E.g., the effect of an educational intervention varies according to the gender of students.

A regression example: with regard to the log annual household living standard measurement per capita of a stratified random sample of households in several collective agricultural communities (‘ejidos’) in Quintana Roo, Mexico.

The value of the outcome variable increases with higher farm levels of mahogony production & with presence (vs. absence) of a community saw mill.

Is there an interaction effect between mahogony production level & mill presence/absence?

------------------------------------------------------------------------------- lsm 199 3.98 1.97 0.01 2.49 3.67 5.49 8.69-------------------------------------------------------------------------------Variable n Mean S.D. Min .25 Mdn .75 Max -------------- Quantiles --------------. univar lsm

Total 3.98 1.97 3.67 3.01 0.01 8.69 4 6.59 1.26 6.75 1.93 4.10 8.69 3 2.83 0.93 2.64 1.27 1.53 5.47 2 4.09 1.70 3.70 2.14 1.33 7.88 1 3.33 1.77 3.16 2.36 0.01 8.15 0 3.61 1.74 3.50 1.78 1.16 8.27 mvc mean sd p50 iqr min max

by categories of: mvc (Mahogany volume categories)Summary for variables: lsm

. tabstat lsm, s(mean sd med iqr min max) by(mvc) f(%9.2f)

05

10

0 1 2 3 4 0 1 2 3 4

No mill Mill

Living standard measure Fitted values

Mahogany volume categories

Graphs by Mill

. twoway scatter lsm mvc || lfit lsm mvc, by(mill)

_cons 2.369156 .4068352 5.82 0.000 1.566716 3.171596 resworks .6022761 .2480852 2.43 0.016 .1129537 1.091598 nmaya .9817913 .4365363 2.25 0.026 .1207687 1.842814 _Imi1Xmv 1.236629 .2460735 5.03 0.000 .7512742 1.721983 mvc -.019945 .1623813 -0.12 0.902 -.3402253 .3003354 _Imill_1 -2.725266 .6579602 -4.14 0.000 -4.023024 -1.427508 hway .6115082 .3280433 1.86 0.064 -.0355233 1.25854 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]


(1 missing value generated)i.mill _Imill_0-1 (naturally coded; _Imill_0 omitted). xi3: reg lsm hway i.mill*mvc nmaya resworks

Interaction: coef=1.24, p=.000

postgr3 mv, by(mill) table: graph of saw mill X mahagony volume categories (see next slide)

3.5

44

.55

5.5

6

0 1 2 3 4Mahogany volume categories

yhat_, mill == No mill yhat_, mill == Mill

. postgr3 mvc, by(mill) table [predicted average values of lsm by millXmvc]

Variables left asis: mvc _Imill_1 _Imi1XmvHolding hway constant at .79396985Holding nmaya constant at .46231156Holding resworks constant at .71356784

------------------------------Mahogany |volume |categorie | Mill s | No mill Mill----------+------------------- 0 | 3.738333 1 | 3.718388 2 | 3.446435 3 | 3.678499 4 | 5.879803------------------------------

. findit xi3 [download]

. help xi3

. findit postgr3 [download]

. help postgr3

. xi3: reg y x i.xcat*z

. postgr3 x, by(xcat)

23

45

67

Ave

. Liv

ing S

tand

ard

Me

asu

re

0 1 2 3 4Mahogany volume categories

Fitted values No MillFitted values Mill

. twoway scatter yhat mvc if mill==0 || mspline yhat mvc if mill==0, clpatt(solid) || scatter yhat mvc if mill==1, ms(oh) || mspline yhat mvc if mill==1, clpatt(solid) || , ytitle("Ave. Living Standard Measure")

Another way to graph the interaction

As an alternative to postgr3, table or to complement interaction graphs in general, use lincom to explore predicted outcomes.

See Paul Allison, Multiple Regression: A Primer.

See the next slide…

. reg lsm hway mvc mill mvcXmill nmaya resworks

(1) 2.221249 .793086 2.80 0.006 .656969 3.785529 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mill + 4 mvcXmill = 0

. lincom mill*1 + mvcXmill*4

(1) .9846203 .6311184 1.56 0.120 -.2601954 2.229436 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]



(1) -.2520085 .5373451 -0.47 0.640 -1.311866 .8078492 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]



(1) 1.216684 .185311 6.57 0.000 .851177 1.582191 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mvc + mvcXmill = 0

. lincom mvc*1 + mvcXmill*1

You’ll spend lots of time analyzing interaction effects next semester.

On making correlation & regression tables, see the class document ‘Making working & publication-style tables in Stata’.

. findit esttab [download]

. findit eststo [download]

. reg lsm hway mill mvc nmaya resworks

. eststo

. reg lsm hway mill mvc millXmvc nmaya resworks

. estso

. esttab, se starlevels(+ .10 * .05 ** .01 *** .001) r2 nodepvars no mtitles compress

+ p<.10, * p<.05, ** p<.01, *** p<.001Standard errors in parentheses R-sq 0.315 0.395 N 199 199 (0.388) (0.407) _cons 1.477*** 2.369***

(0.246) millXmvc 1.237***

(0.263) (0.248) resworks 0.636* 0.602*

(0.449) (0.437) nmaya 1.517*** 0.982*

(0.130) (0.162) mvc 0.517*** -0.0199

(0.560) (0.658) mill -0.753 -2.725***

(0.341) (0.328) hway 0.946** 0.612+ (1) (2)

review: correlation vs. regression what are the main differences between correlation &...

Documents