stata red tutorial

84

Click here to load reader

Upload: yang-zou

Post on 07-Apr-2015

260 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: STATA Red Tutorial

STATASTATA:: TheThe RedRed tutorialtutorial

Page 2: STATA Red Tutorial

This tutorial presentation is prepared by

Mohammad Ehsanul [email protected]

STATASTATA:: TheThe RedRed tutorialtutorial

Page 3: STATA Red Tutorial

This tutorial presentation is prepared by

Mohammad Ehsanul [email protected]

STATASTATA:: TheThe RedRed tutorialtutorial

Page 4: STATA Red Tutorial

ContentsContents

1. Introduction to Linear Regression2. Tests for Normality of Residuals3. Tests for Heteroscedasticity4. Tests for Multicollinearity5. Tests for Autocorrelation6. Detecting Unusual and Influential Data7. Tests for Model Specification

Linear Regression Analysis

Page 5: STATA Red Tutorial

1. Introduction to Introduction to Linear RegressionLinear Regression

Page 6: STATA Red Tutorial

Linear Regression

The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).

Page 7: STATA Red Tutorial

Linear Regression. clear. use hs1, clear. regress write read female

Page 8: STATA Red Tutorial

Linear Regression. clear. use hs1, clear. regress write read female

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21

Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394

-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778

female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011

------------------------------------------------------------------------------

Page 9: STATA Red Tutorial

Linear Regression. clear. use hs1, clear. regress write read female

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21

Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394

-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778

female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011

------------------------------------------------------------------------------

Page 10: STATA Red Tutorial

2. Tests for Tests for Normality of Normality of ResidualsResiduals

Page 11: STATA Red Tutorial

Tests for Normality of Residuals

We use the predict command with the residoption to generate residuals and we name the residuals r.

. predict r, resid

Page 12: STATA Red Tutorial

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

Page 13: STATA Red Tutorial

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

. swilk r

Page 14: STATA Red Tutorial

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

. swilk rShapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z-------------+-------------------------------------------------

r | 200 0.98714 1.919 1.499 0.06692

Page 15: STATA Red Tutorial

Tests for Normality of Residuals

In verifying that the residuals are normally distributed, which is a very important assumption for regression,

the kdensity command with the normal option displays a density graph of the residuals with annormal distribution superimposedon the graph.

Page 16: STATA Red Tutorial

Tests for Normality of Residuals

. kdensity r, normal

Page 17: STATA Red Tutorial

Tests for Normality of Residuals

. kdensity r, normal

Page 18: STATA Red Tutorial

Tests for Normality of Residuals

The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.

Page 19: STATA Red Tutorial

Tests for Normality of Residuals

. pnorm r

Page 20: STATA Red Tutorial

Tests for Normality of Residuals

. pnorm r

Page 21: STATA Red Tutorial

Tests for Normality of Residuals

The qnorm command produces a normal quantile plot.

It is yet another method for testing if the residuals are normally distributed.

Page 22: STATA Red Tutorial

Tests for Normality of Residuals

. qnorm r

Page 23: STATA Red Tutorial

Tests for Normality of Residuals

. qnorm r

Page 24: STATA Red Tutorial

Summary of Tests for Normality of Residuals

swilk performs the Shapiro-Wilk W test for normality.

kdensity produces kernel density plot with normal distribution overlayed.

pnorm graphs a standardized normal probability (P-P) plot.

qnorm plots the quantiles of varname against the quantiles of a normal distribution.

Tests for Normality of Residuals

Page 25: STATA Red Tutorial

3. Tests for Tests for HeteroscedasticityHeteroscedasticity

Page 26: STATA Red Tutorial

Tests for Heteroscedasticity

One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals.

There are graphical and non-graphical methods for detecting heteroscedasticity.

Page 27: STATA Red Tutorial

Tests for Heteroscedasticity

Cook-Weisberg test for heteroskedasticity

Page 28: STATA Red Tutorial

Tests for Heteroscedasticity

Cook-Weisberg test for heteroskedasticity

. hettest

Cook-Weisberg test for heteroskedasticity using fitted values of write

Ho: Constant variancechi2(1) = 5.79Prob > chi2 = 0.0161

Page 29: STATA Red Tutorial

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0.

Page 30: STATA Red Tutorial

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)

Page 31: STATA Red Tutorial

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)

Page 32: STATA Red Tutorial

Summary of Tests for Heteroscedasticity

hettest performs Cook and Weisberg testrvfplot graphs residual-versus-fitted plot.

Tests for Heteroscedasticity

Page 33: STATA Red Tutorial

4. Tests for Tests for MulticollinearityMulticollinearity

Page 34: STATA Red Tutorial

Tests for Multicollinearity

Multicollinearity is a concern for multiple regression, not for its existence, but for its degree.

For severe degree of multicollinearity, the regression model estimates of the coefficients become unstableand the standard errors for the coefficients can get wildly inflated.

Page 35: STATA Red Tutorial

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

Page 36: STATA Red Tutorial

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00

Page 37: STATA Red Tutorial

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00

A variable whose VIF values aregreater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.

Page 38: STATA Red Tutorial

Tests for Multicollinearity

Summary of Tests for Multicollinearity

vif calculates the variance inflation factor for the independent variables in the linear model.

Page 39: STATA Red Tutorial

5. Tests for Tests for AutocorrelationAutocorrelation

Page 40: STATA Red Tutorial

Tests for Autocorrelation. tsset id

time variable: id, 1 to 200. dwstat

Durbin-Watson d-statistic( 3, 200) = 1.93992

Page 41: STATA Red Tutorial

6. Detecting Detecting Unusual and Unusual and

Influential DataInfluential Data

Page 42: STATA Red Tutorial

Detecting Unusual and Influential Data

• Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

• Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.

• Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Page 43: STATA Red Tutorial

Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Valueleverage >(2k+2)/nabs(rstu) > 2Cook's D > 4/nabs(DFITS) > 2*sqrt(k/n)abs(DFBETA) > 2/sqrt(n)

Detecting Unusual and Influential Data

Page 44: STATA Red Tutorial

We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

Detecting Unusual and Influential Data

Page 45: STATA Red Tutorial

We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

. predict r, rstudent

Detecting Unusual and Influential Data

Page 46: STATA Red Tutorial

. stem rStem-and-leaf plot for r (Studentized residuals)r rounded to nearest multiple of .01plot in units of .01-2** | 50,42-2** | 26,21-2** | 18-1** | 92,85,84,83-1** | 75,72,69,61,61,60-1** | 50,48,46,46,42-1** | 33,32,22,20,20,20-1** | 17,16,13,12,10,01-0** | 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** | 74,74,71,70,67-0** | 59,59,58,53,49,49,47,42,42,40-0** | 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** | 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,020** | 00,02,02,04,04,04,04,07,09,11,14,16,16,190** | 21,23,23,24,24,26,28,29,30,33,33,35,350** | 40,44,44,51,51,54,54,54,54,56,56,57,57,570** | 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,780** | 88,88,89,93,94,94,97,98,991** | 01,06,06,08,08,13,13,13,13,15,191** | 23,29,32,36,36,37,37,391** | 42,43,44,48,51,52,53,551** | 60,68,73,73,75,771** | 80,842** | 16

Detecting Unusual and Influential Data

Page 47: STATA Red Tutorial

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

Page 48: STATA Red Tutorial

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

Page 49: STATA Red Tutorial

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Page 50: STATA Red Tutorial

Detecting Unusual and Influential Data

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Page 51: STATA Red Tutorial

. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904

Detecting Unusual and Influential Data

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Page 52: STATA Red Tutorial

. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904

Detecting Unusual and Influential Data

. list r if r<-2.5 | r>2.5r

1. -2.503566

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Page 53: STATA Red Tutorial

To get Leverage points, we use the predictcommand with the leverage option and we name them lev.

Detecting Unusual and Influential Data

Page 54: STATA Red Tutorial

To get Leverage points, we use the predictcommand with the leverage option and we name them lev.

. predict lev, leverage

Detecting Unusual and Influential Data

Page 55: STATA Red Tutorial

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

Detecting Unusual and Influential Data

Page 56: STATA Red Tutorial

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict d, cooksd

Detecting Unusual and Influential Data

Page 57: STATA Red Tutorial

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict d, cooksd

Detecting Unusual and Influential Data

. list female read d if d>4/_Nfemale read d

13. male 50 .0234054 39. male 47 .0212312

123. female 57 .0202435 142. male 76 .0327483

Page 58: STATA Red Tutorial

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)

Detecting Unusual and Influential Data

Page 59: STATA Red Tutorial

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)

Detecting Unusual and Influential Data

The above measures are general measures of influence.

Page 60: STATA Red Tutorial

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

Detecting Unusual and Influential Data

Page 61: STATA Red Tutorial

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensivethan summary statistics such as Cook's D.

Detecting Unusual and Influential Data

Page 62: STATA Red Tutorial

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

Detecting Unusual and Influential Data

Page 63: STATA Red Tutorial

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

. dfbetaDFread: DFbeta(read)

DFfemale: DFbeta(female)

Detecting Unusual and Influential Data

Page 64: STATA Red Tutorial

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

. list DFread DFfemale in 1/5DFread DFfemale

1. .0492348 .1971976 2. -.0887463 -.1617497 3. .0915453 .1802994 4. .0434659 .1740918 5. .0717626 -.1374498

Detecting Unusual and Influential Data

Page 65: STATA Red Tutorial

There are also several graphs that can be used to search for unusual and influentialobservations.

The avplot command graphs an added-variable plot.

Detecting Unusual and Influential Data

Page 66: STATA Red Tutorial

avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot.

We can do an avplot on variable grade.

Detecting Unusual and Influential Data

Page 67: STATA Red Tutorial

Detecting Unusual and Influential Data

. avplot grade

Page 68: STATA Red Tutorial

Detecting Unusual and Influential Data

. avplot grade

Added-Variable plot

Page 69: STATA Red Tutorial

rvpplot is another convenience command which produces a plot of the residual versus a specified predictorand it is also used after regress or anova.

Detecting Unusual and Influential Data

Page 70: STATA Red Tutorial

Detecting Unusual and Influential Data

. rvpplot read

Page 71: STATA Red Tutorial

Detecting Unusual and Influential Data

. rvpplot read

Page 72: STATA Red Tutorial

lvr2plot stands for leverageversus residual squared plot.

Detecting Unusual and Influential Data

Page 73: STATA Red Tutorial

Detecting Unusual and Influential Data

. lvr2plot

Page 74: STATA Red Tutorial

Detecting Unusual and Influential Data

. lvr2plot

Page 75: STATA Red Tutorial

Detecting Unusual and Influential Data

Summary of Detecting Unusual and Influential Data

predict create predicted values, residuals, and measures of influence.

dfbeta DFBETAs for all the independent variablesavplot graphs an added-variable plotlvr2plot graphs a leverage-versus-squared-

residual plot.rvpplot graphs a residual-versus-predictor plot.rvfplot graphs residual-versus-fitted plot.

Page 76: STATA Red Tutorial

7. Tests Tests for Model for Model SpecificationSpecification

Page 77: STATA Red Tutorial

Tests for Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.

Page 78: STATA Red Tutorial

Tests for Model Specification

There are several methods to detect specification errors.

The linktest command performs a model specification link test for single-equation models.

Page 79: STATA Red Tutorial

Tests for Model Specification

. Linktest

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 79.86

Model | 8005.11739 2 4002.55869 Prob > F = 0.0000Residual | 9873.75761 197 50.120597 R-squared = 0.4477

-------------+------------------------------ Adj R-squared = 0.4421Total | 17878.875 199 89.843593 Root MSE = 7.0796

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------_hat | 2.807497 1.052071 2.67 0.008 .7327302 4.882264

_hatsq | -.0170281 .0098827 -1.72 0.086 -.0365176 .0024615_cons | -47.29516 27.77544 -1.70 0.090 -102.0705 7.480201

------------------------------------------------------------------------------

Page 80: STATA Red Tutorial

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

Page 81: STATA Red Tutorial

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

. ovtest

Page 82: STATA Red Tutorial

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

. ovtest

Ramsey RESET test using powers of the fitted values of write

Ho: model has no omitted variablesF(3, 194) = 1.95Prob > F = 0.1233

Page 83: STATA Red Tutorial

Tests for Model Specification

Summary of Tests for Model Specification

linktest performs a link test for model specification.ovtest performs regression specification error

test (RESET) for omitted variables.

Page 84: STATA Red Tutorial

STATASTATA:: TheThe RedRed tutorialtutorial