stata red tutorial

STATASTATA:: TheThe RedRed tutorialtutorial

This tutorial presentation is prepared by

Mohammad Ehsanul [email protected]


ContentsContents

1. Introduction to Linear Regression2. Tests for Normality of Residuals3. Tests for Heteroscedasticity4. Tests for Multicollinearity5. Tests for Autocorrelation6. Detecting Unusual and Influential Data7. Tests for Model Specification

Linear Regression Analysis

1. Introduction to Introduction to Linear RegressionLinear Regression

Linear Regression

The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).

Linear Regression. clear. use hs1, clear. regress write read female

Linear Regression. clear. use hs1, clear. regress write read female

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21

Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394

-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778

female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011

------------------------------------------------------------------------------

2. Tests for Tests for Normality of Normality of ResidualsResiduals

Tests for Normality of Residuals

We use the predict command with the residoption to generate residuals and we name the residuals r.

. predict r, resid


Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data




. swilk r




. swilk rShapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z-------------+-------------------------------------------------

r | 200 0.98714 1.919 1.499 0.06692


In verifying that the residuals are normally distributed, which is a very important assumption for regression,

the kdensity command with the normal option displays a density graph of the residuals with annormal distribution superimposedon the graph.


. kdensity r, normal


The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.


. pnorm r


The qnorm command produces a normal quantile plot.

It is yet another method for testing if the residuals are normally distributed.


. qnorm r

Summary of Tests for Normality of Residuals

swilk performs the Shapiro-Wilk W test for normality.

kdensity produces kernel density plot with normal distribution overlayed.

pnorm graphs a standardized normal probability (P-P) plot.

qnorm plots the quantiles of varname against the quantiles of a normal distribution.


3. Tests for Tests for HeteroscedasticityHeteroscedasticity

Tests for Heteroscedasticity

One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals.

There are graphical and non-graphical methods for detecting heteroscedasticity.


Cook-Weisberg test for heteroskedasticity


Cook-Weisberg test for heteroskedasticity

. hettest

Cook-Weisberg test for heteroskedasticity using fitted values of write

Ho: Constant variancechi2(1) = 5.79Prob > chi2 = 0.0161


we use the rvfplot command with the yline(0)option to put a reference line at y=0.


we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)

Summary of Tests for Heteroscedasticity

hettest performs Cook and Weisberg testrvfplot graphs residual-versus-fitted plot.


4. Tests for Tests for MulticollinearityMulticollinearity

Tests for Multicollinearity

Multicollinearity is a concern for multiple regression, not for its existence, but for its degree.

For severe degree of multicollinearity, the regression model estimates of the coefficients become unstableand the standard errors for the coefficients can get wildly inflated.

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.



. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00



. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00

A variable whose VIF values aregreater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.

Tests for Multicollinearity

Summary of Tests for Multicollinearity

vif calculates the variance inflation factor for the independent variables in the linear model.

5. Tests for Tests for AutocorrelationAutocorrelation

Tests for Autocorrelation. tsset id

time variable: id, 1 to 200. dwstat

Durbin-Watson d-statistic( 3, 200) = 1.93992

6. Detecting Detecting Unusual and Unusual and

Influential DataInfluential Data

Detecting Unusual and Influential Data

• Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

• Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.

• Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Valueleverage >(2k+2)/nabs(rstu) > 2Cook's D > 4/nabs(DFITS) > 2*sqrt(k/n)abs(DFBETA) > 2/sqrt(n)


We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.


We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

. predict r, rstudent


. stem rStem-and-leaf plot for r (Studentized residuals)r rounded to nearest multiple of .01plot in units of .01-2** | 50,42-2** | 26,21-2** | 18-1** | 92,85,84,83-1** | 75,72,69,61,61,60-1** | 50,48,46,46,42-1** | 33,32,22,20,20,20-1** | 17,16,13,12,10,01-0** | 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** | 74,74,71,70,67-0** | 59,59,58,53,49,49,47,42,42,40-0** | 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** | 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,020** | 00,02,02,04,04,04,04,07,09,11,14,16,16,190** | 21,23,23,24,24,26,28,29,30,33,33,35,350** | 40,44,44,51,51,54,54,54,54,56,56,57,57,570** | 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,780** | 88,88,89,93,94,94,97,98,991** | 01,06,06,08,08,13,13,13,13,15,191** | 23,29,32,36,36,37,37,391** | 42,43,44,48,51,52,53,551** | 60,68,73,73,75,771** | 80,842** | 16


. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652


. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652


. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652


. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904



. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904


. list r if r<-2.5 | r>2.5r

1. -2.503566


To get Leverage points, we use the predictcommand with the leverage option and we name them lev.


To get Leverage points, we use the predictcommand with the leverage option and we name them lev.

. predict lev, leverage


Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.



. predict d, cooksd



. predict d, cooksd


. list female read d if d>4/_Nfemale read d

13. male 50 .0234054 39. male 47 .0212312

123. female 57 .0202435 142. male 76 .0327483


. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)



. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)


The above measures are general measures of influence.

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.


We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensivethan summary statistics such as Cook's D.


We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.


We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

. dfbetaDFread: DFbeta(read)

DFfemale: DFbeta(female)


We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

. list DFread DFfemale in 1/5DFread DFfemale

1. .0492348 .1971976 2. -.0887463 -.1617497 3. .0915453 .1802994 4. .0434659 .1740918 5. .0717626 -.1374498


There are also several graphs that can be used to search for unusual and influentialobservations.

The avplot command graphs an added-variable plot.


avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot.

We can do an avplot on variable grade.



. avplot grade


. avplot grade

Added-Variable plot

rvpplot is another convenience command which produces a plot of the residual versus a specified predictorand it is also used after regress or anova.



. rvpplot read

lvr2plot stands for leverageversus residual squared plot.



. lvr2plot


Summary of Detecting Unusual and Influential Data

predict create predicted values, residuals, and measures of influence.

dfbeta DFBETAs for all the independent variablesavplot graphs an added-variable plotlvr2plot graphs a leverage-versus-squared-

residual plot.rvpplot graphs a residual-versus-predictor plot.rvfplot graphs residual-versus-fitted plot.

7. Tests Tests for Model for Model SpecificationSpecification

Tests for Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.


There are several methods to detect specification errors.

The linktest command performs a model specification link test for single-equation models.


. Linktest

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 79.86

Model | 8005.11739 2 4002.55869 Prob > F = 0.0000Residual | 9873.75761 197 50.120597 R-squared = 0.4477

-------------+------------------------------ Adj R-squared = 0.4421Total | 17878.875 199 89.843593 Root MSE = 7.0796

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------_hat | 2.807497 1.052071 2.67 0.008 .7327302 4.882264

_hatsq | -.0170281 .0098827 -1.72 0.086 -.0365176 .0024615_cons | -47.29516 27.77544 -1.70 0.090 -102.0705 7.480201

------------------------------------------------------------------------------


The ovtest command performs performs a regression specification error test (RESET) for omitted variables.



. ovtest



. ovtest

Ramsey RESET test using powers of the fitted values of write

Ho: model has no omitted variablesF(3, 194) = 1.95Prob > F = 0.1233


Summary of Tests for Model Specification

linktest performs a link test for model specification.ovtest performs regression specification error

test (RESET) for omitted variables.

stata red tutorial

Documents