Download - Diagnostics
Diagnostics
Checking Assumptions and Bad Data
Questions
• What is the linearity assumption? How can you tell if it seems met?
• What is homoscedasticity (heteroscedasticity)? How can you tell if it’s a problem?
• What is an outlier?• What is leverage?
• What is a residual?
• How can you use residuals in assuring that the regression model is a good representation of the data?
• Why consider a standardized residual?
• What is a studentized residual?
Linear Model
• Linear relations b/t X and Y
• Normal distribution of error of prediction
• Homoscedasticity (homogeneity of error in Y across levels of X)
Good-Looking Graph
6420-2
X
9
6
3
0
-3
Y
No apparent departures from line.
Same Data, Different Graph
65320-2
X
3
2
0
-2
-3
Re
sid
ua
ls
No systematic relations between X and residuals.
Problem with Linearity
50 100 150 200 250
Horsepower
10
20
30
40
50M
iles
per
Gal
lon
R Sq Linear = 0.595
Problem with Heteroscedasticity
65320-2
X
10
8
6
4
2
0
Y
Common problem when Y = $
Outliers
65320-2
X
10
8
6
3
1
-1
Y
Outlier
Outlier = pathological point
Review
• What is the linearity assumption? How can you tell if it seems met?
• What is homoscedasticity (heteroscedasticity)? How can you tell if it’s a problem?
• What is an outlier?
Residuals
• Zresid
• Look for large values (some say |z|>2)
• Studentized residual (Student Residual):
XYS
e
SDresid
e
SDresid
YYZresid
.
2
2
.
)(11
x
XX
NSS XYei
The studentized residual considers the distance of the point from the mean. The farther X is from the mean, the smaller the standard error and the larger the residual. Look for large values. Also, studentized deleted residual (RStudent).
Influence Analysis
• Leverage:
• Leverage is an index of the importance of an observation to a regression analysis.– Function of X only– Large deviations from mean are influential– Maximum is 1; min is 1/N– Average value is (k+1)/N, where k is the
number of IVs
hN
X X
xi
1 2
2
( )
Influence Analysis (2)
• DFBETA and standardized DFBETA
• Change in slope or intercept resulting when you delete the ith person.
• Allow for influence of both X and Y
Example
2 2
3 3
3 1
4 1
4 3
5 2
8 8
4.14 2.86
X Y r = .82; r2 = .67; p < .05.
SX = 1.95, SY = 2.41
b=1.01, a=-1.34
986532
X
10
8
5
3
0
Y
M=
Example (2)
Y Pred Resid Student Residual
Rstudent DFBETA a
DFBETAb
2 .6875 1.3125 1.072 1.0923 .7577 -.6044
3 1.7 1.3 .962 .9526 .3943 -.2546
1 1.7 -.7 -.518 -.476 -.1970 .1272
1 2.7125 -1.7125 -1.224 -1.3086 -.2524 .0423
3 2.7125 .2875 .206 .1846 .0356 -.006
2 3.725 -1.725 -1.256 -1.3584 .0198 -.2681
8 6.7625 1.2375 1.803 2.7249 -3.5303 4.4807
Remedies
• Fit Curves if needed.
• Note heteroscedasticity for applied problems.
• Investigate all outliers. May delete them or not, depending. Report your actions.
Review
• What is leverage?
• What is a residual?
• How can you use residuals in assuring that the regression model is a good representation of the data?
• Why consider a standardized residual?
• What is a studentized residual?