model checking using residuals to check the validity of the linear regression model assumptions
TRANSCRIPT
![Page 1: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/1.jpg)
Model Checking
Using residuals to check the validity of the linear regression model assumptions
![Page 2: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/2.jpg)
The simple linear regression model
• The mean of the responses, E(Yi), is a linear function of the xi.
• The errors, εi, and hence the responses Yi, are independent.
• The errors, εi, and hence the responses Yi, are normally distributed.
• The errors, εi, and hence the responses Yi, have equal variances (σ2) for all x values.
![Page 3: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/3.jpg)
The simple linear regression model
iii XY 10
with the independent error terms i following a normal distribution with mean 0 and equal variance 2.
Assume (!!) response is linear function of trend and error:
![Page 4: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/4.jpg)
Why do we have to check our model?
• All estimates, intervals, and hypothesis tests have been developed assuming that the model is correct.
• If the model is incorrect, then the formulas and methods we use are at risk of being incorrect.
![Page 5: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/5.jpg)
When should we worry most?
• All tests and intervals are very sensitive to– departures from independence.– moderate departures from equal variance.
• Tests and intervals for β0 and β1 are fairly robust against departures from normality.
• Prediction intervals are quite sensitive to departures from normality.
![Page 6: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/6.jpg)
What can go wrong with the model?
• Regression function is not linear.• Error terms are not independent.• Error terms are not normal.• Error terms do not have equal variance.• The model fits all but one or a few outlier
observations.• An important predictor variable has been
left out of the model.
![Page 7: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/7.jpg)
The basic idea of residual analysis
The observed residuals:
iii yye ˆ
should reflect the properties assumed for the unknown true error terms:
iii YEY
So, investigate the observed residuals to see if they behave “properly.”
![Page 8: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/8.jpg)
Distinction between true errors i and residuals ei
54321
22
18
14
10
6
High school gpa
Co
llege
ent
ranc
e te
st s
core
xYEY 10
xbby 10ˆ
![Page 9: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/9.jpg)
The sample mean of the residuals ei is always 0.
54321
25
20
15
10
x
y
S = 1.55768 R-Sq = 94.4 % R-Sq(adj) = 93.6 %
y = 3.73711 + 3.65464 x
Regression Plot
x y RESIDUAL 1 9 1.60825 1 7 -0.39175 1 8 0.60825 2 10 -1.04639 3 15 0.29897 3 12 -2.70103 4 19 0.64433 5 24 1.98969 5 21 -1.01031 --------- 0.00001
(round-off error)
![Page 10: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/10.jpg)
The residuals are not independent.
54321
25
20
15
10
x
y
S = 1.55768 R-Sq = 94.4 % R-Sq(adj) = 93.6 %
y = 3.73711 + 3.65464 x
Regression Plot
xbbye 1011
xbbye 1022
xbbye nn 10
![Page 11: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/11.jpg)
A residuals vs. fits plot
• A scatter plot with residuals on the y axis and fitted values on the x axis.
• Helps to identify non-linearity, outliers, and non-constant variance.
![Page 12: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/12.jpg)
40302010 0
30
20
10
alcohol
stre
ngth
S = 3.87372 R-Sq = 41.2 % R-Sq(adj) = 39.9 %
strength = 26.3695 - 0.295868 alcohol
Regression Plot
Example: Alcoholism and muscle strength?
![Page 13: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/13.jpg)
A well-behaved residuals vs. fits plot
252015
5
0
-5
-10
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is strength)
![Page 14: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/14.jpg)
Characteristics of a well-behaved residual vs. fits plot
• The residuals “bounce randomly” around the 0 line. (Linear is reasonable).
• No one residual “stands out” from the basic random pattern of residuals. (No outliers).
• The residuals roughly form a “horizontal band” around 0 line. (Constant variance).
![Page 15: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/15.jpg)
A residuals vs. predictor plot
• A scatter plot with residuals on the y axis and the values of a predictor on the x axis.
• If the predictor on the x axis is the same predictor used in model, offers nothing new.
• If the predictor on the x axis is a new and different predictor, can help to determine whether the predictor should be added to model.
![Page 16: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/16.jpg)
A residuals vs. predictor plot offering nothing new.
403020100
5
0
-5
-10
alcohol
Re
sid
ual
Residuals Versus alcohol(response is strength)
(Same predictor!)
![Page 17: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/17.jpg)
Example: What are good predictors of blood pressure?
• n = 20 hypertensive individuals
• age = age of individual
• weight = weight of individual
• duration = years with high blood pressure
![Page 18: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/18.jpg)
Regression of BP on Age
555045
125
115
105
Age
BP
S = 4.19480 R-Sq = 43.4 % R-Sq(adj) = 40.3 %
BP = 44.4545 + 1.43098 Age
Regression Plot
![Page 19: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/19.jpg)
Residuals (age only) vs. weight plot
100959085
5
0
-5
Weight
Re
sid
ual
Residuals Versus Weight(response is BP)
(New predictor!)
![Page 20: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/20.jpg)
Residuals (age, weight) vs. duration plot
1098765432
1
0
-1
Duration
Re
sid
ual
Residuals Versus Duration(response is BP)
(New predictor!)
![Page 21: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/21.jpg)
How a non-linear function shows up on a residual vs. fits plot
• The residuals depart from 0 in some systematic manner:– such as, being positive for small x values,
negative for medium x values, and positive again for large x values
![Page 22: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/22.jpg)
Example: A linear relationship between tread wear and mileage?
mileage groove0 394.334 329.508 291.0012 255.1716 229.3320 204.8324 179.0028 163.8332 150.33
X = mileage in 1000 miles
Y = groove depth in mils
![Page 23: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/23.jpg)
Is tire tread wear linearly related to mileage?
302010 0
400
300
200
100
mileage
gro
ove
S = 19.0170 R-Sq = 95.3 % R-Sq(adj) = 94.6 %
groove = 360.637 - 7.28062 mileage
Regression Plot
![Page 24: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/24.jpg)
A residual vs. fits plot suggesting relationship is not linear
350250150
30
20
10
0
-10
-20
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is groove)
![Page 25: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/25.jpg)
How non-constant error variance shows up on a residual vs. fits plot
• The plot has a “fanning” effect.– Residuals are close to 0 for small x values and
are more spread out for large x values.
• The plot has a “funneling” effect– Residuals are spread out for small x values and
close to 0 for large x values.
• Or, the spread of the residuals can vary in some complex fashion.
![Page 26: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/26.jpg)
Example: How is plutonium activity related to alpha particle counts?
0 10 20
0.00
0.05
0.10
0.15
plutonium
alp
ha
alpha = 0.0070331 + 0.0055370 plutonium
S = 0.0125713 R-Sq = 91.6 % R-Sq(adj) = 91.2 %
Regression Plot
![Page 27: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/27.jpg)
A residual vs. fits plot suggesting non-constant error variance
0.120.100.080.060.040.020.00
0.03
0.02
0.01
0.00
-0.01
-0.02
-0.03
-0.04
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is alpha)
![Page 28: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/28.jpg)
How an outlier shows up on a residuals vs. fits plot
• The observation’s residual stands apart from the basic random pattern of the rest of the residuals.
• The random pattern of the residual plot can even disappear if one outlier really deviates from the pattern of the rest of the data.
![Page 29: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/29.jpg)
Example: Relationship between tobacco use and alcohol use?
Region Alcohol TobaccoNorth 6.47 4.03Yorkshire 6.13 3.76Northeast 6.19 3.77EastMidlands 4.89 3.34WestMidlands 5.63 3.47EastAnglia 4.52 2.92 Southeast 5.89 3.20Southwest 4.79 2.71Wales 5.27 3.53Scotland 6.08 4.51Northern Ireland 4.02 4.56
•Family Expenditure Survey of British Dept. of Employment
•X = average weekly expenditure on tobacco
•Y = average weekly expenditure on alcohol
![Page 30: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/30.jpg)
Example: Relationship between tobacco use and alcohol use?
3.0 3.5 4.0 4.5
4.0
4.5
5.0
5.5
6.0
6.5
Tobacco
Alc
oho
l
Alcohol = 4.35117 + 0.301938 Tobacco
S = 0.819630 R-Sq = 5.0 % R-Sq(adj) = 0.0 %
Regression Plot
![Page 31: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/31.jpg)
A residual vs. fits plot suggesting an outlier exists
5.755.655.555.455.355.255.15
1
0
-1
-2
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is Alcohol)
“outlier”
![Page 32: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/32.jpg)
How large does a residual need to be before being flagged?
• The magnitude of the residuals depends on the units of the response variable.
• Make the residuals “unitless” by dividing by their standard deviation. That is, use “standardized residuals.”
• Then, an observation with a standardized residual greater than 2 or smaller than -2 should be flagged for further investigation.
![Page 33: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/33.jpg)
Standardized residuals vs. fits plot
5.755.655.555.455.355.255.15
1
0
-1
-2
-3
Fitted Value
Sta
ndar
diz
ed
Re
sid
ual
Residuals Versus the Fitted Values(response is Alcohol)
![Page 34: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/34.jpg)
Minitab identifies observations with large standardized residuals
Unusual ObservationsObs Tobacco Alcohol Fit SE Fit Resid St Resid11 4.56 4.020 5.728 0.482 -1.708 -2.58R
R denotes an observation with a large standardized residual.
![Page 35: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/35.jpg)
Anscombe data set #3
1413121110 9 8 7 6 5 4
13
12
11
10
9
8
7
6
5
x3
y3
S = 1.23631 R-Sq = 66.6 % R-Sq(adj) = 62.9 %
y3 = 3.00245 + 0.499727 x3
Regression Plot
![Page 36: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/36.jpg)
A residual vs. fits plot suggesting an outlier exists
1098765
3
2
1
0
-1
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is y3)
![Page 37: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/37.jpg)
Residuals vs. order plot
• Helps assess serial correlation of error terms.• If the data are obtained in a time (or space)
sequence, a “residuals vs. order” plot helps to see if there is any correlation between error terms that are near each other in the sequence.
• A horizontal band bouncing randomly around 0 suggests errors are independent, while a systematic pattern suggests not.
![Page 38: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/38.jpg)
Residuals vs. order plots suggesting non-independence of error terms
![Page 39: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/39.jpg)
Normal (probability) plot of residuals
• Helps assess normality of error terms.• If data are Normal(μ, σ2), then percentiles
of the normal distribution should plot linearly against sample percentiles (with sampling variation).
• The parameters μ and σ2 are unknown. Theory shows it’s okay to assume μ = 0 and σ2 = 1.
![Page 40: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/40.jpg)
Normal (probability) plot of residuals
x y i RESI1 PCT MTB_PCT NSCORE3 12 1 -2.70103 0.1 0.060976 -1.546642 10 2 -1.04639 0.2 0.158537 -1.000495 21 3 -1.01031 0.3 0.256098 -0.655421 7 4 -0.39175 0.4 0.353659 -0.375463 15 5 0.29897 0.5 0.451220 -0.122581 8 6 0.60825 0.6 0.548780 0.122584 19 7 0.64433 0.7 0.646341 0.375461 9 8 1.60825 0.8 0.743902 0.655425 24 9 1.98969 0.9 0.841463 1.00049
1n
i
418
3
n
i
Ordered!
![Page 41: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/41.jpg)
Normal (probability) plot of residuals (cont’d)
• Plot normal scores (theoretical percentiles) on vertical axis against ordered residuals (sample percentiles) on horizontal axis.
• Plot that is nearly linear suggests normality of error terms.
![Page 42: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/42.jpg)
Normal (probability) plot
210-1-2-3
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
No
rmal
Sco
re
Residual
Normal Probability Plot of the Residuals(response is y)
![Page 43: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/43.jpg)
Normal (probability) plot
P-Value: 0.813A-Squared: 0.205
Anderson-Darling Normality Test
N: 9StDev: 1.45707Average: -0.0000000
210-1-2
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
RESI1
Normal Probability Plot
![Page 44: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/44.jpg)
Normal (probability) plot
543210-1-2-3-4-5
99
95
90
80
7060504030
20
10
5
1
Data
Per
cent
1.349AD*
Goodness of Fit
Normal Probability Plot for RESI1ML Estimates - 95% CI
Mean
StDev
-0.0000000
1.37374
ML Estimates
![Page 45: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/45.jpg)
A normal (probability) plot with non-normal error terms
500-50
99
95
90
80
7060504030
20
10
5
1
Data
Per
cent
3.417AD*
Goodness of Fit
Normal Probability Plot for RESI1ML Estimates - 95% CI
Mean
StDev
-0.0000000
16.1636
ML Estimates
![Page 46: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/46.jpg)
Residual plots in Minitab’s regression command
• Select Stat >> Regression >> Regression
• Specify predictor and response
• Under Graphs…– select either Regular or Standardized– select desired types of residual plots (normal
plot, versus fits, versus order, versus predictor variable)
![Page 47: Model Checking Using residuals to check the validity of the linear regression model assumptions](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e215503460f94b0db39/html5/thumbnails/47.jpg)
Normal plots outside of Minitab’s regression command
• Select Stat >> Regression >> Regression...• Specify predictor and response• Under Storage …
– select Regular or Standardized residuals– Select OK. Residuals will appear in worksheet.
• (Either) Select Graph >> Probability plot…– Specify RESI as variable and select Normal
distribution. Select OK.• (Or) Select Stat >> Basic Stat >> Normality Test
– Specify RESI as variable and select OK.