stats 330: lecture 12

04/21/23 330 lecture 12 1

STATS 330: Lecture 12

04/21/23 330 lecture 12 2

Diagnostics 4Aim of today’s lecture

To discuss diagnostics for independence

04/21/23 330 lecture 12 3

Independence One of the regression assumptions is that the errors are

independent.

Data that is collected sequentially over time often have errors that are not independent.

If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable.

Thus, we need to be able to detect lack of independence.

04/21/23 330 lecture 12 4

Types of dependence

If large positive errors have a tendency to follow large positive errors, and large negative errors a tendency to follow large negative errors, we say the data has positive autocorrelation

If large positive errors have a tendency to follow large negative errors, and large negative errors a tendency to follow large positive errors, we say the data has negative autocorrelation

04/21/23 330 lecture 12 5

Diagnostics If the errors are positively autocorrelated,

• Plotting the residuals against time will show long runs of positive and negative residuals

• Plotting residuals against the previous residual (ie ei vs ei-1) will show a positive trend

• A correlogram of the residuals will show positive spikes, gradually decaying

04/21/23 330 lecture 12 6

Diagnostics (2)If the errors are negatively autocorrelated,

• Plotting the residuals against time will show alternating positive and negative residuals

• Plotting residuals against the previous residual (ie ei vs ei-1) will show a negative trend

• A correlogram of the residuals will show alternating positive and negative spikes, gradually decaying

04/21/23 330 lecture 12 7

Residuals against time

res<-residuals(lm.obj)plot(1:length(res),res, xlab=“time”,ylab=“residuals”, type=“b”)lines(1:length(res),res)abline(h=0, lty=2)

Can omit the “x” vector if it is sequence numbers

Dotted line at 0 (mean residual)

Dots/lines

04/21/23 330 lecture 12 8

0 20 40 60 80 100-1

.0-0

.50.

00.

5

Autocorrelation = 0.9

time

resi

du

al

0 20 40 60 80 100

-0.4

0.0

0.4


time

resi

du

al

0 20 40 60 80 100

-1.0

0.0

0.5

1.0

Autocorrelation = - 0.9

time

resi

du

al

04/21/23 330 lecture 12 9

Residuals against previous

res<-residuals(lm.obj)

n<-length(res)

plot.res<-res[-1] # element 1 has no previous

prev.res<-res[-n] # have to be equal length

plot(prev.res,plot.res,

xlab=“previous residual”,ylab=“residual”)

04/21/23 330 lecture 12 10

Plots for different degrees of

autocorrelation

-0.5 0.0 0.5 1.0

-0.5

0.0

0.5

1.0


residual

pre

vio

us

resi

du

al

-0.4 -0.2 0.0 0.2 0.4

-0.4

-0.2

0.0

0.2

0.4


residual

pre

vio

us

resi

du

al

-1.5 -1.0 -0.5 0.0 0.5 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0


residual

pre

vio

us

resi

du

al

04/21/23 330 lecture 12 11

Correlogram

acf(residuals(lm.obj))

Correlogram (autocorrelation function, acf) is plot of lag k autocorrelation versus k

Lag k autocorrelation is correlation of residuals k time units apart

04/21/23 330 lecture 12 12

0 5 10 15 20

0.0

0.4

0.8

Lag

AC

F


0 5 10 15 20

0.0

0.4

0.8

Lag

AC

F


0 5 10 15 20

-1.0

-0.5

0.0

0.5

1.0

Lag

AC

F


04/21/23 330 lecture 12 13

Durbin-Watson test We can also do a formal hypothesis test, (the

Durbin-Watson test) for independence The test assumes the errors follow a model of

the form

iiiu 1

where the ui’s are independent, normal

and have constant variance. is the lag 1 correlation: this is the autoregressive model of order 1

NB

04/21/23 330 lecture 12 14

Durbin-Watson test (2)

When = 0, the errors are independent

The DW test tests independence by testing = 0

is estimated by

n

ii

n

iii

e

ee

2

2

21

04/21/23 330 lecture 12 15

Durbin-Watson test (3)

DW test statistic is )ˆ1(2)(

2

2

2

2

1

n

ii

n

iii

e

eeDW

Value of DW is between 0 and 4

Values of DW around 2 are consistent with independence

Values close to 4 indicate negative serial correlation

Values close to 0 indicate positive serial correlation

04/21/23 330 lecture 12 16

Durbin-Watson test (4) There exist values dL, dU depending on the

number of variables k in the regression and the sample size n – see table on next slide

Use the value of DW to decide on independence as follows:

0 44-dU 4-dLdL dU

Positive autocorrelation

Negative autocorrelation

Independence

Inconclusive

04/21/23 330 lecture 12 17

Durbin-Watson table

04/21/23 330 lecture 12 18

Example: the advertising data

Sales and advertising data Data on monthly sales and advertising

spend for 35 months

Model is Sales ~ spend + prev.spend

(prev.spend = spend in previous month)

04/21/23 330 lecture 12 19

> ad.df spend prev.spend sales1 16 15 20.52 18 16 21.03 27 18 15.54 21 27 15.35 49 21 23.56 21 49 24.57 22 21 21.38 28 22 23.59 36 28 28.010 40 36 24.011 3 40 15.512 21 3 17.3… 35 lines in all

Advertising data

04/21/23 330 lecture 12 20

R code for residual vs previous plot

advertising.lm<-lm(sales~spend + prev.spend, data = ad.df)res<-residuals(advertising.lm)n<-length(res)plot.res<-res[-1]prev.res<-res[-n]plot(prev.res,plot.res, xlab="previous residual",ylab="residual",main="Residual versus previous residual \n for the advertising data")abline(coef(lm(plot.res~prev.res)), col="red", lwd=2)

04/21/23 330 lecture 12 21

-5 0 5

-50

5

Residual versus previous residual for the advertising data

previous residual

resi

du

al

04/21/23 330 lecture 12 22

Time series plot, correlogram – R code

par(mfrow=c(2,1))plot(res, type="b", xlab="Time Sequence", ylab = "Residual", main = "Time series plot of residuals for the advertising data")abline(h=0, lty=2, lwd=2,col="blue")

acf(res, main ="Correlogram of residuals for the advertising data")

04/21/23 330 lecture 12 23

0 5 10 15 20 25 30 35

-50

5

Time series plot of residuals for the advertising data

Time Sequence

Re

sid

ua

l

0 5 10 15

-0.2

0.2

0.6

1.0

Lag

AC

F

Correlogram of residuals for the advertising data

Increasing trend?

04/21/23 330 lecture 12 24

Calculating DW> rhohat<-cor(plot.res,prev.res)

> rhohat

[1] 0.4450734

> DW<-2*(1-rhohat)

> DW

[1] 1.109853

For n=35 and k=2, dL = 1.34. Since DW = 1.109 < dL = 1.34 , strong evidence of positive serial correlation

04/21/23 330 lecture 12 25

Durbin-Watson tableuse

(1.28 + 1.39)/2= 1.34

04/21/23 330 lecture 12 26

Remedy (1) If we detect serial correlation, we need to

fit special time series models to the data.

For full details see STATS 326/726.

Assuming that the AR(1) model is ok, we can use the arima function in R to fit the regression

04/21/23 330 lecture 12 27

Fitting a regression with AR(1) errors

> arima(ad.df$sales,order=c(1,0,0), xreg=cbind(spend,prev.spend))

Call:arima(x = ad.df$sales, order = c(1, 0, 0), xreg = cbind(spend, prev.spend))

Coefficients: ar1 intercept spend prev.spend 0.4966 16.9080 0.1218 0.1391s.e. 0.1580 1.6716 0.0308 0.0316

sigma^2 estimated as 9.476: log likelihood = -89.16, aic = 188.32

04/21/23 330 lecture 12 28

Comparisons

lm arima

Const (std err) 15.60 (1.34) 16.90 (1.67)

Spend (std err) 0.142 (0.035) 0.128 (0.031)

Prev Spend (Std err)

0.166 (0.036) 0.139 (0.031)

1st order Correlation

0.442 0.497

Sigma 3.652 3.078

04/21/23 330 lecture 12 29

Remedy (2) Recall there was a trend in the time series

plot of the residuals, these seem related to time

Thus, time is a “lurking variable” , a variable that should be in the regression but isn’t

Try model

Sales ~ spend + prev.spend + time

04/21/23 330 lecture 12 30

Fitting new modeltime=1:35new.advertising.lm<-lm(sales~spend + prev.spend + time, data = ad.df)res<-residuals(new.advertising.lm)

n<-length(res)plot.res<-res[-1]prev.res<-res[-n]

DW = 2*(1-cor(plot.res,prev.res))

04/21/23 330 lecture 12 31

DW Retest DW is now 1.73 For a model with 3 explanatory variables,

du is about 1.66 (refer to the table), so no evidence of serial correlation

Time is a highly significant variable in the regression

Problem is fixed!

stats 330: lecture 12

Documents

decaying330 lecture

independence330 lecture

large negative errors

time units apart330

standard errors

positive spikes

negative spikes

correlation of residuals