math 423/533 - assignment s

MATH 423/533 - ASSIGNMENT 2 SOLUTIONS
The following data gives data on average public teacher annual salary in dollars, recorded in the data frame salary as the variable SALARY, and spending (SPENDING) per pupil (in thousands of dollars) on public schools in 1985 in the 50 US states and the District of Columbia.
The objective of the analysis is to understand whether there is a relationship between teacher pay, y, and per- pupil spending, x. An analysis in R is presented below: some of the output has been deleted and replaced by XXXXX.
1 > salary<-read.csv('salary.csv',header=TRUE) 2 > x1<-salary$SPENDING/1000 3 > y<-salary$SALARY 4 > fit.Salary<-lm(y∼x1);summary(fit.Salary) 5 6 Call: 7 lm(formula = y ∼ x1) 8 9 Residuals:
10 Min 1Q Median 3Q Max 11 -3848.0 -1844.6 -217.5 1660.0 5529.3 12 13 Coefficients: 14 Estimate Std. Error t value Pr(>|t|) 15 (Intercept) 12129.4 XXXXX 10.13 1.31e-13 *** 16 x1 3307.6 311.7 10.61 2.71e-14 *** 17 --- 18 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 19 20 Residual standard error: XXXXX on 49 degrees of freedom 21 Multiple R-squared: 0.6968, Adjusted R-squared: 0.6906 22 F-statistic: XXXXX on 1 and 49 DF, p-value: 2.707e-14
In answering the following questions, you may not use the lm function or its result on these data (or the functions coef(), residuals() etc.), but instead should use vector and matrix calculations.
(a) Write R code to verify the calculation of the entries in the Estimate column, and show that your code produces the correct results. 2 Marks
(b) Write R code to compute the value of the omitted entry for the Residual standard error on line 20. 2 Marks
(c) Compute the value of the entry in the Std. Error column on line 15 first using entries already given in the table, and then using the data directly. 2 Marks
(d) The entry for Multiple R-squared on line 21 is computed using the formula
R2 = SSR
SST
where SSR is the ‘regression sum-of-squares’ and SST is the ‘total sum of squares’ as defined in lectures. Write R code to verify the calculation of R2.
2 Marks
(e) Prove for a simple linear regression that, in the notation from lectures,
SSR = β1Sxy
and show this result holds numerically for the salary data. 2 Marks
MATH 423/533 ASSIGNMENT 2 Solutions Page 1 of 15
(f) The F-statistic on line 22 is computed using the sums-of-squares decomposition
SST = SSRes + SSR
F = SSR/(p− 1)
SSRes/(n− p) where here p = 2 for simple linear regression. Write R code to compute the omitted value for F .
2 Marks
(g) In the notation from lectures, we have that the sums-of-squares decomposition can be written
y>(In −H1)y = y>(In −H)y + y>(H−H1)y.
Show, mathematically and numerically, that
trace(In −H1) = n− 1 trace(H−H1) = p− 1
for this example, where p = 2 for simple linear regression. 2 Marks
(h) Using residual plots, assess the validity of the assumptions underlying the least squares analysis. Verify numerically the orthogonality results concerning the residuals, that is, in vector form
1>n e = 0 X>e = 0p y>e = 0.
2 Marks
(i) Using the fitted model, predict what the average public teacher annual salary would be in a state where the spending per pupil is $4800.
1 Mark
(j) The prediction at an arbitrary new x value, xnew 1 can be written in terms of the estimates β as
ynew = xnew 1 β = [1 xnew
1 ] β = β0 + β1x new 1
with β the least squares estimate. Compute the estimated standard prediction error for ynew, that is, the square root of the estimated variance of the corresponding random variable
Y new = xnew 1 β = [1 xnew
1 ] β
now with β the least squares estimator, if xnew 1 is $4800. 3 Marks
EXTRA QUESTION FOR STUDENTS IN MATH 533 The figure below plots the percent differences on the log scale between successive recorded quarterly Gross Domestic Product (GDP) values in the US between the first quarter of 1947 and the first quarter of 2016 (277 data points).
Time
− 2
US GDP Growth: log scale differences, 1947−2016
The data may be read in from the file US-GDP.txt as follows. For regression purposes, we define the predictor x1 by considering time (in quarters) since Q1, 1947.
1 y0<-scan('US-GDP.txt') 2 y<-100*log(y0[-1]/y0[-278]) 3 x1<-c(1:277)
Is there any statistical evidence that there is a ‘changepoint’ in the GDP series at the year 1980 (when x1 = 133), that is, that the relationship between y and x1 prior to Q1 1980 is different from the relationship after that time ? Investigate this possibility using straight line regression modelling (not a single simple linear regression), and report the result of an appropriate hypothesis test.
5 Marks
SOLUTION
The results to verify are given in the output below:
salary<-read.csv('salary.csv',header=TRUE) x1<-salary$SPENDING/1000 y<-salary$SALARY fit.Salary<-lm(y˜x1);summary(fit.Salary)
+ + Call: + lm(formula = y ˜ x1) + + Residuals: + Min 1Q Median 3Q Max + -3848.0 -1844.6 -217.5 1660.0 5529.3 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 12129.4 1197.4 10.13 1.31e-13 *** + x1 3307.6 311.7 10.61 2.71e-14 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 2325 on 49 degrees of freedom + Multiple R-squared: 0.6968,Adjusted R-squared: 0.6906 + F-statistic: 112.6 on 1 and 49 DF, p-value: 2.707e-14
The values here are rounded quite aggressively; to show more digits
print(summary(fit.Salary),digits=8)
+ + Call: + lm(formula = y ˜ x1) + + Residuals: + Min 1Q Median 3Q Max + -3847.97573 -1844.55654 -217.51923 1659.97327 5529.34250 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 12129.37102 1197.35080 10.13017 1.3081e-13 *** + x1 3307.58500 311.70427 10.61129 2.7069e-14 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 2324.779 on 49 degrees of freedom + Multiple R-squared: 0.69678128,Adjusted R-squared: 0.69059314 + F-statistic: 112.59952 on 1 and 49 DF, p-value: 2.7068708e-14
In the following calculations, as much precision is carried forward as possible, with no rounding of values in the calculation (although there is some rounding in printing).
(a) To compute the estimate, use the formula for the least squares solution
X<-cbind(1,x1) XTX<-t(X)%*%X beta.hat<-solve(XTX,t(X)%*%y) print(beta.hat)
+ [,1] + 12129.371 + x1 3307.585
The estimates are therefore (12129.371, 3307.585) for β0 and β1 respectively. From the output, we see that the estimated coefficients are also (12129.371, 3307.585).
2 Marks
+ [1] 2324.779
Therefore we have σ = (2324.7789), whereas from the output we have (2324.7789). 2 Marks
(c) From the table, we have simply that t0 = β0/e.s.e(β0) so that therefore
e.s.e(β0) = β0
1.2129371× 104
10.1301732 = 1197.3508043
that is, that the missing standard error value is 1197.3508. From first principles, we have
(estimated.covariance<-sigma.hatˆ2 * solve(XTX))
(ese.vals<-sqrt(diag(estimated.covariance)))
+ x1 + 1197.3508 311.7043
that is, that the missing standard error value is 1197.3508 as before. 2 Marks
(d) Here we have
+ [1] 0.6967813
and R2 is confirmed as 0.69678128. Here we have that
SST = 873380264.627 SSRes = 264825249.995 SSR = 608555014.633
To confirm the calculation using the formula
y>(In −H1)y = y>(In −H)y + y>(H−H1)y.
we have the alternate code
H<-X %*% (solve(XTX) %*% t(X)) One<-cbind(rep(1,length(y))) H1<-(One %*% t(One))/length(y) SSQ.T<-t(y) %*% (diag(1,length(y))-H1) %*% y SSQ.Res<-t(y) %*% (diag(1,length(y))-H) %*% y SSQ.R<-t(y) %*% (H-H1) %*% y print(c(SSQ.T,SSQ.Res,SSQ.R),digits=12)
+ [1] 873380264.627 264825249.995 608555014.633
n∑ i=1
(β0 + β1xi1 − y)2
but, from an earlier result, we know that the fitted straight line passes through the point (x1, y), that is, we know that
y = β0 + β1x1.
n∑ i=1
But also from a previous result, β1 = Sxy/Sxx, so therefore
SSR = β1(β1Sxx) = β1Sxy
(S.xy<-sum((y-mean(y))*(x1-mean(x1))))
+ [1] 183987.7
2 Marks
(f) From the previous computation the F statistic is confirmed as 112.5995.
Fstat<-(SS.R/(2-1))/(SS.Res/(length(y)-2)) Fstat
2 Marks
(h) Here is a plot of the fitted line
par(mar=c(4,4,0,0)) plot(x1,y,pch=19,cex=0.75);abline(coef(fit.Salary),col='red')

20 00
0 25
00 0
30 00
0 35
00 0
40 00
0
x1
y
which seems to do a good job in reflecting the relationship. For the residual plots versus the predictor and versus the fitted values:
par(mar=c(4,4,0,0)) plot(x1,residual.vec,pch=19,cex=0.75,ylim=range(-6000,6000));abline(h=0,lty=2)

− 60
00 −
− 60
00 −
and we can conclude that there is no evidence to suggest that the assumptions concerning the model and residual errors are invalid. To demonstrate the orthogonality, there are different ways to compute:
#Result 1 sum(residual.vec) #Summation
+ [1] 1.599318e-09
+ [1] 3.29758e-06
+ [,1] + [1,] 3.311783e-06
The last result is several orders of magnitude away from the others, although still very close to the required zero value. This is in part due to the fitted and residual values being stored in an earlier calculation. For example, recomputing using differently stored objects, we have
#Result 3 recomputed sum((cbind(1,x1) %*% coef(fit.Salary))*(y-(cbind(1,x1) %*% coef(fit.Salary))))
+ [1] 1.30618e-07
2 Marks
(i) Recalling that in the original data, the predictors are recorded in thousands of dollars, we have that the prediction is computed using the fitted straight line as
format(as.numeric(c(1,4.8) %*% coef(fit.Salary)),digits=8)
(j) As we may write the predicted value Y new as a linear combination of the estimator vector elements
Y new = xnew 1 β = [1 xnew
1 ] β,
we can use standard results from lectures to deduce that
VarY |X,Xnew [Y new|X,xnew 1 ] = xnew
1 VarY |X[β|X]{xnew 1 }> = xnew
1 {σ2(X>X)−1}{xnew 1 }>
which would then be estimated by replacing σ2 by its estimate. That is, for the problem at hand
x1new<-matrix(c(1,4.8),nrow=1) pred.var<-sigma.hatˆ2*x1new %*% solve(XTX) %*% t(x1new) pred.var
+ [,1] + [1,] 224261.7
+ [,1] + [1,] 473.5628
so that the required estimated standard prediction error is 473.5628. 3 Marks
To check the results for (i) and (j) using predict, we would write
predict(fit.Salary,newdata=data.frame(x1=4.8),se.fit=TRUE)
+ $fit + 1 + 28005.78 + + $se.fit + [1] 473.5628 + + $df + [1] 49 + + $residual.scale + [1] 2324.779
EXTRA QUESTION FOR STUDENTS IN MATH 533 We can proceed in (at least) three different ways:
1. Fit using a ‘flexible’ model as in one of the earlier Extra Hour classes, assuming that there are two discontinuous straight lines either side of x1 = 133, so that the modelled expected value is
EYi|X[Yi|xi] = I(0,133](xi1)(β10 + β11xi1) + I(133,277)(xi1)(β20 + β21xi1) =
{ β10 + β11xi1 0 ≤ xi1 ≤ 133
β20 + β21xi1 133 < xi1 ≤ 277
However, an alternative parameterization in terms of contrasts more readily permits the assessment of inter- est: that is, we write
EYi|X[Yi|xi] = β0 + β1xi1 + I(133,277)(xi1)(δ0 + δ1xi1) =
{ β0 + β1xi1 0 ≤ xi1 ≤ 133
(β0 + δ0) + (β1 + δ1)xi1 133 < xi1 ≤ 277
so that δ0 measures the change in intercept and δ1 measures the change in slope.
y0<-scan('US-GDP.txt') y<-100*log(y0[-1]/y0[-278]) x1<-c(1:277) fit.c1<-lm(y˜(x1>133)*x1) summary(fit.c1)
+ + Call: + lm(formula = y ˜ (x1 > 133) * x1) + + Residuals: + Min 1Q Median 3Q Max + -3.2428 -0.4601 0.0010 0.4981 4.5554 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 1.268748 0.176850 7.174 6.86e-12 *** + x1 > 133TRUE 2.027377 0.461446 4.394 1.60e-05 *** + x1 0.008489 0.002290 3.707 0.000254 *** + x1 > 133TRUE:x1 -0.018159 0.003062 -5.930 9.14e-09 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 1.014 on 273 degrees of freedom + Multiple R-squared: 0.1681,Adjusted R-squared: 0.1589 + F-statistic: 18.39 on 3 and 273 DF, p-value: 6.806e-11
confint(fit.c1)
+ 2.5 % 97.5 % + (Intercept) 0.920584549 1.61691226 + x1 > 133TRUE 1.118931885 2.93582160 + x1 0.003980031 0.01299743 + x1 > 133TRUE:x1 -0.024187618 -0.01213028
The second and fourth lines of the coefficients table correspond to the estimates of δ0 and δ1 respectively; we have δ0 = 2.027377 and δ1 = −0.01815895 – both values are significantly different from zero, so there is significant evidence of a changepoint.
2. We can repeat the above, but make the expectation continuous at the changepoint. The easiest way to do this is to form the model
EYi|X[Yi|xi] = β0 + β1xi1 + I(133,277)(xi1)(δ1(xi1 − 133)) =
{ β0 + β1xi1 0 ≤ xi1 ≤ 133
(β0 − 133δ1) + (β1 + δ1)xi1 133 < xi1 ≤ 277
Note that the continuity assumption restricts us to using a three parameter model.
fit.c2<-lm(y˜x1+I((x1>133)*(x1-133))) print(summary(fit.c2,digits=4))
+ + Call: + lm(formula = y ˜ x1 + I((x1 > 133) * (x1 - 133))) + + Residuals: + Min 1Q Median 3Q Max + -3.3257 -0.4900 -0.0313 0.5300 4.4862 + + Coefficients: + Estimate Std. Error t value + (Intercept) 1.369618 0.165547 8.273 + x1 0.006230 0.001802 3.458 + I((x1 > 133) * (x1 - 133)) -0.017855 0.003065 -5.826 + Pr(>|t|) + (Intercept) 5.74e-15 *** + x1 0.000632 *** + I((x1 > 133) * (x1 - 133)) 1.59e-08 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 1.017 on 274 degrees of freedom + Multiple R-squared: 0.1604,Adjusted R-squared: 0.1542 + F-statistic: 26.17 on 2 and 274 DF, p-value: 3.982e-11
confint(fit.c2)
+ 2.5 % 97.5 % + (Intercept) 1.043711323 1.69552456 + x1 0.002683043 0.00977787 + I((x1 > 133) * (x1 - 133)) -0.023888774 -0.01182157
Now we have δ1 = −0.01785517, and again a significant change in slope.
3. We could perform two separate straight-line fits on the first half and second half of the data: we achieve this using the subset function in the lm call.
fit.c3<-lm(y˜x1,subset=(x1<=133)) #First half print(summary(fit.c3,digits=4))
+ + Call: + lm(formula = y ˜ x1, subset = (x1 <= 133)) + + Residuals: + Min 1Q Median 3Q Max + -3.2428 -0.6401 -0.0697 0.6753 4.5554 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 1.268748 0.219464 5.781 5.16e-08 *** + x1 0.008489 0.002842 2.987 0.00337 ** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 1.258 on 131 degrees of freedom + Multiple R-squared: 0.06376,Adjusted R-squared: 0.05661 + F-statistic: 8.921 on 1 and 131 DF, p-value: 0.003366
confint(fit.c3)
fit.c4<-lm(y˜x1,subset=(x1>133)) #Second half print(summary(fit.c4,digits=4))
+ + Call: + lm(formula = y ˜ x1, subset = (x1 > 133)) + + Residuals: + Min 1Q Median 3Q Max + -2.90201 -0.33119 0.01855 0.38653 2.57152 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 3.29612 0.30194 10.917 < 2e-16 *** + x1 -0.00967 0.00144 -6.715 4.19e-10 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 0.7183 on 142 degrees of freedom + Multiple R-squared: 0.241,Adjusted R-squared: 0.2357 + F-statistic: 45.09 on 1 and 142 DF, p-value: 4.186e-10
confint(fit.c4)
From the reported confidence intervals for the parameters, we can conclude that the two lines are significantly different; for example, the confidence intervals for the slope parameters do not overlap.
par(mar=c(4,4,3,0)) plot(x1,y,ylab='Percent GDP growth',type='l') points(x1,y,pch=19,cex=0.75) title('US GDP Growth: log scale differences, 1947-2016') x1v<-seq(0,277,by=0.01) y1v<-predict(fit.c1,newdata=data.frame(x1=x1v)) lines(x1v,y1v,col='red') y2v<-predict(fit.c2,newdata=data.frame(x1=x1v)) lines(x1v,y2v,col='blue') legend(150,6,c('Discontinuous','Continuous'),col=c('red','blue'),lty=1)
0 50 100 150 200 250
− 2
Discontinuous Continuous
The residual plots from the first two fits do not offer conclusive evidence that the assumptions are incorrect, however it appears that the residuals in the first half have a larger variance than the residuals in the second half. This is backed up by the fact that the two estimates of σ are different: 1.258361 and 0.7183496 for the first and second halves respectively.
par(mar=c(4,4,3,0),mfrow=c(2,1)) plot(x1,residuals(fit.c1),pch=19,ylim=range(-4,4),main='Discontinuous model: residuals') abline(v=133,lty=3);abline(h=0,lty=2) plot(x1,residuals(fit.c2),pch=19,ylim=range(-4,4),main='Continuous model: residuals') abline(v=133,lty=3);abline(h=0,lty=2)

− 4
− 2
− 4
− 2

math 423/533 - assignment s

Documents