[1] simple linear regression. the general equation of a line is y = c + mx or y = + x. > 0 ...
TRANSCRIPT
[1]
Simple Linear Regression
The general equation of a line is Y = c + mX or Y = + X.
> 0 > 0
> 0 = 0
= 0 < 0
> 0 < 0
[3]
Regression analysis is a technique for quantifying the relationship between a response variable (or dependent variable) and one or more predictor (independent or explanatory) variables.
Two Main Purposes: To predict the dependent variable based on specified values for the predictor variable(s).
To understand how the predictor variable(s) influence or relate to the dependent variable.
Example - Humidity Data
The raw material used in the production of a certain synthetic fiber is stored in a location without humidity control. Measurements of the relative humidity in the storage location and the moisture content (in %) of a sample of the raw material were taken over 15 days.
Rel. Humidity: 46, 53, 29, 61, 36, 39, 47, 49,Mois. Content: 12, 15, 7, 17, 10, 11, 11, 12,
Rel. Humidity: 52, 38, 55, 32, 57, 54, 44.Mois. Content: 14, 9, 16, 8, 18, 14, 12.
Relative Humidity takes the role of explanatory variable.Moisture Content takes the role of dependent variable.
[5]
60504030
18
13
8
Relative Humidity
Moi
stur
e C
onte
nt (
%)
[6]
The Regression Model
The Simple Linear Regression Model can be stated as
Yi = + X
i +
i
Yi is the value of the response variable in the ith trial
and are the intercept and slope parameters
Xi is a known constant, namely the value of the
explanatory variable in the ith trial
i is an unobservable random error term such that
i~N(0,2).
i is also referred to as the stochastic
element of the regression model Yi = + X
i +
i.
[7]
60504030
18
13
8
Relative Humidity
Moi
stur
e C
onte
nt (
%)
Minimise Vertical Distances of Data to ‘Best Fit Line’
Y Xi i i
[8]
Formulae For Least Squares Method
The most common method for computing the estimated
intercept , and the estimate slope , is least squares.
The fitted regression line is calculated as Y = + X,
where =
X
X
=
X
X
and = Y - X.
i
i
i
i2
X Y Y
X
Y nXY
nX
ii
n
i
n
ii
n
i
n1
2
1
1
1
2
LEAST SQUARES ESTIMATES
Seek to minimise the sum of the squared residuals
SSE = Y
Take derivatives wrt and and set to zero...
i
Xii
n 2
1
/
Y
Y -
ii
n
ii
nX n
X
1 1
SSE= 2 Y
2 Y 2n 2
at Max or Min
i
i
X
X
ii
n
i
n
ii
n1
1 1
0
X X X X Y X
X X n
X X n
ii
n
ii
n
ii
n
ii
n
ii
n
ii
n
i
n
ii
n
ii
n
2
1 1 1 1
1 1 1
2
1 1
2
/
/
Y
Y Y
i
i i
SSE= 2 Y
2 Y 2 2
at Max or Min
i
i
X X
X X X
i ii
n
ii
n
ii
n
ii
n1
1 1
2
1
0 Y - X
[11]
X i Y i X i2 Y i
2 X Yi i
4 6 1 2 2 1 1 6 1 4 4 5 5 25 3 1 5 2 8 0 9 2 2 5 7 9 52 9 7 8 4 1 4 9 2 0 36 1 1 7 3 7 2 1 2 8 9 1 0 3 73 6 1 0 1 2 9 6 1 0 0 3 6 03 9 1 1 1 5 2 1 1 2 1 4 2 94 7 1 1 2 2 0 9 1 2 1 5 1 74 9 1 2 2 4 0 1 1 4 4 5 8 85 2 1 4 2 7 0 4 1 9 6 7 2 83 8 9 1 4 4 4 8 1 3 4 25 5 1 6 3 0 2 5 2 5 6 8 8 03 2 8 1 0 2 4 6 4 2 5 65 7 1 8 3 2 4 9 3 2 4 1 0 2 65 4 1 4 2 9 1 6 1 9 6 7 5 64 4 1 2 1 9 3 6 1 4 4 5 2 8
6 9 2 1 8 6 3 3 2 1 0 2 4 5 4 8 9 9 7
X X
Y Y
X Y X Y
i
i
i i i i
692 461333
186 12 4
33210 2454 89972 2
.
.
, ,
.
=. .
.
=
8997 15 461333 12 4
33210 15 461333
0 3232
2
.
.
= Y - X
= 12.4 - 0.3232
46133
2 51
[12]
The can be written
, or in the case of the humidity
data .
fitted regression equation
. .
Y X
Y X
2 55 0 3232
When the relative humidity (X) is known to be 45,
the would be
.
average predicted moisture content
. . ( ) .Y 2 55 0 3232 45 1199
[13]
When the relative humidity is known to be 36,
the average predicted moisture content would be
%. . . ( ) .Y 2 55 0 3232 36 9 09
We actually have recorded data for X = 36, and
the moisture content was found to be 10 %.
We form a for the fifth observation as
ei
residual
. . . Y Yi i 10 9 09 0 91
[14]
RESIDUAL = DATA - MODEL
Xi Yi Yiei ei
2
46 12 12.3572 -0.3572 0.1275953 15 14.6196 0.3804 0.1447029 7 6.8628 0.1372 0.0188261 17 17.2052 -0.2052 0.0421136 10 9.1252 0.8748 0.7652839 11 10.0948 0.9052 0.8193947 11 12.6804 -1.6804 2.8237449 12 13.3268 -1.3268 1.7604052 14 14.2964 -0.2964 0.0878538 9 9.7716 -0.7716 0.5953755 16 15.2660 0.7340 0.5387632 8 7.8324 0.1676 0.0280957 18 15.9124 2.0876 4.3580754 14 14.9428 -0.9428 0.8888744 12 11.7108 0.2892 0.08364
0.0 13.083 = SSE
[15]
35 45 55
9
10
11
12
13
14
15
Humidity
Moi
stur
e
554535
15
14
13
12
11
10
9
Humidity
Moi
stur
e
35 45 55
9
10
11
12
13
14
15
Humidity
Moi
stur
e
52.134
60.147
08.13ˆ
2
1
2
1
2
1
2
SSESSTOSSR
YnY
YYSSTO
YYSSE
n
ii
n
ii
n
iii
Y Xi i i
Y Xi i i
iii YYe ˆˆ
YYi
Y Y
YYi ˆ
[16]
Y Y Y Y Y Yii
n
ii
n
i ii
n
2
1
2
1
2
1
SSTO SSR SSE
(n 1) df's (p -1) df's (n - 2) df's
ExplainedVariation
UnexplainedVariation
TotalVariation
= +
where p equals the number of parameters being estimated, in our case p = 2, (. intercept and slope).
[17]
A Measure of the Relative Goodness-Of-Fit
R2
0 911
Explained Variation in YTotal Variation in Y
SSRSSTO
SSTO - SSESSTO
134.523147.6
or 91.1%.
R2 is interpreted as the percentage variation in the response variable Y, explained through the simple linear regression on the explanatory variable X.
[18]
The regression equation is Moisture = - 2.51 + 0.323 Humidity
Predictor Coef StDev T PConstant -2.510 1.315 -1.91 0.079Humidity 0.32320 0.02796 11.56 0.000
S = 1.003 R-Sq = 91.1%
Analysis of VarianceSource DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60
. . = MSE 10064 1003on 13 degrees of freedom
[19]
Estimating A Confidence Interval for
Using statistical theory we can derive a formula for the standard error of
We may use a confidence interval to quantify the uncertainty associated with the slope.
A confidence interval will be calculated as the point estimate + a value from the tables times the standard error of the point estimate…...
SEX Xi
2
, , t SE t SEn df n df2 2
[20]
3836.02628.0
)02796(.16.23232.0)02796(.16.23232.0
ˆˆˆˆ,2,2
SEtSEt dfndfn
02797.078.1285
003.1
ˆˆ2
XXSE
i
78.1285
13333.461533210 2
22
XnXXX ii
Comes from a t-distribution on (n-2) = 13 degrees of freedom
Read from MINITAB output.
[21]
Hypothesis Testing About
Ho: = 0 (% Moist. per Rel. Hum.)Ha: 0 (% Moist. per Rel. Hum.)
With a 0.05 level of significance the decision rule is reject Ho if t* < -2.16 or if t* > +2.16
tSE
t n df*
~
...
0
0 3232 002796
1156
2t-distribution on 13 df
2.5% 95% 2.5%
+2.16-2.16Reject Ho: = 0
[22]
The regression equation isMoisture = - 2.51 + 0.323 Humidity
Predictor Coef StDev T PConstant -2.510 1.315 -1.91 0.079Humidity 0.32320 0.02796 11.56 0.000
S = 1.003 R-Sq = 91.1%
Analysis of Variance
Source DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60
=
[23]
Statistical Inference for
ˆˆ
:Interval Confidence
~ˆ0-ˆ
=t*
:StatisticTest
1ˆˆ
2
2
2
2
SEt
tSE
XX
X
nSE
dfn
dfn
i
[24]
The regression equation isMoisture = - 2.51 + 0.323 Humidity
Predictor Coef StDev T PConstant -2.510 1.315 -1.91 0.079Humidity 0.32320 0.02796 11.56 0.000
S = 1.003 R-Sq = 91.1% 11.56 11.56 = 133.63
Analysis of Variance
Source DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60
[25]
F-Test: Ho: = 0 (% Moist. per Rel. Hum.)Ha: 0 (% Moist. per Rel.
Hum.)
52.1341
52.134
DFs Re
where,*
gression
SSRMSR
MSE
MSRF Note: Large values of F*
lead to the rejection of Ho
Critical Value = F.05 = 4.67 1df Numerator, 13df Denominator
[26] 4.67 4.67
Do Not Reject H0Do Not Reject H0 Reject H0Reject H0
Area = 5%
Decision Rule: Fail to accept Ho if F* = MSR/MSE < 4.67Reject Ho if F* = MSR/MSE > 4.67
FMSRMSE
*
..
.
134 52101
133 67
[27]
options(show.signif.stars = FALSE) humidity = c(46, 53, 29, 61, 36, 39, 47, 49, 52, 38, 55, 32, 57, 54, 44) moisture = c(12, 15, 7, 17, 10, 11, 11, 12, 14, 9, 16, 8, 18, 14, 12) slr = lm( moisture ~ humidity ) slr summary(slr) anova(slr) plot(x = humidity, y = moisture) abline(slr, col = "red", lwd = 2) confint(slr) fits = predict(slr, data.frame( humidity = seq(30,60,by=0.1)), se.fit = TRUE) lines(seq(30,60,by=0.1), fits$fit + 2 * fits$se.fit, col = "blue", lty = 2) lines(seq(30,60,by=0.1), fits$fit - 2 * fits$se.fit, col = "blue", lty = 2)
[28]
Mail Processing Hours(Fiscal Years 1962 -63)
Fiscal Year 1962 Fiscal Year 1963
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
1 157 572 1 154 569
2 161 570 2 157 564
3 168 645 3 164 573
4 186 645 4 188 667
5 183 645 5 191 700
6 184 671 6 180 765
7 268 1053 7 270 1070
8 180 675 8 180 637
9 175 670 9 172 650
10 193 710 10 184 655
11 184 656 11 179 665
12 179 640 12 169 599
13 164 599 13 160 605
[29]
Line plots of Manhours and Volume
600
700
800
900
1000
Manhours
0 13 26
Four-week periods
150
175
200
225
250
275
Volume
[30]
Line plots of Manhours and VolumeChristmas excluded
550
600
650
700
750
800
Manhours
0 13 26
Four-week periods
150
160
170
180
190
200
Volume
[31]
Scatter plots of Manhours and Volume
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
[32]
Scatter plots of Manhours and Volumewith curve representing return to scale
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
[33]
Simple linear regression modelwith Normal model for chance variation
150 160 170 180 190
Volume
550
600
650
700
Manhours
Y = α + βX + ε
[34]
The simple linear regression model
• Y = α + βX + ε
Y is the Response variable
X is the Explanatory variable
• Model parameters:
α and β are the linear parameters
hidden parameter, standard deviation σ,
measures spread of Normal curve
[35]
The simple linear regression model
• Choosing values for the regression coefficients
– the method of least squares
• Interpreting the fitted line
• Using the fitted line; prediction
• A model for chance causes of variation
• Estimating
[36]
Case study:Mail processing costs in a U.S. Post Office
Fiscal Year 1962 Fiscal Year 1963
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
1 157 572 1 154 569
2 161 570 2 157 564
3 168 645 3 164 573
4 186 645 4 188 667
5 183 645 5 191 700
6 184 671 6 180 765
7 268 1053 7 270 1070
8 180 675 8 180 637
9 175 670 9 172 650
10 193 710 10 184 655
11 184 656 11 179 665
12 179 640 12 169 599
13 164 599 13 160 605
[37]
Scatter plots of Manhours and Volume
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
[38]
Scatter plot with grid(to assist in reading x- and y-values)
[39]
Simple linear regression modelwith Normal model for chance variation
150 160 170 180 190
Volume
550
600
650
700
Manhours
Y = α + βX + ε
[40]
The simple linear regression model
• Y = α + βX + ε
Y is the Response variable
X is the Explanatory variable
• Model parameters:
α and β are the linear parameters
hidden parameter, standard deviation σ,
measures spread of Normal curve
[41]
Choosing values for the regression coefficients
Given values for and , the fitted values of Y are
+ X1,
+ X2,
+ X3,
+ Xn
[42]
Find values for and that minimise the deviations
Y1 − − X1,
Y2 − − X2,
Y3 − − X3,
Yn − − Xn
Choosing values for the regression coefficients
[43]
Trial regression lines, with "residuals"
[44]
The method of least squares
Find values for and that minimise the sum of the squared deviations:
(Y1 − − X1)2
+ (Y2 − − X2)2
+ (Y3 − − X3)2
+ (Yn − − X)2
[45]
"Least squares" regression line, with "residuals"
[46]
The method of least squares
Solution:
For these data,
2in
1
2iin
1
)XX(
)YY)(XX(ˆ
XˆYˆ
3.3ˆ
50ˆ
[47]
Interpretation
is the marginal change in Y for a unit change in X.
Check the measurement units!
is overheads.
WARNING
ˆ
ˆ
[48]
"Least squares" regression line,with non-linear extensions
[49]
Using the fitted line; prediction
Prediction equation:
Prediction equation allowing for chance variation:
Original model:
SD =
XˆˆY
ˆ2XˆˆY
XY
[50]
Simple linear regression modelwith Normal model for chance variation
150 160 170 180 190
Volume
550
600
650
700
Manhours
Y = α + βX + ε
[51]
Estimating
measures spread of deviations from the true line.
Estimate by s, the standard deviation of deviations from the fitted line, via
fitted values:
and
residuals:
ii XˆˆY
iii YYe
2ne
sˆ2i
= 20
for our example
[52]
The estimated model:
Exercise
Use the prediction formula to estimate the loss incurred through equipment breakdown in Period 6, Fiscal 1962, when Y was 765 and X was 180.
40X3.350Y
[53]
Homework
Given the Volume figures for periods 1, 6 and 7 of Fiscal Year 1963, what predictions, including prediction errors, would you make for the Manhours requirement?
Recall:
How do these predictions relate to the actual manhours used?
Comment.
20sˆ,3.3ˆ,50ˆ
[54]
Case study:Mail processing costs in a U.S. Post Office
Fiscal Year 1962 Fiscal Year 1963
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
Four-week accounting
period
Pieces of mail handled
(in millions)
Manhours used
(in thousands)
1 157 572 1 154 569
2 161 570 2 157 564
3 168 645 3 164 573
4 186 645 4 188 667
5 183 645 5 191 700
6 184 671 6 180 765
7 268 1053 7 270 1070
8 180 675 8 180 637
9 175 670 9 172 650
10 193 710 10 184 655
11 184 656 11 179 665
12 179 640 12 169 599
13 164 599 13 160 605
[55]
Scatter plots of Manhours and Volume
150 175 200 225 250 275
Volume
600
700
800
900
1000
1100
Manhours
[56]
Simple linear regression modelwith Normal model for chance variation
150 160 170 180 190
Volume
550
600
650
700
Manhours
Y = α + βX + ε
[57]
Calculating the regression
by formula:
For these data,
2in
1
2iin
1
)XX(
)YY)(XX(ˆ
XˆYˆ
3.3ˆ
50ˆ
[58]
Calculating the regression
by computer
[59]
The "constant" variable?
Y = α + βX + ε
Y = α × 1 + β × X + ε
[60]
Calculating the prediction formula
Manhours = 50.4394 + 3.34544 × Volume
2 × 18.93
[61]
Standard errors ofestimated regression coefficients
• Regression coefficient estimate subject to chance variation
• Normal model applies
• Standard deviation of the Normal model is the standard error of the coefficient estimate
[62]
Application 1Confidence interval for marginal change
Recall confidence interval for
or
Confidence interval for :
)ˆ(SE2ˆ
[63]
More results
Exercise: Calculate a 95% confidence interval for .
Calculate a 95% CI for change in manhours corresponding to a 10m increase in pieces of mail handled.
[64]
Point Estimate
Standard Error
95% CI
)ˆ(SE2ˆ
)ˆ(SE
3.34544
0.3401
3.34544 ± 2× 0.3401
3.34544 ± 0.6802
2.665 to 4.026
[65]
Point Estimate
Standard Error
95% CI
)ˆ(ˆ)2( SEt dfn
)ˆ(SE
3.34544
0.3401
3.34544 ± 2.0796× 0.3401
3.34544 ± 0.7073
2.638 to 4.053
2.665 to 4.026 using betahat + 2 SE(betahat)
21 df
[66]
Point Estimate
Standard Error
95% CI
10
10 )ˆ(SE
33.4544 ± 2× 3.401
33.4544 ± 6.802
26.65 to 40.26
[67]
Application 2Testing the statistical significance of the
slope
Formal test:
H0: = 0
Test statistic:
Calculated value: 9.84
Critical value: 2.0796 (t-dist, 21df) or 2 (approx)
Comparison: | 9.84 | > 2.0796 cutoff
Conclusion: REJECT H0
)ˆ(
ˆ
)ˆ(
0ˆ*
SESEt