[1] simple linear regression. the general equation of a line is y = c + mx or y = + x. > 0 ...

[1]

Simple Linear Regression

The general equation of a line is Y = c + mX or Y = + X.

> 0 > 0

> 0 = 0

= 0 < 0

> 0 < 0

[3]

Regression analysis is a technique for quantifying the relationship between a response variable (or dependent variable) and one or more predictor (independent or explanatory) variables.

Two Main Purposes: To predict the dependent variable based on specified values for the predictor variable(s).

To understand how the predictor variable(s) influence or relate to the dependent variable.

Example - Humidity Data

The raw material used in the production of a certain synthetic fiber is stored in a location without humidity control. Measurements of the relative humidity in the storage location and the moisture content (in %) of a sample of the raw material were taken over 15 days.

Rel. Humidity: 46, 53, 29, 61, 36, 39, 47, 49,Mois. Content: 12, 15, 7, 17, 10, 11, 11, 12,

Rel. Humidity: 52, 38, 55, 32, 57, 54, 44.Mois. Content: 14, 9, 16, 8, 18, 14, 12.

Relative Humidity takes the role of explanatory variable.Moisture Content takes the role of dependent variable.

[5]

60504030

18

13

8

Relative Humidity

Moi

stur

e C

onte

nt (

%)

[6]

The Regression Model

The Simple Linear Regression Model can be stated as

Yi = + X

i +

i

Yi is the value of the response variable in the ith trial

and are the intercept and slope parameters

Xi is a known constant, namely the value of the

explanatory variable in the ith trial

i is an unobservable random error term such that

i~N(0,2).

i is also referred to as the stochastic

element of the regression model Yi = + X

i +

i.

[7]

60504030

18

13

8

Relative Humidity

Moi

stur

e C

onte

nt (

%)

Minimise Vertical Distances of Data to ‘Best Fit Line’

Y Xi i i

[8]

Formulae For Least Squares Method

The most common method for computing the estimated

intercept , and the estimate slope , is least squares.

The fitted regression line is calculated as Y = + X,

where =

X

X

=

X

X

and = Y - X.

i

i

i

i2

X Y Y

X

Y nXY

nX

ii

n

i

n

ii

n

i

n1

2

1

1

1

2

LEAST SQUARES ESTIMATES

Seek to minimise the sum of the squared residuals

SSE = Y

Take derivatives wrt and and set to zero...

i

Xii

n 2

1

/

Y

Y -

ii

n

ii

nX n

X

1 1

SSE= 2 Y

2 Y 2n 2

at Max or Min

i

i

X

X

ii

n

i

n

ii

n1

1 1

0

X X X X Y X

X X n

X X n

ii

n

ii

n

ii

n

ii

n

ii

n

ii

n

i

n

ii

n

ii

n

2

1 1 1 1

1 1 1

2

1 1

2

/

/

Y

Y Y

i

i i

SSE= 2 Y

2 Y 2 2

at Max or Min

i

i

X X

X X X

i ii

n

ii

n

ii

n

ii

n1

1 1

2

1

0 Y - X

[11]

X i Y i X i2 Y i

2 X Yi i

4 6 1 2 2 1 1 6 1 4 4 5 5 25 3 1 5 2 8 0 9 2 2 5 7 9 52 9 7 8 4 1 4 9 2 0 36 1 1 7 3 7 2 1 2 8 9 1 0 3 73 6 1 0 1 2 9 6 1 0 0 3 6 03 9 1 1 1 5 2 1 1 2 1 4 2 94 7 1 1 2 2 0 9 1 2 1 5 1 74 9 1 2 2 4 0 1 1 4 4 5 8 85 2 1 4 2 7 0 4 1 9 6 7 2 83 8 9 1 4 4 4 8 1 3 4 25 5 1 6 3 0 2 5 2 5 6 8 8 03 2 8 1 0 2 4 6 4 2 5 65 7 1 8 3 2 4 9 3 2 4 1 0 2 65 4 1 4 2 9 1 6 1 9 6 7 5 64 4 1 2 1 9 3 6 1 4 4 5 2 8

6 9 2 1 8 6 3 3 2 1 0 2 4 5 4 8 9 9 7

X X

Y Y

X Y X Y

i

i

i i i i

692 461333

186 12 4

33210 2454 89972 2

.

.

, ,

.

=. .

.

=

8997 15 461333 12 4

33210 15 461333

0 3232

2

.

.

= Y - X

= 12.4 - 0.3232

46133

2 51

[12]

The can be written

, or in the case of the humidity

data .

fitted regression equation

. .

Y X

Y X

2 55 0 3232

When the relative humidity (X) is known to be 45,

the would be

.

average predicted moisture content

. . ( ) .Y 2 55 0 3232 45 1199

[13]

When the relative humidity is known to be 36,

the average predicted moisture content would be

%. . . ( ) .Y 2 55 0 3232 36 9 09

We actually have recorded data for X = 36, and

the moisture content was found to be 10 %.

We form a for the fifth observation as

ei

residual

. . . Y Yi i 10 9 09 0 91

[14]

RESIDUAL = DATA - MODEL

Xi Yi Yiei ei

2

46 12 12.3572 -0.3572 0.1275953 15 14.6196 0.3804 0.1447029 7 6.8628 0.1372 0.0188261 17 17.2052 -0.2052 0.0421136 10 9.1252 0.8748 0.7652839 11 10.0948 0.9052 0.8193947 11 12.6804 -1.6804 2.8237449 12 13.3268 -1.3268 1.7604052 14 14.2964 -0.2964 0.0878538 9 9.7716 -0.7716 0.5953755 16 15.2660 0.7340 0.5387632 8 7.8324 0.1676 0.0280957 18 15.9124 2.0876 4.3580754 14 14.9428 -0.9428 0.8888744 12 11.7108 0.2892 0.08364

0.0 13.083 = SSE

[15]

35 45 55

9

10

11

12

13

14

15

Humidity

Moi

stur

e

554535

15

14

13

12

11

10

9

Humidity

Moi

stur

e

35 45 55

9

10

11

12

13

14

15

Humidity

Moi

stur

e

52.134

60.147

08.13ˆ

2

1

2

1

2

1

2

SSESSTOSSR

YnY

YYSSTO

YYSSE

n

ii

n

ii

n

iii

Y Xi i i

Y Xi i i

iii YYe ˆˆ

YYi

Y Y

YYi ˆ

[16]

Y Y Y Y Y Yii

n

ii

n

i ii

n

2

1

2

1

2

1

SSTO SSR SSE

(n 1) df's (p -1) df's (n - 2) df's

ExplainedVariation

UnexplainedVariation

TotalVariation

= +

where p equals the number of parameters being estimated, in our case p = 2, (. intercept and slope).

[17]

A Measure of the Relative Goodness-Of-Fit

R2

0 911

Explained Variation in YTotal Variation in Y

SSRSSTO

SSTO - SSESSTO

134.523147.6

or 91.1%.

R2 is interpreted as the percentage variation in the response variable Y, explained through the simple linear regression on the explanatory variable X.

[18]

The regression equation is Moisture = - 2.51 + 0.323 Humidity

Predictor Coef StDev T PConstant -2.510 1.315 -1.91 0.079Humidity 0.32320 0.02796 11.56 0.000

S = 1.003 R-Sq = 91.1%

Analysis of VarianceSource DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60

. . = MSE 10064 1003on 13 degrees of freedom

[19]

Estimating A Confidence Interval for

Using statistical theory we can derive a formula for the standard error of

We may use a confidence interval to quantify the uncertainty associated with the slope.

A confidence interval will be calculated as the point estimate + a value from the tables times the standard error of the point estimate…...

SEX Xi

2

, , t SE t SEn df n df2 2

[20]

3836.02628.0

)02796(.16.23232.0)02796(.16.23232.0

ˆˆˆˆ,2,2

SEtSEt dfndfn

02797.078.1285

003.1

ˆˆ2

XXSE

i

78.1285

13333.461533210 2

22

XnXXX ii

Comes from a t-distribution on (n-2) = 13 degrees of freedom

Read from MINITAB output.

[21]

Hypothesis Testing About

Ho: = 0 (% Moist. per Rel. Hum.)Ha: 0 (% Moist. per Rel. Hum.)

With a 0.05 level of significance the decision rule is reject Ho if t* < -2.16 or if t* > +2.16

tSE

t n df*

~

...

0

0 3232 002796

1156

2t-distribution on 13 df

2.5% 95% 2.5%

+2.16-2.16Reject Ho: = 0

[22]

The regression equation isMoisture = - 2.51 + 0.323 Humidity


S = 1.003 R-Sq = 91.1%

Analysis of Variance

Source DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60

=

[23]

Statistical Inference for

ˆˆ

:Interval Confidence

~ˆ0-ˆ

=t*

:StatisticTest

1ˆˆ

2

2

2

2

SEt

tSE

XX

X

nSE

dfn

dfn

i

[24]

The regression equation isMoisture = - 2.51 + 0.323 Humidity


S = 1.003 R-Sq = 91.1% 11.56 11.56 = 133.63

Analysis of Variance

Source DF SS MS F PRegression 1 134.52 134.52 133.67 0.000Error 13 13.08 1.01Total 14 147.60

[25]

F-Test: Ho: = 0 (% Moist. per Rel. Hum.)Ha: 0 (% Moist. per Rel.

Hum.)

52.1341

52.134

DFs Re

where,*

gression

SSRMSR

MSE

MSRF Note: Large values of F*

lead to the rejection of Ho

Critical Value = F.05 = 4.67 1df Numerator, 13df Denominator

[26] 4.67 4.67

Do Not Reject H0Do Not Reject H0 Reject H0Reject H0

Area = 5%

Decision Rule: Fail to accept Ho if F* = MSR/MSE < 4.67Reject Ho if F* = MSR/MSE > 4.67

FMSRMSE

*

..

.

134 52101

133 67

[27]

options(show.signif.stars = FALSE) humidity = c(46, 53, 29, 61, 36, 39, 47, 49, 52, 38, 55, 32, 57, 54, 44) moisture = c(12, 15, 7, 17, 10, 11, 11, 12, 14, 9, 16, 8, 18, 14, 12) slr = lm( moisture ~ humidity ) slr summary(slr) anova(slr) plot(x = humidity, y = moisture) abline(slr, col = "red", lwd = 2) confint(slr) fits = predict(slr, data.frame( humidity = seq(30,60,by=0.1)), se.fit = TRUE) lines(seq(30,60,by=0.1), fits$fit + 2 * fits$se.fit, col = "blue", lty = 2) lines(seq(30,60,by=0.1), fits$fit - 2 * fits$se.fit, col = "blue", lty = 2)

[28]

Mail Processing Hours(Fiscal Years 1962 -63)

Fiscal Year 1962 Fiscal Year 1963

Four-week accounting

period

Pieces of mail handled

(in millions)

Manhours used

(in thousands)


period


(in millions)

Manhours used

(in thousands)

1 157 572 1 154 569

2 161 570 2 157 564

3 168 645 3 164 573

4 186 645 4 188 667

5 183 645 5 191 700

6 184 671 6 180 765

7 268 1053 7 270 1070

8 180 675 8 180 637

9 175 670 9 172 650

10 193 710 10 184 655

11 184 656 11 179 665

12 179 640 12 169 599

13 164 599 13 160 605

[29]

Line plots of Manhours and Volume

600

700

800

900

1000

Manhours

0 13 26

Four-week periods

150

175

200

225

250

275

Volume

[30]

Line plots of Manhours and VolumeChristmas excluded

550

600

650

700

750

800

Manhours

0 13 26

Four-week periods

150

160

170

180

190

200

Volume

[31]

Scatter plots of Manhours and Volume

150 175 200 225 250 275

Volume

600

700

800

900

1000

1100

Manhours

[32]

Scatter plots of Manhours and Volumewith curve representing return to scale

150 175 200 225 250 275

Volume

600

700

800

900

1000

1100

Manhours

[33]

Simple linear regression modelwith Normal model for chance variation

150 160 170 180 190

Volume

550

600

650

700

Manhours

Y = α + βX + ε

[34]

The simple linear regression model

• Y = α + βX + ε

Y is the Response variable

X is the Explanatory variable

• Model parameters:

α and β are the linear parameters

hidden parameter, standard deviation σ,

measures spread of Normal curve

[35]


• Choosing values for the regression coefficients

– the method of least squares

• Interpreting the fitted line

• Using the fitted line; prediction

• A model for chance causes of variation

• Estimating

[36]

Case study:Mail processing costs in a U.S. Post Office



period


(in millions)

Manhours used

(in thousands)


period


(in millions)

Manhours used

(in thousands)

1 157 572 1 154 569

2 161 570 2 157 564

3 168 645 3 164 573

4 186 645 4 188 667

5 183 645 5 191 700

6 184 671 6 180 765

7 268 1053 7 270 1070

8 180 675 8 180 637

9 175 670 9 172 650

10 193 710 10 184 655

11 184 656 11 179 665

12 179 640 12 169 599

13 164 599 13 160 605

[37]


150 175 200 225 250 275

Volume

600

700

800

900

1000

1100

Manhours

[38]

Scatter plot with grid(to assist in reading x- and y-values)

[39]


150 160 170 180 190

Volume

550

600

650

700

Manhours

Y = α + βX + ε

[40]


• Y = α + βX + ε

Y is the Response variable

X is the Explanatory variable

• Model parameters:

α and β are the linear parameters

hidden parameter, standard deviation σ,

measures spread of Normal curve

[41]

Choosing values for the regression coefficients

Given values for and , the fitted values of Y are

+ X1,

+ X2,

+ X3,

+ Xn

[42]

Find values for and that minimise the deviations

Y1 − − X1,

Y2 − − X2,

Y3 − − X3,

Yn − − Xn

Choosing values for the regression coefficients

[43]

Trial regression lines, with "residuals"

[44]

The method of least squares

Find values for and that minimise the sum of the squared deviations:

(Y1 − − X1)2

+ (Y2 − − X2)2

+ (Y3 − − X3)2

+ (Yn − − X)2

[45]

"Least squares" regression line, with "residuals"

[46]

The method of least squares

Solution:

For these data,

2in

1

2iin

1

)XX(

)YY)(XX(ˆ

XˆYˆ

3.3ˆ

50ˆ

[47]

Interpretation

is the marginal change in Y for a unit change in X.

Check the measurement units!

is overheads.

WARNING

ˆ

ˆ

[48]

"Least squares" regression line,with non-linear extensions

[49]

Using the fitted line; prediction

Prediction equation:

Prediction equation allowing for chance variation:

Original model:

SD =

XˆˆY

ˆ2XˆˆY

XY

[50]


150 160 170 180 190

Volume

550

600

650

700

Manhours

Y = α + βX + ε

[51]

Estimating

measures spread of deviations from the true line.

Estimate by s, the standard deviation of deviations from the fitted line, via

fitted values:

and

residuals:

ii XˆˆY

iii YYe

2ne

sˆ2i

= 20

for our example

[52]

The estimated model:

Exercise

Use the prediction formula to estimate the loss incurred through equipment breakdown in Period 6, Fiscal 1962, when Y was 765 and X was 180.

40X3.350Y

[53]

Homework

Given the Volume figures for periods 1, 6 and 7 of Fiscal Year 1963, what predictions, including prediction errors, would you make for the Manhours requirement?

Recall:

How do these predictions relate to the actual manhours used?

Comment.

20sˆ,3.3ˆ,50ˆ

[54]

Case study:Mail processing costs in a U.S. Post Office



period


(in millions)

Manhours used

(in thousands)


period


(in millions)

Manhours used

(in thousands)

1 157 572 1 154 569

2 161 570 2 157 564

3 168 645 3 164 573

4 186 645 4 188 667

5 183 645 5 191 700

6 184 671 6 180 765

7 268 1053 7 270 1070

8 180 675 8 180 637

9 175 670 9 172 650

10 193 710 10 184 655

11 184 656 11 179 665

12 179 640 12 169 599

13 164 599 13 160 605

[55]


150 175 200 225 250 275

Volume

600

700

800

900

1000

1100

Manhours

[56]


150 160 170 180 190

Volume

550

600

650

700

Manhours

Y = α + βX + ε

[57]

Calculating the regression

by formula:

For these data,

2in

1

2iin

1

)XX(

)YY)(XX(ˆ

XˆYˆ

3.3ˆ

50ˆ

[58]

Calculating the regression

by computer

[59]

The "constant" variable?

Y = α + βX + ε

Y = α × 1 + β × X + ε

[60]

Calculating the prediction formula

Manhours = 50.4394 + 3.34544 × Volume

2 × 18.93

[61]

Standard errors ofestimated regression coefficients

• Regression coefficient estimate subject to chance variation

• Normal model applies

• Standard deviation of the Normal model is the standard error of the coefficient estimate

[62]

Application 1Confidence interval for marginal change

Recall confidence interval for

or

Confidence interval for :

)ˆ(SE2ˆ

[63]

More results

Exercise: Calculate a 95% confidence interval for .

Calculate a 95% CI for change in manhours corresponding to a 10m increase in pieces of mail handled.

[64]

Point Estimate

Standard Error

95% CI

)ˆ(SE2ˆ

)ˆ(SE

3.34544

0.3401

3.34544 ± 2× 0.3401

3.34544 ± 0.6802

2.665 to 4.026

[65]

Point Estimate

Standard Error

95% CI

)ˆ(ˆ)2( SEt dfn

)ˆ(SE

3.34544

0.3401

3.34544 ± 2.0796× 0.3401

3.34544 ± 0.7073

2.638 to 4.053

2.665 to 4.026 using betahat + 2 SE(betahat)

21 df

[66]

Point Estimate

Standard Error

95% CI

10

10 )ˆ(SE

33.4544 ± 2× 3.401

33.4544 ± 6.802

26.65 to 40.26

[67]

Application 2Testing the statistical significance of the

slope

Formal test:

H0: = 0

Test statistic:

Calculated value: 9.84

Critical value: 2.0796 (t-dist, 21df) or 2 (approx)

Comparison: | 9.84 | > 2.0796 cutoff

Conclusion: REJECT H0

)ˆ(

ˆ

)ˆ(

0ˆ*

SESEt

[1] simple linear regression. the general equation of a line is y = c + mx or y = + x. > 0 ...

Documents