civl 7012/8012 - memphis linear...𝑖= observed value of dependent variable (tip amount)....

CIVL 7012/8012

Simple Linear Regression

Lecture 2

Correlation

• Correlation is the degree to which two continuous variables are

linearly associated.

• This is most often represented by a scatterplot and the Pearson

correlation coefficient, denote by (𝑟).

• The scatterplot provides a visual as to how the two continuous

variable are correlated.

• The coefficient is a measure of the linear association between the

two variables.

Correlation

• If there is no correlation between the two variables, the points will

form a horizontal or vertical line or complete randomness (no obvious

patterns).

• Note that it does not matter which variable is on x-axis and which is

on the y-axis.

• The pattern the two variables form determines the strength and

direction of their correlation.

Correlation

• The stronger the correlation, the more

linearly distinct the pattern will be.

• The coefficient is between -1 and 1.

+1 indicates a perfect positive correlation

-1 indicates a perfect negative correlation

0 indicates no correlation

• No strict rules for interpretation, however,

as a guideline, it is suggested:

0 < |𝑟| < 0.3: weak correlation

0.3 < |𝑟| < 0.7: moderate correlation

|𝑟| > 0.7: strong correlation

Correlation

Snapshot from Multivariate Lecture 6

𝜌𝑋𝑌 is the correlation notation for the entire population.

Pearson correlation coefficient (𝑟) is for our sample representing

the population.

𝑟 = 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦

𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2

Correlation calculation

Meal

Bill ($)

Tip ($)

Bill deviations

Tip deviations

Deviations products

Bill deviations squared

Tip deviations squared

𝑥 𝑦 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 ) 𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2

1 35 6 -37.5 -4 150 1406.25 16

2 110 18 37.5 8 300 1406.25 64

3 66 11 -6.5 1 -6.5 42.25 1

4 75 7 2.5 -3 -7.5 6.25 9

5 100 14 27.5 4 110 756.25 16

6 49 4 -23.5 -6 141 552.25 36

687 4169.5 142

𝑟 = 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦

𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2=

687

(4169.5)(142) = 0.892

Correlation significance test (t-test)

• Is it statistically significant?

• Conduct a t-test

• 𝐻0: 𝜌 = 0 𝑣𝑠. 𝐻1: 𝜌 ≠ 0 𝑎𝑡 𝛼 = 0.05

• 𝑡 = 𝑟𝑛−2

1−𝑟2, df=n-2

• 𝑡 = 0.8926−2

1−0.8922= 3.947

𝑟 = 0.892

Correlation significance test (t-test)

• 𝐻0: 𝜌 = 0 𝑣𝑠. 𝐻1: 𝜌 ≠ 0 𝑎𝑡 𝛼 = 0.05

• 𝑡 = 𝑟𝑛−2

1−𝑟2, df=n-2

• 𝑡 = 0.8926−2

1−0.8922= 3.947

• 𝑡𝑐𝑎𝑙𝑐 > 𝑡𝑐𝑟𝑖𝑡. −−→ 𝑟𝑒𝑗𝑒𝑐𝑡 𝑛𝑢𝑙𝑙

SLR Lecture 1 Recap

Recap - Quick Review

• SLR is a comparison of 2 models:

• One is where the independent variable does not exist

• And the other uses the best-fit regression line

• If there is only one variable, the best prediction for other

values is the mean of the dependent variable.

• The distance between the best-fit line and the observed

value is called residual (or error).

• The residuals are squared and added together to

generate sum of squares residuals/error (SSE).

• SLR is designed to find the best fitting line through the

data that minimizes the SSE.

Recap - Example

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7

Tip

($

)

Meal #

Tips for service ($)

𝑦 =10

Best-fit line

Meal # Tip ($)

1 6

2 18

3 11

4 7

5 14

6 4

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7

Tips for service ($)

16 1

16

64

9 36

Recap - Residuals (Errors)

+8

+1

−3

+4

−6 Squared Residuals (Errors)

# Residual Residual2

1 −4 16

2 +8 64

3 +1 1

4 −3 9

5 +4 16

6 −6 36

Sum of squared errors (SSE)

= 142

𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔𝟐 = 𝟏𝟒𝟐

−4

Recap – Population vs. Sample Eq.

• If we knew our “population” parameters, 𝛽0, 𝛽1, then we could use the SLR eq. as is.

• In reality, we almost never have the population parameters. Therefore we have to estimate them using sample data. With sample data, SLR eq. changes a bit.

• Where 𝑦 “y-hat” is the point estimator of 𝐸 𝑦 .

• Or, 𝑦 is the mean value of 𝑦 for a given 𝑥.

𝐸 𝑦 = 𝛽0 + 𝛽1𝑥

𝑦 = 𝑏0 + 𝑏1𝑥

Recap – OLS criterion

𝑦𝑖 = observed value of dependent variable (tip amount).

𝑦 𝑖 =estimated (predicted) value of the dependent variable

(predicted tip amount based on regression model).

min 𝑦𝑖 − 𝑦 𝑖2

0

5

10

15

20

0 50 100 150

observed

predicted

Recap - SLR parameter equations

𝑦 𝑖 = 𝑏0 + 𝑏1𝑥

𝑏1 = 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦

𝑥𝑖 − 𝑥 2

slope

𝑥 = mean of the independent variable ($

bill)

𝑦 = mean of the dependent variable ($ tip)

𝑥𝑖 = value of the independent variable

𝑦𝑖 = value of the dependent variable

𝑏0 = 𝑦 − 𝑏1𝑥

intercept

Recap - OLS Calculations

Meal Bill ($) Tip ($) Bill deviations

(𝑆𝑥) Tip deviations Deviations products

Bill deviations squared 𝑆𝑥

2

𝑥 𝑦 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 ) 𝑥𝑖 − 𝑥 2

1 35 6 -37.5 -4 150 1406.25

2 110 18 37.5 8 300 1406.25

3 66 11 -6.5 1 -6.5 42.25

4 75 7 2.5 -3 -7.5 6.25

5 100 14 27.5 4 110 756.25

6 49 4 -23.5 -6 141 552.25

𝑥 = 72.5 𝑦 = 10 687 4169.5


Deviations products Bill deviations squared

(𝒙𝒊 − 𝒙 )(𝒚𝒊 − 𝒚 ) 𝒙𝒊 − 𝒙 𝟐

150 1406.25

300 1406.25

-6.5 42.25

-7.5 6.25

110 756.25

141 552.25

𝟔𝟖𝟕 𝟒𝟏𝟔𝟗. 𝟓

𝒃𝟏 = 𝒙𝒊 − 𝒙 𝒚𝒊 − 𝒚

𝒙𝒊 − 𝒙 𝟐

𝒃𝟏 =𝟔𝟖𝟕

𝟒𝟏𝟔𝟗. 𝟓

𝒃𝟏 = 𝟎. 𝟏𝟔𝟒𝟖


𝒃𝟎 = 𝟏𝟎 − 𝟎. 𝟏𝟔𝟒𝟖(𝟕𝟐. 𝟓)

𝒃𝟏 = 𝟎. 𝟏𝟔𝟒𝟖

𝒃𝟎 = 𝒚 + 𝒃𝟏𝒙

Bill ($) Tip ($)

𝒙 𝒚

35 6

110 18

66 11

75 7

100 14

49 4

𝑥 = 72.5 𝑦 = 10

𝒃𝟎 = 𝟏𝟎 − 𝟏𝟏. 𝟗𝟒𝟓𝟕

𝒃𝟎 = −𝟏. 𝟗𝟒𝟓𝟕

Recap – New Best-Fit Line & Parameters

𝑦 𝑖 = 𝑏0 + 𝑏1𝑥

𝑦 𝑖 = −1.9457 +0.1648𝑥

𝑏0 = −1.9457

intercept

𝑏1 = 0.1648

slope

𝑦 𝑖 = 0.1648𝑥 − 1.9457

OR

Recap - Final SLR line

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120

Tip

($

)

Bill ($)

Bill vs. Tip Amount ($)

𝒚 ̂_𝒊 =−𝟏.𝟗𝟒𝟓𝟕 +𝟎.𝟏𝟔𝟒𝟖𝒙

𝒃𝟎=−𝟏.𝟗𝟒𝟓𝟕

𝒔𝒍𝒐𝒑𝒆 𝒃𝟏 = 𝟎. 𝟏𝟔𝟒𝟖

Recap - SLR Model Interpretation

𝑦 𝑖 = −1.9457 +0.1648𝑥

For every $1 the bill amount (𝑥) increases, we would expect the tip

amount to also increase by $0.1648 or

about 16 cents (positive coefficient).

If the bill amount (𝑥) is zero, then the

expected/predicted tip amount is $-

1.9457 or negative $1.95!

Does this make any sense? NO In real

world problems, the intercept may or

may not make sense.

SLR – Lecture 2

0

2

4

6

8

10

12

14

16

18

20

0 50 100 150

Bills vs Tips ($)

0

5

10

15

20

0 1 2 3 4 5 6 7

Tips ($)

Model fit and Coefficient of Determination

𝑺𝑺𝑬 = 𝟏𝟒𝟐

𝑺𝑺𝑬 = 𝑺𝑺𝑻

With only the DV, the only sum

of squares is due to error.

Therefore, it is also the total,

and MAX sum of squares for

this data sample. 𝑺𝑺𝑻 = 𝟏𝟒𝟐

With both the IV and DV, SST

remains the same. But the SSE

is reduced significantly. The

difference between the SSE

and SST is due to regression

(SSR).

𝑺𝑺𝑻 = 𝟏𝟒𝟐

𝑺𝑺𝑬 = ?

𝑺𝑺𝑻 − 𝑺𝑺𝑬 = 𝑺𝑺𝑹

Estimate regression values

Meal Bill ($) Tip ($) 𝒚 𝒊 = −𝟏. 𝟗𝟒𝟓𝟕 +𝟎. 𝟏𝟔𝟒𝟖𝒙 𝒚 𝒊 (predicted tip $)

𝑥𝑖 𝑦𝑖

1 35 6 𝑦 𝑖 = −1.9457 +0.1648(35) 3.8212

2 110 18 𝑦 𝑖 = −1.9457 +0.1648(110) 16.1788

3 66 11 𝑦 𝑖 = −1.9457 +0.1648(66) 8.9290

4 75 7 𝑦 𝑖 = −1.9457 +0.1648(75) 10.4119

5 100 14 𝑦 𝑖 = −1.9457 +0.1648(100) 14.5311

6 49 4 𝑦 𝑖 = −1.9457 +0.1648(49) 6.1280

𝑥 = 72.5 𝑦 = 10

min 𝑦𝑖 − 𝑦 𝑖2

Regression errors (residuals)

Meal Bill ($) Tip ($) 𝒚 𝒊 (predicted tip $) Error (𝒚 − 𝒚 𝒊)

𝑥 𝑦 (observed-predicted)

1 35 6 3.8212 6 − 3.8212 = 2.1788

2 110 18 16.1788 18 − 16.1788 = 1.8212

3 66 11 8.9290 11 − 8.9290 = 2.0710

4 75 7 10.4119 7 − 10.4119 = -3.4119

5 100 14 14.5311 14 − 14.5311 = -0.5311

6 49 4 6.1280 4 − 6.1280 = -2.1280

𝑥 = 72.5 𝑦 = 10

Meal Bill ($) Tip ($) 𝒚 𝒊 (predicted tip $) Error (𝒚 − 𝒚 𝒊) (𝒚 − 𝒚 𝒊)𝟐

𝑥 𝑦

1 35 6 3.8212 2.1788 4.7472

2 110 18 16.1788 1.8212 3.3168

3 66 11 8.9290 2.0710 4.2890

4 75 7 10.4119 -3.4119 11.6412

5 100 14 14.5311 -0.5311 0.2821

6 49 4 6.1280 -2.1280 4.5282

Regression errors (residuals) - SSE

𝑥 = 72.5 𝑦 = 10 𝑆𝑆𝐸 = 28.8044

SSE comparison

Sum of squared error (SSE) Comparison

D.V. (tip $) ONLY

+ + + + + = SSE = 28.8044

16 1 16 64 9 36 + + + + + = SSE = 142

D.V. & I.V (tip $ as a function of bill $)

Comparison of two lines

• When we conducted the regression, the SSE decreased

from 142 to 28.8044.

• 28.8044 was explained by (allocated to) ERROR.

• What happen to the difference (113.1956)?

• 113.1956 is the sum of squares due to REGRESSION

(SSR).

• 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

• In this case:

142 = 113.1956 + 28.8044

0

2

4

6

8

10

12

14

16

18

20

0 50 100 150

Bills vs Tips ($)

0

5

10

15

20

0 1 2 3 4 5 6 7

Tips ($)

Comparison of two lines

𝑺𝑺𝑬 = 𝟏𝟒𝟐

𝑺𝑺𝑬 = 𝑺𝑺𝑻

𝑺𝑺𝑻 = 𝟏𝟒𝟐

𝑺𝑺𝑻 = 𝟏𝟒𝟐

𝑺𝑺𝑬 = 𝟐𝟖. 𝟖𝟎𝟒𝟒

𝑺𝑺𝑻 − 𝑺𝑺𝑬 = 𝑺𝑺𝑹 = 𝟏𝟏𝟑. 𝟏𝟗𝟓𝟔

Coefficient of Determination (𝑟2)

• How well does the estimated regression equation fit our

data?

• This is where regression starts to look a lot like ANOVA,

where the SST is partitioned into SSE & SSR.

• The larger the SSR the smaller the SSE.

• The Coefficient of Determination quantifies this ratio as a

percentage (%).

SSE

SST

SSR

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑟2 =𝑆𝑆𝑅

𝑆𝑆𝑇

Coefficient of Determination (𝑟2)

• How well does the estimated regression equation fit our

data?

• This is where regression starts to look a lot like ANOVA,

where the SST is partitioned into SSE & SSR.

• The larger the SSR the smaller the SSE.

• The Coefficient of Determination quantifies this ratio as a

percentage (%).

SSE

SST

SSR

ANOVA

df SS MS F Significance F

Regression 1 113.1956 113.1956 15.7192 0.016611541

Residual 4 28.80441 7.201103

Total 5 142

𝑟2 Interpretation

• 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑟2 =𝑆𝑆𝑅

𝑆𝑆𝑇

• 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑟2 =113.1956

142

• 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑟2 = 0.7972 𝑜𝑟 79.72%

• We can conclude that 79.72% of the total sum of squares

can be explained using the estimates from the regression

equation to predict the tip amount. And that the remainder

(20.28%) is error.

• This is a “Good fit”!

0

2

4

6

8

10

12

14

16

18

20

30 40 50 60 70 80 90 100 110

Tip

($

)

Bill ($)

3 squared differences

𝒚 𝒊 = −𝟏. 𝟗𝟒𝟓𝟕 +𝟎. 𝟏𝟔𝟒𝟖𝒙

Bills vs. Tips ($)

𝒚 = 𝟏𝟎

SSE= (𝑦𝑖 − 𝑦 𝑖)2

SST= (𝑦𝑖 − 𝑦 )2

SSR= (𝑦 𝑖 − 𝑦 )2

Model fit

𝑦 𝑖 = −1.9457 +0.1648𝑥

Questions:

• Once a regression line is calculated, how much better is it than only

using the mean of the dependent variable line alone? (coefficient of

determination (𝑟2)

• How confident are we in the significance of the relationship between x

and y? (t-test of slope)

Regression with Excel

• Produce SLR model in Excel.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.892834

R Square 0.797152

Adjusted R Square 0.74644

Standard Error 2.683487

Observations 6

ANOVA

df SS MS F Significance F

Regression 1 113.1956 113.1956 15.7192 0.016611541

Residual 4 28.80441 7.201103

Total 5 142

Coefficien

ts Standard

Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept -1.94568 3.205964 -0.60689 0.576683 -10.84685887 6.955504991 -10.84685887 6.955504991

X Variable 1 0.164768 0.041558 3.964745 0.016612 0.049383684 0.280152232 0.049383684 0.280152232

Testing slope -1

• Is the relationship between 𝑦 and 𝑥 significant?

• Test the slope 𝛽1. (two-tailed t-test)

• Remember 𝑏1is for our sample and 𝛽1 is for the population

• We will use our sample slope 𝑏1 to test if the true slope of

the population 𝛽1 is significantly different than 0.

𝑦 𝑖 = −1.9457 +0.1648𝑥

Testing slope -2

Steps to conduct a t-test on slope 𝛽1:

• Step 1: Specify hypothesis:

• 𝐻0: 𝛽1 = 0 𝑣𝑠. 𝐻1: 𝛽1 ≠ 0 𝑎𝑡 𝛼 = 0.05

• Step 2: Determine the test statistic:

𝑡 =𝑏1−𝛽1

𝑆𝐸𝑏1

• where 𝛽1 is true coefficient for all population

• where 𝑆𝐸𝑏1 =𝑆𝑆𝐸𝑛−2

(𝑥−𝑥 )2

= standard error of the slope 𝑏1

Testing slope -3

• Step 2 calculation:

• 𝑆𝐸𝑏1 =𝑆𝑆𝐸𝑛−2

(𝑥−𝑥 )2

=28.8044(6−2)

4169.5

= 0.0416

• 𝑡 =𝑏1−𝛽1

𝑆𝐸𝑏1=

0.1648−0

0.0416= 3.9615

• Step 3: Quantify the evidence of the test

• Method 1: Critical value method

• Compare calculated t to critical t

• ±𝑡1−𝛼

2,𝑛−2 = ±𝑡0.975,4

𝑦 𝑖 = −1.9457 +0.1648𝑥

Testing slope -4

• Step 3: Quantify the evidence of the test

• Method 1: Critical value method

• Compare calculated 𝑡 to critical 𝑡 (remember 𝛼 = 0.05)

• ±𝑡1−𝛼

2,𝑛−2 = ±𝑡0.975,4 = 2.776

Testing slope -5

• Step 3: Method 1: Critical value method

• Compare calculated 𝑡 to critical 𝑡 (remember 𝛼 = 0.05)

• 𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 = 3.9615 > 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 2.776

• T calc is in the critical region so Reject null hypothesis 𝐻0: 𝛽1 = 0

meaning that our 𝛽1 ≠ 0 and we do have a statistically significant

relationship between 𝑥 and 𝑦. .

0.95

0.025 0.025

Testing slope -6

• Step 3: Method 2: p-value method

• Compare calculated/estimated 𝑝 value to desired significance

level. (remember 𝛼 = 0.05)

• 𝑝𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑/𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 = 2𝑝 𝑡 > 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑 𝑡 = 2𝑝(𝑡 > 3.9615) ≈

0.03

• 𝑝 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 0.03 < 𝛼 = 0.05, therefore reject null hypothesis

𝐻0: 𝛽1 = 0 meaning that our 𝛽1 ≠ 0 and we do have a statistically

significant relationship between 𝑥 and 𝑦. .

SLR Example with R

• Start R session

• Import dataset “airquality” included in R base

• Explore and plot data

• Run a simple linear regression model with

“Ozone” as a DV (𝑦)

“Temp” as an IV (𝑥)

• Follow in R session and model results are as follows:

SLR Example with R

• Dataset = airquality ----> 153 obs. of 6 variables

• Start R session and follow instructions in code

• Use simple linear regression to predict ozone levels “Ozone” based on the

temperature “Temp”.

ID Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

5 NA NA 14.3 56 5 5

6 28 NA 14.9 66 5 6

7 23 299 8.6 65 5 7

8 19 99 13.8 59 5 8

9 8 19 20.1 61 5 9

10 NA 194 8.6 69 5 10

Step 1: scatter plot

Ozone Temp

41 67

36 72

12 74

18 62

NA 56

28 66

23 65

19 59

8 61

NA 69

STEP 3: CORRELATION (Ozone vs Temp)

• What is the correlation coefficient (r) for Ozone vs. Temp? (see R session)

In this case, 𝑟 = .698

• Is the relationship strong?

MODERATE! --------> RUN MODEL see R session

Model results (model m1)

• 𝑦 = 𝛽0 + 𝛽1𝑥

• 𝛽0 = −146.996 (Intercept) 𝛽1 = +2.429 (Slope)

• Regression line for this model ---> 𝑦 = −146.996 +2.429(𝑥)

Results interpretation (model m1) -1

Residuals:

• Residuals are the differences between the actual observed response values

(distance to Ozone levels in our case) and the response values that the

model predicted.

• The “Residuals” section of the model output breaks it down into 5 summary

points to assess how well the model fit the data.

• A good fit model will show symmetry from the min to max around the mean

value (0).

• We do not have a very good symmetry here.

• So, the model is predicting certain points that fall far away from the actual

observed points.


Model Coefficients:

• 𝛽0 = −146.996 (𝑦 − 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

No interpretational meaning; but it is the Ozone level value when Temp = 0

• 𝛽1 = +2.429 (𝑆𝑙𝑜𝑝𝑒)

For every 1 degree ℉ the temperature increases (𝑥), it is expected that the

Ozone level to also increase by 2.429 units.

• 𝑠𝑡𝑑. 𝑒𝑟𝑟𝑜𝑟 = 0.2331

We can say that Ozone level/units can vary by 0.2331.

• t-value for “Temp” = 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡

𝑠𝑡𝑑. 𝑒𝑟𝑟𝑜𝑟 =

2.429

0.233 = 10.418

t-value is significant Pr (> |𝑡|) = 2𝑒−16 ; which is significant at any level of

significance (you could say at 99.99% level of confidence or 0.001).


• Residual Standard Error = 23.71 on 114 degrees of freedom

• The Residual Standard Error is the average amount that the response

“Ozone” will deviate from the true regression line.

• In our example, the actual Ozone level can deviate from the true regression

line by approximately 23.71 units, on average.

• Degrees of freedom are the actual number of data points (observations)

minus 2 (taking into account the parameters for the “intercept” and the

“Ozone” variables).

So, we started the model with 153 data point in the “airquality” dataset

We removed 37 data points that were N/A’s

We are left with 116 data points

116 data points will lead to (116-2 parameters) = 114 DF


• 𝑅-squared = 0.4877 (𝑅2 = coefficient of determination)

𝑅2 varies from 0 𝑡𝑜 1; in this case, 48.77% of (𝑦) is explained by (𝑥)

• Adjusted 𝑅2 = 0.4832

Adjusted 𝑅2 accounts for how many independent variables entered the

model. Typically lower than 𝑅2 based on how much contribution

additional independent variables (𝑥’𝑠)added to explaining (𝑦)

A sharp drop in the adjusted 𝑅2 versus 𝑅2 indicates a bad model.

𝑭-Test (F-value is used for measuring the overall model significance).

• At the desired level of significance (say 95%), the statistical significance of

the 𝐹-test will show how good of a model this is.

• In this model, the 𝐹-statistic = 108.5 on 1 variable with 114

• The 𝐹-statistic level of significance is Pr (> 𝐹) = 2.2𝑒−16; that is the 𝐹-statistic

is significant at any reasonable level of significance (or you could say @

99.99%).

SLR – R code

civl 7012/8012 - memphis linear...𝑖= observed value of dependent variable (tip amount)....

Documents