chapter 8 linear regression how can a model be created which represents the linear relationship...

33
Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Upload: jayson-wilkins

Post on 18-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Chapter 8 Linear RegressionHOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Page 2: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Fat Versus Protein: An Example

The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:

How many grams of fat would an item with 25 grams of protein have?

Slide

8- 2

Page 3: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

What is Linear Regression

Remember that correlation suggests there is a “linear” relationship between two variables.

We can say more about the linear relationship between two quantitative variables with a model.

The linear relationship is modeled by a straight line through the data.

The data points do not all line up on the line, but a straight line summarizes the overall direction of the data.

Page 4: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Regression and Residuals

Some points will be above the line some points will be below the line.

The estimate made from a model is the predicted value (denoted as ŷ ).

The difference between a predicted value and the actual value is known as the residual

ˆresidual observed predicted y y

Page 5: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Residuals (cont.)

A negative residual means the predicted value’s too big (an overestimate).

A positive residual means the predicted value’s too small (an underestimate).

Slide

8- 5

Page 6: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Line of Best Fit

Some residuals are positive (above the predicted line) and some are negative (below the predicted line).

To find how well the line fits we add up the residuals. If we add the negatives and the positives, they cancel each other out. Therefore we add the squared residual values.

The line of best fit is the line where the sum of the squared residuals is the smallest.

The regression line is also know as the Least Squared Regression Line (LSRL)

Page 7: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Line of best fit

It is written as Ŷ = a + bxŷ= b0 +b1x

Page 8: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slope of the regression line

Our slope is always in units of y per unit of x

1y

x

sb r

s

Page 9: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Y intercept

Our intercept is always in units of y

0 1b y b x

Page 10: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Residuals Revisited

The model assumes all points are on the straight line. The points of data that are not on the line are those

that have not been modeled. Data = Model + Residual Residual = Data – Model In symbols

ˆe y y

Page 11: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example

Given the regression line for the previous scatter plot

Ŷ = 6.413 + 0.9769xPredicted Fat = 6.413 + 0.9769proteinWhat does the slope represent?What does the y intercept mean?

Page 12: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example continued

Given the regression line for the previous scatter plot

Ŷ = 6.413 + 0.9769x Predicted Fat = 6.413 + 0.9769protein How much fat would we expect an item with 12

grams of protein to have? How much protein would an item with 15 grams of

fat have?

Page 13: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example continued

Given the regression line for the previous scatter plot

Ŷ = 6.413 + 0.9769xPredicted Fat = 6.413 + 0.9769proteinA Double Whopper sandwich has 48 grams of

Protein and 58 grams of fat. What is the residual in fat for this sandwich?

Page 14: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example Burger King

The following are select items from the Burger King Menu with grams of fat and total calories

Item Calories Grams of fat

Whopper 650 37

Whopper with cheese 730 44

Big King 530 31

Hamburger 230 9

Cheeseburger 270 12

Tendergrill chicken Sandwich

460 21

Original chicken Sandwich 660 40

Big fish Sandwich 520 28

BK Veggie Burger 390 16

Page 15: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example Continued

What is the regression line for the data?

What is the slope in the context of the problem?

What is the y-intercept in the context of the problem?

A sandwich with 15 grams of fat would be expected to have how many calories?

A sandwich with 450 calories would be expected to have how many grams of fat?

A Bacon Cheeseburger has 13 grams of fat and 290 total calories, what is the residual in calories for this sandwich?

Page 16: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Conditions Required

1. Quantitative Variable condition

2. Straight enough condition3. Outlier condition

Page 17: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

R-Squared

R2 – gives the fraction of the data’s variation accounted for by the model and 1 - R2 is the fraction of the original variation left in the residuals.

Example: Burger King sandwich example

r is 0.9881

r2 is 0.9763

97.63% of the calorie content in Burger King Sandwiches is explained by the fat content. 2.37% comes from other factors.

Page 18: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Residual Plot

A diagram of the residuals of the regression line.

A noticeable pattern in the residual plot may indicate that the regression line is not a good model.

The residual plot of a better fit model will have appropriate scatter

Page 19: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

What not to do

Don’t fit a straight line to a non linear relationship

Beware of extraordinary pointsDon’t extrapolate beyond the dataDon’t infer that x causes y just because

there is a good linear model for their relationship

Don’t choose a model based on r2 alone.

Page 20: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Breakfast Cereals, sugar and Calories

The following is data from 77 different breakfast cereals comparing the relationship of sugar in the cereal and the amount of calories with each cereal.

R = 0.564 Calories mean – 107.0 SD – 19.5

Sugar mean – 7.0 grams, SD – 4.4

What is the slope of regression line?

What is the y – intercept?

Write the regression equation?

Interpret

Page 21: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Urban planning

We want to estimate the costs per person associated with traffic delays

2002 Urban mobility report (70 cities in 2000) Annual cost person mean - $298.96 SD - $180.83 Average speed per person mean – 54.34 mph, SD

4.494 mph R = -0.90 Write an equation to model this situation What does the slope mean?

Page 22: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

What to watch out for in Regression

Interpreting beyond the data – extrapolating

Influential pointsLurking variablesLinear regression that is not “linear” –

what to do

Page 23: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Extrapolation

We cannot assume that a linear relationship in the data exists beyond the range of the data.

Once we venture into new x territory, such a prediction is called an extrapolation.

Page 24: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide

9- 24

Extrapolation (cont.)

A regression of mean age at first marriage for men vs. year fit to the first 4 decades of the 20th century does not hold for later years:

Page 25: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Influential Outliers

We say that a point is influential if omitting the point from the scatterplot completely gives a different model.

Page 26: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide

9- 26

Outliers, Leverage, and Influence (cont.)

The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…

Page 27: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Lurking Variable

No matter how straight the line, no matter how strong the association, or how high the R-squared value is, there is no way to conclude from regression alone that one variable causes the other.

There is always the possibility that some third variable is driving both of the variables being observed.

Page 28: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

What to do when the linear regression line is not straight

Re-express the data with logs, square roots, reciprocals

We will look at square roots and logarithms, primarily

Example: taking the square root of the response variable and re-expressing the data in a scatterplot and examining the residual plot.

Example: Re-expressing data using a combination of logarithms, log(x), log (y) Fit a line to the curved graph – more difficult

Page 29: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide 10- 29

The Ladder of Powers

Ratios of two quantities (e.g., mph) often benefit from a reciprocal.

The reciprocal of the data-1

An uncommon re-expression, but sometimes useful.

Reciprocal square root-1/2

Measurements that cannot be negative often benefit from a log re-expression.

We’ll use logarithms here“0”

Counts often benefit from a square root re-expression.

Square root of data values½

Data with positive and negative values and no bounds are less likely to benefit from re-expression.

Raw data1

Try with unimodal distributions that are skewed to the left.

Square of data values2

CommentNamePower

Page 30: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide

10- 30

Plan B: Attack of the Logarithms (cont.)

Page 31: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide

10- 31

Why Not Just a Curve?

If there’s a curve in the scatterplot, why not just fit a curve to the data?

Page 32: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Slide

10- 32

Why Not Just a Curve? (cont.)

The mathematics and calculations for “curves of best fit” are considerably more difficult than “lines of best fit.”

Besides, straight lines are easy to understand.We know how to think about the slope and

the y-intercept.

Page 33: Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Example: Data collected in the study of water pollution from commercial and domestic waste

Day Oxygen Demand

1 109

2 149

3 149

5 191

7 213

10 224