chapter 10 regression and correlation -...

13
CHAPTER 10 REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION: TWO VARIABLES (SECTIONS 10.1–10.3 OF UNDERSTANDABLE STATISTICS) Chapter 10 of Understandable Statistics introduces linear regression. The formula for the correlation coefficient r is given in Section 10.1. Formulas to find the equation of the least squares line, y = a + bx are given in Section 10.2. This section also contains the formula for the coefficient of determination . The equation for the standard error of estimate as well as the procedure to find a confidence interval for the predicted value of y are given in Section 10.3. 2 r The menu selection Stat Regression Regression gives the equation of the least-squares line, the value of the standard error of estimate (s = standard error of estimate), the value of the coefficient of determination (R – sq), as well as several other values such as R – sq adjusted (an unbiased estimate of the population For simple regression with a response variable and one explanatory variable, we can get the value of the Pearson product moment correlation coefficient r by simply taking the square root of R – sq. 2 r 2 ). r The standard deviation, t-ratio and P values of the coefficients are also given. The P value is useful for testing the coefficients to see that the population coefficient is not zero (see Section 10.4 of Understandable Statistics for a discussion about testing the coefficients). For the time being we will not use these values. Depending on the amount of output requested (controlled by the options selected under the [Results] button) you will also see an analysis of variance chart, as well as a table of x and y values with the fitted values p y and residuals ( We will not use the analysis of variance chart in our introduction to regression. However, in more advanced treatments of regression, you will find it useful. ). r y y To find the equation of the least-squares line and the value of the correlation coefficient, use the menu options Stat Regression Regression Dialog Box Responses Response: Enter the column number C# of the column containing the responses (that is Y values). Predictor: Enter the column number C# of the column containing the explanatory variables (that is, X values). [Graphs]: Do not click on at this time. [Results]: Click on and select the second option, Regression equation, etc. [Options]: We will click on this option when we wish to do predictions for new variables. [Storage]: Click on and select the fits and residual options if you wish. To graph the scatter plot and show the least-squares line on the graph, use the menu options Stat Regression Fitted Line Plot Dialog Box Responses Response: List the column number C# of the column containing the Y values. Predictor: List the column number C# of the column containing the X values. Type of Regression model: Select Linear. 254 Copyright © Houghton Mifflin Company. All rights reserved.

Upload: nguyenkiet

Post on 07-Feb-2018

251 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

CHAPTER 10 REGRESSION AND CORRELATION

SIMPLE LINEAR REGRESSION: TWO VARIABLES (SECTIONS 10.1–10.3 OF UNDERSTANDABLE STATISTICS)

Chapter 10 of Understandable Statistics introduces linear regression. The formula for the correlation coefficient r is given in Section 10.1. Formulas to find the equation of the least squares line,

y = a + bx

are given in Section 10.2. This section also contains the formula for the coefficient of determination . The equation for the standard error of estimate as well as the procedure to find a confidence interval for the predicted value of y are given in Section 10.3.

2r

The menu selection Stat Regression Regression gives the equation of the least-squares line, the value of the standard error of estimate (s = standard error of estimate), the value of the coefficient of determination

(R – sq), as well as several other values such as R – sq adjusted (an unbiased estimate of the population For simple regression with a response variable and one explanatory variable, we can get the value of the Pearson product moment correlation coefficient r by simply taking the square root of R – sq.

2r 2 ).r

The standard deviation, t-ratio and P values of the coefficients are also given. The P value is useful for testing the coefficients to see that the population coefficient is not zero (see Section 10.4 of Understandable Statistics for a discussion about testing the coefficients). For the time being we will not use these values.

Depending on the amount of output requested (controlled by the options selected under the [Results] button) you will also see an analysis of variance chart, as well as a table of x and y values with the fitted values

py and residuals ( We will not use the analysis of variance chart in our introduction to regression. However, in more advanced treatments of regression, you will find it useful.

).ry y−

To find the equation of the least-squares line and the value of the correlation coefficient, use the menu options

Stat Regression Regression

Dialog Box Responses

Response: Enter the column number C# of the column containing the responses (that is Y values).

Predictor: Enter the column number C# of the column containing the explanatory variables (that

is, X values).

[Graphs]: Do not click on at this time.

[Results]: Click on and select the second option, Regression equation, etc.

[Options]: We will click on this option when we wish to do predictions for new variables.

[Storage]: Click on and select the fits and residual options if you wish.

To graph the scatter plot and show the least-squares line on the graph, use the menu options

Stat Regression Fitted Line Plot

Dialog Box Responses

Response: List the column number C# of the column containing the Y values.

Predictor: List the column number C# of the column containing the X values.

Type of Regression model: Select Linear.

254 Copyright © Houghton Mifflin Company. All rights reserved.

Page 2: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 255

[Options]: Click on and select Display Prediction Interval for a specified confidence level of

prediction band. Do not use if you do not want the prediction band.

[Storage]: This button gives you the same storage options as found under regression.

To find the value of the correlation coefficient directly and to find its corresponding P value, use the menu selection

Stat Basic Statistics Correlation

Dialog Box Responses

Variables: List the column number C# of the column containing the X variable and the column

number C# of the column containing the Y variable.

Select Display P- values option.

Example

Merchandise loss due to shoplifting, damage, and other causes is called shrinkage. Shrinkage is a major concern to retailers. The managers of H.R. Merchandise think there is a relationship between shrinkage and number of clerks on duty. To explore this relationship, a random sample of 7 weeks was selected. During each week the staffing level of sales clerks was kept constant and the dollar value (in hundreds of dollars) of the shrinkage was recorded.

X 10 12 11 15 9 13 8

Y 19 15 20 9 25 12 31 (in hundreds)

Store the value of X in C1 and name C1 as X. Sore the values of Y in C2 and name C2 as Y.

Use menu choices to give descriptive statistics regarding the values of X and Y. Use commands to draw an (X, Y) scatter plot and then to find the equation of the regression line. Find the value of the correlation coefficient, and test to see if it is significant.

(a) First we will use Stat Basic Statistics Display Descriptive Statistics and each of the columns X and Y. Note that we select both C1 and C2 in the variables box.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 3: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

256 Technology Guide Understandable Statistics, 8th Edition

(b) Next we will use Stat Regression Fitted Line Plot to graph the scatter plot and show the least-

squares line on the graph. We will not use prediction bands.

Notice that the equation of the regression line is given on the figure, as well as the value of 2.r

(c) However, to find out more information about the linear regression model, we use the menu selection Stat Regression Regression. Enter C2 for Response and C1 for Predictor.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 4: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 257

The results follow.

Notice that the regression equation is given as

y = 52.5 – 3.03x

The value of the standard error of estimate is given as S = 2.28. We have the value of R-square = 92.8%. Find the value of r by taking the square root. It is 0.963 or 96.3%.

eS 2 ,r

(d) Next, let’s use the prediction option to find the shrinkage when 14 clerks are available.

Use Stat Regression Regression. Your previous selections should still be listed. Now press [Options]. Enter 14 in the prediction window.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 5: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

258 Technology Guide Understandable Statistics, 8th Edition

The results follow.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 6: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 259

The predicted value of the shrinkage when 14 clerks are on duty is 10.05 hundred dollars, or $1005. A 95% prediction interval goes from 3.33 hundred dollars to 16.77 hundred dollars—that is, from $333 to $1677.

(e) Graph a prediction band for predicted values. Now we use Stat Regression Fitted Line Plot with the [Option] Display Prediction Interval selected.

(f) Find the correlation coefficient and test it against the hypothesis that there is no correlation. We use the menu options Stat Basic Statistics Correlation.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 7: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

260 Technology Guide Understandable Statistics, 8th Edition

The results are

Notice r = –0.963 and the P value is so small as to be 0. We reject the null hypothesis and conclude that there is a linear correlation between the number of clerks on duty and the amount of shrinkage.

LAB ACTIVITIES FOR SIMPLE LINEAR REGRESSION: TWO VARIABLES

1. Open or retrieve the worksheet Slr01.mtp from the CD-ROM. This worksheet contains the following data, with the list price in column C1 and the best price in the column C2. The best price is the best price negotiated by a team from the magazine.

List Price versus Best Price for a New GMC Pickup Truck In the following data pairs (X, Y) X = List Price (in $1000) for a GMC Pickup Truck Y = Best Price (in $1000) for a GMC Pickup Truck

SOURCE: CONSUMERS DIGEST, FEBRUARY 1994

(12.4, 11.2) (14.3, 12.5) (14.5, 12.7)

(14.9, 13.1) (16.1, 14.1) (16.9, 14.8)

(16.5, 14.4) (15.4, 13.4) (17.0, 14.9)

(17.9, 15.6) (18.8, 16.4) (20.3, 17.7)

(22.4, 19.6) (19.4, 16.9) (15.5, 14.0)

Copyright © Houghton Mifflin Company. All rights reserved.

Page 8: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 261

(16.7, 14.6) (17.3, 15.1) (18.4, 16.1)

(19.2, 16.8) (17.4, 15.2) (19.5, 17.0)

(19.7, 17.2) (21.2, (18.6)

(a) Use MINITAB to find the least-squares regression line using the best price as the response variable and list price as the explanatory variable.

(b) Use MINITAB to draw a scatter plot of the data. (c) What is the value of the standard error of estimate?

(d) What is the value of the coefficient of determination of the correlation coefficient r? 2 ?r (e) Use the least-squares model to predict the best price for a truck with a list price of $20,000. Note:

Enter this value as 20 since X is assumed to be in thousands of dollars. Find a 95% confidence interval for the prediction.

2. Other MINITAB worksheets appropriate to use for simple linear regression are

Cricket Chirps Versus Temperature: Slr02.mtp Source: The Song of Insects by Dr. G.W. Pierce, Harvard College Press

The chirps per second for the striped grouped cricket are stored in C1; the corresponding temperature in degrees Fahrenheit is stored in C2.

Diameter of Sand Granules Versus Slope on a Beach: Slr03.mtp; source Physical Geography by A.M. King, Oxford press The median diameter (mm) of granules of sand in stored in C1; the corresponding gradient of beach slope

in degrees is stored in C2. National Unemployment Rate Male Versus Female: Slr04.mtp Source: Statistical Abstract of the United States The national unemployment rate for adult males is stored in C1; the corresponding unemployment rate

for adult females for the same period of time is stored in C2. The data in these worksheets are described in the Appendix of this Guide. Select these worksheets and

repeat parts (a)–(d) of problem 1, using C1 as the explanatory variable and C2 as the response variable. 3. A psychologist interested in job stress is studying the possible correlation between interruptions and job

stress. A clerical worker who is expected to type, answer the phone, and do reception work has many interruptions. A store manager who has to help out in various departments as customers make demands also has interruptions. An accountant who is given tasks to accomplish each day and who is not expected to interact with other colleagues or customers except during specified meeting times has few interruptions. The psychologist rated a group of jobs for interruption level. The results follow, with X being interruption level of the job on a scale of 1 to 20, with 20 having the most interruptions, and Y the stress level on a scale of 1 to 50, with 50 the most stressed.

Person 1 2 3 4 5 6 7 8 9 10 11 12

X 9 15 12 18 20 9 5 3 17 12 17 6 Y 20 37 45 42 35 40 20 10 15 39 32 25

(a) Enter the X values into C1 and the Y values into C2. Use the menu selections Stat Basic

Statistics Display Descriptive Statistics on the two columns. What is the mean of the Y-values? Of the X-values? What are the respective standard deviations?

(b) Make a scatter plot of the data using the Stat Regression Fitted Line menu selection. From the diagram do you expect a positive or negative correlation?

Copyright © Houghton Mifflin Company. All rights reserved.

Page 9: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

262 Technology Guide Understandable Statistics, 8th Edition

(c) Use the Stat Basic Statistics Correlation menu choices to get the value of r. Is this value consistent with your response in part (b)?

(d) Use the Stat Regression Regression menu choices with Y as the response variable and X as the explanatory variable. Use the [Option] button with predictions 5, 10, 15, 20 to get the predicted stress level of jobs with interruption levels of 5, 10, 15, and 20. Look at the 95% P.I. intervals. Which are the longest? Why would you expect these results? Find the standard error of estimate. Is R – sq equal to the square of r as you found in part (c)? What is the equation of the least-squares line?

(e) Redo the Stat Regression Regression menu option, this time using X as the response variable and Y as the explanatory variable. Is the equation different than that of part (d)? What about the value of the standard error of estimate (s on your output)? Did it change? Did R – sq change?

4. The researcher of problem 3 was able to add to her data. Another random sample of 11 people had their jobs rated for interruption level and were then evaluated for stress level.

Person 13 14 15 16 17 18 19 20 21 22 23

X 4 15 19 13 10 9 3 11 12 15 4 Y 20 35 42 37 40 23 15 32 28 38 12

Add this data to the data in problem 3, and repeat parts (a) through (e). Compare the values of s, the

standard error of estimate in parts (d). Did more data tend to reduce the value of s? Look at the 95% P.I. intervals. How do they compare to the corresponding ones of problem 3? Are they shorter or longer? Why would you expect this result?

MULTIPLE REGRESSION (SECTION 10.4 OF UNDERSTANDABLE STATISTICS)

The Stat Regression Regression menu choices also do multiple regression.

Stat Regression Regression

Dialog Box Responses

Response: Enter the column number C# of the column containing the responses (that is Y values).

Predictor: Enter the column number C# of the columns containing the explanatory variables.

[Graphs]: Do not click on at this time.

[Results]: Click on and select the second option, Regression equation, etc.

[Options]: We will click on this option when we wish to do predictions for new variables.

[Storage]: Click on and select the fits and residual options if you wish.

Example

Bowman Brothers is a large sporting goods store in Denver that has a giant ski sale every year during the month of October. The chief executive officer at Bowman Brothers is studying the following variables regarding the ski sale:

1X = Total dollar receipts from October ski sale

2X = Total dollar amount spent advertising ski sale on local TV

3X = Total dollar amount spent advertising ski sale on local radio

Copyright © Houghton Mifflin Company. All rights reserved.

Page 10: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 263

4X = Total dollar amount spent advertising ski sale in Denver newspapers

Data for the past eight years is shown below (in thousands of dollars):

Year 1 2 3 4 5 6 7 8

X1 751 768 801 832 775 718 739 780

X2 19 23 27 32 25 18 20 24

X3 14 17 20 24 19 9 10 19

X4 11 15 16 18 12 5 7 14

(a) Enter the data in C1–C4. Name C1 = ‘X1’, C2 = ‘X2’, C3 = ‘X3’, C4 = ‘X4’. Use Stat Basic Statistics Display Description Statistics to study the data.

(b) Next use Stat Basic Statistics Correlation menu option to see the correlation between each pair of columns of data.

Copyright © Houghton Mifflin Company. All rights reserved.

Page 11: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

264 Technology Guide Understandable Statistics, 8th Edition

(c) Finally, we use Stat Regression Regression. Use X1 as the response variable with 3 predictors X2, X3, X4. Use the [Options] button and select Prediction values 21, 11, 8 so that you can see the predicted value of X1 for X2 = 21, X3 = 11, and X4 = 8. For this regression model, note the least-squares equation, the standard error of estimate, and the coefficient of multiple determination R – sq. Look at the P values of the coefficients. Remember we are testing the null hypothesis 0 1: 0H .β = against the alternate hypothesis 1 1:H 0.β ≠ A P value less than α is evidence to reject 0.H

Copyright © Houghton Mifflin Company. All rights reserved.

Page 12: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

Part III: MINITAB Guide 265

Note that we will not use the results of the Analysis of Variance.

LAB ACTIVITIES FOR MULTIPLE REGRESSION Use the Section 10.4 problems 3–6. Each of these problems has MINITAB worksheets stored on the CD-ROM.

Section 10.5 problem #3 (Systolic Blood Pressure Data)

MINITAB worksheet: Mlr02.mtp

Section 10.5 problem #4 (Test Scores for General Psychology)

MINITAB worksheet: Mlr03.mtp

Section 10.5 problem #5 (Hollywood Movies data)

MINITAB worksheet: Mlr04.mtp

Section 10.5 problem #6 (All Greens Franchise Data)

MINITAB worksheet: Mlr05.mtp

Two additional case studies are available on the CD-ROM. The data are listed in the Appendix. For each of these studies, explore the relationships among the variables.

MINITAB WORKSHEET Mlr07.mtp

This is a case study of public health, income, and population density for small cities in eight midwestern states: Ohio, Indiana, Illinois, Iowa, Missouri, Nebraska, Kansas, and Oklahoma. The data is for a sample of 53 small cities in these states.

X1 = Death Rate per 1000 Residents

X2 = Doctor Availability per 100,000 Residents

X3 = Hospital Availability per 100,000 Residents

X4 = Annual per Capita Income in Thousands of Dollars

X5 = Population Density People per Square Mile

MINITAB WORKSHEET Mlr06.mtp

This is a case study of education, crime, and police funding for small cities in ten eastern and southeastern states. The states are New Hampshire, Connecticut, Rhode Island, Maine, New York, Virginia, North Carolina, South Carolina, Georgia, and Florida. The data is for a sample of 50 small cities in these states.

X1 = Total Overall Reported Crime Rate per 1 Million Residents

X2 = Reported Violent Crime Rate per 100,000 Residents

X3 = Annual Police Funding in Dollars per Resident

X4 = Percent of People 25 Years and Older that have had 4 years

of High School

X5 = Percent of 16 to 19 Year-Olds Not in High School and Not High School

Graduates

X6 = Percent of 18 to 24 Year-Olds Enrolled in College

X7 = Percent of People 25 Years and Older with at Least 4 Years of College

Copyright © Houghton Mifflin Company. All rights reserved.

Page 13: CHAPTER 10 REGRESSION AND CORRELATION - …college.cengage.com/mathematics/brase/understandable_statistics/8e/... · CHAPTER 10 REGRESSION AND CORRELATION ... For simple regression

266 Technology Guide Understandable Statistics, 8th Edition

COMMAND SUMMARY

To Perform Simple or Multiple Regression

REGRESS C K C…C

does regression with the first column containing the response variable, K explanatory variables in the remaining columns. Following are some of the subcommands.

PREDICT E…E predicts the response variable for the given values of the explanatory variable(s).

RESIDUALS C stores the residuals in column C.

WINDOWS menu selection: Stat Regression Regression

Use the dialog box to list the response and explanatory (prediction) variables. Mark the residuals box. In the Options dialog box list the values of the explanatory variable(s) for which you wish to make a prediction. Select the P.I. confidence interval.

BRIEF K controls the amount of output for K = 0, 1, 2, 3 with 3 giving the most output. Default selection is K=2. This command is not available from a menu.

There are other subcommands for REGRESS. See the MINITAB Help for your release of MINITAB for a list of the subcommands and their descriptions.

To Find the Pearson Product Moment Correlation Coefficient

CORRELATION C…C calculates the correlation coefficient for all pairs of columns.

WINDOWS menu selection: Stat Basic Statistics Correlation

To Graph the Scatter Plot for Simple Regression

With GSTD use the PLOT C C command.

WINDOWS menu selection:

Stat Regression Fitted Line Plot

Copyright © Houghton Mifflin Company. All rights reserved.