practice problems, part i

23
Practice Problems for Stat 112 Midterm I I. Multiple choice The following three questions refer to the following data. Asking prices were collected from a random sample of classified advertisements for used Honda Accords being offered for sale in the spring of 2002. The asking price in thousands of dollars and the age of the vehicle in years are the data values. Assume that all the Honda Accords were comparable (that is, all were four-door sedans and had the same collection of accessories). Bivariate Fit of Price By Age 2.5 5 7.5 10 12.5 15 17.5 Price 0 2 4 6 8 10 12 Age Linear Fi Linear Fit Price = 15.188647 - 1.1970134 Age Summary of Fit RSquare 0.886348 RSquare Adj 0.881407 Root Mean Square Error 1.271025 Mean of Response 7.432 Observations (or Sum Wgts) 25

Upload: hondafanatics

Post on 14-Apr-2017

201 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practice Problems, Part I

Practice Problems for Stat 112 Midterm I

I. Multiple choice

The following three questions refer to the following data. Asking prices were collected from a random sample of classified advertisements for used Honda Accords being offered for sale in the spring of 2002. The asking price in thousands of dollars and the age of the vehicle in years are the data values. Assume that all the Honda Accords were comparable (that is, all were four-door sedans and had the same collection of accessories).

Bivariate Fit of Price By Age

2.55

7.510

12.515

17.5

Pric

e

0 2 4 6 8 10 12Age

Linear FitLinear FitPrice = 15.188647 - 1.1970134 Age

Summary of FitRSquare 0.886348RSquare Adj 0.881407Root Mean Square Error 1.271025Mean of Response 7.432Observations (or Sum Wgts) 25

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio

Model 1 289.77780 289.778 179.3730Error 23 37.15660 1.616 Prob > F

C. Total 24 326.93440 <.0001

Page 2: Practice Problems, Part I

Parameter EstimatesTerm Estimate Std Error T Ratio Prob>|t|

Intercept 15.188647

0.632489 24.01 <.0001

Age -1.197013 0.089376 -13.39 <.0001

-3-113

Res

idua

l

0 2 4 6 8 10 12Age

(i) What is the interpretation of the coefficient of Age in the fitted model?(a) For each additional year of age, mean asking price given age declines by about

$1271.(b) For each additional year of age, mean asking price given age increases by about

$1271.(c) For each additional year of age, mean asking price given age declines by about

$1197. (d) For each additional year of age, mean asking price given age increases by about

$1197.(e) For each additional year of age, mean asking price given age increases by about

$1519.

___

(ii) The R squared value indicates that(a) approximately 89 percent of the asking price values are correctly predicted by the

model.(b) approximately 89 percent of the variation in asking price is explained by the linear

regression of asking price on age(c) approximately 89 percent of the age values are correctly determined by the model(d) approximately 89 percent of the asking price values are within two root mean

square errors of their predicted values(e) none of the above is correct.

___

Page 3: Practice Problems, Part I

(iii) The residual plot indicates nonlinearity. Which of the following transformations do you think would be the best to try?

(a) Transform age to log(age) and transform asking price to asking price squared.(b) Transform age to age squared and transform asking price to log(asking price).(c) Transform age to age squared and keep asking price as it is.(d) Keep age as it is and transform asking price to asking price squared.(e) Transform age to square root of age and keep asking price as it is.

(iv) It is easy to measure the “diameter at breast height” (x) of a tree. It’s hard to measure the total “above ground biomass” (y) of a tree, because to do this you must cut and weigh the tree. Ecologists commonly estimate the biomass using a transformed simple linear regression of ln y on ln x. Based on data on 378 trees in tropical rain forests, the following relationship between biomass y measured in kilograms and diameter x measured in centimeters was estimated by least squares

What would you predict the biomass of a tropical tree 40 centimeters in diameters to be approximately?

(a) 7(b) 95(c) 1020(d) 1,020,000

(v) The first scatterplot below (next page) plots, for each of the 50 states, the infant mortality rate (deaths per 1000) X in 1990 in the state vs. the percent of 18-year olds in the state Y in 1990 that graduated from high school. The least squares line was fit to the data in the scatterplot and the residuals computed. The second scatterplot below is a plot of the residuals versus the 1990 population in the state. This second scatterplot suggests

(a) high infant mortality rates imply low nutrition and hence higher drop-out rates later in life, but only for states with large populations(b) high infant mortality rates imply low nutrition and hence higher drop-out rates later in life, but only for states with small populations(c) population may be a lurking variable in understanding the association between infant mortality rate and percent graduating from high school(d) none of the above.

Page 4: Practice Problems, Part I

Bivariate Fit of Percent Graduating from High School By Infant Mortalit

55

6570

80

90P

erce

nt G

radu

atin

g fro

m H

igh

Sch

ool

7 8 9 10 11 12 13 14Infant Mortality Rate

Bivariate Fit of Residuals Percent Graduating from High School By Popul

-10

-5

0

5

10

Res

idua

ls P

erce

nt G

radu

atin

g fro

m H

igh

0 5000000 15000000Population

II. Short answer.

1. (16) It is extremely difficult to measure the volume of a child because of a child’s irregular shape but it is easy to measure the weight of a child. Regression analysis can be used to estimate a child’s volume based on a child’s weight. A study was done in which the volume, measured in cubic decimeters, of 18 children was determined by an elaborate procedure along with the children’s weights, measured in kilograms (Boyd, Human Biology, 1933). The data is analyzed below.

Page 5: Practice Problems, Part I

Bivariate Fit of Volume By Weight

10

12

14

16

18

20Vo

lum

e

10 12 14 16 18 20Weight

Linear FitLinear FitVolume = -0.104046 + 0.9880519 Weight

Summary of FitRSquare 0.993127RSquare Adj 0.992697Root Mean Square Error 0.201749Mean of Response 14.72222Observations (or Sum Wgts) 18

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio

Model 1 94.099871 94.0999 2311.895Error 16 0.651240 0.0407 Prob > F

C. Total 17 94.751111 <.0001

Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|

Intercept -0.104046 0.311998 -0.33 0.7431Weight 0.988051

90.020549 48.08 <.0001

Page 6: Practice Problems, Part I

-0.4-0.20.00.2

Res

idua

l

10 12 14 16 18 20Weight

Distribution of Residuals

DistributionsResiduals Volume

-0.4 -0.2 0 .1 .2 .3

MomentsMean 6.908e-16Std Dev 0.1957249Std Err Mean 0.0461328upper 95% Mean 0.0973317lower 95% Mean -0.097332N 18

Page 7: Practice Problems, Part I

(a) (4) Do the regression diagnostics indicate any problems with the ideal simple linear regression model holding for this data? Comment on both the residual plot and the histogram of the residuals.

For the remaining parts of the question, ignore any problems (if any) with the regression assumptions as indicated by the regression diagnostics. That is, go ahead and assume the ideal simple linear regression model holds (for the purposes of this question) even if you think the regression model should be improved.

(b) (3) Is there evidence that a child’s weight is a useful predictor of a child’s volume? Justify your answer.

Page 8: Practice Problems, Part I

(c) (3) Predict the volume of a child who weights 25 kilograms using the ideal simple linear regression model. Why should you not trust this prediction?

(d) (3) A doctor needs to estimate the volume of a sick child weighting 14.0 kilograms in order to determine the appropriate treatment for the child. If there is more than a slight possibility that the child’s volume is less than 13.0 cubic decimeters, the doctor would like to use a treatment which is safer for children who have low volume. Which would be more useful to the doctor – a 95% confidence interval for the mean volume of the population of children who weigh 14 kilograms or a 95% prediction interval for the volume of a randomly chosen child who weighs 14 kilograms? Explain.

Page 9: Practice Problems, Part I

(e) (3) Another doctor insisted on using a transformation of volume to log(volume), where log denotes the natural logarithm. The least squares line of the regression of log(volume) on weight is shown below. Assuming that the ideal simple linear regression holds for the regression of log(volume) on weight, predict the volume of a child who weighs 10 kilograms.

Transformed Fit LogTransformed Fit LogLog(Volume) = 1.5948941 + 0.0720483 Weight

2. (6) For each of the following situations, a simple linear regression has been carried out. State whether (i) the the simple linear regression is well suited to answer the question of interest or (ii) the simple linear regression is not well suited to answer the question of interest and needs to be modified. If you answer (ii), state what the most salient problem with simple linear regression is and discuss briefly how you would try to fix it.

Page 10: Practice Problems, Part I

(a) (3) We want to predict a car’s fuel consumption from its speed. The scatterplot below shows data on the British Ford Escort.

Bivariate Fit of Fuel Used By Speed

5

7

9

11

13

Fuel

Use

d 1314

0 50 100 150Speed

Linear FitLinear FitFuel Used = 7.5102418 + 0.0186022 Speed

Summary of FitRSquare 0.109542

Page 11: Practice Problems, Part I

RSquare Adj 0.035337Root Mean Square Error 2.309302Mean of Response 9.091429Observations (or Sum Wgts) 14

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio

Model 1 7.872450 7.87245 1.4762Error 12 63.994521 5.33288 Prob > F

C. Total 13 71.866971 0.2477

Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|

Intercept 7.5102418 1.440329 5.21 0.0002Speed 0.0186022 0.015311 1.21 0.2477

-3036

Res

idua

l

1314

0 50 100 150Speed

Observations with Largest Cook’s DistancesObservation Number Cook’s Distance Leverage13 0.280 0.25714 0.083 0.204

(b) (3) One of the most dangerous contaminants deposited over European countries following the Chernobyl accident of April 1987 was radioactive cesium. To study cesium transfer from contaminated soil to plants, researchers collected soil samples and samples of mushroom mycelia from 17 wooded locations in Umbria, Central Italy, from August 1986 to November 1989. The researchers measured concentrations (Bq/kg) of cesium in the soil and in the mushrooms. The researchers’ goal is to predict Y=concentration in

Page 12: Practice Problems, Part I

mushrooms based on X=concentration in soil. The output from a simple linear regression is shown below.

Bivariate Fit of MUSHROOM By SOIL

0

50

100

150

200

MU

SHR

OO

M

16

17

0 250 500 750 1000 1250 1500SOIL

Linear Fit

Linear FitMUSHROOM = 16.725686 + 0.0959027 SOIL

Summary of FitRsquare 0.406386RSquare Adj 0.366812Root Mean Square Error 36.56475Mean of Response 44.58824Observations (or Sum Wgts) 17

Analysis of VarianceSource DF Sum of Squares Mean Square F RatioModel 1 13729.399 13729.4 10.2690

Page 13: Practice Problems, Part I

Source DF Sum of Squares Mean Square F RatioError 15 20054.718 1337.0 Prob > FC. Total 16 33784.118 0.0059

Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|Intercept 16.725686 12.41954 1.35 0.1981SOIL 0.0959027 0.029927 3.20 0.0059

-50

-25

0

25

50

75

Res

idua

l

16

17

0 250 500 750 1000 1250 1500SOIL

Observations with Largest Cook’s DistancesObservation Number Cook’s Distance Leverage16 0.081 0.08217 10.08 0.755

Page 14: Practice Problems, Part I

3. One general belief held by observers of the business world is that taller men earn more money than shorter men. In a University of Pittsburgh study (reported in the Wall Street Journal, December 30, 1986), 250 MBA graduates, all about 30 years old, were polled and asked to report their height (in inches) and their annual income (to the nearest $1,000). The JMP output from a simple linear regression of income on height is shown below.

Bivariate Fit of Income By Height

40

50

60

70

80

Inco

me

60 70 80Height

Linear FitLinear FitIncome = 17.933325 + 0.6041112 Height

Summary of FitRSquare 0.050545RSquare Adj 0.046717Root Mean Square Error 8.282066Mean of Response 59.588Observations (or Sum Wgts) 250

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio

Model 1 905.597 905.597 13.2025Error 248 17010.967 68.593 Prob > F

C. Total 249 17916.564 0.0003

Page 15: Practice Problems, Part I

Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%

Intercept 17.933325 11.47593 1.56 0.1194 -4.669388 40.536039Height 0.6041112 0.16626 3.63 0.0003 0.2766492 0.9315731

-20-51025

Res

idua

l

60 70 80Height

DistributionsResiduals Income

-20

-10

0

10

20 .01 .05.10 .25 .50 .75 .90.95 .99

-2 -1 0 1 2 3Normal Quantile Plot

(a) (3) Do the regression diagnostics indicate any problems with the regression model? Comment on both the residual plot and the plots of the distribution of the residual.

Page 16: Practice Problems, Part I

For the remaining parts of the question, ignore any problems (if any) with the regression assumptions as indicated by the regression diagnositcs. That is, go ahead and assume that the simple linear regression model holds even if you think the regression model should be improved.

(b) (3) Do these data provide strong evidence that taller MBA’s earn more money than shorter ones? State this question in terms of a hypothesis test and answer the question.

Page 17: Practice Problems, Part I

(c) (2) The root mean square error ( )is left blank. What is a reasonable estimate for this value?

(i) 2.42(ii) 8.29(iii) 25.73(iv) 321.46

(d) (3) Find a 95% confidence interval for the difference between the mean salaries of MBAs who are 6’0’’ tall and the mean salaries of MBAs who are 5’2’’ tall.

(e) (3) A group of 20 men who are five foot, two inches and 30 years old bring a class action suit against a large company, claiming that men this short are discriminated against because of their height. The court defines discrimination as occurring when the mean salary of a particular group (e.g., a group of a particular height, race or sex) is less than the mean salary of all comparable employees in the company. The group of short men take a random sample of 100 30-year-old men in the company, record their height and salary and compute the least squares regression of salary on height. Based on their least squares regression, they compute a 95% confidence interval for the mean earnings of five foot, two inch 30-year-old men in the company of ($53,890, $56,940). It is known that the mean earnings of all 30-year old men in the company is $60,000. The group of five

Page 18: Practice Problems, Part I

foot, two inch men argues that this analysis provides strong evidence of discrimination because the upper end of the confidence interval is less than $60,000.

The company’s defense lawyers counter this argument by calculating the 95% prediction interval for the mean earnings of a 30-year old man in the company who is five foot, two inches, which turns out to be ($38,330, $70,580). The defense lawyers argue that because the upper end of the prediction interval is greater than $60,000, there is not strong evidence that men who are five feet, two inches are discriminated against.

Which side makes a more compelling argument – the group of five foot, two inch men or the company’s defense lawyers? Explain briefly.