tue mar 16

24
Lecture 17: Tues., March 16 • Inference for simple linear regression (Ch. 7.3-7.4) •R 2 statistic (Ch. 8.6.2) • Association is not causation (Ch. 7.5.3) • Next class: Diagnostics for asssumptions of simple linear regression model (Ch. 8.2-8.3)

Upload: fordlovers

Post on 15-Dec-2014

172 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Tue Mar 16

Lecture 17: Tues., March 16

• Inference for simple linear regression (Ch. 7.3-7.4)

• R2 statistic (Ch. 8.6.2)

• Association is not causation (Ch. 7.5.3)

• Next class: Diagnostics for asssumptions of simple linear regression model (Ch. 8.2-8.3)

Page 2: Tue Mar 16

Regression

• Goal of regression: Estimate the mean response Y for subpopulations X=x,

• Example: Y= catheter length required, X=height• Simple linear regression model:

• Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)

}|{ XY

SD and mean on with distributi normal has |

}|{

10

10

XXY

XXY

0 110

ˆ,ˆ

Page 3: Tue Mar 16

Car Price Example

• A used-car dealer wants to understand how odometer reading affects the selling price of used cars.

• The dealer randomly selects 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player and air conditioning.

• carprices.JMP contains the price and number of miles on the odometer of each car.

Page 4: Tue Mar 16

B i v a r i a t e F i t o f P r i c e B y O d o m e t e r

1 3 5 0 0

1 4 0 0 0

1 4 5 0 0

1 5 0 0 0

1 5 5 0 0

1 6 0 0 0

Price

1 5 0 0 0 3 0 0 0 0 4 0 0 0 0O d o m e te r

L in e a r F it L i n e a r F i t

P r i c e = 1 7 0 6 6 . 7 6 6 - 0 . 0 6 2 3 1 5 5 O d o m e t e r S u m m a r y o f F i t

R S q u a r e 0 . 6 5 0 1 3 2 R S q u a r e A d j 0 . 6 4 6 5 6 2 R o o t M e a n S q u a r e E r r o r 3 0 3 . 1 3 7 5 M e a n o f R e s p o n s e 1 4 8 2 2 . 8 2 O b s e r v a t i o n s ( o r S u m W g t s ) 1 0 0 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t |

I n t e r c e p t 1 7 0 6 6 . 7 6 6 1 6 9 . 0 2 4 6 1 0 0 . 9 7 < . 0 0 0 1 O d o m e t e r - 0 . 0 6 2 3 1 5 0 . 0 0 4 6 1 8 - 1 3 . 4 9 < . 0 0 0 1

Page 5: Tue Mar 16

Inference for Simple Linear Regression

• Inference based on the ideal simple linear regression model holding.

• Inference based on taking repeated random samples ( ) from the same subpopulations

( ) as in the observed data. • Types of inference:

– Hypothesis tests for intercept and slope– Confidence intervals for intercept and slope– Confidence interval for mean of Y at X=X0

– Prediction interval for future Y for which X=X0

nyy ,,1

nxx ,,1

Page 6: Tue Mar 16

Ideal Simple Linear Regression Model

• Assumptions of ideal simple linear regression model – There is a normally distributed subpopulation of

responses for each value of the explanatory variable– The means of the subpopulations fall on a straight-line

function of the explanatory variable.– The subpopulation standard deviations are all equal (to )– The selection of an observation from any of the

subpopulations is independent of the selection of any other observation.

Page 7: Tue Mar 16

Sampling Distributions of and

• See handout.

• See Display 7.7

Standard deviation is smaller for (i) larger n, (ii) smaller , (iii) larger spread in x (higher )

0̂1̂

21)1(

1)ˆ(

xsnSD

2xs

Page 8: Tue Mar 16

Hypothesis tests for and • Hypothesis test of vs.

– Based on t-test statistic, – p-value has usual interpretation, probability under the null hypothesis

that |t| would be at least as large as its observed value, small p-value is evidence against null hypothesis

– Interpretation of null hypothesis: X is not a useful predictor of Y, mean of Y is not associated with X.

• Test for vs. is based on an analogous test statistic.

• Test statistics and p-values can be found on JMP output under parameter estimates, obtained by using fit line after fit Y by X.

• For car price data, convincing evidence that both intercept and slope are not zero (p-value <.0001 for both).

01

0: 10 H 0: 1 aH

)ˆ(

|ˆ|

)(

||||

1

1

SEEstimateSE

Estimatet

0: 00 H 0: 0 aH

Page 9: Tue Mar 16

Confidence Intervals for and

• Confidence intervals provide a range of plausible values for and

• 95% Confidence Intervals: Table A.2 lists . It is approximately 2. • Finding CIs in JMP: Go to parameter estimates, right click,

click Columns and then click Lower 95% and Upper 95%.• For car price data set, CIs:

01

0 1

)ˆ(2ˆ)ˆ()975(.ˆ00020 SESEtn

)ˆ(2ˆ)ˆ()975(.ˆ11121 SESEtn

)975(.2nt

)17402,16731(:0)053.0,071.0(:1

Page 10: Tue Mar 16

Two prediction problemsa) The used-car dealer has an opportunity to bid on a lot of cars

offered by a rental company. The rental company has 250 Ford Tauruses, all equipped with automatic transmission, air conditioning and AM/FM cassette tape players. All of the cars in this lot have about 40,000 miles on the odometer. The dealer would like an estimate of the average selling price of all cars in this lot (or, virtually equivalently, average selling price of population of Ford Tauruses with above equipment and 40,000 miles on the odometer).

b) The used-car dealer is about to bid on a 3-year old Ford Taurus equipped with automatic transmission, air conditioner and AM/FM cassette tape player and with 40,000 miles on the odometer. The dealer would like to predict the selling price of this particular car.

Page 11: Tue Mar 16

Prediction problem (a)

• Goal is to estimate the conditional mean of selling price given odometer reading=40,000,

• Point estimate is

• What is a range of plausible values for

?

574,1440000*ˆˆ}40000|{ˆ 10 XY

}40000|{ XY

}40000|{ XY

Page 12: Tue Mar 16

Confidence Intervals for Mean of Y at X=X0

• What is a plausible range of values for ?• 95% CI for : • , • Note about formula

– Precision in estimating is not constant for all values of X. Precision decreases as X0 gets farther away from sample average of X’s

• JMP implementation: Use Confid Curves fit command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.

}|{ 0XY}|{ 0XY }]|{ˆ[)975(.}|{ˆ 020 XYSEtXY n

2

20

0)1(

)(1ˆ}]|{ˆ[

Xsn

XX

nXYSE

0100ˆˆ}|{ˆ XXY

}|{ XY

Page 13: Tue Mar 16

Prediction Problem (b)• Goal is to estimate the selling price of a given car with

odometer reading=40,000. • What are likely values for a future value Y0 at some specified value of

X (=X0)?• Best prediction is the estimated mean response for X=X0: • A prediction interval is an interval of likely values along with a

measure of the likelihood that interval will contain response.• 95% prediction interval for X0: If repeated samples are

obtained from the subpopulations and a prediction interval is formed, the prediction interval will contain the value of Y0 for a future observation from the subpopulation X0 95% of the time.

574,1440000*ˆˆ}40000|{ˆ 10 XY

),,( 1 nxx

Page 14: Tue Mar 16

Prediction Intervals Cont.

• Prediction interval must account for two sources of uncertainty:– Uncertainty about the location of the subpopulation mean – Uncertainty about where the future value will be in

relation to its mean

• Prediction Error = Random Sampling Error + Estimation Error

}|{ 0XY

}|{ˆ}|{Pr 00 XYYXYedY

}]|{ˆ}|{[}]|{[ 000 XYXYXYY

Page 15: Tue Mar 16

Prediction Interval Formula

• 95% prediction interval at X0

• Compare to 95% CI for mean at X0:

– Prediction interval is wider due to random sampling error in future response

– As sample size n becomes large, margin of error of CI for mean goes to zero but margin of error of PI doesn’t.

• JMP implementation: Use Confid Curves Indiv command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.

20

220 }]|{ˆ[ˆ)975(.}|{ˆ XYSEtXY n

}]|{ˆ[)975(.}|{ˆ 020 XYSEtXY n

Page 16: Tue Mar 16

B i v a r i a t e F i t o f P r i c e B y O d o m e t e r

13500

14000

14500

15000

15500

16000Price

15000 30000 40000Odometer

9 5 % C o n f i d e n c e I n t e r v a l f o r )14653 ,14514( }40000|{ XY 9 5 % P r e d i c t i o n I n t e r v a l f o r X = 4 0 0 0 0 15194) ,13972(

Page 17: Tue Mar 16

R-Squared

• The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.

• Unitless measure of strength of relationship between x and y

• Total sum of squares = . Best sum of squared prediction error without using x.

• Residual sum of squares =

squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)( YY

n

i i

n

i ii

n

ii XYres

1

2101

2 )ˆˆ(

Page 18: Tue Mar 16

R-Squared Example

• R2=.6501. Read as “65.01 percent of the variation in car prices was explained by the linear regression on odometer.”

Bivariate Fit of Price By Odometer

13500

14000

14500

15000

15500

16000

Pri

ce

150003000040000Odometer

Summary of Fit

RSquare 0.650132 RSquare Adj 0.646562 Root Mean Square Error 303.1375 Mean of Response 14822.82 Observations (or Sum Wgts) 100

Page 19: Tue Mar 16

Interpreting R2

• R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association.

• If the residuals are all zero (a perfect fit), then R2 is 1. If the least squares line has slope 0, R2 will be 0.

• R2 is useful as a unitless summary of the strength of linear association.

Page 20: Tue Mar 16

Caveats about R2

– R2 is not useful for assessing model adequacy, i.e., does simple linear regression model hold (use residual plots) or whether or not there is an association (use test of

vs. )

– A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

0: 10 H 0: 1 aH

Page 21: Tue Mar 16

Association is not causation• A high means that x has a strong linear relationship with y – there is

a strong association between x and y. It does not imply that x causes y.• Alternative explanations for high :

– Reverse is true. Y causes X.– There may be a lurking (confounding) variable related to both x and

y which is the common cause of x and y• No cause and effect relationship can be inferred unless X is randomly

assigned to units in a random experiment.• A researcher measures the number of television sets per person X and

the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?

2R

2R

Page 22: Tue Mar 16

Example

• A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community may be able to cover the costs of increased police protection by gains in tax revenues from higher property values. Data on the average housing price and crime rate (per 1000 population) for communities in Pennsylvania near Philadelphia for 1996 are shown in housecrime.JMP.

Page 23: Tue Mar 16

Bivariate Fit of HousePrice By CrimeRate

0

100000

200000

300000

400000

500000H

ou

se

Pri

ce

10 20 30 40 50 60 70

CrimeRate

Summary of Fit RSquare 0.184229 RSquare Adj 0.175731 Root Mean Square Error 78861.53 Mean of Response 158464.5 Observations (or Sum Wgts) 98 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 225233.55 16404.02 13.73 <.0001 CrimeRate -2288.689 491.5375 -4.66 <.0001

-100000

0

100000

200000

300000

Re

sid

ua

l

10 20 30 40 50 60 70

CrimeRate

Distributions Residuals HousePrice

-100000 0 100000200000300000

Page 24: Tue Mar 16

Questions

1. Can you deduce a cause-and-effect relationship from these data? What are other explanations for the association between housing prices and crime rate other than that high crime rates cause low housing prices?

2. Does the ideal simple linear regression model appear to hold?