part 3: regression and correlation 3-1/41 regression models professor william greene stern school of...

41
Part 3: Regression and Correlation -1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Upload: coleen-gray

Post on 24-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-1/41

Regression ModelsProfessor William Greene

Stern School of Business

IOMS Department

Department of Economics

Page 2: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-2/41

Regression and Forecasting Models

Part 3 – Model Fit and Correlation

Page 3: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-3/41

Correlation and Linear Association

Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Correlation = 0.845

Page 4: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-4/41

Correlation Coefficient for Two Variables

xy

N1i iN-1 i=1

N N2 21 1i iN-1 N-1i=1 i=1

xy

r = Correlation(x,y)

Sample Cov[x,y]=

[Sample Standard deviation (x)] [Sample standard deviation (y)]

(x -x)(y -y)=

(x -x) (y -y)

1 r 1

Page 5: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-5/41

Correlation and Linear Association

Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Standard Deviation Height = 2.978Standard Deviation Income = 176.903Covariance of Height and Income = 445.034

Correlation = 445.034 / (2.978 x 176.903) = 0.845

Page 6: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-6/41

Sample Correlation Coefficients

Domestic

Overs

eas

6005004003002001000

1400

1200

1000

800

600

400

200

0

Scatterplot of Overseas vs Domestic

rxy = 0.723

GINI

GDPC

0.60.50.40.30.2

30000

25000

20000

15000

10000

5000

0

S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%

Fitted Line PlotGDPC = 19826 - 34508 GINI

rxy = -.402

C6

C5

1614121086420

9

8

7

6

5

4

3

2

1

0

Scatterplot of C5 vs C6

rxy = +1.000

rxy = -.06 (close to 0)

Page 7: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-7/41

Inference About a Correlation Coefficient

xy

x y

2

r

0

Inference about a Correlation Coefficient

Population Parameter = Correlation of y and x

sEstimator: r = Sample Correlation =

s s

1 rSampling standard error: s =

N-2Hypothesis: H :

1

r

= 0, H : 0

r 0Test Statistic: t = ; t statistic N-2 D.F.

s

Rejection Region: |t| > Critical Value from Table

2 20 1

2 22

2 2r

Note for later.

Equivalent test: H : 0, H : 0

(r-0) r /1Test statistic: t

s (1-r )/(N-2)

Page 8: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-8/41

Correlation and Linear Association

Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Correlation = 0.845

t = .845 / sqr((1-.8452)/(30-2)) = 8.361

Page 9: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-9/41

Correlation is Not Causality

Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Correlation = 0.845

Page 10: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-10/41

Linear regression is about correlation

YEARS

SA

LAR

Y

302520151050

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

Scatterplot of SALARY vs YEARS

Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

ROOMS

FUEL

BIL

L111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS

The variables are highly correlated because the regression does a good job of predicting changes in the y variable associated with changes in the x variable.

Page 11: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-11/41

Regression Algebra

i

N N

ˆ

ˆ

ˆ

i i

i i i

22 N 2i=1 i i=1 i i=1 i

y = y + e = prediction + error

y - y = y - y + e

A few algebra steps later...

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

This is the analysis of (the) variance (of y); ANOVA

Page 12: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-12/41

Variance Decomposition

Page 13: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-13/41

ANOVA Table

Page 14: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-14/41

Fit of the Model to the Data

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

TOTAL SS =

R

TOTAL

egr

SS

ession SS

TOTAL SS

Proportion Expl

Residual S

ained

+

1 = +

S

TOTAL SS

Proportion Unexplained

Page 15: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-15/41

Explained Variation

The proportion of variation “explained” by the regression is called R-squared (R2)

It is also called the Coefficient of Determination (It is the square of something – to be shown later.)

Page 16: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-16/41

Movie Madness

Fit

R2

Page 17: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-17/41

Pretty Good Fit: R2 = .722

Regression of Fuel Bill on Number of Rooms

Page 18: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-18/41

Regression Fits

Domestic

Overs

eas

6005004003002001000

1400

1200

1000

800

600

400

200

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

R2 = 0.522

Income

G

2750025000225002000017500150001250010000

7

6

5

4

3

S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%

Fitted Line PlotG = 1.928 + 0.000179 Income

R2 = 0.880

R2 = 0.424

Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924

Page 19: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-19/41

R2 = 0.338

R2 is still positive even if the correlation is negative.

Page 20: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-20/41

R Squared Benchmarks Aggregate time

series: expect .9+ Cross sections, .5

is good. Sometimes we do much better.

Large survey data sets, .2 is not bad. Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

R2 = 0.924 in this cross section.

Page 21: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-21/41

R-Squared is rxy2

R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.

The correlation between yi and (b0 +b1xi) is the same as the correlation between yi and xi.

Therefore,…. A regression with a high R2 predicts yi well.

Page 22: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-22/41

Squared Correlations

Output

Cost

80000700006000050000400003000020000100000

500

400

300

200

100

0

Scatterplot of Cost vs Output

Domestic

Overs

eas

6005004003002001000

1400

1200

1000

800

600

400

200

0

Scatterplot of Overseas vs Domestic

rxy2 = 0.522

GINI

GDPC

0.60.50.40.30.2

30000

25000

20000

15000

10000

5000

0

S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%

Fitted Line PlotGDPC = 19826 - 34508 GINI

rxy2 = .161

rxy2 = .924

Page 23: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-23/41

Regression Fits

YEARS

SA

LAR

Y

302520151050

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

Scatterplot of SALARY vs YEARS

Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

ROOMS

FUEL

BIL

L111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS

Page 24: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-24/41

Is R2 Large?

Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)

by examining R2. F is used for this purpose.

Page 25: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-25/41

The F Ratio

ˆ 22N N N 2i=1 i i=1 i i=1 i

The original question about the model fit to the data :

Σ y - y = Σ y - y + Σ e

TOTAL LARGE?? small??

TOTAL = Regression + Residual

We would like the Regression SS

ˆ

2N 2i=1 i

N 2 2i=1 i

to be large and the

Residual SS to be small

(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF =

Residual SS Σ e 1-R

Page 26: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-26/41

Is R2 Large?

Since F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.

For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.

Page 27: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-27/41

Movie Madness Fit

R2

F

Page 28: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-28/41

Why Use F and not R2?

When is R2 “large?” we have no benchmarks to decide.

We have a table for F statistics to determine when F is statistically large: yes or no.

Page 29: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-29/41

F Table

The “critical value” depends on the number of observations. If F is larger than the value in the table, conclude that there is a “statistically significant” relationship.

There is a huge table on pages 826-833 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

n2 is N-2

Page 30: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-30/41

Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

n2 is N-2

Page 31: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-31/41

Inference About a Correlation Coefficient

xy

x y

2

r

0

Inference about a Correlation Coefficient

Population Parameter = Correlation of y and x

sEstimator: r = Sample Correlation =

s s

1 rSampling standard error: s =

N-2Hypothesis: H :

1

r

= 0, H : 0

r 0Test Statistic: t = ; t statistic N-2 D.F.

s

Rejection Region: |t| > Critical Value from Table

2 20 1

2 22

2 2r

Note for later.

Equivalent test: H : 0, H : 0

(r-0) r /1Test statistic: t

s (1-r )/(N-2)

This is F

Page 32: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-32/41

$135 Million

http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

Klimt, to Ronald Lauder

Page 33: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-33/41

$100 Million … sort ofStephen Wynn with a Prized Possession, 2007

Page 34: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-34/41

An Enduring Art Mystery

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Persistence of Econometrics. Greene, 2011

Graphics show relative sizes of the two works.

Page 35: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-35/41

Page 36: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-36/41

Page 37: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-37/41

Monet in Large and Small

ln (SurfaceArea)

ln (

US$)

7.67.47.27.06.86.66.46.26.0

18

17

16

15

14

13

12

11

S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%

Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)

Log of $price = a + b log surface area + e

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Page 38: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-38/41

The Data

ln (SurfaceArea)

Frequency

8.88.07.26.45.64.84.03.2

90

80

70

60

50

40

30

20

10

0

Histogram of ln (SurfaceArea)

ln (US$)

Frequency

16.515.013.512.010.5

80

70

60

50

40

30

20

10

0

Histogram of ln (US$)

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Page 39: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-39/41

Application: Monet Paintings Does the size of the

painting really explain the sale prices of Monet’s paintings?

Investigate: Compute the regression

Hypothesis: The slope is actually zero.

Rejection region: Slope estimates that are very far from zero.

The hypothesis that β = 0 is rejected

Page 40: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-40/41

An Equivalent Test Is there a

relationship? H0: No correlation Rejection region:

Large R2. Test: F= Reject H0 if F > 4 Math result: F = t2.

2

2

(N-2)R

1 - R

Degrees of Freedom for the F statistic are 1 and N-2

Page 41: Part 3: Regression and Correlation 3-1/41 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 3: Regression and Correlation3-41/41

Monet Regression: There seems to be a regression. Is there a theory?