part 3: regression and correlation 3-1/41 regression models professor william greene stern school of...
TRANSCRIPT
Part 3: Regression and Correlation3-1/41
Regression ModelsProfessor William Greene
Stern School of Business
IOMS Department
Department of Economics
Part 3: Regression and Correlation3-2/41
Regression and Forecasting Models
Part 3 – Model Fit and Correlation
Part 3: Regression and Correlation3-3/41
Correlation and Linear Association
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050
Correlation = 0.845
Part 3: Regression and Correlation3-4/41
Correlation Coefficient for Two Variables
xy
N1i iN-1 i=1
N N2 21 1i iN-1 N-1i=1 i=1
xy
r = Correlation(x,y)
Sample Cov[x,y]=
[Sample Standard deviation (x)] [Sample standard deviation (y)]
(x -x)(y -y)=
(x -x) (y -y)
1 r 1
Part 3: Regression and Correlation3-5/41
Correlation and Linear Association
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050
Standard Deviation Height = 2.978Standard Deviation Income = 176.903Covariance of Height and Income = 445.034
Correlation = 445.034 / (2.978 x 176.903) = 0.845
Part 3: Regression and Correlation3-6/41
Sample Correlation Coefficients
Domestic
Overs
eas
6005004003002001000
1400
1200
1000
800
600
400
200
0
Scatterplot of Overseas vs Domestic
rxy = 0.723
GINI
GDPC
0.60.50.40.30.2
30000
25000
20000
15000
10000
5000
0
S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%
Fitted Line PlotGDPC = 19826 - 34508 GINI
rxy = -.402
C6
C5
1614121086420
9
8
7
6
5
4
3
2
1
0
Scatterplot of C5 vs C6
rxy = +1.000
rxy = -.06 (close to 0)
Part 3: Regression and Correlation3-7/41
Inference About a Correlation Coefficient
xy
x y
2
r
0
Inference about a Correlation Coefficient
Population Parameter = Correlation of y and x
sEstimator: r = Sample Correlation =
s s
1 rSampling standard error: s =
N-2Hypothesis: H :
1
r
= 0, H : 0
r 0Test Statistic: t = ; t statistic N-2 D.F.
s
Rejection Region: |t| > Critical Value from Table
2 20 1
2 22
2 2r
Note for later.
Equivalent test: H : 0, H : 0
(r-0) r /1Test statistic: t
s (1-r )/(N-2)
Part 3: Regression and Correlation3-8/41
Correlation and Linear Association
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050
Correlation = 0.845
t = .845 / sqr((1-.8452)/(30-2)) = 8.361
Part 3: Regression and Correlation3-9/41
Correlation is Not Causality
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050
Correlation = 0.845
Part 3: Regression and Correlation3-10/41
Linear regression is about correlation
YEARS
SA
LAR
Y
302520151050
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
Scatterplot of SALARY vs YEARS
Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes
ROOMS
FUEL
BIL
L111098765432
1400
1200
1000
800
600
400
200
Scatterplot of FUELBILL vs ROOMS
The variables are highly correlated because the regression does a good job of predicting changes in the y variable associated with changes in the x variable.
Part 3: Regression and Correlation3-11/41
Regression Algebra
i
N N
ˆ
ˆ
ˆ
i i
i i i
22 N 2i=1 i i=1 i i=1 i
y = y + e = prediction + error
y - y = y - y + e
A few algebra steps later...
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??
TOTAL = Regression + Residual
This is the analysis of (the) variance (of y); ANOVA
Part 3: Regression and Correlation3-12/41
Variance Decomposition
Part 3: Regression and Correlation3-13/41
ANOVA Table
Part 3: Regression and Correlation3-14/41
Fit of the Model to the Data
ˆ 22N N N 2i=1 i i=1 i i=1 i
The original question about the model fit to the data :
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??
TOTAL = Regression + Residual
TOTAL SS =
R
TOTAL
egr
SS
ession SS
TOTAL SS
Proportion Expl
Residual S
ained
+
1 = +
S
TOTAL SS
Proportion Unexplained
Part 3: Regression and Correlation3-15/41
Explained Variation
The proportion of variation “explained” by the regression is called R-squared (R2)
It is also called the Coefficient of Determination (It is the square of something – to be shown later.)
Part 3: Regression and Correlation3-16/41
Movie Madness
Fit
R2
Part 3: Regression and Correlation3-17/41
Pretty Good Fit: R2 = .722
Regression of Fuel Bill on Number of Rooms
Part 3: Regression and Correlation3-18/41
Regression Fits
Domestic
Overs
eas
6005004003002001000
1400
1200
1000
800
600
400
200
0
S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%
Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic
R2 = 0.522
Income
G
2750025000225002000017500150001250010000
7
6
5
4
3
S 0.370241R-Sq 88.0%R-Sq(adj) 87.8%
Fitted Line PlotG = 1.928 + 0.000179 Income
R2 = 0.880
R2 = 0.424
Output
Cost
80000700006000050000400003000020000100000
500
400
300
200
100
0
Scatterplot of Cost vs Output
R2 = 0.924
Part 3: Regression and Correlation3-19/41
R2 = 0.338
R2 is still positive even if the correlation is negative.
Part 3: Regression and Correlation3-20/41
R Squared Benchmarks Aggregate time
series: expect .9+ Cross sections, .5
is good. Sometimes we do much better.
Large survey data sets, .2 is not bad. Output
Cost
80000700006000050000400003000020000100000
500
400
300
200
100
0
Scatterplot of Cost vs Output
R2 = 0.924 in this cross section.
Part 3: Regression and Correlation3-21/41
R-Squared is rxy2
R-squared is the square of the correlation between yi and the predicted yi which is a + bxi.
The correlation between yi and (b0 +b1xi) is the same as the correlation between yi and xi.
Therefore,…. A regression with a high R2 predicts yi well.
Part 3: Regression and Correlation3-22/41
Squared Correlations
Output
Cost
80000700006000050000400003000020000100000
500
400
300
200
100
0
Scatterplot of Cost vs Output
Domestic
Overs
eas
6005004003002001000
1400
1200
1000
800
600
400
200
0
Scatterplot of Overseas vs Domestic
rxy2 = 0.522
GINI
GDPC
0.60.50.40.30.2
30000
25000
20000
15000
10000
5000
0
S 6574.43R-Sq 16.2%R-Sq(adj) 15.8%
Fitted Line PlotGDPC = 19826 - 34508 GINI
rxy2 = .161
rxy2 = .924
Part 3: Regression and Correlation3-23/41
Regression Fits
YEARS
SA
LAR
Y
302520151050
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
Scatterplot of SALARY vs YEARS
Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes
ROOMS
FUEL
BIL
L111098765432
1400
1200
1000
800
600
400
200
Scatterplot of FUELBILL vs ROOMS
Part 3: Regression and Correlation3-24/41
Is R2 Large?
Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits)
by examining R2. F is used for this purpose.
Part 3: Regression and Correlation3-25/41
The F Ratio
ˆ 22N N N 2i=1 i i=1 i i=1 i
The original question about the model fit to the data :
Σ y - y = Σ y - y + Σ e
TOTAL LARGE?? small??
TOTAL = Regression + Residual
We would like the Regression SS
ˆ
2N 2i=1 i
N 2 2i=1 i
to be large and the
Residual SS to be small
(N- 2)Σ y - y(N- 2)Regression SS (N- 2)RF =
Residual SS Σ e 1-R
Part 3: Regression and Correlation3-26/41
Is R2 Large?
Since F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large.
For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.
Part 3: Regression and Correlation3-27/41
Movie Madness Fit
R2
F
Part 3: Regression and Correlation3-28/41
Why Use F and not R2?
When is R2 “large?” we have no benchmarks to decide.
We have a table for F statistics to determine when F is statistically large: yes or no.
Part 3: Regression and Correlation3-29/41
F Table
The “critical value” depends on the number of observations. If F is larger than the value in the table, conclude that there is a “statistically significant” relationship.
There is a huge table on pages 826-833 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.
n2 is N-2
Part 3: Regression and Correlation3-30/41
Internet Buzz RegressionRegression Analysis: BoxOffice versus Buzz
The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1
n2 is N-2
Part 3: Regression and Correlation3-31/41
Inference About a Correlation Coefficient
xy
x y
2
r
0
Inference about a Correlation Coefficient
Population Parameter = Correlation of y and x
sEstimator: r = Sample Correlation =
s s
1 rSampling standard error: s =
N-2Hypothesis: H :
1
r
= 0, H : 0
r 0Test Statistic: t = ; t statistic N-2 D.F.
s
Rejection Region: |t| > Critical Value from Table
2 20 1
2 22
2 2r
Note for later.
Equivalent test: H : 0, H : 0
(r-0) r /1Test statistic: t
s (1-r )/(N-2)
This is F
Part 3: Regression and Correlation3-32/41
$135 Million
http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss
Klimt, to Ronald Lauder
Part 3: Regression and Correlation3-33/41
$100 Million … sort ofStephen Wynn with a Prized Possession, 2007
Part 3: Regression and Correlation3-34/41
An Enduring Art Mystery
Why do larger paintings command higher prices?
The Persistence of Memory. Salvador Dali, 1931
The Persistence of Econometrics. Greene, 2011
Graphics show relative sizes of the two works.
Part 3: Regression and Correlation3-35/41
Part 3: Regression and Correlation3-36/41
Part 3: Regression and Correlation3-37/41
Monet in Large and Small
ln (SurfaceArea)
ln (
US$)
7.67.47.27.06.86.66.46.26.0
18
17
16
15
14
13
12
11
S 1.00645R-Sq 20.0%R-Sq(adj) 19.8%
Fitted Line Plotln (US$) = 2.825 + 1.725 ln (SurfaceArea)
Log of $price = a + b log surface area + e
Sale prices of 328 signed Monet paintings
The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.
Part 3: Regression and Correlation3-38/41
The Data
ln (SurfaceArea)
Frequency
8.88.07.26.45.64.84.03.2
90
80
70
60
50
40
30
20
10
0
Histogram of ln (SurfaceArea)
ln (US$)
Frequency
16.515.013.512.010.5
80
70
60
50
40
30
20
10
0
Histogram of ln (US$)
Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)
Part 3: Regression and Correlation3-39/41
Application: Monet Paintings Does the size of the
painting really explain the sale prices of Monet’s paintings?
Investigate: Compute the regression
Hypothesis: The slope is actually zero.
Rejection region: Slope estimates that are very far from zero.
The hypothesis that β = 0 is rejected
Part 3: Regression and Correlation3-40/41
An Equivalent Test Is there a
relationship? H0: No correlation Rejection region:
Large R2. Test: F= Reject H0 if F > 4 Math result: F = t2.
2
2
(N-2)R
1 - R
Degrees of Freedom for the F statistic are 1 and N-2
Part 3: Regression and Correlation3-41/41
Monet Regression: There seems to be a regression. Is there a theory?