lecture 9: explaining variation in y
DESCRIPTION
Lecture 9: Explaining Variation in Y. BUEC 333 Summer 2009 Simon Woodcock. Explaining Variation in Y. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 9: Explaining Variation in Y
BUEC 333 Summer 2009Simon Woodcock
Explaining Variation in Y We’ve said several times that the goal of
regression analysis is to “explain” variation in the dependent variable Yi on the basis of variation in the independent variables X1i , X2i , ... , Xki
What does this mean? And how do we know whether we’re doing a
good job? These are today’s topics.
The Total Sum of Squares When we talk about the “variation” in Yi to be “explained” by the
independent variables, we’re talking about how Yi varies around its mean. That is, we want to explain departures in Yi from it’s population mean μY
why? because we can already “explain” the mean pretty well using Of course, we don’t know μY so we look at departures in Yi from its sample
mean (i.e., ) However, we always have
(why?) so trying to “explain” the total of these departures is pretty useless Instead, we focus on what’s usually called the Total Sum of Squares (TSS)
which isn’t zero unless there’s no variation in Yi at all. When TSS is big, there is lots of variation in Yi around its mean – and this is
what we want to explain using the independent variables.
YYi
n
ii YYTSS
1
2
01
n
ii YY
Y
The Decomposition of Variance We can always write:
(all I’ve done is add and subtract the predicted value -- draw a picture) It follows that (you should be able to show this yourself – and I
recommend you try it):
where ESS is the explained sum of squares, and RSS is the residual sum of squares.
We’ve decomposed the total (squared) variation in Yi around its mean into a component that our regression model explains (ESS), and a component that our regression model cannot explain (RSS).
ii
iiii
eYY
YYYYYY
ˆ
ˆˆ
RSSESSTSS
eYYYYn
i
n
i
n
iiii
1 1 1
222 ˆ
The Proportion of Variance Explained: R2
When we build a regression model, we frequently want to know how well it “fits” the data. Does our model do a good job of explaining the variation in Yi?
We can use our decomposition of TSS into ESS and RSS to measure the proportion of the variation in Yi that is explained by our model.
We call the proportion of the variation in Yi that is explained by the regression model R2:
Notice that 0 ≤ R2 ≤ 1
i i
i i
YY
e
TSS
RSS
TSS
RSSTSS
TSS
ESSR 2
22 11
Using R2 to Assess Model Fit R2 is a useful measure to assess how well our model “fits” the data –
that is, how well it explains the variation in Yi. When R2 = 0, the regression explains none of the variation in Yi
the regression model explains variation in Yi no better than the sample mean does (draw a picture)
When R2 = 1, the regression explains all of the variation in Yi this means there is an exact relationship between Yi and the
independent variables (no errors – draw a picture) Typically, we don’t encounter either of these extremes in real data
(draw a picture) Usually, bigger values of R2 are “better” in the sense that our
regression model does a “better” job of predicting Yi but all it tells us is that there is a strong linear relationship between Yi
and the independent variables – it doesn’t imply anything causal.
More About R2
How big should R2 be to be confident in our model? that depends on the context in wage regressions (regress wage on education, experience, etc.)
there are so many things that affect what a person earns that are hard to measure (luck, ability, motivation, etc.) that we are happy when R2 is above 0.4
in “macro” or financial regressions (e.g., regress the unemployment rate on inflation, economic growth, etc.) we are suspicious if R2 is below 0.9
There is a temptation to build a model (i.e., choose your independent variables) to maximize R2
avoid this temptation! if you add another independent variable to your model, R2 never
decreases – even if the new variable has no “real” relationship with the dependent variable!
Motivating Adjusted R2
There are other reasons to avoid building a model to maximize R2
Occam’s Razor: “one should not increase, beyond what is necessary, the number of entities required to explain anything” (all else equal, we prefer smaller, simpler models)
losing degrees of freedom: a model’s degrees of freedom is the number of observations (n) minus the number of parameters you estimate (k slope parameters + 1 intercept). When we add independent variables to the model, we lose degrees of freedom and (we’ll see soon), our parameter estimates are less precise.
So if we add extra variables to the model, we need to trade off a better fit (in terms of R2) against parsimony (having a small, simple model).
An alternative to R2 that takes this into account is adjusted R2.
Adjusted R2
Another way to measure the quality of a model’s fit is adjusted R2:
Adjusted R2 (pronounced Rbar-squared) penalizes for having lots of independent variables (or few degrees of freedom)
It can increase, decrease, or stay the same when we add an extra regressor to the model.
If we add an extra independent variable that is only weakly related to the dependent variable, adjusted R2 will decrease
Like R2, adjusted R2 is less than 1, but it is not necessarily positive (if R2 is very close to zero, adjusted R2 can be negative)
It’s not the “be all and end all” – to assess whether a regression model is “good” we need to look at plenty of other things: do regression coefficients have plausible sign & magnitude? does the model give sensible predictions? is it missing independent variables that we know matter? etc.
1
1)1(1
)1/(
)1/(1
)1/(
)/( 22
22 freedom of degrees model
kn
nR
nYY
kne
nTSS
ESSR
i i
i i