lecture 9: explaining variation in y

Lecture 9: Explaining Variation in Y

BUEC 333 Summer 2009Simon Woodcock

Explaining Variation in Y We’ve said several times that the goal of

regression analysis is to “explain” variation in the dependent variable Yi on the basis of variation in the independent variables X1i , X2i , ... , Xki

What does this mean? And how do we know whether we’re doing a

good job? These are today’s topics.

The Total Sum of Squares When we talk about the “variation” in Yi to be “explained” by the

independent variables, we’re talking about how Yi varies around its mean. That is, we want to explain departures in Yi from it’s population mean μY

why? because we can already “explain” the mean pretty well using Of course, we don’t know μY so we look at departures in Yi from its sample

mean (i.e., ) However, we always have

(why?) so trying to “explain” the total of these departures is pretty useless Instead, we focus on what’s usually called the Total Sum of Squares (TSS)

which isn’t zero unless there’s no variation in Yi at all. When TSS is big, there is lots of variation in Yi around its mean – and this is

what we want to explain using the independent variables.

YYi

n

ii YYTSS

1

2

01

n

ii YY

Y

The Decomposition of Variance We can always write:

(all I’ve done is add and subtract the predicted value -- draw a picture) It follows that (you should be able to show this yourself – and I

recommend you try it):

where ESS is the explained sum of squares, and RSS is the residual sum of squares.

We’ve decomposed the total (squared) variation in Yi around its mean into a component that our regression model explains (ESS), and a component that our regression model cannot explain (RSS).

ii

iiii

eYY

YYYYYY

ˆ

ˆˆ

RSSESSTSS

eYYYYn

i

n

i

n

iiii

1 1 1

222 ˆ

The Proportion of Variance Explained: R2

When we build a regression model, we frequently want to know how well it “fits” the data. Does our model do a good job of explaining the variation in Yi?

We can use our decomposition of TSS into ESS and RSS to measure the proportion of the variation in Yi that is explained by our model.

We call the proportion of the variation in Yi that is explained by the regression model R2:

Notice that 0 ≤ R2 ≤ 1

i i

i i

YY

e

TSS

RSS

TSS

RSSTSS

TSS

ESSR 2

22 11

Using R2 to Assess Model Fit R2 is a useful measure to assess how well our model “fits” the data –

that is, how well it explains the variation in Yi. When R2 = 0, the regression explains none of the variation in Yi

the regression model explains variation in Yi no better than the sample mean does (draw a picture)

When R2 = 1, the regression explains all of the variation in Yi this means there is an exact relationship between Yi and the

independent variables (no errors – draw a picture) Typically, we don’t encounter either of these extremes in real data

(draw a picture) Usually, bigger values of R2 are “better” in the sense that our

regression model does a “better” job of predicting Yi but all it tells us is that there is a strong linear relationship between Yi

and the independent variables – it doesn’t imply anything causal.

More About R2

How big should R2 be to be confident in our model? that depends on the context in wage regressions (regress wage on education, experience, etc.)

there are so many things that affect what a person earns that are hard to measure (luck, ability, motivation, etc.) that we are happy when R2 is above 0.4

in “macro” or financial regressions (e.g., regress the unemployment rate on inflation, economic growth, etc.) we are suspicious if R2 is below 0.9

There is a temptation to build a model (i.e., choose your independent variables) to maximize R2

avoid this temptation! if you add another independent variable to your model, R2 never

decreases – even if the new variable has no “real” relationship with the dependent variable!

Motivating Adjusted R2

There are other reasons to avoid building a model to maximize R2

Occam’s Razor: “one should not increase, beyond what is necessary, the number of entities required to explain anything” (all else equal, we prefer smaller, simpler models)

losing degrees of freedom: a model’s degrees of freedom is the number of observations (n) minus the number of parameters you estimate (k slope parameters + 1 intercept). When we add independent variables to the model, we lose degrees of freedom and (we’ll see soon), our parameter estimates are less precise.

So if we add extra variables to the model, we need to trade off a better fit (in terms of R2) against parsimony (having a small, simple model).

An alternative to R2 that takes this into account is adjusted R2.

Adjusted R2

Another way to measure the quality of a model’s fit is adjusted R2:

Adjusted R2 (pronounced Rbar-squared) penalizes for having lots of independent variables (or few degrees of freedom)

It can increase, decrease, or stay the same when we add an extra regressor to the model.

If we add an extra independent variable that is only weakly related to the dependent variable, adjusted R2 will decrease

Like R2, adjusted R2 is less than 1, but it is not necessarily positive (if R2 is very close to zero, adjusted R2 can be negative)

It’s not the “be all and end all” – to assess whether a regression model is “good” we need to look at plenty of other things: do regression coefficients have plausible sign & magnitude? does the model give sensible predictions? is it missing independent variables that we know matter? etc.

1

1)1(1

)1/(

)1/(1

)1/(

)/( 22

22 freedom of degrees model

kn

nR

nYY

kne

nTSS

ESSR

i i

i i

lecture 9: explaining variation in y

Documents