iii general regression

1 III

James B. McDonald

Brigham Young University

9/29/2010

III. Classical Normal Linear Regression Model Extended to the Case of k

Explanatory Variables

A. Basic Concepts

Let y denote an n x l vector of random variables, i.e., y = (y1, y2, . . ., yn)'.

1. The expected value of y is defined by

1

2

n

E( )y

E( )yE(y) =

E( )y

2. The variance of the vector y is defined by

1 1 2 1 n

2 1 2 2 n

n 1 n 2 n

Var( ) Cov( , ) Cov( , )y y y y y

Cov( , ) Var( ) Cov( , )y y y y yVar(y) =

Cov( , ) Cov( , ) Var( )y y y y y

NOTE: Let μ = E(y), then

Var(y) = E[(y - μ)(y - μ)']

11

nn

-y

= E

-y

(y1 - μ1, . . ., yn - μn)

2 III

21 2 1 n1 2 1 n11

22 1 2 n2 1 2 n22

n

E( - )( - ) . . . E( - )( - )y y y yE( -y )

E( - )( - ) . . . E( - )( - )y y y yE( -y )

. . .=

. . .

. . .

E( y 2n 1 n 21 n 2 nn

- )( - ) E( - )( - ) . . .y y y E( - y )

1 1 2 1 n

2 1 2 2 n

n 1 n 2 n

Var( ) Cov( , ) ... Cov( , )y y y y y

Cov( , ) Var( ) ... Cov( , )y y y y y

. . .= .

. . .

. . .

Cov( , ) Cov( , ) ... Var( )y y y y y

3. The n x l vector of random variables, y, is said to be distributed as a multivariate

normal with mean vector μ and variance covariance matrix (denoted y ~

N(μ, )) if the probability density function of y is given by

-11

- (y- ) (y- )2

n 1

2 2

ef(y; , ) = .

(2 |) |

Special case (n = 1): y = (y1), μ = (μ1), = (ζ2).

)()(2

e = ),;yf(

2

12

2

1

)-y(1

)-y(2

1-

11

11211

. 2

e =

2

2

)-y-(2

211

4. Some Useful Theorems

a. If y ~ N(μy, y), then z = Ay ~ N(μz = Aμy; z = A yA') where A is a

matrix of constants.

3 III

b. If y ~ N(0,I) and A is a symmetric idempotent matrix, then y'Ay ~ χ2(m)

where m = Rank(A) = trace (A).

c. If y ~ N(0,I) and L is a k x n matrix of rank k, then Ly and y'Ay are

independently distributed if LA = 0.

d. If y ~ N(0,I), then the idempotent quadratic forms y'Ay and y'By are

independently distributed χ2 variables if AB = 0.

NOTE:

(1) Proof of (a)

(2) Example: Let y1, . . ., yn denote a random sample drawn from

N(μ,ζ2).

The "Useful" Theorem 4.a implies that:

2

1 n

1 1 1 1y = + ... + = , . . . y ~ N( , /n)y y

n n n n.

Verify that

E(z) = E(Ay) = AE(y) = Aµy

VAR(z) = E[(z - E(z))(z - E(z))']

= E[(Ay - Aµy)(Ay - Aµy)']

= E[A(y - µy)(y - µy)'A']

= AE[(y - µy)(y - µy)']A'

= AΣyA' =Σ z

2

21

. . . 0

. .

0 . . .

,.

.N~

.

.y

ny

y

4 III

(a) = n

1,...,

n

1

(b) ./n =

n

1

n

1

I n

1,...,

n

1 22

5 III

B. The Basic Model

Consider the model defined by

(1) yt = β1xtl + β2xt2 + . . . + βkxtk + εt (t = 1, . . ., n).

If we want to include an intercept, define xtl = 1 for all t and we obtain

(2) yt = β1 + β2xt2 + . . . + βkxtk + εt.

Note that βi can be interpreted as the marginal impact of a unit increase in xi on the

expected value of y.

The error terms (εt) in (1) will be assumed to satisfy:

(A.1) εt distributed normally

(A.2) E(εt) = 0 for all t

(A.3) Var(εt) = ζ2 for all t

(A.4) Cov(εtεs) = 0,t s.

Rewriting (1) for each t (t = 1, 2, . . ., n) we obtain

y1 = β1x11 + β2x12 + . . . + βkx1k + ε1

y2 = β1x21 + β2x22 + . . . + βkx2k + ε2

. . . .

. . . .

(3) . . . .

yn = β1xn1 + β2xn2 + . . . + βkxnk + εn.

The system of equations (3) is equivalent to the matrix representation

y = Xβ + ε

where the matrices y, X, β and ε are defined as follows:

6 III

y = Xβ + ε.

(A.1)' ε ~ N(0; = ζ2I)

(A.5)' The xtj's are nonstochastic and

x

n

X X = Limit

nis nonsingular.

columns: n observations on k

individual variables.

rows: may represent

observations at a given point

in time.

11

22

nk

= and = .

NOTE: (1) Assumptions (A.1)-(A.4) can be written much more

compactly as

(A.1)’ ε ~ N (0; = ζ2I).

(2) The model to be discussed can then be summarized as

11 1k1

21 2k2

n1 nkn (nxk)(nx1)

y x x

y x xy = X =

y x x

7 III

C. Estimation

We will derive the least squares, MLE, BLUE and instrumental variables estimators in

this section.

1. Least Squares:

The basic model can be written as

y = Xβ + ε

ˆ ˆ= Xβ + e = Y + e

where ˆY = Xβ is an nx1 vector of predicted values for the dependent variable and

e denotes a vector of residuals or estimated errors.

The sum of squared errors is defined by

n2t

t=1

ˆSSE(β) = e

e

e

e

)e , ,e ,e( =

n

2

1

n21

ee =

ˆ ˆ= (y - Xβ) (y - Xβ)

ˆ ˆ ˆ ˆ= y y - β X y - y Xβ + β X Xβ

ˆ ˆ ˆ= y y - 2β X y + β X Xβ .

The least squares estimator of β is defined as the β which minimizes ˆSSE (β). A

necessary condition for ˆSSE(β) to be a minimum is that

ˆdSSE(β) = 0

ˆdβ (see Appendix A for how to differentiate a real

valued function with respect to a vector) ˆdSSE(β) ˆ = -2X y + 2X Xβ = 0 or

ˆdβ

8 III

yX = ˆXX

yX)XX( = ˆ -1

Normal Equations

is the least squares estimator.

Note that β is a vector of least squares estimators of β1, β2,...,βk.

2. Maximum Likelihood Estimation (MLE)

Likelihood Function: (Recall y ~ N (Xβ; = ζ2I))

-11- (y-X ) (y-X )2

2

1n/ 22

eL(y; , = I) =

(2 |) |

2

1- (y-X ) (y-X )2

1n/ 2 2 2

e =

(2 | I) |

2(y-X ) (y-X ) / 2

nn 2 2 2

e = .

(2 () )

The natural log of the likelihood function,

2

2ln

2

n - 2ln

2

n -

2

)X(y-)X(y- - = Lln =

= 2

2

1 n n (y- X ) (y- X ) - ln 2 - ln

2 2 2

= 2

2

1 n n y' - 2 'X' 2 ' ' - ln 2 - ln

2 2 2y Y X X

is known as the log likelihood function. is a function of β and ζ2.

The MLE. of β and ζ are defined by the two equations (necessary conditions for a

maximum):

2

1 = (-2X y + 2(X X) ) = 0

β 2

2222

(y - X ) (y - X ) n 1 = - = 0

22( )

9 III

nln)2ln(1

2

n-

SSE

i.e.,

.

NOTE: (1) ˆ =

(2) 2 is a biased estimator of ζ2; whereas,

2 1 (y - X ) (y - X ) SSE = e e = = s

n- k n - k n - k

is an unbiased estimator of ζ2.

A proof of the unbiasedness of s2 is given in Appendix B.

Only n-k of the estimated residuals are independent. The

necessary conditions for least squares estimates impose k

restrictions on the estimated residuals (e). The restrictions

are summarized by the normal equations X'X ˆ = X'y, or

equivalently

(3) Substituting ζ2 = SSE/n into the log likelihood function

yields what is known as the concentrated log likelihood

function

X’e = 0

-1 = (X X X'y)

2t

12 = (y - X ) (y - X )n

e e e= =

n n

10 III

which expresses the loglikelihood value as a function of β

only. This equation also clearly demonstrates the

equivalence of maximizing and minimizing SSE.

11 III

3. BLUE ESTIMATORS OF β, β .

We will demonstrate that assumptions (A.2)-(A.5) imply that the best

(least variance) linear unbiased estimator (BLUE) of β is the least squares

estimator. We first consider the desired properties and then derive the associated

estimator.

Linear: Ay = ~

where A is a kxn matrix of constants

Unbiased: AX = AE(y) = )~

E(

We note that = XA = )~

E( requires AX = I.

Minimum Variance:

i iiVar( ) = Var(y) β A A

= ζ2AiAi'

where Ai = the ith

row of A and ii = yβ A .

Thus, the construction of BLUE is equivalent to selecting the matrix A so that the

rows of A

Min AiAi' i = 1, 2, . . ., k

s.t. AX = I

or equivalently, min i

Var( )β s.t. AX = I (unbiased).

The solution to this problem is given by

A = (X'X)-1

X' ; hence, the BLUE of β is given by -1

= Ay (X X X y) .

The details of this derivation are contained in Appendix C.

NOTE: (1)

(2) 1

AX X 'X X 'X I ; thus β is unbiased

-1ˆβ = β = β = (X X X y)

12 III

4. Method of Moments Estimation

Method of moments parameter estimators are selected to equate sample and

corresponding theoretical moments. The open question is what theoretical

moments should be considered and what are the corresponding sample moments.

With the regression model we might consider the following theoretical moments

which follow from the underlying theoretical assumptions:

(A.2) 0tE

(A.5) , 0it tCov X

The sample moment associated with (A.2) is

1

/ 0n

t

t

e n e

The sample covariances can be written as

1 1 1

/ / / 0n n n

it i t it i t it t

t t t

X X e e n X X e n X e n

These sample moments can be summarized using matrix notation as follows:

1

12 22 2 2

1 2

1 1 ... 1

.../ ' / 0

. . ... . .

...

n

k k nk n

e

x x x en X e n

x x x e

which is equivalent to X’e=0 which are also known as the normal equations in the

OLS framework and yields the OLS estimator by solving

ˆ' ' 0X e X Y X

for ˆ .

5. Instrumental Variables Estimators

y = Xβ + ε

Let Z denote an n x k matrix of “instruments” or "instrumental" variables.

Consider the solution of the modified normal equations:

ZZ ' Y Z ' X ; hence, 1

zβ Z X Z y .

zβ is referred to as the instrumental variables estimator of β based on the

instrumental variables Z. Instrumental variables can be very useful if the

variables on the right hand side include “endogenous” variables or in the case of

13 III

measurement error. In this case OLS will yield biased and inconsistent

estimators; whereas, instrumental variables can yield consistent estimators.

NOTE: (1) The motivation for the selection of the instruments (Z) is

that the covariance (Z,ε) approaches 0 and Z and X are

correlated. Thus Z'(Y) = Z'(Xβ + ε) = Z' X β + Z'ε Z' Xβ.

(2) Ifn

Z XLim

nis nonsingular and

n

Z = 0Lim

n, then

zβ is a consistent estimator of β.

(3) Many calculate an R2 after instrumental variables

estimation using the formula R2 = 1 – SSE/SST. Since this

can be negative, there is not a natural interpretation of R2

for instrumental variables estimators. Further, the R2 can’t

be used to construct F-statistics for IV estimators.

(4) If Z includes “weak” instruments (weakly correlated

with the X’s), then the variances of the IV estimator can

be large and the corresponding asymptotic biases can be

large if the Z and error are correlated. This can be

seen by noting that the bias of the instrumental variables

estimator is given by

E1

' / ( ' / )Z X n Z n .

(5) As a special case, if Z = X, then Δ

ˆˆ ˆ = = β = β = ββ βz x.

(6) If Z is an x k* n matrix where k< k* (Z contains more

variables than X), then the IV estimator defined above must

be modified. The most common approach in this case is to

replace Z in the “IV” equation by the projections** of X on

the columns of Z, i.e. 1ˆ ' 'X Z Z Z Z X .

This substitution yields the IV estimator 1

11 1

ˆ ˆ' '

' ' ' ' ' '

IV X X X Y

X Z Z Z Z X X Z Z Z Z Y

which yields estimates for k k* .

14 III

.

The Stata command for the instrumental variables estimator

is given by

ivregress 2sls depvar (varlist_1 =varlist_iv)

[varlist_2]

where estimator = 2sls, gmm, or liml with

2sls is the default estimator

for the model

1 2depvar = (varlist_1)b + var(list_2)b + error

where varlist_iv are the instrumental variables for varlist_1.

A specific example is given by:

ivregres 2sls y1 (y2=z1 z2 z3) x1 x2 x3

Identical results could be obtained with the command,

Ivregress 2sls y1 (y2 x1 x2 x3=z1 z2 z3)

which is equivalent to regressing all of the right hand side

variables on the set of instrumental variables. This can be

thought of as being of the form

ivregress 2sls y (X=Z)

**The projections of X on Z can be obtained by obtaining

estimates of

in the "reduced form" equation X Z V to yield 1ˆ ' 'Z Z Z X ; hence, the estimate of X is given by

1ˆ ˆ ' 'X Z Z Z Z Z X

D. Distribution of Δ

β, , ββ

Recall that under the assumptions (A.1) – (A.5) y ~ N(Xβ, = ζ2I) and

-1β = β = β = (X X X y;)

hence, by useful theorem (II.’ A. 4.a), we conclude that Δ

2yy

β = β = β ~ N(A A A ) = N[Ax , A IA ]

15 III

where A = (X'X)-1

X'.

The desired derivations can be can be simplified by noting that

AXβ = (X'X)-1

X'Xβ = β

ζ2AA' = ζ

2(X'X)

-1X'((X'X)

-1X')'

= ζ2(X'X)

-1X'X((X'X)

-1)'

= ζ2((X'X)

-1)'

= ζ2((X'X)')

-1

= ζ2(X'X)

-1.

ThereforeΔ

12β = β = β ~ N β; X X

NOTE: (1) ζ2(X'X)

-1 can be shown to be the Cramer-Rao matrix, the matrix

of lower bounds for the variances of unbiased estimators.

(2) Δ

β, , β,β are

unbiased

consistent

minimum variance of all (linear and nonlinear unbiased

estimators

normally distributed

(3) An unbiased estimator of ζ2(X'X)

-1 is given by

s2(X'X)

-1

where s2 = e'e/(n-k) and is the formula used to calculate the

16 III

"estimated variance covariance matrix" in many computer

programs.

(4) To report s2(X'X)

-1 in STATA type

. reg y x

. estat vce

(5) Distribution of the variance estimator

22

2

(n - k)s ~ (n - k)

NOTE: This can be proven using the theorem (II'.A.4(b)) and noting that 2 ˆ ˆ(n- k) = e e = (Y - Xβ) (Y - Xβ) .s

-1

= (X + ) (I - X(X X X )(X + ))

= ε'(I - X(X'X)-1

X')ε.

Therefore,2

-1

2

(n- k)s = (I - X(X X X ))

= M

where ~ N [0, I]. hence

22

2

(n- k)s ~ (n- k) because

M is idempotent with rank and trace equal to n - k.

17 III

E. Statistical Inference

1. Ho: β2 = β3 = . . . = βk = 0

This hypothesis tests for the statistical significance of overall explanatory power

of the explanatory variables by comparing the model with all variables included to

the model without any of the explanatory variables, i.e., yt = β1 + εt (all non-

intercept coefficients = 0). Recall that the total sum of squares (SST) can be

partitioned as follows:

)y - y( + )y - y( = )y - y( 2

t

N

1=t

2

tt

N

1=t

2

t

N

1=t

or

SST = SSE + SSR.

Dividing both sides of the equation by ζ2 yields quadratic forms, each having a

chi-square distribution:

2 2 2

SST SSE SSR = +

ζ ζ ζ

χ2(n - 1) = χ

2(n - k) + χ

2(k - 1).

This result provides the basis for using

to test the hypothesis that β2 = β3 = . . . = βk = 0.

NOTE: (1)R - 1

R =

SST

SSR - 1

SSR/SST =

SSR - SST

SSR =

SSE

SSR2

2

hence, the F-statistic for this hypothesis can also be rewritten as

Recall that this decomposition of SST can be summarized in an ANOVA table as

2

2

SSR(K -1)(n- K)K 1F = = ~ F(K - 1, n - K)

SSE (n- K)(K -1)

n K

2

2

2 2

Rn - k Rk - 1F = = ~ F(k - 1,n - k).

(1 - ) /(n - k) k - 1 1 - R R

18 III

follows:

Source of Variation

SS

d.f

MSE

Model

Error

SSR

SSE

K - 1

n – K

SSR/(K-1)

SSE/(n - K) 2s Total

SST

n – 1

K = number of coefficients in model

where the ratio of the model and error MSE’s yields the F statistic just discussed.

Additionally, remember that the adjusted R2 ( 2R ), defined by

22 t

2

t

( ) /(n- K)e = 1 - ,R

( - Y /(n - 1))Y

will only increase with the addition of a new variable if the t-statistic associated with

the new variable is greater than 1 in absolute value. This result follows from the

equation

_ var

22

_ var2 2

ˆ

ˆ 0( 1)1

1New

NewNewNew Old

n SSER R

n k n K SST s where the last

term in the product is 2 1t and K denotes the number of coefficients in the “old”

regression model and the “new” regression model includes K+1 coefficients.

The Lagrangian Multiplier (LM) and likelihood ratio (LR) tests can also be

used to test this hypothesis where

2 2~ ( 1)aLM NR k

2 2ln 1 ~ ( 1)aLR N R k

19 III

2. Testing hypotheses involving individual βi's

Recall that

-1

β ~ N (β; ζ (X X ))

where

2

ˆ ˆ ˆ ˆ ˆβ β β β β1 1 2 1 k

2ˆ ˆ ˆ ˆ ˆβ β β β β2 1 2 2 k12

2ˆ ˆ ˆ ˆ ˆβ β β β βk 1 k 2 k

X X

which can be estimated by

2

ˆ ˆ ˆ ˆ ˆβ β β β β1 1 2 1 k

2ˆ ˆ ˆ ˆ ˆβ β β β β2 1 2 2 k12

2ˆ ˆ ˆ ˆ ˆβ β β β βk 1 k 2 k

s s s

s s ss X X

s s s

Hypotheses of the form H0: βi = 0i can be tested using the result

The validity of this distributional result follows from

2

N(0,1) ~ t(d)

(d) /d

since

i

ii

β

ˆ - ββ ~ N(0,1) and

ζ

i

i

22

β2

β

(n - k) ~ (n - k).χs

ζ

i

0

ii

β

ˆ - ββ ~ t(n - k)

s

20 III

3. Tests of hypotheses involving linear combinations of coefficients

A linear combination of the βi's can be written as

1k

i 1 ki

=1

k

β

= ( ,..., ) = δ β.βδ δ δ

β

We now consider testing hypotheses of the form

Recall that

-12β ~ N (β; (X X ) ;)ζ

therefore,

-12ˆδ β ~ N (δβ; δ (X X δ))ζ

hence, '

' ' '

-1 2' 2ˆδ β

ˆ ˆδβ - δβ δβ - γ = ~ t(n - k).

δ (X,X δ) ss

The t-test of a hypothesis involving a linear combination of the coefficients

involves running one regression and estimating the variance of ˆδ β from s2(X'X)

-1

to construct the test statistics.

4. More general tests

a. Introduction

We have considered tests of the overall explanatory power of the

regression model (Ho: β2 = β3 = . . . βk = 0), tests involving individual parameters

(e.g., Ho: β3 = 6), and testing the validity of a linear constraint on the coefficients

H0: δ'β = γ.

21 III

(Ho: δ’β = γ). In this section we will consider how more general tests can be

performed. The testing procedures will be based on the Chow and Likelihood

ratio (LR) tests. The hypotheses may be of many different types and involve the

previous tests as special cases. Other examples might include joint hypotheses of

the form: Ho: β2 + 6 β5 = 4, β3 = β7 = 0. The basic idea is that if the hypothesis is

really valid, then goodness of fit measures such as SSE, R2 and log-likelihood

values (l) will not be significantly impacted by imposing the valid hypothesis in

estimation. Hence, the SSE, R2 or values will not be significantly different for

constrained (via the hypothesis) and unconstrained estimation of the underlying

regression model. The tests of the validity of the hypothesis are based on

constructing test statistics, with known exact or asymptotic distributions, to

evaluate the statistical significance of changes in SSE, R2, or .

Consider the model

y = X β + ε

and a hypothesis, Ho: g(β) = 0 which imposes individual and/or multiple

constraints on the β vector.

The Chow and likelihood ratio tests for testing Ho: g(β) = 0 can be

constructed from the output obtained from estimating the two following

regression models.

(1) Estimate the regression model y = Xβ + ε without imposing any

constraints on the vector β. Let the associated sum of square errors,

coefficient of determination, log-likelihood value and degrees of freedom

22 III

be denoted by SSE, R2, , and (n - k).

(2) Estimate the same regression model where the β is constrained as

specified by the hypothesis (Ho: g(β) = 0) in the estimation process. Let

the associated sum of squared errors, R2, log-likelihood value and degrees

of freedom be denoted by SSE*, R

2*,

* and (n - k)

*, respectively.

b. Chow test

The Chow test is defined by the following statistic:

where r = (n-k) - (n-k)* is the number of independent restrictions imposed on β by

the hypothesis. For example, if the hypothesis was Ho: β2 + 6 β5 =4, β3 = β7 = 0,

then the numerator degrees of freedom (r) is equal to 3. In applications where the

SST is unaltered by the imposing the restrictions, we can divide the numerator and

denominator by SST to yield the Chow test rewritten in terms of the change in the

R2 between the constrained and unconstrained regressions.

Note that if the hypothesis (H0: g(β) = 0) is valid, then we would expect R2 (SSE)

and R2*

(SSE*) to not be significantly different from each other. Thus, it is only

large values (greater than the critical value) of F which provide the basis for

rejecting the hypothesis. Again, the 2R form of the Chow test is only valid if the

dependent variable is the same in the constrained and unconstrained regression.

References:

(1) Chow, G. C., "Tests of Equality Between Subsets of Coefficients in Two

Linear Regressions," Econometrica, 28(1960), 591-605.

(2) Fisher, F. M., "Tests of Equality Between Sets of Coefficients in Two Linear

SSE* - SSE

rSSE ~ F(r, n - k)n - k

2 2

2

- * n - kR RF = ~ F(r, n - k)

1 - rR

23 III

Regressions: An Expository NOTE," Econometrica, 38(1970), 361-66.

c. Likelihood ratio (LR) test.

The LR test is a common method of statistical inference in classical

statistics. The motivation behind the LR test is similar to that of the Chow test

except that it is based on determining whether there has been a significant

reduction in the value of the log-likelihood value as a result of imposing the

hypothesized constraints on β in the estimation process. The LR test statistic is

defined to be twice the difference between the values of the constrained and

unconstrained log-likelihood values (2( - *)) and, under fairly general

regularity conditions, is asymptotically distributed as a chi-square with degrees of

freedom equal to the number of independent restrictions (r) imposed by the

hypothesis. This may be summarized as follows:

The LR test is more general than the Chow test and for the case of

independent and identically distributed normal errors, with known ζ2, LR is equal

to LR = [SSE* - SSE]/ζ

2 .

Recall that s2 = SSE/(n - k) appears in the denominator of the Chow test statistic

and that for large values of (n-k), s2 is "close" to ζ

2; hence, we can see the

similarity of the LR and Chow tests. If ζ2 is unknown, substituting the

concentrated log-likelihood function into LR yields

LR = 2 ( - *)

= n [ln (SSE*) - ln (SSE) ]

= n [ln (SSE* / SSE)].

2aLR = 2( - *) (r).

24 III

a LR = nln[1/(1-R

2)] = -nln[1-R

2] ~ χ

2(k-1).

If the hypothesis Ho: β2 = β3 = . . . βk = 0 is being tested in the classical

normal linear regression model, then SSE* = SST and LR can be rewritten in

terms of the R2 as follows:

In this case, the Chow test is identical to the F test for overall explanatory power

discussed earlier.

Thus the Chow test and LR test are similar in structure and purpose. The

LR test is more general than the Chow test; however, its distribution is

asymptotically (not exact) chi-square even for non-normally distributed errors.

The LR test provides a unified method of testing hypotheses.

d. Applications of the Chow and LR tests:

(1) Model: yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt

Ho: β2 = β3 = 0 (two independent constraints)

(a) Estimate yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt

to obtain SSE = et2 = (n - 4)s

2, R

2 ,

=n

SSEln + )ln(2 + 1

2

n- ,

n-k = n - 4

(b) Estimate yt = β1 + β4xt4 + εt to obtain

SSE* = et*2 = (n - 2)s*

2

SSE*, R2*

, * and (n-k)* = n - 2

25 III

26 III

(c) Construct the test statistics

SSE* - SSE SSE* - SSEn- 4 SSE*-SSE(n k)* (n k) 2Chow = = =

SSE SSE 2 SSE

n k n 4

2 2

2

- * n - 4R R= ~ F(2, n - 4)

1 - 2R

a LR = 2( - *) ~ χ

2(2).

(2) Tests of equality of the regression coefficients in two different regressions

models.

(a) Consider the two regression models

y(1)

= X(1)

β(1)

+ ε(1)

n1 observations, k independent variables

y(2)

= X(2)

β(2)

+ ε(2)

n2 observations, k independent variables

Ho: β(1)

= β(2)

(k independent restrictions)

(b) Rewrite the model as

(1)'

(1) (1)(1) (1)

(2)(2) (2) (2)

0 y Xy = = +

0 y X

Estimate (1)' using least squares and determine SSE, R2,

and (n - k) = n1 + n2 - 2k.

Now impose the hypothesis that β(1)

= β(2)

= β and write (1)

as

(2)’

(1) (1) (1)

(2) (2) (2)

y Xy = = β +

y X

Estimate (2)’ using least squares to obtain the constrained

27 III

sum of squared errors (SSE*), R2*

, * and

(n - k)* = n1 + n2 - k.

(c) Construct the test statistics

SSE* - SSE

(n - k)* - (n - k)Chow =

SSE

(n k)

2 2

1 2, 1 22

- * + - kR R n n = ~ F ( + - 2k)k n n

1 - kR

a LR = 2( - *) ~ χ

2 (k).

5. Testing Hypotheses using Stata

a. Stata reports the log likelihood values when the command

estat ic

follows a regression command and can be used in constructing LR tests.

b. Stata can also perform many tests based on t or Chow-type tests.

Consider the model

(1) Yt = β1 + β2Xt2 + β3Xt3 + β4Xt4 + εt

with the hypotheses:

(2) H1: β2 = 1

H2: β3 = 0

H3: β3 + β4 = 1

H4: β3β4 = 1

H5: β2 = 1 and β3 = 0

The Stata commands to perform tests of these hypotheses follow OLS

28 III

estimation of the unconstrained model.

29 III

reg Y X2 X3 X4

estimates the unconstrained model

test X2 = 1 (Tests H1)

test X3 = 0 (Tests H2)

test X3 + X4 = 1 (Tests H3)

testnl _b[X3]*_b[X4] = 1 (Tests H4. The “testnl” command is

for testing nonlinear hypotheses. The

suffix “_b”, along with the braces,

must be used when testing nonlinear

hypotheses)

test (X2 = 1) (X3 = 0) (Tests H5)

95% confidence intervals on coefficient estimates are automatically calculated in

Stata. To change the confidence level, use the “level” option as follows:

reg Y X2 X3 X4, level(90) (changes the confidence level

to 90%)

30 III

F. Stepwise Regression

Stepwise regression is a method for determining which variables might be

considered as being included in a regression model. It is a purely mechanical approach,

adding or removing variables in the model solely determined by their statistical

significance and not according to any theoretical reason. While stepwise regression can be

considered when deciding among many variables to include in a model, theoretical

considerations should be the primary factor for such a decision.

A stepwise regression may use forward selection or backward selection. Using

forward selection, a stepwise regression will add one independent variable at a time to see

if it is significant. If the variable is significant, it is kept in the model and another variable

is added. If the variable is not significant, or if a previously added variable becomes

insignificant, it is not included in the model. This process continues until no additional

variables are significant.

Stepwise regression using Stata

To perform a stepwise regression in Stata, use the following commands:

Forward:

stepwise, pe(#): reg dep_var indep_vars

stepwise, pe(#) lockterm: reg dep_var (forced in

variables) other indep_vars

Backward:

stepwise, pr(#): reg dep_var indep_vars

31 III

stepwise, pr(#) lockterm: reg dep_var (forced in

variables) other indep_vars

where the “#” in “pr(#)” is the significance level at which variables are removed, as

0.051, and the “#” in “pe(#)” is the significance level at which variables are entered or

added to the model. If pr(#1) and pr(#2) are both included in a stepwise regression

command, #1 must be greater than #2. Also, “depvar” represents the dependent variable,

“forced_indepvars” represent the independent variables which the user wishes to remain

in the model no matter what their significance level may be, and “other_indepvars”

represents the other independent variables which the stepwise regression will consider

including or excluding. Forward and backward stepwise regression may yield different

results.

G. Forecasting

Let yt = F(Xt, β) + εt

denote the stochastic relationship between the variable yt and the vector of variables Xt

where Xt = (xt1,..., xtk). β represents a vector of unknown parameters.

Forecasts are generally made by estimating the vector of parameters ˆβ(β) ,

determining the appropriate vector )X(X tt and then evaluating

ttˆˆˆ = F( , β) .y X

The forecast error is FE = yt - yt.

There are at least four factors which contribute to forecast error.

32 III

1. Incorrect functional form (This is an example of specification error and will be

discussed later.)

2. Existence of random disturbance (εt)

Even if the "appropriate" future value of Xt and true parameter values, β,

were known with certainty

FE = yt - yt = yt - F(Xt,β) = εt

2

FE = Variance(FE)

= Var(εt) = ζ2.

In this case confidence intervals for yt would be obtained from

t t( / 2) ( / 2)tPr [F ( , β) - ζ < < F ( , β) + ζ] = 1 - αyt tX X

which could be visualized as follows for the linear case:

Yt

Xt

Yt

Xt

33 III

3. Uncertainty about β

Assume F(Xt, β) = Xtβ in the model yt = F(Xt, β) + εt, then the predicted

value of yt for a given value of Xt is given by

ttˆˆ = β ,y X

and the variance of ˆty (sample regression line),

t

2

y is given by

t

2t ty

ˆ = Var (β) X Xζ ,

with the variance of the forecast error (actual y) given by:

2

FEt

2 2y

= + .ζ ζ

Note that 2

FE takes account of the uncertainty associated with the unknown

regression line and the error term and can be used to construct confidence

intervals for the actual value of Y rather than just the regression line.

Unbiased sample estimators of t

2

y and 2FEζ can be easily obtained by replacing ζ

2

with its unbiased estimator s2.

Confidence intervals for t tE ( | ) ,Y X the population regression line:

ttˆˆt t t(α/2) (α/2) yy

ˆ ˆPr [ β - < < β + ] = 1 - αt s t sX Y X

Confidence intervals for Yt:

t t t(α/2) FE (α/2) FEˆ ˆP R [ β - < < β + ] = 1 - αt s t sX Y X

34 III

Yt

Xt

35 III

4. A comparison of confidence intervals.

Some students have found the following table facilitates their understanding of the different confidence intervals for the

population regression line and actual value of Y. The column for the estimated coefficients is only included to compare

the organizational parallels between the different confidence intervals.

Statistic 1ˆ ' 'X X X Y ˆˆt tY X = sample regression line =

predicted Y values corresponding to tX .

FE (forecast error)

ˆˆt t t tFE Y Y Y X

Distribution

12, 'N X X

12 2 '

ˆ, ( ' )t

t t tYN X X X X X

2 2 2

ˆ0,t

FE YN

t-stat

/ 2 / 2

ˆ

ˆ1 Pr

i

i it ts

= ˆ ˆ

2 2

ˆ ˆPri i

i i it s t s

/ 2 / 2

ˆ

ˆ1 Pr

t

t t

Y

X Xt t

s

ˆ ˆ

2 2

ˆ ˆPr t t tY YX t s X X t s

/ 2 / 2

01 Pr

FE

FEt t

s

=

2 2

Pr 0FE FEFE t s FE t s =

2 2

ˆ ˆPr t FE t t FEX t s Y X t s

C.I.

i : ˆ ˆ

2 2

ˆ ˆ, i i

i it s t s tX : ˆ ˆ

2 2

ˆ ˆ, Xt tY YX t s t s :tY

2 2

ˆ ˆ, Xt FE t FEX t s t s

where Y

s is used to compute confidence intervals for the regression line ( t tE Y X ) and FEs is used in the calculation of

confidence intervals for the actual value of Y. Recall that 2 2 2

ˆ sFE Ys s ; hence,

2 2

ˆ > FE Ys s and the confidence intervals for

Y are larger than for the population regression line.

36 III

5. Uncertainty about X. In many situations the value of the independent variable also

needs to be predicted along with the value of y. Not surprisingly, a “poor” estimate of

Xt will likely result in a poor forecast for y. This can be represented graphically as

follows:

6. Hold out samples and a predictive test.

One way to explore the predictive ability of a model is to estimate the model on a

subset of the data and then use the estimated model to predict known outcomes which

are not used in the initial estimation.

7. Example M6 + G2.5 + 10 = y ttt

ttt 2 3ˆ ˆ ˆ= + + G Mβ β β

where yt, Gt, Mt denote GDP, government expenditure, and money supply.

Assume that

Yt

Xt

Yt

Xt

X t

37 III

. 10 = s ,10

1532

3205

2510

= )XX( s23-1-2

a. Calculate an estimate of GPD(y) which corresponds to

Gt = 100, Mt = 200, i.e., Xt = (1, 100, 200).

tt

10

ˆˆ = β = (1, 100, 200) 2.5y X

6

1460. = 1200 + 250 + 10 =

b. Evaluate s2

ytand s

2FE

corresponding to the Xt in question (a).

10.

200

100

1

1532

3205

2510

200) 100, (1, = X ))XX( s( X = s3-

t

1-2ty

2

t

921.81 =

30.30 = syt

931.81 = 921.81 + 10 = s + s = s y2

FE

2

t

30.53 = SFE

7. Forecasting—basic Stata commands

a) The data file should include values for the explanatory variables

corresponding to the desired forecast period, say in observations n1 + 1 to n2.

b) Estimate the model using least squares

reg Y X1 . . . XK, [options]

c) Use the predict command, picking the name you want for the predictions, in

38 III

this case, yhat, e, ˆ, and FE Ys s .

predict yhat, xb this option predicts Y

predict e, resid this option predicts the residuals (e)

predict sfe, stdf this option predicts the standard

error of the forecast ( FEs )

predict syhat, stdp this option predicts the standard

error of the prediction (Y

s )

list y yhat sfe this option lists indicated variables

These commands result in the calculation and reporting of s e, ,Y Y, FE and

Ys for observations 1 through n2. The predictions will show up in the Data

Editor of STATA under the variable names you picked (in this case, yhat,

e, sfe and syhat).

You may want to restrict the calculations to t= n1 + 1, .. , n2 by using

predict yhat if(_n> n1), xb

where “n1” is the numerical value of n1.

d) The variance of the predicted value can be calculated as follows:

s - s = s2

FE

2y

2

t

39 III

H. PROBLEM SETS: MULTIVARIATE REGRESSION

Problem Set 3.1

Theory

OBJECTIVE: The objective of problems 1 & 2 is to demonstrate that the matrix equations and

summation equations for the estimators and variances of the estimators are equivalent.

Remember 1

n

t

t

X NX and Don’t get discouraged!!

1. BACKGROUND: Consider the model (1) Yt = β1 + β2 Xt+ εt (t = 1, . . ., N) or

equivalently,

(1)’

1 1 1

2 2 21

2

n n n

1 εY X

1 εY X = +

1 εY X

(1)” Y = Xβ + ε

The least squares estimator of YX)XX( = ˆ is ˆ

ˆ1-

2

1

.

If (A.1) - (A.5) (see class notes) are satisfied, then

)ˆVar()ˆ ,ˆCov(

)ˆ ,ˆCov()ˆVar(

= )ˆVar(

212

211

)XX( =-12

QUESTIONS: Verify the following:

*Hint: It might be helpful to work backwards on part c and e.

a.

XXN

XNN = XX

t

2

and

1

' N

t t

t

NY

X YX Y

40 III

b. )XN - X( / )Y XN - YX( = ˆ 2

t

2tt2

c. Xˆ - Y = ˆ21

d. )XN - X( / = )ˆVar(2

t

22

2

e. XN - X

X + n

1 = )ˆVar(

2

t

2

2

2

1

)ˆVar( X + )YVar( =2

2

f. )ˆVar( X- = )ˆ ,ˆCov(221

(JM II’-A, JM Stats)

2. Consider the model: ttt + X = Y

a. Show that this model is equivalent to Y = Xβ + ε

where

1 1 1

2 2 2

n n n

εY X

εY XY , X = ,ε

εY X

b. Using the matrices in 2(a), evaluate YX)XX(-1

and compare your answer with

the results obtained in question 4 in Problem Set 2.1.

c. Using the matrices in 2(a) evaluate )XX(-12 .

(JM II’-A)

Applied

3. Use the data in HPRICE1.RAW to estimate the model

price = β0 + β1sqrft + β2bdrms + u

where price is the house price measured in thousands of dollars, sqrft is

the floorspace measured in square feet, and bdrms is the number of bedrooms.

a. Write out the results in equation form.

41 III

b. What is the estimated increase in price for a house with one more bedroom, holding

square footage constant?

c. What is the estimated increase in price for a house with an additional bedroom that is 140

square feet in size? Compare this to your answer in part (ii).

d. What percentage variation in price is explained by square footage and number of

bedrooms?

e. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling

price for this house from the OLS regression line.

f. The actual selling price of the first house in the sample was $300,000 (so price = 300).

Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for

the house?

42 III

Problem Set 3.2

Theory

1. R2, Adjusted R

2( 2R ), F Statistic, and LR

The R2 (coefficient of determination) is defined by

SST

SSE - 1 =

SST

SSR = R

2

where e = SSE t

2and )Y - Y( = SSR , )Y - Y( = SST

2

t

2

t .

Given that SST = SSR + SSE when using OLS,

a. Demonstrate that 0 R2 1.

b. Demonstrate that n = k implies R2 = 1. (Hint: n=k implies that X is square. Be

careful! Show .) ˆX = Y = Y

c. If an additional independent variable is included in the regression equation, will

the R2 increase, decrease, or remain unaltered? (Hint: What is the effect upon

SST, SSE?)

d. The adjusted , R , R22 is defined by .

1)SST/(n-

k)SSE/(n- - 1 = R

2 Demonstrate that

, 1 R R kn-

k-1 22i.e., the adjusted R

2 can be negative.

))R-(1 kn-

1n- =

kn-

1n-

SST

SSE = R - 1 :(Hint 22

e. Verify that

2

SSE - SSE* = LR if ζ

2 is known

/SSE)ln(SSE*n = if ζ2 is unknown where SSE* denotes the

restricted SSE.

43 III

f. For the hypothesis H0: β2 = . . . = βk = 0, verify that the corresponding LR statistic

can be written as )R-ln(1n - = R-1

1ln n = LR 2

2.

FYI: The corresponding Lagrangian multiplier (LM) test statistic for this

hypothesis can be written in terms of the coefficient of variation as 2LM NR .

(JM II-B)

2. Demonstrate that

a. X’e = 0 is equivalent to the normal equations . YX = ˆXX

b. X’e = 0 implies that the sum of estimated error terms will equal zero if regression

equation includes an intercept.

Remember: ˆˆe Y Y Y X

(JM II-B)

Applied

3. The following model can be used to study whether campaign expenditures affect election

outcomes:

voteA = β0 + β1ln(expendA) + β2 ln(expendB) + β3 prtystrA + u

where voteA is the percent of the vote received by Candidate A, expendA and expendB are

campaign expenditures by Candidates A and B, and prtystrA is a measure of party

strength for Candidate A (the percent of the most recent presidential vote that went to A's

party).

i) What is the interpretation of β1?

ii) In terms of the parameters, state the null hypothesis that a 1% increase in A's

expenditures is offset by a 1% increase in B's expenditures.

iii) Estimate the model above using the data in VOTE1.RAW and report the results in

the usual form. Do A's expenditures affect the outcome? What about B's

expenditures? Can you use these results to test the hypothesis in part (ii)?

iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part

(ii). What do you conclude? (Use a two sided alternative.). A possible approach,

test 0 1 2:H D , plug 2D for 1 and simplify to obtain

voteA = β0 + D ln(expendA) + β2 ln(exp ) - ln exp endB endA + β3 prtystrA + u

Regress voteA on ln(expendA), (ln(expend(B)-ln(expendA)), and prtystrA

44 III

and test the hypothesis that the coefficient, D, of ln(expendA) is 0.

You can check your results by constructing the “high-tech” t-test or by using the

Stata command, test ln(expendA) + ln(expendB) =0 following the estimation of

the unconstrained regression model.

(Wooldridge C. 4.1)

4. Consider the data (data from Solow’s paper on economic growth)

t Output (Yt) Labor (Lt) Capital (Kt)

1 40.26 64.63 133.14

2 40.84 66.30 139.24

3 42.83 65.27 141.64

4 43.89 67.32 148.77

5 46.10 67.20 151.02

6 44.45 65.18 143.38

7 43.87 65.57 148.19

8 49.99 71.42 167.12

9 52.64 77.52 171.33

10 57.93 79.46 176.41

The Cobb Douglas Production function is defined by

(1) 3 41 2

β β+ tβ βt t t t = εeY K L

where (β2t) takes account of changes in output for any reason other than a change in Lt or

Kt; εt denotes a random disturbance having the property that lnεt is distributed N(0, ζ2).

Labor’s share receipts sales total

receipts wagetotalis given by β3 if β3 + β4 (the returns to scale) is

equal to one. β2 is frequently referred to as the rate of technological change

. K and L fixedfor Y/dt

dYt

t Taking the natural logarithm of equation(1),we obtain

(2) t t t1 2 3 t 4ln = + t + ) + ln( ) + ln(ε ) .β β β ln(L βY K

45 III

If 43

+ is equal to 1, then equation (2) can be rewritten as

(3) t t t t1 2 3ln( / ) = + t + ln( / ) + ln .Y K L K t

a. Estimate equation (2) using the technique of least squares.

b. Corresponding to equation (2)

1) Test the hypothesis Ho: β2 = β3 = β4 = 0. Explain the implications of this

hypothesis. (95% confidence level)

2) perform and interpret individual tests of significance of β2, β3, and β4, i.e. test

Ho : βi = 0 .α = .05.

3) test the hypothesis of constant returns to scale, i.e., Ho: β3 + β4 = 1, using

a. a t-test for general linear hypothesis, let restrictions δ= (0,0,1,1);

b. a Chow test;

c. a LR test.

c. Estimate equation (3) and test the hypothesis that labor’s share is equal to .75, i.e., β3 =

.75.

d. Re-estimate the model (equation 2) with the first nine observations and check to see if the actual

log(output) for the 10th observation lies in the 95% forecast confidence interval.

(JM II)

5. The translog production function corresponding to the previous problem is given by

2 2

1 2 3 4 5 6 7ln(Y) = + t + ln(L) + ln(K) + (ln(L) + (ln(K) + (ln(L)) ln(K) + ln(ε )β β β β β ) β ) β

t

a. What restrictions on the translog production function result in a Cobb-Douglas

production function?

b. Estimate the translog production function using the data in problem 5 and use the Chow and

LR tests to determine whether it provides a statistically significant improved fit to the data,

relative to the Cobb-Douglas function.

(JM II)

46 III

6. The transcendental production function corresponding to the data in problem 5 is defined by

1 2 3 4 5 6 + t + L + Kβ β β β β βY = e L K

a. What restrictions on the transcendental production function result in a Cobb-Douglas

production function?

b. Estimate the transcendental production function using the data in problem 2 and use the Chow

and LR tests to compare it with the Cobb-Douglas production function.

(JM II)

47 III

APPENDIX A

Some important derivatives:

Let aa

aa =A ,

a

a = a ,

x

x = X

2221

1211

2

1

2

1

(symmetric) )a = a = a( 2112

1. a = dX

a)X( d =

dX

X)a( d

2. AX2 = dX

AX)X( d

Proof of a = dX

a)X( d

Note: a’X = X’a = a1x1 + a2x2

a = a

a =

X/a)X(

X/a)X( =

dX

a)X( d

2

1

2

1

Proof of d (X AX)

= 2AXdX

Note: X’AX = a11x12 + (a12 + a21) x1x2 + a22 x2

2

xa2 + xa2

xa2 + xa2 =

X/AX)X(

X/a)X( =

dX

AX)X( d

2221

2111

2

1

xa + xa

xa + xa 2 =

2221

2111

x

x

aa

aa 2 =

2

1

22

11

.AX2 =

48 III

APPENDIX B

An unbiased estimator of ζ2 is given by

. k)SSE/(n- = y) )X)XX( X - (Iy( kn-

1 = s

1-2

Proof: To show this, we need some results on traces:

a = (A)tr ii

n

i

1) tr(I) = n

2) If A is idempotent, tr(A) = rank of A

3) tr(A+B) = tr(A) + tr(B)

4) tr(AB) = tr(BA) if both AB and BA are defined

5) tr(ABC) = tr(CAB)

6) tr(kA) = k tr(A)

Now, remember that

2 1 = e eζ

n

and ee k -n

1 = s

2

-1ˆe = y - Xβ = y - X (X X X y = My)

= M (Xβ + ε) = MXβ + Mε ,

= Mε ,

where M = I - X(X’X)-1

X’.

Note that M is symmetric, and idempotent (problem set R.2).

So 2 1 1 = e e = ε M Mεζ

n n

49 III

1= ε MMε .

n

1= ε Mε .

n

and 2 1 = ε Mε .s

n - k

2 1 1E ( ) = E (ε Mε) = E (tr(ε Mε))ζ

n nbecause i jcov ( , ) = 0, i j)ε ε

1 1= Etr (Mεε ) = tr (ME (εε ))

n n

2 21 1= tr (M I) = tr ( M)ζ ζ

n n

2

ζ= tr(M)

n

2

-1ζ= tr(I - X(X X X ))

n

2-1ζ

= (n - tr (X(X X X )))n

2

-1ζ= (n - tr (X X(X X )))

n

2

k

ζ= (n - trace ( ))I

n

2

ζ= (n - k)

n

2 2 2 2n - k n= so E ( ) = E ( ) = .ˆζ s ζ ζ

n n - k

Therefore 2ζ is biased, but 2 2 2n

E ( ) = E ( ) = ˆs ζ ζn - k

and s2 is unbiased.

50 III

APPENDIX C

β = AY = (X X) X Y is BLUE.

Proof: Let ii = Yβ A where Ai denotes the i

th row of the matrix A. Since the result will be

symmetric for each βi (hence, for each Ai), denote Ai by a’ where a is a (n by 1) vector.

The problem then becomes:

Min a’Ia when I is nxn

s.t. AX = I when X is nxk (for unbiasedness)

or min a’Ia

s.t. X’a = i where i is the ith

column of the identity matrix.

Let = a Ia + λ (X a - i) which is the associated Lagrangian function where λ is kx1.

The necessary conditions for a solution are:

= 2a I + λ X = 0a

= (X a - i) = 0 .λ

This implies

a = (-1/ 2)λ X ) .

Now substitute a = (-½)Xλ into the expression for = 0λ

and we obtain

(-1/ 2) X Xλ = i -1

λ = - 2 (X X i)

X)XX( i(-2) 2)/(-1 = a-1

. A = X)XX(i = i

-1

which implies

X)XX( =A -1

hence, -1

β = (X X X y .)

iii general regression

Documents