introductory graduate econometrics

INTRODUCTORY GRADUATE ECONOMETRICS

Craig Burnside

Department of Economics

University of Pittsburgh

Pittsburgh, PA 15260

January 2, 1994

∗These are incomplete notes intended for use in an introductory graduate econometrics

course. As notes, the style of presentation is deliberately informal and lacking in proper

citations. Please point out any errors you find.

1. Ordinary Least Squares


We have a sequence of observations on a random variable

yt, t = 1, 2, . . . , T.

The T indicates I’m a time series guy. Furthermore we have an economic model which

tells us that y is a linear function of explanatory variables plus a random component. I.e.

yt = x0tβ + t

where xt is a k × 1 vector of explanatory variables, β is a k × 1 vector and yt and t are

scalars. Written out this is

yt = (x1t x2t · · · xkt )

⎛⎜⎜⎝β1β2...βk

⎞⎟⎟⎠+ t.

Typically x1t = 1, for all t. If the observations are stacked we get

y1 = β1x11 + β2x21 + . . . βkxk1 + 1

y2 = β1x12 + β2x22 + . . . βkxk2 + 2

...

yT = β1x1T + β2x2T + . . . βkxkT + T

or

y = Xβ +

with y and T × 1 vectors, X a T × k matrix.

1.1. Assumptions

(1) X is a matrix of fixed variables (unrealistic) and has full column rank (i.e. the

columns of X are linearly independent)

(2) E( ) = 0

(3) E( 0) = σ2IT or more strongly, the t are i.i.d.

(4) β is an unknown constant parameter vector, and σ2 is an unknown scalar.

1


1.2. Estimation

We want to estimate β. The first method is to use the least squares criterion. If I

choose some estimate β, the difference between y and Xβ is called a residual.

Residuals e = y −Xβ. Minimize the sum of the squared residuals by choice of β.

SSE =TXi=1

e2t = e0e = (y −Xβ)0(y −Xβ) = y0y − 2β0X 0y + β0X0Xβ

Minimize SSE by choosing β so that ∂SSE/∂β = 0.

∂SSE

∂β= −2X 0y + 2X 0Xβ = 0

Solution is β = (X0X)−1X 0y. To verify that this is a minimum compute the matrix of

second derivatives

∂2SSE

∂β∂β0= 2X 0X

which is positive definite. See Rule 5 p.961 Red Judge to see that this follows from our

assumptions. Our solution is therefore the unique minimum.

We will also want to have an estimate of the variance σ2. The estimator we will use

is

σ2 =1

T − k

TXt=1

e2t =e0e

T − k

1.3. Sampling Properties

Since the vector y is random, this means β is a vector-valued random variable. Sim-

ilarly, σ2 is a random variable. Therefore, we can ask questions about the distributions of

these random variables. The first property we will consider is the mean of these random

variables. We will show that both estimators are unbiased. An estimator θ is unbiased if

E(θ) = θ.

2


E(β) = Eh(X 0X)−1X0y

i= (X 0X)−1X 0E(y)

= (X 0X)−1X 0E(Xβ + )

= (X 0X)−1X 0Xβ

= β

E(σ2) = Eh 1

T − ke0ei

=1

T − kE(e0e)

Consider the definition of the residuals.

e = y −Xβ = y −X(X 0X)−1X 0y

=hI −X(X0X)−1X 0

iy =My

=hI −X(X0X)−1X 0

i(Xβ + ) =M

Some facts about M . It is T × T but has rank T − k. It is symmetric as M = M 0.

Furthermore it is idempotent asMM =M . As a result of all this (trace is sum of elements

on diagonal of a square matrix)

e0e = tr(e0e) = tr( 0M 0M ) = tr( 0M )

E(e0e) = Ehtr( 0M )

i= E

htr(M 0)

iRule : tr(ABC) = tr(CAB) = tr(BCA)

= trhME( 0)

i= tr(Mσ2I) = σ2tr(M)

= σ2(T − k)

The last step follows from the fact that an idempotent matrix of rank s has trace equal to

s. Therefore E(σ2) = σ2.

We will also derive the variance-covariance matrix of β. I.e. we will compute (give

explanation of what’s in there) V (β) = Eh(β− β)(β− β)0

i. To do this, note that β− β =

3


(X 0X)−1X 0y − β = (X0X)−1X 0(Xβ + )− β = (X 0X)−1X 0 . Therefore,

V (β) = Eh(X 0X)−1X 0 0X(X 0X)−1

i= (X0X)−1X 0E( 0)X(X 0X)−1

= (X0X)−1X 0σ2IX(X0X)−1

= σ2(X 0X)−1

1.4. The Gauss-Markov Theorem

Recall that the elements of β are linear combinations of the elements of y. I.e. it is

a linear estimator. Suppose we consider any other linear estimator β = Ay which is also

unbiased. I.e. E(β) = β. Then V (β) ≥ V (β). Proof: First note that A is a k × T matrix.

Since β = Ay = AXβ+A we have E(β) = AXβ = β by the assumption of unbiasedness.

This means AX = Ik. However this does not imply A = X−1 since neither A nor X is

invertible. Notice that the last result means that β = β +A or β − β = A . Therefore

V (β) = Eh(β − β)(β − β)0

i= E(A 0A0)

= σ2AA0.

Now define C = A− (X0X)−1X 0. This means that A = C +(X 0X)−1X0. Some neat facts

about C. CC0 is a k × k positive semi-definite matrix. CX = AX − (X 0X)−1X0X = 0!

Therefore,

V (β) = σ2[C + (X0X)−1X0][C + (X 0X)−1X 0]0

= σ2[C + (X0X)−1X0][C0 +X(X 0X)−1]

= σ2[CC0 + (X 0X)−1X 0C0 +CX(X0X)−1 + (X 0X)−1X 0X(X 0X)−1]

= σ2[CC0 + (X 0X)−1]

Therefore, V (β)− V (β) is a positive semidefinite matrix. This proves that OLS is BLUE

given our assumptions.

4


1.5. Further Statistical Properties

Suppose we make a further assumption that

(5) The errors are normally distributed.

This last assumption implies that β is a normally distributed random vector since it

is a linear combination of the t.

Also define C2 = (T − k)σ2/σ2 = e0e/σ2. Recall that e =M . Therefore,

C2 =0M 0Mσ2

=σ

0M

σ.

Notice that /σ ∼ N(0, I) and that M is idempotent of rank k. This implies that C2 ∼χ2(T − k). Check Red Judge for this trivia tidbit!

We could use both of these last results to conduct hypothesis tests if we so desired.

Unfortunately we can’t use the first result directly since V (β) = σ2(X0X)−1 involves the

unknown parameter vector σ2. Recall that a t-distributed random variable is defined as

t(n) = z/px/n where z is standard normal, x is χ2(n) and z and x are independent.

Thus, you see why we use the t-statistics

ti =(βi − βi)p

σ2[(X 0X)−1]ii

=(βi − βi)p

σ2[(X 0X)−1]ii

.pσ2/σ2

=(βi − βi)p

σ2[(X 0X)−1]ii

.pC2/(T − k)

= z/px/n.

Thus if we wanted to test H0 : βi = βi0, we could exploit the fact that ti is distributed

t(T − k). Notice that I didn’t verify independence in my proof. This would make a good

assignment question.

We might also want to conduct tests such as H0 : Rβ = r, where R is a j×k matrix.

5


I.e. we’re looking at j joint linear restrictions on β. Notice that

Rβ −Rβ = R(β − β)

= R(X 0X)−1X0

∼ N³0, σ2R(X 0X)−1R0

´The next step involves finding the Cholesky decomposition of σ2R(X0X)−1R0. I.e. find

C, lower triangular, and invertible such that CC0 = σ2R(X 0X)−1R. Now define Z =

C−1(Rβ−Rβ). From the definition of C we have Z ∼ N(0, Ij). This means Z0Z is χ2(j).

Or

(Rβ −Rβ)0C−10C−1(Rβ −Rβ) =

(Rβ −Rβ)0[R(X0X)−1R0]−1(Rβ −Rβ)

σ2∼ χ2(j).

Again we have a problem since we don’t know σ2. When we use σ2 instead we get

the familiar F -statistic. To see this recall that F (n1, n2) = (χ21/n1)/(χ

22/n2) where χ

21 and

χ22 are independent. Therefore,

F (j, T − k) =(Rβ −Rβ)0[R(X 0X)−1R0]−1(Rβ −Rβ)/σ2j

[(T − k)σ2/σ2]/(T − k)

=(Rβ −Rβ)0[R(X 0X)−1R0]−1(Rβ −Rβ)/j

σ2

Are they independent χ2? Good assignment question!

6


1.6. Measures of Fit

Measures of fit are used to summarize the extent to which the estimated model ‘fits’

the data. Notice that by the definition of the residuals y = Xβ + e. This implies that

y − y1 = Xβ − y1 + e. Thus the total sum of squared (SST) deviations of yt around its

mean is given bySST = (y − y1)0(y − y1)

= (Xβ − y1 + e)0(Xβ − y1 + e)

= (y − y1 + e)0(y − y1 + e)

= (y − y1)0(y − y1) + e0e− 2y10e

= (y − y1)0(y − y1) + e0e if 1 ∈ X

= SSR+ SSE

The R2 is defined to be R2 = SSR/SST = 1 − SSE/SST . Clearly this can simply be

increased by adding regressors indiscriminately because of the extra degree of freedom to

reduce SSE that is introduced by adding a parameter. Thus the adjusted R2 is introduced

to adjust for the number of regressors. It is defined as

R2 =

ÃT − 1T − k

!R2 −

Ãk − 1T − k

!

= 1−ÃT − 1T − k

!(1−R2).

1.7. Geometry

We saw earlier that e = y −Xβ = y −X(X0X)−1Xy = [I −X(X 0X)−1X0]y =My.

M was shown to be symmetric and idempotent. Now, let’s also consider the fitted values

y = Xβ = X(X 0X)−1X 0y = Py. P is also symmetric and idempotent, but this time with

rank k.

The matrix P projects a T ×1 vector y onto the space spanned by the k T ×1 vectorswhich are the columns of X. I.e. P produces a linear combination of the columns of X, y,

7


whose deviation from y, e, is orthogonal to the space spanned by those columns. M would

project any vector onto the orthogonal complement of that space (because it produces the

residuals which are orthogonal to X). Notice that y = Py so that y − y = y − Py =

(I − P )y =My = e.

Things to note are the following.

1. X 0e = X 0My = X0[I −X(X 0X)−1X 0]y = [X 0 −X0X(X 0X)−1X 0]y = 0. Notice that

only if X contains a 1 vector doesPT

t=1 et = 10e = 0. Furthermore we have X 0y = X0Py =

X 0X(X 0X)−1X0y = X0y.

2.y0y = (Xβ + e)0(Xβ + e)

= β0X 0Xβ + e0Xβ + β0X0e+ e0e

= β0X 0Xβ + e0e.I.e. (raw) SST = SSR+ SSE.

3. Suppose I transform the variables before regression by applying a nonsingular k × k

matrix A to X. I.e. I generate X∗ = XA. Then notice that if you regress y on X∗ you getP ∗ = X∗(X∗0X∗)−1X∗0

= XA(A0X 0XA)−1A0X 0

= XAA−1(X 0X)−1A0−1A0X 0

= X(X 0X)−1X 0 = P.

Thus, the predicted values from the two regressions will be the same. What about the

coefficients? β∗ = (X∗0X∗)−1X∗0y = A−1(X 0X)−1A0−1A0Xy = A−1β. The coefficients

are changed by the inverse of the linear transformation.

4. Suppose I regress y on X and Z where the model is y = Xβ + Zγ + . Define W =

(X Z ). Also define θ0 = (β0 γ0 )0. Then y =Wθ+ . Thus we’ll get θ = (W 0W )−1W 0y.

Expanding this we get µβγ

¶=

µX 0X X 0ZZ0X Z0Z

¶−1µX 0

Z0

¶y

= H−1µX0

Z0

¶y

8


This means that the estimate of γ is given by

γ = H−121 X0y +H−122 Z

0y

= −(H22 −H21H−111 H12)

−1H21H−111 X

0y + (H22 −H21H−111 H12)

−1Z0y

= −[Z0Z − Z0X(X 0X)−1X0Z]−1(Z0X)(X 0X)−1X 0y+

[Z0Z − Z0X(X 0X)−1X 0Z]−1Z0y

= −³Z0[I −X(X0X)−1X 0]Z

´−1Z0Pxy +

³Z0[I −X(X0X)−1X0]Z

´−1Z0y

= −(Z0MxZ)−1Z0Pxy + (Z0MxZ)

−1Z0y

= (Z0MxZ)−1Z0(I − Px)y

= (Z0MxZ)−1Z0Mxy

Suppose that instead I did the following. First regress y onto X and compute the

residuals y∗ = Mxy from that regression. Also regress each of the columns of Z onto X

and make a matrix out of all the different residual series Z∗ =MxZ. Now regress y∗ onto

Z∗. You’ll getγ = (Z∗0Z∗)−1Z∗0y∗

= (Z0M 0xMxZ)

−1Z0M 0xMxy

= (Z0MxZ)−1Z0Mxy

Notice the equivalence. This result is called the Frisch-Waugh-Lovell Theorem.

9

2. Hypothesis Testing


There are different approaches to hypothesis testing. The ones you will be introduced

to in this course are all classical hypothesis tests. The term classical refers to the basic

approach to testing. Essentially, the classical approach works as follows.

1. Determine the null hypothesis to be tested.

2. Find some testable implication of the null hypothesis. Usually this will pertain to

the distribution of some statistic under the null.

3. Set up the test. What this means is that you have to have a rejection region for the

test. I.e. you must define a region such that if the statistic ever lies in that region

you will reject the null hypothesis.

4. Actually compute the test statistic for your sample and determine the outcome of

the test.

How does one actually construct the test? In the classical approach, the parameters

of the model are treated as unknown fixed constants. This is as opposed to the Bayesian

approach where the econometrician actually assigns a prior probability distribution to

the parameters, and then updates this to form a posterior given the observed sample

of data. Because of the inherent assumption of the classical approach there is always

assumed to exist some unknown true value of the parameter in question, unless the model

is misspecified.

Classical hypothesis tests are constructed with the following two considerations in

mind.

1. SIZE - this is the probability of rejecting the null hypothesis even though the null

hypothesis is true, i.e. it is the probability of type I error.

2. POWER - this is the probability of rejecting the null hypothesis when it is false, i.e.

it is 1 minus the probability of type II error.

To evaluate the size or power of tests one clearly needs to compute the probabilities

10


in question. One way to do this would be to perform the conceptual experiment of an

infinite number of repeated samples. For example, suppose I wanted to know the weight

assigned to heads on a coin. I either know the true weight (i.e. I know the probability

distribution governing the outcome) or I can flip the coin an infinite number of times. We

know that the proportion of outcomes which are heads will converge to the true proportion

with a large enough number of flips. That’s what’s really going on in hypothesis testing.

Size is determined by assuming the null hypothesis is true. If it is true, then the

question is what is the probability in any given sample that I will end up rejecting the null.

Sometimes, this can be calculated by carefully working out the probability distribution of

the test statistic. Other times, it can be determined by doing a lot of repeated sampling

in a controlled experiment. This is what is referred to as Monte-Carlo simulation.

For a given null hypothesis, size is a fixed number α. However, power clearly depends

on what the true parameter actually is. For example, suppose I set up the following test

for whether the mean of a normally distribued random variable was zero or not. Assume

that we know the variable is distributed normally, and that the variance is 1. The null

hypothesis is H0 : µ = 0 versus H1 : µ 6= 0. Suppose I decide to reject the null hypothesiswhenever x from T observations on X below -0.1 or above 0.1. What do we know about

the sample mean when the draws are i.i.d? A testable implication is that if H0 is true,

the sample mean x is distributed as N(0, 1/T ). This is a standard result. We know that

the distribution of the sample mean of a N(µ, σ2) is N(µ, σ2/T ). Therefore the size in my

example is just

1− Pr(−0.1 < x < 0.1) = 1− Pr³−0.1√

T< Z <

0.1√T

´which is just some number for any fixed T . However what is the power of the test? It is

the probability of rejecting the null hypothesis when it is false. When it is false, we want

to standardize x with the correct (true) mean, rather than 0. Therefore power is

1− Pr(−0.1 < x < 0.1) = 1− Pr³−0.1− µ√

T< Z <

0.1− µ√T

´11


which is clearly a function of µ. Suppose µ is very close to zero. Then power will approxi-

mately equal the size of the test. For the test I’ve set up, the power gets bigger the further

µ gets from zero.

The way hypothesis tests are typically designed is to choose a test with a given

amount of size. Then among tests of a given size the natural thing to do would be to look

for the most powerful test. This isn’t always the approach, especially when asymptotic

tests are being used. However, there is a theory of uniformly most powerful tests which

tries to find exactly that. It turns out that usually such a test is not available. However,

for certain classes of distributions it’s often the case that there is a UMP test among the

class of unbiased tests, i.e. tests for which the power is always greater than or equal to the

size.

This is what provides the logic behind the typical tail rejection regions that we use.

For example, to test whether µ = 0, in the example above you usually want to set up

the statistic ZS = xσ/√T. Usually we set up a 5% test where the rejection region is

z||z| > 1.96. Clearly when the null hypothesis is true the probability of rejection is 5%.However, there are an infinite number of critical regions which give tests whose sizes are

5%. To be specific our rejection region, C, would only have to satisfy the condition thatRCφ(z)dz = .05. There are many such regions. Lets try out a few, and examine their

power properties.

i. C1 = z||z| > 1.96,ii. C2 = z||z| < 0.0625,iii. C3 = z|0.5960 < |z| < 0.7535.

All three critical regions have size of 5%, but their power properties are dramatically

different. Look at the power functions on the following page. These are constructed for the

case where σ is known to be 1 and T = 100, and xt is normally distributed and i.i.d. The

region C1 is the standard rejection region, and is seen to have maximal power, approaching

1, against the most distant alternatives to H0 : µ = 0, and weakest power, approaching

12


0.05 against those alternatives which are closest to H0. The region C2 leads to size of 5%

but has obviously undesirable power properties. Because it is centered around zero, power

is maximal at µ = 0, and is never greater than 0.05. Furthermore, power is zero against

the most distant alternatives. The region C3, which is off-centered but involves no tail

area, again leads to terrible power against the most distant alternatives, although power

is now maximized to the right of H0. However, it has slightly higher power against some

nearby alternatives than does C1, although this gain in power is very hard to discern in

the graphs.

13


Figure 2.1: Power

14

3. Maximum Likelihood


We have a sequence of observations on a random variable

yt, t = 1, 2, . . . , T.

Suppose the joint density function of this sequence of random variables is given by

f(y1, y2, . . . , yT |θ)

where θ is some parameter (vector) of that joint distribution. f is called the likelihood

function when written as a function of θ, i.e. L(θ|y1, y2, . . . , yT ) = f(y1, y2, . . . , yT |θ). Thelog-likelihood function is given by L(θ|y) = ln[L(θ|y)], where y is the vector of observationsas in section 1.

3.1. The Linear Model

In the linear model we had yt = x0tβ + t. To determine the likelihood function for y

let us first consider the joint distribution or likelihood of the vector . Given assumptions

(1)—(5), in section 1, we have

f( ) =TYt=1

φ( t) by independence

=TYt=1

1√2πσ

exp(− 1

2σ22t )

= (2π)−T2 σ−T exp(− 1

2σ20 )

Now to get the distribution of y we have to use the following fact. If any variable y is a

function of , g( ), then the joint distribution of the y’s is

fY (y) = f [g−1(y)]|J | where J =∂

∂y=

∂g−1(y)∂y

.

In the linear model y = Xβ + so that = y −Xβ. Therefore J = I and |J | = 1. Thus

L(θ|y) = f(y) = (2π)−T2 σ−T exp

³− 1

2σ2(y −Xβ)0(y −Xβ)

´15


and the log-likelihood is

L(θ|y) = −T2ln(2π)− T

2lnσ2 − 1

2σ2(y −Xβ)0(y −Xβ)

= −T2ln(2π)− T

2lnσ2 − 1

2σ2

TXt=1

(yt − x0tβ)2.

In this case, θ = (β0 σ2 )0. The maximum likelihood estimator (MLE) for θ is obtained

by maximizing the likelihood function. Since ln is a monotonic transformation we can

maximize the log-likelihood instead. We will do this by solving the first order conditions

for choices of β and σ2.

First expand the definition of L:

L = −T2ln(2π)− T

2lnσ2 − 1

2σ2(y0y − 2β0X0y + β0X 0Xβ)

∂ L∂β

= − 1

2σ2(−2X 0y + 2X0Xβ)

∂ L∂σ2

= −T2σ−2 +

1

2σ−4(y −Xβ)0(y −Xβ)

If we solve these FONCs for β and σ2 we get MLE estimators β = (X0X)−1X 0y and

σ2 = 1T e

0e where e = y −Xβ. Notice that β can be solved for independently of σ2 and is

the same as the OLS estimator. We already know that E(β) = β. Since σ2 = (T −k)σ2/Twe also know that E(σ2) = (T −k)σ2/T . Furthermore, V (β) = σ2(X0X)−1. The variance

of the variance estimator is

E³(σ2 −Eσ2)2

´= E

³(1

Te0e− T − k

Tσ2)2

´=

σ4

T 2Eh³e0e

σ2− (T − k)

´2i=

σ4

T 2Eh³(T − k)

σ2

σ2− (T − k)

´2i=

σ4

T 2Eh³χ2 − (T − k)

´2i= 2(T − k)σ4/T 2

16


because the mean of a χ2(T − k) is T − k and its variance is 2(T − k). As for covariance

Cov(β, σ2) = E³(β − β)(σ2 −Eσ2)

´= E

h³(X 0X)−1X0

´³ 1Te0e− T − k

Tσ2´i

= 0

This is because the first term is a linear combination of a normal, while the second term

is a quadratic form (based on M) in the same normal. These are independent if the two

matrices multiplied give you 0. Of course (X 0X)−1X 0M = 0. Therefore the variance

covariance matrix

V (θ) =

µσ2(X 0X)−1 0

0 2T−kT σ4

¶

3.2. Cramer-Rao Lower Bound

The Cramer-Rao Lower Bound simply states that any unbiased estimator of a pa-

rameter vector θ will have greater variance than the inverse of a matrix called I(θ). I.e. if

θ is an unbiased estimator of θ, V (θ)− I(θ)−1 is a p.s.d. matrix, where

I(θ) = −EÃ

∂2 L∂θ∂θ0

!.

This is a general principle applying to any model with a likelihood. To find the matrix I

called the information matrix, we need to compute the second derivatives.

∂2 L∂β∂β0

= − 1σ2

X 0X

∂2 L∂β∂σ2

=1

σ4(−X0y +X0Xβ)

∂2 L∂σ2∂β0

=1

σ4(−y0X + β0X0X)

∂2 L∂(σ2)2

=T

2σ−4 +−σ−6(y −Xβ)0(y −Xβ)

The negation of the expectation of these terms is

I(θ) =

µσ−2X0X 0

0 −T2 σ−4 + σ−6Tσ2

¶=

µσ−2X0X 0

0 T2 σ−4

¶.

17


Thus the CRLB for θ is

I(θ)−1 =µσ2(X 0X)−1 0

0 2T σ

4

¶.

Another useful concept in maximum likelihood theory is the sufficient statistic. A

sufficient statistic is a statistic which the likelihood can be completely characterized in

terms of. I.e. the random data no longer appear in the formula for the likelihood. In this

case

L(y|θ) = (2π)−T/2σ−T exp³− 1

2σ2(y −Xβ)0(y −Xβ)

´.

Now y −Xβ = (y −Xβ) + (Xβ −Xβ) = e+X(β − β), so that

(y −Xβ)0(y −Xβ) = e0e+ 2e0X(β − β) + (β0 − β0)X0X(β − β)

= (T − k)σ2 + (β − β)0X0X(β − β)

Substituting this into L gets rid of the data y. Thus β and σ2 are jointly sufficient statistics

for L. A theorem in statistics states that if a sufficient statistic is unbiased it is minimum

variance unbiased. Therefore, β and σ2 are the MVUE of β and σ2.

Since the OLS estimators are the MVUEs, this might raise the question: why use

ML? It turns out that in cases where we have a growing sample size and we need to

use asymptotic theory (we’ll see cases like this later), ML estimators have the following

properties:

1. Consistency (sort of asymptotic unbiasedness)

2. Asymptotic normality of the estimator regardless of distribution of

3. Asymptotic efficiency - achieving the CRLB.

We will discuss asymptotic theories more completely when we get to cases with

stochastic regressors.

18


3.3. A Brief Discussion of Asymptotics

What is all this asymptotic stuff? Well, we have to diverge a little bit to talk about

it. We need definitions. This is different from what we’ve been talking about because the

sample size T is treated as variable. What does this mean for things like X fixed? It just

means that in repeated samples of any size we’d get the same sequence of xt’s. We will

also need to redefine some things because of scale factors.

CONSISTENCY: An estimator θT converges in probability to c if

limT→∞

Pr(|θT − c| > ) = 0.

This is sometimes denoted c = plim θT . θT is consistent if θ = plim θT .

ASYMPTOTIC NORMALITY: A random variable XT converges in distribution to a ran-

dom variable X if

limT→∞

|FT (x)− F (x)| = 0

at all continuity points of X. Typically, we’ll see cases where θT is asymptotically normal

if√T (θT − θ) converges in distribution to N(0, V ).

To illustrate this let’s take a scalar i.i.d. random variable xt ∼ N(µ, σ2). The

estimator I’ll consider is xT =PT

t=1 xt/T . Because I’ve assumed normality we’re obviously

going to get an exact normal distribution for xT for any T , because it’s a linear combination

of normal random variables. The mean is clearly µ and because the xt are i.i.d. we easily

get V(xT ) = σ2/T . Can we prove consistency? Yes. xT − µ = wT ∼ N(0, σ2/T ).

Therefore, Pr(|xT − µ| > ) = Pr(xT − µ > ) + Pr(xT − µ < − ) = Pr³TwT/σ >

T /σ´+Pr

³TwT/σ < −T /σ

´= Pr(Z > T /σ) + Pr(Z < −T /σ). The limit as T →∞

of this quantity is clearly 0.

The important thing about asymptotic normality is that you must blow up by√T .

Note in the example the variance goes to zero if you don’t blow it up, while it is σ2 if you

do. This would be unnecessary if you always had an exact small sample distribution as

19


we have in the example and all the work so far, but it is not helpful in cases where we

can’t characterize the small sample distribution. When this happens we use the asymp-

totic distribution to approximate the small sample distribution, which is only useful if the

asymptotic distribution is nonredundant.

3.4. The 3 Tests

We will now discuss three methods for testing restrictions in the context of ML

estimation. Specifically we’re interested in tests of the form H0 : g(θ) = 0 vs HA : g(θ) 6= 0.The function g : Rk → Rj .

3.4.1. The Likelihood Ratio Test

Denote θ = argmaxθ L(θ|y) while θ = argmaxθ L(θ|y) s.t. g(θ) = 0. Clearly

L(θ|y) ≥ L(θ|y). The test is based on how big the difference is. If it is large the hy-

pothesis is brought into question.

LR = −2[L(θ|y)− L(θ|y)] = −2 log(r) where r =L(θ|y)L(θ|y) .

It turns out that LR converges in distribution to χ2(j) as T →∞. The LR test requires

that you estimate the model under both the null and the alternative. Sometimes this isn’t

practical depending on the form of the function g.

In our linear model suppose we used the LR test. Having estimated a restricted and

an unrestricted model we would have two parameter vectors β and β. Then LR is given

by

LR = −2[L(β)− L(β)]

= −2³−T2log(σ2)− 1

2σ2e0e+

T

2log(σ2) +

1

2σ2e0e´

= −2³−T2log(e0e) +

T

2log(e0e

´= T log(e0e/e0e)

20


3.4.2. The Wald Test

The Wald test, rather than using the fact that the likelihoods should be close if

the hypothesis is true, uses the fact that when we estimate the unrestricted model, if the

restiction is true, then g(θ) should be close to zero anyway. Redefine the information

matrix as

IA(θ) = −EÃ1

T

∂2 L∂θ∂θ0

!Notice the 1/T part. Then

W = Tg(θ)0"∂g(θ)

∂θ0

¯θIA(θ)

−1 ∂g(θ)∂θ

¯θ

#−1g(θ).

What does this look like when the restriction is Rβ = r (i.e. when the restrictions are

linear)? g(·) = (Rj×k 0j×1 ) θ − rj×1. As a result ∂g/∂θ = (R 0 )0,

IA(θ) =

µσ−2T−1X0X 0

0 12 σ−4

¶As a result

W = (Rβ − r)0[R(X 0X)−1R0]−1(Rβ − r)/σ2.

Under some assumptions, W is χ2(j) asymptotically, but it only relies on estimation of the

model under the alternative (or unrestricted) model. Thus, it is an especially convenient

test when estimation under the null is problematic.

3.4.3. The Lagrange Multiplier Test

Finally, we have a test which only requires estimates of the restricted model. It is

based on the fact that if the restriction is valid the slope of the likelihood function shouldn’t

be much different at the restricted estimator than at the unrestricted estimator (where it

is clearly zero). Let’s set up the restricted estimation in Lagrangean form

Λ(θ) = L(θ|y)− λ0j×1g(θ).

21


The first order condition for this problem is

∂Λ(θ)

∂θ

¯θ=

∂ L(θ)∂θ

¯θ− λ0

∂g(θ)

∂θ

¯θ= 0.

The test is measured as

LM =1

Tλ0∂g(θ)

∂θ

¯θIA(θ)

−1 ∂g(θ)∂θ0

¯θλ.

As you might expect, given our results for the other tests, LM is asymptotically χ2(j).

What is the form of the test when the restriction is Rβ = r? Since in this case σ2 is

unrestricted the partial of g (and thus of L) with respect to σ2 is 0. However, the partialof L with respect to β is (X 0y −X 0Xβ)/σ2 = X 0e/σ2. These two form the leading term

in the formula. Also

IA(θ) =

µσ−2T−1X0X 0

0 12 σ−4

¶Therefore

LM =1

T( e0X/σ2 0 )

µT σ2(X0X)−1 0

0 2σ4

¶µX0e/σ2

0

¶= e0X(X 0X)−1X 0e/σ2

Under certain conditions, which happen to be satisified by any linear model, although

the tests have the same distributional form in large samples, in small samples we have the

following ordering

W ≥ LR ≥ LM.

Therefore, the Wald test will tend to reject more often than the likelihood ratio test, which

will tend to reject more often than the Lagrange multiplier test.

22

4. Generalized Least Squares


We now change one of the assumptions of the linear model. In particular we change

the assumption made about the covariance matrix of the error term. We now assume that

E( 0) = σ2Ω where Ω is a T × T , positive definite, symmetric matrix.

4.1. Ω Known

First consider the case where Ω is known to the econometrician. Let’s reconsider the

OLS estimator, βO = (X 0X)−1X0y. It’s still unbiased using the same proof. However,

what about its variance?

V (βO) = E[(βO − β)(βO − β)0]

= E[(X0X)−1X 0 0X(X 0X)−1]

= σ2(X 0X)−1X 0ΩX(X0X)−1.

This doesn’t look so good.

Since Ω is positive definite we can decompose it as PΩP 0 = I where P is a T ×T nonsingular matrix, or Ω = P−1P 0−1, or Ω−1 = P 0P . Since P is known, consider

premultiplying the data by P so that we have

Py = PXβ + P

y∗ = X∗β + ∗

where E( ∗ ∗0) = E(P 0P 0) = σ2PΩP 0 = σ2I. Now if we use OLS on this transformed

model we’ll get βG = (X∗0X∗)−1X∗0y∗. Clearly βG is BLUE because the transformed

model satisfies the conditions of the Gauss-Markov Theorem. It’s clearly unbiased and its

covariance matrix is clearly σ2(X∗0X∗)−1.

Now going back to untransformed variables we see that

βG = (X∗0X∗)−1X∗0y∗

= (X0P 0PX)−1X0P 0Py

= (X0Ω−1X)−1X 0Ω−1y

23


and V (βG) = σ2(X∗0X∗)−1 = σ2(X0P 0PX)−1 = σ2(X0Ω−1X)−1. How about proving

that this means GLS is more efficient than OLS.

V (βO)− V (βG) = σ2(X 0X)−1X 0ΩX(X 0X)−1 − σ2(X 0Ω−1X)−1

= σ2[(X0X)−1X 0 − (X 0Ω−1X)−1X0Ω−1]Ω

[(X0X)−1X 0 − (X 0Ω−1X)−1X0Ω−1]0

= σ2AΩA0

which is p.s.d. because Ω is p.d.

Good assignment question: Prove that σ2O is biased. What is the GLS estimator for

the variance. Well you go back to the transformed model

σ2G = e∗0e∗/(T − k)

= (y∗ −X∗βG)0(y∗ −X∗βG)/(T − k)

= (y0P 0 − β0GX0P 0)(Py − PXβG)/(T − k)

= (y −XβG)0P 0P (y −XβG)/(T − k)

= e0GΩ−1eG/(T − k)

Clearly, this estimator can be shown to be unbiased using the standard argument (the OLS

proof) on the transformed model.

4.2. Ω Unknown

In general Ω could have how many unknown elements? It is a T × T matrix but it

is symmetric so that there are actually only T + (T − 1) + (T − 2) + . . . 1 = T (T + 1)/2

different elements of Ω. Actually, since we’ve normalized by pulling out σ2 there is one

less than that. You cannot model Ω freely when it is unknown because you only have T

observations to work with and you’ve already used up k of those degrees of freedom in β.

Therefore, you need assumptions as to the form of Ω. Then using those assumptions we’ll

construct estimators for Ω, say Ω.

The GLS estimator in this case will become

βG = (X0Ω−1X)−1X 0Ω−1y.

24


It is no longer possible to pass the expectations operator all the way through to y because

now there is an additional random component due to Ω. At this point we must rely on

asymptotic arguments.

Take for example the GLS estimator

βG = (X0Ω−1X)−1X 0Ω−1y

= (X0Ω−1X)−1X 0Ω−1(Xβ + )

= β + (X 0Ω−1X)−1X 0Ω−1

βG − β = (X0Ω−1X)−1X 0Ω−1

√T (βG − β) =

ÃX 0Ω−1X

T

!−11√TX0Ω

To show that this thing will be asymptotically normal you need

1. a law of large numbers so that (X 0Ω−1X)/T converges in probability to some fixed

positive definite matrix D. This is usually the same matrix which (X0ΩX)/T would

converge to deterministically if Ω were known and

2. a central limit theorem so that (X 0Ω−1 )/√T converges in distribution to the same

random variable as (X 0Ω−1 )/√T would converge to, say a N(0, σ2V ), for some

positive definite symmetric matrix V .

Generally we can get these two results if we obtain Ω in such a way that it converges in

probability to Ω.

4.3. Heteroskedasticity

Suppose that we have the same linear model, but E( 2t ) = σ2t where lnσ2t = lnσ

2 +

z0tα, but there is no covariance across ’s. Typical procedure works like this

1. Estimate the model by OLS, and obtain the OLS residuals e = y −XβO.

2. Notice that ln e2t + lnσ2t = lnσ2 + z0tα + ln e2t or ln e2t = lnσ2 + z0tα + ln(et2/σ2t ) =

lnσ2 + z0tα+ vt. Now regress ln e2 on [1 Z] and obtain OLS estimates α and dlnσ2.

3. Construct σ2t = exp(dlnσ2+z0tα), and a matrix Ω with the exp(z0tα)’s on the diagonal.

25


4. Compute βG and σ2G using Ω in place of Ω in the formulas.

Generally, these estimators will have the desired properties.

4.4. Autocorrelation

The classic case you’ve all seen is where the error term is a first-order autoregression.

I.e. we have a model of the form

yt = x0tβ + ut

ut = ρut−1 + t − 1 < ρ < 1

where xt satisfies all the assumptions we’ve made up until now and t is i.i.d. N(0, σ2.

Since the model is now y = Xβ+u we must of course be concerned with the properties of

u. It is clearly mean zero. However what is E(uu0)?

ut = ρut−1 + t

= ρ2ut−2 + t + ρ t−1 . . .

=∞Xi=0

ρi t−i

As a result of this we can compute the variances and covariances of the ut’s.

σ2 = V (ut) = E(u2t ) =∞Xi=0

ρ2iσ2 = σ2/(1− ρ2)

Furthermore, we have E(utut−1) = ρσ2, etc. with E(utut−i) = ρiσ2. Therefore,

Ω =

⎛⎜⎜⎜⎜⎝1 ρ ρ2 . . . ρT−1

ρ 1 ρ . . . ρT−2

ρ2 ρ 1 . . . ρT−3...

......

. . ....

ρT−1 ρT−2 ρT−3 . . . 1

⎞⎟⎟⎟⎟⎠To get a consistent estimator for Ω we just need a consistent estimate of one parameter:

ρ. We could use the following steps.

1. Estimate the model by OLS and obtain the residuals.

26


2. Compute a consistent estimator for ρ. A bunch of possibilities are available

r =

PTt=2 etet−1PTt=1 e

2t

Theil0s r∗ =T − k

T − 1 r r∗∗ = 1− d

2

where d is the Durbin-Watson statistic, d =PT

t=2(et − et−1)2/PT

t=1 e2t .

3. Then in the last step there are various methods used

a. Full GLS,

b. Full GLS dropping the first observation, or

c. Full ML, Beach and Mackinnon (1978).

The rationale for (3b) versus (3a) is that (3b) involves a simpler transformation of

the data. It turns out that in this case

P =

⎛⎜⎜⎜⎜⎜⎝

p1− ρ2 0 0 . . . 0−ρ 1 0 . . . 00 −ρ 1 . . . 0...

....... . .

...0 0 0 . . . 1

⎞⎟⎟⎟⎟⎟⎠As a result applying the transformation P to the data just means taking any variable zt

and replacing it with zt − ρzt−1. However the first observation is treated diffently.

To do full ML note that the likelihood of the u vector is a multivariate normal

fU (u) = (2πσ2)−T/2|Ω|−1/2 exp

³− 1

2σ2u0Ω−1u

´but since the Jacobian of the transformation between y and u is just I, we have

fY (y) = (2πσ2)−T/2|Ω|−1/2 exp

³− 1

2σ2(y −Xβ)0Ω−1(y −Xβ)

´The likelihood is maximized by choosing β, σ2 and ρ to maximize f or L. Similarly, onecould do ML for any heteroscedastic model. However, ML tends to be computationally

burdensome.

27


4.5. Testing for Heteroskedasticity

4.5.1. White’s Test

Take the residuals from OLS (the restricted model), and square them to get e2t .

Regress these on all unique combinations in xt⊗xt. If there are p regressors in all, then TR2

from that regression will have a χ2(p− 1) distribution under the null of homoskedasticity.Effectively the hypotheses being compared here are H0 : σ

2t = σ2, ∀t, vs. H1 : σ

2t 6= σ2s for

at least one s 6= t. Unfortunately (?), this test tends to pick up all kinds of specification

error, and is not suggestive about the source of heteroscedasticity.

4.5.2. Goldfeld-Quandt (A Special Case)

Here the null is H0 : σ2t = σ2, ∀t while the alternative is H1 : σ

2t = σ21, ∀t ∈ τ ,

σ2t = σ22 , ∀t /∈ τ , where τ is some subsample. Here the idea is to estimate 2 separate

OLS regressions on the two subsamples and compute separate estimates σ21 and σ22. Under

H0 the ratio [(T1 − k)σ21/σ2(T1 − k)]/[(T2 − k)σ22/σ

2(T2 − k)] = σ21/σ22 will be distributed

F (T1 − k, T2 − k). This would be tested using a 2-sided rejection region since F could be

small or large under the alternative.

4.5.3. Breusch-Pagan

Here the model for the variance is σ2t = exp(α1+ z0tα2). The relevant hypotheses are

H0 : α2 = 0 versus H1 : α2 6= 0. Assuming that α2 is a s× 1 vector it turns out that if youregress ln(e2t/σ

2) on zt then (SST − SSE)/2 from this regression is asymptotically χ2(s).

As it turns out so is TR2 from regression of squared residuals on constant and zt.

28


4.6. Testing for Autocorrelation

4.6.1. Durbin-Watson

The test statistic is given by

d =TXt=2

(et − et−1)2. TX

t=1

e2t

= 2(1− r) + (e21 + e2T )/TXt=1

e2t

Note that 0 < d < 4. Near 0 suggests positive serial correlation. Near 4 suggests negative

serial correlation. The distribution of d in any finite sample depends on the data matrix.

However, upper and lower bounds for relevant critical values have been obtained for dif-

ferent sample sizes and numbers of regressors. SHAZAM can actually compute the critical

value appropriate for a particular case numerically when you run your job. This test is

not valid when the X data matrix is not fixed, especially with lagged dependent variables.

4.6.2. Box-Pierce

This is a test for general autocorrelation of any variety in the data. I.e. it tests a

null of no autocorrelation versus an alternative of any autocorrelation. The test statistic

is Q = TPL

j=1 r2j , where r

2j =

PTt=j+1 etet−j/

PTt=1 e

2t . It turns out that in large samples

Q ∼ χ2(L).

4.6.3. Ljung-Box

Just like the Box-Pierce test except Q = T (T + 2)PL

j=1 r2j/(T − j).

You might not see any reason to choose between (2) and (3). These tests have

different small sample properties. There is also the problem of how to choose L. It is an

arbitrary choice. Obviously it cannot be chosen too large because with large L we are

using fewer and fewer observations to estimate r2j . On the other hand too small can miss

29


autocorrelation at more distant lags. The Ljung-Box appears to be closer to a χ2(L) in

small samples.

In general you do not need to use any of these tests. You can use any convenient LR,

Wald, or LM test if you estimate by ML. One of the easiest ways to do this is to estimate

under the null (restricted) model. This lets you use OLS to estimate β. Typically it is

straightforward to derive an appropriate LM test which lets you do some easy regression

involving the residuals, and then to take TR2 as the test statistic.

30

5. Systems of Equations


5.1. Seemingly Unrelated Regression Equations

Suppose we have a set of equations rather than a single equation. For example,

forgetting about simultaneous equations bias suppose you had a set of demand equations

for several goods.

yit = x0itβi + it, i = 1, . . . n, t = 1, . . . T

ory1 = X1β1 + 1

y2 = X2β2 + 2 . . .

yn = Xnβn + n,

where the yi are T × 1 vectors, each Xi is a T × ki vector, each βi is a ki × 1 vector andthe i are T × 1. This can be rewritten as⎛⎜⎜⎝

y1y2...yn

⎞⎟⎟⎠ =

⎛⎜⎜⎝X1 0 . . . 00 X2 . . . 0...

.... . .

...0 0 . . . Xn

⎞⎟⎟⎠⎛⎜⎜⎝β1β2...βn

⎞⎟⎟⎠+⎛⎜⎜⎝

1

2...

n

⎞⎟⎟⎠or

y = Xβ + ,

where y and are nT × 1 vector, X is a nT ×Pi ki matrix, and β is aP

i ki × 1 vector.Assumptions about the error terms: E( it) = 0, ∀i, t.

E( it js) =nσij if i 6= j, t = s0 otherwise

.

Let

Σ =

⎛⎜⎜⎝σ11 σ12 . . . σ1nσ21 σ22 . . . σ2n...

.... . .

...σn1 σn2 . . . σnn

⎞⎟⎟⎠31


Then

E( 0) =

⎛⎜⎜⎝101 1

02 . . . 1

0n

201 2

02 . . . 2

0n

......

. . ....

n01 n

02 . . . n

0n

⎞⎟⎟⎠

=

⎛⎜⎜⎝σ11IT σ12IT . . . σ1nITσ21IT σ22IT . . . σ2nIT...

.... . .

...σn1IT σn2IT . . . σnnIT

⎞⎟⎟⎠= Σ⊗ IT = Φ

If Σ is known, then Φ is known and we just have a huge GLS model. βG =

(X 0Φ−1X)−1X 0Φ−1y. If Σ is not known we do something like what we did in simple

GLS. Estimate the model by OLS equation by equation. Get the n residual series ei.

Estimate σij = e0iej/T . Then construct Σ = [σij ] and Φ = Σ ⊗ IT . Then construct

βG = (X0Φ−1X)−1X0Φ−1y. Why do we do it all simultaneously rather than equation by

equation? Because, you can exploit the information in the error structure to help predict

the errors in the other equations.

Some rules about Kronecker products, when An×n, and BT×T .

1. (A⊗B)(C ⊗D) = AC ⊗BD

2. A⊗ (B ⊗C) = (A⊗B)⊗ C

3. A⊗ (B +C) = (A⊗B) + (A⊗C)

4. (B +C)⊗A = (B ⊗A) + (C ⊗A)

5. (A⊗B)0 = A0 ⊗B0

6. (A⊗B)−1 = A−1 ⊗B−1

When is OLS equation by equation the same as GLS.

1. When Σ is a diagonal matrix. Notice that in this case the matrix Φ is a block diagonal

matrix. As a result

Φ−1 =

⎛⎜⎜⎝σ−111 IT 0 . . . 00 σ−122 IT . . . 0...

.... . .

...0 0 . . . σ−1nnIT

⎞⎟⎟⎠32


Therefore

X 0Φ−1X =

⎛⎜⎜⎝σ−111 X

01X1 0 . . . 00 σ−122 X

02X2 . . . 0

......

. . ....

0 0 . . . σ−1nnX 0nXn

⎞⎟⎟⎠and

(X 0Φ−1X)−1 =

⎛⎜⎜⎝σ11(X

01X1)

−1 0 . . . 00 σ22(X

02X2)

−1 . . . 0...

.... . .

...0 0 . . . σnn(X

0nXn)

−1

⎞⎟⎟⎠and

X 0Φ−1y =

⎛⎜⎜⎝σ−111 X

01y1 0 . . . 0

0 σ−122 X02y2 . . . 0

......

. . ....

0 0 . . . σ−1nnX0nyn

⎞⎟⎟⎠so that

βG =

⎛⎜⎜⎝(X 0

1X1)−1X 0

1y1(X 0

2X2)−1X 0

2y2...

(X 0nXn)

−1X 0nyn

⎞⎟⎟⎠2. Xi = Z, ∀i. I.e. the regressors are the same in every equation. X = In ⊗ Z.

βG = [(In ⊗ Z)0(Σ⊗ IT )−1(In ⊗ Z)]−1(In ⊗ Z)0(Σ⊗ IT )

−1y

= [(In ⊗ Z0)(Σ−1 ⊗ IT )(In ⊗ Z)]−1(In ⊗ Z0)(Σ−1 ⊗ IT )y

= [(Σ−1 ⊗ Z0)(In ⊗ Z)]−1(Σ−1 ⊗ Z0)y

= (Σ−1 ⊗ Z0Z)−1(Σ−1 ⊗ Z0)y

= [Σ⊗ (Z0Z)−1](Σ−1 ⊗ Z0)y

= [In ⊗ (Z0Z)−1Z0]y3. Kruskal’s Theorem. Applies to standard GLS problems but this one also. If ∃ non-singular Gk×k such that ΩX = XG then βG = βO. Notice that ΩXG−1 = X.

βG = (X0Ω−1X)−1X 0Ω−1y

= (G−10X 0X)−1G−1

0X 0y

= (X 0X)−1G0G−10X 0y

= (X 0X)−1X 0y

33


So in the context of SURE we’re looking for Gnk×nk such that (Σ⊗ IT )X = XG.

5.2. Panel Data

In this case we have observations which are sorted both by time t and by individual i.

However, the coefficients do not vary across the individuals, except in very specific ways.

yit = x0itβ + z0iγ + αi + it.

The αi are unobservable effects which vary by individual. Therefore, we can think of them

as unobservable person specific regression coefficients. However, β and γ do not vary by

individual.

We will assume that E( it) = 0, E( it js) = 0, for i 6= j or t 6= s, E( 2it) = σ2, ∀i, t.There are two ways to approach the estimation

a. Fixed Effects - treat the αi as unknown constants, and put in dummy variables, di,

which are 1 in observations involving person i and 0 otherwise.

b. Random effects - assume that the differences in α across individuals are random, and

make assumptions about the distribution they come from.

5.2.1. Fixed Effects

The model can be written in stacked form as⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

y11y12...

y1Ty21...

y2T...

ynT

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

x011x012...

x01Tx021...

x02T...

x0nT

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠β +

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

z01z01...z01z02...z02...z0n

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠γ +

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 . . . 01 0 . . . 0....... . .

...1 0 . . . 00 1 . . . 0....... . .

...0 1 . . . 0....... . .

...0 0 . . . 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

⎛⎜⎜⎝α1α2...αn

⎞⎟⎟⎠+

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

11

12...

1T

21...

2T...

nT

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠.

If we could estimate the whole thing in one big block it would be BLUE. Least squares

involves minimizing S =P

i

Pt e2it =

Pi

Pt(yit − x0itβ − z0iγ − αi)

2. Differentiate

∂S

∂αj= −2

Xt

(yjt − x0jtβ − z0jγ − αj) = 0.

34


This is easily solved for αj .

αj =1

T

³Xt

yjt −Xt

x0jtβ −Xt

z0jγ´

= yj − x0jβ − z0jγ

When I substitute this back into S I get S =P

i

Pt

£yit − yi − (xit − xi)

0β¤2. As a result,

γ is not identified because zj is time invariant. Therefore, in fixed effects stuff you have to

drop any variables which are individual specific and don’t vary in the time dimension. Just

do least squares after subtracting the means across individuals from y and x as suggested

in the solution for αj above.

A simple test for whether fixed effects exist would be to test whether all the α’s are

the same. This is n−1 restrictions. Just run the restricted regression and the unrestrictedregression and do an F -test of the restrictions.

Summarizing fixed effects: Individual specific effects are constants-implies you can’t

estimate γ. Furthermore, the assumed lack of covariance across individuals means you

can’t exploit it to improve estimates of β.

5.2.2. Random Effects

In this case the α’s are random variables. The model is yit = x0itβ + z0iγ + αi + it =

α + x0itβ + z0iγ + uit, where uit = (αi − α) + it. The same assumptions are made about

it. In addition it is independent of αi. The distribution of αi is such that E(αi) = α,

∀i, and E(αi − α)2 = σ2α and they are i.i.d. across individuals. Therefore, E(uit) = 0,

E(u2it) = σ2α + σ2, E(uituis) = σ2α, 0 otherwise.

If we write the model as a whole we have y = α+Xβ + Zγ + u = Wθ + u. Do the

stacking by individual. E(uu0) = In ⊗Σ = Ω, where

Σ =

⎛⎜⎝σ2α + σ2 σ2α . . .σ2α σ2α + σ2 . . ....

.... . .

⎞⎟⎠ = σ2IT + σ2α110

To estimate the model you just do a big GLS.

35

6. Nonlinear Models

6. Nonlinear Models

What if the model we have is nonlinear of the form yt = f(xt, β) + t. We will have

two ways of estimating models like this. The first method is called nonlinear least squares,

while the other method is maximum likelihood.

6.1. Nonlinear Least Squares

The criterion function as before is the sum of squared residuals, S(β) =P

t e2t =P

t

¡yt − f(xt, β)

¢2= [y − f(X,β)]0[y − f(X,β)]. We will make the usual assumptions

about X and . If we minimize the sum of squared residuals by choosing β the first order

condition is

∂S(β)

∂β= −2∂f(X,β)

∂β

0[y − f(X,β)] = 0.

Do a first-order Taylor series expansion of f(X,β) around β = β1.

f(X,β) ≈ f(X,β1) +∂f(X,β1)

∂β(β − β1)

which implies that

y ≈ f(X,β1) +∂f(X,β1)

∂β(β − β1) +

y − f(X,β1) +∂f(X,β1)

∂ββ1 ≈ ∂f(X,β1)

∂ββ +

y∗ =∂f(X,β1)

∂ββ +

Notice that for any β1,∂f(X,β1)

∂β and y∗ are data. Therefore starting from some arbitrary

β1, create∂f(X,β1)

∂β and y∗. Then run a regression and get a new estimate

β2 =

Ã∂f(X,β1)

∂β

0 ∂f(X,β1)

∂β

!−1∂f(X,β1)

∂β

0y∗

=

Ã∂f(X,β1)

∂β

0 ∂f(X,β1)

∂β

!−1∂f(X,β1)

∂β

0³y − f(X,β1)

´+ β1

36

6. Nonlinear Models

In general you keep going, following the rule

βn+1 =

Ã∂f(X,βn)

∂β

0∂f(X,βn)

∂β

!−1∂f(X,βn)

∂β

0³y − f(X,βn)

´+ βn

Suppose that at some point βn+1 = βn, i.e. the algorithm converges. What does that

imply? It must be the case that

∂f(X,βn)

∂β

0³y − f(X,βn)

´= 0

which simply means the first-order condition for maximization is satisfied. Can we be sure

it is a minimum? It can’t be a maximum because we’ll assume that ∂f(X,β)∂β has full column

rank which means [∂f 0∂f ]−1 is positive definite. This means you are always going downhill.

On the other hand, you can’t show that it’s not just a local minimum. Under regularity

conditions the asymptotic variance covariance matrix of βN is consistently estimated by

σ2

Ã∂f(X, β)

∂β

0∂f(X, β)

∂β

!−1

where σ2 = S(βN )/(T −k). What we have just described is the Gauss-Newton Algorithm.Next we will examine the NEWTON-RAPHSON algorithm. Here we do a second-

order Taylor series expansion of S(β) around β = β1.

S(β) ≈ S(β1) +∂S(β1)

∂β

0(β − β1) +

1

2(β − β1)

0 ∂2S(β1)

∂β∂β0(β − β1).

Suppose we tried to minimize with respect to β

∂S(β)

∂β≈ ∂S(β1)

∂β+

∂2S(β1)

∂β∂β0(β − β1) = 0

β − β1 ≈ −Ã∂2S(β1)

∂β∂β0

!−1∂S(β1)

∂β

β2 = β1 −H(β1)−1 ∂S(β1)

∂β

or in general

βn+1 = βn −H(βn)−1 ∂S(βn)

∂β.

37

6. Nonlinear Models

If the algorithm ever converges what will be true? Clearly, ∂S(βn)∂β = 0 which solves the

first order condition. Is it a minimum? Depends on whether H is positive definite. If it

is positive definite everywhere then you could guarantee it, but usually it’s only locally

a minimum because H is non-pd in places. There are some methods for always ensuring

that at each step you go downhill. These are usually of the form

βn+1 = βn − tnH(βn)−1 ∂S(βn)

∂β

where at each n, tn is varied until S(βn+1) < S(βn). Mention the often used

1, 0.5, 0.25, 0.125, . . .method. The variance of βN is consistently estimated by 2σ2H(βN )

−1

where σ2 = S(βN )/(T − k).

What is the relationship between Gauss-Newton and Newton-Raphson? Notice that

H(β) =∂2S

∂β∂β0=

∂

∂β0

Ã−2

TXt=1

[yt − f(xt, β)]∂f(x, β)

∂β

!

= 2TXt=1

∂f(x, β)

∂β

∂f(x, β)

∂β

0− 2

TXt=1

[yt − f(xt, β)]∂2f(xt, β)

∂β∂β0

Therefore,

E[H(β)] = 2∂f(X,β)

∂β

0 ∂f(X,β)

∂β,

which shows you that the methods are really quite similar. In general there are a multitude

of methods of the form

βn+1 = βn − tnPn∂S(βn)

∂β.

Note that even the Gauss-Newton method fits into this form. See Judge Appendix B for

a description of all the methods.

6.2. Maximum Likelihood

Since the model is of the form y = f(X,β) + we can simply take the likelihood for

the error terms and replace them by y − f(). Therefore,

L = −T2ln 2π − T

2lnσ2 − 1

2σ2S(β).

38

6. Nonlinear Models

As usual we’ll get σ2 = S(β)/T , but the formula for β is not a closed form. Plug the

solution for σ2 into L and remaximize.

L∗ = −T2ln 2π − T

2(1 + lnT−1)− T

2lnS(β).

What’s clear is that maximizing the likelihood is equivalent to minimizing the sum of

squares. There are three methods for finding the MLE for β. Going back to the original

likelihood

6.2.1. Newton-Raphson

In this case you really just do the same thing as you do in NLLS. Notice that

∂ L∂β

= − 1

2σ2∂S(β)

∂β

∂2 L∂ββ0

= − 1

2σ2∂2S(β)

∂β∂β0

Therefore,

βn+1 = βn −Ã∂2 L(βn)∂ββ0

!−1∂ L(βn)

β

= βn −Ã∂2S(βn)

∂β∂β0

!−1∂S(βn)

∂β

Here the estimator for the covariance matrix is given by

2σ2

Ã∂2S(β)

∂β∂β0

!−16.2.2. Method of Scoring

This method uses the inverse of the expectation of the Hessian instead of the inverse

of the Hessian to weight the first derivatives.

βn+1 = βn −ÃE∂2 L(βn)∂ββ0

!−1∂ L(βn)

β

= βn −Ã−E 1

2σ2∂2S(βn)

∂β∂β0

!−1³− 1

2σ2

´∂S(βn)∂β

= βn − 12

Ã∂f(X,βn)

∂β

0 ∂f(X,βn)

∂β

!∂S(βn)

∂β

39

6. Nonlinear Models

Here the estimator for the covariance matrix is given by

σ2

Ã∂f(x, β)

∂β

0∂f(x, β)

∂β

!−1

6.2.3. Berndt, Hall, Hall and Hausman (BHHH)

This is a method with a seemingly arbitrary choice of weighting matrix but it does

relate to the method of scoring. Define

lt = −12ln 2π − 1

2lnσ2 −

Ã[yt − f(xt, β)]

2

2σ2

!

so thatP

t lt = L. Here

βn+1 = βn −Ã−Xt

∂lt∂β

∂lt∂β0

!−1βn,σn

∂ L(βn)β

= βn −ÃX

t

[yt − f(xt, β)]2

σ4∂f(x, β)

∂β

∂f(x, β)

∂β

0!−1βn,σn

Ã− 1

2σ2∂S(βn)

∂β

!

= βn − σ2n2

ÃXt

[yt − f(xt, β)]2

σ4∂f(x, β)

∂β

∂f(x, β)

∂β

0!−1βn,σn

∂S(βn)

∂β

Notice that

E

ÃXt

∂lt∂β

∂lt∂β0

!=1

σ2

Xt

Ã∂f(x, β)

∂β

∂f(x, β)

∂β

0!which looks just like method of scoring. Therefore, the estimate of the variance-covariance

matrix is just

σ4

ÃXt

[yt − f(xt, β)]2

σ4∂f(x, β)

∂β

∂f(x, β)

∂β

0!−1β

40

7. Stochastic Regressors


7.1. Independent Regressors

We now change one of the previously vital assumptions. Up until now we have been

assuming that the matrix X is fixed. Now we will instead assume that it is stochastic, but

also that X is independent of . This implies that any function of X is independent of any

function of . In the plain linear model we have y = Xβ+ . Therefore, the OLS estimator

is β = (X 0X)−1X0y = β + (X0X)−1X 0 . Therefore, E(β) = β +E[(X0X)−1X 0]E( ) = β.

Furthermore, we have

E(e0e) = E( 0M )

= E[tr( 0M )]

= E[tr(M 0)]

= tr[E(M)E( 0)]

= σ2tr(EM) = σ2Etr(M)

= σ2(T − k).

Therefore, E(σ2) = σ2. And

V (β) = E[(X 0X)−1X 0 0X(X 0X)−1]

= σ2E(X 0X)−1.

One can approximate this by σ2(X 0X)−1 which is also unbiased.

41


7.2. Nonindependent Regressors (of a particular form)

In this case we imagine a world where y = Xβ+ but our previous assumptions don’t

hold. Now we only assume that t is i.i.d. mean zero, variance σ2 with plim 0 /T = σ2,

plimX0X/T = Σxx and plimX 0 /T = 0. Under these circumstances

plim β = plim(X 0X)−1X 0y

= β + plim³X0X

T

´−1X0

T

= β + plim³X0X

T

´−1plim

X0

T

= β +Σ−1xx 0 = β.

You can check the textbook for proofs that the estimates of the variances converge in

probability to the right objects. One last thing is the asymptotic distribution. We have

√T (β − β) =

³X 0XT

´−1X 0√T.

Therefore, by the quick and dirty asymptotic method we have an estimator whch will be

asymptotically normal and the vcv matrix will be σ2Σ−1xx .

7.3. Instrumental Variables

7.3.1. Errors in Variables

Let us discuss all this in a univariate setting. Suppose the model is yt = xtβ + t.

However, you do not observe xt but a measurement error ridden version of it called zt =

xt + ut. Therefore, we can write yt = ztβ + t − utβ = ztβ + vt. We’ll assume the ’s and

the u’s are i.i.d. and independent of each other and xt. If the econometrician regresses yt

on zt he gets

β =

PztytPz2t

=

Pzt(ztβ + vt)P

z2t

= β +

PztvtPz2t

β − β =1T

Pztvt

1T

Pz2t

42


Now 1T

Pztvt =

1T

P(xt+ut)( t−utβ) p−→−βσ2u. Also, 1T

Pz2t =

1T

P(xt+ut)

2 p−→σ2x+

σ2u. Therefore, β−β p−→−σ2uβ/(σ2x+σ2u). This means that β is not consistent except when

β = 0. As the signal to noise ratio σ2x/σ2u ratio gets small plim β → 0, while as it gets

large plim β → β. The basic problem arises because the regressor zt is not orthogonal to

the error term vt.

7.3.2. Instrumental Variables

Now we will consider general models of the form yt = xtβ+ t where plim1T

Pxt t =

σx 6= 0. This will imply that OLS is inconsistent.What is instrumental variables estimation? The trick is to find some variable zt s.t.

(1) 1T

Pzt t

p−→ 0.(2) 1

T

Pztxt

p−→σzx 6= 0.The IV estimator is generated by the following formula

βIV =

PztytPztxt

.

As a result, if we substitute in the definition of yt we get

βIV =

Pzt(xtβ + t)P

ztxt

= β +

Pzt tPztxt

βIV − β =1T

Pzt t

1T

Pztxt

p−→ 0.Also as far as distribution theory is concerned we’ll get

√T (βIV − β) =

1√T

Pzt t

1T

Pztxt

.

We can for the purposes of this course assert that 1√T

Pzt t

d−→N(0, σ2σ2z) while

1T

Pztxt

p−→σxz. Therefore,

√T (βIV − β)

d−→N(0, σ2σ−1xz σ2zσ−1xz ).

43


Notice that the variance is equal to

σ2σ2zσ2xz

=σ2

σ2x

σ2xσ2z

σ2xz

= VOLS/ρ2xz

Obviously, OLS is just IV with X as the instrument. We want a Z which is as highly

correlated with X as possible without being correlated with .

In a multivariate setting the story is really no different. We have the model y =

Xβ + . We define Σxx = plim 1TX

0X and Σx = plim 1TX

0 6= 0. Suppose we have

the replacements for the X matrix called ZT×k. We’ll define Σzx = plim 1T Z

0X and

Σzz = plim1T Z

0Z which are assumed to have full rank. Furthermore, Σz = plim 1T Z

0 = 0.

Also assume enough regularity conditions such that 1√TZ0 d−→N(0, σ2Σzz). Then

βIV = (Z0X)−1Z0y

= (Z0X)−1(Z0Xβ + Z0 )

= β + (Z0X)−1Z0

βIV − β = (Z0X)−1Z0

Clearly plim βIV − β = 0. Furthermore,

√T (βIV − β)

d−→N(0, σ2Σ−1zxΣzzΣ−1xz )

One natural question that arises is how to choose the instrument which replaces each

column in the matrix X. Suppose I have l variables (called W ’s) which I can reasonably

assume to be uncorrelated with the ’s and which are correlated with the X’s. If l >

k, then clearly we have more candidates than we have room for. How can we reduce

the number from l to k in the optimal way. One way to think about this is to think

about the k linear combinations of the l variables which minimizes the size of the variance

covariance matrix of βIV . I.e. the goal is to construct a matrix ZT×k = WT×lAl×k in

such way as to minimize the variance covariance matrix. Clearly this requires choosing

44


A optimally. Now it’s clear that Σzz = plim1T Z

0Z = plim 1TA

0W 0WA = A0ΣwwA. Also,

Σzx = plim1T Z

0X = plim 1TA

0W 0X = A0Σwx. Therefore, the variance of the IV estimator

is

V = σ2(A0Σwx)−1A0ΣwwA(ΣxwA)−1.

which can be minimized by choosing A∗ = Σ−1wwΣwx so that the variance is

V ∗ = σ2(ΣxwΣ−1wwΣwx)

−1.

To prove that this is smallest in a cheesy way, consider the difference between its inverse

and the inverse of any other matrix for any other choice of A.

V ∗−1 − V −1 ∝ ΣxwΣ−1wwΣwx −ΣxzΣ−1zz Σzx= X 0[W (W 0W )−1W 0 − Z(Z0Z)−1Z0]X

which is positive semidefinite as the matrix in the middle is idempotent. To check this

notice that

[W (W 0W )−1W 0 − Z(Z0Z)−1Z0]2 =W (W 0W )−1W 0 −W (W 0W )−1W 0Z(Z0Z)−1Z0

− Z(Z0Z)−1Z0W (W 0W )−1W 0 + Z(Z0Z)−1Z0

=W (W 0W )−1W 0 −WA(Z0Z)−1Z0 − Z(Z0Z)−1A0W 0 + Z(Z0Z)−1Z0

=W (W 0W )−1W 0 − Z(Z0Z)−1Z0.

45

8. Time Series

8. Time Series

We’ll be looking at univariate time series. We’re looking at a sequence ytTt=1. We’regoing to look at autoregressive moving average (ARMA) processes. These are processes

which allow us to characterize the distribution of yt in terms of its own past. First we need

to define a certain type of stationarity.

A stochastic process is stationary if

f(yt, yt−1, . . . , yt−k) = f(yt+s, yt+s−1, . . . , yt+s−k)

for all s, k.

A stochastic process ytTt=1 is covariance stationary if E(yt) = µ and Var(yt) = γ0 <

∞ and E[(yt − µ)(yt−s − µ)] = γs, ∀t.Let’s look at different types of time series processes.

8.1. Time Series Processes

8.1.1. White Noise

We will generically represent a white noise process with the symbol ( t). It has the

following propertiesE( t) = 0

γs =

½σ2 if s = 00 if s 6= 0

8.1.2. Autoregressive 1st Order - AR(1)

yt = φyt−1 + t

= φ(φyt−2 + t−1) + t

= φ2yt−2 + t + φ t−1

= φT yt−T +T−1Xj=0

φj t−j

46

8. Time Series

If yt−T were fixed then E(yt) = φT yt−T which implies that if φ 6= 1 then E(yt) 6= E(ys)

for t 6= s. Suppose then that yt−T is not fixed for any T . If |φ| < 1 we can write

yt =∞Xj=0

φj t−j

and thus Eyt = Eys = 0, ∀s, t.

Var(yt) = Eh ∞Xj=0

φj t−ji2

= Eh ∞Xj=0

φ2j 2t−ji

= (1 + φ2 + φ4 + . . .)σ2

= σ2/(1− φ2) = γ0

γs = E(ytyt−s)

= E[(φsyt−s + t + φ t−1 + φs−1 t−s+1)yt−s]

= φsγ0

We’re going to use lag operator notation. Lyt = yt−1. Liyt = yt−i.

yt = φLyt + t

(1− φL)yt = t

yt = (1− φL)−1 t

yt = (1 + φL+ φ2L2 + φ3L3 + . . .) t

The polynomial (1 − φL)−1 exists iff ∀|Z| ≤ 1, (1 − φZ) 6= 0. All the roots of

the polynomial lie outside the unit circle. Furthermore, an AR(1) process is covariance

stationary iff ∀|Z| ≤ 1, 1− φZ 6= 0.

47

8. Time Series

8.1.3. Autoregressive pth Order - AR(p)

yt = φ1yt−1 + φ2yt−2 + . . .+ φpyt−p + t

φ(L)yt = t

The polynomial φ(L)−1 exists iff ∀|Z| ≤ 1, φ(Z) 6= 0. All the roots of the polynomiallie outside the unit circle. Furthermore, an AR(p) process is covariance stationary iff

∀|Z| ≤ 1, φ(Z) 6= 0.

8.1.4. Moving Average 1st Order - MA(1)

yt = t + θ t−1

Eyt = 0

Var(yt) = Ey2t = E( 2t + 2θ t t−1 + θ2 2t−1)

= σ2(1 + θ2) = γ0

γ1 = t t−1 + θ 2t−1 + θ2 t−1 t−2

= θσ2

γs = 0, s > 1

An MA(1) process is always covariance stationary. On the other hand the value of

θ will matter for invertibility. We want to write (1 + θL)−1yt = θ(L)yt = t. This MA

process is invertible iff ∀|Z| ≤ 1, 1 + θZ 6= 0.

48

8. Time Series

8.1.5. Moving Average qth Order - MA(q)

yt = t + θ1 t−1 + . . .+ θq t−q

= θ(L) t

Eyt = 0

Var(yt) = Ey2t = σ2(1 + θ21 + . . . θ2q) = γ0

γi = 0, i > q

An MA(q) process is always covariance stationary. We want to write θ(L)yt = t.

This MA process is invertible iff ∀|Z| ≤ 1, θ(Z) 6= 0.

8.1.6. ARMA(p,q)

yt = φ1yt−1 + . . .+ φqyt−p + t + θ1 t−1 + . . .+ θq t−q

φ(L)yt = θ(L) t

STATIONARY : ∀|Z| ≤ 1 φ(Z) 6= 0 yt = φ(L)−1θ(L) t

INVERTIBLE : ∀|Z| ≤ 1 θ(Z) 6= 0 t = θ(L)−1φ(L)yt

8.2. Wold Decomposition Theorem

Any covariance stationary stochastic process xt can be represented as xt = yt +

zt where zt is linearly deterministic [i.e. Var(zt|zt−1, . . .) = 0] and yt is purely non-

deterministic with an MA(∞) representation yt = ψ(L) t whereP∞

j=1 ψ2j <∞.

ARMA models represent approximations to the MA(∞). I.e. ψ(L) ≈ φ(L)−1θ(L).

8.3. Box-Jenkins Procedure

The procedure consists of three basic parts

(1) Identification (Model Selection) — make a judgment based on autocorrelations and

partial autocorrelations that leads to a model that could have generated them,

49

8. Time Series

(2) Estimation — estimate model chosen in (1),

(3) Diagnostic Checking — do various diagnostic checks to determine whether the resid-

uals resemble white noise.

8.3.1. Model Selection

We’ll select models based on the autocorrelations and partial autocorrelations of a

time series. We first need to define what these objects are.

The autocorrelation function (of i) is defined as ρi = γi/γ0. Notice that |ρi| ≤ 1,ρ−i = ρi and ρ0 = 1.

WN : ρ0 = 1

ρi = 0 i 6= 0

AR(1) : ρ0 = 1

ρi = φi

For AR(p) there is some more complex form of decay.

MA(1) : ρ0 = 1

ρ1 = θ/(1 + θ2)

ρi = 0 i > 1

MA(q) : ρi = 0 i > q

So if you thought your data followed an MA(q) process you would look for a drop

off in the autocorrelation function at some lag in order to identify q. How can you tell

whether an autocorrelation is 0 or not. Do a statistical test! If yt is mean zero then the

ith autocovariance is γi =1

T−iPT

t=i+1 ytyt−i. You would estimate ρi = γi/γ0. To test the

null hypothesis H0 : ρi = 0 for some i > 0 it is turns out that under the null hypothesis√T ρi

d−→N(0, 1). This is because the numerator converges in distribution to N(0, σ4y)

and the denominator converges in probability to σ2y. Thus an easy way to check whether

the autocorrelations are significant is to construct an acceptance region based on the 5%

significance level around 0, equal to ±1.96/√T .

50

8. Time Series

Suppose you want to do a joint test H0 : ρi = 0, i = 1, . . . N . Then construct the

Box-Pierce statistic QBP = TPN

i=1 ρ2i or the Ljung-Box QLB = T (T −2)PN

i=1 ρ2i /(T − i).

Both of these are χ2N asymptotically although the Ljung-Box is closer to a χ2 is small

samples. It’s because both are sums of squared standard normals which are independent

asymptotically.

Obviously these methods are more helpful for MA processes than AR processes.

The partial autocorrelation function φkk is defined by the Yule-Walker equations

ρj =kXi=1

φkiρi−j , j = 1, . . . , k.

Notice that we have k equations and there are k unknowns φki, i = 1, . . . , k. This can be

written in matrix notation⎛⎜⎜⎝ρ1ρ2...ρk

⎞⎟⎟⎠ =

⎛⎜⎜⎝1 ρ1 ρ2 · · · ρk−1ρ1 1 ρ1 · · · ρk−2...

......

. . ....

ρk−1 · · · · · · · · · 1

⎞⎟⎟⎠⎛⎜⎜⎝φk1φk2...

φkk

⎞⎟⎟⎠or ρk = Pkφk or φk = P−1k ρk, but remember we’re really only interested in φkk the last

element. Notice that for k = 1, this just reduces to ρ1 = φ11, so that the first partial

autocorrelation is the same as the first autocorrelation. Might want to illustrate k = 2

also.

Of course we’d be doing this in sample, so all the quantities would be sample

estimates. It looks pretty formidable but it isn’t really. In fact, these things can

be computed from simple OLS regressions. Start with φ11. Since it equals ρ1 it is

clear that φ11 = ρ1 = φ1 the OLS estimator in a regression of yt on yt−1. I.e.

φ1 = (P

yt−1yt/T )/(P

y2t−1/T )p−→γ1/γ0 = ρ1 = φ11. Suppose we estimated an AR(2)

model yt = φ1yt−1 + φ2yt−2 + t. Notice we’d haveµφ1φ2

¶=

µ Py2t−1/T

Pyt−1yt−2/TP

yt−1yt−2/TP

y2t−2/T

¶−1µPyt−1yt/TPyt−2yt

¶p−→µγ0 γ1γ1 γ0

¶µγ1γ2

¶=

µρ0 ρ1ρ1 ρ0

¶µρ1ρ2

¶51

8. Time Series

Therefore φ2 acts as an estimate of φ22. In general, φk in an OLS of an AR(k) acts as an

estimate of φkk. In other words, φkk = Corr(yt, yt−k|yt−1, . . . , yt−k+1). An implication ofall of this is that φkk = 0 for k > p if yt is an AR(p) process. On the other hand the φkk

will decay in some manner if yt is an MA process.

Another small point is that when we’re selecting a model we may need to difference

the time series to render it stationary. Most macroeconomic time series look like this

because they grow steadily through time. There is an issue as to whether this growth can

be modelled as a deterministic trend or must involve explicitly modelling the first difference

of the log of a series. We will not discuss this issue. Let me just mention a definition.

The stochastic process yt is integrated of order d iff ∆dyt = (1 − L)dyt is station-

ary. This definition leads us to refer to ARIMA(p, d, q) processes which have the form

φ(L)∆dyt = θ(L) t. Two special ARIMA processes are the (misnamed by economists)

random walk ∆yt = t and the random walk with drift ∆yt = µ+ t. Use of these models

is especially prevalent in finance.

8.3.2. Estimation

We will consider maximum likelihood estimation. This requires the assumption that

t ∼ N(0, σ2). The estimation of autoregressive models is simple. To estimate them we

need to write down the joint density function of the y’s, L(y) = f(y1, y2, . . . , yT ). Let

me introduce the notation Yt = (y1, y2, . . . , yt). We can use the definition of conditional

density functions to derive f .

f(YT ) = f(yT |YT−1)f(YT−1)

= f(yT |YT−1)f(yT−1|YT−2)f(YT−2)

=³ TYt=2

f(yt|Yt−1)´f(y1)

Since each of these conditional distributions is easy to characterize in an AR model, this

is a convenient way to proceed.

52

8. Time Series

Take for example the AR(1) model. What is f(yt|Yt−1)? Well, in the AR(1) modelyt = φyt−1+ t. Given the information on all y’s up to time t− 1 it is clear that Et−1yt =

φyt−1 and that Vart−1yt = σ2. Therefore, (yt|Yt−1) ∼ N(φyt−1, σ2). We also need to

consider y1. We know that y1 =P∞

i=0 φi1−i. Therefore, the marginal distribution of y1

is N [0, σ2/(1− φ2)]. Thus the joint density is

f(YT ) =h TYt=2

(2πσ2)−1/2 exp³− 1

2σ2(yt − φyt−1)2

´i×

(2πσ2)−1/2(1− φ2)1/2 exp³−1− φ2

2σ2y21

´so that the log of the likelihood function is simply

L = −T2ln(2π)− T

2ln(σ2) +

1

2ln(1− φ2)− 1

2σ2

TXt=2

(yt − φyt−1)2 − (1− φ2)

2σ2y21.

Notice that if we were to drop the two extra terms from the likelihood we’d have the

OLS criterion function. Since that term dominates as the sample size grows there is

going to be numerical similarity between the ML estimate and the OLS estimate in large

samples. They are asymptotically equivalent. You should verify by computing the first

derivatives, second derivatives of the likelihood function and then the information matrix

that the asymptotic variance of the ML and OLS estimators is σ2γ−10 , which is consistently

estimated by σ2(P

y2t−1/T )−1 which in small samples means use σ2(X0X)−1 from OLS!

For higher order AR processes the story is the same although there are more complex

issues to deal with concerning the first p observations.

For a moving average model it is difficult to write y in terms of lagged y0s and current

shocks. For example, if we have an MA(1) model, yt = t + θ t−1 the appropriate way

to do it is to invert the MA process to get (1 + θL)−1yt = (1 − θL + θ2L2 − . . .)yt = t.

Even in an MA(1) y depends on an infinite sequence of lagged y0s when the process is

inverted. Therefore, the easier way to go is just to directly write down the likelihood for

the process. Clearly, since the ’s are normal, the y’s are normal. Each y has mean zero,

53

8. Time Series

variance σ2(1 + θ2) and first order covariance θσ2. Therefore, YT ∼ N(0, σ2Ω) where

Ω =

⎛⎜⎜⎜⎜⎝1 + θ2 θ 0 . . . 0θ 1 + θ2 θ . . . 00 θ 1 + θ2 . . . 0...

......

. . ....

0 0 0 . . . 1 + θ2

⎞⎟⎟⎟⎟⎠So that

L = −T2ln(2πσ2)− 1

2ln |Ω|− 1

2σ2Y 0TΩ−1YT .

There are some kinky ways to estimate this guy but you wont learn them here. Shazam

seems a little useless to me, try RATS.

8.3.3. Diagnostic Testing

One thing you could do which is straightforward is to redo the Ljung-Box or Box-

Pierce tests on the residuals to check for leftover autocorrelations in the data. Unfortu-

nately, these tests have very low power, so they tend to often accept the null when it is

false. You really can only get much out of these tests if you have a large sample and you

can sum over many autocorrelations.

A better approach is to use LM tests. You will have chosen a particular model which

you might think of as a restricted version of a more general model. If you want to test

whether your version is adequate you can use an LM test, which only requires estimation of

the restricted model along with the derivatives of the likelihood function at the parameter

estimates. We will only examine a test based on a selected model of AR(1). I will test

this specification versus an AR(2). Dropping the starting values from consideration the

likelihood is just

L = −T2ln(2π)− T

2ln(σ2)− 1

2σ2

TXt=1

(yt − φ1yt−1 − φ2yt−2)2.

In this case the usual block separation of the information matrix φ components and σ2

54

8. Time Series

component will hold so we only need

∂L∂φ1

=1

σ2

TXt=1

(yt − φ1yt−1 − φ2yt−2)yt−1

∂L∂φ2

=1

σ2

TXt=1

(yt − φ1yt−1 − φ2yt−2)yt−2

∂2L∂φ21

= − 1σ2

TXt=1

y2t−1

∂2L∂φ1φ2

= − 1σ2

TXt=1

yt−1yt−2

∂2L∂φ22

= − 1σ2

TXt=1

y2t−2

At the AR(1) estimates φ2 = 0 and φ1 renders the first order condition for φ1 to zero.

∂L∂φ1

=1

σ2

TXt=1

(yt − φ1yt−1)yt−1 =1

σ2

TXt=1

etyt−1 = 0

The first order condition for φ2 reduces to

∂L∂φ2

=1

σ2

TXt=1

(yt − φ1yt−1)yt−2 =1

σ2

TXt=1

etyt−2

This makes the LM statistic

LM = (P

etyt−1P

etyt−2 )µ P

y2t−1 yt−1yt−2yt−1yt−2 y2t−2

¶−1µPetyt−1Petyt−2

¶/σ2

= T (P

etyt−1P

etyt−2 )µ P

y2t−1 yt−1yt−2yt−1yt−2 y2t−2

¶−1µPetyt−1Petyt−2

¶/X

e2t

= Te0X(X 0X)−1X0e/(e0e)

= TR2

from the regression of e on yt−1 and yt−2.

Suppose I tried an ARMA(1,1) model. In this case (and I wont go into why) it turns

out that the relevant regression is the residuals et on yt−1 and et. See if you can decide on

analogs for more complex models.

55

9. Vector Autoregressions


At this point we move to a multivariate setting since univariate models aren’t par-

ticularly interesting from an economic perspective. If we just wrote down a multivariate

ARMA model it would look like

Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t +Θ1 t−1 + . . .+Θq t−q

where Φ0 isn’t necessarily I since we might want to allow contemporaneous responses of

one variable to another and E( t0t) = Σ. Notice how many free parameters we have

n2(1 + p + q) + n(n + 1)/2. This could be hideous. I won’t discuss the restrictions on

the matrix polynomials Φ(Z) and Θ(Z) required to render this process stationary and

invertible, suffice it to say that they are analagous to the univariate setting.

9.1. Identification

Identification is going to be a problem, just as it is in simultaneous equations models.

Suppose we restrict our attention to models with no MA components (they’d be hideous

to deal with) and write a vector autoregression (VAR)

Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t.

The likelihood function for the ’s is

L =TYt=1

(2π)−n/2|Σ|−1/2 exp³− 0

tΣ1t

´

The Jacobian of the transformation from xt to t is Φ0 so that the likelihood of X is

56


(ignoring starting values)

L =TYt=1

(2π)−n/2|Σ|−1/2|Φ0| exp³−[Φ(L)xt]0Σ−1[Φ(L)xt]

´=

TYt=1

(2π)−n/2|Φ−10 ΣΦ−100|−1/2 exp

³[x0tΦ

00 − x0t−1Φ

01 − . . .− x0t−pΦ

0p]Σ−1

[Φ0xt −Φ1xt−1 − . . .−Φpxt−p]´

=TYt=1

(2π)−n/2|Φ−10 ΣΦ−100|−1/2 exp

³[x0t − x0t−1Φ

01Φ−10

0 − . . .− x0t−pΦ0pΦ−10

0]Φ00Σ

−1

Φ0[xt −Φ−10 Φ1xt−1 − . . .− φ−10 Φpxt−p]´

=TYt=1

(2π)−n/2|V |−1/2 exp³[x0t − x0t−1C

01 − . . .− x0t−pC

0p]V−1

[xt −C1xt−1 − . . .−Cpxt−p]´

Suppose I generated new parameters using any non-singular n× n matrix G,

Φi = GΦi Σ = GΣG0

thenCi = Φ

−10 Φi

= (GΦ0)−1GΦi

= Φ−10 G−1GΦi

= Φ−10 Φi = Ci

V = Φ−10 ΣΦ00−1

= Φ−10 G−1GΣG0G0−1Φ00−1

= Φ−10 ΣΦ00−1= V

This means that just as in the simultaneous equations models we must put some restrictions

upon the parameters to achieve identification. The idea is to put enough restrictions on

the Φi and Σ to imply that the only admissable G that delivers Ci and V is G = I. Unlike

simultaneous equations models, people who use VAR models don’t typically have in mind

restrictions on the Φi matrices for i ≥ 1. This leaves us with two possibilities.

57


i) Restrictions on Φ0 and/or

ii) Restrictions on Σ.

The macroeconomist typically wants an interpretation of his model which allows him

to identify error terms as particular kinds of shocks. For example, money shocks, supply

shocks, technology shocks etc. In choosing what restrictions to impose, the economist

needs to be aware of what he wants to do with his model once it’s identified.

9.1.1. Restrictions are Φ0 = I.

Clearly this achieves identification. Now, this means that the model can be written

as

xt = Φ1xt−1 + Φ2xt−2 + . . .+Φpxt−p + t.

This model has an infinite moving average representation which can be written by inverting

the autoregressive matrix polynomial in the lag operator. If we write Φ(L)xt = t then we

have xt = Φ(L)−1

t = Λ(L) t or

xt = Λ0 t + Λ1 t−1 + Λ2 t−2 + . . .

One of the ways in which economists like to interpret VARs is to look at this moving

average representation. It’s quickly obvious that

∂xi,t∂ j,t−k

= Λk[ij] = λk,ij .

Typically λk,ij is plotted as a function of k and is called the impulse response function

(IRF) of variable i with respect to shocks in equation j. The only problem with this

concept is that with no restrictions on Σ this is a rather wierd thing to do. If there’s a

shock to jt and there’s covariance between j and how can we look at the response

to one shock in isolation from the responses to other shocks which are correlated with it?

To be able to isolate individual shocks we need to identify shocks which are uncorrelated

with each other. Suppose we do a decomposition of Σ = C0DC, C is invertible and upper

58


triangular with 1’s on the diagonal and D is a diagonal matrix. If we create new shocks

vt = C−10 t notice that E(vtv0t) = E(C−10 t 0tC−1) = D. We can then write

xt = Λ0C0vt + Λ1C0vt−1 + Λ2C0vt−2 + . . .

= Λ0vt + Λ1vt−1 + Λ2vt−2 + . . .

which creates a new impulse response function with respect to the elements of vt given

by λk,ij . This does allow the computation of a response function to shocks which are

mutually orthogonal but it still creates another problem of interpretation. Unfortunately,

the impulse responses will be sensitive to the ordering of the equations. The decomposition

we did will lead to different results if the variables are ordered differently in the VAR. This

leads us to wonder what the heck getting impulse responses to v0s really means. Sometimes

theory can shed light on what ordering is appropriate but clear choices are rare.

A second thing people like to look at is called the variance decomposition. Let’s go

back to our representation

xt = Λ0 t + Λ1 t−1 + Λ2 t−2 + . . . .

Suppose we wanted to forecast xt with information up to time t − k. The forecast error

would beek,t = xt −E(xt|It−k)

= Λ0 t + . . .+ Λk−1 t−k+1

The variance of this k-step ahead forecast error1 is given by

Vk = Λ0ΣΛ00 + . . .+ Λk−1ΣΛ0k−1.

Clearly Vk is a linear function of σij , ∀i, j. The variance decomposition looks at the

percentage of the variance of the k-step ahead forecast error in forecasting variable i which

is due to variation in shock j. This is not well defined if there are covariance terms floating

around in the expressions on the diagonal of Vk. We’d want to look at Vk(ii) and see

1 Notice that at k =∞ you get the unconditional variance in xt.

59


something like a1σ11 + a2σ22 + . . . anσnn and take the ratio of ajσjj to Vk(ii). The only

way to ensure that the diagonal of Vk has this appearance is to have a diagonal Σ such as

we had in our decomposed model.

9.1.2. Restrict Φ0 and Σ

In this case, the most common restriction is to assume that Φ0 is lower triangular

with 1’s on the diagonal and that Σ is diagonal. This setup has the advantage of allowing

contemporaneous relationships between the variables. For this reason it is often referred

to as a structural VAR. Bceause Σ is assumed to be diagonal we can interpret the shocks

in the model as independent processes. This helps in thinking about the impulse response

functions.

Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t

xt = Φ−10 Φ1xt−1 +Φ

−10 Φ2xt−2 + . . .+ Φ−10 Φpxt−p + Φ

−10 t

= A1xt−1 +A2xt−2 + . . .+Apxt−p + ut

So that our structural model can be written as a plain VAR model with variance covariance

matrix for the u’s equal to Φ−10 ΣΦ−10

0. So suppose we then wrote down the moving average

resprentation of xt in terms of ut. We’d have

xt =∞Xk=0

Bkut−k

To get the moving average of the representation of xt in terms of the t’s is quite simple.

It is simply

xt =∞Xk=0

BkΦ−10 t−k =

∞Xk=0

Bk t−k.

Notice that when we computed the IRF in the Φ0 = I model we took it with respect to the

vt which were orthogonal, by getting a decomposition of the nonspherical ’s using C−10.

Here we are decomposing the nonspherical ut’s using the matrix Φ−10 which by assumption

looks a lot like C did. In both cases the model when written in terms of nonspehrical

60


disturbances was a plain VAR model. Thus if we estimated a plain VAR model with

nonsherical disturbances we would simply have to decide how to decompose the variance

covariance matrix of the errors. Either by obtaining an estimate of Φ0 or of C. Notice

that in the structural model, if the variance decomposition is computed as

Vk = B0ΣB00 + . . .+ Bk−1ΣB0

k−1

then the expression will be linear in the σii and no covariance terms will appear since Σ is

assumed to be diagonal.

9.2. Estimation

We’re going to work with either set of restrictions and we’ll write both models in

their plain VAR form, generically as

xt = D1xt−1 +D2xt−2 + . . .+Dpxt−p +wt

Notice that in either setup, this implies that wt is an error term which is nonspherical.

As a result GLS seems appropriate. However, since in each equation (remember xt is a

vector here) the regressors are identical, GLS (which is SURE) is going to be numerically

identical to OLS equation by equation. Therefore in either case simply compute Di by

OLS and get an estimate, Σw =P

wtw0t/T , where the wt’s are the residuals.

9.2.1. Restriction 9.1.1.

In this case the Di represent estimates of the Φi and Σw is an estimate of Σ. To do

the impulse response functions simply requires the computation of a C and D such that

C0DC = Σw.

61


9.2.2. Restriction 9.1.2.

In this case the Di represent estimates of the Φ−10 Φi and Σw is an estimate of

Φ−10 ΣΦ00−1. Clearly to get out the parameters we’re interested in requires an estimate

of Φ0. But remember simultaneous equations. If Φ0 is lower triangular we can do OLS

equation by equation and get the right estimates.

9.3. Structural VARs

Suppose we didn’t have such a wonderful situation. Suppose that Φ0 had enough

restrictions in it to achieve identification but it wasn’t lower triangular. This would be

tricky. It’s hard because the triangularity bought a lot. It meant that if you looked in any

equation like the one for xit you’d see

xit = β1ix1t + . . .+ βi−1ixi−1t + f(lags) + it.

The whole secret to a system like this is that everything on the right hand side is either

exogenous or is determined by an equation higher in the orderwhich does not depend on

xit. Without triangularity OLS wont work. So what will we do. Blanchard and Watson

(1986) suggest a way of doing this which only works for some kinds of nontriangular

systems. Let me illustrate their procedure with a 4-variable system.⎛⎜⎝1 0 0 00 1 0 −α0 0 1 −β−φ1 −φ2 −φ3 1

⎞⎟⎠ yt = Ztγ + t

where Zt respresents all the lags. This can be written as 4 equations,

y1 = Zγ1 + 1

y2 = αy4 + Zγ2 + 1

y3 = βy4 + Zγ3 + 1

y4 = φ1y1 + φ2y2 + φ3y3 + Zγ4 + 1

62


If you work out the reduced form for y4 it is

y4 = [Z(φ1γ1 + φ2γ2 + φ3γ3 + γ4) + (φ1 1 + φ2 2 + φ3 3 + 4)]/(1− αφ2 − βφ3)

Now suppose I estimated each equation by OLS but left out the endogenous RHS variables

creating what are really reduced form coefficient estimates which I’ll call γi. These guys

wouldn’t converge to the Φi’s rather they’d converge to the Φ−10 Φi’s. Create four residual

series

e1 = y1 − Z(Z0Z)−1Z0y1 =My1

=M(Zγ1 + 1) =M 1

e2 =My2 =M(Zγ2 + αy4 + 2)

= αMy4 +M 2 = αe4 +M 2

e3 =My3 =M(Zγ3 + βy4 + 3)

= βMy4 +M 3 = βe4 +M 3

e4 =My4 =M(Zγ4 + φ1y1 + φ2y2 + φ3y3 + 4) = φ1My1 + φ2My2 + φ3My3 +M 4

= φ1e1 + φ2e2 + φ3e3 +M 4

How could we get α? Regress e2 on e4?

α = (e04e4)−1e04e2

= (e04e4)−1e04(αe4 +M 2)

= α+ (y04My4)−1y04M 2

Clearly this means there will be bias in α because look at covariance in reduced form for

y4 with 2. Could we instrument it out? Yes. Use e1 as an instrument for e4!

αIV = (e01e4)

−1e01e2

= (e01e4)−1e01(αe4 +M 2)

= α+ ( 01My4)−1 0

1M 2

This is going to work because the second part is zero in expectation and the first part is

nonsingular because 1 appears in y4’s reduced form. This last part is critical! We could

63


continue like this by next taking e1 and e2 = e2− αIV e4 as instruments (using optimal IV

methods) for e4 to get an estimate of β. This would lead to e3 = e3 − βIV e4. Finally, e1,

e2 and e3 could be (using optimal IV methods) to get instruments for e1, e2 and e3 and

thereby get estimates of the φi’s.

64

10. Generalized Method of Moments


There is a large class of macroeconomic models which lead to Euler equations, or in

general, nonlinear restrictions of the form

Et−if(Yt−1, yt|θ) = 0.

for some i ≥ 0. Because of the law of iterated expectations it follows that Et−i[f(Yt−1, yt|θ)xt−i] = 0 for any xt−i which is in the time t− i information set. Furthermore we can writeE[f(Yt−1, yt|θ)xt−i] = 0. With restrictions like these we could write f(Yt−1, yt|θ)xt−i = ut,

where restriction is rewritten as Eut = 0. This just looks like a nonlinear model. The

problem that GMM estimation is meant to solve is that the model in its most primitive

form often makes no statement as to the distribution function of ut. I.e. we don’t know that

it’s normal or virtually anything about it (e.g. we don’t know whether it is heteroskedastic

or autocorrelated). We do know that E(ut) = 0. Furthermore, E(utut−j) = 0 for |j| ≥ i,

otherwise nonzero, because ut is orthogonal to anything known up to time t− i.

10.1. Estimation

The intuition behind GMM estimation is to have an estimator of θ which sets the

sample equivalent of the moment condition equal to zero. I.e. choose θ so that

1

T

TXt=1

f(Yt−1, yt|θ)⊗ xt−i =1

T

TXt=1

ut(θ)

is set equal to zero. I’m now making explicit the dependence of the error on the choice of

θ. Suppose f is m × 1 and xt−i is n × 1. If θ is k × 1 then, in general, we can only setthe average of the ut equal to zero if k = mn. Usually what we have is a situation where

k < mn. As a result, we can only set (1/T )P

ut close to zero. But how? One way to

proceed is to minimize

J =h 1T

TXt=1

ut(θ)i0WT

h 1T

TXt=1

ut(θ)i

65


where WT is some positive definite symmetric matrix which may be sample dependent.

What is the first order condition?

∂J

∂θ=1

T

TXt=1

∂ut(θ)

∂θ

0

mn×kWmn×mn

1

T

TXt=1

ut(θ)mn×1 = 0

As you can see, the first-order condition sets k linear combinations of the mn restrictions

to zero.

10.2. Asymptotics

It turns out that as long as ut(θ0) is stationary the θ which solves the first order

condition will be consistent. As a result

1

T

TXt=1

∂ut(θ)

∂θ

p−→ = D0

where

D0 = E∂ut(θ0)

∂θ.

Furthermore, with the assumption that ut(θ0) is stationary we can assume that

1√T

TXt=1

ut(θ0)d−→ = N(0, S0).

where

S0 = E³ ∞Xj=−∞

ut(θ0)ut−j(θ0)0´.

It turns out that if WT is chosen (optimally) to be a consistent estimator for S−10 , then

√T (θ − θ0)

d−→Nh0, (D0

0S−10 D0)

−1i.

Of course you’d have to estimate the variance covariance matrix as usual. This is usually

done by computing the sample counterparts of D0 and S0.

10.3. Examples

Some examples serve to illustrate GMM.

66


10.3.1. Estimating the Mean of a Random Variable

Suppose we are interested in estimating the mean of a scalar random variable xt. If

we parameterize its mean as µ, we have the moment restriction

E(xt − µ) = 0.

As a result, mn = 1 and k = 1, so that µ is exactly identified. The GMM estimator for µ

is simply the sample mean

µ = x =1

T

TXt=1

xt.

Clearly, D0 = −1. Also, assuming xt is stationary, we have

S0 = E³ ∞Xj=−∞

(xt − µ0)(xt−j − µ0)´.

One special case would be that xt is i.i.d. In this case,

S0 = Eh(xt − µ0)

2i= σ2 ≡ Var(xt).

Thus, in the i.i.d. case we have

√T (µ− µ0)

d−→N(0, σ2),

which is the familiar result from undergraduate statistics where µ is casually written as

having a normal distribution with variance σ2/T in large samples.

10.3.2. The Linear Model With Stochastic Regressors

Suppose the model is yt = x0tβ+ t where t is white noise. Clearly Et(yt−x0tβ) = 0.

Of course this implies that Etxt(yt−x0tβ) = Ext(yt−x0tβ) = 0. This is a case where m = 1

and n = k so we have exact identification. Therefore we can set

1

T

TXt=1

xt(yt − x0tβ) = 0

67


by choosing β = (PT

t=1 xtx0t)−1PT

t=1(xtyt). What’s the variance-covariance matrix? It

had better be the usual one.

∂ut∂β

= −xtx0t

so that D0 = −Σxx. What about S0? We have E(utu0t) = E(xt

2tx0t) = σ2Σxx, while

E(utu0t−i) = 0 for i 6= 0. Therefore, S0 = σ2Σxx. Therefore the variance-covariance

matrix is σ2Σ−1xx .

10.3.3. The Consumption-Based Asset Pricing Model

When consumers have time-separable constant relative risk aversion preferences, so

that u(c) = c1−γ/(1 − γ). Then the intertemporal Euler equation which governs the

behavior of returns is

1 = Etβz−γt+1Rt+1,

where zt+1 = ct+1/ct. In this case, the parameters of interest are β and γ. As a result, to

estimate these parameters, the econometrician can use Euler equations for different assets,

thus increasing the number of moment restrictions, or the econometrician can identify

different variables xt that are in the time t information set. As a result, the econometrician

uses a set of equations given by

Eh(βz−γt+1Rt+1 − 1)⊗ xt

i,

where xt is n × 1. Typical choices for the elements of xt would be lagged values of

consumption growth and the asset returns. In general, the variable R can be an m ×1 vector, so that we have mn moment restrictions. As a result, there will be mn − 2overidentifying restrictions.

68


10.4. Testing With Overidentifying Restrictions

Last, but not least, when mn > k and the weighting matrix is chosen optimally, it

turns out that TJ(θ)d−→χ2(mn−k). This provides a statistic which can be used to test the

overidentifying moment conditions. The idea is that when the overidentifying restrictions

hold, the minimized value of J should be small even though it wont be exactly zero. The

factor T is required, since J converges to 0 with probability 1 under the null hypothesis

that the moment restrictions all hold. On the other hand TJ converges to a random

variable which can be used to test this hypothesis. When TJ is large, the hypothesis is

rejected.

69

introductory graduate econometrics

Documents