introductory graduate econometrics
TRANSCRIPT
INTRODUCTORY GRADUATE ECONOMETRICS
Craig Burnside
Department of Economics
University of Pittsburgh
Pittsburgh, PA 15260
January 2, 1994
∗These are incomplete notes intended for use in an introductory graduate econometrics
course. As notes, the style of presentation is deliberately informal and lacking in proper
citations. Please point out any errors you find.
1. Ordinary Least Squares
1. Ordinary Least Squares
We have a sequence of observations on a random variable
yt, t = 1, 2, . . . , T.
The T indicates I’m a time series guy. Furthermore we have an economic model which
tells us that y is a linear function of explanatory variables plus a random component. I.e.
yt = x0tβ + t
where xt is a k × 1 vector of explanatory variables, β is a k × 1 vector and yt and t are
scalars. Written out this is
yt = (x1t x2t · · · xkt )
⎛⎜⎜⎝β1β2...βk
⎞⎟⎟⎠+ t.
Typically x1t = 1, for all t. If the observations are stacked we get
y1 = β1x11 + β2x21 + . . . βkxk1 + 1
y2 = β1x12 + β2x22 + . . . βkxk2 + 2
...
yT = β1x1T + β2x2T + . . . βkxkT + T
or
y = Xβ +
with y and T × 1 vectors, X a T × k matrix.
1.1. Assumptions
(1) X is a matrix of fixed variables (unrealistic) and has full column rank (i.e. the
columns of X are linearly independent)
(2) E( ) = 0
(3) E( 0) = σ2IT or more strongly, the t are i.i.d.
(4) β is an unknown constant parameter vector, and σ2 is an unknown scalar.
1
1. Ordinary Least Squares
1.2. Estimation
We want to estimate β. The first method is to use the least squares criterion. If I
choose some estimate β, the difference between y and Xβ is called a residual.
Residuals e = y −Xβ. Minimize the sum of the squared residuals by choice of β.
SSE =TXi=1
e2t = e0e = (y −Xβ)0(y −Xβ) = y0y − 2β0X 0y + β0X0Xβ
Minimize SSE by choosing β so that ∂SSE/∂β = 0.
∂SSE
∂β= −2X 0y + 2X 0Xβ = 0
Solution is β = (X0X)−1X 0y. To verify that this is a minimum compute the matrix of
second derivatives
∂2SSE
∂β∂β0= 2X 0X
which is positive definite. See Rule 5 p.961 Red Judge to see that this follows from our
assumptions. Our solution is therefore the unique minimum.
We will also want to have an estimate of the variance σ2. The estimator we will use
is
σ2 =1
T − k
TXt=1
e2t =e0e
T − k
1.3. Sampling Properties
Since the vector y is random, this means β is a vector-valued random variable. Sim-
ilarly, σ2 is a random variable. Therefore, we can ask questions about the distributions of
these random variables. The first property we will consider is the mean of these random
variables. We will show that both estimators are unbiased. An estimator θ is unbiased if
E(θ) = θ.
2
1. Ordinary Least Squares
E(β) = Eh(X 0X)−1X0y
i= (X 0X)−1X 0E(y)
= (X 0X)−1X 0E(Xβ + )
= (X 0X)−1X 0Xβ
= β
E(σ2) = Eh 1
T − ke0ei
=1
T − kE(e0e)
Consider the definition of the residuals.
e = y −Xβ = y −X(X 0X)−1X 0y
=hI −X(X0X)−1X 0
iy =My
=hI −X(X0X)−1X 0
i(Xβ + ) =M
Some facts about M . It is T × T but has rank T − k. It is symmetric as M = M 0.
Furthermore it is idempotent asMM =M . As a result of all this (trace is sum of elements
on diagonal of a square matrix)
e0e = tr(e0e) = tr( 0M 0M ) = tr( 0M )
E(e0e) = Ehtr( 0M )
i= E
htr(M 0)
iRule : tr(ABC) = tr(CAB) = tr(BCA)
= trhME( 0)
i= tr(Mσ2I) = σ2tr(M)
= σ2(T − k)
The last step follows from the fact that an idempotent matrix of rank s has trace equal to
s. Therefore E(σ2) = σ2.
We will also derive the variance-covariance matrix of β. I.e. we will compute (give
explanation of what’s in there) V (β) = Eh(β− β)(β− β)0
i. To do this, note that β− β =
3
1. Ordinary Least Squares
(X 0X)−1X 0y − β = (X0X)−1X 0(Xβ + )− β = (X 0X)−1X 0 . Therefore,
V (β) = Eh(X 0X)−1X 0 0X(X 0X)−1
i= (X0X)−1X 0E( 0)X(X 0X)−1
= (X0X)−1X 0σ2IX(X0X)−1
= σ2(X 0X)−1
1.4. The Gauss-Markov Theorem
Recall that the elements of β are linear combinations of the elements of y. I.e. it is
a linear estimator. Suppose we consider any other linear estimator β = Ay which is also
unbiased. I.e. E(β) = β. Then V (β) ≥ V (β). Proof: First note that A is a k × T matrix.
Since β = Ay = AXβ+A we have E(β) = AXβ = β by the assumption of unbiasedness.
This means AX = Ik. However this does not imply A = X−1 since neither A nor X is
invertible. Notice that the last result means that β = β +A or β − β = A . Therefore
V (β) = Eh(β − β)(β − β)0
i= E(A 0A0)
= σ2AA0.
Now define C = A− (X0X)−1X 0. This means that A = C +(X 0X)−1X0. Some neat facts
about C. CC0 is a k × k positive semi-definite matrix. CX = AX − (X 0X)−1X0X = 0!
Therefore,
V (β) = σ2[C + (X0X)−1X0][C + (X 0X)−1X 0]0
= σ2[C + (X0X)−1X0][C0 +X(X 0X)−1]
= σ2[CC0 + (X 0X)−1X 0C0 +CX(X0X)−1 + (X 0X)−1X 0X(X 0X)−1]
= σ2[CC0 + (X 0X)−1]
Therefore, V (β)− V (β) is a positive semidefinite matrix. This proves that OLS is BLUE
given our assumptions.
4
1. Ordinary Least Squares
1.5. Further Statistical Properties
Suppose we make a further assumption that
(5) The errors are normally distributed.
This last assumption implies that β is a normally distributed random vector since it
is a linear combination of the t.
Also define C2 = (T − k)σ2/σ2 = e0e/σ2. Recall that e =M . Therefore,
C2 =0M 0Mσ2
=σ
0M
σ.
Notice that /σ ∼ N(0, I) and that M is idempotent of rank k. This implies that C2 ∼χ2(T − k). Check Red Judge for this trivia tidbit!
We could use both of these last results to conduct hypothesis tests if we so desired.
Unfortunately we can’t use the first result directly since V (β) = σ2(X0X)−1 involves the
unknown parameter vector σ2. Recall that a t-distributed random variable is defined as
t(n) = z/px/n where z is standard normal, x is χ2(n) and z and x are independent.
Thus, you see why we use the t-statistics
ti =(βi − βi)p
σ2[(X 0X)−1]ii
=(βi − βi)p
σ2[(X 0X)−1]ii
.pσ2/σ2
=(βi − βi)p
σ2[(X 0X)−1]ii
.pC2/(T − k)
= z/px/n.
Thus if we wanted to test H0 : βi = βi0, we could exploit the fact that ti is distributed
t(T − k). Notice that I didn’t verify independence in my proof. This would make a good
assignment question.
We might also want to conduct tests such as H0 : Rβ = r, where R is a j×k matrix.
5
1. Ordinary Least Squares
I.e. we’re looking at j joint linear restrictions on β. Notice that
Rβ −Rβ = R(β − β)
= R(X 0X)−1X0
∼ N³0, σ2R(X 0X)−1R0
´The next step involves finding the Cholesky decomposition of σ2R(X0X)−1R0. I.e. find
C, lower triangular, and invertible such that CC0 = σ2R(X 0X)−1R. Now define Z =
C−1(Rβ−Rβ). From the definition of C we have Z ∼ N(0, Ij). This means Z0Z is χ2(j).
Or
(Rβ −Rβ)0C−10C−1(Rβ −Rβ) =
(Rβ −Rβ)0[R(X0X)−1R0]−1(Rβ −Rβ)
σ2∼ χ2(j).
Again we have a problem since we don’t know σ2. When we use σ2 instead we get
the familiar F -statistic. To see this recall that F (n1, n2) = (χ21/n1)/(χ
22/n2) where χ
21 and
χ22 are independent. Therefore,
F (j, T − k) =(Rβ −Rβ)0[R(X 0X)−1R0]−1(Rβ −Rβ)/σ2j
[(T − k)σ2/σ2]/(T − k)
=(Rβ −Rβ)0[R(X 0X)−1R0]−1(Rβ −Rβ)/j
σ2
Are they independent χ2? Good assignment question!
6
1. Ordinary Least Squares
1.6. Measures of Fit
Measures of fit are used to summarize the extent to which the estimated model ‘fits’
the data. Notice that by the definition of the residuals y = Xβ + e. This implies that
y − y1 = Xβ − y1 + e. Thus the total sum of squared (SST) deviations of yt around its
mean is given bySST = (y − y1)0(y − y1)
= (Xβ − y1 + e)0(Xβ − y1 + e)
= (y − y1 + e)0(y − y1 + e)
= (y − y1)0(y − y1) + e0e− 2y10e
= (y − y1)0(y − y1) + e0e if 1 ∈ X
= SSR+ SSE
The R2 is defined to be R2 = SSR/SST = 1 − SSE/SST . Clearly this can simply be
increased by adding regressors indiscriminately because of the extra degree of freedom to
reduce SSE that is introduced by adding a parameter. Thus the adjusted R2 is introduced
to adjust for the number of regressors. It is defined as
R2 =
ÃT − 1T − k
!R2 −
Ãk − 1T − k
!
= 1−ÃT − 1T − k
!(1−R2).
1.7. Geometry
We saw earlier that e = y −Xβ = y −X(X0X)−1Xy = [I −X(X 0X)−1X0]y =My.
M was shown to be symmetric and idempotent. Now, let’s also consider the fitted values
y = Xβ = X(X 0X)−1X 0y = Py. P is also symmetric and idempotent, but this time with
rank k.
The matrix P projects a T ×1 vector y onto the space spanned by the k T ×1 vectorswhich are the columns of X. I.e. P produces a linear combination of the columns of X, y,
7
1. Ordinary Least Squares
whose deviation from y, e, is orthogonal to the space spanned by those columns. M would
project any vector onto the orthogonal complement of that space (because it produces the
residuals which are orthogonal to X). Notice that y = Py so that y − y = y − Py =
(I − P )y =My = e.
Things to note are the following.
1. X 0e = X 0My = X0[I −X(X 0X)−1X 0]y = [X 0 −X0X(X 0X)−1X 0]y = 0. Notice that
only if X contains a 1 vector doesPT
t=1 et = 10e = 0. Furthermore we have X 0y = X0Py =
X 0X(X 0X)−1X0y = X0y.
2.y0y = (Xβ + e)0(Xβ + e)
= β0X 0Xβ + e0Xβ + β0X0e+ e0e
= β0X 0Xβ + e0e.I.e. (raw) SST = SSR+ SSE.
3. Suppose I transform the variables before regression by applying a nonsingular k × k
matrix A to X. I.e. I generate X∗ = XA. Then notice that if you regress y on X∗ you getP ∗ = X∗(X∗0X∗)−1X∗0
= XA(A0X 0XA)−1A0X 0
= XAA−1(X 0X)−1A0−1A0X 0
= X(X 0X)−1X 0 = P.
Thus, the predicted values from the two regressions will be the same. What about the
coefficients? β∗ = (X∗0X∗)−1X∗0y = A−1(X 0X)−1A0−1A0Xy = A−1β. The coefficients
are changed by the inverse of the linear transformation.
4. Suppose I regress y on X and Z where the model is y = Xβ + Zγ + . Define W =
(X Z ). Also define θ0 = (β0 γ0 )0. Then y =Wθ+ . Thus we’ll get θ = (W 0W )−1W 0y.
Expanding this we get µβγ
¶=
µX 0X X 0ZZ0X Z0Z
¶−1µX 0
Z0
¶y
= H−1µX0
Z0
¶y
8
1. Ordinary Least Squares
This means that the estimate of γ is given by
γ = H−121 X0y +H−122 Z
0y
= −(H22 −H21H−111 H12)
−1H21H−111 X
0y + (H22 −H21H−111 H12)
−1Z0y
= −[Z0Z − Z0X(X 0X)−1X0Z]−1(Z0X)(X 0X)−1X 0y+
[Z0Z − Z0X(X 0X)−1X 0Z]−1Z0y
= −³Z0[I −X(X0X)−1X 0]Z
´−1Z0Pxy +
³Z0[I −X(X0X)−1X0]Z
´−1Z0y
= −(Z0MxZ)−1Z0Pxy + (Z0MxZ)
−1Z0y
= (Z0MxZ)−1Z0(I − Px)y
= (Z0MxZ)−1Z0Mxy
Suppose that instead I did the following. First regress y onto X and compute the
residuals y∗ = Mxy from that regression. Also regress each of the columns of Z onto X
and make a matrix out of all the different residual series Z∗ =MxZ. Now regress y∗ onto
Z∗. You’ll getγ = (Z∗0Z∗)−1Z∗0y∗
= (Z0M 0xMxZ)
−1Z0M 0xMxy
= (Z0MxZ)−1Z0Mxy
Notice the equivalence. This result is called the Frisch-Waugh-Lovell Theorem.
9
2. Hypothesis Testing
2. Hypothesis Testing
There are different approaches to hypothesis testing. The ones you will be introduced
to in this course are all classical hypothesis tests. The term classical refers to the basic
approach to testing. Essentially, the classical approach works as follows.
1. Determine the null hypothesis to be tested.
2. Find some testable implication of the null hypothesis. Usually this will pertain to
the distribution of some statistic under the null.
3. Set up the test. What this means is that you have to have a rejection region for the
test. I.e. you must define a region such that if the statistic ever lies in that region
you will reject the null hypothesis.
4. Actually compute the test statistic for your sample and determine the outcome of
the test.
How does one actually construct the test? In the classical approach, the parameters
of the model are treated as unknown fixed constants. This is as opposed to the Bayesian
approach where the econometrician actually assigns a prior probability distribution to
the parameters, and then updates this to form a posterior given the observed sample
of data. Because of the inherent assumption of the classical approach there is always
assumed to exist some unknown true value of the parameter in question, unless the model
is misspecified.
Classical hypothesis tests are constructed with the following two considerations in
mind.
1. SIZE - this is the probability of rejecting the null hypothesis even though the null
hypothesis is true, i.e. it is the probability of type I error.
2. POWER - this is the probability of rejecting the null hypothesis when it is false, i.e.
it is 1 minus the probability of type II error.
To evaluate the size or power of tests one clearly needs to compute the probabilities
10
2. Hypothesis Testing
in question. One way to do this would be to perform the conceptual experiment of an
infinite number of repeated samples. For example, suppose I wanted to know the weight
assigned to heads on a coin. I either know the true weight (i.e. I know the probability
distribution governing the outcome) or I can flip the coin an infinite number of times. We
know that the proportion of outcomes which are heads will converge to the true proportion
with a large enough number of flips. That’s what’s really going on in hypothesis testing.
Size is determined by assuming the null hypothesis is true. If it is true, then the
question is what is the probability in any given sample that I will end up rejecting the null.
Sometimes, this can be calculated by carefully working out the probability distribution of
the test statistic. Other times, it can be determined by doing a lot of repeated sampling
in a controlled experiment. This is what is referred to as Monte-Carlo simulation.
For a given null hypothesis, size is a fixed number α. However, power clearly depends
on what the true parameter actually is. For example, suppose I set up the following test
for whether the mean of a normally distribued random variable was zero or not. Assume
that we know the variable is distributed normally, and that the variance is 1. The null
hypothesis is H0 : µ = 0 versus H1 : µ 6= 0. Suppose I decide to reject the null hypothesiswhenever x from T observations on X below -0.1 or above 0.1. What do we know about
the sample mean when the draws are i.i.d? A testable implication is that if H0 is true,
the sample mean x is distributed as N(0, 1/T ). This is a standard result. We know that
the distribution of the sample mean of a N(µ, σ2) is N(µ, σ2/T ). Therefore the size in my
example is just
1− Pr(−0.1 < x < 0.1) = 1− Pr³−0.1√
T< Z <
0.1√T
´which is just some number for any fixed T . However what is the power of the test? It is
the probability of rejecting the null hypothesis when it is false. When it is false, we want
to standardize x with the correct (true) mean, rather than 0. Therefore power is
1− Pr(−0.1 < x < 0.1) = 1− Pr³−0.1− µ√
T< Z <
0.1− µ√T
´11
2. Hypothesis Testing
which is clearly a function of µ. Suppose µ is very close to zero. Then power will approxi-
mately equal the size of the test. For the test I’ve set up, the power gets bigger the further
µ gets from zero.
The way hypothesis tests are typically designed is to choose a test with a given
amount of size. Then among tests of a given size the natural thing to do would be to look
for the most powerful test. This isn’t always the approach, especially when asymptotic
tests are being used. However, there is a theory of uniformly most powerful tests which
tries to find exactly that. It turns out that usually such a test is not available. However,
for certain classes of distributions it’s often the case that there is a UMP test among the
class of unbiased tests, i.e. tests for which the power is always greater than or equal to the
size.
This is what provides the logic behind the typical tail rejection regions that we use.
For example, to test whether µ = 0, in the example above you usually want to set up
the statistic ZS = xσ/√T. Usually we set up a 5% test where the rejection region is
z||z| > 1.96. Clearly when the null hypothesis is true the probability of rejection is 5%.However, there are an infinite number of critical regions which give tests whose sizes are
5%. To be specific our rejection region, C, would only have to satisfy the condition thatRCφ(z)dz = .05. There are many such regions. Lets try out a few, and examine their
power properties.
i. C1 = z||z| > 1.96,ii. C2 = z||z| < 0.0625,iii. C3 = z|0.5960 < |z| < 0.7535.
All three critical regions have size of 5%, but their power properties are dramatically
different. Look at the power functions on the following page. These are constructed for the
case where σ is known to be 1 and T = 100, and xt is normally distributed and i.i.d. The
region C1 is the standard rejection region, and is seen to have maximal power, approaching
1, against the most distant alternatives to H0 : µ = 0, and weakest power, approaching
12
2. Hypothesis Testing
0.05 against those alternatives which are closest to H0. The region C2 leads to size of 5%
but has obviously undesirable power properties. Because it is centered around zero, power
is maximal at µ = 0, and is never greater than 0.05. Furthermore, power is zero against
the most distant alternatives. The region C3, which is off-centered but involves no tail
area, again leads to terrible power against the most distant alternatives, although power
is now maximized to the right of H0. However, it has slightly higher power against some
nearby alternatives than does C1, although this gain in power is very hard to discern in
the graphs.
13
2. Hypothesis Testing
Figure 2.1: Power
14
3. Maximum Likelihood
3. Maximum Likelihood
We have a sequence of observations on a random variable
yt, t = 1, 2, . . . , T.
Suppose the joint density function of this sequence of random variables is given by
f(y1, y2, . . . , yT |θ)
where θ is some parameter (vector) of that joint distribution. f is called the likelihood
function when written as a function of θ, i.e. L(θ|y1, y2, . . . , yT ) = f(y1, y2, . . . , yT |θ). Thelog-likelihood function is given by L(θ|y) = ln[L(θ|y)], where y is the vector of observationsas in section 1.
3.1. The Linear Model
In the linear model we had yt = x0tβ + t. To determine the likelihood function for y
let us first consider the joint distribution or likelihood of the vector . Given assumptions
(1)—(5), in section 1, we have
f( ) =TYt=1
φ( t) by independence
=TYt=1
1√2πσ
exp(− 1
2σ22t )
= (2π)−T2 σ−T exp(− 1
2σ20 )
Now to get the distribution of y we have to use the following fact. If any variable y is a
function of , g( ), then the joint distribution of the y’s is
fY (y) = f [g−1(y)]|J | where J =∂
∂y=
∂g−1(y)∂y
.
In the linear model y = Xβ + so that = y −Xβ. Therefore J = I and |J | = 1. Thus
L(θ|y) = f(y) = (2π)−T2 σ−T exp
³− 1
2σ2(y −Xβ)0(y −Xβ)
´15
3. Maximum Likelihood
and the log-likelihood is
L(θ|y) = −T2ln(2π)− T
2lnσ2 − 1
2σ2(y −Xβ)0(y −Xβ)
= −T2ln(2π)− T
2lnσ2 − 1
2σ2
TXt=1
(yt − x0tβ)2.
In this case, θ = (β0 σ2 )0. The maximum likelihood estimator (MLE) for θ is obtained
by maximizing the likelihood function. Since ln is a monotonic transformation we can
maximize the log-likelihood instead. We will do this by solving the first order conditions
for choices of β and σ2.
First expand the definition of L:
L = −T2ln(2π)− T
2lnσ2 − 1
2σ2(y0y − 2β0X0y + β0X 0Xβ)
∂ L∂β
= − 1
2σ2(−2X 0y + 2X0Xβ)
∂ L∂σ2
= −T2σ−2 +
1
2σ−4(y −Xβ)0(y −Xβ)
If we solve these FONCs for β and σ2 we get MLE estimators β = (X0X)−1X 0y and
σ2 = 1T e
0e where e = y −Xβ. Notice that β can be solved for independently of σ2 and is
the same as the OLS estimator. We already know that E(β) = β. Since σ2 = (T −k)σ2/Twe also know that E(σ2) = (T −k)σ2/T . Furthermore, V (β) = σ2(X0X)−1. The variance
of the variance estimator is
E³(σ2 −Eσ2)2
´= E
³(1
Te0e− T − k
Tσ2)2
´=
σ4
T 2Eh³e0e
σ2− (T − k)
´2i=
σ4
T 2Eh³(T − k)
σ2
σ2− (T − k)
´2i=
σ4
T 2Eh³χ2 − (T − k)
´2i= 2(T − k)σ4/T 2
16
3. Maximum Likelihood
because the mean of a χ2(T − k) is T − k and its variance is 2(T − k). As for covariance
Cov(β, σ2) = E³(β − β)(σ2 −Eσ2)
´= E
h³(X 0X)−1X0
´³ 1Te0e− T − k
Tσ2´i
= 0
This is because the first term is a linear combination of a normal, while the second term
is a quadratic form (based on M) in the same normal. These are independent if the two
matrices multiplied give you 0. Of course (X 0X)−1X 0M = 0. Therefore the variance
covariance matrix
V (θ) =
µσ2(X 0X)−1 0
0 2T−kT σ4
¶
3.2. Cramer-Rao Lower Bound
The Cramer-Rao Lower Bound simply states that any unbiased estimator of a pa-
rameter vector θ will have greater variance than the inverse of a matrix called I(θ). I.e. if
θ is an unbiased estimator of θ, V (θ)− I(θ)−1 is a p.s.d. matrix, where
I(θ) = −EÃ
∂2 L∂θ∂θ0
!.
This is a general principle applying to any model with a likelihood. To find the matrix I
called the information matrix, we need to compute the second derivatives.
∂2 L∂β∂β0
= − 1σ2
X 0X
∂2 L∂β∂σ2
=1
σ4(−X0y +X0Xβ)
∂2 L∂σ2∂β0
=1
σ4(−y0X + β0X0X)
∂2 L∂(σ2)2
=T
2σ−4 +−σ−6(y −Xβ)0(y −Xβ)
The negation of the expectation of these terms is
I(θ) =
µσ−2X0X 0
0 −T2 σ−4 + σ−6Tσ2
¶=
µσ−2X0X 0
0 T2 σ−4
¶.
17
3. Maximum Likelihood
Thus the CRLB for θ is
I(θ)−1 =µσ2(X 0X)−1 0
0 2T σ
4
¶.
Another useful concept in maximum likelihood theory is the sufficient statistic. A
sufficient statistic is a statistic which the likelihood can be completely characterized in
terms of. I.e. the random data no longer appear in the formula for the likelihood. In this
case
L(y|θ) = (2π)−T/2σ−T exp³− 1
2σ2(y −Xβ)0(y −Xβ)
´.
Now y −Xβ = (y −Xβ) + (Xβ −Xβ) = e+X(β − β), so that
(y −Xβ)0(y −Xβ) = e0e+ 2e0X(β − β) + (β0 − β0)X0X(β − β)
= (T − k)σ2 + (β − β)0X0X(β − β)
Substituting this into L gets rid of the data y. Thus β and σ2 are jointly sufficient statistics
for L. A theorem in statistics states that if a sufficient statistic is unbiased it is minimum
variance unbiased. Therefore, β and σ2 are the MVUE of β and σ2.
Since the OLS estimators are the MVUEs, this might raise the question: why use
ML? It turns out that in cases where we have a growing sample size and we need to
use asymptotic theory (we’ll see cases like this later), ML estimators have the following
properties:
1. Consistency (sort of asymptotic unbiasedness)
2. Asymptotic normality of the estimator regardless of distribution of
3. Asymptotic efficiency - achieving the CRLB.
We will discuss asymptotic theories more completely when we get to cases with
stochastic regressors.
18
3. Maximum Likelihood
3.3. A Brief Discussion of Asymptotics
What is all this asymptotic stuff? Well, we have to diverge a little bit to talk about
it. We need definitions. This is different from what we’ve been talking about because the
sample size T is treated as variable. What does this mean for things like X fixed? It just
means that in repeated samples of any size we’d get the same sequence of xt’s. We will
also need to redefine some things because of scale factors.
CONSISTENCY: An estimator θT converges in probability to c if
limT→∞
Pr(|θT − c| > ) = 0.
This is sometimes denoted c = plim θT . θT is consistent if θ = plim θT .
ASYMPTOTIC NORMALITY: A random variable XT converges in distribution to a ran-
dom variable X if
limT→∞
|FT (x)− F (x)| = 0
at all continuity points of X. Typically, we’ll see cases where θT is asymptotically normal
if√T (θT − θ) converges in distribution to N(0, V ).
To illustrate this let’s take a scalar i.i.d. random variable xt ∼ N(µ, σ2). The
estimator I’ll consider is xT =PT
t=1 xt/T . Because I’ve assumed normality we’re obviously
going to get an exact normal distribution for xT for any T , because it’s a linear combination
of normal random variables. The mean is clearly µ and because the xt are i.i.d. we easily
get V(xT ) = σ2/T . Can we prove consistency? Yes. xT − µ = wT ∼ N(0, σ2/T ).
Therefore, Pr(|xT − µ| > ) = Pr(xT − µ > ) + Pr(xT − µ < − ) = Pr³TwT/σ >
T /σ´+Pr
³TwT/σ < −T /σ
´= Pr(Z > T /σ) + Pr(Z < −T /σ). The limit as T →∞
of this quantity is clearly 0.
The important thing about asymptotic normality is that you must blow up by√T .
Note in the example the variance goes to zero if you don’t blow it up, while it is σ2 if you
do. This would be unnecessary if you always had an exact small sample distribution as
19
3. Maximum Likelihood
we have in the example and all the work so far, but it is not helpful in cases where we
can’t characterize the small sample distribution. When this happens we use the asymp-
totic distribution to approximate the small sample distribution, which is only useful if the
asymptotic distribution is nonredundant.
3.4. The 3 Tests
We will now discuss three methods for testing restrictions in the context of ML
estimation. Specifically we’re interested in tests of the form H0 : g(θ) = 0 vs HA : g(θ) 6= 0.The function g : Rk → Rj .
3.4.1. The Likelihood Ratio Test
Denote θ = argmaxθ L(θ|y) while θ = argmaxθ L(θ|y) s.t. g(θ) = 0. Clearly
L(θ|y) ≥ L(θ|y). The test is based on how big the difference is. If it is large the hy-
pothesis is brought into question.
LR = −2[L(θ|y)− L(θ|y)] = −2 log(r) where r =L(θ|y)L(θ|y) .
It turns out that LR converges in distribution to χ2(j) as T →∞. The LR test requires
that you estimate the model under both the null and the alternative. Sometimes this isn’t
practical depending on the form of the function g.
In our linear model suppose we used the LR test. Having estimated a restricted and
an unrestricted model we would have two parameter vectors β and β. Then LR is given
by
LR = −2[L(β)− L(β)]
= −2³−T2log(σ2)− 1
2σ2e0e+
T
2log(σ2) +
1
2σ2e0e´
= −2³−T2log(e0e) +
T
2log(e0e
´= T log(e0e/e0e)
20
3. Maximum Likelihood
3.4.2. The Wald Test
The Wald test, rather than using the fact that the likelihoods should be close if
the hypothesis is true, uses the fact that when we estimate the unrestricted model, if the
restiction is true, then g(θ) should be close to zero anyway. Redefine the information
matrix as
IA(θ) = −EÃ1
T
∂2 L∂θ∂θ0
!Notice the 1/T part. Then
W = Tg(θ)0"∂g(θ)
∂θ0
¯θIA(θ)
−1 ∂g(θ)∂θ
¯θ
#−1g(θ).
What does this look like when the restriction is Rβ = r (i.e. when the restrictions are
linear)? g(·) = (Rj×k 0j×1 ) θ − rj×1. As a result ∂g/∂θ = (R 0 )0,
IA(θ) =
µσ−2T−1X0X 0
0 12 σ−4
¶As a result
W = (Rβ − r)0[R(X 0X)−1R0]−1(Rβ − r)/σ2.
Under some assumptions, W is χ2(j) asymptotically, but it only relies on estimation of the
model under the alternative (or unrestricted) model. Thus, it is an especially convenient
test when estimation under the null is problematic.
3.4.3. The Lagrange Multiplier Test
Finally, we have a test which only requires estimates of the restricted model. It is
based on the fact that if the restriction is valid the slope of the likelihood function shouldn’t
be much different at the restricted estimator than at the unrestricted estimator (where it
is clearly zero). Let’s set up the restricted estimation in Lagrangean form
Λ(θ) = L(θ|y)− λ0j×1g(θ).
21
3. Maximum Likelihood
The first order condition for this problem is
∂Λ(θ)
∂θ
¯θ=
∂ L(θ)∂θ
¯θ− λ0
∂g(θ)
∂θ
¯θ= 0.
The test is measured as
LM =1
Tλ0∂g(θ)
∂θ
¯θIA(θ)
−1 ∂g(θ)∂θ0
¯θλ.
As you might expect, given our results for the other tests, LM is asymptotically χ2(j).
What is the form of the test when the restriction is Rβ = r? Since in this case σ2 is
unrestricted the partial of g (and thus of L) with respect to σ2 is 0. However, the partialof L with respect to β is (X 0y −X 0Xβ)/σ2 = X 0e/σ2. These two form the leading term
in the formula. Also
IA(θ) =
µσ−2T−1X0X 0
0 12 σ−4
¶Therefore
LM =1
T( e0X/σ2 0 )
µT σ2(X0X)−1 0
0 2σ4
¶µX0e/σ2
0
¶= e0X(X 0X)−1X 0e/σ2
Under certain conditions, which happen to be satisified by any linear model, although
the tests have the same distributional form in large samples, in small samples we have the
following ordering
W ≥ LR ≥ LM.
Therefore, the Wald test will tend to reject more often than the likelihood ratio test, which
will tend to reject more often than the Lagrange multiplier test.
22
4. Generalized Least Squares
4. Generalized Least Squares
We now change one of the assumptions of the linear model. In particular we change
the assumption made about the covariance matrix of the error term. We now assume that
E( 0) = σ2Ω where Ω is a T × T , positive definite, symmetric matrix.
4.1. Ω Known
First consider the case where Ω is known to the econometrician. Let’s reconsider the
OLS estimator, βO = (X 0X)−1X0y. It’s still unbiased using the same proof. However,
what about its variance?
V (βO) = E[(βO − β)(βO − β)0]
= E[(X0X)−1X 0 0X(X 0X)−1]
= σ2(X 0X)−1X 0ΩX(X0X)−1.
This doesn’t look so good.
Since Ω is positive definite we can decompose it as PΩP 0 = I where P is a T ×T nonsingular matrix, or Ω = P−1P 0−1, or Ω−1 = P 0P . Since P is known, consider
premultiplying the data by P so that we have
Py = PXβ + P
y∗ = X∗β + ∗
where E( ∗ ∗0) = E(P 0P 0) = σ2PΩP 0 = σ2I. Now if we use OLS on this transformed
model we’ll get βG = (X∗0X∗)−1X∗0y∗. Clearly βG is BLUE because the transformed
model satisfies the conditions of the Gauss-Markov Theorem. It’s clearly unbiased and its
covariance matrix is clearly σ2(X∗0X∗)−1.
Now going back to untransformed variables we see that
βG = (X∗0X∗)−1X∗0y∗
= (X0P 0PX)−1X0P 0Py
= (X0Ω−1X)−1X 0Ω−1y
23
4. Generalized Least Squares
and V (βG) = σ2(X∗0X∗)−1 = σ2(X0P 0PX)−1 = σ2(X0Ω−1X)−1. How about proving
that this means GLS is more efficient than OLS.
V (βO)− V (βG) = σ2(X 0X)−1X 0ΩX(X 0X)−1 − σ2(X 0Ω−1X)−1
= σ2[(X0X)−1X 0 − (X 0Ω−1X)−1X0Ω−1]Ω
[(X0X)−1X 0 − (X 0Ω−1X)−1X0Ω−1]0
= σ2AΩA0
which is p.s.d. because Ω is p.d.
Good assignment question: Prove that σ2O is biased. What is the GLS estimator for
the variance. Well you go back to the transformed model
σ2G = e∗0e∗/(T − k)
= (y∗ −X∗βG)0(y∗ −X∗βG)/(T − k)
= (y0P 0 − β0GX0P 0)(Py − PXβG)/(T − k)
= (y −XβG)0P 0P (y −XβG)/(T − k)
= e0GΩ−1eG/(T − k)
Clearly, this estimator can be shown to be unbiased using the standard argument (the OLS
proof) on the transformed model.
4.2. Ω Unknown
In general Ω could have how many unknown elements? It is a T × T matrix but it
is symmetric so that there are actually only T + (T − 1) + (T − 2) + . . . 1 = T (T + 1)/2
different elements of Ω. Actually, since we’ve normalized by pulling out σ2 there is one
less than that. You cannot model Ω freely when it is unknown because you only have T
observations to work with and you’ve already used up k of those degrees of freedom in β.
Therefore, you need assumptions as to the form of Ω. Then using those assumptions we’ll
construct estimators for Ω, say Ω.
The GLS estimator in this case will become
βG = (X0Ω−1X)−1X 0Ω−1y.
24
4. Generalized Least Squares
It is no longer possible to pass the expectations operator all the way through to y because
now there is an additional random component due to Ω. At this point we must rely on
asymptotic arguments.
Take for example the GLS estimator
βG = (X0Ω−1X)−1X 0Ω−1y
= (X0Ω−1X)−1X 0Ω−1(Xβ + )
= β + (X 0Ω−1X)−1X 0Ω−1
βG − β = (X0Ω−1X)−1X 0Ω−1
√T (βG − β) =
ÃX 0Ω−1X
T
!−11√TX0Ω
To show that this thing will be asymptotically normal you need
1. a law of large numbers so that (X 0Ω−1X)/T converges in probability to some fixed
positive definite matrix D. This is usually the same matrix which (X0ΩX)/T would
converge to deterministically if Ω were known and
2. a central limit theorem so that (X 0Ω−1 )/√T converges in distribution to the same
random variable as (X 0Ω−1 )/√T would converge to, say a N(0, σ2V ), for some
positive definite symmetric matrix V .
Generally we can get these two results if we obtain Ω in such a way that it converges in
probability to Ω.
4.3. Heteroskedasticity
Suppose that we have the same linear model, but E( 2t ) = σ2t where lnσ2t = lnσ
2 +
z0tα, but there is no covariance across ’s. Typical procedure works like this
1. Estimate the model by OLS, and obtain the OLS residuals e = y −XβO.
2. Notice that ln e2t + lnσ2t = lnσ2 + z0tα + ln e2t or ln e2t = lnσ2 + z0tα + ln(et2/σ2t ) =
lnσ2 + z0tα+ vt. Now regress ln e2 on [1 Z] and obtain OLS estimates α and dlnσ2.
3. Construct σ2t = exp(dlnσ2+z0tα), and a matrix Ω with the exp(z0tα)’s on the diagonal.
25
4. Generalized Least Squares
4. Compute βG and σ2G using Ω in place of Ω in the formulas.
Generally, these estimators will have the desired properties.
4.4. Autocorrelation
The classic case you’ve all seen is where the error term is a first-order autoregression.
I.e. we have a model of the form
yt = x0tβ + ut
ut = ρut−1 + t − 1 < ρ < 1
where xt satisfies all the assumptions we’ve made up until now and t is i.i.d. N(0, σ2.
Since the model is now y = Xβ+u we must of course be concerned with the properties of
u. It is clearly mean zero. However what is E(uu0)?
ut = ρut−1 + t
= ρ2ut−2 + t + ρ t−1 . . .
=∞Xi=0
ρi t−i
As a result of this we can compute the variances and covariances of the ut’s.
σ2 = V (ut) = E(u2t ) =∞Xi=0
ρ2iσ2 = σ2/(1− ρ2)
Furthermore, we have E(utut−1) = ρσ2, etc. with E(utut−i) = ρiσ2. Therefore,
Ω =
⎛⎜⎜⎜⎜⎝1 ρ ρ2 . . . ρT−1
ρ 1 ρ . . . ρT−2
ρ2 ρ 1 . . . ρT−3...
......
. . ....
ρT−1 ρT−2 ρT−3 . . . 1
⎞⎟⎟⎟⎟⎠To get a consistent estimator for Ω we just need a consistent estimate of one parameter:
ρ. We could use the following steps.
1. Estimate the model by OLS and obtain the residuals.
26
4. Generalized Least Squares
2. Compute a consistent estimator for ρ. A bunch of possibilities are available
r =
PTt=2 etet−1PTt=1 e
2t
Theil0s r∗ =T − k
T − 1 r r∗∗ = 1− d
2
where d is the Durbin-Watson statistic, d =PT
t=2(et − et−1)2/PT
t=1 e2t .
3. Then in the last step there are various methods used
a. Full GLS,
b. Full GLS dropping the first observation, or
c. Full ML, Beach and Mackinnon (1978).
The rationale for (3b) versus (3a) is that (3b) involves a simpler transformation of
the data. It turns out that in this case
P =
⎛⎜⎜⎜⎜⎜⎝
p1− ρ2 0 0 . . . 0−ρ 1 0 . . . 00 −ρ 1 . . . 0...
....... . .
...0 0 0 . . . 1
⎞⎟⎟⎟⎟⎟⎠As a result applying the transformation P to the data just means taking any variable zt
and replacing it with zt − ρzt−1. However the first observation is treated diffently.
To do full ML note that the likelihood of the u vector is a multivariate normal
fU (u) = (2πσ2)−T/2|Ω|−1/2 exp
³− 1
2σ2u0Ω−1u
´but since the Jacobian of the transformation between y and u is just I, we have
fY (y) = (2πσ2)−T/2|Ω|−1/2 exp
³− 1
2σ2(y −Xβ)0Ω−1(y −Xβ)
´The likelihood is maximized by choosing β, σ2 and ρ to maximize f or L. Similarly, onecould do ML for any heteroscedastic model. However, ML tends to be computationally
burdensome.
27
4. Generalized Least Squares
4.5. Testing for Heteroskedasticity
4.5.1. White’s Test
Take the residuals from OLS (the restricted model), and square them to get e2t .
Regress these on all unique combinations in xt⊗xt. If there are p regressors in all, then TR2
from that regression will have a χ2(p− 1) distribution under the null of homoskedasticity.Effectively the hypotheses being compared here are H0 : σ
2t = σ2, ∀t, vs. H1 : σ
2t 6= σ2s for
at least one s 6= t. Unfortunately (?), this test tends to pick up all kinds of specification
error, and is not suggestive about the source of heteroscedasticity.
4.5.2. Goldfeld-Quandt (A Special Case)
Here the null is H0 : σ2t = σ2, ∀t while the alternative is H1 : σ
2t = σ21, ∀t ∈ τ ,
σ2t = σ22 , ∀t /∈ τ , where τ is some subsample. Here the idea is to estimate 2 separate
OLS regressions on the two subsamples and compute separate estimates σ21 and σ22. Under
H0 the ratio [(T1 − k)σ21/σ2(T1 − k)]/[(T2 − k)σ22/σ
2(T2 − k)] = σ21/σ22 will be distributed
F (T1 − k, T2 − k). This would be tested using a 2-sided rejection region since F could be
small or large under the alternative.
4.5.3. Breusch-Pagan
Here the model for the variance is σ2t = exp(α1+ z0tα2). The relevant hypotheses are
H0 : α2 = 0 versus H1 : α2 6= 0. Assuming that α2 is a s× 1 vector it turns out that if youregress ln(e2t/σ
2) on zt then (SST − SSE)/2 from this regression is asymptotically χ2(s).
As it turns out so is TR2 from regression of squared residuals on constant and zt.
28
4. Generalized Least Squares
4.6. Testing for Autocorrelation
4.6.1. Durbin-Watson
The test statistic is given by
d =TXt=2
(et − et−1)2. TX
t=1
e2t
= 2(1− r) + (e21 + e2T )/TXt=1
e2t
Note that 0 < d < 4. Near 0 suggests positive serial correlation. Near 4 suggests negative
serial correlation. The distribution of d in any finite sample depends on the data matrix.
However, upper and lower bounds for relevant critical values have been obtained for dif-
ferent sample sizes and numbers of regressors. SHAZAM can actually compute the critical
value appropriate for a particular case numerically when you run your job. This test is
not valid when the X data matrix is not fixed, especially with lagged dependent variables.
4.6.2. Box-Pierce
This is a test for general autocorrelation of any variety in the data. I.e. it tests a
null of no autocorrelation versus an alternative of any autocorrelation. The test statistic
is Q = TPL
j=1 r2j , where r
2j =
PTt=j+1 etet−j/
PTt=1 e
2t . It turns out that in large samples
Q ∼ χ2(L).
4.6.3. Ljung-Box
Just like the Box-Pierce test except Q = T (T + 2)PL
j=1 r2j/(T − j).
You might not see any reason to choose between (2) and (3). These tests have
different small sample properties. There is also the problem of how to choose L. It is an
arbitrary choice. Obviously it cannot be chosen too large because with large L we are
using fewer and fewer observations to estimate r2j . On the other hand too small can miss
29
4. Generalized Least Squares
autocorrelation at more distant lags. The Ljung-Box appears to be closer to a χ2(L) in
small samples.
In general you do not need to use any of these tests. You can use any convenient LR,
Wald, or LM test if you estimate by ML. One of the easiest ways to do this is to estimate
under the null (restricted) model. This lets you use OLS to estimate β. Typically it is
straightforward to derive an appropriate LM test which lets you do some easy regression
involving the residuals, and then to take TR2 as the test statistic.
30
5. Systems of Equations
5. Systems of Equations
5.1. Seemingly Unrelated Regression Equations
Suppose we have a set of equations rather than a single equation. For example,
forgetting about simultaneous equations bias suppose you had a set of demand equations
for several goods.
yit = x0itβi + it, i = 1, . . . n, t = 1, . . . T
ory1 = X1β1 + 1
y2 = X2β2 + 2 . . .
yn = Xnβn + n,
where the yi are T × 1 vectors, each Xi is a T × ki vector, each βi is a ki × 1 vector andthe i are T × 1. This can be rewritten as⎛⎜⎜⎝
y1y2...yn
⎞⎟⎟⎠ =
⎛⎜⎜⎝X1 0 . . . 00 X2 . . . 0...
.... . .
...0 0 . . . Xn
⎞⎟⎟⎠⎛⎜⎜⎝β1β2...βn
⎞⎟⎟⎠+⎛⎜⎜⎝
1
2...
n
⎞⎟⎟⎠or
y = Xβ + ,
where y and are nT × 1 vector, X is a nT ×Pi ki matrix, and β is aP
i ki × 1 vector.Assumptions about the error terms: E( it) = 0, ∀i, t.
E( it js) =nσij if i 6= j, t = s0 otherwise
.
Let
Σ =
⎛⎜⎜⎝σ11 σ12 . . . σ1nσ21 σ22 . . . σ2n...
.... . .
...σn1 σn2 . . . σnn
⎞⎟⎟⎠31
5. Systems of Equations
Then
E( 0) =
⎛⎜⎜⎝101 1
02 . . . 1
0n
201 2
02 . . . 2
0n
......
. . ....
n01 n
02 . . . n
0n
⎞⎟⎟⎠
=
⎛⎜⎜⎝σ11IT σ12IT . . . σ1nITσ21IT σ22IT . . . σ2nIT...
.... . .
...σn1IT σn2IT . . . σnnIT
⎞⎟⎟⎠= Σ⊗ IT = Φ
If Σ is known, then Φ is known and we just have a huge GLS model. βG =
(X 0Φ−1X)−1X 0Φ−1y. If Σ is not known we do something like what we did in simple
GLS. Estimate the model by OLS equation by equation. Get the n residual series ei.
Estimate σij = e0iej/T . Then construct Σ = [σij ] and Φ = Σ ⊗ IT . Then construct
βG = (X0Φ−1X)−1X0Φ−1y. Why do we do it all simultaneously rather than equation by
equation? Because, you can exploit the information in the error structure to help predict
the errors in the other equations.
Some rules about Kronecker products, when An×n, and BT×T .
1. (A⊗B)(C ⊗D) = AC ⊗BD
2. A⊗ (B ⊗C) = (A⊗B)⊗ C
3. A⊗ (B +C) = (A⊗B) + (A⊗C)
4. (B +C)⊗A = (B ⊗A) + (C ⊗A)
5. (A⊗B)0 = A0 ⊗B0
6. (A⊗B)−1 = A−1 ⊗B−1
When is OLS equation by equation the same as GLS.
1. When Σ is a diagonal matrix. Notice that in this case the matrix Φ is a block diagonal
matrix. As a result
Φ−1 =
⎛⎜⎜⎝σ−111 IT 0 . . . 00 σ−122 IT . . . 0...
.... . .
...0 0 . . . σ−1nnIT
⎞⎟⎟⎠32
5. Systems of Equations
Therefore
X 0Φ−1X =
⎛⎜⎜⎝σ−111 X
01X1 0 . . . 00 σ−122 X
02X2 . . . 0
......
. . ....
0 0 . . . σ−1nnX 0nXn
⎞⎟⎟⎠and
(X 0Φ−1X)−1 =
⎛⎜⎜⎝σ11(X
01X1)
−1 0 . . . 00 σ22(X
02X2)
−1 . . . 0...
.... . .
...0 0 . . . σnn(X
0nXn)
−1
⎞⎟⎟⎠and
X 0Φ−1y =
⎛⎜⎜⎝σ−111 X
01y1 0 . . . 0
0 σ−122 X02y2 . . . 0
......
. . ....
0 0 . . . σ−1nnX0nyn
⎞⎟⎟⎠so that
βG =
⎛⎜⎜⎝(X 0
1X1)−1X 0
1y1(X 0
2X2)−1X 0
2y2...
(X 0nXn)
−1X 0nyn
⎞⎟⎟⎠2. Xi = Z, ∀i. I.e. the regressors are the same in every equation. X = In ⊗ Z.
βG = [(In ⊗ Z)0(Σ⊗ IT )−1(In ⊗ Z)]−1(In ⊗ Z)0(Σ⊗ IT )
−1y
= [(In ⊗ Z0)(Σ−1 ⊗ IT )(In ⊗ Z)]−1(In ⊗ Z0)(Σ−1 ⊗ IT )y
= [(Σ−1 ⊗ Z0)(In ⊗ Z)]−1(Σ−1 ⊗ Z0)y
= (Σ−1 ⊗ Z0Z)−1(Σ−1 ⊗ Z0)y
= [Σ⊗ (Z0Z)−1](Σ−1 ⊗ Z0)y
= [In ⊗ (Z0Z)−1Z0]y3. Kruskal’s Theorem. Applies to standard GLS problems but this one also. If ∃ non-singular Gk×k such that ΩX = XG then βG = βO. Notice that ΩXG−1 = X.
βG = (X0Ω−1X)−1X 0Ω−1y
= (G−10X 0X)−1G−1
0X 0y
= (X 0X)−1G0G−10X 0y
= (X 0X)−1X 0y
33
5. Systems of Equations
So in the context of SURE we’re looking for Gnk×nk such that (Σ⊗ IT )X = XG.
5.2. Panel Data
In this case we have observations which are sorted both by time t and by individual i.
However, the coefficients do not vary across the individuals, except in very specific ways.
yit = x0itβ + z0iγ + αi + it.
The αi are unobservable effects which vary by individual. Therefore, we can think of them
as unobservable person specific regression coefficients. However, β and γ do not vary by
individual.
We will assume that E( it) = 0, E( it js) = 0, for i 6= j or t 6= s, E( 2it) = σ2, ∀i, t.There are two ways to approach the estimation
a. Fixed Effects - treat the αi as unknown constants, and put in dummy variables, di,
which are 1 in observations involving person i and 0 otherwise.
b. Random effects - assume that the differences in α across individuals are random, and
make assumptions about the distribution they come from.
5.2.1. Fixed Effects
The model can be written in stacked form as⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
y11y12...
y1Ty21...
y2T...
ynT
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
x011x012...
x01Tx021...
x02T...
x0nT
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠β +
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
z01z01...z01z02...z02...z0n
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠γ +
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
1 0 . . . 01 0 . . . 0....... . .
...1 0 . . . 00 1 . . . 0....... . .
...0 1 . . . 0....... . .
...0 0 . . . 1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
⎛⎜⎜⎝α1α2...αn
⎞⎟⎟⎠+
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
11
12...
1T
21...
2T...
nT
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠.
If we could estimate the whole thing in one big block it would be BLUE. Least squares
involves minimizing S =P
i
Pt e2it =
Pi
Pt(yit − x0itβ − z0iγ − αi)
2. Differentiate
∂S
∂αj= −2
Xt
(yjt − x0jtβ − z0jγ − αj) = 0.
34
5. Systems of Equations
This is easily solved for αj .
αj =1
T
³Xt
yjt −Xt
x0jtβ −Xt
z0jγ´
= yj − x0jβ − z0jγ
When I substitute this back into S I get S =P
i
Pt
£yit − yi − (xit − xi)
0β¤2. As a result,
γ is not identified because zj is time invariant. Therefore, in fixed effects stuff you have to
drop any variables which are individual specific and don’t vary in the time dimension. Just
do least squares after subtracting the means across individuals from y and x as suggested
in the solution for αj above.
A simple test for whether fixed effects exist would be to test whether all the α’s are
the same. This is n−1 restrictions. Just run the restricted regression and the unrestrictedregression and do an F -test of the restrictions.
Summarizing fixed effects: Individual specific effects are constants-implies you can’t
estimate γ. Furthermore, the assumed lack of covariance across individuals means you
can’t exploit it to improve estimates of β.
5.2.2. Random Effects
In this case the α’s are random variables. The model is yit = x0itβ + z0iγ + αi + it =
α + x0itβ + z0iγ + uit, where uit = (αi − α) + it. The same assumptions are made about
it. In addition it is independent of αi. The distribution of αi is such that E(αi) = α,
∀i, and E(αi − α)2 = σ2α and they are i.i.d. across individuals. Therefore, E(uit) = 0,
E(u2it) = σ2α + σ2, E(uituis) = σ2α, 0 otherwise.
If we write the model as a whole we have y = α+Xβ + Zγ + u = Wθ + u. Do the
stacking by individual. E(uu0) = In ⊗Σ = Ω, where
Σ =
⎛⎜⎝σ2α + σ2 σ2α . . .σ2α σ2α + σ2 . . ....
.... . .
⎞⎟⎠ = σ2IT + σ2α110
To estimate the model you just do a big GLS.
35
6. Nonlinear Models
6. Nonlinear Models
What if the model we have is nonlinear of the form yt = f(xt, β) + t. We will have
two ways of estimating models like this. The first method is called nonlinear least squares,
while the other method is maximum likelihood.
6.1. Nonlinear Least Squares
The criterion function as before is the sum of squared residuals, S(β) =P
t e2t =P
t
¡yt − f(xt, β)
¢2= [y − f(X,β)]0[y − f(X,β)]. We will make the usual assumptions
about X and . If we minimize the sum of squared residuals by choosing β the first order
condition is
∂S(β)
∂β= −2∂f(X,β)
∂β
0[y − f(X,β)] = 0.
Do a first-order Taylor series expansion of f(X,β) around β = β1.
f(X,β) ≈ f(X,β1) +∂f(X,β1)
∂β(β − β1)
which implies that
y ≈ f(X,β1) +∂f(X,β1)
∂β(β − β1) +
y − f(X,β1) +∂f(X,β1)
∂ββ1 ≈ ∂f(X,β1)
∂ββ +
y∗ =∂f(X,β1)
∂ββ +
Notice that for any β1,∂f(X,β1)
∂β and y∗ are data. Therefore starting from some arbitrary
β1, create∂f(X,β1)
∂β and y∗. Then run a regression and get a new estimate
β2 =
Ã∂f(X,β1)
∂β
0 ∂f(X,β1)
∂β
!−1∂f(X,β1)
∂β
0y∗
=
Ã∂f(X,β1)
∂β
0 ∂f(X,β1)
∂β
!−1∂f(X,β1)
∂β
0³y − f(X,β1)
´+ β1
36
6. Nonlinear Models
In general you keep going, following the rule
βn+1 =
Ã∂f(X,βn)
∂β
0∂f(X,βn)
∂β
!−1∂f(X,βn)
∂β
0³y − f(X,βn)
´+ βn
Suppose that at some point βn+1 = βn, i.e. the algorithm converges. What does that
imply? It must be the case that
∂f(X,βn)
∂β
0³y − f(X,βn)
´= 0
which simply means the first-order condition for maximization is satisfied. Can we be sure
it is a minimum? It can’t be a maximum because we’ll assume that ∂f(X,β)∂β has full column
rank which means [∂f 0∂f ]−1 is positive definite. This means you are always going downhill.
On the other hand, you can’t show that it’s not just a local minimum. Under regularity
conditions the asymptotic variance covariance matrix of βN is consistently estimated by
σ2
Ã∂f(X, β)
∂β
0∂f(X, β)
∂β
!−1
where σ2 = S(βN )/(T −k). What we have just described is the Gauss-Newton Algorithm.Next we will examine the NEWTON-RAPHSON algorithm. Here we do a second-
order Taylor series expansion of S(β) around β = β1.
S(β) ≈ S(β1) +∂S(β1)
∂β
0(β − β1) +
1
2(β − β1)
0 ∂2S(β1)
∂β∂β0(β − β1).
Suppose we tried to minimize with respect to β
∂S(β)
∂β≈ ∂S(β1)
∂β+
∂2S(β1)
∂β∂β0(β − β1) = 0
β − β1 ≈ −Ã∂2S(β1)
∂β∂β0
!−1∂S(β1)
∂β
β2 = β1 −H(β1)−1 ∂S(β1)
∂β
or in general
βn+1 = βn −H(βn)−1 ∂S(βn)
∂β.
37
6. Nonlinear Models
If the algorithm ever converges what will be true? Clearly, ∂S(βn)∂β = 0 which solves the
first order condition. Is it a minimum? Depends on whether H is positive definite. If it
is positive definite everywhere then you could guarantee it, but usually it’s only locally
a minimum because H is non-pd in places. There are some methods for always ensuring
that at each step you go downhill. These are usually of the form
βn+1 = βn − tnH(βn)−1 ∂S(βn)
∂β
where at each n, tn is varied until S(βn+1) < S(βn). Mention the often used
1, 0.5, 0.25, 0.125, . . .method. The variance of βN is consistently estimated by 2σ2H(βN )
−1
where σ2 = S(βN )/(T − k).
What is the relationship between Gauss-Newton and Newton-Raphson? Notice that
H(β) =∂2S
∂β∂β0=
∂
∂β0
Ã−2
TXt=1
[yt − f(xt, β)]∂f(x, β)
∂β
!
= 2TXt=1
∂f(x, β)
∂β
∂f(x, β)
∂β
0− 2
TXt=1
[yt − f(xt, β)]∂2f(xt, β)
∂β∂β0
Therefore,
E[H(β)] = 2∂f(X,β)
∂β
0 ∂f(X,β)
∂β,
which shows you that the methods are really quite similar. In general there are a multitude
of methods of the form
βn+1 = βn − tnPn∂S(βn)
∂β.
Note that even the Gauss-Newton method fits into this form. See Judge Appendix B for
a description of all the methods.
6.2. Maximum Likelihood
Since the model is of the form y = f(X,β) + we can simply take the likelihood for
the error terms and replace them by y − f(). Therefore,
L = −T2ln 2π − T
2lnσ2 − 1
2σ2S(β).
38
6. Nonlinear Models
As usual we’ll get σ2 = S(β)/T , but the formula for β is not a closed form. Plug the
solution for σ2 into L and remaximize.
L∗ = −T2ln 2π − T
2(1 + lnT−1)− T
2lnS(β).
What’s clear is that maximizing the likelihood is equivalent to minimizing the sum of
squares. There are three methods for finding the MLE for β. Going back to the original
likelihood
6.2.1. Newton-Raphson
In this case you really just do the same thing as you do in NLLS. Notice that
∂ L∂β
= − 1
2σ2∂S(β)
∂β
∂2 L∂ββ0
= − 1
2σ2∂2S(β)
∂β∂β0
Therefore,
βn+1 = βn −Ã∂2 L(βn)∂ββ0
!−1∂ L(βn)
β
= βn −Ã∂2S(βn)
∂β∂β0
!−1∂S(βn)
∂β
Here the estimator for the covariance matrix is given by
2σ2
Ã∂2S(β)
∂β∂β0
!−16.2.2. Method of Scoring
This method uses the inverse of the expectation of the Hessian instead of the inverse
of the Hessian to weight the first derivatives.
βn+1 = βn −ÃE∂2 L(βn)∂ββ0
!−1∂ L(βn)
β
= βn −Ã−E 1
2σ2∂2S(βn)
∂β∂β0
!−1³− 1
2σ2
´∂S(βn)∂β
= βn − 12
Ã∂f(X,βn)
∂β
0 ∂f(X,βn)
∂β
!∂S(βn)
∂β
39
6. Nonlinear Models
Here the estimator for the covariance matrix is given by
σ2
Ã∂f(x, β)
∂β
0∂f(x, β)
∂β
!−1
6.2.3. Berndt, Hall, Hall and Hausman (BHHH)
This is a method with a seemingly arbitrary choice of weighting matrix but it does
relate to the method of scoring. Define
lt = −12ln 2π − 1
2lnσ2 −
Ã[yt − f(xt, β)]
2
2σ2
!
so thatP
t lt = L. Here
βn+1 = βn −Ã−Xt
∂lt∂β
∂lt∂β0
!−1βn,σn
∂ L(βn)β
= βn −ÃX
t
[yt − f(xt, β)]2
σ4∂f(x, β)
∂β
∂f(x, β)
∂β
0!−1βn,σn
Ã− 1
2σ2∂S(βn)
∂β
!
= βn − σ2n2
ÃXt
[yt − f(xt, β)]2
σ4∂f(x, β)
∂β
∂f(x, β)
∂β
0!−1βn,σn
∂S(βn)
∂β
Notice that
E
ÃXt
∂lt∂β
∂lt∂β0
!=1
σ2
Xt
Ã∂f(x, β)
∂β
∂f(x, β)
∂β
0!which looks just like method of scoring. Therefore, the estimate of the variance-covariance
matrix is just
σ4
ÃXt
[yt − f(xt, β)]2
σ4∂f(x, β)
∂β
∂f(x, β)
∂β
0!−1β
40
7. Stochastic Regressors
7. Stochastic Regressors
7.1. Independent Regressors
We now change one of the previously vital assumptions. Up until now we have been
assuming that the matrix X is fixed. Now we will instead assume that it is stochastic, but
also that X is independent of . This implies that any function of X is independent of any
function of . In the plain linear model we have y = Xβ+ . Therefore, the OLS estimator
is β = (X 0X)−1X0y = β + (X0X)−1X 0 . Therefore, E(β) = β +E[(X0X)−1X 0]E( ) = β.
Furthermore, we have
E(e0e) = E( 0M )
= E[tr( 0M )]
= E[tr(M 0)]
= tr[E(M)E( 0)]
= σ2tr(EM) = σ2Etr(M)
= σ2(T − k).
Therefore, E(σ2) = σ2. And
V (β) = E[(X 0X)−1X 0 0X(X 0X)−1]
= σ2E(X 0X)−1.
One can approximate this by σ2(X 0X)−1 which is also unbiased.
41
7. Stochastic Regressors
7.2. Nonindependent Regressors (of a particular form)
In this case we imagine a world where y = Xβ+ but our previous assumptions don’t
hold. Now we only assume that t is i.i.d. mean zero, variance σ2 with plim 0 /T = σ2,
plimX0X/T = Σxx and plimX 0 /T = 0. Under these circumstances
plim β = plim(X 0X)−1X 0y
= β + plim³X0X
T
´−1X0
T
= β + plim³X0X
T
´−1plim
X0
T
= β +Σ−1xx 0 = β.
You can check the textbook for proofs that the estimates of the variances converge in
probability to the right objects. One last thing is the asymptotic distribution. We have
√T (β − β) =
³X 0XT
´−1X 0√T.
Therefore, by the quick and dirty asymptotic method we have an estimator whch will be
asymptotically normal and the vcv matrix will be σ2Σ−1xx .
7.3. Instrumental Variables
7.3.1. Errors in Variables
Let us discuss all this in a univariate setting. Suppose the model is yt = xtβ + t.
However, you do not observe xt but a measurement error ridden version of it called zt =
xt + ut. Therefore, we can write yt = ztβ + t − utβ = ztβ + vt. We’ll assume the ’s and
the u’s are i.i.d. and independent of each other and xt. If the econometrician regresses yt
on zt he gets
β =
PztytPz2t
=
Pzt(ztβ + vt)P
z2t
= β +
PztvtPz2t
β − β =1T
Pztvt
1T
Pz2t
42
7. Stochastic Regressors
Now 1T
Pztvt =
1T
P(xt+ut)( t−utβ) p−→−βσ2u. Also, 1T
Pz2t =
1T
P(xt+ut)
2 p−→σ2x+
σ2u. Therefore, β−β p−→−σ2uβ/(σ2x+σ2u). This means that β is not consistent except when
β = 0. As the signal to noise ratio σ2x/σ2u ratio gets small plim β → 0, while as it gets
large plim β → β. The basic problem arises because the regressor zt is not orthogonal to
the error term vt.
7.3.2. Instrumental Variables
Now we will consider general models of the form yt = xtβ+ t where plim1T
Pxt t =
σx 6= 0. This will imply that OLS is inconsistent.What is instrumental variables estimation? The trick is to find some variable zt s.t.
(1) 1T
Pzt t
p−→ 0.(2) 1
T
Pztxt
p−→σzx 6= 0.The IV estimator is generated by the following formula
βIV =
PztytPztxt
.
As a result, if we substitute in the definition of yt we get
βIV =
Pzt(xtβ + t)P
ztxt
= β +
Pzt tPztxt
βIV − β =1T
Pzt t
1T
Pztxt
p−→ 0.Also as far as distribution theory is concerned we’ll get
√T (βIV − β) =
1√T
Pzt t
1T
Pztxt
.
We can for the purposes of this course assert that 1√T
Pzt t
d−→N(0, σ2σ2z) while
1T
Pztxt
p−→σxz. Therefore,
√T (βIV − β)
d−→N(0, σ2σ−1xz σ2zσ−1xz ).
43
7. Stochastic Regressors
Notice that the variance is equal to
σ2σ2zσ2xz
=σ2
σ2x
σ2xσ2z
σ2xz
= VOLS/ρ2xz
Obviously, OLS is just IV with X as the instrument. We want a Z which is as highly
correlated with X as possible without being correlated with .
In a multivariate setting the story is really no different. We have the model y =
Xβ + . We define Σxx = plim 1TX
0X and Σx = plim 1TX
0 6= 0. Suppose we have
the replacements for the X matrix called ZT×k. We’ll define Σzx = plim 1T Z
0X and
Σzz = plim1T Z
0Z which are assumed to have full rank. Furthermore, Σz = plim 1T Z
0 = 0.
Also assume enough regularity conditions such that 1√TZ0 d−→N(0, σ2Σzz). Then
βIV = (Z0X)−1Z0y
= (Z0X)−1(Z0Xβ + Z0 )
= β + (Z0X)−1Z0
βIV − β = (Z0X)−1Z0
Clearly plim βIV − β = 0. Furthermore,
√T (βIV − β)
d−→N(0, σ2Σ−1zxΣzzΣ−1xz )
One natural question that arises is how to choose the instrument which replaces each
column in the matrix X. Suppose I have l variables (called W ’s) which I can reasonably
assume to be uncorrelated with the ’s and which are correlated with the X’s. If l >
k, then clearly we have more candidates than we have room for. How can we reduce
the number from l to k in the optimal way. One way to think about this is to think
about the k linear combinations of the l variables which minimizes the size of the variance
covariance matrix of βIV . I.e. the goal is to construct a matrix ZT×k = WT×lAl×k in
such way as to minimize the variance covariance matrix. Clearly this requires choosing
44
7. Stochastic Regressors
A optimally. Now it’s clear that Σzz = plim1T Z
0Z = plim 1TA
0W 0WA = A0ΣwwA. Also,
Σzx = plim1T Z
0X = plim 1TA
0W 0X = A0Σwx. Therefore, the variance of the IV estimator
is
V = σ2(A0Σwx)−1A0ΣwwA(ΣxwA)−1.
which can be minimized by choosing A∗ = Σ−1wwΣwx so that the variance is
V ∗ = σ2(ΣxwΣ−1wwΣwx)
−1.
To prove that this is smallest in a cheesy way, consider the difference between its inverse
and the inverse of any other matrix for any other choice of A.
V ∗−1 − V −1 ∝ ΣxwΣ−1wwΣwx −ΣxzΣ−1zz Σzx= X 0[W (W 0W )−1W 0 − Z(Z0Z)−1Z0]X
which is positive semidefinite as the matrix in the middle is idempotent. To check this
notice that
[W (W 0W )−1W 0 − Z(Z0Z)−1Z0]2 =W (W 0W )−1W 0 −W (W 0W )−1W 0Z(Z0Z)−1Z0
− Z(Z0Z)−1Z0W (W 0W )−1W 0 + Z(Z0Z)−1Z0
=W (W 0W )−1W 0 −WA(Z0Z)−1Z0 − Z(Z0Z)−1A0W 0 + Z(Z0Z)−1Z0
=W (W 0W )−1W 0 − Z(Z0Z)−1Z0.
45
8. Time Series
8. Time Series
We’ll be looking at univariate time series. We’re looking at a sequence ytTt=1. We’regoing to look at autoregressive moving average (ARMA) processes. These are processes
which allow us to characterize the distribution of yt in terms of its own past. First we need
to define a certain type of stationarity.
A stochastic process is stationary if
f(yt, yt−1, . . . , yt−k) = f(yt+s, yt+s−1, . . . , yt+s−k)
for all s, k.
A stochastic process ytTt=1 is covariance stationary if E(yt) = µ and Var(yt) = γ0 <
∞ and E[(yt − µ)(yt−s − µ)] = γs, ∀t.Let’s look at different types of time series processes.
8.1. Time Series Processes
8.1.1. White Noise
We will generically represent a white noise process with the symbol ( t). It has the
following propertiesE( t) = 0
γs =
½σ2 if s = 00 if s 6= 0
8.1.2. Autoregressive 1st Order - AR(1)
yt = φyt−1 + t
= φ(φyt−2 + t−1) + t
= φ2yt−2 + t + φ t−1
= φT yt−T +T−1Xj=0
φj t−j
46
8. Time Series
If yt−T were fixed then E(yt) = φT yt−T which implies that if φ 6= 1 then E(yt) 6= E(ys)
for t 6= s. Suppose then that yt−T is not fixed for any T . If |φ| < 1 we can write
yt =∞Xj=0
φj t−j
and thus Eyt = Eys = 0, ∀s, t.
Var(yt) = Eh ∞Xj=0
φj t−ji2
= Eh ∞Xj=0
φ2j 2t−ji
= (1 + φ2 + φ4 + . . .)σ2
= σ2/(1− φ2) = γ0
γs = E(ytyt−s)
= E[(φsyt−s + t + φ t−1 + φs−1 t−s+1)yt−s]
= φsγ0
We’re going to use lag operator notation. Lyt = yt−1. Liyt = yt−i.
yt = φLyt + t
(1− φL)yt = t
yt = (1− φL)−1 t
yt = (1 + φL+ φ2L2 + φ3L3 + . . .) t
The polynomial (1 − φL)−1 exists iff ∀|Z| ≤ 1, (1 − φZ) 6= 0. All the roots of
the polynomial lie outside the unit circle. Furthermore, an AR(1) process is covariance
stationary iff ∀|Z| ≤ 1, 1− φZ 6= 0.
47
8. Time Series
8.1.3. Autoregressive pth Order - AR(p)
yt = φ1yt−1 + φ2yt−2 + . . .+ φpyt−p + t
φ(L)yt = t
The polynomial φ(L)−1 exists iff ∀|Z| ≤ 1, φ(Z) 6= 0. All the roots of the polynomiallie outside the unit circle. Furthermore, an AR(p) process is covariance stationary iff
∀|Z| ≤ 1, φ(Z) 6= 0.
8.1.4. Moving Average 1st Order - MA(1)
yt = t + θ t−1
Eyt = 0
Var(yt) = Ey2t = E( 2t + 2θ t t−1 + θ2 2t−1)
= σ2(1 + θ2) = γ0
γ1 = t t−1 + θ 2t−1 + θ2 t−1 t−2
= θσ2
γs = 0, s > 1
An MA(1) process is always covariance stationary. On the other hand the value of
θ will matter for invertibility. We want to write (1 + θL)−1yt = θ(L)yt = t. This MA
process is invertible iff ∀|Z| ≤ 1, 1 + θZ 6= 0.
48
8. Time Series
8.1.5. Moving Average qth Order - MA(q)
yt = t + θ1 t−1 + . . .+ θq t−q
= θ(L) t
Eyt = 0
Var(yt) = Ey2t = σ2(1 + θ21 + . . . θ2q) = γ0
γi = 0, i > q
An MA(q) process is always covariance stationary. We want to write θ(L)yt = t.
This MA process is invertible iff ∀|Z| ≤ 1, θ(Z) 6= 0.
8.1.6. ARMA(p,q)
yt = φ1yt−1 + . . .+ φqyt−p + t + θ1 t−1 + . . .+ θq t−q
φ(L)yt = θ(L) t
STATIONARY : ∀|Z| ≤ 1 φ(Z) 6= 0 yt = φ(L)−1θ(L) t
INVERTIBLE : ∀|Z| ≤ 1 θ(Z) 6= 0 t = θ(L)−1φ(L)yt
8.2. Wold Decomposition Theorem
Any covariance stationary stochastic process xt can be represented as xt = yt +
zt where zt is linearly deterministic [i.e. Var(zt|zt−1, . . .) = 0] and yt is purely non-
deterministic with an MA(∞) representation yt = ψ(L) t whereP∞
j=1 ψ2j <∞.
ARMA models represent approximations to the MA(∞). I.e. ψ(L) ≈ φ(L)−1θ(L).
8.3. Box-Jenkins Procedure
The procedure consists of three basic parts
(1) Identification (Model Selection) — make a judgment based on autocorrelations and
partial autocorrelations that leads to a model that could have generated them,
49
8. Time Series
(2) Estimation — estimate model chosen in (1),
(3) Diagnostic Checking — do various diagnostic checks to determine whether the resid-
uals resemble white noise.
8.3.1. Model Selection
We’ll select models based on the autocorrelations and partial autocorrelations of a
time series. We first need to define what these objects are.
The autocorrelation function (of i) is defined as ρi = γi/γ0. Notice that |ρi| ≤ 1,ρ−i = ρi and ρ0 = 1.
WN : ρ0 = 1
ρi = 0 i 6= 0
AR(1) : ρ0 = 1
ρi = φi
For AR(p) there is some more complex form of decay.
MA(1) : ρ0 = 1
ρ1 = θ/(1 + θ2)
ρi = 0 i > 1
MA(q) : ρi = 0 i > q
So if you thought your data followed an MA(q) process you would look for a drop
off in the autocorrelation function at some lag in order to identify q. How can you tell
whether an autocorrelation is 0 or not. Do a statistical test! If yt is mean zero then the
ith autocovariance is γi =1
T−iPT
t=i+1 ytyt−i. You would estimate ρi = γi/γ0. To test the
null hypothesis H0 : ρi = 0 for some i > 0 it is turns out that under the null hypothesis√T ρi
d−→N(0, 1). This is because the numerator converges in distribution to N(0, σ4y)
and the denominator converges in probability to σ2y. Thus an easy way to check whether
the autocorrelations are significant is to construct an acceptance region based on the 5%
significance level around 0, equal to ±1.96/√T .
50
8. Time Series
Suppose you want to do a joint test H0 : ρi = 0, i = 1, . . . N . Then construct the
Box-Pierce statistic QBP = TPN
i=1 ρ2i or the Ljung-Box QLB = T (T −2)PN
i=1 ρ2i /(T − i).
Both of these are χ2N asymptotically although the Ljung-Box is closer to a χ2 is small
samples. It’s because both are sums of squared standard normals which are independent
asymptotically.
Obviously these methods are more helpful for MA processes than AR processes.
The partial autocorrelation function φkk is defined by the Yule-Walker equations
ρj =kXi=1
φkiρi−j , j = 1, . . . , k.
Notice that we have k equations and there are k unknowns φki, i = 1, . . . , k. This can be
written in matrix notation⎛⎜⎜⎝ρ1ρ2...ρk
⎞⎟⎟⎠ =
⎛⎜⎜⎝1 ρ1 ρ2 · · · ρk−1ρ1 1 ρ1 · · · ρk−2...
......
. . ....
ρk−1 · · · · · · · · · 1
⎞⎟⎟⎠⎛⎜⎜⎝φk1φk2...
φkk
⎞⎟⎟⎠or ρk = Pkφk or φk = P−1k ρk, but remember we’re really only interested in φkk the last
element. Notice that for k = 1, this just reduces to ρ1 = φ11, so that the first partial
autocorrelation is the same as the first autocorrelation. Might want to illustrate k = 2
also.
Of course we’d be doing this in sample, so all the quantities would be sample
estimates. It looks pretty formidable but it isn’t really. In fact, these things can
be computed from simple OLS regressions. Start with φ11. Since it equals ρ1 it is
clear that φ11 = ρ1 = φ1 the OLS estimator in a regression of yt on yt−1. I.e.
φ1 = (P
yt−1yt/T )/(P
y2t−1/T )p−→γ1/γ0 = ρ1 = φ11. Suppose we estimated an AR(2)
model yt = φ1yt−1 + φ2yt−2 + t. Notice we’d haveµφ1φ2
¶=
µ Py2t−1/T
Pyt−1yt−2/TP
yt−1yt−2/TP
y2t−2/T
¶−1µPyt−1yt/TPyt−2yt
¶p−→µγ0 γ1γ1 γ0
¶µγ1γ2
¶=
µρ0 ρ1ρ1 ρ0
¶µρ1ρ2
¶51
8. Time Series
Therefore φ2 acts as an estimate of φ22. In general, φk in an OLS of an AR(k) acts as an
estimate of φkk. In other words, φkk = Corr(yt, yt−k|yt−1, . . . , yt−k+1). An implication ofall of this is that φkk = 0 for k > p if yt is an AR(p) process. On the other hand the φkk
will decay in some manner if yt is an MA process.
Another small point is that when we’re selecting a model we may need to difference
the time series to render it stationary. Most macroeconomic time series look like this
because they grow steadily through time. There is an issue as to whether this growth can
be modelled as a deterministic trend or must involve explicitly modelling the first difference
of the log of a series. We will not discuss this issue. Let me just mention a definition.
The stochastic process yt is integrated of order d iff ∆dyt = (1 − L)dyt is station-
ary. This definition leads us to refer to ARIMA(p, d, q) processes which have the form
φ(L)∆dyt = θ(L) t. Two special ARIMA processes are the (misnamed by economists)
random walk ∆yt = t and the random walk with drift ∆yt = µ+ t. Use of these models
is especially prevalent in finance.
8.3.2. Estimation
We will consider maximum likelihood estimation. This requires the assumption that
t ∼ N(0, σ2). The estimation of autoregressive models is simple. To estimate them we
need to write down the joint density function of the y’s, L(y) = f(y1, y2, . . . , yT ). Let
me introduce the notation Yt = (y1, y2, . . . , yt). We can use the definition of conditional
density functions to derive f .
f(YT ) = f(yT |YT−1)f(YT−1)
= f(yT |YT−1)f(yT−1|YT−2)f(YT−2)
=³ TYt=2
f(yt|Yt−1)´f(y1)
Since each of these conditional distributions is easy to characterize in an AR model, this
is a convenient way to proceed.
52
8. Time Series
Take for example the AR(1) model. What is f(yt|Yt−1)? Well, in the AR(1) modelyt = φyt−1+ t. Given the information on all y’s up to time t− 1 it is clear that Et−1yt =
φyt−1 and that Vart−1yt = σ2. Therefore, (yt|Yt−1) ∼ N(φyt−1, σ2). We also need to
consider y1. We know that y1 =P∞
i=0 φi1−i. Therefore, the marginal distribution of y1
is N [0, σ2/(1− φ2)]. Thus the joint density is
f(YT ) =h TYt=2
(2πσ2)−1/2 exp³− 1
2σ2(yt − φyt−1)2
´i×
(2πσ2)−1/2(1− φ2)1/2 exp³−1− φ2
2σ2y21
´so that the log of the likelihood function is simply
L = −T2ln(2π)− T
2ln(σ2) +
1
2ln(1− φ2)− 1
2σ2
TXt=2
(yt − φyt−1)2 − (1− φ2)
2σ2y21.
Notice that if we were to drop the two extra terms from the likelihood we’d have the
OLS criterion function. Since that term dominates as the sample size grows there is
going to be numerical similarity between the ML estimate and the OLS estimate in large
samples. They are asymptotically equivalent. You should verify by computing the first
derivatives, second derivatives of the likelihood function and then the information matrix
that the asymptotic variance of the ML and OLS estimators is σ2γ−10 , which is consistently
estimated by σ2(P
y2t−1/T )−1 which in small samples means use σ2(X0X)−1 from OLS!
For higher order AR processes the story is the same although there are more complex
issues to deal with concerning the first p observations.
For a moving average model it is difficult to write y in terms of lagged y0s and current
shocks. For example, if we have an MA(1) model, yt = t + θ t−1 the appropriate way
to do it is to invert the MA process to get (1 + θL)−1yt = (1 − θL + θ2L2 − . . .)yt = t.
Even in an MA(1) y depends on an infinite sequence of lagged y0s when the process is
inverted. Therefore, the easier way to go is just to directly write down the likelihood for
the process. Clearly, since the ’s are normal, the y’s are normal. Each y has mean zero,
53
8. Time Series
variance σ2(1 + θ2) and first order covariance θσ2. Therefore, YT ∼ N(0, σ2Ω) where
Ω =
⎛⎜⎜⎜⎜⎝1 + θ2 θ 0 . . . 0θ 1 + θ2 θ . . . 00 θ 1 + θ2 . . . 0...
......
. . ....
0 0 0 . . . 1 + θ2
⎞⎟⎟⎟⎟⎠So that
L = −T2ln(2πσ2)− 1
2ln |Ω|− 1
2σ2Y 0TΩ−1YT .
There are some kinky ways to estimate this guy but you wont learn them here. Shazam
seems a little useless to me, try RATS.
8.3.3. Diagnostic Testing
One thing you could do which is straightforward is to redo the Ljung-Box or Box-
Pierce tests on the residuals to check for leftover autocorrelations in the data. Unfortu-
nately, these tests have very low power, so they tend to often accept the null when it is
false. You really can only get much out of these tests if you have a large sample and you
can sum over many autocorrelations.
A better approach is to use LM tests. You will have chosen a particular model which
you might think of as a restricted version of a more general model. If you want to test
whether your version is adequate you can use an LM test, which only requires estimation of
the restricted model along with the derivatives of the likelihood function at the parameter
estimates. We will only examine a test based on a selected model of AR(1). I will test
this specification versus an AR(2). Dropping the starting values from consideration the
likelihood is just
L = −T2ln(2π)− T
2ln(σ2)− 1
2σ2
TXt=1
(yt − φ1yt−1 − φ2yt−2)2.
In this case the usual block separation of the information matrix φ components and σ2
54
8. Time Series
component will hold so we only need
∂L∂φ1
=1
σ2
TXt=1
(yt − φ1yt−1 − φ2yt−2)yt−1
∂L∂φ2
=1
σ2
TXt=1
(yt − φ1yt−1 − φ2yt−2)yt−2
∂2L∂φ21
= − 1σ2
TXt=1
y2t−1
∂2L∂φ1φ2
= − 1σ2
TXt=1
yt−1yt−2
∂2L∂φ22
= − 1σ2
TXt=1
y2t−2
At the AR(1) estimates φ2 = 0 and φ1 renders the first order condition for φ1 to zero.
∂L∂φ1
=1
σ2
TXt=1
(yt − φ1yt−1)yt−1 =1
σ2
TXt=1
etyt−1 = 0
The first order condition for φ2 reduces to
∂L∂φ2
=1
σ2
TXt=1
(yt − φ1yt−1)yt−2 =1
σ2
TXt=1
etyt−2
This makes the LM statistic
LM = (P
etyt−1P
etyt−2 )µ P
y2t−1 yt−1yt−2yt−1yt−2 y2t−2
¶−1µPetyt−1Petyt−2
¶/σ2
= T (P
etyt−1P
etyt−2 )µ P
y2t−1 yt−1yt−2yt−1yt−2 y2t−2
¶−1µPetyt−1Petyt−2
¶/X
e2t
= Te0X(X 0X)−1X0e/(e0e)
= TR2
from the regression of e on yt−1 and yt−2.
Suppose I tried an ARMA(1,1) model. In this case (and I wont go into why) it turns
out that the relevant regression is the residuals et on yt−1 and et. See if you can decide on
analogs for more complex models.
55
9. Vector Autoregressions
9. Vector Autoregressions
At this point we move to a multivariate setting since univariate models aren’t par-
ticularly interesting from an economic perspective. If we just wrote down a multivariate
ARMA model it would look like
Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t +Θ1 t−1 + . . .+Θq t−q
where Φ0 isn’t necessarily I since we might want to allow contemporaneous responses of
one variable to another and E( t0t) = Σ. Notice how many free parameters we have
n2(1 + p + q) + n(n + 1)/2. This could be hideous. I won’t discuss the restrictions on
the matrix polynomials Φ(Z) and Θ(Z) required to render this process stationary and
invertible, suffice it to say that they are analagous to the univariate setting.
9.1. Identification
Identification is going to be a problem, just as it is in simultaneous equations models.
Suppose we restrict our attention to models with no MA components (they’d be hideous
to deal with) and write a vector autoregression (VAR)
Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t.
The likelihood function for the ’s is
L =TYt=1
(2π)−n/2|Σ|−1/2 exp³− 0
tΣ1t
´
The Jacobian of the transformation from xt to t is Φ0 so that the likelihood of X is
56
9. Vector Autoregressions
(ignoring starting values)
L =TYt=1
(2π)−n/2|Σ|−1/2|Φ0| exp³−[Φ(L)xt]0Σ−1[Φ(L)xt]
´=
TYt=1
(2π)−n/2|Φ−10 ΣΦ−100|−1/2 exp
³[x0tΦ
00 − x0t−1Φ
01 − . . .− x0t−pΦ
0p]Σ−1
[Φ0xt −Φ1xt−1 − . . .−Φpxt−p]´
=TYt=1
(2π)−n/2|Φ−10 ΣΦ−100|−1/2 exp
³[x0t − x0t−1Φ
01Φ−10
0 − . . .− x0t−pΦ0pΦ−10
0]Φ00Σ
−1
Φ0[xt −Φ−10 Φ1xt−1 − . . .− φ−10 Φpxt−p]´
=TYt=1
(2π)−n/2|V |−1/2 exp³[x0t − x0t−1C
01 − . . .− x0t−pC
0p]V−1
[xt −C1xt−1 − . . .−Cpxt−p]´
Suppose I generated new parameters using any non-singular n× n matrix G,
Φi = GΦi Σ = GΣG0
thenCi = Φ
−10 Φi
= (GΦ0)−1GΦi
= Φ−10 G−1GΦi
= Φ−10 Φi = Ci
V = Φ−10 ΣΦ00−1
= Φ−10 G−1GΣG0G0−1Φ00−1
= Φ−10 ΣΦ00−1= V
This means that just as in the simultaneous equations models we must put some restrictions
upon the parameters to achieve identification. The idea is to put enough restrictions on
the Φi and Σ to imply that the only admissable G that delivers Ci and V is G = I. Unlike
simultaneous equations models, people who use VAR models don’t typically have in mind
restrictions on the Φi matrices for i ≥ 1. This leaves us with two possibilities.
57
9. Vector Autoregressions
i) Restrictions on Φ0 and/or
ii) Restrictions on Σ.
The macroeconomist typically wants an interpretation of his model which allows him
to identify error terms as particular kinds of shocks. For example, money shocks, supply
shocks, technology shocks etc. In choosing what restrictions to impose, the economist
needs to be aware of what he wants to do with his model once it’s identified.
9.1.1. Restrictions are Φ0 = I.
Clearly this achieves identification. Now, this means that the model can be written
as
xt = Φ1xt−1 + Φ2xt−2 + . . .+Φpxt−p + t.
This model has an infinite moving average representation which can be written by inverting
the autoregressive matrix polynomial in the lag operator. If we write Φ(L)xt = t then we
have xt = Φ(L)−1
t = Λ(L) t or
xt = Λ0 t + Λ1 t−1 + Λ2 t−2 + . . .
One of the ways in which economists like to interpret VARs is to look at this moving
average representation. It’s quickly obvious that
∂xi,t∂ j,t−k
= Λk[ij] = λk,ij .
Typically λk,ij is plotted as a function of k and is called the impulse response function
(IRF) of variable i with respect to shocks in equation j. The only problem with this
concept is that with no restrictions on Σ this is a rather wierd thing to do. If there’s a
shock to jt and there’s covariance between j and how can we look at the response
to one shock in isolation from the responses to other shocks which are correlated with it?
To be able to isolate individual shocks we need to identify shocks which are uncorrelated
with each other. Suppose we do a decomposition of Σ = C0DC, C is invertible and upper
58
9. Vector Autoregressions
triangular with 1’s on the diagonal and D is a diagonal matrix. If we create new shocks
vt = C−10 t notice that E(vtv0t) = E(C−10 t 0tC−1) = D. We can then write
xt = Λ0C0vt + Λ1C0vt−1 + Λ2C0vt−2 + . . .
= Λ0vt + Λ1vt−1 + Λ2vt−2 + . . .
which creates a new impulse response function with respect to the elements of vt given
by λk,ij . This does allow the computation of a response function to shocks which are
mutually orthogonal but it still creates another problem of interpretation. Unfortunately,
the impulse responses will be sensitive to the ordering of the equations. The decomposition
we did will lead to different results if the variables are ordered differently in the VAR. This
leads us to wonder what the heck getting impulse responses to v0s really means. Sometimes
theory can shed light on what ordering is appropriate but clear choices are rare.
A second thing people like to look at is called the variance decomposition. Let’s go
back to our representation
xt = Λ0 t + Λ1 t−1 + Λ2 t−2 + . . . .
Suppose we wanted to forecast xt with information up to time t − k. The forecast error
would beek,t = xt −E(xt|It−k)
= Λ0 t + . . .+ Λk−1 t−k+1
The variance of this k-step ahead forecast error1 is given by
Vk = Λ0ΣΛ00 + . . .+ Λk−1ΣΛ0k−1.
Clearly Vk is a linear function of σij , ∀i, j. The variance decomposition looks at the
percentage of the variance of the k-step ahead forecast error in forecasting variable i which
is due to variation in shock j. This is not well defined if there are covariance terms floating
around in the expressions on the diagonal of Vk. We’d want to look at Vk(ii) and see
1 Notice that at k =∞ you get the unconditional variance in xt.
59
9. Vector Autoregressions
something like a1σ11 + a2σ22 + . . . anσnn and take the ratio of ajσjj to Vk(ii). The only
way to ensure that the diagonal of Vk has this appearance is to have a diagonal Σ such as
we had in our decomposed model.
9.1.2. Restrict Φ0 and Σ
In this case, the most common restriction is to assume that Φ0 is lower triangular
with 1’s on the diagonal and that Σ is diagonal. This setup has the advantage of allowing
contemporaneous relationships between the variables. For this reason it is often referred
to as a structural VAR. Bceause Σ is assumed to be diagonal we can interpret the shocks
in the model as independent processes. This helps in thinking about the impulse response
functions.
Φ0xt = Φ1xt−1 +Φ2xt−2 + . . .+Φpxt−p + t
xt = Φ−10 Φ1xt−1 +Φ
−10 Φ2xt−2 + . . .+ Φ−10 Φpxt−p + Φ
−10 t
= A1xt−1 +A2xt−2 + . . .+Apxt−p + ut
So that our structural model can be written as a plain VAR model with variance covariance
matrix for the u’s equal to Φ−10 ΣΦ−10
0. So suppose we then wrote down the moving average
resprentation of xt in terms of ut. We’d have
xt =∞Xk=0
Bkut−k
To get the moving average of the representation of xt in terms of the t’s is quite simple.
It is simply
xt =∞Xk=0
BkΦ−10 t−k =
∞Xk=0
Bk t−k.
Notice that when we computed the IRF in the Φ0 = I model we took it with respect to the
vt which were orthogonal, by getting a decomposition of the nonspherical ’s using C−10.
Here we are decomposing the nonspherical ut’s using the matrix Φ−10 which by assumption
looks a lot like C did. In both cases the model when written in terms of nonspehrical
60
9. Vector Autoregressions
disturbances was a plain VAR model. Thus if we estimated a plain VAR model with
nonsherical disturbances we would simply have to decide how to decompose the variance
covariance matrix of the errors. Either by obtaining an estimate of Φ0 or of C. Notice
that in the structural model, if the variance decomposition is computed as
Vk = B0ΣB00 + . . .+ Bk−1ΣB0
k−1
then the expression will be linear in the σii and no covariance terms will appear since Σ is
assumed to be diagonal.
9.2. Estimation
We’re going to work with either set of restrictions and we’ll write both models in
their plain VAR form, generically as
xt = D1xt−1 +D2xt−2 + . . .+Dpxt−p +wt
Notice that in either setup, this implies that wt is an error term which is nonspherical.
As a result GLS seems appropriate. However, since in each equation (remember xt is a
vector here) the regressors are identical, GLS (which is SURE) is going to be numerically
identical to OLS equation by equation. Therefore in either case simply compute Di by
OLS and get an estimate, Σw =P
wtw0t/T , where the wt’s are the residuals.
9.2.1. Restriction 9.1.1.
In this case the Di represent estimates of the Φi and Σw is an estimate of Σ. To do
the impulse response functions simply requires the computation of a C and D such that
C0DC = Σw.
61
9. Vector Autoregressions
9.2.2. Restriction 9.1.2.
In this case the Di represent estimates of the Φ−10 Φi and Σw is an estimate of
Φ−10 ΣΦ00−1. Clearly to get out the parameters we’re interested in requires an estimate
of Φ0. But remember simultaneous equations. If Φ0 is lower triangular we can do OLS
equation by equation and get the right estimates.
9.3. Structural VARs
Suppose we didn’t have such a wonderful situation. Suppose that Φ0 had enough
restrictions in it to achieve identification but it wasn’t lower triangular. This would be
tricky. It’s hard because the triangularity bought a lot. It meant that if you looked in any
equation like the one for xit you’d see
xit = β1ix1t + . . .+ βi−1ixi−1t + f(lags) + it.
The whole secret to a system like this is that everything on the right hand side is either
exogenous or is determined by an equation higher in the orderwhich does not depend on
xit. Without triangularity OLS wont work. So what will we do. Blanchard and Watson
(1986) suggest a way of doing this which only works for some kinds of nontriangular
systems. Let me illustrate their procedure with a 4-variable system.⎛⎜⎝1 0 0 00 1 0 −α0 0 1 −β−φ1 −φ2 −φ3 1
⎞⎟⎠ yt = Ztγ + t
where Zt respresents all the lags. This can be written as 4 equations,
y1 = Zγ1 + 1
y2 = αy4 + Zγ2 + 1
y3 = βy4 + Zγ3 + 1
y4 = φ1y1 + φ2y2 + φ3y3 + Zγ4 + 1
62
9. Vector Autoregressions
If you work out the reduced form for y4 it is
y4 = [Z(φ1γ1 + φ2γ2 + φ3γ3 + γ4) + (φ1 1 + φ2 2 + φ3 3 + 4)]/(1− αφ2 − βφ3)
Now suppose I estimated each equation by OLS but left out the endogenous RHS variables
creating what are really reduced form coefficient estimates which I’ll call γi. These guys
wouldn’t converge to the Φi’s rather they’d converge to the Φ−10 Φi’s. Create four residual
series
e1 = y1 − Z(Z0Z)−1Z0y1 =My1
=M(Zγ1 + 1) =M 1
e2 =My2 =M(Zγ2 + αy4 + 2)
= αMy4 +M 2 = αe4 +M 2
e3 =My3 =M(Zγ3 + βy4 + 3)
= βMy4 +M 3 = βe4 +M 3
e4 =My4 =M(Zγ4 + φ1y1 + φ2y2 + φ3y3 + 4) = φ1My1 + φ2My2 + φ3My3 +M 4
= φ1e1 + φ2e2 + φ3e3 +M 4
How could we get α? Regress e2 on e4?
α = (e04e4)−1e04e2
= (e04e4)−1e04(αe4 +M 2)
= α+ (y04My4)−1y04M 2
Clearly this means there will be bias in α because look at covariance in reduced form for
y4 with 2. Could we instrument it out? Yes. Use e1 as an instrument for e4!
αIV = (e01e4)
−1e01e2
= (e01e4)−1e01(αe4 +M 2)
= α+ ( 01My4)−1 0
1M 2
This is going to work because the second part is zero in expectation and the first part is
nonsingular because 1 appears in y4’s reduced form. This last part is critical! We could
63
9. Vector Autoregressions
continue like this by next taking e1 and e2 = e2− αIV e4 as instruments (using optimal IV
methods) for e4 to get an estimate of β. This would lead to e3 = e3 − βIV e4. Finally, e1,
e2 and e3 could be (using optimal IV methods) to get instruments for e1, e2 and e3 and
thereby get estimates of the φi’s.
64
10. Generalized Method of Moments
10. Generalized Method of Moments
There is a large class of macroeconomic models which lead to Euler equations, or in
general, nonlinear restrictions of the form
Et−if(Yt−1, yt|θ) = 0.
for some i ≥ 0. Because of the law of iterated expectations it follows that Et−i[f(Yt−1, yt|θ)xt−i] = 0 for any xt−i which is in the time t− i information set. Furthermore we can writeE[f(Yt−1, yt|θ)xt−i] = 0. With restrictions like these we could write f(Yt−1, yt|θ)xt−i = ut,
where restriction is rewritten as Eut = 0. This just looks like a nonlinear model. The
problem that GMM estimation is meant to solve is that the model in its most primitive
form often makes no statement as to the distribution function of ut. I.e. we don’t know that
it’s normal or virtually anything about it (e.g. we don’t know whether it is heteroskedastic
or autocorrelated). We do know that E(ut) = 0. Furthermore, E(utut−j) = 0 for |j| ≥ i,
otherwise nonzero, because ut is orthogonal to anything known up to time t− i.
10.1. Estimation
The intuition behind GMM estimation is to have an estimator of θ which sets the
sample equivalent of the moment condition equal to zero. I.e. choose θ so that
1
T
TXt=1
f(Yt−1, yt|θ)⊗ xt−i =1
T
TXt=1
ut(θ)
is set equal to zero. I’m now making explicit the dependence of the error on the choice of
θ. Suppose f is m × 1 and xt−i is n × 1. If θ is k × 1 then, in general, we can only setthe average of the ut equal to zero if k = mn. Usually what we have is a situation where
k < mn. As a result, we can only set (1/T )P
ut close to zero. But how? One way to
proceed is to minimize
J =h 1T
TXt=1
ut(θ)i0WT
h 1T
TXt=1
ut(θ)i
65
10. Generalized Method of Moments
where WT is some positive definite symmetric matrix which may be sample dependent.
What is the first order condition?
∂J
∂θ=1
T
TXt=1
∂ut(θ)
∂θ
0
mn×kWmn×mn
1
T
TXt=1
ut(θ)mn×1 = 0
As you can see, the first-order condition sets k linear combinations of the mn restrictions
to zero.
10.2. Asymptotics
It turns out that as long as ut(θ0) is stationary the θ which solves the first order
condition will be consistent. As a result
1
T
TXt=1
∂ut(θ)
∂θ
p−→ = D0
where
D0 = E∂ut(θ0)
∂θ.
Furthermore, with the assumption that ut(θ0) is stationary we can assume that
1√T
TXt=1
ut(θ0)d−→ = N(0, S0).
where
S0 = E³ ∞Xj=−∞
ut(θ0)ut−j(θ0)0´.
It turns out that if WT is chosen (optimally) to be a consistent estimator for S−10 , then
√T (θ − θ0)
d−→Nh0, (D0
0S−10 D0)
−1i.
Of course you’d have to estimate the variance covariance matrix as usual. This is usually
done by computing the sample counterparts of D0 and S0.
10.3. Examples
Some examples serve to illustrate GMM.
66
10. Generalized Method of Moments
10.3.1. Estimating the Mean of a Random Variable
Suppose we are interested in estimating the mean of a scalar random variable xt. If
we parameterize its mean as µ, we have the moment restriction
E(xt − µ) = 0.
As a result, mn = 1 and k = 1, so that µ is exactly identified. The GMM estimator for µ
is simply the sample mean
µ = x =1
T
TXt=1
xt.
Clearly, D0 = −1. Also, assuming xt is stationary, we have
S0 = E³ ∞Xj=−∞
(xt − µ0)(xt−j − µ0)´.
One special case would be that xt is i.i.d. In this case,
S0 = Eh(xt − µ0)
2i= σ2 ≡ Var(xt).
Thus, in the i.i.d. case we have
√T (µ− µ0)
d−→N(0, σ2),
which is the familiar result from undergraduate statistics where µ is casually written as
having a normal distribution with variance σ2/T in large samples.
10.3.2. The Linear Model With Stochastic Regressors
Suppose the model is yt = x0tβ+ t where t is white noise. Clearly Et(yt−x0tβ) = 0.
Of course this implies that Etxt(yt−x0tβ) = Ext(yt−x0tβ) = 0. This is a case where m = 1
and n = k so we have exact identification. Therefore we can set
1
T
TXt=1
xt(yt − x0tβ) = 0
67
10. Generalized Method of Moments
by choosing β = (PT
t=1 xtx0t)−1PT
t=1(xtyt). What’s the variance-covariance matrix? It
had better be the usual one.
∂ut∂β
= −xtx0t
so that D0 = −Σxx. What about S0? We have E(utu0t) = E(xt
2tx0t) = σ2Σxx, while
E(utu0t−i) = 0 for i 6= 0. Therefore, S0 = σ2Σxx. Therefore the variance-covariance
matrix is σ2Σ−1xx .
10.3.3. The Consumption-Based Asset Pricing Model
When consumers have time-separable constant relative risk aversion preferences, so
that u(c) = c1−γ/(1 − γ). Then the intertemporal Euler equation which governs the
behavior of returns is
1 = Etβz−γt+1Rt+1,
where zt+1 = ct+1/ct. In this case, the parameters of interest are β and γ. As a result, to
estimate these parameters, the econometrician can use Euler equations for different assets,
thus increasing the number of moment restrictions, or the econometrician can identify
different variables xt that are in the time t information set. As a result, the econometrician
uses a set of equations given by
Eh(βz−γt+1Rt+1 − 1)⊗ xt
i,
where xt is n × 1. Typical choices for the elements of xt would be lagged values of
consumption growth and the asset returns. In general, the variable R can be an m ×1 vector, so that we have mn moment restrictions. As a result, there will be mn − 2overidentifying restrictions.
68
10. Generalized Method of Moments
10.4. Testing With Overidentifying Restrictions
Last, but not least, when mn > k and the weighting matrix is chosen optimally, it
turns out that TJ(θ)d−→χ2(mn−k). This provides a statistic which can be used to test the
overidentifying moment conditions. The idea is that when the overidentifying restrictions
hold, the minimized value of J should be small even though it wont be exactly zero. The
factor T is required, since J converges to 0 with probability 1 under the null hypothesis
that the moment restrictions all hold. On the other hand TJ converges to a random
variable which can be used to test this hypothesis. When TJ is large, the hypothesis is
rejected.
69