iii general regression
TRANSCRIPT
1 III
James B. McDonald
Brigham Young University
9/29/2010
III. Classical Normal Linear Regression Model Extended to the Case of k
Explanatory Variables
A. Basic Concepts
Let y denote an n x l vector of random variables, i.e., y = (y1, y2, . . ., yn)'.
1. The expected value of y is defined by
1
2
n
E( )y
E( )yE(y) =
E( )y
2. The variance of the vector y is defined by
1 1 2 1 n
2 1 2 2 n
n 1 n 2 n
Var( ) Cov( , ) Cov( , )y y y y y
Cov( , ) Var( ) Cov( , )y y y y yVar(y) =
Cov( , ) Cov( , ) Var( )y y y y y
NOTE: Let μ = E(y), then
Var(y) = E[(y - μ)(y - μ)']
11
nn
-y
= E
-y
(y1 - μ1, . . ., yn - μn)
2 III
21 2 1 n1 2 1 n11
22 1 2 n2 1 2 n22
n
E( - )( - ) . . . E( - )( - )y y y yE( -y )
E( - )( - ) . . . E( - )( - )y y y yE( -y )
. . .=
. . .
. . .
E( y 2n 1 n 21 n 2 nn
- )( - ) E( - )( - ) . . .y y y E( - y )
1 1 2 1 n
2 1 2 2 n
n 1 n 2 n
Var( ) Cov( , ) ... Cov( , )y y y y y
Cov( , ) Var( ) ... Cov( , )y y y y y
. . .= .
. . .
. . .
Cov( , ) Cov( , ) ... Var( )y y y y y
3. The n x l vector of random variables, y, is said to be distributed as a multivariate
normal with mean vector μ and variance covariance matrix (denoted y ~
N(μ, )) if the probability density function of y is given by
-11
- (y- ) (y- )2
n 1
2 2
ef(y; , ) = .
(2 |) |
Special case (n = 1): y = (y1), μ = (μ1), = (ζ2).
)()(2
e = ),;yf(
2
12
2
1
)-y(1
)-y(2
1-
11
11211
. 2
e =
2
2
)-y-(2
211
4. Some Useful Theorems
a. If y ~ N(μy, y), then z = Ay ~ N(μz = Aμy; z = A yA') where A is a
matrix of constants.
3 III
b. If y ~ N(0,I) and A is a symmetric idempotent matrix, then y'Ay ~ χ2(m)
where m = Rank(A) = trace (A).
c. If y ~ N(0,I) and L is a k x n matrix of rank k, then Ly and y'Ay are
independently distributed if LA = 0.
d. If y ~ N(0,I), then the idempotent quadratic forms y'Ay and y'By are
independently distributed χ2 variables if AB = 0.
NOTE:
(1) Proof of (a)
(2) Example: Let y1, . . ., yn denote a random sample drawn from
N(μ,ζ2).
The "Useful" Theorem 4.a implies that:
2
1 n
1 1 1 1y = + ... + = , . . . y ~ N( , /n)y y
n n n n.
Verify that
E(z) = E(Ay) = AE(y) = Aµy
VAR(z) = E[(z - E(z))(z - E(z))']
= E[(Ay - Aµy)(Ay - Aµy)']
= E[A(y - µy)(y - µy)'A']
= AE[(y - µy)(y - µy)']A'
= AΣyA' =Σ z
2
21
. . . 0
. .
0 . . .
,.
.N~
.
.y
ny
y
4 III
(a) = n
1,...,
n
1
(b) ./n =
n
1
n
1
I n
1,...,
n
1 22
5 III
B. The Basic Model
Consider the model defined by
(1) yt = β1xtl + β2xt2 + . . . + βkxtk + εt (t = 1, . . ., n).
If we want to include an intercept, define xtl = 1 for all t and we obtain
(2) yt = β1 + β2xt2 + . . . + βkxtk + εt.
Note that βi can be interpreted as the marginal impact of a unit increase in xi on the
expected value of y.
The error terms (εt) in (1) will be assumed to satisfy:
(A.1) εt distributed normally
(A.2) E(εt) = 0 for all t
(A.3) Var(εt) = ζ2 for all t
(A.4) Cov(εtεs) = 0,t s.
Rewriting (1) for each t (t = 1, 2, . . ., n) we obtain
y1 = β1x11 + β2x12 + . . . + βkx1k + ε1
y2 = β1x21 + β2x22 + . . . + βkx2k + ε2
. . . .
. . . .
(3) . . . .
yn = β1xn1 + β2xn2 + . . . + βkxnk + εn.
The system of equations (3) is equivalent to the matrix representation
y = Xβ + ε
where the matrices y, X, β and ε are defined as follows:
6 III
y = Xβ + ε.
(A.1)' ε ~ N(0; = ζ2I)
(A.5)' The xtj's are nonstochastic and
x
n
X X = Limit
nis nonsingular.
columns: n observations on k
individual variables.
rows: may represent
observations at a given point
in time.
11
22
nk
= and = .
NOTE: (1) Assumptions (A.1)-(A.4) can be written much more
compactly as
(A.1)’ ε ~ N (0; = ζ2I).
(2) The model to be discussed can then be summarized as
11 1k1
21 2k2
n1 nkn (nxk)(nx1)
y x x
y x xy = X =
y x x
7 III
C. Estimation
We will derive the least squares, MLE, BLUE and instrumental variables estimators in
this section.
1. Least Squares:
The basic model can be written as
y = Xβ + ε
ˆ ˆ= Xβ + e = Y + e
where ˆY = Xβ is an nx1 vector of predicted values for the dependent variable and
e denotes a vector of residuals or estimated errors.
The sum of squared errors is defined by
n2t
t=1
ˆSSE(β) = e
e
e
e
)e , ,e ,e( =
n
2
1
n21
ee =
ˆ ˆ= (y - Xβ) (y - Xβ)
ˆ ˆ ˆ ˆ= y y - β X y - y Xβ + β X Xβ
ˆ ˆ ˆ= y y - 2β X y + β X Xβ .
The least squares estimator of β is defined as the β which minimizes ˆSSE (β). A
necessary condition for ˆSSE(β) to be a minimum is that
ˆdSSE(β) = 0
ˆdβ (see Appendix A for how to differentiate a real
valued function with respect to a vector) ˆdSSE(β) ˆ = -2X y + 2X Xβ = 0 or
ˆdβ
8 III
yX = ˆXX
yX)XX( = ˆ -1
Normal Equations
is the least squares estimator.
Note that β is a vector of least squares estimators of β1, β2,...,βk.
2. Maximum Likelihood Estimation (MLE)
Likelihood Function: (Recall y ~ N (Xβ; = ζ2I))
-11- (y-X ) (y-X )2
2
1n/ 22
eL(y; , = I) =
(2 |) |
2
1- (y-X ) (y-X )2
1n/ 2 2 2
e =
(2 | I) |
2(y-X ) (y-X ) / 2
nn 2 2 2
e = .
(2 () )
The natural log of the likelihood function,
2
2ln
2
n - 2ln
2
n -
2
)X(y-)X(y- - = Lln =
= 2
2
1 n n (y- X ) (y- X ) - ln 2 - ln
2 2 2
= 2
2
1 n n y' - 2 'X' 2 ' ' - ln 2 - ln
2 2 2y Y X X
is known as the log likelihood function. is a function of β and ζ2.
The MLE. of β and ζ are defined by the two equations (necessary conditions for a
maximum):
2
1 = (-2X y + 2(X X) ) = 0
β 2
2222
(y - X ) (y - X ) n 1 = - = 0
22( )
9 III
nln)2ln(1
2
n-
SSE
i.e.,
.
NOTE: (1) ˆ =
(2) 2 is a biased estimator of ζ2; whereas,
2 1 (y - X ) (y - X ) SSE = e e = = s
n- k n - k n - k
is an unbiased estimator of ζ2.
A proof of the unbiasedness of s2 is given in Appendix B.
Only n-k of the estimated residuals are independent. The
necessary conditions for least squares estimates impose k
restrictions on the estimated residuals (e). The restrictions
are summarized by the normal equations X'X ˆ = X'y, or
equivalently
(3) Substituting ζ2 = SSE/n into the log likelihood function
yields what is known as the concentrated log likelihood
function
X’e = 0
-1 = (X X X'y)
2t
12 = (y - X ) (y - X )n
e e e= =
n n
10 III
which expresses the loglikelihood value as a function of β
only. This equation also clearly demonstrates the
equivalence of maximizing and minimizing SSE.
11 III
3. BLUE ESTIMATORS OF β, β .
We will demonstrate that assumptions (A.2)-(A.5) imply that the best
(least variance) linear unbiased estimator (BLUE) of β is the least squares
estimator. We first consider the desired properties and then derive the associated
estimator.
Linear: Ay = ~
where A is a kxn matrix of constants
Unbiased: AX = AE(y) = )~
E(
We note that = XA = )~
E( requires AX = I.
Minimum Variance:
i iiVar( ) = Var(y) β A A
= ζ2AiAi'
where Ai = the ith
row of A and ii = yβ A .
Thus, the construction of BLUE is equivalent to selecting the matrix A so that the
rows of A
Min AiAi' i = 1, 2, . . ., k
s.t. AX = I
or equivalently, min i
Var( )β s.t. AX = I (unbiased).
The solution to this problem is given by
A = (X'X)-1
X' ; hence, the BLUE of β is given by -1
= Ay (X X X y) .
The details of this derivation are contained in Appendix C.
NOTE: (1)
(2) 1
AX X 'X X 'X I ; thus β is unbiased
-1ˆβ = β = β = (X X X y)
12 III
4. Method of Moments Estimation
Method of moments parameter estimators are selected to equate sample and
corresponding theoretical moments. The open question is what theoretical
moments should be considered and what are the corresponding sample moments.
With the regression model we might consider the following theoretical moments
which follow from the underlying theoretical assumptions:
(A.2) 0tE
(A.5) , 0it tCov X
The sample moment associated with (A.2) is
1
/ 0n
t
t
e n e
The sample covariances can be written as
1 1 1
/ / / 0n n n
it i t it i t it t
t t t
X X e e n X X e n X e n
These sample moments can be summarized using matrix notation as follows:
1
12 22 2 2
1 2
1 1 ... 1
.../ ' / 0
. . ... . .
...
n
k k nk n
e
x x x en X e n
x x x e
which is equivalent to X’e=0 which are also known as the normal equations in the
OLS framework and yields the OLS estimator by solving
ˆ' ' 0X e X Y X
for ˆ .
5. Instrumental Variables Estimators
y = Xβ + ε
Let Z denote an n x k matrix of “instruments” or "instrumental" variables.
Consider the solution of the modified normal equations:
ZZ ' Y Z ' X ; hence, 1
zβ Z X Z y .
zβ is referred to as the instrumental variables estimator of β based on the
instrumental variables Z. Instrumental variables can be very useful if the
variables on the right hand side include “endogenous” variables or in the case of
13 III
measurement error. In this case OLS will yield biased and inconsistent
estimators; whereas, instrumental variables can yield consistent estimators.
NOTE: (1) The motivation for the selection of the instruments (Z) is
that the covariance (Z,ε) approaches 0 and Z and X are
correlated. Thus Z'(Y) = Z'(Xβ + ε) = Z' X β + Z'ε Z' Xβ.
(2) Ifn
Z XLim
nis nonsingular and
n
Z = 0Lim
n, then
zβ is a consistent estimator of β.
(3) Many calculate an R2 after instrumental variables
estimation using the formula R2 = 1 – SSE/SST. Since this
can be negative, there is not a natural interpretation of R2
for instrumental variables estimators. Further, the R2 can’t
be used to construct F-statistics for IV estimators.
(4) If Z includes “weak” instruments (weakly correlated
with the X’s), then the variances of the IV estimator can
be large and the corresponding asymptotic biases can be
large if the Z and error are correlated. This can be
seen by noting that the bias of the instrumental variables
estimator is given by
E1
' / ( ' / )Z X n Z n .
(5) As a special case, if Z = X, then Δ
ˆˆ ˆ = = β = β = ββ βz x.
(6) If Z is an x k* n matrix where k< k* (Z contains more
variables than X), then the IV estimator defined above must
be modified. The most common approach in this case is to
replace Z in the “IV” equation by the projections** of X on
the columns of Z, i.e. 1ˆ ' 'X Z Z Z Z X .
This substitution yields the IV estimator 1
11 1
ˆ ˆ' '
' ' ' ' ' '
IV X X X Y
X Z Z Z Z X X Z Z Z Z Y
which yields estimates for k k* .
14 III
.
The Stata command for the instrumental variables estimator
is given by
ivregress 2sls depvar (varlist_1 =varlist_iv)
[varlist_2]
where estimator = 2sls, gmm, or liml with
2sls is the default estimator
for the model
1 2depvar = (varlist_1)b + var(list_2)b + error
where varlist_iv are the instrumental variables for varlist_1.
A specific example is given by:
ivregres 2sls y1 (y2=z1 z2 z3) x1 x2 x3
Identical results could be obtained with the command,
Ivregress 2sls y1 (y2 x1 x2 x3=z1 z2 z3)
which is equivalent to regressing all of the right hand side
variables on the set of instrumental variables. This can be
thought of as being of the form
ivregress 2sls y (X=Z)
**The projections of X on Z can be obtained by obtaining
estimates of
in the "reduced form" equation X Z V to yield 1ˆ ' 'Z Z Z X ; hence, the estimate of X is given by
1ˆ ˆ ' 'X Z Z Z Z Z X
D. Distribution of Δ
β, , ββ
Recall that under the assumptions (A.1) – (A.5) y ~ N(Xβ, = ζ2I) and
-1β = β = β = (X X X y;)
hence, by useful theorem (II.’ A. 4.a), we conclude that Δ
2yy
β = β = β ~ N(A A A ) = N[Ax , A IA ]
15 III
where A = (X'X)-1
X'.
The desired derivations can be can be simplified by noting that
AXβ = (X'X)-1
X'Xβ = β
ζ2AA' = ζ
2(X'X)
-1X'((X'X)
-1X')'
= ζ2(X'X)
-1X'X((X'X)
-1)'
= ζ2((X'X)
-1)'
= ζ2((X'X)')
-1
= ζ2(X'X)
-1.
ThereforeΔ
12β = β = β ~ N β; X X
NOTE: (1) ζ2(X'X)
-1 can be shown to be the Cramer-Rao matrix, the matrix
of lower bounds for the variances of unbiased estimators.
(2) Δ
β, , β,β are
unbiased
consistent
minimum variance of all (linear and nonlinear unbiased
estimators
normally distributed
(3) An unbiased estimator of ζ2(X'X)
-1 is given by
s2(X'X)
-1
where s2 = e'e/(n-k) and is the formula used to calculate the
16 III
"estimated variance covariance matrix" in many computer
programs.
(4) To report s2(X'X)
-1 in STATA type
. reg y x
. estat vce
(5) Distribution of the variance estimator
22
2
(n - k)s ~ (n - k)
NOTE: This can be proven using the theorem (II'.A.4(b)) and noting that 2 ˆ ˆ(n- k) = e e = (Y - Xβ) (Y - Xβ) .s
-1
= (X + ) (I - X(X X X )(X + ))
= ε'(I - X(X'X)-1
X')ε.
Therefore,2
-1
2
(n- k)s = (I - X(X X X ))
= M
where ~ N [0, I]. hence
22
2
(n- k)s ~ (n- k) because
M is idempotent with rank and trace equal to n - k.
17 III
E. Statistical Inference
1. Ho: β2 = β3 = . . . = βk = 0
This hypothesis tests for the statistical significance of overall explanatory power
of the explanatory variables by comparing the model with all variables included to
the model without any of the explanatory variables, i.e., yt = β1 + εt (all non-
intercept coefficients = 0). Recall that the total sum of squares (SST) can be
partitioned as follows:
)y - y( + )y - y( = )y - y( 2
t
N
1=t
2
tt
N
1=t
2
t
N
1=t
or
SST = SSE + SSR.
Dividing both sides of the equation by ζ2 yields quadratic forms, each having a
chi-square distribution:
2 2 2
SST SSE SSR = +
ζ ζ ζ
χ2(n - 1) = χ
2(n - k) + χ
2(k - 1).
This result provides the basis for using
to test the hypothesis that β2 = β3 = . . . = βk = 0.
NOTE: (1)R - 1
R =
SST
SSR - 1
SSR/SST =
SSR - SST
SSR =
SSE
SSR2
2
hence, the F-statistic for this hypothesis can also be rewritten as
Recall that this decomposition of SST can be summarized in an ANOVA table as
2
2
SSR(K -1)(n- K)K 1F = = ~ F(K - 1, n - K)
SSE (n- K)(K -1)
n K
2
2
2 2
Rn - k Rk - 1F = = ~ F(k - 1,n - k).
(1 - ) /(n - k) k - 1 1 - R R
18 III
follows:
Source of Variation
SS
d.f
MSE
Model
Error
SSR
SSE
K - 1
n – K
SSR/(K-1)
SSE/(n - K) 2s Total
SST
n – 1
K = number of coefficients in model
where the ratio of the model and error MSE’s yields the F statistic just discussed.
Additionally, remember that the adjusted R2 ( 2R ), defined by
22 t
2
t
( ) /(n- K)e = 1 - ,R
( - Y /(n - 1))Y
will only increase with the addition of a new variable if the t-statistic associated with
the new variable is greater than 1 in absolute value. This result follows from the
equation
_ var
22
_ var2 2
ˆ
ˆ 0( 1)1
1New
NewNewNew Old
n SSER R
n k n K SST s where the last
term in the product is 2 1t and K denotes the number of coefficients in the “old”
regression model and the “new” regression model includes K+1 coefficients.
The Lagrangian Multiplier (LM) and likelihood ratio (LR) tests can also be
used to test this hypothesis where
2 2~ ( 1)aLM NR k
2 2ln 1 ~ ( 1)aLR N R k
19 III
2. Testing hypotheses involving individual βi's
Recall that
-1
β ~ N (β; ζ (X X ))
where
2
ˆ ˆ ˆ ˆ ˆβ β β β β1 1 2 1 k
2ˆ ˆ ˆ ˆ ˆβ β β β β2 1 2 2 k12
2ˆ ˆ ˆ ˆ ˆβ β β β βk 1 k 2 k
X X
which can be estimated by
2
ˆ ˆ ˆ ˆ ˆβ β β β β1 1 2 1 k
2ˆ ˆ ˆ ˆ ˆβ β β β β2 1 2 2 k12
2ˆ ˆ ˆ ˆ ˆβ β β β βk 1 k 2 k
s s s
s s ss X X
s s s
Hypotheses of the form H0: βi = 0i can be tested using the result
The validity of this distributional result follows from
2
N(0,1) ~ t(d)
(d) /d
since
i
ii
β
ˆ - ββ ~ N(0,1) and
ζ
i
i
22
β2
β
(n - k) ~ (n - k).χs
ζ
i
0
ii
β
ˆ - ββ ~ t(n - k)
s
20 III
3. Tests of hypotheses involving linear combinations of coefficients
A linear combination of the βi's can be written as
1k
i 1 ki
=1
k
β
= ( ,..., ) = δ β.βδ δ δ
β
We now consider testing hypotheses of the form
Recall that
-12β ~ N (β; (X X ) ;)ζ
therefore,
-12ˆδ β ~ N (δβ; δ (X X δ))ζ
hence, '
' ' '
-1 2' 2ˆδ β
ˆ ˆδβ - δβ δβ - γ = ~ t(n - k).
δ (X,X δ) ss
The t-test of a hypothesis involving a linear combination of the coefficients
involves running one regression and estimating the variance of ˆδ β from s2(X'X)
-1
to construct the test statistics.
4. More general tests
a. Introduction
We have considered tests of the overall explanatory power of the
regression model (Ho: β2 = β3 = . . . βk = 0), tests involving individual parameters
(e.g., Ho: β3 = 6), and testing the validity of a linear constraint on the coefficients
H0: δ'β = γ.
21 III
(Ho: δ’β = γ). In this section we will consider how more general tests can be
performed. The testing procedures will be based on the Chow and Likelihood
ratio (LR) tests. The hypotheses may be of many different types and involve the
previous tests as special cases. Other examples might include joint hypotheses of
the form: Ho: β2 + 6 β5 = 4, β3 = β7 = 0. The basic idea is that if the hypothesis is
really valid, then goodness of fit measures such as SSE, R2 and log-likelihood
values (l) will not be significantly impacted by imposing the valid hypothesis in
estimation. Hence, the SSE, R2 or values will not be significantly different for
constrained (via the hypothesis) and unconstrained estimation of the underlying
regression model. The tests of the validity of the hypothesis are based on
constructing test statistics, with known exact or asymptotic distributions, to
evaluate the statistical significance of changes in SSE, R2, or .
Consider the model
y = X β + ε
and a hypothesis, Ho: g(β) = 0 which imposes individual and/or multiple
constraints on the β vector.
The Chow and likelihood ratio tests for testing Ho: g(β) = 0 can be
constructed from the output obtained from estimating the two following
regression models.
(1) Estimate the regression model y = Xβ + ε without imposing any
constraints on the vector β. Let the associated sum of square errors,
coefficient of determination, log-likelihood value and degrees of freedom
22 III
be denoted by SSE, R2, , and (n - k).
(2) Estimate the same regression model where the β is constrained as
specified by the hypothesis (Ho: g(β) = 0) in the estimation process. Let
the associated sum of squared errors, R2, log-likelihood value and degrees
of freedom be denoted by SSE*, R
2*,
* and (n - k)
*, respectively.
b. Chow test
The Chow test is defined by the following statistic:
where r = (n-k) - (n-k)* is the number of independent restrictions imposed on β by
the hypothesis. For example, if the hypothesis was Ho: β2 + 6 β5 =4, β3 = β7 = 0,
then the numerator degrees of freedom (r) is equal to 3. In applications where the
SST is unaltered by the imposing the restrictions, we can divide the numerator and
denominator by SST to yield the Chow test rewritten in terms of the change in the
R2 between the constrained and unconstrained regressions.
Note that if the hypothesis (H0: g(β) = 0) is valid, then we would expect R2 (SSE)
and R2*
(SSE*) to not be significantly different from each other. Thus, it is only
large values (greater than the critical value) of F which provide the basis for
rejecting the hypothesis. Again, the 2R form of the Chow test is only valid if the
dependent variable is the same in the constrained and unconstrained regression.
References:
(1) Chow, G. C., "Tests of Equality Between Subsets of Coefficients in Two
Linear Regressions," Econometrica, 28(1960), 591-605.
(2) Fisher, F. M., "Tests of Equality Between Sets of Coefficients in Two Linear
SSE* - SSE
rSSE ~ F(r, n - k)n - k
2 2
2
- * n - kR RF = ~ F(r, n - k)
1 - rR
23 III
Regressions: An Expository NOTE," Econometrica, 38(1970), 361-66.
c. Likelihood ratio (LR) test.
The LR test is a common method of statistical inference in classical
statistics. The motivation behind the LR test is similar to that of the Chow test
except that it is based on determining whether there has been a significant
reduction in the value of the log-likelihood value as a result of imposing the
hypothesized constraints on β in the estimation process. The LR test statistic is
defined to be twice the difference between the values of the constrained and
unconstrained log-likelihood values (2( - *)) and, under fairly general
regularity conditions, is asymptotically distributed as a chi-square with degrees of
freedom equal to the number of independent restrictions (r) imposed by the
hypothesis. This may be summarized as follows:
The LR test is more general than the Chow test and for the case of
independent and identically distributed normal errors, with known ζ2, LR is equal
to LR = [SSE* - SSE]/ζ
2 .
Recall that s2 = SSE/(n - k) appears in the denominator of the Chow test statistic
and that for large values of (n-k), s2 is "close" to ζ
2; hence, we can see the
similarity of the LR and Chow tests. If ζ2 is unknown, substituting the
concentrated log-likelihood function into LR yields
LR = 2 ( - *)
= n [ln (SSE*) - ln (SSE) ]
= n [ln (SSE* / SSE)].
2aLR = 2( - *) (r).
24 III
a LR = nln[1/(1-R
2)] = -nln[1-R
2] ~ χ
2(k-1).
If the hypothesis Ho: β2 = β3 = . . . βk = 0 is being tested in the classical
normal linear regression model, then SSE* = SST and LR can be rewritten in
terms of the R2 as follows:
In this case, the Chow test is identical to the F test for overall explanatory power
discussed earlier.
Thus the Chow test and LR test are similar in structure and purpose. The
LR test is more general than the Chow test; however, its distribution is
asymptotically (not exact) chi-square even for non-normally distributed errors.
The LR test provides a unified method of testing hypotheses.
d. Applications of the Chow and LR tests:
(1) Model: yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt
Ho: β2 = β3 = 0 (two independent constraints)
(a) Estimate yt = β1 + β2xt2 + β3xt3 + β4xt4 + εt
to obtain SSE = et2 = (n - 4)s
2, R
2 ,
=n
SSEln + )ln(2 + 1
2
n- ,
n-k = n - 4
(b) Estimate yt = β1 + β4xt4 + εt to obtain
SSE* = et*2 = (n - 2)s*
2
SSE*, R2*
, * and (n-k)* = n - 2
25 III
26 III
(c) Construct the test statistics
SSE* - SSE SSE* - SSEn- 4 SSE*-SSE(n k)* (n k) 2Chow = = =
SSE SSE 2 SSE
n k n 4
2 2
2
- * n - 4R R= ~ F(2, n - 4)
1 - 2R
a LR = 2( - *) ~ χ
2(2).
(2) Tests of equality of the regression coefficients in two different regressions
models.
(a) Consider the two regression models
y(1)
= X(1)
β(1)
+ ε(1)
n1 observations, k independent variables
y(2)
= X(2)
β(2)
+ ε(2)
n2 observations, k independent variables
Ho: β(1)
= β(2)
(k independent restrictions)
(b) Rewrite the model as
(1)'
(1) (1)(1) (1)
(2)(2) (2) (2)
0 y Xy = = +
0 y X
Estimate (1)' using least squares and determine SSE, R2,
and (n - k) = n1 + n2 - 2k.
Now impose the hypothesis that β(1)
= β(2)
= β and write (1)
as
(2)’
(1) (1) (1)
(2) (2) (2)
y Xy = = β +
y X
Estimate (2)’ using least squares to obtain the constrained
27 III
sum of squared errors (SSE*), R2*
, * and
(n - k)* = n1 + n2 - k.
(c) Construct the test statistics
SSE* - SSE
(n - k)* - (n - k)Chow =
SSE
(n k)
2 2
1 2, 1 22
- * + - kR R n n = ~ F ( + - 2k)k n n
1 - kR
a LR = 2( - *) ~ χ
2 (k).
5. Testing Hypotheses using Stata
a. Stata reports the log likelihood values when the command
estat ic
follows a regression command and can be used in constructing LR tests.
b. Stata can also perform many tests based on t or Chow-type tests.
Consider the model
(1) Yt = β1 + β2Xt2 + β3Xt3 + β4Xt4 + εt
with the hypotheses:
(2) H1: β2 = 1
H2: β3 = 0
H3: β3 + β4 = 1
H4: β3β4 = 1
H5: β2 = 1 and β3 = 0
The Stata commands to perform tests of these hypotheses follow OLS
28 III
estimation of the unconstrained model.
29 III
reg Y X2 X3 X4
estimates the unconstrained model
test X2 = 1 (Tests H1)
test X3 = 0 (Tests H2)
test X3 + X4 = 1 (Tests H3)
testnl _b[X3]*_b[X4] = 1 (Tests H4. The “testnl” command is
for testing nonlinear hypotheses. The
suffix “_b”, along with the braces,
must be used when testing nonlinear
hypotheses)
test (X2 = 1) (X3 = 0) (Tests H5)
95% confidence intervals on coefficient estimates are automatically calculated in
Stata. To change the confidence level, use the “level” option as follows:
reg Y X2 X3 X4, level(90) (changes the confidence level
to 90%)
30 III
F. Stepwise Regression
Stepwise regression is a method for determining which variables might be
considered as being included in a regression model. It is a purely mechanical approach,
adding or removing variables in the model solely determined by their statistical
significance and not according to any theoretical reason. While stepwise regression can be
considered when deciding among many variables to include in a model, theoretical
considerations should be the primary factor for such a decision.
A stepwise regression may use forward selection or backward selection. Using
forward selection, a stepwise regression will add one independent variable at a time to see
if it is significant. If the variable is significant, it is kept in the model and another variable
is added. If the variable is not significant, or if a previously added variable becomes
insignificant, it is not included in the model. This process continues until no additional
variables are significant.
Stepwise regression using Stata
To perform a stepwise regression in Stata, use the following commands:
Forward:
stepwise, pe(#): reg dep_var indep_vars
stepwise, pe(#) lockterm: reg dep_var (forced in
variables) other indep_vars
Backward:
stepwise, pr(#): reg dep_var indep_vars
31 III
stepwise, pr(#) lockterm: reg dep_var (forced in
variables) other indep_vars
where the “#” in “pr(#)” is the significance level at which variables are removed, as
0.051, and the “#” in “pe(#)” is the significance level at which variables are entered or
added to the model. If pr(#1) and pr(#2) are both included in a stepwise regression
command, #1 must be greater than #2. Also, “depvar” represents the dependent variable,
“forced_indepvars” represent the independent variables which the user wishes to remain
in the model no matter what their significance level may be, and “other_indepvars”
represents the other independent variables which the stepwise regression will consider
including or excluding. Forward and backward stepwise regression may yield different
results.
G. Forecasting
Let yt = F(Xt, β) + εt
denote the stochastic relationship between the variable yt and the vector of variables Xt
where Xt = (xt1,..., xtk). β represents a vector of unknown parameters.
Forecasts are generally made by estimating the vector of parameters ˆβ(β) ,
determining the appropriate vector )X(X tt and then evaluating
ttˆˆˆ = F( , β) .y X
The forecast error is FE = yt - yt.
There are at least four factors which contribute to forecast error.
32 III
1. Incorrect functional form (This is an example of specification error and will be
discussed later.)
2. Existence of random disturbance (εt)
Even if the "appropriate" future value of Xt and true parameter values, β,
were known with certainty
FE = yt - yt = yt - F(Xt,β) = εt
2
FE = Variance(FE)
= Var(εt) = ζ2.
In this case confidence intervals for yt would be obtained from
t t( / 2) ( / 2)tPr [F ( , β) - ζ < < F ( , β) + ζ] = 1 - αyt tX X
which could be visualized as follows for the linear case:
Yt
Xt
Yt
Xt
33 III
3. Uncertainty about β
Assume F(Xt, β) = Xtβ in the model yt = F(Xt, β) + εt, then the predicted
value of yt for a given value of Xt is given by
ttˆˆ = β ,y X
and the variance of ˆty (sample regression line),
t
2
y is given by
t
2t ty
ˆ = Var (β) X Xζ ,
with the variance of the forecast error (actual y) given by:
2
FEt
2 2y
= + .ζ ζ
Note that 2
FE takes account of the uncertainty associated with the unknown
regression line and the error term and can be used to construct confidence
intervals for the actual value of Y rather than just the regression line.
Unbiased sample estimators of t
2
y and 2FEζ can be easily obtained by replacing ζ
2
with its unbiased estimator s2.
Confidence intervals for t tE ( | ) ,Y X the population regression line:
ttˆˆt t t(α/2) (α/2) yy
ˆ ˆPr [ β - < < β + ] = 1 - αt s t sX Y X
Confidence intervals for Yt:
t t t(α/2) FE (α/2) FEˆ ˆP R [ β - < < β + ] = 1 - αt s t sX Y X
34 III
Yt
Xt
35 III
4. A comparison of confidence intervals.
Some students have found the following table facilitates their understanding of the different confidence intervals for the
population regression line and actual value of Y. The column for the estimated coefficients is only included to compare
the organizational parallels between the different confidence intervals.
Statistic 1ˆ ' 'X X X Y ˆˆt tY X = sample regression line =
predicted Y values corresponding to tX .
FE (forecast error)
ˆˆt t t tFE Y Y Y X
Distribution
12, 'N X X
12 2 '
ˆ, ( ' )t
t t tYN X X X X X
2 2 2
ˆ0,t
FE YN
t-stat
/ 2 / 2
ˆ
ˆ1 Pr
i
i it ts
= ˆ ˆ
2 2
ˆ ˆPri i
i i it s t s
/ 2 / 2
ˆ
ˆ1 Pr
t
t t
Y
X Xt t
s
ˆ ˆ
2 2
ˆ ˆPr t t tY YX t s X X t s
/ 2 / 2
01 Pr
FE
FEt t
s
=
2 2
Pr 0FE FEFE t s FE t s =
2 2
ˆ ˆPr t FE t t FEX t s Y X t s
C.I.
i : ˆ ˆ
2 2
ˆ ˆ, i i
i it s t s tX : ˆ ˆ
2 2
ˆ ˆ, Xt tY YX t s t s :tY
2 2
ˆ ˆ, Xt FE t FEX t s t s
where Y
s is used to compute confidence intervals for the regression line ( t tE Y X ) and FEs is used in the calculation of
confidence intervals for the actual value of Y. Recall that 2 2 2
ˆ sFE Ys s ; hence,
2 2
ˆ > FE Ys s and the confidence intervals for
Y are larger than for the population regression line.
36 III
5. Uncertainty about X. In many situations the value of the independent variable also
needs to be predicted along with the value of y. Not surprisingly, a “poor” estimate of
Xt will likely result in a poor forecast for y. This can be represented graphically as
follows:
6. Hold out samples and a predictive test.
One way to explore the predictive ability of a model is to estimate the model on a
subset of the data and then use the estimated model to predict known outcomes which
are not used in the initial estimation.
7. Example M6 + G2.5 + 10 = y ttt
ttt 2 3ˆ ˆ ˆ= + + G Mβ β β
where yt, Gt, Mt denote GDP, government expenditure, and money supply.
Assume that
Yt
Xt
Yt
Xt
X t
37 III
. 10 = s ,10
1532
3205
2510
= )XX( s23-1-2
a. Calculate an estimate of GPD(y) which corresponds to
Gt = 100, Mt = 200, i.e., Xt = (1, 100, 200).
tt
10
ˆˆ = β = (1, 100, 200) 2.5y X
6
1460. = 1200 + 250 + 10 =
b. Evaluate s2
ytand s
2FE
corresponding to the Xt in question (a).
10.
200
100
1
1532
3205
2510
200) 100, (1, = X ))XX( s( X = s3-
t
1-2ty
2
t
921.81 =
30.30 = syt
931.81 = 921.81 + 10 = s + s = s y2
FE
2
t
30.53 = SFE
7. Forecasting—basic Stata commands
a) The data file should include values for the explanatory variables
corresponding to the desired forecast period, say in observations n1 + 1 to n2.
b) Estimate the model using least squares
reg Y X1 . . . XK, [options]
c) Use the predict command, picking the name you want for the predictions, in
38 III
this case, yhat, e, ˆ, and FE Ys s .
predict yhat, xb this option predicts Y
predict e, resid this option predicts the residuals (e)
predict sfe, stdf this option predicts the standard
error of the forecast ( FEs )
predict syhat, stdp this option predicts the standard
error of the prediction (Y
s )
list y yhat sfe this option lists indicated variables
These commands result in the calculation and reporting of s e, ,Y Y, FE and
Ys for observations 1 through n2. The predictions will show up in the Data
Editor of STATA under the variable names you picked (in this case, yhat,
e, sfe and syhat).
You may want to restrict the calculations to t= n1 + 1, .. , n2 by using
predict yhat if(_n> n1), xb
where “n1” is the numerical value of n1.
d) The variance of the predicted value can be calculated as follows:
s - s = s2
FE
2y
2
t
39 III
H. PROBLEM SETS: MULTIVARIATE REGRESSION
Problem Set 3.1
Theory
OBJECTIVE: The objective of problems 1 & 2 is to demonstrate that the matrix equations and
summation equations for the estimators and variances of the estimators are equivalent.
Remember 1
n
t
t
X NX and Don’t get discouraged!!
1. BACKGROUND: Consider the model (1) Yt = β1 + β2 Xt+ εt (t = 1, . . ., N) or
equivalently,
(1)’
1 1 1
2 2 21
2
n n n
1 εY X
1 εY X = +
1 εY X
(1)” Y = Xβ + ε
The least squares estimator of YX)XX( = ˆ is ˆ
ˆ1-
2
1
.
If (A.1) - (A.5) (see class notes) are satisfied, then
)ˆVar()ˆ ,ˆCov(
)ˆ ,ˆCov()ˆVar(
= )ˆVar(
212
211
)XX( =-12
QUESTIONS: Verify the following:
*Hint: It might be helpful to work backwards on part c and e.
a.
XXN
XNN = XX
t
2
and
1
' N
t t
t
NY
X YX Y
40 III
b. )XN - X( / )Y XN - YX( = ˆ 2
t
2tt2
c. Xˆ - Y = ˆ21
d. )XN - X( / = )ˆVar(2
t
22
2
e. XN - X
X + n
1 = )ˆVar(
2
t
2
2
2
1
)ˆVar( X + )YVar( =2
2
f. )ˆVar( X- = )ˆ ,ˆCov(221
(JM II’-A, JM Stats)
2. Consider the model: ttt + X = Y
a. Show that this model is equivalent to Y = Xβ + ε
where
1 1 1
2 2 2
n n n
εY X
εY XY , X = ,ε
εY X
b. Using the matrices in 2(a), evaluate YX)XX(-1
and compare your answer with
the results obtained in question 4 in Problem Set 2.1.
c. Using the matrices in 2(a) evaluate )XX(-12 .
(JM II’-A)
Applied
3. Use the data in HPRICE1.RAW to estimate the model
price = β0 + β1sqrft + β2bdrms + u
where price is the house price measured in thousands of dollars, sqrft is
the floorspace measured in square feet, and bdrms is the number of bedrooms.
a. Write out the results in equation form.
41 III
b. What is the estimated increase in price for a house with one more bedroom, holding
square footage constant?
c. What is the estimated increase in price for a house with an additional bedroom that is 140
square feet in size? Compare this to your answer in part (ii).
d. What percentage variation in price is explained by square footage and number of
bedrooms?
e. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling
price for this house from the OLS regression line.
f. The actual selling price of the first house in the sample was $300,000 (so price = 300).
Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for
the house?
42 III
Problem Set 3.2
Theory
1. R2, Adjusted R
2( 2R ), F Statistic, and LR
The R2 (coefficient of determination) is defined by
SST
SSE - 1 =
SST
SSR = R
2
where e = SSE t
2and )Y - Y( = SSR , )Y - Y( = SST
2
t
2
t .
Given that SST = SSR + SSE when using OLS,
a. Demonstrate that 0 R2 1.
b. Demonstrate that n = k implies R2 = 1. (Hint: n=k implies that X is square. Be
careful! Show .) ˆX = Y = Y
c. If an additional independent variable is included in the regression equation, will
the R2 increase, decrease, or remain unaltered? (Hint: What is the effect upon
SST, SSE?)
d. The adjusted , R , R22 is defined by .
1)SST/(n-
k)SSE/(n- - 1 = R
2 Demonstrate that
, 1 R R kn-
k-1 22i.e., the adjusted R
2 can be negative.
))R-(1 kn-
1n- =
kn-
1n-
SST
SSE = R - 1 :(Hint 22
e. Verify that
2
SSE - SSE* = LR if ζ
2 is known
/SSE)ln(SSE*n = if ζ2 is unknown where SSE* denotes the
restricted SSE.
43 III
f. For the hypothesis H0: β2 = . . . = βk = 0, verify that the corresponding LR statistic
can be written as )R-ln(1n - = R-1
1ln n = LR 2
2.
FYI: The corresponding Lagrangian multiplier (LM) test statistic for this
hypothesis can be written in terms of the coefficient of variation as 2LM NR .
(JM II-B)
2. Demonstrate that
a. X’e = 0 is equivalent to the normal equations . YX = ˆXX
b. X’e = 0 implies that the sum of estimated error terms will equal zero if regression
equation includes an intercept.
Remember: ˆˆe Y Y Y X
(JM II-B)
Applied
3. The following model can be used to study whether campaign expenditures affect election
outcomes:
voteA = β0 + β1ln(expendA) + β2 ln(expendB) + β3 prtystrA + u
where voteA is the percent of the vote received by Candidate A, expendA and expendB are
campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percent of the most recent presidential vote that went to A's
party).
i) What is the interpretation of β1?
ii) In terms of the parameters, state the null hypothesis that a 1% increase in A's
expenditures is offset by a 1% increase in B's expenditures.
iii) Estimate the model above using the data in VOTE1.RAW and report the results in
the usual form. Do A's expenditures affect the outcome? What about B's
expenditures? Can you use these results to test the hypothesis in part (ii)?
iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part
(ii). What do you conclude? (Use a two sided alternative.). A possible approach,
test 0 1 2:H D , plug 2D for 1 and simplify to obtain
voteA = β0 + D ln(expendA) + β2 ln(exp ) - ln exp endB endA + β3 prtystrA + u
Regress voteA on ln(expendA), (ln(expend(B)-ln(expendA)), and prtystrA
44 III
and test the hypothesis that the coefficient, D, of ln(expendA) is 0.
You can check your results by constructing the “high-tech” t-test or by using the
Stata command, test ln(expendA) + ln(expendB) =0 following the estimation of
the unconstrained regression model.
(Wooldridge C. 4.1)
4. Consider the data (data from Solow’s paper on economic growth)
t Output (Yt) Labor (Lt) Capital (Kt)
1 40.26 64.63 133.14
2 40.84 66.30 139.24
3 42.83 65.27 141.64
4 43.89 67.32 148.77
5 46.10 67.20 151.02
6 44.45 65.18 143.38
7 43.87 65.57 148.19
8 49.99 71.42 167.12
9 52.64 77.52 171.33
10 57.93 79.46 176.41
The Cobb Douglas Production function is defined by
(1) 3 41 2
β β+ tβ βt t t t = εeY K L
where (β2t) takes account of changes in output for any reason other than a change in Lt or
Kt; εt denotes a random disturbance having the property that lnεt is distributed N(0, ζ2).
Labor’s share receipts sales total
receipts wagetotalis given by β3 if β3 + β4 (the returns to scale) is
equal to one. β2 is frequently referred to as the rate of technological change
. K and L fixedfor Y/dt
dYt
t Taking the natural logarithm of equation(1),we obtain
(2) t t t1 2 3 t 4ln = + t + ) + ln( ) + ln(ε ) .β β β ln(L βY K
45 III
If 43
+ is equal to 1, then equation (2) can be rewritten as
(3) t t t t1 2 3ln( / ) = + t + ln( / ) + ln .Y K L K t
a. Estimate equation (2) using the technique of least squares.
b. Corresponding to equation (2)
1) Test the hypothesis Ho: β2 = β3 = β4 = 0. Explain the implications of this
hypothesis. (95% confidence level)
2) perform and interpret individual tests of significance of β2, β3, and β4, i.e. test
Ho : βi = 0 .α = .05.
3) test the hypothesis of constant returns to scale, i.e., Ho: β3 + β4 = 1, using
a. a t-test for general linear hypothesis, let restrictions δ= (0,0,1,1);
b. a Chow test;
c. a LR test.
c. Estimate equation (3) and test the hypothesis that labor’s share is equal to .75, i.e., β3 =
.75.
d. Re-estimate the model (equation 2) with the first nine observations and check to see if the actual
log(output) for the 10th observation lies in the 95% forecast confidence interval.
(JM II)
5. The translog production function corresponding to the previous problem is given by
2 2
1 2 3 4 5 6 7ln(Y) = + t + ln(L) + ln(K) + (ln(L) + (ln(K) + (ln(L)) ln(K) + ln(ε )β β β β β ) β ) β
t
a. What restrictions on the translog production function result in a Cobb-Douglas
production function?
b. Estimate the translog production function using the data in problem 5 and use the Chow and
LR tests to determine whether it provides a statistically significant improved fit to the data,
relative to the Cobb-Douglas function.
(JM II)
46 III
6. The transcendental production function corresponding to the data in problem 5 is defined by
1 2 3 4 5 6 + t + L + Kβ β β β β βY = e L K
a. What restrictions on the transcendental production function result in a Cobb-Douglas
production function?
b. Estimate the transcendental production function using the data in problem 2 and use the Chow
and LR tests to compare it with the Cobb-Douglas production function.
(JM II)
47 III
APPENDIX A
Some important derivatives:
Let aa
aa =A ,
a
a = a ,
x
x = X
2221
1211
2
1
2
1
(symmetric) )a = a = a( 2112
1. a = dX
a)X( d =
dX
X)a( d
2. AX2 = dX
AX)X( d
Proof of a = dX
a)X( d
Note: a’X = X’a = a1x1 + a2x2
a = a
a =
X/a)X(
X/a)X( =
dX
a)X( d
2
1
2
1
Proof of d (X AX)
= 2AXdX
Note: X’AX = a11x12 + (a12 + a21) x1x2 + a22 x2
2
xa2 + xa2
xa2 + xa2 =
X/AX)X(
X/a)X( =
dX
AX)X( d
2221
2111
2
1
xa + xa
xa + xa 2 =
2221
2111
x
x
aa
aa 2 =
2
1
22
11
.AX2 =
48 III
APPENDIX B
An unbiased estimator of ζ2 is given by
. k)SSE/(n- = y) )X)XX( X - (Iy( kn-
1 = s
1-2
Proof: To show this, we need some results on traces:
a = (A)tr ii
n
i
1) tr(I) = n
2) If A is idempotent, tr(A) = rank of A
3) tr(A+B) = tr(A) + tr(B)
4) tr(AB) = tr(BA) if both AB and BA are defined
5) tr(ABC) = tr(CAB)
6) tr(kA) = k tr(A)
Now, remember that
2 1 = e eζ
n
and ee k -n
1 = s
2
-1ˆe = y - Xβ = y - X (X X X y = My)
= M (Xβ + ε) = MXβ + Mε ,
= Mε ,
where M = I - X(X’X)-1
X’.
Note that M is symmetric, and idempotent (problem set R.2).
So 2 1 1 = e e = ε M Mεζ
n n
49 III
1= ε MMε .
n
1= ε Mε .
n
and 2 1 = ε Mε .s
n - k
2 1 1E ( ) = E (ε Mε) = E (tr(ε Mε))ζ
n nbecause i jcov ( , ) = 0, i j)ε ε
1 1= Etr (Mεε ) = tr (ME (εε ))
n n
2 21 1= tr (M I) = tr ( M)ζ ζ
n n
2
ζ= tr(M)
n
2
-1ζ= tr(I - X(X X X ))
n
2-1ζ
= (n - tr (X(X X X )))n
2
-1ζ= (n - tr (X X(X X )))
n
2
k
ζ= (n - trace ( ))I
n
2
ζ= (n - k)
n
2 2 2 2n - k n= so E ( ) = E ( ) = .ˆζ s ζ ζ
n n - k
Therefore 2ζ is biased, but 2 2 2n
E ( ) = E ( ) = ˆs ζ ζn - k
and s2 is unbiased.
50 III
APPENDIX C
β = AY = (X X) X Y is BLUE.
Proof: Let ii = Yβ A where Ai denotes the i
th row of the matrix A. Since the result will be
symmetric for each βi (hence, for each Ai), denote Ai by a’ where a is a (n by 1) vector.
The problem then becomes:
Min a’Ia when I is nxn
s.t. AX = I when X is nxk (for unbiasedness)
or min a’Ia
s.t. X’a = i where i is the ith
column of the identity matrix.
Let = a Ia + λ (X a - i) which is the associated Lagrangian function where λ is kx1.
The necessary conditions for a solution are:
= 2a I + λ X = 0a
= (X a - i) = 0 .λ
This implies
a = (-1/ 2)λ X ) .
Now substitute a = (-½)Xλ into the expression for = 0λ
and we obtain
(-1/ 2) X Xλ = i -1
λ = - 2 (X X i)
X)XX( i(-2) 2)/(-1 = a-1
. A = X)XX(i = i
-1
which implies
X)XX( =A -1
hence, -1
β = (X X X y .)