asymptotic theory — part vjenni.uchicago.edu/econ312/slides/topic2/asymp5... · 4/18/2006 ·...

Non-linear Least Squaresand Durbin’s ProblemAsymptotic Theory — Part V

James J. HeckmanUniversity of Chicago

Econ 312This draft, April 18, 2006

This lecture consists of two parts:

1. Non-linear least squares: This looks at Non-linear leastsquares estimation in detail; and

2. Durbin’s problem: This examines the correction of as-ymptotic variances in the case of two stage estimators.

1

1 Nonlinear Least Squares

In this section, we examine in detail the Non-linear LeastSquares estimator. The section is organized as follows:

• Section 1.1: Recap the analog principle motivation forthe NLLS estimator (using the extremum principle);

• Section 1.2: Consistency of the NLLS estimator;• Section 1.3: Draw analogy with the OLS estimator;• Section 1.4: Asymptotic normality of NLLS estimator;• Section 1.5: Discussion of asymptotic e ciency;

• Section 1.6: Estimation of b .

2

1.1 NLLS estimator as an application of theExtremum principle

Here we recap the derivation of the NLLS estimator as an ap-plication of the Extremum principle, from section 3.2 of thenotes Asymptotic Theory II, with slight modification in no-tation. As noted there, we could also motivate NLLS as amoment estimator (refer section 3.2 of Asymptotic Theory II).

1. The model: We assume that in the population the fol-lowing model holds:

= ( | 0) + (1)

= ( ; ) + [ ( ; 0) ( ; )] +

where is a vector of exogenous variables. Unlike in

3

the linear regression model, may not necessarily be ofthe same dimension as . Since ( | ) is a nonlinearfunction of and , (*) is called the nonlinear regressionmodel. Assume ( ) i.i.d.; so that:

( ; ) . Then we can write out a least squarecriterion function as below.

2. Criterion function: We choose criterion function as:

= ( ( ; ))2 = [ ( ; 0) ( ; )]2 + 2

Then possess the property that it is minimized at =

0 (true parameter value). If = 0 is the only suchvalue, model is identified (wrt criterion).

4

3. Analog in sample:

Pick: ( ) =1X

=1

( ( ; ))2

as analog to in the sample. As established in the OLScase in the notes Asymptotic Theory II (Section 3.2), wecan show that plim = .

4. The estimator: We construct the NLLS estimator as:ˆ = argmin ( )

Thus we chose ˆ to minimize ( ).

In the next few sections, we establish consistency and asymp-totic normality for the NLLS estimator (under certain condi-tions), and discuss conditions for asymptotic e ciency.

5

1.2 Consistency of NLLS estimator

Assume:

1. i.i.d., ( ) = 0 ( 2) = 2 ;

2. 0 is a vector of unknown parameters;

3. Assume 0 0;

4. exists and is continuous in nbd of 0;

5. ( ) is continuous in uniformly in (i.e., for every0 there exists 0 such that | ( 1) ( 2)|

for 1 2 closer than (i.e. || 1 2|| ), for all 1 2

in nbd of 0 and for all );

6

6.1 P

=1

( 1) ( 2) converges uniformly in 1 2 in nbd of

0;

7. lim1 P

( ( 0) ( ))2 6= 0 if 6= 0.

Then, we have that there exists a unique root b such that:b = argminX( ( | ))2;

and that it is consistent, i.e.:b0

Proof: Amemiya p. 129. The proof is an application of theExtremumAnalogy Theorem for the class of estimators definedas b = argmin ( ).

7

1.3 Analogy with OLS estimator

Gallant (1975): Consider the NLLS model from (1) above:

= ( | ) +

Now expand in nbd of in Taylor series to get:

= ( | ) +( | )

0

¯̄̄̄( ) +

Rewrite the equation as:

( | ) +( | )

0

¯̄̄̄=

( | )+

8

Now by analogy with classical linear regression model, we have:

• ( | ) +( | )

0

¯̄̄̄is analogous to the

dependent variable in OLS.

• ( | )is analogous to the independent variables ma-

trix in OLS.

9

The NLLS estimator is:

b =Xμ

( | )¶0

μ( | )

0

¶0

¸ 1

(2)

×μX ( | ) ¶

0

so that in comparison to the OLS estimator we have:

• 0 replaced byP

=1

Ã˜!Ã

˜0

!; and

• 0 replaced byP

=1

μ ¶,

where ˜= ( 1 ) = ( 1 ).

10

Then analogy with OLS goes through exactly. Now, as for theOLS case, we can do hypothesis testing, etc., using derivativesin nbd of optimum.

Using the analogy, we also obtain the estimator for Asy. var(b ) as:

\Asy. var(b ) = ˆ2(˜0 ˜ ) 1 where ˜0 =( )

¯̄̄̄ˆ

11

1.4 Asymptotic normality

To justify large sample normality, we need additional condi-tions on the model. The required conditions for asymptoticnormality, assuming the conditions for consistency hold, arethe following .

1. lim1 P

=1

¯̄̄̄0

0

¯̄̄̄0

= a positive definite matrix;

2.1 P

=10 converges uniformly to a finite matrix in an

open nbd of 0;

3.2

0 is continuous in in an open nbd of 0 uniformly

( · need uniform continuity of first and second partials);

12

4. lim12

P=1

2

0

¸= 0 for all in an open nbd of 0;

and

5.1 P

=1

( 1)2

0

¯̄̄̄2

converges to a finite matrix uniformly

for all 1 2 in an open nbd of 0

Then:(b 0) (0 2 1)

where 2 = ( ).Sketch of proof (For rigorous proof, see Amemiya, p.132-4).Theintuition for this result is exactly as in Cramer’s Theorem (referto Section 2 of notes Asymptotic Theory III).

13

Look at first order condition:

=2X

=1

( ( ))

Then as in Cramer’s theorem (Theorem 3 in handout III) weget: ¯̄̄̄

0

=1 X

( )

¯̄̄̄0

14

This is asymptotically normal (i.i.d. r.v.) by Lindeberg-LevyCentral Limit Theorem. Then using equation (2) we obtain:

(b 0) =

"1X

=1

μ( | )¶μ ( | )

0

¶# 1

× 1 X=1

μ( | )¶

( )

We get that this is asymptotically normal in nbd of 0, if£1P¡ ¢ ¡

0¢¤

converges uniformly to a non-singular ma-trix (which is true by assumption).This completes the analogy with the Cramer’s theorem provedin earlier lecture. (See Amemiya for a rigorous derivation.Also, see the result in Gallant.)

15

1.5 Asymptotic e ciency of NLLS estima-tor

Analogy of the NLLS estimator with is complete if weassume is normal. Then, we get the log likelihood function:

ln$ =2ln 2 1

2 2

X( ( | ))2

So that here we get b = b (FOC and asy. theory asbefore).

16

Thus we obtain the general result that any nonlinear regressionmodel if we have that normal. Though thenonlinear regression is picking another criterion, the estimatoris identical to the MLE estimator.

· NLLS estimator is e cient in normal case. In general,Greene (p. 305-8) shows that (unless is normal) NLLS isnot necessarily asymptotically e cient.

17

1.6 Estimation of bNow, consider the problem of numerical estimation: How toobtain b? The two commonly used methods are:i. Newton-Raphson; and

ii. Gauss-Newton.

18

1.6.1 Newton-Raphson Method

In the NLLS case, we wish to find a solution to the equation:( )

= 0. This is true for many criteria outside of NLLS

(all criteria in Asymptotic Theory handout III).

We expand the criterion function ( ) in nbd of an initial

starting value b1, by a second order (quadratic) Taylor seriesapproximation to get:

( ) ' (b1)+ ¯̄̄̄01

( b1)+12( b1)0 2

0 ( b1) (3)This quadratic problem has a solution if Hessian matrix

2

0is a definite matrix (pos. def. for min).

19

In equation (3), we min ( ) wrt (by taking the FOC) andobtain the algorithm:

b2 = b1 2

0

¸ 1

1

¯̄̄̄1

We continue iteration until convergence occurs. The methodassumes that we can approximate with a quadratic. Somethe drawbacks of the method and possible fixes are discussedbelow.

(A) Singular Hessian: There is a problem if the Hessiansingular: the method fails as we are unable then to obtain

2

0

¸ 1

1.

20

In case the Hessian is singular, the following correctioncould be used : Use such that:

2

0

¸neg. def.

Usually we pick scalar (obviously can pick vectors).One can then fiddle with this to get out of nbd of lo-cal singularity. In applications of the Newton-Raphson method, one could use idea due to T.W. An-derson on reading list and note that asymptotically:μ

2 ln0

¶=

μln ln

0

¶to arrive at an alternative estimator (sometimes calledBHHH but method due to Anderson) for the Hessian.

21

(B) Algorithm Overshoots: In this case, one could scalethe step back by :

b2 = b1 2

0

¸ 1

1

¯̄̄̄1

We choose 0 1 so that the iteration di erences getdampened, reducing chances of overshooting.

22

1.6.2 Gauss-Newton Method

The motivation for the Gauss-Newton method mimics exactlythe NLLS set up in section 1.3 where we drew the analogy withOLS. Expanding in nbd of some initial starting value b1, weget:

( | b1) + ( | b1)b1 = ( | b1)b2 + 2

This set-up is analogous to OLS; the LHS and part of RHS aredata once one knows (guesses) the starting value b1. Then doOLS, to get the next iteration in the algorithm:

b2 = " 1X=1

0

# 1

1

1X=1

( | b1)23

so that we get:

b2 = b1+" 1X=1

0

# 1

1

1X=1

( | b1) h( | b1)i

Revise, update, start all over again. This method has the sameproblems as in Newton-Raphson.

(A) Singular Hessian: As in the Newton-Raphson method,to solve for optimum use:

0 +¸

scalar.

(B) Algorithm Overshoots: To avoid overshooting, use

24

Hartley modification. Form:

1 =

"1X

=1

0

# 1

1

1X=1

( | b1) h( | b1)i

Then choose 0 1 such that:

(b1 + 1) (b1)where ( ) =

P( ( | ))2. Update by settingb2 = b1 + 1. Then algorithm converges to a root of

the equation. General Global convergence is a mess, un-resolved.

25

1.6.3 E ciency theorems for estimation methods

Theorem 1 One Newton-Raphson Step toward an optimum isfully e cient if you start from an initial consistent estimator.

This theorem suggests a strategy for quick convergence. Getcheap (low computational cost) estimator which is consistentbut not e cient. Then iterate once — avoids computationalcost. (True also for Gauss-Newton). Note that here one mustuse unmodified Hessians (without corrections for overshootingor singularity).

Proof. Suppose b1 0 and (b1 0) (0X

0

)

It is consistent but not necessarily e cient. Now expand root

26

of likelihood equation in nbd of b1 to get:ln$

¯̄̄̄1=

ln$¯̄̄̄0

+2 ln$

0

¯̄̄̄1

(b1 0)

b1 does not necessarily set left hand side to zero. If it did, wewould have an e cient estimator. As before 1 is intermediatevalue.

27

Now look at Newton-Raphson criterion.

b20 = (b1 0)

2 ln$0

¸ 1

1

ln$¸

1

= (b1 0)2 ln$

0

¸ 1

1

ln$¯̄̄̄0

+2 ln$

0

¯̄̄̄(b1 0)

Multiplying by and collecting terms, we get:³b20

´=

1 2 ln$0

¸ 1

1

1 ln$¯̄̄̄0

+

"1 2 ln$

0

¸ 1

1

1 2 ln$0

¯̄̄̄1

# ³b10

´28

1 10 0

ln$¯̄̄̄0

+h£

10 0 0 0

¤ ³b10

´iThe second term vanishes as (ˆ

1

0) is 0 (1). Therefore,one Newton-Raphson step satisfies likelihood equation at 0.

29

Same result obviously holds for Gauss-Newton. One G-N stepfor a consistent estimator is fully e cient (or at least as e cientas NLLS).

Thus starting from a consistent estimator (where possible)saves computer time, avoids problems of nonlinear optimiza-tions and also avoids local optimization problem (i.e., possibil-ity of arriving at an inconsistent local optima).

30

2 “Durbin Problem”

Durbin’s problem is question of arriving at the correct variance-covariance matrix for a set of parameters estimated in the sec-ond step of a two-step estimation procedure.

For example, let 0 = (¯1 ¯2), where ¯1 ¯2 are “true values”,as in the case with the composite hypothesis considered in theearlier lecture (Asymptotic theory IV).

31

Suppose we use an initial consistent estimator for 2. Then ifwe treat likelihood as if 2 known (but it is estimated by ˜2),we have:

ln$¯̄̄̄1 2| {z }

=0

=1 ln$

1

¯̄̄̄0

+1 2 ln$

101

¯̄̄̄ ³b1 1

´+1 2 ln$

102

¯̄̄̄ ³˜2 2

´| {z }

“Durbin Problem”

We assume sample sizes the same in both samples.

32

which implies³b1

¯1

´= 1

1 1

1 ln$

1

¯̄̄̄0

11 1 1 2

³˜2

¯2

´where $̃ is from the likelihood with sample size used to produce ˜2.

33

Thus to obtain the right covariance matrix for ˆ1, we needcovariance between the two score vectors. We have:

(˜2 ¯2) =

Ã1 2$̃

202

! 1Ã$̃

2

!= 1

2 2

Ã$̃

2

!

which implies

(ˆ1 ¯1) = 1

1 1

1 ln$

1

¯̄̄̄0

11 1 1 2

12 2

1 $̃

2

¯̄̄̄¯0

34

We need to compute covariance to get the right standard errors.Just form a new covariance matrix:

(ˆ1 ¯1) = 1

1 1+ 1

1 1 1 2˜ 12 2 2 1

11 1

11 1 1 2

˜2 2 ( 1̂ 2̃)

11 1

11 1

( 1̂ 2̃)˜ 2 2 2 1

11 1

where (now we assume 2 di erent sample sizes):

35

1 =1 ln$

1

¯̄̄̄˜1ˆ2

where is the sample size for primary samples.

2 =1p˜

ln $̃

2

¯̄̄̄¯2

where ˜ is the sample size for samples used to get ˜2.

In the independent sample case, we get that last two terms in(*) vanish, so that we get

(ˆ1 ¯1) =

11 1+ 1

1 1 1 2˜ 12 2 2 1

11 1

36

2.1 Concentrated Likelihood Problem

This problem seems similar to, but is actually di erent fromthe Durbin problem. Here we have a log likelihood functionwhich has two sets of parameters, ln$( ).

In the first step here, we solveln$

( ) = 0 to get ( )

We then optimize with respect to ( ).

While this looks like the two-step estimator in Durbin’s prob-lem, here we are not using an estimate of , but rather using( ). In fact here we can show that this is the same as jointmaximization of .

37

Using the Envelope Theorem (i.e., utilizing the fact that ( )is arrived at through an optimization), we get:

ln$( ( ))=

ln$( ( ))

2 ln$( ( ))0 =

2 ln$( ( ))0 +

2 ln$( ( ))

Now, we also have:

ln$( )¸=

2 ln$+

( )

2 ln$0 = 0

= =2 ln$

μ2 ln$

0

¶ 1

38

Substituting into previous expression, we get:

plim1 2 ln$( ( ))

0 = ( 1 )

= The asymptotic distribution is the same for if we esti-mate jointly or through the concentrated likelihood approach.

(Refer to Asymptotic Theory — Lecture IV, section on compos-ite hypothesis for the distribution of sub-vector of parameterswhen estimation is done jointly.)

39

asymptotic theory — part vjenni.uchicago.edu/econ312/slides/topic2/asymp5... · 4/18/2006 ·...

Documents