asymptotic distribution of estimators with overlapping ... · in pakes and pollard (1989), chen,...

The Asymptotic Distribution of Estimatorswith Overlapping Simulation Draws ∗

Tim Armstronga A. Ronald Gallantb Han Hongc Huiyu Lid

First draft: September 2011Current draft: August 2015

Abstract

We study the asymptotic distribution of simulation estimators, where the same set ofdraws are used for all observations under general conditions that do not require thefunction used in the simulation to be smooth. We consider two cases: estimators thatsolve a system of equations involving simulated moments and estimators that maxi-mize a simulated likelihood. Many simulation estimators used in empirical work involveboth overlapping simulation draws and nondifferentiable moment functions. Develop-ing sampling theorems under these two conditions provides an important complimentto the existing results in the literature on the asymptotics of simulation estimators.

Keywords: U-Process, Simulation estimators.

JEL Classification: C12, C15, C22, C52.

∗Corresponding author is A. Ronald Gallant, The Pennsylvania StateUniversity, Department of Eco-nomics, 613 Kern Graduate Building, University Park PA 16802-3306, USA, [email protected]. Wethank Donald Andrews, Bernard Salanie, and Joe Romano in particular, as well as seminar participants forinsightful comments. We also acknowledge support by the National Science Foundation (SES 1024504) andSIEPR.

a Yale Universityb The Pennsylvania State Universityc Stanford Universityd Federal Reserve Bank of San Francisco

1

1 Introduction

Simulation estimation is popular in economics and is developed by Lerman and Manski

(1981), McFadden (1989), Laroque and Salanie (1989), (1993), Duffie and Singleton (1993),

Gourieroux and Monfort (1996) and Gallant and Tauchen (1996) among others. Pakes and

Pollard (1989) provided a general asymptotic approach for generalized method of simulated

moment estimators, and verified the conditions in the general theory when a fixed number

of independent simulations are used for each of the independent observations. A recent

insightful paper by Lee and Song (2015) also developed results for a class of simulated

maximum likelihood-like estimators. In practice, however, researchers sometimes use the

same set of simulation draws for all the observations in the dataset.

Independent simulation draws are doubly indexed, i.e. ωir, so that there are n × R

simulations in total, where n is the number of observations and R is the number of simulations

for each observation. Overlapping simulation draws are singly indexed, i.e., ωr, so that there

are R simulations in total, where all the same R total number of simulations are used

for each observation. The properties of simulation based estimators using overlapping and

independent simulation draws are studied by Lee (1992), Lee (1995) and Kristensen and

Salanie (2010) under the conditions that the simulated moment conditions are smooth and

continuously differentiable functions of the parameters. This is, however, a strong assumption

that is likely to be violated by many simulation estimators used in practice. We extend the

above results to nonsmooth moment functions using empirical process and U process theories

developed in a sequence of papers by Pollard (1984), Nolan and Pollard (1987, 1988) and

Neumeyer (2004). In particular, the main insight relies on verifying the high level conditions

in Pakes and Pollard (1989), Chen, Linton, and Van Keilegom (2003) and Ichimura and Lee

(2010) by combining the results in Neumeyer (2004) with results from the empirical process

literature (e.g. Andrews (1994)).

Even in the simulated method of moment estimator, the classical results in Pakes and

Pollard (1989) and McFadden (1989) are for independent simulation draws. However, their

results only apply to a finite number of independent simulations for each observation, since

the proof depends crucially on the fact that a finite sum of functions with limited complexity

2

also has limited complexity. It is a challenging question with unclear answer how their anal-

ysis can be extended to a larger number of simulation draws. With overlapping simulation

draws, this difficulty is resolved by appealing to empirical U-process theory.

A main application of maximum simulated likelihood estimators is multinomial probit

discrete choice models and its various panel data versions (Newey and McFadden (1994)).

Whether or not using overlapping simulations improves computational efficiency depends

on the specific model. Generating the random numbers is easy but computing the moment

conditions or the likelihood function is typically difficult. To equate the order of computation

effort, we will adopt the notation of letting R denote either the total number of overlapping

simulations or the number of independent simulations for each observation. For a given

R, Lee (1995) and Kristensen and Salanie (2010) pointed out that the leading terms of the

asymptotic expansion are smaller with independent draws than with overlapping draws. This

suggests that independent draws are more desirable and leads to smaller confidence intervals

whenever it is feasible.

There are still two reasons to consider overlapping draws, especially for simulated maxi-

mum likelihood estimators, based on theoretical and computational feasibility. Despite the

theoretical advantage of the method of simulated moments, the method of simulated max-

imum likelihood is still appealing in empirical research, partly because it minimizes a well

defined distance between the model and the data even when the model is misspecified. The

asymptotic theory with independent draws in this case is difficult and to our knowledge has

not been fully worked out in the literature. In particular, Pakes and Pollard (1989) only pro-

vided an analysis for simulated GMM, but did not provide an analysis for simulated MLE,

which can be in fact far more involved. Only the very recent insightful paper by Lee and Song

(2015) studies an unbiased approximation to the simulated maximum likelihood, which still

differs from most empirical implementation of simulated maximum likelihood methods using

nonsmooth crude frequency simulators. Smoothing typically requires the choice of kernel

and bandwidth parameters and introduces biases. For example, the Stern (1992) decomposi-

tion simulator, while smooth and unbiased, requires repeated calculations of eigenvalues and

is computationally prohibitive. Significant process is only made recently, in an important

paper by Kristensen and Salanie (2010) who develop bias reduction techniques for simulation

3

estimators.

When computing the simulated likelihood function is very difficult, overlapping simu-

lations can be used to trade off computational feasibility with statistical accuracy. Using

independent draws requires that R increases faster than√n, where n is the sample size in

order that the estimator has an asymptotic normal distribution. With overlapping draws,

the estimator will be asymptotically normal as long as R increases to infinity. A caveat, of

course, is that when R is much smaller than n, the asymptotic distribution would mostly

represent the simulation noise rather than the the sampling error, which reflects the cost in

statistical accuracy as a result of more feasible computation.

2 Simulated Moments and Simulated Likelihood

We begin by formally defining the method of simulated moments and maximum simulated

likelihood using overlapping simulation draws. These methods are defined in Lee (1992) and

Lee (1995) in the context of multinomial discrete choice models. We use a more general

notation to allow for both continuous and discrete dependent variables. Let zi = (yi, xi) be

i.i.d. random variables in the observed sample for i = 1, . . . , n, where the yi are the dependent

variables and the xi are the covariates or regressors. We are concerned about estimating an

unknown parameter θ ∈ Θ ⊂ Rk. As discussed in the introduction, independent draws are

typically preferrable in the method of simulated moments. The method of moment results are

developed both for completenes and for expositional transition to the simulated maximum

likelihood section.

The method of moments estimator is based on a set of moment conditions g(zi, θ) ∈ Rd

such that g(θ) ≡ Pg (·, θ) is zero if and only if θ = θ0 where θ0 is construed as the true

parameter value. In the above Pg(·, θ) =∫g (zi, ·) dP (zi) denotes expectation with respect

to the true distribution of zi. In models where the moment g(zi, θ) can not be analytically

evaluated, it can often be approximated using simulations. Let ωr, r = 1, . . . , R, be a set of

simulation draws, and let q(zi, ωr, θ) be a function such that it is an unbiased estimator of

g(zi, θ) for all zi:

Qq(z, ·, θ) ≡∫q(z, ω, θ) dQ(ω) = g(z, θ).

4

Then the unknown moment condition g(z, θ) can be estimated by

g(z, θ) = QRq(z, ωr, θ) ≡1

R

R∑r=1

q(z, ωr, θ),

which in turn is used to form an estimate of the population moment condition g(θ):

g(θ) = Png(·, θ) ≡ 1

n

n∑i=1

g(zi, θ) =1

nR

n∑i=1

R∑r=1

q(zi, ωr, θ).

In the above, both z1, . . . , zn and ω1, . . . , ωR are iid and and they are independent of each

other. The method of simulated moments (MSM) estimator with overlapping simulated

draws is defined with the usual quadratic norm as in Pakes and Pollard (1989)

θ =θ∈Θ

argmin ‖g(θ)‖2Wn

where ‖x‖2W = x′Wx,

where both Wn and W are d dimensional weighting matrixes such that Wnp−→ W . In the

maximum simulated likelihood method, we reinterpret g(zi; θ) as the likelihood function of θ

at the observation zi, and g(zi; θ) as the simulated likelihood function which is an unbiased

estimator of g(zi; θ). The MSL estimator is usually defined as, for i.i.d data,

θ =θ∈Θ

argmax Pn log g(·; θ) =1

n

n∑i=1

log g(zi; θ).

While g(zi; θ) is typically a smooth function of zi and θ, g(zi; θ) often times is not. In

these situations it is difficult to obtain the exact optimum for both MSM and MSL, and

these definitions will be relaxed below to only require that the MSM and MSL estimators

obtain “near-optimum” of the respective objective functions. The likelihood function g (z; θ)

can be either the density for continuous data, or the probability mass function for discrete

data. It can also be either the joint likelihood of the data, or the conditional likelihood

g (z; θ) = g (y|x; θ) when z = (y, x).

In the following we will develop conditions under which both MSM and MSL are consis-

tent as both n→∞ and R→∞. Under the conditions given below, they both converge at

the rate of√m, where m = min(n,R) to a limiting normal distribution. These results are de-

veloped separately for MSM and MSL. For MSL, the condition that R >>√n is required for

asymptotic normality with independent simulation draws, e.g. Laroque and Salanie (1989)

5

and Train (2003). With overlapping draws, asymptotic normality holds as long as both R

and n converge to infinity. If R << n, then the convergence rate becomes√R instead of

√n. A simulation estimator with overlapping simulations can also be viewed as a profiled

two step estimator to invoke the high level conditions in Chen, Linton, and Van Keilegom

(2003). The derivations in the remaining sections are tantamount to verifying these high

level conditions. For maximum likelihood with independent simulations, the bias reduction

condition√R/n → ∞ is derived in Laroque and Salanie (1989), (1993) and Gourieroux

and Monfort (1996), and is strengthened by Lee and Song (2015) to√R logR/n → ∞ for

nonsmooth maximum likelihood like estimators. To summarize, the following assumption is

maintained through the paper.

ASSUMPTION 1 Let zi = (yi, xi), i = 1, . . . , n and ωr, r = 1, . . . , R be two independent

sequences of i.i.d random variables with distributions P and Q respectively. The function

q(zi, ωr, θ) satisfies Qq(z, ·, θ) ≡∫q(z, ω, θ) dQ(ω) = g(z, θ) for all z and all θ ∈ Θ.

3 Asymptotics of MSM with Overlapping Simulations

The MSM objective function takes the form of a two-sample U-process studied extensively

in Neumeyer (2004):

g(θ) ≡ 1

nRSnR(θ) where SnR(θ) ≡

n∑i=1

R∑r=1

q(zi, ωr, θ),

with kernel function q(zi, wr, θ) and its associated projections

g(zi, θ) = Qq(zi, ·, θ) and h(wr, θ) ≡ Pq(·, wr, θ).

The following assumption restricts the complexity of the kernel function and its projections

viewed as classes indexed by the parameter θ.

ASSUMPTION 2 For each j = 1, . . . , d, the following three classes of functions

F = {qj(zi, wr, θ), θ ∈ Θ},

QF = {gj(zi, θ), θ ∈ Θ},

PF = {hj(wr, θ), θ ∈ Θ},

6

are Euclidean, cf. Lemma 25 (p. 27), Lemma 36 (p. 34), and Theorem 37 (p. 34) of Pollard

(1984). Their envelope functions, denoted respectively by F , QF and PF , have at least two

moments.

By Definition (2.7) in Pakes and Pollard (1989) (hereafter P&P), a class of functions F

is called Euclidean for the envelope F if there exist positive constants A and V that do not

depend on measures µ, such that if µ is a measure for which∫Fdµ < ∞, then for each

ε > 0, there are functions f1, . . . , fk in F such that (i) k ≤ Aε−V ; (ii) For each f in F , there

is an fi with∫|f − fi|dµ ≤ ε

∫Fdµ.

This assumption is satisfied by many known functions. A counter example is given

on page 2252 of Andrews (1994). In the case of binary choice models, it is satisfied given

common low level conditions on the random utility functions. For example, when the random

utility function is linear with an addititive error term, q(zi, wr, θ) typically takes a form that

resembles 1 (z′iθ + wr ≥ 0), which is Euclidean by Lemma 18 in Pollard (1984). As another

example, in random coefficient binary choice models, the conditional choice probability is

typically the integral of a distribution function of a single index Λ (x′iβ) over the disribution

of the random coefficient β. Suppose β follows a normal distribution with mean v′iθ1 and a

variance matrix with Cholesky factor θ2, then the choice probability is given by, for φ (·;µ,Σ)

normal density function with mean µ and variance matrix Σ,∫

Λ (x′iβ)φ (β; v′iθ1, θ′2θ2) dβ. In

this model, for draws ωr from the standard normal density, and for zi = (xi, vi), q(zi, wr, θ)

takes a form that resembles

Λ (x′i (viθ1 + θ′2ωr)) = Λ

(x′iviθ1 +

K∑k=1

xikθ′2kωr

).

As long as Λ (·) is a monotone function, this function is Euclidean according to Lemma 2.6.18

in Van der Vaart and Wellner (1996).

Under assumption 2, which implies that the class F and its projections QF and PF are

Euclidean (see Neumeyer (2004), p. 79), the following lemma is analogous to Theorems 2.5,

2.7 and 2.9 of Neumeyer (2004).

LEMMA 1 Under Assumption 2 the following statements hold:

7

a. Define

q(z, ω, θ) = q(z, ω, θ)− g(z, θ)− h(w, θ) + g(θ),

then

supθ∈Θ||SnR(θ)|| = Op(

√nR),

where

SnR(θ) ≡n∑i=1

R∑r=1

q(zi, ωr, θ).

b. Define

UnR(θ) ≡√m

(1

nRSnR(θ)− g(θ)

),

then

supd(θ1,θ2)=o(1)

||UnR(θ1)− UnR(θ2)|| = op(1).

where d (θ1, θ2) denotes the Euclidean distance√

(θ1 − θ2)′ (θ1 − θ2).

c. Further,

supθ∈Θ

∣∣∣∣∣∣∣∣ 1

nRSnR(θ)− g(θ)

∣∣∣∣∣∣∣∣ = op(1).

Proof Consider first the case when the moment condition q (z, ω, θ) is univariate, so that

d = 1. The first statement (a) follows from Theorem 2.5 in Neumeyer (2004). The proof of

part (b) resembles Theorem 2.7 in Neumeyer (2004) but does not require n/(n + R)→ κ ∈

(0, 1). First define UnR(θ) =√m

nRSnR(θ). It follows from part (a) that

supθ∈Θ||UnR(θ)|| = Op

(√m

nR

)= op(1).

Since UnR(θ) = UnR(θ) +√m(Pn − P )g(·, θ) +

√m(QR −Q)h(·, θ). It then only remains to

verify the stochastic equicontinuity conditions for the two projection terms:

supd(θ1,θ2)=o(1)

||√m(Pn − P )(g(·, θ1)− g(·, θ2))|| = op(1),

and

supd(θ1,θ2)=o(1)

||√m(QR −Q)(h(·, θ1)− h(·, θ2))|| = op(1).

8

This in turn follows from m ≤ n,R and the equicontinuity lemma of Pollard (1984), p. 150.

Part (c) mimicks Theorem 2.9 in Neumeyer (2004), noting that

1

nRSnR(θ)− g(θ) =

1

nRSnR(θ) + (Pn − P )g(·, θ) + (QR −Q)h(·, θ),

and invoking part (a) and Theorem 24 of Pollard (1984), p. 25.

When the moment conditions q (z, ω, θ) are multivariate, so that d > 1, the above argu-

ments apply to each univariate element of the vector moment condition q (z, ω, θ). In the

vector case, the notation of ||·|| (e.g. ||SnR(θ)||) denotes Euclidean norms. The stated results

in the lemma then follow from the equivalence between the L2 norm and the L1 norm, since

for g ∈ Rd, ||g|| ≤ d∑d

j=1 |gj|. 2

Lemma 1 will be applied in combination with the following restatement of a version of

Theorem 7.2 of Newey and McFadden (1994) and Theorem 3.3 of Pakes and Pollard (1989).

THEOREM 1 Let θp−→ θ0, where g(θ) = 0 if and only if θ = θ0, which is an interior point

of the compact Θ. If

i. ‖g(θ)‖Wn ≤ infθ ‖g(θ)‖Wn + op(m−1/2).

ii. Wn = W + op(1) where W is positive definite.

iii. g(θ) is continuously differentiable at θ0 with a full rank derivative matrix G.

iv. supd(θ,θ0)=o(1)

√m ‖g(θ)− g(θ)− g(θ0)‖W = op(1).

v.√m g(θ0)

d−→ N(0,Σ).

Then the following result holds

√m(θ − θ0)

d−→ N(0, (G′WG)−1G′WΣWG(G′WG)−1). �

Remark: The original Theorem 3.3 of Pakes and Pollard (1989) uses the Euclidean norm

to define the GMM objective function, which amounts to using an identity weighting matrix

Wn ≡ I. However, generalizing their proof arguments to a general random Wn is straight-

forward. First, note that their Theorem 3.3 is isophormic to using a fixed positive definite

weighting matrix W to define the norm. This is because if one uses the square root A of

9

W (such that A′A = W ) to form a linear combination of the original moment conditions

g(θ), the moment condition in Theorem 3.3 can be reinterpreted as Ag(θ), and exactly the

same arguments in the proof goes through, with the matrixes Γ and V in P&P being AG

and AΣA′.

Second, a close inspection of the proof of Theorem 3.3 in P&P shows that their Condi-

tion (i) is only used to the extent of requiring both

‖g(θ)‖W ≤ ‖g(θ0)‖W + op(m−1/2) and ‖g(θ)‖W ≤ ‖g(θ∗)‖W + op(m

−1/2),

where θ∗ is the minimizer of the quadratic approximation to the objective function ‖g(θ)‖Wdefined in p. 1042 of P&P. These will follow from Condition [i] if:

‖g(θ)‖Wn = ‖g(θ)‖W + op(m−1/2), ‖g(θ0)‖Wn = ‖g(θ0)‖W + op(m

−1/2)

and

‖g(θ∗)‖Wn = ‖g(θ∗)‖W + op(m−1/2),

all of which follow in turn from combining Conditions [ii], [iv], and [v].

Consistency, under the conditions stated in Corollary 1, is an immediate consequence of

part (c) of Lemma 1 and Corollary 3.2 of Pakes and Pollard (1989). Asymptotic normality

is an immediate consequence of Theorem 1.

COROLLARY 1 Given Assumption 2, θp−→ θ0 under the following conditions: (a) g(θ) =

0 if and only if θ = θ0; (b) Wnp−→ W for W positive definitive; and (c)∥∥∥g(θ)∥∥∥Wn

= ‖g(θ0)‖Wn+ op(1).

Furthermore, if∥∥∥g(θ)

∥∥∥Wn

= ‖g(θ0)‖Wn+ op(m

−1/2), and if R/n→ κ ∈ [0,∞] as n→∞,

R→∞, then the conclusion of Theorem 1 holds under Assumption 2, with Σ = (1∧κ)Σg +

(1 ∧ 1/κ)Σh, where Σg = Var(g(zi, θ0)) and Σh = Var(h(ωr, θ0)). �.

In particular, Lemma 1.b delivers condition [iv]. Condition [v] is implied by Lemma 1.a

because

√mg(θ0) = UnR(θ0) +

√m(Pn − P )g(·, θ0) +

√m(QR −Q)h(·, θ0)

=√m(Pn − P )g(·, θ) +

√m(QR −Q)h(·, θ0) + op(1)

d−→ N(0, (1 ∧ κ)Σg + (1 ∧ 1/κ)Σh).

10

3.1 MSM Variance Estimation

Each component of the asymptotic variance can be estimated using sample analogs. A

consistent estimate G of G, with individual elements Gj, can be formed by numerical differ-

entiation, for ej being a dθ × 1 vector with 1 in the jth position and 0 otherwise, and δ a

step size parameter

Gj ≡ Gj

(θ, δ)

=1

2δ

[g(θ + ejδ)− g(θ − ejδ)

].

A sufficient, although likely not necessary, condition for G(θ)p−→ G (θ0) is that both δ → 0

and√mδ → ∞. Under these conditions, Lemma 1.b implies that Gj − Gj(θ)

p−→ 0, and

Gj(θ)p−→ Gj(θ0) as both δ → 0 and θ

p→ θ0. Σ can be consistently estimated by

Σ = (1 ∧R/n) Σg + (1 ∧ n/R) Σh,

where

Σg =1

n

n∑i=1

g(zi, θ) g′(zi, θ) and Σh =

1

R

R∑r=1

h(ωr, θ) h′(ωr, θ).

In the above

h(ω, θ) =1

n

n∑i=1

q (zi, ω, θ) .

Resampling methods, such as bootstrap and subsampling, or MCMC, can also be used for

inference. Note that in Σ above, R has to go to infinity with overlapping draws. In contrast,

with independent draws, a finite R only incurs an efficiency loss of the order of 1/R.

4 Asymptotics of MSL with overlapping simulations

In this section we derive the asymptotic properties of maximum simulated likelihood estima-

tors with overlapping simulations, which requires a different approach due to the nonlinearity

of the log function. Recall that MSL is defined as

θ =θ∈Θ

argmax L(θ),

where

L(θ) = Pn logQR q(·, ·, θ) =1

n

n∑i=1

log1

R

R∑r=1

q(zi, ωr, θ) =1

n

n∑i=1

log g(zi, θ);

11

L(θ) and θ are implicitly indexed by m = min(n,R).

To begin with, the class of functions q(z, ·, θ) of ω indexed by both θ and z is required

to be a VC-class, as defined in Van der Vaart and Wellner (1996) (pp 134, 141). Frequently

g(z, θ) is a conditional likelihood in the form of g(y |x, θ) where z = (y, x) includes both the

dependent variable and the covariates. The “densities” g(zi; θ) are broadly interpreted to

include also probability mass functions for discrete choice models or a mixture of probability

density functions and probability mass functions for mixed discrete-continuous models.

ASSUMPTION 3 The class of functions indexed by both θ and z: L = { q(z, ·, θ) : z ∈

Z, θ ∈ Θ} and is VC with a uniformly bounded envelope function L. The classes {g (·, θ) , θ ∈

Θ} and {log g (·, θ) , θ ∈ Θ} are also both VC with a uniformly bounded envelope.

The following boundedness assumption is restrictive, but is difficult to relax for nons-

mooth simulators using empirical process theory. It is also assumed in Lee (1992, 1995).

ASSUMPTION 4 There is an M <∞ such that supz,θ

∣∣∣ 1g(z,θ)

∣∣∣ < M .

Let L(θ) = P log g(·; θ). The VC property and boundedness assumption ensures uniform

convergence.

LEMMA 2 Under Assumptions 2, 3, and 4, L (θ) − L (θ0) converges to L (θ) − L (θ0) as

m→∞ uniformly over Θ.

Proof Consider the decomposition

L(θ)− L(θ)− L (θ0) + L (θ0) = A (θ) +B (θ)

where

A (θ) = (Pn − P )[log g(·, θ)− log g(·, θ0)] (1)

B (θ) = Pn[log g(·, θ)− log g(·, θ0)− log g(·, θ)− log g(·, θ0)].

First, by Theorem 19.13 of van der Vaart (1999), A (θ) converges uniformly to 0 in

probability. By the monotonicity of log transformation and Lemma 2.6.18 (v) and (viii) in

Van der Vaart and Wellner (1996), log ◦QF − log g(·, θ0) is VC-subgraph.

12

Second, we show that B (θ) converges uniformly to 0 in probability as R → ∞. By

Taylor’s theorem and Assumption 4,

supθ|B (θ) | ≤ 2 sup

z,θ| log g(z, θ)− log g(z, θ)|

= 2 supz,θ

∣∣∣∣ g(z, θ)− g(z, θ)

g∗(z, θ)

∣∣∣∣ for g∗(z, θ) ∈ [g(z, θ), g(z, θ)]

≤ 2M supz,θ|g(z, θ)− g(z, θ)|

Moreover, by Assumption 3 and Theorem 19.13 of van der Vaart (1999), as R→∞,

supz,θ|g(z, θ)− g(z, θ)| p→ 0.

Therefore, B (θ) converges uniformly to 0 as R → ∞. The lemma then follows from the

triangle inequality. 2

Consistency is a direct consequence of Theorem 2.1 in Newey and McFadden (1994) from

uniform convergence when the true parameter is uniquely identified.

COROLLARY 2 Under Assumptions 2, 3, and 4, if

1. L(θ) ≥ L(θ0)− op(1)

2. For any δ > 0, sup‖θ−θ0‖≥δ L(θ) < L(θ0)

then θ − θ0p−→ 0.

As pointed out in Pollard (1984) (pp 10), the requirement that supθ∈Θ |L (θ)− L (θ) | =

op (1) can be weakened to lim supn→∞ P{

supθ∈Θ

[L(θ)− L(θ)

]≥ ε}

= 0 for all ε > 0. In the

remaining of this section, we investigate the asymptotic normality of MSL, which requires

that the limiting population likelihood is at least twice differentiable. First we recall a general

result (see for example Sherman (1993) for optimization estimators and Chernozhukov and

Hong (2003) for MCMC estimators, among others).

THEOREM 2√m(θ − θ0)

d−→ N(0, H−1ΣH−1)

under the following conditions:

13

1. L(θ) ≥ supθ∈Θ L(θ)− op( 1m

);

2. θp−→ θ0;

3. θ0 is an interior point of Θ;

4. L(θ) is twice continuously differentiable in an open neighborhood of θ0 with positive

definite Hessian H(θ);

5. There exists D such that√mD

d−→ N(0,Σ); and such that

6. For any δ → 0 and for R(θ) = L (θ)− L (θ)− L (θ0) + L(θ0)− D′ (θ − θ0),

sup‖θ−θ0‖≤δ

mR(θ)

1 +m ‖θ − θ0‖2= op (1) . (2)

(If θ is known to be rm consistent, i.e., θ − θ0 = op(1/rm) for rm → ∞, then Condition 6

only has to hold for δ = op(1/rm).)

The following analysis consists of verifying the conditions in the above general theorem.

The finite sample likelihood, without simulation, is required to satisfy the stochastic differen-

tiability condition as required in the following high level assumption. It is typically satisfied

when the true non-simulated log likelihood function is pointwise differentiable.

ASSUMPTION 5 There exists a mean zero random variable D0 (zi) with finite variance

such that for any δ → 0 we have

sup‖θ−θ0‖≤δ

nRn (θ)

1 + n‖θ − θ0‖2= op (1) (3)

for

Rn (θ) ≡ (Pn − P ) (log g(·, θ)− log g(·, θ0))− D′0(θ − θ0),

where

D0 =1

n

n∑i=1

D0 (zi) .

An primitive condition for this assumption is given in Lemma 3.2.19, p. 302, of Van der

Vaart and Wellner (1996). To account for the simulation error we need an intermediate step

which is a modification of Theorem 1 of Sherman (1993).

14

THEOREM 3 Let {am}, {bm}, and {cm} be sequences of positive numbers that tend to

infinity. Suppose

1. L(θ) ≥ L(θ0)−Op(a−1m );

2. θp−→ θ0;

3. In a neighborhood of θ0 there is a κ > 0 such that L(θ) ≤ L(θ0)− κ‖θ‖2;

4. For every sequence of positive numbers {δm} that converges to zero, ‖θm − θ0‖ < δm

implies∣∣∣L(θm)− L(θ0)− L(θm) + L(θ0)

∣∣∣ ≤ Op(‖θm‖/bm) + op(‖θm‖2) +Op(1/cm) .

then ∥∥∥ θ ∥∥∥ = Op

(1√dm

),

where dm = min (am, b2m, cm).

Proof The proof is a modification of Sherman (1993). Condition 2 implies that there is a

sequence of positive numbers {δm} that converges to zero slowly enough that P (‖θ − θ0‖ ≤

δm)→ 1. When ‖θ − θ0‖ ≤ δm we have from Conditions 1 and 2 that

κ‖θ‖2 −Op(1/am) ≤ L(θ)− L(θ0)− L(θ) + L(θ0) ≤ Op

(‖θ‖/bm

)+ op

(‖θ‖2

)+Op(1/cm)

whence

[κ+ øp(1)] ‖θ‖2 ≤ Op(1/am) +Op

(‖θ‖/bm

)+Op(1/cm) ≤ Op(1/dm) +Op

(‖θ‖/

√dm

).

Letting W denote an Op(1/√dm) random variable, the expression above implies that

1

2κ‖θ‖2 − W‖θ‖ ≤ Op(1/dm)

on an event that has probability one in the limit. Completing the square gives

1

2κ(‖θ‖ −W/κ

)2

≤ Op

(1

dm

)+W 2

2κ= Op

(1

dm

)whence

√dm

∥∥∥θ∥∥∥ ≤ √dmW +Op(1) = Op(1). 2

The next assumption requires that the simulated likelihood is not only unbiased, but is

also a proper likelihood function.

15

ASSUMPTION 6 For all simulation lengths R and all parameters θ, both g (zi; θ) and

QRq (zi, ·; θ) are proper (possibly conditional) density functions.

We also need to regulate the amount of irregularity that can be allowed by the simulation

function q (z, ω, θ). In particular, it allows for q (z, ω, θ) to be an indicator function.

ASSUMPTION 7 Define f (z, ω, θ) = q (z, ω, θ) /g (z, θ) − q (z, ω, θ0) /g (z, θ0), then (1)

Q× P[sup‖θ−θ0‖=o(1) f (·, ·, θ)2] = o (1), (2) sup‖θ−θ0‖≤δ,z∈Z Varωf (z, ω, θ) = O (δ).

ASSUMPTION 8 Define ψ (ω, θ) =∫ q(z,ω,θ)

g(z,θ)f (z) dz, where f (z) is the joint density or

probability mass function of the data. There exists a random vector D1 (ωr) with finite

variance such that for D1 = 1R

∑Rr=1D1 (ωr)−QD1 (ωr),

sup‖θ−θ0‖=o((logR)−1)

R (QR −Q) (ψ (·, θ)− ψ (·, θ0))−RD′1 (θ − θ0)

1 +R‖θ − θ0‖2= op (1)

Remark When g (z; θ) represents the joint likelihood of the data, f (z) = g (z; θ0). When

g (z; θ) = g (y|x; θ) represents a conditional likelihood, f (z) = g (z; θ0) f (x) where f (x)

is the marginal density or probability mass function of the conditioning variables, in which

case ψ (ω, θ) =∫ ∫ q(z,ω,θ)

g(z,θ)g (y|x; θ0) dyf (x) dx, with the understanding that integrals become

summations in the case of discrete data. Assumption 8 can be further simplified when

the true likelihood g (z, θ) is twice continuously differentiable (with bounded derivatives for

simplicity). In this case

D1 (ωr) = −∫q (ωr, z, θ0)

g2 (z; θ0)

∂

∂θg (z; θ0) f (z) dz. (4)

When g (z; θ) is the joint likelihood of the data, D1 (ωr) = −∫ q(ωr,z,θ0)

g(z;θ0)∂∂θg (z; θ0) dz. When

g (z; θ) is a conditional likelihood g (z; θ) = g (y|x; θ), D1 (ωr) = −∫ q(ωr,z,θ0)

g(z;θ0)∂∂θg (z; θ0) f (x) dz.

To see (4), note that

(QR −Q) (ψ (·, θ)− ψ (·, θ0))

= P

[1

g (·, θ)− 1

g (·, θ0)

](g (·, θ0)− g (·, θ0))

+ P1

g (·, θ0)(g (·, θ)− g (·, θ)− g (·, θ0) + g (·, θ0))

16

+ P

(1

g (·, θ)− 1

g (·, θ0)

)(g (·, θ)− g (·, θ)− g (·, θ0) + g (·, θ0)) .

The second line is zero because of assumption 6. The third line can be bounded by

M‖θ − θ0‖ sup‖θ−θ0‖=o((logR)−1),z∈Z

| (QR −Q) (q (·, z, θ)− q (ωr, z, θ0)) ‖ = op

(1√R

)‖θ − θ0‖,

using the same arguments that handle the B22 (θ, z) in the proof. Finally, the first line

becomes

P

[1

g (·, θ)− 1

g (·, θ0)

](g (·, θ0)− g (·, θ0)) = (QR −Q)D1 (·) (θ − θ0) + R (θ) ,

where ‖R (θ) ‖ ≤ op (‖θ − θ0‖) | supz∈Z (QR −Q) q (·, z, θ0) | = op

(‖θ−θ0‖√

R

).

THEOREM 4 Under Assumptions 2, 3, 4, 5, 6, 7, and 8 and Conditions 1, 2, 3 and 4 of

Theorem 2, the conclusion of Theorem 2 holds with D = PnD0 (·) +QRD1 (·) and

Σ = (1 ∧ κ) Var (D0 (zi)) + (1 ∧ 1/κ) Var (D1 (ωr)) .

Proof Consistency is given in Corollary 2. Consider again the decomposition given by

Equation (1). Because of the linearity structure of Conditions (5) and (6) of Theorem 2, it

suffices to verify them separately for the terms A (θ) and B (θ).

It follows immediately from Assumption 5 that Conditions (5) and (6) of Theorem 2 hold

for the first term A (θ) because n ≥ m, since (3) is increasing in n:

sup||θ−θ0||≤δ

m(A (θ)− D′0 (θ − θ0)

)1 +m||θ − θ0||2

≤ sup||θ−θ0||≤δ

R0 (θ)

1/n+ ||θ − θ0||2= oP (1) , (5)

for R0 (θ) = A (θ)− D′0 (θ − θ0). Next we verify these conditions for the B (θ) term.

Decompose B further into B (θ) = B1 (θ) +B2 (θ) +B3 (θ), where

B1 (θ) = Pn

[1

g (·, θ)(g (·, θ)− g (·, θ))− 1

g (·, θ)(g (·, θ0)− g (·, θ0))

]B2 (θ) = −1

2Pn

[1

g (·, θ)2 (g (·, θ)− g (·, θ))2 − 1

g (·, θ0)2 (g (·, θ0)− g (·, θ0))2

]B3 (θ) =

1

3Pn

[1

g (·, θ)3 (g (·, θ)− g (·, θ))3 − 1

g (·, θ0)3 (g (·, θ0)− g (·, θ0))3

].

17

In the above g (z, θ) and g (z, θ0) are mean values, dependent on z, between [g (z, θ) , g (z, θ)]

and [g (z, θ0) , g (z, θ0)] respectively. By Assumption 4,

supθ∈Θ|B3 (θ) | ≤ 2

3M3| sup

θ∈Θ,z∈Z(g (z, θ)− g (z, θ)) |3 ≤ Op

(1

R√R

),

where the last inequality follows from supθ∈Θ,z∈Z |g (z, θ)− g (z, θ) | = Op

(1√R

)due, e.g., to

Theorem 2.14.1 of Van der Vaart and Wellner (1996). By Theorem 2.14.1 it also holds that

supθ∈Θ|B1 (θ) | = Op

(1√R

)and sup

θ∈Θ|B2 (θ) | = Op

(1

R

).

This allows us to invoke Theorem 3, with dm =√m, to claim that

‖θ − θ0‖ = Op

(m−1/4

).

Next we bound the second term by, up to a constant, within ‖θ − θ0‖ = op (1/ logR):

sup‖θ−θ0‖�(logR)−1

|B2 (θ) | = op

(1

R

). (6)

To show (6), first note that

sup‖θ−θ0‖�(logR)−1

|B2 (θ) | ≤ sup‖θ−θ0‖�(logR)−1,z∈Z

B21 (θ, z)×B22 (θ, z)

where

B21 (θ, z) =

∣∣∣∣ (QR −Q)

(q (z, ·, θ)g (z, θ)

+q (z, ·, θ0)

g (z, θ0)

) ∣∣∣∣and

B22 (θ, z) =

∣∣∣∣ (QR −Q)

(q (z, ·, θ)g (z, θ)

− q (z, ·, θ0)

g (z, θ0)

) ∣∣∣∣.It follows again from Theorem 2.14.1 that

sup‖θ−θ0‖�(logR)−1,z∈Z

|B21 (θ, z) | = Op

(1√R

).

Next we consider B22 (θ, z) in light of arguments similar to Theorem 2.37 in Pollard (1984),

for which it follows that for δ = o((logR)−1), for

f (z, ω, θ) = q (z, ω, θ) /g (z, θ)− q (z, ω, θ0) /g (z, θ0)

18

where ‖θ−θ0‖ ≤ δ, and for εR = ε/√R: Var (QRf (z, ·, θ)) /ε2R → 0 for each ε > 0. Therefore

the symmetrization inequalities (30) in p. 31 of Pollard (1984) apply and subsequently, for

FR = {f (z, ω, θ) , z ∈ Z, ‖θ − θ0‖ ≤ δ},

P

(supf∈FR

∣∣∣∣ (QR −Q) f

∣∣∣∣ > 8ε√R

)

≤ 4P

(supf∈FR

|Q0Rf | > 2

ε√R

)≤ 8Aε−WRW/2 exp

(− 1

128ε2δ−1

)+ P

(supf∈FR

QRf2 > 64δ

).

The second term goes to zero for the same reason as in Pollard. The first also goes to

zero since logR − 1δ→ −∞. Thus we have shown that B22 (θ, z) = op

(1√R

)uniformly in

θ−θ0 ≤ δ and z ∈ Z, and consequently (6) holds. By considering n� R, n� R and n ≈ R

separately, (6) also implies that for some α > 0:

sup‖θ−θ0‖�m−α

|B2 (θ) | = op

(1

m

).

It remains to investigate B1 (θ) = 1n

∑ni=1

1R

∑Rr=1 f (zi, ωr, θ), which, using Assumption 6,

can be written

B1 =1

nRSnR

(fθ

)+B0,

where

f (z, ω, θ) = f (z, ω, θ)−Qf (z, ·, θ)− Pf (·, ω, θ) + PQf (·, ·, θ) ,

B0 (θ) = (QR −Q) (ψ (·, θ)− ψ (ω, θ0)) , and ψ (ω, θ) =∫ q(z,ω,θ)

g(z,θ)f (z) dz. Upon noting that

Q q(·,z,θ)g(z,θ)

= 1 identically, Qf (z, ·, θ) = 0 and PQf (z, ·, θ) = 0. The proof of Theorem 2.5 (pp.

83) of Neumeyer (2004) shows that, since by assumption 7,

Q× P

[sup

‖θ−θ0‖=o(1)

f (·, ·, θ)2

]= o (1)

the envelope expectation Q⊗ P (F )2 converges to zero. Hence,

1

nRSnR

(fθ

)= op

(1√nR

)= op

(1

m

).

Finally, B0 is handled by Assumption 8. So that, since B (θ) = B1 (θ) − B0 (θ) + B2 (θ) +

B3 (θ)+B0 (θ), and each of B1 (θ)−B0 (θ), B2 (θ) and B3 (θ) is oP(

1R

)uniformly in ||θ−θ0|| ≤

19

δ, for any δ → 0, we have, for R1 (θ) = B (θ)− D′1 (θ − θ0),

sup||θ−θ0||≤δ

RR1 (θ)

1 +R||θ − θ0||2= sup||θ−θ0||≤δ

RB0 (θ)−RD′1 (θ − θ0)

1 +R||θ − θ0||2+ oP (1) = oP (1) .

This together with (5) implies that condition 6 of Theorem 2 is satisfied with D = D0 + D1,

since we can bound (2) by

(2) = sup||θ−θ0||≤δ

R0 (θ) + R1 (θ)

1/m+ ||θ − θ0||2≤ sup||θ−θ0||≤δ

R0 (θ)

1/n+ ||θ − θ0||2+ sup||θ−θ0||≤δ

R1 (θ)

1/R + ||θ − θ0||2.

Finally to verify condition 5 in Theorem 2, write

√mD =

√(1 ∧ R

n

)1√n

n∑i=1

D0 (zi) +

√( nR∧ 1) 1√

R

R∑r=1

D1 (ωr) .

That√mD

d→ N (0,Σ) follows from 1 ∧ Rn→ 1 ∧ κ, n

R∧ 1 → 1

κ∧ 1, the continuous

mapping Theorem, Slutsky’s Lemma, and CLTs applied to√nD0 = 1√

n

∑ni=1 D0 (zi) and

√RD1 = 1√

R

∑Rr=1D1 (ωr). 2

4.1 MSL Variance Estimation

A consistent estimate of the asymptotic variance can be formed by sample analogs. In

general, each of

H = Pn∂2

∂θ∂θ′logQRq

(·, ·, θ

), D0 (zi) =

∂

∂θlog g

(zi, θ

)and D1 (ωr) =

∂

∂θPnq(ωr, ·, θ

)g(·, θ)

can not be computed analytically, and has to be replaced by numerical estimates:

Hij =1

4ε2

(Pn logQRq

(·, ·, θ + eiε+ ejε

)− Pn logQRq

(·, ·, θ − eiε+ ejε

)−Pn logQRq

(·, ·, θ + eiε− ejε

)+ Pn logQRq

(·, ·, θ − eiε− ejε

)),

D0j (zi) =1

2h

(log g

(zi, θ + ejh

)− log g

(zi, θ − ejh

)),

D1j (wr) =1

2h

Pn q(ωr, ·, θ + ejh

)g(·, θ + ejh

) − Pnq(ωr, ·, θ − ejh

)g(·, θ − ejh

) .

Let, for κ = R/n,

Σh = PnD0 (·) D0 (·)′ Σg = QRD1 (·) D1 (·) Σ = (1 ∧ κ) Σh + (1 ∧ 1/κ) Σg.

20

Under the given assumptions, if ε→ 0, h→ 0,√nh→∞ and n

14 ε→∞, then H = H+op (1)

and Σh = Σh + op (1), Σg = Σg + op (1). Hence Σ = Σ + op (1) by the continuous mapping

theorem.

5 MCMC

Simulated objective functions that are nonsmooth can be difficult to optimize by numer-

ical methods. An alternative to optimizing the objective function is to run it through a

MCMC routine, as in Chernozhukov and Hong (2003). Under the assumptions given in the

previous sections, the MCMC Laplace estimators can also be shown to be consistent and

asymptotically normal. The Laplace estimator is defined as

θ = argminθ∈Θ

∫ρ(√

m (u− θ))

exp(mL (u)

)π (u) du.

In the above ρ (·) is a convex symmetric loss function such that ρ (h) ≤ 1 + |h|p for some

p ≥ 1, and π (·) is a continuous density function with compact support and postive at θ0. In

the above the objective function can be either GMM:

L (θ) =1

2PnQRq (·, ·, θ)′WnPnQRq (·, ·, θ) ,

or the log likelihood function L (θ) =∑n

i=1 log g (zi, θ).

The asymptotic distribution of the posterior distribution and θ follows immediately from

Assumption 2, which leads to Theorem 1, and Chernozhukov and Hong (2003). Define

h =√m(θ − θ

), and consider the posterior distribution on the localized parameter space:

pn (h) =π(θ + h√

m

)exp

(mL

(θ + h/

√m)−mL

(θ))

Cm

where

Cm =

∫θ+h/

√m∈Θ

π

(θ +

h√m

)exp

(mL

(θ + h/

√m)−mL

(θ))

dh.

Desirable properties of the MCMC method include the following, for any α > 0:∫|h|α|pn (h)− p∞ (h) |dh p−→ 0, where p∞ (h) =

√| det(J0)|(2π)dim θ

exp

(−1

2h′J0h

). (7)

In the above J0 = G′WG for the GMM model and J0 = − ∂2

∂θ∂θ′L (θ0) for the likelihood

model.

21

THEOREM 5 Under Assumption 2 for the GMM model or Assumptions 2, and 8, Condi-

tions 1, 2, 4 of Theorem 2 for the MLE model, (7) holds. Consequently,√m(θ − θ

)p−→ 0,

and the variance of pn,R (h) converges to J−10 in probability.

Proof For the GMM model, the stated results follow immediately from Assumption 2, which

leads to Theorem 1, and Chernozhukov and Hong (2003) (CH). The MLE case is also almost

identical to CH but requires a small modification. When Condition (6) in Theorem 2 holds for

δ = o (1), the original proof shows (7) over three areas of integration separately, {|h| ≤√mδ}

and {|h| ≥ δ√m}. When Condition 6 in Theorem 2 only holds for δ = am = (logm)−d,

we need to consider separately, for a fixed δ, {|h| ≤√mam}, {

√mam ≤ |h| ≤

√mδ} and

{|h| ≥ δ√m}. The arguments for the first and third regions {|h| ≤

√mam} and {|h| ≥ δ

√m}

are identical to the ones in CH. Hence we only need to show that (since the prior density is

assumed bounded around θ0):∫√mam≤|h|≤

√mδ

π

(θ +

h√m

)exp

(mL

(θ + h/

√m)−mL

(θ))

dhp−→ 0.

By arguments that handle the term B in the proof of Theorem 4, in this region,

ω (h) ≡ mL(θ + h/

√m)−mL

(θ)

= −1

2(1 + op (1))h′J0h+mOp

(1√m

).

Hence the left hand side integral can be bounded by, up to a finite constant∫√mam≤|h|≤

√mδ

exp (ω (h)) dh = exp(Op

(√m)) ∫

√mam≤|h|

exp

(−1

2(1 + op (1))h′J0h

)dh.

The tail of the normal distribution can be estimated by w.p. → 1:∫√mam≤|h| exp

(−1

2(1 + op (1))h′J0h

)dh

≤∫√mam≤|h| exp

(−1

4h′J0h

)dh ≤ C (

√mam)

−1exp (−ma2

m) ,

for am >> m−α for any α > 0, hence for some α > 0.∫√mam≤|h|≤

√mδ

exp (ω (h)) dh ≤ C exp(Op

(√m)) (

m12−α)−1

exp(−m1−2α

)= op (1) .

The rest of the proof is identical to CH. 2

The MCMC method can always be used to obtain consistent and asymptotically nor-

mal parameter estimates. For the GMM model with W being the asymptotic variance of

22

√mg (θ0), or for the likelihood model where n >> R, the posterior distribution from the

MCMC can also be used to obtain valid asymptotic confidence intervals for θ0.

For the GMM model where W 6= asym Var (√mg (θ0)), or the likelihood model where

R >> n, R ∼ n, the posterior distribution does not resemble the asymptotic distribution of

θ or θ. However, in this case the variance of the posterior distribution can still be used to

estimate the inverse of the Hessian term (G′WG)−1 or H (θ0) in Condition (4) of Theorem 2.

6 Monte Carlo Simulations

In this section we report the results from a set of Monte Carlo simulations from a univariate

Probit model to illustrate the finite sample properties of the asymptotic distributions derived

in this paper. The true data generating process is specified to be:

yi = 1{α0 + xiβ0 + εi ≥ 0}, εi ⊥⊥ xi, εiiid∼ N(0, 1).

Define zi = (yi, xi) and xi = (1, xi), θ := (α, β)′. In the earlier notation the likelihood

function is g(zi, θ) = Φ(x′iθ)yi(1 − Φ(x′iθ))

1−yi , where Φ (u) =∫ u 1√

2πe−v

2/2dv, and the true

parameter θ0 maximizes

L(θ) = Ezi log g(zi, θ) = Ezi [yi log Φ(x′iθ) + (1− yi) log(1− Φ(x′iθ))].

The likelihood function is simulated by g(zi, θ) = 1R

∑Rr=1 q(wr, zi, θ), where

q(zi, wr, θ) = 1{x′iθ + wr ≥ 0}yi(1− 1{x′iθ + wr ≥ 0})1−yi ,

and wriid∼ N(0, 1). Note that Qg(zi, θ) = Qq(zi, ·, θ) = g(zi, θ). These functions fall into

the class of multinomial discrete choice models studied in Pakes and Pollard (1989) and

hence satisfy Assumptions 2 and 3. Assumption 6 is immediate. Assumptions 4 and 7 are

satisfied because of the indicator function in q(zi, wr, θ) when θ and z have bounded support.

Furthermore, since g (zi, θ) is differentiable, Assumptions 5 and 8 can be directly verified with

D0 (zi) =yi−Φ(x′iθ0)

Φ(x′iθ0)(1−Φ(x′iθ0))φ (x′iθ0)xi and

D1 (ωr) =

∫1{x′iθ0 + wr ≥ 0} − Φ (x′iθ0)

Φ (x′iθ0) (1− Φ (x′iθ0))φ (x′iθ0)xif (xi) dxi.

23

The simulated maximum likelihood estimator maximizes

LnR(θ) =1

n

n∑i=1

log g(zi, θ)

and is computed using the simulated annealing routine in Matlab’s global optimization tool-

box. The starting value for optimization is taken to be the OLS estimates. The numerical

results are not sensitive to the choice of the starting values when the temperature parame-

ter in the simulated annealing routine is reduced sufficiently slowly. Obviously this simple

example can be estimated by the probit command in Stata. The goal of this section is to

illustrate the finite properties of the simulated maximum likelihood estimator when we are

agnostic about the normal distribution function and density function.

We compute an estimate of the asymptotic variances using the empirical analog of The-

orem 2:√m(θSMLE − θ

)A∼ N

(0, H−1

((1 ∧ κ)Σ0 + (1 ∧ 1/κ)Σ1

)H−1

).

In the above κ = R/n, m = min (R, n).

While analytical derivatives can be easily computed in this example, in practice, the

analytical derivatives of the likelihood function is usually unknown. In our baseline results,

we are agnostic about the analytical derivatives and estimate the asymptotic variance using

numerical differentiation:

Σ0 =1

n

n∑i=1

D0(zi)D0(zi)′, H = −Σ0,

Σ1 =1

R

R∑r=1

D1(wr)D1(wr)′, D1(wr) := − 1

n

N∑i=n

q(zi, wr, θSMLE)

g(zi, θSMLE)D0(zi).

In the above, for an estimate of the derivative of g(zi, θSMLE) with respect to θ, denoted as

∇g(zi, θSMLE), we define:

D0(zi) :=1

g(zi, θSMLE)∇g(zi, θ

SMLE).

We use the index structure of g (zi, θ) and numerical differentiation to obtain ∇g(zi, θSMLE).

For this purpose we note that

∂g(zi, θ)

∂θ=

(− ∂

∂θΦ(−x′iθ)

)yi ( ∂

∂θΦ(−x′iθ)

)1−yi

24

= xi

( ∂

∂wΦ(w)

∣∣∣∣w=−x′iθ

)yi (− ∂

∂wΦ(w)

∣∣∣∣w=−x′iθ

)1−yi = (−1)1−yixi φ (x′iθ)

Therefore we use:

∇g(zi, θSMLE) = (−1)1−yixi φ

(x′iθ)

where to be agnostic about knowledge of φ (·), we use a first order two-sided formula to

define:

φ (w) ≡ 1

R

R∑r=1

1{wr ≤ w + ε} − 1{wr ≤ w − ε}2ε

=1

2ε

#{w − ε ≤ wr ≤ w + ε}R

,

where ε is a step size parameter. In the simulation, we experiment with a range of the step

size parameter, ε = R−α, where α ranges in[2 3

21 3

412

13

14

18

110

115

]. It turns out that the

larger step sizes produces more accurate coverage for a larger range of n and R. Therefore

we use α = 1/15 in the results reported in the following tables.

For comparison, we also provide the empirical coverages when the analytical derivatives

is used to compute H and Σ0. Here

H = − 1

n

n∑i=1

φ(x′iθ)2

Φ (x′iθ) (1− Φ (x′iθ))xix′i, Σ0 = −H,

and

Σ1 =1

R

R∑r=1

D1(wr)D1(wr)′ D1 (wr) = − 1

n

n∑i=1

q(wr, zi, θ

)g(zi, θ

)(yi − Φ

(x′iθ))

φ (x′iθ)xi

Φ(x′iθ)

(1− Φ (x′iθ)).

Table 1 reports the empirical coverage of the 95% confidential interval constructed from

the estimate of the asymptotic distribution using numerical derivatives, over 5000 Monte

Carlo repetitions. The column dimension corresponds to the sample size n and the row

dimension corresponds to the ratio between R and n. The two rows for each sample size

correspond to the intercept and the slope coefficient, respectively. The results show that

the asymptotic distribution accurately represents the finite sample distribution when m =

min (R, n) is not too small.

Table 2 reports the false empirical coverage of the 95% confidence interval when the

simulation noise is ignored in the asymptotic distribution of the estimator. As expected,

25

Table 1: Empirical coverage frequency for 95% confidence interval, numerical derivatives

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.9334 0.9428 0.9436 0.9458 0.9476 0.9508 0.9488 0.9526 0.9524 0.9546

0.9946 0.9844 0.979 0.9736 0.9712 0.9672 0.9638 0.9656 0.9676 0.9682

100 0.9254 0.9266 0.9342 0.9326 0.9356 0.9342 0.9378 0.939 0.939 0.94080.983 0.9628 0.9564 0.956 0.9536 0.9538 0.9524 0.9536 0.9514 0.9558

200 0.9342 0.9388 0.9378 0.9382 0.938 0.937 0.9408 0.9366 0.9424 0.9430.9672 0.9528 0.9462 0.9456 0.9442 0.9472 0.9466 0.9462 0.9474 0.9488

400 0.9448 0.9412 0.9388 0.9398 0.9358 0.9388 0.936 0.9362 0.9346 0.93780.9512 0.9442 0.9516 0.9472 0.9498 0.9482 0.9514 0.9528 0.9512 0.9536

800 0.9438 0.9442 0.9384 0.9386 0.9458 0.9412 0.9454 0.9408 0.9422 0.94340.9468 0.9462 0.9394 0.9442 0.948 0.9518 0.9512 0.9492 0.9538 0.9508

The total number of Monte Carlo repetitions is 5000.

Table 2: False empirical coverage frequency for 95% confidence interval, numerical derivatives

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.662 0.7822 0.8344 0.8504 0.8954 0.9298 0.939 0.9474 0.9508 0.9538

0.9614 0.9578 0.9566 0.9562 0.9626 0.964 0.9622 0.9652 0.9672 0.968

100 0.594 0.742 0.799 0.8236 0.878 0.9136 0.9282 0.9352 0.9362 0.94020.9246 0.9348 0.9414 0.9406 0.9456 0.951 0.9518 0.9534 0.951 0.9554

200 0.5892 0.737 0.8016 0.8232 0.8784 0.9154 0.9292 0.931 0.9384 0.94260.9172 0.931 0.9338 0.9334 0.9378 0.9452 0.9454 0.9454 0.9474 0.9486

400 0.5744 0.727 0.794 0.8218 0.8734 0.9156 0.9224 0.9292 0.932 0.93620.9134 0.9308 0.9412 0.9394 0.9472 0.947 0.9508 0.9524 0.9512 0.9536

800 0.575 0.7302 0.8022 0.8182 0.8828 0.9192 0.9314 0.9362 0.9394 0.9420.9242 0.9362 0.9328 0.9376 0.9442 0.9498 0.9504 0.949 0.9538 0.9508


26

Table 3: Empirical coverage frequency for 95% confidence interval, analytical derivatives

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.9956 0.953 0.9542 0.9574 0.9646 0.9662 0.9604 0.9542 0.9518 0.9526

0.9944 0.9824 0.976 0.9724 0.9694 0.9582 0.9586 0.9558 0.9574 0.9582

100 0.94 0.943 0.9522 0.9572 0.9576 0.9436 0.9402 0.9422 0.9402 0.94080.9856 0.9722 0.9628 0.962 0.9534 0.952 0.9506 0.952 0.9528 0.954

200 0.9326 0.9472 0.9572 0.951 0.9464 0.9398 0.9428 0.9378 0.9418 0.94220.9742 0.9568 0.9446 0.9456 0.9392 0.9434 0.9412 0.9418 0.9432 0.9436

400 0.9356 0.9512 0.954 0.9494 0.9346 0.941 0.9368 0.9374 0.9352 0.93760.9638 0.9488 0.9514 0.9498 0.9482 0.9504 0.9492 0.9514 0.9526 0.9526

800 0.9416 0.948 0.943 0.943 0.9458 0.9416 0.9452 0.9422 0.9418 0.94340.9542 0.9462 0.9402 0.943 0.9468 0.9496 0.9502 0.95 0.95 0.9504


when R/n is large, in particular above 10, the improvement from accounting for Σ1 in the

asymptotic distribution is very small. When R/n is very small, the size distortion from

ignoring Σ1 is very sizable. The size distortion is quite visible when R/n is as big as 2, and

still visible even when R/n = 5.

Table 3 and 4 report the counterparts of Table 1 and 2 when analytical derivatives are

used instead to compute the asymptotic variances. In Table 3, we see that using analytical

derivatives do not necessarity give a more accurate coverage than using numerical derivatives.

The results in Table 4 is similar to that in Table 2: ignoring variances due to simulation

when R is smaller than n can lead to significant errors in the confidence interval.

Empirically, choosing an optimal step size for numerical gradient calculation can be

difficult and depends on knowledge of the underlying function to be simulated. Without this

knowledge, we recommend using a rule of thumb of the form Cn−α for α < 1/2 for the step

size choice, where we have chosen α = 1/15 in this simulation example. The optimal choice

of C and α is an open challenging theoretical question in this context which is beyond what

27

Table 4: False empirical coverage frequency for 95% confidence interval, analytical derivatives

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.6676 0.7916 0.842 0.8606 0.9002 0.9314 0.9424 0.9474 0.9504 0.9508

0.9186 0.9316 0.9394 0.9364 0.9472 0.9506 0.9546 0.9534 0.9566 0.958

100 0.6056 0.754 0.8064 0.8274 0.88 0.9166 0.9298 0.937 0.938 0.93920.91 0.922 0.9322 0.932 0.9418 0.9488 0.9488 0.9508 0.9528 0.954

200 0.5948 0.7442 0.8052 0.8258 0.8804 0.9156 0.9288 0.9318 0.9396 0.94120.9114 0.9224 0.9244 0.9272 0.9322 0.941 0.9404 0.9414 0.9428 0.9436

400 0.5736 0.7236 0.795 0.8234 0.8758 0.916 0.9232 0.93 0.9324 0.93640.9132 0.9294 0.94 0.9368 0.9446 0.9498 0.9482 0.9512 0.9526 0.9524

800 0.5724 0.7308 0.8028 0.8226 0.8826 0.919 0.9316 0.9364 0.9394 0.9420.9198 0.9338 0.9338 0.936 0.9428 0.9486 0.9494 0.949 0.9498 0.9504


we are able to obtain in this paper. We can also adopt Silverman’s rule of thumb or use

cross-validation methods for data driven automated bandwidth selection. However, these

methods are designed for nonparametric curve fitting. They are not known to be optimal

in semiparametric problems such as variance estimation that we consider. We conducted a

small experiment where we looked for the stepsize that minimizes the difference between the

analytical and asymptotic variance in each Monte Carlo trial across the above values of α,

and find substantial variation in the best step size in this sense, which is tabulated in Table

7. However, minimizing the difference between the analytical and asymptotic variance does

not seem to translate into coverage accuracy. Comparing Table 1 (for α = 1/15) with Table

8 (for α = 1/2) shows that the larger step size produces more accurate confidence intervals

than the smaller step size in general. We leave it for future research for a theoretical guidance

on the properties of the step sizes.

Overlapping draws are applicable in situations in which independent draws are not com-

putationally practicable, or with nonsmooth moment conditions where the theoretical va-

28

Table 5: False empirical coverage frequency for 95% confidence interval, analytical, indepen-dent draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.7816 0.9182 0.9208 0.922 0.9332 0.942 0.9506 0.9506 0.9502 0.948

0.912 0.9266 0.9384 0.94 0.9434 0.9506 0.9514 0.9558 0.957 0.9598

100 0.8838 0.908 0.8992 0.9068 0.9208 0.9366 0.9354 0.9358 0.9392 0.94260.9102 0.9184 0.9288 0.9254 0.9378 0.945 0.951 0.9518 0.952 0.9542

200 0.8622 0.8966 0.9108 0.914 0.9198 0.9352 0.937 0.9414 0.9426 0.9410.8912 0.9152 0.918 0.9266 0.9356 0.9378 0.9422 0.9448 0.9454 0.9458

400 0.8764 0.9056 0.9188 0.9214 0.929 0.934 0.9338 0.9372 0.9386 0.93780.9068 0.9256 0.9366 0.936 0.9426 0.9486 0.9486 0.9528 0.952 0.9528

800 0.9056 0.9228 0.9288 0.932 0.9352 0.9406 0.9422 0.9454 0.9448 0.94420.9116 0.9332 0.9378 0.94 0.9428 0.9488 0.9476 0.9498 0.9496 0.9488


29

Table 6: False empirical coverage frequency for 95% confidence interval, numerical, indepen-dent draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.2892 0.7306 0.8178 0.8348 0.8944 0.9202 0.935 0.94 0.9462 0.9488

0.417 0.749 0.853 0.8786 0.9252 0.947 0.9524 0.9614 0.9602 0.9652

100 0.5322 0.8216 0.855 0.8704 0.8986 0.9196 0.9276 0.9324 0.9358 0.93820.5858 0.8462 0.8936 0.898 0.9272 0.9342 0.9434 0.9494 0.9516 0.9534

200 0.75 0.8646 0.8908 0.8978 0.9128 0.9316 0.934 0.9386 0.9404 0.93840.7778 0.893 0.9052 0.9156 0.9302 0.9334 0.9438 0.9456 0.9482 0.9482

400 0.832 0.8932 0.9076 0.917 0.926 0.9324 0.9324 0.9376 0.9376 0.93780.8722 0.914 0.9292 0.9316 0.9402 0.9474 0.9476 0.9522 0.9522 0.9532

800 0.8908 0.9196 0.924 0.9302 0.9334 0.94 0.9422 0.944 0.9458 0.94420.901 0.9318 0.936 0.9368 0.9426 0.9498 0.9472 0.95 0.9488 0.9504


Table 7: Step size parameter (α) minimizing difference between numerical and asymptoticvariance in each trial of (n, κ)

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.75 0.75 0.75 0.5 0.5 0.5 0.5 0.5 0.5 0.5100 0.75 0.75 0.75 0.5 0.5 0.5 0.5 0.5 0.5 0.5200 0.75 0.75 0.75 0.5 0.5 0.5 0.5 0.067 0.067 0.067400 0.75 0.75 0.75 0.5 0.5 0.067 0.067 0.067 0.067 0.067800 0.75 0.75 0.75 0.125 0.067 0.067 0.067 0.067 0.067 0.067


30

Table 8: Coverage with α = 1/2, overlapping, correct variance

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.854 0.871 0.881 0.891 0.910 0.932 0.937 0.947 0.951 0.954

0.963 0.949 0.948 0.945 0.953 0.956 0.958 0.964 0.967 0.968

100 0.846 0.873 0.890 0.891 0.907 0.927 0.930 0.938 0.938 0.9410.948 0.939 0.935 0.940 0.940 0.945 0.948 0.953 0.951 0.954

200 0.863 0.890 0.906 0.906 0.915 0.928 0.936 0.932 0.939 0.9410.944 0.935 0.929 0.931 0.932 0.939 0.940 0.944 0.945 0.949

400 0.887 0.900 0.912 0.920 0.922 0.934 0.930 0.935 0.935 0.9350.942 0.936 0.941 0.939 0.944 0.943 0.948 0.951 0.949 0.953

800 0.897 0.914 0.922 0.915 0.936 0.935 0.941 0.939 0.942 0.9420.945 0.940 0.934 0.938 0.942 0.948 0.947 0.947 0.952 0.950


lidity of independent draws is more difficult and beyond the scope of the current paper. In

spite of the lack of a theoretical proof, in tables 5 and 6 we report the couterparts with inde-

pendent draws. When R is relatively small compared to n, the point estimate is sufficiently

biased and the confidence interval does not center on the true parameter value. Larger R

reduces bias and leads to more accurate empirical coverage frequencies. In particular, the

independent draws method does not perform well relative to overlapping draws when both N

ad R/N are small. In unreported simulations we also experimented with the EM algorithm.

While the EM algorithm converges quickly, we are not able to obtain empirical coverage

frequencies that are close to the nominal level.

We also report the mean bias and root mean square error for both overlapping and

independent draws in tables 9 to 12. Especially for small n and R/n, overlapping draws tend

to have smaller bias. While its intercept term tends to have a larger RMSE in overlapping

draws, the larger variance can be accounted for in constructing confidence intervals.

31

Table 9: Mean Bias, overlapping draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.0636 0.0777 0.0782 0.077 0.0726 0.0639 0.0637 0.0592 0.0579 0.0575

0.0027 0.0015 0.00066 -0.00051 0.0010 0.00036 -5.4E-05 0.00048 -2.2E-05 0.00022

100 0.0323 0.033 0.0331 0.0312 0.0299 0.0279 0.0269 0.027 0.0256 0.02570.0009 0.0005 0.0004 0.0004 0.0006 0.0011 0.0005 0.0006 0.0007 0.0005

200 0.0129 0.0139 0.015 0.0147 0.0145 0.0142 0.014 0.0127 0.0131 0.0131-0.0009 -0.0011 -0.0008 -0.00065 -0.00078 -0.00067 -0.0009 -0.0009 -0.00104 -0.0011

400 0.0056 0.0047 0.0062 0.0051 0.0062 0.0062 0.0051 0.0052 0.0052 0.0052-0.00035 -0.0003 -0.0004 -0.0004 -0.00045 -0.00053 -0.00053 -0.00066 -0.00059 -0.00057

800 0.0023 0.0025 0.0023 0.0029 0.0029 0.0026 0.0024 0.0028 0.0025 0.0026-0.00023 -0.00033 -0.00023 -0.00038 -0.00034 -0.00031 -0.00039 -0.00040 -0.00049 -0.00045


Table 10: Mean Bias, Independent draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 -.16800 .02191 .05585 .05852 .05531 .06072 .06648 .06469 .06100 .05951

.00202 .00049 .00000 -.00143 .00001 .00059 -.00016 -.00009 .00046 -.00002

100 -.06624 .00892 .01600 .02348 .03354 .03154 .02970 .02825 .02566 .02617-.00037 .00120 .00018 .00101 .00005 .00109 .00091 .00058 .00061 .00067

200 -.02114 .01151 .01541 .01802 .01637 .01534 .01412 .01387 .01352 .01381-.00156 -.00128 -.00135 -.00106 -.00110 -.00104 -.00089 -.00111 -.00096 -.00084

400 -.00027 .00542 .00358 .00488 .00515 .00538 .00487 .00527 .00567 .00506-.00094 -.00046 -.00049 -.00005 -.00059 -.00055 -.00066 -.00082 -.00050 -.00061

800 -.00361 .00095 .00253 .00220 .00267 .00242 .00309 .00313 .00267 .00284-.00099 -.00057 -.00033 -.00050 -.00040 -.00058 -.00055 -.00041 -.00040 -.00038


32

Table 11: Root Mean Square Error, Overlapping draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.5407 0.5407 0.5407 0.5407 0.5407 0.5407 0.5407 0.5407 0.5407 0.5407

0.1577 0.1533 0.1489 0.1489 0.1398 0.1347 0.1327 0.1301 0.1287 0.1284

100 0.3832 0.2829 0.2523 0.2386 0.2092 0.1878 0.1787 0.1749 0.1718 0.17110.1024 0.0954 0.0923 0.0909 0.0879 0.0857 0.0844 0.0833 0.0828 0.0825

200 0.2632 0.1929 0.1690 0.1592 0.1397 0.1260 0.1204 0.1181 0.1153 0.11480.0660 0.0630 0.0618 0.0611 0.0598 0.0583 0.0578 0.0573 0.0569 0.0566

400 0.1851 0.1350 0.1175 0.1106 0.0972 0.0872 0.0844 0.0822 0.0814 0.08040.0449 0.0427 0.0413 0.0415 0.0404 0.0395 0.0391 0.0388 0.0386 0.0386

800 0.1316 0.0944 0.0819 0.0780 0.0670 0.0604 0.0579 0.0567 0.0562 0.05570.0305 0.0292 0.0289 0.0286 0.0280 0.0274 0.0272 0.0272 0.0270 0.0270


Table 12: Root Mean Square Error, Independent draws

n, κ 0.2 0.5 0.8 1 2 5 10 20 50 10050 0.3204 0.2869 0.2987 0.3025 0.2864 0.2756 0.2712 0.2694 0.2661 0.2640

0.1384 0.1427 0.1418 0.1407 0.1390 0.1340 0.1323 0.1299 0.1293 0.1280

100 0.1862 0.1902 0.1931 0.1947 0.1851 0.1773 0.1753 0.1731 0.1708 0.17000.0936 0.0939 0.0924 0.0925 0.0886 0.0855 0.0841 0.0833 0.0824 0.0823

200 0.1407 0.1336 0.1293 0.1260 0.1221 0.1171 0.1163 0.1150 0.1148 0.11520.0680 0.0633 0.0624 0.0605 0.0592 0.0586 0.0573 0.0569 0.0567 0.0568

400 0.0971 0.0894 0.0853 0.0849 0.0821 0.0814 0.0814 0.0810 0.0804 0.08030.0456 0.0417 0.0413 0.0406 0.0395 0.0392 0.0391 0.0389 0.0387 0.0385

800 0.0629 0.0595 0.0587 0.0577 0.0568 0.0559 0.0556 0.0549 0.0551 0.05510.0309 0.0287 0.0285 0.0283 0.0280 0.0274 0.0273 0.0271 0.0271 0.0270


33

7 Conclusion

We provide an asymptotic theory for simulated GMM and simulated MLE for nonsmooth

simulated objective function. The total number of simulations, R, has to increase without

bound but can be much smaller than the total number of observations. In this case, the

error in the parameter estimates is dominated by the simulation errors. This is a necessary

cost of inference when the simulation model is very intensive to compute.

References

Andrews, D. (1994): “Empirical Process Methods in Econometrics,” in Handbook of Economet-rics, Vol. 4, ed. by R. Engle, and D. McFadden, pp. 2248–2292. North Holland.

Chen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of Semiparametric Modelswhen the Criterion Function Is Not Smooth,” Econometrica, 71(5), 1591–1608.

Chernozhukov, V., and H. Hong (2003): “A MCMC Approach to Classical Estimation,”Journal of Econometrics, 115(2), 293–346.

Duffie, D., and K. J. Singleton (1993): “Simulated Moments Estimation of Markov Modelsof Asset Prices,” Econometrica, 61(4), pp. 929–952.

Gallant, R., and G. Tauchen (1996): “Which Moments to Match,” Econometric Theory, 12,363–390.

Gourieroux, C., and A. Monfort (1996): Simulation-based econometric methods. Oxford Uni-versity Press, USA.

Ichimura, H., and S. Lee (2010): “Characterization of the asymptotic distribution of semipara-metric M-estimators,” Journal of Econometrics, 159(2), 252–266.

Kristensen, D., and B. Salanie (2010): “Higher order improvements for approximate estima-tors,” CAM Working Papers.

Laroque, G., and B. Salanie (1989): “Estimation of multi-market fix-price models: An ap-plication of pseudo maximum likelihood methods,” Econometrica: Journal of the EconometricSociety, pp. 831–860.

Laroque, G., and B. Salanie (1993): “Simulation-based estimation of models with lagged latentvariables,” Journal of Applied Econometrics, 8(S1), S119–S133.

Lee, D., and K. Song (2015): “Simulated MLE for Discrete Choices using Transformed SimulatedFrequencies,” Journal of Econometrics, 187, 131–153.

Lee, L. (1992): “On efficiency of methods of simulated moments and maximum simulated likeli-hood estimation of discrete response models,” Econometric Theory, 8, 518–552.

34

(1995): “Asymptotic bias in simulated maximum likelihood estimation of discrete choicemodels,” Econometric Theory, 11, 437–483.

Lerman, S., and C. Manski (1981): “On the Use of Simulated Frequencies to ApproximateChoice Probabilities,” in Structural Analysis of Discrete Data with Econometric Applications,ed. by C. Manski, and D. McFadden. MIT Press.

McFadden, D. (1989): “A Method of Simulated Moments for Estimation of Discrete ResponseModels without Numerical Integration,” Econometrica.

Neumeyer, N. (2004): “A central limit theorem for two-sample U-processes,” Statistics & Prob-ability Letters, 67, 73–85.

Newey, W., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing,”in Handbook of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden, pp. 2113–2241. NorthHolland.

Nolan, D., and D. Pollard (1987): “U-processes:rates of convergence,” The Annals of Statis-tics, pp. 780–799.

(1988): “Functional limit theorems for U-processes,” The Annals of Probability, pp. 1291–1298.

Pakes, A., and D. Pollard (1989): “Simulation and the Asymptotics of Optimization Estima-tors,” Econometrica, 57, 1027–1057.

Pollard, D. (1984): Convergence of Stochastic Processes. Springer Verlag.

Sherman, R. P. (1993): “The limiting distribution of the maximum rank correlation estimator,”Econometrica, 61, 123–137.

Stern, S. (1992): “A method for smoothing simulated moments of discrete probabilities in multi-nomial probit models,” Econometrica, 60(4), 943–952.

Train, K. (2003): Discrete choice methods with simulation. Cambridge Univ Pr.

van der Vaart, A. (1999): Asymptotic Statistics. Cambridge University Press, Cambridge, UK.

Van der Vaart, A. W., and J. A. Wellner (1996): Weak convergence and empirical processes.Springer-Verlag, New York.

35

asymptotic distribution of estimators with overlapping ... · in pakes and pollard (1989), chen,...

Documents