bayesian estimation and comparison of conditional moment ...bayesian estimation and comparison of...

Bayesian Estimation and Comparison ofConditional Moment Models

Siddhartha Chib∗ Minchul Shin† Anna Simoni‡

October 2019

Abstract

We provide a Bayesian analysis of models in which the unknown distribution of the outcomes isspecified up to a set of conditional moment restrictions. This analysis is based on the nonparametricexponentially tilted empirical likelihood (ETEL) function, which is constructed to satisfy a sequenceof unconditional moments, obtained from the conditional moments by an increasing (in sample size)vector of approximating functions (such as tensor splines based on the splines of each conditioningvariable). The posterior distribution is shown to satisfy the Bernstein-von Mises theorem, subject toa growth rate condition on the number of approximating functions, even under misspecification ofthe conditional moments. A large-sample theory for comparing different conditional moment modelsis also developed. The central result is that the marginal likelihood criterion selects the model thatis less misspecified, that is, the model that is closer to the unknown true distribution in terms of theKullback-Leibler divergence. Several examples are provided to illustrate the framework and results.

Keywords: Bayesian inference, Bernstein-von Mises theorem, Conditional moment restrictions, Ex-ponentially tilted empirical likelihood, Marginal likelihood, Misspecification, Posterior consistency.

1 Introduction

We tackle the problem of prior-posterior inference in models where the only available information about

the unknown parameter θ ∈ Θ ⊂ Rp is supplied in the form of a set of conditional moment restrictions

EP [ρ(X, θ)|Z] = 0, (1.1)

where ρ(X, θ) is a d-vector of known functions of a Rdx-valued random vectorX and the unknown θ, and

P is the unknown conditional distribution of X given a Rdz -valued random vector Z. We suppose that

d ≥ p, letting the number of conditional moments exceed the number of parameters. This is a conditional∗Olin Business School, Washington University in St. Louis, Campus Box 1133, 1 Bookings Drive, St. Louis, MO 63130.

e-mail: [email protected].†Federal Reserve Bank of Philadelphia, e-mail: [email protected]. The views expressed in this paper are solely those

of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Philadelphia or the Federal ReserveSystem.‡CREST, CNRS, Ecole Polytechnique, ENSAE, 5, Avenue Henry Le Chatelier, 91120 Palaiseau - France, e-mail: si-

[email protected].

moment restricted model because the moments constrain the set of possible distributions P . We say

that the model is correctly specfied if the true data generating process P∗ is in the set of distributions

constrained to satisfy these moment conditions for some θ ∈ Θ, while the model is misspecified if P∗ is

not in the set of implied distribtions for any θ ∈ Θ.

A different starting point is that with unconditional moments, say EP [g(X, θ)] = 0. Such models

have recently attracted interest in the Bayesian community as distributional assumptions are entirely

bypassed. Prior-posterior analysis is based on the empirical likelihood, for example, Lazar (2003) and

many others, or the exponentially tilted empirical likelihood (ETEL), as in Schennach (2005) and Chib,

Shin and Simoni (2018). The latter paper provides the large sample theory under misspecification, and

introduces the use of marginal likelihoods for comparing unconditional moment models, including the

relevant framework for comparing such models and the large-sample model consistency of the marginal

likelihoods.

Although the conditional moments imply that the functions ρ(X, θ) are uncorrelated with Z, ie.,

EP [ρ(X, θ) ⊗ Z] = 0, where ⊗ is the Kronecker product operator, it is inappropriate to use these un-

conditional restrictions alone as a substitute for the conditional moments. This is because the conditional

moments assert even more, that ρ(X, θ) is uncorrelated with any measurable, bounded function of Z, or

if Z is square-integrable, that it is uncorrelated with any L2-measurable function of Z. Thus, in order

to assemble the set of unconditional moments that are equivalent to the conditional moments we must

consider all such functions, which is only feasible as the sample size n goes to infinity. A general result

due to Donald, Imbens and Newey (2003) states that this equivalent set of unconditional moments can be

constructed by approximating functions qK(Z) = (qK1 (Z), . . . , qKK (Z))′, such as tensor product splines

obtained from splines of each variable in Z, with the number of such functions, denoted byK, increasing

with n. Then, instead of (1.1), inference based on the expanded unconditional moment conditions

EP [ρ(X, θ)⊗ qK(Z)] = 0, (1.2)

is valid. This is then how we proceed in this paper.

The transformation into unconditional moments introduces, however, some challenges for the prior-

posterior analysis that are different from those addressed in Chib, Shin and Simoni (2018). For one,

quantities that are bounded with fixed moment restrictions, now diverge with K. On determining the

rate of this divergence (and thus stabilizing the growth), we show that under correct specification of the

conditional moments, the posterior distribution of θ satisfies the Bernstein-von Mises (BvM) theorem

2

with asymptotic posterior variance equal to the semiparametric efficiency bound derived in Chamberlain

(1987). As a result, Bayesian credible sets are asymptotically valid efficient confidence sets. Conversely,

sets based on unconditional versions of the conditional moments with a fixed K, in general, are not

optimal.

Second, we consider misspecified conditional moment models, which occur widely in practice, and

establish a similar BvM-type phenomenon. Just as in Kleijn and van der Vaart (2012), our theorem

establishes that the posterior distribution of the centered and scaled parameter√n(θ−θ), where θ is the

pseudo-true value, converges to a Normal distribution with a random mean that is bounded in probability.

Again, the proof of this result, which requires fixing conditions under which the ETEL function satisfies

a stochastic LAN expansion with increasing moments, is substantially more complicated than in the

unconditional moment case.

Third, we develop a large sample theory for comparing competing conditional moment models by

marginal likelihoods. Despite some similarities, the model comparison framework here is different from

the one in our previous work. For one, the comparison of these models does not require that the models

are reformulated to have the same number of conditional moments. The models can be compared directly

as stated in terms of marginal likelihoods. We show that, under regularity conditions which are different

than in the unconditional case, the model picked by the marginal likelihood, in the limit, is the model that

is less misspecified. This is also the model that is closest to the true distribution in the Kullback-Leibler

divergence.

Conditional moments often times supply the only source of information about P . Examples of this

include causal inference, as in Rosenbaum and Rubin (1983), where one assumes that the potential

outcomes are independent of the treatment variable conditioned on covariates, and in missing at random

problems as considered by Hristache and Patilea (2017), and numerous others. Our analysis makes

possible Bayesian inference for a large class of models that appeared to be outside the scope of the

Bayesian paradigm. In addition, our results here extend and complete the work of Chib et al. (2018) on

unconditional moments. Our treatment also complements the papers on conditional moment models from

a Bayesian viewpoint that are based on non-ETEL approaches, and reflect other concerns. For instance,

Liao and Jiang (2011), Florens and Simoni (2012, 2016), Kato (2013), Chen, Christensen and Tamer

(2018) and Liao and Simoni (2019) allow the moment function to contain a non-parametric component

that is estimated by a sieve type approximation method, or a Gaussian process prior, permitting the

3

possibility that the non-parametric component is only partially identified, but within a quasi-Bayesian

formulation based on a pseudolikelihood. None of these papers tackles the problem of misspecification,

or the problem of model comparisons.

The rest of the paper is organized as follows. In Section 2 we sketch the conditional moment setting

more formally and provide a motivating example. In Section 3 we describe the construction of the

sequence of unconditional moments by an increasing (in sample size) vector of approximating functions.

We then supply results on the large sample behavior of the posterior distribution in both the correct and

misspecified moment models. In Section 4 we turn to the problem of model comparisons and determine

the large sample behavior of the marginal likelihood. In Section 5 we provide an application of our

techniques to a causal inference problem. Section 6 concludes. Technical proofs of the theorems are

included in the supplementary appendix.

2 Setting and motivation

Let X := (X ′1, X′z)′ be an Rdx-valued random vector and Z := (Z ′1, X

′z)′ be an Rdz -valued random

vector. The vectors Z and X have elements in common if the dimension of the subvector Xz is non-

zero. Moreover, we denote W := (X ′, Z ′1)′ ∈ Rdw and its (unknown) joint distribution by P . By abuse

of notation we use P also to denote the associated conditional distribution. We suppose that we are

given a random sample W1:n = (W1, . . . ,Wn) of W . Hereafter, we denote by EP [·] the expectation

with respect to P and by EP [·|·] the conditional expectation with respect to the conditional distribution

associated with P .

The parameter of interest is θ ∈ Θ ⊂ Rp which is related to the conditional distribution P through

the conditional moment restrictions

EP [ρ(X, θ)|Z] = 0, (2.1)

where ρ(X, θ) is a d-vector of known functions. Many interesting and important models in statistics fall

into this framework.

Example 1 (Linear model with heteroscedastic error) Suppose that

EP [(Y − θ0 − θ1X)|Z] = 0, (2.2)

where ρ(X, θ) = (Y − θ0 − θ1X), Z = (1, X) and d = 1. This conditional moment restriction model

is consistent with the data generating process (DGP) Y = θ0 +θ1X+ε, where ε = h(X)U , and (X,U)

4

follow some unknown distribution P , with E(U) = 0, and the function h(X), the heteroskedasticity

function, is unknown. If we specify the restrictions

EP [(Y − θ0 − θ1X) |Z] = 0 and EP [(Y − θ0 − θ1X)3|Z] = 0, (2.3)

where now ρ(X, θ) is a (2× 1) vector of functions, we additonally impose conditional symmetry of ε.

Of course, the conditional moment model is different from the unconditional moment model. For

example, in Example 1, the unconditional moment conditions

EP [(Y − θ0 − θ1X)⊗ (1, X)′] = 0, (2.4)

which impose the assumptions that ε has mean zero and is uncorrelated with X , are weaker but, if the

conditional model is correct, less informative about θ.

3 Prior-Posterior analysis

3.1 Expanded Moment Conditions

One way to estimate the conditional moment model is by estimating the conditional expectation directly,

as in the frequentist approach of Kitamura, Tripathi and Ahn (2004). This approach does not seem to

generalize easily, if at all, to the Bayesian setting. An alternative approach, that we adopt, is based

on recognizing that the conditional moments in (2.1) are a functional equation and, therefore, consti-

tute a continuum of unconditional moment conditions. Under certain circumstances, see (Bierens, 1982,

Chamberlain, 1987), a countable number of unconditional moment restrictions that are equivalent to the

conditional moment restrictions in (2.1) is guaranteed. This is the basis of the frequentist approaches

in Donald and Newey (2001), Ai and Chen (2003) and Carrasco and Florens (2000) where, after trans-

forming the conditional moment restrictions into unconditional moment restrictions, the resulting set of

unconditional moments are analyzed under a sieve approach or a Tikhonov regularization approach. Fol-

lowing Donald, Imbens and Newey (2003), the equivalent set of unconditional moments can be obtained

through approximating functions.

Let qK(Z) = (qK1 (Z), . . . , qKK (Z))′, K > 0, denote a K-vector of real-valued functions of Z,

for instance, splines, truncated power series, or Fourier series. Suppose that these functions satisfy the

following condition for the distribution P .

5

Assumption 3.1 For all K, EP [qK(Z)′qK(Z)] is finite, and for any function a(z) : Rdz → R with

EP [a(Z)2] <∞ there are K × 1 vectors γK such that as K →∞,

EP [(a(Z)− qK(Z)′γK)2]→ 0.

Now if EP [ρ(X, θ)′ρ(X, θ)] <∞, then Donald, Imbens and Newey (2003, Lemma 2.1) established

that: (1) if equation (2.1) is satisfied with θ = θ∗ then EP [ρ(X, θ∗) ⊗ qK(Z)] = 0 for all K; (2) if

equation (2.1) is not satisfied then EP [ρ(X, θ∗)⊗ qK(Z)] 6= 0, ∀K large enough.

Thus, under the stated assumptions, the conditional moment restrictions are equivalent to the limit of

a sequence of unconditional moment restrictions

EP [g(W, θ)] = 0 (3.1)

where g(W, θ) := ρ(X, θ)⊗ qK(Z), with K →∞, are the expanded functions.

Example 1 (continued) Let x = (x1, . . . , xn) ∈ Rn denote the sample data, and (τ1, . . . , τK) the K

knots, where the exterior knots τ1 and τK are the minimum and maximum values of x, and the interior

knots are specified quantile points of x. Let qK(x) = (q1(x), . . . , qK(x))′ denote (say) K natural cubic

spline basis functions, where qj(x) is the cubic spline basis function located at τ j . Let B denote the

(n×K) matrix of these basis functions evaluated at x, where the ith row of B is given by qK(xi)′. Let

(y − θ0 − θ1x) and (y − θ0 − θ1x)3 each denote n × 1 vectors where y = (y1, . . . , yn). Then, the

expanded functions g(W , θ) = g(x, θ) for the sample observations are the n× 2K functions

g(W , θ) = [ρ(x, θ)⊗ qK(x)] =[(y − θ0 − θ1x)B

... (y − θ0 − θ1x)3 B]

(3.2)

where aB denotes the Hadamard product, and... denotes matrix concatenation (column binding).

In our numerical examples we use the natural cubic spline basis of Chib and Greenberg (2010) based

on Z to construct qK(Z), with K fixed at a given value, as in sieve estimation. If Z consists of more

than one element, say (Z1, Z2, Z3) where Z1 and Z2 are continuous variables and Z3 is binary, then

the basis matrix B is constructed as follows. Let zj denote the n × 1 sample data on Zj (j ≤ 3). Let

Z = (z1, z2, z1 z2, z1 z3, z2 z3) denote the n× 5 matrix of the continuous data and interactions

of the continuous data and the binary data. Now suppose (τ j1, . . . , τ jK) are K knots based on each

column of Z and let Bj denote the corresponding n ×K matrix of cubic spline basis functions. Then,

6

the basis matrix B is given by

B =

[B1

... B∗2... B∗3

... B∗4... B∗5

... Z3

],

whereB∗j (j = 2, 3, 4, 5) is the n×(K−1) matrix in which each column ofBj is subtracted from its first

and then the first column is dropped, see Chib and Greenberg (2010). Thus, the dimension of this basis

matrix is n×(5K − 4 + 1). To define the expanded functions, let ρl(X, θ) (l ≤ d) denote a n×1 vector

of the lth element of ρ(X, θ) evaluated at the sample data matrix X . Then, the expanded functions for

the sample observations are obtained by multiplying ρl(X, θ) by the matrix B and concatenating. We

use versions of this approach to construct the expanded functions in our examples.

3.2 Posterior distribution

The conditional model (2.1) is semiparametric and is characterized by two parameters: the data distribu-

tion P and the structural parameter θ, which is assumed to be finite dimensional. For a given value of K,

the prior on (θ, P ) is specified as π(θ)π(P |θ,K), where the prior on θ is standard. Our default prior on

θ is a product of independent student-t distributions with 2.5 degrees of freedom on each component of

θ. The conditional prior on P , given (θ,K), can be viewed as a sieve type prior where the hierarchical

parameter K has a degenerate prior with a point mass at a given value. In establishing the asymptotic

properties of the posterior distribution, however, we let K grow to infinity with the sample size to ensure

that (2.1) and (3.1) are equivalent in the limit. Fixing K at a specific value in a finite-sample analysis

only impacts the posterior variance.

Priors on P designed to incorporate overidentifying moment restrictions are those of Schennach

(2005), Kitamura and Otsu (2011), Shin (2014) and Florens and Simoni (2019). Our prior π(P |θ,K)

follows from Schennach (2005). To construct this prior, we first model the joint data distribution P of

W as a mixture of uniform probability densities, a construction which is capable of approximating any

distribution as the number of mixing components increases. Then, a prior is placed on the center of the

dw-dimensional hypercubes such that the corresponding mixture satisfies the moment restrictions for a

given (θ,K). The resulting posterior is well-defined for every value of K.

By integrating out P with respect to this prior π(P |θ,K) one gets the integrated likelihood

p(W1:n|θ,K) =n∏i=1

pi(θ) (3.3)

7

which is the Exponential Tilted (ET) empirical likelihood (ETEL) function and where pi(θ), i =

1, . . . , n are the probabilities that minimize the KL divergence between the probabilities (p1, . . . , pn)

assigned to each sample observation and the empirical probabilities ( 1n , . . . ,

1n), subject to the conditions

that the probabilities (p1, . . . , pn) sum to one and that the expectation under these probabilities satisfy

the given unconditional moment conditions (3.1). That is, pi(θ), i = 1, . . . , n are the solution of the

following problem:

maxp1,...,pn

n∑i=1

[−pi log(npi)] subject to:n∑i=1

pi = 1,n∑i=1

pig(wi, θ) = 0, pi ≥ 0 (3.4)

(see Schennach (2005) for a proof). In practice, we compute the ETEL probabilities from the dual

(saddlepoint) representation as

pi(θ) :=eλ(θ)′g(wi,θ)∑nj=1 e

λ(θ)′g(wj ,θ)(i = 1, . . . , n) (3.5)

where λ(θ) = arg minλ∈RdK1n

∑ni=1 e

λ′g(wi,θ) is the estimated tilting parameter (see e.g. Csiszar

(1984)).

By multiplying the ETEL function by the prior density of θ, the posterior distribution now takes the

form

π(θ|w1:n,K) ∝ π(θ)n∏i=1

eλ(θ)′g(wi,θ)∑nj=1 e

λ(θ)′g(wj ,θ), (3.6)

which we summarize by MCMC simulations, for example, the one block tailored Metropolis-Hastings

(M-H) algorithm of Chib and Greenberg (1995), or the Tailored Randomized Block MH algorithm of

Chib and Ramamurthy (2010).

Example 1 (continued) To illustrate the prior-posterior analysis, we create a set of simulated data

yi, xini=1 with covariates X ∼ U(−1, 2.5), intercept θ0 = 1, slope θ1 = 1, and εi is distributed

according to εi ∼ SN (m(xi), h(xi), s(xi)), where SN (m,h, s) is the skew normal distribution with

location, scale, and shape parameters given by (m,h, s), each depending on xi. When s is zero, εi is

normal with mean m and standard deviation h. We set m(xi) = −h(xi)√

2/πs(xi)/(√

1 + s(xi)2) so

that EP [ε|X] = 0.

Suppose that h(x) =√

exp(1 + 0.7x+ 0.2x2) and s(x) = 1 + x2. We estimate the model using

EP [ε|Z] = 0, Z = (1, X), without the need to model the heteroskedasticity or the skewness functions.

Then, under the default independent student-t prior with mean 0, dispersion 5, and degrees of freedom

8

2.5, implying a prior standard deviation of (25 (2.5) /(2.5− 2))1/2 = 11. 18, the marginal posterior

distribution of θ0 and θ1 are summarized in panel (a) of Table 1 for two different sample sizes. We note

from the dispersion of the posterior distribution, that the posterior distribution of both θ0 and θ1 shrink to

the true value at the√n-rate. In the next section we formally establish this behavior. For comparison, we

also compute the posterior distribution of (θ0, θ1) under the weaker assumption that εi is mean zero and

uncorrelated with xi. The relevant moment restrictions, given as in (2.4), are a subset of the expanded

moment conditions. As can be seen from panels (a) and (b) of Table 1, imposing the (correct) conditional

moment restrictions leads to about a 25% reduction in the posterior standard deviation of θ1, for each

of the two sample sizes.

Panel (a): EP [ε|z] = 0 Mean SD Median Lower Upper Ineff

n = 500 θ0 0.896 0.073 0.895 0.755 1.040 1.107θ1 1.127 0.084 1.126 0.964 1.296 1.117

n = 2000 θ0 0.976 0.034 0.976 0.910 1.042 1.119θ1 1.040 0.041 1.040 0.961 1.121 1.093

Panel (b): EP [ε] = 0,EP [εx] = 0 Mean SD Median Lower Upper Ineff

n = 500 θ0 0.854 0.079 0.854 0.704 1.010 1.092θ1 1.198 0.115 1.196 0.980 1.432 1.141

n = 2000 θ0 0.962 0.036 0.962 0.893 1.032 1.092θ1 1.053 0.055 1.053 0.947 1.162 1.101

Table 1: Difference between inferences from conditional (top panel) vs unconditional moments (bottompanel). Data is generated from a regression model with conditional heteroscedasticity and skewness. Thetrue value of θ0 is 1 and that of θ1 is 1. Results are based on 20,000 MCMC draws beyond a burn-inof 1000. “Lower” and “Upper” refer to the 0.05 and 0.95 quantiles of the simulated draws, respectively,and “Ineff” to the inefficiency factor.

3.3 Asymptotic properties

Consider now the large sample behavior of the posterior distribution of θ. We let θ∗ and P∗, respectively,

denote the true value of θ and of the data distribution P . As notation, when the true distribution P∗ is

involved, expectations EP [·] (resp. EP [·|·]) are taken with respect to P∗ (resp. the conditional distribution

associated with P∗). In addition, we denote

ρθ(X, θ) :=∂ρ(X, θ)

∂θ′, D(Z) := EP [ρθ(X, θ∗)|Z],

Σ(Z) := E[ρ(X, θ∗)ρ(X, θ∗)′|Z], and ρjθθ(X, θ∗) := ∂2ρj(X, θ)/∂θ∂θ

′.

9

For a vector a, ‖a‖ denotes the Euclidean norm. For a matrix A, ‖A‖ denotes the operator norm (the

largest singular value of the matrix). Finally, letZ := supp(Z) denotes the support ofZ and `n,θ(Wi) :=

log pi(θ).

The first assumption is a normalization for the second moment matrix of the approximating functions

which is standard in the literature, see e.g. Newey (1997) and Donald et al. (2003).

Assumption 3.2 For each K there is a constant scalar ζ(K) such that supz∈Z ‖qK(z)‖ ≤ ζ(K),

EP [qK(Z)qK(Z)′] has smallest eigenvalue bounded away from zero uniformly in K, and√K ≤ ζ(K).

The bound ζ(K) is known explicitly in a number of cases depending on the approximating functions

we use. Donald et al. (2003) provide a discussion and explicit formulas for ζ(K) in the case of splines,

power series and Fourier series. We also refer to Newey (1997) for primitive conditions for regression

splines and power series.

Assumption 3.3 The data Wi := (Xi, Zi), i = 1, . . . , n are i.i.d. according to P∗ and (a) there

exists a unique θ∗ ∈ Θ that satisfies EP [ρ(X, θ)|Z] = 0 for the true P∗; (b) Θ is compact; (c)

EP [supθ∈Θ ‖ρ(X, θ)‖2|Z] is uniformly bounded on Z .

This assumption is the same as Donald, Imbens and Newey (2003, Assumption 3). Part (d) of this

assumption imposes a Lipschitz condition which, together with part (c), allows to apply uniform conver-

gence results. The following three assumptions are also the same as the ones required by Donald, Imbens

and Newey (2003) to establish asymptotic normality of the Generalized Empirical Likelihood estimator.

Assumption 3.4 (a) θ∗ ∈ int(Θ); (b) ρ(X, θ) is twice continuously differentiable in a neighborhood U

of θ∗, EP [supθ∈U ‖ρθ(X, θ)‖2|Z] and EP [‖ρjθθ(X, θ∗)‖2|Z], j = 1, . . . d, are uniformly bounded on

Z; (c) EP [D(X)D(X)′] is nonsingular.

Assumption 3.5 (a) Σ(Z) has smallest eigenvalue bounded away from zero; (b) for a neighborhood U

of θ∗, EP [supθ∈U ‖ρ(X, θ)‖4|z] is uniformly bounded on Z , and for all θ ∈ U , ‖ρ(X, θ)− ρ(X, θ∗)‖ ≤

δ(X)‖θ − θ∗‖ and EP [δ(X)2|Z] <∞.

Assumption 3.6 There is γ > 2 such that EP [supθ∈Θ ‖ρ(X, θ)‖γ ] <∞ and ζ(K)2K/n1−2/γ → 0.

The last assumption is about the prior distribution of θ and is standard in the Bayesian literature on

frequentist asymptotic properties of Bayes procedures.

10

Assumption 3.7 (a) π is a continuous probability measure that admits a density with respect to the

Lebesgue measure; (b) π is positive on a neighborhood of θ∗.

We are now able to state our first major result in which we establish the asymptotic normality and

efficiency of the posterior distribution of the local parameter h :=√n(θ − θ∗).

Theorem 3.1 (Bernstein - von Mises) Under Assumptions 3.1-3.7, ifK →∞, ζ(K)K2/√n→ 0, and

if for any δ > 0, ∃ε > 0 such that as n→∞

P

(sup

‖θ−θ∗‖>δ

1

n

n∑i=1

(`n,θ(Wi)− `n,θ∗(Wi)) ≤ −ε

)→ 1, (3.7)

then the posterior distribution π(√n(θ− θ∗)|W1:n) converges in total variation towards a random Nor-

mal distribution, that is,

supB

∣∣π(√n(θ − θ∗) ∈ B|W1:n)−N∆n,θ∗ ,Vθ∗

(B)∣∣ p→ 0, (3.8)

where B ⊆ Θ is any Borel set, ∆n,θ∗ := − 1√n

∑ni=1 Vθ∗D(Zi)

′Σ(Zi)−1ρ(Xi, θ∗) is bounded in proba-

bility and Vθ∗ :=(EP [D(Z)′Σ(Z)−1D(Z)]

)−1.

We note that the centering ∆n,θ∗ of the limiting normal distribution satisfies 1√n

∑ni=1

d log pi(θ∗)dθ −

V −1θ∗

∆n,θ∗p→ 0. We also note that the condition ζ(K)K2/

√n → 0 in the theorem implies K/n → 0

which is a classical condition in the sieve literature. This condition is required to establish a stochastic

Local Asymptotic Normality (LAN) expansion which is an intermediate step to prove the Bernstein - von

Mises result, as we explain below. The LAN expansion is not required to establish asymptotic normality

of the Generalized Empirical Likelihood (GEL) estimators which explains why our condition is slightly

stronger than the condition ζ(K)K/√n → 0 required by Donald, Imbens and Newey (2003). On the

other hand, our condition is weaker than the condition ζ(K)2K2/√n → 0 required by Donald, Imbens

and Newey (2009) to establish the mean square error of the GEL estimators. The asymptotic covariance

of the posterior distribution coincides with the semiparametric efficiency bound given in Chamberlain

(1987) for conditional moment condition models. This means that, for every α ∈ (0, 1), (1−α)-credible

regions constructed from the posterior of θ are (1− α)-confidence sets asymptotically. Indeed, they are

correctly centered and have correct volume.

The proof of this theorem is given in the Appendix and consists of three steps. In the first step we

show consistency of the posterior distribution of θ, namely:

π(√

n‖θ − θ∗‖ > Mn

∣∣W1:n

) p→ 0 (3.9)

11

for any Mn → ∞, as n → ∞. To show this, the identification assumption (3.7) is used. In the second

step we show that the ETEL function satisfies a stochastic LAN expansion:

suph∈H

∣∣∣∣∣n∑i=1

`n,θ∗+h/√n(Wi)−

n∑i=1

`n,θ∗(Wi)− h′V −1θ∗

∆n,θ∗ −1

2h′V −1

θ∗h

∣∣∣∣∣ = op(1) (3.10)

where H denotes a compact subset of Rp and V −1θ∗

∆n,θ∗d→ N (0, V −1

θ∗). As the ETEL function is an

integrated likelihood, expansion (3.10) is better known as integral LAN in the semiparametric Bayesian

literature, see e.g. Bickel and Kleijn (2012, Section 4). In the third step of the proof we use arguments,

see e.g. the proof of Van der Vaart (1998, Theorem 10.1), to show that (3.9) and (3.10) imply asymptotic

normality of π(√n(θ − θ∗) ∈ B|W1:n). While these three steps are classical in proving the Bernstein-

von Mises phenomenon, establishing (3.10) raises challenges that are otherwise absent. This is because

the ETEL function is a nonstandard likelihood that involves estimated parameters ‖λ(θ∗)‖ whose di-

mension is dK, which increases with n. Therefore, we first need to determine the rate of growth of ‖λ‖,

‖ 1n

∑ni=1 g(Wi, θ)‖ and of the norms of the empirical counterparts ofD(Z) and Σ(Z). While ‖λ(θ∗)‖ is

expected to converge to zero in the correctly specified case, the rate of convergence is slower than n−1/2.

In the Appendix we show that ‖λ(θ∗)‖ = Op(√K/n) under the previous assumptions.

3.4 Misspecified model

We now generalize the preceding BvM result for the important class of misspecified conditional moment

models, building on the theory derived in Chib, Shin and Simoni (2018) in connection with misspecified

unconditional moment models.

Definition 3.1 (Misspecified model) We say that the conditional moment conditions model is misspec-

ified if the set of probability measures implied by the moment restrictions does not contain the true

data generating process P for any θ ∈ Θ, that is, P /∈ P where P =⋃θ∈Θ Pθ and Pθ = Q ∈

MX|Z ; EQ[ρ(X, θ)|Z] = 0 a.s. with MX|Z the set of all conditional probability measures of X|Z.

In essence, if (2.1) is misspecified then there is no θ ∈ Θ such that EP [ρ(X, θ) ⊗ qK(Z)] = 0 almost

surely for every K large enough. Now, for every θ ∈ Θ define Q∗(θ) as the minimizer of the Kullback-

Leibler divergence of P∗ to the model Pθ := Q ∈ M;EQ[g(W, θ)] = 0 where M denotes the set

of all the probability measures on Rdw . That is, Q∗(θ) := arginfQ∈PθK(Q||P∗), where K(Q||P∗) :=∫log(dQ/dP∗)dQ. If we suppose that the dual representation of the Kullback-Leibler minimization

12

problem holds, then the P∗-density of Q∗(θ) has the closed form: [dQ∗(θ)/dP∗](Wi) = eλ′g(Wi,θ)

EP [eλ′g(Wj,θ)]

where λ denotes the tilting parameter and is defined in the same way as in the correctly specified case:

λ := λ(θ) := arg minλ∈RdK

EP [eλ′g(Wi,θ)]. (3.11)

We also impose a condition to ensure that the probability measures P :=⋃θ∈θ Pθ, which are implied

by the model, are dominated by the true probability measure P∗. This is required for the validity of the

dual theorem. Therefore, following Sueishi (2013, Theorem 3.1), we replace Assumption 3.3 (a) by the

following.

Assumption 3.8 For a fixed θ ∈ Θ, there exists Q ∈ Pθ such that Q is mutually absolutely continuous

with respect to P , where Pθ := Q ∈M;EQ[g(W, θ)] = 0 and M denotes the set of all the probability

measures on Rdw .

This assumption implies that Pθ is non-empty. A similar assumption is also made by Kleijn and van der

Vaart (2012) and Chib et al. (2018) to establish the BvM under misspecification. The pseudo-true value of

the parameter θ ∈ Θ is denoted by θ and is defined as the minimizer of the Kullback-Leibler divergence

between the true P∗ and Q∗(θ):

θ := arginfθ∈ΘK(P∗||Q∗(θ)) (3.12)

where K(P∗||Q∗(θ)) :=∫

log(dP∗/dQ∗(θ))dP∗. Under the preceding absolute continuity assumption,

the pseudo-true value θ is available as

θ = argmaxθ∈ΘEP log

(eλ′g(Wi,θ)

EP [eλ′g(Wj ,θ)]

). (3.13)

Note that λ(θ), the value of the tilting parameter at the pseudo-true value θ, is nonzero because the

moment conditions do not hold.

Assumption 3.8 implies that K(Q∗(θ)||P∗) < ∞. We supplement this with the assumption that

K(P∗||Q∗(θ)) < ∞ and that K(P∗||Q∗(θ)) < ∞, ∀θ ∈ Θ. Because consistency in misspecified

models is defined with respect to the pseudo-true value θ, we need to replace Assumption 3.7 (b) by the

following assumption which, together with Assumption 3.9 (a), requires the prior to put enough mass to

balls around θ.

Assumption 3.9 (a) π is a continuous probability measure that admits a density with respect to the

Lebesgue measure; (b) The prior distribution π is positive on a neighborhood of θ where θ is as

defined in (3.13).

13

In the next assumption we denote by int(Θ) the interior of Θ and by U a ball centred at θ with radius

h/√n for some h ∈ H andH a compact subset of Rp.

Assumption 3.10 The data Wi := (Xi, Zi), i = 1, . . . , n are i.i.d. according to P∗ and

(a) The pseudo-true value θ ∈ int(Θ) is the unique maximizer of

λ(θ)′EP [g(W, θ)]− logEP [expλ(θ)′g(W, θ)],

where Θ is compact;

(b) λ(θ) ∈ int(Λ(θ)) where Λ(θ) is a compact set for every θ ∈ Θ and λ is as defined in (3.11);

(c) ρ(X, θ) is continuous at each θ ∈ Θ with probability one;

(d) ρ(X, θ) is twice continuously differentiable in the neighborhood U of θ, EP [supθ∈U ‖ρθ(x, θ)‖4|Z]

and EP [supθ∈U eλ(θ)′gi(θ)‖ρjθθ(x, θ)‖2|Z], j = 1, . . . d, are uniformly bounded over Z;

(e) for the neighborhood U of θ,

EP [eλ(θ)′g(W,θ)‖ρ(X, θ)‖2‖qK(Z)‖] = O(K)

and for all θ ∈ U , ‖ρ(X, θ)− ρ(X, θ)‖ ≤ δ(X)‖θ − θ‖, EP [δ(X)2|Z] <∞ and

EP [eλ(θ)′g(W,θ)δ(X)2‖qK(Z)‖2] = O(K)

(f) for the neighborhood U of θ and for κ = 1, 2, j = 2, 4 it holds that

EP [supθ∈U

eκλ(θ)′g(W,θ)‖g(Wi, θ)‖j ] = O(ζ(K)j−2K)

and EP [supθ∈U eκλ(θ)′g(W,θ)‖G(W, θ)‖j ] = O(ζ(K)j−2K), where ζ(K) is as defined in Assumption

3.2;

(g) E[eλ(θ)′g(W,θ)ρ(X, θ)ρ(X, θ)

′|Z] has smallest eigenvalue bounded away from zero;

(h) letH be a compact subset of Rp, it holds

suph∈H

E[g(Wi, θ)′]

(dλ(θ)

dθ′− dλ(θ)

dθ′

)h = Op(n

−1/2)

where λ(θ) is the solution of En[eλ(θ)′g(Wi,θ)g(Wi, θ)] = 0, and En[·] := 1n

∑ni=1[·] is the empirical

mean operator.

Assumption 3.10 (a) guarantees uniqueness of the pseudo-true value and is a standard assumption in

the literature on misspecified models (see e.g. White (1982)). Assumption 3.10 (d) is the misspecified

14

counterpart of Assumption 3.4 (a) and 3.5 (b). Remark that the presence of the exponential eλ(θ)′g(W,θ)

inside the expectations in Assumption 3.10 (e)-(g) is due to the fact that in the misspecified case the

pseudo-true value of the tilting parameter λ(θ) is not equal to zero as it is in the correctly specified

case. Assumptions 3.10 (e) and (f) impose an upper bound on the rate at which the norms of K-vector

and (dK × p)-matrices are allowed to increase. Assumption 3.10 (g) is the misspecified counterpart of

Assumption 3.5 (a). Finally, 3.10 (h) guarantees that one of the terms in the random vector ∆n,θ , which

is introduced in Theorem 3.2 below, is bounded in probability.

Our next important theorem, the Bernstein - von Mises theorem for misspecified models, now fol-

lows.

Theorem 3.2 (Bernstein - von Mises (misspecified)) Let Assumptions 3.1, 3.2, 3.6, 3.8, 3.9 and 3.10

hold. Assume that there exists a constant C > 0 such that for any sequence Mn →∞,

P

(sup

‖θ−θ‖>Mn/√n

1

n

n∑i=1

(`n,θ(Wi)− `n,θ(Wi)) ≤ −CM2n/n

)→ 1, (3.14)

as n→∞. If K →∞, ζ(K)K2√K/n→ 0, then the posteriors converge in total variation towards a

Normal distribution, that is,

supB

∣∣∣π(√n(θ − θ) ∈ B|W1:n)−N∆n,θ ,A

−1θ

(B)∣∣∣ p→ 0 (3.15)

whereB ⊆ Θ is any Borel set, ∆n,θ is a random vector bounded in probability andA−1θ

is a nonsingular

matrix.

The expressions for Aθ is given in (E.30) in the Appendix. Just as in Kleijn and van der Vaart (2012),

this theorem establishes that the posterior distribution of the centered and scaled parameter√n(θ − θ)

converges to a Normal distribution with a random mean that is bounded in probability. Its proof is based

on the same three steps as the proof of Theorem 3.1 in the correctly specified case with θ∗ replaced by

the pseudo-true value θ. There are however important differences in proving that the ETEL function

satisfies a stochastic LAN expansion in the misspecified case. First of all the limit of λ(θ) is λ(θ)

which is different from zero. Therefore, several terms that were equal to zero in the LAN expansion for

the correctly specified case are non-zero in the misspecified case and we have to deal with their limit in

distribution. Second, the quantity 1√n

∑ni=1 g(Wi, θ) is no longer centered on zero which leads to an

additional bias term. Part of the behavior of this term is controlled buy Assumption 3.10 (h).

15

Furthermore, our proof makes use of a stochastic LAN expansion of the ETEL function, which we

prove (under the assumptions of the theorem) takes the form

suph∈H

∣∣∣∣∣n∑i=1

`n,θ1(Wi)−n∑i=1

`n,θ(Wi)− h′Aθ∆n,θ0 −1

2h′Aθh

∣∣∣∣∣ = op(1),

where ∆n,θ0 and Aθ are as in the statement of Theorem 3.2.

4 Model Comparisons

We now turn our attention to the problem of comparing competing conditional moment models. We

suppose that the models in the model space are misspecified, which is arguably the most pervasive case

in practice. We are concerned with establishing the large sample optimality of the formal Bayesian rule

of picking the model with the largest value of the marginal likelihood.

Let M` denote the `-th model in model space. Each model is characterized by a parameter θ` and

an extended moment function g`(W, θ`). For each model M`, we impose a prior distribution for θ`,

and obtain the posterior distribution based on (3.6). Let m(W1:n|M`) denote the marginal likelihood of

model M` , which we calculate from the marginal likelihood identity of Chib (1995):

logm(W1:n|M`) = log π(θ`|M`) + log p(W1:n|θ

`,M`)− log π(θ

`|W1:n,M`), (4.1)

and the method of Chib and Jeliazkov (2001). In this expression, θ`

is any point in the support of the

posterior (such as the posterior mean).

Remark 4.1 Comparison of conditional moment condition models differs in one important aspect from

the framework for comparing unconditional moment condition models that was established in Chib et al.

(2018), where it is shown that to make the unconditional moment condition models comparable it is

necessary to linearly transform the moment functions so that all the transformed moments are included

in each model. This linear transformation consists of adding an extra parameter different from zero to

the components of the vector g(θ,W ) that correspond to the restrictions not included in a specific model.

When comparing conditional moment models, however, this transformation is not necessary because the

convex hulls associated with different expanded models have the same dimension asymptotically.

4.1 Model selection consistency

Let us suppose that our collection of models among which we want to make a selection contains J

models. At least J − 1 of these models are misspecified and one can be either misspecified or correctly

16

specified. Moreover, suppose that the best model M` is selected by the size of the marginal likelihoods.

Then, in Theorem 4.1 we show that this criterion in the limit picks the model M` with the smallest

KL divergence between P and the corresponding Q∗(θ`), where Q∗(θ`) is such that K(Q∗(θ`)||P ) =

infQ∈Pθ`K(Q||P ) and Pθ` is defined in Section 3.4.

Theorem 4.1 Let the assumptions of Theorem 3.2 hold. Let us consider the comparison of J < ∞

models M`, ` = 1, . . . , J , such that J − 1 of these models each has at least one misspecified moment

condition, that is, M` does not satisfy Assumption 3.3 (a), ∀` 6= j, and model Mj can be either correctly

specified or contain some misspecified moment condition. Then,

limn→∞

P

(logm(W1:n;Mj) > max

`6=jlogm(W1:n;M`)

)= 1

if and only if K(P ||Q∗(θj)) < min 6=jK(P ||Q∗(θ`)), where K(P ||Q) :=∫

log(dP/dQ)dP .

Note that if one model in the contending set of models is correctly specified, then this model will have

zero KL divergence and, therefore, according to Theorem 4.1, that model will have the largest marginal

likelihood and will be selected by our procedure.

To understand the ramifications of the preceding result, suppose that we are interested in comparing

models with the same moment conditions but different conditioning variables:

Model 1: EP [ρ(X, θ)|Z1] = 0, Model 2: EP [ρ(X, θ)|Z2] = 0, (4.2)

where Z1 and Z2 may have some elements in common, in particular Z2 might be a subvector of Z1 (or

vice versa). A situation of this type, where we are unsure about the validity of instrumental variables, is

the following.

Example 2 (Comparing IV models) Consider the following model with three instruments (Z1, Z2, Z3):

Y = θ0 + θ1X + e1,

X = f(Z1, Z2, Z3) + e2,

Z1 ∼ U [0, 1], Z2 ∼ U [0, 1], and Z3 ∼ B(0.4),

where (e1, e2)′ are non-Gaussian and correlated, which makes X in the outcome model correlated with

the error e1. We let the true value of θ = (θ0, θ1) be (1, 1). Moreover, suppose that the Zj’s are relevant

instruments, that is, cov(X,Zj) 6= 0 for j ≤ 3, and

f(Z1, Z2, Z3) = 6(√

0.3Z1 +√

0.7Z2

)3(1−

√0.3Z1 −

√0.7Z2)Z3 + Z1Z2(1− Z3). (4.3)

17

We consider a situation in which some instruments are valid and some are not, and we are interested in

selecting valid instruments from a set of instruments. To this end, we generate (e1, e2, Z1) from a Gaus-

sian copula whose covariance matrix is Σ = [1, 0.7, 0.7; 0.7, 1, 0; 0.7, 0, 1] such that the marginal dis-

tribution of e1 is the skewed mixture of two normal distributions 0.5N (0.5, 0.52) + 0.5N (−0.5, 1.1182)

and the marginal distribution of e2 is N (0, 1). Under this setup, Z1 is now an invalid instrument. We

consider the following three models

M1 : EP [(Y − θ0 − θ1X)|Z1, Z2, Z3] = 0, (4.4)

M2 : EP [(Y − θ0 − θ1X)|Z1, Z3] = 0, (4.5)

M3 : EP [(Y − θ0 − θ1X)|Z2, Z3] = 0. (4.6)

Because Z1 is an invalid instrument, bothM1 andM2 are wrong.

In M1, our basis matrix B is made from the variables (z1, z2, z1 z2, z1 z3, z2 z3), each

using five knots, concatenated with the vector z3. This matrix B has 22 columns, which equals the

number of expanded moment conditions. The prior for θ0 and θ1 is the product of Student-t distributions

with mean zero, dispersion 5, and degrees of freedom equal to 2.5. Estimation and calculation of the

marginal likelihood forM2 andM3 are special cases ofM1.

Table 2 calculates the marginal likelihoods of all the three models for two simulated samples. Note

that the model with the valid instruments (M3) is correctly specified and it has the highest marginal

likelihood, in conformity with our theory.

Table 2: Model comparison: IV regression example

M1 M2 M3

n = 500 Marginal Likelihood -3160.65 -3130.36 -3118.76(0.032) (0.123) (0.004)

n = 2, 000 Marginal Likelihood -15350.08 -15262.06 -15217.79(0.188) (0.370) (0.001)

Note: The posterior summaries are based on 20,000 MCMC draws beyond a burn-in of 1000.Numerical standard errors are in parenthesis.

5 Application: Moment-based Causal Inference

An important application of our methods is to problems that arise in causal inference. For specificity,

we consider here the estimation of causal parameters in the sharp regression-discontinuity (RD) design.

18

Another example, the average treatment effect (ATE) estimation under a conditional independence as-

sumption, is deferred to the supplementary appendix.

RD-ATE in a Sharp design. Suppose that the data arise from the following data generating mecha-

nism,

Y = (1−X)g0(Z) +Xg1(Z) + ε,

where X = 1Z ≥ τ and EP [ε|Z] = 0. We define the RD-ATE as

RD-ATE = g1(τ)− g0(τ),

where g0(τ) is the left limit of g0(Z) and g1(τ) is a right limit of g1(Z).

For illustrative purposes, suppose that

g0(z) = 0.5 + Z and g1(Z) = 0.8 + 2Z

where Z = 2(Z∗ − 1) and Z∗ ∼ 2B(2, 4), ε is independently drawn from SN ∼ (m(Z), h(Z), s(Z))

with m(Z) = −h(Z)√

2/πs(Z)/(√

1 + s(Z)2), h(Z) = 0.7(2−Z2), and s(Z) = 3 +Z2. Under this

set up, the true value of RD ATE at the break-point (τ = 0) is 0.3. We estimate the RD-ATE with three

different sample sizes, n = 500, 2000, 8000.

Our prior-posterior analysis is based on the conditional mean independence assumption EP [ε|Z] =

0, without any further assumptions about ε. We estimate g0(Z) and g1(Z) separately for data on either

side of τ using the conditional moment restrictions,EP [Y −θj0−θj1Z|Z] = 0, where j = 0, 1. We use 5

knots to convert the conditional expectation into the expanded moment conditions when n = 500, 2000,

and 10 knots when n = 8000. The prior of (θ00, θ01, θ10, θ11) is an independent student-t prior with

mean 0, dispersion 5, and degrees of freedom 2.5.

The results from this analysis are reported in Figure 1 and Table 3. The left panels of the figure has

a scatter plot of the data and the estimated regression functions at the posterior mean of the parameters.

The right panels of the figure have the histogram approximation to the posterior distribution of the RD-

ATE. One can see that the posterior distribution puts high mass around the true RD-ATE value of 0.3,

and that the posterior distribution shrinks around this value with n.

19

(a) (zi, yi) scatter plot (n = 500) (b) RD-ATE, n = 500

(c) (zi, yi) scatter plot (n = 8000) (d) RD-ATE, n = 8000

Figure 1: In the left panels, grey dots represent realizations of (zi, yi). Blue and red lines are g0(zi) andg1(zi) evaluated at the posterior mean (n = 500, 8000). Right panels have the posterior distributions ofthe RD-ATE. Results are based on 20,000 MCMC draws beyond a burn-in of 1000.

Table 3: Posterior summaries for RD-ATE

Mean SD Median Lower Upper Ineff

n = 500 0.311 0.147 0.314 0.016 0.594 1.137n = 2000 0.324 0.088 0.324 0.153 0.496 1.093n = 8000 0.293 0.040 0.293 0.214 0.373 1.073

6 Conclusion

In this paper we have developed a Bayesian framework for analyzing an important and broad class of

semiparametric models in which the distribution of the outcomes is defined only up to a set of condi-

tional moments, some of which may be misspecified. We have derived Bernstein-von Mieses theorems

for the behavior of the posterior distribution under both correct and incorrect specification of the condi-

tional moments, and developed the theory for comparing different conditional moment models through

a comparison of model marginal likelihoods.

20

Our theory and examples, taken together, show that our framework makes possible the formal

Bayesian analysis of a new, large class of problems that were hitherto difficult, or not possible, to tackle

from the Bayesian viewpoint.

Supplementary Material

Technical proofs of all the results developed in the paper are in the supplementary appendix.

References

Ai, C. and Chen, X. (2003), ‘Efficient estimation of models with conditional moment restrictions con-

taining unknown functions’, Econometrica 71(6), 1795–1843.

Bickel, P. J. and Kleijn, B. J. K. (2012), ‘The semiparametric Bernstein - von Mises theorem’, Ann.

Statist. 40(1), 206–237.

Bierens, H. J. (1982), ‘Consistent model specification tests’, Journal of Econometrics 20, 105–134.

Carrasco, M. and Florens, J.-P. (2000), ‘Generalization of GMM to a continuum of moment conditions’,

Econometric Theory 16(6), 797–834.

Chamberlain, G. (1987), ‘Asymptotic Efficiency in Estimation with Conditional Moment Restrictions’,

Journal of Econometrics 34(3), 305–334.

Chen, X., Christensen, T. and Tamer, E. T. (2018), ‘Monte Carlo confidence sets for identified sets’,

Econometrica 86(6), 1965–2018.

Chib, S. (1995), ‘Marginal Likelihood from the Gibbs Output’, Journal of the American Statistical As-

sociation 90, 1313–1321.

Chib, S. and Greenberg, E. (1995), ‘Understanding the Metropolis-Hastings Algorithm’, The American

Statistician 49, 327–335.

Chib, S. and Greenberg, E. (2010), ‘Additive Cubic Spline Regression with Dirichlet Process Mixture

Errors’, Journal of Econometrics 156, 322–336.

Chib, S. and Jeliazkov, I. (2001), ‘Marginal Likelihood from the Metropolis-Hastings Output’, Journal

of the American Statistical Association 96, 270–281.

Chib, S. and Ramamurthy, S. (2010), ‘Tailored Randomized Block MCMC methods with Application to

DSGE Models’, Journal of Econometrics 155(1), 19–38.

21

Chib, S., Shin, M. and Simoni, A. (2018), ‘Bayesian Estimation and Comparison of Moment Condition

Models’, Journal of the American Statistical Association 113, 1656–1668.

Csiszar, I. (1984), ‘Sanov property, generalized i-projection and a conditional limit theorem’, Ann.

Probab. 12(3), 768–793.

Donald, S. G., Imbens, G. W. and Newey, W. K. (2003), ‘Empirical likelihood estimation and consistent

tests with conditional moment restrictions’, Journal of Econometrics 117(1), 55 – 93.

Donald, S. G., Imbens, G. W. and Newey, W. K. (2009), ‘Choosing instrumental variables in conditional

moment restriction models’, Journal of Econometrics 152(1), 28 – 36.

Donald, S. G. and Newey, W. K. (2001), ‘Choosing the Number of Instruments’, Econometrica

69(5), 1161–1191.

Florens, J.-P. and Simoni, A. (2012), ‘Nonparametric Estimation of an Instrumental Regression: A Quasi-

Bayesian Approach Based on Regularized Posterior’, Journal of Econometrics 170(2), 458 – 475.

Florens, J.-P. and Simoni, A. (2016), ‘Regularizing priors for linear inverse problems’, Econometric

Theory 32(1), 71–121.

Florens, J.-P. and Simoni, A. (2019), ‘Gaussian processes and Bayesian moment estimation’, Journal of

Business & Economic Statistics 0(ja), 1–31.

Hristache, M. and Patilea, V. (2017), ‘Conditional moment models with data missing at random’,

Biometrika 104(3), 735–742.

Kato, K. (2013), ‘Quasi-bayesian analysis of nonparametric instrumental variables models’, Annals of

Statistics 41(5), 2359–2390.

Kitamura, Y. and Otsu, T. (2011), Bayesian Analysis of Moment Condition Models Using Nonparametric

Priors, Technical report.

Kitamura, Y., Tripathi, G. and Ahn, H. (2004), ‘Empirical likelihood-based inference in conditional

moment restriction models’, Econometrica 72(6), 1667–1714.

Kleijn, B. and van der Vaart, A. (2012), ‘The Bernstein-Von-Mises Theorem Under Misspecification’,

Electronic Journal Statistics 6, 354–381.

Liao, Y. and Jiang, W. (2011), ‘Posterior Consistency of Nonparametric Conditional Moment Restricted

Models’, The Annals of Statistics 39(6), pp. 3003–3031.

Liao, Y. and Simoni, A. (2019), ‘Bayesian inference for partially identified smooth convex models’,

22

Journal of Econometrics 211(2), 338 – 360.

Newey, W. K. (1997), ‘Convergence rates and asymptotic normality for series estimators’, Journal of

Econometrics 79(1), 147 – 168.

Rosenbaum, P. R. and Rubin, D. B. (1983), ‘The central role of the propensity score in observational

studies for causal effects’, Biometrika 70(1), 41–55.

Schennach, S. M. (2005), ‘Bayesian Exponentially Tilted Empirical Likelihood’, Biometrika 92(1), 31–

46.

Shin, M. (2014), Bayesian GMM, Technical report, University of Pennsylvania.

Sueishi, N. (2013), ‘Identification Problem of the Exponential Tilting Estimator under Misspecification’,

Economics Letters 118(3), 509 – 511.

Van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic

Mathematics.

White, H. (1982), ‘Maximum likelihood estimation of misspecified models’, Econometrica 50(1), 1–25.

23

Supplement to

Bayesian Estimation and Comparison of

Conditional Moment Models

Siddhartha Chib∗

Olin Business School

Minchul Shin†

FRB Philadelphia

Anna Simoni‡

CREST, CNRS

This version: October 31, 2019

Contents

D Notation 1

E Proofs for Section 3.3 2

E.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

E.2 The Misspecied case: Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . 8

F Proofs for Section 4 12

F.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

G Technical Lemmas 14

G.1 Technical Lemmas for the correctly specied case . . . . . . . . . . . . . . . . . . . . 15

G.2 Technical Lemmas for the misspecied case . . . . . . . . . . . . . . . . . . . . . . . 20

H Additional empirical example 31

D Notation

For each positive integer K let qK(Z) := (q1(Z), . . . , qK(Z))′ be a K-vector of approximating

functions. For every ε > 0 and a constant C > 0, let Θ(C, ε) := ‖θ − θ∗‖ ≤ Cε.∗Olin Business School, Washington University in St. Louis, Campus Box 1133, 1 Brookings Dr. St. Louis, MO

63130, USA, e-mail: [email protected]†Research Department, Federal Reserve Bank of Philadelphia, Ten Independence Mall, Philadelphia, PA 19106,

e-mail: [email protected]‡CREST, CNRS, École Polytechnique, ENSAE 5, avenue Henry Le Chatelier, 91120 Palaiseau France, e-mail:

[email protected]

1

[email protected]

[email protected]

[email protected]

We denote: W := (X ′, Z ′)′, W1:n := (W1, . . . ,Wn) the n-sample of i.i.d. observations of W ,

g(W, θ) := ρ(X, θ)⊗ qK(Z) the expanded moment functions and gi(θ) = g(Wi, θ).

Denote p(W1:n|θ) :=∏ni=1 pi(θ),

`n,θ(Wi) := log pi(θ) = logeλ(θ)′g(Wi,θ)∑nj=1 e

λ(θ)′g(Wj ,θ)

where λ(θ) := arg minλ∈RdK1n

∑ni=1 e

λ′g(Wi,θ) is the estimated tilting parameter. Moreover,

τ(λ, θ,W ) := eλ(θ)′g(W,θ) and τn(λ, θ) := 1n

∑ni=1 τ(λ, θ,Wi). We use the notation En[·] := 1

n

∑ni=1[·]

for the empirical mean and E[·] for the population mean with respect to the true distribution P∗.

For a matrix A, we denote by λmin(A) and λmax(A) the minimum and maximum eigenvalue of A,

respectively.

Moreover, g(θ) := En[gi(θ)], ρθ(X, θ) := ∂ρ(X,θ)∂θ′ , G(θ) := En[G(Wi, θ)] with G(W, θ) :=

ρθ(X, θ)⊗ qK(Z) a dK × p matrix, G(θ) := En[τ(λ, θ,Wi)G(Wi, θ)], Ω(θ) := En[g(Wi, θ)g(Wi, θ)′]

a dK × dK matrix and Ω(θ) := En[τ(λ, θ,Wi)g(Wi, θ)g(Wi, θ)′]. Their population counterparts in

the correctly specied model are G∗ := E[ρθ(X, θ∗) ⊗ qK(Z)] and Ω∗ := E[g(Wi, θ∗)g(Wi, θ∗)′],

respectively. In addition Σ(Z) := E[ρ(X, θ∗)ρ(X, θ∗)′|Z], D(Z) := E[ρθ(X, θ∗)|Z], V −1

θ∗:=

EP [D(Z)′Σ(Z)−1D(Z)], and ρjθθ(x, θ∗) := ∂2ρj(x, θ)/∂θ∂θ′.

Finally, let CS, M, and MVT refer to the Cauchy-Schwartz, Markov, and Mean Value Theorem,

respectively.

E Proofs for Section 3.3

E.1 Proof of Theorem 3.1

The main steps of this proof proceed as in the proof of Van der Vaart (1998, Theorem 10.1)

while the proofs of the technical theorems and lemmas that we use all along this proof are new. Let

us consider a reparametrization of the model centred around the true value θ∗ and dene a local

parameter h =√n(θ − θ∗). Denote by πh and πh(·|W1:n) the prior and posterior distribution of h,

respectively. Denote by Φn the normal distribution N∆n,θ∗ ,Vθ∗and by φn its Lebesgue density. For

a compact subset K ⊂ Rp such that πh(h ∈ K|W1:n) > 0 dene, for any Borel set B ⊆ Ψ,

πhK(B|W1:n) :=πh(K ∩B|W1:n)

πh(K|W1:n)

and let ΦKn be the Φn distribution conditional onK. The proof consists of two steps. In the rst step

we show that the Total Variation (TV) norm of πhK(·|W1:n)− ΦKn converges to zero in probability.

In the second step we use this result to show that the TV norm of πh(·|W1:n) − Φn converges to

zero in probability.

2

Let Assumptions 3.3 (a) and 3.4 (a) hold. For every open neighborhood U ⊂ Θ of θ∗ and a

compact subset K ⊂ Rp, there exists an N such that for every n ≥ N :

θ∗ +K1√n⊂ U . (E.1)

Dene the function fn : K ×K → R as, ∀k1, k2 ∈ K:

fn(k1, k2) :=

(1− φn(k2)sn(k1)πh(k1)

φn(k1)sn(k2)πh(k2)

)+

where (a)+ := max(a, 0), here πh denotes the Lebesgue density of the prior πh for h and sn(h) :=

p(W1:n|θ∗ + h/√n)/p(W1:n|θ∗). The function fn is well dened for n suciently large because of

(E.1) and Assumption 3.7. Remark that by (E.1) and since the prior for θ puts enough mass on U ,then πh puts enough mass on K and as n → ∞: πh(k1)/πh(k2) → 1. Because of this and by the

stochastic LAN expansion in Lemma E.2 below:

logφn(k2)sn(k1)πh(k1)

φn(k1)sn(k2)πh(k2)= −1

2(k2 −∆n,θ∗)

′V −1θ∗

(k2 −∆n,θ∗) +1

2(k1 −∆n,θ∗)

′V −1θ∗

(k1 −∆n,θ∗)

+ k′1V−1θ∗

∆n,θ∗ −1

2k′1V

−1θ∗k1 − k′2V −1

θ∗∆n,θ∗ +

1

2k′2V

−1θ∗k2 + op(1) = op(1). (E.2)

Since, for every n, fn is continuous in (k1, k2) and K ×K is compact, then

supk1,k2∈K

fn(k1, k2)p→ 0, as n→∞. (E.3)

Suppose that the subset K contains a neighborhood of 0 (which guarantees that Φn(K) > 0 and

then that ΦKn is well dened) and let Ξn := πh(K|W1:n) > 0. Moreover, for a given η > 0 dene

the event Ωn :=

supk1,k2∈K fn(k1, k2) ≤ η. The TV distance ‖ · ‖TV between two probability

measures P and Q, with Lebesgue densities p and q respectively, can be expressed as: ‖P −Q‖TV =

2∫

(1− p/q)+dQ. Therefore, by the Jensen inequality and convexity of the functions (·)+,

1

2EP ‖πhK(·|W1:n)− ΦK

n ‖TV 1Ωn∩Ξn = EP

∫K

(1− dΦK

n (k2)

dπhK(k2|W1:n)

)+

dπhK(k2|W1:n)1Ωn∩Ξn

≤ EP∫K

∫K

fn(k1, k2)dΦKn (k1)dπhK(k2|W1:n)1Ωn∩Ξn

≤ EP supk1,k2∈K

fn(k1, k2)1Ωn∩Ξn ≤ η. (E.4)

Moreover, by remembering that ‖ · ‖TV is upper bounded by 2,

EP ‖πhK(·|W1:n) − ΦKn ‖TV 1Ξn ≤ EP ‖πhK(·|W1:n) − ΦK

n ‖TV 1Ωn∩Ξn + 2P (Ωcn ∩ Ξn), (E.5)

3

where the rst term is upper bounded by 2η by (E.4) and the second term is o(1) by (E.3). In the

second step of the proof let Kn be a sequence of closed balls in the parameter space of h centred

at 0 with radii Mn → ∞ and redene Ξn accordingly. For each n ≥ 1, (E.5) holds for these balls.

Moreover, by (E.7) in Theorem E.1 below: P (Ξn)→ 1. Therefore, by the triangular inequality, the

TV distance is upper bounded by

EP ‖πh(·|W1:n)− Φn‖TV = EP ‖πh(·|W1:n)− Φn‖TV 1Ξn + EP ‖πh(·|W1:n)− Φn‖TV 1Ξcn

≤ EP ‖πh(·|W1:n)− πhKn(·|W1:n)‖TV + EP ‖πhKn(·|W1:n)− ΦKnn ‖TV 1Ξn

+ EP ‖ΦKnn − Φn‖TV + 2P (Ξcn)

≤ 2EP (πhKcn(·|W1:n)) + EP ‖πhKn(·|W1:n)− ΦKn

n ‖TV 1Ξn + op(1)p→ 0

since EP ‖πhKn(·|W1:n)−ΦKnn ‖TV 1Ξn = op(1) by (E.5) and (E.4), and where in the third line we have

used the fact that: EP ‖πh(·|W1:n) − πhKn(·|W1:n)‖TV = 2EP (πhKcn(·|W1:n)) and ‖ΦKn

n − Φn‖TV =

‖ΦKcn

n ‖TV = op(1) by Kleijn and van der Vaart (2012, Lemma 5.2) since ∆n,θ∗ is uniformly tight.

The next theorem establishes that the posterior of θ concentrates and puts all its mass on

Θn := ‖θ − θ∗‖ ≤Mn/√n as n→∞.

Theorem E.1 ((Posterior Consistency)). Let the Assumptions of Lemma E.2 and Assumption 3.7

hold. Moreover, assume that there exists a constant C > 0 such that for any sequence Mn →∞,

P

(sup

‖θ−θ∗‖>Mn/√n

1

n

n∑i=1

(`n,θ(Wi)− `n,θ∗(Wi)) ≤ −CM2n/n

)→ 1, (E.6)

as n→∞. Then,

π(√

n‖θ − θ∗‖ > Mn

∣∣W1:n

) p→ 0 (E.7)

for any Mn →∞, as n→∞.

Proof. Dene the events An,1 :=

sup‖θ−θ∗‖>Mn/√n

1n

∑ni=1 (`n,θ(Wi)− `n,θ∗(Wi)) ≤ −CM2

n/n

and

An,2 :=

∫Θ

p(W1:n|θ)p(W1:n|θ∗)

π(θ)d(θ) ≥ e−CM2n/2

.

By (E.6), P (Acn,1)→ 0 and by Lemma E.1 below, P (Acn,2)→ 0. Therefore,

EP[π(√

n‖θ − θ∗‖ > Mn

∣∣x1:n

)]≤ EP

[π(√

n‖θ − θ∗‖ > Mn

∣∣W1:n

)∣∣An,1 ∩An,2]P (An,1 ∩An,2) + o(1)

= EP

∫Θcn

p(W1:n|θ)p(W1:n|θ∗)π(θ)dθ∫

Θp(W1:n|θ)p(W1:n|θ∗)π(θ)dθ

∣∣∣∣∣∣An,1 ∩An,2P (An,1 ∩An,2) + o(1)

4

≤ e−CM2nπ(Θc

n)EP

[(∫Θ


π(θ)dθ

)−1∣∣∣∣∣An,1 ∩An,2

]P (An,1 ∩An,2) + o(1)

≤ e−CM2neCM

2n/2π(Θc

n)P (An,1 ∩An,2) + o(1) = o(1) (E.8)

which proves the result of the theorem.

Lemma E.1. Let the Assumptions of Lemma E.2 hold and suppose Assumptions 3.7 is satised.

Then,

P

(∫Θ


π(θ)dθ < an

)→ 0 (E.9)

for every sequence an → 0.

Proof. For a given M > 0 dene C = h ∈ Rp : ‖h‖ ≤ M. Denote by h 7→ Rem(h) the

remaining term in (E.12) and remark that suph∈C |Rem(h)| p→ 0 by the result in Lemma E.2

and compactness of C. Therefore, for a sequence κn that converges to zero slowly enough, the

corresponding event Bn := suph∈C |Rem(h)| ≤ κn has probability P (Bn)→ 1. Consider the local

parameter h =√n(θ− θ∗) and denote by πh both its prior distribution and prior Lebesgue density

which exists under Assumption 3.7 (a). By making the change of variable θ 7→ θ∗ + h/√n so that

π(θ)dθ = πh(h)dh, we upper bound the probability in (E.9) as follows: for a sequence Kn →∞,

P

(∫Θ


π(θ)dθ < e−K2n

)≤ P

(∫C

p(W1:n|θ∗ + h/√n)

p(W1:n|θ∗)πh(h)dh < e−K

2n

)= P

(∫C

e∑ni=1(`n,θ∗+h/

√n−`n,θ∗ )πh(h)dh < e−K

2n

∩Bn

)+ op(1). (E.10)

By replacing the LAN expansion (E.12) and by noting that for n suciently large, κn ≤ 12K

2n on Bn

and suph∈C h′V −1θ∗h ≤ suph∈C ‖h‖2‖V −1

θ∗‖ ≤ M2‖V −1

θ∗‖ ≤ 1

2K2n (where ‖V −1

θ∗‖ denotes the operator

norm) we obtain:

P

(∫C

e∑ni=1(`n,θ∗+h/

√n−`n,θ∗ )πh(h)dh < e−K

2n

∩Bn

)≤ P

(∫C

eh′V −1θ∗ ∆n,θ∗πh(h)dh < e−3K2

n/4

)+op(1)

= P

(∫C

eh′V −1θ∗ ∆n,θ∗πh(h|C)dh < e− log πh(C)e−K

2n/4

)+ op(1)

≤ P(

exp

∫C

h′V −1θ∗

∆n,θ∗πh(h|C)dh

< eK

2n/8e−K

2n/4

)+ op(1)

≤ P(∫

C

h′V −1θ∗

∆n,θ∗πh(h|C)dh < −K2

n/8

)+ op(1)

≤ 64

K4n

EP(∫

C

(h′V −1

θ∗∆n,θ∗

)2πh(h|C)dh

)+ op(1) ≤ 64

K4n

M2V −1θ∗

+ op(1)→ 0 (E.11)

5

where in the second inequality we have used that, for n large enough, − log πh(C) ≤ K2n/8 and the

Jensen's inequality. In the last line we have used the Markov's inequality and then the Jensen's

inequality. The result follows by Assumption 3.7 (b) and because V −1θ∗

∆n,θ∗d→ N (0, V −1

θ∗).

Lemma E.2 ((Stochastic LAN)). Let Assumptions 3.1, 3.2, 3.3, 3.5 and 3.6 be satised and assume

ζ(K)K2/√n→ 0. Let H denote a compact subset of Rp. Then,

suph∈H

∣∣∣∣∣n∑i=1

`n,θ∗+h/√n(Wi)−

n∑i=1

`n,θ∗(Wi)− h′V −1θ∗

∆n,θ∗ −1

2h′V −1

θ∗h

∣∣∣∣∣ = op(1) (E.12)

where V −1θ∗

∆n,θ∗d→ N (0, V −1

θ∗) and 1√

n

∑ni=1

d`n,θ∗ (Wi)dθ − V −1

θ∗∆n,θ∗

p→ 0.

Proof. Dene τi(λ, θ) := eλ′gi(θ)

En[eλ′gj(θ)]

, Ω(θ, λ) := En[τi(λ, θ)gi(θ)gi(θ)′] and G(θ, λ) :=

En[τi(λ, θ)G(Wi, θ)]. By the MVT expansion of∑n

i=1 `n,θ(Wi) around λ(θ) = 0, there exists a

λθ lying on the line between λ(θ) and zero such that:

n∑i=1

`n,θ(Wi) =

n∑i=1

log τi(λ,θ)− n log(n) = −n log(n) + ng(θ)′λ(θ)− ng(θ)′λ(θ)

− 1

2nλ(θ)′Ω(θ, λθ)λ(θ) +

1

2n∣∣∣En[τi(λθ, θ)gi(θ)

′]λ(θ)∣∣∣2 . (E.13)

By expanding the rst order condition for λ(θ) around λ(θ) = 0, there exists a λθ lying on the line

between λ(θ) and zero such that : g(θ) + Ω(θ, λθ)λ(θ) = 0 which gives λ(θ) = Ω(θ, λθ)−1g(θ). By

replacing this in (E.13) we obtain:

n∑i=1

`n,θ(Wi) = −n log(n)− 1

2ng(θ)′Ω(θ, λθ)

−1Ω(θ, λθ)Ω(θ, λθ)−1g(θ)

+1

2n∣∣∣En[τi(λθ, θ)gi(θ)

′]Ω(θ, λθ)−1g(θ)

∣∣∣2 . (E.14)

Hence, by replacing in∑n

i=1 `n,θ∗+hn(Wi) the following MVT expansion g(θ∗ + h/√n) = g(θ∗) +

G(θ)h/√n, for θ lying between θ∗ + h/

√n and θ∗, and by denoting θ1 := θ∗ + hn, hn := h/

√n we

get

n∑i=1

`n,θ∗+hn(Wi)−n∑i=1

`n,θ∗(Wi)

=1

2ng(θ∗)

′[Ω(θ∗, λθ∗)

−1Ω(θ∗, λθ∗)Ω(θ∗, λθ∗)−1 − Ω(θ1, λθ1)−1Ω(θ1, λθ1)Ω(θ1, λθ1)−1

]g(θ∗)

−√ng(θ∗)

′Ω(θ1, λθ1)−1Ω(θ1, λθ1)Ω(θ1, λθ1)−1G(θ)h

6

− 1

2h′G(θ)′Ω(θ1, λθ1)−1Ω(θ1, λθ1)Ω(θ1, λθ1)−1G(θ)h

+1

2n∣∣∣En[τi(λθ1 , θ1)gi(θ1)′]Ω(θ1, λθ1)−1g(θ1)

∣∣∣2 − 1

2n∣∣∣En[τi(λθ∗ , θ∗)gi(θ∗)

′]Ω(θ∗, λθ∗)−1g(θ∗)

∣∣∣2 .(E.15)

By using the equality A−1BA−1−C−1DC−1 = A−1(B−D)A−1+(A−1−C−1)DA−1+C−1D(A−1−C−1) for matrices A,B,C,D we can write

n∑i=1


`n,θ∗(Wi)

=1

2ng(θ∗)

′Ω(θ∗, λθ∗)−1[Ω(θ∗, λθ∗)− Ω(θ1, λθ1)

]Ω(θ∗, λθ∗)

−1g(θ∗)

+1

2ng(θ∗)

′ [Ω(θ∗, λθ∗)−1 − Ω(θ1, λθ1)−1

]Ω(θ1, λθ1)Ω(θ∗, λθ∗)

−1g(θ∗)

+1

2ng(θ∗)

′ [Ω(θ∗, λθ∗)−1 − Ω(θ1, λθ1)−1

]Ω(θ1, λθ1)Ω(θ1, λθ1)−1g(θ∗)

−√ng(θ∗)


− 1


+1


∣∣∣2 − 1

2n∣∣∣En[τi(λθ∗ , θ∗)gi(θ∗)

′]Ω(θ∗, λθ∗)−1g(θ∗)

∣∣∣2 .(E.16)

Let us analyse the rst three terms in (E.16). Since λθ∗ , λθ1 , λθ1 ∈ Λn, where Λn is as dened in

Lemma G.2, in the following we can use the results in Lemma G.4 to get Ω(θ∗, λθ∗)−1 ≤ CΩ(θ∗)

−1,

Ω(θ1, λθ1)−1 ≤ CΩ(θ1)−1 and Ω(θ1, λθ1)−1 ≤ CΩ(θ1)−1 with probability approaching 1 for any

1 < C <∞. We start from the rst term:

suph∈H

Rn,1(h) :=1

2suph∈H

∣∣∣ng(θ∗)′Ω(θ∗, λθ∗)

−1[Ω(θ∗, λθ∗)− Ω(θ1, λθ1)

]Ω(θ∗, λθ∗)

−1g(θ∗)∣∣∣

≤ 1

2‖Ω(θ∗, λθ∗)

−1√ng(θ∗)‖2 suph∈H‖Ω(θ∗, λθ∗)− Ω(θ1, λθ1)‖

≤(

min1≤i≤n

τi(λθ∗ , θ∗)

)−2

‖Ω(θ∗)−1√ng(θ∗)‖2Op

(ζ(K)K/

√n)

= Op(ζ(K)K2/

√n)

by using the rst result in Lemma G.7 and because ‖Ω(θ∗)−1√ng(θ∗)‖ = ‖Ω−1

∗√ng(θ∗)‖ with

probability approaching 1 by Donald et al. (2003, Lemma A.6) and ‖Ω−1∗√ng(θ∗)‖ = Op(

√K) by

M. For the second term we use the identity (A−1−B−1) = A−1(B−A)B−1 for two matrices A,B,

and again the rst result in Lemma G.7:

suph∈H

Rn,2(h) :=1

2suph∈H

∣∣∣ng(θ∗)′Ω(θ∗, λθ∗)

−1[Ω(θ1, λθ1)− Ω(θ∗, λθ∗)

−1]

Ω(θ1, λθ1)−1

7

× Ω(θ1, λθ1)Ω(θ∗, λθ∗)−1g(θ∗)

∣∣∣≤ 1

2‖Ω(θ∗, λθ∗)

−1√ng(θ∗)‖2Op(ζ(K)K/

√n)

= Op(ζ(K)K2/

√n).

The third term can be treated in a similar way and gives the same rate.

Next, we analyze the last two terms in (E.16). We use again Lemma G.4. Therefore, because

‖Ω(θ1, λθ1)−1g(θ∗)‖ = ‖Ω−1∗√ng(θ∗)‖ with probability approaching 1 by Donald et al. (2003, Lemma

A.6) and ‖Ω−1∗√ng(θ∗)‖ = Op(

√K) by M.

suph∈H

1


∣∣∣2≤ 1

2CEn[τi(λθ1 , θ1)gi(θ1)′]2‖Ω(θ1, λθ1)−1g(θ1)‖2 = Op(K

2/n) (E.17)

where we have used the MVT expansion g(θ1) = g(θ∗) + G(θ)h/√n for a θ lying between θ1 and θ∗

and the result of Lemma G.5. We conclude that

suph∈H

∣∣∣ n∑i=1


`n,θ∗(Wi)−√ng(θ∗)


− 1


∣∣∣= Op

(ζ(K)K2/

√n)

+Op(K/√n) = Op

(ζ(K)K2/

√n). (E.18)

Under the assumptions of the theorem the term Op(ζ(K)K2/

√n)converges to zero. Moreover, by

Lemma G.7 and ζ(K)K/√n→ 0:

−√ng(θ∗)


= −√ng(θ∗)

′Ω(θ∗)−1G(θ∗)h+ op(1)

= − h′√n

n∑i=1

D(Zi)′Σ(Zi)

−1ρ(Xi, θ∗)′ + op(1)

where the op(1) term is uniform in h ∈ H and where to get the second equality we have used

arguments similar to the ones in Donald et al. (2003, Proof of Theorem 5.6). By the Lindberg-Levy

central limit theorem, ∆n,θ∗ := − 1√n

∑ni=1 Vθ∗D(Zi)

′Σ(Zi)−1ρ(Xi, θ∗)

d→ N (0, Vθ∗). Similarly as in

Donald et al. (2003, Proof of Theorem 5.6) it is possible to show that (by using compactness of H)

h′G(θ)′Ω(θ1, λθ1)−1Ω(θ1, λθ1)Ω(θ1, λθ1)−1G(θ)h = h′Vθ∗h+ op(1)

8

where the op(1) term is uniform in h ∈ H. Finally, remark that

1√n

n∑i=1

d`n,θ∗(Wi)

dθ− V −1

θ∗∆n,θ∗

p→ 0.

This establishes the result of the Lemma.

E.2 The Misspecied case: Proof of Theorem 3.2

For the misspecied case we use the following notation: Gi(θ) := G(Wi, θ), G := E[Gi(θ)],

G := E[τi(λ(θ), θ)Gi(θ)], Ω := E[τ(λ(θ), θ,Wi)gi(θ)gi(θ)′], τi(λ, θ) := eλ

′gi(θ)

En[eλ′gj(θ)]

,

G(λ, θ) := En[τi(λ, θ)Gi(θ)], Ω(θ, λ) := En[τ(λ, θ,Wi)gi(θ)gi(θ)′]. We also use standard no-

tation in empirical process theory: Pn := En[δXi ] where δx is the Dirac measure at x, and

Gng :=√n(Pnf −EP f) for every function f .

The proof of Theorem 3.2 proceeds as the proof of Theorem 3.1 and so we omit it. The only

dierences between the proofs of the two theorems consist in replacing Lemma E.2 with Lemma

E.4 and Theorem E.1 with Theorem E.2. In the following of this section we prove Lemma E.4 and

state Theorem E.2.

We omit the proof of Theorem E.2 because it proceeds as the proof of Theorem E.1 with Lemma

E.1 replaced by Lemma E.3. The proof of Lemma E.3 is the same as the proof of Lemma E.1 with

the LAN expansion (E.12) replaced by the LAN expansion (E.22) and then it is also omitted.

Theorem E.2 (Posterior Consistency). Let the Assumptions of Lemma E.4 and Assumption 3.9

hold. Moreover, assume that there exists a constant C > 0 such that for any sequence Mn →∞,

P

(sup

‖θ−θ‖>Mn/√n

1

n

n∑i=1

(`n,θ(Wi)− `n,θ(Wi)) ≤ −CM2n/n

)→ 1, (E.19)

as n→∞. Then,

π(√

n‖θ − θ‖ > Mn

∣∣W1:n

) p→ 0 (E.20)

for any Mn →∞, as n→∞.

Lemma E.3. Let the Assumptions of Lemma E.4 hold and suppose Assumptions 3.9 is satised.

Then,

P

(∫Θ

p(W1:n|θ)p(W1:n|θ)

π(θ)dθ < an

)→ 0 (E.21)

for every sequence an → 0.

9

Lemma E.4 (Stochastic LAN). Let Assumptions 3.1, 3.2, 3.6 and 3.10 hold and assume

ζ(K)K2√K/n → 0. Let H denote a compact subset of Rp and θ1 := θ∗ + h/

√n with h ∈ H.

Then,

suph∈H

∣∣∣∣∣n∑i=1

`n,θ1(Wi)−n∑i=1

`n,θ(Wi)− h′Aθ∆n,θ0 −1

2h′Aθh

∣∣∣∣∣ = op(1) (E.22)

where ∆n,θ0 is a random vector bounded in probability and Aθ is a nonsingular matrix.

Proof. We have to analyse the dierence∑n

i=1 `n,θ1(Wi) −∑n

i=1 `n,θ(Wi). Because Wi are i.i.d.

then E[gi(θ)] = E[gj(θ)], and so we can write:

n∑i=1

`n,θ(Wi) =n∑i=1

log τi(λ, θ)− n log(n) =n∑i=1

logeλ′(gi(θ)−E[gi(θ)])

En[eλ′(gj(θ)−E[gj(θ)])]

− n log n

= nλ(θ)′En(gi(θ)−E[gi(θ)])− n logEn[eλ(θ)′(gj(θ)−E[gj(θ)])]− n log(n). (E.23)

Denote gi(θ) := gi(θ) − E[gi(θ)] and Gi(θ) := Gi(θ) − E[Gi(θ)], so that∑n

i=1 `n,θ(Wi) =

nλ(θ)′En[gi(θ)] − n logEn[eλ(θ)′gi(θ)] − n log(n). By the MVT there exists a t ∈ [0, 1] such that

θ := θ + th/√n satises

n∑i=1

`n,θ1(Wi) =

n∑i=1

`n,θ(Wi) +h′√n

n∑i=1

˙n,θ(Wi) +

1

2

h′√n

n∑i=1

¨n,θ

(Wi)h√n

(E.24)

where

˙n,θ(Wi) :=

n∑i=1

d`n,θ1(Wi)

dθ

∣∣∣∣∣θ1=θ

=dλ(θ)

′

dθgi(θ) +Gi(θ)

′λ(θ)

+dλ(θ)

′

dθE[gi(θ)]−En

[τi(λ(θ), θ)Gi(θ)

′]λ(θ)

=dλ(θ)

′

dθgi(θ) +Gi(θ)

′λ(θ) +Gi(θ)′(λ(θ)− λ(θ)) +

(dλ(θ)

′

dθ− dλ(θ)

′

dθ

)E[gi(θ)]

−En

[(τi(λ(θ), θ)− τi(λ, θ)

)Gi(θ)

′]λ(θ)−En

[τi(λ, θ)Gi(θ)

′ − G′]λ(θ)

−En[τi(λ, θ)Gi(θ)′ −G′](λ(θ)− λ(θ)) (E.25)

with

dλ(θ)′

dθ= −En

[τi(λ, θ)Gi(θ)

′(I + λ(θ)gi(θ)′)] (

En[τi(λ, θ)gi(θ)gi(θ)′])−1

,

dλ(θ)′

dθ= −E

[τ(λ, θ,Wi)Gi(θ)

′(I + λ(θ)gi(θ)′)]

Ω−1 (E.26)

10

and where we have used the rst order condition of the pseudo true value θ, that is:

dλ(θ)′

dθE[gi(θ)] +G′λ(θ)−

dλ(θ)′

dθE[τi(λ, θ)gi(θ)]−E[τi(λ, θ)Gi(θ)

′]λ(θ) = 0

and E[τi(λ, θ)gi(θ)] = 0 because it is the rst order condition for λ. Moreover,

¨n,θ

(Wi) :=

n∑i=1

d2`n,θ1(Wi)

dθdθ′

∣∣∣∣∣θ1=θ

=

dK∑j=1

d2λ(θ)′

dθdθ′gi,j(θ) +

dλ(θ)′

dθGi(θ) +Gi(θ)

′dλ(θ)′

dθ

+dK∑j=1

d2gi,j(θ)

dθdθ′λj(θ) + En

[Gi(θ)

′λ(θ)dτi(λ(θ), θ)

dλ′

dλ(θ)

dθ′

]+ En

[τi(λ(θ), θ)

dθλ(θ)′Gi(θ)

]

+ En

τi(λ(θ), θ)

dK∑j=1

d2gi,j(θ)

dθdθ′

λj(θ) + En

[τi(λ(θ), θ)Gi(θ)

′] dλ(θ)

dθ′. (E.27)

We start with analyzing term (E.25). First, remark that by Lemma G.12 it holds: h′(dλ(θ)′

dθ −dλ(θ)′

dθ )√nEn[gi(θ)] = op(1) uniformly in h ∈ H. Next, we analyse the third term in (E.25). By a

MVT expansion of the rst order condition for λ(θ) there exists τ ∈ [0, 1] such that λτ := τ(λ(θ)−λ(θ))+λ(θ) satisesEn[eλ(θ)′gi(θ)gi(θ)] = 0 = En[eλ(θ)

′gi(θ)gi(θ)]+Ω(θ, λτ )(λ(θ)−λ(θ))which implies:

(λ(θ)− λ(θ)) = −Ω(θ, λτ )−1En[eλ(θ)′gi(θ)gi(θ)]. (E.28)

Therefore,

h′1√n

n∑i=1

Gi(θ)′(λ(θ)− λ(θ)) = −

√nEn[eλ(θ)

′gi(θ)gi(θ)′]Ω(θ, λτ )−1En[Gi(θ)]h

= −Gn[eλ(θ)′gi(θ)gi(θ)

′]Ω(θ, λτ )−1En[Gi(θ)]h = Op(K/√n).

Here, to get the term Op(K/√n) we have used the inequality

suph∈H

E∣∣∣Gn[eλ(θ)

′gi(θ)gi(θ)′]Ω(θ, λτ )−1En[Gi(θ)]h

∣∣∣≤ C−2 sup

h∈H

√E∥∥Gn[eλ(θ)′gi(θ)gi(θ)′]

∥∥2√

E∥∥En[Gi(θ)]h

∥∥2= Op(

√K√K/n)

for which we have used Lemma G.9 and Assumption 3.10 (d) and (f). To control the fourth term

in (E.25) we use Assumption 3.10 (h). We now control the fth term in (E.25). For this, we use

again (E.28) and a MVT expansion of τi(λ(θ), θ) around λ(θ):

√nλ(θ)

′En[(τi(λ(θ), θ)− τi(λ(θ), θ)

)Gi(θ)]h

11

=√n(λ(θ)− λ(θ))′En[

∂τi(λt, θ)

∂λλ(θ)

′Gi(θ)]h

= −√nEn[eλ(θ)

′gi(θ)gi(θ)′]Ω(θ, λτ )−1En[

∂τi(λt, θ)

∂λλ(θ)

′Gi(θ)]h

= −Gn[eλ(θ)′gi(θ)gi(θ)

′]Ω(θ, λτ )−1En[∂τi(λt, θ)

∂λλ(θ)

′Gi(θ)]h+ op(1) (E.29)

where λt = t(λ(θ)− λ(θ)) + λ(θ) for some t ∈ [0, 1]. To control the last term in (E.25) we use

again (E.28) to get

− h′√nEn[τi(λ, θ)Gi(θ)

′ −G′](λ(θ)− λ(θ))

= h′En[τi(λ, θ)Gi(θ)′ −G′]Ω(θ, λτ )−1Gn[eλ(θ)

′gi(θ)gi(θ)].

By putting together these arguments we get:

h√n

n∑i=1

˙n,θ(Wi) = h′Gn(Ln,θ(Wi)) +

√nh′

(dλ(θ)

′

dθ− dλ(θ)

′

dθ

)E[gi(θ)]

+ Gn[eλ(θ)′gi(θ)gi(θ)

′]Ω(θ, λτ )−1En

[∂τi(λt, θ)

∂λλ(θ)

′Gi(θ)

]h

− h′Gn(τi(λ(θ), θ)Gi(θ)′)λ(θ)

+ h′En[τi(λ, θ)Gi(θ)′ −G′]Ω(θ, λτ )−1Gn[eλ(θ)

′gi(θ)gi(θ)] + op(1) =: h′Aθ∆n,θ0 + op(1)

(E.30)

where the op(1) is uniform in h ∈ H, and Ln,θ(Wi) := ddθLn,θ(Wi)

∣∣θ=θ

with Ln,θ(Wi) :=

log(dQ∗(θ)/dP∗)(Wi) = log eλ(θ)′gi(θ)

EP [eλ(θ)′gi(θ)]

. Moreover, as shown above ∆n,θ0 = Op(1) and Aθis dened below.

We now analyse the limit of (E.27). For this, we use Lemma G.12, the fact that

(λ(θ)− λ(θ)) = −Ω(θ, λτ )−1En[eλ(θ)′gi(θ)gi(θ)]

as shown above, and the fact that λ(θ)′ − λ(θ)′ = (θ − θ0)dλ(θ2)′

dθ for θ2 = θ + th/√n and some

t ∈ [0, 1], to get

h′1

n

n∑i=1

¨n,θ

(Wi)h = h′dK∑j=1

d2λ(θ)′

dθdθ′E[gi,j(θ)]h+ h′

dλ(θ)′

dθE[Gi(θ)]h

+ h′E

[Gi(θ)

′λ(θ)dτi(λ(θ), θ)

dλ′

dλ(θ)

dθ′

]h+ h′E

[dτi(λ(θ), θ)

dθλ(θ)

′Gi(θ)

]h

12

+h′E

τi(λ(θ), θ) dK∑j=1

d2gi,j(θ)

dθdθ′

λ,j(θ)+h′E [τi(λ(θ), θ)Gi(θ)′] dλ(θ)dθ′

h+op(1) =: h′Aθh

(E.31)

where the op(1) is uniform in h ∈ H. By replacing (E.30) and (E.31) in (E.24) we get the result of

the Lemma.

F Proofs for Section 4

F.1 Proof of Theorem 4.1

We can write log p(W1:n|θ`;M`) = −n log n+ n log L(θ`) where

L(θ`) := expλ(θ`)′gi(θ`)

[1

n

n∑i=1

expλ(θ`)′gi(Wi, θ`)

]−1

and L(θ`) = expλ(θ`)′EP [g(W, θ`)](EP[expλ(θ`)′g(W, θ`)

])−1. Then, we have:

P


6=jlogm(W1:n;M`)

)= P

(n log L(θj)+log π(θj|Mj)−log π(θj|W1:n,Mj)

> max`6=j

[n log L(θ`) + log π(θ`|M`)− log π(θ`|W1:n,M`)])

= P(n logL(θj) + n log

L(θj)

L(θj)+ Bj > max

`6=j

[n logL(θ`) + B` + n log

L(θ`)

L(θ`)

])(F.1)

where ∀`, B` := log π(θ`|M`) − log π(θ`|W1:n,M`) and B` = Op(1) under the as-

sumptions of Theorem 3.2. By denition of dQ∗(θ) in Section 3.4 we have that:

logL(θ`) = EP [log dQ∗(θ`)/dP ] = −EP [log dP/dQ∗(θ`)] = −K(P ||Q∗(θ`)). Remark that

EP [log(dP/dQ∗(θ2))] > EP [log(dP/dQ∗(θ1

))] means that the KL divergence between P and Q∗(θ`),

is smaller for model M1 than for model M2, where Q∗(θ`) minimizes the KL divergence between

Q ∈ Pθ` and P for ` ∈ 1, 2 (notice the inversion of the two probabilities).

First, suppose that min 6=j EP[log(dP/dQ∗(θ`)

)]> EP

[log(dP/dQ∗(θj)

)]. By (F.1):

P


6=jlogm(W1:n;M`)

)≥

P

logL(θj)

L(θj)−max

6=jlog

L(θ`)

L(θ`)+

1

n(Bj −max

`6=jB`) > max

` 6=jlogL(θ`)− logL(θj)︸︷︷︸

=:In

. (F.2)

13

This probability converges to 1 because In = K(P ||Q∗(θj))−min`6=jK(P ||Q∗(θ`)) < 0 by assump-

tion, and[log L(θ`)− logL(θ`)

]p→ 0, for every θ` ∈ Θ` and every ` ∈ 1, 2 by Lemma G.10 below

and by K/√n→ 0.

To prove the second direction of the statement, suppose that limn→∞ P (logm(W1:n;Mj) >

max 6=j logm(W1:n;M`)) = 1. By (F.1) it holds, ∀` 6= j

P


6=jlogm(W1:n;M`)

)≤

P(

logL(θj)

L(θj)− log

L(θ`)

L(θ`)+

1

n(Bj − B`) > log

L(θ`)

L(θj)

). (F.3)

Convergence to 1 of the left hand side implies convergence to 1 of the right hand side which is

possible only if logL(θ`)− logL(θj) < 0. Since this is true for every model `, then this implies that

K(P ||Q∗(θj)) < min 6=jK(P ||Q∗(θ`)) which concludes the proof.

G Technical Lemmas

Lemma G.1. If Assumptions 3.2 and 3.6 are satised then

maxi≤n

supθ∈Θ‖ρ(Xi, θ)‖ = Op(n

1/γ)

and

maxi≤n

supθ∈Θ‖g(Wi, θ)‖ = Op(n

1/γζ(K)). (G.1)

Proof. To get the rst conclusion, let bi = supθ∈Θ ‖ρ(Xi, θ)‖, so for every γ > 1, P (maxi≤n bi >

ε) = P (maxi≤n bγi > εγ) ≤ P (

∑i≤n b

γi > εγ). By the Markov's inequality this is upper bounded by

E[∑

i≤n bγi ]/εγ which, under Assumption 3.6 is bounded by Cn/εγ for a constant 0 < C <∞. So,

maxi≤n bi = Op(n1/γ). The second result follows from the rst result, the inequality

maxi≤n

supθ∈Θ‖g(Wi, θ)‖ ≤ max

i≤nsupθ∈Θ‖ρ(Xi, θ)‖max

i≤n‖qK(Zi)‖

and Assumption 3.2.

Lemma G.2. If Assumptions 3.2 and 3.6 is satised then for any sequence δn = o(n−1/γζ(K)−1)

14

and Λn := λ ∈ RdK ; ‖λ‖ ≤ δn we have

maxi≤n,λ∈Λn

supθ∈Θ|λ′g(Wi, θ)| → 0

in P∗-probability.

Proof. By using the second result of Lemma G.1 we get

maxi≤n,λ∈Λn

supθ∈Θ|λ′g(Wi, θ)| ≤ max

i≤n,λ∈Λnsupθ∈Θ‖λ‖ ‖g(Wi, θ)‖ = Op(δnn

1/γζ(K))

= Op(δnn1/γζ(K))

which converges to zero.

G.1 Technical Lemmas for the correctly specied case

Lemma G.3. Assume that Σ(Z) is bounded and let λ(θ) = argminλ∈RdKEn[eλ′g(Wi,θ)]. If Assump-

tions 3.2, 3.3, 3.5 (b) and 3.6 are satised, then

‖λ(θ∗)‖ = Op(√K/n) (G.2)

and

supθ∈Θ

1

n

n∑i=1

eλ(θ)′g(Wi,θ) ≥ 1

n

n∑i=1

eλ(θ∗)′g(Wi,θ∗) ≥ 1 +Op(K/n). (G.3)

Proof. Choose a sequence δn = o(n−1/γζ(K)−1) and√K/n = o(δn), which is possible by

ζ(K)2K/n1−2γ → 0. Let λ be on the line joining λ(θ) and 0. By a second order Mean Value

expansion around λ(θ) = 0, we have ∀θ ∈ Θ:

1 ≥ En

[eλ(θ)′g(Wi,θ)

]= 1 + λ(θ)′En [g(Wi, θ)] +

1

2λ(θ)′En

[eλ(θ)′g(Wi,θ)g(Wi, θ)g(Wi, θ)

′]λ(θ)

≥ 1− ‖λ(θ)‖ ‖En [g(Wi, θ)]‖+ min1≤i≤n

e−|λ(θ)′g(Wi,θ)|λmin(Ω(θ))‖λ(θ)‖2. (G.4)

By Donald et al. (2003, Lemma A.6): λmin(Ω(θ∗)) ≥ 1/C with probability approaching 1.

Let Λn be as dened in Lemma G.2 and consider λ(θ∗) := arg minλ∈Λn En

[eλ′g(Wi,θ∗)

]. Let

λ(θ∗) be on the line joining λ(θ∗) and zero. Then, by Lemma G.2: mini≤n e−|λ(θ∗)′g(Wi,θ)| ≥

e−maxi≤n,λ∈Λn supθ∈Θ |λ(θ)′g(Wi,θ)| ≤ C for some nite constant C > 0. Therefore, after simplica-

15

tions, (G.4) evaluated at (λ(θ∗), θ∗) gives:

‖λ(θ∗)‖ ≤C

C‖En [g(Wi, θ∗)]‖ = Op(

√K/n)

where the last equality uses the result of Donald et al. (2003, Lemma A.9). Then, ‖λ(θ∗)‖ < δn

and λ(θ∗) ∈ int(Λn). By convexity of En[eλ′g(Wi,θ)] it then follows that λ(θ∗) = λ(θ∗), which gives

result (G.15).

Moreover, by using (G.4):

supθ∈Θ

1

n

n∑i=1

eλ(θ)′g(Wi,θ) ≥ 1

n

n∑i=1

eλ(θ∗)′g(Wi,θ∗) ≥ 1 +Op(K/n)

which establishes result (G.3).

Lemma G.4. Suppose the assumptions of Lemma G.2 are satised and let Λn be as dened in

Lemma G.2 and τi(λ, θ) := eλ′gi(θ)

En[eλ′gj(θ)]

. Then, for every bounded constant C > 1:

P

(supλ∈Λn

τi(λ, θ) > C

)→ 0. (G.5)

P

(infλ∈Λn

min1≤i≤n

τi(λ, θ) ≤ 1/C

)→ 0 (G.6)

Proof. First, remark that supλ∈Λn τi(λ, θ) ≤supλ∈Λn

eλ′gi(θ)

En[einfλ∈Λnλ′gj(θ)]

. Then, we can upper bound the

numerator as

supλ∈Λn

eλ′gi(θ) ≤ emax1≤i≤n,λ∈Λn supθ∈Θ |λ′g(Wi,θ)|,

and lower bound the denominator as

En[einfλ∈Λn λ′gj(θ)] ≥ exp− max

1≤i≤n,λ∈Λnsupθ∈Θ|λ′g(Wi, θ)|

so that supλ∈Λn τi(λ, θ) ≤ e2 max1≤i≤n,λ∈Λn supθ∈Θ |λ′g(Wi,θ)|. By Lemma G.2, these two quantities

converge to zero in probability. Therefore, for every 1 < C < ∞ there exists a N such that for

every n ≥ N , P(supλ∈Λn τi(λ, θ) > C

)→ 0.

Let us now consider the second result:

infλ∈Λn

min1≤i≤n

τi(λ, θ) ≥e−max1≤i≤n,λ∈Λn supθ∈Θ |λ′g(Wi,θ)|

En[emax1≤i≤n,λ∈Λn supθ∈Θ |λ′g(Wi,θ)|]= e−2 max1≤i≤n,λ∈Λn supθ∈Θ |λ′g(Wi,θ)| > 1/C

with probability approaching 1 by Lemma G.2.

16

Lemma G.5. Let U be a neighborhood of θ∗. Let Assumptions 3.2 and 3.4 (b) be satised. Then,

for every θ ∈ U :‖G(θ)‖ = Op(

√K).

Proof. We have to control E‖G(θ)‖. For every θ ∈ U :

E‖G(θ)‖ ≤ E‖ρθ(X, θ)⊗ qK(Z)‖ ≤ E(‖ρθ(X, θ)‖ ‖qK(Z)‖

)≤√

E‖ρθ(X, θ)‖2√E‖qK(Z)‖2

≤ C√trE[qK(Z)qK(Z)′] = Op(

√K)

under Assumption 3.4 (b) and by Donald et al. (2003, Lemma A.2). The Markov's inequality allows

to obtain the result of the Lemma.

Lemma G.6. Let H be a compact subset of ⊂ Rp. For h ∈ H dene θ1 := θ∗ + h/√n. Suppose

assumptions 3.2, 3.3 (d) and 3.4 (b) hold. Then, suph∈H ‖λ(θ1)‖ = Op(√K/n).

Proof. By expanding the rst order condition for λ(θ) around λ(θ) = 0, there exists a λθ lying

between λ(θ) and zero such that: g(θ) + Ω(θ, λθ)λ(θ) = 0 which gives λ(θ) = −Ω(θ, λθ)−1g(θ). By

using the expansion g(θ1) = g(θ∗) + G(θ)h/√n for θ lying between θ1 and θ∗ we get:

‖λ(θ1)‖ ≤ ‖Ω(θ1, λθ1)−1‖(‖g(θ∗)‖+ ‖G(θ)h‖n−1/2

).

By continuity of θ 7→ λθ, due to the implicit function theorem, and continuity of θ 7→ gi(θ), due to

Assumption 3.3 (d), τi(λ, θ1) → 0 and so Ω(θ1, λθ1) → Ω(θ1) which then implies by Donald et al.

(2003, Lemma A.6) that ‖Ω(θ1, λθ1)−1‖ = λmin(Ω(θ1, λθ1))−1 ≤ C. Moreover, by Lemma G.5, there

exists N > 0 such that for every n ≥ N , ‖G(θ)‖ = Op(√K) and by Donald et al. (2003, Lemma

A.9) ‖g(θ∗)‖ = Op(√K/n). We then conclude that ‖λ(θ1)‖ = Op(

√K/n). Compactness of H and

continuity of θ 7→ λ(θ) allow to conclude.

Lemma G.7. Let Assumptions 3.2, 3.3, 3.5 and 3.6 be satised and assume K/n → 0. Dene

Ω(θ, λ) := En[τi(λ, θ)gi(θ)gi(θ)′] with τi(λ, θ) := eλ

′gi(θ)

En[eλ′gj(θ)]

. Let H denote a compact subset of Rp

and θ1 := θ∗ + h/√n with h ∈ H. Then, for all t1, t2 ∈ (0, 1)

suph∈H‖Ω(θ1, t1λ(θ1))− Ω(θ∗, t2λ(θ∗))‖ = Op

(ζ(K)K/

√n)

(G.7)

17

and

suph∈H‖g(θ1)− g(θ∗)‖ = Op(

√K/n). (G.8)

Proof. Dene ρi(θ) := ρ(xi, θ), and G(θ, λ) := En[τi(λ, θ)G(Wi, θ)]. The quantity that we have to

control is:

suph∈H‖Ω(θ1, t1λ(θ1))− Ω(θ∗, t2λ(θ∗))‖ =

suph∈H‖En[τi(t1λ(θ1), θ1)gi(θ1)gi(θ1)′]−En[τi(t2λ(θ∗), θ∗)gi(θ∗)gi(θ∗)

′]‖

≤ suph∈H‖En[

(τi(t1λ(θ1), θ1)− τi(t2λ(θ∗), θ∗)

)gi(θ1)gi(θ1)′]‖

+ suph∈H‖En[τi(t2λ(θ∗), θ∗)

(gi(θ1)gi(θ1)′ − gi(θ∗)gi(θ∗)′

)]‖ =: A1 +A2 (G.9)

We start by considering term A1 in (G.9). By a MVT expansion there exists λ between t1λ(θ1)

and t2λ(θ∗) such that

τi(t1λ(θ1), θ∗)− τi(t2λ(θ∗), θ∗) = τi(λ, θ∗)gi,n(θ∗)′(t1λ(θ1)− t2λ(θ∗)) (G.10)

where gi,n(θ∗) := gi(θ)′ − En[τj(λ, θ∗)gj(θ∗)]. Then, by Lemma G.6, by compactness of H and

by using the condition Kζ(K)2/n1−2/γ → 0 (which holds under Assumption 3.6) it follows that

suph∈H λ(θ1) ∈ Λn, where Λn is as dened in Lemma G.2 and so, λ ∈ Λn. Hence,

∥∥∥En[(τi(t1λ(θ1), θ1)− τi(t∗λ(θ∗), θ∗)

)gi(θ1)gi(θ1)′]

∥∥∥≤∥∥∥En[

(τi(t1λ(θ1), θ1)− τi(t1λ(θ1), θ∗)

)gi(θ1)gi(θ1)′]

∥∥∥+∥∥∥En[

(τi(λ, θ∗)gi,n(θ∗)

′(t1λ(θ1)− t2λ(θ∗))

)gi(θ1)gi(θ1)′]

∥∥∥ =: A11(h) +A12(h) (G.11)

where we have used rst τi(t1λ(θ1), θ1) − τi(t2λ(θ∗), θ∗) = τi(t1λ(θ1), θ1) − τi(t1λ(θ1), θ∗) +

τi(t1λ(θ1), θ∗) − τi(t2λ(θ∗), θ∗) and then (G.10). We start by analyzing the second term in (G.11).

By CS and the triangle inequality, we have:

suph∈H

A12(h) ≤ suph∈H

En[∣∣∣(τi(λ, θ∗)gi,n(θ∗)

′(t1λ(θ1)− t2λ(θ∗))

)∣∣∣ ‖gi(θ1)gi(θ1)′‖]

≤(

suph∈H

max1≤i≤n

τi(λ, θ∗)

)1/2

suph∈H

√(t1λ(θ1)− t2λ(θ∗))′En[τi(λ, θ∗)gi(θ∗)gi(θ∗)′](t1λ(θ1)− t2λ(θ∗))

×√En[‖gi(θ1)gi(θ1)′‖2]

≤ suph∈H

max1≤i≤n

τi(λ, θ∗)λ1/2max(Ω(θ∗)) sup

h∈H‖(t1λ(θ1)− t2λ(θ∗))‖

√En[‖gi(θ1)gi(θ1)′‖2].

18

By using Lemma G.4 and continuity of λ in h (by the implicit function theorem) we get:

suph∈H

max1≤i≤n

τi(λ, θ∗) < C, for 1 < C <∞; (G.12)

by Lemmas G.3 and G.6 it holds: suph∈H ‖(t1λ(θ1)− t2λ(θ∗))‖ = Op(√K/n) and by Donald et al.

(2003, Lemma A.6): λ1/2max(Ω(θ∗)) = Op(1). Moreover, similarly as in the proof of Donald et al.

(2003, Lemma A.16):

suph∈H

E[∥∥gi(θ1)gi(θ1)′

∥∥2] ≤ E

[E[sup

θ∈U‖ρi(θ)‖4|Zi]‖qK(Zi)‖4

]≤ Cζ(K)2K. (G.13)

Therefore, by denoting A12 := suph∈HA12(h) ≤ C2ζ(K)K/√n, the previous results and M imply

that P (Ac12) = o(1).

Let us now analyze term A11(h) in (G.11). Let Gi(λ, θ) := G(Wi, θ) − En[τi(λ, θ)G(Wj , θ)] and

consider a MVT expansion around θ = θ∗:

τi(t1λ(θ1), θ1)− τi(t1λ(θ1), θ∗) = τi(t1λ(θ1), θ)h√n

′Gi(t1λ(θ1), θ)t1λ(θ1),

where θ lies between θ1 and θ∗. By the Cauchy-Schwartz inequality we get:

A11(h) :=∥∥∥En[

(τi(t1λ(θ1), θ1)− τi(t1λ(θ1), θ∗)

)gi(θ1)gi(θ1)′]

∥∥∥=

∥∥∥∥En[τi(t1λ(θ1), θ)h′√nGi(t1λ(θ1), θ)t1λ(θ1)gi(θ1)gi(θ1)′]

∥∥∥∥≤

(En[τi(t1λ(θ1), θ)

∣∣∣∣ h′√nGi(θ)′λ(θ1)

∣∣∣∣2]

)1/2 (En[τi(t1λ(θ1), θ)‖gi(θ1)gi(θ1)′‖2]

)1/2.

Moreover, Lemma G.4 implies (G.12) so that

suph∈H

A11(h) ≤ C suph∈H

(En[

∣∣∣∣ h′√nGi(θ)′λ(θ1)

∣∣∣∣2]

)1/2

suph∈H

(En[‖gi(θ1)gi(θ1)′‖2]

)1/2= Op(

√K/n

√K/n)Op(ζ(K)

√K)

for which we have used (G.13) and

suph∈H

En[

∣∣∣∣ h′√nGi(θ)′λ(θ1)

∣∣∣∣2] ≤ suph∈H

λmax(λ(θ1)λ(θ1)′)h′√nEn

[Gi(θ)

′Gi(θ)] h√

n

≤ suph∈H‖λ(θ1)‖2 h

′√nEn

[ρθ(Xi, θ)

′ρθ(Xi, θ)‖qK(Zi)‖2] h√

n

19

suph∈H≤ ‖λ(θ1)‖2

∥∥∥∥ h√n

∥∥∥∥2

En

[‖ρθ(Xi, θ)‖2‖qK(Zi)‖2

]= Op(K/n)Op

(1

nE

[E

[suph∈H‖ρθ(Xi, θ)‖2

∣∣∣∣Zi] ‖qK(Zi)‖2])

= Op(K/n)Op(K/n)

where the last line holds by Assumption 3.4 (b) and a similar strategy as in the proof of Donald et al.

(2003, Lemma A.16), and since λmax(λ(θ1)λ(θ1)′) = ‖λ(θ1)λ(θ1)′‖ ≤ tr(λ(θ1)λ(θ1)′) = ‖λ(θ1)‖2 =

Op(K/n) by Lemma G.6. Therefore, by denoting A11 := suph∈HA11(h) ≤ C1ζ(K)K√K/n, the

previous results and M imply that P (Ac11) = o(1).

Next, we analyse term A2 in (G.9):

∥∥∥En[τi(t2λ(θ∗), θ∗)[ρi(θ1)ρi(θ1)′ − ρi(θ∗)ρi(θ∗)′

]⊗ qK(zi)q

K(zi)′]∥∥∥

≤ En[τi(t2λ(θ∗), θ∗)∥∥ρi(θ1)ρi(θ1)′ − ρi(θ∗)ρi(θ∗)′

∥∥ ‖qK(zi)‖2]

≤ En[τi(t2λ(θ∗), θ∗)(‖ρi(θ1)− ρi(θ∗)‖2 + 2‖ρi(θ1)− ρi(θ∗)‖‖ρi(θ∗)‖

)‖qK(zi)‖2]

≤ C‖θ1 − θ∗‖En[τi(t2λ(θ∗), θ∗)Mi(θ∗)‖qK(Zi)‖2] = Op(‖θ1 − θ∗‖K) = Op(hK/√n) (G.14)

under Assumption 3.5 (b) by noting that Mi(θ∗) := δ(Xi)2 + 2δ(Xi)‖ρi(θ∗)‖ has

E[Mi(θ∗)|Zi] bounded by Cauchy-Schwartz (where δ(Xi) is dened in Assumption 3.5) so that

E[Mi(θ2)‖qK(Zi)‖2] = E[E[Mi(θ∗)|Zi]‖qK(Zi)‖2] ≤ CE[‖qK(Zi)‖2] ≤ CK.

Finally, by using the upper bounds in (G.11) and (G.14) we get:

P (suph∈H‖Ω(θ1, t1λ(θ1))− Ω(θ∗, t2λ(θ∗))‖ ≥ εn) ≤ P (A1 +A2 ≥ εn)

≤ P ( suph∈H

(A11(h) +A12(h)) +A2 ≥ εn∣∣∣∣A11 ∩ A12) + P (Ac11) + P (Ac12)

≤ P

(A2 ≥ εn −

(C1

√K

n+ C2

)ζ(K)K/

√n

)+ o(1)

which converges to 0 for ε = C3ζ(K)K/√n with C3 >

(C1

√Kn + C2

)since

√K/n → 0, because

of (G.38) and because P (Ac11) = o(1) and P (Ac12) = o(1).

Next, we show the second result of the lemma. Under Assumption 3.5 (b),

‖g(θ1)− g(θ∗)‖ =∥∥En[(ρi(θ1)− ρi(θ∗))⊗ qK(Zi)]

∥∥ ≤ En[‖ρi(θ1)− ρi(θ∗)‖∥∥qK(Zi)

∥∥]

≤ ‖θ1 − θ∗‖En[δ(Xi)∥∥qK(Zi)

∥∥] = Op(‖θ1 − θ∗‖

√E(E[δ(Xi)2|Zi]‖qK(Zi)‖2)

)= Op(‖θ1 − θ∗‖

√K).

20

Therefore, because H is compact we get:

suph∈H‖g(θ1)− g(θ∗)‖ = Op(

√K/n).

G.2 Technical Lemmas for the misspecied case

Lemma G.8. Let Assumptions 3.2 and 3.10 be satised. Then,

‖λ(θ)− λ(θ)‖ = Op(√K/n) (G.15)

Proof. By a MVT expansion of the rst order condition for λ(θ) there exists τ ∈ [0, 1] such that

λτ := τ(λ(θ) − λ(θ)) + λ(θ) satises En[eλ(θ)′gi(θ)gi(θ)] = 0 = En[eλ(θ)′gi(θ)gi(θ)] +

Ω(θ, λτ )(λ(θ)− λ(θ)) which implies:

(λ(θ)− λ(θ)) = −Ω(θ, λτ )−1En[eλ(θ)′gi(θ)gi(θ)]. (G.16)

By CS and Lemma G.9, it holds∥∥∥λ(θ)− λ(θ)∥∥∥ ≤ C ∥∥∥En[eλ(θ)

′gi(θ)gi(θ)]∥∥∥ = Op(

√K/n)

by using Assumption 3.10 (f).

Lemma G.9. Let Assumptions 3.2 and 3.10 be satised. Let H denote a compact subset of Rp andθ2 := θ + th/

√n with h ∈ H and t ∈ [0, 1]. Then, λmin(Ω) ≥ C−1 and if ζ(K)K

√K/n→ 0 then

with probability approaching 1, λmin(Ω(θ2, λ)) ≥ C−1 uniformly in h ∈ H.

Proof. Consider the matrix Ω and note that by Assumption 3.10 (g):

E[eλ(θ)′g(W,θ)ρi(θ)ρi(θ)

′|Z] ≥ C−1Id. Hence,

Ω ≥ C−1E[Id ⊗ qK(Z)qK(Z)′] = C−1IdK

and λmin(Ω) ≥ C−1. Also, if ζ(K)K√K/n→ 0 then suph∈H ‖Ω(θ2, λ)−Ω‖

p→ 0 by Lemma G.13

and the Markov inequality. Then, by ‖λ(A)−λ(B)‖ ≤ ‖A−B‖, where λ(A) denotes the minimum

or maximum eigenvalue, it follows that

suph∈H|λmin(Ω(θ2, λ))− λmin(Ω)|

p→ 0.

21

Lemma G.10. Let Assumptions 3.2 and 3.10 be satised. Then,

‖λ(θ)′g(θ)− λ(θ)′g(W, θ)‖ = Op(

√K/n) (G.17)

and

log1n

∑ni=1 expλ(θ)

′gi(θ)EP [expλ(θ)′g(W, θ)]

= Op(K/√n).

Proof. We start by proving the rst result. By the triangular inequality and CS we get:

‖λ(θ)′g(θ)− λ(θ)′g(W, θ)‖ ≤ ‖λ(θ)− λ(θ)‖‖g(θ)‖+ ‖λ(θ)‖‖g(θ)−EP [g(W, θ)]‖

= Op(√K/n)

where we have used Lemmas G.8 and G.11 and Assumption 3.10 (b). To get the second result

we use the inequality log(a) ≤ a− 1 for every a > 0:

log1n

∑ni=1 expλ(θ)

′gi(θ)EP [expλ(θ)′g(W, θ)]

≤1n

∑ni=1 expλ(θ)

′gi(θ) −EP [expλ(θ)′g(W, θ)]EP [expλ(θ)′g(W, θ)]

≤En

[eλ(θ)′gi(θ)

]−En

[eλ(θ)

′gi(θ)]

+ En

[eλ(θ)

′gi(θ)]−EP

[eλ(θ)

′g(W,θ)]

EP [expλ(θ)′g(W, θ)]. (G.18)

By a MVT expansion of λ 7→ eλ′gi(θ) around λ, there exists a t ∈ [0, 1] such that λ := t(λ(θ) −

λ(θ)) +λ(θ) satises En

[eλ(θ)′gi(θ)

]−En

[eλ(θ)

′gi(θ)]

= En

[eλ′gi(θ)gi(θ)

′]

(λ(θ)−λ(θ))and so

‖En[expλ(θ)

′gi(θ)]−En

[expλ(θ)′gi(θ)

]‖

≤ emax1≤i≤n ‖λ(θ)−λ(θ)‖ ‖gi(θ)‖En

[eλ(θ)

′gi(θ)‖gi(θ)‖]‖λ(θ)− λ(θ)‖ = Op(K/

√n)

by using Lemmas G.2, G.8 and Assumption 3.10 (f). Finally, by M

En

[eλ(θ)

′gi(θ)]−EP

[eλ(θ)

′g(W,θ)]

= Op(1/√n).

Lemma G.11. If Assumptions 3.2 and 3.10 (d) are satised , then ‖g(θ) − E[g(Wi, θ)]‖ =

Op(√K/n).

Proof. By the Markov's inequality, for every ε > 0

22

P (‖g(θ)−E[g(Wi, θ)]‖ > ε) ≤ 1

ε2E‖g(θ)−E[g(Wi, θ)]‖2 =

1

ε2tr(V ar[g(θ)])

=1

nε2tr(V ar [g(Wi, θ)]) ≤

1

nε2tr(E

[(ρ(Xi, θ)⊗ qK(Zi))(ρ(Xi, θ)⊗ qK(Zi))

′])=

1

nε2tr(E

[(ρ(Xi, θ)ρ(Xi, θ)

′)⊗ (qK(Zi)qK(Zi)

′)]

=1

nε2E[E[‖ρ(Xi, θ)‖2|Zi]tr(qK(Zi)q

K(Zi)′)]

≤ C

nε2trE(qK(Zi)q

K(Zi)′) =

C

nε2tr(IK) =

CK

ε2n.

where we have used Donald et al. (2003, Lemma A.2) in the last inequality.

Lemma G.12. Let Assumptions 3.2, 3.6, 3.8 and 3.10 be satised. Let H denote a compact subset

of Rp and θ2 := θ + τh/√n with h ∈ H and τ ∈ [0, 1]. Then

suph∈H|h′(dλ(θ2)′

dθ− dλ(θ)

′

dθ)√nEn[gi(θ2)]| = Op(ζ(K)K2

√K/n). (G.19)

Proof. First, remark that we can simplify the expression of dλ(θ2)′

dθ (by simplifying the denominator

in τi(λ, θ)):

dλ(θ2)′

dθ− dλ(θ)

′

dθ= −En

[τ(λ, θ2,Wi)Gi(θ2)′(I + λ(θ2)gi(θ2)′)

]Ω−1(θ2, λ)

+ E[τ(λ, θ,Wi)Gi(θ)

′(I + λ(θ)gi(θ)′)]

Ω−1

= −En[(τ(λ, θ2,Wi)− τ(λ(θ), θ2,Wi)

)Gi(θ2)′(I + λ(θ2)gi(θ2)′)

]Ω−1(θ2, λ)

−(En

[τ(λ(θ), θ2,Wi)Gi(θ2)′(I + λ(θ2)gi(θ2)′)

]Ω−1(θ2, λ)

+ E[τ(λ, θ,Wi)Gi(θ)

′(I + λ(θ)gi(θ)′)]

Ω−1

)=: A1 +A2. (G.20)

Further, we dene

A11 := −En[(τ(λ, θ2,Wi)− τ(λ(θ), θ2,Wi)

)Gi(θ2)′

]Ω−1(θ2, λ)

A12 := −En[(τ(λ, θ2,Wi)− τ(λ(θ), θ2,Wi)

)Gi(θ2)′λ(θ2)gi(θ2)′

]Ω−1(θ2, λ)

so that A1 = A11 +A12. In the following we use the argument that gi(θ2) = gi(θ)+Gi(θ)′h/√n+

op(h/√n) to replace gi(θ2) by gi(θ).

By a MVT expansion there exists a τ ∈ [0, 1] such that λ := τ λ(θ2) + (1− τ)λ(θ) satises

eλ(θ2)′gi(θ2) − eλ(θ)′gi(θ2) = eλ′gi(θ2)gi(θ2)′(λ(θ2)− λ(θ)). (G.21)

23

Therefore, |h′A11√nEn[gi(θ2)]| can be upper bounded as:

suph∈H|h′A11

√nEn[gi(θ2)]|

≤ suph∈H

∥∥∥En[eλ′gi(θ2)gi(θ2)′(λ(θ2)− λ(θ))Gi(θ2)]h

∥∥∥ ‖Ω−1(θ2, λ)√nEn[gi(θ2)]‖

≤ suph∈H

max1≤i≤n

eτ [λ(θ2)−λ(θ)]′gi(θ2)‖h‖En[eλ(θ)

′gi(θ2)‖gi(θ2)‖‖Gi(θ2)‖]

× ‖λ(θ2)− λ(θ)‖‖Ω−1(θ2, λ)√nEn[gi(θ2)]‖

= Op(√K/n)

√En

[eλ(θ)′gi(θ2)‖Gi(θ2)‖2

]√En

[eλ(θ)′gi(θ2)‖gi(θ2)‖2

]× ‖Ω−1(θ2, λ)Ω

1/2 ‖ ‖Ω−1/2

√nEn[gi(θ2)]‖ = Op(K

2/√n)

where we have used: (i) Lemma G.2 (ii) the compactness of H, (iii) the fact that every dierentiablemap is locally Lipschitz which allows to show that ‖λ(θ2)− λ(θ)‖ ≤ Chn, (iv) the inequality

‖λ(θ2)− λ(θ)‖ ≤ ‖λ(θ2)− λ(θ)‖ − ‖λ(θ)− λ(θ)‖ = Op(√K/n)

by Lemma G.8 and (iii), (v) Assumption 3.10 (f), (vi) the fact that√nEn[gi(θ2)] = Op(

√K). To

control term A12 we use again (G.34) and the Cauchy-Schwartz inequality to get:

|h′A12

√nEn[gi(θ2)]| :=

|h′En[(τ(λ, θ2,Wi)− τ(λ, θ,Wi)

)Gi(θ2)′λ(θ2)gi(θ2)′

]Ω−1(θ2, λ)

√nEn[gi(θ2)]|

= |h′En[eλ′gi(θ2)gi(θ2)′(λ(θ2)− λ(θ))Gi(θ2)′λ(θ2)gi(θ2)′

]Ω−1(θ2, λ)

√nEn[gi(θ2)]|

≤ suph∈H

max1≤i≤n

eτ(λ(θ2)−λ(θ))′gi(θ2)×

En

[eλ(θ)

′gi(θ2)‖gi(θ2)‖ ‖λ(θ2)− λ(θ)‖ ‖h‖ ‖Gi(θ2)‖ ‖λ(θ2)‖ ‖gi(θ2)‖]‖Ω−1(θ2, λ)

√nEn[gi(θ2)]‖

≤ Op(√K√K/n)

√En

[eλ(θ)′gi(θ2)‖gi(θ2)‖4

] √En

[eλ(θ)′gi(θ2)‖Gi(θ2)‖2

]= Op(K

2ζ(K)/√n) (G.22)

where we have used Lemma G.8, the fact that√nEn[gi(θ2)] = Op(

√K) and argument (vi) above

to get the penultimate line, and we have used Assumption 3.10 (f) to get the last line. Next, we

analyse the term A2 in (G.20), which we decompose further as follows:

A2 = −En[τ(λ(θ), θ2,Wi)[Gi(θ2)−Gi(θ)]′(I + λ(θ2)gi(θ2)′)

]Ω−1(θ2, λ)

−En

[(τ(λ(θ), θ2,Wi)− τ(λ, θ,Wi))Gi(θ)

′(I + λ(θ2)gi(θ2)′)]

Ω−1(θ2, λ)

−En

[τ(λ, θ,Wi)Gi(θ)

′(I + λ(θ2)gi(θ2)′)]

(Ω−1(θ2, λ)− Ω−1 )

24

−En[τ(λ, θ,Wi)Gi(θ)′(I + λ(θ)gi(θ)

′)−E[τ(λ, θ,Wi)Gi(θ)

′(I + λ(θ)gi(θ)′)]]Ω−1

=: A21 +A22 +A23 +A24. (G.23)

We start with A21. By the Cauchy-Schwartz inequality applied twice we get

suph∈H|h′A21

√nEn[gi(θ2)]|

= suph∈H|h′En

[τ(λ(θ), θ2,Wi)[Gi(θ2)−Gi(θ)]′(I + λ(θ2)gi(θ2)′)

]Ω−1(θ2, λ)

√nEn[gi(θ2)]|

≤ suph∈H‖h‖(En[eλ(θ)

′gi(θ2)‖Gi(θ2)−Gi(θ)‖] + ‖λ(θ2)‖√En[eλ(θ)′gi(θ2)‖Gi(θ2)−Gi(θ)‖2]

×√

En[eλ(θ)′gi(θ2)‖gi(θ2)′‖2])‖Ω−1(θ2, λ)

√nEn[gi(θ2)]‖. (G.24)

Because every dierentiable function is locally Lipschitz then, by Assumption 3.10 (d), there exists

a θ between θ2 and θ such that ‖Gi(θ2) − Gi(θ)‖ = ‖[ρθ(Xi, θ2) − ρθ(Xi, θ)] ⊗ qK(Zi)‖ =

n−1/2‖(h′ρjθθ(Xi, θ)

)dj=1‖∥∥qK(Zi)

∥∥. Then, by Assumption 3.2

suph∈H

√En[eλ(θ)′gi(θ2)‖Gi(θ2)−Gi(θ)‖2] =

Op(n−1/2

√E[E[sup

h∈Heλ(θ)′gi(θ2)‖

(ρjθθ(Xi, θ)

)dj=1‖2|Zi] ‖qK(Zi)‖2]) = Op(

√K/n). (G.25)

By this, the Jensen's inequality, Assumption 3.10 (f) and the fact that√nEn[gi(θ2)] = Op(

√K) we

get:

suph∈H|h′A21

√nEn[gi(θ2)]| = Op(K

√K/n). (G.26)

Next, we analyze term A22. By a MVT expansion around θ there exists a θ between θ2 and θ such

that : τ(λ(θ), θ2,Wi)− τ(λ, θ,Wi) = eλ(θ)′gi(θ)λ(θ)

′Gi(θ)τh/√n for τ ∈ [0, 1]. Therefore,

suph∈H|h′A22

√nEn[gi(θ2)]|

= suph∈H|h′En

[eλ(θ)

′gi(θ)λ(θ)′Gi(θ)τh/

√nGi(θ)

′(I + λ(θ2)gi(θ2)′)]

Ω−1(θ2, λ)√nEn[gi(θ2)]|

≤ suph∈H

√En[eλ(θ)′gi(θ)(λ(θ)′Gi(θ)τh/

√n)2](√

En

[eλ(θ)′gi(θ)(h′Gi(θ)′Ω−1(θ2, λ)

√nEn[gi(θ2)])2

]+

√En

[eλ(θ)′gi(θ)(h′Gi(θ)′λ(θ2)gi(θ2)′Ω−1(θ2, λ)

√nEn[gi(θ2)])2

])≤ sup

h∈Hn−1/2‖λ(θ)‖‖h‖2

√En[eλ(θ)′gi(θ)‖Gi(θ)‖2

](√En[eλ(θ)′gi(θ)‖Gi(θ)‖2

]+‖λ(θ2)‖

(En

[eλ(θ)

′gi(θ)‖Gi(θ)‖4])1/4 (

En

[eλ(θ)

′gi(θ)‖gi(θ2)‖4])1/4 )

‖Ω−1(θ2, λ)√nEn[gi(θ2)])‖

25

= Op(K√K/n+ ζ(K)K/

√n) (G.27)

where we have used Assumption 3.10 (f) and the fact that√nEn[gi(θ2)] = Op(

√K). Next, we

analyse term A23:

suph∈H|h′A23

√nEn[gi(θ2)]|

= suph∈H|En

[eλ(θ)

′gi(θ)h′Gi(θ)′(I + λ(θ2)gi(θ2)′)

](Ω−1(θ2, λ)− Ω−1

)√nEn[gi(θ2)]|

≤ suph∈H‖h‖

(En

[eλ(θ)

′gi(θ)‖Gi(θ)‖]

+ ‖λ(θ2)‖√En[eλ(θ)′gi(θ)‖Gi(θ)‖2

]√En

[eλ(θ)′gi(θ)‖gi(θ2)‖2

])× ‖Ω−1(θ2, λ)− Ω−1

‖ ‖Ω1/2 ‖ ‖Ω−1/2

√nEn[gi(θ2)]‖. (G.28)

From the two results of Lemma G.13 and the result of Lemma G.9 we nd the rate for ‖Ω−1(θ2, λ)−Ω−1 ‖:

suph∈H‖Ω−1(θ2, λ)− Ω−1

‖ ≤ ‖Ω−1(θ2, λ)‖ ‖Ω(θ2, λ)− Ω‖ ‖Ω−1 ‖

≤ suph∈H

λ−1min

(Ω(θ2, λ)

)λ−1

min(Ω)(‖Ω(θ2, λ)− Ω(θ, λ(θ))‖+ ‖Ω(θ, λ(θ))− Ω‖

)= Op(ζ(K)K

√K/n+ ζ(K)K/

√n) = Op(ζ(K)K

√K/n).

Therefore, by plugging this result in (G.28) and by using Assumption 3.10 (f), the rst result of

Lemma G.13 and the fact that√nEn[gi(θ2)] = Op(

√K) we obtain:

suph∈H|h′A23

√nEn[gi(θ2)]| = Op((

√K +K)

√Kζ(K)K/

√n).

Finally, we analyse term A24:

suph∈H|h′A24

√nEn[gi(θ2)]| = sup

h∈H

∣∣∣h′En[eλ(θ)′gi(θ)Gi(θ)

′(I + λ(θ)gi(θ)′)

−E[eλ(θ)

′gi(θ)Gi(θ)′(I + λ(θ)gi(θ)

′)]]Ω−1√nEn[gi(θ2)]

∣∣∣= Op(

1√n

(√E[e2λ(θ)′gi(θ)‖Gi(θ)‖2]+

√E[e2λ(θ)′gi(θ)‖Gi(θ)‖2‖gi(θ)‖2]))‖Ω−1

√nEn[gi(θ2)]‖

= Op(1√n

(√E[e2λ(θ)′gi(θ)‖Gi(θ)‖2] +

(E[e2λ(θ)′gi(θ)‖Gi(θ)‖4]

)1/4

×(E[e2λ(θ)′gi(θ)‖gi(θ)‖4]

)1/4))Op(

√K)

= Op((ζ(K)√K/n+ ζ(K)

√K/n)

√K) = Op(ζ(K)K/

√n) (G.29)

where we have used Assumption 3.10 (f) and the fact that√nEn[gi(θ2)] = Op(

√K). By putting

26

together all the rates we have:

suph∈H

∣∣∣∣∣h′(dλ(θ2)′

dθ− dλ(θ)

′

dθ

)√nEn[gi(θ2)]

∣∣∣∣∣= Op(K

2/√n+K2ζ(K)/

√n+K

√K/n+ ζ(K)K/

√n+ ζ(K)K2

√K/n+K/

√n)

= Op(ζ(K)K2√K/n)

after elimination of the negligible terms. This concludes the proof.

Lemma G.13. Let Assumptions 3.2 and 3.10 be satised. Let H denote a compact subset of Rp

and θ2 := θ + th/√n with h ∈ H and t ∈ [0, 1]. Then,

suph∈H‖Ω(θ2, λ(θ2))− Ω(θ, λ(θ))‖ = Op

(ζ(K)K

√K/n

), (G.30)

‖Ω(θ, λ(θ))− Ω‖ = Op(ζ(K)K/√n) (G.31)

and

suph∈H‖g(θ2)− g(θ)‖ = Op(

√K/n). (G.32)

Proof. Dene ρi(θ) := ρ(Xi, θ). Remark that θ2 ∈ Θ(h, n−1/2). The quantity that we have to

control is:

suph∈H‖Ω(θ2, λ(θ2))− Ω(θ, λ(θ))‖ =

suph∈H‖En[eλ(θ2)′gi(θ2)gi(θ2)gi(θ2)′]−En[eλ(θ)

′gi(θ)gi(θ)gi(θ)′]‖

≤ suph∈H‖En[

(eλ(θ2)′gi(θ2) − eλ(θ)′gi(θ)

)gi(θ2)gi(θ2)′]‖

+ suph∈H‖En[eλ(θ)

′gi(θ)(gi(θ2)gi(θ2)′ − gi(θ)gi(θ)′

)]‖ =: A1 +A2 (G.33)

Remark that by a MVT expansion there exists τ ∈ [0, 1] such that λ := τ(λ(θ2)− λ(θ)) + λ(θ)

satises:

eλ(θ2)′gi(θ) − eλ(θ)′gi(θ) = eλ′gi(θ)gi(θ)

′(λ(θ2)− λ(θ)). (G.34)

We start by considering term A1 in (G.33):

∥∥∥En[(eλ(θ2)′gi(θ2) − eλ(θ)′gi(θ)

)gi(θ2)gi(θ2)′]

∥∥∥27

≤∥∥∥En[

(eλ(θ2)′gi(θ2) − eλ(θ2)′gi(θ)

)gi(θ2)gi(θ2)′]

∥∥∥+∥∥∥En[eλ

′gi(θ)gi(θ)′(λ(θ2)− λ(θ))gi(θ2)gi(θ2)′]

∥∥∥ =: A11(h) +A12(h) (G.35)

where we have used rst eλ(θ2)′gi(θ2) − eλ(θ)′gi(θ) = eλ(θ2)′gi(θ2) − eλ(θ2)′gi(θ) + eλ(θ2)′gi(θ) −

eλ(θ)′gi(θ) and then (G.34). We start by analyzing the second term in (G.35). By the trian-

gle inequality and the Cauchy-Schwartz inequality, we have:

suph∈H

A12(h) ≤ suph∈H

max1≤i≤n

eτ(λ(θ2)−λ(θ))′gi(θ)(En[eλ(θ)

′gi(θ)‖gi(θ)‖2])1/2‖λ(θ2)− λ(θ)‖

×(En[eλ(θ)

′gi(θ)∥∥∥gi(θ2)gi(θ2)′

∥∥∥2]

)1/2

= Op(√K)Op(

√K/n)Op(ζ(K)

√K)

where we have used: (i) the continuity and dierentiability of θ 7→ λ(θ) (by the Implicit Function

Theorem), (ii) the compactness of H, (iii) the fact that every dierentiable map is locally Lipschitz

which allows to show that there exists a N such that for every n > N : ‖λ(θ2)− λ(θ)‖ ≤ Chn, (iv)the inequality

suph∈H‖λ(θ2)− λ(θ)‖ ≤ ‖λ(θ2)− λ(θ)‖+ ‖λ(θ)− λ(θ)‖ = Op(

√K/n) (G.36)

which is valid by Lemma G.8 and (i)-(iii), (v) Lemma G.2 which is valid since λ(θ2) − λ(θ) ∈ Λ

with δn =√K/n by (i)-(iv), (vi) Assumption 3.10 (f) which implies En[eλ(θ)

′gi(θ) ‖gi(θ)‖2] =

Op(E[eλ(θ)′gi(θ) ‖gi(θ)‖2]) = Op(K) and suph∈HEn[eλ(θ)

′gi(θ)‖gi(θ2)‖4] = Op(ζ(K)2K).

Therefore, by denoting A12 := suph∈HA12(h) ≤ C2ζ(K)K√K/n, the previous results and the

Markov's inequality imply that P (Ac12) = o(1).

Next, we analyse term A11(h) in (G.35). Consider a MVT expansion around θ:

eλ(θ2)′gi(θ2) − eλ(θ2)′gi(θ) = eλ(θ2)′gi(θ)λ(θ2)′Gi(θ)h√n,

where θ := τ(θ2 − θ) + θ for some τ ∈ [0, 1]. By this and the triangle inequality and the Cauchy-

Schwartz inequality we get:

suph∈H

A11(h) := suph∈H

∥∥∥En[(eλ(θ2)′gi(θ2) − eλ(θ2)′gi(θ)

)gi(θ2)gi(θ2)′]

∥∥∥=

∥∥∥∥En[eλ(θ2)′gi(θ)λ(θ2)′Gi(θ)h√ngi(θ2)gi(θ2)′]

∥∥∥∥≤ sup

h∈Hmax

1≤i≤ne(λ(θ2)−λ(θ))′gi(θ)‖λ(θ2)‖ ‖h‖ 1√

nEn

[eλ(θ)

′gi(θ)‖Gi(θ)‖ ‖gi(θ2)gi(θ2)′‖]

28

≤ Op(n−1/2) suph∈H

√En[eλ(θ)′gi(θ)‖Gi(θ)‖2]

√En[eλ(θ)′gi(θ)‖gi(θ2)gi(θ2)′‖2]

where we have used Lemma G.2 which is valid since λ(θ2)− λ(θ) ∈ Λ with δn =√K/n as shown

in (i)-(iv) above. Moreover, we have used the fact that ‖λ(θ2)‖ = Op(1) by (G.36), continuity of

θ 7→ λ(θ) and compactness of H. Therefore, by Assumption 3.10 (f):

suph∈H

A11(h) = Op(K/√nζ(K)). (G.37)

Therefore, by denoting A11 := suph∈HA11(h) ≤ C1ζ(K)K/√n, the previous results and the

Markov's inequality imply that P (Ac11) = o(1).

Next, we analyse term A2 in (G.33):

A2 = suph∈H

∥∥∥En[eλ(θ)′gi(θ)

(ρi(θ2)ρi(θ2)′ − ρi(θ)ρi(θ)′

)⊗[qK(zi)q

K(zi)′]]∥∥∥

≤ suph∈H

En[eλ(θ)′gi(θ)

∥∥∥ρi(θ2)ρi(θ2)′ − ρi(θ)ρi(θ)′∥∥∥ ‖qK(zi)‖2]

≤ suph∈H

En[eλ(θ)′gi(θ)

(‖ρi(θ2)− ρi(θ)‖2 + 2‖ρi(θ2)− ρi(θ)‖‖ρi(θ)‖

)‖qK(zi)‖2]

≤ suph∈H‖θ2 − θ‖2En[eλ(θ)

′gi(θ)δ(Xi)2‖qK(Zi)‖2]

+ suph∈H‖θ2 − θ‖2En[eλ(θ)

′gi(θ)δ(Xi)‖ρi(θ)‖‖qK(Zi)‖2] = Op(K/√n) (G.38)

under Assumption 3.10 (e).

Finally, by using the upper bounds in (G.35) and (G.38) we get:

P (suph∈H‖Ω(θ2, λ(θ2))− Ω(θ, λ(θ))‖ ≥ εn) ≤ P (A1 +A2 ≥ εn)

≤ P ( suph∈H

(A11(h) +A12(h)) +A2 ≥ εn∣∣∣∣A11 ∩ A12) + P (Ac11) + P (Ac12)

≤ P

(A2 ≥ εn −

(C1√K

+ C2

)ζ(K)K

√K

n

)+ o(1)

which converges to 0 for ε = C3ζ(K)K√K/n with C3 > ( C1√

K+C2) because of (G.38) and because

P (Ac11) = o(1) and P (Ac12) = o(1).

Next, we show the second result of the lemma. Remark that ‖Ω(θ, λ(θ)) − Ω‖ =

Op(√

E[‖Ω(θ, λ(θ))− Ω‖2]) and

E[‖Ω(θ, λ(θ))− Ω‖2] ≤dK∑j=1

dK∑l=1

E[∣∣∣En (eλ′gi(θ)gij(θ)gil(θ)−E[eλ

′gi(θ)gij(θ)gil(θ)]

)∣∣∣2

29

≤ 1

n

dK∑j=1

dK∑l=1

E[e2λ′gi(θ)(gij(θ)gil(θ))2] =

1

n

dK∑j=1

dK∑l=1

E[e2λ′gi(θ)(e′jgi(θ)gi(θ)′el)

2]

1

n

dK∑j=1

E[e2λ′gi(θ)‖gi(θ)‖4] = O(ζ(K)2K2/n)

by using the fact that e′jgi(θ)gi(θ)′el ≤ tr(ele′j)λmax(gi(θ)gi(θ)

′) = ‖gi(θ)‖2e′lej and Assumption

3.10 (f).

Finally, we show the last result of the Lemma. Under Assumption 3.10 (e),

‖g(θ2)− g(θ)‖ =∥∥∥En[(ρi(θ2)− ρi(θ))⊗ qK(zi)]

∥∥∥ ≤ En[‖ρi(θ2)− ρi(θ)‖∥∥qK(zi)

∥∥]

≤ ‖θ2 − θ‖En[δ(xi)∥∥qK(zi)

∥∥] = Op(‖θ2 − θ‖

√E(E[δ(Xi)2|Zi]‖qK(Zi)‖2)

)= Op(‖θ2 − θ‖

√K).

Therefore, because H is compact we get:

suph∈H‖g(θ2)− g(θ)‖ = Op(

√K/n).

Lemma G.14. Let Assumptions 3.2 and 3.10 (a)-(g) be satised and let H denote a compact subset

of Rp. Then,

suph∈H

h′

(dλ(θ)

′

dθ− dλ(θ)

′

dθ

)EP [gi(θ)] = Op(n

−1/2).

Proof. By the triangular inequality:∣∣∣∣∣√nh′(dλ(θ)

′

dθ− dλ(θ)

′

dθ

)E[gi(θ)]

∣∣∣∣∣≤

∣∣∣∣∣√nh′(dλ(θ)

′

dθ+ En

[eλ(θ)


′)]

Ω(θ, λ)−1

)E[gi(θ)]

∣∣∣∣∣+

∣∣∣∣√nh′(En [eλ(θ)′gi(θ)Gi(θ)′(I + λ(θ)gi(θ)′)]

Ω(θ, λ)−1 − dλ(θ)

′

dθ

)E[gi(θ)]

∣∣∣∣ . (G.39)

We start with the analysis of the rst term. By the MVT expansion around λ(θ) there exists a

t ∈ [0, 1] such that λ := tλ(θ) + (1− t)λ(θ) satises

√nh′

dλ(θ)′

dθE[gi(θ)] = −

√nh′En

[eλ(θ)


′)]

Ω(θ, λ)−1E[gi(θ)]

−√nh′(

(λ(θ)− λ(θ))′En[eλ′gi(θ)gi(θ)Gi(θ)

′(I + λ(θ)gi(θ)′)]

30

+ En

[eλ′gi(θ)Gi(θ)

′(λ(θ)− λ(θ))gi(θ)′] )


+√nh′En

[eλ′gi(θ)Gi(θ)

′(I + λgi(θ)′)]

Ω(θ, λ)−2En[eλ′gi(θ)gi(θ)gi(θ)

′gi(θ)′](λ(θ)−λ(θ))E[gi(θ)].

By using (G.16) it follows that the last three terms are equal to

h′(Gn[eλ(θ)

′gi(θ)gi(θ)′]Ω(θ, λτ )−1En

[eλ′gi(θ)gi(θ)Gi(θ)

′(I + λ(θ)gi(θ)′)]

−En

[eλ′gi(θ)Gi(θ)

′Ω(θ, λτ )−1Gn[eλ(θ)′gi(θ)gi(θ)]gi(θ)

′] )


− h′En[eλ′gi(θ)Gi(θ)

′(I + λgi(θ)′)]

Ω(θ, λ)−2En[eλ′gi(θ)gi(θ)gi(θ)

′gi(θ)′]

× Ω(θ, λτ )−1Gn[eλ(θ)′gi(θ)gi(θ)]E[gi(θ)]

and they are bounded in probability. We now control the second term in (G.39). This is upper

bounded by∣∣∣∣√nh′(En [eλ(θ)′gi(θ)Gi(θ)′(I + λ(θ)gi(θ)′)]

Ω(θ, λ)−1 − dλ(θ)

′

dθ

)E[gi(θ)]

∣∣∣∣≤∣∣∣√nh′En [eλ(θ)′gi(θ)Gi(θ)′(I + λ(θ)gi(θ)

′)] (

Ω(θ, λ)−1 − Ω−1

)E[gi(θ)]

∣∣∣+∣∣∣h′Gn

[eλ(θ)


′)]

Ω−1 E[gi(θ)]

∣∣∣which, by using Lemma G.13 is bounded in probability.

H Additional empirical example

In this section we illustrate an extra application of our techniques in the estimation of causal

parameters: the average treatment eect (ATE) estimation under a conditional independence as-

sumption.

Average treatment eect (ATE) estimation. A standard problem in causal inference with a

binary treatment xi ∈ 0, 1, for control and treated, respectively, and covariates zi : d× 1 assumes

that the two potential outcomes yi0 and yi1 for n randomly chosen subjects satisfy the conditional

independence assumption (Rosenbaum and Rubin, 1983)

(yi0, yi1)⊥xi|zi.

31

If we let EP (yi1|zi)−EP (yi0|zi) denote the ATE conditional on zi for the ith subject, then the goal

of the analysis is to calculate the ATE given by

ATE =1

n

n∑i=1

(EP (yi1|zi)−EP (yi0|zi)

).

We show that the conditional moment technique developed above is ideally suited for calculating the

posterior distribution of this quantity, under minimal assumptions. We just need to make assump-

tions about the conditional expectations EP (yij |zi) (j = 0, 1) without specifying (or restricting) the

conditional distributions of yij |zi in any further way. For illustration, suppose that

EP (yij |zi) = z′iβj , j = 0, 1.

Also suppose that there are n0 control subjects, and that the data are organized such that the

observations i ≤ n0 are the data on the controls, and the observations i > n0 are the data on the

treated. Then, the latter conditional expectations imply that estimation of β0 can be based on the

conditional moment conditions

EP ((yi0 − z′iβ0)|zi) = 0 (i ≤ n0)

since yi0 is observed for such subjects, and that independently, estimation of β1 can be based on

the conditional moment conditions

EP ((yi1 − z′iβ1)|zi) = 0 (i > n0)

since yi1 is observed for these subjects. Now, suppose that our prior-posterior analysis is applied to

these sets of moment conditions to produce the MCMC samples

βg0Mg=1 ∼ π(β0| yi0, zin0

i=1) and βg1Mg=1 ∼ π(β1| yi1, zii>n0

).

Then, the Bayes posterior sample of the ATE is given by the sequence of values

ATE(g) =1

n

n∑i=1

(z′iβ

(g)1 − z′iβ

(g)0

), g = 1, 2, . . . ,M.

As an illustration of this approach, consider n = 500, 1000 and 2000 observations generated

from the following DGP. First, suppose that z1 and z2 are generated from a Gaussian copula

whose covariance matrix has 1 on the diagonal, and 0.8 on the o-diagonal, such that the marginal

distribution of z1 is uniform on (0, 1) and the marginal distribution of z2 is standard normal. Next,

32

conditional on (z1, z2), suppose that x is generated as independent Bernoulli

x ∼ B(p)

where the propensity score, the probability p of being treated, is given by

p = Φ(

0.5(√

0.3z1,i +√

0.7z2,i)3(1−

√0.3z1,i −

√0.7z2,i)

)and Φ(·) is the cdf of the standard normal. Finally, suppose that the potential outcomes for each

individual in the sample are given by

y0 = 10 + z1 + 1.5z2 + ε0,

y1 = 10 + 1.5z1 − z2 + ε1

(H.1)

where the conditional distribution of εj is skewed normal with conditional variance and conditional

skewness depending on z = (z1, z2). In particular,

εj ∼ SN (mj(z), sj(z), wj(z)) (H.2)

where

s0(z) = exp(0.5(1 + .5z1 + 0.1z2

1 + .3z2

)), w0 (z) = 1 + z2

1 + .5z2

and

s1(z) = exp(0.5(1 + z1 + .2z2

1 + .3z2

)), w1 (z) = 1 + z2

1 + z2

and mj(z) is xed based on these functions to ensure that E(εj |z) = 0. The observed data is

y = xy1 + (1− x)y0.

There are approximately 42 percent treated subjects that emerge from this design. Also note

that, because of the extreme nonlinearity of the propensity score function, standard propensity score

matching does not perform well with data generated from this design. In addition, any method that

is based on direct modeling of the outcome distributions that is not robust to covariate dependent

heteroskedasticity, or to covariate dependent skewness, would also not perform well.

Our results in Table 1, which are based on 5 knots for the n = 500 case (implying 13 expanded

moment conditions created from z1, z2, and z1z2 ) and 7 knots for the larger sample sizes (implying

19 expanded moment conditions), show that the ATE is well inferred in this problem.

References

Donald, S. G., Imbens, G. W. and Newey, W. K. (2003), `Empirical likelihood estimation and

consistent tests with conditional moment restrictions', Journal of Econometrics 117(1), 55 93.

33

True Mean SD Median Lower Upper Ine

n = 500 0.19 0.12 0.12 0.12 -0.11 0.35 1.28

n = 1000 0.20 0.14 0.10 0.14 -0.06 0.33 1.15

n = 2000 0.23 0.25 0.08 0.25 0.10 0.40 1.10

Table 1: Posterior summary for ATE estimation with three data sets. True ATE for each sample

size is indicated by True. The summaries are based on 10,000 MCMC draws beyond a burn-in of

1000. The M-H acceptance rate is around 90% in the estimation of the control and treated models.

Kleijn, B. and van der Vaart, A. (2012), `The Bernstein-Von-Mises Theorem Under Misspecication',

Electronic Journal Statistics 6, 354381.

Rosenbaum, P. R. and Rubin, D. B. (1983), `The central role of the propensity score in observational

studies for causal eects', Biometrika 70(1), 4155.

Van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic

Mathematics.

34

bayesian estimation and comparison of conditional moment ...bayesian estimation and comparison of...

Documents