nonparametric estimation of triangular simultaneous equations models under weak identification

Nonparametric Estimation of Triangular Simultaneous

Equations Models under Weak Identification

Sukjin Han∗

Department of Economics

University of Texas at Austin

[email protected]

April 8, 2014

Abstract

This paper analyzes the problem of weak instruments on identification, estimation,

and inference in a simple nonparametric model of a triangular system. The paper de-

rives a necessary and sufficient rank condition for identification, based on which weak

identification is established. Then, nonparametric weak instruments are defined as a

sequence of reduced-form functions where the associated rank shrinks to zero. The

problem of weak instruments is characterized as concurvity and to be similar to the

ill-posed inverse problem, which motivates the introduction of a regularization scheme.

The paper proposes a penalized series estimation method to alleviate the effects of weak

instruments and shows that it achieves desirable asymptotic properties. Monte Carlo

results are presented, and an empirical example is given in which the effect of class size

on test scores is estimated nonparametrically.

∗I am very grateful to my advisors, Donald Andrews and Edward Vytlacil, and committee members,Xiaohong Chen and Yuichi Kitamura for their inspiration, guidance and support. I am deeply indebted toDonald Andrews for his thoughtful advice throughout the project. The earlier version of this paper hasbenefited from discussions with Joseph Altonji, Ivan Canay, Philip Haile, Keisuke Hirano, Han Hong, JoelHorowitz, Seokbae Simon Lee, Oliver Linton, Whitney Newey, Byoung Park, Peter Phillips, Andres Santos,and Alex Torgovitsky. I gratefully acknowledge financial support from a Carl Arvid Anderson Prize fromthe Cowles Foundation. I also thank the seminar participants at Yale, UT Austin, Chicago Booth, NotreDame, SUNY Albany, Sogang, SKKU, and Yonsei, as well as the participants at NASM and Cowles SummerConference.

1

mailto:[email protected]

Keywords: Triangular models, nonparametric identification, weak identification, weak

instruments, series estimation, inverse problem, regularization, concurvity.

JEL Classification Numbers: C13, C14, C36.

1 Introduction

Instrumental variables (IVs) are widely used to identify and estimate models with endogenous

explanatory variables. In linear simultaneous equations models, it is well known that standard

asymptotic approximations break down when instruments are weak in the sense that (partial)

correlation between the instruments and endogenous variables is weak. The consequences

of and solutions for weak instruments in linear settings have been extensively studied in

the literature over the past two decades; see, e.g., Bound et al. (1995), Staiger and Stock

(1997), Dufour (1997), Kleibergen (2002, 2005), Moreira (2003), Stock and Yogo (2005),

and Andrews and Stock (2007).1 Weak instruments in nonlinear parametric models have

been studied less in the literature, either in the context of weak identification—e.g., by

Stock and Wright (2000), Han and Phillips (2006), Newey and Windmeijer (2009), Andrews

and Cheng (2012)—or in a particular limited-dependent-variables version of simultaneous

equations models by Han (2012).

One might expect that nonparametric models with endogenous explanatory variables will

generally require stronger identification power than parametric models as there is an infinite

number of unknown parameters to identify, and hence, stronger instruments may be required.2

Despite the problem’s importance and the growing popularity of nonparametric models, weak

instruments in nonparametric settings have not received much attention.3 Furthermore,

surprisingly little attention has been paid to the consequences of weak instruments in applied

research using nonparametric models. Part of the theoretical neglect is due to the existing

complications embedded in nonparametric models.

In a simple nonparametric framework, this paper analyzes the problem of weak instru-

ments on identification, estimation, and inference, and proposes an estimation strategy to

1See Andrews and Stock (2007) for an exhaustive survey of the literature on weak instruments.2This conjecture is shown to be true in the setting considered in this paper; see Theorem 5.1 and Corollary

5.2.3Chesher (2003, 2007) mentions the issue of weak instruments in applying his key identification condition

in the empirical example of Angrist and Keueger (1991). Blundell et al. (2007) determine whether weakinstruments are present in the Engel curve dataset of their empirical section. They do this by applying theStock and Yogo (2005) test developed in linear models to their reduced form, which is linearized by sieveapproximation. Darolles et al. (2011) briefly discuss weak instruments that are indirectly characterized withintheir source condition.

2

mitigate the effect. Identification results are obtained so that the concept of weak identifi-

cation can subsequently be introduced via localization. The problem of weak instruments

is characterized as concurvity and is shown to be similar to the ill-posed inverse problem.

An estimation method is proposed through regularization and the resulting estimators are

shown to have desirable asymptotic properties even when instruments are possibly weak.

The model we consider is a triangular simultaneous equations model with additive errors.

Having a form analogous to its popular parametric counterpart, the model is also broadly used

in applied research such as Blundell and Duncan (1998), Yatchew and No (2001), Lyssiotou

et al. (2004), Dustmann and Meghir (2005), Skinner et al. (2005), Blundell et al. (2008),

Del Bono and Weber (2008), Frazer (2008), and Mazzocco (2012). The specification of

weak instruments is intuitive in the triangular model because it has an explicit reduced-form

relationship. Additionally, clear interpretation of the effect of weak instruments can be made

through a specific structure produced by the control function approach. This particular

model is considered in Newey et al. (1999) (NPV) in a situation without weak instruments.

One of the contributions of this paper is that it derives novel identification results in

nonparametric triangular models that complement the existing results in the literature. With

a mild support condition, we show that a particular rank condition is necessary and sufficient

for the identification of the structural relationship. This rank condition is substantially

weaker than what is established in NPV. The rank condition covers economically relevant

situations such as outcomes resulting from corner solutions or kink points in certain economic

models. More importantly, deriving such a rank condition is the key to establishing the notion

of weak identification. Since the condition is minimal, a “slight violation” of it has a binding

effect on identification, hence resulting in weak identification.

To characterize weak identification, we consider a drifting sequence of reduced-form func-

tions that converges to a non-identification region, namely, a space of reduced-form func-

tions that violate the rank condition for identification. Under this localization, the signal

diminishes relative to the noise in the system, and hence, the model is weakly identified.

A particular rate is designated relative to the sample size, which effectively measures the

strength of the instruments, so that it appears in asymptotic results for the estimator of the

structural function. The concept of nonparametric weak instruments generalizes the concept

of weak instruments in linear models such as in Staiger and Stock (1997).

In general, the weak instrument problem can be seen as an inverse problem that is ill-

posed. In the nonparametric control function framework, the problem becomes a nonpara-

metric analogue of a multicollinearity problem known as concurvity (Hastie and Tibshirani

(1986)). Once the endogeneity is controlled by a control function, the model can be rewritten

3

as an additive nonparametric regression, where the endogenous variables and reduced-form

errors comprise two regressors, and weak instruments result in the variation of the former

regressor being mainly driven by the variation of the latter. This problem of concurvity is

related to the ill-posed inverse problem inherent in other nonparametric models with endo-

geneity or, in general, to settings where smoothing operators are involved; see Carrasco et al.

(2007) for a survey of inverse problems. The similarity of the problems suggests that the

regularization methods used in the literature to solve the ill-posed inverse problem can be

introduced to our problem. The two problems, however, have distinct features, and among

the regularization methods, only penalization (i.e., Tikhonov-type regularization) alleviates

the effect of weak instruments.

This paper proposes a penalized series estimator for the structural function and establishes

its asymptotic properties. We develop a modified version of the standard L2 penalization

to control the penalty bias. Our results on the rate of convergence of the estimator suggest

that weak instruments characterized as concurvity slow down the overall convergence rate,

exacerbating bias and variance “symmetrically.” We then show that a faster convergence

rate is achieved with penalization, while the penalty bias is dominated by the standard

approximation bias. In showing the gain from penalization, this paper derives the decay rates

of coefficients of series expansions, which are related to source conditions (Engl et al. (1996))

in the literature on ill-posed inverse problems. The corresponding rate, which is assumed

to be part of a source condition, is rather an abstract smoothness condition and is agnostic

about dimensionality; see, e.g., Hall and Horowitz (2005). In contrast to the literature, we

derive the decay rates from a conventional smoothness condition as differentiability and also

incorporate dimensionality. Along with the convergence rate results, we derive consistency

and asymptotic normality with mildly weak instruments.

The problem of concurvity in additive nonparametric models is also recognized in the

statistics literature where different estimation methods are proposed to address the problem—

e.g., the backfitting methods (Linton (1997), Nielsen and Sperlich (2005)) and the integration

method (Jiang et al. (2010)). See Sperlich et al. (1999) for further discussions of those

methods in the context of correlated designs (i.e., covariates). In the present paper, where

an additive model results from the control function approach, the problem of concurvity

is addressed in a more direct manner via penalization. In addition, although the main

conclusions of this paper do not depend on the choice of nonparametric estimation method,

using series estimation in our penalization procedure is also justified in the context of design

density. In situations where the joint density of x and v becomes singular, such as in our case

with weak instruments, it is known that series and local linear estimators are less sensitive

4

than conventional kernel estimators; see, e.g., Hengartner and Linton (1996) and Imbens and

Newey (2009) for related discussions.

The findings of this paper provide useful implications for empirical work. First, when

estimating a nonparametric structural function, the results of IV estimation and subsequent

inference can be misleading even when the instruments are strong in terms of conventional

criteria for linear models. Second, the symmetric effect of weak instruments on bias and

variance implies that the bias–variance trade-off is the same across different strengths of

instruments, and hence, weak instruments cannot be alleviated by exploiting the trade-off.

Third, penalization on the other hand can alleviate weak instruments by significantly reducing

variance and sometimes bias as well. Fourth, there is a trade-off between the smoothness

of the structural function and the requirement of strong instruments. Fifth, the strength of

instruments can be improved by having a nonparametric reduced form so that the nonlinear

relationship between the endogenous variable and instruments can be fully exploited. This

is related to the identification results of this paper.

The rest of the paper is organized as follows. Section 2 introduces the model and obtains

new identification results. Section 3 discusses weak identification and Section 4 relates the

weak instrument problem to the ill-posed inverse problem and defines our penalized series

estimator. Sections 5 and 6 establish the rate of convergence and consistency of the penalized

series estimator and the asymptotic normality of some functionals of it. Section 7 presents

the Monte Carlo simulation results. Section 8 discusses the empirical application of nonpara-

metrically estimating the effect of class size on test scores. In this section, we compare the

estimates with those of Horowitz (2011) which are based on a different nonparametric model.

Finally, Section 9 concludes.

2 Identification

We consider a nonparametric triangular simultaneous equations model

y = g0(x, z1) + ε, x = Π0(z) + v, (2.1a)

E[ε|v, z] = E[ε|v] a.s., E[v|z] = 0 a.s., (2.1b)

where g0(·, ·) is an unknown structural function of interest, Π0(·) is an unknown reduced-

form function, x is a dx-vector of endogenous variables, z = (z1, z2) is a (dz1 + dz2)-vector of

exogenous variables, and z2 is a vector of excluded instruments. The stochastic assumptions

2.1b are more general than the assumption of full independence between (ε, v) and z and

5

E[v] = 0. Following the control function approach,

E[y|x, z] = g0(x, z1) + E[ε|Π0(z) + v, z] = g0(x, z1) + E[ε|v] = g0(x, z1) + λ0(v), (2.2)

where λ0(v) = E[ε|v] and the second equality is from the first part of (2.1b). In effect, we

capture endogeneity (E[ε|x, z] 6= 0) by an unknown function λ0(v), which serves as a control

function. Another intuition for this approach is that once v is controlled for or conditioned

on, the only variation of x comes from the exogenous variation of z. Based on equation (2.2)

we establish identification, weak identification, and estimation results.

First, we obtain identification results that complement the results of NPV. For useful

comparisons, we first restate the identification condition of NPV which is written in terms of

Π(·). Given (2.2), the identification of g0(x, z1) is achieved if one can separately vary (x, z1)

and v in g(x, z1)+λ(v). Since x = Π0(z)+v, a suitable condition on Π0(·) will guarantee this

via the separate variation of z and v. In light of this intuition, NPV propose the following

identification condition.

Proposition 2.1 (Theorem 2.3 in NPV) If g(x, z1), λ(v), and Π(z) are differentiable,

the boundary of the support of (z, v) has probability zero, and

Pr

[rank

(∂Π0(z)

∂z′2

)= dx

]= 1, (2.3)

then g0(x, z1) is identified up to an additive constant.

The identification condition can be seen as a nonparametric generalization of the rank

condition.4 Note that this condition is only a sufficient condition, which suggests that the

model can possibly be identified with a relaxed rank condition. This observation motivates

our identification analysis.

We find a necessary and sufficient rank condition for identification by introducing a mild

support condition. The identification analysis of this section is also important for our later

purpose of defining the notion of weak identification. Henceforth, in order to keep our pre-

sentation succinct, we focus on the case where the included exogenous variable z1 is dropped

from model (2.1) and z = z2. With z1 included, all the results of this paper readily follow

similar lines; e.g., the identification analysis follows conditional on z1.

We first state and discuss the assumptions that we impose.

4One can readily show that the order condition (dz2 ≥ dx) is incorporated in this rank condition.

6

Assumption ID1 The functions g(x), λ(v), and Π(z) are continuously differentiable in

their arguments.

This condition is also assumed in Proposition 2.1 above. Before we state a key additional

assumption for identification, we first define the supports that are associated with x and z.

Let X ⊂ Rdx and Z ⊂ Rdz be the marginal supports of x and z, respectively. Also, let Xz be

the conditional support of x given z ∈ Z. We partition Z into two regions where the rank

condition is satisfied, i.e., where z is relevant, and otherwise.

Definition 2.1 (Relevant set) Let Zr be the subset of Z defined by

Zr = Zr(Π0(·)) =

{z ∈ Z : rank

(∂Π0(z)

∂z′

)= dx

}.

Let Z0 = Z\Zr be the complement of the relevant set. Let X r be the subset of Xdefined by X r = {x ∈ Xz : z ∈ Zr}. Given the definitions, we introduce an additional support

condition.

Assumption ID2 The supports X and X r differ only on a set of probability zero, i.e.,

Pr [x ∈ X\X r] = 0.

Intuitively, when z is in the relevant set, x = Π(z) + v varies as z varies, and therefore,

the support of x corresponding to the relevant set is large. Assumption ID2 assures that the

corresponding support is large enough to almost surely cover the entire support of x. ID2

is not as strong as it may appear to be. Below, we show this by providing mild sufficient

conditions for ID2.

If we identify g0(x) for any x ∈ X r, then we achieve identification of g0(x) by Assumption

ID2.5 Now, in order to identify g0(x) for x ∈ X r, we need a rank condition, which will be

minimal. The following is the identification result:

Theorem 2.2 In model (2.1), suppose Assumptions ID1 and ID2 hold. Then, g0(x) is iden-

tified on X up to an additive constant if and only if

Pr

[rank

(∂Π0(z)

∂z′

)= dx

]> 0. (2.4)

5The support on which an unknown function is identified is usually left implicit in the literature. To makeit more explicit, g0(x) is identified if g0(x) is identified on the support of x almost surely.

7

This and all subsequent proofs can be found in the Appendix.

The rank condition (2.4) is necessary and sufficient. By Definition 2.1, it can alternatively

be written as Pr [z ∈ Zr] > 0. The condition is substantially weaker than (2.3) in Proposition

2.1, which is Pr [z ∈ Zr] = 1 (with z = z2). That is, Theorem 2.2 extends the result of NPV

in the sense that when Zr = Z, ID2 is trivially satisfied with X = X r. Theorem 2.2 shows

that it is enough for identification of g0(x) to have any fixed positive probability with which

the rank condition is satisfied. This condition can be seen as the local rank condition as in

Chesher (2003), and we achieve global identification with a local rank condition. Although

this gain comes from having the additional support condition, this support condition is

shown below to be mild, and the trade-off is appealing given the later purpose of building a

weak identification notion. Even without Assumption ID2, maintaining the assumptions of

Theorem 2.2, we still achieve identification of g0(x), but on the set {x ∈ X r}.Lastly, in order to identify the level of g0(x), we need to introduce some normalization as in

NPV. Either E[ε] = 0 or λ0(v) = λ suffices to pin down g0(x). With the latter normalization,

it follows that g0(x) = E[y|x, v = v]− λ, which we apply in estimation as it is convenient to

implement.

The following is a set of sufficient conditions for Assumption ID2. Let Vz be the condi-

tional support of v given z ∈ Z.

Assumption ID2′ Either (a) or (b) holds. (a) (i) x is univariate and x and v are contin-

uously distributed, (ii) Z is a cartesian product of connected intervals, and (iii) Vz = Vz for

all z, z ∈ Z0; (b) Vz = Rdx for all z ∈ Z.

Lemma 2.1 Under Assumption ID1, Assumption ID2′ implies Assumption ID2.

In Assumption ID2′, the continuity of the r.v. is closely related to the support condition

in Proposition 2.1 that the boundary of support of (z, v) has probability zero. For example,

when z or v is discrete their condition does not hold. Assumption ID2′(a)(i) assumes that

the endogenous variable is univariate, which is most empirically relevant in nonparametric

models. An additional condition is required with multivariate x, which is omitted in this

paper. Even under ID2′(a)(i), however, the exogenous covariate z1 in g(x, z1), which is

omitted in the discussion, can still be a vector. ID2′(a)(ii) and (iii) are rather mild. ID2′(a)(ii)

assumes that z has a connected support, which in turn requires that the excluded instruments

vary smoothly. The assumptions on the continuity of the r.v. and the connectedness of Zare also useful in deriving the asymptotic theory of the series estimator considered in this

paper; see Assumption B below. ID2′(a)(iii) means that the conditional support of v given

8

Figure 1: Identification under Assumption ID2′(a), univariate z and no z1.

z is invariant when z is in Z0. This support invariance condition is the key to obtaining a

rank condition that is considerably weaker than that of NPV. ID2′(a)(iii), along with the

control function assumptions (2.1b), is a weaker orthogonality condition for z than the full

independence condition z ⊥ v.6 Note that Vz = {x− Π0(z) : x ∈ Xz}. Therefore, ID2′(a)(iii)

equivalently means that Xz is invariant for z such that E[x|z] = const. This condition can

be checked from the data.

Given ID2′(b) that v has a full conditional support, ID2 is trivially satisfied and no

additional restriction is imposed on the joint support of z and v. ID2′(b) also does not

require univariate x or the connectedness of Z. This assumption on Vz is satisfied with, for

example, a normally distributed error term (conditional on regressors).

Figure 2 illustrates the intuition of the identification proof under ID2′(a) in a simple case

where z is univariate.7 In the figure, the local rank condition (2.4) ensures global identification

of g0(x). The intuition is as follows. First, by ∂E[y|v, z]/∂z = (∂g0(x)/∂x) · (∂Π0(z)/∂z)

and the rank condition, g0(x) is locally identified on x corresponding to a point of z in the

relevant set Zr. As such a point of z varies within Zr, the x corresponding to it also varies

enough to cover almost the entire support of x. At the same time, for any x corresponding

to an irrelevant z (i.e., z outside of Zr), one can always find z inside of Zr that gives the

same value of such an x. The probability Pr [z ∈ Zr] being small but bounded away from

zero only affects the efficiency of estimators in the estimation stage. This issue is related to

the weak identification concept discussed later; see Section 3.

6With ID2′(a)(iii), the heteroskedasticity of v which is allowed by having E[ε|v, z] = E[ε|v] may or maynot be restricted.

7With ID2′(b), the analysis is even more straightforward; see the proof of Lemma 2.1 in the Appendix.

9

Note that the strength of identification of g0(x) is different for different subsets of X .

For instance, identification must be strong in a subset of X corresponding to a subset of Zwhere Π0(·) is steep. In addition, g0(x) is over-identified on a subset of X that corresponds

to multiple subsets of Z where Π0(·) has a nonzero slope, since each association of x and z

contributes to identification. This discussion implies that the shape of Π0(·) provides useful

information on the strength of identification in different parts of the domain of g0(x).

Lastly, it is worth mentioning that the separable structure of the reduced form along

with ID2′(a)(iii) allows “global extrapolation” in a manner that is analogous to that in a

linear model. With a linear model for the reduced form, the local rank condition (2.4)

is a global rank condition. That is, the linearity of the function contributes to globally

extrapolating the reduced-form relationship. Likewise, the identification results of this paper

imply that although the reduced-form function is nonparametric, the way that the additive

error interacts with the invariant support enables the global extrapolation of the relationship.

The identification results of this section apply to economically relevant situations. Let

x be an economic agent’s optimal decision induced by an economic model and z be a set

of exogenous components in the model that affects the decision x. One is interested in a

nonlinear effect of the optimal choice on a certain outcome y in the model. We present two

situations where the resulting Pr [z ∈ Zr] is strictly less than unity in this economic problem:

(a) x is realized as a corner solution beyond a certain range of z. In a returns-to-schooling

example, x can be the schooling decision of a potential worker, z the tuition cost or distance

to school, and y the future earnings. When the tuition cost is too high or the distance to

school is too far beyond a certain threshold, such an instrument may no longer affect the

decision to go to school. (b) The budget set has kink points. In a labor supply curve example,

x is the before-tax income, which is determined by the labor supply decision, z the worker’s

characteristics that shift her utility function, and y the wage. If an income tax schedule

has kink points, then the x realized at such points will possibly be invariant at the shift of

the utility. The identification results of this paper imply that even in these situations, the

returns to schooling or the labor supply curve can be fully identified nonparametrically as

long as Pr [z ∈ Zr] > 0.

3 Weak Identification

The structure of the joint distribution of x and z that contributes to the identification of g(·)is discussed in the previous section. Specifically, the rank condition (2.4) imposes a minimal

10

restriction for identification on the shape of the conditional mean function E[x|z] = Π(z).

This necessity result suggests that “slight violation” of (2.4) will result in weak identification

of g(·). Note that this approach will not be successful with (2.3) of NPV, since violating the

condition, i.e., Pr [rank (∂Π0(z)/∂z′) = dx] < 1, can still result in strong identification.

In this section, we formally construct the notion of weak identification via localization.

We define nonparametric weak instruments as a drifting sequence of reduced-form functions

that are localized around a function with no identification power. Such a sequence of models

or drifting data-generating process (Davidson and MacKinnon (1993)) is introduced to define

weak instruments relative to the sample size n. As a result, the strength of instruments is

represented in terms of the rate of localization, and hence, it can eventually be reflected in

the local asymptotics of the estimator of g0(·).Let C(Z) be the class of conditional mean functions Π(·) on Z that are bounded, Lip-

schitz and continuously differentiable. Note that the derivative of Π(·) is bounded by the

Lipschitz condition. Define a non-identification region C0(Z) as a class of functions that

satisfy the lack-of-identification condition, namely, the contraposition of the necessity part

of the rank condition8: C0(Z) = {Π(·) ∈ C(Z) : Pr [rank (∂Π(z)/∂z′) < dx] = 1}. Define an

identification region as C1(Z) = C(Z)\C0(Z). We consider a sequence of triangular models

y = g0(x) + ε and x = Πn(z) + v with corresponding stochastic assumptions, where Πn(·) is

a drifting sequence of functions in C1(Z) that drifts in a certain sense to a function Π(·) in

C0(Z). Although g(x) is identified with Πn(·) ∈ C1(Z) for any fixed n by Theorem 2.2, g(x)

is only weakly identified as Πn(·) drifts toward Π(·). Namely, the noise (i.e., v) contributes

more than the signal (i.e., Πn(z)) to the total variation of x ∈ {Πn(z) + v : z ∈ Z, v ∈ V} as

n → ∞. In order to facilitate a meaningful asymptotic theory in which the effect of weak

instruments is reflected, we further proceed by considering a specific uniform convergent

sequence of Πn(·).

Assumption L (Localization) For some δ > 0, the true reduced-form function Πn(·)satisfies the following. For some Π(·) ∈ C1(Z) that does not depend on n and for z ∈ Z

∂Πn(z)

∂z′= n−δ · ∂Π(z)

∂z′+ op(n

−δ).

8The lack of identification condition is satisfied either when the order condition fails (dz < dx), or whenz are jointly irrelevant for one or more of x, almost everywhere in their support.

11

The value of δ measures the strength of identification.9 Assumption L is equivalent to

Πn(z) = n−δ · Π(z) + c+ op(n−δ) (3.1)

for some constant vector c. This “local nesting” device is also used in Stock and Wright (2000)

and Jun and Pinkse (2012) among others. Unlike a linear reduced form, to characterize weak

instruments in a more general nonparametric reduced form, we need to control the complete

behavior of the reduced-form function, and the derivation of local asymptotic theory seems

to be more demanding. Nevertheless, Assumption L makes the weak instrument asymptotic

theory straightforward while embracing the most interesting local alternatives against non-

identification.

4 Estimation

Once the endogeneity is controlled by the control function in (2.2), the problem becomes one

of estimating the additive nonparametric regression function E[y|x, z] = g0(x) + λ0(v). In

a weak instrument environment, however, we face a nonstandard problem called concurvity :

x = Πn(z) + v → v a.s. as n → ∞ under the weak instrument specification (3.1) of

Assumption L with c = 0 as normalization. With a series representation g0(x) + λ0(v) =∑∞j=1 {β1jpj(x) + β2jpj(v)}, where the pj(·)’s are the approximating functions, it becomes

a familiar problem of multicollinearity as pj(x) → pj(v) a.s. for all j. More precisely,

pj(x) − pj(v) = Oa.s.(n−δ) by mean value expansion pj(v) = pj(x − n−δΠ(z)) = pj(x) −

n−δΠ(z)∂pj(x)/∂x with an intermediate value x.10 Alternatively, by plugging pj(v) back

into the series, we obtain E[y|x, z] =∑∞

j=1

{(β1j + β2j)pj(x)− β2jn

−δΠ(z)∂pj(x)/∂x}

, and

n−δΠ(z)∂pj(x)/∂x can be seen as a regressor whose variation shrinks as n→∞.

This feature is reminiscent of the ill-posed inverse problem in the literature. The ill-

posed inverse problem is a function space inversion problem that commonly occurs, e.g.,

in estimating a standard nonparametric IV (NPIV) model of y = g0(x) + ε with endoge-

nous x and E[ε|z] = 0. Again by writing g0(x) =∑∞

j=1 β1jpj(x) it follows that E[y|z] =∑∞j=1 β1jE [pj(x)|z]. Analogous to the weak instruments problem, the variation of the re-

gressor E [pj(x)|z] shrinks since E [E[pj(x)|z]2] → 0 as j → ∞ (Kress (1999, p. 235)).

9It would be interesting to have different rates across columns or rows of ∂Π(·)∂z′2

. One can also consider

different rates for different elements of the matrix. The analyses in these cases can analogously be done byslight modifications of the arguments.

10This expansion is useful later in deriving the local asymptotic results in Sections 5 and 6.

12

Blundell and Powell (2003, p. 321) also acknowledge that the ill-posed inverse problem is a

functional analogue to the multicollinearity problem in a classical linear regression model.

In order to tackle the ill-posed inverse problem, two approaches are taken in the NPIV

literature: the truncation method and the penalization method. Using the foregoing example,

the truncation method is to regularize the problem by truncating the infinite sum, that is,

by considering j ≤ Jn for some Jn <∞ for all n. In this way, the variation of the regressor

E [pj(x)|z] is prevented from shrinking. The penalization method is used to directly control

the behavior of coefficients β1j’s for all j < ∞ by penalizing them while maintaining the

original infinite-dimensional approximation. For further discussion, see, e.g., Chen and Pouzo

(2012).11

Given the connections between the weak instruments, concurvity, and ill-posed inverse

problems, the regularization methods used in the realm of research concerning inverse prob-

lems (Carrasco et al. (2007)) are suitable for use with weak instruments. In this paper, we

introduce the penalization scheme. The nature of our problem is such that the truncation

method does not work properly. Unlike in the ill-posed inverse problem, the estimators of

β1j and β2j above can still be unstable even after truncating the series since we still have

pj(x) → pj(v) a.s. for j ≤ J < ∞ as n → ∞. On the other hand, the penalization directly

controls the behavior of the β1j’s and β2j’s, and hence, it successfully regularizes the weak

instrument problem.

We propose a penalized series estimation procedure for h0(w) = g0(x) + λ0(v) where

w = (x, v). We choose to use series estimation rather than other nonparametric methods

as it is more suitable in our particular framework. Because x → v a.s. the joint density

of w becomes concentrated along a lower dimensional manifold as n tends to infinity. As

mentioned in the introduction, series estimators are less sensitive to this problem of singular

density. See Assumption B below for related discussions. Furthermore with series estimation,

it is easy to impose the additivity of h0(·) and to characterize the problem of weak instruments

as a multicollinearity problem.

Since the reduced-form error v is unobserved, the procedure takes two steps. In the first

stage, we estimate the reduced form Πn(·) using a standard series estimation method and

11In Chen and Pouzo (2011) closely related concepts are used with different terminologies: minimizing acriterion over finite sieve space and minimizing a criterion over infinite sieve space with a Tikhonov-typepenalty. They introduce the penalized sieve minimum distance estimator, which essentially incorporatesboth the truncation and penalization methods. See also, e.g., Newey and Powell (2003), Ai and Chen (2003),Blundell et al. (2007), Hall and Horowitz (2005), Horowitz and Lee (2007) for regularization methods used inthe NPIV framework. The regularization approach for multicollinearity in linear regression models is a biasedestimation method called ridge regression (Hoerl and Kennard (1970)), which can be seen as a penalizationmethod.

13

obtain the residual v. In the second stage, we estimate the structural function h0(·) using a

penalized series estimation method with w = (x, v) as the regressors. The theory to follows

uses orthogonal polynomials as approximating functions in series estimation. The orthog-

onal polynomials considered in this paper are the Chebyshev and Legendre polynomials.12

Let {(yi, xi, zi)}ni=1 be the data with n observations, and let rL(zi) = (r1(zi), ..., rL(zi))′

be a vector of approximating functions of order L for the first stage. Define a matrix

Rn×L

= (rL(z1), ..., rL(zn))′. Then, regressing xi on rL(zi) gives Π(·) = rL(·)′γ where γ =

(R′R)−1R′(x1, ..., xn)′, and we obtain vi = xi− Π(zi). Define a vector of approximating func-

tions of order K for the second stage as pK(w) = (p1(w), ..., pK(w))′. To reflect the additive

structure of h0(·), there are no interaction terms between the approximating functions for

g0(·) and those for λ0(·) in this vector; see the Appendix for an explicit expression. This

structure is the key to the dimension reduction featured in the asymptotic results. Denote a

matrix of approximating functions as Pn×K

= (pK(w1), ..., pK(wn))′ where wi = (xi, vi). Note

that L = L(n) and K = K(n) grow with n, which implies that the size of the matrix P as

well as that of R also grows with n.

We define a penalized series estimator :

hτ (w) = pK(w)′βτ , (4.1)

where the “interim” estimator βτ optimizes a penalizing criterion function

βτ = arg minβ∈RK

(y − P β

)′ (y − P β

)/n+ τnβ

′DK∗ β, (4.2)

where y = (y1, ..., yn)′, DK∗ = Diag{0, ..., 0, 1, ..., 1} is a diagonal matrix that satisfies

β′DK∗ β =∑K

j=K∗+1 β2j , and τn ≥ 0 the penalization parameter which converges to zero

as n grows. In (4.2), the standard L2 penalty is tailored to meet our purpose. The penalty

term τnβ′DK∗β penalizes the coefficients on the higher-order series, which effectively imposes

smoothness restrictions on h0(·). At the same time, K∗ = K∗(n) is allowed to grow with n

so that we can control the bias created by the penalization. See Assumption D below for re-

lated discussions. For the same purpose, τn is assumed to converge to zero. The optimization

problem (4.2) yields a closed form solution:

βτ = (P ′P + nτnDK∗)−1P ′y.

12The results of this paper also follow analogously using Fourier series. For detailed descriptions of powerseries in general, see Newey (1997).

14

The multicollinearity feature discussed above is manifested here by the fact that the matrix

P ′P is nearly singular under Assumption L, since the two columns of P become nearly

identical. The term nτnDK∗ mitigates such singularity, without which the performance of

the estimator of h(·) would deteriorate severely.13 The relative effects of weak instruments

and penalization determine the asymptotic performance of hτ (·). Given hτ (·), with the

normalization that λ(v) = λ, we have gτ (x) = hτ (x, v)− λ.

5 Consistency and Rate of Convergence

First, we state the regularity conditions and key preliminary results under which we find the

rate of convergence of the penalized series estimator introduced in the previous section. Let

X = (x, z).

Assumption A {(yi, xi, zi) : i = 1, 2, ...} are i.i.d. and var(x|z) and var(y|X) are bounded

functions of z and X, respectively.14

Assumption B (z, x, v) is continuously distributed with density that is bounded away from

zero on Z ×X ×V, and Z ×X ×V is a cartesian product of compact, connected intervals.15

Assumption B is useful to bound below and above the eigenvalues of the “transformed”

second moment matrix of approximating functions. This condition is worthy of discussion

in the context of identification and weak identification. Let fu and fw denote the density

functions of u = (z, v) and w = (x, v), respectively. An identification condition like As-

sumption ID2′ in Section 2 is embodied in Assumption B. To see this, note that fu being

13One can easily see that a similar feature can be found in a linear setting analogous to the presenttriangular model on applying a control function approach there. Note that the control function estimatoris equivalent to the usual two-stage least squares (TSLS) estimator in such a linear model. Therefore, theproblem of weak instruments with the TSLS estimator analyzed in Staiger and Stock (1997) can be seen as amulticollinearity problem with the “control function estimator.” In linear settings, however, the introductionof a regularization method discussed in the present paper is not appealing, since it creates the well-knownbiased estimator of ridge regression. In contrast, we do not directly interpret βτ in the current nonparametricsetting, since it is only an interim estimator calculated to obtain hτ (·). More importantly, the overall bias

of hτ (·) is unlikely to be worsened in the sense that the additional bias introduced by penalization can bedominated by the existing series estimation bias. This is, in fact, proved to be true in the convergence rateby letting K∗ grow with n as K does.

14As Newey (1997) points out, the bounded conditional variance assumption is difficult to relax withoutaffecting the convergence rates.

15The assumption for the Cartesian products of supports, namely Z × X × V and its compactness canbe replaced by introducing a trimming function as in NPV, that ensures bounded rectangular supports.Assumption B can be weakened to hold only for some component of the distribution of z. Some componentsof z can be allowed to be discrete as long as they have finite supports.

15

bounded away from zero means that there is no functional relationship between z and v,

which in turn implies Assumption ID2′(a)(iii).16 On the other hand, an assumption written

in terms of fw like Assumption 2 in NPV (p. 574) cannot be imposed here. Observe that

w = (Πn(z) + v, v) depends on the behavior of Πn(·), and hence, fw is not bounded away

from zero uniformly over n under Assumption L, but instead approaches a singular density.

Technically, making use of the mean value expansion and a transformation matrix (see the

Appendix), an assumption is made in terms of fu, which is not affected by weak instruments,

and the effect of weak instruments can be handled separately in the asymptotics proof.

Next, Assumption C is a smoothness assumption on the structural and reduced-form

functions. For a generic dimension d and a d-vector µ of nonnegative integers, let |µ| =∑dl=1 µl. Define the derivative ∂µg(x) = ∂|µ|g(x)/∂xµ11 ∂x

µ22 · · · ∂x

µdxdx

of order |µ|. Let ω(·)be a positive continuous weight function on X and let ‖g‖X ,ω =

´X |∂

(1,...,1)′g(x)|ω(x)dx be a

weighted seminorm of g(·).17

Assumption C g0(x) and λ0(v) are Lipschitz and absolutely continuously differentiable of

order s on W, and ‖∂µg‖X ,ω + ‖∂µλ‖V,ω is bounded by a fixed constant for |µ| = s. Πn(z) is

bounded, Lipschitz, and absolutely continuously differentiable of order sπ on Z, and for the

m-th element Πn,m(z), ‖∂µΠn,m‖Z,ω is bounded by a fixed constant for |µ| = sπ, and for all

n and m ≤ dx.

As shown in the following lemma, this assumption on differentiability and bounded varia-

tion ensures that the coefficients on the series expansions have particular decay rates and that

the approximation errors shrink at particular rates as the number of approximating functions

increases. Recall that pj(w) and rj(z) are (the tensor products of) orthogonal polynomials

on W = X ×V and Z, respectively. We present results for the Chebyshev polynomials here,

and results for the Legendre polynomials can be found in the Appendix.

Lemma 5.1 Under Assumptions B and C, the series expansions∑∞

j=1 βjpj(w) of h0(w) and∑∞j=1 γnjrnj(z) of Πn(z) for all n are uniformly convergent, and for a generic positive constant

c and for all n, (i) βj ≤ cj−s/dx−1 and γnl ≤ cl−sπ/dz−1; (ii) supw∈W

∣∣∣h0(w)−∑K

j=1 βjpj(w)∣∣∣ ≤

cK−s/dx and supz∈Z

∥∥∥Πn(z)−∑L

j=1 γnjrj(z)∥∥∥ ≤ cL−sπ/dz for all positive integers K and L.

Lemma 5.1(i) is closely related to source conditions used in the NPIV literature, where

the decay rate of coefficients is assumed to be part of them. For example, Hall and Horowitz

16The definition of a functional relationship can be found, e.g., in NPV (p. 568).17For an univariate example of this seminorm, see (4.5) of Trefethen (2008).

16

(2005) assume that the rate is O(j−β) for some constant β > 1/2. Darolles et al. (2011) incor-

porate a similar assumption within a source condition. See also Chen and Pouzo (2012). Such

an assumption is an abstract assumption on smoothness and is agnostic about dimensionality.

This paper does not assume but explicitly derives the decay rate from a conventional smooth-

ness condition of Assumption C. In this way, Lemma 5.1 connects the abstract smoothness

condition (i.e., a source condition) with a more standard smoothness concept (i.e., differen-

tiability). Note that the result is also informative about dimensionality. This result is novel

in the literature and is useful later in showing that the rate of the penalty bias is dominated

by the rate of the approximation bias. The approximation error bound of (ii) is implied from

(i).18 In Lemma 5.1(i) and hence in (ii), the dimension of w is reduced to the dimension of

x as the additive structure of h0(w) is exploited.19

The next regularity condition restricts the rate of growth of the numbers K and L of

the approximating functions and K∗ in the penalty term. Let an � bn denote that an/bn is

bounded below and above by constants that are independent of n.

Assumption D n2δK7/2(√L/n + L−sπ/dz) → 0, L3/n → 0, and n−δK6 → 0. Also,

K > 2(K∗ + 1) for any n and K∗ � K.

The conditions on K and L are more restrictive than the corresponding assumption for

the power series in NPV (Assumption 4, p. 575) where weak instruments are not considered.

The last inequality requires that K be sufficiently larger than K∗ so that the penalty term

τnβ′DK∗β plays a sufficient role. Technically, this condition allows the derivation of the

asymptotic results to be similar to a simpler case of a fixed K∗.

Now, we provide results for the rate of convergence in probability of the penalized series

estimator hτ (w) in terms of L2 and uniform distance. Let Fw(w) = F (w) be the distribution

function of w.

Theorem 5.1 Suppose Assumptions A–D and L are satisfied. Let Rn = min{nδ, τ−1/2n }.

Then{ˆ [hτ (w)− h0(w)

]2

dF (w)

}1/2

= Op

(Rn(

√K/n+K−s/dx +

√L/n+ L−sπ/dz)

).

18No result analogous to (i) is derived in Lorentz (1986) when showing (ii), but a different approach is usedunder an assumption similar to Assumption C of the present paper. See Theorem 8 of Lorentz (1986, p. 90).

19Relevant discussion can be found in Assumption A.2 and p. 472 of Andrews and Whang (1990) and inTheorem 3.2 of Powell (1981, p. 26).

17

Also,

supw∈W

∣∣∣hτ (w)− h0(w)∣∣∣ = Op

(Rn ·K(

√K/n+K−s/dx +

√L/n+ L−sπ/dz)

).

Suppose there is no penalization (τn = 0). Then with Rn = nδ, Theorem 5.1 provides

the rates of convergence of the unpenalized series estimator h(·). For example, with ‖·‖L2

denoting the L2 norm above,∥∥∥h− h0

∥∥∥L2

= Op

(nδ(√K/n+K−s/dx +

√L/n+ L−sπ/dz)

). (5.1)

Compared to the strong instrument case of NPV (Lemma 4.1, p. 575), the rate deteriorates

by the leading nδ rate, the weak instrument rate. Note that the terms√K/n and K−s/dx

correspond to the variance and bias of the second stage estimator, respectively, and√L/n

and L−sπ/dz are those of the first stage estimator.20 The way that nδ enters into the rate

implies that the effect of weak instruments (hence concurvity) not only exacerbates the

variance but also the bias. This is different from a linear case where multicollinearity only

results in imprecise estimates but does not introduce bias. Moreover, the symmetric effect of

weak instruments on bias and variance implies that the problem of weak instruments cannot

be resolved by the choice of the number of terms in the series estimator. This is also related

to the discussion in Section 4 that the truncation method does not work as a regularization

method for weak instruments.

More importantly, in the case where penalization is in operation (τn > 0), the way that Rn

enters into the convergence rates implies that penalization can reduce both bias and variance

by the same mechanism working in an opposite direction to the effect of weak instruments.

Explicitly, when the effect of penalization dominates the effect of weak instruments, i.e.,

when Rn = min{nδ, τ−1/2n } = τ

−1/2n , penalization plays a role and yields the rate∥∥∥hτ − h0

∥∥∥L2

= Op

(τ−1/2n (

√K/n+K−s/dx +

√L/n+ L−sπ/dz)

). (5.2)

Here, the overall rate is improved since the multiplying rate τ−1/2n is of smaller order than the

multiplying rate nδ of the previous case. Interestingly, the additional bias term introduced

by penalization, namely, the penalty bias, is not present in the rate as it is shown to be

dominated by the approximation bias term K−s/dx . This result is because of Lemma 5.1; see

20The rate√L/n + L−sπ/dz appears here due to the fact that the residuals vi are generated regressors

obtained from the first-stage nonparametric estimation.

18

the Appendix. When the effect of weak instruments dominates the effect of penalization, the

rate becomes (5.1).

Next, we find the optimal L2 convergence rate. For a more concrete comparison between

the rates nδ and τ−1/2n , let τn = n−2δτ for some δτ > 0. For example, the larger δτ is, the faster

the penalization parameter converges to zero, and hence, the smaller the effect of penalization

is.

Corollary 5.2 (Consistency) Suppose the Assumptions of Theorem 5.1 are satisfied. Let

K = O(n1/(1+2s/dx)) and L = O(n1/(1+2sπ/dz)). Then∥∥∥hτ − h0

∥∥∥L2

= Op(n−q) = op(1), where

q = min{

s1+2s/dx

, sπ1+2sπ/dz

}−min{δ, δτ}.

Given the choice of K and L in the corollary, Assumption D implies that δ < 14− 7

4(1+2s)−

14(1+2sπ)

. It can be readily shown that q > 0 from this bound on δ. With weak instruments,

optimal rates in the sense of Stone (1982) are not attainable.21

Corollary 5.2 has several implications. Let dx = dz = 1 and s = sπ for simplicity so

that q = s1+2s−min{δ, δτ}. Consider a weak instruments-dominating case of δ < δτ . When

the structural function is less smooth (i.e., small s, and hence, small s1+2s

), instruments

should not be too weak (i.e., small δ) to achieve the same optimal rate (i.e., holding q fixed).

This implies a trade-off between the smoothness of the structural function and the required

strength of instruments. This, in turn, implies that the weak instrument problem can be

mitigated with some smoothness restrictions, which is in fact one of our justifications for

introducing the penalization method. Once the penalization effect is dominating (δτ < δ), q

increases and a faster rate is achieved.

When implementing the penalized series estimator in practice, there remains the issue of

choosing tuning parameters, namely, the penalization parameter τn and the orders K and L

of the series. In the simulations, we present results with a few chosen values of τ , K, and

L. A data-driven procedure such as the cross-validation method can also be used (Arlot and

Celisse (2010)). This method is used for choosing τ in the empirical section below. There is,

however, no optimal method in the literature for choosing the tuning parameters (Blundell

et al. (2007, pp. 1636–1637)). It would be interesting to further investigate the sensitivity

of the penalized estimator to the choice of τ . Denote Q = P ′P /n and let DK∗ = I for

simplicity. Recall that hτ (·) = pK(·)′βτ = pK(·)′(Q + τI)−1P ′y/n is a function of τ . To

determine the sensitivity of hτ (·) to τ , we consider the sharp bound on λmax((Q + τI)−1),

21The uniform convergence rate does not attain Stone’s (1982) bound even without the weak instrumentfactor (Newey (1997, p.151)).

19

which is{λmin(Q) + τ

}−1

. As a measure of sensitivity, we calculate the absolute value of

the first derivative of the bound:∣∣∣∣∣∂[λmin(Q) + τ ]−1

∂τ

∣∣∣∣∣ ={λmin(Q) + τ

}−2

. (5.3)

Note that the sensitivity is a decreasing function of λmin(Q). That is, as instruments become

weaker (i.e., λmin(Q) becomes smaller because of increased singularity), the performance of

hτ (·) becomes more sensitive to a change in τ . This has certain implications in practice.

Theorem 5.1 leads to the following theorem, which focuses on the rate of convergence of

the structural estimator gτ (·) after subtracting the constant term which is not identified.

Theorem 5.3 Suppose Assumptions A–D and L are satisfied. Let Rn = min{nδ, τ−1/2n }. For

∆(x) = gτ (x)− g0(x),{ˆ [∆(x)−

ˆ∆(x)dFw

]2

dFw

}1/2

= Op

(Rn(

√K/n+K−s/dx +

√L/n+ L−sπ/dz)

).

Also, if gτ (x) = hτ (x, v)− λ and λ = λ0(v), then

supx∈X|gτ (x)− g0(x)| = Op

(RnK(

√K/n+K−

sdx +

√L/n+ L−

sπdz ))

.

The optimal rate results for gτ (·) and the related analyses can be followed analogously,

and we omit them here. The convergence rate is net of the constant term. We can further

assume E[ε] = 0 to identify the constant.

Before closing this section, we discuss one of the practical implications of the identifi-

cation and asymptotic results of this paper. In applied research that uses nonparametric

triangular models, a linear specification of the reduced form is largely prevalent.22 While

cases are rare where a linear reduced-form relationship is justified by economic theory, linear

specification is introduced to avoid the curse of dimensionality with many covariates at hand,

or for an ad hoc reason that it is easy to implement, or simply because the nonparametric

structural equation is of primary interest. The convergence rates in Theorem 5.1 and 5.3

are simplified with a linear reduced form, since the nonparametric rate of the first stage

(√L/n + L−sπ/dz) disappears as the parametric rate (1/

√n) is dominated by the second-

22See, for example, NPV, Blundell and Duncan (1998), Blundell et al. (1998), Yatchew and No (2001),Lyssiotou et al. (2004), Dustmann and Meghir (2005), and Del Bono and Weber (2008).

20

stage nonparametric rate (√K/n + K−s/dx). When the reduced form is linearly specified,

however, any true nonlinear relationship is “flattened out,” and the situation is more likely

to have the problem of weak instruments, let alone the problem of misspecification. On the

other hand, one can achieve a significant gain in the performance of the estimator by non-

parametrically estimating the relationship between x and z. According to the rank condition

(2.4), i.e., Pr [∂Π0(z)/∂z 6= 0] > 0 with univariate x and z, a small region where x and z are

relevant contributes to the identification of g(·). This implies that identification power can

be enhanced by exploiting the entire nonlinear relationship between x and z.23 Note that

the nonparametric first stage estimation is not likely to worsen the overall convergence rate

of the estimator, since the nonparametric rate from the second stage is already present.

6 Asymptotic Distributions

In this section, we establish the asymptotic normality of the linear functionals of the penalized

series estimator hτ (·). We consider linear functionals of h0(·) that include h0(·) at a certain

value (i.e., h0(w)) and the weighted average derivative of h0(·) (i.e.,´ϑ(w) [∂h0(w)/∂x] dw).

The linear functionals of h = h0(·) are denoted as a(h). Then, the estimator θτ = a(hτ ) of

θ0 = a(h) is the natural “plug-in” estimator. Let A = (a(p1K), a(p2K), ..., a(pKK)), where

pjK(·) is an element of pK(·). Then,

θτ = a(hτ ) = a(pK(x)′βτ ) = Aβτ ,

which implies that θτ is a linear combination of the two-step penalized least squares estimator

βτ . Then the following variance estimator of a(hτ ) can be naturally defined:

Vτ = AQ−1τ

(Στ + Hτ Q

−11 Σ1Q

−11 H ′τ

)Q−1τ A′,

Στ =n∑i=1

pK(wi)pK(wi)

′[yi − hτ (wi)]2/n, Σ1 =n∑i=1

v2i rL(zi)r

L(zi)′/n,

Hτ =n∑i=1

pK(wi)

{[∂hτ (wi)/∂w

]′∂w(Xi, Π(zi))/∂π

}rL(zi)

′/n, Q1 = R′R/n,

where X is a vector of variables that includes x and z and w(X, π) is a vector of functions

of X and π where π is a possible value of Π(z).

23This phenomenon may be interpreted in terms of the “optimal instruments” in the GMM settings ofAmemiya (1977); see also Newey (1990) and Jun and Pinkse (2012).

21

The following are additional regularity conditions for the asymptotic normality of a(hτ ).

Let η = y − h.

Assumption E σ2(X) = var(y|X) is bounded away from zero, E [η4|X] is bounded, and

E[‖v‖4 |X

]is bounded. Also, h0(w) is twice continuously differentiable in v with bounded

first and second derivatives.

This assumption strengthens the boundedness of conditional second moments in Assump-

tion A. For the next assumption, let |h|r = max|µ|≤r supw∈W |∂µh(w)|. Also let pK(w) be a

generic vector of approximating functions.

Assumption F a(h) is a scalar, |a(h)| ≤ |h|r for some r ≥ 0, and there exists βK such that

as K →∞, a(pK′βK) is bounded away from zero while E[{pK(w)′βK

}2]→ 0.

This assumption includes the case of h at a certain value. The next condition restricts

the rate of growth of Kand L and the rate of convergence of τn.

Assumption G√nK−s/dx → 0,

√nL−sπ/dz → 0, and nδ+

12 τn → 0; also, n3(δ− 1

6)K4L1/2 → 0,

n4(δ− 14

)(K3 + L3)→ 0, and nδ−12

[K3L3/2 +K4L1/2(K + L)1/2

]→ 0.

Assumption G introduces “overfitting.” The first two conditions,√nK−s/dx → 0 and

√nL−sπ/dz → 0, imply overfitting in that the bias (K−s/dx) shrinks faster than 1/

√n, the

usual rate of standard deviation of the estimator. The same feature is found in the corre-

sponding assumption in NPV (Assumption 8, p. 582), while the overall conditions on K and

L in Assumption G are stronger than that in NPV. At the same time, τn needs to converge

fast enough to satisfy nδ+12 τn → 0, which is stronger than what is required in the previous

section. This condition also implies overfitting in that it requires less penalization, hence

implicitly imposing less smoothness. The value of δ that satisfies G needs to be small, i.e.,

the instruments must be mildly weak, to obtain the asymptotic normality of θτ .

Theorem 6.1 If Assumptions A–G and L are satisfied, then θτ = θ0 +Op(nδ− 1

2K1+2r) and

√nV −1/2

τ (θτ − θ0)→d N(0, 1).

The proofs of this theorem and the following corollary can be found in the Supplemental

Appendix.

In addition to asymptotic normality, the results also provide the bound on the convergence

rate of θτ . The bound on the rate achieved here is slower than that of NPV. The fact that

22

the rate is slower (i.e., the multiplier of (θτ − θ0) is of smaller order) in the case of weak

instruments can be seen as the effective sample size being small.

Under the following assumption,√n-consistency is achieved . Let p∗K(z, v) be a “trans-

formation” of pK(w) purged of the weak instruments effect (see the Appendix).

Assumption H There exists ν(w) and αK such that E[‖ν(w)‖2] <∞, a(h) = E [ν(w)h0(w)],

a(pj) = E [ν(w)pj(w)], and E[∥∥ν(Πn(z) + v, v)− p∗K(z, v)′αK

∥∥2]→ 0 as K →∞.

This assumption includes the case of the weighted average derivative of h, in which case

ν(w) = −fw(w)−1∂ϑ(w)/∂w. Let ρ(z) = E [ν(w)∂h0(w)/∂v′|z]. Under this assumption, the

asymptotic variance of θ in this case can be expressed as

V = E [ν(w)ν(w)′var(y|X)] + E [ρ(z)var(x|z)ρ(z)′] .

Corollary 6.2 If Assumptions A–E, G, H and L are satisfied, then

√n(θτ − θ0)→d N(0, V ), Vτ →p V .

In this case, instruments can be regarded as nearly strong in that√n-consistency is

achieved as if instruments are strong, as in NPV.

There still remain issues when the results of Theorem 6.1 and Corollary 6.2 are used

for inference, e.g., for constructing pointwise asymptotic confidence intervals. As long as

nuisance parameters are present, such an inferential procedure may depend on the strength

of instruments or on the choice of the penalization parameter. Developing a robust procedure

against weak instruments in nonparametric models is beyond the scope of our paper, and we

leave it to future research.

Lastly, one closely related work to the asymptotic results of this paper is Jiang et al.

(2010), where a model for DNA microarray data is transformed into an additive nonpara-

metric model with highly correlated regressors.24 The authors establish pointwise asymptotic

normality for local linear and integral estimators. Their findings show the way that the cor-

related regressors affect bias and variance. Once our problem is transformed to an additive

nonparametric regression with concurvity, we can also develop weak instrument asymptotic

theory for local linear and integral estimators of the functionals of h0(·). Their results can

be applied after taking into account the generated regressors (vi) that are involved in our

24A more obvious example in their paper is a time series model such as yt = m0 +∑dj=1mj(xt−j) where

the lagged variables of xt are highly correlated.

23

problem.

7 Monte Carlo Simulations

In this section, we document the problems of weak instruments in nonparametric estimation

and investigate the finite sample performance of the penalized estimator. We are particularly

interested in the finite sample gain in terms of the bias, variance, and mean squared errors

(MSE) of the penalized series estimators defined in Section 4 (“penalized IV (PIV) estima-

tors”) relative to those of the unpenalized series estimators (“IV estimators”) for a wide

range of strength of instruments. We also compare the IV estimators with series estimators

that ignore endogeneity (“least squares (LS) estimators”). LS estimators are calculated by

nonparametrically estimating the outcome equation without considering the reduced-form

equation.

We consider the following data generating process:

y = Φ

(x− µxσx

)+ ε, x = π1 + zπ + v,

where y, x, and z are univariate, z ∼ N(µz, σ2z) with µz = 0 and σ2

z = 1, and (ε, v)′ ∼ N(0,Σ)

with Σ =

[1 ρ

ρ 1

]. Note that |ρ| measures the degree of endogeneity, and we consider

ρ ∈ {0.2, 0.5, 0.95}. The sample {zi, εi, vi} is i.i.d. with size n = 1000. The number of

simulation repetitions is s ∈ {500, 1000}.We consider different strengths of the instrument by considering different values of π.

Let the intercept π1 = µx − πµz with µx = 2 so that E[x] = µx does not depend on the

choice of π. Note that σ2x = π2σ2

z + 1 still depends on π, which is reasonable since the signal

contributed to the total variation of x is a function of π. More specifically, to measure the

strength of the instrument, we define the concentration parameter (Stock and Yogo (2005)):

µ2 =π2∑n

i=1 z2i

σ2v

.

Note that since the dimension of z is one, the concentration parameter value and the first-

stage F -statistic are similar to each other. For example, in Staiger and Stock (1997), for

F = 30.53 (strong instrument), a 97.5% confidence interval for µ2 is [17.3, 45.8], and for

F = 4.747 (weak instrument), a confidence interval for µ2 is [2.26, 5.64]. The candidate

values of µ2 are {4, 8, 16, 32, 64, 128, 256}, which range from a weak to a strong instrument in

24

the conventional sense.25 As for the penalization parameter τ , we consider candidate values

of {0.001, 0.005, 0.01, 0.05., 0.1}. We choose K∗ ∈ {1, 3, 5}.The approximating functions used for g0(x) and λ0(v) are polynomials with different

choices of (K1, K2), where K1 is the number of terms for g0(·), K2 for λ0(·), and K = K1+K2.

Since g0(·) and λ0(·) are separately identified only up to an additive constant, we introduce

the normalization λ0(1) = ρ, where ρ is chosen because of the joint normality of (ε, v). Then,

g0(x) = h0(x, 1)− ρ, where h(x, v) = g(x) + λ(v).

In the first part of the simulation, we calculate gτ (·) and g0(·), the penalized and unpe-

nalized IV estimates, respectively, and compare their performances. For different strengths

of the instrument, we compute estimates with different values of the penalization parameter.

We choose K1 = K2 = 6 and ρ = 0.5.26 As one might expect, the choice of orders of the

series is not significant as long as we are only interested in comparing gτ (·) and g(·).Figures 2–5 present some representative results. Results with different values of µ2 and

τ are similar and hence are omitted to save space. In Figure 2, we plot the mean of gτ (·)and g(·) with concentration parameter µ2 = 16 and penalization parameter τ = 0.001. The

plots for the unpenalized estimate indicate that with the given strength of the instrument,

the variance is very large, which implies that functions with any trends can fit within the

0.025–0.975 quantile ranges; it indicates that the bias is also large. In Figure 2, the graph

for the penalized estimate shows that the penalization significantly reduces the variance so

that the quantile range implies the upward trend of the true g0(·). Note that the bias of gτ (·)is no larger than that of g(·). Although µ2 = 16 is considered to be strong according to the

conventional criterion, this range of the concentration parameter value can be seen as the

case where the instrument is “nonparametrically” weak in the sense that the penalization

induces a significant difference between gτ (·) and g(·).Figure 3 is drawn with µ2 = 256, while all else remains the same. In this case, the

penalization induces no significant difference between gτ (·) and g(·). This can be seen as the

case where the instrument is “nonparametrically” strong. It is noteworthy that the bias of

the penalized estimate is no larger than the unpenalized one even in this case.

Figures 4–5 present similar plots but with penalization parameter τ = 0.005. Figure

4 shows that with a larger value of τ than the previous case, the variance is significantly

reduced, while the biases of the two estimates are comparable to each other. The change

25The simulation results below seem to be unstable when µ2 = 4 (presumably because instruments in thisrange are severely weak in nonparametric settings), and hence need caution when interpreting them.

26Because of the bivariate normal assumption for (ε, v)′, we implicitly impose linearity in the functionE[ε|v] = λ(v). Although K2 being smaller than K1 would better reflect the fact that λ0(·) is smoother thang0(·), we assume that we are agnostic about such knowledge.

25

in the patterns of the graphs from Figure 4 to 5 is similar to those in the previous case.

Furthermore, the comparison between Figures 2–3 and Figures 4–5 shows that the results are

more sensitive to the change of τ in the weak instrument case than in the strong instrument

case. This provides evidence for the theoretical discussion on sensitivity; see (5.3) in Section

5.

The fact that the penalized and unpenalized estimates differ significantly when the in-

strument is weak has a practical implication: practitioners can be informed about whether

the instrument they are using is worryingly weak by comparing penalized series estimates

with unpenalized estimates. A similar approach can be found in the linear weak instruments

literature; for example, the biased TSLS estimates and the approximately median-unbiased

LIML estimates of Staiger and Stock (1997) can be compared to detect weak instruments.

Tables 1 reports the integrated squared bias, integrated variance, and integrated MSE of

the penalized and unpenalized IV estimators and LS estimators of g0(·). The LS estimates

are calculated by series estimation of the outcome equation (with order K1), ignoring the

endogeneity. We also calculate the relative integrated MSE for comparisons. We use K1 =

K2 = 6 and ρ = 0.5 as before. Results with different choices of orders K1 and K2 between

3 and 10 and a different degree of endogeneity ρ in {0.2, 0.95} show similar patterns. Note

that the usual bias and variance trade-offs are present as the order of the series changes.

As the instrument becomes weaker, the bias and variance of the unpenalized IV (τ = 0)

increase with a greater proportion in variance. The integrated MSE ratios between the IV

and LS estimators (MSEIV /MSELS) indicate the relative performance of the IV estimator

compared to the LS estimator. A ratio larger than unity implies that IV performs worse

than LS. In the table, the IV estimator does poorly in terms of MSE even when µ2 = 16,

which is in the range of conventionally strong instruments; therefore, this can be considered

as the case where the instrument is nonparametrically weak.

The rest of the results in Table 1 are for the penalized IV (PIV) estimator gτ (·). Overall

the variance is reduced significantly compared to that of IV without sacrificing much bias.

In the case of τ = 0.001, the variance is reduced for the entire range of instrument strength,

and the bias is no larger and is reduced when the instrument is weak. This provides evidence

for the theoretical discussion in Section 5 that the penalty bias can be dominated by the

existing series estimation bias. This feature diminishes as the increased value of τ introduces

more bias. The integrated MSE ratios between PIV and IV (MSEPIV /MSEIV ) in Table 1

suggest that PIV outperforms IV in terms of MSE for all the values of τ considered here. For

example, when µ2 = 8, the MSE of PIV with τ = 0.001 is only about 4.57% of that of IV,

while the bias of PIV is only about 20% of that of IV. Obviously, these results imply that

26

PIV performs substantially better than LS unlike the previous case of IV.

Table 2 presents similar results for K∗ = 3. Interestingly, while the variance is uniformly

increased, increasing K∗ does not seem to control the penalty bias in a monotonic fashion.

Compared to the results of Table 1, there are cases where the bias is reduced, but these cases

are for some strength of instruments and are mostly observed in the case where τ = 0.005 or

0.01.

The simulation results can be summarized as follows. Even with a strong instrument in

a conventional sense, unpenalized IV estimators do poorly in terms of mean squared errors

compared to LS estimators. Variance seems to be a bigger problem, but bias is also worrisome.

Penalization alleviates much of the variance problem induced by the weak instrument, and

it also works well in terms of bias for relatively weak instruments and for some values of the

penalization parameter.

8 Application: Effect of Class Size

To illustrate our approach and apply the theoretical findings, we nonparametrically estimate

the effect of class size on students’ test scores. Estimating the effect of class size has been an

interesting topic in the schooling literature, since among school inputs that affect students’

performance, class size is thought to be easier to manipulate. Angrist and Lavy (1999) analyze

the effect of class size on students’ reading and math scores in Israeli primary schools. With

linear models, they find that the estimated effect is negative in most of the specifications

they consider.

In this section, we investigate whether the results of Angrist and Lavy (1999) are driven

by their parametric assumptions. It is also more reasonable to allow a nonlinear effect of

class size, since it is unlikely that the marginal effect is constant across class-size levels. We

nonparametrically extend their linear model: for school s and class c,

scoresc = g(classizesc, disadvsc) + αs + εsc,

where scoresc is the average test score within class, classizesc the class size, disadvsc the

fraction of disadvantaged students, and αs an unobserved school-specific effect. Note that

this model allows for different patterns for different subgroups of school/class characteristics

(here, disadvsc).

Class size is endogenous because it results from choices made by parents, schooling

providers or legislatures, and hence is correlated with other determinants of student achieve-

27

ment. Angrist and Lavy (1999) use Maimonides’ rule on maximum class size in Israeli schools

to construct an IV. According to the rule, class size increases one-for-one with enrollment

until 40 students are enrolled, but when 41 students are enrolled, the class size is dropped to

an average of 20.5 students. Similarly, classes are split when enrollment reaches 80, 120, 160,

and so on, so that each class does not exceed 40. This rule can be expressed by a nonlin-

ear function of enrollment, which produces the IV (denoted as fsc following their notation):

fsc = es/{int((es− 1)/40) + 1}, where es is the beginning-of-the-year enrollment count. This

rule generates discontinuity in the enrollment/class-size relationship, which serves as exoge-

nous variation. Note that with the sample around the discontinuity points, IV exogeneity is

more credible in addressing the endogeneity issue.

The dataset we use is the 1991 Israel Central Bureau of Statistics survey of Israeli public

schools from Angrist and Lavy (1999). We only consider fourth graders. The sample size is

n = 2019 for the full sample and 650 for the discontinuity sample. Given a linear reduced

form, first stage tests have F = 191.66 with the discontinuity sample (±7 students around

the discontinuity points) and F = 2150.4 with the full sample. Lessons from the theoretical

analyses of the present paper suggest that an instrument that is strong in a conventional

sense (F = 191.66) can still be weak in nonparametric estimation of the class-size effect, and

a nonparametric reduced form can enhance identification power. We consider the following

nonparametric reduced form

classizesc = Π(fsc, disadvsc) + vsc.

The sample is clustered, an aspect which is reflected in αs of the outcome equation. Hence,

we use the block bootstrap when computing standard errors and take schools as bootstrap

sampling units to preserve within-cluster (school) correlation. This produces cluster-robust

standard errors. We use b = 500 bootstrap repetitions.

With the same example and dataset (only the full sample), Horowitz (2011, Section 5.2)

uses the model and assumptions of the NPIV approach to nonparametrically estimate the

effect of class size. To solve the ill-posed inverse problem, he conducts regularization by

replacing the operator with a finite-dimensional approximation. First, we compare the NPIV

estimate of Horowitz (2011) with the IV estimate obtained by the control function approach

of this paper. Figure 8 (Figure 3 in Horowitz (2011)) is the NPIV estimate of the function of

class size (g(·, ·)) for disadv = 1.5(%) with the full sample. The solid line is the estimate of

g and the dots show the cluster-robust 95% confidence band. As noted in his paper (p. 374),

“the result suggests that the data and the instrumental variable assumption, by themselves,

28

are uninformative about the form of any dependence of test scores on class size.”

Figure 9 is the (unpenalized) IV estimate calculated with the full sample using the tri-

angular model (2.1) and the control function approach. Although not entirely flexible, the

nonparametric reduced form above is justified for use in the comparison with the NPIV esti-

mate, since the NPIV approach does not specify any reduced-form relationship. The sample,

the orders of the series, and the value of disadv are identical to those for the NPIV estimate.

The dashed lines in the figure indicate the cluster-robust 95% confidence band. The result

clearly presents a nonlinear shape of the effect of class size and suggests that the marginal

effect diminishes as class size increases. The overall trend seems to be negative, which is

consistent with the results of Angrist and Lavy (1999).

It is important to note that the control function and NPIV approaches maintain different

sets of assumptions (e.g., different orthogonality conditions for IV27), and hence, this com-

parison does not imply that one estimate performs better than the other. It does, however,

imply that if the triangular model and control function assumptions are reasonable, they

make the data to be informative about the relationship of interest. One of the control func-

tion assumptions, E [εsc|vsc, fsc] = E [εsc|vsc], is satisfied if (εsc, vsc) are jointly independent

of fsc. Here, εsc captures the factors that determine the average test scores of a class other

than the class size and the fraction of disadvantaged students, and vsc captures other factors

that are correlated with enrollment. Moreover, since the NPIV approach suffers from the

ill-posed inverse problem even without the problem of weak instruments, the control func-

tion approach may be a more appealing framework than the NPIV approach in the possible

presence of weak instruments.

We proceed by calculating the penalized IV estimates from the proposed estimation

method of this paper. For all cases below, we find estimates for disadv = 1.5(%) as before. To

better justify the usage of our method in this part, we use the discontinuity sample where the

instrument is possibly weak in this nonparametric setting. For the penalization parameter τ ,

we use cross-validation to choose a value from among {0.005, 0.01, 0.015, 0.02, 0.05}.28 The

following results of cross-validation (Table 3) suggest that τ = 0.015 is the MSE-minimizing

value. We penalize β−1 = (β2, β3, ..., βK) by choosing K∗ = 1. Figure 10 depicts the penal-

ized and unpenalized IV estimates. There is a certain difference in the estimates, but the

amount is small. It is possible that either the instrument is not very weak in this example

27Without further conditions, assumptions (2.1b) are not stronger or weaker than E[ε|z] = 0, the or-thogonality condition introduced in the NPIV model. It is easy to show that if v ⊥ z is assumed, thenE[ε|v, z] = E[ε|v] with E[ε] = 0 implies E[ε|z] = 0.

28Specifically, we use the 10-fold CV. See Arlot and Celisse (2010) for details.

29

or that cross-validation chooses a smaller value of τ . Similar to before, the results suggest a

nonlinear effect of class size with the overall negative trend.

9 Conclusions

This paper analyzes identification, estimation, and inference in a nonparametric triangular

model in the presence of weak instruments and proposes an estimation strategy to mitigate

the effect. The findings and implications of this paper are not restricted to the present

additively separable triangular models. The results can be adapted to the nonparametric

limited dependent variable framework of Das et al. (2003) and Blundell and Powell (2004).

Weak instruments can also be studied in other nonparametric models with endogeneity, such

as the IV quantile regression model of Chernozhukov and Hansen (2005) and Lee (2007).

The results of this paper are directly applicable in several semiparametric specifications of

the model of this paper. With a large number of covariates, one can consider a semiparametric

outcome equation or reduced form that is additively nonparametric in some components and

parametric in others. One can also consider a single-index model for one equation or the

other. As more structure is imposed on the model, the identification condition of Section

2 and the regularity condition of Sections 5 and 6 can be weakened. When the reduced

form is of a single-index structure Π(z′γ), the strength of instruments is determined by the

combination of ∂Π(·)/∂(z′γ) and γ.

Subsequent research can consider two specification tests: a test for the relevance of the

instruments and a test for endogeneity. These tests can be conducted by adapting the

existing literature on specification tests where the test statistics can be constructed using

the series estimators of this paper; see, e.g., Hong and White (1995). Testing whether

instruments are relevant can be conducted with the nonparametric reduced-form estimate

Π(·). A possible null hypothesis is H0 : Pr {Π(z) = const.} = 1, which is motivated by

our rank condition for identification. Testing whether the model suffers from endogeneity

can be conducted with the control function estimate λ(v) = h(w) − g(x). A possible null

hypothesis is H0 : Pr {λ(v) = const.} = 1. In this case, one needs to take into account the

generated regressors v when using existing results on the specification test. Constructing

a test for instrument weakness would be more demanding than the above-mentioned tests.

Developing inference procedures that are robust to identification of arbitrary strength is also

an important research question.

30

A Appendix

This Appendix provides the proofs of the theorems and lemmas in Sections 2 and 5. The

Supplemental Appendix contains some of the proofs of lemmas introduced in this Appendix

and the proofs of the theorem and corollary in Section 6. Throughout the Appendices, we

suppress the subscript “0” for the true functions and, when no confusion arises, the subscript

“n” of the true reduced-form function in Assumption L and of the coefficients on its expansion

for notational simplicity.

A.1 Proofs in Identification Analysis (Section 2)

In this subsection, we prove Lemma 2.1 and Theorem 2.2. In order to prove the sufficiency of

Assumption ID2′ for ID2 (Lemma 2.1), we first introduce a preliminary lemma. For nonempty

sets A and B, define the following set

A+B = {a+ b : (a, b) ∈ A×B} . (A.1)

Then, the following rules that are useful in proving the lemma. For nonempty sets A, B, and

C,

A+B = B + A (commutative) (Rule 1)

A+ (B ∪ C) = (A+B) ∪ (A+ C) (distributive 1) (Rule 2)

A+ (B ∩ C) = (A+B) ∩ (A+ C) (distributive 2) (Rule 3)

(A+B)c ⊂ A+Bc, (Rule 4)

where the last rule is less obvious than the others but can be shown to hold. The distributive

rules of Rule 2 and 3 do not hold when the operators are switched; e.g., A ∪ (B + C) 6=(A ∪ B) + (A ∪ C). Recall that Zr = {z ∈ Z : rank (∂Π(z)/∂z′) = dx} and Z0 = Z\Zr.Let λLeb denote a Lebesgue measure on Rdx , and ∂V and int(V) denote the boundary and

interior of V , respectively.

Lemma A.1 Suppose Assumptions ID1 and ID2′(a)(i) and (ii) hold. Suppose Zr 6= φ and

Z0 6= φ. Then, (a) {Π(z) + v : z ∈ Z0, v ∈ int(V)} ⊂ X r, and (b) λLeb(Π(Z0)) = 0 and ∂Vis countable.

We prove the main lemma first.

31

Proof of Lemma 2.1: When Zr = φ or Zr = Z we trivially have X r = X . Suppose

Zr 6= φ and Z0 6= φ. First, under Assumption ID2′(b) that V = Rdx , we have the conclusion

by

X r ={

Π(z) + v : z ∈ Zr, v ∈ Rdx}

= Rdx ={

Π(z) + v : x ∈ Z, v ∈ Rdx}

= X .

Now suppose Assumption ID2′(a) holds. By Assumption ID2′(a)(iii), for z ∈ Z0, the joint

support of (z, v) is Z0 × V . Hence

{Π(z) + v : z ∈ Z0, v ∈ int(V)

}={

Π(z) + v : (z, v) ∈ Z0 × int(V)}

= Π(Z0) + int(V).

But by Lemma A.1(a), Π(Z0)+int(V) ⊂ X r or contrapositively, X\X r ⊂ (Π(Z0) + int(V))c.

Also, by (Rule 4), (Π(Z0) + int(V))c ⊂ Π(Z0) + ∂V . Therefore,

X\X r ⊂ Π(Z0) + ∂V . (A.2)

Let ∂V = {ν1, ν2, ..., νk, ...} = ∪∞k=1{νk} by Lemma A.1(b). Then,

λLeb(Π(Z0) + ∂V

)= λLeb

(Π(Z0) + (∪∞k=1{νk})

)= λLeb

(∪∞k=1(Π(Z0) + {νk})

)≤

∞∑k=1

λLeb(Π(Z0) + {νk}) =∞∑k=1

λLeb(Π(Z0)) = 0,

where the second equality is from (Rule 2) and the third equality by the property of Lebesgue

measure. The last equality is by Lemma A.1(b) that λLeb(Π(Z0)) = 0. Since x is continuously

distributed, by (A.2), Pr [x ∈ X\X r] ≤ Pr [x ∈ Π(Z0) + ∂V ] = 0. �

In the following proofs, we explicitly distinguish the r.v.’s with their realization. Let ξ,

ζ, and ν denote the realizations of x, z, and v, respectively. We now prove Lemma A.1. The

result of the following lemma is needed:

Lemma A.2 Under the assumptions of Lemma A.1, Z0 is closed in Z.

Proof of Lemma A.2: Consider any ∪∞n=1 {ζn} ⊂ Z0 ⊂ Z such that lim ζn = ζ ∈ Z.

Then for each n, ∂Π(ζn)/∂ζ ′ = (0, 0, ..., 0) = 0 by the definition of Z0, and

∂Π(ζ)/∂ζ ′ = ∂Π( limn→∞

ζn)/∂ζ ′ = limn→∞

∂Π(ζn)/∂ζ ′ = 0,

32

where the second equality is by continuity of ∂Π(·)/∂ζ ′. Therefore, ζ ∈ Z0, and hence Z0 is

closed. �

Proof of Lemma A.1(a): First, we claim that for any π ∈ Π(Z0) there exists ∪∞n=1{πn} ⊂Π(Zr) such that limn→∞ πn = π.

By Proposition 4.21(a) of Lee (2011, p. 92), for any space S, the path components of

S form a partition of S. Note that a path component of S is a maximal nonempty path

connected subset of S. Then for Z0 ⊂ Rdz , we have Z0 = ∪ι∈IZ0ι where partitions Z0

ι are

path components. Note that, since Z0ι is path connected, for any ζ and ζ in Z0

ι , there exists a

path in Z0ι , namely, a piecewise continuously differentiable function γ : [0, 1]→ Z0

ι such that

γ(0) = ζ and γ(1) = ζ. Note that {γ(t) : t ∈ [0, 1]} ⊂ Z0ι . Consider a composite function

Π◦γ : [0, 1]→ Π(Z0ι ) ⊂ Rdx . Then, Π(γ(·)) is differentiable, and by the mean value theorem,

there exists t∗ ∈ [0, 1] such that

Π(γ(1))− Π(γ(0)) =∂Π(γ(t∗))

∂t(1− 0) =

∂Π(γ(t∗))

∂ζ ′∂γ(t∗)

∂t.

Note that ∂Π(γ(t∗))/∂ζ ′ = 0dx×dx since γ(t∗) ∈ Z0ι ⊂ Z0 and dx = 1. This implies that

Π(γ(1)) = Π(γ(0)), or Π(ζ) = Π(ζ). Therefore for any ζ ∈ Z0ι ,

Π(ζ) = cι (A.3)

for some constant cι.

Also, since Z0 is closed by Lemma A.2, Z0ι is closed for each ι. That is, Z0 is partitioned

to a closed disjoint union of Z0ι ’s. But Assumption ID2′(a)(ii) says Z is a connected set in

Euclidean space (i.e., Rdz). Therefore, for each ι ∈ I, Z0ι must contain accumulation points

of Zr (Taylor (1965, p. 76)). Now, for any π = Π(ζ) ∈ Π(Z0), it satisfies that ζ ∈ Z0ι for

some ι ∈ I. Let ζc ∈ Z0ι be an accumulation point of Zr, that is, there exists ∪∞n=1{ζn} ⊂ Zr

such that limn→∞ ζn = ζc. Then, it follows that

π = Π(ζ) = cι = Π(ζc) = Π( limn→∞

ζn) = limn→∞

Π(ζn),

where the second and third equalities are from (A.3) and the fourth by continuity of Π(·).Let πn = Π(ζn), then πn ∈ Π(Zr) for every n ≥ 1. Therefore, we can conclude that for any

π ∈ Π(Z0), there exists ∪∞n=1{πn} ⊂ Π(Zr) such that limn→∞ πn = π.

Next, we prove that if ξ ∈ {Π(z) + v : z ∈ Z0, v ∈ int(V)} then ξ ∈ X r. Suppose ξ ∈{Π(z) + v : z ∈ Z0, v ∈ int(V)}, i.e., ξ = π + ν for π ∈ Π(Z0) and ν ∈ int(V). Then, by the

33

result above, there exists ∪∞n=1{πn} ⊂ Π(Zr) such that limn→∞ πn = π. Define a sequence

νn = ξ − πn for n ≥ 1. Notice that νn is not necessarily in V . But, νn = (π + ν) − πn =

ν + (π − πn), hence limn→∞ νn = ν. Since ν ∈ int(V), there exists an open neighborhood

Bε(ν) of ν for some ε such that Bε(ν) ⊂ int(V). Also, by the fact that limn νn = ν, there

exists Nε such that for all n ≥ Nε, νn ∈ Bε(ν). Therefore, by conveniently taking n = Nε, ξ

satisfies that

ξ = πNε + νNε ,

where πNε ∈ Π(Zr) and νNε ∈ Bε(ν) ∈ int(V) ⊂ V . That is, ξ ∈ X r. �

Proof of Lemma A.1(b): Recall dx = 1. Note that V ⊂ R can be expressed by a union

of disjoint intervals. Since we are able to choose a rational number in each interval, the

union is a countable union. But note that each interval has at most two end points which

are the boundary of it. Therefore ∂V is countable. To prove that λLeb(Π(Z0)) = 0, we use

the following proposition:

Proposition A.1 (Lusin-Saks, Corollary 6.1.3 in Garg (1998)) Let X be a normed

vector space. Let f : X → R and E ⊂ X. If at each point of E at least one unilateral

derivative of f is zero, then λLeb(f(E)) = 0.

Note that Z0 is the support where ∂Π(z)/∂zk = 0 for k ≤ dz. Therefore, its bilateral

(directional) derivative DαΠ(z) in the direction α = (α1, α2, ..., αdz)′ satisfies DαΠ(z) =∑dz

k=1 αk · ∂Π(z)/∂zk = 0. Since the bilateral derivative is zero, each unilateral derivative is

also zero; see, e.g., Giorgi et al. (2004, p. 94) for the definitions of various derivatives. Then,

by Proposition A.1, λLeb(Π(Z0)) = 0. �

Proof of Theorem 2.2: Consider equation (2.2) with z = z2,

E[y|x, z] = E[y|v, z] = g(Π(z) + v) + λ(v), (2.2)

and note that the conditional expectations and Π(·) are consistently estimable, and v can

also be estimated. By differentiating both sides of (2.2) with respect to z, we have

∂E[y|v, z]∂z′

=∂g(x)

∂x′· ∂Π(z)

∂z′. (A.4)

Now, suppose Pr [z ∈ Zr] > 0. For any fixed value z such that z ∈ Zr, we have

rank (∂Π(z)/∂z′) = dx by definition, hence the system of equations (A.4) has a unique

34

solution ∂g(x)/∂x′ for x in the conditional support Xz. That is, ∂g(x)/∂x′ is locally iden-

tified for x ∈ Xz. Now, since the above argument is true for any z ∈ Zr, we have that

∂g(x)/∂x′ is identified on x ∈ X r. Now by Assumption ID2, the difference between X r and

X has probability zero, thus ∂g(x)/∂x′ is identified.29

Next, we prove the necessity part of the theorem. Suppose Pr[z ∈ Zr] = 0. This implies

Pr[z ∈ Z0] = 1, but since Z0 is closed Z0 = Z. Therefore, for any z ∈ Z, the system of

equations (A.4) either has multiple solutions or no solution. Therefore g(Π(z) + v) is not

identified. �

A.2 Key Theoretical Step and Technical Assumptions (Section 5)

For asymptotic theory, we require a key preliminary step to separate out the weak instrument

factor from the second moment matrices of interest. In order to do so, we first introduce a

transformation matrix and apply the mean value expansion to those matrices.

Define a vector of approximating functions of orders K = K1 + K2 + 1 for the second

stage,

pK(w) = (1, p1(x), ..., pK1(x), p1(v), ..., pK2(v))′ =

[1

... pK1(x)′... pK2(v)′

]′,

where pK1(x) and pK2(v) are vectors of approximating functions for g0(·) and λ0(·) of orders

K1 and K2, respectively. Note that this rewrites pK(w) = (p1(w), ..., pK(w))′ of the main

body for expositional convenience in illustrating the step of this subsection. When not stated

explicitly, pK(w) should be read as its generic expression (such as in the proof of Lemma

A.4).30 Define a K ×K sample second moment matrix

Q =P ′P

n=

∑ni=1 p

K(wi)pK(wi)

′

n. (A.5)

Then, βτ = (Q+ τnDK∗)−1P ′y/n.

For the rest of this subsection, we consider univariate x for simplicity.31 Note that z is still

29Once ∂g(x)/∂x′ is identified, we can identify ∂λ(v)/∂v′ by differentiating (2.2) with respect to v:

∂E[y|v, z]∂v′

=∂g(x)

∂x′+∂λ(v)

∂v.

30Since g0(·) and λ0(·) can only be separately identified up to a constant, when estimating h0(·), we includeonly one constant function.

31This corresponds to Assumption ID2′(a). The analysis can also be generalized to the case of a vector x by

35

a vector. Under Assumption L, Πn(·) = n−δΠ(·) after applying a normalization c = 0 and

suppressing op(n−δ) for simplicity in (3.1). Omitting op(n

−δ) does not affect the asymptotic

results developed in the paper. Assume that the approximating function pj(xi) is twice

differentiable for all j, and for r ∈ {1, 2} define its r-th derivative as ∂rpj(x) = drpj(x)/dxr.

By mean value expanding each element of pK1(xi) around vi, we have, for j ≤ K1,

pj(xi) = pj(n−δΠ(zi) + vi) = pj(vi) + n−δΠ(zi)∂pj(vi), (A.6)

where vi is a value between xi and vi. Define ∂rpK1(x) = [∂rp1(x), ∂rp3(x), ..., ∂rpK1(x)]′.

Then, by (A.6) the vector of regressors pK(wi) for estimating h(·) can be written as

pK(wi)′ =

[1

... pK1(xi)′ ... pK2(vi)

′]

=

[1

... pK1(vi)′ + n−δΠ(zi)∂p

K1(vi)′ ... pK2(vi)

′]

. (A.7)

Let κ = K1 = K2 = (K − 1)/2. Again, K1, K2, L, K, and κ all depends on n. Note that

K � K1 � K2 � κ. This setting can be justified by the assumption that the functions g0(·)and λ0(·) have the same smoothness, which is in fact imposed in Assumption C. The general

case of K1 6= K2 and multivariate x is discussed in the Supplemental Appendix. Now we

choose a transformation matrix Tn to be

Tn =

1 01×κ 01×κ

0κ×1 nδIκ 0κ×κ

0κ×1 −nδIκ Iκ

.

After multiplying Tn on both sides of (A.7), the weak instrument factor is separated from

pK(wi)′: with ui = (zi, vi),

pK(wi)′Tn =

[1

... pκ(vi)′ + n−δΠ(zi)∂p

κ(vi)′ ... pκ(vi)

′]· Tn

=

[1

... Π(zi)∂pκ(vi)

′ ... pκ(vi)′]

= p∗K(ui)′ +mK′

i , (A.8)

where p∗K(ui)′ =

[1

... Π(zi)∂pκ(vi)

′ ... pκ(vi)′]

and mK′i =

[0

... Π(zi) (∂pκ(vi)′ − ∂pκ(vi)′)

... (0κ×1)′]. To illustrate the role of this linear transformation, rewrite the original vector of

using multivariate mean value expansion. See the Supplemental Appendix for discussion of this generalization.

36

regressors in (A.7) as

pK(wi)′ = pK(wi)

′TnT−1n =

{p∗K(ui) +mK

i

}′T−1n .

Ignoring the remainder vector mKi which is shown to be asymptotically negligible below, the

original vector pK(wi) is separated into two components p∗K(ui) and T−1n . Note that p∗K(ui)

is not affected by the weak instruments and can be seen as a new set of regressors.32

We apply this result to Q (below) and to the population second moment matrix

Q = E[pK(wi)p

K(wi)′] . (A.9)

By equations (A.9) and (A.8), it follows

T ′nQTn = E[T ′np

K(wi)pK(wi)

′Tn]

= Q∗ + E[mKi p∗K(ui)

′]+ E[p∗K(ui)m

K′i

]+ E

[mKi m

K′i

](A.10)

where the newly defined Q∗ = E[p∗K(ui)p

∗K(ui)′] is the population second moment matrix

with the new regressors. Also, define

Q =P ′P

n,

where Pn×K

= (pK(w1), ..., pK(wn))′ which is similar to P but with unobservable v instead of

residual v. In terms of the notations, it is useful to note that “∗” denotes matrix P , Q, or

vector pK(·) that are free from the weak instrument effect. Also, “∼” is for P or Q with vi

and “ˆ” is for ones with vi.

Furthermore, since Π(·) ∈ C1(Z) can have nonempty Z0 as a subset of its domain, we

define

Qr∗ = E[p∗K(ui)p

∗K(ui)′|zi ∈ Zr

], Q0∗ = E

[p∗K(ui)p

∗K(ui)′|zi ∈ Z0

].

Also define the second moment matrix for the first-stage estimation as Q1 = E[rL(zi)r

L(zi)′].

Assumptions B, C, D, and L of the main body serve as sufficient conditions for high-level

assumptions that are directly used in the proofs for the convergence rate. In this section,

we state Assumptions B†, C†, and D†, and the Supplemental Appendix proves that they

32For justification that p∗K(ui) can be regarded as regressors, see Assumption B in Section 5 and Assump-tion B† below.

37

are implied by Assumptions B, C, D, and L. For a symmetric matrix B, let λmin(B) and

λmax(B) denote the minimum and maximum eigenvalues of B, respectively, and det(B) the

determinant of B.

Assumption B† (i) λmin(Qr∗) is bounded away from zero for all K(n) and λmin(Q1) is

bounded away from zero for all L(n);

(ii) λmax(Q∗) and λmax(Q) are bounded by a fixed constant, for all K(n), and λmax(Q1)

bounded by a fixed constant, for all L(n); (iii) The diagonal elements qjj of Q are bounded

away from zero for j ≤ K∗.

Assumption C† For all n, (i) βj ≤ Cj−s/dx−1 and γnl ≤ Cl−sπ/dz−1; (ii) for all positive inte-

gers K and L, supw∈W

∣∣∣h0(w)−∑K

j=1 βjpj(w)∣∣∣ ≤ CK−s/dx and supz∈Z

∥∥∥Πn(z)−∑L

j=1 γnjrj(z)∥∥∥

≤ CL−sπ/dz .

For the next assumption let ζvr (κ) and ξvr (L) satisfy

max|µ|≤r

supv∈V‖∂µpκ(v)‖ ≤ ζvr (κ), max

|µ|≤rsupz∈Z

∥∥∂µrL(z)∥∥ ≤ ξr(L),

which impose nonstochastic uniform bounds on the vectors of approximating functions. Let

∆π =√L/n+ L−sπ/dz .

Assumption D† (i) n2δκ1/2ζv1 (κ)∆π → 0; (ii) n−δζv2 (κ)κ1/2 → 0; (iii) ξ0(L)2L/n → 0; also,

(iv) K > 2(K∗ + 1) for any n and K∗ � K.

A.3 Key Lemmas for Asymptotic Theory (Section 5)

Lemmas A.3 and A.4 of this subsection describes the subsequent step and obtain the order of

magnitudes of eigenvalues of the second moment matrices in term of the weak instrument and

penalization rates. In proving these lemmas, we frequently apply two useful mathematical

lemmas (Lemmas B.1 and B.2) that are stated and proved in the Supplemental Appendix.

For any matrix A, let the matrix norm be the Euclidean norm ‖A‖ =√tr(A′A).

The following lemmas are the main results of this subsection.

Lemma A.3 Suppose Assumptions ID, A, B†, D†, and L are satisfied. Then, (a) λmax(Q−1) =

O(n2δ) and (b) λmax(Q−1) = Op(n2δ).

38

We only present the proof of (a) here. The proof of (b), which uses a similar strategy but

involves more works, can be found in the Supplemental Appendix.

Lemma A.4 Suppose Assumptions ID, A, B†, D†, and L are satisfied. Then,

λmax(Q−1τ ) = Op

(min

{λmax(Q−1), τ−1

n

}). (A.11)

In all proofs, let C denote a generic positive constant that may be different in different use.

TR, CS, MK are triangular inequality, Cauchy-Schwartz inequality and Markov inequality,

respectively.

Preliminary derivations for the proofs of Lemmas A.3 and A.4 and Theorem

5.1: Before proving the lemmas and theorems below, it is useful to list the implications of

Assumption D†(i) that are used in the proofs. Define

∆π =√L/n+ L−sπ/dz , ∆Q = ζv1 (κ)2∆2

π + κ1/2ζv1 (κ)∆π, ∆Q =√ζv1 (κ)2κ/n.

Note that ∆Q = ζv1 (κ)√κ/n→ 0 by

n2δκ1/2ζv1 (κ)∆π → 0 (A.12)

of Assumption D†(i). Also

n2δ∆Q = n2δ{ζv1 (κ)2∆2

π + κ1/2ζv1 (κ)∆π

}→ 0,

n2δ∆Q = n2δ√ζv1 (κ)2κ/n ≤ Cn2δκ1/2ζv1 (κ)/

√n→ 0

and

n2δζv1 (κ)2∆2π/n→ 0 (A.13)

by nδκ1/2ζv1 (κ)∆π → 0 and n2δζv1 (κ)∆π → 0 which are implied by (A.12). �

Proof of Lemma A.3(a): Let p∗i = p∗K(ui) and mi = mKi for brevity. Recall (A.10)

that T ′nQTn = Q∗ + E [mip∗′i ] + E [p∗im

′i] + E [mim

′i]. Then,

‖T ′nQTn −Q∗‖ ≤ 2E ‖mi‖ ‖p∗i ‖+ E ‖mi‖2 ≤ 2(E ‖mi‖2)1/2 (

E ‖p∗i ‖2)1/2

+ E ‖mi‖2

by CS. But mi = mKi = [0

... Π(zi)(∂pκ(vi)

′ − ∂pκ(vi)′) ... (0κ×1)′]′ where v is the intermediate

value between x and v. Then, by mean value expanding ∂pκ(vi) around vi and |vi − vi| ≤

39

|xi − vi|, we have

‖mi‖2 =∥∥∥Π(zi)∂

2pκ(vi) (vi − vi)∥∥∥2

≤∣∣∣Π(zi)

∣∣∣2 ζv2 (κ)2 |xi − vi|2

= n−2δ∣∣∣Π(zi)

∣∣∣4 ζv2 (κ)2 ≤ Cn−2δζv2 (κ)2, (A.14)

where v is the intermediate value between v and v, and by Assumption L that supz

∣∣∣Π(zi)∣∣∣ <

∞. Therefore

E ‖mi‖2 ≤ Cn−2δζv2 (κ)2. (A.15)

Then, by Assumption B†(ii),

E [p∗′i p∗i ] = tr(Q∗) ≤ tr(I2κ)λmax(Q∗) ≤ C · κ. (A.16)

Therefore, by combining (A.15) and (A.16) it follows

‖T ′nQTn −Q∗‖ ≤ O(κ1/2n−2δζv2 (κ)) +O(n−2δζv2 (κ)2) = o(1) (A.17)

by Assumption D†(ii), which shows that all the remainder terms are negligible.

Now, by Lemma B.1, we have

|λmin(T ′nQTn)− λmin(Q∗)| ≤ ‖T ′nQTn −Q∗‖ (A.18)

Combine the results (A.17) and (A.18) to have λmin(T ′nQTn) = λmin(Q∗)+o(1). But note that,

with simpler notations p1 = Pr [z ∈ Zr] and p0 = Pr [z ∈ Z0], we have Q∗ = p1Qr∗ + p0Q

0∗.

Then, by a variant of Lemma B.1 (with the fact that λ1(−B) = −λk(B) for any symmetric

matrix B), it follows that λmin(Q∗) ≥ p1 · λmin(Qr∗) + p0 · λmin(Q0∗) = p1 · λmin(Qr∗), because

λmin(Q0∗) = 0. Since p1 > 0, it holds that λmin(Q∗) ≥ p1 · λmin(Qr∗) ≥ c > 0 for all K(n) by

Assumption B†(i). Therefore,

λmin(T ′nQTn) ≥ c > 0. (A.19)

Let

T0n =

[nδ 0

−nδ 1

]⊗ Iκ, so that Tn =

[1 01×2κ

02κ×1 T0n

].

Then, by solving

∣∣∣∣∣ nδ − λ 0

−nδ 1− λ

∣∣∣∣∣ = 0, we have λ = nδ or 1 for eigenvalues of T0n, and since

40

λmax(Iκ) = 1, it follows

λmax(Tn) = λmax(T0n) = O(nδ). (A.20)

Note that λmax(TnT′n) = O(n2δ) by Lemma B.2. Since (A.19) implies λmax((T ′nQTn)−1) =

O(1), it follows

λmax(Q−1) = λmax(Tn(T ′nQTn)−1T ′n) ≤ O(1)λmax(TnT′n) = O(n2δ)

by applying Lemma B.2 again. �

Proof of Lemma A.4: In proving this lemma, we use generic K and K∗ for expositional

convenience. To deal with the singularity in τnDK∗ , define a matrix ΓK∗ whose elements γjk

satisfy the following: for j ≤ K∗ and k 6= j, γjk = qjk, γkj = qkj, and γjj = qjj/2 where qjk

is an element in Q; and for j, k > K∗, γjk = 0. Write

λmax(Q−1τ ) =

1

λmin(Q+ τnDK∗)=

1

λmin(Q− ΓK∗ + ΓK∗ + τnDK∗)

≤ 1

λmin(Q− ΓK∗) + λmin(ΓK∗ + τnDK∗).

Note that λmin(ΓK∗+τnDK∗) = τnλmin(τ−1n ΓK∗+DK∗). By Assumption D†(iv), we have that

λmin(τ−1n ΓK∗ +DK∗) is either 1 or τ−1

n qjj/2 (j ≤ K∗) for sufficiently large n. This is because

det (τ−1n ΓK∗ +DK∗ − λ · IK) =

∏K∗

j=1(τ−1n qjj/2− λj)

∏Kj=K∗+1(1− λj), and in calculating the

determinant, note that off-diagonal, reverse diagonal, and reverse off-diagonal multiplications

are all zero when K > 2(K∗+1) for any n. More precisely, all off-diagonal multiplications are

zero for K > 2K∗+1; all reverse-diagonal multiplications are zero for K > 2K∗+1; all reverse

off-diagonal multiplications are zero for K > 2K∗+ 2. Therefore, λmin(ΓK∗ + τnDK∗) = Cτn.

Next, let QK∗ be the lower block of the block diagonal matrix Q − ΓK∗ . By the fact that

eigenvalues of a block diagonal matrix are those of each blocks, we have that

λmax

((Q− ΓK∗)

−1)

= max{

2/q11, 2/q22, ..., 2/qK∗K∗ , λmax(Q−1K∗)}.

By Assumption B†(iii) that qjj for j ≤ K∗ are bounded away from zero and since λmin(QK∗)→p

0 as n → ∞, for sufficiently large n, we have that λmin(QK∗) < qjj/2 or λmax(Q−1K∗) > 2/qjj

41

for all j ≤ K∗(n). Consequently,

λmax(Q−1τ ) ≤ min

{λmax

((Q− ΓK∗)

−1), Cτ−1

n

}≤ min

{λmax(Q−1

K∗), Cτ−1n

}= Op

(min

{n2δ, τ−1

n

}),

where λmax(Q−1K∗) = Op(n

2δ) can be similarly shown as the proof of λmax(Q−1) = Op(n2δ) by

accordingly redefining Tn. �

A.4 Decaying Rates of Coefficients and Approximation Errors (Sec-

tion 5)

Proofs of Lemma 5.1: Let ‖h‖∞ = supw |h(w)| for simplicity. We provide a proof for the

expansion of h(w), and a similar proof can be followed for Πn(z). Let d be a generic dimension

of w for now. Let pj(w) =∏d

l=1 pλl(j)(wl) be the tensor product of orthogonal polynomials,

where λl(j) are the order of each univariate polynomial, and let ω(w) =∏d

l=1 ωl(wl) be

the associated weight. Here let λ(j) = (λ1(j), λ2(j), ..., λdw(j)) be the multi-index with

norm |λ(j)| =∑d

l=1 λl(j), which is monotonically increasing in j and {λ(j)}∞j=1 are distinct

such vectors. Also, λl(j) � λ(j) for all l for some increasing sequence λ(j). Let βj =´W h(w)pj(w)ω(w)dw and denote hK(w) =

∑Kj=1 βjpj(w). By Assumption C that h(·) is

Lipschitz, hK(·) is uniformly convergent to h(·) (Trefethen (2008, p. 75)).

We derive the decay rate of the coefficients on the Chebyshev polynomials. By Assumption

B, let W =∏d

l=1Wl where Wl is the marginal support of wl and, without loss of generality,

let Wl = [−1, 1] for all l. Also let ωl(wl) = 1/√

1− w2l for all l. Then ‖h‖W,ω becomes the

Chebyshev-weighted seminorm denoted as ‖h‖T . First, for a given l ∈ {1, ..., d}, we pay our

attention to h(w) as a univariate function of wl on Wl in

βj =

ˆW−l

ˆWl

h(wl, w−l)pλl(j)(wl)ωl(wl)dwl∏m6=l

pλm(j)(wm)ωm(wm)dw−l,

where w−l ∈ W−l denotes the remaining vector. Let µ be such that |µ| =∑d

l µl = s. For

a given w−l ∈ W−l, by Assumption C, h(wl, w−l) as a function of wl is µl-times (and hence

(µl − 1)-times) absolutely continuously differentiable. Suppose also that ‖∂µlh(·, w−l)‖T is

bounded for |µl| = µl. Then, one can apply the univariate results of Trefethen (2008, Theorem

42

4.2) to yield

ˆh(wl, w−l)pλl(j)(wl)ωl(wl)dwl = O

(λl(j)

−µl−1).

Likewise, under Assumption C that h(w) is s-times (and hence (µl − 1)-times with respect

to wl for each l) absolutely continuously differentiable and ‖∂µh‖T is bounded for |µ| = s,

we can apply the integration by part in his proof further with respect to w−l as well, which

yields that

βj =

ˆWh(w)pj(w)ω(w)dw = O

(λ1(j)−µ1−1λ2(j)−µ2−1 · · ·λd(j)−µd−1

).

To illustrate this, suppose d = 2 and s1 = s2 = 0 for simplicity. Let Tλl(j)(wl) denote the

univariate Chebyshev polynomial of order λl(j). Then,

βj =

(2

π

)2 ˆ 1

−1

ˆ 1

−1

h(w1, w2)Tλ1(j)(w1)√

1− w21

Tλ2(j)(w2)√1− w2

2

dw1dw2

=

(2

π

)2 ˆ π

0

ˆ π

0

h(cos θ1, cos θ2) cos(λ1(j)θ1) cos(λ2(j)θ2)dθ1dθ2

=

(2

π

)21

λ1(j)λ2(j)

ˆ π

0

ˆ π

0

h12(cos θ1, cos θ2) sin(λ1(j)θ1) sin θ1 sin(λ2(j)θ2) sin θ2dθ1dθ2,

where the last equality is by iteratively applying the integration by part with respect to each

argument and where the boundary terms vanish because of the sine functions. Applying

2 sin θ′ sin θ′′ = cos(θ′ − θ′′)− cos(θ′ + θ′′), we have

|βj| ≤(

2

π

)21

λ1(j)λ2(j)

ˆ π

0

ˆ π

0

h12(cos θ1, cos θ2)

∥∥∥∥cos((λ1(j)− 1)θ1)

2− cos((λ1(j) + 1)θ1)

2

∥∥∥∥∞

×∥∥∥∥cos((λ2(j)− 1)θ2)

2− cos((λ2(j) + 1)θ1)

2

∥∥∥∥∞dθ1dθ2

≤(

2

π

)2 ‖h(w)‖Tλ1(j)λ2(j)

≤ C

λ1(j)λ2(j),

where the second last inequality is because the ‖·‖∞ terms are bounded by 1 and the last

inequality is by Assumption C for s = s1 + s2 = 0. A similar argument follows with general

d and s by iterating the procedure.

Note that j =∏d

l=1(λl(j) + 1), which manifests the curse of dimensionality. Since j �

43

(λ(j))d or j1/d � λ(j),

βj = O(j−(s+d)/d

)= O

(j−s/d−1

).

Additionally, because of the additivity of h(·), we have reduced dimensionality such that

d = dx and j/2 =∏dx

l=1(λl(j) + 1). Therefore, βj = O(j−s/dx−1

).

Now we calculate the approximation error. Since ‖hK − hK‖∞ =∥∥∥∑K

j=K+1 βjpj(w)∥∥∥∞≤

C∑K

j=K+1 |βj| by Assumption B†(ii) that ‖pj‖∞ ≤ C, for all K,

‖h− hK‖∞ ≤ ‖h− hK‖∞ + ‖hK − hK‖∞ ≤ ‖h− hK‖∞ + CK∑

j=K+1

|βj| .

Then, based on the results above that βj = O(j−s/dx−1

)and limK→∞ ‖h− hK‖∞ = 0,

‖h− hK‖∞ ≤ limK→∞

‖h− hK‖∞ + C limK→∞

K∑j=K+1

|βj|

= C∞∑

j=K+1

j−s/dx−1 ≤ C

ˆ ∞K

x−s/dx−1dx = ˜CK−s/dx .

�

Lemma 5.1 delivers the case of the Chebyshev polynomials. For the Legendre polynomials,

similar results follow by extending Theorem 2.1 of Wang and Xiang (2012) as above under the

same assumptions of Lemma 5.1. For example, βj =´W h(w)pj(w)ω(w)dw = O(j−s/d−1/2)

where pj(w) is the tensor product of the Legendre polynomials. We omit the full results.

A.5 Proofs of Rate of Convergence (Section 5)

Given the results of the lemmas above, we derive the rates of convergence. We first derive

the rate of convergence of the unpenalized series estimator h(·) defined as

h(w) = pK(w)′β, where β = (P ′P )−1P ′y.

Then, we prove the main theorem with the penalized estimator hτ (·) defined in Section 4.

44

Lemma A.5 Suppose Assumptions A–D, and L are satisfied. Then,∥∥∥h− h0

∥∥∥L2

= Op

(nδ(√K/n+K−s/dx +

√L/n+ L−sπ/dz)

).

Proof of Lemma A.5: Let β be such that β = (β1, ..., βK). Then, by TR of L2 norm

(first ineq.),

∥∥∥h− h0

∥∥∥L2

=

{ˆ [h(w)− h0(w)

]2

dF (w)

}1/2

≤{ˆ [

pK(w)′(β − β)]2

dF (w)

}1/2

+

{ˆ [pK(w)′β − h0(w)

]2dF (w)

}1/2

=

{(β − β

)′EpK(w)pK(w)′

(β − β

)}1/2

+O(K−s/dx)

≤ C∥∥∥β − β∥∥∥+O(K−s/dx)

by Assumption B†(ii) and using Lemma B.2 (last eq.). As β − β = (P ′P )−1P ′(y − P β), it

follows that ∥∥∥β − β∥∥∥2

= (β − β)′(β − β)

= (y − P β)′P (P ′P )−1(P ′P )−1P ′(y − P β)

= (y − P β)′P Q−1Q−1P ′(y − P β)/n2

= (y − P β)′P Q−1/2Q−1Q−1/2P ′(y − P β)/n2

≤ Op(n2δ)(y − P β)′P

(P ′P

)−1

P ′(y − P β)/n

by Lemma B.2 and Lemma A.3(b) (last ineq.).

Let h = (h(w1), ..., h(wn))′ and h = (h(w1), ..., h(wn))′. Also let ηi = yi − h0(wi) and

η = (η1, ..., ηn)′. Let W = (w1, .., wn)′, then E [yi|W ] = h0(wi) which implies E [ηi|W ] = 0.

Also similar to the proof of Lemma A1 in NPV (p. 594), by Assumption A, we have E [η2i |W ]

being bounded and E [ηiηj|W ] = 0 for i 6= j, where the expectation is taken for y. Then,

45

given that y − P β = (y − h) + (h− h) + (h− P β), we have, by TR,∥∥∥β − β∥∥∥ = Op(nδ)∥∥∥Q−1/2P ′(y − P β)/n

∥∥∥≤ Op(n

δ){∥∥∥Q−1/2P ′η/n

∥∥∥+∥∥∥Q−1/2P ′(h− h)/n

∥∥∥+∥∥∥Q−1/2P ′(h− P β)/n

∥∥∥} .

(A.21)

For the first term of equation (A.21), consider

E[∥∥(PTn − P ∗)′ η/n

∥∥2∣∣∣W] = E

[‖M ′η/n‖2

∣∣∣W] ≤ C1

n2

∑i

‖mi‖2

= Op(n−2δ−1ζv2 (κ)2) = op(1)

by (A.14) and op(1) is implied by Assumption D†(ii). Therefore, by MK,

∥∥(PTn − P ∗)′ η/n∥∥ = op(1). (A.22)

Also,

E

[∥∥∥∥(P Tn − PTn)′ η/n∥∥∥∥2∣∣∣∣∣W]≤ C

1

n2

∑i

‖(pi − pi)′Tn‖2 ≤ C1

n2

∑i

λmax(Tn)2 ‖pi − pi‖2

≤ 1

nO(n2δ)Op(ζ

v1 (κ)2∆2

π) = Op(n2δζv1 (κ)2∆2

π/n) (A.23)

by (A.20) and (B.6), and hence∥∥∥∥(P Tn − PTn)′ η/n∥∥∥∥ = op(1) (A.24)

by Assumption D†(i) (or (A.13)) and MK. Also

E ‖P ∗′η/n‖2= E

[E[‖P ∗′η/n‖2 |W ]

]= E

[∑i

p∗′i p∗iE[η2

i |W ]/n2

]≤ C

1

n2

∑i

E [p∗′i p∗i ] = Ctr(Q∗)/n = O(κ/n)

by Assumptions A (first ineq.) and equation (A.16) (last eq.). By MK, this implies

‖P ∗′η/n‖ ≤ Op(√κ/n). (A.25)

46

Hence by TR with (A.22), (A.24), and (A.25),

∥∥∥T ′nP ′η/n∥∥∥ ≤ ∥∥∥∥(P Tn − PTn)′ η/n∥∥∥∥+∥∥(PTn − P ∗)′ η/n

∥∥+ ‖P ∗′η/n‖ ≤ Op(√κ/n).

Therefore, the first term of (A.21) becomes

∥∥∥Q−1/2P ′η/n∥∥∥2

=

(η′P Tnn

)(T ′nQTn)−1

(T ′nP

′η

n

)≤ Op(1)

∥∥∥T ′nP ′η/n∥∥∥2

= Op(κ/n) (A.26)

by Lemma B.2 and (B.10).

Because I−P(P ′P

)−1

P ′ is a projection matrix, hence is p.s.d, the second term of (A.21)

becomes∥∥∥Q−1/2P ′(h− h)/n∥∥∥2

= (h− h)′P(P ′P

)−1

P ′(h− h)/n ≤ (h− h)′(h− h)/n

=∑i

(h(wi)− h(wi))2 /n =

∑i

(λ(vi)− λ(vi))2 /n

≤ C∑i

|vi − vi|2 /n = Op(∆2π) (A.27)

by (B.5) (last eq.) and Assumption C (Lipschitz continuity of λ(v)) (last ineq.). This term

is as a result of the generated regressors v from the first-stage estimation, and hence follows

the rate of the first-stage series estimation (∆2π). Similarly, the last term is∥∥∥Q−1/2P ′(h− P β)/n

∥∥∥2

= (h− P β)′P(P ′P

)−1

P ′(h− P β)/n

≤ (h− P β)′(h− P β)/n

=∑i

(h(wi)− pK(wi)

′β)2/n = Op(K

−2s/dx) (A.28)

by Assumption C†. Therefore, by combining (A.26), (A.27), and (A.28)∥∥∥β − β∥∥∥ ≤ Op(nδ)[Op(√κ/n) +Op(∆π) +O(K−s/dx)

].

Consequently, since κ � K,∥∥∥h− h0

∥∥∥L2

≤ Op(nδ)[Op(√K/n) +O(K−s/dx) +Op(∆π)

]+O(K−s/dx)

and we have the conclusion of the lemma. �

47

Proof of Theorem 5.1: Recall Qτ = (P ′P + nτnDK∗)/n = Q + τnDK∗ . Define P# =

P + nτnP (P ′P )−1DK∗ . Note that the penalty bias emerges as P# 6= P . Consider∥∥∥βτ − β∥∥∥2

= (βτ − β)′(βτ − β)

= (y − P#β)′P (P ′P + nτnDK∗)−1(P ′P + nτnDK∗)

−1P ′(y − P#β)

= (y − P#β)′P Q−1/2τ Q−1

τ Q−1/2τ P ′(y − P#β)/n2

≤ λmax(Q−1τ )∥∥∥Q−1/2

τ P ′(y − P#β)/n∥∥∥2

.

Then, first note that by Lemma A.4,

λmax(Q−1τ ) = Op(min

{n2δ, τ−1

n

}). (A.29)

Hence∥∥∥βτ − β∥∥∥ ≤ Op(min{nδ, τ−1/2

n })∥∥∥Q−1/2

τ P ′(y − P#β)/n∥∥∥. But, by TR,

∥∥∥Q−1/2τ P ′(y − P#β)/n

∥∥∥≤∥∥∥Q−1/2

τ P ′(y − h)/n∥∥∥+

∥∥∥Q−1/2τ P ′(h− P#β)/n

∥∥∥=∥∥∥Q−1/2

τ P ′(y − h)/n∥∥∥+

∥∥∥Q−1/2τ P ′(h− P β − nτnP (P ′P )−1DK∗β)/n

∥∥∥≤∥∥∥Q−1/2

τ P ′(y − h)/n∥∥∥+

∥∥∥Q−1/2τ P ′(h− P β)/n

∥∥∥+∥∥∥τnQ−1/2

τ DK∗β∥∥∥ .

The first and second terms, note that c′P ′Q−1τ P c ≤ c′P ′Q−1P c for any vector c, since

(Q−1 − Q−1τ ) is p.s.d. Therefore, by (A.26), (A.27), and (A.28) in Lemma A.5, we have∥∥∥Q−1/2

τ P ′(y − h)/n∥∥∥ = Op(

√K/n) and

∥∥∥Q−1/2τ P ′(h− P β)/n

∥∥∥ = Op(K−s/dx + ∆π). The

third term (squared) is

∥∥∥τnQ−1/2τ DK∗β

∥∥∥2

= τ 2nβ′DK∗Q

−1τ DK∗β ≤ τ 2

nλmax(Q−1τ )β′DK∗β ≤ τ 2

nλmax(Q−1τ )

K∑j=K∗+1

β2j .

Then, by Assumption C†

K∑j=K∗+1

β2j ≤

∞∑j=K∗+1

Cj−2s/dx−2 ≤ C

∞

K∗

x−2s/dx−2dx = CK∗−2s/dx−1.

48

Therefore,∥∥∥τnQ−1/2

τ DK∗β∥∥∥ = Op(τnK

∗−s/dx−1/2 min{nδ, τ−1/2n }). Consequently, analogous to

the proof of Lemma A.5,∥∥∥hτ − h0

∥∥∥L2

= Op

(Rn(

√K/n+K−s/dx + τnK

∗−s/dx−1/2Rn +√L/n+ L−sπ/dz)

).

Note that here, the condition that the penalty bias is dominated by the approximation

bias in the penalization dominating case is the following: (K∗/K)−s/dx (τnK∗−1)

1/2 → c for

c ∈ [0,∞), which is guaranteed by Assumption D that K∗ � K. This proves the first part

of the theorem. The conclusion of the second part follows from∥∥∥h(w)− h0(w)∥∥∥∞≤∥∥pK(w)′β − h0(w)

∥∥∞ +

∥∥∥pK(w)′(βτ − β)∥∥∥∞≤ O(K−s/dx) + ζv0 (K)

∥∥∥βτ − β∥∥∥ .

�

Proof of Theorem 5.3: The proof follows directly from the proofs of Theorems 4.2 and

4.3 of NPV (p. 602). As for notations, we use v instead of u of NPV and the other notations

are identical. �

B Supplemental Appendix

This Supplemental Appendix contains the remaining proofs of the lemmas introduced in the

Appendix of the main paper and the proofs of the theorem and corollary in Section 6. It also

briefly discusses the generalization of the analyses in Subsections A.2 and A.3.

B.1 Proofs of Sufficiency of Assumptions B, C, D, and L

Proof that B and L imply B†: For simplicity, assume Pr[z ∈ Zr(Π)

]= 1 and we prove

after replacingQr∗ withQ∗ in Assumption B†(i). Note that Π(·) is piecewise one-to-one. Here,

we prove the case where Π(·) is one-to-one. The general cases where Π(·) is piecewise one-

to-one or where 0 < Pr[z ∈ Zr(Π)

]< 1 can be followed by conditioning on z in appropriate

subset of Z.

Consider the change of variables of u = (z, v) into u = (z, v) where z = Π(z) and

v = v. Then, it follows that p∗K(ui) =

[1

... zipκ(vi)

′ ... pκ(vi)′]′

= pK(ui) where pK(ui) is one

particular form of a vector of approximating functions as specified in NPV (pp. 572–573).

49

Moreover, the joint density of u is

fu(z, v) = fu(Π−1(z), v) ·

∣∣∣∣∣ ∂Π−1(z)∂z

0

0 1

∣∣∣∣∣ = fu(Π−1(z), v) · ∂Π−1(z)

∂z.

Since ∂Π−1(z)/∂z 6= 0 by Π ∈ C1(Z) (bounded derivative) and fu is bounded away from zero

and the support of u is compact by Assumption B, the support of u is also compact and fu

is also bounded away from zero. Then, by the proof of Theorem 4 in Newey (1997, p. 167),

λmin(EpK(ui)pK(ui)

′) is bounded away from zero. Therefore λmin(Q∗) is bounded away from

zero for all κ.

As for Q1 that does not depend on the effect of weak instruments, the density of z being

bounded away from zero implies that λmin(Q1) is bounded away from zero for all L by Newey

(1997, p. 167). The maximum eigenvalues of Q∗, Q and Q1 are bounded by fixed constants

by the fact that the polynomials or splines are defined on bounded sets. Lastly, note that qjj

is either E [pj(x)2] or E [pj(v)2]. But by Assumption B, the density of v is bounded below,

and hence E [pj(v)2] is bounded below. Similarly, E [pj(x)2] is also bounded below since

eventually x converges to v by Assumption L. �

Proof that B and C imply C†: The results follow from Lemma 5.1. �

Proof that D implies D†: It follows from Newey (1997, p. 157) that in the polynomial

case

ζvr (K) = K1+2r. (B.1)

The same results holds for ξr(L). �

B.2 Proofs of Technical Lemmas and Lemma A.3(b)

The following are mathematical lemmas that are useful in proving Lemmas A.3 and A.4 in

the Appendix.

Lemma B.1 For symmetric k × k matrices A and B, let λj(A) and λj(B) denote their

jth eigenvalues such that λ1 ≥ λ2 ≥ · · · ≥ λk. Then, the following inequality holds: for

1 ≤ j ≤ k,

|λj(A)− λj(B)| ≤ |λ1(A−B)| ≤ ‖A−B‖ .

By having i = 1 and n, note that Lemma B.1 implies |λmax(A)− λmax(B)| ≤ ‖A−B‖ and

|λmin(A)− λmin(B)| ≤ ‖A−B‖, respectively, which will be useful in several proofs below.

50

Lemma B.2 If K(n)×K(n) symmetric random sequence of matrices An satisfies λmax(An) =

Op(nδ), then ‖BnAn‖ ≤ ‖Bn‖Op(n

δ) for a given sequence of matrices Bn.

Another useful corollary of this lemma is that, for λmax(An) = Op(nδ) and sequence of

vectors bn and cn, we have b′nAncn ≤ Op(nδ)b′ncn.

Proof of Lemma B.1: We provide slightly more general results. Firstly, by Weyl (1912),

for symmetric k × k matrices C and D

λi+j−1(C +D) ≤ λi(C) + λj(D), (B.2)

where i+ j − 1 ≤ k. As for the second inequality, we prove

|λj(D)| ≤ ‖D‖ , (B.3)

for 1 ≤ j ≤ k. Note that, for any k × 1 vector a such that ‖a‖ = 1,

(a′Da)2 = tr(a′Daa′Da) = tr(DDaa′aa′) = tr(DDaa′) ≤ tr(DD)tr(aa′) = tr(DD).

Since λj(D) = a′Da for some a with ‖a‖ = 1, taking square root on both sides of the

inequality gives the desired result. Now, in inequalities (B.2) and (B.3), take j = 1, C = B,

and D = A−B and we have the conclusions. �

Proof of Lemma B.2: Let An have eigenvalue decomposition An = UDU−1. Then,

‖BnAn‖2 = tr (BnAnAnB′n) = tr

(BnUDU

−1UDU−1B′n)

= tr(BnUD

2U−1B′n)

≤ tr(BnUU

−1B′n)· λmax(An)2 = ‖Bn‖2Op(n

δ)2.

�

Proof of Lemma A.3(b): Recall ∆π =√L/n+L−sπ/dz , ∆Q = ζv1 (κ)2∆2

π+κ1/2ζv1 (κ)∆π,

and ∆Q =√{ζv1 (κ)2 + ζv0 (κ)2}K/n. Let pi = pK(wi) for brevity, whose element is denoted

51

as pk(wi) for k = 1, ..., K. Consider

E∥∥∥Q−Q∥∥∥2

= E

∑j

∑k

([∑i pip

′i

n− E [pip

′i]

]jk

)2

=∑j

∑k

E

{(∑i pk(wi)pj(wi)

n− E [pk(wi)pj(wi)]

)2}

≤∑j

∑k

1

nE[pk(wi)

2pj(wi)2]

=1

nE [p′ipip

′ipi]

by Assumptions A (last ineq.). We bound the forth moment with the bounds of a second

moment and the following bound. With w = (x, v),

max|µ|≤r

supw∈W

∥∥∂µpK(w)∥∥2 ≤ sup

x∈X‖∂rpκ(x)‖2 + sup

v∈V‖∂rpκ(v)‖2 ≤ ζvr (κ)2 + ζvr (κ)2 = 2ζvr (κ)2.

By Assumption B†(ii), the second moment E [p′ipi] = tr(Q) ≤ tr(I2κ)λmax(Q) ≤ C · κ. Hence

E∥∥∥Q−Q∥∥∥2

≤ 1

nE [p′ipip

′ipi] ≤ O(ζv1 (κ)2κ/n).

Therefore, by MK,∥∥∥Q−Q∥∥∥2

= Op(ζv1 (κ)2κ/n) = Op(∆

2Q

). Now, by Lemma B.2,

∥∥∥T ′n(Q−Q)Tn

∥∥∥ ≤ λmax(Tn)2∥∥∥Q−Q∥∥∥ ≤ O(n2δ)Op(∆Q) = op(1) (B.4)

by Assumption D†(i) and (A.20).

Let pi = pK(wi) =

[1

... pκ(xi)... pκ(vi)

]and by mean value expansion

pi =

[1

... pκ(xi)... pκ(vi) + ∂pκ(vi) (vi − vi)

],

where v is the intermediate value. Since

1

n

∑i

|vi − v|2 =1

n

∑i

∣∣∣Πn(zi)− Π(zi)∣∣∣2 = Op(∆

2π), (B.5)

52

we have

‖pi − pi‖2 = ‖pκ(xi)− pκ(xi)‖2 + ‖∂pκ(vi) (vi − vi)‖2

≤ Cζv1 (κ)2 1

n

∑i

|vi − vi|2 ≤ Op(ζv1 (κ)2∆2

π). (B.6)

Also, by MK,

Pr[‖pi‖2 > ε

]≤ E ‖pi‖2

ε= C · tr(Q) ≤ C · tr(I2κ)λmax(Q) = O(κ), (B.7)

hence∑n

i=1 ‖pi‖2 /n = Op(κ). Then, by TR, CS and by combining (B.6) and (B.7)

∥∥∥Q− Q∥∥∥ =

∥∥∥∥∥ 1

n

n∑i=1

(pip′i − pip′i)

∥∥∥∥∥ ≤ 1

n

n∑i=1

∥∥(pi − pi) (pi − pi)′ + pi(pi − pi)′ + (pi − pi)p′i∥∥

≤ 1

n

n∑i=1

‖pi − pi‖2 + 21

n

n∑i=1

‖pi‖ ‖pi − pi‖

≤ 1

n

n∑i=1

‖pi − pi‖2 + 2

(1

n

n∑i=1

‖pi‖2

)1/2(1

n

n∑i=1

‖pi − pi‖2

)1/2

= Op(ζv1 (κ)2∆2

π) +Op(κ1/2)Op(ζ

v1 (κ)∆π) = Op(∆Q).

Thus, by Lemma B.2, ∥∥∥T ′n(Q− Q)Tn

∥∥∥ ≤ O(n2δ)Op(∆Q) = op(1) (B.8)

by Assumption D†(i) and (A.20). Therefore, by combining (B.4) and (B.8) and by TR,∥∥∥T ′nQTn − T ′nQTn∥∥∥ ≤ ∥∥∥T ′nQTn − T ′nQTn∥∥∥+∥∥∥T ′nQTn − T ′nQTn∥∥∥ = op(1). (B.9)

Now by TR, (B.9) and (A.17) give∥∥∥T ′nQTn −Q∗∥∥∥ ≤ ∥∥∥T ′nQTn − T ′nQTn∥∥∥+‖T ′nQTn −Q∗‖

= op(1). Also, by Lemma B.1, we have∣∣∣λmin(T ′nQTn)− λmin(Q∗)

∣∣∣ ≤ ∥∥∥T ′nQTn −Q∗∥∥∥. Com-

bine the results to have λmin(T ′nQTn) = λmin(Q∗) + op(1). But, by Assumption B†(i), we

have

λmin(T ′nQTn) ≥ c > 0, (B.10)

53

w.p→ 1 as n→∞. Then, similarly to the proof of Lemma A.3(a), we have

λmax(Q−1) = λmax(Tn(T ′nQTn)−1T ′n) ≤ Op(1)λmax(TnT′n) = Op(n

2δ).

�

B.3 Generalization

The general case of K1 6= K2 for Subsections A.2 and A.3 can be incorporated by the following

argument. Recall that K = K1 +K2 +1 and let κ = min {K1, K2}. Note that we have K � κ.

Suppose K1 ≤ K2 without loss of generality, then we can rewrite pK(w) as

pK(w) = (1, p1(x), ..., pK1(x), p1(v), ..., pK2(v))′

= (1, p1(x), ..., pK1(x), p1(v), ..., pK1(v), pK1+1(v), ..., pK2(v))′

=

[1

... pK1(x)′... pK1(v)′

... (pK1+1(v), ..., pK2(v))′]′

=

[p2K1+1(w)′

... dK2−K1(v)′]′,

where dK2−K1(v) = (pK1+1(v), ..., pK2(v))′. Let pi = p2κ+1(wi) and di = dK2−K1(vi). Then,

Q = E[pK(w)pK(w)′

]=

[E[pip

′i] E[pid

′i]

E[dip′i] E[did

′i]

]

Note that the partition E [did′i] is the usual second moment matrix free from the weak in-

struments. Therefore, the singularity of Q is determined by the remaining three. Note that

pi is subject to the weak instruments as before and after mean value expanding it, the usual

transformation matrix can be applied. Then, by applying the inverse of partitioned ma-

trix, the maximum eigenvalue of Q−1 can eventually be expressed in term of the nδ rate of

Assumption L. A similar argument holds for K1 ≥ K2.

We can also have a more general case of multivariate x for the results of Subsections A.2

and A.3 including the proof of Lemma A.4. Define

∂µpj(x) =∂|µ|pj(x)

∂xµ11 ∂xµ22 · · · ∂x

µdxdx

where |µ| =∑dx

t=1 µt, and ∂µpK(x) = [∂µp1(x), ..., ∂µpK(x)]′. Then, multivariate mean value

expansion can be applied using the derivative ∂µpj(x).

One can also have a simpler proof by considering mean value expansion of just one element

54

of x and apply a similar proof as in the univariate case. That is, pj(v) = pj(x − Πn(z)) =

pj(x)−Πm,n(z)∂pj(x)/∂xm where x is an intermediate value and Πm,n(z) is m-th element of

Πn(z).

B.4 Proof of Asymptotic Normality (Section 6)

Assumption G in Section 6 implies the following technical assumption. Note that ζr(K) ≤ζvr (K) by definition.

Assumption G† The following converge to zero as n→∞:√nK−s/dx,

√nL−sπ/dz ,

√nτnn

δ,

ζv0 (K)L/√n, nδL1/2

{Lζv1 (κ) +K1/2ξ(L)

}/√n, n3δKζv1 (K)L1/2/

√n,

n4δ {ζv0 (K)2K + ξ(L)2L} /n, and nδζv0 (K)L1/2ζv1 (K)(K + L)1/2/√n.

Proof that Assumption G implies Assumption G†: By (B.1), ζvr (K) = K1+2r and

similarly, ξ(L) = L. Therefore,

ζv0 (K)L/√n = n−1/2KL

nδL1/2{Lζv1 (κ) +K1/2ξ(L)

}/√n = nδ−

12L1/2

{LK3 +K1/2L

}= nδ−

12L3/2

{K3 +K1/2

}≤ Cnδ−

12K3L3/2

n3δKζv1 (K)L1/2/√n = n3(δ− 1

6)K4L1/2

n4δ{ζv0 (K)2K + ξ(L)2L

}/n = n4(δ− 1

4){K3 + L3

}nδζv0 (K)L1/2ζv1 (K)(K + L)1/2/

√n ≤ nδ−

12K4L1/2(K + L)1/2.

Then, the results of Assumption G† follow. �

Preliminary derivations for the proof of Theorem 6.1 : Define

∆Q = ∆Q + ∆Q, ∆Qτ = ∆Q + τn√K, ∆Q1 = ξ(L)L1/2/

√n

∆H = L1/2ζv1 (κ)∆π +K1/2ξ(L)/√n, ∆h = Rn(

√K/n+K−s/dx + ∆π).

First, n3δK1/2∆Q → 0 implies n2δ∆Q → 0. Also note that, by√nK−s/dx → 0 and

√nL−sπ/dz → 0, we have ∆π = (L1/2/

√n)(1 + L−1/2

√nL−sπ/dz) = O(L1/2/

√n) and ∆h =

Rn(√K/n +

√L/n). Therefore, Assumption G† implies

√nζv0 (K)∆2

π = O(ζv0 (K)L/√n) =

55

o(1). Also

nδL1/2∆H = nδL1/2{L1/2ζv1 (κ)∆π +K1/2ξ(L)/

√n}

= O(nδL1/2{L1/2ζv1 (κ)L1/2/

√n+K1/2ξ(L)/

√n}

) = o(1)

n3δκζv1 (κ)∆π = Op(n3δKζv1 (K)L1/2/

√n) = op(1)

by G†. These results imply nδζv1 (K)2∆2π = O(nδζv1 (K)2L/n) = o(1) and also

n3δκ1/2∆Q = n3δκ1/2{ζv1 (κ)2∆2

π + κ1/2ζv1 (κ)∆π

}→ 0

n3δκ1/2∆Q = n3δκ1/2√{ζv1 (κ)2 + ζv0 (κ)2}K/n ≤ Cn3δκζv1 (κ)/

√n→ 0

since n3δκ1/2ζv1 (κ)2∆2π → 0 and n3δκζv1 (κ)∆π → 0. Also, since

√nτnn

δ → 0 and nδK/√n→ 0

imply n2δKτn =√nτnn

δ · nδK/√n → 0, consequently, n2δκ1/2∆Qτ → 0. Also G† assumes

O(n4δ−1 {ζv0 (K)2K + ξ(L)2L}

)= o(1). And L1/2∆Q1 → 0 is also implied by G†. Also, since

√nτnn

δ = o(1) implies Rn = nδ,

ζv0 (K)L1/2ζv1 (K)∆h = nδζv0 (K)L1/2ζv1 (K)(K + L)1/2/√n→ 0

by G†, which in turn gives ζv0 (K)∆h → 0. Then, we also have n2δ {∆Q + ζv0 (K)2K/n} → 0.

Lastly,√nτnn

δ → 0 implies n2δτn → 0 since nδ/√n→ 0, and hence Rn = nδ. �

Proof of Theorem 6.1: The proofs of Theorem 6.1 and Corollary 6.2 are a modification

of the proof of Theorem 5.1 in NPV (with their trimming function being an identity function)

taking into account that instruments are weak. We use the components established in the

proof of the convergence rate, which are distinct from NPV. The rest of the notations are

the same as those of NPV. Let “MVE” abbreviate mean value expansion.

Let X is a vector of variables that includes x and z, and w(X, π) a vector of functions of

X and π. Note that w(·, ·) is a vector of transformation functions of regressors and, trivially,

w(X, π) = (x, x− Πn(z)). Recall

Vτ = AQ−1τ

(Σ + HQ−1

1 Σ1Q−11 H ′

)Q−1τ A′,

Hτ =∑

pi

{[∂hτ (wi)/∂w

]′∂w(Xi, Π(zi))/∂π

}r′i/n,

and Q = P ′P /n, Qτ = Q + τnI, Q = E [pip′i], Q1 = R′R/n, Στ =

∑pip′i

[yi − hτ (wi)

]2

/n,

56

and Σ1 =∑v2i rir

′i/n where ri = rL(zi). Then, define

V = AQ−1(Σ +HQ−1

1 Σ1Q−11 H ′

)Q−1A′,

Σ = E [pip′ivar(yi|Xi)] , H = E

[pi{

[∂h(wi)/∂w]′ ∂w(Xi,Πn(zi))/∂π}r′i]

,

where V does not depend on τ . Note that H is a channel through which the first-stage

estimation error kicks into the variance of the estimator of h(·) in the outcome equation.

We first prove√nV −1/2(θτ − θ0)→d N(0, 1).

For notational simplicity, let F = V −1/2. Let h = (h(w1), ..., h(wn))′ and h = (h(w1), ..., h(wn))′.

Also let ηi = yi−h0(wi) and η = (η1, ..., ηn)′. Let Πn = (Πn(z1), ...,Πn(zn))′, vi = xi−Πn(zi),

and U = (v1, ..., vn)′. Once we prove that ‖FAQ−1‖ = O(nδ),√nF[a(pK′β)− a(h0)

]=

op(1),√nFA(P ′P + nτnI)−1P ′(h − P#β) = op(1), FAQ−1

τ P ′(h − h)/√n = FAQ−1 ×

HR′U/√n+ op(1), and FAQ−1

τ P ′η/√n = FAQ−1P ′η/

√n+ op(1) below, then we will have,

√nV −1/2

(θτ − θ0

)=√nF(a(hτ )− a(h0)

)=√nF(a(pK′βτ )− a(pK′β) + a(pK′β)− a(h0)

)=√nFAβτ −

√nFAβ + op(1)

=√nFA(P ′P + nτnI)−1P ′(h+ η)−

√nFA(P ′P + nτnI)−1P ′h

+√nFA(P ′P + nτnI)−1P ′(h− P#β) + op(1)

= FAQ−1τ P ′η/

√n− FAQ−1

τ P ′(h− h)/√n+ op(1)

= FAQ−1(P ′η/√n+HR′U/

√n) + op(1). (B.11)

Then, for any vector φ with ‖φ‖ = 1, let Zin = φ′FAQ−1 [piηi +Hriui] /√n. Note Zin is i.i.d.

for each n. Also EZin = 0, var(Zin) = 1/n (recall Σ = E[pip′ivar(yi|Xi)]). Furthermore,

‖FAQ−1‖ = O(nδ) and ‖FAQ−1H‖ ≤ C ‖FAQ−1‖ = O(nδ) by CI −HH ′ p.s.d, so that, for

57

any ε > 0,

nE[1 {|Zin| > ε}Z2

in

]= nε2E

[1 {|Zin/ε| > 1} (Zin/ε)

2]

≤ nε2E[(Zin/ε)

4]

≤ nε2

n2ε4‖φ‖4

∥∥FAQ−1∥∥4 {‖pi‖2E

[‖pi‖2E[η4

i |Xi]]

+ ‖ri‖2E[‖ri‖2E[u4

i |zi]]}

≤ CO(n4δ){ζv0 (K)2E ‖pi‖2 + ξ(L)2E ‖ri‖2} /n

≤ CO(n4δ){ζv0 (K)2tr(Q) + ξ(L)2tr(Q1)

}/n

≤ O(n4δ−1{ζv0 (K)2K + ξ(L)2L

}) = o(1)

by G†. Then,√nF (θτ − θ0)→d N(0, 1) by Lindbergh-Feller theorem and (B.11).

Now, we proceed with detailed proofs. For simplicity as before, the remainder of the proof

will be given for the scalar Π(z) case. Under Assumption F and by CS, |a(hK)| = |AβK | ≤‖A‖ ‖βK‖ = C ‖A‖ (EhK(w)2)

1/2so ‖A‖ → ∞. Since λmin(Q−1) is bounded away from zero,

CI = λmin(Q−1)I ≤ Q−1. And also since σ2(X) = var(y|X) is bounded away from zero by

Assumption E, we have Σ ≥ CQ. Hence

V ≥ AQ−1ΣQ−1A′ ≥ CAQ−1QQ−1A′ ≥ CAQ−1A′ ≥ C ‖A‖2 , (B.12)

Therefore, it follows that F is bounded.

Also, by the previous proofs on the convergence rate,∥∥∥Qτ −Q∥∥∥ ≤ ∥∥∥Q−Q+ τnI

∥∥∥ ≤ ∥∥∥Q− Q∥∥∥+∥∥∥Q−Q∥∥∥+ ‖τnI‖

≤ Op(∆Q) +Op(∆Q) +O(τn√K) = Op(∆Qτ )

and, by letting Q1 = I,∥∥∥Q1 − I

∥∥∥ = Op(ξ(L)L1/2/√n) = Op(∆Q1). Furthermore similarly to

the proofs above, with H =∑pid(Xi)r

′i/n,

∥∥H −H∥∥ = Op(∆H) = op(1) as in NPV, where

∆H =[L1/2ζ1(K) + ζ0(K)ξ(L)2

]∆π +K1/2ξ(L)/

√n. Now, by (B.12)

∥∥FAQ−1/2∥∥2

= tr(FAQ−1A′F ) ≤ tr(CFV F ) = C.

By Assumption G†, Rn = nδ and hence by (A.29) and Lemmas A.3 and A.4, λmax(Q−1τ ) =

58

Op(n2δ) and λmax(Q−1) = O(n2δ). And Then,

∥∥FAQ−1∥∥ =

∥∥FAQ−1/2Q−1/2∥∥ ≤ λmax(Q−1)1/2

∥∥FAQ−1/2∥∥ ≤ CO(nδ).

Note that for any matrix B,∥∥∥BQ−1

τ

∥∥∥ ≤ ∥∥∥BQ−1∥∥∥, since (Q−1 − Q−1

τ ) is p.s.d. Hence

∥∥∥FA′Q−1τ

∥∥∥ ≤ ∥∥∥FA′Q−1∥∥∥ ≤ ∥∥FAQ−1

∥∥+∥∥∥FAQ−1

(Q−Q

)Q−1

∥∥∥≤ CO(nδ) + CO(nδ)Op(n

2δ)∥∥∥Q−Q∥∥∥

= O(nδ) +Op(n3δ∆Q) = O(nδ) + op(1) = Op(n

δ)

by Assumption G†. Also∥∥∥FAQ−1/2τ

∥∥∥2

≤∥∥∥FA′Q−1/2

∥∥∥2

≤∥∥FAQ−1/2

∥∥2+ tr(FAQ−1

(Q−Q

)Q−1A′F )

≤ C +∥∥FA′Q−1

∥∥∥∥∥Q−Q∥∥∥∥∥∥FA′Q−1∥∥∥ ≤ O(nδ)Op(∆Q)O(nδ) = op(1).

Firstly, by Assumptions C† and G†,

∥∥√nF [a(pK′β)− a(h0)]∥∥ =

∥∥√nF [a(pK′β − h0)]∥∥ ≤ √n |F | sup

w

∣∣pK(w)′β − h0(w)∣∣

≤ C√nK−s/dx = op(1).

Secondly,∥∥∥FAQ−1τ P ′(h− P#β)/

√n∥∥∥ ≤ ∥∥∥FAQ−1

τ P ′/√n∥∥∥√n sup

W

∣∣∣pK(w)′β − h0(w)∣∣∣

+∥∥∥nτnFAQ−1

τ β/√n∥∥∥

≤∥∥∥FAQ−1/2

τ

∥∥∥√nO(K−s/dx) +√nτn

∥∥∥FAQ−1τ

∥∥∥ ‖β‖≤ op(1)O(

√nK−s/dx) +Op(

√nτnn

δ) = op(1)

by G†. Thirdly, let γ = (γ1, ..., γL)′ and di = d(Xi). By a second order MVE of each h(wi)

59

around wi

FAQ−1τ P ′(h− h)/

√n = FAQ−1

τ

∑i

pidi

[Π(zi)− Πn(zi)

]/√n+ ρ

= FAQ−1τ HQ−1

1 R′U/√n+ FAQ−1

τ HQ−11 R′(Πn −R′γ)/

√n

+ FAQ−1τ

∑i

pidi [r′iγ − Πn(zi)] /

√n+ ρ.

But ‖ρ‖ ≤ C√n∥∥∥FAQ−1/2

τ

∥∥∥ ζv0 (K)∑

i

∥∥∥Π(zi)− Πn(zi)∥∥∥2

/n = op(1)Op(√nζv0 (K)∆2

π) = op(1).

Also, by di being bounded and nHQ−11 H ′ being equal to the matrix sum of squares from the

multivariate regression of pidi on ri, HQ−11 H ′ ≤

∑i pip

′id

2i /n ≤ CQ ≤ CQτ . Therefore,∥∥∥FAQ−1

τ HQ−11 R′(Πn −R′γ)/

√n∥∥∥ ≤ ∥∥∥FAQ−1

τ HQ−11 R′/

√n∥∥∥√n sup

Z

∣∣Πn(z)− rL(z)′γ∣∣

≤{tr(FAQ−1

τ HQ−11 Q1Q

−11 H ′Q−1

τ A′F ′)}1/2

O(√nL−sπ/dz)

≤ C{tr(FAQ−1

τ Qτ Q−1τ A′F ′

)}1/2

O(√nL−sπ/dz)

≤ C∥∥∥FAQ−1/2

τ

∥∥∥O(√nL−sπ/dz) = op(1)Op(

√nL−sπ/dz) = op(1)

by G†. Similarly,∥∥∥∥∥FAQ−1τ

∑i

pidi[r′iγ − Πn(zi)]/

√n

∥∥∥∥∥ ≤ C∥∥∥FAQ−1/2

τ

∥∥∥O(√nL−sπ/dz) = op(1)Op(

√nL−sπ/dz) = op(1).

Next, we consider the term FAQ−1τ HQ−1

1 R′U/√n. Note that E ‖R′U/

√n‖2

= tr(Σ1) ≤Ctr(IL) ≤ L by E[u2|z] bounded, so by MR, ‖R′U/

√n‖ = Op(L

1/2). Also, we have∥∥∥FAQ−1τ HQ−1

1

∥∥∥ ≤ Op(1)∥∥∥FAQ−1/2

τ

∥∥∥ = op(1).

Therefore ∥∥∥FAQ−1τ H(Q−1

1 − I)R′U/√n∥∥∥ ≤ ∥∥∥FAQ−1

τ HQ−11

∥∥∥∥∥∥Q1 − I∥∥∥∥∥R′U/√n∥∥

= op(1)Op(∆Q1)Op(L1/2) = op(1)

60

by G†. Similarly,∥∥∥FAQ−1τ (H −H)R′U/

√n∥∥∥ ≤ ∥∥∥FAQ−1

τ

∥∥∥∥∥H −H∥∥∥∥R′U/√n∥∥= Op(n

δ)Op(∆H)Op(L1/2) = op(1)

by G†. Note that HH ′ is the population matrix mean-square of the regression of pidi on

ri so that HH ′ ≤ C, it follows that E ‖HR′U/√n‖2

= tr(HΣ1H′) ≤ CK and therefore,

‖HR′U/√n‖ = Op(K

1/2). And Then,∥∥∥FA(Q−1τ −Q−1)HR′U/

√n∥∥∥ ≤ λmax(Q−1

τ )∥∥FAQ−1

∥∥∥∥∥Q− Qτ

∥∥∥∥∥HR′U/√n∥∥= O(n2δ)O(nδ)Op(∆Qτ )Op(K

1/2) = op(1).

Combining the results above and by TR, FAQ−1τ P ′(h− h)/

√n = FAQ−1HR′U/

√n+ op(1).

Lastly, similar to (A.23) in the convergence rate part,∥∥∥Q−1/2τ (P − P )′η/

√n∥∥∥ = Op(n

δζv1 (K)2∆2π) = op(1)

by G† (and by (A.6) of NPV), which implies∥∥∥FAQ−1τ (P − P )′η/

√n∥∥∥ ≤ ∥∥∥FAQ−1/2

τ

∥∥∥∥∥∥Q−1/2τ (P − P )′η/

√n∥∥∥ = op(1)op(1) = op(1).

Also, by E[η|X] = 0,

E

[∥∥∥FA(Q−1τ −Q−1)P ′η/

√n∥∥∥2

|Xn

]≤ tr

(FA(Q−1

τ −Q−1)[∑

pip′ivar(yi|Xi)/n

](Q−1

τ −Q−1)A′F)

≤ Ctr(FA(Q−1

τ −Q−1)Qτ (Q−1τ −Q−1)A′F

)= Ctr

(FAQ−1(Qτ −Q)Q−1

τ (Qτ −Q)Q−1A′F)

≤ Op(n2δ)∥∥FAQ−1

∥∥2∥∥∥Qτ −Q

∥∥∥2

≤ Op(n2δ∆Qτ )

2 = op(1)

by G†. Combining all of the previous results and by TR,

√nF[a(hτ )− a(h0)

]= FAQ−1(P ′η/

√n+HR′U/

√n) + op(1).

Next, recall V = AQ−1(Σ +HQ−1

1 Σ1Q−11 H ′

)Q−1A′. Note CI −HΣ1H

′ is p.s.d (that is

HΣ1H′ ≤ CI using Q1 = I) and since var(y|X) is bounded by Assumption E, Σ ≤ CQ.

61

Therefore,

V = AQ−1(Σ +HQ−1

1 Σ1Q−11 H ′

)Q−1A′ = AQ−1 (Σ +HΣ1H

′)Q−1A′

≤ AQ−1 (CQ+ CI)Q−1A′ = C∥∥AQ−1/2

∥∥2+ C

∥∥AQ−1∥∥2

.

Note that it is reasonable that the first-stage estimation error is not cancelled out with Q−1,

since the first-stage regressors does not suffer multicollinearity. Recall a(pK′β) = Aβ so

a(pK′A′) = AA′. And ‖A‖2 ≤∣∣a(pK′A′)

∣∣ ≤ ∣∣ApK∣∣r≤ ζr(K) ‖A‖ so ‖A‖ ≤ ζr(K). Hence∥∥AQ−1/2

∥∥2= AQ−1A′ ≤ C ‖A‖2 = O(ζr(K)2). Likewise, ‖AQ−1‖2

= O(nδζr(K)2). Thus,

V = O(ζr(K)2) +O(n2δζr(K)2) = O(n2δζr(K)2). Therefore,

θτ − θ0 = Op(V1/2/√n) = Op(n

δζr(K)/√n) = Op(n

δ− 12 ζr(K)) ≤ Op(n

δ− 12 ζvr (K)).

This result for scalar a(h) covers the case of Assumption F.

Now we can prove√nV −1/2

τ (θτ − θ0)→d N(0, 1)

by showing∣∣∣FVτF − 1

∣∣∣→p 0. Then, V −1Vτ →p 1, so that√nV−1/2τ (θτ − θ0) =

√nV −1/2(θ−

θ0)/(V −1Vτ )1/2 →d N(0, 1).

The rest part of the proof can directly be followed by the relevant part of the proof of

NPV (pp. 600–601), except that in our case Q 6= I because of weak instrument. Therefore

the following replace the corresponding parts in the proof: For any matrix B, we have

‖BΣ‖ ≤ C ‖BQ‖ by Σ ≤ CQ. Therefore,∥∥∥FA(Q−1τ ΣQ−1

τ −Q−1ΣQ−1)A′F ′∥∥∥

≤∥∥∥FA(Q−1

τ −Q−1)ΣQ−1τ A′F ′

∥∥∥+∥∥∥FAQ−1Σ(Q−1

τ −Q−1)A′F ′∥∥∥

≤∥∥∥FAQ−1

τ (Q− Qτ )Q−1ΣQ−1

τ A′F ′∥∥∥+

∥∥∥FAQ−1ΣQ−1(Q− Qτ )Q−1τ A′F ′

∥∥∥≤∥∥∥FAQ−1

τ

∥∥∥2 ∥∥∥(Q− Qτ )Q−1Σ

∥∥∥+∥∥FAQ−1

∥∥∥∥∥ΣQ−1(Q− Qτ )∥∥∥∥∥∥FAQ−1

τ

∥∥∥≤ C

∥∥∥FAQ−1τ

∥∥∥2 ∥∥∥(Q− Qτ )Q−1Q

∥∥∥+ C∥∥FAQ−1

∥∥∥∥∥QQ−1(Q− Qτ )∥∥∥∥∥∥FAQ−1

τ

∥∥∥≤ Op(n

2δ)Op(∆Qτ ) +Op(n2δ)Op(∆Qτ ) = op(1)

by Assumption G†. Also note that in our proof, Qτ is introduced by penalization but the

treatment is the same as above. Specifically, one can apply∥∥∥BQ−1

τ CQ−1τ B

∥∥∥ ≤ ∥∥∥BQ−1CQ−1B∥∥∥

62

for any matrix B and C of corresponding orders. Also, recall ζr(K) ≤ ζvr (K) and ∆h

and ∆Q are redefined in this paper. That is, by ζv0 (K)∆h, n2δ {∆Q + ζv0 (K)2K/n}, and

ζv0 (K)L1/2ζv1 (K)∆h converging to zero, we can prove the following:∥∥∥FAQ−1τ (Σ− Σ)Q−1

τ A′F ′∥∥∥ ≤ Ctr(D) max

i≤n

∣∣∣hi − hi∣∣∣ ≤ Op(1)Op(ζv0 (K)∆h) = op(1),

∥∥∥FAQ−1τ (Σ− Σ)Q−1

τ A′F ′∥∥∥ ≤ ∥∥∥FAQ−1

τ

∥∥∥2 ∥∥∥Σ− Σ∥∥∥ ≤ Op(n

2δ)Op(∆Q + ζv0 (K)2K/n) = op(1),

∥∥∥H − H∥∥∥ ≤ C

(n∑i=1

‖pi‖2 ‖ri‖2 /n

)1/2( n∑i=1

∣∣∣di − di∣∣∣2 /n)1/2

= Op(ζv0 (K)L1/2)Op(ζ

v1 (K)∆h) = op(1).

The rest of the proof thus follows. �

Proof of Corollary 6.2: Now instead, suppose Assumption H is satisfied. For the

functional c′a(g), this assumption is satisfied with ν(w) replaced by c′ν(w). Therefore,

it suffices to prove the result for scalar ν(w). Then, A1×K

= a(pK) = E[ν(wi)p

K(wi)′].

Let νK(w) = AQ−1pK(w), which is (transpose of) mean square projection of ν(·) on ap-

proximating functions pK(w). Note that Q−1 is singular with weak instruments. There-

fore, define A∗ = a(p∗K) = E[ν(wi)p

∗K(ui)′]. Also, let ν∗K(u) = A∗Q∗−1p∗K(u), which is

(transpose of) mean square projection of ν(·) on approximating functions p∗K(u). Then,

E ‖ν − ν∗K‖2 ≤ E

∥∥ν − α′Kp∗K∥∥2 → 0 by the assumption. But,

νK = E[ν(wi)pK(wi)

′]Q−1pK(w) = E[ν(wi)pK(wi)

′]Tn(T ′nQTn)−1T ′npK(w)

= E[ν(wi)p∗K(ui)

′](T ′nQTn)−1p∗K(u).

Let R∗ = T ′nQTn −Q∗ = E [mip∗′i ] + E [p∗im

′i] + E [mim

′i]. Then, we have

E ‖ν∗K − νK‖2 = E

∥∥A∗[Q∗−1 − (T ′nQTn)−1]p∗K(ui)∥∥2

= E∥∥A∗Q∗−1R∗(T ′nQTn)−1p∗K(ui)

∥∥2

≤ C ‖A∗‖2 ‖R∗‖2E∥∥p∗K(ui)

∥∥2 ≤ C ‖A∗‖2 (2E ‖mi‖ ‖p∗i ‖+ E ‖mi‖2)2.

But by CS and Assumptions B and H,∣∣a(p∗j)

∣∣ =∣∣E[ν(wi)p

∗j(ui)]

∣∣ ≤ E[v(wi)2]E[pj(u

∗i )

2] <∞,

and therefore ‖A∗‖2 =∥∥a(p∗K′)

∥∥2= O(K). Then, by using previous results of (A.15)

and (A.16), and by Assumption D (n−δK6 → 0) which implies that n−δζv2 (κ)κ1/2K1/2 =

63

n−δζv2 (K)K and n−2δζv2 (κ)2K1/2 = n−2δζv2 (K)2K1/2 converge to zero, we conclude that

E ‖ν∗K − νK‖2 → 0. Given

(E ‖ν − νK‖2)1/2 ≤

(E ‖ν − ν∗K‖

2)1/2+(E ‖ν∗K − νK‖

2)1/2,

we have the desired result that E ‖ν − νK‖2 → 0. Let d(X) = [∂h(w)/∂w]′ ∂w(X,Π0(z))/∂π,

bKL(z) = E[d(Xi)νK(wi)r

L(zi)′] rL(z) and bL(z) = E

[d(Xi)ν(wi)r

L(zi)′] rL(z). Then,

E[‖bKL(zi)− bL(zi)‖2] ≤ E

[d(Xi)

2 ‖νK(wi)− ν(wi)‖2] ≤ CE[‖νK(wi)− ν(wi)‖2]→ 0

as K → ∞. Furthermore by Assumption E, E[‖bL(zi)− ρ(zi)‖2] → 0 as L → ∞, where

ρ(z) is a matrix of projections of elements of ν(w)d(X) on L which is the set of limit points

of rL(z)′γL. Therefore (as in (A.10) of NPV), by Assumption E

V = E[νK(wi)νK(wi)

′σ2(Xi)]

+ E [bKL(zi)var(xi|zi)bKL(zi)′]→ V ,

where V = E [ν(wi)ν(wi)′σ2(Xi)] + E [ρ(zi)var(xi|zi)ρ(zi)

′]. This shows F is bounded.

Then, given√nF (θτ − θ0) →d N(0, 1) from the proof of Theorem 6.1, the conclusion

follows from F−1 → V 1/2 so that F−1√nF (θτ − θ0)→d N(0, V ). �

References

Ai, C. and X. Chen (2003): “Efficient estimation of models with conditional moment

restrictions containing unknown functions,” Econometrica, 71, 1795–1843. 11

Amemiya, T. (1977): “The maximum likelihood and the nonlinear three-stage least squares

estimator in the general nonlinear simultaneous equation model,” Econometrica, 955–968.

23

Andrews, D. W. K. and X. Cheng (2012): “Estimation and inference with weak, semi-

strong, and strong identification,” Econometrica, 80, 2153–2211. 1

Andrews, D. W. K. and J. H. Stock (2007): “Inference with weak instruments,” in

Advances in Econometrics: Proceedings of the Ninth World Congress of the Econometric

Society. 1, 1

Andrews, D. W. K. and Y.-J. Whang (1990): “Additive interactive regression models:

circumvention of the curse of dimensionality,” Econometric Theory, 6, 466–479. 19

64

Angrist, J. D. and A. B. Keueger (1991): “Does compulsory school attendance affect

schooling and earnings?” The Quarterly Journal of Economics, 106, 979–1014. 3

Angrist, J. D. and V. Lavy (1999): “Using Maimonides’ rule to estimate the effect of

class size on scholastic achievement,” The Quarterly Journal of Economics, 114, 533–575.

8

Arlot, S. and A. Celisse (2010): “A survey of cross-validation procedures for model

selection,” Statistics Surveys, 4, 40–79. 5

Blundell, R., M. Browning, and I. Crawford (2008): “Best nonparametric bounds

on demand responses,” Econometrica, 76, 1227–1262. 1

Blundell, R., X. Chen, and D. Kristensen (2007): “Semi-nonparametric IV estima-

tion of shape-invariant Engel curves,” Econometrica, 75, 1613–1669. 3, 11, 5

Blundell, R. and A. Duncan (1998): “Kernel regression in empirical microeconomics,”

Journal of Human Resources, 33, 62–87. 1, 22

Blundell, R., A. Duncan, and K. Pendakur (1998): “Semiparametric estimation and

consumer demand,” Journal of Applied Econometrics, 13, 435–461. 22

Blundell, R. and J. L. Powell (2003): “Endogeneity in nonparametric and semipara-

metric regression models,” Econometric Society Monographs, 36, 312–357. 4

Blundell, R. W. and J. L. Powell (2004): “Endogeneity in semiparametric binary

response models,” The Review of Economic Studies, 71, 655–679. 9

Bound, J., D. A. Jaeger, and R. M. Baker (1995): “Problems with instrumental

variables estimation when the correlation between the instruments and the endogenous

explanatory variable is weak,” Journal of the American statistical association, 90, 443–

450. 1

Carrasco, M., J.-P. Florens, and E. Renault (2007): “Linear inverse problems in

structural econometrics estimation based on spectral decomposition and regularization,”

Handbook of econometrics, 6, 5633–5751. 1, 4

Chen, X. and D. Pouzo (2012): “Estimation of nonparametric conditional moment models

with possibly nonsmooth generalized residuals,” Econometrica, 80, 277–321. 4, 5

65

Chernozhukov, V. and C. Hansen (2005): “An IV model of quantile treatment effects,”

Econometrica, 73, 245–261. 9

Chesher, A. (2003): “Identification in nonseparable models,” Econometrica, 71, 1405–1441.

3, 2

——— (2007): “Instrumental values,” Journal of Econometrics, 139, 15–34. 3

Darolles, S., Y. Fan, J.-P. Florens, and E. Renault (2011): “Nonparametric in-

strumental regression,” Econometrica, 79, 1541–1565. 3, 5

Das, M., W. K. Newey, and F. Vella (2003): “Nonparametric estimation of sample

selection models,” The Review of Economic Studies, 70, 33–58. 9

Davidson, R. and J. G. MacKinnon (1993): “Estimation and inference in econometrics,”

OUP Catalogue. 3

Del Bono, E. and A. Weber (2008): “Do wages compensate for anticipated working time

restrictions? Evidence from seasonal employment in Austria,” Journal of Labor Economics,

26, 181–221. 1, 22

Dufour, J.-M. (1997): “Some impossibility theorems in econometrics with applications to

structural and dynamic models,” Econometrica, 1365–1387. 1

Dustmann, C. and C. Meghir (2005): “Wages, experience and seniority,” The Review of

Economic Studies, 72, 77–108. 1, 22

Engl, H. W., M. Hanke, and A. Neubauer (1996): Regularization of inverse problems,

vol. 375, Springer. 1

Frazer, G. (2008): “Used-clothing donations and apparel production in Africa,” The Eco-

nomic Journal, 118, 1764–1784. 1

Garg, K. M. (1998): Theory of differentiation, Wiley. A.1

Giorgi, G., A. Guerraggio, and J. Thierfelder (2004): Mathematics of Optimiza-

tion: Smooth and Nonsmooth Case: Smooth and Nonsmooth Case, Elsevier. A.1

Hall, P. and J. L. Horowitz (2005): “Nonparametric methods for inference in the

presence of instrumental variables,” The Annals of Statistics, 33, 2904–2929. 1, 11, 5

66

Han, C. and P. C. Phillips (2006): “GMM with many moment conditions,” Economet-

rica, 74, 147–192. 1

Han, S. (2012): “Robust inference in bivariate threshold crossing models under weak iden-

tification,” Unpublished manuscript, Department of Economics, Yale University. 1

Hastie, T. and R. Tibshirani (1986): “Generalized additive models,” Statistical science,

297–310. 1

Hengartner, N. W. and O. B. Linton (1996): “Nonparametric regression estimation

at design poles and zeros,” Canadian journal of statistics, 24, 583–591. 1

Hoerl, A. E. and R. W. Kennard (1970): “Ridge regression: Biased estimation for

nonorthogonal problems,” Technometrics, 12, 55–67. 11

Hong, Y. and H. White (1995): “Consistent specification testing via nonparametric series

regression,” Econometrica, 1133–1159. 9

Horowitz, J. L. (2011): “Applied nonparametric instrumental variables estimation,”

Econometrica, 79, 347–394. 1, 8

Horowitz, J. L. and S. Lee (2007): “Nonparametric instrumental variables estimation

of a quantile regression model,” Econometrica, 75, 1191–1208. 11

Imbens, G. W. and W. K. Newey (2009): “Identification and estimation of triangular

simultaneous equations models without additivity,” Econometrica, 77, 1481–1512. 1

Jiang, J., Y. Fan, and J. Fan (2010): “Estimation in additive models with highly or

nonhighly correlated covariates,” The Annals of Statistics, 38, 1403–1432. 1, 6

Jun, S. J. and J. Pinkse (2012): “Testing under weak identification with conditional

moment restrictions,” Econometric Theory, 28, 1229. 3, 23

Kleibergen, F. (2002): “Pivotal statistics for testing structural parameters in instrumental

variables regression,” Econometrica, 70, 1781–1803. 1

——— (2005): “Testing parameters in GMM without assuming that they are identified,”

Econometrica, 73, 1103–1123. 1

Kress, R. (1999): Linear integral equations, vol. 82, Springer. 4

67

Lee, J. M. (2011): “Topological Spaces,” in Introduction to Topological Manifolds, Springer,

19–48. A.1

Lee, S. (2007): “Endogeneity in quantile regression models: A control function approach,”

Journal of Econometrics, 141, 1131–1158. 9

Linton, O. B. (1997): “Miscellanea: Efficient estimation of additive nonparametric regres-

sion models,” Biometrika, 84, 469–473. 1

Lorentz, G. G. (1986): Approximation of functions, Chelsea Publishing Company, New

York. 18

Lyssiotou, P., P. Pashardes, and T. Stengos (2004): “Estimates of the black economy

based on consumer demand approaches,” The Economic Journal, 114, 622–640. 1, 22

Mazzocco, M. (2012): “Testing efficient risk sharing with heterogeneous risk preferences,”

The American Economic Review, 102, 428–468. 1

Moreira, M. J. (2003): “A conditional likelihood ratio test for structural models,” Econo-

metrica, 71, 1027–1048. 1

Newey, W. K. (1990): “Efficient instrumental variables estimation of nonlinear models,”

Econometrica, 809–837. 23

——— (1997): “Convergence rates and asymptotic normality for series estimators,” Journal

of Econometrics, 79, 147–168. 12, 14, B.1

Newey, W. K. and J. L. Powell (2003): “Instrumental variable estimation of nonpara-

metric models,” Econometrica, 71, 1565–1578. 11

Newey, W. K., J. L. Powell, and F. Vella (1999): “Nonparametric estimation of

triangular simultaneous equations models,” Econometrica, 67, 565–603. 1

Newey, W. K. and F. Windmeijer (2009): “Generalized method of moments with many

weak moment conditions,” Econometrica, 77, 687–719. 1

Nielsen, J. P. and S. Sperlich (2005): “Smooth backfitting in practice,” Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 67, 43–61. 1

Powell, M. J. D. (1981): Approximation theory and methods, Cambridge university press.

19

68

Skinner, J. S., E. S. Fisher, and J. Wennberg (2005): “The efficiency of Medicare,”

in Analyses in the Economics of Aging, University of Chicago Press, 129–160. 1

Sperlich, S., O. B. Linton, and W. Hardle (1999): “Integration and backfitting

methods in additive models-finite sample properties and comparison,” Test, 8, 419–458. 1

Staiger, D. and J. H. Stock (1997): “Instrumental variables regression with weak in-

struments,” Econometrica, 65, 557–586. 1, 13, 7

Stock, J. H. and J. H. Wright (2000): “GMM with weak identification,” Econometrica,

68, 1055–1096. 1, 3

Stock, J. H. and M. Yogo (2005): “Testing for weak instruments in linear IV regres-

sion,” Identification and Inference for Econometric Models: Essays in Honor of Thomas

Rothenberg, 80–108. 1, 3, 7

Taylor, A. E. (1965): General theory of functions and integration, DoverPublications. com.

A.1

Trefethen, L. N. (2008): “Is Gauss quadrature better than Clenshaw-Curtis?” SIAM

(Society for Industrial and Applied Mathematics) Review, 50, 67–87. 17, A.4

Wang, H. and S. Xiang (2012): “On the convergence rates of Legendre approximation,”

Mathematics of Computation, 81, 861–877. A.4

Weyl, H. (1912): “Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller

Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung),”

Mathematische Annalen, 71, 441–479. B.2

Yatchew, A. and J. A. No (2001): “Household gasoline demand in Canada,” Economet-

rica, 69, 1697–1709. 1, 22

69

K∗ = 1µ2

4 8 16 32 64 128 256

Bias2 0.4236 0.0325 0.0043 0.0003 0.0002 0.0000 0.0000

τ = 0 V ar 943.0304 9.4828 0.3767 0.1145 0.0870 0.0311 0.0168

MSE 943.4540 9.5153 0.3810 0.1148 0.0872 0.0311 0.0169

MSEIV /MSELS 3711.5352 35.4269 1.5029 0.4751 0.3598 0.1491 0.0904

Bias2 0.0400 0.0064 0.0009 0.0004 0.0002 0.0000 0.0000

τ = 0.001 V ar 1.9542 0.4287 0.2029 0.0915 0.0773 0.0297 0.0163

MSE 1.9943 0.4351 0.2038 0.0918 0.0775 0.0298 0.0163

MSEPIV /MSEIV 0.0021 0.0457 0.5350 0.8001 0.8890 0.9564 0.9677

Bias2 0.1299 0.0759 0.0303 0.0113 0.0034 0.0016 0.0004

τ = 0.005 V ar 0.2470 0.1313 0.1258 0.0647 0.0369 0.0436 0.0153

MSE 0.3768 0.2072 0.1561 0.0759 0.0403 0.0452 0.0158

MSEPIV /MSEIV 0.0001 0.1417 0.4796 0.5039 0.7093 0.6305 0.9049

Bias2 0.1887 0.1269 0.0771 0.0314 0.0143 0.0038 0.0015

τ = 0.01 V ar 0.2063 0.0870 0.0700 0.0415 0.0346 0.0240 0.0164

MSE 0.3951 0.2139 0.1470 0.0729 0.0489 0.0278 0.0180

MSEPIV /MSEIV 0.0011 0.1069 0.5116 0.6535 0.7909 0.8231 0.9073

Table 1: Integrated squared bias, integrated variance, and integrated MSE of the penalized

and unpenalized IV estimators gτ (·) and g(·).

70

K∗ = 3µ2

4 8 16 32 64 128 256

Bias2 0.0634 0.0022 0.0054 0.0004 0.0002 0.0002 0.0000

τ = 0.001 V ar 21.6485 5.7572 0.2994 0.0975 0.0758 0.0343 0.0305

MSE 21.7118 5.7594 0.3048 0.0979 0.0760 0.0344 0.0306

MSEPIV /MSEIV 0.8103 0.9757 0.9097 0.9518 0.9446 0.9674 0.9870

Bias2 0.0015 0.1859 0.0020 0.0011 0.0002 0.0001 0.0001

τ = 0.005 V ar 55.4223 286.3584 0.2392 0.0997 0.0520 0.0375 0.0287

MSE 55.4238 286.5443 0.2412 0.1008 0.0522 0.0376 0.0288

MSEPIV /MSEIV 0.5546 0.8021 0.8498 0.8506 0.8761 0.7105 0.9570

Bias2 0.0139 0.2316 0.0037 0.0002 0.0001 0.0003 0.0000

τ = 0.01 V ar 82.1557 146.8736 0.2420 0.0991 0.0506 0.0500 0.0158

MSE 82.1696 147.1052 0.2457 0.0993 0.0507 0.0503 0.0158

MSEPIV /MSEIV 0.5646 1.0198 0.7376 0.8525 0.8328 0.8605 0.9538

Table 2: Integrated squared bias, integrated variance, and integrated MSE of the penalized

and unpenalized IV estimators gτ (·) and g(·).

τ CV Value

0.015 37.267

0.05 37.246

0.1 37.286

0.15 37.330

0.2 37.373

Table 3: Cross-validation values for the choice of τ .

71

Figure 2: Penalized versus unpenalized estimators (gτ (·) vs. g(·)) with a weak instrument,τ = 0.001.

The (blue) dotted-dash line is the true g0(·). The (black) solid line is the (simulated) mean of g(·)with the dotted band representing the 0.025-0.975 quantile ranges. Note that the difference between

g0(·) and the mean of g(·) is the (simulated) bias. The (red) solid line is the mean of gτ (·) with the

dashed 0.025-0.975 quantile ranges.

72

Figure 3: Penalized versus unpenalized estimators (gτ (·) vs. g(·)) with a strong instrument,τ = 0.001.

Figure 4: Penalized versus unpenalized estimators (gτ (·) vs. g(·)) with a weak instrument,τ = 0.005.

73

Figure 5: Penalized versus unpenalized estimators (gτ (·) vs. g(·)) with a strong instrument,τ = 0.005.

Figure 6: NPIV estimates from Horowitz (2011), full sample (n = 2019), 95% confidenceband

74

Figure 7: Unpenalized IV estimates with nonparametric first-stage equations, full sample(n = 2019), 95% confidence band

Figure 8: Penalized IV estimates with the discontinuity sample (n = 650, F = 191.66).

75

nonparametric estimation of triangular simultaneous equations models under weak identification

Documents