specification testing in nonparametric instrumental ... · specification testing in nonparametric...

SPECIFICATION TESTING IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION

by

Joel L. Horowitz

Department of Economics Northwestern University

Evanston, IL 60208 USA

August 2008

ABSTRACT

In nonparametric instrumental variables estimation, the function being estimated is the solution to an integral equation. A solution may not exist if, for example, the instrument is not valid. This paper discusses the problem of testing the null hypothesis that a solution exists against the alternative that there is no solution. We give necessary and sufficient conditions for existence of a solution and show that uniformly consistent testing of an unrestricted null hypothesis is not possible. Uniformly consistent testing is possible, however, if the null-hypothesis is restricted by assuming that any solution to the integral equation is smooth. Many functions of interest in applied econometrics, including demand functions and Engel curves, are expected to be smooth. The paper presents a statistic for testing the null hypothesis that a smooth solution exists. The test is consistent uniformly over a large class of probability distributions of the observable random variables for which the integral equation has no smooth solution. The finite-sample performance of the test is illustrated through Monte Carlo experiments. Key Words: Inverse problem, instrumental variable, series estimator, linear operator JEL Listing: C12, C14 I thank Whitney Newey for asking a question that led to this paper. Hidehiko Ichimura, Sokbae Lee, and Whitney Newey provided helpful comments. This research was supported in part by NSF grants SES-0352675 and DMS-0706348.

1

SPECIFICATION TESTING IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION

1. Introduction

Nonparametric instrumental variables (IV) estimation consists of estimating the unknown

function g that is identified by the relation

(1.1) [ ( ) | ] 0E Y g X W w− = =

for almost every w in the support of the random variable W . Equivalently, g satisfies

(1.2) ( ) ; ( | ) 0Y g X U E U W w= + = =

for almost every w . In this model, Y is a scalar dependent variable, X is a continuously

distributed explanatory variable that may be endogenous (that is, ( | )E U X x= may not be zero),

W is an instrument for X , and U is an unobserved random variable. The function g is

assumed to satisfy mild regularity conditions but does not belong to a known, finite-dimensional

parametric family. The data are an independent random sample from the distribution of

( , , )Y X W .

Methods for estimating g in (1.1) have been developed by Newey and Powell (2003);

Darolles, Florens, and Renault (2006); Hall and Horowitz (2005); and Blundell, Chen, and

Kristensen (2007). Newey, Powell, and Vella (1999) developed a nonparametric estimator for g

in a different setting in which a control function is available. Horowitz and Lee (2007); Chen and

Pouzo (2008); and Chernozhukov, Gagliardini, and Scaillet (2008) have developed methods for

estimating quantile-regression versions of model (1.1).

All methods for estimating g in model (1.1) assume the existence of a function that

satisfies (1.1). However, as is explained in Section 2 of this paper, a solution need not exist if, for

example, the instrument W is not valid, that is if ( | ) 0E U W w= ≠ on a set of w values that has

non-zero probability. This raises the question whether it is possible to test for existence of a

solution to (1.1). This paper provides an answer to the question. The null hypothesis is that a

solution to (1.1) exists. The set of alternative hypotheses consists of the distributions of

( , , )Y X W for which there is no solution to (1.1). We consider tests that are consistent uniformly

over this set. Uniform consistency is important because it ensures that there are not alternatives

against which a test has low power even with large samples. If a test is not uniformly consistent

over a specified set, then that set contains alternatives against which the test has low power.

Some such alternatives may depart from the null hypothesis in extreme ways, as is illustrated by

example in Section 3. We show that the null hypothesis cannot be tested consistently uniformly

2

over the set of alternative hypotheses without placing restrictions on g beyond those needed to

estimate it when (1.1) has a solution. Specifically, there is always a distribution of ( , , )Y X W

such that no solution to (1.1) exists but any α -level test accepts the null hypothesis with a

probability that is arbitrarily close to 1 α− .

We also show that it is possible to test the hypothesis that a “smooth” solution to (1.1)

exists. The test is consistent uniformly over a large class of non-smooth alternatives. The paper

presents such a test. Non-existence of a solution to (1.1) is an extreme form of non-smoothness,

so in sufficiently large samples, the test presented here rejects the null hypothesis that g is

smooth if no solution to (1.1) exists.

We define g to be smooth if it has sufficiently many square-integrable derivatives. With

a sufficiently large sample, the test presented here rejects the null hypothesis that (1.1) has a

smooth solution if no solution exists or if one exists but is not smooth. The possibility of

rejecting a non-smooth solution is desirable in many applications. For example, a demand

function or Engel curve is unlikely to be discontinuous or wiggly. Thus, rejection of the

hypothesis that a demand function or Engel curve is smooth implies misspecification of the model

that identifies the curve or function (e.g., that W is not a valid instrument for X ). Accordingly,

the test described here is likely to be useful in many applications.

The test presented here is related to Horowitz’s (2006) test of the hypothesis that g

belongs to a specified, finite-dimensional parametric family. A smooth function can be

approximated accurately by a finite-dimensional parametric model consisting of a truncated series

expansion with suitable basis functions. See Section 4 for details. The approximation to a non-

smooth function is less accurate. Therefore, one can test for existence of a smooth solution to

(1.1) by testing the adequacy of a truncated series approximation to g . The test statistic is

similar to Horowitz’s (2006) statistic for testing a finite-dimensional parametric model, but its

asymptotic behavior is different. In Horowitz (2006), the dimension of the parametric model is

fixed. In the present setting, the dimension (or length) of the series approximation increases as

the sample size increases. This is necessary to ensure that the truncation error remains smaller

than the smallest deviation from the null hypothesis that the test can detect. The increasing

dimension of the null hypothesis model changes the asymptotic behavior of the test statistic in

ways that are explained in Section 4.

Section 2 of this paper gives a necessary and sufficient condition for existence of a

function g that solves (1.1). It also explains why a solution may not exist. Section 3 presents an

example showing that it is not possible to construct a uniformly consistent test of the hypothesis

3

that (1.1) has a solution. No matter how large the sample is, there are always alternatives against

which any test has low power. Section 4 describes the statistic for testing the hypothesis that

(1.1) has a smooth solution. This section also explains the test’s asymptotic behavior under the

null and alternative hypotheses. Section 5 presents the results of a Monte Carlo investigation of

the test’s finite-sample behavior, and Section 6 presents conclusions. The proofs of theorems are

in the appendix, which is Section 7.

2. Necessary and Sufficient Conditions for a Solution to (1.1)

Necessary and sufficient conditions for existence of a solution to (1.1) are given by

Picard’s theorem (e.g., Kress 1999, Theorem 15.18). Before stating the theorem, we define

notation that will be used throughout the paper.

Assume that X and W are real-valued random variables and that the support of ( , )X W

is contained in 2[0,1] . This assumption entails no loss of generality as it can always be satisfied,

if necessary, by carrying out monotone transformations of the components of ( , )X W . Let XWf

and Wf , respectively, denote the probability density functions (with respect to Lebesgue

measure) of ( , )X W and W . For , [0,1]x z∈ , define

1

0( , ) ( , ) ( , )XW XWt x z f x w f z w dw= ∫ .

Define the operator 2 2: [0,1] [0,1]T L L→ by

1

0( )( ) ( , ) ( )T z t x z x dxν ν= ∫ ,

where ν is any function in 2[0,1]L . Assume that T is non-singular. Denote its eigenvalues and

eigenvectors by { , : 1,2,...}j j jλ φ = . Sort these so that 1 2 ... 0λ λ≥ ≥ > . Under the assumptions

stated in Section 4, T is a compact operator. Therefore, its eigenvectors form a complete,

orthonormal basis for 2[0,1]pL . Moreover, the eigenvalues are strictly positive and have 0 as

their only limit point.

Now, for [0,1]z∈ define

( ) [ ( , )]YW XWq z E Yf z W=

where YWE denotes the expectation with respect to ( , )Y W . Hall and Horowitz (2005) show that

(1.1) is equivalent to the operator equation

(2.1) q Tg= .

4

Therefore, the conditions for existence of a solution to (1.1) are the same as the conditions for

existence of a function g that satisfies (2.1).

Let ,⋅ ⋅ denote the inner product in 2[0,1]L . The following theorem gives necessary and

sufficient conditions for existence of a function g that solves (1.1) and (2.1).

Theorem 2.1 (Picard): Let T be a compact, non-singular operator, and assume that

0q ≠ . Then (2.1) has a solution if and only if

2

21

, j

jj

q φ

λ

∞

=

< ∞∑ .

If a solution exists, it is

1

( ) ( )j jj

g x b xφ∞

=

=∑ ,

where

, j

jj

qb

φ

λ= . ■

Testing the hypothesis that (1.1) has a solution is equivalent to testing the hypothesis that

21 jjb∞

=< ∞∑ against the alternative 2

1 jjb∞

== ∞∑ . The quantities , jq φ are the generalized

Fourier coefficients of q using the basis functions { }jφ . That is,

1

( ) , ( )j jj

q z q zφ φ∞

=

=∑ .

Therefore a solution to (1.1) exists if and only if the Fourier coefficients of q converge

sufficiently rapidly relative to the eigenvalues of T . It is easy to construct examples in which the

generalized Fourier coefficients converge more slowly than the eigenvalues so that 21 jjb∞

== ∞∑ .

In applied econometric research g may be an Engel curve, demand function, or some

other economically meaningful function whose existence is not in question. In this case, (1.1)

may not have a solution if W is not a valid instrument. Specifically, suppose that

( | ) 0E U W w= ≠ on a set of w values with positive probability. Then arguments like those used

to obtain (2.1) show that g solves not (1.1) or (2.1) but

(2.2) ( )( ) ( ) [ ( , )]UW XWTg z q z E Uf z W= − .

5

The misspecified models (1.1) and (2.1) need not have solutions when W is an invalid

instrument and (2.2) is the correct specification.

3 The Impossibility of Uniformly Consistent Testing with Unrestricted Null and Alternative

Hypotheses

This section presents an example in which uniformly consistent testing of the hypothesis

that (1.1) has a solution is not possible. The distributions used in the example are nested in any

reasonable class of probability distributions for ( , , )Y X W in (1.1), so the impossibility result

obtained with the example holds generally.

The example consists of a simple null-hypothesis and a simple alternative-hypothesis.

“Simple” in this context means that there are no unknown population parameters in either the null

or alternative hypotheses. The null hypothesis is that a specific function g solves (1.1). Under

the alternative hypothesis, (1.1) has no solution. It follows from the Neyman-Pearson lemma that

the likelihood ratio test is the most powerful test of the null hypothesis against the alternative.

We show that regardless of the sample size, it is always possible to choose an alternative

hypothesis against which the power of the likelihood ratio test is arbitrarily close to its size.

Therefore, the likelihood ratio test is not uniformly consistent. It follows that no other test is

uniformly consistent because no other test is more powerful than the likelihood ratio test.

To construct the example, write (1.1) in the equivalent form

(3.1) [ ( ) | ] ; ( | ) 0Y E g X W V E V W= + = ,

where ( | )V Y E Y W= − . Assume that XWf is known and is

1/ 2

1( , ) 1 2 cos( )cos( )XW j

jf x w j x j wλ π π

∞

=

= + ∑ ,

where the jλ ’s are constants satisfying 1 2 ... 0λ λ≥ ≥ > and 1/ 21 jjλ∞

=< ∞∑ . With this density

function, the eigenvalues of T are 1 2{1, , ,...}λ λ . The eigenvectors are 1( ) 1xφ = and

1/ 2( ) 2 cos[( 1) ]j x j xφ π= − for 2j ≥ .

Assume that V is known to be distributed as (0,1)N and is independent of X and W .

Let { : 0,1,2,...}jb j = denote the Fourier coefficients of g with the cosine basis. That is,

(3.2) 1/ 20

1( ) 2 cos( )j

jg x b b j xπ

∞

=

= + ∑ .

Then

6

1/ 2 1/ 20

1[ ( ) | ] 2 cos( )j j

jE g X W b b j Wλ π

∞

=

= + ∑ ,

and model (1.1) becomes

(3.3) 1/ 2 1/ 20

12 cos( ) ; ~ (0,1)j j

jY b b j W V V Nλ π

∞

=

= + +∑ .

Now let 0J > be an integer. Consider testing the simple null hypothesis

02

0

1

: if 1

0 if j

j

b

H b j j J

b j J

−

=⎧⎪⎪ = ≤ ≤⎨⎪ = >⎪⎩

against the simple alternative hypothesis

02

1

1

: if 1 .

1if j

j

b

H b j j J

b j J

−

=⎧⎪⎪ = ≤ ≤⎨⎪ = >⎪⎩

Under 0H , g in (3.2) is an ordinary function on [0,1] , and g solves (1.1). Under 1H , g is a

linear combination of an ordinary function and a delta function, so g is not a function on [0,1] in

the usual sense and (1.1) has no solution.

Let the data be the independent random sample { , , : 1,..., }i i iY X W i n= . We show that for

any fixed n , no matter how large, it is possible to choose J so that the power of the likelihood

ratio test of 0H against 1H is arbitrarily close to its size.

The likelihood ratio statistic for testing 0H against 1H is

21/ 2 2 1/ 2

1 1

21/ 2 2 1/ 2 1/ 2 1/ 2

1 1

(3.4) (1/ 2) 2 cos( )

2 cos( ) 2 cos( ) .

n J

i j ii j

J

i j i j ij j J

LR Y j j W

Y j j W j W

λ π

λ π λ π

−

= =

∞−

= = +

⎧⎡ ⎤⎪⎢ ⎥= −⎨⎢ ⎥⎪⎣ ⎦⎩

⎫⎡ ⎤ ⎪⎢ ⎥− − − ⎬⎢ ⎥ ⎪⎣ ⎦ ⎭

∑ ∑

∑ ∑

Substituting (3.3) into (3.4) shows that under 0H , the likelihood ratio statistic is

21/ 2 1/ 2 1/ 2

01 1 1 1

2 cos( ) cos( )j i i j ii j J i j J

LR j W V j Wλ π λ π∞ ∞ ∞ ∞

= = + = = +

⎡ ⎤⎢ ⎥= −⎢ ⎥⎣ ⎦

∑ ∑ ∑ ∑ .

Under 1H , the statistic likelihood ratio statistic is

7

21/ 2 1/ 2 1/ 2

11 1 1 1

2 cos( ) cos( )j i i j ii j J i j J

LR j W V j Wλ π λ π∞ ∞ ∞ ∞

= = + = = +

⎡ ⎤⎢ ⎥= +⎢ ⎥⎣ ⎦

∑ ∑ ∑ ∑ .

Therefore,

21/ 2

1 01 1

21/ 2

1

2 cos( )

2 .

n

j ii j J

jj J

LR LR j W

n

λ π

λ

∞

= = +

∞

= +

⎡ ⎤⎢ ⎥− =⎢ ⎥⎣ ⎦

⎛ ⎞⎜ ⎟≤⎜ ⎟⎝ ⎠

∑ ∑

∑

Because 1/ 21 jjλ∞

=< ∞∑ , 1 0LR LR− can be made arbitrarily small by making J sufficiently

large. Therefore, we obtain the following result, which is proved in the Appendix.

Proposition 3.1: Let nc α denote the α -level critical value of LR when the sample size

is n . That is, 0( | )nP LR c Hα α> = . Let n be fixed. For each 0ε > there is a 0J such that the

power of the α -level likelihood ratio test of 0H against 1H is less than or equal to α ε+

whenever 0J J≥ . That is, 1( | )nP LR c Hα α ε> ≤ + whenever 0J J≥ . ■

Now consider the class of alternative hypotheses consisting of distributions of ( , , )Y X W

for which 1H is true for some J < ∞ . Because no test is more powerful than the likelihood ratio

test, it follows from Proposition 3.1 that no test of 0H is consistent uniformly over this class.

Regardless of the sample size n , there are always distributions in 1H for some finite J against

which the power of any test is arbitrarily close to the test’s level. The intuitive reason is that

Fourier components of g corresponding to eigenvectors of T with small eigenvalues have little

effect on Y and, therefore, are hard to detect empirically. This is illustrated by (3.3), where Y is

insensitive to changes in Fourier coefficients jb that are associated with very small eigenvalues

jλ . This problem can be overcome by restricting the null and alternative hypotheses so as to

avoid the need for estimating or testing Fourier coefficients associated with very small

eigenvalues of T . Section 4 presents a way of doing this.

4 A Uniformly Consistent Test of the Hypothesis That (1.1) Has a Smooth Solution

In this section, we restrict the null hypothesis by requiring g to be smooth in the sense

that it has s square-integrable derivatives, where s is a sufficiently large integer. Under the

alternative hypothesis, (1.1) has no smooth solution. In sufficiently large samples, the resulting

8

test rejects the null hypothesis if (1.1) has no solution or if (1.1) has a non-smooth solution. As

was explained in the introduction, a non-smooth solution to (1.1) is an indicator of

misspecification (possibly due to an invalid instrument) in many applications. Therefore, the

ability to reject non-smooth solutions to (1.1) can be a desirable property of a test.

4.1 Motivation

We begin with an informal discussion that provides intuition for the test that is developed

here. Let { : 1,2,...}j jψ = be a complete, orthonormal basis for 2[0,1]L . Suppose for the

moment that under 0H , the solution to (1.1) has the finite-dimensional representation

(4.1) 1

( ) ( )J

j jj

g x b xψ=

=∑ ,

for some fixed J < ∞ and (generalized) Fourier coefficients { : 1,..., }jb j J= . Equation (4.1)

restricts g to a finite-dimensional parametric family. The null hypothesis that (4.1) is a solution

to (1.1) for some set of jb ’s can be tested against the alternative that it is not by using the test of

Horowitz (2006). The test statistic is

(4.2) 1 20

( )Pn nS z dzτ = ∫ ,

where

( )1/ 2

1 1

ˆ ˆ( ) ( ) ( , ),n J

in i j j i iXW

i jS z n Y b X f z Wψ −−

= =

⎡ ⎤⎢ ⎥= −⎢ ⎥⎣ ⎦

∑ ∑

ˆjb is an estimator of jb that is 1/ 2n− -consistent under the null hypothesis, and ( )ˆ i

XWf − is a leave-

observation- i -out nonparametric kernel estimator of XWf . Specifically,

(4.3) ( )2

1

1ˆ ( , ) ,( 1)

nj ji

XWjj i

x X w Wf x w K

h hn h−

=≠

− −⎛ ⎞= ⎜ ⎟

− ⎝ ⎠∑ ,

where h is a bandwidth and K is a kernel function of a 2-dimensional argument. Horowitz

shows that if (4.1) is a solution to (1.1), then Pnτ is asymptotically distributed as a weighted sum

of independent chi-square random variables with one degree of freedom. Horowitz (2006) also

shows that Pnτ is consistent uniformly over a class of nonparametric alternative hypotheses

whose distance from (4.1) is proportional to 1/ 2n− .

9

Now let g be nonparametric, but suppose that its derivatives through order s are square

integrable on [0,1] . Then g has the infinite-dimensional Fourier representation

1

( ) ( )j jj

g x b xψ∞

=

=∑

but can be approximated accurately by the finite-dimensional model that is obtained by truncating

this series. Specifically, let { : 1,2,...}nJ n = be a sequence of positive integers with nJ →∞ as

n →∞ . Define

(4.4) 1

( ) ( )nJ

n j jj

g x b xψ=

=∑ .

Then for a wide variety of basis functions { }jψ that includes trigonometric functions and

orthogonal polynomials, the error of ng as an approximation to g satisfies

( )sn ng g O J −− = ,

where ⋅ denotes the norm in 2[0,1]L . Thus, a smooth function g can be approximated

accurately by a finite-dimensional parametric function. This suggests testing the null hypothesis

that (1.1) has a smooth solution by using Horowitz’s (2006) procedure to test the hypothesis that

(4.4) is the solution. If nJ is sufficiently large, the approximation error will be small compared

to the minimum deviation from (4.4) that the test can detect. On the other hand, if the solution to

(1.1) is non-smooth or does not exist, then (4.4) will be a poor approximation to the solution to

(1.1), and the test will reject the null hypothesis if the sample is sufficiently large.

The main difference between this version of Horowitz’s (2006) test and the test based on

Pnτ in (4.2) is that when g is nonparametric, nJ must increase as n increases to ensure that the

approximation error remains too small to be detected by the test. This changes the asymptotic

distributional properties of the test statistic. Among other things, the test statistic is

asymptotically degenerate (its asymptotic distribution is concentrated at a single point) under the

null hypothesis. We deal with this problem here by splitting the sample into halves. One half is

used to estimate the jb ’s and the other half is used to construct the test statistic. The sample

splitting procedure is explained in more detail in Section 4.3. The degeneracy problem is well-

known in nonparametric testing, and a variety of solutions are possible. Sample-splitting leads to

a relatively simple test. Other potential solutions that may have some advantages but are much

more complicated analytically are discussed in Section 4.8.

10

4.2 The Null and Alternative Hypotheses

This section provides formal statements of the null and alternative hypotheses of the test

that is developed in this paper. The test statistic is presented in Section 4.3.

We use the following notation. For a function :[0,1]ν → and integer 0≥ , define

( )( ) xD xxνν ∂

=∂

whenever the derivative exists. Define 0 ( ) ( )D x xν ν= . Given an integer 0s > , define the

Sobolev norm

(4.5) 1/ 2

1 20

0[ ( )]

s

s D x dxν ν=

⎧ ⎫⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭∑∫

and the function space

{ :[0,1] : }s gs Cν ν= → ≤H ,

where gC < ∞ is a constant.

The null hypothesis in the remainder of this paper is

0 :H Equation (1.1) has a solution sg∈H for some integer 0s > .

The alternative hypothesis is

1 :H Equation (1.1) does not have a solution in sH .

4.3 The Test Statistic

This section presents the statistic for testing 0H . The data are the independent random

sample { , , : 1,..., }i i iY X W i n= . To avoid unimportant notational complexities, we assume that n

is even. If n is odd, drop one randomly selected observation. This has a negligible effect on the

power of the test. Define ( )ˆ iXWf − as in (4.3). Define 1 { , , : 1,..., / 2}i i iY X W i n= =S and

2 { , , : / 2 1,..., }i i iY X W i n n= = +S . Let ˆjb ( 1,..., nj J= ) be consistent estimators of the Fourier

coefficients jb in (4.4) that are obtained from the data in 2S . The ˆjb ’s can be obtained by using

the method of Blundell, Chen, and Kristensen (2007), but the derivation of the asymptotic

distribution of the test statistic is simpler if the method explained in Section 4.4 is used. Define

1

ˆˆ ( ) ( )nJ

n j jj

g x b xψ=

=∑ .

Now define

11

1

( )1/ 2 ˆ( ) (2 / ) [ ( )] ( , )in i n i iXW

iS z n Y g X f z W−

∈

= −∑S

.

The test statistic is

1 20

( )n nS z dzτ = ∫ .

0H is rejected if nτ is large.

The test works because nJ can be chosen so that truncation error in the finite-series

approximation to g is negligibly small. Therefore, under 0H , ( )nS z estimates

(4.6) 1

1/ 2(2 / ) ( , )i XW ii

n U f z W∈∑S

,

where ( )i i iU Y g X= − . Under 0H , the quantity in (4.6) a random variable with mean 0 and

finite variance, so nτ is bounded in probability. Under 1H , the truncation error is non-negligible,

and nτ diverges as n →∞ .

4.4 A Sieve Estimator of g

This section describes a modified version of the estimator of Blundell, Chen, and

Kristensen (2007). The derivation of the asymptotic distribution of nτ under 0H is simpler with

the modified estimator than with the original one.

For [0,1]w∈ , define

( ) ( | ) ( )Wm w E Y W w f w= = .

Define the operator 2 2: [0,1] [0,1]A L L→ by

1

0( )( ) ( ) ( , )XWA w x f x w dxν ν= ∫ .

Then (1.1) is equivalent to the operator equation

(4.7) Ag m= .

The estimator of g is defined in terms of series expansions of g , m , and A . As before,

let { }jψ denote a complete, orthonormal basis for 2[0,1]L . The expansions are

1

( ) ( )j jj

g x b xψ∞

=

=∑ ,

1

( ) ( )k kk

m w a wψ∞

=

=∑ ,

and

12

1 1

( , ) ( ) ( )XW jk j kj k

f x w c x wψ ψ∞ ∞

= =

=∑∑ ,

where

1

0( ) ( )j jb g x x dxψ= ∫ ,

1

0( ) ( )k ka m w w dxϕ= ∫ , Should be kψ

and

2[0,1]

( , ) ( ) ( )jk XW j kc f x w x w dwdxψ ϕ= ∫ . Should be kψ

We need estimators of ka and jkc ( , 1,..., nj k J= ). These are

2

ˆ (2 / ) ( )k i k ii

a n Y Wψ∈

= ∑S

and

2

ˆ (2 / ) ( ) ( )jk j i k ii

c n X Wψ ψ∈

= ∑S

.

In addition, define the operator A that estimates A by

1

0ˆˆ( )( ) ( ) ( , )XWA w x f x w dxν ν= ∫ ,

where

(4.8) 1 1

ˆ ˆ( , ) ( ) ( )n nJ J

XW jk j kj k

f x w c x wψ ψ= =

=∑∑ .

Also define the estimator

1

ˆ ˆ( ) ( )nJ

k kk

m w a wψ=

=∑ .

Finally, define the set of functions on 2[0,1]L

1

:nJ

ns j j gsj

Cν ν ψ ν=

⎧ ⎫⎪ ⎪= = ≤⎨ ⎬⎪ ⎪⎩ ⎭

∑H .

The estimator of g is

(4.9) ˆˆ ˆarg minns

g A mν

ν∈

= −H

.

The following result is used to obtain the asymptotic distribution of nτ under 0H and establish

uniform consistency of nτ under the alternative hypothesis defined in Section 4.5.

13

Theorem 4.1: Let XWf have r continuous derivatives with respect to any combination

of its arguments. Let the regularity conditions of Section 4.4 hold. Then as n →∞ ,

1/ 2ˆ ( / )s rp n n ng g O J J J n−⎡ ⎤− = +⎣ ⎦ . ■

4.5 The Asymptotic Distribution of nτ under 0H

This section gives the asymptotic distribution of nτ under 0H . We begin by defining

additional notation and stating the assumptions that are used. Let 1 1 2 2( , ) ( , ) Ex w x w− denote the

Euclidean distance between 1 1( , )x w and 2 2( , )x w . Let j XWD f denote any j ’th partial or mixed

partial derivative of XWf . Let 0 ( , ) ( , )XW XWD f x w f x w= .

Assumption 1: (i) The support of ( , )X W is contained in 2[0,1] . (ii) ( , )X W has a

probability density function XWf with respect to Lebesgue measure. (iii) There are an integer

2r ≥ and a constant fC < ∞ such that | ( , ) |j XW fD f x w C≤ for all 2( , ) [0,1]x w ∈ and

0,1,...,j r= . (iv) 1 1 2 2 1 1 2 2| ( , ) ( , ) | ( , ) ( , )r XW r XW f ED f x w D f x w C x w x w− ≤ − for any order r

derivative and any 1 1( , )x w and 2 2( , )x w in 2[0,1] . (v) The operator T is nonsingular.

Assumption 2: 2( | ) YE Y W w C= ≤ for each [0,1]w∈ and some constant YC < ∞ .

Assumption 3: (1.1) has a solution sg∈H with gsg C< and 2 1/ 2s r≥ + .

Assumption 4: (i) The estimator g is as defined in (4.9). (ii) The basis functions { }jψ

are orthonormal and complete on 2[0,1]L . Moreover, there are coefficients { : 1,2,...}jb j = and

{ : , 1,2,...}jkc j k = such that

1( )

Js

j jj

g b O Jψ −

=

− =∑

and

1 1

( , ) ( ) ( ) ( )J J

rXW jk j k

j kf x w c x w O Jψ ψ −

= =

− =∑∑

as J →∞ for any g and s satisfying Assumption 3.

14

Assumption 5: The kernel function K used to estimate XWf has the form

(1) (2)( ) ( ) ( )K v κ ξ κ ξ= , where ( )jξ is the j ’th component of the vector ξ , κ is a symmetrical,

twice continuously differentiable function on [ 1,1]− , and

1

1

1if 0( )

0 if 1.j j

v v dvj r

κ−

=⎧= ⎨ ≤ −⎩∫

Assumption 6: (i) The bandwidth, h , satisfies 1/(2 2)rhh c n− += , where hc is a constant,

0 hc< < ∞ . (ii) n JJ C nγ= for constants JC < ∞ and 1/(2 ) 1/(4 1)s rγ< < + .

Assumptions 1 and 2 are smoothness and boundedness conditions. Assumption 3 defines

the null hypothesis. It requires gsg C< (strict inequality) to avoid complications that arise

when g is on the boundary of sH . Deriving the asymptotic distribution of nτ when gsg C=

is a difficult task that is beyond the scope of this paper. The requirement that 2 1/ 2s r≥ + is

discussed below. Assumption 4(ii) is satisfied by orthogonal polynomials and trigonometric

functions. It is also satisfied by B-splines that have been orthogonalized by, say, the Gram-

Schmidt procedure. Assumption 5 requires K to be a higher-order kernel if XWf is sufficiently

smooth. K can be replaced by a boundary kernel (Gasser and Müller 1979; Gasser, Müller, and

Mammitzsch 1985) if XWf does not approach 0 smoothly on the boundary of its support. The

rate of convergence of h in Assumption 6(i) is asymptotically optimal for estimating XWf . In

applications, h can be chosen by cross-validation or any of a variety of other bandwidth selection

methods. Assumption 6(ii) requires nJ to increase more rapidly than the asymptotically optimal

rate for estimating g . This undersmoothing ensures that the truncation bias in the series

approximation to g is 1/ 2( )o n− as n →∞ . Rapid convergence of the truncation bias is needed

because 1/ 2( ) ( )n pS z O n−= under 0H for each [0,1]z∈ , so the nτ test will reject 0H due to

truncation bias unless this bias converges more rapidly than 1/ 2n− . However, undersmoothing of

g increases the severity of the ill-posed inverse problem in estimating g , thereby decreasing the

rate of convergence of g and increasing the probability that ˆ Gsg C= . This complicates the

asymptotic distribution of nτ in ways similar to those that occur when gsg C= . The conditions

2 1/ 2s r≥ + in Assumption 3 and 1/(4 1)rγ < + in Assumption 6 ensure that g converges

sufficiently rapidly to avoid this problem. The question whether a useful test can be constructed

under weaker smoothness assumptions is left to future research.

15

Now define 2 2( ) ( | )U w E U W wσ = = . For 1 2, [0,1]z z ∈ define

21 2 1 2( , ) [ ( ) ( , ) ( , )]U XW XWV z z E W f z W f z Wσ=

Define the operator Ω on 2[0,1]L by

(4.10) 1

2 1 2 1 10( )( ) ( , ) ( )z V z z z dzν νΩ = ∫ .

Let { : 1,2,...}j jω = denote the eigenvalues of Ω sorted so that 1 2 ... 0ω ω≥ ≥ ≥ . Let

21{ : 1,2,...}j jχ = denote independent random variables that are distributed as chi-square with one

degree of freedom. The following theorem gives the asymptotic distribution of nτ under H0.

Theorem 4.2: If 0H is true and assumptions 1-5 hold, then

21

1

dn j j

jτ ω χ

∞

=

→ ∑ . ■

4.6 Obtaining the Critical Value

The statistic nτ is not asymptotically pivotal, so its asymptotic distribution cannot be

tabulated. This section presents a method, similar to that of Horowitz (2006), for obtaining an

approximate asymptotic critical value. The method is based on replacing the asymptotic

distribution of nτ with an approximate distribution. The difference between the true and

approximate distributions can be made arbitrarily small, and the quantiles of the approximate

distribution can be estimated consistently. The approximate 1 α− critical value of the nτ test is

a consistent estimator of the 1 α− quantile of the approximate distribution.

We now describe the approximate asymptotic distribution of nτ . Under 0H , nτ is

asymptotically distributed as

21

1j j

jτ ω χ

∞

=

≡∑ .

Given any 0ε > , there is an integer Kε < ∞ such that

21

10 ( )

K

j jj

P t P tε

ω χ τ ε=

⎛ ⎞⎜ ⎟< ≤ − ≤ <⎜ ⎟⎝ ⎠∑ .

uniformly over t . Define

21

1

K

j jj

ε

ετ ω χ=

=∑ .

16

Let zεα denote the 1 α− quantile of the distribution of ετ . Then 0 ( )P zεατ α ε< > − < . Thus,

using zεα to approximate the asymptotic 1 α− critical value of nτ creates an arbitrarily small

error in the probability that a correct null hypothesis is rejected. Similarly, use of the

approximation creates an arbitrarily small change in the power of the nτ test when the null

hypothesis is false. The approximate 1 α− critical value for the nτ test is a consistent estimator

of the 1 α− quantile of the distribution of ετ . Specifically, let ˆ jω ( 1,2,..., )j Kε= be a

consistent estimator of jω under 0H . Then the estimator of the approximate critical value of nτ

is the 1 α− quantile of the distribution of

21

1

ˆ ˆK

n j jj

ε

τ ω χ=

=∑ .

This quantile can be estimated with arbitrary accuracy by simulation.

In applications, Kε can be chosen informally by sorting the ˆ jω ’s in decreasing order and

plotting them as a function of j . They typically plot as random noise near ˆ 0jω = when j is

sufficiently large. One can choose Kε to be a value of j that is near the lower end of the

“random noise” range. The rejection probability of the nτ test is not highly sensitive to Kε , so it

is not necessary to attempt precision in making the choice.

The estimated eigenvalues ˆ jω are those of the estimate of Ω that is defined by

1

2 1 2 1 10ˆ ˆ( )( ) ( , ) ( )z V z z z dzν νΩ = ∫ ,

where

(4.11) ( ) ( )1 21 2 1 2

1

ˆ ˆˆ ˆ( , ) ( , ) ( , )n

i ii i iXW XW

iV z z n U f z W f z W− −−

=

= ∑

and ˆ ˆ( )i i iU Y g X= − . The ˆ jω ’s can be computed easily by using a finite-dimensional series

estimator, like (4.8), for ˆXWf . The ˆ jω ’s are then the eigenvalues of the finite-dimensional

matrix whose ( , )j k element is

1 2

1 1 1

ˆ ˆ ˆ ( ) ( )n nL Ln

i j km i m ii m

n U c c W Wψ ψ−

= = =∑∑∑ ,

where nL is the length of the series used in (4.11) to estimate XWf .

To state the properties of the estimated eigenvalues, define

17

arg minns

g A mν

ν∈

= −H

and ( )U Y g X= − . Let { }jω be the eigenvalues of the operator that is obtained by replacing

1 2( , )V z z in (4.10) with 21 2[ ( , ) ( , )]XW XWE U f z W f z W . Then j jω ω→ as n →∞ if 0H is true.

Let zεα denote the 1 α− quantile of the distribution of the random variable

21

1

K

j jj

ε

ετ ω χ=

≡∑ ,

and let zεα denote the 1 α− quantile of the distribution of ˆnτ . Note that ε ετ τ= if 0H is true.

The following theorem shows that zεα estimates zεα consistently if nL →∞ as n →∞ .

Theorem 4.3: Let assumptions 1-5 hold. As n →∞ , (i) 1 ˆsup | |j K j jεω ω≤ ≤ − =

2 1/ 2[(log ) /( ) ]O n nh almost surely, (ii) 1sup | | ( )s rj K j j n nO J L

εω ω − −

≤ ≤ − = + , and (ii) ˆ pz zεα εα→ .

■

4.7 Consistency of the nτ Test

This section presents a theorem establishing the consistency of the nτ test against a fixed

alternative hypothesis. The section also shows that for any 0ε > , the nτ test rejects 0H with

probability exceeding 1 ε− uniformly over a large class of alternative hypotheses.

Consistency against a fixed alternative is given by the following theorem. Let zα denote

the 1 α− quantile of the asymptotic distribution of nτ under sampling from the model

( )Y g X U= + .

Theorem 4.4: Let Assumptions 1, 2, 4, and 5 hold. Then under 1H ,

lim ( ) 1nnP zατ

→∞> =

for any α such that 0 1α< < . ■

The conclusion of the theorem also holds if zα is replaced by the estimated approximate

critical value zεα .

We now consider uniform consistency. Let *A denote the adjoint of A . Define

arg mins

g A mν

ν∈

= −H

.

18

For each 1,2,...n = and 0C > , define ncF as the set of distributions of Y conditional on ( , )X W

satisfying (i) 2( | ) YE Y W w C= ≤ for some constant YC < ∞ and all [0,1]w∈ , (ii)

* 1/ 2T g A m Cn−− ≥ , and (iii) */ (1)rh Ag m T g A m o− − = as n →∞ . Condition (ii) rules

out alternatives that depend on x only through sequences of eigenvectors of T whose

eigenvalues converge to 0 too rapidly. As the example of Section 3 shows, it is not possible to

achieve consistency uniformly over these alternatives. Condition (iii) ensures that random

sampling errors in ( )ˆ iXWf − are asymptotically negligible relative to the effects of misspecification.

The following theorem states the uniform consistency result.

Theorem 4.5: Let assumptions 1, 2, 4, 5, and 6 hold. Then given any 0δ > , α such

that 0 1α< < , and any sufficiently large but finite constant C ,

(4.11) lim inf ( ) 1nc

nnzατ δ

→∞> ≥ −P

F.

and

(4.12) ˆlim inf ( ) 1 2nc

nn

zεατ δ→∞

> ≥ −PF

. ■

4.8 Alternative Approaches to Testing 0H

This section describes some alternative approaches to testing 0H that do not require

sample splitting. These approaches may have certain advantages over nτ (possibly in terms of

power or weaker smoothness assumptions) but are more complicated analytically. Their

investigation is left to future research.

The degeneracy problem that is solved by sample-splitting in Section 4.2 is also present

in the econometrics literature of the 1990s on testing a parametric or semiparametric model of a

conditional mean function against a nonparametric alternative. In its simplest form, the

hypothesis tested in that literature is that

(4.13) [ ( , ) | ] 0E Y G X X xθ− = =

for almost every x , some known function G , and an unknown finite-dimensional parameter θ

that is estimated from the data. The alternative hypothesis is that there is no θ satisfying (4.13).

Fan and Li (1996) review tests that encounter the degeneracy problem and use sample-splitting to

overcome it. Degeneracy and the need for sample-splitting can be avoided by using a test statistic

that measures the distance from 0 of an empirical analog of [ ( , ) | ]E Y G X X xθ− = . Tests that

avoid degeneracy this way include, among others, Härdle and Mammen (1993) and Horowitz and

19

Spokoiny (2001) for testing (4.13) and Fan and Li (1996) for testing a semiparametric model of a

conditional mean function.

A test of the null hypothesis of this paper that is analogous to the Härdle-Mammen and

Horowitz-Spokoiny tests of (4.13) can be based on an empirical analog of the conditional

moment [ ( ) | ] ( )WE Y g X W w f w− = . The analog is

* 1/ 2

1

ˆ( ) ( ) [ ( )]n

in i i

i

w WS w nh Y g X Kh

−

=

−⎛ ⎞= − ⎜ ⎟⎝ ⎠

∑ ,

where K is a kernel function of a scalar argument. Define 1* * 20

( )n nS w dwτ = ∫ .

Under 0H , one can expect that *nτ differs from 0 only by random sampling error, whereas *

nτ is

large if 0H is false. Accordingly, one might use *nτ to test 0H . However, deriving the

asymptotic distribution of *nτ under 0H requires solving a difficult problem in the theory of

empirical U processes and, consequently, is beyond the scope of this paper.

Another possibility is to base a test on the optimized objective function in (4.9),

ˆ ˆ Âg m− . One can also let the series approximation for m be longer than that for g , thereby

achieving a form of overidentification. However, results obtained with tests based on ˆ ˆ Âg m−

are highly sensitive to the value of GC and the lengths of the series used to estimate g and m .

One can obtain virtually any result one wants by choosing these regularization parameters

appropriately. Thus, a method for choosing regularization parameters is crucial to the

development of any test based on ˆ ˆ Âg m− . The choice of regularization parameters is a major

unsolved problem in nonparametric IV estimation. It, too, is beyond the scope of this paper.

The nτ test described in Section 4.2 requires choosing the regularization parameter nJ ,

but the results of the Monte Carlo experiments discussed in Section 5 suggest that this can be

done satisfactorily by using a simple heuristic procedure that is described in that section.

Therefore, the need to choose nJ does not present an obstacle to implementation of the nτ test.

Finally, suppose there are several instruments, say (1) ( ),..., LW W for some 2L ≥ , and

these are believed to satisfy the moment conditions ( ) ( )[ ( ) | ] 0E Y g X W w− = = ( 1,..., L= ).

Suppose, further, that each moment condition (or, possibly, a subset of more than one but fewer

than L conditions) identifies g when such a function exists. Then one can consider testing the

20

hypothesis all of the moment conditions hold, thereby obtaining a version of the GMM test of

overidentifying restrictions. However, such a test asks whether the same g satisfies all the

moment conditions, whereas the question being addressed in this paper is whether any g satisfies

at least one of the moment conditions. The availability of multiple instruments does not alter the

issues that have been discussed concerning tests of the hypothesis that there is a g satisfying at

least one of the moment conditions.

5. Monte Carlo Experiments

This section reports the results of a Monte Carlo investigation of the finite-sample

performance of the nτ test. The experiments use a sample size of 1000. The nominal level of the

test is 0.05, and there are 1000 Monte Carlo replications in each experiment.

Realizations of ( , )X W were generated by ( )X ξ= Φ and ( )W ζ= Φ , where Φ is the

cumulative normal distribution function, ~ (0,1)Nζ , 2 1/ 2(1 )ξ ρζ ρ ε= + − , (0,1)Nε ∼ , and

0.7ρ = . Realizations of Y were generated from

(5.1) ( ) UY g X Uσ= + ,

where 2 1/ 2(1 )U ηε η ν= + − , (0,1)Nν ∼ , 0.1Uσ = , and 0.4η = . In experiments where 0H is

true, the function g is either

41

1( ) 0.5 cos( )

jg x j j xπ

∞−

=

= +∑

or

1 22

1( ) ( 1) sin( )j

jg x j j xπ

∞+ −

=

= −∑ .

The series were truncated at 100j = for computational purposes. The resulting functions are

displayed in Figure 1.

In experiments where 0H is false,

( ) ( ) ( ); 1,2kg x g x x k= + Δ = ,

where

1( ) (0.5 0.5 )2

x I d x dd

Δ = − < ≤ +

and 0.02d = or 0.005d = , depending on the experiment. The function Δ places a rectangular

spike in g . The spike is centered at 0.5x = . Its width is 2d and its height is 1/(2 )d . This

21

mimics what happens if g is highly non-smooth or (1.1) has no solution. In particular, if (1.1)

has no solution, then as J →∞ the quantity

1

,( )

Jj

jjj

qx

φφ

λ=∑

becomes unbounded at one or more points [0,1]x∈ . This behavior is mimicked by kg + Δ as d

decreases.

The computation of nτ is easiest if the kernel estimator of XWf is replaced by a series

expansion that is equivalent up to truncation error. With basis functions { }jν , this gives

(5.1) ( )

1 1

ˆ ( , ) ( ) ( )J J

ijk j kXW

j kf x w c x wν ν−

= =

=∑∑ ,

where J is the point at which the series is truncated for computational purposes and

22 [0,1]

1

1 , ( ) ( )( 1)

n

jk j k

i

x X w Wc K x w dxdwh hn h

ν ν=≠

− −⎛ ⎞= ⎜ ⎟− ⎝ ⎠∑∫ .

However, in preliminary Monte Carlo experiments the numerical performance of nτ was

unaffected by replacing jkc with its limit as 0h → . This gives coefficients

(5.2) 1

1 ( ) ( )1

n

jk j k

i

c X Wn

ν ν=≠

=− ∑ .

Accordingly, the Monte Carlo experiments reported here use (5.1) and (5.2) to estimate XWf .

The basis functions are 1 1ν = and ( ) 2 cos[( 1) ]j x j xν π= − for 2j ≥ . The series was truncated

at 25J = . Similarly, the asymptotic critical value of nτ was estimated by setting 25Kε = . The

results of the experiments are not sensitive to the choice of J and Kε . The estimated

eigenvalues ˆ jω are very close to 0 when 25j > .

We now discuss the choice of the basis functions, { }jψ , and truncation parameter, nJ ,

for the series approximation to g . Consider, first, the choice of basis functions. Estimation of

g presents an ill-posed inverse problem because 1T − is a discontinuous operator. One

consequence of this is that with samples of moderate size, it is usually possible to estimate only

low-order Fourier coefficients jb with reasonable precision. Therefore, it is desirable to choose

basis functions that provide a good low-order approximation to g . Demand functions, Engel

22

curves, and earnings functions, among other functions of interest in applied econometrics, are

likely to have few inflection points. Functions with few inflection points are often well-

approximated by low-degree polynomials. In preliminary Monte Carlo experiments, we found

that approximating these functions accurately with trignonometric or spline bases requires longer

series and leads to much noisier estimates. Accordingly, the experiments reported here use

Legendre polynomials (centered and scaled to be orthonormal on [0,1] ) for the basis functions.

Now consider the choice of nJ . If nJ is too small, the nτ test will tend to reject a true

0H because the truncated series, 1

nJj jj

b ψ=∑ , is a poor approximation to g . If nJ is too large,

g will be a very noisy estimate of g and this, too, will tend to cause rejection of a correct 0H .

The integrated variance of g is 2 21

ˆ ˆ nJjj

E g Eg σ=

− =∑ , where 2 ˆ( )j jVar bσ = and ˆjb is the

estimator of jb . The variance components 2jσ can be estimated by using the standard formulae

of GMM estimation. We have found through Monte Carlo experiments that as nJ increases from

1, 2ˆ Ê g Eg− changes little at first but increases by a factor of 10 or more when nJ crosses a

“critical value.” This suggests the following heuristic procedure for choosing nJ in applications:

choose the largest nJ that does not produce a very large increase in the estimated value of

2ˆ Ê g Eg− . In the experiments reported here, the large increase in 2ˆ Ê g Eg− occurs when the

degree of the approximating polynomial increases from 3 to 4. Accordingly, the experiments

reported here approximate g with a cubic polynomial.

As in Blundell, Chen, and Kristensen (2007), it is convenient for computational purposes

to replace the constrained estimation problem (4.9) with a penalized estimator. However,

penalization has little effect on the results of the experiments. Accordingly, the results reported

here are based on solving (4.9) without imposing the constraint nsν ∈H .

The results of the experiments are shown in Table 1. When 0H is true and 1g g= , the

difference between the empirical and nominal levels of the nτ test is very small. The difference

is somewhat larger when 2g g= because the error in the cubic polynomial approximation to 2g

is larger than the error in the cubic approximation to 1g . The use of a quartic or higher-degree

polynomial reduces the approximation error when 2g g= but increases 2ˆ Ê g Eg− by a factor

of over 100. This illustrates the importance of choosing a basis that provides a good low-order

approximation to g . The probability of rejecting 0H when it is false is high in all cases. It is

23

higher when 0.02d = than when 0.005d = because there are more data points in the interval

containing the spike when 0.02d = .

6. Conclusions

This paper has been concerned with uniformly consistent testing of the null hypothesis

that the identifying equation of nonparametric IV estimation has a solution. The paper has shown

that no test can be uniformly consistent over all probability distributions of ( , , )Y X W for which

the identifying equation has no solution. No matter how large the sample is, there are alternative

distributions that depart from the null hypothesis in extreme ways but against which any test has

low power. Uniformly consistent testing is possible if the null and alternative hypotheses are

restricted in appropriate ways. In this paper, the null hypothesis is restricted by assuming that any

solution to the identifying equation is smooth. The paper has presented a test of the hypothesis

that a smooth solution exists. The test is uniformly consistent against a large class of

distributions of ( , , )Y X W for which no smooth solution exists. Monte Carlo experiments have

illustrated the test’s finite sample performance as well as certain limitations of the test. The paper

has also outlined several other testing approaches that are more complicated analytically than the

one developed here but may have some advantages. The investigation of these tests is left to

future research.

7. Appendix: Proofs of Theorems

Proof of Proposition 3.1: 0LR is continuously distributed, so for each 0ε > , there is a

0δ > such that 0 0( ) ( ) 1n nP LR c P LR cα αδ ε α ε≤ − ≥ ≤ − = − − . Choose 0J so that

0

1/ 2

12 j

j Jn λ δ

∞

= +

⎛ ⎞⎜ ⎟ <⎜ ⎟⎝ ⎠∑ .

Then

1 1

0 1 0

0 1 0

0

( | ) ( )

[ ( ) ]

[ ( )]

( )

1 .

n n

n

n

n

P LR c H P LR c

P LR LR LR c

P LR c LR LR

P LR c

α α

α

α

α δ

α ε

≤ = ≤

= + − ≤

= ≤ − −

≥ ≤ −

= − −

Therefore,

24

1 1( | ) 1 ( | )

.

n nP LR c H P LR c Hα α

α ε

> = − ≤

≤ +

Q.E.D.

Proof of Theorem 4.1:

Define

1

nJ

n j jj

g b ψ=

=∑ .

Let nA be the operator whose kernel is

1 1

( , ) ( ) ( )n nJ J

n jk j kj k

a x w c x wψ ψ= =

=∑∑ .

Also define

1 1

ˆ ˆ( , ) ( ) ( )n nJ J

n jk j kj k

a x w c x wψ ψ= =

=∑∑ ,

1 1( , ) ( ) ( )jk j k

j ka x w c x wψ ψ

∞ ∞

= =

=∑∑ ,

n n nm A g= , and

* 1/ 2sup

( )n

nh

h

A A hρ

∈=

H.

Let jμ denote the j ’th singular value of A . For sequences of numbers { }nc and { }nd , write

n nc d to mean that /n nc d is bounded away from 0 and ∞ as n →∞ .

Lemma 1: (i) rj jμ − , ( )r

n nO Jρ = , ( ) ( )s rn nA g g O J − −− = , and

ˆ ( )r snm Em O J − −− = . (ii) 1/ 2ˆ ( / )n n p na a O J n− = , and (iii) sup ( ) ( )

s

r sn nA A O Jν ν − −

∈ − =H .

Proof: Part (i) is a slight modification of Theorem 3 of Blundell, Chen, and Kristensen

(2007) and is proved the same way as that theorem. Part (ii) follows from Markov’s inequality

and the observation that 2 2, 1

ˆ ˆ( ) ( / )nJn n jk nj k

E a a Var c O J n=

− = =∑ .

Now consider part (iii). Write

1

( ) ( )j jj

x xν ν ψ∞

=

=∑ ,

25

where the jν ’s are Fourier coefficients. We have

1 1

j jk kj k

A cν ν ψ∞ ∞

= =

=∑∑

Observe that Aν belongs to a Sobolev space of smoothness r s+ . Define

1 1

nJ

n j jk kj k

A cν ν ψ∞

= =

=∑∑

Then ( ) r sn A nA A C Jν − −− = for some constant AC not depending on ν . Note that

1 1

n nJ J

n j jk kj k

A cν ν ψ= =

=∑∑ .

Define

1

( ) ( )nJ

j jj

x xν ν ψ=

=∑

Then

1 1

nJ

j jk kj k

A cν ν ψ∞

= =

=∑∑ .

Now

1 1

( ) ( )n

n

J

n n j jk kj J k

A A A A cν ν ν ψ∞

= + =

− = − + ∑ ∑ ,

so

1 1

1 1

( ) ( )

(7.1) ,

n

n

n

n

J

n n j jk kj J k

Jr s

A n j jk kj J k

r sn A n

A A A A c

C J c

R C J

ν ν ν ψ

ν ψ

∞

= + =

∞− −

= + =

− −

− ≤ − +

≤ +

= +

∑ ∑

∑ ∑

where

1 1

n

n

J

n j jk kj J k

R cν ψ∞

= + =

= ∑ ∑ .

Note that

26

2

2

1 1.

n

n

J

n j jkk j J

R cν∞

= = +

⎛ ⎞⎜ ⎟=⎜ ⎟⎝ ⎠

∑ ∑

In addition

1 1( )

n

j jk kj J k

A cν ν ν ψ∞ ∞

= + =

− = ∑ ∑ ,

so 2

2

1 1( )

n

j jkk j J

A cν ν ν∞ ∞

= = +

⎛ ⎞⎜ ⎟− =⎜ ⎟⎝ ⎠

∑ ∑

and

(7.2) 22 ( )nR A ν ν≤ − .

Now use the argument on pp. 1664-1665 of Blundell, Chen, and Kristensen (2007) to get

(7.3) ( ) ( )r snA O Jν ν − −− = .

Finally, combine (7.1)-(7.3) to get the result. Q.E.D.

Lemma 2: 1/ 2ˆ [ ( / ) ]s rn p n n ng g O J J J n−− = + .

Proof: We have

(7.4) ˆ ˆ( )n n ng g A g gρ− ≤ − .

But ˆ ˆ ˆ( ) 1P Ag m= → as n →∞ . Therefore, with probability approaching 1,

ˆ ˆ( )

ˆ ˆˆ ˆ( ) ( ) ( )( ).

n n

n n n

A g g Ag Ag

A A g m m A g g A A g g

− = −

= − + − − − + − −

.

The triangle inequality gives

ˆ ˆˆ ˆ ˆ( ) ( ) ( ) ( )( )

ˆ ˆˆ ˆ( ) ( ) ( ) ( )( )

ˆ ˆˆ ˆ( ) ( ) ( ) ( )

ˆ( )( )

n n n n

n n n n n n

n n n n n n n n

n n

A g g A A g m m A g g A A g g

A A g A A g m m A g g A A g g

A A g A A g m m A g g A A g g

A A g g

− ≤ − + − + − + − −

≤ − + − + − + − + − −

≤ − + − + − + − + − −

+ − −

27

ˆ ˆˆ ˆ(7.5) ( ) ( ) ( ) ( )

ˆ( )( ) .

n n n n n n n n n

n n

A A g A A g m m A g g A A A g g

A A g g

ρ≤ − + − + − + − + − −

+ − −

Straightforward calculations show that 1/ 2ˆ( ) ( / )n n p nA A g O J n− = (should be 1/ 2( / )nJ n ) and

1/ 2ˆ ˆ [( / ) ]p nm Em O J n− = . Lemma 1 gives the results that ( ) ( )r sn n nA A g O J − −− = ,

ˆ( )( ) ( )r sn n p nA A g g O J − −− − = , ˆ ( )r s

nEm m O J − −− = , ( ) ( )r sn nA g g O J − −− = , and

ˆ (1)n n n pA A oρ − = . Substituting these rates into (7.5) gives

(7.6) 1/ 2ˆ( ) [ ( / ) ]r sn p n nA g g O J J n− −− ≤ + .

The lemma follows by combining (7.4), (7.6) and lemma 1. Q.E.D.

Proof of Theorem 4.1: By the triangle inequality,

(7.7) ˆ ˆ n ng g g g g g− ≤ − + − .

The theorem follows by combining (7.7) with lemma 2 and Assumption 4(ii). Q.E.D.

7.1 Proofs of Theorems 4.2 and 4.3

Let 0H be true. Define

1

1/ 21( ) (2 / ) ( , )n i XW i

iS z n U f z W

∈

= ∑S

,

1

1/ 22 ˆ( ) (2 / ) [ ( ) ( )] ( , )n i i XW i

iS z n g X g X f x W

∈

= −∑S

,

1

( )1/ 23

ˆ( ) (2 / ) [ ( , ) ( , )]in i i XW iXW

iS z n U f z W f z W−

∈

= −∑S

,

and

1

( )1/ 24

ˆˆ( ) (2 / ) [ ( ) ( )][ ( , ) ( , )]in i i i XW iXW

iS z n g X g X f z W f z W−

∈

= − −∑S

.

Then 4

1( ) ( )n nj

jS z S z

=

=∑ .

Lemma 3: As n →∞ , 3( ) (1)n pS z o= uniformly over [0,1]z∈ .

Proof: See Lemma 3 of Horowitz (2006). Q.E.D.

Lemma 4: As n →∞ , 4 ( ) (1)n pS z o= uniformly over [0,1]z∈ .

Proof: Standard properties of kernel estimators give the result that

28

2

( )2 1/ 21 , [0,1]

logˆmax sup | ( , ) ( , ) |( )

iXWXWi n x w

nf x w f x w Onh≤ ≤ ∈

⎡ ⎤− = ⎢ ⎥

⎣ ⎦

almost surely. Therefore,

1

4[0,1]

ˆsup | ( ) | | ( ) ( ) |n n i iz i

S z R g X g X∈ ∈

≤ −∑S

,

where [(log ) /( )]nR O n nh= almost surely. For 1i∈S , iX is independent of g . By Theorem 2.4

of van de Geer (2000), the class of functions {| |: }rH g H− ∈H for fixed g satisfies the

conditions of the uniform law of large numbers (van de Geer 2000, Lemma 3.1). Therefore,

1

110

ˆ ˆ| ( ) ( ) | | ( ) ( ) | ( )i i Xi

n g X g X g x g x f x dx−

∈

− → −∑ ∫S

almost surely. But 1

40ˆ ˆ| ( ) ( ) | ( )Xg x g x f x dx C g g− ≤ −∫

for some constant 4C < ∞ by the Cauchy-Schwarz inequality. Therefore,

4[0,1]

log ˆsup | ( ) |

(1).

n pz

p

nS z O g gh

o

∈

⎡ ⎤= −⎢ ⎥⎣ ⎦

=

Q.E.D.

Lemma 5: As n →∞ ,

2

1/ 22 ( ) (2 / ) ( , )n i XW i n

iS z n U f z W r

∈

= − +∑S

,

where (1)n pr o= .

Proof: For [0,1]z∈ and sν ∈H , define the empirical process

2

2

1/ 2

1/ 2

( , ) (2 / ) [ ( , ) ( ) ( , ) ( )]

(2 / ) [ ( , ) ( ) ( )( )].

n XW i i XW XWi

XW i ii

H z n f z W X E f z W X

n f z W X T z

ν ν ν

ν ν

∈

∈

= −

= −

∑

∑

S

S

nH is stochastically equicontinuous by Theorem 5.12 of van de Geer (2000). Therefore, because

ˆ sg∈H , for each 0ε > and 0η > there is a 0δ > such that

[0,1]

ˆsup | ( , ) ( , ) |p

n nz

P H z g H z g η ε∈

⎡ ⎤− > <⎢ ⎥

⎣ ⎦

29

whenever g g δ− < . Because ˆ (1)pg g o− = , it follows that ˆ( , ) ( , ) (1)n n pH z g H z g o− =

uniformly over [0,1]z∈ and

(7.8) 1/ 22 ˆ( ) ( / 2) [ ( )]( ) (1)n pS z n T g g z o= − − +

uniformly over [0,1]z∈ .

Now ˆ ˆ ˆAg m= with probability approaching 1 as n →∞ . Some algebra now shows that

1

ˆ ˆˆ ˆ ˆ( ) ( )( )

ˆˆ ,n

A g g Ag r A A g g

r Ag r

− = − + + − −

= − +

where 1/ 21 ( )n pr o n−= . But

1 1

ˆˆ ˆ ˆ( )n nJ J

k j jk kk j

r Ag a b c ψ= =

− = −∑ ∑

By the definitions of ˆka and ˆ jkc ,

21 1

ˆ ˆ (2 / ) [ ( )] ( )n nJ J

k j jk i j j i k ij i j

a b c n Y b X Wψ ψ= ∈ =

− = −∑ ∑ ∑S

.

Define 1

nJj jj

g b gψ=

Δ = −∑ . Then because ( )i i iY g X U= + ,

(7.9) 2 21 1

ˆˆ( )( ) (2 / ) ( ) ( ) (2 / ) ( ) ( ) ( )n nJ J

i k i k i k i ki k i k

r Ag w n U W w n g X W wψ ψ ψ ψ∈ = ∈ =

− = − Δ∑∑ ∑∑S S

.

Define

2 1

( ) (2 / ) ( ) ( ) ( )nJ

n i k i ki k

D w n g X W wψ ψ∈ =

= Δ∑∑S

.

Then

2

2 2 2

22 2

1

2 2 2 2

1 1

1 2

(2 / ) ( ) ( )

(2 / ) ( ) ( ) (2 / ) [ ( ) ( )][ ( ) ( )]

.

n

n n

J

n i k ik i

J J

i k i i k i j k jk i k i j

j i

n n

D n g X W

n g X W n g X W g X W

D D

ψ

ψ ψ ψ

= ∈

= ∈ = ∈ ∈≠

⎡ ⎤⎢ ⎥= Δ⎢ ⎥⎣ ⎦

= Δ + Δ Δ

≡ +

∑ ∑

∑∑ ∑∑ ∑

S

S S S

Now

30

2

2 21 [0,1]

1

1 20

1

2

1

(2 / ) ( ) ( ) ( , )

(2 / ) ( )

(2 / )

( ).

n

n

J

n k XWk

J

fk

f n

ED n g x w f x w dxdw

n C g x dx

n C J g

o n

ψ=

=

−

= Δ

≤ Δ

= Δ

=

∑∫

∑∫

Therefore, Markov’s inequality gives

(7.10) 11 ( )n pD o n−= .

Moreover,

22

1[1 (2 / )] [ ( ) ( )]

nJ

n kk

ED n E g X Wψ=

= − Δ∑ .

But

2[0,1]

1

0

( ) ( ) ( ) ( ) ( , )

( ) ( ) ,

k k XW

k

E g X W g x w f x w dxdw

g x x dx

ψ ψ

δ

Δ = Δ

= Δ

∫

∫.

where 1

0( ) ( ) ( , )k k XWx w f x w dwδ ψ= ∫ .

Therefore, by the Cauchy-Schwarz inequality 122 20

[ ( ) ( )] ( )k kE g X W g x dxψ δΔ ≤ Δ ∫ ,

and

12 22 0

1[1 (2 / )] ( )

nJ

n kk

ED n g x dxδ=

≤ − Δ ∑∫ .

But ( )k xδ is the k ’th Fourier coefficient of ( , )XWf x ⋅ , and | ( , ) |XW ff x w C≤ . Therefore,

31

222

2

1

[1 (2 / )]

( )

( ).

n f

sn

ED n C g

O J

o n

−

−

≤ − Δ

=

=

Markov’s inequality now gives

(7.11) 12 ( )n pD o n−= .

Combining (7.9)-(7.11) yields

(7.12) 2

21

ˆˆ( )( ) (2 / ) ( ) ( )nJ

i k i k ni k

r Ag w n U W w rψ ψ∈ =

− = +∑∑S

,

where 1/ 22 ( )n pr o n−= .

Now

*ˆ ˆ( ) ( )T g g A A g g− = −

so

(7.13) *3

ˆˆ ˆ( ) ( ) ,nT g g A r Ag r− = − +

where 1/ 23 ( )n pr o n−= . By (7.12)

2

* *4

1

ˆˆ( )( ) (2 / ) ( )( )( ) ,nJ

i k i k ni k

A r Ag z n U W A z rψ ψ∈ =

− = +∑∑S

where 1/ 24 ( )n pr o n−= . Now

1*0

1

( )( ) ( , ) ( )

( ).

k XW k

jk jj

A z f z w w dw

c z

ψ ψ

ψ∞

=

=

=

∫

∑

Therefore,

2

*4

1 1

ˆˆ( )( ) (2 / ) ( ) ( )nJ

i jk j k i ni j k

A r Ag z n U c z W rψ ψ∞

∈ = =

− = +∑∑∑S

.

Define 1 1

( , ) ( ) ( ) ( , )nJXW jk j k XWj k

f x w c x w f x wψ ψ∞

= =Δ = −∑ ∑ . Then

32

2 2

2

*4

5 4

ˆˆ( )( ) (2 / ) ( , ) (2 / ) ( , )

(2 / ) ( , ) ( ) .

i XW i i XW i ni i

i XW i n ni

A r Ag z n U f z W n U f z W r

n U f z W r z r

∈ ∈

∈

− = + Δ +

≡ + +

∑ ∑

∑

S S

S

Note that

2

2 2[0,1]

[ ( , )] ( )rXW nf x w dxdw O J −Δ =∫ .

Therefore,

{ }2

2 2 25 [0,1]

1 2

(4 / ) ( )[ ( , )] ( )

( ),

n XW W

rn

r n w f x w f w dxdw

o n J

σ

− −

= Δ

=

∫

and

(7.14) 2

*6

ˆˆ( )( ) (2 / ) ( , ) ,i XW i ni

A r Ag z n U f z W r∈

− = +∑S

where 1/ 26 ( )n pr o n−= . Combining (7.8), (7.13), and (7.14) yields the lemma. Q.E.D.

Proof of Theorem 4.2: Define *i iU U= if 1i∈S and *

i iU U= − if 2i∈S . Define

1/ 2 *

1( ) (2 / ) ( , )

n

n i XW ii

B z n U f z W=

= ∑ .

It follows from lemmas 3-5 that nτ is asymptotically distributed as 2nB , which is a degenerate

U-statistic of order 2. The theorem follows from the asymptotic distribution of such a statistic.

See, for example, Serfling (1980, pp. 193-194). Q.E.D.

Proof of Theorem 4.3: ( )ˆˆ| |j j Oω ω− = Ω −Ω by Theorem 5.1a of Bhatia, Davis, and

McIntosh (1983). Moreover, standard calculations for kernel density estimators show that 2 1/ 2ˆ [(log ) /( ) ]O n nhΩ−Ω = . Part (i) of the theorem follows by combining these two results.

Part (ii) is a consequence of Assumption 4(ii). Part (iii) follows by combining parts (i) and (ii).

Q.E.D.

7.2 Proofs of Theorems 4.4 and 4.5

Redefine

1

1/ 21( ) (2 / ) [ ( )] ( , )n i i XW i

iS z n Y g X f z W

∈

= −∑S

33

1

1/ 22 ˆ( ) (2 / ) [ ( ) ( )] ( , )n i i XW i

iS z n g X g X f x W

∈

= −∑S

,

1

( )1/ 23

ˆ( ) (2 / ) [ ( )][ ( , ) ( , )]in i i i XW iXW

iS z n Y g X f z W f z W−

∈

= − −∑S

,

and

1

( )1/ 24

ˆˆ( ) (2 / ) [ ( ) ( )][ ( , ) ( , )]in i i i XW iXW

iS z n g X g X f z W f z W−

∈

= − −∑S

Proof of Theorem 4.4: It suffices to show that under 1H ,

1plim 0nn

n τ−

→∞≥ .

Arguments like those leading to (7.8) show that 212 (1)n pn S o− = . It is clear that

213 (1)n pn S o− = and 21

4 (1)n pn S o− = . Therefore, 21 11 (1)n n pn n S oτ− −= + . But

1/ 2 *1( ) ( )( )nn S z A m Tg z− → − almost surely uniformly in [0,1]pz∈ by Jennrich’s (1969) uniform

strong law of large numbers. Therefore, 21 *p

nn A m Tgτ− → − . The theorem follows from the

fact that 2* 0A m Tg− > under 1H . Q.E.D.

Proof of Theorem 4.5: We prove (4.11). The proof of (4.12) is similar. Define

2 4 1 3( )n n n n nD S S E S S= + + + and n n nS S D= − . Then 2

n n nS Dτ = + . Use the inequality

(7.15) 2 2 20.5 ( )a b b a≥ − −

with na S= and nb D= to obtain

22( ) 0.5n n nP z P D S zα ατ ⎛ ⎞> ≥ − >⎜ ⎟⎝ ⎠

.

For any finite 0M > ,

( )

2 2 22 2

2 22

22

0.5 0.5 ,

0.5 ,

0.5 .

n n n n n

n n n

n n

P D S z P D z S S M

P D z S S M

P D z M P S M

α α

α

α

⎛ ⎞ ⎛ ⎞− ≤ = ≤ + ≤⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

⎛ ⎞+ ≤ + >⎜ ⎟⎝ ⎠

⎛ ⎞≤ ≤ + + >⎜ ⎟⎝ ⎠

nS is bounded in probability uniformly over ncF . Therefore, for each 0ε > there is Mε < ∞

such that for all M Mε>

34

( )22 20.5 .5n n nP D S z P D z Mα α ε⎛ ⎞− ≤ ≤ ≤ + +⎜ ⎟⎝ ⎠

.

Equivalently,

( )22 20.5 .5n n nP D S z P D z Mα α ε⎛ ⎞− > ≥ > + −⎜ ⎟⎝ ⎠

and

(7.16) ( )2( ) .5n nP z P D z Mα ατ ε> ≥ > + − .

Now a further application of (7.15) with na D= and 1 3( )n nb E S S= + gives

2 2 21 3 2 40.5 ( )n n n n nD E S S S S≥ + − + .

Some algebra shows that 22 4 (1)n n pS S O+ = as n →∞ , 1/ 2 *

1( ) ( / 2) ( )( )nES z n Tg A m z= − , and

( )1/ 23

rnES O n h Ag m= − . Therefore,

(7.17) 22 *.125 (1)n pD n Tg A m O≥ − +

uniformly over ncF as n →∞ . Inequality (4.11) follows by substituting (7.17) into (7.16) and

choosing C to be sufficiently large. Q.E.D.

35

REFERENCES

Bhatia, R., C. Davis, and A. McIntosh (1983). Perturbation of spectral subspaces and solution of linear operator equations, Linear Algebra and Its Applications, 52/53, 45-67.

Blundell, R., X. Chen, and D. Kristensen (2007). Semi-nonparametric IV estimation of shape-

invariant Engel curves, Econometrica, 75, 1613-1669. Chen, X. and D. Pouzo (2008). Estimation of nonparametric conditional moment models with

possibly nonsmooth moments, working paper, Department of Economics, Yale University, New Haven, CT.

Chernozhukov, V., P. Gagliardini, and O. Scaillet (2008). Nonparametric instrumental variable

estimation of quantile structural effects, working paper, Department of Economics, Massachusetts Institute of Technology, Cambridge, MA.

Darolles, S., J.-P. Florens, and E. Renault (2006): Nonparametric instrumental regression,

Working paper, GREMAQ, University of Social Science, Toulouse, France. Fan, Y. and Q. Li (1996). Consistent model specification tests: omitted variables and

semiparametric functional forms, Econometrica, 64, 865-890. Gasser, T. and H.G. Müller (1979). Kernel Estimation of Regression Functions, in Smoothing

Techniques for Curve Estimation. Lecture Notes in Mathematics, 757, 23-68. New York: Springer.

Gasser, T. and H.G. Müller, and V. Mammitzsch (1985). Kernels and Nonparametric Curve

Estimation, Journal of the Royal Statistical Society Series B, 47, 238-252. Härdle, W. and E. Mammen (1993). Comparing nonparametric versus parametric regression fits,

Annals of Statistics, 21, 1926-1947. Hall, P. and J.L. Horowitz (2005): Nonparametric methods for inference in the presence of

instrumental variables, Annals of Statistics, 33, 2904-2929. Horowitz, J.L. (2006). Testing a parametric model against a nonparametric alternative with

identification through instrumental variables, Econometrica, 521-538. Horowitz, J.L. and S. Lee (2007). Nonparametric instrumental variables estimation of a quantile

regression model, Econometrica,75, 1191-1208. Horowitz, J.L. and V.G. Spokoiny (2001). An adaptive rate-optimal test of a parametric mean-

regression model against a nonparametric alternative, Econometrica, 69, 599-631. Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares estimators, Annals of

Mathematical Statistics, 40, 633-643. Kress, R. (1999). Linear Integral Equations, 2nd edition, New York: Springer-Verlag.

36

Newey, W.K. and J.L. Powell (2003): Instrumental variable estimation of nonparametric models, Econometrica, 71, 1565-1578.

Newey, W.K., J.L. Powell, and F. Vella (1999): Nonparametric estimation of triangular

simultaneous equations models, Econometrica, 67, 565-603. Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics, New York: Wiley. van de Geer, S. (2000). Empirical Processes in M-Estimation, Cambridge, U.K.: Cambridge

University Press.

37

TABLE 1: RESULTS OF MONTE CARLO EXPERIMENTS

Empirical Rejection

g under 0H d Probability ___________________________________ 0H True 1g n.a 0.053

2g n.a 0.070

0H False 1g 0.02 0.895

1g 0.005 0.795

2g 0.02 0.896

2g 0.005 0.799

38

g(

X)

X0 .5 1

-.5

0

.5

1

1.5

Figure 1a: Graph of 1( )g x

g(X

)

X0 .5 1

0

.5

1

Figure 1b: Graph of 2 ( )g x

specification testing in nonparametric instrumental ... · specification testing in nonparametric...

Documents