SPECIFICATION TESTING IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION
by
Joel L. Horowitz
Department of Economics Northwestern University
Evanston, IL 60208 USA
August 2008
ABSTRACT
In nonparametric instrumental variables estimation, the function being estimated is the solution to an integral equation. A solution may not exist if, for example, the instrument is not valid. This paper discusses the problem of testing the null hypothesis that a solution exists against the alternative that there is no solution. We give necessary and sufficient conditions for existence of a solution and show that uniformly consistent testing of an unrestricted null hypothesis is not possible. Uniformly consistent testing is possible, however, if the null-hypothesis is restricted by assuming that any solution to the integral equation is smooth. Many functions of interest in applied econometrics, including demand functions and Engel curves, are expected to be smooth. The paper presents a statistic for testing the null hypothesis that a smooth solution exists. The test is consistent uniformly over a large class of probability distributions of the observable random variables for which the integral equation has no smooth solution. The finite-sample performance of the test is illustrated through Monte Carlo experiments. Key Words: Inverse problem, instrumental variable, series estimator, linear operator JEL Listing: C12, C14 I thank Whitney Newey for asking a question that led to this paper. Hidehiko Ichimura, Sokbae Lee, and Whitney Newey provided helpful comments. This research was supported in part by NSF grants SES-0352675 and DMS-0706348.
1
SPECIFICATION TESTING IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION
1. Introduction
Nonparametric instrumental variables (IV) estimation consists of estimating the unknown
function g that is identified by the relation
(1.1) [ ( ) | ] 0E Y g X W w− = =
for almost every w in the support of the random variable W . Equivalently, g satisfies
(1.2) ( ) ; ( | ) 0Y g X U E U W w= + = =
for almost every w . In this model, Y is a scalar dependent variable, X is a continuously
distributed explanatory variable that may be endogenous (that is, ( | )E U X x= may not be zero),
W is an instrument for X , and U is an unobserved random variable. The function g is
assumed to satisfy mild regularity conditions but does not belong to a known, finite-dimensional
parametric family. The data are an independent random sample from the distribution of
( , , )Y X W .
Methods for estimating g in (1.1) have been developed by Newey and Powell (2003);
Darolles, Florens, and Renault (2006); Hall and Horowitz (2005); and Blundell, Chen, and
Kristensen (2007). Newey, Powell, and Vella (1999) developed a nonparametric estimator for g
in a different setting in which a control function is available. Horowitz and Lee (2007); Chen and
Pouzo (2008); and Chernozhukov, Gagliardini, and Scaillet (2008) have developed methods for
estimating quantile-regression versions of model (1.1).
All methods for estimating g in model (1.1) assume the existence of a function that
satisfies (1.1). However, as is explained in Section 2 of this paper, a solution need not exist if, for
example, the instrument W is not valid, that is if ( | ) 0E U W w= ≠ on a set of w values that has
non-zero probability. This raises the question whether it is possible to test for existence of a
solution to (1.1). This paper provides an answer to the question. The null hypothesis is that a
solution to (1.1) exists. The set of alternative hypotheses consists of the distributions of
( , , )Y X W for which there is no solution to (1.1). We consider tests that are consistent uniformly
over this set. Uniform consistency is important because it ensures that there are not alternatives
against which a test has low power even with large samples. If a test is not uniformly consistent
over a specified set, then that set contains alternatives against which the test has low power.
Some such alternatives may depart from the null hypothesis in extreme ways, as is illustrated by
example in Section 3. We show that the null hypothesis cannot be tested consistently uniformly
2
over the set of alternative hypotheses without placing restrictions on g beyond those needed to
estimate it when (1.1) has a solution. Specifically, there is always a distribution of ( , , )Y X W
such that no solution to (1.1) exists but any α -level test accepts the null hypothesis with a
probability that is arbitrarily close to 1 α− .
We also show that it is possible to test the hypothesis that a “smooth” solution to (1.1)
exists. The test is consistent uniformly over a large class of non-smooth alternatives. The paper
presents such a test. Non-existence of a solution to (1.1) is an extreme form of non-smoothness,
so in sufficiently large samples, the test presented here rejects the null hypothesis that g is
smooth if no solution to (1.1) exists.
We define g to be smooth if it has sufficiently many square-integrable derivatives. With
a sufficiently large sample, the test presented here rejects the null hypothesis that (1.1) has a
smooth solution if no solution exists or if one exists but is not smooth. The possibility of
rejecting a non-smooth solution is desirable in many applications. For example, a demand
function or Engel curve is unlikely to be discontinuous or wiggly. Thus, rejection of the
hypothesis that a demand function or Engel curve is smooth implies misspecification of the model
that identifies the curve or function (e.g., that W is not a valid instrument for X ). Accordingly,
the test described here is likely to be useful in many applications.
The test presented here is related to Horowitz’s (2006) test of the hypothesis that g
belongs to a specified, finite-dimensional parametric family. A smooth function can be
approximated accurately by a finite-dimensional parametric model consisting of a truncated series
expansion with suitable basis functions. See Section 4 for details. The approximation to a non-
smooth function is less accurate. Therefore, one can test for existence of a smooth solution to
(1.1) by testing the adequacy of a truncated series approximation to g . The test statistic is
similar to Horowitz’s (2006) statistic for testing a finite-dimensional parametric model, but its
asymptotic behavior is different. In Horowitz (2006), the dimension of the parametric model is
fixed. In the present setting, the dimension (or length) of the series approximation increases as
the sample size increases. This is necessary to ensure that the truncation error remains smaller
than the smallest deviation from the null hypothesis that the test can detect. The increasing
dimension of the null hypothesis model changes the asymptotic behavior of the test statistic in
ways that are explained in Section 4.
Section 2 of this paper gives a necessary and sufficient condition for existence of a
function g that solves (1.1). It also explains why a solution may not exist. Section 3 presents an
example showing that it is not possible to construct a uniformly consistent test of the hypothesis
3
that (1.1) has a solution. No matter how large the sample is, there are always alternatives against
which any test has low power. Section 4 describes the statistic for testing the hypothesis that
(1.1) has a smooth solution. This section also explains the test’s asymptotic behavior under the
null and alternative hypotheses. Section 5 presents the results of a Monte Carlo investigation of
the test’s finite-sample behavior, and Section 6 presents conclusions. The proofs of theorems are
in the appendix, which is Section 7.
2. Necessary and Sufficient Conditions for a Solution to (1.1)
Necessary and sufficient conditions for existence of a solution to (1.1) are given by
Picard’s theorem (e.g., Kress 1999, Theorem 15.18). Before stating the theorem, we define
notation that will be used throughout the paper.
Assume that X and W are real-valued random variables and that the support of ( , )X W
is contained in 2[0,1] . This assumption entails no loss of generality as it can always be satisfied,
if necessary, by carrying out monotone transformations of the components of ( , )X W . Let XWf
and Wf , respectively, denote the probability density functions (with respect to Lebesgue
measure) of ( , )X W and W . For , [0,1]x z∈ , define
1
0( , ) ( , ) ( , )XW XWt x z f x w f z w dw= ∫ .
Define the operator 2 2: [0,1] [0,1]T L L→ by
1
0( )( ) ( , ) ( )T z t x z x dxν ν= ∫ ,
where ν is any function in 2[0,1]L . Assume that T is non-singular. Denote its eigenvalues and
eigenvectors by { , : 1,2,...}j j jλ φ = . Sort these so that 1 2 ... 0λ λ≥ ≥ > . Under the assumptions
stated in Section 4, T is a compact operator. Therefore, its eigenvectors form a complete,
orthonormal basis for 2[0,1]pL . Moreover, the eigenvalues are strictly positive and have 0 as
their only limit point.
Now, for [0,1]z∈ define
( ) [ ( , )]YW XWq z E Yf z W=
where YWE denotes the expectation with respect to ( , )Y W . Hall and Horowitz (2005) show that
(1.1) is equivalent to the operator equation
(2.1) q Tg= .
4
Therefore, the conditions for existence of a solution to (1.1) are the same as the conditions for
existence of a function g that satisfies (2.1).
Let ,⋅ ⋅ denote the inner product in 2[0,1]L . The following theorem gives necessary and
sufficient conditions for existence of a function g that solves (1.1) and (2.1).
Theorem 2.1 (Picard): Let T be a compact, non-singular operator, and assume that
0q ≠ . Then (2.1) has a solution if and only if
2
21
, j
jj
q φ
λ
∞
=
< ∞∑ .
If a solution exists, it is
1
( ) ( )j jj
g x b xφ∞
=
=∑ ,
where
, j
jj
qb
φ
λ= . ■
Testing the hypothesis that (1.1) has a solution is equivalent to testing the hypothesis that
21 jjb∞
=< ∞∑ against the alternative 2
1 jjb∞
== ∞∑ . The quantities , jq φ are the generalized
Fourier coefficients of q using the basis functions { }jφ . That is,
1
( ) , ( )j jj
q z q zφ φ∞
=
=∑ .
Therefore a solution to (1.1) exists if and only if the Fourier coefficients of q converge
sufficiently rapidly relative to the eigenvalues of T . It is easy to construct examples in which the
generalized Fourier coefficients converge more slowly than the eigenvalues so that 21 jjb∞
== ∞∑ .
In applied econometric research g may be an Engel curve, demand function, or some
other economically meaningful function whose existence is not in question. In this case, (1.1)
may not have a solution if W is not a valid instrument. Specifically, suppose that
( | ) 0E U W w= ≠ on a set of w values with positive probability. Then arguments like those used
to obtain (2.1) show that g solves not (1.1) or (2.1) but
(2.2) ( )( ) ( ) [ ( , )]UW XWTg z q z E Uf z W= − .
5
The misspecified models (1.1) and (2.1) need not have solutions when W is an invalid
instrument and (2.2) is the correct specification.
3 The Impossibility of Uniformly Consistent Testing with Unrestricted Null and Alternative
Hypotheses
This section presents an example in which uniformly consistent testing of the hypothesis
that (1.1) has a solution is not possible. The distributions used in the example are nested in any
reasonable class of probability distributions for ( , , )Y X W in (1.1), so the impossibility result
obtained with the example holds generally.
The example consists of a simple null-hypothesis and a simple alternative-hypothesis.
“Simple” in this context means that there are no unknown population parameters in either the null
or alternative hypotheses. The null hypothesis is that a specific function g solves (1.1). Under
the alternative hypothesis, (1.1) has no solution. It follows from the Neyman-Pearson lemma that
the likelihood ratio test is the most powerful test of the null hypothesis against the alternative.
We show that regardless of the sample size, it is always possible to choose an alternative
hypothesis against which the power of the likelihood ratio test is arbitrarily close to its size.
Therefore, the likelihood ratio test is not uniformly consistent. It follows that no other test is
uniformly consistent because no other test is more powerful than the likelihood ratio test.
To construct the example, write (1.1) in the equivalent form
(3.1) [ ( ) | ] ; ( | ) 0Y E g X W V E V W= + = ,
where ( | )V Y E Y W= − . Assume that XWf is known and is
1/ 2
1( , ) 1 2 cos( )cos( )XW j
jf x w j x j wλ π π
∞
=
= + ∑ ,
where the jλ ’s are constants satisfying 1 2 ... 0λ λ≥ ≥ > and 1/ 21 jjλ∞
=< ∞∑ . With this density
function, the eigenvalues of T are 1 2{1, , ,...}λ λ . The eigenvectors are 1( ) 1xφ = and
1/ 2( ) 2 cos[( 1) ]j x j xφ π= − for 2j ≥ .
Assume that V is known to be distributed as (0,1)N and is independent of X and W .
Let { : 0,1,2,...}jb j = denote the Fourier coefficients of g with the cosine basis. That is,
(3.2) 1/ 20
1( ) 2 cos( )j
jg x b b j xπ
∞
=
= + ∑ .
Then
6
1/ 2 1/ 20
1[ ( ) | ] 2 cos( )j j
jE g X W b b j Wλ π
∞
=
= + ∑ ,
and model (1.1) becomes
(3.3) 1/ 2 1/ 20
12 cos( ) ; ~ (0,1)j j
jY b b j W V V Nλ π
∞
=
= + +∑ .
Now let 0J > be an integer. Consider testing the simple null hypothesis
02
0
1
: if 1
0 if j
j
b
H b j j J
b j J
−
=⎧⎪⎪ = ≤ ≤⎨⎪ = >⎪⎩
against the simple alternative hypothesis
02
1
1
: if 1 .
1if j
j
b
H b j j J
b j J
−
=⎧⎪⎪ = ≤ ≤⎨⎪ = >⎪⎩
Under 0H , g in (3.2) is an ordinary function on [0,1] , and g solves (1.1). Under 1H , g is a
linear combination of an ordinary function and a delta function, so g is not a function on [0,1] in
the usual sense and (1.1) has no solution.
Let the data be the independent random sample { , , : 1,..., }i i iY X W i n= . We show that for
any fixed n , no matter how large, it is possible to choose J so that the power of the likelihood
ratio test of 0H against 1H is arbitrarily close to its size.
The likelihood ratio statistic for testing 0H against 1H is
21/ 2 2 1/ 2
1 1
21/ 2 2 1/ 2 1/ 2 1/ 2
1 1
(3.4) (1/ 2) 2 cos( )
2 cos( ) 2 cos( ) .
n J
i j ii j
J
i j i j ij j J
LR Y j j W
Y j j W j W
λ π
λ π λ π
−
= =
∞−
= = +
⎧⎡ ⎤⎪⎢ ⎥= −⎨⎢ ⎥⎪⎣ ⎦⎩
⎫⎡ ⎤ ⎪⎢ ⎥− − − ⎬⎢ ⎥ ⎪⎣ ⎦ ⎭
∑ ∑
∑ ∑
Substituting (3.3) into (3.4) shows that under 0H , the likelihood ratio statistic is
21/ 2 1/ 2 1/ 2
01 1 1 1
2 cos( ) cos( )j i i j ii j J i j J
LR j W V j Wλ π λ π∞ ∞ ∞ ∞
= = + = = +
⎡ ⎤⎢ ⎥= −⎢ ⎥⎣ ⎦
∑ ∑ ∑ ∑ .
Under 1H , the statistic likelihood ratio statistic is
7
21/ 2 1/ 2 1/ 2
11 1 1 1
2 cos( ) cos( )j i i j ii j J i j J
LR j W V j Wλ π λ π∞ ∞ ∞ ∞
= = + = = +
⎡ ⎤⎢ ⎥= +⎢ ⎥⎣ ⎦
∑ ∑ ∑ ∑ .
Therefore,
21/ 2
1 01 1
21/ 2
1
2 cos( )
2 .
n
j ii j J
jj J
LR LR j W
n
λ π
λ
∞
= = +
∞
= +
⎡ ⎤⎢ ⎥− =⎢ ⎥⎣ ⎦
⎛ ⎞⎜ ⎟≤⎜ ⎟⎝ ⎠
∑ ∑
∑
Because 1/ 21 jjλ∞
=< ∞∑ , 1 0LR LR− can be made arbitrarily small by making J sufficiently
large. Therefore, we obtain the following result, which is proved in the Appendix.
Proposition 3.1: Let nc α denote the α -level critical value of LR when the sample size
is n . That is, 0( | )nP LR c Hα α> = . Let n be fixed. For each 0ε > there is a 0J such that the
power of the α -level likelihood ratio test of 0H against 1H is less than or equal to α ε+
whenever 0J J≥ . That is, 1( | )nP LR c Hα α ε> ≤ + whenever 0J J≥ . ■
Now consider the class of alternative hypotheses consisting of distributions of ( , , )Y X W
for which 1H is true for some J < ∞ . Because no test is more powerful than the likelihood ratio
test, it follows from Proposition 3.1 that no test of 0H is consistent uniformly over this class.
Regardless of the sample size n , there are always distributions in 1H for some finite J against
which the power of any test is arbitrarily close to the test’s level. The intuitive reason is that
Fourier components of g corresponding to eigenvectors of T with small eigenvalues have little
effect on Y and, therefore, are hard to detect empirically. This is illustrated by (3.3), where Y is
insensitive to changes in Fourier coefficients jb that are associated with very small eigenvalues
jλ . This problem can be overcome by restricting the null and alternative hypotheses so as to
avoid the need for estimating or testing Fourier coefficients associated with very small
eigenvalues of T . Section 4 presents a way of doing this.
4 A Uniformly Consistent Test of the Hypothesis That (1.1) Has a Smooth Solution
In this section, we restrict the null hypothesis by requiring g to be smooth in the sense
that it has s square-integrable derivatives, where s is a sufficiently large integer. Under the
alternative hypothesis, (1.1) has no smooth solution. In sufficiently large samples, the resulting
8
test rejects the null hypothesis if (1.1) has no solution or if (1.1) has a non-smooth solution. As
was explained in the introduction, a non-smooth solution to (1.1) is an indicator of
misspecification (possibly due to an invalid instrument) in many applications. Therefore, the
ability to reject non-smooth solutions to (1.1) can be a desirable property of a test.
4.1 Motivation
We begin with an informal discussion that provides intuition for the test that is developed
here. Let { : 1,2,...}j jψ = be a complete, orthonormal basis for 2[0,1]L . Suppose for the
moment that under 0H , the solution to (1.1) has the finite-dimensional representation
(4.1) 1
( ) ( )J
j jj
g x b xψ=
=∑ ,
for some fixed J < ∞ and (generalized) Fourier coefficients { : 1,..., }jb j J= . Equation (4.1)
restricts g to a finite-dimensional parametric family. The null hypothesis that (4.1) is a solution
to (1.1) for some set of jb ’s can be tested against the alternative that it is not by using the test of
Horowitz (2006). The test statistic is
(4.2) 1 20
( )Pn nS z dzτ = ∫ ,
where
( )1/ 2
1 1
ˆ ˆ( ) ( ) ( , ),n J
in i j j i iXW
i jS z n Y b X f z Wψ −−
= =
⎡ ⎤⎢ ⎥= −⎢ ⎥⎣ ⎦
∑ ∑
ˆjb is an estimator of jb that is 1/ 2n− -consistent under the null hypothesis, and ( )ˆ i
XWf − is a leave-
observation- i -out nonparametric kernel estimator of XWf . Specifically,
(4.3) ( )2
1
1ˆ ( , ) ,( 1)
nj ji
XWjj i
x X w Wf x w K
h hn h−
=≠
− −⎛ ⎞= ⎜ ⎟
− ⎝ ⎠∑ ,
where h is a bandwidth and K is a kernel function of a 2-dimensional argument. Horowitz
shows that if (4.1) is a solution to (1.1), then Pnτ is asymptotically distributed as a weighted sum
of independent chi-square random variables with one degree of freedom. Horowitz (2006) also
shows that Pnτ is consistent uniformly over a class of nonparametric alternative hypotheses
whose distance from (4.1) is proportional to 1/ 2n− .
9
Now let g be nonparametric, but suppose that its derivatives through order s are square
integrable on [0,1] . Then g has the infinite-dimensional Fourier representation
1
( ) ( )j jj
g x b xψ∞
=
=∑
but can be approximated accurately by the finite-dimensional model that is obtained by truncating
this series. Specifically, let { : 1,2,...}nJ n = be a sequence of positive integers with nJ →∞ as
n →∞ . Define
(4.4) 1
( ) ( )nJ
n j jj
g x b xψ=
=∑ .
Then for a wide variety of basis functions { }jψ that includes trigonometric functions and
orthogonal polynomials, the error of ng as an approximation to g satisfies
( )sn ng g O J −− = ,
where ⋅ denotes the norm in 2[0,1]L . Thus, a smooth function g can be approximated
accurately by a finite-dimensional parametric function. This suggests testing the null hypothesis
that (1.1) has a smooth solution by using Horowitz’s (2006) procedure to test the hypothesis that
(4.4) is the solution. If nJ is sufficiently large, the approximation error will be small compared
to the minimum deviation from (4.4) that the test can detect. On the other hand, if the solution to
(1.1) is non-smooth or does not exist, then (4.4) will be a poor approximation to the solution to
(1.1), and the test will reject the null hypothesis if the sample is sufficiently large.
The main difference between this version of Horowitz’s (2006) test and the test based on
Pnτ in (4.2) is that when g is nonparametric, nJ must increase as n increases to ensure that the
approximation error remains too small to be detected by the test. This changes the asymptotic
distributional properties of the test statistic. Among other things, the test statistic is
asymptotically degenerate (its asymptotic distribution is concentrated at a single point) under the
null hypothesis. We deal with this problem here by splitting the sample into halves. One half is
used to estimate the jb ’s and the other half is used to construct the test statistic. The sample
splitting procedure is explained in more detail in Section 4.3. The degeneracy problem is well-
known in nonparametric testing, and a variety of solutions are possible. Sample-splitting leads to
a relatively simple test. Other potential solutions that may have some advantages but are much
more complicated analytically are discussed in Section 4.8.
10
4.2 The Null and Alternative Hypotheses
This section provides formal statements of the null and alternative hypotheses of the test
that is developed in this paper. The test statistic is presented in Section 4.3.
We use the following notation. For a function :[0,1]ν → and integer 0≥ , define
( )( ) xD xxνν ∂
=∂
whenever the derivative exists. Define 0 ( ) ( )D x xν ν= . Given an integer 0s > , define the
Sobolev norm
(4.5) 1/ 2
1 20
0[ ( )]
s
s D x dxν ν=
⎧ ⎫⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭∑∫
and the function space
{ :[0,1] : }s gs Cν ν= → ≤H ,
where gC < ∞ is a constant.
The null hypothesis in the remainder of this paper is
0 :H Equation (1.1) has a solution sg∈H for some integer 0s > .
The alternative hypothesis is
1 :H Equation (1.1) does not have a solution in sH .
4.3 The Test Statistic
This section presents the statistic for testing 0H . The data are the independent random
sample { , , : 1,..., }i i iY X W i n= . To avoid unimportant notational complexities, we assume that n
is even. If n is odd, drop one randomly selected observation. This has a negligible effect on the
power of the test. Define ( )ˆ iXWf − as in (4.3). Define 1 { , , : 1,..., / 2}i i iY X W i n= =S and
2 { , , : / 2 1,..., }i i iY X W i n n= = +S . Let ˆjb ( 1,..., nj J= ) be consistent estimators of the Fourier
coefficients jb in (4.4) that are obtained from the data in 2S . The ˆjb ’s can be obtained by using
the method of Blundell, Chen, and Kristensen (2007), but the derivation of the asymptotic
distribution of the test statistic is simpler if the method explained in Section 4.4 is used. Define
1
ˆˆ ( ) ( )nJ
n j jj
g x b xψ=
=∑ .
Now define
11
1
( )1/ 2 ˆ( ) (2 / ) [ ( )] ( , )in i n i iXW
iS z n Y g X f z W−
∈
= −∑S
.
The test statistic is
1 20
( )n nS z dzτ = ∫ .
0H is rejected if nτ is large.
The test works because nJ can be chosen so that truncation error in the finite-series
approximation to g is negligibly small. Therefore, under 0H , ( )nS z estimates
(4.6) 1
1/ 2(2 / ) ( , )i XW ii
n U f z W∈∑S
,
where ( )i i iU Y g X= − . Under 0H , the quantity in (4.6) a random variable with mean 0 and
finite variance, so nτ is bounded in probability. Under 1H , the truncation error is non-negligible,
and nτ diverges as n →∞ .
4.4 A Sieve Estimator of g
This section describes a modified version of the estimator of Blundell, Chen, and
Kristensen (2007). The derivation of the asymptotic distribution of nτ under 0H is simpler with
the modified estimator than with the original one.
For [0,1]w∈ , define
( ) ( | ) ( )Wm w E Y W w f w= = .
Define the operator 2 2: [0,1] [0,1]A L L→ by
1
0( )( ) ( ) ( , )XWA w x f x w dxν ν= ∫ .
Then (1.1) is equivalent to the operator equation
(4.7) Ag m= .
The estimator of g is defined in terms of series expansions of g , m , and A . As before,
let { }jψ denote a complete, orthonormal basis for 2[0,1]L . The expansions are
1
( ) ( )j jj
g x b xψ∞
=
=∑ ,
1
( ) ( )k kk
m w a wψ∞
=
=∑ ,
and
12
1 1
( , ) ( ) ( )XW jk j kj k
f x w c x wψ ψ∞ ∞
= =
=∑∑ ,
where
1
0( ) ( )j jb g x x dxψ= ∫ ,
1
0( ) ( )k ka m w w dxϕ= ∫ , Should be kψ
and
2[0,1]
( , ) ( ) ( )jk XW j kc f x w x w dwdxψ ϕ= ∫ . Should be kψ
We need estimators of ka and jkc ( , 1,..., nj k J= ). These are
2
ˆ (2 / ) ( )k i k ii
a n Y Wψ∈
= ∑S
and
2
ˆ (2 / ) ( ) ( )jk j i k ii
c n X Wψ ψ∈
= ∑S
.
In addition, define the operator A that estimates A by
1
0ˆˆ( )( ) ( ) ( , )XWA w x f x w dxν ν= ∫ ,
where
(4.8) 1 1
ˆ ˆ( , ) ( ) ( )n nJ J
XW jk j kj k
f x w c x wψ ψ= =
=∑∑ .
Also define the estimator
1
ˆ ˆ( ) ( )nJ
k kk
m w a wψ=
=∑ .
Finally, define the set of functions on 2[0,1]L
1
:nJ
ns j j gsj
Cν ν ψ ν=
⎧ ⎫⎪ ⎪= = ≤⎨ ⎬⎪ ⎪⎩ ⎭
∑H .
The estimator of g is
(4.9) ˆˆ ˆarg minns
g A mν
ν∈
= −H
.
The following result is used to obtain the asymptotic distribution of nτ under 0H and establish
uniform consistency of nτ under the alternative hypothesis defined in Section 4.5.
13
Theorem 4.1: Let XWf have r continuous derivatives with respect to any combination
of its arguments. Let the regularity conditions of Section 4.4 hold. Then as n →∞ ,
1/ 2ˆ ( / )s rp n n ng g O J J J n−⎡ ⎤− = +⎣ ⎦ . ■
4.5 The Asymptotic Distribution of nτ under 0H
This section gives the asymptotic distribution of nτ under 0H . We begin by defining
additional notation and stating the assumptions that are used. Let 1 1 2 2( , ) ( , ) Ex w x w− denote the
Euclidean distance between 1 1( , )x w and 2 2( , )x w . Let j XWD f denote any j ’th partial or mixed
partial derivative of XWf . Let 0 ( , ) ( , )XW XWD f x w f x w= .
Assumption 1: (i) The support of ( , )X W is contained in 2[0,1] . (ii) ( , )X W has a
probability density function XWf with respect to Lebesgue measure. (iii) There are an integer
2r ≥ and a constant fC < ∞ such that | ( , ) |j XW fD f x w C≤ for all 2( , ) [0,1]x w ∈ and
0,1,...,j r= . (iv) 1 1 2 2 1 1 2 2| ( , ) ( , ) | ( , ) ( , )r XW r XW f ED f x w D f x w C x w x w− ≤ − for any order r
derivative and any 1 1( , )x w and 2 2( , )x w in 2[0,1] . (v) The operator T is nonsingular.
Assumption 2: 2( | ) YE Y W w C= ≤ for each [0,1]w∈ and some constant YC < ∞ .
Assumption 3: (1.1) has a solution sg∈H with gsg C< and 2 1/ 2s r≥ + .
Assumption 4: (i) The estimator g is as defined in (4.9). (ii) The basis functions { }jψ
are orthonormal and complete on 2[0,1]L . Moreover, there are coefficients { : 1,2,...}jb j = and
{ : , 1,2,...}jkc j k = such that
1( )
Js
j jj
g b O Jψ −
=
− =∑
and
1 1
( , ) ( ) ( ) ( )J J
rXW jk j k
j kf x w c x w O Jψ ψ −
= =
− =∑∑
as J →∞ for any g and s satisfying Assumption 3.
14
Assumption 5: The kernel function K used to estimate XWf has the form
(1) (2)( ) ( ) ( )K v κ ξ κ ξ= , where ( )jξ is the j ’th component of the vector ξ , κ is a symmetrical,
twice continuously differentiable function on [ 1,1]− , and
1
1
1if 0( )
0 if 1.j j
v v dvj r
κ−
=⎧= ⎨ ≤ −⎩∫
Assumption 6: (i) The bandwidth, h , satisfies 1/(2 2)rhh c n− += , where hc is a constant,
0 hc< < ∞ . (ii) n JJ C nγ= for constants JC < ∞ and 1/(2 ) 1/(4 1)s rγ< < + .
Assumptions 1 and 2 are smoothness and boundedness conditions. Assumption 3 defines
the null hypothesis. It requires gsg C< (strict inequality) to avoid complications that arise
when g is on the boundary of sH . Deriving the asymptotic distribution of nτ when gsg C=
is a difficult task that is beyond the scope of this paper. The requirement that 2 1/ 2s r≥ + is
discussed below. Assumption 4(ii) is satisfied by orthogonal polynomials and trigonometric
functions. It is also satisfied by B-splines that have been orthogonalized by, say, the Gram-
Schmidt procedure. Assumption 5 requires K to be a higher-order kernel if XWf is sufficiently
smooth. K can be replaced by a boundary kernel (Gasser and Müller 1979; Gasser, Müller, and
Mammitzsch 1985) if XWf does not approach 0 smoothly on the boundary of its support. The
rate of convergence of h in Assumption 6(i) is asymptotically optimal for estimating XWf . In
applications, h can be chosen by cross-validation or any of a variety of other bandwidth selection
methods. Assumption 6(ii) requires nJ to increase more rapidly than the asymptotically optimal
rate for estimating g . This undersmoothing ensures that the truncation bias in the series
approximation to g is 1/ 2( )o n− as n →∞ . Rapid convergence of the truncation bias is needed
because 1/ 2( ) ( )n pS z O n−= under 0H for each [0,1]z∈ , so the nτ test will reject 0H due to
truncation bias unless this bias converges more rapidly than 1/ 2n− . However, undersmoothing of
g increases the severity of the ill-posed inverse problem in estimating g , thereby decreasing the
rate of convergence of g and increasing the probability that ˆ Gsg C= . This complicates the
asymptotic distribution of nτ in ways similar to those that occur when gsg C= . The conditions
2 1/ 2s r≥ + in Assumption 3 and 1/(4 1)rγ < + in Assumption 6 ensure that g converges
sufficiently rapidly to avoid this problem. The question whether a useful test can be constructed
under weaker smoothness assumptions is left to future research.
15
Now define 2 2( ) ( | )U w E U W wσ = = . For 1 2, [0,1]z z ∈ define
21 2 1 2( , ) [ ( ) ( , ) ( , )]U XW XWV z z E W f z W f z Wσ=
Define the operator Ω on 2[0,1]L by
(4.10) 1
2 1 2 1 10( )( ) ( , ) ( )z V z z z dzν νΩ = ∫ .
Let { : 1,2,...}j jω = denote the eigenvalues of Ω sorted so that 1 2 ... 0ω ω≥ ≥ ≥ . Let
21{ : 1,2,...}j jχ = denote independent random variables that are distributed as chi-square with one
degree of freedom. The following theorem gives the asymptotic distribution of nτ under H0.
Theorem 4.2: If 0H is true and assumptions 1-5 hold, then
21
1
dn j j
jτ ω χ
∞
=
→ ∑ . ■
4.6 Obtaining the Critical Value
The statistic nτ is not asymptotically pivotal, so its asymptotic distribution cannot be
tabulated. This section presents a method, similar to that of Horowitz (2006), for obtaining an
approximate asymptotic critical value. The method is based on replacing the asymptotic
distribution of nτ with an approximate distribution. The difference between the true and
approximate distributions can be made arbitrarily small, and the quantiles of the approximate
distribution can be estimated consistently. The approximate 1 α− critical value of the nτ test is
a consistent estimator of the 1 α− quantile of the approximate distribution.
We now describe the approximate asymptotic distribution of nτ . Under 0H , nτ is
asymptotically distributed as
21
1j j
jτ ω χ
∞
=
≡∑ .
Given any 0ε > , there is an integer Kε < ∞ such that
21
10 ( )
K
j jj
P t P tε
ω χ τ ε=
⎛ ⎞⎜ ⎟< ≤ − ≤ <⎜ ⎟⎝ ⎠∑ .
uniformly over t . Define
21
1
K
j jj
ε
ετ ω χ=
=∑ .
16
Let zεα denote the 1 α− quantile of the distribution of ετ . Then 0 ( )P zεατ α ε< > − < . Thus,
using zεα to approximate the asymptotic 1 α− critical value of nτ creates an arbitrarily small
error in the probability that a correct null hypothesis is rejected. Similarly, use of the
approximation creates an arbitrarily small change in the power of the nτ test when the null
hypothesis is false. The approximate 1 α− critical value for the nτ test is a consistent estimator
of the 1 α− quantile of the distribution of ετ . Specifically, let ˆ jω ( 1,2,..., )j Kε= be a
consistent estimator of jω under 0H . Then the estimator of the approximate critical value of nτ
is the 1 α− quantile of the distribution of
21
1
ˆ ˆK
n j jj
ε
τ ω χ=
=∑ .
This quantile can be estimated with arbitrary accuracy by simulation.
In applications, Kε can be chosen informally by sorting the ˆ jω ’s in decreasing order and
plotting them as a function of j . They typically plot as random noise near ˆ 0jω = when j is
sufficiently large. One can choose Kε to be a value of j that is near the lower end of the
“random noise” range. The rejection probability of the nτ test is not highly sensitive to Kε , so it
is not necessary to attempt precision in making the choice.
The estimated eigenvalues ˆ jω are those of the estimate of Ω that is defined by
1
2 1 2 1 10ˆ ˆ( )( ) ( , ) ( )z V z z z dzν νΩ = ∫ ,
where
(4.11) ( ) ( )1 21 2 1 2
1
ˆ ˆˆ ˆ( , ) ( , ) ( , )n
i ii i iXW XW
iV z z n U f z W f z W− −−
=
= ∑
and ˆ ˆ( )i i iU Y g X= − . The ˆ jω ’s can be computed easily by using a finite-dimensional series
estimator, like (4.8), for ˆXWf . The ˆ jω ’s are then the eigenvalues of the finite-dimensional
matrix whose ( , )j k element is
1 2
1 1 1
ˆ ˆ ˆ ( ) ( )n nL Ln
i j km i m ii m
n U c c W Wψ ψ−
= = =∑∑∑ ,
where nL is the length of the series used in (4.11) to estimate XWf .
To state the properties of the estimated eigenvalues, define
17
arg minns
g A mν
ν∈
= −H
and ( )U Y g X= − . Let { }jω be the eigenvalues of the operator that is obtained by replacing
1 2( , )V z z in (4.10) with 21 2[ ( , ) ( , )]XW XWE U f z W f z W . Then j jω ω→ as n →∞ if 0H is true.
Let zεα denote the 1 α− quantile of the distribution of the random variable
21
1
K
j jj
ε
ετ ω χ=
≡∑ ,
and let zεα denote the 1 α− quantile of the distribution of ˆnτ . Note that ε ετ τ= if 0H is true.
The following theorem shows that zεα estimates zεα consistently if nL →∞ as n →∞ .
Theorem 4.3: Let assumptions 1-5 hold. As n →∞ , (i) 1 ˆsup | |j K j jεω ω≤ ≤ − =
2 1/ 2[(log ) /( ) ]O n nh almost surely, (ii) 1sup | | ( )s rj K j j n nO J L
εω ω − −
≤ ≤ − = + , and (ii) ˆ pz zεα εα→ .
■
4.7 Consistency of the nτ Test
This section presents a theorem establishing the consistency of the nτ test against a fixed
alternative hypothesis. The section also shows that for any 0ε > , the nτ test rejects 0H with
probability exceeding 1 ε− uniformly over a large class of alternative hypotheses.
Consistency against a fixed alternative is given by the following theorem. Let zα denote
the 1 α− quantile of the asymptotic distribution of nτ under sampling from the model
( )Y g X U= + .
Theorem 4.4: Let Assumptions 1, 2, 4, and 5 hold. Then under 1H ,
lim ( ) 1nnP zατ
→∞> =
for any α such that 0 1α< < . ■
The conclusion of the theorem also holds if zα is replaced by the estimated approximate
critical value zεα .
We now consider uniform consistency. Let *A denote the adjoint of A . Define
arg mins
g A mν
ν∈
= −H
.
18
For each 1,2,...n = and 0C > , define ncF as the set of distributions of Y conditional on ( , )X W
satisfying (i) 2( | ) YE Y W w C= ≤ for some constant YC < ∞ and all [0,1]w∈ , (ii)
* 1/ 2T g A m Cn−− ≥ , and (iii) */ (1)rh Ag m T g A m o− − = as n →∞ . Condition (ii) rules
out alternatives that depend on x only through sequences of eigenvectors of T whose
eigenvalues converge to 0 too rapidly. As the example of Section 3 shows, it is not possible to
achieve consistency uniformly over these alternatives. Condition (iii) ensures that random
sampling errors in ( )ˆ iXWf − are asymptotically negligible relative to the effects of misspecification.
The following theorem states the uniform consistency result.
Theorem 4.5: Let assumptions 1, 2, 4, 5, and 6 hold. Then given any 0δ > , α such
that 0 1α< < , and any sufficiently large but finite constant C ,
(4.11) lim inf ( ) 1nc
nnzατ δ
→∞> ≥ −P
F.
and
(4.12) ˆlim inf ( ) 1 2nc
nn
zεατ δ→∞
> ≥ −PF
. ■
4.8 Alternative Approaches to Testing 0H
This section describes some alternative approaches to testing 0H that do not require
sample splitting. These approaches may have certain advantages over nτ (possibly in terms of
power or weaker smoothness assumptions) but are more complicated analytically. Their
investigation is left to future research.
The degeneracy problem that is solved by sample-splitting in Section 4.2 is also present
in the econometrics literature of the 1990s on testing a parametric or semiparametric model of a
conditional mean function against a nonparametric alternative. In its simplest form, the
hypothesis tested in that literature is that
(4.13) [ ( , ) | ] 0E Y G X X xθ− = =
for almost every x , some known function G , and an unknown finite-dimensional parameter θ
that is estimated from the data. The alternative hypothesis is that there is no θ satisfying (4.13).
Fan and Li (1996) review tests that encounter the degeneracy problem and use sample-splitting to
overcome it. Degeneracy and the need for sample-splitting can be avoided by using a test statistic
that measures the distance from 0 of an empirical analog of [ ( , ) | ]E Y G X X xθ− = . Tests that
avoid degeneracy this way include, among others, Härdle and Mammen (1993) and Horowitz and
19
Spokoiny (2001) for testing (4.13) and Fan and Li (1996) for testing a semiparametric model of a
conditional mean function.
A test of the null hypothesis of this paper that is analogous to the Härdle-Mammen and
Horowitz-Spokoiny tests of (4.13) can be based on an empirical analog of the conditional
moment [ ( ) | ] ( )WE Y g X W w f w− = . The analog is
* 1/ 2
1
ˆ( ) ( ) [ ( )]n
in i i
i
w WS w nh Y g X Kh
−
=
−⎛ ⎞= − ⎜ ⎟⎝ ⎠
∑ ,
where K is a kernel function of a scalar argument. Define 1* * 20
( )n nS w dwτ = ∫ .
Under 0H , one can expect that *nτ differs from 0 only by random sampling error, whereas *
nτ is
large if 0H is false. Accordingly, one might use *nτ to test 0H . However, deriving the
asymptotic distribution of *nτ under 0H requires solving a difficult problem in the theory of
empirical U processes and, consequently, is beyond the scope of this paper.
Another possibility is to base a test on the optimized objective function in (4.9),
ˆ ˆ ˆAg m− . One can also let the series approximation for m be longer than that for g , thereby
achieving a form of overidentification. However, results obtained with tests based on ˆ ˆ ˆAg m−
are highly sensitive to the value of GC and the lengths of the series used to estimate g and m .
One can obtain virtually any result one wants by choosing these regularization parameters
appropriately. Thus, a method for choosing regularization parameters is crucial to the
development of any test based on ˆ ˆ ˆAg m− . The choice of regularization parameters is a major
unsolved problem in nonparametric IV estimation. It, too, is beyond the scope of this paper.
The nτ test described in Section 4.2 requires choosing the regularization parameter nJ ,
but the results of the Monte Carlo experiments discussed in Section 5 suggest that this can be
done satisfactorily by using a simple heuristic procedure that is described in that section.
Therefore, the need to choose nJ does not present an obstacle to implementation of the nτ test.
Finally, suppose there are several instruments, say (1) ( ),..., LW W for some 2L ≥ , and
these are believed to satisfy the moment conditions ( ) ( )[ ( ) | ] 0E Y g X W w− = = ( 1,..., L= ).
Suppose, further, that each moment condition (or, possibly, a subset of more than one but fewer
than L conditions) identifies g when such a function exists. Then one can consider testing the
20
hypothesis all of the moment conditions hold, thereby obtaining a version of the GMM test of
overidentifying restrictions. However, such a test asks whether the same g satisfies all the
moment conditions, whereas the question being addressed in this paper is whether any g satisfies
at least one of the moment conditions. The availability of multiple instruments does not alter the
issues that have been discussed concerning tests of the hypothesis that there is a g satisfying at
least one of the moment conditions.
5. Monte Carlo Experiments
This section reports the results of a Monte Carlo investigation of the finite-sample
performance of the nτ test. The experiments use a sample size of 1000. The nominal level of the
test is 0.05, and there are 1000 Monte Carlo replications in each experiment.
Realizations of ( , )X W were generated by ( )X ξ= Φ and ( )W ζ= Φ , where Φ is the
cumulative normal distribution function, ~ (0,1)Nζ , 2 1/ 2(1 )ξ ρζ ρ ε= + − , (0,1)Nε ∼ , and
0.7ρ = . Realizations of Y were generated from
(5.1) ( ) UY g X Uσ= + ,
where 2 1/ 2(1 )U ηε η ν= + − , (0,1)Nν ∼ , 0.1Uσ = , and 0.4η = . In experiments where 0H is
true, the function g is either
41
1( ) 0.5 cos( )
jg x j j xπ
∞−
=
= +∑
or
1 22
1( ) ( 1) sin( )j
jg x j j xπ
∞+ −
=
= −∑ .
The series were truncated at 100j = for computational purposes. The resulting functions are
displayed in Figure 1.
In experiments where 0H is false,
( ) ( ) ( ); 1,2kg x g x x k= + Δ = ,
where
1( ) (0.5 0.5 )2
x I d x dd
Δ = − < ≤ +
and 0.02d = or 0.005d = , depending on the experiment. The function Δ places a rectangular
spike in g . The spike is centered at 0.5x = . Its width is 2d and its height is 1/(2 )d . This
21
mimics what happens if g is highly non-smooth or (1.1) has no solution. In particular, if (1.1)
has no solution, then as J →∞ the quantity
1
,( )
Jj
jjj
qx
φφ
λ=∑
becomes unbounded at one or more points [0,1]x∈ . This behavior is mimicked by kg + Δ as d
decreases.
The computation of nτ is easiest if the kernel estimator of XWf is replaced by a series
expansion that is equivalent up to truncation error. With basis functions { }jν , this gives
(5.1) ( )
1 1
ˆ ( , ) ( ) ( )J J
ijk j kXW
j kf x w c x wν ν−
= =
=∑∑ ,
where J is the point at which the series is truncated for computational purposes and
22 [0,1]
1
1 , ( ) ( )( 1)
n
jk j k
i
x X w Wc K x w dxdwh hn h
ν ν=≠
− −⎛ ⎞= ⎜ ⎟− ⎝ ⎠∑∫ .
However, in preliminary Monte Carlo experiments the numerical performance of nτ was
unaffected by replacing jkc with its limit as 0h → . This gives coefficients
(5.2) 1
1 ( ) ( )1
n
jk j k
i
c X Wn
ν ν=≠
=− ∑ .
Accordingly, the Monte Carlo experiments reported here use (5.1) and (5.2) to estimate XWf .
The basis functions are 1 1ν = and ( ) 2 cos[( 1) ]j x j xν π= − for 2j ≥ . The series was truncated
at 25J = . Similarly, the asymptotic critical value of nτ was estimated by setting 25Kε = . The
results of the experiments are not sensitive to the choice of J and Kε . The estimated
eigenvalues ˆ jω are very close to 0 when 25j > .
We now discuss the choice of the basis functions, { }jψ , and truncation parameter, nJ ,
for the series approximation to g . Consider, first, the choice of basis functions. Estimation of
g presents an ill-posed inverse problem because 1T − is a discontinuous operator. One
consequence of this is that with samples of moderate size, it is usually possible to estimate only
low-order Fourier coefficients jb with reasonable precision. Therefore, it is desirable to choose
basis functions that provide a good low-order approximation to g . Demand functions, Engel
22
curves, and earnings functions, among other functions of interest in applied econometrics, are
likely to have few inflection points. Functions with few inflection points are often well-
approximated by low-degree polynomials. In preliminary Monte Carlo experiments, we found
that approximating these functions accurately with trignonometric or spline bases requires longer
series and leads to much noisier estimates. Accordingly, the experiments reported here use
Legendre polynomials (centered and scaled to be orthonormal on [0,1] ) for the basis functions.
Now consider the choice of nJ . If nJ is too small, the nτ test will tend to reject a true
0H because the truncated series, 1
nJj jj
b ψ=∑ , is a poor approximation to g . If nJ is too large,
g will be a very noisy estimate of g and this, too, will tend to cause rejection of a correct 0H .
The integrated variance of g is 2 21
ˆ ˆ nJjj
E g Eg σ=
− =∑ , where 2 ˆ( )j jVar bσ = and ˆjb is the
estimator of jb . The variance components 2jσ can be estimated by using the standard formulae
of GMM estimation. We have found through Monte Carlo experiments that as nJ increases from
1, 2ˆ ˆE g Eg− changes little at first but increases by a factor of 10 or more when nJ crosses a
“critical value.” This suggests the following heuristic procedure for choosing nJ in applications:
choose the largest nJ that does not produce a very large increase in the estimated value of
2ˆ ˆE g Eg− . In the experiments reported here, the large increase in 2ˆ ˆE g Eg− occurs when the
degree of the approximating polynomial increases from 3 to 4. Accordingly, the experiments
reported here approximate g with a cubic polynomial.
As in Blundell, Chen, and Kristensen (2007), it is convenient for computational purposes
to replace the constrained estimation problem (4.9) with a penalized estimator. However,
penalization has little effect on the results of the experiments. Accordingly, the results reported
here are based on solving (4.9) without imposing the constraint nsν ∈H .
The results of the experiments are shown in Table 1. When 0H is true and 1g g= , the
difference between the empirical and nominal levels of the nτ test is very small. The difference
is somewhat larger when 2g g= because the error in the cubic polynomial approximation to 2g
is larger than the error in the cubic approximation to 1g . The use of a quartic or higher-degree
polynomial reduces the approximation error when 2g g= but increases 2ˆ ˆE g Eg− by a factor
of over 100. This illustrates the importance of choosing a basis that provides a good low-order
approximation to g . The probability of rejecting 0H when it is false is high in all cases. It is
23
higher when 0.02d = than when 0.005d = because there are more data points in the interval
containing the spike when 0.02d = .
6. Conclusions
This paper has been concerned with uniformly consistent testing of the null hypothesis
that the identifying equation of nonparametric IV estimation has a solution. The paper has shown
that no test can be uniformly consistent over all probability distributions of ( , , )Y X W for which
the identifying equation has no solution. No matter how large the sample is, there are alternative
distributions that depart from the null hypothesis in extreme ways but against which any test has
low power. Uniformly consistent testing is possible if the null and alternative hypotheses are
restricted in appropriate ways. In this paper, the null hypothesis is restricted by assuming that any
solution to the identifying equation is smooth. The paper has presented a test of the hypothesis
that a smooth solution exists. The test is uniformly consistent against a large class of
distributions of ( , , )Y X W for which no smooth solution exists. Monte Carlo experiments have
illustrated the test’s finite sample performance as well as certain limitations of the test. The paper
has also outlined several other testing approaches that are more complicated analytically than the
one developed here but may have some advantages. The investigation of these tests is left to
future research.
7. Appendix: Proofs of Theorems
Proof of Proposition 3.1: 0LR is continuously distributed, so for each 0ε > , there is a
0δ > such that 0 0( ) ( ) 1n nP LR c P LR cα αδ ε α ε≤ − ≥ ≤ − = − − . Choose 0J so that
0
1/ 2
12 j
j Jn λ δ
∞
= +
⎛ ⎞⎜ ⎟ <⎜ ⎟⎝ ⎠∑ .
Then
1 1
0 1 0
0 1 0
0
( | ) ( )
[ ( ) ]
[ ( )]
( )
1 .
n n
n
n
n
P LR c H P LR c
P LR LR LR c
P LR c LR LR
P LR c
α α
α
α
α δ
α ε
≤ = ≤
= + − ≤
= ≤ − −
≥ ≤ −
= − −
Therefore,
24
1 1( | ) 1 ( | )
.
n nP LR c H P LR c Hα α
α ε
> = − ≤
≤ +
Q.E.D.
Proof of Theorem 4.1:
Define
1
nJ
n j jj
g b ψ=
=∑ .
Let nA be the operator whose kernel is
1 1
( , ) ( ) ( )n nJ J
n jk j kj k
a x w c x wψ ψ= =
=∑∑ .
Also define
1 1
ˆ ˆ( , ) ( ) ( )n nJ J
n jk j kj k
a x w c x wψ ψ= =
=∑∑ ,
1 1( , ) ( ) ( )jk j k
j ka x w c x wψ ψ
∞ ∞
= =
=∑∑ ,
n n nm A g= , and
* 1/ 2sup
( )n
nh
h
A A hρ
∈=
H.
Let jμ denote the j ’th singular value of A . For sequences of numbers { }nc and { }nd , write
n nc d to mean that /n nc d is bounded away from 0 and ∞ as n →∞ .
Lemma 1: (i) rj jμ − , ( )r
n nO Jρ = , ( ) ( )s rn nA g g O J − −− = , and
ˆ ( )r snm Em O J − −− = . (ii) 1/ 2ˆ ( / )n n p na a O J n− = , and (iii) sup ( ) ( )
s
r sn nA A O Jν ν − −
∈ − =H .
Proof: Part (i) is a slight modification of Theorem 3 of Blundell, Chen, and Kristensen
(2007) and is proved the same way as that theorem. Part (ii) follows from Markov’s inequality
and the observation that 2 2, 1
ˆ ˆ( ) ( / )nJn n jk nj k
E a a Var c O J n=
− = =∑ .
Now consider part (iii). Write
1
( ) ( )j jj
x xν ν ψ∞
=
=∑ ,
25
where the jν ’s are Fourier coefficients. We have
1 1
j jk kj k
A cν ν ψ∞ ∞
= =
=∑∑
Observe that Aν belongs to a Sobolev space of smoothness r s+ . Define
1 1
nJ
n j jk kj k
A cν ν ψ∞
= =
=∑∑
Then ( ) r sn A nA A C Jν − −− = for some constant AC not depending on ν . Note that
1 1
n nJ J
n j jk kj k
A cν ν ψ= =
=∑∑ .
Define
1
( ) ( )nJ
j jj
x xν ν ψ=
=∑
Then
1 1
nJ
j jk kj k
A cν ν ψ∞
= =
=∑∑ .
Now
1 1
( ) ( )n
n
J
n n j jk kj J k
A A A A cν ν ν ψ∞
= + =
− = − + ∑ ∑ ,
so
1 1
1 1
( ) ( )
(7.1) ,
n
n
n
n
J
n n j jk kj J k
Jr s
A n j jk kj J k
r sn A n
A A A A c
C J c
R C J
ν ν ν ψ
ν ψ
∞
= + =
∞− −
= + =
− −
− ≤ − +
≤ +
= +
∑ ∑
∑ ∑
where
1 1
n
n
J
n j jk kj J k
R cν ψ∞
= + =
= ∑ ∑ .
Note that
26
2
2
1 1.
n
n
J
n j jkk j J
R cν∞
= = +
⎛ ⎞⎜ ⎟=⎜ ⎟⎝ ⎠
∑ ∑
In addition
1 1( )
n
j jk kj J k
A cν ν ν ψ∞ ∞
= + =
− = ∑ ∑ ,
so 2
2
1 1( )
n
j jkk j J
A cν ν ν∞ ∞
= = +
⎛ ⎞⎜ ⎟− =⎜ ⎟⎝ ⎠
∑ ∑
and
(7.2) 22 ( )nR A ν ν≤ − .
Now use the argument on pp. 1664-1665 of Blundell, Chen, and Kristensen (2007) to get
(7.3) ( ) ( )r snA O Jν ν − −− = .
Finally, combine (7.1)-(7.3) to get the result. Q.E.D.
Lemma 2: 1/ 2ˆ [ ( / ) ]s rn p n n ng g O J J J n−− = + .
Proof: We have
(7.4) ˆ ˆ( )n n ng g A g gρ− ≤ − .
But ˆ ˆ ˆ( ) 1P Ag m= → as n →∞ . Therefore, with probability approaching 1,
ˆ ˆ( )
ˆ ˆˆ ˆ( ) ( ) ( )( ).
n n
n n n
A g g Ag Ag
A A g m m A g g A A g g
− = −
= − + − − − + − −
.
The triangle inequality gives
ˆ ˆˆ ˆ ˆ( ) ( ) ( ) ( )( )
ˆ ˆˆ ˆ( ) ( ) ( ) ( )( )
ˆ ˆˆ ˆ( ) ( ) ( ) ( )
ˆ( )( )
n n n n
n n n n n n
n n n n n n n n
n n
A g g A A g m m A g g A A g g
A A g A A g m m A g g A A g g
A A g A A g m m A g g A A g g
A A g g
− ≤ − + − + − + − −
≤ − + − + − + − + − −
≤ − + − + − + − + − −
+ − −
27
ˆ ˆˆ ˆ(7.5) ( ) ( ) ( ) ( )
ˆ( )( ) .
n n n n n n n n n
n n
A A g A A g m m A g g A A A g g
A A g g
ρ≤ − + − + − + − + − −
+ − −
Straightforward calculations show that 1/ 2ˆ( ) ( / )n n p nA A g O J n− = (should be 1/ 2( / )nJ n ) and
1/ 2ˆ ˆ [( / ) ]p nm Em O J n− = . Lemma 1 gives the results that ( ) ( )r sn n nA A g O J − −− = ,
ˆ( )( ) ( )r sn n p nA A g g O J − −− − = , ˆ ( )r s
nEm m O J − −− = , ( ) ( )r sn nA g g O J − −− = , and
ˆ (1)n n n pA A oρ − = . Substituting these rates into (7.5) gives
(7.6) 1/ 2ˆ( ) [ ( / ) ]r sn p n nA g g O J J n− −− ≤ + .
The lemma follows by combining (7.4), (7.6) and lemma 1. Q.E.D.
Proof of Theorem 4.1: By the triangle inequality,
(7.7) ˆ ˆ n ng g g g g g− ≤ − + − .
The theorem follows by combining (7.7) with lemma 2 and Assumption 4(ii). Q.E.D.
7.1 Proofs of Theorems 4.2 and 4.3
Let 0H be true. Define
1
1/ 21( ) (2 / ) ( , )n i XW i
iS z n U f z W
∈
= ∑S
,
1
1/ 22 ˆ( ) (2 / ) [ ( ) ( )] ( , )n i i XW i
iS z n g X g X f x W
∈
= −∑S
,
1
( )1/ 23
ˆ( ) (2 / ) [ ( , ) ( , )]in i i XW iXW
iS z n U f z W f z W−
∈
= −∑S
,
and
1
( )1/ 24
ˆˆ( ) (2 / ) [ ( ) ( )][ ( , ) ( , )]in i i i XW iXW
iS z n g X g X f z W f z W−
∈
= − −∑S
.
Then 4
1( ) ( )n nj
jS z S z
=
=∑ .
Lemma 3: As n →∞ , 3( ) (1)n pS z o= uniformly over [0,1]z∈ .
Proof: See Lemma 3 of Horowitz (2006). Q.E.D.
Lemma 4: As n →∞ , 4 ( ) (1)n pS z o= uniformly over [0,1]z∈ .
Proof: Standard properties of kernel estimators give the result that
28
2
( )2 1/ 21 , [0,1]
logˆmax sup | ( , ) ( , ) |( )
iXWXWi n x w
nf x w f x w Onh≤ ≤ ∈
⎡ ⎤− = ⎢ ⎥
⎣ ⎦
almost surely. Therefore,
1
4[0,1]
ˆsup | ( ) | | ( ) ( ) |n n i iz i
S z R g X g X∈ ∈
≤ −∑S
,
where [(log ) /( )]nR O n nh= almost surely. For 1i∈S , iX is independent of g . By Theorem 2.4
of van de Geer (2000), the class of functions {| |: }rH g H− ∈H for fixed g satisfies the
conditions of the uniform law of large numbers (van de Geer 2000, Lemma 3.1). Therefore,
1
110
ˆ ˆ| ( ) ( ) | | ( ) ( ) | ( )i i Xi
n g X g X g x g x f x dx−
∈
− → −∑ ∫S
almost surely. But 1
40ˆ ˆ| ( ) ( ) | ( )Xg x g x f x dx C g g− ≤ −∫
for some constant 4C < ∞ by the Cauchy-Schwarz inequality. Therefore,
4[0,1]
log ˆsup | ( ) |
(1).
n pz
p
nS z O g gh
o
∈
⎡ ⎤= −⎢ ⎥⎣ ⎦
=
Q.E.D.
Lemma 5: As n →∞ ,
2
1/ 22 ( ) (2 / ) ( , )n i XW i n
iS z n U f z W r
∈
= − +∑S
,
where (1)n pr o= .
Proof: For [0,1]z∈ and sν ∈H , define the empirical process
2
2
1/ 2
1/ 2
( , ) (2 / ) [ ( , ) ( ) ( , ) ( )]
(2 / ) [ ( , ) ( ) ( )( )].
n XW i i XW XWi
XW i ii
H z n f z W X E f z W X
n f z W X T z
ν ν ν
ν ν
∈
∈
= −
= −
∑
∑
S
S
nH is stochastically equicontinuous by Theorem 5.12 of van de Geer (2000). Therefore, because
ˆ sg∈H , for each 0ε > and 0η > there is a 0δ > such that
[0,1]
ˆsup | ( , ) ( , ) |p
n nz
P H z g H z g η ε∈
⎡ ⎤− > <⎢ ⎥
⎣ ⎦
29
whenever g g δ− < . Because ˆ (1)pg g o− = , it follows that ˆ( , ) ( , ) (1)n n pH z g H z g o− =
uniformly over [0,1]z∈ and
(7.8) 1/ 22 ˆ( ) ( / 2) [ ( )]( ) (1)n pS z n T g g z o= − − +
uniformly over [0,1]z∈ .
Now ˆ ˆ ˆAg m= with probability approaching 1 as n →∞ . Some algebra now shows that
1
ˆ ˆˆ ˆ ˆ( ) ( )( )
ˆˆ ,n
A g g Ag r A A g g
r Ag r
− = − + + − −
= − +
where 1/ 21 ( )n pr o n−= . But
1 1
ˆˆ ˆ ˆ( )n nJ J
k j jk kk j
r Ag a b c ψ= =
− = −∑ ∑
By the definitions of ˆka and ˆ jkc ,
21 1
ˆ ˆ (2 / ) [ ( )] ( )n nJ J
k j jk i j j i k ij i j
a b c n Y b X Wψ ψ= ∈ =
− = −∑ ∑ ∑S
.
Define 1
nJj jj
g b gψ=
Δ = −∑ . Then because ( )i i iY g X U= + ,
(7.9) 2 21 1
ˆˆ( )( ) (2 / ) ( ) ( ) (2 / ) ( ) ( ) ( )n nJ J
i k i k i k i ki k i k
r Ag w n U W w n g X W wψ ψ ψ ψ∈ = ∈ =
− = − Δ∑∑ ∑∑S S
.
Define
2 1
( ) (2 / ) ( ) ( ) ( )nJ
n i k i ki k
D w n g X W wψ ψ∈ =
= Δ∑∑S
.
Then
2
2 2 2
22 2
1
2 2 2 2
1 1
1 2
(2 / ) ( ) ( )
(2 / ) ( ) ( ) (2 / ) [ ( ) ( )][ ( ) ( )]
.
n
n n
J
n i k ik i
J J
i k i i k i j k jk i k i j
j i
n n
D n g X W
n g X W n g X W g X W
D D
ψ
ψ ψ ψ
= ∈
= ∈ = ∈ ∈≠
⎡ ⎤⎢ ⎥= Δ⎢ ⎥⎣ ⎦
= Δ + Δ Δ
≡ +
∑ ∑
∑∑ ∑∑ ∑
S
S S S
Now
30
2
2 21 [0,1]
1
1 20
1
2
1
(2 / ) ( ) ( ) ( , )
(2 / ) ( )
(2 / )
( ).
n
n
J
n k XWk
J
fk
f n
ED n g x w f x w dxdw
n C g x dx
n C J g
o n
ψ=
=
−
= Δ
≤ Δ
= Δ
=
∑∫
∑∫
Therefore, Markov’s inequality gives
(7.10) 11 ( )n pD o n−= .
Moreover,
22
1[1 (2 / )] [ ( ) ( )]
nJ
n kk
ED n E g X Wψ=
= − Δ∑ .
But
2[0,1]
1
0
( ) ( ) ( ) ( ) ( , )
( ) ( ) ,
k k XW
k
E g X W g x w f x w dxdw
g x x dx
ψ ψ
δ
Δ = Δ
= Δ
∫
∫.
where 1
0( ) ( ) ( , )k k XWx w f x w dwδ ψ= ∫ .
Therefore, by the Cauchy-Schwarz inequality 122 20
[ ( ) ( )] ( )k kE g X W g x dxψ δΔ ≤ Δ ∫ ,
and
12 22 0
1[1 (2 / )] ( )
nJ
n kk
ED n g x dxδ=
≤ − Δ ∑∫ .
But ( )k xδ is the k ’th Fourier coefficient of ( , )XWf x ⋅ , and | ( , ) |XW ff x w C≤ . Therefore,
31
222
2
1
[1 (2 / )]
( )
( ).
n f
sn
ED n C g
O J
o n
−
−
≤ − Δ
=
=
Markov’s inequality now gives
(7.11) 12 ( )n pD o n−= .
Combining (7.9)-(7.11) yields
(7.12) 2
21
ˆˆ( )( ) (2 / ) ( ) ( )nJ
i k i k ni k
r Ag w n U W w rψ ψ∈ =
− = +∑∑S
,
where 1/ 22 ( )n pr o n−= .
Now
*ˆ ˆ( ) ( )T g g A A g g− = −
so
(7.13) *3
ˆˆ ˆ( ) ( ) ,nT g g A r Ag r− = − +
where 1/ 23 ( )n pr o n−= . By (7.12)
2
* *4
1
ˆˆ( )( ) (2 / ) ( )( )( ) ,nJ
i k i k ni k
A r Ag z n U W A z rψ ψ∈ =
− = +∑∑S
where 1/ 24 ( )n pr o n−= . Now
1*0
1
( )( ) ( , ) ( )
( ).
k XW k
jk jj
A z f z w w dw
c z
ψ ψ
ψ∞
=
=
=
∫
∑
Therefore,
2
*4
1 1
ˆˆ( )( ) (2 / ) ( ) ( )nJ
i jk j k i ni j k
A r Ag z n U c z W rψ ψ∞
∈ = =
− = +∑∑∑S
.
Define 1 1
( , ) ( ) ( ) ( , )nJXW jk j k XWj k
f x w c x w f x wψ ψ∞
= =Δ = −∑ ∑ . Then
32
2 2
2
*4
5 4
ˆˆ( )( ) (2 / ) ( , ) (2 / ) ( , )
(2 / ) ( , ) ( ) .
i XW i i XW i ni i
i XW i n ni
A r Ag z n U f z W n U f z W r
n U f z W r z r
∈ ∈
∈
− = + Δ +
≡ + +
∑ ∑
∑
S S
S
Note that
2
2 2[0,1]
[ ( , )] ( )rXW nf x w dxdw O J −Δ =∫ .
Therefore,
{ }2
2 2 25 [0,1]
1 2
(4 / ) ( )[ ( , )] ( )
( ),
n XW W
rn
r n w f x w f w dxdw
o n J
σ
− −
= Δ
=
∫
and
(7.14) 2
*6
ˆˆ( )( ) (2 / ) ( , ) ,i XW i ni
A r Ag z n U f z W r∈
− = +∑S
where 1/ 26 ( )n pr o n−= . Combining (7.8), (7.13), and (7.14) yields the lemma. Q.E.D.
Proof of Theorem 4.2: Define *i iU U= if 1i∈S and *
i iU U= − if 2i∈S . Define
1/ 2 *
1( ) (2 / ) ( , )
n
n i XW ii
B z n U f z W=
= ∑ .
It follows from lemmas 3-5 that nτ is asymptotically distributed as 2nB , which is a degenerate
U-statistic of order 2. The theorem follows from the asymptotic distribution of such a statistic.
See, for example, Serfling (1980, pp. 193-194). Q.E.D.
Proof of Theorem 4.3: ( )ˆˆ| |j j Oω ω− = Ω −Ω by Theorem 5.1a of Bhatia, Davis, and
McIntosh (1983). Moreover, standard calculations for kernel density estimators show that 2 1/ 2ˆ [(log ) /( ) ]O n nhΩ−Ω = . Part (i) of the theorem follows by combining these two results.
Part (ii) is a consequence of Assumption 4(ii). Part (iii) follows by combining parts (i) and (ii).
Q.E.D.
7.2 Proofs of Theorems 4.4 and 4.5
Redefine
1
1/ 21( ) (2 / ) [ ( )] ( , )n i i XW i
iS z n Y g X f z W
∈
= −∑S
33
1
1/ 22 ˆ( ) (2 / ) [ ( ) ( )] ( , )n i i XW i
iS z n g X g X f x W
∈
= −∑S
,
1
( )1/ 23
ˆ( ) (2 / ) [ ( )][ ( , ) ( , )]in i i i XW iXW
iS z n Y g X f z W f z W−
∈
= − −∑S
,
and
1
( )1/ 24
ˆˆ( ) (2 / ) [ ( ) ( )][ ( , ) ( , )]in i i i XW iXW
iS z n g X g X f z W f z W−
∈
= − −∑S
Proof of Theorem 4.4: It suffices to show that under 1H ,
1plim 0nn
n τ−
→∞≥ .
Arguments like those leading to (7.8) show that 212 (1)n pn S o− = . It is clear that
213 (1)n pn S o− = and 21
4 (1)n pn S o− = . Therefore, 21 11 (1)n n pn n S oτ− −= + . But
1/ 2 *1( ) ( )( )nn S z A m Tg z− → − almost surely uniformly in [0,1]pz∈ by Jennrich’s (1969) uniform
strong law of large numbers. Therefore, 21 *p
nn A m Tgτ− → − . The theorem follows from the
fact that 2* 0A m Tg− > under 1H . Q.E.D.
Proof of Theorem 4.5: We prove (4.11). The proof of (4.12) is similar. Define
2 4 1 3( )n n n n nD S S E S S= + + + and n n nS S D= − . Then 2
n n nS Dτ = + . Use the inequality
(7.15) 2 2 20.5 ( )a b b a≥ − −
with na S= and nb D= to obtain
22( ) 0.5n n nP z P D S zα ατ ⎛ ⎞> ≥ − >⎜ ⎟⎝ ⎠
.
For any finite 0M > ,
( )
2 2 22 2
2 22
22
0.5 0.5 ,
0.5 ,
0.5 .
n n n n n
n n n
n n
P D S z P D z S S M
P D z S S M
P D z M P S M
α α
α
α
⎛ ⎞ ⎛ ⎞− ≤ = ≤ + ≤⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
⎛ ⎞+ ≤ + >⎜ ⎟⎝ ⎠
⎛ ⎞≤ ≤ + + >⎜ ⎟⎝ ⎠
nS is bounded in probability uniformly over ncF . Therefore, for each 0ε > there is Mε < ∞
such that for all M Mε>
34
( )22 20.5 .5n n nP D S z P D z Mα α ε⎛ ⎞− ≤ ≤ ≤ + +⎜ ⎟⎝ ⎠
.
Equivalently,
( )22 20.5 .5n n nP D S z P D z Mα α ε⎛ ⎞− > ≥ > + −⎜ ⎟⎝ ⎠
and
(7.16) ( )2( ) .5n nP z P D z Mα ατ ε> ≥ > + − .
Now a further application of (7.15) with na D= and 1 3( )n nb E S S= + gives
2 2 21 3 2 40.5 ( )n n n n nD E S S S S≥ + − + .
Some algebra shows that 22 4 (1)n n pS S O+ = as n →∞ , 1/ 2 *
1( ) ( / 2) ( )( )nES z n Tg A m z= − , and
( )1/ 23
rnES O n h Ag m= − . Therefore,
(7.17) 22 *.125 (1)n pD n Tg A m O≥ − +
uniformly over ncF as n →∞ . Inequality (4.11) follows by substituting (7.17) into (7.16) and
choosing C to be sufficiently large. Q.E.D.
35
REFERENCES
Bhatia, R., C. Davis, and A. McIntosh (1983). Perturbation of spectral subspaces and solution of linear operator equations, Linear Algebra and Its Applications, 52/53, 45-67.
Blundell, R., X. Chen, and D. Kristensen (2007). Semi-nonparametric IV estimation of shape-
invariant Engel curves, Econometrica, 75, 1613-1669. Chen, X. and D. Pouzo (2008). Estimation of nonparametric conditional moment models with
possibly nonsmooth moments, working paper, Department of Economics, Yale University, New Haven, CT.
Chernozhukov, V., P. Gagliardini, and O. Scaillet (2008). Nonparametric instrumental variable
estimation of quantile structural effects, working paper, Department of Economics, Massachusetts Institute of Technology, Cambridge, MA.
Darolles, S., J.-P. Florens, and E. Renault (2006): Nonparametric instrumental regression,
Working paper, GREMAQ, University of Social Science, Toulouse, France. Fan, Y. and Q. Li (1996). Consistent model specification tests: omitted variables and
semiparametric functional forms, Econometrica, 64, 865-890. Gasser, T. and H.G. Müller (1979). Kernel Estimation of Regression Functions, in Smoothing
Techniques for Curve Estimation. Lecture Notes in Mathematics, 757, 23-68. New York: Springer.
Gasser, T. and H.G. Müller, and V. Mammitzsch (1985). Kernels and Nonparametric Curve
Estimation, Journal of the Royal Statistical Society Series B, 47, 238-252. Härdle, W. and E. Mammen (1993). Comparing nonparametric versus parametric regression fits,
Annals of Statistics, 21, 1926-1947. Hall, P. and J.L. Horowitz (2005): Nonparametric methods for inference in the presence of
instrumental variables, Annals of Statistics, 33, 2904-2929. Horowitz, J.L. (2006). Testing a parametric model against a nonparametric alternative with
identification through instrumental variables, Econometrica, 521-538. Horowitz, J.L. and S. Lee (2007). Nonparametric instrumental variables estimation of a quantile
regression model, Econometrica,75, 1191-1208. Horowitz, J.L. and V.G. Spokoiny (2001). An adaptive rate-optimal test of a parametric mean-
regression model against a nonparametric alternative, Econometrica, 69, 599-631. Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares estimators, Annals of
Mathematical Statistics, 40, 633-643. Kress, R. (1999). Linear Integral Equations, 2nd edition, New York: Springer-Verlag.
36
Newey, W.K. and J.L. Powell (2003): Instrumental variable estimation of nonparametric models, Econometrica, 71, 1565-1578.
Newey, W.K., J.L. Powell, and F. Vella (1999): Nonparametric estimation of triangular
simultaneous equations models, Econometrica, 67, 565-603. Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics, New York: Wiley. van de Geer, S. (2000). Empirical Processes in M-Estimation, Cambridge, U.K.: Cambridge
University Press.
37
TABLE 1: RESULTS OF MONTE CARLO EXPERIMENTS
Empirical Rejection
g under 0H d Probability ___________________________________ 0H True 1g n.a 0.053
2g n.a 0.070
0H False 1g 0.02 0.895
1g 0.005 0.795
2g 0.02 0.896
2g 0.005 0.799
38
g(
X)
X0 .5 1
-.5
0
.5
1
1.5
Figure 1a: Graph of 1( )g x
g(X
)
X0 .5 1
0
.5
1
Figure 1b: Graph of 2 ( )g x