additive nonparametric regression in the presence of ...ftp.iza.org/dp8144.pdf · additive...
TRANSCRIPT
DI
SC
US
SI
ON
P
AP
ER
S
ER
IE
S
Forschungsinstitut zur Zukunft der ArbeitInstitute for the Study of Labor
Additive Nonparametric Regression in thePresence of Endogenous Regressors
IZA DP No. 8144
April 2014
Deniz OzabaciDaniel J. HendersonLiangjun Su
Additive Nonparametric Regression in the
Presence of Endogenous Regressors
Deniz Ozabaci State University of New York, Binghamton
Daniel J. Henderson
University of Alabama and IZA
Liangjun Su
Singapore Management University
Discussion Paper No. 8144 April 2014
IZA
P.O. Box 7240 53072 Bonn
Germany
Phone: +49-228-3894-0 Fax: +49-228-3894-180
E-mail: [email protected]
Any opinions expressed here are those of the author(s) and not those of IZA. Research published in this series may include views on policy, but the institute itself takes no institutional policy positions. The IZA research network is committed to the IZA Guiding Principles of Research Integrity. The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center and a place of communication between science, politics and business. IZA is an independent nonprofit organization supported by Deutsche Post Foundation. The center is associated with the University of Bonn and offers a stimulating research environment through its international network, workshops and conferences, data service, project support, research visits and doctoral program. IZA engages in (i) original and internationally competitive research in all fields of labor economics, (ii) development of policy concepts, and (iii) dissemination of research results and concepts to the interested public. IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character. A revised version may be available directly from the author.
IZA Discussion Paper No. 8144 April 2014
ABSTRACT
Additive Nonparametric Regression in the Presence of Endogenous Regressors
In this paper we consider nonparametric estimation of a structural equation model under full additivity constraint. We propose estimators for both the conditional mean and gradient which are consistent, asymptotically normal, oracle efficient and free from the curse of dimensionality. Monte Carlo simulations support the asymptotic developments. We employ a partially linear extension of our model to study the relationship between child care and cognitive outcomes. Some of our (average) results are consistent with the literature (e.g., negative returns to child care when mothers have higher levels of education). However, as our estimators allow for heterogeneity both across and within groups, we are able to contradict many findings in the literature (e.g., we do not find any significant differences in returns between boys and girls or for formal versus informal child care). JEL Classification: C14, C36, I21, J13 Keywords: additive regression, endogeneity, generated regressors, oracle estimation,
structural equation, child care, nonparametric regression, test scores Corresponding author: Daniel J. Henderson Department of Economics, Finance and Legal Studies University of Alabama Tuscaloosa, AL 35487-0224 USA E-mail: [email protected]
1 Introduction
Nonparametric and semiparametric estimation of structural equation models is becoming in-
creasingly popular in the literature (e.g., Ai and Chen, 2003; Chen and Pouzo, 2012; Darolles et
al., 2011; Gao and Phillips, 2013; Hall and Horowitz, 2005; Martins-Filho and Yao, 2012; Newey
and Powell, 2003; Newey et al., 1999; Pinkse, 2000; Roehrig, 1988; Su and Ullah, 2008; Su et
al., 2013; Vella, 1991). In this paper, we are interested in improving efficiency by imposing a full
additivity constraint on each equation. Our starting point is the triangular system in Newey et
al. (1999). While the assumptions of their model are relatively restrictive as compared to other
examples in the literature, their estimator is typically easier to implement, which is useful for
applied work. While many existing estimators allow for full flexibility, they also suffer from the
curse of dimensionality.
To combat the curse, we impose an additivity constraint on each stage and propose a three
step estimation procedure for our additively separable nonparametric structural equation model.
We employ series/sieve estimators for our first two-stages. The first-stage involves separate
(additive) regressions of each endogenous regressor on each of the exogenous regressors in order
to obtain consistent estimates of the residuals. These residuals are used in our second-stage
regression where we perform a single (additive) regression of our response variable on each of
the endogenous regressors (not their predictions), the “included” exogenous regressors and each
of the residuals from the first-stage regressions. Our final-step (one stage backfitting) involves
(univariate) local-linear kernel regressions to estimate the conditional mean and gradient of each
of our additive components. This process allows our final-stage estimators to be free from the
curse of dimensionality. Further, our estimators have the oracle property. In other words, each
additive component can be estimated with the same asymptotic accuracy as if all the other
components in the regression model were known up to a location parameter (e.g., see Henderson
and Parmeter, 2014, Horowitz, 2014, or Li and Racine, 2007).
We prove that our conditional mean and gradient estimates are consistent and asymptoti-
cally normal. We provide the uniform convergence rate for the additive components and their
gradients. Our theoretical findings show that our final-stage estimator has asymptotic bias and
variance equivalent to those of a single dimension nonparametric local-linear regression estima-
tor. We further propose a partially linear extension of our model. We argue that the parametric
components can be estimated at the parametric root- rate and conclude that our estimates of
2
the additive components and associated gradients remain unaffected in the asymptotic sense.
Finite sample results for each of our proposed estimators are analyzed via a set of Monte Carlo
simulations and support the asymptotic developments.
To showcase our estimators with empirical data, we consider a proper application relating
child care use to cognitive outcomes for children (controlling for likely endogeneity). Specifically,
we use the data in Bernal and Keane (2011) to examine the relationship between child test
scores (our cognitive outcome) from single mothers and cumulative child care (both formal and
informal). The extensive set of instrumental variables in the data set allows us to have a stronger
set of instruments than what is typically used in the literature and our more flexible (partially
linear) estimator leads to more insights as we can exploit the heterogeneity present both between
and within groups (e.g., male versus female children).
Our empirical results show both similarities and differences from the existing literature.
When we look at the average values of our estimates, we find similar results to those in Bernal
and Keane (2011). On average we find mostly positive returns (to test scores) from marginal
changes in income, mother’s education and AFQT score. However, the mean is but one point
estimate. When we check the distribution of the estimated returns, we see that the main reason
behind lower returns to child care use is the amount of cumulative child care rather than the
type. Specifically, we show that as the amount of child care use increases, additional units of child
care lead to even lower returns. Bernal and Keane (2011) argue that those who use informal
child care (versus formal) and girls (versus boys) receive lower returns. Our distributions of
returns show no significant differences between these groups. We also find evidence of both
positive and negative returns to child care use. When we analyze the characteristics of the
children in each group (positive versus negative returns to child care), we find that children with
negative returns are those whose mothers have higher levels education, experience and AFQT
scores. Conversely, those children with positive returns typically have mother’s with lower levels
of education, experience and AFQT scores.
The paper is organized as follows. Section 2 describes our methodology whereas the third
section presents the asymptotic results. Section 4 considers an extension to a partially linear
model and Section 5 examines the finite sample performance of our estimators via Monte Carlo
simulations. The sixth section gives the empirical application and the final section concludes.
All the proofs of the main theorems are relegated to the appendix. Additional proofs for the
technical lemmas are provided in the online supplemental material.
3
Notation. For a real matrix we denote its transpose as 0 its Frobenius norm as kk(≡ [tr(0)]12) its spectral norm as kksp (≡
pmax(0)) where tr(·) is the trace operator,
≡ means “is defined as” and max (·) denotes the largest eigenvalue of a real symmetric matrix(similarly, min (·) denotes the smallest eigenvalue of a real symmetric matrix). Note that thetwo norms are equal when is a vector. For any function (·) defined on the real line, we use (·) and (·) to denote its first and second derivatives, respectively. We use → and
→ to denote
convergence in distribution and probability, respectively.
2 Methodology
In this section, we introduce our model and then propose a three-step estimation procedure that
is a combination of both series and kernel methods.
2.1 Model
We start with the basic set-up of Newey et al. (1999). They consider a triangular system of the
following form⎧⎨⎩ = (XZ1) +
X = (Z1Z2) +U (U|Z1Z2) = 0 (|Z1Z2U) = (|U) (2.1)
where X = (1 )0 is a × 1 vector of endogenous regressors, Z1 = (11 11)
0 is
a 1 × 1 vector of “included” exogenous regressors, Z2 ≡ (21 22)0 is a 2 × 1 vector of“excluded” exogenous regressors, (· ·) denotes the true unknown structural function of interest, ≡ (1 )
0 is a × 1 vector of smooth functions of the instruments Z1 and Z2 and
and U ≡ (1 )0 are error terms. Newey et al. (1999) are interested in estimating (· ·)
consistently.
Newey et al. (1999) show that (· ·) can be identified up to an additive constant underthe key identification conditions that (U|Z1Z2) = 0 and (|Z1Z2U) = (|U). If theseconditions hold, then
( |XZ1Z2U) = (XZ1) + (|XZ1Z2U) = (XZ1) + (|Z1Z2U)= (XZ1) + (|U) (2.2)
4
If U is observed, this is a standard additive nonparametric regression model. However, in
practice, U is not observed and it needs to be replaced by a consistent estimate. This moti-
vates Su and Ullah (2008) to consider a three-stage procedure to obtain consistent estimates of
(· ·) via local-polynomial regressions. In the first-stage, they regress X on (Z1Z2) via local-
polynomial regression and obtain the residuals bU from this first-stage reduced-form regression.
In the second-stage, they estimate ( |XZ1U) via another local-polynomial regression by
regressing Y on X Z1 and bU. In the third-stage, they obtain the estimates of (x z1) via themethod of marginal integration. Unlike previous works in the literature, including Newey et
al. (1999), Pinkse (2000) and Newey and Powell (2003) that are based upon two-stage series
approximations and only establish mean square and uniform convergence, they establish the
asymptotic distribution for their three step local-polynomial estimator.
There are two drawbacks associated with the estimator of Su and Ullah (2008). First, it is
subject to the notorious “curse of dimensionality”. Without any extra restriction, the conver-
gence rate of their second and third-stage estimators depend on 2+1 and +1 respectively,
which can be quite slow if either or 1 is not small. As a result, their estimates may per-
form badly even for moderately large sample sizes when + 1 ≥ 3 Second, their estimatordoes not have the oracle property which an optimal estimator of the additive component in a
nonparametric regression model should exhibit. In this paper we try to address both issues.
To alleviate the curse of dimensionality problem, we propose to impose some amount of
structure on (XZ1) (|U) and (Z1Z2) where = 1 Specifically, we assume that
() = 0 and the above nonparametric objects have additive forms:
(XZ1) = + 1 (1) + + () + +1 (11) + + +1 (11)
(|U) = + +1+1 (1) + + 2+1 () and
(Z1Z2) = +1 (11) + +1 (11) +1+1 (21) + + (22)
where = 1 and = 1 + 2 Consequently, we have
( |XZ1Z2U) = + 1 (1) + + () + +1 (11) + + +1 (11)
++1+1 (1) + + 2+1 () ≡ (XZ1U) (2.3)
where = + Note that the (·)’s are not fully identified without further restriction. De-pending on the method that is used to estimate the additive components, different identification
5
conditions can be imposed. For example, for the method of marginal integration, a convenient
set of identification conditions would be that each additive component (other than ) in (2.3)
has expectation zero.
Horowitz (2014) reviews methods for estimating nonparametric additive models, including
the backfitting method, the marginal integration method, the series method and the mixture of
a series method and a backfitting method to obtain oracle efficiency. It is well known that it is
more difficult to study the asymptotic property of the backfitting estimator than the marginal
integration estimator, but the latter has a curse of dimensionality problem if additivity is not
imposed at the outset of estimation as in conventional kernel methods. Other problems that
are associated with the marginal integration estimator include its lack of oracle property and
its heavy computational burden. Kim et al. (1999) try to address the latter two problems by
proposing a fast instrumental variable (IV) pilot estimator. However, they cannot avoid the
curse of dimensionality problem. In fact, their IV pilot estimator depends on the estimation of
the density function of the regressors at all data points. In addition, their paper ignores the
notorious boundary bias problem for kernel density estimates and because their IV pilot estimate
is not uniformly consistent on the full support, they have to use a trimming scheme to obtain
the second-stage oracle estimator. To overcome the curse of dimensionality problem, Horowitz
and Mammen (2004) propose a two-step estimation procedure with series estimation of the
nonparametric additive components followed by a backfitting step that turns the series estimates
into kernel estimates that are both oracle efficient and free of the curse of dimensionality.
Below we follow the lead of Horowitz and Mammen (2004) and propose a three-stage estima-
tion procedure that is computationally efficient, oracle efficient and fully overcomes the curse of
dimensionality. We shall adopt the following identification restrictions: (0) = ()|=0 = 0for = 1 2 + 1 and (0) = 0 for = 1 and = 1 2 Similar identification
conditions are also adopted in Li (2000). The difference between our models and theirs is that
their model does not allow for endogenous regressors. This complicates our problem relative to
theirs as the endogeneity requires us to replace the unobserved errors in the second-stage with
residuals. Hence, we need to take care of the additional bias factor from the first-step. Further,
we also analyze the gradients of the additive nonparametric components.
6
2.2 Estimation
Given a random sample of observations XZ1Z2=1 where X = (1 )0, Z1 =
(11 11)0 and Z2 = (21 22)
0, we propose the following three-stage estimation
procedure:
1. For = 1 let e e (1) = 1 1 and e1+ (2) = 1 2denote the series estimates of (1) = 1 1 and 1+ (2) =
1 2 in the nonparametric additive regression
= +1 (11) + +1 (11) +1+1 (21) + + (22) +
Let e ≡ − e− e1 (11)− − e1 (11)− e1+1 (21)− − e (22) for
= 1 and = 1
2. Estimate () = 1 + (1) = 1 1 +1+(e) =
1 in the following additive regression model
= + 1 (1) + + () + +1 (11) + + +1 (11)
++1+1(e1) + + 2+1(
e) +
by the series method. Denote the estimates as e e () = 1 e+ (1) = 1 1 and e+1+(e) = 1
3. Estimate 1 (1) and its first-order derivative by the local-linear regression of e1 = −e − e2 (2) − − e () −e+1 (11) − − e+1 (11) − e+1+1(e1) − −e2+1(e) on 1 Estimates of the other additive components in (2.3) and their first-
order derivatives are obtained analogously.
In relation to Horowitz and Mammen (2004), the above first-stage is new as we have to
replace the unobservable by their consistent estimates in the second-stage. In addition,
Horowitz and Mammen (2004) are only interested in estimation of the nonparametric additive
components themselves, while we are also interested in estimating the first-order derivatives
(gradients). Alternatively, we could follow Kim et al. (1999) and use the kernel estimator in
the first two-stages. The oracle estimator of Kim et al. (1999) has gained popularity in recent
years. For example, Ozabaci and Henderson (2012) obtain the gradients of their estimator for
7
the local-constant case and Martins-Filho and Yang (2007) consider the local-linear version of
the oracle estimator, both assuming strictly exogenous regressors. However, as mentioned above,
using the kernel estimators in the first two-stages here has several disadvantages and does not
avoid the curse of dimensionality problem.
For notational simplicity, let W = (X0Z01U0)0 and w = (x0 z01u)0 where, e.g., u =
(1 )0 denotes a realization of U We shall use Z ≡ Z1×Z2 and W ≡ X × Z1 × U to
denote the support of (Z1Z2) andW, respectively. Let (·) = 1 2 denote a sequenceof basis functions. Let 1 = 1 () and = () be some integers such that 1 → ∞ as
→∞ Let 1 () ≡ [1 () 1 ()]0 Define
1 (z1 z2) ≡£1 1 (11)
0 1 (11)0 1 (21)0 1 (22)
0¤0 andΦ (w) ≡ £
1 (1)0 ()
0 (11)0 1 (11)0 (1)0 ()
0¤0 For each (z1 z2) ∈ Z, we approximate (z1 z2) and (w) by 1 (z1 z2)
0α and Φ (w)0 β
respectively, for = 1 , where α ≡ (α01 α
0)
0 and β = (β01 β02+1)
0 are
(1 + 1)× 1 and (1 + (2 + 1))× 1 vectors of unknown parameters to be estimated. Here,each α = 1 is a 1 × 1 vector and each β = 1 2 + 1 is a × 1 vector. LetS1 and S denote 1× (1 + 1) and × (1+ (2+1)) selection matrices, respectively, such
that S1α = α and Sβ = β
To obtain the first-stage estimators of the (·)’s, let eα ≡ (e eα01 eα0)0 be the solutionto min
−1P=1
£ − 1 (Z1Z2)
0α
¤2 The series estimator of (z) is given by
e (z1 z2) = 1 (z1 z2)0 eα
= 1 (z1 z2)
"−1
X=1
1 (Z1Z2)1 (Z1Z2)
0#−
−1X=1
1 (Z1Z2)
where − denotes the Moore-Penrose generalized inverse of Note that we can write e (z1 z2)
as e (z1 z2) = e +P1=1 e (1) +
P2=1 e1+ (2) where e (1) = 1 (1)
0 eα is
a series estimator of (1) for = 1 1 and e1+ (2) = 1 (2)0 eα1+ is a series
estimator of 1+ (2) for = 1 2
To obtain the second-stage estimators of the (·)’s, let eβ ≡ (e eβ01 eβ02+1)0 be a solutionto min
−1P=1
h − (fW)
0βi2 where fW = (X
0Z
01eU0)0 and eU = (e1 e)
0 The
series estimator of (w) is given by
e (w) = (w)0 eβ = e+ X=1
e () + 1X=1
e+ (1) + X=1
e+1+()8
Let 1 (1) ≡ [1 (1) 1 (1)]0 We use b1 (1) ≡ [b1 (1) b1 (1)]0 to denote the local-
linear estimate of 1 (1) in the third-stage by using the kernel function (·) and bandwidth. Let eY1 ≡ (e11 e1)0, ∗
1 (1) ≡ (11 − 1)0 X1 (1) ≡ [∗
11 (1) ∗1 (1)]
0 and
K1 ≡diag(11 1) where 1 ≡ (1 − 1) and (·) ≡ (·) Then
b1 (1) = £X1 (1)0K1X1 (1)¤−1X1 (1)0K1
eY1Below we study the asymptotic properties of eβ and b1 (1) 3 Asymptotic properties
In this section we state two theorems that give the main results of the paper. Even though
several results are available in the literature on nonparametric or semiparametric regressions
with nonparametrically generated regressors (see, e.g., Mammen et al., 2012 and Hahn and Rid-
der, 2013 for recent contributions), none of them can be directly applied to our framework. In
particular, Hahn and Ridder (2013) study the asymptotic distribution of three-step estimators
of a finite-dimensional parameter vector where the second-step consists of one or more nonpara-
metric generated regressions on a regressor that is estimated in the first-step. In sharp contrast,
our third-stage estimator is also a nonparametric estimator. Under fairly general conditions,
Mammen et al. (2012) focus on two-stage nonparametric regression where the first-stage can be
kernel or series estimation while the second-stage is local-linear estimation. In principle, we can
treat our second and third-stage estimation as their first and second-stage estimation, respec-
tively and then apply their results to our case. However, their results are built upon high-level
assumptions and are usually not optimal. For this reason, we derive the asymptotic properties
of our three-stage estimators under some primitive conditions specified in the preceding section.
Let W ≡ (X0Z1U0)0 Z2 and to denote the th random observation of W
Z2 and respectively. Let ≡ − (XZ1U) and Φ = Φ (W) and ΦΦ ≡ [ΦΦ
0]
The asymptotic properties of the second-stage series estimator eβ are reported in the followingtheorem.
Theorem 3.1 Suppose that Assumptions A.1-A.5(i) in Appendix A hold. Then
() eβ−β = −1ΦΦ−1P
=1Φ+−1ΦΦ−1P
=1Φ [ ( 1 )−Φ0β] − −1ΦΦ−1P
=1Φ
×P=1 +1+ () (e − ) +R;
9
()°°°eβ − β°°° = ( + 1) ;
() supw∈W¯e (w)− (w)
¯= [0 ( + 1)] ;
where kRk = ( + 1) and 1 and are defined in Assumption A.5()
To appreciate the effect of the first-stage series estimation on the second-stage series estima-
tion, let β denote a series estimator of β by using U together with (XZ1) as the regressors.
Then it is standard to show that
β − β = −1ΦΦ−1
X=1
Φ +−1ΦΦ−1
X=1
Φ£ ( 1 )−Φ0β
¤+ R
and°°β − β°° = () where
°°R
°° = (−12) = (). The third term on the
right hand side of the expression in Theorem 3.1() signifies the asymptotically non-negligible
dominant effect of the first-stage estimation on the second-stage estimation.
With Theorem 3.1, it is straightforward to show the asymptotic joint distribution of our
three-stage estimators of 1 (1) and its gradient.
Theorem 3.2 Let ≡diag(1 ) Suppose that Assumptions A.1-A.5 in Appendix A hold.
Then
() (Normality)√ [b1 (1)− 1 (1)− 1 (1)]
→ (0Ω1 (1)) where 1 (1) ≡⎛⎝ 212 21 (1)
0
⎞⎠ Ω1 (1) ≡⎛⎝ 2(1)1 (1) 0
0 222(1)
£2211 (1)
¤⎞⎠ 2(1) ≡ (2 |
1 = 1) 1 (·) denotes the probability density function (PDF) of 1 and ≡R()
for = 0 1 2
() (Uniform consistency) Suppose that ΦΦ ≡ ¡ΦΦ
02
¢has bounded maximum eigen-
value. Then sup1∈X1 k [b1 (1)− 1 (1)]k =
³( log)−12 + 2
´
Theorem 3.2() indicates that our three-step estimator of 1 (1) = [1 (1) 1 (1)]0 has
the asymptotic oracle property. The asymptotic distribution of the local-linear estimator of
1 (1) is not affected by random sampling errors in the first two-stage estimators. In fact, the
three-step estimator of 1 (1) has the same asymptotic distribution that we would have if the
other components in ( 1 ) were known and a local-linear procedure is used to estimate
1 (1) Theorem 3.2() gives the uniform convergence rate for b1 (1) Similar properties canbe established for the local-linear estimators of other components of ( 1 ) In addition,
following the standard exercise in the nonparametric kernel literature, we can also demonstrate
that these estimators are asymptotically independently distributed.
10
4 Partially linear additive models
In this section we consider a slight extension of the model in (2.1) to the following partially
linear functional coefficient model⎧⎨⎩ = (XZ1) + 0V+
X = (Z1Z2) +ΨV+U (U|Z1Z2V) = 0 (|Z1Z2UV) = (|U) () = 0(4.1)
where X Z1 Z2 Z and are defined as above, V is a × 1 vector of exogenous variables, is a × 1 parameter vector and Ψ = [01
0 ]
0 is a × matrix of parameters in the
reduced form regression for X To avoid the curse of dimensionality, we continue to assume that
(Z1Z2) (XZ1) and (|U) have the additive forms given in Section 2.1.We remark that the results developed in previous sections extend straightforwardly to the
model specified in (4.1). Note that
( |XZ1Z2UV) = (XZ1) + (|U) + 0V = (XZ1U) + θ0V and (4.2)
(X|Z1Z2V) = (Z1Z2) +ΨV (4.3)
Given a random sample (XZ1Z2V) = 1 , we can continue to adopt the three-step procedure outlined in Section 2.2 to estimate the above model. First, we choose (α
) to minimize −1 P
=1
£ − 1 (Z1Z2)
0α −V0
¤2 Let (eα e) denote the solution.
The series estimator of (z1 z2) is given by e (z1 z2) = 1 (z1 z2)0 eα Define the residualse = − e (Z1Z2) − e0V Let eU = (e1 e)
0 fW = (X0Z
01eU0)0 and (fW) be
defined as before. Second, we choose (β ) to minimize −1P
=1[ − (fW)0β −V0
]2 Leteβ ≡ (e eβ01 eβ02+1)0 and e denote the solution. Define e1 = − eβ0(−1)(fW)−e0V whereeβ(−1) is defined as eβ with its component eβ1 being replaced by a × 1 vector of zeros. Third,
we estimate 1 (1) and its first-order derivative by regressing e1 on 1 via the local-linear
procedure. Let b1 (1) denote the estimate of 1 (1) via local-linear fitting.It is well known that the finite dimensional parameter vectors ’s and can be estimated
at the parametric√-rate and the appearance of the linear components in (4.1) will not affect
the asymptotic properties of eβ and b1 (1) To conserve space, we do not repeat the argumentshere.
11
5 Finite sample properties
In this section we evaluate the finite sample properties of our estimator by simulations. We
first look at four data generating processes (DGPs) to show the performance of our estimator.
We then consider higher-dimensional data and compare our estimator to a fully nonparametric
alternative of Su and Ullah (2008). We report the average bias, variance, and root mean square
error (RMSE) for the final-stage conditional mean and gradients estimates across 1000 Monte
Carlo simulations. We consider three different sample sizes: 100 200 and 400.
5.1 Baseline simulations
We consider four different DGPs of structural equations between (once for in DGP
3), and . Unless we state otherwise, 1 and 2 are independently distributed as uniform from
zero to one ( [0 1]) and and are independently distributed as Gaussian with mean zero and
variance one ((0 1)) and are mutually independent of one another and of and (1 2). For
the first three DGPs, each error distribution is assumed to be homoskedastic.
Our first DGP is our baseline model and is given as
= sin () + sin (1) + and = sin (1) + sin (2) +
Our second DGP considers a slightly more complicated first-stage regression model as 2 enters
the reduced form of via the PDF of a logistic distribution
= 0252 + 0521 + and = 02021 +−2
(1 + −2)2+
Our third DGP considers the partially linear extension where 1 and 2 are distributed via a
binomial and Gaussian distribution, respectively:
= sin () + sin (1) + 051 + 2 + and = sin (1) + sin (2) +
Finally, our fourth DGP is similar to the first, but allows for heteroskedasticity (note that our
theory allows for heteroskedasticity). Specifically, we allow the variance of to be a function of
1 and 2 via 01 + 0521 + 05
22 . Apparently, = 1 = 2 = 1 in DGPs 1-4.
We estimate the structural function in three steps. In the first two steps we use cubic B-
splines for the sieve estimation and in the third step we use local-linear kernel regression. For
the spline estimation, we specify the number of knots as b215c so that 1 = = b215c+ 4
12
Table 1: Monte Carlo simulations for the final-stage conditional mean and gradient estimatesb (· ·) (··)
(··)1
b (· ·) (··)
(··)1
b (· ·) (··)
(··)1
= 100 = 200 = 400Bias
DGP 1 0.0522 -0.0611 0.2847 0.0419 -0.0586 0.0092 0.0297 -0.0358 -0.0078
DGP 2 -0.0649 0.0525 -0.0346 -0.0621 0.0178 -0.0483 -0.0434 0.0577 -0.0013
DGP 3 0.0526 -0.0544 -0.0019 0.0371 -0.0575 -0.0024 0.0278 -0.0502 -0.0015
DGP 4 0.0515 -0.0341 0.0542 0.0412 -0.0556 0.0166 0.0285 -0.0347 -0.0138
Variance
DGP 1 0.0526 0.0879 0.3017 0.0316 0.0593 0.1672 0.0195 0.0419 0.1120
DGP 2 0.1234 0.2485 0.7299 0.1211 0.2352 0.6851 0.0508 0.2242 0.3233
DGP 3 0.1170 0.1530 0.6956 0.0703 0.1054 0.4047 0.0443 0.0686 0.2732
DGP 4 0.1182 0.1520 0.6670 0.0692 0.1016 0.3773 0.0436 0.0714 0.2562
RMSE
DGP 1 0.2468 0.3699 0.6608 0.1870 0.2938 0.4880 0.1465 0.2405 0.3785
DGP 2 0.3708 0.6714 0.9763 0.3701 0.6684 0.9310 0.2365 0.6382 0.6197
DGP 3 0.3589 0.4947 0.9839 0.2779 0.4143 0.7609 0.2199 0.3235 0.5865
DGP 4 0.3659 0.5004 0.9740 0.2755 0.3974 0.7161 0.2163 0.3171 0.5669
where b·c denotes the integer part of ·. For the kernel regression, we need to choose both thekernel function (·) and the bandwidth parameter We apply the Gaussian kernel throughoutthe simulations and application: () = exp
¡−22¢ √2 There are two standard ways tochoose the bandwidth. One is to apply Silverman’s rule of thumb by setting = 106
−15
where denotes to sample standard deviation of ; and the other is to consider leave-one-out
least-squares cross-validation (LSCV). To save time on computation, we consider Silverman’s
rule-of-thumb choice of bandwidth. In our application, we will use generalized cross-validation
(GCV) to choose the number of sieve basis terms in each of the first two steps’ sieve estimation
and LSCV to choose the bandwidth in the last step kernel estimation. For example, in the
first step, we choose 1 to minimize the following GCV objective function
(1) =1
X=1
[ − e1(Z)]2 [1− (1)]2
where Z is a collection of all exogeneous variables and e1 (·) denotes the sieve estimation of (|Z) by imposing the additive structure and using 1 terms of cubic B-spline basis functionsto approximate each additive component. In the third step, we choose to minimize
() =1
X=1
he1 − b1−(1)i2 13
where b1−(1) is the leave-one-out version of b1(1) defined in Section 2.2.
The simulation results for the final-stage regressions can be found in Table 1. Each of the
results are as expected. The average bias, variance and RMSE of each estimator decreases with
the sample size. The conditional mean is estimated more precisely than its gradients. The
estimators in the homoskedastic DGP outperform those from the heteroskedastic DGP (DGP 1
versus 4).
5.2 Higher-dimensional performance
Now we look at the performance of our estimator with higher-dimensional data and compare
it with that of a fully nonparametric alternative — Su and Ullah (2008). Unless stated oth-
erwise, 1 2 3 4 and 5 are independently distributed as [0 1] and and are inde-
pendently distributed as (0 1) and are mutually independent of one another and of and
(1 2 3 4 5). For DGPs 5-7, each error distribution is assumed to be homoskedastic.
Our fifth DGP is a variant of our baseline model and is given as
= sin () + sin (1) + sin (2) + sin (3) + sin (4) +
= sin (1) + sin (2) + sin (3) + sin (4) + sin (5) +
Our sixth DGP is specified as follows
= 0252 + 0521 + sin (2) + 23 + 324 +
= 02021 +−2
(1 + −2)2+ cos (2) + sin (3) + sin (4) +
Our seventh DGP considers the partially linear extension where 1 and 2 are distributed via a
binomial and Gaussian distribution, respectively:
= sin () + sin (1) + sin (2) + sin (3) + sin (4) + 051 + 2 +
= sin (1) + sin (2) + sin (3) + sin (4) + sin (5) +
Finally, our eighth DGP is similar to the fifth, but allows for heteroskedasticity. Specifically, we
allow the variance of to be a function of 1 2 3 4 and 5 via 01 + 0521 + 05
22 +23 +
24 + 25 . Apparently, = 2 = 1 and 1 = 4 in DGPs 5-8.
The simulation results for the final-stage regressions can be found in Tables 2 and 3 for
our proposed estimates and Su and Ullah’s (2008) estimates, respectively. For Su and Ullah’s
14
Table 2: Monte Carlo simulations for higher-dimensional data for the final-stage conditionalmean and gradient estimatesb (· ·) (··)
(··)1
b (· ·) (··)
(··)1
b (· ·) (··)
(··)1
= 100 = 200 = 400Bias
DGP 5 0.0468 0.0390 0.0453 0.0420 0.0492 0.0032 0.0364 0.0329 -0.0082
DGP 6 -0.0820 0.0022 0.0128 -0.0659 -0.0028 0.0093 -0.0583 0.0011 0.0024
DGP 7 0.0659 0.0509 0.0056 0.0613 0.0350 0.0190 0.0472 0.0281 -0.0254
DGP 8 0.0492 0.0470 -0.0599 0.03897 0.0303 0.0024 0.0332 0.0371 -0.0081
Variance
DGP 5 0.2510 0.2551 1.2890 0.1225 0.0897 0.5907 0.0699 0.0571 0.3391
DGP 6 0.2480 0.2453 1.2710 0.1198 0.0977 0.6142 0.0685 0.0583 0.3386
DGP 7 0.5752 0.2977 1.7930 0.3355 0.0994 0.7068 0.1809 0.0637 0.3713
DGP 8 0.3575 0.3000 1.7510 0.1750 0.1158 0.8258 0.0985 0.0776 0.4766
RMSE
DGP 5 0.5091 0.4364 1.1820 0.3577 0.3145 0.7953 0.2700 0.2498 0.5970
DGP 6 0.5111 0.4663 1.1620 0.3573 0.3206 0.8058 0.2710 0.2528 0.5948
DGP 7 0.7680 0.7292 0.5137 0.5691 0.5456 0.3315 0.4306 0.3987 0.5833
DGP 8 0.6052 0.8277 1.4001 0.4263 0.5833 0.9460 0.3195 0.2870 0.7138
estimates, we basically follow their suggestions to choose the orders of local-polynomial regression
(3 in the first stage and 1 in the second stage), kernel, and bandwidth, but use the technique of
Kim et al. (1999) in the third stage to speed up the calculation. The findings for our estimator
are similar to those in Table 1. As for the comparison between the two estimates, we see that
the additive estimates, which exploit the additive nature of the data, have smaller bias, variance,
and RMSE than Su and Ullah’s fully nonparametric estimates with higher dimensional data. In
particular, Su and Ullah’s estimates are subject to the curse of dimensionality and tend to have
very large variance and RMSE even for moderate sample sizes (e.g., = 400) as they need to
estimate 1+2 = 5 2+1 = 6 and +1 = 5 dimensional nonparametric objects in DGPs
5, 6 and 8 in their first, second, and third stage estimation, respectively. [In DGP 7, the linear
components 1 and 2 in the structural equation are also counted as a part of Z1 in Su and
Ullah’s procedure. As a result, 1 = 6 and even higher dimensional nonparametric objects have
to be estimated.] We also consider models with an even higher number of covariates and the
results are as expected: the variance and RMSE of Su and Ullah’s estimates blow up quickly as
the number of covariates increase and those of ours are still well behaved.
15
Table 3: Monte Carlo simulations for higher-dimensional data for Su and Ullah’s (2008) final-stage conditional mean and gradient estimatesb (· ·) (··)
(··)1
b (· ·) (··)
(··)1
b (· ·) (··)
(··)1
= 100 = 200 = 400Bias
DGP 5 0.4796 -0.0122 0.2364 0.3342 -0.0166 0.2500 0.2655 -0.0227 0.0066
DGP 6 1.0310 0.0549 0.2030 -0.9077 0.0649 0.2400 0.3724 -0.1362 0.2916
DGP 7 0.4767 -0.0238 0.2484 0.1881 0.0015 0.0963 0.2334 0.0998 1.0110
DGP 8 1.3720 0.4493 0.4159 0.7077 0.5225 0.5809 0.5766 0.5119 0.4634
Variance
DGP 5 0.8934 1.2996 4.9969 0.8995 1.1916 4.3957 0.8924 1.0031 2.0049
DGP 6 1.6010 1.5890 5.6349 1.4980 1.7370 5.0254 1.4130 1.6297 3.9871
DGP 7 0.9969 1.2118 4.9003 0.5720 0.7518 3.2076 0.5663 0.7199 2.9871
DGP 8 2.0390 2.8980 10.8100 2.0640 2.6460 9.4325 2.0050 2.5940 9.2876
RMSE
DGP 5 1.0218 1.3035 5.0366 0.9635 1.1989 4.4616 0.9334 1.1157 2.0134
DGP 6 1.0990 1.7050 5.6563 1.0130 1.6930 5.0655 0.9725 1.6354 4.5161
DGP 7 1.1134 1.2347 4.9763 0.6046 0.7565 3.2311 0.5997 0.7223 3.1100
DGP 8 2.4940 2.9580 10.9200 2.2019 2.7030 9.4439 2.0970 2.6510 9.3057
6 Application: Child care use and test scores
It is generally accepted in the literature that early childhood achievement is a strong predictor for
success (better labor market outcomes) later in life (Keane and Wolpin, 1997, 2001; Bernal and
Keane, 2011; Cameron and Heckman, 1998). Thus, researchers have focused on the determinants
of childhood achievement. Various models have been developed and most focus on cognitive
ability as the outcome measure. In the present context, we are concerned whether or not child
care improves or hurts a particular measure of cognitive ability, test scores. Although this is an
interesting question, previously there were serious data limitations.
The two major limitations associated with cognitive ability production functions in this
context are sample selection bias and endogeneity. Sample selection bias occurs when only
mothers’ labor force participation is used in the analysis. This variable implicitly assumes that
it is a direct indicator of child care use. The main problem here is that working mothers and
non-working mothers may differ substantially in the cognitive ability production process and if
only labor force participation is used, the analysis is going to rule out “non-working” mothers.
Adding actual child care use can help take care of the selectivity problem (see Bernal, 2008 and
Bernal and Keane, 2011).
16
The second issue is potential endogeneity of the child care use variable. To the best of
our knowledge, there are relatively few papers in this literature that use instrumental variable
estimation to solve the endogeneity problem and those that do find no benefits to IV regression.
Two possible reasons for this are the use of restrictive methods (those that likely hide the existing
heterogeneity of mothers hinder the sources of potential endogeneity) and data limitations. The
three papers that we are aware of which use IV regressions are Blau and Grossberg (1992),
James-Burdumy (2005) and Bernal and Keane (2011).
Blau and Grossberg (1992) use maternal labor supply as an indicator of child care use and
analyze children’s cognitive development. They define endogeneity via the participation decision
of mothers. They define it as a comparison between in home and market production. They state
that the employed and unemployed mothers’ differences may create differences in child quality
production. Hence they focus on the endogeneity of the mothers. They instrument for maternal
labor supply and conclude that there is no statistical heterogeneity between employed and
unemployed mothers (and reject the IV model). An issue with their paper (which they point
out), is the possibility of weak instruments. They also do not have a detailed control for child
care. James-Burgundy (2005) focuses on the same problem and uses labor market conditions
for her fixed effects IV model. Potentially weak instruments are again blamed for rejection of
the IV model.
In response to the issues mentioned above, Bernal and Keane (2011) obtain data on actual
child care use (which helps correct for selectivity issues) as well as an extensive number of
instrumental variables (which helps correct for the weak instrumental variable issue). Further,
they choose a larger age range (compared to existing studies) for children in their application
(previous studies found stronger correlations for their target ages). Also, they focus only on
single mothers which arguably fits their set of instruments better. They conclude that their IV
regression perform well.
We start our analysis with Bernal and Keane’s (2011) data, but consider a more flexible
cognitive ability production function ( (·)). Consider the following cognitive ability productionfunction
= ( ) + (6.1)
where is the logarithm of child test scores (our measure of cognitive ability), repre-
sents the child’s age, (the primary variable of interest) is cumulative child care, is the
17
logarithm of cumulative income since childbirth, is a vector of group characteristics of
the child and the mother (e.g., mother’s AFQT score), is the number of children and
measures initial ability (e.g., birth weight). Equation (6.1) is the baseline cognitive ability
production equation (minus any functional form assumptions) in Bernal and Keane (2011, pp.
474).
A primary concern of Equation (6.1) is that , and may be correlated with
the error term. Hence we use instruments to correct for this potential endogeneity. Following
Bernal and Keane (2011), we use local demand conditions and welfare rules as instruments.
Our contribution here is to provide a more flexible version of the cognitive ability production
function proposed by Bernal and Keane (2011). This allows us to obtain the effects of child
care for each child. Standard least-squares estimation methods are best suited to data near the
mean. Looking solely at the mean may be misleading. Further, it is arguable that we are more
interested in the upper or lower tails of the distribution of returns to child care. Our approach
allows us to observe the overall variation.
Here we will be using the partially linear additive nonparametric specification. In our first-
stage we estimate three separate regressions: one for each endogenous regressor (,
and ). In Equation (4.1), these are given as the regressions of on 1, 2 and .
Specifically, our first-stage equations are written as
= 1 () +2 () +3 () +4 () + 0V+
= 1 () +2 () +3 () +4 () + 0V +
= 1 () +2 () +3 () +4 () + 0V + (6.2)
where we allow the control variables of mother’s AFQT score (), mother’s education
(), mother’s experience () and mother’s age () to enter nonlinearly. The remaining
control variables as well as each of the instruments are contained in V. Note that V includes
interactions for each instrument with and . This results in a total of 99 regressors in
each first-stage regression (clearly indicating the need for a partially linear model).
After obtaining consistent estimates of each of the residual vectors from Equation (6.2), we
run the second-stage model via a nonparametric additively separable partially linear regression
of log test scores () on the endogenous variables (not their predicted values), each of the
18
residuals from the first-stage and the remaining control variables as
= 1 () + 2 () + 3 () + 4 () + 5 ()
+6 () + 7 () + 8 (b) + 9 (b) + 10 (b) +ΨeV+ (6.3)
where eV is the same (twenty) control variables included in V (and does not include any of the
instruments 2) as well as linear interactions between the control variables in the nonparametric
functions (, , , , , and ) and childcare (). Estimation (of the
additive components and their gradients) in the final-stage follows from Section 2.2.
Regarding implementation, note that in the first two stages we use cross-validation techniques
to choose the number of knots for our B-splines. In our final-stage estimates we use cross-
validation techniques to determine the bandwidth in our kernel function. R code which can be
used to estimate our model is available from the authors upon request.
6.1 Data
Our data comes directly from Bernal and Keane (2011). The data are extensive and we will
attempt to summarize them in a concise manner. For those interested in the specific details, we
refer you to the excellent description in the aforementioned paper. Their primary data source
is the National Longitudinal Survey of Youth of 1979 (NLSY79). The exact instruments and
control variables can be found in Tables 1 and 2 in Bernal and Keane (2011).
As noted in the introduction, the data set consists of single mothers. Although this may
seem like a data restriction at first, it leads to stronger instruments. The main reason behind
this choice, as explained by Bernal and Keane (2011), is that single mothers fit their set of
instruments better. The primary instruments used here are welfare rules, which (as claimed by
Bernal and Keane, 2011) give exogenous variation for single mothers. The (1990s) welfare policy
changes resulted in increased employment rates for single mothers, hence higher child care use.
We describe our variables for the 2454 observations in our sample in more detail below.
6.1.1 Endogenous regressors
We consider three potentially endogenous variables (): cumulative child care, cumulative in-
come and number of children. These are the left-hand-side variables in our first-stage equations.
They are modeled in an additively separable nonparametric fashion in the second-stage regres-
sion.
19
6.1.2 Instruments
We group our instruments (2) into four categories: time limits, work requirements, earning
disregards, and other policy variables and local demand conditions. We briefly explain each of
these categories and refer the reader to a more in depth description of the instruments in Bernal
and Keane (2011, pp. 466-469).
Time limits We consider (time limits for) two programs which aid in helping families with
children: Aid to Families with Dependent Children (AFDC) and Waivers and Temporary Aid
for Needy Families (TANF). Under AFDC, single mothers with children under age 18 may be
eligible to get help from the program as long as they fit certain criteria that are typically set by
the state and program regulations. TANF on the other hand, enables the states to set certain
time limits on the benefits they provide to eligible individuals. AFDC provides the benefits and
TANF creates the variability because states can set their own limits. The limits are important
for benefit receivers because an eligible female may become ineligible by hitting the limit and
she may choose to save some of the eligibility for later use. We include each of the eight (time
limit) instruments proposed by Bernal and Keane (2011).
Work requirements TANF requires eligible females to return to work after a certain time,
as set by the state, to be able to remain eligible. These rules are state dependent. While the
main required length for females is to start working within two years, several states prefer to
choose shorter time limits. Some states lift this requirement for females with young children.
Besides the variation amongst states, even within states there exists variation. Here we include
each of the nine (work requirement) instruments.
Earning disregards The AFDC and TANF benefits are adjusted by states depending upon
the number of children and earnings of the eligible females. While more children may lead to
greater benefits, more earnings may lower them. States set the level for AFDC grants and
adjust the amount of reduction in benefits via TANF. Specifically, our first-stage regressions
include both the “flat amount of earnings disregarded in calculating the benefit amount” and
the “benefit reduction rate” instrumental variables.
20
Other policy variables and local demand conditions Our remaining instruments are
grouped in one generic category: other policy variables and local demand conditions. Here
we consider two additional programs for families with young children. These programs are
Child Support Enforcement (CSE) and the Child Care Development Fund (CCDF). Bernal
and Keane (2011) report CSE as a significant source of income for single mothers via the 2002
Current Population Survey. CSE’s goal is to find absent parents and establish relationships
with their children. CCDF on the other hand, is a subsidy based program which provides child
care for low-income families. States are independent in designing their own programs and hence
variation is present.
In addition to the policy variables, earned income tax credit (EITC), the unemployment rate
and hourly wage rate are listed as instruments. EITC is a wage support program for low-income
families. This is a subsidy based program and the subsidy amount varies with family size. The
benefit levels are not conditioned on actual family size since family size is endogenous. Our
first-stage regressions include six instruments from this category.
6.1.3 Control variables
In addition to the instrumental variables, we have twenty-four control variables (1) which show
up in each stage (four nonlinearly and twenty linearly). These variables primarily represent
characteristics of the mothers and children. These are each assumed to be exogenous. The four
variables that enter nonlinearly are the mother’s AFQT score, education level, work experience
and age. We treat these nonparametrically as we believe their impacts vary with their levels and
we are unaware of an economic theory which states exactly how they should enter. In order to
understand the intuition behind this specification, consider a model where these variables enter
linearly. In that case, having linear schooling in a model would imply that each additional year
of schooling a mother gets will lead to the same percentage change in the child’s test score. The
same will hold true for a linearly modeled AFQT score, age or experience. There is no reason
to assume that this will be true.
6.2 Results
There are a vast number of results that can be presented from this procedure and data. We plan
to limit our discussion to a few important issues. First, we want to determine the strength of
21
our instruments. To determine how well the instruments predict the endogenous regressors, we
propose a -type (wild) bootstrap based test for joint significance of the instruments. Second,
we are interested in whether or not endogeneity exists. We check for this by testing for joint
significance of 8 (·), 9 (·) and 10 (·) in Equation (6.3). Finally, we are interested in potentialheterogeneity in the return to child care use. We accomplish this by separating our observation
specific estimates amongst different pre-specified groups.
Our final-stage results we be given in three separate tables. The first two tables will give
the 10th, 25th, 50th, 75th and 90th percentiles of the estimated returns and their related (wild)
bootstrapped standard errors. The first of those tables (Table 4) will look at the returns to
(gradients of) the final stage estimates of each of the (excluding the residuals) nonlinear variables
from the second-stage regressions. The remaining tables (Tables 5 and 6) will decompose the
gradients for the child care use variable.
We also provide several figures of estimated densities and distribution functions. Figures 1-3
give a set of density plots for the estimated returns to child care use. This allows us to see the
overall distribution. We believe this is a more informative type of analysis as compared to solely
showing results at the mean or percentiles. Figure 4 looks at empirical cumulative distribution
functions (ECDF) for both positive and negative gradients with respect to the amount of child
care.
6.2.1 First and second-stage estimation
For our first-stage regressions, our main focus is the performance of our instruments. In order
to analyze this, we check the significance of our instruments in each of our three first-stage
regressions. Noting that we have 99 regressors in each first-stage, the percentage of significant
instruments in each first-stage regression is roughly one-half. This type of analysis, of course,
is informal. Rather than relying on univariate-type significances, we prefer to perform formal
tests to check for joint significance.
Here we perform a nonparametric -type test, originally proposed by Ullah (1985). The test
involves comparing the residual sum of squares between a restricted and unrestricted model.
Our restricted model assumes that each of our instruments have coefficients equal to zero. The
asymptotic distribution can be obtained by multiplying the statistic by a constant, but it is well
known that using the asymptotic distribution is problematic in practice. Instead, we use a wild
22
Table 4: Final-stage gradient estimates for each of the nonlinear variables at various percentileswith corresponding wild bootstrapped standard errors
10% 25% 50% 75% 90%
-0.0089 -0.0061 -0.0005 0.0049 0.0079
0.0034 0.0038 0.0042 0.0027 0.0054
-0.0415 -0.0045 0.0250 0.0463 0.0741
0.0225 0.0258 0.0249 0.0329 0.0304
-0.0157 -0.0157 -0.0117 -0.0021 -0.0021
0.0077 0.0077 0.0046 0.0065 0.0065
-0.0058 -0.0058 -0.0046 0.0037 0.0128
0.0062 0.0062 0.0061 0.0091 0.0080
-0.0034 -0.0022 -0.0016 0.0003 0.0031
0.0031 0.0031 0.0025 0.0038 0.0038
-0.0012 0.0000 0.0015 0.0034 0.0046
0.0008 0.0009 0.0009 0.0013 0.0019
-0.0051 -0.0017 0.0035 0.0051 0.0063
0.0028 0.0024 0.0024 0.0020 0.0037
bootstrap to determine the conclusion of our tests. For each first-stage regression we perform a
test where the null is that each instrument is irrelevant. In each case our p-value is zero to at
least four decimal places. Hence, we argue (as did Bernal and Keane, 2011) that our instruments
are relevant in our prediction of our endogenous regressors.
In the second-stage regression we are concerned with the joint significance of each of the
residuals from the first-stage regressions. We perform a similar Ullah (1985) type test as above
and reject the null that the three sets of residuals are jointly insignificant with a p-value that is
zero to four decimal places. We conclude that endogeneity is likely present and thus justify the
use of our procedure.
6.2.2 Final-stage estimation
Here we are interested in comparing our gradient estimate results to those of Bernal and Keane
(2011). Our average gradient estimates (Table 4) for the nonlinear variables are often similar
23
in magnitude and sign to their results. We find mostly positive effects for mother’s income,
education and AFQT scores. For our primary variable of interest, the gradient on child care
is also similar at the median (noting that our median result is insignificant). To put this in
perspective, the median coefficient of -0.0005 is equivalent to a 0.2% decrease in test scores for
an additional year of child care use (or 0.05% for an additional quarter). That being said, each
of these statements ignore the heterogeneity in the gradient estimates allowed by our procedure.
When we look at the percentiles, we see both positive and negative estimates. This is not
possible with standard linear models. We therefore want to determine the story behind these
variable returns. To do so we break the results up for different child care type, amount of
child care and between gender. We also examine whether heterogeneity is present amongst
mothers (level of education, experience and age). Finally, we try to determine what attributes
are common with those receiving positive or negative returns to child care use.
Disaggregating the child care gradient In the first few rows of Table 5, we analyze the
child care gradient with respect to child care type, child gender and amount of child care use.
In this table we report the percentiles for the estimated gradients for the related groups and
associated standard errors. We also provide Figures 1-3 which show the overall variation for the
(selected) chosen pairs. Before we get into the details for different groups, we want to point out
that many of the results are insignificant. In fact, only 634 of the 2454 estimates are statistically
significant at the five-percent level. What this implies is that for a large portion of the sample,
an additional unit of child care will have no impact on test scores. That being said, we find
many cases where it does matter and we will highlight the results below.
Bernal and Keane (2011) found that only informal child care (e.g., a grandparent) had
significantly negative effects. Specifically, they found that an additional year of informal child
care led to a 2.6% reduction in test scores. To compare our results, we separated the gradients
on child care use between those who received only formal or only informal child care. Although
there are some differences in our percentile estimates, Figure 1 shows essentially no difference
in the densities of estimates between those who received only formal (e.g., a licensed child care
facility) versus only informal child care. Bernal and Keane (2011) also find differences in returns
to child care between genders. Both Table 5 and Figure 2 show essentially no difference between
these two groups.
Where we do find a difference, is with respect to the amount of child care. In Figure 3, we
24
Table 5: Final-stage gradient estimates for child care use for different types of child care, gender,amount and for different attributes of the mother at various percentiles for the specific groupwith corresponding wild bootstrapped standard errors
10% 25% 50% 75% 90%
Formal -0.0087 -0.0067 -0.0016 0.0035 0.0065
0.0052 0.0037 0.0036 0.0032 0.0031
Informal -0.0090 -0.0061 -0.0007 0.0028 0.0067
0.0089 0.0038 0.0036 0.0049 0.0042
Female -0.0092 -0.0061 -0.0005 0.0057 0.0079
0.0038 0.0038 0.0042 0.0046 0.0055
Male -0.0089 -0.0067 -0.0005 0.0036 0.0079
0.0034 0.0037 0.0042 0.0046 0.0055
Above median child care -0.0092 -0.0087 -0.0054 0.0028 0.0049
0.0038 0.0052 0.0036 0.0037 0.0037
Below median child care -0.0061 -0.0010 0.0013 0.0076 0.0218
0.0038 0.0035 0.0042 0.0034 0.0080
Education 12 -0.0067 0.0014 0.0028 0.0076 0.0218
0.0037 0.0036 0.0037 0.0034 0.0080
Education ≥ 12 -0.0090 -0.0068 -0.0014 0.0028 0.0067
0.0089 0.0032 0.0036 0.0049 0.0042
Experience 5 -0.0090 -0.0068 -0.0014 0.0028 0.0067
0.0036 0.0032 0.0036 0.0049 0.0042
Experience ≥ 5 -0.0087 -0.0046 0.0011 0.0065 0.0218
0.0052 0.0038 0.0040 0.0031 0.0080
No experience -0.0061 -0.0007 0.0028 0.0076 0.0218
0.0038 0.0036 0.0037 0.0034 0.0080
Age 23 -0.0087 -0.0053 0.0007 0.0063 0.0139
0.0052 0.0036 0.0039 0.0036 0.0066
Age 23, 29 -0.0090 -0.0061 -0.0007 0.0049 0.0079
0.0089 0.0038 0.0053 0.0037 0.0055
Age ≥ 30 -0.0090 -0.0076 -0.0016 0.0028 0.0076
0.0036 0.0034 0.0036 0.0037 0.0034
25
−0.015 −0.010 −0.005 0.000 0.005 0.010 0.015 0.020
020
4060
Gradient.Childcare
Den
sity
formalinformal
Figure 1: Density of estimated returns to child care for those with only formal versus those with
only informal child care
look at the estimated gradients of child care use for those children who get below and above the
median total child care. We find that more child care use leads to lower returns. In fact, we
can see negative returns for those children receiving relatively more child care which suggests
decreasing returns to child care use. We consider this finding important since this shows us
evidence to believe that the lower returns may be associated with the amount of child care use
rather than type of child care or gender of the child.
In the remaining rows of Table 5, we separate our estimated gradients on child care based on
attributes of the mothers. Specifically, we analyze the estimated returns for mothers of different
levels of education, experience and age. We see variation in our estimates for mothers of different
age groups. As mothers get older, we see lower returns to child care use (more negative and
significant estimates) as compared to those of younger mothers.
We also see variation for different experience levels. Here we find more negative (and sig-
nificant) estimates for mothers with more experience. On the other hand, for mothers with no
experience, we see much larger (and significant) returns to child care.
Similarly, mothers with more education appear to have more negative returns. Many of
26
−0.01 0.00 0.01 0.02
010
2030
4050
60
Gradient.Childcare
Den
sity
femalemale
Figure 2: Density of estimated returns to child care for male and female children
the percentiles are negative for mothers with twelve or more years of education. For mothers
without a high school diploma, we see positive effects for all but the lowest percentile. In other
words, for less educated mothers, more child care may actually improve their child’s test scores.
All these results show us two important things. First, there is substantial heterogeneity in
our returns to child care use that cannot be captured with a single parameter estimate. Second,
the returns tend to be related to the amount of child care and the quality of maternal time. For
those mothers who have more education and experience, child care tends to hurt their child’s
test score. On the other hand, for those mothers with less education and experience, our results
tend to suggest that their children may be better off with more child care.
Positive and negative returns For most of the reported estimates, we tend to see both
positive and negative returns. This is perhaps a more important result than the lower and
higher estimated returns. What this finding suggests is that there are some children who benefit
from additional child care and there are some who are harmed. We hope to uncover who these
children are and hopefully, the drivers of such returns.
Table 6 separates the partial effects on child care by those which are positive and negative.
27
−0.01 0.00 0.01 0.02 0.03
020
4060
8010
0
Gradient.Childcare
Den
sity
above.medianbelow.median
Figure 3: Density of estimated returns to child care for those with above and below the median
units of child care
Table 6: Median characteristics for groups with positive versus negative negative child care usegradients
Attribute Positive Negative
Mother’s education 12 12
Mother’s experience 4 6
Mother’s AFQT 15 19
Mother’s age 22 24
Child’s age 5.8 5.8
Formal child care 0 0
Informal child care 6 9
Total child care 8.5 11.5
Number of children 1 1
Sample size 1141 1313
28
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
Gradient.Childcare
EC
DF
negativepositive
Figure 4: Empirical cummulative distribution functions for the amount of child care for those
with both positive and negative returns to child care
Perhaps the first point of interest is that slightly more than half of the gradients are negative.
If we were to run a simple ordinary least-squares regression, this would (back of the envelope)
suggest a negative coefficient. This is what is typically found in the literature.
As for the remaining values in the table, the rows represent characteristics of interest and
each number represents the median value for that characteristic for both children with positive
and negative returns. Many values are the same. For example, mother’s education, child’s age,
quarters of formal child care and number of children is the same at the median in each group.
However, we see that for negative returns that, (median) mothers have more experience and are
older. It is true that they have more informal child care (which likely represents the result of
Bernal and Keane, 2011), but it is also true that they have far more quarters of child care at
the median.
Again, these are only point estimates. If we were to plot the ECDFs of types of child care
for the groups with negative and positive child care gradients, we would see that the amount of
both formal and informal child care use are higher for those with negative returns. For example,
Figure 4 plots the distribution functions with respect to total child care use between those with
29
negative and those with positive returns to child care. Those children with negative returns to
child care, receive more child care overall. We fail to reject the null of first-order dominance via
a Kolmogorov-Smirnov test with a p-value near unity. We also find this same level of dominance
when we look at formal versus informal or male versus female. This is strong evidence that it is
the amount of child care and not necessarily the type that matters.
7 Conclusion
In this paper, we develop oracle efficient estimators for additive nonparametric structural equa-
tion models. We show that our estimators of the conditional mean and gradient are consistent,
asymptotically normal and free from the curse of dimensionality. The finite sample results sup-
port the asymptotic development. We also consider a partially linear extension of our model
which we use in an empirical application relating child care to test scores. In our application we
find that the amount of child care use and not the type, is primarily responsible for the sign of
the returns. Given that our nonparametric procedure will give us observation specific estimates,
we are able to uncover, that in addition to the amount of child care, what attributes of mothers
are related to different returns. We find evidence that more educated, more experienced moth-
ers with higher test scores (themselves) are associated with lower returns to child care for their
children. On the other hand, less educated, less experienced mothers with lower test scores’ (for
themselves) children often have positive returns to child care.
8 Acknowledgements
The authors would like to thank two anonymous referees, the joint editor Rong Chen, Pe-
ter Brummund, David Jacho-Chavez, Chris Parmeter and Anton Schick for useful comments
and suggestions. They also thank participants in talks given at the State University of New
York at Albany, the University of Alabama, the University of North Carolina at Greensboro,
New York Camp Econometrics (Bolton, NY), the annual meeting of the Midwest Econometric
Group (Bloomington, IN) and the Western Economic Association International annual confer-
ence (Seattle, WA). Su acknowledges support from the Singapore Ministry of Education for
Academic Research Fund under grant number MOE2012-T2-2-021. The R code used in the
paper can be obtained from the authors upon request.
30
Appendix
In this appendix we first provide assumptions that are used to prove the main results and
then prove the main results in Section 3.
A Assumptions
A real-valued function (·) on the real line is said to satisfy a Hölder condition with exponent ∈ [0 1] if there is such that |() − (e)| ≤ | − e| for all and e on the supportof (·). (·) is said to be -smooth, = + , if it is -times continuously differentiable
on U and its th derivative, (·) satisfies a Hölder condition with exponent . The -
smooth class of functions are popular in econometrics because a -smooth function can be
approximated well by various linear sieves; see, e.g., Chen (2007). For any scalar function (·)on the real line that has derivatives and support S, let |(·)| ≡ max≤ sup∈S | ()| Let X and U denote the supports of and respectively, for = 1 Let Z denote
the support of for = 1 and = 1 2. Let ≡ [1 (Z1Z2)1 (Z1Z2)
0] and = [1 (Z1Z2)
1 (Z1Z2)0 2 ] for = 1 Let Z ≡ (Z01Z02)0
We make the following assumptions.
Assumption A1. (i) XZ = 1 are an IID random sample. (ii) The supports Wand Z ofW and Z are compact. (iii) The distributions ofW and Z are absolutely continuous
with respect to the Lebesgue measure.
Assumption A2.(i) For every 1 that is sufficiently large, there exist 1 and 1 such that
0 1 ≤ min ( ) ≤ max ( ) ≤ 1 ∞ and max () ≤ 1 ∞ for = 1 (ii)
For every that is sufficiently large, there exist 2 and 2 such that 0 2 ≤ min (ΦΦ) ≤max (ΦΦ) ≤ 2 ∞ (iii) The functions (·) = 1 = 1 and (·) =2+1 belong to the class of -smooth functions with ≥ 2. (iv) There exist α’s such that
sup∈Z1 |()−1()0α| = (−1 ) for = 1 and = 1 1, sup∈Z2 |1+()−1()0α1+| = (−1 ) for = 1 and = 1 2. (v) There exist β’s such that
sup∈X |()− ()0β| = (−) for = 1 sup∈Z1 |+(·)− ()0β+| = (−)for = 1 1 and
¯+1+(·)− (·)0β+1+
¯1= (−) for = 1 (vi) The set of
basis functions, (·) = 1 2 are twice continuously differentiable everywhere on thesupport of for = 1 . max1≤≤ max0≤≤ sup∈U k ()k ≤ for = 0 1 2
Assumption A3. (i) The PDF of any two elements in W is bounded, bounded away from
zero, and twice continuously differentiable. (ii) Let 2 ≡ 2 (XZU) ≡ ¡2 |XZU
¢and
≡ [1 () 1 ()
0 2 ] for = 1 and = 1 2 The largest eigenvalue of
is bounded uniformly in 1
Assumption A4. The kernel function (·) is a PDF that is symmetric, bounded and hascompact support [− ]. It satisfies the Lipschitz condition | (1)− (2)| ≤ |1 − 2|for all 1 2 ∈ [− ]
31
Assumption A5. (i) 1 ≤ As → ∞ 1 → ∞ 3 → 0 and → 1 ∈ [0∞) where ≡
¡120 + 1
¢1 + 02
21 1 ≡
121 12 +
−1 and ≡ 1212 + − (ii) As
→∞ → 0 3 log→∞ −2 → 0 1 = (−12−12) and [121(1 + 12−1 )
+2121221]( + 1)→ 0
Assumptions A1(i)-(ii) impose IID sampling and compactness on the support of the exoge-
nous independent variables. Either assumption can be relaxed at lengthy arguments; see, e.g.,
Su and Jin (2012) who allow for weakly dependent data and infinite support for their regressors.
A1(iii) requires that the variables inW and Z be continuously valued, which is standard in the
literature on sieve estimation. Assumption A2(i)-(ii) ensure the existence and non-singularity
of the asymptotic covariance matrix of the first two-stage estimators. They are standard in the
literature; see, e.g., Newey (1997), Li (2000), and Horowitz and Mammen (2004). Note that all
of these authors assume that the conditional variances of the error terms given the exogenous
regressors are uniformly bounded, in which case the second part of A2() becomes redundant.
A2(iii) imposes smoothness conditions on the relevant functions and A2(iv)-(v) quantifies the
approximation error for -smooth functions. These conditions are satisfied, for example, for
polynomials, splines and wavelets. A2(vi) is needed for the application of Taylor expansions. It
is well known that = ¡+12
¢and
¡2+1
¢for -splines and power series, respectively
(see Newey, 1997). The rate at which splines uniformly approximate a function is the same as
that for power series, so that the uniform convergence rate for splines is faster than power se-
ries. In addition, the low multicollinearity of -splines and recursive formula for calculation also
leads to computational advantages (see Chapter 19 of Powell, 1981 and Chapter 4 of Schumaker,
2007). For these reasons, -splines are widely used in the literature.
Assumptions A3(i)-(ii) and A4 are needed for the establishment of the asymptotic property of
the third-stage estimators. A3(ii) is redundant under Assumption A2(i) if one assumes that the
conditional variances of ’s given (XZU) are uniformly bounded. A4 is standard for local-
linear regression (see Fan and Gijbels, 1996 and Masry, 1996). The compact support condition
facilitates the demonstration of the uniform convergence rate in Theorem 3.2 below but can
be removed at the cost of some lengthy arguments (see, e.g., Hansen, 2008). In particular,
the Gaussian kernel can be applied. Assumptions A5(i)-(ii) specify conditions on 1 and
Note that we allow the use of different series approximation terms in the first and second-
stage estimation, which allows us to see clearly the effect of the first-stage estimates on the
second-stage estimates. The first condition (namely, 1 ≤ ) in A5(i) is needed for the proof of
a technical lemma (see Lemma B.5(iii)) and it can be removed at the cost of some additional
assumptions on the basis functions. The terms that are associated with 1 arise because of
the use of the nonparametrically generated regressors in the second-stage series estimation. The
appearance of log arises in order to establish uniform consistency results in Theorem 3.2 below
and it can be replaced by 1 if we are only interested in the pointwise result. In the case where
= ¡+12
¢in Assumption A2(), =
¡321 + 321
¢ In practice, we recommend
setting 1 = These restrictions, in conjunction with the condition ≥ 2, imply that the
conditions in Assumption A5 can be greatly simplified as follows:
32
Assumption A5∗. (i) As → ∞ → ∞ 4 → 1 ∈ [0∞) (ii) As → ∞ → 0
3 log→∞ −2 → 0 and −15 → 0
B Proofs of the Results in Section 3
Let ≡ 1 (Z) eΦ ≡ Φ(fW) ≡ −1P
=1 0 ΦΦ ≡ −1
P=1ΦΦ
0 andeΦΦ ≡ −1
P=1
eΦeΦ0 By Lemmas B.1() and () and Lemma B.4() below, ΦΦ
and eΦΦ are invertible with probability approaching 1 (w.p.a.1) so that in large samples we
can replace the generalized inverses − −ΦΦ and e−ΦΦ by −1
−1ΦΦ and e−1ΦΦ
respectively. We first state some technical lemmas that are used in the proof of the main
results in Section 3. The proofs of all technical lemmas but Lemma B.6 are given in the online
Supplemental Material.
Lemma B.1 Suppose that Assumptions A1 and A2()-() and () hold. Then
() k −k2 =
¡21
¢;
() min () = min ( ) + (1) and max () = max ( ) + (1) ;
()°°°−1 −−1
°°°sp=
¡1
12¢;
() kΦΦ −ΦΦk2 =
¡2
¢;
() min (ΦΦ) = min (ΦΦ) + (1) and max (ΦΦ) = max (ΦΦ) + (1)
Lemma B.2 Let ≡ −1P
=1 and ≡ −1P
=1 [ (Z)− 0α] for = 1
Suppose that Assumptions A1-A2 hold. Then
() kk2 = (1);
() kk2 = (−21 );
() eα−α = −11 −1P
=1 +−11 −1P
=1 [ (Z)− 0α] + ;
where kk = (1+ −+121 12) and = 1
Lemma B.3 Suppose that Assumptions A1-A3 hold. Then for = 1
() −1P
=1
³e −
´2 £2¤=
¡21¢for = 0 1;
() −1P
=1
³e −
´2 kΦk =
¡0
21
¢for = 1 2;
() −1P
=1
°°° ³e
´− ()
°°°2 =
¡21
21
¢;
()°°°−1P
=1
h³e
´− ()
iΦ0°°° =
¡1201 + 02
21
¢;
()°°°−1P
=1
h³e
´− ()
i
°°° =
¡−1211
¢
Lemma B.4 Suppose Assumptions A1-A3 hold. Then
() −1P
=1
°°°eΦ −Φ°°°2 =
¡21
21
¢;
33
()°°°−1P
=1
³eΦ −Φ´Φ0°°°sp=
¡1211
¢;
()°°° eΦΦ −ΦΦ
°°°sp=
¡1201 + 02
21
¢;
()°°° e−1ΦΦ −−1ΦΦ
°°°sp=
¡1201 + 02
21
¢;
()°°°−1P
=1
³eΦ −Φ´ °°° =
¡−1211
¢;
() −1P
=1
³eΦ −Φ´ [ ( 1 )−Φ0β] =
³−1 11
´
Lemma B.5 Let ≡ −1P
=1Φ and ≡ −1P
=1Φ [ ( 1 )−Φ0β] SupposeAssumptions A1-A3 hold. Then
() kk = (1212);
() kk = (−);
()°°°−1ΦΦ−1P
=1ΦP
=1 ()
0 β+1+
³e −
´°°° = (1) for = 1
Lemma B.6 Let ≡ (1 2)0 be an arbitrary 2×1 nonrandom vector such that kk = 1 Supposethat Assumptions A1-A5 hold. Then for = 2
() 2 (1) ≡ −1212P
=110−1∗
1 (1) (e−)
() = 121 (1+12
−1 )
uniformly in 1;
() 2 (1) ≡ −1212P
=11
¯0−1∗
1 (1)¯(e − )
2 = 1212 (21) uni-
formly in 1
Proof. () Let (1) ≡ −1P
=10−1∗
1 (1) () and (1) ≡ [ (1)]
By straightforward moment calculations and Chebyshev inequality, we have (1) = (1)+
(1) where k (1)k = (12−12−12) In fact, sup1∈X1 k (1)k = (
12( log)−12)with a simple application of Bernstein inequality for independent observations [see, e.g., Serfling
(1980, p. 95)]. As an aside, note that the proof of Lemma 7 in Horowitz and Mammen (2004)
contains various errors as they ignore the fact that is diverging to infinity as → ∞ Note
that for = 2
(1) = £ (1 − 1)
0−1∗1 (1)
()¤
=
Z () (1 + 2)
() 1
³1 + 12
´
= 1
Z1 (1 )
() + 1
Z () [1 (1 + )− 1 (1 )]
()
+2
Z () [1 (1 + )− 1 (1 )]
()
≡ 11 (1) + 12 (1) + 23 (1)
As in Horowitz and Mammen (2004, p. 2435), in view of the fact that the components
of 1 (1) are the Fourier coefficients of a function that is bounded uniformly over X1 we
34
have sup1∈X1 k1 (1)k2 = (1) In addition, using Assumptions A2() and A3(), we can
readily show that sup1∈X1 k2 (1)k =
¡122
¢and sup1∈X1 k3 (1)k =
¡12
¢
It follows that sup1∈X1 k (1)k =
¡1 + 12
¢= (1) under Assumption A5() and
sup1∈X1 k (1)k = (1)
By (B.2), 1 (1) = −P5
=1 −1212
P=11
0−1∗1 (1)
() ≡ −P5
=1 1 (1)
say. Noting that 11 (1) = 1212 (1) (e − ) we have
sup1∈X1
k11 (1)k ≤ 1212 sup1∈X1
k (1)k |e − | = 1212 (1)
³−12
´= (1)
Next, note that 12 (1) =P1
=1 12 (1) where 12 (1) = −1212P
=110−1
∗1 (1)
() 1 (1)
0 S11 We decompose 12 as follows:
12 (1) = −1212X=1
10−1∗
1 ()
1 (1)0 S1−1
= 1212 (1)S1−1 + 1212 (1)S1³−1 −−1
´
≡ 121 (1) + 122 (1) say.
where (1) ≡ −1P
=110−1∗
1 (1) ()
1 (1)0 Let (1) ≡ [ (1)]
As in the analysis of (1), we can show that sup1∈X1°° (1)
°°sp= (1) and sup1∈X1°° (1)− (1)
°°sp≡ ((1 log)
−12) It follows that sup1∈X1 k (1)ksp = (1
+(1 log)−12) = (1) under Assumption A5() Then following the analysis of1 (1)
in the proof of Theorem 3.2, we can show that k121 (1)k =
¡121
¢uniformly in 1
In addition,
sup1∈X1
k122 (1)k ≤ 1212 sup1∈X1
k (1)ksp kS1ksp°°°−1 −−1
°°°spkk
= 1212 (1) (1) (1−12) (
121 −12)
= (1321 −1212)
It follows that sup1∈X1 k12 (1)k = (121) + (1
321 −1212) = (
121)
under Assumption A5(). Analogously,
sup1∈X1
k14 (1)k ≤1X=1
1212 sup1∈X1
k (1)k kS1ksp k2k
= 1212 (1) (−1 ) = (
12121−1 )
By the same token, we can show that sup1∈X1 k13 (1)k =
¡121
¢and sup1∈X1 ||15
(1) || = (12121
−1 ) It follows that sup1∈X1 k1 (1)k = 121 (1 + 12
−1 )
() By (B.2) and Cauchy-Schwarz inequality, 2 (1) ≤ 5P5
=1 −1212
P=11
2¯
0−1∗1 (1)
¯≡ 5P5=1 2 say. It is easy to show that sup1∈X1 21 (1) = (
−1212)
35
Note that 22 (1) = −1212P
=11
¯0−1∗
1 (1)¯22 ≤ 1
P1=1 22 (1) where
22 (1) = −1212X=1
1
¯0−1∗
1 (1)¯1 (1)
0 S1101S011 (1)
= 1212tr¡S1101S01 (1)
¢and (1) ≡ −1
P=11
¯0−1∗
1
¯1 (1)
1 (1)0 As in the analysis of (1),
we can show that sup1∈X1 k (1)ksp = (1) By the fact tr() ≤ max ()tr() and
kksp = max () for any symmetric matrix and conformable positive-semidefinite matrix
22 (1) ≤ 1212tr¡S1101S01
¢ k (1)ksp = 1212 kS11k2sp k (1)ksp≤ 1212 kS1k2sp k1k2sp k (1)ksp= 1212 (1)
¡1
−1¢ (1) =
³1
−1212´uniformly in 1
It follows that sup1∈X1 22 (1) =
¡1
−1212¢ Similarly, uniformly in 1
24 (1) ≤ 1212 kS1k2sp k2k2sp k (1)ksp= 1212 (1) (
−21 ) (1) =
³12
−21 12
´
By the same token, 23 (1) = (1−1212) and 25 (1) = (
12−21 12) uniformly
in 1 Consequently, sup1∈X1 2 (1) = 1212 (21)
Proof of Theorem 3.1. () Noting that = ( 1 )+ = eΦ0β+ + [ ( 1 )−eΦ0β] we haveeβ − β = e−1ΦΦ−1 X
=1
eΦ − β = e−1ΦΦ−1 X=1
eΦ + e−1ΦΦ−1 X=1
eΦ h ( 1 )− eΦ0βi= e−1ΦΦ + e−1ΦΦ + e−1ΦΦ−1 X
=1
Φ
³Φ − eΦ´0 β + e−1ΦΦ−1 X
=1
³eΦ −Φ´ + e−1ΦΦ−1 X
=1
³eΦ −Φ´ £ ( 1 )−Φ0β¤− e−1ΦΦ−1 X
=1
³eΦ −Φ´³eΦ −Φ´0 β≡ 1 + 2 + 3 + 4 + 5 − 6 say.
Note that 1 = −1ΦΦ+ 1 where 1 = ( e−1ΦΦ−−1ΦΦ) satisfies k1k ≤°°° e−1ΦΦ −−1ΦΦ
°°°sp
×kksp =
£(1201 + 02
21)
12−12¤by Lemmas B.4() and B.5(). Similarly,
2 = −1ΦΦ + 2 where 2 = ( e−1ΦΦ − −1ΦΦ) satisfies k2k ≤°°° e−1ΦΦ −−1ΦΦ
°°° kksp=
£(1201 + 02
21)
−¤ by Lemmas B.4() and B.5(). Next, note that 3 =−1ΦΦ
−1P=1Φ
³Φ − eΦ´0 β+³ e−1ΦΦ −−1ΦΦ
´−1
P=1Φ
³Φ − eΦ´0 β ≡ 31 + 32 We
36
further decompose 31 as follows:
31 = −−1ΦΦ−1X=1
Φ
X=1
h³e
´− ()
i0β+1+
=X=1
−1ΦΦ−1
X=1
Φ³†
´0β+1+
³ − e
´=
X=1
−1ΦΦ−1
X=1
Φ+1+ ()³ − e
´+
X=1
−1ΦΦ−1
X=1
Φ
h+1+
³†
´− +1+ ()
i ³ − e
´+
X=1
−1ΦΦ−1
X=1
Φ
∙³†
´0β+1+ − +1+
³†
´¸³ − e
´≡
X=1
311 +X=1
312 +X=1
313 say,
where † lies between
e and Noting that |+1+³†
´− +1+ () | ≤ |e − |
where = max1≤≤ max∈U |+1+ ()| = (1) by Assumptions A1() and A2()
k312k ≤ −1P
=1 kΦk (−e)2 = 0
¡21¢by Lemma B.3() By Assumption A2(),
Cauchy-Schwarz inequality and Lemma B.3()
k313k ≤ ¡−
¢ °°−1ΦΦ°°sp −1 X=1
kΦk¯ − e
¯
≤ ¡−
¢ °°−1ΦΦ°°sp(−1
X=1
kΦk2)12(
−1X=1
³ − e
´2)12=
¡−
¢ (1)
³12
´ (1) = −+12 (1)
By Lemma B.5() k311k = (1) which dominates both k312k and k313k Thusk32k ≤
°°° e−1ΦΦ −−1ΦΦ°°°sp
(1) = [(1201 + 02
21)1] It follows that 3 =P
=1−1ΦΦ
−1P=1Φ ×+1+ () (−e)+3 where
°°3°° = [(1201+02
21)1]
By Lemmas B.4()-() k4k =
¡−1211
¢and
k5k ≤°°° e−1ΦΦ°°°
sp
°°°°°−1X=1
³eΦ −Φ´ £ ( 1 )−Φ0β¤°°°°° =
¡−11
¢
37
where we use the fact that°°° e−1ΦΦ°°°
sp≤°°° e−1ΦΦ −−1ΦΦ
°°°sp+°°−1ΦΦ°°sp = (1)+ (1) = (1)
For 6 we have by Taylor expansion and triangle inequality that
k6k ≤X=1
°°°°° e−1ΦΦ−1X=1
³eΦ −Φ´ h ³e
´− ()
i0β+1+
°°°°°=
X=1
°°°°° e−1ΦΦ−1X=1
³eΦ −Φ´ ³ †´0 β+1+
³e −
´°°°°°≤
X=1
°°° e−1ΦΦ°°°sp
°°°°°−1X=1
³eΦ −Φ´ +1+ ³ †´³e −
´°°°°°+
X=1
°°° e−1ΦΦ°°°sp
°°°°°−1X=1
³eΦ −Φ´ ∙ ³ †´0 β+1+ − +1+
³†
´¸³e −
´°°°°°≡
X=1
61 +X=1
62 say.
By the triangle inequality, Lemmas B.3() and B.4()
61 ≤
°°° e−1ΦΦ°°°sp
(−1
X=1
°°°eΦ −Φ°°°2)12(−1 X=1
³e −
´2)12= (1) (11) (1) =
¡1
21
¢
Similarly, we can show that 62 = −
¡1
21
¢by Assumption A2() and Lemmas B.3()
and B.4() It follows that k6k =
¡1
21
¢ Combining the above results yield the conclu-
sion in ()
() Noting that°°−1ΦΦ°° ≤ °°−1ΦΦ°°sp kk =
¡1212
¢and ||−1ΦΦ|| ≤
°°−1ΦΦ°°sp×kk = (
−) by Lemmas B.5()-() the result in part () follows from part (), Lemma
B.4 and the fact that kRk = (1) under Assumption A5()
() By () and Assumptions A2() , supw∈W¯e (w)− (w)
¯= supw∈W |Φ (w)0 (eβ − β) +
[β0Φ (w)− (w)]|≤ supw∈W kΦ (w)k°°°eβ − β°°°+sup∈W ¯β0Φ (w)− (w)
¯= [0 ( + 1)]
as the second term is () ¥
Proof of Theorem 3.2.
Let 1 ≡ −−2 (2)−− () −+1 (11)−−+1 (11)−+1+1(1)−− 2+1() and Y1 ≡ (11 1)0 Using the notation defined at the end of section 2.2,we have
b1 (1) =£−1X1 (1)0K1X1 (1)−1¤−1−1X1 (1)0K1X1 (1)Y1+£−1X1 (1)0K1X1 (1)−1¤−1−1X1 (1)K1(
eY1 −Y1)≡ 1 (1) + 2 (1) say.
38
By standard results in local-linear regressions [e.g., Masry (1996) and Hansen (2008)], −1−1
×X1 (1)0K1X1 (1)−1 = 1 (1)
⎛⎝ 1 0
0R2 ()
⎞⎠ + (1) uniformly in 1 1212
× [1 (1)− 1 (1)]→ (0Ω1 (1)) and sup1∈X1 k1 (1)k =
³( log)−12 + 2
´
where 1 (1) and Ω1 (1) are defined in Theorem 3.2. It suffices to prove the theorem by showing
that −1212−1X1 (1)0K1(eY1 −Y1) = (1) uniformly in 1 (for part () of Theorem 3.2
we only need the pointwise result to hold).
We make the following decomposition: ()−12−1X1 (1)K1(Y1−eY1) = −1212P
=1
1−1∗
1 (1) (1− e1) = (1) +P
=2 (1) +P1
=1 (1) +P
=1 (1) where
(1) =√ (e− )−112
X=1
1−1∗
1 (1)
(1) = −1212X=1
1−1∗
1 (1) [e ()− ()]
(1) = −1212X=1
1−1∗
1 (1) [e+ (1)− + (1)]
(1) = −1212X=1
1−1∗
1 (1)he+1+(e)− +1+()
i
We prove the first part of the theorem by showing that (1) (1) = (1) (2) (1) =
(1) for = 2 (3) (1) = (1) for = 1 1 and (4) (1) = (1) for
= 1 all uniformly in 1
(1) holds by noticing that√ (e− ) = (1) and −1
P=11
−1∗1 (1) = (1)
uniformly in 1 Let ≡ (1 2)0 be an arbitrary 2 × 1 nonrandom vector such that kk = 1
Recall that (1) ≡ −1P
=10−1∗
1 (1) () For (2) we make the following
decomposition
0 (1) = −1212X=1
10−1∗
1 (1) ()
0 S³eβ − β´
+−1212X=1
10−1∗
1 (1)£ ()
0 Sβ − ()¤
= 1212 (1)0 S−1ΦΦ + 1212 (1)S−1ΦΦ
−−1212 (1)0 S−1ΦΦX
=1
Φ
X=1
³e −
´+ −1212 (1)
0 SR
+−1212X=1
10−1∗
1 (1)£ ()
0 Sβ − ()¤
≡ 1 (1) +2 (1)−3 (1) +4 (1) +5 (1)
39
where recall ≡ ()0 β+1+ ≡ −1
P=1Φ and ≡ −1
P=1Φ [ ( 1 )
−β0Φ ] Let (1) ≡ [ (1)] and (1) = (1) − (1) By the proof of Lemma
B.6() k (1)k = (12( log)−12) k (1)k =
¡1 + 12
¢and k (1)k =
(1) uniformly in 1. Write 1 (1) = 1212 (1)0 S−1ΦΦ + 1212 (1)
0 S−1ΦΦ≡ 11 (1) +12 (1) say. Noting that
£211 (1)
¤= (1)
0 S−1ΦΦ¡ΦΦ
02
¢−1ΦΦS
0 (1)
≤ max¡¡ΦΦ
02
¢¢[min (ΦΦ)]
−2 max¡SS0
¢ k (1)k2= (1) (1) (1) = ()
we have |11 (1)| =
¡12
¢for each 1 ∈ X1 Let (1) ≡ −1ΦΦS
0 (1). Then we can
write (1)0 S−1ΦΦ as
−1P=1 (1)
0ΦNoting that[ (1)0Φ] = 0 and[ (1)
0Φ]2
= (1)0¡ΦΦ
02
¢ (1) ≤ max (ΦΦ)
°°−1ΦΦ°°2sp sup1∈X1 k (1)k = (1) we can read-
ily divide X into intervals of appropriate length and apply Bernstein inequality to show that (1)
0 S−1ΦΦ =
¡( log)−12
¢ Consequently, sup1∈X1 |11 (1)| = 1212 ((
log)−12) =
¡( log)−12
¢= (1) For 12 (1) we have by Lemma B.5()
sup1∈X1
k12 (1)k ≤ 1212 sup1∈X1
k (1)k kSksp°°−1ΦΦ°°sp kk
= 1212 (12( log)−12) (1) (1) (
12−12)
= (( log)−12) = (1)
It follows that sup1∈X1 |1 (1)| = (1) By Lemma B.5() and Assumptions A2() and
() and A5,
sup1∈X1
|2 (1)| ≤ 1212 sup1∈X1
k (1)k°°−1ΦΦ°°sp kSksp kk
= 1212 (1) (1)(1) (−) = (1)
sup1∈X1
|4 (1)| ≤ 1212 sup1∈X1
k (1)k kSksp kRk
= 1212 (1) (1)
³−12−12
´= (1)
and
sup1∈X1
|5 (1)| ≤ ¡−
¢1212 sup
1∈X1−1
X=1
1
¯0−1∗
1 (1)¯
=
³1212−
´= (1)
For 3 (1) we have 3 (1) =P
=13 (1) where 3 (1) = −1212 (1)0
×S−1ΦΦP
=1 Φ(e−)Using (B.2), 3 (1) = −
P5=1
−1212 (1)0 S−1ΦΦ
P=1
Φ ≡ −P5
=13 (1) say. First, noting that is uniformly bounded, we can show
40
°°°−1P=1Φ
°°°sp= (1) using arguments similar to those used in the proof of Lemma
B.5() It follows that
sup1∈X1
|31 (1)| ≤ 12 sup1∈X1
k (1)k kSksp°°−1ΦΦ°°sp
°°°°°°−1X
=1
Φ
°°°°°°sp
12 | − e|= 12 (1) (1) (1) (1) (1) (1) = (1)
Now notice that 32 (1) = (1)32 (1) +
(2)32 (1) where
(1)32 (1) =
1X=1
−1212 (1)0 S−1ΦΦ
X=1
Φ1 (1)
0 S11
(2)32 (1) =
1X=1
−1212 (1)0 S−1ΦΦX
=1
Φ1 (1)
0 S11
Let (1) = (1)0 S−1ΦΦ
−1P=1 Φ
1 (1) and (1) = [ (1)] Ar-
guments like those used to study (1) in the proof of Lemma B.6() show that k (1)k = (k (1)k) =
¡1 + 12
¢= (1) under Assumption A5() and || (1)− [ (1)] ||
= k (1)k ((12 log)−12) = ((
12 log)−12) uniformly in 1 We further make
the following decomposition: (1)32 (1) =
P1=1
−1212 (1)0 S−1ΦΦ
P=1 Φ
1 (1)0
S1−1 =P3
=1(1)32 (1) where
(11)32 (1) =
1X=1
1212 (1)0 S1−1
(12)32 (1) =
1X=1
1212 (1)0 S1(−1 −−1 )
(13)32 (1) =
1X=1
1212 (1)0 S1−1
Following the analysis of11 (1) we can show that sup1∈X1¯(11)32 (1)
¯=
¡( log)12
¢.
In addition,
sup1∈X1
¯(12)32 (1)
¯≤ 1212 sup
1∈X1
1X=1
k (1)k kS1ksp°°°−1 −−1
°°°spkk
= 1212 (1) (1)
³1
−12´ (
121 −12) = (1)
and
sup1∈X1
¯(13)32 (1)
¯≤ 1212 sup
1∈X1
1X=1
k (1)k kS1ksp°°°−1°°°
spkksp
= 1212 ((12 log)−12) (1) (1)
³121 −12
´= (1)
41
It follows that sup1∈X1¯(1)32 (1)
¯= (1) For
(2)32 (1) we have
sup1∈X1
¯(2)32 (1)
¯≤ 1212 sup
1∈X1k (1)k kSksp
°°−1ΦΦ°°sp 1X=1
kksp kS1ksp k1k
= 1212
³(12 log)−12
´ (1) (1) (1) (
121 −12) = (1)
where ≡ −1P
=1 Φ1 (1)
0 we use the fact that kksp = (1) by following
similar arguments to those used in the proof of Lemma B.5() and noticing that is uniformly
bounded. Consequently we have shown that sup1∈X1 |32 (1) | = (1) Analogously,
sup1∈X1
|34 (1)| ≤ 1212 sup1∈X1
k (1)ksp kSksp°°−1ΦΦ°°sp 1X
=1
kksp kS1ksp k2k
= 1212 (1) (1) (1) (1) (1)
³−1
´= (1)
By the same token, we can show that 33 (1) = (1) and 3 (1) = (1) uniformly
in 1 It follows that sup1∈X1 k3 (1)k = (1) for = 1 Analogously, we can show
that (3) : sup1∈X1 k (1)k = (1) for = 1 1
Now we show (4) Observe that 0 (1) = −1212P
=110−1∗
1 (1) [e+1+(e)
−+1+(e)] +−1212
P=11
0−1∗1 (1) [+1+(
e)−+1+()] ≡ 1 (1)+
2 (1) say. In view of the fact that e+1+(e)−+1+(e) = (e)0S+1+(eβ − β)+
[(e)0β −+1+(e)] we have 1 (1) =
P3=11 (1) where
11 (1) = −1212X=1
10−1∗
1 (1) ()
0 S+1+³eβ − β´
12 (1) = −1212X=1
10−1∗
1 (1)h(e)− ()
i0S+1+
³eβ − β´ 13 (1) = −−1212
X=1
10−1∗
1 (1)h+1+(
e)− (e)0β
i.
Analogous to the analysis of 1 (1) we can readily show that sup1∈X1 |11 (1)| = (1)
For 12 (1) by Taylor expansion,
12 (1) = −1212X=1
10−1∗
1 (1) (e − ) ()
0³eβ−β
´+1
2−1212
X=1
10−1∗
1 (1) (e − )2
³‡
´0 ³eβ−β
´≡ 121 (1) +
1
2122 (1) say,
42
where ‡ lies between
e and By Theorem 3.1 and Lemmas B.6()-(), sup1∈X1 |121 (1)|= 121 (1 + 12
−1 ) ( + 1) = (1) and
sup1∈X1
|122 (1)| ≤ 2 sup1∈X1
(−1212
X=1
10−1∗
1 (1) (e − )2
)°°°eβ−β
°°°= 2
1212 (1−1 +
−21 ) ( + 1) = (1)
In addition, sup1∈X1 k13 (1)k ≤ 1212 (−) sup1∈X1 −1P
=11
°°−1∗1 (1)
°° = (
12 12−) = (1) It follows that sup1∈X1 |1 (1)| = (1)
By Taylor expansion,
2 (1) = −1212X=1
10−1∗
1 (1) ()³e −
´+−1212
X=1
10−1∗
1 (1) +1+(‡)³e −
´2≡ 21 (1) +22 (1)
Arguments like those used to study 3 (1) show that sup1∈X1 |21 (1)| = (1) By
Lemma B.6(), sup1∈X1 |22 (2)| ≤ sup1∈X1−1212P
=11
¯0−1∗
1 (1)¯(e−
)2 = 1212 (
21) = (1) where = sup∈U +1+ () = (1) ¥
References
Ai, C., and Chen, X. (2003), “Efficient Estimation of Models with Conditional Moment Re-strictions Containing Unknown Functions,” Econometrica, 71, 1795-1843.
Bernal, R. (2008), “The Effect of Maternal Employment and Child Care on Children’s CognitiveDevelopment,” International Economic Review, 49, 1173-209.
Bernal, R., and Keane, M.P. (2011), “Child Care Choices and Children’s Cognitive Achieve-ment: the Case of Single Mothers,” Journal of Labor Economics, 29, 459-12.
Blau, F. D., and Grossberg, A. J. (1992), “Maternal Labor Supply and Children’s CognitiveDevelopment,” Review of Economics and Statistics, 74, 474-1.
Cameron, S. V., and Heckman, J. J. (1998), “Life Cycle Schooling and Dynamic Selection Bias:Models and Evidence for Five Cohorts,” NBER Working Papers 6385, National Bureauof Economic Research, Inc.
Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric Models,” In J. J.Heckman and E. Leamer (eds.), Handbook of Econometrics, 6B (Chapter 76), 5549-5632.New York: Elsevier Science.
Chen, X., and Pouzo, D. (2012), “Estimation of Nonparametric Conditional Moment Modelswith Possibly Nonsmooth Generalized Residuals,” Econometrica, 80, 277-321.
Darolles, S., Fan, Y., Florens, J. P., and Renault, E. (2011), “Nonparametric InstrumentalRegression,” Econometrica, 79, 1541-1565.
43
Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, Chapman &Hall, London.
Gao, J., and Phillips, P. C. B. (2013), “Semiparametric Estimation in Triangular SystemEquations with Nonstationarity,” Journal of Econometrics, 176, 59-79.
Hahn, J., and Ridder, G. (2013), “Asymptotic Variance of Semiparametric Estimators withGenerated Regressors,” Econometrica, 81, 315-340.
Hall, P., and Horowitz, J. L. (2005), “Nonparametric Methods for Inference in the Presence ofInstrumental Variables,” Annals of Statistics, 33, 2904-2929.
Hansen, B. E. (2008), “Uniform Convergence Rates for Kernel Estimation with DependentData,” Econometric Theory, 24, 726-748.
Henderson, D. J., and Parmeter, C. F. (2014), Applied Nonparametric Econometrics, NewYork: Cambridge University Press.
Horowitz, J. L. (2014), “Nonparametric Additive Models,” in J. Racine, L. Su, and A. Ullah(eds.), The Oxford Handbook of Applied Nonparametric and Semiparametric Econometricsand Statistics, pp. 129-148, Oxford University Press, Oxford.
Horowitz, J. L., and Mammen, E. (2004), “Nonparametric Estimation of an Additive Modelwith a Link Function,” Annals of Statistics, 32, 2412-2443.
James-Burdumy, S. (2005), “The Effect of Maternal Labor Force Participation on Child De-velopment,” Journal of Labor Economics, 23, 177-11.
Keane, M. P., and Wolpin, K. I. (2001), “The Effect of Parental Transfers and BorrowingConstraints on Educational Attainment,” International Economic Review, 42, 1051-103.
Keane, M. P., and Wolpin, K. I. (2001), “Estimating Welfare Effects Consistent with Forward-Looking Behavior,” Journal of Human Resources, 37, 600-22.
Kim, W., Linton O. B., and Hengartner N. W., (1999), “A Computationally Efficient Estimatorfor Additive Nonparametric Regression with Bootstrap Confidence Intervals,” Journal ofComputational and Graphical Statistics, 8, 278-297.
Li, Q., (2000), “Efficient Estimation of Additive Partially Linear Models,” International Eco-nomic Review, 41, 1073-1092.
Li, Q., and Racine, J. (2007), Nonparametric Econometrics: Theory and Practice, Princeton:Princeton University Press.
Mammen, E., Rothe, C., and Schienle, M. (2012), “Nonparametric Regression with Nonpara-metrically Generated Covariates,” Annals of Statistics, 40, 1132-1170.
Martins-Filho, C., and Yang, K. (2007), “Finite Sample Performance of Kernel-Based Regres-sion Methods for Nonparametric Additive Models under Common Bandwidth SelectionCriterion,” Journal of Nonparametric Statistics, 19, 23-62.
Martins-Filho, C., and Yao, F. (2012), “Kernel-Based Estimation of Semiparametric Regressionin Triangular Systems,” Economics Letters, 115, 24-27.
Masry, E. (1996), “Multivariate Local Polynomial Regression for Time Series: Uniform StrongConsistency Rates,” Journal of Time Series Analysis, 17, 571-599.
44
Newey, W. K. (1997), “Convergence Rates and Asymptotic Normality for Series Estimators,”Journal of Econometrics, 79, 147-168.
Newey, W. K., and Powell, J. L., (2003), “Instrumental Variable Estimation of NonparametricModels,” Econometrica, 71, 1565-1578.
Newey, W. K., Powell, J. L., and Vella, F. (1999), “Nonparametric Estimation of TriangularSimultaneous Equation Models,” Econometrica, 67, 565-603.
Ozabaci, D., and Henderson, D. J., (2012), “Gradients via Oracle Estimation for AdditiveNonparametric Regression with Application to Returns to Schooling,” Working paper,State University of New York at Binghamton.
Pinkse, J. (2000), “Nonparametric Two-Step Regression EstimationWhen Regressors and ErrorAre Dependent,” Canadian Journal of Statistics, 28, 289-300.
Powell, M. J. D. (1981), Approximation Theory and Methods, Cambridge: Cambridge Univer-sity Press.
Roehrig, C.S. (1988), “Conditions for Identification in Nonparametric and Parametric models,”Econometrica, 56, 433-47.
Schumaker, L. L. (2007), Spline Functions: Basic Theory, 3rd ed., Cambridge: CambridgeUniversity Press.
Serfling, R. J. (1980), Approximation Theorems of Mathematical Statistics. John Wiley & Sons,New York.
Su, L., and Jin, S. (2012), “Sieve Estimation of Panel Data Models with Cross Section Depen-dence,” Journal of Econometrics, 169, 34-47.
Su, L., Murtazashvili, I., and Ullah, A. (2013), “Local Linear GMM Estimation of FunctionalCoefficient IV Models with Application to the Estimation of Rate of Return to Schooling,”Journal of Business & Economic Statistics, 31, 184-207.
Su, L., and Ullah, A. (2008), “Local Polynomial Estimation of Nonparametric SimultaneousEquations Models,” Journal of Econometrics, 144, 193-218.
Ullah, A. (1985), “Specification Analysis of Econometric Models,” Journal of QuantitativeEconomics 1, 187-209
Vella, F. (1991), “A Simple Two-Step Estimator for Approximating Unknown Functional Formsin Models with Endogenous Explanatory Variables,” Working Paper, Department of Eco-nomics, Australian National University.
45
Supplemental Material On“Additive Nonparametric Regression in the Presence of Endogenous Regressors”
Deniz Ozabaci,1 Daniel J. Henderson,2 and Liangjun Su3
THIS APPENDIX PROVIDES PROOFS FOR SOME TECHNICAL LEMMAS IN THE ABOVE PA-
PER.
Proof of Lemma B.1. By straightforward moment calculations, we can show that ||
− ||2 = ¡21
¢under Assumption A1()-() and A2() Then () follows from Markov
inequality. By Weyl inequality [e.g., Bernstein (2005, Theorem 8.4.11)] and the fact that
max () ≤ kk for any symmetric matrix (as |max ()|2 = max () ≤ kk2) we have
min ( ) ≤ min ( ) + max ( − )
≤ min ( ) + k − k = min ( ) + (1)
Similarly,
min ( ) ≥ min ( ) + min ( − )
≥ min ( )− k −k = min (1)− (1)
Analogously, we can prove the second part of () Thus () follows. By the submultiplicative
property of the spectral norm, ()-() and Assumption A2() °°°−1 −−1°°°sp
=°°°−1 ( − )
−1
°°°sp≤°°°−1°°°
spk −ksp
°°−1°°sp= (1)
³1
12´ (1) =
³1
12´
where we use the fact that°°°−1°°°
sp= [min ( )]
−1 = [min ( ) + (1)]−1 = (1)
by () and Assumption A2() Then () follows. The proof of ()-() is analogous to that of
()-() and thus omitted. ¥
Proof of Lemma B.2. () By Assumption A1() and A2() kk2 = −2trP=1(
0
2)
≤ −1 (1 + 1) max () = (1). Then kk2 = (1) by Markov inequality.
() By the facts that kk2sp = kk2 for any vector |0| ≤ kk kk for any two conformablevectors and and that κ0κ ≤ max () kκk2 for any p.s.d. matrix and conformable vector
1Deniz Ozabaci, Dept of Economics, State Univ of New York, Binghamton, NY 13902-6000, (607) 777-2572,
Fax: (607) 777-2681, e-mail: [email protected] J. Henderson, Dept of Economics, Finance and Legal Studies, Univ of Alabama, Tuscaloosa, AL
35487-0224, (205) 348-8991, Fax: (205) 348-0186, e-mail: [email protected] Su, School of Economics, Singapore Management University, 90 Stamford Road, Singapore, 178903;
Tel: (65) 6828-0386; e-mail: [email protected].
1
κ Cauchy-Schwarz inequality, Lemma B.1() and Assumptions A2(), we have
kk2 = kk2sp = max¡
0
¢= max
kκk=1−2
X=1
X=1
κ0 0κ£ (Z)− 0α
¤ £ (Z)− 0α
¤≤ max
kκk=1
(−1
X=1
nκ0 0κ
£ (Z)− 0α
¤2o12)2
≤ (−21 ) max
kκk=1
(−1
X=1
κ0 0κ
)≤ (
−21 )max ( ) = (
−21 )
() Noting that = (Z) + = 0α + + [ (Z)− 0α] by Lemma B.1(),
w.p.a.1 we have
eα−α =
ÃX=1
0
!− X=1
−α
= −1−1
X=1
+−1−1
X=1
£ (Z)− 0α
¤= −1 +−1 ≡ 1 + 2 say. (B.1)
Note that 1 = −11 + 1 where 1 =³−1 −−1
´ satisfies that
k1k ≤ =ntrh³−1 −−1
´
0
³−1 −−1
´io12≤ kksp
°°°−1 −−1°°° = (
121 12) (
121 12) = (1)
by Lemmas B.1() and B.2(). For 2 we have 2 = −11 +2 where 2 =¡−11 −−11
¢
satisfies that
k2k ≤ kksp°°°−1 −−1
°°° = (−1 ) (
121 12) = (
−+121 12)
by Lemmas B.1() and B.2(). The result follows. ¥
Proof of Lemma B.3. ()We only prove the = 1 case as the proof of the other case is almost
identical. By the definition of e and (B.1), we can decompose e − = [ − e (Z)]−
2
as follows
e − = ( − e) + 1X=1
[ (1)− e (1)] +
2X=1
[1+ (2)− e1+ (2)]
= − (e − )−1X=1
1 (1)0 S11 −
2X=1
1 (2)0 S11+1
−1X=1
1 (1)0 S12 −
2X=1
1 (2)0 S11+2
≡ −1 − 2 − 3 − 4 − 5 say. (B.2)
Then by Cauchy-Schwarz inequality, −1P
=1(e−)
22 ≤ 5P5
=1 −1P
=1 2
2 ≡ 5
P5=1
say. Apparently, 1 =
¡−1
¢as e − =
¡−12
¢
2 = −1X=1
Ã1X=1
1 (1)0 S11
!22
≤ 1
1X=1
−1X=1
¡1 (1)
0 S11¢22 = 1
1X=1
tr¡S1101S011
¢≤ 1
1X=1
max (1) tr¡1
01S01S1
¢ ≤ 1
1X=1
max (1) kS1k2sp k1k2
where 1 = −1P
=1 1 (1)
1 (1)0 2 such that max(1) = (1) by As-
sumption A3() and arguments analogous to those used in the proof of Lemma B.1(). In
addition, kS1k2sp = max (S1S01) = 1 and k1k2 ≤°°°−1°°°2
spkk2 = (1) (1) =
(1) by Lemma B.1() and B.2() and Assumption A2(). It follows that 2 = (1)×1× (1) = (1) Similarly, using the fact that k2k2 ≤
°°°−1°°°2spkk2 = (1)
(−21 ) we have
4 = −1X=1
Ã1X=1
1 (1)0 S12
!22 ≤ 1
1X=1
max (1) kS1k2sp tr¡2
02
¢= (1)× 1× (
−21 ) = (
−21 )
By the same token, 3 = (1−1) and 5 = (
−21 )
() The result follows from () and the fact that max1≤≤ kΦk = (0) under Assump-
tion A2()
() By Assumption A2() Taylor expansion and ()
−1X=1
°°° ³e
´− ()
°°°2 = −1X=1
°°° ³ †´³e −
´°°°2≤
¡21¢−1
X=1
³e −
´2=
¡21
21
¢
3
where † lies between
e and
() By Assumption A2() Taylor expansion and triangle inequality,°°°°°−1X=1
h³e
´− ()
iΦ0
°°°°°sp
is bounded by°°°−1P
=1 ()Φ
0
³e −
´°°°sp+ 12
°°°°−1P=1
³‡
´Φ0³e −
´2°°°°sp
≡
1 + 2 where ‡ lies between
e and By triangle and Cauchy-Schwarz inequalities
and ()
1 ≤ −1X=1
k ()ksp°°°Φ0 ³e −
´°°°sp
≤(−1
X=1
k ()k2)12(
−1X=1
kΦk2¯ e −
¯2)12=
³12
´ (01) =
³1201
´
By triangle inequality and () 2 ≤ (02)−1P
=1(e − )
2 = (0221). Then
() follows.
() Let Γ ≡ [(e1) − (1) [(e) − ()]]
0 and e = (1 )0 Then we
can write −1P
=1[(e)− ()] as
−1Γ0e Let D ≡ (XZU)=1 By the law ofiterated expectations, Taylor expansion, Assumptions A1() A3() and A2() and ()
n°°−1Γ0e°°2 |D
o= −2
£tr¡Γ0ee
0Γ¢¤= −2
£tr¡Γ0
¡ee0|D
¢Γ¢¤
= −2X=1
[(e)− ()]22
≤ (1)−2
X=1
³e −
´22 =
¡−121
21
¢
It follows that°°−1Γ0e°° =
¡−1211
¢by the conditional Chebyshev inequality. ¥
Proof of Lemma B.4. ()Noting that −1P
=1
°°°eΦ −Φ°°°2 =P=1
−1P=1
°°° ³e
´− ()
°°°2 the result follows from Lemma B.3().
() Noting that°°°−1P
=1
³eΦ −Φ´Φ0°°°2 = P=1
°°°−1P=1
h³e
´− ()
iΦ0°°°2
the result follows from Lemma B.3().
() Noting that eΦΦ −ΦΦ = −1P
=1(eΦeΦ0 − ΦΦ0) = −1
P=1(
eΦ − Φ)(eΦ − Φ)0+−1
P=1(
eΦ −Φ)Φ0+ −1P
=1Φ(eΦ−Φ)0 the result follows from ()-() and the triangle
inequality.
4
() By the triangle inequality°°° e−1ΦΦ −−1ΦΦ
°°°sp≤°°° e−1ΦΦ −−1ΦΦ
°°°sp+°°°−1ΦΦ −−1ΦΦ
°°°sp
Arguments like those used in the proof of Lemma B.1(ii) show that°°° e−1ΦΦ°°°
sp=hmin
³ eΦΦ
´i−1= [min (ΦΦ) + (1)]
−1 = (1) where the second equality follows from () and Lemma
B.1() By the submultiplicative property of the spectral norm and (),°°° e−1ΦΦ −−1ΦΦ°°°sp
=°°° e−1ΦΦ ³ eΦΦ − eΦΦ
´−1ΦΦ
°°°sp
≤°°° e−1ΦΦ°°°
sp
°°° eΦΦ − eΦΦ
°°°sp
°°°−1ΦΦ°°°sp
=
³1201 + 02
21
´
Similarly,°°°−1ΦΦ −−1ΦΦ
°°°sp=
¡12
¢by Lemma B.1() It follows that
°°° e−1ΦΦ −−1ΦΦ°°°sp
=
¡1201 + 02
21
¢
() Noting that°°°−1P
=1
³eΦ −Φ´ °°°2 =P=1
°°°−1P=1
h³e
´− ()
i
°°°2 theresult follows from Lemma B.3().
() Let ≡ ( 1 ) − Φ0β By triangle inequality, Assumption A2(), Jensen in-equality and () we have
°°°−1P=1
³eΦ −Φ´ °°° ≤ (−)−1
P=1
°°°eΦ −Φ°°° = (−)
(11) = (−11) ¥
Proof of Lemma B.5. The proof of ()-() is analogous to that of Lemma B.2 ()-()
respectively. Noting that°°−1ΦΦ°°sp = (1) by Assumption A2(), we can prove () by showing
that kk = (1) where = −1P
=1Φ(e − ) where = ()
0 β+1+ By
triangle inequality and Assumptions A1() and A2() and ()
≡ max1≤≤
kk ≤ sup∈U
°°+1+ ()− ()0 β+1+
°°+ sup∈U
k+1+ ()k
= ¡−
¢+ (1) = (1)
By (B.2), = −1P
=1Φ(e − ) =
P5=1
−1P=1Φ =
P5=1 say.
Let ≡ −1P
=1 Φ1 (1)
0 and = () Then k − k = ((1)12)
by Chebyshev inequality and
kk2sp =°° £Φ1 (1)0¤°°2sp ≤ 2max () = (1)
where ≡ £Φ
1 (1)0¤ [1 (1)Φ0] and we use the fact that has bounded largest
eigenvalue. To see the last point, first note that for 1 ≤ £Φ
1 (1)0¤ is a submatrix of
≡ (ΦΦ0) which has bounded largest eigenvalue. Partition as follows
=
⎡⎢⎢⎣11 12 13
21 22 23
31 32 33
⎤⎥⎥⎦5
where = 0 for = 1 2 3 and £Φ
1 (1)0¤ = h 012 22 032
i0 Then
=
⎡⎢⎢⎣12
012 1222 12
032
22012 2222 22
032
32012 3222 32
032
⎤⎥⎥⎦ By Thompson and Freede (1970, Theorem 2), max () ≤ max (12
012) + max (22
022) +
max (32032) By Fact 8.9.3 in Bernstein (2005), the positive definiteness of ensures that both
12012 and 32
032 have finite maximum eigenvalues as both
⎡⎣ 11 12
21 22
⎤⎦ and⎡⎣ 22 23
32 33
⎤⎦are also positive definite. In addition, max (2222) = [max (22)]
2 is finite as has bounded
maximum eigenvalue. It follows that max () = (1) Consequently, kk = (1 +
(1)12) = (1)
Analogously, noting that 1 is the first element of Φ we can show that°°−1P
=1Φ°°sp=
(1 + ()12) = (1) It follows that
k1k =
°°°°°−1X=1
Φ
°°°°°sp
|e − | = (1)
³−12
´=
³−12
´
k2 + 4k ≤1X=1
kk kS1ksp (k1k+ k2k) = (1) (1)(1) = (1)
and k3 + 5k = (1) by the same token. Thus we have shown that kk = (1) ¥
References
Bernstein, D. S. (2005), Matrix Mathematics: Theory, Facts and Formulas with Application toLinear Systems Theory, Princeton: Princeton University Press.
Thompson, R., and Freede, L. J. (1974), “Eigenvalues of Partitioned Hermitian Matrices,”Bulletin Australian Mathematical Society, 3, 23-37.
6