additive nonparametric regression in the presence of ...ftp.iza.org/dp8144.pdf · additive...

DI

SC

US

SI

ON

P

AP

ER

S

ER

IE

S

Forschungsinstitut zur Zukunft der ArbeitInstitute for the Study of Labor

Additive Nonparametric Regression in thePresence of Endogenous Regressors

IZA DP No. 8144

April 2014

Deniz OzabaciDaniel J. HendersonLiangjun Su

Additive Nonparametric Regression in the

Presence of Endogenous Regressors

Deniz Ozabaci State University of New York, Binghamton

Daniel J. Henderson

University of Alabama and IZA

Liangjun Su

Singapore Management University

Discussion Paper No. 8144 April 2014

IZA

P.O. Box 7240 53072 Bonn

Germany

Phone: +49-228-3894-0 Fax: +49-228-3894-180

E-mail: [email protected]

Any opinions expressed here are those of the author(s) and not those of IZA. Research published in this series may include views on policy, but the institute itself takes no institutional policy positions. The IZA research network is committed to the IZA Guiding Principles of Research Integrity. The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center and a place of communication between science, politics and business. IZA is an independent nonprofit organization supported by Deutsche Post Foundation. The center is associated with the University of Bonn and offers a stimulating research environment through its international network, workshops and conferences, data service, project support, research visits and doctoral program. IZA engages in (i) original and internationally competitive research in all fields of labor economics, (ii) development of policy concepts, and (iii) dissemination of research results and concepts to the interested public. IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character. A revised version may be available directly from the author.

mailto:[email protected]

IZA Discussion Paper No. 8144 April 2014

ABSTRACT

Additive Nonparametric Regression in the Presence of Endogenous Regressors

In this paper we consider nonparametric estimation of a structural equation model under full additivity constraint. We propose estimators for both the conditional mean and gradient which are consistent, asymptotically normal, oracle efficient and free from the curse of dimensionality. Monte Carlo simulations support the asymptotic developments. We employ a partially linear extension of our model to study the relationship between child care and cognitive outcomes. Some of our (average) results are consistent with the literature (e.g., negative returns to child care when mothers have higher levels of education). However, as our estimators allow for heterogeneity both across and within groups, we are able to contradict many findings in the literature (e.g., we do not find any significant differences in returns between boys and girls or for formal versus informal child care). JEL Classification: C14, C36, I21, J13 Keywords: additive regression, endogeneity, generated regressors, oracle estimation,

structural equation, child care, nonparametric regression, test scores Corresponding author: Daniel J. Henderson Department of Economics, Finance and Legal Studies University of Alabama Tuscaloosa, AL 35487-0224 USA E-mail: [email protected]

mailto:[email protected]

1 Introduction

Nonparametric and semiparametric estimation of structural equation models is becoming in-

creasingly popular in the literature (e.g., Ai and Chen, 2003; Chen and Pouzo, 2012; Darolles et

al., 2011; Gao and Phillips, 2013; Hall and Horowitz, 2005; Martins-Filho and Yao, 2012; Newey

and Powell, 2003; Newey et al., 1999; Pinkse, 2000; Roehrig, 1988; Su and Ullah, 2008; Su et

al., 2013; Vella, 1991). In this paper, we are interested in improving efficiency by imposing a full

additivity constraint on each equation. Our starting point is the triangular system in Newey et

al. (1999). While the assumptions of their model are relatively restrictive as compared to other

examples in the literature, their estimator is typically easier to implement, which is useful for

applied work. While many existing estimators allow for full flexibility, they also suffer from the

curse of dimensionality.

To combat the curse, we impose an additivity constraint on each stage and propose a three

step estimation procedure for our additively separable nonparametric structural equation model.

We employ series/sieve estimators for our first two-stages. The first-stage involves separate

(additive) regressions of each endogenous regressor on each of the exogenous regressors in order

to obtain consistent estimates of the residuals. These residuals are used in our second-stage

regression where we perform a single (additive) regression of our response variable on each of

the endogenous regressors (not their predictions), the “included” exogenous regressors and each

of the residuals from the first-stage regressions. Our final-step (one stage backfitting) involves

(univariate) local-linear kernel regressions to estimate the conditional mean and gradient of each

of our additive components. This process allows our final-stage estimators to be free from the

curse of dimensionality. Further, our estimators have the oracle property. In other words, each

additive component can be estimated with the same asymptotic accuracy as if all the other

components in the regression model were known up to a location parameter (e.g., see Henderson

and Parmeter, 2014, Horowitz, 2014, or Li and Racine, 2007).

We prove that our conditional mean and gradient estimates are consistent and asymptoti-

cally normal. We provide the uniform convergence rate for the additive components and their

gradients. Our theoretical findings show that our final-stage estimator has asymptotic bias and

variance equivalent to those of a single dimension nonparametric local-linear regression estima-

tor. We further propose a partially linear extension of our model. We argue that the parametric

components can be estimated at the parametric root- rate and conclude that our estimates of

2

the additive components and associated gradients remain unaffected in the asymptotic sense.

Finite sample results for each of our proposed estimators are analyzed via a set of Monte Carlo

simulations and support the asymptotic developments.

To showcase our estimators with empirical data, we consider a proper application relating

child care use to cognitive outcomes for children (controlling for likely endogeneity). Specifically,

we use the data in Bernal and Keane (2011) to examine the relationship between child test

scores (our cognitive outcome) from single mothers and cumulative child care (both formal and

informal). The extensive set of instrumental variables in the data set allows us to have a stronger

set of instruments than what is typically used in the literature and our more flexible (partially

linear) estimator leads to more insights as we can exploit the heterogeneity present both between

and within groups (e.g., male versus female children).

Our empirical results show both similarities and differences from the existing literature.

When we look at the average values of our estimates, we find similar results to those in Bernal

and Keane (2011). On average we find mostly positive returns (to test scores) from marginal

changes in income, mother’s education and AFQT score. However, the mean is but one point

estimate. When we check the distribution of the estimated returns, we see that the main reason

behind lower returns to child care use is the amount of cumulative child care rather than the

type. Specifically, we show that as the amount of child care use increases, additional units of child

care lead to even lower returns. Bernal and Keane (2011) argue that those who use informal

child care (versus formal) and girls (versus boys) receive lower returns. Our distributions of

returns show no significant differences between these groups. We also find evidence of both

positive and negative returns to child care use. When we analyze the characteristics of the

children in each group (positive versus negative returns to child care), we find that children with

negative returns are those whose mothers have higher levels education, experience and AFQT

scores. Conversely, those children with positive returns typically have mother’s with lower levels

of education, experience and AFQT scores.

The paper is organized as follows. Section 2 describes our methodology whereas the third

section presents the asymptotic results. Section 4 considers an extension to a partially linear

model and Section 5 examines the finite sample performance of our estimators via Monte Carlo

simulations. The sixth section gives the empirical application and the final section concludes.

All the proofs of the main theorems are relegated to the appendix. Additional proofs for the

technical lemmas are provided in the online supplemental material.

3

Notation. For a real matrix we denote its transpose as 0 its Frobenius norm as kk(≡ [tr(0)]12) its spectral norm as kksp (≡

pmax(0)) where tr(·) is the trace operator,

≡ means “is defined as” and max (·) denotes the largest eigenvalue of a real symmetric matrix(similarly, min (·) denotes the smallest eigenvalue of a real symmetric matrix). Note that thetwo norms are equal when is a vector. For any function (·) defined on the real line, we use (·) and (·) to denote its first and second derivatives, respectively. We use → and

→ to denote

convergence in distribution and probability, respectively.

2 Methodology

In this section, we introduce our model and then propose a three-step estimation procedure that

is a combination of both series and kernel methods.

2.1 Model

We start with the basic set-up of Newey et al. (1999). They consider a triangular system of the

following form⎧⎨⎩ = (XZ1) +

X = (Z1Z2) +U (U|Z1Z2) = 0 (|Z1Z2U) = (|U) (2.1)

where X = (1 )0 is a × 1 vector of endogenous regressors, Z1 = (11 11)

0 is

a 1 × 1 vector of “included” exogenous regressors, Z2 ≡ (21 22)0 is a 2 × 1 vector of“excluded” exogenous regressors, (· ·) denotes the true unknown structural function of interest, ≡ (1 )

0 is a × 1 vector of smooth functions of the instruments Z1 and Z2 and

and U ≡ (1 )0 are error terms. Newey et al. (1999) are interested in estimating (· ·)

consistently.

Newey et al. (1999) show that (· ·) can be identified up to an additive constant underthe key identification conditions that (U|Z1Z2) = 0 and (|Z1Z2U) = (|U). If theseconditions hold, then

( |XZ1Z2U) = (XZ1) + (|XZ1Z2U) = (XZ1) + (|Z1Z2U)= (XZ1) + (|U) (2.2)

4

If U is observed, this is a standard additive nonparametric regression model. However, in

practice, U is not observed and it needs to be replaced by a consistent estimate. This moti-

vates Su and Ullah (2008) to consider a three-stage procedure to obtain consistent estimates of

(· ·) via local-polynomial regressions. In the first-stage, they regress X on (Z1Z2) via local-

polynomial regression and obtain the residuals bU from this first-stage reduced-form regression.

In the second-stage, they estimate ( |XZ1U) via another local-polynomial regression by

regressing Y on X Z1 and bU. In the third-stage, they obtain the estimates of (x z1) via themethod of marginal integration. Unlike previous works in the literature, including Newey et

al. (1999), Pinkse (2000) and Newey and Powell (2003) that are based upon two-stage series

approximations and only establish mean square and uniform convergence, they establish the

asymptotic distribution for their three step local-polynomial estimator.

There are two drawbacks associated with the estimator of Su and Ullah (2008). First, it is

subject to the notorious “curse of dimensionality”. Without any extra restriction, the conver-

gence rate of their second and third-stage estimators depend on 2+1 and +1 respectively,

which can be quite slow if either or 1 is not small. As a result, their estimates may per-

form badly even for moderately large sample sizes when + 1 ≥ 3 Second, their estimatordoes not have the oracle property which an optimal estimator of the additive component in a

nonparametric regression model should exhibit. In this paper we try to address both issues.

To alleviate the curse of dimensionality problem, we propose to impose some amount of

structure on (XZ1) (|U) and (Z1Z2) where = 1 Specifically, we assume that

() = 0 and the above nonparametric objects have additive forms:

(XZ1) = + 1 (1) + + () + +1 (11) + + +1 (11)

(|U) = + +1+1 (1) + + 2+1 () and

(Z1Z2) = +1 (11) + +1 (11) +1+1 (21) + + (22)

where = 1 and = 1 + 2 Consequently, we have

( |XZ1Z2U) = + 1 (1) + + () + +1 (11) + + +1 (11)

++1+1 (1) + + 2+1 () ≡ (XZ1U) (2.3)

where = + Note that the (·)’s are not fully identified without further restriction. De-pending on the method that is used to estimate the additive components, different identification

5

conditions can be imposed. For example, for the method of marginal integration, a convenient

set of identification conditions would be that each additive component (other than ) in (2.3)

has expectation zero.

Horowitz (2014) reviews methods for estimating nonparametric additive models, including

the backfitting method, the marginal integration method, the series method and the mixture of

a series method and a backfitting method to obtain oracle efficiency. It is well known that it is

more difficult to study the asymptotic property of the backfitting estimator than the marginal

integration estimator, but the latter has a curse of dimensionality problem if additivity is not

imposed at the outset of estimation as in conventional kernel methods. Other problems that

are associated with the marginal integration estimator include its lack of oracle property and

its heavy computational burden. Kim et al. (1999) try to address the latter two problems by

proposing a fast instrumental variable (IV) pilot estimator. However, they cannot avoid the

curse of dimensionality problem. In fact, their IV pilot estimator depends on the estimation of

the density function of the regressors at all data points. In addition, their paper ignores the

notorious boundary bias problem for kernel density estimates and because their IV pilot estimate

is not uniformly consistent on the full support, they have to use a trimming scheme to obtain

the second-stage oracle estimator. To overcome the curse of dimensionality problem, Horowitz

and Mammen (2004) propose a two-step estimation procedure with series estimation of the

nonparametric additive components followed by a backfitting step that turns the series estimates

into kernel estimates that are both oracle efficient and free of the curse of dimensionality.

Below we follow the lead of Horowitz and Mammen (2004) and propose a three-stage estima-

tion procedure that is computationally efficient, oracle efficient and fully overcomes the curse of

dimensionality. We shall adopt the following identification restrictions: (0) = ()|=0 = 0for = 1 2 + 1 and (0) = 0 for = 1 and = 1 2 Similar identification

conditions are also adopted in Li (2000). The difference between our models and theirs is that

their model does not allow for endogenous regressors. This complicates our problem relative to

theirs as the endogeneity requires us to replace the unobserved errors in the second-stage with

residuals. Hence, we need to take care of the additional bias factor from the first-step. Further,

we also analyze the gradients of the additive nonparametric components.

6

2.2 Estimation

Given a random sample of observations XZ1Z2=1 where X = (1 )0, Z1 =

(11 11)0 and Z2 = (21 22)

0, we propose the following three-stage estimation

procedure:

1. For = 1 let e e (1) = 1 1 and e1+ (2) = 1 2denote the series estimates of (1) = 1 1 and 1+ (2) =

1 2 in the nonparametric additive regression

= +1 (11) + +1 (11) +1+1 (21) + + (22) +

Let e ≡ − e− e1 (11)− − e1 (11)− e1+1 (21)− − e (22) for

= 1 and = 1

2. Estimate () = 1 + (1) = 1 1 +1+(e) =

1 in the following additive regression model

= + 1 (1) + + () + +1 (11) + + +1 (11)

++1+1(e1) + + 2+1(

e) +

by the series method. Denote the estimates as e e () = 1 e+ (1) = 1 1 and e+1+(e) = 1

3. Estimate 1 (1) and its first-order derivative by the local-linear regression of e1 = −e − e2 (2) − − e () −e+1 (11) − − e+1 (11) − e+1+1(e1) − −e2+1(e) on 1 Estimates of the other additive components in (2.3) and their first-

order derivatives are obtained analogously.

In relation to Horowitz and Mammen (2004), the above first-stage is new as we have to

replace the unobservable by their consistent estimates in the second-stage. In addition,

Horowitz and Mammen (2004) are only interested in estimation of the nonparametric additive

components themselves, while we are also interested in estimating the first-order derivatives

(gradients). Alternatively, we could follow Kim et al. (1999) and use the kernel estimator in

the first two-stages. The oracle estimator of Kim et al. (1999) has gained popularity in recent

years. For example, Ozabaci and Henderson (2012) obtain the gradients of their estimator for

7

the local-constant case and Martins-Filho and Yang (2007) consider the local-linear version of

the oracle estimator, both assuming strictly exogenous regressors. However, as mentioned above,

using the kernel estimators in the first two-stages here has several disadvantages and does not

avoid the curse of dimensionality problem.

For notational simplicity, let W = (X0Z01U0)0 and w = (x0 z01u)0 where, e.g., u =

(1 )0 denotes a realization of U We shall use Z ≡ Z1×Z2 and W ≡ X × Z1 × U to

denote the support of (Z1Z2) andW, respectively. Let (·) = 1 2 denote a sequenceof basis functions. Let 1 = 1 () and = () be some integers such that 1 → ∞ as

→∞ Let 1 () ≡ [1 () 1 ()]0 Define

1 (z1 z2) ≡£1 1 (11)

0 1 (11)0 1 (21)0 1 (22)

0¤0 andΦ (w) ≡ £

1 (1)0 ()

0 (11)0 1 (11)0 (1)0 ()

0¤0 For each (z1 z2) ∈ Z, we approximate (z1 z2) and (w) by 1 (z1 z2)

0α and Φ (w)0 β

respectively, for = 1 , where α ≡ (α01 α

0)

0 and β = (β01 β02+1)

0 are

(1 + 1)× 1 and (1 + (2 + 1))× 1 vectors of unknown parameters to be estimated. Here,each α = 1 is a 1 × 1 vector and each β = 1 2 + 1 is a × 1 vector. LetS1 and S denote 1× (1 + 1) and × (1+ (2+1)) selection matrices, respectively, such

that S1α = α and Sβ = β

To obtain the first-stage estimators of the (·)’s, let eα ≡ (e eα01 eα0)0 be the solutionto min

−1P=1

£ − 1 (Z1Z2)

0α

¤2 The series estimator of (z) is given by

e (z1 z2) = 1 (z1 z2)0 eα

= 1 (z1 z2)

"−1

X=1

1 (Z1Z2)1 (Z1Z2)

0#−

−1X=1

1 (Z1Z2)

where − denotes the Moore-Penrose generalized inverse of Note that we can write e (z1 z2)

as e (z1 z2) = e +P1=1 e (1) +

P2=1 e1+ (2) where e (1) = 1 (1)

0 eα is

a series estimator of (1) for = 1 1 and e1+ (2) = 1 (2)0 eα1+ is a series

estimator of 1+ (2) for = 1 2

To obtain the second-stage estimators of the (·)’s, let eβ ≡ (e eβ01 eβ02+1)0 be a solutionto min

−1P=1

h − (fW)

0βi2 where fW = (X

0Z

01eU0)0 and eU = (e1 e)

0 The

series estimator of (w) is given by

e (w) = (w)0 eβ = e+ X=1

e () + 1X=1

e+ (1) + X=1

e+1+()8

Let 1 (1) ≡ [1 (1) 1 (1)]0 We use b1 (1) ≡ [b1 (1) b1 (1)]0 to denote the local-

linear estimate of 1 (1) in the third-stage by using the kernel function (·) and bandwidth. Let eY1 ≡ (e11 e1)0, ∗

1 (1) ≡ (11 − 1)0 X1 (1) ≡ [∗

11 (1) ∗1 (1)]

0 and

K1 ≡diag(11 1) where 1 ≡ (1 − 1) and (·) ≡ (·) Then

b1 (1) = £X1 (1)0K1X1 (1)¤−1X1 (1)0K1

eY1Below we study the asymptotic properties of eβ and b1 (1) 3 Asymptotic properties

In this section we state two theorems that give the main results of the paper. Even though

several results are available in the literature on nonparametric or semiparametric regressions

with nonparametrically generated regressors (see, e.g., Mammen et al., 2012 and Hahn and Rid-

der, 2013 for recent contributions), none of them can be directly applied to our framework. In

particular, Hahn and Ridder (2013) study the asymptotic distribution of three-step estimators

of a finite-dimensional parameter vector where the second-step consists of one or more nonpara-

metric generated regressions on a regressor that is estimated in the first-step. In sharp contrast,

our third-stage estimator is also a nonparametric estimator. Under fairly general conditions,

Mammen et al. (2012) focus on two-stage nonparametric regression where the first-stage can be

kernel or series estimation while the second-stage is local-linear estimation. In principle, we can

treat our second and third-stage estimation as their first and second-stage estimation, respec-

tively and then apply their results to our case. However, their results are built upon high-level

assumptions and are usually not optimal. For this reason, we derive the asymptotic properties

of our three-stage estimators under some primitive conditions specified in the preceding section.

Let W ≡ (X0Z1U0)0 Z2 and to denote the th random observation of W

Z2 and respectively. Let ≡ − (XZ1U) and Φ = Φ (W) and ΦΦ ≡ [ΦΦ

0]

The asymptotic properties of the second-stage series estimator eβ are reported in the followingtheorem.

Theorem 3.1 Suppose that Assumptions A.1-A.5(i) in Appendix A hold. Then

() eβ−β = −1ΦΦ−1P

=1Φ+−1ΦΦ−1P

=1Φ [ ( 1 )−Φ0β] − −1ΦΦ−1P

=1Φ

×P=1 +1+ () (e − ) +R;

9

()°°°eβ − β°°° = ( + 1) ;

() supw∈W¯e (w)− (w)

¯= [0 ( + 1)] ;

where kRk = ( + 1) and 1 and are defined in Assumption A.5()

To appreciate the effect of the first-stage series estimation on the second-stage series estima-

tion, let β denote a series estimator of β by using U together with (XZ1) as the regressors.

Then it is standard to show that

β − β = −1ΦΦ−1

X=1

Φ +−1ΦΦ−1

X=1

Φ£ ( 1 )−Φ0β

¤+ R

and°°β − β°° = () where

°°R

°° = (−12) = (). The third term on the

right hand side of the expression in Theorem 3.1() signifies the asymptotically non-negligible

dominant effect of the first-stage estimation on the second-stage estimation.

With Theorem 3.1, it is straightforward to show the asymptotic joint distribution of our

three-stage estimators of 1 (1) and its gradient.

Theorem 3.2 Let ≡diag(1 ) Suppose that Assumptions A.1-A.5 in Appendix A hold.

Then

() (Normality)√ [b1 (1)− 1 (1)− 1 (1)]

→ (0Ω1 (1)) where 1 (1) ≡⎛⎝ 212 21 (1)

0

⎞⎠ Ω1 (1) ≡⎛⎝ 2(1)1 (1) 0

0 222(1)

£2211 (1)

¤⎞⎠ 2(1) ≡ (2 |

1 = 1) 1 (·) denotes the probability density function (PDF) of 1 and ≡R()

for = 0 1 2

() (Uniform consistency) Suppose that ΦΦ ≡ ¡ΦΦ

02

¢has bounded maximum eigen-

value. Then sup1∈X1 k [b1 (1)− 1 (1)]k =

³( log)−12 + 2

´

Theorem 3.2() indicates that our three-step estimator of 1 (1) = [1 (1) 1 (1)]0 has

the asymptotic oracle property. The asymptotic distribution of the local-linear estimator of

1 (1) is not affected by random sampling errors in the first two-stage estimators. In fact, the

three-step estimator of 1 (1) has the same asymptotic distribution that we would have if the

other components in ( 1 ) were known and a local-linear procedure is used to estimate

1 (1) Theorem 3.2() gives the uniform convergence rate for b1 (1) Similar properties canbe established for the local-linear estimators of other components of ( 1 ) In addition,

following the standard exercise in the nonparametric kernel literature, we can also demonstrate

that these estimators are asymptotically independently distributed.

10

4 Partially linear additive models

In this section we consider a slight extension of the model in (2.1) to the following partially

linear functional coefficient model⎧⎨⎩ = (XZ1) + 0V+

X = (Z1Z2) +ΨV+U (U|Z1Z2V) = 0 (|Z1Z2UV) = (|U) () = 0(4.1)

where X Z1 Z2 Z and are defined as above, V is a × 1 vector of exogenous variables, is a × 1 parameter vector and Ψ = [01

0 ]

0 is a × matrix of parameters in the

reduced form regression for X To avoid the curse of dimensionality, we continue to assume that

(Z1Z2) (XZ1) and (|U) have the additive forms given in Section 2.1.We remark that the results developed in previous sections extend straightforwardly to the

model specified in (4.1). Note that

( |XZ1Z2UV) = (XZ1) + (|U) + 0V = (XZ1U) + θ0V and (4.2)

(X|Z1Z2V) = (Z1Z2) +ΨV (4.3)

Given a random sample (XZ1Z2V) = 1 , we can continue to adopt the three-step procedure outlined in Section 2.2 to estimate the above model. First, we choose (α

) to minimize −1 P

=1

£ − 1 (Z1Z2)

0α −V0

¤2 Let (eα e) denote the solution.

The series estimator of (z1 z2) is given by e (z1 z2) = 1 (z1 z2)0 eα Define the residualse = − e (Z1Z2) − e0V Let eU = (e1 e)

0 fW = (X0Z

01eU0)0 and (fW) be

defined as before. Second, we choose (β ) to minimize −1P

=1[ − (fW)0β −V0

]2 Leteβ ≡ (e eβ01 eβ02+1)0 and e denote the solution. Define e1 = − eβ0(−1)(fW)−e0V whereeβ(−1) is defined as eβ with its component eβ1 being replaced by a × 1 vector of zeros. Third,

we estimate 1 (1) and its first-order derivative by regressing e1 on 1 via the local-linear

procedure. Let b1 (1) denote the estimate of 1 (1) via local-linear fitting.It is well known that the finite dimensional parameter vectors ’s and can be estimated

at the parametric√-rate and the appearance of the linear components in (4.1) will not affect

the asymptotic properties of eβ and b1 (1) To conserve space, we do not repeat the argumentshere.

11

5 Finite sample properties

In this section we evaluate the finite sample properties of our estimator by simulations. We

first look at four data generating processes (DGPs) to show the performance of our estimator.

We then consider higher-dimensional data and compare our estimator to a fully nonparametric

alternative of Su and Ullah (2008). We report the average bias, variance, and root mean square

error (RMSE) for the final-stage conditional mean and gradients estimates across 1000 Monte

Carlo simulations. We consider three different sample sizes: 100 200 and 400.

5.1 Baseline simulations

We consider four different DGPs of structural equations between (once for in DGP

3), and . Unless we state otherwise, 1 and 2 are independently distributed as uniform from

zero to one ( [0 1]) and and are independently distributed as Gaussian with mean zero and

variance one ((0 1)) and are mutually independent of one another and of and (1 2). For

the first three DGPs, each error distribution is assumed to be homoskedastic.

Our first DGP is our baseline model and is given as

= sin () + sin (1) + and = sin (1) + sin (2) +

Our second DGP considers a slightly more complicated first-stage regression model as 2 enters

the reduced form of via the PDF of a logistic distribution

= 0252 + 0521 + and = 02021 +−2

(1 + −2)2+

Our third DGP considers the partially linear extension where 1 and 2 are distributed via a

binomial and Gaussian distribution, respectively:

= sin () + sin (1) + 051 + 2 + and = sin (1) + sin (2) +

Finally, our fourth DGP is similar to the first, but allows for heteroskedasticity (note that our

theory allows for heteroskedasticity). Specifically, we allow the variance of to be a function of

1 and 2 via 01 + 0521 + 05

22 . Apparently, = 1 = 2 = 1 in DGPs 1-4.

We estimate the structural function in three steps. In the first two steps we use cubic B-

splines for the sieve estimation and in the third step we use local-linear kernel regression. For

the spline estimation, we specify the number of knots as b215c so that 1 = = b215c+ 4

12

Table 1: Monte Carlo simulations for the final-stage conditional mean and gradient estimatesb (· ·) (··)

(··)1

b (· ·) (··)

(··)1

b (· ·) (··)

(··)1

= 100 = 200 = 400Bias

DGP 1 0.0522 -0.0611 0.2847 0.0419 -0.0586 0.0092 0.0297 -0.0358 -0.0078

DGP 2 -0.0649 0.0525 -0.0346 -0.0621 0.0178 -0.0483 -0.0434 0.0577 -0.0013

DGP 3 0.0526 -0.0544 -0.0019 0.0371 -0.0575 -0.0024 0.0278 -0.0502 -0.0015

DGP 4 0.0515 -0.0341 0.0542 0.0412 -0.0556 0.0166 0.0285 -0.0347 -0.0138

Variance

DGP 1 0.0526 0.0879 0.3017 0.0316 0.0593 0.1672 0.0195 0.0419 0.1120

DGP 2 0.1234 0.2485 0.7299 0.1211 0.2352 0.6851 0.0508 0.2242 0.3233

DGP 3 0.1170 0.1530 0.6956 0.0703 0.1054 0.4047 0.0443 0.0686 0.2732

DGP 4 0.1182 0.1520 0.6670 0.0692 0.1016 0.3773 0.0436 0.0714 0.2562

RMSE

DGP 1 0.2468 0.3699 0.6608 0.1870 0.2938 0.4880 0.1465 0.2405 0.3785

DGP 2 0.3708 0.6714 0.9763 0.3701 0.6684 0.9310 0.2365 0.6382 0.6197

DGP 3 0.3589 0.4947 0.9839 0.2779 0.4143 0.7609 0.2199 0.3235 0.5865

DGP 4 0.3659 0.5004 0.9740 0.2755 0.3974 0.7161 0.2163 0.3171 0.5669

where b·c denotes the integer part of ·. For the kernel regression, we need to choose both thekernel function (·) and the bandwidth parameter We apply the Gaussian kernel throughoutthe simulations and application: () = exp

¡−22¢ √2 There are two standard ways tochoose the bandwidth. One is to apply Silverman’s rule of thumb by setting = 106

−15

where denotes to sample standard deviation of ; and the other is to consider leave-one-out

least-squares cross-validation (LSCV). To save time on computation, we consider Silverman’s

rule-of-thumb choice of bandwidth. In our application, we will use generalized cross-validation

(GCV) to choose the number of sieve basis terms in each of the first two steps’ sieve estimation

and LSCV to choose the bandwidth in the last step kernel estimation. For example, in the

first step, we choose 1 to minimize the following GCV objective function

(1) =1

X=1

[ − e1(Z)]2 [1− (1)]2

where Z is a collection of all exogeneous variables and e1 (·) denotes the sieve estimation of (|Z) by imposing the additive structure and using 1 terms of cubic B-spline basis functionsto approximate each additive component. In the third step, we choose to minimize

() =1

X=1

he1 − b1−(1)i2 13

where b1−(1) is the leave-one-out version of b1(1) defined in Section 2.2.

The simulation results for the final-stage regressions can be found in Table 1. Each of the

results are as expected. The average bias, variance and RMSE of each estimator decreases with

the sample size. The conditional mean is estimated more precisely than its gradients. The

estimators in the homoskedastic DGP outperform those from the heteroskedastic DGP (DGP 1

versus 4).

5.2 Higher-dimensional performance

Now we look at the performance of our estimator with higher-dimensional data and compare

it with that of a fully nonparametric alternative — Su and Ullah (2008). Unless stated oth-

erwise, 1 2 3 4 and 5 are independently distributed as [0 1] and and are inde-

pendently distributed as (0 1) and are mutually independent of one another and of and

(1 2 3 4 5). For DGPs 5-7, each error distribution is assumed to be homoskedastic.

Our fifth DGP is a variant of our baseline model and is given as

= sin () + sin (1) + sin (2) + sin (3) + sin (4) +

= sin (1) + sin (2) + sin (3) + sin (4) + sin (5) +

Our sixth DGP is specified as follows

= 0252 + 0521 + sin (2) + 23 + 324 +

= 02021 +−2

(1 + −2)2+ cos (2) + sin (3) + sin (4) +

Our seventh DGP considers the partially linear extension where 1 and 2 are distributed via a

binomial and Gaussian distribution, respectively:

= sin () + sin (1) + sin (2) + sin (3) + sin (4) + 051 + 2 +

= sin (1) + sin (2) + sin (3) + sin (4) + sin (5) +

Finally, our eighth DGP is similar to the fifth, but allows for heteroskedasticity. Specifically, we

allow the variance of to be a function of 1 2 3 4 and 5 via 01 + 0521 + 05

22 +23 +

24 + 25 . Apparently, = 2 = 1 and 1 = 4 in DGPs 5-8.

The simulation results for the final-stage regressions can be found in Tables 2 and 3 for

our proposed estimates and Su and Ullah’s (2008) estimates, respectively. For Su and Ullah’s

14

Table 2: Monte Carlo simulations for higher-dimensional data for the final-stage conditionalmean and gradient estimatesb (· ·) (··)

(··)1

b (· ·) (··)

(··)1

b (· ·) (··)

(··)1

= 100 = 200 = 400Bias

DGP 5 0.0468 0.0390 0.0453 0.0420 0.0492 0.0032 0.0364 0.0329 -0.0082

DGP 6 -0.0820 0.0022 0.0128 -0.0659 -0.0028 0.0093 -0.0583 0.0011 0.0024

DGP 7 0.0659 0.0509 0.0056 0.0613 0.0350 0.0190 0.0472 0.0281 -0.0254

DGP 8 0.0492 0.0470 -0.0599 0.03897 0.0303 0.0024 0.0332 0.0371 -0.0081

Variance

DGP 5 0.2510 0.2551 1.2890 0.1225 0.0897 0.5907 0.0699 0.0571 0.3391

DGP 6 0.2480 0.2453 1.2710 0.1198 0.0977 0.6142 0.0685 0.0583 0.3386

DGP 7 0.5752 0.2977 1.7930 0.3355 0.0994 0.7068 0.1809 0.0637 0.3713

DGP 8 0.3575 0.3000 1.7510 0.1750 0.1158 0.8258 0.0985 0.0776 0.4766

RMSE

DGP 5 0.5091 0.4364 1.1820 0.3577 0.3145 0.7953 0.2700 0.2498 0.5970

DGP 6 0.5111 0.4663 1.1620 0.3573 0.3206 0.8058 0.2710 0.2528 0.5948

DGP 7 0.7680 0.7292 0.5137 0.5691 0.5456 0.3315 0.4306 0.3987 0.5833

DGP 8 0.6052 0.8277 1.4001 0.4263 0.5833 0.9460 0.3195 0.2870 0.7138

estimates, we basically follow their suggestions to choose the orders of local-polynomial regression

(3 in the first stage and 1 in the second stage), kernel, and bandwidth, but use the technique of

Kim et al. (1999) in the third stage to speed up the calculation. The findings for our estimator

are similar to those in Table 1. As for the comparison between the two estimates, we see that

the additive estimates, which exploit the additive nature of the data, have smaller bias, variance,

and RMSE than Su and Ullah’s fully nonparametric estimates with higher dimensional data. In

particular, Su and Ullah’s estimates are subject to the curse of dimensionality and tend to have

very large variance and RMSE even for moderate sample sizes (e.g., = 400) as they need to

estimate 1+2 = 5 2+1 = 6 and +1 = 5 dimensional nonparametric objects in DGPs

5, 6 and 8 in their first, second, and third stage estimation, respectively. [In DGP 7, the linear

components 1 and 2 in the structural equation are also counted as a part of Z1 in Su and

Ullah’s procedure. As a result, 1 = 6 and even higher dimensional nonparametric objects have

to be estimated.] We also consider models with an even higher number of covariates and the

results are as expected: the variance and RMSE of Su and Ullah’s estimates blow up quickly as

the number of covariates increase and those of ours are still well behaved.

15

Table 3: Monte Carlo simulations for higher-dimensional data for Su and Ullah’s (2008) final-stage conditional mean and gradient estimatesb (· ·) (··)

(··)1

b (· ·) (··)

(··)1

b (· ·) (··)

(··)1

= 100 = 200 = 400Bias

DGP 5 0.4796 -0.0122 0.2364 0.3342 -0.0166 0.2500 0.2655 -0.0227 0.0066

DGP 6 1.0310 0.0549 0.2030 -0.9077 0.0649 0.2400 0.3724 -0.1362 0.2916

DGP 7 0.4767 -0.0238 0.2484 0.1881 0.0015 0.0963 0.2334 0.0998 1.0110

DGP 8 1.3720 0.4493 0.4159 0.7077 0.5225 0.5809 0.5766 0.5119 0.4634

Variance

DGP 5 0.8934 1.2996 4.9969 0.8995 1.1916 4.3957 0.8924 1.0031 2.0049

DGP 6 1.6010 1.5890 5.6349 1.4980 1.7370 5.0254 1.4130 1.6297 3.9871

DGP 7 0.9969 1.2118 4.9003 0.5720 0.7518 3.2076 0.5663 0.7199 2.9871

DGP 8 2.0390 2.8980 10.8100 2.0640 2.6460 9.4325 2.0050 2.5940 9.2876

RMSE

DGP 5 1.0218 1.3035 5.0366 0.9635 1.1989 4.4616 0.9334 1.1157 2.0134

DGP 6 1.0990 1.7050 5.6563 1.0130 1.6930 5.0655 0.9725 1.6354 4.5161

DGP 7 1.1134 1.2347 4.9763 0.6046 0.7565 3.2311 0.5997 0.7223 3.1100

DGP 8 2.4940 2.9580 10.9200 2.2019 2.7030 9.4439 2.0970 2.6510 9.3057

6 Application: Child care use and test scores

It is generally accepted in the literature that early childhood achievement is a strong predictor for

success (better labor market outcomes) later in life (Keane and Wolpin, 1997, 2001; Bernal and

Keane, 2011; Cameron and Heckman, 1998). Thus, researchers have focused on the determinants

of childhood achievement. Various models have been developed and most focus on cognitive

ability as the outcome measure. In the present context, we are concerned whether or not child

care improves or hurts a particular measure of cognitive ability, test scores. Although this is an

interesting question, previously there were serious data limitations.

The two major limitations associated with cognitive ability production functions in this

context are sample selection bias and endogeneity. Sample selection bias occurs when only

mothers’ labor force participation is used in the analysis. This variable implicitly assumes that

it is a direct indicator of child care use. The main problem here is that working mothers and

non-working mothers may differ substantially in the cognitive ability production process and if

only labor force participation is used, the analysis is going to rule out “non-working” mothers.

Adding actual child care use can help take care of the selectivity problem (see Bernal, 2008 and

Bernal and Keane, 2011).

16

The second issue is potential endogeneity of the child care use variable. To the best of

our knowledge, there are relatively few papers in this literature that use instrumental variable

estimation to solve the endogeneity problem and those that do find no benefits to IV regression.

Two possible reasons for this are the use of restrictive methods (those that likely hide the existing

heterogeneity of mothers hinder the sources of potential endogeneity) and data limitations. The

three papers that we are aware of which use IV regressions are Blau and Grossberg (1992),

James-Burdumy (2005) and Bernal and Keane (2011).

Blau and Grossberg (1992) use maternal labor supply as an indicator of child care use and

analyze children’s cognitive development. They define endogeneity via the participation decision

of mothers. They define it as a comparison between in home and market production. They state

that the employed and unemployed mothers’ differences may create differences in child quality

production. Hence they focus on the endogeneity of the mothers. They instrument for maternal

labor supply and conclude that there is no statistical heterogeneity between employed and

unemployed mothers (and reject the IV model). An issue with their paper (which they point

out), is the possibility of weak instruments. They also do not have a detailed control for child

care. James-Burgundy (2005) focuses on the same problem and uses labor market conditions

for her fixed effects IV model. Potentially weak instruments are again blamed for rejection of

the IV model.

In response to the issues mentioned above, Bernal and Keane (2011) obtain data on actual

child care use (which helps correct for selectivity issues) as well as an extensive number of

instrumental variables (which helps correct for the weak instrumental variable issue). Further,

they choose a larger age range (compared to existing studies) for children in their application

(previous studies found stronger correlations for their target ages). Also, they focus only on

single mothers which arguably fits their set of instruments better. They conclude that their IV

regression perform well.

We start our analysis with Bernal and Keane’s (2011) data, but consider a more flexible

cognitive ability production function ( (·)). Consider the following cognitive ability productionfunction

= ( ) + (6.1)

where is the logarithm of child test scores (our measure of cognitive ability), repre-

sents the child’s age, (the primary variable of interest) is cumulative child care, is the

17

logarithm of cumulative income since childbirth, is a vector of group characteristics of

the child and the mother (e.g., mother’s AFQT score), is the number of children and

measures initial ability (e.g., birth weight). Equation (6.1) is the baseline cognitive ability

production equation (minus any functional form assumptions) in Bernal and Keane (2011, pp.

474).

A primary concern of Equation (6.1) is that , and may be correlated with

the error term. Hence we use instruments to correct for this potential endogeneity. Following

Bernal and Keane (2011), we use local demand conditions and welfare rules as instruments.

Our contribution here is to provide a more flexible version of the cognitive ability production

function proposed by Bernal and Keane (2011). This allows us to obtain the effects of child

care for each child. Standard least-squares estimation methods are best suited to data near the

mean. Looking solely at the mean may be misleading. Further, it is arguable that we are more

interested in the upper or lower tails of the distribution of returns to child care. Our approach

allows us to observe the overall variation.

Here we will be using the partially linear additive nonparametric specification. In our first-

stage we estimate three separate regressions: one for each endogenous regressor (,

and ). In Equation (4.1), these are given as the regressions of on 1, 2 and .

Specifically, our first-stage equations are written as

= 1 () +2 () +3 () +4 () + 0V+

= 1 () +2 () +3 () +4 () + 0V +

= 1 () +2 () +3 () +4 () + 0V + (6.2)

where we allow the control variables of mother’s AFQT score (), mother’s education

(), mother’s experience () and mother’s age () to enter nonlinearly. The remaining

control variables as well as each of the instruments are contained in V. Note that V includes

interactions for each instrument with and . This results in a total of 99 regressors in

each first-stage regression (clearly indicating the need for a partially linear model).

After obtaining consistent estimates of each of the residual vectors from Equation (6.2), we

run the second-stage model via a nonparametric additively separable partially linear regression

of log test scores () on the endogenous variables (not their predicted values), each of the

18

residuals from the first-stage and the remaining control variables as

= 1 () + 2 () + 3 () + 4 () + 5 ()

+6 () + 7 () + 8 (b) + 9 (b) + 10 (b) +ΨeV+ (6.3)

where eV is the same (twenty) control variables included in V (and does not include any of the

instruments 2) as well as linear interactions between the control variables in the nonparametric

functions (, , , , , and ) and childcare (). Estimation (of the

additive components and their gradients) in the final-stage follows from Section 2.2.

Regarding implementation, note that in the first two stages we use cross-validation techniques

to choose the number of knots for our B-splines. In our final-stage estimates we use cross-

validation techniques to determine the bandwidth in our kernel function. R code which can be

used to estimate our model is available from the authors upon request.

6.1 Data

Our data comes directly from Bernal and Keane (2011). The data are extensive and we will

attempt to summarize them in a concise manner. For those interested in the specific details, we

refer you to the excellent description in the aforementioned paper. Their primary data source

is the National Longitudinal Survey of Youth of 1979 (NLSY79). The exact instruments and

control variables can be found in Tables 1 and 2 in Bernal and Keane (2011).

As noted in the introduction, the data set consists of single mothers. Although this may

seem like a data restriction at first, it leads to stronger instruments. The main reason behind

this choice, as explained by Bernal and Keane (2011), is that single mothers fit their set of

instruments better. The primary instruments used here are welfare rules, which (as claimed by

Bernal and Keane, 2011) give exogenous variation for single mothers. The (1990s) welfare policy

changes resulted in increased employment rates for single mothers, hence higher child care use.

We describe our variables for the 2454 observations in our sample in more detail below.

6.1.1 Endogenous regressors

We consider three potentially endogenous variables (): cumulative child care, cumulative in-

come and number of children. These are the left-hand-side variables in our first-stage equations.

They are modeled in an additively separable nonparametric fashion in the second-stage regres-

sion.

19

6.1.2 Instruments

We group our instruments (2) into four categories: time limits, work requirements, earning

disregards, and other policy variables and local demand conditions. We briefly explain each of

these categories and refer the reader to a more in depth description of the instruments in Bernal

and Keane (2011, pp. 466-469).

Time limits We consider (time limits for) two programs which aid in helping families with

children: Aid to Families with Dependent Children (AFDC) and Waivers and Temporary Aid

for Needy Families (TANF). Under AFDC, single mothers with children under age 18 may be

eligible to get help from the program as long as they fit certain criteria that are typically set by

the state and program regulations. TANF on the other hand, enables the states to set certain

time limits on the benefits they provide to eligible individuals. AFDC provides the benefits and

TANF creates the variability because states can set their own limits. The limits are important

for benefit receivers because an eligible female may become ineligible by hitting the limit and

she may choose to save some of the eligibility for later use. We include each of the eight (time

limit) instruments proposed by Bernal and Keane (2011).

Work requirements TANF requires eligible females to return to work after a certain time,

as set by the state, to be able to remain eligible. These rules are state dependent. While the

main required length for females is to start working within two years, several states prefer to

choose shorter time limits. Some states lift this requirement for females with young children.

Besides the variation amongst states, even within states there exists variation. Here we include

each of the nine (work requirement) instruments.

Earning disregards The AFDC and TANF benefits are adjusted by states depending upon

the number of children and earnings of the eligible females. While more children may lead to

greater benefits, more earnings may lower them. States set the level for AFDC grants and

adjust the amount of reduction in benefits via TANF. Specifically, our first-stage regressions

include both the “flat amount of earnings disregarded in calculating the benefit amount” and

the “benefit reduction rate” instrumental variables.

20

Other policy variables and local demand conditions Our remaining instruments are

grouped in one generic category: other policy variables and local demand conditions. Here

we consider two additional programs for families with young children. These programs are

Child Support Enforcement (CSE) and the Child Care Development Fund (CCDF). Bernal

and Keane (2011) report CSE as a significant source of income for single mothers via the 2002

Current Population Survey. CSE’s goal is to find absent parents and establish relationships

with their children. CCDF on the other hand, is a subsidy based program which provides child

care for low-income families. States are independent in designing their own programs and hence

variation is present.

In addition to the policy variables, earned income tax credit (EITC), the unemployment rate

and hourly wage rate are listed as instruments. EITC is a wage support program for low-income

families. This is a subsidy based program and the subsidy amount varies with family size. The

benefit levels are not conditioned on actual family size since family size is endogenous. Our

first-stage regressions include six instruments from this category.

6.1.3 Control variables

In addition to the instrumental variables, we have twenty-four control variables (1) which show

up in each stage (four nonlinearly and twenty linearly). These variables primarily represent

characteristics of the mothers and children. These are each assumed to be exogenous. The four

variables that enter nonlinearly are the mother’s AFQT score, education level, work experience

and age. We treat these nonparametrically as we believe their impacts vary with their levels and

we are unaware of an economic theory which states exactly how they should enter. In order to

understand the intuition behind this specification, consider a model where these variables enter

linearly. In that case, having linear schooling in a model would imply that each additional year

of schooling a mother gets will lead to the same percentage change in the child’s test score. The

same will hold true for a linearly modeled AFQT score, age or experience. There is no reason

to assume that this will be true.

6.2 Results

There are a vast number of results that can be presented from this procedure and data. We plan

to limit our discussion to a few important issues. First, we want to determine the strength of

21

our instruments. To determine how well the instruments predict the endogenous regressors, we

propose a -type (wild) bootstrap based test for joint significance of the instruments. Second,

we are interested in whether or not endogeneity exists. We check for this by testing for joint

significance of 8 (·), 9 (·) and 10 (·) in Equation (6.3). Finally, we are interested in potentialheterogeneity in the return to child care use. We accomplish this by separating our observation

specific estimates amongst different pre-specified groups.

Our final-stage results we be given in three separate tables. The first two tables will give

the 10th, 25th, 50th, 75th and 90th percentiles of the estimated returns and their related (wild)

bootstrapped standard errors. The first of those tables (Table 4) will look at the returns to

(gradients of) the final stage estimates of each of the (excluding the residuals) nonlinear variables

from the second-stage regressions. The remaining tables (Tables 5 and 6) will decompose the

gradients for the child care use variable.

We also provide several figures of estimated densities and distribution functions. Figures 1-3

give a set of density plots for the estimated returns to child care use. This allows us to see the

overall distribution. We believe this is a more informative type of analysis as compared to solely

showing results at the mean or percentiles. Figure 4 looks at empirical cumulative distribution

functions (ECDF) for both positive and negative gradients with respect to the amount of child

care.

6.2.1 First and second-stage estimation

For our first-stage regressions, our main focus is the performance of our instruments. In order

to analyze this, we check the significance of our instruments in each of our three first-stage

regressions. Noting that we have 99 regressors in each first-stage, the percentage of significant

instruments in each first-stage regression is roughly one-half. This type of analysis, of course,

is informal. Rather than relying on univariate-type significances, we prefer to perform formal

tests to check for joint significance.

Here we perform a nonparametric -type test, originally proposed by Ullah (1985). The test

involves comparing the residual sum of squares between a restricted and unrestricted model.

Our restricted model assumes that each of our instruments have coefficients equal to zero. The

asymptotic distribution can be obtained by multiplying the statistic by a constant, but it is well

known that using the asymptotic distribution is problematic in practice. Instead, we use a wild

22

Table 4: Final-stage gradient estimates for each of the nonlinear variables at various percentileswith corresponding wild bootstrapped standard errors

10% 25% 50% 75% 90%

-0.0089 -0.0061 -0.0005 0.0049 0.0079

0.0034 0.0038 0.0042 0.0027 0.0054

-0.0415 -0.0045 0.0250 0.0463 0.0741

0.0225 0.0258 0.0249 0.0329 0.0304

-0.0157 -0.0157 -0.0117 -0.0021 -0.0021

0.0077 0.0077 0.0046 0.0065 0.0065

-0.0058 -0.0058 -0.0046 0.0037 0.0128

0.0062 0.0062 0.0061 0.0091 0.0080

-0.0034 -0.0022 -0.0016 0.0003 0.0031

0.0031 0.0031 0.0025 0.0038 0.0038

-0.0012 0.0000 0.0015 0.0034 0.0046

0.0008 0.0009 0.0009 0.0013 0.0019

-0.0051 -0.0017 0.0035 0.0051 0.0063

0.0028 0.0024 0.0024 0.0020 0.0037

bootstrap to determine the conclusion of our tests. For each first-stage regression we perform a

test where the null is that each instrument is irrelevant. In each case our p-value is zero to at

least four decimal places. Hence, we argue (as did Bernal and Keane, 2011) that our instruments

are relevant in our prediction of our endogenous regressors.

In the second-stage regression we are concerned with the joint significance of each of the

residuals from the first-stage regressions. We perform a similar Ullah (1985) type test as above

and reject the null that the three sets of residuals are jointly insignificant with a p-value that is

zero to four decimal places. We conclude that endogeneity is likely present and thus justify the

use of our procedure.

6.2.2 Final-stage estimation

Here we are interested in comparing our gradient estimate results to those of Bernal and Keane

(2011). Our average gradient estimates (Table 4) for the nonlinear variables are often similar

23

in magnitude and sign to their results. We find mostly positive effects for mother’s income,

education and AFQT scores. For our primary variable of interest, the gradient on child care

is also similar at the median (noting that our median result is insignificant). To put this in

perspective, the median coefficient of -0.0005 is equivalent to a 0.2% decrease in test scores for

an additional year of child care use (or 0.05% for an additional quarter). That being said, each

of these statements ignore the heterogeneity in the gradient estimates allowed by our procedure.

When we look at the percentiles, we see both positive and negative estimates. This is not

possible with standard linear models. We therefore want to determine the story behind these

variable returns. To do so we break the results up for different child care type, amount of

child care and between gender. We also examine whether heterogeneity is present amongst

mothers (level of education, experience and age). Finally, we try to determine what attributes

are common with those receiving positive or negative returns to child care use.

Disaggregating the child care gradient In the first few rows of Table 5, we analyze the

child care gradient with respect to child care type, child gender and amount of child care use.

In this table we report the percentiles for the estimated gradients for the related groups and

associated standard errors. We also provide Figures 1-3 which show the overall variation for the

(selected) chosen pairs. Before we get into the details for different groups, we want to point out

that many of the results are insignificant. In fact, only 634 of the 2454 estimates are statistically

significant at the five-percent level. What this implies is that for a large portion of the sample,

an additional unit of child care will have no impact on test scores. That being said, we find

many cases where it does matter and we will highlight the results below.

Bernal and Keane (2011) found that only informal child care (e.g., a grandparent) had

significantly negative effects. Specifically, they found that an additional year of informal child

care led to a 2.6% reduction in test scores. To compare our results, we separated the gradients

on child care use between those who received only formal or only informal child care. Although

there are some differences in our percentile estimates, Figure 1 shows essentially no difference

in the densities of estimates between those who received only formal (e.g., a licensed child care

facility) versus only informal child care. Bernal and Keane (2011) also find differences in returns

to child care between genders. Both Table 5 and Figure 2 show essentially no difference between

these two groups.

Where we do find a difference, is with respect to the amount of child care. In Figure 3, we

24

Table 5: Final-stage gradient estimates for child care use for different types of child care, gender,amount and for different attributes of the mother at various percentiles for the specific groupwith corresponding wild bootstrapped standard errors

10% 25% 50% 75% 90%

Formal -0.0087 -0.0067 -0.0016 0.0035 0.0065

0.0052 0.0037 0.0036 0.0032 0.0031

Informal -0.0090 -0.0061 -0.0007 0.0028 0.0067

0.0089 0.0038 0.0036 0.0049 0.0042

Female -0.0092 -0.0061 -0.0005 0.0057 0.0079

0.0038 0.0038 0.0042 0.0046 0.0055

Male -0.0089 -0.0067 -0.0005 0.0036 0.0079

0.0034 0.0037 0.0042 0.0046 0.0055

Above median child care -0.0092 -0.0087 -0.0054 0.0028 0.0049

0.0038 0.0052 0.0036 0.0037 0.0037

Below median child care -0.0061 -0.0010 0.0013 0.0076 0.0218

0.0038 0.0035 0.0042 0.0034 0.0080

Education 12 -0.0067 0.0014 0.0028 0.0076 0.0218

0.0037 0.0036 0.0037 0.0034 0.0080

Education ≥ 12 -0.0090 -0.0068 -0.0014 0.0028 0.0067

0.0089 0.0032 0.0036 0.0049 0.0042

Experience 5 -0.0090 -0.0068 -0.0014 0.0028 0.0067

0.0036 0.0032 0.0036 0.0049 0.0042

Experience ≥ 5 -0.0087 -0.0046 0.0011 0.0065 0.0218

0.0052 0.0038 0.0040 0.0031 0.0080

No experience -0.0061 -0.0007 0.0028 0.0076 0.0218

0.0038 0.0036 0.0037 0.0034 0.0080

Age 23 -0.0087 -0.0053 0.0007 0.0063 0.0139

0.0052 0.0036 0.0039 0.0036 0.0066

Age 23, 29 -0.0090 -0.0061 -0.0007 0.0049 0.0079

0.0089 0.0038 0.0053 0.0037 0.0055

Age ≥ 30 -0.0090 -0.0076 -0.0016 0.0028 0.0076

0.0036 0.0034 0.0036 0.0037 0.0034

25

−0.015 −0.010 −0.005 0.000 0.005 0.010 0.015 0.020

020

4060

Gradient.Childcare

Den

sity

formalinformal

Figure 1: Density of estimated returns to child care for those with only formal versus those with

only informal child care

look at the estimated gradients of child care use for those children who get below and above the

median total child care. We find that more child care use leads to lower returns. In fact, we

can see negative returns for those children receiving relatively more child care which suggests

decreasing returns to child care use. We consider this finding important since this shows us

evidence to believe that the lower returns may be associated with the amount of child care use

rather than type of child care or gender of the child.

In the remaining rows of Table 5, we separate our estimated gradients on child care based on

attributes of the mothers. Specifically, we analyze the estimated returns for mothers of different

levels of education, experience and age. We see variation in our estimates for mothers of different

age groups. As mothers get older, we see lower returns to child care use (more negative and

significant estimates) as compared to those of younger mothers.

We also see variation for different experience levels. Here we find more negative (and sig-

nificant) estimates for mothers with more experience. On the other hand, for mothers with no

experience, we see much larger (and significant) returns to child care.

Similarly, mothers with more education appear to have more negative returns. Many of

26

−0.01 0.00 0.01 0.02

010

2030

4050

60

Gradient.Childcare

Den

sity

femalemale

Figure 2: Density of estimated returns to child care for male and female children

the percentiles are negative for mothers with twelve or more years of education. For mothers

without a high school diploma, we see positive effects for all but the lowest percentile. In other

words, for less educated mothers, more child care may actually improve their child’s test scores.

All these results show us two important things. First, there is substantial heterogeneity in

our returns to child care use that cannot be captured with a single parameter estimate. Second,

the returns tend to be related to the amount of child care and the quality of maternal time. For

those mothers who have more education and experience, child care tends to hurt their child’s

test score. On the other hand, for those mothers with less education and experience, our results

tend to suggest that their children may be better off with more child care.

Positive and negative returns For most of the reported estimates, we tend to see both

positive and negative returns. This is perhaps a more important result than the lower and

higher estimated returns. What this finding suggests is that there are some children who benefit

from additional child care and there are some who are harmed. We hope to uncover who these

children are and hopefully, the drivers of such returns.

Table 6 separates the partial effects on child care by those which are positive and negative.

27

−0.01 0.00 0.01 0.02 0.03

020

4060

8010

0

Gradient.Childcare

Den

sity

above.medianbelow.median

Figure 3: Density of estimated returns to child care for those with above and below the median

units of child care

Table 6: Median characteristics for groups with positive versus negative negative child care usegradients

Attribute Positive Negative

Mother’s education 12 12

Mother’s experience 4 6

Mother’s AFQT 15 19

Mother’s age 22 24

Child’s age 5.8 5.8

Formal child care 0 0

Informal child care 6 9

Total child care 8.5 11.5

Number of children 1 1

Sample size 1141 1313

28

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Gradient.Childcare

EC

DF

negativepositive

Figure 4: Empirical cummulative distribution functions for the amount of child care for those

with both positive and negative returns to child care

Perhaps the first point of interest is that slightly more than half of the gradients are negative.

If we were to run a simple ordinary least-squares regression, this would (back of the envelope)

suggest a negative coefficient. This is what is typically found in the literature.

As for the remaining values in the table, the rows represent characteristics of interest and

each number represents the median value for that characteristic for both children with positive

and negative returns. Many values are the same. For example, mother’s education, child’s age,

quarters of formal child care and number of children is the same at the median in each group.

However, we see that for negative returns that, (median) mothers have more experience and are

older. It is true that they have more informal child care (which likely represents the result of

Bernal and Keane, 2011), but it is also true that they have far more quarters of child care at

the median.

Again, these are only point estimates. If we were to plot the ECDFs of types of child care

for the groups with negative and positive child care gradients, we would see that the amount of

both formal and informal child care use are higher for those with negative returns. For example,

Figure 4 plots the distribution functions with respect to total child care use between those with

29

negative and those with positive returns to child care. Those children with negative returns to

child care, receive more child care overall. We fail to reject the null of first-order dominance via

a Kolmogorov-Smirnov test with a p-value near unity. We also find this same level of dominance

when we look at formal versus informal or male versus female. This is strong evidence that it is

the amount of child care and not necessarily the type that matters.

7 Conclusion

In this paper, we develop oracle efficient estimators for additive nonparametric structural equa-

tion models. We show that our estimators of the conditional mean and gradient are consistent,

asymptotically normal and free from the curse of dimensionality. The finite sample results sup-

port the asymptotic development. We also consider a partially linear extension of our model

which we use in an empirical application relating child care to test scores. In our application we

find that the amount of child care use and not the type, is primarily responsible for the sign of

the returns. Given that our nonparametric procedure will give us observation specific estimates,

we are able to uncover, that in addition to the amount of child care, what attributes of mothers

are related to different returns. We find evidence that more educated, more experienced moth-

ers with higher test scores (themselves) are associated with lower returns to child care for their

children. On the other hand, less educated, less experienced mothers with lower test scores’ (for

themselves) children often have positive returns to child care.

8 Acknowledgements

The authors would like to thank two anonymous referees, the joint editor Rong Chen, Pe-

ter Brummund, David Jacho-Chavez, Chris Parmeter and Anton Schick for useful comments

and suggestions. They also thank participants in talks given at the State University of New

York at Albany, the University of Alabama, the University of North Carolina at Greensboro,

New York Camp Econometrics (Bolton, NY), the annual meeting of the Midwest Econometric

Group (Bloomington, IN) and the Western Economic Association International annual confer-

ence (Seattle, WA). Su acknowledges support from the Singapore Ministry of Education for

Academic Research Fund under grant number MOE2012-T2-2-021. The R code used in the

paper can be obtained from the authors upon request.

30

Appendix

In this appendix we first provide assumptions that are used to prove the main results and

then prove the main results in Section 3.

A Assumptions

A real-valued function (·) on the real line is said to satisfy a Hölder condition with exponent ∈ [0 1] if there is such that |() − (e)| ≤ | − e| for all and e on the supportof (·). (·) is said to be -smooth, = + , if it is -times continuously differentiable

on U and its th derivative, (·) satisfies a Hölder condition with exponent . The -

smooth class of functions are popular in econometrics because a -smooth function can be

approximated well by various linear sieves; see, e.g., Chen (2007). For any scalar function (·)on the real line that has derivatives and support S, let |(·)| ≡ max≤ sup∈S | ()| Let X and U denote the supports of and respectively, for = 1 Let Z denote

the support of for = 1 and = 1 2. Let ≡ [1 (Z1Z2)1 (Z1Z2)

0] and = [1 (Z1Z2)

1 (Z1Z2)0 2 ] for = 1 Let Z ≡ (Z01Z02)0

We make the following assumptions.

Assumption A1. (i) XZ = 1 are an IID random sample. (ii) The supports Wand Z ofW and Z are compact. (iii) The distributions ofW and Z are absolutely continuous

with respect to the Lebesgue measure.

Assumption A2.(i) For every 1 that is sufficiently large, there exist 1 and 1 such that

0 1 ≤ min ( ) ≤ max ( ) ≤ 1 ∞ and max () ≤ 1 ∞ for = 1 (ii)

For every that is sufficiently large, there exist 2 and 2 such that 0 2 ≤ min (ΦΦ) ≤max (ΦΦ) ≤ 2 ∞ (iii) The functions (·) = 1 = 1 and (·) =2+1 belong to the class of -smooth functions with ≥ 2. (iv) There exist α’s such that

sup∈Z1 |()−1()0α| = (−1 ) for = 1 and = 1 1, sup∈Z2 |1+()−1()0α1+| = (−1 ) for = 1 and = 1 2. (v) There exist β’s such that

sup∈X |()− ()0β| = (−) for = 1 sup∈Z1 |+(·)− ()0β+| = (−)for = 1 1 and

¯+1+(·)− (·)0β+1+

¯1= (−) for = 1 (vi) The set of

basis functions, (·) = 1 2 are twice continuously differentiable everywhere on thesupport of for = 1 . max1≤≤ max0≤≤ sup∈U k ()k ≤ for = 0 1 2

Assumption A3. (i) The PDF of any two elements in W is bounded, bounded away from

zero, and twice continuously differentiable. (ii) Let 2 ≡ 2 (XZU) ≡ ¡2 |XZU

¢and

≡ [1 () 1 ()

0 2 ] for = 1 and = 1 2 The largest eigenvalue of

is bounded uniformly in 1

Assumption A4. The kernel function (·) is a PDF that is symmetric, bounded and hascompact support [− ]. It satisfies the Lipschitz condition | (1)− (2)| ≤ |1 − 2|for all 1 2 ∈ [− ]

31

Assumption A5. (i) 1 ≤ As → ∞ 1 → ∞ 3 → 0 and → 1 ∈ [0∞) where ≡

¡120 + 1

¢1 + 02

21 1 ≡

121 12 +

−1 and ≡ 1212 + − (ii) As

→∞ → 0 3 log→∞ −2 → 0 1 = (−12−12) and [121(1 + 12−1 )

+2121221]( + 1)→ 0

Assumptions A1(i)-(ii) impose IID sampling and compactness on the support of the exoge-

nous independent variables. Either assumption can be relaxed at lengthy arguments; see, e.g.,

Su and Jin (2012) who allow for weakly dependent data and infinite support for their regressors.

A1(iii) requires that the variables inW and Z be continuously valued, which is standard in the

literature on sieve estimation. Assumption A2(i)-(ii) ensure the existence and non-singularity

of the asymptotic covariance matrix of the first two-stage estimators. They are standard in the

literature; see, e.g., Newey (1997), Li (2000), and Horowitz and Mammen (2004). Note that all

of these authors assume that the conditional variances of the error terms given the exogenous

regressors are uniformly bounded, in which case the second part of A2() becomes redundant.

A2(iii) imposes smoothness conditions on the relevant functions and A2(iv)-(v) quantifies the

approximation error for -smooth functions. These conditions are satisfied, for example, for

polynomials, splines and wavelets. A2(vi) is needed for the application of Taylor expansions. It

is well known that = ¡+12

¢and

¡2+1

¢for -splines and power series, respectively

(see Newey, 1997). The rate at which splines uniformly approximate a function is the same as

that for power series, so that the uniform convergence rate for splines is faster than power se-

ries. In addition, the low multicollinearity of -splines and recursive formula for calculation also

leads to computational advantages (see Chapter 19 of Powell, 1981 and Chapter 4 of Schumaker,

2007). For these reasons, -splines are widely used in the literature.

Assumptions A3(i)-(ii) and A4 are needed for the establishment of the asymptotic property of

the third-stage estimators. A3(ii) is redundant under Assumption A2(i) if one assumes that the

conditional variances of ’s given (XZU) are uniformly bounded. A4 is standard for local-

linear regression (see Fan and Gijbels, 1996 and Masry, 1996). The compact support condition

facilitates the demonstration of the uniform convergence rate in Theorem 3.2 below but can

be removed at the cost of some lengthy arguments (see, e.g., Hansen, 2008). In particular,

the Gaussian kernel can be applied. Assumptions A5(i)-(ii) specify conditions on 1 and

Note that we allow the use of different series approximation terms in the first and second-

stage estimation, which allows us to see clearly the effect of the first-stage estimates on the

second-stage estimates. The first condition (namely, 1 ≤ ) in A5(i) is needed for the proof of

a technical lemma (see Lemma B.5(iii)) and it can be removed at the cost of some additional

assumptions on the basis functions. The terms that are associated with 1 arise because of

the use of the nonparametrically generated regressors in the second-stage series estimation. The

appearance of log arises in order to establish uniform consistency results in Theorem 3.2 below

and it can be replaced by 1 if we are only interested in the pointwise result. In the case where

= ¡+12

¢in Assumption A2(), =

¡321 + 321

¢ In practice, we recommend

setting 1 = These restrictions, in conjunction with the condition ≥ 2, imply that the

conditions in Assumption A5 can be greatly simplified as follows:

32

Assumption A5∗. (i) As → ∞ → ∞ 4 → 1 ∈ [0∞) (ii) As → ∞ → 0

3 log→∞ −2 → 0 and −15 → 0

B Proofs of the Results in Section 3

Let ≡ 1 (Z) eΦ ≡ Φ(fW) ≡ −1P

=1 0 ΦΦ ≡ −1

P=1ΦΦ

0 andeΦΦ ≡ −1

P=1

eΦeΦ0 By Lemmas B.1() and () and Lemma B.4() below, ΦΦ

and eΦΦ are invertible with probability approaching 1 (w.p.a.1) so that in large samples we

can replace the generalized inverses − −ΦΦ and e−ΦΦ by −1

−1ΦΦ and e−1ΦΦ

respectively. We first state some technical lemmas that are used in the proof of the main

results in Section 3. The proofs of all technical lemmas but Lemma B.6 are given in the online

Supplemental Material.

Lemma B.1 Suppose that Assumptions A1 and A2()-() and () hold. Then

() k −k2 =

¡21

¢;

() min () = min ( ) + (1) and max () = max ( ) + (1) ;

()°°°−1 −−1

°°°sp=

¡1

12¢;

() kΦΦ −ΦΦk2 =

¡2

¢;

() min (ΦΦ) = min (ΦΦ) + (1) and max (ΦΦ) = max (ΦΦ) + (1)

Lemma B.2 Let ≡ −1P

=1 and ≡ −1P

=1 [ (Z)− 0α] for = 1

Suppose that Assumptions A1-A2 hold. Then

() kk2 = (1);

() kk2 = (−21 );

() eα−α = −11 −1P

=1 +−11 −1P

=1 [ (Z)− 0α] + ;

where kk = (1+ −+121 12) and = 1

Lemma B.3 Suppose that Assumptions A1-A3 hold. Then for = 1

() −1P

=1

³e −

´2 £2¤=

¡21¢for = 0 1;

() −1P

=1

³e −

´2 kΦk =

¡0

21

¢for = 1 2;

() −1P

=1

°°° ³e

´− ()

°°°2 =

¡21

21

¢;

()°°°−1P

=1

h³e

´− ()

iΦ0°°° =

¡1201 + 02

21

¢;

()°°°−1P

=1

h³e

´− ()

i

°°° =

¡−1211

¢

Lemma B.4 Suppose Assumptions A1-A3 hold. Then

() −1P

=1

°°°eΦ −Φ°°°2 =

¡21

21

¢;

33

()°°°−1P

=1

³eΦ −Φ´Φ0°°°sp=

¡1211

¢;

()°°° eΦΦ −ΦΦ

°°°sp=

¡1201 + 02

21

¢;

()°°° e−1ΦΦ −−1ΦΦ

°°°sp=

¡1201 + 02

21

¢;

()°°°−1P

=1

³eΦ −Φ´ °°° =

¡−1211

¢;

() −1P

=1

³eΦ −Φ´ [ ( 1 )−Φ0β] =

³−1 11

´

Lemma B.5 Let ≡ −1P

=1Φ and ≡ −1P

=1Φ [ ( 1 )−Φ0β] SupposeAssumptions A1-A3 hold. Then

() kk = (1212);

() kk = (−);

()°°°−1ΦΦ−1P

=1ΦP

=1 ()

0 β+1+

³e −

´°°° = (1) for = 1

Lemma B.6 Let ≡ (1 2)0 be an arbitrary 2×1 nonrandom vector such that kk = 1 Supposethat Assumptions A1-A5 hold. Then for = 2

() 2 (1) ≡ −1212P

=110−1∗

1 (1) (e−)

() = 121 (1+12

−1 )

uniformly in 1;

() 2 (1) ≡ −1212P

=11

¯0−1∗

1 (1)¯(e − )

2 = 1212 (21) uni-

formly in 1

Proof. () Let (1) ≡ −1P

=10−1∗

1 (1) () and (1) ≡ [ (1)]

By straightforward moment calculations and Chebyshev inequality, we have (1) = (1)+

(1) where k (1)k = (12−12−12) In fact, sup1∈X1 k (1)k = (

12( log)−12)with a simple application of Bernstein inequality for independent observations [see, e.g., Serfling

(1980, p. 95)]. As an aside, note that the proof of Lemma 7 in Horowitz and Mammen (2004)

contains various errors as they ignore the fact that is diverging to infinity as → ∞ Note

that for = 2

(1) = £ (1 − 1)

0−1∗1 (1)

()¤

=

Z () (1 + 2)

() 1

³1 + 12

´

= 1

Z1 (1 )

() + 1

Z () [1 (1 + )− 1 (1 )]

()

+2

Z () [1 (1 + )− 1 (1 )]

()

≡ 11 (1) + 12 (1) + 23 (1)

As in Horowitz and Mammen (2004, p. 2435), in view of the fact that the components

of 1 (1) are the Fourier coefficients of a function that is bounded uniformly over X1 we

34

have sup1∈X1 k1 (1)k2 = (1) In addition, using Assumptions A2() and A3(), we can

readily show that sup1∈X1 k2 (1)k =

¡122

¢and sup1∈X1 k3 (1)k =

¡12

¢

It follows that sup1∈X1 k (1)k =

¡1 + 12

¢= (1) under Assumption A5() and

sup1∈X1 k (1)k = (1)

By (B.2), 1 (1) = −P5

=1 −1212

P=11

0−1∗1 (1)

() ≡ −P5

=1 1 (1)

say. Noting that 11 (1) = 1212 (1) (e − ) we have

sup1∈X1

k11 (1)k ≤ 1212 sup1∈X1

k (1)k |e − | = 1212 (1)

³−12

´= (1)

Next, note that 12 (1) =P1

=1 12 (1) where 12 (1) = −1212P

=110−1

∗1 (1)

() 1 (1)

0 S11 We decompose 12 as follows:

12 (1) = −1212X=1

10−1∗

1 ()

1 (1)0 S1−1

= 1212 (1)S1−1 + 1212 (1)S1³−1 −−1

´

≡ 121 (1) + 122 (1) say.

where (1) ≡ −1P

=110−1∗

1 (1) ()

1 (1)0 Let (1) ≡ [ (1)]

As in the analysis of (1), we can show that sup1∈X1°° (1)

°°sp= (1) and sup1∈X1°° (1)− (1)

°°sp≡ ((1 log)

−12) It follows that sup1∈X1 k (1)ksp = (1

+(1 log)−12) = (1) under Assumption A5() Then following the analysis of1 (1)

in the proof of Theorem 3.2, we can show that k121 (1)k =

¡121

¢uniformly in 1

In addition,

sup1∈X1

k122 (1)k ≤ 1212 sup1∈X1

k (1)ksp kS1ksp°°°−1 −−1

°°°spkk

= 1212 (1) (1) (1−12) (

121 −12)

= (1321 −1212)

It follows that sup1∈X1 k12 (1)k = (121) + (1

321 −1212) = (

121)

under Assumption A5(). Analogously,

sup1∈X1

k14 (1)k ≤1X=1

1212 sup1∈X1

k (1)k kS1ksp k2k

= 1212 (1) (−1 ) = (

12121−1 )

By the same token, we can show that sup1∈X1 k13 (1)k =

¡121

¢and sup1∈X1 ||15

(1) || = (12121

−1 ) It follows that sup1∈X1 k1 (1)k = 121 (1 + 12

−1 )

() By (B.2) and Cauchy-Schwarz inequality, 2 (1) ≤ 5P5

=1 −1212

P=11

2¯

0−1∗1 (1)

¯≡ 5P5=1 2 say. It is easy to show that sup1∈X1 21 (1) = (

−1212)

35

Note that 22 (1) = −1212P

=11

¯0−1∗

1 (1)¯22 ≤ 1

P1=1 22 (1) where

22 (1) = −1212X=1

1

¯0−1∗

1 (1)¯1 (1)

0 S1101S011 (1)

= 1212tr¡S1101S01 (1)

¢and (1) ≡ −1

P=11

¯0−1∗

1

¯1 (1)

1 (1)0 As in the analysis of (1),

we can show that sup1∈X1 k (1)ksp = (1) By the fact tr() ≤ max ()tr() and

kksp = max () for any symmetric matrix and conformable positive-semidefinite matrix

22 (1) ≤ 1212tr¡S1101S01

¢ k (1)ksp = 1212 kS11k2sp k (1)ksp≤ 1212 kS1k2sp k1k2sp k (1)ksp= 1212 (1)

¡1

−1¢ (1) =

³1

−1212´uniformly in 1

It follows that sup1∈X1 22 (1) =

¡1

−1212¢ Similarly, uniformly in 1

24 (1) ≤ 1212 kS1k2sp k2k2sp k (1)ksp= 1212 (1) (

−21 ) (1) =

³12

−21 12

´

By the same token, 23 (1) = (1−1212) and 25 (1) = (

12−21 12) uniformly

in 1 Consequently, sup1∈X1 2 (1) = 1212 (21)

Proof of Theorem 3.1. () Noting that = ( 1 )+ = eΦ0β+ + [ ( 1 )−eΦ0β] we haveeβ − β = e−1ΦΦ−1 X

=1

eΦ − β = e−1ΦΦ−1 X=1

eΦ + e−1ΦΦ−1 X=1

eΦ h ( 1 )− eΦ0βi= e−1ΦΦ + e−1ΦΦ + e−1ΦΦ−1 X

=1

Φ

³Φ − eΦ´0 β + e−1ΦΦ−1 X

=1

³eΦ −Φ´ + e−1ΦΦ−1 X

=1

³eΦ −Φ´ £ ( 1 )−Φ0β¤− e−1ΦΦ−1 X

=1

³eΦ −Φ´³eΦ −Φ´0 β≡ 1 + 2 + 3 + 4 + 5 − 6 say.

Note that 1 = −1ΦΦ+ 1 where 1 = ( e−1ΦΦ−−1ΦΦ) satisfies k1k ≤°°° e−1ΦΦ −−1ΦΦ

°°°sp

×kksp =

£(1201 + 02

21)

12−12¤by Lemmas B.4() and B.5(). Similarly,

2 = −1ΦΦ + 2 where 2 = ( e−1ΦΦ − −1ΦΦ) satisfies k2k ≤°°° e−1ΦΦ −−1ΦΦ

°°° kksp=

£(1201 + 02

21)

−¤ by Lemmas B.4() and B.5(). Next, note that 3 =−1ΦΦ

−1P=1Φ

³Φ − eΦ´0 β+³ e−1ΦΦ −−1ΦΦ

´−1

P=1Φ

³Φ − eΦ´0 β ≡ 31 + 32 We

36

further decompose 31 as follows:

31 = −−1ΦΦ−1X=1

Φ

X=1

h³e

´− ()

i0β+1+

=X=1

−1ΦΦ−1

X=1

Φ³†

´0β+1+

³ − e

´=

X=1

−1ΦΦ−1

X=1

Φ+1+ ()³ − e

´+

X=1

−1ΦΦ−1

X=1

Φ

h+1+

³†

´− +1+ ()

i ³ − e

´+

X=1

−1ΦΦ−1

X=1

Φ

∙³†

´0β+1+ − +1+

³†

´¸³ − e

´≡

X=1

311 +X=1

312 +X=1

313 say,

where † lies between

e and Noting that |+1+³†

´− +1+ () | ≤ |e − |

where = max1≤≤ max∈U |+1+ ()| = (1) by Assumptions A1() and A2()

k312k ≤ −1P

=1 kΦk (−e)2 = 0

¡21¢by Lemma B.3() By Assumption A2(),

Cauchy-Schwarz inequality and Lemma B.3()

k313k ≤ ¡−

¢ °°−1ΦΦ°°sp −1 X=1

kΦk¯ − e

¯

≤ ¡−

¢ °°−1ΦΦ°°sp(−1

X=1

kΦk2)12(

−1X=1

³ − e

´2)12=

¡−

¢ (1)

³12

´ (1) = −+12 (1)

By Lemma B.5() k311k = (1) which dominates both k312k and k313k Thusk32k ≤

°°° e−1ΦΦ −−1ΦΦ°°°sp

(1) = [(1201 + 02

21)1] It follows that 3 =P

=1−1ΦΦ

−1P=1Φ ×+1+ () (−e)+3 where

°°3°° = [(1201+02

21)1]

By Lemmas B.4()-() k4k =

¡−1211

¢and

k5k ≤°°° e−1ΦΦ°°°

sp

°°°°°−1X=1

³eΦ −Φ´ £ ( 1 )−Φ0β¤°°°°° =

¡−11

¢

37

where we use the fact that°°° e−1ΦΦ°°°

sp≤°°° e−1ΦΦ −−1ΦΦ

°°°sp+°°−1ΦΦ°°sp = (1)+ (1) = (1)

For 6 we have by Taylor expansion and triangle inequality that

k6k ≤X=1

°°°°° e−1ΦΦ−1X=1

³eΦ −Φ´ h ³e

´− ()

i0β+1+

°°°°°=

X=1

°°°°° e−1ΦΦ−1X=1

³eΦ −Φ´ ³ †´0 β+1+

³e −

´°°°°°≤

X=1

°°° e−1ΦΦ°°°sp

°°°°°−1X=1

³eΦ −Φ´ +1+ ³ †´³e −

´°°°°°+

X=1

°°° e−1ΦΦ°°°sp

°°°°°−1X=1

³eΦ −Φ´ ∙ ³ †´0 β+1+ − +1+

³†

´¸³e −

´°°°°°≡

X=1

61 +X=1

62 say.

By the triangle inequality, Lemmas B.3() and B.4()

61 ≤

°°° e−1ΦΦ°°°sp

(−1

X=1

°°°eΦ −Φ°°°2)12(−1 X=1

³e −

´2)12= (1) (11) (1) =

¡1

21

¢

Similarly, we can show that 62 = −

¡1

21

¢by Assumption A2() and Lemmas B.3()

and B.4() It follows that k6k =

¡1

21

¢ Combining the above results yield the conclu-

sion in ()

() Noting that°°−1ΦΦ°° ≤ °°−1ΦΦ°°sp kk =

¡1212

¢and ||−1ΦΦ|| ≤

°°−1ΦΦ°°sp×kk = (

−) by Lemmas B.5()-() the result in part () follows from part (), Lemma

B.4 and the fact that kRk = (1) under Assumption A5()

() By () and Assumptions A2() , supw∈W¯e (w)− (w)

¯= supw∈W |Φ (w)0 (eβ − β) +

[β0Φ (w)− (w)]|≤ supw∈W kΦ (w)k°°°eβ − β°°°+sup∈W ¯β0Φ (w)− (w)

¯= [0 ( + 1)]

as the second term is () ¥

Proof of Theorem 3.2.

Let 1 ≡ −−2 (2)−− () −+1 (11)−−+1 (11)−+1+1(1)−− 2+1() and Y1 ≡ (11 1)0 Using the notation defined at the end of section 2.2,we have

b1 (1) =£−1X1 (1)0K1X1 (1)−1¤−1−1X1 (1)0K1X1 (1)Y1+£−1X1 (1)0K1X1 (1)−1¤−1−1X1 (1)K1(

eY1 −Y1)≡ 1 (1) + 2 (1) say.

38

By standard results in local-linear regressions [e.g., Masry (1996) and Hansen (2008)], −1−1

×X1 (1)0K1X1 (1)−1 = 1 (1)

⎛⎝ 1 0

0R2 ()

⎞⎠ + (1) uniformly in 1 1212

× [1 (1)− 1 (1)]→ (0Ω1 (1)) and sup1∈X1 k1 (1)k =

³( log)−12 + 2

´

where 1 (1) and Ω1 (1) are defined in Theorem 3.2. It suffices to prove the theorem by showing

that −1212−1X1 (1)0K1(eY1 −Y1) = (1) uniformly in 1 (for part () of Theorem 3.2

we only need the pointwise result to hold).

We make the following decomposition: ()−12−1X1 (1)K1(Y1−eY1) = −1212P

=1

1−1∗

1 (1) (1− e1) = (1) +P

=2 (1) +P1

=1 (1) +P

=1 (1) where

(1) =√ (e− )−112

X=1

1−1∗

1 (1)

(1) = −1212X=1

1−1∗

1 (1) [e ()− ()]

(1) = −1212X=1

1−1∗

1 (1) [e+ (1)− + (1)]

(1) = −1212X=1

1−1∗

1 (1)he+1+(e)− +1+()

i

We prove the first part of the theorem by showing that (1) (1) = (1) (2) (1) =

(1) for = 2 (3) (1) = (1) for = 1 1 and (4) (1) = (1) for

= 1 all uniformly in 1

(1) holds by noticing that√ (e− ) = (1) and −1

P=11

−1∗1 (1) = (1)

uniformly in 1 Let ≡ (1 2)0 be an arbitrary 2 × 1 nonrandom vector such that kk = 1

Recall that (1) ≡ −1P

=10−1∗

1 (1) () For (2) we make the following

decomposition

0 (1) = −1212X=1

10−1∗

1 (1) ()

0 S³eβ − β´

+−1212X=1

10−1∗

1 (1)£ ()

0 Sβ − ()¤

= 1212 (1)0 S−1ΦΦ + 1212 (1)S−1ΦΦ

−−1212 (1)0 S−1ΦΦX

=1

Φ

X=1

³e −

´+ −1212 (1)

0 SR

+−1212X=1

10−1∗

1 (1)£ ()

0 Sβ − ()¤

≡ 1 (1) +2 (1)−3 (1) +4 (1) +5 (1)

39

where recall ≡ ()0 β+1+ ≡ −1

P=1Φ and ≡ −1

P=1Φ [ ( 1 )

−β0Φ ] Let (1) ≡ [ (1)] and (1) = (1) − (1) By the proof of Lemma

B.6() k (1)k = (12( log)−12) k (1)k =

¡1 + 12

¢and k (1)k =

(1) uniformly in 1. Write 1 (1) = 1212 (1)0 S−1ΦΦ + 1212 (1)

0 S−1ΦΦ≡ 11 (1) +12 (1) say. Noting that

£211 (1)

¤= (1)

0 S−1ΦΦ¡ΦΦ

02

¢−1ΦΦS

0 (1)

≤ max¡¡ΦΦ

02

¢¢[min (ΦΦ)]

−2 max¡SS0

¢ k (1)k2= (1) (1) (1) = ()

we have |11 (1)| =

¡12

¢for each 1 ∈ X1 Let (1) ≡ −1ΦΦS

0 (1). Then we can

write (1)0 S−1ΦΦ as

−1P=1 (1)

0ΦNoting that[ (1)0Φ] = 0 and[ (1)

0Φ]2

= (1)0¡ΦΦ

02

¢ (1) ≤ max (ΦΦ)

°°−1ΦΦ°°2sp sup1∈X1 k (1)k = (1) we can read-

ily divide X into intervals of appropriate length and apply Bernstein inequality to show that (1)

0 S−1ΦΦ =

¡( log)−12

¢ Consequently, sup1∈X1 |11 (1)| = 1212 ((

log)−12) =

¡( log)−12

¢= (1) For 12 (1) we have by Lemma B.5()

sup1∈X1

k12 (1)k ≤ 1212 sup1∈X1

k (1)k kSksp°°−1ΦΦ°°sp kk

= 1212 (12( log)−12) (1) (1) (

12−12)

= (( log)−12) = (1)

It follows that sup1∈X1 |1 (1)| = (1) By Lemma B.5() and Assumptions A2() and

() and A5,

sup1∈X1

|2 (1)| ≤ 1212 sup1∈X1

k (1)k°°−1ΦΦ°°sp kSksp kk

= 1212 (1) (1)(1) (−) = (1)

sup1∈X1

|4 (1)| ≤ 1212 sup1∈X1

k (1)k kSksp kRk

= 1212 (1) (1)

³−12−12

´= (1)

and

sup1∈X1

|5 (1)| ≤ ¡−

¢1212 sup

1∈X1−1

X=1

1

¯0−1∗

1 (1)¯

=

³1212−

´= (1)

For 3 (1) we have 3 (1) =P

=13 (1) where 3 (1) = −1212 (1)0

×S−1ΦΦP

=1 Φ(e−)Using (B.2), 3 (1) = −

P5=1

−1212 (1)0 S−1ΦΦ

P=1

Φ ≡ −P5

=13 (1) say. First, noting that is uniformly bounded, we can show

40

°°°−1P=1Φ

°°°sp= (1) using arguments similar to those used in the proof of Lemma

B.5() It follows that

sup1∈X1

|31 (1)| ≤ 12 sup1∈X1

k (1)k kSksp°°−1ΦΦ°°sp

°°°°°°−1X

=1

Φ

°°°°°°sp

12 | − e|= 12 (1) (1) (1) (1) (1) (1) = (1)

Now notice that 32 (1) = (1)32 (1) +

(2)32 (1) where

(1)32 (1) =

1X=1

−1212 (1)0 S−1ΦΦ

X=1

Φ1 (1)

0 S11

(2)32 (1) =

1X=1

−1212 (1)0 S−1ΦΦX

=1

Φ1 (1)

0 S11

Let (1) = (1)0 S−1ΦΦ

−1P=1 Φ

1 (1) and (1) = [ (1)] Ar-

guments like those used to study (1) in the proof of Lemma B.6() show that k (1)k = (k (1)k) =

¡1 + 12

¢= (1) under Assumption A5() and || (1)− [ (1)] ||

= k (1)k ((12 log)−12) = ((

12 log)−12) uniformly in 1 We further make

the following decomposition: (1)32 (1) =

P1=1

−1212 (1)0 S−1ΦΦ

P=1 Φ

1 (1)0

S1−1 =P3

=1(1)32 (1) where

(11)32 (1) =

1X=1

1212 (1)0 S1−1

(12)32 (1) =

1X=1

1212 (1)0 S1(−1 −−1 )

(13)32 (1) =

1X=1

1212 (1)0 S1−1

Following the analysis of11 (1) we can show that sup1∈X1¯(11)32 (1)

¯=

¡( log)12

¢.

In addition,

sup1∈X1

¯(12)32 (1)

¯≤ 1212 sup

1∈X1

1X=1

k (1)k kS1ksp°°°−1 −−1

°°°spkk

= 1212 (1) (1)

³1

−12´ (

121 −12) = (1)

and

sup1∈X1

¯(13)32 (1)

¯≤ 1212 sup

1∈X1

1X=1

k (1)k kS1ksp°°°−1°°°

spkksp

= 1212 ((12 log)−12) (1) (1)

³121 −12

´= (1)

41

It follows that sup1∈X1¯(1)32 (1)

¯= (1) For

(2)32 (1) we have

sup1∈X1

¯(2)32 (1)

¯≤ 1212 sup

1∈X1k (1)k kSksp

°°−1ΦΦ°°sp 1X=1

kksp kS1ksp k1k

= 1212

³(12 log)−12

´ (1) (1) (1) (

121 −12) = (1)

where ≡ −1P

=1 Φ1 (1)

0 we use the fact that kksp = (1) by following

similar arguments to those used in the proof of Lemma B.5() and noticing that is uniformly

bounded. Consequently we have shown that sup1∈X1 |32 (1) | = (1) Analogously,

sup1∈X1

|34 (1)| ≤ 1212 sup1∈X1

k (1)ksp kSksp°°−1ΦΦ°°sp 1X

=1

kksp kS1ksp k2k

= 1212 (1) (1) (1) (1) (1)

³−1

´= (1)

By the same token, we can show that 33 (1) = (1) and 3 (1) = (1) uniformly

in 1 It follows that sup1∈X1 k3 (1)k = (1) for = 1 Analogously, we can show

that (3) : sup1∈X1 k (1)k = (1) for = 1 1

Now we show (4) Observe that 0 (1) = −1212P

=110−1∗

1 (1) [e+1+(e)

−+1+(e)] +−1212

P=11

0−1∗1 (1) [+1+(

e)−+1+()] ≡ 1 (1)+

2 (1) say. In view of the fact that e+1+(e)−+1+(e) = (e)0S+1+(eβ − β)+

[(e)0β −+1+(e)] we have 1 (1) =

P3=11 (1) where

11 (1) = −1212X=1

10−1∗

1 (1) ()

0 S+1+³eβ − β´

12 (1) = −1212X=1

10−1∗

1 (1)h(e)− ()

i0S+1+

³eβ − β´ 13 (1) = −−1212

X=1

10−1∗

1 (1)h+1+(

e)− (e)0β

i.

Analogous to the analysis of 1 (1) we can readily show that sup1∈X1 |11 (1)| = (1)

For 12 (1) by Taylor expansion,

12 (1) = −1212X=1

10−1∗

1 (1) (e − ) ()

0³eβ−β

´+1

2−1212

X=1

10−1∗

1 (1) (e − )2

³‡

´0 ³eβ−β

´≡ 121 (1) +

1

2122 (1) say,

42

where ‡ lies between

e and By Theorem 3.1 and Lemmas B.6()-(), sup1∈X1 |121 (1)|= 121 (1 + 12

−1 ) ( + 1) = (1) and

sup1∈X1

|122 (1)| ≤ 2 sup1∈X1

(−1212

X=1

10−1∗

1 (1) (e − )2

)°°°eβ−β

°°°= 2

1212 (1−1 +

−21 ) ( + 1) = (1)

In addition, sup1∈X1 k13 (1)k ≤ 1212 (−) sup1∈X1 −1P

=11

°°−1∗1 (1)

°° = (

12 12−) = (1) It follows that sup1∈X1 |1 (1)| = (1)

By Taylor expansion,

2 (1) = −1212X=1

10−1∗

1 (1) ()³e −

´+−1212

X=1

10−1∗

1 (1) +1+(‡)³e −

´2≡ 21 (1) +22 (1)

Arguments like those used to study 3 (1) show that sup1∈X1 |21 (1)| = (1) By

Lemma B.6(), sup1∈X1 |22 (2)| ≤ sup1∈X1−1212P

=11

¯0−1∗

1 (1)¯(e−

)2 = 1212 (

21) = (1) where = sup∈U +1+ () = (1) ¥

References

Ai, C., and Chen, X. (2003), “Efficient Estimation of Models with Conditional Moment Re-strictions Containing Unknown Functions,” Econometrica, 71, 1795-1843.

Bernal, R. (2008), “The Effect of Maternal Employment and Child Care on Children’s CognitiveDevelopment,” International Economic Review, 49, 1173-209.

Bernal, R., and Keane, M.P. (2011), “Child Care Choices and Children’s Cognitive Achieve-ment: the Case of Single Mothers,” Journal of Labor Economics, 29, 459-12.

Blau, F. D., and Grossberg, A. J. (1992), “Maternal Labor Supply and Children’s CognitiveDevelopment,” Review of Economics and Statistics, 74, 474-1.

Cameron, S. V., and Heckman, J. J. (1998), “Life Cycle Schooling and Dynamic Selection Bias:Models and Evidence for Five Cohorts,” NBER Working Papers 6385, National Bureauof Economic Research, Inc.

Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric Models,” In J. J.Heckman and E. Leamer (eds.), Handbook of Econometrics, 6B (Chapter 76), 5549-5632.New York: Elsevier Science.

Chen, X., and Pouzo, D. (2012), “Estimation of Nonparametric Conditional Moment Modelswith Possibly Nonsmooth Generalized Residuals,” Econometrica, 80, 277-321.

Darolles, S., Fan, Y., Florens, J. P., and Renault, E. (2011), “Nonparametric InstrumentalRegression,” Econometrica, 79, 1541-1565.

43

Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, Chapman &Hall, London.

Gao, J., and Phillips, P. C. B. (2013), “Semiparametric Estimation in Triangular SystemEquations with Nonstationarity,” Journal of Econometrics, 176, 59-79.

Hahn, J., and Ridder, G. (2013), “Asymptotic Variance of Semiparametric Estimators withGenerated Regressors,” Econometrica, 81, 315-340.

Hall, P., and Horowitz, J. L. (2005), “Nonparametric Methods for Inference in the Presence ofInstrumental Variables,” Annals of Statistics, 33, 2904-2929.

Hansen, B. E. (2008), “Uniform Convergence Rates for Kernel Estimation with DependentData,” Econometric Theory, 24, 726-748.

Henderson, D. J., and Parmeter, C. F. (2014), Applied Nonparametric Econometrics, NewYork: Cambridge University Press.

Horowitz, J. L. (2014), “Nonparametric Additive Models,” in J. Racine, L. Su, and A. Ullah(eds.), The Oxford Handbook of Applied Nonparametric and Semiparametric Econometricsand Statistics, pp. 129-148, Oxford University Press, Oxford.

Horowitz, J. L., and Mammen, E. (2004), “Nonparametric Estimation of an Additive Modelwith a Link Function,” Annals of Statistics, 32, 2412-2443.

James-Burdumy, S. (2005), “The Effect of Maternal Labor Force Participation on Child De-velopment,” Journal of Labor Economics, 23, 177-11.

Keane, M. P., and Wolpin, K. I. (2001), “The Effect of Parental Transfers and BorrowingConstraints on Educational Attainment,” International Economic Review, 42, 1051-103.

Keane, M. P., and Wolpin, K. I. (2001), “Estimating Welfare Effects Consistent with Forward-Looking Behavior,” Journal of Human Resources, 37, 600-22.

Kim, W., Linton O. B., and Hengartner N. W., (1999), “A Computationally Efficient Estimatorfor Additive Nonparametric Regression with Bootstrap Confidence Intervals,” Journal ofComputational and Graphical Statistics, 8, 278-297.

Li, Q., (2000), “Efficient Estimation of Additive Partially Linear Models,” International Eco-nomic Review, 41, 1073-1092.

Li, Q., and Racine, J. (2007), Nonparametric Econometrics: Theory and Practice, Princeton:Princeton University Press.

Mammen, E., Rothe, C., and Schienle, M. (2012), “Nonparametric Regression with Nonpara-metrically Generated Covariates,” Annals of Statistics, 40, 1132-1170.

Martins-Filho, C., and Yang, K. (2007), “Finite Sample Performance of Kernel-Based Regres-sion Methods for Nonparametric Additive Models under Common Bandwidth SelectionCriterion,” Journal of Nonparametric Statistics, 19, 23-62.

Martins-Filho, C., and Yao, F. (2012), “Kernel-Based Estimation of Semiparametric Regressionin Triangular Systems,” Economics Letters, 115, 24-27.

Masry, E. (1996), “Multivariate Local Polynomial Regression for Time Series: Uniform StrongConsistency Rates,” Journal of Time Series Analysis, 17, 571-599.

44

Newey, W. K. (1997), “Convergence Rates and Asymptotic Normality for Series Estimators,”Journal of Econometrics, 79, 147-168.

Newey, W. K., and Powell, J. L., (2003), “Instrumental Variable Estimation of NonparametricModels,” Econometrica, 71, 1565-1578.

Newey, W. K., Powell, J. L., and Vella, F. (1999), “Nonparametric Estimation of TriangularSimultaneous Equation Models,” Econometrica, 67, 565-603.

Ozabaci, D., and Henderson, D. J., (2012), “Gradients via Oracle Estimation for AdditiveNonparametric Regression with Application to Returns to Schooling,” Working paper,State University of New York at Binghamton.

Pinkse, J. (2000), “Nonparametric Two-Step Regression EstimationWhen Regressors and ErrorAre Dependent,” Canadian Journal of Statistics, 28, 289-300.

Powell, M. J. D. (1981), Approximation Theory and Methods, Cambridge: Cambridge Univer-sity Press.

Roehrig, C.S. (1988), “Conditions for Identification in Nonparametric and Parametric models,”Econometrica, 56, 433-47.

Schumaker, L. L. (2007), Spline Functions: Basic Theory, 3rd ed., Cambridge: CambridgeUniversity Press.

Serfling, R. J. (1980), Approximation Theorems of Mathematical Statistics. John Wiley & Sons,New York.

Su, L., and Jin, S. (2012), “Sieve Estimation of Panel Data Models with Cross Section Depen-dence,” Journal of Econometrics, 169, 34-47.

Su, L., Murtazashvili, I., and Ullah, A. (2013), “Local Linear GMM Estimation of FunctionalCoefficient IV Models with Application to the Estimation of Rate of Return to Schooling,”Journal of Business & Economic Statistics, 31, 184-207.

Su, L., and Ullah, A. (2008), “Local Polynomial Estimation of Nonparametric SimultaneousEquations Models,” Journal of Econometrics, 144, 193-218.

Ullah, A. (1985), “Specification Analysis of Econometric Models,” Journal of QuantitativeEconomics 1, 187-209

Vella, F. (1991), “A Simple Two-Step Estimator for Approximating Unknown Functional Formsin Models with Endogenous Explanatory Variables,” Working Paper, Department of Eco-nomics, Australian National University.

45

Supplemental Material On“Additive Nonparametric Regression in the Presence of Endogenous Regressors”

Deniz Ozabaci,1 Daniel J. Henderson,2 and Liangjun Su3

THIS APPENDIX PROVIDES PROOFS FOR SOME TECHNICAL LEMMAS IN THE ABOVE PA-

PER.

Proof of Lemma B.1. By straightforward moment calculations, we can show that ||

− ||2 = ¡21

¢under Assumption A1()-() and A2() Then () follows from Markov

inequality. By Weyl inequality [e.g., Bernstein (2005, Theorem 8.4.11)] and the fact that

max () ≤ kk for any symmetric matrix (as |max ()|2 = max () ≤ kk2) we have

min ( ) ≤ min ( ) + max ( − )

≤ min ( ) + k − k = min ( ) + (1)

Similarly,

min ( ) ≥ min ( ) + min ( − )

≥ min ( )− k −k = min (1)− (1)

Analogously, we can prove the second part of () Thus () follows. By the submultiplicative

property of the spectral norm, ()-() and Assumption A2() °°°−1 −−1°°°sp

=°°°−1 ( − )

−1

°°°sp≤°°°−1°°°

spk −ksp

°°−1°°sp= (1)

³1

12´ (1) =

³1

12´

where we use the fact that°°°−1°°°

sp= [min ( )]

−1 = [min ( ) + (1)]−1 = (1)

by () and Assumption A2() Then () follows. The proof of ()-() is analogous to that of

()-() and thus omitted. ¥

Proof of Lemma B.2. () By Assumption A1() and A2() kk2 = −2trP=1(

0

2)

≤ −1 (1 + 1) max () = (1). Then kk2 = (1) by Markov inequality.

() By the facts that kk2sp = kk2 for any vector |0| ≤ kk kk for any two conformablevectors and and that κ0κ ≤ max () kκk2 for any p.s.d. matrix and conformable vector

1Deniz Ozabaci, Dept of Economics, State Univ of New York, Binghamton, NY 13902-6000, (607) 777-2572,

Fax: (607) 777-2681, e-mail: [email protected] J. Henderson, Dept of Economics, Finance and Legal Studies, Univ of Alabama, Tuscaloosa, AL

35487-0224, (205) 348-8991, Fax: (205) 348-0186, e-mail: [email protected] Su, School of Economics, Singapore Management University, 90 Stamford Road, Singapore, 178903;

Tel: (65) 6828-0386; e-mail: [email protected].

1

κ Cauchy-Schwarz inequality, Lemma B.1() and Assumptions A2(), we have

kk2 = kk2sp = max¡

0

¢= max

kκk=1−2

X=1

X=1

κ0 0κ£ (Z)− 0α

¤ £ (Z)− 0α

¤≤ max

kκk=1

(−1

X=1

nκ0 0κ

£ (Z)− 0α

¤2o12)2

≤ (−21 ) max

kκk=1

(−1

X=1

κ0 0κ

)≤ (

−21 )max ( ) = (

−21 )

() Noting that = (Z) + = 0α + + [ (Z)− 0α] by Lemma B.1(),

w.p.a.1 we have

eα−α =

ÃX=1

0

!− X=1

−α

= −1−1

X=1

+−1−1

X=1

£ (Z)− 0α

¤= −1 +−1 ≡ 1 + 2 say. (B.1)

Note that 1 = −11 + 1 where 1 =³−1 −−1

´ satisfies that

k1k ≤ =ntrh³−1 −−1

´

0

³−1 −−1

´io12≤ kksp

°°°−1 −−1°°° = (

121 12) (

121 12) = (1)

by Lemmas B.1() and B.2(). For 2 we have 2 = −11 +2 where 2 =¡−11 −−11

¢

satisfies that

k2k ≤ kksp°°°−1 −−1

°°° = (−1 ) (

121 12) = (

−+121 12)

by Lemmas B.1() and B.2(). The result follows. ¥

Proof of Lemma B.3. ()We only prove the = 1 case as the proof of the other case is almost

identical. By the definition of e and (B.1), we can decompose e − = [ − e (Z)]−

2

as follows

e − = ( − e) + 1X=1

[ (1)− e (1)] +

2X=1

[1+ (2)− e1+ (2)]

= − (e − )−1X=1

1 (1)0 S11 −

2X=1

1 (2)0 S11+1

−1X=1

1 (1)0 S12 −

2X=1

1 (2)0 S11+2

≡ −1 − 2 − 3 − 4 − 5 say. (B.2)

Then by Cauchy-Schwarz inequality, −1P

=1(e−)

22 ≤ 5P5

=1 −1P

=1 2

2 ≡ 5

P5=1

say. Apparently, 1 =

¡−1

¢as e − =

¡−12

¢

2 = −1X=1

Ã1X=1

1 (1)0 S11

!22

≤ 1

1X=1

−1X=1

¡1 (1)

0 S11¢22 = 1

1X=1

tr¡S1101S011

¢≤ 1

1X=1

max (1) tr¡1

01S01S1

¢ ≤ 1

1X=1

max (1) kS1k2sp k1k2

where 1 = −1P

=1 1 (1)

1 (1)0 2 such that max(1) = (1) by As-

sumption A3() and arguments analogous to those used in the proof of Lemma B.1(). In

addition, kS1k2sp = max (S1S01) = 1 and k1k2 ≤°°°−1°°°2

spkk2 = (1) (1) =

(1) by Lemma B.1() and B.2() and Assumption A2(). It follows that 2 = (1)×1× (1) = (1) Similarly, using the fact that k2k2 ≤

°°°−1°°°2spkk2 = (1)

(−21 ) we have

4 = −1X=1

Ã1X=1

1 (1)0 S12

!22 ≤ 1

1X=1

max (1) kS1k2sp tr¡2

02

¢= (1)× 1× (

−21 ) = (

−21 )

By the same token, 3 = (1−1) and 5 = (

−21 )

() The result follows from () and the fact that max1≤≤ kΦk = (0) under Assump-

tion A2()

() By Assumption A2() Taylor expansion and ()

−1X=1

°°° ³e

´− ()

°°°2 = −1X=1

°°° ³ †´³e −

´°°°2≤

¡21¢−1

X=1

³e −

´2=

¡21

21

¢

3

where † lies between

e and

() By Assumption A2() Taylor expansion and triangle inequality,°°°°°−1X=1

h³e

´− ()

iΦ0

°°°°°sp

is bounded by°°°−1P

=1 ()Φ

0

³e −

´°°°sp+ 12

°°°°−1P=1

³‡

´Φ0³e −

´2°°°°sp

≡

1 + 2 where ‡ lies between

e and By triangle and Cauchy-Schwarz inequalities

and ()

1 ≤ −1X=1

k ()ksp°°°Φ0 ³e −

´°°°sp

≤(−1

X=1

k ()k2)12(

−1X=1

kΦk2¯ e −

¯2)12=

³12

´ (01) =

³1201

´

By triangle inequality and () 2 ≤ (02)−1P

=1(e − )

2 = (0221). Then

() follows.

() Let Γ ≡ [(e1) − (1) [(e) − ()]]

0 and e = (1 )0 Then we

can write −1P

=1[(e)− ()] as

−1Γ0e Let D ≡ (XZU)=1 By the law ofiterated expectations, Taylor expansion, Assumptions A1() A3() and A2() and ()

n°°−1Γ0e°°2 |D

o= −2

£tr¡Γ0ee

0Γ¢¤= −2

£tr¡Γ0

¡ee0|D

¢Γ¢¤

= −2X=1

[(e)− ()]22

≤ (1)−2

X=1

³e −

´22 =

¡−121

21

¢

It follows that°°−1Γ0e°° =

¡−1211

¢by the conditional Chebyshev inequality. ¥

Proof of Lemma B.4. ()Noting that −1P

=1

°°°eΦ −Φ°°°2 =P=1

−1P=1

°°° ³e

´− ()

°°°2 the result follows from Lemma B.3().

() Noting that°°°−1P

=1

³eΦ −Φ´Φ0°°°2 = P=1

°°°−1P=1

h³e

´− ()

iΦ0°°°2

the result follows from Lemma B.3().

() Noting that eΦΦ −ΦΦ = −1P

=1(eΦeΦ0 − ΦΦ0) = −1

P=1(

eΦ − Φ)(eΦ − Φ)0+−1

P=1(

eΦ −Φ)Φ0+ −1P

=1Φ(eΦ−Φ)0 the result follows from ()-() and the triangle

inequality.

4

() By the triangle inequality°°° e−1ΦΦ −−1ΦΦ

°°°sp≤°°° e−1ΦΦ −−1ΦΦ

°°°sp+°°°−1ΦΦ −−1ΦΦ

°°°sp

Arguments like those used in the proof of Lemma B.1(ii) show that°°° e−1ΦΦ°°°

sp=hmin

³ eΦΦ

´i−1= [min (ΦΦ) + (1)]

−1 = (1) where the second equality follows from () and Lemma

B.1() By the submultiplicative property of the spectral norm and (),°°° e−1ΦΦ −−1ΦΦ°°°sp

=°°° e−1ΦΦ ³ eΦΦ − eΦΦ

´−1ΦΦ

°°°sp

≤°°° e−1ΦΦ°°°

sp

°°° eΦΦ − eΦΦ

°°°sp

°°°−1ΦΦ°°°sp

=

³1201 + 02

21

´

Similarly,°°°−1ΦΦ −−1ΦΦ

°°°sp=

¡12

¢by Lemma B.1() It follows that

°°° e−1ΦΦ −−1ΦΦ°°°sp

=

¡1201 + 02

21

¢

() Noting that°°°−1P

=1

³eΦ −Φ´ °°°2 =P=1

°°°−1P=1

h³e

´− ()

i

°°°2 theresult follows from Lemma B.3().

() Let ≡ ( 1 ) − Φ0β By triangle inequality, Assumption A2(), Jensen in-equality and () we have

°°°−1P=1

³eΦ −Φ´ °°° ≤ (−)−1

P=1

°°°eΦ −Φ°°° = (−)

(11) = (−11) ¥

Proof of Lemma B.5. The proof of ()-() is analogous to that of Lemma B.2 ()-()

respectively. Noting that°°−1ΦΦ°°sp = (1) by Assumption A2(), we can prove () by showing

that kk = (1) where = −1P

=1Φ(e − ) where = ()

0 β+1+ By

triangle inequality and Assumptions A1() and A2() and ()

≡ max1≤≤

kk ≤ sup∈U

°°+1+ ()− ()0 β+1+

°°+ sup∈U

k+1+ ()k

= ¡−

¢+ (1) = (1)

By (B.2), = −1P

=1Φ(e − ) =

P5=1

−1P=1Φ =

P5=1 say.

Let ≡ −1P

=1 Φ1 (1)

0 and = () Then k − k = ((1)12)

by Chebyshev inequality and

kk2sp =°° £Φ1 (1)0¤°°2sp ≤ 2max () = (1)

where ≡ £Φ

1 (1)0¤ [1 (1)Φ0] and we use the fact that has bounded largest

eigenvalue. To see the last point, first note that for 1 ≤ £Φ

1 (1)0¤ is a submatrix of

≡ (ΦΦ0) which has bounded largest eigenvalue. Partition as follows

=

⎡⎢⎢⎣11 12 13

21 22 23

31 32 33

⎤⎥⎥⎦5

where = 0 for = 1 2 3 and £Φ

1 (1)0¤ = h 012 22 032

i0 Then

=

⎡⎢⎢⎣12

012 1222 12

032

22012 2222 22

032

32012 3222 32

032

⎤⎥⎥⎦ By Thompson and Freede (1970, Theorem 2), max () ≤ max (12

012) + max (22

022) +

max (32032) By Fact 8.9.3 in Bernstein (2005), the positive definiteness of ensures that both

12012 and 32

032 have finite maximum eigenvalues as both

⎡⎣ 11 12

21 22

⎤⎦ and⎡⎣ 22 23

32 33

⎤⎦are also positive definite. In addition, max (2222) = [max (22)]

2 is finite as has bounded

maximum eigenvalue. It follows that max () = (1) Consequently, kk = (1 +

(1)12) = (1)

Analogously, noting that 1 is the first element of Φ we can show that°°−1P

=1Φ°°sp=

(1 + ()12) = (1) It follows that

k1k =

°°°°°−1X=1

Φ

°°°°°sp

|e − | = (1)

³−12

´=

³−12

´

k2 + 4k ≤1X=1

kk kS1ksp (k1k+ k2k) = (1) (1)(1) = (1)

and k3 + 5k = (1) by the same token. Thus we have shown that kk = (1) ¥

References

Bernstein, D. S. (2005), Matrix Mathematics: Theory, Facts and Formulas with Application toLinear Systems Theory, Princeton: Princeton University Press.

Thompson, R., and Freede, L. J. (1974), “Eigenvalues of Partitioned Hermitian Matrices,”Bulletin Australian Mathematical Society, 3, 23-37.

6

additive nonparametric regression in the presence of ...ftp.iza.org/dp8144.pdf · additive...

Documents