minimax optimal rates of estimation in functional … · · 2017-09-13minimax optimal rates of...

DEPARTMENT OF STATISTICS

University of Wisconsin

1300 University Ave.

Madison, WI 53706

TECHNICAL REPORT NO. 1184

September 12, 2017

Minimax Optimal Rates of Estimation in Functional ANOVA Models

With Derivatives

Xiaowu Dai

1and Peter Chien

2

Department of Statistics

University of Wisconsin, Madison

1Supported in part by NSF Grant DMS-1308877 and DMS-1564376.2Supported in part by NSF Grant DMS-1564376.

MINIMAX OPTIMAL RATES OF ESTIMATION INFUNCTIONAL ANOVA MODELS WITH DERIVATIVES

By Xiaowu Dai⇤,† and Peter Chien†

University of Wisconsin-Madison

We establish minimax optimal rates of convergence for nonpara-metric estimation in functional ANOVA models when data from first-order partial derivatives are available. Our results reveal that par-tial derivatives can improve convergence rates for function estima-tion with deterministic or random designs. In particular, for full d-interaction models, the optimal rates with first-order partial deriva-tives on p covariates are identical to those for (d � p)-interactionmodels without partial derivatives. For additive models, the ratesby using all first-order partial derivatives are root-n to achieve the“parametric rate”. We also investigate the minimax optimal ratesfor first-order partial derivative estimations when derivative data areavailable. Those rates coincide with the optimal rate for estimatingthe first-order derivative of a univariate function.

1. Introduction. Derivative observations for complex systems are avail-able in many applications. In dynamic systems and tra�c engineerings, real-time motion sensors can record velocity, acceleration in addition to positions[23, 27, 32]. In economics, it has a long tradition to study costs and demandswhere the factor demand function is the partial derivative of the cost func-tion by the Shephard’s Lemma [31, 18, 14, 15]. In actuarial science, mortalityforce data can be obtained from demography, which together with samplesfor the survival distribution can yield derivatives for the survival distribu-tion function [10]. In computer experiments, partial derivatives are availableby using di↵erentiation mechanisms at little additional cost [16, 12, 11].Derivative data are commonly collected in geodetics engineering [30, 25].In meteorology, the wind speed and direction as functions of the gradientof barometric pressure are measured over broad geographic regions whilethe pressure will also be recorded [3]. Moreover, an evolving system is oftenmodeled as a constrained optimization problem or a set of partial di↵erentialequations, which give data on the first order condition or partial derivatives

⇤Supported in part by NSF Grant DMS-1308877.†Supported in part by NSF Grant DMS-1564376.MSC 2010 subject classifications: Primary 62G08, 62H12; secondary 62G05, 62P20.Keywords and phrases: Nonparametric regression, smoothing spline ANOVA, partial

derivative data, method of regularization, minimax rate

1

2 X. DAI AND P. CHIEN

as well as the objective function itself [9, 26].Let @f(t)/@t

j

denote the jth first-order partial derivative of a scalar func-tion f(t) of d variables t = (t

1

, . . . , td

). Consider the following multivariateregression model

(1.1)

(

Y e0 = f0

(te0) + ✏e0 ,

Y ej = @f0

/@tj

(tej ) + ✏ej , 1 j p.

Here, ej

is a d-dimensional vector with jth entry one and others zero ande0

is a zero vector. The response Y e0 is the function observation and Y ej

is the observation of the first-order partial derivative on the jth covariate.Assume that the design points te0 and tej s are in a compact product spaceX d

1

, where X1

= [0, 1]. The random errors ✏e0 and ✏ej s are assumed to beindependent centered noises with variances �2

0

and �2j

s, respectively. Letp 2 {1, . . . , d} denote the number of the di↵erent types of first-order partialderivatives being observed. Without loss of generality, we focus on the first pcomponents for notational convenience. Let {(tej

i

, yej

i

) : i = 1, . . . , n} be in-dependent copies of (tej , Y ej ) for j = 1, . . . , p, and {(te0

i

, ye0i

) : i = 1, . . . , n}be independent copies of (te0 , Y e0).

We now discuss two popular approaches for modeling the d-dimensionalnonparametric unknown function f

0

(·). The first uses a multivariate functionwith smoothness assumption on all d dimensions. The second uses a functionwith tensor product structure and smoothness properties on lower dimen-sions. The latter approach is represented by the smoothing spline analysisof variance (SS-ANOVA). See, for example, [45, 38, 20, 13] and referencestherein. As a general framework for nonparametric multivariate estimation,SS-ANOVA can adaptively control the complexity of the model with inter-pretable estimates. The SS-ANOVA model for a function f(t) is

(1.2) f(t) = constant +d

X

k=1

fk

(tk

) +X

k<j

fkj

(tk

, tj

) + · · · ,

where the fk

s are the main e↵ects, the fkj

s are the two-way interactions,and so on. Components on the right hand side satisfy side conditions toassure identifiability. The series is truncated to some order r of interactionsto enhance interpretability, where 1 r d. This model generalizes thepopular additive model where r = 1 and fitted with smoothing splines (see,e.g., [4, 17]).

We assume that the true function f0

(·) is a SS-ANOVA model and residein a certain reproducing kernel Hilbert space (RKHS) H on X d

1

. Let H(k)

be an RKHS of functions of tk

on X1

withR

X1fk

(tk

)dtk

= 0 for fk

(tk

) 2

FUNCTIONAL ANOVA ESTIMATION WITH DERIVATIVES 3

H(k) and [1(k)] be the one-dimensional space of constant functions on R1

.Construct H as

(1.3)

H =d

Y

k=1

⇣n

[1(k)]o

�n

H(k)

o⌘

= [1]�d

X

k=1

H(k) �X

k<j

[H(k) ⌦H(j)]� · · · ,

where [1] denotes the constant functions on X d

1

. The components of theSS-ANOVA decomposition (1.2) are now in mutually orthogonal subspacesof H in (1.3). We further assume that all component functions come froma common RKHS (H

1

, k · kH1), that is H(k) ⌘ H1

for k = 1, . . . , d. LetK : X

1

⇥ X1

7! R be a Mercer kernel generating the RKHS H1

and writeK

d

�

(t1

, . . . , td

)>, (t01

, . . . , t0d

)>�

= K(t1

, t01

) · · ·K(td

, t0d

). Then Kd

is the re-producing kernel of RKHS (H, k · kH) (see, e.g., [1]).

1.1. Deterministic designs. We are interested in the minimax optimalconvergence rates for estimating f

0

(·) and its partial derivatives @f0

/@tj

(·).We begin by considering regular lattices, also known as tensor product de-signs [2, 28]. Suppose that the eigenvalues of the K decay polynomially withthe ⌫th largest eigenvalue of the order ⌫�2m. We show that the minimaxrate for estimating f

0

2 H for full d-interaction SS-ANOVA model is

(1.4)

inf˜

f

supf02H

EZ

X d1

h

f(t)� f0

(t)i

2

dt

=

(

⇥

n(log n)1+p�d

⇤�2m/(2m+1)

if 0 p < d,

n�1(log n)d�1 + n�2md/[(2m+1)d�2] if p = d,

up to a constant scaling factor. If 0 p < d, the above rate is the minimaxoptimal rate for estimating a (d�p) dimensional full interaction SS-ANOVAmodel with only function observations; see, for example, [13, 20]. If p = dand d � 3, the minimax optimal rate in (1.4) becomes

(1.5) inf˜

f

supf02H

EZ

X d1

h

f(t)� f0

(t)i

2

dt ⇣ n�2md/[(2m+1)d�2].

For two positive sequences an

and bn

, we write an

⇣ bn

if an

/bn

is boundedaway from zero and infinity. The rate given by (1.5) converges faster than thewell known optimal rate n�2m/(2m+1) for additive models given in [17, 35].If p = d and d = 2, the minimax optimal rate in (1.4) is n�1 log n. If p = d


and d = 1, the root-n consistency is achieved in (1.4) and this specificphenomenon has been observed earlier (see, e.g., [42, 14]).

We are the first to systematically investigate the estimation of generald-dimensional SS-ANOVA models with derivatives. Other convergence rateresults for truncated SS-ANOVA models (r < d) will be given in Section 2.In particular, for the additive model r = 1 and p = d, the minimax optimalrate is n�1, which coincides with the parametric convergence rate.

1.2. Random designs. We are interested in obtaining sharp results forrandom designs. Suppose that design points te0 and tej are independentlydrawn from distributions ⇧e0 and ⇧ej s, where they are supported on X d

1

.We show that the minimax optimal rate for estimating the full d-interactionSS-ANOVA model is(1.6)

inf˜

f

supf02H

P(

Z

X d1

h

f(t)� f0

(t)i

2

dt � C1

✓

h

n(log n)1+p�d

i�2m/(2m+1)

0p<d

+h

n�1(log n)d�1 + n�2md/[(2m+1)d�2]

i

p=d

◆

)

= 0,

where C1

is a constant scalar not depending on n. The minimax optimalrates are also obtained for estimating @f

0

/@tj

(·) for any j 2 {1, . . . , p} andboth full and truncated SS-ANOVA models with r d, which are

(1.7) inf˜

f

supf02H

P(

Z

X d1

h

f(t)� @f0

/@tj

(t)i

2

dt � C2

n�2(m�1)/(2m�1)

)

> 0,

where C2

does not depend on n. This result holds regardless of the valueof d, r and p. In particular, the rate is the same as the optimal rate forestimating @f

0

/@tj

(·) if f0

actually comes from a univariate function spaceH

1

instead of the d-variate function space H. See, for example, [33, 34].We achieve the minimax rates under deterministic designs (1.4) and ran-

dom designs (1.6) by using the method of regularization in the framework ofRKHS. Unlike the regularization method, alternative methods for modelingderivative data typically assume that data have no random noises. See, forexample, [5, 21, 22, 32] among others. Despite these existing works, theo-retical understandings of observed first-order partial derivatives is limited.Our work fills some gap in this direction. It is worth pointing out the di↵er-ences between this work and [14]. The estimator provided in [14] relies onobserving the complete set of 2s types of mixed derivatives on s variables ofa d-dimensional function with s d. Their requirement could be infeasible


for some problems in practice while our setting fits for any observed first-order partial derivatives. Moreover, [14] does not provide the minimax riskanalysis and considers the estimation of d-dimensional functions without thetensor product structure. Thus, [14] concludes that adding more than onetype first-order partial derivative data does not further improve the conver-gence rate of their estimator. These results are di↵erent from our work in,for exmaple, (1.4), (1.6) and (1.7) for functional ANOVA models.

The rest of the article is organized as follows. We give the main resultson estimating functions with deterministic designs in Section 2, where (1.4)and (1.5) are included. We present the main results with random designsin Section 3 including (1.6). We consider the optimal rates of estimatingfirst-order partial derivatives in Section 4, where (1.7) is elaborated. Proofsof the results with random designs are given in Section 5. Proofs of otherresults and auxiliary technical lemmas are relegated to the supplementarymaterial.

2. Minimax risks with regular lattices. This section provides theminimax optimal rates of estimating f

0

(·) with model (1.1) and regularlattices. A regular lattice of size n = l

1

⇥ · · · ⇥ ld

on X d

1

is a collection ofdesign points

(2.1) {t1

, . . . , tn

} = {(ti1,1, ti2,2, . . . , tid,d)|ik = 1, . . . , l

k

, k = 1, . . . , d},

where tj,k

= j/lk

, j = 1, . . . , lk

, k = 1, . . . , d. This design is often used inthe statistical literature when the true function f

0

is a functional ANOVAmodel. This design is D�optimal in the sense of Kiefer and Wolfowitz [19].Readers are referred to [2, 28] for further details. Under the regular latticedesign, it is reasonable to assume f

0

: X d

1

7! R to have a periodic boundarycondition. This is because any finite-length sequence {f(t

1

), . . . , f(tn

)} canbe associated with a periodic sequence

fper

✓

i1

l1

, · · · , idld

◆

=1X

q1=�1· · ·

1X

qd=�1f

✓

i1

l1

� q1

, · · · , idld

� qd

◆

, 8(i1

, . . . , id

) 2 Zd

by letting f(·) ⌘ 0 outside X d

1

and at the unobserved boundaries of X d

1

.On the other hand, any finite-length sequence {f(t

1

), . . . , f(tn

)} can berecovered from the periodic sequence fper(·).

Recall that K is the reproducing kernel for component RKHS H1

, whichis a symmetric positive semi-definite, square integrable function on X

1

⇥X1

.


In our setting, we require an additional di↵erentiability condition on kernelK, which is given by

(2.2)@2

@t@t0K(t, t0) 2 C(X

1

⇥ X1

).

An straightforward explanation on this condition is as follows. Denote byh·, ·iH the inner product of RKHS H in (1.3). Then, for any g 2 H, we have

(2.3)@g(t)

@tj

=@hg,K

d

(t, ·)iH@t

j

=

⌧

g,@K

d

(t, ·)@t

j

�

H,

where the last step is by the continuity of h·, ·iH. This implies that thecomposite functional of evaluation and partial di↵erentiation, @g/@t

j

(t), isa bounded linear functional in H and has a representer @K

d

(t, ·)/@tj

in H.From Mercer’s theorem [29], K admits a spectral decomposition

(2.4) K(t, t0) =1X

⌫=1

�⌫

⌫

(t) ⌫

(t0),

where �1

� �2

� · · · � 0 are its eigenvalues and { ⌫

: ⌫ � 1} are thecorresponding eigenfunctions. A canonical example of H

1

is the mth orderSobolev space Wm

2

(X1

) whose eigenvalues satisfy �⌫

⇣ ⌫�2m. See, for exam-ple, Wahba [45] for further examples. Here, (2.3) implies that @g/@t

j

(t) isa continuous function. Thus, if H

1

= Wm

2

(X1

), we shall require m > 3/2 bySobolev embedding theorem.

We are now in the position to present our main results. We first state aminimax lower bound under regular lattices.

Theorem 2.1. Assume that �⌫

⇣ ⌫�2m for some m > 3/2, and designpoints te0 and tej , j = 1, . . . , d, are from the regular lattice (2.1). Supposethat f

0

2 H has periodic boundaries on X d

1

and is truncated up to r inter-actions in (1.2). Then, as n ! 1,

inf˜

f

supf02H

EZ

X d1

h

f(t)� f0

(t)i

2

dt

=

(

⇥

n(log n)1�(d�p)^r⇤�2m/(2m+1)

, if 0 p < d

n�1(log n)r�1 + n�2mr/[(2m+1)r�2], if p = d

up to a constant factor which only depends on bounded values �20

, �2j

s, m,r, p, and d.


We relegate the proof to Section A.2.1 in the supplementary material.Next, we show the lower bounds of convergence rates in Theorem 2.1 areobtainable. In particular, we consider the method of regularization by simul-taneously minimize the empirical losses of function observations and partialderivative observations with a single penalty:

(2.5)

bfn�

= argminf2H

(

1

n(p+ 1)

"

1

�20

n

X

i=1

{ye0i

� f(te0i

)}2

+p

X

j=1

1

�2j

n

X

i=1

�

yej

i

� @f/@tj

(tej

i

)

2

3

5+ �J(f)

9

=

;

,

where the weighted squared error loss may be replaced by other convexlosses, and J(·) is a quadratic penalty associated with RKHS H, and � � 0is a tuning parameter. The following theorem shows bf

n�

in (2.5) is indeedminimax rate optimal.

Theorem 2.2. Under the conditions of Theorem 2.1, bfn�

given by (2.5)satisfies

EZ

X d1

h

bfn�

(t)� f0

(t)i

2

dt

=

(

⇥

n(log n)1�(d�p)^r⇤�2m/(2m+1)

if 0 p < d,

n�1(log n)r�1 + n�2mr/[(2m+1)r�2] if p = d,

up to a constant factor which only depends on bounded values �20

, �2j

s, m, r,

p, and d, if tuning parameter � is chosen by � ⇣ ⇥n(log n)1�(d�p)^r⇤�2m/(2m+1)

when 0 p < d, and � ⇣ n�(2mr�2)/[(2m+1)r�2] when p = d, r � 3, and� ⇣ (n log n)�(2m�1)/2m when p = d, r = 2, and � . n�(m�1)/m when p = d,r = 1.

The proof of this theorem is presented in Section A.2.2 in the supplemen-tary material. Theorems 2.1 and 2.2 together immediately imply that withmodel (1.1) and regular lattices, the minimax optimal rate for estimatingf0

2 H is

(2.6)

EZ

X d1

h

bf(t)� f0

(t)i

2

dt

=

(

⇥

n(log n)1�(d�p)^r⇤�2m/(2m+1)

, if 0 p < d,

n�1(log n)r�1 + n�2mr/[(2m+1)r�2], if p = d,


and the method of regularization achieves (2.6). We make several remarks onthis result. First, suppose there is no derivative data, for example, p = 0 andr = d. Then, (2.6) recovers [n(log n)1�d]�2m/(2m+1) and this rate is knownin literature (see, e.g., [13]). For a large n, the exponential term (log n)d�1

makes the full d-interaction SS-ANOVA model impractical for large d. Onthe contrary, suppose partial derivatives data are available, for example,p = d � 1 and r = d. Then, (2.6) gives n�2m/(2m+1) for any d � 1, whichcoincides with the classical optimal rate for additive models [17, 35] and isnot a↵ected by the dimension d.

Second, if partial derivative observations are available on all covariateswith p = d, then the optimal rate can be much improved. Besides (1.5) forr = d and d � 3, we point out some other interesting cases. For the additivemodel with r = 1 and d � 1, (2.6) provides the minimax rate n�1. For thepairwise interaction model with r = 2 and d � 1, (2.6) provides the minimaxrate n�1 log n, which is di↵erent from n�1 only by a log n multiplier.

Third, we remark on an “interaction reduction” phenomenon. That isto say, the optimal rate for estimating an unknown SS-ANOVA model byincorporating partial derivative data is the same as the optimal rate forestimating a reduced interaction SS-ANOVA without derivative data. Forexample, with r = d and p = 1, (2.6) gives [n(log n)1�(d�1)]�2m/(2m+1),which is the same rate as r = d � 1 and p = 0 involving no derivativeobservations but a lower degree of interactions. And, with r = d and p = 2,(2.6) gives [n(log n)1�(d�2)]�2m/(2m+1), which is the same rate as r = d� 2and p = 0 involving no derivative observations but two lower degrees ofinteractions. Similarly, we can extend the same discussion to p = 3, . . . , d�1.

Fourth, by reviewing the proof for Theorem 2.1 and 2.2, we find that whenp = d, both the squared bias and variance are smaller in magnitude thanp < d, and when d � r < p < d, only the variance is smaller in magnitudethan 0 p d� r.

Finally, let n0

denote the sample size on (te0 , Y e0) and nj

denote thesample sizes on (tej , Y ej ), where 1 j p. If n

0

and nj

s are not allidentical to n, we can show that n in (2.6) can be replaced by min

1jp

nj

.

3. Minimax risks with random designs. We now turn to randomdesigns for the minimax optimal rates of estimating f

0

(·) with the regressionmodel (1.1). Parallel to Theorem 2.1, we have the following minimax lowerbound of estimation under random designs.


⇣ ⌫�2m for some m > 3/2, and designpoints te0 and tej , j = 1, . . . , d, are independently drawn from ⇧e0 and ⇧ej s,respectively. Suppose that ⇧e0 and ⇧ej s have densities bounded away from


zero and infinity, and f0

2 H is truncated up to r interactions in (1.2).Then, as n ! 1,

inf˜

f

supf02H

P(

Z

X d1

h

f(t)� f0

(t)i

2

dt � C1

✓

h

n(log n)1�(d�p)^ri�2m/(2m+1)

0p<d

+h

n�1(log n)r�1 + n�2mr/[(2m+1)r�2]

i

p=d

!)

> 0

where the constant C1

only depends on bounded values �20

, �2j

s, m, r, p, andd.

The lower bound is established via Fano’s lemma; see, for example, [36, 6].The proof is deferred to Section 5.1. Next, we show the lower bounds ofconvergence rates in Theorem 3.1 can be achieved by using the regularizedestimator in (2.5).

Theorem 3.2. Under the conditions of Theorem 3.1, we assume that⇧e0 and ⇧ej s are known, and m > 2. Then, bf

n�

in (2.5) satisfies

limD1!1

lim supn!1

supf02H

P(

Z

X d1

h

bfn�

(t)� f0

(t)i

2

dt > D1

✓

h

n(log n)1�(d�p)^ri�2m/(2m+1)

·0p<d

+h

n�1(log n)r�1 + n�2mr/[(2m+1)r�2]

i

p=d

◆

)

= 0

if the tuning parameter � is chosen by � ⇣ ⇥

n(log n)1�(d�p)^r⇤�2m/(2m+1)

when 0 p < d, and � ⇣ n�(2mr�2)/[(2m+1)r�2] when p = d, r � 3, and� ⇣ (n log n)�(2m�1)/2m when p = d, r = 2, and � . n�(m�1)/m when p = d,r = 1. In other words, bf

n�

is rate optimal.

We use the linearization method in [8] to prove Theorem 3.2. The keyingredient of this method is to chose a suitable basis such that the expectedloss of the regularization and the quadratic penalty J(·) can be simultane-ously diagonalized. For applications where these two functionals are positivesemi-definite, the existence of such a basis is guaranteed by the classical op-erator theory (see, e.g., [46]). These are done in [20, 40, 13]. Our situationis di↵erent in the sense that the loss function in (2.5) is the sum of squarederror losses for both the function and partial derivatives but we are only in-terested in estimating the function itself in Theorem 3.2. This induces a thirdpositive semi-definite functional, which is the squared error loss of function


estimation. But three functionals are not guaranteed to be simultaneouslydiagonized, making the direct application of the linearization method infea-sible. We present a detailed proof in Section 5.1.

Theorems 3.1 and 3.2 together demonstrate the fundamental limit rate ofthe squared error loss for estimating f

0

2 H with model (1.1) and randomdesigns is

(3.1)

h

n(log n)1�(d�p)^ri�2m/(2m+1)

0p<d

+h

n�1(log n)r�1 + n�2mr/[(2m+1)r�2]

i

p=d

in a probabilistic sense, and the regularized estimator achieves (3.1). Theminimax rate is the same as that with the regular lattice. We make severalremarks on (3.1). First, all five remarks following (2.6) for the mean squaredsituation hold for (3.1) in a probabilistic sense.

Second, for the special case when p = 0, (3.1) recovers the minimaxoptimal rate of convergence OP

�

[n(log n)1�r]�2m/(2m+1)

for SS-ANOVAmodels, which is known in [20].

Third, the squared error loss in Theorems 3.1 and 3.2 can be replaced bysquared prediction error

R { bfn�

(t)�f0

(t)}2d⇧e0(t) and it achieves the sameminimax optimal rate as (3.1).

Fourth, although (3.1) is established by assuming design points are drawnindependently, it also holds for designs of function and derivatives can begrouped to some sets, where within the sets the design points are drawnidentically and across the sets the design points are drawn independently.For example, when p = 2, (3.1) still holds if the designs can be groupedto {te0 are drawn from ⇧e0} and {te1 ⌘ te2 are drawn from ⇧e1} and thesetwo sets are drawn independently.

As a byproduct of Theorem 3.2, we show the following result of estimating

the mixed partial derivatives @

df0

@t1···@td (t) by its natural estimator @

d bfn�

@t1···@td (t).

Corollary 3.3. Under the conditions of Theorem 3.2 and m > 3, wehave

limD

01!1

lim supn!1

supf02H

P

8

<

:

Z

X d1

"

@d bfn�

(t)

@t1

· · · @td

� @df0

(t)

@t1

· · · @td

#

2

dt

> D01

✓

h

n(log n)1�(d�p)^ri�2(m�1)/(2m+1)

0p<d

+h

n�2(m�1)r/[(2m+1)r�2]

i

p=d

◆

)

= 0,


if the tuning parameter � is chosen by � ⇣ ⇥

n(log n)1�(d�p)^r⇤�2m/(2m+1)

when 0 p < d, and � ⇣ n�(2mr�2)/[(2m+1)r�2] when p = d.

4. Minimax risk for estimating partial derivatives. If one ob-serves noisy data on the function and some partial derivatives in (1.1), itis natural to ask what is the optimal rate for estimating first-order partialderivatives by using all observed data. For brevity, we only consider randomdesigns although similar results can be derived for regular lattices by us-ing techniques in Section 2. The following theorem gives the minimax lowerbound for estimating @f

0

/@tj

, 1 j p.


⇣ ⌫�2m for some m > 2 and designpoints te0 and tej , j = 1, . . . , d, are independently drawn from ⇧e0 and ⇧ej s,respectively. Suppose that ⇧e0 and ⇧ej s have densities bounded away fromzero and infinity, and f

0

2 H is truncated up to r interactions in (1.2).Then, for any j 2 {1, . . . , p} and 1 r d, as n ! 1,

inf˜

f

supf02H

P(

Z

X d1

f(t)� @f0

(t)

@tj

�

2

dt � C2

n�2(m�1)/(2m�1)

)

> 0,

where C2

only depends on bounded values �20

, �2j

s, m, r, p, and d.

We will prove this theorem in Section A.3.1 in the supplementary material.As a natural estimator for @f

0

/@tj

, @ bfn�

/@tj

achieves the lower bound ofconvergence rates in Theorem 4.1.

Theorem 4.2. Under the conditions of Theorem 4.1, bfn�

given by (2.5)satisfies that for any j 2 {1, . . . , p} and 1 r d,

limD2!1

lim supn!1

supf02H

P

8

<

:

Z

X d1

"

@ bfn�

(t)

@tj

� @f0

(t)

@tj

#

2

dt > D2

n�2(m�1)/(2m�1)

9

=

;

= 0,

if the tuning parameter � is chosen by � ⇣ n�2(m�1)/(2m�1).

The proof of this theorem is given in Section A.3.2 in the supplementarymaterial. When r = 1, this result coincides with Corollary 3.3. Di↵erentfrom Theorem 3.2 and Corollary 3.3, the distributions ⇧e0 and ⇧ej s are notassumed to be known.

Theorems 4.1 and 4.2 together give the minimax optimal rate for estimat-ing @f

0

/@tj

, which is given in (1.7). To the best of our knowledge, there are


few existing results in literature about estimating first-order partial deriva-tives. Since the optimal rate in (1.7) holds regardless of the value of p � 1,first-order partial derivative data on di↵erent covariates do not improve theoptimal rates for estimating each other. For example, given noisy data onf0

(·) and @f0

/@tj

(·), the data on @f0

/@tk

(·) does not improve the minimaxoptimal rate for estimating @f

0

/@tj

(·) if 1 k 6= j p.

5. Proofs for Section 3: random designs. Before proving the mainresults, we give some preliminary background on the RKHS H. Since theSS-ANOVA model (1.2) truncates a sequence up to r interactions, withoutloss of generality, we still denote the corresponding function space in (1.3)by H, which is the direct sum of some set of the orthogonal subspaces in thedecomposition ⌦d

j=1

H1

. Define k · k⌦dj=1H1

as the norm on ⌦d

j=1

H1

induced

by component norms k·kH1 , and define k·kH as the norm on H by restrictingk · k⌦d

j=1H1to H. Then H is a RKHS equipped with k · kH. The quadratic

penalty J(·) in (2.5) is defined as a squared semi-norm on H induced bya univariate penalty in H

1

. For example, H1

= Wm

2

(X1

), it is common tochose J(·) for penalizing only the smooth components of a function and anexplicit form is given in Wahba [45].

Now we introduce some notations used in the proof. We define a familyof the multi-index ~⌫ by

(5.1)V = {~⌫ = (⌫

1

, . . . , ⌫d

)> 2 Nd,

where at most r � 1 of ⌫k

s are not equal to 1}.

which will be referred later since f0

in the model (1.2) is truncated up tor interactions. We write for two nonnegative sequences {a

n

} and {bn

} asan

. bn

(or an

& bn

) if there exists constant c > 0 (or c0 > 0) which areindependent of the problem parameters, such that a

n

cbn

(or an

� c0bn

)for all n. Let the maximizer of two scalars {a, b} is denoted by a _ b andtheir the minimizer is denoted by a ^ b.

5.1. Proof of the minimax lower bound: Theorem 3.1. We establish thelower bound for the random design via Fano’s lemma. It su�ces to considera special case where noises ✏e0 and ✏ej s are Gaussian with �

0

= 1 and �j

= 1,and ⇧e0 and ⇧ej s are uniform, and H

1

is generated by periodic kernels.Let N be a natural number whose value will be clear later. We first derive

the eigenvalue decay rate for kernel Kd

which generates the RKHS H. Fora given ⌧ > 0, the number of multi-indices ~⌫ = (⌫

1

, . . . , ⌫r

) 2 Nr satisfying⌫�2m

1

· · · ⌫�2m

r

� ⌧ is the same as the number of multi-indices such that


⌫1

· · · ⌫r

⌧�1/(2m), which amounts to

(5.2)

X

⌫2···⌫r⌧

�1/(2m)

⌧�1/(2m)/(⌫2

· · · ⌫r

) = ⌧�1/(2m)

0

@

X

⌫⌧

�1/(2m)

1/⌫

1

A

r�1

⇣ ⌧�1/(2m)(log 1/⌧)r�1.

Denote by �N

(Kd

) the Nth eigenvalues of Kd

. By inverting (5.2), we obtain

�N

(Kd

) ⇣ ⇥N(logN)1�r

⇤�2m

.

Hence, the multi-indices ~⌫ = (⌫1

, . . . , ⌫r

) 2 Nr satisfying ⌫1

· · · ⌫r

Ncorrespond to the first

c0

N(logN)r�1

eigenvalues of Kd

for some constant c0

. Let b = {b~⌫ : ⌫

1

· · · ⌫r

N} 2{0, 1}c0N(logN)

r�1be a length-{c

0

N(logN)r�1} binary sequence, and {�~⌫ :

⌫1

· · · ⌫r

N} be the first c0

N(logN)r�1 eigenvalues of Kd

. Denote by{�

~⌫+c0N(logN)

r�1 : ⌫1

· · · ⌫r

N} the {c0

N(logN)r�1+1}th, {c0

N(logN)r�1+2}th,. . . , {2c

0

N(logN)r�1}th eigenvalues of Kd

.For brevity, we only prove for the case p = d and r � 3 while the other

cases p = d, r 2 and 0 p < d follow similar arguments. We deal withthe di↵erences among these cases for deterministic designs in Section A.2.1of the supplementary material. Write

fb

(t1

, . . . , tr

) = N�1/2+1/r

X

⌫1···⌫rN

b~⌫

�

1 + ⌫21

+ · · ·+ ⌫2r

��1/2

⇥ �1/2~⌫+c0N(logN)

r�1 ~⌫+c0N(logN)

r�1(t1

, . . . , tr

),

where ~⌫+c0N(logN)

r�1(t1

, . . . , tr

) are the corresponding eigenfunctions of

�~⌫+c0N(logN)

r�1 of Kd

. Note that

kfb

k2H = N�1+2/r

X

⌫1···⌫rN

b2~⌫(1 + ⌫2

1

+ · · ·+ ⌫2r

)�1

N�1+2/r

X

⌫1···⌫rN

(1 + ⌫21

+ · · ·+ ⌫2r

)�1 ⇣ 1,

where the last step by Lemma A.14 in the supplementary material, and thisimplies f

b

(·) 2 H.By the Varshamov-Gilbert bound (see, e.g., [36]), there exists a collection

of binary sequences {b(1), . . . , b(M)} ⇢ {0, 1}c0N(logN)

r�1such that M �

2c0N(logN)

r�1/8 and

H(b(l), b(q)) � c0

N(logN)r�1/8, 81 l < q M,


whereH(·, ·) is the Hamming distance. Then, for b(l), b(q) 2 {0, 1}c0N(logN)

r�1,

we have

kfb

(l) � fb

(q)k2L2

� N�1+2/r(2N)�2m

X

⌫1···⌫rN

(1 + ⌫21

+ · · ·+ ⌫2r

)�1

h

b(l)~⌫ � b(q)

~⌫

i

2

� N�1+2/r(2N)�2m

X

c17N/8⌫1···⌫rN

(1 + ⌫21

+ · · ·+ ⌫2r

)�1

= c2

N�2m

for some constants c1

and c2

, where the last step is by Lemma A.14 in thesupplementary material.

On the other hand, for any b(l) 2 {b(1), . . . , b(M)} and by Lemma A.14,

kfb

(l)k2L2

+p

X

j=1

k@fb

(l)/@tj

k2L2

N�1+2/r

X

⌫1···⌫rN

⌫�2m

1

· · · ⌫�2m

r

h

b(l)~⌫

i

2

N�1+2/r

X

⌫1···⌫rN

⌫�2m

1

· · · ⌫�2m

r

= c3

N�2m+2/r(logN)r�1


.A standard argument gives that the lower bound can be reduced to the

error probability in a multi-way hypothesis test [36]. Specifically, let ⇥ be arandom variable uniformly distributed on {1, . . . ,M}. Note that

(5.3)inf˜

f

supf02H

P⇢

kf � f0

k2L2

� 1

4min

b

(l) 6=b

(q)kf

b

(l) � fb

(q)k2L2

�

� infb⇥

P{b⇥ 6= ⇥},

where the infimum on RHS is taken over all decision rules that are measur-able functions of the data. By Fano’s lemma,

Pn

b⇥ 6= ⇥|te01

, . . . , te0n

; . . . ; tep

1

, . . . , tepn

o

� 1� 1

logM

⇥h

te01 ,...,t

e0n ;...;t

ep1 ,...,t

epn(ye0

1

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

;⇥) + log 2i

,(5.4)


where te01 ,...,t

e0n ;...;t

ep1 ,...,t

epn(ye0

1

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

) is the mutual infor-

mation between ⇥ and {ye01

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

} with the design points{te0

1

, . . . , te0n

; . . . ; tep

1

, . . . , tepn

} being fixed. We can derive that

(5.5)

Ete01 ,...,t

e0n ;...;t

ep1 ,...,t

epn

·h

te01 ,...,t

e0n ;...;t

ep1 ,...,t

epn

�

ye01

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

;⇥�

i

✓

M

2

◆�1

X

b

(l) 6=b

(q)

Ete01 ,...,t

e0n ;...;t

ep1 ,...,t

epnK⇣

Pf

b(l)|P

f

b(q)

⌘

n(p+ 1)

2

✓

M

2

◆�1

X

b

(l) 6=b

(q)

Ete01 ,...,t

e0n ;...;t

ep1 ,...,t

epnkf

b

(l) � fb

(q)k2⇤n,

where K(·|·) is the Kullback-Leibler distance, Pf

is conditional distributionof ye0

i

and yej

i

s given {te01

, . . . , te0n

; . . . ; tep

1

, . . . , tepn

}, and the norm k · k⇤ isdefined as

kfk2⇤n =1

n(p+ 1)

n

X

i=1

8

<

:

[f(te0i

)]2 +p

X

j=1

[@f(tej

i

)/@tj

]2

9

=

;

, 8f : X r

1

7! R.

Thus,

(5.6)

Ete01 ,...,t

e0n ;...;t

ep1 ,...,t

epn

·h

te01 ,...,t

e0n ;...;t

ep1 ,...,t

epn(ye0

1

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

;⇥)i

n(p+ 1)

2

✓

M

2

◆�1

X

b

(l) 6=b

(q)

8

<

:

kfb

(l) � fb

(q)k2L2

+p

X

j=1

k@fb

(l)/@tj

� @fb

(q)/@tj

k2L2

9

=

;

n(p+ 1)

2max

b

(l) 6=b

(q)

8

<

:

kfb

(l) � fb

(q)k2L2

+p

X

j=1

k@fb

(l)/@tj

� @fb

(q)/@tj

k2L2

9

=

;

2n(p+ 1) maxb

(l)2{b(1),...,b(M)}

8

<

:

kfb

(l)k2L2

+p

X

j=1

k@fb

(l)/@tj

k2L2

9

=

;

2c3

n(p+ 1)N�2m+2/r(logN)r�1.


Now, (5.4) yields

inf˜

f

supf02H

P⇢

kf � f0

k2L2

� 1

4c2

N�2m

�

� infb⇥

P{b⇥ 6= ⇥}

� 1� 1

logM

h

E te01 ,...,t

e0n ;...;t

ep1 ,...,t

epn(ye0

1

, . . . , ye0n

, . . . , yep

1

, . . . , yepn

;⇥) + log 2i

� 1� 2c3

n(p+ 1)N�2m+2/r(logN)r�1 + log 2

c0

(log 2)N(logN)r�1/8.

Taking N = c4

nr/(2mr+r�2) with an appropriate choice of c4

, we have

lim supn!1

inf˜

f

supf02H

Pn

kf � f0

k2L2

� C1

n�2mr/(2mr+r�2)

o

> 0,

where C1

only depends on �20

, �2j

s, m, r, p, and d. This completes the proof.

5.2. Proof of the minimax upper bound: Theorem 3.2.

Preliminaries for the proof. Denote by ⇡ej the density of the dis-tribution ⇧ej , which by assumption is bounded away from 0 and infinity,j = 0, 1, . . . , p. First we introduce a norm on H for any f 2 H,

(5.7)

kfk2R

=1

p+ 1

1

�20

Z

f2(t)⇡e0(t)

+p

X

j=1

1

�2j

Z

⇢

@f(t)

@tj

�

2

⇡ej (t)

3

5+ J(f).

Note that k · kR

is a norm since it is a quadratic form and is equal to zero ifand only if f = 0. Let h·, ·i

R

be the inner product associated with k ·kR

. Thefollowing lemma shows that k·k

R

is well defined in H and is equivalent to theRKHS norm k · kH. In particular, kgk

R

< 1 if and only if kgkH < 1. Theproof of this lemma is given in Section A.4.1 in the supplementary material.

Lemma 5.1. The norm k · kR

is equivalent to k · kH in H.

We introduce another norm k · k0

as follows:

(5.8) kfk20

=1

p+ 1

2

4

1

�20

Z

f2(t)⇡e0(t) +p

X

j=1

1

�2j

Z

⇢

@f(t)

@tj

�

2

⇡ej (t)

3

5 .


Based on (5.8), we define a function space F0

to be the direct sum of someset of the orthogonal subspaces in the decomposition of ⌦d

j=1

L2

(X1

) as in(1.3) and equipped with the norm k · k

0

. Let h·, ·i0

be the inner productassociated with k · k

0

in F0

.With the above two norms, we introduce one additional notation. Denote

the loss function in (2.5) by

ln

(f) =1

n(p+ 1)

2

4

1

�20

n

X

i=1

{f(te0i

)� ye0i

}2 +p

X

j=1

1

�2j

n

X

i=1

⇢

@f(tej

i

)

@tj

� yej

i

�

2

3

5 ,

and write ln�

(f) = ln

(f) + �J(f). Then the regularized estimator bfn�

=argmin

f2H ln�

(f). Denote the expected loss by l1(f) = Eln

(f) = kf �f0

k20

+ 1, and write l1�

(f) = l1(f) + �J(f). Note that l1�

(f) a positivequadratic form in f 2 H and hence it has a unique minimizer in H,

f1�

= argminf2H

l1�

(f).

Thus, we decompose

bfn�

� f0

= ( bfn�

� f1�

) + (f1�

� f0

),

where ( bfn�

�f1�

) is referred to the stochastic error and (f1�

�f0

) is referredto the deterministic error. If data Y e0 and Y ej s in (1.1) are observed withoutrandom noises as in deterministic computer experiments, then the total erroris only the deterministic error with bf

n�

� f0

= f1�

� f0

. For brevity, weomit the subscripts of f1�

and bfn�

hereafter if no confusion occurs.

Outline of the proof. Before proceeding to the proof, we make two re-marks on the setup of Theorem 3.2. First, since the distributions ⇧e0 and⇧ej s are known, by the inverse transform sampling, it su�ces to consider uni-form distributions. A detailed discussion on this inverse transform is given inLemma A.12 in the supplementary material. Second, it su�ces to considerf0

having a periodic boundary on X d

1

in the proof of this theorem. This is be-cause f

0

is a tensor product function and each component function space issupported in a compact domain, thus we can smoothly extend f

0

to a largercompact support domain and achieve periodicity on the new boundary, forexample, uniformly zero on the new boundary. These two simplifications canmake the proof easier to understand.

Recall the trigonometrical basis on L2

(X1

) is 1

(t) = 1, 2⌫

(t) =p2 cos 2⇡⌫t

and 2⌫+1

(t) =p2 sin 2⇡⌫t for ⌫ � 1. Write

(5.9) �~⌫(t1, . . . , td) =

⌫1(t1) · · · ⌫d(td)

k ⌫1(t1) · · · ⌫d(td)k0

.


Since f0

has a periodic boundary on X d

1

and ⇡ej ⌘ 1, we know {�~⌫(t) :

~⌫ 2 V }, where V in (5.1) forms an orthogonal basis for H in h·, ·iR

; anorthogonal system for L

2

(X d

1

); and an orthonormal basis for F0

in h·, ·i0

,that is h�

~⌫(t),�~µ(t)i0 = �~⌫~µ, where �~⌫~µ is Kronecker’s delta. Hence, any

f 2 H has the decomposition

(5.10) f(t1

, . . . , td

) =X

~⌫2Vf~⌫�~⌫(t1, . . . , td), where f

~⌫ = hf(t),�~⌫(t)i0.

We denote a positive scalar series {⇢~⌫}⌫2V such that h�

~⌫ ,�~µiR = (1 +⇢~⌫)�~⌫~µ. Then,

(5.11) J(f) = hf, fiR

� hf, fi0

=X

~⌫2V⇢~⌫f

2

~⌫ .

First, we analyze the deterministic error (f � f0

). By (5.10), we writef0

(t) =P

~⌫2V f0

~⌫�~⌫(t) and f(t) =P

~⌫2V f~⌫�~⌫(t). Then, l1(f) =

P

~⌫2V (f~⌫�f0

~⌫)2 + 1, and

(5.12) f~⌫ =

f0

~⌫

1 + �⇢~⌫, ~⌫ 2 V.

An upper bound of the deterministic error will be given in Lemma 5.2.Second, we analyze the stochastic error ( bf�f). The existence the following

Frechet derivatives, for any g, h 2 H, is guaranteed by Lemma A.1 in thesupplementary material:

(5.13)

Dln

(f)g =2

n(p+ 1)

"

1

�20

n

X

i=1

{f(te0i

)� ye0i

}g(te0i

)

+p

X

j=1

1

�2j

n

X

i=1

⇢

@f(tej

i

)

@tj

� yej

i

�

@g(tej

i

)

@tj

3

5 ,

(5.14)

Dl1(f)g =2

p+ 1

1

�20

Z

{f(t)� f0

(t)} @g(t)@t

j

⇡ej (t)

+p

X

j=1

1

�2j

Z

⇢

@f(t)

@tj

� @f0

(t)

@tj

�

@g(t)

@tj

⇡ej (t)

3

5 ,

(5.15)

D2ln

(f)gh =2

n(p+ 1)

"

1

�20

n

X

i=1

g(te0i

)h(te0i

)

+p

X

j=1

1

�2j

n

X

i=1

@g(tej

i

)

@tj

@h(tej

i

)

@tj

3

5 ,


(5.16)

D2l1(f)gh =2

p+ 1

1

�20

Z

g(t)h(t)⇡e0(t)

+p

X

j=1

1

�2j

Z

@g(t)

@tj

@h(t)

@tj

⇡ej (t)

3

5 = 2hg, hi0

,

where Dln

(f), Dl1(f), D2ln

(f)g, and D2l1(f)g are bounded linear opera-tors on H. By Riesz representation theorem, with slight abuse of notation,write

Dln

(f)g = hDln

(f), giR

, Dl1(f)g = hDl1(f), giR

,

D2ln

(f)gh = hD2ln

(f)g, hiR

, D2l1(f)gh = hD2l1(f)g, hiR

.

From [24, 46], there exists a bounded linear operator U : F0

7! H such thatU�

~⌫ = (1 + ⇢~⌫)

�1�~⌫ and hf, Ugi

R

= hf, gi0

for any f 2 H and g 2 F0

, andthe restriction of U to H is self-adjoint and positive definite. By (5.16), wefurther derive

D2l1�

(f)�~⌫(t) = 2(U + �(I � U))�

~⌫(t) = 2(1 + ⇢~⌫)

�1(1 + �⇢~⌫)�~⌫(t).

Define that G�

�~⌫ = 1

2

D2l1�

(f)�~⌫ . By the Lax-Milgram theorem, G

�

: H 7!H has a bounded inverse G�1

�

on H, and

(5.17) G�1

�

�~⌫ = (1 + ⇢

~⌫)(1 + �⇢~⌫)

�1�~⌫ .

Define

f⇤ = f � 1

2G�1

�

Dln�

(f).

Then the stochastic error can be decomposed as

bf � f = (f⇤ � f) + ( bf � f⇤).

The two terms on RHS will be studied separately and their upper boundswill be given in Lemma 5.3 and Lemma 5.4, respectively.

Finally, we define the following norm which is important in our lateranalysis, for f 2 H

(5.18) kfk2L2(a)

=X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

f2

~⌫k�~⌫k2L2, for 0 a 1,

where f~⌫ = hf,�

~⌫i0. By direct calculations, note that when a = 0 this normcoincides with k ·k

L2 on F0

, and when a = 1 this norm is equivalent to k ·kR

on H.


Details of the proof. Now we give the details following the outline above.First, we present an upper bound of the deterministic error (f � f

0

).

Lemma 5.2. For any 0 a 1, the deterministic error satisfies

kf � f0

k2L2(a)

=

8

<

:

O�

�1�aJ(f0

)

when 0 p < d,

O

⇢

�(1�a)mrmr�1 J(f

0

)

�

when p = d.

Proof. For any 0 a 1, by (5.11) and (5.12), we have

(5.19)

kf � f0

k2L2(a)

=X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

✓

�⇢~⌫

1 + �⇢~⌫

◆

2

(f0

~⌫)2k�

~⌫k2L2

�2 sup~⌫2V

(1 + ⇢~⌫/k�~⌫k2

L2)a⇢

~⌫k�~⌫k2L2

(1 + �⇢~⌫)2

X

~⌫2V⇢~⌫(f

0

~⌫)2

. �2J(f0

) sup~⌫2V

(Q

d

k=1

⌫2mk

)1+a

(1 +P

p

j=1

⌫2j

+ �Q

d

k=1

⌫2mk

)2.

Write

B�

(~⌫) =(Q

d

k=1

⌫2mk

)1+a

(1 +P

p

j=1

⌫2j

+ �Q

d

k=1

⌫2mk

)2, ~⌫ 2 V.

We discuss B�

(~⌫) for 0 p d� 1 and p = d separately.For 0 p d�1, since ~⌫ 2 V , there are at most r of ⌫

1

, . . . , ⌫d

not equalto 1. Suppose for any x =

Q

d

k=1

⌫�2m

k

> 0 fixed. Then B�

(~⌫) is maximizedby letting

P

p

j=1

⌫2j

be as small as possible, which implying ⌫1

= ⌫2

= · · · =⌫p

= 1. Then

(5.20)

sup~⌫2V

B�

(~⌫) ⇣ sup(⌫p+1,...,⌫(p+r)^d)

>2Nr^(d�p)

Q

(p+r)^dk=p+1

⌫2m(1+a)

k

(1 + �Q

(p+r)^dk=p+1

⌫2mk

)2

⇣ supx>0

x�(1+a)

(1 + �x�1)2⇣ ��(a+1),

where the last step is achieved when x ⇣ �.For p = d, since ~⌫ 2 V and by the symmetry of coordinates v

1

, . . . , vd

,assume that all indices except v

1

, . . . , vr

being 1. Letting z =Q

r

j=1

⌫�2m

j

> 0,then

sup~⌫2V

B�

(~⌫) ⇣ supz>0

z�(1+a)

(z�1/mr + �z�1)2⇣ �

2�(1+a)mrmr�1 ,(5.21)


where the last step is achieved when z ⇣ �mr/(mr�1). Combining (5.19),(5.20) and (5.21), we complete the proof.

Second, we show an upper bound of (f⇤ � f), which is a part of thestochastic error.

Lemma 5.3. When 0 p < d, we have for any 0 a < 1� 1/2m,

kf⇤ � fk2L2(a)

= OPn

n�1��(a+1/2m)[log(1/�)](d�p)^r�1

o

.

When p = d, we have for any 0 a 1,

kf⇤ � fk2L2(a)

=

8

>

>

>

>

>

<

>

>

>

>

>

:

OPn

n�1�mr

1�mr (a+r�22mr )

o

, if r � 3;

OP�

n�1 log(1/�)

, if r = 2, a = 0; OP�

n�1

, if r = 2, 0 < a 1;

OP�

n�1

, if r = 1, a < 1

2m

; OP�

n�1 log(1/�)

, if r = 1, a = 1

2m

;

OPn

n�1�1�2ma2m�2

o

, if r = 1, a > 1

2m

.

Proof. Notice that Dln,�

(f) = Dln,�

(f)�Dl1,�

(f) = Dln

(f)�Dl1(f).Hence, for any g 2 H,(5.22)

E

1

2Dl

n,�

(f)g

�

2

= E

1

2Dl

n

(f)g � 1

2Dl1(f)g

�

2

=1

n(p+ 1)2Var

2

4

1

�20

�

f(te0)� Y e0

g(te0) +p

X

j=1

1

�2j

⇢

@f(tej )

@tj

� Y ej

�

@g(tej )

@tj

3

5

1

n(p+ 1)

1

�40

E�

f(te0)� f0

(te0)

2 {g(te0)}2 + 1

�20

E{g(te0)}2

+p

X

j=1

1

�4j

E⇢

@f(tej )

@tj

� @f0

(tej )

@tj

�

2

⇢

@g(tej )

@tj

�

2

+p

X

j=1

1

�2j

E⇢

@g(tej )

@tj

�

2

3

5

1

n(p+ 1)

1

�40

c2dK

kf � f0

k2R

E {g(te0)}2 + 1

�20

E {g(te0)}2

+p

X

j=1

1

�4j

c2dK

kf � f0

k2R

E⇢

@g(tej )

@tj

�

2

+p

X

j=0

1

�2j

E⇢

@g(tej )

@tj

�

2

3

5

. n�1kgk20

,


where the third step is by Lemma 5.1 and Lemma A.9 in the supplementarymaterial, and the last step is by Lemma 5.2 and the definition of the normk · k

0

. From the definition of G�1

�

in (5.17), we have that 8g 2 H,

�

�G�1

�

g�

�

2

L2(a)=X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

(1 + �⇢~⌫)

�2 k�~⌫k2

L2hg,�

~⌫i2R

.

Then by the definition of f⇤, we have

Ekf⇤ � fk2L2(a)

= E�

�

�

�

1

2G�1

�

Dln�

(f)

�

�

�

�

2

L2(a)

=1

4E"

X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

(1 + �⇢~⌫)

�2k�~⌫k2

L2hDl

n�

(f),�~⌫i2

R

#

X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

(1 + �⇢~⌫)

�2k�~⌫k2

L2E

1

2Dl

n�

(f)�~⌫

�

2

. n�1

X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

(1 + �⇢~⌫)

�2 k�~⌫k2

L2k�

~⌫k20

⇣ n�1Na

(�),

where the fourth step is by (5.22) and the last step is because of k�~⌫k0 = 1,

k�~⌫k2

L2⇣ (1 +

P

p

j=1

⌫2j

)�1, ⇢~⌫ ⇣ (1 +

P

p

j=1

⌫2j

)�1

Q

d

k=1

⌫2mk

, and Na

(�) isdefined in Lemma A.7 in the supplementary material. Hence, by LemmaA.7, we complete the proof.

Then, we give an upper bound of ( bf � f⇤), which is another part of thestochastic error. Since l

n�

(f) is a quadratic form of f , the Taylor expansionof Dl

n�

( bf) = 0 at f gives

Dln�

(f) +D2ln�

(f)( bf � f) = 0,

and by the definition of f⇤ and G�

, we have

Dln�

(f) +D2l1�

(f)(f⇤ � f) = 0.

Thus, G�

( bf � f⇤) = 1

2

D2l1(f)( bf � f)� 1

2

D2ln

(f)( bf � f), and

(5.23) bf � f⇤ = G�1

�

1

2D2l1(f)( bf � f)� 1

2D2l

n

(f)( bf � f)

�

.


Lemma 5.4. If n�1��(2a+3/2m)[log(1/�)]r�1 ! 0 and 1/2m < a <(2m� 3)/4m, we have for any 0 c a+ 1/m,

k bf � f⇤k2L2(c)

= oPn

kf⇤ � fk2L2(c)

o

.

Proof. A su�cient condition for this lemma is that for any 1/(2m) <a < (2m� 3)/(4m) and 0 c a+ 1/m,(5.24)k bf � f⇤k2

L2(c)

=

8

>

>

>

>

>

>

<

>

>

>

>

>

>

:

OP�

n�1��(c+a+1/2m)[log(1/�)]r^(d�p)�1

k bf � fk2L2(a+1/m)

, if 0 p < d,

OPn

n�1�mr

1�mr (a+c+

r�22mr )

o

k bf � fk2L2(a+1/m)

, if p = d, r � 3,

OP�

n�1

k bf � fkL2(a+1/m)

, if p = d, r = 2,

OP

⇢

n�1�1�2m(a+c)

2m�2

�

k bf � fkL2(a+1/m)

, if p = d, r = 1.

This is because once (5.24) established, by letting c = a + 1/m and underthe assumption n�1��(2a+3/2m)[log(1/�)]r�1 ! 0, we have

k bf � f⇤k2L2(a+1/m)

= oP(1)k bf � fk2L2(a+1/m)

.

By the triangle inequality, we have kf⇤ � fkL2(a+1/m)

� k bf � fkL2(a+1/m)

�k bf�f⇤k

L2(a+1/m)

= [1�oP(1)]k bf�fkL2(a+1/m)

, which implies k bf�fk2L2(a+1/m)

=

OP{kf⇤ � fk2L2(a+1/m)

}. Thus by (5.24) and Lemma 5.3, we complete theproof.

We now are in the position to prove (5.24). For any 0 c a+ 1/m, by(5.23), we have

k bf � f⇤k2L2(c)

=X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

c

(1 + �⇢~⌫)

�2k�~⌫k2

L2

⇥

1

2D2l1(f)( bf � f)�

~⌫ � 1

2D2l

n

(f)( bf � f)�~⌫

�

2

X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

c

(1 + �⇢~⌫)

�2k�~⌫k2

L2

⇥ 1

p+ 1

8

<

:

"

P

n

i=1

( bf � f)(te0i

)�~⌫(t

e0i

)

n�20

�R

( bf � f)(t)�~⌫(t)

�20

#

2

(5.25)


+p

X

j=1

2

4

P

n

i=1

@(

bf� ¯

f)

@tj(t

ej

i

)@�~⌫@tj

(tej

i

)

n�2j

�R

@(

bf� ¯

f)(t)@tj

@�~⌫(t)@tj

�2j

3

5

2

9

>

=

>

;

.

Denote gj

(t) = 1

�

2j

@(

bf� ¯

f)

@tj

@�~⌫@tj

and g0

(t) = 1

�

20( bf � f)�

~⌫ . Hence, we can do the

expansion on the basis {�~µ}

~µ2Nd ,

(5.26) gj

(t) =X

~µ2Nd

Qj

~µ�~µ(t), where Qj

~µ = hgj

(t),�~µ(t)i0.

Unlike (5.10) with the multi-index ~⌫ 2 V , we require ~µ 2 Nd in (5.26) sincenow g

j

(t) is a product function. By Cauchy-Schwarz inequality,"

1

n�2j

n

X

i=1

@( bf � f)

@tj

(tej

i

)@�

~⌫

@tj

(tej

i

)� 1

�2j

Z

@( bf � f)(t)

@tj

@�~⌫(t)

@tj

#

2

=

2

4

X

~µ2Nd

Qj

~µ

1

n

n

X

i=1

�~µ(t

ej

i

)�Z

�~µ(t)

!

3

5

2

2

4

X

~µ2Nd

(Qj

~µ)2

1 +⇢~µ

k�~µk2

L2

!

a

k�~µk2

L2

3

5

⇥2

4

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!�a

k�~µk�2

L2

1

n

n

X

i=1

�~µ(t

ej

i

)�Z

�~µ(t)

!

2

3

5 .

(5.27)

For brevity, we write f(t) = @f/@t0

. By Lemma A.11 in the supplementarymaterial we have that if a > 1/2m, the sum of the first part in (5.27) overj = 0, . . . , p is bounded by

(5.28)

p

X

j=0

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!

a

k�~µk2

L2

*

@( bf � f)

@tj

@�~⌫

@tj

,�~µ

+

2

0

. k bf � fk2L2(a+1/m)

p

X

j=0

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!

a

k�~µk2

L2

⌧

@�~⌫

@tj

,�~µ

�

2

0

. k bf � fk2L2(a+1/m)

1 +⇢~⌫

k�~⌫k2

L2

!

a

k�~⌫k2

L2

0

@1 +p

X

j=1

⌫2j

1

A

⇣ k bf � fk2L2(a+1/m)

1 +⇢~⌫

k�~⌫k2

L2

!

a

.


The second part of (5.27) can be bounded by

(5.29)

E

2

4

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!�a

k�~µk�2

L2

1

n

n

X

i=1

�~µ(t

ej

i

)�Z

�~µ(t)

!

2

3

5

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!�a

k�~µk�2

L2

✓

1

n

Z

�2~µ(t)

◆

⇣ n�1

X

~µ2Nd

1 +⇢~µ

k�~µk2

L2

!�a

. n�1

X

~µ2Nd

µ�2ma

1

· · ·µ�2ma

d

n�1

0

@

1X

µ1=1

µ�2ma

1

1

A

d

⇣ n�1,

where the third step uses ⇢~µ/k�~µk2

L2⇣ µ2m

1

· · ·µ2m

d

, and the fourth stepholds for a > 1/2m. Combing (5.27), (5.28) and (5.29), we have for a >1/2m,

E

8

<

:

"

1

n�20

n

X

i=1

( bf � f)(te0i

)�~⌫(t

e0i

)� 1

�20

Z

( bf � f)(t)�~⌫(t)

#

2

+p

X

j=1

2

4

1

n�2j

n

X

i=1

@( bf � f)

@tj

(tej

i

)@�

~⌫

@tj

(tej

i

)�p

X

j=1

1

�2j

Z

@( bf � f)(t)

@tj

@�~⌫(t)

@tj

3

5

2

9

=

;

. 1

nk bf � fk2

L2(a+1/m)

1 +⇢~⌫

k�~⌫k2

L2

!

a

.

(5.30)

Therefore, if 1/2m < a < (2m � 3)/4m and 0 c a + 1/m, (5.25) and(5.30) imply that

Ek bf � f⇤k2L2(c)

. n�1k bf � fk2L2(a+1/m)

Na+c

(�).

By Lemma A.7 in the supplementary material, we complete the proof for(5.24) and this lemma.

Last, we combine Lemma 5.2, Lemma 5.3 and Lemma 5.4 and get thefollowing proposition.


Proposition 5.5. Under the conditions of Theorem 3.1 and assumingthe distributions ⇧e0 and ⇧ej s are known. If 1/2m < a < (2m � 3)/4m,m > 2, and n�1��(2a+3/2m)[log(1/�)]r�1 ! 0, then for any c 2 [0, a+1/m],the bf given by (2.5) satisfies, when 0 p < d,

k bf � f0

k2L2(c)

= O{�1�cJ(f0

)}+OPn

n�1��(c+1/2m)[log(1/�)]r^(d�p)�1

o

,

and when p = d,

k bf � f0

k2L2(c)

=

8

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

:

O

⇢

�(1�c)mrmr�1 J(f

0

)

�

+OPn

n�1�mr

1�mr (c+r�22mr )

o

if r � 3,

On

�2m

2m�1J(f0

)o

+OP�

n�1 log(1/�)

if r = 2, c = 0,

O

⇢

�2(1�c)m2m�1 J(f

0

)

�

+OPn

n�1�2mc

1�2m

o

if r = 2, c > 0,

O

⇢

�(1�c)mm�1 J(f

0

)

�

+OP�

n�1

if r = 1, c < 1

2m

,

On

�2m�12(m�1)J(f

0

)o

+OP�

n�1 log(1/�)

if r = 1, c = 1

2m

,

O

⇢

�(1�c)mm�1 J(f

0

)

�

+OPn

n�1�1�2mc2m�2

o

if r = 1, c > 1

2m

.

Many results on the regularized estimator bf can be derived from Propo-sition 5.5 including Theorem 3.2. In fact, consider for p = d and r � 3, by

letting � ⇣ n� 2mr�2

(2m+1)r�2 , a = 1/2m + ✏ for some ✏ > 0 and c = 0, then thecondition n�1��(2a+3/2m)[log(1/�)]r�1 ! 0 is equivalent to

(5.31) � 1 +5(mr � 1)

2m2r +mr � 2m< 0,

and m > 2 is su�cient for (5.31). Thus the conditions for Proposition 5.5are satisfied. Similarly, we can verify that when p = d and r = 2, � ⇣[n(log n)]�(2m�1)/2m satisfies the conditions for Proposition 5.5. When p = dand r = 1, � . n�(m�1)/m satisfies the conditions for the above Proposition.When 0 p d � r, � ⇣ [n(log n)1�r]�2m/(2m+1) satisfies the conditionsfor the above Proposition, as well as when d � r < p < d by letting � ⇣[n(log n)1+p�d]�2m/(2m+1). This completes the proof for Theorem 3.2.

5.3. Proof of Corollary 3.3. This corollary can be directly derived fromProposition 5.5. Observe that

Z

X d1

"

@d bfn�

(t)

@t1

· · · @td

� @df0

(t)

@t1

· · · @td

#

2

dt ⇣ k bfn�

� f0

kL2(1/m)

.


If d � r < p < d, we let c = a = 1/m and � ⇣ [n(log n)1+p�d]�2m/(2m+1)

in Proposition 5.5, then the condition n�1��(2a+3/2m)[log(1/�)]r�1 ! 0 isequivalent to

(5.32) � 1 + 7/(2m+ 1) < 0,

and m > 3 is su�cient for (5.32). Thus the condition for Proposition 5.5are satisfied, and Proposition 5.5 yields the rate of convergence for k bf

n�

�f0

kL2(1/m)

is

OP⇣

[n(log n)1+p�d]�2(m�1)/(2m+1)

⌘

.

Similarly, if 0 p d � r, we let � ⇣ [n(log n)1�r]�2m/(2m+1); if p = dand r � 3, let � ⇣ n�2(mr�1)/(2mr+r�2); if p = d and r = 2, let � ⇣n�(2m�1)/2m; if p = d and r = 1, let � ⇣ n�(2m�2)/(2m�1), then the conditionsfor Proposition 5.5 will be satisfied. This completes the proof for Corollary3.3.

6. Discussion. This paper is the first to study the minimax optimalrates for nonparametric estimation when data from first-order partial deriva-tives are available. We study the function estimation and partial derivativeestimations with functional ANOVA models while there are few existingresults in literature concerning the partial derivative estimations.

In Theorem 2.1, Theorem 2.2, Theorem 3.1 and Theorem 3.2, we assumethat all component functions are from a common RKHS H

1

. We also assumethe eigenvalues decay at the polynomial rate, which is true for Sobolev ker-nels and other widely used kernels. More general settings are also interesting,for example, component RKHS are di↵erent, and the eigenvalues decay atdi↵erent polynomial rates or even exponentially, and the method of regu-larization in (2.5) uses other goodness of fit measures. It would of coursebe of great interest to extend our results to a broad class of bounded lin-ear functionals and to multivariate function spaces without tensor productstructure. We leave these open for future studies.

ACKNOWLEDGEMENTS

X. Dai would like to thank Yuhua Zhu and Cuize Han for helpful discus-sions. We thank Grace Wahba for very helpful comments on an early versionof the manuscript.


APPENDIX A: PROOFS OF TECHNICAL RESULTS

This appendix consists of five parts. In Section A.1, we give a brief reviewon Frechet derivative which is used in (5.13), (5.14), (5.15) and (5.16) in themain text. In Section A.2, we give the proofs for results with deterministicdesigns in Section 2. In Section A.3, we prove the results of estimating partialderivatives in Section 4. We present some key lemmas used for the proofs inSection A.4. All auxiliary technical lemmas are deferred to Section A.5.

A.1. Frechet derivative of an operator. LetX and Y be the normedlinear spaces. The Frechet derivative of an operator F : X 7! Y is a boundedlinear operator DF (a) : X 7! Y with

limh!0,h2X

kF (a+ h)� F (a)�DF (a)hkY

khkX

= 0.

For illustration, if F (a + h) � F (a) = Lh + R(a, h) with a linear opera-tor L and kR(a, h)k

Y

/khkX

! 0 as h ! 0, then by the above definition,L = DF (a) is the Frechet derivative of F (·). The reader is referred to el-ementary functional analysis textbooks such as Cartan [41] for a thoroughinvestigation on Frechet derivative.

Lemma A.1. With the norm k·kR

in (5.7), the first order Frechet deriva-tive of the functional l

n

(·) for any f, g 2 H is

Dln

(f)g =2

n(p+ 1)

"

1

�20

n

X

i=1

{f(te0i

)� ye0i

}g(te0i

)

+p

X

j=1

1

�2j

n

X

i=1

⇢

@f(tej

i

)

@tj

� yej

i

�

@g(tej

i

)

@tj

3

5 .

The second order Frechet derivative of ln

(·) for any f, g, h 2 H is

D2ln

(f)gh =2

n(p+ 1)

"

1

�20

n

X

i=1

g(te0i

)h(te0i

)

+p

X

j=1

1

�2j

n

X

i=1

@g(tej

i

)

@tj

@h(tej

i

)

@tj

3

5 .


Proof. By direct calculations, we have

ln

(f + g)� ln

(f) =2

n(p+ 1)

"

1

�20

n

X

i=1

{f(te0i

)� ye0i

}g(te0i

)

+p

X

j=1

1

�2j

n

X

i=1

⇢

@f(tej

i

)

@tj

� yej

i

�

@g(tej

i

)

@tj

3

5+Rn

(f, g),

where

Rn

(f, g) =1

n(p+ 1)

2

4

1

�20

n

X

i=1

g2(te0i

) +p

X

j=1

1

�2j

n

X

i=1

⇢

@g(tej

i

)

@tj

�

2

3

5

= kgk20

+O(n�1/2).

Note that |Rn

(f, g)|/kgkR

! 0 as kgkR

! 0 and n1/2kgkR

! 1. This provesthe form of Dl

n

(f)g in the lemma. For the second order Frechet derivative,note that

Dln

(f + h)g �Dln

(f)g

=2

n(p+ 1)

2

4

1

�20

n

X

i=1

g(te0i

)h(te0i

) +p

X

j=1

1

�2j

n

X

i=1

@g(tej

i

)

@tj

@h(tej

i

)

@tj

3

5 ,

which is linear in h. By definition of Frechet derivatives, we conclude theform of D2l

n

(f)gh in the lemma.

We remark that following a similar derivation in the above proof, we canobtain the first and the second order Frechet derivatives of the functionall1(·) in (5.14) and (5.16), respectively.

A.2. Proofs for Section 2: regular lattices. For brevity, we shallassume the regular lattice (2.1) is l

1

= · · · = ld

= l and n = ld. The moregeneral case can be showed similarly. Write

(A.1) 1

(t) = 1, 2⌫

(t) =p2 cos 2⇡⌫t,

2⌫+1

(t) =p2 sin 2⇡⌫t,

for ⌫ � 1. Since f0

has periodic boundaries on X d

1

, { ⌫

(t)}⌫�1

forms anorthonormal system in L

2

(X1

) and an eigenfunction system for K. For ad-dimensional vector ~⌫ = (⌫

1

, . . . , ⌫d

) 2 Nd, write

(A.2) ~⌫(t) =

⌫1(t1) · · · ⌫d(td) and �~⌫ = �

⌫1�⌫2 · · ·�⌫d ,


where �⌫ks and

⌫k(tk)s are defined in (2.4) with k = 1, . . . , d. Then, anyfunction f(·) in H admits the Fourier expansion f(t) =

P

~⌫2Nd ✓~⌫ ~⌫(t),

where ✓~⌫ = hf(t),

~⌫(t)iL2 , and J(f) =P

~⌫2Nd ��1

~⌫ ✓2~⌫ . We also write f

0

(t) =P

~⌫2Nd ✓0~⌫ ~⌫(t).

By Page 23 of Wahba [45], it is known that

l�1

l

X

i=1

µ

(i/l) ⌫

(i/l) =

(

1, if µ = ⌫ = 1, . . . , l,

0, if µ 6= ⌫, µ, ⌫ = 1, . . . , l.

Define~ ~⌫ = (

~⌫(t1), . . . , ~⌫(tn))>,

where {t1

, . . . , tn

} are design points in (2.1). Thus, we have

h~ ~⌫ , ~ ~µin =

(

1, if ⌫k

= µk

= 1, . . . , l; k = 1, . . . , d,

0, if there exists some k such that ⌫k

6= µk

,

where h·, ·in

is the empirical inner product in Rn. This implies that {~ ~⌫ |⌫k =

1, . . . , l; k = 1, . . . , d} form an orthogonal basis in Rn with respect to the em-pirical norm k·k

n

. Denote the observed data vectors by ye0 = (ye01

, . . . , ye0n

)>

and yej = (yej

1

, . . . , yejn

)>, and write

(A.3)

8

>

<

>

:

ze0~⌫ = hye0 , ~

~⌫in,zej

⌫1,...,2⌫k�1,...,⌫d= (2⇡)�1hyej , ~

⌫1,...,2⌫k,...,⌫din,zej

⌫1,...,2⌫k,...,⌫d= �(2⇡)�1hyej , ~

⌫1,...,2⌫k�1,...,⌫din,

for ⌫k

= 1, . . . , l and k = 1, . . . , d. Then ze0~⌫ = ✓0

~⌫ + �e0~⌫ and z

ej

~⌫ = ⌫j

✓0~⌫ + �

ej

~⌫ ,

where ✓0~⌫ = ✓0

~⌫ +P

µk�l,k=1,...,d

✓0~µh~ ~⌫ , ~ ~µin, and �e0

~⌫ , �ej

~⌫ are all independent

with mean 0 and variance �20

/n and �2j

/n, respectively.

A.2.1. Proof of minimax lower bound: Theorem 2.1. We now prove thelower bound for estimating functions under the regular lattice. By the datatransformation (A.3), it su�ces to show the optimal rate in a special case

(A.4)

(

ze0~⌫ = ✓0

~⌫ + �e0~⌫ ,

zej

~⌫ = ⌫j

✓0~⌫ + �

ej

~⌫ , for 1 j p,

where �ej

~⌫ ⇠ N (0,�2j

/n) are independent. For any ~⌫ 2 Nd, if we have the

prior that |✓0~⌫ | ⇡

~⌫ , then the minimax linear estimator is

b✓L~⌫ =

��2

0

ze0~⌫ +

P

p

j=1

��2

j

⌫j

zej

~⌫

n�1⇡�2

~⌫ + ��2

0

+P

p

j=1

��2

j

⌫2j

,


and the minimax linear risk is

n�1

2

4n�1⇡�2

~⌫ + ��2

0

+p

X

j=1

��2

j

⌫2j

3

5

�1

.

By Lemma 6 and Theorem 7 in Donoho, Liu and MacGibbon [43], if �2j

s are

known, the minimax risk of estimating ✓0~⌫ under the model (A.4) is larger

than 80% of the minimax linear risk of the hardest rectangle subproblem,and the latter linear risk is

(A.5) RL = n�1 maxP~⌫2V (1+�~⌫)⇡

2~⌫=1

X

~⌫2V

2

4n�1⇡�2

~⌫ + ��2

0

+p

X

j=1

��2

j

⌫2j

3

5

�1

,

where �~⌫ is the product of eigenvalues in (A.2) and recall that the set V is

defined in (5.1).We use the Lagrange multiplier method to find ⇡2

~⌫ for solving (A.5). Leta be the scalar multiplier and define

L(⇡2~⌫ , a) =

X

~⌫2V

2

4n�1⇡�2

~⌫ + ��2

0

+p

X

j=1

��2

j

⌫2j

3

5

�1

� a(1 + �~⌫)⇡

2

~⌫ .

Taking partial derivative with respect to ⇡2~⌫ gives

@L

@⇡2~⌫

= n�1

2

4n�1 +

0

@��2

0

+p

X

j=1

��2

j

⌫2j

1

A⇡2~⌫

3

5

�2

� a(1 + �~⌫) = 0.

This implies

b⇡2~⌫ =

0

@��2

0

+p

X

j=1

��2

j

⌫2j

1

A

�1

h

b(1 + �~⌫)

�1/2 � n�1

i

+

,

where b = (na)�1/2. On one hand, plugging the above formula into theconstraint

P

~⌫2V (1 + �~⌫)⇡

2

~⌫ = 1 gives

X

~⌫2V

d

Y

k=1

⌫2mk

0

@��2

0

+p

X

j=1

��2

j

⌫2j

1

A

�1

"

bd

Y

k=1

⌫�m

k

� n�1

#

+

⇣ 1.


By restrictingQ

d

k=1

⌫k

(nb)1/m, this becomes

(A.6)

X

~⌫2V,Qdk=1 ⌫k(nb)

1/m

0

@��2

0

+p

X

j=1

��2

j

⌫2j

1

A

�1

⇥

bd

Y

k=1

⌫mk

� n�1

d

Y

k=1

⌫2mk

!

⇣ 1.

On the other hand, the linear risk in (A.5) can be written as

(A.7)

RL ⇣ n�1

X

~⌫2V,Qdk=1 ⌫k(nb)

1/m

1� 1

nb

d

Y

k=1

⌫mk

!

⇥0

@��2

0

+p

X

j=1

��2

j

⌫2j

1

A

�1

.

We discuss for RL in the above (A.7) under the condition (A.6) for threecases with 0 p d� r, d� r < p < d and p = d.

If 0 p d� r, since ~⌫ 2 V , there are at most r of ⌫1

, . . . , ⌫d

not equalto 1, which implies that the number of combinations of non-1 indices beingsummed in (A.6) is no greater than C1

d

+ C2

d

+ · · · + Cr

d

< 1. Due to theterm (��2

0

+P

p

j=1

��2

j

⌫2j

)�1, the largest terms of the summation (A.6) over~⌫ 2 V correspond to the combinations of indices where as fewer ⌫

1

, . . . , ⌫p

being summed as possible, for example, vk

⌘ 1 for k p and k > p+ r, and(⌫

p+1

, . . . , ⌫p+r

) 2 Nr are non-1. Thus, (A.6) is equivalent to

X

Qrk=1 ⌫p+k(nb)

1/m

br

Y

k=1

⌫mp+k

� n�1

r

Y

k=1

⌫2mp+k

!

⇣ 1.

Using the integral approximation, we have

Z

Qrk=1 xp+k(nb)

1/m,xp+k�1

br

Y

k=1

xmp+k

� 1

n

r

Y

k=1

x2mp+k

!

dxp+1

· · · dxp+r

⇣ 1.

By letting zj

=Q

1kj

xp+k

, j = 1, 2, . . . , r, we have

Z

(nb)

1/m

1

Z

zr

1

· · ·Z

z2

1

✓

bzmr

� 1

nz2mr

◆

z�1

1

· · · z�1

r�1

dz1

· · · dzr�1

�

dzr

⇣ 1,


where the LHS term is the order of n(m+1)/mb(2m+1)/m[log(nb)]r�1 and hence

(A.8) b ⇣ n�(m+1)/(2m+1)(log n)�m(r�1)/(2m+1).

The linear risk in (A.7) becomes

RL ⇣ n�1

Z

Qrk=1 xp+k(nb)

1/m,xp+k�1

1� 1

nb

r

Y

k=1

xmp+k

!

⇣ [log(nb)]r�1n�1+1/mb1/m ⇣ [n(log n)1�r]�2m/(2m+1),

where the last step is by (A.8).If d � r < p < d, as discussed in the previous case, the number of com-

binations of non-1 indices being summed is finite, and the largest terms ofthe summation (A.6) over ~⌫ 2 V correspond to the combinations of indiceswhere as fewer than ⌫

1

, . . . , ⌫p

being summed as possible, for example, vk

⌘ 1for k d� r, and (⌫

d�r+1

, . . . , ⌫d

) 2 Nr are non-1. Thus, (A.6) is equivalentto

X

Qrk=1 ⌫d�r+k(nb)

1/m

br

Y

k=1

⌫md�r+k

� n�1

r

Y

k=1

⌫2md�r+k

!

⇥0

@1 +p

X

j=d�r+1

⌫2j

1

A

�1

⇣ 1.


Z

Qrk=1 xd�r+k(nb)

1/m,xd�r+k�1

br

Y

k=1

xmd�r+k

� n�1

r

Y

k=1

x2md�r+k

!

⇥0

@1 +p

X

j=d�r+1

x2j

1

A

�1

dxd�r+1

· · · dxd

⇣ 1.


By letting zj

= xp+1

xp+2

· · ·xj

, j = p+ 1, . . . , d, we get

1 ⇣Z

xd�r+1···xpzd(nb)

1/m

Z

zd

1

· · ·Z

zp+2

1

✓

bxmd�r+1

· · ·xmp

zmd

� 1

nx2md�r+1

· · ·x2mp

z2md

◆

z�1

p+1

· · · z�1

d�1

⇥ �1 + x2d�r+1

+ · · ·+ x2p

��1

dzp+1

· · · dzd�1

�

dxd�r+1

· · · dxp

dzd

=

Z


1/mbxm

d�r+1

· · ·xmp

zmd

✓

1� 1

nbxmd�r+1

· · ·xmp

zmd

◆

⇥ (log zd

)d�p�1

�

1 + x2d�r+1

+ · · ·+ x2p

��1

dxd�r+1

· · · dxp

dzd

⇣ [log(nb)]d�p�1n1+1/mb2+1/m,

where the last step is by Lemma A.13 in Section A.5. Hence,

(A.9) b ⇣ n�(m+1)/(2m+1)(log n)�m(d�p�1)/(2m+1).


RL ⇣ n�1

Z

Qdk=d�r+1 xk(nb)

1/m,xk�1

✓

1� 1

nbxmd�r+1

· · ·xmd

◆

· (1 + x2d�r+1

+ · · ·+ x2p

)�1dxd�r+1

· · · dxd

⇣ n�1

Z


1/m

✓

1� 1

nbxmd�r+1

· · ·xmp

zmd

◆

(log zd

)d�p�1

· (1 + x2d�r+1

+ · · ·+ x2p

)�1dxd�r+1

· · · dxp

dzd

⇣ [log(nb)]d�p�1n�1+1/mb1/m,

where the second step uses the same change of variables by letting zj

=xp+1

xp+2

· · ·xj

, j = p + 1, . . . , d, and the last step is by Lemma A.13 inSection A.5. By (A.9), we have

RL ⇣ [n(log n)1+p�d]�2m/(2m+1).

If p = d, as discussed in the previous two cases, the number of combina-tions of non-1 indices being summed is finite, and the largest terms of thesummation (A.6) over ~⌫ 2 V correspond to any combinations of r non-1indices, for example, ⌫

k

⌘ 1 for k � r+1, and (⌫1

, . . . , ⌫r

) 2 Nr. Thus, (A.6)is equivalent to

X

Qrk=1 ⌫k(nb)

1/m

br

Y

k=1

⌫mk

� n�1

r

Y

k=1

⌫2mk

!

1 +r

X

k=1

⌫2k

!�1

⇣ 1.



1 ⇣Z

Qrk=1 xk(nb)

1/m,xk�1

br

Y

k=1

xmk

� n�1

r

Y

k=1

x2mk

!

1 +r

X

k=1

x2k

!�1

dx1

· · · dxr

⇣Z

Qrk=1 xk(nb)

1/m,xk�1

br

Y

k=1

xmk

1 +r

X

k=1

x2k

!�1

dx1

· · · dxr

By letting � = m > 1 and ↵ = 2 in Lemma A.14 in Section A.5, we havefor any r � 1,

(A.10) b ⇣ n�(mr+r�2)/(2mr+r�2).


RL ⇣ n�1

Z

Qrk=1 xk(nb)

1/m,xk�1

✓

1� 1

nbxm1

· · ·xmr

◆

· (1 + x21

+ · · ·+ x2r

)�1dx1

· · · dxr

⇣ n�1

Z

Qrk=1 xk(nb)

1/m,xk�1

(1 + x21

+ · · ·+ x2r

)�1dx1

· · · dxr

⇣h

n�1(nb)(r�2)/(mr)

i

r�3

+⇥

n�1 log(nb)⇤

r=2

+�

n�1

�

r=1

,

where the last step uses Lemma A.14 in Section A.5 by letting � = 0 and↵ = 2. By (A.10), we have

RL ⇣h

n�(2mr)/[(2m+1)r�2]

i

r�3

+⇥

n�1 log(n)⇤

r=2

+ n�1

r=1

,

where the constant factor only depends on �20

, �2j

, m, r, p and d. Thiscompletes the proof.

A.2.2. Proof of minimax upper bound: Theorem 2.2. We now prove thetheorem for only r = d and p = d � 1. Other cases can be proved similarlywith slight changes.

Using the discrete transformed data (A.3), the regularized estimator bfn�

by (2.5) can be obtained through

b✓~⌫ = argmin

˜

✓~⌫2R

8

<

:

1

n(p+ 1)

2

4

1

�20

X

~⌫2V,⌫kl

�

ze0~⌫ � ✓

~⌫

�

2

+p

X

j=1

1

�2j

X

~⌫2V,⌫kl

�

zej

~⌫ � ⌫j

✓~⌫

�

2

3

5+ �X

~⌫2V,⌫kl

�~⌫✓

2

~⌫

9

=

;


and bfn�

(t) =P

~⌫2V,⌫kl

b✓~⌫ ~⌫(t), where V is defined in (5.1). Direct calculations

give

b✓~⌫ =

��2

0

ze0~⌫ +

P

p

j=1

��2

j

⌫j

zej

~⌫

��2

0

+P

p

j=1

��2

j

⌫2j

+ ��1

~⌫

.

The deterministic error of bfn�

can be analyzed by two parts. On the onehand, since f

0

2 H and �⌫

⇣ ⌫�2m, we knowP

~⌫2V,⌫k�l+1

(✓0~⌫)

2 ⇣ n�2m.

This is the truncation error due to b✓~⌫ = 0 for ⌫

k

� l+ 1, 1 k d. On theother hand, note that h~

~⌫ , ~ ~µi2n

1 and then

0

@

X

~µ2V,µk�l+1

✓0~µh~ ~⌫ , ~ ~µin

1

A

2

X

~µ2V,µk�l+1

(✓0~µ)

2 ⇣ n�2m.

Thus,

X

~⌫2V,⌫kl

⇣

Eb✓~⌫ � ✓0

~⌫

⌘

2

.X

~⌫2V,⌫kl

��1

~⌫

��2

0

+P

p

j=1

��2

j

⌫2j

+ ��1

~⌫

!

2

(✓0~⌫)

2 + n�2m+1

�2 sup~⌫2V

��1

~⌫⇣

��2

0

+P

p

j=1

��2

j

⌫2j

+ ��1

~⌫

⌘

2

X

~⌫2V��1

~⌫ (✓0~⌫)

2 + n�2m+1

⇣ �2J(f0

) sup~⌫2V

⌫2m1

· · · ⌫2md

(1 +P

p

j=1

⌫2j

+ �⌫2m1

· · · ⌫2md

)2+ n�2m+1.

Define that

B�

(~⌫) =⌫2m1

· · · ⌫2md

(1 +P

p

j=1

⌫2j

+ �⌫2m1

· · · ⌫2md

)2.

For the sup~⌫2V B

�

(~⌫) term above, suppose thatQ

d

j=1

⌫2mj

> 0 is fixed and

denoted by x�1, then B�

(~⌫) is maximized by lettingP

p

j=1

⌫2j

be as small aspossible, where p = d� 1. This suggests ⌫

1

= ⌫2

= · · · = ⌫p

= 1, and

sup~⌫2V

B�

(~⌫) ⇣ supx>0

x�1

(1 + �x�1)2⇣ ��1,

where the last step is achieved when x ⇣ �. Combining all parts of bias gives

(A.11)X

~⌫2V

⇣

Eb✓~⌫ � ✓0

~⌫

⌘

2

= O�

�J(f0

) + n�2m+1 + n�2m

,


where the constant factor only depends on �20

, �2j

, m, r, p and d.The stochastic error is bounded as follows:

X

~⌫2V

⇣

b✓~⌫ � Eb✓

~⌫

⌘

2

=X

~⌫2V,⌫kl

n�1(��2

0

+P

p

j=1

��2

j

⌫2j

)

(��2

0

+P

p

j=1

��2

j

⌫2j

+ ��1

~⌫ )2

.X

~⌫2V,⌫kl

1 +P

p

j=1

⌫2j

n(1 +P

p

j=1

⌫2j

+ �⌫2m1

· · · ⌫2md

)2.

Using Lemma A.7 in Section A.4.3 with a = 0 and p = d� 1, we have

(A.12)X

~⌫2V

⇣

b✓~⌫ � Eb✓

~⌫

⌘

2

= On

n�1��1/2m

o

.

Combining (A.11) and (A.12) and letting � ⇣ n�2m/(2m+1) completes theproof.

A.3. Proofs of results in Section 4: estimating partial deriva-tives. We now turn to prove the results for estimating partial derivativesunder the random design.

A.3.1. Proof of minimax lower bound: Theorem 4.1. The minimax lowerbound will be established by using Fano’s lemma but the proof is di↵erentfrom Section 5.1 in construction details. It su�ces to consider a special casethat noises ✏e0 and ✏ej s are Gaussian with �

0

= 1 and �j

= 1, and ⇧e0 and⇧ej s are uniform, and H

1

is generated by periodic kernels. For simplicity, westill use the notation introduced in Section 5.1. In the rest of this section,without less of generality, we consider estimating @f

0

/@t1

(·) with p � 1.First, the number of multi-indices ~⌫ = (⌫

1

, . . . , ⌫r

) 2 Nr satisfying

⌫(m�1)/m

1

⌫2

· · · ⌫r

N

is c00

Nm/(m�1), where c00

is some constant. Define a length-{c00

Nm/(m�1)}binary sequence as

b = {b~⌫ : ⌫(m�1)/m

1

⌫2

· · · ⌫r

N} 2 {0, 1}c00Nm/(m�1).

We write

hb

(t1

, . . . , tr

) = N�m/2(m�1)

X

⌫

(m�1)/m1 ⌫2···⌫rN

b~⌫

�

1 + ⌫21

+ · · ·+ ⌫2r

��1/2

⇥h

⌫(m�1)/m

1

⌫2

· · · ⌫r

+Ni�m

⌫1(t1) ⌫2(t2) · · · ⌫r(tr).


where ⌫k(tj)s are the trigonometric basis in (A.1). Note that

khb

k2H . N�m/(m�1)

X

⌫

(m�1)/m1 ⌫2···⌫rN

b2~⌫⌫

2

1

�

1 + ⌫21

+ · · ·+ ⌫2r

��1

N�m/(m�1)

X

⌫

(m�1)/m1 ⌫2···⌫rN

⌫21

�

1 + ⌫21

+ · · ·+ ⌫2r

��1 ⇣ 1,

where the last step is by Lemma A.16 in Section A.5. Hence, hb

(·) 2 H.Then, using the Varshamov-Gilbert bound, there exists a collection of

binary sequences {b(1), . . . , b(M)} ⇢ {0, 1}c00Nm/(m�1)such that

M � 2c00N

m/(m�1)/8

andH(b(l), b(q)) � c0

0

Nm/(m�1)/8, 81 l < q M.

For b(l), b(q) 2 {0, 1}c00Nm/(m�1), we have

�

�

�

�

@hb

(l)

@t1

� @hb

(q)

@t1

�

�

�

�

2

L2

� c0N�m/(m�1)(2N)�2m

X

⌫

(m�1)/m1 ⌫2···⌫rN

⌫21

(1 + ⌫21

+ · · ·+ ⌫2r

)�1

h

b(l)~⌫ � b(q)

~⌫

i

2

� c0N�m/(m�1)(2N)�2m

X

c

017N/8⌫

(m�1)/m1 ⌫2···⌫rN

⌫21

(1 + ⌫21

+ · · ·+ ⌫2r

)�1

= c02

N�2m

for some constant c0, c01

and c02

, where the last step is by Lemma A.16 inSection A.5. On the other hand, for any b(l) 2 {b(1), . . . , b(M)},

khb

(l)k2L2

+p

X

j=1

k@hb

(l)/@tj

k2L2

N�m/(m�1)N�2m

X

⌫

(m�1)/m1 ⌫2···⌫rN

h

b(l)~⌫

i

2

c03

N�2m

with some constant c03

, where the last step is a corollary of Lemma A.16.


Last, by the same argument in (5.3), (5.4), (5.5) and (5.6), we obtain

inf˜

f

supf02H

P(

�

�

�

�

f(t)� @f0

(t)

@t1

�

�

�

�

2

L2

� 1

4c02

N�2m

)

� 1� 2c03

n(p+ 1)N�2m + log 2

c00

(log 2)Nm/(m�1)/8.

Taking N = c04

n(m�1)/(2m

2�m) with an appropriately chosen c04

, we have

lim supn!1

inf˜

f

supf02H

P(

�

�

�

�

f(t)� @f0

(t)

@t1

�

�

�

�

2

L2

� C2

n�2(m�1)/(2m�1)

)

> 0,

where the constant factor C2

only depends on �20

, �2j

, m, and bounded valuesr, p and d. This completes the proof.

A.3.2. Proof of minimax upper bound: Theorem 4.2. We continue to usethe notation and definitions such as the minimizer f , the Frechet derivativesDl

n

(f)g, Dl1(f)g, D2ln

(f)gh, D2l1(f)gh, the operator G�1

�

and most im-

portantly f⇤ in Section 5.2. Unlike Section 5.2, here we do not require ⇡ej sare known nor f

0

has periodic boundaries on X d

1

by some transformation.By the assumption that ⇡ej s are bounded away from 0 and infinity, we

have for any 1 j p,

Z

X d1

"

@ bfn�

(t)

@tj

� @f0

(t)

@tj

#

2

dt . k bf � f0

k20

.

Hence, the following lemma is su�cient for proving Theorem 4.2.

Lemma A.2. Under the conditions of Theorem 4.1, then bfn�

given by(2.5) satisfies

limD2!1

lim supn!1

supf02H

Pn

k bf � f0

k20

> D2

n�2(m�1)/(2m�1)

o

= 0,

if the tuning parameter � is chosen by � ⇣ n�2(m�1)/(2m�1).

A lemma for the proof.. In H, the quadratic form hf, fi0

is completelycontinuous with respect to hf, fi

R

. By the theory in Section 3.3 of Wein-berger [46], there exists an eigen-decomposition for the generalized Rayleighquotient hf, fi

0

/hf, fiR

in H, where we denote the eigenvalues are {(1 +�⌫

)�1}⌫�1

and the corresponding eigenfunctions are {(1 + �⌫

)�1/2⇠⌫

}⌫�1

.Thus, h⇠

⌫

, ⇠µ

iR

= (1 + �⌫

)�⌫µ

and h⇠⌫

, ⇠µ

i0

= �⌫µ

, where �⌫µ

is Kronecker’sdelta. The following proposition gives the decay rate of �

⌫

and its proof isgiven in Section A.4.2.


Lemma A.3. By the well-ordering principle, the elements in the set

8

<

:

0

@1 +p

X

j=1

⌫2j

1

A

d

Y

k=1

⌫�2m

k

: ~⌫ 2 V

9

=

;

can be ordered from large to small, where V is defined in (5.1). Denote by{�0

⌫

}⌫�1

the ordered sequence. Then �⌫

⇣ (�0⌫

)�1.

The proof of this lemma is delegated to Section A.4.2. The lemma bridgesthe gap between the proof needed for Lemma A.2 and the proof for Theorem3.2 shown in Section 5.2 since the eigenvalues ⇢

~⌫ in Section 5.2 satisfies⇢~⌫ ⇣ (1 +

P

p

j=1

⌫2j

)�1

Q

d

k=1

⌫2mk

. Hence in later analysis, we can exchangethe use of {�

⌫

, ⌫ 2 N} and {⇢~⌫ : ~⌫ 2 V } in some asymptotic calculation

settings.For any function f 2 H, it can be decomposed as

f(t1

, . . . , td

) =X

⌫2Nf⌫

⇠⌫

(t1

, . . . , td

), where f⌫

= hf(t), ⇠⌫

(t)i0

,

and J(f) = hf, fiR

� hf, fi0

=P

⌫2N �⌫f2

⌫

.First, we present an upper bound of the deterministic error (f � f

0

).

Lemma A.4. The deterministic error satisfies

kf � f0

k20

= O {�J(f0

)} .

Proof. For any 0 a 1,

kf � f0

k20

=1X

⌫=1

✓

��⌫

1 + ��⌫

◆

2

(f0

⌫

)2

�2 sup⌫2N

�⌫

(1 + ��⌫

)2

1X

⌫=1

�⌫

(f0

⌫

)2

�2J(f0

) supx>0

x�1

(1 + �x�1)2

⇣ �2J(f0

)��1 = �J(f0

),

where the fourth step is achieved when x ⇣ �.

Second, we show an upper bound of (f⇤ � f), which accounts for a partof the stochastic error.


Lemma A.5. For 1 p d, then if m > 5/4, we have

kf⇤ � fk20

= OPn

n�1��1/(2m�2)

o

.

Proof. As shown in (5.22), E[12

Dln,�

(f)g]2 = O{n�1kgk20

}. By the defi-nition of G�1

�

in (5.17),

kG�1

�

gk20

=1X

⌫=1

(1 + ��⌫

)�2 hg, ⇠⌫

i2R

, 8g 2 H.

Thus,

Ekf⇤ � fk20

=1

4E" 1X

⌫=1

(1 + ��⌫

)�2hDln�

(f), ⇠⌫

i2R

#

1X

⌫=1

(1 + ��⌫

)�2E

1

2Dl

n�

(f)⇠⌫

�

2

. n�1

1X

⌫=1

(1 + ��⌫

)�2

⇣ n�1M0

(�),

where the last step is because of Lemma A.3, and Ma

(�) for 0 a 1 isdefined in Lemma A.8 of Section A.4.4. Hence, we complete the proof byusing Lemma A.8.

Then, we give an upper bound of ( bf � f⇤), which accounts for anotherpart of the stochastic error.

Lemma A.6. If n�1��[a+ma/(m�1)+3/2m] [log(1/�)]r�1 ! 0 and 1/2m <a < (2m� 3)/2m, we have

k bf � f⇤k20

= oPn

n�1��1/(2m�2)

o

.


Proof. Observe that

Ek bf � fk20

⇣ EX

~⌫2V(1 + ��

~⌫)�2

1

2D2l1(f)( bf � f)�

~⌫ � 1

2D2l

n

(f)( bf � f)�~⌫

�

2

EX

~⌫2V(1 + ��

~⌫)�2

⇥ 1

p+ 1

8

<

:

"

1

n�20

n

X

i=1

( bf � f)(te0i

)�~⌫(t

e0i

)� 1

�20

Z

( bf � f)(t)�~⌫(t)⇡

e0(t)

#

2

+p

X

j=1

"

1

n�2j

n

X

i=1

@( bf � f)

@tj

(te0i

)@�

~⌫

@tj

(te0i

)� 1

�2j

Z

@( bf � f)(t)

@tj

@�~⌫(t)

@tj

⇡e0(t)

#

2

9

=

;

. n�1k bf � fk2L2(a+1/m)

X

~⌫2V

1 +⇢~⌫

k�~⌫k2

L2

!

a

(1 + �⇢~⌫)

�2

= n�1k bf � fk2L2(a+1/m)

Ma

(�)

n

n�1��[a+3/2m+ma/(m�1)][log(1/�)]r�1

o

n�1��1/(2m�2),

where the first step exchange the use of {�⌫

, ⌫ 2 N} and {⇢~⌫ : ~⌫ 2 V },

the third step is by (5.30), and the last step is Lemma 5.3, Lemma 5.4 andLemma A.8 in Section A.4.4. The above inequality holds for any 1/2m <a < (2m� 3)/2m. This completes the proof.

Last, we combine Lemma A.4, Lemma A.5 and Lemma A.6. By letting� ⇣ n�2(m�1)/(2m�1) and a = 1/2m+ ✏ for some ✏ > 0, then

n�1��(a+3/2m+ma/(m�1))[log(1/�)]r�1 ! 0

holds as long as m > 2. Therefore, we conclude that for any 1 p d andm > 2,

k bf � f0

k20

= O {�J(f0

)}+OPn

n�1��1/(2m�2)

o

+ oPn

n�1��1/(2m�2)

o

= OPn

n�2(m�1)/(2m�1)

o

.

This completes the proof for Lemma A.2 and the proof for Theorem 4.2 .

A.4. Key lemmas. Now we prove and show some keys lemmas usedfor the proofs in Section 5, Section A.2 and Section A.3. We remind thereader that the proofs in this section rely on some lemmas to be stated laterin Section A.5.


A.4.1. Proof of Lemma 5.1.

The norm k · kR

is equivalent to k · kH in H.

Proof. Observe that for any g 2 H, by the assumption that ⇡e0 and⇡ej s are bounded away from 0 and infinity, we have

1

p+ 1

2

4

1

�20

Z

g2(t)⇡e0(t) +p

X

j=1

1

�2j

Z

⇢

@g(t)

@tj

�

2

⇡ej (t)

3

5

c1

2

4

Z

g2(t) +p

X

j=1

Z

⇢

@g(t)

@tj

�

2

3

5 c2

· c2dK

kgk2H,


and c2

, where the last step is by Lemma A.9. Hence

(A.13) kgk2R

(c2

c2dK

+ 1)kgk2H.One the other hand, for any g 2 H we can do the orthogonal decomposi-

tion g = g0+ g1 where hg0, g1iH = 0, g0 is in the null space of J(·) and g1 isin the orthogonal space of the null space of J(·) in H. Since the null spaceof J(·) has a finite basis which forms a positive definite kernel matrix, weassume the minimal eigenvalue of the kernel matrix is µ0

min

> 0. Then thereexists a constant c

3

> 0 such that

(A.14) kg0k2R

� c3

kg0k2L2

� c3

µ0min

kg0k2H.For g1, we have kg1k2

R

� J(g1) = kg1k2H. Thus, for any g 2 H,

kgk2R

� c3

Z

�

g0 + g1�

2

+ kg1k2H

� c3

⇢

kg0k2L2

+1 + c

3

c3

kg1k2L2

� 2kg0kL2kg1kL2

�

� c3

1 + c3

kg0k2L2,

where the second inequality is by kg1k2H � kg1k2L2. Then by (A.14), we obtain

kgk2R

� (1 + c3

)�1c3

µ0min

kg0k2H. Together with kgk2R

� J(g1) = kg1k2H, wehave

(A.15) kgk2R

�✓

1 +1 + c

3

c3

µ0min

◆�1

kgk2H.

Combining (A.13) and (A.15) completes the proof.


A.4.2. Proof of Lemma A.3.

Proof. When d = 1, this problem is solved in Cox [42]. Their method isfinding an orthonormal basis in L

2

(X1

) to simultaneously diagonalize hf, fi0

and hf, fiR

, and then obtain the decay rate of �⌫

. However, their methodcannot be applied to our case when 2 p d. Alternatively, we use theCourant-Fischer-Weyl min-max principle to prove the lemma.

Note that for any f 2 H, the norm kfk20

is equivalent to

Z

f2 +p

X

j=1

Z

✓

@f(t)

@tj

◆

2

.

From Lemma 5.1, the norm k · k2R

is equivalent to k · k2H. Now by applyingthe mapping principle [see, e.g., Theorem 3.8.1 in Weinberger [46]], we mayreplace hf, fi

0

byR

f2 +P

p

j=1

R

(@f/@tj

)2 and hf, fiR

by kfk2H, and the

resulting eigenvalues {�00⌫

}⌫�1

of {R f2 +P

p

j=1

R

(@f/@tj

)2}/kfk2H satisfy

(A.16) �00⌫

⇣ (1 + �⌫

)�1.

Thus, we only need to study {�00⌫

}⌫�1

. Since f 2 H has the tensor prod-uct structure, we denote by �

~⌫ [{R

f2 +P

p

j=1

R

(@f/@tj

)2}/hf, fiH] the ~⌫theigenvalue of the generalized Rayleigh quotient, where ~⌫ 2 V and V is de-fined in (5.1).

Second, by the assumption that �⌫

⇣ ⌫�2m in (2.4), H1

is equivalent toa Sobolev space Wm

2

(X1

) and the trigonometric functions { ⌫

}⌫�1

in (A.1)form an eigenfunction basis of H

1

up to a m-dimensional linear space ofpolynomials of order less than m. See, for example, Wahba [45]. Denote thelatter linear space of polynomials by G. Denote by F

µ

and F?µ

the linearspaces spanned by {

⌫

: 1 ⌫ µ} and { ⌫

: ⌫ � µ + 1}, respectively.For any ~⌫ = (⌫

1

, ⌫2

, . . . , ⌫d

) 2 V , by the Courant-Fischer-Weyl min-maxprinciple,

�(⌫1�m)_0,(⌫2�m)_0,...,(⌫d�m)_0

2

4

8

<

:

Z

f2 +p

X

j=1

Z

✓

@f

@tj

◆

2

9

=

;

,

hf, fiH3

5

� minf2H\⌦d

k=1{F⌫k\G?}

2

4

8

<

:

Z

f2 +p

X

j=1

Z

✓

@f

@tj

◆

2

9

=

;

,

hf, fiH3

5

� c1

0

@1 +p

X

j=1

⌫2j

1

A

d

Y

k=1

⌫�2m

k



> 0, where the last inequality is by the fact thatd

2⌫�1

(t)/dt = 2⇡⌫ 2⌫

(t) and d 2⌫

(t)/dt = �2⇡⌫ 2⌫�1

(t). On the otherhand,

�⌫1+m,⌫2+m,...,⌫d+m

2

4

8

<

:

Z

f2 +p

X

j=1

Z

✓

@f

@tj

◆

2

9

=

;

,

hf, fiH3

5

maxf2H\⌦d{F?

k�1\G?}

2

4

8

<

:

Z

f2 +p

X

j=1

Z

✓

@f

@tj

◆

2

9

=

;

,

hf, fiH3

5

c2

0

@1 +p

X

j=1

⌫2j

1

A

d

Y

k=1

⌫�2m

k


> 0. Thus, for any ~⌫ 2 V ,

�~⌫

2

4

8

<

:

Z

f2 +p

X

j=1

Z

✓

@f

@tj

◆

2

9

=

;

,

hf, fiH3

5 ⇣0

@1 +p

X

j=1

⌫2j

1

A

d

Y

k=1

⌫�2m

k

.

This implies �0⌫

= �00⌫

, where �0⌫

is defined in Lemma A.3. Together with(A.16), we complete the proof.

A.4.3. Definition of Na

(�) and its upper bound.

Lemma A.7. Recall that V as a family of multi-index ~⌫ is defined in(5.1). We let

(A.17) Na

(�) =X

~⌫2V

⇣

Q

d

k=1

⌫2mk

⌘

a

⇣

1 +P

p

j=1

⌫2j

⌘

⇣

1 +P

p

j=1

⌫2j

+ �Q

d

k=1

⌫2mk

⌘

2

.

Then, when 0 p < d, we have for any 0 a < 1� 1/2m,

Na

(�) = On

��a�1/2m [log(1/�)](d�p)^r�1

o

,

and when p = d, we have for any 0 a 1,

Na

(�) =

8

>

>

>

>

>

<

>

>

>

>

>

:

On

�mr

1�mr (a+r�22mr )

o

, if r � 3;

O {log(1/�)} , if r = 2, a = 0; O {1} , if r = 2, 0 < a 1;

O {1} , if r = 1, a < 1

2m

; O {log(1/�)} , if r = 1, a = 1

2m

;

On

�1�2ma2m�2

o

, if r = 1, a > 1

2m

.


Proof. We will discuss three separate cases for 0 p d � r, d � r <p < d and p = d.

First, consider 0 p d � r. Since ~⌫ 2 V , there are at most r of⌫1

, . . . , ⌫d

not equal to 1, which implies that the number of combinations ofnon-1 indices being summed in (A.17) is no greater than C1

d

+C2

d

+· · ·+Cr

d

<1. Due to the appearance of (1 +

P

p

j=1

⌫2j

) in the denominator of (A.17),the largest terms of the summation (A.17) over ~⌫ 2 V correspond to thecombinations of r indices where as few ⌫

1

, . . . , ⌫p

being summed as possible,which is the indices ~⌫ = (⌫

k1 , ⌫k2 , . . . , ⌫kr)> 2 Nr with k

1

, k2

, . . . , kr

> p.Thus, by the integral approximation,

Na

(�)

⇣1X

⌫p+1=1

· · ·1X

⌫p+r�1=1

1X

⌫p+r=1

Q

p+r

k=p+1

⌫2ma

k

⇣

1 + �Q

p+r

k=p+1

⌫2mk

⌘

2

⇣Z 1

1

Z 1

1

· · ·Z 1

1

⇣

1 + �xbp+1

· · ·xbp+r�1

xbp+r

⌘�2

dxp+1

· · · dxp+r�1

dxp+r

,

where b = 2m/(2ma+1). Let zk

= xp+1

xp+2

· · ·xk

for k = p+1, . . . , p+r. Byusing the change of variables to replace (x

p+1

, . . . , xp+r

) by (zp+1

, . . . , zp+r

)and z

p+r

by x = �1/bzp+r

,

Na

(�)

⇣Z 1

1

Z

zp+r

1

· · ·Z

zp+2

1

⇣

1 + �zbp+r

⌘�2

z�1

p+1

· · · z�1

p+r�1

dzp+1

· · · dzp+r�1

dzp+r

⇣Z 1

1

(1 + �zbp+r

)�2(log zp+r

)r�1dzp+r

⇣ ��1/b

Z 1

�

1/b(1 + xb)�2

�

log x� b�1 log ��

r�1

dx

⇣ ��a�1/2m [log(1/�)]r�1 ,

where the last step follows from the fact that 2b > 1 for any 0 a <(2m� 1)/(2m).

Second, we consider d� r < p < d. As discussed in the previous case, thenumber of combinations of non-1 indices being summed is finite, and thelargest terms of the summation (A.17) over ~⌫ 2 V correspond to the indices~⌫ = (⌫

k1 , . . . , ⌫kr+p�d, ⌫

p+1

, . . . , ⌫d

)> 2 Nr, where the indices k1

, . . . , kr+p�d


p. Thus, by the integral approximation,

Na

(�)

⇣1X

vd�r+1=1

· · ·1X

vd=1

Q

d

k=d�r+1

⌫2ma

k

�

1 +P

p

k=d�r+1

⌫2k

�

⇣

1 +P

p

k=d�r+1

⌫2k

+ �Q

d

k=d�r+1

⌫2mk

⌘

2

⇣Z 1

1

· · ·Z 1

1

1 + xb/md�r+1

+ · · ·+ xb/mp

⇣

1 + xb/md�r+1

+ · · ·+ xb/mp

+ �xbd�r+1

· · ·xbd

⌘

2

dxd�r+1

· · · dxd

,

where b = 2m/(2ma+ 1). Set zk

= xp+1

xp+2

· · ·xk

for k = p+ 1, . . . , d. Byusing the change the variables to replace (x

p+1

, . . . , xd

) by (zp+1

, . . . , zd

),and z

d

by x = �1/bzd

, and x by u = xd�r+1

· · ·xp

· x. We have

Na

(�) ⇣Z 1

1

· · ·Z 1

1

Z 1

1

Z

zd

1

· · ·Z

zp+2

1

xb/md�r+1

⇣

1 + xb/md�r+1

+ · · ·xb/mp

+ �xbd�r+1

· · ·xbp

zbd

⌘�2

·z�1

p+1

· · · z�1

d�1

dzp+1

· · · dzd�1

dzd

�

dxd�r+1

· · · dxp

⇣ ��1/b

Z 1

1

· · ·Z 1

1

Z 1

�

1/b

xb/md�r+1

(1 + xb/md�r+1

+ · · ·xb/mp

+ xbd�r+1

· · ·xbp

xb)�2

· �log x� b�1 log ��

d�p�1

dx

�

dxd�r+1

· · · dxp

. ��1/b

Z 1

�

1/b

Z 1

1

· · ·Z 1

1

xb/md�r+1

⇣

1 + xb/md�r+1

+ · · ·+ xb/mp

+ ub⌘�2

x�1

d�r+1

· · ·x�1

p

· �log u� log xd�r+1

� · · ·� log xp

� b�1 log ��

d�p�1

dxd�r+1

· · · dxp

�

du.

By Lemma A.10, then for any 0 < ⌧ < 1,

⇣

1 + xb/md�r+1

+ xb/md�r+2

+ · · ·+ xb/mp

+ ub⌘�2

.⇣

1 + xb/md�r+2

+ · · ·+ xb/mp

+ ub⌘�1+⌧ ·

⇣

xb/md�r+1

⌘�(1+⌧)

.


Together with the factR11

t�1�⌧ (log t)kdt < 1 for any k < 1, we have

Na

(�) . ��1/b

Z 1

�

1/b

Z 1

1

· · ·Z 1

1

⇣

1 + xb/md�r+2

+ · · ·+ xb/mp

+ ub⌘�1+⌧

x�1

d�r+2

· · ·x�1

p

· �log u� log xd�r+2

� · · ·� log xp

� b�1 log ��

d�p�1

dxd�r+2

· · · dxp

�

du.

Continuing this procedure gives

Na

(�) . ��1/b

Z 1

�

1/b

⇣

1 + ub⌘�(1�⌧)

p�d+r�

log u� b�1 log ��

d�p�1

du.

Since for any ✏ > 0 and d� r < p < d, we know if ⌧ < ✏/d,

(1� ⌧)p�d+r � 1� ⌧(p� d+ r) � 1� ⌧(d� 1) > 1� ✏.

Hence, for any 0 a < (2m�1)/(2m), there exists ⌧ such that (1�⌧)p�d+r >a+ 1/(2m) = 1/b. Therefore,

Na

(�) . ��1/b [log(1/�)]d�p�1 = ��a�1/2m [log(1/�)]d�p�1 .

Finally, we consider p = d. As argued in the previous two cases, thenumber of combinations of non-1 indices being summed is finite. Now sincep = d, by the symmetry of indices, the largest terms of the summation(A.17) over ~⌫ 2 V correspond to any combinations of r non-1 indices, forexample, the first r indices. Thus, by the integral approximation,

Na

(�)

⇣1X

⌫1=1

· · ·1X

⌫r�1=1

1X

⌫r=1

Q

r

k=1

⌫2ma

k

�

1 +P

r

k=1

⌫2k

�

�

1 +P

r

k=1

⌫2k

+ �Q

r

k=1

⌫2mk

�

2

⇣Z 1

1

Z 1

1

· · ·Z 1

1

1 + xb/m1

+ · · ·+ xb/mr�1

+ xb/mr

⇣

1 + xb/m1

+ · · ·+ xb/mr

+ �xb1

· · ·xbr�1

xbr

⌘

2

dx1

· · · dxr�1

dxr

where b = 2m/(2ma+1). Observe that if x1

· · ·xr�1

xr

. �mr/[b(1�mr)], then

�xb1

· · ·xbr�1

xbr

. xb/m1

+ · · ·+ xb/mr�1

+ xb/mr

.


By Lemma A.14 with � = 0 and ↵ = b/m 2, we have

(A.18)

Na

(�) ⇣Z

x1···xr�1xr.�

mr/[b(1�mr)]

⇣

1 + xb/m1

+ · · ·+ xb/mr�1

+ xb/mr

⌘�1

dx1

· · · dxr�1

dxr

⇣

8

>

>

>

>

<

>

>

>

>

:

�mr

1�mr (a+r�22mr ), if r � 3;

log(1/�), if r = 2, a = 0; �2ma1�2m , if r = 2, 0 < a 1;

1, if r = 1, a < 1

2m

; log(1/�), if r = 1, a = 1

2m

;

�1�2ma2m�2 , if r = 1, a > 1

2m

.

On the other hand, if �mr/[b(1�mr)](x1

· · ·xr�1

xr

)�1 = o(1), without less ofgenerality, we assume x

r

= min{x1

, · · · , xr

}. Let z = �1/bx1

· · ·xr�1

xr

. Bychanging x

r

to z, we have

(A.19)

Na

(�) ⇣Z

�

mr/[b(1�mr)](x1···xr�1xr)

�1=o(1)

⇣

1 + xb/m1

+ · · ·+ xb/mr

+ �xb1

· · ·xbr�1

xbr

⌘�1

dx1

· · · dxr�1

dxr

. ��1/b

Z

�

1/[b(1�mr)]z

�1=o(1),�

�(r�1)/(br)z

(r�1)/rx1···xr�1�

�1/bz

⇣

1 + xb/m1

+ · · ·+ xb/mr�1

+ zb⌘�1

x�1

1

· · ·x�1

r�1

dx1

· · · dxr�1

dz

. ��1/b

Z

�

1/[b(1�mr)]z

�1=o(1)

"

Z

�

�(r�1)/(br)z

(r�1)/rx1···xr�1�

�1/bz

⇣

xb/m1

+ · · ·+ xb/mr�1

⌘�⌧

x�1

1

· · ·x�1

r�1

dx1

· · · dxr�1

�

zb(�1+⌧)dz

. ��1/b

Z

�

1/[b(1�mr)]z

�1=o(1)

�⌧/(mr)z�⌧b/(mr) · zb(�1+⌧)dz

= oh

�mr

1�mr (a+r�22mr )

i

,

where the third step follows from the Lemma A.15 in Section A.5 for � = �1and ↵ = ⌧b/m. Combining (A.18) and (A.19), we complete the proof forp = d and this lemma.

A.4.4. Definition of Ma

(�) and its upper bound.


Lemma A.8. Recall that V as a family of multi-index ~⌫ is defined in(5.1). We let

Ma

(�) =X

~⌫2V

⇣

Q

d

k=1

⌫2mk

⌘

a

h

1 + �Q

d

k=1

⌫2mk

(1 +P

p

j=1

⌫2j

)�1

i

2

.

When m > 5/(4� 2a), we have for any 1 p d and 0 a 1,

Ma

(�) = On

��(2ma+1)/(2m�2)

o

.

Proof. We first show for any 1 s r,

(A.20)

1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

h

1 + �Q

r

k=1

⌫2mk

(1 +P

s

j=1

⌫2j

)�1

i

2

⇣1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

⇥

1 + �Q

r

k=1

⌫2mk

(1 + ⌫2s

)�1

⇤

2

.

Note that in (A.20), the LHS is greater than the RHS up to some constant.On the contrary, observe that

1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

h

1 + �Q

r

k=1

⌫2mk

(1 +P

s

j=1

⌫2j

)�1

i

2

⇣1X

⌫1=1

· · ·1X

⌫r=1

s

X

i=1

(1 + ⌫2i

)2Q

r

k=1

⌫2ma

k

⇣

1 +P

s

j=1

⌫2j

+ �Q

r

k=1

⌫2mk

⌘

2

⇣1X

⌫1=1

· · ·1X

⌫r=1

(1 + ⌫2s

)2Q

r

k=1

⌫2ma

k

⇣

1 +P

s

j=1

⌫2j

+ �Q

r

k=1

⌫2mk

⌘

2

1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

⇥

1 + �Q

r

k=1

⌫2mk

(1 + ⌫2s

)�1

⇤

2

.

This proves (A.20). Moreover, note that

(A.21)

1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

⇥

1 + �Q

r

k=1

⌫2mk

(1 + ⌫2s

)�1

⇤

2

�1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

�

1 + �Q

r

k=1

⌫2mk

�

2

.


Now return to the proof of the lemma. Since ~⌫ 2 V and 1 p d, by(A.20), (A.21) and the integral approximation, we have

Ma

(�) ⇣1X

⌫1=1

· · ·1X

⌫r=1

Q

r

k=1

⌫2ma

k

⇥

1 + �Q

r

k=1

⌫2mk

(1 + ⌫2r

)�1

⇤

2

⇣Z 1

1

Z 1

1

· · ·Z 1

1

h

1 + �xb1

· · ·xbr�1

xb(m�1)/m

r

i�2

dx1

· · · dxr�1

dxr

,

where b = 2m/(2ma + 1). Let z = �m/[b(m�1)]xm/(m�1)

1

· · ·xm/(m�1)

r�1

xr

andchange x

r

to z. Then,

Ma

(�)

⇣ ��m/[b(m�1)]

Z 1

�

�m/[b(m�1)]

Z 1

1

· · ·Z 1

1

h

1 + zb(m�1)/m

i�2

x�m/(m�1)

1

· · ·x�m/(m�1)

d�1

dx1

· · · dxd�1

dz

⇣ ��m/[b(m�1)]

Z 1

�

�m/[b(m�1)]

h

1 + zb(m�1)/m

i�2

dz,

��m/[b(m�1)]

Z 1

0

h

1 + zb(m�1)/m

i�2

dz

= On

��(2ma+1)/(2m�2)

o

,

where the second step is because m/(m� 1) > 1 and the last step holds forany m > 5/(4� 2a).

A.4.5. Boundedness of functions in the RKHS H.

Lemma A.9. For any g 2 H, there exists a constant cK

which is inde-pendent of g such that

supt2X d

1

|g(t)| cdK

kgkH,

andsupt2X d

1

|@g/@tj

(t)| cdK

kgkH, 81 j d.

Proof. Since we assume that K is continuous in the compact domainX1

and satisfies (2.2), there exists some constant cK

such that

supt2X1

|K(t, t)| cK

and supt2X1

�

�

�

�

@2K(t, t)

@t@t0

�

�

�

�

cK

.


This implies for any t 2 X d

1

,

�

�

�

�

@Kd

(t, ·)@t

j

�

�

�

�

2

H=

�

�

�

�

�

@2K(tj

, tj

)

@tj

@t0j

�

�

�

�

�

Y

l 6=j

|K(tl

, tl

)| cdK

.

Thus, for any g 2 H, by the Cauchy-Schwarz inequality,

supt2X d

1

�

�

�

�

@g(t)

@tj

�

�

�

�

supt2X d

1

�

�

�

�

@Kd

(t, ·)@t

j

�

�

�

�

HkgkH cd

K

kgkH, 81 j d.

Similarly, we can show that supt |g(t)| cdK

kgkH.

A.5. Auxiliary technical lemmas.

Lemma A.10 (A variant of Young’s inequality). For any a, b � 0 and0 < ⌧ < 1, we have

(A.22) (a+ b)�2 (1� ⌧)1�⌧ (1 + ⌧)1+⌧

4a�(1+⌧)b�(1�⌧).

When ⌧ is small, the coe�cient (1� ⌧)1�⌧ (1 + ⌧)1+⌧/4 is close to 1/4.

Proof. To prove (A.22), it is su�cient to show

a+ b � 2(1� ⌧)�(1�⌧)/2(1 + ⌧)�(1+⌧)/2a(1+⌧)/2b(1�⌧)/2.

Letting p = 2/(1 + ⌧), a0 = a1/p, b0 = [b/(p � 1)](p�1)/p, the above formulais equivalent to

a0

p+

(b0)p/(p�1)

p/(p� 1)� a0b0,

which holds by Young’s inequality. This completes the proof.

Lemma A.11 (Bounding the norm of product of functions). For anyf, g 2 ⌦dH

1

, a > 1/2m, and 1 p d, we have that

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

k�~⌫k2

L2

⌧

@f(t)

@tj

@g(t)

@tj

,�~⌫(t)

�

2

0

. kfk2L2(a+1/m)

2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

k�~⌫k2

L2

⌧

@g(t)

@tj

,�~⌫(t)

�

2

0

3

5 .


Proof. Recall that { ⌫

(t)}⌫�1

is the trigonometrical basis on L2

(X1

)and �

~⌫(·) is defined in (5.9). Write ~⌫(t) =

⌫1(t1) ⌫2(t2) · · · ⌫d(td). Notethat

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

k�~⌫k2

L2hf,�

~⌫i20

=X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

f ~⌫

!

2

.

By Theorem A.2.2 and Corollary A.2.1 in Lin [44], if a > 1/2m, then forany f, g 2 ⌦dH

1

,

X

~⌫2Nd

(1 + ⇢~⌫)

a

Z

X d1

fg ~⌫

!

2

.

2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

f ~⌫

!

2

3

5

2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

g ~⌫

!

2

3

5 .

Thus,

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

k�~⌫k2

L2

⌧

@f(t)

@tj

@g(t)

@tj

,�~⌫(t)

�

2

0

=X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

@f(t)

@tj

@g(t)

@tj

~⌫(t)

!

2

.

2

4

X

~⌫2Nd

⌫2j

1 +d

Y

k=1

⌫2mk

!

a

Z

X d1

f(t) ~⌫(t)

!

2

3

5

⇥2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

@g(t)

@tj

~⌫(t)

!

2

3

5

8

<

:

X

~⌫2Nd

"

1 +d

Y

k=1

⌫2mk

#

a+

1m

Z

X d1

f(t) ~⌫(t)

!

2

9

=

;

⇥2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

@g(t)

@tj

~⌫(t)

!

2

3

5

⇣ kfk2L2(a+1/m)

2

4

X

~⌫2Nd

1 +⇢~⌫

k�~⌫k2

L2

!

a

Z

X d1

@g(t)

@tj

~⌫(t)

!

2

3

5 .

This completes the proof.


Lemma A.12 (Inverse transformation). Assume that design points tej shave known density ⇡ej s which are supported on X d

1

. Then, there exists alinear transformation to data (tej , Y ej ) such that transformed design pointsxej s are independently uniformly distributed on X d

1

and the transformed re-sponses Zej s are the jth first-order partial derivative data of some function.

Proof. As remarked after (3.1), the design under our consideration hasthe following structure: di↵erent types design points can be grouped to somesets, where within the sets di↵erent types design points are drawn identicallyand across the sets the design points are drawn independently. We give theproof for two cases as follows for the illustration.

First, we consider that function observations and partial derivatives datashare a common design, i.e., t

ej

i

= teki

, 81 i n, 0 j < k p. Writetej = (t

ej

1

, . . . , tej

d

) 2 X d

1

. We allow covariates of tej can be correlated, thatis the density of tej is decomposed as:

⇡ej (t1

, . . . , td

) = ⇡ej

d

(td

)⇡ej

d�1

(td�1

|td

) · · ·⇡ej1

(t1

|td

, td�1

, . . . , t2

).

Denote by ⇧ejq

the CDF corresponding to ⇡ejq

, 1 q d. Let

xej

d

= ⇧ej

d

(tej

d

), xej

d�1

= ⇧ej

d�1

(tej

d�1

|tejd

), . . . , xej

1

= ⇧ej

1

(tej

1

|tejd

, tej

d�1

. . . , tej

2

).

Then, xej = (xej

1

, xej

2

, . . . , xej

d

) is uniformly distributed on X d

1

. Define that

h(x1

, x2

, . . . , xd

)

= f�{⇧ej

1

}�1(x1

|xd

, . . . , x2

), {⇧ej

2

}�1(x2

|xd

, . . . , x3

), . . . , {⇧ej

d

}�1(xd

)�

.

Thus,

@h(x)

@xj

=j

X

k=1

@f(t)

@tk

· @tk@x

j

=j�1

X

k=1

@f

@tk

· @tk@x

j

+@f

@tj

· 1

⇡ej

j

(tj

|td

, . . . , tj+1

).

With the design xej defined, we transform the responses Y ej s to Zej s byletting Ze0 = Y e0 and for any j = 1, . . . , p,

Zej =j�1

X

k=1

Y ek@t

ej

k

(xej

d

, xej

d�1

. . . , xej

k

)

@xj

+Y ej

⇡ej

j

(tej

j

|tejd

, . . . , tej

j+1

).

Write

�2j

=j�1

X

k=1

�2k

"

@tej

k

@xj

(xej

d

, xej

d�1

, . . . , xej

k

)

#

2

+�2j

h

⇡ej

j

(tej

j

|(tejd

, . . . , tej

j+1

)i

2

.


Then, it is clear that Zej = @h/@xj

(xej ) + f✏ej , where the errors f✏ej s areindependent centered noises with variance �2

j

s.Second, we consider that not all types of function observations and partial

derivatives data share a common design, i.e., 90 j 6= k p and 1 i nsuch that t

ej

i

6= teki

. We require the covariates of each tej are independent,that is the density of tej can be decomposed as:

⇡ej (t1

, . . . , td

) = ⇡ej

1

(t1

)⇡ej

2

(t2

) · · ·⇡ejd

(td

)

Now let

xej

1

= ⇧ej

1

(tej

1

), xej

2

= ⇧ej

2

(tej

2

), . . . , xej

d

= ⇧ej

d

(tej

d

).

Then xej = (xej

1

, xej

2

, . . . , xej

d

) is uniformly distributed on X d

1

. Define thefunction

h(x1

, . . . , xd

) = f�{⇧ej

1

}�1(x1

), {⇧ej

2

}�1(x2

), . . . , {⇧ej

d

}�1(xd

)�

.

Thus, we have

@h(x)

@xj

=@f(t)

@tj

· @tj(xj)@x

j

=@f(t)

@tj

· 1

⇡ej

j

(tj

).

Correspondingly, the responses Y ej is transformed to Zej , 0 j p, byletting Ze0 = Y e0 and Zej = Y ej/⇡

ej

j

(tej

j

) for 1 j d, and write the

transformed variance �2j

= �2j

/[⇡ej

j

(tej

j

)]2.

Lemma A.13. Suppose that s � 1, � � 0 and � 6= 1, and r � 1. ThenZ

x1···xr·z⌅,xk�1,z�1

x�1

· · ·x�r

z�(log z)s(x21

+ · · ·+ x2r

)�1dx1

· · · dxr

dz

⇣ ⌅�+1(log⌅)s, as ⌅ ! 1.

Proof. For any ⌧ � 1, we have {1 z ⌅⌧�r, 1 xk

⌧, k =1, . . . , r} ⇢ {x

1

· · ·xr

· z ⌅, z � 1, xk

� 1, k = 1, . . . , r}. Thus, if ⌅ ! 1,

Z

x1···xr·z⌅,xk�1,z�1

x�1

· · ·x�r

z�(log z)s(x21

+ · · ·+ x2r

)�1dx1

· · · dxr

dz

�Z

⌅⌧

�r

1

Z

⌧

1

· · ·Z

⌧

1

z�(log z)sx��2

1

· · ·x��2

r

dx1

· · · dxr

dz

⇣ ⌅�+1⌧�r(�+1)(log⌅� r log ⌧)s⌧ r(��1).


Let ⌧ ! 1, we haveR

x1···xr·z⌅,xk�1,z�1

(log z)s(x21

+· · ·+x2r

)�1dx1

· · · dxr

dz &⌅�+1(log⌅)s.

On the other hand, define u = x1

· · ·xr

· z and change the variable z to u.We have that as ⌅ ! 1,Z

x1···xr·z⌅,xk�1,z�1

x�1

· · ·x�r

z�(log z)s(x21

+ · · ·+ x2r

)�1dx1

· · · dxr

dz

=

Z

⌅

1

Z

u

1

Z

u/xr

1

· · ·Z

u/(xrxr�1···x2)

1

u�(log u� log xr

� · · ·� log x1

)s

· �x21

+ · · ·+ x2r�1

+ x2r

��1

x�1

1

· · ·x�1

r�1

x�1

r

dx1

· · · dxr�1

dxr

du

.Z

⌅

1

Z

u

1

Z

u/xr

1

· · ·Z

u/(xrxr�1···x2)

1

u�(log u� log xr

� · · ·� log x1

)s

· x�1�2/r

1

· · ·x�1�2/r

r�1

x�1�2/r

r

dx1

· · · dxr�1

dxr

du

.Z

⌅

1

u�(log u)sdu ⇣ ⌅�+1(log⌅)s,

where the second step is by Lemma A.10. This completes the proof.

Lemma A.14. Suppose that � � 0 and 0 < ↵ 2. Then, as ⌅ ! 1,

Z

x1···xr⌅,xk�1

r

Y

k=1

x�k

(x↵1

+ x↵2

+ · · ·+ x↵r

)�1dx1

· · · dxr

⇣

8

>

>

>

>

<

>

>

>

>

:

⌅�+1�↵/r, if r � 3;

log(⌅), if r = 2,� = ↵/2� 1; ⌅�+1�↵/2 if r = 2,� > ↵/2� 1;

1, if r = 1,� < ↵� 1; log(⌅) if r = 1,� = ↵� 1;

⌅��↵+1 if r = 1,� > ↵� 1.

Proof. By the symmetry of covariates,

Z

x1···xr⌅,xk�1

r

Y

k=1

x�k

(x↵1

+ x↵2

+ · · ·+ x↵r

)�1dx1

· · · dxr

⇣Z

x1···xr⌅,x1�x2�···�xr�1

r

Y

k=1

x�k

(x↵1

+ x↵2

+ · · ·+ x↵r

)�1dxr

· · · dx1

:= E .First we prove when r � 3, as ⌅ ! 1, we have

E . ⌅�+1�↵/r.(A.23)


For this, define the setK =

⇢

0 k r � 2 :⇣

⌅

x1···xr�k�1

⌘

1/(k+1) xr�k�1

�

.

If K is not empty, we denote the smallest element in K by k⇤. Then 0 k⇤ r � 2. For any (x

1

, . . . , xr

) 2 {(x1

, . . . , xr

) : x1

· · ·xr

⌅, x1

� x2

�· · · � x

r

� 1, xr

xr�1

⌅

x1···xr�1}, we have

(A.24)

8

>

>

>

>

>

>

<

>

>

>

>

>

>

:

1 xr�k

xr�k�1

for 0 k k⇤ � 1,

1 xr�k

⇤ ⇣

⌅

x1···xr�k⇤�1

⌘

1/(k

⇤+1)

for k = k⇤,

xr�k

�⇣

⌅

x1···xr�k�1

⌘

1/(k+1)

for k⇤ + 1 k r � 2,

x1

� ⌅1/r for k = r � 1.

Thus, as ⌅ ! 1,

(A.25)

E .Z

x1···xr⌅,x1�x2�···�xr�1

n

(x1

)��↵/(r�1) · · · (xr�k

⇤�1

)��↵/(r�1)

o

x�r�k

⇤

·n

(xr�k

⇤+1

)��↵/(r�1) · · · (xr

)��↵/(r�1)

o

dx

⇣Z

x1···xr⌅,x1�x2�···�xr�1

n

(x1

)��↵/(r�1) · · · (xr�k

⇤�1

)��↵/(r�1)

o

· (xr�k

⇤)[�+1�↵/(r�1)]k

⇤+�dx

r�k

⇤dxr�k

⇤�1

· · · dx1

⇣Z

x1···xr⌅,x1�x2�···�xr�1

n

(x1

)�1�↵/[(r�1)(k

⇤+1)] · · · (x

r�k

⇤�1

)�1�↵/[(r�1)(k

⇤+1)]

o

· ⌅�+1�↵k

⇤/[(r�1)(k

⇤+1)]dx

r�k

⇤�1

· · · dx1

= ⌅�+1�↵/r,

where the first step uses xr�k

⇤ � 1 and Lemma A.10, the second step usesxr�k

xr�k�1

for all k k⇤ � 1 in (A.24), the third step uses the up-per bound on x

r�k

⇤ in (A.24), the fourth step uses the lowers bounds onxr�k

for all k⇤ + 1 k r � 2 in (A.24). If K is empty, then for any(x

1

, . . . , xr

) 2 {(x1

, . . . , xr

) : x1

· · ·xr

⌅, x1

� x2

� · · · � xr

� 1, xr

xr�1

⌅/(x1

· · ·xr�1

)}, it satisfies

1 xk

xk�1

for any 2 k r, and 1 x1

⌅1/r.


Thus, as ⌅ ! 1,

(A.26)

E =

Z

⌅

1/r

1

· · ·Z

xr�2

1

Z

xr�1

1

r

Y

k=1

x�k

(x↵1

+ x↵2

+ · · ·+ x↵r�1

+ x↵r

)�1dxr

dxr�1

· · · dx1

.Z

⌅

1/r

1

· · ·Z

xr�2

1

Z

xr�1

1

x��↵/r

1

· · ·x��↵/r

r�1

x��↵/r

r

dxr

dxr�1

· · · dx1

⇣ ⌅�+1�↵/r.

Combining (A.25) and (A.26) completes the proof for (A.23).On the other hand, when r � 3 and as ⌅ ! 1,

(A.27)

E �Z

⌅

1/r

1

· · ·Z

xr�2

1

Z

xr�1

1

r

Y

k=1

x�k

(x↵1

+ · · ·+ x↵r�1

+ x↵r

)�1dxr

dxr�1

· · · dx1

�Z

⌅

1/r

1

· · ·Z

xr�2

1

Z

xr�1

1

r

Y

k=1

x�k

· r�1x�↵

1

dxr

dxr�1

· · · dx1

⇣ ⌅�+1�↵/r.

Therefore, combining (A.23) and (A.27) completes the proof of the lemmafor r � 3.

Then we consider for r = 2. For 0 < ↵ 2,

E 2

Z

p⌅

1

Z

x1

1

x��↵

1

x�2

dx2

dx1

+ 2

Z

⌅

p⌅

Z

⌅/x1

1

x��↵

1

x�2

dx2

dx1

⇣(

log(⌅) when 2� + 2� ↵ = 0

⌅�+1�↵/2 when 2� + 2� ↵ > 0as ⌅ ! 1.(A.28)

On the other hand, we have

(A.29)

E �Z

p⌅

1

Z

x1

1

x�1

x�2

(x↵1

+ x↵2

)�1dx2

dx1

� 2�1

Z

p⌅

1

Z

x1

1

x��2

1

x�2

dx2

dx1

⇣(

log(⌅) when 2� + 2� ↵ = 0

⌅m when 2� + 2� ↵ > 0as ⌅ ! 1.


Combining (A.28) and (A.29) completes the proof of the lemma for r = 2.

Finally, we consider for r = 1. Note thatR

⌅

1

x�1

x�↵

1

dx1

⇣ 1 when 0 � <

↵ � 1, andR

⌅

1

x�1

x�↵

1

dx1

⇣ log(⌅) when � = ↵ � 1, andR

⌅

1

x�1

x�↵

1

dx1

⇣⌅��↵+1 when � > ↵� 1. This complete the proof.

Lemma A.15. Suppose that � �1 and ↵ > 0. Then, as ⌅ ! 1,

Z

x1···xr�⌅,xk�1

r

Y

k=1

x�k

(x↵1

+ x↵2

+ · · ·+ x↵r

)�1dx1

· · · dxr

⇣ ⌅�+1�↵/r.

Proof. The proof is similar to the proof for Lemma A.14. We omit thedetails here.

Lemma A.16. Suppose that m > 1. Then, as ⌅ ! 1,

Z

x

(m�1)/m1 x2···xr⌅,xk�1

(x21

+ x22

+ · · ·+ x2r

)�1x21

dx1

· · · dxr

⇣ ⌅m/(m�1).

Proof. When r = 1, the lemma can be verified by direct calculations.In what follows, assume r � 2. First, we show that LHS of the formulaabove is larger than the RHS up to some constant. It su�ces to consider a

subset of (x1

, x2

, . . . , xr

) which satisfy x(m�1)/m

1

� x2

� · · · � xr

� 1. Let

u1

= x(m�1)/m

1

, and uj

= u1

x2

· · ·xj

for 2 j r. By changing variables(x

1

, x2

, . . . , xr

) to (u1

, u2

, . . . , ur

), the LHS in the lemma satisfies

Z

x

(m�1)/m1 x2···xr⌅,xk�1

(x21

+ x22

+ · · ·+ x2r

)�1x21

dx1

· · · dxr

�Z

x

(m�1)/m1 x2···xr⌅,xk�1

(rx21

)�1x21

dx1

· · · dxr

= r�1

Z

⌅

1

Z

ur

u

(r�1)/rr

· · ·Z

u2

u

1/22

u1/(m�1)

1

u�1

1

· · ·u�1

r�1

du1

· · · dur�1

dur

⇣ ⌅m/(m�1).

Second, we show that RHS of the formula above is larger than the LHS up


to some constant. Note that (x21

+x22

+ · · ·+x2r

)�1x21

1, so the LHS satisfies

Z

x

(m�1)/m1 x2···xr⌅,xk�1

(x21

+ x22

+ · · ·+ x2r

)�1x21

dx1

· · · dxr

Z

x

(m�1)/m1 x2···xr⌅,xk�1

1dx1

· · · dxr

= r�1

Z

⌅

1

Z

ur

u

(r�1)/rr

· · ·Z

u2

u

1/22

u1/(m�1)

1

u�1

1

· · ·u�1

r�1

du1

· · · dur�1

dur

⇣ ⌅m/(m�1).

This completes the proof.

REFERENCES

[1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68337–404. MR0051437

[2] Bates, R. A., Buck, R. J., Riccomagno, E. and Wynn, H. P. (1996). Experimen-tal design and observation for large systems. J. Roy. Statist. Soc. Ser. B. 58 77–94.MR1379235

[3] Breckling, J. (1989). The Analysis of Directional Time Series: Applications to WindSpeed and Direction 61. Springer-Verlag, Berlin. MR1027836

[4] Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additivemodels. Ann. Statist. 17 453–555. MR0994249

[5] Carr, J. C., Beatson, R. K., Cherrie, J. B., Mitchell, T. J., Fright, W. R.,McCallum, B. C. and Evans, T. R. (2001). Reconstruction and representation of3D objects with radial basis functions. In Proceedings of the 28th Annual Conferenceon Computer Graphics and Interactive Techniques 67–76. ACM, New York.

[6] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed.John Wiley & Son, New York. MR2239987

[7] Cox, D. D. (1988). Approximation of method of regularization estimators. Ann.Statist. 16 694–712. MR0947571

[8] Cox, D. D. and O’Sullivan, F. (1990). Asymptotic analysis of penalized likelihoodand related estimators. Ann. Statist. 18 1676–1695. MR1074429

[9] Forreste, A., Keane, A. and Sobester, A. (2008). Engineering Design via Surro-gate Modeling: A Practical Guide. John Wiley & Son, New York.

[10] Frees, E. W. and Valdez, E. A. (1998). Understanding relationships using copulas.N. Am. Actuar. J. 2 1–25. MR1988432

[11] Golub, G. H. and Ortega, J. M. (2014). Scientific Computing and Di↵eren-tial Equations: an Introduction to Numerical Methods. Academic Press, Cambridge.MR1133393

[12] Griewank, A. and Walther, A. (2008). Evaluating Derivatives: Principles andTechniques of Algorithmic Di↵erentiation, 2nd ed. SIAM, Philadelphia. MR2454953

[13] Gu, C. (2013). Smoothing Spline ANOVA Models, 2nd ed. Springer, New York.MR3025869

[14] Hall, P. and Yatchew, A. (2007). Nonparametric estimation when data on deriva-tives are available. Ann. Statist. 35 300–323. MR2332277

http://www.ams.org/mathscinet-getitem?mr=0051437













[15] Hall, P. and Yatchew, A. (2010). Nonparametric least squares estimation inderivative families. J. Econometrics. 157 362–374. MR2661608

[16] Hansen, E. and Walster, G. W. (2003). Global Optimization Using Interval Anal-ysis: Revised and Expanded 264. CRC Press, Boca Raton. MR2025041

[17] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapmanand Hall, London. MR1082147

[18] Jorgenson, D. W. (1986). Econometric methods for modeling producer behavior.Handbooks in Econom. 3 1841–1915. MR0858551

[19] Kiefer, J. and Wolfowitz, J. (1959). Optimum designs in regression problems.Ann. Math. Statist. 30 271–294. MR0104324

[20] Lin, Y. (2000). Tensor product space ANOVA models. Ann. Statist. 28 734–755.MR1792785

[21] Mitchell, T. J., Morris, M. D. and Ylvisaker, D. (1994). Asymptotically op-timum experimental designs for prediction of deterministic functions given derivativeinformation. J. Statist. Plann. Inference. 41 377–389. MR1309620

[22] Morris, M. D., Mitchell, T. J.and Ylvisaker, D. (1993). Bayesian design andanalysis of computer experiments: use of derivatives in surface prediction. Technomet-rics. 35 243–255. MR1234641

[23] Murray-Smith, R. and Sbarbaro, D. (2002). Nonlinear adaptive control usingnon-parametric Gaussian process prior models. IFAC Proceedings Volumes. 35 325–330.

[24] Oden, J. T. and Reddy, J. N. (2012). An Introduction to the Mathematical Theoryof Finite Elements. John Wiley & Sons, New York. MR0461950

[25] Plessix, R-E. (2006). A review of the adjoint-state method for computing the gra-dient of a functional with geophysical applications. Geophys. J. Int. 167 495–503.

[26] Ramsay, J. O., Hooker, G., Campbell, D. and Cao, J. (2007). Parameter esti-mation for di↵erential equations: a generalized smoothing approach. J. Roy. Statist.Soc. Ser. B. 69 741–796. MR2368570

[27] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for MachineLearning. MIT Press, Cambridge. MR2514435

[28] Riccomagno, E., Schwabe, R. and Wynn, H. P. (1997). Lattice-based D-optimumdesign for Fourier regression. Ann. Statist. 25 2313–2327. MR1604453

[29] Riesz, F. and Sz.-Nagy, B. (1955). Functional Analysis. Dover Publications, NewYork. MR1068530

[30] Schwarz, K. P. (1979). Geodetic improperly posed problems and their regulariza-tion. Bolletino di Geodesia e Scienze A�ni. 38 389–416.

[31] Shepherd, R. W. (2015). Theory of Cost and Production Functions. Princeton Uni-versity Press, Princeton. MR0414052

[32] Solak, E., Murray-Smith, R., Leithead, W. E., Leith, D. J. and Rasmussen,C. E. (2003). Derivative observations in Gaussian process models of dynamic systems.In Advances in neural information processing systems. 1057–1064.

[33] Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators.Ann. Statist. 8 1348–1360. MR0594650

[34] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regres-sion. Ann. Statist. 10 1040–1053. MR0673642

[35] Stone, C. J. (1985). Additive regression and other nonparametric models. Ann.Statist. 13 689–705. MR0790566

[36] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, NewYork. MR724359

[37] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.




















MR1045442[38] Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline

ANOVA for exponential families, with application to the Wisconsin EpidemiologicalStudy of Diabetic Retinopathy. Ann. Statist. 23 1865–1895. MR1389856

[39] Weinberger, H. F. (1974). Variational Methods for Eigenvalue Approximation.SIAM, Philadelphia. MR0400004

[40] Yuan, M. and Cai, T. T. (2010). A reproducing kernel Hilbert space approach tofunctional linear regression. Ann. Statist. 38 3412–3444. MR2766857

[41] Cartan, H. P. (1971). Di↵erential Calculus 1. Hermann, Paris. MR0344032[42] Cox, D. D. (1988). Approximation of method of regularization estimators. Ann.

Statist. 16 694–712. MR0947571[43] Donoho, D. L., Liu, R. C. and MacGibbon, B. (1990). Minimax risk over hyper-

rectangles, and implications. Ann. Statist. 18 1416–1437. MR1062717[44] Lin, Y. (1998). Tensor product space ANOVA models in multivariate function esti-

mation. PhD thesis. University of Pennsylvania. MR2697355[45] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.

MR1045442[46] Weinberger, H. F. (1974). Variational Methods for Eigenvalue Approximation.

SIAM, Philadelphia. MR0400004

Department of StatisticsUniversity of Wisconsin-Madison1300 University AvenueMadison, Wisconsin 53706USAE-mail: [email protected]

[email protected]











mailto:[email protected]

mailto:[email protected]

minimax optimal rates of estimation in functional … · · 2017-09-13minimax optimal rates of...

Documents