penalized nonparametric drift estimation for a multidimensional diffusion process

This article was downloaded by: [UOV University of Oviedo]On: 29 October 2014, At: 09:23Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Statistics: A Journal of Theoretical andApplied StatisticsPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/gsta20

Penalized nonparametric driftestimation for a multidimensionaldiffusion processEmeline Schmisser aa Laboratoire MAP5, UMR CNRS 8145 , Université Paris Descartes 45rue des Saints Péres , 75270 , Paris Cedex 06 , FrancePublished online: 13 Jul 2011.

To cite this article: Emeline Schmisser (2013) Penalized nonparametric drift estimation for amultidimensional diffusion process, Statistics: A Journal of Theoretical and Applied Statistics, 47:1,61-84, DOI: 10.1080/02331888.2011.591931

To link to this article: http://dx.doi.org/10.1080/02331888.2011.591931

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/gsta20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/02331888.2011.591931

http://dx.doi.org/10.1080/02331888.2011.591931

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Statistics, 2013Vol. 47, No. 1, 61–84, http://dx.doi.org/10.1080/02331888.2011.591931

Penalized nonparametric drift estimation for amultidimensional diffusion process

Emeline Schmisser*

Laboratoire MAP5, UMR CNRS 8145, Université Paris Descartes 45 rue des Saints Péres, 75270 ParisCedex 06, France

(Received 10 July 2009; final version received 24 May 2011)

We consider a multi-dimensional diffusion process (Xt)t≥0 with drift vector b and diffusion matrix �.This process is observed at n + 1 discrete times with regular sampling interval �. We review sufficientconditions for the existence and uniqueness of an invariant density. In a second step, we assume that theprocess is stationary, and estimate the drift function b on a compact set K in a nonparametric way. For thispurpose, we consider a family of finite dimensional linear subspaces of L2(K), and compute a collection ofdrift estimators on every subspace by a penalized least-squares approach. We introduce a penalty functionand select the best drift estimator. We obtain a bound for the risk of the resulting adaptive estimator. Ourmethod fits for any dimension d, but, for safe of clarity, we focus on the case d = 2. We also provide severalexamples of two-dimensional diffusions satisfying our assumptions, and realize various simulations. Ourresults illustrate the theoretical properties of our estimators.

Keywords: drift; model selection; multidimensional diffusions; nonparametric estimation; stationarydistribution

1. Introduction

Let us consider a d-dimensional diffusion process (Xt)t≥0 = (X1t , X2

t , . . . , Xdt )t≥0 satisfying the

stochastic differential equation (SDE):

dXt = b(Xt) dt + �(Xt) dWt , X0 = η, (1)

where b(x) = (bi(x))1≤i≤d is a d-dimensional vector, �(x) = (σij(x))1≤i,j≤d a (d, d) matrix, η ad-dimensional random vector and (Wt) a Brownian motion of R

d independent of η. The process(Xt) is assumed to be strictly stationary and ergodic. Our aim is to realize nonparametric estimationof the drift function b, given discretized observations.

For one-dimensional processes, non-parametric drift estimation has been the subject of severalcontributions (see e.g. Kutoyants [1] and the references therein for kernel methods). Hoffmann [2]studied nonparametric adaptive estimators using projections on wavelet bases. However, these esti-mators are difficult to implement numerically. Comte et al. [3] proposed different nonparametricestimators, based on a penalized least-squares approach. Their estimators are easily computable

*Email: [email protected]

© 2013 Taylor & Francis

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

62 E. Schmisser

and have optimality properties. Our aim is to extend these results to multidimensional diffu-sions. Statistical inference for multidimensional ergodic diffusions is not often studied. This ispartly because the characterization and computation of stationary laws is more difficult than forone-dimensional models. Dalalyan and Reiß [4] studied nonparametric drift estimation for multi-dimensional ergodic diffusions based on a continuous time observation of the sample path. Theyuse a kernel estimator and exhibit optimal minimax rates of convergence by means of asymp-totic statistical equivalence. Here, we assume that the process is strictly stationary, ergodic andβ-mixing and is discretely observed: the discrete observations (X0, X�, . . . , X(n+1)�) have a sam-pling interval �. Our asymptotic framework is n tends to infinity, � = �n tends to 0 and n� tendsto infinity. The drift vector is estimated on a compact set K ⊂ R

d . For each component of the drift,bi(x), i = 1, . . . , d, we define a collection of nonparametric estimators (bm,i)m of bi belongingto a family of linear subspaces (Sm)m of L2(K). Then, introducing a penalty, we select the bestestimator bmi ,i: the adaptive estimator risk reaches the usual optimal nonparametric rate. The caseof an additive model, i.e. b(x1, . . . , xd) =∑d

i=1 ci(xi), is studied separately. Such models lead tobetter rates of convergence, see, e.g. Wang and Yang [5] and references therein for the regressionmodel. In the case of an additive drift, the optimal rate obtained for our estimator is the same asin dimension d = 1.

In Section 2, we specify the model and its assumptions. We review sufficient conditions forthe existence and uniqueness of an invariant density for the SDE (1). Section 3 describes approx-imation spaces. Section 4 presents the estimator and studies its risk like in Comte et al. [3].Section 5 gives some indications for the estimation algorithm. Section 6 proposes examples ofmultidimensional diffusion processes for which data are simulated and estimators are imple-mented. Numerical simulation results are convincing, even if some theoretical assumptions arenot satisfied. Proofs are given in Section 7.

2. Model and assumptions

We consider a diffusion process (Xt) satisfying (1). We denote by 〈x, y〉 =∑di=1 xiyi the usual

inner product of Rd , by |x| the associated norm and by |M|mat a matrix norm. For a matrix M, M∗

denotes its transpose. Let us consider the following assumptions:

Assumption 1 The functions �(x) and b(x) are globally Lipschitz:

∃L, ∀(x, y) ∈ (Rd)2, |�(x) − �(y)|mat + |b(x) − b(y)| ≤ L|x − y|.

Assumption 2 There exist constants r > 0 and α ≥ 1 such that

∃M0 ∈ R+, ∀x, |x| > M0, 〈b(x), x〉 ≤ −r|x|α .

Assumption 3 (i) The diffusion matrix A(x) = �(x)�∗(x) = (aij(x))1≤i,j≤d is bounded andpositive. Let σ 2

0 be such that

∀x, Tr(A(x)) ≤ σ 20 .

(ii) The matrix A satisfies:

∃λ−, λ+ > 0, ∀x ∈ Rd , λ−|x|2 ≤ 〈A(x)x, x〉 ≤ λ+|x|2.

Assumption 4 (i) b ∈ C1(Rd , Rd), A ∈ C2(Rd , Rd ⊗ Rd).

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 63

(ii) There exists a function V ∈ C2(Rd , R) satisfying:

b = 1

2

⎛⎝ d∑

j=1

∂

dxjaij

⎞⎠

1≤i≤d

− A∇V .

(iii) c = exp(−2V(x)) < +∞.

Assumption 1 implies existence and uniqueness of a process (Xt)t≥0, solution of (1) (seeKaratzas and Shreve [6], Theorem 2.5, p. 281). Under Assumptions 2–3, there exists a uniqueinvariant density (see Pardoux and Veretennikov [7], Veretennikov [8]). For scalar diffusions(d = 1), the expression of the invariant density is simple and explicit (see e.g. Kutoyants [1]). Formultidimensional diffusions, one cannot have, in the general case, an explicit expression for theinvariant density. The interest of Assumption 4 is that is allows to obtain it. Let us set

π(x) = c−1 exp(−2V(x)). (2)

Proposition 1 Under Assumptions 1–2, π is the unique invariant density of Equation (1).

Results on stationary distributions for Markov processes and diffusions can be found, e.g.in Bhattacharya [9] or Ethier and Kurtz [10]. For the sake of clarity, we detail the proof ofProposition 1 in Section 6. Under Assumptions 1–4, on each compact subset of R

d , π is boundedfrom below and above by positive constants, a property which is crucial for the nonparametricestimation method (see, e.g. Comte et al. [3]). Let us assume:

Assumption 5 η ∼ π .

Under Assumptions 1–5, according to Pardoux and Veretennikov [7], the process (Xt) is strictlystationary and exponentially β-mixing: there exist positive constants C, θ such that, for all t > 0,

βX(t) ≤ Ce−θ t ,

where βX(t) denotes the β-mixing coefficient of (Xt). Recall that, for a stationary diffusionprocess,

βX(s) = 12‖P(X0,Xs) − PX0 ⊗ PX0‖TV,

where P(X0,Xs) is the joint law of (X0, Xs), PX0 the law of X0 and ‖.‖TV the total variation distance.Furthermore,

∃ν > 0, E[exp(ν|η|)] < ∞.

In particular, η has moments of any order.

3. Approximation spaces

Our aim is to estimate the drift function b on a compact set K of Rd . Without loss of generality,

we consider K = [0, 1]d .

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

64 E. Schmisser

As the stationary density π(x) is proportional to exp(−2V(x)), there exist some constants π0

and π1 such that, for all x ∈ K ,

0 < π0 ≤ π(x) ≤ π1 < +∞.

Below, we construct a family (Sm)m∈Mn of linear subspace of L2(K) with Dm = dim(Sm) andwhere Mn is the index set of the collection:

Mn = Mn(r) = {m, Dm ≤ Nn}, (3)

where the maximal dimension Nn will be later precised. For each m ∈ Mn, we compute anestimator bm of b, belonging to Sm. Then we choose the ‘best’ possible estimator by introducinga penalty function pen(m). For simplicity, we describe the case d = 2. The construction of acollection would be exactly the same for any dimension d. We start by constructing subspaces ofL2([0, 1]). Then we deduce subspaces of L2([0, 1]2).

3.1. Construction of univariate subspaces

For our construction, we use spline functions. We recall here some of their properties. The B-splinefunction of degree r is denoted by gr , where

gr = 1[0,1] ∗ 1[0,1] ∗ . . . ∗ 1[0,1] r + 1 times

is the (r + 1) times convolution of the indicator function of [0, 1]. This function is a piecewisepolynomial of degree r and of support [0, r + 1] and, for any r ≥ 1, it belongs to Cr−1.According tostandard properties of convolution, we obtain by induction that, for any integer r,

∫ +∞−∞ gr(x) dx =

1 and, for all x ∈ R,∑

k∈Zgr(x − k) = 1. Let us fix r ≥ 1 and denote, for k ∈ Z,

f0,k = gr(. − k)1[0,1] and S0 = Vect{(f0,k), k ∈ Z}.Every function g ∈ S0 has support in [0, 1] and can be written as

g =0∑

k=−r

αkf0,k .

Functions f0,k , for k /∈ {−r, . . . , 0}, are identically null. Let us define, for m ∈ N, k ∈ Z,

fm,k(x) = 2m/2gr(2mx − k)1[0,1](x)

which has support [k

2m∨ 0,

k + r + 1

2m∧ 1

].

Non-null functions fm,k correspond to k ∈ {−r, −r + 1, . . . , 2m − 1}. Their supports are notdisjoint but these functions are linearly independent. Let us set Sm = Vect{(fm,k), k ∈ Z} the vectorspace generated by functions fm,k . Its dimension is dm = 2m + r.Any function g ∈ Sm has a supportincluded in [0, 1] and can be written as

g =2m−1∑k=−r

αm,kfm,k .

Moreover, as g2r ≤ gr ≤ 1, we have that

∫R

f 2m,k(x) dx ≤ ∫

Rg2

r (x) dx ≤ 1.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 65

Proposition 2 There exists a positive constant φ0 such that, for any t ∈ Sm:

‖t‖2∞ ≤ φ2

0dm‖t‖2L2 .

Remark 1 Actually, we could use any sequence of subspaces (Sm) such that:

• The subspaces Sm have finite dimension dm, and ∀m ≥ 0, Sm ⊆ Sm+1.• The norms ‖.‖∞ and ‖.‖L2 are connected: there exists a constant φ0 such that, for any t ∈ Sm:

‖t‖2∞ ≤ φ0dm‖t‖2

L2 .

• If t belongs to the unit ball of a Besov space Bα2,∞([0, 1]) (with α > r), then

‖t − tm‖2L2 ≤ cd−2α

m .

The subspaces generated by piecewise polynomials, B-spline functions, trigonometric polynomi-als and compactly supported wavelets satisfy these assumptions. We focus on subspaces generatedby B-splines of degree r for simplicity, as these functions are easily computable and the resultingestimators are smooth (Cr−1).

3.2. Bi-variate spaces

Now, we build subspaces Sm of L2([0, 1]2). For this purpose, we use two different constructions.The first one uses tensor products. It is more general than the second and allows us to approx-

imate any function of L2([0, 1]2). Let us set Sm = Sm ⊗ Sm. Any function g ∈ Sm can be writtenas

g(x, y) =2m−1∑

k,l=−r

αm,k,l fm,k(x)fm,l(y).

The family Fm = {fm,k(x)fm,l(y)} is a basis of Sm and we have Dm = dim(Sm) = d2m = (2m + r)2.

For d-variate spaces, we would have

Dm = ddm = (2m + r)d .

The second construction only allows us to estimate drift functions of the additive form

b(x, y) = c(x) + e(y).

In that case, our estimator is of the form b(x, y) = c(x) + e(y). As a consequence, we can considerthe family

Sm ={

(x, y) �→ g(x, y), g(x, y) =2m−1∑k=−r

αm,k fm,k(x) +2m−1∑l=−r

βm,l fm,l(y)

}.

The dimension of the latter Sm is Dm = 2(r + 2m). In the d-variate case, we would have

Dm = d(r + 2m).

In order to use generic notations, we consider Fm = {(ϕm,k)0≤k≤Dm−1} a basis of Sm. According toMeyer [11], B-spline functions constitute a multiresolution analysis of L2(R) and, according toProposition 4, p. 50, we deduce

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

66 E. Schmisser

Proposition 3 Let t belong to the Besov space Bα2,∞([0, 1]d) and let tm be its orthogonal

projection (L2) over Sm, with r ≥ α. Then

∃C > 0, ‖t − tm‖L2 ≤ C2−mα .

Remark 2 We could as well use anisotropic spaces Sm1,m2 = Sm1 ⊗ Sm2 depending on m1, m2, r1

and r2, generated by Fm1,m2 = {fm1,k(x)fm2,l(y)}, with Dm = (r1 + 2m1)(r2 + 2m2). Additive spaceswould be

Sm1,m2 = Sm1 + Sm2 =⎧⎨⎩(x, y) → g(x, y) =

2m1 −1∑k=−r1

αm1,kfm1,k(x) +2m2 −1∑l=−r2

βm2,l fm2,l(y)

⎫⎬⎭ ,

with dimension Dm = r1 + r2 + 2m1 + 2m2 . The following result replaces Proposition 3:

Proposition 4 Let t belong to the Besov space Bα1,α22,∞ ([0, 1]d) with projection tm1,m2 on Sm1,m2 .

Assume that r1 ≥ α1 and r2 ≥ α2. We have that

∃C > 0, ‖t − tm1,m2‖L2 ≤ C(2−m1α1 + 2−m2α2).

(see Lacour [12], Lemma 9).

4. Drift estimation

4.1. Notations

Remember that K = [0, 1]d and set

Yk� = X(k+1)� − Xk�

�, Zk� = 1

�

∫ (k+1)�

k�

�(Xs) dWs, (4)

and, for any function t from Rd to R:

Ik�(t) = 1

�

∫ (k+1)�

k�

(t(Xs) − t(Xk�)) ds.

By (1), for i = 1, . . . , d,

Y ik� = bi(Xk�) + Ik�(bi) + Zi

k�,

where xi is the ith component of the vector x. In this equation, bi(Xk�) is the main term, Zik� a

noise term and Ik�(bi) a remainder term. We estimate each component bi of the drift b. Let usconsider, for i = 1, . . . , d, the contrast

γn,i(t) = 1

n

n∑k=1

(Y ik� − t(Xk�))2 (5)

and define the estimator bm,i:

bm,i = arg mint∈Sm

γn,i(t).

We always can find a function bm,i which minimizes γn,i, but it may be not unique. On the contrary,setting Y = (Y i

�, Y i2�, . . . , Y i

n�), the random vector (bm,i(X�), . . . , bm,i(Xn�)) = �m(Y), where

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 67

�m the Euclidean projection over the subspace {(t(X�), . . . , t(Xn�)), t ∈ Sm} is always uniquelydefined. For this reason, as in Comte et al. [3], we choose the risk function equal to

R(bm,i) = E(‖bm,i − bi,K‖2n),

where ‖t‖2n = (1/n)

∑nk=1 t2(Xk�) and bi,K = bi1K .

4.2. Risk of the non-adaptive estimator

Using Equations (4) and (5), we obtain that

γn,i(t) − γn,i(bi) = ‖t − bi‖2n + 2

n

n∑k=1

(bi − t)(Xk�)Zik� + 2

n

n∑k=1

(bi − t)(Xk�)Ik�(bi).

Set

νn,i(t) = 1

n

n∑k=1

t(Xk�)Zik�.

The orthogonal projection (L2) of bi over Sm is denoted bm,i. We have

γn,i(bm,i) ≤ γn,i(bm,i),

γn,i(bm,i) − γn,i(bi) ≤ γn,i(bm,i) − γn,i(bi).

So we can write

‖bm,i − bi‖2n ≤ ‖bm,i − bi‖2

n + 2νn,i(bm,i − bm,i) + 2

n

n∑k=1

(bm,i − bm,i)(Xk�)Ik�(bi).

As the supports of bm,i and bm,i are included in K ,

‖bm,i − bi,K‖2n ≤ ‖bm,i − bi,K‖2

n + 2νn,i(bm,i − bm,i) + 2

n

n∑k=1


Let us introduce the following assumption:

Assumption 6 (i) � = �n ≤ 1, n�n ≥ 1(ii) The maximal dimension Nn (see Equation (3)) satisfies

N2n ≤ π2

0

216θπ1φ20

n�n

ln2(n).

The use of this assumption is as follows. In proofs, we introduce a set �n on which the empiricalnorm ‖ · ‖n and the ‖ · ‖L2 norm are equivalent for functions of the maximal space. The risk isstudied on �n and Assumption 6 (ii) is used to prove that P(�c

n) is negligible.

Theorem 1 Under Assumptions 1–6, the risk for the drift estimator bm belonging to a space Sm

satisfies, for all i = 1, . . . , d:

E(‖bm,i − bi,K‖2n) ≤ C

(‖bm,i − bi,K‖2

L2 + σ 20

Dm

n�

)+ C′� + C′′

n�(6)

with C, C′, C′′ being constants. Under the asymptotic framework � → 0 and n�, the estimatorbm,i converges on bi,K .

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

68 E. Schmisser

The term C′� + C′′/(n�) is a remainder term. It does not depend on m and, under the classicalassumption n�2 → 0, it is negligible. The term ‖bm,i − bi,K‖2

L2 is the bias term, it decreases whenthe dimension Dm increases. On the contrary, the variance term, σ 2

0 Dm/(n�), increases when thedimension increases. To construct an adaptive estimator, we have to choose m, that is to find thebest compromise between the bias term and the variance term.

4.3. Optimization of the dimension space

For given (n,�), we wish to select m in order to obtain the best compromise between the bias term,‖bm,i − bi,K‖2

L2 , and the main variance term, Dm/n�. In a first step, we assume that the regularityis known, i.e. that bi,K ∈ Bα

2,∞ and ‖bi,K‖2Bα

2,∞≤ 1, with r ≥ α. Thanks to Proposition 3, we have

that

‖bi,K − bm,i‖2L2 ≤ C2−2mα .

Let us distinguish two cases. If Dm = (2m + r)d , i.e. if Dm is of order 2md (tensor product), m hasto satisfy the equation

m = 1

d + 2αlog2(n�).

If Dm = d(2m + r) (additive model), Dm is of order 2m and we must have

m = 1

1 + 2αlog2(n�).

Using Equation (6), we obtain for a basis obtained by tensor product

E(‖bm,i − bi,K‖2n) ≤ K(n�)−2α/(2α+d) + C′� + C′′

n�

or, for the additive model,

E(‖bm,i − bi,K‖2n) ≤ K(n�)−2α/(2α+1) + C′� + C′′

n�.

In the latter case, our estimator converges at the same rate as for a one-dimensional model. Inthe nonadditive model, the estimator bm,i reaches the optimal rate of convergence obtained byDalalyan and Reiß [4].

Remark 3 Using anisotropic tensorized bases, we would have that

E(‖bm,i − bi,K‖2n) ≤ K(n�)−2α/(2α+d) + C′� + C′′

n�,

with α defined by d/α =∑di=1 1/αi. For anisotropic additive models, we would find that

E(‖bm,i − bi,K‖2n) ≤ K(n�)−2α/(2α+1) + C′� + C′′

n�,

where α = min(αi): in this case, using anisotropic bases does not modify the theoreticalconvergence rate.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 69

4.4. Adaptive estimation

Since we do not know the regularity of bi, it is important to construct an algorithm which selectsautomatically m, without any knowledge about the regularity of bi. For that purpose, we introducea penalty function pen(m), depending on the dimension Dm, on the number of observations n andon the discretization step �. Then, we define

mi = arg minm∈Mn

[γn,i(bm,i) + pen(m)]

with the penalty function pen(m) such that

pen(m) ≥ κσ 20

Dm

n�.

We denote by bi := bmi ,i the resulting estimator. In our simulations, we used pen(m) =κσ 2

0 (Dm/n�) with κ = 5. (This constant was chosen by numerical calibration, see Comte andRozenholc [13,14] for a complete discussion).

Theorem 2 Under Assumptions 1–6, the risk of the adaptive estimator satisfies, for i = 1, . . . , d,

E(‖bi − bi,K‖2n) ≤ C inf

m∈Mn

(‖bi,K − bm,i‖2L2 + pen(m)) + C′� + C′′

n�,

where C, C′, C′′ are constants. This estimator is really adaptive if n�/ ln2(n) → ∞ (in that case,Nn → ∞). If � → 0 and n� → ∞, our estimator is convergent.

The adaptive estimator automatically realizes the bias-variance compromise: whenever bi,K

belongs to some Besov ball, if r ≥ α, bi achieves the optimal corresponding nonparametric rate.

5. Examples and simulation

5.1. Algorithms

In this section, we set Xk = Xk� and Yk = Yik� for the ith component of vector Yk�. Any

function g ∈ Sm can be written g(x) =∑Dm−1k=0 αm,kϕm,k(x) and is characterized by the vector

α = (αm,k)k=0,...,Dm−1. For computing the estimator bm,i, we minimize with respect to α theexpression

n∑i=1

(Yi −

Dm−1∑k=0

αm,kϕm,k(Xi)

)2

.

We have to solve, for l = 0, . . . , Dm − 1:

n∑j=1

Dm−1∑k=0

αm,k[ϕm,k(Xj)ϕm,l(Xj)] =n∑

j=1

Yj[ϕm,l(Xj)].

Let us set P = (ϕm,l(Xj))l=0,...,Dm−1, j=1,...,n and Y = (Y1, . . . , Yn)∗, and solve the equation

PP∗α = PY .We simulate a process (Xt) by an Euler discretization scheme with sampling interval δ and

consider Xk� with � = pδ and p integer, k = 1, . . . , n. The number of observations n varies from

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

70 E. Schmisser

100 to 50,000 and � from 0.01 to 0.1. When n ≥ 10, 000, we have chosen δ = �, otherwise wechoose, δ = 0.01 and p = 5, 8, 10. To have enough points in our estimation compact, we keep95% of the data points, suppressing 5% of extreme values. To estimate the drift on any rectangleK , we have two solutions: either center and renormalize the points Xk in order to have values on[0, 1]2, or modify the functions ϕm,k . We have modified the points Xk , and given afterwards thedrift estimator on the rectangle K . Actually, our algorithm is adaptive with respect to m and r.

We let r vary from 1 to Rmax and m from 0 to max(Mn(r)). Then, to compute bm,i, we solvein α the equation PP∗α = PY and compute γn,i(m, r) with pen(m) = pen(m, r). We minimizeγn,i(m, r) + pen(m, r) with respect to m and r, and return the obtained estimator bi = bmi ,i.

5.2. Examples

5.2.1. Constant diffusion matrix

We consider the SDE

dXt = −A∇V(Xt) dt + �dWt , X0 = η, (7)

with A = ��∗ being a constant matrix. According to Proposition 1, if the function ρ(x) =exp(−2V(x)) is integrable, the process (Xt)t≥0 solution of Equation (7) is a reversible process,with stationary density proportional to ρ. The following examples are proposed in Fearnheadet al. [15].

Model 1: Ornstein–Uhlenbeck processWe consider the stochastic process of parameters

b(x, y) =(−0.2 0.2

0.1 −0.2

)(xy

), � =

(1 10 1

).

Its invariant density π(x) is proportional to exp(− 310 x2 − 3

5 y2 + 45 xy), i.e.

π ∼ N((

00

),

5

2

(6 44 3

)).

We simulate a path with n = 10, 000 points and � = 0.1. Figure 1 shows the estimation of thedrift first component b1 for the additive model. Figure 2 represents sections for the same model.

Model 2: Fixman potentialLet us consider

b(x, y) =⎛⎝− x

2− 2 cos(2x) − 6 cos(3x) + y

2x

2− y

2

⎞⎠ , � = Id.

The invariant density is

π(x) ∝ exp(− 1

2 (y − x)2 − 2 sin(2x) − 4 sin(3x))

.

Figure 3 shows the graph (non-normalized) of this stationary density. We simulate a pathwith n = 10, 000 points and � = 0.08. Figures 4 and 5 represent the estimation of the drift firstcomponent b1, with the additive model, on the compact set [−π , π ]2. Risks are computed on thesame rectangle.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 71

−10

−5

0

5

−10

−5

0

5−2

−1

0

1

2

3

−10

−5

0

5

−10

−5

0

5−1.5

−1

−0.5

0

0.5

1

1.5

Figure 1. Flows comparison for model 1. Estimated function: b1(x, y) = −0.2x + 0.2y. Light area: estimated drift atobserved points. Dark area: estimated drift on the whole rectangle.

−10 −5 0 5 10−1.5

−1

−0.5

0

0.5

1

1.5x=0.9

−10 −5 0 5 10−2

−1.5

−1

−0.5

0

0.5

1

1.5

2y=0.7

Figure 2. Sections for model 1. Estimated function: b1 _: true drift. · : estimated drift.

5.2.2. Non-constant diffusion matrix

Model 3: Multivariate Student invariant distributionThe following example comes from Jacobsen and Sorensen [16] and Jacobsen [17]. We consider

a stochastic process such that

dXt = −BB∗Xtdt + B√

v(Xt) dWt , X0 = η,

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

72 E. Schmisser

−4 −2 0 2 4 −5

0

50

50

100

150

200

250

300

350

Figure 3. Stationary density of model 2.

−2 0 2 4−5

0

5−10

−5

0

5

10

−20

24

−5

0

5−10

−5

0

5

10

Figure 4. Flows comparison for model 2. Estimated function: b1(x, y) = −x/2 − 2 cos(2x) − 6 cos(3x) + y/2. Lightarea: estimated drift at observed points. Dark area: estimated drift on the whole rectangle.

where B is a constant, symmetric and positive-definite matrix, and v(x) = 2(ν + d − 2)−1(ν +‖x‖2). The associated invariant density is a multivariate Student law with parameter ν, anddensity

πν(x) ∝ (ν + ‖x‖2)−(ν+d)/2.

The multivariate Student law with dimension d and degree of freedom ν is the distribution ofX/

√Y/ν, where X has law N�(0, Id), Y has law χ2(ν) and X and Y are independent. This

model satisfies the equation b = −A∇V with diffusion matrix A(x) = BB∗v(x) and V(x) =

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 73

−3 −2 −1 0 1 2 3−9.5

−9

−8.5

−8

−7.5

−7

−6.5

−6

−2 −1 0 1 2−10

−8

−6

−4

−2

0

2

4

6

8

Figure 5. Sections for model 2. Estimated function: b1 _: true drift. ·: estimated drift.

−5

0

5

−5

0

5−10

−5

0

5

10

−5

0

5

−5

0

5−10

−5

0

5

10

Figure 6. Flows comparison for Model 3. Estimated function: b1(x, y) = −x − 0.9y. Light area: estimated drift atsimulated points. Dark area: estimated drift on the whole rectangle.

((ν + d)/4) ln(ν + ‖x‖2). We choose

ν = 10, BB∗ =(

1 0.90.9 1

)

and estimate the first component b1(x, y) = −x − 0.9y. Figures 6 and 7 correspond to estimationwith reduced spline functions, with n = 10, 000 observations and sample path � = 0.1. Risks arecomputed on the rectangle [−1.6, 2.4] × [−2.4, 1.6]. In this example, A(x) is not bounded. Weset σ 2

0 = 8 for the penalty, which is larger than the maximal value of Tr(A(x)) over the estimationdomain.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

74 E. Schmisser

−5 0 5−4

−3

−2

−1

0

1

2

3

4

5x=-0.2

−5 0 5−5

−4

−3

−2

−1

0

1

2

3

4

5y=-0.2

Figure 7. Sections for Model 3. Estimated function: b1 _: true drift. ·: estimated drift.

5.3. Results and comments

The values selected by the algorithm are denoted m and r. We compute the error measured by theempirical norm:

error = ‖bi − bi,K‖2n.

In order to check that the algorithm is adaptive, we also compute

emin = minm

{‖bm,i − bi,K‖2n}.

In Table 1, we choose a fixed compact K = K1 × K2 precised for each example, and computethe mean of m and r. We also compute ‘ris = mean of error’ over 50 estimations and an oracle‘or = mean of error/emin’ over 50 estimations. We used spline functions (additive sp, tensorizedsp), and the piecewise polynomial bases (tensorized poly) described in Comte et al. [3] (for thelatter bases, we use κ = 10).

When the drift is linear (Models 1 and 3), our risks are nearly proportional to the product n�.Moreover, estimated functions are always linear. Approximating a trigonometric function (seeModel 2) is a little more difficult. Moreover, the smaller n�, the smaller r and m. Unexpectedly,risks decrease when � get smaller. In fact, the smaller the discretization path, the smaller thecompact K , and the less observed oscillations of the drift function. The better estimations areobtained with additive models with spline functions.

In general, we cannot compare oracle values for our three different estimators, because theseestimators are not chosen over the same function spaces. We can observe that the oracle fortensorized bases is pretty good, better than for the additive model. Nevertheless, risks are ingeneral smaller for the additive model with spline functions. Other simulation results are availablein Schmisser [18].

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 75

Table 1. Empirical risks.

Additive sp Tensorized sp Tensorized poly

n � ris or m r ris or m r ris or m r

Risks for Model 1: b1(x, y) = −2x + y104 0.1 0.0009 1.05 0 1 0.001 1.00 0 1 0.001 1.00 0 1104 0.01 0.01 1.07 0 1 0.01 1.00 0 1 0.01 1.00 0 1103 0.1 0.01 1.00 0 1 0.01 1.00 0 1 0.01 1.00 0 1103 0.01 0.1 1.7 0 1 0.1 1.00 0 1 0.1 1.11 0 1102 0.1 0.07 1.00 0 1 0.09 1.00 0 1 0.09 1.02 0 1

Risks for Model 2: b1(x, y) = x/2 − 2 cos(2x) − 6 cos(3x) + y/2104 0.08 1.1 1.07 1.5 4.3 6.2 1.00 0.8 2.1 2 1.01 1.7 1.3104 0.05 0.2 1.01 0.2 3.3 1.2 1.01 0.8 1.8 0.6 1.01 1 1.3104 0.01 0.2 2.2 0 1.9 0.3 1.02 0 1.1 0.2 1.04 0 1103 0.08 1.8 1.02 0.6 2.9 3.5 1.00 0.4 1.6 2.1 1.06 1 1.2103 0.05 1.3 1.02 0 1.2 1.4 1.04 0 1.1 1.4 1.10 0 1102 0.01 0.6 1.13 0 1.1 0.7 1.00 0 1 0.7 1.02 0 1

Risks for Model 3: b1(x, y) = −x − 0.9y104 0.1 0.01 1.00 0 1 0.01 1.00 0 1 0.01 1.00 0 1104 0.01 0.1 1.00 0 1 0.1 1.00 0 1 0.1 1.00 0 1103 0.1 0.1 1.00 0 1 0.2 1.00 0 1 0.1 1.00 0 1103 0.01 1.3 1.00 0 1 1.6 1.00 0 1 1.6 1.69 0 1102 0.1 1.3 1.02 0 1 1.8 1.00 0 1 1.9 1.87 0 1

6. Proofs

6.1. Proof of Proposition 1

We have to prove that π is an invariant density. The infinitesimal generator L associated with SDE(1) can be written, for any function f ∈ C2

c (Rd , R) := C2c ,

Lf =d∑

i=1

bi∂if + 1

2

d∑i,j=1

aij∂ijf ,

where ∂i and ∂ij denote partial derivative operators. General form of adjoint L∗ of L with respectto L2(Rd) := L2 is, for any function g ∈ C2,

L∗g = −d∑

i=1

∂i(big) + 1

2

d∑i,j=1

∂ij(aijg).

Lemma 1 If h ∈ C2 is a stationary density, then L∗h = 0.

Proof Assume that the process (Xt)t≥0 is stationary, with marginal density h. According to theIto formula, for any function f ∈ C2

c :

f (Xt) = f (X0) +∫ t

0Lf (Xs) ds +

d∑i=1

∫ t

0∂if (Xs)

d∑j=1

σij(Xs) dWjs .

Taking expectation, for any f ∈ C2c , we obtain

E(f (Xt)) = E(f (X0)) +∫ t

0E(Lf (Xs)) ds

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

76 E. Schmisser

and, thanks to stationarity,∫ t

0E(Lf (Xs)) ds = 0 = t

∫Lf (x)h(x) dx = 〈Lf , h〉L2 .

As 〈Lf , h〉L2 = 〈f , L∗h〉L2 , we obtain the expected result. �

Let us explicit the particular form of L∗ and the solutions of equation L∗h = 0 underAssumption 4.

Lemma 2 Set bi = bi − 12

∑dj=1 ∂jaij. Then

L∗g = −d∑

i=1

∂i(big) + 1

2

d∑i=1

∂i

⎛⎝ d∑

j=1

aij∂jg

⎞⎠ = −div(l(g)),

where l(g) = gb − 12 A∇g. Moreover, under Assumption 4, b = −A∇V and the function

h(x) ∝ exp(−2V(x))

is a solution of the equation L∗h = 0.

Proof We have, for any function f ∈ C2c :

Lf = 1

2

d∑i,j=1

∂i(aij∂jf ) +d∑

i=1

bi∂if = 1

2div(A∇f ) + 〈b, ∇f 〉. (8)

By integrating by parts, for any function g ∈ C2:

〈g, Lf 〉L2 = −d∑

i=1

∫Rd

f ∂i(big) − 1

2

d∑i,j=1

∫Rd

aij∂jf ∂ig

= −d∑

i=1

∫Rd

f ∂i(big) + 1

2

d∑i,j=1

∫Rd

f ∂j(aij∂ig) = 〈f , L∗g〉L2 .

As A is symmetric, we obtain the predicted formula.Let us solve l(h) = 0 under Assumption 4. As b = −A∇V and A is invertible, we have to solve

2h∇V = ∇h. We find

h(x) ∝ exp(−2V(x)). �

Lemma 3 L is self-adjoint with respect to L2π := L2(Rd , π(x)dx). Under Assumptions 1–4, π is

the only invariant density associated with the SDE (1) (see (2)).

Proof The adjoint of L with respect to L2π is denoted L∗π . We have:

L∗π (g) = 1

πL∗(πg).

It is known that ∇π/2π = −∇V = A−1b. For any function g ∈ C2c ,

L∗πg = − 1

πdiv

(πgb − 1

2A∇(πg)

).

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 77

On one hand,

1

πdiv(πgb) = div(gb) +

⟨∇π

π, gb⟩

= gdiv(b) + 〈∇g, b〉 + 2g〈A−1b, b〉.

On the other hand,

1

2πdiv(A∇(πg)) = 1

2πdiv(πA∇g + gA∇π)

= 1

2div(A∇g) + 1

2

⟨∇π

π, A∇g

⟩

+ g1

2πdiv(A∇π) + 1

2

⟨∇g, A

∇π

π

⟩.

As A is symmetric, ⟨∇π

2π, A∇g

⟩= 〈A−1b, A∇g〉 = 〈b, ∇g〉,

and the last term is written ⟨∇g, A

∇π

2π

⟩= 〈b, ∇g〉.

Furthermore:

1

2πdiv(A∇π) = 1

2πdiv(2π b) = div(b) +

⟨∇π

π, b⟩

= div(b) + 2〈A−1b, b〉.

Collecting terms, we obtain:

1

2πdiv(A∇(πg)) = 1

2div(A∇g) + 2〈b, ∇g〉 + gdiv(b) + 2g〈A−1b, b〉,

and, using (8),

L∗πg = 〈b, ∇g〉 + 1

2div(A∇g) = Lg. �

Kent [19] demonstrates the following result:

Lemma 4 For a function h > 0 of Rd , L∗h = L if and only if the transition density of (Xt),

p(t, x, y), is h-symmetric, i.e.

∀t, x, y,p(t, x, y)

h(y)= p(t, y, x)

h(x).

If h is integrable, the symmetry relation implies that h (normalized) is a stationary density.

As L is self-adjoint with respect to π(x) dx, π is the only stationary density.

6.2. Proof of Proposition 2

Any function t ∈ Sm can be written t(x) =∑2m−1

k=−r αm,kfm,k(x). Consequently, we have

‖t‖2L2

= 2m∫ 1

0

(2m−1∑k=−r

αm,kgr(2mx − k)

)2

dx =∫ 2m

0u2(y) dy,

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

78 E. Schmisser

where

u(y) =2m−1∑k=−r

αm,kgr(y − k)1[0,2m](y).

Let us notice that

‖t‖2∞ = 2m‖u‖2

∞ and ‖t‖2L2 = ‖u‖2

L2 . (9)

As dm = r + 2m, we only have to prove that there exists a positive constant c0 such that ‖u‖∞ ≤c0‖u‖L2 . The maximum of u is attained at a point x ∈ I = [j0, j0 + 1]. We have that

‖u‖∞ = ‖u1I‖∞ and ‖u‖L2 ≥ ‖u1I‖L2 .

Assume I = [0, 1]. We can write

u(x)1[0,1](x) =0∑

k=−r

αm,kgr(x − k)1[0,1](x).

Then u1[0,1] ∈ S0, a vector subspace of finite dimension d0 = r + 1. In this subspace, all normsare equivalent, as a consequence we obtain that

∃c0 > 0, ‖u1[0,1]‖∞ ≤ c0‖u1[0,1]‖L2 ,

which, with (9), ends the proof.

6.3. Proof of Theorem 1

The following proposition is used several times in proofs. It is proved, e.g. in Gloter [20] for aone-dimensional diffusion. The extension to the multidimensional case is straightforward.

Proposition 5 Under Assumptions 1–5, ∀k ≥ 1, ∃c(k) ∈ R, ∀h, 0 < h ≤ 1, ∀t ≥ 0:

E

(sup

s∈[t,t+h]|b(Xs) − b(Xt)|k

)≤ c(k)hk/2.

All these proofs are adapted from Comte et al. [3]. Introduce the norm

‖t‖2π =∫

t2(x)π(x) dx

and the set

�n ={

ω, ∀(m, m′) ∈ M2n, ∀t ∈

⋃m,m′

(Sm + Sm′) � {0} ,

∣∣∣∣ ‖t‖2n

‖t‖2π

− 1

∣∣∣∣ ≤ 1

2

}

in which norms ‖·‖n and ‖·‖π are equivalent: in �n, we have

‖t‖2π ≤ 2‖t‖2

n ≤ 3‖t‖2π . (10)

Proposition 6

E(‖bm,i − bi,K‖2n1�n) ≤ 7π1E(‖bm,i − bi,K‖2

L2) + 32σ 20

Dm

n�+ 32c�.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 79

Proof We have:

‖bm,i − bi,K‖2n ≤ ‖bm,i − bi,K‖2

n + 2νn,i(bm,i − bm,i)

+ 2

n

n∑k=1


On one hand,

2νn,i(bm,i − bm,i) ≤ 2‖bm,i − bm,i‖π supt∈Sm ,‖t‖π =1

|νn,i(t)|

≤ 1

8‖bm,i − bm,i‖2

π + 8 supt∈Sm ,‖t‖π =1

|νn,i(t)|2.

On the other hand, according to the Cauchy–Schwartz inequality,

2

n

n∑k=1

(bm,i − bm,i)(Xk�)Ik�(bi) ≤ 2‖bm,i − bm,i‖n

√√√√1

n

n∑k=1

Ik�(bi)2

≤ 1

8‖bm,i − bm,i‖2

n + 8

n

n∑k=1

Ik�(bi)2.

Introducing (10), we have that ‖bm,i − bm,i‖2π ≤ 2‖bm,i − bm,i‖2

n and ‖bm,i − bm,i‖2n ≤ 2‖bm,i −

bi,K‖2n + 2‖bi,K − bm,i‖2

n. Collecting terms, we obtain that

1

4‖bm,i − bi,K‖2

n ≤ 7

4‖bi,K − bm,i‖2

n + 8 supt∈Sm ,‖t‖π =1

|νn,i(t)|2 + 8

n

n∑k=1

Ik�(bi)2.

Hence, we have

‖bm,i − bi,K‖2n1�n ≤ 7‖bm,i − bi,K‖2

n + 32 supt∈Sm ,‖t‖π =1

|νn,i(t)|2 + 32

n

n∑k=1

(Ik�(bi))2.

Thanks to Proposition 5, the last term is easily bounded:

E[I2k�(bi)] ≤ 1

�

∫ (k+1)�

k�

E[(bi(Xs) − bi(Xk�))2] ds

≤ c�.

It remains to bound

E

(sup

t∈Sm ,‖t‖π =1

ν2n,i(t)

).

Vector subspace Sm has an orthonormal basis with respect to L2π . We denote it by {ϕλ, λ ∈ �m},

with card(�m) = Dm. Every function t ∈ Sm can be written as t =∑λ∈�naλϕλ, and its norm is

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

80 E. Schmisser

obtained by the formula ‖t‖2π =∑λ∈�m

a2λ. So,

E

(sup

t∈Sm ,‖t‖π =1

ν2n,i(t)

)≤∑λ∈�m

E[ν2n,i(ϕλ)].

As the process is stationary,

E[ϕ2λ(Xk�)] =

∫ϕ2

λ(x)π(x) dx = 1,

and we obtain

E[ν2n,i(ϕλ)] = 1

n2�2

n∑k=1

E

[ϕ2

λ(Xk�)

∫ (k+1)�

k�

aii(Xs) ds

]

≤ σ 20

n�,

where σ 20 is defined in Assumption 3. �

Proposition 7

E(‖bm,i − bi,K‖2n1�c

n) ≤ c

n�.

Proof It is demonstrated in Comte et al. [3] Lemma 1 p. 533 (see Appendix) that

P(�cn) ≤ c

n2. (11)

Let us set εk� = Ik�(bi) + Zik�, ε = (ε�, . . . , εn�)∗ and �mY = �m(Y i

�, . . . , Y in�)∗ =

(bm,i(X�), . . . , bm,i(Xn�))∗. We obtain

‖bi,K − bm,i‖2n = ‖bi,K − �mbi‖2

n + ‖�mbi − �mY i‖2n

= ‖bi,K − �mbi‖2n + ‖�mε‖2

n

≤ ‖bi,K‖2n + ‖ε‖2

n.

Using the Cauchy–Schwartz inequality, strict stationarity and (11), we have

E(‖b2i,K‖n1�c

n) ≤ (E(b4

i,K(X0))P(�cn))

1/2 ≤ c

n.

Besides,

E(‖ε‖2n1�c

n) ≤ (E[ε4

�]P(�cn))

1/2.

Let us compute E[ε4�]. According to the Burkholder inequality, we know that:

E(ε4�) ≤ CE[(Ik�(bi))

4] + C

�4E

[(∫ �

0aii(Xs) ds

)2]

≤ CE

[sup

0≤s≤�

{(bi(Xs) − bi(Xk�))4}]

ds + C

�2E[a2

ii(X0)].

By stationarity,

E(ε4�) ≤ Cc�2 + C

�2E[a2

ii(X0)] ≤ C′

�2.

Collecting terms, we have

E(‖bm,i − bi,K‖2n1�c

n) ≤ c

n�. �

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 81

As the process is stationary, for any function t with support in K ,

E(‖t‖2n) = ‖t‖2

π ≤ π1‖t‖2L2 .

Propositions 6 and 7 allow us to conclude the proof of Theorem 1.

6.4. Proof of Theorem 2

We have

‖bi − bi,K‖2n = ‖bi − bi,K‖2

n1�n + ‖bi − bi,K‖2n1�c

n.

We obtain, thanks to a proof similar to the previous one, that

E(‖bi − bi,K‖2n1�c

n) ≤ c

n�.

By definition of the estimator bi, for any integer m, we have

γn,i(bi) − γn,i(bi) + pen(mi) ≤ γn,i(bm,i) − γn,i(bi) + pen(m).

As previously, we obtain, for any m ∈ Mn,

‖bi − bi,K‖2n ≤ ‖bm,i − bi,K‖2

n + 2νn,i(bi − bm,i)

+ 2

n

n∑k=1

(bi − bm,i)(Xk�)Ik�(bi) + pen(m) − pen(mi).

We easily obtain, for any m ∈ Mn, that

‖bi − bi,K‖2n1�n ≤ 7‖bm,i − bi,K‖2

n + 4(pen(m) − pen(mi))1�n

+ 32 sup‖t‖π =1,t∈Sm+Smi

|νn,i(t)|21�n + 32

n

n∑k=1

Ik�(bi)2.

We know that

E(I2k�(bi,K)) ≤ c�

E(‖bm,i − bi,K‖2n) ≤ π1‖bm,i − bi,K‖2

L2 .

Let us set

Gm(m′) = supt∈Sm+Sm′ ,‖t‖π =1

|νn,i(t)|

and introduce a penalty function p(m, m′) such that

p(m, m′) = κ1σ20 (Dm + Dm′)

n�.

The following proposition is based on a result of Baraud et al. [21] and an inequality of theBernstein type. We give a sketch of the proof in the Appendix.

Proposition 8 There exists a positive numerical constant κ1 such that

E[(G2m(m′) − p(m, m′))1�n ]+ ≤ cσ 2

0e−Dm′

n�.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

82 E. Schmisser

Applying this proposition, we choose the penalty function

pen(m) ≥ κσ 20 Dm

n�

with κ = 8κ1. We have

E := E

[(8 sup

t∈Sm+Smi ,‖t‖π =1

ν2n,i(t) + (pen(m) − pen(mi))

)1�n

]

≤ E[8(G2

m(mi) − p(m, mi))1�n + (pen(m) − pen(mi) + 8p(m, mi))1�n

].

As

[(G2m(mi) − p(m, mi))1�n ]+ ≤

∑m′∈Mn

[(G2m(m′) − p(m, m′)1�n ],

we can write

E ≤ 8∑

m′∈Mn

E[(G2m(m′) − p(m, m′)1�n ] + 2pen(m).

Applying Proposition 8 and using the fact that∑

m e−Dm < +∞, we obtain

E ≤ 8∑

m′∈Mn

cσ 20

e−Dm′

n�+ 2pen(m) ≤ c′σ 2

0

n�+ 2pen(m).

Then, collecting terms, we have

E(‖bi − bi,K‖2n) ≤ inf

m∈Mn

(7π1‖bm,i − bi,K‖2L2 + 8pen(m)) + C

n�+ C′�

and the proof of Theorem 2 is complete.

Acknowledgements

The author thanks F. Comte and V. Genon-Catalot for helpful discussions.

References

[1] Y.A. Kutoyants, Statistical Inference for Ergodic Diffusion Processes, Springer Series in Statistics, Springer, London,2004.

[2] M. Hoffmann, Adaptive estimation in diffusion processes, Stochastic Process. Appl. 79(1) (1999), pp. 135–163.[3] F. Comte, V. Genon-Catalot, and Y. Rozenholc, Penalized nonparametric mean square estimation of the coefficients

of diffusion processes, Bernoulli 13(2) (2007), pp. 514–543.[4] A. Dalalyan and M. Reiß, Asymptotic statistical equivalence for ergodic diffusions: the multidimensional case,

Probab. Theory Related Fields 137(1–2) (2007), pp. 25–47.[5] J. Wang and L. Yang, Efficient and fast spline-backfitted kernel smoothing of additive models, Ann. Inst. Statist.

Math. 61(3) (2009), pp. 663–690.[6] I. Karatzas and S.E. Shreve, Brownian Motion and Stochastic Calculus Graduate Texts in Mathematics, Vol. 113,

Springer, New York, 1988.[7] E. Pardoux and A.Y. Veretennikov, On the Poisson equation and diffusion approximation. I, Ann. Probab. 29(3)

(2001), pp. 1061–1085.[8] A. Veretennikov, Bounds for the mixing stochastic differential equations, Theory Probab. Appl. 32(2) (1987),

pp. 273–281.[9] R.N. Bhattacharya, Criteria for recurrence and existence of invariant measures for multidimensional diffusions,

Ann. Probab. 6(4) (1978), pp. 541–553.

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

Statistics 83

[10] S.N. Ethier and T.G. Kurtz, Markov Processes Characterization and Convergence, Wiley Series in Probability andMathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons, New York, 1986.

[11] Y. Meyer, Ondelettes et opérateurs. I, Actualités Mathématiques. [Current Mathematical Topics]. Hermann, Paris.,Ondelettes. [Wavelets], 1990.

[12] C. Lacour, Adaptive estimation of the transition density of a Markov chain, Ann. Inst. H. Poincaré Probab. Statist.43(5) (2007), pp. 571–597.

[13] F. Comte and Y. Rozenholc, Adaptive estimation of mean and volatility functions in (auto-)regressive models,Stochastic Process. Appl. 97(1) (2002), pp. 111–145.

[14] F. Comte and Y. Rozenholc, A new algorithm for fixed design regression and denoising, Ann. Inst. Statist. Math.56(3) (2004), pp. 449–473.

[15] P. Fearnhead, G. Papaspiliopoulos, G.O. Roberts, and A. Stuart, Filtering systems of coupled stochastic differentialequations partially observed at high frequency, 2007. Available at http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2007/paper07-11

[16] M. Jacobsen and M. Sorensen, Multivariate diffusions with linear drift and given marginal distribution. Privatecommunication, 2004.

[17] M. Jacobsen, Examples of multivariate diffusions: Time-reversibility, a Cox-Ingersoll-Ross type process, preprint(2001-6), Department of Statistics and Operations Research, University of Copenhagen, 2001.

[18] E. Schmisser, Penalized nonparametric drift estimation for a multidimensional diffusion process, preprint 2009-02(2009), MAP5, Université Paris Descartes.

[19] J. Kent, Time-reversible diffusions, Adv. Appl. Probab. 10(4) (1978), pp. 819–835.[20] A. Gloter, Discrete sampling of an integrated diffusion process and parameter estimation of the diffusion coefficient,

ESAIM Probab. Statist. 4 (2000), pp. 205–227.[21] Y. Baraud, F. Comte, and G. Viennet, Model selection for (auto-)regression with dependent data, ESAIM Probab.

Statist. 5 (2001), pp. 33–49.[22] Y. Baraud, F. Comte, and G. Viennet, Adaptive estimation in autoregression or β-mixing regression via model

selection, Ann. Statist. 29(3) (2001), pp. 839–875.

Appendix 1. Additional proofs

A.1. Sketch of proof of Proposition 8

We follow the steps of Comte et al. [3] and adapt these to dimension d. First, we prove the following lemma which is aBernstein-type inequality:

Lemma A1 Under the assumptions of Theorem 1, for any function t with support in K , ∀ε > 0, ∀ζ > 0,

P

(n∑

k=1

|t(Xk�)Zik�| ≥ nε, ‖t‖2

n ≤ ζ 2

)≤ 2 exp

(− n�ε2

2σ 20 ζ 2

).

Hence, for all x > 0,

Pn

⎛⎝|νn,i(t)| ≥ ζ

√2σ 2

0 x

�, ‖t‖2

n ≤ ζ 2

⎞⎠ ≤ 2 exp(−nx),

where Pn(.) := P(. ∩ �n).

Proof Let us consider a martingale (Ms) such that (exp(λMs − (λ2/2)〈M〉s)) is also a martingale. We have that

E(exp(λMs − λ2〈M〉s/2)) = 1.

According to the Tchebitchev inequality, for all λ > 0,

P[(Ms ≥ c), (〈M〉s ≤ c′)] = P

[exp(λMs − λ2〈M〉s/2) ≥ exp

(λc − λ2c′

2

)]

≤ exp

(−λc + λ2c′

2

)E

(exp

(λMs − λ2

2〈M〉s

))

≤ exp

(−λc + λ2c′

2

).

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2007/paper07-11

http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2007/paper07-11

84 E. Schmisser

Hence, minimizing with respect to λ, we obtain

P(Ms ≥ c, 〈M〉s ≤ c′) ≤ infλ>0

exp

(−λc + λ2c′

2

)= exp

(− c2

2c′

).

Let us consider the process

(Hu)j =(

n∑k=1

1]k�,(k+1)�](u)t(Xk�)σij(Xu)

)j

which satisfies, for all positive real u, the inequality Hu.Hu ≤ σ 20 ‖t‖∞ (in order to avoid confusion with the martingale

bracket, the euclidean inner product is denoted by ‘·’ in this proof). Let us set Ms = ∫ s0 Hu.dWu. This process satisfies:

M(n+1)� =d∑

j=1

n∑k=1

t(Xk�)

∫ (k+1)�

k�

σij(Xs) dWjs = �

n∑k=1

t(Xk�)Zik�,

〈M〉(n+1)� =n∑

k=1

t2(Xk�)

∫ (k+1)�

k�

aii(Xs) ds ≤ σ 20 n�‖t‖2

n.

Moreover,

〈M〉s =∫ s

0Hu · Hudu ≤ nσ 2

0 �‖t‖∞.

Then Ms and exp(λMs − (λ2/2)〈M〉s) are martingales. We obtain

F := P

[(n∑

k=1

t(Xk�)Zik� ≥ nε

), (‖t‖2

n ≤ ζ 2)

]

≤ P[(M(n+1)� ≥ �nε), (〈M〉(n+1)� ≤ σ 2

0 n�ζ 2)]

≤ exp

(− nε2�

2σ 20 ζ 2

).

�

To complete the proof of Proposition 8, we use that

E[(G2m(m′) − p(m, m′))1�n ]+ =

∫ ∞

0P[(G2

m(m′) − p(m, m′))+1�n ≥ x] dx.

By substituting x = κ1σ20 (1/n�)τ , and replacing p(m, m′) by its expression p(m, m′) = κ1σ

20 (Dm + Dm′ )/n�, we have

E[(G2m(m′) − p(m, m′))1�n ]+ ≤ κ1

σ 20

n�

∫ ∞

0Pn

[G2

m(m′) ≥ κ1σ 2

0

n�(τ + D)

]dτ .

Lemma A1 and the L2 chaining technique of Baraud et al. [21] allow one to obtain the announced result.

A.2. Proof of inequality (11)

The following lemma is proved in Baraud et al. [22]

Lemma A2 Let us set n = pnqn and consider Sn as the largest vector space generated by the families of functions Fm,for m ∈ Mn (see Section 3). The dimension of Sn is equal to Nn. For all positive �, we have

P(�cn) ≤ 2nβX (qn�) + 2n2 exp

(−A0π

20

n

qnLn(φ)

),

where Ln(φ) satisfies

Ln(φ) ≤ φ20 N2

n .

Comte et al. [3] use this lemma to end the proof of the inequality (11).

Dow

nloa

ded

by [

UO

V U

nive

rsity

of

Ovi

edo]

at 0

9:23

29

Oct

ober

201

4

penalized nonparametric drift estimation for a multidimensional diffusion process

Documents