notes from nonparametric statistical theorypages.stat.wisc.edu/~vrpatel6/resources/notes/notes...

36
Notes from Nonparametric Statistical Theory Cambridge Part III Mathematical Tripos 2012-2013 Lecturer: Adam Bull Vivak Patel April 13, 2013 1

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Notes from Nonparametric Statistical Theory

Cambridge Part III Mathematical Tripos 2012-2013

Lecturer: Adam Bull

Vivak Patel

April 13, 2013

1

Page 2: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Contents

1 Introduction 31.1 Convergence of Random Variables . . . . . . . . . . . . . . . . . 31.2 Uniform Law of Large Numbers . . . . . . . . . . . . . . . . . . . 3

2 Classical Empirical Process Theory 42.1 Empirical Distribution Function and Properties . . . . . . . . . . 4

2.1.1 Convergence (LLN) . . . . . . . . . . . . . . . . . . . . . 42.1.2 Rate of Convergence (CLT) . . . . . . . . . . . . . . . . . 52.1.3 Finite Sample Error Bound (Hoeffding) . . . . . . . . . . 52.1.4 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . 5

3 Minimax Lower Bounds 63.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Reduction to Testing Problem . . . . . . . . . . . . . . . . . . . . 63.3 Lower Bounds for Differentiable Densities . . . . . . . . . . . . . 8

4 Approximation of Functions 84.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Regularisation by Convolution . . . . . . . . . . . . . . . . . . . 94.3 Approximation by Basis Functions . . . . . . . . . . . . . . . . . 11

4.3.1 Haar Basis . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3.2 B-Spline Basis on Dyadic Break Points . . . . . . . . . . . 12

4.4 Approximation by Orthonormal Wavelet Basis . . . . . . . . . . 13

5 Density Estimation 145.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . 14

5.2.1 Consistency and Error . . . . . . . . . . . . . . . . . . . . 145.2.2 Asymptotic Behaviour of Kernel Density Estimator . . . . 18

5.3 Histogram Density Estimation . . . . . . . . . . . . . . . . . . . 195.4 Wavelet Density Estimation . . . . . . . . . . . . . . . . . . . . . 20

6 Nonparametric Regression 216.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 226.3 Local Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 276.5 Wavelet Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Parameter Selection 287.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.2 Global Bandwidth Choices . . . . . . . . . . . . . . . . . . . . . . 297.3 Lepski’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 307.4 Wavelet Thresholding . . . . . . . . . . . . . . . . . . . . . . . . 32

2

Page 3: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

1 Introduction

1.1 Convergence of Random Variables

Definition 1.1. Let (Ω,A,P []) be a probability space. Let X1, X2, . . . , X : Ω→(S, d) be random variables taking values in metric space (S, d). Let E [] denotethe expectation with respect to the probability measure.

1. If P [Xn → X] = 1 then Xna.s.−→ X

2. If ∀ε > 0, P [d(Xn, X) > ε]→ 0 then XnP[]−→ X

3. IF ∀f ∈ C(S) and bounded for which E [f(Xn)]→ E [f(X)] then Xnd−→

X

Proposition 1.1. For random variables X1, X2, . . . , X:

1. Xna.s.−→ X =⇒ Xn

P[]−→ X =⇒ Xnd−→ X

2. X =⇒ Xnd−→ X if and only if the distribution function Fn convergest

pointwise to the distribution function F .

Theorem 1.1. (Strong Law of Large Numbers) Suppose X1, X2, . . . , X are i.i.dR.V. taking values in (<p, ‖ · ‖). If X has mean µ then the empirical meanEnX = 1

n

∑ni=1Xi converges a.s. to µ.

Theorem 1.2. (Central Limit Theorem) Suppose X1, X2, . . . , X are i.i.d R.V.taking values in (<p, ‖ · ‖). If X has mean µ and positive definite covariance

matrix Σ then the error rate of estimation√n(EnX − µ)

d−→ N(0,Σ).

Theorem 1.3. (Hoeffding’s Inequality) Suppose X1, X2, . . . Xn are zero-meanR.V. taking values in [bi, ci]. Then for every natural number n and u > 0,

P

[∣∣∣∣∣n∑i=1

Xi

∣∣∣∣∣ > u

]≤ 2 exp

(−2u2∑n

i=1(ci − bi)2

)

1.2 Uniform Law of Large Numbers

1. Motivation: In the nonparametric context, we want the LLN to hold uni-formly over a class of functions taking values in a measure space T .

2. Notations and Conventions

(a) Let H be the class of functions for which E [|h(X)|] <∞, ∀h ∈ H(b) The empirical mean is denoted En [h] = 1

n

∑ni=1 h(Xi)

(c) Suppose l, u : T → < are measurable. The bracket [l, u] = f : T →<|l(x) ≤ f(x) ≤ u(x)

3. The Uniform Law of Large Numbers:

Proposition 1.2. Suppose for ε > 0, there exists a finite set of brackets[li, ui] with li, ui ∈ L1 and E [uj ]− E [lj ] < ε. If ∀ε > 0,∀h ∈ H, ∃j such

that h ∈ [lj , uj ] then suph∈H |En [h]−E [h] | a.s.−→ 0.

3

Page 4: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Proof. We want to show that ∀ε > 0 ∃n0 ∈ N such that for n ≥ n0

suph∈H|En [h]−E [h] | ≤ 2ε a.s.

(a) Let ε > 0 and [li, ui] be a bracketing as above. By the SLLN, ∀j, ∃njsuch that for n ≥ nj , |En [lj ]−E [lj ] | ≤ ε and similarly for uj . Sincej is finite, let n0 = max(nj).

(b) Let h ∈ H. Then ∃j such that h ∈ [lj , uj ].

En [h] ≥ En [lj ] ≥ E [lj ]− ε ≥ E [uj ]− 2ε ≥ E [h]− 2ε

En [h] ≤ En [uj ] ≤ E [uj ] + ε ≥ E [lj ] + 2ε ≥ E [h] + 2ε

(c) Therefore, |En [h] − E [h] | ≤ 2ε. Since h is arbitrary, we can simplytake the supremum.

2 Classical Empirical Process Theory

2.1 Empirical Distribution Function and Properties

Definition 2.1. The empirical distribution function Fn(t) = 1n

∑ni=1 1 [xi ≤ t].

2.1.1 Convergence (LLN)

Theorem 2.1. (Uniform Convergence, Glivenko-Cantelli) Let X1, X2, . . . , Xn

be i.i.d R.V. with an arbitrary distribution function F . Then

supt∈<|Fn(t)− F (t)| a.s.−→ 0

Proof. We want to apply the ULLN to the set of functions H = 1 [x ≤ t] |t ∈<. So, we simply need to find brackets of arbitrary size. There are two cases:F is continuous, and F has discontinuities.

1. When F is continuous, it takes on every value from [0, 1]. So for any ε > 0we can find a k such that 1

k < ε. Let tj be such that F (tj) = jk for

j = 1, . . . , k − 1. Then H is covered by the brackets

[0,1 [x ≤ t1]], [1 [x ≤ t1] ,1 [x ≤ t2]], . . . , [1 [x ≤ tk−1] , 1]

for which

|E [1 [x ≤ tj ]]−E [1 [x ≤ tj+1]] | ≤ 1

k

2. When F has discontinuities, it may skip certain values of jk . So F (t−j ) < jk

and F (t+j ) > jk . To deal with this, we add the brackets [1 [x ≤ tj−1] ,

1[x ≤ t−j

]] and [1

[x ≤ t+j

],1 [x ≤ tj+1]]. We still have a finite bracket for

which the expected distance between the upper and lower limits are ≤ 1k .

4

Page 5: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Theorem 2.2. (Multi-variate Glivenko-Cantelli) Let X1, X2, . . . , Xn be i.i.din <d with distribution F (t). Let Fn(t) be the empirical distribution function.Then

supt∈<d

|Fn(t)− F (t)| a.s.−→ 0

2.1.2 Rate of Convergence (CLT)

Proposition 2.1. For a fixed t ∈ <.

√n(Fn(t)− F (t)

d−→ N(0, F (t)(1− F (t))

Theorem 2.3. Let X1, X2, . . . , Xn be i.i.d. with distribution function F . Then√n(Fn − F ) converges to the F-Brownian bridge process in L∞.

Definition 2.2. The F-Brownian bridge process GF is the zero-mean Gaussianprocess indexed by < with covariance E [GF (ti), GF (tj)] = F (tj∧ti)−F (ti)F (tj)

2.1.3 Finite Sample Error Bound (Hoeffding)

The following theorem is the empirical distribution process equivalent to Ho-effding’s inequality.

Theorem 2.4. Let X1, X2, . . . , Xn be i.i.d. R.V. with distribution F . Then∀n ∈ N and ∀λ > 0,

P[‖√n (Fn − F ) ‖∞ > λ

]≤ 2 exp[−2λ2]

2.1.4 Optimality

Given no prior information on F , the empirical distribution function is theasymptotically optimal choice for estimating F .

Theorem 2.5. Let X1, . . . , Xn be i.i.d. R.V. with distribution F . Let Tn be anestimator for F based on the R.V. Xi. Then:

limn→∞

supF E [‖Fn − F‖∞]

infTn supF E [‖Tn − F‖∞]= 1

Note 2.1. infTngives us the best estimator for supF , the worst distribution

function available.

2.2 Kolmogorov-Smirnov Test

Motivation: We know that√n(Fn − F ) behaves like GF . However, we often

do not know F , so determining GF is difficult. The next result gives us a wayaround this.

Theorem 2.6. Let X1, X2, . . . , Xn be i.i.d. R.V. with continuous distributionF . Then √

n‖Fn − F‖∞d−→ ‖G‖∞

the standard brownian bridge. Moreover, for λ > 0,

P [‖G‖∞ > λ] = 2

∞∑j=1

(−1)j+1 exp(2j2λ2)

5

Page 6: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Proof. We prove the first part

1. Let Yi = F (Xi). Then Yi are uniformly distributed with distributionU(x) = x.

2. Note that U(F (t)) = F (t). And Un(F (t)) = 1n

∑ni=1 1 [Yi ≤ F (t)] =

1n

∑ni=1 1 [Xi ≤ t] = Fn(t). Therefore,

‖Un − U‖∞ = ‖Un(F )− U(F )‖∞ = ‖Fn − F‖∞

3. By the rate of convergence (CLT) for empirical distributions, we have that√n(Un − U) converges in L∞ to the standard brownian bridge process.

Since f 7→ ‖f‖∞ is continuous, the continuous mapping theorem yieldsthe desired first result.

3 Minimax Lower Bounds

3.1 Introduction

1. In parametric theory and distribution function estimation, the rate of con-vergence to a limiting distribution is 1√

n. However, in other nonparametric

cases, our rate of convergence is strictly worse than this.

2. Minimax Risk

Definition 3.1. Let F be a collection of densities or regression functions.Suppose under some distribution function f ∈ F, the data have probabilitymeasure Pf [] and expectation Ef []. Let d(f, g) be a metric on F. The best

performance estimator under loss d for an estimator fn is the minimaxrisk:

Rn(F, d) = inffn

supf∈F

Ef

[d(f, fn)

]3. Minimax Rate of Convergence

Definition 3.2. If the risk is bounded above and below by a deterministicsequence rn (i.e. for some C,D > 0, Crn ≤ Rn(F, d) ≤ Drn) then rn isthe minimax rate of convergence.

3.2 Reduction to Testing Problem

We will determine lower bounds by framing this as a testing problem, allowingus to lower bound the minimax risk. Then, we will lower bound this to get anestimate for the lower bound constant.

1. We reduce to a testing problem as follows:

(a) Suppose we want to test the hypotheses H0 : f = f0 and H1 : f = f1where f0, f1 ∈ F such that d(f0, f1) > 2rn.

6

Page 7: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(b) Given an estimator fn, we construct the statistic:

Tn =

0 if d(f0, fn) ≤ d(f1, fn)

1 if d(f0, fn) > d(f1, fn)

(c) It follows that

R(F, d)/rn ≥ infTn

maxj∈0,1

Pfj [Tn 6= j]

Derivation 3.1. By the Markov Property:

R(F, d)

rn= inf

fn

supf∈F

Ef

[d(fn, f)

]rn

≥ inffn

supf∈F

Pf

[d(fn, f) > rn

]Also note that if Tn = 1 then d(f0, fn) > d(f1, fn). If Tn = 0 then

d(f0, fn) ≤ d(f1, fn). So we have by triangle inequality when Tn is 1and 0 respectively:

d(f0, fn) ≥ d(f0, f1)− d(f1, fn) ≥ 2rn − d(f1, fn) ≥ 2rn − d(f0, fn)

d(f1, fn) ≥ d(f0, f1)− d(f0, fn) ≥ 2rn − d(f0, fn) ≥ 2rn − d(f1, fn)

We now recast the probability from the beginning in terms of Tn:

inffn

supf∈F

Pf

[d(fn, f) > rn

]≥ inf

fn

maxj∈0,1

Pfj

[d(fj , fn) > rn

]= inf

Tn

maxj∈0,1

Pfj [Tn 6= j]

2. To lower bound this probability, we have the following Lemma:

Lemma 3.1. Let Z =dPf1

[]

dPf0[] be the likelihood ratio. Then,

infTn

maxj∈0,1

Pfj [Tn 6= j] ≥ 1

2

(1−

√Ef0 [(Z − 1)2]

)Proof. The penultimate line follows from Cauchy-Schwarz:

2 max Pfj [Tn 6= j] ≥ Pf0 [Tn = 1] + Pf1 [Tn = 0]

= 1−Pf0 [Tn = 0] + Ef1 [1 [Tn = 0]]

= Ef0 [1− 1 [Tn = 0]] +

∫1 [Tn = 0] dPf1

≥ Ef0 [1− 1 [Tn = 0]] +

∫1 [Tn = 0]

dPf1

dPf0

dPf0

= 1−Ef0 [(1− Z)1 [Tn = 0]]

≥ 1−√

Ef0 [(1− Z)2] Ef0

[1 [Tn = 0]

2]2

≥ 1−√

Ef0 [(Z − 1)2]

3. When the two densities for f1 and f0 are close, the test is impossible.

7

Page 8: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

3.3 Lower Bounds for Differentiable Densities

The minimax rate of convergence depends on the number of derivatives thedensity has and a maximum over the derivatives. We define the following ac-cordingly:

Definition 3.3. For the class of s-times differentiable densities, let:

1. ‖f‖s,∞ = maxk≤s∥∥f (k)∥∥∞ be a norm

2. Cs(M) = f |f is a density, ‖f‖s,∞ ≤ M is the subset of differentiabledensities with bounded derivatives.

Theorem 3.1. Let s ∈ N , M > 1 and x0 ∈ [0, 1]. Let X1, . . . , Xn be i.i.d. R.V.

with density f on [0, 1] and fn be an estimator depending only on X1, . . . , Xn.Then for some constant C > 0 and n large enough,

inffn

supf∈Cs(M)

Ef

[∣∣∣fn(x0)− f(x0)∣∣∣] > Cn

−s2s+1

Proof. PROOF GOES HERE.Remark 3.1. Similar results can be proven with different loss functions suchas Lp or L∞ loss. They can also be proven for non-integer smoothness orin regression, but the rates will be no better than n−s/(2s+1) or worse than(

nlog(n)

)−s/(2s+1)

.

4 Approximation of Functions

4.1 Introduction

1. When estimating density or regression function f , there is no natural esti-mator fn as there is for the distribution function. The “best” (maximumlikelihood) natural estimators either:

(a) Place peaks at every data point in density estimation

(b) Interpolate between data points in regression estimation

2. Although such MLE have low bias, they have extremely high variance. Sowe exchange some bias for lower variance by estimating a nicer functionnear f, which does have a natural estimator.

3. Notation:

(a) L∞ is the space of f : < → < with ‖f‖∞ = supx∈< |f(x)| ≤ ∞

(b) For p ∈ [1,∞), Lp is the space of f : < → < with ‖f‖p =(∫< |f |

p)1/p ≤

∞(c) f is locally integrable if for all bounded borel sets C ⊂ <,

∫C|f | <∞

8

Page 9: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

4.2 Regularisation by Convolution

1. Convolution

Definition 4.1. Let f, g : < → <. Their convolution, if the integralexists, is (f ∗ g)(x) =

∫< f(x− y)g(y)dy =

∫< f(y)g(x− y)dy = (g ∗ f)(x).

2. Convolution with Kh

(a) Let K : < → < such that∫<K(x)dx = 1. For h > 0 we have the

localised kernel Kh(x) = 1hK(xh )

(b) h is called the bandwidth and controls how localised the effects ofKh are.

3. “Consistency” and “Error” of kernel convolution of a function with respectto the original function.

Proposition 4.1. Let K : < → < be a bounded kernel (integrates to 1)with compact support. Let f : < → < be locally integrable. Let p ∈ [1,∞].Then as h→ 0:

(a) If f is continuous at x then |(Kh ∗ f)(x)− f(x)| → 0

(b) If p =∞ and f is uniformly continuous, OR p <∞ and f ∈ Lp then‖Kh ∗ f − f‖p → 0

(c) Let∫< u

rK(u)du = 0, r = 1, . . . , s− 1, and κ(s) =∫< |u

sK(u)|du.

Suppose for M > 0, f is s-times differentiable and∥∥f (s)∥∥

p≤M .

If p =∞ OR f (s−1) is absolutely continuous then

‖Kh ∗ f − f‖p ≤Mκ(s)

s!hs

Proof. By assumptions on the Kernel, K, we know that ∃C > 0 such thatK(x) ≤ C for all x ∈ <. Moreover, by compactness, ∃R > 0 such thatsupp(K) ⊂ [−R,R]. Finally, since f is locally integrable, Kh ∗ f exists.

(a) The last step follows from continuity at x.

|(Kh ∗ f)(x)− f(x)| = |∫h−1K(

x− yh

)f(y)dy − f(x)|

= |∫K(u)f(x− uh)du− f(x)

∫K(u)|

= |∫K(u)[f(x− uh)− f(x)]du|

≤∫|K(u)||f(x− uh)− f(x)|du

≤(∫|K(u)|

)(sup|u|≤R

|f(x− uh)− f(x)|

)

≤ 2RC

(sup|u|≤R

|f(x− uh)− f(x)|

)→ 0

9

Page 10: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(b) For p =∞, the result follows from (a) since f is uniformly continuous.

For p ≤ ∞, we start with p = 1 The remaining p follow fromMinkowski’s integral inequality.∫<|(Kh ∗ f)(x)− f(x)|dx =

∫<

∣∣∣∣∣∫ R

−RK(u)[f(x− uh)− f(x)]du

∣∣∣∣∣ dx≤∫<

∫ R

−R|K(u)[f(x− uh)− f(x)]|dudx

≤∫<

(∫|K(u)|

)(sup|u|≤R

|f(x− uh)− f(x)|

)dx

=

(∫|K(u)|

)∫<

(sup|u|≤R

|f(x− uh)− f(x)|

)dx

≤ 2RC

∫<

(sup|u|≤R

|f(x− uh)− f(x)|

)dx→ 0

The last integral tends to 0. See the example sheet.

(c) We first prove p =∞ and then p = 1 (use Minkowski for the remain-ing ones). Taylor’s theorem is required.

i. Taylor’s Theorem(s):For y ∈ (x, x− uh):

f(x− uh)− f(x) =

s−1∑r=1

(−h)rur

r!f (r)(x) +

(−uh)s

s!f (s)(y)

For f absolutely continuous:

f(x− uh)− f(x) =

s−1∑r=1

(−h)rur

r!f (r)(x)

+

∫ −uh0

ts−1

(s− 1)!f (s)(x− t)dt

ii. The result follows for p =∞ since for all x we have:

|(Kh ∗ f)(x)− f(x)|

=

∣∣∣∣∣∫ R

−RK(u)

(s−1∑r=1

(−h)rur

r!f (r)(x) +

(−uh)s

s!f (s)(y)

)du

∣∣∣∣∣=

∣∣∣∣∣∫ R

−RK(u)

(−uh)s

s!f (s)(y)du

∣∣∣∣∣≤ hs

s!

∥∥∥f (s)∥∥∥∞κ(s)

10

Page 11: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

iii. By absolute continuity we have:∫<|(Kh ∗ f)(x)− f(x)|dx

=

∫<

∣∣∣∣∣∫ R

−RK(u)

(∫ −uh0

ts−1

(s− 1)!f (s)(x− t)dt

)du

∣∣∣∣∣ dx≤∫<

∫ R

−R

∫ uh

0

|K(u)|∣∣∣∣ ts−1

(s− 1)!

∣∣∣∣ ∣∣∣f (s)(x− t)∣∣∣ dtdudx≤∫ R

−R|K(u)|

[∫ uh

0

ts−1

(s− 1)!

(∫<

∣∣∣f (s)(x− t)∣∣∣ dx) dt] du≤∥∥∥f (s)∥∥∥

1

hs

s!

∫ R

−R|K(u)us|du

≤ hs

s!κ(s)

∥∥∥f (s)∥∥∥1

4. If the density is symmetric, it satisfies our conditions for s = 1, 2.

4.3 Approximation by Basis Functions

4.3.1 Haar Basis

1. Suppose we want to approximate a function in L2. The fourier seriesis optimal, but may not converge pointwise. Instead, we consider thepiecewise-constant Haar Basis.

2. Haar Basis and Approximation

Definition 4.2. Suppose f is a locally integrable function. It’s HaarApproximation of resolution level j is

Hj(f)(x) =∑k∈Z

ckϕj,k(x)

where:

(a) ϕj,k(x) = 2j/21[2−jk, 2−j(k + 1)

](b) ck = 〈f, ϕj,k〉

3. An equivalent expression, which approximates f in terms of a basis notdependent on j is:

Hj(f)(x) =∑k∈Z

akϕ0,k(x) +

j−1∑l=0

∑k∈Z

bl,kψl,k(x)

where:

(a) ϕ0,k(x) = 1 [k, k + 1] and ak = 〈f, ϕ0,k〉

11

Page 12: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(b) ψl,k(x) =

2l2

(1[2−lk ≤ x ≤ 2−l( 1

2 + k)]− 1

[2−l( 1

2 + k) ≤ x ≤ 2−l(1 + k)])

(c) bl,k = 〈f, ψl,k〉

Derivation 4.1. Let Vj = span(ϕj,k). Then Vj is the space of all L2

functions which are constant on the intervals (2−jk, 2−j(k + 1)), ∀k ∈ Z.Then we have that Vj ⊃ Vj−1. Letting Wj−1 = V ⊥j−1 ∩ Vj, we have Vj =Vj−1⊕Wj−1. Note that Wj−1 = span(ψj−1,k). Continuing in this fashion,Vj = V0 ⊕W0 ⊕ · · · ⊕Wj−1.

4. “Consistency” and “Error” of Haar Basis

Proposition 4.2. Let f be locally integrable function. Let p ∈ [1,∞]. Asj →∞:

(a) If f is continuous at x, then |Hj(f)(x)− f(x)| → 0

(b) If p =∞ and f is uniformly continuous OR p <∞ and f ∈ Lp then‖Hj(f)− f‖p → 0

(c) Suppose f is differentiable. If p =∞ OR p <∞ and f is absolutelycontinuous, then ‖Hj(f)− f‖p ≤

12 ‖f

′‖p 2−j

5. The Haar basis cannot make use of higher-order derivatives, so a ba-sis which can, can be used to approximate functions with higher-orderderivatives.

4.3.2 B-Spline Basis on Dyadic Break Points

1. For the B-spline Basis

Definition 4.3. We have:

(a) The B-spline scaling function of order 1: N1(x) = 1 [0, 1] ∗ 1 [0, 1] =x1 [0, 1] + (1− x)1 [1, 2]

(b) The B-spline basis of order 1: ϕj,k(x) = N1(2jx−k)|k ∈ Z, j ∈ N0(c) The B-spline scaling function of order r: Nr(x) = 1 [0, 1] ∗ 1 [0, 1] ∗· · · ∗ 1 [0, 1] (r times).

(d) The B-spline basis of order r: ϕ(j, k)(x) = Nr(2jx − k)|k ∈ Z, j ∈N0

2. “Consistency” and “Error” for B-spline

Proposition 4.3. For s ∈ N0, define the B-spline basis of order s− 1 ondyadic break points. If f ∈ L∞ and s-times differentiable, then for ck ∈ <we have for some constant κs > 0:∥∥∥∥∥∑

k∈Z

ckϕj,k − f

∥∥∥∥∥∞

≤ κs∥∥∥f (s)∥∥∥ 2−j

3. Note that determining ck is difficult since the basis is not orthonormal.

4. Note that the B-spline basis of order r is r − 1 times differentiable.

12

Page 13: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

4.4 Approximation by Orthonormal Wavelet Basis

1. The Ideal Basis

(a) Localised, (unlike the fourier basis), so that we can approximate fpointwise

(b) Smooth, (unlike Haar), so that we can approximate a smooth f

(c) Orthogonal, (unlike B-spline), so that we can control variance ofestimators for the constants

2. Ideal Father Wavelet (Scaling Function):

(a) Bounded and compactly supported

(b) Has orthonormal translates (ϕ0,k = ϕ(x− k)) for all k ∈ Z(c) The linear space Vj = span(ϕj,k) are nested (Vj ⊃ Vj−1)

(d)⋃∞j=0 Vj is dense in L2

3. Given the ideal father wavelet forming the ideal basis, we can do thefollowing:

(a) Decompose Vj so that Vj = V0 ⊕W0 ⊕ · · · ⊕Wj−1

(b) Determine the mother wavelet for the spaces Wi, ψ : < → <(c) Write the approximation as

Wj(f)(x) =∑k∈Z

akϕ0,k(x) +

j−1∑l=1

bl,kψl,k(x)

where ak and bl,k are determined by orthogonal projections of f ontoV0 and Wl.

4. “Consistency” and “Error” of the Orthonormal Wavelet Basis

Proposition 4.4. Let ϕ0,k, ψl,k|k ∈ Z be a compactly supported or-thonormal wavelet basis, which is either the Haar basis or has a contin-uous father wavelet. Also let f : < → < be locally integrable, p ∈ [1,∞],s ∈ N0, and

∫< u

rψ(u)du = 0 for r = 0, . . . , s− 1. As j →∞:

(a) If f is continuous at x then |Wj(f)(x)− f(x)| → 0

(b) If either p =∞ and f is uniformly continuous OR p <∞ and f ∈ Lpthen ‖Wj(f)− f‖p → 0

(c) Suppose f is s-times differentiable. If p =∞ OR f (s−1) is absolutelycontinuous then ‖Wj(f)− f‖p ≤

κs

s!

∥∥f (s)∥∥p

2−js

Proof. The wavelet estimator is essentially a moving kernel estimator. LetK(x, y) =

∑k∈Z ϕ(x−k)ϕ(y−k). Since ϕ(x) are bounded and compactly

supported, so is K(x, y). Moreover,∫<K(x, y)dy = 1 (see example sheet).

Therefore:

Wj(f)(x) =

∫<

2jf(2jx, 2jy)f(y)dy

13

Page 14: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) Since f is continuous at x, and K is bounded and compactly sup-ported: (equivalent proof for convolution with kernel)

|Wj(f)(x)− f(x)| =∣∣∣∣∫<K(2jx, x− 2−ju)

(f(x− 2−ju)− f(x)

)du

∣∣∣∣≤ supu∈[−R,R]

|f(x− 2−ju)− f(x)|C → 0

(b) This also holds based on the equivalent proof for convolution with astationary kernel.

(c) We need several things to use the proof for convolution with a sta-tionary kernel:

i.∫<K(x, y)(y − x)rdy = 0 for r = 1, . . . , s − 1. For the Haar

basis, this holds vacuously. For higher-order orthonormal waveletbases, see the example sheet.

ii. Also, since K is periodic, we let κs = supx∈(0,1)∫< |u

sK(x, x −u)|du, which is finite.

iii. Note for the Haar Basis:

κs = supx∈(0,1)

∫<

∑k∈Z

u1 [x ∈ [k, k + 1]] 1 [x− u ∈ [k, k + 1]] du

= supx∈(0,1)

∫<u1 [x ∈ [0, 1]] 1 [x− u ∈ [0, 1]] du, so when x=1:

=

∫<u1 [1− u ∈ [0, 1]] du

=

∫ 1

0

udu =1

2

5. An example of such wavelets is Daubechies’ wavelets.

5 Density Estimation

5.1 Motivation

There is no natural estimator for densities given X1, X2, . . . , Xn independentsamples with density f . So we approximate using Kernels and wavelets. Thisgives us many approximation choices, smoothness choices (i.e. resolution levelfor wavelets or bandwidth for kernels), and loss measurements (e.g. pointwise,Lp, etc.).

5.2 Kernel Density Estimation

5.2.1 Consistency and Error

1. Framework: notice that

(Kh ∗ f)(x) =

∫<Kh(x− y)f(y)dy = Ef [Kh(x−X)]

14

Page 15: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Therefore, we have the natural kernel density estimator, based on the lawof large numbers:

fkn,h =1

n

n∑i=1

Kh(x−Xi)

2. “Consistency” of Kernel Density Estimator

Theorem 5.1. Let X1, . . . , Xn be I.i.d. R.V. taking values in < withdensity f , and let fkn,h be the Kernel Density Estimator. Suppose:

(a) h is chosen in terms of n, so that h→ 0, nh→∞ as n∞(b) Kh is non-negative, bounded, compactly supported and integrates to

1

Then as n→∞, Ef

[∥∥∥fkn,h − f∥∥∥1

]→ 0

Proof. We will first split the term of interest into variance and bias.

(a) E[∥∥∥fkn,h − f∥∥∥

1

]≤ E

[∥∥∥fkn,h − (Kh ∗ f)∥∥∥1

]+ E [‖(Kh ∗ f)− f‖1]

(b) The bias term is deterministic, and by Proposition 4.1(b), as n→∞,h→ 0 and so ‖(Kh ∗ f)− f‖1 → 0

(c) For the variance term, we first need to demonstrate several properties:

i. E[fkn,h(x)

]= 1

n

∑ni=1

∫<Kh(x− y)f(y)dy = (Kh ∗ f)(x)

ii. By Fubini’s Theorem and assumption (b)∫<fkn,h(x)− (Kh ∗ f)(x)dx =

1

n

n∑i=1

∫<Kh(x−Xi)dx

−∫<

∫<Kh(x− y)f(y)dydx

= 1−∫ ∫

Kh(x− y)dxf(y)dy

= 1−∫f(y)dy = 0

Therefore,∫<

(fkn,h(x)− (Kh ∗ f)(x)

)+

=

∫<

(fkn,h(x)− (Kh ∗ f)(x)

)−

iii. By Jensen’s Inequality:

E

[(fkn,h(x)− (Kh ∗ f)(x)

)+

]2≤ E

[(fkn,h(x)− (Kh ∗ f)(x)

)2]

15

Page 16: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

iv. Since K is bounded by some√C with compact support in some

[−R,R]:

E

[(fKn,h

)2]= E

( 1

n

n∑i=1

Kh(x−Xi)

)2

=1

n2

n∑i=1

E[K2h(x−X)

]=

1

n

∫<K2h(x− y)f(y)dy

=1

nh

∫<K2(u)f(x− uh)du

≤ C

nh

∫ R

−Rf(x− uh)du

By Lebesgue differentiation theorem, the integral ends to f(x)as h→ 0. Therefore, as nh→∞, the entire integral tends to 0.

(d) We have by Fubini’s Theorem:

E[∥∥∥fKn,h − (Kh ∗ f)

∥∥∥1

]=

∫<

E[∣∣∣fKn,h − (Kh ∗ f)

∣∣∣] dx= 2

∫<

E

[(fkn,h(x)− (Kh ∗ f)(x)

)+

]dx

By Property (ii) and (iii)

≤ 2

∫<

min

((Kh ∗ f)(x),E

[(fKn,h(x)− (Kh ∗ f)(x)

)+

])dx

≤ 2

∫<

min

((Kh ∗ f)(x),

√E

[(fKn,h(x)− (Kh ∗ f)(x)

)2])dx

≤ 2

∫<

min

(f(x),

√E

[(fKn,h(x)

)2])+ 2 ‖Kh ∗ f − f‖1

By Proposition 4.1(b) the term on the right tends to 0. By Property(iv) the integrand will tend pointwise to 0. Since the integrand isdominated by f(x), the dominated convergence theorem gives thatthe integral tends to 0.

3. Although the estimator converges, it is not “optimal,” in the sense that ithas the worst possible error over the class of all possible densities.

Proposition 5.1. Let X1, . . . , Xn be i.i.d. R.V. with density f . LetfKn,h be the Kernel Density Estimator. Suppose K is non-negative andintegrates to 1, then:

16

Page 17: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) supf infh>0 Ef

[∥∥∥fKn,h − f∥∥∥1

]= 2

4. “Error” of Kernel Density Estimator over class of smooth functions withbounded derivatives

Proposition 5.2. Let X1, . . . , Xn be i.i.d. R.V. with density f . Let fKn,hbe the Kernel Density Estimator. Let s ∈ N . Suppose that K satisfies theconditions of Proposition 4.1(c) and f is s-times differentiable. Then:

(a) If x ∈ < then

E[|fKn,h(x)− f(x)|

]≤ ‖K‖2

√‖f‖∞nh

+κss!

∥∥∥f (s)∥∥∥∞hs

(b) If f (s−1) is absolutely continuous then

E[∥∥∥fKn,h − f∥∥∥

2

]≤ ‖K‖2

√1

nh+κss!

∥∥∥f (s)∥∥∥∞hs

Proof. Again we split the terms into bias and variance components. Thebias can be bounded by Proposition 4.1(c), since all the conditions aresatisfied.

(a) We use the Proof of Theorem 5.1, Property (iv):

E[∣∣∣fKn,h(x)− (Kh ∗ f)(x)

∣∣∣]2 ≤ E

[(fKn,h(x)− (Kh ∗ f)(x)

)2]≤ E

[(fKn,h

)2]≤ 1

nh

∫<K2(u)f(x− uh)du

≤‖f‖∞nh

‖K‖22

(b) Again, we use the Proof of Theorem 5.1, Property (iv) along withFubini:

E[∥∥∥fKn,h − (Kh ∗ f)

∥∥∥2

]2≤ E

[∥∥∥fKn,h − (Kh ∗ f)∥∥∥22

]=

∫<

E

[(fKn,h(x)− (Kh ∗ f)(x)

)2]dx

≤∫<

1

nh

∫<K2(u)f(x− uh)dudx

=1

nh

∫<K2(u)

∫<f(x− uh)dxdu

=1

nh‖K‖22

17

Page 18: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

5.2.2 Asymptotic Behaviour of Kernel Density Estimator

1. By Proposition 5.2, the error in the Kernel Density Estimator is boundedby two competing terms, and is optimised when the terms are equal:

h ∼

(‖f‖∞ ‖K‖

22

n∥∥f (s)∥∥2∞

)1/(2s+1)

However, we cannot estimate the smoothness s from the data. So, if wewant to do inference, we must look at the behaviour of fKn,h as n gets largein distribution.

2. Asymptotically, the Kernel Density Estimator is normally distributed.

Proposition 5.3. Let X1, . . . , Xn be i.i.d. R.V. with density f . Let f bethe Kernel Density Estimator. Suppose that:

(a) h is chosen so that as n→∞, nh2s+1 → 0 and nh→∞(b) K satisfies the conditions of Proposition 4.1(c).

(c) f is s-times differentiable with f, f (s) ∈ L∞

Then,√nh(fKn,h(x)− f(x)

)d−→ N

(0, f(x) ‖K‖22

)Proof. We first do our standard bias-variance decomposition. We useProposition 4.1(c) to show that the bias term tends to 0. Then we useLindeberg-Feller CLT to determine the distribution of the remaining vari-ance term.

(a)√nh(fKn,h(x)− f(x)

)=√nh(fKn,h(x)− (Kh ∗ f)(x)

)+√nh ((Kh ∗ f)(x)− f(x))

(b) By Proposition 4.1(a) and assumption (a) the bias term

√nh ((Kh ∗ f)(x)− f(x)) ≤ Mκs

s!

√nh2s+1 → 0

(c) To use Lindeberg-Feller CLT, we need to first choose a Yni, demon-

strate that∑ni=1 Yni−E [Yni] =

√nh(fKn,h(x)− f(x)), determine it’s

variance (hence property (b)), and demonstrate property (a) of theLindeberg-Feller CLT.

i. Let Yni = 1√nhK(x−Xi

h

)ii. Its sum, centred about its mean is:

n∑i=1

1√nhK

(x−Xi

h

)−E

[1√nhK

(x−Xi

h

)]

=√nh

1

n

n∑i=1

Kh(x−Xi)−√nh

∫1

hK(

x− yh

)f(y)dy

=√nh(fKn,h(x)− (Kh ∗ f)(x)

)

18

Page 19: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

iii. The variance, nCov [Yn,i] = nE[Y 2ni

]− n (E [Yni])

2

A. For the first term, as h→ 0, will by continuity and dominatedconvergence:

nE[Y 2ni

]=

1

h

∫K2

(x− yh

)f(y)dy

=

∫K2(u)f(x− uh)du

→ f(x) ‖K‖22

B. For the second term, by Cauchy-Schwarz, as h→ 0

n(E [Yni])2 = h

(∫<K(u)f(x− uh)du

)2

≤ h ‖f‖2∞ ‖K‖21

→ 0

C. Therefore, the variance term is nCov [Yn,i] = f(x) ‖K‖22iv. To prove property (a) of the Lindeberg-Feller CLT, we note that|Yn,i| ≤ 1√

nh‖K‖∞ → 0 as n→∞.

(d) The result follows by the Lindeberg-Feller CLT.

3. Lindeberg-Feller Central Limit Theorem

Theorem 5.2. For each n, let Yn,1, . . . , Yn,n be i.i.d. R.V. with finitevariance. As n→∞ suppose that:

(a) ∀ε > 0, nE[Y 2n,i1 [|Yn,i| > ε]

]→ 0

(b) nCov [Yn,i]→ σ2

Then∑ni=1 (Yn,i −E [Yn,i])

d−→ N(0, σ2)

4. We can use the result of Proposition 5.3 to construction point-wise confi-dence intervals.

5.3 Histogram Density Estimation

1. A simple estimator of density divides the real line into bins Ik = [xk, xk+1]and estimates the density f by the proportion of observations in each Ik

Definition 5.1. The Histogram Density Estimator is:

fH(x) =∑k∈Z

(1

n(xk+1 − xk)

∑i

1 [Xi ∈ Ik]

)1 [x ∈ Ik]

19

Page 20: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

2. We can use dyadic break points, of resolution j, giving the following esti-mator instead:

fHn,j(x)

=∑k∈Z

(2j

1

n

n∑i=1

1[Xi ∈

[2−jk, 2−j(k + 1)

]])1[x ∈

[2−jk, 2−j(k + 1)

]]=∑k

(2

j2

1

n

∑i

1[Xi ∈

[2−jk, 2−j(k + 1)

]])2

j2 1[x ∈

[2−jk, 2−j(k + 1)

]]3. Notice that the histogram density estimator with dyadic break points is

the Haar approximation to a function with the ck estimated by

ck = 2j2

1

n

∑i

1[Xi ∈

[2−jk, 2−j(k + 1)

]]Therefore, its expectation is the Haar approximation to the function f .

4. “Error” of the Kernel Density Estimator

Proposition 5.4. Let X1, . . . , Xn be i.i.d. R.V. with density f . Let fHn,jbe the histogram estimator. Suppose f is once differentiable. Then:

(a) If x ∈ < then

E[∣∣∣(x)− fHn,j

∣∣∣] ≤√‖f‖∞n2−j

+1

2‖f ′‖∞ 2−j

(b) If f is absolutely continuous, then

E[∥∥∥f − fHn,j∥∥∥

2

]≤√

1

n2−j+

1

2‖f ′‖2 2−j

5.4 Wavelet Density Estimation

1. Motivation: as in histogram density estiamtion, we can use smootherwavelets to approximate smoother functions. To do so, we must approxi-mate the coefficients of Wj(f)(x) =

∑k∈Z aj,kϕj,k(x).

Definition 5.2. Noting that aj , k = 〈f, ϕj,k〉 = E [ϕj , k(x)], the esti-mators for our coefficients are aj,k = 1

n

∑ni=1 ϕj,k(Xi), and the wavelet

density estimator is:

fWn,j(x) =∑k∈Z

aj,kϕj,k(x)

2. “Error” of Wavelet Density Estimators

Proposition 5.5. Let X1, . . . , Xn be i.i.d. R.V. with density f . LetfWn,j be the wavelet density estimator. Let s ∈ N . Suppose f is s-timesdifferentiable, and the wavelet basis satisfies the conditions of Proposition4.4.

20

Page 21: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) If x ∈ < then

E[∣∣∣fWn,j(x)− f(x)

∣∣∣] ≤ κs [√‖f‖∞n2−j

+1

2

∥∥∥f (s)∥∥∥∞

2−js

]

(b) If f (s−1) is absolutely continuous then

E[∥∥∥fWn,j − f∥∥∥

2

]≤ κs

[√1

n2−j+

1

2

∥∥∥f (s)∥∥∥2

2−js

]

Proof. Again we split the terms into variance and bias. The bias termWj(f)(x)− f(x) is bounded by Proposition 4.4(c).

(a) Notice that fWn,j(x) = 1n

∑i Zi = 1

n

∑i 2jK(2jx, 2jXi) where

K(x, y) =∑k∈Z

ϕ(x− k)ϕ(y − k)

(b) Then, by i.i.d. of Zi and boundedness/compactness of K:

E

[(fWn,j(x)−Wj(f)(x)

)2]≤ E

[(fWn,j(x)

)2]=

1

n2

n∑i=1

E[Z2i

]=

1

nE[Z2]

=1

n

∫ R

−R22jK2(2jx, 2jy)f(y)dy

=2j

n

∫ R

−RK2(2jx, 2jx− u)f(x− 2−ju)du

≤ 1

n2−jκs

(c) For the L2 norm, we will have a double integral. But this is alsobounded by some constant since K is non-zero on a a compact sup-port and bounded.

6 Nonparametric Regression

6.1 Introduction

In typical regression we are given (X1, Y1), . . . , (Xn, Yn) and are asked to esti-mate the relationship between Xi and Yi. There are two approaches, of whichwe mainly consider fixed design.

1. Fixed Design: We assume that the design points Xi are fixed, and knownin advance, denoted xi. We want to estimate the function m where Yi =m(xi) + εi, where εi are independent, zero-mean random variables.

21

Page 22: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

2. Random Design: We assume that the pairs (Xi, Yi) are i.i.d. and want toestimate the mean function m(x) = E [Y |Xi = x].

6.2 Kernel Estimation

1. One way of estimating m is by local averaging (e.g. kernel convolution)of response variables around a given x. Formally, this is the Nadaraya-Watson Estimator.

2. Consider the form of the mean estimator:

m(x) =

∫yf(x, y)dy∫f(x, y)dy

Then a natural estimator (the Nadayara-Watson Estimator) for f is theNadayara-Watson Estimator:

mKn,h =

∑iKh(x− xi)Yi∑iKh(x− xi)

where if the denominator is 0, we set the estimator to 0.

3. “Error” of the Nadayara-Watson Estimator.

Theorem 6.1. Let Y1, . . . , Yn be independent R.V. with E [Yi] = m(xi),and variance v(xi). Let xi = i

n for i = 1, . . . , n. Let mKn,h be the Nadayara-

Watson Estimator. Suppose

(a) As n→∞, h→ 0

(b) K satisfies the conditions of Proposition 4.1(c), K is absolutely con-tinuous, K is differentiable and K ′ ∈ L1.

(c) Suppose m is s-times differentiable

(d) Suppose v(xi) are all abounded by σ2 > 0.

Then for x ∈ (0, 1),

E[|mK

n,h −m(x)|]≤σ ‖K‖2√

nh+κss!

∥∥∥m(s)∥∥∥∞hs +O

(1

nh

)Proof. Denote the numerator of the estimator as ngn,h(x) and the denom-inator as nfn,h(x)). We decompose the error into three terms, variance,bias and denominator control. The key to many of the bounds is first ex-tending the sum to an integral, using Reimann Sums,, incurring an error,and then from an integral from 0 to 1 to all of <, incurring another error.

(a) |mKn,h(x) − m(x)| ≤ |mK

n,h(x) − gn,h(x)| + |gn,h(x) − E [gn,h(x)] | +|E [gn,h(x)]−m(x)|

22

Page 23: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(b) We first look at gn,h and fn,h individually:

E [gn,h(x)] =1

n

n∑i=1

Kh(x− xi)E [Yi]

=1

n

n∑i=1

Kh(x− xi)m(xi)

=

∫ 1

0

h−1K

(x− yh

)m(y)dy +O

(1

nh

)=

∫ x/h

(x−1)/hK (u)m(x− uh)du+O

(1

nh

)As h→ 0 =

∫ ∞−∞

K(u)m(x− uh)du+O

(1

nh

)= (Kh ∗m)(x) +O

(1

nh

)

Cov [gn,h(x)] = E

( 1

n

∑i

Kh(x− xi)(Yi −m(xi))

)2

independence =1

n2

∑i

K2(x− xi)Cov [Yi]

=1

n2

∑i

K2(x− xi)v(xi)

≤ σ2

(nh)2

∑i

K2

(x− xih

)=σ2

nh

∫ 1

0

1

hK2

(x− yh

)dy +O

(1

(nh)2

)=σ2

nh

∫<K2(u)du+O((nh)−2)

=σ2

nh‖K‖22 +O((nh)−2)

fn,h(x) =1

n

∑i

Kh(x− xi) =

∫ 1

0

Kh(x− y)dy +O((nh)−1)

=

∫<K(u)du+O((nh)−1)

= 1 +O((nh)−1)

(c) Now we look at each term:

i. Bias: |E [gn,h(x)] − m(x)| = |(Kh ∗ m)(x) − m(x)| + O((nh)−)which is bounded by Proposition 4.1(c).

ii. Variance: By Jensen, E [|gn,h(x)−E [gn,h] |] ≤√

Cov [gn,h] ≤σ√nh‖K‖2 +O((nh)−)

23

Page 24: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

iii. Denominator: Since m(x) is continuous on a compact set, it isbounded. And we have bounded the other terms in the penulti-mate line, therefore:

E [|m(x)− gn,h(x)|] = E

[∣∣∣∣gn,h(x)1− fn,h(x)

fn,h(x)

∣∣∣∣]= O((nh)−)E [|gn,h(x)|]≤ O((nh)−)E [|gn,h(x)−E [gn,h(x)]|]+ E [|E [gn,h(x)]−m(x)|] + |m(x)|= O((nh)−)

4. If we choose a bandwidth h ∼ Cn−1/(2s+1), we attain the optimal conver-gence rate n−s/(2s+1) just as we did for Kernel density estimation.

5. The estimator does not perform well at the boundaries. We can fix thiswithin the Nadaraya-Watson framework, but we will instead consider localpolynomials.

6.3 Local Polynomials

1. Framework: Let s ∈ N and Y = m(x)+ε, where m is s-times differentiable

(a) Let:

U(t) =

[1, t, . . . ,

ts−1

(s− 1)!

]TM(x) =

[m(x), hm′(x), . . . , hs−1m(s−1)(x)

]T(b) By Taylor’s Theorem, we can compute m(y), where y is in the neigh-

bourhood of x by:

m(y) = UT(y − xh

)M(x) +O

(∥∥∥m(s)∥∥∥hs)

2. Estimator: For K ∈ L1, non-negative and∫K = 1, and fixed effects

(xi, Yi):

(a) We begin by estimating M by

M(x) ∈ arg minM∈<s

n∑i=1

K

(xi − xh

)[Yi − UT

(xi − xh

)M

]2(b) Local polynomial estimator of the mean function:

Definition 6.1. Given M(x) and U(t) as defined above, and s ∈ N ,the local polynomial estimator of order s − 1 for the mean functionm(x) is:

mPn,h(x) = UT (0)M(x)

And when s > 1, we can estimate the other derivatives as well.

24

Page 25: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

Remark 6.1. When s = 1, this is the Nadaraya-Watson estimatorwith kernel K.

3. We can rewrite (see example sheet) the estimator as

mPn,h(x) =

n∑i=1

Wn,i(x)Yi

where:

Wn,i(x) =1

nhK

(xi − xh

)UT (0)B−1U

(xi − xh

)

B = (nh)−1n∑i=1

K

(xi − xh

)U

(xi − xh

)UT

(xi − xh

)(a) Note that B must be invertible. This is asymptotically guaranteed

when the design is uniformly spaced. In fact B will be positive defi-nite.

(b) For s > 1 the weights (Wn,i) can be thought of as adapting to theboundaries, removing any boundary difficulties.

4. “Error” of the Local Polynomial Estimator

Proposition 6.1. Let Y1, Y2, . . . , Yn be independent R.V. with Yi hav-ing a mean m(xi) and variance v(xi) for xi = i

n . Let mPn,h be the local

polynomial estimator of order S − 1 for S ∈ N . Suppose that:

(a) The smallest eigenvalue of B is λ > 0

(b) K is bounded and compactly supported

(c) m is s-times differentiable for s ≤ S.

(d) v(x) are bounded by σ2 > 0.

Then for x ∈ [0, 1] and n ≥ 1h , we have:

E[∣∣mP

n,h(x)−m(x)∣∣] ≤ κs( σ√

nh+∥∥∥m(s)

∥∥∥∞hs)

Proof. We again decompose into bias and variance, and bound each com-ponent separately:∣∣mP

n,h(x)−m(x)∣∣ ≤ ∣∣mP

n,h(x)−E[mPn,h(x)

]∣∣+∣∣E [mP

n,h(x)]−m(x)

∣∣(a) First we need some properties of Wn,i(x)

i. (Second Example Sheet) For r = 1, . . . , s− 1∑i

Wn,i(x)(xi − x)r = 1 [r = 0]

25

Page 26: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

ii. Since supp(K) ⊂ [−R,R]:

|Wn,i(x)| ≤ 1

nh

∥∥B−1∥∥∥∥∥∥U (xi − xh

)∥∥∥∥ ‖K‖∞ 1

[∣∣∣∣xi − xh

∣∣∣∣ ≤ R]≤ 1

nhλ

∥∥∥∥U (xi − xh

)∥∥∥∥ ‖K‖∞ 1

[∣∣∣∣xi − xh

∣∣∣∣ ≤ R]≤ C

nh1 [|xi − x| < Rh]

By assumption, nh > 1, so∑i

|Wn,i(x)| < C

nh

∑i

1 [|xi − x| < Rh] < C ′

(b) The bias term, we use Taylor Expansion, and the first property aboutWn,i(x) to remove all terms (xi − x)r for r = 1, . . . , s− 1. Then, weuse the second property. Then, we use nh > 1:

|E [m]−m| =

∣∣∣∣∣∑i

Wn,i(x)m(xi)−m(x)

∣∣∣∣∣=

∣∣∣∣∣∑i

Wn,i(x)m(xi)−

[∑i

Wn,i(x)

]m(x)

∣∣∣∣∣=

∣∣∣∣∣∑i

Wn,i(x) [m(xi)−m(x)]

∣∣∣∣∣≤

∣∣∣∣∣∑i

Wn,i(x)C ′∥∥m(s)

∥∥∞

s!|xi − x|s

∣∣∣∣∣≤ C ′C

nh

∥∥m(s)∥∥∞

s!

∑i

1 [|xi − x| ≤ Rh] |xi − x|s

≤ C ′C∥∥m(s)

∥∥∞

s!

∑i

1 [|xi − x| ≤ Rh] |xi − x|s

≤ C ′′∥∥∥m(s)

∥∥∥∞hs

(c) For the variance term, we use Jensen’s and bound Cov [m]:

E[(m(x)−E [m(x)])

2]

= E

[(∑i

Wn,i(x) [Yi −m(xi)]

)]=∑i

W 2n,i(x)Cov [Yi] by independence

≤ σ2∑i

W 2n,i(x)

≤ σ2 maxi|Wn,i(x)|

∑i

|Wn,i(x)|

≤ σ2 C

nhC ′

26

Page 27: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

5. Local polynomials have the same convergence rates as Nadaraya-Watsonestimators, except it is uniform over [0, 1]. Therefore, we can prove Lp

convergence results.

6.4 Smoothing Splines

1. Smoothing Spline Estimator

Definition 6.2. The smooth spline estimator is:

mSn,λ ∈ arg min

m

[∑i

(Yi −m(xi))

]+ λ

∥∥∥m(s)∥∥∥22

2. General Splines:

(a) A spline g of order 2s − 1 on [0, 1] with breakpoints at 0 < x1 <x2 < · · · < xn < 1 is 2s− 2 times continuously differnetiable on eachinterval (0, x1), . . . , (xn, 1), and is given by polynomials of degree2s− 1 on each interval

(b) A spline g is called natural if for r = s, . . . , 2s−1, g(r)(0) = g(r)(1) =0

3. Properties of Spline Estimators:

(a) (Example Sheet) mSn,λ is a natural spline of order 2s− 1

(b) Given B-splines, N2s−1k for k = 1, . . . , n + 2s, we can rewrite the

Spline Estimator as:

mSn,λ(x) =

n+2s∑k=1

ckN2s−1k (x)

(c) To estimate constants ck, we can reformulate the original minimisa-tion problem as follows:

i. Let N be an n× n+ 2s matrix with entries Ni,k = N2s−1k (xi)

ii. Let C be the vector of unknown coefficients ck

iii. Then:

C = arg minC

(Y −NC)T (Y −NC) + λCTΩC

where

Ωk,l =

∫<

(N2s−1k

)(s)(x)(N2s−1l

)(s)(x)dx

27

Page 28: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

6.5 Wavelet Regression

1. Recall that for all x ∈ < we can approximate a function m using fatherand mother wavelets:

Wj(m)(x) =∑k∈Z

akϕ0,k(x) +

j−1∑l=0

∑k∈Z

bl,kψl,k(x)

2. In estimation, we focus on x ∈ [0, 1]. A simplification is to assume m = 0on [0, 1]C . We approximate the coefficients, then, as follows:

ak =1

n

n∑i=1

ϕ0,k(xi)Yi

≈ 1

n

n∑i=1

ϕ0,k(xi)m(xi)

≈∫ 1

0

ϕ0,k(x)m(x)dx

=

∫<ϕ(x)m(x)dx since m = 0 outside of [0, 1]

bl,k =1

n

n∑i=1

ψl,k(xi)Yi

≈ 1

n

n∑i=1

ψl,k(xi)m(xi)

≈∫ 1

0

ψl,k(x)m(x)dx

=

∫<ψl,k(x)m(x)dx since m = 0 outside of [0, 1]

3. This approximation scheme is similar to the Nadaraya-Watson estimator.It has the same boundary problems, which can be fixed using L2 waveletssuch as the Cohen-Daubechies-Vial basis.

7 Parameter Selection

7.1 Introduction

1. Motivation: The methods in previous sections required choosing param-eters such as bandwidth h or resolution j. Optimal choices depend onunknown quantities, specifically, the smoothness of the function beingestimated. Therefore, how do we choose the parameter correctly? Addi-tionally, if smoothness of the function changes from region to region, howcan we allow our smoothing parameter to vary in space?

2. Solutions:

(a) Practical Approach: Implementing procedures which have intuitiveappeal, but have no theoretical guarantees. These techniques tendto perform well.

28

Page 29: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(b) Adaptation: obtaining parameters which have asymptotic perfor-mance that is as good as if we knew the smoothness a priori. Adap-tation can be used to select global or local parameters, but may notimprove practical performance.

7.2 Global Bandwidth Choices

We consider methods which give a global parameter choice, and we discuss it inthe context of Kernel Density Estimation, although they are easily applied toother methods.

1. Plug-in Methods: Guess a value for the smoothness s. Determine h0dependent on this value s. Calculate the estimator fKn,h. Plug this backinto the optimal bandwidth equation to get h1. Continue iterating untilhn converges.

(a) This method depends heavily on the starting point s

(b) Theoretical results often assume that f has smoothness larger thanthe initial guess s, and, so, h will always be sub-optimal

2. Cross-Validation: Let fK,in,h denote the Kernel Density Estimator from

the observations after leaving out the ith data point. Then for density orregression estimation (respectively), we determine h by minimising:∫

<

[fKn,h(x)

]2dx− 2

n

n∑i=1

fK,in,h (xi)

1

n

n∑i=1

[mK,in,h(xi)− Yi

]2Derivation 7.1. Consider the L2 error, which we can approximately es-timate as follows:∫

<

[fKn,h − f

]2dx =

∫<

[fKn,h(x)

]2dx− 2

∫<fKn,h(x)f(x)dx+ C

≈∫<

[fKn,h(x)

]2dx− 2

n

n∑i=1

fKn,h(xi) + C

We can ignore C since we do not know it, and because we estimate fKn,husing the data points, we cannot estimate its error using the same data.Therefore, we replace the last term with fK,in,h (x).

(a) Generalisations of this model often include weights to account forboundary problems

(b) In general, bandwidths will only be good for L2 or whatever lossfunction is being estimated.

3. Model Selection: Given an appropriate penalty function λ, we deter-mine h by minimising:

1

n

n∑i=1

(mKn,h(xi)− Yi

)2+ λ(h)

29

Page 30: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) λ should be chosen to penalise small values in h, which would resultin a simple interpolation of the data in the case of regression.

(b) This method is again tied to the choice of loss function, and onlyperforms well in the context of that specific loss function.

7.3 Lepski’s Method

1. Motivation: We need parameter selection methods which are good overall error metrics.

2. Lepski’s Method (for Kernel Density Estimation)

(a) Start with a finite collection of bandwidths H = h0 > h1 > · · ·hk

(b) We want to compare the biases for each fKn,h where h ∈ H, and select

the one with the smallest bias. We do this by starting with fKn,h0and

comparing it with fKn,hifor i > 0 since the smaller the smoothing

parameter, the smaller the bias.

(c) If the differences between the estimators with h0 and hi are significant(greater than some threshold), then we will reject h0 in favour of thehi that had the significant difference.

(d) We repeat this process for all hi with respect to hj for j > i, until wefind the largest bandwidth for which no significant comparison existsbetween this bandwidth and any smaller bandwidths. We use thisbandwidth for Lepski’s estimator fLn

(e) Summary: Given some threshold function τ(h), we want

hn = maxh ∈ H|∣∣∣fKn,h − fKn,h′ ∣∣∣ ≤ τ(h′), ∀h′ ∈ H s.t. h′ < h

3. Rate of Convergence for Lepski’s Estimator

Theorem 7.1. Let X1, . . . , Xn be i.i.d. real-valued R.V. with a boundeddensity f . Let fLn be the Lepski estimator. Suppose that:

(a) K satisfies the conditions of Proposition 4.1(c) with order S ∈ N(b) The set H has bandwidths: h0 ∼ n−1/(2S+1), hi = αih0 for α ∈ (0, 1),

and hk ∼(

nlogn

)−1(c) τ(h) = κ

√log(n)/nh for κ > 0 depending on ‖f‖∞ and K

(d) f is s-times differentiable for s ≤ S, and f (s) ∈ L∞

Then,

E[∣∣∣fLn (x)− f(x)

∣∣∣] = O

([n

log n

]−s/(2s+1))

Proof. We want to compare hn’s performance to the nearly-optimal h∗n ∈H, and study the behaviour of hn when it is greater than and less thanthe nearly-optimal choice.

30

Page 31: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) Define

h∗n = maxh ∈ H|∀h′ ∈ H s.t. h′ ≤ h, |Kh ∗ f − f | ≤1

4τ(h′)

By Proposition 4.1(c), we know that |Kh ∗ f − f | = O(hs). So h∗nis well defined, and we can solve for it by setting O([h∗n]s) = τ(h∗n),yielding:

1

h∗n= O

([n

log n

]1/(2s+1))

(b) When hn > h∗n: (By definition of hn, then by Proposition 5.2(a), andby definition of h∗n)

E[∣∣fLn (x)− f(x)

∣∣1 [hn > h∗n

]]≤ E

[∣∣∣fLn (x)− fKn,h∗n∣∣∣]+ E

[∣∣∣fKn,h∗n − (Kh∗n∗ f)∣∣∣]+

∣∣(Kh∗n∗ f)− f

∣∣≤ τ(h∗n) + E

[∣∣∣fKn,h∗n − (Kh∗n∗ f)∣∣∣]+

∣∣(Kh∗n∗ f)− f

∣∣≤ τ(h∗n) + ‖K‖2

√‖f‖∞nh∗n

+∣∣(Kh∗n

∗ f)− f

∣∣≤ τ(h∗n) + ‖K‖2

√‖f‖∞nh∗n

+1

4τ(h∗n)

= O

([n

log n

]−s/(2s+1))

(c) When hn < h∗n, we have by assumption (b) that Ch∗n = hn forC ∈ (0, 1). Therefore,

1

hn= O

([n

log n

]1/(2s+1))

Therefore, since fLn = fKn,hn

by definition, and by definition of h∗n and

then Proposition 5.2(a):

E[∣∣∣fK

n,hn− f

∣∣∣1 [hn < h∗n

]]≤ E

[∣∣∣fKn,hn−(Khn

∗ f)∣∣∣]+

∣∣∣(Khn∗ f)− f

∣∣∣≤ 1

4τ(hn) +

∣∣∣(Khn∗ f)− f

∣∣∣≤ 1

4τ(hn) + ‖K‖2

√‖f‖∞nhn

= O

([n

log n

]−s/(2s+1))

31

Page 32: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

4. Similar results can be proven for other loss metrics, and for other regressionand density estimators

5. Results hold locally: pointwise rate of convergence holds if f is s-timesdifferentiable in a neighbourhood of the point

6. Choosing κ depends on ‖f‖∞ which we can estimate using∥∥∥fKn,h∥∥∥∞ for

an h that gives a consistent estimate of f .

7.4 Wavelet Thresholding

1. Overview: We let j0 < j1 ∈ N , where j0 is a minimum resolution and j1is a maximum resolution that we want to use in estimating our function.Our estimator is then:

fWn,j1(x) =∑k∈Z

akϕj0,k(x) +

j1−1∑l=j0

∑k∈Z

bl,kψl,k(x)

(a) If |bl,k| ≤ τ , we will believe bl,k is 0. If |bl,k| > τ , we believe it is non-zero. Therefore, we have the following threshold wavelet coefficients:

bTl,k = bl,k1[|bl,k| > τ

](b) The wavelet thresholding estimate is then:

fTn (x) =∑k∈Z

akϕj0,k(x) +

j1−1∑l=j0

∑k∈Z

bTl,kψl,k(x)

2. Rate of Convergence for the Wavelet Thresholding Estimator

Theorem 7.2. Let X1, . . . , Xn be real valued, i.i.d. R.V with boundeddensity function f . Let fTn be the wavelet thresholding estimator of f .Suppose:

(a) The wavelet basis satisfies the conditions of Proposition 4.4 with Svanishing moments (r = 0, 1, . . . , S − 1)

(b) j0 and j1 are chosen in terms of n so that 2j0 ∼ n1/(2S+1) and2j1 ∼ n

logn

(c) τ = κ√

lognn for κ depending only on ‖f‖∞ and ϕ

(d) f is s-times differentiable for s ≤ S and f (s) ∈ L∞

Then:

E[∣∣∣fTn (x)− f(x)

∣∣∣] = O

([n

log n

]−s/(2s+1))

Proof. This proof relies on splitting terms into different levels of resolution,then splitting terms into bias and variance, and again into regions of highand lower probability, which we can bound “easily”

32

Page 33: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

(a) We first divide the terms into high and low resolution parts:

∣∣∣fTn (x)− f(x)∣∣∣ =

∣∣∣∣∣∣fWn,j0 +

j1−1∑l=j0

∑k∈Z

bTl,kψl,k(x)− f

∣∣∣∣∣∣≤∣∣∣fWn,j0 −Wj0(f)

∣∣∣+

∣∣∣∣∣∣j1−1∑l=j0

∑k∈Z

[bTl,k − bl,k

]ψl,k(x)− f

∣∣∣∣∣∣+ |Wj1(f)− f |

(b) The low and high resolution terms:

i. E[∣∣∣fWn,j0 −Wj0(f)

∣∣∣] is bounded by Proposition 5.5(a) and as-

sumption (b)

ii. |Wj1(f)− f | = O

([n

logn

]−s/(2s+1))

by Proposition 4.4(c) and

assumption (b)

(c) Variance-Bias decomposition of the remaining term:∣∣∣∣∣∣j1−1∑l=j0

∑k∈Z

[bTl,k − bl,k

]ψl,k(x)− f

∣∣∣∣∣∣≤∣∣∣∑∑(

bl,k − bl,k)ψl,k1

[∣∣∣bl,k∣∣∣ > τ]∣∣∣

+∣∣∣∑∑

(bl,k)ψl,k1[∣∣∣bl,k∣∣∣ ≤ τ]∣∣∣

(d) Low-High Probability Decomposition of Variance:∣∣∣∑∑(bl,k − bl,k

)ψl,k1

[∣∣∣bl,k∣∣∣ > τ]∣∣∣

≤∣∣∣∑∑(

bl,k − bl,k)ψl,k1

[∣∣∣bl,k∣∣∣ > τ, |bl,k| <τ

2

]∣∣∣+∣∣∣∑∑(

bl,k − bl,k)ψl,k1

[∣∣∣bl,k∣∣∣ > τ, |bl,k| >τ

2

]∣∣∣(e) Low-High Probability Decomposition of Bias:∣∣∣∑∑

(bl,k)ψl,k1[∣∣∣bl,k∣∣∣ ≤ τ]∣∣∣

≤∣∣∣∑∑

(bl,k)ψl,k1[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| > 2τ

]∣∣∣≤∣∣∣∑∑

(bl,k)ψl,k1[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| ≤ 2τ

]∣∣∣(f) Notice that we want to bound the infinite sum over k ∈ Z, the term

|bl,k| and the variance |bl,k − bl,ki. Because ψ is compactly supported in [−R,R] for R ∈ N , for anyx ∈ <,

∑k∈Z ψl,k can have at most 2R+ 1 terms. So for a fixed

x ∈ <,∑k∈Z ψl,k(x) ≤ (2R+ 1) ‖ψ‖∞ 2l/2

33

Page 34: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

ii. To bound |bl,k| we use Taylor’s Theorem and the vanishing mo-ments assumption. For some fixed number f(2−lk):

|bl,k| =∣∣∣∣∫<ψl,k(x)f(x)dx− 0

∣∣∣∣=

∣∣∣∣∫<ψl,k(x)

[f(x)− f(2−lk)

]dx

∣∣∣∣=2−l/2

∣∣∣∣∫<ψ(u)

[f(2−lu+ 2−lk)− f(2−lk)

]du

∣∣∣∣=2−l/2

∣∣∣∣∣∫<ψ(u)

[s−1∑k=1

f (k)(2−lk)

k!

(2−lu

)k+f (s)(y)

k!2−ls(u)s

]∣∣∣∣∣=2−l/2

∣∣∣∣∫<ψ(u)

[f (s)(y)

k!2−ls(u)s

]du

∣∣∣∣≤2−l(s+1/2)

∥∥∥f (s)∥∥∥∞

1

k!

∣∣∣∣∫<ψ()u)usdu

∣∣∣∣≤C2−l(s+1/2)

iii. To bound the variance term, we use Jensen’s inequality andbound the variance:

E

[(bl,k − bl,k

)2]≤ E

[(bl,k

)]= E

( 1

n

n∑i=1

ψl,kXi

)2

=1

nE[ψ2l,k(X)

]≤ C

n‖f‖∞

(g) To bound the Low-Probability Bias Term, we use the Bernstein In-equality, if κ ≥ 1 and for some D > 0:

i. Notice that:

P[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| > 2τ

]≤ P

[∣∣∣bl,k − bl,k∣∣∣ > τ]

≤ 2 exp [−κ log(n)D]

= 2n−Dκ

34

Page 35: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

ii. Therefore, for sufficiently large κ:

E[∣∣∣∑∑

(bl,k)ψl,k1[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| > 2τ

]∣∣∣]≤ (2R+ 1) ‖ψ‖∞ 2l/2P

[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| > 2τ] j1−1∑l=j0

C2−l(s+1/2)

≤ (2R+ 1) ‖ψ‖∞ 2Cn−κDj1−1∑j0

2−ls

= O

([n

log(n)

−S/(2S+1)])

(h) To bound the High-Probability Bias Term:

i. We have that:∣∣∣∑∑bl,kψl,k1

[∣∣∣bl,k∣∣∣ ≤ τ, |bl,k| ≤ 2τ]∣∣∣

≤ (2R+ 1) ‖ψ‖∞j1−1∑l=j0

2l/2 min(2τ, C2−l(s+1/2))

≤ (2R+ 1) ‖ψ‖∞

j−1∑l=j0

√log n

n2l/2 + C

j1−1∑j=l

2−ls

= O

([n

log(n)

−s/(2s+1)])

ii. Where: j ∈ [j0, j1 − 1], is such that 2j ∼(

lognn

)1/(2s+1)

so that

we have:

√log n

n2j/2 ∼ C2−js(

log n

n

) 1/2(2s+1)

−1/2

∼(

log n

n

)−s/(2s+1)

(log n

n

) 1/2−s−1/2(2s+1)

∼(

log n

n

)−s/(2s+1)

(i) To bound the High-Probability Variance Term:

i. Note that if |bl,k| > τ2 then min(2|bl,k|, τ) = τ , so l < j for j s.t.

2j ∼(

log n

n

)1/(2s+1)

35

Page 36: Notes from Nonparametric Statistical Theorypages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes on... · 2018-09-08 · us to lower bound the minimax risk. Then, we will lower bound

ii. Therefore we have:

E

∣∣∣∣∣∣j1−1∑l=j0

∑(bl,k − bl,k)ψl,k1

[∣∣∣bl,k∣∣∣ > τ, |bl,k| >τ

2

]∣∣∣∣∣∣

≤ (2R+ 1) ‖ψ‖∞

√E[(bl,k − bl,k)2

] j−1∑l=j0

2l/2

≤ (2R+ 1) ‖ψ‖∞√‖f‖∞ n−1

j−1∑l=j0

2l/2

= O

([n

log(n)

−s/(2s+1)])

(j) To bound the Low-Probability Variance Term:

i. By Cauchy-Schwarz, and Bernstein’s Inequality:

E

∣∣∣∣∣∣j1−1∑l=j0

∑(bl,k − bl,k)ψl,k1

[∣∣∣bl,k∣∣∣ > τ, |bl,k| ≤τ

2

]∣∣∣∣∣∣

≤ (2R+ 1) ‖ψ‖∞j1−1∑l=j0

2l/2E[|bl,k − bl,k|1

[∣∣∣bl,k∣∣∣ > τ, |bl,k| ≤τ

2

]]≤ (2R+ 1) ‖ψ‖∞×

j1−1∑l=j0

2l/2√

E[|bl,k − bl,k|2

]P[∣∣∣bl,k∣∣∣ > τ, |bl,k| ≤

τ

2

]

≤ (2R+ 1) ‖ψ‖∞√‖f‖∞ n−12n−D

′κ

j1∑l=j0

2l/2

= O

([n

log(n)

−s/(2s+1)])

36