econ 721: lecture notes on nonparametric density and...

ECON 721: Lecture Notes on NonparametricDensity and Regression Estimation

Petra E. Todd

Fall, 2014

Contents

1 Review of Stochastic Order Symbols 1

2 Nonparametric Density Estimation 32.1 Histogram Estimator . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 What is the variance and bias of this estimator? . . . . 42.1.2 Selecting the bin size . . . . . . . . . . . . . . . . . . . 5

2.2 Symmetric Binning Histogram Estimator . . . . . . . . . . . . 62.3 Standard kernel density estimator . . . . . . . . . . . . . . . . 7

2.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Choice of Bandwidth . . . . . . . . . . . . . . . . . . . 10

2.4 Alternative Estimator- k nearest neighbor . . . . . . . . . . . 112.5 Asymptotic Distribution of Kernel Density . . . . . . . . . . . 12

3 Nonparametric Regression 153.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 The Local approach . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Two types of potential problems . . . . . . . . . . . . . 193.2.2 Interpretation of Local Polynomial Estimators as a lo-

cal regression . . . . . . . . . . . . . . . . . . . . . . . 193.3 Properties of Nonparametric Regression Estimators . . . . . . 22

3.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 223.3.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . 243.3.3 How might we choose the bandwidth? . . . . . . . . . . 28

4 Bandwidth Selection for Density Estimation 314.1 First Generation Methods . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Plug-in method (local version) . . . . . . . . . . . . . . 31

i

ii CONTENTS

4.1.2 Global plug-in method . . . . . . . . . . . . . . . . . . 314.1.3 Rule-of-thumb Method (Silverman) . . . . . . . . . . . 324.1.4 Global Method #2: Least Squares Cross-validation . . 324.1.5 Biased Cross-validation . . . . . . . . . . . . . . . . . . 344.1.6 Second Generation Methods . . . . . . . . . . . . . . . 35

5 Bandwidth selection for Nonparametric Regression 375.1 Plug-in Estimator (local method) . . . . . . . . . . . . . . . . 375.2 global Plug-in Estimator . . . . . . . . . . . . . . . . . . . . . 385.3 Bandwidth Choice in Nonparametric Regression . . . . . . . . 385.4 Global MISE criterion . . . . . . . . . . . . . . . . . . . . . . 385.5 blocking method . . . . . . . . . . . . . . . . . . . . . . . . . 395.6 Bandwidth Selection for Nonparametric Regression- Cross-

Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 1

Review of Stochastic OrderSymbols

Let Xn be a sequence of random variables. Then

Xn = oP (1) if Xn →P 0 as n→∞

and

Xn = OP (an) if Xn/an is stochastically bounded.

Recall that Xn/an is stochastically bounded if

∀ε > 0, ∃M <∞ : P (|Xn/an| > M) < ε, ∀n

If a sequence is op(1) then it is Op(1). Convergence in probability impliesstochastic boundedness but not vice versa.

1

2 CHAPTER 1. REVIEW OF STOCHASTIC ORDER SYMBOLS

Chapter 2

Nonparametric DensityEstimation

Nonparametric density estimators allow estimation of densities without hav-ing to specify the functional form of the density. Instead, weaker assumptionssuch as continuity and differentiability can be assumed.

2.1 Histogram Estimator

The most commonly used nonparametric density estimator is the histogramestimator. Suppose we estimate density by a histogram with bin width equalto h

Proportion of data in bin:

1n

∑ni=11(xi ∈ [x0, x0 + h]) ≈ f(x0)h

This suggests a density estimator:

f(x0) = 1nh

∑ni=1 1(xi ∈ [x0, x0 + h])

3

4 CHAPTER 2. NONPARAMETRIC DENSITY ESTIMATION

2.1.1 What is the variance and bias of this estimator?

Bias

Ef(x0) =1

hE(1(xi ∈ [x0, x0 + h]))

=1

h

∫ ∞−∞

1(xi ∈ [x0, x0 + h])f(xi)dxi

=1

h

∫ x0+h

x0

{f(x0) + f ′(x)(xi − x0)}dxi x ∈ [xi,x0]

= f(x0)1

h

∫ x0+h

x0

dxi︸︷︷︸1

+

[∫ x0+h

x0

f ′(x)(xi − x0)

]1

h︸︷︷︸R(x0,h)

dxi

Assume the first derivative is bounded:

| f ′(x)| ≤ c <∞ ∀x

In that case, the remainder term can be bounded from above by

R(x0, h) ≤∫ x0+h

x0|f ′(x)| |xi − x0| dxi

h≤ C · h2 · 1

h= Op(h)

Thus, ∣∣∣Ef(x0)− f(x0)∣∣∣ ≤ Ch

Consistency of the histogram requires that h → 0 as n → ∞. Note thatthese derivations assumed f is differentiable and derivative is bounded.

2.1. HISTOGRAM ESTIMATOR 5

Variance

Mean-square consistency requires that var → 0 as well. If observations areiid, we can write

var f(x0) = 1n2h2

n∑i=1

var{1(xi ∈ [x0, x0 + h])}

We will show that the variance is bounded above by something that goes tozero (below, it is bounded by 0).

var{1(xi ∈ [x0, x0 + h])} = E(1(xi ∈ [x0, x0 + h])2

)− E(1(·))2

≤ E(1(xi ∈ [x0, x0 + h])2

)≤ E (1(xi ∈ [x0, x0 + h]))

We already analyzed this term above in studying the bias. We showed that

E(1(xi ∈ [x0, x0 + h])) = hf(x0) +Op(h2)

Assume that f(x0) ≤ C ′, so,

varf(x0) ≤ C ′ · n · hn2h2

=C ′

nh= Op(

1

nh)

Consistency therefore requires nh→∞Remarks

(i) The estimator is not root-n consistent, because the variance gets largeas h→ 0

(ii) For consistency, we require both h→ 0 and n · h→∞

2.1.2 Selecting the bin size

How might you choose h? One way is to minimize the mean-squared error(MSE), focussing on the lowest order terms (the terms that converge theslowest):

MSE = C2h2 +C ′

nh


∂∂h

= 2 · C2h+− C′

nh2= 0

2 · C2h3 = +C′

n

h = [C ′

2C2]13n−

13︸︷︷︸

optimal choice of bin width

2.2 Symmetric Binning Histogram Estimator

Instead of constructing the bins as before, consider constructing the binssymmetrically around point of evaluation:

f(x0) =1

n

n∑i=1

1

2h1(|x0 − xi| < h)

Can rewrite as

f(x0) =1

nh

n∑i=1

1

2· 1(

∣∣∣∣x0 − xih

∣∣∣∣ < 1)︸︷︷︸”uniform kernel”

Note that the area under the kernel function integrates to 1 and the kernelfunction is symmetric.

2.3. STANDARD KERNEL DENSITY ESTIMATOR 7

Figure 2.1: Symmetric uniform kernel function

2.3 Standard kernel density estimator

Instead of uniform kernel, use kernel with weights that give more weight tocloser observations and also go smoothly to zero. We also require that itintegrates to 1 and is symmetric.

Let

s =x0 − xih

Examples:

k(s) = 1516

(s2 − 1)2 if |s| ≤ 1 “biweight” or ”quartic” kernel= 0 else

k(s) = 1√2πe−

s2

2 “normal kernel”

With the normal kernel, the weights are always positive.

The so-called Nadaraya-Watson Kernel Estimator is given by

f(x0) =1

nhn

n∑i=1

k(x0 − xihn

)


Figure 2.2: Kernel function with weights that go smoothly to zero

2.3.1 Bias

Ef(x0) =

∫ ∞−∞

1

hnk(x0 − xihn

)f(xi)dxi

let s =x0 − xihn

ds = − 1

hnxi = x0 − shn

when xi = −∞, si =∞

= −∫ ∞−∞

k(s)f(x0 − shn)ds

=

∫ ∞−∞

k(s)f(x0 − shn)ds

Assume f is r times differentiable with bounded rth derivative

f(x0 − shn) = f(x0) + f ′(x0)(−hns) +1

2f ′′(x0)h2

ns2 + ...

...+1

r!f r(x0)(−hns)r +

1

r![f r(x)− f r(x0)](−hns)r︸︷︷︸Remainder term R(x0,s)

If ∫∞−∞ k(s)sR = 0 R = 1..r

2.3. STANDARD KERNEL DENSITY ESTIMATOR 9

then bias is ∫∞−∞ k(s)R(X0, s)ds ≤ hrnC2

∫∞−∞ |s|

r k(s)ds

(since we assumed bounded rth derivative)

Thus, this estimator improves on histogram if some moments equal 0.Typically, k satisfies

∫ ∞−∞

k(s)ds = 1∫ ∞−∞

k(s)sds = 0

If we require ∫ ∞−∞

k(s)s2ds = 0

Then k(s) must be negative in places, as in the figure below.

Remark If f r satisfies a lipschitz condition

i.e. |f r(x)− f r(x0)| ≤ m |x− x0|

then

Bias ≤ C0 · hr+1n

∫ ∞−∞|s|r k(s)ds = Op(h

r+1n )︸︷︷︸

higher order

2.3.2 Variance

V ar f(x0) = 1n2h2n

n∑i=1

V ark(x0−xihn

)

= 1nh2n

[Ek2(x0−xihn

)− Ek(x0−xihn

)2] can show using same techniques

≤ C1

nh

= Op(1nhn

)⇒ does not improve on histogram


Figure 2.3: Kernel function that is sometimes negative

2.3.3 Choice of Bandwidth

Choose h to minimize the asymptotic mean-squared error (AMSE):

minhAMSE(h) = C2

2h2rn + C1

nhn

∂∂h

= C22 · 2r · h2r−1

n − C1

nh2n= 0

h2r+1n = C1

C222rn

⇒ hn = ( C1

2rC2)

12r+1 · (n)−

12r+1

hn shrinks to zero as n grows large

Now, let’s look at order of AMSE (without making the Lipschitz assumption).

Plug in the optimal value for hn that was obtained above. Let C3 = C1

2rC2

12r+1 .

2.4. ALTERNATIVE ESTIMATOR- K NEAREST NEIGHBOR 11

MSE(hk) = C2C2r3 n− 2r

2r+1 +C1

n · C3n− 1

2r+1

= C2C2r3 n

− 2r2r+1

+ C1C−13 n

− 2r2r+1

= Op(n− 2r

2r+1)

so√MSE = Op(n

− r2r+1

)

r

2r + 1<

1

2

(parametric rate is n−12 )

Recall

(βols − β0) ∼ (0, σ2(X ′X

N)−1︸︷︷︸

V ar

·N−1)

RMSE =√MSE is Op(N

− 12 )

How do you generalize kernel density estimator to more dimen-sions?

f(x0, z0) =1

nhn,xhn,z

n∑i=1

k1(x0 − xihn,x

)k2(z0 − zihn,z

)

2.4 Alternative Estimator- k nearest neigh-

bor

Alternatively, the bandwidth can be chosen so as to include a certain numberof “neighbors” in each point estimation. In this case, the bandwidth will bevariable:


f(x0) · 2rk = #data in bin=k−1n

f(x0) = k−12rk·n

For m-dimensional data,

f(x0) =k − 1

(2rk)m · n

2.5 Asymptotic Distribution of Kernel Den-

sity

We already did consistency (we showed conditions required for conv. in meansquare)

f(x0) =1

nhn

n∑i=1

k(x0 − xihn

)p→

by LLNE f(x0)→ f(x0) as h→ 0

V AR f(x0)p→ 0 as nhn →∞

Now, asymptotic distribution

√nhn(

1

nhn

n∑i=1

k(x0 − xihn

)− f(x0)) =√nhn(

1

nhn

n∑i=1

k(x0 − xihn

)− E k(x0 − xihn

)) (1)

+√nhn(

1

nhn

n∑i=1

E k(x0 − xihn

)− f(x0)) (2)

2.5. ASYMPTOTIC DISTRIBUTION OF KERNEL DENSITY 13

Analyze term (1):

√n

1

n

n∑i=1

[1√hnk(x0 − xihn

)− E(1√hnk(x0 − xihn

))

]can find a CLT for this part

∼ N (0, V AR1√hnk(x0 − xihn

))

V AR1√hnk(x0 − xihn

) =1

hnE (k(

x0 − xihn

)2)︸︷︷︸f(x0)

∫∞−∞ k2(s)ds

− 1

hnE (k(

x0 − xihn

))2

Assume

∫ ∞−∞

k(s)ds = 1,

∫ ∞−∞

k(s)sds = 0

[1

hn·O(h2

n)]→ 0 as hn → 0

Term (1) ∼ N(0, f(x0)

∫ ∞−∞

k2(s)ds)

Now, show Term (2)p→ 0 (will require more stringent conditions on the band-

width)

√nhn(

1

nhn

n∑i=1

E k(x0 − xihn

)− f(x0)) =√nhn{E

1

hnk(x0 − xihn

)− f(x0)}

=1

hn

∫ ∞−∞

k(x0 − xihn

)f(xi)dxi s =x0 − xihn

hnds = −dxi

=

∫ ∞−∞

k(s)f(x0 − shn)ds

=

∫ ∞−∞

k(s)[f(x0) + f ′(x0)(−shn) +f ′′(x)

2s2h2

n]ds

= f(x0)

∫k(s)ds+−f ′(x0) · hn

∫ ∞−∞

k(s)sds+ h2n

∫ ∞−∞

f ′′(x)

2k(s)s2ds

x between x0 and x0 + sh.

Assume |f ′′(x)| ≤M and∫∞−∞ k(s)sds = 0 (satisfied if k is symmetric).


then

E1

hnk

(x0 − xihn

)− f(x0) is Op(h

2n)

which implies that√nhnE(

1

hnk(x0 − xihn

)− f(x0)) is Op(√nhn · h2

n) = Op(√nh5

n)√nhnE(

1

hnk(x0 − xihn

)− f(x0)) = Op(√nh5

n)

Hence, we require nh5n → 0 in addition to hn → 0, nhn → ∞. The required

condition for asymptotic normality is stronger than for consistency.

Chapter 3

Nonparametric Regression

3.1 Overview

Consider the model:

yi = g(xi) + εi

where we assume

E(εi|xi) = 0

E(ε2i |xi) = c <∞

and we would like to estimated g(x) without having to impose functional formassumptions on g, except assumptions such as continuity and differentiability.

There are two types of nonparametric estimation methods: local andglobal. These two approaches reflect two different ways to reduce the problemof estimating a function into estimation of real numbers.

Local approaches consider a real valued function g (x) at a single pointx = x0. The problem of estimating a function becomes estimating a realnumber h (x0). If we are interested in evaluating the function in the neigh-borhood of the point x0, we can approximate the function by g(x0) or, ifg (x) is continuously differentiable at x0, then a better approximation mightbe g (x0) + g′ (x0) (x− x0). Thus, the problem of estimating a function at a

15

16 CHAPTER 3. NONPARAMETRIC REGRESSION

point may be thought of as estimating two real numbers g (x0) and g′ (x0),making use of observations in the neighborhood. Either way, if we want toestimate the function over a wider range of x values, the same, pointwiseproblem can be solved at the different points of evaluation.

Global approaches introduce a coordinate system in a space of functions,which reduces the problem of estimating a function into that of estimat-ing a set of real numbers. Any element v in a d-dimensional vector spacecan be uniquely expressed using a system of independent vectors {bj}dj=1 as

v =∑d

j=1 θj · bj, where one can think of {bj}dj=1 as a system of coordinates

and (θ1, ..., θd)′ as the representation of v using the coordinate system. Like-

wise, using an appropriate set of linearly independent functions {φj (x)}∞j=1

as coordinates any square integrable real valued function can be uniquely ex-pressed by a set of coefficients. That is, given an appropriate set of linearlyindependent functions {φj (x)}∞j=1, any square integrable function g (x) has

unique coefficients {θj}∞j=1 such that

g (x) =∞∑j=1

θj · φj (x) .

One can think of {φj (x)}∞j=1 as a system of coordinates and (θ1, θ2, ...)′ as

the representation of g (x) using the coordinate system. This observationallows us to translate the problem of estimating a function into a problem ofestimating a sequence of real numbers {θj}∞j=1.

Well known bases are polynomial series and Fourier series. These basesare infinitely differentiable everywhere. Other well known bases are polyno-mial spline bases and wavelet bases.

The literature on nonparametric estimation methods is vast. These noteswill focus on local methods, in particular, kernel-based methods, which areamong the most commonly used (at least by economists!).

3.2 The Local approach

Examples: kernel regression, local polynomial regression

Early smoothing methods date back to the 1860’s in actuarial literature (seeCleveland (1979, JASA)), where they were used to study life expectancies atdifferent ages.

3.2. THE LOCAL APPROACH 17

One possibility is to estimate the conditional mean by:

E(y|x) =

∫yf(y|x)dy =

∫yf(x, y)

f(x)dy

E(y|x) =

∫yf(x, y)

f(x)dy.

Or, with repeated data at each point, could estimate by

g(x0) by g(x0) =

∑i

yi1(xi = x0)∑i

1(xi = x0)

A drawback is that we can only estimate this way at points where wehave data. Need a way of interpolating or extrapolating between and outsidedata observations. We could create cells as in the histogram estimator:

g(x0) =

n∑i=1

yi1(xi ∈ [x0 − hn, x0 + hn])

n∑i=1

1(xi ∈ [x0 − hn, x0 + hn])

=

n∑i=1

yi1(∣∣∣x0−xihn

∣∣∣ < 1)

n∑i=1

1(∣∣∣x0−xihn

∣∣∣ < 1)

If we want g(x) to be smooth, we need to choose a kernel function that goesto 0 smoothly


Figure 3.1: Repeated data estimator

g(x0) =

n∑i=1

yik(x0−xihn

)

n∑i=1

k(x0−xihn

)

=

1nhn

n∑i=1

yik(x0−xihn

) (→ g(x0)f(x0))

1nhn

n∑i=1

k(x0−xihn

) (→ f(x0))

=n∑i=1

yiwi(x0)


where

wi(x0) =k(x0−xi

hn)

n∑i=1

k(x0−xihn

)note :

n∑i=1

wi(x0) = 1

We don’t need∫k(s)ds = 1, but we do require that kernel used in numerator

and denominator integrate to the same value.

3.2.1 Two types of potential problems

(i) When density at x0 is low, the denominator tends to be close to 0 and weget a division by zero problem.

(ii)At boundary points, the estimator has higher order bias, even with ap-propriately chosen k(). This problem is known as boundary bias.

3.2.2 Interpretation of Local Polynomial Estimators asa local regression

Solve this problem at each point of evaluation, x0 (which may or may notcorrespond to points in the data):

a : argmin{a,b1,..,bk}

n∑i=1

(yi − a− b1(xi − x0)− b2(xi − x0)2 − ...− bk(xi − x0)k)2ki

ki = k

(x0 − xihn

)b1 provides an estimator for g′(x0)

b2 provides an estimator for g′′(x0)2

if k = 0,

a =

n∑i=1

yik(x0−xi

hn)

n∑i=1

k(x0−xi

hn)

← standard Nadaraya-Watson kernel regression estimator

=n∑i=1

yiwi(x0) wheren∑i=1

wi(x0) = 1


if k = 1

a =

n∑i=1

yikin∑j=1

(xj − x0)2 −n∑k=1

ykkk(xk − x0)n∑l=1

kl(xl − x0)

n∑i=1

kin∑j=1

kj (xj − x0)2 − {n∑k=1

kk(xk − x0)}2

where

ki = k(x0 − xihn

)

We can again write the expression asn∑i=1

yiwi(x0)

where the weights satisfyn∑i=1

wi(x0) = 1

andn∑i=1

wi(x0)(xi − x0) = 0

k = 0 weights for the standard Nadaraya-Watson kernel estimator do notsatisfy this second property.

The LLR estimator (k=1) improves on kernel estimator in two ways

(a) the asymptotic bias does not depend on design density of the data (i.e.f(x) does not appear in the bias expression.)

(b) the bias has a higher order of convergence at boundary points

Intuitively, the standard kernel estimator fits a local constant, so whenthere is no data on the other side, the estimate will tend to be too high ortoo low. It does not capture the slope. LLR gets the slope.

yi = g(xi) + εi


Figure 3.2: Boundary bias - How LLR improves on kernel regression

gLLR(x0)− g(x0) =n∑i=1

wi(x0)yi − g(x0)

=n∑i=1

wi(x0)εi +n∑i=1

wi(x0)(g(xi)− g(x0))

=n∑i=1

wi(x0)εi + g′(x0) (xi − x0) + [g′(x)− g′(x0)] (xi − x0)

=n∑i=1

wi(x0)εi︸︷︷︸variance part

+ g′(x0)n∑i=1

wi(x0) (xi − x0)︸︷︷︸=0 with LLR

+

+n∑i=1

wi(x0)(g′(x)− g′(x0))︸︷︷︸ (xi − x0)


If g′ satisfies a Lipschitz condition, the last term can be bounded.

3.3 Properties of Nonparametric Regression

Estimators

1. Consistency (will show convergence in mean square)

2. Asymptotic Normality

In showing the asymptotic properties, we will work with the standardkernel regression estimator, because it is simpler to analyze than LLR andthe basic techniques are the same.

m(x0) =

1nhn

∑i

yik(x0−xihn

)

1nhn

∑i

k(x0−xihn

)

Assume

(i) hn → 0, nhn →∞(ii)

∫∞−∞ sk(s)ds = 0

∫∞−∞ k(s)ds 6= 0

∫∞−∞ k(s)s2ds < C

(iii) f(x) and g(x) are c2 with bounded second derivatives

3.3.1 Consistency

We already showed

1

nhn

∑i

k(x0 − xihn

)MS→ f(x0)

∫ ∞−∞

k(s)ds (denominator)

will now show

1

nhn

∑i

yik(x0 − xihn

)MS→ f(x0)g(x0)

∫ ∞−∞

k(s)ds (numerator)

Then apply Mann.Wald theorem (plim of a continuous function is continuous

function of plim) to conclude ratioP→ g(x0)

Denominator:

3.3. PROPERTIES OF NONPARAMETRIC REGRESSION ESTIMATORS23

V AR(1

nhn

n∑i=1

k(x0 − xihn

)) ≤ 1

n2h2· n E(k2(

x0 − xihn

))

=1

nh2·Op(hn)

= Op(1

nh)

Thus,

1

nhn

∑i

k(x0 − xihn

)MS→ E(

1

nhn

∑i

k(x0 − xihn

))

= f(x0)

∫k(s)ds+Op(hn)

Numerator:

E(1

nhn

n∑i=1

yik(x0 − xihn

)) =1

hnE(E(yi|xi)k(

x0 − xihn

))

=1

hnE(g (xi) k(

x0 − xihn

))

=1

hn

∫ ∞−∞

g (xi) k(x0 − xihn

)f(xi)dxi

Let φ(xi) = f (xi) g(xi) and Taylor expand (as before)

= φ (x0)

∫ ∞−∞

k(s)ds+Op(hn)

where φ(x0) = g(x0)f(x0)

Also, can show

V AR(1

nhn

n∑i=1

yik

(x0 − xihn

)≤ c

nhn(3.1)

c

nhn→ 0 as nhn →∞ (3.2)

Hence,


ratio→g (x0) f (x0)

∫k(s)ds

f (x0)∫k(s)ds

= g (x0)

provided thathn → 0 nhn →∞

3.3.2 Asymptotic Normality

Show

√nhn(g(x0)− g(x0))˜

N

{[1

2g′′ (x0) +

g′ (x0) f ′ (x0)

f (x0)

]h2n

√nhn

∫ ∞−∞

u2k(u)du, f(x0)−1σ2(x0)

∫ ∞−∞

k2 (u) du

}

whereσ2(x0) = E (ε2

i |xi = x0)← conditional variance

√nhn

[∑yiki∑ki− g (x0)

]=√nhn

[∑(yi − g(x0))ki∑

ki

](3.3)

yi = g(xi) + εi E(εi|xi) = 0

=√nhn

∑εiki∑ki︸︷︷︸

variance part

+√nhn

∑(g(xi)− g(x0))ki∑

ki︸︷︷︸bias part

=√nhn(

1nhn

∑εiki

1nhn

∑ki

) +√nhn(

1nhn

∑(g(xi)− g(x0))ki

1nhn

∑ki

)

Assuming∫∞−∞ k(s)ds = 1, we showed before that


1

nhn

∑ki

MS→ f(x0) +O(hn)

We will now analyze the distribution of the numerator

√nhn

1

nhn

∑εiki =

1√nhn

∑i

εik(x0 − xihn

) =1√n

∑i

1√hnεik(

x0 − xihn

)

and apply CLT to get

∼ N

(0, V AR

(1√hnεik(

x0 − xihn

)

))(3.4)

V AR

(1√hnεik(

x0 − xihn

)

)=

1

hnE

(ε2i k

2(x0 − xihn

)

)+ higher order terms

=1

hnE(E(ε2

i |xi)

=σ2(xi)

k2(x0 − xihn

))

=1

hn

∫ ∞−∞

σ2(xi)k2(x0 − xihn

)f(xi)dxi

= σ2(x0)f(x0)

∫k2(s)ds+O(hn) by change of var

Now, analyze the bias term


√nhn

1

nhn

∑(g(xi)− g(x0))ki =

1√nhn

∑i

(g(xi)− g(x0))k(x0 − xihn

)

−→ 1√nhn· n∫ ∞−∞

(g(xi)− g(x0))k(x0 − xihn

)f(xi)dxi

=√nhn

∫ ∞−∞

(g(x0 − shn)− g(x0)) · k(s) · f(x0 − shn)ds

=√nhn

∫ ∞−∞{g′(x0) (−shn) +

1

2g′′(x0)s2h2

n +1

2[g′′(x)− g′′(x0)]s2h2

n}

·{f(x0)− shnf ′(x0) +s2h2

n

2f ′′(x)}k(s)ds

+higher order (will ignore)

–if g′′ satisfies lipschitz condition

|g′′(x)− g′′(x0)| ≤ C |x− x0|

then [g′′(x)−g′′(x0)]2

term will be higher order -other terms

= −g′f

=0︷︸︸︷∫sk(s)ds+ g′f ′h2

n

∫s2k(s)ds− g′f ′′(x)

2h3n︸︷︷︸

higher order

∫k(s)s3ds

+1

2g′′(x0)f(x0) · h2

n

∫ ∞−∞

k(s)s2ds+ higher order terms

get

√nhn

∑(g(xi)−g(x0))ki →

√nhn [g′(x0)f ′(x0)+

1

2g′′(x0)f(x0)] ·h2

n

∫ ∞−∞

k(s)s2ds


require √nhnh

2n → 0 i.e. nh5

n → 0

Putting these results together, we get

√nhn(g(x0)−g(x0)) ∼ N

([g′(x0)f

′(x0)

f(x)+

1

2g′′(x0)]

√nhnh

2n

∫k(s)s2ds, f(x0)−1σ2(x0)

∫k(s)2ds

)

(which is what we set out to show)

Remarks:

Getting asymptotic distribution requires plug-in estimator of g′, f ′, f, g′′, σ2(x0)(all the ingredients of the bias and variance)

g′-obtain by local linear or local quadratic regressionf ′-derivative of estimator for f gives estimator of f ′

f ′(x0) =1

nhn

n∑i=1

k′(x0 − xihn

)

f–obtain by standard density estimator∫∞−∞ k

2(s)ds,∫∞−∞ k(s)sds–constants that depend on kernel function k(s)

σ2(x0)–conditional variance. Estimate by

E(ε2i |x0) =

∑wi(x0)ε2

i∑wi(x0)

← fitted residuals y − m(xi) = εi

We showed distribution of kernel estimator. What about distribution ofLLR? Fan (JASA, 1992) showed

√nhn(gLLR(x0)−g(x0)) = N

1

2g′′(x0)

∫k(s)s2ds ·

√nhnh

2n︸︷︷︸

one less term in bias expression

, f(x0)−1σ2(x0)

∫k2(s)ds︸︷︷︸

same variance

Remarks:


–Bias of LLR does not depend on f(x0) (the design density of the data).Because of this feature, Fan refers to the estimator as being ”Design-Adaptive”

–Fan showed that in general, better to use odd-order local polynomial.There is no cost in variance, bias of odd-order does not depend on f(x).

–LLR estimator does not suffer from boundary bias problem

3.3.3 How might we choose the bandwidth?

Could choose the bandwidth to minimize the pointwise asymptotic MSE(AMSE)

AMSELLR = [h2n

12g′′(x0)

∫k(s)s2ds]2 + f(x0)−1σ2(x0)

nhn

∫k2(s)ds

∂hn

= 4h3n

14g′′(x0)2[

∫k(s)s2ds]2 +−f(x0)−1σ2(x0)

nh2n

∫k2(s)ds = 0

⇒ hn =

f(x0)−1σ2(x0)g′′(x0)2

·∫k2(s)ds

[∫k(s)s2ds]2︸︷︷︸constant

15

n−15

Remarks:

(1) higher variance σ2(x0) ⇒wider bandwidth

(2) more variability in function (g′′(x0))⇒wider bandwidth

(3) more data⇒narrower bandwidth (with kernel regression (i.e. local poly-nomial of degree 0), bias term also depends on f(x) and therefore op-timal bandwidth choice will depend on f(x))

(4) difficulty–obtaining optimal bandwidth requires estimates of σ2(x0), g′′(x0) (andin the case of the usual regression estimator f(x0). This is the problemof how to choose “pilot bandwidth.” One could assume normality inchoosing a pilot for f(x0) (this is Silverman’s rule-of-thumb method).Could also use fixed bw or nearest neighbor for σ2(x0), g′′(x0)


Figure 3.3: Functions of different smoothness

(5)Could minimize AMISE = E∫

(g(x0) − g(x0))dx0 and pick a globalbandwidth

For further information on choosing the bandwidth, see below and seebook by Fan and Gijbels (1996).

Chapter 4

Bandwidth Selection forDensity Estimation

4.1 First Generation Methods

4.1.1 Plug-in method (local version)

One could derive the optimal bandwidth at each point of evaluation, x0,which would lead to a pointwise localized bandwidth estimator

h(x0) = arg min AMSE f(x0) = E(f (x0)− f (x0)

)2

=h4

4f ′′(x0)2

[∫k(s)s2ds

]2

+1

nhf(x0)

∫k2(s)ds

⇒ h(x0) =

[ ∫k2(s)ds[∫k(s)s2ds

]2 · f(x0)

(f ′′(x0))2

] 15

n−15

We need to estimate f (x0) , f ′′ (x0) to get optimal plug-in bandwidth, andthe optimal bandwidth would be different for different points of estimation.

4.1.2 Global plug-in method

Because the above method is computationally intensive, one could insteadderive a optimal bandwidth for all the points of evaluation by minimizing adifferent criterion - the mean integrate squared error:

31

32CHAPTER 4. BANDWIDTH SELECTION FORDENSITY ESTIMATION

MISE f(x0) = E

∫ (f(x0)− f(x0)

)2

dx0 ← removes x0 by integration

=1

nh

∫f (x0) dx0︸︷︷︸

=1

∫k2(s)ds︸︷︷︸A

+h4

4(

∫k(s)s2ds)2(

∫f ′′(x0)2dx0)︸︷︷︸

B

h4MISE = (

A

B2)15n−

15

note:∫f ′′(x0)2dx0 is a measure of the variation in f(x0). If f is highly

variable then h will be small. Here, we still need estimate of f ′′(x0) but nolonger require an estimate of f(x0) (it was eliminated by integrating)

4.1.3 Rule-of-thumb Method (Silverman)

Ifx ∼ Normal

then Silverman shows (in his book Density Estimation)

h4MISE ≈ n−

15SD(x) SD = standard deviation

This method if often applied to nonnormal data, but it has a tendency to“oversmooth”, particularly when data density is multimodel. Sometimes, thestandard deviation is replaced by the 75%-25% interquartile range.

This is an older method, now available in some software packages, butthe performance is not so good. Other methods are better.

4.1.4 Global Method #2: Least Squares Cross-validation

The idea is to minimize an estimate of the integrated squared error (ISE)

minhISE =

∫{f(x0)− f(x0)}2dx0

=

∫f(x0)2dx0︸︷︷︸term #1

− 2

∫f(x0)f(x0)dx0︸︷︷︸term #2

+

∫f(x0)2dx0︸︷︷︸

term doesnt depend on h

4.1. FIRST GENERATION METHODS 33

Term #1 =

∫1

n2h2

∑i

∑j

k(xi − x0

hn)k(

xj − x0

hn)

let − s =xi − x0

hn⇒ x0 = xj + sh

x0

h=xjh

+ s

=1

n2h2

∑i

∑j

∫k(xih− xj

h︸︷︷︸xi−x0

hn

− s)k(s)ds

=1

n2h

∑i

∑j

∫k(xi − xjh

− s)k(s)ds

Note:∫k(a − s)k(s)ds is the density of the sum of two random variables

with density k (also known as the ”convolution”)

y = a+ s

Pr(y ≤ y0) = Pr(a+ s ≤ y0)

= Pr(a ≤ y0 − s)

=

∫ ∞−∞

F (y0 − s)k(s)ds

∂

∂y=

∫ ∞−∞

k(y0 − s)k(s)ds

Now examine Term #2:−2∫f(x0)f(x0)dx0

An unbiased estimator for term (2) is

−2

N(N − 1)

∑i

∑j 6=ik(xi − xjh

) ≈ −2

N2h

∑i


)

Note that we need to use “leave-one-out” estimator to get independence.

hCV = arg min1

n2h

∑i

∑j

∫k(xi − xjh

− s)k(s)ds︸︷︷︸need to calculate convolution for this part

− 2

n2h

∑i


)︸︷︷︸ith point does not get used, =f−i(xi)

f−i(xi)︸︷︷︸leave one out estimator

= f(xi)−K(o)1

nh


(1) turns out that hCV can only be estimated at rate n−110 (result due to

Stone) which is very slow. However, in Monte Carlo studies LSCV oftendoes better that Rule-of-thumb method.

(2) Marson (1990) found a tendency for local minima near small h areas,which lead to a tendency to undersmooth.

(3) LSCV is usually implemented by searching over a grid of bandwidthvalues

4.1.5 Biased Cross-validation

AMISE =1

nh

∫k2(s)ds+

h4n

4

∫Γ2k(s)ds

∫f ′′(x)2dx+ o(

1

nh+ h4)

what if we plug in f ′′(x)?note:

f(x) =1

nh

∑i

k(xi − xh

)

f ′(x) =1

nh2

∑i

k′(xi − xh

)

f ′′(x) =1

nh3

∑i

k′′(xi − xh

)

f ′′ (x)2 =1

n2h6

n∑i=1

n∑j=1

k(xi − xh

)k(xj − xh

)

What is bias in estimation b′′(x)2?

E{f ′′(x)2} =

∫E(

1

n2h6

∑i

k′′(xi − xh

)∑j

k′′(xj − xhn

))dx

=1

nh6

∫E ′′(

xi − xh

)2dx

can show = f ′′(x) +1

nh5

∫k′′(s)2d+ higher order

4.1. FIRST GENERATION METHODS 35

hBCV = arg min1

nh

∫k2(s)ds+

h4

4

∫s2k(s)ds[

∫f ′′ (x)2 dx− 1

nh5

∫k′′(s)ds︸︷︷︸

subtract this bias here

]

4.1.6 Second Generation Methods

Sheather and Jones(1991) “Solve -the -equation” Plug-in Method

hSIPI =

[ ∫k2(s)ds∫

f ′′g(h)(x)2dx[∫s2k(s)ds

]2] 1

5

n−15 ← find h that solves this

We need to plug in an estimate for f ′′(x). The bandwidth that is optimalfor f(x) is not same as optimal bandwidth for f ′′

We can find an analogue to AMSE for R(f ′′) =∫f ′′(x)2dx

get

g =[C1(k){R(f ′′′)}

17C2(k)

]n−

17

R(f ′′) =

∫f ′′(x)dx enters into the formula for the optimal bandwidth for g

C1(k) and C2(k) are functions of the kernel. R(f ′′′) is∫f ′′′(x)2dx Now solve

for g as a function of h

g(h) =

[C3(k){R(f ′′)

R(f ′′′)}

17C4(k)

]h

57

–Now we need an estimate of f ′′′. At this point, SJ suggest using a rule ofthumb method, based on normal density assumption What is main advantage

of this method over other methods?For most methods

h− hMISE

hMISE

∼ n−p


Recall, for CV, p = 110

. for SJPI,

ρ =5

14(close to

1

2)

so rate of convergence of the bandwidth is faster. This method is availablein some software packages.

Chapter 5

Bandwidth selection forNonparametric Regression

Recall that the distribution for the LLR estimator was

√nhn(gLLR(x0)−g(x0)) ∼ N

(1

2g′′(x0)

∫k2(s)sds ·

√nhn · h2

n, f(x0)−1σ2 (x0)

∫k2 (s) ds

)

5.1 Plug-in Estimator (local method)

–choose bandwidth to minimize the MSE “data-dependent”

minhn

[1

2g′′(x0)

]2 [∫k(s)s2ds

]2

h4n︸︷︷︸

Bias2=C0h4n

+f(x0)−1σ2 (x0)

nhn

∫k2(S)ds︸︷︷︸

Variance=C1n−1h−1n

h∗n =[f(x0)−1σ2(x0)

∫k2(s)ds

g′′(x0)2∫k(s)s2ds

] 15n−

15 h∗n : h∗n(x0) variable bandwidth

note:

minC0h4n + C1n

−1h−1n

4C0h3n − C1n

−1h−2n = 0

h5n =

C1

4C0

n−1

hn =

(C1

4C0

)− 15

n−15

37

38CHAPTER 5. BANDWIDTH SELECTION FORNONPARAMETRIC REGRESSION

Remarks : g′′high ⇒smaller bandwidthwant smaller bandwidth in regions where function is highly variableσ2(x0) high⇒use a larger bandwidth

5.2 global Plug-in Estimator

–minimize global criterion s.a.MISE

hrn =

{ ∫σ2(x0)dx0

∫k2(s)ds∫

g′′(x0)2dx0

∫k(s)s2ds

} 15

n−15 Mf: Fan &Gijbels (1996) Local Polynomial Regression

5.3 Bandwidth Choice in Nonparametric Re-

gression

yi = m(xi) + εi

minn∑i=1

{yi − α− ...− βp(xi − x)p} k(xi−xhn

) ← how to choose hn , given p

5.4 Global MISE criterion

hMISE =

(p+ 1)p!2R(kp)

V AR(y|x)︷︸︸︷∫σ2(x)dx

∂µp+1(kp)2∫mp+1(x)2f(x)dx

12p+3

n−1

2p+3

for LLR, p = 1 and get n−15 , for p odd

Fan & Gijbels (1996) Local Polynomial Regression

kp = function of kernelR(kp) =

∫kp(s)

2dsµl(kp) =

∫µlkp(µ)dµ

=[

R(k)∫σ2(x)dx

µ2(k)2∫m′′(x)2f(x)dx

] 15n−

15

5.5. BLOCKING METHOD 39

unknowns are σ2(x),m′′(x), f(x)

Apply same plug-in method to estimate m′′(x) then take averageStill need

∫σ2(x)dx

5.5 blocking method

(1) divide x into N blocks(2) Estimate y = m(x) + ε by Q-degree poly within each block

σ2(N) = (n− sN)−1∑i

∑j

{yi − mQ

j (xi)}

1(xi ∈ Xj)

5.6 Bandwidth Selection for Nonparametric

Regression- Cross-Validation

Recall LLR estimator defined as

a = arg minn∑i=1

(yi − a− b(xi − x0))2k(xi − x0

hn)

Let

y =

y1

...yN

w =

k(x1−x0hn

)0

...

0 k(xN−x0hn

)← N ×N

x =

1 (x1 − x0)...

1 (xN − x0)

← N × Z

gLLR(x0) = [1 0](x′wx)−1(x′wy) ← coefficient from a weighted least squares problem.

weights depend on x0, h

Let

40CHAPTER 5. BANDWIDTH SELECTION FORNONPARAMETRIC REGRESSION

H(h) = [1 0]1×2

(x′wx)−1

2×2

x′2×N

wN×N

1×N

gLLR(x0) = H(h) · y

choose h to minE((gLLR(x0)− g(x))2|x)minimize conditional variance at each point of evaluation

= E((H (h) y − g)′ (H(h)y − g)|x)= E((H (h) g +H (h) ε− g)′((H (h) g +H (h) ε− g)|x)= E(((H (h)− I)g +H (h) ε)′((H (h)− I)g +H (h) ε)|x)= g′(H (h)− I)′(H (h)− I)g + E( ε′

1×NH(h)′N×1

H(h)1×N

εN×1|x) recall trAB = trBA

= g′(H (h)− I)′(H (h)− I)g + E(tr(ε′H(h)′H(h)ε|x) tr(A+B) = tr(A) + tr(B)

= g′(H (h)− I)′(H (h)− I)g + E(tr(εε′) · tr(H(h)′H(h))|x trace is a linear mapping= g′(H (h)− I)′(H (h)− I)g + nσ2

εtr(H(h)′H(h)) need estimator for thisConsider what is estimated by

(y −H(h)y)′(y −H(h)y) = y′(I −H (h))′(I −H (h))y ← sum of fitted residuals

= (g + ε)′(I −H (h))′(I −H (h))(g + ε)

=

{+g′(I −H (h))′(I −H (h))g + g′(I −H (h))′(I −H (h))ε+ε′(I −H (h))′(I −H (h))g + ε′(I −H (h))′(I −H (h))ε

}

E(g′(I −H (h))′(I −H (h))g|x) = g′(I −H (h))′(I −H (h))g

E(g′(I −H (h))′(I −H (h))ε|x) = 0

E(ε′(I −H (h))′(I −H (h))ε|x) = E(tr(ε′(I −H (h))′(I −H (h))ε)|x)

= E(tr(εε′) tr ((I −H (h))′(I −H (h)) |x)

= nσ2tr ((I −H (h))′(I −H (h))

= nσ2 [trI − 2tr(H (h))trI + trH(h)′H(h)]

together,

(y −H(h)y)′(y −H(h)y)= g′(I −H (h))′(I −H (h))g + nσ2trI − 2nσ2trH (h) thI︸︷︷︸

don’t want this term

+ nσ2trH (h)′H (h)

if we use g−i(xi) instead of g (xi) (leave-one-out estimator)

5.6. BANDWIDTH SELECTION FORNONPARAMETRIC REGRESSION- CROSS-VALIDATION41

then

trH (h) = 0

hCV = arg minh←H

n∑i=1

(yi − g−i (xi))2

econ 721: lecture notes on nonparametric density and...

Documents