inference based on robust estimatorsmatias/ulb-1.pdf · ubc - university of british columbia where...

Inference based on robust estimators

Matias Salibian-Barrera 1

Department of Statistics – University of British Columbia

ECARES - Dec 2007

Matias Salibian-Barrera (UBC) Robust inference ECARES - Dec 2007 1 / 190

UBC - University of British Columbia

Where we are

Who we are

I 43000 students – 7700 graduate students

I Department of Statistics

I 15 faculty members – joint appointments with CS, Hospitals, ResearchInstitutes. . .

I Research: Spatial S, Bayesian S, Bioinformatics, Biostatistics, FunctionalData Analysis, Missing and Longitudinal data, Non-normal Multivariate,MCMC, Robustness

I http://www.stat.ubc.ca

I We are friendly! (come visit!)

X1, . . . , Xn ∼ F ∈ Hε

Hε ={

F : F (x) = (1− ε) Fθ(x) + εH(x) , H arbitrary}

Parameter of interest: θ ∈ Rp (or a subset of it)

Example

θ = (µ, σ) ∈ R× R+

F(µ,σ)(x) = Φ((x − µ)/σ) ∼ N(µ, σ2)

F(µ,σ)(x) = F(0,1)((x − µ)/σ) , F(0,1) fixed

Model: X = µ + σ W , where W ∼ F = (1− ε)F(0,1) + εH

Parameter of interest: µ ∈ R;

Nuisance parameter: σ > 0.

Location-scale

Xi ∼ F0((x − µ)/σ) F0 a fixed dist’n

(µ, σ) = arg maxµ,σ

n∑i=1

log(f0((Xi − µ)/σ))

Score equations

n∑i=1

g0((Xi − µ)/σ) = 0 , g0(t) = f ′0(t)/f0(t)

Examples

F0(x) = Φ(x) ∼ N (0, 1)

g0(t) = −t ⇒ µ =n∑

f0(x) = exp (−|x |) /2

log(f0(x)) = −|x | − log(2) ⇒ µ = arg minµ

n∑i=1

|Xi − µ|

⇒ µ = median (X1, . . . , Xn) = mn

MLE most efficient at the modelHow much do you trust the model?

I var(Xn) = σ2/nI var(mn) ≈ 1/

`n 4 f (µ)2´

F If data are normal:var(Xn)/var(mn) ≈ 2/π ≈ 0.64

F If data are double exponential:

var(Xn)/var(mn) ≈ 2

F If data are F (x) = 0.85 Φ(x) + 0.15Φ(x/3):

var(Xn)/var(mn) ≈ 1.13

F (x) = 0.85 Φ(x) + 0.15Φ(x/3)

Tukey (1960): If ε > 0.10 ⇒ var(Xn) > var(mn).Efficiency over a range of plausible distributionsRobustness measures: influence function, maximum bias, breakdownpoint.

M-estimators – Huber (1964)

(µ, σ) = arg minµ,σ

n∑i=1

− log(f0((Xi − µ)/σ))

n∑i=1

g0((Xi − µ)/σ) = 0 , g0(t) = f ′0(t)/f0(t)

M-estimators

µ = arg minµ

n∑i=1

ρ((Xi − µ)/σ)

n∑i=1

Ψ((Xi − µ)/σ) = 0

Model ⇐⇒| Estimator

M-estimators

Simultaneous scale estimation (Huber’s Proposal II)

n∑i=1

Ψ((Xi − µ)/σ) = 0

n∑i=1

χ((Xi − µ)/σ) = b

ρρ((x)) ΨΨ((x))

Mean – Median – Huber-type M-estimator

(Adaptive) weigthed mean

ΨΨ((x))

Ψc(x) =

x if |x | ≤ c ,

c sign(x) if|x | > c .

n∑i=1

Ψ((Xi − µ)/σ) = 0

n∑i=1

[Ψ((Xi − µ)/σ)/((Xi − µ)/σ)] ((Xi − µ)/σ) = 0

n∑i=1

wi (Xi − µ) = 0

wi = wi (µ, σ) = Ψ(ri)/ri =

1 if (Xi − µ) /σ ≤ c ,

c/|Xi − µ| if (Xi − µ) /σ > c .

µ =n∑

wi (µ, σ) Xi

wj (µ, σ)

An iterative algorithm

I µ(0) = median(X1, . . . Xn), σ = MAD(X1, . . . , Xn);

I µ(j+1) =Pn

i=1wi(µ(j), σ) Xi

. Pnj=1wj(µ

(j), σ), j = 0, 1, . . . ,

Implementation:

> library(robustbase)> set.seed(31)> x <- c(rnorm(24), rnorm(6, mean=10, sd=.2))> mean(x)[1] 1.949270> median(x)[1] 0.1134845> huberM(x, k=1.345)$mu[1] 0.3952441

$s[1] 1.395404

$it[1] 10> huberM(x, k=0.01)$mu[1] 0.1134845> huberM(x, k=100)$mu[1] 1.949270

Intuitively, median is less affected by outliers than mean

Breakdown point – formal measure of resistance to outliers Hampel (1968, 1971);

Donoho and Huber (1983)

“Smallest amount of outliers that are sufficient to make the estimatorunbounded”

Finite sample Breakdown Point Donoho & Huber (1983)

µn = µ(X1, . . . , Xn)

ε∗(X1, . . . , Xn) = inf

n + m: sup

V1,...,Vm

|µ(X1, . . . , Xn, V1, . . . , Vm)| = +∞

µn = Xn ⇒ ε∗(X1, . . . , Xn) = 1/(n + 1) → 0

µn = mn ⇒ ε∗(X1, . . . , Xn) = n/(n + n) = 1/2

Asymptotic Breakdown Point

µ(X1, . . . , Xn) = µ(Fn) Fn(x) =n∑

I(Xi ≤ x)/n

µ : D −→ R

µ(Fn) −−−→n→∞

µ(F )

ε∗(F ) = inf{

ε : 0 ≤ ε ≤ 1, supG|µ(Fε)− µ(F )| = ∞

Fε = (1− ε) F + ε G

Breakdown point of M-estimators

see Maronna, Martin and Yohai (2006)

If σ remains bounded (and away from zero) and µn is given by

n∑i=1

Ψ((Xi − µn)/σ) = 0

hasε∗(X1, . . . , Xn) = min(k1, k2)/(k1 + k2)

wherek1 = − lim

x→−∞Ψ(x) k2 = lim

x→+∞Ψ(x)

and k1 < +∞, k2 < +∞

Breakdown point of M-estimators

Consider Fε ∈ Hε

Fε = (1− ε)F0 + εG

and let µ(Fε) the solution to

EFε[Ψ (X − µ(Fε))] = 0

(1− ε)EF0 [Ψ (X − µ(Fε))] + ε EG [Ψ (X − µ(Fε))] = 0

Take 0 ≤ ε < ε∗. Then |µ(Fε)| < A for some A < +∞. Take G = δx0

(1− ε)EF0 [Ψ (X − µ(Fε))] + ε Ψ(x0 − µ(Fε)) = 0

Letting x0 →∞, we have

Ψ(x0 − µ(Fε)) → k2

Also −k1 ≤ Ψ(u), thus

0 = (1− ε)EF0 [Ψ (X − µ(Fε))] + ε Ψ(x0 − µ(Fε))

≥ −k1 (1− ε) + εΨ(x0 − µ(Fε))

→ −k1 (1− ε) + ε k2

k1 (1− ε) ≥ ε k2

ε ≤ k1/ (k1 + k2)

Letting x0 → −∞, we have

ε ≤ k2/ (k1 + k2)

Thusε ≤ min(k1, k2)/ (k1 + k2) ∀ ε < ε∗

⇒ ε∗ ≤ min(k1, k2)/ (k1 + k2)

Let ε ≥ ε∗ and let Gn

µn = µ((1− ε)F0 + εGn) → +∞

0 = (1− ε)EF0 [Ψ (X − µn)] + ε EG [Ψ (X − µn)]

≤ (1− ε)EF0 [Ψ (X − µn)] + ε k2

⇒ 0 ≤ limn

(1− ε)EF0 [Ψ (X − µn)] + ε k2

Dominated Convergence Theorem

0 ≤ (1− ε) limn

EF0 [Ψ (X − µn)] + ε k2

≤ −(1− ε) k1 + ε k2

(1− ε) k1 ≤ ε k2

k1/(k1 + k2) ≤ ε ∀ε > ε∗

If µn → −∞ we getk2/(k1 + k2) ≤ ε ∀ε > ε∗

Hence, putting all together, we obtain

ε∗ = min(k1, k2)/ (k1 + k2)

Huber proposed a family of score functions Ψc

Ψc(x) =

x if |x | ≤ c ,

c sign(x) if |x | > c .

Thus, we have k1 = k2 = c and ε∗ = 1/2 (for any c ∈ R)

The median is associated with the function

Ψ(x) = sign(x)

so that k1 = k2 = 1 and ε∗ = 1/2

µ = arg minµ

n∑i=1

ρ((Xi − µ)/σ)

n∑i=1

Ψ((Xi − µ)/σ) = 0

⇒ Need a (robust) scale estimator σ

Robust scale estimator

Consider r = (r1, . . . , rn)

σ : Rn → R+ such that

I σ(r) ≥ 0;

I σ(b r) = |b| σ(r) for all b ∈ R;

I σ(|r1|, . . . , |rn|) = σ(r); and

I σ is invariant under permutations.

Scale estimatorsDifferent scales:

σ(r)2 =n∑

σ(r) = median(|r1|, . . . , |rn|)

M-scale (implicitly defined):

n∑i=1

ρ (ri/σ) = b

ρ : R → R+, non-decreasing on [0,+∞);ρ(−r) = ρ(r);ρ(0) = 0; andb = EF0ρ(u) (consistency)

σ(r)2 =n∑

σ(r) = median(|r1|, . . . , |rn|)

n∑i=1

ρ (ri/σ) = b

σ(r)2 =n∑

σ(r) = median(|r1|, . . . , |rn|)

n∑i=1

ρ (ri/σ) = b

ρ(r) =

0 if |r | <= 1

1 if |r | > 1⇒ σ = median (|r1|, . . . , |rn|)

ρ(r) = r2 ⇒ 1n

n∑i=1

ρ (ri/σ) = 1

n∑i=1

r2i /σ2 = 1

n∑i=1

r2i = σ2

ρ(r) =

0 if |r | <= 1

1 if |r | > 1⇒ σ = median (|r1|, . . . , |rn|)

ρ(r) = r2 ⇒ 1n

n∑i=1

ρ (ri/σ) = 1

n∑i=1

r2i /σ2 = 1

n∑i=1

r2i = σ2

Simultaneous estimation

n∑i=1

Ψ((Xi − µ)/σ) = 0

n∑i=1

ρ ((Xi − µ)/σ) = b

⇒ µ has breakdown point lower than 1/2.

Preliminary scale:

σ = median (|X1 −mn|, . . . , |Xn −mn|)

mn = median (X1, . . . , Xn)

n∑i=1

Ψ((Xi − µ)/σ) = 0

⇒ µ has breakdown point 1/2.

Asymptotic distribution - heuristic Taylor expansion

Proper derivation - Huber (1967) / He and Shao (1996)

Allows us to compute the efficiency at the central model

Asymptotic distribution

µn → µ(F ) and σn → σ(F ) where

EF (Ψ((X − µ(F ))/σ(F ))) = 0

0 =n∑

(Xi − µn

n∑i=1

(Xi − µ(F )

σ(F )

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

σ(F )(µn − µ(F ))−

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F )) + Rn

Asymptotic distribution

µn → µ(F ) and σn → σ(F ) where

EF (Ψ((X − µ(F ))/σ(F ))) = 0

0 =n∑

(Xi − µn

n∑i=1

(Xi − µ(F )

σ(F )

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

σ(F )(µn − µ(F ))−

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F )) + Rn

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

σ(F )(µn − µ(F )) =

n∑i=1

(Xi − µ(F )

σ(F )

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F ))− Rn

a−1n =

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

1σ(F )

√n (µn − µ(F )) = a−1

n1√n

n∑i=1

(Xi − µ(F )

σ(F )

a−1n√n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F ))− a−1

a−1n → a(F )−1 = EF

[Ψ′(

X − µ(F )

σ(F )

)]If F is symmetric then

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)→ 0

Ψ(u) odd ⇒ Ψ′(u) even and so Ψ′(u)u is odd

If, in addition,√

n (σn − σ(F )) = Op(1), then

an√n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F )) =

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

) √n (σn − σ(F )) = op(1)

Finally, we will assume that√

n Rn → 0

(X − µ(F )

σ(F )

n∑i=1

(Xi − µ(F )

σ(F )

)D−−−→

n→∞N (0, Q(F )2)

Q(F )2 = EF

Xi − µ(F )

σ(F )

1σ(F )

√n (µn − µ(F )) = an

n∑i=1

(Xi − µ(F )

σ(F )

)+ op(1)

√n (µn − µ(F )) = σ(F ) an

n∑i=1

(Xi − µ(F )

σ(F )

)+ op(1)

√n (µn − µ(F ))

D−−−→n→∞

N (0, V (F ))

V (F ) = σ(F )2EF

X−µ(F )σ(F )

[Ψ′(

X−µ(F )σ(F )

Simple CI for µ

µn ± 1.96√

V (Fn)/n

V (Fn) = σ2

∑ni=1Ψ

Xi−µσ

)/n{∑n

i=1Ψ′(

Xi−µσ

Empirical coverage of 95% CI for µ0

Based on a 95%-efficient M-estimator

ε n20 100 200 500

0.00 0.92 (0.86) 0.95 (0.40) 0.93 (0.28) 0.94 (0.18)

0.10 0.91 (1.05) 0.69 (0.49) 0.40 (0.35) 0.05 (0.22)

0.20 0.80 (1.44) 0.08 (0.67) 0.00 (0.47) 0.00 (0.30)

1000 random samples

outliers follow a N (10, 0.22) distribution

−1.0 −0.5 0.0 0.5 1.0

n = 50, 100, 500, 1000, 5000, 10000, 100000

Bootstrap

Efron (1979)

Brief description (more / better comes later)

To approximate the sampling distribution of T (X1, . . . , Xn)

I For j = 1 in 1:B

I Take a random sample from X1, . . . , Xn with replacement X∗1 , . . . , X∗

I Compute T ∗j (X∗

1 , . . . , X∗n )

Use the “sample” T ∗1 , . . . , T ∗B to approximate the sampling distribution ofT (X1, . . . , Xn)

Bootstrap

Efron (1979)

I For j = 1 in 1:B

1 , . . . , X∗n )

Bootstrap

Efron (1979)

I For j = 1 in 1:B

1 , . . . , X∗n )

Bootstrap

Efron (1979)

I For j = 1 in 1:B

1 , . . . , X∗n )

d(F ∗T ,n, FT ,n

)→ 0

In particular, (depending on d) V (T ) could be approximated by

V ∗(T ) =1B

B∑j=1

(T ∗j − T ∗j

T ∗j =1b

B∑j=1

T ∗j

A 95% confidence interval can be constructed as follows

µn ± 1.96√

V ∗(T )

or using estimated quantiles

d(F ∗T ,n, FT ,n

)→ 0

V ∗(T ) =1B

B∑j=1

(T ∗j − T ∗j

T ∗j =1b

B∑j=1

T ∗j

µn ± 1.96√

V ∗(T )

d(F ∗T ,n, FT ,n

)→ 0

V ∗(T ) =1B

B∑j=1

(T ∗j − T ∗j

T ∗j =1b

B∑j=1

T ∗j

µn ± 1.96√

V ∗(T )

Empirical coverage of 95% bootstrap CI for µ0

Based on a 95%-efficient M-estimator

ε n20 100 200 500

0.00 0.92 (0.88) 0.93 (0.37) 0.95 (0.28) 0.94 (0.18)

0.10 0.95 (1.24) 0.63 (0.51) 0.45 (0.36) 0.05 (0.23)

0.20 0.99 (2.71) 0.27 (0.84) 0.00 (0.57) 0.00 (0.36)

100 random samples

outliers follow a N (10, 0.22) distribution

−1.0 −0.5 0.0 0.5 1.0

n = 20, 100, 200, 500

µn ± 1.96√

V (Fn)/n

We need to study both bias and variance

For large samples, bias becomes more important

µn ± 1.96√

V (Fn)/n

We need to study both bias and variance

For large samples, bias becomes more important

Maximum Asymptotic Bias

X = µ0 + ε

ε ∼ F ∈ Hε(F0)

Hε(F0) ={

F : (1− ε) F0 + ε H}

µn = µ(Fn) → µ(F ) 6= µ(F0) = µ0

Maximum asymptotic biases

BF0(ε) = supF∈Hε(F0)

|µ(F )− µ(F0)| /σ0

We can assume (wlog) that µ(F0) = 0

Let Ψ(u) be non-decreasing

Ψ(u) = k < +∞

g(b) = EF0 (Ψ(X + b))

g(b) is increasing (if either Ψ is, or F ′0(u) = f0(u) > 0 for all u ∈ R

Let 0 ≤ ε < 1/2 and F (x) = (1− ε)F0(x) + εH(x)

Then µ(F ) solves

EF Ψ(X − µ(F )) = 0 = (1− ε)g(−µ(F )) + εEHΨ(X − µ(F ))

Since −k ≤ Ψ(u) ≤ k we have

(1− ε)g(−µ(F ))− εk ≤ 0 ≤ (1− ε)g(−µ(F )) + εk

−kε/(1− ε) ≤ g(−µ(F )) ≤ kε/(1− ε)

|µ(F )| ≤ g−1(kε/(1− ε))

Taking H = δx0 with x0 →∞ shows that in that case

|µ(F )| = g−1(kε/(1− ε))

For the median, when F0 = N (0, 1)

Ψ(u) = sign(u) ⇒ k = 1

g(b) = EΦsign(u + b) = PΦ (Z > −b)− PΦ (Z < −b) =

1− 2 Φ(−b) = 2Φ(b)− 1

g(b) = kε/(1− ε) = ε/(1− ε)

2 Φ(b)− 1 = ε/(1− ε) ⇒ Φ(b) = 1/ [2 (1− ε)]

b = Φ−1 (1/ [2 (1− ε)])

Same calculation for any symmetric F0

ε Median Ψ1.345

0.00 0.00 0.00

0.05 0.07 0.09

0.10 0.14 0.18

0.20 0.32 0.42

F0 = N (0, 1)

Median minimizes the maximum bias (Huber, 1981), but

√n (mn − µ(F0))

D−−−→n→∞

4 f (µ(F0))2

µ(F0) = F−10 (1/2)

When F0 = Φ

efficiency of the Median: 2/π ≈ 0.64

efficiency of the M estimator with Ψ1.345: 0.95

difficulty of estimating f (µ(F0)) for inference

Linear regression

Y = X′β0 + ε

errors are independent from the covariates

βn = arg minβ∈Rp

n∑i=1

(Yi − X′i β)2

n∑i=1

(Yi − X′i β) Xi = 0

Huber (1973)

βn = arg minβ∈RP

n∑i=1

(Yi − X′i β

n∑i=1

(Yi − X′i βn

)Xi = 0

n∑i=1

(Yi − X′i βn

Least squares

●●

0 5 10 15

# of Shocks

Least squares + Huber

●●

0 5 10 15

# of Shocks

eLSLS minus outliersHuber

If Ψ is monotone and Yi is a large outlier with high leverage (‖Xi‖ large) then∥∥∥∥Ψ(Yi − X′iβσ

∥∥∥∥ ≈ Ψ(+∞) ‖Xi‖

which can then dominate the equation

n∑i=1

(Yi − X′i β

)Xi = 0

Breakdown of a monotone-Ψ M-estimator with high-leverage outliers

Let (Y1, X1) be such that Y1/‖X1‖ → ∞, while βn remains bounded

Y1 − X′1βn ≥ Y1 − ‖X1‖ ‖βn‖ = ‖X1‖(

Y1/‖X1‖ − ‖βn‖)→∞

0 = Ψc

(Y1 − X′1 βn

∑Ψc

(Yi − X′i βn

cannot hold (first term diverging while the second term remains bounded)

We need a redescending function Ψ (bounded loss function ρ)

(Or we could downweight high-leverage points)

Then loss and score equations are not equivalent

Multiple solutions to the score equations

Need criterium to select a robust solution

Global minimum of loss function

Bi-square loss (Beaton and Tukey, 1974)

ρd (r) =

[1− (r/d)2

]3if |r | ≤ d

1 if |r | > d

Ψd (r) =

6 r[1− (r/d)2

]2/d2 if |r | ≤ d

0 if |r | > d

−4 −2 0 2 4

ρρ d((r))

−4 −2 0 2 4

d((r))

ρ3(r) Ψ3(r)

βn = arg minβ

n∑i=1

(Yi − X′iβ

)⇐/ ⇒

n∑i=1

(Yi − X′i βn

)Xi = 0

Non-convex problem – existence of a unique global minimumNeed a good initial point

●●

● ●

● ●●

●●

●●●

●●

●●●

−2 0 2 4

f (β) =∑

i ρd ((Yi − X′iβ) /σ)

●●

● ●

● ●●

●●

●●●

●●

●●●

−2 0 2 4

f (β) = mediani |(Yi − X′iβ)|

Algorithms

Data-driven random search (Rousseeuw, 1984; Rupper, 1992)

I Generate random lines using random pairs from the sample

I Find local minima near these random starts βj

I Pick the best

Heuristic – Simulated Annealing - Tabu search

Recent refinements of the random subsampling algorithmI fast-LTS, fast-MCD Rousseeuw and van Driessen, 1999

I fast-S S-B and Yohai, 2006

I fast-tau S-B, Willems, Zamar, 2006 and Zamar, 2006

I no-name-yet Harrington and S-B, 2007

The scale estimator σ

I Measures scale of the residuals

I itself needs a regression / location estimator

I A bit of a conundrum (spelling?)...

S-estimators

Rousseeuw and Yohai (1984)

Estimators based on minimizing a residual scale

Let σ(r) be a scale estimator, and define

βn = arg minβ

σ(Y1 − X′1β, . . . , Yn − X′nβ)

σ(r)2 =∑n

i=1r2i /n

βn = arg minβ

(Yi − X′i β)2

σ(r)2 =∑n

i=1|ri |/n

βn = arg minβ

|Yi − X′i β|2

σ(r)2 = median(r21 , . . . , r2

I LMS (Hampel Rousseeuw)

βn = arg minβ

median(Yi − X′i β)2

σ(r)2 =∑[α n]

i=1 r2(i)

I LTS (Rousseeuw, 1984)

βn = arg minβ

[α n]Xi=1

(Y − X′β)2(i)

σ(r) solves∑n

i=1ρ(ri/σ(r))/n = b

I S-estimators (Rousseeuw and Yohai, 1984)

LMS are not√

n consistent (Rousseeuw, 1984; Kim and Pollard, 1990)

LTS are less efficient than S-estimators

High-breakdown S-estimators are not very efficient (Hossjer, 1992).

S-estimators are M-estimators

βn = arg minβ

σ (β)

n∑i=1

(Yi − X′iβ

σ(β)

βn = arg minβ

n∑i=1

(Yi − X′iβ

)where σ = σ(βn)

●●

0 5 10 15

# of Shocks

eHuberLMSLTSS

Breakdown point of S-estimators

Tuning of ρ (b) to obtain LMS

Maximum asymptotic bias

Breakdown pointβn = arg min

βσ (β)

n∑i=1

(Yi − X′iβ

σ(β)

For consistency at the model, we need

b = EF0ρ (r/σ0)

ε∗ = min(

b/ρ (+∞) , 1− b/ρ (+∞))

ρ (+∞) = limr→+∞

ρ (r)

Consider

ρd (r) =

0 if |r | <= d

1 if |r | > d

Then, for normal errors,

EΦρd (r) = PΦ (|Z | > d) = 2 [1− Φ(d)]

To obtain maximum BP we set

EΦρd (r) = 1/2 ⇒ d = Φ−1 (3/4)

Consider

ρd (r) =

0 if |r | <= d

1 if |r | > d

Then, for normal errors,

EΦρd (r) = PΦ (|Z | > d) = 2 [1− Φ(d)]

To obtain maximum BP we set

EΦρd (r) = 1/2 ⇒ d = Φ−1 (3/4)

n∑i=1

ρd (ri/σ) = 1/2

# {i : |ri | ≥ d σ} = n/2

# {i : |ri/d | ≥ σ} = n/2

σ = median (|r1|, . . . , |rn|) /Φ−1 (3/4)

n∑i=1

ρd (ri/σ) = 1/2

# {i : |ri | ≥ d σ} = n/2

# {i : |ri/d | ≥ σ} = n/2

σ = median (|r1|, . . . , |rn|) /Φ−1 (3/4)

n∑i=1

ρd (ri/σ) = 1/2

# {i : |ri | ≥ d σ} = n/2

# {i : |ri/d | ≥ σ} = n/2

σ = median (|r1|, . . . , |rn|) /Φ−1 (3/4)

n∑i=1

ρd (ri/σ) = 1/2

# {i : |ri | ≥ d σ} = n/2

# {i : |ri/d | ≥ σ} = n/2

σ = median (|r1|, . . . , |rn|) /Φ−1 (3/4)

Maximum bias

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02

LMS 0.53 0.83 1.13 1.52

S 0.56 0.88 1.23 1.65

Maximum bias – 50% breakdown point

n∑i=1

(Yi − X′i βn

)Xi = 0

0 =n∑

(Yi − X′i βn

n∑i=1

(Yi − X′iβ0

n∑i=1

Ψ′c

(Yi − X′iβ0

)Xi X′i/σ0

(βn − β0

n∑i=1

Ψ′c

(Yi − X′iβ0

) (Yi − X′iβ0

)(σ − σ0) Xi + Rn

√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ2c(r))

(EF0(Ψ′c(r)))2 EG0 (X X′)−1

Efficiencies and Maximum bias

ε Eff

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02 0.07

LMS 0.53 0.83 1.13 1.52 0.00

S 0.56 0.88 1.23 1.65 0.29

Maximum bias & Efficiencies – 50% breakdown point

Need to find

βn = arg minβ

n∑i=1

(Yi − X′iβ

Or, at least, a robust solution to

n∑i=1

(Yi − X′i βn

)Xi = 0

(and need σ)

Need to find

βn = arg minβ

n∑i=1

(Yi − X′iβ

n∑i=1

(Yi − X′i βn

)Xi = 0

(and need σ)

Need to find

βn = arg minβ

n∑i=1

(Yi − X′iβ

n∑i=1

(Yi − X′i βn

)Xi = 0

(and need σ)

MM-estimators

(Yohai, 1987)

Let βn0 be a consistent, high-BP estimator

Let σ be a high-BP M-scale estimator using βn0

n∑i=1

(Yi − X′i βn0

)= 1/2

Find a local minimum βn of f (β) =∑n

i=1ρ1

(Yi−X′

i βσ

)such that

f (βn) ≤ f (βn0)

Needρ1(r) ≤ ρ0(r) ∀ r

MM-estimators

(Yohai, 1987)

Let βn0 be a consistent, high-BP estimator

Let σ be a high-BP M-scale estimator using βn0

n∑i=1

(Yi − X′i βn0

)= 1/2

Find a local minimum βn of f (β) =∑n

i=1ρ1

(Yi−X′

i βσ

)such that

f (βn) ≤ f (βn0)

Needρ1(r) ≤ ρ0(r) ∀ r

Retains the BP of βn0

Has efficiency given by

n∑i=1

(Yi − X′i βn

)Xi = 0

whereΨ1(r) = ρ1

′(r)

(efficiency can be set by the choice of ρ1(r))

√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ12(r))

(EF0(Ψ1′(r)))2 EG0 [X X′]−1

= σ20

[EH0(Ψ1

′(r)XX′)]−1 [

EH0(Ψ12(r)XX′)

][EH0(Ψ1

′(r)XX′)]−1

whereH0(r , x) = G0(x) F0(r)

Example with robustbase

> library(robustbase)> toxi <- read.table(’toxicity.txt’, header=FALSE)> names(toxi)[1] <- ’y’> dim(toxi)[1] 38 10> a.lm <- lm(y˜., data=toxi)> plot(a.lm)

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

Fitted values

●●

lm(y ~ .)

Residuals vs Fitted

●●

−2 −1 0 1 2

Theoretical Quantiles

lm(y ~ .)

Normal Q−Q

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

Fitted values

●●

● ●

lm(y ~ .)

Scale−Location

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Leverage

●●

lm(y ~ .)

Cook's distance

Residuals vs Leverage

Efficiencies for bi-square score functions

Efficiency: 0.80 0.85 0.90 0.95

Tuning constant (d): 3.14 3.44 3.88 4.68

> a.lmrob.85 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, tuning.psi=3.44, compute.rd=TRUE))>> a.lmrob.90 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, tuning.psi=3.88, compute.rd=TRUE))>> a.lmrob.95 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, compute.rd=TRUE))>> plot(a.lmrob.85)

●●●●

●● ●

●●

● ●● ●●●

●●●

●●

0 10 20 30 40 50

Robust Distances

Standardized residuals vs. Robust Distances

lmrob(formula = y ~ ., data = toxi, control = lmrob.control(nResample = 5000, tuning.psi = 3.44, compute.rd = TRUE))

●●

●●●

●●

● ● ●●● ●

●●●

● ●●

●●

−2 −1 0 1 2

Normal Q−Q vs. Residuals

●●

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

Fitted Values

Response vs. Fitted Values

●●

●●●

●●

●●●● ●●

● ●●

●●●

● ●

−2.0 −1.5 −1.0 −0.5 0.0

Fitted Values

Residuals vs. Fitted Values

> summary(a.lm)Call:lm(formula = y ˜ ., data = toxi)

Residuals:Min 1Q Median 3Q Max

-0.36704 -0.09072 -0.01605 0.05775 0.50947

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.973446 6.538420 -1.067 0.29529V2 0.317054 0.136360 2.325 0.02754 *V3 0.059883 0.184185 0.325 0.74751V4 -0.201126 0.057242 -3.514 0.00152 **V5 -0.027091 0.173513 -0.156 0.87705V6 0.012661 0.036188 0.350 0.72906V7 -0.014451 0.017489 -0.826 0.41562V8 5.896792 5.156774 1.144 0.26251V9 -0.014075 0.011667 -1.206 0.23777V10 0.008387 0.013845 0.606 0.54957---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.184 on 28 degrees of freedomMultiple R-Squared: 0.8463, Adjusted R-squared: 0.7969F-statistic: 17.14 on 9 and 28 DF, p-value: 3.520e-09

> summary(a.lmrob.85)

Call:lmrob(formula = y ˜ ., data = toxi, control = lmrob.control(nResample = 5000,

tuning.psi = 3.44, compute.rd = TRUE))

Weighted Residuals:Min 1Q Median 3Q Max

-0.13540 -0.01594 0.01612 0.25659 2.33151

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.763606 5.022955 -0.948 0.35106V2 0.500946 0.032760 15.291 4.03e-15 ***V3 0.140541 0.060796 2.312 0.02837 *V4 0.495203 0.081339 6.088 1.44e-06 ***V5 0.245450 0.195695 1.254 0.22012V6 -0.028718 0.009201 -3.121 0.00415 **V7 -0.027577 0.005072 -5.437 8.41e-06 ***V8 -1.790614 5.920822 -0.302 0.76456V9 0.023948 0.010537 2.273 0.03091 *V10 -0.036026 0.022852 -1.576 0.12615---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Robust residual standard error: 0.09632Convergence in 22 IRWLS iterations

(...)Robustness weights:9 observations c(12,13,23,28,32,34,35,36,37)

are outliers with |weight| < 2.632e-06;one weight is ˜= 1; the remaining 28 ones are summarized asMin. 1st Qu. Median Mean 3rd Qu. Max.

0.01005 0.95930 0.98950 0.93200 0.99550 0.99990Algorithmic parameters:tuning.chi bb tuning.psi refine.tol rel.tol1.5476400 0.5000000 3.4400000 0.0000001 0.0000001nResample max.it groups n.group best.r.s k.fast.s k.max

5000 50 5 400 2 1 200trace.lev compute.rd

0 1seed : int(0)

●●

●●●

●●

0 10 20 30 40 50

Robust Distances

Standardized residuals vs. Robust Distances

lmrob(formula = y ~ ., data = toxi, control = lmrob.control(nResample = 5000, compute.rd = TRUE))

●●

● ●

●●

● ●

−2 −1 0 1 2

Normal Q−Q vs. Residuals

● ●

●●

−1.0 −0.5 0.0 0.5

Fitted Values

Response vs. Fitted Values

●●

● ●

●●

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

Fitted Values

Residuals vs. Fitted Values

> library(MASS)> a.lms <- lmsreg(y˜., data=toxi)> a.lmsCall:lqs.formula(formula = y ˜ ., data = toxi, method = "lms")

Coefficients:(Intercept) V2 V3 V4 V5 V6

-4.44985 0.50840 0.15560 0.83908 0.41175 -0.02570V7 V8 V9 V10

-0.03311 -5.02900 0.03002 -0.06489

Scale estimates 0.03314 0.02720

> summary(a.lms)Length Class Mode

crit 1 -none- numericsing 1 -none- charactercoefficients 10 -none- numeric[...]xlevels 0 -none- listmodel 10 data.frame list

> plot(a.lms)Error in plot.window(xlim, ylim, log, asp, ...) :

need finite ’xlim’ valuesIn addition: Warning messages:1: no non-missing arguments to min; returning Inf2: no non-missing arguments to max; returning -Inf3: no non-missing arguments to min; returning Inf4: no non-missing arguments to max; returning -Inf

MM-regression estimators combine

I high-breakdown point

n consistent and asymptotically normal

I high-efficiency at the central model

●●

●●●●●●●●●●●●●●

●●●●●

−2 0 2 4 6

MM – LS

●●

●●●●

●●●●●

●●●●●●●●●●●

−2 0 2 4 6

MM – LS

●●

●●●

●●●●●●●●●

● ●● ●●●●

−2 0 2 4 6

MM – LS

●●

●●●●●●●●● ●

●●●●●●●●●●

−2 0 2 4 6

MM – LS

●●

●●●●●●

●●●●●●●●

●●●●●●

−2 0 2 4 6

MM – LS

●●

● ●●●●●●●●●●●●●●●●● ●●

−2 0 2 4 6

MM – LS

●●

●●●●●●

● ●●●●

●●●●●

−2 0 2 4 6

MM – LS

ε Eff

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02 0.07

LMS 0.53 0.83 1.13 1.52 0.00

S 0.56 0.88 1.23 1.65 0.29

MM 0.78 1.24 1.77 2.42 0.95

MM+S 0.56 0.88 1.23 1.65 0.95

Maximum bias & Efficiencies – 50% breakdown point

● ●

●●

●●●

●●

●●●●●●●●● ●●●●●●●●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●

●●●●●●●●●● ●●●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●● ●●●●●● ●● ●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●●●●● ●

●●●●●●●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●●●●●●●●●●

●●●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

● ●●●●●●●●●●●●●●●●● ●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●

●●● ●●

●●●●●●● ●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●

−2 0 2 4 6

MM – LS

● ●

●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●

−2 0 2 4 6

MM – LS

Asymptotics revisited

the problem of the scale outside the model – (Croux, Dhaene, Hoorelbeke, 2003; S-B, 2000)

0 =n∑

(Yi − X′i βn

n∑i=1

(Yi − X′iβ0

n∑i=1

Ψ′c

(Yi − X′iβ0

)Xi X′i/σ0

(βn − β0

n∑i=1

Ψ′c

(Yi − X′iβ0

) (Yi − X′iβ0

)(σ − σ0) Xi + Rn

√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ2c(r))

(EF0(Ψ′c(r)))2 EG0 [X X′]−1

= σ20

[EH0(Ψ

′c(r)XX′)

]−1 [EH0(Ψ

2c(r)XX′)

][EH0(Ψ

′c(r)XX′)

whereH0(r , x) = G0(x) F0(r)

√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ2([

EH(Ψ′c(r)XX′)]−1 [

EH(Ψ2c(r)XX′)

][EH(Ψ′c(r)XX′)

]−1−

a EH(ρ(r)Ψc(r)X′)[EH(Ψ′c(r)XX′)

]−1−[

EH(Ψ′c(r)XX′)]−1

EH(ρ(r)Ψc(r)X) a′ + EH(ρ(r)− b)2 a a′)

a =[EH(Ψ′c(r)XX′)

]−1EH(Ψ′c(r) r X)

/EH(ρ′(r) r)

andr = (Y − β0)/σ0 r = (Y − β0)/σ0

Uniform asymptotic results over contamination neighbourhoods

I location (S-B and Zamar, 2004)

I linear regression (first attempt: Omelka and S-B, 2006)

Under “certain regularity assumptions”

limn→∞

supF∈Hε

supx∈R

∣∣∣PF{√

n(µn − µ(F ))/V (F ) ≤ x}− Φ(x)

∣∣∣ = 0

Assumptions

Stringent conditions for uniform consistency of S-location estimator

I Uniform unique minimum – uniform “minimal” convexity

Extension to linear regression

Under “certain regularity assumptions”

limn→∞

supF∈Hε

supx∈R

∣∣∣PF{√

n(µn − µ(F ))/V (F ) ≤ x}− Φ(x)

∣∣∣ = 0

Assumptions

Stringent conditions for uniform consistency of S-location estimator

I Uniform unique minimum – uniform “minimal” convexity

Extension to linear regression

Trade-off between BP and the size of Hε where uniform asymptotics hold

BP ε0.50 0.110.45 0.140.40 0.170.35 0.200.30 0.240.25 0.25

Back to confidence intervals

βn j ± 1.96√

Empirical coverage of 95% CI for µ0

Based on a 95%-efficient MM-estimator with 50% BP

ε p1 2 5 10

0.00 0.93 0.95 0.95 0.93

0.10 0.69 0.67 0.69 0.65

0.20 0.04 0.05 0.03 0.04

500 samples of size n = 100 – outliers concentrated at (x , y) = (4, 3)

●● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ● ●

●●

● ●

−2 −1 0 1 2 3 4 5

MM – LS

Bootstrap

µn = µ(X1, . . . , Xn) Xi ∼ F

Plug-in principle

Fn ≈ F ⇒ L(µn, F ) ≈ L(µn, Fn)

µ∗n = µ(V1, . . . , Vn) Vi ∼ Fn

µn =n∑

µ∗n =n∑

Vi/n Vi ∼ Fn

P(Vi ≤ t) =n∑

I(Xi ≤ t)/n

P(Vi = t) =

1/n if t = Xj for some j = 1, . . . , n

0 otherwise

E(µ∗n) =n∑

E(Vi)/n =n∑

Xn/n = Xn

E(V 2i ) =

n∑i=1

X 2i /n

V (µ∗n) = V (Vi)/n

X 2i /n − X 2

[(n∑

(Xi − Xn)2

= s2/n ≈ V (Xn) = σ2/n

Problem:L(µn, Fn)

generally unknownCan be estimated (simulated) by re-computing µn on a large number ofpseudo-random samples from Fn

for(j in 1:B) {

V1, . . . , Vn ∼ Fn

mu[j]=µ(V1, . . . , Vn)

V (µ) =var(mu)

Without outliers X ∼ N (0, 1.52)

n V (µ∗n) V (Fn) MC50 2.45 2.40 2.27100 2.43 2.42 2.48200 2.38 2.38 2.23500 2.37 2.37 2.45

500 samples – 200 bootstrap samples

With 10% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (µ∗n) V (Fn) MC50 4.57 4.56 3.08100 4.73 4.72 3.22200 4.66 4.67 3.17500 4.70 4.69 3.43

n V (µ∗n) V (Fn) MC50 9.16 9.46 3.17100 9.34 9.42 3.32200 9.22 9.25 3.05500 9.16 9.24 3.56

n V (µ∗n) V (Fn) MC50 11.6 11.1 2.25100 11.1 10.8 2.25200 10.7 10.6 2.12500 10.5 10.4 2.29

Timing for n = 2000, p = 30

Average computing time: 35 CPU seconds

2000 bootstrap samples: 20 hours

Bootstrap samples can be highly affected by outliers

Fast and Robust Bootstrap (S-B and Zamar, 2002)

I Faster than bootstrapping the estimator

I Able to downweight potential outliers in the bootstrap samples

I may come in larger proportions than in the sample

Fast and Robust Bootstrap

n∑i=1

ρ′1 (ri/σn) Xi = 0

n∑i=1

ρ0 (ri/σn) = b

ri = Yi − Xi βn ri = Yi − Xi βn

ωi xi x′i

]−1 n∑i=1

ωi xi yi ,

σn =n∑

vi (yi − β′nxi) .

ωi = ρ′1 ( ri/ σn)/ ri ,

vi =σn

n bρ0 ( ri/ σn)/ ri ,

−4 −2 0 2 4

ωω==

ψψ((r))

β∗n =

ω∗i x∗i x∗′i

]−1 n∑i=1

ω∗i x∗i y∗i ,

σ∗n =n∑

v∗i (y∗i − β′nx∗i )

The Robust Bootstrap βR∗n − βn is given by

βR∗n − βn = Kn (β

∗n − βn) + dn (σ∗n − σn) ,

Kn = σn

ρ′′1 ( ri/ σn, xi) xi x′i

]−1 n∑i=1

ωi xi x′i ,

dn = a−1n

ρ′′1 ( ri/ σn, xi) xix′i

]−1 n∑i=1

ρ′′1 ( ri/ σn, xi) ri xi ,

an = σ2n

n∑i=1

[ρ′0 ( ri/ σn) ri/ σn]

Without outliers X ∼ N (0, 1.52)

n V (µ∗n) V (Fn) FRB MC50 2.45 2.40 2.27 2.27100 2.43 2.42 2.34 2.48200 2.38 2.38 2.35 2.23500 2.37 2.37 2.36 2.45

n V (µ∗n) V (Fn) FRB MC50 4.57 4.56 3.88 3.08100 4.73 4.72 3.88 3.22200 4.66 4.67 3.83 3.17500 4.70 4.69 3.81 3.43

Regression (slope) With 10% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (βn) Σ(Fn) FRB MC50 1.74 0.53 0.56 0.77100 0.68 0.52 0.54 0.52200 0.52 0.51 0.52 0.48500 0.52 0.51 0.52 0.54

500 samples – 200 bootstrap samples – Outliers at (10, 16)

Regression (slope) With 20% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (βn) Σ(Fn) FRB MC50 12.5 0.56 0.60 2.34100 8.69 0.57 0.59 0.55200 3.32 0.57 0.58 0.52500 0.60 0.57 0.57 0.57

500 samples – 200 bootstrap samples – Outliers at (10, 16)

Bootstrap provides an estimator of the distribution

Theorem – consistency

Theorem(Salibian-Barrera and Zamar 2002) - Let ρ0 and ρ1 satisfy

(R1) ρ is symmetric, twice continuously differentiable and ρ(0) = 0,(R2) ρ is strictly increasing on [0, c] and constant on [c,∞) for some finite

constant c,with continuous third derivatives. Let βn be the MM-regression estimator, σnthe S-scale and βn the associated S-regression estimator and assume thatβn

P−→ β, σnP−→ σ and βn

P−→ β. Then, under certain regularity conditions,√n (β

R∗n − βn) converges weakly, as n goes to infinity, to the same limit

distribution as√

n (βn − β).

Theorem – Breakdown point

FR Bootstrap Classical Bootstrapp n q0.005 q0.025 q0.05 q0.005 q0.025 q0.05

10 0.456 0.500 0.500 0.128 0.187 0.2222 20 0.500 0.500 0.500 0.217 0.272 0.302

30 0.500 0.500 0.500 0.265 0.313 0.33910 0.191 0.262 0.304 0.011 0.025 0.036

5 20 0.500 0.500 0.500 0.114 0.154 0.17730 0.500 0.500 0.500 0.185 0.226 0.249100 0.500 0.500 0.500 0.368 0.398 0.41420 0.257 0.315 0.347 0.005 0.012 0.018

10 50 0.500 0.500 0.500 0.180 0.212 0.230100 0.500 0.500 0.500 0.294 0.322 0.336

Example

> attach(toxi)

> summary(a.lmrob.85)$coef[,2](Intercept) V2 V3 V4 V5 V65.022954793 0.032760301 0.060795669 0.081339429 0.195694746 0.009200821

V7 V8 V9 V100.005072355 5.920822057 0.010536611 0.022852485

> sqrt(diag(frb(a.lmrob.85)))[1] 6.74805639 0.11360617 0.20158115 0.26692828 0.18190995 0.02291613[7] 0.01246029 6.21339888 0.01485913 0.02480457

> dim(toxi)[1] 38 10

Example

> summary(a.lmrob.95)$coef[,2](Intercept) V2 V3 V4 V5 V64.89848024 0.16891448 0.10310144 0.03644448 0.10455605 0.01855623

V7 V8 V9 V100.01190021 3.87072176 0.01061300 0.01624102

>> sqrt(diag(frb(a.lmrob.95)))[1] 8.25505934 0.21430835 0.21686602 0.16039174 0.17656676 0.02965149[7] 0.01866741 6.78063852 0.01904172 0.02571575

General approach

Fixed point equations

θn = gn(θn)

Bootstrap the equations at the full-data estimator

θ∗n = g∗n(θn)

Fast (e.g. weighted mean, weighted least squares)

Underestimate variability (weights are not recomputed)

General approach

θn = gn(θn) = gn (θ) +∇gn (θ)(θn − θ

√n(θn − θ) = [I−∇gn (θ)]−1 √n (gn(θ)− θ) + op(1)

g∗n(θn)− θn

)≈√

n (g∗n(θ)− θ) ≈√

n (gn(θ)− θ)

√n(θn − θ) ≈ [I−∇gn (θ)]−1 √n

(g∗n(θn)− θn

√n(θ

∗n − θn) ≈

√n(θn − θ) ≈ [I−∇gn (θ)]−1 √n

(g∗n(θn)− θn

θR∗n − θn =

[I−∇gn(θn)

]−1 (g∗n(θn)− θn

Applications

Linear regression

I Standard errors (S-B and Zamar, 2002)

I Tests of hypotheses (S-B, 2005)

I Model selection (S-B and van Aelst, 2007)

Multivariate location / scatter – PCA (S-B, van Aelst, and Willems, 2006)

Discriminant analysis (S-B, van Aelst, and Willems, 2007)

Model selection

Linear regression

(y1, x1), . . . , (yn, xn)

Let α denote a subset of pα indices from {1, 2, . . . , p}

yi = x′αiβα + σα εαi i = 1, . . . , n ,

all models α ∈ A are submodels of a “full” model – σn S-scale estimate of“full” model

For each model α ∈ A, the regression estimator βα,n solves

n∑i=1

ρ′1

(yi − xαi

′ βα,n

)xi = 0 .

expected prediction error (conditional on the observed data)

Mpe(α) =σ2

(zi − x′αi βα

)∣∣∣∣∣ y, X

where z = (z1, . . . , zn)′ are future responses at X, independent of y,

Goodness of fitσ2

(yi − x′αi βα

parsimonious models are preferred Muller and Welsh (2005)

Mppe(α) =σ2

(yi − x′αi βα

)]+ δ(n) pα

}+ Mpe(α) ,

where δ(n) →∞ δ(n)/n → 0 (δ(n) = log(n))

Criteria

Mpem,n(α) =

(yi − x′αi βα,n

)∣∣∣∣∣ y, X

Mppem,n(α) =

(yi − x′αi βα,n

)+ δ(n) pα

}+ Mpe

m,n(α) ,

E∗ is the bootstrap mean

select α ∈ A such that

αpem, n = arg min

α∈AMpe

m,n(α) ,

αppem, n = arg min

α∈AMppe

m,n(α) .

Ac ⊂ A such that βα contain all non-zero components of β

In what follows we will assume that Ac is not empty.

The smallest model in Ac will be “true” model α0

TheoremAssume that(A1) n−1 ∑xα ix′α i → Γα > 0, n−1 ∑ωα ixα ix′α i → Γω α > 0, and

n−1 ∑ ‖xα i‖4 < ∞,(A2) δ(n) = o(n/m) and m = o(n);(A3)

∑ni=1ρ

′1(ri(βα,n)/σn)xαi = 0,

(A4) σn − σ = Op(1/√

n), βα,n − βα = Op(1/√

n);(A5) ρ′1 and ρ′′1 are uniformly continuous, var(ρ′1(εα0)) < ∞, var(ρ′′1 (εα0)) < ∞

and E(ρ′′1 (εα0)) > 0; and(A6) for any α /∈ Ac , var(ρ′1(εα)) < ∞ and with probability one

lim infn→∞

n∑i=1

ρ1(ri(βα)/σn) > limn→∞

n∑i=1

ρ1(ri(βα0,n)/σn) .

Thenlim

n→∞P(αppe

m,n = α0) = limn→∞

P(αpem,n = α0) = 1 .

Example

Los Angeles Ozone Pollution Data

366 daily observations on 9 variables

Full model includes all second order interactions p = 45

Computational complexity

Example

Backward elimination

Starting from the full model

Select the size-(k − 1) model with best selection criteria

Iterate

Reduces search from 2p to p(p + 1)/2 models

Using minα∈AMpem,n(α) ⇒ p = 6

Using minα∈AMppem,n(α) ⇒ p = 7

Full model ⇒ p = 45

Prediction error

5-fold CV trimmed (γ) prediction error estimators

αpem,n αppe

m,n Full modelp = 10 p = 7 p = 45

γ TMSE ρ TMSE ρ TMSE ρ0.05 11.67 5.36 10.45 5.03 10.78 5.030.10 9.18 8.35 8.33

Diagnostic plots

0 10 20 30

Fitted Values

0 10 20 30

Fitted Values

0 10 20 30

Fitted Values

Average time (CPU seconds) to bootstrap an MM-regression estimator1000 times on samples of size 200

p FRB CB25 8 195535 28 430045 35 10700

Full model selection analysis on the Ozone dataset (p = 45) is reducedfrom 15 days (360 hours) to 4 hours.

inference based on robust estimatorsmatias/ulb-1.pdf · ubc - university of british columbia where...

Documents

robust inference with clustered...

weak-instrument robust inference · there are two...

cameron about cluster robust inference

finite-sample identiﬁcation-robust inference for...

spatial correlation robust inference

robust post-matching inference - harvard university ·...

robust inference with multi-way clustering - a. colin...

sharpening wald-type inference in robust regression...

variational inference based on robust...

i l r e robust estimation and b inference for threshold...

robust bayesian inference for set- identified models

heteroskedasticity and autocorrelation robust inference

a practitioner's guide to cluster-robust...

robust inference in panel data model nurul sima

robust inference with multi-way...

statistical inference for data-adaptive doubly robust

robust inference using weighted least squaresbengt 0....

robust inference using inverse probability weighting ·...

identi cation and persistence-robust exact inference in

local projection inference is simpler and more robust than...