inference based on robust estimatorsmatias/ulb-1.pdf · ubc - university of british columbia where...

Inference based on robust estimators

Matias Salibian-Barrera 1

Department of Statistics – University of British Columbia

ECARES - Dec 2007

Matias Salibian-Barrera (UBC) Robust inference ECARES - Dec 2007 1 / 190

UBC - University of British Columbia



Where we are



Who we are

I 43000 students – 7700 graduate students

I Department of Statistics

I 15 faculty members – joint appointments with CS, Hospitals, ResearchInstitutes. . .

I Research: Spatial S, Bayesian S, Bioinformatics, Biostatistics, FunctionalData Analysis, Missing and Longitudinal data, Non-normal Multivariate,MCMC, Robustness

I http://www.stat.ubc.ca

I We are friendly! (come visit!)


Model

X1, . . . , Xn ∼ F ∈ Hε

Hε ={

F : F (x) = (1− ε) Fθ(x) + εH(x) , H arbitrary}

Parameter of interest: θ ∈ Rp (or a subset of it)


Example

θ = (µ, σ) ∈ R× R+

F(µ,σ)(x) = Φ((x − µ)/σ) ∼ N(µ, σ2)

F(µ,σ)(x) = F(0,1)((x − µ)/σ) , F(0,1) fixed

Model: X = µ + σ W , where W ∼ F = (1− ε)F(0,1) + εH

Parameter of interest: µ ∈ R;

Nuisance parameter: σ > 0.


Location-scale

Xi ∼ F0((x − µ)/σ) F0 a fixed dist’n

MLE

(µ, σ) = arg maxµ,σ

n∑i=1

log(f0((Xi − µ)/σ))

Score equations

n∑i=1

g0((Xi − µ)/σ) = 0 , g0(t) = f ′0(t)/f0(t)


Examples

F0(x) = Φ(x) ∼ N (0, 1)

g0(t) = −t ⇒ µ =n∑

i=1

Xi/n

f0(x) = exp (−|x |) /2

log(f0(x)) = −|x | − log(2) ⇒ µ = arg minµ

n∑i=1

|Xi − µ|

⇒ µ = median (X1, . . . , Xn) = mn


MLE most efficient at the modelHow much do you trust the model?

I var(Xn) = σ2/nI var(mn) ≈ 1/

`n 4 f (µ)2´

F If data are normal:var(Xn)/var(mn) ≈ 2/π ≈ 0.64

F If data are double exponential:

var(Xn)/var(mn) ≈ 2

F If data are F (x) = 0.85 Φ(x) + 0.15Φ(x/3):

var(Xn)/var(mn) ≈ 1.13


F (x) = 0.85 Φ(x) + 0.15Φ(x/3)

Den

sity

Tukey (1960): If ε > 0.10 ⇒ var(Xn) > var(mn).Efficiency over a range of plausible distributionsRobustness measures: influence function, maximum bias, breakdownpoint.


M-estimators – Huber (1964)

MLE

(µ, σ) = arg minµ,σ

n∑i=1

− log(f0((Xi − µ)/σ))

n∑i=1

g0((Xi − µ)/σ) = 0 , g0(t) = f ′0(t)/f0(t)

M-estimators

µ = arg minµ

n∑i=1

ρ((Xi − µ)/σ)

n∑i=1

Ψ((Xi − µ)/σ) = 0

Model ⇐⇒| Estimator


M-estimators

Simultaneous scale estimation (Huber’s Proposal II)

n∑i=1

Ψ((Xi − µ)/σ) = 0

n∑i=1

χ((Xi − µ)/σ) = b


ρρ((x)) ΨΨ((x))

Mean – Median – Huber-type M-estimator


(Adaptive) weigthed mean

ΨΨ((x))

Ψc(x) =

x if |x | ≤ c ,

c sign(x) if|x | > c .


n∑i=1

Ψ((Xi − µ)/σ) = 0

n∑i=1

[Ψ((Xi − µ)/σ)/((Xi − µ)/σ)] ((Xi − µ)/σ) = 0

n∑i=1

wi (Xi − µ) = 0

wi = wi (µ, σ) = Ψ(ri)/ri =

1 if (Xi − µ) /σ ≤ c ,

c/|Xi − µ| if (Xi − µ) /σ > c .


µ =n∑

i=1

wi (µ, σ) Xi

/n∑

j=1

wj (µ, σ)

An iterative algorithm

I µ(0) = median(X1, . . . Xn), σ = MAD(X1, . . . , Xn);

I µ(j+1) =Pn

i=1wi(µ(j), σ) Xi

. Pnj=1wj(µ

(j), σ), j = 0, 1, . . . ,


Implementation:

> library(robustbase)> set.seed(31)> x <- c(rnorm(24), rnorm(6, mean=10, sd=.2))> mean(x)[1] 1.949270> median(x)[1] 0.1134845> huberM(x, k=1.345)$mu[1] 0.3952441

$s[1] 1.395404

$it[1] 10> huberM(x, k=0.01)$mu[1] 0.1134845> huberM(x, k=100)$mu[1] 1.949270


Intuitively, median is less affected by outliers than mean

Breakdown point – formal measure of resistance to outliers Hampel (1968, 1971);

Donoho and Huber (1983)

“Smallest amount of outliers that are sufficient to make the estimatorunbounded”


Finite sample Breakdown Point Donoho & Huber (1983)

µn = µ(X1, . . . , Xn)

ε∗(X1, . . . , Xn) = inf

{m

n + m: sup

V1,...,Vm

|µ(X1, . . . , Xn, V1, . . . , Vm)| = +∞

}

µn = Xn ⇒ ε∗(X1, . . . , Xn) = 1/(n + 1) → 0

µn = mn ⇒ ε∗(X1, . . . , Xn) = n/(n + n) = 1/2


Asymptotic Breakdown Point

µ(X1, . . . , Xn) = µ(Fn) Fn(x) =n∑

i=1

I(Xi ≤ x)/n

µ : D −→ R

µ(Fn) −−−→n→∞

µ(F )

ε∗(F ) = inf{

ε : 0 ≤ ε ≤ 1, supG|µ(Fε)− µ(F )| = ∞

}

Fε = (1− ε) F + ε G


Breakdown point of M-estimators

see Maronna, Martin and Yohai (2006)

If σ remains bounded (and away from zero) and µn is given by

n∑i=1

Ψ((Xi − µn)/σ) = 0

hasε∗(X1, . . . , Xn) = min(k1, k2)/(k1 + k2)

wherek1 = − lim

x→−∞Ψ(x) k2 = lim

x→+∞Ψ(x)

and k1 < +∞, k2 < +∞


Breakdown point of M-estimators

Consider Fε ∈ Hε

Fε = (1− ε)F0 + εG

and let µ(Fε) the solution to

EFε[Ψ (X − µ(Fε))] = 0

(1− ε)EF0 [Ψ (X − µ(Fε))] + ε EG [Ψ (X − µ(Fε))] = 0

Take 0 ≤ ε < ε∗. Then |µ(Fε)| < A for some A < +∞. Take G = δx0

(1− ε)EF0 [Ψ (X − µ(Fε))] + ε Ψ(x0 − µ(Fε)) = 0


Letting x0 →∞, we have

Ψ(x0 − µ(Fε)) → k2

Also −k1 ≤ Ψ(u), thus

0 = (1− ε)EF0 [Ψ (X − µ(Fε))] + ε Ψ(x0 − µ(Fε))

≥ −k1 (1− ε) + εΨ(x0 − µ(Fε))

→ −k1 (1− ε) + ε k2

k1 (1− ε) ≥ ε k2

ε ≤ k1/ (k1 + k2)


Letting x0 → −∞, we have

ε ≤ k2/ (k1 + k2)

Thusε ≤ min(k1, k2)/ (k1 + k2) ∀ ε < ε∗

⇒ ε∗ ≤ min(k1, k2)/ (k1 + k2)


Let ε ≥ ε∗ and let Gn

µn = µ((1− ε)F0 + εGn) → +∞

0 = (1− ε)EF0 [Ψ (X − µn)] + ε EG [Ψ (X − µn)]

≤ (1− ε)EF0 [Ψ (X − µn)] + ε k2

⇒ 0 ≤ limn

(1− ε)EF0 [Ψ (X − µn)] + ε k2

Dominated Convergence Theorem

0 ≤ (1− ε) limn

EF0 [Ψ (X − µn)] + ε k2

≤ −(1− ε) k1 + ε k2

(1− ε) k1 ≤ ε k2

k1/(k1 + k2) ≤ ε ∀ε > ε∗


If µn → −∞ we getk2/(k1 + k2) ≤ ε ∀ε > ε∗

Hence, putting all together, we obtain

ε∗ = min(k1, k2)/ (k1 + k2)


Huber proposed a family of score functions Ψc

Ψc(x) =

x if |x | ≤ c ,

c sign(x) if |x | > c .

Thus, we have k1 = k2 = c and ε∗ = 1/2 (for any c ∈ R)

The median is associated with the function

Ψ(x) = sign(x)

so that k1 = k2 = 1 and ε∗ = 1/2


µ = arg minµ

n∑i=1

ρ((Xi − µ)/σ)

n∑i=1

Ψ((Xi − µ)/σ) = 0

⇒ Need a (robust) scale estimator σ


Robust scale estimator

Consider r = (r1, . . . , rn)

σ : Rn → R+ such that

I σ(r) ≥ 0;

I σ(b r) = |b| σ(r) for all b ∈ R;

I σ(|r1|, . . . , |rn|) = σ(r); and

I σ is invariant under permutations.


Scale estimatorsDifferent scales:

σ(r)2 =n∑

i=1

r2i

/n

σ(r) = median(|r1|, . . . , |rn|)

M-scale (implicitly defined):

1n

n∑i=1

ρ (ri/σ) = b

ρ : R → R+, non-decreasing on [0,+∞);ρ(−r) = ρ(r);ρ(0) = 0; andb = EF0ρ(u) (consistency)


ρ(r) =

0 if |r | <= 1

1 if |r | > 1⇒ σ = median (|r1|, . . . , |rn|)

ρ(r) = r2 ⇒ 1n

n∑i=1

ρ (ri/σ) = 1

1n

n∑i=1

r2i /σ2 = 1

1n

n∑i=1

r2i = σ2


Simultaneous estimation

n∑i=1

Ψ((Xi − µ)/σ) = 0

1n

n∑i=1

ρ ((Xi − µ)/σ) = b

⇒ µ has breakdown point lower than 1/2.


Preliminary scale:

σ = median (|X1 −mn|, . . . , |Xn −mn|)

mn = median (X1, . . . , Xn)

n∑i=1

Ψ((Xi − µ)/σ) = 0

⇒ µ has breakdown point 1/2.


Asymptotic distribution - heuristic Taylor expansion

Proper derivation - Huber (1967) / He and Shao (1996)

Allows us to compute the efficiency at the central model


Asymptotic distribution

µn → µ(F ) and σn → σ(F ) where

EF (Ψ((X − µ(F ))/σ(F ))) = 0

0 =n∑

i=1

Ψ

(Xi − µn

σ

)=

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)−

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

)1

σ(F )(µn − µ(F ))−

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F )) + Rn


1n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

)1

σ(F )(µn − µ(F )) =

1n

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)−

1n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F ))− Rn


a−1n =

1n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

)> 0

1σ(F )

√n (µn − µ(F )) = a−1

n1√n

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)+

a−1n√n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F ))− a−1

n√

n Rn


a−1n → a(F )−1 = EF

[Ψ′(

X − µ(F )

σ(F )

)]If F is symmetric then

1n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)→ 0

Ψ(u) odd ⇒ Ψ′(u) even and so Ψ′(u)u is odd


If, in addition,√

n (σn − σ(F )) = Op(1), then

an√n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

)(σn − σ(F )) =

an

n

n∑i=1

Ψ′(

Xi − µ(F )

σ(F )

) (Xi − µ(F )

σ(F )2

) √n (σn − σ(F )) = op(1)

Finally, we will assume that√

n Rn → 0


Since

EF

[Ψ

(X − µ(F )

σ(F )

)]= 0

then

1√n

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)D−−−→

n→∞N (0, Q(F )2)

Q(F )2 = EF

[Ψ2(

Xi − µ(F )

σ(F )

)]


1σ(F )

√n (µn − µ(F )) = an

1√n

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)+ op(1)

√n (µn − µ(F )) = σ(F ) an

1√n

n∑i=1

Ψ

(Xi − µ(F )

σ(F )

)+ op(1)


√n (µn − µ(F ))

D−−−→n→∞

N (0, V (F ))

where

V (F ) = σ(F )2EF

[Ψ2(

X−µ(F )σ(F )

)]{

EF

[Ψ′(

X−µ(F )σ(F )

)]}2


Simple CI for µ

µn ± 1.96√

V (Fn)/n

V (Fn) = σ2

∑ni=1Ψ

2(

Xi−µσ

)/n{∑n

i=1Ψ′(

Xi−µσ

)/n}2


Empirical coverage of 95% CI for µ0

Based on a 95%-efficient M-estimator

ε n20 100 200 500

0.00 0.92 (0.86) 0.95 (0.40) 0.93 (0.28) 0.94 (0.18)

0.10 0.91 (1.05) 0.69 (0.49) 0.40 (0.35) 0.05 (0.22)

0.20 0.80 (1.44) 0.08 (0.67) 0.00 (0.47) 0.00 (0.30)

1000 random samples

outliers follow a N (10, 0.22) distribution


−1.0 −0.5 0.0 0.5 1.0

CI

n

[ ]

[ ]

[ ]

[ ]

[ ]

[ ]

[]

n = 50, 100, 500, 1000, 5000, 10000, 100000


Bootstrap

Efron (1979)

Brief description (more / better comes later)

To approximate the sampling distribution of T (X1, . . . , Xn)

I For j = 1 in 1:B

I Take a random sample from X1, . . . , Xn with replacement X∗1 , . . . , X∗

n

I Compute T ∗j (X∗

1 , . . . , X∗n )

Use the “sample” T ∗1 , . . . , T ∗B to approximate the sampling distribution ofT (X1, . . . , Xn)


d(F ∗T ,n, FT ,n

)→ 0

In particular, (depending on d) V (T ) could be approximated by

V ∗(T ) =1B

B∑j=1

(T ∗j − T ∗j

)2

where

T ∗j =1b

B∑j=1

T ∗j

A 95% confidence interval can be constructed as follows

µn ± 1.96√

V ∗(T )

or using estimated quantiles


Empirical coverage of 95% bootstrap CI for µ0

Based on a 95%-efficient M-estimator

ε n20 100 200 500

0.00 0.92 (0.88) 0.93 (0.37) 0.95 (0.28) 0.94 (0.18)

0.10 0.95 (1.24) 0.63 (0.51) 0.45 (0.36) 0.05 (0.23)

0.20 0.99 (2.71) 0.27 (0.84) 0.00 (0.57) 0.00 (0.36)

100 random samples

outliers follow a N (10, 0.22) distribution


−1.0 −0.5 0.0 0.5 1.0

CI

n

[ ]

[ ]

[ ]

[ ]

n = 20, 100, 200, 500


µn ± 1.96√

V (Fn)/n

We need to study both bias and variance

For large samples, bias becomes more important


Maximum Asymptotic Bias

X = µ0 + ε

ε ∼ F ∈ Hε(F0)

Hε(F0) ={

F : (1− ε) F0 + ε H}

µn = µ(Fn) → µ(F ) 6= µ(F0) = µ0


Maximum asymptotic biases

BF0(ε) = supF∈Hε(F0)

|µ(F )− µ(F0)| /σ0


We can assume (wlog) that µ(F0) = 0

Let Ψ(u) be non-decreasing

supu

Ψ(u) = k < +∞

g(b) = EF0 (Ψ(X + b))

g(b) is increasing (if either Ψ is, or F ′0(u) = f0(u) > 0 for all u ∈ R

Let 0 ≤ ε < 1/2 and F (x) = (1− ε)F0(x) + εH(x)


Then µ(F ) solves

EF Ψ(X − µ(F )) = 0 = (1− ε)g(−µ(F )) + εEHΨ(X − µ(F ))

Since −k ≤ Ψ(u) ≤ k we have

(1− ε)g(−µ(F ))− εk ≤ 0 ≤ (1− ε)g(−µ(F )) + εk

−kε/(1− ε) ≤ g(−µ(F )) ≤ kε/(1− ε)

|µ(F )| ≤ g−1(kε/(1− ε))

Taking H = δx0 with x0 →∞ shows that in that case

|µ(F )| = g−1(kε/(1− ε))


For the median, when F0 = N (0, 1)

Ψ(u) = sign(u) ⇒ k = 1

g(b) = EΦsign(u + b) = PΦ (Z > −b)− PΦ (Z < −b) =

1− 2 Φ(−b) = 2Φ(b)− 1

g(b) = kε/(1− ε) = ε/(1− ε)

2 Φ(b)− 1 = ε/(1− ε) ⇒ Φ(b) = 1/ [2 (1− ε)]

b = Φ−1 (1/ [2 (1− ε)])

Same calculation for any symmetric F0


ε Median Ψ1.345

0.00 0.00 0.00

0.05 0.07 0.09

0.10 0.14 0.18

0.20 0.32 0.42

F0 = N (0, 1)


Median minimizes the maximum bias (Huber, 1981), but

√n (mn − µ(F0))

D−−−→n→∞

N(

0,1

4 f (µ(F0))2

)

µ(F0) = F−10 (1/2)

When F0 = Φ

efficiency of the Median: 2/π ≈ 0.64

efficiency of the M estimator with Ψ1.345: 0.95

difficulty of estimating f (µ(F0)) for inference


Linear regression

Y = X′β0 + ε

errors are independent from the covariates

βn = arg minβ∈Rp

n∑i=1

(Yi − X′i β)2

n∑i=1

(Yi − X′i β) Xi = 0


Huber (1973)

βn = arg minβ∈RP

n∑i=1

ρc

(Yi − X′i β

σ

)

n∑i=1

Ψc

(Yi − X′i βn

σ

)Xi = 0

n∑i=1

χ

(Yi − X′i βn

σ

)= b


Least squares

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

0 5 10 15

24

68

1012

14

# of Shocks

Mea

n tim

e


Least squares + Huber

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

0 5 10 15

24

68

1012

14

# of Shocks

Mea

n tim

eLSLS minus outliersHuber


If Ψ is monotone and Yi is a large outlier with high leverage (‖Xi‖ large) then∥∥∥∥Ψ(Yi − X′iβσ

)Xi

∥∥∥∥ ≈ Ψ(+∞) ‖Xi‖

which can then dominate the equation

n∑i=1

Ψc

(Yi − X′i β

σ

)Xi = 0


Breakdown of a monotone-Ψ M-estimator with high-leverage outliers

Let (Y1, X1) be such that Y1/‖X1‖ → ∞, while βn remains bounded

Y1 − X′1βn ≥ Y1 − ‖X1‖ ‖βn‖ = ‖X1‖(

Y1/‖X1‖ − ‖βn‖)→∞

thus

0 = Ψc

(Y1 − X′1 βn

σ

)X1 +

∑Ψc

(Yi − X′i βn

σ

)Xi

cannot hold (first term diverging while the second term remains bounded)


We need a redescending function Ψ (bounded loss function ρ)

(Or we could downweight high-leverage points)

Then loss and score equations are not equivalent

Multiple solutions to the score equations

Need criterium to select a robust solution

Global minimum of loss function


Bi-square loss (Beaton and Tukey, 1974)

ρd (r) =

1−

[1− (r/d)2

]3if |r | ≤ d

1 if |r | > d

Ψd (r) =

6 r[1− (r/d)2

]2/d2 if |r | ≤ d

0 if |r | > d


−4 −2 0 2 4

0.0

0.5

1.0

1.5

r

ρρ d((r))

−4 −2 0 2 4

−0.

50.

00.

5

rΨΨ

d((r))

ρ3(r) Ψ3(r)


βn = arg minβ

n∑i=1

ρd

(Yi − X′iβ

σ

)⇐/ ⇒

n∑i=1

Ψd

(Yi − X′i βn

σ

)Xi = 0

Non-convex problem – existence of a unique global minimumNeed a good initial point


●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●●

−2 0 2 4

−6

−4

−2

02

4

x

y

f (β) =∑

i ρd ((Yi − X′iβ) /σ)


●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●●

−2 0 2 4

−6

−4

−2

02

4

x

y

f (β) = mediani |(Yi − X′iβ)|


Algorithms

Data-driven random search (Rousseeuw, 1984; Rupper, 1992)

I Generate random lines using random pairs from the sample

I Find local minima near these random starts βj

I Pick the best

Heuristic – Simulated Annealing - Tabu search

Recent refinements of the random subsampling algorithmI fast-LTS, fast-MCD Rousseeuw and van Driessen, 1999

I fast-S S-B and Yohai, 2006

I fast-tau S-B, Willems, Zamar, 2006 and Zamar, 2006

I no-name-yet Harrington and S-B, 2007


The scale estimator σ

I Measures scale of the residuals

I itself needs a regression / location estimator

I A bit of a conundrum (spelling?)...


S-estimators

Rousseeuw and Yohai (1984)

Estimators based on minimizing a residual scale

Let σ(r) be a scale estimator, and define

βn = arg minβ

σ(Y1 − X′1β, . . . , Yn − X′nβ)


σ(r)2 =∑n

i=1r2i /n

I LS

βn = arg minβ

nXi=1

(Yi − X′i β)2

σ(r)2 =∑n

i=1|ri |/n

I L1

βn = arg minβ

nXi=1

|Yi − X′i β|2

σ(r)2 = median(r21 , . . . , r2

n )

I LMS (Hampel Rousseeuw)

βn = arg minβ

median(Yi − X′i β)2


σ(r)2 =∑[α n]

i=1 r2(i)

I LTS (Rousseeuw, 1984)

βn = arg minβ

[α n]Xi=1

(Y − X′β)2(i)

σ(r) solves∑n

i=1ρ(ri/σ(r))/n = b

I S-estimators (Rousseeuw and Yohai, 1984)


LMS are not√

n consistent (Rousseeuw, 1984; Kim and Pollard, 1990)

LTS are less efficient than S-estimators

High-breakdown S-estimators are not very efficient (Hossjer, 1992).


S-estimators are M-estimators

βn = arg minβ

σ (β)

1n

n∑i=1

ρ

(Yi − X′iβ

σ(β)

)= b

βn = arg minβ

n∑i=1

ρ

(Yi − X′iβ

σ

)where σ = σ(βn)


●

●

●

●

●●

●

●

●

●●

●●

●

●

●

0 5 10 15

24

68

1012

14

# of Shocks

Mea

n tim

eHuberLMSLTSS


Breakdown point of S-estimators

Tuning of ρ (b) to obtain LMS

Maximum asymptotic bias


Breakdown pointβn = arg min

βσ (β)

1n

n∑i=1

ρ

(Yi − X′iβ

σ(β)

)= b

For consistency at the model, we need

b = EF0ρ (r/σ0)

ε∗ = min(

b/ρ (+∞) , 1− b/ρ (+∞))

ρ (+∞) = limr→+∞

ρ (r)


Consider

ρd (r) =

0 if |r | <= d

1 if |r | > d

Then, for normal errors,

EΦρd (r) = PΦ (|Z | > d) = 2 [1− Φ(d)]

To obtain maximum BP we set

EΦρd (r) = 1/2 ⇒ d = Φ−1 (3/4)


Thus,

1n

n∑i=1

ρd (ri/σ) = 1/2

# {i : |ri | ≥ d σ} = n/2

# {i : |ri/d | ≥ σ} = n/2

σ = median (|r1|, . . . , |rn|) /Φ−1 (3/4)


Maximum bias

ε

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02

LMS 0.53 0.83 1.13 1.52

S 0.56 0.88 1.23 1.65

Maximum bias – 50% breakdown point


n∑i=1

Ψd

(Yi − X′i βn

σ

)Xi = 0

0 =n∑

i=1

Ψc

(Yi − X′i βn

σ

)Xi =

n∑i=1

Ψc

(Yi − X′iβ0

σ0

)Xi+

n∑i=1

Ψ′c

(Yi − X′iβ0

σ0

)Xi X′i/σ0

(βn − β0

)+

n∑i=1

Ψ′c

(Yi − X′iβ0

σ0

) (Yi − X′iβ0

σ20

)(σ − σ0) Xi + Rn


√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ2c(r))

(EF0(Ψ′c(r)))2 EG0 (X X′)−1


Efficiencies and Maximum bias

ε Eff

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02 0.07

LMS 0.53 0.83 1.13 1.52 0.00

S 0.56 0.88 1.23 1.65 0.29

Maximum bias & Efficiencies – 50% breakdown point


Need to find

βn = arg minβ

n∑i=1

ρd

(Yi − X′iβ

σ

)

Or, at least, a robust solution to

n∑i=1

Ψd

(Yi − X′i βn

σ

)Xi = 0

(and need σ)


MM-estimators

(Yohai, 1987)

Let βn0 be a consistent, high-BP estimator

Let σ be a high-BP M-scale estimator using βn0

1n

n∑i=1

ρ0

(Yi − X′i βn0

σ

)= 1/2

Find a local minimum βn of f (β) =∑n

i=1ρ1

(Yi−X′

i βσ

)such that

f (βn) ≤ f (βn0)

Needρ1(r) ≤ ρ0(r) ∀ r


Retains the BP of βn0

Has efficiency given by

n∑i=1

Ψ1

(Yi − X′i βn

σ

)Xi = 0

whereΨ1(r) = ρ1

′(r)

(efficiency can be set by the choice of ρ1(r))


√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ12(r))

(EF0(Ψ1′(r)))2 EG0 [X X′]−1

= σ20

[EH0(Ψ1

′(r)XX′)]−1 [

EH0(Ψ12(r)XX′)

][EH0(Ψ1

′(r)XX′)]−1

whereH0(r , x) = G0(x) F0(r)


Example with robustbase


> library(robustbase)> toxi <- read.table(’toxicity.txt’, header=FALSE)> names(toxi)[1] <- ’y’> dim(toxi)[1] 38 10> a.lm <- lm(y˜., data=toxi)> plot(a.lm)


−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

−0.

4−

0.2

0.0

0.2

0.4

0.6

Fitted values

Res

idua

ls

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

lm(y ~ .)

Residuals vs Fitted

28

34

38


●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−2

−1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

lm(y ~ .)

Normal Q−Q

28

34 38


−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

lm(y ~ .)

Scale−Location

28

3438


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

−3

−2

−1

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

lm(y ~ .)

Cook's distance

1

0.5

0.5

1

Residuals vs Leverage

38

28

32


Efficiencies for bi-square score functions

Efficiency: 0.80 0.85 0.90 0.95

Tuning constant (d): 3.14 3.44 3.88 4.68


> a.lmrob.85 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, tuning.psi=3.44, compute.rd=TRUE))>> a.lmrob.90 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, tuning.psi=3.88, compute.rd=TRUE))>> a.lmrob.95 <- lmrob(y˜., data=toxi,+ control=lmrob.control(nResamp=5000, compute.rd=TRUE))>> plot(a.lmrob.85)


●●●●

●

●● ●

●

●

●

●

●

●●

●

● ●● ●●●

●

●●●

●

●

●●●

●

●

●

●●

●

●

0 10 20 30 40 50

05

1015

2025

Robust Distances

Rob

ust S

tand

ardi

zed

resi

dual

s

Standardized residuals vs. Robust Distances

lmrob(formula = y ~ ., data = toxi, control = lmrob.control(nResample = 5000, tuning.psi = 3.44, compute.rd = TRUE))


●●

●●

●

●●●

●

●

●

●

●

●●

●

● ● ●●● ●

●

●●●

●

●

● ●●

●

●

●

●●

●

●

−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0


Res

idua

ls

Normal Q−Q vs. Residuals



●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

Fitted Values

Res

pons

e

Response vs. Fitted Values



●●

●●

●

●●●

●

●

●

●

●

●●

●

●●●● ●●

●

● ●●

●

●

●●●

●

●

●

● ●

●

●

−2.0 −1.5 −1.0 −0.5 0.0

0.0

0.5

1.0

1.5

2.0

Fitted Values

Res

idua

ls

Residuals vs. Fitted Values



> summary(a.lm)Call:lm(formula = y ˜ ., data = toxi)

Residuals:Min 1Q Median 3Q Max

-0.36704 -0.09072 -0.01605 0.05775 0.50947

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.973446 6.538420 -1.067 0.29529V2 0.317054 0.136360 2.325 0.02754 *V3 0.059883 0.184185 0.325 0.74751V4 -0.201126 0.057242 -3.514 0.00152 **V5 -0.027091 0.173513 -0.156 0.87705V6 0.012661 0.036188 0.350 0.72906V7 -0.014451 0.017489 -0.826 0.41562V8 5.896792 5.156774 1.144 0.26251V9 -0.014075 0.011667 -1.206 0.23777V10 0.008387 0.013845 0.606 0.54957---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.184 on 28 degrees of freedomMultiple R-Squared: 0.8463, Adjusted R-squared: 0.7969F-statistic: 17.14 on 9 and 28 DF, p-value: 3.520e-09


> summary(a.lmrob.85)

Call:lmrob(formula = y ˜ ., data = toxi, control = lmrob.control(nResample = 5000,

tuning.psi = 3.44, compute.rd = TRUE))

Weighted Residuals:Min 1Q Median 3Q Max

-0.13540 -0.01594 0.01612 0.25659 2.33151

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.763606 5.022955 -0.948 0.35106V2 0.500946 0.032760 15.291 4.03e-15 ***V3 0.140541 0.060796 2.312 0.02837 *V4 0.495203 0.081339 6.088 1.44e-06 ***V5 0.245450 0.195695 1.254 0.22012V6 -0.028718 0.009201 -3.121 0.00415 **V7 -0.027577 0.005072 -5.437 8.41e-06 ***V8 -1.790614 5.920822 -0.302 0.76456V9 0.023948 0.010537 2.273 0.03091 *V10 -0.036026 0.022852 -1.576 0.12615---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Robust residual standard error: 0.09632Convergence in 22 IRWLS iterations


(...)Robustness weights:9 observations c(12,13,23,28,32,34,35,36,37)

are outliers with |weight| < 2.632e-06;one weight is ˜= 1; the remaining 28 ones are summarized asMin. 1st Qu. Median Mean 3rd Qu. Max.

0.01005 0.95930 0.98950 0.93200 0.99550 0.99990Algorithmic parameters:tuning.chi bb tuning.psi refine.tol rel.tol1.5476400 0.5000000 3.4400000 0.0000001 0.0000001nResample max.it groups n.group best.r.s k.fast.s k.max

5000 50 5 400 2 1 200trace.lev compute.rd

0 1seed : int(0)


●●

●

●●

●●

●

●

●

●●

●

●

●●●

●●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

0 10 20 30 40 50

−2

02

46

8

Robust Distances

Rob

ust S

tand

ardi

zed

resi

dual

s

Standardized residuals vs. Robust Distances

lmrob(formula = y ~ ., data = toxi, control = lmrob.control(nResample = 5000, compute.rd = TRUE))


●●

●

●●

● ●

●

●

●

●●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●

●

●

● ●

●

●

●

●

−2 −1 0 1 2

−0.

20.

00.

20.

40.

60.

8


Res

idua

ls

Normal Q−Q vs. Residuals



●

● ●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

−1.0 −0.5 0.0 0.5

−1.

0−

0.5

0.0

0.5

Fitted Values

Res

pons

e

Response vs. Fitted Values



●●

●

●●

●●

●

●

●

● ●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

−0.

20.

00.

20.

40.

60.

8

Fitted Values

Res

idua

ls

Residuals vs. Fitted Values



> library(MASS)> a.lms <- lmsreg(y˜., data=toxi)> a.lmsCall:lqs.formula(formula = y ˜ ., data = toxi, method = "lms")

Coefficients:(Intercept) V2 V3 V4 V5 V6

-4.44985 0.50840 0.15560 0.83908 0.41175 -0.02570V7 V8 V9 V10

-0.03311 -5.02900 0.03002 -0.06489

Scale estimates 0.03314 0.02720

> summary(a.lms)Length Class Mode

crit 1 -none- numericsing 1 -none- charactercoefficients 10 -none- numeric[...]xlevels 0 -none- listmodel 10 data.frame list

> plot(a.lms)Error in plot.window(xlim, ylim, log, asp, ...) :

need finite ’xlim’ valuesIn addition: Warning messages:1: no non-missing arguments to min; returning Inf2: no non-missing arguments to max; returning -Inf3: no non-missing arguments to min; returning Inf4: no non-missing arguments to max; returning -Inf


MM-regression estimators combine

I high-breakdown point

I√

n consistent and asymptotically normal

I high-efficiency at the central model


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●●●●●●●●

●●●●●

●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●●●●

●●●●●●●●●●●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●●●●●●●

● ●● ●●●●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●●● ●

●●●●●●●●●●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●●●●●●

●●●●●●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●●●●●●●●●●●●●●●● ●●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●

● ●●●●

●●●●●

●

−2 0 2 4 6

−4

−2

02

46

x

y

MM – LS


ε Eff

0.05 0.10 0.15 0.20

LTS 0.63 1.02 1.46 2.02 0.07

LMS 0.53 0.83 1.13 1.52 0.00

S 0.56 0.88 1.23 1.65 0.29

MM 0.78 1.24 1.77 2.42 0.95

MM+S 0.56 0.88 1.23 1.65 0.95

Maximum bias & Efficiencies – 50% breakdown point


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●●●● ●●●●●●●●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●

●●●●●●●●●● ●●●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●●●● ●●●●●● ●● ●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●●●● ●

●●●●●●●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●●●●●●●●●

●●●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●●●●●●●●●●●●●●●● ●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●●●

●●● ●●

●●●●●●● ●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●●●●●●

●●●●●●●●●

−2 0 2 4 6

−6

−4

−2

02

4

x

y

MM – LS


Asymptotics revisited

the problem of the scale outside the model – (Croux, Dhaene, Hoorelbeke, 2003; S-B, 2000)


0 =n∑

i=1

Ψc

(Yi − X′i βn

σ

)Xi =

n∑i=1

Ψc

(Yi − X′iβ0

σ0

)Xi+

n∑i=1

Ψ′c

(Yi − X′iβ0

σ0

)Xi X′i/σ0

(βn − β0

)+

n∑i=1

Ψ′c

(Yi − X′iβ0

σ0

) (Yi − X′iβ0

σ20

)(σ − σ0) Xi + Rn


√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ20

EF0(Ψ2c(r))

(EF0(Ψ′c(r)))2 EG0 [X X′]−1

= σ20

[EH0(Ψ

′c(r)XX′)

]−1 [EH0(Ψ

2c(r)XX′)

][EH0(Ψ

′c(r)XX′)

]−1

whereH0(r , x) = G0(x) F0(r)


√n(βn − β0

)D−−−→

n→∞Np (0,Σ)

Σ = σ2([

EH(Ψ′c(r)XX′)]−1 [

EH(Ψ2c(r)XX′)

][EH(Ψ′c(r)XX′)

]−1−

a EH(ρ(r)Ψc(r)X′)[EH(Ψ′c(r)XX′)

]−1−[

EH(Ψ′c(r)XX′)]−1

EH(ρ(r)Ψc(r)X) a′ + EH(ρ(r)− b)2 a a′)

where

a =[EH(Ψ′c(r)XX′)

]−1EH(Ψ′c(r) r X)

/EH(ρ′(r) r)

andr = (Y − β0)/σ0 r = (Y − β0)/σ0


Uniform asymptotic results over contamination neighbourhoods

I location (S-B and Zamar, 2004)

I linear regression (first attempt: Omelka and S-B, 2006)


Under “certain regularity assumptions”

limn→∞

supF∈Hε

supx∈R

∣∣∣PF{√

n(µn − µ(F ))/V (F ) ≤ x}− Φ(x)

∣∣∣ = 0

Assumptions

Stringent conditions for uniform consistency of S-location estimator

I Uniform unique minimum – uniform “minimal” convexity

Extension to linear regression


Trade-off between BP and the size of Hε where uniform asymptotics hold

BP ε0.50 0.110.45 0.140.40 0.170.35 0.200.30 0.240.25 0.25


Back to confidence intervals

βn j ± 1.96√

Σjj


Empirical coverage of 95% CI for µ0

Based on a 95%-efficient MM-estimator with 50% BP

ε p1 2 5 10

0.00 0.93 0.95 0.95 0.93

0.10 0.69 0.67 0.69 0.65

0.20 0.04 0.05 0.03 0.04

500 samples of size n = 100 – outliers concentrated at (x , y) = (4, 3)


●● ●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

● ●

●●

●

●

●

●

●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●●●

●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

● ● ●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

−2 −1 0 1 2 3 4 5

−2

−1

01

23

45

x

y

MM – LS


Bootstrap

µn = µ(X1, . . . , Xn) Xi ∼ F

Plug-in principle

Fn ≈ F ⇒ L(µn, F ) ≈ L(µn, Fn)

µ∗n = µ(V1, . . . , Vn) Vi ∼ Fn


µn =n∑

i=1

Xi/n

µ∗n =n∑

i=1

Vi/n Vi ∼ Fn

P(Vi ≤ t) =n∑

i=1

I(Xi ≤ t)/n

P(Vi = t) =

1/n if t = Xj for some j = 1, . . . , n

0 otherwise


E(µ∗n) =n∑

i=1

E(Vi)/n =n∑

i=1

Xn/n = Xn

E(V 2i ) =

n∑i=1

X 2i /n

V (µ∗n) = V (Vi)/n

=

(n∑

i=1

X 2i /n − X 2

n

)/n

=

[(n∑

i=1

(Xi − Xn)2

)/n

]/n

= s2/n ≈ V (Xn) = σ2/n


Problem:L(µn, Fn)

generally unknownCan be estimated (simulated) by re-computing µn on a large number ofpseudo-random samples from Fn

for(j in 1:B) {

V1, . . . , Vn ∼ Fn

mu[j]=µ(V1, . . . , Vn)

}

V (µ) =var(mu)


Without outliers X ∼ N (0, 1.52)

n V (µ∗n) V (Fn) MC50 2.45 2.40 2.27100 2.43 2.42 2.48200 2.38 2.38 2.23500 2.37 2.37 2.45

500 samples – 200 bootstrap samples


With 10% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (µ∗n) V (Fn) MC50 4.57 4.56 3.08100 4.73 4.72 3.22200 4.66 4.67 3.17500 4.70 4.69 3.43




n V (µ∗n) V (Fn) MC50 9.16 9.46 3.17100 9.34 9.42 3.32200 9.22 9.25 3.05500 9.16 9.24 3.56




n V (µ∗n) V (Fn) MC50 11.6 11.1 2.25100 11.1 10.8 2.25200 10.7 10.6 2.12500 10.5 10.4 2.29



Timing for n = 2000, p = 30

Average computing time: 35 CPU seconds

2000 bootstrap samples: 20 hours

Bootstrap samples can be highly affected by outliers


Fast and Robust Bootstrap (S-B and Zamar, 2002)

I Faster than bootstrapping the estimator

I Able to downweight potential outliers in the bootstrap samples

I may come in larger proportions than in the sample


Fast and Robust Bootstrap

n∑i=1

ρ′1 (ri/σn) Xi = 0

1n

n∑i=1

ρ0 (ri/σn) = b

ri = Yi − Xi βn ri = Yi − Xi βn


βn =

[n∑

i=1

ωi xi x′i

]−1 n∑i=1

ωi xi yi ,

σn =n∑

i=1

vi (yi − β′nxi) .

ωi = ρ′1 ( ri/ σn)/ ri ,

vi =σn

n bρ0 ( ri/ σn)/ ri ,


−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

r

ωω==

ψψ((r))

r


β∗n =

[n∑

i=1

ω∗i x∗i x∗′i

]−1 n∑i=1

ω∗i x∗i y∗i ,

σ∗n =n∑

i=1

v∗i (y∗i − β′nx∗i )

The Robust Bootstrap βR∗n − βn is given by

βR∗n − βn = Kn (β

∗n − βn) + dn (σ∗n − σn) ,


Kn = σn

[n∑

i=1

ρ′′1 ( ri/ σn, xi) xi x′i

]−1 n∑i=1

ωi xi x′i ,

dn = a−1n

[n∑

i=1

ρ′′1 ( ri/ σn, xi) xix′i

]−1 n∑i=1

ρ′′1 ( ri/ σn, xi) ri xi ,

an = σ2n

1n

1b

n∑i=1

[ρ′0 ( ri/ σn) ri/ σn]


Without outliers X ∼ N (0, 1.52)

n V (µ∗n) V (Fn) FRB MC50 2.45 2.40 2.27 2.27100 2.43 2.42 2.34 2.48200 2.38 2.38 2.35 2.23500 2.37 2.37 2.36 2.45




n V (µ∗n) V (Fn) FRB MC50 4.57 4.56 3.88 3.08100 4.73 4.72 3.88 3.22200 4.66 4.67 3.83 3.17500 4.70 4.69 3.81 3.43



Regression (slope) With 10% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (βn) Σ(Fn) FRB MC50 1.74 0.53 0.56 0.77100 0.68 0.52 0.54 0.52200 0.52 0.51 0.52 0.48500 0.52 0.51 0.52 0.54

500 samples – 200 bootstrap samples – Outliers at (10, 16)


Regression (slope) With 20% outliers distributed as X ∼ Φ((x − 5)/0.5)

n V (βn) Σ(Fn) FRB MC50 12.5 0.56 0.60 2.34100 8.69 0.57 0.59 0.55200 3.32 0.57 0.58 0.52500 0.60 0.57 0.57 0.57

500 samples – 200 bootstrap samples – Outliers at (10, 16)

Bootstrap provides an estimator of the distribution


Theorem – consistency

Theorem(Salibian-Barrera and Zamar 2002) - Let ρ0 and ρ1 satisfy

(R1) ρ is symmetric, twice continuously differentiable and ρ(0) = 0,(R2) ρ is strictly increasing on [0, c] and constant on [c,∞) for some finite

constant c,with continuous third derivatives. Let βn be the MM-regression estimator, σnthe S-scale and βn the associated S-regression estimator and assume thatβn

P−→ β, σnP−→ σ and βn

P−→ β. Then, under certain regularity conditions,√n (β

R∗n − βn) converges weakly, as n goes to infinity, to the same limit

distribution as√

n (βn − β).


Theorem – Breakdown point

FR Bootstrap Classical Bootstrapp n q0.005 q0.025 q0.05 q0.005 q0.025 q0.05

10 0.456 0.500 0.500 0.128 0.187 0.2222 20 0.500 0.500 0.500 0.217 0.272 0.302

30 0.500 0.500 0.500 0.265 0.313 0.33910 0.191 0.262 0.304 0.011 0.025 0.036

5 20 0.500 0.500 0.500 0.114 0.154 0.17730 0.500 0.500 0.500 0.185 0.226 0.249100 0.500 0.500 0.500 0.368 0.398 0.41420 0.257 0.315 0.347 0.005 0.012 0.018

10 50 0.500 0.500 0.500 0.180 0.212 0.230100 0.500 0.500 0.500 0.294 0.322 0.336


Example

> attach(toxi)

> summary(a.lmrob.85)$coef[,2](Intercept) V2 V3 V4 V5 V65.022954793 0.032760301 0.060795669 0.081339429 0.195694746 0.009200821

V7 V8 V9 V100.005072355 5.920822057 0.010536611 0.022852485

> sqrt(diag(frb(a.lmrob.85)))[1] 6.74805639 0.11360617 0.20158115 0.26692828 0.18190995 0.02291613[7] 0.01246029 6.21339888 0.01485913 0.02480457

> dim(toxi)[1] 38 10


Example

> summary(a.lmrob.95)$coef[,2](Intercept) V2 V3 V4 V5 V64.89848024 0.16891448 0.10310144 0.03644448 0.10455605 0.01855623

V7 V8 V9 V100.01190021 3.87072176 0.01061300 0.01624102

>> sqrt(diag(frb(a.lmrob.95)))[1] 8.25505934 0.21430835 0.21686602 0.16039174 0.17656676 0.02965149[7] 0.01866741 6.78063852 0.01904172 0.02571575


General approach

Fixed point equations

θn = gn(θn)

Bootstrap the equations at the full-data estimator

θ∗n = g∗n(θn)

Fast (e.g. weighted mean, weighted least squares)

Underestimate variability (weights are not recomputed)


General approach

θn = gn(θn) = gn (θ) +∇gn (θ)(θn − θ

)+ Rn

√n(θn − θ) = [I−∇gn (θ)]−1 √n (gn(θ)− θ) + op(1)


√n(

g∗n(θn)− θn

)≈√

n (g∗n(θ)− θ) ≈√

n (gn(θ)− θ)

√n(θn − θ) ≈ [I−∇gn (θ)]−1 √n

(g∗n(θn)− θn

)


√n(θ

∗n − θn) ≈

√n(θn − θ) ≈ [I−∇gn (θ)]−1 √n

(g∗n(θn)− θn

)

θR∗n − θn =

[I−∇gn(θn)

]−1 (g∗n(θn)− θn

)


Applications

Linear regression

I Standard errors (S-B and Zamar, 2002)

I Tests of hypotheses (S-B, 2005)

I Model selection (S-B and van Aelst, 2007)

Multivariate location / scatter – PCA (S-B, van Aelst, and Willems, 2006)

Discriminant analysis (S-B, van Aelst, and Willems, 2007)


Model selection

Linear regression

(y1, x1), . . . , (yn, xn)

Let α denote a subset of pα indices from {1, 2, . . . , p}

yi = x′αiβα + σα εαi i = 1, . . . , n ,


all models α ∈ A are submodels of a “full” model – σn S-scale estimate of“full” model

For each model α ∈ A, the regression estimator βα,n solves

1n

n∑i=1

ρ′1

(yi − xαi

′ βα,n

σn

)xi = 0 .

expected prediction error (conditional on the observed data)

Mpe(α) =σ2

nE

[n∑

i=1

ρ

(zi − x′αi βα

σ

)∣∣∣∣∣ y, X

],

where z = (z1, . . . , zn)′ are future responses at X, independent of y,


Goodness of fitσ2

nE

[n∑

i=1

ρ

(yi − x′αi βα

σ

)].

parsimonious models are preferred Muller and Welsh (2005)

Mppe(α) =σ2

n

{E

[n∑

i=1

ρ

(yi − x′αi βα

σ

)]+ δ(n) pα

}+ Mpe(α) ,

where δ(n) →∞ δ(n)/n → 0 (δ(n) = log(n))


Criteria

Mpem,n(α) =

σ2n

nE∗

[n∑

i=1

ρ

(yi − x′αi βα,n

σn

)∣∣∣∣∣ y, X

],

Mppem,n(α) =

σ2n

n

{n∑

i=1

ρ

(yi − x′αi βα,n

σn

)+ δ(n) pα

}+ Mpe

m,n(α) ,

E∗ is the bootstrap mean

select α ∈ A such that

αpem, n = arg min

α∈AMpe

m,n(α) ,

αppem, n = arg min

α∈AMppe

m,n(α) .


Ac ⊂ A such that βα contain all non-zero components of β

In what follows we will assume that Ac is not empty.

The smallest model in Ac will be “true” model α0


TheoremAssume that(A1) n−1 ∑xα ix′α i → Γα > 0, n−1 ∑ωα ixα ix′α i → Γω α > 0, and

n−1 ∑ ‖xα i‖4 < ∞,(A2) δ(n) = o(n/m) and m = o(n);(A3)

∑ni=1ρ

′1(ri(βα,n)/σn)xαi = 0,

(A4) σn − σ = Op(1/√

n), βα,n − βα = Op(1/√

n);(A5) ρ′1 and ρ′′1 are uniformly continuous, var(ρ′1(εα0)) < ∞, var(ρ′′1 (εα0)) < ∞

and E(ρ′′1 (εα0)) > 0; and(A6) for any α /∈ Ac , var(ρ′1(εα)) < ∞ and with probability one

lim infn→∞

1n

n∑i=1

ρ1(ri(βα)/σn) > limn→∞

1n

n∑i=1

ρ1(ri(βα0,n)/σn) .

Thenlim

n→∞P(αppe

m,n = α0) = limn→∞

P(αpem,n = α0) = 1 .


Example

Los Angeles Ozone Pollution Data

366 daily observations on 9 variables

Full model includes all second order interactions p = 45

Computational complexity


Example

Backward elimination

Starting from the full model

Select the size-(k − 1) model with best selection criteria

Iterate

Reduces search from 2p to p(p + 1)/2 models


Using minα∈AMpem,n(α) ⇒ p = 6

Using minα∈AMppem,n(α) ⇒ p = 7

Full model ⇒ p = 45


Prediction error

5-fold CV trimmed (γ) prediction error estimators

αpem,n αppe

m,n Full modelp = 10 p = 7 p = 45

γ TMSE ρ TMSE ρ TMSE ρ0.05 11.67 5.36 10.45 5.03 10.78 5.030.10 9.18 8.35 8.33


Diagnostic plots

0 10 20 30

−6

−4

−2

02

4

Fitted Values

Sta

ndar

dize

d re

sidu

als

0 10 20 30

−6

−4

−2

02

4

Fitted Values

Sta

ndar

dize

d re

sidu

als

0 10 20 30

−6

−4

−2

02

4

Fitted Values

Sta

ndar

dize

d re

sidu

als


Average time (CPU seconds) to bootstrap an MM-regression estimator1000 times on samples of size 200

p FRB CB25 8 195535 28 430045 35 10700

Full model selection analysis on the Ozone dataset (p = 45) is reducedfrom 15 days (360 hours) to 4 hours.


inference based on robust estimatorsmatias/ulb-1.pdf · ubc - university of british columbia where...

Documents