influence function of the error rate of generalized k-means - joint ...€¦ · inﬂuence function...

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence function of the error rate ofgeneralized k-meansJoint work with G. Haesbroeck

Ch. Ruwet

UNIVERSITY OF LIÈGE

30 March 2009


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Introduction


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

The k -means clustering method

Aim of clustering : Group similar observations in kclusters C1, . . . , Ck .

The k -means algorithm constructs clusters in order tominimize the within cluster sum of squared distances.

Let us focus on k = 2 groups.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Classification based on clustering

In case of natural groups in the data, clustering may beused to find these groups via a classification rule :

x ∈ Cj ⇔ ‖x − xj‖ = min1≤i≤2

‖x − xi‖


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Example in 1D

Height of students in 1BM

Height (in cm)

160 170 180 190


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Example in 2D

40 50 60 70 80 90

160

170

180

190

Students of 1BM

Weight (in kg)

Hei

ght (

in c

m)


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example in 1D


Height (in cm)

120 140 160 180


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

The generalized 2-means clustering method

The clusters’ centers (T1, T2) are solution of

mint1,t2⊂R2

n∑

i=1

Ω

(

inf1≤j≤2

‖xi − tj‖)

for a suitable nondecreasing penalty function Ω.

Classical penalty functions :

Ω(x) = x2 → 2-means method

Ω(x) = x → 2-medoids method


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example with the 2-medoidsmethod


Height (in cm)

120 130 140 150 160 170 180 190


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

The generalized 2-means clustering method

The classification rule :

x ∈ Cj ⇔ Ω(‖x − Tj‖) = min1≤i≤2

Ω(‖x − Ti‖)

In one dimension, the estimated clusters are simply:

C1 =] −∞, C[

C2 =]C, +∞[

where C =T1 + T2

2is the cut-off point.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Error rate


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Example in 1D


Height (in cm)

160 170 180 190

GirlsBoys


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Example in 1D with the 2-means


Height (in cm)

160 170 180 190

GirlsBoys


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Example in 1D with the 2-means


Height (in cm)

160 170 180 190

GirlsBoys

ER=0.326


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example in 1D


Height (in cm)

120 140 160 180

GirlsBoys


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example in 1D with the 2-means


Height (in cm)

120 140 160 180

GirlsBoys


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example in 1D with the 2-means


Height (in cm)

120 140 160 180

GirlsBoys

ER=0.609


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example in 1D with the2-medoids


Height (in cm)

120 140 160 180

GirlsBoys

ER=0.370


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Classification setting

Suppose

X arises from 2 groups G1 and G2 with πi(F ) = IPF [X ∈ Gi ]

then

F is a mixture of two distributions

F = π1(F )F1 + π2(F )F2

with densities f1 and f2.

Additional assumption : one dimension !


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

The generalized 2-means as statisticalfunctionals

The clusters’ centers (T1(F ), T2(F )) are solution of

mint1,t2⊂R2

∫

Ω

(

inf1≤j≤2

‖x − tj‖)

dF (x)

for a suitable nondecreasing penalty function Ω.

The classification rule is

RF (x) = Cj(F ) ⇔ Ω(‖x−Tj(F )‖) = min1≤i≤2

Ω(‖x−Ti(F )‖)


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Optimality in classification

A classification rule is optimal if the corresponding errorrate is minimal

The optimal classification rule is the Bayes rule :

x ∈ C1 ⇔ π1(F )f1(x) > π2(F )f2(x)

(Anderson, 1958)

The 2-means procedure is optimal under the model

FN = 0.5 N(µ1, σ2) + 0.5 N(µ2, σ

2) with µ1 < µ2

(Qiu and Tamhane, 2007)


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Theoretical vs empirical error rate

Theoretical error rate :• Training sample according to F : estimation of the rule• Test sample according to Fm : evaluation of the rule• In ideal circumstances : F = Fm

ER(F , Fm) =2

∑

j=1

πj(Fm)IPFm

[

RF (X ) 6= Cj(F )∣

∣ Gj]

Empirical error rate :• Training sample according to F : estimation and

evaluation of the rule

ER(F , F ) =2

∑

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated distribution

A contaminated distribution is defined by

Fε

""D

DD

DD

DD

D

zzuuuu

uuuu

uu

1 − ε : F ε : G

where G is a arbitrary distribution function.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated distribution

A contaminated distribution is defined by

Fε

""D

DD

DD

DD

D

zzuuuu

uuuu

uu

1 − ε : F ε : G

where G is a arbitrary distribution function.

To see the influence of one singular point x , G = ∆x leadingto

Fε = (1 − ε)F + ε∆x


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated mixture

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

Mixture

x

0.4

0.6

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

Contaminated mixture

x

(1−ε)0.4

(1−ε)0.6


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Now, the training sample is distributed as Fε which is acontaminated mixture.

Theoretical error rate :

ER(Fε, Fm) =2

∑

j=1

πj(Fm)IPFm

[

RFε(X ) 6= Cj(Fε)

∣

∣ Gj]

Empirical error rate :

ER(Fε, Fε) =2

∑

j=1

πj(Fε)IPFε

[

RFε(X ) 6= Cj(Fε)

∣

∣ Gj]


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


Graphically :

Fm = FN ≡ 0.5 N(−1, 1) + 0.5 N(1, 1) an optimal model

C(FN) = −1+12 = 0 (Qiu and Tamhane, 2007)

Fε = (1 − ε)Fm + ε∆x

ε varying and x = −0.5

x ∈ G1 varying and ε = 0.1


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


Theoretical error rate (with the 2-means) :

0.0 0.2 0.4 0.6 0.8

0.16

00.

165

0.17

00.

175

0.18

0

Err

or R

ate

ε−2 −1 0 1 2

0.15

90.

160

0.16

10.

162

0.16

3

x

Err

or R

ate


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


Empirical error rate (with the 2-means) :

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Err

or R

ate

If x is well placedIf not

ε−2 −1 0 1 2

0.14

0.16

0.18

0.20

0.22

0.24

x

Err

or R

ate


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Influence function of theerror rate


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Influence functions

Hampel et al (1986) : For any statistical functional T andany distribution F ,

IF(x ; T, F ) = limε→0

T(Fε) − T(F )

ε=

∂

∂εT(Fε)

∣

∣

∣

∣

ε=0where

Fε = (1 − ε)F + ε∆x ;

EF [IF(X ; T, F )] = 0;

T(Fε) ≈ T(F ) + εIF(x ; T, F ) for ε small enough(First-order von Mises expansion of T at F ).


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Theoretical vs empirical error rate

ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )

ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )

Theoretical error rate :

ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0

Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

ER(Fε) =?

ER(F ) =2

∑

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

ER(Fε) =?

ER(F ) =2

∑

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))

Under Fε = (1 − ε)F + ε∆x , one has

ER(Fε) = π1(Fε)

1 − F1,ε (C(Fε))

+ π2(Fε)F2,ε (C(Fε))


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

πi(Fε) =? and Fi ,ε =?

Fε = (1 − ε)F + ε∆x

πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi

Fi,ε =

(

1 −εIx ∈ Gi

πi(Fε)

)

Fi +εIx ∈ Gi

πi(Fε)∆x

⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Influence function of the empirical error rate

Proposition

For all x 6= C(F ),

IF(x ; ER, F ) = −ER(F )

+ Ix ≤ C(F )(1 − 2 Ix ∈ G1) + Ix ∈ G1

+12(IF(x ; T1, F ) + IF(x ; T2, F ))

π2(F )f2(C(F )) − π1(F )f1(C(F )).

Expressions of IF(x ; T1, F ) and IF(x ; T2, F ) have beencomputed by García-Escudero and Gordaliza (1999).


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Graphics of IF(x ; ER, F )

−4 −2 0 2 4

0.0

0.5

1.0

x

IF E

R

x in G1x in G2

−4 −2 0 2 4

0.0

0.5

1.0

xIF

ER

x in G1x in G2


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


−4 −2 0 2 4

−2

−1

01

2

x

IF E

R

pi1=0.2pi1=0.4pi1=0.6pi1=0.8

−4 −2 0 2 4

−2

−1

01

2x

IF E

R

pi1=0.2pi1=0.4pi1=0.6pi1=0.8


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


∆ = µ2 − µ1

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

1.5

x

IF E

R

Delta=1Delta=3Delta=5

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

1.5

x

IF E

R

Delta=1Delta=3Delta=5


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch


−4 −2 0 2 4

−0.

50.

00.

51.

0

x

IF E

R

sigma=0.5sigma=1sigma=1.5sigma=2

−4 −2 0 2 4

−0.

50.

00.

51.

0x

IF E

R

sigma=0.5sigma=1sigma=1.5sigma=2


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Bias of the error rate


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Definition

In practice, F is replace by Fn, the empirical cdf

The bias is defined by Bn(ER) = EF [ER(Fn) − ER(F )]

Fernholz (2001) :

Bn(ER) =1

2nEF [IF2(X ; ER, F )] + o(n−1)

where

IF2(x ; ER, F ) =∂2

∂ε2 ER((1 − ε)F + ε∆x)

∣

∣

∣

∣

ε=0.

This result is true if ER(.) is Hadamard or Fréchetdifferentiable.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Computation of the bias

Proposition

Under asymptotic normality and consistency of T1 and T2

(Pollard, 1981 and 1982) :

Bn(ER) ≈1

4nπ2(F )f2(C(F )) − π1(F )f1(C(F ))

(EF [IF2(X ; T1, F )] + EF [IF2(X ; T2, F )])

+1

8nπ2(F )f ′2(C(F )) − π1(F )f ′1(C(F ))

(ASV(T1) + ASV(T2) + 2 ASC(T1, T2)).

Expressions of IF2(X ; T1, F ) and IF2(X ; T2, F ) have beencomputed.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Asymptotic difference

How much difference in error rate is to be expected byestimating a clustering rule from a finite sample ?

A-Diff(ER) = limn→∞

n Bn(ER)

⇒ Graphical comparisons of the 2-means and 2-medoidsmethods :

F = π1 N(−∆/2, 1) + (1 − π1) N(∆/2, 1)

• π1 varying and ∆ = 3• ∆ varying and π1 = 0.4

FN = 0.5 N(−∆/2, 1) + 0.5 N(∆/2, 1)


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Graphics of A-Diff(ER) under F

0.2 0.4 0.6 0.8

−50

050 2−means

2−medoids

π1

A-D

iff

0 2 4 6 8 10

−0.

10.

00.

10.

20.

3

2−means2−medoids

∆A

-Diff


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Graphic of A-Diff(ER) under FN

0 2 4 6 8 10

−0.

06−

0.04

−0.

020.

00

2−means2−medoids

∆

A-D

iff


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation study


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation settings

F = 0.4 N(−1.5, 1) + 0.6 N(1.5, 1)

n = 100Fε = (1 − ε)F + ε∆x with

• ε = 0.01 and |x | = 5• ε = 0.05 and |x | = 5• ε = 0.01 and |x | = 50• ε = 0.05 and |x | = 50

N = 1000


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

from G1

from G2


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083from G1

0.078 0.072from G2


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1

0.078 0.072 0.081 0.073from G2


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139

0.078 0.072 0.081 0.073from G2 0.119 0.083


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072

0.41 0.59


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.083

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072

0.41 0.59 0.081 0.073


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.0830.35 0.650.078 0.072 0.081 0.073

from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)


0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.0830.35 0.65 0.35 0.650.078 0.072 0.081 0.073

from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55 0.45 0.55


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Contaminated example with the 2-medoids


Height (in cm)

120 130 140 150 160 170 180

GirlsBoys


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Conclusion of these simulations

A well-placed contamination in the smallest groupmakes the error rate decrease

Too much contamination makes the error rate of the2-means break down

Unfortunately, the error rate of the 2-medoids alsobreaks down when there are too much and too faroutliers !


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Future research


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Future research

Generalized trimmed 2-means : for α ∈ [0, 1],(T1(F ), T2(F )) are solution of

minA:F (A)=1−α

mint1,t2⊂R

∫

AΩ

(

inf1≤j≤2

‖x − tj‖)

dF (x).

Theoretical error rate

ER(Fε, Fm) =2

∑

j=1

πj(Fm)IPFm

[

RFε(X ) 6= Cj(Fε)

∣

∣ Gj]

More than 1 dimension or more than 2 groups


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Thank you for yourattention!


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Bibliography

CROUX C., FILZMOSER P. and JOOSSENS K. (2008),Classification efficiences for robust linear discriminantanalysis, Statistica Sinica 18, pp. 581-599.

CROUX C., HAESBROECK G. and JOOSSENS K. (2008),Logistic discrimination using robust estimators : aninfluence function approach, The Canadian Journal ofStatistics, 36, pp. 157-174.

FERNHOLZ L. T., On multivariate higher order von Misesexpansions in Metrika 2001, vol. 53, pp. 123-140.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Bibliography

GARCÍA-ESCUDERO L. A. and GORDALIZA A.,Robustness Properties of k Means and Trimmed kMeans, Journal of the American Statistical Association,September 1999, Vol. 94, nr 447, pp. 956-969.

HAMPEL F.R., RONCHETTI E.M., ROUSSEEUW P.J.,STAHEL W.A., Robust Statistics : The Approach Basedon Influence Functions, John Wiley and Sons,New-York, 1986.

ANDERSON T.W., An Introduction to MultivariateStatistical Analysis, Wiley, New-York, 1958, pp.126-133.


the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate



Simulationstudy

Futureresearch

Bibliography

POLLARD D., Strong Consistency of k-MeansClustering, The Annals of Probability, 1981, Vol.9, nr4,pp.919-926.

POLLARD D., A Central Limit Theorem for k-MeansClustering, The Annals of Probability, 1982, Vol.10, nr1,pp.135-140.

QIU D. and TAMHANE A. C. (2007), A comparative studyof the k -means algorithm and the normal mixturemodel for clustering : Univariate case, Journal ofStatistical Planning and Inference, 137, pp. 3722-3740.

influence function of the error rate of generalized k-means - joint ...€¦ · inﬂuence function...

Documents