influence function of the error rate of generalized k-means - joint ...€¦ · influence function...
TRANSCRIPT
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Influence function of the error rate ofgeneralized k-meansJoint work with G. Haesbroeck
Ch. Ruwet
UNIVERSITY OF LIÈGE
30 March 2009
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Introduction
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
The k -means clustering method
Aim of clustering : Group similar observations in kclusters C1, . . . , Ck .
The k -means algorithm constructs clusters in order tominimize the within cluster sum of squared distances.
Let us focus on k = 2 groups.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Classification based on clustering
In case of natural groups in the data, clustering may beused to find these groups via a classification rule :
x ∈ Cj ⇔ ‖x − xj‖ = min1≤i≤2
‖x − xi‖
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D
Height of students in 1BM
Height (in cm)
160 170 180 190
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D
Height of students in 1BM
Height (in cm)
160 170 180 190
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 2D
40 50 60 70 80 90
160
170
180
190
Students of 1BM
Weight (in kg)
Hei
ght (
in c
m)
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 2D
40 50 60 70 80 90
160
170
180
190
Students of 1BM
Weight (in kg)
Hei
ght (
in c
m)
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D
Height of students in 1BM
Height (in cm)
120 140 160 180
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D
Height of students in 1BM
Height (in cm)
120 140 160 180
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
The generalized 2-means clustering method
The clusters’ centers (T1, T2) are solution of
mint1,t2⊂R2
n∑
i=1
Ω
(
inf1≤j≤2
‖xi − tj‖)
for a suitable nondecreasing penalty function Ω.
Classical penalty functions :
Ω(x) = x2 → 2-means method
Ω(x) = x → 2-medoids method
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example with the 2-medoidsmethod
Height of students in 1BM
Height (in cm)
120 130 140 150 160 170 180 190
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
The generalized 2-means clustering method
The classification rule :
x ∈ Cj ⇔ Ω(‖x − Tj‖) = min1≤i≤2
Ω(‖x − Ti‖)
In one dimension, the estimated clusters are simply:
C1 =] −∞, C[
C2 =]C, +∞[
where C =T1 + T2
2is the cut-off point.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Error rate
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D
Height of students in 1BM
Height (in cm)
160 170 180 190
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
160 170 180 190
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
160 170 180 190
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
160 170 180 190
GirlsBoys
ER=0.326
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D
Height of students in 1BM
Height (in cm)
120 140 160 180
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
120 140 160 180
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
120 140 160 180
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D with the 2-means
Height of students in 1BM
Height (in cm)
120 140 160 180
GirlsBoys
ER=0.609
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example in 1D with the2-medoids
Height of students in 1BM
Height (in cm)
120 140 160 180
GirlsBoys
ER=0.370
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Classification setting
Suppose
X arises from 2 groups G1 and G2 with πi(F ) = IPF [X ∈ Gi ]
then
F is a mixture of two distributions
F = π1(F )F1 + π2(F )F2
with densities f1 and f2.
Additional assumption : one dimension !
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
The generalized 2-means as statisticalfunctionals
The clusters’ centers (T1(F ), T2(F )) are solution of
mint1,t2⊂R2
∫
Ω
(
inf1≤j≤2
‖x − tj‖)
dF (x)
for a suitable nondecreasing penalty function Ω.
The classification rule is
RF (x) = Cj(F ) ⇔ Ω(‖x−Tj(F )‖) = min1≤i≤2
Ω(‖x−Ti(F )‖)
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Optimality in classification
A classification rule is optimal if the corresponding errorrate is minimal
The optimal classification rule is the Bayes rule :
x ∈ C1 ⇔ π1(F )f1(x) > π2(F )f2(x)
(Anderson, 1958)
The 2-means procedure is optimal under the model
FN = 0.5 N(µ1, σ2) + 0.5 N(µ2, σ
2) with µ1 < µ2
(Qiu and Tamhane, 2007)
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate
Theoretical error rate :• Training sample according to F : estimation of the rule• Test sample according to Fm : evaluation of the rule• In ideal circumstances : F = Fm
ER(F , Fm) =2
∑
j=1
πj(Fm)IPFm
[
RF (X ) 6= Cj(F )∣
∣ Gj]
Empirical error rate :• Training sample according to F : estimation and
evaluation of the rule
ER(F , F ) =2
∑
j=1
πj(F )IPF[
RF (X ) 6= Cj(F )∣
∣ Gj]
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate
Theoretical error rate :• Training sample according to F : estimation of the rule• Test sample according to Fm : evaluation of the rule• In ideal circumstances : F = Fm
ER(F , Fm) =2
∑
j=1
πj(Fm)IPFm
[
RF (X ) 6= Cj(F )∣
∣ Gj]
Empirical error rate :• Training sample according to F : estimation and
evaluation of the rule
ER(F , F ) =2
∑
j=1
πj(F )IPF[
RF (X ) 6= Cj(F )∣
∣ Gj]
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated distribution
A contaminated distribution is defined by
Fε
""D
DD
DD
DD
D
zzuuuu
uuuu
uu
1 − ε : F ε : G
where G is a arbitrary distribution function.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated distribution
A contaminated distribution is defined by
Fε
""D
DD
DD
DD
D
zzuuuu
uuuu
uu
1 − ε : F ε : G
where G is a arbitrary distribution function.
To see the influence of one singular point x , G = ∆x leadingto
Fε = (1 − ε)F + ε∆x
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated mixture
−4 −2 0 2 4
0.00
0.05
0.10
0.15
0.20
Mixture
x
0.4
0.6
−4 −2 0 2 4
0.00
0.05
0.10
0.15
0.20
Contaminated mixture
x
(1−ε)0.4
(1−ε)0.6
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate undercontamination
Now, the training sample is distributed as Fε which is acontaminated mixture.
Theoretical error rate :
ER(Fε, Fm) =2
∑
j=1
πj(Fm)IPFm
[
RFε(X ) 6= Cj(Fε)
∣
∣ Gj]
Empirical error rate :
ER(Fε, Fε) =2
∑
j=1
πj(Fε)IPFε
[
RFε(X ) 6= Cj(Fε)
∣
∣ Gj]
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate undercontamination
Now, the training sample is distributed as Fε which is acontaminated mixture.
Theoretical error rate :
ER(Fε, Fm) =2
∑
j=1
πj(Fm)IPFm
[
RFε(X ) 6= Cj(Fε)
∣
∣ Gj]
Empirical error rate :
ER(Fε, Fε) =2
∑
j=1
πj(Fε)IPFε
[
RFε(X ) 6= Cj(Fε)
∣
∣ Gj]
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate undercontamination
Graphically :
Fm = FN ≡ 0.5 N(−1, 1) + 0.5 N(1, 1) an optimal model
C(FN) = −1+12 = 0 (Qiu and Tamhane, 2007)
Fε = (1 − ε)Fm + ε∆x
ε varying and x = −0.5
x ∈ G1 varying and ε = 0.1
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate undercontamination
Theoretical error rate (with the 2-means) :
0.0 0.2 0.4 0.6 0.8
0.16
00.
165
0.17
00.
175
0.18
0
Err
or R
ate
ε−2 −1 0 1 2
0.15
90.
160
0.16
10.
162
0.16
3
x
Err
or R
ate
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate undercontamination
Empirical error rate (with the 2-means) :
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Err
or R
ate
If x is well placedIf not
ε−2 −1 0 1 2
0.14
0.16
0.18
0.20
0.22
0.24
x
Err
or R
ate
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Influence function of theerror rate
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Influence functions
Hampel et al (1986) : For any statistical functional T andany distribution F ,
IF(x ; T, F ) = limε→0
T(Fε) − T(F )
ε=
∂
∂εT(Fε)
∣
∣
∣
∣
ε=0where
Fε = (1 − ε)F + ε∆x ;
EF [IF(X ; T, F )] = 0;
T(Fε) ≈ T(F ) + εIF(x ; T, F ) for ε small enough(First-order von Mises expansion of T at F ).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Influence functions
Hampel et al (1986) : For any statistical functional T andany distribution F ,
IF(x ; T, F ) = limε→0
T(Fε) − T(F )
ε=
∂
∂εT(Fε)
∣
∣
∣
∣
ε=0where
Fε = (1 − ε)F + ε∆x ;
EF [IF(X ; T, F )] = 0;
T(Fε) ≈ T(F ) + εIF(x ; T, F ) for ε small enough(First-order von Mises expansion of T at F ).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate
ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )
ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )
Theoretical error rate :
ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0
Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate
ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )
ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )
Theoretical error rate :
ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0
Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Theoretical vs empirical error rate
ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )
ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )
Theoretical error rate :
ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0
Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
ER(Fε) =?
ER(F ) =2
∑
j=1
πj(F )IPF[
RF (X ) 6= Cj(F )∣
∣ Gj]
= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
ER(Fε) =?
ER(F ) =2
∑
j=1
πj(F )IPF[
RF (X ) 6= Cj(F )∣
∣ Gj]
= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))
Under Fε = (1 − ε)F + ε∆x , one has
ER(Fε) = π1(Fε)
1 − F1,ε (C(Fε))
+ π2(Fε)F2,ε (C(Fε))
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
ER(Fε) =?
ER(F ) =2
∑
j=1
πj(F )IPF[
RF (X ) 6= Cj(F )∣
∣ Gj]
= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))
Under Fε = (1 − ε)F + ε∆x , one has
ER(Fε) = π1(Fε)
1 − F1,ε (C(Fε))
+ π2(Fε)F2,ε (C(Fε))
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
πi(Fε) =? and Fi ,ε =?
Fε = (1 − ε)F + ε∆x
πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi
Fi,ε =
(
1 −εIx ∈ Gi
πi(Fε)
)
Fi +εIx ∈ Gi
πi(Fε)∆x
⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
πi(Fε) =? and Fi ,ε =?
Fε = (1 − ε)F + ε∆x
πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi
Fi,ε =
(
1 −εIx ∈ Gi
πi(Fε)
)
Fi +εIx ∈ Gi
πi(Fε)∆x
⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
πi(Fε) =? and Fi ,ε =?
Fε = (1 − ε)F + ε∆x
πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi
Fi,ε =
(
1 −εIx ∈ Gi
πi(Fε)
)
Fi +εIx ∈ Gi
πi(Fε)∆x
⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Influence function of the empirical error rate
Proposition
For all x 6= C(F ),
IF(x ; ER, F ) = −ER(F )
+ Ix ≤ C(F )(1 − 2 Ix ∈ G1) + Ix ∈ G1
+12(IF(x ; T1, F ) + IF(x ; T2, F ))
π2(F )f2(C(F )) − π1(F )f1(C(F )).
Expressions of IF(x ; T1, F ) and IF(x ; T2, F ) have beencomputed by García-Escudero and Gordaliza (1999).
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphics of IF(x ; ER, F )
−4 −2 0 2 4
0.0
0.5
1.0
x
IF E
R
x in G1x in G2
−4 −2 0 2 4
0.0
0.5
1.0
xIF
ER
x in G1x in G2
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphics of IF(x ; ER, F )
−4 −2 0 2 4
−2
−1
01
2
x
IF E
R
pi1=0.2pi1=0.4pi1=0.6pi1=0.8
−4 −2 0 2 4
−2
−1
01
2x
IF E
R
pi1=0.2pi1=0.4pi1=0.6pi1=0.8
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphics of IF(x ; ER, F )
∆ = µ2 − µ1
−4 −2 0 2 4
−1.
0−
0.5
0.0
0.5
1.0
1.5
x
IF E
R
Delta=1Delta=3Delta=5
−4 −2 0 2 4
−1.
0−
0.5
0.0
0.5
1.0
1.5
x
IF E
R
Delta=1Delta=3Delta=5
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphics of IF(x ; ER, F )
−4 −2 0 2 4
−0.
50.
00.
51.
0
x
IF E
R
sigma=0.5sigma=1sigma=1.5sigma=2
−4 −2 0 2 4
−0.
50.
00.
51.
0x
IF E
R
sigma=0.5sigma=1sigma=1.5sigma=2
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Bias of the error rate
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Definition
In practice, F is replace by Fn, the empirical cdf
The bias is defined by Bn(ER) = EF [ER(Fn) − ER(F )]
Fernholz (2001) :
Bn(ER) =1
2nEF [IF2(X ; ER, F )] + o(n−1)
where
IF2(x ; ER, F ) =∂2
∂ε2 ER((1 − ε)F + ε∆x)
∣
∣
∣
∣
ε=0.
This result is true if ER(.) is Hadamard or Fréchetdifferentiable.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Computation of the bias
Proposition
Under asymptotic normality and consistency of T1 and T2
(Pollard, 1981 and 1982) :
Bn(ER) ≈1
4nπ2(F )f2(C(F )) − π1(F )f1(C(F ))
(EF [IF2(X ; T1, F )] + EF [IF2(X ; T2, F )])
+1
8nπ2(F )f ′2(C(F )) − π1(F )f ′1(C(F ))
(ASV(T1) + ASV(T2) + 2 ASC(T1, T2)).
Expressions of IF2(X ; T1, F ) and IF2(X ; T2, F ) have beencomputed.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Asymptotic difference
How much difference in error rate is to be expected byestimating a clustering rule from a finite sample ?
A-Diff(ER) = limn→∞
n Bn(ER)
⇒ Graphical comparisons of the 2-means and 2-medoidsmethods :
F = π1 N(−∆/2, 1) + (1 − π1) N(∆/2, 1)
• π1 varying and ∆ = 3• ∆ varying and π1 = 0.4
FN = 0.5 N(−∆/2, 1) + 0.5 N(∆/2, 1)
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphics of A-Diff(ER) under F
0.2 0.4 0.6 0.8
−50
050 2−means
2−medoids
π1
A-D
iff
0 2 4 6 8 10
−0.
10.
00.
10.
20.
3
2−means2−medoids
∆A
-Diff
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Graphic of A-Diff(ER) under FN
0 2 4 6 8 10
−0.
06−
0.04
−0.
020.
00
2−means2−medoids
∆
A-D
iff
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation study
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation settings
F = 0.4 N(−1.5, 1) + 0.6 N(1.5, 1)
n = 100Fε = (1 − ε)F + ε∆x with
• ε = 0.01 and |x | = 5• ε = 0.05 and |x | = 5• ε = 0.01 and |x | = 50• ε = 0.05 and |x | = 50
N = 1000
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
from G1
from G2
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083from G1
0.078 0.072from G2
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1
0.078 0.072 0.081 0.073from G2
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139
0.078 0.072 0.081 0.073from G2 0.119 0.083
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127
0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127
0.39 0.61
0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072
0.41 0.59
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127
0.39 0.61 0.072 0.083
0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072
0.41 0.59 0.081 0.073
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127
0.39 0.61 0.072 0.0830.35 0.650.078 0.072 0.081 0.073
from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Simulation results
2-means 2-medoids
(0.071) (0.072)
in C1 in C2 in C1 in C2
0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127
0.39 0.61 0.072 0.0830.35 0.65 0.35 0.650.078 0.072 0.081 0.073
from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55 0.45 0.55
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Contaminated example with the 2-medoids
Height of students in 1BM
Height (in cm)
120 130 140 150 160 170 180
GirlsBoys
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Conclusion of these simulations
A well-placed contamination in the smallest groupmakes the error rate decrease
Too much contamination makes the error rate of the2-means break down
Unfortunately, the error rate of the 2-medoids alsobreaks down when there are too much and too faroutliers !
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Future research
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Future research
Generalized trimmed 2-means : for α ∈ [0, 1],(T1(F ), T2(F )) are solution of
minA:F (A)=1−α
mint1,t2⊂R
∫
AΩ
(
inf1≤j≤2
‖x − tj‖)
dF (x).
Theoretical error rate
ER(Fε, Fm) =2
∑
j=1
πj(Fm)IPFm
[
RFε(X ) 6= Cj(Fε)
∣
∣ Gj]
More than 1 dimension or more than 2 groups
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Thank you for yourattention!
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Bibliography
CROUX C., FILZMOSER P. and JOOSSENS K. (2008),Classification efficiences for robust linear discriminantanalysis, Statistica Sinica 18, pp. 581-599.
CROUX C., HAESBROECK G. and JOOSSENS K. (2008),Logistic discrimination using robust estimators : aninfluence function approach, The Canadian Journal ofStatistics, 36, pp. 157-174.
FERNHOLZ L. T., On multivariate higher order von Misesexpansions in Metrika 2001, vol. 53, pp. 123-140.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Bibliography
GARCÍA-ESCUDERO L. A. and GORDALIZA A.,Robustness Properties of k Means and Trimmed kMeans, Journal of the American Statistical Association,September 1999, Vol. 94, nr 447, pp. 956-969.
HAMPEL F.R., RONCHETTI E.M., ROUSSEEUW P.J.,STAHEL W.A., Robust Statistics : The Approach Basedon Influence Functions, John Wiley and Sons,New-York, 1986.
ANDERSON T.W., An Introduction to MultivariateStatistical Analysis, Wiley, New-York, 1958, pp.126-133.
Influencefunction of
the error rateof
generalizedk-means
Ch. Ruwet
Introduction
Error rate
Influencefunction of theerror rate
Bias of theerror rate
Simulationstudy
Futureresearch
Bibliography
POLLARD D., Strong Consistency of k-MeansClustering, The Annals of Probability, 1981, Vol.9, nr4,pp.919-926.
POLLARD D., A Central Limit Theorem for k-MeansClustering, The Annals of Probability, 1982, Vol.10, nr1,pp.135-140.
QIU D. and TAMHANE A. C. (2007), A comparative studyof the k -means algorithm and the normal mixturemodel for clustering : Univariate case, Journal ofStatistical Planning and Inference, 137, pp. 3722-3740.