cutting the dendrogram through permutation tests

Post on 11-Feb-2022

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dario Bruzzese Domenico Vistoccodbruzzes@unina.it vistocco@unicas.it

Dario Bruzzese, Domenico Vistocco () Compstat 2010 1 / 19

Cutting the dendrogram throughpermutation tests

Department ofPreventive Medical Sciences

UNIVERSITY OF NAPLES ITALY

Department ofEconomics

UNIVERSITY OF CASSINO ITALY

La Carte

1 Motivation

2 The stairstep-like permutation procedureNotationThe outline

3 Some resultsReal datasetsSynthetic dataset

4 ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 2 / 19

La Carte

1 Motivation

2 The stairstep-like permutation procedureNotationThe outline

3 Some resultsReal datasetsSynthetic dataset

4 ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 3 / 19

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut

The rep1HighNoise datasetYeung KY, Medvedovic M, Bumgarner KY:Clustering gene-expression data withrepeated measurements.

Genome Biology, 2003, 4:R34

n = 200p = 20

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut

Horizontal cutk = 3

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut

An alternative cutk = 3

La Carte

1 Motivation

2 The stairstep-like permutation procedureNotationThe outline

3 Some resultsReal datasetsSynthetic dataset

4 ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 5 / 19

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

︸ ︷︷ ︸︸ ︷︷ ︸

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C1L C1

R

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C2L C2

R

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C3L C3

R

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C1L C1

R

h(

C1L ∪ C1

R

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C2L C2

R

h(

C2L ∪ C2

R

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C3L C3

R

h(

C3L ∪ C3

R

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C1L

h(

C1L

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C1R

h(

C1R

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C2L

h(

C2L

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C2R

h(

C2R

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C3L

h(

C3L

)

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:n the number of objects to classify;

CkL and Ck

R the two classes merged at level k(k=1,...,n-1)

h(

CkL ∪ Ck

R

)the height necessary to merge

CkL and Ck

R

h(

Ckj

)the height at which Ck

j has been obtained(j ∈ { L, R })

C3R

h(

C3R

)

The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset

initialization:aggregationLevelsToVisit← h(C1

L ∪ C1R)

permClusters← [ ]i← 1repeat

if C iL ≡ C i

R thenadd C i

L ∪ C iR to permClusters

elseadd h(C i

L) and h(C iR) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1

until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset

initialization:aggregationLevelsToVisit← h(C1

L ∪ C1R)

permClusters← [ ]i← 1

repeatif C i

L ≡ C iR then

add C iL ∪ C i

R to permClusterselse

add h(C iL) and h(C i

R) to aggregationLevelsToVisitsort aggregationLevelsToVisit in descending order

endremove the first element from aggregationLevelsToVisiti← i+1

until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset

initialization:aggregationLevelsToVisit← h(C1

L ∪ C1R)

permClusters← [ ]i← 1repeat

if C iL ≡ C i

R thenadd C i

L ∪ C iR to permClusters

elseadd h(C i

L) and h(C iR) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending orderend

remove the first element from aggregationLevelsToVisiti← i+1

until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset

initialization:aggregationLevelsToVisit← h(C1

L ∪ C1R)

permClusters← [ ]i← 1repeat

if C iL ≡ C i

R thenadd C i

L ∪ C iR to permClusters

elseadd h(C i

L) and h(C iR) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1

until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset

initialization:aggregationLevelsToVisit← h(C1

L ∪ C1R)

permClusters← [ ]i← 1repeat

if C iL ≡ C i

R thenadd C i

L ∪ C iR to permClusters

elseadd h(C i

L) and h(C iR) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1

until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Initializationi ← 0

aggregationLevelsToVisit

h(C1L ∪ C1

R)

permClusters

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 1

aggregationLevelsToVisit

h(C1L ∪ C1

R)

permClusters

h(

C1L ∪ C1

R

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 1

aggregationLevelsToVisit

h(C1L ∪ C1

R)

permClusters

clusters to compare

H0 : C1L ≡ C1

R 7→ reject

C1L C1

R

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 1

permClusters

aggregationLevelsToVisit

h(C1L ∪ C1

R),h(C1R),h(C

1L)

h(

C1L

)h(

C1R

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 1

permClusters

aggregationLevelsToVisit

h(C1R),h(C

1L)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 2

aggregationLevelsToVisit

h(C1R),h(C

1L)

h(

C1R

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 2

aggregationLevelsToVisit

h(C1R),h(C

1L)

clusters to compare

H0 : C2L ≡ C2

R 7→ reject

C2L C2

R

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 2

aggregationLevelsToVisit

h(C1R),h(C

1L),h(C

2R),h(C

2L)

h(

C2L

)h(

C2R

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 2

aggregationLevelsToVisit

h(C1L),h(C

2R),h(C

2L)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 3

aggregationLevelsToVisit

h(C1L),h(C

2R),h(C

2L)

h(

C1L

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 3

aggregationLevelsToVisit

h(C1L),h(C

2R),h(C

2L)

clusters to compare

H0 : C3L ≡ C3

R 7→ reject

C3L C3

R

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 3

h(

C3L

)h(

C3R

) aggregationLevelsToVisit

h(C3R),h(C

2R),h(C

2L),h(C

3L)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 4

aggregationLevelsToVisit

h(C3R),h(C

2R),h(C

2L),h(C

3L)

h(

C3R

)

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters

Iterationi ← 4

aggregationLevelsToVisit

h(C3R),h(C

2R),h(C

2L),h(C

3L)

C4L C4

R

clusters to compare

H0 : C4L ≡ C4

R 7→ accept

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 4

aggregationLevelsToVisit

h(C3R),h(C

2R),h(C

2L),h(C

3L)

clusters to compare

H0 : C4L ≡ C4

R 7→ accept

permClusters

C4L ∪ C4

R ⇔ C3R

C3R

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 9

aggregationLevelsToVisit

permClusters

C3L ,C

3R,C

2L ,C

4L ,C

4R

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iterationi ← 9

aggregationLevelsToVisit

permClusters

C3L ,C

3R,C

2L ,C

4L ,C

4R

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each aggregation level k a permutation test isdesigned to test the Null Hypothesis that the twogroups Ck

L and CkR really belong to the same

cluster, i.e. :

H0 : CkL ≡ Ck

R

Under this null, mixing up (permuting) the statisticalunits of Ck

L and CkR should not alter the aggregation

process resulting in their merging in.

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k , the difference between maxj∈{L,R}

h(

Ckj

)and min

j∈{L,R}h(

Ckj

)can be considered as the

minimum cost necessary to merge the two classes..

min h(C3j )

max h(C3j )

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k , the difference between maxj∈{L,R}

h(

Ckj

)and min

j∈{L,R}h(

Ckj

)can be considered as the

minimum cost necessary to merge the two classes.

The difference between h(

CkL ∪ Ck

R

)and

maxj∈{L,R}

h(

Ckj

)can be, instead, considered as the

cost actually incurred for merging CkL and Ck

R .

h(C3L ∪ C3

R )

max h(C3j )

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k , the difference between maxj∈{L,R}

h(

Ckj

)and min

j∈{L,R}h(

Ckj

)can be considered as the

minimum cost necessary to merge the two classes.

The difference between h(

CkL ∪ Ck

R

)and

maxj∈{L,R}

h(

Ckj

)can be, instead, considered as the

cost actually incurred for merging CkL and Ck

R .

The ratio between these two differences:

cost(

CkL ∪ Ck

R

)=

maxj∈{L,R}

h(

Ckj

)− min

j∈{L,R}h(

Ckj

)h(Ck

L ∪ CkR

)− max

j∈{L,R}h(

Ckj

)is thus a measure that characterizes the aggregation process resulting in thenew class Ck

L ∪ CkR

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3L C3

R

mC3LmC3

R

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3L C3

R

mC3LmC3

R

mC3L

mC3R

h(mC3L)

h(mC3R)

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The ratio:

cost(

mCkL ∪ mCk

R

)=

maxj∈{L,R}

h(

mCkj

)− min

j∈{L,R}h(

mCkj

)h(Ck

L ∪ CkR

)− max

j∈{L,R}h(

mCkj

)is thus a measure that characterizes the aggregation process resulting in thenew (potential) class mCk

L ∪ mCkR

C3L C3

R

mC3LmC3

R

mC3L

mC3R

h(mC3L)

h(mC3R)

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

Under H0 the aggregation process resulting in the new cluster CkL ∪ Ck

R should be very similar

to the one that potentially produces mCkL ∪ mCk

R ; thus the two values cost(

mCkL ∪ mCk

R

)and

cost(

CkL ∪ Ck

R

)should be close enough.

C3L C3

R

mC3LmC3

R

mC3L

mC3R

h(mC3L)

h(mC3R)

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The permutation procedure is repeated M times and each time a new couple mCkL , mCk

R isobtained. The pvalue Montecarlo is thus computed as:

p =#

{cost

(mCk

L ∪ mCkR

)≤ cost

(Ck

L ∪ CkR

)}+ 1

M + 1

C3L C3

R

mC3LmC3

R

mC3L

mC3R

h(mC3L)

h(mC3R)

La Carte

1 Motivation

2 The stairstep-like permutation procedureNotationThe outline

3 Some resultsReal datasetsSynthetic dataset

4 ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 11 / 19

Some results - Real datasets

The yeast galactosedatasetIdeker T, Thorsson V, Ranish JA,Christmas R, Buhler J, Eng JK,Bumgarner RE, Goodlett DR,Aebersold R, Hood LIntegrated genomic andproteomic analyses of asystemically perturbed metabolicnetwork.

Science 2001, 292:929-934.

n = 205p = 80

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

% of misclassification = 1.5

Some results - Real datasets

The diabetes datasetBanfield JD, Raftery AEModel–based Gaussian andNon–Gaussian Clustering.

Biometrics, 1993, 49, 803-821.

n = 145p = 3

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

% of misclassification = 15.2

Some results - Synthetic dataset

QIU W.-L, JOE H. (2009). clusterGeneration: random clustergeneration (with specified degree of separation). R package version1.2.7.

different number of clusters (k = 2; 3; 4; 5; 6; 7)separation index = 0.01different number of variables (p = 5; 10; 15)100 replications for each combination of k and p

Dario Bruzzese, Domenico Vistocco () Compstat 2010 14 / 19

Some results - Synthetic dataset (p=5)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 15 / 19

Some results - Synthetic dataset (p=10)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 16 / 19

Some results - Synthetic dataset (p=15)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 17 / 19

La Carte

1 Motivation

2 The stairstep-like permutation procedureNotationThe outline

3 Some resultsReal datasetsSynthetic dataset

4 ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 18 / 19

ToDo List

Statistical issues

Introducing a penalty term in the permutation test stepQuality measures of the obtained partitionMultiple Testing Problem (???)

Computational issues

profiling and optimizing the R codeI use of compiled codeI deploying a package

Dario Bruzzese, Domenico Vistocco () Compstat 2010 19 / 19

top related