the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...

The CLICK clustering algorithm

Roded Sharan, Naama Arbili, Adi Maron, Rani Elkon

ABDBM © Ron Shamir 1

http://www.cs.tau.ac.il/~rshamir/Group/Photos/roded.jpg�

http://www.cs.tau.ac.il/~rshamir/Group/Photos/Rani.jpg�

CLICK: CLuster Identification via Connectivity Kernels

• Identify highly homogeneous sets of elements - connectivity kernels. • Add elements to kernels via similarity to average kernel fingerprints. •Uses tools from graph theory and probabilistic considerations for similarity evaluation and kernel identification. • Efficient implementation.


Data

Raw Data

experiments

gene

s

•Input: Real-valued matrix. •Row vectors: gene fingerprints •Compute gene similarity matrix


Similarity S

genes

gene

s

Probabilistic Model • Mates: genes that belong to the same true

cluster • Probabilistic assumptions:

– Similarity between mates ~ N(µT,σT) – Similarity between non-mates ~ N(µF,σF) – Independence of pairwise similarities – Mates probability: p

• Observed for real data • Justified in some cases by the Central Limit

Theorem, simulations • Parameter values needed:

– Computed from partially known solution, or – Estimated using the EM algorithm ABDBM © Ron Shamir 5

Similarity Graph • Input ⇒ weighted graph G with a vertex

per element and an edge between similar elements.

• The weight of edge (i,j)

3 1

4

fM / fN = p.d.f. at Sij for ij mates /nonmates

( )ln

(1 ) ( )

Mij

ij Nij

p f Sw

p f S⋅

=− ⋅


Edge weights (contd)

2

22

( )

1( | , are mates)2

ij T

T

T

S

ijf S i j eσ

µ

πσ

−−

=

( )ln

(1 ) ( )

Mij

ij Nij

p f Sw

p f S⋅

=− ⋅


Kernel Identification Cut - Partition of vertices into two groups. Weight of cut - Sum of weights of edges

crossing the cut. • For each cut C in G we test two hypotheses: H0: C contains only edges between non-mates. H1: C contains only edges between mates. G is declared a kernel if H1 is more probable for

all cuts. Thm: G is a kernel iff weight of min. cut > 0.

3 1

4

1 1

3


Main Theorem

1 1 1

0 0 0| |

| |

Pr( | ) Pr( ) ( | )ln lnPr( | ) Pr( ) ( | )

( )ln

(1 ) ( )

( )ln ( )

(1 ) ( )

C Mijij C

C Nijij C

Mij

ijNij C ij Cij

H C H f C HH C H f C H

p f S

p f S

p f Sw W C

p f S

∈

∈

∈ ∈

=

=−

⋅= = =

− ⋅

∏∏

∑ ∑

Pf: pick any cut C in G:

Thm: G is a kernel iff weight of min. cut > 0.

Bayes Thm.

Sij–s and mate relations indep

Take C* min cut of G. H1 accepted for C* iff it is accepted for all cuts => accept H1 iff W(C) >0.


Kernel Identification Algorithm

Basic-CLICK(G): • If G={v} then v is a singleton. • o/w:

– Compute a min. weight cut C. – If Weight(C)>0 then G is a kernel. – o/w cut G into the two resulting pieces and continue recursively with each one.


Refinements • Adoption Step: Adopt singleton into kernel

if it is sufficiently similar to its fingerprint • Iterative application of Basic-CLICK and

the adoption step. • Merging Step: Merge clusters with

sufficiently similar average fingerprints

Mincut: NP-hard when there are negative weights

=> heuristic: remove all negative wt edges, compute mincut, correct cut weight for kernel test criterion.

These steps use both fingerprints and similarity values


Figures of Merits Evaluating a clustering solution when no correct

clustering is known; using fingerprints Homogeneity: Average similarity between the fingerprint of an element and the average fingerprint of its cluster

Separation: Weighted average similarity between the average fingerprints of clusters

Minimum

maximum


Gene Expression: Yeast Cell Cycle

*Self-Organizing Maps; Tamayo et al., PNAS 1999.

Separation Homogeneity Clus-ters

-0.02

-0.07 Ave

30

30

0.74

0.8 Ave

0.97 -0.88 Gene-Cluster*

0.65 -0.19 CLICK Max Min

Expression levels of 6,218 S. cerevisiae genes, measured at 17 time points over two cell cycles. (Data from Cho et al., Mol. Cell 1998)


CLICK clusters: Yeast Cell Cycle


Yeast Cell Cycle: late G1 Cluster

• Contains 91% of late G1-peaking genes. • In contrast, in GeneCluster 87% are contained in 3 clusters. • M peaking genes: CLICK: 95% in a single cluster, GeneCluster: 92.5% in 3 clusters. • Similar specificities

N=164


Gene Expression: Serum Response

*Average linkage agglomerative hierarchical clustering Eisen et al., PNAS 1998.

Separation Homogeneity Clus-ters

-0.13

-0.34

Ave

10

10

0.87

0.88

Ave

0.9 -0.75 CLUSTER*

0.65 0.13 CLICK

Max Min

Human fibroblast cells starved for 48 hours, then stimulated by serum. Expression levels of 8,613 genes measured at 13 time points. (Data from Iyer et al., Science 1999)


Performance of CLICK vs other clustering algorithms

Elements Problem Original Algorithm

CLICK Improvement

Time (min)*

517 Gene Expression Fibroblasts

Cluster Eisen et al.

Yes 0.5

826 Gene Expression Yeast cell cycle

GeneCluster Tamayo et al.

Yes 0.2

2,329 cDNA OFP Blood Monocytes

HCS Hartuv et al.

Yes 0.8

20,275 cDNA OFP Sea urchin eggs

K-Means Herwig et al.

Yes 32.5

72,623 Protein similarity ProtoMap Yona et al.

Minor 53

117,835 Protein similarity SYSTERS Krause et al.

Yes 126.3

* Executed on an SGI ORIGIN200 machine utilizing one IP27 processor. Does not include preprocessing time. ABDBM © Ron Shamir 26

“True” CAST*

GeneCluster (SOM)

K-means

CLICK

Homogeneity

Sepa

rati

on

Performance on Yeast Cell Cycle data

*Ben-Dor, Shamir, Yakhini ‘99

698 genes, 71 conditions (data from Spellman et al ‘98)

Each alg was run by its authors in a “blind” test


EXPression ANalyzer and DisplayER

Clustering Identify clusters of co-expressed genes

CLICK, KMeans, SOM, hierarchical

http://acgt.cs.tau.ac.il/expander

A. Maron, R. Sharan Bioinformatics 03

Functional

enrichment Visualization

Promoter analysis Analyze TF binding

sites of co-regulated genes

PRIMA

Biclustering Identify

homogeneous submatrices

SAMBA

A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05

microRNA enrichment

FAME


FIN


the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...

Documents