the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...

The CLICK clustering algorithm

Roded Sharan, Naama Arbili, Adi Maron, Rani Elkon

ABDBM © Ron Shamir 1

CLICK: CLuster Identification via Connectivity Kernels

• Identify highly homogeneous sets of elements - connectivity kernels. • Add elements to kernels via similarity to average kernel fingerprints. •Uses tools from graph theory and probabilistic considerations for similarity evaluation and kernel identification. • Efficient implementation.

Raw Data

experiments

•Input: Real-valued matrix. •Row vectors: gene fingerprints •Compute gene similarity matrix

Similarity S

Probabilistic Model • Mates: genes that belong to the same true

cluster • Probabilistic assumptions:

– Similarity between mates ~ N(µT,σT) – Similarity between non-mates ~ N(µF,σF) – Independence of pairwise similarities – Mates probability: p

• Observed for real data • Justified in some cases by the Central Limit

Theorem, simulations • Parameter values needed:

– Computed from partially known solution, or – Estimated using the EM algorithm ABDBM © Ron Shamir 5

Similarity Graph • Input ⇒ weighted graph G with a vertex

per element and an edge between similar elements.

• The weight of edge (i,j)

fM / fN = p.d.f. at Sij for ij mates /nonmates

(1 ) ( )

ij Nij

p f Sw

p f S⋅

=− ⋅

Edge weights (contd)

1( | , are mates)2

ijf S i j eσ

−−

(1 ) ( )

ij Nij

p f Sw

p f S⋅

=− ⋅

Kernel Identification Cut - Partition of vertices into two groups. Weight of cut - Sum of weights of edges

crossing the cut. • For each cut C in G we test two hypotheses: H0: C contains only edges between non-mates. H1: C contains only edges between mates. G is declared a kernel if H1 is more probable for

all cuts. Thm: G is a kernel iff weight of min. cut > 0.

Main Theorem

0 0 0| |

Pr( | ) Pr( ) ( | )ln lnPr( | ) Pr( ) ( | )

(1 ) ( )

( )ln ( )

(1 ) ( )

C Mijij C

C Nijij C

ijNij C ij Cij

H C H f C HH C H f C H

p f Sw W C

∈ ∈

⋅= = =

− ⋅

∏∏

∑ ∑

Pf: pick any cut C in G:

Thm: G is a kernel iff weight of min. cut > 0.

Bayes Thm.

Sij–s and mate relations indep

Take C* min cut of G. H1 accepted for C* iff it is accepted for all cuts => accept H1 iff W(C) >0.

Kernel Identification Algorithm

Basic-CLICK(G): • If G={v} then v is a singleton. • o/w:

– Compute a min. weight cut C. – If Weight(C)>0 then G is a kernel. – o/w cut G into the two resulting pieces and continue recursively with each one.

Refinements • Adoption Step: Adopt singleton into kernel

if it is sufficiently similar to its fingerprint • Iterative application of Basic-CLICK and

the adoption step. • Merging Step: Merge clusters with

sufficiently similar average fingerprints

Mincut: NP-hard when there are negative weights

=> heuristic: remove all negative wt edges, compute mincut, correct cut weight for kernel test criterion.

These steps use both fingerprints and similarity values

Figures of Merits Evaluating a clustering solution when no correct

clustering is known; using fingerprints Homogeneity: Average similarity between the fingerprint of an element and the average fingerprint of its cluster

Separation: Weighted average similarity between the average fingerprints of clusters

Minimum

maximum

Gene Expression: Yeast Cell Cycle

*Self-Organizing Maps; Tamayo et al., PNAS 1999.

Separation Homogeneity Clus-ters

-0.07 Ave

0.8 Ave

0.97 -0.88 Gene-Cluster*

0.65 -0.19 CLICK Max Min

Expression levels of 6,218 S. cerevisiae genes, measured at 17 time points over two cell cycles. (Data from Cho et al., Mol. Cell 1998)

CLICK clusters: Yeast Cell Cycle

Yeast Cell Cycle: late G1 Cluster

• Contains 91% of late G1-peaking genes. • In contrast, in GeneCluster 87% are contained in 3 clusters. • M peaking genes: CLICK: 95% in a single cluster, GeneCluster: 92.5% in 3 clusters. • Similar specificities

Gene Expression: Serum Response

*Average linkage agglomerative hierarchical clustering Eisen et al., PNAS 1998.

Separation Homogeneity Clus-ters

0.9 -0.75 CLUSTER*

0.65 0.13 CLICK

Max Min

Human fibroblast cells starved for 48 hours, then stimulated by serum. Expression levels of 8,613 genes measured at 13 time points. (Data from Iyer et al., Science 1999)

Performance of CLICK vs other clustering algorithms

Elements Problem Original Algorithm

CLICK Improvement

Time (min)*

517 Gene Expression Fibroblasts

Cluster Eisen et al.

Yes 0.5

826 Gene Expression Yeast cell cycle

GeneCluster Tamayo et al.

Yes 0.2

2,329 cDNA OFP Blood Monocytes

HCS Hartuv et al.

Yes 0.8

20,275 cDNA OFP Sea urchin eggs

K-Means Herwig et al.

Yes 32.5

72,623 Protein similarity ProtoMap Yona et al.

Minor 53

117,835 Protein similarity SYSTERS Krause et al.

Yes 126.3

* Executed on an SGI ORIGIN200 machine utilizing one IP27 processor. Does not include preprocessing time. ABDBM © Ron Shamir 26

“True” CAST*

GeneCluster (SOM)

K-means

Homogeneity

Performance on Yeast Cell Cycle data

*Ben-Dor, Shamir, Yakhini ‘99

698 genes, 71 conditions (data from Spellman et al ‘98)

Each alg was run by its authors in a “blind” test

EXPression ANalyzer and DisplayER

Clustering Identify clusters of co-expressed genes

CLICK, KMeans, SOM, hierarchical

http://acgt.cs.tau.ac.il/expander

A. Maron, R. Sharan Bioinformatics 03

Functional

enrichment Visualization

Promoter analysis Analyze TF binding

sites of co-regulated genes

Biclustering Identify

homogeneous submatrices

A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05

microRNA enrichment

the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...

Documents

cse601 clustering ensemble - university at...

clustering results result list example clustering results

clustering. 2 outline introduction k-means clustering...

collaborative clustering for entity clustering

data clustering: k-means and hierarchical clustering

clustering: partition clustering

introduction to web clustering - uniroma2.it ·...

lecture outline clustering aggregation – reference: a....

clustering k-mean clustering

hierarchical clustering and k-mean...

multi omics clustering -...

asa clustering within vmdc architecture - cisco€¦ · asa...

clustering in ratemaking: applications in territories...

clustering 2: hierarchical clustering

vladyslav kolbasin stable clustering. clustering data...

clustering. 2 outline introduction k-means clustering ...

data mining techniques: clustering. 2 today clustering ...

improving of clustering partitions fitness and clustering

text-mining: clustering - philosophische fakultät ·...

fuzzy clustering 2009/2010. 2 what is data clustering? ...