the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...

Post on 23-Jul-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The CLICK clustering algorithm

Roded Sharan, Naama Arbili, Adi Maron, Rani Elkon

ABDBM © Ron Shamir 1

CLICK: CLuster Identification via Connectivity Kernels

• Identify highly homogeneous sets of elements - connectivity kernels. • Add elements to kernels via similarity to average kernel fingerprints. •Uses tools from graph theory and probabilistic considerations for similarity evaluation and kernel identification. • Efficient implementation.

ABDBM © Ron Shamir 3

Data

Raw Data

experiments

gene

s

•Input: Real-valued matrix. •Row vectors: gene fingerprints •Compute gene similarity matrix

ABDBM © Ron Shamir 4

Similarity S

genes

gene

s

Probabilistic Model • Mates: genes that belong to the same true

cluster • Probabilistic assumptions:

– Similarity between mates ~ N(µT,σT) – Similarity between non-mates ~ N(µF,σF) – Independence of pairwise similarities – Mates probability: p

• Observed for real data • Justified in some cases by the Central Limit

Theorem, simulations • Parameter values needed:

– Computed from partially known solution, or – Estimated using the EM algorithm ABDBM © Ron Shamir 5

Similarity Graph • Input ⇒ weighted graph G with a vertex

per element and an edge between similar elements.

• The weight of edge (i,j)

3 1

4

fM / fN = p.d.f. at Sij for ij mates /nonmates

( )ln

(1 ) ( )

Mij

ij Nij

p f Sw

p f S⋅

=− ⋅

ABDBM © Ron Shamir 6

Edge weights (contd)

2

22

( )

1( | , are mates)2

ij T

T

T

S

ijf S i j eσ

µ

πσ

−−

=

( )ln

(1 ) ( )

Mij

ij Nij

p f Sw

p f S⋅

=− ⋅

ABDBM © Ron Shamir 7

Kernel Identification Cut - Partition of vertices into two groups. Weight of cut - Sum of weights of edges

crossing the cut. • For each cut C in G we test two hypotheses: H0: C contains only edges between non-mates. H1: C contains only edges between mates. G is declared a kernel if H1 is more probable for

all cuts. Thm: G is a kernel iff weight of min. cut > 0.

3 1

4

1 1

3

ABDBM © Ron Shamir 8

Main Theorem

1 1 1

0 0 0| |

| |

Pr( | ) Pr( ) ( | )ln lnPr( | ) Pr( ) ( | )

( )ln

(1 ) ( )

( )ln ( )

(1 ) ( )

C Mijij C

C Nijij C

Mij

ijNij C ij Cij

H C H f C HH C H f C H

p f S

p f S

p f Sw W C

p f S

∈ ∈

=

=−

⋅= = =

− ⋅

∏∏

∑ ∑

Pf: pick any cut C in G:

Thm: G is a kernel iff weight of min. cut > 0.

Bayes Thm.

Sij–s and mate relations indep

Take C* min cut of G. H1 accepted for C* iff it is accepted for all cuts => accept H1 iff W(C) >0.

ABDBM © Ron Shamir 9

Kernel Identification Algorithm

Basic-CLICK(G): • If G={v} then v is a singleton. • o/w:

– Compute a min. weight cut C. – If Weight(C)>0 then G is a kernel. – o/w cut G into the two resulting pieces and continue recursively with each one.

ABDBM © Ron Shamir 10

ABDBM © Ron Shamir 11

Refinements • Adoption Step: Adopt singleton into kernel

if it is sufficiently similar to its fingerprint • Iterative application of Basic-CLICK and

the adoption step. • Merging Step: Merge clusters with

sufficiently similar average fingerprints

Mincut: NP-hard when there are negative weights

=> heuristic: remove all negative wt edges, compute mincut, correct cut weight for kernel test criterion.

These steps use both fingerprints and similarity values

ABDBM © Ron Shamir 12

Figures of Merits Evaluating a clustering solution when no correct

clustering is known; using fingerprints Homogeneity: Average similarity between the fingerprint of an element and the average fingerprint of its cluster

Separation: Weighted average similarity between the average fingerprints of clusters

Minimum

maximum

ABDBM © Ron Shamir 13

Gene Expression: Yeast Cell Cycle

*Self-Organizing Maps; Tamayo et al., PNAS 1999.

Separation Homogeneity Clus-ters

-0.02

-0.07 Ave

30

30

0.74

0.8 Ave

0.97 -0.88 Gene-Cluster*

0.65 -0.19 CLICK Max Min

Expression levels of 6,218 S. cerevisiae genes, measured at 17 time points over two cell cycles. (Data from Cho et al., Mol. Cell 1998)

ABDBM © Ron Shamir 14

CLICK clusters: Yeast Cell Cycle

ABDBM © Ron Shamir 15

Yeast Cell Cycle: late G1 Cluster

• Contains 91% of late G1-peaking genes. • In contrast, in GeneCluster 87% are contained in 3 clusters. • M peaking genes: CLICK: 95% in a single cluster, GeneCluster: 92.5% in 3 clusters. • Similar specificities

N=164

ABDBM © Ron Shamir 16

Gene Expression: Serum Response

*Average linkage agglomerative hierarchical clustering Eisen et al., PNAS 1998.

Separation Homogeneity Clus-ters

-0.13

-0.34

Ave

10

10

0.87

0.88

Ave

0.9 -0.75 CLUSTER*

0.65 0.13 CLICK

Max Min

Human fibroblast cells starved for 48 hours, then stimulated by serum. Expression levels of 8,613 genes measured at 13 time points. (Data from Iyer et al., Science 1999)

ABDBM © Ron Shamir 17

Performance of CLICK vs other clustering algorithms

Elements Problem Original Algorithm

CLICK Improvement

Time (min)*

517 Gene Expression Fibroblasts

Cluster Eisen et al.

Yes 0.5

826 Gene Expression Yeast cell cycle

GeneCluster Tamayo et al.

Yes 0.2

2,329 cDNA OFP Blood Monocytes

HCS Hartuv et al.

Yes 0.8

20,275 cDNA OFP Sea urchin eggs

K-Means Herwig et al.

Yes 32.5

72,623 Protein similarity ProtoMap Yona et al.

Minor 53

117,835 Protein similarity SYSTERS Krause et al.

Yes 126.3

* Executed on an SGI ORIGIN200 machine utilizing one IP27 processor. Does not include preprocessing time. ABDBM © Ron Shamir 26

“True” CAST*

GeneCluster (SOM)

K-means

CLICK

Homogeneity

Sepa

rati

on

Performance on Yeast Cell Cycle data

*Ben-Dor, Shamir, Yakhini ‘99

698 genes, 71 conditions (data from Spellman et al ‘98)

Each alg was run by its authors in a “blind” test

ABDBM © Ron Shamir 27

EXPression ANalyzer and DisplayER

Clustering Identify clusters of co-expressed genes

CLICK, KMeans, SOM, hierarchical

http://acgt.cs.tau.ac.il/expander

A. Maron, R. Sharan Bioinformatics 03

Functional

enrichment Visualization

Promoter analysis Analyze TF binding

sites of co-regulated genes

PRIMA

Biclustering Identify

homogeneous submatrices

SAMBA

A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05

microRNA enrichment

FAME

ABDBM © Ron Shamir 30

FIN

ABDBM © Ron Shamir 31

top related