the click clustering algorithmrshamir/abdbm/pres/17/click.pdf · performance of click vs other...
TRANSCRIPT
The CLICK clustering algorithm
Roded Sharan, Naama Arbili, Adi Maron, Rani Elkon
ABDBM © Ron Shamir 1
CLICK: CLuster Identification via Connectivity Kernels
• Identify highly homogeneous sets of elements - connectivity kernels. • Add elements to kernels via similarity to average kernel fingerprints. •Uses tools from graph theory and probabilistic considerations for similarity evaluation and kernel identification. • Efficient implementation.
ABDBM © Ron Shamir 3
Data
Raw Data
experiments
gene
s
•Input: Real-valued matrix. •Row vectors: gene fingerprints •Compute gene similarity matrix
ABDBM © Ron Shamir 4
Similarity S
genes
gene
s
Probabilistic Model • Mates: genes that belong to the same true
cluster • Probabilistic assumptions:
– Similarity between mates ~ N(µT,σT) – Similarity between non-mates ~ N(µF,σF) – Independence of pairwise similarities – Mates probability: p
• Observed for real data • Justified in some cases by the Central Limit
Theorem, simulations • Parameter values needed:
– Computed from partially known solution, or – Estimated using the EM algorithm ABDBM © Ron Shamir 5
Similarity Graph • Input ⇒ weighted graph G with a vertex
per element and an edge between similar elements.
• The weight of edge (i,j)
3 1
4
fM / fN = p.d.f. at Sij for ij mates /nonmates
( )ln
(1 ) ( )
Mij
ij Nij
p f Sw
p f S⋅
=− ⋅
ABDBM © Ron Shamir 6
Edge weights (contd)
2
22
( )
1( | , are mates)2
ij T
T
T
S
ijf S i j eσ
µ
πσ
−−
=
( )ln
(1 ) ( )
Mij
ij Nij
p f Sw
p f S⋅
=− ⋅
ABDBM © Ron Shamir 7
Kernel Identification Cut - Partition of vertices into two groups. Weight of cut - Sum of weights of edges
crossing the cut. • For each cut C in G we test two hypotheses: H0: C contains only edges between non-mates. H1: C contains only edges between mates. G is declared a kernel if H1 is more probable for
all cuts. Thm: G is a kernel iff weight of min. cut > 0.
3 1
4
1 1
3
ABDBM © Ron Shamir 8
Main Theorem
1 1 1
0 0 0| |
| |
Pr( | ) Pr( ) ( | )ln lnPr( | ) Pr( ) ( | )
( )ln
(1 ) ( )
( )ln ( )
(1 ) ( )
C Mijij C
C Nijij C
Mij
ijNij C ij Cij
H C H f C HH C H f C H
p f S
p f S
p f Sw W C
p f S
∈
∈
∈ ∈
=
=−
⋅= = =
− ⋅
∏∏
∑ ∑
Pf: pick any cut C in G:
Thm: G is a kernel iff weight of min. cut > 0.
Bayes Thm.
Sij–s and mate relations indep
Take C* min cut of G. H1 accepted for C* iff it is accepted for all cuts => accept H1 iff W(C) >0.
ABDBM © Ron Shamir 9
Kernel Identification Algorithm
Basic-CLICK(G): • If G={v} then v is a singleton. • o/w:
– Compute a min. weight cut C. – If Weight(C)>0 then G is a kernel. – o/w cut G into the two resulting pieces and continue recursively with each one.
ABDBM © Ron Shamir 10
ABDBM © Ron Shamir 11
Refinements • Adoption Step: Adopt singleton into kernel
if it is sufficiently similar to its fingerprint • Iterative application of Basic-CLICK and
the adoption step. • Merging Step: Merge clusters with
sufficiently similar average fingerprints
Mincut: NP-hard when there are negative weights
=> heuristic: remove all negative wt edges, compute mincut, correct cut weight for kernel test criterion.
These steps use both fingerprints and similarity values
ABDBM © Ron Shamir 12
Figures of Merits Evaluating a clustering solution when no correct
clustering is known; using fingerprints Homogeneity: Average similarity between the fingerprint of an element and the average fingerprint of its cluster
Separation: Weighted average similarity between the average fingerprints of clusters
Minimum
maximum
ABDBM © Ron Shamir 13
Gene Expression: Yeast Cell Cycle
*Self-Organizing Maps; Tamayo et al., PNAS 1999.
Separation Homogeneity Clus-ters
-0.02
-0.07 Ave
30
30
0.74
0.8 Ave
0.97 -0.88 Gene-Cluster*
0.65 -0.19 CLICK Max Min
Expression levels of 6,218 S. cerevisiae genes, measured at 17 time points over two cell cycles. (Data from Cho et al., Mol. Cell 1998)
ABDBM © Ron Shamir 14
CLICK clusters: Yeast Cell Cycle
ABDBM © Ron Shamir 15
Yeast Cell Cycle: late G1 Cluster
• Contains 91% of late G1-peaking genes. • In contrast, in GeneCluster 87% are contained in 3 clusters. • M peaking genes: CLICK: 95% in a single cluster, GeneCluster: 92.5% in 3 clusters. • Similar specificities
N=164
ABDBM © Ron Shamir 16
Gene Expression: Serum Response
*Average linkage agglomerative hierarchical clustering Eisen et al., PNAS 1998.
Separation Homogeneity Clus-ters
-0.13
-0.34
Ave
10
10
0.87
0.88
Ave
0.9 -0.75 CLUSTER*
0.65 0.13 CLICK
Max Min
Human fibroblast cells starved for 48 hours, then stimulated by serum. Expression levels of 8,613 genes measured at 13 time points. (Data from Iyer et al., Science 1999)
ABDBM © Ron Shamir 17
Performance of CLICK vs other clustering algorithms
Elements Problem Original Algorithm
CLICK Improvement
Time (min)*
517 Gene Expression Fibroblasts
Cluster Eisen et al.
Yes 0.5
826 Gene Expression Yeast cell cycle
GeneCluster Tamayo et al.
Yes 0.2
2,329 cDNA OFP Blood Monocytes
HCS Hartuv et al.
Yes 0.8
20,275 cDNA OFP Sea urchin eggs
K-Means Herwig et al.
Yes 32.5
72,623 Protein similarity ProtoMap Yona et al.
Minor 53
117,835 Protein similarity SYSTERS Krause et al.
Yes 126.3
* Executed on an SGI ORIGIN200 machine utilizing one IP27 processor. Does not include preprocessing time. ABDBM © Ron Shamir 26
“True” CAST*
GeneCluster (SOM)
K-means
CLICK
Homogeneity
Sepa
rati
on
Performance on Yeast Cell Cycle data
*Ben-Dor, Shamir, Yakhini ‘99
698 genes, 71 conditions (data from Spellman et al ‘98)
Each alg was run by its authors in a “blind” test
ABDBM © Ron Shamir 27
EXPression ANalyzer and DisplayER
Clustering Identify clusters of co-expressed genes
CLICK, KMeans, SOM, hierarchical
http://acgt.cs.tau.ac.il/expander
A. Maron, R. Sharan Bioinformatics 03
Functional
enrichment Visualization
Promoter analysis Analyze TF binding
sites of co-regulated genes
PRIMA
Biclustering Identify
homogeneous submatrices
SAMBA
A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05
microRNA enrichment
FAME
ABDBM © Ron Shamir 30
FIN
ABDBM © Ron Shamir 31