presentation 2009 journal club azhar ali shah
DESCRIPTION
TRANSCRIPT
Azhar Ali Shah
@ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC)
IODMJC, March 20 , 2009
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
2/31
Overview Introduction
About the authors About the topic
Hierarchical Clustering UPGMA
Research Problem Methodology
Suite of algorithms Results Observations
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
3/31
Introduction: authors
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
4/31
Introduction: Hierarchical Clustering
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
5/31
Introduction: Hierarchical Clustering Two fundamental problems in hierarchical
clustering:
1. How to determine the similarity between two objects (eg. Proteins, genes)? Calculate the distance between two object (e.g
RMSD etc).
2. How to determine the similarity between two clusters? (Single, Complete, Average) linkage:
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
6/31
Introduction: about the topic
There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage.
UPGMA (Unweighted Pair GroupMethod using arithmetic Averages)
Scalable to large datasets as it requires only (O(1)) edges in memory.
BUTHighly susceptible to outliers!
Introduction: UPGMA Input format:
Three fields per line Cluster_id1 cluster_id2 distance
Assumptions on input: Cluster IDs are >0 integers No self edges i.e Cluster_id1==cluster_id2 is illegal No repeated edges i.e if exists i<->j then no j<->i
Output format: Four fields per line
Cluster_id1 cluster_id2 distance cluster_id3 Cluster_id1 cluster_id2 identify the pair of merged clusters while
cluster_id3 is an identifier for a new cluster – their union.
Introduction: UPGMA -Sparse input
UPGMA-input
1 2 1e-100
1 3 1e-40
1 4 2e-40
2 3 1e-80
2 4 1e-50
3 4 4e-10
11 12 1e+01
11 13 11
12 13 12
12 14 20
13 14 30
21 22 50
22 23 70
1 23 90
N=11 input singletons (vertices): {1,2,3,4,11,12,13,14,21,22,23}
and 14 edges in the sparse input.
The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22.
Clusters 1,2,3,4 form a clique A.
Clusters 11,12,13,14 are missing edge <11,14> to form clique B.
Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A.
In total there are two connected components in the input graph:
({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and
{11,12,13,14} (producing 4 merges for 3 nodes), which
therefore forms a forest of two disjoint trees, rather than the full
tree of N-1=10 merges.
UPGMA-tree
1 2 1e-100 24
3 24 5e-41 25
4 25 1.33e-10 26
11 12 10 27
13 27 11.5 28
21 22 50 29
14 28 50 30
23 29 85 31
26 31 99.167 32
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
9/31
Research Problem: UPGMA UPGMA requires the entire dissimilarity matrix to be in memory:
Th
is data
renders U
PG
MA
im
pra
ctical
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
10/31
Methodology: 1) Sparse-UPGMA
Can’t cope with huge datasets, where an O(E) memory requirement is intolerable (e.g. Table 1).
UPGMA (mean):
New eq:
Time and memory improvement:
Methodology: 2) Multi-Round MC-UPGMA Requirements:
A correct clusterer should be mindful of unseen edges (≥λ), effecting clustering before λ (max of loaded edges).
Such examples are rather prevalent in non-metric datasets e.g. the case of clustering sequence similarities.
Illustration of non-metric constraints imposed by BLAST sequence similarities (eges). False
transitivity is possible due to CSKP_HUMAN.
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
12/31
Methodology: 2) Multi-Round MC-UPGMA
Solution: To prevent false clustering of a non-minimal edge, suitable bounds per edge are maintained.
The value of dij is lower (lij) and upper (uij) bounded as:
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
13/31
Methodology: 2) Multi-Round MC-UPGMA
1. When Multi-Round MC-UPGMA halts, it is not using its entire memory budget M, since each merge reduces the number of edges in memory.
2. Most of the computation time is spent on preprocessing for the next round of clustering.
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
14/31
Methodology: 2) Single-Round MC-UPGMA
Requires O(n) memory for holding forming tree!
Methodology: 2) Single-Round MC-UPGMA
Methods Clustered data:
UniRef90 (release 8.5) non-redundant 1.80M sequences
BLAST Similarities: blastp with E=100 run on MOSIX grid reciprocal-BLAST-like setting – each sequence is used both as a query
and database entry The directed multigraph1 is transformed to undirected graph2
(symmetric dissimilarities)1. 2.5x109 edges (50 GB)2. 1.5x109 edges (30 GB)
Methods Protein Family Keywords
Interpro classification is used as a mapping of keywords to protein sequences
Metrics
Jaccard Score
Results from 1 801 506 UniRef90 proteins.
1107 (0.06%) proteins are singletons having no BLAST similarities.
From the clustered set, 1 791 206 proteins (99.5%) are clustered into a single tree.
1 497 733 of the tree clusters (83.6%) are fully linked, including 426 360 large clusters with at least 10 members.
Results
Smith–Waterman
BLAST
Sparse UPGMA
With reduced dataset
220K
1.80M
Results
200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.
Results
Observations No detailed discussion on parallelization No results of Single round MC-UPGMA
Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space
23/31
Clu
ster C
ard
Page
View Proteins of Cluster
Keyw
ord
s Ap
peara
nce
s
Clu
ster S
imila
rity D
istributio
n
simila
rity m
atrix
for th
e p
rote
ins in
this clu
ster