presentation 2009 journal club azhar ali shah

Azhar Ali Shah

@ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC)

IODMJC, March 20 , 2009

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

2/31

Overview Introduction

About the authors About the topic

Hierarchical Clustering UPGMA

Research Problem Methodology

Suite of algorithms Results Observations


3/31

Introduction: authors


4/31

Introduction: Hierarchical Clustering


5/31

Introduction: Hierarchical Clustering Two fundamental problems in hierarchical

clustering:

1. How to determine the similarity between two objects (eg. Proteins, genes)? Calculate the distance between two object (e.g

RMSD etc).

2. How to determine the similarity between two clusters? (Single, Complete, Average) linkage:


6/31

Introduction: about the topic

There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage.

UPGMA (Unweighted Pair GroupMethod using arithmetic Averages)

Scalable to large datasets as it requires only (O(1)) edges in memory.

BUTHighly susceptible to outliers!

Introduction: UPGMA Input format:

Three fields per line Cluster_id1 cluster_id2 distance

Assumptions on input: Cluster IDs are >0 integers No self edges i.e Cluster_id1==cluster_id2 is illegal No repeated edges i.e if exists i<->j then no j<->i

Output format: Four fields per line

Cluster_id1 cluster_id2 distance cluster_id3 Cluster_id1 cluster_id2 identify the pair of merged clusters while

cluster_id3 is an identifier for a new cluster – their union.

Introduction: UPGMA -Sparse input

UPGMA-input

1 2 1e-100

1 3 1e-40

1 4 2e-40

2 3 1e-80

2 4 1e-50

3 4 4e-10

11 12 1e+01

11 13 11

12 13 12

12 14 20

13 14 30

21 22 50

22 23 70

1 23 90

N=11 input singletons (vertices): {1,2,3,4,11,12,13,14,21,22,23}

and 14 edges in the sparse input.

The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22.

Clusters 1,2,3,4 form a clique A.

Clusters 11,12,13,14 are missing edge <11,14> to form clique B.

Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A.

In total there are two connected components in the input graph:

({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and

{11,12,13,14} (producing 4 merges for 3 nodes), which

therefore forms a forest of two disjoint trees, rather than the full

tree of N-1=10 merges.

UPGMA-tree

1 2 1e-100 24

3 24 5e-41 25

4 25 1.33e-10 26

11 12 10 27

13 27 11.5 28

21 22 50 29

14 28 50 30

23 29 85 31

26 31 99.167 32


9/31

Research Problem: UPGMA UPGMA requires the entire dissimilarity matrix to be in memory:

Th

is data

renders U

PG

MA

im

pra

ctical


10/31

Methodology: 1) Sparse-UPGMA

Can’t cope with huge datasets, where an O(E) memory requirement is intolerable (e.g. Table 1).

UPGMA (mean):

New eq:

Time and memory improvement:

guest

thick edges (neighbours) are calculated in O(1) time.

Methodology: 2) Multi-Round MC-UPGMA Requirements:

A correct clusterer should be mindful of unseen edges (≥λ), effecting clustering before λ (max of loaded edges).

Such examples are rather prevalent in non-metric datasets e.g. the case of clustering sequence similarities.

Illustration of non-metric constraints imposed by BLAST sequence similarities (eges). False

transitivity is possible due to CSKP_HUMAN.


12/31

Methodology: 2) Multi-Round MC-UPGMA

Solution: To prevent false clustering of a non-minimal edge, suitable bounds per edge are maintained.

The value of dij is lower (lij) and upper (uij) bounded as:


13/31

Methodology: 2) Multi-Round MC-UPGMA

1. When Multi-Round MC-UPGMA halts, it is not using its entire memory budget M, since each merge reduces the number of edges in memory.

2. Most of the computation time is spent on preprocessing for the next round of clustering.

guest

merger: Processes partial clustering and set of current edges. Edges grow thicker as clusters grow larger. Current set of thick edges is input to the next round. Algorithm is same as sparse-upgma however it loads only M minimal edges in Et. Edges Dij are repaced with intervals Lij and Uij to accomodate uncertain edge values (due to partia edges at hand)l .Clustering halts when it is impossible to identify minimal edge in entire Et.Clustering proceeds while a distinctly minimal edge is at hand -- an edge whose upper bound is lower than the lower bound of nay edge in Et.


14/31

Methodology: 2) Single-Round MC-UPGMA

Requires O(n) memory for holding forming tree!

guest

uses freed-up memory to load fresh edges.to accomodate the reloading of old invalid edges, a new edge representation is introduced.

Methodology: 2) Single-Round MC-UPGMA

Methods Clustered data:

UniRef90 (release 8.5) non-redundant 1.80M sequences

BLAST Similarities: blastp with E=100 run on MOSIX grid reciprocal-BLAST-like setting – each sequence is used both as a query

and database entry The directed multigraph1 is transformed to undirected graph2

(symmetric dissimilarities)1. 2.5x109 edges (50 GB)2. 1.5x109 edges (30 GB)

Methods Protein Family Keywords

Interpro classification is used as a mapping of keywords to protein sequences

Metrics

Jaccard Score

Results from 1 801 506 UniRef90 proteins.

1107 (0.06%) proteins are singletons having no BLAST similarities.

From the clustered set, 1 791 206 proteins (99.5%) are clustered into a single tree.

1 497 733 of the tree clusters (83.6%) are fully linked, including 426 360 large clusters with at least 10 members.

Results

Smith–Waterman

BLAST

Sparse UPGMA

With reduced dataset

220K

1.80M

Results

200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.

Results

Observations No detailed discussion on parallelization No results of Single round MC-UPGMA


23/31

Clu

ster C

ard

Page

View Proteins of Cluster

Keyw

ord

s Ap

peara

nce

s

Clu

ster S

imila

rity D

istributio

n

simila

rity m

atrix

for th

e p

rote

ins in

this clu

ster

presentation 2009 journal club azhar ali shah

Education