Download - Clustering Applications

Clustering Applications

Reminder

Applications

Spectral Clustring

Assignment Clustering

clustering methods

Clustering

hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.

non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap.

Non-hierarchical Methods

non-hierarchical methods

partitioning methods - classes are mutually exclusive

clumping method, - overlap is allowed.

hierarchical methods

Agglomerative or - The hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset.

Hierarchical methods

Divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster.

Partitioning Methods

Partitioning methods are divided acording to the number of passes over the data.

Single pass Basic partitioning methods

Multiple passes K –means (Very widely used)

K-means: Sample Application

Gene clustering. Given a series of microarray

experiments measuring the expression of a set of genes at regular time intervals in a common cell line.

Normalization allows comparisons across microarrays.

Produce clusters of genes which vary in similar ways over time.

Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.

Sample Array. Rows are genes Sample Array. Rows are genes and columns are time points.and columns are time points.

A cluster of co-regulated genes.A cluster of co-regulated genes.

Clustering gene expression dataClustering gene expression dataSamples

Gen

es

samplessamples

samples

Expression profile of the gene.

samples

Clustering gene expression data

SamplesG

enes

samplessamples

samples

Expression profile of the gene.

samples

Cluster genes with similar expression profiles

Clustering genes on expression profiles

• The expression profile of each gene• is a point in ‘sample space’.

Sample 1

Sample 3

Sample 2Gene g

eg1

eg3

eg2

Sample 1

Sample 3

Sample 2 • All genes together form• a scatter in this space

Normalized Expression Data from microarrays

T1 T2 T3Gene 1

Gene N

Representation of expression data

Time-point 1

Tim

e-po

int 3

Tim

e-po

int 2

Gene 1Gene 2

.dij

Identifying prevalent expression patternsTime-point 1

Tim

e-po

int 3

Tim

e-po

int 2

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1 2 3

-2

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

Time -pointTime -point

Time -point

Nor

mal

ized

Exp

ress

ion

Nor

mal

ized

Exp

ress

ion

Nor

mal

ized

Exp

ress

ion

gpm1HTB1RPL11ARPL12BRPL13ARPL14ARPL15ARPL17ARPL23ATEF2YDL228cYDR133CYDR134CYDR327WYDR417CYKL153WYPL142C

GlycolysisNuclear Organization

Ribosome

Translation

Unknown

Genes MIPS functional category

Evaluate Cluster contents

Hierarchical Agglomerative methods

The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm.

1. Find the 2 closest objects and merge them into a cluster

2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects.

3. If more than one cluster remains , return to step 2


Sample 1

Sample 3

Sample 2

• Define a distance/similarity measure between points.

s sggs eeggd 2

' )()',(

s

sggs eeggd ||)',( '

Euclidean:

Manhattan:

• Define a distance between clusters of points. 1) Distance between closest pair between two clusters. (single-linkage) 2) Distance between the furthest pair of points (total linkage). 3) Average distance between points from both clusters. 4) Distance between the clusters’ centroids.


Sample 1

Sample 3

Sample 2

Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.


Sample 1

Sample 3

Sample 2

Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.

Eventually all points will be linked into a single cluster.

Clustering genes on expression profilesClustering genes on expression profiles

The sequence of mergers can be represented in a hierarchical tree.

Sample 1

Sample 3

Sample 2

b

c

de

fg

a

a b c d e f g


Green = Expression level low with respect to reference sample.Red = Expression level high with respect to reference sample.Black = Expression level comparable to reference sample.

The columns are ordered such that similar expression profiles neighbor each other.

Eisen et al. PNAS 1998.

Clustering gene expression data

samples

samples

Samples

Gen

es

samples

Expression profile of the sample.

samples

Instead of genes one may cluster samples with similar expression profiles.

Clustering samples on expression profiles

Alizadeh et al. Nature 2003

Identifying different tumor types through sample clustering.

Alizadeh et al., Nature 403:503-11, 2000

Combinations of samples/genes

Gen

es

SamplesSamples

Gen

es Cluster genes with similarsample expression-profile.

Cluster samples with similargene expression-profile.

Combination model

Gen

esSamples

Each color corresponds tosome “cause”.

The cause affects a subset of genes in asubset of the samples. e.g. Ihmels et al. Nature genetics 2002

Combinations of samples/genesCombinations of samples/genes

Ihmels et al. Nature genetics 2002

Clustering genes: Clusters of homologous genes

• Task: Detect clusters in the graph.

• A set of protein or DNA sequences.• Use alignment algorithm (e.g. BLAST) to score the similarity of each pair of sequences.

Graph of similarities of proteins in Methanococcus Jannaschii. The length of the links reflectssimilarity (short link = high similarity).

Enright and OuzounisBioinformatics 2001

Clustering genes: Clusters of homologous genes

Example solution: Put ‘random walkers’ on graph and let them follow links at random.Look at the density of walkers and strengthen ‘high flow’ links, and weaken ‘low flow’ links.Stijn van Dongen, Graph Clustering by Flow simulation (PhD. Thesis, University of Amsterdam).

Clustering DNA sequences:Transcription factor binding sites

Transcription factors recognize ‘fuzzy motifs’.Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG

CTGCTGAATTGATTCAGGTCAGGCCA

GTGCTGAAACCATTCAAGAGTCAATT

GTGGTGAATCGATACTTTACCGGTTG

CGACTGAAACGCTTCAGCTAGGATAA

TGACTGAAACGTTTTTGCCCTATGAG

TTCTTGAAACGTTTCAGCGCGATCTT

ACGGTGAATCGTTCAAGCAAATATAT

GCACTGAATCGGTTAACTGTCCAGTC

ATCGTTAAGCGATTCAGCACCTTACC

**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******

Task: thousands of such binding sites for hundreds of different TFs. Infer which binding sites bind the same TF.


AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC

TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG

Each line contains a binding site for a transcription factor.


AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC

TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG

van Nimwegen et al. PNAS 2002

Clustering DNA sequences:Transcription factor binding sitesAAGCACTATATTGGTGCAACATTCACATCGTG

AAGCACCATGTTGGTGCAATGACCTTTGGATAATGCACTAAAATGGTGCAACCTGTTCAGGAGAGCGCACCAGATTGGTGCCCCAGAATGGTGCATa*GCAC*A*atTGGTGCaac****t***g**ACTCATCTGGTACGACCAGATCACCTTGCGGACATTAACTCATCGGATCAGTTCAGTAACTATTTCCGAACTGATCGGACTTGTTCAGCGTACACGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGCCACTGGTCTGATTTCTAAGATGTACCTCCAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC****a*CTGgTc*Gat**GT******t*****AAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACTGCTGAATTGATTCAGGTCAGGCCAAATGGCGTGCTGAAACCATTCAAGAGTCAATTGGCGCGGTGGTGAATCGATACTTTACCGGTTGAATTTGCGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTACGGTGAATCGTTCAAGCAAATATATTTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCATCGTTAAGCGATTCAGCACCTTACCTCAGGC**gcTGAAtCG*TTcAg**c**************gcTGAAtCG*TTcAg**c************

GTGATGAACTGTTTTTTTATCCAGTATAATTTTGGATGTACTGTACATCCATACAGTAACTCACACAGACTACTGTATATAAAAACAGTATAACTTCCTCTTTACTGTATATAAAACCAGTTTATACTACTCACAACTGTATATAAATACAGTTACAGATAGTTTATACTGTACACAATAACAGTAATGGTTACTTGATACTGTATGAGCATACAGTATAATTGTTCCAATACTGTATATTCATTCAGGTCAATTTCAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGTATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGGTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACATATTTACTGATGATATATACAGGTATTTAGCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTCCACAATATTGGCTGTTTATACAGTATTTCAGAGTAATGACTGTATAAAACCACAGCCAATCAA****t*tACTGTATATa*A*ACAG********

Similarity/distance matrices

Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.

Array batch 1

Array batch 2


Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG

CTGCTGAATTGATTCAGGTCAGGCCA

GTGCTGAAACCATTCAAGAGTCAATT

GTGGTGAATCGATACTTTACCGGTTG

CGACTGAAACGCTTCAGCTAGGATAA

TGACTGAAACGTTTTTGCCCTATGAG

TTCTTGAAACGTTTCAGCGCGATCTT

ACGGTGAATCGTTCAAGCAAATATAT

GCACTGAATCGGTTAACTGTCCAGTC

ATCGTTAAGCGATTCAGCACCTTACC

**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******

Probability Evaluation

0.067 , 0.467 , 0.2 , 0.267 :instanceFor

. positionat base finding of yProbabilit 3333

wwww

w

TGCA

i i

Probability that a sequence s is a binding site for the factor represented by w:

l

1i )|( w

i

s iwsP

Kohonen Self-organizing maps

K = r*s clusters are arranged as nodes of a two dimensional grid. Nodes represent cluster centers/prototype vectors.

This allows to represent similarity between clusters.

Algorithm: Initialize nodes at random positions.

Iterate: - Randomly pick one data point (gene) x.

- Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations.

from Tamayo et al. 1999

Self-organizing maps

from Tamayoet al. 1999(yeast cell cycle data)

MST-method : Graph Representation of data

Representation of a set of n-dimensional “k” points as a graph each data point is represented

as a node V (a vertex) Edge between i-th and j-th

points is a connection evaluated by the “distance” between the two points V(i) and V(j)

d i,j -matrix of distances

d1,1 d1,2 …. d1,k-1 d1,k

d2,1 d2,2 …. d2,k-1 d2,k

- -- - - - - - - - - - - - - - - - - - -

dk,1 dk,2 …. dk,k-1 dk,k

d1,1 d1,2 …. d1,7 d1,8

d2,1 d2,2 …. d2,7 d2,8

- -- - - - - - - - - - - - - - - - - - -

d8,1 d8,2 …. d8,7 d8,8

Edges

Vertices

V(1) V(2)

V(7)

V(8)

V(6)

V(5)V(4)

V(3)

di,j –distance between V(i) and V(j)

Graph Representation

Intuitive Requirement for a Cluster

Closest point for among all non is Closest point for among all non is

For any partition :

Intuitive requirement for a cluster (IR)

If subset C has IR points of C form subtree of MST

In other words deleting a few edges one will get a tree consisting only of points of C

Set of don’t have IR !

Cluster Versus MST

Root 06

5

4

3

0

2

1

109

8

7

Data points with indices

Sequential Presentation

Step index

2 3

41 5

6

7

8 910

PRIM Algorithm for Cluster Identification

Intuitive Requirement for a Cluster

Sequential Representation

Valley

Cluster analysis & graph theory

Graph Formulation View data set as a set of vertices V={1,2,…,n} The similarity between objects i and j is viewed as

the weight of the edge connecting these vertices Aij. A is called the affinity matrix

We get a weighted undirected graph G=(V,A). Clustering (Segmentation) is equivalent to partition

of G into disjoint subsets. The latter could be achieved by simply removing connecting edges.

Nature of the Affinity Matrix

2 2( ) / 2i js s

ijA e i j 0iiA

Weight as a function of

“closer” vertices will get larger weight

Spectral Clustering

Algorithms that cluster points using eigenvectors of matrices derived from the data

Obtain data representation in the low-dimensional space that can be easily clustered

Variety of methods that use the eigenvectors differently

Spectral Clustering Algorithm Ng, Jordan, and Weiss

Given a set of points S={s1,…sn} Form the affinity matrix

Define diagonal matrix Dii=aik

Form the matrix Stack the k largest eigenvectors of L to form the columns of the new matrix X: Renormalize each of X’s rows to have unit length.

Cluster rows of Y as points in R k

2 2|| || / 2i js s

ijA e i j 0iiA

1/ 2 1/ 2L D AD

1 2, ,..., kx x x

Spectral Clustering Algorithm Ng, Jordan, and Weiss

Motivation Given a set of points: S={s1,s2,..sn}Rl

We would like to cluster them into k subsets Form the affinity matrix Define A Rn*n

Scaling parameter chosen by user

Define D a diagonal matrix whose (i,i) element is the sum of A’s row i

2 2|| || / 2 , 0i js s

ij iiA e A

Algorithm

Form the matrix L=D-1/2AD-1/2 Find x1,x2…xk , the k largest eigenvectors of L These form the the columns of the new

matrix X We have reduced dimension from nxn to nxk

Algorithm

Form the matrix Y Renormalize each of X’s rows to have unit length Y Rn*k

Treat each row of Y as a point in Rk

Cluster into k clusters via K-means Final Cluster Assignment

Assign point Si to cluster j if row i of Y was assigned to cluster j

2 2/( )ij ij ijj

Y X X

Why?

If we eventually use K-means, why not just apply K-means to the original data?

This method allows us to cluster non-convex regions

Basic intuition

Divide points in space. Let us use for basic intuition the case of only two clusters.

Distance between points defined by affinity matrix Ci,j

Want to pick partition Xi’s to minimize cost within a cluster

Partition can be 0 and 1 or it can be fuzzy Only members of the cluster contributes to the

distance (xi=xj=1).

Basic Intuition

Minimize Squared length Start with connection array B (bi,j)

“Placement” Vector X for xi placement

cost = (all I,j) xibi,jxj

Constraint: X’X=1 Maintains normalization.

Basic Intuition

Minimize cost=X’BX w/ constraint minimize L=X’BX-(X’X-1) L/ X=2BX-2X=0 (B-I)X=0 X => Eigenvector of B cost is Eigenvalue

X (xi’s) continuous

form cut partition from ordering

Spectral Partitioning

use to order nodes real problem wants to place at discrete locations this is one case where can solve LP problem

(continuous x’s) then move back to closests discrete point

Simple Example

Consider two 2-dimensional slightly overlapping Gaussian clouds each containing 100 points.

Simple Example cont-d I

Simple Example cont-d II

Example 2 (not so simple)

Example 2 cont-d I

Example 2 cont-d II

Assignment Clustering

Given M 0-1-N vectors of length L each, find a resolution which results in the least number of distinct vectors

A vector is called resolved it there are no Ns in it.

This is called also called Binary Clustering with Missing Values

where p is the maximum number of Ns

Motivation

Oligonucleotide Fingerprinting Array based Method for characterizing cDNAs, tissues

etc

Forensics DNA profiling etc

Fingerprinting

Fingerprint vector formed from the intensity values of each probe

Quantize the values based on V > h – 1 V < l – 0 Others – N

Trivial solutions

Just make each of the vectors into a cluster ….

Find a clustering solution with minimum number of clusters.

Each cluster can be resolved to the same vector. So minimize the number of unique resolved vectors

Actual Problem

Identify clusters of mutually compatible vectors.

Compatibility: If the vectors differ only at Ns they are called compatible

Example: 110N0NN110 N10100N1N0

Greedy Clique Partition

1 Find a Unique Maximal Clique

2 Remove it from the graph

3 Repeat 1, 2 until there exist no more unique maximal cliques

4 Find a Maximum Clique

5 Remove it from the graph

6 Repeat the 1-5 till the graph is empty

Implementation

Definitions – set of resolved vectors of f

Hash table H with entries

F(r), v(r) – a vector of length L

N(r) – positions of Ns in v(r)

)}(|{)(

)(

fRrFfrF

fRRf

)( fR

Implementation

Hash the entries with the resolved vectors Double hashing

Chaining for avoiding over-writes due to collisions

)(*)()( 21 rhkrhrh

Fill the table

For each finger print vector fill the table insert its resolved vectors

V(r)- the minimal resolution of the vectors Example

f1 = 01N01N r = 01011 v = 01N01N f2 = N1N11 r = 01011 v = 01N011

Finding cliques

Maximum Clique The entry r with the largest F(r) is the the

Maximum Clique Check for a unique vertex – a vertex belonging

to only one maximal clique If for an f, all the v(r)s are mutually compatible then f

is a unique vertex. Among all the cliques associated with f choose

the largest

Data generation

Generate a cluster structure

Generate d random mutually non-compatible vectors

Make copies and randomly change 2 bits to N

1

),...,( 1

i

d

s

ssS

Download - Clustering Applications

Top Related