Download - Clustering Applications
Clustering Applications
Reminder
Applications
Spectral Clustring
Assignment Clustering
clustering methods
Clustering
hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.
non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap.
Non-hierarchical Methods
non-hierarchical methods
partitioning methods - classes are mutually exclusive
clumping method, - overlap is allowed.
hierarchical methods
Agglomerative or - The hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset.
Hierarchical methods
Divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster.
Partitioning Methods
Partitioning methods are divided acording to the number of passes over the data.
Single pass Basic partitioning methods
Multiple passes K –means (Very widely used)
K-means: Sample Application
Gene clustering. Given a series of microarray
experiments measuring the expression of a set of genes at regular time intervals in a common cell line.
Normalization allows comparisons across microarrays.
Produce clusters of genes which vary in similar ways over time.
Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.
Sample Array. Rows are genes Sample Array. Rows are genes and columns are time points.and columns are time points.
A cluster of co-regulated genes.A cluster of co-regulated genes.
Clustering gene expression dataClustering gene expression dataSamples
Gen
es
samplessamples
samples
Expression profile of the gene.
samples
Clustering gene expression data
SamplesG
enes
samplessamples
samples
Expression profile of the gene.
samples
Cluster genes with similar expression profiles
Clustering genes on expression profiles
• The expression profile of each gene• is a point in ‘sample space’.
Sample 1
Sample 3
Sample 2Gene g
eg1
eg3
eg2
Sample 1
Sample 3
Sample 2 • All genes together form• a scatter in this space
Normalized Expression Data from microarrays
T1 T2 T3Gene 1
Gene N
Representation of expression data
Time-point 1
Tim
e-po
int 3
Tim
e-po
int 2
Gene 1Gene 2
.dij
Identifying prevalent expression patternsTime-point 1
Tim
e-po
int 3
Tim
e-po
int 2
-1.8
-1.3
-0.8
-0.3
0.2
0.7
1.2
1 2 3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3
Time -pointTime -point
Time -point
Nor
mal
ized
Exp
ress
ion
Nor
mal
ized
Exp
ress
ion
Nor
mal
ized
Exp
ress
ion
gpm1HTB1RPL11ARPL12BRPL13ARPL14ARPL15ARPL17ARPL23ATEF2YDL228cYDR133CYDR134CYDR327WYDR417CYKL153WYPL142C
GlycolysisNuclear Organization
Ribosome
Translation
Unknown
Genes MIPS functional category
Evaluate Cluster contents
Hierarchical Agglomerative methods
The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm.
1. Find the 2 closest objects and merge them into a cluster
2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects.
3. If more than one cluster remains , return to step 2
Clustering genes on expression profiles
Sample 1
Sample 3
Sample 2
• Define a distance/similarity measure between points.
s sggs eeggd 2
' )()',(
s
sggs eeggd ||)',( '
Euclidean:
Manhattan:
• Define a distance between clusters of points. 1) Distance between closest pair between two clusters. (single-linkage) 2) Distance between the furthest pair of points (total linkage). 3) Average distance between points from both clusters. 4) Distance between the clusters’ centroids.
Clustering genes on expression profiles
Sample 1
Sample 3
Sample 2
Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.
Clustering genes on expression profiles
Sample 1
Sample 3
Sample 2
Hierarchical clustering:• Start with each point its own cluster.• At each iteration, merge the two clusters with the smallest distance.
Eventually all points will be linked into a single cluster.
Clustering genes on expression profilesClustering genes on expression profiles
The sequence of mergers can be represented in a hierarchical tree.
Sample 1
Sample 3
Sample 2
b
c
de
fg
a
a b c d e f g
Clustering genes on expression profiles
Green = Expression level low with respect to reference sample.Red = Expression level high with respect to reference sample.Black = Expression level comparable to reference sample.
The columns are ordered such that similar expression profiles neighbor each other.
Eisen et al. PNAS 1998.
Clustering gene expression data
samples
samples
Samples
Gen
es
samples
Expression profile of the sample.
samples
Instead of genes one may cluster samples with similar expression profiles.
Clustering samples on expression profiles
Alizadeh et al. Nature 2003
Identifying different tumor types through sample clustering.
Alizadeh et al., Nature 403:503-11, 2000
Combinations of samples/genes
Gen
es
SamplesSamples
Gen
es Cluster genes with similarsample expression-profile.
Cluster samples with similargene expression-profile.
Combination model
Gen
esSamples
Each color corresponds tosome “cause”.
The cause affects a subset of genes in asubset of the samples. e.g. Ihmels et al. Nature genetics 2002
Combinations of samples/genesCombinations of samples/genes
Ihmels et al. Nature genetics 2002
Clustering genes: Clusters of homologous genes
• Task: Detect clusters in the graph.
• A set of protein or DNA sequences.• Use alignment algorithm (e.g. BLAST) to score the similarity of each pair of sequences.
Graph of similarities of proteins in Methanococcus Jannaschii. The length of the links reflectssimilarity (short link = high similarity).
Enright and OuzounisBioinformatics 2001
Clustering genes: Clusters of homologous genes
Example solution: Put ‘random walkers’ on graph and let them follow links at random.Look at the density of walkers and strengthen ‘high flow’ links, and weaken ‘low flow’ links.Stijn van Dongen, Graph Clustering by Flow simulation (PhD. Thesis, University of Amsterdam).
Clustering DNA sequences:Transcription factor binding sites
Transcription factors recognize ‘fuzzy motifs’.Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG
CTGCTGAATTGATTCAGGTCAGGCCA
GTGCTGAAACCATTCAAGAGTCAATT
GTGGTGAATCGATACTTTACCGGTTG
CGACTGAAACGCTTCAGCTAGGATAA
TGACTGAAACGTTTTTGCCCTATGAG
TTCTTGAAACGTTTCAGCGCGATCTT
ACGGTGAATCGTTCAAGCAAATATAT
GCACTGAATCGGTTAACTGTCCAGTC
ATCGTTAAGCGATTCAGCACCTTACC
**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******
Task: thousands of such binding sites for hundreds of different TFs. Infer which binding sites bind the same TF.
Clustering DNA sequences:Transcription factor binding sites
AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC
TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG
Each line contains a binding site for a transcription factor.
Clustering DNA sequences:Transcription factor binding sites
AAGCACTATATTGGTGCAACATTCACATCGTGGTGATGAACTGTTTTTTTATCCAGTATAATTTACTCATCTGGTACGACCAGATCACCTTGCGGAAAGCACCATGTTGGTGCAATGACCTTTGGATAAAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACATTAACTCATCGGATCAGTTCAGTAACTATTCCTCTTTACTGTATATAAAACCAGTTTATACTTCCGAACTGATCGGACTTGTTCAGCGTACACGACTCACAACTGTATATAAATACAGTTACAGATGTGCTGAAACCATTCAAGAGTCAATTGGCGCGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAATACTGTATATTCATTCAGGTCAATTTGTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGCAGCGGCTGGTCCGCTGTTTCTGCATTCTTACACGGTGAATCGTTCAAGCAAATATATTTTTTTAGTAATGACTGTATAAAACCACAGCCAATCAAATCGTTAAGCGATTCAGCACCTTACCTCAGGC
TGGATGTACTGTACATCCATACAGTAACTCACATGCACTAAAATGGTGCAACCTGTTCAGGAGATATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCATACAGACTACTGTATATAAAAACAGTATAACTTTCGCCACTGGTCTGATTTCTAAGATGTACCTCAGTTTATACTGTACACAATAACAGTAATGGTTCTGCTGAATTGATTCAGGTCAGGCCAAATGGCACTTGATACTGTATGAGCATACAGTATAATTGTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCCCACAATATTGGCTGTTTATACAGTATTTCAG
van Nimwegen et al. PNAS 2002
Clustering DNA sequences:Transcription factor binding sitesAAGCACTATATTGGTGCAACATTCACATCGTG
AAGCACCATGTTGGTGCAATGACCTTTGGATAATGCACTAAAATGGTGCAACCTGTTCAGGAGAGCGCACCAGATTGGTGCCCCAGAATGGTGCATa*GCAC*A*atTGGTGCaac****t***g**ACTCATCTGGTACGACCAGATCACCTTGCGGACATTAACTCATCGGATCAGTTCAGTAACTATTTCCGAACTGATCGGACTTGTTCAGCGTACACGATCAAGCTGGTATGATGAGTTAATATTATGTTTTCCAGCTGGTCCGACCTATACTCTCGCCACTTCGCCACTGGTCTGATTTCTAAGATGTACCTCCAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC****a*CTGgTc*Gat**GT******t*****AAGCTGAATCGATTTTATGATTTGGTTCAATTAGGCTGAATCGTTTCAATTCAGCAAGAGAGGACTGCTGAATTGATTCAGGTCAGGCCAAATGGCGTGCTGAAACCATTCAAGAGTCAATTGGCGCGGTGGTGAATCGATACTTTACCGGTTGAATTTGCGACTGAAACGCTTCAGCTAGGATAAGCGAAATGACTGAAACGTTTTTGCCCTATGAGCTCCGGTTCTTGAAACGTTTCAGCGCGATCTTGTCTTTACGGTGAATCGTTCAAGCAAATATATTTTTTTGCACTGAATCGGTTAACTGTCCAGTCGACGGCATCGTTAAGCGATTCAGCACCTTACCTCAGGC**gcTGAAtCG*TTcAg**c**************gcTGAAtCG*TTcAg**c************
GTGATGAACTGTTTTTTTATCCAGTATAATTTTGGATGTACTGTACATCCATACAGTAACTCACACAGACTACTGTATATAAAAACAGTATAACTTCCTCTTTACTGTATATAAAACCAGTTTATACTACTCACAACTGTATATAAATACAGTTACAGATAGTTTATACTGTACACAATAACAGTAATGGTTACTTGATACTGTATGAGCATACAGTATAATTGTTCCAATACTGTATATTCATTCAGGTCAATTTCAGCATAACTGTATATACACCCAGGGGGCGGAGCCTTTTGCTGTATATACTCACAGCATAACTGTATTTTACCTGTATAAATAACCAGTATATTCACAGCAAATCTGTATATATACCCAGCTTTTTGGTCGTTTTCCTGTATGAAAAACCATTACTGTTATTACACTCCTGTTAATCCATACAGCAACAGTACATATTTACTGATGATATATACAGGTATTTAGCTGTTACACTGGATAGATAACCAGCATTCGGAATCCTTCGCTGGATATCTATCCAGCATTTTTTCCACAATATTGGCTGTTTATACAGTATTTCAGAGTAATGACTGTATAAAACCACAGCCAATCAA****t*tACTGTATATa*A*ACAG********
Similarity/distance matrices
Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.
Array batch 1
Array batch 2
Clustering DNA sequences:Transcription factor binding sites
Alignment of known fruR binding sites: AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG
CTGCTGAATTGATTCAGGTCAGGCCA
GTGCTGAAACCATTCAAGAGTCAATT
GTGGTGAATCGATACTTTACCGGTTG
CGACTGAAACGCTTCAGCTAGGATAA
TGACTGAAACGTTTTTGCCCTATGAG
TTCTTGAAACGTTTCAGCGCGATCTT
ACGGTGAATCGTTCAAGCAAATATAT
GCACTGAATCGGTTAACTGTCCAGTC
ATCGTTAAGCGATTCAGCACCTTACC
**gcTGAAtCG*TTcAg**c********gcTGAAtCG*TTcAg**c******
Probability Evaluation
0.067 , 0.467 , 0.2 , 0.267 :instanceFor
. positionat base finding of yProbabilit 3333
wwww
w
TGCA
i i
Probability that a sequence s is a binding site for the factor represented by w:
l
1i )|( w
i
s iwsP
Kohonen Self-organizing maps
K = r*s clusters are arranged as nodes of a two dimensional grid. Nodes represent cluster centers/prototype vectors.
This allows to represent similarity between clusters.
Algorithm: Initialize nodes at random positions.
Iterate: - Randomly pick one data point (gene) x.
- Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations.
from Tamayo et al. 1999
Self-organizing maps
from Tamayoet al. 1999(yeast cell cycle data)
MST-method : Graph Representation of data
Representation of a set of n-dimensional “k” points as a graph each data point is represented
as a node V (a vertex) Edge between i-th and j-th
points is a connection evaluated by the “distance” between the two points V(i) and V(j)
d i,j -matrix of distances
d1,1 d1,2 …. d1,k-1 d1,k
d2,1 d2,2 …. d2,k-1 d2,k
- -- - - - - - - - - - - - - - - - - - -
dk,1 dk,2 …. dk,k-1 dk,k
d1,1 d1,2 …. d1,7 d1,8
d2,1 d2,2 …. d2,7 d2,8
- -- - - - - - - - - - - - - - - - - - -
d8,1 d8,2 …. d8,7 d8,8
Edges
Vertices
V(1) V(2)
V(7)
V(8)
V(6)
V(5)V(4)
V(3)
di,j –distance between V(i) and V(j)
Graph Representation
Intuitive Requirement for a Cluster
Closest point for among all non is Closest point for among all non is
For any partition :
Intuitive requirement for a cluster (IR)
If subset C has IR points of C form subtree of MST
In other words deleting a few edges one will get a tree consisting only of points of C
Set of don’t have IR !
Cluster Versus MST
Root 06
5
4
3
0
2
1
109
8
7
Data points with indices
Sequential Presentation
Step index
2 3
41 5
6
7
8 910
PRIM Algorithm for Cluster Identification
Intuitive Requirement for a Cluster
Sequential Representation
Valley
Cluster analysis & graph theory
Graph Formulation View data set as a set of vertices V={1,2,…,n} The similarity between objects i and j is viewed as
the weight of the edge connecting these vertices Aij. A is called the affinity matrix
We get a weighted undirected graph G=(V,A). Clustering (Segmentation) is equivalent to partition
of G into disjoint subsets. The latter could be achieved by simply removing connecting edges.
Nature of the Affinity Matrix
2 2( ) / 2i js s
ijA e i j 0iiA
Weight as a function of
“closer” vertices will get larger weight
Spectral Clustering
Algorithms that cluster points using eigenvectors of matrices derived from the data
Obtain data representation in the low-dimensional space that can be easily clustered
Variety of methods that use the eigenvectors differently
Spectral Clustering Algorithm Ng, Jordan, and Weiss
Given a set of points S={s1,…sn} Form the affinity matrix
Define diagonal matrix Dii=aik
Form the matrix Stack the k largest eigenvectors of L to form the columns of the new matrix X: Renormalize each of X’s rows to have unit length.
Cluster rows of Y as points in R k
2 2|| || / 2i js s
ijA e i j 0iiA
1/ 2 1/ 2L D AD
1 2, ,..., kx x x
Spectral Clustering Algorithm Ng, Jordan, and Weiss
Motivation Given a set of points: S={s1,s2,..sn}Rl
We would like to cluster them into k subsets Form the affinity matrix Define A Rn*n
Scaling parameter chosen by user
Define D a diagonal matrix whose (i,i) element is the sum of A’s row i
2 2|| || / 2 , 0i js s
ij iiA e A
Algorithm
Form the matrix L=D-1/2AD-1/2 Find x1,x2…xk , the k largest eigenvectors of L These form the the columns of the new
matrix X We have reduced dimension from nxn to nxk
Algorithm
Form the matrix Y Renormalize each of X’s rows to have unit length Y Rn*k
Treat each row of Y as a point in Rk
Cluster into k clusters via K-means Final Cluster Assignment
Assign point Si to cluster j if row i of Y was assigned to cluster j
2 2/( )ij ij ijj
Y X X
Why?
If we eventually use K-means, why not just apply K-means to the original data?
This method allows us to cluster non-convex regions
Basic intuition
Divide points in space. Let us use for basic intuition the case of only two clusters.
Distance between points defined by affinity matrix Ci,j
Want to pick partition Xi’s to minimize cost within a cluster
Partition can be 0 and 1 or it can be fuzzy Only members of the cluster contributes to the
distance (xi=xj=1).
Basic Intuition
Minimize Squared length Start with connection array B (bi,j)
“Placement” Vector X for xi placement
cost = (all I,j) xibi,jxj
Constraint: X’X=1 Maintains normalization.
Basic Intuition
Minimize cost=X’BX w/ constraint minimize L=X’BX-(X’X-1) L/ X=2BX-2X=0 (B-I)X=0 X => Eigenvector of B cost is Eigenvalue
X (xi’s) continuous
form cut partition from ordering
Spectral Partitioning
use to order nodes real problem wants to place at discrete locations this is one case where can solve LP problem
(continuous x’s) then move back to closests discrete point
Simple Example
Consider two 2-dimensional slightly overlapping Gaussian clouds each containing 100 points.
Simple Example cont-d I
Simple Example cont-d II
Example 2 (not so simple)
Example 2 cont-d I
Example 2 cont-d II
Assignment Clustering
Given M 0-1-N vectors of length L each, find a resolution which results in the least number of distinct vectors
A vector is called resolved it there are no Ns in it.
This is called also called Binary Clustering with Missing Values
where p is the maximum number of Ns
Motivation
Oligonucleotide Fingerprinting Array based Method for characterizing cDNAs, tissues
etc
Forensics DNA profiling etc
Fingerprinting
Fingerprint vector formed from the intensity values of each probe
Quantize the values based on V > h – 1 V < l – 0 Others – N
Trivial solutions
Just make each of the vectors into a cluster ….
Find a clustering solution with minimum number of clusters.
Each cluster can be resolved to the same vector. So minimize the number of unique resolved vectors
Actual Problem
Identify clusters of mutually compatible vectors.
Compatibility: If the vectors differ only at Ns they are called compatible
Example: 110N0NN110 N10100N1N0
Greedy Clique Partition
1 Find a Unique Maximal Clique
2 Remove it from the graph
3 Repeat 1, 2 until there exist no more unique maximal cliques
4 Find a Maximum Clique
5 Remove it from the graph
6 Repeat the 1-5 till the graph is empty
Implementation
Definitions – set of resolved vectors of f
Hash table H with entries
F(r), v(r) – a vector of length L
N(r) – positions of Ns in v(r)
)}(|{)(
)(
fRrFfrF
fRRf
)( fR
Implementation
Hash the entries with the resolved vectors Double hashing
Chaining for avoiding over-writes due to collisions
)(*)()( 21 rhkrhrh
Fill the table
For each finger print vector fill the table insert its resolved vectors
V(r)- the minimal resolution of the vectors Example
f1 = 01N01N r = 01011 v = 01N01N f2 = N1N11 r = 01011 v = 01N011
Finding cliques
Maximum Clique The entry r with the largest F(r) is the the
Maximum Clique Check for a unique vertex – a vertex belonging
to only one maximal clique If for an f, all the v(r)s are mutually compatible then f
is a unique vertex. Among all the cliques associated with f choose
the largest
Data generation
Generate a cluster structure
Generate d random mutually non-compatible vectors
Make copies and randomly change 2 bits to N
1
),...,( 1
i
d
s
ssS