cz5225: modeling and simulation in biology lecture 3: clustering analysis for microarray data i...
TRANSCRIPT
CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology
Lecture 3: Clustering Analysis for Microarray Data ILecture 3: Clustering Analysis for Microarray Data I
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg
http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS
22
Clustering AlgorithmsClustering Algorithms
• Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it.
• Anything will cluster! Garbage In means Garbage Out.
33
Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning
• Supervised: there is a teacher, class labels are known
• Support vector machines• Backpropagation neural networks
• Unsupervised: No teacher, class labels are unknown
• Clustering• Self-organizing maps
44
Gene Expression DataGene Expression Data
Gene expression data on p genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
=Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
55
Expression VectorsExpression VectorsGene Expression Vectors encapsulate the
expression of a gene over a set of experimental conditions or sample types.
-0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5
-2
0
2
1 2 3 4 5 6 7 8Line Graph
-2 2
Numeric Vector
Heatmap
66
Expression Vectors As Points in ‘Expression Space’Expression Vectors As Points in ‘Expression Space’
Experiment 1
Experiment 2
Experiment 3
Similar Expression
-0.8
-0.60.9 1.2
-0.3
1.3
-0.7t 1 t 2 t 3
G1
G2
G3
G4
G5
-0.4-0.4
-0.8-0.8
-0.7
1.3 0.9 -0.6
77
Cluster AnalysisCluster Analysis
• Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.
88
How can we do this?How can we do this?
• What is closely related?• Distance or similarity metric• What is close?
• Clustering algorithm• How do we minimize distance between objects in a
group while maximizing distances between groups?
99
Distance MetricsDistance Metrics
• Euclidean Distance measures average distance
• Manhattan (City Block) measures average in each dimension
• Correlation measures difference with respect to linear trends
Gene Expression 1
Gen
e E
xpre
ssio
n 2
(5.5,6)
(3.5,4)
1010
Clustering Gene Expression DataClustering Gene Expression Data
• Cluster across the rows, group genes together that behave similarly across different conditions.
• Cluster across the columns, group different conditions together that behave similarly across most genes.
Gen
es
Expression Measurements
i
j
1111
Clustering Time Series DataClustering Time Series Data
• Measure gene expression on consecutive days
• Gene Measurement matrix• G1= [1.2 4.0 5.0 1.0]• G2= [2.0 2.5 5.5 6.0]• G3= [4.5 3.0 2.5 1.0]• G4= [3.5 1.5 1.2 1.5]
1212
Euclidean DistanceEuclidean Distance
• Distance is the square root of the sum of the squared distance between coordinates
• 2 2 2
1 1 2 2ij i j i j in jnd x x x x x x
0 5.3 4.3 5.1
5.3 0 6.4 6.5
4.3 6.4 0 2.3
5.1 6.5 2.3 0
2 2 2 21.2 2 4 2.5 5 5.5 1 6ijd
1313
City Block or Manhattan DistanceCity Block or Manhattan Distance
• G1= [1.2 4.0 5.0 1.0]• G2= [2.0 2.5 5.5 6.0]• G3= [4.5 3.0 2.5 1.0]• G4= [3.5 1.5 1.2 1.5]
• Distance is the sum of the absolute value between coordinates
1 1 2 2ij i j i j in jnd x x x x x x
0 7.8 6.8 9.1
7.8 0 11 11.3
6.8 11 0 4.3
9.1 11.3 4.3 0
1.2 2 4 2.5 5 5.5 1 6ijd
1414
Correlation DistanceCorrelation Distance
• Pearson correlation measures the degree of linear relationship between variables, [-1,1]
• Distance is 1-(pearson correlation), range of [0,2]
1 1 1
2 2
2 2
1 1 1 1
1
1 11 1
N N N
in jn in jnn n n
ijN N N N
in in jn jnn n n n
x x x xN
d
x x x xN N
0 .91 .98 1.6
.91 0 1.9 1.7
.98 1.9 0 .22
1.6 1.7 .22 0
1515
Similarity MeasurementsSimilarity Measurements• Pearson Correlation
Nx
x
x 1
Two profiles (vectors) and
])(][)([
))((),(
1
2
1
2
1
N
i yi
N
i xi
N
i yixipearson
mymx
mymxyxC
Ny
y
y 1
x
y
x
y+1 Pearson Correlation – 1
N
n nx xN
m1
1
N
n ny yN
m1
1
1616
Similarity MeasurementsSimilarity Measurements• Cosine Correlation
Nx
x
x 1
yx
yxNyxC
N
i ii
1
cosine
1
),(
Ny
y
y 1
yx
+1 Cosine Correlation – 1 yx
1717
Hierarchical ClusteringHierarchical Clustering
(HCL-1)
• IDEA: Iteratively combines genes into groups based on similar patterns of observed expression
• By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships.
• Display the data as a heatmap and dendrogram
• Cluster genes, samples or both
1818
Hierarchical ClusteringHierarchical Clustering
DendrogramVenn Diagram of Clustered Data
1919
Hierarchical clusteringHierarchical clustering
• Merging (agglomerative): start with every measurement as a separate cluster then combine
• Splitting: make one large cluster, then split up into smaller pieces
• What is the distance between two clusters?
2020
Distance between clustersDistance between clusters
• Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster
• Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster
• Average: Distance between the average of all points in each cluster
• Ward: minimizes the sum of squares of any two clusters
2121
Hierarchical Clustering-MergingHierarchical Clustering-Merging
• Euclidean distance
• Average linking
Gene expression time series
Distance between clusters when combined
2222
Manhattan DistanceManhattan Distance
• Average linking
Gene expression time series
Distance between clusters when combined
2323
Correlation DistanceCorrelation Distance
2424
Data StandardizationData Standardization• Data points are normalized with respect to mean
and variance, “sphering” the data
• After sphering, Euclidean and correlation distance are equivalent
• Standardization makes sense if you are not interested in the size of the effects, but in the effect itself
• Results are misleading for noisy data
ˆ
ˆx
x
2525
Distance CommentsDistance Comments
• Every clustering method is based SOLELY on the measure of distance or similarity
• E.G. Correlation: measures linear association between two genes• What if data are not properly transformed?• What about outliers?• What about saturation effects?
• Even good data can be ruined with the wrong choice of distance metric
2626
A B C D
Dist A B C D
A 20 7 2
B 10 25
C 3
D
Distance MatrixInitial Data Items
Hierarchical Clustering
2727
A B C D
Dist A B C D
A 20 7 2
B 10 25
C 3
D
Distance MatrixInitial Data Items
Hierarchical Clustering
2828
Current Clusters
Single Linkage
Hierarchical Clustering
Dist A B C D
A 20 7 2
B 10 25
C 3
D
Distance Matrix
A B CD
2
2929
Dist AD B C
AD 20 3
B 10
C
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
A B CD
3030
A B CD
Dist AD B C
AD 20 3
B 10
C
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
3131
Dist AD B C
AD 20 3
B 10
C
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
A BCD
3
3232
Dist ADC B
ADC
10
B
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
A BCD
3333
A BCD
Dist ADC B
ADC
10
B
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
3434
Dist ADC B
ADC
10
B
Distance MatrixCurrent Clusters
Single Linkage
Hierarchical Clustering
A BCD
10
3535
A BCD
Dist ADCB
ADCB
Distance MatrixFinal Result
Single Linkage
Hierarchical Clustering
3636
Hierarchical ClusteringHierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
3737
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
3838
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
3939
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
4040
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
4141
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
4242
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
4343
Hierarchical ClusteringHierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
4444
Hierarchical ClusteringHierarchical Clustering
H L
4545
Hierarchical ClusteringHierarchical Clustering
The Leaf Ordering Problem:
• Find ‘optimal’ layout of branches for a given dendrogram architecture
• 2N-1 possible orderings of the branches• For a small microarray dataset of 500 genes, there are
1.6*E150 branch configurations
SamplesG
enes
4646
Hierarchical ClusteringHierarchical ClusteringThe Leaf Ordering Problem:
4747
Hierarchical ClusteringHierarchical Clustering
• Pros:– Commonly used algorithm– Simple and quick to calculate
• Cons:– Real genes probably do not have a
hierarchical organization
4848
Using Hierarchical ClusteringUsing Hierarchical Clustering
1. Choose what samples and genes to use in your analysis
2. Choose similarity/distance metric
3. Choose clustering direction
4. Choose linkage method
5. Calculate the dendrogram
6. Choose height/number of clusters for interpretation
7. Assess results
8. Interpret cluster structure
4949
Choose what samples/genes to includeChoose what samples/genes to include
• Very important step• Do you want to include housekeeping genes or genes
that didn’t change in your results?• How do you handle replicates from the same sample?• Noisy samples?• Dendrogram is a mess if everything is included in large
datasets• Gene screening
5050
No FilteringNo Filtering
5151
Filtering 100 relevant genesFiltering 100 relevant genes
5252
2. Choose distance metric2. Choose distance metric
• Metric should be a valid measure of the distance/similarity of genes
• Examples– Applying Euclidean distance to categorical
data is invalid– Correlation metric applied to highly skewed
data will give misleading results
5353
3. Choose clustering direction3. Choose clustering direction
• Merging clustering (bottom up)
• Divisive– split so that genes in the two clusters are the
most similar, maximize distance between clusters
5454
NearestNearest NeighborNeighbor AlgorithmAlgorithm
• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
5555
Nearest Neighbor, Level 3, k = 6 clusters.
5656
Nearest Neighbor, Level 4, k = 5 clusters.
5757
Nearest Neighbor, Level 5, k = 4 clusters.
5858
Nearest Neighbor, Level 6, k = 3 clusters.
5959
Nearest Neighbor, Level 7, k = 2 clusters.
6060
Nearest Neighbor, Level 8, k = 1 cluster.
6161
Calculate the similarity between all possible
combinations of two profiles
Two most similar clusters are grouped together to form
a new cluster
Calculate the similarity between the new cluster and
all remaining clusters.
Hierarchical ClusteringHierarchical Clustering
Keys• Similarity• Clustering
6262
Hierarchical ClusteringHierarchical Clustering
C1
C2
C3
Merge which pair of clusters?
6363
+
+
Hierarchical ClusteringHierarchical Clustering
Single Linkage
C1
C2
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters
Tend to generate “long chains”
6464
+
+
Hierarchical ClusteringHierarchical Clustering
Complete Linkage
C1
C2
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters
Tend to generate “clumps”
6565
+
+
Hierarchical ClusteringHierarchical Clustering
Average Linkage
C1
C2
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).
6666
+
+
Hierarchical ClusteringHierarchical Clustering
Average Group Linkage
C1
C2
Dissimilarity between two clusters = Distance between two cluster means.
6767
Which one?Which one?
• Both methods are “step-wise” optimal, at each step the optimal split or merge is performed
• Doesn’t mean that the final result is optimal• Merging:
• Computationally simple• Precise at bottom of tree• Good for many small clusters
• Divisive• More complex, but more precise at the top of the tree• Good for looking at large and/or few clusters
• For Gene expression applications, divisive makes more sense