cz5225: modeling and simulation in biology lecture 3: clustering analysis for microarray data i...

67
CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in Biology Biology Lecture 3: Clustering Analysis for Lecture 3: Clustering Analysis for Microarray Data I Microarray Data I Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS

Upload: georgia-tyler

Post on 17-Jan-2016

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 3: Clustering Analysis for Microarray Data ILecture 3: Clustering Analysis for Microarray Data I

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg

http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS

Page 2: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

22

Clustering AlgorithmsClustering Algorithms

• Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it.

• Anything will cluster! Garbage In means Garbage Out.

Page 3: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

33

Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning

• Supervised: there is a teacher, class labels are known

• Support vector machines• Backpropagation neural networks

• Unsupervised: No teacher, class labels are unknown

• Clustering• Self-organizing maps

Page 4: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

44

Gene Expression DataGene Expression Data

Gene expression data on p genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

=Log (Red intensity / Green intensity)

Log(Avg. PM - Avg. MM)

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 5: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

55

Expression VectorsExpression VectorsGene Expression Vectors encapsulate the

expression of a gene over a set of experimental conditions or sample types.

-0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5

-2

0

2

1 2 3 4 5 6 7 8Line Graph

-2 2

Numeric Vector

Heatmap

Page 6: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

66

Expression Vectors As Points in ‘Expression Space’Expression Vectors As Points in ‘Expression Space’

Experiment 1

Experiment 2

Experiment 3

Similar Expression

-0.8

-0.60.9 1.2

-0.3

1.3

-0.7t 1 t 2 t 3

G1

G2

G3

G4

G5

-0.4-0.4

-0.8-0.8

-0.7

1.3 0.9 -0.6

Page 7: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

77

Cluster AnalysisCluster Analysis

• Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.

Page 8: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

88

How can we do this?How can we do this?

• What is closely related?• Distance or similarity metric• What is close?

• Clustering algorithm• How do we minimize distance between objects in a

group while maximizing distances between groups?

Page 9: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

99

Distance MetricsDistance Metrics

• Euclidean Distance measures average distance

• Manhattan (City Block) measures average in each dimension

• Correlation measures difference with respect to linear trends

Gene Expression 1

Gen

e E

xpre

ssio

n 2

(5.5,6)

(3.5,4)

Page 10: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1010

Clustering Gene Expression DataClustering Gene Expression Data

• Cluster across the rows, group genes together that behave similarly across different conditions.

• Cluster across the columns, group different conditions together that behave similarly across most genes.

Gen

es

Expression Measurements

i

j

Page 11: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1111

Clustering Time Series DataClustering Time Series Data

• Measure gene expression on consecutive days

• Gene Measurement matrix• G1= [1.2 4.0 5.0 1.0]• G2= [2.0 2.5 5.5 6.0]• G3= [4.5 3.0 2.5 1.0]• G4= [3.5 1.5 1.2 1.5]

Page 12: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1212

Euclidean DistanceEuclidean Distance

• Distance is the square root of the sum of the squared distance between coordinates

• 2 2 2

1 1 2 2ij i j i j in jnd x x x x x x

0 5.3 4.3 5.1

5.3 0 6.4 6.5

4.3 6.4 0 2.3

5.1 6.5 2.3 0

2 2 2 21.2 2 4 2.5 5 5.5 1 6ijd

Page 13: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1313

City Block or Manhattan DistanceCity Block or Manhattan Distance

• G1= [1.2 4.0 5.0 1.0]• G2= [2.0 2.5 5.5 6.0]• G3= [4.5 3.0 2.5 1.0]• G4= [3.5 1.5 1.2 1.5]

• Distance is the sum of the absolute value between coordinates

1 1 2 2ij i j i j in jnd x x x x x x

0 7.8 6.8 9.1

7.8 0 11 11.3

6.8 11 0 4.3

9.1 11.3 4.3 0

1.2 2 4 2.5 5 5.5 1 6ijd

Page 14: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1414

Correlation DistanceCorrelation Distance

• Pearson correlation measures the degree of linear relationship between variables, [-1,1]

• Distance is 1-(pearson correlation), range of [0,2]

1 1 1

2 2

2 2

1 1 1 1

1

1 11 1

N N N

in jn in jnn n n

ijN N N N

in in jn jnn n n n

x x x xN

d

x x x xN N

0 .91 .98 1.6

.91 0 1.9 1.7

.98 1.9 0 .22

1.6 1.7 .22 0

Page 15: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1515

Similarity MeasurementsSimilarity Measurements• Pearson Correlation

Nx

x

x 1

Two profiles (vectors) and

])(][)([

))((),(

1

2

1

2

1

N

i yi

N

i xi

N

i yixipearson

mymx

mymxyxC

Ny

y

y 1

x

y

x

y+1 Pearson Correlation – 1

N

n nx xN

m1

1

N

n ny yN

m1

1

Page 16: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1616

Similarity MeasurementsSimilarity Measurements• Cosine Correlation

Nx

x

x 1

yx

yxNyxC

N

i ii

1

cosine

1

),(

Ny

y

y 1

yx

+1 Cosine Correlation – 1 yx

Page 17: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1717

Hierarchical ClusteringHierarchical Clustering

(HCL-1)

• IDEA: Iteratively combines genes into groups based on similar patterns of observed expression

• By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships.

• Display the data as a heatmap and dendrogram

• Cluster genes, samples or both

Page 18: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1818

Hierarchical ClusteringHierarchical Clustering

DendrogramVenn Diagram of Clustered Data

Page 19: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

1919

Hierarchical clusteringHierarchical clustering

• Merging (agglomerative): start with every measurement as a separate cluster then combine

• Splitting: make one large cluster, then split up into smaller pieces

• What is the distance between two clusters?

Page 20: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2020

Distance between clustersDistance between clusters

• Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster

• Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster

• Average: Distance between the average of all points in each cluster

• Ward: minimizes the sum of squares of any two clusters

Page 21: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2121

Hierarchical Clustering-MergingHierarchical Clustering-Merging

• Euclidean distance

• Average linking

Gene expression time series

Distance between clusters when combined

Page 22: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2222

Manhattan DistanceManhattan Distance

• Average linking

Gene expression time series

Distance between clusters when combined

Page 23: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2323

Correlation DistanceCorrelation Distance

Page 24: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2424

Data StandardizationData Standardization• Data points are normalized with respect to mean

and variance, “sphering” the data

• After sphering, Euclidean and correlation distance are equivalent

• Standardization makes sense if you are not interested in the size of the effects, but in the effect itself

• Results are misleading for noisy data

ˆ

ˆx

x

Page 25: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2525

Distance CommentsDistance Comments

• Every clustering method is based SOLELY on the measure of distance or similarity

• E.G. Correlation: measures linear association between two genes• What if data are not properly transformed?• What about outliers?• What about saturation effects?

• Even good data can be ruined with the wrong choice of distance metric

Page 26: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2626

A B C D

Dist A B C D

A 20 7 2

B 10 25

C 3

D

Distance MatrixInitial Data Items

Hierarchical Clustering

Page 27: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2727

A B C D

Dist A B C D

A 20 7 2

B 10 25

C 3

D

Distance MatrixInitial Data Items

Hierarchical Clustering

Page 28: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2828

Current Clusters

Single Linkage

Hierarchical Clustering

Dist A B C D

A 20 7 2

B 10 25

C 3

D

Distance Matrix

A B CD

2

Page 29: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

2929

Dist AD B C

AD 20 3

B 10

C

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

A B CD

Page 30: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3030

A B CD

Dist AD B C

AD 20 3

B 10

C

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

Page 31: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3131

Dist AD B C

AD 20 3

B 10

C

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

A BCD

3

Page 32: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3232

Dist ADC B

ADC

10

B

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

A BCD

Page 33: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3333

A BCD

Dist ADC B

ADC

10

B

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

Page 34: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3434

Dist ADC B

ADC

10

B

Distance MatrixCurrent Clusters

Single Linkage

Hierarchical Clustering

A BCD

10

Page 35: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3535

A BCD

Dist ADCB

ADCB

Distance MatrixFinal Result

Single Linkage

Hierarchical Clustering

Page 36: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3636

Hierarchical ClusteringHierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 37: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3737

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 38: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3838

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 39: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

3939

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 40: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4040

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 41: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4141

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 42: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4242

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 43: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4343

Hierarchical ClusteringHierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Page 44: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4444

Hierarchical ClusteringHierarchical Clustering

H L

Page 45: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4545

Hierarchical ClusteringHierarchical Clustering

The Leaf Ordering Problem:

• Find ‘optimal’ layout of branches for a given dendrogram architecture

• 2N-1 possible orderings of the branches• For a small microarray dataset of 500 genes, there are

1.6*E150 branch configurations

SamplesG

enes

Page 46: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4646

Hierarchical ClusteringHierarchical ClusteringThe Leaf Ordering Problem:

Page 47: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4747

Hierarchical ClusteringHierarchical Clustering

• Pros:– Commonly used algorithm– Simple and quick to calculate

• Cons:– Real genes probably do not have a

hierarchical organization

Page 48: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4848

Using Hierarchical ClusteringUsing Hierarchical Clustering

1. Choose what samples and genes to use in your analysis

2. Choose similarity/distance metric

3. Choose clustering direction

4. Choose linkage method

5. Calculate the dendrogram

6. Choose height/number of clusters for interpretation

7. Assess results

8. Interpret cluster structure

Page 49: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

4949

Choose what samples/genes to includeChoose what samples/genes to include

• Very important step• Do you want to include housekeeping genes or genes

that didn’t change in your results?• How do you handle replicates from the same sample?• Noisy samples?• Dendrogram is a mess if everything is included in large

datasets• Gene screening

Page 50: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5050

No FilteringNo Filtering

Page 51: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5151

Filtering 100 relevant genesFiltering 100 relevant genes

Page 52: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5252

2. Choose distance metric2. Choose distance metric

• Metric should be a valid measure of the distance/similarity of genes

• Examples– Applying Euclidean distance to categorical

data is invalid– Correlation metric applied to highly skewed

data will give misleading results

Page 53: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5353

3. Choose clustering direction3. Choose clustering direction

• Merging clustering (bottom up)

• Divisive– split so that genes in the two clusters are the

most similar, maximize distance between clusters

Page 54: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5454

NearestNearest NeighborNeighbor AlgorithmAlgorithm

• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).

• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

Page 55: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5555

Nearest Neighbor, Level 3, k = 6 clusters.

Page 56: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5656

Nearest Neighbor, Level 4, k = 5 clusters.

Page 57: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5757

Nearest Neighbor, Level 5, k = 4 clusters.

Page 58: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5858

Nearest Neighbor, Level 6, k = 3 clusters.

Page 59: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

5959

Nearest Neighbor, Level 7, k = 2 clusters.

Page 60: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6060

Nearest Neighbor, Level 8, k = 1 cluster.

Page 61: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6161

Calculate the similarity between all possible

combinations of two profiles

Two most similar clusters are grouped together to form

a new cluster

Calculate the similarity between the new cluster and

all remaining clusters.

Hierarchical ClusteringHierarchical Clustering

Keys• Similarity• Clustering

Page 62: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6262

Hierarchical ClusteringHierarchical Clustering

C1

C2

C3

Merge which pair of clusters?

Page 63: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6363

+

+

Hierarchical ClusteringHierarchical Clustering

Single Linkage

C1

C2

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”

Page 64: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6464

+

+

Hierarchical ClusteringHierarchical Clustering

Complete Linkage

C1

C2

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”

Page 65: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6565

+

+

Hierarchical ClusteringHierarchical Clustering

Average Linkage

C1

C2

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

Page 66: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6666

+

+

Hierarchical ClusteringHierarchical Clustering

Average Group Linkage

C1

C2

Dissimilarity between two clusters = Distance between two cluster means.

Page 67: CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg

6767

Which one?Which one?

• Both methods are “step-wise” optimal, at each step the optimal split or merge is performed

• Doesn’t mean that the final result is optimal• Merging:

• Computationally simple• Precise at bottom of tree• Good for many small clusters

• Divisive• More complex, but more precise at the top of the tree• Good for looking at large and/or few clusters

• For Gene expression applications, divisive makes more sense