chapter 5: microarray techniques - columbia university · chapter 5: microarray techniques 5.2...
TRANSCRIPT
1
Prof. Yechiam Yemini (YY)
Computer Science DepartmentColumbia University
Chapter 5: Microarray Techniques
5.2 Analysis of Microarray Data
2
Overview
Normalization Clustering
2
3
Processing Microarray DataProblem 1: extract data from microarraysProblem 2: analyze the meaning of data (multiple arrays)
gm
g1g2
gi
Tj
Genes
Expression levelof gi under test Tj
Heat map
Test Tj
4
Normalization
3
5
Differentiating Gene Expression Ideal data
R=G for all genes that are not differentiatedR>G for up-regulated genes (R<G for down regulated)
Microarray data can be noisyNoise due to technology factors:
o Measurements of R and G may be noisy; two arrays can vary greatlyo Even a single array can have variations in dye, mRNA, scanning…
Noise due to biological factors:o Samples variability
Down-regulated
G
R Up-regulated
Ideal G
R
More likely
6
Normalizing Expression Levels Consider logR, logG to evaluate orders of magnitude differences Normalization: calibrate R,G fluorescence measurements Regression: consider log(R/G) -c = log(aR/G) c=-log(a)
c is selected to shift the mean log ratio to 0 Under ideal circumstances this gives the distribution below
Rotate 45o
M=logR-logG=log(R/G)A=½[logR+logG]=log(RG)1/2
A
M
logG
logR Regression
logG
logR
Log(aR/G)
4
7
Lowess NormalizationRelationships of M/A may not be linearLowess (Locally WEighted polynomial regreSSion)
Lowess
Normalized M values are the heights of spots from the “trend” line
A
M
A
M
8
Normalizing Data From Two Arrays
Normalization:• Transform to A,M axes• Apply Lowess adjustment• Use resulting values for
gene expression matrix
5
9
Differentiated Expression Analysis
Use the normalized regression “Fold” lines determine region
A
M
Up-regulated
Down-regulated
Fold line
10
Hierarchical Clustering
6
11
Heat Map Matrix
gi
Gene expression profile
gm
g1g2
gi
T1T2 TnTj
Genes
Tests/experiments/samples/conditions
Tj
Test expression profile
Expression levelof gi under test Tj
Heat map
12
Clustering Analysis Gene profile co-expression Test/sample profile sample similarity
gi
Gene expression profileTj
Test expression profile
7
13
Clustering Expression ProfilesProfile vector of expression values
Gene (rows): gi (ei1,ei2,….ein) Test/sample (columns): Tj(e1j,e2j,….emj)
gm
g1g2
gi
T1T2 TnTj
Tj=(e1j,e2j,….emj)
gi=(ei1,ei2,….ein)
14
Hierarchical Clustering
Key idea: cluster recursively the “closest” pairE.g., We used this for phylogeny and MSAAgglomerative (bottom-up) vs. Divisive (top-down)
Distance metrics*: d(A,B)
2 3 4 51 0.3 0.2 0.8 0.12 0.9 0.1 0.83 0.2 0.74 0.1
Distance Matrix
1. Euclidean: √Σi = 1 (xiA - xiB)2
2. Manhattan: Σi = 1 |xiA – xiB|m
3. Pearson correlation
−
−= ∑
= B
BBTn
T A
AAT xxxxn
r(A,B)σσ
1
1
(* Triangle inequality is not required; semi-metric is sufficient)
8
15
Hierarchical Clustering1) Connect nearest neighbors into cluster2) Compute distance matrix to new cluster3) Repeat until all clustered
16
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
9
17
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
18
Hierarchical ClusteringGene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
10
19
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
20
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
11
21
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
22
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
12
23
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
24
Pros & Cons of Hierarchical Clustering Pros
It provides useful partitioning of the data Organizes co-expressed genes and similar tests Visual 2D organization of data
Cons Can be very sensitive to noise Dimensionality may exacerbate sensitivity May not be related to nature (genes are not hierarchical)
13
25
An example : Hierarchical Clustering
26
An example : Hierarchical Clustering
14
27
An example : Hierarchical Clustering
28
K-Means Clustering
15
29
K-Means Clustering
Key-idea: iterative improvement of clustersStart with random partitioning and improve it
30
Initialization1) Select # of clusters: k=42) Select k random centroids {mj}3) Assign genes to cluster of closest centroid
4) Compute new centroids
5) Repeat until convergence!
c = argmin j ||m j " gi || Classify gene i to cluster c
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html
!
mc =1
NC
gii=1
NC
"
16
31
Self Organizing Maps
32
Self Organizing Maps (SOM) Clustering
Kohonen 87 Iterative clustering similar to k-means Select # of clusters k and a grid of k centroids Move grid closer to points
17
33
Initialize: A. Select k=6 Clusters
34
Initialize: B. Select Random LocationFor a Grid of k=6 Centroids
18
35
Iteration: Select A Random Point
P
36
Iteration: Identify Nearest Centroid
NP
P
19
37
Iteration: Move Centroid Towards Point
P
NP
38
Q
NQ
Iteration: Repeat For New Point
20
39
Q
NQ
Iteration: Repeat
40
Iteration: Repeat Until Convergenece
21
41
Comparison(Based on W. Noble slides)
42
Comparison of clustering algorithms Hierarchical clustering
+ Widely used.+ Easy to understand.+ Does not require the number of clusters a priori.- Difficult to implement well.- Requires post-processing.- Unstable.- Greediness can lock in early mistakes.- Expression data may not be organized hierarchically.
22
43
Comparison of clustering algorithms k-means
- Less widely used.- Requires the number of clusters a priori.- Creates unorganized clusters that are hard to interpret.+ Easy to understand.+ Easy to implement.+ Scales well.+ Stable.
44
Comparison of clustering algorithms Self-organizing maps
- Less widely used.- Difficult to understand.- Requires the number of clusters a priori.+ Easy to implement.+ Scales well.+ Allows imposition of partial structure.+ Stable.
23
45
What clustering can’t do Identify differentially regulated genes. Account for complex experimental design. Provide semantics for discovered clusters. Determine whether a pathway is differentially expressed. Incorporate prior knowledge about relevant gene groups.