chapter 5: microarray techniques - columbia university · chapter 5: microarray techniques 5.2...

Post on 14-Mar-2020

44 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Prof. Yechiam Yemini (YY)

Computer Science DepartmentColumbia University

Chapter 5: Microarray Techniques

5.2 Analysis of Microarray Data

2

Overview

Normalization Clustering

2

3

Processing Microarray DataProblem 1: extract data from microarraysProblem 2: analyze the meaning of data (multiple arrays)

gm

g1g2

gi

Tj

Genes

Expression levelof gi under test Tj

Heat map

Test Tj

4

Normalization

3

5

Differentiating Gene Expression Ideal data

R=G for all genes that are not differentiatedR>G for up-regulated genes (R<G for down regulated)

Microarray data can be noisyNoise due to technology factors:

o Measurements of R and G may be noisy; two arrays can vary greatlyo Even a single array can have variations in dye, mRNA, scanning…

Noise due to biological factors:o Samples variability

Down-regulated

G

R Up-regulated

Ideal G

R

More likely

6

Normalizing Expression Levels Consider logR, logG to evaluate orders of magnitude differences Normalization: calibrate R,G fluorescence measurements Regression: consider log(R/G) -c = log(aR/G) c=-log(a)

c is selected to shift the mean log ratio to 0 Under ideal circumstances this gives the distribution below

Rotate 45o

M=logR-logG=log(R/G)A=½[logR+logG]=log(RG)1/2

A

M

logG

logR Regression

logG

logR

Log(aR/G)

4

7

Lowess NormalizationRelationships of M/A may not be linearLowess (Locally WEighted polynomial regreSSion)

Lowess

Normalized M values are the heights of spots from the “trend” line

A

M

A

M

8

Normalizing Data From Two Arrays

Normalization:• Transform to A,M axes• Apply Lowess adjustment• Use resulting values for

gene expression matrix

5

9

Differentiated Expression Analysis

Use the normalized regression “Fold” lines determine region

A

M

Up-regulated

Down-regulated

Fold line

10

Hierarchical Clustering

6

11

Heat Map Matrix

gi

Gene expression profile

gm

g1g2

gi

T1T2 TnTj

Genes

Tests/experiments/samples/conditions

Tj

Test expression profile

Expression levelof gi under test Tj

Heat map

12

Clustering Analysis Gene profile co-expression Test/sample profile sample similarity

gi

Gene expression profileTj

Test expression profile

7

13

Clustering Expression ProfilesProfile vector of expression values

Gene (rows): gi (ei1,ei2,….ein) Test/sample (columns): Tj(e1j,e2j,….emj)

gm

g1g2

gi

T1T2 TnTj

Tj=(e1j,e2j,….emj)

gi=(ei1,ei2,….ein)

14

Hierarchical Clustering

Key idea: cluster recursively the “closest” pairE.g., We used this for phylogeny and MSAAgglomerative (bottom-up) vs. Divisive (top-down)

Distance metrics*: d(A,B)

2 3 4 51 0.3 0.2 0.8 0.12 0.9 0.1 0.83 0.2 0.74 0.1

Distance Matrix

1. Euclidean: √Σi = 1 (xiA - xiB)2

2. Manhattan: Σi = 1 |xiA – xiB|m

3. Pearson correlation

−= ∑

= B

BBTn

T A

AAT xxxxn

r(A,B)σσ

1

1

(* Triangle inequality is not required; semi-metric is sufficient)

8

15

Hierarchical Clustering1) Connect nearest neighbors into cluster2) Compute distance matrix to new cluster3) Repeat until all clustered

16

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

9

17

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

18

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

10

19

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

20

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

11

21

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

22

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

12

23

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

24

Pros & Cons of Hierarchical Clustering Pros

It provides useful partitioning of the data Organizes co-expressed genes and similar tests Visual 2D organization of data

Cons Can be very sensitive to noise Dimensionality may exacerbate sensitivity May not be related to nature (genes are not hierarchical)

13

25

An example : Hierarchical Clustering

26

An example : Hierarchical Clustering

14

27

An example : Hierarchical Clustering

28

K-Means Clustering

15

29

K-Means Clustering

Key-idea: iterative improvement of clustersStart with random partitioning and improve it

30

Initialization1) Select # of clusters: k=42) Select k random centroids {mj}3) Assign genes to cluster of closest centroid

4) Compute new centroids

5) Repeat until convergence!

c = argmin j ||m j " gi || Classify gene i to cluster c

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

!

mc =1

NC

gii=1

NC

"

16

31

Self Organizing Maps

32

Self Organizing Maps (SOM) Clustering

Kohonen 87 Iterative clustering similar to k-means Select # of clusters k and a grid of k centroids Move grid closer to points

17

33

Initialize: A. Select k=6 Clusters

34

Initialize: B. Select Random LocationFor a Grid of k=6 Centroids

18

35

Iteration: Select A Random Point

P

36

Iteration: Identify Nearest Centroid

NP

P

19

37

Iteration: Move Centroid Towards Point

P

NP

38

Q

NQ

Iteration: Repeat For New Point

20

39

Q

NQ

Iteration: Repeat

40

Iteration: Repeat Until Convergenece

21

41

Comparison(Based on W. Noble slides)

42

Comparison of clustering algorithms Hierarchical clustering

+ Widely used.+ Easy to understand.+ Does not require the number of clusters a priori.- Difficult to implement well.- Requires post-processing.- Unstable.- Greediness can lock in early mistakes.- Expression data may not be organized hierarchically.

22

43

Comparison of clustering algorithms k-means

- Less widely used.- Requires the number of clusters a priori.- Creates unorganized clusters that are hard to interpret.+ Easy to understand.+ Easy to implement.+ Scales well.+ Stable.

44

Comparison of clustering algorithms Self-organizing maps

- Less widely used.- Difficult to understand.- Requires the number of clusters a priori.+ Easy to implement.+ Scales well.+ Allows imposition of partial structure.+ Stable.

23

45

What clustering can’t do Identify differentially regulated genes. Account for complex experimental design. Provide semantics for discovered clusters. Determine whether a pathway is differentially expressed. Incorporate prior knowledge about relevant gene groups.

top related