chapter 5: microarray techniques

Prof. Yechiam Yemini (YY)

Computer Science DepartmentColumbia University

Chapter 5: Microarray Techniques

5.2 Analysis of Microarray Data

Overview

Normalization Clustering

Processing Microarray DataProblem 1: extract data from microarraysProblem 2: analyze the meaning of data (multiple arrays)

Expression levelof gi under test Tj

Heat map

Test Tj

Normalization

Differentiating Gene Expression Ideal data

R=G for all genes that are not differentiatedR>G for up-regulated genes (R<G for down regulated)

Microarray data can be noisyNoise due to technology factors:

o Measurements of R and G may be noisy; two arrays can vary greatlyo Even a single array can have variations in dye, mRNA, scanning…

Noise due to biological factors:o Samples variability

Down-regulated

R Up-regulated

Ideal G

More likely

Normalizing Expression Levels Consider logR, logG to evaluate orders of magnitude differences Normalization: calibrate R,G fluorescence measurements Regression: consider log(R/G) -c = log(aR/G) c=-log(a)

c is selected to shift the mean log ratio to 0 Under ideal circumstances this gives the distribution below

Rotate 45o

M=logR-logG=log(R/G)A=½[logR+logG]=log(RG)1/2

logR Regression

Log(aR/G)

Lowess NormalizationRelationships of M/A may not be linearLowess (Locally WEighted polynomial regreSSion)

Lowess

Normalized M values are the heights of spots from the “trend” line

Normalizing Data From Two Arrays

Normalization:• Transform to A,M axes• Apply Lowess adjustment• Use resulting values for

gene expression matrix

Differentiated Expression Analysis

Use the normalized regression “Fold” lines determine region

Up-regulated

Down-regulated

Fold line

Hierarchical Clustering

Heat Map Matrix

Gene expression profile

T1T2 TnTj

Tests/experiments/samples/conditions

Test expression profile

Expression levelof gi under test Tj

Heat map

Clustering Analysis Gene profile co-expression Test/sample profile sample similarity

Gene expression profileTj

Test expression profile

Clustering Expression ProfilesProfile vector of expression values

Gene (rows): gi (ei1,ei2,….ein) Test/sample (columns): Tj(e1j,e2j,….emj)

T1T2 TnTj

Tj=(e1j,e2j,….emj)

gi=(ei1,ei2,….ein)

Key idea: cluster recursively the “closest” pairE.g., We used this for phylogeny and MSAAgglomerative (bottom-up) vs. Divisive (top-down)

Distance metrics*: d(A,B)

2 3 4 51 0.3 0.2 0.8 0.12 0.9 0.1 0.83 0.2 0.74 0.1

Distance Matrix

1. Euclidean: √Σi = 1 (xiA - xiB)2

2. Manhattan: Σi = 1 |xiA – xiB|m

3. Pearson correlation

−= ∑

AAT xxxxn

r(A,B)σσ

(* Triangle inequality is not required; semi-metric is sufficient)

Hierarchical Clustering1) Connect nearest neighbors into cluster2) Compute distance matrix to new cluster3) Repeat until all clustered

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Pros & Cons of Hierarchical Clustering Pros

It provides useful partitioning of the data Organizes co-expressed genes and similar tests Visual 2D organization of data

Cons Can be very sensitive to noise Dimensionality may exacerbate sensitivity May not be related to nature (genes are not hierarchical)

An example : Hierarchical Clustering

K-Means Clustering

Key-idea: iterative improvement of clustersStart with random partitioning and improve it

Initialization1) Select # of clusters: k=42) Select k random centroids {mj}3) Assign genes to cluster of closest centroid

4) Compute new centroids

5) Repeat until convergence!

c = argmin j ||m j " gi || Classify gene i to cluster c

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

Self Organizing Maps

Self Organizing Maps (SOM) Clustering

Kohonen 87 Iterative clustering similar to k-means Select # of clusters k and a grid of k centroids Move grid closer to points

Initialize: A. Select k=6 Clusters

Initialize: B. Select Random LocationFor a Grid of k=6 Centroids

Iteration: Select A Random Point

Iteration: Identify Nearest Centroid

Iteration: Move Centroid Towards Point

Iteration: Repeat For New Point

Iteration: Repeat

Iteration: Repeat Until Convergenece

Comparison(Based on W. Noble slides)

Comparison of clustering algorithms Hierarchical clustering

+ Widely used.+ Easy to understand.+ Does not require the number of clusters a priori.- Difficult to implement well.- Requires post-processing.- Unstable.- Greediness can lock in early mistakes.- Expression data may not be organized hierarchically.

Comparison of clustering algorithms k-means

- Less widely used.- Requires the number of clusters a priori.- Creates unorganized clusters that are hard to interpret.+ Easy to understand.+ Easy to implement.+ Scales well.+ Stable.

Comparison of clustering algorithms Self-organizing maps

- Less widely used.- Difficult to understand.- Requires the number of clusters a priori.+ Easy to implement.+ Scales well.+ Allows imposition of partial structure.+ Stable.

What clustering can’t do Identify differentially regulated genes. Account for complex experimental design. Provide semantics for discovered clusters. Determine whether a pathway is differentially expressed. Incorporate prior knowledge about relevant gene groups.

chapter 5: microarray techniques - columbia university · chapter 5: microarray techniques 5.2...

Documents

analysis of microarray data using arti cial …analysis of...

chapter 4: counting techniques

chapter 1 - university of alberta...5 chapter 1 :...

chapter 1. introduction to dna...

data mining techniques for dna microarray data · data...

chapter 5: signal encoding techniques encoding techniques

protoarray human protein microarray v5.0 kinase substrate...

chapter 10: cipher techniques

data mining techniques for dna microarray data - indiana...

biovlab-microarray: microarray data analysis in virtual...

[9] tm4 microarray software suite€¦ · of tools that...

microarray and surface plasmon resonance experiments for...

chapter 5 · chapter 5 classifying microarray samples the...

two-color microarray-based prokaryote analysis · 2016. 9....

microarray analysis - the...

deep insight into microarray...

biovlab-microarray: microarray data analysis in...

techniques in cellular and molecular neurobiology · 2020....

microarray basics, and planning a microarray experiment