microarray data analysis

Microarray Data AnalysisMicroarray Data Analysis

Data preprocessing and visualization Data preprocessing and visualization Supervised learningSupervised learning

Machine learning approachesMachine learning approaches Unsupervised learningUnsupervised learning

Clustering and pattern detectionClustering and pattern detection Gene regulatory regions predictions Gene regulatory regions predictions

based co-regulated genesbased co-regulated genes Linkage between gene expression data Linkage between gene expression data

and gene sequence/function databasesand gene sequence/function databases ……

Unsupervised learningUnsupervised learning Supervised methods

Can only validate or reject hypotheses

Can not lead to discovery of unexpected partitions

Unsupervised learning

No prior knowledge is used

Explore structure of data on the basis of corrections and similarities

DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Eytan Domany

BUT WHAT ABOUT THE OKAPI ?

Eytan Domany

Centroid methods – K-Centroid methods – K-meansmeans

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid ; Si =

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

Minimize E over Si , Y

2

1 1

))(,(

YXS i

N

i

K

i

Eytan Domany

K-meansK-means

“Guess” K=3

Eytan Domany

Start with random positions of centroids.

K-meansK-means

Iteration = 0

Eytan Domany

K-meansK-means

Iteration = 1


Assign each data point to closest centroid.

Eytan Domany

K-meansK-means

Iteration = 2



Move centroids to center of assigned points

Eytan Domany

K-meansK-means

Iteration = 3



Move centroids to center of assigned points

Iterate till minimal cost

Eytan Domany

FastFast algorithm: compute distances algorithm: compute distances from data points to centroidsfrom data points to centroids

Result depends on initial centroids’ Result depends on initial centroids’ positionposition

Must preset KMust preset K Fails for “non-spherical” Fails for “non-spherical”

distributionsdistributions

K-means - SummaryK-means - Summary

52 41 3

Agglomerative Hierarchical Agglomerative Hierarchical ClusteringClustering

3

1

4 2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

at each step merge pair of nearest clustersinitially – each point = clusterNeed to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Eytan Domany

Hierarchical Clustering -Hierarchical Clustering -SummarySummary

Results depend on distance update methodResults depend on distance update method

Greedy iterative process Greedy iterative process

NOT robust against noiseNOT robust against noise

No inherent measure to identify stable No inherent measure to identify stable clustersclusters

Average Linkage – the most widely used clustering method in gene expression analysis

naturnature e 2002 2002 breasbreast t canccancerer

Heat map

Cluster both genes and Cluster both genes and samplessamples

Sample should Sample should cluster together cluster together based on based on experimental experimental designdesign Often a way to Often a way to

catch labelling catch labelling errors or errors or heterogeneity in heterogeneity in samplessamples

Epinephrine Epinephrine Treated Treated

Rat Fibroblast Rat Fibroblast CellCell

IDID ProbeProbe 1h1h 5h5h 10h10h 18h18h 24h24h

11 D21869_s_atD21869_s_at 25.725.7 55.055.0 170.7170.7 305.5305.5 807.9807.9

22 D25233_atD25233_at 705.2705.2 578.2578.2 629.2629.2 641.7641.7 795.3795.3

33 D25543_atD25543_at 2148.72148.7 1303.01303.0 915.5915.5 149.2149.2 96.396.3

44 L03294_g_atL03294_g_at 241.8241.8 421.5421.5 577.2577.2 866.1866.1 2107.32107.3

55 J03960_atJ03960_at 774.5774.5 439.8439.8 314.3314.3 256.1256.1 44.444.4

66 M81855_atM81855_at 1487.61487.6 1283.71283.7 1372.11372.1 1469.11469.1 1611.71611.7

77 L14936_atL14936_at 1212.61212.6 1848.51848.5 2436.22436.2 3260.53260.5 4650.94650.9

88 L19998_atL19998_at 767.9767.9 290.8290.8 300.2300.2 129.4129.4 51.551.5

99 AB017912_aAB017912_att

1813.71813.7 3520.63520.6 4404.34404.3 6853.16853.1 9039.49039.4

1010 M32855_atM32855_at 234.1234.1 23.123.1 789.4789.4 312.7312.7 67.867.8

Heap mapHeap map

Correlation coeff

Normalized across each gene

Distance IssuesDistance Issues Euclidean distance

■ Pearson distance

g1

g2

g3

g4

0

50

100

150

200

250

300

350

400

gene1 gene2 gene3 gene4

time0time1time2time3

ExerciseExercise Use Average Linkage Use Average Linkage

AlgorithmAlgorithm and Manhattan distance.

Gene Gene IDID

Exp1Exp1 Exp2Exp2

11 4545 5555

22 5555 7878

33 148148 13031303

44 241241 765765

55 774774 439439

66 607607 383383

ExerciseExercise

Issues in Cluster Issues in Cluster AnalysisAnalysis

A lot of clustering algorithmsA lot of clustering algorithms A lot of distance/similarity metricsA lot of distance/similarity metrics Which clustering algorithm runs Which clustering algorithm runs

faster and uses less memory?faster and uses less memory? How many clusters after all?How many clusters after all? Are the clusters stable?Are the clusters stable? Are the clusters meaningful?Are the clusters meaningful?

Which Clustering Which Clustering Method Should I Use?Method Should I Use?

What is the biological question?What is the biological question? Do I have a preconceived notion of how Do I have a preconceived notion of how

many clusters there should be?many clusters there should be? How strict do I want to be? Spilt or How strict do I want to be? Spilt or

Join?Join? Can a gene be in multiple clusters?Can a gene be in multiple clusters? Hard or soft boundaries between Hard or soft boundaries between

clustersclusters

The EndThe End

Thank you for taking this course. Bioinformatics Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it. you all decide to continue your pursuit of it.

We will be very glad to answer your emails or We will be very glad to answer your emails or schedule appointments to talk about any schedule appointments to talk about any bioinformatics related questions you might have.bioinformatics related questions you might have.

We wish you all have a wonderful summer break!We wish you all have a wonderful summer break!

microarray data analysis

Documents