microarray data analysis
DESCRIPTION
Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes - PowerPoint PPT PresentationTRANSCRIPT
Microarray Data AnalysisMicroarray Data Analysis
Data preprocessing and visualization Data preprocessing and visualization Supervised learningSupervised learning
Machine learning approachesMachine learning approaches Unsupervised learningUnsupervised learning
Clustering and pattern detectionClustering and pattern detection Gene regulatory regions predictions Gene regulatory regions predictions
based co-regulated genesbased co-regulated genes Linkage between gene expression data Linkage between gene expression data
and gene sequence/function databasesand gene sequence/function databases ……
Unsupervised learningUnsupervised learning Supervised methods
Can only validate or reject hypotheses
Can not lead to discovery of unexpected partitions
Unsupervised learning
No prior knowledge is used
Explore structure of data on the basis of corrections and similarities
DEFINITION OF THE CLUSTERING PROBLEM
Eytan Domany
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
Eytan Domany
BUT WHAT ABOUT THE OKAPI ?
Eytan Domany
Centroid methods – K-Centroid methods – K-meansmeans
Data points at Xi , i= 1,...,N
Centroids at Y , = 1,...,K
Assign data point i to centroid ; Si =
Cost E:
E(S1 , S2 ,...,SN ; Y1 ,...YK ) =
Minimize E over Si , Y
2
1 1
))(,(
YXS i
N
i
K
i
Eytan Domany
K-meansK-means
“Guess” K=3
Eytan Domany
Start with random positions of centroids.
K-meansK-means
Iteration = 0
Eytan Domany
K-meansK-means
Iteration = 1
Start with random positions of centroids.
Assign each data point to closest centroid.
Eytan Domany
K-meansK-means
Iteration = 2
Start with random positions of centroids.
Assign each data point to closest centroid.
Move centroids to center of assigned points
Eytan Domany
K-meansK-means
Iteration = 3
Start with random positions of centroids.
Assign each data point to closest centroid.
Move centroids to center of assigned points
Iterate till minimal cost
Eytan Domany
FastFast algorithm: compute distances algorithm: compute distances from data points to centroidsfrom data points to centroids
Result depends on initial centroids’ Result depends on initial centroids’ positionposition
Must preset KMust preset K Fails for “non-spherical” Fails for “non-spherical”
distributionsdistributions
K-means - SummaryK-means - Summary
52 41 3
Agglomerative Hierarchical Agglomerative Hierarchical ClusteringClustering
3
1
4 2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a linear ordering of the data points
The dendrogram induces a linear ordering of the data points
at each step merge pair of nearest clustersinitially – each point = clusterNeed to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Eytan Domany
Hierarchical Clustering -Hierarchical Clustering -SummarySummary
Results depend on distance update methodResults depend on distance update method
Greedy iterative process Greedy iterative process
NOT robust against noiseNOT robust against noise
No inherent measure to identify stable No inherent measure to identify stable clustersclusters
Average Linkage – the most widely used clustering method in gene expression analysis
naturnature e 2002 2002 breasbreast t canccancerer
Heat map
Cluster both genes and Cluster both genes and samplessamples
Sample should Sample should cluster together cluster together based on based on experimental experimental designdesign Often a way to Often a way to
catch labelling catch labelling errors or errors or heterogeneity in heterogeneity in samplessamples
Epinephrine Epinephrine Treated Treated
Rat Fibroblast Rat Fibroblast CellCell
IDID ProbeProbe 1h1h 5h5h 10h10h 18h18h 24h24h
11 D21869_s_atD21869_s_at 25.725.7 55.055.0 170.7170.7 305.5305.5 807.9807.9
22 D25233_atD25233_at 705.2705.2 578.2578.2 629.2629.2 641.7641.7 795.3795.3
33 D25543_atD25543_at 2148.72148.7 1303.01303.0 915.5915.5 149.2149.2 96.396.3
44 L03294_g_atL03294_g_at 241.8241.8 421.5421.5 577.2577.2 866.1866.1 2107.32107.3
55 J03960_atJ03960_at 774.5774.5 439.8439.8 314.3314.3 256.1256.1 44.444.4
66 M81855_atM81855_at 1487.61487.6 1283.71283.7 1372.11372.1 1469.11469.1 1611.71611.7
77 L14936_atL14936_at 1212.61212.6 1848.51848.5 2436.22436.2 3260.53260.5 4650.94650.9
88 L19998_atL19998_at 767.9767.9 290.8290.8 300.2300.2 129.4129.4 51.551.5
99 AB017912_aAB017912_att
1813.71813.7 3520.63520.6 4404.34404.3 6853.16853.1 9039.49039.4
1010 M32855_atM32855_at 234.1234.1 23.123.1 789.4789.4 312.7312.7 67.867.8
Heap mapHeap map
Correlation coeff
Normalized across each gene
Distance IssuesDistance Issues Euclidean distance
■ Pearson distance
g1
g2
g3
g4
0
50
100
150
200
250
300
350
400
gene1 gene2 gene3 gene4
time0time1time2time3
ExerciseExercise Use Average Linkage Use Average Linkage
AlgorithmAlgorithm and Manhattan distance.
Gene Gene IDID
Exp1Exp1 Exp2Exp2
11 4545 5555
22 5555 7878
33 148148 13031303
44 241241 765765
55 774774 439439
66 607607 383383
ExerciseExercise
Issues in Cluster Issues in Cluster AnalysisAnalysis
A lot of clustering algorithmsA lot of clustering algorithms A lot of distance/similarity metricsA lot of distance/similarity metrics Which clustering algorithm runs Which clustering algorithm runs
faster and uses less memory?faster and uses less memory? How many clusters after all?How many clusters after all? Are the clusters stable?Are the clusters stable? Are the clusters meaningful?Are the clusters meaningful?
Which Clustering Which Clustering Method Should I Use?Method Should I Use?
What is the biological question?What is the biological question? Do I have a preconceived notion of how Do I have a preconceived notion of how
many clusters there should be?many clusters there should be? How strict do I want to be? Spilt or How strict do I want to be? Spilt or
Join?Join? Can a gene be in multiple clusters?Can a gene be in multiple clusters? Hard or soft boundaries between Hard or soft boundaries between
clustersclusters
The EndThe End
Thank you for taking this course. Bioinformatics Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it. you all decide to continue your pursuit of it.
We will be very glad to answer your emails or We will be very glad to answer your emails or schedule appointments to talk about any schedule appointments to talk about any bioinformatics related questions you might have.bioinformatics related questions you might have.
We wish you all have a wonderful summer break!We wish you all have a wonderful summer break!