canadian bioinformatics workshops

Post on 18-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Canadian Bioinformatics Workshops

www.bioinformatics.ca

2Module #: Title of Module

Lecture 7ML & Data Visualization & Microarrays

MBP1010

Dr. Paul C. BoutrosWinter 2015

DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS

This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others

††

††

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Machine-Learning• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Sequence Analysis• Final Exam (written)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

House Rules• Cell phones to silent

• No side conversations

• Hands up for questions

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Briefly)

• Data visualization 101

• Attendance

• Microarrays 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])

D.cho<-dist(cho.data, method = "euclidean")

hc.single<-hclust(D.cho, method = "single", members=NULL)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

plot(hc.single)

Single linkage

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Careful with the interpretation of dendrograms:they introduce a proximity between elements that does not correlate with distance between elements!cf.: # 1 and #47

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=2

rect.hclust(hc.single,k=2)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=3

rect.hclust(hc.single,k=3)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=4

rect.hclust(hc.single,k=4)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=5

rect.hclust(hc.single,k=5)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=25

rect.hclust(hc.single,k=25)

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

class.single<-cutree(hc.single, k = 4)

par(mfrow=c(2,2))

matplot(t(cho.data[class.single==1,]),type="l",

xlab="time",ylab="log expression value")

matplot(t(cho.data[class.single==2,]),type="l",

xlab="time",ylab="log expression value")

matplot(as.matrix(cho.data[class.single==3,]),

type="l",xlab="time",ylab="log expression value")

matplot(t(cho.data[class.single==4,]),type="l",

xlab="time",ylab="log expression value")

Properties of cluster members, single linkage, k=4

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

Single linkage, k=4

1 2

3 4

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

Complete linkage, k=4

1 2

3 4

Single linkage, k=4

Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Hierarchical clustering analyzed

Advantages Disadvantages

There may be small clusters nested inside large ones

Clusters might not be naturally represented by a hierarchical structure

No need to specify number groups ahead of time

Its necessary to ‘cut’ the dendrogram in order to produce clusters

Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning methods

• Anatomy of a partitioning based method• data matrix• distance function• number of groups

• Output• group assignment of every object

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning based methods

• Choose K groups• initialise group centers

• aka centroid, medoid

• assign each object to the nearest centroid according to the distance metric

• reassign (or recompute) centroids

• repeat last 2 steps until assignment stabilizes

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means vs. K-medoids

K-means K-medoids

Centroids are the ‘mean’ of the clusters

Centroids are an actual object that minimizes the total within cluster distance

Centroids need to be recomputed every iteration

Centroid can be determined from quick look up into the distance matrix

Initialisation difficult as notion of centroid may be unclear before beginning

Initialisation is simply K randomly selected objects

kmeans pam

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning based methods

Advantages Disadvantages

Number of groups is well defined

Have to choose the number of groups

A clear, deterministic assignment of an object to a group

Sometimes objects do not fit well to any cluster

Simple algorithms for inference

Can converge on locally optimal solutions and often require multiple restarts with random initializations

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

N items, assume K clusters

Goal is to minimize

over the possible assignments and centroids .

represents the location of the cluster.

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1. Divide the data into K clustersInitialize the centroids with the mean of the clusters

2. Assign each item to the cluster with closest centroid

3. When all objects have been assigned, recalculate the centroids (mean)

4. Repeat 2-3 until the centroids no longer move

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means, k=4

1 2

3 4

set.seed(100)km.cho<-kmeans(cho.data, 4)

par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

K-means, k=4

1 2

3 4

Single linkage, k=4

K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means and hierarchical clustering methods are simple, fast and useful techniques

Beware of memory requirements for HC

Both are bit “ad hoc”:Number of clusters?

Distance metric?

Good clustering?

Summary

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Meta-Analysis• Combining results of multiple-studies that study related

hypotheses

• Often used to merge data from different microarray platforms

• Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Why Do Meta-Analysis?• Can identify publication biases• Appropriately weights diverse studies

• Sample-size• Experimental-reliability• Similarity of study-specific hypotheses to the overall one

• Increases statistical power• Reduces information

• A single meta-analysis vs. five large studies• Provides clearer guidance

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Challenges of Meta-Analysis• No control for bias

• What happens if most studies are poorly designed?

• File-drawer problem• Publication bias can be detected, but not explicitly controlled

for

• How homogeneous is the data?• Can it be fairly grouped?• Simpson’s Paradox

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Simpson’s Paradox

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Focus: Unsupervised)

• Data visualization 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Briefly)

• Data visualization 101

• Attendance

• Microarrays 101

top related