clustering and visualisation using r programming

28
CLUSTERING AND VISUALIZATION USING R Nixon Mendez Department of Bioinformatics

Upload: nixon-mendez

Post on 13-Feb-2017

69 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Page 1: Clustering and Visualisation using R programming

CLUSTERING AND

VISUALIZATION USING R

Nixon Mendez

Department of Bioinformatics

Page 2: Clustering and Visualisation using R programming

OUTLINE

Microarray Data of Yeast Cell Cycle

Clustering Analysis :-

Principal Component Analysis (PCA)

Multidimensional Scaling (MDS)

K-Means

Self-Organizing Maps (SOM)

Hierarchical Clustering

Page 3: Clustering and Visualisation using R programming

CLUSTERING

Page 4: Clustering and Visualisation using R programming

Microarray Data of Yeast Cell Cycle

Spellman et al., (1998). Comprehensive Identification of Cell Cycle-

regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray

Hybridization. Molecular Biology of the Cell 9, 3273-3297.

We found 800 yeast genes whose transcripts oscillate through one peak per

cell cycle.

These 800 genes by using an objective, empirical model of cell cycle

regulation, whose threshold was somewhat arbitrary.

Examine the effects of inducing either the cyclin Cln3p or the B-type cyclin

Clb2p, on more than half of these 800 genes.

A full description and complete data sets are available at http://cellcycle-

www.stanford.edu

Page 5: Clustering and Visualisation using R programming

Loading the data

> mic <- read.delim("C:/Users/Nixon/Desktop/R prog/mic.txt")

> View(mic)

> cell.matrix <- mic

> n <- dim(cell.matrix)[1]

> p <- dim(cell.matrix)[2]-2

> cell.data <- cell.matrix[,3:p+2]

> gene.name <- cell.matrix[,1]

> gene.phase <- cell.matrix[,2]

> phase <- unique(gene.phase)

> phase.name <- c("G1", "S", "S/G2", "G2/M", "M/G1")

## standardized data

> cell.sdata <- (cell.data-apply(cell.data, 1, mean))/sqrt(apply(cell.data, 1, var))

Page 6: Clustering and Visualisation using R programming

View Microarray Data Before visualization we must set the colors.

maPalette is created

Page 7: Clustering and Visualisation using R programming

##CODE

> cell.image <- as.matrix(t(cell.sdata[n:1,]))

> RGcol <- maPalette(low = "green", high = "red", k = 50)

> image(cell.image, xlab="Exp.", ylab="Genes", col = RGcol)

OUTPUT

Page 8: Clustering and Visualisation using R programming

Principal Component Analysis (PCA)

The PCA summaries the dispersion of data points as data cloud in

a small number of major axes (principal components) of variation

among the variables.

Page 9: Clustering and Visualisation using R programming

Principal Component Analysis (PCA)

Syntax :

# entering raw data and extracting PCs from the correlation

matrix

fit <- princomp(mydata, cor=TRUE)

# screenplot

plot(fit,type="lines")

Page 10: Clustering and Visualisation using R programming

Principal Component Analysis (PCA)

> cell.pca <- princomp(cell.sdata, cor=TRUE,

scores=TRUE)

# 2D plot for first two components

> pca.dim1 <- cell.pca$scores[,1]

> pca.dim2 <- cell.pca$scores[,2]

> plot(pca.dim1, pca.dim2,

main="PCA for Cell Cycle Data on Genes", xlab="1st

PCA Componnet", ylab="2nd PCA Componnet",

col=c(1,2,3,4,5), pch=c(phase))

> legend(0.8, 1, phase.name, pch="01234", col=c(1,2,3,4,5))

Page 11: Clustering and Visualisation using R programming

PCA OUTPUT

Page 12: Clustering and Visualisation using R programming

Multidimensional Scaling (MDS)

Multidimensional scaling takes a set of dissimilarities and returns a set of

points such that the distances between the points are approximately equal to

the dissimilarities.

Page 13: Clustering and Visualisation using R programming

Multidimensional Scaling (MDS)

# Classical MDS

# N rows (objects) x p columns (variables)

# each row identified by a unique row name

d <- dist(mydata) # euclidean distances between the rows

fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim

fit # view results

# plot solution

x <- fit$points[,1]

y <- fit$points[,2]

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",

main="Metric MDS", type="n")

text(x, y, labels = row.names(mydata), cex=.7)

Page 14: Clustering and Visualisation using R programming

Multidimensional Scaling (MDS)

#correlation matrix

> cell.cor<- cor(t(cell.sdata))

#distance matrix

> cell.dist<- sqrt(2*(1-cell.cor))

> cell.mds<- cmdscale(cell.dist)

> mds.dim1 <- cell.mds[,1]

> mds.dim2 <- cell.mds[,2]

> plot(mds.dim1, mds.dim2, type="n", xlab="MDS-1", ylab="MDS-2",

main="MDS for Cell Cycle Data")

> text(mds.dim1, mds.dim2,gene.phase , cex=0.8, col= i+1)

> legend(0.7, 0.8, phase.name, pch="01234", col=c(1,2,3,4,5))

Page 15: Clustering and Visualisation using R programming

MDS OUTPUT

Page 16: Clustering and Visualisation using R programming

K-means Clustering

It is a prototype based , partitional clustering technique that

attempts to find a user-specified number of clusters (k),

which are presented by their centroids.

Page 17: Clustering and Visualisation using R programming

K-means Clustering

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",

"MacQueen"), trace=FALSE)

Arguments

X numeric matrix of data, or an object that can

be coerced to such a matrix (such as a numeric

vector or a data frame with all numeric

columns).

Centers either the number of clusters, say k, or a set of

initial (distinct) cluster centres. If a number, a

random set of (distinct) rows in x is chosen as

the initial centres.

iter.max the maximum number of iterations allowed.

nstart if centers is a number, how many random sets

should be chosen?

algorithm character: may be abbreviated. Note that

"Lloyd" and "Forgy" are alternative names for

one algorithm.

Page 18: Clustering and Visualisation using R programming

K-means Clustering

> no.group <- 5

> no.iter <- 20

> cell.kmeans <- kmeans(cell.sdata, no.group, no.iter)

> plot(cell.sdata[,1:4], col = cell.kmeans$cluster)

Page 19: Clustering and Visualisation using R programming

K-means Output

Page 20: Clustering and Visualisation using R programming

Self-Organizing Maps (SOM)

SOM is unique in the sense that it combines both aspects. It can be used

at the same time both to reduce the amount of data by clustering, and to

construct a nonlinear projection of the data onto a low-dimensional

display.

Page 21: Clustering and Visualisation using R programming

Self-Organizing Maps (SOM)

som(data, xdim, ydim, init="linear",neigh="gaussian", topol="rect",

radius=NULL, rlen=NULL,)

ARGUMENTS :

neigh - a character string specifying the neighborhood function type.

The following are permitted: "bubble" "gaussian"

topol - a character string specifying the topology type when measuring

distance in the map. The following are permitted: "hexa" "rect"

radius - a vector of initial radius of the training area in som-algorithm

for the two training phases. Decreases linearly to one during training.

rlen - a vector of running length (number of steps) in the two training

phases.

Page 22: Clustering and Visualisation using R programming

Self-Organizing Maps (SOM)

> library(som)

> cell.som <- som(cell.sdata, xdim=5, ydim=4, topol="rect",

neigh="gaussian")

> plot(cell.som)

Page 23: Clustering and Visualisation using R programming

SOM OUTPUT

Page 24: Clustering and Visualisation using R programming

Hierarchical Clustering

Hierarchical clustering is a method of cluster analysis which seeks to

build a hierarchy of clusters. Strategies for hierarchical clustering

generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation

starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one

cluster, and splits are performed recursively as one moves down the

hierarchy.

Page 25: Clustering and Visualisation using R programming

Hierarchical Clustering

dist(as.matrix(mtcars)) - find distance matrix

hclust(d) - apply hirarchical clustering

plot(hc) - plot the dendrogram

hang - The fraction of the plot height by which labels

should hang below the rest of the plot.

method - the agglomeration method to be used

Page 26: Clustering and Visualisation using R programming

Hierarchical Clustering

## Hierarchical Clustering on Genes

> cell.exp.hc.ave <- hclust(dist(d(cell.sdata)), method = "ave")

> plot(cell.exp.hc.ave, cex=0.8)

## Hierarchical Clustering on Experiments

> cell.gene.hc.ave <- hclust(dist(cell.sdata), method = "ave")

> plot(cell.gene.hc.ave, hang = -1, cex=0.5, labels=gene.name)

Page 27: Clustering and Visualisation using R programming

Hierarchical Clustering Output

Page 28: Clustering and Visualisation using R programming

THANK YOU!!