clustering and visualisation using r programming

CLUSTERING AND

VISUALIZATION USING R

Nixon Mendez

Department of Bioinformatics

OUTLINE

Microarray Data of Yeast Cell Cycle

Clustering Analysis :-

Principal Component Analysis (PCA)

Multidimensional Scaling (MDS)

K-Means

Self-Organizing Maps (SOM)

Hierarchical Clustering

CLUSTERING

Microarray Data of Yeast Cell Cycle

Spellman et al., (1998). Comprehensive Identification of Cell Cycle-

regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray

Hybridization. Molecular Biology of the Cell 9, 3273-3297.

We found 800 yeast genes whose transcripts oscillate through one peak per

cell cycle.

These 800 genes by using an objective, empirical model of cell cycle

regulation, whose threshold was somewhat arbitrary.

Examine the effects of inducing either the cyclin Cln3p or the B-type cyclin

Clb2p, on more than half of these 800 genes.

A full description and complete data sets are available at http://cellcycle-

www.stanford.edu

Loading the data

> mic <- read.delim("C:/Users/Nixon/Desktop/R prog/mic.txt")

> View(mic)

> cell.matrix <- mic

> n <- dim(cell.matrix)[1]

> p <- dim(cell.matrix)[2]-2

> cell.data <- cell.matrix[,3:p+2]

> gene.name <- cell.matrix[,1]

> gene.phase <- cell.matrix[,2]

> phase <- unique(gene.phase)

> phase.name <- c("G1", "S", "S/G2", "G2/M", "M/G1")

## standardized data

> cell.sdata <- (cell.data-apply(cell.data, 1, mean))/sqrt(apply(cell.data, 1, var))

View Microarray Data Before visualization we must set the colors.

maPalette is created

##CODE

> cell.image <- as.matrix(t(cell.sdata[n:1,]))

> RGcol <- maPalette(low = "green", high = "red", k = 50)

> image(cell.image, xlab="Exp.", ylab="Genes", col = RGcol)

OUTPUT


The PCA summaries the dispersion of data points as data cloud in

a small number of major axes (principal components) of variation

among the variables.


Syntax :

# entering raw data and extracting PCs from the correlation

matrix

fit <- princomp(mydata, cor=TRUE)

# screenplot

plot(fit,type="lines")


> cell.pca <- princomp(cell.sdata, cor=TRUE,

scores=TRUE)

# 2D plot for first two components

> pca.dim1 <- cell.pca$scores[,1]

> pca.dim2 <- cell.pca$scores[,2]

> plot(pca.dim1, pca.dim2,

main="PCA for Cell Cycle Data on Genes", xlab="1st

PCA Componnet", ylab="2nd PCA Componnet",

col=c(1,2,3,4,5), pch=c(phase))

> legend(0.8, 1, phase.name, pch="01234", col=c(1,2,3,4,5))

PCA OUTPUT


Multidimensional scaling takes a set of dissimilarities and returns a set of

points such that the distances between the points are approximately equal to

the dissimilarities.


# Classical MDS

# N rows (objects) x p columns (variables)

# each row identified by a unique row name

d <- dist(mydata) # euclidean distances between the rows

fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim

fit # view results

# plot solution

x <- fit$points[,1]

y <- fit$points[,2]

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",

main="Metric MDS", type="n")

text(x, y, labels = row.names(mydata), cex=.7)


#correlation matrix

> cell.cor<- cor(t(cell.sdata))

#distance matrix

> cell.dist<- sqrt(2*(1-cell.cor))

> cell.mds<- cmdscale(cell.dist)

> mds.dim1 <- cell.mds[,1]

> mds.dim2 <- cell.mds[,2]

> plot(mds.dim1, mds.dim2, type="n", xlab="MDS-1", ylab="MDS-2",

main="MDS for Cell Cycle Data")

> text(mds.dim1, mds.dim2,gene.phase , cex=0.8, col= i+1)

> legend(0.7, 0.8, phase.name, pch="01234", col=c(1,2,3,4,5))

MDS OUTPUT

K-means Clustering

It is a prototype based , partitional clustering technique that

attempts to find a user-specified number of clusters (k),

which are presented by their centroids.

K-means Clustering

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",

"MacQueen"), trace=FALSE)

Arguments

X numeric matrix of data, or an object that can

be coerced to such a matrix (such as a numeric

vector or a data frame with all numeric

columns).

Centers either the number of clusters, say k, or a set of

initial (distinct) cluster centres. If a number, a

random set of (distinct) rows in x is chosen as

the initial centres.

iter.max the maximum number of iterations allowed.

nstart if centers is a number, how many random sets

should be chosen?

algorithm character: may be abbreviated. Note that

"Lloyd" and "Forgy" are alternative names for

one algorithm.

K-means Clustering

> no.group <- 5

> no.iter <- 20

> cell.kmeans <- kmeans(cell.sdata, no.group, no.iter)

> plot(cell.sdata[,1:4], col = cell.kmeans$cluster)

K-means Output


SOM is unique in the sense that it combines both aspects. It can be used

at the same time both to reduce the amount of data by clustering, and to

construct a nonlinear projection of the data onto a low-dimensional

display.


som(data, xdim, ydim, init="linear",neigh="gaussian", topol="rect",

radius=NULL, rlen=NULL,)

ARGUMENTS :

neigh - a character string specifying the neighborhood function type.

The following are permitted: "bubble" "gaussian"

topol - a character string specifying the topology type when measuring

distance in the map. The following are permitted: "hexa" "rect"

radius - a vector of initial radius of the training area in som-algorithm

for the two training phases. Decreases linearly to one during training.

rlen - a vector of running length (number of steps) in the two training

phases.


> library(som)

> cell.som <- som(cell.sdata, xdim=5, ydim=4, topol="rect",

neigh="gaussian")

> plot(cell.som)

SOM OUTPUT


Hierarchical clustering is a method of cluster analysis which seeks to

build a hierarchy of clusters. Strategies for hierarchical clustering

generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation

starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one

cluster, and splits are performed recursively as one moves down the

hierarchy.


dist(as.matrix(mtcars)) - find distance matrix

hclust(d) - apply hirarchical clustering

plot(hc) - plot the dendrogram

hang - The fraction of the plot height by which labels

should hang below the rest of the plot.

method - the agglomeration method to be used


## Hierarchical Clustering on Genes

> cell.exp.hc.ave <- hclust(dist(d(cell.sdata)), method = "ave")

> plot(cell.exp.hc.ave, cex=0.8)

## Hierarchical Clustering on Experiments

> cell.gene.hc.ave <- hclust(dist(cell.sdata), method = "ave")

> plot(cell.gene.hc.ave, hang = -1, cex=0.5, labels=gene.name)

Hierarchical Clustering Output

THANK YOU!!

clustering and visualisation using r programming

Data & Analytics