clustering and visualisation using r programming
TRANSCRIPT
CLUSTERING AND
VISUALIZATION USING R
Nixon Mendez
Department of Bioinformatics
OUTLINE
Microarray Data of Yeast Cell Cycle
Clustering Analysis :-
Principal Component Analysis (PCA)
Multidimensional Scaling (MDS)
K-Means
Self-Organizing Maps (SOM)
Hierarchical Clustering
CLUSTERING
Microarray Data of Yeast Cell Cycle
Spellman et al., (1998). Comprehensive Identification of Cell Cycle-
regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray
Hybridization. Molecular Biology of the Cell 9, 3273-3297.
We found 800 yeast genes whose transcripts oscillate through one peak per
cell cycle.
These 800 genes by using an objective, empirical model of cell cycle
regulation, whose threshold was somewhat arbitrary.
Examine the effects of inducing either the cyclin Cln3p or the B-type cyclin
Clb2p, on more than half of these 800 genes.
A full description and complete data sets are available at http://cellcycle-
www.stanford.edu
Loading the data
> mic <- read.delim("C:/Users/Nixon/Desktop/R prog/mic.txt")
> View(mic)
> cell.matrix <- mic
> n <- dim(cell.matrix)[1]
> p <- dim(cell.matrix)[2]-2
> cell.data <- cell.matrix[,3:p+2]
> gene.name <- cell.matrix[,1]
> gene.phase <- cell.matrix[,2]
> phase <- unique(gene.phase)
> phase.name <- c("G1", "S", "S/G2", "G2/M", "M/G1")
## standardized data
> cell.sdata <- (cell.data-apply(cell.data, 1, mean))/sqrt(apply(cell.data, 1, var))
View Microarray Data Before visualization we must set the colors.
maPalette is created
##CODE
> cell.image <- as.matrix(t(cell.sdata[n:1,]))
> RGcol <- maPalette(low = "green", high = "red", k = 50)
> image(cell.image, xlab="Exp.", ylab="Genes", col = RGcol)
OUTPUT
Principal Component Analysis (PCA)
The PCA summaries the dispersion of data points as data cloud in
a small number of major axes (principal components) of variation
among the variables.
Principal Component Analysis (PCA)
Syntax :
# entering raw data and extracting PCs from the correlation
matrix
fit <- princomp(mydata, cor=TRUE)
# screenplot
plot(fit,type="lines")
Principal Component Analysis (PCA)
> cell.pca <- princomp(cell.sdata, cor=TRUE,
scores=TRUE)
# 2D plot for first two components
> pca.dim1 <- cell.pca$scores[,1]
> pca.dim2 <- cell.pca$scores[,2]
> plot(pca.dim1, pca.dim2,
main="PCA for Cell Cycle Data on Genes", xlab="1st
PCA Componnet", ylab="2nd PCA Componnet",
col=c(1,2,3,4,5), pch=c(phase))
> legend(0.8, 1, phase.name, pch="01234", col=c(1,2,3,4,5))
PCA OUTPUT
Multidimensional Scaling (MDS)
Multidimensional scaling takes a set of dissimilarities and returns a set of
points such that the distances between the points are approximately equal to
the dissimilarities.
Multidimensional Scaling (MDS)
# Classical MDS
# N rows (objects) x p columns (variables)
# each row identified by a unique row name
d <- dist(mydata) # euclidean distances between the rows
fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
# plot solution
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="Metric MDS", type="n")
text(x, y, labels = row.names(mydata), cex=.7)
Multidimensional Scaling (MDS)
#correlation matrix
> cell.cor<- cor(t(cell.sdata))
#distance matrix
> cell.dist<- sqrt(2*(1-cell.cor))
> cell.mds<- cmdscale(cell.dist)
> mds.dim1 <- cell.mds[,1]
> mds.dim2 <- cell.mds[,2]
> plot(mds.dim1, mds.dim2, type="n", xlab="MDS-1", ylab="MDS-2",
main="MDS for Cell Cycle Data")
> text(mds.dim1, mds.dim2,gene.phase , cex=0.8, col= i+1)
> legend(0.7, 0.8, phase.name, pch="01234", col=c(1,2,3,4,5))
MDS OUTPUT
K-means Clustering
It is a prototype based , partitional clustering technique that
attempts to find a user-specified number of clusters (k),
which are presented by their centroids.
K-means Clustering
kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)
Arguments
X numeric matrix of data, or an object that can
be coerced to such a matrix (such as a numeric
vector or a data frame with all numeric
columns).
Centers either the number of clusters, say k, or a set of
initial (distinct) cluster centres. If a number, a
random set of (distinct) rows in x is chosen as
the initial centres.
iter.max the maximum number of iterations allowed.
nstart if centers is a number, how many random sets
should be chosen?
algorithm character: may be abbreviated. Note that
"Lloyd" and "Forgy" are alternative names for
one algorithm.
K-means Clustering
> no.group <- 5
> no.iter <- 20
> cell.kmeans <- kmeans(cell.sdata, no.group, no.iter)
> plot(cell.sdata[,1:4], col = cell.kmeans$cluster)
K-means Output
Self-Organizing Maps (SOM)
SOM is unique in the sense that it combines both aspects. It can be used
at the same time both to reduce the amount of data by clustering, and to
construct a nonlinear projection of the data onto a low-dimensional
display.
Self-Organizing Maps (SOM)
som(data, xdim, ydim, init="linear",neigh="gaussian", topol="rect",
radius=NULL, rlen=NULL,)
ARGUMENTS :
neigh - a character string specifying the neighborhood function type.
The following are permitted: "bubble" "gaussian"
topol - a character string specifying the topology type when measuring
distance in the map. The following are permitted: "hexa" "rect"
radius - a vector of initial radius of the training area in som-algorithm
for the two training phases. Decreases linearly to one during training.
rlen - a vector of running length (number of steps) in the two training
phases.
Self-Organizing Maps (SOM)
> library(som)
> cell.som <- som(cell.sdata, xdim=5, ydim=4, topol="rect",
neigh="gaussian")
> plot(cell.som)
SOM OUTPUT
Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis which seeks to
build a hierarchy of clusters. Strategies for hierarchical clustering
generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation
starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one
cluster, and splits are performed recursively as one moves down the
hierarchy.
Hierarchical Clustering
dist(as.matrix(mtcars)) - find distance matrix
hclust(d) - apply hirarchical clustering
plot(hc) - plot the dendrogram
hang - The fraction of the plot height by which labels
should hang below the rest of the plot.
method - the agglomeration method to be used
Hierarchical Clustering
## Hierarchical Clustering on Genes
> cell.exp.hc.ave <- hclust(dist(d(cell.sdata)), method = "ave")
> plot(cell.exp.hc.ave, cex=0.8)
## Hierarchical Clustering on Experiments
> cell.gene.hc.ave <- hclust(dist(cell.sdata), method = "ave")
> plot(cell.gene.hc.ave, hang = -1, cex=0.5, labels=gene.name)
Hierarchical Clustering Output
THANK YOU!!