tutorial: analysing microarray data using …guiyuan lei tutorial: analysing microarray data using...

Tutorial: analysing Microarray data usingBioConductor

Guiyuan Lei

Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN)School of Mathematics & Statistics

Newcastle Universityhttp://www.mas.ncl.ac.uk/∼ngl9/

4 Feb, 2008

Guiyuan Lei Tutorial: analysing Microarray data using BioConductor

http://www.mas.ncl.ac.uk/~ngl9/

Outline of Tutorial

Tutorial on CISBAN Internal wiki, under ‘Tools & Resources’page http://bioinf.ncl.ac.uk/cisban/doku.php?id=resources:resourceshome

Introduction to BioconductorFibroblast data set (Chapter 2) and Yeast time course data(Chapter 3)

Pre-process of dataModel Fitting for Identifying Differential ExpressionNetwork Inference


http://bioinf.ncl.ac.uk/cisban/doku.php?id=resources:resourceshome

http://bioinf.ncl.ac.uk/cisban/doku.php?id=resources:resourceshome

Why Bioconductor

Bioconductor:open source software for bioinformaticsprovide innovative methodology for analyzing genomic datausing R statistical computing environment

R: Powerful grapphic feature and cut-edge statisticaltechniques, around 800 packages available, around 60basic packages (like affy, limma) in BioconductorPublished Papers using Bioconductorhttp://www.bioconductor.org/pub

Google Scholar Beta, PubMed, BEPress (BerkeleyElectronic Press), Biostatistics, BioMed CentralBioinformatics and IngentaFor example, in Bioinformatics, 161 papers found with‘Bioconductor’ in title


http://www.bioconductor.org/pub

Getting Bioconductor and associated packages

Install base packages, such as affy and limmasource("http://bioconductor.org/biocLite.R")biocLite()

Install specific packages, such as yeast2probesource("http://bioconductor.org/biocLite.R")biocLite("yeast2probe")

Set search directories in the .Renviron file, e.g.R_LIBS=/data/Rpackages/

Update Bioconductorsource("http://bioconductor.org/biocLite.R")update.packages(repos=biocinstallRepos(), ask=FALSE)


Pre-process of data

Entering data into BioconductorExtraction of Cerevisiae probesetsExploratory data analysisNormalising Microarray dataProbeset level expression to gene level expressionPrincipal Component Analysis


Entering data into Bioconductor

library(affy)fns2 = list.celfiles(path="data2", full.names=TRUE)rawdata = ReadAffy(filenames=fns2)print(rawdata)

Strain 0 hours 1 hour 2 hours 3 hours 4 hoursMutant 1 yeast01.cel yeast02.cel yeast03.cel yeast04.cel yeast05.celWild type 1 yeast06.cel yeast07.cel yeast08.cel yeast09.cel yeast10.celMutant 2 yeast11.cel yeast12.cel yeast13.cel yeast14.cel yeast15.celWild type 2 yeast16.cel yeast17.cel yeast18.cel yeast19.cel yeast20.celMutant 3 yeast21.cel yeast22.cel yeast23.cel yeast24.cel yeast25.celWild type 3 yeast26.cel yeast27.cel yeast28.cel yeast29.cel yeast30.cel


Mask file for Cerevisiae probesets

Mask file to filter out pombe probesetshttp://www.affymetrix.com/Auth/support/downloads/mask files/s cerevisiae.zips_cerevisiae<-scan("s_cerevisiae.msk", skip=2, list("", ""))pombe_filter_out<-s_cerevisiae[[1]]

RemoveProbe 1

source("RemoveProbes.r")library(yeast2probe)cleancdf = cleancdfname("yeast2")RemoveProbes(listOutProbes=NULL, pombe_filter_out,

"yeast2cdf","yeast2probe")

1W.G. Alvord et al., “A microarray analysis for differential gene expressionin the soybean genome using Bioconductor and R.”, Briefings inBioinformatics, September 2007


http://www.affymetrix.com/Auth/support/downloads/mask_files/s_cerevisiae.zip

http://www.affymetrix.com/Auth/support/downloads/mask_files/s_cerevisiae.zip

Cerevisiae probesets IDs

Yeast probeset IDslibrary(yeast2)genenames = as.list(yeast2GENENAME)YeastProbeID <- names(genenames)

Cerevisiae probeset IDsCerevisiaeProbeID <-YeastProbeID[-match(pombe_filter_out,YeastProbeID)]


Cerevisiae gene names

Yeast Transcript IDs from annotation fileYeast 2.na24.annot.csv.zipyeast2annotation=read.csv(file="yeast2annotation.csv",

header=TRUE,stringsAsFactors=FALSE)YeastTranscriptID<-yeast2annotation[,3]yeast2annotationProbesetID<-yeast2annotation[,1]YeastTranscriptID<-YeastTranscriptID[match(YeastProbeID,

yeast2annotationProbesetID)]

Yeast gene namesYeastGeneName<-character()for(i in 1:length(YeastProbeID)){YeastGeneName[i]=genenames[i][[1]]if(is.na(YeastGeneName[i])){YeastGeneName[i]=YeastTranscriptID[i]}

}

Cerevisiae gene namesCerevisiaeGeneName<-YeastGeneName[-match(pombe_filter_out,YeastProbeID)]


Exploratory data analysis: examining raw images

png(filename="cerevisiaeimage.png",width=960, height=480)par(mfrow=c(1,2))image(rawdata[,1])image(rawdata[,27])dev.off()


Exploratory data analysis: probe intensities

png(filename="yeastintensities.png",width=960, height=960)hist(rawdata, lty=1:30, lwd=2)legend(14, 0.60, legend=sampleNames(rawdata), lty=1:30, lwd=2)dev.off()


Exploratory data analysis: MA plots

png(filename="cerevisiaemaplot.png",width=960, height=480)par(mfrow=c(1,2))MAplot(rawdata,which=1)MAplot(rawdata,which=27)dev.off()


Exploratory data analysis: RNA degradation

RNAdeg <- AffyRNAdeg(rawdata)plotAffyRNAdeg(RNAdeg)

RNA degradation plot

5' <−−−−−> 3' Probe Number

Mea

n In

tens

ity :

shift

ed a

nd s

cale

d

0 2 4 6 8 10

010

2030

40


Pre-process of data: Normalisation

eset.rma=rma(rawdata)library(affyPLM)par(mfrow=c(1,2))boxplot(rawdata, col="red",main="Cerevisiae Probe intensities")boxplot(eset.rma, col="blue",main="Cerevisiae RMA expression values")

yeast01.cel yeast08.cel yeast15.cel yeast22.cel yeast29.cel

68

1012

1416

Cerevisiae Probe intensities

yeast01.cel yeast08.cel yeast15.cel yeast22.cel yeast29.cel

46

810

1214

Cerevisiae RMA expression values


Probeset level expression to gene level expression

There are usually several probesets map to one gene inAffymetrix.

CerevisiaeGeneNameLevels<-factor(CerevisiaeGeneName)#Function to average the expression of probesets#which map to same geneprobeset2genelevel<-function(onesample){

return(tapply(onesample,CerevisiaeGeneNameLevels,mean))}#Do the average for each column/arrayCerevisiaeGeneData<-apply(CerevisiaeProbeData,2,probeset2genelevel)


Principal Component Analysis

library(smida)cluster.samples(t(CerevisiaeProbeData),method="pca")

−20 −10 0 10 20 30 40

−10

−5

05

10

1st PC

2nd

PC

Clustering Samples

Principal Component Projection

yeast01.cel

yeast02.cel

yeast03.cel

yeast04.cel

yeast05.cel

yeast06.cel

yeast07.cel

yeast08.cel

yeast09.celyeast10.cel

yeast11.cel


yeast14.cel

yeast15.cel

yeast16.cel



yeast21.cel

yeast22.cel

yeast23.cel

yeast24.cel

yeast25.cel


yeast28.cel

yeast29.cel

yeast30.cel


Model Fitting for Identifying Differential Expression

Limma modelConstruct design matrixConstruct constrasts

Plot time course for top differential expressionHeatmap


Limma: design matrix

library(limma)levels = c("m0","m1", "m2", "m3", "m4","w0","w1", "w2", "w3", "w4")X= rep(levels, 3)TS <- factor(X, levels= levels)design <- model.matrix(˜0+TS)colnames(design) <- levels(TS)

E

2666666666666666664

0BBBBBBBBBBBBBBBBB@

yg1yg2...

yg10...

yg26yg27yg28yg29yg30

1CCCCCCCCCCCCCCCCCA

3777777777777777775

=

0BBBBBBBBBBBBBBBBB@

1 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0...

......

......

......

......

...0 0 0 0 0 0 0 0 0 1...

......

......

......

......

...0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 1

1CCCCCCCCCCCCCCCCCA

0BBBBBBBBBBBBB@

m0m1m2m3m4w0w1w2w3w4

1CCCCCCCCCCCCCA(1)


Limma: construct constrsts and model fitting

For identifying differential expression, combine the contrasts bycomparing mutant type and wild type at time point 1,2,3 and 4.

#Model Fittingfit<-lmFit(CerevisiaeProbeData, design)mc<-makeContrasts(’m1-w1’,’m2-w2’,’m3-w3’,’m4-w4’,levels=design)fit2<-contrasts.fit(fit, mc)eb<-eBayes(fit2)

Different ways to rank the differentially expressed probesets:

topTable(eb,sort.by=’logFC’) #log-fold change

topTableF(eb) #F-statistics


Up and down regulated list

modFpvalue<-eb$F.p.value #$F-test p valueselectedgenesindx<-p.adjust(eb$F.p.value,method="bonferroni")<0.05 #$Sig<-modFpvalue[selectedgenesindx]nsiggenes<-length(Sig) #number of differential expressionresults1<-decideTests(eb, method="global")modF<-eb$F #$F-test valuemodFordered<-order(modF, decreasing = TRUE)CerevisiaeRankProbe<-CerevisiaeProbeID[modFordered[1:nsiggenes]]CerevisiaeRankGeneName<-CerevisiaeGeneName[modFordered[1:nsiggenes]]updown<-results1[modFordered[1:nsiggenes],]

Probeset ID Gene Symbol T1 T2 T3 T4ProbesetID 1 Gene 1 -1 -1 -1 -1ProbesetID 2 Gene 2 1 1 1 1ProbesetID 3 Gene 3 -1 -1 -1 -1

Table: Up and down regulated list2

2Not use real gene names hereGuiyuan Lei Tutorial: analysing Microarray data using BioConductor

Plot time course for top differential expression

#Rank the i+1’th differential expressionindx <- rank(modF) == nrow(CerevisiaeProbeData)-i

M

M

M

M

M

0 1 2 3 4

89

1011

1213

ProbesetID 1 Gene 1 Rank= 1

time

Exp

ress

ion

M

M

MM

M

M

M

M

M

M

W W W WW

W W W W WW

W WW

W

M

M

M

M M

0 1 2 3 4

89

1011

1213

14


time

Exp

ress

ion

M

M

M

M M

M

M

M

MM

W W

W W W

W

W

W W W

W W

W W W

M

M

M

M

M

0 1 2 3 4

78

910

11


time

Exp

ress

ion

M

M

M

M

M

M

M

M

M

M

W

WW W W

W

W W W

W

W W

W W

W

Figure: Time course expression for top 3 differentially expressedYeast genes3

3Not use real gene names hereGuiyuan Lei Tutorial: analysing Microarray data using BioConductor

Code for plotting time course data

modF <- eb$F #$F is to get F-statisticwhich.M <- c(seq(1,5), seq(11,15), seq(21,25))which.W <- c(seq(6,10), seq(16,20), seq(26,30))par(mfrow=c(1,3),ask=T,cex=0.5)#cex:font sizefor(i in 0:2){indx <- rank(modF) == nrow(CerevisiaeProbeData)-irow1 = CerevisiaeProbeData[indx, which.M]row2 = CerevisiaeProbeData[indx, which.W]id=CerevisiaeProbeID[indx]name = CerevisiaeGeneName[indx]genetitle<-paste(sprintf("%.30s",id)," ",sprintf("%.30s",name)," Rank=", i+1)time=c(0,1,2,3,4)plot(time,row1[1:5],ylim=range(min(row1,row2), max(row1,row2)),

ylab="Expression", main=genetitle,pch=’M’,type=’b’,col=2)lines(time, row1[6:10] , pch=’M’, type=’b’, col=2)lines(time, row1[11:15], pch=’M’, type=’b’, col=2)lines(time, row2[1:5], pch=’W’, type=’b’, col=1)lines(time, row2[6:10], pch=’W’, type=’b’, col=1)lines(time, row2[11:15], pch=’W’, type=’b’, col=1)

}


Heatmap

ngenes = 100m = matrix(nrow=ngenes,ncol=30)rnames = vector("list", length(1))for(i in 0:(ngenes-1)){indx <- rank(modF) == nrow(CerevisiaeProbeData) - im[i+1,] = CerevisiaeProbeData[indx,]rnames[i+1]=CerevisiaeGeneName[indx]}

}heatmap(m)

r1t0

r2t0

r3t0

r1t1

r2t1

r3t1

r1t2

r2t2

r3t2

r1t3

r2t3

r3t3

r1t4

r2t4

r3t4

Gene 2Gene 6Gene 38Gene 93Gene 80Gene 94Gene 88Gene 63Gene 85Gene 12Gene 91Gene 60Gene 46Gene 50Gene 19Gene 74Gene 92Gene 73Gene 100Gene 71Gene 69Gene 9Gene 43Gene 72Gene 32Gene 55Gene 64Gene 61Gene 75Gene 59Gene 86Gene 89Gene 45Gene 44Gene 83Gene 67Gene 68Gene 24Gene 53Gene 84Gene 90Gene 37Gene 87Gene 96Gene 15Gene 13Gene 21Gene 8Gene 39Gene 7Gene 3Gene 4Gene 5Gene 22Gene 42Gene 16Gene 95Gene 48Gene 77Gene 65Gene 76Gene 36Gene 20Gene 29Gene 33Gene 54Gene 52Gene 79Gene 51Gene 70Gene 97Gene 58Gene 62Gene 66Gene 14Gene 47Gene 23Gene 41Gene 1Gene 35Gene 57Gene 10Gene 99Gene 81Gene 98Gene 27Gene 40Gene 34Gene 31Gene 56Gene 11Gene 49Gene 82Gene 78Gene 18Gene 26Gene 25Gene 17Gene 30Gene 28


Network Inference

GeneNetStrimmer’s VAR model


GeneNet: partial correlation network

Yeast Network

Gene 41

Gene 87

Gene 64

Gene 99Gene 57

Gene 81

Gene 94 Gene 91

Gene 98

Gene 96

Gene 58

Gene 68

Gene 88

Gene 92

Gene 80

Gene 95

Gene 69

Gene 97

Gene 65

Gene 63

Gene 66Gene 74

Gene 49

Gene 67 Gene 89

Gene 70

Gene 82Gene 75

Gene 100

Gene 90

Gene 22

Gene 53

Gene 44

Gene 93

Gene 60

Gene 56

Gene 59Gene 55Gene 76

Gene 54

Gene 34

Gene 10

Gene 35

Gene 71

Gene 28

Gene 50

Gene 84

Gene 79

Gene 73

Gene 16

Gene 86

Gene 42

Gene 20

Gene 31

Gene 27

Gene 40Gene 39

Gene 77

Gene 18

Gene 38

Gene 24

Gene 48

Gene 33

Gene 15

Gene 32

Gene 14

Gene 83

Gene 11

Gene 23

Gene 25

Gene 36 Gene 4

Gene 52

Figure: Inferred network by GeneNet package for top 100differentially expressed Yeast genes


GeneNet step 1: build longitudial object

#Need to transpose data matrix, rows to be arraysm = t(m)#Need to rearrange the rows so that the rows are ordered by time points#using the property of design matrix#the entry design[i,j] with value 1 means array i is for time point j!!!mnew = t(matrix(nrow=ngenes,ncol=30))#Get the index of array ordered by time pointarrayindx<-numeric(0)ntime=5for(j in 1:ntime){arrayindx<-c(arrayindx,grep(1,design[,j]),grep(1,design[,j+ntime]))}mnew<-m[arrayindx,]library("GeneNet")# step 1: create longitudinal object #mlong = as.longitudinal(mnew,repeats=c(6,6,6,6,6),time=c(0,1,2,3,4))


GeneNet: step 2 to step 5

# step 2: compute partial correlations #pcor.dyn <- ggm.estimate.pcor(mlong, method = "dynamic")# step 3: assign (local) fdr values to all possible edges #m.edges <- network.test.edges(pcor.dyn,direct=TRUE)dim(m.edges)# step 4: construct graph containing the 150 top edges #m.net <- extract.network(m.edges, method.ggm="number", cutoff.ggm=150)# step 5: plot graph using graphviz ##If rnames has no "", Need for Graphvizfor(i in 1:ngenes){

rnames[i] = paste(’"’,rnames[i],’"’,sep="")}

colnames(m) = rnamesnode.labels <- colnames(m)network.make.dot(filename="net.dot", m.net, node.labels, main="Yeast Network")


tutorial: analysing microarray data using …guiyuan lei tutorial: analysing microarray data using...

Documents