analysis of genomic and proteomic data …°c technology and preprocessing methods benjamin...

35
Analysis of Genomic and Proteomic Data A FFYMETRIX c Technology and Preprocessing Methods Benjamin Haibe-Kains [email protected] Universit ´ e Libre de Bruxelles Institut Jules Bordet

Upload: lamxuyen

Post on 11-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Analysis of Genomic and Proteomic Data

AFFYMETRIX c© Technology andPreprocessing Methods

Benjamin Haibe-Kains

[email protected]

Universite Libre de Bruxelles Institut Jules Bordet

Page 2: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Table of Contents

AFFYMETRIX c© Technology

Preprocessing MethodsR and BioconductorImage AnalysisExpression QuantificationNormalization

Conclusion

Benjamin Haibe-Kains Bioinformatics Courses – p.1/33

Page 3: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFYMETRIX c© Technology

Benjamin Haibe-Kains Bioinformatics Courses – p.2/33

Page 4: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Technology

A microarray is composed of

DNA fragments fixed on a solid support

ordered position of probes

principle of hybridization to a specific probe ofcomplementary sequence

radioactive labeling

å simultaneous detection of thousands of sequences inparallel

Benjamin Haibe-Kains Bioinformatics Courses – p.3/33

Page 5: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Technology(2)

It exists several high-throughput methods tosimultaneously measure the expression of a largenumber of genes :

cDNA microarray

oligonucleotide microarrayshort oligonucleotide (AFFYMETRIX c©)long oligonucleotide (AGILENT c©, CODELINK c©)

multiplex quantitative RT-PCR

Benjamin Haibe-Kains Bioinformatics Courses – p.4/33

Page 6: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFYMETRIX c© GeneChip

Benjamin Haibe-Kains Bioinformatics Courses – p.5/33

Page 7: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFYMETRIX c© Design

Benjamin Haibe-Kains Bioinformatics Courses – p.6/33

Page 8: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFY c© GeneChip Structure

1 gene is represented by 1 or more probe sets

1 probe set includes 11 to 20 probe pairs

1 probe pair includes a Perfect Match (PM) value anda Mis-Match value (MM)

To facilitate the explanation, we assume that 1 gene isrepresented by 1 probe set including 20 probe pairs (PMand MM)

Benjamin Haibe-Kains Bioinformatics Courses – p.7/33

Page 9: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFY c© GeneChip Structure(2)

Benjamin Haibe-Kains Bioinformatics Courses – p.8/33

Page 10: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFYMETRIX c© Hybridization

Benjamin Haibe-Kains Bioinformatics Courses – p.9/33

Page 11: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

AFFYMETRIX c© Detection

Benjamin Haibe-Kains Bioinformatics Courses – p.10/33

Page 12: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Comparison

cDNAshort

oligonucleotidelong

oligonucleotide

Benjamin Haibe-Kains Bioinformatics Courses – p.11/33

Page 13: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Comparison(2)

AFFYMETRIX c© advantages :

commercially available for several years (strongmanufacturing)

large number of published studies (generallyaccepted method)

no reference sample → possible comparisonbetween studies

Benjamin Haibe-Kains Bioinformatics Courses – p.12/33

Page 14: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Comparison(3)

AFFYMETRIX c© disadvantages :

cost of the devices and the chips (but easy use)

changes in probe design is hard (but new programpermits to create his own design)

short oligos → several oligos per gene,specificity/sensitivity trade-off (complex methods toget gene expression)

Benjamin Haibe-Kains Bioinformatics Courses – p.13/33

Page 15: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Preprocessing Methods

Benjamin Haibe-Kains Bioinformatics Courses – p.14/33

Page 16: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

R and Bioconductor

R is a widely used open source language andenvironment for statistical computing and graphics

Software and documentation are available fromhttp://www.r-project.org

Bioconductor is an open source and opendevelopment software project for the analysis andcomprehension of genomic data

Software and documentation are available fromhttp://www.bioconductor.org

Benjamin Haibe-Kains Bioinformatics Courses – p.15/33

Page 17: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Microarray Analysis Design

Benjamin Haibe-Kains Bioinformatics Courses – p.16/33

Page 18: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Preprocessing Methods

We will focus on preprocessing methods forAFFYMETRIX c© data

image analysis : get raw probe intensities from chipimage

expression quantification : get gene expressionsfrom raw probe intensities

normalization : remove systematic bias to comparegene expressions

It exists several methods but we will focus on RobustMulti-array Analysis (RMA) methods [Irizarry et al., 2003]

Benjamin Haibe-Kains Bioinformatics Courses – p.17/33

Page 19: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Image AnalysisMain software from AFFYMETRIX c© was MicroArray Suite5 (MAS5), now called GeneChip Operating Software 1.2(GCOS)

DAT file is the image file

Benjamin Haibe-Kains Bioinformatics Courses – p.18/33

Page 20: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Image Analysis(2)

Each probe cell is composed by 10x10 pixels

remove outer 36 pixels → 8x8 pixels

probe cell signal, PM or MM, is the 75th percentile ofthe 8x8 pixel values.

Benjamin Haibe-Kains Bioinformatics Courses – p.19/33

Page 21: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Image Analysis(3)

CEL file is the CELl intensity file including probe levelPM and MM values

The last file provided by AFFYMETRIX c© concerns theannotations of the chip

CDF file is the Chip Description File describing whichprobes go in which probe sets (genes, genefragments, ESTs)

The bioconductor functions (especially from affypackage) use the CEL files

Benjamin Haibe-Kains Bioinformatics Courses – p.20/33

Page 22: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Expression QuantificationFor each probe set, summarization of the probe leveldata (11-20 PM and MM pairs) into a single expressionmeasure

RMA procedure

use only PM and ignore MM

adjust for background on the raw intensity scale

carry out quantile normalization [Bolstad et al., 2003]of PM − BG and call the result n(PM − BG)

Take log2 of normalized background adjusted PM

Carry out a medianpolish of the quantitieslog2n(PM − BG)

Benjamin Haibe-Kains Bioinformatics Courses – p.21/33

Page 23: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Quantile Normalization

Benjamin Haibe-Kains Bioinformatics Courses – p.22/33

Page 24: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Quantile Normalization(2)

Benjamin Haibe-Kains Bioinformatics Courses – p.23/33

Page 25: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Quantile Normalization(3)

Benjamin Haibe-Kains Bioinformatics Courses – p.24/33

Page 26: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Quantile Normalization(4)

Benjamin Haibe-Kains Bioinformatics Courses – p.25/33

Page 27: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Bioconductor Functions

Most of the functions are in the affy package

> library(affy) #load the library> library(help=affy) #help about the librarycontents

All the CEL files have to be read in an AffyBatch object

Benjamin Haibe-Kains Bioinformatics Courses – p.26/33

Page 28: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Bioconductor Functions(2)AffyBatch structure (see ?AffyBatch)

cdfName : object of class character representing thename of CDF file associated with the arrays in theAffyBatch (e.g. hgu133plus2)

exprs : object of class matrix inherited from exprSet.The matrix contains one probe per row and one chipper column

phenoData : object of class phenoData inheritedfrom exprSetannotation : object of class character identifying theannotation that may be used for the chips

description : object of class MIAME (MinimalInformation About Microarry Experiment)

Benjamin Haibe-Kains Bioinformatics Courses – p.27/33

Page 29: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Bioconductor Functions(3)

AffyBatch creation (see ?read.affybatch)

> abatch <- read.affybatch(filenames, phenoData,description, verbose=TRUE)

Remarks

the AffyBatch class is an extension of the exprSetclass

filenames is an object of class character containingthe whole paths to CEL files

all the CEL files have to come from the same chip(e.g. hgu133plus2)

Benjamin Haibe-Kains Bioinformatics Courses – p.28/33

Page 30: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Bioconductor Functions(4)

RMA function (see ?rma)

> eset <- rma(abatch, verbose=TRUE,normalize=TRUE, background=TRUE)

Remarks

rma function for background correction

quantile function for normalization

pmonly function for probe specific correction

medianpolish function for summarization

Benjamin Haibe-Kains Bioinformatics Courses – p.29/33

Page 31: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Bioconductor Functions(5)

Due to memory limitations, just.rma function can beused without create an AffyBatch object (see ?just.rma)

> eset <- just.rma(filenames, phenoData,description, verbose=TRUE, background=TRUE,normalize=TRUE)

Benjamin Haibe-Kains Bioinformatics Courses – p.30/33

Page 32: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Conclusion

AFFYMETRIX c© is a widespread technology and usedin a large number of studies

R and Bioconductor are a gold mine to analyzeAFFYMETRIX c© data

Preprocessing methods have an important impact onthe high-level analyses

RMA methods seem very efficient (see bias/variancestudies with spike-in and dilution experiments[Bolstad et al., 2003])

Benjamin Haibe-Kains Bioinformatics Courses – p.31/33

Page 33: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Conclusion(2)

Take care about memory consumption ... Need torewrite some methods (C language, parallelprogramming) ?

I propose a project about the parallelization of theRMA methods in order to manage large datasets(current limit is about 250 chips)the project description is available from the courseweb page or my homepage

The slides of this presentation are available from thecourse web page or my homepage

A practical course will be given february 4, 2005 at10 to 12h

Benjamin Haibe-Kains Bioinformatics Courses – p.32/33

Page 34: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

Links

Course web page :http://www.ulb.ac.be/di/map/gbonte/DEA_appli.html

Personal homepage :http://www.ulb.ac.be/di/map/bhaibeka/

Benjamin Haibe-Kains Bioinformatics Courses – p.33/33

Page 35: Analysis of Genomic and Proteomic Data …°c Technology and Preprocessing Methods Benjamin Haibe-Kains ... cost of the devices and the chips ... use the CEL les

References

[Bolstad et al., 2003] Bolstad, B. M., Irizarry, R. A., Astrand,M., and TP, T. S. (2003). A comparison of normalizationmethods for high density oligonucleotide array data basedon variance and bias. Bioinformatics, 19(2):185–193.

[Irizarry et al., 2003] Irizarry, R. A., Bolstad, B. M., Collin, F.,Cope, L. M., Hobbs, B., and Speed, T. P. (2003). Summariesof affymetrix genechip probe level data. Nucleic Acids Re-search, 31(4):e15.

33-1