introductiontogeneexpression microarray* · pdf...

Introduction to gene expression microarray data analysis

Outline

• Brief introduction:– Technology and data.– Statistical challenges in data analysis.

• Preprocessing – data normalization and transformation.

• Useful Bioconductor packages.

A short history

• Evolved from Southern blotting, which is a procedure to detect and quantify a specific DNA sequence.

• Gene expression microarray can be thought as parallelized Southern blotting experiments.

• First influential paper: Schenaet al. (1995) Science. – study the expression of 45 Arabidopsisgenes.

• Very popular for the past 20 years. Searching “gene expressionmicroarray” on PubMed returns 50,000+ hits.

Still microarray?

• Microarray is still widely used because of lower costs, easier experimental procedure and more established analysis methods.

• Similar problems are presented in newer technologies such as RNA-‐seq, and similar statistical techniques can be borrowed.

Introduction to GE microarray technology and design

Goal: measure mRNA abundance

gene

The amount of these matters! But they are difficult to measure.

The amount of these is easy to measure. And it is positively correlated with the protein amount!

DNA(2 copies)

mRNA(multiple copies)

Protein(multiple copies)

Gene expression microarray design

TTAAGTCGTACCCGTGTACGGGCGCAATTCAGCATGGGCACATGCCCGCG

• A collection of DNA spot on a solid surface.

• Each spot contains many copies of the same DNA sequence (called “probes”).– Probe sequences are designed to target

specific genes.

• Genes with part of its sequence complementary to a probe will hybridized on (stick to) that probe.

• The amount of hybridization on each probe measures the amount of mRNA for its target gene.

Experimental procedure

wet lab: perform experiment

dry lab: data analysis

Available platforms

• Affymetrix• Agilent• Nimblegene• Illumina• ABI• Spotted cDNA

Affymetrix Gene expression arrays

The Affymetrix platform is one of the most widely used.

http://www.affymetrix.com/

Affymetrix GeneChip array design

Use U133 system for illustration:• Around 20 probes per gene.• Not necessarily evenly spaced: sequence property matters.• The probes are located at random locations on the chip to average out the

effects of the array surface.

TTAAGTCGTACCCGTGTACGGGCGC Target sequence

AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe: measure signalsAATTCAGCATGGACACATGCCCGCG Mis-‐match (MM) probe: measure background

One-‐color vs. two-‐color arrays

• Two-‐color (two-‐channel) arrays hybridize two samples on the same array with different colors (red and green). – Each spot produce two numbers. – Agilent, Nimblegen

• One-‐color (single-‐channel) arrays hybridize one sample per array.– Easier when comparing multiple groups.– Have to use twice as many arrays.– Affymetrix, Illumina.

Data from microarray• Data are fluorescent intensities:

– extracted from the images with artifacts (e.g., cross-‐talk) removed, which Involves many statistical methods.

– Final data are stored in a matrix: row for probes, column for samples.

– For each sample, each probe has one number from one-‐color arrays and two numbers for two-‐color arrays.

sample1 sample2 sample3 sample41007_s_at 8.575758 8.915618 9.150667 8.9678701053_at 6.959002 7.039825 6.898245 7.136316117_at 7.738714 7.618013 7.499127 7.610726121_at 10.114529 10.018231 10.003332 9.8090681255_g_at 5.056204 4.759066 4.629297 4.6734581294_at 8.009337 7.980694 8.343183 8.0253351316_at 6.899290 7.045843 6.976185 7.0630501320_at 7.218898 7.600437 7.433031 7.2019841405_i_at 6.861933 6.042179 6.165090 6.2006711431_at 5.073265 5.114023 5.159933 5.063821...

Microarray data measure the “relative” levels of mRNA abundance

• Expression levels for different genes on the same array are not directly comparable.

• Expression levels for the same genes from different arrays can be compared, after proper normalization.

• All statistical inferences are for relative expressions, e.g., “the expression of gene X is higher in caner compared to normal”.

Statistical challenges

• Data normalization: remove systematic technical artifacts.– Within array: variations of probe intensities are caused by:

• cross-‐hybridization: probes capture the “wrong” target.• probe sequence: some probes are “sticker”.• others: spot sizes, smoothness of array surface, etc.

– Between array: intensity-‐concentration response curve can be different from different arrays, caused by variations in sample processing, image reader, etc.

• Summarization of gene expressions:– summarize values for multiple probes belonging to the same gene into

one number.• Differential expression detection:

– Find genes that are expressed differently between different experimental conditions, e.g., cases and controls.

Gene expression microarray data normalization

Normalization

• Artifacts are introduced at each step of the experiment:– Sample preparation: PCR effects.– Array itself: array surface effects, printing-‐tip effects. – Hybridization: non-‐specific binding, GC effects.– Scanning: scanner effects.

• Normalization is necessary before any analysis to ensure differences in intensities are due to differential expressions, not artifacts.

Within-‐ and between-‐array normalization

• Within-‐array: normalization at each array individually to remove array-‐specific artifacts.

• Between-‐array: to adjust the values from different arrays and put them at the same baseline, so that numbers are comparable.

• The methods are different for one-‐ and two-‐color arrays.

Within array normalization, two-‐color

• Most common problem is intensity dependent effect: log ratios of intensities from two channels depends on the total intensity.

• Most popular: loess normalization.

MA plot

• Widely used diagnostic plot for microarray data (Yang et al. 2002, Nucleic Acids Research).

• Also used for sequencing data now. • For spot i, let Ri and Gi be the intensities, define:

– Mi=log2Ri-‐log2Gi, A=(log2Ri+log2Gi)/2.– M measures relative expression, A measures total expression.

• Visualize relative vs. total expression dependence.

2

MostMost Common ProblemCommon Problem

Intensity dependent effect: Different

background level most likely culprit

Scatter PlotScatter Plot

Demonstrates importance of MA plot

Two-color platformsTwo-color platforms

• Platforms that use printing robots areprone to many systematic effects:

– Dye

– Print-tip

– Plates

– Print order

– Spatial

• Some examples follow

Loess normalization

• Based on the assumptions that: (1) most genes are not DE (with M=0) and (2) M and A are independent, MA plot should be flat and centered at 0.

• Normalization procedure:– Fit a smooth curve of M vs. A using loess, e.g., M=f(A)+ε, f(.) is smooth.

– Mnorm=M-‐f(A)

– loess (lowess): locally weighted scatterplot smoothing. – method to fit a smooth curve between two variables .

Loess normalization: before and after

Within array normalization: one-‐color

• RMA (Robust Multi-‐array Average) background model (Irizarry et al. 2003, Biostatistics).

• Idea: observed intensity Y is composed of the true intensity S(exponentially distributed) and a random background noise B (normally distribute).

• For each array, assume:Y = S +BSignal: S ~ Exp(λ)Background: B ~ N(µ,σ 2 ) left-truncated at zero

Simple derivation

• Observed: Y; of interest: S. • The idea is to predict S from Y using :

• The joint: • Marginal distribution of Y can be derived.

E[S |Y ]

E[S |Y ]= s f (∫ s |Y = y)ds = s f (s,Y = y)fY (y)

∫ ds = 1fY (y)

s f (∫ s,Y = y)ds

f (s,Y = y) = f (s,B = y− s) = fS (s) fB (y− s)

fY (y)

An extension to consider probe sequence effects: GCRMA

for a given gene become correlated across experimental units. This makes obtaining reliable esti-

mates of uncertainty difficult. However, a multi-array versions of our model motivates a method

that performs background adjustment, normalization, and summarization as part of the estimation

procedure. Using this approach one can compute standard error estimates that account for the three

steps.

For example, if we were comparing gene expression across different conditions, each contain-

ing various arrays, we could write the following model based on (2):

Ygi j = Ogi j +Ngi j +Sgi j (5)

= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)

Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed

error that account for NSB for the same probe behaving differently in different arrays, sg repre-

sents the baseline log expression level for probeset g, agi j represents the signal detecting ability

of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a

normally distributed term that accounts for the multiplicative error, and δg is the expected differ-

ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As

described by Naef and Magnasco (2003) ag j is a function of α.

With this model in place one may obtain point estimates and standard errors for δg using, say,

theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both

these approaches are computationally difficult. In Figure 5 we present some preliminary results

obtained using generalized estimating equations to estimate δ. A difficulty with this approach is

that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and

making this approach useful in practice is the subject of current research.

18

http://www.bepress.com/jhubiostat/paper1

for a given gene become correlated across experimental units. This makes obtaining reliable esti-

mates of uncertainty difficult. However, a multi-array versions of our model motivates a method

that performs background adjustment, normalization, and summarization as part of the estimation

procedure. Using this approach one can compute standard error estimates that account for the three

steps.

For example, if we were comparing gene expression across different conditions, each contain-

ing various arrays, we could write the following model based on (2):

Ygi j = Ogi j +Ngi j +Sgi j (5)

= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)

Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed

error that account for NSB for the same probe behaving differently in different arrays, sg repre-

sents the baseline log expression level for probeset g, agi j represents the signal detecting ability

of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a

normally distributed term that accounts for the multiplicative error, and δg is the expected differ-

ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As

described by Naef and Magnasco (2003) ag j is a function of α.

With this model in place one may obtain point estimates and standard errors for δg using, say,

theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both

these approaches are computationally difficult. In Figure 5 we present some preliminary results

obtained using generalized estimating equations to estimate δ. A difficulty with this approach is

that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and

making this approach useful in practice is the subject of current research.

18


Wu et al. (2005) JASA

Probe sequence effects

• Probe affinity is modeled as:

• This kind of modeling is widely used in other microarrays and sequencing data!

base effects:

α=25

∑k=1

∑j⇤{A,T,G,C}

µj,k1bk= j with µj,k =3

∑l=0

β j,lkl, (1)

where k= 1, . . . ,25 indicates the position along the probe, j indicates the base letter, bk represents

the base at position k, 1bk= j is an indicator function that is 1 when the k-th base is of type j and 0

otherwise, and µj,k represents the contribution to affinity of base j in position k. For fixed j, the

effect µj,k is assumed to be a polynomial of degree 3. The model is fitted to log intensities from

many arrays using least squares (Naef and Magnasco, 2003).

We adapt this idea to help describe the NSB component. We fit (1) to our NSB experiment log

intensity data using a spline with 5 degrees of freedom instead of a polynomial of degree 3. The

least squares estimates µ̂j,k are shown in Figure 3. This Figure is similar to Figure 3 in Naef and

Magnasco (2003). In Section 3 we use these affinity estimates to describe NSB noise.

Zhang et al (Zhang et al., 2003) propose using a positional-dependent-nearest-neighbor (PDNN)

model which is based on hybridization theory. This model takes into account interactions between

bases that are physically close. However, Naef and Magnasco (2003) demonstrate that these in-

teractions do not add much predictive power for specific signal probe effects. Similar results for

prediction of NSB are found in Wu and Irizarry (2004). Figure 4 shows background adjusted

log2(PM) against α for the NSB data. Notice the affinities predict NSB quite well. Almost as

well as theMM intensities. The advantage of the affinities over the MM is that they will not detect

signal since they are pre-computed numbers.

12


A

A

A

AA

AA A A A A A A A A

AA

AA

AA

AA

AA

5 10 15 20 25

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

C

C

C

CC

CC C C C C C C C C C

CC

CC

CC

CC

CG G G G G G G G G G G G G G G G G G G G G G G G GT T T T T T T T T T T T T T T T T T T T T T T T T

Figure 3: The effect of base A in position k, µA,k, is plotted against k. Similarly for the other threebases.

3 Background Model

In this section we propose a statistical model that motivates useful background adjustments. For

any particular probe-pair we assume that:

PM = OPM +NPM +S (2)

MM = OMM +NMM +φS.

Here O represents optical noise, N represents NSB noise and S is a quantity proportional to RNA

expression (the quantity of interest). The parameter 0 < φ < 1 accounts for the fact that for

some probe-pairs the MM detects signal. We assume O follows a log-normal distribution and

that log(NPM) and log(NMM) follow a bivariate-normal distribution with means of µPM and µMM

and the variance var[log(NPM] = var[log(NMM)]� σ2 and correlation ρ constant across probes. We

assume µPM � h(αPM) and µMM � h(αMM), with h a smooth (almost linear) function and the αs

defined by (1). Because we do not expect NSB to be affected by optics we assume O and N are

13

Hosted by The Berkeley Electronic Press

Summary: within array normalization

• To remove the unwanted artifacts and obtain true signals.

• Performed at each array individually.• Both MA-‐plot based normalization and background error models (eg, RMA) are popular in many other data (other microarrays, ChIP-‐seq, RNA-‐seq) – Use loess with caution because it assumes most genes are not DE.

– The error model (additive background, multiplicative error) is very useful.

Between array normalization

• Remember data from arrays (intensity values) estimate mRNA quantities, but the intensity-‐mRNA quantities response can be different from different arrays. So a number, say, 5, on arrays 1 doesn’t mean the same on array 2.

• This could be caused by:– Total amount of mRNA used– Properties of the agents used.– Array properties– Settings of laser scanners– etc.

• These artifacts cannot be removed by within array normalization.

• Goal: normalize so that data from different arrays are comparable!

Linear scaling method

• Used in Affymetrix software MAS:– Use a number of “housekeeping” genes and assume their expressions are identical across all arrays.

– Shift and rescale all data so the average expression of these genes are the same across all arrays.

Non-‐linear smoothing based

• Implemented in dChip (Li and Wong 2001, Genome Bio.)– Find a set of genes invariant across arrays. – Find a “baseline” array– For every other arrays fit a smooth curve on expressions of invariant genes

– Normalize based on the fitted curve.

dChip normalization

Acknowledgements!"#$%&'(#)*"'#+"#,-./#0&'#1&'2/#34(#56-7'/#)$&'#3"8.-'/#9&"

:;#<""/#=&6-'#>&(&(/#9-%'#!&8("6#&'+#?64'+&@#5%&$$&A%&6B""

C-6#D6-*4+4'2#+&$&/#&'+#$%"#"+4$-6#&'+#6"C"6"".#7%-#D6-*4+"+

*&8E&F8"# .E22".$4-'.;#1%4.#7-6(# 4.# .EDD-6$"+# 4'#D&6$#FG#3H>

26&'$#I#JKI#>LMNOPIQMI#&'+#3)R#26&'$#05HQSSMPTMI;

References1. Li C, Wong WH: Model-based analysis of oligonucleotide

arrays: expression index computation and outlier detection.Proc Natl Acad Sci USA 2001, 98:31-36.

2. Hakak Y, Walker JR, Li C, Wong WH, Davis KL, Buxbaum JD,Haroutunian V, Fienberg AA: Genome-wide expression analy-sis reveals dysregulation of myelination-related genes inchronic schizophrenia. Proc Natl Acad Sci USA 2001, 98:4746-4751.

3. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D: Genome-wide expression monitoring in Saccharomyces cerevisiae. NatBiotechnol 1997, 15:1359-1367.

4. Wallace D: The Behrens-Fisher and Fieller-Creasy problems.In Lecture Notes in Statistics 1, R.A.Fisher: An Appreciation. Edited byFienberg SE, Hinkley DV. Springer-Verlag 1988, 119-147.

5. Cox, DR, Hinkley DV: Theoretical Statistics. London: Chapman andHall, 1974.

6. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad SciUSA 1998, 95:14863-14868

7. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,Lander E, Golub T: Interpreting patterns of gene expressionwith self-organizing maps: methods and application tohematopoietic differentiation. Proc Natl Acad Sci USA 1999,96:2907-2912.

8. Efron B, Tibshirani R: An Introduction to the Bootstrap. New York:Chapman & Hall/CRC, 1993.

9. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Rad-macher M, Simon R, Yakhinik Z, Ben-Dork A, et al.: Molecular clas-sification of cutaneous malignant melanoma by geneexpression profiling. Nature 2000, 406:536-540.

10. DNA-Chip Analyzer [http://www.dchip.org] 11. Schadt EE, Li C, Su C, Wong WH: Analyzing high-density

oligonucleotide gene expression array data. J Cell Biochem2001, 80:192-202.

comm

entreview

sreports

deposited researchinteractions

information

refereed research

http://genomebiology.com/2001/2/8/research/0032.11

Figure 10Similar plots as in Figure 9 for arrays hybridized to two different samples (array 24 and 36 of array set 5). (a) CEL intensities;(b) same plot as in (a) with superimposed circles representing the invariant set; (c) after renormalization; (d) Q-Q plot ofnormalized probe intensities. Note that the smoothing spline in (a) is affected by several points at the lower-right corner,which might belong to differentially expressed genes. The invariant set, on the other hand, does not include these pointswhen determining the normalization curve, leading to a different normalization relationship at the high end.

Quantile normalization

Proposed in Bolstad et al. 2003, Bioinformatics: • Force the distribution of all data from all arrays to be the

same, but keep the ranks of the genes. • Procedures:

1. Create a target distribution, usually use the average from all arrays. 2. For each array, match its quantiles to that of the target. To be

specific: xnorm = F2-‐1(F1(x)):• x: value in the chip to be normalized• F1: distribution function in the array to be normalized• F2: target distribution function

33

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11

A simple example for quantile normalization

34

1. Find the Smallest Value for each sample

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11

2. Average them

(1+2+2+8)/4=3.25

35

3. Replace Each Value by the Average

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11

36

4. Find the Next Smallest Values, then average

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11

(3+5+5+9)/4=5.5

37

5. Replace Each Value by the Average

Gene sample1 sample2 sample3 sample41 8 15 9 13 2 7 3.25 7 153 5.50 6 5.50 3.254 3.25 5.50 3.25 5.505 9 13 6 11

38

6. Continue the process, we get the following matrix after finishing:

Gene sample1 sample2 sample3 sample41 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.003 5.50 7.50 5.50 3.254 3.25 5.50 3.25 5.505 12.00 10.25 7.50 7.50

The result matrix has following properties:• The values taken in each column are exactly the same.• The ranks of genes in each column are the same as

before normalization.

Before/after QN boxplot

Summary: between-‐array normalization

• Must do before comparing different arrays.• Same problems exist in sequencing data.• Quantile normalization is too strong and often remove the true signals, use with caution.

Microarray data summarization

• There are usually multiple probes corresponding to a gene. The task is to summarize the readings from these probes into one number to represent the gene expression.

• Naïve methods: mean, median.

• From MAS 5.0: use one-‐step Tukey Biweight (TBW) to obtain a robust weighted mean that is resistant to outliers.– Probes with intensities far away from median will have smaller

weights in the average.

• dChip (Li & Wong, 2001): model based on PM-‐MM.

• Borrow information from multiple samples to estimate probe effects.

• Model-‐fitting: Median Polish (robust against outliers)• Iteratively removing the row and column medians until convergence

• The remainder is the residual;

• After subtracting the residual, the rowmedians are the estimates of the expression, and column medians are probe effects.

RMA summarization

Irizarry et al. (2003) Biostatistics.

Comparison of different methods

• Some works are done for Affymetrix: – Cope et al. (2004) Bioinformatics.– Irizarry et al. (2006) Bioinformatics.

• Gold standard: “spike-‐in” data in with the true expressions are known.

• Criteria:– Accuracy: how close are the estimated values to the truth.– Precision: how variable are the estimates. – Overall detection ability.

Results: accuracy vs. precisionBenchmark for Affymetrix GeneChip expression measures

expression across replicate arrays, (2) response of expressionmeasure to changes in abundance of RNA, (3) sensitivity offold-change measures to the amount of actual RNA sample,(4) accuracy of fold-change as ameasure of relative expressionand (5) usefulness of raw fold-change score for the detectionof differential expression.When multiple expression measures are to be compared

directly, a comparative Image Report can be generated. Smalldifferences between the basic and comparative plots arementioned in the descriptions below. The webtool and theR package alike can readily generate either report. Examplesof these reports can be found on the webtool webpage. In thenext section, we describe two applications where the ImageReport is useful. All examples of figures included here arefrom those applications.All data is plotted on the log2 scale. The log transformation

is made at the beginning and all intermediate transformationsare made to the logged data. Fold changes are calculatedfor single chip comparisons, even where replicates are avail-able. This standardizes the procedure for those cases in whichreplicates are not available, and gives results that depend aslittle as possible on the the specific design of the benchmarkdatasets.

NotationThe spike-in data set encompasses more than one set ofexperimental conditions. Within each experiment, only thespike-in concentrations are varied; background is the same forall arrays. Fold-change calculations are always made withinexperiment to ensure that only spiked-in genes will be dif-ferentially expressed. We refer to spike-in concentrations asnominal concentration and their ratios as nominal fold-changeas a reminder that small errors in the actual quantity of RNA inthe sample prevent us from knowing the true concentrations.We will use M to denote log observed fold-change and FCas an abbreviation of fold change. Notation for formulas is asfollows:Dilution study: Let ytdrg represent log2 expression for tissue

t = 1, 2, dilution d = 1, . . . , 6, replicate r = 1, . . . , 5 andgene g = 1, . . . , n.

Spike-in study: A different variable is used to avoid con-fusion. Let xecg represent log2 expression for experimente = 1, 2, 3, array type c = 1, 2, . . . , 20 and gene g = 1, . . . , n.For the spike-in genes only, χcg represents the log2 of thenominal concentration.

The plots and statistics(1)MA plot: TheMAplot shows log fold-change as a functionof mean log expression level. This has come to be a funda-mental graphical tool for the analysis of expression array data.A great deal of information about the distribution of observedfold-changes can be read from such a plot. A set of 14 arraysrepresenting a single experiment from theAffymetrix spike-in

0 5 10 15

–10

–5

0

5

10

MAS 5.0

A

M 11 1 1 11 11 1 11112222

22 22 22

12

22333 3

33 3333

1111

344 4

4 44

44

4

1010

10

455 5 5 55

55

999

9

566 6 6 6

6

6

8 888

8

77 7 77

7

7

77777

7

88 8 88

66 66666

6

999 9

5

55 55555

5

1010

10

4

44

44

44444

1111

333

33 33 33

33

12

2 2 22 22 22222

2

1 1 1 1111 1

11111

∞

∞∞ ∞

∞

∞

∞

∞

∞

∞

∞

∞

∞

∞

∞

∞

∞∞

∞

∞

∞

∞

∞∞

∞∞

− ∞

− ∞

− ∞

− ∞− ∞− ∞

− ∞

− ∞− ∞

− ∞

− ∞− ∞− ∞

− ∞

0 5 10 15

–10

–5

0

5

10

RMA

A

M

1 11 1 11 11 1 11112 22

222 22 2

2

12

223 33

333 33 3

3

11

11

34 4

44

44 444

10

10

10

45 5

5 5 55 55

9

9

9

9

56 6

6 6 66 6

88

888

7 77 7 77

7 7 77777

8 88 88

6

66 6

6666

9 99 9

55

5 5 5555

5

10 1010

44

44

4 44444

1111

333 3

3 3 333

33

12

2 222 22 2 22222

11 1 11 11 1 11111∞∞

∞∞ ∞∞ ∞∞∞∞

∞∞∞∞ ∞∞

∞∞∞∞

∞∞∞

∞∞∞

− ∞

− ∞− ∞

− ∞

− ∞− ∞− ∞− ∞

− ∞− ∞

− ∞− ∞− ∞− ∞

Fig. 1. TheMAplot shows log fold-change as a function ofmean logexpression level. A set of 14 arrays representing a single experimentfrom the Affymetrix spike-in data are used for this plot. A total of13 sets of fold changes are generated by comparing the first arrayin the set to each of the others. Genes are symbolized by numbersrepresenting the nominal log2 fold-change for the gene. Genes withobserved fold-changes larger than 2 are outside the horizontal lines.All other probesets are represented with black dots.

data are used for this plot. We choose 14 so that all possiblecombinations of pairs of concentrations appear once. A totalof 13 sets of fold changes are generated by comparing the firstarray in the set to each of the others, so thatM1cg = x11g−x1cgfor c = 2, . . . , 14. These are plotted together against themean values A1cg = (x11g + x1cg)/2. To make the plot moreinformative, spiked-in genes are symbolized by numbers rep-resenting the nominal log2 fold-change for the gene. In thepackage, non-differentially expressed genes with observedfold-changes larger than 2 are plotted in red (in the black andwhite version in this paper we use horizontal lines to denotea fold-change of 2). All other probesets are represented withblack dots.MA plots for competing methods are presented side by side

on separate axes. Figure 1 shows a comparative plot for twoexpression measures.

325

Results: DE detection ability

between 4 and 32) and high expressed (nominal concentration >32).For each of these subgroups we followed the same procedure usedto compute the signal detect slope. Specifically, a regression linewas fitted to the observed log expression values for the spiked-ingenes using nominal concentration as the predictor, and the slopeestimate recorded as the assessment measure. The new assessmentmeasures are referred to as low, med and high slopes and areshown in Table 2.To better assess the concentration dependent bias, we added the

plot shown in Figure 3b to the benchmark. In this figure, local slopesare calculated by taking the difference between the averageobserved log expression values between consecutive nominal con-centration levels. The difference between 1 and these local slopesare plotted against the larger of the two concentration levels. We

subtract from 1 because we are in the log-scale, thus all theseslopes should be 1 (when nominal concentration doubles soshould the observed concentrations). Notice that these curves followan upside-down U shape. This shape illustrates the fact that bias isworst for low expressed genes and high expressed genes. The lowexpressed genes are affected by background noise as describedabove. The high expressed genes are affected by scanner attenuation(not discussed in this paper).

3.2 Precision

To provide a more practical context for the new accuracy assess-ment measures, we defined the null log-fc 99.9% statistic shown inFigure 2. Row 6 in Table 1 of Cope et al. (2004) presented the inter-quartile range (IQR) of the observed log-fold-changes among thegenes that are known not to be differentially expressed. The newstatistics gives the 99.9% instead of the IQR. We have also added ameasure related to the Median SD represented by row 1 in Table 1ofCope et al. (2004). The previous measure used the dilution studydata. Similarly, a spike-in experiment version of Figure 2 in theoriginal benchmark was added.

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

False Positives

Tru

e P

ositi

ves

VSNRMARMA_NBGGCRMAPLIER+16RSVD

(a) Nominal fold-chage = 2

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

False Positives

Tru

e P

ositi

ves

(b) Low nominal concentration

Fig. 4. A typical identification rule for differential expression filters genes

with fold change exceeding a given threshold. This figure shows averageROC curves which offer a graphical representation of both specificity and

sensitivity for such a detection rule. (a) Average ROC curves based on

comparisons with nominal fold changes equal to 2. (b) As (a) but consideronly low concentration spiked-in genes.

Table 2. Table showing the new assessment summary statistics described in

the text

SlopeMethod SD 99.9% Low Med High AUC

GCRMA 0.08 0.74 0.66 1.06 0.56 0.70

GS_GCRMA 0.10 0.79 0.62 1.03 0.55 0.66

MMEI 0.04 0.23 0.16 0.54 0.46 0.62

GL 0.05 0.25 0.16 0.55 0.46 0.62RMA_NBG 0.04 0.24 0.16 0.56 0.46 0.61

RSVD 0.00 0.58 0.42 0.85 0.40 0.61

ZL 0.22 0.52 0.35 0.71 0.45 0.61VSN_scale 0.09 0.43 0.28 0.91 0.70 0.59

VSN 0.06 0.28 0.18 0.6 0.46 0.59

RMA_VSN 0.09 0.48 0.31 0.74 0.46 0.57

GLTRAN 0.07 0.42 0.23 0.61 0.45 0.55ZAM 0.09 0.50 0.30 0.70 0.47 0.54

RMA_GNV 0.11 0.58 0.35 0.76 0.47 0.52

RMA 0.11 0.57 0.35 0.76 0.47 0.52

GSrma 0.11 0.57 0.35 0.76 0.47 0.52GSVDmod 0.07 0.44 0.22 0.64 0.42 0.51

PerfectMatch 0.05 0.40 0.18 0.56 0.43 0.50

PLIER+16 0.13 0.83 0.49 0.80 0.46 0.48GSVDmin 0.08 0.60 0.22 0.62 0.41 0.41

MAS 5.0+32 0.14 1.07 0.35 0.71 0.44 0.12

ChipMan 0.27 2.26 0.44 1.11 0.68 0.12

qn.p5 0.12 1.09 0.13 0.50 0.52 0.11dChip PM-only 0.13 1.44 0.31 0.67 0.39 0.09

mmgMOSgs 0.40 3.27 1.34 1.13 0.45 0.07

gMOSv.1 0.29 3.35 0.98 1.12 0.42 0.06

ProbeProfiler 0.31 18.75 1.61 1.57 0.39 0.03dChip 0.23 14.83 1.40 0.86 0.35 0.02

mgMOS_gs 0.36 2.86 0.83 0.86 0.43 0.01

MAS 5.0 0.63 4.48 0.69 0.81 0.45 0.00

PLIER 0.19 123.27 0.75 0.85 0.46 0.00UMTrMn 0.32 2.92 0.58 0.83 0.42 0.00

The methods are ordered by their performance in the weighted average AUC value.

R.A.Irizarry et al.

792

Resources

• The affycomp R package in Bioconductor.• Online tools: http://rafalab.rc.fas.harvard.edu/affycomp.

Bioconductor for microarray data

• There is a rich collection of bioc packages for microarrays. In fact, Bioconductor started for microarray analysis.

• There are currently 200+ packages for microarray. • Important ones include:

– affy: one of the earliest bioc packages. Designed for analyzing data from Affymetrix arrays.

– oligo: preprocessing tools for many types of oligonecleotide arrays. This is designed to replace affy package.

– limma and siggenes: DE detection using limma and SAM-‐t model. – Many annotation data package to link probe names to genes.

• Data normalization and summarization can be done using oligo package (details next lecture).

Review

• We have covered microarray analysis, including:– Data preprocessing: within and between array normalization.

– Summarization.

• Next lecture: – DE detection for microarray.

To do list

• Review the slides.• Read the review article on Nature Reviews Genetics (link on the class webpage).

introduction*to*gene*expression* microarray* · pdf...

Documents

introductiontogeneexpression microarray* · pdf...