introduction*to*gene*expression* microarray* · pdf...
TRANSCRIPT
Introduction to gene expression microarray data analysis
Outline
• Brief introduction:– Technology and data.– Statistical challenges in data analysis.
• Preprocessing – data normalization and transformation.
• Useful Bioconductor packages.
A short history
• Evolved from Southern blotting, which is a procedure to detect and quantify a specific DNA sequence.
• Gene expression microarray can be thought as parallelized Southern blotting experiments.
• First influential paper: Schenaet al. (1995) Science. – study the expression of 45 Arabidopsisgenes.
• Very popular for the past 20 years. Searching “gene expressionmicroarray” on PubMed returns 50,000+ hits.
Still microarray?
• Microarray is still widely used because of lower costs, easier experimental procedure and more established analysis methods.
• Similar problems are presented in newer technologies such as RNA-‐seq, and similar statistical techniques can be borrowed.
Introduction to GE microarray technology and design
Goal: measure mRNA abundance
gene
The amount of these matters! But they are difficult to measure.
The amount of these is easy to measure. And it is positively correlated with the protein amount!
DNA(2 copies)
mRNA(multiple copies)
Protein(multiple copies)
Gene expression microarray design
TTAAGTCGTACCCGTGTACGGGCGCAATTCAGCATGGGCACATGCCCGCG
• A collection of DNA spot on a solid surface.
• Each spot contains many copies of the same DNA sequence (called “probes”).– Probe sequences are designed to target
specific genes.
• Genes with part of its sequence complementary to a probe will hybridized on (stick to) that probe.
• The amount of hybridization on each probe measures the amount of mRNA for its target gene.
Experimental procedure
wet lab: perform experiment
dry lab: data analysis
Available platforms
• Affymetrix• Agilent• Nimblegene• Illumina• ABI• Spotted cDNA
Affymetrix Gene expression arrays
The Affymetrix platform is one of the most widely used.
http://www.affymetrix.com/
Affymetrix GeneChip array design
Use U133 system for illustration:• Around 20 probes per gene.• Not necessarily evenly spaced: sequence property matters.• The probes are located at random locations on the chip to average out the
effects of the array surface.
TTAAGTCGTACCCGTGTACGGGCGC Target sequence
AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe: measure signalsAATTCAGCATGGACACATGCCCGCG Mis-‐match (MM) probe: measure background
One-‐color vs. two-‐color arrays
• Two-‐color (two-‐channel) arrays hybridize two samples on the same array with different colors (red and green). – Each spot produce two numbers. – Agilent, Nimblegen
• One-‐color (single-‐channel) arrays hybridize one sample per array.– Easier when comparing multiple groups.– Have to use twice as many arrays.– Affymetrix, Illumina.
Data from microarray• Data are fluorescent intensities:
– extracted from the images with artifacts (e.g., cross-‐talk) removed, which Involves many statistical methods.
– Final data are stored in a matrix: row for probes, column for samples.
– For each sample, each probe has one number from one-‐color arrays and two numbers for two-‐color arrays.
sample1 sample2 sample3 sample41007_s_at 8.575758 8.915618 9.150667 8.9678701053_at 6.959002 7.039825 6.898245 7.136316117_at 7.738714 7.618013 7.499127 7.610726121_at 10.114529 10.018231 10.003332 9.8090681255_g_at 5.056204 4.759066 4.629297 4.6734581294_at 8.009337 7.980694 8.343183 8.0253351316_at 6.899290 7.045843 6.976185 7.0630501320_at 7.218898 7.600437 7.433031 7.2019841405_i_at 6.861933 6.042179 6.165090 6.2006711431_at 5.073265 5.114023 5.159933 5.063821...
Microarray data measure the “relative” levels of mRNA abundance
• Expression levels for different genes on the same array are not directly comparable.
• Expression levels for the same genes from different arrays can be compared, after proper normalization.
• All statistical inferences are for relative expressions, e.g., “the expression of gene X is higher in caner compared to normal”.
Statistical challenges
• Data normalization: remove systematic technical artifacts.– Within array: variations of probe intensities are caused by:
• cross-‐hybridization: probes capture the “wrong” target.• probe sequence: some probes are “sticker”.• others: spot sizes, smoothness of array surface, etc.
– Between array: intensity-‐concentration response curve can be different from different arrays, caused by variations in sample processing, image reader, etc.
• Summarization of gene expressions:– summarize values for multiple probes belonging to the same gene into
one number.• Differential expression detection:
– Find genes that are expressed differently between different experimental conditions, e.g., cases and controls.
Gene expression microarray data normalization
Normalization
• Artifacts are introduced at each step of the experiment:– Sample preparation: PCR effects.– Array itself: array surface effects, printing-‐tip effects. – Hybridization: non-‐specific binding, GC effects.– Scanning: scanner effects.
• Normalization is necessary before any analysis to ensure differences in intensities are due to differential expressions, not artifacts.
Within-‐ and between-‐array normalization
• Within-‐array: normalization at each array individually to remove array-‐specific artifacts.
• Between-‐array: to adjust the values from different arrays and put them at the same baseline, so that numbers are comparable.
• The methods are different for one-‐ and two-‐color arrays.
Within array normalization, two-‐color
• Most common problem is intensity dependent effect: log ratios of intensities from two channels depends on the total intensity.
• Most popular: loess normalization.
MA plot
• Widely used diagnostic plot for microarray data (Yang et al. 2002, Nucleic Acids Research).
• Also used for sequencing data now. • For spot i, let Ri and Gi be the intensities, define:
– Mi=log2Ri-‐log2Gi, A=(log2Ri+log2Gi)/2.– M measures relative expression, A measures total expression.
• Visualize relative vs. total expression dependence.
2
MostMost Common ProblemCommon Problem
Intensity dependent effect: Different
background level most likely culprit
Scatter PlotScatter Plot
Demonstrates importance of MA plot
Two-color platformsTwo-color platforms
• Platforms that use printing robots areprone to many systematic effects:
– Dye
– Print-tip
– Plates
– Print order
– Spatial
• Some examples follow
Loess normalization
• Based on the assumptions that: (1) most genes are not DE (with M=0) and (2) M and A are independent, MA plot should be flat and centered at 0.
• Normalization procedure:– Fit a smooth curve of M vs. A using loess, e.g., M=f(A)+ε, f(.) is smooth.
– Mnorm=M-‐f(A)
– loess (lowess): locally weighted scatterplot smoothing. – method to fit a smooth curve between two variables .
Loess normalization: before and after
Within array normalization: one-‐color
• RMA (Robust Multi-‐array Average) background model (Irizarry et al. 2003, Biostatistics).
• Idea: observed intensity Y is composed of the true intensity S(exponentially distributed) and a random background noise B (normally distribute).
• For each array, assume:Y = S +BSignal: S ~ Exp(λ)Background: B ~ N(µ,σ 2 ) left-truncated at zero
Simple derivation
• Observed: Y; of interest: S. • The idea is to predict S from Y using :
• The joint: • Marginal distribution of Y can be derived.
E[S |Y ]
E[S |Y ]= s f (∫ s |Y = y)ds = s f (s,Y = y)fY (y)
∫ ds = 1fY (y)
s f (∫ s,Y = y)ds
f (s,Y = y) = f (s,B = y− s) = fS (s) fB (y− s)
fY (y)
An extension to consider probe sequence effects: GCRMA
for a given gene become correlated across experimental units. This makes obtaining reliable esti-
mates of uncertainty difficult. However, a multi-array versions of our model motivates a method
that performs background adjustment, normalization, and summarization as part of the estimation
procedure. Using this approach one can compute standard error estimates that account for the three
steps.
For example, if we were comparing gene expression across different conditions, each contain-
ing various arrays, we could write the following model based on (2):
Ygi j = Ogi j +Ngi j +Sgi j (5)
= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)
Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed
error that account for NSB for the same probe behaving differently in different arrays, sg repre-
sents the baseline log expression level for probeset g, agi j represents the signal detecting ability
of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a
normally distributed term that accounts for the multiplicative error, and δg is the expected differ-
ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As
described by Naef and Magnasco (2003) ag j is a function of α.
With this model in place one may obtain point estimates and standard errors for δg using, say,
theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both
these approaches are computationally difficult. In Figure 5 we present some preliminary results
obtained using generalized estimating equations to estimate δ. A difficulty with this approach is
that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and
making this approach useful in practice is the subject of current research.
18
http://www.bepress.com/jhubiostat/paper1
for a given gene become correlated across experimental units. This makes obtaining reliable esti-
mates of uncertainty difficult. However, a multi-array versions of our model motivates a method
that performs background adjustment, normalization, and summarization as part of the estimation
procedure. Using this approach one can compute standard error estimates that account for the three
steps.
For example, if we were comparing gene expression across different conditions, each contain-
ing various arrays, we could write the following model based on (2):
Ygi j = Ogi j +Ngi j +Sgi j (5)
= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)
Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed
error that account for NSB for the same probe behaving differently in different arrays, sg repre-
sents the baseline log expression level for probeset g, agi j represents the signal detecting ability
of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a
normally distributed term that accounts for the multiplicative error, and δg is the expected differ-
ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As
described by Naef and Magnasco (2003) ag j is a function of α.
With this model in place one may obtain point estimates and standard errors for δg using, say,
theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both
these approaches are computationally difficult. In Figure 5 we present some preliminary results
obtained using generalized estimating equations to estimate δ. A difficulty with this approach is
that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and
making this approach useful in practice is the subject of current research.
18
http://www.bepress.com/jhubiostat/paper1
Wu et al. (2005) JASA
Probe sequence effects
• Probe affinity is modeled as:
• This kind of modeling is widely used in other microarrays and sequencing data!
base effects:
α=25
∑k=1
∑j⇤{A,T,G,C}
µj,k1bk= j with µj,k =3
∑l=0
β j,lkl, (1)
where k= 1, . . . ,25 indicates the position along the probe, j indicates the base letter, bk represents
the base at position k, 1bk= j is an indicator function that is 1 when the k-th base is of type j and 0
otherwise, and µj,k represents the contribution to affinity of base j in position k. For fixed j, the
effect µj,k is assumed to be a polynomial of degree 3. The model is fitted to log intensities from
many arrays using least squares (Naef and Magnasco, 2003).
We adapt this idea to help describe the NSB component. We fit (1) to our NSB experiment log
intensity data using a spline with 5 degrees of freedom instead of a polynomial of degree 3. The
least squares estimates µ̂j,k are shown in Figure 3. This Figure is similar to Figure 3 in Naef and
Magnasco (2003). In Section 3 we use these affinity estimates to describe NSB noise.
Zhang et al (Zhang et al., 2003) propose using a positional-dependent-nearest-neighbor (PDNN)
model which is based on hybridization theory. This model takes into account interactions between
bases that are physically close. However, Naef and Magnasco (2003) demonstrate that these in-
teractions do not add much predictive power for specific signal probe effects. Similar results for
prediction of NSB are found in Wu and Irizarry (2004). Figure 4 shows background adjusted
log2(PM) against α for the NSB data. Notice the affinities predict NSB quite well. Almost as
well as theMM intensities. The advantage of the affinities over the MM is that they will not detect
signal since they are pre-computed numbers.
12
http://www.bepress.com/jhubiostat/paper1
A
A
A
AA
AA A A A A A A A A
AA
AA
AA
AA
AA
5 10 15 20 25
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
C
C
C
CC
CC C C C C C C C C C
CC
CC
CC
CC
CG G G G G G G G G G G G G G G G G G G G G G G G GT T T T T T T T T T T T T T T T T T T T T T T T T
Figure 3: The effect of base A in position k, µA,k, is plotted against k. Similarly for the other threebases.
3 Background Model
In this section we propose a statistical model that motivates useful background adjustments. For
any particular probe-pair we assume that:
PM = OPM +NPM +S (2)
MM = OMM +NMM +φS.
Here O represents optical noise, N represents NSB noise and S is a quantity proportional to RNA
expression (the quantity of interest). The parameter 0 < φ < 1 accounts for the fact that for
some probe-pairs the MM detects signal. We assume O follows a log-normal distribution and
that log(NPM) and log(NMM) follow a bivariate-normal distribution with means of µPM and µMM
and the variance var[log(NPM] = var[log(NMM)]� σ2 and correlation ρ constant across probes. We
assume µPM � h(αPM) and µMM � h(αMM), with h a smooth (almost linear) function and the αs
defined by (1). Because we do not expect NSB to be affected by optics we assume O and N are
13
Hosted by The Berkeley Electronic Press
Summary: within array normalization
• To remove the unwanted artifacts and obtain true signals.
• Performed at each array individually.• Both MA-‐plot based normalization and background error models (eg, RMA) are popular in many other data (other microarrays, ChIP-‐seq, RNA-‐seq) – Use loess with caution because it assumes most genes are not DE.
– The error model (additive background, multiplicative error) is very useful.
Between array normalization
• Remember data from arrays (intensity values) estimate mRNA quantities, but the intensity-‐mRNA quantities response can be different from different arrays. So a number, say, 5, on arrays 1 doesn’t mean the same on array 2.
• This could be caused by:– Total amount of mRNA used– Properties of the agents used.– Array properties– Settings of laser scanners– etc.
• These artifacts cannot be removed by within array normalization.
• Goal: normalize so that data from different arrays are comparable!
Linear scaling method
• Used in Affymetrix software MAS:– Use a number of “housekeeping” genes and assume their expressions are identical across all arrays.
– Shift and rescale all data so the average expression of these genes are the same across all arrays.
Non-‐linear smoothing based
• Implemented in dChip (Li and Wong 2001, Genome Bio.)– Find a set of genes invariant across arrays. – Find a “baseline” array– For every other arrays fit a smooth curve on expressions of invariant genes
– Normalize based on the fitted curve.
dChip normalization
Acknowledgements!"#$%&'(#)*"'#+"#,-./#0&'#1&'2/#34(#56-7'/#)$&'#3"8.-'/#9&"
:;#<""/#=&6-'#>&(&(/#9-%'#!&8("6#&'+#?64'+&@#5%&$$&A%&6B""
C-6#D6-*4+4'2#+&$&/#&'+#$%"#"+4$-6#&'+#6"C"6"".#7%-#D6-*4+"+
*&8E&F8"# .E22".$4-'.;#1%4.#7-6(# 4.# .EDD-6$"+# 4'#D&6$#FG#3H>
26&'$#I#JKI#>LMNOPIQMI#&'+#3)R#26&'$#05HQSSMPTMI;
References1. Li C, Wong WH: Model-based analysis of oligonucleotide
arrays: expression index computation and outlier detection.Proc Natl Acad Sci USA 2001, 98:31-36.
2. Hakak Y, Walker JR, Li C, Wong WH, Davis KL, Buxbaum JD,Haroutunian V, Fienberg AA: Genome-wide expression analy-sis reveals dysregulation of myelination-related genes inchronic schizophrenia. Proc Natl Acad Sci USA 2001, 98:4746-4751.
3. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D: Genome-wide expression monitoring in Saccharomyces cerevisiae. NatBiotechnol 1997, 15:1359-1367.
4. Wallace D: The Behrens-Fisher and Fieller-Creasy problems.In Lecture Notes in Statistics 1, R.A.Fisher: An Appreciation. Edited byFienberg SE, Hinkley DV. Springer-Verlag 1988, 119-147.
5. Cox, DR, Hinkley DV: Theoretical Statistics. London: Chapman andHall, 1974.
6. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad SciUSA 1998, 95:14863-14868
7. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,Lander E, Golub T: Interpreting patterns of gene expressionwith self-organizing maps: methods and application tohematopoietic differentiation. Proc Natl Acad Sci USA 1999,96:2907-2912.
8. Efron B, Tibshirani R: An Introduction to the Bootstrap. New York:Chapman & Hall/CRC, 1993.
9. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Rad-macher M, Simon R, Yakhinik Z, Ben-Dork A, et al.: Molecular clas-sification of cutaneous malignant melanoma by geneexpression profiling. Nature 2000, 406:536-540.
10. DNA-Chip Analyzer [http://www.dchip.org] 11. Schadt EE, Li C, Su C, Wong WH: Analyzing high-density
oligonucleotide gene expression array data. J Cell Biochem2001, 80:192-202.
comm
entreview
sreports
deposited researchinteractions
information
refereed research
http://genomebiology.com/2001/2/8/research/0032.11
Figure 10Similar plots as in Figure 9 for arrays hybridized to two different samples (array 24 and 36 of array set 5). (a) CEL intensities;(b) same plot as in (a) with superimposed circles representing the invariant set; (c) after renormalization; (d) Q-Q plot ofnormalized probe intensities. Note that the smoothing spline in (a) is affected by several points at the lower-right corner,which might belong to differentially expressed genes. The invariant set, on the other hand, does not include these pointswhen determining the normalization curve, leading to a different normalization relationship at the high end.
Quantile normalization
Proposed in Bolstad et al. 2003, Bioinformatics: • Force the distribution of all data from all arrays to be the
same, but keep the ranks of the genes. • Procedures:
1. Create a target distribution, usually use the average from all arrays. 2. For each array, match its quantiles to that of the target. To be
specific: xnorm = F2-‐1(F1(x)):• x: value in the chip to be normalized• F1: distribution function in the array to be normalized• F2: target distribution function
33
Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11
A simple example for quantile normalization
34
1. Find the Smallest Value for each sample
Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11
2. Average them
(1+2+2+8)/4=3.25
35
3. Replace Each Value by the Average
Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11
36
4. Find the Next Smallest Values, then average
Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11
(3+5+5+9)/4=5.5
37
5. Replace Each Value by the Average
Gene sample1 sample2 sample3 sample41 8 15 9 13 2 7 3.25 7 153 5.50 6 5.50 3.254 3.25 5.50 3.25 5.505 9 13 6 11
38
6. Continue the process, we get the following matrix after finishing:
Gene sample1 sample2 sample3 sample41 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.003 5.50 7.50 5.50 3.254 3.25 5.50 3.25 5.505 12.00 10.25 7.50 7.50
The result matrix has following properties:• The values taken in each column are exactly the same.• The ranks of genes in each column are the same as
before normalization.
Before/after QN boxplot
Summary: between-‐array normalization
• Must do before comparing different arrays.• Same problems exist in sequencing data.• Quantile normalization is too strong and often remove the true signals, use with caution.
Microarray data summarization
• There are usually multiple probes corresponding to a gene. The task is to summarize the readings from these probes into one number to represent the gene expression.
• Naïve methods: mean, median.
• From MAS 5.0: use one-‐step Tukey Biweight (TBW) to obtain a robust weighted mean that is resistant to outliers.– Probes with intensities far away from median will have smaller
weights in the average.
• dChip (Li & Wong, 2001): model based on PM-‐MM.
• Borrow information from multiple samples to estimate probe effects.
• Model-‐fitting: Median Polish (robust against outliers)• Iteratively removing the row and column medians until convergence
• The remainder is the residual;
• After subtracting the residual, the rowmedians are the estimates of the expression, and column medians are probe effects.
RMA summarization
Irizarry et al. (2003) Biostatistics.
Comparison of different methods
• Some works are done for Affymetrix: – Cope et al. (2004) Bioinformatics.– Irizarry et al. (2006) Bioinformatics.
• Gold standard: “spike-‐in” data in with the true expressions are known.
• Criteria:– Accuracy: how close are the estimated values to the truth.– Precision: how variable are the estimates. – Overall detection ability.
Results: accuracy vs. precisionBenchmark for Affymetrix GeneChip expression measures
expression across replicate arrays, (2) response of expressionmeasure to changes in abundance of RNA, (3) sensitivity offold-change measures to the amount of actual RNA sample,(4) accuracy of fold-change as ameasure of relative expressionand (5) usefulness of raw fold-change score for the detectionof differential expression.When multiple expression measures are to be compared
directly, a comparative Image Report can be generated. Smalldifferences between the basic and comparative plots arementioned in the descriptions below. The webtool and theR package alike can readily generate either report. Examplesof these reports can be found on the webtool webpage. In thenext section, we describe two applications where the ImageReport is useful. All examples of figures included here arefrom those applications.All data is plotted on the log2 scale. The log transformation
is made at the beginning and all intermediate transformationsare made to the logged data. Fold changes are calculatedfor single chip comparisons, even where replicates are avail-able. This standardizes the procedure for those cases in whichreplicates are not available, and gives results that depend aslittle as possible on the the specific design of the benchmarkdatasets.
NotationThe spike-in data set encompasses more than one set ofexperimental conditions. Within each experiment, only thespike-in concentrations are varied; background is the same forall arrays. Fold-change calculations are always made withinexperiment to ensure that only spiked-in genes will be dif-ferentially expressed. We refer to spike-in concentrations asnominal concentration and their ratios as nominal fold-changeas a reminder that small errors in the actual quantity of RNA inthe sample prevent us from knowing the true concentrations.We will use M to denote log observed fold-change and FCas an abbreviation of fold change. Notation for formulas is asfollows:Dilution study: Let ytdrg represent log2 expression for tissue
t = 1, 2, dilution d = 1, . . . , 6, replicate r = 1, . . . , 5 andgene g = 1, . . . , n.
Spike-in study: A different variable is used to avoid con-fusion. Let xecg represent log2 expression for experimente = 1, 2, 3, array type c = 1, 2, . . . , 20 and gene g = 1, . . . , n.For the spike-in genes only, χcg represents the log2 of thenominal concentration.
The plots and statistics(1)MA plot: TheMAplot shows log fold-change as a functionof mean log expression level. This has come to be a funda-mental graphical tool for the analysis of expression array data.A great deal of information about the distribution of observedfold-changes can be read from such a plot. A set of 14 arraysrepresenting a single experiment from theAffymetrix spike-in
0 5 10 15
–10
–5
0
5
10
MAS 5.0
A
M 11 1 1 11 11 1 11112222
22 22 22
12
22333 3
33 3333
1111
344 4
4 44
44
4
1010
10
455 5 5 55
55
999
9
566 6 6 6
6
6
8 888
8
77 7 77
7
7
77777
7
88 8 88
66 66666
6
999 9
5
55 55555
5
1010
10
4
44
44
44444
1111
333
33 33 33
33
12
2 2 22 22 22222
2
1 1 1 1111 1
11111
∞
∞∞ ∞
∞
∞
∞
∞
∞
∞
∞
∞
∞
∞
∞
∞
∞∞
∞
∞
∞
∞
∞∞
∞∞
− ∞
− ∞
− ∞
− ∞− ∞− ∞
− ∞
− ∞− ∞
− ∞
− ∞− ∞− ∞
− ∞
0 5 10 15
–10
–5
0
5
10
RMA
A
M
1 11 1 11 11 1 11112 22
222 22 2
2
12
223 33
333 33 3
3
11
11
34 4
44
44 444
10
10
10
45 5
5 5 55 55
9
9
9
9
56 6
6 6 66 6
88
888
7 77 7 77
7 7 77777
8 88 88
6
66 6
6666
9 99 9
55
5 5 5555
5
10 1010
44
44
4 44444
1111
333 3
3 3 333
33
12
2 222 22 2 22222
11 1 11 11 1 11111∞∞
∞∞ ∞∞ ∞∞∞∞
∞∞∞∞ ∞∞
∞∞∞∞
∞∞∞
∞∞∞
− ∞
− ∞− ∞
− ∞
− ∞− ∞− ∞− ∞
− ∞− ∞
− ∞− ∞− ∞− ∞
Fig. 1. TheMAplot shows log fold-change as a function ofmean logexpression level. A set of 14 arrays representing a single experimentfrom the Affymetrix spike-in data are used for this plot. A total of13 sets of fold changes are generated by comparing the first arrayin the set to each of the others. Genes are symbolized by numbersrepresenting the nominal log2 fold-change for the gene. Genes withobserved fold-changes larger than 2 are outside the horizontal lines.All other probesets are represented with black dots.
data are used for this plot. We choose 14 so that all possiblecombinations of pairs of concentrations appear once. A totalof 13 sets of fold changes are generated by comparing the firstarray in the set to each of the others, so thatM1cg = x11g−x1cgfor c = 2, . . . , 14. These are plotted together against themean values A1cg = (x11g + x1cg)/2. To make the plot moreinformative, spiked-in genes are symbolized by numbers rep-resenting the nominal log2 fold-change for the gene. In thepackage, non-differentially expressed genes with observedfold-changes larger than 2 are plotted in red (in the black andwhite version in this paper we use horizontal lines to denotea fold-change of 2). All other probesets are represented withblack dots.MA plots for competing methods are presented side by side
on separate axes. Figure 1 shows a comparative plot for twoexpression measures.
325
Results: DE detection ability
between 4 and 32) and high expressed (nominal concentration >32).For each of these subgroups we followed the same procedure usedto compute the signal detect slope. Specifically, a regression linewas fitted to the observed log expression values for the spiked-ingenes using nominal concentration as the predictor, and the slopeestimate recorded as the assessment measure. The new assessmentmeasures are referred to as low, med and high slopes and areshown in Table 2.To better assess the concentration dependent bias, we added the
plot shown in Figure 3b to the benchmark. In this figure, local slopesare calculated by taking the difference between the averageobserved log expression values between consecutive nominal con-centration levels. The difference between 1 and these local slopesare plotted against the larger of the two concentration levels. We
subtract from 1 because we are in the log-scale, thus all theseslopes should be 1 (when nominal concentration doubles soshould the observed concentrations). Notice that these curves followan upside-down U shape. This shape illustrates the fact that bias isworst for low expressed genes and high expressed genes. The lowexpressed genes are affected by background noise as describedabove. The high expressed genes are affected by scanner attenuation(not discussed in this paper).
3.2 Precision
To provide a more practical context for the new accuracy assess-ment measures, we defined the null log-fc 99.9% statistic shown inFigure 2. Row 6 in Table 1 of Cope et al. (2004) presented the inter-quartile range (IQR) of the observed log-fold-changes among thegenes that are known not to be differentially expressed. The newstatistics gives the 99.9% instead of the IQR. We have also added ameasure related to the Median SD represented by row 1 in Table 1ofCope et al. (2004). The previous measure used the dilution studydata. Similarly, a spike-in experiment version of Figure 2 in theoriginal benchmark was added.
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
False Positives
Tru
e P
ositi
ves
VSNRMARMA_NBGGCRMAPLIER+16RSVD
(a) Nominal fold-chage = 2
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
False Positives
Tru
e P
ositi
ves
(b) Low nominal concentration
Fig. 4. A typical identification rule for differential expression filters genes
with fold change exceeding a given threshold. This figure shows averageROC curves which offer a graphical representation of both specificity and
sensitivity for such a detection rule. (a) Average ROC curves based on
comparisons with nominal fold changes equal to 2. (b) As (a) but consideronly low concentration spiked-in genes.
Table 2. Table showing the new assessment summary statistics described in
the text
SlopeMethod SD 99.9% Low Med High AUC
GCRMA 0.08 0.74 0.66 1.06 0.56 0.70
GS_GCRMA 0.10 0.79 0.62 1.03 0.55 0.66
MMEI 0.04 0.23 0.16 0.54 0.46 0.62
GL 0.05 0.25 0.16 0.55 0.46 0.62RMA_NBG 0.04 0.24 0.16 0.56 0.46 0.61
RSVD 0.00 0.58 0.42 0.85 0.40 0.61
ZL 0.22 0.52 0.35 0.71 0.45 0.61VSN_scale 0.09 0.43 0.28 0.91 0.70 0.59
VSN 0.06 0.28 0.18 0.6 0.46 0.59
RMA_VSN 0.09 0.48 0.31 0.74 0.46 0.57
GLTRAN 0.07 0.42 0.23 0.61 0.45 0.55ZAM 0.09 0.50 0.30 0.70 0.47 0.54
RMA_GNV 0.11 0.58 0.35 0.76 0.47 0.52
RMA 0.11 0.57 0.35 0.76 0.47 0.52
GSrma 0.11 0.57 0.35 0.76 0.47 0.52GSVDmod 0.07 0.44 0.22 0.64 0.42 0.51
PerfectMatch 0.05 0.40 0.18 0.56 0.43 0.50
PLIER+16 0.13 0.83 0.49 0.80 0.46 0.48GSVDmin 0.08 0.60 0.22 0.62 0.41 0.41
MAS 5.0+32 0.14 1.07 0.35 0.71 0.44 0.12
ChipMan 0.27 2.26 0.44 1.11 0.68 0.12
qn.p5 0.12 1.09 0.13 0.50 0.52 0.11dChip PM-only 0.13 1.44 0.31 0.67 0.39 0.09
mmgMOSgs 0.40 3.27 1.34 1.13 0.45 0.07
gMOSv.1 0.29 3.35 0.98 1.12 0.42 0.06
ProbeProfiler 0.31 18.75 1.61 1.57 0.39 0.03dChip 0.23 14.83 1.40 0.86 0.35 0.02
mgMOS_gs 0.36 2.86 0.83 0.86 0.43 0.01
MAS 5.0 0.63 4.48 0.69 0.81 0.45 0.00
PLIER 0.19 123.27 0.75 0.85 0.46 0.00UMTrMn 0.32 2.92 0.58 0.83 0.42 0.00
The methods are ordered by their performance in the weighted average AUC value.
R.A.Irizarry et al.
792
Resources
• The affycomp R package in Bioconductor.• Online tools: http://rafalab.rc.fas.harvard.edu/affycomp.
Bioconductor for microarray data
• There is a rich collection of bioc packages for microarrays. In fact, Bioconductor started for microarray analysis.
• There are currently 200+ packages for microarray. • Important ones include:
– affy: one of the earliest bioc packages. Designed for analyzing data from Affymetrix arrays.
– oligo: preprocessing tools for many types of oligonecleotide arrays. This is designed to replace affy package.
– limma and siggenes: DE detection using limma and SAM-‐t model. – Many annotation data package to link probe names to genes.
• Data normalization and summarization can be done using oligo package (details next lecture).
Review
• We have covered microarray analysis, including:– Data preprocessing: within and between array normalization.
– Summarization.
• Next lecture: – DE detection for microarray.
To do list
• Review the slides.• Read the review article on Nature Reviews Genetics (link on the class webpage).