introduction*to*gene*expression* microarray* · pdf...

49
Introduction to gene expression microarray data analysis

Upload: trandien

Post on 03-Feb-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Introduction  to  gene  expression  microarray  data  analysis

Page 2: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Outline

• Brief  introduction:– Technology  and  data.– Statistical  challenges  in  data  analysis.

• Preprocessing  – data  normalization  and  transformation.

• Useful  Bioconductor  packages.

Page 3: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

A  short  history

• Evolved  from  Southern  blotting,  which  is  a  procedure  to  detect  and  quantify  a  specific  DNA  sequence.  

• Gene  expression  microarray  can  be  thought  as  parallelized  Southern  blotting  experiments.  

• First  influential  paper:  Schenaet  al.  (1995)  Science.  – study  the  expression  of  45  Arabidopsisgenes.  

• Very popular  for the past 20  years.  Searching “gene  expressionmicroarray” on  PubMed returns 50,000+  hits.  

Page 4: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Still  microarray?

• Microarray  is  still  widely  used  because  of  lower  costs,  easier  experimental  procedure  and  more  established  analysis  methods.

• Similar  problems  are  presented  in  newer  technologies  such  as  RNA-­‐seq,  and  similar  statistical  techniques  can  be  borrowed.  

Page 5: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Introduction  to  GE  microarray  technology  and  design

Page 6: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Goal:  measure  mRNA  abundance

gene

The  amount  of  these  matters!    But  they  are  difficult  to  measure.

The  amount  of  these   is  easy  to  measure.  And  it  is  positively   correlated  with  the  protein  amount!

DNA(2  copies)

mRNA(multiple  copies)

Protein(multiple  copies)

Page 7: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Gene  expression  microarray  design

TTAAGTCGTACCCGTGTACGGGCGCAATTCAGCATGGGCACATGCCCGCG

• A  collection  of  DNA  spot  on  a  solid  surface.  

• Each  spot  contains  many  copies  of  the  same  DNA  sequence  (called  “probes”).– Probe  sequences  are  designed  to  target  

specific  genes.  

• Genes  with  part  of  its  sequence  complementary  to  a  probe  will  hybridized  on  (stick  to)  that  probe.

• The  amount  of  hybridization  on  each  probe  measures  the  amount  of  mRNA  for  its  target  gene.

Page 8: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Experimental  procedure

wet lab: perform experiment

dry lab: data analysis

Page 9: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Available  platforms

• Affymetrix• Agilent• Nimblegene• Illumina• ABI• Spotted  cDNA

Page 10: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Affymetrix Gene  expression  arrays

The Affymetrix platform is one of the most widely used.

http://www.affymetrix.com/

Page 11: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Affymetrix GeneChip array  design  

Use U133 system for illustration:• Around 20 probes per gene.• Not necessarily evenly spaced: sequence property matters.• The probes are located at random locations on the chip to average out the

effects of the array surface.

TTAAGTCGTACCCGTGTACGGGCGC        Target  sequence

AATTCAGCATGGGCACATGCCCGCG        Perfect  match  (PM)  probe:  measure  signalsAATTCAGCATGGACACATGCCCGCG        Mis-­‐match  (MM)  probe:  measure   background

Page 12: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

One-­‐color  vs.  two-­‐color  arrays

• Two-­‐color  (two-­‐channel)  arrays  hybridize  two  samples  on  the  same  array  with  different  colors  (red  and  green).  – Each  spot  produce  two  numbers.  – Agilent,  Nimblegen

• One-­‐color  (single-­‐channel)  arrays  hybridize  one  sample  per  array.– Easier  when  comparing  multiple  groups.– Have  to  use  twice  as  many  arrays.– Affymetrix,  Illumina.

Page 13: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Data  from  microarray• Data  are  fluorescent  intensities:

– extracted  from  the   images  with  artifacts  (e.g.,  cross-­‐talk)  removed,  which  Involves  many  statistical  methods.    

– Final  data  are  stored  in  a  matrix:  row  for  probes,  column  for  samples.

– For  each  sample,   each  probe  has  one  number  from  one-­‐color  arrays  and  two  numbers  for  two-­‐color  arrays.

sample1 sample2 sample3 sample41007_s_at 8.575758 8.915618 9.150667 8.9678701053_at 6.959002 7.039825 6.898245 7.136316117_at 7.738714 7.618013 7.499127 7.610726121_at 10.114529 10.018231 10.003332 9.8090681255_g_at 5.056204 4.759066 4.629297 4.6734581294_at 8.009337 7.980694 8.343183 8.0253351316_at 6.899290 7.045843 6.976185 7.0630501320_at 7.218898 7.600437 7.433031 7.2019841405_i_at 6.861933 6.042179 6.165090 6.2006711431_at 5.073265 5.114023 5.159933 5.063821...

Page 14: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Microarray  data  measure  the  “relative”  levels  of  mRNA  abundance  

• Expression  levels  for  different  genes  on  the  same  array  are  not  directly  comparable.  

• Expression  levels  for  the  same  genes  from  different  arrays  can  be  compared,  after  proper  normalization.

• All  statistical  inferences  are  for  relative  expressions,  e.g.,  “the  expression  of  gene  X  is  higher  in  caner  compared  to  normal”.  

Page 15: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Statistical  challenges  

• Data  normalization:  remove  systematic  technical  artifacts.– Within  array:  variations  of  probe  intensities  are  caused  by:

• cross-­‐hybridization:  probes  capture  the  “wrong”  target.• probe  sequence:  some  probes  are  “sticker”.• others:  spot  sizes,  smoothness  of  array  surface,  etc.

– Between  array:  intensity-­‐concentration  response  curve  can  be  different  from  different  arrays,  caused  by  variations  in  sample  processing,  image  reader,  etc.

• Summarization  of  gene  expressions:– summarize  values  for  multiple  probes  belonging  to  the  same  gene  into  

one  number.• Differential  expression  detection:

– Find  genes  that  are  expressed  differently  between  different  experimental  conditions,  e.g.,  cases  and  controls.  

Page 16: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Gene  expression  microarray  data  normalization

Page 17: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Normalization

• Artifacts  are  introduced  at  each  step  of  the  experiment:– Sample  preparation:  PCR  effects.– Array  itself:  array  surface  effects,  printing-­‐tip  effects.  – Hybridization:  non-­‐specific  binding,  GC  effects.– Scanning:  scanner  effects.  

• Normalization  is  necessary  before  any  analysis  to  ensure  differences  in  intensities  are  due  to  differential  expressions,  not  artifacts.    

Page 18: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Within-­‐ and  between-­‐array  normalization

• Within-­‐array:  normalization  at  each  array  individually  to  remove  array-­‐specific  artifacts.  

• Between-­‐array:  to  adjust  the  values  from  different  arrays  and  put  them  at  the  same  baseline,  so  that  numbers  are  comparable.

• The  methods  are  different  for  one-­‐ and  two-­‐color  arrays.

Page 19: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Within  array  normalization,  two-­‐color

• Most  common  problem  is  intensity  dependent  effect:  log  ratios  of  intensities  from  two  channels  depends  on  the  total  intensity.  

• Most  popular:  loess  normalization.

Page 20: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

MA  plot

• Widely  used  diagnostic  plot  for  microarray  data  (Yang  et  al.  2002,  Nucleic  Acids  Research).  

• Also  used  for  sequencing  data  now.  • For  spot  i,  let  Ri and  Gi be  the  intensities,  define:

– Mi=log2Ri-­‐log2Gi,  A=(log2Ri+log2Gi)/2.– M  measures   relative  expression,  A  measures   total  expression.  

• Visualize  relative  vs.  total  expression  dependence.

2

MostMost Common ProblemCommon Problem

Intensity dependent effect: Different

background level most likely culprit

Scatter PlotScatter Plot

Demonstrates importance of MA plot

Two-color platformsTwo-color platforms

• Platforms that use printing robots areprone to many systematic effects:

– Dye

– Print-tip

– Plates

– Print order

– Spatial

• Some examples follow

Page 21: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Loess  normalization

• Based  on  the  assumptions  that:  (1)  most  genes  are  not  DE  (with  M=0)  and  (2)  M  and  A  are  independent,  MA  plot  should  be  flat  and  centered  at  0.

• Normalization  procedure:– Fit  a  smooth  curve  of  M  vs.  A  using  loess,  e.g.,  M=f(A)+ε,  f(.)  is  smooth.

– Mnorm=M-­‐f(A)

– loess  (lowess):  locally  weighted  scatterplot  smoothing.  – method  to  fit  a  smooth  curve  between  two  variables  .  

Page 22: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Loess  normalization:  before  and  after

Page 23: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Within  array  normalization:  one-­‐color

• RMA  (Robust  Multi-­‐array  Average)  background  model  (Irizarry  et  al.  2003,  Biostatistics).  

• Idea:  observed  intensity  Y is  composed  of  the  true  intensity  S(exponentially  distributed)  and  a  random  background  noise  B  (normally  distribute).  

• For  each  array,  assume:Y = S +BSignal: S ~ Exp(λ)Background: B ~ N(µ,σ 2 ) left-truncated at zero

Page 24: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Simple  derivation

• Observed:  Y;  of  interest:  S.  • The  idea  is  to  predict  S  from  Y  using                          :

• The  joint:    • Marginal  distribution  of  Y can  be  derived.  

E[S |Y ]

E[S |Y ]= s f (∫ s |Y = y)ds = s f (s,Y = y)fY (y)

∫ ds = 1fY (y)

s f (∫ s,Y = y)ds

f (s,Y = y) = f (s,B = y− s) = fS (s) fB (y− s)

fY (y)

Page 25: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

An  extension  to  consider  probe  sequence  effects:  GCRMA

for a given gene become correlated across experimental units. This makes obtaining reliable esti-

mates of uncertainty difficult. However, a multi-array versions of our model motivates a method

that performs background adjustment, normalization, and summarization as part of the estimation

procedure. Using this approach one can compute standard error estimates that account for the three

steps.

For example, if we were comparing gene expression across different conditions, each contain-

ing various arrays, we could write the following model based on (2):

Ygi j = Ogi j +Ngi j +Sgi j (5)

= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)

Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed

error that account for NSB for the same probe behaving differently in different arrays, sg repre-

sents the baseline log expression level for probeset g, agi j represents the signal detecting ability

of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a

normally distributed term that accounts for the multiplicative error, and δg is the expected differ-

ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As

described by Naef and Magnasco (2003) ag j is a function of α.

With this model in place one may obtain point estimates and standard errors for δg using, say,

theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both

these approaches are computationally difficult. In Figure 5 we present some preliminary results

obtained using generalized estimating equations to estimate δ. A difficulty with this approach is

that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and

making this approach useful in practice is the subject of current research.

18

http://www.bepress.com/jhubiostat/paper1

for a given gene become correlated across experimental units. This makes obtaining reliable esti-

mates of uncertainty difficult. However, a multi-array versions of our model motivates a method

that performs background adjustment, normalization, and summarization as part of the estimation

procedure. Using this approach one can compute standard error estimates that account for the three

steps.

For example, if we were comparing gene expression across different conditions, each contain-

ing various arrays, we could write the following model based on (2):

Ygi j = Ogi j +Ngi j +Sgi j (5)

= Ogi j + exp(µgi j + εgi j)+ exp(sg+δgXi+agi j +bi+ξgi j). (6)

Here Ygi j is the PM intensity for the probe j in probeset g on array i, εgi j is a normally distributed

error that account for NSB for the same probe behaving differently in different arrays, sg repre-

sents the baseline log expression level for probeset g, agi j represents the signal detecting ability

of probe j in gene g on array i, bi is a term used to describe the need for normalization, ξgi j is a

normally distributed term that accounts for the multiplicative error, and δg is the expected differ-

ential expression for every unit difference in covariate X . Notice δg is the parameter of interest. As

described by Naef and Magnasco (2003) ag j is a function of α.

With this model in place one may obtain point estimates and standard errors for δg using, say,

theMLE.With appropriate priors in place one could also obtain Bayesian estimates. However, both

these approaches are computationally difficult. In Figure 5 we present some preliminary results

obtained using generalized estimating equations to estimate δ. A difficulty with this approach is

that when Sgi j = 0 in (5) then (6) is not defined. However, preliminary results look promising and

making this approach useful in practice is the subject of current research.

18

http://www.bepress.com/jhubiostat/paper1

Wu  et  al.  (2005)  JASA

Page 26: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Probe  sequence  effects

• Probe  affinity  is  modeled  as:

• This  kind  of  modeling  is  widely  used  in  other  microarrays  and  sequencing  data!

base effects:

α=25

∑k=1

∑j⇤{A,T,G,C}

µj,k1bk= j with µj,k =3

∑l=0

β j,lkl, (1)

where k= 1, . . . ,25 indicates the position along the probe, j indicates the base letter, bk represents

the base at position k, 1bk= j is an indicator function that is 1 when the k-th base is of type j and 0

otherwise, and µj,k represents the contribution to affinity of base j in position k. For fixed j, the

effect µj,k is assumed to be a polynomial of degree 3. The model is fitted to log intensities from

many arrays using least squares (Naef and Magnasco, 2003).

We adapt this idea to help describe the NSB component. We fit (1) to our NSB experiment log

intensity data using a spline with 5 degrees of freedom instead of a polynomial of degree 3. The

least squares estimates µ̂j,k are shown in Figure 3. This Figure is similar to Figure 3 in Naef and

Magnasco (2003). In Section 3 we use these affinity estimates to describe NSB noise.

Zhang et al (Zhang et al., 2003) propose using a positional-dependent-nearest-neighbor (PDNN)

model which is based on hybridization theory. This model takes into account interactions between

bases that are physically close. However, Naef and Magnasco (2003) demonstrate that these in-

teractions do not add much predictive power for specific signal probe effects. Similar results for

prediction of NSB are found in Wu and Irizarry (2004). Figure 4 shows background adjusted

log2(PM) against α for the NSB data. Notice the affinities predict NSB quite well. Almost as

well as theMM intensities. The advantage of the affinities over the MM is that they will not detect

signal since they are pre-computed numbers.

12

http://www.bepress.com/jhubiostat/paper1

A

A

A

AA

AA A A A A A A A A

AA

AA

AA

AA

AA

5 10 15 20 25

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

C

C

C

CC

CC C C C C C C C C C

CC

CC

CC

CC

CG G G G G G G G G G G G G G G G G G G G G G G G GT T T T T T T T T T T T T T T T T T T T T T T T T

Figure 3: The effect of base A in position k, µA,k, is plotted against k. Similarly for the other threebases.

3 Background Model

In this section we propose a statistical model that motivates useful background adjustments. For

any particular probe-pair we assume that:

PM = OPM +NPM +S (2)

MM = OMM +NMM +φS.

Here O represents optical noise, N represents NSB noise and S is a quantity proportional to RNA

expression (the quantity of interest). The parameter 0 < φ < 1 accounts for the fact that for

some probe-pairs the MM detects signal. We assume O follows a log-normal distribution and

that log(NPM) and log(NMM) follow a bivariate-normal distribution with means of µPM and µMM

and the variance var[log(NPM] = var[log(NMM)]� σ2 and correlation ρ constant across probes. We

assume µPM � h(αPM) and µMM � h(αMM), with h a smooth (almost linear) function and the αs

defined by (1). Because we do not expect NSB to be affected by optics we assume O and N are

13

Hosted by The Berkeley Electronic Press

Page 27: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Summary:  within  array  normalization

• To  remove  the  unwanted  artifacts  and  obtain  true  signals.  

• Performed  at  each  array  individually.• Both  MA-­‐plot  based  normalization  and  background  error  models  (eg,  RMA)  are  popular  in  many  other  data  (other  microarrays,  ChIP-­‐seq,  RNA-­‐seq)  – Use  loess  with  caution  because  it  assumes  most  genes  are  not  DE.

– The  error  model  (additive  background,  multiplicative  error)  is  very  useful.

Page 28: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Between  array  normalization

• Remember  data  from  arrays  (intensity  values)  estimate  mRNA  quantities,  but  the    intensity-­‐mRNA  quantities  response  can  be  different  from  different  arrays.  So  a  number,  say,  5,  on  arrays  1  doesn’t  mean  the  same  on  array  2.

• This  could  be  caused  by:– Total  amount  of  mRNA  used– Properties  of  the  agents  used.– Array  properties– Settings  of  laser  scanners– etc.

• These  artifacts  cannot  be  removed  by  within  array  normalization.

• Goal:  normalize  so  that  data  from  different  arrays  are  comparable!  

Page 29: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Linear  scaling  method

• Used  in  Affymetrix  software  MAS:– Use  a  number  of  “housekeeping”  genes  and  assume  their  expressions  are  identical  across  all  arrays.

– Shift  and  rescale  all  data  so  the  average  expression  of  these  genes  are  the  same  across  all  arrays.

Page 30: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Non-­‐linear  smoothing  based

• Implemented  in  dChip  (Li  and  Wong  2001,  Genome  Bio.)– Find  a  set  of  genes  invariant  across  arrays.  – Find  a  “baseline”  array– For  every  other  arrays  fit  a  smooth  curve  on  expressions  of  invariant  genes  

– Normalize  based  on  the  fitted  curve.  

Page 31: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

dChip  normalization

Acknowledgements!"#$%&'(#)*"'#+"#,-./#0&'#1&'2/#34(#56-7'/#)$&'#3"8.-'/#9&"

:;#<""/#=&6-'#>&(&(/#9-%'#!&8("6#&'+#?64'+&@#5%&$$&A%&6B""

C-6#D6-*4+4'2#+&$&/#&'+#$%"#"+4$-6#&'+#6"C"6"".#7%-#D6-*4+"+

*&8E&F8"# .E22".$4-'.;#1%4.#7-6(# 4.# .EDD-6$"+# 4'#D&6$#FG#3H>

26&'$#I#JKI#>LMNOPIQMI#&'+#3)R#26&'$#05HQSSMPTMI;

References1. Li C, Wong WH: Model-based analysis of oligonucleotide

arrays: expression index computation and outlier detection.Proc Natl Acad Sci USA 2001, 98:31-36.

2. Hakak Y, Walker JR, Li C, Wong WH, Davis KL, Buxbaum JD,Haroutunian V, Fienberg AA: Genome-wide expression analy-sis reveals dysregulation of myelination-related genes inchronic schizophrenia. Proc Natl Acad Sci USA 2001, 98:4746-4751.

3. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D: Genome-wide expression monitoring in Saccharomyces cerevisiae. NatBiotechnol 1997, 15:1359-1367.

4. Wallace D: The Behrens-Fisher and Fieller-Creasy problems.In Lecture Notes in Statistics 1, R.A.Fisher: An Appreciation. Edited byFienberg SE, Hinkley DV. Springer-Verlag 1988, 119-147.

5. Cox, DR, Hinkley DV: Theoretical Statistics. London: Chapman andHall, 1974.

6. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad SciUSA 1998, 95:14863-14868

7. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,Lander E, Golub T: Interpreting patterns of gene expressionwith self-organizing maps: methods and application tohematopoietic differentiation. Proc Natl Acad Sci USA 1999,96:2907-2912.

8. Efron B, Tibshirani R: An Introduction to the Bootstrap. New York:Chapman & Hall/CRC, 1993.

9. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Rad-macher M, Simon R, Yakhinik Z, Ben-Dork A, et al.: Molecular clas-sification of cutaneous malignant melanoma by geneexpression profiling. Nature 2000, 406:536-540.

10. DNA-Chip Analyzer [http://www.dchip.org] 11. Schadt EE, Li C, Su C, Wong WH: Analyzing high-density

oligonucleotide gene expression array data. J Cell Biochem2001, 80:192-202.

comm

entreview

sreports

deposited researchinteractions

information

refereed research

http://genomebiology.com/2001/2/8/research/0032.11

Figure 10Similar plots as in Figure 9 for arrays hybridized to two different samples (array 24 and 36 of array set 5). (a) CEL intensities;(b) same plot as in (a) with superimposed circles representing the invariant set; (c) after renormalization; (d) Q-Q plot ofnormalized probe intensities. Note that the smoothing spline in (a) is affected by several points at the lower-right corner,which might belong to differentially expressed genes. The invariant set, on the other hand, does not include these pointswhen determining the normalization curve, leading to a different normalization relationship at the high end.

Page 32: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Quantile  normalization

Proposed  in  Bolstad et  al.  2003,  Bioinformatics:  • Force  the  distribution  of  all  data  from  all  arrays  to  be  the  

same,  but  keep  the  ranks  of  the  genes.  • Procedures:

1. Create  a  target  distribution,  usually  use  the  average  from  all  arrays.  2. For  each  array,  match  its  quantiles  to  that  of  the  target.  To  be  

specific:  xnorm =  F2-­‐1(F1(x)):• x:  value  in  the  chip  to  be  normalized• F1:  distribution   function   in  the  array  to  be  normalized• F2:  target  distribution   function

Page 33: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

33

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11

A  simple  example  for  quantile  normalization

Page 34: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

34

1.  Find  the  Smallest  Value  for  each  sample

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 2 7 153 3 6 5 84 1 5 2 95 9 13 6 11

2.  Average  them

(1+2+2+8)/4=3.25

Page 35: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

35

3.  Replace  Each  Value  by  the  Average

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11

Page 36: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

36

4.  Find  the  Next  Smallest  Values,  then  average

Gene sample1 Sample2 Sample3 Sample41 8 15 9 13 2 7 3.25 7 153 3 6 5 3.254 3.25 5 3.25 95 9 13 6 11

(3+5+5+9)/4=5.5

Page 37: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

37

5.  Replace  Each  Value  by  the  Average

Gene sample1 sample2 sample3 sample41 8 15 9 13 2 7 3.25 7 153 5.50 6 5.50 3.254 3.25 5.50 3.25 5.505 9 13 6 11

Page 38: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

38

6.  Continue  the  process,  we  get  the  following  matrix  after  finishing:

Gene sample1 sample2 sample3 sample41 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.003 5.50 7.50 5.50 3.254 3.25 5.50 3.25 5.505 12.00 10.25 7.50 7.50

The  result  matrix  has  following  properties:• The  values  taken  in  each  column  are  exactly  the  same.• The  ranks  of  genes  in  each  column  are  the  same  as  

before  normalization.  

Page 39: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Before/after  QN  boxplot

Page 40: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Summary:  between-­‐array  normalization

• Must  do  before  comparing  different  arrays.• Same  problems  exist  in  sequencing  data.• Quantile  normalization  is  too  strong  and  often  remove  the  true  signals,  use  with  caution.  

Page 41: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Microarray  data  summarization

• There  are  usually  multiple  probes  corresponding  to  a  gene.  The  task  is  to  summarize  the  readings  from  these  probes  into  one  number  to  represent  the  gene  expression.

• Naïve  methods:  mean,  median.

• From  MAS  5.0:  use  one-­‐step  Tukey  Biweight  (TBW)  to  obtain  a  robust  weighted  mean  that  is  resistant  to  outliers.– Probes  with  intensities  far  away  from  median  will  have  smaller  

weights  in  the  average.

• dChip  (Li  &  Wong,  2001):  model  based  on  PM-­‐MM.    

Page 42: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

• Borrow  information  from  multiple  samples  to  estimate  probe  effects.

• Model-­‐fitting:  Median  Polish  (robust  against  outliers)• Iteratively  removing  the  row  and  column  medians   until  convergence

• The  remainder   is  the  residual;  

• After  subtracting  the  residual,   the  rowmedians   are  the  estimates   of  the  expression,   and  column  medians   are  probe  effects.

RMA  summarization

Irizarry  et  al.  (2003)  Biostatistics.

Page 43: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Comparison  of  different  methods

• Some  works  are  done  for  Affymetrix:  – Cope  et  al.  (2004)  Bioinformatics.– Irizarry  et  al.  (2006)  Bioinformatics.

• Gold  standard:  “spike-­‐in”  data  in  with  the  true  expressions  are  known.  

• Criteria:– Accuracy:  how  close  are  the  estimated  values  to  the  truth.– Precision:  how  variable  are  the  estimates.  – Overall  detection  ability.

Page 44: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Results:  accuracy  vs.  precisionBenchmark for Affymetrix GeneChip expression measures

expression across replicate arrays, (2) response of expressionmeasure to changes in abundance of RNA, (3) sensitivity offold-change measures to the amount of actual RNA sample,(4) accuracy of fold-change as ameasure of relative expressionand (5) usefulness of raw fold-change score for the detectionof differential expression.When multiple expression measures are to be compared

directly, a comparative Image Report can be generated. Smalldifferences between the basic and comparative plots arementioned in the descriptions below. The webtool and theR package alike can readily generate either report. Examplesof these reports can be found on the webtool webpage. In thenext section, we describe two applications where the ImageReport is useful. All examples of figures included here arefrom those applications.All data is plotted on the log2 scale. The log transformation

is made at the beginning and all intermediate transformationsare made to the logged data. Fold changes are calculatedfor single chip comparisons, even where replicates are avail-able. This standardizes the procedure for those cases in whichreplicates are not available, and gives results that depend aslittle as possible on the the specific design of the benchmarkdatasets.

NotationThe spike-in data set encompasses more than one set ofexperimental conditions. Within each experiment, only thespike-in concentrations are varied; background is the same forall arrays. Fold-change calculations are always made withinexperiment to ensure that only spiked-in genes will be dif-ferentially expressed. We refer to spike-in concentrations asnominal concentration and their ratios as nominal fold-changeas a reminder that small errors in the actual quantity of RNA inthe sample prevent us from knowing the true concentrations.We will use M to denote log observed fold-change and FCas an abbreviation of fold change. Notation for formulas is asfollows:Dilution study: Let ytdrg represent log2 expression for tissue

t = 1, 2, dilution d = 1, . . . , 6, replicate r = 1, . . . , 5 andgene g = 1, . . . , n.

Spike-in study: A different variable is used to avoid con-fusion. Let xecg represent log2 expression for experimente = 1, 2, 3, array type c = 1, 2, . . . , 20 and gene g = 1, . . . , n.For the spike-in genes only, χcg represents the log2 of thenominal concentration.

The plots and statistics(1)MA plot: TheMAplot shows log fold-change as a functionof mean log expression level. This has come to be a funda-mental graphical tool for the analysis of expression array data.A great deal of information about the distribution of observedfold-changes can be read from such a plot. A set of 14 arraysrepresenting a single experiment from theAffymetrix spike-in

0 5 10 15

–10

–5

0

5

10

MAS 5.0

A

M 11 1 1 11 11 1 11112222

22 22 22

12

22333 3

33 3333

1111

344 4

4 44

44

4

1010

10

455 5 5 55

55

999

9

566 6 6 6

6

6

8 888

8

77 7 77

7

7

77777

7

88 8 88

66 66666

6

999 9

5

55 55555

5

1010

10

4

44

44

44444

1111

333

33 33 33

33

12

2 2 22 22 22222

2

1 1 1 1111 1

11111

∞∞ ∞

∞∞

∞∞

∞∞

− ∞

− ∞

− ∞

− ∞− ∞− ∞

− ∞

− ∞− ∞

− ∞

− ∞− ∞− ∞

− ∞

0 5 10 15

–10

–5

0

5

10

RMA

A

M

1 11 1 11 11 1 11112 22

222 22 2

2

12

223 33

333 33 3

3

11

11

34 4

44

44 444

10

10

10

45 5

5 5 55 55

9

9

9

9

56 6

6 6 66 6

88

888

7 77 7 77

7 7 77777

8 88 88

6

66 6

6666

9 99 9

55

5 5 5555

5

10 1010

44

44

4 44444

1111

333 3

3 3 333

33

12

2 222 22 2 22222

11 1 11 11 1 11111∞∞

∞∞ ∞∞ ∞∞∞∞

∞∞∞∞ ∞∞

∞∞∞∞

∞∞∞

∞∞∞

− ∞

− ∞− ∞

− ∞

− ∞− ∞− ∞− ∞

− ∞− ∞

− ∞− ∞− ∞− ∞

Fig. 1. TheMAplot shows log fold-change as a function ofmean logexpression level. A set of 14 arrays representing a single experimentfrom the Affymetrix spike-in data are used for this plot. A total of13 sets of fold changes are generated by comparing the first arrayin the set to each of the others. Genes are symbolized by numbersrepresenting the nominal log2 fold-change for the gene. Genes withobserved fold-changes larger than 2 are outside the horizontal lines.All other probesets are represented with black dots.

data are used for this plot. We choose 14 so that all possiblecombinations of pairs of concentrations appear once. A totalof 13 sets of fold changes are generated by comparing the firstarray in the set to each of the others, so thatM1cg = x11g−x1cgfor c = 2, . . . , 14. These are plotted together against themean values A1cg = (x11g + x1cg)/2. To make the plot moreinformative, spiked-in genes are symbolized by numbers rep-resenting the nominal log2 fold-change for the gene. In thepackage, non-differentially expressed genes with observedfold-changes larger than 2 are plotted in red (in the black andwhite version in this paper we use horizontal lines to denotea fold-change of 2). All other probesets are represented withblack dots.MA plots for competing methods are presented side by side

on separate axes. Figure 1 shows a comparative plot for twoexpression measures.

325

Page 45: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Results:  DE  detection  ability

between 4 and 32) and high expressed (nominal concentration >32).For each of these subgroups we followed the same procedure usedto compute the signal detect slope. Specifically, a regression linewas fitted to the observed log expression values for the spiked-ingenes using nominal concentration as the predictor, and the slopeestimate recorded as the assessment measure. The new assessmentmeasures are referred to as low, med and high slopes and areshown in Table 2.To better assess the concentration dependent bias, we added the

plot shown in Figure 3b to the benchmark. In this figure, local slopesare calculated by taking the difference between the averageobserved log expression values between consecutive nominal con-centration levels. The difference between 1 and these local slopesare plotted against the larger of the two concentration levels. We

subtract from 1 because we are in the log-scale, thus all theseslopes should be 1 (when nominal concentration doubles soshould the observed concentrations). Notice that these curves followan upside-down U shape. This shape illustrates the fact that bias isworst for low expressed genes and high expressed genes. The lowexpressed genes are affected by background noise as describedabove. The high expressed genes are affected by scanner attenuation(not discussed in this paper).

3.2 Precision

To provide a more practical context for the new accuracy assess-ment measures, we defined the null log-fc 99.9% statistic shown inFigure 2. Row 6 in Table 1 of Cope et al. (2004) presented the inter-quartile range (IQR) of the observed log-fold-changes among thegenes that are known not to be differentially expressed. The newstatistics gives the 99.9% instead of the IQR. We have also added ameasure related to the Median SD represented by row 1 in Table 1ofCope et al. (2004). The previous measure used the dilution studydata. Similarly, a spike-in experiment version of Figure 2 in theoriginal benchmark was added.

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

False Positives

Tru

e P

ositi

ves

VSNRMARMA_NBGGCRMAPLIER+16RSVD

(a) Nominal fold-chage = 2

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

False Positives

Tru

e P

ositi

ves

(b) Low nominal concentration

Fig. 4. A typical identification rule for differential expression filters genes

with fold change exceeding a given threshold. This figure shows averageROC curves which offer a graphical representation of both specificity and

sensitivity for such a detection rule. (a) Average ROC curves based on

comparisons with nominal fold changes equal to 2. (b) As (a) but consideronly low concentration spiked-in genes.

Table 2. Table showing the new assessment summary statistics described in

the text

SlopeMethod SD 99.9% Low Med High AUC

GCRMA 0.08 0.74 0.66 1.06 0.56 0.70

GS_GCRMA 0.10 0.79 0.62 1.03 0.55 0.66

MMEI 0.04 0.23 0.16 0.54 0.46 0.62

GL 0.05 0.25 0.16 0.55 0.46 0.62RMA_NBG 0.04 0.24 0.16 0.56 0.46 0.61

RSVD 0.00 0.58 0.42 0.85 0.40 0.61

ZL 0.22 0.52 0.35 0.71 0.45 0.61VSN_scale 0.09 0.43 0.28 0.91 0.70 0.59

VSN 0.06 0.28 0.18 0.6 0.46 0.59

RMA_VSN 0.09 0.48 0.31 0.74 0.46 0.57

GLTRAN 0.07 0.42 0.23 0.61 0.45 0.55ZAM 0.09 0.50 0.30 0.70 0.47 0.54

RMA_GNV 0.11 0.58 0.35 0.76 0.47 0.52

RMA 0.11 0.57 0.35 0.76 0.47 0.52

GSrma 0.11 0.57 0.35 0.76 0.47 0.52GSVDmod 0.07 0.44 0.22 0.64 0.42 0.51

PerfectMatch 0.05 0.40 0.18 0.56 0.43 0.50

PLIER+16 0.13 0.83 0.49 0.80 0.46 0.48GSVDmin 0.08 0.60 0.22 0.62 0.41 0.41

MAS 5.0+32 0.14 1.07 0.35 0.71 0.44 0.12

ChipMan 0.27 2.26 0.44 1.11 0.68 0.12

qn.p5 0.12 1.09 0.13 0.50 0.52 0.11dChip PM-only 0.13 1.44 0.31 0.67 0.39 0.09

mmgMOSgs 0.40 3.27 1.34 1.13 0.45 0.07

gMOSv.1 0.29 3.35 0.98 1.12 0.42 0.06

ProbeProfiler 0.31 18.75 1.61 1.57 0.39 0.03dChip 0.23 14.83 1.40 0.86 0.35 0.02

mgMOS_gs 0.36 2.86 0.83 0.86 0.43 0.01

MAS 5.0 0.63 4.48 0.69 0.81 0.45 0.00

PLIER 0.19 123.27 0.75 0.85 0.46 0.00UMTrMn 0.32 2.92 0.58 0.83 0.42 0.00

The methods are ordered by their performance in the weighted average AUC value.

R.A.Irizarry et al.

792

Page 46: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Resources

• The  affycomp R  package  in  Bioconductor.• Online  tools:  http://rafalab.rc.fas.harvard.edu/affycomp.  

Page 47: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Bioconductor  for  microarray  data

• There  is  a  rich  collection  of  bioc  packages  for  microarrays.  In  fact,  Bioconductor  started  for  microarray  analysis.  

• There  are  currently  200+  packages  for  microarray.  • Important  ones  include:

– affy:  one  of  the  earliest  bioc  packages.  Designed  for  analyzing  data  from  Affymetrix  arrays.

– oligo:  preprocessing  tools  for  many  types  of  oligonecleotide  arrays.  This  is  designed  to  replace  affy  package.  

– limma and  siggenes:  DE  detection  using  limma  and  SAM-­‐t  model.  – Many  annotation  data  package  to  link  probe  names  to  genes.

• Data  normalization  and  summarization  can  be  done  using  oligo  package  (details  next  lecture).  

Page 48: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

Review

• We  have  covered  microarray  analysis,  including:– Data  preprocessing:  within  and  between  array  normalization.

– Summarization.

• Next  lecture:  – DE  detection  for  microarray.

Page 49: Introduction*to*gene*expression* microarray* · PDF filee.g.,&“theexpression&of&geneX&is&higher&in&caner& compared&to&normal”.& ... ,25 indicates the position along the probe,

To  do  list

• Review  the  slides.• Read  the  review  article  on  Nature  Reviews  Genetics   (link  on  the  class  webpage).