statistical analyses of microarray data rafael a. irizarry department of biostatistics [email protected]...
TRANSCRIPT
Statistical Analyses of Microarray Data
Rafael A. Irizarry
Department of [email protected]
http://biosun01.biostat.jhsph.edu/~ririzarr
Outline
• Scientific questions
• Review of technology
• Role of statistics
• Two case studies
Scientific Questions
• Expression
• Differential expression
• Expression patterns
“To understand gene function, it is helpful to know when and where it is expressed and…”
“…under what circumstances the expression level is affected.”
“… questions concerning functional pathways and how cellular components work together to regulate and carry out cellular
processes.”
Lipshutz et al. (1999) Nature genetics, 21, pp. 20-21
What do Microarrays do?Interrogate labeled nucleic acid samples
model systems, microdissections, cell lines, human tissue bank
kanRUPTAG DOWNTAG
• RNA samples
• Oligonucleotide barcodes
How do they do it?
Probes
Labeled targets
cDNA clones(probes)
PCR product amplificationpurification
printing
microarray
Hybridize target to microarray
mRNA target
excitation
laser 1laser 2
emission
scanning
analysis
0.1nl/spot
overlay image and normalize
cDNA Arrays
High Density Oligonucleotide Arrays
24µm24µm
Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe
Image of Hybridized Probe ArrayImage of Hybridized Probe Array
>200,000 different>200,000 differentcomplementary probes complementary probes
Single stranded, Single stranded, labeled RNA targetlabeled RNA target
Oligonucleotide probeOligonucleotide probe
**
**
*
1.28cm1.28cm
GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell
Compliments of D. Gerhold
Role of Statistics
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Quantify Expression
Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale
Does one size fit all?
Segmentation: limitation of the fixed circle method
SRG Fixed Circle
Inside the boundary is spot (fg), outside is not.
Some local backgrounds
We use something different again: a smaller, less variable value.
Single channelgrey scale
Quantification of Expression
For each spot on the slide we calculateRed intensity = Rfg – Rbg
fg = foreground, bg = background, andGreen intensity = Gfg – Gbg
and combine them in the log (base 2) ratioLog2( Red intensity / Green intensity)
we now have one differential expression for each gene for each array
Top 2.5%of ratios red, bottom 2.5% of ratios green
The red-green ratios can be spatially biased
Another example
Oligo Array Image Analysis
• About 100 pixels per probe cell
• These intensities are combined to form one number representing expression for the probe cell oligo
Normalization at Probe Level
Normalization at Probe Level
Dilution Experiment Data
Dilution Experiment Data
PM MM
Default until 2002
• GeneChip® software uses Avg.diff
with A a set of “suitable” pairs chosen by software.• Log ratio version is also used.• For differential expression Avg.diffs are compared
between chips.
j
jj MMPMdiffAvg )(1
.
What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)
Two case studies
Spike-In Experiments
• Add concentrations (0.5pM – 100 pM) of 11 foreign species cRNAs to hybridization mixture
• Set A: 11 control cRNAs were spiked in, all at the same concentration, which varied across chips.
• Set B: 11 control cRNAs were spiked in, all at different concentrations, which varied across chips. The concentrations were arranged in 12x12 cyclic Latin square (with 3 replicates)
Set A: Probe Level Data (12 chips)
Spike-In BProbe Set Conc 1 Conc 2 Rank
BioB-5 100 0.5 1
BioB-3 0.5 25.0 2
BioC-5 2.0 75.0 3
BioB-M 1.0 35.7 4
BioDn-3 1.5 50.0 5
DapX-3 35.7 3.0 6
CreX-3 50.0 5.0 7
CreX-5 12.5 2.0 8
BioC-3 25.0 100 9
DapX-5 5.0 1.5 10
DapX-M 3.0 1.0 11
Later we consider 23 different combinations of concentrations
Observed RanksGene AvDiff MAS 5.0 Li&Wong AvLog(PM-BG)
BioB-5 6 2 77 1
BioB-3 16 1 33 2
BioC-5 74 6 22 5
BioB-M 30 3 6 3
BioDn-3 44 5 27 4
DapX-3 239 24 796 7
CreX-3 333 73 386 11
CreX-5 3276 33 43 9
BioC-3 2709 8572 12 10300
DapX-5 2709 102 59 17
DapX-M 165 19 30 6
kanRA
Transformation into deletion pool
Select for Ura+ transformantsGenomic DNA preparation
Circular pRS416
PCRCy5 labeled PCR products Cy3 labeled PCR products
Oligonucleotide array hybridization
B
EcoRI linearized PRS416
NHEJ Defective
MCS
CEN/ARS
URA3 ttaaaatt
CEN/ARS
URA3
UPTAG DOWNTAG
• .
Y K U 7 0 N E J 1 Y K U 8 0
Y K U 7 0 N E J 1 Y K U 8 0
Average Red and Green Scatter Plot
Average Red and Green MVA plot
Histograms
QQ-Plot
Z-Scores
Average Red and Green MVA Plot
Average Red and Green Scatter Plot
Summary
• Simple data exploration useful tool for quality assessment
• Statistical thinking helpful for interpretation
• Statistical models may help find signals in noise
Acknowledgements
UC Berkeley StatBen BolstadSandrine DudoitTerry SpeedJean Yang
MBG (SOM)Jef BoekeSiew-Loon OoiMarina LeeForrest Spencer
BiostatisticsKarl BromanLeslie CopeCarlo CoulantoniGiovanni ParmigianiScott Zeger
Gene LogicFrancois Colin Uwe Scherf’s Group
PGATom Cappola Skip GarciaJoshua Hare
WEHIBridget HobbsNatalie Thorne