experimental design, normalization, and exploratory data analysis class web site: statistics for...
Post on 15-Jan-2016
235 views
TRANSCRIPT
Experimental Design, Normalization, and Exploratory Data Analysis
Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/
Statistics for Microarrays
A B
C
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
R, G
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg)
Some Considerations for cDNA Microarray Experiments (I)
Scientific (Aims of the experiment)• Specific questions and priorities• How will the experiments answer the questions
Practical (Logistic)• Types of mRNA samples: reference, control, treatment, mutant, etc• Source and Amount of material (tissues, cell lines)• Number of slides available
Some Considerations for cDNA Microarray Experiments (II)
Other Information• Experimental process prior to hybridization: sample isolation, mRNA extraction, amplification, labelling,… • Controls planned: positive, negative, ratio, etc.• Verification method: Northern, RT-PCR, in situ hybridization, etc.
Aspects of Experimental Design Applied to Microarrays (I)
Array Layout•Which cDNA sequences are printed•Spatial position
Allocation of samples to slides •Design layouts
•A vs B: Treatment vs control•Multiple treatments•Factorial •Time series
Aspects of Experimental Design Applied to Microarrays (II)
Other considerations•Replication•Physical limitations: the number of slides and the amount of material
•Sample Size•Extensibility - linking
Layout options
The main issue is the use of reference samples, typically labelled green. Standard statistical design principles can lead to more efficient layouts; use of dye-swaps can also help.
Sample size determination is more than usually difficult, as there are 1,000s of possible changes, each with its own SD.
Natural design choice
Case 1: Meaningful biological control (C)Samples: Liver tissue from four mice treated by cholesterol modifying
drugs.Question 1: Genes that respond differently between the T and the C.Question 2: Genes that responded similarly across two or more
treatments relative to control.Case 2: Use of universal referenceSamples: Different tumor samples.Question: To discover tumor subtypes.
C
T1 T2 T3 T4 T1
Ref
T2 Tn-1 Tn
Treatment vs Control
Two samplese.g. KO vs. WT or mutant vs. WT
T CT Ref
C Ref
Direct Indirect
2 /2 22
average (log (T/C)) log (T / Ref) – log (C / Ref )
I) Common Reference
II) Common reference
III) Direct comparison
Number of Slides
Ave. variance
Units of material A = B = C = 1 A = B = C = 2 A = B = C = 2
Ave. variance
One-way layout: one factor, k levels
C B
A
ref
CBA
ref
CBA
I) Common Reference
II) Common reference
III) Direct comparison
Number of Slides N = 3 N=6 N=3
Ave. variance 2 0.67
Units of material A = B = C = 1 A = B = C = 2 A = B = C = 2
Ave. variance 1 0.67
One-way layout: one factor, k levels
C B
A
ref
CBA
ref
CBA
For k = 3, efficiency ratio (Design I / Design III) = 3. In general, efficiency ratio = 2k / (k-1). (But may not be achievable due to lack of independence.)
Design I
Design III
A B
C
A
Ref
B C
Illustration from one experiment
Box plots of log ratios: direct still ahead
CTL OSM
EGF OSM & EGF
Factorial experiments
•Treated cell lines
•Possible experiments
Here interest is not in genes for which there is an O or an E (main) effect, but in which there is an OE interaction, i.e. in genes for which log(O&E/O)-log(E/C) is large or small.
Indirect A balance of direct and indirect
I) II) III) IV)
# Slides N = 6
Main effect A
0.5 0.67 0.5 NA
Main effect B
0.5 0.43 0.5 0.3
Int A.B 1.5 0.67 1 0.67
2 x 2 factorial: some design options
C
A.BBA
B
C
A.B
A
B
C
A.B
A
B
C
A.B
A
Table entry: variance (assuming all log ratios uncorrelated)
Some Design Possibilities for Detecting Interaction
Samples: treated tumor cell lines at 4 time points (30 minutes, 1 hour, 4 hours, 24 hours)
Question: Which genes contribute to the enhanced inhibitory effect of OSM when it is combined with EGF? Role of time?
ctl OSM
EGF OSM & EGF
ctl
OSM EGF OSM & EGF
2
Design A: Design B:
Combining Estimates
Different ways of estimating the same contrast:e.g. A compared to P
Direct = A-P
Indirect = A-M + (M-P) or
A-D + (D-P) or
-(L-A) - (P-L)
How do we combine these?
LL
PPVV
DD
MM
AA
Time Course Experiments
• Number of time points• Which differences are of highest
interest (e.g. between initial time and later times, between adjacent times)
• Number of slides available
Design choices in time series. Entry: variance
t vs t+1 t vs t+2 t vs t+3
Ave
T1T2 T2T3 T3T4 T1T3 T2T4 T1T4
N=3 A) T1 as common reference 1 2 2 1 2 1 1.5
B) Direct Hybridization 1 1 1 2 2 3 1.67
N=4 C) Common reference 2 2 2 2 2 2 2
D) T1 as common ref + more .67 .67 1.67 .67 1.67 1 1.06
E) Direct hybridization choice 1 .75 .75 .75 1 1 .75 .83
F) Direct Hybridization choice 2 1 .75 1 .75 .75 .75 .83
T2 T3 T4T1
T2 T3 T4T1
Ref
T2 T3 T4T1
T2 T3 T4T1
T2 T3 T4T1
T2 T3 T4T1
Replication
• Why?• To reduce variability• To increase generalizability
• What is it?• Duplicate spots• Duplicate slides
• Technical replicates• Biological replicates
Technical Replicates: Labeling
• 3 sets of self – self hybridizations
• Data 1 and Data 2 were labeled together and hybridized on two slides separately
• Data 3 were labeled separately
Data 1 Data 1
Dat
a 2
Dat
a 3
Sample Size
• Variance of individual measurements (X)
• Effect size(s) to be detected (X)• Acceptable false positive rate• Desired power (probability of
detecting an effect of at least the specfied size)
Extensibility
• “Universal” common reference for arbitrary undetermined number of (future) experiments
• Provides extensibility of the series of experiments (within and between labs)
• Linking experiments necessary if common reference source diminished/depleted
Summary
• Balance of direct and indirect comparisons
• Optimize precision of the estimates among comparisons of interest
• Must satisfy scientific and physical constraints of the experiment
(BREAK)
Mini-Review: How to make a cDNA microarray
384 well plate --
Contains cDNA probes
Glass Slide
Array of bound cDNA probes
4x4 blocks = 16 print-tip groups Print-tip group 6
cDNA clones
Spotted in duplicate
Print-tip group 1
Pins collect cDNA from wells
Ngai Lab arrayer , UC Berkeley
Building the chip
Print-tip head
Microarray Experiment
HybridizationBinding cDNA samples (targets) to cDNA probes on
slide
cover
slip
Hybridise for
5-12 hours
Quantification of expression
For each spot on the slide we calculate
Red intensity = Rfg - Rbg
fg = foreground, bg = background, and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)
Background matters
From Spot From GenePix
Quality Measurements
• Array– Correlation between spot intensities– Percentage of spots with no signals– Distribution of spot signal area
• Spot– Signal / Noise ratio– Variation in pixel intensities– Identification of “bad spot” (spots with
no signal)• Ratio (2 spots combined)
– Circularity
Affymetrix Oligo Chips
• Only one “color”
• Different technology, different normalization issues
• Affy chip normalization is an active research area – see http://www.stat.berkeley.edu/users/terry/zarray/Affy/affy_index.html
• Was the experiment a success?
• Are there any specific problems?
• What analysis tools should be used?
Preprocessing: Data VisualizationPreprocessing: Data Visualization
Tools for Microarray Normalization and Analysis
• Both commercial and free software
• The labs for this course use the R package sma
• Upcoming release (29 April 2002) of Bioconductor (http://www.bioconductor.org/)
Red/Green overlay images
Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.
Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artefactssuch as dust or scratches
Scatterplots: always log, always rotate
log2R vs log2G M=log2R/G vs A=log2√RG
Signal/Noise = log2(spot intensity/background intensity)
Histograms
Boxplots of log2R/G
Liver samples from 16 mice: 8 WT, 8 ApoAI KO
Spatial plots: background from the two slides
Highlighting extreme log ratios
Top (black) and bottom (green) 5% of log ratios
Pin group (sub-array) effects
Boxplots of log ratios by pin groupLowess lines through points from pin groups
Boxplots and highlighting pin group effects
Clear example of spatial bias
Print-tip groups
Lo
g-r
ati o
s
Plate effects
KO #8
Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.
Clearly visible plate effects
Time of printing effects
Green channel intensities (log2G). Printing over 4.5 days.The previous slide depicts a slide from this print run.
spot number
Preprocessing: Normalization• Why?
To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples.
• How do we know it is necessary? By examining self-self hybridizations,
where no true differential expression is occurring.
We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….
Self-self hybridizations
False color overlay Boxplots within pin-groups Scatter (MA-)plots
From the NCI60 data set (Stanford web site)
Similar patterns apparent in non self-self hybridizations
From Lawrence Berkeley National Laboratory
Normalization Methods (I)• Normalization based on a global adjustment
log2 R/G -> log2 R/G - c = log2 R/(kG)
Choices for k or c = log2k are c = median or mean of log ratios for a particular gene set (e.g. housekeeping genes). Or, total intensity normalization, where
k = ∑Ri/ ∑Gi. • Intensity-dependent normalization Here, run a line through the middle of the MA plot,
shifting the M value of the pair (A,M) by c=c(A), i.e. log2 R/G -> log2 R/G - c (A) = log2 R/(k(A)G).
One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing.
Normalization Methods (II)• Within print-tip group normalization In addition to intensity-dependent variation in log
ratios, spatial bias can also be a significant source of systematic error.
Most normalization methods do not correct for spatial effects produced by hybridization artefacts or print-tip or plate effects during the construction of the microarrays.
It is possible to correct for both print-tip and intensity-dependent bias by performing LOWESS fits to the data within print-tip groups, i.e.
log2 R/G -> log2 R/G - ci(A) = log2 R/(ki(A)G),
where ci(A) is the LOWESS fit to the MA-plot for the ith grid only.
Normalization: Which Spots to use?
The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability.
For example, the use of a global LOWESS approach
can be justified by supposing that, when stratified by mRNA abundance, a) only a minority of genes are expected to be differentially expressed, or b) any differential expression is as likely to be up-regulation as down-regulation.
Pin-group LOWESS requires stronger assumptions:
that one of the above applies within each pin-group. The use of other sets of genes, e.g. control or
housekeeping genes, involve similar assumptions.
Use of Control Slides: M vs A PlotM = log R/G = logR - logG
A = ( logR + logG ) /2
Positive controlsNegative controls
blanks
Lowess curve
Global scale, global lowess, pin-group lowess; spatial plot after, smooth histograms of M after
Normalization makes a difference
Normalization by controls:Microarray Sample Pool titration
series
Control set to aid intensity- dependent normalization
Different concentrations in titration series
Spotted evenly spread across the slide in each pin-group
Pool the whole library
Comparison of Normalization Schemes
(courtesy of Jason Goncalves)
No consensus on best normalization method Experiment done to assess the common
normalization methods Based on reciprocal labeling experimental
data for a series of 140 replicate experiments on two different arrays each with 19,200 spots
DESIGN OF RECIPROCALDESIGN OF RECIPROCALLABELING EXPERIMENTLABELING EXPERIMENT
Replicate experiment in which we assess the same mRNA pools but invert the fluors used.
The replicates are independent experiments and are scanned, quantified and normalized as usual
Comparison of Normalization Methods - Using 140 19K Microarrays
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
Pre Normalized Global Intensity Subarray Intensity Global Ratio Sub-Array Ratio Global LOWESS Subarray LOWESS
Normalization Method
Ave
rage
Mea
n D
evia
tion
Val
ue
***
Scale normalization: between slides
Boxplots of log ratios from 3 replicate self-self hybridizations.Left panel: before normalizationMiddle panel: after within print-tip group normalizationRight panel: after a further between-slide scale normalization.
The “NCI 60” experiments (no bg)
Some scale normalization seems desirable
Scale normalization: another data set
Lo
g-r
ati o
s
Only small differences in spread apparent. No action required.
`
Assumption: All slides have the same spread in M
True log ratio is mij where i represents different slides and j represents different spots.
Observed is Mij, where
Mij = ai mij
Robust estimate of ai is
MADi = medianj { |yij - median(yij) | }
One way of taking scale into account
A slightly harder normalization problem
Global lowess doesn’t do the trick here
Print-tip-group normalization helps
But not completely
Still a lot of scatter in the middle in a WT vs KO comparison
Effects of previous normalization
Before normalization After print-tip-groupnormalization
Within print-tip-group box plots of M after print-tip-group
normalization
Assumption: All print-tip-groups have the same spread in M True log ratio is mij where i represents
different print-tip-groups and j represents different spots.
Observed is Mij, where
Mij = ai mij
Robust estimate of ai is
MADi = medianj { |yij - median(yij) | }
Taking scale into account, cont.
Effect of location & scale normalization
Clearly care is needed in making decisions like this
A comparison of three M v A plots
Unnormalized Print-tip normalization Print tip & scale n
The same normalization on another data set
.
Before
After
Normalization: Summary
• Reduces systematic (not random) effects• Makes it possible to compare several arrays
• Use logratios (M vs A-plots)• Lowess normalization (dye bias)• MSP titration series – composite normalization• Pin-group location normalization• Pin-group scale normalization• Between slide scale normalization
• Control Spots• Normalization introduces more variability• Outliers (bad spots) are handled with replication
Pre-processed cDNA Gene Expression Data
On p genes for n slides: p is O(10,000), n is O(10-100), but growing,
Genes
Slides
Gene expression level of gene 5 in slide 4
= Log2( Red intensity / Green intensity)
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
First Steps: QQ-Plots
Used to assess whether a sample follows aparticular (e.g. normal) distribution (or to compare two samples)
A method for lookingfor outliers when dataare mostly normal
Sam
ple
Theoretical
Sample quantile is 0.125
Value from Normal distribution which yields a quantile of 0.125
Acknowledgments
Terry Speed (UCB and WEHI)
Jean Yee Hwa Yang (UCB)Sandrine Dudoit (UCB)Ben Bolstad (UCB)Natalie Thorne (WEHI)Ingrid Lönnstedt
(Uppsala)Henrik Bengtsson (Lund)
Jason Goncalves (Iobion)
Matt Callow (LLNL)Percy Luu (UCB)John Ngai (UCB)Vivian Peng (UCB)
Dave Lin (Cornell)