linear models and limma copenhagen, 19 august...
Post on 19-Jul-2020
5 Views
Preview:
TRANSCRIPT
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 1
1
Linear models and Limma
Københavns Universitet,19 August 2009
Mark D. RobinsonBioinformatics, Walter+Eliza Hall InstituteEpigenetics Laboratory, Garvan Institute
(with many slides taken from Gordon Smyth)
2
Limma = linear models formicroarray data
o Morning Theory Introduction Background correction Moderated t-tests Simple linear models
o Morning Practical Demonstration of smoothing Limma objects (beta7) Background correction and
normalization (beta7) Simple experimental designs
2-colour example (beta7) Affymetrix example (cancer)
o Afternoon Theory More advanced designs /
linear modeling Moderated F-tests Gene set tests Other analyses limma can do
o Afternoon practical Factorial design (estrogen) Gene set testing (cancer) Time course experiment
(SAHA/depsipeptide)
3
Expression measures
Two-colour
Affymetrix
Illumina
log-intensity(summarizedover probes)
log-intensity(summarizedover beads)
probe or gene
array
yga = log2(R/G)
yga =
yga =
4
Questions of Interesto What genes have changed in expression? (e.g.
between disease/normal, affected by treatment)Gene discovery, differential expression
o Is a specified group of genes all up-regulated in aparticular condition?Gene set differential expression
o Can the expression profile predict outcome?Class prediction, classification
o Are there tumour sub-types not previouslyidentified? Do my genes group into previouslyundiscovered pathways?Class discovery, clustering
Today will cover first two questions - differential expression
5
MicroarrayDifferential Expression Studies
o 103 – 106 genes/probes/exons on a chip orglass slide
o Inputs to limma: log-intensities (1-colour data)or log(R/G) log-ratios (for 2-colour data)
o Several steps to go from raw data to table of“expression”: background correction,normalization
o Idea: Fit a linear model to the expression datafor each gene
6
Two colourmicroarrays
http://en.wikipedia.org/wiki/DNA_microarray
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 2
7
Two-colour data: Log-Intensities
*
2log ( )f bR R R= !
*
2log ( )f bG G G= !
For each probe:
Various ways to calculate background.Will often modify to ensure: Rf – Rb >0 and Gf – Gb > 0. 8
Two-colour data: Means andDifferences
M R G= !
( ) / 2A R G= +
For each probe:
“minus”
“add”
9
Data Summaries
G1 R1 G2 R2 G3 R3 G4 R4
A1M1 A2M2 A3M3 A4M4
For each gene
10
MA Plot
11
Log-Ratios orSingle Channel Intensities?
o Tradition analysis, treats log-ratios M=log(R/G) asthe primary data, i.e., gene expression measurementsare relative
o Alternative approach treats individual channelintensities R and G as primary data, i.e., geneexpression measures are absolute (Wolfinger,Churchill, Kerr)
o Single channel approach makes new analyses possiblebut- make stronger assumptions- requires more complex models (mixed models in place of
ordinary linear models) to accommodate correlationbetween R and G on same spot
12
BG correction affects DE results
o Importance of careful pre-processing and qualitycontrol cannot be over-emphasized for microarraydata
o Can have dramatic effect on differential expressionresults
o Consider here the normexp method of adaptivebackground correction- background correction step of the RMA algorithm- Can also be applied to two colour data
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 3
13
Additive + multiplicative errormodel
Observe intensity for one probe on one array
additiveerrors
multiplicativeerrors
Intensity = background + signal
This idea underlies variance stabilizing transformationsvsn (two colour data) and vst (for Illumina data)
I = B + S
14
normexp convolution model
Intensity = Background + Signal
N(μ,σ2) Exponential(α)
+=
15
Conditional expectation undernormexp model
Then
with
16
normexp background correction
o Estimate the three parameterso Replace I with E(S|I)o For Affymetrix data, I is the “Perfect Match” data
intensityo For two-colour data, I=Rf-Rb or I=Gf-Gb
o In the RMA algorithm, parameter estimation uses anad hoc density kernel method
o In limma (two colour), parameter estimationmaximises the saddlepoint approximation to thelikelihood
17
PM data on log2 scale: raw and fitted model
18
Background corrected intensity
= E(Signal | Observed Intensity)
Observed Intensity
E( S
igna
l | In
tens
ity)
Adaptivebackgroundcorrectionproducespositive signal
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 4
19
Offsets to stabilise the variance
Offset reduces variability at low intensities
Log-ratios
Background correction
20
Why do offsets stabilize thevariance?
21
Why do offsets stabilize thevariance?
o log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold)o Additive noise affects small numbers moreo Offsets introduce bias:
o log2[(80+10)/(10+10)] = 2.17
o But the tradeoff (drop in variance for increasein bias) is usually worth it
22
A self-self experiment:two background methods
553 spots not plotted
23
Comparison of 2-colour BG correctionmethods
Fals
e di
scov
erie
s (li
mm
a)
Genes selectedRitchie et al. 2007
24
References
o Silver et al. (2009). Microarray backgroundcorrection: maximum likelihood estimation for thenormal-exponential convolution. Biostatistics. [completemathematical development of the saddle point approximation]
o Ritchie et al. (2007). A comparison of backgroundcorrection methods for two-colour microarrays.Bioinformatics. [shows “normexp” performs best for 2-colour data]
o Irizarry et al. (2003). Exploration, normalization andsummaries of high density oligonucleotide arrayprobe level data. Biostatistics. [Describes RMA BG correction,but doesn't give much detail of the normexp convolution model.]
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 5
25
Normalization
26
NormalizationTwo-colour
BG correction andnormalization areclosely connected
Even after BGcorrection, someeffects remain.
27
Normalization One-colour
Similarly for singlechannel data,adjustments needto be made for allsamples to becomparable.
28
Moderated t-tests
29
Borrowing information acrossgenes
o Small data sets: few arrays, inference forindividual genes is uncertain
o Curse of dimensionality: many tests,need to adjust for multiple testing, loss ofpower
o Benefit of parallelism: same model isfitted for every gene. Can borrow informationfrom one gene to another
30
Hard and soft shrinkage
o Hard: simplest way to borrow information isto assume that one or more parameters areconstant across genes
o Soft: smooth genewise parameters towards acommon value in a graduated way, e.g., Bayes,empirical Bayes, Stein shrinkage …
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 6
31
A very common experiment
Wild-type mouse x 2 Mutant mouse x 2
Which genes are differentially expressed?
n1 = n2 = 2 Affymetrix arrays
25,000 probe-sets
Gene X
32
Ordinary t-tests
give very high false discovery rates
Residual df = 2
33
Another very commonexperiment
Which genes are differentially expressed?
n = 2 two-colour arrays
30,000 probes
Wild-type mouse 1 Mutant mouse 1
Wild-type mouse 2 Mutant mouse 2
34
Ordinary t-tests
give very high false discovery rates
Residual df = 1
35
Small sample size, many tests
o These experiments would be under-poweredeven with just one gene
o Yet we want to test differential expression foreach of 50k genes, hence lots of multipletesting and further loss of power
The problem:
The solution:The same statistical model is being fitted for everygene in parallel. Can borrow strength from othergenes.
36
t-tests with common variance
across geneswith residual standard deviation pooled
More stable, but ignores gene-specific variability
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 7
37
A better compromise
Moderated t-statistics
Shrink standard deviations towards common value
= degrees offreedom
38
Gs%
0s
1s%
1s
2s%
2s
L
Gs
,pooledgt
gt%
gt
d
0d
Shrinkage of standard deviations
L
The data decides whethergt% should be closer to
,pooledgt or to
gt
39
Why does it work?
o We learn what is the typical variability level bylooking at all genes, but allow some flexibilityfrom this for individual genes
o Adaptive
40
Hierarchical model for variances
Data
Prior
Posterior
41
Posterior Statistics
Moderated t-statistics
Posterior variance estimators
Baldi & Long 2001, Wright & Simon 2003, Smyth 2004 42
Exact distribution for moderatedt
0 ggddtt +%:
An unexpected piece of mathematics shows that, underthe null hypothesis,
The degrees of freedom add!
The Bayes prior in effect adds d0 extra arrays forestimating the variance.
Wright and Simon 2003, Smyth 2004
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 8
43
More on empirical Bayes statistics
44
Hierarchical model for means
Data
Prior
Lönnstedt and Speed 2002, Smyth 2004
45
Posterior Odds
Posterior odds of differential expression
Lönnstedt and Speed 2002, Smyth 2004
Monotonic function of t%
Hence t% gives the best possible ranking of genes
46
Estimating Hyper-Parameters
Closed form estimators with good propertiesare available:
for c0 in terms of quantiles of the | |gt%
for s0 and d0 in terms of the first twomoments of log s2
47
Marginal Distributions
0
00
with prob 1-
1 / with prob
d d
g
d d
t pt
c c t p
+
+
!"#
+"$
%
Under the hierarchical model, sg is independent ofthe moderated t-statistics instead
Under usual likelihood model, sg is independent of theestimated coefficients.
48
Moment estimators for s0 and d0
Marginal moments of log s2 lead to estimators ofs0 and d0:
Estimate d0 by solving
where
Finally
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 9
49
Quantile Estimation of c0
Let r be rank of | |gt% in descending order, and let F(;)
be the distribution function of the t-distribution. Canestimate c0 by equating empirical to theoretical quantiles:
Get overall estimator of c0 by averaging the individualestimators from the top p/2 proportion of the | |
gt%
50
Short note on multiple testing
51
Multiple testing and adjusted p-values
o Traditional method in statistics is to control familywise error rate, e.g., by Bonferroni.
o Holm’s method is improved (step-down)modification of Bonferroni.
o Controlling the false discovery rate (FDR) is moreappropriate in microarray studies
o Benjamini and Hochberg method controls expectedFDR for independent or weakly dependent teststatistics. Simulation studies support use formicroarray data.
o All methods can be implemented in terms ofadjusted p-values.
52
End of morning theory - Summary
o Background correction, normalization areimportant considerations -- normexp, offsets
o Moderation will generally always help --moderation of variances is very effective
o Convenient model gives known nulldistribution
o Multiple testing
53
Linear models
54
More complex experiments
o More complex microarray experiments can berepresented by linear models
o For one-channel platforms, the linear modelcan be set up using the usual univariate linearmodel formulae
o For two-colour platforms, the linear modelshave some special properties
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 10
55
Linear Models
o In general, need to specify:- Dependent variable- Explanatory variables (experimental design,
covariates, etc.)
o More generally:
vector ofobserveddata
designmatrix
Vector ofparameters toestimate
56
Linear Models for microarrays
o Analyse all arrays together combininginformation in optimal way
o Combined estimation of precisiono Extensible to arbitrarily complicated
experimentso Design matrix: specifies RNA targets used
on arrayso Contrast matrix: specifies which
comparisons are of interest
57
Design → Linear models
Wild-type mouse x 2 Mutant mouse x 2
β1 = wt log-expression
β2 = mutant − wt
Design matrices for 1-colour arrays are easier to specify!
E[y1]=E[y2]=β1 E[y3]=E[y4]= β1+ β2
58
Designs → Linear Models
A B1
2
1
1
y
y!
" # " #=$ % $ %&' (' (
RefA
B
B A! " #
1
1
2
2
3
1 0
1 0
1 1
y
y
y
!
!
" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %
' ( ' (
A B2log ( / )y R G B A= ! "
1
2
RefA
B A
!
!
" #
" #
A B
C
1
1
2
2
3
1 0
1 1
0 1
y
y
y
!
!
" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %&' ( ' (
1
2
B A
C A
!
!
" #
" #
59
Matrix Multiplication
A B1
2
1
1
y
y
!!
!
" # " # " #= =$ % $ % $ %& &' ( ' (' (
RefA
B
1 1
1
2 1
2
3 1 2
1 0
1 0
1 1
y
y
y
!!
!!
! !
" # " # " #" #$ % $ % $ %= & = &$ %$ % $ % $ %' ($ % $ % $ %+' ( ' ( ' (
A B
C
1 1
1
2 1 2
2
3 2
1 0
1 1
0 1
y
y
y
!!
! !!
!
" # " # " #" #$ % $ % $ %= & = & +$ %$ % $ % $ %' ($ % $ % $ %& &' ( ' ( ' (
1
23
1
2
RefA
B A
!
!
" #
" #
1
2
B A
C A
!
!
" #
" #
B A! " #
Contrast:
60
Linear Model Estimates
Obtain a linear model for each gene g
Estimate models to get
coefficients
standard deviations
standard errors
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 11
61
Contrasts
A contrast is any linear combination of the coefficientsαj which we want to test equal to zero.
Define contrasts
Want to test
vs
were C is the contrast matrix.
62
Pax5: example of saturateddesign
Pax5-/-Wt
IL-7 removed Rag1-/-
3
2
7
5
11
1
12
46
8
910
63
Regression Analysis
Choose 3 comparisons between the 4 RNAsources to be the coefficients of the linearmodel, e.g.,- PW: Pax5-/- vs Wt- RW: Rag1-/- vs Wt- IW: IL-7 withdrawn vs Wt
For each gene, fit a linear model with acoefficient for each contrastAny other comparisons of interest can beextracted from the linear model as contrasts
64
!!!
"
#
$$$
%
&
•
!!!!!!!!!!!!!!!!!
"
#
$$$$$$$$$$$$$$$$$
%
&
'
'
'
'
'
'
'
'
'
=
!!!!!!!!!!!!!!!!!
"
#
$$$$$$$$$$$$$$$$$
%
&
IW
RW
PW
m
m
m
m
m
m
m
m
m
m
m
m
E
110
110
010
010
011
011
101
101
100
100
001
001
12
11
10
9
8
7
6
5
4
3
2
1 Exercise: Fill in the designmatrix.
65
!!!!!!
"
#
$$$$$$
%
&
'''''''''
(
)
*********
+
,
=
!!!!!!!!!
"
#
$$$$$$$$$
%
&
ba
ba
b
a
a
y
y
y
y
y
y
y
2
1
2
1
7
6
5
4
3
2
1
11100
10010
00010
01100
01001
00001
00100
WT.P11 µ + a1
MT.P21µ + (a1+a2) + b + (a1+a2)b
MT.P11µ +a1+b+a1.b
WT.P21µ + a1 + a2
WT.P1 µ
MT.P1 µ + b
1
2
3
4
5
6
7
Example of factorial design
66
Moderated F-statistics
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 12
67
Moderated F-Statistic
Moderated F-statistic
MST=Mean Sum of squares between Treatments
The idea of shrinking the variance extendsimmediately to multiple contrasts
Wright & Simon 2003, Smyth 2004 68
Doubly shrunk F statistic
The moderated F is not a monotonically function ofthe posterior odds
A doubly shrunk F statistic can be shown have tothe desired relationship to the posterior odds
Tai and Speed 2006, 2007
Improves further the gene ranking
69
Single or double shrinkage
o Shrinking the variances only is enough whencomparing two groups
o When comparing 3 or more groups, furthergain can be had by shrinking the β also (recallStein estimator needs at least 3 means)
70
Functional category analysis
71
Functional category analysis
o Used on a set of genes deemed to bedifferentially expressed
o Asks the question: is my set of genes enrichedfor a particular molecular function?
o Useful for establishing what pathways / typesof genes are affected
o Nowadays largely superceded by gene settests
72
Overlap statistics
o Question: Say you have a set of 85 genes (ofa total 20000 genes) known to be associatedwith function X. Calculate the probability ofrandomly selecting 40 or more of those genesin a list of 100 DE genes.
o Answer: ?
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 13
73
Overlap statistics
o Question: Say you have a set of 85 genes (of a total 20000genes) known to be associated with some function. Calculatethe probability of randomly selecting 40 or more of thosegenes in a list of 100 DE genes.
o Answer: Hypergeometric (i.e. the “urn”problem).
N=19915 “white”m=85 black
n=100k=40
74
Gene set tests
75
Gene sets
o Test significance of a (apriori specified) groupof genes
o The genes might belong to a known pathwayor might be the top genes from a relatedexperiment
o The set might be significant even if individualgenes are not
o Gene set enrichment analysis (GSEA)originated by Mootha et al PNAS 2003 andSubramanian et al PNAS 2005
76
Available gene set methods
o GSEA: gene set enrichment analysis. Complexmethod using Kolmogorov-Smirnov type tests andsample permutation. Needs two-groups, manyarrays, many genes and many sets.
o GSA: gene set analysis. Uses combination ofpermutation of samples and standardization acrossgenes. More powerful. Still needs two-groups, manygenes and many sets.
o GST: gene set tests using Wilcoxon test. Userandomization over genes. Applicable to linearmodels and small samples, but can be over-optimisticif the genes in the set are highly correlated.
o Now, rotation-based gene set tests.
77
Gene set tests
All microarray probes,ranked by a test statisticof interest
t1t2t3t4:
A priori subsetof genes
X1, X2, X3 … Xn
Look for ranks for set genes amongst test statistics
78
Viewing gene sets
Cell adhesion genes
Genes regulated by MYB
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 14
79
What’s the hypothesis?
o Two major types of gene set tests:competitive or self-contained
o Competitive:Genes in the set tend to be more strongly DE thanrandomly chosen genes
o Self-contained:At least some genes in the set are truly DE
Goeman & Bühlmann, Bioinformatics 2007 80
Permutation
o Competitive gene set tests are usually testedby permuting genes, but this ignores inter-gene correlations
o Self-contained gene set tests are usually testedby permuting arrays, but this is limited to two-group comparisons with large numbers
81
Rotation gene set tests (ROAST)
o Self-contained hypothesiso The first test suitable for small samples which
correctly accounts for inter-gene correlationo Can handle complex linear model designs,
including array weights, random effects etc
82
Two steps: projection androtation
o Project data onto space orthogonal tonuisance parameters in the linear model
o Random rotation of the orthogonal residualsprovides fractional permutation, avoidsgranularity of p-values
o Assumes multivariate normality, but proves tobe highly robust against deviations fromnormality
83
Set summary statistics
o Compute empirical Bayes t-statistics for each gene inset
o Convert to z-statisticso Mean of z2:
– good power, even when only a subset of genes respond– not robust against non-normality
o Mean50: mean of top half of |z|:– good power– robust
84
References
o Subramanian et al (2005). A knowledge-basedapproach for interpreting genome-wideexpression profiles. PNAS 102, 15545-15550.
o Efron and Tibshirani (2007). On testing thesignificance of sets of genes. Ann Appl Stat 1,107-129.http://www-stat.stanford.edu/~tibs/GSA/
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 15
85
Multi-level models I:duplicate spots
86
Hard shrinking: examples
o Common correlation model for within-arrayduplicates spotsSmyth et al (2005). The use of within-array replicate spots for assessingdifferential expression in microarray experiments. Bioinformatics 21, 2067-2075.
o Common correlation models for singlechannel analysis of two-color microarray dataSmyth, G. K. (2005). Individual channel analysis of two-colour microarraydata. 55th Session of the International Statistics Institute, 5-12 April 2005,Sydney, Australia.
87
Common correlation model
Given a blocking factor with variance component σb2,
focus on within-block correlations
Common correlation model assumes
Has proved effective for technical blocking factorsfor which correlations are high 88
Duplicate Spots
o If the clone library is not too large, it is oftenpossible to print each gene more than onceon an array
o Duplicates are always side-by-side or a fixeddistance apart
89
Genes printed induplicate pairs
One pair
90
Duplicate spots are correlated
o Duplicate spots are a form of technicalreplication, share lots of common causes
o Cannot be treated as replicates on separatearrays, log-ratios from duplicate spots arecorrelated
o How best to use duplicate spots? Usualapproach is simply to average them
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 16
91
Common Correlation Model
o Assume the between-duplicate correlation isthe same for every gene
o Justified by the belief that the correlationsprings mainly from spatial proximity
o Improves estimation of variances
92
Consequences for individual genes
o If the number of genes is large, then theestimator of ρ is very accurate, so ρ may betreated as known as far as inference for eachindividual gene is concerned
o This doesn’t change estimation of μg butgreatly changes estimation of σg
93
Validation with Spike-In Data
o Does the idea of using common correlationswork in practice?
o Check the ability of the common-correlationt-statistic to distinguish calibration from ratiospike-in spots
o Scorecard system includes calibrationcontrols, 3-fold up and down ratio controls,and 10-fold up and down ratio controls
94
95 96
Multilevel models II:Separate channel analysis
of two-colour data
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 17
97
Separate channel analysis of two-colour microarray data
G1 R1 G2 R2 G3 R3 G4 R4
Each spot is a block of size 2
The two channels give correlated pair of values
Common correlation model sets intra-spotcorrelation equal across genes
98
Why Use Means and Differences?
cor( , ) 0M A =
An old idea dating back to Tukey, Altman & Bland
Ifvar( ) var( )R G=
then
regardless of the correlation between R andG.
99
Common Reference Experiment
Ref B
Ref C
Ref B
Ref C
RefM
Bµ = !}
} RefM
Cµ = !
( Ref)/2A
Bµ = +
( Ref)/2A
Cµ = +
Why not use the A-values as well as M-values?100
A simple normal model
g! is the intra-spot correlation
Gene g, array i
101
Models in terms of M and A
102
M and A parameters
Mgi Rgi Ggiµ µ µ= !
( ) / 2Agi Rgi Ggi
µ µ µ= +
2 22 (1 )Mgi g g
! ! "= #
2 2 (1 ) / 2Agi g g
! ! "= +
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 18
103
Correlation
2
1
2
41tanh log
2
Ag
g
Mg
!"
!
#=
M and A are independent but have differentvariances.
Have converted a correlated problem into aheteroscedastic problem
Can estimate correlation by estimatingvariances:
104
Common correlation model
o Assume the intra-spot correlation is constantacross genes
o Justified by (i) variance components areobserved to be positively correlated and (ii)standard errors for coefficients are notsensitive to correlation value
o Common correlation can then be assumed toknown at individual gene level
o Converts a mixed model into a weightedregression
lmscFit() in limma
105
ApoAI Experiment
8 wild type arrays and 8 ApoAI-/- arrays, all relativeto a common reference
Median intraspot correlation is estimated as
ˆ 0.85! =
The efficiency gain from using A-values is
ˆ10.08
ˆ1
!
!
"=
+106
Individual Channel Normalization
Using A-values in the analysis requires that theybe normalized to have comparable valuesbetween arrays: “single channel normalization”
For the ApoAI data:
Within-array loess normalization only
A-quantile normalization
Quantile normalization
ˆ 0.85! =
ˆ 0.89! =
ˆ 0.84! =
107
Ignoring the Reference
Why not ignore the common reference channel? If weuse only the red channel in the common referenceexperiment,
2
1 2
1 1var( )
C BR R
n n!
" #$ = +% &
' (
var( ) 1
var( ) 2(1 )
C B
C B
R R
M M !
"=
" "
so
so adjusting for a common reference is worthwhilewhenever ρ > 0.5 108
Disconnected Design
B C
D E
A-values make no contribution to estimating BvsCcomparison (direct)
M-values make no contribution to estimating BvsDcomparison (indirect)
Relative efficiency of indirect comparison comparedto direct is 1 !"
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 19
109
Observed false discovery rates
0 1000 2000 3000 4000
01
00
20
03
00
40
0
Tests Rejected
Know
n F
als
e P
ositiv
es
Mixed modelIgnore CorSpotCor
Data from Holloway et al (2006) BMC Bioinformatics 7, Article 511.110
Why common correlation iseffective
o Bias introduced into the variance estimationseems to be offset by increase in precision
o Great simplification of mathematical modelo Penalizes genes with large within-block
variances
111
A Shrinkage Hierarchy
o Fold changes – shrinkage may not be required(unless more than 2 groups)
o Genewise variances – soft smoothing givesspectacular improvement
o Technical replicate correlations – hardsmoothing has proved successful
As we move from parameters of interest to higherorder nuisance parameters, bias decreases inimportance relative to noise
112
Other bits and pieces
113
Some other common variations
o Technical replicateso Paired sampleso Array weights
114
Summary
o Borrowing strength is essential in small-scalemicroarray experiments
o Information can be shared across genes oracross arrays
o Parameters may be set common betweengenes (correlations) or shrunk in a graduatedway (standard errors)
o Power can be increased by testing hypothesesfor sets of genes
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 20
115
Acknowledgements
WEHI Bioinformatics
o Gordon Smytho Matt Ritchieo Alicia Oshlacko Terry Speed
Garvan Epigeneticso Susan Clarko Aaron Stathamo Marcel Coolen
116
Computer Laboratories
117
Proposed schedule
a.m.
o ModerationDemo.Ro LimmaObjects.Ro BGNormalization.Ro SimpleExperiments.R
p.m.
o FactorialDesign.Ro GeneSetAnalysis.Ro TimeCourse.R
118
Getting started
o Grab the files from the ‘KU-August2009-LIMMA’directory (or archive) and copy/move to aconvenient location on your computer
o Set the variable ‘rootDir’ to that directory. Forexample:
rootDir <- “~/Desktop/KU-August2009-LIMMA/data”
o Make rootDir the working directory of your Rsession
119
limma package documentation
o Function help pages?lmFit, ?eBayes
o Class help pages?"RGList-class"?"MArrayLM-class"
o Group help pageshelp("06.LinearModels")
o User’s Guide
limmaUsersGuide()
o The R html help system is a good top view120
Moderation Demo
o Illustration of sampling from the modelo Reduction in false discoverieso Empirical Bayes differential expression
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 21
121
Limma Objects
o RGListo MAListo MArrayLM
122
BG correction / normalization demo
o Various procedures for BG correction,normalization
o Jurkat data:
same vs. same
123
Simple Experiment 1:Integrin beta7+ vs beta7–
beta7- beta7+
o Reading two-color datao Control spotso Background correction
o Dye-swapso Empirical Bayes differential expression
124
Simple Experiment 2: Cancer cellsversus normal cells
Normalcell line
Cancercell line
o Reading Affymetrix datao Simple design versus contrasts
125
Factorial Experiment: Estrogen
10 hours 48 hours
AbsentEstrogen
Present
o Reading Affymetrix datao Factorial designs, contrasts
126
Gene set analysis
o Convenient “functional category analysis”o Newer, flexible set-based testing
Linear models and limma Copenhagen, 19 August 2009
Mark D. Robinson, WEHI/Garvan 22
127
Targets of SAHA and depsipeptideA time course experiment
o Study effects of SAHA and depsipeptide onthe acute T-cell leukemia cell line CEM
o SAHA and depsipeptide are structurallydifferent but have similar biological effects(induce death through intrinsic apoptoticpathway)
o Prising out subtle differences is of greatinterest
128
SAHA/depsipeptide:Experimental design
SAHA Vehicle only depsipeptide
0hr 0hr 0hr
1hr 1hr 1hr
2hr 2hr 2hr
4hr 4hr 4hr
8hr 8hr 8hr
16hr 16hr 16hr
129
Aims of experiment
o Identify common responders: genes whichrespond similarly to SAHA and depsipeptide
o Identify specific responders: genes whichrespond to one of SAHA or depsipeptide, butnot to the other
o Different responders, genes which respond toboth SAHA and depsipeptide but differently,are of lesser interest
130
Linear model analysis
o Fit genewise linear models to all the arrayssimultaneously
o Include effects for drug x timeo Allow for probe-specific dye-effectso Treat each time series of 6 arrays as a
randomized block, i.e., allow arrays hybridizedtogether to be correlated
top related