limma = linear models for microarray data linear models and · pdf file ·...
TRANSCRIPT
1
1
Linear models and Limma
Københavns Universitet,19 August 2009
Mark D. RobinsonBioinformatics, Walter+Eliza Hall InstituteEpigenetics Laboratory, Garvan Institute
(with many slides taken from Gordon Smyth)
2
2
Limma = linear models formicroarray data
o Morning Theory! Introduction
! Background correction
! Moderated t-tests
! Simple linear models
o Morning Practical! Demonstration of smoothing
! Limma objects (beta7)
! Background correction andnormalization (beta7)
! Simple experimental designs
! 2-colour example (beta7)
! Affymetrix example (cancer)
o Afternoon Theory
! More advanced designs /linear modeling
! Moderated F-tests
! Gene set tests
! Other analyses limma can do
o Afternoon practical
! Factorial design (estrogen)
! Gene set testing (cancer)
! Time course experiment(SAHA/depsipeptide)
3
3
Expression measures
Two-colour
Affymetrix
Illumina
log-intensity(summarizedover probes)
log-intensity(summarizedover beads)
probe or gene
array
yga = log2(R/G)
yga =
yga =
4
4
Questions of Interest
o What genes have changed in expression? (e.g.between disease/normal, affected by treatment)Gene discovery, differential expression
o Is a specified group of genes all up-regulated in aparticular condition?Gene set differential expression
o Can the expression profile predict outcome?Class prediction, classification
o Are there tumour sub-types not previouslyidentified? Do my genes group into previouslyundiscovered pathways?Class discovery, clustering
Today will cover first two questions - differential expression
5
5
MicroarrayDifferential Expression Studies
o 103 – 106 genes/probes/exons on a chip orglass slide
o Inputs to limma: log-intensities (1-colour data)or log(R/G) log-ratios (for 2-colour data)
o Several steps to go from raw data to table of“expression”: background correction,normalization
o Idea: Fit a linear model to the expression datafor each gene
6
6
Two colourmicroarrays
http://en.wikipedia.org/wiki/DNA_microarray
7
7
Two-colour data: Log-Intensities
*
2log ( )f bR R R= !
*
2log ( )f bG G G= !
For each probe:
Various ways to calculate background.
Will often modify to ensure:
Rf – Rb >0 and Gf – Gb > 0.
8
8
Two-colour data: Means andDifferences
M R G= !
( ) / 2A R G= +
For each probe:
“minus”
“add”
9
9
Data Summaries
G1 R1 G2 R2 G3 R3 G4 R4
A1M1 A2M2 A3M3 A4M4
For each gene
10
10
MA Plot
11
11
Log-Ratios orSingle Channel Intensities?
o Tradition analysis, treats log-ratios M=log(R/G) asthe primary data, i.e., gene expression measurementsare relative
o Alternative approach treats individual channelintensities R and G as primary data, i.e., geneexpression measures are absolute (Wolfinger,Churchill, Kerr)
o Single channel approach makes new analyses possiblebut- make stronger assumptions- requires more complex models (mixed models in place of
ordinary linear models) to accommodate correlationbetween R and G on same spot
12
12
BG correction affects DE results
o Importance of careful pre-processing and qualitycontrol cannot be over-emphasized for microarraydata
o Can have dramatic effect on differential expressionresults
o Consider here the normexp method of adaptivebackground correction- background correction step of the RMA algorithm
- Can also be applied to two colour data
13
13
Additive + multiplicative errormodel
Observe intensity for one probe on one array
additiveerrors
multiplicativeerrors
Intensity = background + signal
This idea underlies variance stabilizing transformationsvsn (two colour data) and vst (for Illumina data)
I = B + S
14
14
normexp convolution model
Intensity = Background + Signal
N(!,!2) Exponential(")
+=
15
15
Conditional expectation undernormexp model
Then
with
16
16
normexp background correction
o Estimate the three parameters
o Replace I with E(S|I)
o For Affymetrix data, I is the “Perfect Match” dataintensity
o For two-colour data, I=Rf-Rb or I=Gf-Gb
o In the RMA algorithm, parameter estimation uses anad hoc density kernel method
o In limma (two colour), parameter estimationmaximises the saddlepoint approximation to thelikelihood
17
17
PM data on log2 scale: raw and fitted model
18
18
Background corrected intensity
= E(Signal | Observed Intensity)
Observed Intensity
E(
Sign
al | Inte
nsi
ty)
Adaptivebackgroundcorrectionproducespositive signal
19
19
Offsets to stabilise the variance
Offset reduces variability at low intensities
Log-ratios
Background correction
20
20
Why do offsets stabilize thevariance?
21
21
Why do offsets stabilize thevariance?
o log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold)
o Additive noise affects small numbers more
o Offsets introduce bias:
o log2[(80+10)/(10+10)] = 2.17
o But the tradeoff (drop in variance for increasein bias) is usually worth it
22
22
A self-self experiment:two background methods
553 spots not plotted
First statistical point – choose background correction to stabilise the variance as a
function of intensity
23
23
Comparison of 2-colour BG correctionmethods
Fals
e dis
cove
ries
(lim
ma)
Genes selectedRitchie et al. 2007
24
24
References
o Silver et al. (2009). Microarray backgroundcorrection: maximum likelihood estimation for thenormal-exponential convolution. Biostatistics. [completemathematical development of the saddle point approximation]
o Ritchie et al. (2007). A comparison of backgroundcorrection methods for two-colour microarrays.Bioinformatics. [shows “normexp” performs best for 2-colour data]
o Irizarry et al. (2003). Exploration, normalization andsummaries of high density oligonucleotide arrayprobe level data. Biostatistics. [Describes RMA BG correction,but doesn't give much detail of the normexp convolution model.]
25
25
Normalization
26
26
Normalization
Two-colour
BG correction andnormalization areclosely connected
Even after BGcorrection, someeffects remain.
27
27
Normalization
One-colour
Similarly for singlechannel data,adjustments needto be made for allsamples to becomparable.
28
28
Moderated t-tests
29
29
Borrowing information acrossgenes
o Small data sets: few arrays, inference forindividual genes is uncertain
o Curse of dimensionality: many tests,need to adjust for multiple testing, loss ofpower
o Benefit of parallelism: same model isfitted for every gene. Can borrow informationfrom one gene to another
30
30
Hard and soft shrinkage
o Hard: simplest way to borrow information isto assume that one or more parameters areconstant across genes
o Soft: smooth genewise parameters towards acommon value in a graduated way, e.g., Bayes,empirical Bayes, Stein shrinkage …
31
31
A very common experiment
Wild-type mouse x 2 Mutant mouse x 2
Which genes are differentially expressed?
n1 = n2 = 2 Affymetrix arrays
25,000 probe-sets
Gene X
32
32
Ordinary t-tests
give very high false discovery rates
Residual df = 2
33
33
Another very commonexperiment
Which genes are differentially expressed?
n = 2 two-colour arrays
30,000 probes
Wild-type mouse 1 Mutant mouse 1
Wild-type mouse 2 Mutant mouse 2
34
34
Ordinary t-tests
give very high false discovery rates
Residual df = 1
35
35
Small sample size, many tests
o These experiments would be under-poweredeven with just one gene
o Yet we want to test differential expression foreach of 50k genes, hence lots of multipletesting and further loss of power
The problem:
The solution:
The same statistical model is being fitted for everygene in parallel. Can borrow strength from othergenes.
36
36
t-tests with common variance
across genes
with residual standard deviation pooled
More stable, but ignores gene-specific variability
37
37
A better compromise
Moderated t-statistics
Shrink standard deviations towards common value
= degrees offreedom
38
38
Gs%
0s
1s%
1s
2s%
2s
L
Gs
,pooledgt
gt%
gt
d
0d
Shrinkage of standard deviations
L
The data decides whethergt% should be closer to
,pooledgt or to
gt
39
39
Why does it work?
o We learn what is the typical variability level bylooking at all genes, but allow some flexibilityfrom this for individual genes
o Adaptive
40
40
Hierarchical model for variances
Data
Prior
Posterior
41
41
Posterior Statistics
Moderated t-statistics
Posterior variance estimators
Baldi & Long 2001, Wright & Simon 2003, Smyth 2004
42
42
Exact distribution for moderatedt
0 ggddtt +%:
An unexpected piece of mathematics shows that, underthe null hypothesis,
The degrees of freedom add!
The Bayes prior in effect adds d0 extra arrays forestimating the variance.
Wright and Simon 2003, Smyth 2004
43
43
More on empirical Bayes statistics
44
44
Hierarchical model for means
Data
Prior
Lönnstedt and Speed 2002, Smyth 2004
45
45
Posterior Odds
Posterior odds of differential expression
Lönnstedt and Speed 2002, Smyth 2004
Monotonic function of t%
Hence t% gives the best possible ranking of genes
46
46
Estimating Hyper-Parameters
Closed form estimators with good propertiesare available:
for c0 in terms of quantiles of the | |gt%
for s0 and d0 in terms of the first twomoments of log s2
47
47
Marginal Distributions
0
00
with prob 1-
1 / with prob
d d
g
d d
t pt
c c t p
+
+
!"#
+"$
%
Under the hierarchical model, sg is independent ofthe moderated t-statistics instead
Under usual likelihood model, sg is independent of theestimated coefficients.
48
48
Moment estimators for s0 and d0
Marginal moments of log s2 lead to estimators ofs0 and d0:
Estimate d0 by solving
where
Finally
49
49
Quantile Estimation of c0
Let r be rank of | |gt% in descending order, and let F(;)
be the distribution function of the t-distribution. Canestimate c0 by equating empirical to theoretical quantiles:
Get overall estimator of c0 by averaging the individualestimators from the top p/2 proportion of the | |
gt%
50
50
Short note on multiple testing
51
51
Multiple testing and adjusted p-values
o Traditional method in statistics is to control familywise error rate, e.g., by Bonferroni.
o Holm’s method is improved (step-down)modification of Bonferroni.
o Controlling the false discovery rate (FDR) is moreappropriate in microarray studies
o Benjamini and Hochberg method controls expectedFDR for independent or weakly dependent teststatistics. Simulation studies support use formicroarray data.
o All methods can be implemented in terms ofadjusted p-values.
52
52
End of morning theory - Summary
o Background correction, normalization areimportant considerations -- normexp, offsets
o Moderation will generally always help --moderation of variances is very effective
o Convenient model gives known nulldistribution
o Multiple testing
53
53
Linear models
54
54
More complex experiments
o More complex microarray experiments can berepresented by linear models
o For one-channel platforms, the linear modelcan be set up using the usual univariate linearmodel formulae
o For two-colour platforms, the linear modelshave some special properties
55
55
Linear Models
o In general, need to specify:
- Dependent variable
- Explanatory variables (experimental design,covariates, etc.)
o More generally:
vector ofobserveddata
designmatrix
Vector ofparameters toestimate
56
56
Linear Models for microarrays
o Analyse all arrays together combininginformation in optimal way
o Combined estimation of precision
o Extensible to arbitrarily complicatedexperiments
o Design matrix: specifies RNA targets usedon arrays
o Contrast matrix: specifies whichcomparisons are of interest
57
57
Design # Linear models
Wild-type mouse x 2 Mutant mouse x 2
$1 = wt log-expression
$2 = mutant % wt
Design matrices for 1-colour arrays are easier to specify!
E[y1]=E[y2]=$1 E[y3]=E[y4]= $1+ $2
58
58
Designs # Linear Models
A B1
2
1
1
y
y!
" # " #=$ % $ %&' (' (
Ref
A
B
B A! " #
1
1
2
2
3
1 0
1 0
1 1
y
y
y
!
!
" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %
' ( ' (
A B2log ( / )y R G B A= ! "
1
2
RefA
B A
!
!
" #
" #
A B
C
1
1
2
2
3
1 0
1 1
0 1
y
y
y
!
!
" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %&' ( ' (
1
2
B A
C A
!
!
" #
" #
59
59
Matrix Multiplication
A B1
2
1
1
y
y
!!
!
" # " # " #= =$ % $ % $ %& &' ( ' (' (
Ref
A
B
1 1
1
2 1
2
3 1 2
1 0
1 0
1 1
y
y
y
!!
!!
! !
" # " # " #" #$ % $ % $ %= & = &$ %$ % $ % $ %' ($ % $ % $ %+' ( ' ( ' (
A B
C
1 1
1
2 1 2
2
3 2
1 0
1 1
0 1
y
y
y
!!
! !!
!
" # " # " #" #$ % $ % $ %= & = & +$ %$ % $ % $ %' ($ % $ % $ %& &' ( ' ( ' (
1
23
1
2
RefA
B A
!
!
" #
" #
1
2
B A
C A
!
!
" #
" #
B A! " #
Contrast:
60
60
Linear Model Estimates
Obtain a linear model for each gene g
Estimate models to get
coefficients
standard deviations
standard errors
61
61
Contrasts
A contrast is any linear combination of the coefficients"j which we want to test equal to zero.
Define contrasts
Want to test
vs
were C is the contrast matrix.
62
62
Pax5: example of saturateddesign
Pax5-/-Wt
IL-7 removed Rag1-/-
3
2
7
5
11
1
12
4
6
8
9
10
Robust design – can tolerate failure of some of the arrays
63
63
Regression Analysis
Choose 3 comparisons between the 4 RNAsources to be the coefficients of the linearmodel, e.g.,- PW: Pax5-/- vs Wt
- RW: Rag1-/- vs Wt
- IW: IL-7 withdrawn vs Wt
For each gene, fit a linear model with acoefficient for each contrast
Any other comparisons of interest can beextracted from the linear model as contrasts
64
64
!!!
"
#
$$$
%
&
•
!!!!!!!!!!!!!!!!!
"
#
$$$$$$$$$$$$$$$$$
%
&
'
'
'
'
'
'
'
'
'
=
!!!!!!!!!!!!!!!!!
"
#
$$$$$$$$$$$$$$$$$
%
&
IW
RW
PW
m
m
m
m
m
m
m
m
m
m
m
m
E
110
110
010
010
011
011
101
101
100
100
001
001
12
11
10
9
8
7
6
5
4
3
2
1 Exercise: Fill in the designmatrix.
Can be fitted using robust regression, but problems with se’s, as Patty Solomon has
observed
65
65
!!!!!!
"
#
$$$$$$
%
&
'''''''''
(
)
*********
+
,
=
!!!!!!!!!
"
#
$$$$$$$$$
%
&
ba
ba
b
a
a
y
y
y
y
y
y
y
2
1
2
1
7
6
5
4
3
2
1
11100
10010
00010
01100
01001
00001
00100
WT.P11 µ + a1
MT.P21µ + (a1+a2) + b + (a1+a2)b
MT.P11µ +a1+b+a1.b
WT.P21µ + a1 + a2
WT.P1
µ
MT.P1 µ + b
1
2
3
4
5
6
7
Example of factorial design
66
66
Moderated F-statistics
67
67
Moderated F-Statistic
Moderated F-statistic
MST=Mean Sum of squares between Treatments
The idea of shrinking the variance extendsimmediately to multiple contrasts
Wright & Simon 2003, Smyth 2004
68
68
Doubly shrunk F statistic
The moderated F is not a monotonically function ofthe posterior odds
A doubly shrunk F statistic can be shown have tothe desired relationship to the posterior odds
Tai and Speed 2006, 2007
Improves further the gene ranking
69
69
Single or double shrinkage
o Shrinking the variances only is enough whencomparing two groups
o When comparing 3 or more groups, furthergain can be had by shrinking the $ also (recall
Stein estimator needs at least 3 means)
70
70
Functional category analysis
71
71
Functional category analysis
o Used on a set of genes deemed to bedifferentially expressed
o Asks the question: is my set of genes enrichedfor a particular molecular function?
o Useful for establishing what pathways / typesof genes are affected
o Nowadays largely superceded by gene settests
72
72
Overlap statistics
o Question: Say you have a set of 85 genes (ofa total 20000 genes) known to be associatedwith function X. Calculate the probability ofrandomly selecting 40 or more of those genesin a list of 100 DE genes.
o Answer: ?
73
73
Overlap statistics
o Question: Say you have a set of 85 genes (of a total 20000genes) known to be associated with some function. Calculatethe probability of randomly selecting 40 or more of thosegenes in a list of 100 DE genes.
o Answer: Hypergeometric (i.e. the “urn”problem).
N=19915 “white”m=85 black
n=100k=40
74
74
Gene set tests
75
75
Gene sets
o Test significance of a (apriori specified) groupof genes
o The genes might belong to a known pathwayor might be the top genes from a relatedexperiment
o The set might be significant even if individualgenes are not
o Gene set enrichment analysis (GSEA)originated by Mootha et al PNAS 2003 andSubramanian et al PNAS 2005
76
76
Available gene set methods
o GSEA: gene set enrichment analysis. Complexmethod using Kolmogorov-Smirnov type tests andsample permutation. Needs two-groups, manyarrays, many genes and many sets.
o GSA: gene set analysis. Uses combination ofpermutation of samples and standardization acrossgenes. More powerful. Still needs two-groups, manygenes and many sets.
o GST: gene set tests using Wilcoxon test. Userandomization over genes. Applicable to linearmodels and small samples, but can be over-optimisticif the genes in the set are highly correlated.
o Now, rotation-based gene set tests.
77
77
Gene set tests
All microarray probes,ranked by a test statisticof interest
t1t2t3t4:
A priori subsetof genes
X1, X2, X3 … Xn
Look for ranks for set genes amongst test statistics
78
78
Viewing gene sets
Cell adhesion genes
Genes regulated by MYB
79
79
What’s the hypothesis?
o Two major types of gene set tests:competitive or self-contained
o Competitive:Genes in the set tend to be more strongly DE thanrandomly chosen genes
o Self-contained:At least some genes in the set are truly DE
Goeman & Bühlmann, Bioinformatics 2007
80
80
Permutation
o Competitive gene set tests are usually testedby permuting genes, but this ignores inter-gene correlations
o Self-contained gene set tests are usually testedby permuting arrays, but this is limited to two-group comparisons with large numbers
81
81
Rotation gene set tests (ROAST)
o Self-contained hypothesis
o The first test suitable for small samples whichcorrectly accounts for inter-gene correlation
o Can handle complex linear model designs,including array weights, random effects etc
82
82
Two steps: projection androtation
o Project data onto space orthogonal tonuisance parameters in the linear model
o Random rotation of the orthogonal residualsprovides fractional permutation, avoidsgranularity of p-values
o Assumes multivariate normality, but proves tobe highly robust against deviations fromnormality
83
83
Set summary statistics
o Compute empirical Bayes t-statistics for each gene inset
o Convert to z-statistics
o Mean of z2:– good power, even when only a subset of genes respond
– not robust against non-normality
o Mean50: mean of top half of |z|:– good power
– robust
84
84
References
o Subramanian et al (2005). A knowledge-basedapproach for interpreting genome-wideexpression profiles. PNAS 102, 15545-15550.
o Efron and Tibshirani (2007). On testing thesignificance of sets of genes. Ann Appl Stat 1,107-129.http://www-stat.stanford.edu/~tibs/GSA/
85
85
Multi-level models I:duplicate spots
86
86
Hard shrinking: examples
o Common correlation model for within-arrayduplicates spotsSmyth et al (2005). The use of within-array replicate spots for assessingdifferential expression in microarray experiments. Bioinformatics 21, 2067-2075.
o Common correlation models for singlechannel analysis of two-color microarray dataSmyth, G. K. (2005). Individual channel analysis of two-colour microarraydata. 55th Session of the International Statistics Institute, 5-12 April 2005,Sydney, Australia.
87
87
Common correlation model
Given a blocking factor with variance component !b2,
focus on within-block correlations
Common correlation model assumes
Has proved effective for technical blocking factorsfor which correlations are high
88
88
Duplicate Spots
o If the clone library is not too large, it is oftenpossible to print each gene more than onceon an array
o Duplicates are always side-by-side or a fixeddistance apart
89
89
Genes printed induplicate pairs
One pair
90
90
Duplicate spots are correlated
o Duplicate spots are a form of technicalreplication, share lots of common causes
o Cannot be treated as replicates on separatearrays, log-ratios from duplicate spots arecorrelated
o How best to use duplicate spots? Usualapproach is simply to average them
91
91
Common Correlation Model
o Assume the between-duplicate correlation isthe same for every gene
o Justified by the belief that the correlationsprings mainly from spatial proximity
o Improves estimation of variances
92
92
Consequences for individual genes
o If the number of genes is large, then theestimator of " is very accurate, so " may betreated as known as far as inference for eachindividual gene is concerned
o This doesn’t change estimation of !g butgreatly changes estimation of #g
93
93
Validation with Spike-In Data
o Does the idea of using common correlationswork in practice?
o Check the ability of the common-correlationt-statistic to distinguish calibration from ratiospike-in spots
o Scorecard system includes calibrationcontrols, 3-fold up and down ratio controls,and 10-fold up and down ratio controls
94
94
95
95
96
96
Multilevel models II:Separate channel analysis
of two-colour data
97
97
Separate channel analysis of two-colour microarray data
G1 R1 G2 R2 G3 R3 G4 R4
Each spot is a block of size 2
The two channels give correlated pair of values
Common correlation model sets intra-spotcorrelation equal across genes
98
98
Why Use Means and Differences?
cor( , ) 0M A =
An old idea dating back to Tukey, Altman & Bland
Ifvar( ) var( )R G=
then
regardless of the correlation between R andG.
99
99
Common Reference Experiment
Ref B
Ref C
Ref B
Ref C
RefM
Bµ = !}
}Ref
MCµ = !
( Ref)/2A
Bµ = +
( Ref)/2A
Cµ = +
Why not use the A-values as well as M-values?
100
100
A simple normal model
g! is the intra-spot correlation
Gene g, array i
101
101
Models in terms of M and A
102
102
M and A parameters
Mgi Rgi Ggiµ µ µ= !
( ) / 2Agi Rgi Ggi
µ µ µ= +
2 22 (1 )Mgi g g
! ! "= #
2 2 (1 ) / 2Agi g g
! ! "= +
103
103
Correlation
2
1
2
41tanh log
2
Ag
g
Mg
!"
!
#=
M and A are independent but have differentvariances.
Have converted a correlated problem into aheteroscedastic problem
Can estimate correlation by estimatingvariances:
104
104
Common correlation model
o Assume the intra-spot correlation is constantacross genes
o Justified by (i) variance components areobserved to be positively correlated and (ii)standard errors for coefficients are notsensitive to correlation value
o Common correlation can then be assumed toknown at individual gene level
o Converts a mixed model into a weightedregression
lmscFit() in limma
105
105
ApoAI Experiment
8 wild type arrays and 8 ApoAI-/- arrays, all relativeto a common reference
Median intraspot correlation is estimated as
ˆ 0.85! =
The efficiency gain from using A-values is
ˆ10.08
ˆ1
!
!
"=
+
106
106
Individual Channel Normalization
Using A-values in the analysis requires that theybe normalized to have comparable valuesbetween arrays: “single channel normalization”
For the ApoAI data:
Within-array loess normalization only
A-quantile normalization
Quantile normalization
ˆ 0.85! =
ˆ 0.89! =
ˆ 0.84! =
107
107
Ignoring the Reference
Why not ignore the common reference channel? If weuse only the red channel in the common referenceexperiment,
2
1 2
1 1var( )
C BR R
n n!
" #$ = +% &
' (
var( ) 1
var( ) 2(1 )
C B
C B
R R
M M !
"=
" "
so
so adjusting for a common reference is worthwhile
whenever " > 0.5
108
108
Disconnected Design
B C
D E
A-values make no contribution to estimating BvsCcomparison (direct)
M-values make no contribution to estimating BvsDcomparison (indirect)
Relative efficiency of indirect comparison comparedto direct is 1 !"
109
109
Observed false discovery rates
0 1000 2000 3000 4000
01
00
20
03
00
40
0
Tests Rejected
Know
n F
als
e P
ositiv
es
Mixed modelIgnore CorSpotCor
Data from Holloway et al (2006) BMC Bioinformatics 7, Article 511.
110
110
Why common correlation iseffective
o Bias introduced into the variance estimationseems to be offset by increase in precision
o Great simplification of mathematical model
o Penalizes genes with large within-blockvariances
111
111
A Shrinkage Hierarchy
o Fold changes – shrinkage may not be required(unless more than 2 groups)
o Genewise variances – soft smoothing givesspectacular improvement
o Technical replicate correlations – hardsmoothing has proved successful
As we move from parameters of interest to higherorder nuisance parameters, bias decreases inimportance relative to noise
112
112
Other bits and pieces
113
113
Some other common variations
o Technical replicates
o Paired samples
o Array weights
114
114
Summary
o Borrowing strength is essential in small-scalemicroarray experiments
o Information can be shared across genes oracross arrays
o Parameters may be set common betweengenes (correlations) or shrunk in a graduatedway (standard errors)
o Power can be increased by testing hypothesesfor sets of genes
115
115
Acknowledgements
WEHI Bioinformatics
o Gordon Smyth
o Matt Ritchie
o Alicia Oshlack
o Terry Speed
Garvan Epigenetics
o Susan Clark
o Aaron Statham
o Marcel Coolen
116
116
Computer Laboratories
117
117
Proposed schedule
a.m.
o ModerationDemo.R
o LimmaObjects.R
o BGNormalization.R
o SimpleExperiments.R
p.m.
o FactorialDesign.R
o GeneSetAnalysis.R
o TimeCourse.R
118
118
Getting started
o Grab the files from the ‘KU-August2009-LIMMA’directory (or archive) and copy/move to aconvenient location on your computer
o Set the variable ‘rootDir’ to that directory. Forexample:
rootDir <- “~/Desktop/KU-August2009-LIMMA/data”
o Make rootDir the working directory of your Rsession
119
119
limma package documentation
o Function help pages?lmFit, ?eBayes
o Class help pages?"RGList-class"?"MArrayLM-class"
o Group help pageshelp("06.LinearModels")
o User’s Guide
limmaUsersGuide()
o The R html help system is a good top view
120
120
Moderation Demo
o Illustration of sampling from the model
o Reduction in false discoveries
o Empirical Bayes differential expression
121
121
Limma Objects
o RGList
o MAList
o MArrayLM
122
122
BG correction / normalization demo
o Various procedures for BG correction,normalization
o Jurkat data:
same vs. same
123
123
Simple Experiment 1:Integrin beta7+ vs beta7–
beta7- beta7+
o Reading two-color data
o Control spots
o Background correction
o Dye-swaps
o Empirical Bayes differential expression
124
124
Simple Experiment 2: Cancer cellsversus normal cells
Normalcell line
Cancercell line
o Reading Affymetrix data
o Simple design versus contrasts
125
125
Factorial Experiment: Estrogen
10 hours 48 hours
Absent
Estrogen
Present
o Reading Affymetrix data
o Factorial designs, contrasts
126
126
Gene set analysis
o Convenient “functional category analysis”
o Newer, flexible set-based testing
127
127
Targets of SAHA and depsipeptideA time course experiment
o Study effects of SAHA and depsipeptide onthe acute T-cell leukemia cell line CEM
o SAHA and depsipeptide are structurallydifferent but have similar biological effects(induce death through intrinsic apoptoticpathway)
o Prising out subtle differences is of greatinterest
128
128
SAHA/depsipeptide:Experimental design
SAHA Vehicle only depsipeptide
0hr 0hr 0hr
1hr 1hr 1hr
2hr 2hr 2hr
4hr 4hr 4hr
8hr 8hr 8hr
16hr 16hr 16hr
Time courses of 6 arrays were done at each time.
129
129
Aims of experiment
o Identify common responders: genes whichrespond similarly to SAHA and depsipeptide
o Identify specific responders: genes whichrespond to one of SAHA or depsipeptide, butnot to the other
o Different responders, genes which respond toboth SAHA and depsipeptide but differently,are of lesser interest
130
130
Linear model analysis
o Fit genewise linear models to all the arrayssimultaneously
o Include effects for drug x time
o Allow for probe-specific dye-effects
o Treat each time series of 6 arrays as arandomized block, i.e., allow arrays hybridizedtogether to be correlated
2nd statistical point – analyse all arrays together