1-11-20051 we calculated a t-test for 30,000 genes at once how do we handle results, present data...
TRANSCRIPT
1-11-2005 1
• We calculated a t-test for 30,000 genes at once
• How do we handle results, present data and results
• Normalization of the data as a mean of removing biases and reducing experimental variability
• Two basic questions in the normalization process
•Are we attenuating the signal?
•Are we compromising the independence of our measurements?
• Outliers – part of the quality control.
• If we can identify physical reasons for excluding an observation (e.g. scratch on the slide)
• Such physical problems are usually "flagged" in the process of quantifying fluorescence intensities
• The questions of excluding a whole array from the analysis is particularly tricky – we will discuss it further later
Genome-wide analysis
1-11-2005 2
The Problem:Identify genes whose expression in a target organ (Lung) of a model organism (Rat) is affected by an environmental toxicant (W)
Population:All model organisms of this type (Rats)
Sample:12 randomly selected rats from the population of all rats. (Randomly means that all rats in the population have the equal chance of being selected)
Randomization:Randomly select 6 rats to be treated by the toxicant. Randomly is the key word here that allows us to ascribe observed changes to the treatment alone.
Prepare samples and extract RNA from all 12 rats
Randomly assign labeled RNA to different microarrays
Process microarrays in a random order
Randomization Issue
1-11-2005 3
•12 microarrays, 12 samples (C1,...,C6,W1,...,W6)
•Randomly assign samples to different microarrays
•In terms of a single gene, 12 different "spots"
Single Channel Microarrays – Each Sample Assigned to a Different Microarray
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
W3 W5 W6 W1 W2 W4 C5 C1 C2 C4 C6 C3
•Proceed with a two-sample t-test as we did so far
1-11-2005 4
•6 microarrays, 12 samples (C1,...,C6,W1,...,W6)
•Randomly select pairs and assign then to different microarrays
•In terms of a single gene, 6 different "spots"
Two-Channel Microarrays – One C and One W Sample Assigned to Each Microarray
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
Scanning the “Green Channel”(XG)
Scanning the “Red Channel”(XR)
Scanning the “Red Channel”(XR)
G
RX
X )Xlog()log(X GR
W3 C5 W6 C1 W2 C2 W5 C6 W4 C4 W1 C3
•Individual samples are no longer "free" to be assigned to any microarray – restriction on the randomization process
•Measurements are "blocked" within a microarray (terminology)
•We could still randomly assign samples and not have treatment and the control on each microarray, but this would be unreasonable (arguments to come)
•Need to use a paired t-test
1-11-2005 5
• For a specific gene ri = xiw -xic = ith difference, i=1,…,6
Paired t-test
• Differential expression 0
)σ,μ(~r 2i N• Statistical Model of observed data
• Estimating parametersn
rr
n
ii
1ˆ
1
)( 1
2
2
n
rrs
n
ii
• Calculating t-statistic
n1
s
t* r
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
• "Null Distribution" is t-distribution with n-1 degrees of freedom
1-11-2005 6
Two-sample t-test vs paired t-test
•Denominator 1.51 0.04
•p-value 0.870 0.002
6 7 8 9 10 11
67
89
1011
C
W
• Reference Distribution t2n-2 tn-1
n2
s
t
p
12*
xx
n1
s
t* r
1-11-2005 7
Two-sample t-test vs paired t-test
Raw Paired TTest
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Standard Deviations
Sta
nd
ard
De
via
tio
n
Raw Paired TTest
0.0
0.5
1.0
1.5
2.0
Standard Deviations
Sta
nd
ard
De
via
tio
n
1-11-2005 8
Two-sample t-test vs paired t-test
Raw Paired TTest
02
46
8P-values
-lo
g1
0(p
-va
lue
)
Raw Paired TTest
0.0
0.5
1.0
1.5
2.0
P-values
-lo
g1
0(p
-va
lue
)
1-11-2005 9
0 2 4 6 8 10 12 14
01
02
03
04
0
Two-sample vs Paired t-test
T statistic
Pa
ire
d t-
sta
tistic
Two-sample t-test vs paired t-test
n
1s2
n
2sp
*paired
* t2t
1-11-2005 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Two-sample vs Paired t-test
Two-sample t-test p-value
Pa
ire
d t-t
est p
-va
lue
Two-sample t-test vs paired t-test
•Small advantage for two-sample t-test purely due to degrees of freedom•Bigger possible advantage due to the smaller denominator (standard error)
1-11-2005 11
When is t-test "better" than paired t-test
•Q: Can we use the two-paired t-test in this case since it gives us a smaller p-value?•A: NO! Randomization and non-independence issues remain
8.8 9.0 9.2 9.4 9.6
6.0
6.5
7.0
7.5
8.0
8.5
C
W
t-sample tpaired t
Denominator 0.56 0.64
p-value 0.0008 0.0097
1-11-2005 12
Multiple Factor Experiments - Incomplete Block Design
Control Treatment ControlTreatment 1
Treatment 2
ArrayCy 3 Cy 5
1-11-2005 13
Multiple Factor Experiments - Incomplete Block Design
•No color effect•Homogeneous variance•Optimal
•No color effect•Homogeneous variance•Sub-Optimal
•Homogeneous color effect•Homogeneous variance
•Homogeneous variance
1-11-2005 14
Multiple Factor Experiments - Incomplete Block Design
C
T1
T2
T1 & T2C
T1
T2
T1 & T2
•Homogeneous Variance
1-11-2005 15
limma
... is a package for the analysis of microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression.
• Specially constructed data objects to represent various aspects of microarray data
• Specially constructed "object methods" for importing, normalizing, displaying and analyzing microarray data
• Unique in the implementation of the empirical Bayes procedure for identifying differentially expressed genes by "borrowing" information from different genes (everything so far has been gene by gene)