1-11-20051 we calculated a t-test for 30,000 genes at once how do we handle results, present data...

15
1-11-2005 1 • We calculated a t-test for 30,000 genes at once • How do we handle results, present data and results Normalization of the data as a mean of removing biases and reducing experimental variability • Two basic questions in the normalization process •Are we attenuating the signal? •Are we compromising the independence of our measurements? • Outliers – part of the quality control. • If we can identify physical reasons for excluding an observation (e.g. scratch on the slide) • Such physical problems are usually "flagged" in the process of quantifying fluorescence intensities • The questions of excluding a whole array from the analysis is particularly tricky – we will discuss it further later Genome-wide analysis

Upload: colin-webb

Post on 02-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

1-11-2005 1

• We calculated a t-test for 30,000 genes at once

• How do we handle results, present data and results

• Normalization of the data as a mean of removing biases and reducing experimental variability

• Two basic questions in the normalization process

•Are we attenuating the signal?

•Are we compromising the independence of our measurements?

• Outliers – part of the quality control.

• If we can identify physical reasons for excluding an observation (e.g. scratch on the slide)

• Such physical problems are usually "flagged" in the process of quantifying fluorescence intensities

• The questions of excluding a whole array from the analysis is particularly tricky – we will discuss it further later

Genome-wide analysis

1-11-2005 2

The Problem:Identify genes whose expression in a target organ (Lung) of a model organism (Rat) is affected by an environmental toxicant (W)

Population:All model organisms of this type (Rats)

Sample:12 randomly selected rats from the population of all rats. (Randomly means that all rats in the population have the equal chance of being selected)

Randomization:Randomly select 6 rats to be treated by the toxicant. Randomly is the key word here that allows us to ascribe observed changes to the treatment alone.

Prepare samples and extract RNA from all 12 rats

Randomly assign labeled RNA to different microarrays

Process microarrays in a random order

Randomization Issue

1-11-2005 3

•12 microarrays, 12 samples (C1,...,C6,W1,...,W6)

•Randomly assign samples to different microarrays

•In terms of a single gene, 12 different "spots"

Single Channel Microarrays – Each Sample Assigned to a Different Microarray

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

W3 W5 W6 W1 W2 W4 C5 C1 C2 C4 C6 C3

•Proceed with a two-sample t-test as we did so far

1-11-2005 4

•6 microarrays, 12 samples (C1,...,C6,W1,...,W6)

•Randomly select pairs and assign then to different microarrays

•In terms of a single gene, 6 different "spots"

Two-Channel Microarrays – One C and One W Sample Assigned to Each Microarray

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

W3 C5 W6 C1 W2 C2 W5 C6 W4 C4 W1 C3

•Individual samples are no longer "free" to be assigned to any microarray – restriction on the randomization process

•Measurements are "blocked" within a microarray (terminology)

•We could still randomly assign samples and not have treatment and the control on each microarray, but this would be unreasonable (arguments to come)

•Need to use a paired t-test

1-11-2005 5

• For a specific gene ri = xiw -xic = ith difference, i=1,…,6

Paired t-test

• Differential expression 0

)σ,μ(~r 2i N• Statistical Model of observed data

• Estimating parametersn

rr

n

ii

1

)( 1

2

2

n

rrs

n

ii

• Calculating t-statistic

n1

s

t* r

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

t-statistics

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

t-statistics

• "Null Distribution" is t-distribution with n-1 degrees of freedom

1-11-2005 6

Two-sample t-test vs paired t-test

•Denominator 1.51 0.04

•p-value 0.870 0.002

6 7 8 9 10 11

67

89

1011

C

W

• Reference Distribution t2n-2 tn-1

n2

s

t

p

12*

xx

n1

s

t* r

1-11-2005 7

Two-sample t-test vs paired t-test

Raw Paired TTest

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Standard Deviations

Sta

nd

ard

De

via

tio

n

Raw Paired TTest

0.0

0.5

1.0

1.5

2.0

Standard Deviations

Sta

nd

ard

De

via

tio

n

1-11-2005 8

Two-sample t-test vs paired t-test

Raw Paired TTest

02

46

8P-values

-lo

g1

0(p

-va

lue

)

Raw Paired TTest

0.0

0.5

1.0

1.5

2.0

P-values

-lo

g1

0(p

-va

lue

)

1-11-2005 9

0 2 4 6 8 10 12 14

01

02

03

04

0

Two-sample vs Paired t-test

T statistic

Pa

ire

d t-

sta

tistic

Two-sample t-test vs paired t-test

n

1s2

n

2sp

*paired

* t2t

1-11-2005 10

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Two-sample vs Paired t-test

Two-sample t-test p-value

Pa

ire

d t-t

est p

-va

lue

Two-sample t-test vs paired t-test

•Small advantage for two-sample t-test purely due to degrees of freedom•Bigger possible advantage due to the smaller denominator (standard error)

1-11-2005 11

When is t-test "better" than paired t-test

•Q: Can we use the two-paired t-test in this case since it gives us a smaller p-value?•A: NO! Randomization and non-independence issues remain

8.8 9.0 9.2 9.4 9.6

6.0

6.5

7.0

7.5

8.0

8.5

C

W

t-sample tpaired t

Denominator 0.56 0.64

p-value 0.0008 0.0097

1-11-2005 12

Multiple Factor Experiments - Incomplete Block Design

Control Treatment ControlTreatment 1

Treatment 2

ArrayCy 3 Cy 5

1-11-2005 13

Multiple Factor Experiments - Incomplete Block Design

•No color effect•Homogeneous variance•Optimal

•No color effect•Homogeneous variance•Sub-Optimal

•Homogeneous color effect•Homogeneous variance

•Homogeneous variance

1-11-2005 14

Multiple Factor Experiments - Incomplete Block Design

C

T1

T2

T1 & T2C

T1

T2

T1 & T2

•Homogeneous Variance

1-11-2005 15

limma

... is a package for the analysis of microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression.

• Specially constructed data objects to represent various aspects of microarray data

• Specially constructed "object methods" for importing, normalizing, displaying and analyzing microarray data

• Unique in the implementation of the empirical Bayes procedure for identifying differentially expressed genes by "borrowing" information from different genes (everything so far has been gene by gene)