multiple testing under dependency, with applications to genomic data analysis

27
SRCOS Summer Research Conf, 2011 Multiple Testing Under Dependency, with Applications to Genomic Data Analysis Zhi Wei Department of Computer Science New Jersey Institute of Technology Joint work with Wenguang Sun

Upload: varana

Post on 19-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Multiple Testing Under Dependency, with Applications to Genomic Data Analysis. Zhi Wei Department of Computer Science New Jersey Institute of Technology Joint work with Wenguang Sun. X. X. X. Expression. X. X. X. Phenotype 2. Phenotype 1. Intro Microarray. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

SRCOS Summer Research Conf, 2011

Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

Zhi Wei

Department of Computer Science

New Jersey Institute of Technology

Joint work with Wenguang Sun

Page 2: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

2New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Intro Microarray

• Measure the activity of thousands of genes simultaneously

• DE v.s. EE genes

• Notation:– Xi ∈ {DE, EE}

– Yi = (yi1, …, yim; yi(m+1), …, yi(m+n)), m, n replicates for 2 cond, resp.

Phenotype 1 Phenotype 2

Exp

ress

ion

XXX

XXX

Differential Expression (DE) Analysis

Data for a single gene

Page 3: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

3New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Microarray Time Course Data: A Motivating Example

• Microarray Time Course (MTC) data commonly collected for studying biological processes

• Tian, Nowak, and Brasier (2005) “A TNF-Induced Gene Expression Program Under Oscillatory NF-Kappab Control,” – study the biological process of how the cytokine tumor necrosis factor (TNF)

initiates tissue inflammation

• Dataset: time-course microarray experiment to profile gene activities at 0hr, 1hr, 3hr, 6hr after inhibition of the NF-kB transcription factor

• To find:– (1) “Early response genes” that were differentially expressed (DE) less than

1hr after the NF-kB inhibition

– (2) “Middle response genes” that were DE at 3hr but no response prior to that

– (3) “Late response genes” that were DE at 6hr but no response prior to that

– (4) “Biphasic genes” that were DE at both 1hr and 6hr hours, but not at 3hr.

Page 4: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

4New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Existing solutions

[Yi1,Yi2, …,YiT]

Xi Xi1 Xi2 … XiT

Yi1 Yi2 … YiT

• Solution 1: Find TDE genes– Moderated likelihood ratio statistic and

Hotelling T2 (Tai and Speed 06),– functional data analysis : Luan and Li

(04), Storey et al. (05), Hong and Li (06)

• Solution 2: Find DE gene at each time point– T-test at each time point– HMM, Yuan and

Kendziorski (06).

Too detailedToo general ?

Page 5: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

5New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Set-wise multiple testing

• Conceptualize the gene selection problem as a set-wise multiple testing problem– Test DE vs EE at each time point for each gene

– Combine the tests across all time points as a set for DE patterns of interest, namely, one gene corresponding to a set of hypotheses

• Several issues in set-wise testing– (i) multiplicity: how to control testing errors (such as the false discovery rate)

at the set level when thousands of gene are considered simultaneously;

– (ii) optimality: how to optimally combine the testing results within a set so that sensitivity is maximized while multiplicity is controlled.

– (iii) dependency: how to account for and exploit the dependency of the tests within a set, e.g. the high temporal correlation in MTC data.

Page 6: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

6New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Related work

• Multiplicity issue for simultaneous set-wise tests – Formally addressed by (Benjamini & Heller, Biometrics, 2008)

• Test DE vs EE at each time point individually

• Combine the p-values at each time point into a pooled p-value by Simes

• BH procedure is applied to the Simes-combined P values

– Partial conjunction tests • Are all T hypotheses in the set true? (Conjunction null)

• Are at least u out of T hypotheses in the set false? (Partial Conjunction alternative)

– Limitations• Incapable of capturing sophisticated DE patterns, e.g., early response genes

• Not optimal because dependency is ignored, although FDR at set level (FSR) is controlled.

• Testing hypotheses under dependency (Sun & Cai, JRSSB, 2009)– One HMM to account for dependency of hypotheses

– Local Index of Significance (LIS) statistics

– Valid and optimal

Page 7: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

7New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Characterizing Gene Temporal Expr. Patterns in a Set-wise Multiple Testing Framework

• Setup for MTC data:– Each gene has two possible states at each time point: equally expressed (EE)

and differentially expressed (DE), denoted by 0 and 1, respectively.

– Consider P sets of hypotheses: {(Hi1, . . . ,HiT) : i = 1, . . . ,P} for testing EE versus DE, where P is #genes and T, #time points.

• Pattern representation– 2T atomic patterns η {0,1}∈ T , e.g. 00101, 11100, etc. for T=5

– Partition 2T ηs into null Θ0 and non-null Θ1, the patterns of interest

– 2^(2T) -1 partitions powerful temporal pattern characterization• Conjunction: Θ0 ={(η1, . . . , ηT) {0, 1}∈ T|∑ ηt =0}

• Partial Conjunction: Θ0 ={(η1, . . . , ηT) {0, 1}∈ T|∑ ηt < u}

• “Late response genes”, DE after time t, Θ1 ={(η1, . . . , ηT) {0, 1}∈ T| η1=…=ηt=0 & ηt+1+…+ηT>0}

Page 8: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

8New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

HMM

• The state sequence of gene i over time is a binary vector Xi = (xi1, . . . , xiT), where xit = 1 indicates that gene i at time t is DE and xit = 0 otherwise.

• Assume Xi {0,1}∈ T is distributed as a Markov chain

– Transition probability can be either homogeneous or inhomogeneous

• Assume Xi ┴ Xj for i <>j

• Emission probability f(Yit|Xit) ~ Gamma-Gamma Model (Newton et al. 2001, Kendziorski et al 2003)

Xi1 Xi2 … XiT

Yi1 Yi2 … YiT

Page 9: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

9New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

HMM-based set-wise testing

• Use EM algorithm for estimating all HMM parameters: initial prob., transition prob., emission prob. (Gamma-Gamma model)

• Define a binary vector ϑ = (ϑ1, . . . , ϑp) {0, 1}∈ p, where ϑi =1 if Xi ∈Patterns of Interest and ϑi =0 otherwise.

• Define the generalized local index of significance (GLIS):

Xi1 Xi2 … XiT

Yi1 Yi2 … YiT

interest of patterns

)|(Pr)|0(Prs

YsY iiiiiGLIS

Page 10: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

10New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

HMM-based set-wise testing (cont’d)

• GLIS testing procedure– Denote by GLIS(1), ..., GLIS(p) the ordered GLIS values and H(1), ..., H(p) the

corresponding hypotheses

– Let

– Then reject all H(i), for i=1, 2, ..., l

– Valid and asymptotically optimal for set-wise multiple testing when the HMM parameter estimate is consistent (Sun and Wei, JASA, 2011)

• The GLIS statistic can be interpreted as the probability of a gene being null (i.e., not exhibiting a pattern of interest) given the observed expression data

}1

:max{1 )(

i

j jGLISi

il

Page 11: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

11New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Simulation

• Setup– #subjects under both conditions are n1 = n2 = 10

– #genes P = 2000

– #time points =6

– #replications =200.

– Nominal FSR level is 0.1.

– HMM: π = (0.95, 0.05) , At = (a00, 1−a00; 1−a11, a11), t = (αt,α0t, νt).

• Dependency: fix t = (10, 1, 0.5) and a00 =0.95 , change a11 from 0.2 to .98

• Gene expr. Signal: fix At = (0.95, 0.05; 0.2, 0.8) and t = (αt,1, 0.5), change αt

• Conjunction and Partial Conjunction

• Comparison– BH-Simes

– Viterbi-based decision rule: the most probable state sequence η ∈ Θ1 then select it.

• Evaluation: FSR (False Set Rate), MSR (Missed Set Rate) = 1 – sensitivity

Page 12: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

12New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Conjunction Test Results

Page 13: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

13New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Partial Conjunction Test Results

Page 14: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

14New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Ranking efficiency

• Ak = (0.95, 0.05; 0.2, 0.8) and k = (αk,32, 4)

Solid: Oracle GLISDashed: data-driven GLISDotted: BH-Simes

Page 15: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

15New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Application

Calvano et al (Nature 2005): A network-based analysis of systemic inflammation in humans.

Goal: Identify genes in response to endotoxin

Dataset: 4 healthy persons’ gene expression in whole blood leukocytes immediately before and at 2,4,6,9 and 24 h after the intravenous administration of bacterial endotoxin; 4 healthy persons w/o treatment as controls.

Page 16: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

16New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Pattern Design

• Genes perturbed in response to treatments

• Sequentially perturbed genes

• Early and late response genes

OR

Page 17: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

17New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

HMM Fitting

Page 18: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

18New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Early Response Genes (DE from 2nd time point)

Page 19: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

19New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Late Response Genes (DE after 2nd time point)

Page 20: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

20New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011Published Genome-Wide Associationsthrough 3/2009, 398 published GWA at p < 5 x 10-8

NHGRI GWA Catalogwww.genome.gov/

GWAStudies

Page 21: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

21New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

HMM-dependent (set-wise) hypothesis test in Genome-wide Association Studies (GWAS)

• Individual Tests (Wei, Sun et al Bioinformatics, 2009)– N chromosomes N HMMs, emission prob: Normal mixtures

– How to pool different HMMs together for having a global multiplicity control

– Improvement: More reproducible , Higher sensitivity

• Set-wise Tests (Wang, Wei, and Sun, Statistics and Its Interface, 2010)– Testing a genomic region

Page 22: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

22New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

From HMM to Markov Random Field

• An undirected graph, where each node represents a random variable Xi and each edge represents a dependency

• Joint Distribution Function

• Markov Blanket and Conditional independence

• Inference– Exact inference is NP-complete– Approximation techniques: MCMC, loopy belief propagation, ICM

))(exp()Pr( }{1

kkkkZ xwxX

}),,,{|APr(})A{\|APr( nmjiX

Page 23: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

23New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Microarray DE analysis

• Discrete Markov Random Field Model for Latent Differential Expression States (Wei and Li, Bioinformatics, 2007)

p genes o

n a

network

Gene expression Yi=(Yi1,…,Yim; Yi(m+1),…,Yi(m+n))

Phenotype 1 Phenotype 2

Networks built from KEGG: 1668 nodes (genes), 8011 edges

Framework: f(X|Y) ~ f(Y|X)*Pr(X)Pr(X) ~ Markov random field (MRF) ModelEmission Prob. f(Y|X) ~ Gamma-Gamma Model

Page 24: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

24New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

MRF in Genome-wide association studies (GWAS)

• GWAS data with network structure constructed based on LD dependency. (Li, Wei, and Maris, Biostatistics, 2010)

Framework: f(X|Y) ~ f(Y|X)*Pr(X)X, disease assc. state: Pr(X) ~ Markov random field (MRF) ModelY, genotype counts, Emission Prob. f(Y|X) ~ Dirichlet-Trinomial Model

Page 25: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

25New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

MTC Data

• Hidden Spatial-Temporal MRF Model (Wei and Li, Annals of Applied Statistics 2008)

• Model both regulatory (gene network) and temporal dependency (time).

g

g -1

g+1

tt -1 t +1 Time

Gene

Ygt Gene Expression

Xgt = 0 or 1, Gene State

Temporal Dependency

Regulatory Dependency

f(Ygt|Xgt) ~ Emission Prob.

Spatial Neighbors

Temporal Neighbors

Page 26: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

26New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

Challenges in MRF

• MRF-structured hypothesis test is valid and optimal?

• MRF-structured set-wise hypothesis test is valid optimal?

• Unlike HMM, the model fitting and parameter estimating for MRF is much more challenging– The EM algorithm for MRF is NP-hard

– (Generalized) belief propagation algorithm

Page 27: Multiple Testing Under Dependency, with Applications to Genomic Data Analysis

27New Jersey Inst. of Tech.

SRCOS Summer Research Conf, 2011

References

• Wei Z and Li H (2007): A Markov random field model for network-based analysis of genomic data. Bioinformatics, 23: 1537-1544.

• Wei Z and Li H (2008): A spatial-temporal MRF model for network-based analysis of microarray time course gene expression data. Annals of Applied Statistics, 2: 408-429 .

• Sun, W., and Cai, T. (2009). “Large-scale multiple testing under dependence”. Journal of the Royal Statistical Society, Series B, 71, 393-424.

• Wei Z, Sun W, Wang K and Hakonarson H (2009), Multiple Testing in Genome-Wide Association Studies via Hidden Markov Models, Bioinformatics, 25:2802-2808

• Li H, Wei Z and Maris J (2010), A Hidden Markov Random Field Model for Genome-wide Association Studies, Biostatistics, 11:139-150.

• Wang W, Wei Z, and Sun W (2010), Simultaneous Set-Wise Testing Under Dependence, with Applications to Genome-Wide Association Studies, Statistics and Its Interface, 3(4):501-512

• Sun W and Wei Z (2011), Multiple Testing for Pattern Identification, with Applications to Microarray Time Course Experiments, Journal of the American Statistical Association, 106(493):73–88