multiple testing for pattern identification, with ...zhiwei/jasa2011hmm.pdf · applications to...

16
Multiple Testing for Pattern Identification, With Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments, it is often desirable to identify genes that exhibit a specific pattern of differential expression over time and thus gain insights into the mechanisms of the underlying biological processes. Two challenging issues in the pattern identification problem are: (i) how to combine the simultaneous inferences across multiple time points and (ii) how to control the multiplicity while accounting for the strong dependence. We formulate a compound decision-theoretic framework for set-wise multiple testing and propose a data-driven procedure that aims to minimize the missed set rate subject to a constraint on the false set rate. The hidden Markov model proposed in Yuan and Kendziorski (2006) is generalized to capture the temporal correlation in the gene expression data. Both theoretical and numerical results are presented to show that our data-driven procedure controls the multiplicity, provides an optimal way of combining simultaneous inferences across multiple time points, and greatly improves the conventional combined p-value methods. In particular, we demonstrate our method in an application to a study of systemic inflammation in humans for detecting early and late response genes. KEY WORDS: Compound decision problem; Conjunction and partial conjunction tests; False discovery rate; Hidden Markov models; Microarray time-course data; Simultaneous set-wise testing. 1. INTRODUCTION Microarray time-course (MTC) experiments are capable of capturing the dynamic changes of genes over time and have been widely applied for studying many biological processes, such as the regulation of development (Arbeitman et al. 2002), immune response (Calvano et al. 2005), and tissue inflamma- tion program (Tian, Nowak, and Brasier 2005). The MTC ex- periments are usually conducted under one or multiple biolog- ical conditions. Of interest for one-condition experiment is to identify genes whose expression levels change over time in some specific way. A well-known example is the MTC experi- ment conducted by Spellman et al. (1998) for studying cell cy- cles. Recently two-condition MTC experiments have become popular. Like in common case-control studies, genes exhibiting different temporal patterns across two biological conditions are good candidates for further study because the biological process motivating the experiments may be driven by their differential expressions. Multiple-condition MTC experiments are rare due to the complication in design and analysis. In this article we focus on the problems under two biological conditions. In two-condition MTC experiments, each gene at each time point has two possible states: equally expressed (EE) or differ- entially expressed (DE). A temporal pattern is a prespecified class of sequences of DE and EE states over time. The tempo- ral patterns of dynamic changes often provide insights into the underlying biological mechanisms—genes that exhibit specific temporal patterns of DE can be informative in deciphering the underlying regulatory programs that govern the dynamic bio- logical process of interest. We first consider a motivating example. In studying the bio- logical process of how the cytokine tumor necrosis factor (TNF) initiates tissue inflammation, Tian, Nowak, and Brasier (2005) conducted a time-course microarray experiment to profile gene Wenguang Sun is Assistant Professor, Department of Statistics, North Car- olina State University, Raleigh, NC 27606-8203 (E-mail: [email protected]). Zhi Wei is Assistant Professor, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102. This work was supported in part by National Science Foundation grant DMS-10-07675. We thank the Associate Editor and two referees for detailed and constructive comments which lead to a much improved article. activities before the inhibition of the NF-kB transcription fac- tor, a mediator of the process, and at one, three, and six hours after the inhibition. The investigators were interested in identi- fying genes with the following distinct temporal patterns: (1) “Early response genes” that were differentially ex- pressed (DE) less than one hour after the NF-kB inhi- bition (2) “Middle response genes” that were DE at three hours but no response prior to three hours (3) “Late response genes” that were DE at six hours but no response prior to six hours (4) “Biphasic genes” that were DE at both one hour and six hours, but not at three hours. Such delicate expression patterns can be used to distinguish up- stream regulatory genes from downstream regulated genes and help to form biological hypotheses for further experimental val- idations. The goal of this article is to develop powerful multiple testing procedures to identify genes that exhibit temporal pat- terns of interest. Conventional gene selection procedures for analysis of time- course data were only developed for identifying genes that are DE at one single time point, and cannot be used for pattern iden- tification. Recently a few approaches were proposed to select genes that show overall difference in their expression profiles. Some approaches view the time-course gene expression data as vectors of correlated observations. Under this formulation, sta- tistical methods in multivariate analysis and longitudinal data analysis were proposed to test the equivalence of two vector values; such methods include the F-statistic derived from an ANOVA analysis (Park et al. 2003; Ma, Zhong, and Liu 2009), the robust Wald statistic derived from a generalized estimat- ing equation analysis (Guo et al. 2003), the moderated like- lihood ratio statistic and Hotelling T 2 -statistic derived from a multivariate empirical Bayes model (Tai and Speed 2006), and the maximum ratio statistic derived from a hierarchical Bayes © 2011 American Statistical Association Journal of the American Statistical Association March 2011, Vol. 106, No. 493, Applications and Case Studies DOI: 10.1198/jasa.2011.ap09587 73

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Multiple Testing for Pattern Identification, WithApplications to Microarray Time-Course Experiments

Wenguang SUN and Zhi WEI

In time-course experiments, it is often desirable to identify genes that exhibit a specific pattern of differential expression over time andthus gain insights into the mechanisms of the underlying biological processes. Two challenging issues in the pattern identification problemare: (i) how to combine the simultaneous inferences across multiple time points and (ii) how to control the multiplicity while accountingfor the strong dependence. We formulate a compound decision-theoretic framework for set-wise multiple testing and propose a data-drivenprocedure that aims to minimize the missed set rate subject to a constraint on the false set rate. The hidden Markov model proposed inYuan and Kendziorski (2006) is generalized to capture the temporal correlation in the gene expression data. Both theoretical and numericalresults are presented to show that our data-driven procedure controls the multiplicity, provides an optimal way of combining simultaneousinferences across multiple time points, and greatly improves the conventional combined p-value methods. In particular, we demonstrate ourmethod in an application to a study of systemic inflammation in humans for detecting early and late response genes.

KEY WORDS: Compound decision problem; Conjunction and partial conjunction tests; False discovery rate; Hidden Markov models;Microarray time-course data; Simultaneous set-wise testing.

1. INTRODUCTION

Microarray time-course (MTC) experiments are capable ofcapturing the dynamic changes of genes over time and havebeen widely applied for studying many biological processes,such as the regulation of development (Arbeitman et al. 2002),immune response (Calvano et al. 2005), and tissue inflamma-tion program (Tian, Nowak, and Brasier 2005). The MTC ex-periments are usually conducted under one or multiple biolog-ical conditions. Of interest for one-condition experiment is toidentify genes whose expression levels change over time insome specific way. A well-known example is the MTC experi-ment conducted by Spellman et al. (1998) for studying cell cy-cles. Recently two-condition MTC experiments have becomepopular. Like in common case-control studies, genes exhibitingdifferent temporal patterns across two biological conditions aregood candidates for further study because the biological processmotivating the experiments may be driven by their differentialexpressions. Multiple-condition MTC experiments are rare dueto the complication in design and analysis. In this article wefocus on the problems under two biological conditions.

In two-condition MTC experiments, each gene at each timepoint has two possible states: equally expressed (EE) or differ-entially expressed (DE). A temporal pattern is a prespecifiedclass of sequences of DE and EE states over time. The tempo-ral patterns of dynamic changes often provide insights into theunderlying biological mechanisms—genes that exhibit specifictemporal patterns of DE can be informative in deciphering theunderlying regulatory programs that govern the dynamic bio-logical process of interest.

We first consider a motivating example. In studying the bio-logical process of how the cytokine tumor necrosis factor (TNF)initiates tissue inflammation, Tian, Nowak, and Brasier (2005)conducted a time-course microarray experiment to profile gene

Wenguang Sun is Assistant Professor, Department of Statistics, North Car-olina State University, Raleigh, NC 27606-8203 (E-mail: [email protected]).Zhi Wei is Assistant Professor, Department of Computer Science, New JerseyInstitute of Technology, Newark, NJ 07102. This work was supported in partby National Science Foundation grant DMS-10-07675. We thank the AssociateEditor and two referees for detailed and constructive comments which lead to amuch improved article.

activities before the inhibition of the NF-kB transcription fac-tor, a mediator of the process, and at one, three, and six hoursafter the inhibition. The investigators were interested in identi-fying genes with the following distinct temporal patterns:

(1) “Early response genes” that were differentially ex-pressed (DE) less than one hour after the NF-kB inhi-bition

(2) “Middle response genes” that were DE at three hours butno response prior to three hours

(3) “Late response genes” that were DE at six hours but noresponse prior to six hours

(4) “Biphasic genes” that were DE at both one hour and sixhours, but not at three hours.

Such delicate expression patterns can be used to distinguish up-stream regulatory genes from downstream regulated genes andhelp to form biological hypotheses for further experimental val-idations. The goal of this article is to develop powerful multipletesting procedures to identify genes that exhibit temporal pat-terns of interest.

Conventional gene selection procedures for analysis of time-course data were only developed for identifying genes that areDE at one single time point, and cannot be used for pattern iden-tification. Recently a few approaches were proposed to selectgenes that show overall difference in their expression profiles.Some approaches view the time-course gene expression data asvectors of correlated observations. Under this formulation, sta-tistical methods in multivariate analysis and longitudinal dataanalysis were proposed to test the equivalence of two vectorvalues; such methods include the F-statistic derived from anANOVA analysis (Park et al. 2003; Ma, Zhong, and Liu 2009),the robust Wald statistic derived from a generalized estimat-ing equation analysis (Guo et al. 2003), the moderated like-lihood ratio statistic and Hotelling T2-statistic derived from amultivariate empirical Bayes model (Tai and Speed 2006), andthe maximum ratio statistic derived from a hierarchical Bayes

© 2011 American Statistical AssociationJournal of the American Statistical Association

March 2011, Vol. 106, No. 493, Applications and Case StudiesDOI: 10.1198/jasa.2011.ap09587

73

Page 2: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

74 Journal of the American Statistical Association, March 2011

model (Chi et al. 2007). Alternately, the time-course data can beviewed as samples from an underlying continuous gene expres-sion trajectory. Under this formulation, statistical methods infunctional data analysis were developed to test the equivalenceof two expression curves; such methods include the hierarchicalmodeling and basis expansion approaches considered by Luanand Li (2004), Storey et al. (2005), Hong and Li (2006), andTelesca et al. (2009). However, it is not clear how to combinethe simultaneous inferences from single time point analysesfor a joint set analysis. In addition, the overall profile analy-sis (global null test) is too coarse to handle the subtle patternslike (1)–(4).

A useful framework is to conceptualize the gene selectionproblem as a set-wise multiple testing problem, where eachtime point of a particular gene gives rise to a hypothesis for test-ing DE versus EE and therefore each DE pattern correspondsto a set of hypotheses formed by combining the tests across alltime points. Then a gene is claimed to be significant if a specificcombination of the null hypotheses is rejected at the set level.However, several complicated issues arise in this testing frame-work: (i) optimality, that is, how to optimally combine testingresults of multiple time points; (ii) multiplicity, that is, how tocontrol testing errors (such as the false discovery rate) at the setlevel when thousands of gene are considered simultaneously;and (iii) dependency, that is, how to account for and exploit thehigh temporal correlation in the time-course data. Next we shallcompare and review related works and then discuss a unifiedapproach that addresses all three issues simultaneously.

Combining simultaneous tests in sets not only improvesstatistical power but also provides new scientific insights;successful applications include large epidemiological surveys(Zaykin et al. 2002), meta-analysis of microarray experiments(Pyne, Futcher, and Skiena 2006) and brain imaging studies(Benjamini and Hochberg 1995; Heller et al. 2006). The multi-plicity issue in simultaneous set-wise tests was formally ad-dressed in Benjamini and Heller (2008), where a thresholdfor Simes combined p-value is suggested based on the well-known BH procedure (Benjamini and Hochberg 1995). Al-though Benjamini–Heller’s procedure was suitable for manylarge-scale studies, the applicability of their approach is quitelimited in time-course experiments. First, the combined p-valuemethod can only test how many DE time points are in a set butcannot distinguish subtler differences among DE patterns (e.g.,early response versus late response). Second, even in situationswhere the Benjamini–Heller procedure is applicable, it can befurther improved by exploiting the temporal correlation.

One important feature of the time-course experiments is thatif a gene is differentially expressed at one time point, it is verylikely to remain differentially expressed at the next time point.This local dependency structure can be approximated by a hid-den Markov model (HMM). Specifically, an HMM assumesthat the temporal sequence of the underlying states (DE or EE)of a particular gene form a Markov chain and the observedgene expression data are independent conditioning on the hid-den states. The HMM has been shown to be an effective toolfor analyzing biological sequences and processes; see Churchill(1992), Krogh et al. (1994), MacDonald and Zucchini (1997),and Durbin et al. (1999), among others. Successful applicationsof HMM to MTC data include the clustering analysis of gene

expression profiles (Schliep, Schonhuth, and Steinhoff 2003)and the significance analysis of differential expression undermultiple biological conditions (Yuan and Kendziorski 2006).By utilizing the temporal dependency, the HMM approach leadsto results with both increased statistical power and better scien-tific interpretations.

Sun and Cai (2009) proposed a data-driven procedure fortesting HMM-dependent hypotheses and showed that the pro-cedure is asymptotically optimal. However, their approach can-not be applied in time-course experiments. First, the optimal-ity of multiple testing was only addressed for single parameteranalysis, but a temporal pattern cannot be identified by simplycombining the results from single time points. It is important tonote that the set-wise error rate can be extremely high even ifthe pointwise error rate is low. Second, Sun and Cai (2009) con-sidered a homogeneous HMM with a normal mixture as the ob-servation distribution, whereas the time-course experiment re-quires different models and assumptions. Specifically, we needto consider thousands of short sequences instead of one longsequence, and it is desirable to use an inhomogeneous HMMto account for the changing transition probabilities over time.Moreover, a Gamma–Gamma hierarchical model is more ap-propriate than a normal mixture model for MTC data based onthe works of Newton et al. (2001), Kendziorski et al. (2003),and Yuan and Kendziorski (2006). We shall develop a new test-ing procedure that overcomes these limitations.

A key issue in studying set-wise testing problems is in thechoice of the optimality criterion. Sun and Cai (2009) consid-ered a criterion that maximizes the expected number of correctindividual states given a constraint on the FDR, whereas in pat-tern identification problem it is more desirable to define the op-timality criterion at the sequence level. The problem of uncover-ing an “optimal” state sequence has been studied for an HMM;the most well-known procedure is the Viterbi algorithm (Viterbi1967; Rabiner 1989). Yuan and Kendziorski (2006) consideredthis method in time-course experiments for estimating the mostlikely DE sequence configuration and then used the results forgrouping similar genes. However, the estimated DE states arenot suitable for gene selection problems because the Viterbi al-gorithm does not address the multiplicity issue in simultane-ous inferences. Essentially a desirable procedure should rankdifferent genes according to a score on how likely it obeys aspecific DE pattern, so that a thresholding rule can be appliedalong the rankings to yield a gene subset for future investiga-tion. However, across-gene comparison is not provided by theViterbi algorithm since it only selects the most likely DE pat-tern for a single gene. Consequently, the resultant Type I errorrate is independent of the prespecified test level and can be overconservative or severely inflated. The numerical results in Sec-tion 3 will illustrate this point.

The goal of this article is to develop a unified frameworkto address the optimality, multiplicity, and dependency issuesfor set-wise tests simultaneously. Our methodological devel-opments involve two steps. First, we derive an oracle proce-dure in a compound decision theoretic framework to test mul-tiple dependent sets of hypotheses under a new optimality cri-terion. Our oracle procedure aims to minimize the missed setrate (MSR) subject to a constraint on the false set rate (FSR).The second step is to mimic the oracle and develop a data-driven procedure (GLIS) that is suitable for analysis of MTC

Page 3: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 75

Figure 1. A comparison of the Benjamini–Heller (dotted), oracle (solid), and GLIS (dashed) procedures. Sensitivity = 1 − MSR (a measureof testing power). We can see that the performance of the oracle procedure is achieved by the GLIS procedure asymptotically. At the same FSRlevel, both the oracle and GLIS procedures identify more true nonnulls than the Benjamini–Heller’s procedure.

experiments. Specifically, we considered the more appropriateinhomogeneous HMMs and the hierarchical Gamma–Gammamodel. Our data-driven procedure, which simultaneously ad-dresses all three issues, controls the multiplicity and is asymp-totically optimal. The GLIS procedure is capable of testingdelicate DE patterns that cannot be handled by conventionalFDR procedures. Both theoretical and numerical results are pre-sented to show that the GLIS procedure leads to (i) asymptoti-cally valid error rate control, (ii) improved statistical power, and(iii) more informative scientific findings.

A comparison of the Benjamini–Heller procedure, the oracleprocedure and the GLIS procedure for testing against the globalnull (i.e., no DE at all time points) is shown in Figure 1. Wecan see that both the oracle and GLIS procedures have muchhigher sensitivity than the Benjamini–Heller procedure at thesame FSR level. In addition, the performance of the oracle pro-cedure is achieved by the GLIS procedure asymptotically. Thereal data analysis in Section 4 also shows substantial advantagesof our method over conventional methods.

The article is organized as follows. The data structure, modelassumption and method are discussed in Section 2. Simulationstudies are carried out in Section 3 to compare our procedureversus conventional approaches. The methods are illustrated inSection 4 in a study of the systemic inflammation in humans.Technical details on methodological developments and proofsof theorems are given in Section 5 and the Appendix, respec-tively.

2. SET–WISE TESTING IN TIME–COURSE EXPERIMENTS

In time-course experiments, the gene expression levels aremeasured longitudinally with two possible states at each timepoint: equally expressed (EE) and differentially expressed(DE). Consider m sets of hypotheses: {(Hi1, . . . ,HiK) : i =1, . . . ,m} for testing EE versus DE, where m is the number ofgenes and K is the number of time points. The problem of iden-tifying genes with specific patterns involves the simultaneous

testing of hypotheses at the set level. Typical questions consid-ered by previous research include: (i) Are all K hypotheses inthe set true? (ii) Are all K hypotheses in the set false? (iii) Areat least u out of K hypotheses in the set false? In Benjamini andHeller (2008), these questions are referred to as conjunctiontest, disjunction test, and partial conjunction test, respectively.The conjunction and disjunction tests can be viewed as spe-cial cases of the partial conjunction test. A big limitation of theabove framework is that partial conjunction test can only han-dle patterns defined in terms of the total number of nonnullsin a set. However more complicated patterns that involve tem-poral ordering, such as “early response” and “late response,”cannot be distinguished from each other. Next we introduce amore general testing framework that effectively describes var-ious temporal patterns by defining appropriate null parameterspaces.

2.1 Characterizing Patterns in a MultipleTesting Framework

Consider the unknown state sequence θ i = (θi1, . . . , θiK).The null space for testing against the conjunction of null hy-potheses (i.e., a gene is EE at all time points) is �

conj0 = {η =

(η1, . . . , ηK) ∈ {0,1}K :∑K

k=1 ηk = 0}. For a partial conjunctiontest that at least u hypotheses in a set are false, the null parame-ter space is �

pconj0 = {η ∈ {0,1}K :

∑Kk=1 ηk < u}. When tempo-

ral ordering is involved in a pattern, we need to go beyond theframework of partial conjunction test. Suppose the pattern of in-terest is “late response,” which means that there are responsesbut no response prior to time k. It is more convenient to statethe testing problem in terms of the nonnull parameter space �1.Specifically, we define �

late,k1 = {η ∈ {0,1}K : ∑k−1

i=1 ηi = 0 and∑Ki=k ηi ≥ 1} as the nonnull parameter space for identifying late

response genes. It is easy to see that the partial conjunction testis incapable of describing this pattern since it only has the in-formation on the total number of nonnulls in a set, therefore the

Page 4: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

76 Journal of the American Statistical Association, March 2011

combined p-value method (Benjamini and Heller 2008) cannotbe used for detecting the “late response” pattern.

The gene selection problem involves the simultaneous test-ing of m sets of hypotheses; the inflation of Type I errors isa big issue. Similar to the ideas of false discovery rate (FDR,Benjamini and Hochberg 1995) and false nondiscovery rate(FNR, Genovese and Wasserman 2002), we define the false setrate (FSR) and missed set rate (MSR) to combine the errorsin set-wise testing. The FSR and MSR will serve as the targetfor control and measure of power, respectively. Define a binaryvector ϑ = (ϑ1, . . . , ϑm) ∈ {0,1}m, where

ϑi = 1 if θ i ∈ �0 and ϑi = 1 otherwise. (2.1)

We are interested in testing m new hypotheses: Hi0 : θ i ∈ �0versus Hi1 : θ i ∈ �1, i = 1, . . . ,m, where gene i is selected ifthe null hypothesis Hi0 is rejected at the set level. The FSR andMSR are then defined as

FSR = E

{∑mi=1(1 − ϑi)δi

(∑m

i=1 δi) ∨ 1

}and

(2.2)

MSR = E

{∑mi=1 ϑi(1 − δi)

(∑m

i=1 ϑi) ∨ 1

},

respectively. The FSR is the expected proportion of falsely re-jected sets among all rejections and the MSR is the expectedproportion of nonnull sets that are missed. Our ultimate goal isto find the “optimal” subset of genes for future experimental in-vestigations. Under the multiple testing framework, optimalitymeans that our procedure for subset construction controls theFSR at level α with the smallest MSR.

An efficient procedure should be developed based on a sta-tistical model that well describes the data. The issues regardingthe data structure, model and assumptions in time-course exper-iments will be discussed in the next section. An asymptoticallyoptimal data-driven procedure for pattern identification is thenintroduced in Section 2.3.

2.2 HMM and Gamma–Gamma Model forTime-Course Data

Hidden Markov models have been successfully applied foranalysis of time-course data (Schliep, Schonhuth, and Stein-hoff 2003; Yuan and Kendziorski 2006). Let eik = (xik,yik)

denote the observed expression levels of gene i at time k,i = 1, . . . ,m; k = 1, . . . ,K. Here xik = (xik1, . . . , xikn1) andyik = (yik1, . . . , yikn2) are expression data for n1 and n2 repli-cated measurements under two biological conditions X and Y,respectively. The state sequence of gene i over time is a binaryvector θ i = (θi1, . . . , θiK), where θik = 1 indicates that gene i attime k is DE and θik = 0 otherwise. Assume that

θ i is distributed as a Markov chain and (2.3)

θ i and θ j, i �= j, are independent. (2.4)

Let {π ij = P(θi1 = j) : i = 1, . . . ,m; j = 0,1} be the initial prob-

abilities. Our real data analysis suggests that the HMMs are in-homogeneous. Therefore we allow the transition probabilitiesaik

jl = P(θi,k = l|θi,k−1 = j), j, l ∈ {0,1}, to depend on time k,

but assume that π ij and aik

jl are the same for all genes (hence iwill be suppressed for transition probabilities).

A gene under one specific condition has a latent mean ex-pression level μx or μy. To take advantage of information shar-ing, it is assumed that the latent mean expression levels fol-low a common genome-wide distribution. Following Yuan andKendziorski (2006), we assume that at time k, the observed ex-pression levels of gene i under condition X are n1 independentsamples from the conditional distribution hk(·|μx

ik), where μxik

is the mean expression level of gene i, and {μxik : i = 1, . . . ,m}

follow a common distribution Gk(·). Similarly, the expres-sion data under condition Y are n2 independent samples fromhk(·|μy

ik), and μyik follows the same genome-wide distribution

Gk(·). When it is EE (θik = 0), we have μxik = μ

yik = μik; hence

the conditional density of eik = (xik,yik) is

f (eik|μik, θik = 0) =n1+n2∏

j=1

hk(eikj|μik). (2.5)

Alternately, when it is DE (θik = 1) we have μxik �= μ

yik. We sam-

ple μxik and μ

yik independently from Gk(·), then we have

f (eik|μxik,μ

yik, θik = 1) =

n1∏j=1

hk(xikj|μxik)

n2∏l=1

hk(yikl|μyik).

(2.6)

Let pk be the proportion of DE genes at time k, then the mar-ginal density of eik is fk(eik) = (1 − pk)fk0(eik) + pkfk1(eik),

where f0k(eik) = ∫ ∏n1+n2j=1 ht(eikj|μik)dGt(μik) and f1k(eik) =∫ ∏n1

j=1 hk(xikj|μxik)dGk(μ

xik) · ∫ ∏n2

l=1 hk(yikl|μyik)dGk(μ

yik) are

the marginal densities under EE and DE, respectively. We fur-ther assume that eik’s are conditionally independent:

f (ei1, . . . , eiK |θ i) =K∏

k=1

f (eik|θik). (2.7)

We call (2.3)–(2.7) a hidden Markov model (HMM) for time-course data. Denote by Ak = {ak

jl : j, l = 0,1} the transition ma-trix at time k, k = 1, . . . ,K − 1, π = {π0,π1} the initial proba-bility distribution, Fk = {hk,Gk} the observation distributions,and

= (π, A1, . . . , AK−1, F1, . . . , FK)

the collection of all HMM parameters.Choosing appropriate observation distribution hk(·) and

genome-wide distribution Gk(·) is another important issue. TheGamma–Gamma (GG) model has been shown to be a usefulmodel that maintains most inherent characteristics of the mi-croarray data (Newton et al. 2001; Kendziorski et al. 2003).Specifically in time-course experiments, a GG model assumesthat hk(·|μik) (μik may be taken as μx

ik or μyik) is a Gamma

density function with shape parameter αk > 0 and rate para-meter λik = αk/μik. Fixing αk, λik is assumed to follow anotherGamma distribution Gk(α0k, νk), where α0k and νk are the shapeparameter and rate parameter, respectively. Then after integrat-ing out μik, we can obtain explicit forms for fk(·) with unknownparameters �0

k = (αk, α0k, νk). The EM algorithm can be usedto estimate �k. The GG model will be used in both our simu-lation studies and data analysis. See Newton et al. (2001) andKendziorski et al. (2003) for more details about this model.

Page 5: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 77

2.3 The Proposed Procedure for Pattern Identification

In pattern identification problems, the goal is to find a subsetof genes which exhibit the temporal pattern of interest. Afteran HMM-GG model is fitted, the key issue is how to draw in-ferences based on the fitted model. We solve the problem intwo steps: first rank all genes based on the likelihood of obey-ing a specific temporal pattern, then choose a threshold alongthe rankings. In a multiple testing framework, the goal in theranking step is to derive a test statistic T to reflect the relativeimportance of each gene, and the goal in the thresholding stepis to decide a critical value for T to control the false set rate.

Denote by the collection of all estimated parameters. InSection 5, we show in a compound decision-theoretic frame-work that a desirable statistic for ranking is the generalized lo-cal index of significance (GLIS):

GLISi = P

(ϑi = 0|ei)

=∑

s={s1,...,sK}∈�0

P

(θ i = s|ei), (2.8)

which is asymptotically optimal for set-wise multiple testingwhen is a consistent estimate. Here ϑi is a binary randomvariable indicating whether a gene is interesting, s is a K-dimensional binary vector and �0 is the null parameter space.The GLIS statistic can be interpreted as the probability of agene being null (i.e., not exhibiting a pattern of interest) giventhe observed expression data. Compared to the combined p-value which can only be used for testing the total number ofnonnulls in a set, the GLIS statistic can be used to screen forvarious complicated patterns. In addition, the GLIS procedure,combined with the HMM-GG model, can exploit the tempo-ral correlation structure and borrow information across genes;hence it produces more accurate and interpretable results thanconventional methods.

In Sun and Cai (2009), it was shown that the local index ofsignificance (LIS) is the optimal statistic for testing hypothe-ses arising from an HMM. However, the LIS statistic can onlybe used for testing a single sequence of correlated hypotheses.Our problem is different since we have thousands of sequences,and we are interested in set-wise analysis instead of point-wiseanalysis. The optimality criteria of the two problems are alsodifferent in the sense that LIS maximizes the number of true in-dividual point discoveries whereas GLIS maximizes the num-ber of true set discoveries. In addition, the LIS procedure, whichis designed for controlling the FDR, in general leads to an in-flated FSR level.

To control the FSR, we determine a data-driven thresholdfor GLIS using the following step-up procedure. First we rankthe GLIS values ascendingly as GLIS(1), . . . ,GLIS(m), and de-note by H(1), . . . ,H(m) the corresponding hypotheses, then wechoose the largest i such that (1/i)

∑ij=1 GLIS(j) ≤ α. In The-

orem 4 we show that the data-driven procedure controls theFSR level at the nominal level and is asymptotically optimal.The key difference between GLIS procedure and conventionalmethods is in the gene rankings. By optimally combining test-ing results from individual time points and exploiting the HMMdependency, the GLIS rankings are more efficient than the rank-ings by combined p-values; this is illustrated in the simulationstudies conducted in Section 3.

The implementation of GLIS procedure is simple and fast.The maximum likelihood estimate (MLE) can be used to es-timate . Under certain regularity conditions, the MLE isstrongly consistent and asymptotically normal (Leroux 1992;Bickel, Ritov, and Rydén 1998). The MLE can be obtained us-ing the EM algorithm or other numerical optimization schemes.The EM algorithms for Gamma–Gamma model and normalmixture model were provided, for example, in Yuan andKendziorski (2006) and Sun and Cai (2009), respectively. Giventhe HMM parameters , the GLIS statistic can be calculated as

GLISi =∑

s∈�0πs1 f (ei1|s1)

∏Kj=2 asj−1sj f (eij|sj)

αi,K(0) + αi,K(1),

where αi,k(j) = P [(eit)kt=1, θik = j] is the forward variable

that can be calculated using the forward–backward proce-dure (Baum et al. 1970; Rabiner 1989). Specifically, letαi,1(j) = πjfj1(ei1), then by induction we have αi,k+1(j) ={∑1

l=0 αi,k(l)aklj}fj,k+1(ei,k+1).

2.4 Mixed Directional Errors

In this section, we discuss an extension of our GLIS proce-dure to deal with the mixed directional errors. The issue waspreviously studied by Finner (1999) and Benjamini and Yeku-tieli (2005), and recently raised by Guo, Sarkar, and Peddada(2010) in the context of MTC experiments. Guo, Sarkar, andPeddada (2010) considered the identification of specific expres-sion patterns over ordered categories under one biological con-dition. The concern is that, in addition to the usual Type I errors,directional or sign errors can occur for rejected hypotheses. Toresolve the issue, Guo, Sarkar, and Peddada (2010) extended theBH step-up procedure (Benjamini and Hochberg 1995) to con-trol the so-called mixed directional FDR (mdFDR) of orderedpatterns.

Similar issues remain for MTC experiments under two bi-ological conditions. Define = μx − μy to be the differenceof gene expression levels. It is generally accepted that DEgenes with time-condition interactions (i.e., both μx < μy andμx > μy occur for the same gene over the experimental pe-riod) are more biologically meaningful. In contrast, the patternswhere the expression levels of one condition are dominant atall time points are of less interest because they are likely to becaused by the baseline differences between the case and con-trol groups (Ma, Zhong, and Liu 2009). The identification ofthese patterns involves the inference about the sign of ; henceadditional directional errors can occur.

The pattern identification problem is essentially a set-wisetesting problem. From a biological point of view, the FSR is amore appealing concept than the mdFDR in Guo et al. (2010)for inference of patterns since scientific discoveries are usuallymade at the gene (or set) level, and specific patterns are moreinformative than the individual states to reveal the underlyingbiological mechanisms. It is natural to consider multiple test-ing procedures that aim to control the number of misspecifiedpatterns. The concept of mdFDR is only useful for point-wiseanalysis. In particular, the FSR can be very high even if themdFDR is controlled at the nominal level. Consider a situationwhere we make 20 errors out of 40 selected genes over 5 timepoints. In a point-wise analysis, the proportion of false positives

Page 6: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

78 Journal of the American Statistical Association, March 2011

is 0.10; however in a set-wise analysis, the proportion of falsesets can be as high as 0.5 if the 20 errors occur at exactly 20different genes.

Dealing with mixed directional errors in conventional testingframework is complicated. When a set-wise analysis is of pri-mary interest, we may circumvent the complication by consid-ering a three-state HMM. The key observation is that the mixeddirectional errors are the consequence of using a binary deci-sion rule to deal with a trichotomous situation. Therefore it isnatural to define a multinomial variable

θik =

⎧⎪⎨⎪⎩

−1, if μxik < μ

yik or ik < 0

0, if μxik = μ

yik or ik = 0

1, if μxik > μ

yik or ik > 0

(2.9)

for our analysis. Let θ i = (θi1, . . . , θiK) ∈ {−1,0,1}K be the se-quence of unknown states over time. Then for a pattern of inter-est, we may use a new binary variable to combine the individualstates in a set as before:

ϑi = 1 if θ i ∈ �0 and ϑi = 1 otherwise. (2.10)

To identify genes that interact with time, we may define thenull parameter space �interact

0 = {η ∈ {−1,0,1}K : |∑i ηi| =∑i |ηi|}. The expression of GLIS shall remain the same as in

(2.8) for appropriately defined θik’s and �0, and the same pro-cedure can be applied for determining an appropriate cutoff forGLIS to control the FSR. By extending our two-state HMMto a three-state HMM, the issue of mixed directional errors nolonger exists.

The forward–backward procedure and EM algorithm areeasy to implement for a three-state HMM. However, a challeng-ing issue is to specify an appropriate hierarchical model for thethree-state HMM to replace (2.5)–(2.6). Specifically, the exten-sion of the GG model to reflect both the sign and magnitude ofthe differences in expression levels is non-trivial. Finally, ourtesting framework is designed for set-wise inference, and it isinteresting to extend it to consider mixed directional errors forpointwise testing as formulated by Guo, Sarkar, and Peddada(2010). Much research is needed under the formulation that in-volves mixed directional errors; in particular the dependencyand optimality issues seem to be complicated. We leave theseimportant topics for future research.

3. SIMULATION STUDIES

In this section, we first introduce some alternative approachesfor pattern identification and then compare their performanceswith that of our method.

3.1 Alternative Methods

The first alternative method is the well-known Viterbi al-gorithm, which aims to estimate the most probable state se-quence of a gene given the observed data. It seems natural toselect the genes whose most probable configuration suggestedby the Viterbi algorithm obeys the pattern of interest. Specifi-cally, we calculate the most probable state sequence θ i for genei, i = 1, . . . ,m; then set δi = 1 if θ i ∈ �1, and δi = 0 otherwise.However, the Viterbi algorithm does not address the multiplic-ity issue in simultaneous inferences because it was only devel-oped for selecting a pattern for a single gene, whereas in gene

selection problems, across gene comparison is needed. For ex-ample, it is not clear how to select the “top 10” genes basedon the results obtained from the Viterbi algorithm. We shall seethat the Type I error rate of Viterbi algorithm is independent ofany prespecified test level and hence can be either too conserv-ative or too liberal.

The second approach is based on thresholding the com-bined p-values (Benjamini and Heller 2008). This procedureaddresses the multiplicity issue in simultaneous set-wise in-ferences but is limited in its applicability and power. Letpi(1), . . . ,pi

(K) be the ordered p-values from the ith set. De-

note by Hiu/K the partial conjunction test that at least u out of

K hypotheses in set i are false. The Simes’ p-value piu/K =

min{(K−u+1j )pi

(u−1+j) : j = 1, . . . ,K − u + 1} can be used to

summarize the K p-values from set i into a single index. TheFDR procedure in Benjamini and Hochberg (1995) was thenapplied to the ordered Simes’ p-values:

let l = max

{i : p(i)

u/K ≤ i

},

then reject H(i)u/K, i = 1, . . . , l. (3.1)

Benjamini and Heller (2008) showed that the procedure (3.1),referred to as the BH–Simes procedure, controls the FSR at thenominal level α. In addition, it was shown that the BH–Simesprocedure is still valid under different dependency assumptionsin the sense that it controls the FSR at level α. When a specifictemporal pattern can be described as conjunction or partial con-junction tests, the BH–Simes procedure may be applied. Specif-ically, let tik be the two sample t-statistic of gene i at time k forcomparing two biological conditions, then tik can be convertedto a p-value using transformation pik = 2F(−|tik|), where F isthe cdf of the t-variable. We can first obtain the Simes’ p-valuepi

u/K for gene i, then apply the BH procedure. However, theBH–Simes procedure is only applicable for partial conjunctiontest but incapable of dealing with patterns that involves tem-poral ordering such as “early or late response.” Moreover, therankings by combined p-values can be much improved by therankings of GLIS statistics which exploit the dependency struc-ture.

In the next section, we design and conduct three simulationstudies to compare the numerical performances of Viterbi, BH–Simes and GLIS. In Section 5, we also derive an “oracle” pro-cedure that assumes all parameters are known. The oracle pro-cedure, which is included in the comparison as well, providesa benchmark for defining optimality and comparing differentprocedures. We will see that the performance of the oracle pro-cedure is asymptotically achieved by the GLIS procedure. Wecomment here that the simulation results show that GLIS ismore powerful than BH–Simes and Viterbi. Another advantageis that GLIS is more flexible and can be used to test a variety ofcomplicated patterns.

3.2 Simulation Results

Simulation Studies 1 and 2 consider conjunction and partialconjunction tests, respectively. To provide insights into the su-periority of our procedure, we investigate the ranking efficien-cies of GLIS versus BH–Simes in Simulation Study 3. To the

Page 7: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 79

best of our knowledge, no multiple testing procedures can beused to screen for the temporal patterns (1)–(4) discussed inSection 1. Therefore we only illustrate how to implement theGLIS procedure for identifying such delicate patterns in Sec-tion 4 without doing comparisons.

In all simulations, the numbers of subjects under both condi-tions are n1 = n2 = 10, the number of genes is m = 2000, thenumber of time points measured for each gene is K = 6 and thenumber of replications is 200. The nominal FSR level is 0.1.For a given gene i, a Markov chain θ i = (θik)

Kk=1 is first gener-

ated with initial state distribution π = (0.95,0.05) and transi-tion matrix Ak = (a00,1 − a00;1 − a11,a11), k = 1, . . . ,5. Theobserved data eik are then generated conditionally on (θik)

Kk=1

with GG parameters �k = (αk, α0k, νk). We shall specify Ak

and �k later in each simulation study.

Simulation Study 1. This simulation study compares the ef-ficiency of Viterbi, BH–Simes, and GLIS for conjunction tests,where a rejection is made at the set level if any hypothesis ina set is false. The simulation results are summarized in Fig-ure 2. In the top row, we choose Ak = (0.95,0.05;1−a11,a11),(αk, α0k, νk) = (10,1,0.5), k = 1, . . . ,5, and then apply theViterbi algorithm, BH–Simes procedure, the oracle procedure,and the data-driven GLIS procedure to the simulated data.The FSR and MSR levels are plotted as functions of a11.In the bottom row, we choose Ak = (0.95,0.05;0.2,0.8) and(αk, α0k, νk) = (α,1,0.5), k = 1, . . . ,6. The FSR and MSR lev-els are plotted as functions of α.

Simulation Study 2. This simulation study compares theefficiencies of different FSR procedures for partial conjunc-tion tests, where a rejection is made at the set level if thenumber of false hypotheses in a set is greater than or equalto 3 (the total number of time points is 6). The simula-tion results are summarized in Figure 3. In the top row, wechoose Ak = (0.95,0.05;1 − a11,a11) and (αk, α0k, νk) =(10,1,0.5), k = 1, . . . ,6. In the bottom row, we choose Ak =(0.95,0.05;0.2,0.8) and (αk, α0k, νk) = (α,1,0.5), k = 1,

. . . ,6. The FSR and MSR levels are plotted as functions of α.

Performance of Viterbi-based decision rule. In SimulationStudy 1 for conjunction test (Figure 2), Viterbi algorithm resultsin the most conservative FSR levels. The MSR of Viterbi al-gorithm is much higher than that of GLIS but comparable withthat of BH–Simes. In Simulation Study 2 for partial conjunctiontest (Figure 3), both of the FSR and MSR of the Viterbi algo-rithm are between those of GLIS and BH–Simes. The Viterbi-based decision rule does not take the prespecified test level intoaccount; hence the decisions would be fixed regardless of thechoice of the test level. Here we choose the FSR level to be 0.1and observe that the Viterbi algorithm is conservative. Imagineif the FSR level is 0.01, the Viterbi algorithm will be too liberal.In contrast, the GLIS procedure is adaptive to our choice of testlevel and controls the FSR precisely.

GLIS (GLIS) versus BH–Simes. From Figures 2 and 3, wecan see that (i) the oracle and GLIS procedures can achieve the

Figure 2. Comparison for conjunction tests: GLIS, the oracle procedure; GLIS: the data-driven procedure; Viterbi: Viterbi decision rule;BH–Simes: the combined p-value procedure. FSR of Viterbi is independent of the nominal level. BH–Simes is conservative and missed a largeproportion of true signals. GLIS controls the FSR precisely and has greater power. It also achieves the performance of the oracle procedureasymptotically.

Page 8: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

80 Journal of the American Statistical Association, March 2011

Figure 3. Comparison for partial conjunction tests: GLIS, the oracle procedure; GLIS: the data-driven procedure; Viterbi: Viterbi decisionrule; BH–Simes: the combined p-value procedure. GLIS and GLIS can achieve the nominal FSR levels approximately. In contrast, BH–Simesis very conservative and misses a big proportion of nonnulls. The MSR of BH–Simes is significantly higher than GLIS. The performance ofViterbi is in between.

nominal FSR levels accurately, whereas the BH–Simes proce-dure is very conservative; (ii) the two lines of the oracle proce-dure and data-driven GLIS procedure are almost overlapped, in-dicating that the performance of the oracle procedure is attainedby the data-driven procedure asymptotically; (iii) the MSR lev-els of the data-driven procedure are lower than that of the BH–Simes procedure in all settings, even when the dependence isweak; (iv) the difference in MSR levels is larger as α increases;(v) the gain in efficiency is larger for partial conjunction tests.Note that α controls the signal strength, the results imply thatour procedure is especially useful when the signal is weak.

It is important to note that the larger power of GLIS is notgained at the price of a higher FSR level. To illustrate this point,we conduct the following simulation study to compare the rank-ing efficiencies of GLIS statistics versus combined p-values.

Simulation Study 3 (Ranking efficiency). In this simulationstudy, we compare the MSR levels of the GLIS procedureversus the BH–Simes procedure at the same FDR level. Be-cause the FSR levels of the Viterbi algorithm is fixed, we donot include Viterbi in the comparison. The simulation resultsare summarized in Figure 4. The top row considers conjunc-tion tests and the bottom row considers partial conjunctiontests, where a rejection is made at the set level if the numberof false hypotheses is greater than or equal to 3. We chooseAk = (0.95,0.05;0.2,0.8) and (αk, α0k, νk) = (α,32,4), k =1, . . . ,6. In Panels (a)–(f), α is 10, 16, 30, 10, 16, and 30,respectively. We can see that (i) at the same FSR level, our

proposed procedures have much lower MSR levels than theBH–Simes procedure, indicating that the rankings produced byGLIS are more efficient than those produced by the combinedp-values; (ii) the efficiency gain of GLIS is large when the sig-nals are weak; (iii) compared to conjunction tests, the differencein MSR levels is larger for partial conjunction tests.

4. APPLICATION

In this section we apply our method to a well-known MTCdataset collected by Calvano et al. (2005) for studying thesystemic inflammation in human. The dataset contains eightstudy subjects which are randomly assigned to case and con-trol groups and then administered with endotoxin and placebo,respectively. Affimetrix U133A chips were used to profile theexpression levels of 22,283 genes in human leukocytes mea-sured before infusion (0 hour) and at 2, 4, 6, 9, and 24 hoursafterwards.

The goal of this MTC experiment is to extract informationfrom genome-wide expression data to help identify functionalnetworks responsible for the systemic activation. Inflammatoryresponses exhibit a quick, transient, and self-limiting nature. Inthe early activation of innate immunity, many secreted proin-flammatory factors, including cytokines and chemokines, arequickly activated in response to exterior intrusion. The activatedproinflammatory factors subsequently trigger the expression ofseveral transcription factors to initiate the innate immune re-sponse. In the late period, the expression levels of a number oftranscription factors limiting the innate immune response are

Page 9: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 81

Figure 4. Ranking efficiency: BH–Simes (dotted, combined p-value), GLIS (solid, oracle), GLIS (dashed, data-driven). GLIS and GLIS havelower MSR/higher sensitivity than BH–Simes at the same FSR level. Sensitivity = 1 − MSR.

increased. Finally the whole system concludes with full recov-ery and a normal phenotype. Although the overall activated-then-self-limiting pattern is known for the innate immune re-sponse process, the underlying regulatory programs are yet tobe deciphered. Next we illustrate how to build an HMM andapply the GLIS procedure to analyze this MTC experiment.

Step 1 (Data preprocessing). Data preprocessing is a neces-sary and critical procedure to separate biologically meaning-ful signals from background hybridization noises and otherconfounding signals in microarray studies. For AffymetrixGenechip one-channel array, data preprocessing involves threesteps: (a) background adjustment, (b) normalization, and(c) summarization (Gautier et al. 2004). Some popular softwareprovides one-stop solutions for all three steps; representativeones include RMA (Irizarry et al. 2003), dChip (Li and Wong2001), and MAS5 (Hubbell, Liu, and Mei 2002). For Illuminamicroarray (BeadArray), users can use lumi (Du, Kibbe, andLin 2008), and two-color spotted cDNA arrays, Limma (Smythand Speed 2003). We used RMA here to preprocess the rawAffimetrix array data collected by Calvano et al. (2005). It isnoted that the normalized gene expression intensities may bereturned in the log-transformed format as default by many soft-ware including RMA. We need to convert the transformed data

back to their original scale, which is required by the GG model.After the data preprocessing step, we obtain the normalized ex-pression intensities in original scale of 22,283 gene with fourreplicates for each of the two biological conditions (endotoxinand placebo) over 6 time points.

Step 2 (HMM model fitting). After the preprocessing step,we apply the EM algorithm (Yuan and Kendziorski 2006) to fitan HMM for the time-course data, with the Gamma–Gammamodel as the underlying sampling distribution. The choice ofa suitable model is crucial in the model building process. Inparticular, we expect that very few genes are perturbed in thefirst time point because there is no infusion of endotoxin orplacebo. However, at later time points the immune system be-gins to respond to the two different treatments and the transi-tion probabilities and gene expression intensities will vary overtime. Hence inhomogeneous Markov chains are recommendedin modeling the dynamic process, and different GG distribu-tions are used at different time points. The fitted HMM parame-ters are summarized in Table 1. We observe considerably dif-ferent (α, α0, ν) and (a00, a11) across the 6 time points, whichconfirms our previous model assumptions. In situations wherethe estimated model parameters are very similar for all timepoints, it is more appropriate to use an homogeneous HMM torefit the data to increase the estimation precision.

Page 10: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

82 Journal of the American Statistical Association, March 2011

Table 1. HMM parameter estimation: a00 and a11 denote the initialstate distribution at 0 hour and the transition probabilities afterwards

HMMparameters 0 h 2 h 4 h 6 h 9 h 24 h

α 34.26 20.91 38.76 39.07 41.22 31.47α0 0.95 0.97 0.94 0.93 0.93 0.96ν 4.13 7.66 4.06 3.94 3.74 4.67a00 (π0) 0.9595 0.7996 0.9196 0.9976 0.9982 0.9998a11 (π1) 0.0405 0.9258 0.9813 0.9349 0.9735 0.2034

Step 3 (Pattern design). Specifying a temporal pattern is animportant step in the analysis of MTC experiments; this in-volves forming an appropriate hypothesis to be tested, and likein most randomized clinical trials, we recommend this step tobe completed prior to the experiment. In particular, the patternof interest should be application specific, independent of theobserved data and incorporate prior information and related do-main knowledge. Here we discuss the design of several com-monly used patterns that meet the needs of most biological ap-plications.

1. Genes perturbed in response to treatments. In many MTCexperiments, the first time point usually serves as a control pointand we expect no much difference in gene expression levelsacross the two treatments. Therefore any difference at 0 hourcan be viewed as nonimmune responsive perturbations, whichmay be attributed to baseline difference. Hence it would be ofinterest to select perturbed genes in response to treatment whichhave EE before the inhibition of endotoxin and DE afterwards.If we further hypothesize that the immune responsive genesshould conclude with full recovery and a normal phenotype af-ter 24 hours due to the self-limiting nature of immune response,then the genes perturbed by treatment should also have EE atthe end. In summary, we aim to select genes which are (i) EEat 0 or 24 hours, and (ii) DE at one or more time points from2, 4, 6, and 9 hours. Therefore the nonnull parameter space foridentifying genes perturbed in response to treatment is

�Resp1 =

{η ∈ {0,1}6 :η1 = η6 = 0,

5∑k=2

ηk ≥ 1

}.

More generally, we may want to exclude genes that are DE atall time points since this type of perturbation is often caused bythe baseline difference. Hence the nonnull parameter space foridentifying genes that have both EE and DE states during theexperimental period is

�Mix1 =

{η ∈ {0,1}6 : 1 ≤

6∑k=1

ηk ≤ 5

}.

2. Sequentially perturbed genes. Sequentially activated ge-nes, which are ordered temporally, may reveal meaningful ac-tivation sequence that governs the immune responses. For ex-ample, Calvano et al. (2005) identified point-wise DE genesand then clustered them into different groups based on the firsttime point when they became DE. Define the nonnull parameterspace for identifying genes that are not DE prior to time point t

but are DE afterwards:

�t-Resp1 =

{η ∈ {0,1}6 :

t−1∑k=1

ηk = 0,

6∑k=t

ηk ≥ 1

}.

Such patterns would be particularly helpful for studying biolog-ical processes such as cell cycle and development, during whichdefinite and specialized genes are expected to be activating se-quentially at each step.

3. Early and late response genes. The magnitude of the dif-ference in expression levels often varies for different genes. Asa result, among genes which are perturbed at the same timepoint, some may be detected early while some cannot be de-tected until the change reaches its peak. If the studied systemdoes not show definite change at every time point, and there isno time point of particular interest in a cell cycle and develop-ment, it would be sufficient to roughly separate the genes thatresponded to treatment early from those that responded late. Forexample, we define the following patterns, “DE within 4 h” and“not DE until 4 h or later,” for early and late immune responsegenes, respectively. Therefore the nonnull parameter spaces foridentifying early and late response genes are

�Early1 =

{η ∈ {0,1}6 :

4∑k=1

ηk ≥ 1

}and

�Late1 =

{η ∈ {0,1}6 :

4∑k=1

ηk = 0,

6∑k=5

ηk ≥ 1

},

respectively. The potential regulatory relationships betweenearly and late response genes can still provide insights on howsignals are transferred during the cell cycle; and the identifiedgenes can serve as good candidates for further experimental ver-ifications.

We just give a few examples of patterns that may be com-monly used. Actually we have in total 2K atomic patterns andeach pattern represents a state sequence over the K time points.Those atomic patterns can flexibly be partitioned into any twocomplementary sets �0 and �1, by which �1 represents thecollection of the patterns of our interest. The number of possi-ble partitions allowed by our testing framework is as large as22K − 1, which is flexible enough to meet various biologicalneeds.

Step 4 (Application of GLIS). With the pattern of interestin mind, the application of our GLIS procedure is relativelystraightforward. As an example, we applied the GLIS procedureto identify “early” and “late” immune response genes, whichare respectively defined as the genes that are DE within fourhours after endotoxin injection, and the genes that begin to beDE at four hours or later. The formal definitions of �1’s weregiven in the example in Step 3. Since �1 and �0 are comple-mentary, we can calculate GLISi as either

∑s∈�0

P

(θ = s|ei)

or 1 − ∑s∈�1

P

(θ = s|ei), depending on which set has asmaller cardinality, to save computational time.

The HMM parameters, which were estimated at Step 2, areused to calculate the values of GLIS statistics. The genes arethen ranked based on GLIS and a cutoff is chosen along therankings for a given FSR level using the step-up procedure. At

Page 11: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 83

Figure 5. Expression profiles of top 20 “early” genes with 95% confidence bands. Vertical and horizontal axes are the log-transformed geneexpression level and time (0, 2, 4, 6, 9, 24), respectively. Solid line: cases; dashed line: controls. The online version of this figure is in color.

the FSR level of 0.05, 4385 genes are identified to be “early” re-sponse genes and 56 are identified to be “late” response genes.The expression profiles of the top 20 “early” and “late” re-sponse genes are shown in Figures 5 and 6, respectively.

Interpretation and discussion of results. A number of genes,many of which are known to take part in initiating the in-nate immune response and implement many functions of leuko-cytes, such as cellular movement, migration and proliferation,are identified as “early” response genes by GLIS (Table 2). Inaddition, it is interesting to find that Calcineurin (PPP3CA)is among the “late” immune response genes. Specifically, itsexpression profile shows a significant down-regulation at sixhours followed by a gradual recovery. Calcineurin induces tran-scription factors for the transcription of IL-2 genes, the amountof which is believed to have significant influence on the extentof the immune response. Similar down-regulation of the LTB,an inducer of the inflammatory response system, was also foundamong the “late” immune response genes. Meanwhile, immuneresponse repressors were observed to be up-regulated, includ-ing the BCL6 (a corepressor of the transcription of START-dependent IL-4 responses of B cells), and LILRA6 and LILRB3(leukocyte immunoglobulin-like receptors that bind to MHCclass I molecules on antigen-presenting cells to inhibit the stim-ulation of an immune response). These up/down regulations to-gether indicate a concluding signal, which is consistent with the

self-limiting nature of the innate immune response. The appli-cation of our GLIS procedure to time-course data provides moreinsights in deciphering the cell regulatory mechanism. The pos-sible regulatory relationships between these “early” and “late”response genes are worthy of further biological investigations.

5. TECHNICAL DETAILS AND DERIVATIONS OF THEPROPOSED PROCEDURE

In this section, we develop optimal procedures in a com-pound decision-theoretic framework for testing sets of hypothe-ses arising from the HMM defined in (2.3)–(2.7). We first showthat the multiple testing and weighted classification problemsare “equivalent” under mild conditions, then derive an ora-cle procedure, under an ideal setting, that minimizes the MSRsubject to a constraint on the FSR. Finally we derive a data-driven procedure that mimics the oracle procedure. The Appen-dix shows that the data-driven procedure is valid and optimalasymptotically.

5.1 Compound Decision Theory

Consider ϑi defined in (2.1). We are interested in inferenceof the unknown ϑi’s based on the observed data and need tosolve m component problems simultaneously. This is referredto as a compound decision problem (Robbins 1951). A so-

Page 12: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

84 Journal of the American Statistical Association, March 2011

Figure 6. Expression profiles of top 20 late response genes with 95% confidence bands. Vertical and horizontal axes are the log-transformedgene expression level and time (0, 2, 4, 6, 9, 24), respectively. Solid line: cases; dashed line: controls. The online version of this figure is in color.

lution to this problem can be represented by a decision ruleδ = (δ1, . . . , δm) ∈ {0,1}m, where δi = 1 if we claim that Hi0is false and δi = 0 otherwise.

In Yuan and Kendziorski (2006), the problem of separat-ing DE time points from EE time points was formulated asa classification problem. Specifically, let θik be an estimateof the unknown state, with θik = 1 indicating that we clas-

sify kth time point of gene i as DE and θik = 0 otherwise.Yuan and Kendziorski (2006) proposed a classification rulebased on the maximum a posterior (MAP) estimates θik =argmaxs∈{0,1} P(θik = s|ei), for i = 1, . . . ,m and k = 1, . . . ,K.Let a = (aik : i = 1, . . . ,m; k = 1, . . . ,K) ∈ {0,1}mK denote ageneral decision rule. It can be shown that the MAP estimate isthe optimal solution to a classification problem with the follow-

Table 2. The known “early” innate immune response genes

Category Members

Cytokines, chemokines TNFSF10, TNFSF14, TNF, IL1A, IL1B, IL8, IL15, IL32,ILF2, ILF3, CXCL1, CXCL2, CCL3, CCL20, CCL3L1,CCL3L3, CCL4, CCL5, XCL1, and XCL2

Membrane receptors of CCR1, CCR2, CCR3, CCR5, CCR6, CCRL2, CXCR3,cytokines, chemokines CXCR4, CX3CR1, IL11RA, IL13RA1, IL18R1, IL2RB, IL2RG,

IL1RAP, IL1R1, IL1R2, IL27RA

Toll-like receptors TLR1, TLR4, TLR5, TLR7, and TLR8

Fc receptors FCGR1A, FCGR1B, FCGR1C, FCGR2A, FCGR3A, FCGR3B

IFN receptors IFNAR2, IFNGR2

kappa/relA family NFKB2, RELA, RELB

Tyrosine kinase JAK2, JAK3

STAT family STAT1, STAT2, STAT4, STAT5B, STAT6

Page 13: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 85

ing loss function

L(θ ,a) = (mK)−1m∑

i=1

K∑k=1

{(1−θik)aik +θik(1−aik)}. (5.1)

In loss function (5.1), it is assumed that a false positive and afalse negative have the same cost. However, in a gene selectionproblem, a false positive is often considered to be more seri-ous than a false negative. Therefore it is desirable treat the twotypes of errors differently. Moreover, the analysis needs to beconducted at the set level. Hence we construct the loss func-tion using the set indicator ϑi, where ϑi = 1 if a gene has thetemporal pattern of interest and ϑi = 0 otherwise. Also, let λ

denote the relative cost of a false positive to a false negative,and δ = (δi : i = 1, . . . ,m) ∈ {0,1}m denote a general decisionrule, where δi = 1 indicates that we claim that a gene has a spe-cific temporal pattern of DE and δi = 0 otherwise. Consider aweighted classification problem with loss function

Lλ(ϑ, δ) = m−1m∑

i=1

{λ(1 − ϑi)δi + ϑi(1 − δi)}. (5.2)

The goal of this weighted classification problem is to find adecision rule δλ that minimizes the classification risk Rλ =E{Lλ(ϑ, δ)}.

In MTC studies, we expect that only a small number ofgenes are differentially expressed at time k. The formulationof a weighted classification problem is not tractable in practi-cal applications since the relative cost λ is usually unknown.Alternately, a common strategy in practice is to identify a listof genes that minimizes the “missed findings,” while incurringa relative low proportion of false positive results. This natu-rally gives rise to a multiple testing problem, where the optimaltesting procedure δα minimizes the MSR subject to the con-straint FSR ≤ α. The multiple testing and weighted classifica-tion problems are closely connected, but the former is more dif-ficult to deal with theoretically. Next we show that under mildconditions the two problems are “equivalent” and then derivethe optimal multiple testing procedure by solving an equivalentweighted classification problem.

5.2 Multiple Testing and Weighted Classification

The goal of both multiple testing and weighted classificationproblems is to separate the nonnull cases from the null cases,and the solution to both problems can be represented by a deci-sion rule of the form

δ(T, c1) = {I(Ti < c) : i = 1, . . . ,m}, (5.3)

where T is a classifier or a test statistic and c is a cutoff. Themost popular choice of T is the vector of p-values. In the mul-tiple testing literature, the following assumption has been used(e.g., Storey 2003; Genovese and Wasserman 2004):

the FDR (FNR) level yielded by δ(T, c1) isincreasing (decreasing) in c.

(5.4)

Assumption (5.4) is desirable for developing FDR proceduressince it implies that in order to minimize the FNR, we should

choose the largest cutoff c that satisfies FDR(c) ≤ α. For set-wise FDR analysis, we would similarly require that

the FSR (MSR) level yielded by δ(T, c1) isincreasing (decreasing) in c.

(5.5)

Next we develop a general condition that guarantees (5.5). De-note by Gj(t) = P(Ti < t|ϑi = j), j = 1,2, the conditional cumu-lative distribution function (cdf) of Ti and gj(t) = (d/dt)Gj(t)the conditional probability distribution function (pdf). Assumesthat

g1(t)/g0(t) is monotonically decreasing in t. (5.6)

The above condition is referred to as the monotone ration condi-tion (MRC). The next lemma shows that the MRC is a desirablecondition for multiple testing.

Lemma 1. The MRC (5.6) implies that (5.5) holds asymptot-ically.

The class of MRC statistics, denoted by T , includes mostcommonly used test statistics. For example, let p be a vectorof independent p-values, then p ∈ T if G1(p), the p-value dis-tribution under the alternative, is concave. The concavity of p-value distribution is a desirable condition that guarantees (5.4)and has been assumed in Genovese and Wasserman (2004) andStorey (2003). In addition, both the local false discovery rate(Lfdr, Efron et al. 2001) statistic and the LIS (Sun and Cai 2009)statistic belong to T . See Sun and Cai (2007) for more discus-sions of the MRC.

The next theorem, which underlies our theoretical develop-ment, states that the multiple testing and weighted classificationproblems are “equivalent” when the MRC holds (Sun and Cai2007). Therefore the more complicated multiple testing prob-lem can be solved by studying an equivalent weighted classifi-cation problem. The following theorem can be proved similarlyas in Sun and Cai (2007).

Theorem 1. Suppose that the classification risk with the lossfunction defined in (5.2) is minimized by δλ{�, c(λ)}, so that �

is optimal in the weighted classification problem. If � ∈ T , then� is also optimal in the multiple testing problem, in the sensethat for each FSR level α, there exists a unique λ(α), and hencec{λ(α)} = c(α), such that δλ(α){�, c(α)} controls the FSR atlevel α with the smallest MSR level among all testing rules inDα , where Dα is the collection of all α-level FSR rules of theform δ = I(� < c1).

5.3 The Oracle Procedure

Now we derive the optimal test statistic for FSR control. Thenext theorem considers, in an ideal situation, the optimal clas-sification rule in the HMM defined by (2.3)–(2.5).

Theorem 2. Consider the HMM defined by (2.3)–(2.7).Suppose that the HMM parameters are known. Then theclassification risk with loss function (5.2) is minimized byδ{�, (1/λ)1} = (δik : i = 1, . . . ,m), where �i = P(ϑi=0|ei)

P(ϑi=1|ei)and

δi = I(�i < 1/λ).

Theorem 2 and the equivalence between multiple testing andweighted classification imply that � is also the optimal test sta-tistic for FSR control. Define GLISi = P(ϑi = 0|ei). Note that

Page 14: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

86 Journal of the American Statistical Association, March 2011

GLISi = �i/(1 + �i) is strictly increasing in �i, the (oracle)optimal multiple testing procedure must be of the form

δ(GLIS, cOR1) = [I(GLISi < cOR) : i = 1, . . . ,m], (5.7)

where the oracle cutoff cOR = sup{c ∈ (0,1) : FSR(GLIS, c1) ≤α}, due to (5.5).

5.4 A Data-Driven Procedure

The optimal testing procedure (5.7) is difficult to implementbecause it is difficult to determine the optimal cutoff cOR di-rectly. We propose the following procedure that is asymptot-ically equivalent to (5.7). The derivation of this procedure isgiven in the Appendix. Let GLIS(1), . . . ,GLIS(m) be the rankedGLIS statistics and H(1), . . . ,H(m) be corresponding hypothe-ses.

Let k = max

{i :

1

i

i∑j=1

GLIS(j) ≤ α

},

then reject all H(i), i = 1, . . . , k. (5.8)

The next theorem shows that the GLIS procedure (5.8) is as-ymptotically equivalent to the oracle procedure (5.7); in partic-ular, it is valid for FSR control.

Theorem 3. Consider the HMM defined by (2.3)–(2.7). LetGLISi = P(ϑi = 0|ei). Denote by GLIS(1), . . . ,GLIS(m) theranked GLIS values, and H(1), . . . ,H(m) the corresponding hy-potheses. Then the GLIS procedure (5.8) controls the FSR atlevel α. In addition, let MSROR and MSRGLIS be the MSR lev-els of the oracle procedure (5.7) and the GLIS procedure (5.8),respectively, then MSRGLIS = MSROR + o(1).

Denote by GLIS(1), . . . , GLIS(m) the ranked estimates andH(1), . . . ,H(m) the corresponding hypotheses. In light of theGLIS procedure (5.8), we propose the following data-drivenprocedure:

Let k = max

{i :

1

i

i∑j=1

GLIS(j) ≤ α

},

then reject all H(i), i = 1, . . . , k. (5.9)

The next theorem shows that the data-driven procedure (5.9)attains the performance of the oracle procedure (5.7) asymptot-ically.

Theorem 4. Consider the HMM defined by (2.3)–(2.7). Let

be an estimate of the HMM parameters such that p→ .

Let FSROR and FSRDD be the FSR levels of the oracle pro-cedure (5.7) and data-driven procedure (5.9), respectively, andMSROR and MSRDD the corresponding MSR levels. ThenFSRDD = α + o(1) and MSRDD = MSROR + o(1).

APPENDIX: PROOFS AND DERIVATIONS OFTECHNICAL RESULTS

Proof of Lemma 1

Let ρ = P(ϑi = 0) be the proportion of nonnull sets. The marginalcdf of Ti is G = (1 − ρ)G0 + ρG1, where G0 and G1 are conditional

cdf’s defined in Section 5.2. Define the marginal FSR (mFSR) andmarginal MSR (mMSR) as

mFSR = E{∑mi=1(1 − ϑi)δi}

E(∑m

i=1 δi)and

mMSR = E{∑mi=1 ϑi(1 − δi)}

E(∑m

i=1 ϑi).

Then mFSR and mMSR are asymptotically equivalent measures tothe FSR and MSR in the sense that, under mild conditions, mFSR =FSR + O(m−1/2) and mMSR = MSR + O(m−1/2). It is easy to showthat mFSR = (1 − ρ)G0(t)/G(t) and mMSR = 1 − G1(t). Obviouslythe mMSR is decreasing in t. Next note that the MRC implies thatg0G1 > g1G0, we have

(d/dt)mFSR(t) = {ρ(1 − ρ)(g0G1 − g1G0)}/G2(t) > 0.

Hence the mFSR is increasing in t. The results follow by noting thatmFSR = FSR + o(m−1/2) and mMSR = MSR + o(m−1/2).

Proof of Theorem 2

The posterior distribution of ϑ given e = (e1, . . . , em) is

Pϑ |e(ϑ |e) =m∏

i=1

{(1 − ϑi)P(ϑi = 0|ei) + ϑiP(ϑi = 1|ei)}.

Hence the posterior risk is

Eϑ |e{Lλ(ϑ, δ)}= m−1

∑i

Eϑi|ei {λ(1 − ϑi)δi + ϑi(1 − δi)}

= m−1∑

i

{λδiP(ϑi = 0|ei) + (1 − δi)P(ϑi = 1|ei)}

= m−1∑

i

P(ϑi = 1|ei)

+ m−1∑

i

{λP(ϑi = 0|ei) − P(ϑi = 1|ei)}δi.

Therefore the classification risk is minimized by δi = I{λP(ϑi =0|ei) < P(ϑi = 1|ei)}.Proof of Theorem 3

(i) Validity. First according to the definition of GLIS, we have

E{(1 − ϑi)δi|ei} = δiP(ϑi = 0|ei) = δiGLISi.

Suppose the total number of rejections at FSR level α is Rα , then Rα =∑i δi. The actual FSR level by the GLIS procedure (5.8) is

FSRGLIS = E

{∑i(1 − ϑi)δi

(∑

i δi) ∨ 1

}

= E

[1

(∑

i δi) ∨ 1

∑i

E{(1 − ϑi)δi|ei}]

≤ E

(1

Rα∑i=1

GLIS(i)

).

The validity of the GLIS procedure follows by noting that (5.8) guar-antees that, for all realizations of e, (1/Rα)

∑Rα

i=1 GLIS(i) ≤ α.(ii) Asymptotic optimality. Note that mMSR = MSR + o(1), it is

sufficient to show that mMSRGLIS = mMSROR + o(1). Let cOR andcOR be the thresholds of the oracle and PLIS procedures, respec-tively. Then mMSROR = P(GLIS > cOR|ϑi = 1) and mMSRGLIS =P(GLIS > cOR|ϑi = 1). The continuous mapping theorem implies that

we only need to show that cORp→ cOR. Denote by G0 and G1 the

Page 15: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

Sun and Wei: Multiple Testing for Pattern Identification 87

null and nonnull cdf’s of GLISi. The mFSR level of the oracle proce-dure for a given cutoff t is mFSROR(t) = (1 − ρ)G0(t)/G(t), whereG = (1 − ρ)G0 + ρG1. Next define

Q(t) ={∑

i

I(GLISi < t)GLISi

}/{∑i

I(GLISi < t)

}.

According to the law of large numbers,

m−1∑

i

I(GLISi < t)GLISiP→ E{I(GLISi < t)GLISi}

= (1 − ρ)G0(t).

Similarly we can show that m−1{∑i I(GLISi < t)} P→ G(t). There-

fore Q(t)p→ mFSR(t). Observe that Q(t) is a constant in the interval

GLIS(i) ≤ t < GLIS(i+1), we have

cOR = maxi=1,...,m

{GLIS(i) :

1

i

i∑j=1

GLIS(j) ≤ α

}

= maxi=1,...,m

{GLIS(i) : Q

(GLIS(i)

) ≤ α}

= sup{c ∈ (0,1) : Q(c) ≤ α}≡ Q−1(α).

We have shown that Q(t)p→ mFSR(t). It follows from functional

delta method that cORP→ cOR. Therefore the GLIS procedure attains

the MSR level of the oracle procedure asymptotically.

Proof of Theorem 4

Define

QDD(t) ={∑

i

I( GLISi < t) GLISi

}/{∑i

I( GLISi < t)

}.

The α-level FSR cutoff of the data-driven procedure is cDD = sup{t ∈(0,1) : QDD(t) ≤ α}. Note that

p→ . Then by continuous map-

ping theorem, we have GLISip→ GLISi. Applying the weak law of

large numbers for triangular arrays we can show that QDD(t)p→

mFSROR(t). By using similar arguments in Theorem 2, we can show

that cDDp→ cOR. Therefore we have

FSRDD = mFSRDD +o(1)

= (1 − ρ)P(GLISi < cDD|ϑi = 0)

P( GLISi < cDD)+ o(1)

p→ (1 − ρ)P(GLISi < cOR|ϑi = 0)

P(GLISi < cOR)

= mFSROR(cOR) = α.

Similarly we can show that MSRDD = MSROR + o(1).

Derivation of the GLIS procedure (5.8)

Note that FSR and mFSR are asymptotically equivalent measures,we shall derive the GLIS procedure using mFSR. Let GLIS(1), . . . ,

GLIS(m) be the ranked test statistics and H(1), . . . ,H(m) be corre-sponding hypotheses. Let ρ be the proportion of the nonnull sets, G0and G1 be the conditional cdf of GLIS under the null and nonnull hy-potheses, respectively. The mFSR level of δ(GLIS, t1) is

mFSR = (1 − ρ)G0(t)

G(t)

= E{(1 − ϑi)I(GLISi < t)}E{I(GLISi < t)}

= E{GLISiI(GLISi < t)}E{I(GLISi < t)}

=∑m

i=1 GLISiI(GLISi < t)∑mi=1 I(GLISi < t)

+ o(1).

Let GLIS(k) be the largest test statistic that is less than t, then themFSR level for a given cutoff t can be approximated by Q(k)(1/k) ×∑k

i=1 GLIS(i). Also note that Q(k) is increasing in k, we shall choosethe largest k such that the mFSR is controlled at level α. Then by notingassumption (5.5), the procedure (5.8) follows.

[Received September 2009. Revised July 2010.]

REFERENCES

Arbeitman, M. N., Furlong, E. E. M., Imam, F., Johnson, E., Null, B. H., Baker,B. S., Krasnow, M. A., Scott, M. P., Davis, R. W., and White, K. P. (2002),“Gene Expression During the Life Cycle of Drosophila Melanogaster,” Sci-ence, 297 (5590), 2270–2275. [73]

Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970), “A Maximization Tech-nique Occuring in the Statistical Analysis of Probablistic Functions ofMarkov Chains,” The Annals of Mathematical Statistics, 41, 164–171. [77]

Benjamini, Y., and Heller, R. (2008), “Screening for Partial Conjunction Hy-potheses,” Biometrics, 64 (4), 1215–1222. [74-76,78]

Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate:A Practical and Powerful Approach to Multiple Testing,” Journal of theRoyal Statistical Society, Ser. B, 57 (1), 289–300. [74,76-78]

Benjamini, Y., and Yekutieli, D. (2005), “False Discovery Rate-Adjusted Multi-ple Confidence Intervals for Selected Parameters,” Journal of the AmericanStatistical Society, 100, 71–93. [77]

Bickel, P. J., Ritov, Y., and Rydén, T. (1998), “Asymptotic Normality of theMaximum-Likelihood Estimator for General Hidden Markov Models,” TheAnnals of Statistics, 26 (4), 1614–1635. [77]

Calvano, S. E., Xiao, W., Richards, D. R., Felciano, R. M., Baker, H. V., Cho,R. J., Chen, R. O., Brownstein, B. H., Cobb, J. P., Tschoeke, S. K., Miller-Graziano, C., Moldawer, L. L., Mindrinos, M. N., Davis, R. W., Tompkins,R. G., Lowry, S. F., and Inflamm and Host Respanse to Injury Large ScaleCollab. Res. Program (2005), “A Network-Based Analysis of Systemic In-flammation in Humans,” Nature, 437 (7061), 1032–1037. [73,80-82]

Chi, Y.-Y., Ibrahim, J. G., Bissahoyo, A., and Threadgill, D. W. (2007),“Bayesian Hierarchical Modeling for Time Course Microarray Experi-ments,” Biometrics, 63 (2), 496–504. [74]

Churchill, G. A. (1992), “Hidden Markov Chains and the Analysis of GenomeStructure,” Computers & Chemistry 16, 107–115. [74]

Du, P., Kibbe, W. A., and Lin, S. M. (2008), “lumi: A Pipeline for ProcessingIllumina Microarray,” Bioinformatics, 24 (13), 1547–1548. [81]

Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1999), Biological Se-quence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cam-bridge, U.K.: Cambridge University Press. [74]

Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), “Empirical BayesAnalysis of a Microarray Experiment,” Journal of the American StatisticalSociety, 96 (456), 1151–1160. [85]

Finner, H. (1999), “Stepwise Multiple Test Procedures and Control of Direc-tional Errors,” The Annals of Statistics, 27, 274–289. [77]

Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. (2004), “affy—Analysis of Affymetrix Genechip Data at the Probe Level,” Bioinformatics,20 (3), 307–315. [81]

Genovese, C., and Wasserman, L. (2002), “Operating Characteristics and Ex-tensions of the False Discovery Rate Procedure,” Journal of the Royal Sta-tistical Society, Ser. B, 64 (3), 499–517. [76]

(2004), “A Stochastic Process Approach to False Discovery Control,”The Annals of Statistics, 32 (3), 1035–1061. [85]

Guo, W., Sarkar, S., and Peddada, S. (2010), “Controlling False Discoveriesin Multidimensional Directional Decisions, With Applications to Gene Ex-pression Data on Ordered Categories,” Biometrics, 66 (2), 485–492. [77,78]

Guo, X., Qi, H., Verfaillie, C. M., and Pan, W. (2003), “Statistical SignificanceAnalysis of Longitudinal Gene Expression Data,” Bioinformatics, 19 (13),1628–1635. [73]

Heller, R., Stanley, D., Yekutieli, D., Rubin, N., and Benjamini, Y. (2006),“Cluster-Based Analysis of fmri Data,” Neuroimage, 33 (2), 599–608. [74]

Hong, F., and Li, H. (2006), “Functional Hierarchical Models for IdentifyingGenes With Different Time-Course Expression Profiles, Biometrics, 62 (2),534–544. [74]

Hubbell, E., Liu, W.-M., and Mei, R. (2002), “Robust Estimators for ExpressionAnalysis,” Bioinformatics, 18 (12), 1585–1592. [81]

Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J.,Scherf, U., and Speed, T. P. (2003), “Exploration, Normalization, and Sum-

Page 16: Multiple Testing for Pattern Identification, With ...zhiwei/jasa2011HMM.pdf · Applications to Microarray Time-Course Experiments Wenguang SUN and Zhi WEI In time-course experiments,

88 Journal of the American Statistical Association, March 2011

maries of High Density Oligonucleotide Array Probe Level Data,” Biosta-tistics, 4 (2), 249–264. [81]

Kendziorski, C. M., Newton, M. A., Lan, H., and Gould, M. N. (2003), “OnParametric Empirical Bayes Methods for Comparing Multiple Groups Us-ing Replicated Gene Expression Profiles,” Statistics in Medicine, 22 (24),3899–3914. [74,76]

Krogh, A., Brown, M., Mian, I. S., Sjölander, K., and Haussler, D. (1994),“Hidden Markov Models in Computational Biology. Applications to Pro-tein Modeling,” Journal of Molecular Biology, 235 (5), 1501–1531. [74]

Leroux, B. G. (1992), “Maximum-Likelihood Estimation for Hidden MarkovModels,” Stochastic Processes and Their Applications, 40 (1), 127–143.[77]

Li, C., and Wong, W. H. (2001), “Model-Based Analysis of Oligonucleotide Ar-rays: Expression Index Computation and Outlier Detection,” Proceedings ofthe National Academy of Sciences of the USA, 98 (1), 31–36. [81]

Luan, Y., and Li, H. (2004), “Model-Based Methods for Identifying Periodi-cally Expressed Genes Based on Time Course Microarray Gene ExpressionData,” Bioinformatics, 20 (3), 332–339. [74]

Ma, P., Zhong, W., and Liu, J. S. (2009), “Identifying Differentially ExpressedGenes in Time Course Microarray Data,” Statistics in Biosciences, 1, 144–159. [73,77]

MacDonald, I. L., and Zucchini, W. (1997), Hidden Markov and Other Modelsfor Discrete-Valued Time Series, New York: Chapman & Hall. [74]

Newton, M., Kendziorski, C., Richmond, C., Blattner, F., and Tsui, K. (2001),“On Differential Variability of Expression Ratios: Improving Statistical In-ference About Gene Expression Changes From Microarray Data,” Journalof Computational Biology, 8 (6), 37–52. [74,76]

Park, T., Yi, S.-G., Lee, S., Lee, S. Y., Yoo, D.-H., Ahn, J.-I., and Lee, Y.-S.(2003), “Statistical Tests for Identifying Differentially Expressed Genes inTime-Course Microarray Experiments,” Bioinformatics, 19 (6), 694–703.[73]

Pyne, S., Futcher, B., and Skiena, S. (2006), “Meta-Analysis Based on Controlof False Discovery Rate: Combining Yeast Chip-Chip Datasets,” Bioinfor-matics, 22 (20), 2516–2522. [74]

Rabiner, L. R. (1989), “A Tutorial on Hidden Markov Models and SelectedApplications in Speech Recognition,” Proceedings of the IEEE, 77, 257–286. [74,77]

Robbins, H. (1951), “Asymptotically Subminimax Solutions of Compound Sta-tistical Decision Problems,” in Proceedings of the Second Berkeley Sympo-sium on Mathematical Statistics and Probability, 1950, Berkeley and LosAngeles: University of California Press, pp. 131–148. [83]

Schliep, A., Schonhuth, A., and Steinhoff, C. (2003), “Using Hidden MarkovModels to Analyze Gene Expression Time Course Data,” Bioinformatics,19 (Suppl 1), i255–i263. [74,76]

Smyth, G. K., and Speed, T. (2003), “Normalization of cDNA MicroarrayData,” Methods, 31 (4), 265–273. [81]

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998), “ComprehensiveIdentification of Cell Cycle-Regulated Genes of the Yeast SaccharomycesCerevisiae by Microarray Hybridization,” Molecular Biology of the Cell, 9(12), 3273–3297. [73]

Storey, J. D. (2003), “The Positive False Discovery Rate: A Bayesian Interpre-tation and the q-Value,” The Annals of Statistics, 31 (6), 2013–2035. [85]

Storey, J. D., Xiao, W., Leek, J. T., Tompkins, R. G., and Davis, R. W. (2005),“Significance Analysis of Time Course Microarray Experiments,” Proceed-ings of the National Academy of Sciences of the USA, 102 (36), 12837–12842. [74]

Sun, W., and Cai, T. T. (2007), “Oracle and Adaptive Compound Decision Rulesfor False Discovery Rate Control,” Journal of the American Statistical So-ciety, 102 (479), 901–912. [85]

(2009), “Large-Scale Multiple Testing Under Dependence,” Journal ofthe Royal Statistical Society, Ser. B, 71 (2), 393–424. [74,77,85]

Tai, Y. C., and Speed, T. P. (2006), “A Multivariate Empirical Bayes Statisticfor Replicated Microarray Time Course Data,” The Annals of Statistics, 34(5), 2387–2412. [73]

Telesca, D., Inoue, L. Y. T., Neira, M., Etzioni, R., Gleave, M., and Nelson,C. (2009), “Differential Expression and Network Inferences Through Func-tional Data Modeling,” Biometrics, 65 (3), 793–804. [74]

Tian, B., Nowak, D. E., and Brasier, A. R. (2005), “A TNF-Induced Gene Ex-pression Program Under Oscillatory NF-Kappab Control,” BMC Genomics,6, 137. [73]

Viterbi, A. J. (1967), “Error Bounds for Convolutional Codes and an Asymp-totically Optimal Decoding Algorithm,” IEEE Transactions on InformationTheory, 13, 268–278. [74]

Yuan, M., and Kendziorski, C. (2006), “Hidden Markov Models for Microar-ray Time Course Data in Multiple Biological Conditions,” Journal of theAmerican Statistical Society, 101 (476), 1323–1332. [73,74,76,77,81,84]

Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H., and Weir, B. S. (2002), “Trun-cated Product Method for Combining p-Values,” Genetic Epidemiology, 22(2), 170–185. [74]