gar 2000 systbiol

Upload: juan-jose-sanchez-meca

Post on 09-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Gar 2000 Systbiol

    1/19

    Syst. Biol. 49(4):652670, 2000

    Likelihood-Based Tests of Topologies in Phylogenetics

    NICK G OLDMAN,1 JON P. ANDERSON,2 AND ALLEN G. RODRIGO3

    1 University Museum of Zoology, Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK;E-mail: [email protected]

    2

    Department of Molecular Biotechnology, University of Washington, Seattle, Washington USA3 School of Biological Sciences, University of Auckland, Auckland, New Zealand

    Abstract.Likelihood-based statistical tests of competing evolutionary hypotheses (tree topolo-gies) have been available for approximately a decade. By far the most commonly used is theKishinoHasegawa test. However, the assumptions that have to be made to ensure the validityof the KishinoHasegawa test place important restrictions on its applicability. In particular, it isonly valid when the topologies being compared are specied a priori. Unfortunately, this meansthat the KishinoHasegawa test may be severely biased in many cases in which it is now com-monly used: for example, in any case in which one of the competing topologies has been selectedfor testing because it is the maximum likelihood topology for the data set at hand. We reviewthe theory of the KishinoHasegawa test and contend that for the majority of popular applica-

    tions this test should not be used. Previously published results from invalid applications of theKishinoHasegawa test should be treated extremely cautiously, and future applications should useappropriate alternative tests instead. We review such alternative tests, both nonparametric andparametric, and give two examples which illustrate the importance of our contentions. [KishinoHasegawa test; maximum likelihood; phylogeny; ShimodairaHasegawa test; statistical tests; treetopology.]

    Hasegawa and Kishino (1989) and Kishinoand Hasegawa (1989) developed methods for

    estimating the standard error and condenceintervals for the difference in log-likelihoods between two topologically distinct phylo-genetic trees representing hypotheses thatmight explain particular aligned sequencedata sets. The method initially was intro-duced to compute condence intervals onposterior probabilities for topologies in aBayesian analysis (Hasegawa and Kishino,1989; Kishino and Hasegawa, 1989). Atten-tion quickly turned to using thesame ideas to

    perform nonparametric likelihood ratio tests(LRTs) of the statisticalsignicance of topolo-gies (Kishino and Hasegawa, 1989), as cur-rently implemented in the popular PHYLIP(Felsenstein, 1995), MOLPHY (Adachi andHasegawa, 1996), PUZZLE (Strimmer andvon Haeseler, 1996; Strimmer et al., 1997),and PAUP* (Swofford, 1998) software pack-ages. Tests based on the ideas of Kishino andHasegawa (hereafter referred to as KH-tests)

    could overcome some of the peculiar prop-erties of phylogenetic estimation (see, e.g.,Yang et al., 1995), for example, that differ-ent topologies do not share the same sets ofparameters and do not generally representnested hypotheses (one hypothesis being aspecial case of another).

    Kishino and Hasegawa originally devisedand applied their methods for trees that were

    specied a priori, that is, trees that corre-sponded to phylogenetic hypotheses derivedindependently of the data at hand. However,since then, the KH-test has in the majorityof cases been used to compare the maximumlikelihood (ML) tree derived from the datato one or more a priorispecied trees or toone or more a posteriorispecied trees (e.g.,the trees with second-, third-, and so forthhighest likelihoods). Such applications arefa-cilitated, even encouraged, by various soft-

    ware packages. We contend that these latterapplications of the KH-test are incorrect. Weare not the rst to note this: Swofford et al.(1996) observed that the KH-test should beapplied only when the trees in question arespecied a priori. Shimodaira and Hasegawa(1999) recently made the same point and de-scribed a correct nonparametric test for thecase in which the ML tree is one of the testedtopologies.

    In this paper, we review the published lit-erature on likelihood-based tests of topolo-gies in phylogenetics. We wish to draw at-tention to the inapplicability of the KH-testfor many common situations, despite the factthat it has become the most widely used andgenerally accepted way of testing alternative

    652

  • 8/8/2019 Gar 2000 Systbiol

    2/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 653

    hypotheses of evolutionary relationship. Werst look in detail at the KH-test, in particularreviewing the conditions for its correct ap-plication and considering various methodsthat can be used to generate the null hy-pothesis distribution of the KH-test statis-

    tic. We then consider alternative applicationsof the KH-test, which represent the major-ity of published applications, and discusswhy these are incorrect. We contend thatthe bias inherent in the KH-test when ap-plied incorrectly warrants that the resultsof such applications be treated extremelycautiously. Next, we describe both the non-parametric test introduced by Shimodairaand Hasegawa (1999) and a parametric LRT

    of topologies, described briey by Swoffordet al. (1996) and based on the parametric bootstrap or Monte Carlo simulation ap-proach to hypothesis testing using likeli-hood ratio statistics advocated by Goldmanand others (e.g., Goldman, 1993; Hilliset al., 1996; Huelsenbeck and Crandall, 1997;Huelsenbeck and Rannala, 1997).

    All of the methods described in this pa-per are equally relevant to ML analyses ofDNA and amino acid (aa) data. We provide

    two analyses comparing the results of dif-ferent tests, one using DNA sequences fromHIV-1 isolates and one using aa sequencesfrom mammalian mitochondrial (mt) pro-teins. These examples illustrate the impor-tance of understanding precisely what hy-potheses are being tested, the importance ofperforming statistically valid tests, and thediffering power of parametric and nonpara-metric tests. We conclude with a discussion

    of the importance of our and others recentwork on the theory of tests of topologies.The likelihood-based test developed from

    the work of Hasegawa and Kishino (1989)and Kishino and Hasegawa (1989) is quiteclosely related to a parsimony-based testdescribed by Templeton (1983). Kishinoand Hasegawa (1989 : 177) also discussedparsimony-based analogs of their likelihoodmethods. Such parsimony-based tests aresubject to precisely the same misuses de-

    scribed in this paper for likelihood-basedtests. In their unmodied forms, they tooshould not be used with phylogenies that arenot selected a priori. Derivation of their cor-rected forms could follow the same generalapproaches described below (Buckley et al.,unpubl.).

    METHODS

    Terminology and Notation

    For the purposes of this paper, we are inter-ested in the topologies of phylogenetic trees,not in the lengths of the branches on those

    trees. We use Tj to denote the topology of thejth prespecied tree, and TML for the topol-ogyoftheMLtreeforagivendataset.Implic-itly, because we focus on likelihood-basedmodels, we assume that some model of evo-lution that allows us to dene the probabil-ity of character-state changes can be specied

    before tree reconstruction. The vector of pa-rameters for branch lengths (plus any otherfree parameters, e.g., in the model of nu-cleotidesubstitutionbeing used) is writtenhxfor topology Tx . H0 and HA are, respectively,null and alternative hypotheses for statisti-cal testing. Lx is the log-likelihood of Tx orHx for a given data set, generally maximizedover all possible values of hx, and Lx(k) is thesitewise log-likelihood for site k out of a totalof S sites, so L x =

    Sk=1 L x(k). For paramet-

    ric or nonparametric bootstrapped data, L(i )x

    is the log-likelihood of Tx for the ith replicate

    data set, and L(i)x (k) is the corresponding site-

    wise log-likelihood. We use d to denote thedifference in log-likelihoods between topolo-gies (d(i) for the value from the ith replicatedata set), E[X] to denote the expectation ofa statistic X, and N(l r2) to indicate a nor-mal (gaussian) distribution with mean l andvariance r2.

    Fundamentals of the KH-Test

    Suppose we have two hypotheses (treetopologies) T1 and T2, selected a priori, andwe want to test whether they are equally wellsupported by a data set. Intuitively, for anyone data set we would expect that stochastic-ity and sampling would ensure L1 6=L2, evenwhen the null hypothesis is true. However, ifwe were somehow able to obtain several datasets, we would expect that on average L1= L2 when the null hypothesis is true. Writ-ing d

    L1

    L2, this intuition corresponds

    to E[d] = 0. In terms of a statistical test, ourhypotheses are:

    H0: E [d] = 0

    HA: E [d] 6= 0.

  • 8/8/2019 Gar 2000 Systbiol

    3/19

    654 S YSTEMATIC BIOLOGY VOL. 49

    In this section we describe a nonparamet-ric approach to performing this compari-son. This method was essentially given byHasegawa and Kishino (1989), although theydid not use the procedure for signicancetesting.

    To perform a test of H0 versus HA,weneedto know the distribution of d under the nullhypothesis. Working nonparametrically, wecannot derive the exact distribution of d be-cause H0 does not specify a distribution forthe data from which it is calculated. We areable, however, to implement the nonpara-metric bootstrap (Efron, 1982; Felsenstein1985; Efron and Tibshirani, 1986), the use ofwhich requires the assumption that the data

    are a representative and independent samplefrom the true distribution of data.Table 1 summarizes all of the statistical

    tests discussed in this paper and explains themnemonic naming system we have adopted.Table 2 relates this naming system to var-ious published tests and software imple-mentations (all described below). The proce-dure for the fundamental KH-test is now asfollows:

    Test priNPfcd: Calculate the test statistic d L 1 L2. Resample data (repeated nonparametric

    bootstrap data sets i). Reestimate any free parameters h1, h2

    (branch lengths, and so forth) to get max-imized log-likelihoods L

    (i )1 and L

    (i)2 un-

    der T1 and T2, respectively.

    Hence calculate bootstrap values of d(i) L (i )1 L (i )2 .

    For the values of d(i

    ) to conform to H0,we require E[d(i)] = 0; to ensure this, re-place the d(i )by d(i ) d(i ) d(i ), where d(i)is the mean over bootstrap replicates i ofd(i). This procedure is known as center-ing (see below), and the resulting set ofvalues d(i ) gives an estimate of the distri-

    bution of dunder H0. Test whether the attained value of d

    (from the original data) is a plausible

    sample from the distribution of the d(i)by seeing if it falls within the condenceinterval for E[d], given, for example, bythe 2.5% and 97.5% points of the rankedlist of the d(i). A two-sided test is appro-priate (because we have no a priori ex-pectation as to whether T1 or T2 should

    be preferred); in this example, a 5% sig-nicance level is being used.

    Hope (1968) and Marriott (1979) considerthe amount of resampling that is needed for

    reliable statisticalproperties. It seems reason-able to hope that 100 data sets will give a suf-ciently accurate estimate of the distributionof dfor testing at the 5% level, for both non-parametric (resampled data) and parametric(simulated data; see below) tests. More strin-gent tests will require more replicate datasets.

    Hall and Wilson (1991) and Westfall andYoung (1993 : 35) explain more fully the needfor the centering procedure (d(i )

    d(i )

    d(i );

    Table 1, level 4, optionc) to ensure conformityto the null hypothesis in this and other non-parametric tests. Comparing the observedvalue d with the distribution of the d(i ) (asis appropriate in parametric tests; see belowand Table 1, level 4, option u) would give aninvalid test. A test comparing the expectedvalue of d (i.e., 0) with the (uncentered) dis-tribution of the d(i ) is less inappropriate butis equivalent to comparing d(i) with the dis-

    tribution of the d(i )

    . This in turn is equivalentto using d(i) as an estimate of the test statis-tic d, which is inefcient and without anyredeeming advantage (see also Efron et al.,1996). We note that under the more power-ful normal approximations discussed below(Table 1, level 5, options a and s) a test com-paring 0 with the distribution of the d(i) be-comes equivalent to the test that compares dwith the distribution of the d(i ).

    The rst stages of this test (up to the cal-culation of the d(i)) form the procedure atthe heart the work of Hasegawa and Kishino(1989). In that paper, however, signicancetesting of phylogenies is not contemplated(there is no mention of the hypothesis E[d] =0) and instead the estimated distribution ofdis used to compute condence intervals onposterior probabilities of different (a priori)topologies in a Bayesian analysis. The ideaof using the distribution of the d(i) to per-

    form a signicance test of phylogenies basedon E[d] = 0 was introduced by Kishino andHasegawa (1989), with some methods beingpartially described in Hasegawa et al. (1988).To our knowledge, the above form (priNPfcd)of the KH-test has never been implemented.At the time of this tests introduction, one

  • 8/8/2019 Gar 2000 Systbiol

    4/19

    TABLE1.Com

    ponentsofstatisticaltestsdescribedinthispaper.Eachtestiscomposedofoneoptionchosenateachofthevelevels.Notallcombinationsarevalid,as

    indicated.Mnemoniccodesarederivedbyconcate

    natingtheitalicizedletterslabelingeachoption;thederivationofthesemnemonicsisindicatedby

    underlining.Seetextfor

    furtherdetails.

    Level3:

    Level4:

    Le

    vel5:

    Level1:

    Level2:

    optimizationmethodfor

    teststatisticanddistrib

    ution

    howtestisperfo

    rmedorcondence

    choiceoftreesto

    test

    statisticalapproach

    bootstrappeddata

    againstwhichitiscom

    pared

    intervalsa

    regenerated

    pri:treeschosena

    priori

    NP:nonparametric

    f:full:allparametersestimated

    c:centered:attainedd

    vs.di

    stribution

    d:comparisonofte

    ststatisticdirectlywith

    fromdata(andoptimization

    ofcenterednonparametric

    itsestimateddistribution

    overmultipletopologieswit

    h

    estimatesd(i)

    d(i)

    d(i)(only

    posoptionatlevel1)

    withNPoptionatlevel2

    )

    pos:includestree(s)

    P:parametric

    p:partial:someparameters

    u:uncentered:attaineddvs.

    n:assumptionofnormaldistributionfor

    selectedaposteriori,

    xedatvaluesestimatedfrom

    distributionofparametricestimates

    teststatistic(app

    licableonlywithpri

    fromanalysiso

    fdatato

    data

    d(i)(onlywithPoptionatlevel2)

    optionatlevel1)

    beusedfortesting

    n:nooptimizationfor

    a:additionalnormalassumption(variance

    bootstrappeddata(givesrise

    ofd

    estimatedfromvarianceofsitewise

    toRELLmethodswithdorn

    d(k);applicableo

    nlywithbothpriand

    optionsatlevel5)

    noptionsatlevels1and3,respectively)

    s:strongerassumptionofnormal

    distributionfors

    itewised(k)(applicable

    onlyincombinationwithbothpriandn

    optionsatlevels

    1and3,respectively)

    655

  • 8/8/2019 Gar 2000 Systbiol

    5/19

    TABLE2.Re

    lationshipsofstatisticaltestspre

    viouslydescribedandpopularsoftwareimplementationswithtestsdescribedinthispaper,witha

    dditionalnotes.

    Innotationof

    Inliterature/co

    mputerimplementation

    thispaper

    Notes

    KishinoHaseg

    awatest,fundamentalconcept

    priNPfcd

    N

    onparametrictest;neverbefore

    published/implementedinthisforma

    HasegawaandKishino(1989)

    priNPf

    b

    D

    istributionofd

    derivedinadifferentcontext;nostatisticaltestsp

    ecied

    Kishinoand

    Hasegawa(1989)

    priNPn

    b

    R

    ELLandnormalapproximationsintroduced;statisticaltestsonly

    brieydiscussed

    PHYLIP(Felsenstein,1995)andPUZZLE(Strimmerand

    priNPnca

    U

    seadditionalnormalassumptio

    nandperform(two-sided)z-test

    vonHaeseler,1996;Strimmeretal.,1997)

    implementat

    ions

    MOLPHYim

    plementation(AdachiandHasegawa,1996)

    priNPnca

    U

    sesadditionalnormalassumption;estimatesvarianceofd

    butpe

    rformsnotest

    PAUP*implementation(Swofford,1998)

    priNPncs

    U

    sesstrongernormalassumption

    andperforms(two-sided)pairedt-test

    ShimodairaandHasegawa(1999)implementationof

    priNPncd

    U

    sedinanexample(withRELLa

    ndaone-sidedtest),forcomparisonwithposNPncd

    KishinoHas

    egawatest

    ShimodairaHasegawatest,fundamentalconcept

    posNPfcd

    N

    onparametrictest;neverbefore

    implementedinthisforma

    ShimodairaandHasegawa(1999)

    posNPncd

    U

    sesRELL;testdescribedandusedinanexample

    SOWH-test,fundamentalconcept

    posPfud

    P

    arametrictest;originallydescrib

    edbySwoffordetal.(1996)

    SOWH-test,alternativeimplementationinthispaper

    posPpud

    U

    sedinanexample(someapproximationsunderHA);seetextfor

    details

    aThesetests,a

    mongstothers,areimplementedinth

    ispaperfortheHIV-1dataset.

    bDotsrepresentcomponentsoftestingproceduresthatwerenotspeciedinthecorrespo

    ndingpublications.

    656

  • 8/8/2019 Gar 2000 Systbiol

    6/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 657

    main reason for this was probably the com-putation time required for the step in whichfree parameters h1, h2 are reestimated foreach bootstrap data set. (Nowadays this com-putational demand would not be problem-atic but this form of the test is still not used,

    probably because interest is typically in phy-logenies, at least one of which has been cho-sen a posteriorisee below.) Accordingly,methods were devised to reduce the compu-tational burden.

    The KH-Test: Time-Saving Approximations

    The fundamental KH-test (priNPfcd above)has the disadvantage that likelihood max-imization is performed for each bootstrap

    replicate data set i. Although no maximiza-tion over topologies is required, becausein this case only two a priorispeciedtopologies are being considered, Kishino andHasegawa (1989) were concerned about thecomputation time needed. To reduce thecomputational burden, they developed a re-sampling estimated log-likelihood (RELL)technique (Table 1, level 3, option n; see alsoKishino et al., 1990). In brief, they showedthat instead of performing time-consuminglikelihood optimizations for each bootstrapdata set, one could use values of d(i) calcu-lated with the optimized parameter values(h1, h2) from the original data set. Certainasymptotic conditions (correctly speciedevolutionary models and sufciently largeamounts of data) are required for this ap-proximation to be valid; the RELL methodhas been shown to perform well in the esti-mation of bootstrap probabilities of phyloge-

    nies (Hasegawa and Kishino, 1994; see alsobelow for discussion of the possible effectsof approximations to log-likelihood scores inLRTs). The necessary likelihood calculationsnow require no optimization after the initialanalysis of the original data, which saves alarge amount of computational effort. By us-ing a prime symbol (0) to denote this formof approximation where parameters are notre-optimized for replicate data sets, the re-sulting test can be described as follows:

    Test priNPncd: Calculate the test statistic d L1 L2. Resample data (repeated bootstrap data

    sets i). Using the ML estimates of any free pa-

    rameters h1, h2 (branch lengths and so

    forth) derived from the original data set,

    compute log-likelihoods L0(i)1 and L

    0(i )2

    under T1 and T2, respectively (the resam-pling is effectively being made from thesitewise log-likelihoods L1(k) and L2(k)estimated under H0; hence the RELL

    mnemonic). Calculate bootstrap values of d0(i )

    L0(i )1 L 0(i )2 . As before, perform the centering proce-

    dure d0(i ) d0(i ) d0(i ). Test whether the attained value of d

    (from the original data) is a plausiblesample from the distribution of the d0(i )

    by seeing if it falls within the condenceinterval for E[d] given by appropriatepoints of the ranked list of the d0(i) (two-sided test).

    Kishino and Hasegawa (1989) furthershowed that the difference (d) in log-likelihoods between two topologies specieda priori would follow a normal distribution;the mean and variance of which could bespecied in terms of the differences in log-likelihoods (d(i )) calculated for nonparamet-

    ric bootstrap data sets i. This approximation,based on the Central Limit Theorem,requiresthe same asymptotic conditions as the RELLmethod for its validity. An alternative to us-ing the direct comparison of the attained dwith the distribution of the d0(i ), as in the laststep of test priNPncd, is then to utilize thisnormal approximation for d(Table 1, level 5,option n):

    Test priNPncn: Proceed as in test priNPncd above, butreplace the nal step with the following:

    Compute the variance of the d0(i ) (denotethis by m2) and test whether the attainedvalue of d (from the original data) is aplausible sample from a N(0, m2) distri-

    bution, by seeing if it falls within thecondence interval for E[d] given, forexample, by 0 1.96m (two-sided test;5% signicance level in this example).

    In practice, these tests have rarely beenimplemented, to our knowledge. More usu-ally, an additional assumption is also made:that the variance of dcan be estimated fromthe variance (over sites k = 1, 2, . . . , S) ofthe sitewise log-likelihood differences d(k)

  • 8/8/2019 Gar 2000 Systbiol

    7/19

    658 S YSTEMATIC BIOLOGY VOL. 49

    L(k)1 L (k)2 (Table 1, level 5, option a; Kishino

    and Hasegawa, 1989). In this case a test canbe made without any resampling, thus giv-ing an even greater saving in time:

    Test priNPnca: Calculate the test statistic d (L 1 L 2). Using the ML estimates of any free pa-

    rameters h1, h2 (branch lengths, and soforth) derived from the original data set,compute sitewise log-likelihoods L1(k)and L2(k), under T1 and T2, respectively,for the sites k of the original data set.

    Calculate the values d(k) L 1(k) L2(k)and hence the centered values d(k) d(k)

    d(k) and an estimate of their vari-

    ance m2 = k(d(k))2/(S 1) (clearly, thevariances of the d(k) and the d(k)areiden-tical). Because d k d(k) + Sd(k), andd= 0 under H0, Sm

    2 is an estimate of thevariance of d.

    Test whether the attained value of dcal-culated from the original data is a plausi-

    ble sample from a N(0, Sm2) distribution,for example, by comparing it with thecondence interval 0 1.96

    pSm (two-

    sided test; 5% signicance level in thisexample).

    This is the method implemented in variousprograms of the PHYLIP (Felsenstein, 1995),MOLPHY (Adachi and Hasegawa, 1996),and PUZZLE (Strimmer and von Haeseler,1996; Strimmer et al., 1997) packages. (Pro-

    grams in MOLPHY computep

    Sm but leavestatistical testing to the user).

    In Swoffords (1998) PAUP* program, astronger assumption is made (D. Swofford,pers. comm.): that the sitewise log-likelihooddifferences d(k) are themselves normally dis-tributed (Table 1, level 5, option s). This as-sumption, if accurate, guarantees the accu-racy of the normal approximations describedabove (Table 1, level 5, options n and a) andpermits a different test to be performed di-rectly on the sitewise log-likelihoods:

    Test priNPncs: Proceed as in test priNPnca above, but

    replace the nal two steps with thefollowing:

    Perform a paired-t-test of the L1(k)and L2(k) (pairs {L1(1), L2(1)}, {L1(2),L2(2)}, . . . , {L1(S), L2(S)}) to determine if

    the means of the {L1(k)} and {L2(k)} areequal (two-sided test).

    We know of no theoretical justication forthis additional assumption, and it does not

    give rise to any signicant saving in com-putation time. However, we would expectthat it will only make the smallest of dif-ferences in real applications (with a largenumber of sites S), and the signicance lev-els reported by PAUP* and by the DNAMLprogram of PHYLIP are invariably very sim-ilar (J. Felsenstein, pers. comm.; D. Swofford,pers. comm.; J.P. Anderson, unpubl.). Notethat both the priNPnca and priNPncs tests re-quire no resampling and so are even faster

    than the priNPncd and priNPncn tests, whichuse RELL methods.

    Incorrect Usage of the KH-Test

    Many of the arguments in the derivationsof the statistical tests above are strongly de-pendent on the topologies T1 and T2 having

    been selected independently of any analysisof the data used for the testing. In particu-

    lar, this assumption is necessary to justify thefundamental hypothesis H0: E[d] =0. Unfor-tunately, when the selection of topologies has

    been made with reference to the data, espe-cially if they have been selected by a criterionlinked to their likelihood scores, this expec-tation is no longer justied. Two particularcases in which it is not reasonable to expectE[d] = 0 are (1) the comparison of the treefound to have the maximal likelihood, TML,

    with an a priori tree, T1, and (2) the compari-sonof TML with a tree selected for having thesecond- (or third-, and so forth) highest like-lihood. In fact, in both of these cases E[d] > 0.TML is selected exactly because its likelihoodfor the data at hand is greater than that ofany other tree: In other words, it is guaran-teed that LML will be at least as large as L1 orthe log-likelihood of any other topology.

    This is not a minor discrepancy: no re-sult suggests that E[d] will be even near to

    0 in these cases. Further, given that neces-sarily E[d] > 0, two-sided tests are no longerappropriate: we are interested in assessingdeviations only in one direction from ex-pectation, and one-sided tests are required.In our experience, however, situations suchas those just described in which trees areselected a posteriori are precisely those for

  • 8/8/2019 Gar 2000 Systbiol

    8/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 659

    which the KH-test has most often been used.Indeed, even Hasegawa and coworkers ap-pear to omit consideration of whether E[d]= 0 is a valid assumption (Hasegawa et al.,1988:8; Kishino and Hasegawa, 1989:175),even though Kishino and Hasegawa (1989 :

    177, equation 20 and following) recognizethat E[d] depends on how the topologies be-ing assessed were chosen.

    We believe that the results of all such analy-ses using the KH-test are invalid and requirerecomputation by methods such as those de-scribed later in this paper. We stress that thisis not a minor, obscure, or purely hypotheti-cal point, of interest only to theoretical statis-ticians. Although the KH-test is suitable for

    the questions it was designed to answer, it isentirely inappropriate for use in testing thesignicance of trees selected by ML from thedata which are to be used for the testing. Wecan nd only one possible adjustment appli-cable to the results of incorrect applicationsof KH-tests that may render them useful; thisis discussed below.

    To help illustrate why we can no longer base tests on the hypothesis E[d] = 0, weuse an analogy of running races. The coach

    of a running squad is interested to knowwhether two runners, Matthew and Mark,differ signicantly in their average runningtimes for the 100 m sprint. To determine this,he times the two runners when they partici-pate in several races. For each race, the coachcalculates the difference in running timest between Matthew and Mark: d(Matthew,Mark) t(Matthew) t(Mark). Note thatd(Matthew, Mark) can sometimes be positive

    and sometimes be negative, depending onwho runs faster in any given race; in fact,if Matthew and Mark are equally good atthe 100 m sprint, then the average value ofd(Matthew, Mark) over many races will tendto 0. In fact, as the team statistician approv-ingly explains, the data the coach has col-lected can be used to estimate the variance ofd(Matthew, Mark) and, consequently, to testthe following hypothesis:

    H0: E[d(Matthew, Mark)]=

    0HA: E[d(Matthew, Mark)] 6= 0.

    This is analogous to the KH-test for phylo-genies, with Matthew and Mark correspond-ing to two a priori topologies T1 and T2,each race equivalent to a sample of data,

    t(Matthew) and t(Mark) equivalent to thelog-likelihoods L1 and L2, respectively, andd(Matthew, Mark) corresponding to the dif-ference in log-likelihoods d.

    The coach is also interested in anotherrunner, Luke, whom he believes to be the

    fastest runner on his squad. Given his suc-cess at collecting data for the earlier hypoth-esis test, the coach decides to do somethingsimilar with Luke. He obtains running timesfor Luke, t(Luke), over several races amongstthe squad members, as well as the fastesttime for each race, t(fastest). For each race hecomputes the difference between thesetimes,d(Luke, fastest) t(Luke) t(fastest), argu-ing that if Luke truly is the fastest then, over

    many races, the average of d(Luke, fastest)will be zero. However, as the team statis-tician points out, this assumption is neces-sarily false. The reason for this is simple. IfLuke truly is the fastest, then we may ex-pect that in the majority of the races he par-ticipates in, his time is the fastest time, i.e.,t(Luke) =t(fastest) and d(Luke, fastest) = 0.However, we also expect that some othersquad members will manage to win someraces, if only infrequently, so that d(Luke,

    fastest) >0. Note that it is never possible thatd(Luke, fastest) < 0, and consequently the av-erage of d(Luke, fastest) over several races(the majority with d(Luke, fastest) = 0 andsome with d(Luke, fastest) >0) must neces-sarily be > 0. Variations in the squad mem-

    bers performances in different races ensuresthat even if none is systematically faster thanLuke, there is a chance that someone of equalor lower ability will appear to outperform

    Luke in any one race. In fact, the bigger thesquad, the more likely this is, and the greaterthe level of outperformance one can expect.Thestatistical testused should reect this factand cannot be based on E[d(Luke, fastest)] E[t(Luke) t(fastest)] =0.

    This example is analogous to the com-mon but incorrect application of the KH-test, when TML and T1 are identied withthe faster runner and Luke, respectively, LMLand L1 with t(fastest) and t(Luke); and dwith

    d(Luke, fastest). It would be possible, but oflittle interest here, to devise a correct test

    based on runners selected a posteriori. In-stead, we will revert to discussing phyloge-netic examples and will describe two teststhat can be used in place of the KH-test whenit is not applicable.

  • 8/8/2019 Gar 2000 Systbiol

    9/19

    660 S YSTEMATIC BIOLOGY VOL. 49

    The ShimodairaHasegawa Test: A CorrectedNonparametric Test of Topologies

    Although it has been noted in the past thatthe KH-test is suitable only for cases whereE[d] = 0 (Swofford et al., 1996), it appearsthat Shimodaira and Hasegawa (1999) are therst to publish a full explanation of why thetest is not appropriate when one or moretopologies under test were selected with ref-erence to the same data being used for test-ing. Our arguments in the preceding sectionare essentially an extended version of thoseof Shimodaira and Hasegawa (1999). Basedon earlier work by Shimodaira (1993, 1998),Shimodaira and Hasegawa (1999) have pro-posed a nonparametric test similar to the KH-

    test but making the appropriate allowancefor the method by which topologies are usu-ally selected for statistical comparison. TheShimodairaHasegawa test (SH-test) simul-taneously compares all topologies in someset M and makes appropriate allowance forthese multiple comparisons. It is necessarythat M contains every topology that can pos-sibly be entertained as the true topology,to ensure that the true topology is alwaysavailable to be the ML topology for any

    bootstrap data set; if this condition is not met,the signicance levels computed will be inac-curate (Westfall and Young, 1993:48). In ad-dition, selection of topologies for the set Mshould be made a priori and not with refer-ence to the observed data; otherwise, signi-cance levels will again be inaccurate. Choos-ing M to be the set of all possible topologiesis always safe, if conservative.

    We cannot write the null hypothesis as

    E[d] = 0 for this test; instead, the hypothe-ses tested are as follows:

    H0: all Tx 2 M (including TML, the MLtree) are equally good explanationsof the data

    HA: some or all Tx 2 M are not equallygood explanations of the data

    and the test may proceed as follows:

    Test posNPfcd: Calculate a test statistic dx for each topol-

    ogy Tx 2 M: dx is the attained value ofLML L x .

    Generate nonparametric bootstrap repli-cate data sets i and for each one maxi-mize likelihoods over parameters hx for

    each permitted topology Tx , giving opti-

    mal log-likelihood values L(i )x .

    For each topology Tx , form the adjustedlog-likelihood L

    (i )x L (i )x L (i )x by sub-

    tracting L(i )x , the mean over replicates i

    of L(i )

    x , from each value of L(i )

    x this isthe centering method devised by Shi-modaira (1998), which is appropriate forenforcing that the resampled data con-form to H0 for this a posteriori test.

    For each replicate i, nd L (i )ML, the maxi-mum over topologies Tx of the adjusted

    log-likelihoods L(i )x , and form bootstrap

    replicate statistics d(i)x L (i)ML L (i )x ; this

    allows for the a posteriori selection of

    TML . For each topology Tx , test whether theattained dx is a plausible sample fromthe distribution (over replicates i) of thed

    (i )x by seeing if it falls within the con-

    dence interval for E[dx ] given, for exam-ple, by the interval between 0 and the95% point of the ranked list of the d

    (i )x .

    Such a one-sided test is appropriate, be-

    cause we know that only L(i)ML L

    (i )x is

    possible; in this example, a 5% signi-cance level is being used.

    (We have used some notation different fromthat of Shimodaira and Hasegawa [1999], tomaintain a consistent style within this pa-per. Shimodaira and Hasegawa [1999] useTa, a = 1, 2, . . . , M; Lai ; Rai ; and Sai , wherewe have used dxTx 2 M; L (i )x ; L (i )x ; and d(i )x ,respectively.)

    Time-saving approximations are alsopossible with this test. Shimodaira andHasegawa (1999) propose the use of theRELL method for nding approximatevalues of L

    (i )x without having to re-optimize

    hx for each replicate data set. This test,implemented by Shimodaira and Hasegawa(1999) and Buckley et al. (unpubl.), can bedescribed as follows:

    Test posNPncd: Calculate a test statistic dx for each topol-ogy Tx 2 M: dx is the attained value ofL ML L x .

    Generate nonparametric bootstrap repli-cate data sets i; for each one, and foreach tree Tx, use the ML estimates hx ofany free parameters derived for each tree

  • 8/8/2019 Gar 2000 Systbiol

    10/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 661

    Tx from the original data set to compute

    log-likelihoods L0(i )x , which approximate

    the optimized values L(i )x in testposNPfcd

    above. For each topology Tx, form the adjusted

    log-likelihoodL

    0(i )x L

    0(i )x

    L

    0(i)x (center-ing).

    For each replicate i, nd L 0(i )ML, the maxi-mum over topologies Tx of the adjusted

    log-likelihoods L0(i )x , and form bootstrap

    replicate statistics d0(i )x L 0(i )ML L 0(i )x . For each topology Tx, test whether the

    attained dx is a plausible sample from thedistribution (over replicates i) of the d

    0(i)x

    by seeing if it falls within the condence

    interval for E[dx] given, for example, by0 and the 95% point of the ranked list ofthe d

    0(i )x (one-sided test; 5% signicance

    level used in this example).

    Note that the SH-test simultaneously as-sesses the signicance level for each of thetopologies Tx 2 M. It immediately reducesto a version of the KH-test, modied forthe comparison of a prioriselected topology

    T1 and a posterioriselected topology TML ,when attention is restricted to the signi-cance level computed for d1 from the distri-

    bution of the d(i )1 or d

    0(i )1 . Note, however, that

    the set M of all plausible topologies still hasto be considered to compute this distribution.

    The effect of the new centering procedureintroduced in this method is to decrease thesignicance accorded to the difference dx inlog-likelihoods between each topology Tx

    and the ML topology TML (Shimodaira andHasegawa, 1999), in comparison with thesignicance indicated by the corresponding

    but inappropriate KH-test. Intuitively, thisis because the attained value of dx should

    be attributed to two components: one (nec-essarily positive) being a consequence ofthe selection of TML precisely because it hasthe highest likelihood, and another (of un-known sign) attributable to the differencein the abilities of Tx and TML to explain

    the observed data. Whereas the SH-test cor-rectly compares Tx and TML on the basis ofthe second component alone, making an ap-propriate allowance for the rst component,the incorrectly applied KH-test assesses bothcomponents combined as though they wereonly the second component. The fact thatthe rst component is necessarily > 0 acts

    to make the new test more conservative (i.e.,less likely to reject the null hypothesis). How-ever, the SH-test correctly uses a one-sidedtest, and this acts to increase the signicanceof results.

    Is It Possible to Salvage the Resultsof Incorrect Applications of the KH-Test?

    We can nd only one possible adjustmentthat might render some previously publishedresults from incorrectapplications of the KH-test useful in the light of the theoretical ad-vances described in this paper. It is straight-forward to convert the signicance level ofa two-sided test to that of a one-sided test:the P-value should simply be halved. If the

    P-value obtained from an incorrectly appliedKH-test is p, then the P-value that would beobtained in the SH-test is necessarily p/2.Therefore, if the adjusted value p/2 is largeenough to indicate no rejection of the nullhypothesis (a priori tree T1), e.g., p/2 > 0.05for a 5% signicance level, we can be certainthat using the SH-test would give the sameconclusion.

    However, in all other cases (where the ad-

    justed value p/2 is sufciently small to indi-cate rejection of the null hypothesis in favorof the ML tree, e.g.,p/2 < 0.05 for a 5% signif-icance level) we cannot assume that this re-sult would hold under a SH-test, which mustgive a P-value that would exceed p/2 by anunknown amount. Note that this will neces-sarily be the case whenever p indicated rejec-tion of the null hypothesis in the incorrectlyapplied KH-test (e.g., p < 0.05 implies p/2 0; in this example,a 5% signicance level is being used.

    Notice that the test statistic dis the same asin the KH- and SH-tests. The use of TML inthetest means that the assumption E[d] =0 can-not be made, but the use of parametric boot-strapping to generate data conforming to thenull hypothesis means that this presents noproblem: the repeated analysis of parametric

    bootstrap data sets guarantees the appropri-

    ate statistical properties. (Indeed, E[d] maybe estimated by d(i), the mean of the d(i ).) Thefact that the data necessarily conform to thenull hypothesis is also the reason that no cen-tering procedure is necessary for this test (Ta-

    ble 1, level 4, option u).This test has a substantial time penalty,

    however, caused by the need to repeatedly

  • 8/8/2019 Gar 2000 Systbiol

    12/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 663

    maximize likelihoods over topologies underthe hypothesis HA. The same penalty ex-ists in the basic SH-test (posNPfcd) above butis avoided by using the RELL method (test

    posNPncd above). Although we do not havetheoretical results to justify applying all the

    most useful approximations described abovefor nonparametric tests, we have a severalsuggestions for possible ways to reduce thecomputational burden of the above paramet-ric test.

    The rst suggestion is to use RELL-likemethods applied only to the a priorispe-cied null hypothesis topology T1 (Table 1,level 3, option p):

    Test posPpud (approximation under H0): Calculate the test statistic d LML L 1. Simulate data sets i by parametric boot-

    strapping, based on the null hypothesistopology T1 and the ML estimates of anyfree parameters, h1, derived for T1 fromthe original data set.

    Use T1 and the ML estimates of param-eters h1 to get log-likelihoods L

    0(i )1 under

    H0.

    Maximize likelihood over all topologiesand their respective hx to get maximizedlog-likelihoods L

    (i )ML under HA.

    Calculate values of d0(i ) L (i )ML L 0(i)1 . Test whether the attained value of d(from the original data) is a plausiblesample from the estimated distributionof d given by the set of the d0( i) by see-ing if it falls below the 95% point ofthe ranked list of the d0(i ) (one-sidedtest; 5% signicance level used in this

    example).

    This saves on the maximization of param-eters under the xed topology T1 and resultsin a small saving in computation time. It doesnot address the more difcult problem of re-peated maximizations over topologies in thealternative hypothesis. As with other testsdescribed above, the approximation underH0 can be trusted and the signicance lev-

    els taken at face value. Alternatively, notethat necessarily L

    0(i )1 L

    (i )1 and so d

    0(i ) d(i).Therefore, if this test rejects H0 (the attainedd is too big), then this result is certain (theapproximation cannot have changed the re-sult). But if the test does not reject H0 (theattained d is sufciently small), we do notknow whether it would have been rejected if

    the exact d(i) had been used in place of thed0(i ), and the test does not give us a denitiveresult.

    A similar approach applied to the alter-native hypothesis will be less effective (e.g.,for a test denoted posPnud). Although it pro-

    vides a much greater scope for saving timeunder HA, because searches are performedover topologies, using xed values of TMLand hML from the ML analysis (over all trees)of the original data to assess the replicatedata set is not sensible. The original TMLand corresponding hML will probably be farfrom the optimal values for replicate datasets (which were simulated using the origi-

    nal T1 and h1), so the difference between L(i )ML

    and its possible approximation L 0(i )

    ML may belarge.

    However, ML estimates of some param-eters of nucleotide substitution models areknown to be quite stable over differenttopologies (e.g., Yang et al., 1994, 1998; Sulli-van et al., 1996; Yang, 1997). Examples of suchparameters are base frequencies (pA, pC, pG,pT), the transition/transversion rate ratio j,and the shape parameter aof the gamma dis-

    tribution widely used to model among-sitesrate variation. Therefore, we think it reason-able to use xed values of these parametersestimated under H0 from each bootstrap dataset i (i.e., the components of h

    (i )1 that are not

    the lengths of branches of T1 and are thuscommon to all topologies Tx) when assess-ing that data set under HA. This gives thefollowing test:

    Test posPpud (approximation under HA): Calculate the test statistic d LML L1. Simulate data sets i by parametric boot-

    strapping, based on the null hypothesistopology T1 and the ML estimates of anyfree parameters, h1, derived for T1 fromthe original data set.

    Use T1 and reestimate free parametersh1 to get maximized log-likelihoods L

    (i )1

    under H0 (and respective optimal values

    of h(i)

    1 ). Maximize likelihood over all topologiesTx to get log-likelihoods L

    0(i)ML under HA:

    these maximizations all x the valuesof substitution process parameters to be

    equal to h(i )1 , but the maximization is per-

    formed over topologies Tx and their re-spective branch length parameters.

  • 8/8/2019 Gar 2000 Systbiol

    13/19

    664 S YSTEMATIC BIOLOGY VOL. 49

    Calculate values of d0(i ) L 0(i )ML L (i)1 . Test whether the attained value of d(from the original data) is a plausiblesample from the estimated distributionof d given by the set of the d0(i) by see-ing if it falls below the 95% point of theranked list of the d0(i ) (one-sided test; 5%signicance level used in this example).

    (Note that the two preceding tests both re-ceive the mnemonic posPpud because theyvary only in the form of the approxima-tion used in their likelihood maximizations[Table 1, level 3]. A more complex namingsystem that would assign different mnemon-ics to these tests seems unwarranted in thispaper.) Now, somesubstantial saving of timeis made as the substitution process parame-ter values are xed during the likelihood op-timizations under HA. The greater problemof optimizing over topologies is still not ad-dressed. For this test, we know that necessar-ily that L

    0(i )MLL

    (i)ML and so d

    0(i )d(i ). There-

    fore, if this test fails to reject H0 (attained dnotexcessively large relative to the null hypoth-esis distribution of the d0(i )), then this result iscertain. If this test does reject H0, this couldin principle be a consequence solely of theapproximation. However, we expect the ef-fect of this approximation to be small; in theexample given below, it is insignicant andhas no bearing on the conclusions reached.

    If approximations are made under both hy-potheses, for example, by some combinationof the two posPpud tests above (and as intestspriNPncd, priNPncn, priNPnca, priNPncs,

    and posNPncd above), it is no longer possi-ble to make general statements about the di-rection of the bias that they produce in thed(i). The precise effects of the combinationof such approximations in a posteriori para-metrictests require further investigation. Ap-proximations based on assumptions of a nor-mal distribution for d(Table 1, level 5, optionsn, a, and s) seem unlikely to be useful in testsdesigned for hypotheses chosen a posteriori,given that the necessary condition of d 0indicates a truncation of the distribution of d,which precludes normality.

    Other SOWH-Like Tests

    It is also straightforward to devise a para-metric bootstrap test of the following hy-potheses, which are akin to those of the fun-

    damental KH-test for a priorispecied treesT1 and T2:

    H0: T1 is the true topology

    HA: T2 is the true topology

    This fundamental version of such a testwould be denoted priPfud (see Table 1). De-noting the topology with second highest like-lihood by TML2, we could also devise a para-metric bootstrap test of the hypotheses:

    H0: TML2 is the true topology

    HA: TML is the true topology

    This test would be based on modica-tions of posPfud by using the test statis-tic d LML LML2, data simulated by us-ing TML2 and hML2; ML analysis of simulated

    data sets to nd the distinct topologies T(i )

    ML

    and T(i )

    ML2 which give the greatest and second-greatest likelihoods, respectively; and d(i ) L

    (i )ML L (i )ML2. Having introduced the general

    principles of such tests, we will not go intofurther details here. We also draw readers

    attention to the related parametric bootstraptest of monophyly described by Huelsenbecket al. (1996), which compares partially con-strained topologies (chosen a priori) with theML topology (chosen a posteriori).

    EXAMPLES

    HIV-1 Subtypes A, B, D, and E gag and polNucleotide Sequences

    Six homologous sequences, each consist-ing of 2,000 base pairs (bp) from the gagand pol genes, were selected from isolates ofHIV-1 subtypes A (two sequences, A1 andA2), B (one sequence), D (one sequence), andE (two sequences, E1 and E2). The sequenceswere easily aligned by eye. The conventionalphylogeny for these subtypes would groupthe two subtype A sequences and also thetwo subtype E sequencesthat is, T1 = ((A1,A2), (B, D), (E1, E2))for which the opti-

    mal log-likelihood is L1 = 5,073.75. For oursequences, however, the ML phylogeny in-dicated that the subtype A sequences didnot cluster together; that is, TML = (A1, (B,D), (A2, (E1, E2))) with LML = 5,069.85.In this example, all ML calculations wereperformed with the general time reversiblemodel of nucleotide substitution, using a

  • 8/8/2019 Gar 2000 Systbiol

    14/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 665

    TABLE 3. Results of statistical tests of topologies for HIV-1 gag and pol gene nucleotide data set.

    Test code Notes P-valuea

    priNPfcd KH-test (incorrect application); full optimization; direct estimation 0.38 (0.19)of P-value

    priNPfcn KH-test (incorrect application); full optimization; normal approximation 0.41 (0.20)for distribution of d

    priNPncs KH-test (incorrect application); RELL approximation; stronger normal approximation 0.38 (0.19)for distribution of d(k)

    posNPfcd SH-test; full optimization; direct estimation of P-value 0.26posPfud SOWH-test; full optimization; direct estimation of P-value 0.002posPpud SOWH-test; partial optimization under HA; direct estimation of P-value 0.002

    a First value is from a two-sided test, as widely used to date; second value, when present, is for the more appropriate one-sidedtest.

    gamma distribution to model rate hetero-geneity among sites (REV+C; Yang, 1994,

    1996, 1997); the parameters hx for topol-ogy Tx are branch lengths, base frequencies,parameters describing the relative rates ofsubstitution between each nucleotide pair,and the shape parameter a of the gammadistribution.

    We illustrate some of the statistical testsdescribed above by investigating whetheror not the data provide signicant evidencethat TML is to be preferred over T1. For all

    the tests we performed, the test statisticd= LML L1 = 5,069.85 (5,073.75) =3.90. Because TML has been selected fortesting a posteriori, that is, as a consequenceof having the highest likelihood, the KH-testis inappropriate (but was performed forcomparative purposes), and the SH- orSOWH-tests should be used. These testswere performed as described above, with1,000 replicates used whenever parametricor nonparametric bootstraps were per-

    formed. The results are summarized inTable 3.

    We performed three versions of the KH-test, two using full likelihood optimizationsand computing the tests P-values either di-rectly (test priNPfcd) or by assuming a nor-mal distribution for d (test priNPfcn), andone using the strongest assumption of nor-mality of the sitewise d(k) (test priNPncs). Inall cases, both a two-sided test and a one-

    sided test were performed. The two-sidedtest is inappropriate for this a posteriori test(as indeed is the entire KH-test) but repre-sents the computation performed in the mostwidely available implementations of the KH-test (PHYLIP, Felsenstein, 1995; PUZZLE,Strimmer and von Haeseler, 1996; Strimmer

    et al., 1997; and PAUP*, Swofford, 1998). Theone-sided test is more suitable and, as de-

    scribed above, at least has the possibility ofgiving a statistically interpretable result. In-deed, that is the case in this example: theone-sided P-values of 0.2 indicate no rejec-tion of the null hypothesis, and as explainedabove, this conclusion must necessarily bemaintained by the correction inherent in theSH-test. We also note the good agreement be-tween the P-values calculated by the threevariants of the KH-test.

    Our application of the SH-test used fulllikelihood optimizations (test posNPfcd) andpermitted the consideration of three topolo-gies as possibly true: T1, TML , and the topol-ogy (A2, (B, D), (A1, (E1, E2))). We reportonly the signicance level for the test of T1against TML. This test, with its allowance forthe a posteriori selection of one topology,must give a higher P-value (i.e., a less signif-icant result), and this is conrmed in Table 3.There seems no way to draw any general con-

    clusions about the size of the difference inthe P-values (0.26 vs. 0.190.20 for the KH-tests). Figure 1 shows the distribution of the1,000 replicate values d

    (i )1 against which the

    attained value d1 =3.90 is compared. We con-clude that the SH-test indicates no signicantdifference between T1 and TML; therefore, wedo not reject T1 in favor of TML for these data.We also note from Figure 1 that the minimumvalue of d1 that would indicate rejection of

    the null hypothesis (T1) in this example is8.8 (the value of d for which the SH-testdistribution reaches a cumulative frequencyof 0.95).

    The results of the SOWH-test are very dif-ferent. We performed two versions of thistest, one using full likelihood optimizations

  • 8/8/2019 Gar 2000 Systbiol

    15/19

    666 S YSTEMATIC BIOLOGY VOL. 49

    FIGURE 1. Test distributions for SH-test (nonparametric bootstrap) and SOWH-tests (parametric bootstrap) oftopologies for HIV-1 gag and pol gene nucleotide data set. The histogram (right-hand y-axis; note the break usedon this scale) shows the distribution over 1,000 replicates i of the d(

    i )1 (SH-test, code posNPfcd; wide dark-gray bars),

    d(i) (SOWH-test, code posPfud; narrow white bars), and d0(i) (SOWH-test, code posPpud ; light-gray bars). The curves(left-hand y-axis) show the cumulative frequency distributions of the d

    (i)1 (SH-test; solid line) and d

    (i ) (SOWH-testposPfud; dashed line). The cumulative frequency distribution of the d0(i) (SOWH-test posPpud) is indistinguishablefrom that of the d(i ). The points at which the horizontal line (dashed gray) at a cumulative frequency of 0.95 crossesthese curves indicate the values of dthat must be exceeded for a signicant result at the 5% level. Given the attainedvalue of d=3.90, the SH-test is not signicant and does not reject T1, but the SOWH-tests are highly signicant andreject T1 in favor of TML (see text for further details).

    (test posPfud) and one using the approxima-tion described above as test posPpud (ap-proximation under HA). The differences be-tween these two tests were negligible in thisexample, and both indicated a P-value of0.002 (Fig. 1; Table 3). For these data, thistest strongly rejects topology T1 in favorof TML. As Figure 1 shows, any observedvalue of d exceeding

    1.2 would have re-

    sulted in rejection of T1 at the 95% level.The attained value is 7.7 standard deviationsabove the mean of the SOWH-test statisticdistribution.

    One explanation of the difference betweenthe signicance levels for the SH-test andthe SOWH-test is the different forms of theirnull hypotheses. In this example the SH-test considers whether the three competingtopologies are equally good explanations of

    the data, whereas the SOWH-test considerswhether other topologies are better than thesingle topology T1. As a consequence, theSH-test is permitting more a priori possibletopologies in its null hypothesis, which willgenerally lead to more conservative resultsan effect of the allowances made for themultiple statistical comparisons being made

    between the ML and all other permittedhypotheses.

    Another factor affecting the signicancelevels of the different tests may simply be theincrease in power expected for a parametrictest over a nonparametric test. We also recallthe reliance of parametric tests on the modelsthat they assume. Although we may hope forrobustness of tests to inaccuracy of models,

    this has generally been left untested in phy-logenetics. To examine whether the REV+Cmodel ts the data well in the present ex-ample, we performed ML analyses under avariety of nucleotide substitution models tocompare those models (Goldman, 1993; Yanget al., 1994). The results of these analyses areshown in Table 4. It is immediately evidentthat the REV+C model ts these data signif-icantly better than any of the other models

    considered, in agreement with the good per-formance of both the REV andC componentsreported over a variety of data sets (see Yang,1994, 1996; Arvestad and Bruno, 1997; andreferences therein). We conclude that in thisexample all reasonable steps have been takento exclude any effects on the SOWH-test at-tributable to model inaccuracy.

  • 8/8/2019 Gar 2000 Systbiol

    16/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 667

    TABLE 4. Maximum likelihood scores for HIV-1 gagandpol genedata set under various models of nucleotidesubstitution. Models JC69 (Jukes and Cantor, 1969),K80 (Kimura, 1980), F81 (Felsenstein, 1981), HKY85(Hasegawa et al., 1985), and REV (Yang, 1994) were im-plemented, each without (no C) and with (+C) a gammadistribution to model rate heterogeneity amongst sites

    (Yang, 1996, 1997). All calculations were performed withthe topology referred to in the text as TML . Numbersgiven represent the log-likelihood value by which eachmodel is worse than the best value, attained under theREV+C model, 5,069.85. Also shown, in parentheses,are the numbers of free parameters in each substitutionmodel. Pairs of nested models can be compared by usinga test statistic that is twice the log-likelihood difference

    between those models, assessed with either a v 2 distri-bution (models compared are both of the no C form orboth of the +C form) or a v 2 distribution (exactly oneof the models compared is of the +C form), degrees offreedom are given by the difference in the numbers of

    free parameters. For full details of these tests see, forexample, Yang et al. (1994) and Goldman and Whelan(2000).

    Substitution model No C +C

    JC69 395.08 (0) 349.93 (1)K80 190.28 (1) 131.32 (2)F81 280.19 (3) 231.43 (4)HKY85 81.29 (4) 12.79 (5)REV 65.09 (8) 0 (9)

    Mammalian Mitochondrial Protein AminoAcid Sequences

    Shimodaira and Hasegawa (1999) illus-trated the SH-test with a data set consistingof aligned mt protein sequences, eachcomprising 3,414 aa, from six mammals:human, harbor seal, cow, rabbit, mouse,and opossum. The grouping (harbor seal,cow) was assumed to be true, which left

    15 candidate topologies to be evaluated.Shimodaira and Hasegawa (1999) appliedthe SH-test to this data set, comparing all15 candidate topologies simultaneously,and concluded that seven topologies couldnot be rejected. To illustrate the SOWH-test, we selected (a priori) the topology T1= ((human, ((harbor seal, cow), rabbit)),mouse, opossum), called topology a=2 byShimodaira and Hasegawa (1999), To testagainst the ML topology, which for these

    TABLE 5. Results of statistical tests of topologies for mammalian mitochondrial protein amino acid data set.

    Test code Notes P-value

    priNPncd KH-test (incorrect application); RELL approximation; direct estimation of P-value 0.36a

    posNPncd SH-test; RELL approximation; direct estimation of P-value 0.81a

    posPfud SOWH-test; full optimization; direct estimation of P-value < 0.001

    a From Shimodaira and Hasegawa (1999).

    data is TML = (((human, (harbor seal, cow)),rabbit), mouse, opossum)topology a = 1of Shimodaira and Hasegawa (1999). Wecompare our results from the SOWH-testwith Shimodaira and Hasegawas (1999)results for analogous comparisons from

    KH- and SH-tests. In this example, as inShimodaira and Hasegawa (1999), all MLcalculations were performed using a modelof mammalian mt aa replacement described

    by Yang et al. (1998), with aa frequenciesestimated from the data set being ana-lyzed and using a gamma distribution tomodel rate heterogeneity amongst sites(mtmam+F+C; see also Yang, 1997). Forthis model, the optimal log-likelihoods

    for these topologies were L1=

    21,727.26and LML = 21,724.60; therefore, the teststatistic for all the tests of topologiesconsidered below was d= LML L1 = 21,727.26 (21,724.60) = 2.66. Table 5summarizes the results of the KH-, SH-, andSOWH-tests for these data. The KH- andSH- test results are taken from Shimodairaand Hasegawa (1999) and were calculated

    by using RELL approximations and esti-mating the tests P-values directly (without

    a normal approximation). The SOWH-testwas performed by using full likelihoodoptimizations.

    The P-value obtained from a one-sidedcomparison in the KH-test, as given byShimodaira and Hasegawa (1999), was0.36, which suggests that T1 cannot berejected in favor of TML. As explainedabove, this conclusion must be maintained

    by the SH-test and we see from Table 5

    that this is so (P=

    0.81). Notice that thedifference between the P-values for the(one-sided) KH- and SH-tests (0.36 and0.81, respectively) is considerably greaterthan in the HIV-1 example ( 0.20 and 0.26,respectively).

    The SOWH-test again gives very differentresults. The P-value from this test, for 1,000replicate parametric bootstrap data sets, isestimated to be < 0.001in other words, innone of the 1,000 replicates i did the value

  • 8/8/2019 Gar 2000 Systbiol

    17/19

    668 S YSTEMATIC BIOLOGY VOL. 49

    of d(i) equal or exceed the value d = 2.66observed for the real data. (The attainedvalue d= 2.66 lies 27.8 standard deviationsabove the mean of the d(i).) Thus, topologyT1 is very strongly rejected in favor of TML.

    As with the HIV-1 example above, there

    is no obvious single reason for the contrast-ing results of the SH- and SOWH-tests forthese mt protein sequences. The SOWH-testconsiders only one a priori topology, T1,and of the 1,000 replicate data sets gener-ated by using T1 only 7 resulted in topolo-gies different from T1 when analyzed by ML.Evidently, if this topology, its correspond-ing parameter values h1 (as estimated byML from the original mt protein data set),

    and the mtmam+F+C

    model of aa replace-ment were all adequate, then we would ex-pect to retrieve the correct topology from adata set of 3,414 aa with high probability(1,000 7)/1,000 0.99; consequently, ournding that for the original data TML andnot T1 is optimal seems unreconcilable withthe hypothesis that T1 is true. In contrast, theSH-test has 15 topologies considered equallyplausible a priori in its null hypothesis andtherefore the signicance level assigned to a

    particular one of these, e.g., T1, is reduced.The effects of differences between parametricand nonparametric tests and the possibilitythat the mtmam+F+C model is inadequatehave not been assessed.

    DISCUSSION

    We want to emphasize once more that theproblems described above with typical appli-cations of the KH-test are very real and will

    have practical consequences in many appli-cations. We contend that all future applica-tions must use new methods such as the SH-test and the SOWH-test (above). Assessmentof the results of published analyses based onincorrect applications of the KH-test must bemade with extreme caution. The sole correc-tion to these results that we have been ableto derive (see above) will often generate in-conclusive results, demanding reanalysis ofdata.

    Evidently, it is vital that researchers thinkcarefully about what phylogenetic hypothe-ses they wish to test. A priori hypotheses anda posteriori hypotheses can be quite different,as can the statistical distributions required totest them. It serves no scientic purpose tocheat and represent an a posteriori hypoth-esis as an a priori one simply for the expedi-

    ency of a more readily available or faster test.The SOWH-test, as described above, tests asingle a priori hypothesis of topology. If suchtests are used repeatedly, to assess the sig-nicance of multiple trees, the issue of cor-recting signicance values for multiple tests

    arises. This might occur with data sets forwhich large numbers of tree topologies areconsidered plausible a priori. Bar-Hen andKishino (in press) describe a novel paramet-ric likelihood-based test for computing si-multaneous signicance values for multipletopologies.

    The SH-test simultaneously compares allmembers of a set M of topologies. The inclu-sion in M of all a priori possible topologiesis important. Even topologies with low boot-strap replicate likelihoods (L

    (i )x ) can readily

    affect the signicance levels of other topolo-gies, because these are based only on varia-tions in likelihoods over bootstrap replicates(L

    (i)x L (i )x L (i )x ). A posteriori selection of

    topologies for inclusion in or exclusion fromM based on their likelihoods may thus biasall signicance levels recordedanalogousto performing multiple comparisons tests ononly a subset of a larger number of compar-

    isons, selected (for example) to be the most(or least) signicant. Decreasing the numberof comparisons performed this way will un-

    justiably increase the apparent signicancelevels of the results. In the HIV-1 exampleabove, if the SH-test (posNPfcd) is applied

    by considering that all 105 topologies forsix sequences are possibly true, the P-valuefor T1 is increased from 0.26 to 0.90. Con-sidering all 105 possible topologies in the

    SH-test applied in the mt protein sequenceexample above increases the P-value for T1from 0.81 to 0.93 (RELL approximation, pos-NPncd); for the topologies called a=915 byShimodaira and Hasegawa (1999), P-valuesare increased from signicant values ( < 0.05)to nonsigncant values ( >0.05). Clearly, thehonest choice of a priori hypothesis topolo-gies may be crucial to the conclusions ulti-mately drawn.

    The claim is sometimes madewhen phy-

    logenies of different genes are compared, forinstancethat no a priori topologies can beconstructed. In such cases, however, one canusually recast the hypothesis and its statisti-cal test differently. To use the example above,when comparing the evolutionary historiesof different genes, we may restate this asa test of whether the two (or more) trees

  • 8/8/2019 Gar 2000 Systbiol

    18/19

    2000 GOLDMAN ET AL.TESTS OF TOPOLOGIES IN PHYLOGENETICS 669

    are sample estimates of the same phylogeny(Rodrigo et al., 1993).

    As illustrated in this paper, the results ofparametric tests (e.g., the SOWH-test) andnonparametric tests (e.g., the SH-test) can ap-pear to be very different. The SH-test may

    often appear to be more conservative thanthe SOWH-test. As we have explained, thismay be due to some or all of the followingphenomena: different forms of null hypothe-ses; increased power of parametric tests; andgreater reliance of parametric tests on mod-els of sequence evolution. The relative con-sequences of these and other effects requirefurther investigation in the future.

    PROGRAM AND DATA AVAILABILITY

    Notes on the use of PAUP* (Swofford,1998) to perform SOWH-tests, and details ofthe HIV-1 nucleotide sequences (6 sequences,each 2,000 bp) and mammalian mt proteinaa sequences (6 sequences, each 3,414 aa)used in the examples above, can be obtainedfrom the authors at http://www.zoo.cam.ac.uk/zoostaff/goldman/tests and down-stream Web pages. A computer program,shtests, to perform SH-tests by using theRELL approximation (posNPncd above)can be obtained from Andrew Rambautat http://evolve.zoo.ox.ac.uk/software/shtests. Versions of PHYLIP and PAUP*package programs (Felsenstein, 1995;Swofford, 1998) implementing the SH-test are currently under development(J. Felsenstein, pers. comm.; D. Swofford,pers. comm.).

    ACKNOWLEDGMENTS

    Work by N.G. and A.G.R. on this topic was partiallysupported by the Isaac Newton Institute for the Mathe-matical Sciences programme on Biomolecular Functionand Evolution in the Context of the Genome Project(JulyDecember 1998). N. G. is supported by a Well-come Trust Fellowship in Biodiversity Research. J.P.A.is supported by a NIH Institutional NRSA Interdisci-plinary Training in Genomic Sciences Fellowship and

    by the University of Washington Center for AIDS Re-

    search (CFAR). We are very grateful for the assistancegiven to us by Hidetoshi Shimodaira, Joe Felsenstein,David Swofford, Hirohisa Kishino, and Andrew Ram-

    baut throughout the preparation of this paper; for pre-publication versions of papers provided by HidetoshiShimodaira, ThomasBuckley, and Hirohisa Kishino; andfor critical readings of draft versions of the paper by Ed-ward Holmes, Ann Oakenfull, Tim Massingham, MartinEmbley, and Andrew Rambaut. Andrew Rambaut andKorbinian Strimmer provided all of the 105-topologySH-test P-value calculations given in the Discussion.

    REFERENCES

    ADACHI, J., AND M. HASEGAWA. 1996. MOLPHY: Pro-grams for molecular phylogenetics based on maxi-mum likelihood, vers. 2.3. Institute of Statistical Math-ematics, Tokyo.

    ARVESTAD, L., AND W. J. BRUNO. 1997. Estimation ofreversible substitution matrices from multiple pairsof sequences. J. Mol. Evol. 45:696703.

    BAR-HEN, A., AND H. KISHINO . In press. Comparing thelikelihood functions of phylogenetic trees. Ann. Inst.Stat. Math.

    CUNNINGHAM , C. W., H. ZHU, AND D. M. HILLIS. 1998.Best-t maximum-likelihood models for phylogeneticinference: empirical tests with known phylogenies.Evolution 52:978987.

    EFRON, B. 1982. The jackknife, the bootstrap and otherresampling plans. CBMS-NSF regional conference se-ries in applied mathematics, volume 38. Society forIndustrial and Applied Mathematics, Philadelphia.

    EFRON,B . , AND R. TIBSHIRANI. 1986. Bootstrap methodsfor standard errors, condence intervals, and othermeasures of statistical accuracy. Stat. Sci. 1:5477.

    EFRON, B., E. HALLORAN, AND S. HOLMES. 1996. Boot-strap condence levels for phylogenetic trees. Proc.Natl. Acad. Sci. USA 93:1342913434.

    FELSENSTEIN , J. 1981. Evolutionary trees from DNA se-quences: A maximum likelihood approach. J. Mol.Evol. 17:368376.

    FELSENSTEIN , J. 1985. Condence limits on phylogenies:an approach using the bootstrap. Evolution 39:783791.

    FELSENSTEIN , J. 1995. PHYLIP (Phylogenetic inference

    package), version 3.57. Univ. of Washington, Seattle.GOLDMAN, N. 1993. Statistical tests of models of DNAsubstitution. J. Mol. Evol. 36:182198.

    GOLDMAN, N., AND S. WHELAN. 2000. Statistical testsof gamma-distributed rate heterogeneity in models ofsequence evolution in phylogenetics. Mol. Biol. Evol.17:975978.

    HALL, P., AND S. R. WILSON. 1991. Two guidelines forbootstrap hypothesis testing. Biometrics 47:757762.

    HASEGAWA,M., AND H. KISHINO . 1989. Condence lim-its on the maximum-likelihood estimate of the homi-noid tree from mitochondrial-DNA sequences. Evo-lution 43:672677.

    HASEGAWA, M., AND H. KISHINO . 1994. Accuracies ofthe simple methods for estimating the bootstrap prob-ability of a maximum-likelihood tree. Mol. Biol. Evol.11:142145.

    HASEGAWA,M .,H .KISHINO , AND T. YANO. 1985. Datingof the humanape splitting by a molecular clock ofmitochondrial DNA. J. Mol. Evol. 22:160174.

    HASEGAWA, M., H. KISHINO , AND T. YANO . 1988. Phy-logenetic inference from DNA sequence data. Pages113 in Statistical theory and data analysis II (K. Mu-tusita, ed.). Elsevier, Amsterdam.

    HILLIS , D. M., B. K. MABLE, AND C. M ORITZ. 1996. Ap-plications of molecular systematics: the state of the

    eld and a look to the future. Pages 515543 in Molec-ular systematics (D. M. Hillis, C. Moritz, and B. K.Mable, eds.). Sinauer, Sunderland, Massachusetts.

    HOPE, A. C. A. 1968. A simplied Monte Carlo signi-cance test procedure. J. R. Statist. Soc. B 30:582598.

    HUELSENBECK, J. P., AND K. A. CRANDALL. 1997. Phy-logeny estimation and hypothesis testing using max-imum likelihood. Annu. Rev. Ecol. Syst. 28:437466.

    HUELSENBECK, J. P., AND B. RANNALA. 1997. Phyloge-netic methods come of age: Testing hypotheses in anevolutionary context. Science 276:227232.

    http://pinkerton.catchword.com/nw=1/rpsv/0014-3820%5E28%5E2952L.978[csa=0014-3820%5E26vol=52%5E26iss=4%5E26firstpage=978]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2917L.368[csa=0022-2844%5E26vol=17%5E26iss=6%5E26firstpage=368,nlm=7288891]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2917L.368[csa=0022-2844%5E26vol=17%5E26iss=6%5E26firstpage=368,nlm=7288891]http://pinkerton.catchword.com/nw=1/rpsv/0014-3820%5E28%5E2943L.672[csa=0014-3820%5E26vol=43%5E26iss=3%5E26firstpage=672]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2911L.142[csa=0737-4038%5E26vol=11%5E26iss=1%5E26firstpage=142]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2911L.142[csa=0737-4038%5E26vol=11%5E26iss=1%5E26firstpage=142]http://pinkerton.catchword.com/nw=1/rpsv/0014-3820%5E28%5E2943L.672[csa=0014-3820%5E26vol=43%5E26iss=3%5E26firstpage=672]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2917L.975[csa=0737-4038%5E26vol=17%5E26iss=6%5E26firstpage=975,nlm=10833204]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2917L.368[csa=0022-2844%5E26vol=17%5E26iss=6%5E26firstpage=368,nlm=7288891]http://pinkerton.catchword.com/nw=1/rpsv/0027-8424%5E28%5E2993L.13429[csa=0027-8424%5E26vol=93%5E26iss=23%5E26firstpage=13429,nlm=8917608]http://pinkerton.catchword.com/nw=1/rpsv/0014-3820%5E28%5E2952L.978[csa=0014-3820%5E26vol=52%5E26iss=4%5E26firstpage=978]http://pinkerton.catchword.com/nw=1/rpsv/0036-8075%5E28%5E29276L.227[nlm=9092465]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2922L.160[nlm=3934395]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2911L.142[csa=0737-4038%5E26vol=11%5E26iss=1%5E26firstpage=142]http://pinkerton.catchword.com/nw=1/rpsv/0014-3820%5E28%5E2943L.672[csa=0014-3820%5E26vol=43%5E26iss=3%5E26firstpage=672]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2917L.975[csa=0737-4038%5E26vol=17%5E26iss=6%5E26firstpage=975,nlm=10833204]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2936L.182[nlm=7679448]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2917L.368[csa=0022-2844%5E26vol=17%5E26iss=6%5E26firstpage=368,nlm=7288891]http://pinkerton.catchword.com/nw=1/rpsv/0027-8424%5E28%5E2993L.13429[csa=0027-8424%5E26vol=93%5E26iss=23%5E26firstpage=13429,nlm=8917608]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2945L.696[nlm=9419247]http://evolve.zoo.ox.ac.uk/software/http://www.zoo.cam./
  • 8/8/2019 Gar 2000 Systbiol

    19/19

    670 S YSTEMATIC BIOLOGY VOL. 49

    HUELSENBECK, J. P., D. M. HILLIS, AND R. NIELSEN.1996. A likelihood-ratio test of monophyly. Syst. Biol.45:546558.

    JUKES ,T.H., AND C . R . CANTOR. 1969. Evolution of pro-tein molecules. Pages 21132 in Mammalian proteinmetabolism (H. N. Munro, ed.). Academic Press, NewYork.

    KIMURA, M. 1980. A simple method for estimating evo-lutionary rates of base substitutions through compar-ative studies of nucleotide sequences. J. Mol. Evol.16:111120.

    KISHINO , H., AND M. HASEGAWA. 1989. Evaluation ofthe maximum likelihood estimate of the evolution-ary tree topologies from DNA sequence data, and the

    branching order in Hominoidea. J. Mol. Evol. 29:170179.

    KISHINO , H., T. MIYATA, AND M. HASEGAWA. 1990.Maximum likelihood inference of protein phylogenyand theorigin of chloroplasts. J. Mol. Evol. 31:151160.

    MARRIOTT, F. H. C. 1979. Barnards Monte Carlo tests:

    how many simulations? Appl. Statist. 28:7577.RODRIGO, A. G., M. KELLY-BORGES, P. R. B ERGQUIST,

    AND P. L. BERGQUIST. 1993. A randomization test ofthe null hypothesis that two cladograms are sampleestimates of a parametric phylogenetic tree. N.Z. J.Bot. 31:257268.

    SHIMODAIRA, H. 1993. A model search technique basedon condence set and map of models. Proc. Inst. Stat.Math. 41:131147 (in Japanese).

    SHIMODAIRA, H. 1998. An application of multiple com-parison techniques to model selection. Ann. Inst. Stat.Math. 50:113.

    SHIMODAIRA, H., AND M. HASEGAWA. 1999. Multiple

    comparisons of log-likelihoods with applications tophylogenetic inference. Mol. Biol. Evol. 16:11141116.

    STRIMMER, K., AND A. VON HAESELER. 1996. Quartetpuzzling: a quartet maximum-likelihood method forreconstructing treetopologies. Mol. Biol. Evol. 13:964969.

    STRIMMER, K., N. G OLDMAN, AND A. VON HAESELER.1997. Bayesian probabilities and quartet puzzling.Mol. Biol. Evol. 14:210211.

    S ULLIVAN, J., K. E. HOLSINGER, AND C. SIMON. 1996.The effect of topology on estimates of among-site ratevariation. J. Mol. Evol. 42:308312.

    S WOFFORD, D. L. 1998. PAUP* 4.00: *Phylogenetic anal-ysis using parsimony (and other methods). Sinauer,Sunderland, Massachusetts.

    S WOFFORD,D.L . ,G.J .OLSEN, P. J .WADDELL, AND D.M.

    HILLIS . 1996. Phylogenetic inference. Pages 407514 inMolecular systematics (D. M. Hillis, C. Moritz, and B.K. Mable, eds.). Sinauer, Sunderland, Massachusetts.

    TEMPLETON, A. R. 1983. Phylogenetic inference from re-striction endonuclease cleavage site maps with par-ticular reference to the evolution of humans and theapes. Evolution 37:221244.

    WESTFALL, P. H., AND S. S. YOUNG. 1993. Resampling- based multiple testing: Examples and methods forp-value adjustment. John Wiley & Sons, New York.

    YANG , Z. 1994. Estimating the pattern of nucleotide sub-stitution. J. Mol. Evol. 39:105111.

    YANG , Z. 1996. Among-site variation and its impact on

    phylogenetic analysis. TREE 11:367372.YANG , Z. 1997. PAML: A program package for phy-

    logenetic analysis by maximum likelihood. CABIOS13:555556.

    YANG , Z., N. GOLDMAN, AND A. FRIDAY. 1994. Com-parison of models for nucleotide substitution used inmaximum-likelihood phylogenetic estimation. Mol.Biol. Evol. 11:316324.

    YANG , Z., N. G OLDMAN, AND A. FRIDAY. 1995. Maxi-mum likelihood trees from DNA sequences: a pecu-liar statistical estimation problem. Syst. Biol. 44:384399.

    YANG ,Z.,R.NIELSEN, AND M. HASEGAWA. 1998. Models

    of amino acid substitution and applications to mito-chondrial protein evolution. Mol. Biol. Evol. 15:16001611.

    ZHANG, J. 1999. Performance of likelihood ratio tests ofevolutionary hypotheses under inadequate substitu-tion models. Mol. Biol. Evol. 16:868875.

    Received 9 November 1999; accepted 17 December 1999

    Associate Editor: R. Olmstead

    http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2911L.316[nlm=8170371]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2939L.105[nlm=8064867]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2914L.210[csa=0737-4038%5E26vol=14%5E26iss=2%5E26firstpage=210]http://pinkerton.catchword.com/nw=1/rpsv/0028-825X%5E28%5E2931L.257[csa=0028-825X%5E26vol=31%5E26iss=3%5E26firstpage=257]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2916L.111[nlm=7463489]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2916L.868[nlm=10368963]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2915L.1600[nlm=9866196]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2911L.316[nlm=8170371]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2942L.308[nlm=8919882]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2913L.964[csa=0737-4038%5E26vol=13%5E26iss=7%5E26firstpage=964]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2916L.1114[csa=0737-4038%5E26vol=16%5E26iss=8%5E26firstpage=1114]http://pinkerton.catchword.com/nw=1/rpsv/0028-825X%5E28%5E2931L.257[csa=0028-825X%5E26vol=31%5E26iss=3%5E26firstpage=257]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2931L.151[csa=0022-2844%5E26vol=31%5E26iss=2%5E26firstpage=151]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2929L.170[csa=0022-2844%5E26vol=29%5E26iss=2%5E26firstpage=170,nlm=2509717]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2916L.111[nlm=7463489]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2913L.964[csa=0737-4038%5E26vol=13%5E26iss=7%5E26firstpage=964]http://pinkerton.catchword.com/nw=1/rpsv/0737-4038%5E28%5E2915L.1600[nlm=9866196]http://pinkerton.catchword.com/nw=1/rpsv/0022-2844%5E28%5E2929L.170[csa=0022-2844%5E26vol=29%5E26iss=2%5E26firstpage=170,nlm=2509717]