Download - Empirical and hierarchical Bayesian estimation of ancestral.pdf

Syst. Biol. 50(3):351366, 2001

Empirical and Hierarchical Bayesian Estimation of Ancestral States

JOHN P. HUELSENBECK AND JONATHAN P. BOLLBACKDepartment of Biology, University of Rochester, Rochester, New York 14627, USA;

E-mail: [email protected]

Abstract.Several methods have been proposed to infer the states at the ancestral nodes on a phy-logeny. These methods assume a specic tree and set of branch lengths when estimating the ancestralcharacter state. Inferences of the ancestral states, then, are conditioned on the tree and branch lengthsbeing true. We develop a hierarchical Bayes method for inferring the ancestral states on a tree. Themethod integrates over uncertainty in the tree, branch lengths, and substitutionmodel parameters byusingMarkov chainMonteCarlo.We compare thehierarchicalBayes inferences of ancestral stateswithinferences of ancestral states made under the assumption that a specic tree is correct. We nd that themethods are correlated, but that accommodating uncertainty in parameters of the phylogenetic modelcanmake inferences of ancestral states evenmore uncertain than they would be in an empirical Bayesanalysis. [Ancestral state reconstruction; Bayesian estimation; empirical Bayes; hierarchical Bayes.]

The reconstruction of ancestral states on aphylogeny remains an important endeavorin evolutionary biology. The comparativemethod, for example, looks for evidence ofcorrelated change in two or more characters;in the course of such analyses, the ances-tral states are often reconstructed on a tree(Harvey and Pagel, 1991). Other studies ex-amine the properties of ancient molecules(Malcolm et al., 1990; Stackhouse et al., 1990;Adey et al., 1994; Jermann et al., 1995). Theamino acid sequence of the ancient protein isestimatedand then synthesized in the labora-tory. Theproperties of the ancient protein canthen be measured in vivo or in vitro with thegoal of demonstrating a change in the func-tion of the protein. Similar studies have beenperformed for morphological or behavioraltraits. For example, Ryan and Rand (1995) re-constructed themating calls of the hypothet-ical ancestors of a group of frogs and thenexamined the response of the extant femalesto the ancient calls. The inferences made insuch studies depend on the reliability of theancestral state reconstructions.Several methods have been proposed to

reconstruct the states present in the hypo-thetical ancestors on a phylogenetic tree. Theparsimony method, probably the most fre-quentlyusedmethod to infer ancestral states,nds the combination of ancestral states atan interior node of a phylogeny that mini-mizes the number of changes over the wholetree. The result of a parsimony analysis of an-cestral states is either to choose one state asthe best reconstruction or, less frequently, topresent multiple reconstructions when sev-eral reconstructionsgive the sameparsimony

tree length (in which case the reconstructionis said to be ambiguous).More recently, ancestral states have been

reconstructed forDNA,amino acid, andmor-phological (two-state) data by the use ofstochastic models (Schluter, 1995; Yang et al.,1995; Schluter et al., 1997; Pagel, 1999). Twodifferent methods have been used to inferthe ancestral states by using stochastic mod-els. The maximum likelihood method ndsthe character state at an internal node onthe tree that maximizes the probability ofobserving the data. For Bayesian inference,on the other hand, the goal is to calculatewhat is called the posterior probability thatan ancestral node on a tree has a particularstate, given the observations at the tips of thetree. The probability that a character takes aparticular state at some interior node on aphylogenetic tree depends on the topologyof the phylogenetic tree, the lengths of thebranches on the phylogeny, and the param-eters of the substitution model (such as thetransition/transversion rate bias). The typ-ical approach is to use the maximum like-lihood estimates of these parameters whencalculating the posterior probability of a site.Such an analysis is referred to as an empiricalBayes analysis. Schultz and Churchill (1999)have reviewed Bayesian approaches to re-constructing ancestral states for morpholog-ical characters.In this study, we examine how sensitive

the empirical Bayesian estimates of ances-tral states are to uncertainty in the tree,branch lengths, and substitution parameters.We propose a method that integrates overuncertainty in these parameters; such an

351

at Mihai Em

inescu Central University Library of Iasi on M

arch 17, 2015http://sysbio.oxfordjournals.org/

Dow

nloaded from

352 SYSTEMATIC BIOLOGY VOL. 50

analysis is referred to as an hierarchical Bayesanalysis. The advantage of a hierarchicalBayes analysis is that inferences about thestate of an ancestral character are not condi-tioned on any single tree or set of parametervalues. We approximate the posterior proba-bility of an ancestral state assignment usingMarkov chainMonteCarlo (MCMC).Wendthat it is important to consider uncertainty inmodel parameters and phylogeny when re-constructing ancestral states.

METHODSWe develop a hierarchical Bayes estimate

of ancestral states on phylogenetic trees.In this section, we review several differ-ent methods for estimating ancestral statesand introduce the notation that will be usedthroughout the paper. The general goal ofthis study is to approximate the posteriorprobability of a nucleotide assignment toa specic internal node of a phylogenetictreewhile accommodating uncertainty in thetree, branch lengths, and the substitution pa-rameters. We do this using MCMC and thencompare the hierarchical Bayes estimates ofancestral states with the empirical Bayesestimates.

Data

We assume that aligned DNA sequencesare available. However, the method devel-oped in this paper applies equally well tostochastic models for the stems of rRNAgenes (Schoniger and von Haeseler, 1994),codon models (Goldman and Yang, 1994;Muse and Gaut, 1994), models of aminoacid change (Adachi and Hasegawa, 1992),or simple two-state models, such as theMarkovBernoulli process, that might beapplied to morphological features (seeSchluter, 1995; Schluter et al., 1997; Pagel,1999). Although the method developed herecan be applied to different types of data, ourmethod differs from earlier work, in thatSchluter (1995) and Pagel (1999) consideredcharacters that would not offer informationon the tree and branch lengths. However,the methods devoped here could usefullybe applied in the same situations consideredby Schluter (1995) and Pagel (1999) byaccommodating uncertainty in the trees. Thealigned DNA sequences are contained in thematrix X D fxi j g where i D 0, 1, : : : , s andj D 1, 2, : : : , c (s is the number of species

and c is the length of the sequences). The j thsite in the sequence is contained in the vectorx j D fx1 j , x2 j , : : : , xs j g. Each element in thematrix, xi j , can take one of four states (A, C,G, or T). The observation that the i th speciesand j th site is nucleotide A is denotedxi j D A.We examine ve aligned DNA sequence

data sets. These data include (1) IRBPsequences from s D 13 mammals (van denBussche et al., 1998); (2) three tRNA(tRNAHI S, tRNASER, tRNALEU), partial(30 region) NADH-dehydrogenase subunit4 (ND4), and partial (50 region) ND5 se-quences from s D 12 primates (Hayasakaet al., 1988); (3) ATPase8 sequences from s D10 vertebrates (Cummings et al., 1995); (4) cy-tochrome oxidase I (COI) sequences from s D10 vertebrates (Cummings et al., 1995); and(5) ND3 sequences from s D 10 vertebrates(Cummings et al., 1995).

Phylogenetic Trees

We assume that the s sampled speciesare related through a phylogenetic tree, i ,where i D 1, : : : , B(s). The number of possi-ble trees is B(s) D (2s5)!2s3(s3)! for unrooted treesand B(s) D (2s3)!2s2(s2)! for rooted trees. Eachtree, i , denes a set of b branches (b D 2s 3or b D 2s 2 for unrooted and rooted trees,respectively). The lengths of the b brancheson the tree are expressed in terms of ex-pected number of substitutions per site, .The branch lengths for the ith tree are con-tained in the vector i ( i D f1, 2, : : : , bg).The tips of the tree are labeled n1, : : : , ns ,and the internal nodes of the tree are la-beled nsC1, : : : , n2s2 for unrooted trees andnsC1, : : : , n2s1 for rooted trees. The internalnodes are labeled consecutively according toa postorder traversal of the tree (i.e., fromthe tips of the tree to the root). The treesare rooted either along a branch of the tree(for rooted trees) or at taxon ns (for unrootedtrees). The ancestor of node nk is denoted (nk ). For unrooted trees, (n2s2) D ns . Thatis, the ancestor of the last internal node onthe tree is the tip taxon ns .We are interested in estimating the (unob-

served) nucleotide states at one or more ofthe internal nodes of the tree. In particular,we are interested in the ancestral states forbats (Tonatia silvicola, Tonatia bidens, andPteropus; van den Bussche et al., 1998);apes (Homo sapiens, Pan, Gorilla, Pongo,

at Mihai Em



Dow

nloaded from

2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE RECONSTRUCTION 353

and Hylobates; Hayasaka et al., 1988); andAmniota (a clade containing chicken, rat,mouse, human, bovine, seal, and whale;Cummings et al., 1995).We assume that theseclades are monophyletic. Figure 1 shows themaximum likelihood estimates of phylogeny

FIGURE 1. Themaximum likelihood trees under the constraints of monophyly of (a) bats, (b) apes, and (c, d, ande) amniotes. The thickened branch indicates the constraint of monophyly. The ancestral states are reconstructed forthe node with the dot.

under the constraint that bats, apes, or am-niotes are monophyletic. The phylogenieswere estimated under the HKY85 model ofDNA substitution (Hasegawa et al., 1984,1985), a model that allows for differingbase frequencies and for a bias in the rate of

at Mihai Em



Dow

nloaded from


TABLE 1. The maximum likelihood estimates under the HKY85C0 model of DNA substitution for the ve datasets examined in this study. Themaximum likelihood estimates were under the constraints of monophyly, indicatedin Figure 1. The genes are (A) IRBP (van den Bussche et al., 1998), (B) mtDNA (Hayasaka et al., 1988), (C) ATPase8(Cummings et al., 1995), (D) COI (Cummings et al., 1995), and (E) ND3 (Cummings et al., 1995). See text for moreinformation regarding the parameters.

Gene max[loge f (X j , v, )] A C G TA 7569.55 5.22 0.45 0.22 0.30 0.30 0.18B 5711.94 12.76 0.36 0.36 0.32 0.08 0.23C 1590.49 5.73 0.61 0.40 0.31 0.07 0.22D 9489.63 15.62 0.16 0.35 0.30 0.10 0.26E 2665.87 7.33 0.30 0.34 0.33 0.08 0.25

transitions and transversions. Among-siterate variationwas accommodated by assum-ing that the rate at a site is a random vari-able drawn from a mean-one gamma distri-bution with shape parameter ( > 0; Yang,1993, 1994). The constrained clade is indi-catedby adot inFigure 1. Themaximum like-lihoodestimates of theparameters are shownin Table 1. The maximum likelihood tree forthe ND3 gene did not have amniotes mono-phyletic. The log likelihood of the best treewas 2665:77, whereas the log likelihood ofthe best tree under the constraint of mono-phyly was 2665:87.The ancestral states are estimated for the

nodes indicated by the large dot in Figure 1.Weestimated the ancestral states for only oneof the nodes on the tree. However, if addi-tional constraints areplacedon the tree topol-ogy, the ancestral states for other nodes canalso be estimated. The reason we constrain aparticular node is because our method con-siders all trees that are consistent with theconstraint; inferences of ancestral charactersare aweightedaverageoverall possible trees.If we did not maintain the constraint, thenode of interest would not be present in alarge number of the possible trees. The effectof constraining the tree is to reduce the num-ber of possible trees. If for a tree of s speciesthere is a single constraint with s1 species onone side of the constraint and s2 species onthe other side, the total number of possibleunrooted trees is:

Bc(s1, s2) D B(s1 C 1) B(s2 C 1)

Hence, a total of 103,378,275 unrootedtrees are consistent with the bat constraint,1,091,475 trees are consistent with theape constraint, and 31,185 trees are con-sistent with the amniote constraint. In the

hierarchical Bayes analysis, inferences of an-cestral state reconstructions will be a sumover all possible trees consistent with theconstraint, weighted by the probability thatthe tree is correct. Because the number of

FIGURE 2. Examples of parsimony reconstruction ontwo trees. The nodes (internal and external) of the treeare labelled n1 , : : : , n8. The observations are the nu-cleotide states assigned to the tips, (in this case, AACCCor ACACC for Tree a or Tree b, respectively. The parsi-mony reconstruction of ancestral states is indicated bythe character sets at the internal nodes. The characterreconstruction is ambiguous for Tree b .

at Mihai Em



Dow

nloaded from


possible trees is so large, we will use a nu-merical technique to approximate the sum.

Parsimony Reconstruction

The parsimony reconstruction of an an-cestral character state is the nucleotide as-signed to an interior node of a tree thatminimizes the number of changes. Take, forexample, the tree shown in Figure 2a. Thetree is drawn as an unrooted tree of ve se-quences, with one of the sequences drawn atthe root. The observations for the j th site arex j D fA, A, C, C, Cg. What are the states atthe interior nodes of the tree?For notation purposes, we contain the

hidden (or unobserved) states in a matrixY D fyi j g, where i D s C 1, s C 2, : : : , 2s 2 and j D 1, 2, : : : , c. The reconstructionthat minimizes the number of changesfor the observations at the jth site x j DfA, A, C, C, Cg is y j D fA, C, Cg. That is,there is an A at node 6 and C at nodes 7and 8. This reconstruction implies that therewas a single change along the branch be-tween nodes 6 and 7.Swofford et al. (1996) and Maddison and

Maddison (1992) describe algorithms for re-constructing ancestral states in a parsimonyanalysis. For our purposes, we simply notethe dependence of the reconstruction of thehidden states (Y) on the topology of the tree( ) and on the states observed at the tips ofthe tree (X). Also, it is possible for the par-simony method to ambiguously reconstructthe ancestral states. For example, Figure 2bshows a different set of observations at thetips of the phylogenetic tree. For this tree, thereconstruction of states at two of the internalnodes of the tree is ambiguous.

Maximum Likelihood

The other two methods for estimatingancestral statesthe methods of maximumlikelihood and Bayesian inferenceassumethat the characters evolve according to astochastic process. For example, theMarkovBernoulli process is a simple two-statemodel of evolution that has been appliedto morphological characters. The MarkovBernoullimodel has twoparameters, the biasparameter p and a rate parameter . In an in-stant of time, dt , the probability of a changefrom state 0 to state 1 is (1 p)dt and theprobability of a change from state 1 to state0 is pdt . In this study, we are interested in

modeling the evolution of DNA sequencesand assume that the DNA sequences evolveaccording to a time-homogeneous Poissonprocess with four states. In particular, we as-sume that substitutions follow the HKY85model ofDNAsubstitution (Hasegawa et al.,1984, 1985). The instantaneous ratematrix,Q,for the HKY85 model is

Q D fqi j g

D C G TA G TA C TA C G

(1)

where is the transition/transversion ratebias, and D (A, C, G, T) are the equi-librium base frequencies. When > 1, tran-sitions occur more frequently than transver-sions. The rows of the instantaneous ratematrix sum to 0. Moreover, the constraintthat qi ii D 1 is added, ensuring that thebranch lengths of the phylogenetic tree aregiven in terms of expected number of substi-tutions per site, . The probability that nu-cleotide i changes into j over a branch oflength is contained in the matrix P D fpi j g.P can be obtained from the rate matrix Qthrough the operation P D eQ : We accom-modate rate variation across sites by assum-ing that the rate at a site is a random variabledrawn from amean-one gamma distributionwith shape parameter (Yang, 1993, 1994).Substitution models that assume gamma-distributed rate variation are denoted C0.We use a total of four rate categories toapproximate the continuous gamma distri-bution. Parameters of the model of DNAsubstitution are contained in the vectorD (, , ).The probability ofobserving the data at the

tips of an unrooted tree for a particular site(xi ) and an assignment of nucleotides to theinternal nodes of the tree (yi ) is

f (xi jyi , j , j , ) D xis pyi (2s2)xi s (2s2)

s1

kD1pyi (k )xi k (k )

2s3

kDsC1pyi (k)yi k (k )

(2)

The probability of observing the data at thetips is conditional on the states assigned to

at Mihai Em



Dow

nloaded from


the interior nodes of the tree (yi ), the topol-ogyof the tree ( j ), the lengthsof the brancheson the tree ( j ), and the parameters of thesubstitutionmodel ( ). Maximum likelihoodestimation of phylogeny is not typicallyconditioned on a particular assignment ofnucleotides to the interior nodes of the tree(Felsenstein, 1981). Instead, the probabilityof observing the data at the tips of the treeis a sum over all 4s2 possible assignmentsof nucleotides to the internal nodes of thetree.

f (xi j j , j , ) Dyi

f (xi jyi , j , j , ) (3)

Assuming independence of substitutions atdifferent sites, the probability of observingthe aligned DNA sequence data is the prod-uct of the c site probabilities

f (Xj j , j , ) Dc

iD1f (xi j j , j , ) (4)

This function is maximized to obtain maxi-mum likelihood estimates of the parameters , , and .The ancestral states can be estimated

by maximum likelihood (Schluter, 1995;Schluter et al., 1997; Pagel, 1999). Severaldifferent maximum likelihood methods canbe used to estimate ancestral states. Sup-pose one is interested in the ancestral stateat only one of the internal nodes of thetree. One method for estimating the ances-tral state at that node is to nd the com-bination of states for all nodes on the treethat maximizes the likelihood; not only isthe likelihood maximized with respect to thenode of interest but also for other nodes onthe tree that are not of direct interest. Onepotential problemwith thismethod for infer-ring the ancestral condition at a node, how-ever, is that the number of parameters be-ing estimated is large (in fact, larger than thenumber of observations at a site if branchlengths are estimated for the each site inde-pendently). Another method for estimatingthe ancestral condition is to nd the max-imum likelihood nucleotide assignment forthe node of interest while summing overall possible nucleotide assignments to thenodes that are not of direct interest. Thismethod is preferable because it focuses the

power of the method only on the node ofinterest.

Empirical Bayes

Bayesian inference is based on the poste-rior probability of a parameter. The posteriorprobability that the character for the jth siteat the i th internal node of tree k takes stateyi j , conditional on the data at the tips, tree,branch lengths, and substitution parame-ters, is

f (yi j jx j , k , k , )

D f (x j jyi j , k , k , )yi jyi j2fA,C,G,Tg f (x j jyi j , k , k , )yi j

(5)

The probabilities are a sum over all possi-ble assignments of nucleotides that can beassigned to the nodes in the tree that arenot of interest (i.e., all of the internal nodesexcept ni ). Note that the probability of theancestral state at node ni is conditioned onthe observed data at the tips of the tree,the topology of the tree, the lengths of thebranches, and the values of the parameters ofthe substitution process.What values shouldthese unknown parameters take? An empir-ical Bayes analysis uses estimates for theseparameters. Yang et al. (1995) substitutedthe maximum likelihood estimates for theseunknown parameters. Hence, the posteriorprobability for the ancestral state at node i is

f (yi j jx j , , , )

D f (x j jyi j , , , )yi j

yi j2fA,C,G,Tg f (x j jyi j , , , )yi j(6)

Hierarchical Bayes

One of the disadvantages of an empiricalBayes analysis is that inferences are condi-tioned on assigning specic values to un-known parameters (such as the maximumlikelihood estimates for the parameters). Analternativemethod, called hierarchical Bayesanalysis, species a prior probability distri-bution for theunknownparameters.Thepos-terior probability is then integrated over un-certainty in the parameters. The posteriorprobability that the state for the j th site at

at Mihai Em



Dow

nloaded from


the i th internal node of the tree is yi j is

f (yi j jX) D f (Xjyi j )yi jyi j2fA,C,G,Tg f (Xjyi j )yi j

(7)

where

f (Xjyi j ) DBc (s)

kD1 kf (Xjyi j , k , k , )

f (k) f ( k ) f ( )d kd (8)

The prior probabilities for the parameters aref (k ), f ( k ), and f ( ), and the summation isover all possible trees that are consistentwiththe constraint. In this study, we assume thatall trees are a priori equally probable, witha uniform(0, 10) prior for branch lengths, auniform(0, 100) prior for , a uniform(0, 10)prior for , and a Dirichlet distributed priorfor (Appendix).

Markov Chain Monte Carlo

The summations and integrations requiredin equation 8 are impossible to perform ana-lytically for even small phylogenetic prob-lems. We use MCMC to approximate theposterior probability of nucleotide assign-ments to interior nodes on the tree. Speci-cally,we use theMetropolisHastingsGreenalgorithm (Metropolis et al., 1953; Hastings,1970;Green, 1995). TheMCMCmethod takesvalid, albeit dependent, samples from theprobability distribution of interest by con-structing a Markov chain that has as itsstate space the parameter(s) of interest. Here,we are interested in integrating over un-certainty in the phylogenetic tree, branchlengths, and substitution parameters; for theproblem of estimating ancestral states, then,the chain runs over topology ( ), branchlengths ( ), the transition/transversion bias(), the gamma shape parameter for among-site rate variation (), and base frequenc-ies ( ).The Markov chain was constructed as fol-

lows: (1) The current state of the chain isdesignated U ( U D f , , , , g); the cur-rent state of the chain consists of a specictree with branch lengths and specic val-ues for the substitution parameters. If this isthe rst generation of the chain, then an ini-tial state for the chain is chosen. (2) A new

state for the chain is proposed and desig-nated U 0. The probability of proposing thenew state given the old state is f ( U 0j U ). Theprobability of making the reverse move isf ( U jU 0). Our proposal mechanism modiesonly one or a few of the elements in U ata time. The specic proposal mechanismsand their acceptance probabilities are dis-cussed in the Appendix. (3) The probabilitythat the proposed state is accepted is calcu-lated. The probability of accepting the newstate is

R D min 1, f (Xj U0) f ( U 0)= f (X)

f (Xj U ) f ( U )=f (X) f ( U j U 0)f ( U 0jU )

D min 1, f (Xj U0) f ( U 0)

f (Xj U ) f ( U ) f ( U j U 0)f ( U 0j U )

D min 1, f (Xj U0)

f (Xj U )Likelihood ratio

f ( U0)

f ( U )

Prior ratio

f ( U j U0)

f ( U 0j U )Proposal ratio

(9)

The probability of accepting the new(proposed) state is the product of the like-lihood, prior, and proposal ratios. The pro-posal ratio is often referred to as the Hast-ings ratio. (4) A uniformly distributed(pseudo)random number on the interval[0, 1] is generated. If this number is lessthan R, then the proposed state is acceptedand U D U 0. Otherwise, the chain remainsat U .Steps 1 to 4 are repeated a large number of

times (in this study, 106 times). The sequenceof states visited constitutes a Markov chain.In this study,we save the states of theMarkovchain every 100 generations (taking a totalof 104 samples). These sampled points rep-resent valid draws from the posterior prob-ability of interest. Although the 104 sampledstates are not independent draws from theposterior distribution, the Markov chain lawof large numbers guarantees that posteriorprobabilities can be validly estimated fromlong-run samples from the chain (Tierney,1994). For each sampled state of the chain,we calculate the posterior probabilities of thenucleotide state assignments at the con-strained node for all c sites.

at Mihai Em



Dow

nloaded from


RESULTSInferences of the posterior probabilities of

the parameters should be based on samplesdrawn from the chain when at stationar-ity. Figure 3 shows the log likelihood of thecurrent state of the chain through time forthe MCMC analysis. For each data set, thechain started at a low likelihood (a poor com-bination of parameters) and quickly reacheda plateau. The rst 1,000 sampled points(or the rst 105 generations of the chain)werediscarded as the burn in of the chain. Allposterior probabilities in this paper are basedon the 9,000 points that were sampled fromthe chain when at apparent stationarity. Theposterior probability of a phylogeny is theproportion of the time that it was sampled(out of 9,000 samples).The phylogenetic trees for the ve data

sets were not known with certainty. Figure4 shows the 50% majority rule consensustrees for the ve data sets. These trees rep-resent the Bayesian estimates of phylogenyunder the HKY85C0 model of DNA sub-stitution (Li, 1996; Mau, 1996; Rannala andYang, 1996; Mau and Newton, 1997; Yangand Rannala, 1997; Larget and Simon, 1999;Mau et al., 1999; Newton et al., 1999). Thenumbers at the interior nodes of the treesrepresent the posterior probability that theclade is correct; they do not represent non-parametric bootstrap values. The posteriorprobability for the constrained clade is notshown because it must have been presentin 100% of the samples. Note that the chainconsidered many different trees. For four ofthe data sets, no single tree could have beensafely treated as known without error whenestimating the ancestral states at the con-strained node. For the COI data set, on theother hand, the posterior probabilities of allclades are high (>0.97); wemay thus assumethe topology (Fig. 4, Tree d) is known with-out error, even though the other parametersof the model are uncertain.Just as the analysis considers uncertainty

in the topology of the tree relating thespecies, uncertainty in the parameters of thesubstitution model is also accommodated.Figure 5 shows the posterior probabilitiesfor the transition/transversion rate bias ()and the gamma shape parameter for among-site rate variation (). The posterior proba-bility for both parameters is distributed overa range of values, most of the weight being

FIGURE 3. The log likelihood through time for theve data sets analyzed in this study. (a) van den Busscheet al. (1998); (b) Hayasaka et al. (1988); (c) atpase8(Cummings et al., 1995); (d)COI (Cummings et al., 1995);(e) ND3 (Cummings et al., 1995).

at Mihai Em



Dow

nloaded from


FIGURE 4. The 50% majority rule consensus trees of the trees visited during the MCMC analysis. The numbersat the interior branches represent the posterior probability that the clade is correct. Data sets (a)(e) as in Figure 3.

at Mihai Em



Dow

nloaded from


FIGURE 5. The posterior probability distributions forthe transition/transversion rate () and for the gammashape parameter (). (a)(e) refer to same data sets as inFigure 3.

placed near the maximum likelihood esti-mates. A 95% credible interval for each pa-rameter can be constructed by taking the2.5% and 97.5% quantiles of the distribution.Table 2 shows themean and credible intervalfor the parameters of the substitutionmodel.We calculated the empirical and hierar-

chical Bayes estimates of ancestral states forall site patterns at the constrained nodes of

Figure 1. The empirical Bayes estimates usedthemaximum likelihoodestimates for thepa-rameters , , and . In thehierarchical Bayesanalysis, the uncertainty in these parame-ters was integrated over by using MCMC.Figure 6 shows the relationship between theempirical and hierarchical Bayes estimatesof the ancestral states. The graphs show theposterior probabilities across all sites, rankedfrom sites with lowest to greatest probabilityunder the empirical Bayes approach. As ex-pected, the posterior probabilities of ances-tral state assignments for the empirical andhierarchical Bayes analyses show a close re-lationship. The relationship between the hi-erarchical and empirical Bayes estimates isclosest for probabilities near 0 or 1. Theseare site patterns for which there is little un-certainty about the state at the constrainednode; both methods place most of the prob-ability on a single reconstruction. However,for some site patterns, the nucleotide assign-ment at the ancestral node is less certain andthere is more disagreement between the em-pirical and hierarchical Bayes analyses. Forthese sites, the topology and branch lengthsof the tree and the uncertainty in the sub-stitution parameters make the ancestral con-dition at the constrained node less certain.Importantly, a site can have either a loweror a higher posterior probability under thehierarchical Bayes approach because of theuncertainty in the trees and branch lengths.Figure 7 more explicitly demonstrates the

uncertainty in the ancestral state assign-ments introduced by the uncertainty in thephylogeny and substitutionmodel. The errorbars represent the 95% credible region for theprobability of a particular nucleotide assign-ment to the constrained node on the tree. Forthe nuclear gene (IRBP), the uncertainty inthe ancestral states is relatively small. How-ever, for the other genes, the uncertainty inthe ancestral state assignment can be quitelarge. For one of the sites in theATPase8gene,for example, the probability that anAwas as-signed to the constrained node varied by asmuch as 0.83 (a credible interval from 0.09to 0.93). The nucleotide state assignments forother sites for ATPase8 gene were nearly asuncertain.

DISCUSSIONPhylogenetic uncertainty is usually ig-

nored when reconstructing ancestral states.

at Mihai Em



Dow

nloaded from


TABLE 2. TheBayesian estimates under theHKY85C0model of DNA substitution for the ve data sets examinedin this study. The point estimate is the mean of the posterior distribution, and the interval represent the 95% credibleregion for the parameters. Genes as in Table 1.

Gene A C G T

A 5.32 0.45 0.22 0.30 0.30 0.19(4.63, 6.08) (0.39, 0.52) (0.20, 0.23) (0.28, 0.32) (0.28, 0.32) (0.17, 0.20)

B 11.92 0.38 0.36 0.31 0.08 0.24(9.39, 15.22) (0.31, 0.45) (0.34, 0.39) (0.29, 0.34) (0.07, 0.09) (0.22, 0.26)

C 6.35 0.59 0.40 0.31 0.07 0.22(4.16, 9.51) (0.41, 0.83) (0.36, 0.44) (0.28, 0.35) (0.05, 0.09) (0.19, 0.25)

D 12.43 0.17 0.34 0.30 0.11 0.25(9.82, 16.36) (0.15, 0.18) (0.32, 0.36) (0.29, 0.32) (0.10, 0.12) (0.24, 0.27)

E 11.89 0.23 0.34 0.33 0.08 0.25(5.67, 18.57) (0.17, 0.35) (0.31, 0.37) (0.30, 0.36) (0.07, 0.10) (0.23, 0.28)

The phylogeny of some groups is so wellestablished that it is perhaps safe to treatthe phylogeny as known. However, evenfor cases where the phylogeny is wellsupported, the uncertainty in other pa-rameters of the phylogenetic model, suchas the branch lengths on the tree and thesubstitution parameters, can be large. Un-certainty in the phylogenetic model (tree,

FIGURE6. Relationshipbetween theposterior probabilities of states A, C,G,orTcalculatedbyusing the empiricaland the hierarchical Bayes methods. (a)(e) as in Figure 3.

branch lengths, and substitution model)can all contribute to make ancestral statereconstruction ambiguous.Figures 8 and 9 demonstrate how uncer-

tainty in the phylogeny and the lengths ofthe branches on the phylogeny can lead todifferent interpretations of the ancestral stateat a node. The observations at the tips of thetrees are A, A, C, C, and C for species 1, 2,

at Mihai Em



Dow

nloaded from


FIGURE7. Relationshipbetween theposteriorprobabilities of statesA,C,G, orTcalculatedbyusing the empiricaland the hierarchical Bayes methods. The 95% credible regions for the nucleotide probabilities are indicated by thevertical error bars. (a)(e) as in Figures 36.

FIGURE 8. The posterior probabilities of A, C, G, or T at the node indicated by the dot. The three trees representall of the trees of ve species that contain the taxon bipartition fn1 , n2 , n3g, fn4, n5g. Posterior probabilities werecalculated under the JukesCantor (1969) model of DNA substitution, assuming that all of the branches were 0.1expected substitutions per site. Numbers in parentheses below each tree indicate (from left to right) the probabilityof having A, C, G, or T, respectively, at the constrained node.

at Mihai Em



Dow

nloaded from


FIGURE 9. The posterior probabilities of A, C, G, or T at the node indicated by the dot. Posterior probabilitieswere calculated under the JukesCantor (1969) model of DNA substitution and assuming that the lengths of all ofthe branches were 0.1 expected substitutions per site, except the two short branches of Trees b and c, which were0.01 expected substitutions per site. Numbers in parentheses as in Figure 8.

3, 4, and 5, respectively. The lengths of thebranches for the three trees of Figure 8 areall 0.1 expected substitutions per site. Thethree trees shown in Figure 8 represent allof the trees that contain the taxon bipartitionfn1, n2, n3g, fn4, n5g. What is the probabilityof having an A, C, G, or T at the node indi-cated by the dot? If the JukesCantor (1969)model of DNA substitution is assumed, thenthe probability of having an A, C, G, or Tat the constrained node is indicated by thenumbers in parentheses in Figure 8. For Treea of Figure 8, the posterior probability of hav-ingaCat the interior node isgreatest (0.9644),makingC themost probable state at thenode.However, for Trees b and c of Figure 8, thereconstruction of the ancestral state is moreambiguous; the probability of having anA orC are about the same. These empirical Bayesreconstructions make intuitive sense and arein accordwith the parsimony reconstruction.The parsimonymethod reconstructs the stateat the constrained node as C for Tree a ofFigure 8 and either A or C for Trees b and c.If the tree is certain, then the reconstruction ofthe ancestral state is not problematic. For ex-ample, if Tree a is correct, then the best recon-struction has nucleotide C at the constrainednode. Similarly, if Tree b unambiguouslyrepresents the relationships of the vespecies, then the best reconstruction has ei-ther an A or C at the constrained node (witha slight preference for the reconstruction thathas A at the constrained node). However,rarely is thephylogenyknownwith certainty.For example, what if the posterior probabili-

ties of Trees a , b, and c were 0.5, 0.3, and 0.2,respectively? If one were to simply assumethat the tree with the greatest posterior prob-ability is correct (the MAP estimate of phy-logeny), then the best reconstruction wouldhave C at the constrained node (with proba-bility 0.96). Ideally, however, the uncertaintyin the topology should be accommodated,as we do by using MCMC in this study. Ifthe uncertainty of the trees is accounted for,the probabilities are 0.2702, 0.7269, 0.0015,and 0.0015 for nucleotides A, C, G, and T,respectively. This calculation assumes thatthe lengths of all branches on the trees areequal (0.1 substitutions per site). The hier-archical Bayes estimate differs substantiallyfrom the empirical Bayes estimate (e.g., theprobability of C is 0.9644 and 0.7269 for theempirical and hierarchical Bayes analyses,respectively).Uncertainty in the lengths of the branches

on the phylogeny is another source of am-biguity in ancestral state reconstructions.Figure 9 shows how posterior probabili-ties of ancestral state reconstructions can beaffected by uncertainty in branch lengths.Here, the topology of the phylogeny is thesame for all three examples. The branchesfor Tree a are all equal in length (0.1 ex-pected substitutions per site). The lengthsof the branches for Trees b and c are also0.1 expected substitutions per site exceptfor the branch forming the taxon bipartitionfn1, n2g, fn3, n4, n5g on Tree b, which is 0.01expected substitutions per site long, and thebranch leading to tip n3 on Tree c , which is

at Mihai Em



Dow

nloaded from


also 0.01 expected substitutions per site long.The posterior probabilities of having an A,C, G, or T are shown for each tree. For alltrees, the best reconstruction has C at theconstrained node. However, the poste-rior probability changes depending on thelengths of the branches. For Tree b the proba-bility of having C is much less than for Treesa and c . The empirical Bayes analysis picksone set of branch lengths to use when recon-structing ancestral states, whereas the hierar-chical Bayes analysis attempts to accommo-date any uncertainty in the branch lengthswhen reconstructing ancestral states.Although inferences for our hierarchical

Bayes method are not conditioned on anyparticular tree or set of model parametersbeing correct, the ancestral state recon-structions are conditioned on the model ofDNA substitution being correct and the con-strained node being real. Therefore, usingas realistic a model of DNA substitution aspossible is important when reconstructingancestral states. For DNA sequences, modelsthat accommodate more rate parametersor allow limited dependence among sitesmight provide better estimates of ancestralstates. At least one of the clades on the treemust be assumed to be known. Becausemany of the possible phylogenetic trees willnot contain the ancestral node of interest,we must assume that the constraint, at least,is correct. For the problems considered inthis paper, the constraint is not problematic;virtually every study has supported themonophyly of bats (see van den Bussche etal., 1998, for a review of the bat monophylydebate), apes, and amniotes.Many studies in evolutionary biology

assume that the phylogeny and branchlengths of a species group are known with-out error (e.g., Harvey and Pagel, 1991).However, phylogenetic estimates are poten-tially subject to large errors. In fact, methodsfor evaluating uncertainty in phylogenies,such as the nonparametric bootstrap andBayesian inference, suggest that many treeshave a large amount of uncertainty. Ideally,this uncertainty should be accommodatedin evolutionary studies. MCMC has beenapplied with success to accommodate un-certainty in trees. For example, Kuhner et al.(1994) and Beerli and Felsenstein (1999) haveused MCMC to estimate parameters of thecoalescence process while integrating overuncertainty in the gene tree. The hierarchical

Bayesmethoddeveloped in this paper allowsestimation of ancestral states, conditionedonly on the observations of the states at thetip of the tree. The method might usefully beextended to other problems in evolutionarybiology that depend on a phylogeny butfor which the phylogeny is not of primeinterest.

ACKNOWLEDGMENTSThisworkwas supported byNSF grantsDEB-0075406

and MCB-0075404 awarded to J.P.H.

REFERENCESADACHI, J., AND M. HASEGAWA. 1992. Amino acid sub-stitution of proteins coded for in mitochondrial DNAduring mammalian evolution. Jpn. J. Genet. 67:187197.

ADEY, N. B., T. O. TOLLEFSBOL, A. B. SPARKS, M.H. EDGELL, AND C. A. HUTCHISON. 1994. Molecu-lar resurrection of an extinct ancestral promoter formouse L1. Proc. Nat. Acad. Sci. USA 91:15691573.

BEERLI, P., AND J. FELSENSTEIN. 1999. Maximum likeli-hood estimation of migration rates and effective pop-ulation umbers in two populations using a coalescentapproach. Genetics 152:763773.

CUMMINGS , M. P., S. P. OTTO, AND J. WAKELEY. 1995.Sampling properties of DNA sequence data in phylo-genetic analyses. Mol. Biol. Evol. 12:814822.

FELSENSTEIN, J. 1981. Evolutionary trees from DNAsequences: A maximum likelihood approach. J. Mol.Evol. 17:368376.

GOLDMAN, N., AND Z. YANG. 1994. A codon-basedmodel of nucleotide substitution for protein-codingDNA sequences. Mol. Biol. Evol. 11:725736.

GREEN, P. J. 1995. Reversible jump Markov chain MonteCarlo computation and Bayesian model determina-tion. Biometrika 82:711732.

HARVEY, P. H.,ANDM.D. PAGEL. 1991. The comparativemethod in evolutionary biology. Oxford Univ. Press,Oxford.

HASEGAWA, M., H. KISHINO , AND T. YANO. 1985. Dat-ing the humanape split by a molecular clock of mi-tochondrial DNA. J. Mol. Evol. 22:160174.

HASEGAWA, M., T. YANO, AND H. KISHINO . 1984. A newmolecular clock of mitochondrial DNA and the evo-lution of Hominoids. Proc. Jpn. Acad. Ser. B 60:9598.

HASTINGS , W. K. 1970. Monte Carlo sampling meth-ods using Markov chains and their applications.Biometrika 57:97109.

HAYASAKA,K.,T.GOJOBORI,AND S.HORAI. 1988.Molec-ular phylogeny and evolution of primate mitochon-drial DNA. Mol. Biol. Evol. 5:626644.

JERMANN, T. M., J. G. OPITZ, J. STACKHOUSE, ANDS. A. BENNER . 1995. Reconstructing the evolutionaryhistory of the artiodactyl ribonuclease superfamily.Nature 374:5759.

JUKES, T., AND C. CANTOR. 1969. Evolution of pro-tein molecules. Pages 21132 in Mammalian pro-tein metabolism (H. Munro, ed.). Academic Press,New York.

KUHNER, M., J. YAMATO, AND J. FELSENSTEIN. 1994. Es-timating effective population size and mutation ratefrom sequence data using MetropolisHastings sam-pling. Genetics 149:429434.

at Mihai Em



Dow

nloaded from


LARGET, B., AND D. SIMON. 1999. Markov chain MonteCarlo algorithms for the Bayesian analysis of phylo-genetic trees. Mol. Biol. Evol. 16:750759.

LI, S. 1996. Phylogenetic tree construction usingMarkovchainMonte carlo. Ph.D.dissertation,OhioStateUniv.Columbus.

MADDISON, W. P., AND D. R. MADDISON. 1992.MacClade, version 3.00. Academic Press, New York.

MALCOLM, B. A., K. P. WILSON, B. W. MATTHEWS,J. F. KIRSCH, AND A. C. WILSON. 1990. Ancestrallysozymes reconstructed, neutrality tested, and ther-mostability linked to hydrocarbon packing. Nature345:8689.

MAU, B. 1996. Bayesian phylogenetic inference viaMarkov chain Monte carlo methods. Ph.D. disserta-tion, Univ. of Wisconsin, Madison.

MAU, B., AND M.NEWTON. 1997. Phylogenetic inferencefor binary data on dendrograms using Markov chainMonte Carlo. J. Comput. Graph. Stat. 6:122131.

MAU, B., M. NEWTON, AND B. LARGET. 1999. Bayesianphylogenetic inference via Markov chainMonte carlomethods. Biometrics 55:112.

METROPOLIS, N., A. W. ROSENBLUTH, M. N.ROSENBLUTH, A. H. TELLER , AND E. TELLER. 1953.Equations of state calculations by fast computingmachines. J. Chem. Phys. 21:10871091.

MUSE, S. V., AND B. S. GAUT. 1994. A likelihood ap-proach for comparing synonymous and nonsynony-mous nucleotide substitution rates with applicationto the chloroplast genome. Mol. Biol. Evol. 11:715724.

NEWTON, M., B. MAU, AND B. LARGET. 1999. Markovchain Monte Carlo for the Bayesian analysis of evo-lutionary trees from aligned molecular sequences. InStatistics in molecular biology (F. Seillier-Moseiwitch,T. P. Speed, and M. Waterman, eds.). Monograph Se-ries of the Institute of Mathematical Statistics.

PAGEL, M. 1999. The maximum likelihood approach toreconstructing ancestral character states of discretecharacters on phylogenies. Syst. Biol. 48:612622.

RANNALA, B., AND Z. YANG. 1996. Probability distribu-tion of molecular evolutionary trees: A new methodof phylogenetic inference. J. Mol. Evol. 43:304311.

RYAN, M. J., AND A. S. RAND. 1995. Female responsesto ancestral advertisement calls in Tungara frogs.Science 269:390392.

SCHLUTER, D. 1995. Uncertainty in ancient phylogenies.Nature 377:108109.

SCHLUTER, D., T. PRICE, A. .MOOERS,AND D. LUDWIG.1997. Likelihood of ancestor states in adaptive radia-tion. Evolution 51:16991711.

SCHONIGER ,M.,AND A. VONHAESELER. 1994.A stochas-tic model for the evolution of autocorrelated DNA se-quences. Mol. Phylogenet. Evol. 3:240247.

SCHULTZ, T. R., AND G. A. CHURCHILL. 1999. The roleof subjectivity in reconstructing ancestral characterstates: A Bayesian approach to unknown rates, states,and transformation asymmetries. Syst. Biol. 48:651664.

STACKHOUSE, J., S. R. PRESNELL, G. M. MCGEEHAN,K. P. NAMBIAR, AND S. A. BENNER. 1990. The ribonu-clease from an ancient bovid ruminant. FEBS Lett.262:104106.

SWOFFORD, D., G. OLSEN, P. WADDELL, AND D. M.HILLIS . 1996. Phylogenetic inference. Pages 407511in Molecular Systematics, 2nd edition (D. Hillis, C.Moritz, and B. Mable, eds.). Sinauer, Sunderland,Massachusetts.

TIERNEY, L. 1994. Markov chains for exploring posteriordistributions (with discussion). Ann. Stat. 22:17011762.

VANDENBUSSCHE, R. A.,R. J. BAKER, J. P.HUELSENBECK,AND D. M. HILLIS . 1998. Base compositional bias andphylogenetic analyses: A test of the ying DNA hy-pothesis. Mol. Phylogenet. Evol. 13:408416.

YANG, Z. 1993. Maximum likelihood estimation of phy-logeny from DNA sequences when substitution ratesdiffer over sites. Mol. Biol. Evol. 10:13961401.

YANG, Z. 1994. Maximum likelihood phylogenetic esti-mation from DNA sequences with variable rates oversites: Approximate methods. J. Mol. Evol. 39:306314.

YANG, Z., S. KUMAR, AND M. NEI. 1995. A new methodof inference of ancestral nucleotide and amino acidsequences. Genetics 141:16411650.

YANG, Z.,AND B.RANNALA. 1997.Bayesianphylogeneticinference using DNA sequences: A Markov chainMonte Carlo method. Mol. Biol. Evol. 14:717724.

Received 8 March 2000; accepted 3 May 2000Associate Editor: R. Olmstead

APPENDIXWe use the MetropolisHastingsGreen algorithm

(Metropolis et al., 1953; Hastings, 1970; Green, 1995)to approximate the posterior probabilities of ancestralstates (and other parameters). We changed one or a fewparameters of the model at a time. The proposal mech-anisms are described here.

Changing and .We used the LOCAL algorithmofBAMBE to change the topology and branch lengths ofthe tree simultaneously (Larget and Simon, 1999). Oneof the s 3 branches of the tree was chosen at random.The two nodes of this branch are labeled u and w. Thisbranch partitions the tree, with two clades on one endof the branch and two clades on the other. One of theclades on the left side of the branch (from node u) is ran-domly labeled a , and the other is randomly labeled b.Similarly, one of the clades to the right (from node w) israndomly labeled c, the other randomly labeled d . Thepath length from a to c is m. Path length m is modiedby multiplying the branches on that path by a randomnumber to obtain the new path length, m . Specically,m D m e(U1=2), where U is a uniformly distributedrandom number on the interval (0, 1) and is a tuningparameter (Larget and Simon, 1999). After the path be-tween a and c is modied, one of the two branches, u bor w d , is detached from the tree and then reattachedat random along the path between a and c. Specically,its reattachment point is uniformly distributed on theinterval m . The acceptance probability for this move(see Larget and Simon, 1999:753) is

R D min 1, (likelihood ratio) (m=m)2

Changing and .The transition/transversion rateratio () and the gamma shape parameter () were mod-ied by adding to the current value a uniformly dis-tributed random number on the interval [, C]. Both and are restricted to positive values. When the pro-posed state was negative, the excess was reected back.The acceptance probability for these moves is

R D min [1, (likelihood ratio)]Changing .New values for the base frequencies

were proposed from a Dirichlet distribution withexpected values at the current values. The Dirichlet

at Mihai Em



Dow

nloaded from


distribution has probability density

f ( j ) D 0(0)0(A)0(C)0(G)0(T)

A1A C1C G1G T1T

where i is the Dirichlet parameter for the i thnucleotide, 0 D A C C C G C T , and i is thefrequency of the ith nucleotide. New base fre-quencies are drawn from the Dirichlet distributionwith parameter i D i0. We set 0 D 100 in thisstudy.

at Mihai Em



Dow

nloaded from

Download - Empirical and hierarchical Bayesian estimation of ancestral.pdf

Top Related