advances in phylogeny reconstruction from gene order and …treport/tr/04-07/advances.pdf ·...

27
Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque NM 87131 [email protected] Tandy Warnow Department of Computer Sciences University of Texas Austin TX 78712 [email protected] Abstract Genomes can be viewed in terms of their gene content and the order in which the genes appear along each chromosome. Evolutionary events that affect the gene order or content are “rare genomic events” (rarer than events that affect the composition of the nucleotide sequences) and have been advo- cated by systematists for inferring deep evolutionary histories. This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are currently quite restricted in the type of data they can analyze, we also present current research aimed at handling the full range of whole-genome data. 1 Introduction: Molecular Sequence Phylogenetics A phylogeny represents the evolutionary history of a collection of organisms, usu- ally in the form of a tree. Sequence data are by far the most common form of molecular data used in phylogenetic analyses. We begin by briefly reviewing tech- niques for estimating phylogenies from molecular sequences, with emphasis on the computational and statistical issues involved. 1.1 Model trees and stochastic models of evolution Most algorithms for phylogenetic reconstruction attempt to reverse a model of evo- lution. Such a model embodies certain knowledge and assumptions about the pro- cess of evolution, such as characteristics of speciation and details about evolution- ary changes that affect the content of molecular sequences. Models of evolution vary in their complexity; in particular, they require different numbers of param- eters. For instance, the Jukes-Cantor model, which assumes that all sites evolve identically and independently and that all substitutions are equally likely, requires just one parameter per edge of the tree, viz., the expected number of changes of a 1

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

Advances in Phylogeny Reconstructionfrom Gene Order and Content Data

Bernard M.E. MoretDepartment of Computer ScienceUniversity of New MexicoAlbuquerque NM [email protected]

Tandy WarnowDepartment of Computer SciencesUniversity of TexasAustin TX [email protected]

Abstract

Genomes can be viewed in terms of their gene content and the order inwhich the genes appear along each chromosome. Evolutionary events thataffect the gene order or content are “rare genomic events” (rarer than eventsthat affect the composition of the nucleotide sequences) and have been advo-cated by systematists for inferring deep evolutionary histories. This chaptersurveys recent developments in the reconstruction of phylogenies from geneorder and content, focusing on their performance under various stochasticmodels of evolution. Because such methods are currently quite restricted inthe type of data they can analyze, we also present current research aimed athandling the full range of whole-genome data.

1 Introduction: Molecular Sequence Phylogenetics

A phylogeny represents the evolutionary history of a collection of organisms, usu-ally in the form of a tree. Sequence data are by far the most common form ofmolecular data used in phylogenetic analyses. We begin by briefly reviewing tech-niques for estimating phylogenies from molecular sequences, with emphasis on thecomputational and statistical issues involved.

1.1 Model trees and stochastic models of evolution

Most algorithms for phylogenetic reconstruction attempt to reverse a model of evo-lution. Such a model embodies certain knowledge and assumptions about the pro-cess of evolution, such as characteristics of speciation and details about evolution-ary changes that affect the content of molecular sequences. Models of evolutionvary in their complexity; in particular, they require different numbers of param-eters. For instance, the Jukes-Cantor model, which assumes that all sites evolveidentically and independently and that all substitutions are equally likely, requiresjust one parameter per edge of the tree, viz., the expected number of changes of a

1

Page 2: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

random site on that edge. Overall, then, a rooted Jukes-Cantor tree with n leavesrequires 2n− 2 parameters. Under more complex models of evolution, the pro-cess operating on a single edge can require up to 12 parameters (for the GeneralMarkov model), although these models still requires Θ(n) parameters overall. Ifedge “lengths” are drawn from a distribution, however, the complexity can be re-duced, since the evolutionary process operating on the model tree can then be de-scribed just by the parameters of the distribution.

These parameters describe how a single site evolves down the tree and so re-quire additional assumptions in order to describe how different sites evolve. Usu-ally the sites are assumed to evolve independently; sometimes they are also as-sumed to evolve identically. Moreover, the different sites are assume either toevolve under the same process or to have rates of evolution that vary dependingupon the site. In the latter case (in which each site has its own rate), an additionalk parameters are needed, where k is the number of sites. However, if the ratesare presumed to be drawn from a distribution (typically, the gamma distribution),then a single additional parameter suffices to describe the evolutionary process op-erating on the tree; furthermore, in this case, the sites still evolve under the i.i.d.assumption.

Tree generation models typically have parameters regulating speciation rates,but also inheritance characteristics, etc. For more on stochastic models of (se-quence) evolution, see [19, 34, 39, 82]; for an interesting discussion of models oftree generation, see [29, 48].

By studying the performance of methods under explicit stochastic models ofevolution, it becomes possible to assess the relative strengths of different methods,as well as to understand how methods can fail. Such studies can be theoretical, forinstance proving statistical consistency: given long enough sequences, the methodwill return the true tree with arbitrarily high probability. Others can use simulationsto study the performance of the methods under conditions closely approximatingpractice. In a simulation, sequences are evolved down different model trees andthen given to different methods for reconstruction; the reconstructions can then becompared against the model trees that generated the data. Such studies provide im-portant quantifications of the relative merits of phylogeny reconstruction methods.

1.2 Phylogeny reconstruction methods from molecular sequences

Three main types of methods are used to reconstruct phylogenies from molecularsequences: distance-based methods, maximum parsimony heuristics, and maxi-mum likelihood heuristics.

2

Page 3: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

1.2.1 Distance-based methods

Of the three types of methods, only distance-based methods include algorithms thatrun in polynomial-time. Distance-based methods operate in two phases:

1. Pairwise distances between every pair of taxa are estimated.

2. An algorithm is applied to the matrix of pairwise distances to compute anedge-weighted tree T .

The statistical consistency (if any) of such two-phase procedures rests on two as-sumptions: first, that a statistically consistent distance estimator is used in the firstphase and, second, that an appropriate distance-based algorithm is used in the sec-ond phase. The requirements that the first phase be statistically consistent meansthat the distance estimator should return a value that approaches the expected num-ber of times a random site changes on the path between the two taxa. Thus, the esti-mation of pairwise distances must be done with respect to some assumed stochasticmodel of evolution. As an example, in the Jukes-Cantor model of evolution, theestimated distance between sequences si and s j is given by the formula

di j = −34

ln(

1−43

Hi j

k

)

,

where k is the sequence length and Hi j denotes the Hamming distance (the numberof positions in which the two sequences differ, which is the edit distance undermutation operations).

Algorithms that attempt to reconstruct trees from distance matrices are guaran-teed to produce accurate reconstructions of the trees only when the distance matrixentries approach very closely the actual number of changes between the pair of se-quences. (In the context of estimating model trees, this requirement means that theestimated distances need to be extremely close to the model distances, defined tobe the expected number of times a random site changes on a leaf-to-leaf path. See[1, 34] for more on this issue.) Naı̈vely defined distances, such as the Hammingdistance, typically underestimate the number of changes that took place in the evo-lutionary history; thus the first step of a distance-based method is to modify thenaı̈vely defined distance into one that accurately accounts for the expected numberof unseen back-and-forth changes in a site. The only way to produce such distancecorrections is to assume some specific model of evolution; hence, each model forwhich distance-based methods have been successfully used is also equipped witha statistically consistent distance estimator.

The most commonly used, and simplest, distance-based method is the neighbor-joining (NJ) algorithm of Saitou and Nei [72]; improved versions of this basic

3

Page 4: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

method include BioNJ [22] and a version known as Weighbor, which requires anestimate of the variance of the distance estimator [9]. NJ is known to be statisticallyconsistent under most models of evolution.

1.2.2 Maximum Parsimony

Parsimony-based methods seek the tree, along with sequences labelling its internalnodes, that together minimize the total number of evolutionary changes (viewedas distances summed along all edges of the tree). Put formally, the problem isas follows: Given a set S of sequences in a multiple alignment, each of length k,find a tree T and a set of additional sequences S0, all also of length k, so that,with the leaves of T are labelled by S and its internal nodes by S0, the value∑e∈E(T ) Hamming(e) is minimized, where Hamming(e) denotes the Hamming dis-tance between the sequences labelling the endpoints of e. (Weighted or distance-corrected versions can also be defined.)

The maximum parsimony problem (MP) is thus an optimization problem—anda hard one: finding the best tree is provably NP-hard [15]. This property effectivelyrules out exact solutions for all but the smallest instances; indeed, in practice, exactsolvers run within reasonable time on at most 30 taxa. Thus heuristics are the nor-mal approach to the problem; most are based on iterative improvement techniquesand appear to return very good solutions for up to a few hundred taxa. Many soft-ware packages implement such heuristics, among them MEGA [35], PAUP* [81],Phylip [20], and TNT [23].

1.2.3 Maximum Likelihood

Like maximum parsimony, maximum likelihood (ML) is an optimization problem.ML seeks the tree and associated model parameter values that maximizes the prob-ability of producing the given set of sequences. ML thus depends explicitly on theassumed model of evolution. For example, the ML problem under the Jukes-Cantormodel needs to estimate one parameter (the substitution probability) for each edgeof the tree, while under the General Markov model 12 parameters must be esti-mated on each edge. ML is much more computationally expensive than MP: eventhe point estimation (how to score a fixed tree) cannot currently be done in poly-nomial time for ML (see [78] for a discussion), whereas it is easily accomplishedin linear time for MP using Fitch’s algorithm [21]. In consequence, exact solutionsfor ML are limited to fewer than 10 taxa and even heuristic solutions are typicallylimited to fewer than 100 taxa. Various software packages provide heuristics forML, including PAUP* [81], Phylip [20], FastDNAml [61], and PhyML [24].

4

Page 5: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

1.3 Performance issues

Methods can be compared in terms of their performance guarantees, in terms oftheir resource requirements, and in terms of the quality of the trees they produce.Very few methods offer any performance guarantees, except in purely theoreti-cal terms. For instance, while ML is known to be statistically consistent undermost models, the same cannot be said of its heuristic implementation; and evenneighbor-joining, which is statistically consistent and is implemented exactly, mayreturn very poor trees—the guarantee of statistical consistency only implies goodperformance in the limit, as sequences lengths become sufficiently large. In termsof computational requirements, the comparison is easy: distance-based methodsare efficient (running in polynomial-time with low coefficients); parsimony is muchharder to solve (systematists are accustomed to running MP for weeks on a datasetof modest size); and maximum likelihood is much harder again than MP. Thesecomparisons, however, all have limited value: as we saw, statistical consistencyis a very weak guarantee, while a guarantee of fast running times is worthless ifthe returned solution is poor. Thus experimental studies are our best tool in thestudy of the relative performance of methods. Simulation studies, in particular, canestablish the absolute accuracy of methods (whereas studies conducted with bio-logical datasets can only assess relative performance in terms of the optimizationcriterion). Such studies have shown that MP methods can produce reasonably goodtrees under conditions where neighbor-joining can have high topological error (sig-nificantly worse than MP); this possibly surprising performance holds under manyrealistic model conditions—in particular, when the model tree has a high evolu-tionary diameter [50, 58, 59, 60, 71].

1.4 Limitations for molecular sequence phylogenetics

Although existing methods often yield good estimates of phylogenies on datasetsof small to medium size, all methods based on molecular sequences suffer fromsimilar limitations. Perhaps most seriously, deep evolutionary histories can be hardto reconstruct from molecular sequence data: the further back one goes in time, theharder the alignment of sequences becomes and the greater the impact of homo-plasy (multiple point mutations at the same position). Under these conditions, wehave established, through extensive simulation studies, that most major methods(heuristics for maximum parsimony as well as neighbor-joining) have poor topo-logical accuracy [50, 58, 60]. The problem accrues from a combination of thesmall state space (only four nucleotides for DNA or RNA sequences and only 20amino acids for protein sequences), the relatively high frequency of point muta-tions, and the limited amount of data. Concatenating—also called combining—

5

Page 6: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

gene sequences to obtain longer sequences may provide more data, but brings itsown problems: different genes may follow different evolutionary paths (each po-tentially different from that of the organism as a whole)—a problem known as thegene tree/species tree problem [42, 44, 65, 67]—, while reticulation events (suchas hybridization, lateral gene transfer, gene conversion. etc., see [40]) may createconvergent paths, with the result that tree-based analyses may run into contradic-tions and yield poor results. Current research on resolving the gene tree/speciestree problem (a process known as reconciliation [62, 63, 64, 65]) and on identify-ing and properly handling reticulation events has not yet produced reliably accu-rate and scalable methods. Thus phylogeny reconstruction based on site-evolutionmodels will continue to suffer problems, for at least the near future, when attempt-ing to infer deep evolutionary histories.

2 Whole-Genome Evolution

Systematists are interested in whole genomes because they have the potential toovercome two of the main problems afflicting sequence data. Since the entiregenome is used, the data reflect organismal evolution, not the evolution of singlegenes, thereby avoiding the gene tree/species tree problem. Moreover, the eventsthat affect the whole genome by altering its gene content or rearranging its genesare so-called “rare genomic events” [70]: they occur rarely and come from a verylarge set of choices (for instance, there can be a quadratic number of distinct in-version events), so that they are unlikely to give rise to homoplasy, even in deepbranches of the tree. Thus genome rearrangements, in particular, have been in-creasingly used in phylogenetic analysis [7, 13, 14, 16, 32, 79].

To date, the main approach to whole-genome phylogenetic analysis has usedthe ordering of the genes along the chromosomes as its primary data. That is,each chromosome is considered as a linear (or circular) ordering of genes, witheach gene represented by an identifier that it shares with its homologs on otherchromosomes (or, for that matter, on the same chromosome, in the case of geneduplications). The genome is thus simplified in the sense that point mutations areignored and evolutionary history is inferred on the basis of the gene content andgene order within each chromosome. A typical single circular chromosome for thechloroplast organelle of a Guillardia species (taken from the NCBI database) isshown in Figure 1. Plant chloroplast genomes such as this have around 120 DNAgenes; in contrast, nuclear chromosomes in eukaryotes and free-living bacteriatypically have several thousand genes.

6

Page 7: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

Figure 1: The chloroplast chromosome of Guillardia (from NCBI)

2.1 Evolution of gene order and content

Events that change gene order (but not content) along a single chromosome includeinversions, which are well documented [32, 66], transpositions (strongly suspectedin mitochondria [6, 7]), and inverted transpositions; these three operations are il-lustrated in Figure 2. In a multichromosomal genome, additional operations that donot affect gene content include translocations, which moves a piece of one chromo-some into another chromosome (in effect, a transposition between chromosomes),and fissions and fusions, which split and merge chromosomes without affectinggenes. Finally, a number of events can affect the gene content of genomes: inser-tions (of genes without existing homologs), duplications (of genes with existinghomologs), and deletions. In multichromosomal organisms, colocation of geneson the same chromosome, or synteny, is an important evolutionary character andhas been used in phylogenetic reconstruction [57, 75, 76].

In this model, one whole chromosome forms a single character, whose state isaffected by all of the operations just described. This one character can assume anyof an enormous number of states—for a chromosome with n distinct, single-copy

7

Page 8: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

12

3 7

4 65

8

7

85

6

1

43

2

7

85

6

1

−4−3

−2

1

7

65

8−4

−3

−2

InversionInverted Transposition

Transposition

Figure 2: The three rearrangements operating on a single chromosome

genes, the number of states is 2n−1(n− 1)!, in sharp contrast to the 4 or 20 statespossible for a sequence character. Even for a simple chloroplast genome, with asingle small circular chromosome of 120 genes, the resulting number of states isvery large—on the order of 10235.

The use of gene-order and gene-content data in phylogenetic reconstruction isrelatively recent and the subject of much current research. As mentioned earlier,such data present many advantages: (i) the identification of homologies can reston a lot of information and thus tends to be quite accurate; (ii) because the datacapture the entire genome, there is no gene tree vs. species tree problem; (iii) thereis no need for multiple sequence alignment; and (iv) gene rearrangements and du-plications are much rarer events than nucleotide mutations and thus enable us totrace evolution farther back in time—by two or more orders of magnitude.

However, there remain significant challenges. First is the lack of data: map-ping a full genome, while easier than sequencing it, remains far more demandingthan sequencing a few genes. Table 1 gives a rough idea of the state of affairs

Table 1: Existing whole-genome data ca. 2003 (approximate values)

Type Attributes NumbersAnimal mitochondria 1 chromosome, 40 genes 200Plant chloroplast 1 chromosome, 140 genes 100Bacteria 1–2 chromosomes, 500–5,000 genes 50Eukaryotes 3–30 chromosomes, 2,000–40,000 genes 10

8

Page 9: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

Table 2: Main attributes of sequence and gene-order data

Sequence Gene-Orderevolution fast slow

errors small negligibledata type a few genes whole genome

data quantity abundant sparse# char. states tiny huge

models good primitivecomputation easy hard

around 2003; for obvious reasons, most of the bacteria sequenced to date are hu-man pathogens, while the few eukaryotes are the model species chosen in genomeprojects: human, mouse, fruit fly, worm, mustard plant, yeast, etc. Although thenumber of sequenced eukaryotic genomes is growing quickly, coverage at this levelof detail will not, for the foreseeable future, exceed a small fraction of the totalnumber of described organisms.

This lack of data has so far prevented us from understanding very much aboutthe relative probabilities of different events that modify gene order and content;consequently, the stochastic models proposed to date (see Section 2.2) remain fairlyprimitive.

Finally, the extreme (at least in comparison with sequence data) mathemat-ical complexity of gene orders means that all reconstruction methods (even thedistance-based ones) face major computational challenges even on small datasets(containing only ten or so genomes).

Table 2 summarizes the salient characteristics of sequence data and gene-orderdata.

2.2 Stochastic models of evolution

Models of genome evolution have been largely limited to simple combinations ofthe main rearrangement operations: inversion, transposition, and inverted trans-position. The original model was proposed by Nadeau and Taylor; it uses onlyinversions and assumes that all inversions are equally likely [57]. We extendedthis model to produce the Generalized Nadeau-Taylor (GNT) model [89], whichincludes transpositions and inverted transpositions. Within each of the three typesof events, any two events are equiprobable, but the relative probabilities of eachtype of event are specified by the two parameters of the model: α, the probabil-

9

Page 10: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

ity that a random event is a transposition and β, the probability that a randomevent is an inverted transposition. (The probability of an inversion is thus givenby 1−α− β.) The GNT model contains the Nadeau-Taylor model as a specialcase: just set α = β = 0. Extensions of this simple model in which the probabilityof an event depends on the segment affected by the event have been proposed [5],but no solid data exist to support one model over another; all that appears certain(see, e.g., [37]) is that short inversions are more likely than long ones. Similarly,multichromosomal rearrangements such as translocations and events that affect thegene content of chromosomes such as duplications, insertions, and deletions, haveall been considered, but once again we have insufficient biological data to defineany model with confidence.

3 Genomic Distances

The distance between two genomes (each represented by the order of its genes) canbe defined in several ways. First, we have the true evolutionary distance, that is, theactual number of evolutionary events (mutations, deletions, etc.) that separate onegenome from the other. This is the distance measure we would really want to have,but of course it cannot be inferred—as our earlier discussion of homoplasy madeclear, we cannot infer such a distance even when we know the correct phylogenyand have correctly inferred ancestral data (at internal nodes of the tree). What wecan define precisely and, in some cases, compute, is the edit distance, the minimumnumber of permitted evolutionary events that can transform one genome into theother. Since the edit distance invariably underestimates the true evolutionary dis-tance, we can attempt to correct the edit distance according to an assumed modelof evolution in order to produce the expected true evolutionary distance. Finally,we can attempt to estimate the true evolutionary distance directly through variousheuristic techniques.

We begin by reviewing distance measures between two chromosomes withequal gene contents and no duplications—the simplest possible case—, then dis-cuss current research on distance measures between multichromosomal genomesand between genomes with unequal gene content.

3.1 Distances between two chromosomes with equal gene content andno duplications

Here we consider chromosomes that have identical gene content and exactly onecopy of each gene, so that each chromosome can be viewed as a (singed) permu-tation of the underlying set of genes. Two distance metrics have been used in this

10

Page 11: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

context, one based on observed differences and one based on allowable evolution-ary operations. Assume our two chromosomes each have seven genes, numbered1 through 7, and are circular. Genome G1 is given by (1,2,−4,−3,5,6,7) andgenome G2 by (1,2,3,4,5,6,7).

• The breakpoint distance simply counts the number of gene adjacencies (readon either strand) present in one chromosome, but not in the other. In ourexample, we have two breakpoints: the adjacencies 2,3 and 4.5 are presentin G2, but not in G1. (Note that the adjacency 3,4 is present on the forwardstrand on G2 and on the reverse complement strand, as −4,−3, on G1. Notealso that the definition is symmetric: the adjacencies 2,−4 and −3,5 arepresent in G1, but not in G2.) The breakpoint distance is thus not based onevolutionary events, only on the end result of such events.

• The inversion distance is the edit distance under the single allowed event ofinversion. We need only one inversion to transform G1 into G2: we invertthe two-gene segment −4,−3.

Every inversion clearly creates (or, equivalently, removes) at most two breakpoints,so that the inversion distance is at least half the breakpoint distance. The number ofbreakpoints is also clearly at most n in a circular chromosome of n genes (and n+1in a linear chromosome); less obviously, the same bound holds for the inversiondistance [47].

Computing the breakpoint distance is trivially achievable in linear time. Com-puting the inversion distance, however, is a very complex problem. Indeed, it iscomputationally intractable (technically, it is NP-hard) for unsigned permutations(when we cannot tell on which strand each gene lies). For signed permutations, weshowed that it can be computed in linear time [3], but this result is the culmina-tion of many years of research and rests on the very elaborate and elegant theoryof Hannenhalli and Pevzner [25, 26]. Obtaining an actual sequence of inversions(as opposed to just the number of required inversions) is computationally more de-manding: the classic algorithm of Kaplan et al. [33] takes O(dn) time, where d isthe distance and n the number of genes; thus, for large distances, this algorithmtakes quadratic time, but was recently improved to O(n

√n log n) time [86].

Transpositions are also of significant interest in biology. However, while someof the same theoretical framework can be used [4], results here are disappointing:no efficient algorithm has yet been developed to compute the transposition dis-tance. The best result to date remains an approximation algorithm that could sufferfrom up to a 50% error [4, 27]; this result was recently extended, with the sameerror bound, to the computation of edit distances under a combination of inversionsand inverted transpositions (with equal weights assigned to each) [28].

11

Page 12: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

3.2 Distance corrections

None of these distances produces the true evolutionary distance; indeed, since allare bounded by n, all can produce arbitrary underestimates of the true evolutionarydistance. In order to estimate the latter, we must use correction methods. Suchmethods are widely used in distance estimation between DNA sequences [31, 82].We designed two techniques for “correcting” gene-order distances have been de-veloped under the GNT model, one (IEBP) based on the breakpoint distance [89]and the other (EDE) [52, 55] based on the inversion distance.

• The IEBP estimator takes as input the breakpoint distance and the values ofthe two GNT parameters, α and β, and returns the expected number of inver-sions, transpositions, and inverted transpositions under the specified relativeprobabilities of the different events. The method is analytical and mathemat-ically exact and can be implemented to run in cubic time.

• The EDE estimator (“EDE” stands for “empirically derived estimator”) takesas input the inversion distance and produces an estimate of the expected num-ber of inversions, under a model in which all inversions are equally likely.The formula used was derived with numerical techniques from a large seriesof simulations and can be computed in quadratic time.

Figure 3 [52] illustrates the EDE correction and compares it with the breakpointand inversion distances, under an inversion-only scenario. The EDE estimator of-fers no theoretical guarantee and only takes inversions into account, but is easier tocompute than the IEBP estimator, which also requires an accurate guess of the val-ues of the two GNT parameters. Both estimators suffer from the simple fact that thevariance of the true distance (as a function of the breakpoint or inversion distance)grows rapidly as the distance grows—as is clearly visible in Figure 3. The most im-portant aspect of EDE’s performance is that trees reconstructed by distance-based

0 50 100 150 2000

50

100

150

200

Breakpoint Distance

Act

ual n

umbe

r of e

vent

s

0 50 100 150 2000

50

100

150

200

Inversion Distance

Act

ual n

umbe

r of e

vent

s

0 50 100 150 2000

50

100

150

200

EDE Distance

Act

ual n

umbe

r of e

vent

s

breakpoint distance inversion distance EDE

Figure 3: Edit distances vs. true evolutionary distances and the EDE correction

12

Page 13: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

methods using EDE are generally more accurate than those reconstructed by thesame methods using any of the other distances (including IEBP), even when theevolutionary model involves many (or even only) transpositions [52, 55].

3.3 Distances between two chromosomes with unequal gene content

Unequal gene content introduces new evolutionary events: insertions (includingduplications) and deletions. If we first assume that each gene exists in at mostone copy (no duplication), then the problem is to handle insertions of new genesand deletions of existing ones. (Note that the same event can be viewed as a dele-tion or as a non-duplicating insertion, depending on the direction of time flow.)In a seminal paper [18], El-Mabrouk showed how to extend the theory of Han-nenhalli and Pevzner for a single chromosome without duplicated genes to handleboth inversions and deletions exactly; we showed that the corresponding distancemeasure can also be computed in linear time [41]. In the same paper, El-Mabroukalso showed how to approximate the distance between two such genomes in thepresence of both deletions and nonduplicating insertions along the same time flow.

Duplications are considerably harder from a computational standpoint, as theyintroduce a matching problem: which homolog in one genome corresponds towhich in the other genome? Sankoff proposed to sidestep the entire issue by re-ducing the problem to one with no duplications using the exemplar approach [73].In that approach a single copy is chosen from each gene family, and all other ho-mologs in each family are discarded, in such a way as to minimize the breakpointor inversion distance between the two genomes. Unfortunately, choosing the exem-plars themselves is an NP-hard problem [10]; moreover, the loss of information innuclear genomes (which can have tens or hundreds of genes in each gene family) isvery large—large enough to give rise to problems in reconstruction (see [83, 85]).

We recently proposed an approximation with provable guarantees (in terms ofthe edit distance) for the distance between two genomes under duplications, inser-tions, and deletions [45]. We later refined the approach to estimate true evolution-ary distances directly [80] and used the new measure in a pilot study of a groupof 13 γ-proteobacteria with widely differing gene contents (varying from 800 toover 5,000 genes in their single nuclear chromosome) [17]. The latter study indi-cates that simple distance-based reconstructions can be very accurate even in thepresence of enormous evolutionary distances: many of the edges in the final treehave several hundred evolutionary events along them. Our simulations show thatthe distance computation tracks the true evolutionary distance remarkably well upuntil saturation, which only occurs at extremely high levels of evolution (over 250events on a genome of 800 genes, for instance). Figure 4 [80] shows typical resultsfrom these simulations: genomes of 800 genes were generated in a simulation on

13

Page 14: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

0

50

100

150

200

250

300

0 50 100 150 200 250 300

Cal

cula

ted

Edi

t Len

gths

Generated Edit Lengths

y=x

-100

-50

0

50

100

0 50 100 150 200 250 300

Err

or V

aria

nce

Generated Edit Lengths

Low Error

High Error

y=0

Figure 4: Experimental results for 800 genes with expected edge length 40. Left:generated edit length vs. reconstructed length; right: the variance of computeddistances per generated distance.

a balanced tree of 16 taxa and all 120 pairwise distances computed and comparedto the true evolutionary distances (the sum of the lengths of the edges in the truetree on the path connecting each pair). The figure shows the calculated edit lengthsas a function of the generated edit lengths (the true evolutionary distances) on theleft and an error plot on the right. The data used for this figure were generatedusing an expected edge length of 40, so that the pairwise distance between leaveswhose common ancestor is the root has an expected value of 320 events. Eventswere a mix of 70% inversions, 16% deletions, 7% insertions, and 7% duplications;the inversions had a mean length of 20 and a standard deviation of 10, while thedeletions, insertions, and duplications all had a mean length of 10 with a standarddeviation of 5. The figure shows excellent tracking of the true evolutionary dis-tance for up to about 250 events, after which the computation consistently returnsvalues less than the true distance; of course, distance corrections could be devisedto remedy the latter situation.

3.4 Distances between multichromosomal genomes

With multiple chromosomes, we can still make the same distinction as for sin-gle chromosomes, by first addressing genomes with equal gene content and noduplicate genes. The second of the two papers by Hannenhalli and Pevzner [26]showed how to handle a combination of inversions and translocations for suchgenomes, a result later improved by Tesler [87] to handle any combination of in-versions, translocations, fissions, and fusions. Using the same approach we pio-neered [3], one can devise linear-time algorithms to compute other distances cov-

14

Page 15: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

ered by the Hannenhalli-Pevzner theory, such as the translocation distance [38].However, multichromosomal genomes typically have large gene families, so thatthe handling of duplications is in fact crucial; moreover, few such genomes willhave identical gene content—even such closely related genomes as two species ofthe roundworm genus Caenorhabditis, C. elegans and C. briggsae, have differentgene content. Thus no distance method currently exists that could be used in thereconstruction of phylogenies for multichromosomal organisms.

4 Phylogenetic Reconstruction From Whole Genomes

The same three general types of approaches to phylogenetic reconstruction that wejust reviewed for sequence data can be used for whole-genome data.

4.1 Distance-based methods

We have run extensive simulations under a variety of GNT model conditions [52,54, 55, 88], using breakpoint distance, inversion distance, and both EDE and IEBPcorrections, all using the standard neighbor-joining [72] and Weighbor [9] distance-based reconstruction algorithms, as well as within our own “DCM-boosted” exten-sions of the same [30]. Figure 5 [52] shows typical results from these simulationstudies: NJ is run with four distances (BP stands for breakpoint and INV for in-version) on datasets of various sizes (here, 10, 20, 40, 80, and 160 taxa), eachtaxon being given by a signed permutation of the same set of 120 genes, then thereconstructed trees are compared with the model trees. The figure shows the false

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

Normalized Maximum Pairwise Inversion Distance

Fals

e N

egat

ive

Rat

e (%

)

NJ(BP) NJ(INV) NJ(IEBP)NJ(EDE)

Figure 5: False negative rates for NJ run with four genomic distance measures. Thedatasets have 10, 20, 40, 80, and 160 taxa; each taxon is a signed permutation of thesame set of 120 genes, generated from the identity permutation at the root throughan equal-weight mix of inversions, transpositions, and inverted transpositions.

15

Page 16: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

negative rate of the trees reconstructed with each distance measure, as a functionof the diameter of the dataset. At low rates of evolution, the results are much thesame for all four distances measures, but the corrected measures perform muchbetter than the uncorrected ones when the diameter is large.

Our simulations have established the following:

• IEBP distances are generally the most accurate, even when given incorrectestimates of the parameters α and β. Although EDE is based on an inversion-only scenario, it produces highly accurate estimates of the actual number ofevents even under model conditions in which inversions play a minor role.Both EDE and IEBP are thus very robust to model violations.

• Phylogenies estimated using EDE are more accurate than phylogenies esti-mated using any other distance estimator (including IEBP), under all condi-tions, but especially when the model tree has a high evolutionary diameter.Phylogenies based upon either EDE or IEBP are better than phylogeniesbased upon inversion distances, which in turn are far better than phylogeniesbased upon breakpoint distances.

4.2 Parsimony-based methods

Given a way of defining a distance between two genomes, we can define the“length” of a tree in which every node is labelled by a genome to be the sumof the lengths of the edges. This measure of length is thus similar to that used insequence-based phylogenetic analysis, enabling us to define parsimony problemsfor gene-order data. Using the breakpoint distance, for instance, we would seek atree and a collection of new genomes (for internal nodes) such that this tree, leaf-labelled by the given set of genomes and with internal nodes labelled by the newgenomes, minimizes the breakpoint length of the tree—this problem is called thebreakpoint phylogeny [74]. Similarly we can define the inversion phylogeny byreplacing breakpoint distances with inversion distances. Both problems are NP-hard—indeed, they are harder than maximum parsimony for DNA sequences, asthey remain NP-hard even for just three genomes [11, 68]. This last version isknown as the median problem: given three genomes, find a fourth genome (to la-bel the internal node connecting the three leaves) that minimizes the sum of itsdistances to the three given genomes. Exact solutions to the problem of findinga median of three genomes can be obtained for both the inversion and breakpointdistances [12, 52, 77] and are implemented in our GRAPPA software suite [2].However, they take time exponential in the overall length and are thus applicableonly to instances with modest pairwise distances. It should also be noted that all of

16

Page 17: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

these approaches are limited to unichromosomal genomes with equal gene contentand no duplication.

Solving the parsimony problem for gene-order data requires the inference of“ancestral” genomes at the internal nodes of the candidate trees. Sankoff proposedto use the median problem in an iterative manner to refine rough initial guessesfor these genomes [74], an approach that we implemented in GRAPPA with goodresults to date on small genomes of a few hundred genes (see [49, 56, 83, 84, 85])and that was also implemented, with less accurate heuristics, to handle a few largergenomes by Pevzner’s group [8]. Identifying good candidate trees, however, is amuch more expensive proposition; in that same paper, Sankoff proposed gener-ating and scoring each possible tree in turn; the resulting BPAnalysis softwareis limited to breakpoint medians and to trees of 8 or fewer leaves. We reimple-mented Sankoff’s algorithm and optimized its components to improve its speed,gradually, by 7 orders of magnitude [49, 52, 56, 77], enabling the analysis of up to16 taxa; more recently, we successfully coupled it with the Disk-Covering Method(the “DCM1” technique [30]) to make it applicable to large datasets of severalthousand taxa [84].

Evidence to date from simulation studies as well as from the analysis of biolog-ical datasets indicates that even when the mechanism of evolution is based entirelyon transpositions, solving the inversion phylogeny yields more accurate reconstruc-tions than solving the breakpoint phylogeny—see, e.g., [51, 85]. Solving the in-version phylogeny was also found to yield better results than using distance-basedmethods. Figure 6 [85] shows one example of such findings, on a small datasetof 7 chloroplast genomes from green plants. Note that the phylogeny producedby NJ (using inversion distances after equalizing the gene contents) has false posi-tives (edges in the inferred tree that are not present in the reference tree), while thebreakpoint phylogeny (computed on the same input) resolves only one single edge.In contrast, the inversion phylogeny matches the reference phylogeny, modulo theplacement of Mesostigma (which remains in doubt).

A recent study [37] indicates that, at least in prokaryotic genomes, short in-versions are much more likely than long ones; work on developing new models ofevolution that take such results into account is in progress. Applying these meth-ods to large eukaryotic genomes may yield some surprises, however, since thereis evidence that certain breakpoints in chromosomes are “hot spots” for rearrange-ments [69]; if an inversion is thus “anchored” at a fixed position in the chromo-some, its net effect over several events is to produce one breakpoint per event,which might be better modelled with breakpoints than with inversions.

To date, our GRAPPA code remains the only parsimony-based reconstructiontool that can handle datasets with unequal gene content, albeit with only modestchanges in gene content. In its current version [85], it has been used to recon-

17

Page 18: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

(a) reference phylogeny (b) inversion phylogeny

(c) neighbor-joining tree (d) breakpoint phylogeny

Figure 6: Phylogenies on a 7-taxa cpDNA dataset

struct chloroplast phylogenies of green plants in which gene content differs by afew genes at most. The algorithm proceeds in two phases: first it computes genecontents for all internal nodes, then it proceeds (much as in the equal gene-contentcase) to establish an ordering for these contents based on median computations.Gene content is determined from the leaves inward, under standard biological as-sumptions such as independence of branches; under such assumptions, the likeli-hood of simultaneous identical changes on two sibling edges is vanishingly smallcompared to the reverse change on the third edge (see, e.g., [43, 46]).

4.3 Likelihood-based methods

Likelihood methods are based on a specific model of evolution. For a given treeT , they estimate the parameter values that maximize the probability that T wouldproduce the observed data; over all trees, they return that tree (and its associatedparameter values) with the largest probability of producing the observed data. Be-cause current statistical models for whole-genome evolution are so primitive (but

18

Page 19: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

also because any such models will involve enormous complications due to theglobal nature of changes caused by a single event), no method yet exists to computemaximum-likelihood trees for gene-order data. The simpler Bayesian approachturns the tables: instead of seeking the distribution for the probability of producingthe data given the tree, it seeks the probability of each tree given the data. Thetwo approaches are similar in that they both rely on a model of evolution, but theBayesian approach need not estimate all parameter values in order to produce high-scoring trees. In its standard implementation using Markov chain Monte-Carlo(MCMC) methods, the Bayesian approach uses a biased random walk through thespace of trees and so depends mostly on a good choice of possible moves withintree space (and on the efficient computation of the effect of these moves on thecurrent state and on the probability estimates). A preliminary implementation ofsuch a method (for equal gene content) has yielded some promising results [36].

5 Open Problems and Future Research

Nearly everything remains to be done! The work to date has convincingly demon-strated that gene-content and gene-order data can form the basis for highly accuratephylogenetic analyses; the demonstration is all the more impressive given the prim-itive state of knowledge in the area. We cannot give a detailed list of interestingopen problems, as it would be far too long, but content ourselves with a short listof what, from today’s perspective, seem to be the most promising or importantavenues of exploration, from both computational and modelling perspectives.

• Solving the transposition distance problem.

• Handling transpositions along with inversions, preferably in a weighted frame-work.

• Adding length and location dependencies to the rearrangement framework.

• Formulating and providing reasonable approaches to solving the medianproblem in the above contexts.

• Developing a formal statistical model of evolution that includes all rear-rangements discussed here, takes into account location within the chromo-somes and length of affected segments, and obeys basic biological con-straints (such as the need for telomeres and the presence of a single cen-tromere).

• Designing a Bayesian approach to reconstruction within the framework justsketched.

19

Page 20: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

• Combining DNA sequence data and rearrangement data—the sequence datamay be used to rule out or favor certain rearrangements.

• Using rearrangement data in the context of network reconstruction (i.e., inthe presence of past hybridizations, gene conversions, or lateral transfers).

The reader interested in more detail in the topics presented here and in some of theresearch problems just suggested should consult the survey articles of Wang andWarnow [90] on distance corrections and distance-based methods and of Moret,Tang, and Warnow [53] on parsimony-based methods.

6 Acknowledgments

Research on this topic at the University of New Mexico is supported by the Na-tional Science Foundation under grants ANI 02-03584, EF 03-31654, IIS 01-13095,IIS 01-21377, and DEB 01-20709 (through a subcontract to the U. of Texas) andby the NIH under grant 2R01GM056120-05A1 (through a subcontract to the U. ofArizona); research on this topic at the University of Texas is supported by the Na-tional Science Foundation under grants EF 03-31453, IIS 01-13654, IIS 01-21680,and DEB 01-20709, by the Program for Evolutionary Dynamics at Harvard, by theInstitute for Cellular and Molecular Biology at the University of Texas at Austin,and by the David and Lucile Packard Foundation.

References[1] K. Atteson. The performance of the neighbor-joining methods of phylogenetic re-

construction. Algorithmica, 25(2/3):251–278, 1999.

[2] D.A. Bader, B.M.E. Moret, T. Warnow, S.K. Wyman, and M. Yan. GRAPPA(Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algo-rithms). www.cs.unm.edu/∼moret/GRAPPA/.

[3] D.A. Bader, B.M.E. Moret, and M. Yan. A linear-time algorithm for computinginversion distance between signed permutations with an experimental study. J. Com-put. Biol., 8(5):483–491, 2001. A preliminary version appeared in WADS’01, pp.365–376.

[4] V. Bafna and P.A. Pevzner. Sorting permutations by transpositions. In Proc. 6th Ann.ACM/SIAM Symp. Discrete Algs. (SODA’95), pages 614–623. SIAM Press, Philadel-phia, 1995.

[5] M.A. Bender, D. Ge, S. He, H. Hu, R.Y. Pinter, S. Skiena, and F. Swidan. Improvedbounds on sorting with length-weighted reversals. In Proc. 15th Ann. ACM/SIAMSymp. Discrete Algs. (SODA’04), pages 912–921. SIAM Press, Philadelphia, 2004.

20

Page 21: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[6] J.L. Boore and W.M. Brown. Big trees from little genomes: Mitochondrial geneorder as a phylogenetic tool. Curr. Opinion Genet. Dev., 8(6):668–674, 1998.

[7] J.L. Boore, T. Collins, D. Stanton, L. Daehler, and W.M. Brown. Deducing the patternof arthropod phylogeny from mitochondrial DNA rearrangements. Nature, 376:163–165, 1995.

[8] G. Bourque and P. Pevzner. Genome-scale evolution: reconstructing gene orders inthe ancestral species. Genome Research, 12:26–36, 2002.

[9] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: Alikelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol.Evol., 17(1):189–197, 2000.

[10] D. Bryant. The complexity of calculating exemplar distances. In D. Sankoff andJ. Nadeau, editors, Comparative Genomics: Empirical and Analytical Approaches toGene Order Dynamics, Map Alignment, and the Evolution of Gene Families, pages207–212. Kluwer Academic Pubs., Dordrecht, Netherlands, 2000.

[11] A. Caprara. Formulations and hardness of multiple sorting by reversals. In Proc. 3rdAnn. Int’l Conf. Comput. Mol. Biol. (RECOMB’99), pages 84–93. ACM Press, NewYork, 1999.

[12] A. Caprara. On the practical solution of the reversal median problem. In Proc. 1stInt’l Workshop Algs. in Bioinformatics (WABI’01), volume 2149 of Lecture Notes inComputer Science, pages 238–251. Springer Verlag, 2001.

[13] M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L. Wang, T. Warnow, andS.K. Wyman. An empirical comparison of phylogenetic methods on chloroplast geneorder data in Campanulaceae. In D. Sankoff and J.H. Nadeau, editors, ComparativeGenomics, pages 99–122. Kluwer Academic Publishers, 2000.

[14] M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L. Wang, T. Warnow,and S.K. Wyman. A new fast heuristic for computing the breakpoint phylogeny andexperimental phylogenetic analyses of real and synthetic data. In Proc. 8th Int’l Conf.on Intelligent Systems for Mol. Biol. (ISMB’00), pages 104–115, 2000.

[15] W.H.E. Day. Computationally difficult parsimony problems in phylogenetic system-atics. J. Theoretical Biology, 103:429–438, 1983.

[16] S.R. Downie and J.D. Palmer. Use of chloroplast DNA rearrangements in reconstruct-ing plant phylogeny. In P. Soltis, D. Soltis, and J.J. Doyle, editors, Plant MolecularSystematics, pages 14–35. Chapman and Hall, 1992.

[17] J. Earnest-DeYoung, E. Lerat, and B.M.E. Moret. Reversing gene erosion: recon-structing ancestral bacterial genomes from gene-content and gene-order data. In Proc.4th Int’l Workshop Algs. in Bioinformatics (WABI’04), Lecture Notes in ComputerScience. Springer Verlag, 2004.

21

Page 22: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[18] N. El-Mabrouk. Genome rearrangement by reversals and insertions/deletions of con-tiguous segments. In Proc. 11th Ann. Symp. Combin. Pattern Matching (CPM’00),volume 1848 of Lecture Notes in Computer Science, pages 222–234. Springer Verlag,2000.

[19] J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihoodapproach. J. Mol. Evol., 17:368–376, 1981.

[20] J. Felsenstein. Phylogenetic Inference Package (PHYLIP), Version 3.5. University ofWashington, Seattle, 1993.

[21] W. M. Fitch. On the problem of discovering the most parsimonious tree. AmericanNaturalist, 111:223–257, 1977.

[22] O. Gascuel. BIONJ: An improved version of the NJ algorithm based on a simplemodel of sequence data. Mol. Biol. Evol., 14(7):685–695, 1997.

[23] P. Goloboff. Analyzing large datasets in reasonable times: solutions for compositeoptima. Cladistics, 15:415–428, 1999.

[24] S. Guindon and O. Gascuel. PHYML—a simple, fast, and accurate algorithm toestimate large phylogenies by maximum likelihood. Syst. Biol., 52(5):696–704, 2003.

[25] S. Hannenhalli and P.A. Pevzner. Transforming cabbage into turnip (polynomialalgorithm for sorting signed permutations by reversals). In Proc. 27th Ann. ACMSymp. Theory of Comput. (STOC’95), pages 178–189. ACM Press, New York, 1995.

[26] S. Hannenhalli and P.A. Pevzner. Transforming mice into men (polynomial algorithmfor genomic distance problems). In Proc. 36th Ann. IEEE Symp. Foundations ofComput. Sci. (FOCS’95), pages 581–592. IEEE Press, Piscataway, NJ, 1995.

[27] T. Hartman. A simpler 1.5-approximation algorithm for sorting by transpositions. InProc. 14th Ann. Symp. Combin. Pattern Matching (CPM’03), volume 2676 of LectureNotes in Computer Science, pages 156–169. Springer Verlag, 2003.

[28] T. Hartman and R. Sharan. A 1.5-approximation algorithm for sorting by trans-positions and transreversals. In Proc. 4th Int’l Workshop Algs. in Bioinformatics(WABI’04), Lecture Notes in Computer Science. Springer Verlag, 2004.

[29] S.B. Heard. Patterns in phylogenetic tree balance with variable and evolving specia-tion rates. Evol., 50:2141–2148, 1996.

[30] D. Huson, S. Nettles, and T. Warnow. Disk-covering, a fast converging method forphylogenetic tree reconstruction. J. Comput. Biol., 6(3):369–386, 1999.

[31] D. Huson, K.A. Smith, and T. Warnow. Correcting large distances for phylogeneticreconstruction. In Proc. 3rd Int’l Workshop Alg. Engineering (WAE’99), volume 1668of Lecture Notes in Computer Science, pages 273–286. Springer Verlag, 1999.

[32] R.K. Jansen and J.D. Palmer. A chloroplast DNA inversion marks an ancient evo-lutionary split in the sunflower family (Asteraceae). Proc. Nat’l Acad. Sci., USA,84:5818–5822, 1987.

22

Page 23: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[33] H. Kaplan, R. Shamir, and R.E. Tarjan. Faster and simpler algorithm for sortingsigned permutations by reversals. SIAM J. Computing, 29(3):880–892, 1999.

[34] J. Kim and T. Warnow. Phylogenetic tree estimation. In Proc. 7th Int’l Conf. onIntelligent Systems for Mol. Biol. (ISMB’99), 1999. Tutorial.

[35] S. Kumar, K. Tamura, I. B. Jakobsen, and M. Nei. MEGA2: Molecular evolutionarygenetics analysis software. Bioinformatics, 17(12):1244–1245, 2001.

[36] B. Larget, D.L. Simon, and J.B. Kadane. Bayesian phylogenetic inference from ani-mal mitochondrial genome arrangements. J. Royal Stat. Soc. B, 64(4):681–694, 2002.

[37] J.-F. Lefebvre, N. El-Mabrouk, E.R.M. Tillier, and D. Sankoff. Detection and val-idation of single gene inversions. In Proc. 11th Int’l Conf. on Intelligent Systemsfor Mol. Biol. (ISMB’03), volume 19 of Bioinformatics, pages i190–i196. Oxford U.Press, 2003.

[38] G. Li, X. Qi, X. Wang, and B. Zhu. A linear-time algorithm for computing translo-cation distance between signed genomes. In Proc. 15th Ann. Symp. Combin. PatternMatching (CPM’04), volume 3109 of Lecture Notes in Computer Science. SpringerVerlag, 2004.

[39] W.-H. Li. Molecular Evolution. Sinauer Assoc., 1997.

[40] C.R. Linder, B.M.E. Moret, L. Nakhleh, and T. Warnow. Network (reticulated) evo-lution: Biology, models, and algorithms. In Proc. 9th Pacific Symp. on Biocomputing(PSB’04), 2004. Tutorial.

[41] T. Liu, B.M.E. Moret, and D.A. Bader. An exact, linear-time algorithm for computinggenomic distances under inversions and deletions. Technical Report TR-CS-2003-31,Univ. of New Mexico, 2003.

[42] B. Ma, M. Li, and L. Zhang. On reconstructing species trees from gene trees interms of duplications and losses. In Proc. 2nd Ann. Int’l Conf. Comput. Mol. Biol.(RECOMB’98), 1998.

[43] W.P. Maddison. A method for testing the correlated evolution of two binary char-acters: are gains or losses concentrated on certain branches of a phylogenetic tree?Evol., 44:539–557, 1990.

[44] W.P. Maddison. Gene trees in species trees. Syst. Biol., 46(3):523–536, 1997.

[45] M. Marron, K.M. Swenson, and B.M.E. Moret. Genomic distances under dele-tions and insertions. In Proc. 9th Int’l Conf. Computing and Combinatorics (CO-COON’03), volume 2697 of Lecture Notes in Computer Science, pages 537–547.Springer Verlag, 2003.

[46] A. McLysaght, P.F. Baldi, and B.S. Gaut. Extensive gene gain associated with adap-tive evolution of poxviruses. Proc. Nat’l Acad. Sci., USA, 100:15655–15660, 2003.

[47] J. Meidanis, M.E.M.T. Walter, and Z. Dias. Reversal distance of signed circularchromosomes. unpublished, 2000.

23

Page 24: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[48] A.O. Mooers and S.B. Heard. Inferring evolutionary process from phylogenetic treeshape. Quarterly Rev. Biol., 72:31–54, 1997.

[49] B.M.E. Moret, D.A. Bader, and T. Warnow. High-performance algorithm engineeringfor computational phylogenetics. J. Supercomputing, 22:99–111, 2002.

[50] B.M.E. Moret, U. Roshan, and T. Warnow. Sequence length requirements for phy-logenetic methods. In R. Guigo and D. Gusfield, editors, Proc. 2nd Int’l WorkshopAlgs. in Bioinformatics (WABI’02), volume 2452 of Lecture Notes in Computer Sci-ence, pages 343–356. Springer Verlag, 2002.

[51] B.M.E. Moret, A.C. Siepel, J. Tang, and T. Liu. Inversion medians outperform break-point medians in phylogeny reconstruction from gene-order data. In R. Guigo andD. Gusfield, editors, Proc. 2nd Int’l Workshop Algs. in Bioinformatics (WABI’02),volume 2452 of Lecture Notes in Computer Science, pages 521–536. Springer Ver-lag, 2002.

[52] B.M.E. Moret, J. Tang, L.-S. Wang, and T. Warnow. Steps toward accurate recon-structions of phylogenies from gene-order data. J. Comput. Syst. Sci., 65(3):508–525,2002.

[53] B.M.E. Moret, J. Tang, and T. Warnow. Reconstructing phylogenies from gene-content and gene-order data. In O. Gascuel, editor, Mathematics of Evolution andPhylogeny. Oxford University Press, 2004.

[54] B.M.E. Moret, L.-S. Wang, and T. Warnow. New software for computational phylo-genetics. IEEE Computer, 35(7):55–64, 2002.

[55] B.M.E. Moret, L.-S. Wang, T. Warnow, and S.K. Wyman. New approaches for re-constructing phylogenies from gene-order data. In Proc. 9th Int’l Conf. on IntelligentSystems for Mol. Biol. (ISMB’01), volume 17 of Bioinformatics, pages S165–S173,2001.

[56] B.M.E. Moret, S.K. Wyman, D.A. Bader, T. Warnow, and M. Yan. A new imple-mentation and detailed study of breakpoint analysis. In Proc. 6th Pacific Symp. onBiocomputing (PSB’01), pages 583–594. World Scientific Pub., 2001.

[57] J.H. Nadeau and B.A. Taylor. Lengths of chromosome segments conserved sincedivergence of man and mouse. Proc. Nat’l Acad. Sci., USA, 81:814–818, 1984.

[58] L. Nakhleh, B.M.E. Moret, U. Roshan, K. St. John, and T. Warnow. The accuracy offast phylogenetic methods for large datasets. In Proc. 7th Pacific Symp. on Biocom-puting (PSB’02), pages 211–222. World Scientific Pub., 2002.

[59] L. Nakhleh, U. Roshan, K. St. John, J. Sun, and T. Warnow. Designing fast converg-ing phylogenetic methods. In Proc. 9th Int’l Conf. on Intelligent Systems for Mol.Biol. (ISMB’01), volume 17 of Bioinformatics, pages S190–S198. Oxford U. Press,2001.

24

Page 25: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[60] L. Nakhleh, U. Roshan, K. St. John, J. Sun, and T. Warnow. The performance ofphylogenetic methods on trees of bounded diameter. In Proc. 1st Int’l Workshop Algs.in Bioinformatics (WABI’01), volume 2149 of Lecture Notes in Computer Science,pages 214–226. Springer Verlag, 2001.

[61] G.J. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek. FastDNAml: A tool forconstruction of phylogenetic trees of DNA sequences using maximum likelihood.Computations in Applied Biosciences, 10(1):41–48, 1994.

[62] R. Page. GeneTree: comparing gene and species phylogenies using reconciled trees.Bioinf., 14(9):819–820, 1998.

[63] R. Page and M. Charleston. From gene to organismal phylogeny: reconciled treesand the gene tree/species tree problem. Mol. Phyl. Evol., 7:231–240, 1997.

[64] R. Page and M. Charleston. Reconciled trees and incongruent gene and species trees.In B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzehtsky, editors, Mathematicalhierarchies in biology, volume 37. American Math. Soc., 1997.

[65] R.D.M. Page and M.A. Charleston. From gene to organismal phylogeny: Reconciledtrees and the gene tree/species tree problem. Mol. Phyl. Evol., 7:231–240, 1997.

[66] J.D. Palmer. Chloroplast and mitochondrial genome evolution in land plants. InR. Herrmann, editor, Cell Organelles, pages 99–133. Springer Verlag, 1992.

[67] P. Pamilo and M. Nei. Relationship between gene trees and species trees. Mol. Biol.Evol., 5:568–583, 1998.

[68] I. Pe’er and R. Shamir. The median problems for breakpoints are NP-complete. Elec.Colloq. on Comput. Complexity, 71, 1998.

[69] P. Pevzner and G. Tesler. Human and mouse genomic sequences reveal exten-sive breakpoint reuse in mammalian evolution. Proc. Nat’l Acad. Sci., USA,100(13):7672–7677, 2003.

[70] A. Rokas and P. W. H. Holland. Rare genomic changes as a tool for phylogenetics.Trends in Ecol. and Evol., 15:454–459, 2000.

[71] U. Roshan, B.M.E. Moret, T.L. Williams, and T. Warnow. Rec-I-DCM3: A fastalgorithmic technique for reconstructing large phylogenetic trees. In Proc. 3rd IEEEComputational Systems Bioinformatics Conf. CSB’04. IEEE Press, Piscataway, NJ,2004. to appear.

[72] N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstruct-ing phylogenetic trees. Mol. Biol. Evol., 4:406–425, 1987.

[73] D. Sankoff. Genome rearrangement with gene families. Bioinformatics, 15(11):990–917, 1999.

[74] D. Sankoff and M. Blanchette. Multiple genome rearrangement and breakpoint phy-logeny. J. Comput. Biol., 5:555–570, 1998.

25

Page 26: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[75] D. Sankoff, V. Ferretti, and J.H. Nadeau. Conserved segment identification. J. Com-put. Biol., 4(4):559–565, 1997.

[76] D. Sankoff and J.H. Nadeau. Conserved synteny as a measure of genomic distance.Disc. Appl. Math., 71(1–3):247–257, 1996.

[77] A.C. Siepel and B.M.E. Moret. Finding an optimal inversion median: experimentalresults. In Proc. 1st Int’l Workshop Algs. in Bioinformatics (WABI’01), volume 2149of Lecture Notes in Computer Science, pages 189–203. Springer Verlag, 2001.

[78] M. A. Steel. The maximum likelihood point for a phylogenetic tree is not unique.Syst. Biol., 43(4):560–564, 1994.

[79] D.B. Stein, D.S. Conant, M.E. Ahearn, E.T. Jordan, S.A. Kirch, M. Hasebe, K. Iwat-suki, M.K. Tan, and J.A. Thomson. Structural rearrangements of the chloroplastgenome provide an important phylogenetic link in ferns. Proc. Nat’l Acad. Sci.,USA, 89:1856–1860, 1992.

[80] K.M. Swenson, M. Marron, J.V. Earnest-DeYoung, and B.M.E. Moret. Approximat-ing the true evolutionary distance between two genomes. Technical Report TR-CS-2004-15, Univ. of New Mexico, 2004.

[81] D.L. Swofford. PAUP*: Phylogenetic analysis using parsimony (*and other meth-ods), version 4.0b8. Sunderland, MA, 2001.

[82] D.L. Swofford, G.J. Olsen, P.J. Waddell, and D.M. Hillis. Phylogenetic inference.In D. M. Hillis, B. K. Mable, and C. Moritz, editors, Molecular Systematics, pages407–514. Sinauer Assoc., 1996.

[83] J. Tang and B.M.E. Moret. Phylogenetic reconstruction from gene rearrangementdata with unequal gene contents. In Proc. 8th Int’l. Workshop on Algs. and DataStructures (WADS’03), volume 2748 of Lecture Notes in Computer Science, pages37–46. Springer Verlag, 2003.

[84] J. Tang and B.M.E. Moret. Scaling up accurate phylogenetic reconstruction fromgene-order data. In Proc. 11th Int’l Conf. on Intelligent Systems for Mol. Biol.(ISMB’03), volume 19 of Bioinformatics, pages i305–i312. Oxford U. Press, 2003.

[85] J. Tang, B.M.E. Moret, L. Cui, and C.W. dePamphilis. Phylogenetic reconstructionfrom arbitrary gene-order data. In Proc. 4th IEEE Symp. on Bioinformatics and Bio-engineering BIBE’04, pages 592–599. IEEE Press, Piscataway, NJ, 2004.

[86] E. Tannier and M. Sagot. Sorting by reversals in subquadratic time. In Proc. 15thAnn. Symp. Combin. Pattern Matching (CPM’04), volume 3109 of Lecture Notes inComputer Science. Springer Verlag, 2004.

[87] G. Tesler. Efficient algorithms for multichromosomal genome rearrangements. J.Comput. Syst. Sci., 65(3):587–609, 2002.

[88] L.-S. Wang, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, and T. Warnow. Fast phylo-genetic methods for genome rearrangement evolution: An empirical study. In Proc.7th Pacific Symp. on Biocomputing (PSB’02), pages 524–535. World Scientific Pub.,2002.

26

Page 27: Advances in Phylogeny Reconstruction from Gene Order and …treport/tr/04-07/advances.pdf · 2005-08-17 · Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard

[89] L.-S. Wang and T. Warnow. Estimating true evolutionary distances between genomes.In Proc. 33rd Ann. ACM Symp. Theory of Comput. (STOC’01), pages 637–646. ACMPress, New York, 2001.

[90] L.-S. Wang and T. Warnow. Distance-based genome rearrangement phylogeny. InO. Gascuel, editor, Mathematics of Evolution and Phylogeny. Oxford UniversityPress, 2004.

27