giga: a simple, efficient algorithm for gene tree inference in the genomic age

19
Thomas BMC Bioinformatics 2010, 11:312 http://www.biomedcentral.com/1471-2105/11/312 Open Access METHODOLOGY ARTICLE © 2010 Thomas; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons At- tribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Methodology article GIGA: a simple, efficient algorithm for gene tree inference in the genomic age Paul D Thomas 1,2 Abstract Background: Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost. Results: We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process. Conclusions: GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events. Background Phylogenetic inference algorithms have a very long his- tory [1]. The earliest algorithms used information about macroscopic phenotypic "characters" to determine the evolutionary relationships between species. So it was nat- ural that as soon as genetic (DNA) or genetically encoded (protein) sequences became available, these were treated as "molecular characters" that could be used, essentially in an identical manner to phenotypic characters, to eluci- date species relationships. Out of this character evolution paradigm were developed techniques such as the maxi- mum parsimony algorithm [2], and various approximate methods that aim toward parsimony such as neighbor- joining (NJ) [3], as well as methods that assume constant "molecular clock"-like behavior such as the unweighted pair group method with arithmetic mean (UPGMA) [4]. More recently, a different paradigm has developed specif- ically for molecular sequences. This sequence evolution paradigm is exemplified by maximum likelihood (ML) [5] and Bayesian [6] methods, which use an explicit model of how molecular sequences change over time. Different possible evolutionary histories ("alternative hypotheses") are distinguished by their relative likelihood under a par- ticular model of sequence evolution. This paradigm has also led to the use of "corrected" distances calculated using a sequence evolution model, as an input into dis- * Correspondence: [email protected] 1 Evolutionary Systems Biology Group, SRI International, Menlo Park, CA USA Contributed equally Full list of author information is available at the end of the article

Upload: paul-d-thomas

Post on 30-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Open AccessM E T H O D O L O G Y A R T I C L E

Methodology articleGIGA: a simple, efficient algorithm for gene tree inference in the genomic agePaul D Thomas1,2

AbstractBackground: Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost.

Results: We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process.

Conclusions: GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events.

BackgroundPhylogenetic inference algorithms have a very long his-tory [1]. The earliest algorithms used information aboutmacroscopic phenotypic "characters" to determine theevolutionary relationships between species. So it was nat-ural that as soon as genetic (DNA) or genetically encoded(protein) sequences became available, these were treatedas "molecular characters" that could be used, essentiallyin an identical manner to phenotypic characters, to eluci-date species relationships. Out of this character evolutionparadigm were developed techniques such as the maxi-

mum parsimony algorithm [2], and various approximatemethods that aim toward parsimony such as neighbor-joining (NJ) [3], as well as methods that assume constant"molecular clock"-like behavior such as the unweightedpair group method with arithmetic mean (UPGMA) [4].More recently, a different paradigm has developed specif-ically for molecular sequences. This sequence evolutionparadigm is exemplified by maximum likelihood (ML) [5]and Bayesian [6] methods, which use an explicit model ofhow molecular sequences change over time. Differentpossible evolutionary histories ("alternative hypotheses")are distinguished by their relative likelihood under a par-ticular model of sequence evolution. This paradigm hasalso led to the use of "corrected" distances calculatedusing a sequence evolution model, as an input into dis-

* Correspondence: [email protected] Evolutionary Systems Biology Group, SRI International, Menlo Park, CA USA† Contributed equallyFull list of author information is available at the end of the article

© 2010 Thomas; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons At-tribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in anymedium, provided the original work is properly cited.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 2 of 19

tance-based methods such as NJ and UPGMA. Many ofthe important recent developments in phylogenetic infer-ence have involved constructing ever more realistic mod-els of sequence evolution. The increased accuracy has aprice, though, both in the computational power requiredand in the complexity of downstream analysis, such asinterpreting the resulting inferences and comparing alter-native hypotheses from different models or parametersets.

In the genomic age, knowledge of a "representativegenome" for many different species provides the opportu-nity to consider yet another paradigm, which we dub thegenome evolution paradigm. Recently, several approacheshave been developed that make use of genomic informa-tion in the construction of gene trees. One commonmethod is species tree reconciliation [7,8], which takes agene tree (typically estimated using NJ, a character evolu-tion paradigm method) and then prunes and rearrangesbranches (typically those with weaker statistical support)to reduce the number of implied gene duplications andlosses given a known species tree. The soft parsimonyalgorithm [9] extends tree reconciliation to minimizeduplications and losses given an uncertain species tree(containing "soft" polytomies or multifurcating nodes).The SPIDIR algorithm [10] extends the sequence evolu-tion paradigm by learning lineage-specific rate parame-ters for phylogenetic reconstruction over a large numberof orthologous gene trees simultaneously. The SYNERGYalgorithm [11] constructs a gene tree by using a knownspecies tree to specify the sequence of iterative steps--bottom-up from leaves to root--of building and rootingNJ trees. In addition to sequence dissimilarity, the dis-tance used in the NJ step includes an empirical term tocapture synteny, or shared genomic context, which pro-vides long-range (extending over multiple genes)genomic sequence evidence of common descent. Syntenyhas been used in a number of gene tree inference algo-rithms [12], and results from the inheritance of contigu-ous sequence regions that include more than oneproduct-encoding gene. Existing algorithms within thegenome evolution paradigm have shown that includingthis additional information generally improves gene treeinference, but they are algorithmically quite complex andcomputationally expensive. We set out to ask the ques-tion: given the constraints that can be derived fromknowledge of whole genomes, how simple can we make agene tree inference algorithm? What is a minimal set ofprinciples underlying the evolution of gene familiesneeded to reliably reconstruct gene histories?

Note that the inference of gene trees in the genomeevolution paradigm builds upon either the character orsequence evolution paradigms--as described above,sequence data retain a primary role in all such algorithmsproposed to date. The difference is that, in the genome

evolution paradigm, we can make use of informationfrom whole genomes in addition to that which can bederived from information inherent in each gene itself. InGIGA, we make use of two additional sources of informa-tion, which are applicable for even very distant relation-ships (unlike synteny, which is not observed, for example,between the most distant animal lineages). First, in thegenomic era, we have more accurate knowledge of the"true" species tree (insofar as the tree model holds, seediscussion below). Whole genome sequences have pro-vided important information for resolving many speciesrelationships that were difficult to determine from physi-cal characters, or from sequences of individual genes.Second, the genome sequence provides nearly completeknowledge of the genes in the genome (for protein-cod-ing genes at least, given the current state of gene predic-tion). This is critically important for distinguishingbetween alternative hypotheses for gene trees, and forlocating gene duplication events relative to speciationevents. However, it is also important to acknowledge thelimitations of the genome evolution paradigm for genetrees. Despite much rhetoric to the contrary, these arestill early days in the genomic age. Gene predictions arenot of uniformly high quality [13,14], and any inferencealgorithm must take steps to minimize errors arisingfrom low-quality predictions.

In summary, our approach is to infer phylogenetic treesof gene families using 1) a "known" species tree, 2) knowl-edge of all recognizable members of a given family ineach genome, and 3) identification of potentially prob-lematic gene predictions, together with 4) some knowl-edge derived from the molecular sequences. Ourhypothesis was that the genomic constraints and detec-tion of potentially problematic sequences might allow theuse of an extremely rudimentary representation of thesequences themselves. If so, we could develop a simplealgorithm that would be straightforward to interpret andto improve in the future. We call our algorithm GIGA(Gene tree Inference in the Genomic Age), and it differsfrom existing genome evolution paradigm methods inthat it does not, at any point, construct a tree using exist-ing character or sequence evolution methods such as NJor ML. Like the very simple, efficient UPGMA method, itbuilds up a gene tree iteratively using a pairwise sequencedistance matrix. However, in stark contrast to UPGMA,the final tree topology from GIGA does not simply reflectthe order in which sequence pairs are joined during thealgorithm. Rather, the algorithm uses simple rules basedon knowledge of evolutionary processes, for inferring thetree topology from the order of pairwise operations. Inthe following section, we describe these rules and the the-oretical and empirical considerations underlying them, inthe context of a novel conception of gene trees as beingcomposed of orthologous subtrees (OS's) joined together

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 3 of 19

by founding copying events (FCEs) such as gene duplica-tion and horizontal transfer. We then describe the GIGAalgorithm in detail. Finally, we assess the algorithm's per-formance by comparing to a comprehensive set of morethan 14,000 phylogenetic trees from the TreeFam data-base.

Results and DiscussionRationale behind the GIGA algorithmIn the spirit of making our initial algorithm as simple aspossible, we designed a "greedy" algorithm that con-structs a tree guided by the sequence distance matrix butadditionally applying rules that aim toward a parsimony-like criterion minimizing the number of gene duplicationand deletion events [15]. The algorithm iteratively joinstogether subtrees of sequences, beginning with the twosequences that are closest according to the distancematrix. The topology of the joined subtree after each iter-ation is not simply an agglomeration of the constituentsubtrees; rather, rules are used to "rearrange" the joinedsubtree at each iteration, in accordance with additional(genomic) knowledge. In essence, at each stage of theagglomeration process, GIGA interprets the tree in termsof the evolutionary events (speciation and gene duplica-tion) that most likely generated that tree.

We found that, somewhat surprisingly, we needed onlya very rudimentary description of sequence distances tobuild accurate tree topologies. Our initial implementa-tion uses simply the relative pairwise sequence difference:(number of different amino acids at homologous sites)/(total number of homologous sites). Furthermore, unlikeother distance-matrix-based methods, our algorithmdoes not update distances after each step, but uses the"raw" sequence distances throughout (in effect, the dis-tance between two groups is the minimum distance overall inter-group sequence pairs). These additional simplifi-cations are possible because the rules described belowstrongly constrain the inferred evolutionary history; thesequence distances are required only to represent veryapproximately any actual sequence divergence from acommon ancestor.

Orthologous subtrees and gene treesWe first describe a novel conception of gene trees, whichwill simplify the explanation of our rules for determiningthe tree topology from the pairwise sequence distance-determined order of operations. In this conception, agene tree is composed of "orthologous subtrees", i.e., con-taining sequences related by speciation events. Each OScontains at most one gene from each organism, and everysequence in the subtree is orthologous to the others. Dis-tinct orthologous subtrees are joined together to producea gene tree, via events that involve the copying (or trans-fer) of genetic material to create a new locus within an

ancestral genome. When the copying is from the samegenome (e.g., tandem gene duplication or whole genomeduplication), the joining event is a gene duplication event;when the copying is from another genome, the joiningevent is a horizontal transfer event. In our representation,when a copying event occurs, one copy of the gene"remains" in the same OS as its ancestors, while the othercopy "founds" a second OS. Each OS, then, has a "found-ing copying event", though at the root of the tree thisevent is unresolved. Thus, each OS is defined by 1) a rela-tionship to the OS that contains the other duplicatedcopy ("sibling"), and 2) a date of the FCE, relative to spe-ciation events in the sibling OS's.

If there has been at least one copying event, there ismore than one way to decompose a phylogenetic tree intoOS's, depending on which copy is chosen to remain in itsancestral subtree, and which is chosen to found a newsubtree. Each copying event can be decomposed in n pos-sible ways, where n is the number of descendants of thecopying event, so n = 2 for a bifurcating event. Thus, for abifurcating tree with N copying events, there are N2 possi-ble decompositions into OS's. Figure 1 gives an exampleof a tree with one duplication event (orange circle) andthe two possible ways in which it can be decomposed intoOS's. Note that this example is designed merely to illus-trate our conception that gene trees can be described interms of OS's and the relationships between sibling OS's.The fact that the decomposition is not unique does notbear on our algorithm, as it constructs gene trees fromcomponent OS's, rather than decomposing a gene treeinto constituent OS's.

Rules for phylogenetic inferenceWith this representation we can describe rules for phylo-genetic inference that can be applied during the distance-based iterative process. The first three rules describe howthe species tree and genome content can be used todetermine the topology of each OS (Rule 1), distinguishlikely speciation from duplication events (Rule 2), anddate FCEs relative to speciation events (Rule 3). Thefourth rule enables initial OS's and FCEs to be revised atlater steps in the process. The fifth rule attempts to mini-mize errors in tree reconstruction due to sequence frag-ments (usually due to partial gene predictions).

These rules treat only speciation and gene duplicationevents, i.e., vertical inheritance of genetic material (fromparent to child). Less common, but still important partic-ularly in prokaryotes, is "horizontal" gene transfer, inwhich DNA from a source other than a parent is inte-grated into the genome. In this case, the DNA being cop-ied originates in another genome. However, it should benoted that vertical inheritance is generally treated as thenull hypothesis even for bacterial genes, and horizontalinheritance is usually established by evidence that rules

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 4 of 19

out vertical inheritance. We therefore focus in this paperon vertical inheritance, noting that there are already anumber of methods for locating horizontal transferevents, such as incongruence with a vertical-only inheri-tance model [16] including comparison with ancestralsequence reconstructions [17], and extension of maxi-mum parsimony to phylogenetic network representations[18].Rule 1: if a subtree contains only speciation events, the topology is determined by the known species treeWhen an ancestral species undergoes a speciation event,it is first separated into two reproductively isolated popu-lations. Within the tree model of gene evolution, thisevent produces two copies of an ancestral gene, one ineach species' genome, and these two copies then proceed

to diverge from each other by well-known processes ofpopulation genetics, including mutation, random drift,and natural selection. If only speciation events haveoccurred, and multiple speciation events do not occurwithin a relatively short period of time, the gene tree isexpected to be congruent with the species tree. Indeed, amajor application of gene tree inference has been to inferspecies relationships. For genes that approximately obey"molecular clock-like" behavior such as ribosomal RNAgenes [19] this remains a powerful tool. However, on agenome-wide scale, the inference of species trees basedon single gene trees is notorious for giving differentanswers for different genes. While there are evolutionaryscenarios, such as incomplete lineage sorting [20-22], bywhich a gene tree will be genuinely incongruent with the

Figure 1 Decomposing a tree with a duplication event into orthologous subtrees (OS's). The example shows part of the methylene tetrahydro-folate reductase (MTHFR in human) gene family. This tree can be decomposed into OS's in two different ways: 1) the fungal MET13/met9 group re-mains in the same OS as its ancestors, while the MET12/met11 group founds a new OS, and 2) the MET12/met11 group remains in the same OS as its ancestors, while the MET13/met9 group founds a new OS. In both cases, the two OS's are sibling groups, because they contain genes descending from the duplication event, and in both cases the FCE of the more recent OS (the one with only genes from fungi) can be dated relative to speciation events, between the opisthokont common ancestor and the fungal common ancestor in this example. Species are abbreviated with the 5-letter Un-iProt code: CAEEL (C. elegans, nematode worm), CHICK (G. gallus, chicken), DANRE (D. rerio, zebrafish), DICDI (D. discoideum, cellular slime mold), HU-MAN (H. sapiens, human), MOUSE (M. musculus, mouse), SCHPO (S. pombe, fission yeast), YEAST (S. cerevisiae, Baker's yeast).

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 5 of 19

known species tree, recent studies have concluded thatobserved incongruence is much more often due to prob-lems with sequence alignment algorithms, tree inferencealgorithms, and paucity of data when considering onlyrelatively short regions of contiguous sequence, ratherthan to actual historical causes [10,23].

Particularly relevant to our approach, Rasmussen andKellis [10] demonstrated that accounting for lineage-spe-cific rate differences in a Bayesian evolutionary modeldramatically increased the number of orthologous genefamilies among Drosophila species that matched the genetree. Two of their important conclusions are that a singlegene does not typically contain enough information toadequately resolve gene family relationships, and that lin-eage-specific differences in evolutionary rate (due to pop-ulation dynamics) are a primary cause of incongruencebetween the species tree and gene tree. This is not to saythat incomplete lineage sorting does not occur, or that atree is always a good model for gene evolution. Incom-plete lineage sorting may lead to cases where a gene treegenuinely disagrees with the known species tree, oragrees over some regions of a gene and not others (e.g.,due to recombination); the reason for this disagreementis a breakdown of the gene tree model itself, which treatsspeciation and duplication as point events occurring toan ancestral genome, rather than as actual population-based events. Rather, these results suggest that for large-scale phylogenetic reconstruction, rate differences andinadequate information within a single gene may poselarger problems than incomplete lineage sorting. There-fore, in GIGA, we use the known species tree during thetree inference process to define a species tree constrainton the tree topology. Within the gene tree model, theproblem of short speciation times can be addressed bysimply allowing multifurcations in the underlying speciestree. Finally, we note that even incorrect trees thatassume the species tree topology is correct are useful as anull hypothesis for establishing that a more complicatedevolutionary history has occurred.Rule 2: a duplication event should be inferred only when there is genomic proof that a duplication occurred, viz. when, during the iterative process, a given subtree contains more than one gene from the same species (within-species paralogs)As discussed under Rule 1, we do not expect phylogeneticinference to be accurate when using information fromtypical gene-length sequences, and we cannot thenexpect the agglomerative process, in general, to constructan orthologous tree in the order of the known speciesrelationships. Therefore, when two OS's are joined at agiven stage of the algorithm, if together they contain onlya single gene from each genome, we merge the OS's into asingle OS (Figure 2). Our simple rule assumes that thegenes are in fact orthologous, but the sequence data was

not adequate for recognizing this relationship. However,if the OS's together contain more than one sequence fromany genome, our algorithm retains the two separate OS's,which are then joined by a gene duplication event (Figure2B).

This procedure is similar to the "witness of non-orthol-ogy" criterion used by Dessimoz et al. [24] to identify par-alogous relationships in the Clusters of OrthologousGroups (COGs) database [25]: within-species paralogs (z,z') can establish paralogy between pairs of genes (x, y) inother species if the distance d(x, y)>d(x, z) and d(x,y)>d(x, z'). This criterion is justified by the improbabilityof overall convergent evolution in molecular sequences.Most molecular sequence evolution is selectively neutral[26], and therefore similarity between molecularsequences is due more to common ancestry than to com-mon selective selective pressures driving sequence con-vergence. If x is more similar to z, and y is more similar toz', than x is to y, this is almost certainly due to the factthat x and z have a more recent common ancestor than xand y; and y and z' have a more recent common ancestorthan x and y. In other words, the two paralogous genes ingenome Z, together with pairwise sequence distances,allow us to recognize that x and y are also paralogous.Thus, this is a genome age criterion, and can be used onlyif we know which genome each of the sequences camefrom, and if we can assume that our list of genes fromeach genome is nonredundant. Of course, relatively rareevents, such as gene conversion or complementary dele-tions of paralogs in different genomes, can invalidate thisassumption, and further rules could be developed toidentify such cases.Rule 3: if two subtrees of orthologous genes are related by a founding copy event, tentatively date the FCE using the gene content of the two groups and the known species treeGiven a known species tree, if there have been gene dupli-cation events, our task is to determine where each duplica-tion event occurred relative to the speciation events, i.e.,which ancestral gene was copied, and when it was copied.In a character or sequence evolution paradigm, we mustinfer the location of duplications from sequence diver-gence. However, if evolutionary rates differ significantlyfor different lineages, parsimony and related approachessuffer from artifacts such as "long branch attraction,"while likelihood methods typically make assumptionsabout the evolutionary model such as a constant relativesubstitution rate at each site. Yet evolutionary rate changeis one of the prominent features of gene families, particu-larly after gene duplication [27]. After a duplicationevent, at least one copy is free to diverge under relaxedselective constraints and/or positive selection for a newor modified function; in a gene tree model this commonlymanifests as branch- or lineage-specific accelerated evo-lutionary rate that differs among sites. We propose below

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 6 of 19

that gene duplication events can be located using, inaddition to sequence data, gene presence or absence overa particular set of genomes (genome content, used toconstruct "phylogenetic profiles" [28]) and knowledge ofthe species tree.

At the point at which OS's are inferred to be related bygene duplication (Rule 2), we have inferred the twodescendant sibling lineages of the duplication. This speci-fies which sequences arose from a duplication. Because oftwo constraints--namely, the species tree, and theimprobability of convergent sequence evolution--we canalso make an initial hypothesis as to when the gene dupli-cation may have occurred. Because, as described above,overall convergent sequence evolution is extremely rare,at the point in the iterative process where two OS's(inferred from shorter sequence distances) are joined by aduplication event, this duplication event very likelyoccurred prior to the most recent MRCA speciationevent in either OS. The most recent common ancestor(MRCA) speciation event can be determined for each OS,from the species tree and the list of species with a gene inthe OS. Thus, the OS must be older than its MRCA spe-

ciation event. We can therefore tentatively assign an FCEto the OS with the more recent MRCA speciation event(Figure 3, top). If both OS's have the same MRCA specia-tion event (as would be expected, assuming approxi-mately molecular clock-like behavior and no gene loss),both OS's can be tentatively assigned an FCE. Note thatthis method of locating the FCE is reliable only if weknow the full complement of genes in that family, for allthe genomes under consideration; otherwise, the inferredfounding ancestor for the subtree could depend on miss-ing data rather than established absence of a gene.

Note also that gene loss may have occurred within anOS, which can cause the MRCA speciation event of allthe extant sequences in the OS to be an underestimate ofthe age of the founding event (Figure 3, middle and bot-tom). However, we note that these alternative evolution-ary histories become increasingly less likely as we gofurther back in time, as they invoke an increasing numberof independent gene loss events. Thus, in the absence ofadditional information, the most parsimonious explana-tion of the data with respect to implied gene deletionevents [15] is to connect each pair of related OS's accord-

Figure 2 GIGA rules for speciation events. Note that GIGA Rules 1 and 2 result in a different tree topology than standard agglomerative methods such as UPGMA. Because GIGA uses knowledge of the species tree, it postulates that the yeast MET13/met9 group is actually orthologous to the MTH-FR genes from other organisms, but was not merged prior to the sequence from D. discoidieum (DICDI) due to accelerated evolutionary rate in the fungal lineage.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 7 of 19

ing to the most recent FCE. Of course, additional infor-mation (such as synteny) could be used to revise thisestimate, but in our simple algorithm we report only themost parsimonious reconstruction. This criterion implic-itly considers the genomic presence/absence of genes tobe a more reliable data source than the molecularsequence evolution rate, as estimated from character orsequence evolution methods. Finally, we note that multi-ple OS's may have the same MRCA speciation event andrelate to the same sibling OS. In this case, multiple dupli-cations have occurred between the same speciationevents, and in this first implementation of GIGA we allowthese to remain unresolved multifurcations.Rule 4: if the founding copy event of an orthologous subtree has already been dated, allow this date to be revised based on additional evidenceGene loss near the FCE is not the only reason that the ini-tial MRCA speciation event may be an underestimate ofthe actual age of the FCE. Accelerated evolutionary rates

near the FCE of an OS will also result in an underesti-mate. However, unlike gene loss (where we would needadditional information such as synteny to recognize theseevents), we should be able to recognize most cases ofaccelerated rates in later iterations of the algorithm, andrevise the FCE accordingly. This revision is necessarybecause we initially date an FCE when paralogoussequences are joined (Rules 2 and 3, Figure 3). If a lineagenear the true FCE is accelerated in evolutionary rate,sequences in this lineage may have diverged more fromtheir orthologs in other species than those orthologs havediverged from genuine paralogs. As a result, the sequencedistances between paralogs will be smaller than thosebetween some ortholog pairs, and the orthologs will bejoined at a later iteration than the paralogs.

We can recognize possible cases of accelerated ratenear the FCE in the following way. Even if there has beenan accelerated evolutionary rate, there is likely to be somesignal of common ancestry that can identify the diverged

Figure 3 GIGA rules for duplication events. GIGA infers that a duplication must have occurred using Rule 2, as the two OS's being joined contain two genes from both Baker's yeast (YEAST) and fission yeast (SCHPO). It then places the duplication just prior to the most recent MRCA speciation event (fungi, in this case), which is the most parsimonious solution with respect to gene deletion events (Rule 3). Note that many other solutions are possible (two examples are shown below the most parsimonious case), but they require an increasing number of independent gene deletion events.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 8 of 19

sequences as members of the correct orthologous sub-tree. We therefore ask whether these diverged sequencesare significantly closer to this potentially orthologoussubtree than to any other subtree. Consider a stage in thealgorithm where the closest remaining pairwise distanceasserts that OS1 (with a previously established FCE)should be merged with OS2 (containing potentiallydiverged orthologs) into a new OS. If OS2 contains genu-ine orthologs, these sequences will most likely retainsequence similarity evidence of this orthology. Recall thatthe FCE of OS1 was established due to a closer distance(earlier iteration) to a sibling OS containing at least oneparalog of a sequence in OS1. Thus, if OS2 is significantlymore similar to OS1 than to the sibling of OS1 (the clos-est paralogous group), this would be good evidence that itis actually orthologous to sequences in OS1. Because wecalculate distance as the number of differences per site,we can simply use the Jukes-Cantor formula to estimatethe standard deviation in this distance [29]. Dependingon the alternative hypothesis to the proposed merge ofOS1 and OS2, we take either one or three standard devia-tions to be significant, and if this criterion is met, themerge proceeds and the FCE is revised accordingly. If thealternative hypothesis is that OS2 is instead orthologousto the sibling of OS1 (which, like a merge with OS1 itself,also implies no gene duplications), we require the dis-tance to be closer by at least three standard deviations. Ifthe alternative hypothesis would require a gene duplica-tion, either of OS1 or its sibling (i.e., OS2 is paralogous tothe sibling of OS1), then we require less stringent evi-dence, namely, that the distance be at least one standarddeviation closer.Rule 5: if a sequence appears to be a fragment, leave it aside until the tree topology of all non-fragments has been determinedObviously, it is of value to determine the evolutionaryhistories of as many genes as possible. However, it is wellknown that a nontrivial fraction of predicted genes incurrent genomes are partial predictions, which can causeproblems for phylogenetic inference. Sequence fragmentscannot be treated the same way as full-length sequences--e.g., for calculating pairwise distances or within asequence evolution model--because different regions of agene may be under dramatically different selective pres-sures, and will therefore evolve at very different rates;consequently distances estimated from part of thesequence may not accurately reflect those of the wholegene. One way to solve this problem is by constructing amultiple sequence alignment, and then restricting analy-sis to only those sites that are common to all sequences.However, this reduces the amount of data available forevolutionary inference, which as discussed above isalready inadequate even if the entire gene sequence canbe used.

It is not trivial, a priori, to distinguish a sequence frag-ment from a genuine evolutionary event in which aregion of sequence has been gained or lost. For an evolu-tionary event, of course, we expect congruence with thephylogenetic tree: once a region of sequence is lost orgained in a particular ancestral gene, this gain or loss willbe inherited by its descendants. Because, at a given stagein our tree reconstruction process, we have a hypothesisfor the evolutionary history, we can make use of thisexpected congruence to identify potential sequence frag-ments on-the-fly. In our simple algorithm, each OS is ahypothesis about a group of sequences that descendsfrom a common ancestor by speciation events, and wecan expect to a good approximation that these sequencesshould have inherited most of the sequence sites presentin the ancestor.

Thus, we implemented the following simple on-the-flytest for potential fragments. At a particular step in theiterative process, we are considering a possible mergebetween two OS's to form a new OS, based on a distancebetween two sequences. We want to avoid a merge if one(or both) of the sequences driving it is a sequence frag-ment, since in that case the merge would be based onunreliable data. We first approximate the sites likely to bepresent in the ancestral sequence as those columns of themultiple sequence alignment for which more than 50% ofthe sequences in the merged OS align an amino acid. Wethen test each of the two sequences driving the merge toidentify each as a potential fragment. First, if thesequence is already part of an OS with at least three othersequences, it is not considered a fragment, since it passedour fragment test during previous steps, demonstratingthat there are at least three other (presumably indepen-dently predicted, to some degree) orthologous sequenceswith similar structure. (The choice of three other inde-pendent observations is somewhat arbitrary, and ofcourse depends on the number of species under consider-ation in a tree; in our tests below we considered as manyas 50 species total, and varying this empirical parametersomewhat did not, in general, have an effect on the result-ing trees.) If there are less than three other sequences inits OS, we gather all sequences from the potentially new,merged OS. If a sequence does not align more than 50%of the expected sites, it is identified as a potential frag-ment; the merge is not made; and the sequence isremoved from the list of sequences to be used during theremainder of the iterative process. This prevents the frag-ment from determining the tree topology at any stage ofthe algorithm. However, we found that we could oftencorrectly place sequence fragments during a second itera-tive process after a tree has been initially reconstructedfor all non-fragment sequences. In this second process,each previously removed fragment is joined into the

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 9 of 19

existing tree according to its shortest distance to a non-fragment sequence.

The GIGA algorithmThe steps of the algorithm are as follows:

1. Preprocessing and setup1.1 Decide on the genomes to be included; construct the "known" species tree for these genomes1.2 For each protein family:

1.2.1 Assemble a "complete" set of genes for the given family.1.2.2 Create a multiple sequence alignment of the genes in the family.1.2.3 Select homologous sites in the alignment for sequence comparisons. We "trim" the alignment by removing a site if more than 15% of the weighted sequences are gapped at that site. Sequences are weighted using the procedure of Karplus et al. [30].1.2.4 Represent sequence divergence at homolo-gous sites. In the spirit of first trying the simplest model, we calculate the distance between each pair of sequences as simply the fraction of sequence differences at selected homologous sites.

2. For each protein family, infer the gene tree topologyby iteratively defining orthologous groups, and how thosegroups are related via gene duplication events. Initializa-tion: each sequence begins in its own OS.

2.1: Consider the closest pair of sequences that has not been treated in previous steps, and do one of the following operations with the two OS's containing them (Figure 4):

2.1.1 Join the two OS's by a duplication event, and locate the event relative to the speciation events in each OS (Rule 3). If the two OS's, taken together, have two genes from a single organism, then they will be joined by a duplication event (Rule 2) if either of the following conditions also holds:

2.1.1.1: The founding duplication event has not been previously located for either OS.2.1.1.2: The founding duplication event has been previously located for only OS1 and not OS2, and joining the two OS's will not conflict with this location. In other words, the phylo-genetic span of OS2 must be less than or equal to that of OS1. This constraint means that joining the two OS's would not require us to revise an earlier hypothesis about when a duplication event occurred.

The founding duplication event is initially esti-mated so as to minimize the number of implied deletions, i.e. the FDE of the OS with the more

recent MRCA is set to be immediately prior to that MRCA (Rule 3, Figure 3).2.1.2: Merge the two separate OS's into a single OS (Rule 1).

2.1.2.1 Allow the merge only if the sequences are not likely fragments (Rule 5). If neither sequence is a fragment then the two OS's are merged if one of the following conditions holds:2.1.2.2 The founding duplication event has not been located for either OS.2.1.2.3 The founding duplication event has been located for only OS1 and not OS2, and merging the two OS's will not conflict with this location. In other words, the MRCA spe-ciation event of the merged OS is the same as for OS1. This constraint means that merging the two OS's would not require us to revise an earlier hypothesis about when a gene duplica-tion event occurred.2.1.2.4 The founding duplication event has been located for only OS1 and not OS2, and merging the two OS's will conflict with this location, but there is adequate sequence evi-dence to support the revised location of the duplication event (Rule 4). We first calculate the standard deviation of the distance between OS2 and OS1 (dist1 and std_dev1) and that of the distance between OS2 and the sibling of OS1 (dist2 and std_dev2). If OS2 and the sib-ling of OS1 have no species overlap and might be orthologous, we require that

dist1-dist2>1.5(std_dev1+std_dev2)Otherwise, we require a less stringent criterion thatdist1-dist2>0.5(std_dev1+std_dev2)

2.2 Attempt to add fragments back into the tree. Allow each fragment one attempted merge or join event, based on the shortest distance between the fragment and any non-fragment.

3. Infer tree branch lengthsWe recommend taking the tree topology generated by

GIGA and estimating branch lengths and ancestralsequences using an ML-based procedure, e.g. PAML [31].However, in the spirit of the simple algorithm, we com-pute by default an approximate reconstruction of eachancestral sequence (a local, parsimony-like algorithm thatreconstructs each node using only its descendants andclosest outgroup), and then compute branch lengths asthe sequence difference between adjacent nodes in thetree, including the Jukes-Cantor correction [29].

3.1 Infer ancestral sequences at each node. We do this in a simple manner, by recursion beginning at the leaf nodes (only extant sequences, the leaves, are known).

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 10 of 19

For each non-leaf node, we consider the descendant nodes and its closest outgroup node. If the sequence of the closest outgroup node has not yet been deter-mined, use its descendants to define the outgroup. If over half of the descendant nodes align the same amino acid at a given site, it is inferred to be the most likely ancestral amino acid. If the descendants dis-agree, and the outgroup agrees with one of them, the outgroup amino acid is inferred to be the most likely ancestral amino acid. Otherwise, the ancestral amino acid is considered to be unknown ('X').

3.2 Calculate branch lengths from node sequences. We use a simple measure, the fraction of sequence differences between a parent node and a child node. The Jukes-Cantor correction is

applied to this value. However, in one respect we want to be very careful, and calculate distances only over a selected subset of sites. Following a duplication event, it is often the case that one of the duplicates continues to conserve the ancestral function more closely, while the other diverges more rapidly. We can identify the "least diverged" ortholog by tracing the shorter branch. Because of rate heterogeneity among sites, the relative branch lengths are reliable only if they are calcu-lated over the same sites. Therefore in our algo-rithm, for branches following a duplication event, lengths are calculated using only those sites that are aligned in all the descendant nodes.

Figure 4 Core of the simple GIGA algorithm. OS = orthologous subtree, a portion of the gene tree containing only speciation events; FCE = found-ing copying event, the event (located relative to speciation events) that founded a given OS; in this first implementation of GIGA all FCEs are duplica-tion events. The algorithm begins with each sequence in its own, separate OS. Each iteration operates on the currently closest pair of OS's. At each iteration, either 1) the two OS's are merged into a single OS (right side), or 2) one (or both) OS's are assigned FCEs. The tree is complete when all OS's have been assigned FCEs.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 11 of 19

ImplementationThe GIGA algorithm has been implemented in the C pro-gramming language, and the code is available at ftp://ftp.pantherdb.org/.

Testing the GIGA algorithmThree properties of a phylogenetic inference algorithmare important to assess: speed, accuracy and robustness.Speed (compute time required to build each tree) shouldbe assessed over a range of conditions to also determinehow well a method scales with increasing number ofsequences or alignment length. Robustness describes thesensitivity of the tree topology to perturbations such asadding/removing sequences, "resampled" character states(for calculating "bootstrap" values), or different parame-ter settings. Accuracy describes how well the inferredtree matches the actual history of evolutionary events. Ofcourse, for actual gene families, we cannot go back intime and follow the "true" sequence of events, to know itfor certain. In practice, accuracy can potentially beassessed in two ways: comparison against sequence datagenerated by "forward" evolutionary simulation for aknown tree topology, or comparison with "gold standard"phylogenetic reconstructions. Simulated data sets arewidely used, but their relevance for assessing gene treeinference algorithms is not established; the ability of analgorithm to correctly infer the underlying tree may bemore dependent on how well it matches the particularsimulation algorithm than how well it will work on actualgene data. On the other hand, there are as yet no "gold-standard" sets of diverse gene phylogeny reconstructionsbased on actual data. Several groups have used congru-ence with the "known" species tree as a gold-standardmeasure [10,32], but because GIGA uses such congru-ence as a constraint, this is not an appropriate test(though it does support the use of species tree congru-ence as a constraint).

We suggest that the TreeFam resource [33] can be usedto provide benchmarks for speed, accuracy and at leastone type of robustness, namely the effect of adding moresequences (taxa) of potentially lower quality. TreeFamcontains a large, diverse set of protein families. Moreover,most families display substantial sequence divergence,indicating that they do not represent trivial cases for evo-lutionary reconstruction. Figure 5 shows the distributionof average and minimum pairwise sequence identityacross TreeFam protein alignments (considering only thepositions that are aligned for most sequences, asdescribed in the GIGA algorithm description 1.2.3above). Average pairwise identity is approximately nor-mal, with a mean and standard deviation of 52%+-15%,while the minimum pairwise identity mode is less than20%, and nearly all families (>90%) have a minimum pair-wise identity less than 50%. Figure 5B shows the distribu-

tions of the number of sequences in the TreeFam families(>4 sequences), and the lengths of the protein alignments.

Accuracy cannot be assessed directly (because the trueevolutionary history is unknown), but consistency withTreeFam trees can be easily assessed. The quality of Tree-Fam trees on average has been established using a num-ber of different metrics [34], so we can reasonably expectthat an accurate method should produce trees similar toTreeFam trees in general. Because there is only one possi-ble topology for a tree of two sequences, and only threepossible topologies for a bifurcating tree of threesequences, we confined our analyses to 14,331 TreeFamfamilies (release 6.1) comprising four or more sequences.Finally, robustness of an algorithm to perturbations suchas adding sequences, and variations in sequence quality,can be assessed by comparing trees constructed from thetwo different TreeFam alignments, the "clean" and "full"alignments, which differ only in that the full alignmentscontain additional sequences from partially sequencedgenomes. A perfectly robust method will infer identicaltrees for the sequences in the "clean" alignment, regard-less of whether or not the additional "full" sequences arealso included during the tree building process.Algorithm speedIn order to assess algorithm speed and scaling, weselected two sets of families: one with families of thesame alignment length (to test the algorithms' depen-dence on the number of sequences), and one with fami-lies of the same number of sequences (to test thealgorithms' dependence on the alignment length). In thecurrent implementation, GIGA scales similarly to other,commonly used methods in terms of dependence on thenumber of sequences and alignment length (Figure 6),but is over 100 times as fast as neighbor joining (as imple-mented in PhyML), and over 1000 times as fast as the MLmethods PhyML and TreeBeST.Accuracy of GIGA trees: consistency with TreeBeSTAs a proxy for accuracy, we compared GIGA treesdirectly to TreeFam "clean" trees (the trees considered tobe of highest quality in TreeFam). Figure 7A shows theRobinson-Foulds (RF) [35] distances (the most com-monly used measurement of tree similarity) betweenTreeFam clean trees, and GIGA trees inferred using thesame alignment. Overall, the trees produced with the twodifferent algorithms are quite similar, with about 13% ofthe trees being identical and 64% very similar (RF dis-tance < 0.2).

To further characterize the magnitude of these RF dis-tances, we constructed both NJ and ML trees from theclean alignments using the PhyML program [36]. Wethen compared the trees from all four methods. Whilethe RF distance distributions for all comparisons areaffected by both the number of sequences and the align-ment length, GIGA and TreeBeST trees are the most sim-

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 12 of 19

ilar to each other in almost all cases, except for a few ofthe smaller trees (Figure 8). On average, GIGA and Tree-BeST produce the most similar trees, along with NJ-ML(average RF = 0.21). The overall NJ-ML similarity isexpected, since in PhyML the ML tree construction pro-cess begins with the NJ tree, so the comparable GIGA-TreeBeST similarity is striking.

For all comparisons, the RF distance correlates with thenumber of sequences when alignment length is constant(Figure 8A, R = 0.25 to 0.41 except GIGA-TreeBeST),though the GIGA-TreeBeST distance depends much lessupon the family size (R = 0.16), presumably due to thespecies tree guidance. Thus, despite a slight absolutedecrease in GIGA-TreeBeST similarity for larger families,this decrease is small relative to all other comparisons.

Also for all comparisons, the RF score is negatively corre-lated with alignment length when the number ofsequences is held fixed (Figure 8B, R = -0.25 to -0.37except for NJ-ML), though this effect is much less pro-nounced for the NJ-ML comparison (R = -0.14) presum-ably because the NJ tree is used in the ML process.Importantly, all of the different methods tend to convergetowards each other as the amount of substitution dataincreases. This is consistent with the conclusion of Ras-mussen and Kellis [10] that shorter genes often lack suffi-cient information for accurate evolutionaryreconstruction. Finally, because GIGA joins the closestsequence pair at each step, it is useful to compare it to theUPGMA method. On average, the UPGMA trees (esti-mated by PHYLIP [37], using corrected distances from

Figure 5 Characteristics of the TreeFam families used in this study (14,331 families with at least 4 sequences). (A) Distribution of average and minimum pairwise identity of families, (B) Distributions of number of sequences and protein alignment length.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 13 of 19

Figure 6 CPU time required for tree reconstruction, note the log scale. GIGA is over 100 times faster than NJ and 1000 times faster than ML meth-ods. (A) Dependence on number of sequences (alignment length is constant at 200-204). (B) Dependence on alignment length (number of sequences is constant at 20). The same alignments are used for each method.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 14 of 19

PROTDIST using the JTT model [38]) are much less sim-ilar to TreeBeST trees (average RF = 0.30), and perhapssurprisingly, are even less similar to GIGA trees (averageRF = 0.41) than to TreeBeST trees. Thus, on real proteinfamily alignments, the tree construction rules in GIGAtend to predominate over the algorithmic ordering ofjoining operations, and these rules dramatically improvethe match with TreeBeST trees.

Although the differences between GIGA and TreeBeSTtrees were not very substantial (no larger than thosebetween PhyML and its NJ starting point, as discussedabove), we explored these differences further. Because

TreeFam trees use the species tree only as a "soft" con-straint, we reasoned that some of the disagreementbetween the two algorithms was simply due to local rear-rangements of speciation events, as opposed to moresubstantive differences in the location of gene duplicationevents. We can quantify this disagreement by comparingthe sets of ortholog pairs that are inferred from the twotrees. Because two genes are inferred to be orthologous iftheir most recent common ancestor in a gene tree is aspeciation event, a local rearrangement involving onlyspeciation events will have no effect on the inferredortholog pairs. Differences involving duplication events,

Figure 7 Accuracy of GIGA trees: comparison with TreeFam clean trees for more than 14,000 TreeFam families. A) normalized RF distance comparing tree topology; B) ortholog pair difference (see text) is substantially smaller than the RF distance, indicating that many of the topological differences between TreeBeST and GIGA trees are due to disagreements in speciation event order.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 15 of 19

on the other hand, will affect the inferred ortholog pairs.We therefore calculated ortholog pairs for all the trees.We defined the ortholog pair difference as

1-(number of pairs inferred in common from both trees)/(total number of pairs inferred from either tree)

Figure 7B shows that for 53% of the TreeFam families,the inferred orthologs are in perfect agreement; in thesecases the GIGA and TreeFam trees are either identical ordisplay only minor rearrangements of speciation nodesrelative to each other. For the vast majority of trees (84%),

Figure 8 Comparison of multiple different tree reconstruction methods. The RF distance of each pair of trees is plotted versus (A) the number of sequences in the family and (B) the length of the alignment to show the dependence on these parameters. GIGA and TreeBeST (blue diamonds) generally yield more similar trees than any other pair of methods, except for NJ-ML, which is of comparable similarity. The RF distance mean and stan-dard deviation for each pair of methods is in the figure legend in parentheses. The same subsets of TreeFam families was used as for Figure 6.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 16 of 19

the ortholog difference is less than 0.2, suggesting gener-ally good agreement on the inference of gene duplicationevents.

Because multiple sequence alignment quality is wellknown to influence phylogenetic reconstruction, we rea-soned that some of the remaining discrepancy betweenTreeFam and GIGA trees might be due to poor alignmentquality. We ran PredictedSP [39] on all TreeFam align-ments to generate an alignment score (1 indicating per-fect quality, with smaller numbers indicating lowerquality). We found that families for which GIGA andTreeFam differed more substantially (ortholog pair differ-ence > 0.2) had a strong tendency to have poor align-ments. Over half (54%) of families with substantiallydifferent trees inferred by the two different algorithms(ortholog pair difference > 0.2) had a PredictedSP score ofless than 0.85, while this was true of only 19% of the fami-lies with similar trees (ortholog pair difference < 0.2).This strongly suggests that poor alignment qualityaccounts for a substantial fraction of the discrepanciesbetween TreeBeST and GIGA trees.

Poor alignment quality does not account for all the dis-crepancies. Despite the overall good agreement betweenpredicted orthologs, there are systematic differencesbetween the two tree inference methods. Over the entireset of more than 14,000 families we compared, 67% of allortholog pairs inferred by either algorithm are in exactagreement. However, the disagreements are not ran-domly distributed between the two algorithms. GIGAinfers a substantially larger number of ortholog pairs. Ofall ortholog pairs inferred by TreeBeST, 96% are alsoinferred by GIGA, but of all ortholog pairs inferred byGIGA, only 69% are also inferred by TreeBeST (Figure 9).This difference can be largely explained by the fact thatthe GIGA algorithm locates duplication events using agenomic parsimony criterion, while TreeBeST also usesan ML sequence evolution model. GIGA will tend tolocate duplication events as far as possible toward theleaves of the species tree, to minimize the number ofimplied deletion events. This, in turn, will enable a larger

number of gene pairs to be traced to a common specia-tion event ancestor, i.e., a larger number of inferredorthologs. An example is shown in Figure 10.Robustness of GIGA treesAs a baseline, we first assessed the robustness of Tree-BeST trees, by comparing the TreeFam "clean" and "full"trees. If the TreeBeST algorithm were perfectly robust tothe addition of sequences, the topology of the TreeFamfull tree would be identical (for the subset of sequencesalso in the clean tree) to the clean tree. To identify devia-tions from perfect robustness, we calculated the RF dis-tance between the TreeBeST clean and full trees. Figure11 (red bars) shows that TreeBeST is reasonably robust tothe additional sequences. In relatively few cases are theTreeBeST trees for the clean and full alignments identical(5.7%) but most are very similar (57% have a distance lessthan 0.2; 88% have a distance less than 0.4). We note thatthe TreeBeST algorithm itself is somewhat different forfull and clean alignments, as full trees are estimated usingonly protein sequences, while clean trees can use nucle-otide as well as protein sequences.

To test the robustness of GIGA, we then constructedtwo separate GIGA trees for each TreeFam family, onefrom the "clean" protein alignment, and one from the"full" alignment. We then calculated the RF distancebetween the two GIGA trees to measure how much theadditional "full" sequences changed the topology inferredfor the "clean" sequences. We found that GIGA is consid-erably more robust than TreeBeST to the perturbation ofadding sequences (Figure 11, blue bars), with over 85% ofthe trees being completely unchanged (RF distance of 0)and 98% changing in RF distance by less than 0.2. Therobustness of GIGA is remarkable, and is due largely tothe strong constraints provided by the rules describedabove.

ConclusionsWe described a simple algorithm, GIGA, for inferring theevolutionary events that have given rise to a particulargene family. We then demonstrated that this simple algo-rithm creates trees that are similar overall to those pro-duced by the much more complex and computationallyintensive TreeBeST algorithm [33]. We consider this tobe evidence of the accuracy of GIGA, based a on pub-lished analysis of TreeBeST trees [34], though of courseevolutionary reconstruction is an ongoing research activ-ity. GIGA is over 1000 times faster than "fast" ML meth-ods such as TreeBeST and PhyML, and over 100 timesfaster than neighbor joining. The GIGA algorithm can besimple precisely because it makes use of constraints onthe evolutionary history that have only recently becomeavailable, with the advent of whole genome sequencing.The overall philosophy of the algorithm is that because ofpotentially dramatic departures from clocklike behavior

Figure 9 Overlap between orthologs computed from GIGA and TreeBeST trees. GIGA infers 96% of orthologs inferred by TreeBeST, but also finds many additional orthologs, due mainly to minimization of implied gene duplication and deletion events.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 17 of 19

in many gene families, even a fairly sophisticated treat-ment of molecular sequences is likely to be less trustwor-thy overall as a guide for constructing gene trees than isgenome-derived information such as a known speciestree, and gene content (presence or absence of particulargenes). The algorithm does rely on sequences to revealcommon ancestry (due to the improbability to convergentsequence evolution), even if the sequences alone may notreliably date that common ancestor. Of course, GIGA isnot the ultimate implementation of this genome evolu-tion paradigm--rather, it is only a simple first step.

The GIGA algorithm makes use of ideas that have beenemployed for some time. GIGA is similar to tree reconcil-iation [7,8] and soft parsimony [9], but rather than firstestimating the entire tree and then reconciling it with thespecies tree, GIGA reconciles the tree at each step in thealgorithm. Unlike soft parsimony, polytomies in the spe-cies tree or from rapidly repeated duplication remainunresolved, "simultaneous" events, in the current imple-mentation of GIGA. Other algorithms have made use of a

species tree to guide gene tree reconstruction, notablySYNERGY [11] and TreeBeST [33]. While SYNERGYuses the species tree to determine the order of iterativeNJ tree building and rooting, GIGA builds up different"orthologous subtrees" (OS's) simultaneously and deter-mines their relationships based on pairwise distances andgenome content. While TreeBeST uses the species tree tocount duplications and deletions, which are then treatedin an ML framework with weighted probabilities relativeto substitution events, GIGA minimizes the number ofduplications and deletions consistent with the OS's,which is tantamount to giving these events a very largeweight compared to substitutions.

One of the main advantages of the GIGA algorithmover other methods is its simplicity. This simplicitymakes it particularly amenable to systematic improve-ment, as it is easy to identify the algorithmic reasons forthe tree topology inferred by GIGA and to propose addi-tional rules if necessary. In the future, one could developrules to handle specific evolutionary events in addition to

Figure 10 Example of a tree with substantial disagreement in inferred duplication events, and corresponding orthologs, between Tree-BeST (A) and GIGA (B), TreeFam family TF105095. The sequence alignment is of high quality according to PredictedSP, so this disagreement is due to algorithm differences rather than a problematic alignment. The main differences are in the inference of gene duplication events (orange nodes) in the CYP17A1 lineage (other than the recent duplications in the bovine lineage). (A) TreeBeST infers two duplication events (dup 1 and dup 2), both prior to the ray-finned fish-tetrapod divergence, followed by at least five separate deletion events: one prior to the frog-amniote divergence (del 1), one prior to the chicken-mammal divergence (del 2), one prior to the fish radiation (del 3), one following the divergence of the frog lineage (del 4), and one following the divergence of the chicken lineage (del 5). Note that according to this tree, there are no orthologs of human CYP17A1 in chicken, frog, or fish. (B) GIGA infers one duplication event, before the fish radiation (dup 1') and no deletion events. Note that according to this tree, there is one ortholog of human CYP17A1 in frog, one in chicken, and two in each fish species. Note also that tree (B) infers two periods of accelerated (poten-tially adaptive) molecular evolutionary rates, which may account for why a molecular evolution model would favor a topology with longer divergence times such as in (A).

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 18 of 19

those addressed in this initial implementation, such aswhole genome duplication, horizontal transfer, polyto-mies and even incomplete lineage sorting. Nevertheless,even with only the few rules presented here, the algo-rithm performs remarkably well for many applications.

As with nearly all computational methods, GIGA wasdesigned to address certain applications of phylogeneticinference and is not appropriate for all applications.GIGA assumes that the "true" species tree is known (inso-far as the tree model holds), and that we have a wholegenome and "complete" knowledge of the genes in thatgenome. It is therefore applicable only to genomes thathave been fully sequenced and annotated with respect tothe genes in the families whose histories we wish to infer.It is obviously not amenable to analysis of gene sequencesobtained by environmental sequencing (where the spe-cies of origin is not known), nor to inference of speciesphylogenies from gene sequences, nor to inference ofincomplete lineage sorting (except as a null hypothesis).Nevertheless, the algorithm has many advantages forproblems involving large-scale phylogenetic reconstruc-tion, inference of orthologs, and inference of gene func-tion by homology. It is very fast, enabling reconstructionof the phylogenies of large gene families. GIGA may evenbe appropriate as a starting point for refinement by "fast"maximum-likelihood methods. Finally, GIGA is remark-ably robust to adding sequences. This property is particu-larly useful for phylogenomic databases, as it enablesancestral sequences to be referred to by a stable identifier

over successive releases as new genomes are sequencedand new genes are annotated in existing genomes. Inturn, the stable reference to ancestral sequences wouldenable the large-scale annotation of gene function byhomology, explicitly tracing the evolution of gene func-tion within a gene family. The Gene Ontology ReferenceGenomes Project is currently undertaking just such aneffort, using the trees produced by the GIGA algorithm[40]. Trees produced by GIGA for 48 completed genomesare now available in the PANTHER version 7 database[41], which complements other existing phylogenomicsresources that employ other tree reconstruction algo-rithms, such as TreeFam [33] (using a combinedsequence/genomic event ML algorithm), PhylomeDB[42] (using NJ, ML and Bayesian algorithms) and Gene-Trees [43] (using a Bayesian algorithm).

List of abbreviationsFCE: founding copy event; GIGA: gene tree inference inthe genomic age; ML: maximum likelihood; MRCA: mostrecent common ancestor; NJ: neighbor joining; OS:orthologous subtree; RF distance: Robinson-Foulds dis-tance; UPGMA: unweighted pair group method witharithmetic mean.

AcknowledgementsThe author is indebted to Stan Dong for implementing the comparisons between the different tree methods. This work was supported in part by NIGMS grant R01GM081084.

Figure 11 Robustness of tree inference algorithms: histograms for GIGA and TreeBeST, for "clean" vs. "full" alignments for more than 14,000 TreeFam families. Full alignments include additional sequences, but the alignment is the same as for the clean set. An RF distance of 0 indi-cates that the tree topology is unchanged by adding more sequences. Overall, GIGA is more robust than TreeBeST to the perturbation of adding se-quences.

Thomas BMC Bioinformatics 2010, 11:312http://www.biomedcentral.com/1471-2105/11/312

Page 19 of 19

Author Details1Evolutionary Systems Biology Group, SRI International, Menlo Park, CA USA and 2Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA USA

References1. Felsenstein J: Inferring Phylogenies. New York: Sinauer, Inc.; 2004. 2. Barnabas J, Goodman M, Moore GW: Descent of mammalian alpha

globin chain sequences investigated by the maximum parsimony method. J Mol Biol 1972, 69(2):249-278.

3. Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406-425.

4. Prager EM, Wilson AC: Construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods. J Mol Evol 1978, 11(2):129-142.

5. Whelan S: Inferring trees. Methods Mol Biol 2008, 452:287-309.6. Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP: Bayesian inference of

phylogeny and its impact on evolutionary biology. Science 2001, 294(5550):2310-2314.

7. Chen K, Durand D, Farach-Colton M: NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 2000, 7(3-4):429-447.

8. Durand D, Halldorsson BV, Vernot B: A hybrid micro-macroevolutionary approach to gene tree reconstruction. J Comput Biol 2006, 13(2):320-335.

9. Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA: Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J Mol Evol 2006, 63(2):240-250.

10. Rasmussen MD, Kellis M: Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res 2007, 17(12):1932-1942.

11. Wapinski I, Pfeffer A, Friedman N, Regev A: Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 2007, 23(13):i549-558.

12. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428(6983):617-624.

13. Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD: nGASP--the nematode genome annotation assessment project. BMC Bioinformatics 2008, 9(549):549.

14. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006, 7(Suppl 1(1)):S21-31.

15. Czelusniak J, Goodman M, Hewett-Emmett D, Weiss ML, Venta PJ, Tashian RE: Phylogenetic origins and adaptive evolution of avian and mammalian haemoglobin genes. Nature 1982, 298(5871):297-300.

16. Beiko RG, Hamilton N: Phylogenetic identification of lateral genetic transfer events. BMC Evol Biol 2006, 6(15):15.

17. Kunin V, Goldovsky L, Darzentas N, Ouzounis CA: The net of life: reconstructing the microbial phylogenetic network. Genome Res 2005, 15(7):954-959.

18. Jin G, Nakhleh L, Snir S, Tuller T: Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Mol Biol Evol 2007, 24(1):324-337.

19. Olsen GJ, Woese CR: Ribosomal RNA: a key to phylogeny. Faseb J 1993, 7(1):113-123.

20. Maddison WP, Knowles LL: Inferring phylogeny despite incomplete lineage sorting. Syst Biol 2006, 55(1):21-30.

21. Pollard DA, Iyer VN, Moses AM, Eisen MB: Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet 2006, 2(10):e173.

22. Rannala B, Yang Z: Phylogenetic inference using whole genomes. Annu Rev Genomics Hum Genet 2008, 9:217-231.

23. Marcet-Houben M, Gabaldon T: The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS One 2009, 4(2):e4357.

24. Dessimoz C, Boeckmann B, Roth AC, Gonnet GH: Detecting non-orthology in the COGs database and other approaches grouping

orthologs using genome-specific best hits. Nucleic Acids Res 2006, 34(11):3309-3316.

25. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al.: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4(41):41.

26. Kimura M: The neutral theory of molecular evolution. Cambridge: Cambridge University Press; 1983.

27. Lynch M, Katju V: The altered evolutionary trajectories of gene duplicates. Trends Genet 2004, 20(11):544-549.

28. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96(8):4285-4288.

29. Jukes TH, Cantor CR: Evolution of Protein Molecules. In Mammalian Protein Metabolism Edited by: Munro HN. New York: Academic Press; 1969.

30. Karplus K, Sjolander K, Barrett C, Cline M, Haussler D, Hughey R, Holm L, Sander C: Predicting protein structure using hidden Markov models. Proteins 1997:134-139.

31. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 2007, 24(8):1586-1591.

32. Altenhoff AM, Dessimoz C: Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol 2009, 5(1):e1000262.

33. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Heriche JK, Hu Y, Kristiansen K, Li R, et al.: TreeFam: 2008 Update. Nucleic Acids Res 2008:D735-740.

34. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 2009, 19(2):327-335.

35. Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci 1981:131-147.

36. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696-704.

37. Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5:164-166.

38. Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 1992, 8(3):275-282.

39. Ahola V, Aittokallio T, Vihinen M, Uusipaikka E: Model-based prediction of sequence alignment quality. Bioinformatics 2008, 24(19):2165-2171.

40. Gaudet P, Chisholm R, Berardini T, Dimmer E, Engel S, Fey P, Hill D, Howe D, Hu J, Huntley R, et al.: The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput Biol 2009, 5(7):e1000431.

41. Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD: PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res 2009:D204-210.

42. Huerta-Cepas J, Bueno A, Dopazo J, Gabaldon T: PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res 2008:D491-496.

43. Tian Y, Dickerman AW: GeneTrees: a phylogenomics resource for prokaryotes. Nucleic Acids Res 2007:D328-331.

doi: 10.1186/1471-2105-11-312Cite this article as: Thomas, GIGA: a simple, efficient algorithm for gene tree inference in the genomic age BMC Bioinformatics 2010, 11:312

Received: 19 August 2009 Accepted: 9 June 2010 Published: 9 June 2010This article is available from: http://www.biomedcentral.com/1471-2105/11/312© 2010 Thomas; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.BMC Bioinformatics 2010, 11:312