effect of incomplete lineage sorting on tree-reconciliation-based inference of gene duplication

9
Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication Yu Zheng and Louxin Zhang Abstract—In the tree reconciliation approach to infer the duplication history of a gene family, the gene (family) tree is compared to the corresponding species tree. Incomplete lineage sorting (ILS) gives rise to stochastic variation in the topology of a gene tree and hence likely introduces false duplication events when a tree reconciliation method is used. We quantify the effect of ILS on gene duplication inference in a species tree in terms of the expected number of false duplication events inferred from reconciling a random gene tree, which occurs with a probability predicted in coalescent theory, and the species tree. We computationally examine the relationship between the effect of ILS on duplication inference in a species tree and its topological parameters. Our findings suggest that ILS may cause non-negligible bias on duplication inference, particularly on an asymmetric species tree. Hence, when gene duplication is inferred via tree reconciliation or any other approach that takes gene tree topology into account, the ILS-induced bias should be examined cautiously. Index Terms—Gene tree and species tree reconciliation, incomplete lineage sorting, gene duplication Ç 1 INTRODUCTION A gene tree is the phylogenetic tree of a family of homol- ogous genes. A species tree is the phylogenetic tree of a collection of species. In population genomics and phyloge- netics, it is important to distinguish gene trees and species trees, as a gene tree reconstructed from the DNA sequences of the given gene family is sometimes discordant with the tree of the species that host the genes [1], [2]. Their discor- dance can be caused by gene duplication and loss, horizon- tal gene transfer, hybridization, or incomplete lineage sorting (ILS) [3], [4], [5]. Accordingly, the relationships of gene trees and species trees have been the focus of many studies over the past two decades [6], [7]. Gene trees have been used to estimate the species trees [8], [9], [10], [11], [12], to estimate species divergence time [13] and ancestral population size [14], [15], [16], and to infer the history of gene duplication [17], [18], [19], [20], [21], [22], [23]. One popular approach for gene duplication inference is gene tree and species tree reconciliation. It is formalized from the following fact: A node in the gene tree corresponds to a gene duplication event if there exist two gene family members from one species such that one is found below a child of the node and the other gene found in the other child of the node [9], [22]. Clearly, this approach takes gene tree topology into account. If incorrect gene trees are used, duplication events are often mis-inferred [24]. In a species tree, each inter-nodal branch represents an ancestral population; each internal node represents a time point at which the ancestral population split into two sub- populations. Here it is assumed that there was no gene flow between the subpopulations after split. When the popula- tion of each species is large, the DNA sequences sampled from two species are more unlikely to have their common sequence ancestor living at the moment that the most recent common ancestor (MRCA) of the two species split; instead, the time back to the common sequence ancestor is uncertain and typically longer than the time back to the MRCA of the species. This evolutionary phenomenon is called ILS or deep coalescence. Clearly, ILS gives rise to considerable sto- chastic variations in gene tree topology [1], [25], implying that different unlinked loci might have different genealogi- cal histories, and different samplings might also lead to dif- ferent gene tree topologies for the same gene. Consider the two different gene tree topologies in Fig. 1. The top gene tree in the right panel is concordant to the species tree; rec- onciling this gene tree and the species tree does not infer any gene duplication event, whereas reconciliation of the bottom gene tree gives one (false) duplication event. Hence, ILS affects gene duplication inference. To the best of our knowledge, the effects of ILS on gene duplication inference has not been examined quantitatively although they have been noticed for long time (see [4] for example). This paper quantitatively studies the effect of ILS on the inference of gene duplication. Here, we assume that no genetic exchange has occurred between unrelated species and there is no sequence error when gene tree is recon- structed to simplify our study. The rest of the paper is divided into three sections. Section 2 introduces basic con- cepts and notions. In Section 3, we first propose to measure the effect of ILS on gene duplication inference on a given spe- cies tree using the expected number of false duplication (or gene loss) events inferred from the reconciliation of a ran- dom gene tree and the species tree. As a warm up, we then The authors are with the Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge, Singapore 119076. Email: {yu_zheng, matzlx}@nus.edu.sg. Manuscript received 16 July 2013; revised 29 Nov. 2013; accepted 27 Dec. 2013. Date of publication 8 Jan. 2014; date of current version 5 June 2014. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCBB.2013.2297913 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014 477 1545-5963 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: vukien

Post on 20-Feb-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

Effect of Incomplete Lineage SortingOn Tree-Reconciliation-Based Inference

of Gene DuplicationYu Zheng and Louxin Zhang

Abstract—In the tree reconciliation approach to infer the duplication history of a gene family, the gene (family) tree is compared to the

corresponding species tree. Incomplete lineage sorting (ILS) gives rise to stochastic variation in the topology of a gene tree and hence

likely introduces false duplication events when a tree reconciliation method is used. We quantify the effect of ILS on gene duplication

inference in a species tree in terms of the expected number of false duplication events inferred from reconciling a random gene tree,

which occurs with a probability predicted in coalescent theory, and the species tree. We computationally examine the relationship

between the effect of ILS on duplication inference in a species tree and its topological parameters. Our findings suggest that ILS may

cause non-negligible bias on duplication inference, particularly on an asymmetric species tree. Hence, when gene duplication is

inferred via tree reconciliation or any other approach that takes gene tree topology into account, the ILS-induced bias should be

examined cautiously.

Index Terms—Gene tree and species tree reconciliation, incomplete lineage sorting, gene duplication

Ç

1 INTRODUCTION

Agene tree is the phylogenetic tree of a family of homol-ogous genes. A species tree is the phylogenetic tree of

a collection of species. In population genomics and phyloge-netics, it is important to distinguish gene trees and speciestrees, as a gene tree reconstructed from the DNA sequencesof the given gene family is sometimes discordant with thetree of the species that host the genes [1], [2]. Their discor-dance can be caused by gene duplication and loss, horizon-tal gene transfer, hybridization, or incomplete lineagesorting (ILS) [3], [4], [5]. Accordingly, the relationships ofgene trees and species trees have been the focus of manystudies over the past two decades [6], [7]. Gene trees havebeen used to estimate the species trees [8], [9], [10], [11],[12], to estimate species divergence time [13] and ancestralpopulation size [14], [15], [16], and to infer the history ofgene duplication [17], [18], [19], [20], [21], [22], [23].

One popular approach for gene duplication inference isgene tree and species tree reconciliation. It is formalizedfrom the following fact: A node in the gene tree correspondsto a gene duplication event if there exist two gene familymembers from one species such that one is found below achild of the node and the other gene found in the other childof the node [9], [22]. Clearly, this approach takes gene treetopology into account. If incorrect gene trees are used,duplication events are often mis-inferred [24].

In a species tree, each inter-nodal branch represents anancestral population; each internal node represents a time

point at which the ancestral population split into two sub-populations. Here it is assumed that there was no gene flowbetween the subpopulations after split. When the popula-tion of each species is large, the DNA sequences sampledfrom two species are more unlikely to have their commonsequence ancestor living at the moment that the most recentcommon ancestor (MRCA) of the two species split; instead,the time back to the common sequence ancestor is uncertainand typically longer than the time back to the MRCA of thespecies. This evolutionary phenomenon is called ILS ordeep coalescence. Clearly, ILS gives rise to considerable sto-chastic variations in gene tree topology [1], [25], implyingthat different unlinked loci might have different genealogi-cal histories, and different samplings might also lead to dif-ferent gene tree topologies for the same gene. Consider thetwo different gene tree topologies in Fig. 1. The top genetree in the right panel is concordant to the species tree; rec-onciling this gene tree and the species tree does not inferany gene duplication event, whereas reconciliation of thebottom gene tree gives one (false) duplication event. Hence,ILS affects gene duplication inference. To the best of ourknowledge, the effects of ILS on gene duplication inferencehas not been examined quantitatively although they havebeen noticed for long time (see [4] for example).

This paper quantitatively studies the effect of ILS on theinference of gene duplication. Here, we assume that nogenetic exchange has occurred between unrelated speciesand there is no sequence error when gene tree is recon-structed to simplify our study. The rest of the paper isdivided into three sections. Section 2 introduces basic con-cepts and notions. In Section 3, we first propose to measurethe effect of ILS on gene duplication inference on a given spe-cies tree using the expected number of false duplication (orgene loss) events inferred from the reconciliation of a ran-dom gene tree and the species tree. As a warm up, we then

� The authors are with the Department of Mathematics, National Universityof Singapore, 10 Lower Kent Ridge, Singapore 119076.Email: {yu_zheng, matzlx}@nus.edu.sg.

Manuscript received 16 July 2013; revised 29 Nov. 2013; accepted 27 Dec.2013. Date of publication 8 Jan. 2014; date of current version 5 June 2014.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TCBB.2013.2297913

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014 477

1545-5963� 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

conduct analytic analyses of the effect for the tree over fourspecies.We further examine the effect on aDrosophila speciestree. We also computationally investigate relationshipsbetween the effect and the topological parameters of speciestrees, and investigate the degree of the ILS influence on geneduplication inference. Section 4 concludes with discussion.

Finally, we remark that the interplays of horizontal genetransfer and hybridization events on gene duplication infer-ence have been studied by proposing general evolutionarymodels to coordinate these events or by computational sim-ulation [26], [27], [28].

2 METHODS

2.1 Gene Trees and Species Trees

Consider a collection X of extant species. The species treeover X is a rooted tree in which each leaf represents aunique extant species and hence is labeled by it. Here, wefurther assume species trees are fully binary and branch-weighted. Therefore, in a species tree, an internal (i.e., non-leaf) node has exactly two children; each inter-nodal branchrepresents an ancestral species and has the evolutionarytime of the ancestral species as its length.

For a gene family sampled from X, its gene tree is arooted, binary tree that represents the evolutionary relation-ship of the gene family members, in which each leaf repre-sents a gene. In the present paper, to simplify our study, weassume that only one family member is sampled from a spe-cies in X. In the tree reconciliation approach to gene dupli-cation inference, a gene tree for a gene family needs to bemapped onto the corresponding species tree over X. Hence,a gene tree leaf is labeled by the species that hosts the corre-sponding gene instead of the gene itself. Thus, each genetree leaf has a unique species label in the present paper.

Let S be the species tree of X and let G be a gene tree fora gene family F over X. Let T be S or G. We use V ðT Þ todenote the set of all nodes of T . For any pair of nodesu; v 2 V ðT Þ, we use lcaT ðu; vÞ to denote their MRCA node inT . For any leaf v in T , we use lT ðvÞ to denote its species label.For u 2 V ðT Þ, CT ðvÞ ¼ flT ðuÞ j u is a leaf below vg is called acluster in T . CðT Þ denotes the set of all clusters in T . In addi-tion, ½CT ðv0Þ; CT ðv00Þ� is called a partial partition of X in T ,where v0 and v00 are the two children of v. PðT Þ denotes theset of all partial paritions ofX in T .

For any species x in X, we use l�1S ðxÞ to denote the

unique leaf that has label x in S. The lca reconciliation R

between S and G is a node-to-node mapping from V ðGÞ toV ðSÞ defined as follows: for any gene tree node g, if g is aleaf, thenRðgÞ is the unique leaf in S that has the same labelas g; otherwise, RðgÞ ¼ lcaSðRðg1Þ;Rðg2ÞÞ, where g1 and g2are the children of g. That is

RðgÞ ¼ l�1S ðlGðgÞÞ; if g is a leaf,lcaS Rðg1Þ;Rðg2Þð Þ; otherwise:

�(1)

2.2 Gene Duplication Inference

The duplication history of the gene family F can be inferredusing the lca reconciliation R between G and S [9], [22]. Foran internal node g of G, if RðcðgÞÞ ¼ RðgÞ for some childcðgÞ of g, then a duplication event is inferred in the branchentering RðgÞ in S. Since we may infer a duplication eventthat occurred in the common ancestor of the species in S,we draw an “open” branch entering the root of S and indi-cate such a duplication event in this open branch as shownin panel C of Fig. 2.

The total number of the inferred gene duplication eventsis denoted by cdupðG;SÞ. All the inferred gene duplicationevents form a putative duplication history of F that has theminimum number of gene duplication events, in whichgene loss events might occur. The number of gene lossevents assumed in the obtained gene duplication history forthe gene family is computed as follows.

Let s0 and s00 be two nodes in S such that s0 is below s00, i.e.,the unique path from the root to s0 passes through s00. For anode h, we write s0 � h � s00 to denote that h is a node in thepath from s00 to s0 such that s0 6¼ h 6¼ s00. We further define

ns0;s00 ¼ jfh 2 S j s0 � h � s00gj:

Fig. 1. A schematic view of two different coalescent histories in a speciestree. The species tree (in light gray) of three species is given in the leftpanel. DNA sequences sampled from different individuals within a spe-cies may give two different collapsed gene trees (right panel) for a genefamily. If the bottom gene tree is reconciled with the species tree, a geneduplication event is inferred in the branch entering the species tree rootand a gene loss event is inferred in the lineage leading to C.

Fig. 2. An illustration of lca reconciliation and gene duplication inference.(A) A species tree S over species h; c;m; f and a gene tree G for a genefamily sampled from the species. (B) The reconciliationR betweenG andS. (C) The gene duplication history inferred fromR is drawn inside S.

478 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014

Page 3: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

Note that ns0;s00 is equal to the number of lineages diverg-ing off the evolutionary path from s00 to s0. For an internalnode gwith children g1 and g2 of G, define

lossðgÞ ¼0; if Rðg1Þ ¼ RðgÞ ¼ Rðg2Þ;nRðg1Þ;RðgÞ þ 1; if Rðg1Þ � RðgÞ ¼ Rðg2Þ;P

i¼1;2 nRðgiÞ;RðgÞ; if Rðg1Þ � RðgÞ � Rðg2Þ:

8<:

The number of gene losses that have to be assumed inthe inferred duplication history is equal to

Pg2G lossðgÞ,

denoted by clossðG; SÞ and called the gene loss cost of thelca reconciliation between G and S. An example is givenin Fig. 2 to illustrate how to compute both the gene lossand duplication costs.

In this work, we used our computer program to computethe gene duplication and loss costs for a gene tree and a spe-cies tree [29].

2.3 Computing the Gene Tree Distribution in aSpecies Tree

In the present paper, we assume a simple coalescentmodel in which each species has a constant diploid effec-tive population size N during its entire existence andevolutionary time of t generations equals T ¼ t=ð2NÞ coa-lescent time units [1]. In such a model, the probabilitythat m lineages coalesce into n (� m) lineages for t gener-ation time is given by

pm!nðtÞ ¼Xmk¼n

e�kðk�1Þt=N ð2k� 1Þð�1Þk�n

n!ðk� nÞ!ðnþ k� 1Þ

�Yk�1

j¼0

ðnþ jÞðm� jÞmþ j

;

(see for example [25]). Given a coalescent history H of agene family in a species tree S (see Fig. 1 for an illustrationof a coalescent history and its collapsed gene tree), we cancalculate its probability Pr½HjS� by multiplying the pm!nðtÞover all species tree branches [30], [31].

For a gene tree G, we setHðGÞ to be the set of all possiblecoalescent histories that collapse into G. By definition,

Pr½G j S� ¼X

H2HðGÞPr½H j S�: (2)

For our analysis, we need to compute the gene tree distri-bution in a species tree. In the case of 12 species, we shallconsider over 13.7 billion gene trees. Since the space ofthe gene trees is huge, we employed a home-made com-puter program implemented in C. It speeds up the tree-probability computation via the dynamic programmingtechnique, which was also used byWu in STELLS [31]. Pres-ently, it allows us to examine the effect of ILS on gene dupli-cation inference for species trees of up to 12 species.

3 RESULTS AND DISCUSSION

In our study, we assume only one gene is sampled fromeach species. Such a simple assumption simplifies greatly

our study because any inferred gene duplication event is afalse one and, most importantly, the gene tree distributionis easy to computing using coalescent theory [25], [30].

3.1 Measuring the Effect of ILS on Gene DuplicationInference

Consider a single-gene family F sampled from a set X ofspecies. Let S be the species tree of X. If no gene duplica-tion has occurred to the gene family during the evolutionof the species, the tree of the gene family has the sametopology as S. If ILS events have occurred, however, thegene tree reconstructed from the gene sequences might bedifferent from S. To quantify the effect of ILS on the infer-ence of gene duplication on S, we use the expected num-ber DðSÞ (or LðSÞ) of false gene duplication (or loss)events output from the lca reconciliation of a random genetree G and S. Recall that cdupðG; SÞ (or c lossðG; SÞ) denotethe gene duplication (or loss) cost of the lca reconciliationbetween G and S. Since cdupðG; SÞ ¼ c lossðG; SÞ ¼ 0 ifG ¼ S. DðSÞ and LðSÞ are simply:

DðSÞ ¼XG2G

cdupðG; SÞPr½G j S�; (3)

LðSÞ ¼XG2G

c lossðG; SÞPr½G j S�; (4)

where G is the set of all possible gene tree topologies, andPr½G j S� the probability that G is the collapsed gene tree ofa coalescent history of the sampled genes in S.

3.2 The Case of Four Species

By definition, DðSÞ and LðSÞ depend only on the unlabeledspecies tree topology and are independent of leaf labeling.Therefore, we only need to consider unlabeled tree topolo-gies of species tree. In the case of four species, there are onlytwo different unlabeled topologies

S1 ¼ððA;BÞ : t1; ðC;DÞ : t2Þ;S2 ¼ðððA;BÞ : t1; CÞ : t2; DÞ;

in Newick phylogeny format (Fig. 3). For the sake ofbrevity, we use pðx; yÞ to denote the parental node of twosiblings x and y in each of these two species trees. In S1,the evolutionary time of pðA;BÞ is t1 generations, whereasthat of pðC;DÞ is t2 generations. Let G be a random genetree. Consider the lca reconciliation between G and S1.Any gene tree node is mapped to pðA;BÞ if and only if its

Fig. 3. Two topologies S1 (left) and S2 (right) of the species trees of fourspecies.

ZHENG AND ZHANG: EFFECT OF INCOMPLETE LINEAGE SORTING ON TREE-RECONCILIATION-BASED INFERENCE OF GENE DUPLICATION 479

Page 4: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

two children are mapped to A and B, respectively. Thisfact also holds for pðC;DÞ. Therefore, false duplicationevents can only be inferred on the branch entering theroot in S1.

There are 15 possible gene trees in the case of four spe-cies, which are listed together with their reconciliation costswhen reconciled with S1 and S2 (Fig. 3) below.

Given a species tree, we can compute the probabilitythat a gene tree occurs in the species tree using a closedformula (Eq. (12), [30]). Set Ti ¼ ti=ð2NÞ for i ¼ 1; 2. Theprobabilities that all 15 gene trees occur in the speciestree S1 are listed on the top in the right column.

By summing the products of the probabilities that thegene trees occur in S1 and the corresponding gene duplica-tion (or gene loss) costs, we obtain

DðS1Þ ¼ 2

3

�e�T1 þ e�T2

�; (5)

LðS1Þ ¼ 2�e�T1 þ e�T2

�þ 2

9e�ðT1þT2Þ; (6)

from Eq. (3) and (4).Now, we switch to S2. Setting T i ¼ ti=ð2NÞ for i ¼ 1; 2.

We obtain

DðS2Þ ¼ 2

3

�e�T1 þ e�T2

� 1

3e�ðT1þT2Þ þ 5

18e�ðT1þ3T2Þ; (7)

and

LðS2Þ ¼ 2�e�T1 þ e�T2

�� 1

3e�ðT1þT2Þ þ 5

6e�ðT1þ3T2Þ; (8)

from Eq. (3) and (4) using the probabilities that all the 15gene trees occur in the species tree S2 are listed below.

Since e�x < 1 for any x> 0, we have

DðS1Þ < 11

3and LðS1Þ < 4

2

9;

DðS2Þ < 15

18and LðS2Þ < 4

1

2:

480 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014

Page 5: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

If all branches have equal length (i.e., T1 ¼ T2 ¼ T 1 ¼T 2 ¼ c), DðS1Þ DðS2Þ for any c. However, LðS1Þ LðS2Þonly if c 1

2 lnð3=2Þ.Assume S1 and S2 are ultrametric trees with the same

height, i.e., every path from the root to a leaf is equally long.We further assume that T1 ¼ T2 ¼ 2d and T 1 ¼ T 2 ¼ d forsome d, implying that t1 þ t2 ¼ t1 ¼ t2 and that A and Bdiverged at the same time in both trees. Then,

DðS1Þ ¼ 4

3e�2d;

LðS1Þ ¼ 4e�2d þ 2

9e�4d;

DðS2Þ ¼ 4

3e�d � 1

3e�2d þ 5

18e�4d;

LðS2Þ ¼ 4e�d � 1

3e�2d þ 5

6e�4d:

Using numerical computation, we obtained DðS1Þ <DðS2Þ only if d > 0:0649, but LðS1Þ < LðS2Þ for any d. Thisanalysis suggests that the effect of ILS is closely related tospecies tree structure.

3.3 Effect Analysis on a Drosophila Species Tree

Genome-wide analysis provides strong evidence for theprevalence of ILS events in Drosophila evolution [33]. Here,we examined the expected number of false duplicationevents caused by ILS in the phylogeny of 12 Drosophila spe-cies [32], in which evolutionary time is dated for allbranches. Since the effective population size N for the Dro-sophila species is unknown, we considered four differenteffective population sizes (2� 106; 6� 106; 10� 106, and14� 106) and set the generation time to be 1/10 years [33].The expected numbers of false duplication events caused byILS for different effective population sizes are plotted(Fig. 4). Here, we would like to point out that our conclusiondoes not depend on the specific effective population sizesthat were used.

Since only one gene is sampled from the population ofeach species, no gene duplication is inferred on branches

connecting to the leaves. In other words, false gene duplica-tion events can only be inferred on the seven branches thatare denoted by their end nodes Si (0 � i � 6) (Fig. 4A). Theexpected total number of false gene duplication events inthe tree ranges from 0.0534 to 1.7663 for each of the choseneffective population sizes. Let Di be the expected numberof false gene duplication events on Si for each i. Althoughthe exact values of these Di are different for the differenteffective population sizes, their relative ranks remainalmost the same, correlating well with the branches’ evolu-tionary time. For instance, D0 has the largest value for eacheffective population size, as D0 is assumed to be longenough that all the lineages coexisting at the moment thatthe MRCA of all the extant species split will coalesce on it.For the longest branch entering S4, D4 is the second largestfor effective population sizes of 10 and 14 million, and thethird largest for other sizes.

Another finding is that on the branches close to the root,the expected number of false gene duplication events is rela-tively large. For example, for the shortest branch S1, D1 isnot the smallest; instead, it is larger than D3. Similarly,branch S6 is longer than S2, but D6 is smaller than D2 foreach effective population size.

We now switch to 6,698 gene trees in the Drosophila spe-cies tree [24]. We inferred gene duplication events for thecorresponding gene families by reconciling the gene treesand the species tree (Fig. 4A). In total, we inferred 10,264gene duplication events that are distributed on the sevenbranches as:

1:8% ðS3Þ; 6:5% ðS0Þ; 7:4% ðS2Þ;8:0% ðS5Þ; 15:1% ðS1Þ; 20:5% ðS6Þ; 40:6% ðS4Þ:

Such a distribution is not quite consistent with thecomputational analysis presented above. The proportion ofinferred duplication events on the branches entering S0and S2 is significantly lower than what the analysis sug-gests, whereas those on branches entering S1 and S6 aremuch higher. Possible reasons for this are either becausesequence sampling and alignment errors influenced gene

Fig. 4. Effect analysis for a Drosophila species tree. (A). A tree of 12 Drosophila species given in [32]. All the branches are drawn in proportion to evo-

lutionary time. False duplication events caused by ILS can only be inferred on seven branches that enter S0-S6 respectively for single gene families.

(B). The expected number Di of false gene duplication events on the branch entering Si is plotted against the effective population size N, with the

generation time being set to 110 years. Four different effective population sizes (2� 106, 6� 106, 10� 106 and 14� 106) were examined. It shows that

the number of false gene duplication events on a branch correlates largely with its evolutionary time.

ZHENG AND ZHANG: EFFECT OF INCOMPLETE LINEAGE SORTING ON TREE-RECONCILIATION-BASED INFERENCE OF GENE DUPLICATION 481

Page 6: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

tree reconstruction, leading to incorrect topology for somegene trees, or because effective population size varies fordifferent ancestral species. At this stage, we are unable toassess the effect of these factors, as the estimation of ances-tral population sizes remains as a challenging problem.

3.4 Impact of Species Tree Topology

To interrogate the impact of species tree topology on theeffect of ILS for gene duplication inference, we considered10 ultrametric tree topologies over 10 species (Fig. 5), inwhich all the paths from the root to a leaf have equal evolu-tionary time. In each of the five asymmetric species trees,the left and right subtrees are just linear trees. In each of thefive symmetric trees, the subtrees are balanced binary treesin which for each internal node, the size difference of thetwo subtrees rooted at its children is at most 1.

We define the height of a ultrametric species tree to bethe coalescent time of a path from the root to a leaf, mea-sured in coalescent time units. DðSÞ and LðSÞ for these 10topologies with heights of 2 and 10 units are respectivelypresented in two bottom plots in Fig. 5. Although each pathfrom the root to a leaf has the same evolutionary time, thenumber of branches contained in such a path varies. Foreach leaf, we define its depth to be the number of branchesin the unique path from it to the root. Note that different

leaves may have different depths. The Sackin index of a spe-cies tree is defined as the average depth of a leaf in the tree[34]. The Sackin indexes of the 10 tree topologies listed inFig. 5 are

Hence, our experiments suggest that:

� DðSÞ and LðSÞ correlate well with the Sackin indexof a species tree S;

� DðSÞ changes in a small range for different treetopologies;

� Asymmetric trees frequently have a larger DðSÞ andLðSÞ than symmetric ones of the same height.

We remark that since DðSÞ (or DðLÞ) are a function ofbranch lengths of S, a symmetric species tree may have alargerDðSÞ (orDðLÞ) than an asymmetric species tree of thesame height, as indicated in Section 3.2.

3.5 A Bound ofDDDðSÞ and LLLðSÞIt is natural to ask to what extent ILS influences gene dupli-cation inference. To answer this question, we compute the

Fig. 5. DðSÞ and LðSÞ for asymmetric topologies (in the first row) and symmetric topologies (in the second row). Branches in each of 10 ultrametrictopologies are drawn in proportion to their length. In the bottom row, the left and right plots are drawn to different scales for the topologies of heights2 and 10, respectively. In each plot, the white and black bars represent LðSÞ andDðSÞ, respectively.

482 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014

Page 7: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

limit of DðSÞ and LðSÞ for an arbitrary ultrametric speciestree S by allowing the branches of S to be extremely short.Fix the effective population size for each branch of S. Whenall branches of S become very short, two lineages areunlikely to coalesce in any branch below the tree root; inother words, there is a high probability that any pair of line-ages will coalesce in the branch entering the root. Therefore,in the limit case, for each gene tree G

Pr½G j S� X

H2H0ðGÞPr½H j S�;

where H0ðGÞ is the set of the coalescent histories of n line-ages whose collapsed gene tree is G in the root branch. LetS have n leaves. In this limit case, the probability of eachgene tree is equal to its probability under the Yule model[35]. Hence, we are able to compute DðSÞ using (seeAppendix, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2013.2297913):

DðSÞ ¼ 1

2

X½U;V �2PðSÞ

X2�iþj�jU jþjV j

Prði; jÞF ði; j; jU j; jV jÞ; (9)

where PðSÞ denotes the set of all partial partitions each cor-responding to a node in S, Prði; jÞ the probability that a par-tition partition ½U 0; V 0� occurs in a random tree whenjU 0j ¼ i 1 and jV 0j ¼ j 1, and F ði; j; jUj; jV jÞ the numberof possible gene tree nodes, each corresponding to a partialpartition ½U 0; V 0� such that jU 0j ¼ i and jV 0j ¼ j, that aremapped to the species tree node ½U; V � as a duplicationnode. Here,

Prði; jÞ ¼2 ðiþj

i Þ�1

ðiþ j� 1Þ �2n ð n

iþjÞ �1

ðiþ jÞðiþ jþ 1Þ if iþ j � n� 1;

2

ðiþj�1Þ ðiþji Þ

if iþ j ¼ n;

8>><>>:

and

F ði; j; a; bÞ ¼ aþ b

iþ j

� �� a

iþ j

� �� b

iþ j

� �� �iþ j

i

� �

� a

i

b

j

� �þ a

j

� �b

i

� �� �:

Furthermore, by a formula in [36], LðSÞ can be computedusing (see Appendix, available online):

LðSÞ ¼ 2 �DðSÞ þDCðSÞ; (10)

whereDCðSÞ is equal to the expected value of the deep coa-lescence cost when a random gene tree is reconciled with Sand computed as [37]:

DCðSÞ ¼ 2Xv

jCSðvÞj � 2ðn� 1Þ

�Xv

XjCS ðvÞj

i¼1

2n

iðiþ 1Þn

i

�1 jCSðvÞji

� �;

where the sums are over all non-root nodes in S.Using the above formulas, we computed DðSÞ and LðSÞ

for the most balanced species tree, the line species tree aswell as 10,000 random species tree for each size (that isthe number of species) from 4 to 30 (Fig. 6). Here, themost balanced tree is a tree in which the sizes of the leftand right subtrees are equal or different by 1 for everyinternal node, whereas the line tree is the unique tree inwhich each internal node has at least one leaf child. Asshown in Fig. 6, DðSÞ varies in a narrow range for eachtree size and linearly increases with the species tree size.In contrast, LðSÞ changes in a wide range for each speciestree size. Line trees S have large LðSÞ, whereas balancedtrees S have small LðSÞ.

4 CONCLUSION

In [30], Degnan and Salter studied the probability distribu-tion of all the gene trees in a species tree over five species.Since our study focuses on the mean duplication and geneloss costs of a gene tree defined in (3) and (4), our results arenot direct consequences of those reported in [30].

ILS introduces stochastic variation into the topology ofthe gene tree of a gene family. For the first time, we havequantified the effect of ILS on inference of gene duplicationby examining the expected number of false gene duplicationevents inferred from reconciling a random gene tree and thecorresponding species tree. In this preliminary study, wehave also analyzed the connection between topologicalparameters and the effect of ILS on gene duplication infer-ence for a species tree S.

One of our findings is that inference of gene duplicationbased on tree reconciliation was affected by ILS to a greaterextent on an asymmetric species tree than on a symmetricone. Considering gene duplication events arising from ILSon different species tree branches separately, we also foundthat the longer an internodal branch is, the more likely geneduplication events are to be mis-inferred on it. Additionally,

Fig. 6. DðSÞ and LðSÞ in the limit case when all the tree branches areextremely short so that all the coalescence events occur beyond the rootof S. For each species tree size from 4 to 30, we computed DðLÞ andSðLÞ for 10,000 random species trees generated in the Yule model, themost balanced species tree, and the line tree.DðSÞ varies with the topol-ogy of S in a narrow range, whereas LðSÞ varies in a wide range over allthe species trees of the same size.

ZHENG AND ZHANG: EFFECT OF INCOMPLETE LINEAGE SORTING ON TREE-RECONCILIATION-BASED INFERENCE OF GENE DUPLICATION 483

Page 8: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

gene duplication events are more likely to be mis-inferredon a branch close to the species tree root.

In analyzing the upper bounds of DðSÞ and LðSÞ for aspecies tree S when its branches are extremely short, wefound that DðSÞ increases linearly with the species tree size.This fact indicates that DðSÞ increases with jSj, the size of Sin general and hence it is not bounded above. It also raises atheoretical problem: DðSÞ � 0:6jSj? Since LðSÞ 3DðSÞ fora species tree [36], LðSÞ is not bounded above by a constantif DðSÞ is not. Our findings imply that the bias caused byILS in reconciliation-based gene duplication inference is notnegligible. Therefore, when gene duplication is inferred viatree reconciliation or any other method that takes gene treetopology into account, the ILS-induced bias should beexamined cautiously. Alternatively, one may use a unifiedreconciliation approach that considers gene duplication,loss and ILS simultaneously [38], [39].

Finally, we remark that ILS also affects the gene trees forgenes which are from different genera. How much ILS isexpected to affect gene duplication inference for gene fami-lies across different genera is definitely a research topic forfuture study.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewersfor their valuable comments on the first version of thiswork. The work was supported by Singapore AcRF Tier-2Grant R-146-000-134-112.

REFERENCES

[1] N. Takahata, “Gene Genealogy in Three Related Populations:Consistency Probability between Gene and Population Trees,”Genetics, vol. 122, no. 4, pp. 957-966, 1989.

[2] K.M. Wong, M.A. Suchard, and J.P. Huelsenbeck, “AlignmentUncertainty and Genomic Analysis,” Science, vol. 319, no. 5862,pp. 473-476, 2008.

[3] P.J. Keeling and J.D. Palmer, “Horizontal Gene Transfer in Eukary-otic Evolution,”Nat’l Rev. Genetics, vol. 9, no. 8, pp. 605-618, 2008.

[4] W.P. Maddison, “Gene Trees in Species Trees,” Systematic Biology,vol. 46, no. 3, pp. 523-536, 1997.

[5] T. Sang and Y. Zhong, “Testing Hybridization Hypotheses Basedon Incongruent Gene Trees,” Systematic Biology, vol. 49, no. 3,pp. 422-434, 2000.

[6] S.V. Edwards, “Is a New and General Theory of Molecular Sys-tematics Emerging?” Evolution, vol. 63, no. 1, pp. 1-19, 2008.

[7] L.L. Knowles and L.S. Kubatko, Estimating Species Trees: Practicaland Theoretical Aspects. Wiley-Blackwel, 2010.

[8] J.J. Doyle, “Gene Trees and Species Trees: Molecular Systematicsas One-Character Taxonomy,” Systematic Botany, vol. 17, pp. 144-163, 1992.

[9] M. Goodman, J. Czelusniak, G.W. Moore, A.E. Romero-Herrera,and G. Matsuda, “Fitting the Gene Lineage into Its Species Line-age, a Parsimony Strategy Illustrated by Cladograms Constructedfrom Globin Sequences,” Systematic Biology, vol. 28, no. 2, pp. 132-163, 1979.

[10] L. Liu, L. Yu, L. Kubatko, D.K. Pearl, and S.V. Edwards,“Coalescent Methods for Estimating Phylogenetic Trees,” Molecu-lar Phylogenetics and Evolution, vol. 53, no. 1, pp. 320-328, 2009.

[11] B. Ma, M. Li, and L.X. Zhang, “From Gene Trees to Species Trees,”SIAM J. Computing, vol. 30, no. 3, pp. 729-752, 2000.

[12] W.P. Maddison and L.L. Knowles, “Inferring Phylogeny DespiteIncomplete Lineage Sorting,” Systematic Biology, vol. 55, no. 1,pp. 21-30, 2006.

[13] R.L. Cann, M. Stoneking, and A.C. Wilson, “Mitochondrial Dnaand Human Evolution,” Nature, vol. 325, no. 6099, pp. 31-36, 1987.

[14] S.V. Edwards and P. Beerli, “Perspective: Gene Divergence, Popu-lation Divergence, and the Variance in Coalescence Time in Phylo-geographic Studies,” Evolution, vol. 54, no. 6, pp. 1839-1854, 2000.

[15] J. Hey and R. Nielsen, “Multilocus Methods for Estimating Popu-lation Sizes, Migration Rates and Divergence Time, with Applica-tions to the Divergence of Drosophila Pseudoobscura and D.Persimilis,” Genetics, vol. 167, no. 2, pp. 747-760, 2004.

[16] C.I. Wu, “Inferences of Species Phylogeny in Relation to Segrega-tion of Ancient Polymorphisms,” Genetics, vol. 127, no. 2, pp. 429-435, 1991.

[17] €O. A�kerborg, B. Sennblad, L. Arvestad, and J. Lagergren,

“Simultaneous Bayesian Gene Tree Reconstruction and Reconcili-ation Analysis,” Proc. Nat’l Academy of Science USA, vol. 106,no. 14, pp. 5714-5719, 2009.

[18] A.C. Berglund-Sonnhammer, P. Steffansson, M.J. Betts, and D.A.Liberles, “Optimal Gene Trees from Sequences and Species TreesUsing a Soft Interpretation of Parsimony,” J. Molecular Evolution,vol. 63, no. 2, pp. 240-250, 2006.

[19] C. Chauve and N. El Mabrouk, “New Perspectives on Gene Fam-ily Evolution: Losses in Reconciliation and a Link with Super-trees,” Proc. 13th Ann. Int’l Conf. Research in ComputationalMolecular Biology (RECOMB ’209), pp. 46-58, 2009.

[20] K. Chen, D. Durand, and M. Farach Colton, “Notung: A Programfor Dating Gene Duplications and Optimizing Gene FamilyTrees,” J. Computational Biology, vol. 7, nos. 3-4, pp. 429-447, 2000.

[21] P. G�orecki and J. Tiuryn, “DLS-Trees: A Model of EvolutionaryScenarios,” Theoretical Computer Science, vol. 359, no. 1, pp. 378-399, 2006.

[22] R.D.M. Page, “Maps between Trees and Cladistic Analysis of His-torical Associations among Genes, Organisms, and Areas,” Sys-tematic Biology, vol. 43, no. 1, pp. 58-77, 1994.

[23] A. Wehe, M.S. Bansal, J.G. Burleigh, and O. Eulenstein, “Duptree:A Program for Large-Scale Phylogenetic Analyses Using GeneTree Parsimony,” Bioinformatics, vol. 24, no. 13, pp. 1540-1541,2008.

[24] M.W. Hahn, “Bias in Phylogenetic Tree Reconciliation Methods:Implications for Vertebrate Genome Evolution,” Genome Biology,vol. 8, no. 7, 2007.

[25] N.A. Rosenberg, “The Probability of Topological Concordance ofGene Trees and Species Trees,” Theoretical Population Biology,vol. 61, no. 2, pp. 225-247, 2002.

[26] M.S. Bansal, E.J. Alm, and M. Kellis, “Efficient Algorithms forthe Reconciliation Problem with Gene Duplication, HorizontalTransfer and Loss,” Bioinformatics, vol. 28, no. 12, pp. i283-i291,2012.

[27] J.P. Doyon, C. Scornavacca, K. Gorbunov, G. Sz€ollo��si, V. Ranwez,and V. Berry, “An Efficient Algorithm for Gene/Species TreesParsimonious Reconciliation with Losses, Duplications and Trans-fers,” Proc. Int’l Conf. Comparative Genomics (RECOMB-CG ’11),pp. 93-108, 2010.

[28] Y. Yu, C. Than, J.H. Degnan, and L. Nakhleh, “Coalescent Histo-ries on Phylogenetic Networks and Detection of HybridizationDespite Incomplete Lineage Sorting,” Systematic Biology, vol. 60,no. 2, pp. 138-149, 2011.

[29] Y. Zheng, T. Wu, and L.X. Zhang, “Reconciliation of Gene andSpecies Trees with Polytomies,” arXiv preprint, arXiv:1201.3995,2012.

[30] J.H. Degnan and L.A. Salter, “Gene Tree Distributions Underthe Coalescent Process,” Evolution, vol. 59, no. 1, pp. 24-37,2005.

[31] Y. Wu, “Coalescent-Based Species Tree Inference from Gene TreeTopologies under Incomplete Lineage Sorting by Maximum Like-lihood,” Evolution, vol. 66, pp. 763-775, 2012.

[32] M.W. Hahn, M.V. Han, and S.G. Han, “Gene Family Evolutionacross 12 Drosophila Genomes,” PLoS Genetics, vol. 3, no. 11,p. e197, 2007.

[33] D.A. Pollard, V.N. Iyer, A.M. Moses, and M.B. Eisen,“Widespread Discordance of Gene Trees with Species Tree in Dro-sophila: Evidence for Incomplete Lineage Sorting,” PLoS Genetics,vol. 2, no. 10, p. e173, 2006.

[34] M.J. Sackin, “Good and Bad Phenograms,” Systematic Zoology,vol. 21, pp. 225-226, 1972.

[35] N.A. Rosenberg, “The Shapes of Neutral Gene Genealogies inTwo Species: Probabilities of Monophyly, Paraphyly, and Poly-phyly in a Coalescent Model,” Evolution, vol. 57, no. 7, pp. 1465-1477, 2003.

[36] L.X. Zhang, “From Gene Trees to Species Trees II: Species TreeInference by Minimizing Deep Coalescence Events,” IEEE/ACMTrans. Computational Biology and Bioinformatics, vol. 8, no. 6,pp. 1685-1691, Nov. 2011.

484 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 3, MAY/JUNE 2014

Page 9: Effect of Incomplete Lineage Sorting On Tree-Reconciliation-Based Inference of Gene Duplication

[37] C. Than and N.A. Rosenberg, “Mean Deep Coalescence Costunder Exchangeable Probability Distributions,” Discrete AppliedMath., in press.

[38] M.D. Rasmussen andM. Kellis, “UnifiedModeling of Gene Dupli-cation, Loss, and Coalescence Using a Locus Tree,” GenomeResearch, vol. 22, no. 4, pp. 755-765, 2012.

[39] M. Stolzer, H. Lai, M. Xu, D. Sathaye, B. Vernot, and D. Durand,“Inferring Duplications, Losses, Transfers and Incomplete Line-age Sorting with Nonbinary Species Trees,” Bioinformatics, vol. 28,no. 18, pp. i409-i415, 2012.

Yu Zheng received the BS degree in mathemat-ics from Sun Yat-Sen University, China, in 2009.He is currently working toward the PhD degree inthe Department of Mathematics at the NationalUniversity of Singapore. His research interestsinclude algorithms and computational biology.

Louxin Zhang received the BS and MS degreesin mathematics from the Lanzhou University,China, and the PhD degree in computer sciencefrom the University of Waterloo, Canada. He isan associate professor of applied mathematicsat the National University of Singapore. Hisresearch areas have been in computational biol-ogy, bioinformatics, applied combinatorics, andtheory of computing. In particular, the focus of hiscurrent research has been on sequence compari-son, gene duplication in comparative genomics.

He has coauthored more than 70 scholarly research articles and onemonography on sequence comparison. He has served as a programcommittee member of many international bioinformatics conferencesand workshops including RECOMB and ECCB.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

ZHENG AND ZHANG: EFFECT OF INCOMPLETE LINEAGE SORTING ON TREE-RECONCILIATION-BASED INFERENCE OF GENE DUPLICATION 485