angiosperm genome comparisons reveal early polyploidy in

6
Angiosperm genome comparisons reveal early polyploidy in the monocot lineage Haibao Tang a,b,1 , John E. Bowers a,b,1 , Xiyin Wang a,c , and Andrew H. Paterson a,b,2 a Plant Genome Mapping Laboratory and b Department of Plant Biology, University of Georgia, Athens, GA 30602; and c College of Science, Hebei Polytechnic University, Tangshan, Hebei 063000, China Edited by Douglas E. Soltis, University of Florida, Gainesville, FL, and accepted by the Editorial Board November 19, 2009 (received for review July 20, 2009) Although the timing and extent of a whole-genome duplication occurring in the common lineage of most modern cereals are clear, the existence or extent of more ancient genome duplications in cereals and perhaps other monocots has been hinted at, but remain unclear. We present evidence of additional duplication blocks of deeper hierarchy than the pancereal rho (ρ) duplication, covering at least 20% of the cereal transcriptome. These more ancient duplicated regions, herein called σ, are evident in both intragenomic and intergenomic analyses of rice and sorghum. Res- olution of such ancient duplication events improves the under- standing of the early evolutionary history of monocots and the origins and expansions of gene families. Comparisons of syntenic blocks reveal clear structural similarities in putatively homologous regions of monocots (rice) and eudicots (grapevine). Although the exact timing of the σ-duplication(s) is unclear because of uncer- tainties of the molecular clock assumption, our data suggest that it occurred early in the monocot lineage after its divergence from the eudicot clade. collinearity | paleopolyploidy | synteny W hole-genome duplications (WGDs) have occurred in the lineages of plants (13), animals (4, 5), and fungi (6), with consequences ranging from the origin of evolutionary novelty (7) to the provision of genetic buffer capacitythat increases ge- netic robustness (810). Reciprocal gene loss following a WGD can contribute to reproductive isolation through divergent res- olution of duplicate copies (11), foreshadowing the diversica- tion of species (1214). WGDs are particularly widespread in the phylogeny of owering plants, giving rise to large gene families and com- plicating comparative studies (1, 1517). Relatively recent WGDs often are readily identied through intragenomic comparisons; however, more ancient WGDs are less tractable and often have been studied through bottom-uprecon- struction of intermediate orders (1, 5), iteratively inferring the state of the ancestral genome before successively more ancient duplications. It is well established that one WGD (hereinafter denoted as ρ) occurred in the cereal lineage an estimated 70 million years ago, preceding the radiation of the major cereal clades by 20 million years or more (2, 18). Quartetcomparisons of the two resulting paralogous (homeologous) chromosomal regions in rice and sorghum show that 97%98% of postduplication gene losses are orthologous (19), consistent with the ρ event predating the di- versication of major grass lineages (2, 20). This suggests that ricesorghum gene arrangements likely are representative of those of most grass genomes, albeit further modied in some lineages by additional cycles of duplication and gene loss. The ρ duplication is extensive, involving all modern chromosomes of rice and sorghum and covering much of the euchromatin (2, 21). Even a duplication previously thought to be recent and seg- mental apparently results from ρ with subsequent concerted evolution (19, 22). While several studies (3, 20, 23) have hinted that additional monocot duplications may have predated ρ, the extent of such earlier duplications has not yet been elucidated. Inferences of more ancient polyploidy based on inspection of amino acid dif- ferences between duplicate genes (d A ) (23) are affected by varying substitution rates among different gene families (1). A recent study identied 29 duplications in the rice genome, in- cluding 19 minor blocks that overlap with 10 major blocks (20), but did not systematically study these segments in a hierarchical context to elucidate their evolutionary history. In the present work, we combined a visually intuitive approach with a gene-based multicollinearity search algorithm, MCscan (24), to improve understanding of the paleoevolution of the cereal lineage before ρ and explore its implications for com- parative genomics and gene family evolution. In particular, consideration of these additional monocot duplications in mul- tiple alignments claries monocoteudicot sequence compar- isons and reveals clear associations between sets of segments in representative genomes from each clade. Results Quartet Alignments Among Rice and Sorghum Gene Orders (ρ-Blocks). We compiled a list of syntenic gene quartets from rice and sor- ghum, showing both orthologous and ρ-paralogous matches. We analyzed a total of nine large segmental duplications attributed to the ρ-genome duplication using previously described block identiers (2). The extent of ρ synteny between ρ-duplicate segments is summarized in Table S1, and boundaries of ρ blocks are highlighted in a rice intragenomic dot plot in Fig. 1A. These 9 ρ-blocks correspond to 9 of 10 major blocks described by Salse et al. (20). We consider one block involving chromosomes 410 of Salse et al. (20) to overlap with both ρ2 and ρ5, indicating an origin more ancient than ρ. Each ρ-block merges two regions of rice and two regions of sorghum into a single gene order that approximates the genome composition before the ρ duplication. Specically, the ρ-order collapses 15,640 rice genes and 15,636 sorghum genes into 13,308 ρ-nodes (50% of the rice and sorghum transcriptomes), excluding tandemly duplicated genes. The incorporation of sor- ghum gene orders validates the ρ-blocks previously identied in rice while better resolving a few duplicated regions that were reciprocally silenced in rice and sorghum. This reconstruction of pre-ρ gene order is intended to computationally reverse post-ρ gene loss, increasing the sensitivity of subsequent analysis. We emphasize that this order is only an approximation, because the ancestral positions of the intervening singleton genes between consecutive pairs of ρ-paralogs cannot be determined precisely. Author contributions: H.T., J.E.B., and A.H.P. designed research; H.T. and J.E.B. performed research; H.T., J.E.B., and X.W. analyzed data; and H.T. and A.H.P. wrote the paper. The authors declare no conict of interest. This article is a PNAS Direct Submission. D.E.S. is a guest editor invited by the Editorial Board. 1 H.T. and J.E.B. contributed equally to this work. 2 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/cgi/content/full/ 0908007107/DCSupplemental. 472477 | PNAS | January 5, 2010 | vol. 107 | no. 1 www.pnas.org/cgi/doi/10.1073/pnas.0908007107 Downloaded by guest on November 2, 2021

Upload: others

Post on 02-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Angiosperm genome comparisons reveal early polyploidy in

Angiosperm genome comparisons reveal earlypolyploidy in the monocot lineageHaibao Tanga,b,1, John E. Bowersa,b,1, Xiyin Wanga,c, and Andrew H. Patersona,b,2

aPlant Genome Mapping Laboratory and bDepartment of Plant Biology, University of Georgia, Athens, GA 30602; and cCollege of Science, Hebei PolytechnicUniversity, Tangshan, Hebei 063000, China

Edited by Douglas E. Soltis, University of Florida, Gainesville, FL, and accepted by the Editorial Board November 19, 2009 (received for review July 20, 2009)

Although the timing and extent of a whole-genome duplicationoccurring in the common lineage of most modern cereals are clear,the existence or extent of more ancient genome duplications incereals and perhaps other monocots has been hinted at, butremain unclear. We present evidence of additional duplicationblocks of deeper hierarchy than the pancereal rho (ρ) duplication,covering at least 20% of the cereal transcriptome. These moreancient duplicated regions, herein called σ, are evident in bothintragenomic and intergenomic analyses of rice and sorghum. Res-olution of such ancient duplication events improves the under-standing of the early evolutionary history of monocots and theorigins and expansions of gene families. Comparisons of syntenicblocks reveal clear structural similarities in putatively homologousregions of monocots (rice) and eudicots (grapevine). Although theexact timing of the σ-duplication(s) is unclear because of uncer-tainties of the molecular clock assumption, our data suggest that itoccurred early in the monocot lineage after its divergence fromthe eudicot clade.

collinearity | paleopolyploidy | synteny

Whole-genome duplications (WGDs) have occurred in thelineages of plants (1–3), animals (4, 5), and fungi (6), with

consequences ranging from the origin of evolutionary novelty (7)to the provision of genetic “buffer capacity” that increases ge-netic robustness (8–10). Reciprocal gene loss following a WGDcan contribute to reproductive isolation through divergent res-olution of duplicate copies (11), foreshadowing the diversifica-tion of species (12–14).WGDs are particularly widespread in the phylogeny of

flowering plants, giving rise to large gene families and com-plicating comparative studies (1, 15–17). Relatively recentWGDs often are readily identified through intragenomiccomparisons; however, more ancient WGDs are less tractableand often have been studied through “bottom-up” recon-struction of intermediate orders (1, 5), iteratively inferringthe state of the ancestral genome before successively moreancient duplications.It is well established that one WGD (hereinafter denoted as ρ)

occurred in the cereal lineage an estimated 70 million years ago,preceding the radiation of the major cereal clades by 20 millionyears or more (2, 18). “Quartet” comparisons of the two resultingparalogous (homeologous) chromosomal regions in rice andsorghum show that 97%–98% of postduplication gene losses areorthologous (19), consistent with the ρ event predating the di-versification of major grass lineages (2, 20). This suggests thatrice–sorghum gene arrangements likely are representative ofthose of most grass genomes, albeit further modified in somelineages by additional cycles of duplication and gene loss. The ρduplication is extensive, involving all modern chromosomes ofrice and sorghum and covering much of the euchromatin (2, 21).Even a duplication previously thought to be recent and seg-mental apparently results from ρ with subsequent concertedevolution (19, 22).While several studies (3, 20, 23) have hinted that additional

monocot duplications may have predated ρ, the extent of such

earlier duplications has not yet been elucidated. Inferences ofmore ancient polyploidy based on inspection of amino acid dif-ferences between duplicate genes (dA) (23) are affected byvarying substitution rates among different gene families (1). Arecent study identified 29 duplications in the rice genome, in-cluding 19 minor blocks that overlap with 10 major blocks (20),but did not systematically study these segments in a hierarchicalcontext to elucidate their evolutionary history.In the present work, we combined a visually intuitive approach

with a gene-based multicollinearity search algorithm, MCscan(24), to improve understanding of the paleoevolution of thecereal lineage before ρ and explore its implications for com-parative genomics and gene family evolution. In particular,consideration of these additional monocot duplications in mul-tiple alignments clarifies monocot–eudicot sequence compar-isons and reveals clear associations between sets of segments inrepresentative genomes from each clade.

ResultsQuartet Alignments Among Rice and Sorghum Gene Orders (ρ-Blocks).We compiled a list of syntenic gene quartets from rice and sor-ghum, showing both orthologous and ρ-paralogous matches. Weanalyzed a total of nine large segmental duplications attributedto the ρ-genome duplication using previously described blockidentifiers (2). The extent of ρ synteny between ρ-duplicatesegments is summarized in Table S1, and boundaries of ρ blocksare highlighted in a rice intragenomic dot plot in Fig. 1A. These 9ρ-blocks correspond to 9 of 10 major blocks described by Salseet al. (20). We consider one block involving chromosomes 4–10of Salse et al. (20) to overlap with both ρ2 and ρ5, indicating anorigin more ancient than ρ.Each ρ-block merges two regions of rice and two regions of

sorghum into a single gene order that approximates the genomecomposition before the ρ duplication. Specifically, the ρ-ordercollapses 15,640 rice genes and 15,636 sorghum genes into13,308 ρ-nodes (∼50% of the rice and sorghum transcriptomes),excluding tandemly duplicated genes. The incorporation of sor-ghum gene orders validates the ρ-blocks previously identified inrice while better resolving a few duplicated regions that werereciprocally silenced in rice and sorghum. This reconstruction ofpre-ρ gene order is intended to computationally reverse post-ρgene loss, increasing the sensitivity of subsequent analysis. Weemphasize that this order is only an approximation, because theancestral positions of the intervening singleton genes betweenconsecutive pairs of ρ-paralogs cannot be determined precisely.

Author contributions: H.T., J.E.B., and A.H.P. designed research; H.T. and J.E.B. performedresearch; H.T., J.E.B., and X.W. analyzed data; and H.T. and A.H.P. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. D.E.S. is a guest editor invited by the EditorialBoard.1H.T. and J.E.B. contributed equally to this work.2To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0908007107/DCSupplemental.

472–477 | PNAS | January 5, 2010 | vol. 107 | no. 1 www.pnas.org/cgi/doi/10.1073/pnas.0908007107

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1

Page 2: Angiosperm genome comparisons reveal early polyploidy in

Nonetheless, we show below that this intermediate order isuseful to mask post-ρ events and infer the structure of moreancient blocks.

Pre-ρ Duplications in the Cereal Lineage (σ-Blocks). The σ-blocks (in-volved in duplication events before ρ) were identified throughfurther bottom-up reconstruction (1). Reconstructed ρ-orders of13,308 ρ-nodes from the previous step were compared, revealingcollinear patterns of correspondence involving all nine majorρ-blocks (Fig. 1B). Some collinear patterns between pairs ofρ-blocks are one to one, whereas others (i.e., σ2, σ4, and σ5)

involve more than two ρ-blocks, suggesting the identification ofadditional duplications.To facilitate further analysis, we curated a second list of 8

large σ-blocks (Table S2) that have retained collinearity follow-ing σ. These blocks contain a total of 4,168 σ-nodes, covering5,747 rice genes and 5,738 sorghum genes (∼20% of the rice andsorghum transcriptomes). Enumerating all patterns of σ colli-nearity is difficult, because some duplicated regions becomehighly degenerate during post-WGD diploidization, creatinggene orders that are largely reciprocal or sometimes comple-mentary (25, 26). Relationships between some degenerate seg-ments can still be identified through transitive comparisons ofgrapevine and rice genomes (see below), but there is little re-maining intragenomic correspondence between rice segments.The bottom-up approach, starting from the modern gene

order to deduce ρ- and σ-orders, offers inherent hierarchicalstructures that reflect the relationships among chromosomesegments. Figure 2 shows a section of σ6; collinearity is wellretained, and anchored gene pairs, including rice–sorghum or-thologs, ρ-paralogs and σ-paralogs, often retain consistent tran-scriptional orientations. Nonetheless, gene losses (due tofractionation) are extensive, particularly across the σ duplication(between the 2 ρ-blocks) where there are the fewest corre-sponding genes (Fig. 2).

Genetic Distances of the Gene Pairs. Paralogous gene pairs fall intoseparate age groups, in accordance with the hierarchical rela-tionships among the segments on which they reside. Synonymousnucleotide substitutions per synonymous site (Ks) for the groupsof orthologs and paralogs from different events (ρ and σ) werewell separated (Fig. 3). However, variations in the GC content ofcereal genes can affect Ks calculations, with different algorithmsgenerating differing Ks values for pairs involving genes with highthird codon position GC content (GC3) (27). Accordingly, wefocused on gene pairs with an average GC3 < 75%; we providejustification for this cutoff in Methods.Rice–sorghum orthologs show a sharp Ks peak (median, 0.62),

consistent with previous estimates (19). The populations ofρ-paralogs from both rice and sorghum show a major peak at Ks0.94, along with a small peak at Ks ∼0.15 resulting from con-certed evolution of the terminal part of ρ9 (22, 28) (Table S1).Paralogs derived from σ duplications show a well-defined peak

around much older Ks (median, 1.72) and with a larger variancethan that of other groups. Based on a molecular clock of 6.5 ×

Fig. 1. Illustration of bottom-up reconstruction of ρ-blocks and σ-blocks. (A)Classifications of ρ duplicated blocks are visualized in the lower left triangle.(B) During the second iteration, each of the paired hits is converted intoρ-nodes and then plotted in the upper-right triangle. Gene positions are intheir rank orders along the chromosome (A) or the reconstructed ρ-order (B).

Fig. 2. Example alignment showing syntenic relationshipsamong four rice (Os) and four sorghum (Sb) regions. Threewell-retained gene clusters along these syntenic groups areplotted to show the relatively stable gene phylogeny that isconsistent with the duplication scenario. The sample trees wereconstructed by neighbor-joining clustering of the protein se-quences of the selected loci with 1,000 bootstrap replicates.

Tang et al. PNAS | January 5, 2010 | vol. 107 | no. 1 | 473

PLANTBIOLO

GY

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1

Page 3: Angiosperm genome comparisons reveal early polyploidy in

10−9 synonymous substitutions per synonymous site per year(29), the σ duplications are estimated to have occurred ≈130million years ago. Because the Ks values for many σ-paralogouspairs are almost saturated and there are uncertainties in thecalibration of the molecular clock (30), this date can be con-sidered only a rough estimate.Decomposition of Ks distributions explains why previous

studies were not able to identify the σ event (and to some extentthe ρ event as well) based solely on the Ks distribution of ESTs(31). Several analyses have relied on curve-fitting methods tofind multiple duplication events based on Ks distributions (16,24). The combined set of ρ and σ paralogs shows a distributionwith the mixed peak extending from 1.0 to 2.0, which can bereadily separated into distinct components using our synteny-based classifications. Synteny-based classifications of gene pairsalso remove the L-shaped component resulting from recentsingle-gene duplication events in the Ks plot (24).Judging from the Ks distribution, the distances between both ρ

and σ duplicates appear to be bounded between rice‒sorghumorthologs and grape–cereal orthologs (Fig. 3), suggesting that eachof theseWGDsoccurredbetweencerealdiversificationandmonocot–eudicot divergence. Indeed, the distances between grape–cerealorthologs (median Ks, 1.95) are higher than those between thecereal paralogs from the σ duplications (P = 4.8 × 10−24; Student ttest). However, differences in lineage-specific mutation rates be-tween grass and grape confound interpretation of Ks values, andwe reemphasize that our divergence time estimates must be con-sidered only rough approximations. Initial interpretation of theArabidopsis βWGDprovides a cautionary example;Ks analyses ofduplicated genes suggested that the β-duplication predated thedivergence of Arabidopsis and Carica (papaya), but analyses ofblocks of genomic sequence showed that the β-duplication oc-curred after the divergence of these lineages (24).

Episodic Expansions for Some Cereal Gene Families. Differentancestral loci often show varying levels of retention followingpolyploidy; particular gene functional groups are preferentiallyretained or lost following WGD events (9, 24, 32, 33). To furtherinvestigate this, we calculated the enrichment of molecularfunction terms of Gene Ontology (GO) for the rice WGD pa-ralogs within different age groups (see Methods).

Interestingly, ρ-duplicates and themore ancient σ-duplicates havethe same fourmost-enriched gene functional groups: transcriptionalfactor activity (GO:0003700), ligand binding (GO:0005488), DNAbinding (GO:0003677), and transcriptional regulator activity(GO:0030528) (Table S3). This trend of retention in rice WGDduplicates is consistent with previous findings for WGD paralogs inArabidopsis (33, 34).Many rice transcriptional regulators andkinaseswere preferentially retained following the recurring WGDs (ρ andσ), leading toepisodic expansionsof thosegene families.Enrichmentfor paleo-duplicates suggests that the diversification of signal trans-duction pathways in both the upstream elements (protein kinases)and the downstream elements (transcriptional factors) may increasethe regulatory complexity for flowering plants followingWGDs (33).

Effective Comparisons Between Cereal and Eudicot Genomes. Sim-ilarities between monocot and eudicot genomes resulting fromcommon ancestry have been obscured by many rounds of pale-opolyploidy and numerous genome rearrangements (3, 35).Recurring polyploidy events pose significant challenges whencomparing monocot and eudicot genomes because of the de-generation caused by independent gene fractionation (or “dip-loidization”) following several rounds of paleopolyploidy ineach lineage.To compare monocot and eudicot genomes, we applied a

hierarchical clustering approach (see Methods) that partially cir-cumvents such difficulties in identifying synteny across grape andrice (36). In brief, the chromosomes were cut into small segments,and each pair of rice and grape segments were compared. Forexample, assume that we had rice segments O1 and O2 and grapesegment V1, and that comparisons O1–V1 and O2–V1 showed asignificant number of homologs. Based on this information, O1and O2 could be clustered together, because both matched thesamegrape region(s). In this approach, only the “dense” (syntenic)portions of the whole-genome dot plot were clustered, assembled,and interpreted; the “sparse” (nonsyntenic) portions were elimi-nated from further analysis (Fig. S1).Based on our clustering approach, duplicated segments

retained in grape following the eudicot γ hexaploidy event (3), aswell as homologous segments retained in rice following at leasttwo rounds of duplication (ρ and σ), were found to contain 38“putative ancestral regions” (PARs). Each PAR consists of re-gions showing a high density of homologs (P < 1 × 10−10; seeMethods). The PARs collectively explain 19.1% of all observedhomolog pairs and 31.0% of the reciprocal best hits betweengrape and rice genes, although by chance they should explainonly 2.1% for both categories (the 38 PARs, as highlighted inFig. S1, occupy only 2.1% of the total area on the dot plot),representing a ∼10-fold enrichment. The PARs interleave mul-tiple grape and rice genomic regions collectively covering ∼70%of each genome. By consolidating much of the redundancy ineach genome, the PARs create syntenic blocks with less ambi-guity and in most cases show an association between one γ blockand one σ block. We found no PAR that mapped simultaneouslyto two different γ or σ blocks (Table S4).When scrutinizing a particular PAR, analyzing syntenic rela-

tionships among the clustered regions is more informative thananalyzing any individual pair of syntenic segments that contributeto the PAR. For example, in PAR17 (Fig. 4A), three grape re-gions resulting from the γ triplication (γ6) (3, 24) correspond toseveral regions in rice matching one another, which can bepartially explained by σ1, as well as additional duplications un-observed in intragenomic comparisons.

DiscussionIntegration of Intragenomic and Intergenomic Analyses. The indi-vidual PARs derived from grapevine–rice comparisons using ourclustering approach offer an independent and important vali-dation of the σ blocks that we identified through cereal intra-

Fig. 3. Ks value distributions for rice–sorghum orthologs, cereal WGD pa-ralogs (ρ and σ paralogs), and grape‒cereal orthologs. The recent ρ paralogpairs (rice‒rice and sorghum–sorghum pairs) are readily derived from theρ-nodes. The σ-nodes contain several possible paralog pairs, however. Tocalculate Ks values for σ paralogs, we include all paralog pairs within rice andsorghum, but exclude the ρ pairs. Cereal‒grape orthologs are inferred fromreciprocal best hits in rice‒grape BLAST or sorghum–grape BLAST.

474 | www.pnas.org/cgi/doi/10.1073/pnas.0908007107 Tang et al.

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1

Page 4: Angiosperm genome comparisons reveal early polyploidy in

genomic comparisons. All σ blocks that we identified above areevident in the grapevine–rice PARs (Table S4). Some “ghostduplications” (25) in rice that we failed to identify through in-tragenomic comparisons (due to reciprocal gene losses in largelycomplementary fashion) are much clearer in cross-species com-parisons (i.e., PARs).Compared with the WGD events in grape, where 94.5% of the

genome appears to be duplicated (3), the cereal WGDs are morecomplicated and degenerate. The dual approach of intragenomicand intergenomic comparisons provides a far more completepicture of the duplication landscape than afforded by eitherapproach alone. The intergenomic approach is slightly moreexhaustive, but has some limitations. First, interpreting the seg-ments in a phylogenetic context without a hierarchical structure(which is inherent to the intragenomic approach) is more diffi-cult; second, this approach shows duplications only in those re-gions where grape–rice synteny is well conserved, missing someduplicated blocks. In summary, neither the intragenomic nor theintergenomic approach provides an exhaustive list of duplicatedblocks; rather, each method provides a complementary view ofgenome duplication and fractionation.

Consideration of duplications in both lineages improves infer-ences of correspondence between divergent genomes (or segmentsthereof). The 38 grapevine–rice PARs represent a qualitative ad-vance toward a global view of monocot–dicot synteny (Fig. 5).Collinearity (represented by thewhite lines in the figure) appears tobe disrupted around the pericentromeric regions in 10 of the 12 ricechromosomes, suggesting dynamic reorganizationpossibly resultingfrom massive transpositions or gene losses (21).

Number of WGD Events in the Monocot Lineages. In many lineages,the existence of WGD events, and the numbers of these events,remain unclear. Whether the vertebrate lineage had experiencedtwo or three rounds of WGD was the subject of a long debatethat was resolved only recently through a careful analysis of thesynteny patterns of WGD paralogs (37). Similarly, variousstudies offered conflicting estimates of the number of WGDs inArabidopsis (1, 38). Different sources of evidence might favordifferent models; in particular, estimates based on the dis-tribution of genetic distances of paralogs or topologies of genetrees alone are now known to be complicated by unequal evo-lutionary rates between gene families and lineages (15, 39).Currently, analyses based on synteny patterns provide the mostaccurate inferences of WGD events (15).Our unique approach to synteny analysis provides new insight

into the number of WGD events experienced by modern cerealgenomes. The pattern exemplified by the one PAR that we hadthe space to show with fine resolution (Fig. 4A), usually with 3-fold redundancies on the grape axis and at least 4-fold re-dundancies on the rice axis, is representative of all 38 PARpatterns (shown in Fig. S2). In 22 of the 38 PARs, grapevine–ricecollinearity was clear, allowing us to evaluate the level of re-dundancy in both genomes (Table S4). These redundancies re-flect the number of genome duplication events observable inboth lineages. Among the 22 PARs, 12 were 3-fold redundant ingrapevine, consistent with hexaploidy (3). The level of re-dundancy in rice was less clear, ranging from as little as 2-fold(one PAR) to 7-fold (three PARs) and 8-fold (five PARs). In linewith the intragenomic evidence from our bottom-up analysis,these high redundancies suggest that the rice lineage experiencedmore than two, perhaps three, rounds of WGD.

Fig. 4. Synteny comparisons with PARs. (A) Zoom-in view of an exemplaryPAR17 consisting of corresponding regions from grape and rice. The seg-ment labels on the right and below the graph has the format species (Vv forgrape, Os for rice) followed by “chromosome:start-stop,” where start andstop are the reindexed gene rank after removal of tandem genes and sin-gletons (see Methods). (B) Synteny between one sorghum genomic regionand two contiguous Musa BACs (with +/− indicating a flipping of order) tothe rice duplicated regions identified in PAR17.

Fig. 5. Schematic view of syntenic grape and rice regions. The color schemeis consistent with the original description (3), with different colors assignedto the different sets of triplicate chromosomes. The centromere positions forrice were retrieved from Rice Annotation Project website (http://rapdb.dna.affrc.go.jp), while the centromere positions for grapevine are undeterminedand thus are not labeled. The syntenic regions in rice were circumscribedfrom 38 significant PARs with the corresponding colors of the grape regions.White lines in the syntenic blocks indicate discernible “collinearity” (i.e.,gene order conservation in addition to synteny) between grape and rice.

Tang et al. PNAS | January 5, 2010 | vol. 107 | no. 1 | 475

PLANTBIOLO

GY

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1

Page 5: Angiosperm genome comparisons reveal early polyploidy in

Implications for Comparisons Between Cereal and Basal Genomes.The high level of synteny and collinearity among cereal genomeshas long been clear, but parallels to other monocots, such asbanana (40), onion, and asparagus (41), have been more difficultto discern. The generally low synteny found in these previousstudies may improve after redundancies within cereal (and oth-er) genomes are accounted for.The duplicated regions that we identified in rice also are evident

in comparisons to banana, a nongrass monocot (40). Our syntenysearch in limited outgroup sequences revealed that two bananaBACs (198Kb and 135Kb) match the set of rice regions in PAR17,which was used as an example in the rice–grape PARs (Fig. 4A). Asorghum genomic region (c3:67–68Mb) was selected as a cerealreference (Fig. 4B). Sorghum shows very strong synteny corre-sponding to the orthologous rice region (Os01:1720–1819), thenlesser but still easily discernible synteny to one matching ρ-block(Os05:775–832), whereas σ-blocks (the remaining six regions)show only a few homologs. In contrast, banana–rice homologconcentrations in each duplicated regions are comparable, sug-gesting that the banana–rice divergencemay have predated both ρand σ duplications. The limited banana sequence data availableprevent us from falsifying the alternative hypothesis that this lesserstratification of synteny patterns simply reflects greater divergencebetween banana and rice.

Synthesis and Ongoing Needs. Progress in understanding andquantifying the ancient events that shaped eudicot and monocotgenomes facilitates comparisons across plant lineages whosegenomes have been dynamically restructured over the last 150million years. Such comparisons promise to improve the under-standing of the evolution of functional and regulatory complexity(33) that may have contributed to radiation of angiosperms, aswell as the evolution of novel features that may have motivatedaboriginal peoples to use and domesticate a subset of theseplants. Clarification of angiosperm evolutionary history alsoprovides a firm foundation on which to base translational ge-nomics, the leveraging of structural and functional genomic in-formation from botanical models to dissect specific traits inpoorly characterized organisms, such as “orphan crops” that arestaples for large and impoverished human populations but forwhich little genomic data exist (42).

MethodsSequence Sources and Similarity Search. We retrieved the rice gene set fromthe Rice Annotation Project (RAP2) (43), the sorghum gene set (Sbi1.4) fromthe Joint Genomics Institute (19), and the grapevine gene set from Geno-scope (3). Two Musa balbisiana BACs (AC226052.1 and AC226053.1) weredownloaded from GenBank, with putative gene models predicted usingFGENESH (http://www.softberry.com). Similarity between two proteins wasconsidered significant if the E-value of BLASTP (44) was < 1 × 10−10.

Bottom-Up Method for Identifying Ancestral Duplications. With gene orderalignment algorithm implemented in the software MCscan (24) and somemanual curation, an approximate order of the ancestral cereal genomebefore the ρ-duplication was constructed from quartet alignments betweensorghum and rice (19). Manual curation was done to remove minor seg-ments with fragmentary synteny that overlap with major segments, so thatthe ρ-order represents the most recent event. Based on the comparisonswithin the ρ-order, more ancient blocks (σ blocks) were circumscribed inwhich gene pairs within 40 Manhattan distance units were clustered using asingle-linkage algorithm. The Manhattan distance is calculated by combin-ing the number of intervening genes in both regions. We focused ouranalysis on segments with >10 anchor points.

Clustering and Reconstruction of PARs. Putative ancestral regions betweengrapevine and rice genomes were derived through clustering of syntenicsegments, inspired by the methodology used in previous analyses of seaanemone and amphioxus genomes (36, 45). The whole analysis, streamlinedin a set of computer programs, involves 3 major steps, as detailed here. Fig.S3 provides a graphical representation of the methodology.

Filtering of the Matching Set. We first scanned for tandem gene families,definedasclustersofgeneswithin10 interveninggenes fromoneanother, andkept the longest peptide. Next, we used c-value filtering to exclude weakpeptidematches. The c-value is definedas c(x,y)=b(x,y)/max {b(x,z) for z inYorb(w,y) for w in X}, for each BLAST hit between peptide x in genome X andpeptide y in genomeY. The c-value generalizes the concept ofmutual best hit,because the mutual best hit would have a value of 1 (36). We used a c-valuecutoff of 0.7, which implies thatwe excludedmatches thatwere< 70% similarto the best match in either genome. The filtered BLASTP results contained35,386matches between 14,982 grape genes and 15,395 rice genes. The geneswere reindexed according to the rank order on each chromosome.Segmentation of Chromosomes and Scaffolds. BLASTP matches within 40Manhattan distance units were clustered for a first-pass evaluation of syntenicblocks,andasbeforetheblockswithmorethan10genepairswereretained.Thestart and stop boundaries of the first-pass syntenic blocks were used as indica-tions of the breakpoints that disrupt otherwise even distributions of homologs.The chromosomes or scaffolds in both genomes were cut into “atomic” seg-ments according to these breakpoints. (Note that some breakpoints can beshared by several synteny blocks.) A total of 180 and 266 “atomic” segmentswere identified in grape and rice, respectively, including the breaks created bychromosomal ends. Such segments are less affected by genome rearrange-ments and are suitable for defining simple synteny patterns.Clustering of Segments Free of Rearrangements into PARs. The segments fromgrape and rice identified above were compared in a pairwise manner, andhomolog concentration scores (36) were calculated using -log(p), where p isthe probability of observed number of homolog pairs as modeled by aPoisson distribution. For each segment, the array of scores against all seg-ments in the other genome form a unique profile. The segments were thenclustered based on the similarity of these profiles (determined by Pearsoncorrelation coefficient r) using average linkage method. The clusters weredefined at a cutoff of r = 0.3, as determined by visual inspection of theresulting clusters (Fig. S1). This resulted in 56 and 56 reconstructed regions inthe grape and rice genomes, respectively. Significant synteny between thereconstructed regions was evaluated statistically by summing the likelihoodsof observing as many or more gene pairs under the null hypothesis of thesepairs occurring randomly. For all pairwise comparisons (56 × 56) in grape andrice, we kept 38 grape–rice comparisons that were significantly enriched forhomologs (P < 1 × 10−10), using this stringent cutoff to limit consideration toparticularly strong synteny. These 38 comparisons, each containing an en-semble of syntenic patterns, were referred to as PARs, and a unique PARidentifier was assigned to each.

Availability of Reconstructed Orders and Composition of PARs. The compiledrice–sorghum ρ-order and σ-orders, and 38 grapevine–rice PARs, are avail-able as downloadable EXCEL spreadsheets at http://chibba.agtec.uga.edu/duplication/par.

Calculation of Synonymous Substitutions (Ks). For homologs inferred fromsyntenicalignments,theproteinsequenceswerealignedusingCLUSTALW(46),and the resulting protein alignments were used to guide coding sequencealignments by PAL2NAL (47).Ks valueswere calculatedusing theNei–Gojoborimethod implemented in the yn00program in thePAMLpackage (48). In-housepython scripts were used to pipeline all of the calculations. Extra caution wasneeded when calculating the Ks values for cereal genes, because there are 2distinct groups of genes with significantly different third codon position GCcontent (GC3) (Fig. S4A). Ks values calculated using the Nei–Gojobori andYang–Nielson methods were quite consistent for low-GC3 gene pairs, butdiffered significantly for high-GC3 gene pairs (Fig. S4B) (18, 27). Consequently,we chose to not use Ks values for gene pairs with average GC3 > 75%. [Wechose 75% because this is the saddle point in the bimodal distribution in (Fig.S4A) and also was used in previous analyses (18, 27).] In addition, we consid-ered Ks values >3.0 to indicate saturated substitutions at synonymous posi-tions and also excluded these pairs from the later analysis.

GO Enrichment for Rice WGD Paralogs.We tested the enrichment of GO broadterms in the two duplicate sets (4,831 ρ-duplicates and 1,098 σ-duplicates inrice, with some genes retained in both sets) using Fisher’s exact test, calcu-lating the P value for the null hypothesis that there is no association be-tween the duplicate status and a particular functional category. The P valueswere corrected with the total number of terms to account for multipletesting. Mappings from the rice genes to the molecular function terms werebased on GO-SLIM assignments from the MSU rice annotation database(http://rice.plantbiology.msu.edu/).

476 | www.pnas.org/cgi/doi/10.1073/pnas.0908007107 Tang et al.

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1

Page 6: Angiosperm genome comparisons reveal early polyploidy in

ACKNOWLEDGMENTS. We thank Jim Leebens-Mack for his helpful com-ments on the manuscript. Financial support was provided by the National

Science Foundation (Grant MCB-0450260, to A.H.P. and J.E.B.; Grant MCB-0821096, to A.H.P.).

1. Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Unravelling angiosperm genomeevolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433–438.

2. Paterson AH, Bowers JE, Chapman BA (2004) Ancient polyploidization predatingdivergence of the cereals, and its consequences for comparative genomics. Proc NatlAcad Sci USA 101:9903–9908.

3. Jaillon O, et al. French-Italian Public Consortium for Grapevine Genome Characterization(2007) The grapevine genome sequence suggests ancestral hexaploidization in majorangiosperm phyla. Nature 449:463–467.

4. Jaillon O, et al. (2004) Genome duplication in the teleost fish Tetraodon nigroviridisreveals the early vertebrate proto-karyotype. Nature 431:946–957.

5. Aury JM, et al. (2006) Global trends of whole-genome duplications revealed by theciliate Paramecium tetraurelia. Nature 444:171–178.

6. Kellis M, Birren BW, Lander ES (2004) Proof and evolutionary analysis of ancientgenome duplication in the yeast Saccharomyces cerevisiae. Nature 428:617–624.

7. Zhang G, Cohn MJ (2008) Genome duplication and the origin of the vertebrateskeleton. Curr Opin Genet Dev 18:387–393.

8. Ha M, Kim ED, Chen ZJ (2009) Duplicate genes increase expression diversity in closelyrelated species and allopolyploids. Proc Natl Acad Sci USA 106:2295–2300.

9. Paterson AH, et al. (2006) Many gene and domain families have convergent fatesfollowing independent whole-genome duplication events in Arabidopsis, Oryza,Saccharomyces, and Tetraodon. Trends Genet 22:597–602.

10. Gu Z, et al. (2003) Role of duplicate genes in genetic robustness against null mutations.Nature 421:63–66.

11. Lynch M, Force AG (2000) The origin of interspecific genomic incompatibility via geneduplication. Am Nat 156:590–605.

12. Bikard D, et al. (2009) Divergent evolution of duplicate genes leads to geneticincompatibilities within A. thaliana. Science 323:623–626.

13. Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH (2006) Multiple rounds ofspeciation associated with reciprocal gene loss in polyploid yeasts. Nature 440:341–345.

14. Soltis DE, et al. (2009) Polyploidy and angiosperm diversification. Am J Bot 96:336–348.

15. Tang H, et al. (2008) Synteny and collinearity in plant genomes. Science 320:486–488.16. Cui L, et al. (2006) Widespread genome duplications throughout the history of

flowering plants. Genome Res 16:738–749.17. Paterson AH, et al. (1996) Toward a unified genetic map of higher plants,

transcending the monocot‒dicot divergence. Nat Genet 14:380–382.18. Wang X, Shi X, Hao B, Ge S, Luo J (2005) Duplication and DNA segmental loss in the

rice genome: Implications for diploidization. New Phytol 165:937–946.19. Paterson AH, et al. (2009) The Sorghum bicolor genome and the diversification of

grasses. Nature 457:551–556.20. Salse J, et al. (2008) Identification and characterization of shared duplications

between rice and wheat provide new insight into grass genome evolution. Plant Cell20:11–24.

21. Bowers JE, et al. (2005) Comparative physical mapping links conservation ofmicrosynteny to chromosome structure and recombination in grasses. Proc Natl AcadSci USA 102:13206–13211.

22. Wang X, Tang H, Bowers JE, Feltus FA, Paterson AH (2007) Extensive concertedevolution of rice paralogs and the road to regaining independence. Genetics 177:1753–1763.

23. Zhang Y, Xu GH, Guo XY, Fan L J (2005) Two ancient rounds of polyploidy in ricegenome. J Zhejiang Univ Sci B 6:87–90.

24. Tang H, et al. (2008) Unraveling ancient hexaploidy through multiply-alignedangiosperm gene maps. Genome Res 18:1944–1954.

25. Van de Peer Y (2004) Computational approaches to unveiling ancient genomeduplications. Nat Rev Genet 5:752–763.

26. Freeling M, et al. (2008) Many or most genes in Arabidopsis transposed after theorigin of the order Brassicales. Genome Res 18:1924–1937.

27. Shi X, et al. (2006) Nucleotide substitution pattern in rice paralogues: Implication fornegative correlation between the synonymous substitution rate and codon usagebias. Gene 376:199–206.

28. Wang X, Tang H, Bowers JE, Paterson AH (2009) Comparative inference of illegitimaterecombination between rice and sorghum duplicated genes produced by polyploidization.Genome Res 19:1026–1032.

29. Gaut BS, Morton BR, McCaig BC, Clegg MT (1996) Substitution rate comparisonsbetween grasses and palms: Synonymous rate differences at the nuclear gene Adhparallel rate differences at the plastid gene rbcL. Proc Natl Acad Sci USA 93:10274–10279.

30. Hedges SB, Kumar S (2004) Precision of molecular time estimates. Trends Genet 20:242–247.

31. Blanc G, Wolfe KH (2004) Widespread paleopolyploidy in model plant species inferredfrom age distributions of duplicate genes. Plant Cell 16:1667–1678.

32. Gout JF, Duret L, Kahn D (2009) Differential retention of metabolic genes followingwhole-genome duplication. Mol Biol Evol 26:1067–1072.

33. Freeling M, Thomas BC (2006) Gene-balanced duplications, like tetraploidy, providepredictable drive to increase morphological complexity. Genome Res 16:805–814.

34. Seoighe C, Gehring C (2004) Genome duplication led to highly selective expansion ofthe Arabidopsis thaliana proteome. Trends Genet 20:461–464.

35. Liu H, Sachidanandam R, Stein L (2001) Comparative genomics between rice andArabidopsis shows scant collinearity in gene order. Genome Res 11:2020–2026.

36. Putnam NH, et al. (2008) The amphioxus genome and the evolution of the chordatekaryotype. Nature 453:1064–1071.

37. Dehal P, Boore JL (2005) Two rounds of whole-genome duplication in the ancestralvertebrate. PLoS Biol 3:1700–1708.

38. Vision TJ, Brown DG, Tanksley SD (2000) The origins of genomic duplications inArabidopsis. Science 290:2114–2117.

39. Fares MA, Byrne KP, Wolfe KH (2006) Rate asymmetry after genome duplicationcauses substantial long-branch attraction artifacts in the phylogeny of Saccharomycesspecies. Mol Biol Evol 23:245–253.

40. Lescot M, et al. (2008) Insights into the Musa genome: Syntenic relationships to riceand between Musa species. BMC Genomics 9:58.

41. Jakse J, et al. (2006) Comparative sequence and genetic analyses of asparagus BACsreveal no microsynteny with onion or rice. Theor Appl Genet 114:31–39.

42. Naylor RL, et al. (2004) Biotechnology in the developing world: A case for increasedinvestments in orphan crops. Food Policy 29:15–44.

43. Itoh T, et al. (2007) Rice Annotation Project (2007) Curated genome annotation ofOryza sativa ssp. japonica and comparative genome analysis with Arabidopsisthaliana. Genome Res 17:175–183.

44. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of proteindatabase search programs. Nucleic Acids Res 25:3389–3402.

45. Putnam NH, et al. (2007) Sea anemone genome reveals ancestral eumetazoan generepertoire and genomic organization. Science 317:86–94.

46. Larkin MA, et al. (2007) Clustal W and Clustal X, version 2.0. Bioinformatics 23:2947–2948.

47. Suyama M, Torrents D, Bork P (2006) PAL2NAL: Robust conversion of proteinsequence alignments into the corresponding codon alignments. Nucleic Acids Res 34:W609–612.

48. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol24:1586–1591.

Tang et al. PNAS | January 5, 2010 | vol. 107 | no. 1 | 477

PLANTBIOLO

GY

Dow

nloa

ded

by g

uest

on

Nov

embe

r 2,

202

1