sequencing and comparative analysis of a conserved syntenic … · 2008. 10. 8. · sequencing and...

18
Copyright Ó 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087981 Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,* ,†,1 Adam Diehl, ‡,1 Feinan Wu,* Julia Vrebalov, §, ** James Giovannoni, §, ** Adam Siepel ††,2 and Steven D. Tanksley* *Department of Plant Breeding and Genetics, Graduate Field of Genetics and Development, § USDA–ARS Plant, Soil and Nutrition Laboratory, **Boyce Thompson Institute for Plant Research and †† Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853 and Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, Hubei 430074, People’s Republic of China Manuscript received February 11, 2008 Accepted for publication June 23, 2008 ABSTRACT Comparative genomics is a powerful tool for gaining insight into genomic function and evolution. However, in plants, sequence data that would enable detailed comparisons of both coding and noncoding regions have been limited in availability. Here we report the generation and analysis of sequences for an unduplicated conserved syntenic segment (CSS) in the genomes of five members of the agriculturally important plant family Solanaceae. This CSS includes a 105-kb region of tomato chromosome 2 and orthologous regions of the potato, eggplant, pepper, and petunia genomes. With a total neutral divergence of 0.73–0.78 substitutions/site, these sequences are similar enough that most noncoding regions can be aligned, yet divergent enough to be informative about evolutionary dynamics and selective pressures. The CSS contains 17 distinct genes with generally conserved order and orientation, but with numerous small- scale differences between species. Our analysis indicates that the last common ancestor of these species lived 27–36 million years ago, that more than one-third of short genomic segments (5–15 bp) are under selection, and that more than two-thirds of selected bases fall in noncoding regions. In addition, we identify genes under positive selection and analyze hundreds of conserved noncoding elements. This analysis provides a window into 30 million years of plant evolution in the absence of polyploidization. G ENOME sequences are now rarely studied in iso- lation, but instead are examined alongside their neighbors on the tree of life. Most animal species of primary research importance in genetics—including hu- man, mouse, Drosophila melanogaster , and Caenorhabditis elegans—now belong to whole ‘‘sequenced clades,’’ consisting of at least half a dozen and in some cases more than two dozen sequenced species (e.g.,Lindblad- Toh et al. 2005; Rhesus Macaque Genome Sequencing and Analysis Consortium 2007; Clark et al. 2007; Miller et al. 2007; Stark et al. 2007) (http://www. genome.gov/Pages/Research/Sequencing/SeqProposals/ CaenorhabditisSEQ.pdf). The same is true of the model yeast Saccharomyces cerevisiae (Cliften et al. 2003; Kellis et al. 2003). The species within each of these clades are evolutionarily close enough that noncoding as well as coding sequences can be aligned, yet distant enough that genomic comparisons reveal clear signatures of natural selection. In addition, the generally similar physiology, behavior, and genetics of the organisms within each clade help to facilitate comparative analy- ses. Comparative genomic analyses of sequenced clades have, among other things, allowed for the identifica- tion of new genes, regulatory elements, noncoding RNAs, and conserved sequences of unknown function (e.g.,Guigo ´ et al. 2003; Kellis et al. 2003; Bejerano et al. 2004; Siepel et al. 2007; Stark et al. 2007); shed light on duplication and rearrangement histories (Murphy et al. 2005; Jiang et al. 2007); produced refined phy- logenies (Thomas et al. 2003; Murphy et al. 2007); and enabled the detection of rapidly evolving coding and noncoding sequences (Clark et al. 2003; Pollard et al. 2006). In plants, however, comparable sequenced clades have not yet emerged. The main embryophytic (land-plant) species that have been fully sequenced—Arabidopsis thaliana (Arabidopsis Genome Initiative 2000), Oryza sativa (Goff et al. 2002; Yu et al. 2002), Medicago truncatula (Cannon et al. 2006), and Populus trichocarpa (Tuskan et al. 2006)—have been selected primarily for their indi- vidual importance as model species or agricultural crops, rather than for their value in comparative genomics. These genomes are sufficiently distant from one another that they generally do not align outside of coding regions. In addition, each genome has been considerably Sequence data from this article have been deposited with the EMBL/ GenBank Data Libraries under accession nos. AF273333 and EF517791– EF5177914. 1 These authors contributed equally to this study. 2 Corresponding author: 101 Biotechnology Bldg., Cornell University, Ithaca, NY 14853. E-mail: [email protected] Genetics 180: 391–408 (September 2008)

Upload: others

Post on 08-Sep-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Copyright � 2008 by the Genetics Society of AmericaDOI: 10.1534/genetics.108.087981

Sequencing and Comparative Analysis of a Conserved Syntenic Segment inthe Solanaceae

Ying Wang,*,†,1 Adam Diehl,‡,1 Feinan Wu,* Julia Vrebalov,§,** James Giovannoni,§,**Adam Siepel††,2 and Steven D. Tanksley*

*Department of Plant Breeding and Genetics, ‡Graduate Field of Genetics and Development, §USDA–ARS Plant, Soil and NutritionLaboratory, **Boyce Thompson Institute for Plant Research and ††Department of Biological Statistics and Computational

Biology, Cornell University, Ithaca, New York 14853 and †Wuhan Botanical Garden, Chinese Academy of Sciences,Wuhan, Hubei 430074, People’s Republic of China

Manuscript received February 11, 2008Accepted for publication June 23, 2008

ABSTRACT

Comparative genomics is a powerful tool for gaining insight into genomic function and evolution.However, in plants, sequence data that would enable detailed comparisons of both coding and noncodingregions have been limited in availability. Here we report the generation and analysis of sequences for anunduplicated conserved syntenic segment (CSS) in the genomes of five members of the agriculturallyimportant plant family Solanaceae. This CSS includes a 105-kb region of tomato chromosome 2 andorthologous regions of the potato, eggplant, pepper, and petunia genomes. With a total neutral divergenceof 0.73–0.78 substitutions/site, these sequences are similar enough that most noncoding regions can bealigned, yet divergent enough to be informative about evolutionary dynamics and selective pressures. TheCSS contains 17 distinct genes with generally conserved order and orientation, but with numerous small-scale differences between species. Our analysis indicates that the last common ancestor of these species lived�27–36 million years ago, that more than one-third of short genomic segments (5–15 bp) are underselection, and that more than two-thirds of selected bases fall in noncoding regions. In addition, we identifygenes under positive selection and analyze hundreds of conserved noncoding elements. This analysisprovides a window into 30 million years of plant evolution in the absence of polyploidization.

GENOME sequences are now rarely studied in iso-lation, but instead are examined alongside their

neighbors on the tree of life. Most animal species ofprimary research importance in genetics—including hu-man, mouse, Drosophila melanogaster, and Caenorhabditiselegans—now belong to whole ‘‘sequenced clades,’’consisting of at least half a dozen and in some casesmore than two dozen sequenced species (e.g., Lindblad-Toh et al. 2005; Rhesus Macaque Genome Sequencing

and Analysis Consortium 2007; Clark et al. 2007;Miller et al. 2007; Stark et al. 2007) (http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/CaenorhabditisSEQ.pdf). The same is true of the modelyeast Saccharomyces cerevisiae (Cliften et al. 2003; Kellis

et al. 2003). The species within each of these clades areevolutionarily close enough that noncoding as well ascoding sequences can be aligned, yet distant enoughthat genomic comparisons reveal clear signatures ofnatural selection. In addition, the generally similar

physiology, behavior, and genetics of the organismswithin each clade help to facilitate comparative analy-ses. Comparative genomic analyses of sequenced cladeshave, among other things, allowed for the identifica-tion of new genes, regulatory elements, noncodingRNAs, and conserved sequences of unknown function(e.g., Guigo et al. 2003; Kellis et al. 2003; Bejerano

et al. 2004; Siepel et al. 2007; Stark et al. 2007); shedlight on duplication and rearrangement histories (Murphy

et al. 2005; Jiang et al. 2007); produced refined phy-logenies (Thomas et al. 2003; Murphy et al. 2007); andenabled the detection of rapidly evolving coding andnoncoding sequences (Clark et al. 2003; Pollard et al.2006).

In plants, however, comparable sequenced clades havenot yet emerged. The main embryophytic (land-plant)species that have been fully sequenced—Arabidopsisthaliana (Arabidopsis Genome Initiative 2000), Oryzasativa (Goff et al. 2002; Yuet al. 2002), Medicago truncatula(Cannon et al. 2006), and Populus trichocarpa (Tuskan

et al. 2006)—have been selected primarily for their indi-vidual importance as model species or agricultural crops,rather than for their value in comparative genomics.These genomes are sufficiently distant from one anotherthat they generally do not align outside of codingregions. In addition, each genome has been considerably

Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AF273333 and EF517791–EF5177914.

1These authors contributed equally to this study.2Corresponding author: 101 Biotechnology Bldg., Cornell University,

Ithaca, NY 14853. E-mail: [email protected]

Genetics 180: 391–408 (September 2008)

Page 2: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

scrambled with respect to the others by millions of yearsof rearrangement, duplication, insertion, and deletion,further complicating comparative analyses. Consequently,with a few exceptions (Inada et al. 2003; Ma andBennetzen 2004; Haberer et al. 2006; Freeling et al.2007; Thomas et al. 2007), comparative genomic studiesof plants have largely focused on content of protein-coding genes and repetitive elements (Ku et al. 2000;Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003), ratherthan on the kind of detailed analysis of orthologousfunctional elements that has been possible in animals.

Moreover, comparative studies of plant genomes sofar have largely dealt with species that have experiencedrecent whole-genome duplications (WGDs) (Ku et al.2000; Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003;Zhu et al. 2003). These studies have revealed strikingdifferences between species in genome organization,perhaps induced by the massive genetic redundancycreated by WGD (Lynch and Conery 2003; Semon andWolfe 2007). However, they leave open the question ofhow plant genomes evolve in the absence of WGD, andthey complicate comparisons with animal genomes, inwhich WGD is much less common (Otto and Whitton

2000). Furthermore, WGD creates additional challengesin comparative genomics, by producing dramatic dif-ferences in genome size and number of genes, many-to-many relationships among orthologs, and frequentdisruptions in synteny.

The Solanaceae are highly important among flower-ing plants that have diversified in the absence of WGD.The Solanaceae family comprises .3000 species, in-cluding aquatic plants, desert dwellers, trees, ornamen-tals, and familiar crops such as tomato, potato, andpepper. It ranks third among plant families in economicimportance, it is the most valuable in terms of vegetablecrops, and it includes important model systems for fruitdevelopment (tomato and pepper), tuber development(potato), plant defense (tomato and tobacco), andanthocyanin pigments (petunia). Despite their greatphenotypic diversity, all Solanaceae derived�40 millionyears ago (MYA) from an ancestral diploid with x ¼ 12chromosomes, and nearly all family members havemaintained this chromosome number (Wikstrom

et al. 2001; Wu et al. 2006). Moreover, members of therelated family Rubiaceae (coffee family) are also diploidwith x ¼ 11 or x ¼ 12, implying that any WGD in thehistory of the Solanaceae and the Rubiaceae occurredbefore their divergence �85 MYA (F. Wu and S. D.Tanksley, unpublished data). Comparative genomicsof the Solanaceae has been an active area of research fortwo decades (Tanksley et al. 1988). In addition, aninternational project is underway to sequence the fulleuchromatic portion of the tomato genome (http://www.sgn.cornell.edu/about/tomato_sequencing.pl), and awell-developed bioinformatics infrastructure with strongsupport for comparative analyses is available (Mueller

et al. 2005). While several recent studies have included

comparative analyses of solanaceous ESTs (Van Der

Hoeven et al. 2002; Ronning et al. 2003; Blanc andWolfe 2004a; Rensink et al. 2005), so far a large-scalecomparative analysis of genomic sequences in theSolanaceae has not been possible.

Here we report a comparative analysis of a conservedsyntenic segment (CSS) in the genomes of five Solana-ceae species. We compare the previously sequenced105-kb ovate-containing region from chromosome 2 oftomato (Ku et al. 2000) with newly sequenced ortholo-gous regions of the potato, eggplant, pepper, and petuniagenomes. This CSS is present as a single copy in all fivespecies, and it contains 17 distinct genes with mostlyconserved order and orientation. However, its generalconservation is punctuated by numerous small-scaledifferences, due to nucleotide substitutions, insertionsand deletions, tandem duplications of individual genes,inversions, and transpositions. Our detailed comparisonof these sequences provides new insights into theevolutionary history of an important group of plants,the evolutionary dynamics of plant genomes in theabsence of WGD, and the selective pressures experiencedby both coding and noncoding functional elements.

MATERIALS AND METHODS

Identification and sequencing of BACs: The starting pointfor the study was a gene-rich BAC from the long arm of tomatochromosome 2 known to contain the ovate locus (BAC19,LE_HBa0106H06) (Ku et al. 2000). The 16 predicted genesfrom the BAC were first tested for copy number in the tomatogenome via genomic Southern hybridization. Hybridizationwas carried out at 60� overnight using probes labeled with 32Pand washed in 23 SSC for 20 min and in 13 SSC for 10 min. Ahybridization probe for each gene was based on a single exonor two nearby exons ($350 bp; see supplemental Table S1).The majority of the genes in the BAC were shown to be singlecopy by virtue of hybridization to a single restriction fragmentin digests of at least one restriction enzyme. Several of thesesingle-copy probes were genetically mapped on the high-density tomato genetic map (Frary et al. 2005) (http://www.sgn.cornell.edu/). All mapped to the expected position onchromosome 2.

Five of the single-copy probes (for genes 1, 5, 8, 15, and 17;see supplemental Table S2) were then used to screen BAClibraries for potato (Solanum bulbocastanum) (Song et al. 2000),eggplant (S. melongena cv. ‘‘black eggplant’’) ( J. Vrebalov andJ. Giovannoni, unpublished data), petunia (Petunia inflata)(McCubbin et al. 2000), and pepper (Capsicum annuum) (J.Vrebalov and J. Giovannoni, unpublished data). PositiveBACs were confirmed by Southern hybridization on HindIII-digested BAC DNA and further selected using three additionalprobes (for genes 3, 12, and 14; supplemental Table S2) formaximum gene overlap with the tomato BAC. All probes wereconfirmed to be present in single copy in potato, eggplant,pepper, and petunia by Southern hybridization with genomicsequences digested by more than two restriction enzymes. Onthe basis of these results, a single BAC from each species,hybridizing to the maximum number of tomato gene probes,was selected for further analysis: tomato (106H06), potato(027I18), eggplant (077N19), pepper (215H17), and petunia(126I14) (Figure 1). These BACs were, respectively, 135, 105,122, 106, and 139 kb in size.

392 Y. Wang et al.

Page 3: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Each BAC clone was shotgun sequenced to 103 coverage andassembled using Phred and Phrap with default parameters.Gaps and low-quality regions were finished with sequences fromPCR products to obtain final assemblies with minimum qualityscores of 25. These were checked by BAC end sequences and bycomparing virtual (electronic) and empirical (lab) restrictiondigests using HindIII, EcoRI, and BamHI.

Alignment and annotation of BAC sequences: A multiplealignment for the entire CSS was constructed from theassembled sequences using BLASTZ (Schwartz et al. 2003)and the threaded blockset aligner (TBA) (Blanchette et al.2004). Before processing by TBA, pairwise BLASTZ align-ments were filtered by the University of California, Santa Cruz(UCSC), alignment ‘‘chaining’’ and ‘‘netting’’ pipeline (Kent

et al. 2003), which uses conserved synteny to help ensure thatorthologous sequences are aligned. The chains, nets, andmultiple alignment were displayed and manually inspectedin a ‘‘Solanaceae Genome Browser’’ based on the UCSCplatform (http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db¼sol1).

Ab initio gene predictions were obtained for each BAC, usingfour different computational gene-finding programs—FGENESH(Solovyev et al. 1994), GenemarkHMM (Borodovsky andMcIninch 1993), Genscan1 (Burge and Karlin 1997), andGlimmerM (Salzberg et al. 1998) (Arabidopsis training dataset; Wortman et al. 2003). Independently, each BAC wasscreened against a large Solanaceae EST database (239,593tomato ESTs, 134,365 potato ESTs, 3181 eggplant ESTs, and20,738 pepper ESTs; http://www.sgn.cornell.edu/) and againstthe Arabidopsis proteome. An initial set of gene annotationswas defined with the requirement that each gene be supportedby at least two computational gene predictions, at least onesolanaceous EST (with .95% identity over 80% sequencelength if from the same species or BLASTN E-value ,10�10 iffrom another species), or at least one Arabidopsis protein(tBLASTX E-value ,10�10). These candidate gene structureswere then evaluated for cross-species support, using theclean_genes program [part of the PHAST package (Siepel

et al. 2005)], and were inspected manually in the SolanaceaeGenome Browser alongside the multiple alignments, ESTalignments, alignments of full-length mRNA sequences fromGenBank, and protein alignments. This inspection turned uptwo apparent pseudogenes in tomato (both derived from gene10; see results) and allowed for some minor refinements inthe positions of splice sites, but otherwise supported thecandidate predictions. Putative functions for genes were as-signed on the basis of Arabidopsis homologs and predicteddomains, where available (Table 1).

Repeat elements and low-complexity sequences within allsequences were soft masked using TRF (Benson 1999) andRepeatMasker (http://www.repeatmasker.org). A custom Re-peatMasker library was produced by concatenating repeatelements from the Solanaceae Genomics Network RepeatDatabase (http://www.sgn.cornell.edu), TIGR plant repeats(http://www.tigr.org/tdb/e2k1/plant.repeats), Munich Infor-mation Center for Protein Sequence (MIPS) plant repeats(http://mips.gsf.de/proj/plant/webapp/recat), and plant re-peats within the RepeatMasker library. The annotated repeatsare displayed in the RepeatMasker and Simple Repeats tracksin the Solanaceae Genome Browser.

Alignments of orthologous coding regions: For the 12multispecies genes (Table 2), multiple alignments of orthol-ogous protein-coding DNA sequences were extracted from theCSS-wide multiple alignment by concatenating the segmentscorresponding to the exons of each gene, as defined by thetomato gene models. Manual inspection suggested that norealignment was needed. For use in the analyses based oncodon models, a hand-curated version of these alignments was

created without frameshifts or stop codons. These alignmentswere truncated at frame-shifting insertions and deletions(indels) or premature stops near the 39 ends of genes (Table2), and any out-of-frame sequences between compensatoryframe-shifting indels were masked by replacing them with N’s.For the estimation of dates of divergence, the closest Arabi-dopsis ortholog of each gene was incorporated into thesealignments (excluding gene 17), and for the analysis of genetrees, all paralogs and putative homologs in Arabidopsis andrice (based on TBLASTX matches and data reported by Ku

et al. 2000) were added. These expanded alignments werecreated by aligning predicted peptide sequences with T_Coffee(Notredame et al. 2000) and then reverse translating to DNAsequences. They were also truncated at premature stop codonsas necessary.

Phylogenetic analysis: Gene trees were estimated by maxi-mum likelihood using PhyML (Guindon and Gascuel 2003).In all cases the inferred tree topology was consistent with thespecies phylogeny shown in Figures 3 and 4, which is inagreement with previous phylogenetic studies of the Solana-ceae (Olmstead and Palmer 1997; Olmstead et al. 1999). Inan initial analysis, the two petunia copies of gene 12 groupedtogether, as expected, but the two pepper copies of gene 7 didnot. However, a follow-up analysis using codon modelssupported a topology for gene 7 in which the two peppergenes grouped together (Figure 7A), as expected from thecopy numbers of this gene in the different species.

Maximum-likelihood estimates of dN, dS, and v ¼ dN/dS

were obtained using the codeml program (Yang 1997), withF3 3 4 codon frequencies, equal amino acid distances (aaDist¼0), a single v across sites and across branches (model ¼ 0,NSsites¼ 0), and the tree topology of Figure 3. Estimates wereobtained separately for each of the multispecies genes, usingthe hand-curated alignments of coding regions, and for apooled data set in which all alignments were concatenated.Fourfold degenerate (4D) sites were extracted using msa_view(from PHAST) and substitution rates for these sites wereestimated using phyloFit (also from PHAST) with the generalreversible (REV) model (Tavare 1986). For each type of site,only sequences with three or more aligned orthologoussequences, including tomato, were included in the analysis.

Dates of divergence were estimated by applying codeml asabove, but assuming a global molecular clock (clock¼ 1). The‘‘fossil calibration’’ feature (Yang and Yoder 2003; see the‘‘Global and Local Clocks’’ section of the PAML manual) wasused to fix the Arabidopsis/Solanaceae (AS) divergence at theestimated dates of 110, 120, and 130 MYA, and the otherdivergence times were then estimated by maximum likeli-hood. The standard errors of the estimates were smallcompared to the uncertainty in the AS divergence date andwere therefore ignored. The data do not strictly support thehypothesis of a global clock (likelihood-ratio test, LRT P¼ 7 310�15), but the branch length estimates were not dramaticallyaltered by the assumption of a clock, and violations of thisassumption are not expected to have a dramatic effect on theestimated dates.

The codeml program was also used to perform LRTs forpositive selection, based on the branch-site model of Yang andNielsen (2002) (model ¼ 2, NSsites ¼ 2). A separate LRT wasperformed for each gene and each branch of the tree byrunning codeml twice: once with fix_omega ¼ 0 (alternativemodel) and once with fix_omega ¼ 1, omega ¼ 1 (nullmodel). Nominal P-values were computed by assuming thattwice the difference of the log likelihoods of these two modelsshould have a null distribution that is a 50:50 mixture of a x2-distribution and a point mass at zero (Zhang et al. 2005).These P-values were then corrected for multiple comparisons,using the method of Benjamini and Hochberg (1995).

Comparative Genomics of the Solanaceae 393

Page 4: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Proportion of nucleotide sites under selection: Conserva-tion scores were produced using the program phastOdds(from PHAST) in sliding windows of 5, 10, and 15 bp.PhastOdds computes log-odds scores for each base, compar-ing a phylogenetic model of conserved evolution with a modelof nonconserved evolution. The scores were averaged withinwindows. Specifically, the score Si for a window of size d andradius r ¼ ºd/2c, centered at position i, was computed as

Si ¼1

d

Xi�r1d�1

j¼i�r

log PðXj jccÞ � log PðXj jcnÞ; ð1Þ

where Xj is the jth column of the multiple alignment, cc is aphylogenetic model for conserved sites, cn is a phylogeneticmodel for nonconserved sites, and P(Xj j cx) is computed byFelsenstein’s pruning algorithm (Felsenstein 1981). (Asimilar scoring procedure is described in more detail bySiepel et al. 2005.) The models cc and cn were estimatedusing the phastCons program with the REV model (seebelow). An alternative analysis in which cc was estimated fromcoding exons and cn was estimated from fourfold degeneratesites to estimate cn produced nearly identical results. Align-ment gaps were treated as missing data. Neutral scores wereobtained by concatenating alignment columns from 4D sitesinto a pseudoalignment, randomly permuting them, and thenapplying phastOdds to this alignment.

The complete distribution of conservation scores, fall, wasmodeled as a mixture of neutral and selected components,fall(S)¼pnfn(S) 1 psfs(S), with mixture coefficients pn and ps

(0 # pn, ps # 1, pn 1 ps¼ 1). The distributions fall and fn wereobtained by Gaussian kernel density estimation from the setsof all scores and of neutral scores, respectively, excluding siteswith bases from fewer than three species. The density functionin R was used with kernel¼ ‘‘gaussian,’’ bandwidth (bw) of 0.15,0.20, or 0.25, and n (the number of points) of 1024. The lowerbound for ps was then estimated as ps ¼ 1 � minS[fall(S)/fn(S)], as described by Chiaromonte et al. (2003). In thisminimization, only scores S with fall . d and fn(S) . d (for smallpositive d) were considered, to avoid distortion from regions ofsparse data. Values of d between 0.0001 and 0.01 producedessentially identical results. To estimate confidence intervals,bootstrap resampling both of all sites and of neutral sites wasperformed. Kernel density estimation and estimation of ps

were performed for each of 1000 samples, and the 0.025 and0.975 quantiles of the estimates of ps were taken as 95%confidence intervals. Estimates of ps were converted toestimates of g by multiplying them by the fraction of basesthat were aligned in three or more species (here 0.726), underthe conservative assumption that unaligned bases are notunder selection.

The posterior probability that each window Wi with score Si

is under selection was computed as

PðZi ¼ 1 j SiÞ ¼ 1� PðZi ¼ 0 j SiÞ

¼ 1� PðZi ¼ 0ÞPðSi jZi ¼ 0ÞPðSiÞ

¼ 1� ð1� psÞfnðSiÞ=fallðSiÞ; ð2Þ

where Zi is a random variable equal to 1 if Wi is under selectionand equal to 0 otherwise (Chiaromonte et al. 2003). On thebasis of the gene models and EST/mRNA data, each base wasassigned to one of four annotation classes (see Figure 6), andeach window was assigned to the class containing the largestnumber of its bases. These posterior probabilities were thenused to compute expected fractions of windows that are underselection within each class and expected fractions of allselected windows that come from each class.

To test the sensitivity of the analysis to the alignmentmethods, all steps were repeated using an alternative align-ment constructed by the Pecan program (B. Paten, K. Beal

and E. Birney, unpublished data; http://www.ebi.ac.uk/�bjp/pecan/). This analysis produced very similar results,but slightly higher estimates of g (see supplemental material).

Coding indels: A history of indels was reconstructed byparsimony from the alignments of orthologous coding re-gions, using the program indelHistory (from PHAST). Theinferred events were classified as insertions on particularbranches of the phylogeny, deletions on particular branches,or ambiguous indels (with no outgroup data). Normalizedindel rates for each gene were computed by dividing theestimated number of indels in that gene by the product of itslength and the total neutral (dS) branch length of the phylogenyof the available species (with dS values as shown in Figure 3A).This normalization corrects for gene length (more indels areexpected in longer genes), branch length (more indels areexpected on longer branches), and differences in the sets ofspecies represented at each gene (more indels can be observedwhere there are more data). The normalized indel rates weremultiplied by 100 and expressed in units of indels per 100 neu-tral substitutions. A similar normalization was used to compareindel rates on different branches of the tree.

Identification and characterization of conserved elements:Conserved elements were identified with phastCons, aftertuning the parameters g and v to obtain 60% coverage of theannotated coding regions by conserved elements (see Siepel

et al. 2005). All parameters were estimated from the data(including substitution model parameters, the branch lengthsof the tree, and the scaling parameter r). The expectedminimum length of a detectable conserved element wasestimated as described by Siepel et al. (2005). Elements underlineage-specific selection were identified with DLESS andsignificance was assessed with phyloP (Siepel et al. 2006).Predictions with P $ 0.05 were discarded. The model es-timated from 4D sites was used as the neutral model, and thetuning parameters were set to their default values. AllphastCons and DLESS elements were analyzed with RNAz(Washietl et al. 2005), searched against the RFAM database(Griffiths-Jones et al. 2005) with INFERNAL (Eddy 2002),and examined for pre-miRNA and snoRNA structures withRNAmicro (Hertel and Stadler 2006) and snoReport(Hertel et al. 2008), respectively.

Known binding-site sequences from solanaceous plantswere collected from the TRANSFAC (Wingender et al. 1996)and PLACE (Higo et al. 1999) databases, as well as fromvarious sources in the primary literature. These included 17transcription factors (TFs) with three or more independentsites. Position-specific score matrices (PSSMs) were derived forthese 17 TFs by standard methods (supplemental Figure S4).The noncoding portion of the tomato genome was scannedfor significant matches to each of these PSSMs, by computinglog-odds scores with respect to a third-order Markov back-ground model (estimated from all noncoding regions) andretaining all predictions with empirical P , 1.5 3 10�4 (asassessed by simulation from the background model). Thesepredictions are displayed in the ‘‘Motif Predictions’’ track ofthe Solanaceae Browser.

RESULTS

Sequences, annotations, alignments, and genomebrowser: BACs corresponding to a CSS from tomatochromosome 2 were isolated from tomato, potato, egg-plant, pepper, and petunia and were sequenced (Figure

394 Y. Wang et al.

Page 5: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

1). Gene annotations for each sequenced BAC wereprepared by a combination of computational and man-ual methods, and a multiple alignment of all sequenceswas constructed, using methods that exploited theconserved synteny of the region to ensure that orthol-ogous sequences were aligned (see materials and

methods). Because the tomato sequence is the mostcomplete and best annotated (due to reasonably exten-sive mRNA and EST data for tomato), it was selected asthe reference sequence for the multiple alignment andwas used as the main source of gene annotations. TheBAC sequences were also annotated with the positionsof transposable elements, simple sequence repeats,conserved elements, known regulatory motifs, andother features, as discussed below. Nearly all codingbases, and most noncoding bases, are aligned in theregion (supplemental material; supplemental TableS3). The sequences, alignments, and annotations aredisplayed in a publicly available Solanaceae GenomeBrowser based on the UCSC platform (Kent et al. 2002)(Figure 2; http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db¼sol1).

Conservation of gene content, order, and structure:The CSS contains 17 distinct genes, which are generallywell conserved across species (Figure 1 and Table 1).However, small-scale duplications and losses have re-sulted in some differences in gene content. For example,Gene 9 is found in the same position and orientation inpotato, eggplant, and petunia but is absent in tomato andpepper (Figure 1). An examination of the phylogenetictree of these species (Figure 3) indicates that this genemust have been lost independently in both the tomatoand the pepper lineages.

Gene 10 has the same position and orientation in allspecies but tomato. In its place in the tomato genomeare two apparent pseudogenes, both aligning to por-tions of gene 10 from the other species. One pseudo-

gene (tomato.10p) has what appears to be the ancestralposition and orientation, while the other (tomato.10p9)is found in the same orientation but �3 kb upstream.Moreover, these two copies have a large (�500-bp) regionof similarity, suggesting that tomato.10p9 arose from theancestral gene by a partially duplicative transposition.On the basis of the degree of divergence of these twosequences (�12%), this event appears to have occurredsoon after the separation of tomato and potato,�6 MYA.It is possible that this rearrangement is an example oftransposon-mediated exon shuffling, as observed in grassgenomes (Bennetzen 2007), but no known transposonwas identified in the immediate vicinity.

Two other genes have apparently undergone geneexpansion via tandem duplication—gene 7, which ispresent in two adjacent copies in pepper, and gene 12,which is present in two adjacent copies in petunia(Figure 1). In both cases, the duplicate copies are pres-ent in the same orientation, consistent with the mech-anisms of tandem duplication. These duplications arelineage specific, and phylogenetic trees estimated bymaximum likelihood suggest that they occurred rela-tively recently (see below). Both copies of both geneshave intact open reading frames.

Gene order, like gene content, is largely conservedwithin the CSS. The only major exception is that theorder and the orientation of genes 15 and 16 arereversed in petunia relative to tomato, potato, andeggplant—apparently the result of an inversion of �20kb. The petunia inversion is likely to be a derivedcondition as the tomato/potato/eggplant configura-tion for these genes is shared with Arabidopsis (Figures1 and 2). There is also a much smaller inversion of�800bp in potato (Figure 2). Despite these differences, theCSS shows considerably more conservation than pre-viously observed in comparisons of plant genomes (e.g.,Ku et al. 2000; Song et al. 2002; Ilic et al. 2003), perhaps

Figure 1.—Conserved syntenic segment (CSS)in five species of Solanaceae. The sequenced seg-ments of the potato, tomato, pepper, eggplant,and petunia genomes are shown alongside corre-sponding regions of the Arabidopsis (At) genome.All annotated genes and several pseudogenes areshown, with arrows indicating the direction oftranscription and red dashed lines connecting pu-tative orthologs. For At1, At3, and At5-A in Arabi-dopsis, zigzag lines indicate intervening genes thatare not shown.

Comparative Genomics of the Solanaceae 395

Page 6: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

in part due to the absence of WGD in the evolution ofthe Solanaceae (see discussion).

On the basis of the multiple alignment for the region,we analyzed the open reading frames (ORFs) and exon–intron structures of orthologous genes within the CSS,focusing on the 12 genes that were present in tomatoand sequenced in at least two other species (henceforth,the multispecies genes; Table 2). Seven of the 12 multispe-cies genes have well-conserved ORFs and exon–intronstructures, with aligned start/stop codons and splicesites, no premature stop codons, and no frameshiftindels (Table 2). Only one gene, gene 14, displays ex-tensive disruptions to its ORF, primarily owing to a long(748 bp) CT-rich low-complexity region near its 39 end,which contains several frame-shifting indels. The remain-ing four genes show minor changes in ORFs or structure,including compensatory frameshift indels, shifts insplice-site position, and frameshift indels near their 39

ends that cause changes in stop codon position. Thus,the ORFs and exon–intron structures in this region havegenerally been fairly well conserved through the evolu-tion of the Solanaceae, but more than a third of genes doshow differences among species.

Rates of neutral substitution: We used synonymoussubstitutions within the coding regions of the multispe-cies genes to estimate rates of neutral substitution in theSolanaceae. For this analysis, we pooled data from all 12multispecies genes (Table 2), but excluded regions inwhich gene structure or reading frame was not con-served across species. The tree topology of Figure 3 wasassumed, on the basis of previous studies and our ownanalysis (materials and methods). Using the codonmodel of Yang et al. (1998), we obtained estimates ofdS for each branch that summed to 0.78 synonymoussubstitutions per synonymous site (Figure 3A). Byextracting just the 4D sites (which have no effect onthe encoded amino acid) and estimating branch lengthsunder the REV model of nucleotide evolution (Tavare

1986), we arrived at a similar estimate of 0.73 substitu-tions/site (Figure 3C). The total neutral divergence ofthese Solanaceae genomes, therefore, is comparable tothe neutral divergence within each of the groups ofmammalian, Drosophila, Caenorhabditis, and Saccha-romyces genomes that have recently been widely ana-lyzed (e.g., Siepel et al. 2005). Indeed, it is nearly

identical to the neutral divergence of the six eutherianmammalian genomes (human, chimpanzee, macaque,mouse, rat, and dog) that have been completely se-quenced at present (0.83 substitions/site, as estimatedfrom 4D sites in the ENCODE regions; Margulies

et al. 2007). Thus, many of the analytical methods andcomputational tools that have been developed for com-parative genomics of mammals should be well suited forthese plant genomes.

Dates of divergence: While there have been severaldetailed studies of the phylogeny of the Solanaceae(Olmstead and Palmer 1992; Olmstead et al. 1999),there are, to our knowledge, no published estimates ofabsolute dates of divergence throughout the family.However, good estimates are available for the date of theArabidopsis/Solanaceae (rosid/asterid) divergence. Byconditioning on these dates and assuming a molecularclock within the Solanaceae, we were able to obtain anapproximate timescale for our five-species phylogeny.

In recent studies of divergence dates throughout theangiosperms, the AS divergence was estimated at 125 6

5 MYA (Wikstrom et al. 2001) and �115 MYA (Mag-

allon and Sanderson 2005). These studies were basedon nonparametric rate smoothing and/or penalizedlikelihood methods, with multiple dates from the fossilrecord as calibration points. Earlier estimates of the ASdivergence ranged from 112 to 156 MYA (Yang et al.1999), but the fossil record for tricolpate pollen (pre-sent only in eudicots, which include rosids and asterids)strongly suggests that the AS divergence occurred #125MYA (Bell et al. 2005). Here we assume a fairly inclusiverange of 110–130 MYA for the AS divergence and use anintermediate value of 120 MYA as a point estimate.

We incorporated the closest Arabidopsis ortholog foreach gene (Figure 1) into the multiple alignments for11 of the multispecies genes (excluding gene 17, whichhas two equally distant orthologs in Arabidopsis).Phylogenies estimated separately for each gene stronglysuggested that the identified Arabidopsis genes weretruly orthologs and not more distant ‘‘outparalogs’’(Sonnhammer and Koonin 2002), because they showedfairly consistent levels of divergence from the Solanaceaein all cases. On the basis of a pooled data set includingall genes, we estimated dates of divergence on the basisof the method of Yang and Yoder (2003), assuming a

Figure 2.—The Solanaceae Genome Browser. (A) The 105-kb region from the tomato genome (the reference for the browser)contains 15 genes and two pseudogenes (labeled tomato.10pa and tomato.10pb). Tan bars represent alignment nets, showinglarge blocks of conserved synteny. An inversion in the Petunia genome (blue arrowhead) can be seen in the area outlined inblue. A smaller inversion in potato is marked by a gold asterisk. (B) Inset of the tomato.12 gene alongside supporting EST ev-idence. Peaks in the conservation track (red arrowhead) denote areas of cross-species conservation. Black and gray bars belowthe conservation plot represent alignment blocks from each species; darker bars indicate greater sequence identity. Black barsin the bottom row represent conserved elements identified by phastCons. Gold arrows indicate conserved noncoding sequence.(C) Putative tomato.15 enhancer, located 2 kb upstream of the gene, with six highly conserved predicted regulatory motifs (greenbars). (D) Tomato.11 exon 12 is identified as under lineage-specific selection in petunia and is also significantly diverged in to-mato. Petunia-specific substitutions are highlighted in orange. Tomato-specific substitutions are highlighted in green and the bluebox denotes the location of a tomato-specific 3-bp deletion.

<

396 Y. Wang et al.

Page 7: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Comparative Genomics of the Solanaceae 397

Page 8: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

TA

BL

E1

Su

mm

ary

of

gen

esin

the

CS

S

Gen

eE

xon

sL

engt

ha

Fu

nct

ion

An

no

tati

on

sD

om

ain

sE

xpre

ssio

nA

rab

ido

psi

sh

om

olo

gs

121

1071

Sign

alin

g—

Cyc

lin

-like

F-b

ox

Flo

wer

s,fl

ora

lb

ud

san

dsh

oo

ts(S

.ly

cope

rsic

on)

21

358

——

F-b

ox

——

33

425

Tra

nsc

rip

tio

nfa

cto

rT

FII

Bp

rote

inT

FII

B—

AT

4G36

650.

14

216

3—

—P

epti

das

e-M

52O

vary

(S.

lyco

pers

icon

),fl

ora

lb

ud

san

dro

ots

(S.

tube

rosu

m)

At4

g366

60.1

,A

t5g6

5650

.15

337

3—

——

Flo

ral

bu

ds

and

flo

wer

s(S

.ly

cope

rsic

on)

At1

g751

60.1

,A

t5g6

6740

.1,

At5

g058

40.1

62

352

Reg

ula

tory

sign

alin

gT

om

ato

Ova

tep

rote

inB

ipar

tite

nu

clea

r-lo

cali

zati

on

sign

al,

Vo

nW

ille

bra

nd

fact

or

typ

eC

,co

nse

rved

C-t

erm

inal

do

mai

n

Flo

ral

bu

ds

and

fru

its

(S.

lyco

pers

icon

)A

t2g1

8500

.1

74

500

Cat

alyt

icH

om

olo

gyto

aden

ylo

succ

inat

esy

nth

etas

e—

Lea

ves

and

roo

ts(S

.ly

cope

rsic

on)

At3

g576

10.1

81

167

—H

om

olo

gyto

mem

bra

ne-

asso

ciat

edsa

lt-in

du

cib

lep

rote

inP

enta

tric

op

epti

de

do

mai

nF

lora

lb

ud

s,fl

ow

ers,

fru

its

(S.

lyco

pers

icon

),an

dL

eaf

tric

ho

mes

(S.

pen

nel

lii)

At3

g577

85.1

,A

t2g4

2310

.1

93

202

—Si

mil

arto

vacu

ola

tin

gcy

toto

xin

(vac

A)

—R

oo

ts,

fru

its,

mix

edti

ssu

esA

t1g4

5170

.1,

At5

g429

60.1

102

431

——

—L

eaf

tric

ho

mes

(S.

pen

nel

lii)

At4

g366

80.1

,A

t2g1

8520

.111

1265

5T

ran

scri

pti

on

Nu

cleu

s,tr

ansc

rip

tio

n,

DN

A-d

irec

ted

RN

Ap

oly

mer

ase,

cata

lyti

cac

tivi

ty

Sin

-like

pro

tein

con

serv

edd

om

ain

,IM

Pd

ehyd

roge

nas

e/G

MP

red

uct

ase

Fru

its

(S.

lyco

pers

icon

),le

aftr

ich

om

es(L

.hi

rsu

tum

)A

t5g4

9530

.1

1213

558

RN

Am

etab

oli

smH

om

olo

gyto

tob

acco

U2

snR

NP

larg

esu

bu

nit

U2-

snR

NP

auxi

llia

ryfa

cto

r,la

rge

sub

un

it,

spli

cin

gfa

cto

r,R

NA

-bin

din

gre

gio

nR

NP

-1,

IMP

deh

ydro

gen

ase/

GM

Pre

du

ctas

e

—A

t4g3

6690

.1,

At1

g609

00.1

134

527

Tra

nsc

rip

tio

nfa

cto

r—

Zin

cfi

nge

rC

2H2-

like

Lea

ftr

ich

om

es(S

.pe

nn

elli

i)A

t5g6

6730

.114

467

9St

ress

resp

on

seD

ehyd

rin

pro

tein

—F

ruit

s(S

.ly

cope

rsic

on),

roo

ts(S

.tu

bero

sum

)A

t2g1

8540

.1,

At4

g367

00.1

151

536

Tra

nsc

rip

tio

nfa

cto

rSc

arec

row

-like

tran

scri

pti

on

fact

or

GR

AS

tran

scri

pti

on

fact

or

Lea

ves

and

fru

itp

eric

arp

s(S

.ly

cope

rsic

on)

At4

g367

10.1

1611

352

Tra

nsc

rip

tio

nfa

cto

rG

-bo

xb

ind

ing

pro

tein

Bas

icle

uci

ne

zip

per

,G

-bo

xb

ind

ing

—A

t4g3

6730

.117

3119

51Si

gnal

ing

DSK

2p

rote

inSe

rin

e–th

reo

nin

ep

rote

inki

nas

e,ty

rosi

ne

pro

tein

kin

ase

Fru

its

(S.

lyco

pers

icon

)A

t5g6

6710

.1,

At3

g275

60.1

,A

t3g5

0730

.1,

At5

g501

80.1

aSi

zes

inam

ino

acid

resi

du

es.

398 Y. Wang et al.

Page 9: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

codon model and a molecular clock. The AS divergencewas fixed at the point estimate of 120 MYA, the lowerlimit of 110 MYA, or the upper limit of 130 MYA. Thisanalysis yielded estimates of 6.2 MYA (range 5.1–7.3 MYA)for the divergence of tomato and potato, 13.7 MYA (11.6–16.0 MYA) for tomato and eggplant, 19.1 MYA (16.2–22.2MYA) for tomato and pepper, and 31.2 MYA (26.9–35.9MYA) for tomato and petunia (Figure 4; Table 3). Thus,these species appear to have diversified over the course of�30 MYA, with major splits occurring in intervals of�5–10 MY. For several reasons—including the dependencyon clock assumptions, the relatively small number ofgenes considered, and the possibility of misidentificationof Arabidopsis orthologs—the estimated dates should beconsidered rough approximations, but they do provide ageneral idea of the timeline for the diversification of theSolanaceae (see discussion).

Proportion of nucleotide sites under selection: Weestimated the proportion of nucleotides in the CSSthat have been subject to natural selection duringthe evolution of the Solanaceae (here denoted g) by themixture-decomposition method that was used in thecomparative analysis of the human and mouse genomes(Mouse Genome Sequencing Consortium 2002). Thismethod requires only a conservation score for eachnucleotide and the identification of a subset of nucleo-tides that are believed to be neutrally evolving. We used

phylogenetic conservation scores computed by thephastOdds program in sliding windows of between 5and 15 bp (see materials and methods) and treatedthe 1827 4D sites from the 12 multispecies genes asneutrally evolving (Table 2). The method decomposesthe distribution of all conservation scores into a neutralcomponent and a nonneutral, or ‘‘selected,’’ compo-nent, using the designated neutral sites as a guide. Anestimate of g can be obtained from the estimatedmixture coefficient for the selected component of thedistribution, by conservatively assuming any unalignedbases are not under selection (materials and meth-

ods). The method is essentially nonparametric anddoes not require any assumptions about the selecteddistribution (which is likely to be complex). However, ityields only an estimate of a lower bound for g.

With 10-bp windows, this analysis yielded a lower-bound estimate of g ¼ 0.337 (bootstrapping 95% C.I.,0.303–0.373; Figure 5 and supplemental Table S4).Estimates were slightly lower with 5-bp windows (0.322;95% C.I., 0.283–0.364) and slightly higher with 15-bp win-dows (0.375; 95% C.I., 0.355–0.395), probably becausethe power to detect selection increases with window size.They also depended slightly on a smoothing parameterfor the score distributions. (The values given here arebased on an intermediate smoothing bandwidth of 0.20;see supplemental Table S4 for full results.) Nevertheless,all estimates were between �0.3 and �0.4, suggestingthat at least about a third of the CSS has evolved undernatural selection. Only �18% of bases in the region fallin protein-coding regions, so at least roughly half ofselected bases must have noncoding functions. Interest-ingly,�5% of windows under selection—or �1.5% of allwindows—have lower, rather than higher, conservationscores than would be expected under neutrality (leftpeak of solid line in Figure 5). These windows may havebeen subject to positive selection during the evolution ofthe Solanaceae. Our estimates of g are expected to beconservative as lower bounds, although elevated rates ofsubstitution in 4D sites could conceivably produce anupward bias (see discussion).

To gain insight into the functional roles of theselected bases, we partitioned the sites in the CSS intofour classes—coding (CDS), other transcribed (pre-dominantly UTR), intronic, and intergenic sites—andexamined the score distribution within each class.These classes of sites show clear differences in theirscores, with coding sites being most conserved, inter-genic sites being least conserved, and the other classesdisplaying intermediate levels of conservation (supple-mental Figure S2), consistent with observations in otherspecies. Assuming 10-bp windows, we estimate that atleast 60% of coding windows, 39% of other transcribedand intronic windows, and 23% of intergenic windowsare under selection (Figure 6A). By this analysis, onlyabout one-third (32.9%) of the sites under selectioncome from coding regions, while 38.5% come from

Figure 3.—Rates of evolution in neutrally evolving andconserved sequences. Branch lengths were estimated from(A) synonymous and (B) nonsynonymous substitutions in14,961 coding sites, on the basis of the codon model of Yang

et al. (1998) (branch lengths are given by dS and dN, respec-tively). (C) A total of 1827 fourfold-degenerate sites in codingregions, based on the general reversible (REV) nucleotidemodel (Tavare 1986). (D) A total of 9620 conserved noncod-ing sites, based on the REV model. Units are expected synon-ymous substitutions per synonymous site for A, expectednonsynonymous substitutions per nonsynonymous site for B,and expected substitutions per site for C and D. Horizontalbranch lengths are drawn in proportion to estimated distances,with proportions maintained both within and between trees.

Comparative Genomics of the Solanaceae 399

Page 10: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

intergenic regions and 25.5% come from intronicregions. The remaining 3.1% come from other tran-scribed regions (Figure 6B).

Evolution of protein-coding sequences: Synonymousand nonsynonymous substitutions: From the pooled set ofcoding regions used to estimate dS (see above) weestimated dN ¼ 0.16 nonsynonymous substitutions pernonsynonymous site (Figure 3B) and a ratio of v ¼ dN/dS ¼ 0.21, indicating moderate purifying selection (e.g.,Nielsen 2005). This estimate of v is somewhat largerthan genomewide estimates for Drosophila, Caenorhab-ditis, and Saccharomyces genomes (�0.06–0.11; Kellis

et al. 2003; Stein et al. 2003; Clark et al. 2007). It is alsolarger than estimates for nonprimate mammals (0.10–0.14), but it is similar to estimates for hominids (0.17–0.25) (Rhesus Macaque Genome Sequencing andAnalysis Consortium 2007; Kosiol et al. 2008).Relatively few genomewide estimates are available forplants, but Tuskan et al. (2006) have reported v ¼ 0.40

on the basis of single-nucleotide polymorphisms inPopulus, and Yu et al. (2005) have reported averageKA/KS estimates (which are roughly comparable to v-estimates) of 0.35 for the indica and japonica subspeciesof O. sativa, with higher values in duplicated regions(0.720 in segmental duplications). Earlier estimates,based on smaller data sets, included KA/KS ¼ 0.14 forA. thaliana and Brassica rapa ssp. pekinensis orthologs(Tiffin and Hahn 2002) (although this estimate mayreflect a downward bias from the use of EST data; see,e.g., Wright et al. 2004), and KA/KS¼ 0.20 for duplicategenes in Arabidopsis (Zhang et al. 2002). Overall, theseSolanaceae genes appear to be under somewhat weakerpurifying selection than the genes of most animals, butunder stronger purifying selection than the genes of atleast some other plants. It is possible that the absence ofa recent whole-genome duplication in the Solanaceaehas contributed to the reduced estimates of v incomparison with Populus and O. sativa.

TABLE 2

Summary of comparative analysis of multispecies genes

Gene Species sequenceda ORF and exon–intron conservation dNb dS

c vd Inse Delf Otherg Rateh

4 Tom, Pot, Pep Frameshift indel causing �22-aa shiftof stop in Tom

0.051 (0.74) 0.335 (0.98) 0.151 0 1 4 3.0

5 Tom, Pot, Pep Conserved 0.041 (0.60) 0.374 (1.09) 0.109 0 0 0 0.06 Tom, Pot, Pep Compensatory frameshift indels in

either Pep or (Tom, Pot);frameshift indels causing �7-aashift of stop in Pot, 16-aa shift ofstop in Pep (or �6 aa in Tom, Pot)

0.070 (1.03) 0.564 (1.65) 0.124 7 3 11 5.8

7 Tom, Pot, Pep Conserved 0.033 (0.48) 0.236 (0.69) 0.139 0 0 3 0.68 Tom, Pot, Egg, Pep Conserved 0.065 (0.70) 0.515 (1.12) 0.126 0 0 0 0.011 Tom, Pot, Egg,

Pep, PetCompensatory frameshift indels in

Pep; shifts of splice sites in Egg,Pep, Pet; frameshift indel causing�9-aa shift of stop in Pet(or 19 aa in ingroup)

0.311 (1.99) 0.586 (0.75) 0.530 4 14 10 1.9

12 Tom, Pot, Egg,Pep, Pet

Conserved 0.065 (0.41) 0.601 (0.77) 0.107 3 8 0 0.8

13 Tom, Pot, Egg, Pet Conserved 0.109 (0.80) 0.723 (1.06) 0.151 0 4 8 1.114 Tom, Pot, Egg, Pet Numerous disruptions, mostly from

frameshift indels inlow-complexity region

0.202 (1.48) 1.022 (1.50) 0.198 23 23 13 4.3

15 Tom, Pot, Egg, Pet Conserved 0.090 (0.66) 1.147 (1.68) 0.078 3 4 6 1.216 Tom, Pot, Egg, Pet Frameshift indel causing 18-aa

shift of stop in Egg0.138 (1.01) 0.472 (0.69) 0.292 0 2 3 0.7

17 Tom, Pot, Egg, Pet Conserved 0.165 (1.21) 0.497 (0.73) 0.331 1 0 0 0.3All — — 0.162 (1.00) 0.782 (1.00) 0.207 41 59 58 1.7

a Tom, tomato; Pot, potato; Egg, eggplant; Pep, pepper; Pet, petunia.b Estimated number of nonsynonymous substitutions per nonsynonymous site. In parentheses is the normalized number based

on the subset of species present (expected value 1; see materials and methods).c Estimated number of synonymous substitutions per synonymous site. In parentheses is the normalized number based on the

subset of species present (normalized for expected value 1).d Ratio of dN to dS.e Number of inferred insertions in coding region.f Number of inferred deletions in coding region.g Number of indels that may be either insertions or deletions (no outgroup data).h Normalized rate of all indels in indels per hundred neutral substitutions (see materials and methods).

400 Y. Wang et al.

Page 11: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Separate estimates of v per gene indicate that mostgenes have v , 0.2, suggesting moderate to strongpurifying selection, but a few have values closer to 1,suggesting relaxation of constraint (Table 2). Gene 11,with the largest estimate of v (0.53), also displays severaldifferences between species in its ORF and exon–intronstructure. We also tested all genes for positive selection,both across all branches and on individual branches ofthe phylogeny, using LRT methods (Yang and Nielsen

2002). Most genes (including gene 11) showed nosignificant evidence of positive selection, but evidencewas detected for genes 7 and 12, both of which haveundergone tandem duplications (Figure 7). This isconsistent with Yu et al.’s (2005) finding of significantlyelevated v-estimates for tandemly duplicated genes andwith observed enrichments for positive selection amongduplicated genes (e.g., Rhesus Macaque Genome

Sequencing and Analysis Consortium 2007). Itshould be noted that interlocus gene conversion be-

tween duplicate copies of genes can complicate analysesof positive selection.

Insertions and deletions: Using a maximum parsimony-based method (materials and methods), we recon-structed an indel history (history of insertion anddeletion events along the branches of the phylogeny)for each of the 12 multispecies genes. A total of 158coding indels were identified, or an average of �13 pergene. However, there was considerable variation in indelnumber, with 6 genes contributing #5 indels each, and2 genes (genes 11 and 14) responsible for more thanhalf of all observed indels (Table 2). Because differentnumbers of species are represented at each gene andthe genes have different lengths, indel rates wereexpressed in normalized units of indel events per 100neutral substitutions (indels/hns) (materials and

methods). The average normalized rate was estimatedto be 1.7 indels/hns, with a range of 0–5.8 indels/hnsper gene. The excess of indels in gene 14 is most likelyexplained by replication slippage in the 748-bp CT-richlow-complexity region near its 39 end. Gene 11, despiteits large number of indels, does not show a substantialelevation in its normalized indel rate (1.9 indels/hns).

No correlation was observed between v and normal-ized indel rates (R 2¼ 0.001). This finding, together withthe excess of indels in the low-complexity region of gene14 and in several simple sequence repeats, suggests thatmutation (e.g., replication slippage), rather than selec-tion, may be responsible for most of the variance inindel rates. As expected, the vast majority (78%) of theobserved indels have lengths that are exact multiples ofthree and therefore leave the open reading frames oftheir genes intact. The remainder generally occur nearthe 39 ends of genes or co-occur with compensatoryindels that maintain the reading frame. No significantdifferences were observed in the indel rates on in-dividual branches of the tree.

Almost two-thirds of the observed indels (100/158)could be identified, on the basis of outgroup data, as

TABLE 3

Estimated ages of most recent common ancestors with tomato (MYA)

Lower limita Intermediate valuea Upper limita

Splitb Meanc 95% C.I.d Meanc 95% C.I.d Meanc 95% C.I.d

Arabidopsis 110.0 — 120.0 — 130.0 —Petunia 28.6 (26.9, 30.4) 31.2 (29.3, 33.2) 33.8 (31.7, 35.9)Pepper 17.5 (16.2, 18.8) 19.1 (17.6, 20.5) 20.7 (19.1, 22.2)Eggplant 12.6 (11.6, 13.5) 13.7 (12.7, 14.7) 14.9 (13.8, 16.0)Potato 5.7 (5.1, 6.2) 6.2 (5.6, 6.8) 6.7 (6.1, 7.3)

a Estimates of 110, 120, and 130 MYA are assumed for the Arabidopsis/Solanaceae divergence (see text).b Indicates most recent common ancestor (MRCA) of tomato and listed species. Ages for Arabidopsis are

based on previous estimates (see text).c Age of MRCA estimated by the method of Yang and Yoder (2003), assuming the given Arabidopsis/

Solanaceae divergence and a molecular clock within the Solanaceae. Estimates are based on the codon modelof Yang et al. (1998) and the coding regions of the 12 multispecies genes (Table 2).

d Approximate 95% confidence interval, estimated by the curvature method.

Figure 4.—Estimated dates of divergence for the five Sol-anaceae species. This tree was estimated from the coding re-gions of 12 genes in the conserved syntenic segment for whichcross-species data are available (see Table 2), assuming a co-don model, a molecular clock, and a date of 120 MYA forthe Arabidopsis/Solanaceae (rosid/asterid) divergence. Theshaded lines indicate the estimates obtained by assuming 110and 130 MYA for the Arabidopsis/Solanaceae divergence.

Comparative Genomics of the Solanaceae 401

Page 12: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

either insertions or deletions, with deletions outnum-bering insertions by�3:2 (Table 2). This preference fordeletions over insertions is consistent with genomewideobservations in mammals (Cooper et al. 2004; Rat

Genome Sequencing Project Consortium 2004) andplants (Ma and Bennetzen 2004), but is not as pro-nounced as what has been observed in Drosophila(Petrov et al. 1996). The average sizes of insertions

and deletions were similar (6.7-bp insertions, 7.8-bpdeletions). Five of seven phylogenetically informativeindels (i.e., shared by between 2 and n � 2 species in aregion with n $ 4 aligned species) in nonrepetitiveregions supported the inferred tree topology (Figure 3).The remaining two indels occurred in exon 3 of gene 12and supported a phylogeny in which eggplant andpepper are grouped as sister taxa (supplemental FigureS3). These two indels might reflect homoplasy events orevolutionary scenarios—such as incomplete lineage sort-ing or duplication followed by loss—that would cause thisexon’s phylogeny to differ from the species tree.

Repeat elements: The BAC sequences from all fivespecies were screened for transposable elements (TEs),simple sequence repeats, and low-complexity sequencesusing TRF (Benson 1999) and RepeatMasker (http://www.repeatmasker.org), with a custom RepeatMaskerlibrary assembled from several databases (materials

and methods). The estimated repeat content rangedfrom 9.6% of bases in eggplant to 23.0% in potato(supplemental Table S6). In all cases, TEs accounted forthe majority of repeats, covering between 5.8% (egg-plant) and 16.4% (potato) of bases. By applying thesame methods to four BACs identified as being ineuchromatin by Wang et al. (2006), we obtained esti-mates of repeat content ranging from 11.5 to 23.0%(median 12.8%), suggesting that the CSS is fairly typicalof the euchromatic portions of Solanaceae genomes in

Figure 6.—Selection on sites of various annotation classes.(A) Fractions of all 10-bp windows that are under selection insites annotated as belonging to coding (CDS), other tran-scribed, intronic, and intergenic regions. (B) Breakdown ofall windows under selection by class. These are posterior ex-pected fractions, based on the posterior probability that eachwindow Wi is under selection, given its score Si. They assumean estimate of ps ¼ 0.464.

Figure 7.—Maximum-likelihood phylogenies for genesthat show tandem duplications. (A) Phylogeny for gene 7,which is duplicated in pepper. (B) Phylogeny for gene 12,which is duplicated in petunia. Labels above branches indi-cate branchwise estimates of v. Gene 7 shows significant evi-dence of positive selection (P ¼ 4.7 3 10�10) by a likelihood-ratio test (LRT) that allows for variation in v across sites butnot across branches. Gene 12 shows significant evidence ofpositive selection on the branches to the two petunia genesby a branch-site LRT (see P-values below shaded branches).

Figure 5.—Decomposition of distribution of conservationscores into neutral and selected components. The shaded linerepresents the full distribution (fall) in the CSS, the dashedsolid line represents the contribution of neutrally evolvingsites (pnfn), and the solid line represents the contributionof sites under selection (psfs). The full distribution was esti-mated from all sites with at least three aligned species, theneutral distribution was estimated from 1827 fourfold degen-erate sites in coding regions, and the selected component wascomputed as psfs(S)¼ fall(S)� pnfn(S). Here 10-bp windows, asmoothing bandwidth of 0.20, and a lower-bound estimate ofps ¼ 0.464 were used. This estimate of ps corresponds to anestimate of g ¼ 0.337, assuming unaligned sites are not underselection (see materials and methods).

402 Y. Wang et al.

Page 13: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

this respect. Notably, the overall fractions of repetitivebases in these genomes are considerably higher, becausethey consist predominantly of repeat-rich heterochro-matin (�75% of the tomato genome is estimated to beheterochromatin; Wang et al. 2006). On the basis of4300 randomly sheared sequences, we estimate the over-all repeat content of the tomato genome to be�56% (L.Mueller, unpublished data), which is comparable toreported values for grass genomes—e.g., 58–66% inmaize (Messing et al. 2004; Haberer et al. 2005) and35% in rice (International Rice Genome Sequencing

Project 2005).We observed a strong depletion of TEs in introns

relative to intergenic regions (P , 2.6 3 10�22, Fisher’sexact test), probably because intronic TEs tend todisrupt transcription or splicing and hence are elimi-nated by natural selection. This depletion was observedin all species except eggplant, where an unusually lowrepeat content suggested the possibility of poor de-tection sensitivity. As in maize and rice (Haberer et al.2005), LTR transposons were the most abundant amongthe annotated classes of repeats, except in tomato,where DNA transposons were more abundant (supple-mental Table S6).

Conserved elements: The mixture decompositionanalysis above provided bulk statistical estimates offractions of the genome under selection, but did notidentify specific bases of interest. Therefore, the pro-gram phastCons (Siepel et al. 2005) was used to predictspecific genomic elements that are significantly moreconserved across species than would be expected underneutral evolution and hence are likely to be underpurifying selection (Figure 2). The program was cali-brated by requiring 60% coverage of coding regions byconserved elements, as estimated above. The predictedconserved elements covered 20.2% of all bases, including27.9% of transcribed noncoding, 16.7% of intronic, and8.3% of intergenic bases (supplemental Table S7). Thesefractions are somewhat lower than those predicted by themixture decomposition method, most likely becausemany selected bases are not detectable by phastCons,due to short element length, weak selective pressure, orpositive rather than negative selection. (For this data set,the expected minimum length of a detectable element is�10 bp, assuming all bases within the element obeyphastCons’s model of conserved evolution.)

The predicted elements that fell outside of annotatedprotein-coding genes were identified as conserved non-coding sequences (CNSs; Figure 2B). These CNSs cov-ered 11.2% of all noncoding bases, substantially morethan the �2% found in an analysis of rice and maize(using somewhat different methods) (Inada et al. 2003).As in animal genomes, they were shorter and less variablein length than conserved coding sequences (medianlengths 26 and 60 bp), indicating possible cis-regulatoryfunction. They were similar in length to CNSs recentlyidentified in Arabidopsis (median length 24 bp; Thomas

et al. 2007). The CNSs were considerably more conservedthan nonsynonymous sites in protein-coding genes, withsubstitution rates �15% lower (Figure 3D), but this dif-ference may simply reflect the fact that stronger conser-vation is required to detect shorter elements.

In addition, elements showing evidence of lineage-specific differences in selective pressure were identifiedusing the program DLESS (Siepel et al. 2006). The vastmajority of the 97 identified elements were predicted tobe under purifying selection on all branches of thephylogeny and corresponded closely with phastConspredictions. However, 3 elements were significantly di-verged in one species yet conserved in the others(supplemental Table S8). These elements were predictedby DLESS as lineage-specific ‘‘losses’’—that is, elementsunder purifying selection that had experienced relaxa-tion of constraint on a particular branch of the phyloge-netic tree. However, they could instead be subject tolineage-specific positive selection, which the programdoes not explicitly model. One element overlapped anexon of gene 11 (Figure 2D), another overlapped anexon of gene 12, and the third fell in an intron of gene12. Notably, 2 of the 12 multispecies genes showedevidence of lineage-specific selection. No significant evi-dence of positive selection could be found in the protein-coding regions of these elements, using the branch-siteLRT of Yang and Nielsen (2002), and no motif or otherfunctional evidence could be associated with the non-coding elements. Regardless, these elements show strik-ing cross-species differences at the sequence level and, aspotential determinants of phenotypic differences be-tween species, they may be worthy of future study.

Using RNAz (Washietl et al. 2005), the elementsidentified by phastCons and DLESS were examined forevidence of RNA secondary structure. There were 36distinct elements (71 including overlapping elementsfrom the two sets) with at least one moderately high-scoring predicted structure (probability .50%). Basedon an estimated false-positive rate of 37%, an expected23 of these 36 elements are true positives. Just underone-fourth of the predictions were in introns or inter-genic regions, while the remainder overlapped codingregions, on either the sense or the antisense strand.Only 11% (8/71) of predicted structures showedsignificant similarity to known structures in the RFAMdatabase (Griffiths-Jones et al. 2005) (supplementalTable S9), but this could in part reflect a dearth of plantRNAs in the database. An example of a high-scoringstructure with a significant match in the database isshown in Figure 8. In addition to the RNAz predictions,we found five predicted C/D box snoRNAs usingsnoReport (Hertel et al. 2008), two of which matchedknown snoRNA structures in RFAM. We also searchedfor miRNA precursors using RNAmicro (Hertel andStadler 2006) but found none. Interestingly, weobserved a significant enrichment for 39 exons amongthe exons of multi-exon genes that had overlapping

Comparative Genomics of the Solanaceae 403

Page 14: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

RNAz-predicted structures (P ¼ 0.045, Fisher’s exacttest; predictions on sense strand only). Overall, 7 of 23exons with overlapping structures were last exons,and most of these structures did not extend substan-tially into the 39-UTR. This is consistent with dataindicating that a large proportion of human andmouse stop codons fall in RNA loop structures, whichperhaps contribute to efficient termination of trans-lation (Shabalina et al. 2006).

We scanned the noncoding regions of the CSS formatches to 16 known Solanaceae regulatory motifs(supplemental Figure S4), using position-specific scorematrices (PSSMs) and stringent significance thresholds(materials and methods). In some cases, predictedbinding sites corresponded closely to CNSs identified byphastCons (Figure 2C). However, of 480 predictedbinding sites, only 26 overlapped CNSs, and none ofthe motifs were significantly enriched or depleted formatches in CNSs (supplemental Table S10). Thus, theseregulatory elements appear not to be maintained (inposition or sequence) by strong purifying selection.However, weak detection power for binding sites and forconserved sequences that are short in length or subjectto weak selection may contribute to these observations.

DISCUSSION

Comparative genomics has a venerable history inplants, going back at least 20 years (Bonierbale et al.1988; Tanksley et al. 1988; Gale and Devos 1998).

However, sequence data that would allow detailedcomparisons of both coding and noncoding regions inplant species at moderate evolutionary distances havebeen slow to emerge. In this article, we take a steptoward addressing this deficiency by presenting ananalysis of orthologous genomic sequences in plantssimilar to the ones that have been so valuable in recentfunctional and evolutionary studies of animal genomes.Our analysis of an unduplicated CSS from five species inthe family Solanaceae reveals, among other things, thatthe CSS is generally well conserved but has numeroussmall-scale differences between species, that at leastabout a third of the region is under selection, that itcontains more selected bases in noncoding than incoding regions, and that most of its genes have expe-rienced moderate to strong purifying selection, but tworecently duplicated genes show evidence of positiveselection.

With evidence for only two gene-gain and two gene-loss events since the divergence of the Solanaceae, theCSS shows greater conservation of gene content thanhas been observed in other comparative analyses ofplant genomes. For example, CSSs in the dicot species P.trichocarpa, A. thaliana, and M. truncatula display fre-quencies of gene loss in at least one species of $40%(Kevei et al. 2005). Similarly high rates of gene loss havebeen observed in comparisons of tomato and Arabi-dopsis (Ku et al. 2000) and of the grass species maize,sorghum, and rice (Ilic et al. 2003). Notably, these othercomparisons have all included species that have un-dergone relatively recent WGDs (e.g., maize or Arabi-dopsis), and the species with the most recent WGD hastypically experienced the most gene loss (Ilic et al.2003). Thus, it seems likely that the increased conser-vation of gene content observed in the Solanaceae is atleast in part due to the absence of a recent WGD, aswould be predicted by theory (Walsh 1995; Lynch andConery 2003). Further support for this conjecture isprovided by the high rate of gene loss in the six regionsof the Arabidopsis genome homologous to the CSS(Figure 1), which evidently arose from duplicationssubsequent to the Arabidopsis/Solanaceae divergence.[Three of these segments—At2, At4, and At5-B—exhi-bit some conserved synteny and can be traced back tothe a-WGD in Arabidopsis (Bowers et al. 2003).] Fifteenof the Solanaceae genes have at least one identifiableArabidopsis homolog in these regions, and 8 have two ormore homologs. Remarkably, in 7 of these 8 cases, allbut one homolog has been pseudogenized. Altogether,more than half of the Arabidopsis genes in these regionshave been lost, and these gene-loss events have beenaccompanied by a large number of rearrangement andinsertion events. The degree of conservation in theSolanaceae is strikingly high by contrast.

We estimate that the five Solanaceae species studiedhere all derive from a most recent common ancestorthat lived�30 MYA. This would imply that these species

Figure 8.—Predicted RNA structure for a conserved ele-ment overlapping gene 15. (A) Diagram of secondary struc-ture. (B) Multiple sequence alignment, with parenthesesindicating paired bases and dots indicating unpaired bases.In B, red indicates columns with complete conservation ofpaired bases, while yellow indicates substitutions in one or morespecies. Four petunia-specific compensatory pairs of substitu-tions are highlighted in contrasting colors. This element showssignificant similarity to the bacterial SgrS RNA, which is associ-ated with transcript localization (Vanderpool and Gottesman

2004; Kawamoto et al. 2005). The structure is predicted on theantisense strand of the coding region of gene 15.

404 Y. Wang et al.

Page 15: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

diversified well after the breakup of the Gondwanasupercontinent, because most major continental sepa-rations are estimated to have occurred between 180 and105 MYA (McLoughlin 2001). Therefore, our esti-mates support a hypothesis of long-distance—ratherthan intra-Gondwanan—seed dispersal to explain theworldwide distribution of the Solanaceae (D’Arcy 1991;Olmstead and Palmer 1992, 1997). They are consis-tent with Olmstead and Palmer’s (1997) proposal of aNew World origin and initial diversification of theSolanum, followed by dispersal to and further diversifi-cation in Africa, Australia, and other parts of the world.However, certain genera of Solanaceae, such as Schi-zanthus and Schwenkia, are believed to have branchedoff somewhat earlier than petunia (Olmstead andPalmer 1992; Olmstead et al. 1999), and our data donot help in dating these events.

Our estimate of 5.1–7.3 MYA for the tomato/potatodivergence is substantially older than a recent estimateof 1.6–3.3 MYA, based on a large collection of ESTs(Blanc and Wolfe 2004b). This difference appears tostem primarily from the assumption in that study of anabsolute rate of 1.5 3 10�8 synonymous substitutions persite per year, as estimated for the Adh and Chs loci inArabidopsis (Koch et al. 2000). If this substitution ratewere assumed for our data, the AS divergence would beestimated at only 76.5 MYA, in conflict with most otherevidence. Thus, it appears that Koch et al.’s estimatedsubstitution rate is too high, at least for the Solanaceae.By instead conditioning on the best available estimateddates for the AS divergence, we estimate a substitutionrate in the Solanaceae of only 9.6 3 10�9 (8.3 3 10�9–1.13 10�8) synonymous substitutions per site per year, �1.6times lower than Koch et al.’s estimate. Interestingly,Tuskan et al. (2006) observed a similar, but moredramatic, rate slowdown in Populus compared with Kochet al.’s estimate for Arabidopsis. It is worth stressing thatour methods assume a molecular clock not only withinthe Solanaceae but also between Arabidopsis and theSolanaceae. If substitution rates in the Solanaceae areindeed lower than those in Arabidopsis, then we will haveunderestimated the divergence times within the Solana-ceae. Such a difference in rates would imply that the rootof the phylogeny in Figure 4 should slide toward theSolanaceae, which would have the effect of pulling theSolanaceae divergence events back in time. Any un-derestimation of the neutral distance between Arabidop-sis and the Solanaceae due to saturation of synonymoussites would have a similar effect on the estimateddivergence times. For these reasons, we believe thetomato/potato divergence occurred at least �5–7 MYA.

The proportion of nucleotide sites in a genome thatare subject to natural selection (g) is a key quantity inunderstanding how genomes evolve and in evaluatingthe completeness of functional annotations. There hasbeen considerable interest in estimating this quantityfrom comparative sequence data, especially for mam-

malian (e.g., Shabalina et al. 2001; Mouse Genome

Sequencing Consortium 2002; Cooper et al. 2004;Smith et al. 2004; Lindblad-Tohet al. 2005), Drosophila(Bergman and Kreitman 2001; Halligan et al. 2004;Andolfatto 2005), and Caenorhabditis (Shabalina

and Kondrashov 1999) genomes. However, we know ofno published estimate of this quantity for plant species.In this study, we find evidence that at least about a thirdof bases within the CSS are under selection and thatmore than two-thirds of selected bases fall in noncodingregions. Therefore, like mammalian and Drosophilagenomes (Mouse Genome Sequencing Consortium

2002; Chiaromonte et al. 2003; Andolfatto 2005),this region appears to contain a large amount ofgenomic ‘‘dark matter’’—that is, functional DNA thatis invisible to current methods for genome annotation.It is not known how representative the CSS is of theentire Solanaceae genomes, but it is worth noting thatthe gene density of the region is fairly typical foreuchromatin. The CSS has one gene per 7 kb comparedwith one gene per �6.7 kb in the euchromatic portionof the tomato genome (Wang et al. 2006). Values of g inthe gene-poor heterochromatin are likely to be consid-erably lower. Notably, because they are based completelyon between-species divergence, our estimates shouldprimarily reflect ancient selection—operating on thescale of millions of years—rather than more recent,possibly human-induced selection.

Several sources of possible error—such as lineage-specific selection, (negative) selection in so-calledneutral sites, and failure to align divergent sequences—would tend to produce underestimates, rather thanoverestimates, of g. However, spuriously high measure-ments of the neutral rate of substitution (i.e., spuriouslylow neutral conservation scores) could lead to anupward bias in g. Such a bias could result from choosingneutral sites that were actually evolving faster thanaverage or from alignments of nonorthologous neutralbases. Our use of 4D sites (which fall in easy-to-aligncoding regions) and our synteny-based alignment meth-ods make it unlikely that alignment error in neutral siteshas a major effect on g. We have also repeated the analysiswith an alternative alignment and arrived at very similarresults (supplemental material), providing further evi-dence that alignment error is not a dominant factor inthe analysis. In addition, 4D sites have generally beenfound to evolve somewhat more slowly, rather than faster,than other neutral sites (e.g., Hardison et al. 2003).Indeed, we find that a phylogeny estimated from 1915sites in ancestral repeats in the CSS has .10% greatertotal branch length than the one estimated from 4D sites(data not shown). This suggests that forces that mayproduce elevated rates of substitution in 4D sites, such asCpG methylation (Hobolth et al. 2006) or genetic draft,are probably more than offset by weak negative selectionon synonymous sites. Another possible source of bias isthat the phastOdds scores—which depend on estimated

Comparative Genomics of the Solanaceae 405

Page 16: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

models of neutral and conserved evolution—are influ-enced by differences between 4D and other sites in basecomposition, branch length proportions, and substitu-tion patterns. However, two quite different estimationschemes for these models, based on different sets ofsites, led to very similar estimates of g (materials and

methods), suggesting that the mixture decomposition isbeing driven by conservation and not by ancillaryproperties. Taken together, these lines of evidencesuggest that our estimate of g � 1

3 is conservative as alower bound for the fraction of bases under selection inthe CSS.

Our ability to address major questions about theevolution of the Solanaceae and of other plant speciesin this study has clearly been limited by having sequencedata for only one �100-kb region. The completion ofthe tomato genome sequence, and new sequence datafor other Solanaceae species and for coffee, will allowfor more definitive answers to these and many otherquestions. Nevertheless, this study demonstrates thepotential of comparative sequence analysis to provideinsight into the evolutionary forces that have shapedmodern-day plant genomes.

We thank Rod Wing for assistance in BAC sequencing and T. Vinar,L. Mueller, J. S. Pedersen, and Y. Xu for assistance in the analysis. Thiswork was partially supported by National Science Foundation grantsDBI-0116076 (A.S.) and 0421634 (S.T.) and National Institutes ofHealth training grant T32 GM007617-27 (A.D.).

LITERATURE CITED

Andolfatto, P., 2005 Adaptive evolution of non-coding DNA inDrosophila. Nature 437: 1149–1152.

Arabidopsis Genome Initiative, 2000 Analysis of the genome se-quence of the flowering plant Arabidopsis thaliana. Nature408: 796–815.

Bejerano, G., M. Pheasant, I. Makunin, S. Stephen, W. Kent et al.,2004 Ultraconserved elements in the human genome. Science304: 1321–1325.

Bell, C. D., D. E. Soltis and P. S. Soltis, 2005 The age of the an-giosperms: a molecular timescale without a clock. Evol. Int. J.Org. Evol. 59: 1245–1258.

Benjamini, Y., and Y. Hochberg, 1995 Controlling the false discov-ery rate: a practical and powerful approach to multiple testing. J.R. Stat. Soc. B 57: 289–300.

Bennetzen, J. L., 2007 Patterns in grass genome evolution. Curr.Opin. Plant Biol. 10: 176–181.

Benson, G., 1999 Tandem repeats finder: a program to analyze DNAsequences. Nucleic Acids Res. 27: 573–580.

Bergman, C. M., and M. Kreitman, 2001 Analysis of conserved non-coding DNA in Drosophila reveals similar constraints in intergenicand intronic sequences. Genome Res. 11: 1335–1345.

Blanc, G., and K. H. Wolfe, 2004a Functional divergence of dupli-cated genes formed by polyploidy during Arabidopsis evolution.Plant Cell 16: 1679–1691.

Blanc, G., and K. H. Wolfe, 2004b Widespread paleopolyploidy inmodel plant species inferred from age distributions of duplicategenes. Plant Cell 16: 1667–1678.

Blanchette, M., W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit

et al., 2004 Aligning multiple genomic sequences with thethreaded blockset aligner. Genome Res. 14: 708–715.

Bonierbale, M., R. Plaisted and S. Tanksley, 1988 RFLP mapsbased on a common set of clones reveal modes of chromosomalevolution in potato and tomato. Genetics 120: 1095–1103.

Borodovsky, M., and J. McIninch, 1993 GeneMark: parallel generecognition for both DNA strands. Comput. Chem. 17: 123–133.

Bowers, J. E., B. A. Chapman, J. Rong and A. H. Paterson,2003 Unravelling angiosperm genome evolution by phyloge-netic analysis of chromosomal duplication events. Nature 422:433–438.

Burge, C., and S. Karlin, 1997 Prediction of complete gene struc-tures in human genomic DNA. J. Mol. Biol. 268: 78–94.

Cannon, S., L. Sterck, S. Rombauts, S. Sato, F. Cheung et al.,2006 Legume genome evolution viewed through the Medicagotruncatula and Lotus japonicus genomes. Proc. Natl. Acad. Sci.USA 103: 14959–14964.

Chiaromonte, F., R. J. Weber, K. M. Roskin, M. Diekhans, W. J.Kent et al., 2003 The share of human genomic DNA under se-lection estimated from human-mouse genomic alignments. ColdSpring Harbor Symp. Quant. Biol. 68: 245–254.

Clark, A., M. Eisen, D. Smith, C. Bergman, B. Oliver et al.,2007 Evolution of genes and genomes on the Drosophila phy-logeny. Nature 450: 203–218.

Clark, A. G., S. Glanowski, R. Nielsen, P. D. Thomas, A. Kejariwal

et al., 2003 Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302: 1960–1963.

Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton et al.,2003 Finding functional features in Saccharomyces genomes byphylogenetic footprinting. Science 301: 71–76.

Cooper, G. M., M. Brudno, E. A. Stone, I. Dubchak, S. Batzoglou

et al., 2004 Characterization of evolutionary rates and constraintsin three mammalian genomes. Genome Res. 14: 539–548.

D’Arcy, W. G., 1991 The Solanaceae since 1976, with a review of itsbiogeography, pp. 75–138 in Solanaceae III: Taxonomy, Chemistry,Evolution, edited by J. G. Hawkes, R. N. Lester, M. Nee andN. Estrada. Royal Botanic Gardens, Kew, UK.

Eddy, S. R., 2002 A memory-efficient dynamic programming algo-rithm for optimal alignment of a sequence to an RNA secondarystructure. BMC Bioinform. 3: 18.

Felsenstein, J., 1981 Evolutionary trees from DNA sequences. J.Mol. Evol. 17: 368–376.

Frary, A., Y. Xu, J. Liu, S. Mitchell, E. Tedeschi et al.,2005 Development of a set of PCR-based anchor markers en-compassing the tomato genome and evaluation of their useful-ness for genetics and breeding experiments. Theor. Appl.Genet. 111: 291–312.

Freeling, M., L. Rapaka, E. Lyons, B. Pedersen and B. Thomas,2007 G-boxes, bigfoot genes, and environmental response:characterization of intragenomic conserved noncoding sequen-ces in Arabidopsis. Plant Cell 19: 1441–1457.

Gale, M., and K. Devos, 1998 Plant comparative genetics after 10years. Science 282: 656–659.

Goff, S. A., D. Ricke, T.-H. Lan, G. Presting, R. Wang et al., 2002 Adraft sequence of the rice genome (Oryza sativa L. ssp. japonica).Science 296: 92–100.

Griffiths-Jones, S., S. Moxon, M. Marshall, A. Khanna, S. R. Eddy

et al., 2005 Rfam: annotating non-coding RNAs in complete ge-nomes. Nucleic Acids Res. 33: 121–124.

Guigo, R., E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra

et al., 2003 Comparison of mouse and human genomes fol-lowed by experimental verification yields an estimated 1,019 ad-ditional genes. Proc. Natl. Acad. Sci. USA 100: 1140–1145.

Guindon, S., and O. Gascuel, 2003 A simple, fast, and accurate al-gorithm to estimate large phylogenies by maximum likelihood.Syst. Biol. 52: 696–704.

Haberer, G., S. Young, A. K. Bharti, H. Gundlach, C. Raymond

et al., 2005 Structure and architecture of the maize genome.Plant Physiol. 139: 1612–1624.

Haberer, G., M. T. Mader, P. Kosarev, M. Spannagl, L. Yang et al.,2006 Large-scale cis-element detection by analysis of correlatedexpression and sequence conservation between Arabidopsis andBrassica oleracea. Plant Physiol. 142: 1589–1602.

Halligan, D., A. Eyre-Walker, P. Andolfatto and P. Keightley,2004 Patterns of evolutionary constraints in intronic and inter-genic DNA of Drosophila. Genome Res. 14: 273–279.

Hardison, R. C., K. M. Roskin, S. Yang, M. Diekhans, W. J. Kent

et al., 2003 Covariation in frequencies of substitution, deletion,transposition, and recombination during eutherian evolution.Genome Res. 13: 13–26.

406 Y. Wang et al.

Page 17: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Hertel, J., and P. F. Stadler, 2006 Hairpins in a haystack: recogniz-ing microRNA precursors in comparative genomics data. Bioin-formatics 22: 197–202.

Hertel, J., I. L. Hofacker and P. F. Stadler, 2008 SnoReport:computational identification of snoRNAs with unknown targets.Bioinformatics 24(2): 158–164.

Higo, K., Y. Ugawa, M. Iwamoto and T. Korenaga, 1999 Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nu-cleic Acids Res. 27: 297–300.

Hobolth, A., R. Nielsen, Y. Wang, F. Wu and S. Tanksley,2006 CpG 1 CpNpG analysis of protein-coding sequences fromtomato. Mol. Biol. Evol. 23: 1318.

Ilic, K., P. J. SanMiguel and J. L. Bennetzen, 2003 A complex his-tory of rearrangement in an orthologous region of the maize, sor-ghum, and rice genomes. Proc. Natl. Acad. Sci. USA 100: 12265–12270.

Inada, D., A. Bashir, C. Lee, B. Thomas, C. Ko et al., 2003 Con-served noncoding sequences in the grasses. Genome Res. 13:2030–2041.

International Rice Genome Sequencing Project, 2005 Themap-based sequence of the rice genome. Nature 436: 793–800.

Jiang, Z., H. Tang, M. Ventura, M. Cardone, T. Marques-Bonet

et al., 2007 Ancestral reconstruction of segmental duplicationsreveals punctuated cores of human genome evolution. Nat.Genet. 39: 1361–1368.

Kawamoto, H., T. Morita, A. Shimizu, T. Inada and H. Aiba,2005 Implication of membrane localization of target mRNAin the action of a small RNA: mechanism of post-transcriptionalregulation of glucose transporter in Escherichia coli. Genes Dev.19: 328–338.

Kellis, M., N. Patterson, M. Endrizzi, B. Birren and E. S. Lander,2003 Sequencing and comparison of yeast species to identifygenes and regulatory elements. Nature 423: 241–254.

Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle

et al., 2002 The human genome browser at UCSC. Genome Res.12: 996–1006.

Kent, W. J., R. Baertsch, A. Hinrichs, W. Miller and D. Haussler,2003 Evolution’s cauldron: duplication, deletion, and rear-rangement in the mouse and human genomes. Proc. Natl. Acad.Sci. USA 100: 11484–11489.

Kevei, Z., A. Seres, A. Kereszt, P. Kalo, P. Kiss et al., 2005 Sig-nificant microsynteny with new evolutionary highlights is de-tected between Arabidopsis and legume model plants despitethe lack of macrosynteny. Mol. Genet. Genomics 274: 644–657.

Koch, M. A., B. Haubold and T. Mitchell-Olds, 2000 Com-parative evolutionary analysis of chalcone synthase and alcoholdehydrogenase loci in Arabidopsis, Arabis, and related genera(Brassicaceae). Mol. Biol. Evol. 17: 1483–1498.

Kosiol, C., T. Vinar, R. R. da Fonseca, M. J. Hubisz, C. D. Bustamante

et al., 2008 Patterns of positive selection in six mammalian ge-nomes. PLoS Genet. 4(8): e1000144.

Ku, H. M., T. Vision, J. Liu and S. D. Tanksley, 2000 Comparingsequenced segments of the tomato and Arabidopsis genomes:large-scale duplication followed by selective gene loss creates anetwork of synteny. Proc. Natl. Acad. Sci. USA 97: 9121–9126.

Lindblad-Toh, K., C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D.B. Jaffe et al., 2005 Genome sequence, comparative analysisand haplotype structure of the domestic dog. Nature 438: 803–819.

Lynch, M., and J. S. Conery, 2003 The evolutionary demography ofduplicate genes. J. Struct. Funct. Genomics 3: 35–44.

Ma, J., and J. L. Bennetzen, 2004 Rapid recent growth and diver-gence of rice nuclear genomes. Proc. Natl. Acad. Sci. USA 101:12404–12410.

Magallon, S. A., and M. J. Sanderson, 2005 Angiosperm diver-gence times: the effect of genes, codon positions, and time con-straints. Evol. Int. J. Org. Evol. 59: 1653–1670.

Margulies, E. H., G. M. Cooper, G. Asimenos, D. J. Thomas, C. N.Dewey et al., 2007 Analyses of deep mammalian sequencealignments and constraint predictions for 1% of the human ge-nome. Genome Res. 17: 760–774.

McCubbin, A. G., C. Zuniga and T. Kao, 2000 Construction of a bi-nary bacterial artificial chromosome library of Petunia inflataand the isolation of large genomic fragments linked to theself-incompatibility (S-) locus. Genome 43: 820–826.

McLoughlin, S., 2001 The breakup history of gondwana and itsimpact on pre-cenozoic floristic provincialism. Aust. J. Bot. 49:271–300.

Messing, J., A. Bharti, W. Karlowski, H. Gundlach, H. Kim et al.,2004 Sequence composition and genome organization ofmaize. Proc. Natl. Acad. Sci. USA 101: 14349–14354.

Miller, W., K. Rosenbloom, R. Hardison, M. Hou, J. Taylor et al.,2007 28-Way vertebrate alignment and conservation track inthe UCSC Genome Browser. Genome Res. 17: 1797–1808.

Mouse Genome Sequencing Consortium, 2002 Initial sequencingand comparative analysis of the mouse genome. Nature 420:520–562.

Mueller, L., T. Solow, N. Taylor, B. Skwarecki, R. Buels et al.,2005 The SOL Genomics Network: a comparative resourcefor Solanaceae biology and beyond. Plant Physiol. 138: 1310–1317.

Murphy, W., D. Larkin, A. Everts-van der Wind, G. Bourque, G.Tesler et al., 2005 Dynamics of mammalian chromosome evo-lution inferred from multispecies comparative maps. Science309: 613–617.

Murphy, W. J., T. H. Pringle, T. A. Crider, M. S. Springer and W.Miller, 2007 Using genomic data to unravel the root of theplacental mammal phylogeny. Genome Res. 17: 413–421.

Nielsen, R., 2005 Molecular signatures of natural selection. Annu.Rev. Genet. 39: 197–218.

Notredame, C., D. Higgins and J. Heringa, 2000 T-Coffee: a novelmethod for fast and accurate multiple sequence alignment. J.Mol. Biol. 302: 205–217.

Olmstead, R. G., and J. D. Palmer, 1992 A chloroplast DNA phylog-eny of the Solanaceae: subfamilial relationships and characterevolution. Ann. MO Bot. Gard. 79: 346–360.

Olmstead, R. G., and J. D. Palmer, 1997 Implications for the phy-logeny, classification, and biogeography of Solanum fromcpDNA restriction site variation. Syst. Bot. 22: 19–29.

Olmstead, R. G., J. A. Sweere, R. E. Spangler, L. Bohs and J. D.Palmer, 1999 Phylogeny and provisional classification of theSolanaceae based on chloroplast DNA, pp. 111–137 in SolanaceaeIV, Advances in Biology and Utilization, edited by M. Nee, D. E.Symon, J. P. Jessup and J. G. Hawkes. Royal Botanic Gardens,Kew, UK.

Otto, S., and J. Whitton, 2000 Polyploid incidence and evolution.Annu. Rev. Genet. 34: 401–437.

Petrov, D. A., E. R. Lozovskaya and D. L. Hartl, 1996 High in-trinsic rate of DNA loss in Drosophila. Nature 384: 346–349.

Pollard, K. S., S. R. Salama, N. Lambert, M.-A. Lambot, S. Coppens

et al., 2006 An RNA gene expressed during cortical develop-ment evolved rapidly in humans. Nature 443: 167–172.

Quiros, C. F., F. Grellet, J. Sadowski, T. Suzuki, G. Li et al.,2001 Arabidopsis and Brassica comparative genomics: se-quence, structure and gene content in the ABI-Rps2-Ck1 chro-mosomal segment and related regions. Genetics 157: 1321–1330.

Rat Genome Sequencing Project Consortium, 2004 Genome se-quence of the brown Norway rat yields insights into mammalianevolution. Nature 428: 493–521.

Rensink, W., Y. Lee, J. Liu, S. Iobst, S. Ouyang et al., 2005 Com-parative analyses of six solanaceous transcriptomes reveal a highdegree of sequence conservation and species-specific transcripts.BMC Genomics 6: 124.

Rhesus Macaque Genome Sequencing and Analysis Consortium,2007 Evolutionary and biomedical insights from the rhesus ma-caque genome. Science 13: 222–234.

Ronning, C., S. Stegalkina, R. Ascenzi, O. Bougri, A. Hart et al.,2003 Comparative analyses of potato expressed sequence tag li-braries. Plant Physiol. 131: 419–429.

Salzberg, S. L., A. L. Delcher, S. Kasif and O. White, 1998 Mi-crobial gene identification using interpolated Markov models.Nucleic Acids Res. 26: 544–548.

Schwartz, S., W. J. Kent, A. Smit, Z. Zhang, R. Baertsch et al.,2003 Human-mouse alignments with BLASTZ. Genome Res.13: 103–107.

Semon, M., and K. Wolfe, 2007 Consequences of genome duplica-tion. Curr. Opin. Genet. Dev. 17: 505–512.

Shabalina, S. A., and A. S. Kondrashov, 1999 Pattern of selectiveconstraint in C. elegans and C. briggsae genomes. Genet. Res. 74:23–30.

Comparative Genomics of the Solanaceae 407

Page 18: Sequencing and Comparative Analysis of a Conserved Syntenic … · 2008. 10. 8. · Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Ying Wang,*,†,1

Shabalina, S. A., A. Y. Ogurtsov, V. A. Kondrashov and A. S.Kondrashov, 2001 Selective constraint in intergenic regionsof human and mouse genomes. Trends Genet. 17: 373–376.

Shabalina, S. A., A. Y. Ogurtsov and N. A. Spiridonov, 2006 Aperiodic pattern of mRNA secondary structure created by the ge-netic code. Nucleic Acids Res. 34: 2428–2437.

Siepel, A., G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou et al.,2005 Evolutionarily conserved elements in vertebrate, insect,worm, and yeast genomes. Genome Res. 15: 1034–1050.

Siepel, A., K. Pollard and D. Haussler, 2006 New methods for de-tecting lineage-specific selection, pp. 190–205 in Proceedings ofthe 10th International Conference on Research in ComputationalMolecular Biology (Lecture Notes in Computer Science, Vol.3909), edited by A. Apostolico, C. Guerra, S. Istrail, P. Pevzner

and M. Waterman. Springer, New York.Siepel, A., M. Diekhans, B. Brejov, L. Langton, M. Stevens et al.,

2007 Targeted discovery of novel human exons by comparativegenomics. Genome Res. 17: 1763–1773.

Smith, N. G. C., M. Brandstrom and H. Ellegren, 2004 Evidencefor turnover of functional noncoding DNA in mammalian ge-nome evolution. Genomics 84: 806–813.

Solovyev, V. V., A. A. Salamov and C. B. Lawrence, 1994 Pre-dicting internal exons by oligonucleotide composition and dis-criminant analysis of spliceable open reading frames. NucleicAcids Res. 22: 5156–5163.

Song, J., F. Dong and J. Jiang, 2000 Construction of a bacterial ar-tificial chromosome (BAC) library for potato molecular cytoge-netics research. Genome 43: 199–204.

Song, R., V. Llaca and J. Messing, 2002 Mosaic organization of or-thologous sequences in grass genomes. Genome Res. 12: 1549–1555.

Sonnhammer, E., and E. Koonin, 2002 Orthology, paralogy andproposed classification for paralog subtypes. Trends Genet. 18:619–620.

Stark, A., M. Lin, P. Kheradpour, J. Pedersen, L. Parts et al.,2007 Discovery of functional elements in 12 Drosophila ge-nomes using evolutionary signatures. Nature 450: 219–232.

Stein, L. D., Z. Bao, D. Blasiar, T. Blumenthal, M. R. Brent et al.,2003 The genome sequence of Caenorhabditis briggsae: a plat-form for comparative genomics. PLoS Biol. 1: E45.

Tanksley, S., R. Bernatzky, N. Lapitan and J. Prince, 1988 Con-servation of gene repertoire but not gene order in pepper andtomato. Proc. Natl. Acad. Sci. USA 85: 6419–6423.

Tavare, S., 1986 Some probabilistic and statistical problems in theanalysis of DNA sequences. Lect. Math. Life Sci. 17: 57–86.

Thomas, B. C., L. Rapaka, E. Lyons, B. Pedersen and M. Freeling,2007 Arabidopsis intragenomic conserved noncoding se-quence. Proc. Natl. Acad. Sci. USA 104: 3348–3353.

Thomas, J. W., J. W. Touchman, R. W. Blakesley, G. G. Bouffard, S.M. Beckstrom-Sternberg et al., 2003 Comparative analyses ofmulti-species sequences from targeted genomic regions. Nature424: 788–793.

Tiffin, P., and M. Hahn, 2002 Coding sequence divergence be-tween two closely related plant species: Arabidopsis thalianaand Brassica rapa ssp. pekinensis. J. Mol. Evol. 54: 746–753.

Tuskan, G. A., S. Difazio, S. Jansson, J. Bohlmann, I. Grigoriev

et al., 2006 The genome of black cottonwood, Populus trichocar-pa (Torr. & Gray). Science 313: 1596–1604.

Van der Hoeven, R., C. Ronning, J. Giovannoni, G. Martin and S.Tanksley, 2002 Deductions about the number, organization,and evolution of genes in the tomato genome based on analysisof a large expressed sequence tag collection and selective geno-mic sequencing. Plant Cell 14: 1441–1456.

Vanderpool, C. K., and S. Gottesman, 2004 Involvement of a noveltranscriptional activator and small RNA in post-transcriptional

regulation of the glucose phosphoenolpyruvate phosphotrans-ferase system. Mol. Microbiol. 54: 1076–1089.

Walsh, J. B., 1995 How often do duplicated genes evolve new func-tions? Genetics 139: 421–428.

Wang, Y., X. Tang, Z. Cheng, L. Mueller, J. Giovannoni et al.,2006 Euchromatin and pericentromeric heterochromatin:comparative composition in the tomato genome. Genetics 172:2529–2540.

Washietl, S., I. L. Hofacker, M. Lukasser, A. Huttenhofer and P.F. Stadler, 2005 Mapping of conserved RNA secondary struc-tures predicts thousands of functional noncoding RNAs in thehuman genome. Nat. Biotechnol. 23: 1383–1390.

Wikstrom, N., V. Savolainen and M. W. Chase, 2001 Evolution ofthe angiosperms: calibrating the family tree. Proc. Biol. Sci. 268:2211–2220.

Wingender, E., P. Dietze, H. Karas and R. Knuppel,1996 TRANSFAC: a database on transcription factors and theirDNA binding sites. Nucleic Acids Res. 24: 238–241.

Wortman, J. R., B. J. Haas, L. I. Hannick, R. K. J. Smith, R. Maiti

et al., 2003 Annotation of the Arabidopsis genome. Plant Physiol.132: 461–468.

Wright, S., C. Yau, M. Looseley and B. Meyers, 2004 Effects ofgene expression on molecular evolution in Arabidopsis thalianaand Arabidopsis lyrata. Mol. Biol. Evol. 21: 1719–1726.

Wu, F., L. A. Mueller, D. Crouzillat, V. Petiard and S. D. Tanksley,2006 Combining bioinformatics and phylogenetics to identifylarge sets of single-copy orthologous genes (COSII) for compara-tive, evolutionary and systematic studies: a test case in the euasteridplant clade. Genetics 174: 1407–1420.

Yang, Y. W., K. N. Lai, P. Y. Tai and W. H. Li, 1999 Rates of nucle-otide substitution in angiosperm mitochondrial DNA sequencesand dates of divergence between Brassica and other angiospermlineages. J. Mol. Evol. 48: 597–604.

Yang, Z., 1997 PAML: a program package for phylogenetic analysisby maximum likelihood. Comput. Appl. Biosci. 13: 555–556.

Yang, Z., and R. Nielsen, 2002 Codon-substitution models for de-tecting molecular adaptation at individual sites along specific lin-eages. Mol. Biol. Evol. 19: 908–917.

Yang, Z., and A. D. Yoder, 2003 Comparison of likelihood andBayesian methods for estimating divergence times using multiplegene loci and calibration points, with application to a radiationof cute-looking mouse lemur species. Syst. Biol. 52: 705–716.

Yang, Z., R. Nielsen and M. Hasegawa, 1998 Models of amino acidsubstitution and applications to mitochondrial protein evolu-tion. Mol. Biol. Evol. 15: 1600–1611.

Yu, J., S. Hu, J. Wang, G. K.-S. Wong, S. Li et al., 2002 A draft se-quence of the rice genome (Oryza sativa L. ssp. indica). Science296: 79–92.

Yu, J., J. Wang, W. Lin, S. Li, H. Li et al., 2005 The genomes of Oryzasativa: a history of duplications. PLoS Biol. 3: e38.

Zhang, J., R. Nielsen and Z. Yang, 2005 Evaluation of an improvedbranch-site likelihood method for detecting positive selection atthe molecular level. Mol. Biol. Evol. 22: 2472–2479.

Zhang, L., T. Vision and B. Gaut, 2002 Patterns of nucleotide sub-stitution among simultaneously duplicated gene pairs in Arabi-dopsis thaliana. Mol. Biol. Evol. 19: 1464–1473.

Zhu, H., D.-J. Kim, J.-M. Baek, H.-K. Choi, L. C. Ellis et al.,2003 Syntenic relationships between Medicago truncatulaand Arabidopsis reveal extensive divergence of genome organiza-tion. Plant Physiol. 131: 1018–1026.

Communicating editor: D. M. Rand

408 Y. Wang et al.