use of nuclear genes for phylogeny reconstruction in plants

26
CSIRO PUBLISHING © CSIRO 29 April 2004 10.1071/SB03015 1030-1887/04/020145 www.publish.csiro.au/journals/asb Australian Systematic Botany 17, 145–170 L. A. S. JOHNSON REVIEW No. 2 Use of nuclear genes for phylogeny reconstruction in plants Randall L. Small A,D , Richard C. Cronn B and Jonathan F. Wendel C A Department of Botany, The University of Tennessee, Knoxville, TN 37996, USA. B US Forest Service, Pacific Northwest Research Station, 3200 SW Jefferson Way, Corvallis, OR 97331, USA. C Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA. D Corresponding author; email: [email protected] Abstract. Molecular data have had a profound impact on the field of plant systematics, and the application of DNA-sequence data to phylogenetic problems is now routine. The majority of data used in plant molecular phylogenetic studies derives from chloroplast DNA and nuclear rDNA, while the use of low-copy nuclear genes has not been widely adopted. This is due, at least in part, to the greater difficulty of isolating and characterising low-copy nuclear genes relative to chloroplast and rDNA sequences that are readily amplified with universal primers. The higher level of sequence variation characteristic of low-copy nuclear genes, however, often compensates for the experimental effort required to obtain them. In this review, we briefly discuss the strengths and limitations of chloroplast and rDNA sequences, and then focus our attention on the use of low-copy nuclear sequences. Advantages of low-copy nuclear sequences include a higher rate of evolution than for organellar sequences, the potential to accumulate datasets from multiple unlinked loci, and bi-parental inheritance. Challenges intrinsic to the use of low-copy nuclear sequences include distinguishing orthologous loci from divergent paralogous loci in the same gene family, being mindful of the complications arising from concerted evolution or recombination among paralogous sequences, and the presence of intraspecific, intrapopulational and intraindividual polymorphism. Finally, we provide a detailed protocol for the isolation, characterisation and use of low-copy nuclear sequences for phylogenetic studies. SB03015 Useofnucleargenes in plantphylogeny R.L.Small etal . Introduction The impact of molecular data on the field of plant systematics can hardly be overstated. In combination with explicit methods for phylogenetic analysis, molecular data have reshaped concepts of relationships and circumscriptions at all levels of the taxonomic hierarchy (Qiu et al. 1999; Soltis et al. 1999; Crawford 2000). As molecular phylogenetic studies have accumulated, it has become apparent that different molecular tools are required for different questions because of varying rates of sequence evolution among genomes, genes and gene regions. The choice of molecular tool is of paramount importance to ensure that an appropriate level of variation is recovered to answer the phylogenetic question at hand. Nonetheless, the plant systematics community is using only a small fraction of the available molecular tools. The preponderance of molecular data applied to plant systematics problems come from two sources: chloroplast DNA (cpDNA) or nuclear ribosomal DNA (rDNA). While the contributions of cpDNA and rDNA to plant systematics are undeniable, reliance on these tools to the exclusion of other, perhaps more appropriate, tools is pervasive (Alvarez and Wendel 2003). Alternatives to cpDNA and rDNA include both mitochondrial (mtDNA) and nuclear (nDNA) sequences other than rDNA. Because of its generally slow rate of sequence evolution and fast rate of structural evolution (Palmer 1992; Palmer et al. 2000), mtDNA generally has been ignored by plant systematists as a potential source of data (but see e.g. Qiu et al. 1998, 1999; Freudenstein and Chase 2001; Anderberg et al. 2002; Sanjur et al. 2002). For this reason, mtDNA will not be considered further in this review. Nuclear sequences other than rDNA represent most of the DNA contained in any given cell, comprising both high-copy repetitive DNA (e.g. transposons, centromeric and telomeric repeats), and low- to moderate-copy DNA elements, including the majority of genes. The evolutionary dynamics of these two classes of DNA can be dramatically different. For example, repetitive DNA may experience non-Mendelian transmission, be subject to concerted evolution, and/or be mobile within a genome (e.g.

Upload: doankhanh

Post on 31-Dec-2016

225 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Use of nuclear genes for phylogeny reconstruction in plants

CSIRO PUBLISHING

© CSIRO 29 April 2004 10.1071/SB03015 1030-1887/04/020145

www.publish.csiro.au/journals/asb Australian Systematic Botany 17, 145–170

L. A. S. JOHNSON REVIEW No. 2Use of nuclear genes for phylogeny reconstruction in plants

Randall L. SmallA,D, Richard C. CronnB and Jonathan F. WendelC

ADepartment of Botany, The University of Tennessee, Knoxville, TN 37996, USA.BUS Forest Service, Pacific Northwest Research Station, 3200 SW Jefferson Way, Corvallis, OR 97331, USA.

CDepartment of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA.DCorresponding author; email: [email protected]

Abstract. Molecular data have had a profound impact on the field of plant systematics, and the application ofDNA-sequence data to phylogenetic problems is now routine. The majority of data used in plant molecularphylogenetic studies derives from chloroplast DNA and nuclear rDNA, while the use of low-copy nuclear genes hasnot been widely adopted. This is due, at least in part, to the greater difficulty of isolating and characterisinglow-copy nuclear genes relative to chloroplast and rDNA sequences that are readily amplified with universalprimers. The higher level of sequence variation characteristic of low-copy nuclear genes, however, oftencompensates for the experimental effort required to obtain them. In this review, we briefly discuss the strengths andlimitations of chloroplast and rDNA sequences, and then focus our attention on the use of low-copy nuclearsequences. Advantages of low-copy nuclear sequences include a higher rate of evolution than for organellarsequences, the potential to accumulate datasets from multiple unlinked loci, and bi-parental inheritance. Challengesintrinsic to the use of low-copy nuclear sequences include distinguishing orthologous loci from divergentparalogous loci in the same gene family, being mindful of the complications arising from concerted evolution orrecombination among paralogous sequences, and the presence of intraspecific, intrapopulational and intraindividualpolymorphism. Finally, we provide a detailed protocol for the isolation, characterisation and use of low-copy nuclearsequences for phylogenetic studies.SB03015Use of nuclear genes in plant phylogenyR. L. Sm al let al .

IntroductionThe impact of molecular data on the field of plantsystematics can hardly be overstated. In combination withexplicit methods for phylogenetic analysis, molecular datahave reshaped concepts of relationships andcircumscriptions at all levels of the taxonomic hierarchy(Qiu et al. 1999; Soltis et al. 1999; Crawford 2000). Asmolecular phylogenetic studies have accumulated, it hasbecome apparent that different molecular tools are requiredfor different questions because of varying rates of sequenceevolution among genomes, genes and gene regions. Thechoice of molecular tool is of paramount importance toensure that an appropriate level of variation is recovered toanswer the phylogenetic question at hand. Nonetheless, theplant systematics community is using only a small fraction ofthe available molecular tools. The preponderance ofmolecular data applied to plant systematics problems comefrom two sources: chloroplast DNA (cpDNA) or nuclearribosomal DNA (rDNA). While the contributions of cpDNAand rDNA to plant systematics are undeniable, reliance on

these tools to the exclusion of other, perhaps moreappropriate, tools is pervasive (Alvarez and Wendel 2003).

Alternatives to cpDNA and rDNA include bothmitochondrial (mtDNA) and nuclear (nDNA) sequencesother than rDNA. Because of its generally slow rate ofsequence evolution and fast rate of structural evolution(Palmer 1992; Palmer et al. 2000), mtDNA generally hasbeen ignored by plant systematists as a potential source ofdata (but see e.g. Qiu et al. 1998, 1999; Freudenstein andChase 2001; Anderberg et al. 2002; Sanjur et al. 2002). Forthis reason, mtDNA will not be considered further in thisreview. Nuclear sequences other than rDNA represent mostof the DNA contained in any given cell, comprising bothhigh-copy repetitive DNA (e.g. transposons, centromeric andtelomeric repeats), and low- to moderate-copy DNAelements, including the majority of genes. The evolutionarydynamics of these two classes of DNA can be dramaticallydifferent. For example, repetitive DNA may experiencenon-Mendelian transmission, be subject to concertedevolution, and/or be mobile within a genome (e.g.

Page 2: Use of nuclear genes for phylogeny reconstruction in plants

146 Australian Systematic Botany R. L. Small et al.

transposable elements). Accordingly, distinguishingsequences related by descent (orthologs) from a sea ofrelated (but non-orthologous) sequences may be a challenge.Low-copy nuclear sequences on the other hand, typicallyevolve independently of paralogous sequences and tend to bestable in position and copy number (but see e.g. Fu andDooner 2002), thereby facilitating identification andisolation of orthologous sequences. It is this low-copyfraction of nDNA that we wish to focus on here, particularlywith reference to applications at the level that mostsystematists work, namely, among closely related specieswithin one to several genera. We briefly review the pros andcons of using the now-traditional cpDNA and rDNAsequences. We then focus our attention on low-copy nuclearsequences, describing methodological, biological andconceptual issues surrounding their use in phylogeneticanalysis. Finally, we describe a generally applicable strategyfor the isolation, characterisation and phylogenetic use oflow-copy nuclear genes.

Chloroplast DNA

The most widely used source of data in plant molecularphylogenetic analyses has been cpDNA, either in the form ofrestriction site or DNA sequence analysis. The use of cpDNAhas been reviewed extensively (Olmstead and Palmer 1994;Soltis and Soltis 1998a) and will not be belabored here, butseveral points merit restating, with respect to the nature ofcpDNA evolution and the general utility of cpDNA relativeto nDNA.

Evolutionary dynamics of cpDNA

The primary advantages of cpDNA as a molecular tool lie inits relatively simple genetics. The chloroplast genome is acircular molecule found in multiple copies per chloroplast,and contains both coding (gene) and non-coding (intron andintergenic spacer) sequences. The presence of multiplecopies of the chloroplast genome per chloroplast, coupledwith the presence of multiple chloroplasts per leaf cell,means that cpDNA is relatively high copy within a typicalgenomic DNA prep. This property is a significant advantageas the high copy number facilitates restriction site analysis aswell as PCR amplification of specific cpDNA regionsbecause higher copy-number sequences are more readilyaccessible.

Chloroplast DNA is generally characterised asstructurally stable, haploid, non-recombinant and generallyuniparentally inherited (primarily maternally inangiosperms, although examples of paternal and biparentalinheritance are also known), all features that facilitate its usein systematic studies. Structural stability across largeevolutionary scales has been demonstrated by comparativecpDNA mapping and sequencing, and numerous studieshave shown that cpDNA molecules are highly conservedwith respect to gene content and arrangement, especially

among closely related species (Olmstead and Palmer 1994).As evolutionary divergence increases, structural mutations(inversions, indels, and expansion/shrinkage of the invertedrepeat) become more pronounced, but overall gene contentand order remain remarkably consistent. This structuralstability has facilitated the design of ‘universal’ PCR primersand heterologous cpDNA probes. In addition, and mostimportantly, it allows the expectation that specific DNAsequences isolated from different species are orthologous.

Haploidy, uniparental inheritance and the absence ofrecombination among cpDNA molecules are also importantfeatures. A central assumption of phylogenetic analysis isthat terminal taxa are the product of bifurcatinglineage-splitting events, rather than products of reticulation.For haploid (and thus non-recombinant), uniparentallyinherited cpDNA molecules, relationships are by definitionbifurcating rather than reticulate, and hence these areappropriate terminals for phylogenetic analysis (Doyle1992). Additionally, to the extent that cpDNA is haploid,intra-individual (allelic) variation is absent, thus simplifyinganalyses. The haploid nature of cpDNA also serves to reducethe amount of intraspecific and intrapopulation variation.Since the effective population size of a haploid genome issmaller than that of a diploid genome (1/4 in dioeciousplants; 1/2 in monoecious plants), coalescence times andtime to fixation of cpDNA haplotypes within a populationare short relative to diploid genomes.

These generalisations are, however, not withoutexceptions. For example, while organellar genomes areassumed to be non-recombinant, evidence from Pinuscontorta (Marshall et al. 2001) suggests cpDNArecombination may occur in plants, similar to reports ofmtDNA recombination in hominids (Awadalla et al. 1999;Eyre-Walker 2000). Uniparental inheritance is usuallyassumed in cpDNA studies, and while most studies haveconfirmed uniparental inheritance of organellar genomes(maternal in angiosperms, paternal in gymnosperms),exceptions, including both paternal and biparentalinheritance, have been reported in angiosperms (Birky 1995;Corriveau and Coleman 1988; Reboud and Zeyl 1994). Theprimary effect of these processes on molecular systematics isan increase in homoplasy over what might be expected froma uniparentally inherited, and thus non-recombiningmolecule.

There is some irony in the realisation that the propertiesof cpDNA that make it an attractive tool for molecularsystematics also hinder its overall utility in phylogeneticanalyses. These stem from the propensity of plants tohybridise and undergo polyploidisation. Because cpDNA isuniparentally inherited and haploid, it reveals only half of theparentage in plants of hybrid or polyploid origin (generallythe seed parent in angiosperms because of maternalinheritance). Thus, cpDNA analysis of hybrid or polyploidplants may incorrectly identify them as belonging to a clade

Page 3: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 147

of one of the two parents, without revealing the hybridhistory. If hybridisation is followed by introgression andsubsequent fixation of the alien cpDNA, then the phylogenymay be accurately resolved with regard to the maternalhistory; however, cpDNA fails to identify the phylogeneticconflict arising from a hybrid ancestry.

Evolutionary rates of cpDNA

Generally, in plants the mitochondrial genome evolves at theslowest rate, the chloroplast genome at a slightly faster rateand the nuclear genome at the fastest rate (Wolfe, Li et al.1987;Gaut 1998). There is, of course, substantial ratevariation within genomes, and coding regions evolve moreslowly than non-coding regions (introns and intergenicspacers), presumably because of selective constraints. Forthis reason, cpDNA gene sequences (e.g. rbcL, atpB, matKand ndhF) have been used extensively at the family level andabove (Chase et al. 1993; Qiu et al. 1999; Soltis et al. 1999),while non-coding sequences such as introns (e.g. rpL16,rpoC1, rpS16, trnL, trnK) and intergenic spacers (e.g.trnT–trnL, trnL–trnF, atpB–rbcL, psbA–trnH) are usedmore frequently at lower taxonomic levels (Taberlet et al.1991; Sang et al. 1997a; Small et al. 1998). Given therelatively slow rate of cpDNA evolution, however, evennon-coding cpDNA regions often fail to provide significantphylogenetic information at low taxonomic levels (Smallet al. 1998). This latter problem represents a significantlimitation of cpDNA sequences for phylogeneticreconstruction, as applicability is restricted to relatively deepdivergence histories.

Nuclear ribosomal DNA

To compensate for the limitations of cpDNA, as well as toobtain additional and independent estimates of phylogeny,nuclear rDNA has been widely adopted as a tool in plantsystematics, and is now as commonly used as cpDNA. Athigher taxonomic levels the slowly evolving rRNA genes areused (Hamby and Zimmer 1988, 1992; Steele et al. 1991;Soltis et al. 1997; Kuzoff et al. 1998; Soltis and Soltis1998b), while at lower taxonomic levels internal and externalintergenic spacers are more commonly employed (Baldwinet al. 1995; Alvarez and Wendel 2003; Bailey et al. 2003).

Evolutionary dynamics of rDNA

In land plants rRNA genes are organised into two distinctsets of tandem arrays. The first set is composed of 5S rRNAgenes and intergenic spacers in tandem arrays at one or morechromosomal loci. The second set includes the18S–5.8S–26S rDNA cistron in tandem arrays at one or morechromosomal loci. Both sets of rDNA arrays have been usedin systematic studies, although the 18S–5.8S–26S arrayshave been used far more frequently than 5S. As for cpDNA,the potential phylogenetic utility of rDNA is facilitated by itsstructure and molecular evolution. Ribosomal genes exist in

tandem arrays of genes composed of hundreds to thousandsof copies per array (Baldwin et al. 1995; Cronn et al. 1996;Hanson et al. 1996b; Kuzoff et al. 1998; Soltis et al. 1997;Soltis and Soltis 1998b; Wendel et al. 1995a). The highercopy number facilitates evaluation of rDNA by bothrestriction site- and PCR-based strategies. In addition, therepetitive structure of these arrays promotes a process ofhomogenisation (‘concerted evolution’: Zimmer et al. 1980;Arnheim 1983) that may result in a single predominantsequence across all copies and arrays. This homogenisationallows PCR products to be directly sequenced, generallyyielding a single dominant sequence that is assumed to berepresentative of the underlying genomic sequences.

While rDNA has contributed substantially to our presentunderstanding of plant systematics, the highly repetitivenature of rDNA gives it properties that effect the potentialutility and reliability of these regions in phylogenetic studies(Alvarez and Wendel 2003; Bailey et al. 2003). Thesepotential difficulties stem from the structure andevolutionary dynamics of rDNA—specifically, the presenceof both multiple copies per array and multiple arrays pergenome, and the presence, absence, or variable strength ofconcerted evolution. As noted above, concerted evolutiontends to homogenise sequences within and sometimesbetween rDNA arrays. The important phrase here is ‘tendsto.’ The strength of concerted evolution is far from uniformacross repeat units, arrays and taxa. In the absence ofcomplete concerted evolution, sequence variants can ariseand be maintained within and between arrays, yieldingmultiple distantly related rDNA types within individuals(Suh et al. 1993; Cronn et al. 1996; Buckler et al. 1997;Hartmann et al. 2001; Mayol and Rossello 2001; Muir et al.2001; Bailey et al. 2003). Indeed, rDNA pseudogenes may beregular constituents of many genomes, representingphylogenetically distant sequences that may preferentially beamplified over functional rDNA loci (Buckler and Holtsford1996; Buckler et al. 1997; Hartmann et al. 2001; Mayol andRossello 2001; Chase et al. 2003).

More problematic, however, is the possibility that sucherratic variation is the norm for rDNA rather than theexception, even though such findings may be generallyignored, underreported or undetected (Alvarez and Wendel2003). For example, polymorphic positions are foundfrequently in published rDNA datasets. Such polymorphismsgenerally are simply coded as polymorphic characters forphylogenetic analyses, ignoring the fact that these positionsare evidence of unhomogenised sequence variants, perhapsof distantly related sequences. In addition, sequencepolymorphism may go undetected in automated or manualsequencing, or else the strongest peak (or band) at a positionmay be scored as the ‘correct’ base. Finally, PCR bias mayselectively amplify one of the genomic sequences because ofdifferences in genomic copy number or primer affinity(Wagner et al. 1994). The sum of these observations is that

Page 4: Use of nuclear genes for phylogeny reconstruction in plants

148 Australian Systematic Botany R. L. Small et al.

rDNA sequences obtained by direct sequencing of PCRproducts may fail to reveal the complexity of nuclear rDNAcontent, and may in fact preferentially reveal paralogousrDNA sequences in different taxa or accessions.

While the prevalence of this phenomenon can be difficultto ascertain in diploid species, the problem becomesexacerbated in allopolyploid and hybrid taxa in whichmultiple divergent rDNA loci are expected to exist ondifferent chromosomes donated by the different parents. Forexample, allotetraploid (AD-genome) species of Gossypiumcontain genomes donated from an African A-genome diploidand a New World D-genome diploid. Direct sequencing ofPCR-amplified ITS fragments (Wendel et al. 1995a) fromthe five allotetraploid species revealed only a single rDNAsequence type characteristic of either the A- or D-genome(but not both), despite the fact that Southern hybridisation(Wendel et al. 1995a) and FISH (Hanson et al. 1996b)evidence indicated that both genome types were maintainedin the allotetraploid. Similar results have been reported inallotetraploid Glycine (Rauscher et al. 2002) and elsewhere(reviewed in Alvarez and Wendel 2003; Bailey et al. 2003)where biased amplification of ITS sequences from specificparental diploid genomes was observed, perhaps because ofdifferential underlying copy number.

Evolutionary rates of rDNA

One of the reasons for adding rDNA sequences to the arsenalof tools available to plant molecular systematists is to obtaindata from a source independent of cpDNA. The otherprimary reason, especially at lower taxonomic levels, is toobtain sequences that evolve at a faster rate, so that morephylogenetically informative characters are obtained. Inphylogenetic studies at low taxonomic levels in which bothnon-coding cpDNA and ITS data are collected for a set oftaxa, ITS sequences generally provide greater levels ofdivergence and thus greater resolution and stronger supportthan an equivalent sample of cpDNA sequence (e.g. Hodgesand Arnold 1994; Gielly et al. 1996; Sang et al. 1997a;Whitten et al. 2000). The more rapid rate of evolution of ITS,however, is tempered by its relatively short length (generally500–600 bp in angiosperms) and the relatively high levels ofhomoplasy (Alvarez and Wendel 2003). Recently, attemptsto add the flanking external transcribed spacer (ETS) ofnuclear rDNA to supplement ITS sequences have met withsome success (Baldwin and Markos 1998; Linder et al.2000). The internally repetitive structure of the ETS region,however, can make both PCR amplification and sequencealignment difficult, thus presenting an additional obstacle tothe widespread adoption of ETS in systematic studies.Finally, it is important to note that because of the linkednature of the 45S rDNA cistron, the ETS region will fall preyto the same pitfalls and problematic evolutionary dynamicsas ITS. For this reason, congruence between ITS- andETS-derived topologies provides not so much an

independent confirmation of phylogenetic signal asconfirmation of genetic and physical linkage.

Low-copy nuclear genes

While the use of low-copy nuclear genes for phylogenyreconstruction is still in its relative infancy, severalconclusions may be drawn regarding both utility andlimitations. Among the advantages of nuclear genes are anoverall faster rate of sequence evolution than in organellargenomes, the presence of multiple independent (unlinked)loci and biparental inheritance. Disadvantages of nuclearloci stem primarily from the more complex geneticarchitecture and evolutionary dynamics of the nucleargenome and possible difficulties in isolating and identifyingorthologous genes. Other relevant issues include thepossibilities of concerted evolution and/or recombinationamong paralogous sequences and the presence ofintraspecific, intrapopulational and intraindividual variation(heterozygosity).

Advantages of nuclear genes

Rate variation in nuclear genes

One of the primary advantages of nuclear genes forphylogenetic analysis is the elevated rate of sequenceevolution relative to organellar genes. Broad surveys haverevealed that synonymous substitution rates of nuclear genesare up to five times greater than those of chloroplast genesand 20 times greater than those of mitochondrial genes(Wolfe et al. 1987; Gaut 1998). This elevated evolutionaryrate yields a greater efficiency of sequencing effort, sincemore variation is detected per unit of sequence than inorganellar genes. This advantage becomes particularly acuteat low taxonomic levels (Small et al. 1998) or amonglineages created by rapid divergence events (Cronn et al.2002b; Malcomber 2002). For example, in an analysis of therelationships among the allotetraploid species of GossypiumL. (Small et al. 1998), direct comparison of non-codingcpDNA and nuclear-encoded alcohol dehydrogenase (AdhC)sequences showed that relationships were incompletelyresolved and poorly supported by >7 kb of non-codingcpDNA, while data from a 1.65-kb section of AdhC providedcomplete and robustly supported resolution of relationships.Extrapolation of these results suggested that to obtainequivalent phylogenetic resolution from the cpDNA as theAdhC sequences, >40 kb of non-coding cpDNA would haveto be sampled.

While this example highlights a potential advantage ofnuclear genes in terms of a faster evolutionary rate, it ignoresthe huge range of evolutionary rates observed among nucleargenes, regions within genes, and even sites within genes. Forexample, the selection of AdhC for the study of Small et al.(1998) was fortuitous, as this gene is the most rapidlyevolving of the Gossypium Adh gene family (Small and

Page 5: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 149

Wendel 2000a) and the fifth most rapidly evolving geneamong a sample of 49 nuclear genes in Gossypium (Senchinaet al. 2003). Variation in overall evolutionary rate is observedwhenever multiple genes are sampled in a given phylogeneticcontext. For example, in comparisons of A- and D-genomediploid cottons for the Adh gene family of Gossypium, agreater than 3-fold range of variation was observed forsynonymous sites (Small and Wendel 2000a). Broadersampling of 16 mostly anonymous nuclear loci fromGossypium showed a 5-fold range of variation in overallevolutionary divergence (Cronn et al. 1999), and a recentstudy of 36 ‘fibre-expressed’ genes from the same species(Senchina et al. 2003) revealed a 7-fold range of variation inoverall divergence and a 2.9-fold range in synonymoussubstitution rates. These ranges are in accordance with thosederived from other well studied groups (Gaut 1998; Mathewset al. 2002). Such wide ranging variation will clearlyinfluence the lineage-specific utility of a given nuclear gene,and points to the importance of screening multiple loci andchoosing markers on the basis of preliminary evidence fromthe taxa of interest.

In addition to rate variation among genes, there is ratevariation among regions within genes. Nuclear-encodedgenes can generally be divided into different functionalregions, such as the 5′ untranslated region (UTR), exons,introns and 3′ UTR. Given the variable functions of theseregions, and thus variable evolutionary constraints onthem, considerable variation in evolutionary rates existswithin a single locus. Important promoter elementsresponsible for gene regulation are generally found in the5′ UTR which may include relatively conserved domains(Wray et al. 2003). In some cases, however, 5′ UTRs maycontain introns that are highly variable (Liu et al. 2001).Exons are likely to be more conserved at non-synonymoussites (primarily 1st and 2nd codon positions) whilesynonymous 3rd codon positions typically diverge at ratessimilar to those of non-coding regions. Nuclearspliceosomal introns generally are under fewer functionalconstraints at the sequence level, although there often arelength limits on introns that are important for propermRNA processing, and occasionally regulatory elementslie within introns (e.g. Ahlandsberg et al. 2002; Hong etal. 2003) that also are likely to be highly conserved.Elements in the 3′-UTR sequences control mRNAprocessing and poly-A addition signals, but otherwise areoften highly variable. Perhaps counterintuitively,accumulating evidence indicates that synonymous siteswithin exons (i.e. 3rd codon positions) may exhibitevolutionary rates that are equal to or greater than thosefound at ‘silent’ non-coding sites (introns, UTRs)(Charlesworth and Charlesworth 1998; Small and Wendel2002; Senchina et al. 2003), presumably because ofconserved structural features and/or existence ofregulatory domains in the latter.

Multiple independent loci

In addition to faster evolutionary rates, nuclear genes offeranother vitally important feature—multiple unlinked locimay be used for independent phylogenetic inference.Corroboration of phylogenetic hypotheses by independentdatasets increases confidence in a given phylogenetic tree.Likewise, incongruence between datasets has been of use toinfer evolutionary phenomena such as hybridisation,introgression or lineage sorting (Wendel and Doyle 1998). Inplant systematics, such comparisons are typically betweencpDNA and nrDNA datasets. Given the complete linkageamong cpDNA sites owing to its structure (a single circularchromosome) and its non-recombining nature, multiplecpDNA datasets are expected to converge on a singletopology (irrespective of whether or not this reflects theorganismal phylogeny). Ribosomal DNA offers a singlealternative from the nuclear genome, but if incongruencebetween rDNA and cpDNA is identified, independent datamust be obtained to sort out the source of the incongruence.Moreover, and as alluded to above and discussed at lengthelsewhere (Alvarez and Wendel 2003), rDNA sequences mayprove phylogenetically unreliable under certain conditions.In Gossypium, unexpected and anomalous phylogeneticresults obtained with rDNA (Wendel et al. 1995a, 1995b)have been reconciled only through the use of multiple,unlinked low-copy loci (Small et al. 1998; Liu et al. 2001;Cronn et al. 2002b, 2003), each of which provides anindependent evaluation of the unexpected results fromrDNA.

Low-copy nuclear genes are capable of providing avirtually limitless source of additional, independentphylogenetic information. Because eukaryotic nucleargenomes are composed of multiple chromosomes thatundergo recombination, genes found on differentchromosomes, or even on the same chromosome ifsufficiently far apart, are effectively evolutionarilyindependent of each other. Factors that may give rise tophylogenetic incongruence between datasets (e.g.hybridisation/introgression; non-homologous geneconversion) are likely to affect a limited number of genes ina localised region. Acquisition of data from multipleindependent regions allows an opportunity to determinephylogenetic relationships supported by a majority of thedata and can highlight particular genomic regions that maybe problematic (Cronn et al. 2002b, 2003; Rokas et al.2003).

Biparental inheritance

A final desirable property of low-copy nuclear genes is theirexplicitly biparental Mendelian inheritance. Because thechloroplast genome is usually uniparentally inherited,cpDNA datasets can tell only half the story in cases ofhybridisation or allopolyploidisation. Likewise, while

Page 6: Use of nuclear genes for phylogeny reconstruction in plants

150 Australian Systematic Botany R. L. Small et al.

nrDNA is biparentally inherited, the process of concertedevolution, array expansion/contraction and the presence ofparalogous sequences can make isolation of both parentalcopies difficult, if not impossible (Rauscher et al. 2002;Alvarez and Wendel 2003).

In contrast, low-copy nuclear genes are less frequentlysubject to concerted evolution (although exceptions havebeen reported—see below), thus making them idealcandidates for identifying parental donors of suspectedhybrids or polyploids. While phylogenetic studies of

allotetraploid Gossypium based on cpDNA identified thematernal lineage (Wendel 1989; Wendel and Albert 1992),and ITS analyses were complicated by bi-directionalinterlocus concerted evolution (Wendel et al. 1995a),analyses of low-copy nuclear genes from tetraploidGossypium have unambiguously identified the lineagesrepresenting the parental donor species (Small et al. 1998;Cronn et al. 1999, 2002b; Small and Wendel 2000a; Liuet al. 2001). Low-copy nuclear genes have been used toidentify the origins of hybrid or allopolyploid taxa in a

Fig. 1. Hypothetical scenario of gene duplication, followed by speciation events to depict relationships amongorthologous, paralogous and homoeologous gene copies. Orthologous genes are related solely by speciation (e.g.1A and 1B). Paralogous genes are related by gene duplication and are found both within species (e.g. 1A and 2A)or between species (e.g. 1A and 2B). Homeologous genes are orthologous genes that are related bypolyploidisation (e.g. 1A′ and 1C′). A gene duplication occurs near the base of the tree, resulting in two geneticloci (Locus 1 and Locus 2). Two speciation events give rise to three extant species (Species A–C), each of whichis diploid and thus contains two haploid genomes (AA, BB and CC, respectively). Hybridisation between speciesA and C, followed by polyploidisation results in the tetraploid AC species (with four haploid genomes, AACC).Gene copies in the tetraploid are noted with a′ to distinguish them from copies found in the diploid progenitors.Orthologous genes: 1A, 1B, 1C; 2A, 2B, 2C. Paralogous gene pairs: 1A, 2A; 1B, 2B; 1C, 2C; 1A′, 2A′; 1C′, 2C′;1A, 2B; 1A, 2C; 1A, 2A′; 1A, 2C′; 1B, 2A; 1B, 2C; 1B, 2A′; 1B, 2C′; 1C, 2A; 1C, 2B; 1C, 2A′; 1C, 2C′; 1A′,2A; 1A′, 2B; 1A′, 2C; 1A′, 2C′; 1C′, 2A; 1C′, 2B; 1C′, 2C; 1C′, 2A′. Homeologous gene pairs: 1A′, 1C′; 2A′, 2C′.

Page 7: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 151

number of other groups as well (Sang et al. 1997b; Doyle etal. 1999a, 1999b; Ford and Gottlieb 1999; Ge et al. 1999;Sang and Zhang 1999; Mason-Gamer 2001).

Conceptual, methodological and biological issues in using nuclear genes

Gene families and the isolation and identification of orthologous genes

The primary disadvantage of using nuclear genes stems fromthe complicated genetic architecture of eukaryotic nucleargenomes and the tendency for nuclear genes to exist in genefamilies—multiple copies of homologous genes related bygene (or genome) duplication (Fig. 1) (Henikoff et al. 1997;Thornton and DeSalle 2000). In plants, nuclear-genefamilies vary widely in size, ranging from genes that appearto be truly single copy in some species (e.g. GBSSI in diploidPoaceae; Mason-Gamer et al. 1998) to gene families withdozens to hundreds of copies (e.g. actins: Moniz de Sá andDrouin 1996; or small heat-shock proteins: Waters 1995).Further complicating the situation is the fact that gene andgenome duplication (and subsequent gene loss) appear to beongoing and dynamic processes (Gottlieb and Ford 1996;Clegg et al. 1997; Wagner 1998, 2001; Grant et al. 2000;Lynch and Conery 2000; Small and Wendel 2000a; Bancroft2001; Gaut 2001; Ford and Gottlieb 2002). Because of this,characterisation of a gene family necessarily becomes alineage-specific problem, and inferences of gene familystructure from one lineage may not be applicable to otherlineages. For example, plant Adh gene families are oftenconsidered to be small, with one to three loci present in mostdiploids (e.g. Chang and Meyerowitz 1986; Sang et al.1997b; Gaut et al. 1999). In contrast, the Adh gene familiesin Gossypium (Small and Wendel 2000a) and Pinus (Perryand Furnier 1996) have been shown to have a minimum ofseven loci in diploid species. Phylogenetic analyses of theAdh gene family in angiosperms suggest multiple rounds ofboth ancient and recent gene duplication and deletion (Clegget al. 1997; Small and Wendel 2000a).

Given the complexity of nuclear genomes, the need tocharacterise gene family composition prior to phylogeneticanalysis is of paramount importance. Phylogenetic analysesimplemented to estimate organismal phylogeny assume thatcomparisons are between strictly orthologous gene copies,and the inclusion of a mixture of orthologous and paralogoussequences can produce robust, yet erroneous, hypotheses ofrelationships (Wendel and Doyle 1998). Rigorouselucidation of the size of a gene family, as well as criteria forestablishing orthology should be considered a first step inpreliminary studies of the potential utility of nuclear genes(see below).

As gene family structure varies widely both among genefamilies and among plant lineages, initial studies of nucleargenes should focus on characterising the target gene family

in the taxa of interest. The usual route taken by manyinvestigators is to design or obtain ‘universal’ primers for aparticular gene family, usually by comparison of genesequences from a wide phylogenetic array of taxa, withprimers located in regions of strong sequence conservation(Sang et al. 1997b; Strand et al. 1997; Mason-Gamer et al.1998; Small et al. 1998; Evans et al. 2000; Small and Wendel2000a).

PCR using ‘universal’ primers frequently results inamplification of more than one product, as evidenced byeither multiple amplified bands or obvious sequenceheterogeneity in the PCR pool. Resolution of this sequencecomplexity and subsequent development of locus-specificprimers generally requires isolation of individual PCRproducts from the heterogeneous initial amplificationreaction, often accomplished by cloning PCR products andscreening multiple clones. The most efficient approach is toselect a few representative taxa for this initial screening.Once all apparent sequence types from the representativetaxa have been obtained, phylogenetic analysis of thosesequences is performed. Ideally, these preliminary analysesshould take place in the context of related sequences fromother taxa in GenBank (http://www.nlm.nih.gov).Appropriate sequences for comparison can often be obtainedfrom GenBank by reference to the literature, searching bygene name or by BLAST searches [Basic Local AlignmentSearch Tool; http://www.ncbi.nlm.nih.gov/BLAST/;(Altschul et al. 1990)] for similar sequences. This exercise isan important component of a preliminary study (see below),as it can differentiate between relatively recent and ancientgene-duplication events. Confidence in assumptions oforthology generally are stronger when the putativelyorthologous sequences in question are monophyletic anddivergent from paralogs. Preliminary phylogenetic analysesalso provide important information of relative rates ofsequence evolution within and among genes (see below).

Once an initial characterisation of a gene family has beenconducted and appropriate loci have been selected for furtherstudy, the next logical step is to develop locus-specific PCRprimers. This step can improve efficiency by reducing thenecessity of cloning heterogeneous PCR products (except forthose cases where heterozygosity is encountered).Locus-specific primers have the added benefit of eliminatingthe possibility of recovering cloned PCR-generated chimeras(PCR-mediated recombinants) that can form whenever twohighly similar templates are co-amplified in a single PCRreaction (Bradley and Hillis 1997; Cronn et al. 2002a). Ifhomogeneous sequences are obtained from direct sequencesof PCR products from all taxa of interest there is also greaterconfidence that a single orthologous locus is being amplifiedand sequenced from all taxa.

Once a group of candidate loci have been identified,orthology must be evaluated empirically. A number ofcriteria that vary in their methodology and assumptions have

Page 8: Use of nuclear genes for phylogeny reconstruction in plants

152 Australian Systematic Botany R. L. Small et al.

been suggested as evidence of orthology. These criteriainclude overall sequence similarity, tissue specificity andexpression patterns (Doyle 1991; Doyle and Doyle 1999),Southern hybridisation analysis (Cronn and Wendel 1998;Evans et al. 2000; Small and Wendel 2000a) andcomparative genetic mapping (Cronn and Wendel 1998;Small and Wendel 2000a; Cronn et al. 2002b).

The simplest approach to identifying orthologs is throughsequence similarity and phylogenetic analysis. Clearly,orthologs are expected to be more similar andphylogenetically more closely related to each other than toany paralogous sequence, assuming complete sampling ofgenes within all taxa. This expectation can be violated,however, for at least two reasons. First, complete sampling ofall genes of a gene family in all taxa may not beaccomplished, owing to the challenges of generating andisolating all relevant gene copies from each taxon in a study.PCR amplification with ‘universal’ primers may result indifferential amplification of loci in different taxa because ofimperfect pairing between PCR primers and templates orrelative qualities of template DNA (Wagner et al. 1994).Subsequent cloning of heterogeneous PCR products samplesonly a subset of the PCR products present, and screening ofnumerous clones (by sequencing or restriction digestion) isrequired to differentiate orthologs from paralogoussequences. Subsequent phylogenetic analysis of thesequences may reveal two sequences from different taxa thatare more closely related to each other than either is tosuspected paralogs; yet these sequences may indeed beparalogous rather than orthologous if they are related by arecent gene-duplication event and both paralogs are notsampled in both taxa. The second issue that bears on thisapproach is the problem of either in vivo or in vitro(PCR-mediated) recombination. In this scenario, closelyrelated paralogous loci may undergo non-homologousrecombination, and phylogenetic analysis of the resultingchimeric sequences will confound rather than illuminaterelationships and inferences of orthology/paralogy. It shouldbe noted that this sequence-based method of determiningorthology represents minimal evidence for orthology and istypically included as a precursor to the following threemethods of homology assessment.

A second approach to identifying orthologs is to useshared expression patterns as evidence of orthology (Doyle1991; Doyle and Doyle 1999). One of the features ofnuclear-gene families may be differential expression patterns(either timing or tissue specificity) of paralogous genes. Infact, gene families may exist primarily because paralogouscopies are adapted to function differentially, and iforthologous genes serve the same function in differentspecies they can be expected to maintain similar expressionpatterns. However, expression data may be difficult toobtain, often requiring either detailed RNA (e.g. Northernblot or RT–PCR) or protein (Western blot or isozyme)

analysis. In some cases, expression patterns can be inferredfrom sequence data, as is the case with gene families thathave both cytosolic- and organellar-expressed genes. Suchcases usually represent ancient gene duplications that arereadily identifiable from the sequence data alone and theevidence of orthology is clear. For example, both thephosphoglucose isomerase (PGI) (Gottlieb and Ford 1996;Ford and Gottlieb 1999; Ford and Gottlieb 2002) andglutamine synthase (GS) (Doyle 1991; Emshwiller andDoyle 1999; Emshwiller and Doyle 2002) gene familiescontain both cytosolic- and chloroplast-expressedparalogous gene copies. Sequence conservation among theseclasses allows ready identification of the expression patternof each paralog. Difficulties may arise, however, if arelatively recent gene duplication creates paralogous genecopies with similar or identical expression patterns,highlighting a second difficulty of using expression patternsas a criterion of orthology. Specifically, recent geneduplications within a particular expression class may resultin paralogous gene copies that share both strong sequencesimilarity and expression patterns. A recent study of genepairs duplicated by polyploidy in allotetraploid cotton(Adams et al. 2003) shows how subsequent expressionpatterns of duplicated genes may be not only tissue specific,but locus- or genome-specific. This could create a situationwhereby one copy (e.g. the A′ genome homoeolog in Fig. 1)would be expressed in one species, while the other copy (e.g.the C′ genome homoeolog in Fig. 1) could be expressed in aclosely related species. For this reason estimatingorthologous relationships among sequences on the basis oftissue-specific mRNA pools could prove particularlychallenging, especially if polyploids were included in theanalysis. An additional example is provided by thechloroplast-expressed PGI genes of Clarkia (Onagraceae)(Gottlieb and Ford 1996; Ford and Gottlieb 1999; Ford andGottlieb 2002).

A third and more practical criterion that can strengtheninferences of orthology are Southern blot hybridisationexperiments. Once a suite of sequences have been obtainedfrom a given set of taxa, comparisons of sequence identitybetween classes of loci can be used to identify regions thatare ‘locus-specific,’ i.e. sequence motifs found in only one ofthe sequence types or are sufficiently divergent between locito minimise cross-hybridisation under conditions of highstringency. Generally, introns and 3′ UTRs make the bestcandidates for such motifs. Locus-specific hybridisationprobes are then designed from these regions, amplified byPCR, and used to probe restriction-digested genomic DNAsof representative taxa. Results from high-stringencyhybridisations using these ideal probes provide anopportunity to ‘count genes,’ since each band on theresulting autoradiograph represents a single locus. If a singleband is detected using this approach in the taxa examined,this constitutes strong evidence that the locus is unique, and

Page 9: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 153

by inference, orthologous among taxa. If multiple bands aredetected under stringent hybridisation conditions, then either(1) a restriction site was present within the probe region(which can be avoided by choosing appropriate enzymesand/or evaluating multiple enzymes), or (2) there aremultiple, closely related (but paralogous) loci that share highsequence identity with the probe.

An example of the power of this approach is provided bythe Adh gene family of Gossypium that contains a minimumof seven loci. Southern hybridisation experiments wereinformative in identifying both single-copy orthologous locishared among species as well as documenting a small gene‘subfamily’—a group of closely related genes resulting fromrecent gene duplications (Small and Wendel 2000a). TheAdhA gene in Gossypium is a truly single-copy gene, and thisis reflected in the Southern hybridisation results—using asmall intron probe, a single band was detected in diploidspecies screened, and two bands were detected in theallotetraploid species, reflecting the additivity of the locicontributed from their diploid progenitors.

‘AdhB’ in Gossypium, on the other hand, was found toinclude several closely related sequence types. Again, byusing a small intron probe, Southern hybridisation showed acomplex banding pattern with several hybridising fragments,despite the fact that PCR-based experiments had isolatedonly a single AdhB-like sequence type. This conundrum wasresolved when independent sequence data from genomic andcDNA clones of Adh sequences from Gossypium weredescribed (Millar and Dennis 1996). The Adh sequencesdescribed by (Millar and Dennis 1996) were subsequentlyincluded in a phylogenetic analysis, with Adh sequencesisolated via PCR in our laboratory (Small and Wendel2000a) and found to be closely related but not orthologous toAdhB, and thus, to constitute a small gene subfamily thatpresumably resulted from recent gene-duplication events.Given these data and evidence of potential interlocusrecombination (Millar and Dennis 1996), it was determinedthat AdhB would be a poor choice for use in phylogeneticstudies.

Screening exemplar taxa by Southern hybridisation, whileclearly the most efficient approach for counting genes, maynot always prove completely effective. While preliminaryscreening of AdhA in Gossypium indicated that AdhA was asingle-copy gene in the three diploid species surveyed(spanning the diversity of Gossypium), a subsequentphylogenetic study using AdhA of the 13 D-genome diploidspecies showed a previously undetected AdhAgene-duplication event (Small and Wendel 2000b). Duringthe course of data collection for this study, all speciessurveyed showed little to no sequence heterogeneity, with theexception of the group of four species that constituteGossypium section Erioxylum subsection Erioxylum. Directsequencing of PCR products of AdhA (amplified using AdhAlocus-specific primers) detected extensive sequence

heterogeneity in these four species. Given this information,we conducted Southern hybridisation experiments on thesespecies and found that while all other D-genome speciesdisplayed a single AdhA-hybridising fragment, all species ofsection Erioxylum displayed two hybridising fragments.Apparently, a gene-duplication event confined to thesespecies had occurred, and both loci were being amplifiedwith our PCR reaction conditions.

Finally, the most rigorous approach for demonstratingorthology can be obtained by comparative genetic-mappingstudies. In practice, such analyses are restricted to taxa forwhich genetic-mapping experiments are already underwayfor other reasons, simply because of the extensive time andeffort required for comparative genetic mapping.Nevertheless, it is important to note that comparativemapping studies have been completed for a wide variety ofplant lineages (especially crop plants), including, forexample, Asteraceae, Brassicaceae, Chenopodiaceae,Fabaceae, Malvaceae, Pinaceae, Poaceae, Rosaceae andSolanaceae. Information from these ‘data rich’ species canbe used to bolster inferences of orthology to more distantrelatives. While evidence from sequence identity, sharedexpression patterns and Southern hybridisation analysis canall assist the process of inferring orthology, the retention ofshared genomic position among species is arguably the mostrigorous evidence of orthology.

Gene families and rate variation among genes

In addition to the requirement of orthology, rate variation isan important consideration when choosing an appropriategene. Preliminary data from representative species for allpotential loci provide the raw data required to inform thisdecision. Given the observation of extensive rate variationamong nuclear genes at all levels (among gene families,among genes within gene families, among regions withingenes, among plant lineages: see e.g. Gaut 1998; Gaut et al.1996; Senchina et al. 2003; Small and Wendel 2000a), thegene selected should provide an appropriate level ofsequence variation to answer the question being asked. Theappropriate level of variation depends on the level ofresolution being sought: inter- or intrafamilial studies mayutilise more slowly evolving sequences than studies at theinter- or intraspecific level. Further, the regions of aparticular gene that are to be used can vary from question toquestion—exons are generally easily alignable across widephylogenetic distances (in many cases even among extantland plants), while introns are often unalignable outside ofindividual genera. Accordingly, questions focused at highertaxonomic levels may choose to exploit regions with highexon content, while lower-level studies may emphasiseintron sequence.

While examples of particular plant lineages that havebeen broadly sampled for nuclear genes are relatively few,the available examples highlight the range in variation (and

Page 10: Use of nuclear genes for phylogeny reconstruction in plants

154 Australian Systematic Botany R. L. Small et al.

thus phylogenetic utility) found thus far. Our work inGossypium has resulted in characterisation of a large numberof nuclear genes across the well established phylogeny of theprimary genome groups. More than 50 nuclear loci havebeen sequenced from multiple species, and the range ofphylogenetic utility spans from an almost complete lack ofvariation to highly variable loci that provide robustlyresolved and supported depictions of relationships (Small etal. 1998; Cronn et al. 1999, 2002b; Small and Wendel 2000a,2000b; Senchina et al. 2003).

Other well studied examples include Zea (Gaut and Clegg1993; Hanson et al. 1996a; Gaut and Doebley 1997;Eyre-Walker et al. 1998; Gaut 1998, 2001; Hilton and Gaut1998; Zhang et al. 2001), Poaceae (Mathews and Sharrock1996; Mason-Gamer et al. 1998; Mathews et al. 2000, 2002;Mason-Gamer 2001), Brassicaceae (Galloway et al. 1998;Bailey and Doyle 1999; Bailey et al. 2002), Paeonia (Sanget al. 1997b; Sang and Zhang 1999; Ferguson and Sang2001; Tank and Sang 2001; Sang 2002), and Glycine (Doyleet al. 1996, 1999a; 1999b, 2000, 2002; Doyle and Doyle1999). In all of these examples, variation in the relativephylogenetic utility of nuclear genes is evident, againhighlighting the need for preliminary studies to determinethe most appropriate locus (or loci) for a given question.

Intraspecific variation in nuclear genes

Two features of nuclear genes that require attention inphylogenetic studies are the high probability of allelicvariation within and among populations of a species, andalleles that are shared between species. Because of thesmaller effective population size of organellar genes, andconcerted evolution in nrDNA sequences, allelic variation isoften low for these markers. Accordingly, a single or just afew individuals of a species are often sampled asplaceholders, with the implicit assumption that all allelesfrom individuals of that species will be more closely relatedto each other than they are to any other species. Whensampling of individuals is increased, this assumption is oftenborne out, although exceptions do exist, especially withnrDNA (Suh et al. 1993; Mason-Gamer et al. 1995; Levy etal. 1996; Mayer and Soltis 1999; Hartmann et al. 2001;Mayol and Rossello 2001; Muir et al. 2001). Allelic variationwithin individuals, populations and species in low-copynuclear genes, however, can be extensive. This is due to thelarger effective population size of nuclear genes than that oforganellar genes, the process of allelic recombination and thefaster rate of molecular evolution of nuclear genes than thatof organellar genes. Evidence of such variation was firstrecognised in isozyme analyses (Crawford 1985; Gottlieb1977), where extensive allelic variation was detected inmany species, but also shared allelic variation among specieswas evident. As results from sequencing studies of plantnuclear genes have accumulated, the presence of substantialallelic variation has been confirmed, although the majority

of this work has been on model systems such as maize,Arabidopsis and cotton (Miyashita et al. 1996; Gaut andDoebley 1997; Innan and Tajima 1997; Kawabe et al. 1997;Hilton and Gaut 1998; Purugganan and Suddith 1998;Kawabe and Miyashita 1999; Small et al. 1999; Kuittinenand Aguade 2000; Small and Wendel 2002).

Allelic variation within species may take two forms, withdifferent implications for phylogenetic studies. First,irrespective of the extent of allelic variation, if all alleles ofa species coalesce within that species (i.e. all alleles of aspecies are more closely related to each other than they areto any allele of a different species), then allelic variation isirrelevant to the ultimate goal of recovering a speciesphylogeny. This type of allelic variation may be useful,however, for intraspecific studies, e.g. ascertainingpopulation-level relationships, phylogeography, and studiesof rates and patterns of sequence evolution.

A second possibility is that allelic variation spansspecies boundaries (deep coalescence); i.e. some alleles ofa species are more closely related to alleles of otherspecies than they are to those of the same species. Anumber of population genetic phenomena can give rise tothis commonly observed pattern. One primary cause lies inthe population genetics of nuclear genes. Because of thegreater effective population size and faster mutation rateof nuclear genes relative to organellar genes, coupled withthe process of recombination, extensive intraspecificallelic variation is both expected and observed in specieswith sufficient population sizes. When speciation occurs,it is likely that both descendants of an ancestral specieswill contain some, if not all, of the allelic variation presentin the ancestral species. If the alleles contained in each ofthe descendant species are reciprocally monophyletic, thenalleles will coalesce within species and phylogeneticanalysis using any alleles should reflect this history. If, onthe other hand, allelic variation in one or both of thedescendant species is not monophyletic, then trans-speciespolymorphism will be observed.

In exceptional cases, natural selection acts to promote thistrans-species polymorphism. This is the case for genes thatundergo balancing selection where natural selection acts topromote and maintain allelic variation. Striking examples ofthis phenomenon are found for genes involved inself-recognition (reviewed in Klein et al. 1998), for examplemajor histocompatibility (Mhc) loci in mammals (Hughesand Yeager 1998) and self-incompatibility (S-genes) inplants (Charlesworth and Awadalla 1998). In the case ofS-allele variation, analyses in Solanaceae have revealedallelic variation that exceeds not only species boundaries, buteven generic boundaries (Richman et al. 1996; Richman andKohn 1999). While clearly of interest for molecularevolutionary studies, loci that tend to undergo balancingselection would be poor choices for phylogenetic analysesattempting to infer species histories.

Page 11: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 155

Less dramatic examples of trans-specific polymorphismare also evident for genes in which segregation of ancestralpolymorphism may be the sole cause. Several examples frommaize show such a pattern. For example, alleles found in Zeamays ssp. mays are found phylogenetically intermingled withalleles from other subspecies of Zea mays or even otherspecies of Zea (Hanson et al. 1996a; Eyre-Walker et al. 1998;Hilton and Gaut 1998; Wang et al. 1999; White and Doebley1999; Gaut 2001; Zhang et al. 2001). Studies of severalnuclear genes in Leavenworthia (Brassicaceae) have shownsimilar sharing of alleles across species boundaries(Charlesworth et al. 1998; Liu et al. 1998; Filatov andCharlesworth 1999). A study of nucleotide diversity in twopairs of homeologous Adh loci in allotetraploid Gossypiumhirsutum and G. barbadense revealed an unusual pattern ofnon-coalescence (Small and Wendel 2002). In this particularcase, four loci were sampled (two AdhA and two AdhC loci,one from each of the diploid progenitors). For three of fourloci, alleles coalesced within species; however, for the fourthlocus (AdhC in the D-genome of the tetraploids) alleles foundin G. hirsutum were placed phylogenetically into two separateclades. One of these clades included only G. hirsutumsequences, but the second clade included all G. barbadensealleles as well as several G. hirsutum alleles. This observationis significant because, regardless of the underlying cause(non-coalescence or introgression), it was found for only oneof four loci, indicating that these population-geneticphenomena can act differentially across loci.

A second cause of trans-species polymorphism ishybridisation and introgression. A prominent feature of plantpopulation biology, hybridisation, and the subsequentpossibility of introgression, may be responsible for instancesof alleles shared among species. Many well known exampleshave been described of introgression of chloroplast DNA(‘chloroplast capture’) and many isozyme studies havedocumented instances of putative introgression of nucleargenes (Rieseberg and Soltis 1991; Rieseberg and Wendel1993; Rieseberg 1997). Few studies have addressed theprevalence of nuclear-gene introgression, however, despitethe power of phylogenetic analysis of variable nuclear genesto illuminate the phenomenon (Sang and Zhang 1999). In thecase of trans-specific allelic variation in maize, the variationmay reflect a combination of non-coalescence andintrogression between subspecies of Zea mays (Hanson et al.1996a; White and Doebley 1999). The paucity of empiricaldata may stem from the theoretical difficulty ofdistinguishing non-coalescence or lineage sorting fromintrogression, as both processes are expected to result insimilar patterns of allele sharing. Arguing for one or theother generally requires independent evidence, e.g.biosystematic evidence that hybridisation is occurring, orgeographic evidence that allele sharing occurs only inregions of sympatry between putatively introgressing species(e.g. Doyle et al. 1999a).

Recombination and concerted evolution

An additional feature of nuclear genes that has implicationsfor phylogenetic studies is the potential for recombinationboth at individual loci (thus generating additional allelicdiversity), and more importantly, between paralogous genes.Such recombination can take place either in vivo or in vitro(i.e. PCR-mediated recombination).

Allelic recombination (including both crossing-over andgene conversion) results in alleles that are chimeric betweenparental alleles. This phenomenon violates the assumption ofphylogenetic analysis that relationships among terminals arestrictly bifurcating, rather than reticulate. If, however, alleleswithin species are monophyletic, this phenomenon will notaffect reconstruction of species relationships from a genetree, although it may introduce homoplasy into thephylogenetic analysis (Doyle 1995, 1997). Analytical toolsare available for identifying putatively recombinantsequences and their potential parental alleles (Stephens1985; Sawyer 1989; Jakobsen and Easteal 1996; Grassly andHolmes 1997; Drouin et al. 1999). Thus, while allelicrecombination has implications for phylogenetic analysis, itdoes not necessarily prevent accurate reconstruction ofsupra-specific phylogenies from gene sequence data.

Recombination among paralogous sequences(non-homologous recombination), however, is of greaterconsequence. This phenomenon may take place onlysporadically, or be a general feature of a particular gene orgene family. One noted example is concerted evolution, whichtends to homogenise rDNA sequences both within and amongrDNA repeats (Zimmer et al. 1980; Arnheim 1983; Baldwinet al. 1995; Elder and Turner 1995). In the particular case ofrDNA, which generally consists of thousands of repeats perlocus, concerted evolution tends to homogenise repeats suchthat a single predominant allelic form exists. Concertedevolution appears to be a common feature of highly repetitivenuclear sequences. Low-copy nuclear genes are not free fromconcerted evolution, however, and examples exist ofconcerted evolution even among fairly small gene families(e.g. rbcS: Clegg et al. 1997).

The effect of concerted evolution (or any recombinationamong paralogous sequences) on phylogenetic analysisdepends on its extent. As noted by Sanderson and Doyle(1992), if concerted evolution among members of a genefamily is absent, then complete sampling of all genes fromall species will result in an orthology–paralogy tree (OP tree)in which sequences of orthologous loci are all more closelyrelated to each other than to any paralogous sequence, andphylogenetic relationships among each ortholog may beexpected to reflect organismal relatioships. If, at the otherextreme, concerted evolution results in completehomogenisation of members of a gene family (as is oftenassumed in studies of rDNA), then sampling of any gene ofa gene family will result in its correct phylogenetic

Page 12: Use of nuclear genes for phylogeny reconstruction in plants

156 Australian Systematic Botany R. L. Small et al.

placement (assuming concerted evolution occurs at a ratefaster than speciation). If, however, concerted evolutionoccurs but is incomplete, then sampled genes may representa mixture of orthologous and non-homogenised paralogoussequences. Accurate reconstruction of organismalphylogenies from such data are practically impossible(Sanderson and Doyle 1992).

The probability of non-homologous recombinationappears to be influenced by several factors. Principle amongthese may be the degree of sequence similarity amongparalogs, and the genomic proximity of paralogs. Asrecombination is a similarity-driven process, paralogoussequences that are more closely related (recently diverged)are more likely to experience non-homologousrecombination. This can create a circular pathway ofrecombinational events—after duplication, paralogoussequences diverge in sequence, but if they retain sufficientsimilarity, non-homologous recombination may result inhomogenisation of the paralogs, which leads to greatersequence similarity, which can then promote inter-locusrecombination. Genomic proximity may also play animportant role in the probability of non-homologousrecombination. Because recombination occurs primarilybetween homologous chromosomes, paralogous loci that areclosely linked on a chromosome may be more likely toundergo non-homologous recombination than paralogousloci that are located on separate chromosomes.

Finally, the methodological concern of PCR-mediatedrecombination must be addressed, as most gene sequencesused in molecular phylogenetic studies are generated viaPCR. PCR-mediated recombination is a well characterisedphenomenon (Myerhans et al. 1990; Bradley and Hillis1997; Cronn et al. 2002a) that results from eithertemplate-switching during PCR or from incompletelyextended copies from one locus serving as a primer forsubsequent extension from a paralogous locus. This is anotable problem for analyses of low-copy nuclear genesbecause studies often utilise general or universal PCRprimers that amplify multiple loci, especially duringpreliminary stages of a study, prior to the development oflocus-specific primers (Sang et al. 1997b; Evans, Alice et al.2000; Small and Wendel 2000a; Cronn et al. 2002a). Similarto in vivo non-homologous recombination, the propensity forPCR-mediated recombination may depend on the degree ofsequence similarity among paralogs as well as PCR reactionconditions. Specifically, as noted by Cronn et al. (2002a),annealing temperature, amplicon length and extension timeare factors that may be optimised to avoid PCR-mediatedrecombination.

Procedure to determine appropriate nuclear genes for phylogenetic analysis

When planning a phylogenetic study that includes nucleargenes, a generalised protocol would entail several sequential

steps (Fig. 2). These include selecting candidate genes andrepresentative taxa for a preliminary study, isolatingcandidate genes from representative taxa, assessingorthology among sequences isolated from representativetaxa, assessing relative rates of sequence evolution in orderto choose among potential loci and finally, generatingsequences from the taxa of interest from the chosen loci. Inan effort to facilitate a general application of these sequentialexperimental necessities, we will use as an example ourprevious investigations of the alcohol dehydrogenase (Adh)gene family in Gossypium (Small and Wendel 2000a).

Selection of candidate genes

One of the great advantages of using nuclear genes forphylogenetic analysis is the practically unlimited number ofgenes from which to choose (e.g. the Arabidopsis thaliananuclear genome contains about 26000 genes: TheArabidopsis Genome Initiative 2000). While there clearlyare differences among genes and gene families in theirpotential phylogenetic utility, the huge number of possiblegenes from which to choose indicates that a multitude ofgenes exist that will be useful in any given plant lineage atany given phylogenetic level. It is important to note that thereis no a priori reason to expect that any particular gene orgene family will be universally useful at any givenphylogenetic depth because of the vagaries of theevolutionary dynamics of gene families. However, it is evenmore important to note that with a reasonable investment incharacterisation, almost any gene will prove to be useful atsome level. Thus, there is no reason to limit explorations tothose genes that have previously been shown to be useful inother plant groups. To paraphrase and emphasise this point,there is nothing ‘special’ in the attributes of commonly usedgenes such as Adh, particularly in-as-much as relativelyunexplored genes (Liu et al. 2001; Malcomber 2002; Wendelet al. 2002) and even anonymous nuclear loci (Blake et al.1999; Cronn et al. 2002b) have proven phylogeneticallyuseful.

Given the diversity of genes from which to choose, whereshould one start? One logical place is with an assessment ofgenes that have been used by previous workers in taxa relatedto the group of interest. While the list continues to grow, atpresent a relatively small number of gene families have beenwidely investigated for their phylogenetic utility in plants(Sang 2002). These include Adh (alcohol dehydrogenase:e.g. Gaut and Clegg 1991, 1993; Morton et al. 1996; Sanget al. 1997b; Small et al. 1998; Ge et al. 1999; Small andWendel 2000a, 2000b), G3PDH (glyceraldehyde3-phosphate dehydrogenase: e.g. Olsen and Schaal 1999;Olsen 2002), GBSSI (granule-bound starch synthase: e.g.Mason-Gamer et al. 1998; Evans et al. 2000; Mason-Gamer2001; Evans and Campbell 2002; Small 2003), MADS-boxgenes (e.g. pistillata, apetala1, apetala3, leafy: e.g. Baileyand Doyle 1999; Barrier et al. 1999; Bailey, Price et al.

Page 13: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 157

2002), PHY (phytochrome: e.g. Mathews and Sharrock1996; Lavin et al. 1998; Simmons et al. 2001; Mathews et al.2002) and PGI (phospho-glucose isomerase: e.g. Gottlieband Ford 1996; Filatov and Charlesworth 1999; Ford andGottlieb 1999, 2002). Many of these genes have beeninvestigated in a wide array of taxa, including most majorangiosperm groups. In addition, however, many modelplants, including most major crops, have associated withthem extensive libraries of gene sequences for hundreds if

not thousands of genes. This wealth of systematically usefulinformation has been generated as a consequence of themany ‘genome projects’ being undertaken worldwide.

A first test of potential utility of a given gene or genefamily, then, is whether or not it proved useful in othersystems at a phylogenetic depth similar to that beingattempted (e.g. interspecific, intergeneric, interfamilial). Asecond advantage of a literature survey is that papers willgenerally describe the methodology used to isolate specific

Fig. 2. Generalised protocol describing the steps necessary in foundational studies of the potentialphylogenetic utility of nuclear genes.

Page 14: Use of nuclear genes for phylogeny reconstruction in plants

158 Australian Systematic Botany R. L. Small et al.

genes and include PCR-amplification primers. These mayinclude general primers that were used for preliminaryinvestigations, as well as locus-specific primers that wereused to amplify single members of a gene family. At the timewe began our investigations of Gossypium nuclear genes, theAdh gene family was the most widely investigatednuclear-gene family in plants, making it a good candidate forfurther study. Additionally, previous isozyme studies inGossypium had shown that ADH protein products(isozymes) were variable among Gossypium species.

Primary sources of candidate genes are the publiclyavailable DNA-sequence databases (GenBank—UnitedStates National Center for Biotechnology Information:http://www.ncbi.nlm.nih.gov/; EMBL—EuropeanMolecular Biology Laboratory—European BioinformaticsInstitute: http://www.ebi.ac.uk/embl/index.html; DDJB—DNA Data Bank of Japan: http://www.ddbj.nig.ac.jp/). DNAsequences deposited in any of these databases are ultimatelyshared among all three, thus searching through one databasegenerally is sufficient, and because of our familiarity withGenBank, it will be used as an example in the followingdiscussion. Searching for candidate genes by using the DNAsequence databases can be conducted in several ways. First,taxonomic queries can be made at any level of the taxonomichierarchy through the Taxonomy section of GenBank(http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/). Note here that GenBank uses a specific taxonomichierarchical nomenclature for plants that follows recentphylogenetic studies (APGII 2003), and which must befollowed for successful searches. Within angiosperms,groups are classified not only by traditional ranks (e.g.subclasses, orders, families), but also by rankless but wellaccepted groups (e.g. eudicotyledons, eurosid I). If searchesfor a particular genus or species do not result in anysequences of interest, the next higher level of the taxonomichierarchy can be investigated.

Queries can be made by gene name in the search windowof the GenBank homepage, if a particular gene family is ofinterest. Again, nomenclature is an issue that must be dealtwith, given the variable names used for some genes. Forexample, searches for ‘alcohol dehydrogenase ANDMagnoliophyta’ result in matches not only for medium-chainalcohol dehydrogenase sequences that have been the primaryfocus of molecular systematic and molecular evolutionarystudies, but also matches for short-chain alcoholdehydrogenases, cinnamyl alcohol dehydrogenases,unknown mRNA and genomic clones with some sequencesimilarity to alcohol dehydrogenase genes. Further, somegenes may be deposited in GenBank as ‘alcoholdehydrogenase’, while others may be deposited as ‘Adh.’Multiple searches using different combinations of gene andtaxon names may be necessary to identify candidate genes.

A final method of searching GenBank that can beespecially useful is BLAST searching (Basic Local

Alignment Search Tool: http://www.ncbi.nlm.nih.gov/BLAST/). This search tool compares an input sequence to allof the sequences in GenBank and identifies those sequencesthat have high sequence similarity to the input sequence. Oneof the advantages of BLAST searching is that results areretrieved without prior knowledge of gene name ortaxonomic rank. If preliminary sequence data are availablefor some taxon of interest (e.g. from a genome project), thenBLAST searches can identify sequences in GenBank that areclosely related to the sequence of interest, which can thenprovide useful information for primer design, identificationof gene structure (exons and introns) and preliminaryphylogenetic analyses. In addition to searches in the generalGenBank database, searches can be made through specificsequence sets such as ESTs (expressed sequence tags) andSTSs (sequence-tagged sites) that are not included in thegeneral nucleotide database searches, as well as with specificwhole genomes (e.g. Arabidopsis thaliana and Oryzasativa—the only plants with complete genome sequences).Given the large number of genome sequencing, EST andSTS projects currently underway for various organisms (seeNCBI’s Plant Genomes Central: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html), these searches maybe especially useful for identifying candidate genes.

Selection of representative taxa

When screening candidate genes for potential phylogeneticutility, one efficient approach is to select several taxa thatrepresent the range of diversity that is ultimately to beincluded. This approach allows the preliminary study to berelatively contained in terms of the number of sequences thatmust be generated, while still providing enough data to makean informed decision with respect to phylogenetic utility.Preliminary studies should include a minimum of two taxa,as this is required to understand relative levels of divergence.Preferably, up to five taxa should be included to provide amore comprehensive overview of potential phylogeneticutility. The specific choice of taxa will depend on the amountof preliminary data from other sources (taxonomy,biosystematic studies, other phylogenetic hypotheses) that isavailable. For example, if a study is to investigaterelationships among species within a genus, then speciesrepresenting the major groups within that genus should beselected on the basis of, for example, taxonomicclassification of species into subgenera or sections,cytogenetic data grouping species into genome groups, orprevious phylogenetic hypotheses.

A second criterion that should be used when consideringchoice of taxa to use in pilot studies is ploidy level. Given theadded layer of sequence complexity that accompaniespolyploidy, it is prudent to select taxa at the lowest possibleploidy level for initial investigation. As ploidy levelincreases, gene family size for any given gene also increases,making it more challenging to confidently identify

Page 15: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 159

orthologous sequences. Multiple homeologous loci(orthologous loci donated by the progenitors of theallopolyploid) are expected to be found in allopolyploids,which must be distinguished from paralogous sequences.

Third, taxa for which significant background informationis available are better choices for preliminary studies thanrelatively unknown taxa. Such background information mayinclude ploidy level (see above), cytogenetic characterisation,previous knowledge of phylogenetic relationships, and/orprevious molecular biological studies. All of this informationmay help place preliminary molecular phylogenetic data intoa useful organismal and molecular framework, which maythen guide further experimental choices.

Finally, representative taxa for which sufficient quantitiesof high-quality template DNA are available should bechosen. Preliminary studies may involve numerous PCRexperiments to find optimal PCR conditions, and successfulPCR amplification is often dependent on DNA quality.Additionally, if Southern hybridisation experiments are to beperformed (see below), large quantities (5–10 µg perdigestion) of restriction enzyme-digestible DNA must beavailable.

For our studies of Adh in Gossypium we chose threerepresentative taxa for our initial studies. There are ~50species of Gossypium distributed circumtropically with awell developed infrageneric classification system (Fryxell1968, 1979, 1992). There are three primary centers ofdiversity in Gossypium: Australia, Africa and the NewWorld. Additionally, previous cytogenetic studies inGossypium (reviewed in Endrizzi et al. 1985) had identifiedeight different genome groups, and phylogenetic studiesbased on cpDNA restriction-site data (Wendel and Albert1992) had identified major clades. On the basis of this wealthof preliminary data we chose three species representative ofthe major groups: G. robinsonii (Australian C-genome),G. herbaceum (African A-genome), and G. raimondii (NewWorld D-genome). Although Gossypium includes bothdiploid and allotetraploid species, the representative taxachosen were all diploids to simplify estimates of gene copynumber. It should be noted that while the amount ofpreliminary data available in Gossypium facilitated ourselection of representative taxa, any one of the above sourcesof information (taxonomy, distribution, cytogenetics,previous phylogenetic hypothesis) alone would haveprovided sufficient information to make an informeddecision on which taxa to include.

Isolation of candidate genes from representative taxa

Once candidate genes and exemplar taxa have been chosen,the next step is to generate preliminary sequence data. Thisstep generally is accomplished via PCR amplification usingeither general gene family primers or locus-specific primers.The choice of approach is dictated by the amount ofpreliminary data available to the investigator.

If locus-specific primers are available (i.e. have beendesigned by other research groups), then PCR amplificationand sequencing can be relatively straightforward. It isnecessary, however, to remain mindful of complications thatmay arise from heterozygosity, particularly if alleles differ inlength. If, on the other hand, little or nothing is known abouta given candidate gene from the taxa of interest, an approachusing general primers is necessary. This approach involveseither development of general gene-family primers or use ofprimers developed by others. Such primers usually aredesigned by comparing gene sequences of available genes ofthe gene family (e.g. downloaded from GenBank) andplacing primers in regions of high sequence conservation(e.g. highly conserved exon sequences). General primers canbe designed to incorporate some polymorphism within theprimer if necessary, and it is useful to be aware of codonstructure of the exons where primers are to be placed. As the3′ end of the primer is the most important in terms ofhomology with the target, and especially the last 3′nucleotide, the 3′ end of the primer should be placed at asecond codon-position nucleotide, as these are likely to bethe most highly conserved. General primers have beendeveloped for a number of gene families, includingAdh—alcohol dehydrogenase (Sang et al. 1997b; Gaut et al.1999; Small and Wendel 2000a), GBSSI—granule-boundstarch synthase I (Mason-Gamer et al. 1998; Evans et al.2000), GS—glutamine synthetase (Emshwiller and Doyle1999) and PHY—phytochrome (Mathews and Sharrock1996). In addition, the publication of Strand et al. (1997)gives primers for a number of different nuclear-genefamilies.

Once primers have been designed or obtained, initial PCRoptimisation experiments should be performed. Becausegeneral primers have not been designed to be locus-specific,the number and size of PCR amplicons from a given taxoncannot be predicted a priori. The efficiency of amplificationof various loci may be affected by a number of factors,including template DNA quality, efficiency of the PCRpolymerase (Taq, or other available polymerases), primerannealing temperature, MgCl2 concentration and the use ofother PCR additives. Relative to other loci amplified formolecular phylogenetic studies (e.g. cpDNA, rDNA),nuclear genes exist in much lower copy number and henceare more difficult to amplify. In many cases, significantexperimental effort is required to optimise conditions.

Template DNA quality is important, especially whenusing general primers that may not match the templateexactly. Optimal tissue for DNA extraction may vary acrosslineages. In our experience in Malvaceae, DNA extractedfrom fresh, young leaf material generally provides thebest-quality DNA. Other researchers (personalcommunication from a reviewer) have found that DNAextracted from properly processed silica gel-dried materialalso provides high-quality DNA. In addition, in some groups

Page 16: Use of nuclear genes for phylogeny reconstruction in plants

160 Australian Systematic Botany R. L. Small et al.

very young leaves may contain PCR-inhibiting compounds,indicating that mature leaves may be more suitable tissuesources for DNA extraction (personal communication froma reviewer). While typical CTAB DNA isolation procedures(e.g. Doyle and Doyle 1987, 1990) often result in amplifiableDNA for high-copy templates, these DNAs may retainsufficient impurities to inhibit the more selectiveamplification of low-copy nuclear genes. In these cases, theuse of DNA extraction kits (e.g. Plant DNeasy, Qiagen) thatproduce cleaner DNAs may be indicated. Templateconcentration in a PCR reaction may also be important,given the lower copy number of nuclear genes;low-concentration (<20 ng µL–1) DNAs that have beenadequate for amplification of cpDNA or rDNA loci may beinsufficient for amplification of nuclear genes.

The choice of amplification polymerase may also affectefficiency or even ability to PCR amplify some nuclear genes.In our laboratories we have used a wide variety of polymerasesfrom various suppliers and found significant variation in theability of these enzymes to amplify low-copy nuclearsequences. This is, unfortunately, an example of ‘you get whatyou pay for.’ While some lower-cost polymerases may besufficient for amplification of high-copy templates, they maybe insufficient for amplification of low-copy sequences.

In addition to the specific polymerase used, otherPCR-reaction conditions will affect the ability to amplifylow-copy templates. Specifically, annealing temperature,MgCl2 concentration and addition of PCR additives may needto be adjusted to optimise amplification conditions. Theinterplay between primer-annealing temperature and MgCl2concentration is especially important since increasingannealing temperature and decreasing MgCl2 concentrationresults in more specific amplification. Because multiple

paralogous copies of most genes are expected to be amplifiedwith a general primer set, it may be desirable in some casesto amplify multiple members of a gene family for initialcharacterisation. At low annealing temperatures and highMgCl2 concentrations a large number of bands often areobserved, many of which may turn out to be non-specificamplification products (i.e. not the target sequence).Alternatively, at high annealing temperatures and low MgCl2concentrations, fewer to no bands may be observed becauseof the high stringency of the reaction conditions. A combinedoptimisation approach that evaluates a range of annealingtemperatures (e.g. ± 5°C of the theoretical meltingtemperature of the primer pair) and MgCl2 concentrations(e.g. 1.5–3.0 mM final concentration) will provide anestimate of the combination of conditions under which areasonable number of bands is amplified (Fig. 3). If a gradientthermal cycler is available, these optimisation experimentscan be conducted in a single run by using a single DNA inone set of reactions, varying both the MgCl2 concentrationand the temperature gradient. Finally, a number of PCRadditives have been suggested to improve amplification.While we have not exhaustively screened these additives, ourexperience of amplifying low-copy loci from a variety of seedplants indicates that the addition of bovine serum albumin(BSA; final concentration of ~0.2 µg µL–1) significantlyimproves amplification from problematic DNA.

If locus-specific PCR primers are used to amplify thegenes of interest and a single homogenous PCR product isamplified, these products may be directly sequenced. Moreoften, however, especially when universal primers are used,multiple PCR products that may vary in size are obtained. Toisolate individual PCR products for sequencing, cloning ofthese PCR products into an appropriate plasmid is generally

Fig. 3. Gel photo demonstrating the effect of MgCl2 concentration and annealing temperature on the amplification ofnuclear genes from genomic DNA. A single genomic DNA (Gossypium raimondii) was amplified with universal Adhprimers across a range of annealing temperatures (46–64°C), with either 3.0 or 2.0 mM MgCl2 final concentrations. Asannealing temperature increases, the number of amplified products decreases. As the MgCl2 concentration increases,the number of amplified products increases.

Page 17: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 161

required (e.g. Promega’s pGEM-T). Alternatively, if multiplebands are widely spaced in size and can be slicedindividually from a gel, these products can then be directlysequenced. An exception to these generalisations are studiesof highly heterozygous outcrossing plants. For example,analysis of four low-copy loci in conifers (R. Cronn, unpubl.)has revealed that heterozygosity is nearly universal acrossloci and species and that length polymorphism is commonwithin intron regions. In these more challenging cases,cloning may be necessary. One final alternative to cloning isthe design of allele-specific PCR-amplification andsequencing primers (e.g. Rauscher et al. 2002) if sufficientpreliminary sequence data are available.

Once PCR products have been ligated into a plasmid andtransformed into competent E. coli cells, the next decisionthat must be made is to choose how many transformants toscreen. At this point in a preliminary study thoroughness ismore important than speed, and although these screeningsteps are often laborious and time-consuming, they increasethe probability of completely sampling the pool of PCRproducts. The number of colonies to pick depends on theinitial starting PCR product. If only one or two unique PCRproducts are expected in a pool of transformants, then asmaller number of colonies can be screened, whereas ifseveral unique PCR products are expected, a greater numberof colonies must be screened to ensure identifying allproducts. As a starting point, we have found that picking 20colonies is reasonable if a small number of PCR products isexpected, and 40 or more if a larger number is expected.Apparent transformants (i.e. white colonies in systems usingblue–white selection criteria with X-gal and IPTG) canquickly and easily be screened for inserts via PCR by usingthe following procedure.

Individual colonies are picked from plates with a pipet tipand suspended in 10 µL of H2O in a numberedmicrocentrifuge tube. Once the colony is resuspended bypumping the pipettor several times, the pipet tip is applied toa fresh LB-agar–ampicillin plate that has a numbered gridwhich corresponds to the numbers on the microcentrifugetubes. This results in (1) a suspension of bacterial colonies inthe microfuge tube and (2) a grid plate with correspondingbacterial colonies that (after overnight culture) provide cellsfor plasmid minipreps. The suspended cells are then boiledfor 10 min to lyse cells and release plasmid into thesupernatant. After centrifugation (1 min, maximum speed) topellet the cell debris, 1 µL of this suspension can then beused as a template in a small-volume (e.g. 10 µL) PCRreaction using either the gene-specific or vector-specificprimers to screen the colonies for presence of an appropriateinsert. Test gels may be run using high-percentage agarosegels (e.g. 2% or higher) to help distinguish among PCRproducts of slightly different lengths.

A test gel of the colony PCR results will indicate (1) howmany of the apparent transformants contain the PCR

products of interest and (2) the relative sizes of the inserts. Ifdifferent-sized PCR products were ligated and transformedin the same reaction, this screen will provide quickidentification of which colonies contain different-lengthPCR products. Because a single-sized PCR product maycontain amplicons from more than one gene or multiplealleles of a single locus, a secondary screening usingrestriction-enzyme digestion (typically frequently cuttingenzymes with 4-bp recognition sequences) can be conductedto differentiate among plasmids with identically sizedinserts. Ideally, this two-step screening procedure will resultin the identification of unique sets of plasmids: a set for eachsize class, and subsets of plasmids with the same size, butdifferent restriction digestion profiles. Once these sets havebeen identified, one to several plasmids from each set shouldbe screened by sequencing. Although the foregoingpre-screening procedures may be skipped, the only way to besure that all variants present are sampled is to sequence all ofthe colonies.

Template DNA for sequencing from individual coloniesmay be obtained either by PCR amplification from the boiledcolonies or by miniprep isolation of the plasmid DNA eitherby using published protocols (Ausubel et al. 1992;Sambrook et al. 1989) or commercially available kits. Wesuggest that initial sequencing efforts utilise theplasmid-specific primers that generally flank the cloning siteof the plasmid (e.g. in Promega’s pGEM-T plasmid primersites for the T7 and SP6 promoters are available about 50 bpon either side of the cloning site) as this allows sequencingreads through the PCR-primer sites and into the PCR productitself. Depending on the size of the PCR product, internalsequencing primers may be necessary to fully sequence agiven plasmid. Such primers usually are easily designed towork across a gene family by placing them in conservedregions of exons (see below), and if necessary, incorporatingsome ambiguity into the primer.

Preliminary characterisation of genes isolated from representative taxa

Gene structure tends to be fairly conserved across genes of agiven gene family. For example, plant Adh genes generallyhave a 10 exon/9 intron structure. There are examples ofintrons that have been lost from some lineages—for example,the Arabidopsis thaliana Adh gene (Chang and Meyerowitz1986), Gossypium AdhA (Small and Wendel 2000a)—butmost genes retain the 10 exon/9 intron structure, and theintron losses are easily discerned through sequencealignment. Identification of exon and intron regions can beaccomplished through alignment of sequences obtained fromthe taxa of interest, with previously published sequencesdownloaded from GenBank. Gene sequences deposited inGenBank usually have the exon/intron structure designated,and individual exon sequences can be isolated and alignedindividually with the preliminary data. In addition, the

Page 18: Use of nuclear genes for phylogeny reconstruction in plants

162 Australian Systematic Botany R. L. Small et al.

presence of the highly conserved 5′ ‘GT’ and 3′ ‘AG’intron-boundary dinucleotides can be used to confirm thestart and end points of an intron. Thus, through sequentialalignment of individual exons and confirmation ofintron-boundary sequences, the exon/intron structure ofpreliminary sequence data can be defined (Fig. 4). Once thesesteps are accomplished for the exemplar taxa, the next logicalstep is to assess copy number and orthology of the isolatedsequences.

Copy number and orthology assessment

As noted earlier in this review, several criteria can be used asevidence of orthology. In practice, the two most useful andefficient approaches are phylogenetic analysis and Southernhybridisation analysis, the former to infer orthologousrelationships among sequences, and the latter to confirm thatthe inference of orthology is not confounded by the presenceof multiple, closely related paralogs.

Once sequence data have been collected forrepresentative taxa, sequences should be trimmed to includeonly exons and aligned with genes from related plants (exonsonly because introns are generally unalignable outside ofclosely related species). Phylogenetic analyses of these dataproduce a hypothesis of relationships among genes, andinferred clades may represent orthologous sequence groups.For example, in our studies of Gossypium Adh we initiallyisolated several Adh sequence types from threerepresentative species. Putative exons were inferred bycomparison with exons of a Solanum Adh sequence from

GenBank. A large dataset of angiosperm Adh exonsequences from GenBank was then compiled and subjectedto phylogenetic analysis. The resulting phylogenetic tree(Fig. 5) grouped the Gossypium Adh sequences into fiveprimary clades that included sequences from therepresentative species. These clades were inferred torepresent orthologous genes, which consequently werenamed AdhA–AdhE (Fig. 5). The inference of orthology wasbased on the fact that for each of the named genes, sequencesof each type were isolated from each of the representativespecies, and in each case these sequences formedmonophyletic groups. Note that the Gossypium Adh genesare found in two disparate parts of the angiosperm Adh tree:AdhA, AdhB and AdhC in one clade, and AdhD and AdhE ina separate clade. These data indicate an ancient geneduplication giving rise to these two primary clades, withmore recent gene duplications giving rise to the genes withineach clade.

Once these preliminary phylogenetic analyses had beenperformed, examination of the entire sequences (exons +introns) further supported the inference of orthology of thegenes. Intron sequences were easily alignable within putativeorthologs, but generally unalignable between orthologs.Furthermore, the absence of two introns in the AdhAsequences isolated from all representative species, butpresent in most other angiosperm Adh sequences supportedthe inference of orthology among these sequences. Thus, allavailable phylogenetic and gene structure data corroboratedthe orthology of the named Adh genes.

Fig. 4. Representation of the relationship among gene sequences derived from cDNA clones, individual exon sequences and genomicDNA. (A) Diagrammatic representation of an alignment of cDNA, exon and genomic DNA sequences from a hypothetical gene with fiveexons (shaded and numbered boxes) and four introns (lower-case letters). (B) Alignment of a portion of Adh cDNA (Solanum tuberosumAdh1, GenBank M25154), individual exons (Zea mays Adh1, GenBank AF123535) and genomic DNA (Gossypium raimondii AdhA,GenBank AF182116). Note the conservation of the intron boundary dinucleotides (5′ ‘GT’ and 3′ ‘AG’).

Page 19: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 163

To ensure that the named genes were unique, and thatclosely related paralogous genes did not exist, we thenconducted Southern hybridisation analysis. Because intronsare unalignable between genes and exons are fairly divergent(generally 20–30%, except for AdhD/AdhE which are onlyabout 10% divergent: Small and Wendel 2000a) we designed

hybridisation probes that included both intron and exonsequence. Probes were small (about 500 bp) and included themajority of intron 3 + exon 4 of the Adh genes. Individualprobes were obtained by PCR for each of the Adh genes andused in high-stringency Southern hybridisations (see Smalland Wendel 2000a for detailed hybridisation conditions).

Fig. 5. Strict consensus of seven equally parsimonious trees resulting from maximum parsimonyanalysis of Adh exon sequences from a wide range of angiosperms (rooted with the gymnosperm Pinus),including representative Gossypium sequences. Bootstrap values (1000 replicates) greater than 50% areshown above each branch. For each Gossypium locus (AdhA–AdhE), sequences were obtained from eachof three representative diploid taxa denoted by their genome affiliations (A = G. herbaceum or G.arboreum, C = G. robinsonii, D = G. raimondii). Each inferred Gossypium Adh locus includes all threerepresentative taxa, is monophyletic, and is supported by a 100% bootstrap value.

Page 20: Use of nuclear genes for phylogeny reconstruction in plants

164 Australian Systematic Botany R. L. Small et al.

The results of these experiments in some cases confirmedthe uniqueness of putative orthologs, and in other casesrefuted those inferences. For example, for most GossypiumAdh genes, a single hybridising band was seen for the diploidspecies and two hybridising bands in the tetraploid, asexpected. For AdhB, however, multiple hybridising bandswere observed in all species indicative of additionalunsampled AdhB-like genes in the Gossypium genome.Subsequent publication of Adh sequences from Gossypiumgenomic and cDNA libraries (Millar and Dennis 1996)corroborated this inference. Phylogenetic analysis of thesequences published by Millar and Dennis (1996) revealedthat they were closely related to the AdhB sequences isolatedvia PCR in our study. Orthology–paralogy relationshipsamong these AdhB- and AdhB-like genes were not clear onthe basis of either phylogenetic analysis or sequencealignment; thus, it was concluded that these particular geneswould be poor choices for further phylogenetic studies.

In sum, the combination of phylogenetic and Southernhybridisation analyses allowed rigorous inference oforthology (or lack thereof) among specific sequence typesisolated from the representative Gossypium species surveyedin our preliminary studies. This in turn allowed us to choosefrom among the isolated sequences those that werepotentially phylogenetically useful.

Choosing among potential sequences

Once orthologous and potentially useful genes have beenidentified, evaluation of rates of sequence evolution canprovide insight into the relative phylogenetic utility ofvarious candidate genes. For example, in a recent study ofGossypium phylogeny (Cronn et al. 2002b), 11 low-copynuclear encoded sequences were evaluated, and thepercentages of variable and phylogenetically informativesites varied over 3.5-fold and 7.9-fold ranges, respectively.Percentage of variable sites ranged from 3.43 to 12.1%,while percentage of phylogenetically informative sitesranged from 0.21 to 1.66%.

Evaluation of relative rates, and thus potentialphylogenetic utility, can be conducted several ways. First andperhaps simplest, phylogenetic analysis of sequencesobtained from representative taxa can be performed andrelative branch lengths can be assessed. Those genes withlongest branch lengths are the most variable and thus mayprovide the most information for a larger set of taxa.

On a finer scale, it is useful to evaluate variation on a persite basis, thus allowing inferences of the amount ofinformation obtained per unit of sequence. Clearly, longersequences are, in general, more likely to contain morevariable sites. However, if efficiency in sequencing effort isa desirable goal, then those sequences that have the greatestvariation per site may be more useful even if they are shorterthan some other sequences that provide greater overallnumbers of variable sites. This value can be estimated by

using a variety of genetic-distance algorithms available inmost phylogenetic inference packages (e.g. PAUP*:Swofford 2002). Calculation of p-distances (proportion ofvariable sites) provides a straightforward estimate of relativevariation. More complex models (e.g. Jukes–Cantor, Kimura2-parameter) can be invoked if deviations from a simplemodel of evolution are observed (e.g. transition:transversionbias, GC-content bias).

Data collection from all taxa of interest

Once a gene or set of genes has been identified as potentiallyuseful in representative taxa, the next step is to generatesequence data for all taxa of interest. In our experience, themost efficient approach to accomplish this is to developlocus-specific PCR-amplification primers for eachparticular gene of interest. There are several advantages tothis approach. First, locus-specific amplification mayminimise the costly, tedious and time-consuming step ofhaving to clone PCR products from a large number of taxa,although this may be necessary in any event if heterozygosityis high. Second, the more specific the PCR primers are, theless will be the nagging problem of chimeric, recombinantPCR products that occasionally (to often) are producedduring PCR amplification of genes (as discussed earlier inthis review). Third, the ability to reliably and reproduciblyamplify a single homogenous PCR product from multiplespecies bolsters the inference of orthology of thosesequences.

Conclusions

The paths chosen by modern plant systematists increasinglylead to questions that are difficult to resolve through theapplication of one source, or even sometimes two sources, ofmolecular inference. There has been enormous growth overthe past decade of publicly available gene-sequencedatabases including whole-genome sequences for modelorganisms such as Arabidopsis thaliana (The ArabidopsisGenome Initiative 2000) and Oryza sativa (Goff et al. 2002;Yu et al. 2002) and EST databases from many otherspecies (see Plants Genome Central at NCBI:http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html). These accumulating data have set the stage for thedevelopment of new, low-copy gene markers that are up tothe challenge of resolving difficult phylogenetic questions.Low-copy nuclear markers come at a higher cost with regardto development than either cpDNA or rDNA markers, butthey are certain to eclipse cpDNA and rDNA with regards toreliability and resolving power. Since single-copy nucleargenes are biparentally inherited, they have the potential toreveal the ancestry of all lineages contributing to an extantspecies, whether diploid (Cronn et al. 2003) or polyploid(Small et al. 1998; Liu et al. 2001; Cronn et al. 2002b).Low-copy loci are rarely subject to concerted evolution(Cronn et al. 1999; Senchina et al. 2003), and they do not

Page 21: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 165

display the high intragenic polymorphism (andcorresponding increase in homoplasy) characteristic ofrDNA, which requires concerted evolution to maintain intra-and inter-array homogeneity. Finally, nuclear genes containexonic regions that limit alignment ambiguity and facilitatehomologous comparisons (Bailey and Doyle 1999; Doyle etal. 1999b; Bortiri et al. 2002; Sang 2002), as well as intronsand other non-coding regions that diverge at a substantiallyhigher rate than either cpDNA or rDNA.

To the extent that plant molecular systematic studieswould benefit from the routine application of one nucleargene, the development and application of multipleindependent nuclear loci holds even greater promise foraddressing more complex questions. For example, theapplication of multiple, low-copy nuclear markers maypresent the only viable approach for teasing apart temporallycompressed divergence events, since the variance ofdivergence rates across sites from exonic and intronicregions almost ensures that a fraction of sites will yieldsufficient phylogenetic signals to mark lineages (Small et al.1998; Cronn et al. 2002b; Malcomber 2002). Similarly, thetopological ‘stalemates’ frequently presented by cpDNA andrDNA can only be confidently resolved through the use ofmultiple independent markers, such as low-copy nucleargenes (Small et al. 1998; Cronn et al. 2002b, 2003;Malcomber 2002). In many cases, the resolution provided bynuclear genes will help to shed light on the true topology, aswell as the underlying methodological or biological basis ofphylogenetic incongruence. Less frequently, attempts toresolve cpDNA–rDNA incongruence using multiple nuclearloci may turn up exciting instances where multiple nuclearmarkers return a consistent topology that stands in starkcontrast to the topologies returned by both cpDNA andrDNA (Cronn et al. 2003). In these challenging cases, thesingle accomplishment of resolving a phylogeny pales insignificance to the greater challenges of understanding theevolutionary dynamics that have given rise to the genomiccomplexity embodied in living plant species.

Acknowledgments

We thank the many colleagues who have contributed to ourthoughts on these issues, Jeff Doyle and two anonymousreviewers whose comments improved this manuscript, andacknowledge funding support from the United StatesNational Science Foundation.

References

Adams KL, Cronn R, Percifield R, Wendel JF (2003) Genes duplicatedby polyploidy show unequal contributions to the transcriptome andorgan-specific reciprocal silencing. Proceedings of the NationalAcademy of Sciences of the United States of America 100,4649–4654. doi:10.1073/PNAS.0630618100

Ahlandsberg S, Sun C, Jansson C (2002) An intronic element directsendosperm-specific expression of the sbellb gene during barley seeddevelopment. Genetics and Genomics 20, 864–868.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic localalignment search tool. Journal of Molecular Biology 215, 403–410. doi:10.1006/JMBI.1990.9999

Alvarez I, Wendel JF (2003) Ribosomal ITS sequences and plantphylogenetic inference. Molecular Phylogenetics and Evolution 29,417–434. doi:10.1016/S1055-7903(03)00208-2

Anderberg AA, Rydin C, Kallersjo M (2002) Phylogenetic relationshipsin the order Ericales s.l.: analyses of molecular data from five genesfrom the plastid and mitochondrial genomes. American Journal ofBotany 89, 677–687.

APGII (2003) An update of the Angiosperm Phylogeny Groupclassification for the orders and families of flowering plants: APGII. Botanical Journal of the Linnean Society 141, 399–436.

Arnheim N (1983) Concerted evolution of multigene families. In‘Evolution of genes and proteins’. (Eds M Nei, RK Koehn) pp.38–61. (Sinauer: Sunderland, MA)

Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA,Struhl K (Eds) (1992) ‘Short protocols in molecular biology.’ 2ndedn. (John Wiley & Sons: New York)

Awadalla P, Eyre-Walker A, Maynard Smith J (1999) Linkagedisequilibrium and recombination in hominid mitochondrial DNA.Science 286, 2524–2525. doi:10.1126/SCIENCE.286.5449.2524

Bailey CD, Carr TG, Harris SA, Hughes CE (2003) Characterization ofangiosperm nrDNA polymorphism, paralogy, and pseudogenes.Molecular Phylogenetics and Evolution 29, 435–455.doi:10.1016/J.YMPEV.2003.08.021

Bailey CD, Doyle JJ (1999) Potential phylogenetic utility of the low-copynuclear gene pistillata in dicotyledonous plants: comparison tonrDNA ITS and trnL intron in Sphaerocardamum and otherBrassicaceae. Molecular Phylogenetics and Evolution 13, 20–30. doi:10.1006/MPEV.1999.0627

Bailey CD, Price RA, Doyle JJ (2002) Systematics of the HalimolobineBrassicaceae: evidence from three loci and morphology. SystematicBotany 27, 318–332.

Baldwin BG, Markos S (1998) Phylogenetic utility of the externaltranscribed spacer (ETS) of 18S–26S nrDNA: congruence of ETSand ITS trees of Calycadenia (Compositae). MolecularPhylogenetics and Evolution 10, 449–463. doi:10.1006/MPEV.1998.0545

Baldwin BG, Sanderson MJ, Porter JM, Wojciechowski MF, CampbellCS, Donoghue MJ (1995) The ITS region of nuclear ribosomal DNA:a valuable source of evidence on angiosperm phylogeny. Annals ofthe Missouri Botanical Garden 82, 247–277.

Bancroft I (2001) Duplicate and diverge: the evolution of plant genomemicrostructure. Trends in Genetics 17, 89–93. doi:10.1016/S0168-9525(00)02179-X

Barrier M, Baldwin BG, Robichaux RH, Purugganan MD (1999)Interspecific hybrid ancestry of a plant adaptive radiation:allopolyploidy of the Hawaiian silversword alliance inferred fromduplicated floral homeotic genes. Molecular Biology and Evolution16, 1105–1113.

Birky CW (1995) Uniparental inheritance of mitochondrial andchloroplast genes: mechanisms and evolution. Proceedings of theNational Academy of Sciences of the United States of America 92,11331–11338.

Blake NK, Lehfeldt BR, Lavin M, Talbert LE (1999) Phylogeneticreconstruction based on low copy DNA sequence data in anallopolyploid: the B genome of wheat. Genome 42, 351–360. doi:10.1139/GEN-42-2-351

Bortiri E, Oh S-H, Gao F-Y, Potter D (2002) The phylogenetic utility ofnucleotide sequences of sorbitol 6-phosphate dehydrogenase inPrunus (Rosaceae). American Journal of Botany 89, 1697–1708.

Bradley RD, Hillis DM (1997) Recombinant DNA sequences generatedby PCR amplification. Molecular Biology and Evolution 14,592–593.

Page 22: Use of nuclear genes for phylogeny reconstruction in plants

166 Australian Systematic Botany R. L. Small et al.

Buckler ES, Holtsford TP (1996) Zea ribosomal repeat evolution andsubstitution patterns. Molecular Biology and Evolution 13,623–632.

Buckler ES, Ippolito A, Holtsford TP (1997) The evolution ofribosomal DNA: divergent paralogues and phylogeneticimplications. Genetics 145, 821–832.

Chang C, Meyerowitz EM (1986) Molecular cloning and DNAsequence of the Arabidopsis thaliana alcohol dehydrogenase gene.Proceedings of the National Academy of Sciences of the UnitedStates of America 83, 1408–1412.

Charlesworth D, Awadalla P (1998) Flowering plantself-incompatibility: the molecular population genetics of BrassicaS-loci. Heredity 81, 1–9. doi:10.1038/SJ.HDY.6884000

Charlesworth D, Charlesworth B (1998) Sequence variation: lookingfor effects of genetic linkage. Current Biology 8, R658–R661.

Charlesworth D, Liu F, Zhang L (1998) The evolution of the alcoholdehydrogenase gene family in plants of the genus Leavenworthia(Brassicaceae): loss of introns, and an intronless gene. MolecularBiology and Evolution 15, 552–559.

Chase MW, Knapp S, Cox AV, Clarkson JJ, Butsko Y, Joseph J,Savolainen V, Parokonny AS (2003) Molecular systematics, GISHand the origin of hybrid taxa in Nicotiana (Solanaceae). Annals ofBotany 92, 107–127. doi:10.1093/AOB/MCG087

Chase MW, Soltis DE, Olmstead, RG, Morgan D, Les DH (1993)Phylogenetics of seed plants: an analysis of nucleotide sequencesfrom the plastid gene rbcL. Annals of the Missouri BotanicalGarden 80, 528–580.

Clegg MT, Cummings MP, Durbin ML (1997) The evolution of plantnuclear genes. Proceedings of the National Academy of Sciences ofthe United States of America 94, 7791–7798.doi:10.1073/PNAS.94.15.7791

Corriveau JL, Coleman AW (1988) Rapid screening method to detectpotential biparental inheritance of plastid DNA and results for over200 angiosperm species. American Journal of Botany 75,1443–1458.

Crawford DJ (1985) Electrophoretic data and plant speciation.Systematic Botany 10, 405–416.

Crawford DJ (2000) Plant macromolecular systematics in the past 50years: one view. Taxon 49, 479–501.

Cronn R, Wendel JF (1998) Simple methods for isolatinghomoeologous loci from allopolyploid genomes. Genome 41,756–762. doi:10.1139/GEN-41-6-756

Cronn RC, Zhao X, Paterson AH, Wendel JF (1996) Polymorphism andconcerted evolution in a tandemly repeated gene family: 5Sribosomal DNA in diploid and allopolyploid cottons. Journal ofMolecular Evolution 42, 685–705.

Cronn RC, Small RL, Wendel JF (1999) Duplicated genes evolveindependently after polyploid formation in cotton. Proceedings ofthe National Academy of Sciences of the United States of America96, 14406–14411.

Cronn R, Cedroni M, Haselkorn T, Grover C, Wendel JF (2002a)PCR-mediated recombination in amplification products derivedfrom polyploid cotton. Theoretical and Applied Genetics 104,482–489. doi:10.1007/S001220100741

Cronn RC, Small RL, Haselkorn T, Wendel JF (2002b) Rapiddiversification of the cotton genus (Gossypium: Malvaceae)revealed by analysis of sixteen nuclear and chloroplast genes.American Journal of Botany 89, 707–725. doi:10.1093/AOB/MCF125

Cronn RC, Small RL, Haselkorn T, Wendel JF (2003) Cryptic repeatedgenomic recombination during speciation in Gossypiumgossypioides. Evolution 57, 2475–2489.

Doyle JJ (1991) Evolution of higher-plant glutamine synthetase genes:tissue specificity as a criterion for predicting orthology. MolecularBiology and Evolution 8, 366–377.

Doyle JJ (1992) Gene trees and species trees: molecular systematics asone-character taxonomy. Systematic Botany 17, 144–163.

Doyle JJ (1995) The irrelevance of allele tree topologies for speciesdelimitation, and a nontopological alternative. Systematic Botany20, 574–588.

Doyle JJ (1997) Trees within trees: genes and species, molecules andmorphology. Systematic Biology 46, 537–553.

Doyle JJ, Doyle JL (1987) A rapid DNA isolation procedure for smallquantities of fresh leaf tissue. Phytochemical Bulletin 19, 11–15.

Doyle JJ, Doyle JL (1990) Isolation of plant DNA from fresh tissue.Focus 12, 13–15.

Doyle JJ, Doyle JL (1999) Nuclear protein-coding genes in phylogenyreconstruction and homology assessment: some examples fromLeguminosae. In ‘Molecular systematics and plant evolution’. (Ed.RJ Gornall) pp. 229–254. (Taylor and Francis: London)

Doyle JJ, Kanazin V, Shoemaker RC (1996) Phylogenetic utility ofhistone H3 intron sequences in the perennial relatives of soybean(Glycine: Leguminosae). Molecular Phylogenetics and Evolution 6,438–447. doi:10.1006/MPEV.1996.0092

Doyle JJ, Doyle JL, Brown AHD (1999a) Incongruence in the diploidB-genome species complex of Glycine (Leguminosae) revisited:histone H3-D alleles versus chloroplast haplotypes. MolecularBiology and Evolution 16, 354–362.

Doyle JJ, Doyle JL, Brown AHD (1999b) Origins, colonization, andlineage recombination in a widespread perennial soybean polyploidcomplex. Proceedings of the National Academy of Sciences of theUnited States of America 96, 10741–10745. doi:10.1073/PNAS.96.19.10741

Doyle JJ, Doyle JL, Brown AHD, Pfeil BE (2000) Confirmation ofshared and divergent genomes in the Glycine tabacina polyploidcomplex (Leguminosae) using histone H3-D sequences. SystematicBotany 25, 437–448.

Doyle JJ, Doyle JL, Brown AHD, Palmer RG (2002) Genomes, multipleorigins, and lineage recombination in the Glycine tomentella(Leguminosae) polyploid complex: histone H3-D gene sequences.Evolution 56, 1388–1402.

Drouin G, Prat F, Ell M, Clarke GDP (1999) Detecting andcharacterizing gene conversions between multigene familymembers. Molecular Biology and Evolution 16, 1369–1390.

Elder JF, Turner BJ (1995) Concerted evolution of repetitive DNAsequences in eukaryotes. The Quarterly Review of Biology 70,297–320. doi:10.1086/419073

Emshwiller E, Doyle JJ (1999) Chloroplast-expressed glutaminesynthetase (ncpGS): potential utility for phylogenetic studies withan example from Oxalis (Oxalidaceae). Molecular Phylogeneticsand Evolution 12, 310–319. doi:10.1006/MPEV.1999.0613

Emshwiller E, Doyle JJ (2002) Origins of domestication and polyploidyin Oca (Oxalis tuberosa: Oxalidaceae). 2. Chloroplast-expressedglutamine synthetase data. American Journal of Botany 89,1042–1056.

Endrizzi JD, Turcotte EL, Kohel RJ (1985) Genetics, cytology, andevolution of Gossypium. Advances in Genetics 23, 271–375.

Evans RC, Campbell CS (2002) The origin of the apple subfamily(Maloideae; Rosaceae) is clarified by DNA sequence data fromduplicated GBSSI genes. American Journal of Botany 89,1478–1484.

Evans RC, Alice LA, Campbell CS, Kellogg EA, Dickinson TA (2000)The granule-bound starch synthase (GBSSI) gene in the Rosaceae:Multiple loci and phylogenetic utility. Molecular Phylogenetics andEvolution 17, 388–400. doi:10.1006/MPEV.2000.0828

Eyre-Walker A (2000) Do mitochondria recombine in humans?Philosophical Transactions of the Royal Society of London. SeriesB, Biological Sciences 355, 1573–1580. doi:10.1098/RSTB.2000.0718

Page 23: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 167

Eyre-Walker A, Gaut RL, Hilton H, Feldman DL, Gaut BS (1998)Investigation of the bottleneck leading to domestication of maize.Proceedings of the National Academy of Sciences of the UnitedStates of America 95, 4441–4446. doi:10.1073/PNAS.95.8.4441

Ferguson D, Sang T (2001) Speciation through homoploidhybridization between allotetraploids in peonies (Paeonia).Proceedings of the National Academy of Sciences of the UnitedStates of America 98, 3915–3919. doi:10.1073/PNAS.061288698

Filatov DA, Charlesworth D (1999) DNA polymorphism, haplotypestructure and balancing selection in the Leavenworthia PgiC locus.Genetics 153, 1423–1434.

Ford VS, Gottlieb LD (1999) Molecular characterization of PgiC in atetraploid plant and its diploid relatives. Evolution 53, 1060–1067.

Ford VS, Gottlieb LD (2002) Single mutations silence PGIC2 genes intwo very recent allotetraploid species of Clarkia. Evolution 56,699–707.

Freudenstein JV, Chase MW (2001) Analysis of mitochondrial nad1b-cintron sequences in Orchidaceae: utility and coding oflength-change characters. Systematic Botany 26, 643–657.

Fryxell PA (1968) A redfinition of the tribe Gossypieae. BotanicalGazette 129, 296–308. doi:10.1086/336448

Fryxell PA (1979) ‘The natural history of the cotton tribe.’ (Texas A&MUniversity Press: College Station)

Fryxell PA (1992) A revised taxonomic interpretation of Gossypium L.(Malvaceae). Rheedea 2, 108–165.

Fu H, Dooner HK (2002) Intraspecific violation of genetic colinearityand its implications in maize. Proceedings of the National Academyof Sciences of the United States of America 99, 9573–9578.

Galloway GL, Malmberg RL, Price RA (1998) Phylogenetic utility ofthe nuclear gene arginine decarboxylase: an example fromBrassicaceae. Molecular Biology and Evolution 15, 1312–1320.

Gaut BS (1998) Molecular clocks and nucleotide substitution rates inhigher plants. Evolutionary Biology 30, 93–120.

Gaut BS (2001) Patterns of chromosomal duplication in maize and theirimplications for comparative maps of the grasses. GenomeResearch 11, 55–66. doi:10.1101/GR.160601

Gaut BS, Clegg MT (1991) Molecular evolution of alcoholdehydrogenase 1 in members of the grass family. Proceedings of theNational Academy of Sciences of the United States of America 88,2060–2064.

Gaut BS, Clegg MT (1993) Molecular evolution of the Adh1 locus inthe genus Zea. Proceedings of the National Academy of Sciences ofthe United States of America 90, 5095–5099.

Gaut BS, Doebley JF (1997) DNA sequence evidence for the segmentalallotetraploid origin of maize. Proceedings of the National Academyof Sciences of the United States of America 94, 6809–6814.doi:10.1073/PNAS.94.13.6809

Gaut BS, Morton BR, McCaig BC, Clegg MT (1996) Substitution ratecomparisons between grasses and palms: synonymous ratedifferences at the nuclear gene Adh parallel rate differences at theplastid gene rbcL. Proceedings of the National Academy of Sciencesof the United States of America 93, 10274–10279. doi:10.1073/PNAS.93.19.10274

Gaut BS, Peek AS, Morton BR, Clegg MT (1999) Patterns of geneticdiversification within the Adh gene family in the grasses (Poaceae).Molecular Biology and Evolution 16, 1086–1097.

Ge S, Sang T, Lu B-R, Hong D-Y (1999) Phylogeny of rice genomeswith emphasis on origins of allotetraploid species. Proceedings ofthe National Academy of Sciences of the United States of America96, 14400–14405. doi:10.1073/PNAS.96.25.14400

Gielly L, Yuan Y-M, Küpfer P, Taberlet P (1996) Phylogenetic use ofnoncoding regions in the genus Gentiana L.: chloroplast trnL(UAA) intron versus nuclear ribosomal internal transcribed spacersequences. Molecular Phylogenetics and Evolution 5, 460–466. doi:10.1006/MPEV.1996.0042

Goff SA, Ricke D, Lan T-H, Presting G, Wang R, et al. (2002) A draftsequence of the rice genome (Oryza sativa L. ssp. japonica).Science 296, 92–100. doi:10.1126/SCIENCE.1068275

Gottlieb LD (1977) Electrophoretic evidence and plant systematics.Annals of the Missouri Botanical Garden 64, 161–180.

Gottlieb LD, Ford VS (1996) Phylogenetic relationships among thesections of Clarkia (Onagraceae) inferred from the nucleotidesequences of PgiC. Systematic Botany 21, 45–62.

Grant D, Cregan P, Shoemaker RC (2000) Genome organization indicots: genome duplication in Arabidopsis and synteny betweensoybean and Arabidopsis. Proceedings of the National Academy ofSciences of the United States of America 97, 4168–4173. doi:10.1073/PNAS.070430597

Grassly NC, Holmes EC (1997) A likelihood method for the detectionof selection and recombination using sequence data. MolecularBiology and Evolution 14, 239–247.

Hamby RK, Zimmer EA (1988) Ribosomal RNA sequences forinferring phylogeny within the grass family (Poaceae). PlantSystematics and Evolution 160, 29–37.

Hamby RK, Zimmer EA (1992) Ribosomal RNA as a phylogenetic toolin plant systematics. In ‘Molecular systematics of plants’. (Eds PSSoltis, DE Soltis, JJ Doyle) pp. 50–91. (Chapman & Hall: NewYork)

Hanson MA, Gaut BS, Stec AO, Fuerstenberg SI, Goodman MM, CoeEH, Doebley JF (1996a) Evolution of anthocyanin biosynthesis inmaize kernels: the role of regulatory and enzymatic loci. Genetics143, 1395–1407.

Hanson RE, Islam-Faridi M, Percival EA, Crane CF, Ji Y, McKnightTD, Stelly DM, Price HJ (1996b) Distribution of 5S and 18S–28SrDNA loci in a tetraploid cotton (Gossypium hirsutum L.) and itsputative diploid ancestors. Chromosoma 105, 55–61. doi:10.1007/S004120050159

Hartmann S, Nason JD, Bhattacharya D (2001) Extensive ribosomalDNA genic variation in the columnar cactus Lophocereus. Journalof Molecular Evolution 53, 124–134.

Henikoff S, Greene EA, Pietrokovski S, Bork P, Attwood TK, Hood L(1997) Gene families: the taxonomy of protein paralogs andchimeras. Science 278, 609–614. doi:10.1126/SCIENCE.278.5338.609

Hilton H, Gaut BS (1998) Speciation and domestication in maize andits wild relatives: evidence from the Globulin-1 gene. Genetics 150,863–872.

Hodges SA, Arnold ML (1994) Columbines: a geographicallywidespread species flock. Proceedings of the National Academy ofSciences of the United States of America 91, 5129–5132.

Hong RL, Hamaguchi L, Busch MA, Weigel D (2003) Regulatoryelements of the floral homeotic gene AGAMOUS identified byphylogenetic footprinting and shadowing. The Plant Cell 15,1296–1309. doi:10.1105/TPC.009548

Hughes AL, Yeager M (1998) Natural selection at majorhistocompatibility complex loci of vertebrates. Annual Review ofGenetics 32, 415–435. doi:10.1146/ANNUREV.GENET.32.1.415

Innan H, Tajima F (1997) The amounts of nucleotide variation withinand between allelic classes and the reconstruction of the commonancestral sequence in a population. Genetics 147, 1431–1444.

Jakobsen IB, Easteal S (1996) A program for calculating and displayingcompatibility matrices as an aid in determining reticulate evolutionin molecular sequences. CABIOS 12, 291–295.

Kawabe A, Miyashita NT (1999) DNA variation in the basic chitinaselocus (ChiB) region of the wild plant Arabidopsis thaliana. Genetics153, 1445–1453.

Kawabe A, Innan H, Terauchi R, Miyashita NT (1997) Nucleotidepolymorphism in the acidic chitinase locus (ChiA) region of thewild plant Arabidopsis thaliana. Molecular Biology and Evolution14, 1303–1315.

Page 24: Use of nuclear genes for phylogeny reconstruction in plants

168 Australian Systematic Botany R. L. Small et al.

Klein J, Sato A, Nagl S, O’huigin C (1998) Molecular trans-speciespolymorphism. Annual Review of Ecology and Systematics 29,1–21. doi:10.1146/ANNUREV.ECOLSYS.29.1.1

Kuittinen H, Aguade M (2000) Nucleotide variation at the CHALCONEISOMERASE locus in Arabidopsis thaliana. Genetics 155,863–872.

Kuzoff RK, Sweere JA, Soltis DE, Soltis PS, Zimmer EA (1998) Thephylogenetic potential of entire 26S rDNA sequences in plants.Molecular Biology and Evolution 15, 251–263.

Lavin M, Eshbaugh E, Hu J-M, Matthews S, Sharrock RA (1998)Monophyletic subgroups of the tribe Millettieae (Leguminosae) asrevealed by phytochrome nucleotide sequence data. AmericanJournal of Botany 85, 412–433.

Levy F, Antonovics J, Boynton JE, Gillham NW (1996) A populationgenetic analysis of chloroplast DNA in Phacelia. Heredity 76,143–155.

Linder CR, Goertzen LR, Heuvel BV, Francisco-Ortega J, Jansen RK(2000) The complete external transcribed spacer of 18S–26S rDNA:amplification and phylogenetic utility at low taxonomic levels inAsteraceae and closely allied families. Molecular Phylogeneticsand Evolution 14, 285–303. doi:10.1006/MPEV.1999.0706

Liu F, Zhang L, Charlesworth D (1998) Genetic diversity inLeavenworthia populations with different inbreeding levels.Proceedings of the Royal Society of London. Series B. BiologicalSciences 265, 293–301. doi:10.1098/RSPB.1998.0295

Liu Q, Brubaker CL, Green AG, Marshall DR, Sharp PJ, Singh SP(2001) Evolution of the FAD2-1 fatty acid desaturase 5′ UTR intronand the molecular systematics of Gossypium (Malvaceae).American Journal of Botany 88, 92–102.

Lynch M, Conery JS (2000) The evolutionary fate and consequences ofduplicate genes. Science 290, 1151–1155. doi:10.1126/SCIENCE.290.5494.1151

Malcomber ST (2002) Phylogeny of Gaertnera Lam. (Rubiaceae)based on multiple DNA markers: evidence of a rapid radiation in awidespread, morphologically diverse genus. Evolution 56, 42–57.

Marshall HD, Newton C, Ritland K (2001) Sequence-repeatpolymorphisms exhibit the signature of recombination inLodgepole Pine chloroplast DNA. Molecular Biology andEvolution 18, 2136–2138.

Mason-Gamer RJ (2001) Origin of North American Elymus (Poaceae:Triticeae) allotetraploids based on granule-bound starch synthasegene sequences. Systematic Botany 26, 757–768.

Mason-Gamer RJ, Holsinger KE, Jansen RK (1995) Chloroplast DNAhaplotype variation within and among populations of Coreopsisgrandiflora (Asteraceae). Molecular Biology and Evolution 12,371–381.

Mason-Gamer RJ, Weil CF, Kellogg EA (1998) Granule-bound starchsynthase: structure, function, and phylogenetic utility. MolecularBiology and Evolution 15, 1658–1673.

Mathews S, Sharrock RA (1996) The phytochrome gene family ingrasses (Poaceae): a phylogeny and evidence that grasses have asubset of the loci found in dicot angiosperms. Molecular Biologyand Evolution 13, 1141–1150.

Mathews S, Tsai RC, Kellogg EA (2000) Phylogenetic structure in thegrass family (Poaceae): evidence from the nuclear genePhytochrome B. American Journal of Botany 87, 96–107.

Mathews S, Spangler RE, Mason-Gamer RJ, Kellogg EA (2002)Phylogeny of Andropogoneae inferred from Phytochrome B,GBSSI, and NDHF. International Journal of Plant Sciences 163,441–450. doi:10.1086/339155

Mayer MS, Soltis PS (1999) Intraspecific phylogeny analysis using ITSsequences: insights from studies of the Streptanthus glandulosuscomplex (Cruciferae). Systematic Botany 24, 47–61.

Mayol M, Rossello JA (2001) Why nuclear ribosomal DNA spacers(ITS) tell different stories in Quercus. Molecular Phylogenetics andEvolution 19, 167–176. doi:10.1006/MPEV.2001.0934

Millar AA, Dennis ES (1996) The alcohol dehydrogenase genes ofcotton. Plant Molecular Biology 31, 897–904.

Miyashita NT, Innan H, Terauchi R (1996) Intra- and interspecificvariation of the alcohol dehydrogenase locus region in wild plantsArabis gemmifera and Arabidopsis thaliana. Molecular Biology andEvolution 13, 433–436.

Moniz de Sá M, Drouin G (1996) Phylogeny and substitution rates ofangiosperm actin genes. Molecular Biology and Evolution 13,1198–1212.

Morton BR, Gaut BS, Clegg MT (1996) Evolution of alcoholdehydrogenase genes in the Palm and Grass families. Proceedingsof the National Academy of Sciences of the United States of America93, 11735–11739.

Muir G, Fleming CC, Schlotterer C (2001) Three divergent rDNAclusters predate the species divergence in Quercus petraea (Matt.)Liebl. and Quercus robur L. Molecular Biology and Evolution 18,112–119.

Myerhans A, Vartanian J-P, Wain-Hobon S (1990) DNA recombinationduring PCR. Nucleic Acids Research 18, 1687–1691.

Olmstead RG, Palmer JD (1994) Chloroplast DNA systematics: areview of methods and data analysis. American Journal of Botany81, 1205–1224.

Olsen KM (2002) Population history of Manihot esculenta(Euphorbiaceae) inferred from nuclear DNA sequences. MolecularEcology 11, 901–911. doi:10.1046/J.1365-294X.2002.01493.X

Olsen KM, Schaal BA (1999) Evidence on the origin of cassava:phylogeography of Manihot esculenta. Proceedings of the NationalAcademy of Sciences of the United States of America 96,5586–5591. doi:10.1073/PNAS.96.10.5586

Palmer JD (1992) Mitochondrial DNA in plant systematics:applications and limitations. In ‘Molecular systematics of plants’.(Eds PS Soltis, DE Soltis, JJ Doyle) pp. 36–49. (Chapman and Hall:New York)

Palmer JD, Adams KL, Cho Y, Parkinson CL, Qiu Y-L, Song K (2000)Dynamic evolution of plant mitochondrial genomes: mobile genesand introns and highly variable mutation rates. Proceedings of theNational Academy of Sciences of the United States of America 97,6960–6966. doi:10.1073/PNAS.97.13.6960

Perry DJ, Furnier GR (1996) Pinus banksiana has at least sevenexpressed alcohol dehydrogenase genes in two linked groups.Proceedings of the National Academy of Sciences of the UnitedStates of America 93, 13020–13023.

Purugganan MD, Suddith JI (1998) Molecular population genetics ofthe Arabidopsis CAULIFLOWER regulatory gene: nonneutralevolution and naturally occurring variation in floral homeoticfunction. Proceedings of the National Academy of Sciences of theUnited States of America 95, 8130–8134. doi:10.1073/PNAS.95.14.8130

Qiu Y-L, Cho Y, Cox JC, Palmer JD (1998) The gain of threemitochondrial introns identifies liverworts as the earliest landplants. Nature 394, 671–674. doi:10.1038/29286

Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zonis M,Zimmer EA, Chen Z, Sauolainen V, Chase MW (1999) The earliestangiosperms: evidence from mitochondrial, plastid, and nucleargenomes. Nature 402, 404–407. doi:10.1038/46536

Rauscher JT, Doyle JJ, Brown AHD (2002) Internal transcribed spacerrepeat-specific primers and the analysis of hybridization in theGlycine tomentella (Leguminosae) polyploid complex. MolecularEcology 11, 2691–2702. doi:10.1046/J.1365-294X.2002.01640.X

Page 25: Use of nuclear genes for phylogeny reconstruction in plants

Use of nuclear genes in plant phylogeny Australian Systematic Botany 169

Reboud X, Zeyl C (1994) Organelle inheritance in plants. Heredity 72,132–140.

Richman AD, Kohn JR (1999) Self-incompatibility alleles fromPhysalis: implications for historical inference from balancedgenetic polymorphisms. Proceedings of the National Academy ofSciences of the United States of America 96, 168–172. doi:10.1073/PNAS.96.1.168

Richman AD, Uyenoyama MK, Kohn JR (1996) Allelic diversity andgene genealogy at the self-incompatibility locus in the Solanaceae.Science 273, 1212–1216.

Rieseberg LH (1997) Hybrid origins of plant species. Annual Review ofEcology and Systematics 28, 359–389. doi:10.1146/ANNUREV.ECOLSYS.28.1.359

Rieseberg LH, Soltis DE (1991) Phylogenetic consequences ofcytoplasmic gene flow in plants. Evolutionary Trends in Plants 5,65–84.

Rieseberg LH, Wendel JF (1993) Introgression and its consequences inplants. In ‘Hybrid zones and the evolutionary process’. (Ed. RGHarrison) pp. 70–109. (Oxford University Press: New York)

Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scaleapproaches to resolving incongruence in molecular phylogenies.Nature 425, 798–804. doi:10.1038/NATURE02053

Sambrook J, Fritsch EF, Maniatis T (1989) ‘Molecular cloning: alaboratory manual.’ 2nd edn. (Cold Spring Harbor LaboratoryPress)

Sanderson MJ, Doyle JJ (1992) Reconstruction of organismal and genephylogenies from data on multigene families: concerted evolution,homoplasy, and confidence. Systematic Biology 41, 4–17.

Sang T (2002) Utility of low-copy nuclear gene sequences in plantphylogenetics. Critical Reviews in Biochemistry and MolecularBiology 37, 121–147.

Sang T, Zhang D (1999) Reconstructing hybrid speciation usingsequences of low copy nuclear genes: hybrid origins of five Paeoniaspecies based on Adh gene phylogenies. Systematic Botany 24,148–163.

Sang T, Crawford DJ, Stuessy TF (1997a) Chloroplast phylogeny,reticulate evolution, and biogeography of Paeonia (Paeoniaceae).American Journal of Botany 84, 1120–1136.

Sang T, Donoghue MJ, Zhang D (1997b) Evolution of alcoholdehydrogenase genes in peonies (Paeonia): phylogeneticrelationships of putative nonhybrid species. Molecular Biology andEvolution 14, 994–1007.

Sanjur OI, Piperno DR, Andres TC, Wessel-Beaver L (2002)Phylogenetic relationships among domesticated and wild species ofCucurbita (Cucurbitaceae) inferred from a mithochondrial gene:implications for crop plant evolution and areas of origin.Proceedings of the National Academy of Sciences of the UnitedStates of America 99, 535–540. doi:10.1073/PNAS.012577299

Sawyer S (1989) Statistical tests for detecting gene conversion.Molecular Biology and Evolution 6, 526–536.

Senchina DS, Alvarez I, Cronn RC, Liu B, Rong J, Noyes RD, PatersonAH, Wing RA, Wilkins TA, Wendel JF (2003) Rate variation amongnuclear genes and the age of polyploidy in Gossypium. MolecularBiology and Evolution 20, 633–643.doi:10.1093/MOLBEV/MSG065

Simmons MP, Savolainen V, Clevinger CC, Archer RH, Davis JI (2001)Phylogeny of the Celastraceae inferred from 26S nuclear ribosomalDNA, phytochrome B, rbcL, atpB, and morphology. MolecularPhylogenetics and Evolution 19, 353–366. doi:10.1006/MPEV.2001.0937

Small RL (2004) Phylogeny of Hibiscus sect. Muenchhusia(Malvaceae) based on chloroplast rpL16 and ndhF, and nuclear ITSand GBSSI sequences. Systematic Botany, in press.

Small RL, Wendel JF (2000a) Copy number lability and evolutionarydynamics of the Adh gene family in diploid and tetraploid cotton(Gossypium). Genetics 155, 1913–1926.

Small RL, Wendel JF (2000b) Phylogeny, duplication, and intraspecificvariation of Adh sequences in New World diploid cottons(Gossypium, Malvaceae). Molecular Phylogenetics and Evolution16, 73–84. doi:10.1006/MPEV.1999.0750

Small RL, Wendel JF (2002) Differential evolutionary dynamics ofduplicated paralogous Adh loci in allotetraploid cotton(Gossypium). Molecular Biology and Evolution 19, 597–607.

Small RL, Ryburn JA, Cronn RC, Seelanan T, Wendel JF (1998) Thetortoise and the hare: choosing between noncoding plastome andnuclear Adh sequences for phylogenetic reconstruction in a recentlydiverged plant group. American Journal of Botany 85, 1301–1315.

Small RL, Ryburn JA, Wendel JF (1999) Low levels of nucleotidediversity at homoeologous Adh loci in allotetraploid cotton(Gossypium L.). Molecular Biology and Evolution 16, 491–501.

Soltis DE et al. (1997) Angiosperm phylogeny inferred from 18Sribosomal DNA sequences. Annals of the Missouri BotanicalGarden 84, 1–49.

Soltis DE, Soltis PS (1998a) Choosing an approach and an appropriategene for phylogenetic analysis. In ‘Molecular systematics of plantsII: DNA sequencing’. (Eds DE Soltis, PS Soltis, JJ Doyle) pp. 1–42.(Kluwer Academic Publishers: Boston)

Soltis PS, Soltis DE (1998b) Molecular evolution of 18S rDNA inangiosperms: implications for character weighting in phylogeneticanalysis. In ‘Molecular systematics of plants II: DNA sequencing’.(Eds DE Soltis, PS Soltis, JJ Doyle) (Kluwer Academic Publishers:Boston, MA)

Soltis PS, Soltis DE, Chase MW (1999) Angiosperm phylogenyinferred from multiple genes as a tool for comparative biology.Nature 402, 402–404. doi:10.1038/46528

Steele KP, Holsinger KE, Jansen RK, Taylor DW (1991) Assessing thereliability of 5S ribosomal-RNA sequence data for phylogeneticanalysis in green plants. Molecular Biology and Evolution 8,240–248.

Stephens JC (1985) Statistical methods of DNA sequence analysis:detection of intragenic recombination or gene conversion.Molecular Biology and Evolution 2, 539–556.

Strand AE, Leebens-Mack J, Milligan BG (1997) Nuclear DNA-basedmarkers for plant evolutionary biology. Molecular Ecology 6,113–118. doi:10.1046/J.1365-294X.1997.00153.X

Suh Y, Thien LB, Reeve HE, Zimmer EA (1993) Molecular evolutionand phylogenetic implications of internal transcribed spacersequences of ribosomal DNA in Winteraceae. American Journal ofBotany 80, 1042–1055.

Swofford DL (2002) ‘PAUP*. Phylogenetic analysis using parsimony(*and other methods).’ (Sinauer Associates: Sunderland, MA)

Taberlet P, Gielly L, Pautou G, Bouvet J (1991) Universal primers foramplification of three non-coding regions of chloroplast DNA.Plant Molecular Biology 17, 1105–1109.

Tank DC, Sang T (2001) Phylogenetic utility of theglycerol-3-phosphate acyltransferase gene: evolution andimplications in Paeonia (Paeoniaceae). Molecular Phylogeneticsand Evolution 19, 421–429. doi:10.1006/MPEV.2001.0931

The Arabidopsis Genome Initiative (2000) Analysis of the genomesequence of the flowering plant Arabidopsis thaliana. Nature 408,796–815. doi:10.1038/35048692

Thornton JW, DeSalle R (2000) Gene family evolution and homology:genomics meets phylogenetics. Annual Review of Genomics andHuman Genetics 1, 41–73. doi:10.1146/ANNUREV.GENOM.1.1.41

Page 26: Use of nuclear genes for phylogeny reconstruction in plants

170 Australian Systematic Botany R. L. Small et al.

http://www.publish.csiro.au/journals/asb

Wagner A (1998) The fate of duplicated genes: loss or new function?BioEssays 20, 785–788. doi:10.1002/(SICI)1521-1878(199810)20:103.3.CO;2-D

Wagner A (2001) Birth and death of duplicated genes in completelysequenced eukaryotes. Trends in Genetics 17, 237–239. doi:10.1016/S0168-9525(01)02243-0

Wagner A, Blackstone N, Cartwright P, Dick M, Misof B, Snow P,Wagner GP, Barttels J, Murtha M, Pendleton J (1994) Surveys ofgene families using polymerase chain reaction: PCR selection andPCR drift. Systematic Biology 43, 250–261.

Wang R-L, Stec A, Hey J, Lukens L, Doebley J (1999) The limits ofselection during maize domestication. Nature 398, 236–239. doi:10.1038/18435

Waters ER (1995) The molecular evolution of the small heat-shockproteins in plants. Genetics 141, 785–795.

Wendel JF (1989) New World tetraploid cottons contain Old Worldcytoplasm. Proceedings of the National Academy of Sciences of theUnited States of America 86, 4132–4136.

Wendel JF, Albert VA (1992) Phylogenetics of the cotton genus(Gossypium L.): character-state weighted parsimony analysis ofchloroplast DNA restriction site data and its systematic andbiogeographic implications. Systematic Botany 17, 115–143.

Wendel JF, Doyle JJ (1998) Phylogenetic incongruence: window intogenome history and molecular evolution. In ‘Molecular systematicsof plants II. DNA sequencing’. (Eds D Soltis, P Soltis, J Doyle).(Kluwer Academic Publishing)

Wendel JF, Schnabel A, Seelanan T (1995a) Bidirectional interlocusconcerted evolution following allopolyploid speciation in cotton(Gossypium). Proceedings of the National Academy of Sciences ofthe United States of America 92, 280–284.

Wendel JF, Schnabel A, Seelanan T (1995b) An unusual ribosomalDNA sequence from Gossypium gossypioides reveals ancient,cryptic, intergenomic introgression. Molecular Phylogenetics andEvolution 4, 298–313. doi:10.1006/MPEV.1995.1027

Wendel JF, Cronn RC, Johnston JS, Price HJ (2002) Feast and faminein plant genomes. Genetica 115, 37–42. doi:10.1023/A:1016020030189

White SE, Doebley JF (1999) The molecular evolution of terminalear1, a regulatory gene in the genus Zea. Genetics 153, 1455–1462.

Whitten WM, Williams NH, Chase MW (2000) Subtribal and genericrelationships of Maxillarieae (Orchidaceae) with emphasis onStanhopeinae: combined molecular evidence. American Journal ofBotany 87, 1842–1856.

Wolfe KH, Li W-H, Sharp PM (1987) Rates of nucleotide substitutionvary greatly among plant mitochondrial, chloroplast, and nuclearDNAs. Proceedings of the National Academy of Sciences of theUnited States of America 84, 9054–9058.

Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV,Romano L (2003) The evolution of transcriptional regulation ineukaryotes. Molecular Biology and Evolution 20, 1377–1419. doi:10.1093/MOLBEV/MSG140

Yu J, Hu S, Wang J, Wong GK-S, Li S et al. (2002) A draft sequence ofthe rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92.doi:10.1126/SCIENCE.1068037

Zhang L, Pond SK, Gaut BS (2001) A survey of the molecularevolutionary dynamics of twenty-five multigene families from fourgrass taxa. Journal of Molecular Evolution 52, 144–156.

Zimmer EA, Martin SL, Beverly SM, Kan W, Wilson AC (1980) Rapidduplication and loss of genes coding for the α chains of hemoglobin.Proceedings of the National Academy of Sciences of the UnitedStates of America 77, 2158–2162.

Manuscript received 12 June 2003, accepted 28 January 2004