a protocol for large-scale rrna sequence analysis: towards a detailed phylogeny of coleoptera

13
A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera Toby Hunt, Alfried P. Vogler * Department of Entomology, Natural History Museum, London SW7 5BD, UK Division of Biology, Imperial College London, Silwood Park Campus, Ascot, UK Received 9 August 2007; revised 2 November 2007; accepted 17 November 2007 Available online 10 January 2008 Abstract Large-scale phylogenetic analyses involving thousands of rRNA sequences are complicated due to length variability which com- pounds the already complex problem of large tree searches. Here, we generated a large data matrix and test phylogenetic procedures for large-scale analysis in the Coleoptera (beetles), as a resource for evolutionary biology and identification of this hugely diverse group. The analysis included nearly 1200 species, including representatives of 126 (75%) families, all 18 superfamilies of Polyphaga, and the four suborders. Alignments were obtained by a fragment-extension method derived from the BLAST algorithm using the BlastAlign script [Belshaw, R., Katzourakis, A., 2005. BlastAlign: a program that uses blast to align problematic nucleotide sequences. Bioinformatics 21, 122–123], followed by fast parsimony and maximum likelihood searches. Trees were assessed against the existing classification, using a formal procedure for coding the hierarchical position of taxa and establishing taxonomic congruence. We found that the BlastAlign procedure greatly exceeded the performance of standard progressive alignment methods such as Clustal. The resulting trees, when used as guide tree, also greatly improved the Clustal-based alignments. Long-branch attraction potentially affecting the quality of the tree was reduced by the systematic removal of all branches longer than a 95% interval of the distribution of branch lengths. We applied this pro- tocol to the test for monophyly of major proposed lineages of Coleoptera, including Crowson’s 18 superfamilies in the hyperdiverse sub- order Polyphaga. While searches for very large trees remained challenging and details of the tree topology were not always satisfactory, the strategy for alignment and tree searches used here makes large-scale phylogenetics of super-diverse groups such as Coleoptera ame- nable to desktop computing. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Large data sets; Alignment; Most representative sequence; BlastAlign; Validation; Taxonomic consistency; Classification 1. Introduction The increasing taxonomic content of DNA databases and rapid sequencing technology now permit tree construc- tion at ever larger scales (Hibbett et al., 2005; Kallersjo et al., 1998; McMahon and Sanderson, 2006; Soltis et al., 1999). However, traditional phylogenetic methodologies struggle to accommodate these huge data sets, whilst newly developed techniques, more capable of coping with large- scale analyses, have not become generally established. Ribosomal RNA genes remain among the most widely used phylogenetic markers and therefore techniques for their analysis at this scale are particularly important. In insects, the small subunit (SSU) rRNA gene has been the dominant marker (Chalwatzis et al., 1996; Kjer, 2004; Pashley et al., 1993; Wheeler et al., 2001; Whiting et al., 1997), but this gene is affected by great length variability and high variation in molecular rates, exacerbating the dif- ficulty of finding optimal trees when numbers of taxa increase. Procedures for simultaneous alignment and tree building (Wheeler, 1996) cannot currently handle more than a few hundred full-length SSU sequences. Similarly, secondary structure alignments, either machine-based (e.g. Mathews 1055-7903/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2007.11.029 * Corresponding author. Address: Department of Entomology, Natural History Museum, London SW7 5BD, UK. Fax: +44 207 942 5229. E-mail address: [email protected] (A.P. Vogler). www.elsevier.com/locate/ympev Available online at www.sciencedirect.com Molecular Phylogenetics and Evolution 47 (2008) 289–301

Upload: toby-hunt

Post on 02-Jul-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

Available online at www.sciencedirect.com

www.elsevier.com/locate/ympev

Molecular Phylogenetics and Evolution 47 (2008) 289–301

A protocol for large-scale rRNA sequence analysis: Towardsa detailed phylogeny of Coleoptera

Toby Hunt, Alfried P. Vogler *

Department of Entomology, Natural History Museum, London SW7 5BD, UK

Division of Biology, Imperial College London, Silwood Park Campus, Ascot, UK

Received 9 August 2007; revised 2 November 2007; accepted 17 November 2007Available online 10 January 2008

Abstract

Large-scale phylogenetic analyses involving thousands of rRNA sequences are complicated due to length variability which com-pounds the already complex problem of large tree searches. Here, we generated a large data matrix and test phylogenetic proceduresfor large-scale analysis in the Coleoptera (beetles), as a resource for evolutionary biology and identification of this hugely diverse group.The analysis included nearly 1200 species, including representatives of 126 (75%) families, all 18 superfamilies of Polyphaga, and the foursuborders. Alignments were obtained by a fragment-extension method derived from the BLAST algorithm using the BlastAlign script[Belshaw, R., Katzourakis, A., 2005. BlastAlign: a program that uses blast to align problematic nucleotide sequences. Bioinformatics21, 122–123], followed by fast parsimony and maximum likelihood searches. Trees were assessed against the existing classification, usinga formal procedure for coding the hierarchical position of taxa and establishing taxonomic congruence. We found that the BlastAlignprocedure greatly exceeded the performance of standard progressive alignment methods such as Clustal. The resulting trees, when usedas guide tree, also greatly improved the Clustal-based alignments. Long-branch attraction potentially affecting the quality of the tree wasreduced by the systematic removal of all branches longer than a 95% interval of the distribution of branch lengths. We applied this pro-tocol to the test for monophyly of major proposed lineages of Coleoptera, including Crowson’s 18 superfamilies in the hyperdiverse sub-order Polyphaga. While searches for very large trees remained challenging and details of the tree topology were not always satisfactory,the strategy for alignment and tree searches used here makes large-scale phylogenetics of super-diverse groups such as Coleoptera ame-nable to desktop computing.� 2007 Elsevier Inc. All rights reserved.

Keywords: Large data sets; Alignment; Most representative sequence; BlastAlign; Validation; Taxonomic consistency; Classification

1. Introduction

The increasing taxonomic content of DNA databasesand rapid sequencing technology now permit tree construc-tion at ever larger scales (Hibbett et al., 2005; Kallersjoet al., 1998; McMahon and Sanderson, 2006; Soltis et al.,1999). However, traditional phylogenetic methodologiesstruggle to accommodate these huge data sets, whilst newlydeveloped techniques, more capable of coping with large-scale analyses, have not become generally established.

1055-7903/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.ympev.2007.11.029

* Corresponding author. Address: Department of Entomology, NaturalHistory Museum, London SW7 5BD, UK. Fax: +44 207 942 5229.

E-mail address: [email protected] (A.P. Vogler).

Ribosomal RNA genes remain among the most widelyused phylogenetic markers and therefore techniques fortheir analysis at this scale are particularly important. Ininsects, the small subunit (SSU) rRNA gene has been thedominant marker (Chalwatzis et al., 1996; Kjer, 2004;Pashley et al., 1993; Wheeler et al., 2001; Whiting et al.,1997), but this gene is affected by great length variabilityand high variation in molecular rates, exacerbating the dif-ficulty of finding optimal trees when numbers of taxaincrease.

Procedures for simultaneous alignment and tree building(Wheeler, 1996) cannot currently handle more than a fewhundred full-length SSU sequences. Similarly, secondarystructure alignments, either machine-based (e.g. Mathews

Page 2: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

290 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

and Turner, 2002 and Sankoff, 1985) or manual (Gillespie,2004; Kjer, 1995), are time-consuming to perform on sucha large scale. The widely used Clustal algorithm (Thompsonet al., 1994) and the newer MUSCLE (Edgar, 2004)and MAFFT (Katoh et al., 2005) are less constrainedbecause they are based on pairwise similarity betweensequences (Gotoh (1982) whose time requirements increasepolynomially with the number of taxa. However, theirimplementation in ‘progressive’ alignment algorithms (Fengand Doolittle, 1987), whereby the order of sequence addi-tions to the alignment is given by a guide tree, generally suf-fers from the problem that the gaps introduced in theoriginal pairwise alignment step are retained in the latermultiple alignment. When this is followed by iterativeprocedures for refinement of initial alignments or searchesfor internal consistency between multiple sequence pairs,implemented in T-Coffee (Notredame et al., 2000) and oth-ers, this is again at the expense of greater computationaleffort.

In contrast, homology-extension alignment methodswhich rely on the recovery of small segments of sequencerecognized among terminals from which scores of related-ness can be derived (Morgenstern, 1999) are generallyapplicable to complex problems of sequence alignment,but they are not widely used in phylogenetic analyses. Thisapproach is also the basis for the widely used BLAST algo-rithm (Altschul et al., 1990) that establishes segments oflocally maximal ungapped sequence alignments which meeta threshold score for length and level of similarity. TheseHigh-scoring Segment Pairs (HSP) act as seeds for initiat-ing searches to find longer segments in both directionsand can be displayed as ‘flat query-anchored alignments’which link all alignable sites in a set of sequences to a spe-cific query ‘anchor’ sequence.

Here we investigate the utility of a BLAST-based align-ment strategy, using the BlastAlign script (Belshaw andKatzourakis, 2005). This software prints the ‘query-anchored alignment’ output from BLAST and turns it intoan input file for standard phylogenetic software packages.Rather than using a specific query sequence in standardBLAST searches, alignments for phylogenetic data setscan be produced if each sequence is aligned against a uni-versal reference sequence. Fragment-extension of ungap-ped pairwise aligned sequence segments may contain alarger or smaller proportion of sites of a given full-lengthsequence, depending on the similarity to the referencesequence. Portions of the sequence lacking similarity tothe reference therefore are not included in the analysis.This has the advantage that large indels or highly divergentregions are not retained, providing an objective procedureto remove portions of unrecognizable or ambiguous simi-larity and therefore improves homology assignments. Adifficulty, however, is the selection of the referencesequence whose choice will have an effect on which basesare retained, calling for a so-called ‘most representativesequence’ (MRS) that best reflects the diversity ofsequences in a given data set.

Coupled with fast parsimony and likelihood searchesperformed on the resulting BlastAlign alignments, weinvestigated this method to infer the phylogeny of the hugeorder Coleoptera which accounts for one quarter of alldescribed animal species, as a deserving model for large-scale approaches in molecular systematics. Recent workprovided numerous sequences for the SSU rRNA genefor several family level groups (Caterino et al., 2005; Cate-rino et al., 2002; Farrell, 1998; Galian et al., 2002; Gomez-Zurita et al., 2005; Maddison et al., 1999; Ribera et al.,2002; Robertson et al., 2004; Shull et al., 2001). Supple-mented with unreleased data for poorly sampled groups,there is now a possibility for a detailed phylogenetic anal-ysis of the Coleoptera.

The current classification of Coleoptera recognizes foursuborders, including the species poor Myxophaga andArchostemata, the largely predatory Adephaga, and theextremely diverse Polyphaga (90% of all species of Coleop-tera, >150 families). Crowson (1970) grouped the latter in18 superfamilies, most of which were assigned to the threeseries (Staphyliniformia, Elateriformia, Cucujiformia)while five of the superfamilies (Scarabaeoidea, Dascilloi-dea, Eucinetoidea, Bostrichoidea, Dermestoidea) remaineddifficult to place. In most recent classification schemes theformer is regarded as a separate series Scarabaeiformiaand the latter two superfamilies grouped as Bostrychifor-mia, whereas Dascilloidea and Eucinetoidea have beenplaced within Elateriformia (Lawrence and Newton,1995). No detailed phylogenetic analysis across the beetleshas been conducted that would test the monophyly of thesedeep-level groups using molecular data. We compiled aSSU database of Coleoptera and applied the BLAST-basedalignment protocols to assess how this procedure shouldbest be applied to capture a tree implied in this classifica-tion, while in turn the analysis also provides a test of thevalidity of this classification. The entire methodology wasplaced within a bioinformatics pipeline to allow the regularautomatic creation of this tree of Coleoptera, for the inclu-sion of new SSU sequences as they become available.

2. Materials and methods

2.1. Database generation and taxon sampling

All existing sequences of Coleoptera (131,043 sequencesfor 4125 species; August 2006) on GenBank were placedinto a flatfile database. To this were added unreleasedsequences, mostly for various Cucujiformia but includingseveral sequences from other groups, for a total of 262full-length SSU rRNA sequences released only recently(Hunt et al., submitted; Supplementary Table 1). Toextract all SSU sequences from this database, a set ofSSU sequences was identified based on the gene annota-tion, and aligned using ClustalW. The alignment was thenused to generate an MRS using BlastAlign (see below)which was the query for a secondary BLAST search againstthe local Coleoptera data base, extracting all sequences

Page 3: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 291

above a threshold of 7e-36 without relying on the annota-tion itself for data compilation as these may differ betweensubmissions. This cut-off value discriminated efficientlyagainst potential paralogs or contaminants. Sequenceswere filtered for a length range of 1650–2300 bps. Wheremultiple entries were available for a species, we selectedthe longest (most complete) sequence, but retained onlyone sequence per named species. The length range wasset to provide an optimal set of (nearly) full-length SSUrRNA sequences. To reduce missing data columns, 50 and30 ends were cut to a fragment corresponding to position42–1938 of the D. melanogaster sequence (Tautz et al.,1988). The final data set consisted of 1161 taxa containing922, 224, 7 and 1 sequences for the suborder Polyphaga,Adephaga, Myxophaga and Archostemata, respectively.Outgroups were two taxa each from the neuropteroidorders Rhaphidioptera, Neuroptera and Megaloptera (thepresumed sister lineage of Coleoptera), Hymenoptera,and the more distantly related paraneopteran orders Phthi-raptera, Hemiptera, Thysanoptera and Psocoptera.

2.2. Sequence alignment using BlastAlign

BlastAlign (Belshaw and Katzourakis, 2005) makes useof the BLAST algorithm to produce a pairwise alignmentbetween a query sequence and each of the database hitsbased on HSPs. The result of a BLAST search can be dis-played as a pseudo-multiple alignment produced by pars-ing the HSPs using the query as a template, usuallyvisualized as colored lines in the familiar BLAST output.When used for phylogenetics, the methodology is problem-atic because sequences are aligned to the progressivelymore distant query, and therefore does not include anytests of alignments of sequences with each other. Blast-

A B

GH

Fig. 1. The alignment procedure employed in this study using BlastAlign. Stasearch is conducted to establish segments of nucleotide identity (B). These arewithout entry where no nucleotide identity is found, shown as empty fields (C).among all sequences is used as the MRS; in the standard analysis the flat query-blastn search (D). Here this procedure was repeated 100 times on a subset osequences are aligned to each other using Clustal (F) and a consensus sequesequences in the database (G), from which a final query-anchored multiple ali

Align addresses this problem by producing an anchorsequence (the MRS) that is presumably more representa-tive of the entire data set. The MRS calculation is basedon an all-against-all measure of BLAST similarity. Blast-Align scores these pairwise similarities for intervals of60 bps each which are either considered a match or nomatch, and selects the sequence with the highest cumulativenumber of matches as the MRS. The latter is then used as aquery for pairwise similarity searches with each sequence inthe data set and base homologies of each of these to theMRS are parsed for the production of the final alignment(Fig. 1).

When datasets increase in size, these calculations of anMRS may become too complex for an exhaustive analysis.BlastAlign therefore uses a randomly selected subset ofsequences (less than half of the sequences in the case of thisdataset) for identifying the MRS to limit the output filefrom the all-against-all assessment to 1 GB. The MRSselected may therefore change due to the random subsetchosen. To reduce any bias at this stage of the process,we created 100 BlastAlign alignments (using the default set-tings) retaining the MRS from each. A ClustalW alignmentwas then produced from these sequences, and the consen-sus sequence used as the MRS in all subsequent BlastAlignprocedures (i.e., in the multiple-alignment step of Blast-Align each of the sequences in the data set is aligned to thisconsensus sequence). The final BLAST based alignmentwas then used directly for parsimony and likelihood treesearches.

The tree from a parsimony search on the BLAST-basedalignment can be used as guide tree for a Clustal alignmenton the full-length sequences, with the aim of avoiding theloss of data from the matrix inherent in the BlastAlignapproach while circumventing some of the problems with

C D

EF

MRS

rting with a database of unaligned sequences (A), an all-versus-all blastn

scored for each sequence and plotted as 60 bp segments, some of whichThe sequence with the highest number of scores (positive 60 bp fragments)anchored multiple alignment for this sequence is retrieved from the originalf sequences randomly chosen from the database (E). The resulting MRSnce from this alignment is used in a secondary blastn search against allgnment is built (H).

Page 4: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

292 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

the use of the initial distance matrix in a ‘global’ (across theentire molecule) pairwise sequence similarity. However,Clustal remains sensitive to the input order of thesequences which can affect the overall alignment. Therefore20 different alignments were produced from randomizedinput orders, which substantially differed in the length ofthe resulting trees. Among these, the final alignment waschosen based upon their consistency with the current tax-onomy which was used as arbiter to decide on the preferredalignments (see below).

This alignment protocol is time-consuming but neces-sary in order to produce accurate and defensible homologystatements. Whilst each of the initial 100 BLAST align-ments with the 1161 taxa dataset took an average of42 min to complete, subsequent alignments with the pre-selected most representative sequence took under 5 min.Similarly a default ClustalW (version 1.83) alignment tookaround 41 h, and this dropped with a predefined input treeto an average of 30 min. (All analyses were benchmarkedon a 2.60 GHz Pentium 4 PC with 768 MB of RAM underdefault parameters.) Because of the necessity to run multi-ple BLAST and Clustal alignments for this procedure, sim-ple automation scripts were written in PERL to run theprocedures without further intervention.

2.3. Tree searches and support

Parsimony analyses were conducted using TNT (Golob-off et al., 2004) under the ‘New Technology’ search settingsand run on a desktop PC (as above). Driven searches wereused with xmult (the level of ‘aggressiveness’ of the searchstrategy; (Goloboff and Pol, 2007; Hovenkamp, 2004) setto 100. Under this method sectorial searches, tree ratchet-ing, tree drifting and tree fusing are used in combinationas determined by the driver program. The search timewas limited to a maximum of 24 h, but in selected casesextended to 168 h (1 week).

Long branches were detected and removed following amethod modified from Korte et al. (2004). A table ofbranch lengths was obtained from PAUP* from whichthe mean, average and standard deviation of all branchlengths within the tree were produced and the normaldistribution of branch lengths calculated. Any taxonattached to a branch longer than those within a 95%interval of the distribution were removed, including ter-minal taxa and entire clades downstream from long inter-nal branches. A second approach to the problem ofdivergent sequences was the removal of hypervariableexpansion segments from the alignment, following com-mon practice in molecular systematics of insects (Farrell,1998; Whiting et al., 1997). Finally, likelihood analysesunder the GTR + I + C model were performed on thealigned data matrices using PHYML (Guindon andGascuel, 2003). This is the most complex model availableand was chosen because of the great complexity andlarge size of the data set which permits precise estimationof model parameters.

Nodal support levels were assessed using fast parsimonyjack-knife searches in TNT with 100 replicates, xmult set to10 and p (the removal probability) set to 36%. These set-tings allowed for a rapid resampling search, which wasdesigned to find support levels only for the most stronglysupported nodes.

2.4. Assessing taxonomic congruence

Trees were assessed based on the criterion of congru-ence, assuming that the preferred alignment is the one thatprovides highest levels of homology (synapomorphy)(Brower and Schawaroch, 1996; De Pinna, 1991). Congru-ence can be assessed with reference to a known tree orknown critical nodes (Ribera et al., 2002; Wheeler, 1995).We here used the current taxonomy defining six hierarchi-cal levels within the Coleoptera as a proxy of well estab-lished nodes, and devised an explicit test for the fit of thetree with the taxonomic classification. For simplicity weused a system currently maintained at NCBI’s taxonomydatabase which associates species names to a string ofnested taxonomic groupings that is established in collabo-ration with the submitters. This string was converted intoa single code name using a custom PERL script, basedon the starting letter of each name within the hierarchyand combined together so that: Coleoptera; Polyphaga;Cucujiformia; Phytophaga; Scolytidae; Sinophloeus; por-

teri becomes: CPCPhCuScSipor. This naming scheme facil-itates the recognition of species names and their taxonomicaffinities when reading large trees. Ambiguities in codeswere resolved by choosing a subsequent letter of the nameof one taxon whose code would be identical to another.The naming scheme also provided a convenient input inthe tree annotation and graphics facility Treedyn (Cheve-net et al., 2006), as the name codes representing a particulargrouping that can be easily color coded with this softwarefor visual inspection of trees.

These recoded names were also used to devise a simpleprocedure for testing the fit of the trees to the existing clas-sification. The taxonomy codes described above were con-verted into a binary matrix, whereby each uniquetaxonomic level contained in the data set was representedby a set of binary ‘characters’, one each corresponding toeach name at the given taxonomic level. This resulted in4, 5, 137, 142, 136 and 969 characters at the suborder,infraorder, family, subfamily, tribe and genus level, respec-tively, each of them with two states for presence (if a spe-cies is a member of the taxon) or absence (not amember). The number of informative characters (i.e., aminimum of two terminals in a particular group) was 3,5, 82, 86, 63 and 131 at these taxonomic levels. Characterswere then assessed for their consistency with the trees, andan ensemble consistency index (CI) of the tree with theresulting matrix of binary characters for each of the six tax-onomic levels was calculated, as well as for the overall treeitself. The latter value was calculated for the informativecharacters only and will be referred to as ‘taxonomic CI’.

Page 5: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 293

It is used here to detect discrepancies between the classifica-tion and the tree, to identify groupings requiring furtherinvestigation. As the interpretation of the CI value at a par-ticular taxonomic level is dependent on the proportion oftaxa that share a particular name (character), i.e., the max-imum homoplasy for that character, we also calculated theretention index (RI) which normalizes the CI for the max-imum number of changes on a tree (the proportion ofhomoplasy in a data set in relation to the total possiblehomoplasy). This ‘taxonomic RI’ can be thought of asthe percentage of taxa assigned to a given group in theexisting classification that are recovered as monophyletic.

2.5. Software used in this study

A collection of PERL scripts generated for this study,including a detailed manual for their use, is available atthe Holometabola Insect Phylogenetics (HIP-db) website(http://hip-db.myspecies.info).

3. Results

3.1. Selection of most representative sequence

The full data set contained sequences for 1161 taxa from126 of 168 recognized families (Lawrence and Newton,1995) and all 18 of Crowson’s (1970) superfamilies of thespecies-rich Polyphaga (Table 1). These sequences were

Table 1Number of species and families represented in the data sets for each superfam

Taxon Code intree

Species in 1161-taxondataset

Species in 932dataset

Suborders

Archostemata CArh 1 1Myxophaga CM 7 6Geadephaga 134 72Hydradephaga 90 82Hydrophiloidea CPSHy 13 12Staphylinoidea CPSSt 54 30Histeroidea CPSHi 35 28Scarabaeoidea CPScrSc 49 46Bostrichoidea CPBBo 13 9Derodontoidea CPBDe 1 1Eucinetoidea CPESc 6 3Buprestoidea CPEBu 11 11Byrrhoidea CPEBy 28 25Dascilloidea CPEDa 2 0Elateroidea/

CantharoideaCPEElEl 61 23

Lymexyloidea CPCLy 4 1Cleroidea CPCCl 19 15Cucujoidea CPCCu 94 82Tenebrionoidea CPCTe 44 35Chrysomeloidea CPCCh 272 262Curculionoidea CPCCuc 213 180Outgroups 8 8

Total: 1161 932

Counts of family follows numbers on Genbank, with the representation of thAlso given is an estimate of the total number of species in each taxon.

subjected to an initial analysis using BlastAlign to selectthe MRS in each of 100 alignment runs from a randomset of 541 sequences (limited by the maximum size of theBlastAlign output) each. These searches resulted in theselection of 50 different MRSs. All of these were from themost highly represented suborder Polyphaga, including32 MRSs from Cucujiformia (16, 8, 4 and 4, respectively,from Chrysomeloidea, Curculionoidea, Cucujoidea andTenebrionoidea), 12 from Elateriformia, 5 from Scarabae-iformia, and 1 Staphyliniformia. These sequences were thenaligned using ClustalW to produce a consensus sequence of1842 bp in length. This sequence now constitutes a scaffoldcontaining the most conserved elements of the data setagainst which all others were aligned in the subsequentBlastAlign alignments.

BlastAlign was first applied using this MRS on a dataset that only included the more distant paraneopteran out-groups, to avoid the problem of non-monophyletic Coleop-tera reported in the literature (Caterino et al., 2002;Whiting et al., 1997). The resulting alignment obtainedunder the default parameters produced an alignment of2220 positions. BlastAlign may remove entire taxa fromthe matrix if they are too divergent, but this was not thecase here under the parameters used. On average, 6.8% ofnucleotides were removed from each sequence (Table 2),but adephagan sequences, being less prevalent in the dataset and therefore receiving less weight in calculating theMRS, lost up to 30% of their nucleotides (average

ily

-taxon Families in dataset(1161/932)

Total families(Genbank)

Totalspecies

1/1 4 403/2 4 583/2 3 321007/7 7 55004/3 6 28007/3 7 470003/2 3 4000

11/10 13 330004/4 6 44001/1 1 174/2 4 12001/1 1 15000

10/9 12 44002/0 2 150

12/2 16 20000

1/1 1 504/4 6 10000

19/16 34 1800018/14 27 39000

4/4 4 535007/6 7 59800

n/a n/a n/a

126/94 168 350015

e total given separately for the full and reduced data set.

Page 6: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

Table 2Average number of base pairs per taxon in SSU sequences, before and after removal of hypervariable regions and BlastAlign treatment

Full-length Conserved regions BlastAlign

1161 932 1161 932 1161 932

Coleoptera 1863.6 1850.3 1704.3 (8.4%) 1733.4 (6.2%) 1731.7 (6.8%) 1760.2 (4.7%)Archostemata 1963 1963 1682 (14.3%) 1708 (13.0%) 1701 (13.3%) 1701 (13.3%)Myxophaga 2035.9 1999.5 1715.3 (15.6%) 1740.2 (13.0%) 1669.6 (17.8%) 1687.2 (15.6%)Adephaga 1993.2 1958.4 1719.9 (13.5%) 1746.2 (10.7%) 1658 (16.6%) 1695.9 (13.3%)Polyphaga 1831.3 1827.6 1700.6 (7.1%) 1731.1 (5.3%) 1752 (4.3%) 1776.6 (2.8%)

Numbers are given for the full 1161-taxon and reduced 932-taxon data set, separately for each suborder, and the percentage reduction compared to thefull-length sequence in parentheses.

294 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

16.6%), whilst many polyphagan sequences retained almosttheir full length (average loss 4.3%). Adephagan sequenceswere also consistently longer than Polyphaga (average1993.2 vs. 1831.3 bps) and more divergent, and thereforemore prone to data removal in fragment-extension meth-ods due to lack of similarity. In fact, the procedure reducedthe alignment to a smaller size in the long adephagansequences than in the shorter polyphagan data set (1695vs. 1776 bps), and this effect was exacerbated when thelong-branch taxa were present (1658 vs. 1752 bps; Table2), indicating that retention of potential homologs was bet-ter when starting from sequences which are more closelysimilar.

3.2. Tree searches and congruence with taxonomicclassification

The aligned matrix was used directly for a parsimonytree search (Tree A1), and compared to a tree obtainedfrom an alignment produced under default parameters inClustalW (Tree A0) (Fig. 2). The resulting trees were15704 vs. 39065 steps in length (CI = 0.158 andCI = 0.117) (Trees A1 and A0 in Table 3; Fig. 3, Supple-mentary Fig. S1), corresponding to data matrices of 2220and 2580 aligned nucleotide positions (Table 3). We also

Fig. 2. Flow chart of analyses conducted resulting in th

used the topology obtained with BlastAlign as a guide treefor alignment of the full data set using Clustal, resulting intrees of 46806 steps (Tree A2; CI = 0.096).

Trees were assessed against the classification of theColeoptera, as one possible way of a global comparisonof tree quality. The trees based on the default ClustalWalignment performed worst (taxonomic CI = 0.412,RI = 0.857), whilst using the guide tree from the BlastAlignproduced trees more consistent with the classification (tax-onomic CI = 0.435, RI = 0.870) (Table 3). However, theoriginal BlastAlign tree (Tree A1) had the highest overalltaxonomic CI = 0.456 (RI = 0.881). Likelihood treesderived from the same alignments were topologically simi-lar to those obtained with parsimony, and taxonomic CIswere very similar to those in the parsimony trees A1 andA2 (Table 3).

To assess the impact of long branches, all taxa thatwere terminal to branches outside a 95% confidence inter-val of all branch lengths on Tree A1were removed fromthe dataset (Section 2), retaining 932 taxa and representa-tives of 102 families (Table 1), after which another cycleof alignment and parsimony tree search was conductedon the remaining taxa (Fig. 1). These sequences wereagain aligned with BlastAlign using the previously createdconsensus MRS and searched for the shortest tree (Tree

e trees used for analysis of taxonomic congruence.

Page 7: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

Table 3Tree statistics for trees obtained in this study, and taxonomic consistency index at various levels of the classification

1161 1161 932

Dataset

Tree name A0 A1 ML A1 A2 ML A2 A3 B1 ML B1 B2 ML B2 B3Clustal (BA guide) Default

p p p p p p

BlastAlignp p p p

Variable regions deletedp p

Alignment length 2580 2220 2220 2510 2510 1904 2160 2160 2395 2395 1875Tree length/ML score 39065 15704 �91543.8 46806 �197067.0 15690 10277 �63487.9 26039 �124368.1 9804CI 0.117 0.158 — 0.096 — 0.174 0.191 — 0.139 — 0.226RI 0.705 0.691 — 0.572 — 0.699 0.726 — 0.619 — 0.74

Taxonomic consistency

All levels 0.412 0.456 0.447 0.435 0.432 0.438 0.484 0.467 0.477 0.451 0.46Genus 0.557 0.557 0.555 0.554 0.571 0.538 0.571 0.554 0.580 0.554 0.543Tribe 0.48 0.452 0.449 0.442 0.469 0.442 0.458 0.446 0.478 0.475 0.439Family 0.366 0.451 0.447 0.433 0.416 0.440 0.475 0.475 0.468 0.453 0.479Superfamily 0.2 0.258 0.235 0.21 0.208 0.254 0.364 0.333 0.291 0.320 0.308Infraorder 0.122 0.25 0.238 0.167 0.143 0.217 0.455 0.294 0.33 0.227 0.333Suborder 0.2 0.75 0.75 0.429 0.125 0.750 0.750 1 1 0.214 0.75Coleoptera 1 1 1 1 1 1 1 1 1 1 1

Fig. 3. Tree representation of Tree A1. Major taxonomic groups are shown in different colors. See Supplementary Fig. 1 for a detailed topology.

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 295

B1; 10277 steps, CI = 0.191) (Supplementary Fig. S2), orusing the BlastAlign tree as guide in a ClustalW align-ment for tree searches (Tree B2; 26039 steps,CI = 0.139) (Table 3). Generally, the trees obtained afterremoval of long-branch taxa were greatly improved intheir taxonomic fit, with a taxonomic CI = 0.484 for theBlastAlign-based tree and CI = 0.477 for the guided Clus-tal alignment. In the case of the reduced data set the MLanalyses displayed a lower taxonomic CI compared to the

parsimony analysis (Table 3). Finally, we removed thehypervariable regions from the ClustalW alignment basedon the sharp boundaries between variable and conservedregions in the aligned matrix. This resulted in a tree of15,690 steps with all taxa retained, compared to 9804steps for the reduced data set. In both cases, this didnot improve the match of the tree to the classification(taxonomic CI = 0.438 and CI = 0.460 for the full andreduced data set, respectively; Table 3).

Page 8: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

Table 5Tree statistics for trees obtained using closely related outgroups

1161

Dataset

Clustal (BA guide) Defaultp p

BlastAlignp

Variable regions deletedp

Alignment length 2544 2256 2494 1670Tree length 38228 16123 45282 11049CI 0.111 0.154 0.092 0.183RI 0.703 0.695 0.579 0.698

Taxonomic consistency

All levels 0.41 0.462 0.438 0.394Genus 0.559 0.544 0.56 0.52Tribe 0.466 0.449 0.462 0.449Family 0.357 0.46 0.446 0.389Superfamily 0.208 0.286 0.216 0.165Infraorder 0.135 0.278 0.161 0.089Suborder 0.188 1 0.333 0.75Coleoptera 0.5 1 0.5 1

296 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

Taxonomic congruence was also assessed at varioushierarchical levels and for various subgroups, to test forthe fit of the classification with the tree throughout the dataset. Taxonomic CIs were high near the tips (genus and fam-ily level), and declined nearer the base, in particular indi-cating polyphyly of the superfamilies and series(infraorders) in Polyphaga (not shown). However, thesedirect comparisons are problematic due to the greater num-ber of taxa exhibiting a given character state in these deep-level groups, compared to the genus level and higher levels.When the corresponding RI value was calculated, thesewere generally very high (Table 4), showing that the largegroups conformed very well with the tree and that the com-paratively low CI for deep-level groups was affected by thedistant placement of a few ‘outliers’ or a very limited num-ber of splits of major groups. The improvement of taxo-nomic consistency in the reduced data set was evident atall hierarchical levels and in all major groups, whether cal-culated as the CI or RI. There were clear differences in tax-onomic CI/RI between major subdivisions of Coleoptera(Table 4). The family taxonomic CI ranked highest inBostrychiformia, followed by Staphyliniformia, and thenby the similarly ranked Adephaga, but with the Elaterifor-mia, Scarabaeiformia and Cucujiformia generally lower.As in the all-taxa analysis, the genus level and other tiplevel groups showed highest fit, and this declined for thedeeper nodes. Taxonomic RI (Table 4) was found to behigh at all levels, again indicating that low taxonomic CIswere due to a few distantly placed outliers. Notably,although the series had poor fit with the tree, the Cucujifor-mia were recovered as monophyletic in most analyses, dem-onstrating the good support for this group that includesroughly half of all species of beetles, but also shows the dif-ficulties of establishing relationships at the base of the fourremaining suborders.

Finally, BlastAlign was applied to a data set thatincluded the Neuropteroidea plus Hymenoptera as moreclosely related outgroups. The parsimony tree obtainedfrom this alignment of 16,123 steps compared to trees of38,228 and 45,282 steps under the default Clustal and theBlastAlign-guided Clustal procedure, respectively (Table5). Applying the same procedures to this data set, thesearches produced generally very similar trees to thoseobtained with paraneopteran outgroups (Table 5). How-ever, only the BlastAlign alignment and the guided Clustal

Table 4Taxonomic RI for major groups at various levels of the classificatory hierarch

Tree All groups Adephaga Staphyliniformia Sca

A1 B1 A1 B1 A1 B1 A1

Genus 0.545 0.534 0.520 0.596 0.400 0.400 0.6Tribe 0.646 0.638 0.702 0.711 0.714 0.600 n/aFamily 0.913 0.922 0.977 0.979 0.944 0.968 0.8Superfamily 0.949 0.963 n/a n/a 0.97 0.985 0.9Infraorder 0.981 0.989 n/a n/a 0.951 0.971 0.9Suborder 0.998 0.997 n/a n/a n/a n/a n/a

The results are listed separately for tree A1 (1161 taxa) and tree B1 (932 taxa

alignment after removal of variable regions recovered theColeoptera as monophyletic, while the other treatmentsresulted in a polyphyletic ingroup, mirroring the resultsof earlier studies based on SSU (Caterino et al., 2002;Whiting et al., 1997). Because the taxon sampling wasnot designed to investigate deeper holometabolan relation-ships, this obvious artefact was avoided by using moredivergent outgroups for all tests of the performance of dif-ferent alignment procedures, where this problem wasavoided. However, this analysis again supports the powerof the BlastAlign approach, as it was not affected by thespurious polyphyly of Coleoptera.

3.3. Tree topology and nodal support

The assessment of trees was based primarily on the con-gruence with the traditional classification, but the trees alsoprovided a scaffold of relationships among monophyleticgroups and the constitution of some critical clades(Fig. 3, Supplementary Fig. 1). For example, the analysisestablished the relationships match the ((Archoste-mata+Myxophaga) (Polyphaga+Adephaga)) topology ofthe unrooted trees from previous SSU based studies (Cate-rino et al., 2002). The basic split of Adephaga in terrestrialGeadephaga and aquatic Hydradephaga was also recov-

y and major subgroups of Coleoptera

rabaeiformia Elateriformia Bostrychiformia Cucujiformia

B1 A1 B1 A1 B1 A1 B1

00 0.667 0.976 0.875 n/a n/a 0.461 0.494n/a n/a n/a n/a n/a 0.593 0.583

16 0.806 0.833 0.816 0.889 1 0.905 0.9238 0.978 0.942 0.949 0.833 0.875 0.95 0.96138 0.978 0.963 0.984 0.769 0.778 1 1

n/a n/a n/a n/a n/a n/a n/a

).

Page 9: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 297

ered, in agreement with Shull et al. (2001), and family levelrelationship largely coincided with studies of Geadephaga(Maddison et al., 1999) and Hydradephaga (Ribera et al.,2002). Within Polyphaga, the relationships among the fiveseries were complex, with only the Cucujiformia monophy-letic. The monophyly of each of Crowson’s (1970) 18superfamilies and relationships among them could also beassessed. The Eucinetoidea and Derondontoidea, of uncer-tain position in Crowson (1970), were placed at the base ofPolyphaga. The four superfamilies grouped as Elaterifor-mia (Elateroidea, Cantharoidea, Byrrhoidea and Bupresto-idea) were closely related, but only the latter wasmonophyletic. The Dascilloidea, a group moved betweenScarabaeiformia and Elateriformia by Crowson and otherworkers, was clearly grouped in the latter. The close rela-tionship of Staphyliniformia (Histeroidea, Hydrophiloidea,Staphylinoidea) and Scarabaeiformia (Scarabaeoidea) wasconfirmed, although both groups appeared broadly para-phyletic. Among the larger groups, lowest RI values wereobtained for Tenebrionoidea and Cucujoidea, but alsofor Bostrychoidea, Elateroidea and Byrrhoidea. Hence,the constitution of these groupings was less clear and theirinterrelationships with other groups remain particularlyproblematic.

Nodal support was difficult to establish. Jack-kniferesampling on the 932 taxa data set was low or non-existentfor the vast majority of nodes but this was unsurprisinggiven the superficial search strategy. However, the mostbasal nodes which define the overall structure of the treewere generally supported, including support for the mono-phyly of Coleoptera and its four suborders, the relation-ships of the suborders to each others, the monophyly ofGeadephaga, Hydradephaga and Cucujiformia. In addi-tion, support was also obtained for many smaller tip-levelgroups, and for the monophyly of several genera, subfam-ilies and families.

4. Discussion

4.1. Alignment procedures and the performance of BlastAlign

Our study provides a straightforward methodology forthe phylogenetic analysis of large numbers of length-vari-able sequences. The initial choice of alignment softwarewas guided by reports from the literature on performanceand suitability for large data sets. Among procedures for‘global’ (entire sequence length) sequence alignment, Clu-stalW remains the most widely used program despite theintroduction of several new methods. MAFFT (Katohet al., 2005) and MUSCLE (Edgar, 2004) are also designedfor high-throughput applications (at a trade-off of loweraccuracy), while T-Coffee is not recommended for largealignment problems of >100 sequences (Edgar and Batzog-lou, 2006). However, the widely perceived performanceadvantage of these recent programs over the Clustal proce-dure is mainly based on protein alignments, while for RNAalignments Clustal continues to score highly across a wide

range of data types (Gardner et al., 2005). This was con-firmed in our preliminary tests using alignments fromMAFFT and MUSCLE which showed no discernableimprovement in the accuracy of the resulting trees overthose from Clustal. Their inability to accept user-definedguide trees also reduced their appeal for our study.

‘Global’ alignment procedures, including Clustal, derivean optimal score from the similarity across the entiresequence length. However, fragment-based ‘local’ align-ment methods such as BLAST (Altschul et al., 1990) andDIALIGN (Morgenstern, 1999) might be more appropri-ate to the analysis of variation in rRNA genes which con-sist of alternating conserved and highly variable regions,while among-sequence divergence also varies greatly.Although the placement and general properties of thehypervariable regions along the length of the gene are pre-served, the low similarity levels, differences in length of themolecule and great variation in AT content mean that sim-ilarity is lacking across distantly related lineages of Coleop-tera. Therefore, the similarity criterion as a prerequisite forestablishing homology (see De Pinna, 1991) is not fulfilled,leading to ‘over-alignment’. In these cases, local alignmentsbased on fragments of sequence similarity in BLAST wouldretain those regions with homology between close relativesonly, but avoiding the inappropriate alignment of moredistant sequences that do not exhibit apparent similarity.

Hence, data removal in BlastAlign is not uniform anddepends on the particular composition of a data set, whilethe retained nucleotides are a good representation ofhomology, as was evident from the higher taxonomic CI(i.e., greater congruence with other evidence). The proce-dures compared favorably with the alternative treatmentof completely removing the hypervariable regions whichresulted in a slightly higher total loss of data (Table 2)and a lower taxonomic CI (Table 3). BlastAlign was alsothe only method to address successfully the challenge ofrecovering a monophyletic Coleoptera (Caterino et al.,2002) when analyzed together with other holometabolaninsect orders (Table 5). However, the advantage of theBlastAlign method was less clear when the long-branchtaxa had not been removed, i.e., sequences with extendedhypervariable regions remain in the data set, indicatingthat the procedure becomes increasingly unspecific whensequences are very divergent and recognition of similarityacross the data set is more difficult. In this case the effectof BlastAlign was similar to that of removal of these ques-tionable regions. Yet, for the majority of sequences Blast-Align provided high quality alignments in a shorter timeframe than standard approaches, which then result in taxo-nomically congruent trees, indicating that homologies ofnucleotides were captured well.

An obvious drawback is the loss of character variationwhen trees were built directly from these BLAST-basedalignments. In the Coleoptera data set, this led to a reduc-tion in number of steps by about two thirds while main-taining a similar CI, and presumably resulted in generallylower clade support and the failure to recover certain

Page 10: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

298 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

clades due to lack of phylogenetic signal. On the otherhand, standard Clustal alignments were clearly worse fortheir taxonomic congruence scores, indicating poor homol-ogy assignments. The utility of the fragment-basedapproach of BlastAlign apparently lies in its ability toretain the more conserved regions of the gene, whilstremoving the highly divergent regions and indels (whetherfrom the hypervariable or conserved regions) and thus pro-vides an alignment that contains generally more defensiblehomology assignments. This is brought to bear on theClustal procedure via the guide tree, forcing an alignmentof the full-length sequences that is less confounded byincorrect homologies established in the standard applica-tion of the program.

4.2. Long branches, rogue taxa and tree searches

A related problem was the great rate variation amongtaxa and the resulting long branch attraction. Removingthe longest branches from the tree clearly increased the tax-onomic consistency values, indicating that error in the treetopology was correlated disproportionally to long branchphenomena. Our procedure for long branch removal usingthe 95% window of a distribution of all branches on thetree resulted in the removal of nearly 20% of taxa, includ-ing several major lineages whose position within Coleop-tera therefore remains questionable. However, theproportion of taxa removed depends on how exactly theprocedure is applied. First, we used parsimony branchlength, but more realistic measures could perhaps beachieved with ML. Second, we eliminated all taxa terminalto the long branches, including entire clades. This was onlysuccessful when conducted at the BlastAlign stage (treeA1), once some length differences had already been reduceddue to the excision of unalignable bases. When applied tothe full-length sequences (tree A2), long branches deep inthe tree would have resulted in the loss of the majority ofspecies (not shown). Alternative schemes could be appliedthat may be less sensitive to the removal of entire clades,such as the use of root-to-tip path length for determiningthe 95% interval for each terminal, as originally proposed(Korte et al., 2004). Further iterations of the procedurecould be applied after recalculating trees and branchlengths for the reduced data sets. However, preliminaryanalyses on smaller data sets seem to indicate that few, ifany, additional branches are affected once the most deviat-ing taxa have been removed in the first round, and it isclear from the current analysis that a single round clearlyresulted in great improvements of the tree. In addition, asthe number of taxa increases with the expansion of thedatabase, denser taxon sampling would be expected toreduce the occurrence of exceptionally long branches.

A further problem is that the trees were presumably sub-optimal because of the size of the data set and the conse-quently astronomical tree space. While the topologiesused here are unlikely to represent the globally shortestor most likely trees, by evaluating the trees based on con-

gruence with the taxonomy rather than their length or like-lihood score, we in part circumvent the problems ofabbreviated searches. Repeated searches on some align-ments indicated that tree length did not strictly relate totaxonomic congruence. For example, the tree search onthe ClustalW alignment of the reduced dataset produceda tree four steps longer than the best tree, but had a bettertaxonomic CI of 0.481 compared to 0.477 for the shortestTree B2. Similarly, the BlastAlign alignment also produceda suboptimal tree one step longer than the shortest tree A1but with better taxonomic CI of 0.460 (versus CI = 0.456).This indicates that the link of tree topology and fit with theclassification (taxonomic CI) is not straightforward, andextended searches would not necessarily increase the taxo-nomic congruence of the final trees. However, if the taxon-omy is used as criterion for discriminating betweenalignment strategies, the searches performed here are prob-ably sufficient.

Similarly, the size of the data set and uncertainty of treesearches causes problems for the measurement of nodalsupport. This limits the reliability of traditional supportvalues, which require accurate trees to obtain values foreach constrained search (Bremer Support) or pseudorepli-cate (to calculate Bootstrap proportions). In addition, weregularly observed shifts in the placement of particular ter-minals in trees and sometimes massive rearrangements ofcertain subgroups with only very minor differences in treelength. Because of the large number of taxa that may beaffected by these, nodal support in particular for the mono-phyly of groups deep in the tree can be expected to be low(see also (McMahon and Sanderson, 2006). Nonetheless,several deep splits in Coleoptera were supported in the fastjack-knife procedure employed here despite the conserva-tive nature of the test, indicating strong phylogenetic signalfor these nodes.

4.3. Assessing trees based on congruence with the taxonomic

classification

Because of the difficulty to estimate the best topologyand associated support values, the comparison with theexisting knowledge of phylogeny provides useful informa-tion to assess the quality of trees and alignment parame-ters. The rationale is derived from the analysis ofalignment variable regions under the assumption that topo-logical congruence of nucleotides can be assessed in theframework of synapomorphy (Wheeler, 1995), and the bestsupported alignment and derived tree are those exhibitingmaximal congruence (synapomorphy) with a known phy-logeny. Taxonomic classification can provide a subset ofnodes (the monophyly of genera, families, etc.) useful asa proxy for the assessment of all nodes, comparable tothe use of critical nodes in sensitivity analysis (Wheeler,1995). Based on the taxonomic CI/RI values, the matchof trees with the re-coded classification was far from perfectunder all types of analysis, but clear differences in fit were

Page 11: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 299

evident and permitted tests of the most appropriate datatreatments.

These tests for consistency of the tree with the classifica-tion are affected by the data structure and the efficiency oftree searches, as well as the shortcomings of the classifica-tion itself. The use of the latter might seem problematic,but it is used here only in a statistical sense, and does notrely on monophyly of any specific group. In addition, theclassification of Coleoptera (Crowson, 1960; Lawrenceand Newton, 1995) is based on a long tradition of studieson numerous groups and all hierarchical levels, and hencethe sum of these are a reasonably good arbiter of a tree.The monophyly of groupings in particular at the level offamilies and subfamilies is not controversial in many cases,even if these groups have not been defined by formal cladis-tic analysis. Only these well-established units were usedhere for the assessment of the tree, but not the much lesswell known relationships among these. The specific schemeemployed by Genbank (www.ncbi.nlm.nih.gov/Taxon-omy/Browser/wwwtax.cgi) follows this taxonomic systemclosely, e.g. listing all 166 families of the Lawrence andNewton (1995) classification plus two additional familiesdescribed more recently.

Yet, it is remarkable that the taxonomic CI did notincrease beyond about 0.5, i.e., despite the substantial dif-ferences in tree topology and level of recovery of estab-lished groups between data treatments, the effect on theCI was comparatively small while even the best valuesremained fairly low. The degree to which these discrepan-cies are due to poor quality of the trees or the insufficienciesof the higher classification of Coleoptera remains to beaddressed. Taxon sampling as one possible factor deter-mining the recovery of monophyletic groups, was 75%complete at the family level and included multiple represen-tatives of all major families, but sampling remained unevenwith high coverage in particular in the Phytophaga (mak-ing up more than one third of all sequences) but insufficientin other parts. Sampling density probably varies even morewhen considering the ‘true’ phylogenetic tree which wouldlack equal coverage even if all taxonomic groups had beensampled to the same degree.

A persistent problem was the occurrence of ‘rogue taxa’(Wilkinson, 1994), i.e., individual terminals or small sub-groups found in unforeseen positions. Many of these unex-pected placements were encountered only once in aparticular tree search and were confined to a small set oftaxa, suggesting insufficiencies of individual tree searches.Yet, this had great impact on the recovery of clade mono-phyly and lowered the taxonomic CI, in particular at thelevel of suborders and superfamilies which are composedof a greater number of terminals than the genus, subtribeand tribe level which rarely included more than two orthree terminals (Table 3). When considering the RI whichnormalizes the measures of consistency for the maximumlevel of homoplasy possible, the values were generally veryhigh (Table 4), demonstrating that monophyly of thesuperfamilies is compromised by a small proportion of ter-

minals only, while at the tip levels low taxonomic CI doesindeed indicate the polyphyly of taxa. This may be partic-ularly problematic for the subtribes whose taxonomic CIvalues across all trees were unlike those at the taxonomiclevels below and above (genus and tribe) and hence maybe indicative of generally poor diagnosis of subtribes. Toa lesser degree, perhaps the same can be said for somesuperfamilies whose constitution may be compromised byshoehorning all species into one of these. Hence, cross-illu-mination of tree and taxonomy will reveal groups most inneed of further tests and direct efforts to improve thetaxonomy.

5. Conclusions

We have shown that the BLAST-based alignment pro-vides a useful approach for the analysis of >1000 full lengthSSU sequences on standard desktop computers. Themethod is clearly superior to standard progressive align-ment procedures or the wholesale removal of unalignablesections in hypervariable regions. Plausible phylogenetictrees at this scale can be obtained in reasonable time usingparsimony searches in TNT, and the validity of such treescan easily be assessed based on the fit of the tree to estab-lished higher level taxonomy. The method was particularlyadvantageous at deep hierarchical levels, where compara-ble phylogenetic analyses using the same gene and standardalignment procedures encountered problems of non-mono-phyly of established groups (Hibbett et al., 2005; Kallersjoet al., 1998) which were therefore routinely constrained intree searches (Hibbett et al., 2005). It will be interestingto analyze these and other large data sets with the proposedprocedure to test for improvements of unconstrainedanalyses.

A fully satisfactory phylogenetic analysis of the Coleop-tera (or any other group) is unlikely to be produced fromjust one gene. However, as high-throughput sequencingaccelerates, data mining and compilations of sequence datawill lead to a more open phyloinformatics approach inmolecular systematics using all available gene and taxondata (McMahon and Sanderson, 2006). This will not onlyincrease taxon numbers but due to the various provenanceof sequences, unequal depth of taxon sampling, and a largeproportion of missing data, these non-traditional data setswill aim at a summary of all available sequence informa-tion rather than a final depiction of the phylogeny. Broadtrends of data composition and tree structure, rather thanspecific sister relationships, can be inferred in this way. Forexample, the tree of Coleoptera presented here shows goodconsistency with the current classification, strongly corrob-orating the visionary work of Crowson (1955, 1960) andothers, and greatly narrows down the remaining questionsabout the phylogeny of this largest insect order. The lack ofsupport values across most nodes of these large trees will bea concern, as might perhaps be the search strategy used,and the criterion employed to judge the quality of the resul-tant trees. However, the procedure was highly successful on

Page 12: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

300 T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301

a pragmatic level, to interpret an ever increasing body ofsequence information in the light of the existing frameworkof relationships of Coleoptera. Accumulating evidencefrom growing sets of taxa and multiple gene markers caneasily integrated into this framework and will produce anever more complete image of the tree-of-life.

Acknowledgments

We are grateful to R. Booth (NHM London) for speci-mens used in this study. James Abott (Imperial CollegeComputer Services) provided support with bioinformaticsand installing software. Unpublished sequence data weresupplied by M. Barclay (weevils), J. Mate, D. Inward andC. Scholtz (Scarabaeoidaea), F. Ciampor (Dryopoidea)and A. Papadopoulou (various). We thank R. Belshaw,I. Ribera, J. Bergsten, J. Gomez-Zurita and K. Kjer forinvaluable discussions and comments. Funding was pro-vided by The Leverhulme Trust (Grant F/696/H), NERC(Grant NE/E010962/1 and a NERC studentship toT.H.), and the Museum Research Fund of the NHM.

Appendix A. Supplementary data

Supplementary data associated with this article can befound, in the online version, at doi:10.1016/j.ympev.2007.11.029.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990.Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

Belshaw, R., Katzourakis, A., 2005. BlastAlign: a program that uses blastto align problematic nucleotide sequences. Bioinformatics 21, 122–123.

Brower, A.V.Z., Schawaroch, V., 1996. Three steps of homologyassessment. Cladistics 12, 265–272.

Caterino, M.S., Hunt, T., Vogler, A.P., 2005. On the constitution andphylogeny of Staphyliniformia. Mol. Phylogenet. Evol. 34, 655–672.

Caterino, M.S., Shull, V.L., Hammond, P.M., Vogler, A.P., 2002. Thebasal phylogeny of the Coleoptera inferred from 18S rDNA sequences.Zool. Scr. 31, 41–49.

Chalwatzis, N., Hauf, J., Peer, Y.V.D., Kinzelbach, R., Zimmermann,F.K., 1996. 18S ribosomal RNA genes in insects: primary structure ofthe genes and molecular phylogeny of the Holometabola. Ann.Entomol. Soc. Amer. 89, 788–803.

Chevenet, F., Brun, C., Banuls, A.L., Jacq, B., Christen, R., 2006.TreeDyn: towards dynamic graphics and annotations for analyses oftrees. BMC Bioinformatics 7, Art. No. 439.

Crowson, R.A., 1955. The natural classification of the families ofColeoptera. Nathaniel Lloyd & Co., London.

Crowson, R.A., 1960. The phylogeny of Coleoptera. Ann. Rev. Entomol.5, 111–134.

Crowson, R.A., 1970. Classification and Biology. Heinemann EducationalBooks Ltd, London.

De Pinna, M.C.C., 1991. Concepts and tests of homology in the cladisticparadigm. Cladistics 7, 367–394.

Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with highaccuracy and high throughput. Nucl. Acids Res., 32.

Edgar, R.C., Batzoglou, S., 2006. Multiple sequence alignment. Curr.Opin. Struct. Biol. 16, 368–373.

Farrell, B.D., 1998. ‘‘Inordinate fondness” explained: why are there somany beetles? Science 281, 555–559.

Feng, D.F., Doolittle, R.F., 1987. Progressive sequence alignment as aprerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351–360.

Galian, J., Hogan, J.E., Vogler, A.P., 2002. The origin of multiple sexchromosomes in tiger beetles. Mol. Biol. Evol. 19, 1792–1796.

Gardner, P.P., Wilm, A., Washietl, S., 2005. A benchmark of multiplesequence alignment programs upon structural RNAs. Nucl. Acids Res.33, 2433–2439.

Gillespie, J.J., 2004. Characterizing regions of ambiguous alignmentcaused by the expansion and contraction of hairpin-stem loops inribosomal RNA molecules. Mol. Phylogenet. Evol. 33, 936–943.

Goloboff, P., Farris, S., Nixon, K., 2004. TNT (tree analysis using newtechnology). Cladistics 20, 84.

Goloboff, P., Pol, D., 2007. On divide-and-conquer strategies forparsimony analysis of large data sets: Rec-I-DCM3 versus TNT. Syst.Biol. 56, 485–495.

Gotoh, O., 1982. An improved algorithm for matching biologicalsequences. J. Mol. Biol. 162, 705–708.

Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm toestimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704.

Gomez-Zurita, J., Jolivet, P., Vogler, A.P., 2005. Molecular systematics ofeumolpinae and the relationships with Spilopyrinae (Coleoptera,Chrysomelidae). Mol. Phylogenet. Evol. 34, 584–600.

Hibbett, D.S., Nilsson, R.H., Snyder, M., Fonseca, M., Costanzo, J.,Shonfeld, M., 2005. Automated phylogenetic taxonomy: an example inthe homobasidiomycetes (mushroom-forming fungi). Syst. Biol. 54,660–668.

Hovenkamp, P., 2004. Review of: TNT—tree analysis using new technol-ogy. Version 1.0, by P. Goloboff, J. S. Farris, K. Nixon. Cladistics 20,pp. 378–383.

Kallersjo, M., Farris, F.S., Chase, M.W., Bremer, B., Fay, M.F.,Humphries, C.J., Petersen, G., Seberg, O., Bremer, K., 1998.Simultaneous parsimony jackknife analysis of 2538 rbcL DNAsequences reveals support for major clades of green plants, landplants, seed plants and flowering plants. Plant Syst. Evol. 213, 259–287.

Katoh, K., Kuma, K., Toh, H., Miyata, T., 2005. MAFFT version 5:improvement in accuracy of multiple sequence alignment. Nucl. AcidsRes. 33, 511–518.

Kjer, K.M., 1995. Use of rRNA secondary structure in phylogeneticstudies to identify homologous positions: an example of alignmentand data presentation from the frogs. Mol. Phylogenet. Evol. 4,314–330.

Kjer, K.M., 2004. Aligned 18S and insect phylogeny. Syst. Biol. 53, 506–514.

Korte, A., Ribera, I., Beutel, R.G., Bernhard, D., 2004. Interrelationshipsof Staphyliniform groups inferred from 18S and 28S rDNA sequences,with special emphasis on Hydrophiloidea (Coleoptera, Staphylinifor-mia). J. Zool. Syst. Evol. Res. 42, 281–288.

Lawrence, J.F., Newton, A.F., 1995. Families and subfamilies ofColeoptera (with selected genera, notes, references and data onfamily-group names). In: Pakaluk, J., Slipinski, S.A. (Eds.), Biology,phylogeny, and classification of Coleoptera. Museum i InstytutZoologii PAN, Warzawa, pp. 779–1092.

Maddison, D.R., Baker, M.D., Ober, K.A., 1999. Phylogeny of carabidbeetles as inferred from 18S ribosomal DNA (Coleoptera: Carabidae).Syst. Entomol. 24, 103–138.

Mathews, D.H., Turner, D.H., 2002. Dynalign: an algorithm for findingthe secondary structure common to two RNA sequences. J. Mol. Biol.317, 191–203.

McMahon, M.M., Sanderson, M.J., 2006. Phylogenetic supermatrixanalysis of GenBank sequences from 2228 papilionoid legumes. Syst.Biol. 55, 818–836.

Morgenstern, B., 1999. DIALIGN 2: Improvement of the segment-to-segment approach to multiple alignment. Bioinformatics 15, 211–218.

Notredame, C., Higgins, D.G., Heringa, J., 2000. T-Coffee: a novelmethod for fast and accurate multiple sequence alignment. J. Mol.Biol. 302, 205–217.

Page 13: A protocol for large-scale rRNA sequence analysis: Towards a detailed phylogeny of Coleoptera

T. Hunt, A.P. Vogler / Molecular Phylogenetics and Evolution 47 (2008) 289–301 301

Pashley, D.P., McPheron, B.A., Zimmer, E.A., 1993. Systematics ofholometabolous insect orders based on 18S ribosomal RNA. Mol.Phylogenet. Evol. 2, 132–142.

Ribera, I., Hogan, J.E., Vogler, A.P., 2002. Phylogeny of hydradephaganwater beetles inferred from 18S rRNA sequences. Mol. Phylogenet.Evol. 23, 43–62.

Robertson, J.A., McHugh, J.V., Whiting, M.F., 2004. Colour patterningand the phylogeny of the pleasing fungus beetles (Coleoptera;Evotylidae): molecular evidence. Syst. Entomol. 29, 173–187.

Sankoff, D., 1985. Simultaneous solution of the RNA folding, alignmentand protosequence problems. SIAM Journal of Applied Mathematics45, 810–825.

Shull, V.L., Vogler, A.P., Baker, M.D., Maddison, D.R., Hammond,P.M., 2001. Sequence alignment of 18S ribosomal RNA and thebasal relationships of adephagan beetles: evidence for monophyly ofaquatic families and the placement of Trachypachidae. Syst. Biol.50, 945–969.

Soltis, P.S., Soltis, D.E., Chase, M.W., 1999. Angiosperm phylogenyinferred from multiple genes as a tool for comparative biology. Nature402, 402–404.

Tautz, D., Hancock, J.M., Webb, D.A., Tautz, C., Dover, G.A., 1988.Complete sequence of the rRNA genes of Drosophila melanogaster.Mol. Biol. Evol. 5, 366–376.

Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W:improving the sensitivity of progressive multiple sequence alignmentthrough sequence weighting, position specific gap penalties and weightmatrix choice. Nucl. Acids Res. 22, 4673–4680.

Wheeler, W.C., 1995. Sequence alignment, parameter sensitivity, and thephylogenetic analysis of molecular data. Syst. Biol. 44, 321–331.

Wheeler, W.C., 1996. Optimization alignment: the end of multiplesequence alignment in phylogenetics? Cladistics 12, 1–9.

Wheeler, W.C., Whiting, M., Wheeler, Q.D., Carpenter, J.M., 2001. Thephylogeny of the extant hexapod orders. Cladistics 17, 113–169.

Whiting, M.F., Carpenter, J.C., Wheeler, Q.D., Wheeler, W.C., 1997. TheStrepsiptera problem: phylogeny of the holometabolous insect ordersinferred from 18S and 28S ribosomal DNA sequences and morphol-ogy. Syst. Biol. 46, 1–68.

Wilkinson, M., 1994. Common cladistic information and its consensusrepresentation—reduced Adams and reduced cladistic consensus treesand profiles. Syst. Biol. 43, 343–368.