measuring community similarity with phylogenetic networks

12
Research article Measuring Community Similarity with Phylogenetic Networks Donovan H. Parks* and Robert G. Beiko Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada *Corresponding author: E-mail: [email protected]. Associate editor: Barbara Holland Abstract Environmental drivers of biodiversity can be identified by relating patterns of community similarity to ecological factors. Community variation has traditionally been assessed by considering changes in species composition and more recently by incorporating phylogenetic information to account for the relative similarity of taxa. Here, we describe how an important class of measures including Bray–Curtis, Canberra, and UniFrac can be extended to allow community variation to be computed on a phylogenetic network. We focus on phylogenetic split systems, networks that are produced by the widely used median network and neighbor-net methods, which can represent incongruence in the evolutionary history of a set of taxa. Calculating b diversity over a split system provides a measure of community similarity averaged over uncertainty or conflict in the available phylogenetic signal. Our freely available software, Network Diversity, provides 11 qualitative (presence–absence, unweighted) and 14 quan- titative (weighted) network-based measures of community similarity that model different aspects of community richness and evenness. We demonstrate the broad applicability of network-based diversity approaches by applying them to three distinct data sets: pneumococcal isolates from distinct geographic regions, human mitochondrial DNA data from the Indonesian island of Nias, and proteorhodopsin sequences from the Sargasso and Mediterranean Seas. Our results show that major expected patterns of variation for these data sets are recovered using network-based measures, which indicates that these patterns are robust to phylogenetic uncertainty and conflict. Nonetheless, network-based measures of community similarity can differ substantially from measures ignoring phylogenetic relationships or from tree-based measures when incongruent signals are present in the underlying data. Network-based measures provide a methodology for assessing the robustness of b-diversity results in light of incongruent phylogenetic signal and allow b diversity to be calculated over widely used network structures such as median networks. Key words: beta diversity, split system, phylogenetic network, MLST, mtDNA, proteorhodopsin. Introduction Measures of b diversity are used to assess variation in com- munity composition between sample sites. Examining patterns of b diversity across environmental gradients, be- tween treatment conditions, or over time provides insights into the spatiotemporal dynamics of communities and the influence of ecological factors on biodiversity (Anderson et al. 2011). b-diversity measures have been used in studies such as the investigation of human migration patterns (Kayser et al. 2008; HUGO Pan-Asian SNP Consortium 2009), the explor- ation of the structure of human-associated bacterial commu- nities (Costello et al. 2009; Caporaso et al. 2011), and the comparison of diversity patterns exhibited by micro- and macro-organisms across environmental gradients (Bryant et al. 2008; Rousk et al. 2010; Martiny et al. 2011). Although community variation has traditionally been assessed by con- sidering changes in species composition with “taxon-based” measures such as the Bray–Curtis or Canberra indices, more recent methods account for the relative similarity of taxa with “sequence-based” measures that consider the genetic distance between aligned sequences (e.g., F ST : Hudson et al. 1992; Martin 2002; Holsinger and Weir 2009) or with “phylogenetic-based” measures that compute distances between taxa using a phylogenetic tree (e.g., UniFrac; Lozupone and Knight 2005). Phylogenetic-based measures are becoming increasingly popular as they are complemen- tary to taxon-based measures (Graham and Fine 2008; Hamady et al. 2010) and eliminate the need to separate units of diversity into predefined groups. Here, we describe how an important class of b-diversity measures, which in- cludes the Bray–Curtis, Canberra, and UniFrac measures, can be extended to account for phylogenetic uncertainty and conflict. Although phylogenetic-based measures are relatively robust to the methods used for phylogenetic inference (Lozupone et al. 2007), they calculate similarity over a single phylogenetic tree that is presumed to be correct. In practice, there may be multiple trees supported by the available phylo- genetic signal. Consideration of a set of bootstrapped trees or the Bayesian posterior distribution of trees has been used to account for this phylogenetic uncertainty to provide more robust statistical tests (Wang et al. 2001; Jones and Martin 2006; Lozupone et al. 2008; Parker et al. 2008). The use of jackknifed trees has also been considered to evaluate the ro- bustness of phylogenetic-based measures of b diversity to sampling depth and evenness (Lozupone et al. 2007), and this approach could be adopted to account for phylogenetic ß The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 29(12):3947–3958 doi:10.1093/molbev/mss200 Advance Access publication August 21, 2012 3947 Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860 by guest on 06 April 2018

Upload: hoanglien

Post on 07-Feb-2017

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Measuring community similarity with phylogenetic networks

Research

articleMeasuring Community Similarity with Phylogenetic Networks

Donovan H. Parks* and Robert G. BeikoFaculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada

*Corresponding author: E-mail: [email protected].

Associate editor: Barbara Holland

Abstract

Environmental drivers of biodiversity can be identified by relating patterns of community similarity to ecological factors.Community variation has traditionally been assessed by considering changes in species composition and more recently byincorporating phylogenetic information to account for the relative similarity of taxa. Here, we describe how an important class ofmeasures including Bray–Curtis, Canberra, and UniFrac can be extended to allow community variation to be computed on aphylogenetic network. We focus on phylogenetic split systems, networks that are produced by the widely used median networkand neighbor-net methods, which can represent incongruence in the evolutionary history of a set of taxa. Calculating b diversityover a split system provides a measure of community similarity averaged over uncertainty or conflict in the available phylogeneticsignal. Our freely available software, Network Diversity, provides 11 qualitative (presence–absence, unweighted) and 14 quan-titative (weighted) network-based measures of community similarity that model different aspects of community richness andevenness. We demonstrate the broad applicability of network-based diversity approaches by applying them to three distinct datasets: pneumococcal isolates from distinct geographic regions, human mitochondrial DNA data from the Indonesian island ofNias, and proteorhodopsin sequences from the Sargasso and Mediterranean Seas. Our results show that major expected patternsof variation for these data sets are recovered using network-based measures, which indicates that these patterns are robust tophylogenetic uncertainty and conflict. Nonetheless, network-based measures of community similarity can differ substantiallyfrom measures ignoring phylogenetic relationships or from tree-based measures when incongruent signals are present in theunderlying data. Network-based measures provide a methodology for assessing the robustness of b-diversity results in light ofincongruent phylogenetic signal and allow b diversity to be calculated over widely used network structures such as mediannetworks.

Key words: beta diversity, split system, phylogenetic network, MLST, mtDNA, proteorhodopsin.

IntroductionMeasures of b diversity are used to assess variation in com-munity composition between sample sites. Examiningpatterns of b diversity across environmental gradients, be-tween treatment conditions, or over time provides insightsinto the spatiotemporal dynamics of communities and theinfluence of ecological factors on biodiversity (Anderson et al.2011). b-diversity measures have been used in studies such asthe investigation of human migration patterns (Kayser et al.2008; HUGO Pan-Asian SNP Consortium 2009), the explor-ation of the structure of human-associated bacterial commu-nities (Costello et al. 2009; Caporaso et al. 2011), and thecomparison of diversity patterns exhibited by micro- andmacro-organisms across environmental gradients (Bryantet al. 2008; Rousk et al. 2010; Martiny et al. 2011). Althoughcommunity variation has traditionally been assessed by con-sidering changes in species composition with “taxon-based”measures such as the Bray–Curtis or Canberra indices, morerecent methods account for the relative similarity of taxa with“sequence-based” measures that consider the geneticdistance between aligned sequences (e.g., FST: Hudson et al.1992; Martin 2002; Holsinger and Weir 2009) or with“phylogenetic-based” measures that compute distances

between taxa using a phylogenetic tree (e.g., UniFrac;Lozupone and Knight 2005). Phylogenetic-based measuresare becoming increasingly popular as they are complemen-tary to taxon-based measures (Graham and Fine 2008;Hamady et al. 2010) and eliminate the need to separateunits of diversity into predefined groups. Here, we describehow an important class of b-diversity measures, which in-cludes the Bray–Curtis, Canberra, and UniFrac measures,can be extended to account for phylogenetic uncertaintyand conflict.

Although phylogenetic-based measures are relativelyrobust to the methods used for phylogenetic inference(Lozupone et al. 2007), they calculate similarity over a singlephylogenetic tree that is presumed to be correct. In practice,there may be multiple trees supported by the available phylo-genetic signal. Consideration of a set of bootstrapped trees orthe Bayesian posterior distribution of trees has been used toaccount for this phylogenetic uncertainty to provide morerobust statistical tests (Wang et al. 2001; Jones and Martin2006; Lozupone et al. 2008; Parker et al. 2008). The use ofjackknifed trees has also been considered to evaluate the ro-bustness of phylogenetic-based measures of b diversity tosampling depth and evenness (Lozupone et al. 2007), andthis approach could be adopted to account for phylogenetic

� The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]

Mol. Biol. Evol. 29(12):3947–3958 doi:10.1093/molbev/mss200 Advance Access publication August 21, 2012 3947Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 2: Measuring community similarity with phylogenetic networks

uncertainty. Our approach differs from these efforts byincorporating phylogenetic uncertainty into the inferred evo-lutionary history of a set of sequences and calculating com-munity similarity over this more general model of sharedancestry known as a phylogenetic network.

Phylogenetic networks are a generalization of phylogenetictrees and can be divided into two distinct classes (Huson andBryant 2006): 1) implicit networks, which represent phylogen-etic uncertainty and conflict in the available phylogeneticsignal, and 2) explicit networks, which represent evolutionaryhistories that may contain reticulate events such as hybrid-ization or lateral gene transfer. Phylogenetic uncertainty andconflict can arise when there is insufficient signal to ad-equately resolve all branches of a bifurcating evolutionaryprocess (Ho and Jermiin 2004; Kennedy et al. 2005) orwhen the underlying evolutionary history of a set of taxacontains reticulate events. Implicit networks aim to representall, or at least many, plausible evolutionary scenarios for aset of taxa (Huson and Scornavacca 2011; Morrison 2011).In contrast, explicit networks aim to describe a single biologic-ally plausible phylogeny, which may include reticulateconnections.

This article considers split systems, a widely used class ofimplicit networks (Huson and Scornavacca 2011). A split XjYis a bipartition of the taxa into two nonempty, disjoint subsetsX and Y. Phylogenetic trees can be described, without loss ofinformation, by a compatible split system. As such, b-diversitymeasures defined on a phylogenetic tree can be expressed interms of a compatible set of splits. More generally, a splitsystem may contain incompatible splits, in which case itcan be represented graphically as a split network or as atable enumerating all splits (fig. 1). Graphically, a split is rep-resented by one or more parallel lines. Each split has a weightrepresenting, among other possibilities, the amount of evolu-tionary change that has occurred between the taxa on eitherside of the split. Here, we describe how measures of b diversitycan be extended to incompatible sets of splits.

We have focused on split systems as they are widely usedwithin the biological sciences and include phylogenetic trees

as a special case. Character-based (Bandelt and Dress 1993;Bandelt et al. 1995), distance-based (Bandelt and Dress 1992;Bryant and Moulton 2004), and tree-based (Huson et al. 2004;Holland et al. 2005, 2007) methods for inferring incompatiblesplit systems have all been proposed. Implementations ofmany split system algorithms are available in Spectronet(Huber et al. 2002) and SplitsTree4 (Huson and Bryant2006). Two popular inference methods are the character-based median network method (Bandelt et al. 1995), whichis commonly used to study the relationships between humanhaplogroups (Bandelt et al. 1995; Herrnstadt et al. 2002), andthe distance-based neighbor-net method (Bryant andMoulton 2004), which is often applied to provide furtherinsights into poorly understood or complex phylogenies(Morrison 2005; Huson and Bryant 2006). Neighbor-netshave also been used within conservation planning to accountfor conflicting phylogenetic signal (Spillner et al. 2008; Minhet al. 2009) and as a method for visualizing the relationshipsbetween communities as determined by a b-diversity analysis(Mitra et al. 2010).

Our proposed extension allows phylogenetic b diversity tobe measured over unrooted and rooted (i.e., as defined by aset of outgroup taxa) split systems. For rooted systems, bothqualitative (i.e., presence–absence, unweighted) and quanti-tative (i.e., weighted) measures can be computed, whichallows changes in both community richness and evennessto be investigated (Legendre and Legendre 1998; Lozuponeet al. 2007). Many of the quantitative measures can also beapplied to unrooted split systems. We demonstrate the val-idity and utility of calculating network-based diversity byanalyzing three distinct data sets: pneumococcal isolatesfrom distinct geographic regions, human mitochondrialDNA (mtDNA) data from the Indonesian island of Nias,and proteorhodopsin sequences from the Sargasso andMediterranean Seas. These data sets illustrate the calculationof b diversity over varying network structures (i.e., neighbor-nets and median networks), the impact of incompatible splitson measured dissimilarity, and the contrasting resultsobtained when applying qualitative and quantitative

FIG. 1. Example of a rooted split system depicted as a split network and a table of splits. Each split has been assigned a number and is represented byone or more parallel edges in the split network (e.g., the two lines labeled 10 represent the split A1A2jB1B2C1O1O2). The split system contains taxa fromthree communities labeled A, B, and C and has been rooted using the outgroup taxa labeled O. Splits are classified with respect to communities A and B:unique to community A (UA: purple), unique to community B (UB: purple), shared (S: green), root (R: orange), external (E: gray), or outgroup (O: blue).

3948

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 3: Measuring community similarity with phylogenetic networks

network-based measures. In all cases, we found that previ-ously determined major patterns of variation can be re-covered on an incompatible split system indicating therobustness of these results to phylogenetic uncertainty andconflict. Nonetheless, finer-scale analyses revealed that thedissimilarity of communities can differ substantially whenthe evolutionary history of taxa is allowed to include incom-patible splits (i.e., when a network is inferred as opposed to atree) or with the measure of b diversity considered (e.g.,weighted UniFrac in contrast to FST).

Materials and Methods

Measuring Qualitative Phylogenetic b Diversity over aRooted Split System

Here, we propose a classification scheme for splits within arooted split system, which allows qualitative phylogeneticb-diversity measures to be applied to such systems. Foreach pair of communities, i and j, all splits within a rootedsplit system can be classified as "unique," "shared," "root,""external," or "outgroup" using the following definitions(fig. 1 and supplementary table S1, Supplementary Materialonline):

Unique split: A split (i.e., XjY) is unique to community ionly when a subset induced by the split (i.e., either X or Y)contains a taxon or taxa from community i and does notcontain any taxa from community j or the outgroup.

Shared split: A split is shared by communities i and j onlywhen 1) there is a subset induced by the split that con-tains taxa from communities i and j and no taxa from theoutgroup and 2) the other subset contains at least onetaxon from community i or j.

Root split: A split is a root of communities i and j onlywhen 1) there is a subset induced by the split that con-tains all the taxa from these communities and no taxafrom the outgroup and 2) the other subset contains atleast one ingroup taxon.

External split: A split is external to communities i and jonly when there is a subset induced by the split thatcontains only taxa from other communities within thestudy and no taxa from the outgroup.

Outgroup split: A split belongs to the outgroup whenthere is a subset induced by the split that contains onlyoutgroup taxa. Additionally, any split where both inducedsubsets contain outgroup taxa is defined as an outgroupsplit even if both subsets also include ingroup taxa. Inpractice, such splits often occur even for a set of credibleoutgroup taxa.

The terms a, b, c, and d are used to define many qualitativetaxon-based measures (Koleff 2003; Kuczynski et al. 2010).Using the above classifications, these terms can be appliedto a split system by defining a as the total split weight sharedby communities i and j, b as the total split weight unique tocommunity i, c as the total split weight unique to commu-nity j, and d as the total split weight external to communitiesi and j. Ferrier et al. (2007) and Nipperess et al. (2010) have

proposed similar definitions for extending these taxon-basedterms to phylogenetic trees. Although Ferrier et al. do notspecify how root or external branches were to be treated,Nipperess et al. implicitly treat root branches as shared(i.e., contribute to a).

Our definitions are more general as they can be applied tosplit systems and explicitly differentiate between shared androot splits. This latter distinction is critical as it allows calcu-lations to be restricted to the most recent common ancestortree of a pair of communities, which can substantiallyinfluence the dissimilarity measured between a pair of com-munities (Parks and Beiko 2012). For example, unweightedUniFrac is defined as (b + c)/(a + b + c) but produces twodistinct measures depending on whether root splits contrib-ute to a or d. Ambiguity already exists with both the FastUniFrac web interface (Hamady et al. 2010) and QIIME(Caporaso et al. 2010) implicitly treating root branches ascontributing to a, whereas mothur (Schloss et al. 2009)treats these branches as contributing to d by default(though it does give users the choice to treat these branchesas contributing to a).

Measuring Quantitative Phylogenetic b Diversity overa Rooted Split System

Many quantitative phylogenetic b-diversity measures con-sider the proportion of taxa from communities i and j des-cendant from branch n (pin and pjn, respectively) along withthe length of the branch (Wn). These terms can be adapted toa rooted split system by defining pin as the proportion oftaxa from community i within the subset induced by split nthat contains no outgroup taxa. Splits where both inducedsubsets contain outgroup taxa are ignored. Within a splitsystem, Wn is simply the weight of split n. For example, ona split system, the commonly used Manhattan or weightedUniFrac (Lozupone et al. 2007) measure is given byP

n pin � pjn

��

��Wn, where the summation is over all splits

within the split system. We have recently evaluated a largenumber of measures that are a strict function of pin, pjn, andWn (Parks and Beiko 2012).

Measuring Qualitative and Quantitative Phylogeneticb Diversity over an Unrooted Split System

In practice, a credible rooting for a split system cannot alwaysbe established and phylogenetic b diversity must be deter-mined with a root invariant measure, that is, a measureproducing the same result regardless of root placement.Although root invariant qualitative measures such as themean nearest neighbor distance and mean phylogenetic dis-tance do exist (Webb et al. 2008; Parks and Beiko 2012), anyqualitative measures that must distinguish between sharedand unique splits require a rooted split system. To our know-ledge, all measures that are a function of the terms a, b, c, andd require distinguishing between these two types of splits.This is unfortunate as these measures are among the mostcommonly used and best understood qualitative measures. Incontrast, many of the most commonly used quantitative

3949

b Diversity over Phylogenetic Networks . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 4: Measuring community similarity with phylogenetic networks

measures such as Manhattan, Euclidean, and Gower are rootinvariant (Parks and Beiko 2012).

Interpretation of b Diversity Measured over a SplitSystem

Calculating b diversity over a split system allows incompatiblesplits representing phylogenetic uncertainty or conflict to beconsidered (fig. 2). When splits are assigned a weight propor-tional to the amount of phylogenetic signal supporting thesplit and accurately reflect the available signal, the dissimilar-ity measured between a pair of communities will be an aver-age over phylogenetic uncertainty and conflict. In practice, itis often unclear how to best represent the available phylo-genetic signal, and many inference methods have been pro-posed. In this article, we consider split systems inferred withthe unweighted pair group method with arithmetic mean(UPGMA; Legendre and Legendre 1998), neighbor-joining(Saitou and Nei 1987), neighbor-net (Bryant and Moulton2004), and median network (Bandelt et al. 1995) methods.UPGMA assumes a constant rate of evolution and producesan ultrametric tree where branch lengths must be adjusted toaccount for any discordant phylogenetic signal. Neighbor-joining relaxes the molecular clock assumption and aims toinfer a tree of minimal length under the "balanced minimumevolution" criterion (Gascuel and Steel 2006). Neighbor-netinfers a more general split system consisting of a collection ofcircular splits, which can contain mutually incompatiblesplits. A set of circular splits can always be represented as aplanar split network (Dress and Huson 2004), andneighbor-net tends to produce well-resolved networks(Bryant and Moulton 2004). Notably, neighbor-net will infer

a tree if the provided distance matrix is additive, that is, can beperfectly represented by a tree (Bryant et al. 2007). A mediannetwork allows all splits within a set of "binary" sequences tobe inferred. Although the resulting split systems are often toocomplicated to visualize (Huber et al. 2001), b diversity canstill be calculated over the inferred splits. The split systemsinferred with alternative methods can differ substantially,which will directly influence the phylogenetic dissimilaritymeasured between a pair of communities. These differencesin community similarity must be interpreted with respect tohow phylogenetic uncertainty and conflict are treated, andthe varying assumptions made by each method.

Pneumococcus Data Set

A multilocus sequence typing (MLST) data set ofStreptococcus pneumoniae serotype 1 isolates from 29 geo-graphic regions was compiled from previous studies (supple-mentary table S2, Supplementary Material online). Thesestudies characterized isolates by sequencing fragments ofseven genes according to the protocol of Enright and Spratt(1998). Each allele was assigned a distinct number, and eachset of seven unique alleles specifies a sequence type (ST). Thedistance between STs is the number of alleles in which theydiffer. Using these distances, a UPGMA tree, neighbor-joiningtree, and neighbor-net were inferred with SplitsTree v4.12.3(Huson and Bryant 2006). The relationship between sampleswas visualized using a UPGMA tree. Jackknife values werecalculated over 100 independent trials to assess the robust-ness of results to sequence subsampling (Lozupone andKnight 2005). For each trial, sequences were randomly

FIG. 2. Measuring b diversity over a split system provides an average over phylogenetic uncertainty and conflict. A set of trees representing alternativeevolutionary histories for a set of taxa is given on the left along with the percentage of data supporting each tree. This could represent the percentageof trees within a collection of trees or the percentage of sites within a multiple sequence alignment supporting each tree. The splits contained within thisset of trees can be represented by the split network on the right. Splits within the network have a weight equal to their mean length within the set oftrees. The total weight of shared (S), unique (U), root (R), and external (E) splits is given below each phylogeny. Splits are colored in the same manneras figure 1. Dissimilarity values obtained with the unweighted UniFrac (uUF) and quantitative Manhattan measures for communities A and B arealso given. The uUF distance measured on the split network is not the mean of the uUF distances measured over the set of trees. When calculating b di-versity over a split system, each split within the set of trees is considered and weighted by the amount of phylogenetic signal supporting the split.

3950

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 5: Measuring community similarity with phylogenetic networks

drawn to reduce each sample to the number of sequencescontained in the smallest sample.

mtDNA Data Set

mtDNA sequences covering the first hypervariable segmentwere collected by van Oven et al. (2011) from nine groups onNias, Indonesia. We obtained these sequences from the NCBIPopSet database and analyzed all groups with a sample sizegreater than 10. Sequences were aligned with MUSCLE v3.8.31(Edgar 2004) using default parameters. A median networkwas inferred from a binary character representation of thealigned sequences using SplitsTree v4.12.3. Of the 59 variablesites, only 6 contained more than 2 character states. Thesesites were converted to binary characters using an R/Y encod-ing. This favors transversional changes by converting bases Aand G to Y (pyrimidine), and bases C and T to R (purine). Thisencoding scheme is used by Spectronet (Huber et al. 2002)and is similar to the encoding proposed by Bandelt et al.(1995). Pairwise FST values were calculated using Arlequin3.5.1.3 (Excoffier et al. 2005). Principal coordinate analysis(PCoA) plots were generated using custom scripts.Geographic visualizations showing the similarity of Hia toother groups on Nias were generated with GenGIS (Parkset al. 2009).

Proteorhodopsin Data Set

Proteorhodopsin sequences from 12 samples collected bySabehi et al. (2007) and three additional Mediterranean Seasamples submitted by Sabehi and Beja (ABD84734–ABD85012) were retrieved from GenBank. Protein sequenceswere aligned with MUSCLE v3.8.31 using default parameters.Initial and trailing columns were trimmed if they con-tained missing data for >90% of the sequences. Maximum-likelihood distances between sequences were calculated withthe WAG model of protein substitution (Whelan andGoldman 2001), and a neighbor-net was inferred withSplitsTree v4.12.3. The phylogeny was rooted with a eukary-otic rhodopsin from Pyrocystis lunula (AAO14677; de la Torreet al. 2003; Sabehi et al. 2005, 2007). The relationship betweensamples was visualized using a UPGMA tree and jackknifevalues calculated as described for the pneumococcus dataset. Genotype-specific samples were also considered by as-signing each protein sequence as either preferentially blue-absorbing (BPR) or green-absorbing (GPR) based on whetherthe amino acid at position 105 was a glutamine or leucine,respectively (supplementary table S3, SupplementaryMaterial online). Eight green-absorbing sequences containinga methionine at position 105 were ignored to replicate thedata set considered by Sabehi et al. (2007). When calculatingunweighted UniFrac values, we treated root splits as contri-buting to a (i.e., shared between the two communities).

Software Availability

Our software, Network Diversity, for calculating 11 qualitativeand 14 quantitative network-based measures of phylogeneticb diversity is freely available at http://kiwi.cs.dal.ca/Software/NetworkDiversity, last accessed August 23, 2012. The software

is open source and released under the GNU General PublicLicense. It is compatible with split systems generated by thewidely used SplitsTree4 software. The software is designed forlarge data sets, and the limiting factor for most analyses will bethe computational resources required to infer the underlyingsplit system. On a split system containing 10,000 taxa from500 environmental samples, calculating the b diversity be-tween all pairs of samples requires approximately 10 minon a standard desktop computer. A comprehensive compara-tive analysis of the measures available in our software hasbeen conducted, which provides practical advice on selectinga subset of measures likely to produce complementary in-sights into a particular data set (Parks and Beiko 2012).

Results

Pneumococcal Biogeography: Alternative PhylogeniesInfluence b Diversity

Pneumococcal infections are a major cause of morbidity andmortality, causing diseases ranging in severity from otitismedia to meningitis and pneumonia. Serotype 1 pneumo-cocci are increasingly responsible for invasive pneumococcaldiseases in several countries (Henriques Normark et al. 2001;McChlery et al. 2005; Obando et al. 2008) and a major causeof pneumonia and pulmonary empyema in children (Estevaet al. 2011). Here, we consider the biogeography of serotype 1isolates from 29 distinct geographic regions spanning 4 con-tinents (supplementary table S2, Supplementary Materialonline). We investigated the geographic distribution of sero-type 1 by applying the Manhattan phylogenetic b-diversitymeasure to a neighbor-net, neighbor-joining tree, andUPGMA tree inferred from serotype 1 MLST data. Differencesbetween these results reveal that the underlying phylogenycan substantially influence the similarity measured betweencommunities.

Hierarchical clustering of the Manhattan distances ob-tained over a neighbor-net shows clear geographic structuring(fig. 3A). A well-supported clustering separates North Ameri-can and European serotype 1 populations from those inAfrica, Asia, and South America. The North American andEuropean cluster is itself separated into its respective contin-ents with the exception of England, which falls within theNorth American cluster. Although the isolates collectedfrom South America appear highly distinct from those ob-tained in other regions, isolates collected from countries inAfrica and Asia are intermixed indicating they contain similarserotype 1 clones. These results are in agreement with anearlier study of serotype 1 isolates collected from 14 countries(Brueggemann and Spratt 2003). This is highly encouraging asBrueggemann and Spratt drew their conclusions by consider-ing a UPGMA phylogeny of serotype 1 STs along with a tableindicating the number of times each ST is observed in acountry. Phylogenetic b-diversity measures require the samedata as input but provide a less subjective and more scalablemethodology for exploring the relative similarity betweenpopulations.

Calculating b diversity over a UPGMA tree gives similarresults to those obtained on a neighbor-net (fig. 3B) despite

3951

b Diversity over Phylogenetic Networks . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 6: Measuring community similarity with phylogenetic networks

the neighbor-net containing an additional 67 (47% increase)splits (fig. 4). Although the majority of pairwise distancesbetween geographic regions are robust to the underlyingphylogeny, differences do exist. For example, the distancemeasured between isolates from Scotland and the Spanishcity of Gipuzkoa is substantially larger when considering aUPGMA tree instead of a neighbor-net (fig. 3C; 1.12 vs.0.74). A simple scaling cannot be used to resolve such incon-gruencies as many pairwise distances are already in agreement(e.g., Gipuzkoa and Czech Republic). To further investigatehow b-diversity results differ under alternative phylogenies,we compared the distances measured between every pair ofgeographic regions when all STs are considered or when a denovo analysis is performed using only the STs from the pair ofregions under consideration. Distances changed moredrastically when they were calculated over a UPGMA tree(fig. 3D). For example, although the distances measured be-tween Mozambique, Spain, and India are nearly identicalwhen measured over a neighbor-net inferred on the fulldata set or de novo, they change substantially when aUPGMA tree is used as the underlying phylogeny (fig. 3E).This suggests that considering the distance between specificsubsets of samples will generally be more robust when bdiversity is measured over a neighbor-net. Repeating theabove analysis with a neighbor-joining tree also indicatesthat measuring b diversity over a neighbor-net will readilyrecover strong patterns of variation but that considerationof phylogenetic uncertainty and conflict can substantially

change the dissimilarity measured between certain samples,for example, Quebec (supplementary fig. S1, SupplementaryMaterial online).

mtDNA Diversity in Nias: Contrasting Sequence- andPhylogenetic-Based Measures

Nias is an Indonesian island located approximately 120 km offthe western coast of Sumatra. Its inhabitants are known fortheir distinctive architecture (Bontaz 2009) and uniqueAustronesian dialect (Gray et al. 2009). Within Nias, there isa cultural and linguistic division between the northern andsouthern portions of the island (Beatty 1993; Bontaz 2009).A recent genetic study of Niasans found Y chromosomehaplogroups to be strongly differentiated between southand north populations, whereas mtDNA haplogroups weremore evenly distributed throughout the island (van Ovenet al. 2011). The more uniform distribution of mtDNA hap-logroups is likely the result of Niasan culture whereby awoman must marry a man from a different clan and relocateto her husband’s village. Here, we determine whether consid-eration of the phylogenetic relationships between mtDNAHSV1 sequences changes the conclusions of van Oven et al.(2011).

We assessed the similarity of Niasan groups by applying thephylogenetic-based Manhattan measure to a median net-work and the sequence-based FST measure. These two dis-similarity measures produce distinct patterns of relationshipsbetween Niasan groups (figs. 5A and B). Most notably, under

FIG. 3. Similarity of pneumococcus isolates from 29 countries measured over a neighbor-net or UPGMA tree. (A) Hierarchical clustering of theneighbor-net distance matrix with jackknife values greater than 60 indicated. Colors indicate the continent of each sample. (B) Scatter plot relating thedissimilarity values for every pair of countries over a neighbor-net (x axis) or UPGMA tree (y axis). The y = x line is shown as a dashed line, the linearregression line is shown in red, and Pearson’s correlation coefficient is given in the top-left corner. (C) Distances measured over a neighbor-net andUPGMA tree, which differ substantially for select pairs of geographic regions. (D) Histogram of the percentage difference between distances measuredfor every pair of geographic regions when the full data set is considered or when a de novo analysis is performed for each pair. (E) Comparison ofdistances measured between Mozambique (MOZ), Spain (ESP), and India (IND) when considering the full data set or performing a de novo analysisusing only STs from these three countries. To show that a linear scaling does not make the distances congruent, each set of measurements wasnormalized by the longest distance between any pair of countries.

3952

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 7: Measuring community similarity with phylogenetic networks

FST, the Si’ulu population from southern Nias appears rela-tively distinct from all other populations, whereas theManhattan measure identifies this population as being closelyrelated to the southern Sarumaha population. This strikingdifference occurs despite the two measures being reasonablycorrelated (fig. 5C). Differences are not confined to the Si’ulupopulation nor are they an artifact of the PCoA plots. Forexample, the relative similarity of the centrally located Hiagroup to the other Niasan groups depends strongly on thedissimilarity measure used (figs. 5D and E), and subsets ofgroups exist whose dissimilarity is only poorly correlated be-tween the two measures (fig. 5F; Pearson’s r = 0.37). Althoughconsideration of the phylogenetic structure between mtDNAsequences substantially changes the relative similarity be-tween Niasan groups, our results largely corroborate the con-clusions of van Oven et al. (2011). Neither the FST norManhattan measures support a north–south division inmtDNA diversity.

Distribution of Proteorhodopsins: Qualitative andQuantitative b Diversity

Proteorhodopsin proteins provide a diverse range of aquaticbacteria with a light-driven proton pump suggesting thatphotosynthesis may play a significant role in the metabolismof aquatic ecosystems (Beja et al. 2000; Sharma et al. 2008).To investigate the distribution of proteorhodopsin proteins,Sabehi et al. (2007) collected sequences from 12 environmen-tal samples. Three samples were taken at the BATS station inthe Sargasso Sea in March 1998 and March 2003 when deepwater mixing occurs, and another three samples were col-lected in July 1998 when the water is highly stratified.Samples were taken at depths of 0, 40, and 80 m. Analogoussamples were taken from the H01 station in the Mediterra-nean Sea in January 2006 (mixed) and May 2003 (stratified) atdepths of 0, 20, and 50–55 m. Here, we analyze these samplesalong with three additional Mediterranean samples taken at

FIG. 4. UPGMA tree and neighbor-net of pneumococcus serotype 1 isolates from 29 geographic regions. For clarity, only the most common STs areshown along with STs useful as landmarks. The UPGMA tree consists of 143 splits and has a fit statistic of 0.81, whereas the neighbor-net consists of 210splits and has a fit statistic of 0.93. Fit is deEned as the sum of all pairwise distances between taxa in the tree divided by the sum of all pairwise distancesin the input distance matrix.

3953

b Diversity over Phylogenetic Networks . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 8: Measuring community similarity with phylogenetic networks

the same depths and collected 90 km away at the TB04 sta-tion in February 2006 (supplementary table S3, Supplemen-tary Material online). We consider a rooted neighbor-net toestablish that the conclusions of Sabehi et al. (2007) regardingthe geographic, seasonal, and depth structuring of thesesamples can be recovered using both a qualitative and quan-titative network-based measure of b diversity. Furthermore,we investigate the structuring of specific proteorhodopsingenotypes.

Hierarchical clustering of the unweighted UniFrac dissim-ilarities among proteorhodopsin communities revealed per-fect clustering by geographic location and a strong seasonaldependence (fig. 6A). Samples taken at different depthsduring periods of water mixing were more similar to eachother than the stratified samples, but both sets of samplesshowed geographic structuring (fig. 6B). Under the quantita-tive Manhattan measure, stratified communities were nolonger separated perfectly by geography (supplementary fig.S2A, Supplementary Material online) and showed high vari-ation in the dissimilarity measures between samples (supple-mentary fig. S2B, Supplementary Material online). In contrast,mixed communities were strongly geographically structuredand even showed small-scale structuring between the H01and TB04 samples taken only 90 km apart. The averageManhattan distance between samples is 0.14 for the mixedH01 samples and 0.12 for TB04 samples, in comparison to0.23 across these two sets of samples. Together, the qualitativeand quantitative results suggest that during periods of water

mixing the proteorhodopsin proteins at these two stationsare similar but differ in their relative abundance.

Proteorhodopsins are preferentially blue-absorbing (BPR)or green-absorbing (GPR) based largely on the amino acid atposition 105 (Beja et al. 2001; Man et al. 2003). The structur-ing determined above may be reliant on both spectral geno-types, primarily due to a single genotype, or observedindividually for each genotype. To test this, we define geno-type communities by dividing each sample into two sets con-sisting exclusively of BPR or GPR proteins. This results in 24genotype communities as only a single GPR sequence wasfound in the Sargasso Sea. Applying the unweighted UniFracmeasure to these communities indicated that these twogenotypes are phylogenetically distinct (figs. 6C and D).Congruent with the results for the undivided samples,strong seasonal dependencies were observed for both geno-types. However, perfect geographic clustering was notobserved between the BPR samples indicating that this struc-turing is largely due to the absence of GPR proteins in theSargasso Sea. Similar results were obtained under theManhattan measure. Samples clustered perfectly by genotypeand showed strong seasonal structuring within each genotype(supplementary figs. S2C and S2D, Supplementary Materialonline). Although there was evidence of small-scale geo-graphic structuring between the mixed BPR samples fromH01 and TB04, the mixed GPR samples showed little structur-ing. This may be a result of higher dispersal of GPR proteinsrelative to BPR proteins. However, the small number of

FIG. 5. Similarity of nine groups residing on Nias determined using FST or by applying the Manhattan phylogenetic b-diversity measure to a mediannetwork. Ethnic groups are abbreviated as follows: Daeli (Da), Fau (Fa), Gozo (Go), Hia (Hi), Ho (Ho), Sarumaha (Sa), Si’ulu (Si), Zalukhu (Za), and Zebua(Ze). PCoA plot of FST values (A) and Manhattan distances (B). The percentage of total variance explained by each axis is shown in parentheses.(C) Scatter plot relating the dissimilarity values for every pair of groups determined with the Manhattan (x axis) or FST (y axis) measures. The linearregression line is shown in red and Pearson’s correlation coefficient is given in the top-left corner. Dissimilarity measured between Hi and each of theother eight groups as determined using the FST (D) or Manhattan (E) measures. (F) Comparison of FST values and Manhattan distances between the Ho,Sa, and Ze ethnic groups. To aid in comparing the relative magnitude of values, each set of measurements was normalized by the longest distancebetween any pair of groups.

3954

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 9: Measuring community similarity with phylogenetic networks

GPR proteins found within these samples may be undulyinfluencing the results, and deeper sequencing is requiredto further explore the structuring of specific proteorhodopsingenotypes.

DiscussionTree inference algorithms produce a single tree model even ifthere is substantial incongruent phylogenetic signal, whereassplit systems allow uncertainty and conflict to be modeled.Phylogenetic b-diversity measures have been shown to pro-duce similar results under varying tree inference methods(Lozupone et al. 2007), and we have repeatedly observed, atleast when strong patterns of community variation exist, thatdissimilarity values will remain highly correlated even whenmeasuring b diversity over an incompatible split system (datanot shown). Identifying similar patterns of community vari-ation on an inferred tree and on a split system provides strongevidence that results are not due to systematic errors (e.g., theforcing of data into a tree model). For this reason, we recom-mend measuring dissimilarity on a network as it demon-strates that a given pattern of variation is robust touncertainty and conflict in the available phylogenetic signal.When community relationships are sensitive to the under-lying method used for phylogenetic inference, results must beinterpreted carefully. For example, although the primaryconclusions regarding the biogeographic distribution ofpneumococcus serotype 1 were recovered using UPGMA,neighbor-joining, and neighbor-net, the placement of theQuebec sample depended on whether phylogenetic inferencewas performed with neighbor-joining or neighbor-net

(contrast fig. 3A and supplementary fig. S1A,Supplementary Material online). This may be the result ofthe neighbor-joining algorithm forcing incongruent signalinto a tree model or the neighbor-net algorithm inferringincompatible splits that would best be viewed as phylogeneticnoise.

Our analysis of mtDNA diversity in Nias demonstrates thatdissimilarity values obtained using the sequence-based FST

measure can vary considerably from those obtained by apply-ing the Manhattan measure to a median network (Pearson’sr = 0.82). This disagreement between b-diversity measures,while not necessarily surprising, is of particular interest as itcontrasts the population- and lineage-based approachescommonly used in studies of human biogeography (Paken-dorf and Stoneking 2005). In a population-based analysis, FST

or a related measure is used to explore the relationship be-tween populations, whereas in a lineage-based analysis, a(often simplified) median network is used to investigate theevolutionary history of mtDNA or Y-chromosome hap-logroups (Bandelt et al. 1995; Huber et al. 2001). By extendingb-diversity measures to median networks, population-basedanalyses can be performed directly over a median networkinstead of on the underlying sequence data. In practice, thedissimilarity values obtained by applying FST to sequence dataor by applying the Manhattan measure to a median networkcan be highly negatively correlated indicating that thesemeasures provide complementary insights into the variationbetween populations (supplementary fig. S3, SupplementaryMaterial online). Further investigation is warranted to deter-mine which network-based b-diversity measures are best

FIG. 6. Relationship between proteorhodopsin communities and genotype-specific samples determined by applying unweighted UniFrac to a rootedneighbor-net. (A) Hierarchical clustering of 15 proteorhodopsin samples collected from the Sargasso (Sar) or Mediterranean (Med) Seas with jackknifevalues greater than 50 indicated. (B) Mean (±standard error of the mean [SEM]) unweighted UniFrac dissimilarity values for all samples either within thesame or between (BW) the two sampling locations. The same analysis was also repeated using only samples taken during periods of deep water mixingor when the water is highly stratified. (C) Hierarchical clustering of blue (BPR) and green (GPR) proteorhodopsin genotype samples with jackknife valuesgreater than 40 indicated. (D) Mean (±SEM) unweighted UniFrac dissimilarity values for samples from the same genotype or between samples fromdifferent genotypes. Results are also shown for BPR samples collected from the Sargasso or Mediterranean Seas.

3955

b Diversity over Phylogenetic Networks . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 10: Measuring community similarity with phylogenetic networks

suited for studying human populations, the influence bio-logically motivated simplifications of median networks haveon measured b diversity (i.e., reduced and prunednetworks, Bandelt et al. 1995; Huber et al. 2001) andhow network-based measures complement traditionalsequence-based measures.

Qualitative and quantitative measures of b diversity pro-vide complementary information on community variation(Legendre and Legendre 1998; Lozupone et al. 2007).On the proteorhodopsin data set, communities sampledduring periods of water mixing were found to be stronglygeographically structured under both the unweighted (quali-tative) UniFrac and quantitative Manhattan measures. Thissuggests that unique proteorhodopsin lineages are present inthe Sargasso and Mediterranean Seas (qualitative results) andthat such lineages do not consist only of rare proteins (quan-titative results). In contrast, the small-scale geographic struc-turing between the H01 and TB04 samples is only recoveredunder the quantitative Manhattan measure suggesting thatthese sites contain similar proteins but in different abun-dances. Although the additional insights provided by applyingboth a qualitative and quantitative measure are clear, in prac-tice applying qualitative measures is complicated by requiringa rooted split system (Parks and Beiko 2012). Rooting anincompatible split system can be problematic as there willoften be several low-weight splits where both induced subsetscontain ingroup and outgroup taxa. As such, our proposeddefinitions for calculating qualitative and quantitative b di-versity on a split system ignore any splits where both inducedsubsets contain outgroup taxa. Nonetheless, care should betaken to ensure that outgroup taxa are tightly clusteredwithin the split system. For closely related taxa, an interestingalternative is to root a split system based on neutral coales-cent theory (Castelloe and Templeton 1994).

We have described how an important class of phylogen-etic b-diversity measures can be applied to split systemscontaining incompatible splits. This allows phylogenetic un-certainty and conflict to be considered in the assessmentof community variation and provides a methodology fortesting whether the assumption of a bifurcating evolution-ary history is causing erroneous patterns of relationshipsbetween communities. Although split systems are themostly widely used class of phylogenetic networks, theysuffer from providing only an abstract representation ofthe relationship between taxa. Extending b-diversity meas-ures to networks that explicitly model recombination orhybridization would allow the influence of these evolution-ary events on community variation to be explored. Webelieve the branch classification scheme proposed here isa strong starting ground for extending b-diversity measuresin this direction.

Supplementary MaterialSupplementary tables S1–S3 and figures S1–S3 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Acknowledgments

This work was supported by the Killam Trusts and the TulaFoundation to D.H.P. and by Genome Atlantic, the CanadaFoundation for Innovation, and the Canada Research Chairsprogram to R.G.B.

ReferencesAnderson MJ, Crist TO, Chase JM, et al. (14 co-authors). 2011.

Navigating the multiple meanings of b diversity: a roadmap for

the practicing ecologist. Ecol Lett. 14:19–28.

Bandelt HJ, Dress AWM. 1992. A canonical decomposition theory for

metrics on a finite set. Adv Math. 92:47–105.

Bandelt HJ, Dress AWM. 1993. A relational approach to split

decomposition. Technical report. Bielefeld, Germany: Universitat

Bielefeld.

Bandelt HJ, Forster P, Sykes BC, Richards MB. 1995. Mitochondrial por-

traits of human populations using median networks. Genetics 141:

743–753.

Beatty AW. 1993. Nias. In: Levinson D, Hockings P, editors. Encyclopedia

of world cultures. Vol. V, East and Southeast Asia. New York: G.K.

Hall & Co. p. 194–197.

Beja O, Aravind L, Koonin EV, et al. (12 co-authors). 2000. Bacterial

rhodopsin: evidence for a new type of phototrophy in the sea.

Science 289:1902–1904.

Beja O, Spudich EN, Spudich JL, Leclerc M, DeLong EF. 2001. Proteorho-

dopsin phototrophy in the ocean. Nature 411:786–789.

Bontaz D. 2009. The megaliths in Nias. In: Gruber P, Herbig U, editors.

Traditional architecture and art on Nias, Indonesia. Vienna (Austria)

IVA-ICR [Internet]. [cited 2012 AUg 23]. Available from: http://

www.nirn.org/sites/papers.htm.

Brueggemann AB, Spratt BG. 2003. Geographic distribution and clonal

diversity of Streptococcus pneumoniae serotype 1 isolates. J Clin

Microbiol. 41:4966–4970.

Bryant D, Moulton V. 2004. Neighbor-net: an agglomerative method for

the construction of phylogenetic networks. Mol Biol Evol. 21:

255–265.

Bryant D, Moulton V, Spillner A. 2007. Consistency of the neighbor-net

algorithm. Algorithms Mol Biol. 2:8.

Bryant JA, Lamanna C, Morlon H, Kerkhoff AJ, Enquist BJ, Green JL. 2008.

Microbes on mountainsides: contrasting elevational patterns of bac-

terial and plant diversity. Proc Natl Acad Sci U S A. 105(Suppl 1):

11505–11511.

Caporaso JG, Kuczynski J, Sombaught J, et al. (27 co-authors). 2010.

QIIME allows analysis of high-throughput community sequencing

data. Nat Methods. 7:335–336.

Caporaso JG, Lauber CL, Costello EK, et al. (12 co-authors). 2011. Moving

pictures of the human microbiome. Genome Biol. 12:R50.

Castelloe J, Templeton AR. 1994. Root probabilities for intraspecific gene

trees under neutral coalescent theory. Mol Phylogenet Evol. 3:

102–113.

Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R. 2009.

Bacterial community variation in human body habitats across space

and time. Science 326:1694–1697.

de la Torre JR, Christianson LM, Beja O, Suzuki MT, Karl DM, Heidelberg

J, DeLong EF. 2003. Proteorhodopsin genes are distributed among

divergent marine bacterial taxa. Proc Natl Acad Sci U S A. 100:

12830–12835.

3956

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 11: Measuring community similarity with phylogenetic networks

Dress AWM, Huson DH. 2004. Constructing splits graphs. IEEE/ACM

Trans Comput Biol Bioinf. 1:109–115.

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high ac-

curacy and high throughput. Nucleic Acids Res. 32:1792–1797.

Enright MC, Spratt BG. 1998. A multilocus sequence typing

scheme for Streptococcus pneumoniae: identification of clones

associated with serious invasive disease. Microbiology 144:3049–3060.

Esteva C, Selva L, de Sevilla MF, Garcia-Garcia JJ, Pallares R, Munoz-

Almagro C. 2011. Streptococcus pneumoniae serotype 1 causing in-

vasive disease among children in Barcelona over a 20-year period

(1989–2008). Clin Microbiol Infect. 17:1441–1444.

Excoffier L, Laval G, Schneider S. 2005. Arlequin (version 3.0): an inte-

grated software package for population genetics data analysis. Evol

Bioinform Online. 1:47–50.

Ferrier S, Manion G, Elith J, Richardson K. 2007. Using generalized

dissimilarity modeling to analyse and predict patterns of beta

diversity in regional biodiversity assessment. Diversity Distrib. 13:

252–264.

Gascuel O, Steel M. 2006. Neighbor-joining revealed. Mol Biol Evol. 23:

1997–2000.

Graham CH, Fine PVA. 2008. Phylogenetic beta diversity: linking

ecological and evolutionary processes across space in time. Ecol

Lett. 11:1265–1277.

Gray RD, Drummond AJ, Greenhill SJ. 2009. Language phylogenies reveal

expansion pulses and pauses in Pacific settlement. Science 323:

479–483.

Hamady M, Lozupone C, Knight R. 2010. Fast UniFrac: facilitating

high-throughput phylogenetic analyses of microbial communities

including analysis of pyrosequencing and PhyloChip data. ISME J.

4:17–27.

Henriques Normark B, Kalin M, Ortqvist A, et al. (11 co-authors). 2001.

Dynamics of penicillin-susceptible clones in invasive pneumococcal

disease. J Infect Dis. 184:861–869.

Herrnstadt C, Elson JL, Fahy E, et al. (11 co-authors). 2002. Reduced-

median-network analysis of complete mitochondrial DNA

coding-region sequences for the Major African, Asian, and

European Haplogroups. Am J Hum Genet. 70:1152–1171.

Ho SYW, Jermiin LS. 2004. Tracing the decay of the historical signal in

biological sequence data. Syst Biol. 53:623–637.

Holland B, Conner G, Huber K, Moulton V. 2007. Imputing supertrees

and supernetworks from quartets. Syst Biol. 56:57–67.

Holland B, Delsuc F, Moulton V. 2005. Visualizing conflicting evolution-

ary hypothesis in large collections of trees using consensus networks.

Syst Biol. 54:66–76.

Holsinger KE, Weir BS. 2009. Genetics in geographically structured popu-

lations: defining, estimating and interpreting FST. Nat Rev Genet. 10:

639–650.

Huber KT, Langton M, Penny D, Moulton V, Hendy M. 2002. Spectronet:

a package for computing spectra and median networks. Appl

Bioinform. 1:159–161.

Huber KT, Moulton V, Lockhart P, Dress A. 2001. Pruned median net-

works: a technique for reducing the complexity of median networks.

Mol Phylogenet Evol. 12:302–310.

Hudson RR, Slatkin M, Maddison WP. 1992. Estimation of levels of gene

flow from DNA sequence data. Genetics 132:583–589.

HUGO Pan-Asian SNP Consortium. 2009. Mapping human genetic di-

versity in Asia. Science 326:1541:1545.

Huson DH, Bryant D. 2006. Application of phylogenetic networks in

evolutionary studies. Mol Biol Evol. 23:254–267.

Huson DH, Dezulian T, Klopper T, Steel MA. 2004. Phylogenetic

super-networks from partial trees. IEEE/ACM Trans Comput Biol

Bioinf. 1:151–158.

Huson DH, Scornavacca C. 2011. A survey of combinatorial methods for

phylogenetic networks. Genome Biol Evol. 3:23–35.

Jones RT, Martin AP. 2006. Testing for differentiation of microbial com-

munities using phylogenetic methods: accounting for uncertainty of

phylogenetic inference and character state mapping. Microb Ecol. 52:

408–417.

Kayser M, Choi Y, van Oven M, Mona S, Brauer S, Trent RJ, Saurkia

D, Schiefenhovel W, Stoneking M. 2008. The impact of the

Austronesian expansion: evidence form mtDNA and Y chromo-

some diversity in the Admiralty Islands of Melanesia. Mol Biol

Evol. 25:1362–1374.

Kennedy M, Holland BR, Gray RD, Spencer HG. 2005. Untangling long

branches: identifying conflicting phylogenetic signals using spectral

analysis, neighbor-net, and consensus networks. Syst Biol. 54:

620–633.

Koleff P, Gaston KJ, Lennon JJ. 2003. Measuring beta diversity for

presence-absence data. J Anim Ecol. 72:367–382.

Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, Knight R. 2010.

Microbial community resemblance methods differ in their ability to

detect biologically relevant patterns. Nat Methods. 7:813–819.

Legendre P, Legendre L. 1998. Numerical ecology. 2nd English ed.

Amsterdam (The Netherlands): Elsevier Science.

Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for

comparing microbial communities. Appl Environ Microbiol. 71:

8228–8235.

Lozupone CA, Hamady M, Cantarel BL, Coutinho PM, Henrissat B,

Gordon JI, Knight R. 2008. The convergence of carbohydrate

active gene repertoires in human gut microbes. Proc Natl Acad Sci

U S A. 105:15076–15081.

Lozupone CA, Hamady M, Kelley ST, Knight R. 2007. Quantitative and

qualitative b diversity measures lead to different insights into factors

that structure microbial communities. Appl Environ Microbiol. 73:

1576–1585.

Man D, Wang W, Sabehi G, Aravind L, Post AF, Massana R, Spudich EN,

Spudich JL, Beja O. 2003. Diversification and spectral tuning in

marine proteorhodopsins. EMBO J. 22:1725–1731.

Martin AP. 2002. Phylogenetic approaches for describing and comparing

the diversity of microbial communities. Appl Environ Microbiol. 68:

3673–3682.

Martiny JBH, Eisen JA, Penn K, Allison SD, Horner-Devine MC. 2011.

Drivers of bacterial b-diversity depend on spatial scale. Proc Natl

Acad Sci U S A. 108:7850–7854.

McChlery SM, Scott KJ, Clarke SC. 2005. Clonal analysis of invasive

pneumococcal isolates in Scotland and coverage of serotypes by

the licensed conjugate polysaccharide pneumococcal vaccine: pos-

sible implications for UK vaccine policy. Eur J Clin Microbiol Infect

Dis. 24:262–267.

Minh BQ, Klaere S, von Haeseler A. 2009. Taxon selection under split

diversity. Syst Biol. 58:586–594.

Mitra S, Gilbert JA, Field D, Huson DH. 2010. Comparison of multiple

metagenomes using phylogenetic networks based on ecological in-

dices. ISME J. 4:1236–1242.

Morrison DA. 2005. Networks in phylogenetic analysis: new tools for

population biology. Int J Parasitol. 35:567–582.

Morrison DA. 2011. Introduction to phylogenetic networks. Uppsala

(Sweden): RJR Productions.

3957

b Diversity over Phylogenetic Networks . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018

Page 12: Measuring community similarity with phylogenetic networks

Nipperess DA, Faith DP, Barton K. 2010. Resemblance in phylogeneticdiversity among ecological assemblages. J Veg Sci. 21:809–820.

Obando I, Munoz-Almargo C, Arroyo LA, et al. (12 co-authors). 2008.Pediatric parapneumonic empyema. Emerg Infect Dis. 14:1390–1397.

Pakendorf B, Stoneking M. 2005. Mitochondrial DNA and human evo-lution. Annu Rev Genomics Hum Genet. 6:165–183.

Parker J, Rambaut A, Pybus OG. 2008. Correlating viral phenotypes withphylogeny: accounting for phylogenetic uncertainty. Infect GenetEvol. 8:239–246.

Parks DH, Beiko RG. 2012. Measures of phylogenetic differentiationprovide robust and complementary insights into microbial commu-nities. ISME J. Advance Access published August 2, 2012,doi:10.1038/ismej.2012.88.

Parks DH, Porter M, Churcher S, Wang S, Blouin C, Whalley J, Brooks S,Beiko RG. 2009. GenGIS: a geospatial information system for gen-omic data. Genome Res. 19:1896–1904.

Rousk J, Baath E, Brookes PC, Lauber CL, Lozupone C, Caporaso JG,Knight R, Fierer N. 2010. Soil bacterial and fungal communitiesacross a pH gradient in an arable soil. ISME J. 4:1340–1351.

Sabehi G, Kirkup BC, Rozenberg M, Stambler N, Polz MF, Beja O. 2007.Adaptation and spectral tuning in divergent marineproteorhodopsins from the eastern Mediterranean and theSargasso Seas. ISME J. 1:48–55.

Sabehi G, Loy A, Jung KH, Partha R, Spudich JL, Issacson R,Hirschberg J, Wagner M, Beja O. 2005. New insights into meta-bolic properties of marine bacteria encoding proteorhodopsins.PLoS Biol. 3:e273.

Saitou N, Nei M. 1987. The neighbor-joining method: a new method forreconstructing phylogenetic trees. Mol Biol Evol. 4:406–425.

Schloss PD, Westcott SL, Ryabin T, et al. (15 co-authors). 2009.Introducing mothur: open-source, platform-independent, commu-nity-supported software for describing and comparing microbial

communities. Appl Environ Microbiol. 75:7537–7541.

Sharma AK, Zhaxybayeva O, Papke RT, Doolittle WF. 2008.Actinorhodopsins: proteorhodopsin-like gene sequences found pre-dominantly in non-marine environments. Environ Microbiol. 10:1039–1056.

Spillner A, Nguyen BT, Moulton V. 2008. Computingphylogenetic diversity for split systems. IEEE-ACM T Comput Bi. 5:235–244.

van Oven M, Hammerle JM, van Schoor M, Kushnick G, Pennekamp P,Zega I, Lao O, Brown L, Kennerknecht I, Kayer M. 2011.Unexpected island effects at an extreme: reduced Y chromosomeand mitochondrial DNA diversity in Nias. Mol Biol Evol. 28:1349–1361.

Wang TH, Donaldson YK, Brettle RP, Bell JE, Simmonds P. 2001.Identification of shared populations of human immunodeficiencyvirus type 1 infecting microglia and tissue macrophages outside thecentral nervous system. J Virol. 75:11686–11699.

Webb CO, Ackerly DD, Kembel SW. 2008. Phylocom: software for theanalysis of phylogenetic community structure and trait evolution.Bioinformatics 24:2098–2100.

Whelan S, Goldman N. 2001. A general empirical model of proteinevolution derived from multiple protein families using a maxi-mum-likelihood approach. Mol Biol Evol. 18:691–699.

3958

Parks and Beiko . doi:10.1093/molbev/mss200 MBE

Downloaded from https://academic.oup.com/mbe/article-abstract/29/12/3947/1011860by gueston 06 April 2018