phylogeny driven approaches to genomic and metagenomic studies

111
Searching for novelty using phylogeny-driven approaches to genomics and metagenomics iPAM November 15, 2011 Jonathan A. Eisen University of California, Davis 1 Wednesday, November 16, 11

Upload: jonathan-eisen

Post on 10-May-2015

1.027 views

Category:

Health & Medicine


5 download

TRANSCRIPT

Page 1: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Searching for novelty using phylogeny-driven approaches to genomics and metagenomics

iPAMNovember 15, 2011

Jonathan A. EisenUniversity of California, Davis

1

Wednesday, November 16, 11

Page 2: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Searching for novelty using phylogeny-driven approaches to genomics and metagenomics

iPAMNovember 15, 2011

Jonathan A. EisenUniversity of California, Davis

2

Wednesday, November 16, 11

Page 3: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Searching for novelty using phylogeny-driven approaches to genomics and metagenomics

iPAMNovember 15, 2011

Jonathan A. EisenUniversity of California, Davis

3

Wednesday, November 16, 11

Page 4: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogeny• Phylogeny is a description of the

evolutionary history of relationships among organisms (or their parts).

• This is frequently portrayed in a diagram called a phylogenetic tree.

• Phylogenies can be more complex than a bifurcating tree (e.g., lateral gene transfer, recombination, hybridization)

Wednesday, November 16, 11

Page 5: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Four Models for Rooting TOLfrom Lake et al. doi: 10.1098/rstb.2009.0035

Whatever the History: Trying to Incorporate it is Critical

Wednesday, November 16, 11

Page 6: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Evolutionary Rate Variation

Wednesday, November 16, 11

Page 7: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogeny

• Applies to – Species– Genes– Genomes

Wednesday, November 16, 11

Page 8: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogenyin Genomics and Metagenomics

Example 1:

Phylotyping

Wednesday, November 16, 11

Page 9: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Phylotyping

Wednesday, November 16, 11

Page 10: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Phylotyping• Collect DNA from

environment• PCR amplify rRNA

genes using broad (so-called universal) primers

• Sequence• Align to others• Infer evolutionary tree• Unknowns “identified”

by placement on tree

Wednesday, November 16, 11

Page 11: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Phylotyping

Wednesday, November 16, 11

Page 12: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Data Overload #1circa 2003

• 1000s of rRNA sequences per sample being generated via Sanger Sequencing

• most being classified by BLAST searches and ID of top hit

• seemed like a bad idea ...

Wednesday, November 16, 11

Page 13: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomics

shotgunsequence

Wednesday, November 16, 11

Page 14: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

STAP

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Wu et al. 2008 PLoS OneWednesday, November 16, 11

Page 15: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

STAP

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Wu et al. 2008 PLoS One

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Each sequence analyzed separately

Wednesday, November 16, 11

Page 16: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Combine all into one alignment

Wednesday, November 16, 11

Page 17: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomic Phylogenetic challenge

A single tree with everything

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

Wednesday, November 16, 11

Page 18: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomic Phylogenetic challenge

A single tree with everything

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

Wednesday, November 16, 11

Page 21: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso PhylotypesW

eigh

ted

% o

f Clo

nes

Major Phylogenetic Group

EFG EFTu HSP70 RecA RpoB rRNA

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 22: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Really Weird Stuff Out There

Wednesday, November 16, 11

Page 23: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 24: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 25: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 26: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 27: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

??????

Wu et al. (2011) PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011

Wednesday, November 16, 11

Page 28: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Scanned through GOS data for rRNAs that fit this pattern

Wednesday, November 16, 11

Page 29: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Found many, but closer examination revealed all to have issues

Wednesday, November 16, 11

Page 30: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

RecA????

Wednesday, November 16, 11

Page 31: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

RecAGOS 1

GOS 2

GOS 3

GOS 4

GOS 5

RecA

Wednesday, November 16, 11

Page 32: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

RpoB Too

Wednesday, November 16, 11

Page 33: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

+++++

Wednesday, November 16, 11

Page 34: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Automation for Proteins

Wednesday, November 16, 11

Page 35: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

AMPHORA

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Wednesday, November 16, 11

Page 36: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

WGT

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151Wednesday, November 16, 11

Page 37: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

AMPHORA

Guide tree

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Wednesday, November 16, 11

Page 38: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Wednesday, November 16, 11

Page 39: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

MEGAN analysis of metagenomic dataDaniel H. Huson,1,3 Alexander F. Auch,1 Ji Qi,2 and Stephan C. Schuster2,3

1Center for Bioinformatics, Tübingen University, Sand 14, 72076 Tübingen, Germany; 2Center for Comparative Genomicsand Bioinformatics, Center for Infectious Disease Dynamics, Penn State University, University Park, Pennsylvania 16802, USA

Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat usingtargeted or random sequencing. Goals include understanding the extent and role of microbial diversity. Thetaxonomical content of such a sample is usually estimated by comparison against sequence databases of knownsequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmidand BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very highthroughput are paving the way to low-cost random “shotgun” approaches. This paper introduces MEGAN, a newcomputer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set ofDNA sequences is compared against databases of known sequences using BLAST or another comparison tool.MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBItaxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxasuch that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The softwareallows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers.It provides graphical and statistical output for comparing different data sets. The approach is applied to several datasets, including the Sargasso Sea data set, a recently published metagenomic data set sampled from a mammoth bone,and several complete microbial genomes. Also, simulations that evaluate the performance of the approach fordifferent read lengths are presented.

[MEGAN is freely available at http://www-ab.informatik.uni-tuebingen.de/software/megan.]

The genomic revolution of the early 1990s targeted the study ofindividual genomes of microorganisms, plants, and animals.While this type of analysis has almost become routine, the ge-nomic analysis of complex mixtures of organisms remains chal-lenging. Metagenomics has been defined as “the genomic analy-sis of microorganisms by direct extraction and cloning of DNAfrom an assemblage of microorganisms” (Handelsman 2004),and its importance stems from the fact that 99% or more of allmicrobes are deemed to be unculturable. Goals of metagenomicstudies include assessing the coding potential of environmentalorganisms, quantifying the relative abundances of (known) spe-cies, and estimating the amount of unknown sequence informa-tion (environmental sequences) for which no species, or onlydistant relatives, have yet been described. It is useful to extendHandelsman’s definition to also include sequences from higherorganisms as well as just microorganisms, thus opening the doorto “environmental forensics.” By vastly extending the currentlyavailable sequences in databases, metagenomics promises to leadto the discovery of new genes that have useful applications inbiotechnology and medicine (Steele and Streit 2005).

Early metagenomics projects (Béja et al. 2000, 2001) wereplagued by potential biases that are due to DNA extraction andcloning methods (Martiny et al. 2006). Clone libraries were con-structed from environmental DNA using fosmid and BAC vectorsas vehicles for DNA propagation and amplification. The librarieswere subsequently screened for specific phylogenetic markers,and paired-end sequencing was undertaken on clones of interest.Overlapping clones, sequenced in their entity, were scaffoldedinto super-contigs, giving a snapshot of an organism’s genomic

features, such as GC content, codon usage, or coding density.This strategy was soon complemented by whole (meta)-genomesequencing using a “shotgun” approach (Venter et al. 2004) thatemploys cloning and paired-end sequencing of plasmid libraries.Recent projects based on these methodologies include data setsfrom an acid mine biofilm (Tyson et al. 2004), seawater samples(Venter et al. 2004; DeLong et al. 2006), deep-sea sediment (Hal-lam et al. 2004), or soil and whale falls (Tringe et al. 2005).

These projects all use “Sanger sequencing,” based on clon-ing, fluorescent dideoxynucleotides, and capillary electrophore-sis (Meldrum 2000a,b). Recently, a new “sequencing-by-synthesis” strategy was published (Margulies et al. 2005; Zhang etal. 2006). This approach uses emulsion-based PCR amplificationof a large number of DNA fragments and parallel pyro-sequencing with high throughput. In a single sequencing run,>20 million base pairs of sequence can be generated, at a lower priceper base than Sanger-based methods. The current drawbacks of themethod are short read lengths of !100 bp, in contrast to !800 bpusing Sanger sequencing, a slightly higher sequencing error rate dueto difficulties determining base pair counts in homopolymerstretches, and a substantial reduction of read length when sequenc-ing pair-ended reads. The most important advantage of the newsequencing approach for metagenomics is that it does not requirecloning of the target DNA fragments and therefore avoids cloningbiases resulting from toxic sequences killing their cloning hosts.

In this study, we present a new approach to the initial analy-sis of a metagenomic data set that avoids the problems associatedwith environmental assemblies or the use of a limited number ofphylogenetic markers. Our strategy can be applied to DNA readscollected within the framework of any metagenomics project,regardless of the sequencing technology used, and thus providesan easily deployable alternative to other types of analysis. Weprovide a new computer program called MEGAN (MetagenomeAnalyser) that allows analysis of large data sets by a single scien-tist. In a pre-processing step, the set of DNA reads (or contigs) is

3Corresponding authors.E-mail [email protected]; fax 49-7071-295148.E-mail [email protected]; fax (814) 863-6699.Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5969107. Freely available onlinethrough the Genome Research Open Access option.

Resource

17:000–000 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org Genome Research 1www.genome.org

on January 26, 2007 www.genome.orgDownloaded from

Figure 3. (Legend on next page)

Huson et al.

4 Genome Researchwww.genome.org

on January 26, 2007 www.genome.orgDownloaded from

the BLASTX search into a prelimary version of MEGAN and ap-plied the LCA algorithm to compute an assignment of reads totaxa, thus obtaining an estimation of the taxonomical content ofthe sample.

Here we provide details of the MEGAN analysis, using abit-score threshold of 30 and discarding any isolated assign-ments, that is, any taxon that has only a single read assigned toit. The LCA algorithm assigned 50,093 reads to taxa, and 2086remained unassigned either because the bit-score of theirmatches fell below the threshold or because they gave rise to anisolated hit.

A total of 19,841 reads were assigned to Eukaryota, of which7969 were assigned to Gnathostomata (jawed vertebrates) andthus presumably derive from mammoth sequences. Furthermore,a total of 16,972 reads were assigned to Bacteria, 761 to Archea,and 152 to Viruses, respectively. These numbers are marginallylower than those reported in Poinar et al. (2006) because of ournew filters, thus underlining the intrinsic robustness of the LCAapproach.

Figures 5 and 6 demonstrate the ability of MEGAN to sum-marize results at different levels of the NCBI taxonomy. A dis-tinctive feature of the program is that such summaries are com-puted dynamically on-the-fly, as the user changes parameters ofthe LCA algorithm or expands or collapses parts of the tax-onomy. The relative abundance of reads at a certain node or leafis indicated visually by the size of the circle representing thenode, or by numerical labels. The cladograms produced byMEGAN can be considered “species profiles” and can be pro-

duced as tables, for example, for side-by-side comparisons of series of samples(see Fig. 4).

Species identification from short reads

Several companies are developing newsequencing technologies that promise toproduce high-throughput sequencing atsubstantially reduced cost, albeit withreads as short as 35 bp. The averagelength of reads produced using currentRoche GS20 sequencing technology, in-troduced last year (Margulies et al.2005), is !100 bp, and reads obtainableby current Sanger sequencing are !800bp in length (Franca et al. 2002). Thequestion therefore arises what readlength is required to identify species in ametagenomic sample reliably.

A simple approach to addressingthis is to collect a set of reads from aknown genome, to process the data as ametagenomic data set (as describedabove), and then to evaluate the accu-racy of the assignments. For this pur-pose, the genome sequence of the two

organisms E. coli K12 and B. bacteriovorus HD100 were used. Wechose E. coli as it is used as a cloning host in most clone-basedsequencing projects and is thus likely to occur in several differentdatabase sequences by mistake. The second test organism, B. bac-teriovorus, is very distinctive in its sequence from other Proteo-bacteria and has no close relatives that are currently representedin the sequence databases. Its metagenomic analysis should there-fore result in a much better signal/noise ratio than for E. coli.

We show the results of simulation studies for the two ge-nomes in Tables 1 (E. coli) (Blattner et al. 1997) and 2 (B. bacter-iovorus) (Rendulic et al. 2004). For each genome, we use sequenceintervals of length 35 bp, 100 bp, 200 bp, and 800 bp, as theselengths correspond to upcoming or existing sequencing technol-ogy. We simulated 5000 random shotgun reads for each data-point, compared them to the NCBI-NR database using BLASTX,and then processed the reads with MEGAN, using a bit-scorethreshold of 35, retaining only those hits that are within 20% ofthe best hit for a read, and discarding all isolated assignments.The percentage of reads classified as Enterobacteriaceae rangedfrom 22% to 85%, Gammaproteobacteria from 24% to 94%, andProteobacteria from 25% to 96% in the case of E. coli. The num-ber of false-positive assignments of reads was 0%. In the case of B.bacteriovorus, the percentage of reads classified as B. bacteriovorusranges from 25% to 98%, Deltaproteobacteria from 26% to 99%,and Proteobacteria from 26% to !100%. No false-positive hitswere detected. The result demonstrates that short reads in gen-eral can be used for metagenomic analysis, albeit at the cost of ahigh rate of under-prediction.

Figure 4. The distribution of reads from Sample 1, pooled Samples 2–4, and the weighted averageof these two data sets, over 16 major phylogenetic groups, as computed by MEGAN. For the sake ofcomparison, the diagram also shows the relative contribution of organisms to these groups, as esti-mated from Venter et al. (2004) by averaging over the values for all six genes that are reported there.

Figure 3. Phylogenetic diversity of the Sargasso Sea sequences computed by MEGAN. The microheterogeneity of Sample 1 was investigated bycomparing it to pooled Samples 2, 3, and 4 (Venter et al. 2004). (A) Analysis of 10,000 reads randomly chosen from Sample 1. (B) Analysis of 10,000reads randomly chosen from Sample 2. (C,D) A more detailed view of Sample 1 and Samples 2–4, respectively, illustrating a significant difference ofrelative frequencies of Shewanella and Burkholderia species in the two data sets. In all such figures, each circle represents a taxon in the NCBI taxonomyand is labeled by its name and the number of reads that are assigned either directly to the taxon, or indirectly via one of its subtaxa. The size of the circleis scaled logarithmically to represent the number of reads assigned directly to the taxon.

Species identification from metagenomic data

Genome Research 5www.genome.org

on January 26, 2007 www.genome.orgDownloaded from

Wednesday, November 16, 11

Page 40: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogeny in Genomics and Metagenomics

Example 2:

Phylogenetic Ecology

Wednesday, November 16, 11

Page 41: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA survey

• Sequence rRNAs

• Cluster

Wednesday, November 16, 11

Page 42: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA survey

• Sequence rRNAs

• Cluster• Identify

“OTUs”

OTU1OTU2OTU3OTU4OTU5OTU6OTU7OTU8OTU9OTU10

Wednesday, November 16, 11

Page 43: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

Wednesday, November 16, 11

Page 44: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

Wednesday, November 16, 11

Page 45: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

Wednesday, November 16, 11

Page 46: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

Wednesday, November 16, 11

Page 47: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

Wednesday, November 16, 11

Page 48: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomic Phylogenetic challenge

A single tree with everything

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

Wednesday, November 16, 11

Page 49: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomic Phylogenetic challenge

A single tree with everything

Wednesday, November 16, 11

Page 50: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and

references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed

statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare

PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.

PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to

commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition

between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods

Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001

Finding Metagenomic OTUs

PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061

PhylOTU - Sharpton et al. PLoS Comp. Bio 2011Wednesday, November 16, 11

Page 51: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs = Richness

containing at least 10% gammaproteobacteria. Sequences fall-ing within the pseudomonad tree (see Fig. S2 in the supple-mental material) appear most closely related to the oligotro-phic marine gammaproteobacteria (OMG) (2). The lack of aclose phylogenetic relationship between representatives of thedescribed major OMG clades and our coral sequences suggeststhat the latter represent new OMG clades.

Rarefaction analysis of our data and of sequence data fromshallow-water scleractinian coral communities (12) suggestedthat our accumulated deepwater octocoral samples showedless diversity than their shallow-water counterparts (Fig. 2);with 350 sequences sampled, the shallow-water data set con-tained approximately twice as many observed operational tax-onomic units (97% threshold for operational taxonomic unitdefinition) as the deepwater set.

This study provides a first glimpse of the deep-sea octocoralmicroflora. The results suggest that these populations are dom-inated by several major groups but that the relative propor-tions of these groups vary (bearing in mind that known meth-odological biases [5] limit the extent to which clone librarycompositions reflect community compositions). Phylotypesclustered according to sample origin, and we did not observemuch overlap between coral-associated phylotypes and thoserecovered from the water column and rock surfaces (see Fig.S1 and S2 in the supplemental material), suggesting character-istic coral-associated assemblages with minimal influence oftransient water-column microbes. Future sampling of multipleindividuals and their immediate environment is clearly neededto perform a more comprehensive survey and to address ques-tions regarding the nutritional relationships, evolution, andbiogeography of these populations.

We thank R/V Atlantis and DSV Alvin personnel, NOAA’s OceanExploration Program, and Brad Stevens, Randy Keller, Tom Shirley,and Tom Guilderson for help with data acquisition.

Phylogenetic analysis was supported in part by NSF Assembling theTree of Life grant 0228651 to J.A.E. and N.W.

REFERENCES

1. Altschul, S., T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Res. 25:3389–3402.

2. Cho, J. C., and S. J. Giovannoni. 2004. Cultivation and growth characteristicsof a diverse group of oligotrophic marine Gammaproteobacteria. Appl. En-viron Microbiol. 70:432–440.

3. Cole, J., B. Chai, T. Marsh, R. Farris, Q. Wang, S. Kulam, S. Chandra, D.McGarrell, T. Schmidt, G. Garrity, and J. Tiedje. 2003. The RibosomalDatabase Project (RDP-II): previewing a new autoaligner that allows regularupdates and the new prokaryotic taxonomy. Nucleic Acids Res. 31:442–443.

4. Etnoyer, P., and L. Morgan. December 2003, posting date. Occurrences ofhabitat-forming deep sea corals in the northeast Pacific Ocean: a report toNOAA’s Office of Habitat Conservation. [Online.] http://www.mcbi.org/destructive/DSC_occurrences.pdf.

5. Farrelly, V., F. A. Rainey, and E. Stackebrandt. 1995. Effect of genome sizeand rrn gene copy number on PCR amplification of 16S rRNA genes from amixture of bacterial species. Appl. Environ. Microbiol. 61:2798–2801.

6. Felsenstein, J. 1989. PHYLIP—phylogeny inference package (version 3.2).Cladistics 5:164–166.

7. Giovannoni, S., and M. Rappe. 2000. Evolution, diversity, and molecularecology of marine prokaryotes, p. 47–84. In D. L. Kirchman (ed.), Microbialecology of the oceans. John Wiley & Sons, New York, N.Y.

8. Gonzalez, J. M., R. Simo, R. Massana, J. S. Covert, E. O. Casamayor, C.Pedros-Alio, and M. A. Moran. 2000. Bacterial community structure associ-ated with a dimethylsulfoniopropionate-producing North Atlantic algalbloom. Appl. Environ. Microbiol. 66:4237–4246.

9. Heifetz, J. 2002. Coral in Alaska: distribution, abundance, and species asso-ciations. Hydrobiologia 471:19–27.

10. Kellogg, C., and R. Stone. 2004. A pilot study of deep-water coral microbialecology. Presented at the ASLO/TOS Ocean Research Conference, Hono-lulu, Hawaii.

11. Rainey, F. A., N. Ward-Rainey, R. M. Kroppenstedt, and E. Stackebrandt.1996. The genus Nocardiopsis represents a phylogenetically coherent taxon

FIG. 2. Rarefaction curves for the accumulated coral-associated 16S rRNA gene sequences generated for this study (CGOA, -C, -D, -F, and-G) and the sequences of Rohwer et al. (12, 13). Bars indicate 95% confidence intervals. Statistical resampling was performed using EstimateS.

1682 PENN ET AL. APPL. ENVIRON. MICROBIOL.

on Novem

ber 15, 2011 by guesthttp://aem

.asm.org/

Dow

nloaded from

Wednesday, November 16, 11

Page 52: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

OTUs on Tree

OTU1OTU5

OTU4

OTU6

OTU8OTU9

OTU7OTU3OTU2

OTU10

• Clades• Rates of

change• LGT• Convergence• Character

history

Wednesday, November 16, 11

Page 53: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Unifrac

cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).

Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are

applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.

MATERIALS AND METHODS

Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).

The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:

u ! !i

n

bi " "Ai

AT#

Bi

BT"

Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.

If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:

D ! !j

n

dj " #Aj

AT$

Bj

BT$

Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.

Clustering with normalized u values treats each sample equally instead of

TABLE 1. Measurements of diversity

Measure Measurement of " diversity Measurement of ! diversity

Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that

each taxon was observedQuantitative (species richness and evenness) Quantitative

FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.

VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577

Figure 1.Estimates of Phylogenetic Diversity (PD) and PD Gain (G) for the grey community. Theboxes represent taxa from the black, white, and grey communities. (A) PD is the sum of thebranches leading to the grey taxa. (B) G is the sum of the branches leading only to the greytaxa. (C) PD rarefaction curves showing the increase in branch length with sampling effortfor the intestinal and stool bacteria from three healthy individuals. Aligned16S rRNAsequences from the three individuals were available with the Supplementary Materials in(Eckburg, et al., 2005). The Arb parsimony insertion tool was used to add the sequences to atree containing over 9,000 sequences (Hugenholtz, 2002) that is available for download atthe rRNA Database Project II website (Maidak, et al., 2001). The curves represent theaverage values for 50 replicate trials.

Lozupone and Knight Page 24

FEMS Microbiol Rev. Author manuscript; available in PMC 2009 July 1.N

IH-P

A A

uthor Manuscript

NIH

-PA

Author M

anuscriptN

IH-P

A A

uthor Manuscript

Wednesday, November 16, 11

Page 54: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Challenge

• Each gene poorly sampled in metagenomes• Can we combine all into a single tree?

Wednesday, November 16, 11

Page 55: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Kembel et al. PLoS One 2011Wednesday, November 16, 11

Page 56: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Wednesday, November 16, 11

Page 57: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Figure 3. Taxonomic diversity and standardized phylogenetic diversity versus depth in environmental samples along an oceanic depth gradient at the HOT

ALOHA site.

Wednesday, November 16, 11

Page 58: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogeny in Genomics and Metagenomics

Example 3:

Binning

Wednesday, November 16, 11

Page 59: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Metagenomics

Wednesday, November 16, 11

Page 60: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

Wednesday, November 16, 11

Page 61: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

Best binning method: reference genomes

Wednesday, November 16, 11

Page 62: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

Best binning method: reference genomes

Wednesday, November 16, 11

Page 63: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

No reference genome? What do you do?

Wednesday, November 16, 11

Page 64: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

No reference genome? What do you do?

Composition, Assembly, othersWednesday, November 16, 11

Page 65: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Binning challenge

No reference genome? What do you do?

PhylogenyWednesday, November 16, 11

Page 66: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

CFB Phyla

Wednesday, November 16, 11

Page 67: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Wu et al. 2006 PLoS Biology 4: e188.

Baumannia makes vitamins and cofactors

Sulcia makes amino acids

Wednesday, November 16, 11

Page 68: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogeny in Genomics and Metagenomics

Example 4:

Functional Diversity and Functional Predictions

Wednesday, November 16, 11

Page 69: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Predicting Function• Identification of motifs

– Short regions of sequence similarity that are indicative of general activity

– e.g., ATP binding• Homology/similarity based methods

– Gene sequence is searched against a databases of other sequences

– If significant similar genes are found, their functional information is used

• Problem– Genes frequently have similarity to hundreds of motifs

and multiple genes, not all with the same function

Wednesday, November 16, 11

Page 70: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Wednesday, November 16, 11

Page 71: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Phylotyping• Collect DNA from

environment• PCR amplify rRNA

genes using broad (so-called universal) primers

• Sequence• Align to others• Infer evolutionary tree• Unknowns “identified”

by placement on tree

Wednesday, November 16, 11

Page 72: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Massive Diversity of Proteorhodopsins

Venter et al., 2004Wednesday, November 16, 11

Page 73: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Uses of Phylogeny in Genomics and Metagenomics

Example 5:

Selecting Organisms for Study

Wednesday, November 16, 11

Page 74: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

http://www.jgi.doe.gov/programs/GEBA/pilot.htmlWednesday, November 16, 11

Page 75: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 76: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

GEBA Lesson 1:The rRNA Tree of Life is a Useful Tool for Identifying Phylogenetically Novel

From Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11

Page 77: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

GEBA Lesson 2:The rRNA Tree of Life is not perfect ...

Badger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.

16s WGT, 23S

Wednesday, November 16, 11

Page 78: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

GEBA Lesson 3:Phylogeny driven genome selection (and

phylogenetics) improves genome annotation

• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes

• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary”

based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction

Wednesday, November 16, 11

Page 79: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

GEBA Lesson 4

Phylogeny-driven genome selection helps discover new genetic diversity

Wednesday, November 16, 11

Page 80: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Protein Family Rarefaction Curves

• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families

Wednesday, November 16, 11

Page 86: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Synapomorphies exist

Wu et al. 2009 Nature 462, 1056-1060

Wednesday, November 16, 11

Page 87: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Families/PD not uniform

� �

�������6���

3����1�����

Wednesday, November 16, 11

Page 88: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

GEBA Lesson 5

Improves analysis of genome data from uncultured organisms

Wednesday, November 16, 11

Page 89: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 90: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

Cannot be done without good sampling of genomes

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 91: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Phylogenetic Binning

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 92: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

Cannot be done without good sampling of genomes

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 93: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

GEBA Project improves metagenomic analysis

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 94: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Gamm

aproteobacteria

Deltapro

teobacteria

Firmicutes

Chlorobi

Chloroflexi

Fusobacteria

Euryarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

But not a lot

Venter et al., Science 304: 66-74. 2004Wednesday, November 16, 11

Page 95: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogeny and Metagenomics Future 1

Need to adapt genomic and metagenomic methods to make better

use of data

Wednesday, November 16, 11

Page 96: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

iSEEM Project

Wednesday, November 16, 11

Page 97: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

AMPHORA 2 Coming w/ More Markers

Phylogenetic group

Genome Number

Gene Number

Maker Candidates

Archaea 62 145415 106Actinobacteria 63 267783 136Alphaproteobacteria

94 347287 121Betaproteobacteria 56 266362 311Gammaproteobacteria

126 483632 118Deltaproteobacteria 25 102115 206Epislonproteobacteria

18 33416 455Bacteriodes 25 71531 286Chlamydae 13 13823 560Chloroflexi 10 33577 323Cyanobacteria 36 124080 590Firmicutes 106 312309 87Spirochaetes 18 38832 176Thermi 5 14160 974Thermotogae 9 17037 684

See posters by Dongying Wu and Guillaume Jospin

Wednesday, November 16, 11

Page 98: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

AMPHORA ALL • Build reference tree with concatenated alignment

• Align reads that match any of the HMMs to concatenated alignment

• Place reads into reference tree one at a time

Wednesday, November 16, 11

Page 99: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogeny and Metagenomics Future 2

We have still only scratched the surface of microbial diversity

Wednesday, November 16, 11

Page 100: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

Wednesday, November 16, 11

Page 103: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogenetic Diversity: Isolates

From Wu et al. 2009 Nature 462, 1056-1060Wednesday, November 16, 11

Page 104: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogenetic Diversity: All

From Wu et al. 2009 Nature 462, 1056-1060

Wednesday, November 16, 11

Page 105: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

101

Number of SAGs from Candidate Phyla

OD

1

OP

11

OP

3

SA

R4

06

Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -

Sample collections at 4 additional sites are underway.

Phil Hugenholtz

GEBA uncultured

Wednesday, November 16, 11

Page 106: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Earth Microbiome Projectwww.earthmicrobiome.org

• Goal – to systematically approach the problem of characterizing microbial life on earth

• Strategy:– Explore microbes in environmental parameter space– Design ‘ideal’ strategy to interrogate these biomes– Acquire samples and sequence broad and deep both

DNA, mRNA and rRNA– Define microbial community structure and the protein

universe• Gilbert et al., 2010a,b SIGS•

Wednesday, November 16, 11

Page 107: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Phylogenomics Future 3

Need Experiments from Across the Tree of Life too

Wednesday, November 16, 11

Page 108: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

A Happy Tree of Life

Wednesday, November 16, 11

Page 109: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Acknowledgements

• GEBA: DOE-JGI, DSMZ• iSEEM: Katie Pollard, Jessica Green,

Martin Wu, Steven Kembel, Tom Sharpton• RecA: Dongying Wu, Craig Venter, Aaron

Halpern, Doug Rusch, et al.• Eisen Lab: Aaron Darling, Jenna Morgan,

Dongying Wu• $$$ - Moore Foundation, NSF, DOE,

DARPA, Sloan FoundationWednesday, November 16, 11

Page 110: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

Wednesday, November 16, 11

Page 111: Phylogeny Driven Approaches to Genomic and Metagenomic Studies

MICROBES

Wednesday, November 16, 11