sourav chatterji uc davis genome center schatterji@ucdavis

Post on 30-Dec-2015

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority . Sourav Chatterji UC Davis Genome Center schatterji@ucdavis.edu. Background. The Microbial World. Exploring the Microbial World. Culturing Majority of microbes currently unculturable . - PowerPoint PPT Presentation

TRANSCRIPT

Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority

Sourav ChatterjiUC Davis Genome Centerschatterji@ucdavis.edu

Background

The Microbial World

Exploring the Microbial World

• Culturing– Majority of microbes currently unculturable.– No ecological context.

• Molecular Surveys (e.g. 16S rRNA)– “who is out there?”– “what are they doing?”

Environmental Shotgun Sequencing

Interpreting Metagenomic Data

• Nature of Metagenomic Data– Mosaic– Intraspecies polymorphism– Fragmentary

• New Sequencing Technologies– Enormous amount of data– Short Reads

Overview of Talk

• Metagenomic Binning• PhyloMetagenomics• The Big Picture/ Future Work

Overview of Talk

• Metagenomic Binning– Background– CompostBin

• PhyloMetagenomics• The Big Picture/ Future Work

Metagenomic Binning

Classification of sequences by taxa

Current Binning Methods

• Assembly • Align with Reference Genome• Database Search [MEGAN, BLAST]• Phylogenetic Analysis• DNA Composition [TETRA,Phylopythia]

Current Binning Methods

• Need closely related reference genomes.• Poor performance on short fragments.

– Sanger sequence reads 500-1000 bp long.– Current assembly methods unreliable

• Complex Communities Hard to Bin.

Genome Signatures

• Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?– Yes [Karlin et al. 1990s]

• What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

DNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

• Working with K-mers for Binning.– Curse of Dimensionality : O(4K) independent

dimensions.– Statistical noise increases with decreasing

fragment lengths.• Project data into a lower dimensional space to

decrease noise.– Principal Component Analysis.

DNA-composition metrics

PCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

A Weighting Scheme

For each read, find overlap with other sequences

A Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Weighted PCA

• Calculate weighted mean µw :

• Calculates weighted co-variance matrix Mw

• Principal Components are eigenvectors of Mw.– Use first three PCs for further analysis.

Twi

N

1iwiiw )μ(X)μ(XwM --=å

=

N

Xwμ

N

1iii

w

å==

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Un-supervised Classification?

Semi-Supervised Classification

• 31 Marker Genes [courtesy Martin Wu]– Omni-present– Relatively Immune to Lateral Gene Transfer

• Reads containing these marker genes can be classified with high reliability.

Semi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Testing

• Simulate Metagenomic Sequencing– Variables

• Number of species• Relative abundance• GC content• Phylogenetic Diversity

• Test on a “real” dataset where answer is well-established.

Results

Conclusions

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Overview of Talk

• Metagenomic Binning• Phylo-Metagenomics

– Background– Incorporating Alignment Accuracy

• The Big Picture/ Future Work

Phylogenetic Trees

Charles Darwin, First Notebook on Transmutation of Species (1837)

Garcia Martin et al., Nat. Biotechnology (2006)

Population Structure of Communities

Yooseph et al., PLoS Biology (2007)

Gene Family Characterization

Wong et al., Science, 2008

Manual Masking

• Require skilled and tedious manual intervention

• Subjective and non-reproducible• Impractical for high throughput data

– Frequently ignored. “Garbage-in-and-garbage-out”

Gblocks

Probabilistic Masking using pair-HMMs

• Probabilistic formulation of alignment problem.

• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments

Durbin et al., Cambridge University Press (1998)

Probabilistic Masking

• What is the probability residues xi and yj are homologous?

• Posterior Probability the residues xi and yj are homologous

• Can be calculated efficiently for all pairs (and gaps) in quadratic time.

y]Pr[x,y]x,,yPr[x

]yPr[x jiji

à=à

Scoring Multiple Alignments

• Calculate the “posterior probability matrix” and distances dij between every pair of sequences.

• Weighted “sum of pairs” score for column r :

å

å à

ji,ij

jiji,

ij

d

]rPr[rd

Testing

The Balibase 3.0 Benchmark Database

Testing

• Realign sequences using MSA programs like Clustalw.

• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good

• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

Performance

Gblocks

Prob Mask

Sensitivity Specificity

97% 93%

53% 94%

Effect on Phylogenetic Inference

Protocol Symmetric Tree Inference Accuracy

Asymmetric Tree Inference Accuracy

No Masking 84.08 % 80.51 %

Gblocks 76.92 % 79.99 %

Prob. Masking 85.11 % 84.60 %

Gblocks simulated data-set, PhyML likelihood tree

Consistency between Alignment Programs

• Yeast Genome Data Set– 7 yeast species, 1502 “orthologs” in each.

• Wong et al. , Science (2008).– Aligned using 7 programs– Different programs often give inconsistent

answers.• Garbage in, Garbage Out?

– Partial Data, confusing global alignment programs.– No Masking

Consistency between Alignment Programs

Protocol Inconsistent Consistent

No Masking 4.05 % 95.95%

Prob. Masking 2.74 % 97.26%

Masking remove ~33% of inconsistencies

Consistency between Alignment Programs

ProtocolInconsistent Consistent

No BootstrapSupport

Bootstrap Support

No BootstrapSupport

Bootstrap Support

No Masking 3.73 % 0.32 % 23.41 % 72.54%

Prob. Masking 2.67% 0.07 % 23.77 % 73.48 %

Masking remove ~75% of inconsistencies with high support

The Final Result

A Phylogenetic Database/Pipeline (with Martin Wu)

Overview of Talk

• Metagenomic Binning • Phylo-Metagenomics• The Big Picture/ Future Work

Population Structure

Venter et al. , Science (2004)

Future Directions/Challenges

• What defines a species (OTU)?– Clustering Problem

• Handling Partial Data• Improved Phylogenetic Inference• How to integrate information from multiple

markers?

Species Interactions

Interactions in Microbial Communities

Time Series Data

Ruan et al., Bioinformatics (2006)

Interaction Networks in Microbial Communities

Ruan et al., Bioinformatics (2006)

Functional Profiling

Prediction of Gene Function Prediction of Metabolic Pathway

Functional Profiling (with Binning)

McCutcheon and Moran PNAS.(2007)

Future Directions/Challenges

• Inferring Species Interactions– Time Series Analysis– Network Dynamics

• Generalizing Binning to Multiple Classes– Semi-supervised Approach

• Semi Supervised Projection?

– More Phylogenetic Markers• Iterative Binning/Assembly

– Problem : Modeling variations within a species

Single Cell Genomics

Reads From Single Cell “Simulated” Contamination

With Ramunas Stepanauskas at Bigelow Institute

Detecting Genetic Engineering

Caveat : Also detects host anomalous DNA (e.g. LGT), Comparative Genomics helps

The Big PictureMicrobial Community

Metagenomic Sampling Single Cell Genomics

Population Structure Functional Profiling

Species Interaction Network

Time Series Data

Acknowledgements

UC Davis• Jonathan Eisen • Martin Wu• Dongying Wu• Ichitaro Yamazaki• Amber Hartman• Marcel Huntemann

UC Berkeley• Lior Pachter• Richard Karp• Ambuj Tewari• Narayanan Manikandan

Princeton University• Simon Levin• Josh Weitz• Jonathan Dushoff

top related