genovo : de novo assembly for metagenomes
DESCRIPTION
Genovo : De Novo Assembly for Metagenomes. Gao Song 2010/07/14. Outline. Overview of Metagenomices Current Assemblers Genovo Assembly. Overview of Metagemices. Motivation. Metagenomics is: Why Do We Need Metagenomics ? Snapshot of bacterial community Cannot be cultivated.TRANSCRIPT
Genovo: De Novo Assembly for Metagenomes
Gao Song2010/07/14
OutlineOverview of MetagenomicesCurrent AssemblersGenovo Assembly
Overview of Metagemices
Metagenomics is:
Why Do We Need Metagenomics?Snapshot of bacterial communityCannot be cultivated
Motivation
<1%
Monitoring the impact of pollutants on ecosystems
Discovery of new genes, enzymes…- Global Ocean Sampling Expedition
Human Microbiome Project
JGI sequenced Acid Mine Drainage sample
Applications
Marker Gene Sequencing16s rRNA:
Two ways
Other marker genes: RuBisCo, NifHOnly composition
Whole Genome Sequencing (WGS)Detailed picture of community
Two Paradigms
Complex Communities>1000X5000200L
1million
Current Assembler
Why not assemble reads?
ORFome assembler*Three steps:
The putative ORFs are annotated for each read ORFs are assembled using EULER ORF homologs are searched for in Integrated Microbial Genomics
(IMG) database
Existing WGS assemblersSanger reads: Phrap, Celera, Arachne, JAZZ…Short reads: Velvet, Newbler…
Current Status
* Y. Ye and H. Tang, "An orfome assembly approach to metagenomics sequences analysis." Journal of bioinformatics and computational biology, vol. 7, no. 3, pp. 455-471, June 2009
Genovo: De Novo Assembly for Metagenomes
Jonathan Laserson, Vladimir Jojic and Daphne Koller. RECOMB 2010, LNBI 6044, pp. 341-356, 2010
Main IdeaPropose a generative model for Metagenome
dataUsing iterated conditional modes (ICM)Using hill-climbing steps iterativelyDesign a score for evaluation
ModelInitialize contigs:
Infinite contigs with infinite length
Partition the readsUsing Chinese Restaurant Process
ModelGenerate the starting point oi
Generate the length of read
Quality of assembly of each read
AlgorithmUsing ICMStarting from initial condition, hill-climbing
moves are performed iterativelyMove 1: Consensus Sequence:
Select the most frequent base
AlgorithmMove 2: Read Mapping
For read i, first remove it, then recalculate its contig and alignment
First, for each potential location, compute alignment
Then, select the location according to possibility
Filtering: using common 10-mer
AlgorithmMove 3: update geometric variable
->Globle moves:
Propose indelsCenterMerge contigs
Chimeric readsDisassemble the dangling contigs
EvaluationBLASTPFAMDesigned score
1st term: quality of assembly2nd term: penalty for total length3rd term: prefer to merge when V>V0
ResultsUsing 454 readsCompare with Newbler, Velvet and EULER-
SRSingle Genome
ResultMetagenome data
Score
PFAM
DiscussionNew ideaApply a mature algorithm to assembly
domainSystematically describe and analyze the
problem and algorithmResults are better
DiscussionSlowly: minute vs. hours for 300k 454 readsMain idea: try to extend as long as possible,
so they will have more hits for BLASTWhy choose 20 for V0?How to deal with branching? Repeats?Model:
Why it can capture the property of metagenomic data?
How to argue the correctness of that model?The distribution of starting points
Thank you