seminar: haider et al. 2014, bioinformatics:btu395
TRANSCRIPT
Omega: an overlap-graph de novo assembler formetagenomics
Presentation by Rosemary McCloskey
Bahlul Haider1 Tae-Hyuk Ahn1 Brian Bushnell2 JuanjuanChai1 Alex Copeland2 Chongle Pan1
1Oak Ridge National Laboratory
2US Department of Energy Joint Genome Institute
October 2, 2014
Haider et al. Omega October 2, 2014 1 / 11
Metagenomics
Analysis of genetic material (typically bacteria) directly fromenvironmental samples.
The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.
The list included unexpected entries such as the genus Homo even though the
two trips were uneventful.
Haider et al. Omega October 2, 2014 2 / 11
Metagenomics
Analysis of genetic material (typically bacteria) directly fromenvironmental samples.
The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.
The list included unexpected entries such as the genus Homo even though the
two trips were uneventful.
Haider et al. Omega October 2, 2014 2 / 11
Metagenomics
Analysis of genetic material (typically bacteria) directly fromenvironmental samples.
The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.
The list included unexpected entries such as the genus Homo even though the
two trips were uneventful.
Haider et al. Omega October 2, 2014 2 / 11
Metagenomics
Analysis of genetic material (typically bacteria) directly fromenvironmental samples.
The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.
The list included unexpected entries such as the genus Homo even though the
two trips were uneventful.
Haider et al. Omega October 2, 2014 2 / 11
Challenges
Metagenomic samples cancontain hundreds of distinctgenomes.
Most do not have a referencegenome.
Representation of genomes isnot equal.
Goal of Omega: assembleindividual genomes frommetagenomic data.
Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.
Haider et al. Omega October 2, 2014 3 / 11
Challenges
Metagenomic samples cancontain hundreds of distinctgenomes.
Most do not have a referencegenome.
Representation of genomes isnot equal.
Goal of Omega: assembleindividual genomes frommetagenomic data.
Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.
Haider et al. Omega October 2, 2014 3 / 11
Challenges
Metagenomic samples cancontain hundreds of distinctgenomes.
Most do not have a referencegenome.
Representation of genomes isnot equal.
Goal of Omega: assembleindividual genomes frommetagenomic data.
Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.
Haider et al. Omega October 2, 2014 3 / 11
Challenges
Metagenomic samples cancontain hundreds of distinctgenomes.
Most do not have a referencegenome.
Representation of genomes isnot equal.
Goal of Omega: assembleindividual genomes frommetagenomic data. Trindade-Silva, Amaro E., et al. “Polyketide
synthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.
Haider et al. Omega October 2, 2014 3 / 11
Graph theory
Definition
A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .
a b
cd
Definition
A bi-directed graph associates two directions to each edge.
Definition
A path is a sequence of contiguous edges with no loops.
Haider et al. Omega October 2, 2014 4 / 11
Graph theory
Definition
A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .
a b
cd
Definition
A bi-directed graph associates two directions to each edge.
Definition
A path is a sequence of contiguous edges with no loops.
Haider et al. Omega October 2, 2014 4 / 11
Graph theory
Definition
A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .
a b
cd
Definition
A bi-directed graph associates two directions to each edge.
Definition
A path is a sequence of contiguous edges with no loops.
Haider et al. Omega October 2, 2014 4 / 11
Graph theory
Definition
A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .
a b
cd
Definition
A bi-directed graph associates two directions to each edge.
Definition
A path is a sequence of contiguous edges with no loops.
Haider et al. Omega October 2, 2014 4 / 11
Overlap graph
node = read
edge = overlap
direction = prefix/suffix,forward/reverse
path = contig
disallow paths entering andexiting a node by the sametype of arrow
Haider et al. Omega October 2, 2014 5 / 11
Overlap graph
node = read
edge = overlap
direction = prefix/suffix,forward/reverse
path = contig
disallow paths entering andexiting a node by the sametype of arrow
Haider et al. Omega October 2, 2014 5 / 11
Overlap graph
node = read
edge = overlap
direction = prefix/suffix,forward/reverse
path = contig
disallow paths entering andexiting a node by the sametype of arrow
Haider et al. Omega October 2, 2014 5 / 11
Overlap graph
node = read
edge = overlap
direction = prefix/suffix,forward/reverse
path = contig
disallow paths entering andexiting a node by the sametype of arrow
Haider et al. Omega October 2, 2014 5 / 11
Overlap graph
node = read
edge = overlap
direction = prefix/suffix,forward/reverse
path = contig
disallow paths entering andexiting a node by the sametype of arrow
Haider et al. Omega October 2, 2014 5 / 11
Simplifications
remove triangles
=⇒
contract simple vertices
=⇒
trim small branches
=⇒
remove bubbles
=⇒
Haider et al. Omega October 2, 2014 6 / 11
Simplifications
remove triangles
=⇒
contract simple vertices
=⇒
trim small branches
=⇒
remove bubbles
=⇒
Haider et al. Omega October 2, 2014 6 / 11
Simplifications
remove triangles
=⇒
contract simple vertices
=⇒
trim small branches
=⇒
remove bubbles
=⇒
Haider et al. Omega October 2, 2014 6 / 11
Simplifications
remove triangles
=⇒
contract simple vertices
=⇒
trim small branches
=⇒
remove bubbles
=⇒
Haider et al. Omega October 2, 2014 6 / 11
Finding contigs
push flow along long edges (>1000 bp)
minimize total flow in network
merge edges with mate-pair support
scaffold edges with mate-pair support
resolve ambiguity by coverage depth
Haider et al. Omega October 2, 2014 7 / 11
Finding contigs
push flow along long edges (>1000 bp)
minimize total flow in network
merge edges with mate-pair support
scaffold edges with mate-pair support
resolve ambiguity by coverage depth
Haider et al. Omega October 2, 2014 7 / 11
Finding contigs
push flow along long edges (>1000 bp)
minimize total flow in network
merge edges with mate-pair support
scaffold edges with mate-pair support
resolve ambiguity by coverage depth
Haider et al. Omega October 2, 2014 7 / 11
Finding contigs
push flow along long edges (>1000 bp)
minimize total flow in network
merge edges with mate-pair support
scaffold edges with mate-pair support
resolve ambiguity by coverage depth
Haider et al. Omega October 2, 2014 7 / 11
Finding contigs
push flow along long edges (>1000 bp)
minimize total flow in network
merge edges with mate-pair support
scaffold edges with mate-pair support
resolve ambiguity by coverage depth
Haider et al. Omega October 2, 2014 7 / 11
Benchmarking: real data
HiSeq 100-bp dataset, 64 micro-organisms.
N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).
Haider et al. Omega October 2, 2014 8 / 11
Benchmarking: real data
HiSeq 100-bp dataset, 64 micro-organisms.
N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).
Haider et al. Omega October 2, 2014 8 / 11
Benchmarking: real data
HiSeq 100-bp dataset, 64 micro-organisms.
N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).
Haider et al. Omega October 2, 2014 8 / 11
Benchmarking: simulated data
Simulated MiSeq 300-bp dataset, 9 genomes.
Haider et al. Omega October 2, 2014 9 / 11
Benchmarking: simulated data
Simulated MiSeq 300-bp dataset, 9 genomes.
Haider et al. Omega October 2, 2014 9 / 11
Good and bad
What was good:
Clear and detailed description of algorithm.
Upfront about limitations of their software and difficulties ofbenchmarking.
Room for improvement:
Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).
Why no Celera on real data?
Haider et al. Omega October 2, 2014 10 / 11
Good and bad
What was good:
Clear and detailed description of algorithm.
Upfront about limitations of their software and difficulties ofbenchmarking.
Room for improvement:
Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).
Why no Celera on real data?
Haider et al. Omega October 2, 2014 10 / 11
Good and bad
What was good:
Clear and detailed description of algorithm.
Upfront about limitations of their software and difficulties ofbenchmarking.
Room for improvement:
Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).
Why no Celera on real data?
Haider et al. Omega October 2, 2014 10 / 11
Good and bad
What was good:
Clear and detailed description of algorithm.
Upfront about limitations of their software and difficulties ofbenchmarking.
Room for improvement:
Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).
Why no Celera on real data?
Haider et al. Omega October 2, 2014 10 / 11