day3 lecture binning - medicalintelligence.org

58
www.elixir-europe.org ELIXIR Workshop in metagenomics Nador February 2018

Upload: others

Post on 10-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: day3 Lecture binning - medicalintelligence.org

www.elixir-europe.org

ELIXIRWorkshop in metagenomics

Nador February 2018

Page 2: day3 Lecture binning - medicalintelligence.org

www.elixir-europe.org

Genome assembly from metagenomic reads

Page 3: day3 Lecture binning - medicalintelligence.org

Overview• Obtaining a genome sequence

• Binning

• Taxonomic classification of bins

Page 4: day3 Lecture binning - medicalintelligence.org

Why do we want to obtain a genome sequence?o Genome

o Chromosomal structure and organisation

o Identify all genes and predict their functions

o Starting point for other studies

o Transcriptomics, proteomics, evolutionary studies

Page 5: day3 Lecture binning - medicalintelligence.org

What is in the databases - for example RefSeq?oBiased with sequences having pathogenic or industrial implications

oLarge fraction of Proteobacteria

oHost-associated are overrepresented

Ecosystem Total

Host-associated 11,816

Humans 4973

Animal 1804

Plants 1410

Mammals 867

Other 2762

Environmental 6774

Aquatic 4559

Terrestrial 2057

Other 158

Engineered systems 1658

Food production 440

Wastewater 410

Lab synthesis 387

Other 418

Total 20,248GOLD database

RefSeq

Page 6: day3 Lecture binning - medicalintelligence.org

•Microbial dark matter• Uncultivable organisms

• Many lineages known from 16 rRNA sequencing lacks a genomic representative

Christian Rinke / Tanja Woyke, DOE JGI

Page 7: day3 Lecture binning - medicalintelligence.org

Obtaining a genome sequence

Genome sequencing => Requires culturing

Single cell sequencing => Incomplete genomes

Metagenomic sequencing => The solution?

Page 8: day3 Lecture binning - medicalintelligence.org

Obtaining a genome sequence

Genome sequencing => Requires culturing

Single cell sequencing => Incomplete genomes

Metagenomic sequencing => The solution?

Page 9: day3 Lecture binning - medicalintelligence.org

Recap – Challenges in assembly of genome sequences

homes.cs.washington.edu

• Identify overlapping sequence reads or K-mers and create a graph

• Challenges when there are variations or repeats

• Creates bubbles in the graph

• Split into contigs

TCGCGGACGCGGATGCGGATT

* * *

Page 10: day3 Lecture binning - medicalintelligence.org

Obtaining a genome sequence from a metagenomic sample

www.slideshare.net Mads Albertsen, University of Vienna

• Metagenomic Assembled Genomes (MAGs)

• Similar as to genome sequencing

• Trying to reconstruct the individual genomes of a mixture of DNA from an entire population

Page 11: day3 Lecture binning - medicalintelligence.org

Challenges obtaining a MAGs• Very fragmented and rarely complete genomes in the sample

• Highly diverse DNA (extremely many K-mers?)

• Diverse level of abundance

• Different relatedness to each other (same specie but different strains?)

• Computational challenging

www.slideshare.net Mads Albertsen, University of Vienna

Page 12: day3 Lecture binning - medicalintelligence.org

Challenges obtaining a MAGs• Most metagenomics assemblers use de Bruijn graphs

• Algorithms for single genome assemblies cannot be used directly

• Digital normalization aims to eliminate redundant reads

• Partition the de Bruijun graph prior to assembly – lower memory costs

• ‘Bubble popping' procedure. Parallel paths in the graph that differ by only a small amount, these paths are collapsed into one

• Metagenomic assemblies will still be highly fragmented - Binning

Page 13: day3 Lecture binning - medicalintelligence.org

Binning• Method to sort data values into a smaller groups or “bins”

• For example to group animals into more taxon-specific bins

Jungle Animals Real Life Wallpapers Hd

Page 14: day3 Lecture binning - medicalintelligence.org

Binning• Method to sort data values into a smaller groups or “bins”

• For example to group animals into more taxon-specific bins

• Various taxonomic levels: All belong to Kindom = Animalia, but Class = Aves AND Mammalia

Animalia

Mammalia Aves

Page 15: day3 Lecture binning - medicalintelligence.org

Binning - metagenomics• Group contigs or reads belonging to the same specie

• Group nucleotide sequences based on composition

• Group nucleotide sequences based on abundance

SlideShare / CoMet: Coverage and Composition based binning of Metagenomes

Page 16: day3 Lecture binning - medicalintelligence.org

Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies

• Taxonomy dependent methods classifies the DNA fragments by performing a standard homology inference against a reference database

• Taxonomy independent methods performs the reference-free binning by applying clustering techniques

BINNING

Taxonomy independent

Alignment based

Taxonomy dependent

Composition based

Hybrid

Abundance based

Composition based

Hybrid

ReferenceNo reference

Page 17: day3 Lecture binning - medicalintelligence.org

Two types of binning strategies• Taxonomy dependent and taxonomy independent strategies

Modified from: Siegwald et al. / PLoS ONE 12(1): e0169563. https://doi.org/10.1371/journal.pone.0169563

Page 18: day3 Lecture binning - medicalintelligence.org

How are nucleotide sequences binned?• Abundance based binning

• Also called coverage based binning

• Sequences originating from the same specie will have similar abundance in the sample

SlideShare / A. Murat Eren, Intro to metagenomic binning

Page 19: day3 Lecture binning - medicalintelligence.org

How are nucleotide sequences binned?• Composition based binning

• Genomic signatures have been shown to display a species-specific pattern

• GC content is simple and commonly used genomic signature

• More widely used genomic signature is tetranucleotide frequencies

SlideShare / A. Murat Eren, Intro to metagenomic binning

Page 20: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

Page 21: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

Page 22: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG

Seq1

Page 23: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG

Seq1 1 1 1 1 1

Page 24: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Page 25: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq3 2 1 1 1

Seq4 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Page 26: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq3 2 1 1 1

Seq4 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Page 27: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq4 2 1 1 1

Seq3 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Page 28: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

AATT ATTC TTCC TCCG CCGG ATTA TTAA TAAG AAGG TCCC AGGA GGAA GAAG TAAT

Seq1 1 1 1 1 1

Seq2 1 1 1 1 1

Seq4 2 1 1 1

Seq3 2 1 1 1

Seq5 1 1 2 1

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Page 29: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Page 30: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Seq1

Seq2 Seq4Seq3

Seq5

PCAAnalysis

PCA1

PCA2

Page 31: day3 Lecture binning - medicalintelligence.org

Composition based binning• Computation of tetranucleotide frequencies (k-mer =4)Seq1: AATTCCGG

AATTATTC

TTCCTCCG

CCGG

Seq1

Seq2

Seq4

Seq3

Seq5

Seq2: AATTAAGGAATT

ATTATTAA

TAAGAAGG

Seq3: AAGGAAGGAAGG

AGGAGGAA

GAAGAAGG

Seq4: AATTAATTAATT

ATTATTAA

TAATAATT

Seq5: GGAAGGAAGGAA

GAAGAAGG

AGGAGGAA

Seq1

Seq2 Seq4Seq3

Seq5Bin 1

Bin 2PCAAnalysis

PCA1

PCA2

Page 32: day3 Lecture binning - medicalintelligence.org

K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)

Binning – Taxonomy independent methods• Hybrid binning

• Combine abundance and composition

• Give more accurate binning results

• Can be performed on either sequence reads or assembled contigs

• Potentially separate subspecies into individual bins

Page 33: day3 Lecture binning - medicalintelligence.org

Binning – Taxonomy dependent methods• Referred to as supervised.

• Methods rely on comparison of contigs or reads against reference databases (known phylogenetic origin)

• Reads classified under similar taxonomic categories are finally grouped into bins

• Comparison on sequence level using:

•BLAST

•BLAT

•Bowtie

•BWA

• Comparison on model level using:

•HMMs + databases

Page 34: day3 Lecture binning - medicalintelligence.org

Modified from: Mande et al. / Briefings in Bioinformatics, Volume 13, Issue 6, 1 November 2012, Pages 669–681

Binning – Taxonomy dependent methods

Taxonomy dependent Taxonomy independent

Page 35: day3 Lecture binning - medicalintelligence.org

Binning – Taxonomy independent methods• Referred to as un-supervised

• Enables the discovery of new microbial of new organisms

• Two types of features used for classification

• Sequence composition based

•Assumption that the genome composition is unique for each taxon

•DNA fragments from the same genome are more similar than those from different genomes

•Cluster formation being defined by k-mer composition

• Abundance based

•Coverage reflecting abundance of given tax

•Cluster formation being defined by k-mer abundance

Page 36: day3 Lecture binning - medicalintelligence.org

• Binning of assembled contigs using an expectation-maximization algorithm

• Bins are predicted from initial identification of marker genes in assembled sequences

• Tetranucleotide frequencies and scaffold coverages are combined to organize metagenomicsequences into individual bins

• Estimation of genome completeness – 107 marker genes

Wu et al., Microbiome 2014 2:26

Taxonomy independent binning of contigs - MaxBin

Page 37: day3 Lecture binning - medicalintelligence.org

• Sequencing depths highly affect the results

• 10-genome simulated datasets - 20X versus 80X coverage

MaxBin - preformance

Wu et al., Microbiome 2014 2:26

Page 38: day3 Lecture binning - medicalintelligence.org

S. Girotto et al. / Bioinformatics, Volume 32, Issue 17, 1 September 2016, Pages i567–i575

Taxonomy independent binning of sequence reads- MetaProb• Assembly-assisted tool for binning of reads

• Phase 1 groups overlapping reads into groups

• Phase 2 builds the probabilistic sequence signatures of independent reads and merges the groups into clusters

Page 39: day3 Lecture binning - medicalintelligence.org

K. Sedlar et al. / Computational and Structural Biotechnology Journal 15 (2017)

Binning – Taxonomy independent methods

Page 40: day3 Lecture binning - medicalintelligence.org

• Investigated performance when recovering individual genome bins

• Large variation:

• Average genome completeness (34% to 80%)

• Average purity (70% to 97%)

Reference

Binning results for the CAMI data sets

Page 41: day3 Lecture binning - medicalintelligence.org

Selecting a binning method• Highly dependent on the sample and the aim of the project and available resources

• The length of the metagenomic sequences – key factor

• Ultra-short sequences (75 bp) - assembly step becomes a necessity

• Short length sequences (200-400 bp)- alignment-based or hybrid binning methods

• Long length sequences - alignment-based as well as composition-based binning methods

• Are you aiming to identify novel un-culturable species?

• Human microbiota

•Most species are known

•Presence or absence of one or several species?

• Environmental samples

•Most species are unknown

Page 42: day3 Lecture binning - medicalintelligence.org

Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie

• Sequences are positioned according to the first two principal components of their tetranucleotide frequencies

• The diameter of each circle is proportional to the length its respective contig

Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30

Page 43: day3 Lecture binning - medicalintelligence.org

Visualization of bins - GroopM• Assume that sequences in bins originate from the same specie - MAG

• Sequences are visualized in coverage space

• Contigs belonging to a single bin are assigned the same color

Ilmerford et al. / PeerJ. 2014; 2: e603. Published online 2014 Sep 30

Page 44: day3 Lecture binning - medicalintelligence.org

Refinement of bins - ICoVeR• The final bins will contain errors (sequences that belong to another bin)

• Curation of bin assignments based on multiple binning algorithms

• Higher completeness and lower contamination

Page 45: day3 Lecture binning - medicalintelligence.org

Refinement of bins - RefineM• Methods for outlier filtering reduces the total number of contigs being binned

• Deviating GC

• Deviating tetranucleotide composition

• Deviating coverage depth

• Identify contigs with a coding density suggestive of a Eukaryotic origin

Refinement of bins

Page 46: day3 Lecture binning - medicalintelligence.org

Estimate completeness and contamination of MAGs• Assembly statistics

• Total size of MAG (sum of contigs in bin)

• Contig size (N50 value)

• Presence and absence of lineage-specific genes

• Presence of 20 standard tRNAs

Page 47: day3 Lecture binning - medicalintelligence.org

Estimate completeness and contamination of MAGs• CheckM assess the quality of genomes recovered from metagenomes

• Estimate genome completeness and contamination

• Using collocated single-copy marker genes within a phylogenetic lineage

•Bacteria: 104 markers organized into 58 sets

•Archaea: 150 markers organized into 108 sets

Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.

Page 48: day3 Lecture binning - medicalintelligence.org

Taxonomic classification of MAGs• Identification of 16S rRNA genes in MAGs

• However many MAGs often lack 16s rRNA genes

• Identification of lineage-specific genes

• Kraken – you have seen before

• Kaiju – you have seen before

• BBtools - sketch

Page 49: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive

Page 50: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive

Erik Hassan Nils Espen Kamal

DATABASE OF ALL CRIMINALS IN MOROCCO

?

Page 51: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive

Erik Hassan Nils Espen Kamal

DATABASE OF ALL CRIMINALS IN MOROCCO

?Hash function Hashes

EH1

EH2

EH3

EH4

EH5

Page 52: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive

Erik Hassan Nils Espen Kamal

DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes

ER1

ER2

ER3

ER4

ER5

EH1

EH2

EH3

EH4

EH5

HG1

HG2

HG3

HG4

HG5

NW1

NW2

NW3

NW4

NW5

ER1

ER2

ER3

ER4

ER5

KM1

KM2

KM3

KM4

KM5

?

Page 53: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Comparing hash tables is much faster than comparing sequences

• Alignment-based = high computational cost

• Composition-based = fast, much less compute intensive

Erik Hassan Nils Espen Kamal

DATABASE OF ALL CRIMINALS IN MOROCCOHash function Hashes

ER1

ER2

ER3

ER4

ER5

EH1

EH2

EH3

EH4

EH5

HG1

HG2

HG3

HG4

HG5

NW1

NW2

NW3

NW4

NW5

ER1

ER2

ER3

ER4

ER5

KM1

KM2

KM3

KM4

KM5

?761-167

Page 54: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Sketch creation works like this:

• Gather all length kmers in a sequence – 31 nucleotides is typically used

• Apply a hash function to them

• A genome tend to contain millions of kmers

• Keep the N smallest hash codes and call this set a "sketch” – normally 10 000

Modified from: http://slideplayer.com/slide/10898138/

DATABASE WITH SKETCHES

?

Sketch

Page 55: day3 Lecture binning - medicalintelligence.org

Bbtools sketch - Hash functions• Sketch creation works like this:

• Gather all length kmers in a sequence – 31 nucleotides is typically used

• Apply a hash function to them

• A genome tend to contain millions of kmers

• Keep the N smallest hash codes and call this set a "sketch” – normally 10 000

Modified from: http://slideplayer.com/slide/10898138/

DATABASE WITH SKETCHES

?

Sketch

Page 56: day3 Lecture binning - medicalintelligence.org

Binning pipeline - Desman• De novo Extraction of Strains from MetAgeNomes

Page 57: day3 Lecture binning - medicalintelligence.org

Example of important outcome from MAGs• Analysed 1500 metagenomics datasets

• First genomes from 17 bacterial phyla and 3 archaeal phyla

Page 58: day3 Lecture binning - medicalintelligence.org

Major outcomes from MAGs• All genomes are estimated to be ≥50% complete and nearly half are ≥90%

complete with ≤5% contamination.

• These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30%

Parks et al. / Nat Microbiol. 2017 Nov;2(11):1533-1542.