bio390 - introduction to bioinformatics: metagenomics · overview – part 1 metagenomics shinichi...
TRANSCRIPT
![Page 1: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/1.jpg)
| | | | |
October 1st, 2019 Shinichi Sunagawa Microbiome Research Group Institute of Microbiology, D-BIOL, ETH Zürich
BIO390 - Introduction to Bioinformatics: Metagenomics
1-Oct-19 Metagenomics Shinichi Sunagawa 1
![Page 2: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/2.jpg)
| | | | |
Part I - Microbial community structure • definition of genotype, taxonomic resolution and operational taxonomic units (OTUs) • analysis of microbial diversity, richness, evenness • comparison of microbial community compositions
Part II – Reconstruction and annotation of microbial community genomes • assembly of individual genomes and metagenomes • binning of metagenomic assemblies • annotation of metagenomes and metagenome assembled genomes
Overview of the Metagenomics lecture
Metagenomics Shinichi Sunagawa 2 1-Oct-19
![Page 3: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/3.jpg)
| | | | |
Background: why metagenomics?
3 1-Oct-19
McFall-Ngai,etal.,PNAS2013
Genome
Environment
Phenotype
Eco
logy
Evolution
Host genome (static)
Metagenome (dynamic)
Environment
Phenotype
Eco
logy
Evolution
McFall-Ngai et al., PNAS, 2013; Venn et al., Science, 2014
Origin of life
Single organism-centric view
Holobiont view
13 mio years
• originated some 3.8 billion years ago • drive biogeochemical cycles of the biosphere • harbor large portion of genetic diversity on Earth Impact (examples): • biogeochemistry: e.g., photosynthesis by microbes, carbon
fixation/export, nitrogen fixation • health: human microbiome
• 2 billion years for eukaryotes and microbes to evolve the most diverse forms of symbioses
Metagenomics Shinichi Sunagawa
![Page 4: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/4.jpg)
| | | | |
§ First microbial genome (1995): Haemophilus influenzae Fleischmann et al. 1995 § Followed by many isolated pathogens of diseases (plague, anthrax,
tuberculosis, Lyme disease, malaria, and sleeping sickness) § Many isolates of important non-pathogenic species: e.g., Prochlorococcus,
Lactobacillus, Bradyrhizobium
§ Most bacteria and archaea cannot be isolated, as they live in communities and depend on other organisms
§ Metagenomics provides access, in principle, to all genomic resources of a microbial community. This allows us to ask “who is there?”, and “what can they do?”
Background: why metagenomics?
Metagenomics Shinichi Sunagawa 4 1-Oct-19
![Page 5: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/5.jpg)
| | | | |
§ Bacteria and archaea have ca. 500–10,000 genes, usually arrayed on circular DNA molecules (e.g., chromosomes and plasmids)
§ Protein coding genes are ca. 1,000 base pairs long § Their genomes are ca. 600,000–12 million bp in size (human 2 x 3 billion bp) § Sequencing costs human genome: $1,000 § Microbial genomes: ~$1 à Costs for data analysis are now much higher than for data generation
Background: why metagenomics?
Metagenomics Shinichi Sunagawa 5 1-Oct-19
0
20000
40000
60000
80000
100000
120000
140000
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2017 130,000
num
ber o
f gen
omes
year
![Page 6: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/6.jpg)
| | | | |
Overview – Part 1
Metagenomics Shinichi Sunagawa 6 1-Oct-19
Samples
Taxonomic profiles
Taxonomic marker gene database
*WGS: whole genome shotgun sequencing
DNA Metagenome
Contigs/ Scaffolds
Reconstructed genomes
WGS* Assembly Binning
Annotated genes/genomes
Amplicons
PCR
deno
vo c
lust
erin
g
Metadata
Data analysis / interpretation: diversity, community dissimilarity, sample classification, novel genes/genomes
Gene prediction Functional annotation
Functional profiles
Gene functional database
![Page 7: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/7.jpg)
| | | | |
16S rRNA-based Operational Taxonomic Units (OTUs)
Metagenomics Shinichi Sunagawa 7 1-Oct-19 Hanson et al., Nat Rev Microbiol, 2012
30S small subunit of ribosomes in prokaryotes § 16S rRNA • present in all prokaryotes • conserved function as integral part of the protein
synthesis machinery • gene is rarely horizontally transferred between
different microorganisms • similar mutation rate: à molecular clock
§ Proxy for phylogenetic relatedness of organisms
§ Owing to lack of prokaryotic species definition, 97% sequence similarity is often used to define ‘species’-like:
Operational Taxonomic Units - OTUs
![Page 8: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/8.jpg)
| | | | |
§ Goal: determine ‘who’ is there at what abundance in one or more samples
Microbial community compositions
Metagenomics Shinichi Sunagawa 8 1-Oct-19
OTU S1 S2 S3 S4 S5 S6 S7 S8 …OTU1 234 87 166 240 131 249 0 244
OTU2 23 0 93 0 146 122 92 5
OTU3 2 137 191 299 285 0 0 0
OTU4 455 0 112 0 114 0 289 0
OTU5 23 229 66 247 0 216 127 116
OTU6 34 8 206 54 276 214 158 81
OTU7 0 249 0 76 132 200 0 0
OTU8 0 6 0 127 0 254 125 272
OTU9 34 174 207 91 184 0 3 44
OTU10 356 186 25 134 98 162 0 0
SUM
OTU count table Columns: S1 to Sn = samples Rows: OTUs
![Page 9: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/9.jpg)
| | | | |
Step 1: PCR-amplification of 16S rRNA amplicons (“tags”)
Primers bind to conserved regions of constant regions. Variable regions are amplified by PCR
agtctcgctatgacgtcgtcgtcagactac
gtcgtacgtcgatatttctcgcgccggagc gtcgtacgtcgatattctcgcggccggagc
agcctacgtcgtcgatagtgcgctagtgtc
PCR
Community DNA extract
Example - “V4 primers” yield ca. 250 bp long sequences
![Page 10: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/10.jpg)
| | | | |
§ All reads are aligned to each other to identify identical sequences § Unique sequences are kept and the number of identical sequences is counted § Output are unique sequences with records of identical sequences
Step 2: de-replication
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G count=4
count=5A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C G G A G G G G T A A G C A C T A A G T C A G A C T G A C G C T C G G A G G G G T A A G C A C T A A G T C A G A C T G
A C G C T C G G A G G G G T A A G C A C T A A G T C A G A C T G count=2
High quality amplicon reads Unique high quality amplicon reads
Metagenomics Shinichi Sunagawa 10 1-Oct-19
![Page 11: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/11.jpg)
| | | | |
Step 3: heuristic clustering of sequences into OTUs Deterministic approach: calculate all pairwise similarities à too “expensive” (resource and time consuming) Heuristic approach: 1) Unique high quality reads are sorted by counts (high to low) 2) Read with highest count is centroid of a new OTU (N=1) 3) Next read is compared to all OTU centroids 2 different possibilities:
a) Centroid sequence and new read are >= 97% identical • read becomes new member of the OTU (N=1)
b) Centroid sequence and new read are < 97% identical • read becomes centroid of a new OTU (N=2)
4) Next read is compared to all OTU centroids 2 different possibilities:
a) Any centroid sequence and new read are >= 97% identical • read becomes new member of the OTU (N=N)
b) Any centroid sequence and new read are < 97% identical • read becomes centroid of a new OTU (N=N+1)
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G count=4
count=5A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C G G A G G G G T A A G C A C T A A G T C A G A C T G count=2
First centroid sequence
3a) 3b) <97%
Metagenomics Shinichi Sunagawa 11 1-Oct-19
or or
+ 4b)
4a)
![Page 12: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/12.jpg)
| | | | |
§ Identification of taxon to which an OTU belongs • The centroid sequence of each OTU is compared to a database of annotated 16S rRNA gene
sequences à sequences are assigned to taxonomic ranks: phylum, class, family etc.
Step 4: taxonomic annotation of OTUs
A C G C T C A G A G C G G TA A G C A C TA A
Phylum Class Order Family Genus Species
Database of taxonomically annotated 16S sequences
Sequence of OTU 1
Taxonomy
Metagenomics Shinichi Sunagawa 12 1-Oct-19
![Page 13: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/13.jpg)
| | | | |
All reads are aligned to best matching OTU centroid sequence (and counted)
The result is an OTU (feature) count table, summarizing read counts for each OTU for each sample:
Step 5: Quantification of OTU abundances
Metagenomics Shinichi Sunagawa 13 1-Oct-19
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C T C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C T G A G C G G T A A G C A C T A A G T C A C A C T G
A C G C T C G G A G C G G T T T G C A C T A A G T C A G A C T G
A C G C T C G G A G C G G T T T G C A C T A A G T C A G A C T G
A C G C T C G G A G C G G T T T G C A C T A A G T C A G A C T G
OTU2
OTU1
OTU3
Data analysis / interpretation: diversity, community dissimilarity, sample classification
![Page 14: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/14.jpg)
| | | | |
§ How can we use an OTU count table to formally describe the diversity of a microbial community?
Concept of diversity
15 1-Oct-19
N s
ampl
es
OTUs
Alpha - diversity
![Page 15: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/15.jpg)
| | |
Alpha diversity: a function of richness and evenness richness = 3 high evenness
richness = 5 high evenness
richness = 3 low evenness
richness = 3 high evenness
1-Oct-19 Metagenomics Shinichi Sunagawa 16
Note that in this figure Site = sample S1-5 = species 1-5
![Page 16: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/16.jpg)
| | |
where pi is the relative abundance of species i, and R the richness
Shannon’s diversity index:
Site A: H’ = -(1/3*ln(1/3) + 1/3*ln(1/3) + 1/3*ln(1/3)) = 1.0986 Site B: H’ = -(1/5*ln(1/5) + 1/5*ln(1/5) + 1/5*ln(1/5) + 1/5*ln(1/5) + 1/5*ln(1/5)) = 1.6094 Site C: H’ = -(4/6*ln(4/6) + 1/6*ln(1/6) + 1/6*ln(1/6) = 0.8676
Alpha - diversity = f(richness, evenness)
1-Oct-19 Metagenomics Shinichi Sunagawa 17
Pielou’s evenness:
where S is the observed richness in sample X
Note that in this figure Site = sample S1-5 = species 1-5
![Page 17: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/17.jpg)
| | | | |
§ Now that we learned how to describe the structure of a microbial community, how do we compare different communities to each other?
Beta diversity: between sample dissimilarity
Metagenomics Shinichi Sunagawa 18 1-Oct-19
N s
ampl
es
OTUs
?
![Page 18: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/18.jpg)
| | |
Beta diversity: between sample dissimilarity
where a = # of species shared b= # of species unique to sample 1 c= # of species unique to sample 2 Site A-B: J = 3/(3+0+2) = 0.6 Site A-C: J = 3/(3+0+0) = 1
D = 1-J = 0.4 D = 1-J = 0
Jaccard index: J = a / (a + b + c)
Distance / Dissimilarity
1-Oct-19 Metagenomics Shinichi Sunagawa 19
![Page 19: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/19.jpg)
| | |
Other distance (dissimilarity) measures
1-Oct-19 Metagenomics Shinichi Sunagawa 20
ai = abundance of taxon i in sample a, and ci = abundance of taxon i in sample c
UniFrac distance = phylogenetically weighted distance Lozupone et al., 2005
![Page 20: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/20.jpg)
| | |
N s
ampl
es
OTUs N
samples
N samples
Alpha - diversity
Beta - diversity
Within sample descriptions à between sample comparisons
1-Oct-19 Metagenomics Shinichi Sunagawa 21
Beta - diversity
![Page 21: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/21.jpg)
| | | | |
Visualize dissimilarities between microbial communities
Metagenomics Shinichi Sunagawa 22 1-Oct-19
§ For 2 (xy) or 3 (xyz) variables, data can be easily visualized § For multi (n>3) dimensional data, calculate distances and ‘project’ into lower dimensional space
Hierarchical clustering Non-metric dimensional Principal component or coordinate
scaling (NMDS) analysis (PCA or PCoA)
![Page 22: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/22.jpg)
| | |
Applied examples I
Top: Yatsunenko et al. Nature (2012); Bottom: C Huttenhower et al. Nature (2012)
• Microbial diversity in human gut increases with age • US citizens have harbor less diverse gut microbiota
relative to other populations
• Microbial communities cluster by human body site rather than by individual
![Page 23: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/23.jpg)
| | |
Applied examples II
Temperature has highest correlation with surface microbial community composition in the open ocean
Sunagawa et al. Science (2015)
![Page 24: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/24.jpg)
| | | | |
§ Metagenomics provides information about microorganisms that are often difficult or impossible to be cultivated in their natural environment
§ Due to the lack of species concept for prokaryotes, researchers use sequence identity cutoffs of marker genes to define operational taxonomic units
§ Microbial community structure describes the richness (number of species) of taxa and their evenness (the distribution of their abundances) • Alpha diversity (within sample diversity) is a function of richness and evenness • Alpha diversity can be quantified by diversity indices (e.g., Shannon, Simpson)
§ Beta diversity describes differences in microbial community structures § Differences are quantified by dissimilarity indices (e.g., Jaccard, Bray-Curtis)
Summary – Part I
Metagenomics Shinichi Sunagawa 25 1-Oct-19
![Page 25: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/25.jpg)
| | | | |
Part II – Reconstruction and annotation of microbial community genomes • assembly of individual genomes and metagenomes • binning of metagenomic assemblies into metagenome-assembled genomes • taxonomic and functional annotation of metagenomes
Overview of the Metagenomics lecture
Metagenomics Shinichi Sunagawa 26 1-Oct-19
![Page 26: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/26.jpg)
| | | | |
Overview – Part 2
Metagenomics Shinichi Sunagawa 27 1-Oct-19
Samples
Taxonomic profiles
Taxonomic marker gene database
*WGS: whole genome shotgun sequencing
DNA Metagenome
Contigs/ Scaffolds
Reconstructed genomes
WGS* Assembly Binning
Annotated genes/genomes
Amplicons
PCR
deno
vo c
lust
erin
g
Metadata
Data analysis / interpretation: diversity, community dissimilarity, sample classification, novel genes/genomes
Gene prediction Functional annotation
Functional profiles
Gene functional database
![Page 27: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/27.jpg)
| | | | |
Metagenome:manygenomeswithdifferentabundances
Highthroughputsequencing
Binning
Reconstruction of (community) genomes: overview
28
Binsormetagenomeassembledgenomes(MAGs):Contigs/scaffoldsthatoriginatefromthesameorganism
Rawreads
scaffold1
scaffold2
contig 1
contig 2
contig 3 Dataqualitycontrol
HQreads
Assembly
Genepredictionandannotation
Analysis
Scaffolding
(many) unbinned contigs
![Page 28: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/28.jpg)
| | | | |
§ DNA extracted from a metagenomic sample is randomly sheared into inserts of known size distribution (i.e., min, max, mean)
§ Adapters are added to facilitate the sequencing of these inserts
Background: DNA sequencing libraries
Metagenomics Shinichi Sunagawa 29 1-Oct-19
Note: Paired end reads may be overlapping providing the possibility to “merge” reads into one or not. In the latter case, the sequence between the paired reads remains unknown, while the length can be estimated due to the known insert size distribution
![Page 29: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/29.jpg)
| | | | |
A C T G A A C T A A G T A
A C T G A A C T C A A N A T T T A G C T G C A
1. Same base 2. Different base 3. Additional base 4. Undefined base
40 40 40 40 30 20 40 40 10 30 20 5 30
Quality score
% Correct Base
40 99.99
30 99.9
20 99
10 90
Probability of error:
sequenced read
Step 1: Data quality control - sources of errors
adapter
original sequence
Other sources of error 2. Residual adapter sequences 3. Residual control DNA sequences (e.g., “PhiX spike-ins”) 4. Contamination from non-target organisms
1. Low base calling quality scores Base calling quality (phred) scores
Q = -10 log10 P
Probability of truth: 1 - P
P = 10 – Q/10
Metagenomics Shinichi Sunagawa 30 1-Oct-19
![Page 30: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/30.jpg)
| | | | |
Step 2: Assembly and scaffolding: overview
El-Metwally et al., PLoS Comp Biol, 2013
See previous slides
Metagenomics Shinichi Sunagawa 31 1-Oct-19
![Page 31: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/31.jpg)
| | | | |
Graph construction: k-mer based assembly
A) k-mer-based graph Nodes = k-mers Edges = k-1 overlaps
B) Layout shortest Eulerian path Visit each edge once
C) Combine into consensus
El-Metwally et al., PLoS Comp Biol, 2013
Metagenomics Shinichi Sunagawa 32 1-Oct-19
![Page 32: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/32.jpg)
| | | | |
Post processing: scaffolding
A) Align paired-end reads
B) Orientation of contigs à according to read orientation
C) Scaffolding à use information on insert size distribution and fill ‘gaps’ with ‘Ns’
El-Metwally et al., PLoS Comp Biol, 2013
NNNNNNN
Metagenomics Shinichi Sunagawa 33 1-Oct-19
![Page 33: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/33.jpg)
| | | | |
Assembly quality: N50 and L50
A) Set of contigs
B) Sort contigs by size in descending order
C) Calculate total length of all contigs
D) Add length in descending order until sum equals or exceeds 50% of the total length.
à N50 = size of last contig added (15kb) à L50 is the number of contigs required to equal or exceed N50 34 1-Oct-19
![Page 34: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/34.jpg)
| | | | |
Repetitive sequences in genomes prevent full assembly
Contig A
Contig C
Repeat
Contig B
Repeat (copy 1) Repeat (copy 2) A B C
Genome
Assembly
Contig A Repeat Contig B
Ambiguous
Contig A Repeat Contig C
Metagenomics Shinichi Sunagawa 35 1-Oct-19
![Page 35: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/35.jpg)
| | | | |
Repetitive sequences in genomes prevent full assembly
Contig A
Contig C
Repeat
Contig B
Repeat (copy 1) Repeat (copy 2) A B C
Genome
Assembly
Contig A Repeat Contig B
Ambiguous
Long read 1
Long read 2
à Long sequencing reads can be used to resolve repeats
Contig A Repeat Contig C
Metagenomics Shinichi Sunagawa 36 1-Oct-19
![Page 36: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/36.jpg)
| | | | |
Step 3: Binning contigs/scaffolds – composition guided Composition-guidedbinning(independentfromexternalinformation):
Tetranucleotide(orotherk-mer)frequencies(e.g.,GGAGvs.GGAC)+contigabundance
Kang et al. PeerJ (2015)
• For each contig: a) calculate k-mer frequencies b) calculate read abundance
• Combine a) and b) into a distance matrix
• Resolve distance matrix into clusters of highly correlated contigs
Metagenomics Shinichi Sunagawa 37 1-Oct-19
![Page 37: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/37.jpg)
| | | | |
Example for tetranucleotide frequency (TNF) distances [ATGC]4=256possiblecombinations
f
1 256
Contig1
f
1 256
f
1 256
Contig2
Contig3
Metagenomics Shinichi Sunagawa 38 1-Oct-19
small distance – high likelihood for same bin
large distance
![Page 38: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/38.jpg)
| | | | |
Beispiel200Proben
Abundance
1 50Contigs
Example for abundance-based distances
Bin1 Bin2 etc.
Metagenomics Shinichi Sunagawa 39 1-Oct-19
à TNF and abundance-based distances can be combined and used for iterative binning
![Page 39: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/39.jpg)
| | | | |
Step 3: Binning contigs/scaffolds – taxonomy guided Taxonomy-guided (based on reference genomes) use of clade specific marker genes for binning metagenomic assemblies into draft genomes
WGS, QC, assembly
Binning + gene prediction
Reconstructed genomes
N. Segata lab, University of Trento
X is a unique marker gene for clade Y A is a unique marker gene for clade B
Clade B
Gene A
à Note that clade specific marker genes can also be used to identify contaminated bins
Metagenomics Shinichi Sunagawa 40 1-Oct-19
![Page 40: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/40.jpg)
| | | | | Metagenomics Shinichi Sunagawa 42 1-Oct-19 Hug et al., Nat. Microbiol., 2016; Zaremba-Niedzwiedzka et al., Nature, 2017; Martijn et al., Nature, 2018
Tree of life
Candidate phyla radiation discovered by metagenomics
All eukaryotes
Only two domains of life?
Eukaryotes
Archaea
Mitochondrial origin not within Rickettsiales?
Mitochondria Rickettsiales
α-Proteobacteria
Applied examples III
https://www.nature.com/articles/s41586-018-0059-5
https://www.nature.com/articles/nature21031
https://www.nature.com/articles/nmicrobiol201648
![Page 41: BIO390 - Introduction to Bioinformatics: Metagenomics · Overview – Part 1 Metagenomics Shinichi Sunagawa 1-Oct-19 6 Samples Taxonomic profiles Taxonomic marker gene database *WGS:](https://reader031.vdocuments.net/reader031/viewer/2022040610/5ecfbdaf951509080e10ee11/html5/thumbnails/41.jpg)
| | | | |
§ Reconstruction of microbial genomes from metagenomes is challenging as natural microbial communities are complex (many co-existing strains, uneven distribution of abundance)
§ Assembled contigs/scaffolds can be binned by composition and/or taxonomy-based approaches
§ Assembly metrics provide means of quality control and lineage-specific genes can reveal contaminations in genomic bins
§ Functional annotation of genes/genomes can involve several search strategies against many different databases
§ Strain level differences between genomes of the same species result in core and pan-genomes
§ MAGs (metagenome assembled genomes) have revealed unexpected diversity and new implications on evolutionary origin of eukaryotes and mitochondria
Summary – Part 2
Metagenomics Shinichi Sunagawa 43 1-Oct-19