genome assembly: then and now — v1.1
Post on 10-May-2015
1.472 Views
Preview:
DESCRIPTION
TRANSCRIPT
Genome assembly: then and nowKeith Bradnam
Image from Wellcome Trust
Image from flickr.com/photos/dougitdesign/5613967601/
Contents
Sequencing 101!! Genome assembly: then!! Genome assembly: now
Assemblathon 1 & 2!! Advice & Angst!! The future
More info
✤ http://assemblathon.org!
!
✤ http://gigasciencejournal.com!
!
✤ http://twitter.com/assemblathon
Sequencing 101A, C, G, T...
Image from nlm.nih.gov
Read
Read pair
Read pair
Mate pair
Contigs
ScaffoldNNNNNNNNNNNNNNNNNNN
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
70
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
95
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
95
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
115
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
115
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
5
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
5
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
190 Mbp
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
190 Mbp
N50 for two assemblies
N50 for two assemblies
208 Mbp 190 Mbp
N50 for two assemblies
208 Mbp 190 Mbp
N50 = 15 Mbp N50 = 25 Mbp
NG50 for two assemblies
208 Mbp 190 Mbp
NG50 for two assemblies
NG50 for two assemblies
Expected genome size = 250 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
NG50 = 15 Mbp NG50 = 15 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
You should check that high N50 values!are not simply due to lots of Ns in the scaffolds!
Assembly 'x'
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Ns = 90.6% !!!
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Ns = 90.6% !!!
Basic assembly metrics
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors Base errors from alignment to reference sequence !and/or input read data
Number of genes From comparison to reference transcriptome !and/or set of known genes
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors Base errors from alignment to reference sequence !and/or input read data
Number of genes From comparison to reference transcriptome !and/or set of known genes
And many, many more...
Genome assemblyBack in the day...
Genome assemblyBack in the day...
1998
Genome assembly: then
Genetic maps ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓
Genome assembly: then
So what was the result of spending millions of dollars !to assemble genomes of well-characterized species,!with accurate long reads, and detailed maps???
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
Arabidopsis thaliana
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
Arabidopsis thaliana
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
✤ Ns = 0.2% of genome
Arabidopsis thaliana
Two views of the same gene
Two views of the same gene
Top: from genome sequence view on TAIR web site!Bottom: from gene sequence file on TAIR FTP site
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
✤ Ns = 4% of genome
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
} Nov 2012
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
✤ Last changes made in 2011
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓
Genome assembly: then
Genetic maps ✗
Physical maps ✗
Understanding of target genome ✗
Haploid / low heterozygosity genome ✗
Accurate & long reads ✗
Resources (time, money, people) ✗
Genome assembly: now
Assembling & finishing!a genome is not easy!
AssemblathonsA new idea is born
Image from flickr.com/photos/dullhunk/4422952630
If you sequence 10,000 genomes...!...you need to assemble 10,000 genomes
How many assembly tools are out there?
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
How many assembly tools are out there?
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
Which is the best?
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
✤ used same sequencing technologies but have different sequence libraries
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
✤ used same sequencing technologies but have different sequence libraries
✤ Even using different options for the same assembler may produce very different assemblies!
The PRICE genome assembler has 52 command-line options!!!
The PRICE genome assembler has 52 command-line options!!!
how many of them are you going to learn?
A genome assembly competition
An attempt to standardize some aspects !of the genome assembly process
Genome assembly contests
✤ 2010–2011!
✤ Used synthetic data!
✤ Small genome (~100 Mbp)!
✤ We knew the answer!
Assemblathon 1
Here we go again
Type of data Number of genomes
Size of genomes
Do we know the answer?
Assemblathon 1 Synthetic 1 Small ✓
Type of data Number of genomes
Size of genomes
Do we know the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 Real 3 Large ✗
Melopsittacus undulatus
Boa constrictor constrictorMaylandia zebra
Bird
SnakeFish
Why these three species?
Why these three species?
Because they were there
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Roche 454
16x!(3 libraries)
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Roche 454
16x!(3 libraries)
PacBio
10x!(2 libraries)
Assemble this!
Who took part?
Who took part?
Who took part?
21 teams!43 assemblies!
52,013,623,777 bp of sequence
Species
Bird
Fish
Snake
Competitive entries
12
10
12
Entries
Species
Bird
Fish
Snake
Competitive entries
12
10
12
Evaluation entries
3
6
0
Entries
Goals
Goals
✤ Assess 'quality' of assemblies
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
✤ Produce ranking of assemblers across species?
Who did what?
Person/group Jobs
Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies
David Schwarz et al. Produce & evaluate optical maps
Jay Shendure et al. Produce Fosmid sequences !(bird & snake only)
Martin Hunt & Thomas Otto Performed REAPR analysis
Dent Earl & Benedict Paten Help with meta-analysis of final rankings
91 co-authors!
flickr.com/photos/jamescridland/613445810
Results!
Lots of results!
102 different metrics!
10 key metrics
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
1) Scaffold NG50 lengths
✤ Can calculate NG50 length for each assembly!
✤ But also calculate NG60, NG70 etc.!
✤ Plot all results as a graph
1) Scaffold NG50 lengths
2) Contig vs scaffold NG50
2) Contig vs scaffold NG50
2) Contig vs scaffold NG50
3) Gene-sized scaffolds
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
✤ Average vertebrate gene = ~25 Kbp
3) Gene-sized scaffolds
4) Core genes
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens
✤ How many full-length CEGs are in each assembly?
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual assembly
420
436
438
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual assembly
420
436
438
Across all assemblies
442
455
454
4) Core genes
ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ ***** !ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ****************************************************** !ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------ MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG *************************************** !
4) Core genes
8 & 9) Optical maps
8 & 9) Optical maps
✤ Stretch out DNA
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
✤ Not all scaffolds suitable for analysis
8 & 9) Optical maps
Image from University of Wisconsin-Madison
8 & 9) Optical maps
8 & 9) Optical maps
8 & 9) Optical maps
What does this all mean?
102 metrics!per assembly
10 key !metrics
1 final!ranking
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Z-score
+0.68
+0.59
+0.54
+0.49
+0.44
+0.30
+0.25
+0.21
–0.08
–0.41
–3.02
What does this all mean?
No really, what does this all mean?
Some conclusions
✤ Very hard to find assemblers that performed well across all 10 key metrics!
✤ Assemblers that perform well in one species, do not always perform as well in another!
✤ Bird & snake assemblies appear better than fish!
✤ No real 'winner' for bird and fish
SGA — best assembler for snake?
SGA — best assembler for snake?
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
Best assembler across species?
Best assembler across species?
Assembler Number of 1st places (out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Best assembler across species?
Assembler Number of 1st places (out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Ray performance
Species Final ranking
Bird 7th
Fish 7th
Snake 9th
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
Validity!Z-score
+1.4
–0.8
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
Validity!Z-score
+1.4
–0.8
NG50 Contig Z-score
+1.5
+2.7
BCM bird assemblies
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
PacBio sequence
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Mismatches from PacBio sequence penalized alignment !score more than matching unknown bases
The choice of one command-line option,!used by one tool in the calculation of one key metric...
...probably made enough difference to drop!the PacBio-containing assembly to 2nd place.
Other conclusions
✤ Different metrics tell different stories!
✤ Heterozygosity was a big issue for bird & fish assemblies!
✤ Final rankings very sensitive to changes in metrics!
✤ N50 is a semi-useful predictor of assembly quality
Inter-specific differences matter
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
✤ The three genomes had very different NGS data sets!
✤ Only bird had PacBio & 454 data!
✤ Different insert sizes in short-insert libraries
The Big Conclusion
The Big Conclusion
"You can't always get what you want"Sir Michael Jagger, 1969
What comes next?
What comes next?
What comes next?
3?
A wish list for Assemblathon 3
A wish list for Assemblathon 3
✤ Only have 1 species
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
✤ Get someone else to write the paper!
Intermission
NGS must die!
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
Pyrosequencing was developed ~1996
NGS madness
Next generation sequencing
aka second generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also:
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
next-next-next generation sequencing
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to some papers…
2nd generation
2nd generation
2nd generation
3rd generation
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to some papers…
2nd generation
2nd generation
2nd generation
3rd generation
According to other papers…
3rd generation
3rd generation
3rd generation
4th generation
NGS madness
“PacBio is a 2.5th generation”
“Helicos lies between the transition of next-generation to third generation”
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
Use one or the other.
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
Use one or the other.
Or just say ‘current sequencing technologies’.
Intermission
I looked at the shortest 10 sequences in 34 different genome assemblies…
I looked at the shortest 10 sequences in 34 different genome assemblies…
I looked at the shortest 10 sequences in 34 different genome assemblies…
I looked at the shortest 10 sequences in 34 different genome assemblies…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
Length of 10 shortest sequences: !100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
Data from Lex Nederbragt’s blog, June 2014
Data from Lex Nederbragt’s blog, June 2014
Long-read technology
Moleculo read data from Illumina BaseSpace, July 2013
Long-read technology
From https://flxlexblog.wordpress.com (Lex Nederbragt's blog)
PacBio!data
Long-read technology
MinIon from Oxford Nanopore
Long-read technology
MinIon from Oxford Nanopore
Where is the data?
Where is the data?
Where is the data?
Nick Loman published the first real-world data on June 10th
Single chromosome assembly?
Single chromosome assembly?
Single chromosome assembly?
Tackling heterozygosity
1000 Genomes project plans to sequence 15 'trios' in high-depth
Hi-C
✤ Nature Biotechnology, 31, 2013 !
✤ Burton et al.!
✤ Selvaraj et al.!
✤ Kaplan & Dekker
The future of genome assembly
Kwik-E-Assembler
acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc
The future of genome assembly
The future of genome assembly
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
✤ the human genome -> human genomes -> tissue-specific genomes
Summary
Summary
✤ There is no real consensus on how to make a good genome assembly
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
✤ Wait 5 years and come back, we’ll (probably) have solved everything!
Resources
✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com!
✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/!
✤ Assemblathon twitter feed - https://twitter.com/assemblathon
top related