genome assembly: then and now — v1.1

Post on 10-May-2015

1.472 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION


This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy. It concerns the Assemblathon projects as well as other aspects relating to genome assembly. A version of this talk is also available on Slideshare with embedded notes. Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.

TRANSCRIPT

Genome assembly: then and nowKeith Bradnam

Image from Wellcome Trust

Image from flickr.com/photos/dougitdesign/5613967601/

Contents

Sequencing 101!! Genome assembly: then!! Genome assembly: now

Assemblathon 1 & 2!! Advice & Angst!! The future

More info

✤ http://assemblathon.org!

!

✤ http://gigasciencejournal.com!

!

✤ http://twitter.com/assemblathon

Sequencing 101A, C, G, T...

Image from nlm.nih.gov

Read

Read pair

Read pair

Mate pair

Contigs

ScaffoldNNNNNNNNNNNNNNNNNNN

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

70

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

95

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

95

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

115

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

115

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

5

5

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

5

5

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

190 Mbp

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

190 Mbp

N50 for two assemblies

N50 for two assemblies

208 Mbp 190 Mbp

N50 for two assemblies

208 Mbp 190 Mbp

N50 = 15 Mbp N50 = 25 Mbp

NG50 for two assemblies

208 Mbp 190 Mbp

NG50 for two assemblies

NG50 for two assemblies

Expected genome size = 250 Mbp

Expected genome size = 250 Mbp

NG50 for two assemblies

NG50 = 15 Mbp NG50 = 15 Mbp

Expected genome size = 250 Mbp

NG50 for two assemblies

You should check that high N50 values!are not simply due to lots of Ns in the scaffolds!

Assembly 'x'

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

Ns = 90.6% !!!

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

Ns = 90.6% !!!

Basic assembly metrics

Basic assembly metrics

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence !and/or input read data

Number of genes From comparison to reference transcriptome !and/or set of known genes

Basic assembly metrics

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence !and/or input read data

Number of genes From comparison to reference transcriptome !and/or set of known genes

And many, many more...

Genome assemblyBack in the day...

Genome assemblyBack in the day...

1998

Genome assembly: then

Genetic maps ✓

Genome assembly: then

Genetic maps ✓ Physical maps ✓

Genome assembly: then

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓

Genome assembly: then

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓

Genome assembly: then

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓

Genome assembly: then

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

So what was the result of spending millions of dollars !to assemble genomes of well-characterized species,!with accurate long reads, and detailed maps???

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

Arabidopsis thaliana

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

Arabidopsis thaliana

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

✤ Ns = 0.2% of genome

Arabidopsis thaliana

Two views of the same gene

Two views of the same gene

Top: from genome sequence view on TAIR web site!Bottom: from gene sequence file on TAIR FTP site

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

✤ Ns = 4% of genome

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

} Nov 2012

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

✤ Last changes made in 2011

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genetic maps ✗

Physical maps ✗

Understanding of target genome ✗

Haploid / low heterozygosity genome ✗

Accurate & long reads ✗

Resources (time, money, people) ✗

Genome assembly: now

Assembling & finishing!a genome is not easy!

AssemblathonsA new idea is born

Image from flickr.com/photos/dullhunk/4422952630

If you sequence 10,000 genomes...!...you need to assemble 10,000 genomes

How many assembly tools are out there?

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LGSGA

Curtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTA

Ragout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC

How many assembly tools are out there?

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LGSGA

Curtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTA

Ragout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LGSGA

Curtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTA

Ragout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC

Which is the best?

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

✤ used same sequencing technologies but have different sequence libraries

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

✤ used same sequencing technologies but have different sequence libraries

✤ Even using different options for the same assembler may produce very different assemblies!

The PRICE genome assembler has 52 command-line options!!!

The PRICE genome assembler has 52 command-line options!!!

how many of them are you going to learn?

A genome assembly competition

An attempt to standardize some aspects !of the genome assembly process

Genome assembly contests

✤ 2010–2011!

✤ Used synthetic data!

✤ Small genome (~100 Mbp)!

✤ We knew the answer!

Assemblathon 1

Here we go again

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Assemblathon 2 Real 3 Large ✗

Melopsittacus undulatus

Boa constrictor constrictorMaylandia zebra

Bird

SnakeFish

Why these three species?

Why these three species?

Because they were there

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Assemble this!

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Assemble this!

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Roche 454

16x!(3 libraries)

Assemble this!

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Roche 454

16x!(3 libraries)

PacBio

10x!(2 libraries)

Assemble this!

Who took part?

Who took part?

Who took part?

21 teams!43 assemblies!

52,013,623,777 bp of sequence

Species

Bird

Fish

Snake

Competitive entries

12

10

12

Entries

Species

Bird

Fish

Snake

Competitive entries

12

10

12

Evaluation entries

3

6

0

Entries

Goals

Goals

✤ Assess 'quality' of assemblies

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

✤ Produce ranking of assemblers across species?

Who did what?

Person/group Jobs

Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies

David Schwarz et al. Produce & evaluate optical maps

Jay Shendure et al. Produce Fosmid sequences !(bird & snake only)

Martin Hunt & Thomas Otto Performed REAPR analysis

Dent Earl & Benedict Paten Help with meta-analysis of final rankings

91 co-authors!

flickr.com/photos/jamescridland/613445810

Results!

Lots of results!

102 different metrics!

10 key metrics

Key Metric Description

1 NG50 scaffold length

2 NG50 contig length

3 Amount of assembly in 'gene-sized' scaffolds

4 Number of 'core genes' present

5 Fosmid coverage

6 Fosmid validity

7 Short-range scaffold accuracy

8 Optical map: level 1

9 Optical map: levels 1–3

10 REAPR summary score

Key Metric Description

1 NG50 scaffold length

2 NG50 contig length

3 Amount of assembly in 'gene-sized' scaffolds

4 Number of 'core genes' present

5 Fosmid coverage

6 Fosmid validity

7 Short-range scaffold accuracy

8 Optical map: level 1

9 Optical map: levels 1–3

10 REAPR summary score

1) Scaffold NG50 lengths

✤ Can calculate NG50 length for each assembly!

✤ But also calculate NG60, NG70 etc.!

✤ Plot all results as a graph

1) Scaffold NG50 lengths

2) Contig vs scaffold NG50

2) Contig vs scaffold NG50

2) Contig vs scaffold NG50

3) Gene-sized scaffolds

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

✤ Average vertebrate gene = ~25 Kbp

3) Gene-sized scaffolds

4) Core genes

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

✤ How many full-length CEGs are in each assembly?

4) Core genes

Species

Bird

Fish

Snake

Core genes (out of 458)

Best individual assembly

420

436

438

4) Core genes

Species

Bird

Fish

Snake

Core genes (out of 458)

Best individual assembly

420

436

438

Across all assemblies

442

455

454

4) Core genes

ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ ***** !ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ****************************************************** !ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------ MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG *************************************** !

4) Core genes

8 & 9) Optical maps

8 & 9) Optical maps

✤ Stretch out DNA

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

✤ Compare to in silico digest of scaffolds

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

✤ Compare to in silico digest of scaffolds

✤ Not all scaffolds suitable for analysis

8 & 9) Optical maps

Image from University of Wisconsin-Madison

8 & 9) Optical maps

8 & 9) Optical maps

8 & 9) Optical maps

What does this all mean?

102 metrics!per assembly

10 key !metrics

1 final!ranking

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Rank

1

2

3

4

5

6

7

8

9

10

11

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Rank

1

2

3

4

5

6

7

8

9

10

11

Z-score

+0.68

+0.59

+0.54

+0.49

+0.44

+0.30

+0.25

+0.21

–0.08

–0.41

–3.02

What does this all mean?

No really, what does this all mean?

Some conclusions

✤ Very hard to find assemblers that performed well across all 10 key metrics!

✤ Assemblers that perform well in one species, do not always perform as well in another!

✤ Bird & snake assemblies appear better than fish!

✤ No real 'winner' for bird and fish

SGA — best assembler for snake?

SGA — best assembler for snake?

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

Best assembler across species?

Best assembler across species?

Assembler Number of 1st places (out of 27)

BCM 5

Meraculous 4

Symbiose 4

Ray 3

Excluding evaluation entries

Best assembler across species?

Assembler Number of 1st places (out of 27)

BCM 5

Meraculous 4

Symbiose 4

Ray 3

Excluding evaluation entries

Ray performance

Species Final ranking

Bird 7th

Fish 7th

Snake 9th

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

BCM bird assemblies

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

BCM bird assemblies

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

BCM bird assemblies

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

Validity!Z-score

+1.4

–0.8

BCM bird assemblies

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

Validity!Z-score

+1.4

–0.8

NG50 Contig Z-score

+1.5

+2.7

BCM bird assemblies

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

NNNNNNNNNNNNNNNNNNN

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

NNNNNNNNNNNNNNNNNNN

PacBio sequence

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

CGTCGNNATCNNGGTTACG

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

CGTCGNNATCNNGGTTACG

Mismatches from PacBio sequence penalized alignment !score more than matching unknown bases

The choice of one command-line option,!used by one tool in the calculation of one key metric...

...probably made enough difference to drop!the PacBio-containing assembly to 2nd place.

Other conclusions

✤ Different metrics tell different stories!

✤ Heterozygosity was a big issue for bird & fish assemblies!

✤ Final rankings very sensitive to changes in metrics!

✤ N50 is a semi-useful predictor of assembly quality

Inter-specific differences matter

Inter-specific differences matter

✤ The three species have genomes with different properties !

✤ repeats!

✤ heterozygosity

Inter-specific differences matter

✤ The three species have genomes with different properties !

✤ repeats!

✤ heterozygosity

✤ The three genomes had very different NGS data sets!

✤ Only bird had PacBio & 454 data!

✤ Different insert sizes in short-insert libraries

The Big Conclusion

The Big Conclusion

"You can't always get what you want"Sir Michael Jagger, 1969

What comes next?

What comes next?

What comes next?

3?

A wish list for Assemblathon 3

A wish list for Assemblathon 3

✤ Only have 1 species

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use new FASTG genome assembly file format

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use new FASTG genome assembly file format

✤ Get someone else to write the paper!

Intermission

NGS must die!

NGS must die!

‘NGS’ is used to refer to everything post-Sanger

NGS must die!

‘NGS’ is used to refer to everything post-Sanger

Pyrosequencing was developed ~1996

NGS madness

Next generation sequencing

aka second generation sequencing

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also:

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

next-next generation sequencing

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

next-next generation sequencing

next-next-next generation sequencing

NGS madness

Technology

Complete Genomics

Ion Torrent

PacBio

Oxford Nanopore

According to some papers…

2nd generation

2nd generation

2nd generation

3rd generation

NGS madness

Technology

Complete Genomics

Ion Torrent

PacBio

Oxford Nanopore

According to some papers…

2nd generation

2nd generation

2nd generation

3rd generation

According to other papers…

3rd generation

3rd generation

3rd generation

4th generation

NGS madness

“PacBio is a 2.5th generation”

“Helicos lies between the transition of next-generation to third generation”

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

Use one or the other.

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

Use one or the other.

Or just say ‘current sequencing technologies’.

Intermission

My #1 piece!of advice

flickr.com/julia_manzerova

flickr.com/thomashawk

flickr.com/thomashawk

Look at your data!

I looked at the shortest 10 sequences in 34 different genome assemblies…

I looked at the shortest 10 sequences in 34 different genome assemblies…

I looked at the shortest 10 sequences in 34 different genome assemblies…

I looked at the shortest 10 sequences in 34 different genome assemblies…

From a vertebrate genome assembly with 72,214 sequences…

From a vertebrate genome assembly with 72,214 sequences…

From a vertebrate genome assembly with 72,214 sequences…

From a vertebrate genome assembly with 72,214 sequences…

From a vertebrate genome assembly with 72,214 sequences…

From a vertebrate genome assembly with 72,214 sequences…

Length of 10 shortest sequences: !100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!

Reasons to be cheerful

flickr.com/danielygo

Data from Lex Nederbragt’s blog, June 2014

Data from Lex Nederbragt’s blog, June 2014

Long-read technology

Moleculo read data from Illumina BaseSpace, July 2013

Long-read technology

From https://flxlexblog.wordpress.com (Lex Nederbragt's blog)

PacBio!data

Long-read technology

MinIon from Oxford Nanopore

Long-read technology

MinIon from Oxford Nanopore

Where is the data?

Where is the data?

Where is the data?

Nick Loman published the first real-world data on June 10th

Single chromosome assembly?

Single chromosome assembly?

Single chromosome assembly?

Tackling heterozygosity

1000 Genomes project plans to sequence 15 'trios' in high-depth

Hi-C

✤ Nature Biotechnology, 31, 2013 !

✤ Burton et al.!

✤ Selvaraj et al.!

✤ Kaplan & Dekker

The future of genome assembly

Kwik-E-Assembler

acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc

The future of genome assembly

The future of genome assembly

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

✤ Data management will remain an issue:

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

✤ Data management will remain an issue:

✤ the human genome -> human genomes -> tissue-specific genomes

Summary

Summary

✤ There is no real consensus on how to make a good genome assembly

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

✤ Look at your input and output data

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

✤ Look at your input and output data

✤ Wait 5 years and come back, we’ll (probably) have solved everything!

Resources

✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com!

✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/!

✤ Assemblathon twitter feed - https://twitter.com/assemblathon

top related