next-generation-sequencing-must-die surya saha · next generation sequencing 3/31/2015 bti plant...

44
Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY [email protected] // Twitter:@ SahaSurya BTI Plant Bioinformatics Course 2015 http:// www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Upload: others

Post on 26-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Surya Saha

Sol Genomics Network (SGN)

Boyce Thompson Institute, Ithaca, [email protected] // Twitter:@SahaSurya

BTI Plant Bioinformatics Course 2015

http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Page 2: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

19

53

DNA Structure discovery

19

77

20

12

Sanger DNA sequencing by chain-terminating inhibitors

19

84

Epstein-Barr virus

(170 Kb)

19

87

Abi370 Sequencer

19

95

20

01

Homo sapiens (3.0 Gb)

20

05

454

Solexa

Solid

20

07

20

11

Ion Torrent

PacBio

Haemophilusinfluenzae(1.83 Mb)

20

13

Slide credit: Aureliano Bombarely

Sequencing over the Ages

Illumina

IlluminaHiseq X

454

3/31/2015 BTI Plant Bioinformatics Course 2015 2

Pinustaeda

(24 Gb)

20

14

NanoporeMinION

Page 3: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

First generation sequencing

3/31/2015 BTI Plant Bioinformatics Course 2015 3

Sanger. Annu Rev Biochem. 1988;57:1-28.

Thanks to Nick Loman for the mention

Page 4: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Maxam-Gilbert method

3/31/2015 BTI Plant Bioinformatics Course 2015 4

Page 5: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Maxam-Gilbert method

3/31/2015 BTI Plant Bioinformatics Course 2015 5

http://en.wikipedia.org/wiki/File:Maxam-Gilbert_sequencing_en.svg

https://www.nationaldiagnostics.com/electrophoresis/article/maxam-gilbert-sequencing

Page 6: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Sanger method

3/31/2015 BTI Plant Bioinformatics Course 2015 6

Frederick Sanger13 Aug 1918 – 19 Nov 2013

Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977

http://dailym.ai/1f1XeTB

Page 7: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Sanger method

3/31/2015 BTI Plant Bioinformatics Course 2015 7

http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

Page 8: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

First generation sequencing

• Very high quality sequences (99.999% or Q50)

• Very low throughput

3/31/2015 BTI Plant Bioinformatics Course 2015 8

Run Time Read Length Reads / Run

Total

nucleotides

sequenced

Cost / MB

Capillary

Sequencing

(ABI3730xl)

20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400

http://www.hindawi.com/journals/bmri/2012/251364/tab1/

Page 9: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Next generation sequencing

3/31/2015 BTI Plant Bioinformatics Course 2015 9

Page 10: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 10

https://twitter.com/kbradnam/status/443153578429923328

• Second generation• Third generation• Fourth generation• Next-next-generation• Next-next-next

generationhttp://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

Page 11: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Use the specific technology used to generate the data

– Illumina Hiseq/Miseq/NextSeq

– Pacific Biosciences RS1/RSII

– Ion Torrent Proton/PGM

– SOLiD

– Oxford Nanopore

3/31/2015 BTI Plant Bioinformatics Course 2015 11

http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

Page 12: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

454 Pyrosequencing

One purified DNA fragment, to one bead, to one read.

3/31/2015 BTI Plant Bioinformatics Course 2015 12

http://www.genengnews.com/

GS FLX Titanium

https://mariamuir.com/wp-content/uploads/2013/04/rip.gif

Page 13: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Illumina

3/31/2015 BTI Plant Bioinformatics Course 2015 13

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

500

Page 14: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Illumina

3/31/2015 BTI Plant Bioinformatics Course 2015 14

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

$1000 human genome??

500

Page 15: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Illu

min

a

3/31/2015 BTI Plant Bioinformatics Course 2015 15

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Page 16: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Illu

min

a

3/31/2015 BTI Plant Bioinformatics Course 2015 16

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Page 18: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Pacific Biosciences SMRT sequencing

Single Molecule Real Time sequencing

3/31/2015 BTI Plant Bioinformatics Course 2015 18

http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif

Page 19: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Pacific Biosciences SMRT sequencingError correction methods

3/31/2015 BTI Plant Bioinformatics Course 2015 19

Hierarchical genome-assembly process (HGAP)

English et al., PLOS One. 2012

PBJelly

Page 20: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Pacific Biosciences SMRT sequencingError correction methods

3/31/2015 BTI Plant Bioinformatics Course 2015 20

PB

cRP

ipel

ine

Page 21: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 21

Pacific Biosciences SMRT sequencingRead Lengths

http://www.igs.umaryland.edu/labs/grc/

Mean Read Length: 8391 bpMaximum Subread Length: 24585 bp

Page 22: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 Centre for Agricultural Bioinformatics, Pusa 22

Pacific Biosciences SMRT sequencingRead Lengths

Page 23: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Oxford Nanopore

3/31/2015 Centre for Agricultural Bioinformatics, Pusa 23

https://www.nanoporetech.com/

http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion

http://halegrafx.com/vector-art/free-vector-despicable-me-minions/

Page 24: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 24

Page 25: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Sequencing Trends

3/31/2015 BTI Plant Bioinformatics Course 2015 25

https://www.google.com/trends/

Page 26: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 26

0

5000

10000

15000

20000

25000

30000

2008 2009 2010 2011 2012 2013 2014

Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

-2000

-1000

0

1000

2000

3000

4000

5000

6000

2009 2010 2011 2012 2013 2014

Increase in Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2009 2010 2011 2012 2013 2014

% Increase in Number of Publications

Pacific Biosciences Roche 454 Ion Torrent

Page 27: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Hi-C Crosslinking

3/31/2015 BTI Plant Bioinformatics Course 2015 27

Page 28: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Others

• Ion Torrent Proton/PGM

• SOLiD

• Helicos

• Supporting technologies– BioNano

– Nabsys

– OpGen

– 10X Genomics

– Fluidigm

3/31/2015 BTI Plant Bioinformatics Course 2015 28

Page 29: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Comparison

3/31/2015 BTI Plant Bioinformatics Course 2015 29

Page 30: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Next generation sequencing

3/31/2015 BTI Plant Bioinformatics Course 2015 30

Run Time Read Length Quality

Total

nucleotides

sequenced

Cost /MB

454

Pyrosequencing24h 700 bp Q20-Q30 1 GB $10

Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15

Illumina Hiseq

25001 - 10days 2x250bp >Q30 3000 GB $0.05

Ion torrent 2h 400bp >Q20 50MB-1GB $1

Pacific

Biosciences30m - 4h 10kb - >40kb

>Q50 consensus

>Q10 single

500 - 1000MB

/SMRT cell$0.13 - $0.60

http://www.hindawi.com/journals/bmri/2012/251364/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227

Page 31: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

http://omicsmaps.com/

Next Generation Genomics: World Map of High-throughput Sequencers

BTI Plant Bioinformatics Course 20153/31/2015 31

Page 32: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 32

https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

Page 33: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 BTI Plant Bioinformatics Course 2015 33

https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

Page 34: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Real cost of Sequencing!!

Sboner, Genome Biology, 2011

3/31/2015 34BTI Plant Bioinformatics Course 2015

Page 35: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Library Types

Single end

Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)

Mate pair (MP, 2Kb to 20 Kb)

3/31/2015 35

F

F R

F R 454/Roche

FR Illumina

Illumina

Slide credit: Aureliano BombarelyBTI Plant Bioinformatics Course 2015

Page 36: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Implications of Choice of Library

3/31/2015 36Slide credit: Aureliano Bombarely

Consensus sequence

(Contig)

Reads

Scaffold

(or Supercontig)

Pair Read information

NNNNN

Pseudomolecule

(or ultracontig)

F

Genetic information (markers) or Optical maps

NNNNN NN

BTI Plant Bioinformatics Course 2015

Page 37: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Multiplexing Libraries

Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.

3/31/2015 37Slide credit: Aureliano Bombarely

AGTCGT

TGAGCA

AGTCGTAGTCGT

AGTCGTAGTCGT

TGAGCATGAGCA

TGAGCATGAGCA

AGTCGT

AGTCGT

AGTCGT

AGTCGT

TGAGCATGAGCA

TGAGCA

TGAGCA

Sequencing

BTI Plant Bioinformatics Course 2015

Page 38: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Fasta files:

It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.

-Wikipedia

File Formats

3/31/2015 38Slide credit: Aureliano Bombarely

BTI Plant Bioinformatics Course 2015

Page 39: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Fastq files:

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

-Wikipedia

• Single line ID with at symbol (“@”) in the first column.

• Sequences can be in multiple lines after the ID line

• Single line with plus symbol (“+”) in the first column to represent the quality line.

• Quality ID line may contain ID

• Quality values are in multiple lines after the + line but length is identical to sequence

3/31/2015 39Slide credit: Aureliano Bombarely

File Formats

BTI Plant Bioinformatics Course 2015

Page 40: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 40

Quality control: EncodingFastq files:

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI Plant Bioinformatics Course 2015

Page 41: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Quality control: Encoding

3/31/2015 41

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI Plant Bioinformatics Course 2015

Page 42: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

3/31/2015 42

Quality control: Encoding

http://en.wikipedia.org/wiki/Phred_quality_score

Phred score of a base is:Qphred = -10 log10 (e)

where e is the estimated probability of a base being wrong

BTI Plant Bioinformatics Course 2015

Page 43: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Pre-processing: Tools

Trimming

• FastQC

• FASTX toolkit

• Trimmomatic

• Scythe

Joining paired-end reads

• fastq-join

• FLASH

• PANDAseq

3/31/2015 43BTI Plant Bioinformatics Course 2015

Page 44: next-generation-sequencing-must-die Surya Saha · Next generation sequencing 3/31/2015 BTI Plant Bioinformatics Course 2015 30 Run Time Read Length Quality Total nucleotides sequenced

Thank you!!

3/31/2015 BTI Plant Bioinformatics Course 2015 44