gpu dna sequencing base quality...

18
GPU DNA Sequencing Base Quality Recalibration Mauricio Carneiro Nuno Subtil [email protected] Group Lead, Computational Technology Development Broad Institute of MIT and Harvard

Upload: others

Post on 29-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

GPU DNA Sequencing Base Quality Recalibration

Mauricio CarneiroNuno Subtil

[email protected]

Group Lead, Computational Technology Development Broad Institute of MIT and Harvard

Page 2: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

To fully understand one genome we need tens of thousands of genomes

vs#

vs#

Rare Variant Association Study

(RVAS)

Common Variant Association Study

(CVAS)

Page 3: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Improving human health in 5 easy steps

Functional studies

Therapeutics and drugs

Association studies

Large scale sequencing

Disease genetics

Many simple and complex human diseases are heritable. Pick one.

Affected and unaffected individuals differ systematically in their genetic composition

These systematic differences can be identified by comparing affected and unaffected individuals

These associated variants give insight into the biological mechanisms of disease

These insights can be used to intervene in the disease process itself

Page 4: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Personalized medicine for rare variants is already a reality

• Couples with rare conditions get referred by local hospitals to local genetics center.

• DNA sequencing can reveal the deleterious mutation causing the condition.

• In vitro fertilization followed by embryo selection can guarantee a disease free baby.

• Limitations of this process are in the size of the control cohort, and the rarity of the condition.

Page 5: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Type%2%Diabetes%

•  3,700%exomes%%•  APOC3%•  2.52fold%protec:on%from%CHD%•  4"rare"disrup+ve"muta+ons"(~1"in"200"

carrier"frequency)"

%

Coronary%Heart%Disease%

Schizophrenia%

•  13,000%exomes%%•  SLC30A8%

%%(Beta2cell2specific%Zn++%transporter)%•  32fold%protec:on%against%T2D!%•  1"LoF""per"1500"people"%

Early%Heart%A9ack%

•  5,000%exomes%%•  Pathways%%

•  Ac:vity2regulated%cytoskeletal%(ARC)%of%post2synap:c%density%complex%(PSD)%

•  Voltage2gated%Ca++%Channel%•  13221%%risk%in%carriers%•  Collec+on"of"rare"disrup+ve"muta+ons"

(~1/10,000"carrier"frequency)%%

•  5,000%exomes%%•  APOA5%•  22%%risk%in%carriers%•  0.5%"Rare"disrup+ve"/"deleterious"alleles"%

The%Importance%of%Scale…Early%Success%Stories%(at%1,000s%of%exomes)%

Page 6: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Terabases of Data Produced by YearTe

raba

ses

0

400

800

1200

1600

2009 2010 2011 2012 2013 2014

1,600

660

362.4302.8

153.822.8

2014

Page 7: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

…and these numbers will continue to grow faster than Moore’s law

Page 8: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

GATK is both a toolkit and a programming framework, enabling NGS analysis by scientists worldwide

Extensive online documentation & user support forum serving >10K users worldwide

MuTect, XHMM, GenomeSTRiP, ...

http://www.broadinstitute.org/gatk

Framework

Tools developed on top of the GATK framework by other groups

Toolkit

Toolkit & framework packages

Best practices for variant discovery

Page 9: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Workshop series educates local and worldwide audiences

Completed:• Dec 4-5 2012, Boston• July 9-10 2013, Boston• July 22-23 2013, Israel• Oct 21-22 2013, Boston

Planned:• March 3-5 2014, Thailand• Oct 18-29 2014, San Diego

Tutorial materials, slide decks and videos all available onlinethrough the GATK website, YouTube and iTunesU

• High levels of satisfaction reported by users in polls• Detailed feedback helps improve further iterations

Format • Lecture series (general audience) • Hands-on sessions (for beginners)

Portfolio of workshop modules• GATK Best Practices for Variant Calling• Building Analysis Pipelines with Queue• Third-party Tools:

o GenomeSTRiP o XHMM

Page 10: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Identifying mutations in a genome is a simple “find the differences” problem

Page 11: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Unfortunately, real data does not look that simple

Page 12: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

We have defined the best practices for sequencing data processing

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

Page 13: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

GPUs have sped up variant calling significantly

Technology Hardware Runtime Improvement

GPU NVidia Tesla K40 70 154x

GPU NVidia GeForce GTX Titan 80 135x

GPU NVidia GeForce GTX 480 190 56x

GPU NVidia GeForce GTX 680 274 40x

GPU NVidia GeForce GTX 670 288 38x

AVX Intel Xeon 1-core 309 35x

FPGA Convey Computers HC2 834 13x

- C++ (baseline) 1,267 9x

- Java (gatk 2.8) 10,800 -

Page 14: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Variant calling depends heavily on accurate measurements of error

]] r

H

R

h

The transition probabilities on this HMM are the

Page 15: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Qualities are the probability measures of error in a read

C G G T A C A A T G

33 37 29 39 30 32 23 12 2 2

43 40 43 42 44 39 22 10 43 40

45 45 40 39 42 41 38 32 40 44

Bases

Quals

InsertionQuals

DeletionQuals

emitted by most instruments

Page 16: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

SLX$GA$ 454$ SOLiD$ HiSeq$Complete$Genomics$

●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●●● ●

●●

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Original, RMSE = 5.242

Recalibrated, RMSE = 0.196

●●

●●

●●●

●●

●●

●●

●●

●●●●

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Original, RMSE = 2.556

Recalibrated, RMSE = 0.213●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

Original, RMSE = 1.215

Recalibrated, RMSE = 0.756

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Original, RMSE = 4.479

Recalibrated, RMSE = 0.235

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●●●

●●

●●

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Original, RMSE = 5.634

Recalibrated, RMSE = 0.135

●●●●●●●●●●●●●●● ●● ●● ●● ●

● ●● ●●

●● ●●●●●

0 5 10 15 20 25 30 35

−10

−5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

●●●●●●●●●●●●●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●● ●

Original, RMSE = 2.207

Recalibrated, RMSE = 0.186

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●●●

● ●●●●●●●● ●●●●●●●● ●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●● ●●●●

●●●●●●●●●●●

●●●●●

●●

●●

●●

0 50 100 150 200

−10

−5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●

●●●

●●●●●

Original, RMSE = 1.784

Recalibrated, RMSE = 0.136

●●

●●

●●●

●●

● ●●●

● ●●●

●●

●●

●●

●● ●●

●●●●●

−30 −20 −10 0 10 20 30

−10

−5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

●●●

●● ●● ●● ●●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●

● ●● ●● ●●● ●●●●

●● ●● ● ●●●●●

Original, RMSE = 1.688

Recalibrated, RMSE = 0.213

●●

●●

●●

●●

●●

●● ●

●●

●●

●● ●● ●●

●●● ●

●●

●● ●●

●● ●●

●●

● ●●

●●

●●●●

●●

−30 −20 −10 0 10 20 30

−10

−5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

● ●●

● ●● ●● ●●

●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●

● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ● ●●●●●

Original, RMSE = 2.679

Recalibrated, RMSE = 0.182

●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●●●●●●●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●●●●●● ●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●

●●

●●●●●●●

●●●●

●●

●●●●●●●

−100 −50 0 50 100

−10

−5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●

Original, RMSE = 2.609

Recalibrated, RMSE = 0.089

−10

−5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

−−−

−−

−−−−−−−−−

−−−−−−−−−−−−−−−−

AA AG CA CG GA GG TA TG

Original, RMSE = 2.598

Recalibrated, RMSE = 0.052

−10

−5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

−−−−−

−−−−

−−−−

−−−−−−−−−−−−−−−−−

AA AG CA CG GA GG TA TG

Original, RMSE = 2.169

Recalibrated, RMSE = 0.135

−10

−5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

−−

−−−

−−−−−−

−−−−−−−−−−−−−−−−−−

AA AG CA CG GA GG TA TG

Original, RMSE = 1.656

Recalibrated, RMSE = 0.088

−10

−5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

−−−−−

−−−

−−−−−

−−−−−−−−−−−−−−−−

AA AG CA CG GA GG TA TG

Original, RMSE = 3.503

Recalibrated, RMSE = 0.06

−10

−5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical −

Report

ed Q

ualit

y)

−−−−−

−−−−

−−−−−−−−−−−−−−−−−−−−

AA AG CA CG GA GG TA TG

Original, RMSE = 2.469

Recalibrated, RMSE = 0.083

first$of$pair$reads$second$of$pair$reads$ first$of$pair$reads$second$of$pair$reads$ first$of$pair$reads$second$of$pair$reads$

Base Qu Base Quality Score Recalibration provides a calibrated error model from which to make mutation calls

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!!

Page 17: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Base recalibration clarifies the unbiased error mode of Pacbio

Carneiro, et al Genome Biology 2014

Page 18: GPU DNA Sequencing Base Quality Recalibrationmauriciocarneiro.github.io/talks/20150317-gpu_tech_conf.pdf · GPU DNA Sequencing Base Quality Recalibration ... • GATK Best Practices

Processing is a big cost on whole genome sequencing

20

40

60

80

100