160628 giab for festival of genomics

So you’ve sequenced my genome. How well did you do?

Justin ZookNIST Genome-Scale Measurements

Group

June 28, 2016

Sequencing technologies and bioinformatics pipelines disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Sequencing technologies and bioinformatics pipelines disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Who’s right?

Is anyone right?

Genome in a Bottle ConsortiumWhole Genome Variant Calling

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance– materials certified for their

variants against a reference sequence, with confidence estimates

• established consortium to develop reference materials, data, methods, performance metrics

• Characterized Pilot Genome NA12878

• Ashkenazim Trio, Asian Trio from PGP in process

gene

ric m

easu

rem

ent p

roce

ss

Well-characterized, stable RMs• Obtain metrics for

validation, QC, QA, PT• Determine sources and

types of bias/error• Learn to resolve difficult

structural variants• Improve reference

genome assembly• Optimization• Enable regulated

applicationsComparison of SNP Calls forNA12878 on 2 platforms, 3

analysis methods

Bringing Principles of Metrologyto the Genome

• Reference material– DNA in a tube you can buy from

NIST– $45/ug

• NA12878 as pilot sample

• Extensive state-of-the-art characterization– as good as we can get for small

variants– arbitrated “gold standard” calls

for SNPs, small indels• “Upgradable” as technology

develops

• Analysis of PGP trios are ongoing– open project

• PGP genomes suitable for commercial derived products

• Developing benchmarking tools and software– with GA4GH

• Samples being used to develop and demonstrate new technology– for instance, 10X Genomics

Paper describing data…

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

NEW: Reproducible

integration pipeline

with new calls for

NA12878 and AJ

Son!

So, how does WGS make it into Regulated Clinical Applications?

• FDA developing strategy to regulate NGS, which is a novel medical device“...this technology allows broad and indication-blind testing and is capable of generating vast amounts of data, both of which present issues that traditional regulatory approaches are not well-suited to address.”

• FDA Workshops Feb ’15, Nov ’15– strategy to rely on

standards-based approaches, including reference materials…

“need for reference materials for validation and proficiency testing… there is no substitute for having real samples.”FDA Whitepaper, Dec ‘14 GenomeWeb, Nov ‘15

Clinical Genome Sequencing Process

Preanalytical

Sequencing

Sequence Bioinformatics

Functional Variant Annotation

Clinical Variant Knowledgebase

Query

Clinical Interpretation Reporting

EHR Archival

What is the standards architecture to demonstrate safety and efficacy?

Preanalytical

Sequencing




Query


EHR Archival

Analytical/Technical PerformanceAssessment

Preanalytical

Sequencing




Query


EHR Archival

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools• vcfeval – Len Trigg• hap.py – Peter Krusche• vgraph – Kevin Jacobs

• Standardized bed files with difficult genome contexts for stratification

Credit: GA4GH, Abby Beeler, Ellie Wood

Stratification of FP RatesHigher FP rates at Tandem Repeats

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Challenges in Benchmarking Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance

metrics

How can we extend this approach to structural variants?

Similarities to small variants• Collect callsets from

multiple technologies• Compare callsets to find

calls supported by multiple technologies

Differences from small variants• Callsets generally are not

sufficiently sensitive to assume that regions without calls are homozygous reference

• Variants are often imprecisely characterized– breakpoints, size, type, etc.

• Representation of variants is poorly standardized, especially when complex

• Comparison tools in infancy

Callsets Contributed so far

Short reads• Illumina

– Spiral Genetics– cortex– Commonlaw– MetaSV– Parliament/assembly– Parliament/assembly-force

• Complete Genomics• CG-SV• CG-CNV• CG-vcfBeta

Long reads and Linked reads• PacBio

• CSHL-assembly• Sniffles• PBHoney-spots and –tails• Parliament/pacbio• Parliament/pacbio-force• MultibreakSV• smrt-sv.dip• Assemblytics-Falcon and-MHAP

• Nanopore mapping• Nabsys force calls

• optical mapping• BioNano with and without haplotype-

aware assembly• 10X Genomics

Number of Calls Supported by 2 Technologies by Size Range

<50bp 50-100bp 100-1000bp 1kb-3kb >3kbpre-filtered 2404 1307 2288 481 600

filtered 2325 1188 1875 379 341

Sensitivity to Draft Benchmark Calls<50bp 50-100bp 100-1000bp 1kb-3kb >3kb

AssemblyticsFalcon 0% 55% 68% 59% 45%AssemblyticsMHAP 0% 51% 66% 56% 52%

CGvcf 86% 20% 4% 0% 0%CGCNV 0% 0% 0% 0% 29%CGSV 0% 0% 39% 65% 56%

CSHLassembly 0% 47% 62% 49% 42%sniffles 7% 28% 58% 59% 64%

BioNano 0% 0% 2% 26% 37%Spiral 85% 44% 57% 38% 40%Cortex 39% 15% 7% 2% 0%

CommonLaw 0% 0% 8% 47% 40%PBHoneySpots 0% 39% 63% 9% 0%PBHoneyTails 0% 0% 0% 31% 57%

MetaSV 0% 0% 75% 74% 71%ParliamentPacBio 0% 0% 74% 75% 48%

ParliamentAssembly 0% 0% 65% 44% 2%MultibreakSV 16% 66% 72% 59% 47%

CNVnator 0% 0% 22% 71% 74%ParliamentPacBioForce 1% 45% 72% 31% 18%

ParliamentAssemblyForce 0% 42% 63% 11% 2%BionanoHaplo 0% 0% 0% 36% 49%

NabsysForce160405 0% 0% 5% 25% 28%smrtsvdip 0% 66% 77% 65% 55%fermikit 94% 86% 83% 59% 56%

Size distributions

Concordance between technologies

All Calls

High-confidence Calls

Acknowledgements

• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA– Liz Mansfield– Zivana Tevak– David Litwack

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://biorxiv.org/content/early/2015/09/15/026468

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA

NRC postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]

http://www.genomeinabottle.org/

https://github.com/genome-in-a-bottle



http://www.slideshare.net/genomeinabottle

http://www.slideshare.net/genomeinabottle

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

http://biorxiv.org/content/early/2015/09/15/026468



https://github.com/ga4gh/benchmarking-tools

https://github.com/ga4gh/benchmarking-tools

160628 giab for festival of genomics

Health & Medicine