150224 giab 30 min generic slides
TRANSCRIPT
Genome in a Bottle: So you’ve sequenced a genome – how well did
you do?
February 2015
Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
Whole genome sequencing technologies disagree about 100,000’s of variants
3,198,316 (80.05%)
125,574 (3.14%)
Platform #1
Platform #2
Platform #3
230,311 (5.76%)
121,440 (3.04%)
208,038 (5.21%)
71,944 (1.80%)
39,604 (0.99%)
# SNPs (% of SNPs detected
by any platform)
Bioinformatics programs also disagree
O’Rawe et al. Genome Medicine 2013, 5:28
NIST-hostedGenome in a Bottle Consortium
• Infrastructure for performance assessment of NGS– support science-based regulatory
oversight
• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…
• Genome in a Bottle Consortium is developing standards to address this…– well-characterized human genomes
as Reference Materials (RMs)• characterized and disseminated by NIST
– tools and methods to use these RMs• Global Alliance for Genomics and
Health Benchmarking Team
http://genomeinabottle.org
Genome in a Bottle Consortium Development
• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011
• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011
• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others
– developed draft work plan– April 2012
• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford
• Website– www.genomeinabottle.org
Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid cell line
• Genome Reference Consortium
Performance Metrics
• Global Alliance for Genomics and Health Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
NIST Plays a Role in the First FDA Authorization for Next-Generation Sequencer
November 20, 2013
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be
certified for their variants against a reference sequence, with confidence estimates
gen
eric
me
asu
rem
en
t p
roce
ss
Analyticalsteps
Pre-Analyticalsteps
ClinicalInterpretation
• NIST worked with GIAB to select genomes
• Current genomes
– NA12878 HapMapsample as Pilot sample• part of 17-member
pedigree
– 2 trios from PGP • Ashkenazim
• Asian
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
11 children
NIST Human Genome RMs in the pipeline
• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,
homogeneous references suitable for use in regulated applications
– all genomes also available from Coriell repository
• Pilot Genome– ~8400 tubes
• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent
• Asian Trio– ~10000 son; parents not yet
planned as NIST RM
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform– take advantage of strengths of each platform
• Avoid bias towards any particular bioinformatics algorithms
11
Pilot Genome: Integrate 12 14 Datasets from 5 platforms
12
Dat
aset
#1
Dat
aset
#2
Dat
aset
#3
Annotation #1Histogram
(e.g., coverage)
Dat
aset
#1
Dat
aset
#2
Dat
aset
#3
Annotation #2Histogram
(e.g., strand bias)
Site A
Site B
PotentialBias
Site C
Dataset Site A Site B Site C
Dataset #1 0/0 0/0 1/1
Dataset #2 0/1 0/1 1/1
Dataset #3 0/0 0/1 1/1
Integration 0/0 0/1 Uncer-tain
Candidate variants
Concordant variants
Find characteristics
of bias
Arbitrate using evidence of
bias
Confidence Level
Integration Methods to Establish Benchmark Variant Calls
Integration Methods to Establish Benchmark Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence LevelZook et al., Nature Biotechnology, 2014.
Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics methods agree or we understand the biases causing disagreement
• At least some methods have no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be difficult for current technologies
• State reasons for lower confidence
• If a site is near a low confidence site, make it low confidence
Challenges with assessing performance
• All variant types are not equal
• All regions of the genome are not equal
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)
– standard diagnostic accuracy measures not well posed
16
Challenge in variant comparison: Complex variants have multiple correct representations
BWA
ssaha2
CGTools
Novo-align
Ref:
T insertion
TCTCT insertion
17
FP SNPs FP MNPs FP indels
Traditionalcomparison
0.38% (610)
100% (915)
6.5% (733)
Comparison with realignment
0.15% (249)
4.2% (38)
2.6% (298)
Global Alliance for Genomics and HealthBenchmarking Task Team
• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Initial focus on germline SNPs/indels• Developing benchmarking tools
• Comparison engine• Pluggable web interface with
modules for:• Reporting/calculation of metrics• Visualization/user interface
• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes
www.bioplanet.com/gcat
Example User Interface
Stratifying Performance
• Measure performance for different types of variants in different sequence contexts– Types of variants
• SNPs• indels of different sizes• complex variants• structural variants
– Sequence contexts• Homopolymers, • STRs• Duplications
– Functional context• Exome vs genome, etc
– Data characteristics• Coverage• Mapping quality
• Challenge of smaller gene panels vs genome sequencing– one RM may not have a
sufficient number of examples of different classes of variants or sequence contexts
– likely need more samples with specific types of variants
NCBI/CDC GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of questionable calls
Initial uses of high-confidence NIST-GIAB genotypes for NA12878
• NIST have released several versions of high-confidence genotypes for its pilot RM
• These data are presently being used for benchmarking
– prior to release of RMs
– SNPs & indels• ~77% of the genome
Using Genome in a Bottle calls to benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
Benchmarking somatic variant callingat Qiagen
Implications of Technical Accuracy in Medical Genome Sequencing
• Collaboration with EuanAshley group at Stanford
• What is accuracy for functional variants?
• How much of the exomefalls in high confidence regions?
• “Black list” in databases
• Sensitivity – WExS (95%) < WGS (98%)
• especially splicing
– genome < nonsyn < syn
– Most exome FNs caused by low coverage
– Most WGS FNs cause by filtering
• Only 81 % of ClinVarpathogenic or likely pathogenic SNPs fall in high-confidence regions– Lots of work to do!
Overview of NIST RM DevelopmentGenome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015
HG-001/NA12878(“Pilot” Genome)
Release NIST RM8398; Preliminary large deletions
RefinedStructural Variants
HG-002 to HG-004 (Ashkenazim trio)
Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability
Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”;mate-pair; CG-LFR
Refined SNPs/indels; Preliminary SVs
RefinedStructural Variants
NIST RMs 8391/8392 release
HG-005 (son in Asian trio)
Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability
“moleculo”;mate-pair; CG-LFR
Preliminary SNPs/indels
Refined SNPs/indels; RefinedStructural Variants
NIST RM8393release
Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…
Illumina Paired-end
150x150bp ~300x/individual
Fastq on ftp SNPs/indels/some SVs
Illumina Long Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina “moleculo”
Custom library ~30x by long fragments
Feb-Mar 2015 SVs/phasing/assembly
Complete Genomics
100x/individual On ftp SNPs/indels/some SVs
Complete Genomics
LFR ?? SNPs/indels/phasing
Ion Proton Exome 1000x/individual
On SRA SNPs/indels in exome
BioNanoGenomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on AJ trio
Finished ~Mar 2015
SVs/phasing/assembly/STRs
Asian PGP trio
• Similar sequencing to Ashkenazim trio except for PacBio
• Only son will be NIST RM
Future Directions
Germline mutations
• Difficult regions/variants– Long-read technologies
– Forming an analysis group
• Tools for assessing performance– How to stratify performance
and understand biases?
Somatic mutations
• Pilot interlaboratory study to assess comparability of spike-ins
• Commercial members developing FFPE cell lines
• Participants interested in mixing different RMs
How to get involved• Use our integrated
SNP/indel genotypes for NA12878 and give us feedback– Cells and DNA currently
available from Coriell– NIST RM available April
2015
• Join our new Analysis group– Use Long-read
technologies– Structural Variant calls– De novo assembly– Help create the best-ever
characterized trio
• Attend our biannual workshops (January in CA, August in MD)
• Develop tools/metrics with Global Alliance for Genomics and Health Benchmarking Team
Acknowledgments
• FDA – Elizabeth Mansfield, HPC staff
• HSPH
• GCAT - David Mittelman, Jason Wang
• Francisco De La Vega
• Illumina - Mike Eberle
• Personalis - Deanna Church
• NCBI – Chunlin Xiao
• Celera - Andrew Grupe
• Genome in a Bottle– www.genomeinabottle.org
– New members welcome!
– Sign up for email newsletters