jan2015 using the pilot genome rm for clinical validation steve lincoln
TRANSCRIPT
2/9/2015 © 2013-2014 Invitae Corporation. All Rights Reserved | CONFIDENTIAL1
Using the Genome in a Bottle (GIAB) pilot
reference material: Its strengths and
limitations for analytic validation of a
diagnostic panel test
Stephen E. Lincoln
Invitae
• Diagnostic tests are ordered in response to a medical
question that needs an answer in order to make a specific
decision for a specific patient
− Can be time critical; Decisions may not be reversible
• Our Job: Provide a highly accurate answer to the question
asked in the time needed
− A complete answer is highly valued
o No matter how challenging (with some limits)
− Extra information is not valued (in most cases)
• Rigorous validation required by CLIA, CAP, the medical
community and payers
− Focus on Analytic (not Clinical) validation here…
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 2
Genetic Diagnostic Tests ≠ Research
29 Gene Hereditary Cancer Panel
Sub-Panel Genes Total Gene names
BRCA1/2 2 2 BRCA1, BRCA2
Other High-Risk Breast/Ovarian 4 6 CDH1, PTEN, STK11, TP53
Moderate-Risk Breast/Ovarian 6 12 ATM, BRIP1, CHEK2, NBN, PALB2, RAD51C
Lynch Syndrome 5 17 EPCAM, MLH1, MSH2, MSH6, PMS2
Other Hereditary Cancer
Syndromes11 28
APC, BMPR1A, SMAD4, CDK4, CDKN2A,
PALLD, MET, MEN1, RET, PTCH1, VHL
MUTYH 1 29 MUTYH
1. Multiple Enrichment Methods− No one technology delivers adequate coverage of all 29 genes
2. Copy number and other structural variants play a
significant role in addition to sequence variants− CNVs as small as one exon
− Alu insertions
− Tandem duplications
3. Of these 29 genes, a number are “hard”− PMS2 (last 4 exons) and CHEK2 have pseudogenes
− SMAD4 also does, in some people
− MSH2 has a large intronic homopolymer-A immediately next to
a canonical splice site (known to harbor pathogenic mutations)
− CDKN2A has a low complexity 80% GC tandem duplication at
the 5’ Met (also known to harbor pathogenic mutations)
Technical Requirements For These 29 Genes
Study Population
Group N Description Previous Testing
Prospective
Clinical735
Prospectively accrued clinical
casesClinical testing for
BRCA1/2, occasionally
other genes (depending
on case)High-Risk
Clinical
(Total 327)
209
Retrospective cases from a clinical
biobank generally containing
higher-risk individuals
118Cases referred due to known
pathogenic variant in family
Clinical single-site
testing
Reference
Samples36
Reference samples from public
biobanks (Coriell, NIBSC)
Samples carry known
pathogenic variants
Well-Characterized
Genomes (WCGs)7
Reference samples from public
biobanks with high-quality whole
genome sequencing (WGS) data
Variants in 29 cancer
genes extracted from
WGS data; most of
these are benign
Total 1105
1062
7 Well Characterized Genomes (WCGs) Used
✔
NA19239 NA19238
NA19240
CEPH/Utah Pedigree 1463 Yoruba Family Y117
✔
NA12889
✔
✔
✔ ✔
NA12879
NA12890
NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
NA12877 NA12878
NA12891 NA12892
✔
Geoff Nilsen
Integrated Complete Genomics, Illumina Platinum and other data sets
Mendelian scrub (leveraging data from family members not used in this study)
1. CLIA Validation and Performance Study (pre-GIAB):
• Integrated CG and Illumina Platinum data
• Compared scrubbed data against our Dx test data
2. Later reconciled NA12878 data against GIAB data set
• Substantially the same as our integrated data
Results presented here are a mix of the pre/post GIAB
WCGs in Cancer Panel Validation
Geoff Nilsen, Shan Yang
• 58,708 variants detected (avg. 53 per patient)
• >90% are common polymorphisms (MAF>1% in 1KG)
• >99% are single nucleotide variants (SNVs)
• <0.1% are of the most technically challenging types*− CNVs (single gene to single exon)
− Larger indels (≥10bp)
− Closely-spaced variants (≤25bp)
− Complex variants
− Variants in/near low complexity sequence
Genetic Data for 1105 Individuals x 29 genes
*We believe this largely reflects prevalence, not sensitivity limitations.
Analytic
Validation
Variants Selected in Analytic Validation Study
Type Variants Details
Single Nucleotide Variants (SNVs) 549
Sequence deletions <10 base-pairs 125
Sequence insertions <5 base-pairs 31
Sequence insertions ≥5 base-pairs 4 24, 5 bp
Sequence deletions ≥10 base-pairs 9 126, 40, 19, 15, 11 bp
Complex variants 6 Delins, haplotypes, Homopolymer-associated1
Single exon deletions 9 BRCA1, BRCA2, MSH2, PMS2
Single exon duplications 4 BRCA1, MLH1
Deletions of multiple exons or whole gene 10 BRCA1, MSH2, RAD51C
Duplications of multiple exons or whole gene 6 BRCA1, BRCA2, NBN, SMAD4
Total 750
Se
qu
en
ce
Copy N
um
ber
Some published validation studies have few, if any, examples of these relatively
challenging classes of variation2,3
1. MSH2:c.942+3A>T
2. Bosdet et al, J Mol Dx, 2013
3. Chong et al, PLOS One, 2014
“Hard Stuff”
All could be directly compared between NGS panel and reference/orthogonal data.
• 7 Samples Contributed 310 of 750 selected variants− All variants in assay targets in the WCG data sets were used
− 41% of the total set of variants came from 0.6% of the samples
• In 15 of 29 genes the 7 WCGs doubled (or more) the
selected variant count
• WCGs added variants in one gene (PTCH1) which
otherwise had none selected
• Saved us 310 Sanger confirmations− Unlike confirmation, WCGs contribute both to sensitivity and
specificity measurements in a strong way
• As a replenishable resource, it’s easy to rerun WCGs
WCGs Contribution to Analytic Validation Study
• No coding variants in 5 of 29 genes− CDKN2A, PALB2, RAD51C, SMAD4
− CHEK2 (a special case)
• Only 1 coding variant in 2 other genes− PTEN, TP53
• The only errors in any reference data
we saw were in WCGs (but not GIAB)− 2 in NA19240, 1 in NA12892
− All errors in low-complexity sequence
• Many of the variants are repeated− Partly due to using related individuals
− Partly because most are common
polymorphisms
Limitations of the 7 WCGs WCGs All Others
APC 31 9
ATM 26 10
BMPR1A 7 1
BRCA1 21 162
BRCA2 39 156
BRIP1 23 5
CDH1 12 4
CDKN2A 3
CHEK2 4
EPCAM 8 1
MEN1 18 1
MET 18 2
MLH1 4 6
MSH2 4 8
MSH6 11 7
MUTYH 4 23
NBN 16 3
PALB2 8
PALLD 6 1
PMS2 16 9
PTCH1 10
PTEN 1 1
RAD51C 4
RET 27 2
SMAD4 3
TP53 1 3
VHL 7 1
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 13
PALB2 in NA12878 (Get-RM browser)
Lots of GIAB variants but none are exonic
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 14
CDKN2A in NA12878 (Get-RM browser)
Just one GIAB variant in 3’ UTR
(Similar situations in RAD51C, SMAD4)
• 304 of 310 sequence variants are SNVs
• 6 small deletions (max 4bp)
• 0 insertions
• 0 other variant types
• 0 variants in the most tricky regions for a Dx test
− Segdups, low-complexity, etc.
• No GIAB CNV data yet, but we’d expect 0 positives
• None of the WCG variants are clinically relevant
− None pathogenic or likely pathogenic under ACMG ISV criteria
− Unsurprisingly
• But Unfortunately….
Other Limitations of the WCGs in this study
A Significant Fraction of Pathogenic Variants in
The Clinical Cases are Technically Challenging
Pathogenic and likely pathogenic variants (n=260) among the clinical cases
(n=1062) by variant type.
SNV34.2%
CNV multi-exon
4.6%
CNV single-exon3.8%Large
Indel3.5%Complex
1.5%
Small Indel
52.3%
Examples
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 17
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 18
BRCA1: c.1175_1214del40
Deletion
mapped
correctly in a
fraction of
reads
Split-read
signature in
additional
reads
2/9/2015 Copyright © Invitae Corp. All Rights Reserved19
BRCA2: c.9203del126
Split-read
signal at 3’
end of
deletion
Split-read
signal at 5’
end of
deletion
Exon target
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 20
Deletion Affecting 2 Neighboring Exons
Split-read
signal at 3’
end of
deletion
Split-read
signal at 5’
end of
deletion
Exon Exon
Intron
CDKN2A:c.9_32dup24
Lincoln et al., December 2014
Insertion of 3rd
repeat in correctly
mapped NGS reads
Repeat Copy 1 Repeat Copy 2
Split-read signal
from 3rd copy
(soft-clipped
reads)
Translation
5’ Met
Sup. Figures Page 21
Split-read signal
from 3rd copy
(soft-clipped
reads)
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 22
BRCA2 c.156_insAlu
Split-read
signal of
Alu sequence
• Get IGV
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 23
MSH2:c.943+3T>C
Homopolymer-A
Alignment and
Biochemical
Artifacts
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 24
SMAD4 Whole-Gene Duplication
Split-read signal
of neighboring
Exon equence
Ditto
Ditto
Ditto
Rare Pseudogene Insertion
Lies, Damned Lies and Statistics*
• Imagine this validation study:− Test genes/exons of medical relevance in NA12878 (etc)
− Compare test results to GIAB reference data
− Count concordance, calculate sensitivity, specificity, and PPV
• Imagine an assay which silently fails to detect all “hard”
variants, but which works highly accurately on the “easy”
variants
• For the total spectrum of variants, sensitivity and specificity
will be over 99.9% for a large enough panel/study
• But among the truly positive patients there is a
>10% chance of a clinical false negative− In targeted and validated assay regions!
*Mark Twain
• Well characterized genomes, in particular NA12878 with
the GIAB data set, contributed significantly to the analytic
validation of a hereditary cancer panel test
• But there were important limitations:
− Few if any coding variants in some genes
o These are the majority of regions targeted by most Dx assays
− Few deletions (in these regions)
o No insertions in these regions
− Very few complex or “hard” variants, including
o Large indels
o Small CNVs
o Variants in medically relevant low complexity regions
o Other tricky stuff
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 26
Conclusion
• More samples with greater genetic diversity
− This is in process!
• CNV/SV maps
− This is in process!
• Fill in some regions currently missing data
− Suggestion: Prioritize coding regions of known disease genes
o There’s ~3,000 in total, ~700 generally used in Dx ~100 commonly used
• Engineered control with lots of “hard” variants
− In subsets of those known disease genes (commonly used ones)
− Genetically engineered cell lines or spike-ins?
• Data in transcript coordinates, using HGVS
2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 27
Wish List for GIAB Reference Samples
• Steve Lincoln
• Yuya Kobayashi
• Michael Anderson
• Shan Yang
• Kevin Jacobs
• Josh Paul
• Geoff Nilsen
• Jon Sorenson
• Federico Monzon
• Swaroop Aradhya
• Scott Topper
• Martin Powers
| Copyright © Invitae Corp. All Rights Reserved
Acknowledgements
• Jim Ford
• Allison Kurian
• Meredith Mills
• Leif Ellisen
• Andrea Desmond
• Michelle Gabree
• Kristen Shannon