applications of ngs, the human variome project and data...
TRANSCRIPT
Clinical Applications of NGS, the Human Variome Project and Data Sharing
Graham Taylor
1
RGH Cotton“Dick Cotton was a visionary leader in thinking about human genetic variation and its role in global health. His legacy will live on in the many people he inspired, and the work of the Human Variome Project and other allied activities such as the Global Alliance for Genomics and Health.”
David Altshuler2
Context and Topics1. Pre‐amble: state of play with “next”
generation sequencing 2. Clinical utility and limitations of NGS
– How can we assess genome‐scale sequence quality?
3. Data Curation and Sharing
1: State of play with “next” generation sequencing
4
Cost and performancecost per base Illumina share price
“Now is the winter of our discount tests” –Richard III
5
An Ultimate Goal for Sequence analysis?
For sequencing– Chromosome‐length reads– Perfect base calling accuracy– Each molecule is read– Highly parallel– Rapid (minutes, not hours or days)
For analysis– De novo assembly– Well curated reference resources– Data integrated with other biological and medical resources
– Rapid (real time output)
6
How good can sequencing get?Attribute Illumina
NextSeqPac Bio E. coli
Length 300 bases 15 kilobases 4,600 kilobases
Speed 0.2 bases/minute/cluster
>1.3bases/sec/molecule
>50 bases/sec per fork
Parallelism 450,000,000 /run 50,000,000 /run Forks /cell
Error rate (errors/base)
1/10,000 15/100 1/100,000,000
Cost per megabase*
Approx $2.00 Approx $5.00 Approx $0
* These costs are highly contestable, consult your finance team7
One‐stop lifetime test?
Attribute Status
De novo assembly Not yet
Very long reads ? Moleculo
Base calling accuracy > 99.99% Getting there
No instrument or system bias ?
High coverage (>500‐fold) for mixed samples Too costly without targeting
Low sample input (single molecule) Not yet
Will one sequencing technology ever be able to deliver complete genetic analysis?
8
De novo Assembly(the unfinished genome)
• Genome Res. 2014. 24: 688‐696 2014 Huddleston et al. – Within the human genome, there are >900 annotated genes
mapping to large segmental duplications. Such genes are typically missing or misassembled in working draft assemblies of genomes
– The widespread adoption of next‐generation sequencing methods for de novo genome assemblies has complicated the assembly of repetitive sequences and their organization
– resolved regions that are complex in a genome‐wide context but simple in isolation for a fraction of the time and cost of traditional methods using long‐read single molecule, real‐time (SMRT) sequencing and assembly technology
– SMRT sequencing of large‐insert clones can significantly improve sequence assembly within complex repetitive regions of genomes
9
PacBio
• English et al. (2012) Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long‐Read Sequencing Technology. PLoS ONE 7(11): e47768
• Loomis et al Sequencing the unsequenceable: Expanded CGG‐repeat alleles of the fragile X gene Genome Research (2012)
10
Reducing assembly complexity of microbial genomes with single‐molecule sequencing
Long, single‐molecule reads are sufficient for the complete assembly of most known microbial genomes. The assemblies presented here have good likelihood and finished‐grade consensus accuracy exceeding 99.9999%.
Koren et al. Genome Biology 2013, 14:R101
11
2: Clinical utility and limitations of NGS
12
Clinical Service != ResearchResearch• Original• Surprising• >80% accurate• Numerator‐driven: get
publications• Bespoke
Clinical• Proven• Predictable• >99.99% accurate• Denominator‐driven (cost sensitive)• Standardized• Not necessarily boring
13
Validation and UtilityAnalytical validation
– The range of conditions under which the assay will give reproducible and accurate data (1)
Clinical validation– the ability of a test to accurately and reliably support or refute the
diagnosis of a clinically defined disorder Clinical Utility
– The ability of a test to lead to an improved health outcome with respect to the defined purpose of the test (2)
1. Teutsch, S. M., Bradley, L. A., Palomaki, G. E., Haddow, J. E., et al. The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) Initiative: methods of the EGAPP Working Group. Genet Med 11, 3‐14 (2009).2. Burke, W., Zimmern, R. L. & Kroese, M. Defining purpose: a key step in genetic test evaluation. Genet Med 9, 675‐681 (2007).
14
From sample to report
15
General Laboratory ProcessLIMS managed
NGS
Informatics
Curation
The entire process needs to be subject to quality control and audit.Most diagnostics labs already have that in place for current testing and reporting. Reporting already includes curation, so the extensions needed are primarily around library construction, sequencing, mapping and variant calling.Since the scale of variant curationwill increase, more automation will be required.
National Quality Standards
• National Pathology Accreditation Advisory Council “Requirements for the Development And Use Of in‐House In Vitro Diagnostic Medical Devices” (IVDs) (Third Edition 2014) NPAAC Tier 3B Document (Print ISBN: 978‐1‐74186‐006‐1 Online ISBN: 978‐1‐74186‐007‐8) the NPAAC Tier 2 Document “Requirements for Medical Pathology Services” (Print ISBN: 978‐1‐74241‐913‐8 Online ISBN: 978‐1‐74241‐914‐5)
• Human Genetics Advisory Committee (HGAC) National Health and Medical Research Council (NHMRC) Council version document of 12 June 2014
16
ACMG (www.acmg.net) > Publications > Laboratory Standards and Guidelines > NGS
AGMG NGS Guidelines
Clinical Drivers
• Referring to the clinical question• Sensitivity and specificity• Depth vs. breadth of coverage
– Manageable workflow– Cost efficiency
18
The case for disease‐centric analysis
• $1,000 dollar genomes or 1,000 x $1 interesting regions?• How to validate 3.5x 109 tests• Sequencing costs are not limiting• Quality and accuracy are incomplete• Perform tests for a (clinical) reason
19
Analysis
• base‐calling– analytical validation
• alignment and variant calling– analytical validation – clinical performance
• variant annotation, classification and prioritization– clinical performance– clinical utility
20
Base calling
21
0
0.01
0.02
0.03
0.04
0.05
0.06
kras_c.30G>C kras_c.30G>A kras_c.30G>T kras_c.30G>N
0
0.01
0.02
0.03
0.04
0.05
0.06
kras_c.29G>C kras_c.29G>A kras_c.29G>T kras_c.29G>N
0
0.05
0.1
0.15
0.2
0.25
kras_c.28T>G kras_c.28T>C kras_c.28T>A kras_c.28T>N
0
0.02
0.04
0.06
0.08
0.1
0.12
BRAF_c.1795_C>G BRAF_c.1795C>A BRAF_c.1795C>T BRAF_c.1795C>N
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
BRAF_c.1794G>C BRAF_c.1794G>A BRAF_c.1794G>T BRAF_c.1794G>N
0
0.05
0.1
0.15
0.2
0.25
BRAF_c.1793A>T BRAF_c.1793A>G BRAF_c.1793A>C BRAF_c.1793A>N
Basic Read & Mapping Stats
• <2% duplicate pairs• <2% N bases at any cycle• most frequent k‐mer <2%• mean insert size between 340 and 440 bp, with a median absolute deviation of <30 bp
• approximately uniform genomic coverage by GC content
• >99% mapped• <2.5% read pairs mapping to different chromosomes
22
Alignment Quality Markers
Coverage Coefficient of variation
23
On target
Variant Calling
24
Low concordance of multiple variant‐calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28
SNV concordance: 57.4% Indel concordance 26.8%
25
How many variants per exome?SNP count Study
20,000 Choi et al. PNAS 2009
142,000 Mullikin NIH, unpublished 2010
50,000 Clark et al. Nature biotechnology 2011
125,000 Smith et al. Genome Biology 2011
100,000 Johnston & Biesecker Human Molecular Genetics 2013
200,000 to 400,000 Yang et al.N Engl J Med 2013
• 20‐fold range• Exome designs vary• Likely to be higher variant count in African populations as the
reference sequence is non‐African
26
The Genome in a Bottle Standard
NIST GIAB – Pedigree analysis12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
• All 17 members sequenced to at least 50x depth (PCR‐Free protocol)• Variants are called across the pedigree using different software & technology• Inheritance information provides high confident, direct validation of variant
calls
Analysis of SNPs in the parents and 11 children
NIST Human Genome RMs in the pipeline
• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,
homogeneous references suitable for use in regulated applications
– all genomes also available from Coriell repository
• Pilot Genome– ~8400 tubes
• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent
• Asian Trio– ~10000 son; parents not yet
planned as NIST RM
NA12878 Reads & Basessample read pairs bases GBr1 231,353,122 41,970,793,143 41.97 1xHiSeq/ r2 115,676,561 20,985,304,713 20.99 NextSeq Runr4 57,838,280 10,492,374,225 10.49r8 28,919,140 5,246,032,165 5.25 1xMiSeq Runr16 14,459,570 2,623,084,647 2.62r32 7,229,785 1,311,465,395 1.31r64 3,614,892 655,665,469 0.66r128 1,807,446 327,792,763 0.33r256 903,723 163,865,510 0.16r512 451,861 81,930,508 0.08
Depth vs. Sensitivity & Specificity
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
450,0
00
900,0
00
1,800
,000
3,600
,000
7,200
,000
14,40
0,000
28
,800,0
00
57,60
0,000
11
5,200
,000
230,4
00,00
0
SNV Precision
SNV Recall
Indel Precision
Indel Recall
From this
• Low coverage gives low recall (sensitivity) and increased number of artefacts
• Indels are harder than SNVs• Can low coverage be rescued using trios?• Do all panel designs behave the same way • Do all aligners behave the same way?
Sensitivity vs. Specificity Using Qual score
33
Sensitivity vs. Specificity Using Depth score
34
SNP indel 1bp indel 2+bp
0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6False positive rate, sorted by DP
True
pos
itive
s ra
te
instrumentHiSeq NextSeq
Depth~400x~200x
~100x~50x
GiaB NRC Exome − Varied Depth
Between run reproducibility
35
Between run reproducibility
36
Variant Annotation
37
Basic workflow for whole-exome and whole-genome sequencing projects.
Stephan Pabinger et al. Brief Bioinform 2013;bib.bbs086© The Author 2013. Published by Oxford University Press.
Annotating and Reporting
• Database is populated after automated annotation (Annovar & the VEP)
• Data curation can take place within the database: need to agree fields, nomenclature and values
• Lab reports can be populated automatically from the database
• The data can also be shared at a number of levels
DJ McCarthy et al. Genome Medicine 2014, 6:26Choice of transcripts and software has a large effect
on variant annotation
40
ConclusionsVariant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole‐genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
Only 44% agreement in annotations for putative loss‐of‐function variants when using the REFSEQ and ENSEMBL transcript sets as the basis for annotation with ANNOVAR. The rate of matching annotations for loss‐of‐function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. For ANNOVAR and VEP using ENSEMBL transcripts, matching annotations were seen for only 65% of loss‐of‐function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy
Factors influencing success of clinical genome sequencing ..Jenny C Taylor et al. Nature Genetics published online 18 May 2015
• Evaluate the potential value of whole‐genome sequencing in mainstream genetic diagnosis.
• Identified multiple strategies in analysis (joint variant calling, filtering of variants against local databases and the use of multiple annotation algorithms) that improve the reliability of the variants called and improve sensitivity and specificity in detecting candidate disease‐causing variants.
• Demonstrated the value of genome sequencing for routine clinical diagnosis.
• Highlight many outstanding challenges.
41
Summary of findings
• In 33 of 156 cases (21%), at least one variant with a high level of evidence of pathogenicity was identified
• 2.5% of cases resulted from variants in genes with false negative test results in standard clinical genetics testing.
42
Factors influencing success of clinical genome sequencing (continued)
• the burden of ‘variants of unknown significance’– Tier 1 HGMD known genes for the disorder– Tier 2 HGMD known genes for related disorders or direct interactions as per Mammalian Protein‐Protein Interaction Database, MIPS
– Tier 3 plus genes in relevant biological pathways from HGMD and the Gene Ontology (GO) databases
43
Risks in curation• Combined use of gene candidacy, predicted functional
consequence, variant frequency and evolutionary conservation, although these are widely used filters within pipelines for identifying pathogenic candidate variants, will not by themselves differentiate between pathogenic and non‐ pathogenic variants. Naive application of such rules will lead to a high rate of false positive diagnosis, even in rare disorders with mutations occurring in limited numbers of known genes.
• Moreover, focusing only on candidate genes will lead to a high false negative rate.
• Additional evidence, such as functional data, familial transmission, de novo status and screening of other patients, was needed to establish pathogenicity.
44
Confirmation of NGS Results
• Orthogonal method– Sanger– Genotyping– Additional NGS method– Re‐analyse same data in a different way (e.g. non‐alignment)
45
fastq, fasta & grouped
>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_nameTGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC
@MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_nameTGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC+AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
FASTQ
FASTA
527 AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTA
Grouped
46
Grouped read testing
• Targeted• Sensitive• Quantitative• Low computing overhead• Genotypes• Estimates error rate• BLAST/BLAT mutation
scanning option
Amplivar vs. Alignment
Amplivar Alignment (e.g. BWA)
Groups reads Uses individual reads
Designed for unknown sequence with known flanks
Designed for randomly sheared fragments
Works with FASTA after filtering Works with FASTQ
Matches against target list Aligns against whole genome
WG Alignment is an optional late stage Alignment is a required early stage
48
Usual suspects file
4 column tab separated text file with Unix line endings• Column 1: RefSeq identifier • Column 2: cDNA HGVS nomenclature• Column 3: codon change HGVS nomenclature• Sequence to match• Usual suspects files available for TruSeq cancer panel and for
PCRbrary
RefSeq cDNA description codon change sequenceBRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA
49
EGFR p.L858R
50
EGFR p.T790M
51
Confirmation by PCR
52
Melbourne Genomics CRC Case 010310401
MGHA Pipeline reported “mosaic” TP53 variant
chr17:g.7577534C>G TP53 NM_000546.5 c.351G>C pR117S
Alignment‐free matching of test and control from q‐filtered FASTQ files
allele control (total) Test (total) control F test F total R Test RTP53 wt 60 24 11 12 7 12 4TP53 wt 57 28 14 16 9 12 5TP53 wt 54 30 14 18 9 12 5TP53 wt 51 30 14 18 9 12 5TP53 wt 48 30 14 18 9 12 5TP53 wt 45 32 16 19 10 13 6TP53 wt 42 34 16 21 10 13 6TP53 wt 39 37 19 22 10 15 9TP53 wt 36 37 20 22 10 15 10TP53 wt 33 39 21 24 10 15 11TP53 wt 30 39 22 24 11 15 11TP53 wt 27 39 22 24 11 15 11TP53 wt 24 39 22 24 11 15 11TP53 wt 21 42 23 26 12 16 11TP53 wt 18 43 24 27 13 16 11TP53 NM_000546.5 c.351G>C pR117S 60 0 3 0 1 0 2TP53 NM_000546.5 c.351G>C pR117S 57 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 54 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 51 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 48 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 45 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 42 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 39 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 36 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 33 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 30 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 27 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 24 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 21 0 6 0 2 0 4TP53 NM_000546.5 c.351G>C pR117S 18 0 6 0 2 0 4
Test
WT
NM_000546.5 c.351G>C
Control
WT
NM_000546.5 c.351G>C
Confirmation by amplicon resequencingEstablishes that the variant is present in the DNA sample. Patient has been rebled for further confirmation
From a NATA accredited Ovarian Cancer NGS pipeline developed by Olga Kondrashova
3: Data Curation and Sharing
56
The need to share data to improve annotation and prediction
To improve our knowledge of the phenotypic effects of DNA variation will require a massive effort in data sharing
Heidi L. Rehm, PhD, FACMG Director, Laboratory for Molecular Medicine, Partners Personalized Medicine Associate Professor of Pathology, Brigham & Women’s Hospital and Harvard Medical School
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
HCM Gene Mutations – 3000 cases tested>500 clinically significant mutations identified
66% of clinically significant mutations are seen in only one family
Number of probands
Num
ber o
f var
iant
s
MYBPC3E258K
MYBPC3MYH7R502WW792fsR663H
Curation protocol
59
Table 1: Classification of PTV and Null variants
Cat. Yes NoNotavailable
Notapplicable
PVS
PS
BA
BS
Evidence (NOTE: Each evidence block as signified by continuous color can only have ONE yes checked)
Affects canonical splice site (within 2bp of exon‐intron boundary) WITHOUT functional data supporting splice effect on protein or RNA level
Protein truncating (PTV) or Null Variant in a gene where LOF is known mechanism of pathogenicity‐ E.g. frameshift, initiator codon disruption, premature stop codon (nonsense), canonical splice site mutation‐ IF Canonical splice site (intronic 1‐2bp upstream/downstream of exon‐intron boundary), functional data required on protein or RNA level‐ IF initiator codon (ATG) mutation, no known alternative transcription start sites allowing functional protein production AND next available initiator codon (if any) results in frameshift‐ Affects critical functional domains‐ Gene not known to have restricted spectrum of non‐PTV mutations as only known mechanism of pathogenicity (e.g. activating or protein aggregation mutations, PTPN1)‐ Novel Stop codon is not in the last exon or the last 50bp of the second to last exon‐ Exon harboring the variant is not known to be subject to alternative splicing/missing in alternative RefSeq transcripts‐ Loss of single or multiple exons affecting functional domains where exon(s) are not known to be subject to alternative splicing/missing in alternative transcripts
MAF >=0.05 in at least one sufficiently large subpopulation
MAF >=0.03 and <0.05 in at least one sufficiently large subpopulation OR MAF is out of keeping with known disease frequency. CAUTION: if similar truncating mutations are common in the same gene (~>0.005, e.g. TITAN, APC), to be used as supporting evidence (BP) only.
Curation score
60
Count Evidence Levels (Categories):
Category Count Final curator classification:PVS 0PS 0PM 0 Curator comments (i.e. reasons for changing evidence levles):PP 0BA 0BS 0BP 0
5‐ Pathogenic: 4 ‐ Likely Pathogenic: 3a ‐ VUS (potentiallypathogenic)
3b ‐ VUS 3c ‐ VUS (potentiallybenign)
2 PVS 1 PVS and >= 1 PM Not unambigouslyclassifiable
1 PVS and >= 1 PS 1 PS and 1‐2 PM1 PVS and >= 2 PM 1 PS and >= 2 PP1 PVS and 1 PM and
>= 2 PP>= 3 PM
>= 2 PS 2 PM and >= 2 PP1 PS and >=3 PM 1 PM and >= 4 PP
1 PS and 2 PM and >= 2 PP
1 PS and 1 PM and >= 4 PP Curator Signature and Date: _______________________
1 BS + >=1 BP
>= 2 BP
1 ‐ Benign:
>= 1 BA
>= 2 BS
2 ‐ LikelyBenign
Not unambigously classifiable with predominantly
pathogenic evidence
Not unambigously classifiable with
predominantly benign evidence
NOTE: The classification formulas below represent a MINIMUM requirement for a given class. The curator has to decide if
Curation concordance between two labs
61
Variant Curator 1 Curator 2 Curator 3 Concordance CommentsOv03 BRCA1 5 4 4 Minor discordance VCGS did not use: Same variant type in affected codon/region has previously been shown to be pathogenic
ERBB2 3b ‐ ‐ ‐Ov09 BRCA1 5 5 5 Concordance VCGS: use expert pathogenicity &literature pathogenicity together
RAD50 2 2 3b Minor discordance UniMelb: used Grantham with no homologue in cat, UniMelb used additional BP, not defined in the original tableNe01 PINK1 4 5 5 Minor discordance VCGS: ‐ used functional data, significant segregation & previous description of pathogenicity
RASA1 3 3b 3b ConcordanceSPG11 3 3c 3c ConcordanceSPG11 3 3c 3c Concordance VCGS: Used Grantham and conservation, when no homologue in cat & score 98NPC1 3 3b 2 Minor discordance VCGS ‐3 : ignored PM in classifying 2
Ca19 COL3A1 ‐ 3b 3b Concordance VCGS did not use Grantham scoreNEBL 3b 3a 3a ConcordanceKCNE3 3b 2 2 Minor discordance VCGS: Population database observed in recessive form mutations in KCNE3 causes autosomal dominant diseaseDCP1B ‐ ‐ ‐ ‐ Common insABCC9 3b 3b 3b ConcordanceKCNE2 3 2 2 Minor discordance Population database observed in recessive form mutations in KCNE2 causes autosomal dominant disease
Ne03 A2M 3 3a 3a Concordance UniMelb ‐ used Grantham, when no homologuesSPG11 3 3 3a Concordance UniMelb ‐ used Grantham, when no homologuesNPC1 3 3 3b Concordance UniMelb ‐ Grantham and conservation did not match up
Ne22 PSEN1 4 4 3a Minor discordance VCGS did not use ExAC ‐ not present,but covered <0.0002Ov27 BRCA1 5 5 5 Concordance VCGS: use PVS, when one refseq transcript does not have exon with mutation
FANCA 3b 3c 2 Minor discordance VCGS did not use previous description of pathogenicity
Classification
Models for Data sharing
• Retain within service database and patient records (lost benefits of data sharing)
• Send direct to ClinVar (cuts out any national capacity/resource)
• Publish (picked up by HGMD)• Share aggregated data within a network, contributing to a national resource
Standards
63
FASTQ Q scores (Sanger or Illumina)Uncalled bases (*, ‐ or N)Genes (Ensembl or RefSeq)Reference Sequence (Build, patch)Chromosome numbering (chr, MT, X, Y)Mutation nomenclature (CGAR, VCF, HGVS)Left or right gap alignmentStrand alignment
Human Variome Project
64
Global Alliance for Genomics & Health
65
GA4GH Clinical Products 1 of 2
66
GA4GH Clinical Products 2 of 2
67
Interpretative Gap
68
Options for data sharing• Retain locally
– Secure– QC role– No added value to the healthcare system– Overseas curation of national data
• Upload aggregates to International Resource e.g. ClinVar– Simple– Lack of clinical detail or confidentiality
• Professional Data Sharing Network– Secure access– Share phenotypic data and medical records– Compliant with national law– National resource
Constraints on data sharing
• Patient confidentiality– Consent & secure clinical network
• Standardized nomenclature not always available– HGVS
• Genome builds & reference sequences change– Build standard tools e.g. LRGs
Challenges for the future
• Migration to build 38– Alignment– Annotation– Quality standards
• Nomenclature standardisation– HGVS: mutalyzer– VCFlib: vcfallelicprimitives
www.ncbi.nlm.nih.gov/clinvar
NIH NCBI ClinVar
73
Data sharing using open source solution
Capturing the Variants in a Database
Data quality classes based on degree of curation
Differentiate between three classes of data:
The Clinically Reported data label would denote the class of data that the HVP Australian Node was originally designed to collect and share: data that has been generated in a NATA accredited Australian diagnostic laboratory and is able to be included in a clinical report.
Unreported Clinical quality data would denote data that has been generated in a NATA accredited diagnostic laboratory, but is not capable of being included in a clinical report. This class would comprise, primarily, of next‐generation sequencing (NGS) type data.
Unaccredited data would be used to denote data that has been generated by an Australian laboratory that has not been NATA accredited
A new filtering option would be made available to allow users to view only data of a certain class
77
The FinDis Model of a Distributed Network of Databases
Standards for Accreditation of DNA Sequence Variation Databases
Quality Use of Pathology Program (QUPP), a national project for the Development of Standards for Accreditation of DNA Sequence Variation Data Bases has been jointly initiated by the Royal College of Pathologists of Australasia (RCPA), and the Human Variome Project (HVP).Background• There is a rapidly increasing volume, spectrum, and complexity of genetic tests emerging within
diagnostic pathology laboratories. In particular, high throughput sequencing methods such as targeted panel, exome (WES), and whole genome sequencing (WGS), are producing an increasing quantity of genetic data requiring analysis and interpretation, forming a substantial proportion of the workload.
• Currently, there is a plethora of online mutation databases to refer to, however there is a distinct lack of such databases that meet the stringent accuracy and reproducibility that the clinical diagnostic environment demands. Additionally, The current databases are “Fractured”, with varied access and sharing of the data within; and variable quality due to errors / inaccurate data posting, all of which is a clear risk to the quality of patient care. With more widespread, secure sharing of variants and associated phenotypes, the value of cumulative variant information will accelerate the delivery of accurate, actionable, and efficient clinical reports.
• There are currently no standards or equivalent mechanisms for accreditation of databases to ensure the accuracy and quality of uploaded data into any central repository to meet the needs of the clinical diagnostics environment.
79
Acknowledgments• Genomic Medicine & CTP, University of Melbourne: Arthur Hsu, Olga
Kondrashova, Sebastian Lunke, Clare Love, Renate Marquis‐Nicholson, Kym Pham, Paul Waring
• MCRI & VCGS: Simon Sadedin, Alicia Oshlack, Damien Bruno, Andrew Sinclair, Kathy North
• Human Variome Project: Tim Smith, Alan Lo, Chris Arnold, Dick Cotton
• Melbourne Genomics Health Alliance
80