applications of ngs, the human variome project and data...

80
Clinical Applications of NGS, the Human Variome Project and Data Sharing Graham Taylor 1

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Clinical Applications of NGS, the Human Variome Project and Data Sharing

Graham Taylor

1

Page 2: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

RGH Cotton“Dick Cotton was a visionary leader in thinking about human genetic variation and its role in global health. His legacy will live on in the many people he inspired, and the work of the Human Variome Project and other allied activities such as the Global Alliance for Genomics and Health.”

David Altshuler2

Page 3: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Context and Topics1. Pre‐amble: state of play with “next” 

generation sequencing 2. Clinical utility and limitations of NGS

– How can we assess genome‐scale sequence quality?

3. Data Curation and Sharing

Page 4: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

1: State of play with “next” generation sequencing 

4

Page 5: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Cost and performancecost per base Illumina share price

“Now is the winter of our discount tests” –Richard III

5

Page 6: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

An Ultimate Goal for Sequence analysis?

For sequencing– Chromosome‐length reads– Perfect base calling accuracy– Each molecule is read– Highly parallel– Rapid (minutes, not hours or days)

For analysis– De novo assembly– Well curated reference resources– Data integrated with other biological and medical resources

– Rapid (real time output)

6

Page 7: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

How good can sequencing get?Attribute Illumina

NextSeqPac Bio E. coli

Length 300 bases 15 kilobases 4,600 kilobases

Speed 0.2 bases/minute/cluster

>1.3bases/sec/molecule

>50 bases/sec per fork

Parallelism 450,000,000 /run 50,000,000 /run Forks /cell

Error rate (errors/base)

1/10,000 15/100 1/100,000,000

Cost per megabase*

Approx $2.00 Approx $5.00 Approx $0

* These costs are highly contestable, consult your finance team7

Page 8: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

One‐stop lifetime test?

Attribute Status

De novo assembly Not yet

Very long reads ? Moleculo

Base calling accuracy > 99.99% Getting there

No instrument or system bias ?

High coverage (>500‐fold) for mixed samples Too costly without targeting

Low sample input (single molecule) Not yet

Will one sequencing technology ever be able to deliver complete genetic analysis?

8

Page 9: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

De novo Assembly(the unfinished genome)

• Genome Res. 2014. 24: 688‐696 2014 Huddleston et al. – Within the human genome, there are >900 annotated genes 

mapping to large segmental duplications. Such genes are typically missing or misassembled in working draft assemblies of genomes

– The widespread adoption of next‐generation sequencing methods for de novo genome assemblies has complicated the assembly of repetitive sequences and their organization

– resolved regions that are complex in a genome‐wide context but simple in isolation for a fraction of the time and cost of traditional methods using long‐read single molecule, real‐time (SMRT) sequencing and assembly technology

– SMRT sequencing of large‐insert clones can significantly improve sequence assembly within complex repetitive regions of genomes

9

Page 10: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

PacBio

• English et al. (2012) Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long‐Read Sequencing Technology. PLoS ONE 7(11): e47768

• Loomis et al Sequencing the unsequenceable: Expanded CGG‐repeat alleles of the fragile X gene Genome Research (2012)

10

Page 11: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Reducing assembly complexity of microbial genomes with single‐molecule sequencing

Long, single‐molecule reads are sufficient for the complete assembly of most known microbial genomes. The assemblies presented here have good likelihood and finished‐grade consensus accuracy exceeding 99.9999%.

Koren et al. Genome Biology 2013, 14:R101

11

Page 12: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

2: Clinical utility and limitations of NGS

12

Page 13: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Clinical Service != ResearchResearch• Original• Surprising• >80% accurate• Numerator‐driven: get 

publications• Bespoke

Clinical• Proven• Predictable• >99.99% accurate• Denominator‐driven (cost sensitive)• Standardized• Not necessarily boring

13

Page 14: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Validation and UtilityAnalytical validation

– The range of conditions under which the assay will give reproducible and accurate data (1)

Clinical validation– the ability of a test to accurately and reliably support or refute the 

diagnosis of a clinically defined disorder Clinical Utility

– The ability of a test to lead to an improved health outcome with respect to the defined purpose of the test (2)

1.   Teutsch, S. M., Bradley, L. A., Palomaki, G. E., Haddow, J. E., et al. The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) Initiative: methods of the EGAPP Working Group. Genet Med 11, 3‐14 (2009).2.   Burke, W., Zimmern, R. L. & Kroese, M. Defining purpose: a key step in genetic test evaluation. Genet Med 9, 675‐681 (2007).

14

Page 15: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

From sample to report

15

General Laboratory ProcessLIMS managed

NGS

Informatics

Curation

The entire process needs to be subject to quality control and audit.Most diagnostics labs already have that in place for current testing and reporting.  Reporting already includes curation, so the extensions needed are primarily around library construction, sequencing, mapping and variant calling.Since the scale of variant curationwill increase, more automation will be required.

Page 16: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

National Quality Standards

• National Pathology Accreditation Advisory Council “Requirements for the Development And Use Of in‐House In Vitro Diagnostic Medical Devices” (IVDs) (Third Edition 2014) NPAAC Tier 3B Document (Print ISBN: 978‐1‐74186‐006‐1 Online ISBN: 978‐1‐74186‐007‐8) the NPAAC Tier 2 Document “Requirements for Medical Pathology Services” (Print ISBN: 978‐1‐74241‐913‐8 Online ISBN: 978‐1‐74241‐914‐5) 

• Human Genetics Advisory Committee (HGAC) National Health and Medical Research Council (NHMRC) Council version document of 12 June 2014

16

Page 17: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

ACMG (www.acmg.net) > Publications > Laboratory Standards and Guidelines > NGS

AGMG NGS Guidelines

Page 18: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Clinical Drivers

• Referring to the clinical question• Sensitivity and specificity• Depth vs. breadth of coverage

– Manageable workflow– Cost efficiency

18

Page 19: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

The case for disease‐centric analysis

• $1,000 dollar genomes or 1,000 x $1 interesting regions?• How to validate 3.5x 109 tests• Sequencing costs are not limiting• Quality and accuracy are incomplete• Perform tests for a (clinical) reason

19

Page 20: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Analysis

• base‐calling– analytical validation 

• alignment and variant calling– analytical validation – clinical performance

• variant annotation, classification and prioritization– clinical performance– clinical utility

20

Page 21: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Base calling

21

0

0.01

0.02

0.03

0.04

0.05

0.06

kras_c.30G>C kras_c.30G>A kras_c.30G>T kras_c.30G>N

0

0.01

0.02

0.03

0.04

0.05

0.06

kras_c.29G>C kras_c.29G>A kras_c.29G>T kras_c.29G>N

0

0.05

0.1

0.15

0.2

0.25

kras_c.28T>G kras_c.28T>C kras_c.28T>A kras_c.28T>N

0

0.02

0.04

0.06

0.08

0.1

0.12

BRAF_c.1795_C>G BRAF_c.1795C>A BRAF_c.1795C>T BRAF_c.1795C>N

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

BRAF_c.1794G>C BRAF_c.1794G>A BRAF_c.1794G>T BRAF_c.1794G>N

0

0.05

0.1

0.15

0.2

0.25

BRAF_c.1793A>T BRAF_c.1793A>G BRAF_c.1793A>C BRAF_c.1793A>N

Page 22: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Basic Read & Mapping Stats

• <2% duplicate pairs• <2% N bases at any cycle• most frequent k‐mer <2%• mean insert size between 340 and 440 bp, with a median absolute deviation of <30 bp

• approximately uniform genomic coverage by GC content

• >99% mapped• <2.5% read pairs mapping to different chromosomes

22

Page 23: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Alignment Quality Markers

Coverage Coefficient of variation

23

On target

Page 24: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Variant Calling

24

Page 25: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Low concordance of multiple variant‐calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28 

SNV concordance: 57.4% Indel concordance 26.8%

25

Page 26: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

How many variants per exome?SNP count Study

20,000 Choi et al. PNAS 2009

142,000 Mullikin NIH, unpublished 2010

50,000 Clark et al. Nature biotechnology 2011

125,000 Smith et al. Genome Biology 2011

100,000  Johnston & Biesecker Human Molecular Genetics 2013

200,000 to 400,000 Yang et al.N Engl J Med 2013

• 20‐fold range• Exome designs vary• Likely to be higher variant count in African populations as the 

reference sequence is non‐African

26

Page 27: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

The Genome in a Bottle Standard

Page 28: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

NIST GIAB – Pedigree analysis12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893

• All 17 members sequenced to at least 50x depth (PCR‐Free protocol)• Variants are called across the pedigree using different software & technology• Inheritance information provides high confident, direct validation of variant 

calls

Analysis of SNPs in the parents and 11 children

Page 29: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable, 

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet 

planned as NIST RM

Page 30: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

NA12878 Reads & Basessample read pairs bases GBr1 231,353,122 41,970,793,143 41.97 1xHiSeq/ r2 115,676,561 20,985,304,713 20.99 NextSeq Runr4 57,838,280 10,492,374,225 10.49r8 28,919,140 5,246,032,165 5.25 1xMiSeq Runr16 14,459,570 2,623,084,647 2.62r32 7,229,785 1,311,465,395 1.31r64 3,614,892 655,665,469 0.66r128 1,807,446 327,792,763 0.33r256 903,723 163,865,510 0.16r512 451,861 81,930,508 0.08

Page 31: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Depth vs. Sensitivity & Specificity

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

450,0

00

900,0

00

1,800

,000

3,600

,000

7,200

,000

14,40

0,000

28

,800,0

00

57,60

0,000

11

5,200

,000

230,4

00,00

0

SNV Precision

SNV Recall

Indel Precision

Indel Recall

Page 32: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

From this

• Low coverage gives low recall (sensitivity) and increased number of artefacts

• Indels are harder than SNVs• Can low coverage be rescued using trios?• Do all panel designs behave the same way • Do all aligners behave the same way?

Page 33: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Sensitivity vs. Specificity Using Qual score

33

Page 34: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Sensitivity vs. Specificity Using Depth score

34

SNP indel 1bp indel 2+bp

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6False positive rate, sorted by DP

True

pos

itive

s ra

te

instrumentHiSeq NextSeq

Depth~400x~200x

~100x~50x

GiaB NRC Exome − Varied Depth

Page 35: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Between run reproducibility

35

Page 36: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Between run reproducibility

36

Page 37: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Variant Annotation

37

Page 38: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Basic workflow for whole-exome and whole-genome sequencing projects.

Stephan Pabinger et al. Brief Bioinform 2013;bib.bbs086© The Author 2013. Published by Oxford University Press.

Page 39: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Annotating and Reporting

• Database is populated after automated annotation (Annovar & the VEP)

• Data curation can take place within the database: need to agree fields, nomenclature and values

• Lab reports can be populated automatically from the database

• The data can also be shared at a number of levels

Page 40: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

DJ McCarthy et al. Genome Medicine 2014, 6:26Choice of transcripts and software has a large effect 

on variant annotation

40

ConclusionsVariant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole‐genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.

Only 44% agreement in annotations for putative loss‐of‐function variants when using the REFSEQ and ENSEMBL transcript sets as the basis for annotation with ANNOVAR. The rate of matching annotations for loss‐of‐function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. For ANNOVAR and VEP using ENSEMBL transcripts, matching annotations were seen for only 65% of loss‐of‐function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy

Page 41: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Factors influencing success of clinical genome sequencing ..Jenny C Taylor et al. Nature Genetics  published online 18 May 2015

• Evaluate the potential value of whole‐genome sequencing in mainstream genetic diagnosis.

• Identified multiple strategies in analysis (joint variant calling, filtering of variants against local databases and the use of multiple annotation algorithms) that improve the reliability of the variants called and improve sensitivity and specificity in detecting candidate disease‐causing variants.

• Demonstrated the value of genome sequencing for routine clinical diagnosis. 

• Highlight many outstanding challenges.

41

Page 42: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Summary of findings

• In 33 of 156 cases (21%), at least one variant with a high level of evidence of pathogenicity was identified

• 2.5% of cases resulted from variants in genes with false negative test results in standard clinical genetics testing.

42

Page 43: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Factors influencing success of clinical genome sequencing (continued)

• the burden of ‘variants of unknown significance’– Tier 1  HGMD known genes for the disorder– Tier 2 HGMD known genes for related disorders or direct interactions as per Mammalian Protein‐Protein Interaction Database, MIPS

– Tier 3 plus genes in relevant biological pathways from HGMD and the Gene Ontology (GO) databases

43

Page 44: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Risks in curation• Combined use of gene candidacy, predicted functional 

consequence, variant frequency and evolutionary conservation, although these are widely used filters within pipelines for identifying pathogenic candidate variants, will not by themselves differentiate between pathogenic and non‐ pathogenic variants. Naive application of such rules will lead to a high rate of false positive diagnosis, even in rare disorders with mutations occurring in limited numbers of known genes. 

• Moreover, focusing only on candidate genes will lead to a high false negative rate.

• Additional evidence, such as functional data, familial transmission, de novo status and screening of other patients, was needed to establish pathogenicity.

44

Page 45: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Confirmation of NGS Results

• Orthogonal method– Sanger– Genotyping– Additional NGS method– Re‐analyse same data in a different way (e.g. non‐alignment)

45

Page 46: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

fastq, fasta & grouped

>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_nameTGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC

@MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_nameTGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC+AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

FASTQ

FASTA

527 AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTA

Grouped

46

Page 47: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Grouped read testing

• Targeted• Sensitive• Quantitative• Low computing overhead• Genotypes• Estimates error rate• BLAST/BLAT mutation 

scanning option

Page 48: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Amplivar vs. Alignment

Amplivar Alignment (e.g. BWA)

Groups reads Uses individual reads

Designed for unknown sequence with known flanks

Designed for randomly sheared fragments

Works with FASTA after filtering Works with FASTQ

Matches against target list Aligns against whole genome

WG Alignment is an optional late stage Alignment is a required early stage

48

Page 49: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Usual suspects file

4 column tab separated text file with Unix line endings• Column 1: RefSeq identifier • Column 2: cDNA HGVS nomenclature• Column 3: codon change HGVS nomenclature• Sequence to match• Usual suspects files available for TruSeq cancer panel and for 

PCRbrary

RefSeq cDNA description codon change sequenceBRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA

49

Page 50: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

EGFR p.L858R

50

Page 51: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

EGFR p.T790M

51

Page 52: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Confirmation by PCR

52

Page 53: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Melbourne Genomics CRC Case 010310401

MGHA Pipeline reported “mosaic” TP53 variant 

chr17:g.7577534C>G  TP53 NM_000546.5 c.351G>C pR117S

Page 54: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Alignment‐free matching of test and control from q‐filtered FASTQ files

allele control (total) Test (total) control F test F total R Test RTP53 wt 60 24 11 12 7 12 4TP53 wt 57 28 14 16 9 12 5TP53 wt 54 30 14 18 9 12 5TP53 wt 51 30 14 18 9 12 5TP53 wt 48 30 14 18 9 12 5TP53 wt 45 32 16 19 10 13 6TP53 wt 42 34 16 21 10 13 6TP53 wt 39 37 19 22 10 15 9TP53 wt 36 37 20 22 10 15 10TP53 wt 33 39 21 24 10 15 11TP53 wt 30 39 22 24 11 15 11TP53 wt 27 39 22 24 11 15 11TP53 wt 24 39 22 24 11 15 11TP53 wt 21 42 23 26 12 16 11TP53 wt 18 43 24 27 13 16 11TP53 NM_000546.5 c.351G>C pR117S 60 0 3 0 1 0 2TP53 NM_000546.5 c.351G>C pR117S 57 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 54 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 51 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 48 0 4 0 1 0 3TP53 NM_000546.5 c.351G>C pR117S 45 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 42 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 39 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 36 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 33 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 30 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 27 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 24 0 5 0 2 0 3TP53 NM_000546.5 c.351G>C pR117S 21 0 6 0 2 0 4TP53 NM_000546.5 c.351G>C pR117S 18 0 6 0 2 0 4

Test

WT

NM_000546.5 c.351G>C

Control

WT

NM_000546.5 c.351G>C

Page 55: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Confirmation by amplicon resequencingEstablishes that the variant is present in the DNA sample.  Patient has been rebled for further confirmation

From a NATA accredited Ovarian Cancer NGS pipeline developed by Olga Kondrashova

Page 56: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

3: Data Curation and Sharing

56

Page 57: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

The need to share data to improve annotation and prediction

To improve our knowledge of the phenotypic effects of DNA variation will require a massive effort in data sharing

Heidi L. Rehm, PhD, FACMG Director, Laboratory for Molecular Medicine, Partners Personalized Medicine Associate Professor of Pathology, Brigham & Women’s Hospital and Harvard Medical School

Page 58: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

HCM Gene Mutations – 3000 cases tested>500 clinically significant mutations identified

66% of clinically significant mutations are seen in only one family

Number of probands

Num

ber o

f var

iant

s

MYBPC3E258K

MYBPC3MYH7R502WW792fsR663H

Page 59: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Curation protocol

59

Table 1: Classification of PTV and Null variants

Cat. Yes NoNotavailable

Notapplicable

PVS

PS

BA

BS

Evidence (NOTE: Each evidence block as signified by continuous color can only have ONE yes checked)

Affects canonical splice site (within 2bp of exon‐intron boundary) WITHOUT functional data supporting splice effect on protein or RNA level

Protein truncating (PTV) or Null Variant in a gene where LOF is known mechanism of pathogenicity‐ E.g. frameshift, initiator codon disruption, premature stop codon (nonsense), canonical splice site mutation‐ IF Canonical splice site (intronic 1‐2bp upstream/downstream of exon‐intron boundary), functional data required on protein or RNA level‐ IF initiator codon (ATG) mutation, no known alternative transcription start sites allowing functional protein production AND next available initiator codon (if any) results in frameshift‐ Affects critical functional domains‐ Gene not known to have restricted spectrum of non‐PTV mutations as only known mechanism of pathogenicity (e.g. activating or protein aggregation mutations, PTPN1)‐ Novel Stop codon is not in the last exon or the last 50bp of the second to last exon‐ Exon harboring the variant is not known to be subject to alternative splicing/missing in alternative RefSeq transcripts‐ Loss of single or multiple exons affecting functional domains where exon(s) are not known to be subject to alternative splicing/missing in alternative transcripts

MAF >=0.05 in at least one sufficiently large subpopulation

MAF >=0.03 and <0.05 in at least one sufficiently large subpopulation OR MAF is out of keeping with known disease frequency. CAUTION: if similar truncating mutations are common in the same gene (~>0.005, e.g. TITAN, APC), to be used as supporting evidence (BP) only.

Page 60: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Curation score

60

Count Evidence Levels (Categories):

Category Count Final curator classification:PVS 0PS 0PM 0 Curator comments (i.e. reasons for changing evidence levles):PP 0BA 0BS 0BP 0

5‐ Pathogenic: 4 ‐ Likely Pathogenic: 3a ‐ VUS (potentiallypathogenic)

3b ‐ VUS 3c ‐ VUS (potentiallybenign)

2 PVS 1 PVS and >= 1 PM Not unambigouslyclassifiable

1 PVS and >= 1 PS 1 PS and 1‐2 PM1 PVS and >= 2 PM 1 PS and >= 2 PP1 PVS and 1 PM and

>= 2 PP>= 3 PM

>= 2 PS 2 PM and >= 2 PP1 PS and >=3 PM 1 PM and >= 4 PP

1 PS and 2 PM and >= 2 PP

1 PS and 1 PM and >= 4 PP Curator Signature and Date: _______________________

1 BS + >=1 BP

>= 2 BP

1 ‐ Benign:

>= 1 BA

>= 2 BS

2 ‐ LikelyBenign

Not unambigously classifiable with predominantly

pathogenic evidence

Not unambigously classifiable with

predominantly benign evidence

NOTE: The classification formulas below represent a MINIMUM requirement for a given class. The curator has to decide if

Page 61: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Curation concordance between two labs

61

Variant Curator 1 Curator 2 Curator 3 Concordance CommentsOv03 BRCA1 5 4 4 Minor discordance VCGS did not use: Same variant type in affected codon/region has previously been shown to be pathogenic

ERBB2 3b ‐ ‐ ‐Ov09 BRCA1 5 5 5 Concordance VCGS: use expert pathogenicity &literature pathogenicity together

RAD50 2 2 3b Minor discordance UniMelb: used Grantham with no homologue in cat, UniMelb used additional BP, not defined in the original tableNe01 PINK1 4 5 5 Minor discordance VCGS: ‐ used functional data, significant segregation & previous description of pathogenicity

RASA1 3 3b 3b ConcordanceSPG11 3 3c 3c ConcordanceSPG11 3 3c 3c Concordance VCGS: Used Grantham and conservation, when no homologue in cat & score 98NPC1 3 3b 2 Minor discordance VCGS ‐3 : ignored PM in classifying 2

Ca19 COL3A1 ‐ 3b 3b Concordance VCGS did not use Grantham scoreNEBL 3b 3a 3a ConcordanceKCNE3 3b 2 2 Minor discordance VCGS: Population database observed in recessive form mutations in KCNE3 causes autosomal dominant diseaseDCP1B ‐ ‐ ‐ ‐ Common insABCC9 3b 3b 3b ConcordanceKCNE2 3 2 2 Minor discordance Population database observed in recessive form mutations in KCNE2 causes autosomal dominant disease

Ne03 A2M 3 3a 3a Concordance UniMelb ‐ used Grantham, when no homologuesSPG11 3 3 3a Concordance UniMelb ‐ used Grantham, when no homologuesNPC1 3 3 3b Concordance UniMelb ‐ Grantham and conservation did not match up

Ne22 PSEN1 4 4 3a Minor discordance VCGS did not use ExAC ‐ not present,but covered <0.0002Ov27 BRCA1 5 5 5 Concordance VCGS: use PVS, when one refseq transcript does not have exon with mutation

FANCA 3b 3c 2 Minor discordance VCGS did not use previous description of pathogenicity

Classification

Page 62: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Models for Data sharing

• Retain within service database and patient records (lost benefits of data sharing)

• Send direct to ClinVar (cuts out any national capacity/resource)

• Publish (picked up by HGMD)• Share aggregated data within a network, contributing to a national resource

Page 63: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Standards

63

FASTQ Q scores (Sanger or Illumina)Uncalled bases (*, ‐ or N)Genes (Ensembl or RefSeq)Reference Sequence (Build, patch)Chromosome numbering (chr, MT, X, Y)Mutation nomenclature (CGAR, VCF, HGVS)Left or right gap alignmentStrand alignment

Page 64: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Human Variome Project

64

Page 65: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Global Alliance for Genomics & Health

65

Page 66: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

GA4GH Clinical Products 1 of 2

66

Page 67: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

GA4GH Clinical Products 2 of 2

67

Page 68: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Interpretative Gap

68

Page 69: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Options for data sharing• Retain locally

– Secure– QC role– No added value to the healthcare system– Overseas curation of national data

• Upload aggregates to International Resource e.g. ClinVar– Simple– Lack of clinical detail or confidentiality

• Professional Data Sharing Network– Secure access– Share phenotypic data and medical records– Compliant with national law– National resource

Page 70: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Constraints on data sharing

• Patient confidentiality– Consent & secure clinical network

• Standardized nomenclature not always available– HGVS

• Genome builds & reference sequences change– Build standard tools e.g. LRGs

Page 71: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Challenges for the future

• Migration to build 38– Alignment– Annotation– Quality standards

• Nomenclature standardisation– HGVS: mutalyzer– VCFlib: vcfallelicprimitives

Page 72: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

www.ncbi.nlm.nih.gov/clinvar

NIH NCBI ClinVar

Page 73: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

73

Page 74: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer
Page 75: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Data sharing using open source solution

Page 76: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Capturing the Variants in a Database

Page 77: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Data quality classes based on degree of curation

Differentiate between three classes of data:

The Clinically Reported data label would denote the class of data that the HVP Australian Node was originally designed to collect and share: data that has been generated in a NATA accredited Australian diagnostic laboratory and is able to be included in a clinical report.

Unreported Clinical quality data would denote data that has been generated in a NATA accredited diagnostic laboratory, but is not capable of being included in a clinical report. This class would comprise, primarily, of next‐generation sequencing (NGS) type data.

Unaccredited data would be used to denote data that has been generated by an Australian laboratory that has not been NATA accredited

A new filtering option would be made available to allow users to view only data of a certain class

77

Page 78: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

The FinDis Model of a Distributed Network of Databases

Page 79: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Standards for Accreditation of DNA Sequence Variation Databases

Quality Use of Pathology Program (QUPP), a national project for the Development of Standards for Accreditation of DNA Sequence Variation Data Bases has been jointly initiated by the Royal College of Pathologists of Australasia (RCPA), and the Human Variome Project (HVP).Background• There is a rapidly increasing volume, spectrum, and complexity of genetic tests emerging within 

diagnostic pathology laboratories. In particular, high throughput sequencing methods such as targeted panel, exome (WES), and whole genome sequencing (WGS), are producing an increasing quantity of genetic data requiring analysis and interpretation, forming a substantial proportion of the workload.

• Currently, there is a plethora of online mutation databases to refer to, however there is a distinct lack of such databases that meet the stringent accuracy and reproducibility that the clinical diagnostic environment demands. Additionally, The current databases are “Fractured”, with varied access and sharing of the data within; and variable quality due to errors / inaccurate data posting, all of which is a clear risk to the quality of patient care. With more widespread, secure sharing of variants and associated phenotypes, the value of cumulative variant information will accelerate the delivery of accurate, actionable, and efficient clinical reports.

• There are currently no standards or equivalent mechanisms for accreditation of databases to ensure the accuracy and quality of uploaded data into any central repository to meet the needs of the clinical diagnostics environment.

79

Page 80: Applications of NGS, the Human Variome Project and Data ...bioinformatics.org.au/ws15/wp-content/uploads/ws14/sites/9/2012/1… · • most frequent k‐mer

Acknowledgments• Genomic Medicine & CTP, University of Melbourne: Arthur Hsu, Olga 

Kondrashova, Sebastian Lunke, Clare Love, Renate Marquis‐Nicholson, Kym Pham, Paul Waring

• MCRI & VCGS: Simon Sadedin, Alicia Oshlack, Damien Bruno, Andrew Sinclair, Kathy North

• Human Variome Project: Tim Smith, Alan Lo, Chris Arnold, Dick Cotton

• Melbourne Genomics Health Alliance

80