genome sequencing of pathogens with epidemic potential
TRANSCRIPT
Genome Sequencing of Pathogens with Epidemic Potential: Implications for Control of Communicable DiseasesCommunicable Diseases
Vitali Sintchenko
Centre for Infectious Diseases and Microbiology – Public Health, ICPMR
Sydney Emerging Infections and Biosecurity Institute, The University of Sydney
Outline
• Transformational power of Whole Genome Sequencing
(WGS) technologies
• Added value of WGS of pathogens with epidemic
potential to public health
• International initiatives for WGS data sharing• International initiatives for WGS data sharing
• Challenges of assuring that this value is realised
2
Magnitude of microbial diversity
• Number of microbes on Earth 5 x 1030
• Number of stars in the Universe 7 x 1021
• Number of humans 6 x 109
• Number of human cells in one human 1013
• Number of microbial cells in one human 1014• Number of microbial cells in one human 1014
• Number of microbial genes in one human gut 3 x 106
3
Accelerating technology
Specialist technology
Portable technology
Bench top technology
• Human Genome Project• 15 years and $3 billion
• Celera genome (J. Craig Venter)• 9 months and $100 million
• Currently -• 3 hours and $1,000
Harvard/MIT 2005
MiniON sequencer 2012 (Oxford Nanopore Technologies)
Ion Torrent PGM
4
• 3 hours and $1,000• One human genome being sequenced
every 3 minutes
• Sequencing of H. influenzae in 1995 took 13 months and costed >$1 million
• > 3K complete bacterial genomes published in NCBI GenBank
WGS bench-top instruments
Instrument Chemistry Read length (bases)
Run time (hours)
Data output per run
454GS Junior (Roche)
Pyrosequencing 500 8 35Mb
MiSeq Reversible 150 27 1.5Gb
5
MiSeq(Illumina)
Reversible terminator
150 27 1.5Gb
Ion Torrent PGM (Life Technologies)
Proton detection 200 3 500Mb (316 chip) or up to
1Gb (318 chip)
Rapid WGS of bacteria in clinical settings can be cost-saving (clinically relevant time to result ~ 50h)
Advantages of WGS data
• Pathogen independent solution with high throughput, speed and quality
• Sequences represent smallest biologically meaningful units (ATCG….)
• DNA sequences represent agnostic and ‘future-proof’ data amenable to exchange and comparison (highly portable ‘molecular Esperanto’)
• Rapid growth of public DBs with reference sequences
6
Conventional Microbiology
WGS based examination
Organism growth detected from culture
Clinical specimen
WGS
From Pasteur to Watson
IdentificationWGS
identification & characterisation
Characterisation/typing by reference laboratory
(phage typing, PFGE, MLST, VNTR etc) Upload to reference DB with
early warning of emergence or spread of virulent/resistant
strains
Identification of specific subtypes from WGS data
Core and accessory/variable genomes
4000
6000
8000
Nu
mb
er o
f g
enes
• Core (essential functions; conserved in all strains)
• Accessory/dispensable genome
• Pathogenicity islands, prophages, transposons and integrated plasmids
• Strain-specific genes
8
0
2000
4000
Escherichia
coli
Pseudomonas
aeruginosa
Streptococcus
pyogenes
Streptococcus
pneumoniae
Core genome Variable genome
70%39% 57%57%
Nu
mb
er o
f g
enes
Transformational power of WGS
• Diagnostic microbiology
• Monitoring emerging clones and new pathogens
• Discovery of virulence/drug resistance mechanisms
• Laboratory surveillance (local, national, global)
• Outbreak detection (at point of first secondary case)
• Detection of covert clusters (proof-of-concept studies demonstrated WGS superiority to current typing methods*)
• Tracing of transmission events within outbreaks
• Source attribution and ‘molecular compass’ – geographical structure among related isolates
9
• Mycobacterium tuberculosis (Gardy et al, 2012; Walker et al, 2013)
• Enterohaemorrhagic Escherichia coli (Underwood et al, 2013)
• Listeria monocytogenes (Gilmour et al, 2010)
• Acinetobacter baumanii (Lewis et al, 2010)
• Legionella pneumophila (Reuter et al, 2013)
• MRSA (Köser et al, 2012)
Approaches to genome wide comparison
• Variable Tandem Repeat Analysis (VNTR)
• Problematic for NGS assembled genomes
• Single nucleotide polymorphism approach
• Works well for monomorphic organisms
• ‘Subjective’ SNP selection
• May be difficult to reproduce• May be difficult to reproduce
• Gene by gene approach
• Hierarchical locus-by-locus analysis
• wgMLST, MLST+, cgMLST
• Intragenic variation is counted as a single event
• Can place the isolate in context with existing typing methods
10
Core genome MLST+
11
Jolley et al. JCM 2012; 59(9): 3046
Ribosomal MLST (rMLST)53 conserved genesClassification according toThe Bacterial Isolate Genome Sequence database (BIGSdb)
Gene-by-gene genomic similarity
Designation of sequence types (ST) and clonal
complexes (CC)
Chambers & DeLeo. Nature Rev Micro 2009;7:629
Size of the node is proportional to the number
of isolates with this sequence type in the
database
S.aureus strains with one locus difference
SNPs genetic diversity of related isolates
Serial isolates from patients with long-
term cavitating pulmonary disease, non-compliant with
therapy
13
Walker et al. Lancet Infect Dis 2013
Zooming in to mutations in genomes
• Mutations (e.g., single nucleotide variants or polymorphisms [SNPs]) often accumulate randomly
• Different rates of mutations• MRSA – 1 SNP/3 months (Croucher
1 mutationdifference
• MRSA – 1 SNP/3 months (Croucher et al. Science 2011) or 1 SNP/6 weeks (Harris et al. Science 2010)
• Vibrio cholerae – 3.3 SNPs per annum (Mutreja et al. Nature 2011)
• M. tuberculosis – 1 SNP/2 years (Walker et al. Lancet Infect Dis 2012)
• Accumulation during the course of outbreak or natural variation?
14
Evolutionary time
2 mutationsdifference
No difference
Inferring direction of transmission
(a) No direction can be inferred
(b) and (c) The root suggests
15
Walker et al. Clin Microbiol Infect 2013
(b) and (c) The root suggests transmission from left to right
(d) a central source case infects three secondary cases
(e) likely undiagnosed common source case
Deciphering outbreaksStep 1: Binary interpretation of subtyping results (match vs. mismatch)
16
Time
Deciphering outbreaksStep 2: Inferring directionality of transmission from WGS data
17
Time
?
New Lab Infrastructure
• Bioinformatics pipelines (QC, genome assembly, variant calling, sequence typing etc)
• Standard, stable and scalable to amount of data
• Reproducible results
• Lab ethernet capacity (?1 GB/sec)
• Data processing and storage• Data processing and storage
• Data analysis – Cloud computing/HPC disk (e.g. Lustre)
• Pipelines - Linux/NFS server
• Warehouse – Compressed data storage and backup
18
International initiatives
• Europe - Patho-NGen-Trace• EU funded FP7 strategic research
• Northern America• 100,000 Microbial Genomes/Genome Tracker (FDA/UCD)
• Advanced Molecular Detection (CDC)
• Integrated Rapid Infectious Disease Analysis [IRIDA] (Canada)
• Global Microbial Identifier• Global Microbial Identifier• 25 countries
• Mission – link and share WGS and epidemiological data in near real-
time for public health surveillance
• Targets:• Pipeline with 4 h TAT for outbreak detection
• Proficiency testing schemes for WGS (sequencing, genome assembly and
genome analysis steps)
19
Challenges
• Global health diplomacy• WHO IHR should include “sharing of sequencing data”
• Minimal data sets and open access • Requires collaboration between different sectors (human
and animal health, food and environment) and
stakeholders (government, commercial and not-for-profit)stakeholders (government, commercial and not-for-profit)
• Ethics and confidentiality issues
• Sharing of benefits and IP rights• DNA sequence as a potential commodity
• IT infrastructure
20
Exchange of genomic data
• 26 TB/day of download
• 4 TB of data exchange
EBI
Prediction of storm tracks
21
NCBI GenBank
EBI
DDBJ
Hurricane Sandy
GMI Minimal Pathogen Metadata
Sample name WHATOrganismStrain/isolate
Category/Attribute1a) Clinical/Host associated• Specific_host
Collection_date WHEN
Geographic location WHERE6a) Geo_loc_name
OR• Specific_host• Isolation_source• Host_disease
OR
1b) Environmental/Food/Other• Isolation_source
6b) Lat_lon
Collected by WHO
Courtesy of James Ostell, NCBI
22
Concluding remarks
• Evidence-based recommendations for implementation
of WGS in public health practice and the assessment of
outcomes are required
• Technical framework (your data is worth more if you share it)
• Proficiency testing and standardisation of WGS processes and • Proficiency testing and standardisation of WGS processes and
data analysis to guide WGS evaluation and implementation
• Education and professional training• Establish WGS training and competencies for public health
professionals, clinicians and scientists
23