barker immemxi final march 2016
TRANSCRIPT
Mitigating the effects of sequence data quality on strain typeability: towards the development of robust Core Genome MLST (cgMLST) schemes.Dillon Barker1,2; Peter Kruczkiewicz1; James Thomas2; Chad Laing1; Vic Gannon1; Eduardo Taboada1 1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge, Alberta, Canada2 Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, Canada
IMMEM XINavigating Microbial Genomes: Insights from the Next Generation9 – 12 March 2016, Estoril, Portugal
2
Whole Genome Sequencing Suddenly cheap and easy
Huge amounts of data generated in Canada & globally
Can solve many problems Resolution Breadth of strains typed
Scale of data brings its own problems Pangenome definitions Variable assembly completeness and quality Existing typing systems don't scale well
3
Classical MLST Looks at allelic diversity of ~7 “housekeeping” loci
All loci must be fully present Each new allele is a type
Recombination and mutation are equivalent Each unique combination of types is a Sequence Type Type definitions are universal
Centralized and curated e.g. ST-21 in Canada = ST-21 in UK = ST-21 in Denmark
Dingle, et al. 2001. J. Clin. Micro. 39(1) 14-23
4
The core genome is shared by all members of the species; mostly SNP-level genetic variation
Accessory genes are not shared by all members of the species and drive a lot of the phenotypic variability between strains
What is a “Core gene”? What about a “Core genome”?
5
Core Genome MLST Logical extension of Classical MLST concepts
7 genes → 100s or 1000s of genes
Potential successor “Gold Standard” typing method for surveillance Big Advantages
High Resolution Viable way for WGS → Surveillance
Lots of interest in cgMLST
cgMLST analysis of 200 isolates “identical” by MLST
7
Walkerton outbreak 2000
cgMLST analysis of 200 isolates “identical” by MLST
8
A prototype cgMLST scheme for C. jejuni 2690 Campylobacter jejuni whole genome sequence assemblies
Set of 1,658 ORFs from reference strain NCTC11168 used as queries 85% sequence identity & 50% length coverage 732 ORFs conserved across all genomes core genome loci
9
cgMLST Trials and Tribulations 2690 Campylobacter jejuni whole genome sequence assemblies
Allele definitions gathered from all genomes
Not so simple! WGS projects don't usually finish their
genomes “Genome Assemblies” Target loci are often truncated by
chance Only 1464 genomes (54%) had
complete sequences at all 732 loci
10
Contig Truncations are a function of genome count
As the number of genomes analyzed is increased, the probability that any locus will have at least one truncation approaches 100%
Average rate of missing/truncated loci ≈ 3.5% 26 per assembly!
11
Contig Truncations are a function of locus count Average rate of missing/truncated loci ≈ 3.5%
26 per assembly!
As the number of loci analyzed is increased, the probability that at least one genome will have a truncation increases to 100%
12
The Story So Far...
Advantages of cgMLST Analysis is cheap and speedy Hugely improved resolution
1. Consistent, portable nomenclature
Difficulties Introduced by cgMLST Missing / Truncated Loci will affect your scheme
As-is, forces you to sacrifice either #1 or #3:
Re-sequence and re-assemble and hope it works – or –
Abandon all hope for portability
13
Some options for damage control!1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
14
Some options for damage control!1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
15
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
16
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
17
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
• Retain the most informative regions Measured by Shannon Entropy
18
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
• Retain the most informative regions Measured by Shannon Entropy
• Optimized sub-regions that are informative and truncation-free
19
Some options for damage control!1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
20
How many loci do we need for accurate clustering?
Pristine Genome Set
732 cgMLST loci
1,464 aforementioned genomes
A controlled development
environment for cgMLST testing
Clustering
Reference set clustered at various similarity thresholds 100% - 20% similarity 0.5% steps
21
Random Gene Selection N genes randomly selected from the 732 1000 replicates each Clusters compared vs the full 732
Comparison to “reference tree” Adjusted Wallace Coefficient Compares clusters produced by two methods
“How often do two strains clustered together by Method A cluster together by Method B”
How many loci do we need for accurate clustering?
22
Random Subset Clusters – 5th Percentile (i.e.“worst case scenario”)
150-250 genes are nearly as good as 732 genes
0.0 0.2 0.4 0.6 0.8
23
Some options for damage control!1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
Allele Imputation: Another Approach
5 21???• Inferring the allele of a missing/partial
locus
• Educated guess from the allele proportions
of 'centres' known to be associated with
particular 'flanks‘
• Mean accuracy of 90.5%
• Further refinement with partial sequence
data
Conclusions• cgMLST is poised to be the Gold Standard for global surveillance of
bacterial pathogens
• Contig truncations and missing data become a blocking problem if the
same portability of typing definitions as MLST is desired
• A compromise between typability and robustness is required
• Contig truncations’ effect can be mitigated by :
• The worst fragments of genes (truncation & information content)
• The genes that contribute the least to discriminatory power
• “Filling the gaps” with advance knowledge about linkage
• Supervisors:• Drs. Ed Taboada & Jim Thomas
• Labmates:• Steven Mutschall (PHAC)• Peter Krucziewicz (PHAC)• Ben Hetman (PHAC/ULeth)• Cody Buchanan (CFIA/ULeth)
• Funding• ESCMID Attendance Grant• University of Lethbridge• Public Health Agency of Canada• Government of Canada Genomics Research and
Development Initiative
Acknowledgements