barker immemxi final march 2016

Mitigating the effects of sequence data quality on strain typeability: towards the development of robust Core Genome MLST (cgMLST) schemes.Dillon Barker1,2; Peter Kruczkiewicz1; James Thomas2; Chad Laing1; Vic Gannon1; Eduardo Taboada1 1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge, Alberta, Canada2 Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, Canada

IMMEM XINavigating Microbial Genomes: Insights from the Next Generation9 – 12 March 2016, Estoril, Portugal

2

Whole Genome Sequencing Suddenly cheap and easy

Huge amounts of data generated in Canada & globally

Can solve many problems Resolution Breadth of strains typed

Scale of data brings its own problems Pangenome definitions Variable assembly completeness and quality Existing typing systems don't scale well

3

Classical MLST Looks at allelic diversity of ~7 “housekeeping” loci

All loci must be fully present Each new allele is a type

Recombination and mutation are equivalent Each unique combination of types is a Sequence Type Type definitions are universal

Centralized and curated e.g. ST-21 in Canada = ST-21 in UK = ST-21 in Denmark

Dingle, et al. 2001. J. Clin. Micro. 39(1) 14-23

4

The core genome is shared by all members of the species; mostly SNP-level genetic variation

Accessory genes are not shared by all members of the species and drive a lot of the phenotypic variability between strains

What is a “Core gene”? What about a “Core genome”?

5

Core Genome MLST Logical extension of Classical MLST concepts

7 genes → 100s or 1000s of genes

Potential successor “Gold Standard” typing method for surveillance Big Advantages

High Resolution Viable way for WGS → Surveillance

Lots of interest in cgMLST

cgMLST analysis of 200 isolates “identical” by MLST

7

Walkerton outbreak 2000

cgMLST analysis of 200 isolates “identical” by MLST

8

A prototype cgMLST scheme for C. jejuni 2690 Campylobacter jejuni whole genome sequence assemblies

Set of 1,658 ORFs from reference strain NCTC11168 used as queries 85% sequence identity & 50% length coverage 732 ORFs conserved across all genomes core genome loci

9

cgMLST Trials and Tribulations 2690 Campylobacter jejuni whole genome sequence assemblies

Allele definitions gathered from all genomes

Not so simple! WGS projects don't usually finish their

genomes “Genome Assemblies” Target loci are often truncated by

chance Only 1464 genomes (54%) had

complete sequences at all 732 loci

10

Contig Truncations are a function of genome count

As the number of genomes analyzed is increased, the probability that any locus will have at least one truncation approaches 100%

Average rate of missing/truncated loci ≈ 3.5% 26 per assembly!

11

Contig Truncations are a function of locus count Average rate of missing/truncated loci ≈ 3.5%

26 per assembly!

As the number of loci analyzed is increased, the probability that at least one genome will have a truncation increases to 100%

12

The Story So Far...

Advantages of cgMLST Analysis is cheap and speedy Hugely improved resolution

1. Consistent, portable nomenclature

Difficulties Introduced by cgMLST Missing / Truncated Loci will affect your scheme

As-is, forces you to sacrifice either #1 or #3:

Re-sequence and re-assemble and hope it works – or –

Abandon all hope for portability

13

Some options for damage control!1. Use only highly conserved core genes

2. Use optimized gene fragments

3. Reduce the number of target loci

4. Attempt to impute data

14





15

Using Optimized Gene Fragments

• The longer the target sequence, then more opportunities for truncations

16



• Avoid regions with empirically high contig truncation rates

17




• Retain the most informative regions Measured by Shannon Entropy

18




• Retain the most informative regions Measured by Shannon Entropy

• Optimized sub-regions that are informative and truncation-free

19





20

How many loci do we need for accurate clustering?

Pristine Genome Set

732 cgMLST loci

1,464 aforementioned genomes

A controlled development

environment for cgMLST testing

Clustering

Reference set clustered at various similarity thresholds 100% - 20% similarity 0.5% steps

21

Random Gene Selection N genes randomly selected from the 732 1000 replicates each Clusters compared vs the full 732

Comparison to “reference tree” Adjusted Wallace Coefficient Compares clusters produced by two methods

“How often do two strains clustered together by Method A cluster together by Method B”

How many loci do we need for accurate clustering?

22

Random Subset Clusters – 5th Percentile (i.e.“worst case scenario”)

150-250 genes are nearly as good as 732 genes

0.0 0.2 0.4 0.6 0.8

23





Allele Imputation: Another Approach

5 21???• Inferring the allele of a missing/partial

locus

• Educated guess from the allele proportions

of 'centres' known to be associated with

particular 'flanks‘

• Mean accuracy of 90.5%

• Further refinement with partial sequence

data

Conclusions• cgMLST is poised to be the Gold Standard for global surveillance of

bacterial pathogens

• Contig truncations and missing data become a blocking problem if the

same portability of typing definitions as MLST is desired

• A compromise between typability and robustness is required

• Contig truncations’ effect can be mitigated by :

• The worst fragments of genes (truncation & information content)

• The genes that contribute the least to discriminatory power

• “Filling the gaps” with advance knowledge about linkage

• Supervisors:• Drs. Ed Taboada & Jim Thomas

• Labmates:• Steven Mutschall (PHAC)• Peter Krucziewicz (PHAC)• Ben Hetman (PHAC/ULeth)• Cody Buchanan (CFIA/ULeth)

• Funding• ESCMID Attendance Grant• University of Lethbridge• Public Health Agency of Canada• Government of Canada Genomics Research and

Development Initiative

Acknowledgements