barker immemxi final march 2016

26
Mitigating the effects of sequence data quality on strain typeability: towards the development of robust Core Genome MLST (cgMLST) schemes. Dillon Barker 1,2 ; Peter Kruczkiewicz 1 ; James Thomas 2 ; Chad Laing 1 ; Vic Gannon 1 ; Eduardo Taboada 1 1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge, Alberta, Canada 2 Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, Canada IMMEM XI Navigating Microbial Genomes: Insights from the Next Generation 9 – 12 March 2016, Estoril, Portugal

Upload: iridacommunity

Post on 23-Jan-2017

46 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Barker immemxi final March 2016

Mitigating the effects of sequence data quality on strain typeability: towards the development of robust Core Genome MLST (cgMLST) schemes.Dillon Barker1,2; Peter Kruczkiewicz1; James Thomas2; Chad Laing1; Vic Gannon1; Eduardo Taboada1 1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge, Alberta, Canada2 Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, Canada

IMMEM XINavigating Microbial Genomes: Insights from the Next Generation9 – 12 March 2016, Estoril, Portugal

Page 2: Barker immemxi final March 2016

2

Whole Genome Sequencing Suddenly cheap and easy

Huge amounts of data generated in Canada & globally

Can solve many problems Resolution Breadth of strains typed

Scale of data brings its own problems Pangenome definitions Variable assembly completeness and quality Existing typing systems don't scale well

Page 3: Barker immemxi final March 2016

3

Classical MLST Looks at allelic diversity of ~7 “housekeeping” loci

All loci must be fully present Each new allele is a type

Recombination and mutation are equivalent Each unique combination of types is a Sequence Type Type definitions are universal

Centralized and curated e.g. ST-21 in Canada = ST-21 in UK = ST-21 in Denmark

Dingle, et al. 2001. J. Clin. Micro. 39(1) 14-23

Page 4: Barker immemxi final March 2016

4

The core genome is shared by all members of the species; mostly SNP-level genetic variation

Accessory genes are not shared by all members of the species and drive a lot of the phenotypic variability between strains

What is a “Core gene”? What about a “Core genome”?

Page 5: Barker immemxi final March 2016

5

Core Genome MLST Logical extension of Classical MLST concepts

7 genes → 100s or 1000s of genes

Potential successor “Gold Standard” typing method for surveillance Big Advantages

High Resolution Viable way for WGS → Surveillance

Lots of interest in cgMLST

Page 6: Barker immemxi final March 2016

cgMLST analysis of 200 isolates “identical” by MLST

Page 7: Barker immemxi final March 2016

7

Walkerton outbreak 2000

cgMLST analysis of 200 isolates “identical” by MLST

Page 8: Barker immemxi final March 2016

8

A prototype cgMLST scheme for C. jejuni 2690 Campylobacter jejuni whole genome sequence assemblies

Set of 1,658 ORFs from reference strain NCTC11168 used as queries 85% sequence identity & 50% length coverage 732 ORFs conserved across all genomes core genome loci

Page 9: Barker immemxi final March 2016

9

cgMLST Trials and Tribulations 2690 Campylobacter jejuni whole genome sequence assemblies

Allele definitions gathered from all genomes

Not so simple! WGS projects don't usually finish their

genomes “Genome Assemblies” Target loci are often truncated by

chance Only 1464 genomes (54%) had

complete sequences at all 732 loci

Page 10: Barker immemxi final March 2016

10

Contig Truncations are a function of genome count

As the number of genomes analyzed is increased, the probability that any locus will have at least one truncation approaches 100%

Average rate of missing/truncated loci ≈ 3.5% 26 per assembly!

Page 11: Barker immemxi final March 2016

11

Contig Truncations are a function of locus count Average rate of missing/truncated loci ≈ 3.5%

26 per assembly!

As the number of loci analyzed is increased, the probability that at least one genome will have a truncation increases to 100%

Page 12: Barker immemxi final March 2016

12

The Story So Far...

Advantages of cgMLST Analysis is cheap and speedy Hugely improved resolution

1. Consistent, portable nomenclature

Difficulties Introduced by cgMLST Missing / Truncated Loci will affect your scheme

As-is, forces you to sacrifice either #1 or #3:

Re-sequence and re-assemble and hope it works – or –

Abandon all hope for portability

Page 13: Barker immemxi final March 2016

13

Some options for damage control!1. Use only highly conserved core genes

2. Use optimized gene fragments

3. Reduce the number of target loci

4. Attempt to impute data

Page 14: Barker immemxi final March 2016

14

Some options for damage control!1. Use only highly conserved core genes

2. Use optimized gene fragments

3. Reduce the number of target loci

4. Attempt to impute data

Page 15: Barker immemxi final March 2016

15

Using Optimized Gene Fragments

• The longer the target sequence, then more opportunities for truncations

Page 16: Barker immemxi final March 2016

16

Using Optimized Gene Fragments

• The longer the target sequence, then more opportunities for truncations

• Avoid regions with empirically high contig truncation rates

Page 17: Barker immemxi final March 2016

17

Using Optimized Gene Fragments

• The longer the target sequence, then more opportunities for truncations

• Avoid regions with empirically high contig truncation rates

• Retain the most informative regions Measured by Shannon Entropy

Page 18: Barker immemxi final March 2016

18

Using Optimized Gene Fragments

• The longer the target sequence, then more opportunities for truncations

• Avoid regions with empirically high contig truncation rates

• Retain the most informative regions Measured by Shannon Entropy

• Optimized sub-regions that are informative and truncation-free

Page 19: Barker immemxi final March 2016

19

Some options for damage control!1. Use only highly conserved core genes

2. Use optimized gene fragments

3. Reduce the number of target loci

4. Attempt to impute data

Page 20: Barker immemxi final March 2016

20

How many loci do we need for accurate clustering?

Pristine Genome Set

732 cgMLST loci

1,464 aforementioned genomes

A controlled development

environment for cgMLST testing

Clustering

Reference set clustered at various similarity thresholds 100% - 20% similarity 0.5% steps

Page 21: Barker immemxi final March 2016

21

Random Gene Selection N genes randomly selected from the 732 1000 replicates each Clusters compared vs the full 732

Comparison to “reference tree” Adjusted Wallace Coefficient Compares clusters produced by two methods

“How often do two strains clustered together by Method A cluster together by Method B”

How many loci do we need for accurate clustering?

Page 22: Barker immemxi final March 2016

22

Random Subset Clusters – 5th Percentile (i.e.“worst case scenario”)

150-250 genes are nearly as good as 732 genes

0.0 0.2 0.4 0.6 0.8

Page 23: Barker immemxi final March 2016

23

Some options for damage control!1. Use only highly conserved core genes

2. Use optimized gene fragments

3. Reduce the number of target loci

4. Attempt to impute data

Page 24: Barker immemxi final March 2016

Allele Imputation: Another Approach

5 21???• Inferring the allele of a missing/partial

locus

• Educated guess from the allele proportions

of 'centres' known to be associated with

particular 'flanks‘

• Mean accuracy of 90.5%

• Further refinement with partial sequence

data

Page 25: Barker immemxi final March 2016

Conclusions• cgMLST is poised to be the Gold Standard for global surveillance of

bacterial pathogens

• Contig truncations and missing data become a blocking problem if the

same portability of typing definitions as MLST is desired

• A compromise between typability and robustness is required

• Contig truncations’ effect can be mitigated by :

• The worst fragments of genes (truncation & information content)

• The genes that contribute the least to discriminatory power

• “Filling the gaps” with advance knowledge about linkage

Page 26: Barker immemxi final March 2016

• Supervisors:• Drs. Ed Taboada & Jim Thomas

• Labmates:• Steven Mutschall (PHAC)• Peter Krucziewicz (PHAC)• Ben Hetman (PHAC/ULeth)• Cody Buchanan (CFIA/ULeth)

• Funding• ESCMID Attendance Grant• University of Lethbridge• Public Health Agency of Canada• Government of Canada Genomics Research and

Development Initiative

Acknowledgements