introduction to bioinformatics - aphl · introduction to bioinformatics aphl 2017 ... from sequence...

44
Office of Advanced Molecular Detection National Center for Emerging and Zoonotic Infectious Diseases Joel Sevinsky PhD & Duncan MacCannell PhD Introduction to Bioinformatics APHL 2017 Bioinformatics Workshop 2017/06/11

Upload: phungkhanh

Post on 14-Apr-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Office of Advanced Molecular Detection

National Center for Emerging and Zoonotic Infectious Diseases

Joel Sevinsky PhD & Duncan MacCannell PhD

Introduction to Bioinformatics

APHL 2017Bioinformatics Workshop2017/06/11

Page 2: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Introductions

https://xkcd.com/1605/

Page 3: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Input: DNA/RNASource:GenomicAmpliconWhole sample

Host/vector/pathogen/environment

Library

Output: InformationFrom Sequence Data

Comparative GenomicsIdentificationHigh res Straintyping/SubtypingCluster identificationMolecular evolutionGenotypic characterizationVirulence, AR, signaturesFunctional annotationDiagnostic dev/validationMinor populations, quasispeciesHost/pathogen expression

MetagenomicsPathogen identification/discoveryCulture-independent diagnosticsMicrobial ecology/diversity

Data Info.

ACAATTTGTGCATAACATGTGGACAGTTTTAATCACATGTGGGTAAATAGTTGTCCACATTTGCTTTTTT TGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACAT TTTATATTTATTAGGTTGTACATTTGTTGCGCAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACAC CTTTGGAAAATATCTCTGATTTATGGAATAGTGCCTTAAAAGAATTAGAAAAAAAGGTAAGCAAGCCTAG TTATGAAACATGGTTAAAATCAACAACGGCTCATAACTTGAAGAAAGACGTATTAACGATTACAGCTCCA AATGAATTTGCTCGTGACTGGCTAGAATCTCATTACTCAGAACTTATTTCGGAAACACTATACGATTTAA CAGGGGCAAAATTAGCAATTCGCTTTATTATTCCCCAAAGTCAATCGGAAGAGGACATTGATCTTCCTCC AGTTAAGCGGAATCCAGCACAAGATGATTCAGCTCATTTACCACAGAGCATGTTAAATCCAAAATATACA TTTGATACATTTGTTATCGGCTCTGGTAACCGTTTTGCCCATGCAGCTTCATTAGCTGTAGCCGAGGCGC CAGCTAAAGCGTATAATCCACTCTTTATTTATGGGGGAGTTGGGCTTGGAAAGACGCATTTAATGCACGC AATTGGTCATTATGTAATTGAACATAATCCAAATGCAAAAGTTGTATATTTATCATCAGAAAAATTCACG AATGAATTTATTAACTCTATTCGTGATAATAAAGCTGTTGATTTTCGTAATAAATATCGCAACGTAGATG

NGSWorkflow:PlatformsChemistryPerf. char.Labor/TaTExpertiseCost

BioinformaticsWorkflow:Hardware/softwareSpecialized skillsetsAlgorithms/pipelinesPathogen databasesData analysis/interpret/Integration/visualization

Increasingly Universal WorkflowsWorking to establish standardized sequencing workflows for a wide range of pathogens.

Many results from a single dataset.Faster and cheaper than serial tests.

A Moving TargetRapidly evolving technology space. Changing hardware and COTS/OSS capabilities. Lots of choice, but lack of consistent standards. BIG DATA. New workforce and skillset is required.

Pathogen- and application-specific, standard and/or compliant assays

File hashes/versioningValidated methods/databases

Process logging/audit

QA/QCSkills/proficiency

StandardsReporting

SecuritySample intakePrep/stagingExtraction

ConversionLibrary prepSequencing

NGS Applications in Microbiology

Page 4: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

NGS Sequencing Technologies

10X GenomicsGemCode

IlluminaTruSeq SLR(moleculo)

SYNTHETIC LONG READ

IonTorrentPGM/Proton/S5

IlluminaMiSeq//NextSeq/HiSeq

SHORT READ SEQUENCING

75 to 400bp readlengths*Millions to billions of readsVarious error modelsRelatively inexpensive (~$60/isolate)Issues: Resolving complex structures, phasing

Oxford Nanopore MinIONPacific BioSciences RS IIPacific BioSciences Sequel*

LONG READ SEQUENCING

>3-20kbp readlengths*Hundreds of thousands of reads*High error rates have presented challenges*Roughly $600/isolate (RSII; Nanopore TBD)Other adv: SMRT; methyl; phasing; closing

Page 5: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Page 6: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Page 7: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Run QC & Metrics

Read QCFastQC, CLC, FastXToolkit, etc

Trimming/FilteringTrimmomatic, CLC, kraken, bowtie2, ...

Reference Mapping

De Novo

Demultiplexing

PAN-GENOME, wgMLST, AMR PREDICTION, FUNC ANNOTATION

Alignmentbwa

Variant Callingsamtools, gatk,

varscan, freebayes,…

Variant Filter/Annot

Tree BuildingRAxML, PHYML, Fasttree, kSNP

MLST/AMRsrst2

WG Alignmentmauve, parsnp …

Annotation/FPprokka, pgaap, etc…

Tree BuildingHarvest, kSNP, …

ComparativeGenomics

MLST/AMRabricate, mlst

H/T: Nick Loman

Page 8: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

What does the data actually look like?

3 SAMPLESIllumina Paired-End~47GB compressedID,flowcell,barcode,lane,pair

Further reading: https://en.wikipedia.org/wiki/FASTQ_format

Identifier for each read. (Syntax varies – Illumina)INSTRUMENT: HISEQRUN ID: 165FLOWCELL ID: C1CKRACXXFLOWCELL LANE: 2TILE: 2201X,Y: 1257,1980PAIR: 1FILTERING: NBARCODE: GAGTGG 1

SEQUENCE DATA

2 Per-base quality score:ASCII – 64 or 94 levels

LOWEST → ! (HEX 21)HIGHEST → ~ (HEX7E) 3

265M 100bp reads in CDD5

Page 9: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

NGS Quality Assessment: FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 10: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Genome Assembly

INCLUDES: Sequence(s) of interest, ContaminantsMAY NOT INCLUDE: Poorly sequenceable regions

Page 11: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Mapped assemblyFor demonstration purposes only.All copyrights belong to their respective owners.

Page 12: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Reference-Guided (Mapped) Assembly

Reference Sequence/Genome

Low sequence coverage

UNMAPPED READS1. Sequences not present in the reference.2. Plasmids or other extrachromosomal.3. DNA Structural Variation/Rearrangement

ADVANTAGES: Relatively fast, well-suited to highly-conserved genomes.DISADVANTAGES: Issues with high diversity, mobile elements, linear reference

Co

vera

ge

18X

1X

Contig 1 Contig 2

Example software: BWA (https://github.com/lh3/bwa) breseq (https://github.com/barricklab/breseq)

Page 13: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Page 14: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Page 15: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

De Novo AssemblyFor demonstration purposes only.All copyrights belong to their respective owners.

Page 16: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

PLASMID

De-Novo Assembly

Contig 1 Contig 2 Contig 3

Contig 4 Contig 5 Contig 6 Contig 7

ADVANTAGES: Reference agnostic: assembles all the reads it can. Various algorithms.DISADVANTAGES: Doesn’t always get things right. Particularly with repeat seqs.

Example software: SPAdes (http://bioinf.spbau.ru/spades)List: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers

Page 17: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Functional Annotation

Page 18: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

EXERCISE

Page 19: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

hqSNP and wgMLST

Page 20: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Reference-Guided (Mapped) Assembly

Reference Sequence/Genome

Low sequence coverage

UNMAPPED READS1. Sequences not present in the reference.2. Plasmids or other extrachromosomal.3. DNA Structural Variation/Rearrangement

ADVANTAGES: Relatively fast, well-suited to highly-conserved genomes.DISADVANTAGES: Issues with high diversity, mobile elements, linear reference

Co

vera

ge

18X

1X

Contig 1 Contig 2

Example software: BWA (https://github.com/lh3/bwa) breseq (https://github.com/barricklab/breseq)

Page 21: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

High Quality SNP Typing (hqSNP)

A

A

T

C

C

C

T

T

T

A

A

A

G

G

C

A

T

T

Reference Sequence/Core Genome

1

2

3

ACTAGAACTAGTTCTACT

Advantages: adaptable, highly discriminatory, good for cluster investigations where a suitable reference is available and timeframe knownDisadvantages: not well suited for surveillance or studies where reference or allele set may shift over time. Provides limited additional data about genomic features. Issues with highly plastic genomes.

Example software: SNIPPY (https://github.com/tseemann/snippy), LYVE-SET (https://github.com/lskatz/lyve-SET), SNP Pipeline (https://github.com/CFSAN-Biostatistics/snp-pipeline), others.

Page 22: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

What factors influence SNP calls?

❑ Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count and population distribution

❑ SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked

Mask mobile elements-do no consider SNPs in this location

Mobile elements

genes

Only call SNPs in genes

Raw reads

Low coverage/Poor quality

Heather Carleton (CDC/DFWED)

Page 23: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Selection of an Appropriate Reference

❑ Choice of reference genome affects analysis – more closely related reference more likely to identify true SNP differences

❑ For some organisms, eg: MTB, the choice is obvious. For others, reference varies or must be selected based on a preliminary genomic analysis. Heather Carleton (CDC/DFWED)

Page 24: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Caveats of hqSNP analyses Advantages Disadvantages When to Use

Phylogenetically informative (build a tree consistent with

evolution of the strains)

Requires a closely related reference genome – hqSNP analysis does not work

if reference genome is not closely related → Longitudinal surv. difficult

Outbreaks

SNP position can be identified on genome (gene affected can

be identified)

Takes a while and requires a lot of computer power

Need highest amount of resolution for

strain comparison

Interpretation of data depends on genomes added – is not stable, does not

lead to strain nomenclature.

Mobile genetic elements can interfere with phylogenetic estimation unless masked. These also may be critically

important.

Heather Carleton (CDC/DFWED)

Page 25: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

PLASMID

De-Novo Assembly

Contig 1 Contig 2 Contig 3

Contig 4 Contig 5 Contig 6 Contig 7

ADVANTAGES: Reference agnostic: assembles all the reads it can. Various algorithms.DISADVANTAGES: Doesn’t always get things right. Particularly with repeat seqs.

Example software: SPAdes (http://bioinf.spbau.ru/spades)List: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers

Page 26: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Whole Genome MLST (wgMLST)

❑ Includes relevant open reading frames from across the pangenome.❑ Some key definitions:

• Locus: open reading frame. • Allele: specific gene sequence at each locus.• Scheme: the set of loci and alleles included in straintyping definition.• Nomenclature: standardized way of referring to alleles, loci and sequence types• Curation: Updating/maintaining the allele database for a given genus/species based on established

criteria.Advantages: discriminatory; reproducible; results in consistent, hierarchical nomenclature; corollary information; data are reasonably portable.Disadvantages: may have decreased discriminatory power, particularly with clustered genomic variations; requires initial dev/ongoing curation of allele db; computationally intensive.

Example software: SRST2 (https://github.com/katholt/srst2)

atgcct

atgcct

atgcgt

1

1

2

gcagga

gcacga

gcagct

gttatt

attatt

cttaat

aacttt

aactat

aacttt

1

2

3

1

1

3

1

2

1

11111221

2331

Page 27: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Flavors of multilocus sequence type analysis

• Subsets of genes can be used to identify genus/species and lineage (rMLST/ MLST)

• Core genome MLST are the genes that are in common in vast majority of genomes belonging to a genus species (for Listeria – 1748 genes belong to core and are present in ~98% of isolates tested)

• wgMLST and hqSNP provide the most information per isolate genome

hq-SNP

cgMLST

Heather Carleton (CDC/DFWED)

Page 28: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Caveats of wgMLST analysisAdvantages Disadvantages When to Use

Phylogenetically informative Initial assignment of alleles is computationally costly (doing assemblies

before calling alleles)

Surveillance

All virulence, serotyping, and antibiotic resistance genes are pulled out as part of analysis

Comparing character data (allele numbers)

rather than genetic data

Need high resolution

Neutralizes the effects of horizontal gene transfer (event is

only counted once rather than many times for hqSNPs)

SNPs and indels treated equally – allele

assignments categorical

Need to know serotype, virulence, AR determinants

Allele calling is stable – can lead to nomenclature based on allele calls

Requires ongoing and active curation for allele

calls

Need to communicate with partners using stable

nomenclatureHeather Carleton (CDC/DFWED)

Page 29: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

hqSNP vs wgMLST: Generalized Bioinformatic Process

1. Selection of appropriate reference genome

2. Review and QC of query genomes

3. Pairwise mapping of query genomes to reference

4. SNP Calling5. Filtering based on coverage,

content, complexity, masking.6. Compile SNP calls from all query

sequences7. Filtering based on allelic freq,

informativeness, etc.8. Tree building and visualization

1. Review and QC of query genomes

2. Individual de novo assembly of each query genome

3. ORF calling/extraction4. Locus assignment/annotation5. Selection of subset of ORFs

based on loci in defined scheme6. Sequence alignment of each ORF

for allele determination7. Assignment of sequence type

based on allele profile and established nomenclature.

8. Tree building and visualization

hqSNP wgMLST

Page 30: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

REFERENCE ATGTCGAATTCTTATGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y D S S S I K V L K D I *

Insertion ATGTCGAATTCTTATGACAAATCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y D K S S S I K V L K D I *

Deletion ATGTCGAATTCTTATATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y I K V L K D I *

SNP ATGTCGTATTCTTATGACTCCTCTAGTATCAAAGTCCTGAAA-//-GATATTTAA M S Y S Y D S S S I K V L K D I *

Inversion ATGTCGAATTATTCTGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N Y S D S S S I K V L K D I *

Duplication ATGTCGAATTCTTATAATTCTTATGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y N S Y D S S S I K V L K D I *

Don’t forget: Gene duplication. Frameshift. Extrachromosomal seq (eg: plasmids). Differential selection.

hqSNP vs. wgMLST: Impact of Genetic Change

Page 31: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

87[87-87]

5[2-6]

5[2-54]

55[2-69]

64[2-191]

Environmental IsolateSprout

Allele Median [min-max]

Whole-genome Multilocus Sequence Typing (wgMLST)

100

999897969594

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

State 2State 3State 1State 1

State 1State 1State 3State 4State 5State 1

.

.

.

.

.

.

.

.

.

.

.

.

State 2 State 3

State 1State 1Environmental IsolateSprout State 1State 1

State 3 State 4

State 5 State 1

0.02

68 hqSNPs

1 ± 1 hqSNPs [0-3]

58 hqSNPs [0-72]65.5 hqSNPs [54-72]

275 hqSNPs

WGS analysis by Enteric Diseases Laboratory Branch, CDC; Heather Carleton (NCEZID/DFWED)

hqSNP Analysis

Comparison of hqSNP and wgMLST analyses

Isolate from MI also highly-related (not

shown)

Page 32: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

EXERCISE

Page 33: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Links and Further Reading❑ Further reading

• Loman NJ, Pallen M. Twenty years of bacterial genome sequencing. Nat Rev Microbiol. 2015 Dec;13(12):787-94. doi: 10.1038/nrmicro3565

• Sintchenko V, Holmes EC. The role of pathogen genomics in assessing disease transmission. BMJ. 2015 May 11;350:h1314. doi: 10.1136/bmj.h1314.

• Kwong JC, McCallum N, Sintchenko V, Howden BP. Whole genome sequencing in clinical and public health microbiology. Pathology. 2015 Apr;47(3):199-210. doi: 10.1097/PAT.0000000000000235

❑ Free and open source software to get you started:• NCBI Genome Workbench (http://www.ncbi.nlm.nih.gov/tools/gbench/); SRST2

(https://github.com/katholt/srst2); Prokka/snippy/arbricate (https://github.com/tseemann); BWA (https://github.com/lh3); GATK (https://www.broadinstitute.org/gatk/); SPAdes (http://bioinf.spbau.ru/spades); HARVEST (https://harvest.readthedocs.org/en/latest/); kSNP (http://sourceforge.net/projects/ksnp/); PhyloViz (http://www.phyloviz.net/); Samtools (http://www.htslib.org); FreeBayes (https://github.com/ekg/freebayes); BIGSdb (http://pubmlst.org/software/database/bigsdb/); RAxML (https://github.com/stamatak/standard-RAxML); Mauve (http://darlinglab.org/mauve/mauve.html); GEPHI (https://gephi.org/); BioPerl (http://bioperl.org); BioPython (http://biopython.org); R (http://www.r-project.org)

Page 34: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Office of Advanced Molecular Detection

National Center for Emerging and Zoonotic Infectious Diseases

Questions?For more information please contact Centers for Disease Control and Prevention

1600 Clifton Road NE, Atlanta, GA 30333Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348E-mail: [email protected] Web: http://www.cdc.gov/amdTwitter: @dmaccannell @CDC_AMD

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Page 35: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Multiple Gene Copies

Gene: rrs (16S) 1471846 – 1473382Gene: rrl (23S) 1473658 – 1476795Gene: rrf (5S) 1476899 – 1477013

May impact analyses.

Need to be addressed in the wgMLST schema or masked in the hqSNP analysis workflow.

Page 36: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

Example: Multi-species CRE Outbreak❑ In this outbreak report, an IncFII plasmid bearing the NDM-1 carbapenemase was

identified in three different genera of Enterobacteriaeceae: Klebsiella, Escherichia and Enterobacter.

❑ Outbreak plasmid: 101kb conjugative plasmid, carrying bla NDM-1

. Other betalactamases (bla

CTX-M-15) found on additional plasmids.

❑ Standard molecular epidemiologic tools, looking primarily at the core genome, would be of limited value in understanding transmission dynamics.

❑ TAKEHOME MESSAGE: MGE plays an important role in both short term and long term bacterial evolution, and it may complicate and confound an investigation.

Torres-Gonzales AAC 2015 doi:10.1128/AAC.00055-15

Page 37: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

AM

D –

Inn

ovat

e * T

ran

sfor

m *

Pro

tect

WGS connectionsEpidemiologic connections

4

C

B

A

1 2 5

3

4

C

B

A

1 2 5

3

Epson et al. (ICHE 2014)

Page 38: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Building the tree

▪ Use the differences you identified by hqSNP or wgMLST to infer the relatedness or phylogeny

Isolate A

Isolate B

Isolate C

Isolate D

11

1

6

3

5

genetic change

actgaatta

actgccggt

ggagaatta

ggagagtta

ggattatta

ggatcccccggataatta

Isolate Sequence

A ggagagtta

B ggatccccc

C ggattatta

D actgccggt

ancestor actgaatta

Page 39: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Reading the trees

Isolate A

Isolate B

Isolate C

Isolate D

11

1

6

3

5

genetic change

LeafTaxa

NodeMost recent common ancestor

(for isolate B and C)

Ancestral nodeTerminal node Outgroup/Root –

related isolate (same PFGE pattern or 7-gene MLST) but not part of outbreak

Clade

Page 40: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Rooted versus unrooted trees

Isolate A

Isolate B

Isolate C

Isolate D

11

1

6

3

5

Rooted

Isolate B

Isolate C

Isolate A

Unrooted

- Rooted trees have a unique node ( created using D) that can be used to infer the most recent common ancestor of all the isolates in the tree- Unrooted trees shows relatedness without inferring the ancestry of isolates

Root

Page 41: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Trees, branches, and leaves – more than one way to draw a tree

▪ Many different ways to display trees

▪ Branches that connect to the terminal node are the important branch lengths to indicate relatedness

Page 42: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Trees, branches and leaves – reading the trees▪ Difference between similarity and relatedness on the tree▪ Isolate A and C are more similar to each other than C and B are▪ Isolate C and B are more related to each other than C and A are

Isolate A

Isolate B

Isolate C

Isolate D

11

1

6

3

5

genetic change

actgaatta

actgccggt

ggagaatta

ggagagtta

ggattatta

ggatcccccggataatta

Page 43: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

Trees, branches and leaves – what does it mean for my outbreak investigation

▪ Epidemiologic data provides context to the tree – cannot rely on phylogenetic tree to identify outbreak source

kale

spinach

Stool

stool

11

1

6

3

5

genetic change

actgaatta

actgccggt

ggagaatta

ggagagtta

ggattatta

ggatcccccggataatta

Page 44: Introduction to Bioinformatics - APHL · Introduction to Bioinformatics APHL 2017 ... From Sequence Data Comparative Genomics ... Skills/proficiency Standards Reporting

What do cowboy have to do with my WGS tree – bootstrap values▪ Bootstrap values – confidence values for how the phylogenetic relationships are

drawn – subsample and redraw the tree

123456789ggagagttaggatcccccggattatta

153456782gaagagttggcatccccggtattattg

213456879ggagagttaggatcccccggattatta

331245678aagggagttaaggtccccaaggttatt

Isolate A

Isolate B

Isolate C

A

B

C

A

BC

A

B

C

67%

replicates with replacement

consensus tree

bootstrap replicates (100x-1000x)