introduction to machine learning potential of sequence data · introduction to machine learning ‐...

19
COllaborative Management Platform for detection and Analyses of (Re-) emerging and foodborne outbreaks in Europe This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. Introduction to machine learning potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage, postdoc September 28, 2017.

Upload: others

Post on 21-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

COllaborative Management Platform for detection and Analyses of (Re-) emerging and foodborne outbreaks in Europe

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Introduction to machine learning ‐potential of sequence data

Nanna Munck, PhD student Patrick Murigu Kamau Njage, postdoc

September 28, 2017. 

Page 2: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Content

• Machine learning method• Potential of sequence data • Example from my PhD project

Page 3: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,
Page 4: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,
Page 5: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,
Page 6: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,
Page 7: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Matrix ‐ example of machine learning input

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

Page 8: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

• Analyse large and complex datasets by recognizing patterns 

• “Algorithm improves with experience” (data) (Libbrecht M.W, and Noble W. S, 2015)

• Identifies relevant “features” in a complex data set enable the ability to make a strong prediction

• Example applications: spam filtering, face recognition, smart cars, finance, advertisement in internet browsers etc.

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

Page 9: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

Page 10: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning Algorithms

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

Page 11: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

BroilerPigPigLayersBroilerLayersPigBroiler

1. Broiler2. Pig3. Layers4. Broiler5. Pig6. Layers7. Pig8. Pig

Figure 1. Conceptual model, modified from Libbrecht M.W, and Noble W. S, 2015

Labels

Page 12: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learningConceptual model og workflow

Figure 2. Conceptual model of the machinelearning workflow

Page 13: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Potential of sequence data

Resistancegenes

Resistancegenes Plasmid

repliconsPlasmidreplicons

Virulencegenes

VirulencegenesMLSTMLST

SNPSNP

CGE Tools

OtherOther

Label Gene1 Gene2 Gene3 Gene4 Gene5 …Sequence 1 1 1 0 0 0Sequence 2 0 1 1 1 1Sequence 3 1 0 1 0 0Sequence 4 1 0 0 0 1

Page 14: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectIntroduction

• Source attribution models link the number of cases of a specific foodborne illness to specific food and animal reservoirs 

• Existing models are based mainly on phenotypic information of a given bacteria and comparison of its distribution in potential sources and humans

• With next generation sequencing technology, new methods for source attribution are available

• We investigate the potential of machine learning to predict the source (animal reservoir) from which a given Salmonella bacteria origins based on genotypic features

Page 15: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD project Case: Salmonella Typhimurium

• Danish Salmonella Typhimurium isolates, 2012‐14 extracted from the Vivaldi database– 22 broilers– 20 pigs– 4 layers

• Features, in total 43: resistance genes (n = 20), plasmid replicons (n = 21), MLST type (n = 2) 

Page 16: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD project Data exploration, SNP analysis

LegendPigsBroilersLayers

Page 17: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectResults and discussion

Broilers Layers PigsIncFIB(S) 6.56 2.01 5.58IncQ1 5.24 1.00 5.93IncFII(S) 5.62 2.01 4.94strA 4.06 1.00 5.15sul2 3.33 1.74 3.94ST‐19 3.57 1.00 ‐1.17ColpVC 3.23 ‐1.74 2.53blaTEM‐1B 2.87 1.42 ‐1.32Col(VCM04) 2.86 ‐1.42 ‐2.56tet(B) 2.59 1.00 1.07ST‐34 2.52 0.00 1.44Col156 1.98 0.00 2.10IncI1 ‐1.50 0.00 2.08strB 2.01 1.00 1.59ColRNAI ‐2.05 1.00 ‐0.66strA.like 0.09 0.00 0.10tet(A)like 0.09 0.00 ‐0.30

Table 1. Accuracy

Table 2. Confusion matrix 

Table 3. Importance measures 

Predicted/Data Broilers Layers PigsBroilers 4 2 2Layers 0 0 0Pigs 3 0 3

Model Accuracy Agreement Accuracyrf svmr svml svmr vs rf svml vs rf

0.67 0.67 0.5 0.86 0.79

Bold: resistance genes, italic: plasmidreplicons, normal: MLST type 

Page 18: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectConclusion and future work

• Host specific patterns– Neither ofthefeaturesincluded sofarshowed clearpredictability.

• Features– Asthis iswork inprogress,morefeaturessuch asvirulence stillneed tobe explored

• Data – Moredataistobe used andmoreknowledge istobe gained about interpretationof

themodeloutputsinorder toimprove themodel.

Page 19: Introduction to machine learning potential of sequence data · Introduction to machine learning ‐ potential of sequence data Nanna Munck, PhD student Patrick Murigu Kamau Njage,

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Thank you

Questions?