introduction to machine learning potential of sequence data · introduction to machine learning ‐...

Post on 21-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

COllaborative Management Platform for detection and Analyses of (Re-) emerging and foodborne outbreaks in Europe

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Introduction to machine learning ‐potential of sequence data

Nanna Munck, PhD student Patrick Murigu Kamau Njage, postdoc

September 28, 2017. 

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Content

• Machine learning method• Potential of sequence data • Example from my PhD project

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Matrix ‐ example of machine learning input

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

• Analyse large and complex datasets by recognizing patterns 

• “Algorithm improves with experience” (data) (Libbrecht M.W, and Noble W. S, 2015)

• Identifies relevant “features” in a complex data set enable the ability to make a strong prediction

• Example applications: spam filtering, face recognition, smart cars, finance, advertisement in internet browsers etc.

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning Algorithms

COMPARE is funded by the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 643476

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learning

BroilerPigPigLayersBroilerLayersPigBroiler

1. Broiler2. Pig3. Layers4. Broiler5. Pig6. Layers7. Pig8. Pig

Figure 1. Conceptual model, modified from Libbrecht M.W, and Noble W. S, 2015

Labels

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Machine learningConceptual model og workflow

Figure 2. Conceptual model of the machinelearning workflow

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Potential of sequence data

Resistancegenes

Resistancegenes Plasmid

repliconsPlasmidreplicons

Virulencegenes

VirulencegenesMLSTMLST

SNPSNP

CGE Tools

OtherOther

Label Gene1 Gene2 Gene3 Gene4 Gene5 …Sequence 1 1 1 0 0 0Sequence 2 0 1 1 1 1Sequence 3 1 0 1 0 0Sequence 4 1 0 0 0 1

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectIntroduction

• Source attribution models link the number of cases of a specific foodborne illness to specific food and animal reservoirs 

• Existing models are based mainly on phenotypic information of a given bacteria and comparison of its distribution in potential sources and humans

• With next generation sequencing technology, new methods for source attribution are available

• We investigate the potential of machine learning to predict the source (animal reservoir) from which a given Salmonella bacteria origins based on genotypic features

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD project Case: Salmonella Typhimurium

• Danish Salmonella Typhimurium isolates, 2012‐14 extracted from the Vivaldi database– 22 broilers– 20 pigs– 4 layers

• Features, in total 43: resistance genes (n = 20), plasmid replicons (n = 21), MLST type (n = 2) 

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD project Data exploration, SNP analysis

LegendPigsBroilersLayers

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectResults and discussion

Broilers Layers PigsIncFIB(S) 6.56 2.01 5.58IncQ1 5.24 1.00 5.93IncFII(S) 5.62 2.01 4.94strA 4.06 1.00 5.15sul2 3.33 1.74 3.94ST‐19 3.57 1.00 ‐1.17ColpVC 3.23 ‐1.74 2.53blaTEM‐1B 2.87 1.42 ‐1.32Col(VCM04) 2.86 ‐1.42 ‐2.56tet(B) 2.59 1.00 1.07ST‐34 2.52 0.00 1.44Col156 1.98 0.00 2.10IncI1 ‐1.50 0.00 2.08strB 2.01 1.00 1.59ColRNAI ‐2.05 1.00 ‐0.66strA.like 0.09 0.00 0.10tet(A)like 0.09 0.00 ‐0.30

Table 1. Accuracy

Table 2. Confusion matrix 

Table 3. Importance measures 

Predicted/Data Broilers Layers PigsBroilers 4 2 2Layers 0 0 0Pigs 3 0 3

Model Accuracy Agreement Accuracyrf svmr svml svmr vs rf svml vs rf

0.67 0.67 0.5 0.86 0.79

Bold: resistance genes, italic: plasmidreplicons, normal: MLST type 

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Example from my PhD projectConclusion and future work

• Host specific patterns– Neither ofthefeaturesincluded sofarshowed clearpredictability.

• Features– Asthis iswork inprogress,morefeaturessuch asvirulence stillneed tobe explored

• Data – Moredataistobe used andmoreknowledge istobe gained about interpretationof

themodeloutputsinorder toimprove themodel.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 643476. 

Thank you

Questions?

top related