machine learning techniques for bacteria...
TRANSCRIPT
![Page 1: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/1.jpg)
Machine Learning Techniques for Bacteria Classification
Massimo La RosaRiccardo RizzoAlfonso M. UrsoS. Gaglio
ICAR-CNR
University of Palermo
Workshop on Hardware Architectures Beyond 2020:
Challenges and Opportunities for Computational Biology and Bioinformatics
Napoli – December 19, 2007
![Page 2: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/2.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 3: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/3.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 4: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/4.jpg)
Motivation
● Microbial identification is crucial for the study of infectious diseases.
● Bacterial taxonomy is usually based on phenotypic characters
● A new approach based on bacteria genotype is under development
● 16S rRNA “housekeeping” gene for taxonomic purposes
![Page 5: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/5.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 6: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/6.jpg)
Goal
● Genotypic features based taxonomy ● Topographic representation of the
bacteria clusters– Finding misclassification = discovery of new
pathogens
– Classifying organisms with an unusual phenotype
![Page 7: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/7.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 8: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/8.jpg)
Methodologies
● General framework
● Building Dataset
● Sequence Alignment
● Evolutionary Distance
● Soft Topographic Map Algorithm
![Page 9: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/9.jpg)
1 downloading and filtering gene sequences from NCBI databases
2 sequence alignment (Needleman-Wunsch)
3 computing dissimilarity matrix (evolutionary distance)
4 clustering (SOM on pairwise distances) and visualization (UMatrix style map)
Sequence
DB
Results
Filtering
Taxonomy
Retrieval
Sequence
Alignment
Labeling
Computing
Distance
Clustering
and
mapping
Sequence
File 1
2 3 4
General framework
![Page 10: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/10.jpg)
Building Dataset
Phylum BXII (Proteobacteria)Class III (Gammaproteobacteria)
14 Orders
147 16S rRNA gene sequencesdownloaded from GenBank database
![Page 11: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/11.jpg)
Sequence Alignment
● Sequence alignment allows to compare homologous sites of the same gene between two different species
● Two well known alignment algorithms used:
– ClustalW: multiple-alignment
– Needleman-Wunsch: pairwise alignment
![Page 12: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/12.jpg)
Evolutionary Distance
● The simplest type of distance is the number of nucleotide substitutions per site.
– Warning: it underrates real distances● Jukes and Cantor method was used: it
provides a better estimate of evolutionary distances
● Evolutionary distances are elements of the symmetric dissimilarity matrix:
Type strain 1 2 3 4 5 6 71 0 0.06286 0.11215 0.06482 0.05128 0.09451 0.067852 0 0.10608 0.0579 0.065 0.07196 0.046823 0 0.1224 0.11418 0.10279 0.115384 0 0.06082 0.10224 0.067645 0 0.10595 0.073626 0 0.082327 0...
![Page 13: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/13.jpg)
Soft Topographic Map Algorithm (1)
● Extension of Kohonen's SOM for pairwise data● The position of bacteria clusters in the topographic
maps is based on the optimization, through deterministic annealing technique, of a cost function that takes its minimum when each data point is mapped to the best matching neuron
Soft Topographic Map Algorithm
input
Dissimilarity matrix
Topographic map showing relationships among Bacteria clusters
Type strain 1 2 31 0 0,06286 0,112152 0 0,106083 0
output
![Page 14: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/14.jpg)
Soft Topographic Map Algorithm (2)
Partial assignment cost :
e t r=∑s
hrs∑t '
at ' s dtt '−12∑t ' '
at ' ' sd t ' t ' ' ,∀ t , r
Weighting factor :
a t r=∑s
hrs P x t∈C s
∑t '∑s
hrsP xt '∈C s,∀ t , r
Assignment probability :
P x t∈C r=exp −et r
∑u
exp−e t u,∀ t , r
Neighborhood function :
hr s=exp −∣r−s∣2
22 ,∀ r , s1) Initialization Step:
a) put e t rn t r ,∀ t , r ,∈[0,1]
b) compute lookup table for hrsc) choose initial value of , final
, increasing temperature factor , threshold
2) Training Step:a) while final (Annealing cycle)
i. repeat (EM cycle)A) E step: compute
P x t∈C r∀ t , rB) M step: compute
a t rnew ,∀ t , r
C) M step: computee t rnew ,∀ t , r
ii. until ∥e t rnew−e t r
old∥
iii. put
b) end while
1) Initialization Step:a) put e t rn t r ,∀ t , r ,∈[0,1]
b) compute lookup table for hrsc) choose initial value of ,
final , increasing temperaturefactor , threshold
2) Training Step:a) while final (Annealing cycle)
i. repeat (EM cycle)A) E step: compute
P x t∈C r∀ t , rB) M step: compute
a t rnew ,∀ t , r
C) M step: computee t rnew ,∀ t , r
ii. until ∥e t rnew−e t r
old∥
iii. put
b) end while
![Page 15: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/15.jpg)
Soft Topographic mapInput vector
The topographic map is a lattice (two dimensional in our case) that self organize in the pattern space
![Page 16: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/16.jpg)
Soft Topographic Map Algorithm (3)
● The algorithm that “moves” this lattice is called deterministic annealing– The advantage of deterministic annealing is
to find a global minimum of the approximation error
Partial assignment cost :
e t r=∑s
hrs∑t '
at ' s d tt '−12∑t ' '
a t ' ' sd t ' t ' ' ,∀ t , r
Weighting factor :
a t r=∑s
hrs P x t∈C s
∑t '∑s
hrs P x t '∈C s,∀ t , r
Assignment probability :
P x t∈C r=exp −e t r
∑u
exp−et u,∀ t , r
Neighborhood function :
hr s=exp−∣r−s∣2
2 2 ,∀ r , s1) Initialization Step:
a) put e t rn t r ,∀ t , r ,∈[0,1]
b) compute lookup table for hrsc) choose initial value of ,
final , increasing temperaturefactor , threshold
2) Training Step:a) while final (Annealing cycle)
i. repeat (EM cycle)A) E step: compute
P x t∈C r∀ t , r
B) M step: computea t rnew ,∀ t , r
C) M step: computee t rnew ,∀ t , r
ii. until ∥e t rnew−e t r
old∥
iii. put
b) end while
Deterministicannealing
![Page 17: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/17.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 18: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/18.jpg)
Experimental Results
● From 8x8 up to 45x45 map dimensions● We trained 20 maps of each geometry in order to
avoid the dependence from the initial conditions● The results obtained using the two alignments
methods do not present any significant difference
12x12 map 16x16 map 20x20 map
![Page 19: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/19.jpg)
138045x45
66040x40
36035x35
24030x30
9025x25
3020x20
2319x19
1718x18
1617x17
1216x16
915x15
614x14
413x13
412x12
211x11
210x10
19x9
0,68x8
Average processing time (min.)Map size
Experimental tests
● Hardware resources
● 16 nodes cluster, dual processor Xeon 3.4 GHz, 4 GB RAM, 6 TB storage, Myrinet-Fiber communication
● Software
● Languages: Java, Python
● Libraries: BioJava, Jama ....
![Page 20: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/20.jpg)
Mixed Clusters
Map Evaluation
![Page 21: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/21.jpg)
We only have distances between patterns, and no metrics!
Usually topology measures are considered, but in our case there is not a space that contains the patterns (sequences)
Probable topology distortion
Map Evaluation
Ideal Map
![Page 22: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/22.jpg)
Map evaluation
● We take rows and columns of the maps and compare the order of the elements in map with the order obtained from the dissimilarity matrix
![Page 23: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/23.jpg)
Map evaluation
This sequence...
...is compared with...
Dissimilarity matrix
Type strain 1 2 31 0 0.06286 0.112152 0 0.106083 0
...the sequence of the same objects
obtained from the dissimilarity matrix
This comparison is made using the Spearman coefficient in order to obtain a similarity value among the two sequences.
Of course the two sequences should be the same in a good map
![Page 24: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/24.jpg)
Map Evaluation
![Page 25: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/25.jpg)
Map evaluation
Low number of mixed clustersLow value of Spearman Coefficient
The “Best” Map
● We have an index for each map and we can see that some geometry are better than other
![Page 26: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/26.jpg)
The “Best” Map
![Page 27: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/27.jpg)
Comparison with the phylogenetic tree
![Page 28: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/28.jpg)
Motivation
Results
Conclusions
A new approach to microbial identification
MethodologiesSelf organizingTopographic map
Deterministic annealing
Goals Genotypic feature based taxonomyVisualization
Outline
A methodology to create a visualization and classification tool
![Page 29: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15](https://reader034.vdocuments.net/reader034/viewer/2022042211/5eb1b8c291b583285f497982/html5/thumbnails/29.jpg)
Conclusions
● Soft Topographic Map for clustering and classification of bacteria
● Genotype based taxonomy● Detecting singular situations● Further analysis with other
housekeeping genes or using other distance algorithms, e.g. Normalized Compressed Distance