steven essinger, robi polikar, gail rosen k electrical ...sessinger.com/posters/nn_isme.pdfsteven...

!

Metagenomic Fragments

Naïve Bayes Classifier

Training Database

Unsupervised Clustering Algorithm (e.g. ART)

Fragments Clustered by

Class (e.g. Phyla)

Goal: Predict the taxonomic classifica2on of organisms based on the fragments obtained from an environmental sample that may include many previously uniden2fied organisms.

Conclusion: •  Compared to other unsupervised and semi-‐supervised approaches, we

cluster shorter reads (500bp) and more strains (200 to 400) than any other method, to show the clustering method’s feasibili2es on real metagenomics datasets.

•  We demonstrate that adap2ve resonance theory is able to cluster novel phyla beGer than K-‐means when there are a large number of fragments to cluster. This is due to the incremental learning capability of ART and its ability to learn non-‐spherical clusters.

•  On an extremely challenging dataset of grouping 500bp reads from 204 strains spanning 17 phyla, ART is able to accomplish this with 43% accuracy (5.9% by chance)

Neural Network-based Taxonomic Clustering for Metagenomics

Steven Essinger, Robi Polikar, Gail Rosen Electrical & Computer Engineering, Drexel University, 3141 Chestnut Street

Philadelphia, PA 19104, US

This work was supported in part by National Science Foundation award #0845827

Challenge

•  The challenge we face is that we cannot simply cluster fragments together that are similar in composi2on as many clustering methods tend to do.

•  While two strains may be similar inter-‐genomically, each generally will vary greatly intra-‐genomically. Since the fragments we are clustering represent short samples of each strain’s genome, we expect that the fragments in each cluster will vary greatly.

•  Current methods do not address next-‐genera2on sequencing technology •  LikelyBin: successful only for low complexity samples (2-‐10 species) •  GSOM: successful when read lengths are greater than 8kbp •  CompostBin: successfully tested only for low-‐complexity samples

Experiment 1: Training on 2 large phyla to cluster 17 smaller phyla

Experiment 2: Training on 17 smaller phyla to cluster 2 large phyla

Experiment 3: Training on examples of each phyla to cluster the rest

Experiment! 1! 2! 3!Training Phyla! 2! 17! 19!Test Phyla! 17! 2! 19!Training Strains! 431! 204! 320!Test Strains! 204! 431! 315! Table 1

•  635 microbe genomes obtained from Na2onal Center for Informa2on Biotechnology

•  Dataset spans 19 different phyla: We selected this level since it is comprised of microbes that are much more diverse than those belonging to the levels of genus or species

•  Whole-‐genomes used in training database

•  Test fragments obtained from test strains by random sample 500 bp in length, 100x

Fig. 3

Fig. 1

Summary

Test Data

Algorithm

Results

Fig. 2

1 2 3

1 23

known cl

usters!

unknown cl

usters!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

!!

!


+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!


!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!




!!

!


+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!


!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,!H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!




Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G

Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G

steven essinger, robi polikar, gail rosen k electrical ...sessinger.com/posters/nn_isme.pdfsteven...

Documents