steven essinger, robi polikar, gail rosen k electrical ...sessinger.com/posters/nn_isme.pdfsteven...

1
Metagenomic Fragments Naïve Bayes Classifier Training Database Unsupervised Clustering Algorithm (e.g. ART) Fragments Clustered by Class (e.g. Phyla) Goal: Predict the taxonomic classifica2on of organisms based on the fragments obtained from an environmental sample that may include many previously uniden2fied organisms. Conclusion: Compared to other unsupervised and semisupervised approaches, we cluster shorter reads (500bp) and more strains (200 to 400) than any other method, to show the clustering method’s feasibili2es on real metagenomics datasets. We demonstrate that adap2ve resonance theory is able to cluster novel phyla beGer than Kmeans when there are a large number of fragments to cluster. This is due to the incremental learning capability of ART and its ability to learn nonspherical clusters. On an extremely challenging dataset of grouping 500bp reads from 204 strains spanning 17 phyla, ART is able to accomplish this with 43% accuracy (5.9% by chance) Neural Network-based Taxonomic Clustering for Metagenomics Steven Essinger, Robi Polikar, Gail Rosen Electrical & Computer Engineering, Drexel University, 3141 Chestnut Street Philadelphia, PA 19104, US This work was supported in part by National Science Foundation award #0845827 Challenge The challenge we face is that we cannot simply cluster fragments together that are similar in composi2on as many clustering methods tend to do. While two strains may be similar intergenomically, each generally will vary greatly intragenomically. Since the fragments we are clustering represent short samples of each strain’s genome, we expect that the fragments in each cluster will vary greatly. Current methods do not address nextgenera2on sequencing technology LikelyBin: successful only for low complexity samples (210 species) GSOM: successful when read lengths are greater than 8kbp CompostBin: successfully tested only for lowcomplexity samples Experiment 1: Training on 2 large phyla to cluster 17 smaller phyla Experiment 2: Training on 17 smaller phyla to cluster 2 large phyla Experiment 3: Training on examples of each phyla to cluster the rest Experiment 1 2 3 Training Phyla 2 17 19 Test Phyla 17 2 19 Training Strains 431 204 320 Test Strains 204 431 315 Table 1 635 microbe genomes obtained from Na2onal Center for Informa2on Biotechnology Dataset spans 19 different phyla: We selected this level since it is comprised of microbes that are much more diverse than those belonging to the levels of genus or species Wholegenomes used in training database Test fragments obtained from test strains by random sample 500 bp in length, 100x Fig. 3 Fig. 1 Summary Test Data Algorithm Results Fig. 2 1 2 3 1 2 3 known clusters unknown clusters The results of clustering 20400 fragments spanning 17 Proposed Algorithm Input: Metagenomic reads (fragments) from next-gen sequencing technology Training database (TDB) – consists of G labeled genomes, previously acquired Unsupervised clustering algorithm (e.g. ART, K-means) Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of G genome probability profiles Do: i = 1, ..., G Do: j = 1, …4 N (# of diff. motif perm.) End End B. Score fragments, evaluate fragment, f using NBC Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs each of length N in fragment, f: [M1, M2, M3, …, MJ] T 2. Calculate probability of fragment belonging to genomei in TDB: End C. Build feature matrix for unsupervised classifier C. Build feature matrix for unsupervised classifier D. Call unsupervised clustering algorithm Cluster each fragment using corresponding feature vector of dimension G Output: Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit Accuracy to group similar classes together Accuracy of algorithm to isolate dissimilar classes C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla Features NBC Scores genomegenomegenomeFrag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . . . . . . . . Objects FragF SF,1 . . SF,G

Upload: others

Post on 24-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Steven Essinger, Robi Polikar, Gail Rosen K Electrical ...sessinger.com/Posters/NN_ISME.pdfSteven Essinger, Robi Polikar, Gail Rosen! Electrical & Computer Engineering, Drexel University,

!

Metagenomic Fragments

Naïve Bayes Classifier

Training Database

Unsupervised Clustering Algorithm (e.g. ART)

Fragments Clustered by

Class (e.g. Phyla)

Goal:  Predict  the  taxonomic  classifica2on  of  organisms  based  on  the  fragments  obtained  from  an  environmental  sample  that  may  include  many  previously  uniden2fied  organisms.  

Conclusion:  •  Compared  to  other  unsupervised  and  semi-­‐supervised  approaches,  we  

cluster  shorter  reads  (500bp)  and  more  strains  (200  to  400)  than  any  other  method,  to  show  the  clustering  method’s  feasibili2es  on  real  metagenomics  datasets.  

•  We  demonstrate  that  adap2ve  resonance  theory  is  able  to  cluster  novel  phyla  beGer  than  K-­‐means  when  there  are  a  large  number  of  fragments  to  cluster.  This  is  due  to  the  incremental  learning  capability  of  ART  and  its  ability  to  learn  non-­‐spherical  clusters.    

•  On  an  extremely  challenging  dataset  of  grouping  500bp  reads  from  204  strains  spanning  17  phyla,  ART  is  able  to  accomplish  this  with  43%  accuracy  (5.9%  by  chance)  

Neural Network-based Taxonomic Clustering for Metagenomics

Steven Essinger, Robi Polikar, Gail Rosen Electrical & Computer Engineering, Drexel University, 3141 Chestnut Street

Philadelphia, PA 19104, US

This work was supported in part by National Science Foundation award #0845827

Challenge  

•  The  challenge  we  face  is  that  we  cannot  simply  cluster  fragments  together  that  are  similar  in  composi2on  as  many  clustering  methods  tend  to  do.    

•  While  two  strains  may  be  similar  inter-­‐genomically,  each  generally  will  vary  greatly  intra-­‐genomically.  Since  the  fragments  we  are  clustering  represent  short  samples  of  each  strain’s  genome,  we  expect  that  the  fragments  in  each  cluster  will  vary  greatly.  

•  Current  methods  do  not  address  next-­‐genera2on  sequencing  technology  •  LikelyBin:  successful  only  for  low  complexity  samples  (2-­‐10  species)  •  GSOM:  successful  when  read  lengths  are  greater  than  8kbp  •  CompostBin:  successfully  tested  only  for  low-­‐complexity  samples  

Experiment  1:  Training  on  2  large  phyla  to  cluster  17  smaller  phyla  

Experiment  2:  Training  on  17  smaller  phyla  to  cluster  2  large  phyla  

Experiment  3:  Training  on  examples  of  each  phyla  to  cluster  the  rest  

Experiment! 1! 2! 3!Training Phyla! 2! 17! 19!Test Phyla! 17! 2! 19!Training Strains! 431! 204! 320!Test Strains! 204! 431! 315! Table  1  

•  635  microbe  genomes  obtained  from  Na2onal  Center  for  Informa2on  Biotechnology  

•  Dataset  spans  19  different  phyla:  We  selected  this  level  since  it  is  comprised  of  microbes  that  are  much    more  diverse  than  those  belonging  to  the  levels  of  genus  or  species  

•  Whole-­‐genomes  used  in  training  database  

•  Test  fragments  obtained  from  test  strains  by  random  sample  500  bp  in  length,  100x  

Fig.  3  

Fig.  1  

Summary  

Test  Data  

Algorithm  

Results  

Fig.  2  

1   2  3

1   23

known cl

usters!

unknown cl

usters!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,!H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G

Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G