Transcript
Page 1: Steven Essinger, Robi Polikar, Gail Rosen K Electrical ...sessinger.com/Posters/NN_ISME.pdfSteven Essinger, Robi Polikar, Gail Rosen! Electrical & Computer Engineering, Drexel University,

!

Metagenomic Fragments

Naïve Bayes Classifier

Training Database

Unsupervised Clustering Algorithm (e.g. ART)

Fragments Clustered by

Class (e.g. Phyla)

Goal:  Predict  the  taxonomic  classifica2on  of  organisms  based  on  the  fragments  obtained  from  an  environmental  sample  that  may  include  many  previously  uniden2fied  organisms.  

Conclusion:  •  Compared  to  other  unsupervised  and  semi-­‐supervised  approaches,  we  

cluster  shorter  reads  (500bp)  and  more  strains  (200  to  400)  than  any  other  method,  to  show  the  clustering  method’s  feasibili2es  on  real  metagenomics  datasets.  

•  We  demonstrate  that  adap2ve  resonance  theory  is  able  to  cluster  novel  phyla  beGer  than  K-­‐means  when  there  are  a  large  number  of  fragments  to  cluster.  This  is  due  to  the  incremental  learning  capability  of  ART  and  its  ability  to  learn  non-­‐spherical  clusters.    

•  On  an  extremely  challenging  dataset  of  grouping  500bp  reads  from  204  strains  spanning  17  phyla,  ART  is  able  to  accomplish  this  with  43%  accuracy  (5.9%  by  chance)  

Neural Network-based Taxonomic Clustering for Metagenomics

Steven Essinger, Robi Polikar, Gail Rosen Electrical & Computer Engineering, Drexel University, 3141 Chestnut Street

Philadelphia, PA 19104, US

This work was supported in part by National Science Foundation award #0845827

Challenge  

•  The  challenge  we  face  is  that  we  cannot  simply  cluster  fragments  together  that  are  similar  in  composi2on  as  many  clustering  methods  tend  to  do.    

•  While  two  strains  may  be  similar  inter-­‐genomically,  each  generally  will  vary  greatly  intra-­‐genomically.  Since  the  fragments  we  are  clustering  represent  short  samples  of  each  strain’s  genome,  we  expect  that  the  fragments  in  each  cluster  will  vary  greatly.  

•  Current  methods  do  not  address  next-­‐genera2on  sequencing  technology  •  LikelyBin:  successful  only  for  low  complexity  samples  (2-­‐10  species)  •  GSOM:  successful  when  read  lengths  are  greater  than  8kbp  •  CompostBin:  successfully  tested  only  for  low-­‐complexity  samples  

Experiment  1:  Training  on  2  large  phyla  to  cluster  17  smaller  phyla  

Experiment  2:  Training  on  17  smaller  phyla  to  cluster  2  large  phyla  

Experiment  3:  Training  on  examples  of  each  phyla  to  cluster  the  rest  

Experiment! 1! 2! 3!Training Phyla! 2! 17! 19!Test Phyla! 17! 2! 19!Training Strains! 431! 204! 320!Test Strains! 204! 431! 315! Table  1  

•  635  microbe  genomes  obtained  from  Na2onal  Center  for  Informa2on  Biotechnology  

•  Dataset  spans  19  different  phyla:  We  selected  this  level  since  it  is  comprised  of  microbes  that  are  much    more  diverse  than  those  belonging  to  the  levels  of  genus  or  species  

•  Whole-­‐genomes  used  in  training  database  

•  Test  fragments  obtained  from  test  strains  by  random  sample  500  bp  in  length,  100x  

Fig.  3  

Fig.  1  

Summary  

Test  Data  

Algorithm  

Results  

Fig.  2  

1   2  3

1   23

known cl

usters!

unknown cl

usters!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

!!

!

"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!

+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!

L,! MINOPHN!

!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!

!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,!H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!

;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!

! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!

<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!

Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G

Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen

sequencing technology • Training database (TDB) – consists of G labeled

genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of

G genome probability profiles Do: i = 1, ..., G

Do: j = 1, …4N (# of diff. motif perm.)

End

End B. Score fragments, evaluate fragment, f using NBC

Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs

each of length N in fragment, f: [M1, M2, M3, …, MJ]T

2. Calculate probability of fragment belonging to genomei in TDB:

End

C. Build feature matrix for unsupervised classifier

D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding

feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit

• Accuracy to group similar classes together

• Accuracy of algorithm to isolate dissimilar classes

C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla

Features NBC Scores genome1 genome2 … genomeG

Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .

. . . . . . Obj

ects

FragF SF,1 . . SF,G

Top Related