cloudy with a touch of cheminformatics

26
Cloudy with a Touch of Cheminforma4cs Rajarshi Guha, Tyler Peryea, DacTrung Nguyen NIH Center for Advancing Transla@onal Science Chemaxon UGM September 26 th , 2012 Wellesley, MA

Upload: rguha

Post on 10-May-2015

691 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Cloudy with a Touch of Cheminformatics

Cloudy  with  a  Touch  of  Cheminforma4cs  

Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen  NIH  Center  for  Advancing  Transla@onal  Science  

 Chemaxon  UGM  

September  26th,  2012  Wellesley,  MA  

Page 2: Cloudy with a Touch of Cheminformatics

Parallel  compu4ng  in  the  cloud  

•  Modern  cloud  vendors  make  provisioning  compute  resources  easy  – Allows  one  to  handle  unpredictable  loads  easily  – Pay  only  for  what  you  need  

•  Chemistry  applica<ons  don’t  usually  have  very  dynamic  loads  

•  But  large  scale  resources  are  an  opportunity  for  large  scale  (parallel)  computa<ons  

Page 3: Cloudy with a Touch of Cheminformatics

• Use  cloud  resources  in  the  same  way  as  a  local  cluster  

• MIT  StarCluster  makes  this  easy  to  do  

Legacy  HPC  

• Make  use  of  cloud  capabili<es  

• Old  algorithms,  new  infrastructure  

• Spot  instances,  SNS,  SQS  SimpleDB,  S3,  etc  

Cloudy  HPC  

• Huge  datasets  • Candidates  for  map-­‐reduce  

•  Involves  algorithm    (re)design  

Big  Data  HPC  

All  HPC  is  not  equal  

hOp://www.slideshare.net/chrisdag/mapping-­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  

Page 4: Cloudy with a Touch of Cheminformatics

Big  data  &  cheminforma4cs  

•  Computa<on  over  large  chemical  databases  – Pubchem,  ChEMBL,  GDB-­‐13,  …  

•  What  types  of  computa<ons?  – Searches  (substructure,  pharmacophore,  ….)  – QSAR  models  &  predic<ons  over  large  data  

•  Fundamentally,  “big  chemical  data”  lets  us  explore  larger  chemical  spaces  

Page 5: Cloudy with a Touch of Cheminformatics

Map-­‐Reduce  

Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    

Split 0 Map

Split 1 Map

Split 2 Map

Reduce Part 0

merge

copysort

Reduce Part 1

merge

K1,V1! list K2,V2( ) K2, list V2( )! list K3,V3( )

Page 6: Cloudy with a Touch of Cheminformatics

Coun4ng  atoms  

•  The  chemical  version  of  the  word  coun<ng  task  

1, Nc1ccc2ncccc2c1N2, Cl.CC1CCc2nc3ccccc3c(C)c2C1...152366, Nc1ccc2ncccc2c1N

Arbitrary linenumbers (K1) SMILES (V1)

N, list(1,1,1,1,...)C, list(1,1,1,1,...)

Atom Symbol (K2) list (V2)

N 1N 1N 1N 1

.

.

Atom Symbol (K2) Occurence (V2)

N,100C,5684...

Atom Symbol (K3) Count (V3)

MAP   Reduce  

Page 7: Cloudy with a Touch of Cheminformatics

The  Hadoop  ecosystem  

Hadoop Common

Hadoop Distributed Filesystem

Map Reduce Engine

Hive

Hama

WhirrHBase

Pig

AvroMahout

FlumeZookeeperChukwa

Based  on  hOp://www.slideshare.net/informa<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  

Page 8: Cloudy with a Touch of Cheminformatics

Cheminforma4cs  on  Hadoop  

•  Hadoop  and  Atom  Coun<ng  •  Hadoop  and  SD  Files  •  Cheminforma<cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma<cs    

But  are  cheminforma@cs  problems    really  big  enough  to  jus@fy  all  of  this?  

Page 9: Cloudy with a Touch of Cheminformatics

package gov.nih.ncgc.hadoop;

import chemaxon.formats.MolFormatException;import chemaxon.formats.MolImporter;import chemaxon.license.LicenseManager;import chemaxon.license.LicenseProcessingException;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.Iterator;

/** * SMARTS searching over a set of files using Hadoop. * * @author Rajarshi Guha */public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0);

public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private String pattern = null; private MolSearch search;

public void configure(JobConf job) {

try { Path[] licFiles = DistributedCache.getLocalCacheFiles(job); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); StringBuilder license = new StringBuilder(); String line; while ((line = reader.readLine()) != null) license.append(line); reader.close(); LicenseManager.setLicense(license.toString()); } catch (IOException e) { } catch (LicenseProcessingException e) { }

pattern = job.getStrings("pattern")[0]; search = new MolSearch(); try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); search.setQuery(queryMol); } catch (MolFormatException e) { }

}

final static IntWritable one = new IntWritable(1); Text matches = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Molecule mol = MolImporter.importMol(value.toString()); matches.set(mol.getName()); search.setTarget(mol); try { if (search.isMatching()) { output.collect(matches, one); } else { output.collect(matches, zero); } } catch (SearchException e) { } } }

public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { if (values.next().compareTo(one) == 0) { result.set(1); output.collect(key, result); } } } }

public int run(String[] args) throws Exception { JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); jobConf.setJobName("smartsSearch");

jobConf.setOutputKeyClass(Text.class); jobConf.setOutputValueClass(IntWritable.class);

jobConf.setMapperClass(MoleculeMapper.class); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class);

jobConf.setInputFormat(TextInputFormat.class); jobConf.setOutputFormat(TextOutputFormat.class);

jobConf.setNumMapTasks(5);

if (args.length != 4) { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); System.exit(2); }

FileInputFormat.setInputPaths(jobConf, new Path(args[0])); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); jobConf.setStrings("pattern", args[2]);

// make the license file available vis dist cache DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);

JobClient.runJob(jobConf); return 0; }

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);

}}

Simplifying  Hadoop  applica4ons  

•  Raw  Hadoop    programs  can    be  tedious  to    write  

SMARTS  based    substructure  search    

Page 10: Cloudy with a Touch of Cheminformatics

Pig  &  Pig  La4n  

•  Pig  La<n  programs  are  much  simpler  to  write  and  get  translated  to  Hadoop  code  

•  SQL-­‐like,  requires    UDF  to  be    implemented  to    perform    non-­‐standard  tasks  

SMARTS  search  in    Pig  La<n  

UDF  for  SMARTS  search  

A = load 'medium.smi' as (smiles:chararray);B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');store B into 'output.txt';

package gov.nih.ncgc.hadoop.pig;

import chemaxon.formats.MolImporter;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.pig.FilterFunc;import org.apache.pig.data.Tuple;

import java.io.IOException;

public class SMATCH extends FilterFunc { static MolSearch search = null;

public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; }}

Page 11: Cloudy with a Touch of Cheminformatics

Going  beyond  chunking?  

•  All  the  preceding  use  cases  are  embarrassingly  parallel    – Chunking  the  input  data  and  applying  the  same  opera<on  to  each  chunk  

– Very  nice  when  you  have  a  big  cluster  

Are  there  algorithms  in    cheminforma@cs  that    can  employ    

map-­‐reduce  at  the  algorithmic  level?  

Page 12: Cloudy with a Touch of Cheminformatics

Going  beyond  chunking?  

•  Applica<ons  that  make  use  of  pairwise  (or  higher  order)  calcula<ons  could  benefit  from  a  map-­‐reduce  incarna<on  – Doesn’t  necessarily  avoid  the  O(N2)  barrier  – Bioisostere  iden<fica<on  is  one  case  that  could  be  rephrased  as  a  map-­‐reduce  problem  

•  Map-­‐Reduce  Design  PaOerns  

Page 13: Cloudy with a Touch of Cheminformatics

Iden4fying  MMPs  

•  First  step  in  iden<fying  bioisosteres  is  to  iden<fy  candidate  matched  molecular  pairs  – Naïve  all  pairs  comparison  – Predefined  list  of  transforma<ons    •  Birch  et  al,  BMCL,  2009  

– Fragment  intersec<on  •  Hussain  et  al,  JCIM,  2010  

– MCS  based  approaches  (e.g.,  WizePairZ)  • Warner  et  al,  JCIM,  2010  

 

Page 14: Cloudy with a Touch of Cheminformatics

Naïve  Bioisostere  evalua4on  

...N  molecules   N(N-­‐1)/2  comparisons  

Page 15: Cloudy with a Touch of Cheminformatics

Scaffold  seeding  

Seed  Fragment:  

Members:  

Page 16: Cloudy with a Touch of Cheminformatics

Scaffold  seeded  bioisosteres  M(M-­‐1)/2  comparisons  

M(M-­‐1)/2  comparisons  

Page 17: Cloudy with a Touch of Cheminformatics

Seeded  bioisosteres  –  MR  style  

• Do  pairwise  MCS  analysis  on  scaffold  series  

• For  each  pair  output  SMIRKS  transform  and  the  pair  of  SMILES  

MAP  

• Collect  pairs  of  SMILES  for  a  given  SMIRKS  

• Store  in  DB,  or  • Filter  by  ac<vity,  or  • …  

REDUCE  

Page 18: Cloudy with a Touch of Cheminformatics

1e+05

1e+08

1e+11

1e+14

1e+03 1e+05 1e+07log Number of molecules

log

Num

ber o

f pai

rwis

e co

mpa

rison

s

Method

all

seeded.7

seeded.21

seeded.100

Does  seeding  help?  

•  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the  constant  

•  Depends  on  how  many  scaffolds  and  the    number  of  member  for  each  scaffold  

•  Certainly  useful  when  there  a  few  members  per  scaffold  

•  Highly  populated  scaffolds  can  throw  things  off  

Page 19: Cloudy with a Touch of Cheminformatics

Data  

•  Exhaus<vely  fragmented  ChEMBL  13  •  Iden<fied  scaffolds  with            

•  Ended  up  with  231,875  scaffolds    –  Covers  235,693  unique  molecules  – Average  of  7  members  per  scaffold  –  95%  of  scaffolds  had  <  21  members  –  99.5%  had  <  74  members  

•  The  0.05%  are  a  bit  problema<c  

Nmembers

Nscaffold

!1.8

1e+02

1e+05

1e+08

All SeededMethod

log

Com

paris

ons

Page 20: Cloudy with a Touch of Cheminformatics

0

50

100

150

200

1 2 3 4 5Job Number

Tim

e (s

)

Timing  experiments  

•  Selected  50  scaffolds  with  10  or  fewer  members  •  Configured  so  as  to  have  ~  5  maps  •  Effec<ve  running  <me  for  the  en<re  job  is  3.8  min  on  Hadoop  – Only  needed  5  of  8  map  slots  on  our  “cluster”  

•  Takes  ~  6  min  without  Hadoop  

Page 21: Cloudy with a Touch of Cheminformatics

Timing  experiments  

•  Selected  1000  scaffolds  with  20  or  fewer  members  – Ran  with  10  scaffolds  /  map  

•  Hadoop  run  <me  was  ~  2  hr  – Most  maps  were  fast  (<  20  sec)  

•  Serial  evalua<on  would  be  >  7  hr  

0

5

10

15

1.0 1.5 2.0 2.5 3.0 3.5 4.0log Time (s)

Num

ber o

f Job

s

Page 22: Cloudy with a Touch of Cheminformatics

A  M-­‐R  workflow  

•  We’re  currently  focused  on  just  the  MMP  step  as  as  a  MR  example  

•  Could  also  include  fragmenta<on  step  as  part  of  the  workflow  – But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible  

•  Store  transforma<ons  and  members  in  HBase  •  Link  with  ac<vity  data  and  apply  structure  &  ac<vity  filters  on  candidate  pairs  

Page 23: Cloudy with a Touch of Cheminformatics

What  Hadoop  is  not  for  

•  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real-­‐<me  analysis  •  Generally  not  effec<ve  unless  dealing  with  massive  datasets  

•  All  algorithms  are  not  amenable  to  the  map-­‐reduce  method  

Page 24: Cloudy with a Touch of Cheminformatics

Conclusions  

•  Cheminforma<cs  applica<ons  can  be  rehosted  or  rewriOen  to  take  advantage  of  cloud  resources  – Remotely  hosted    – Embarrassingly  parallel  /  chunked  – Map/reduce    

•  Ability  to  process  larger  structure  collec<ons  lets  us  explore  more  chemical  space  

•  “Big  data”  isn’t  really  that  big  in  chemistry  

Page 25: Cloudy with a Touch of Cheminformatics

Conclusions  

•  Q:  But  are  cheminforma/cs  problems  really  big  enough  to  jus/fy  all  of  this?    

•  A:  Yes  –  virtual  libraries,  integra<ng  chemical  structure  with  other  types  and  scales  of  data  

•  Q:  Are  there  algorithms  in  cheminforma/cs  that    can  employ  map-­‐reduce  at  the  algorithmic  level?  

•  A:  Yes  –  especially  when  we  consider  problems  with  a  combinatorial  flavor  

Page 26: Cloudy with a Touch of Cheminformatics

hRps://github.com/rajarshi/chem.hadoop