cloudy with a touch of cheminformatics

Cloudy with a Touch of Cheminforma4cs

Rajarshi Guha, Tyler Peryea, Dac-‐Trung Nguyen NIH Center for Advancing Transla@onal Science

Chemaxon UGM

September 26th, 2012 Wellesley, MA

Parallel compu4ng in the cloud

•  Modern cloud vendors make provisioning compute resources easy – Allows one to handle unpredictable loads easily – Pay only for what you need

•  Chemistry applica<ons don’t usually have very dynamic loads

•  But large scale resources are an opportunity for large scale (parallel) computa<ons

• Use cloud resources in the same way as a local cluster

• MIT StarCluster makes this easy to do

Legacy HPC

• Make use of cloud capabili<es

• Old algorithms, new infrastructure

• Spot instances, SNS, SQS SimpleDB, S3, etc

Cloudy HPC

• Huge datasets • Candidates for map-‐reduce

•  Involves algorithm (re)design

Big Data HPC

All HPC is not equal

hOp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa<cs-‐to-‐the-‐cloud

Big data & cheminforma4cs

•  Computa<on over large chemical databases – Pubchem, ChEMBL, GDB-‐13, …

•  What types of computa<ons? – Searches (substructure, pharmacophore, ….) – QSAR models & predic<ons over large data

•  Fundamentally, “big chemical data” lets us explore larger chemical spaces

Map-‐Reduce

Tom White, Hadoop, The Defini/ve Guide. 3rd Ed. O’Reilly

Split 0 Map

Split 1 Map

Split 2 Map

Reduce Part 0

merge

copysort

Reduce Part 1

merge

K1,V1! list K2,V2( ) K2, list V2( )! list K3,V3( )

Coun4ng atoms

•  The chemical version of the word coun<ng task

1, Nc1ccc2ncccc2c1N2, Cl.CC1CCc2nc3ccccc3c(C)c2C1...152366, Nc1ccc2ncccc2c1N

Arbitrary linenumbers (K1) SMILES (V1)

N, list(1,1,1,1,...)C, list(1,1,1,1,...)

Atom Symbol (K2) list (V2)

N 1N 1N 1N 1

.

.

Atom Symbol (K2) Occurence (V2)

N,100C,5684...

Atom Symbol (K3) Count (V3)

MAP Reduce

The Hadoop ecosystem

Hadoop Common

Hadoop Distributed Filesystem

Map Reduce Engine

Hive

Hama

WhirrHBase

Pig

AvroMahout

FlumeZookeeperChukwa

Based on hOp://www.slideshare.net/informa<cacorp/101111-‐part-‐3-‐maO-‐asleO-‐the-‐hadoop-‐ecosystem

Cheminforma4cs on Hadoop

•  Hadoop and Atom Coun<ng •  Hadoop and SD Files •  Cheminforma<cs, Hadoop and EC2 •  Pig and Cheminforma<cs

But are cheminforma@cs problems really big enough to jus@fy all of this?

package gov.nih.ncgc.hadoop;

import chemaxon.formats.MolFormatException;import chemaxon.formats.MolImporter;import chemaxon.license.LicenseManager;import chemaxon.license.LicenseProcessingException;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.Iterator;

/** * SMARTS searching over a set of files using Hadoop. * * @author Rajarshi Guha */public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0);

public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private String pattern = null; private MolSearch search;

public void configure(JobConf job) {

try { Path[] licFiles = DistributedCache.getLocalCacheFiles(job); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); StringBuilder license = new StringBuilder(); String line; while ((line = reader.readLine()) != null) license.append(line); reader.close(); LicenseManager.setLicense(license.toString()); } catch (IOException e) { } catch (LicenseProcessingException e) { }

pattern = job.getStrings("pattern")[0]; search = new MolSearch(); try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); search.setQuery(queryMol); } catch (MolFormatException e) { }

}

final static IntWritable one = new IntWritable(1); Text matches = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Molecule mol = MolImporter.importMol(value.toString()); matches.set(mol.getName()); search.setTarget(mol); try { if (search.isMatching()) { output.collect(matches, one); } else { output.collect(matches, zero); } } catch (SearchException e) { } } }

public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { if (values.next().compareTo(one) == 0) { result.set(1); output.collect(key, result); } } } }

public int run(String[] args) throws Exception { JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); jobConf.setJobName("smartsSearch");

jobConf.setOutputKeyClass(Text.class); jobConf.setOutputValueClass(IntWritable.class);

jobConf.setMapperClass(MoleculeMapper.class); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class);

jobConf.setInputFormat(TextInputFormat.class); jobConf.setOutputFormat(TextOutputFormat.class);

jobConf.setNumMapTasks(5);

if (args.length != 4) { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); System.exit(2); }

FileInputFormat.setInputPaths(jobConf, new Path(args[0])); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); jobConf.setStrings("pattern", args[2]);

// make the license file available vis dist cache DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);

JobClient.runJob(jobConf); return 0; }

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);

}}

Simplifying Hadoop applica4ons

•  Raw Hadoop programs can be tedious to write

SMARTS based substructure search

Pig & Pig La4n

•  Pig La<n programs are much simpler to write and get translated to Hadoop code

•  SQL-‐like, requires UDF to be implemented to perform non-‐standard tasks

SMARTS search in Pig La<n

UDF for SMARTS search

A = load 'medium.smi' as (smiles:chararray);B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');store B into 'output.txt';

package gov.nih.ncgc.hadoop.pig;

import chemaxon.formats.MolImporter;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.pig.FilterFunc;import org.apache.pig.data.Tuple;

import java.io.IOException;

public class SMATCH extends FilterFunc { static MolSearch search = null;

public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; }}

Going beyond chunking?

•  All the preceding use cases are embarrassingly parallel – Chunking the input data and applying the same opera<on to each chunk

– Very nice when you have a big cluster

Are there algorithms in cheminforma@cs that can employ

map-‐reduce at the algorithmic level?

Going beyond chunking?

•  Applica<ons that make use of pairwise (or higher order) calcula<ons could benefit from a map-‐reduce incarna<on – Doesn’t necessarily avoid the O(N2) barrier – Bioisostere iden<fica<on is one case that could be rephrased as a map-‐reduce problem

•  Map-‐Reduce Design PaOerns

Iden4fying MMPs

•  First step in iden<fying bioisosteres is to iden<fy candidate matched molecular pairs – Naïve all pairs comparison – Predefined list of transforma<ons •  Birch et al, BMCL, 2009

– Fragment intersec<on •  Hussain et al, JCIM, 2010

– MCS based approaches (e.g., WizePairZ) • Warner et al, JCIM, 2010

Naïve Bioisostere evalua4on

...N molecules N(N-‐1)/2 comparisons

Scaffold seeding

Seed Fragment:

Members:

Scaffold seeded bioisosteres M(M-‐1)/2 comparisons

M(M-‐1)/2 comparisons

Seeded bioisosteres – MR style

• Do pairwise MCS analysis on scaffold series

• For each pair output SMIRKS transform and the pair of SMILES

MAP

• Collect pairs of SMILES for a given SMIRKS

• Store in DB, or • Filter by ac<vity, or • …

REDUCE

1e+05

1e+08

1e+11

1e+14

1e+03 1e+05 1e+07log Number of molecules

log

Num

ber o

f pai

rwis

e co

mpa

rison

s

Method

all

seeded.7

seeded.21

seeded.100

Does seeding help?

•  Doesn’t bypass the O(N2) barrier – does reduce the constant

•  Depends on how many scaffolds and the number of member for each scaffold

•  Certainly useful when there a few members per scaffold

•  Highly populated scaffolds can throw things off

Data

•  Exhaus<vely fragmented ChEMBL 13 •  Iden<fied scaffolds with

•  Ended up with 231,875 scaffolds –  Covers 235,693 unique molecules – Average of 7 members per scaffold –  95% of scaffolds had < 21 members –  99.5% had < 74 members

•  The 0.05% are a bit problema<c

Nmembers

Nscaffold

!1.8

1e+02

1e+05

1e+08

All SeededMethod

log

Com

paris

ons

0

50

100

150

200

1 2 3 4 5Job Number

Tim

e (s

)

Timing experiments

•  Selected 50 scaffolds with 10 or fewer members •  Configured so as to have ~ 5 maps •  Effec<ve running <me for the en<re job is 3.8 min on Hadoop – Only needed 5 of 8 map slots on our “cluster”

•  Takes ~ 6 min without Hadoop

Timing experiments

•  Selected 1000 scaffolds with 20 or fewer members – Ran with 10 scaffolds / map

•  Hadoop run <me was ~ 2 hr – Most maps were fast (< 20 sec)

•  Serial evalua<on would be > 7 hr

0

5

10

15

1.0 1.5 2.0 2.5 3.0 3.5 4.0log Time (s)

Num

ber o

f Job

s

A M-‐R workflow

•  We’re currently focused on just the MMP step as as a MR example

•  Could also include fragmenta<on step as part of the workflow – But a pre-‐calculated set of scaffolds is more sensible

•  Store transforma<ons and members in HBase •  Link with ac<vity data and apply structure & ac<vity filters on candidate pairs

What Hadoop is not for

•  Doesn’t replace an actual database •  It’s not uniformly fast or efficient •  Not good for ad hoc or real-‐<me analysis •  Generally not effec<ve unless dealing with massive datasets

•  All algorithms are not amenable to the map-‐reduce method

Conclusions

•  Cheminforma<cs applica<ons can be rehosted or rewriOen to take advantage of cloud resources – Remotely hosted – Embarrassingly parallel / chunked – Map/reduce

•  Ability to process larger structure collec<ons lets us explore more chemical space

•  “Big data” isn’t really that big in chemistry

Conclusions

•  Q: But are cheminforma/cs problems really big enough to jus/fy all of this?

•  A: Yes – virtual libraries, integra<ng chemical structure with other types and scales of data

•  Q: Are there algorithms in cheminforma/cs that can employ map-‐reduce at the algorithmic level?

•  A: Yes – especially when we consider problems with a combinatorial flavor

hRps://github.com/rajarshi/chem.hadoop

cloudy with a touch of cheminformatics

Technology