cloudy with a touch of cheminformatics
TRANSCRIPT
Cloudy with a Touch of Cheminforma4cs
Rajarshi Guha, Tyler Peryea, Dac-‐Trung Nguyen NIH Center for Advancing Transla@onal Science
Chemaxon UGM
September 26th, 2012 Wellesley, MA
Parallel compu4ng in the cloud
• Modern cloud vendors make provisioning compute resources easy – Allows one to handle unpredictable loads easily – Pay only for what you need
• Chemistry applica<ons don’t usually have very dynamic loads
• But large scale resources are an opportunity for large scale (parallel) computa<ons
• Use cloud resources in the same way as a local cluster
• MIT StarCluster makes this easy to do
Legacy HPC
• Make use of cloud capabili<es
• Old algorithms, new infrastructure
• Spot instances, SNS, SQS SimpleDB, S3, etc
Cloudy HPC
• Huge datasets • Candidates for map-‐reduce
• Involves algorithm (re)design
Big Data HPC
All HPC is not equal
hOp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa<cs-‐to-‐the-‐cloud
Big data & cheminforma4cs
• Computa<on over large chemical databases – Pubchem, ChEMBL, GDB-‐13, …
• What types of computa<ons? – Searches (substructure, pharmacophore, ….) – QSAR models & predic<ons over large data
• Fundamentally, “big chemical data” lets us explore larger chemical spaces
Map-‐Reduce
Tom White, Hadoop, The Defini/ve Guide. 3rd Ed. O’Reilly
Split 0 Map
Split 1 Map
Split 2 Map
Reduce Part 0
merge
copysort
Reduce Part 1
merge
K1,V1! list K2,V2( ) K2, list V2( )! list K3,V3( )
Coun4ng atoms
• The chemical version of the word coun<ng task
1, Nc1ccc2ncccc2c1N2, Cl.CC1CCc2nc3ccccc3c(C)c2C1...152366, Nc1ccc2ncccc2c1N
Arbitrary linenumbers (K1) SMILES (V1)
N, list(1,1,1,1,...)C, list(1,1,1,1,...)
Atom Symbol (K2) list (V2)
N 1N 1N 1N 1
.
.
Atom Symbol (K2) Occurence (V2)
N,100C,5684...
Atom Symbol (K3) Count (V3)
MAP Reduce
The Hadoop ecosystem
Hadoop Common
Hadoop Distributed Filesystem
Map Reduce Engine
Hive
Hama
WhirrHBase
Pig
AvroMahout
FlumeZookeeperChukwa
Based on hOp://www.slideshare.net/informa<cacorp/101111-‐part-‐3-‐maO-‐asleO-‐the-‐hadoop-‐ecosystem
Cheminforma4cs on Hadoop
• Hadoop and Atom Coun<ng • Hadoop and SD Files • Cheminforma<cs, Hadoop and EC2 • Pig and Cheminforma<cs
But are cheminforma@cs problems really big enough to jus@fy all of this?
package gov.nih.ncgc.hadoop;
import chemaxon.formats.MolFormatException;import chemaxon.formats.MolImporter;import chemaxon.license.LicenseManager;import chemaxon.license.LicenseProcessingException;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;
import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.Iterator;
/** * SMARTS searching over a set of files using Hadoop. * * @author Rajarshi Guha */public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0);
public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private String pattern = null; private MolSearch search;
public void configure(JobConf job) {
try { Path[] licFiles = DistributedCache.getLocalCacheFiles(job); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); StringBuilder license = new StringBuilder(); String line; while ((line = reader.readLine()) != null) license.append(line); reader.close(); LicenseManager.setLicense(license.toString()); } catch (IOException e) { } catch (LicenseProcessingException e) { }
pattern = job.getStrings("pattern")[0]; search = new MolSearch(); try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); search.setQuery(queryMol); } catch (MolFormatException e) { }
}
final static IntWritable one = new IntWritable(1); Text matches = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Molecule mol = MolImporter.importMol(value.toString()); matches.set(mol.getName()); search.setTarget(mol); try { if (search.isMatching()) { output.collect(matches, one); } else { output.collect(matches, zero); } } catch (SearchException e) { } } }
public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { if (values.next().compareTo(one) == 0) { result.set(1); output.collect(key, result); } } } }
public int run(String[] args) throws Exception { JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); jobConf.setJobName("smartsSearch");
jobConf.setOutputKeyClass(Text.class); jobConf.setOutputValueClass(IntWritable.class);
jobConf.setMapperClass(MoleculeMapper.class); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class);
jobConf.setInputFormat(TextInputFormat.class); jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumMapTasks(5);
if (args.length != 4) { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); System.exit(2); }
FileInputFormat.setInputPaths(jobConf, new Path(args[0])); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); jobConf.setStrings("pattern", args[2]);
// make the license file available vis dist cache DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);
JobClient.runJob(jobConf); return 0; }
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);
}}
Simplifying Hadoop applica4ons
• Raw Hadoop programs can be tedious to write
SMARTS based substructure search
Pig & Pig La4n
• Pig La<n programs are much simpler to write and get translated to Hadoop code
• SQL-‐like, requires UDF to be implemented to perform non-‐standard tasks
SMARTS search in Pig La<n
UDF for SMARTS search
A = load 'medium.smi' as (smiles:chararray);B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');store B into 'output.txt';
package gov.nih.ncgc.hadoop.pig;
import chemaxon.formats.MolImporter;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.pig.FilterFunc;import org.apache.pig.data.Tuple;
import java.io.IOException;
public class SMATCH extends FilterFunc { static MolSearch search = null;
public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; }}
Going beyond chunking?
• All the preceding use cases are embarrassingly parallel – Chunking the input data and applying the same opera<on to each chunk
– Very nice when you have a big cluster
Are there algorithms in cheminforma@cs that can employ
map-‐reduce at the algorithmic level?
Going beyond chunking?
• Applica<ons that make use of pairwise (or higher order) calcula<ons could benefit from a map-‐reduce incarna<on – Doesn’t necessarily avoid the O(N2) barrier – Bioisostere iden<fica<on is one case that could be rephrased as a map-‐reduce problem
• Map-‐Reduce Design PaOerns
Iden4fying MMPs
• First step in iden<fying bioisosteres is to iden<fy candidate matched molecular pairs – Naïve all pairs comparison – Predefined list of transforma<ons • Birch et al, BMCL, 2009
– Fragment intersec<on • Hussain et al, JCIM, 2010
– MCS based approaches (e.g., WizePairZ) • Warner et al, JCIM, 2010
Naïve Bioisostere evalua4on
...N molecules N(N-‐1)/2 comparisons
Scaffold seeding
Seed Fragment:
Members:
Scaffold seeded bioisosteres M(M-‐1)/2 comparisons
M(M-‐1)/2 comparisons
Seeded bioisosteres – MR style
• Do pairwise MCS analysis on scaffold series
• For each pair output SMIRKS transform and the pair of SMILES
MAP
• Collect pairs of SMILES for a given SMIRKS
• Store in DB, or • Filter by ac<vity, or • …
REDUCE
1e+05
1e+08
1e+11
1e+14
1e+03 1e+05 1e+07log Number of molecules
log
Num
ber o
f pai
rwis
e co
mpa
rison
s
Method
all
seeded.7
seeded.21
seeded.100
Does seeding help?
• Doesn’t bypass the O(N2) barrier – does reduce the constant
• Depends on how many scaffolds and the number of member for each scaffold
• Certainly useful when there a few members per scaffold
• Highly populated scaffolds can throw things off
Data
• Exhaus<vely fragmented ChEMBL 13 • Iden<fied scaffolds with
• Ended up with 231,875 scaffolds – Covers 235,693 unique molecules – Average of 7 members per scaffold – 95% of scaffolds had < 21 members – 99.5% had < 74 members
• The 0.05% are a bit problema<c
Nmembers
Nscaffold
!1.8
1e+02
1e+05
1e+08
All SeededMethod
log
Com
paris
ons
0
50
100
150
200
1 2 3 4 5Job Number
Tim
e (s
)
Timing experiments
• Selected 50 scaffolds with 10 or fewer members • Configured so as to have ~ 5 maps • Effec<ve running <me for the en<re job is 3.8 min on Hadoop – Only needed 5 of 8 map slots on our “cluster”
• Takes ~ 6 min without Hadoop
Timing experiments
• Selected 1000 scaffolds with 20 or fewer members – Ran with 10 scaffolds / map
• Hadoop run <me was ~ 2 hr – Most maps were fast (< 20 sec)
• Serial evalua<on would be > 7 hr
0
5
10
15
1.0 1.5 2.0 2.5 3.0 3.5 4.0log Time (s)
Num
ber o
f Job
s
A M-‐R workflow
• We’re currently focused on just the MMP step as as a MR example
• Could also include fragmenta<on step as part of the workflow – But a pre-‐calculated set of scaffolds is more sensible
• Store transforma<ons and members in HBase • Link with ac<vity data and apply structure & ac<vity filters on candidate pairs
What Hadoop is not for
• Doesn’t replace an actual database • It’s not uniformly fast or efficient • Not good for ad hoc or real-‐<me analysis • Generally not effec<ve unless dealing with massive datasets
• All algorithms are not amenable to the map-‐reduce method
Conclusions
• Cheminforma<cs applica<ons can be rehosted or rewriOen to take advantage of cloud resources – Remotely hosted – Embarrassingly parallel / chunked – Map/reduce
• Ability to process larger structure collec<ons lets us explore more chemical space
• “Big data” isn’t really that big in chemistry
Conclusions
• Q: But are cheminforma/cs problems really big enough to jus/fy all of this?
• A: Yes – virtual libraries, integra<ng chemical structure with other types and scales of data
• Q: Are there algorithms in cheminforma/cs that can employ map-‐reduce at the algorithmic level?
• A: Yes – especially when we consider problems with a combinatorial flavor
hRps://github.com/rajarshi/chem.hadoop