cloud computing for chemical property prediction
DESCRIPTION
Microsoft Cloud Futures Conference, Redmond, 8 th April 2010. Cloud Computing for Chemical Property Prediction. Paul Watson School of Computing Science Newcastle University, UK [email protected]. The team: David Leahy, Jacek Cala, Hugo Hiden, - PowerPoint PPT PresentationTRANSCRIPT
Cloud Computing for Chemical Property Prediction
Paul Watson
School of Computing Science
Newcastle University, UK
Microsoft Cloud Futures Conference, Redmond, 8th April 2010
The team: David Leahy, Jacek Cala, Hugo Hiden,Dominic Searson, Vladimir Sykora, Martyn Taylor, Simon Woodman
With thanks to:- Microsoft External Research for their financial support for Project Junior- Christophe Poulain, Savas Parastatidis
Chemists want to know:
Q1. What are the properties of this molecule?
Toxicity
Solubility
Biological Activity
Q2. What molecule would have aqueous solubility of 0.1 μg/mL?
Answering the Question by performing experiments
..... time consuming, expensive, ethical Issues
Quantitative Structure Activity Relationship - predict properties based on similar molecules
An alternative to experimentation: QSAR
Activity ≈ f( )
quantifiable structural attributes, e.g. #atoms
logp shape
.....
New/ImprovedModels
New DataorModel-Builders
DataModel-Builders
Model Generation
Models
Generating the models -Discovery Bus (Leahy et al)
www.openqsar.com
Selected Descriptors + Responses
Separate Training & Test Data
Chemical Structures & their Activities
Calculate Descriptors from Structures
Training Data
Combine Descriptors
Descriptors + Responses
Filter Descriptors
Combined Descriptors + Responses
MultipleLinear
Regression
Neural Network
Partial Least Squares
Classification Trees .....Build &
Test ModelsIndependently
Select Best Models
Add to Model Database
Test Data
Increasing amounts of data for model building...
CHEMBL : data on 622,824 compounds,collected from 33,956 publications
WOMBAT-PK: data on 1230 compounds,for over 13,000 clinical measurements
WOMBAT : data on 251,560 structures,for over 1,966 targets
All contain structure information & numerical activity data
More models Better models
Computationally expensive:5 years for new datasets on existing server
JUNIOR Project Aim
Use Azure to generate models in weeks not years.... using as much of the available data as possible.... make models available on www.openqsar.com
... so that researchers can generate predictions for their own molecules
Selected Descriptors + Responses
Separate Training & Test Data
Chemical Structures & their Activities
Calculate Descriptors from Structures
Training Data
Combine Descriptors
Descriptors + Responses
Filter Descriptors
Combined Descriptors + Responses
MultipleLinear
Regression
Neural Network
Partial Least Squares
Classification Trees .....Build &
Test ModelsIndependently
Select Best Models
Add to Model Database
Test Data
Potential for concurrency...
Approach
• avoid rewriting all existing Discovery Bus software
• move existing Discovery Bus to Amazon Cloud– without parallelisation
• move critical tasks to run concurrently on Azure• base solution around e-Science Central ....
Clouds to the rescue?
• Building scalable, dependable, science applications is still hard .....
• e-Science Central– Science Cloud Platform
Science Cloud Options
Cloud Infrastructure: Storage & Compute
Science Cloud Platform
ScienceApp 1 .... Science
App n
Users
Cloud Infrastructure:Storage & Compute
Scie
nce
Ap
p 1
....
Scie
nce
Ap
p n
Users
Cloud Infrastructure: Storage & Compute
Science Cloud Platform
ScienceApp 1 .... Science
App n
Users
e-Science Central
Science Cloud Platformfor developers
Science as a Servicefor users
What should the Science Cloud Platform Include?
Store
Analyse
Automate
Share
Data (instruments, experimental data, sensors...)
Identify the common needs of our e-Science Users
Science CloudPlatform
App ....
Workflow Enactment
App API
Social Networking
Security
Processing
e-Science Central
Storage
App
Analysis Services
CloudInfrastructure
Provenance
Metadata
Workflow Enactment
App API
Social Networking
Security
Processing
e-Science Central
Storage
Analysis Services
Azure
Provenance
Metadata
Discovery BusPlanner
Amazon
Discovery Bus invokes e-Science Central Workflow via API
Workflow decomposed to Message Plan1
2
3
Temporary workflow storage assigned, Message Plan queuedfor execution.
Workflow temporary storage
4
Mes
sage
s se
nt in
seq
uenc
e
Call Message
Response Message
Internal Service
Azure Service
NFS
HTTPCall Message
Response Message
RMI / JMS
HTTP Post
5 Results data stored in e-Science Central folder
Workflow ExecutionCompletes
5 Discovery Bus notified with results
Message Plan
Task Task TaskWeb Node
Worker Node
Worker Node
Worker Node
Queue
e-Science Central
Azure
Blob Storage
Results
• Running across up to 100 Azure nodes• Azure utilisation increasing (average ~60% over runs)• Moving more admin tasks to Azure / e-Science Central
– model validation– co-ordination
Current Status – Work in Progress
CPU Utilization
Summary
• Discovery Bus exemplifies a good Cloud pattern– large, variable, bursty requirements– proposal to apply to software verification
• clouds do NOT make it easier to build complex, scalable, dependable distributed systems– we need higher-level “Science Cloud Platforms”– e-Science Central is our attempt at this
• using the Azure Cloud we have a scalable system that can handle large new datasets
• the models are being made freely available– www.openqsar.com