online chemical modeling environment: models
DESCRIPTION
AACIMP 2009 Summer School lecture by Yuriy Sushko and Sergii Novotarskyi. "Environmental Chemoinfornatics" course.TRANSCRIPT
Online chemical modelingenvironment: models
Iurii Sushko, Sergey NovotarskiyThursday, August 13, 2009
Existent alternatives
Classical approach: Weka, R, Mathematica
Advantages:
1. Most flexible2. Suitable for research and deep analysis
Disadvantages:
1. It’s complex: suitable for mathematician,informatician, statistician but notchemist and biologist
2. Very tedious data preparation
Community driven source Authority driven source
Collaboration in QSAR
Possibilities for collaboration in QSAR:
1.Use others' dataa.build models, based on others' datab.validate your models against others' data
2. Use others' modelsa.validate your data against published modelsb.use output of published models
as an input for new onesc.compare performance of published models
with own ones
All existent modeling tools lack means of collaboration
OCHEM advantages
Collaboration-targeted features:
1. Tight connection between database andmodeling tools
2. Wiki, discussion, comments, tags
Simplified modeling workflow:
1. Sensible defaults for most parameters2. Only necessary parameters requested3. Data representation is targeted for chemist4. Possibility of fine tune for experts
Modeling workflow
1. Data preparation
2. Building a model
3. Analysing the model
4. Application of themodel
AD
Stage 1 – Data preparation
IntroducerBill G., Sergey B.
Date of modificationInformationsystem
TagsToxicology, Biology,Partition coefficient.
logP = 0.5Melting Point = 100
C
PropertyTemperature,pH, species,
tissue, method
Condition
Garberg, P“In vitro models for …”
ArticleBenzene, Urea, ...
Structure
FilteringToxicology, Biology,Partition coefficient.
Data Point
ManipulationEditing
OrganizationWorking sets<
Stage 1 – Data preparation TagsToxicology, Biology,Partition coefficient.
ManipulationEditing
OrganizationWorking sets<
FilteringToxicology, Biology,Partition coefficient.
Stage 1: Data preparation
Stage 1: Data preparation
Stage 1: Data preparation
Stage 1: Data preparation
Stage 2: Model building - input data
Stage 2: Model building - descriptors (I)
Stage 2: Model building - descriptors (II)
Stage 2: Model building – descriptors (manual)
Stage 3: Analysing the model (I)Basic model statistics
Stage 3: Analysing the model (II)Applicability domain assessment
Stage 4: Application of the modelSelection of the model of interest
Model, published by another user
Newly created model
Stage 4: Application of the modelProvide target compounds
Stage 4: Application of the modelPrediction results
Target compound Prediction Accuracy assessment
Stage 4: Application of the modelAssessment of accuracy of predictions
Target compound
Need for distribution of calculations
Fact: QSAR modeling is calculation-intensive
Examples of calculations:• Training of neural network ensembles• Computing 3D conformations• Computing complex molecular descriptors
Solution:• Distributed calculation network• User can postpone, cancel or fetch task results later
Automatic updates and testing
Calculation servers are automatically updated uponavailability of new releaseAutomatic testing of servers upon updatesTasks that did not pass tests are disabled, keepingthe server functional
Backend - distributed calculationCentral metaserver, distributed calculation serversAutomatic server updates, on-the-fly server testing
Basic facts
About 50000 experimental measurements on285 physicochemical properties published inabout 2000 articlesImplemented modeling methods:ANN, KNN, MLR, Kernel ridge regressionIntegrated descriptors: Dragon, E-State,Fragments
Backend - basic facts
Platform: Java EEDatabase: MySQLServer: TomcatORM: HibernateMVC: Spring frameworkClient side: AJAX, HTML+Javascript