1 modelling in chemistry: high and low-throughput regimes dr john mitchell unilever centre for...
Post on 03-Jan-2016
220 Views
Preview:
TRANSCRIPT
1
Modelling in Chemistry: High and Low-Throughput Regimes
Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.
2
3
4
5
6
7
We look at data, analyse data, use data to find correlations ...
... to develop models ...
... and to make (hopefully) useful predictions.
Let’s look at some data ...
8
New York Times,4th October 2005.
9
Happiness ≈ (GNP/$5000) -1 Poor fit to linear model
10
(GNP/$5000) -2
Outliers?
Happiness
11
Fitting with a curve: reduce RMSE
12
Outliers?
Different linear models for different regimes
13
Only one obvious (to me) conclusion
This area is empty: no country isboth rich and unhappy. All other
combinations are observed.
Happiness (GNP/$5000) -2
14
... but what is the connection with chemistry?
15
Modelling in Chemistry
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
PHYSICS-BASED
EMPIRICALATOMISTIC
Car-Parrinello
NON-ATOMISTIC
DPD
CoMFA
2-D QSAR/QSPR
Machine Learning
AM1, PM3 etc.Fluid Dynamics
16
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.
HIGH THROUGHPUT
LOW THROUGHPUT
Fluid Dynamics
17
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.
INFORMATICS
THEORETICAL CHEMISTRY
NO FIRM BOUNDARIES!
Fluid Dynamics
18
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.Fluid Dynamics
19
Theoretical Chemistry
• Calculations and simulations based on real physics.
• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.
• Attempt to model or simulate reality.
• Usually Low Throughput.
20
Informatics and Empirical Models• In general, Informatics methods
represent phenomena mathematically, but not in a physics-based way.
• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.
• Do not attempt to simulate reality. • Usually High Throughput.
21
QSPR
• Quantitative Structure Property Relationship
• Physical property related to more than one other variable
• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).
• Property-property relationships from 1860’s
• General form (for non-linear relationships):y = f (descriptors)
22
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
Y = f (X1, X2, ... , XN )
• Optimisation of Y = f(X1, X2, ... , XN) is called regression.
• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.
23
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Quality of the model is judged by three parameters:
n
i
predi
obsi yy
nBias
1
)(1
n
i
predi
obsi yy
nRMSE
1
2)(1
2
1
2
1
2 )(/)(1 averagen
i
obsi
predi
n
i
obsi yyyyr
24
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Different methods for carrying out regression:
• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.
• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
25
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• However, this does not guarantee a good predictive model….
26
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.
27
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar
to training set.• Global or Local models?
28
Relationship of Chemical Structure
With Lattice Energy
Can we predict lattice energy from 2D molecular
structure?
Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge
C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)
29
Why Do We Need a Predictive Model?
Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies ...
... but only in very low throughput.
30
Why Do We Need a Predictive Model?
A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials
From 2-D molecular structure only
Without knowing the crystal packing
Without expensive theoretical calculations
Should help predict solubility.
31
Why Do We Think it Will Work?
Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.
Many molecules have a plurality of different experimentally observable polymorphs.
We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.
32
x x
x
x
O
�
x
x
x
x
Density (g/cc)
Lattice Energy (kJ/mol)
xx
1.40 1.601.50
-92.0
-94.0
-96.0
-98.0
OOO
O�
�
�O
+
+
+
+ x
x P1-+ P21/c
O P212121 � P21
Calculated Lowest Energy Structure
Experimental Crystal Structure
33
Expression for the Lattice Energy
U crystal = U molecule + U lattice
Theoretical lattice energy
– Crystal binding = Cohesive energy
Experimental lattice energy is related to -H sublimation
H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)
34
Partitioning of the Lattice Energy
U crystal = U molecule + U lattice
H sublimation = -U lattice – 2RT
Partitioning the lattice energy in terms of structural contributions
Choice of the significant parameters
– number of atoms of each type?
– Number of rings, aromatics?
– Number of bonds of each type?
– Symmetry?
– Hydrogen bond donors and acceptors? Intramolecular?
We choose counts of atom type occurrences.
35
Analysis of the Sublimation Energy Data
Experimental data: Hsublimation Atom Types
– SATIS codes : 10-digit
connectivity code + bond types
– Each 2 digit code = atomic
number
HN 01 07 99 99 99
HO 01 08 99 99 99
O=C 08 06 99 99 99
-O- 08 06 06 99 99
Statistical analysis
Multi-Linear Regression Analysis
Hsub # atoms of each type
Typically, several similar SATIS codes are grouped to define an atom type.
NIST (National Institute of Standards and Technology, USA) Scientific literature
36
Training Dataset of Model Molecules 226 organic compounds
19 linear alkanes (19)
14 branched alkanes (33)
17 aromatics (50)
106 other non-H-bonders (156)
70 H-bond formers (226)
Non-specific interacting
– Hydrocarbons
– Nitrogen compounds
– Nitro-, CN, halogens,
– S, Se substituents
– Pyridine
Potential hydrogen
bonding interactions
– Amides
– Carboxylic acids
– Amino acids…
0
50
100
150
200
0 5 10 15 20 25
no. C, N, O
Hsu
blim
atio
n(e
xper
imen
tal)
/ kJ
mol
-1
amides
diamides
acids
diacid
aminoacids
alkanesvalineH O
O C H 3
C H 3
N H 2
37
Study of Non-specific Interactions: Linear
Alkanes
19 compounds : CH4 C20H24 Limit for van der
Waals interactions
Hsub 7.955C-
2.714
r2= 0.977
s = 7.096 kJ/mol0
150
300
450
600
750
0 5 10 15 20
No. of carbon atoms
Bo
ilin
g p
oin
t / °
C
0
30
60
90
120
150
180
Hsu
b / kJ mo
l -1
BPt
Hsub
Note odd-even variation in Hsub for this series.
Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.
38
Include Branched Alkanes
Add 14 branched alkanes to dataset. The graph below highlights the
reduction of sublimation enthalpy due to bulky substituents.
0
50
100
150
200
0 5 10 15 20 25
No. carbon atoms
H
sub
/ kJ
mo
l-1
C(CH3)4
(C(CH)3)3CH
33 compounds : CH4 C20H24
Hsub = 7.724Cnonbranched + 3.703
r2= 0.959 s = 8.117 kJ/mol
If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.
39
All Hydrocarbons: Include Aromatics
Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).
50 compounds
Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162
r2= 0.958 s = 7.478 kJ/mol
As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.
aliphatic
0
50
100
150
200
0 50 100 150 200
Experimental value /kJ mol-1
Pre
dic
ted
val
ue
/kJ
mo
l-1
40
All Non-Hydrogen-Bonded Molecules:
Add 106 non-hydrocarbons to the dataset.
Include elements H, C, N, O, F, S, Cl, Br & I.
156 compounds
Hsub predicted by 16 parameter model
r2= 0.896 s = 9.976 kJ/mol
0
50
100
150
200
250
0 50 100 150 200 250
Experimental value / kJ mol-1
Pre
dic
ted
val
ue
/ kJ
mo
l-1
Parameters in model are counts of atom type occurrences.
41
General Predictive Model
Add 70 hydrogen bond forming molecules to the dataset.
226 compounds
Hsub predicted by 19 parameter model
r2= 0.925 s = 9.579 kJ/mol
Parameters in model are counts of atom type occurrences.
0
50
100
150
200
250
0 50 100 150 200 250
Experimental value /kJ mol-1
Pre
dic
ted
val
ue
/ kJ
mo
l-1
42
Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172
HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +
3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631
Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +
8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585
SO + 12.840 Sthioether
Predictive Model Determined by
MLRA
aliphatic
All these parameters are significantly larger than their standard errors
43
Distribution of Residuals
The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.
0
20
40
60
-30 -20 -10 0 10 20 30Residuals
No
. of
ob
se
rva
tio
ns
44
35 diverse compounds
r2 = 0.928
s = 7.420 kJ/mol
Validation on an Independent Test Set
0
50
100
150
200
0 50 100 150 200H sub (experimental) / kJ mol-1
Hsu
b (
pre
dic
ted
) / k
J m
ol-1
NO2
CH3
NO2O2NNitro-compoundsare often outliers
Very encouraging result: accurate prediction possible.
45
Major Conclusion
Lattice energy can be predicted from 2D
structure, without knowing the details of the
crystal packing!
46
Conclusions
We have determined a general equation allowing us to estimate
the sublimation enthalpy for a large range of organic compounds
with an estimated error of 9 kJ/mol.
A very simple model (counts of atom types) gives a good
prediction of lattice & sublimation energies.
Lattice energy can be predicted from 2D structure, without
knowing the details of the crystal packing.
Avoids need for expensive calculations.
May help predict solubility.
Model gives good chemical insight.
47
Solubility is an important issue in drug discovery and a major source of attrition
This is expensive for the industry
A good model for predicting the solubility of druglike molecules would be very valuable.
48
Drug Disc.Today, 10 (4), 289 (2005)
Cohesive interactions in the lattice reduce solubility
Predicting lattice (or almost equivalently sublimation) energy should help predict solubility
49
Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints
www-mitchell.ch.cam.ac.uk/jbom1@cam.ac.uk
Ed Cannon, Andreas Bender, David Palmer & John Mitchell,
J. Chem. Inf. and Model., 46, 2369-2380 (2006)
50
Classifying the WADA Prohibited List
• Aims & Background.• Methods.• Data.• Results.• Conclusions.
51
Aims & Background
52
Aims & Background
• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.
tetrahydrogestrinone (THG)
53
Aims & Background
• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.
• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.
54
WADA Prohibited Classes
• Anabolic Agents (S1)
• Hormones and Related Substances (S2)
• Beta-2-agonists (S3)• Anti-estrogenic
Agents (S4)• Diuretics and
Masking Agents (S5)
• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids
(S9)• Alcohol (P1)• Beta Blockers (P2)
55
Predicting Bioactivities
• We seek to predict whether a molecule exhibits one of these bioactivities.
• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.
56
Methods
57
Chemical Space
• Use descriptor-based fingerprints to locate molecules in chemical space.
• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.
58
Machine Learning
• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.
• Random Forest.
• k-Nearest Neighbours.
59
Fingerprints
• CDK (Chemistry Development Kit) fingerprint.
• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.
60
CDK Fingerprint
• CDK fingerprint resembles Daylight.
• All bond paths up to a length of 6 are generated.
• A hashing function is used to map these paths onto a fingerprint of 1024 bits.
61
Unity 2D Fingerprint
• Unity is similar to CDK, but based on sub-structures rather than just paths.
• Substructures present in the molecule are enumerated.
• A hashing function is used to map these paths onto a fingerprint of 992 bits.
62
Classification Algorithms
• Random Forest (RF).
• k-Nearest Neighbours (k-NN).
63
Random Forest
• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each
node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of
molecule.
64
Random Forest
Node
A > x1 A < x1
B > x2 B < x2 C > x3 C < x3
Decision: Yes No No Yes
A Random Forest contains many such trees.
65
Random Forest
• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each
node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of
molecule.
66
k-Nearest Neighbours
• Instance based learner.
• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.
• k is a variable describing the number of neighbours to be considered.
• Class of x determined by majority vote of class labels of k neighbours.
• Ties broken randomly (only occurs for even k).
67
k-Nearest Neighbours
ActiveInactive?
68
k-Nearest Neighbours
• Instance based learner.
• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.
• k is a variable describing the number of neighbours to be considered.
• Class of x determined by majority vote of class labels of k neighbours.
• Ties broken randomly (only occurs for even k).
69
k-Nearest Neighbours
• Local method.
• Uses only a very small number of near neighbours to make its prediction.
• Suitable for predicting activity classes with multiple clusters in chemical space.
• Therefore good for WADA classes with multiple receptors.
70
Performance Measure
• Matthews Correlation Coefficient:
• Range: -1 < MCC < 1;• Balance between predicting
positives & negatives.
]))()()(( nnpnnppp
npnp
ftftftft
ffttMCC
71
Data
72
The Dataset
• 5245 molecules (5235 for CDK).
• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.
73
Data by Class
WADA Class Number of Molecules
S1 47
S2 272
S3 367
S4 928
S5 1000
S6 804
S7 195
S8 1000
S9 26
P2 239
Allowed 367
74
Fivefold Cross-validation
• We test for membership of each prohibited class separately.
• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.
75
False Positives
• False Positives arise in two ways:
• (1) A molecule predicted positive on an incorrect activity class;
• (2) An explicitly allowed molecule predicted positive.
76
Results
77
Results: Random Forest
Aggregated over 10 classes
78Unity CDK > MACCS > others.
MCC for RF for Six Fingerprints
0.5000
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6
Rank out of 6 Fingerprints
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
Unity 0.8214 CDK 0.8136
MACCS 0.7823
TGD 0.7283 MOE 0.7172
TAD 0.5902
79
100 trees sufficient; little improvement with more.
MCC as a Function of ntree in RF models for Unity
0.6000
0.7000
0.8000
0.9000
0 100 200 300 400 500 600 700 800 900 1000
ntree
MC
C
Unity
80
Results: k-Nearest Neighbours
Aggregated over 10 classes
81
MCC as a Function of k in k -NN Models for Six Fingerprints
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
82Unity CDK > MACCS > others.
MCC for k = 1 for Six Fingerprints
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6
Rank out of 6 Fingerprints
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
Unity 0.8363CDK 0.8297
MACCS 0.8045
TGD 0.7404
MOE 0.6814
TAD 0.6152
83
k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.
MCC as a Function of k in k -NN Models for Unity
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C Unity
84
k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.
MCC as a Function of k in k- NN Models for CDK
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C CDK
85
Results: Comparison
Recall v PrecisionAggregated over 10 classes
Recall Precision
86
Recall v Precision for Positives
60.00
70.00
80.00
90.00
100.00
30.00 40.00 50.00 60.00 70.00 80.00 90.00
Recall
Pre
cisi
on
Unity
MACCS
CDK
RF
k -NN
RF gives higher precision, k-NN higher recall.
87
Results: Comparison
Analysed by class
88
Classes vary in difficulty of prediction; independent of classification algorithm.
MCC by Class for Random Forest Default and k- NN (k = 1) Models
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
1 2 3 4 5 6 7 8 9 10
Class
MC
C RF
k-NN
S1 S2 S3 S4 S5 S6 S7 S8 S9 P2
89
Conclusions
90
Major Conclusion
• Can use Informatics to predict whether or not a molecule exhibits a prohibited bioactivity.
91
Conclusions
• Can successfully predict active molecules (MCC ≈ 0.83).
• Unity ≈ CDK > MACCS > others.
• RF & k-NN give similar MCC.
• k-NN higher recall.
• RF higher precision; RF less likely to find false positives.
92
Conclusions
• RF results vary little with ntree.
• k-NN results best for k = 1.
• Performance decreases at higher k.
• Odd k avoids problems with ties (k = 2 is worse than k = 3).
• Activity classes show consistent prediction difficulty pattern.
93
www-mitchell.ch.cam.ac.uk/
jbom1@cam.ac.uk
94
Acknowledgements: People
Carole Ouvrard, Ed Cannon, David Palmer,
Florian Nigsch, Chrysi Kirtay, Laura Hughes,
Jo Bailey, Noel O’Boyle, Daniel Almonacid,
Gemma Holliday, Jen Ryder,
Dushy Puvanendrampillai, Andreas Bender.
95
A¢know£€dg€m€nt$: Funding
Unilever
96
MCC as a Function of Class Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000 1200
Class Size
MC
C
UnityMACCSMOE TADTGD
No significant correlation overall; though smallest class S9 is hardest to predict.
97
MCC as a Function of Intra Class Mean Tanimoto Score
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60
Intra Class Mean Tanimoto Score
MC
C
UnityMACCSMOE TADTGD
98
tetrahydrogestrinone (THG)
gestrinone
trenbolone
top related