1 modelling in chemistry: high and low-throughput regimes dr john mitchell unilever centre for...

Modelling in Chemistry: High and Low-Throughput Regimes

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

New York Times,4th October 2005.

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

(GNP/$5000) -2

Outliers?

Happiness

Fitting with a curve: reduce RMSE

Outliers?

Different linear models for different regimes

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

... but what is the connection with chemistry?

Modelling in Chemistry

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

PHYSICS-BASED

EMPIRICALATOMISTIC

Car-Parrinello

NON-ATOMISTIC

2-D QSAR/QSPR

Machine Learning

AM1, PM3 etc.Fluid Dynamics

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

HIGH THROUGHPUT

LOW THROUGHPUT

Fluid Dynamics

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

AM1, PM3 etc.

INFORMATICS

THEORETICAL CHEMISTRY

NO FIRM BOUNDARIES!

Fluid Dynamics

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

AM1, PM3 etc.Fluid Dynamics

Theoretical Chemistry

• Calculations and simulations based on real physics.

• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.

• Attempt to model or simulate reality.

• Usually Low Throughput.

Informatics and Empirical Models• In general, Informatics methods

represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).

• Property-property relationships from 1860’s

• General form (for non-linear relationships):y = f (descriptors)

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.

• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

• Quality of the model is judged by three parameters:

obsi yy

2 )(/)(1 averagen

obsi yyyyr

• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

• However, this does not guarantee a good predictive model….

• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.

• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar

to training set.• Global or Local models?

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from 2D molecular

structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

Why Do We Need a Predictive Model?

Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies ...

... but only in very low throughput.

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

Density (g/cc)

Lattice Energy (kJ/mol)

1.40 1.601.50

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0 5 10 15 20 25

no. C, N, O

amides

diamides

diacid

aminoacids

alkanesvalineH O

O C H 3

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

r2= 0.977

s = 7.096 kJ/mol0

0 5 10 15 20

No. of carbon atoms

t / °

b / kJ mo

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0 5 10 15 20 25

No. carbon atoms

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0 50 100 150 200

Experimental value /kJ mol-1

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0 50 100 150 200 250

Experimental value / kJ mol-1

Parameters in model are counts of atom type occurrences.

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0 50 100 150 200 250

Experimental value /kJ mol-1

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

aliphatic

All these parameters are significantly larger than their standard errors

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

-30 -20 -10 0 10 20 30Residuals

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0 50 100 150 200H sub (experimental) / kJ mol-1

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

Major Conclusion

Lattice energy can be predicted from 2D

structure, without knowing the details of the

crystal packing!

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints

www-mitchell.ch.cam.ac.uk/jbom1@cam.ac.uk

Ed Cannon, Andreas Bender, David Palmer & John Mitchell,

J. Chem. Inf. and Model., 46, 2369-2380 (2006)

Classifying the WADA Prohibited List

• Aims & Background.• Methods.• Data.• Results.• Conclusions.

Aims & Background

• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.

tetrahydrogestrinone (THG)

Aims & Background

• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.

• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.

WADA Prohibited Classes

• Anabolic Agents (S1)

• Hormones and Related Substances (S2)

• Beta-2-agonists (S3)• Anti-estrogenic

Agents (S4)• Diuretics and

Masking Agents (S5)

• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids

(S9)• Alcohol (P1)• Beta Blockers (P2)

Predicting Bioactivities

• We seek to predict whether a molecule exhibits one of these bioactivities.

• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

Methods

Chemical Space

• Use descriptor-based fingerprints to locate molecules in chemical space.

• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

Machine Learning

• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.

• Random Forest.

• k-Nearest Neighbours.

Fingerprints

• CDK (Chemistry Development Kit) fingerprint.

• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.

CDK Fingerprint

• CDK fingerprint resembles Daylight.

• All bond paths up to a length of 6 are generated.

• A hashing function is used to map these paths onto a fingerprint of 1024 bits.

Unity 2D Fingerprint

• Unity is similar to CDK, but based on sub-structures rather than just paths.

• Substructures present in the molecule are enumerated.

• A hashing function is used to map these paths onto a fingerprint of 992 bits.

Classification Algorithms

• Random Forest (RF).

• k-Nearest Neighbours (k-NN).

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Random Forest

A > x1 A < x1

B > x2 B < x2 C > x3 C < x3

Decision: Yes No No Yes

A Random Forest contains many such trees.

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

ActiveInactive?

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

• Local method.

• Uses only a very small number of near neighbours to make its prediction.

• Suitable for predicting activity classes with multiple clusters in chemical space.

• Therefore good for WADA classes with multiple receptors.

Performance Measure

• Matthews Correlation Coefficient:

• Range: -1 < MCC < 1;• Balance between predicting

positives & negatives.

]))()()(( nnpnnppp

ftftftft

ffttMCC

The Dataset

• 5245 molecules (5235 for CDK).

• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

Data by Class

WADA Class Number of Molecules

S2 272

S3 367

S4 928

S5 1000

S6 804

S7 195

S8 1000

P2 239

Allowed 367

Fivefold Cross-validation

• We test for membership of each prohibited class separately.

• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

False Positives

• False Positives arise in two ways:

• (1) A molecule predicted positive on an incorrect activity class;

• (2) An explicitly allowed molecule predicted positive.

Results

Results: Random Forest

Aggregated over 10 classes

78Unity CDK > MACCS > others.

MCC for RF for Six Fingerprints

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

Unity 0.8214 CDK 0.8136

MACCS 0.7823

TGD 0.7283 MOE 0.7172

TAD 0.5902

100 trees sufficient; little improvement with more.

MCC as a Function of ntree in RF models for Unity

0.6000

0.7000

0.8000

0.9000

0 100 200 300 400 500 600 700 800 900 1000

Results: k-Nearest Neighbours

Aggregated over 10 classes

MCC as a Function of k in k -NN Models for Six Fingerprints

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

82Unity CDK > MACCS > others.

MCC for k = 1 for Six Fingerprints

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

Unity 0.8363CDK 0.8297

MACCS 0.8045

TGD 0.7404

MOE 0.6814

TAD 0.6152

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.

MCC as a Function of k in k -NN Models for Unity

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

C Unity

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.

MCC as a Function of k in k- NN Models for CDK

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Results: Comparison

Recall v PrecisionAggregated over 10 classes

Recall Precision

Recall v Precision for Positives

100.00

30.00 40.00 50.00 60.00 70.00 80.00 90.00

Recall

RF gives higher precision, k-NN higher recall.

Results: Comparison

Analysed by class

Classes vary in difficulty of prediction; independent of classification algorithm.

MCC by Class for Random Forest Default and k- NN (k = 1) Models

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10

S1 S2 S3 S4 S5 S6 S7 S8 S9 P2

Conclusions

Major Conclusion

• Can use Informatics to predict whether or not a molecule exhibits a prohibited bioactivity.

Conclusions

• Can successfully predict active molecules (MCC ≈ 0.83).

• Unity ≈ CDK > MACCS > others.

• RF & k-NN give similar MCC.

• k-NN higher recall.

• RF higher precision; RF less likely to find false positives.

Conclusions

• RF results vary little with ntree.

• k-NN results best for k = 1.

• Performance decreases at higher k.

• Odd k avoids problems with ties (k = 2 is worse than k = 3).

• Activity classes show consistent prediction difficulty pattern.

www-mitchell.ch.cam.ac.uk/

jbom1@cam.ac.uk

Acknowledgements: People

Carole Ouvrard, Ed Cannon, David Palmer,

Florian Nigsch, Chrysi Kirtay, Laura Hughes,

Jo Bailey, Noel O’Boyle, Daniel Almonacid,

Gemma Holliday, Jen Ryder,

Dushy Puvanendrampillai, Andreas Bender.

A¢know£€dg€m€nt$: Funding

Unilever

MCC as a Function of Class Size

0 200 400 600 800 1000 1200

Class Size

UnityMACCSMOE TADTGD

No significant correlation overall; though smallest class S9 is hardest to predict.

MCC as a Function of Intra Class Mean Tanimoto Score

0.00 0.10 0.20 0.30 0.40 0.50 0.60

Intra Class Mean Tanimoto Score

UnityMACCSMOE TADTGD

tetrahydrogestrinone (THG)

gestrinone

trenbolone

1 modelling in chemistry: high and low-throughput regimes dr john mitchell unilever centre for...

different linear models

linear modelgnp

output model

qspr different methods

elaborate mathematical

nonlinear random forest

s general form

different regimesonly

Documents

tabulated chemistry approach for diluted combustion regimes...

unilever caribbean limited - home | unilever … · our...

unilever ireland - home | unilever uk & ireland ·...

unilever everest

unilever€¦ · title: unilever author: unilever subject:...

unilever gender pay report 2019 - unilever uk & ireland ·...

unilever (hul)

how unilever drives growth by unlocking skills and ......

unilever · 2020-08-02 · unilever author: unilever...

unilever basis of preparation 2018 › images ›...

unilever global company website | unilever global

immigration regimes and schooling regimes final 11.30 ·...

pestel_analysis unilever

pt unilever

unilever final

the royal unilever society of chemistry 6. electron...

the royal unilever society of chemistry 5 ......the royal...

one day at unilever - unilever

unilever and tea sustainability [presenter’s name],...

unilever brands