1 modelling in chemistry: high and low-throughput regimes dr john mitchell unilever centre for...

Post on 03-Jan-2016

220 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Modelling in Chemistry: High and Low-Throughput Regimes

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

2

3

4

5

6

7

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

8

New York Times,4th October 2005.

9

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

10

(GNP/$5000) -2

Outliers?

Happiness

11

Fitting with a curve: reduce RMSE

12

Outliers?

Different linear models for different regimes

13

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

14

... but what is the connection with chemistry?

15

Modelling in Chemistry

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

PHYSICS-BASED

EMPIRICALATOMISTIC

Car-Parrinello

NON-ATOMISTIC

DPD

CoMFA

2-D QSAR/QSPR

Machine Learning

AM1, PM3 etc.Fluid Dynamics

16

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

HIGH THROUGHPUT

LOW THROUGHPUT

Fluid Dynamics

17

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

INFORMATICS

THEORETICAL CHEMISTRY

NO FIRM BOUNDARIES!

Fluid Dynamics

18

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.Fluid Dynamics

19

Theoretical Chemistry

• Calculations and simulations based on real physics.

• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.

• Attempt to model or simulate reality.

• Usually Low Throughput.

20

Informatics and Empirical Models• In general, Informatics methods

represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

21

QSPR

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).

• Property-property relationships from 1860’s

• General form (for non-linear relationships):y = f (descriptors)

22

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.

• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

23

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Quality of the model is judged by three parameters:

n

i

predi

obsi yy

nBias

1

)(1

n

i

predi

obsi yy

nRMSE

1

2)(1

2

1

2

1

2 )(/)(1 averagen

i

obsi

predi

n

i

obsi yyyyr

24

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

25

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• However, this does not guarantee a good predictive model….

26

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.

27

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar

to training set.• Global or Local models?

28

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from 2D molecular

structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

29

Why Do We Need a Predictive Model?

Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies ...

... but only in very low throughput.

30

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

31

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

32

x x

x

x

O

x

x

x

x

Density (g/cc)

Lattice Energy (kJ/mol)

xx

1.40 1.601.50

-92.0

-94.0

-96.0

-98.0

OOO

O�

�O

+

+

+

+ x

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

33

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

34

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

35

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

36

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0

50

100

150

200

0 5 10 15 20 25

no. C, N, O

Hsu

blim

atio

n(e

xper

imen

tal)

/ kJ

mol

-1

amides

diamides

acids

diacid

aminoacids

alkanesvalineH O

O C H 3

C H 3

N H 2

37

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

2.714

r2= 0.977

s = 7.096 kJ/mol0

150

300

450

600

750

0 5 10 15 20

No. of carbon atoms

Bo

ilin

g p

oin

t / °

C

0

30

60

90

120

150

180

Hsu

b / kJ mo

l -1

BPt

Hsub

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

38

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0

50

100

150

200

0 5 10 15 20 25

No. carbon atoms

H

sub

/ kJ

mo

l-1

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

39

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0

50

100

150

200

0 50 100 150 200

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/kJ

mo

l-1

40

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value / kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Parameters in model are counts of atom type occurrences.

41

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

42

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

MLRA

aliphatic

All these parameters are significantly larger than their standard errors

43

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

0

20

40

60

-30 -20 -10 0 10 20 30Residuals

No

. of

ob

se

rva

tio

ns

44

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0

50

100

150

200

0 50 100 150 200H sub (experimental) / kJ mol-1

Hsu

b (

pre

dic

ted

) / k

J m

ol-1

NO2

CH3

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

45

Major Conclusion

Lattice energy can be predicted from 2D

structure, without knowing the details of the

crystal packing!

46

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

47

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

48

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

49

Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints

www-mitchell.ch.cam.ac.uk/jbom1@cam.ac.uk

Ed Cannon, Andreas Bender, David Palmer & John Mitchell,

J. Chem. Inf. and Model., 46, 2369-2380 (2006)

50

Classifying the WADA Prohibited List

• Aims & Background.• Methods.• Data.• Results.• Conclusions.

51

Aims & Background

52

Aims & Background

• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.

tetrahydrogestrinone (THG)

53

Aims & Background

• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.

• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.

54

WADA Prohibited Classes

• Anabolic Agents (S1)

• Hormones and Related Substances (S2)

• Beta-2-agonists (S3)• Anti-estrogenic

Agents (S4)• Diuretics and

Masking Agents (S5)

• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids

(S9)• Alcohol (P1)• Beta Blockers (P2)

55

Predicting Bioactivities

• We seek to predict whether a molecule exhibits one of these bioactivities.

• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

56

Methods

57

Chemical Space

• Use descriptor-based fingerprints to locate molecules in chemical space.

• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

58

Machine Learning

• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.

• Random Forest.

• k-Nearest Neighbours.

59

Fingerprints

• CDK (Chemistry Development Kit) fingerprint.

• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.

60

CDK Fingerprint

• CDK fingerprint resembles Daylight.

• All bond paths up to a length of 6 are generated.

• A hashing function is used to map these paths onto a fingerprint of 1024 bits.

61

Unity 2D Fingerprint

• Unity is similar to CDK, but based on sub-structures rather than just paths.

• Substructures present in the molecule are enumerated.

• A hashing function is used to map these paths onto a fingerprint of 992 bits.

62

Classification Algorithms

• Random Forest (RF).

• k-Nearest Neighbours (k-NN).

63

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

64

Random Forest

Node

A > x1 A < x1

B > x2 B < x2 C > x3 C < x3

Decision: Yes No No Yes

A Random Forest contains many such trees.

65

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

66

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

67

k-Nearest Neighbours

ActiveInactive?

68

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

69

k-Nearest Neighbours

• Local method.

• Uses only a very small number of near neighbours to make its prediction.

• Suitable for predicting activity classes with multiple clusters in chemical space.

• Therefore good for WADA classes with multiple receptors.

70

Performance Measure

• Matthews Correlation Coefficient:

• Range: -1 < MCC < 1;• Balance between predicting

positives & negatives.

]))()()(( nnpnnppp

npnp

ftftftft

ffttMCC

71

Data

72

The Dataset

• 5245 molecules (5235 for CDK).

• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

73

Data by Class

WADA Class Number of Molecules

S1 47

S2 272

S3 367

S4 928

S5 1000

S6 804

S7 195

S8 1000

S9 26

P2 239

Allowed 367

74

Fivefold Cross-validation

• We test for membership of each prohibited class separately.

• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

75

False Positives

• False Positives arise in two ways:

• (1) A molecule predicted positive on an incorrect activity class;

• (2) An explicitly allowed molecule predicted positive.

76

Results

77

Results: Random Forest

Aggregated over 10 classes

78Unity CDK > MACCS > others.

MCC for RF for Six Fingerprints

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8214 CDK 0.8136

MACCS 0.7823

TGD 0.7283 MOE 0.7172

TAD 0.5902

79

100 trees sufficient; little improvement with more.

MCC as a Function of ntree in RF models for Unity

0.6000

0.7000

0.8000

0.9000

0 100 200 300 400 500 600 700 800 900 1000

ntree

MC

C

Unity

80

Results: k-Nearest Neighbours

Aggregated over 10 classes

81

MCC as a Function of k in k -NN Models for Six Fingerprints

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

82Unity CDK > MACCS > others.

MCC for k = 1 for Six Fingerprints

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8363CDK 0.8297

MACCS 0.8045

TGD 0.7404

MOE 0.6814

TAD 0.6152

83

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.

MCC as a Function of k in k -NN Models for Unity

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C Unity

84

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.

MCC as a Function of k in k- NN Models for CDK

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C CDK

85

Results: Comparison

Recall v PrecisionAggregated over 10 classes

Recall Precision

86

Recall v Precision for Positives

60.00

70.00

80.00

90.00

100.00

30.00 40.00 50.00 60.00 70.00 80.00 90.00

Recall

Pre

cisi

on

Unity

MACCS

CDK

RF

k -NN

RF gives higher precision, k-NN higher recall.

87

Results: Comparison

Analysed by class

88

Classes vary in difficulty of prediction; independent of classification algorithm.

MCC by Class for Random Forest Default and k- NN (k = 1) Models

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10

Class

MC

C RF

k-NN

S1 S2 S3 S4 S5 S6 S7 S8 S9 P2

89

Conclusions

90

Major Conclusion

• Can use Informatics to predict whether or not a molecule exhibits a prohibited bioactivity.

91

Conclusions

• Can successfully predict active molecules (MCC ≈ 0.83).

• Unity ≈ CDK > MACCS > others.

• RF & k-NN give similar MCC.

• k-NN higher recall.

• RF higher precision; RF less likely to find false positives.

92

Conclusions

• RF results vary little with ntree.

• k-NN results best for k = 1.

• Performance decreases at higher k.

• Odd k avoids problems with ties (k = 2 is worse than k = 3).

• Activity classes show consistent prediction difficulty pattern.

93

www-mitchell.ch.cam.ac.uk/

jbom1@cam.ac.uk

94

Acknowledgements: People

Carole Ouvrard, Ed Cannon, David Palmer,

Florian Nigsch, Chrysi Kirtay, Laura Hughes,

Jo Bailey, Noel O’Boyle, Daniel Almonacid,

Gemma Holliday, Jen Ryder,

Dushy Puvanendrampillai, Andreas Bender.

95

A¢know£€dg€m€nt$: Funding

Unilever

96

MCC as a Function of Class Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000 1200

Class Size

MC

C

UnityMACCSMOE TADTGD

No significant correlation overall; though smallest class S9 is hardest to predict.

97

MCC as a Function of Intra Class Mean Tanimoto Score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60

Intra Class Mean Tanimoto Score

MC

C

UnityMACCSMOE TADTGD

98

tetrahydrogestrinone (THG)

gestrinone

trenbolone

top related