prediction models based on classifying compounds by ...# pre-selected pls factors training set...

29
Prediction Models Based on Classifying Compounds by Structural Features Chihae Yang, Kevin Cross, Paul Blower, Glenn Myatt Leadscope, Inc. 3rd Joint Conference Sheffield Conference on Chemoinformatics

Upload: others

Post on 20-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Prediction Models Based on Classifying Compounds by

Structural Features

Chihae Yang, Kevin Cross, Paul Blower, Glenn MyattLeadscope, Inc.

3rd Joint Conference Sheffield Conferenceon Chemoinformatics

Page 2: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Objectives

1. Illustrate strategy for building transparent models2. Build models predicting pIC50 for a published test set

and compare results with a published CoMFA study3. Confirm the importance of macrostructures in a

molecular descriptor set for predicting activity

Page 3: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Issues to Consider When Building Predictive Models

• Data– Size and distribution– Quality– Availability

• Structure– Diversity– Mechanistic complexity

• Interpretation– Chemical transparency

Page 4: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Structure-Data Surface

+--

Data distribution (active/inactive)-Data size+Mechanistic complexity+Structural diversity

- Structure diversity +

-M

echa

nism

com

plex

ity +

-Data

size

+

Page 5: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Assessment of 4 Factors in the Structure-Data Surface

• Local models– Structurally similar– Mechanistically homogenous– Reasonable size– Data distribution

• Global models– Structurally dissimilar– Mechanistically complex– Large data points– Data distribution - balance in actives and inactives

Page 6: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Example of Local Dataset:PTP-1B Inhibitors

• Protein Tyrosine Phosphatase (PTP-1B) is therapeutic target for treatment of diabetes, obesity and cancer

• Dataset of 118 compounds from literature study1

• SAR analysis identified two active classes

• Comparison with literature CoMFA study2

1 Malamas, M.S., et. al.; J. Med. Chem. 2000, 43, 1293-1310 (Wyeth-Ayerst)

2Murthy, V.S., et. al.; Bioorganic & Medicinal Chemistry, 2002, 10, 2267-2282..

Page 7: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Modeling Strategy

1. Diagnose the data set2. Assemble discriminating macrostructures 3. Select descriptors – features and properties4. Build predictive models5. Evaluate the model with chemical inference6. Rebuild the model with a refined feature set

Page 8: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

1. Diagnoses of the PTP1B Dataset

• Dataset was the published 118 structures by Murthy• Training set was partitioned as published

92 compound training set26 compound test set

• 26-test set contains a higher set of actives

Page 9: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

1. Diagnoses of the PTP1B Dataset

• Assessing similarity of training and test sets– chemical space of the test set must lie within the

training set1. Grouping by chemical class2. Diversity analysis3. Similarity by Sammon map (feature based)4. Feature similarity within the test set and between

the test and training set

Page 10: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

7 19 6 17 18 1 2 3 4 9 5 20 11 12 13 8 14 10 15 160.5

0.55

0.6

0.65

0.7

0.75

0.8

dist

ance

Structural Diversity – 92 chemicals

S

O

O

O

O

O

O

O

F

FF

O

N

O

N NN

N

O

O

O

Page 11: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

O S

O

O

O

OO

O

OH377.715

476263

10.57.710.9

019.210.9

03.83.3tetrazole

5.302.2pyridine

213.87.6oxazole

5.33532benzothiophene

636153benzofuran

19-unknown set26-test set92-training set

% frequencyMajor Classes

Chemical class groupings in the data

Page 12: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Diversity Analysis of PTP1B through Multiple Subset Extraction

Structural Diversity of PTP1B

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.05 0.10 0.15 0.20

Subset size (percent)

Cov

erag

e (p

erce

nt)

Page 13: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

118 benzothiophenes and furans - 92Training set- 26 Test set- 19 Test set (benzimidazoles, oxazoles, etc.)

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

-1-0.5

00.5

1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.626C92C19C

Similarity of Training and Test Sets26-test set correlation: 97%19-test set correlation: 68%

Page 14: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Characterization of PTP-1B Inhibitors

• Activity distribution is not Gaussian

• Good balance between actives and inactives

• Data size is small • Mechanistically homogenous• Structurally similar

- Structure + diversity

-Data

size

+

-M

echa

nism

co

mpl

exity

+

pIC50

0.0 0.5 1.0 1.5

prob

abili

ty

0.00

0.05

0.10

0.15

0.20

0.2592-train 26-test

activeinactive

Page 15: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

2. Assembly of Macrostructures

• 509 structural features describe 92 training compounds were automatically extracted from over 27,000 features

• 71 macrostructures with discriminating activity were assembled

• 580 total features plus 8 properties were available for modeling

Page 16: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Advantages of Macro-Structure Assembly

• MSAs are actual substructures that are easily interpreted• Connectivity of individual features is explicitly

represented• The assembly process is supervised by selected

biological response• MSAs are chemically relevant• Macrostructure assembly is computationally feasible for

“larger” structure sets

Page 17: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Macrostructure Assemblies

O OO

O

OO

O

S

O SO O

O O O

O

SO

O

O O

t= -6.4 PLS wt =0.096Mean (92) = 0.16Mean (26) = 0.13

t= -5.9 PLS wt = 0.048Mean (92) = 0.0023Mean (26) = 0.45

MSA 28MSA 26

t= 3.7 PLS wt = 0.056Mean (92) = 0.82Mean (26) = 0.86

t= 4.9 PLS wt = 0.090Mean (92) = 0.70Mean (26) = 0.97

t= 5.4 PLS wt = 0.10Mean (92) = 0.75Mean (26) = 0.97

t= 7.9 PLS wt = 0.052Mean (92) = 1.32Mean (26) = 1.28

MSA 19MSA 13MSA 10MSA 6

t= 9.3 PLS wt = 0.050Mean (92) = 1.34Mean (26) = 1.29

t= 9.7 PLS wt = 0.063Mean (92) = 1.30Mean (26) = 1.34

t= 13.9 PLS wt = 0.12Mean (92) = 1.52Mean (26) = 1.57

t= 14.7 PLS wt = 0.076Mean (92) = 1.57Mean (26) = 1.32

MSA 4MSA 3MSA 2MSA 1

O

O

Page 18: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

3. Pre-selection of Structural Features

• Test significance - T2 test

• Of 150 influential features, 41 were MSAs

Page 19: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Property Descriptors Available

• aLogP• Polar surface area• Hydrogen bond donors and acceptors• Molecular weight• Rotatable bonds• Lipinski scores

Page 20: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

4. Modeling Building

• Multivariate Least Squares• Principal Component Regression• k-nearest neighbors• Partial Least Squares• Neural Networks

Page 21: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Cross-validation of the training set

number of PLS factors

num

ber o

f pre

-sel

ecte

d fe

atur

es

Q2 mean R2Scv

number of PLS factors number of PLS factors

92 1 # PLS factorscvPRESSS =

− −

Page 22: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Parameter Optimization for Cross Validation

0.400.470.400.48188 Properties alone

0.320.680.220.84571 (all MSA)MSA + 8 properties

0.350.610.250.80571(all MSA)MSA only

0.340.620.270.764150Base Features + 8 properties

0.370.560.300.714150Base Features only

0.310.680.280.76550

0.300.710.230.831250

0.310.710.170.9119100

0.320.660.260.786100

0.390.570.150.9320150

0.340.620.180.8912150

0.320.700.240.804150

0.310.680.230.835200

0.370.570.270.773580 (all used)All: Base features + MSA + 8 Properties

RMSEQ2RMSER2FG

Leave-one-out CVTraining SetPLS factors# pre-selected structural features

Predictor Types

Page 23: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Predictive Power of Molecular Descriptors

0.750.84CoMFA model B (w/ aLogP)*

0.510.72CoMFA model A*

0.370.660.470.400.48188 Properties only

0.370.650.680.220.83571MSAs + 8 properties

0.360.640.610.250.79571MSAs only

0.330.690.620.270.764150Basic features + 8 properties

0.330.680.670.260.784150basic features + MSAs

0.380.590.560.300.714150basic features only

0.320.720.700.240.804150All descriptors:Basic features + MSAs + 8 properties

RMSEQ2Q2RMSER2FG

26 Test SetTraining setParametersdescriptors

G = number of pre-selected features; F= number of PLS factors used

Page 24: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Comparison of actual and predicted pIC50 values for 92-training and 26-test sets.

26-Test Set

pIC50 (actual)

-0.5 0.0 0.5 1.0 1.5 2.0

pIC

50 (p

redi

cted

)

-0.5

0.0

0.5

1.0

1.5

2.0

103 106

113

119 125

130

136

141

145

148

159

160 171

177

179

180 182

183

61 62

63

66 68 74

84

85

92-Training Set

pIC50 (actual)

-0.5 0.0 0.5 1.0 1.5 2.0

pIC

50 (p

redi

cted

)

-0.5

0.0

0.5

1.0

1.5

2.0

Page 25: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

5. Chemical Evaluation of Model

pIC50 = 1.41pIC50 = 1.12pIC50 = 0.98pIC50 = 1.56

176174175Test-177

Similar compounds in training setTest compounds

O

O SO O

OH

O

OH O

O SO O

OH

O

O

O SO O

O

OH O

O SO O

OH

O

OH

Page 26: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

6. Refining the Model by Chemical Inference

O SO O

OO

O SO O

OH

OO

O SO O

O

O

Mean (92) = 1.28Mean (26) = 1.59

Mean (92) = 1.22Mean (26) = 1.53

Mean (92) = 1.07Mean (26) = 1.53

New Feature 3New Feature 2New Feature 1

- increased the pCI50 of 117 from 0.886 to 0.947 (1.54 exp) without reducing goodness of fit.

(additional features for test-177)

Page 27: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Building a Classification Model

• Actives and inactives defined as above and below the mean pIC50

• 50 new MSAs from binary response data• 8 QSAR properties still used• MSAs + properties provided the best balance• Partial Logistic Regression model - PLS of binary

response data followed by logistic regression• More accurate results

– two false negatives (with probabilities near 0.5)

Page 28: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Classification model results

85.781.483.795.995.395.79

91.881.487.08691.889.12MSA + props

83.774.479.393.988.491.98

85.779.182.685.781.483.72All 50 MSAs

83.765.175.093.095.994.68

83.781.482.687.881.484.82All 150 feats

% specificity% sensitivity% concordance% specificity% sensitivity% concordance

Cross validationTrainingFactorsDescriptors

Page 29: Prediction Models Based on Classifying Compounds by ...# pre-selected PLS factors Training Set Leave-one-out CV structural features Predictor Types. Predictive Power of Molecular Descriptors

Conclusion

• 2D PLS models performed as well as 3-D QSAR models• The 2D models are intuitive due to chemical structure

descriptors and explain model strengths and weaknesses • These transparent models provide insight for refinement in

using additional features• Macrostructure assemblies provide an intuitive means to

reduce high dimensionality and improve the ability to perform chemical inference

• Chemical inference enables efficient evaluation of hypotheses in the design of structures.