prediction models based on classifying compounds by ...# pre-selected pls factors training set...

Prediction Models Based on Classifying Compounds by

Structural Features

Chihae Yang, Kevin Cross, Paul Blower, Glenn MyattLeadscope, Inc.

3rd Joint Conference Sheffield Conferenceon Chemoinformatics

Objectives

1. Illustrate strategy for building transparent models2. Build models predicting pIC50 for a published test set

and compare results with a published CoMFA study3. Confirm the importance of macrostructures in a

molecular descriptor set for predicting activity

Issues to Consider When Building Predictive Models

• Data– Size and distribution– Quality– Availability

• Structure– Diversity– Mechanistic complexity

• Interpretation– Chemical transparency

Structure-Data Surface

+--

Data distribution (active/inactive)-Data size+Mechanistic complexity+Structural diversity

- Structure diversity +

-M

echa

nism

com

plex

ity +

-Data

size

+

Assessment of 4 Factors in the Structure-Data Surface

• Local models– Structurally similar– Mechanistically homogenous– Reasonable size– Data distribution

• Global models– Structurally dissimilar– Mechanistically complex– Large data points– Data distribution - balance in actives and inactives

Example of Local Dataset:PTP-1B Inhibitors

• Protein Tyrosine Phosphatase (PTP-1B) is therapeutic target for treatment of diabetes, obesity and cancer

• Dataset of 118 compounds from literature study1

• SAR analysis identified two active classes

• Comparison with literature CoMFA study2

1 Malamas, M.S., et. al.; J. Med. Chem. 2000, 43, 1293-1310 (Wyeth-Ayerst)

2Murthy, V.S., et. al.; Bioorganic & Medicinal Chemistry, 2002, 10, 2267-2282..

Modeling Strategy

1. Diagnose the data set2. Assemble discriminating macrostructures 3. Select descriptors – features and properties4. Build predictive models5. Evaluate the model with chemical inference6. Rebuild the model with a refined feature set

1. Diagnoses of the PTP1B Dataset

• Dataset was the published 118 structures by Murthy• Training set was partitioned as published

92 compound training set26 compound test set

• 26-test set contains a higher set of actives

1. Diagnoses of the PTP1B Dataset

• Assessing similarity of training and test sets– chemical space of the test set must lie within the

training set1. Grouping by chemical class2. Diversity analysis3. Similarity by Sammon map (feature based)4. Feature similarity within the test set and between

the test and training set

7 19 6 17 18 1 2 3 4 9 5 20 11 12 13 8 14 10 15 160.5

0.55

0.6

0.65

0.7

0.75

0.8

dist

ance

Structural Diversity – 92 chemicals

S

O

O

O

O

O

O

O

F

FF

O

N

O

N NN

N

O

O

O

O S

O

O

O

OO

O

OH377.715

476263

10.57.710.9

019.210.9

03.83.3tetrazole

5.302.2pyridine

213.87.6oxazole

5.33532benzothiophene

636153benzofuran

19-unknown set26-test set92-training set

% frequencyMajor Classes

Chemical class groupings in the data

Diversity Analysis of PTP1B through Multiple Subset Extraction

Structural Diversity of PTP1B

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.05 0.10 0.15 0.20

Subset size (percent)

Cov

erag

e (p

erce

nt)

118 benzothiophenes and furans - 92Training set- 26 Test set- 19 Test set (benzimidazoles, oxazoles, etc.)

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

-1-0.5

00.5

1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.626C92C19C

Similarity of Training and Test Sets26-test set correlation: 97%19-test set correlation: 68%

Characterization of PTP-1B Inhibitors

• Activity distribution is not Gaussian

• Good balance between actives and inactives

• Data size is small • Mechanistically homogenous• Structurally similar

- Structure + diversity

-Data

size

+

-M

echa

nism

co

mpl

exity

+

pIC50

0.0 0.5 1.0 1.5

prob

abili

ty

0.00

0.05

0.10

0.15

0.20

0.2592-train 26-test

activeinactive

2. Assembly of Macrostructures

• 509 structural features describe 92 training compounds were automatically extracted from over 27,000 features

• 71 macrostructures with discriminating activity were assembled

• 580 total features plus 8 properties were available for modeling

Advantages of Macro-Structure Assembly

• MSAs are actual substructures that are easily interpreted• Connectivity of individual features is explicitly

represented• The assembly process is supervised by selected

biological response• MSAs are chemically relevant• Macrostructure assembly is computationally feasible for

“larger” structure sets

Macrostructure Assemblies

O OO

O

OO

O

S

O SO O

O O O

O

SO

O

O O

t= -6.4 PLS wt =0.096Mean (92) = 0.16Mean (26) = 0.13

t= -5.9 PLS wt = 0.048Mean (92) = 0.0023Mean (26) = 0.45

MSA 28MSA 26

t= 3.7 PLS wt = 0.056Mean (92) = 0.82Mean (26) = 0.86

t= 4.9 PLS wt = 0.090Mean (92) = 0.70Mean (26) = 0.97

t= 5.4 PLS wt = 0.10Mean (92) = 0.75Mean (26) = 0.97

t= 7.9 PLS wt = 0.052Mean (92) = 1.32Mean (26) = 1.28

MSA 19MSA 13MSA 10MSA 6

t= 9.3 PLS wt = 0.050Mean (92) = 1.34Mean (26) = 1.29

t= 9.7 PLS wt = 0.063Mean (92) = 1.30Mean (26) = 1.34

t= 13.9 PLS wt = 0.12Mean (92) = 1.52Mean (26) = 1.57

t= 14.7 PLS wt = 0.076Mean (92) = 1.57Mean (26) = 1.32

MSA 4MSA 3MSA 2MSA 1

O

O

3. Pre-selection of Structural Features

• Test significance - T2 test

• Of 150 influential features, 41 were MSAs

Property Descriptors Available

• aLogP• Polar surface area• Hydrogen bond donors and acceptors• Molecular weight• Rotatable bonds• Lipinski scores

4. Modeling Building

• Multivariate Least Squares• Principal Component Regression• k-nearest neighbors• Partial Least Squares• Neural Networks

Cross-validation of the training set

number of PLS factors

num

ber o

f pre

-sel

ecte

d fe

atur

es

Q2 mean R2Scv

number of PLS factors number of PLS factors

92 1 # PLS factorscvPRESSS =

− −

Parameter Optimization for Cross Validation

0.400.470.400.48188 Properties alone

0.320.680.220.84571 (all MSA)MSA + 8 properties

0.350.610.250.80571(all MSA)MSA only

0.340.620.270.764150Base Features + 8 properties

0.370.560.300.714150Base Features only

0.310.680.280.76550

0.300.710.230.831250

0.310.710.170.9119100

0.320.660.260.786100

0.390.570.150.9320150

0.340.620.180.8912150

0.320.700.240.804150

0.310.680.230.835200

0.370.570.270.773580 (all used)All: Base features + MSA + 8 Properties

RMSEQ2RMSER2FG

Leave-one-out CVTraining SetPLS factors# pre-selected structural features

Predictor Types

Predictive Power of Molecular Descriptors

0.750.84CoMFA model B (w/ aLogP)*

0.510.72CoMFA model A*

0.370.660.470.400.48188 Properties only

0.370.650.680.220.83571MSAs + 8 properties

0.360.640.610.250.79571MSAs only

0.330.690.620.270.764150Basic features + 8 properties

0.330.680.670.260.784150basic features + MSAs

0.380.590.560.300.714150basic features only

0.320.720.700.240.804150All descriptors:Basic features + MSAs + 8 properties

RMSEQ2Q2RMSER2FG

26 Test SetTraining setParametersdescriptors

G = number of pre-selected features; F= number of PLS factors used

Comparison of actual and predicted pIC50 values for 92-training and 26-test sets.

26-Test Set

pIC50 (actual)

-0.5 0.0 0.5 1.0 1.5 2.0

pIC

50 (p

redi

cted

)

-0.5

0.0

0.5

1.0

1.5

2.0

103 106

113

119 125

130

136

141

145

148

159

160 171

177

179

180 182

183

61 62

63

66 68 74

84

85

92-Training Set

pIC50 (actual)

-0.5 0.0 0.5 1.0 1.5 2.0

pIC

50 (p

redi

cted

)

-0.5

0.0

0.5

1.0

1.5

2.0

5. Chemical Evaluation of Model

pIC50 = 1.41pIC50 = 1.12pIC50 = 0.98pIC50 = 1.56

176174175Test-177

Similar compounds in training setTest compounds

O

O SO O

OH

O

OH O

O SO O

OH

O

O

O SO O

O

OH O

O SO O

OH

O

OH

6. Refining the Model by Chemical Inference

O SO O

OO

O SO O

OH

OO

O SO O

O

O

Mean (92) = 1.28Mean (26) = 1.59

Mean (92) = 1.22Mean (26) = 1.53

Mean (92) = 1.07Mean (26) = 1.53

New Feature 3New Feature 2New Feature 1

- increased the pCI50 of 117 from 0.886 to 0.947 (1.54 exp) without reducing goodness of fit.

(additional features for test-177)

Building a Classification Model

• Actives and inactives defined as above and below the mean pIC50

• 50 new MSAs from binary response data• 8 QSAR properties still used• MSAs + properties provided the best balance• Partial Logistic Regression model - PLS of binary

response data followed by logistic regression• More accurate results

– two false negatives (with probabilities near 0.5)

Classification model results

85.781.483.795.995.395.79

91.881.487.08691.889.12MSA + props

83.774.479.393.988.491.98

85.779.182.685.781.483.72All 50 MSAs

83.765.175.093.095.994.68

83.781.482.687.881.484.82All 150 feats

% specificity% sensitivity% concordance% specificity% sensitivity% concordance

Cross validationTrainingFactorsDescriptors

Conclusion

• 2D PLS models performed as well as 3-D QSAR models• The 2D models are intuitive due to chemical structure

descriptors and explain model strengths and weaknesses • These transparent models provide insight for refinement in

using additional features• Macrostructure assemblies provide an intuitive means to

reduce high dimensionality and improve the ability to perform chemical inference

• Chemical inference enables efficient evaluation of hypotheses in the design of structures.

prediction models based on classifying compounds by ...# pre-selected pls factors training set...

Documents