prediction models based on classifying compounds by ...# pre-selected pls factors training set...
TRANSCRIPT
Prediction Models Based on Classifying Compounds by
Structural Features
Chihae Yang, Kevin Cross, Paul Blower, Glenn MyattLeadscope, Inc.
3rd Joint Conference Sheffield Conferenceon Chemoinformatics
Objectives
1. Illustrate strategy for building transparent models2. Build models predicting pIC50 for a published test set
and compare results with a published CoMFA study3. Confirm the importance of macrostructures in a
molecular descriptor set for predicting activity
Issues to Consider When Building Predictive Models
• Data– Size and distribution– Quality– Availability
• Structure– Diversity– Mechanistic complexity
• Interpretation– Chemical transparency
Structure-Data Surface
+--
Data distribution (active/inactive)-Data size+Mechanistic complexity+Structural diversity
- Structure diversity +
-M
echa
nism
com
plex
ity +
-Data
size
+
Assessment of 4 Factors in the Structure-Data Surface
• Local models– Structurally similar– Mechanistically homogenous– Reasonable size– Data distribution
• Global models– Structurally dissimilar– Mechanistically complex– Large data points– Data distribution - balance in actives and inactives
Example of Local Dataset:PTP-1B Inhibitors
• Protein Tyrosine Phosphatase (PTP-1B) is therapeutic target for treatment of diabetes, obesity and cancer
• Dataset of 118 compounds from literature study1
• SAR analysis identified two active classes
• Comparison with literature CoMFA study2
1 Malamas, M.S., et. al.; J. Med. Chem. 2000, 43, 1293-1310 (Wyeth-Ayerst)
2Murthy, V.S., et. al.; Bioorganic & Medicinal Chemistry, 2002, 10, 2267-2282..
Modeling Strategy
1. Diagnose the data set2. Assemble discriminating macrostructures 3. Select descriptors – features and properties4. Build predictive models5. Evaluate the model with chemical inference6. Rebuild the model with a refined feature set
1. Diagnoses of the PTP1B Dataset
• Dataset was the published 118 structures by Murthy• Training set was partitioned as published
92 compound training set26 compound test set
• 26-test set contains a higher set of actives
1. Diagnoses of the PTP1B Dataset
• Assessing similarity of training and test sets– chemical space of the test set must lie within the
training set1. Grouping by chemical class2. Diversity analysis3. Similarity by Sammon map (feature based)4. Feature similarity within the test set and between
the test and training set
7 19 6 17 18 1 2 3 4 9 5 20 11 12 13 8 14 10 15 160.5
0.55
0.6
0.65
0.7
0.75
0.8
dist
ance
Structural Diversity – 92 chemicals
S
O
O
O
O
O
O
O
F
FF
O
N
O
N NN
N
O
O
O
O S
O
O
O
OO
O
OH377.715
476263
10.57.710.9
019.210.9
03.83.3tetrazole
5.302.2pyridine
213.87.6oxazole
5.33532benzothiophene
636153benzofuran
19-unknown set26-test set92-training set
% frequencyMajor Classes
Chemical class groupings in the data
Diversity Analysis of PTP1B through Multiple Subset Extraction
Structural Diversity of PTP1B
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.05 0.10 0.15 0.20
Subset size (percent)
Cov
erag
e (p
erce
nt)
118 benzothiophenes and furans - 92Training set- 26 Test set- 19 Test set (benzimidazoles, oxazoles, etc.)
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-1-0.5
00.5
1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.626C92C19C
Similarity of Training and Test Sets26-test set correlation: 97%19-test set correlation: 68%
Characterization of PTP-1B Inhibitors
• Activity distribution is not Gaussian
• Good balance between actives and inactives
• Data size is small • Mechanistically homogenous• Structurally similar
- Structure + diversity
-Data
size
+
-M
echa
nism
co
mpl
exity
+
pIC50
0.0 0.5 1.0 1.5
prob
abili
ty
0.00
0.05
0.10
0.15
0.20
0.2592-train 26-test
activeinactive
2. Assembly of Macrostructures
• 509 structural features describe 92 training compounds were automatically extracted from over 27,000 features
• 71 macrostructures with discriminating activity were assembled
• 580 total features plus 8 properties were available for modeling
Advantages of Macro-Structure Assembly
• MSAs are actual substructures that are easily interpreted• Connectivity of individual features is explicitly
represented• The assembly process is supervised by selected
biological response• MSAs are chemically relevant• Macrostructure assembly is computationally feasible for
“larger” structure sets
Macrostructure Assemblies
O OO
O
OO
O
S
O SO O
O O O
O
SO
O
O O
t= -6.4 PLS wt =0.096Mean (92) = 0.16Mean (26) = 0.13
t= -5.9 PLS wt = 0.048Mean (92) = 0.0023Mean (26) = 0.45
MSA 28MSA 26
t= 3.7 PLS wt = 0.056Mean (92) = 0.82Mean (26) = 0.86
t= 4.9 PLS wt = 0.090Mean (92) = 0.70Mean (26) = 0.97
t= 5.4 PLS wt = 0.10Mean (92) = 0.75Mean (26) = 0.97
t= 7.9 PLS wt = 0.052Mean (92) = 1.32Mean (26) = 1.28
MSA 19MSA 13MSA 10MSA 6
t= 9.3 PLS wt = 0.050Mean (92) = 1.34Mean (26) = 1.29
t= 9.7 PLS wt = 0.063Mean (92) = 1.30Mean (26) = 1.34
t= 13.9 PLS wt = 0.12Mean (92) = 1.52Mean (26) = 1.57
t= 14.7 PLS wt = 0.076Mean (92) = 1.57Mean (26) = 1.32
MSA 4MSA 3MSA 2MSA 1
O
O
3. Pre-selection of Structural Features
• Test significance - T2 test
• Of 150 influential features, 41 were MSAs
Property Descriptors Available
• aLogP• Polar surface area• Hydrogen bond donors and acceptors• Molecular weight• Rotatable bonds• Lipinski scores
4. Modeling Building
• Multivariate Least Squares• Principal Component Regression• k-nearest neighbors• Partial Least Squares• Neural Networks
Cross-validation of the training set
number of PLS factors
num
ber o
f pre
-sel
ecte
d fe
atur
es
Q2 mean R2Scv
number of PLS factors number of PLS factors
92 1 # PLS factorscvPRESSS =
− −
Parameter Optimization for Cross Validation
0.400.470.400.48188 Properties alone
0.320.680.220.84571 (all MSA)MSA + 8 properties
0.350.610.250.80571(all MSA)MSA only
0.340.620.270.764150Base Features + 8 properties
0.370.560.300.714150Base Features only
0.310.680.280.76550
0.300.710.230.831250
0.310.710.170.9119100
0.320.660.260.786100
0.390.570.150.9320150
0.340.620.180.8912150
0.320.700.240.804150
0.310.680.230.835200
0.370.570.270.773580 (all used)All: Base features + MSA + 8 Properties
RMSEQ2RMSER2FG
Leave-one-out CVTraining SetPLS factors# pre-selected structural features
Predictor Types
Predictive Power of Molecular Descriptors
0.750.84CoMFA model B (w/ aLogP)*
0.510.72CoMFA model A*
0.370.660.470.400.48188 Properties only
0.370.650.680.220.83571MSAs + 8 properties
0.360.640.610.250.79571MSAs only
0.330.690.620.270.764150Basic features + 8 properties
0.330.680.670.260.784150basic features + MSAs
0.380.590.560.300.714150basic features only
0.320.720.700.240.804150All descriptors:Basic features + MSAs + 8 properties
RMSEQ2Q2RMSER2FG
26 Test SetTraining setParametersdescriptors
G = number of pre-selected features; F= number of PLS factors used
Comparison of actual and predicted pIC50 values for 92-training and 26-test sets.
26-Test Set
pIC50 (actual)
-0.5 0.0 0.5 1.0 1.5 2.0
pIC
50 (p
redi
cted
)
-0.5
0.0
0.5
1.0
1.5
2.0
103 106
113
119 125
130
136
141
145
148
159
160 171
177
179
180 182
183
61 62
63
66 68 74
84
85
92-Training Set
pIC50 (actual)
-0.5 0.0 0.5 1.0 1.5 2.0
pIC
50 (p
redi
cted
)
-0.5
0.0
0.5
1.0
1.5
2.0
5. Chemical Evaluation of Model
pIC50 = 1.41pIC50 = 1.12pIC50 = 0.98pIC50 = 1.56
176174175Test-177
Similar compounds in training setTest compounds
O
O SO O
OH
O
OH O
O SO O
OH
O
O
O SO O
O
OH O
O SO O
OH
O
OH
6. Refining the Model by Chemical Inference
O SO O
OO
O SO O
OH
OO
O SO O
O
O
Mean (92) = 1.28Mean (26) = 1.59
Mean (92) = 1.22Mean (26) = 1.53
Mean (92) = 1.07Mean (26) = 1.53
New Feature 3New Feature 2New Feature 1
- increased the pCI50 of 117 from 0.886 to 0.947 (1.54 exp) without reducing goodness of fit.
(additional features for test-177)
Building a Classification Model
• Actives and inactives defined as above and below the mean pIC50
• 50 new MSAs from binary response data• 8 QSAR properties still used• MSAs + properties provided the best balance• Partial Logistic Regression model - PLS of binary
response data followed by logistic regression• More accurate results
– two false negatives (with probabilities near 0.5)
Classification model results
85.781.483.795.995.395.79
91.881.487.08691.889.12MSA + props
83.774.479.393.988.491.98
85.779.182.685.781.483.72All 50 MSAs
83.765.175.093.095.994.68
83.781.482.687.881.484.82All 150 feats
% specificity% sensitivity% concordance% specificity% sensitivity% concordance
Cross validationTrainingFactorsDescriptors
Conclusion
• 2D PLS models performed as well as 3-D QSAR models• The 2D models are intuitive due to chemical structure
descriptors and explain model strengths and weaknesses • These transparent models provide insight for refinement in
using additional features• Macrostructure assemblies provide an intuitive means to
reduce high dimensionality and improve the ability to perform chemical inference
• Chemical inference enables efficient evaluation of hypotheses in the design of structures.