qsar modeling in maps principles and applications · l p 6 p ã g phenotype Ý ÷ ln a@ e? pek j an...
TRANSCRIPT
•QSAR modeling in MAPS
Principles and applications
Csaba Ferenc Kiss
June 2011, Athens
•Some terms
● QSAR – Quantitative structure activity relationships
● Dependent (𝐲), independent (𝐱) variables
● Continuous, discrete, categorical, dichotomous
● 𝒚 = 𝑿𝜷 + 𝝐, s.t. 𝐲 ∈ ℝ𝑟 , 𝑿 ∈ ℝ𝑟×𝐷
•Experimental dataset
● Predicting melting point for drug-like molecules
□ in J. Chem. Inf. Comput. Sci., 2003, 43 (4), pp. 1177–1185
● SMILEs from 12th Merck Index
□ 185 training structures
□ 92 validation structures
□ Melting temperatures 140-160 °C
● 566 descriptors computed with
□ Molconn-Z, Selma, BatchMin, MacroModel, Marea …
0
10
20
30
45 78.33 111.66 145 178.33 211.66 245 278.33 311.66 More
Melting point
•Dragon descriptors
● Talete s.r.l, www.talete.mi.it
● About 4885 descriptors
● In 29 logical groups:
□ Constitutional
□ Ring
□ Topological indices
□ Walk and path counts, Connectivity indices
□ 2D/3D matrix and autocorrelations
□ RDF, MORSE
□ Molecular properties
□ Drug-like indices
□ & much more
•QSAR Study Table
𝒙𝟏 𝒙𝟐 ⋯ 𝒙𝒏
1.72 32.08 9.01
-5.79 11.09 12.01
⋮
43.13 10.00 7.21
𝒚
1.72
-5.79
⋮
43.13
4885 Dragon6 descriptors
•Linear regression assumptions
● Ideally
□ More samples than predictors
□ i.i.d. Gaussian errors, 휀 ∝ 𝒩 0; 𝜎2𝐼
□ Independent descriptors
● But in reality
□ Few structures
□ Multicollinearity
□ skewed, peeked
● See
□ D. C. Montgomery, E. A. Peck, G. G. Vining, 2006
□ M. H. Kutner, J. Neter at. Al, 2004
□ B. G. Tabachnick, L. S. Fidell, 2006
0
10
20
30
40
50
60
Example: HOMO skewness = -1.96 kurtosis = 4.04
•Learning about the predictors
● Modeling could succeed only for similar structures
● Molecular classifiers
□ Rules of assigning group labels in the Study Table
□ Model all or just a labeled subset
● Non-supervised learning
□ Hierarchical clustering
□ Distance: Euclidean, Mahalonobis, Minkowsky, Pearson, etc.
□ Clustering: Single/Complete linkage, Ward, etc.
● See more in
□ A. Webb, 2002
□ R. O. Duda, Hart, Stork, 2001
□ T. Hastie, R. Tibshirani, J. Friedman, 2001
•Implication of grouping by similarity
•Example Cluster tree (1)
● 85 molecules ● 76 dimensional chemical space ● Distance metric: Euclidean ● Clustering method: complete linkage Hierarchical graph ( layout “dot”)
•Example Cluster tree (2)
● Layout “neato” □ Kamada-Kawai algorithm
See more on www.graphviz.org
•Screening data
● Missing values □ Expert knowledge, mean, pairwise regression
● Descriptive statistics □ checking normality constraints: skewness, kurtozis
● Multicollinearity □ Covariance/correlation matrix □ pairwise is not always enough
● Transformations to enforce constraints □ Linearize (log, poly, etc. ) □ Replace highly correlated groups with a surrogate variable
● Outlier detection
□ Mahalonobis distance, 𝒙′𝑪−𝟏𝒙
•Descriptor selection
● Zero variance, missing □ From 4885 3600
● Non-normal descriptors eliminated □ From 3600 1600
● Significant pairwise correlation □ From 1600 600
● Iterative at least □ Finally 76
Used round numbers only for
simplicity
•Modeling with PLS
● Exploit covariance between independent and dependent
variables in the same time in a common latent space
□ PCR cares only about the predictors
● Copes well with small sample sets, high multicollinearity
● Basic idea
Input Independent var. 𝑿 ∈ ℝ𝑟×𝑁 , 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟. 𝒀 ∈ ℝ𝑟×𝑚 , latent space dimension 𝑘.
Process 𝑿𝟏 = 𝑿 𝑓𝑜𝑟 𝑗 = 1…𝑘 𝑙𝑒𝑡 𝒖𝒋,𝒗𝒋,𝜎𝒋𝑏𝑒 𝑡𝑒 𝑓𝑖𝑟𝑠𝑡 𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑿𝟏
′ 𝒀
𝑿𝑗+1 = 𝑰 −𝑿𝒋𝒖𝒋𝒖𝒋
′𝑿𝒋′
𝒖𝒋′𝑿𝒋
′𝑿𝒋𝒖𝒋 𝑿𝒋
𝑒𝑛𝑑 Output Feature directions 𝒖𝒋 𝑗=1..𝑘
.
•Modeling – Primal kernel PLS
J. S. Tailor, N. Cristianini, 2004
Input Independent var. 𝑿 ∈ ℝ𝑟×𝑁 , 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟. 𝒀 ∈ ℝ𝑟×𝑚 , latent space dimension 𝑘.
Process 𝝁 =1
𝑟𝑿′𝒋
𝑿𝟏 = 𝑿 − 𝒋𝝁′ 𝑓𝑜𝑟 𝑗 = 1…𝑘 𝒖𝒋 = 𝑓𝑖𝑟𝑠𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝑿𝟏
′ 𝒀
𝒖𝒋 = 𝒖𝒋 𝒖𝒋
𝑟𝑒𝑝𝑒𝑎𝑡 𝒖𝒋 = 𝑿𝒋𝒀𝒀′𝒀𝒋𝒖𝒋
𝒖𝒋 = 𝒖𝒋 𝒖𝒋
𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒
𝒑𝒋 =𝑿𝒋′𝑿𝒋𝒖𝒋
𝒖𝒋′𝑿𝒋
′𝑿𝒋𝒖𝒋
𝒄𝒋 =𝒀′𝑿𝒋𝒖𝒋
𝒖𝒋′𝑿𝒋
′𝑿𝒋𝒖𝒋
𝒀 = 𝒀 + 𝑿𝒋𝒖𝒋𝒄𝒋′
𝑿𝑗+1 = 𝑿𝒋 𝑰 − 𝒖𝒋𝒑𝒋′
𝑒𝑛𝑑 𝑾 = 𝑼 𝑷′𝑼 −𝟏𝑪′
Output Mean vector 𝝁, training outputs 𝒀 , regression coefficients 𝑾.
Replaced with SVD +
Eigenvalue depletion check
•Modeling – Evolutionary model search (1)
● Formulate descriptor selection problem as a stochastic evolutionary optimization
● Genes represent descriptors, with “on”/”off” alleles
● Cycle over a gene pool: □ Selection, “survival of the fittest” □ Mating □ Mutation □ Reproduction
● Phenotype = regression model □ “on” genes represent active descriptors
● Fitness measured by □ leave one out PLS prediction over training set □ model length
See more on eodev.sourceforge.net
•Modeling – Evolutionary model search (2)
𝐹 휀,𝑑, 𝑡 = 𝑒𝑥𝑝 −𝛼휀𝜌 − 𝑑 − 𝛿 1 + 𝑠𝑖𝑔𝑛 𝑑 − 𝛿 𝛽
2 𝑡
𝑇 𝜆
Phenotype 휀 ∶ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 𝑑 ∶ 𝑛𝑟.𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑜𝑟𝑠 𝑡 ∶ 𝑒𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑡𝑖𝑚𝑒
Evolutionary parameters 𝛼 ∶ 𝑒𝑟𝑟𝑜𝑟 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝜌 ∶ 𝑒𝑟𝑟𝑜𝑟 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝛽 ∶ 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑜𝑟𝑠 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝛿 ∶ 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 𝑡𝑟𝑒𝑠𝑜𝑙𝑑 𝑇 ∶ 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝜆 ∶ 𝑎𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡
•Fitness (1) - prediction error 𝜺, 𝜹
•Fitness (2) - model size 𝜷, 𝜹
•Fitness (3) - simulated annealing 𝑻, 𝝀
● Bring fitness and model length together □ Initial survivor phenotypes have higher RMSE □ Later ones also a shorter length
•Modeling – Evolutionary search (2)
1 objective: minimum error 2 objectives: minimum error
and nr. of terms
● Example from modeling acrylate glass transition temperature □ 18 acrylates, 403 descriptors □ 20 models population, 200 generations □ 12 PLS latent vectors
•Some models (1)
AVS_Dt, ITH, P_VSA_LogP_5, P_VSA_i_3, P_VSA_p_2, SAtot, SpAD_EA(bo), SpDiam_Dt (229.088), 1.50844, 2.10674, -0.908854, 0.228069, 0.551813, -1.04576, 1.28524, -0.951941 Nr. factors 6
Nr. desc. 8
RMSE 34.9458
Std. error 37.2780
R2 0.6500
aR2 0.6009
PRESS 80599.8
LOO RMSE 34.9060
LOO R2 0.6505
● Results for a random 3x subsampled training/test ● Sample model with 8 descriptors
□ descriptors, intercept term in () □ statistic on the training set □ left chart: prediction vs. training set □ right chart: prediction vs. test set
•Some models (2)
AVS_Dt, F04[C-O], H_Dz(p), P_VSA_LogP_5, P_VSA_p_2, SAtot, SpAD_D/Dt, SpAD_EA(bo),
SpDiam_Dt (102.3965), 2.7674, -1.4713, -2.4209, -0.9823, 0.5917, -0.4680, 2.3060, 6.4407, -1.7489 Nr. factors 6
Nr. desc. 9
RMSE 32.7377
Std. error 35.2276
R2 0.6928
aR2 0.6435
PRESS 70736
LOO RMSE 32.6982
LOO R2 0.6933
● Bergström et al., 2003 for the full dataset □ averaged consensus model □ 𝑅2 = 0.63, 𝑅𝑀𝑆𝐸𝑡𝑟 = 35.1℃, 𝑅𝑀𝑆𝐸𝑡𝑒 = 44.6 ℃