qsar modeling in maps principles and applications · l p 6 p ã g phenotype Ý ÷ ln a@ e? pek j an...

•QSAR modeling in MAPS

Principles and applications

Csaba Ferenc Kiss

June 2011, Athens

•Some terms

● QSAR – Quantitative structure activity relationships

● Dependent (𝐲), independent (𝐱) variables

● Continuous, discrete, categorical, dichotomous

● 𝒚 = 𝑿𝜷 + 𝝐, s.t. 𝐲 ∈ ℝ𝑟 , 𝑿 ∈ ℝ𝑟×𝐷

•Experimental dataset

● Predicting melting point for drug-like molecules

□ in J. Chem. Inf. Comput. Sci., 2003, 43 (4), pp. 1177–1185

● SMILEs from 12th Merck Index

□ 185 training structures

□ 92 validation structures

□ Melting temperatures 140-160 °C

● 566 descriptors computed with

□ Molconn-Z, Selma, BatchMin, MacroModel, Marea …

0

10

20

30

45 78.33 111.66 145 178.33 211.66 245 278.33 311.66 More

Melting point

•Dragon descriptors

● Talete s.r.l, www.talete.mi.it

● About 4885 descriptors

● In 29 logical groups:

□ Constitutional

□ Ring

□ Topological indices

□ Walk and path counts, Connectivity indices

□ 2D/3D matrix and autocorrelations

□ RDF, MORSE

□ Molecular properties

□ Drug-like indices

□ & much more

http://www.talete.mi.it/

•QSAR Study Table

𝒙𝟏 𝒙𝟐 ⋯ 𝒙𝒏

1.72 32.08 9.01

-5.79 11.09 12.01

⋮

43.13 10.00 7.21

𝒚

1.72

-5.79

⋮

43.13

4885 Dragon6 descriptors

•Linear regression assumptions

● Ideally

□ More samples than predictors

□ i.i.d. Gaussian errors, 휀 ∝ 𝒩 0; 𝜎2𝐼

□ Independent descriptors

● But in reality

□ Few structures

□ Multicollinearity

□ skewed, peeked

● See

□ D. C. Montgomery, E. A. Peck, G. G. Vining, 2006

□ M. H. Kutner, J. Neter at. Al, 2004

□ B. G. Tabachnick, L. S. Fidell, 2006

0

10

20

30

40

50

60

Example: HOMO skewness = -1.96 kurtosis = 4.04

•Learning about the predictors

● Modeling could succeed only for similar structures

● Molecular classifiers

□ Rules of assigning group labels in the Study Table

□ Model all or just a labeled subset

● Non-supervised learning

□ Hierarchical clustering

□ Distance: Euclidean, Mahalonobis, Minkowsky, Pearson, etc.

□ Clustering: Single/Complete linkage, Ward, etc.

● See more in

□ A. Webb, 2002

□ R. O. Duda, Hart, Stork, 2001

□ T. Hastie, R. Tibshirani, J. Friedman, 2001

•Implication of grouping by similarity

•Example Cluster tree (1)

● 85 molecules ● 76 dimensional chemical space ● Distance metric: Euclidean ● Clustering method: complete linkage Hierarchical graph ( layout “dot”)

•Example Cluster tree (2)

● Layout “neato” □ Kamada-Kawai algorithm

See more on www.graphviz.org

•Screening data

● Missing values □ Expert knowledge, mean, pairwise regression

● Descriptive statistics □ checking normality constraints: skewness, kurtozis

● Multicollinearity □ Covariance/correlation matrix □ pairwise is not always enough

● Transformations to enforce constraints □ Linearize (log, poly, etc. ) □ Replace highly correlated groups with a surrogate variable

● Outlier detection

□ Mahalonobis distance, 𝒙′𝑪−𝟏𝒙

•Descriptor selection

● Zero variance, missing □ From 4885 3600

● Non-normal descriptors eliminated □ From 3600 1600

● Significant pairwise correlation □ From 1600 600

● Iterative at least □ Finally 76

Used round numbers only for

simplicity

•Modeling with PLS

● Exploit covariance between independent and dependent

variables in the same time in a common latent space

□ PCR cares only about the predictors

● Copes well with small sample sets, high multicollinearity

● Basic idea

Input Independent var. 𝑿 ∈ ℝ𝑟×𝑁 , 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟. 𝒀 ∈ ℝ𝑟×𝑚 , latent space dimension 𝑘.

Process 𝑿𝟏 = 𝑿 𝑓𝑜𝑟 𝑗 = 1…𝑘 𝑙𝑒𝑡 𝒖𝒋,𝒗𝒋,𝜎𝒋𝑏𝑒 𝑡𝑕𝑒 𝑓𝑖𝑟𝑠𝑡 𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑿𝟏

′ 𝒀

𝑿𝑗+1 = 𝑰 −𝑿𝒋𝒖𝒋𝒖𝒋

′𝑿𝒋′

𝒖𝒋′𝑿𝒋

′𝑿𝒋𝒖𝒋 𝑿𝒋

𝑒𝑛𝑑 Output Feature directions 𝒖𝒋 𝑗=1..𝑘

.

•Modeling – Primal kernel PLS

J. S. Tailor, N. Cristianini, 2004

Input Independent var. 𝑿 ∈ ℝ𝑟×𝑁 , 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟. 𝒀 ∈ ℝ𝑟×𝑚 , latent space dimension 𝑘.

Process 𝝁 =1

𝑟𝑿′𝒋

𝑿𝟏 = 𝑿 − 𝒋𝝁′ 𝑓𝑜𝑟 𝑗 = 1…𝑘 𝒖𝒋 = 𝑓𝑖𝑟𝑠𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝑿𝟏

′ 𝒀

𝒖𝒋 = 𝒖𝒋 𝒖𝒋

𝑟𝑒𝑝𝑒𝑎𝑡 𝒖𝒋 = 𝑿𝒋𝒀𝒀′𝒀𝒋𝒖𝒋

𝒖𝒋 = 𝒖𝒋 𝒖𝒋

𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒

𝒑𝒋 =𝑿𝒋′𝑿𝒋𝒖𝒋

𝒖𝒋′𝑿𝒋

′𝑿𝒋𝒖𝒋

𝒄𝒋 =𝒀′𝑿𝒋𝒖𝒋

𝒖𝒋′𝑿𝒋

′𝑿𝒋𝒖𝒋

𝒀 = 𝒀 + 𝑿𝒋𝒖𝒋𝒄𝒋′

𝑿𝑗+1 = 𝑿𝒋 𝑰 − 𝒖𝒋𝒑𝒋′

𝑒𝑛𝑑 𝑾 = 𝑼 𝑷′𝑼 −𝟏𝑪′

Output Mean vector 𝝁, training outputs 𝒀 , regression coefficients 𝑾.

Replaced with SVD +

Eigenvalue depletion check

•Modeling – Evolutionary model search (1)

● Formulate descriptor selection problem as a stochastic evolutionary optimization

● Genes represent descriptors, with “on”/”off” alleles

● Cycle over a gene pool: □ Selection, “survival of the fittest” □ Mating □ Mutation □ Reproduction

● Phenotype = regression model □ “on” genes represent active descriptors

● Fitness measured by □ leave one out PLS prediction over training set □ model length

See more on eodev.sourceforge.net

•Modeling – Evolutionary model search (2)

𝐹 휀,𝑑, 𝑡 = 𝑒𝑥𝑝 −𝛼휀𝜌 − 𝑑 − 𝛿 1 + 𝑠𝑖𝑔𝑛 𝑑 − 𝛿 𝛽

2 𝑡

𝑇 𝜆

Phenotype 휀 ∶ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 𝑑 ∶ 𝑛𝑟.𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑜𝑟𝑠 𝑡 ∶ 𝑒𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑡𝑖𝑚𝑒

Evolutionary parameters 𝛼 ∶ 𝑒𝑟𝑟𝑜𝑟 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝜌 ∶ 𝑒𝑟𝑟𝑜𝑟 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝛽 ∶ 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑜𝑟𝑠 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝛿 ∶ 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 𝑡𝑕𝑟𝑒𝑠𝑕𝑜𝑙𝑑 𝑇 ∶ 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝜆 ∶ 𝑎𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡

•Fitness (1) - prediction error 𝜺, 𝜹

•Fitness (2) - model size 𝜷, 𝜹

•Fitness (3) - simulated annealing 𝑻, 𝝀

● Bring fitness and model length together □ Initial survivor phenotypes have higher RMSE □ Later ones also a shorter length

•Modeling – Evolutionary search (2)

1 objective: minimum error 2 objectives: minimum error

and nr. of terms

● Example from modeling acrylate glass transition temperature □ 18 acrylates, 403 descriptors □ 20 models population, 200 generations □ 12 PLS latent vectors

•Some models (1)

AVS_Dt, ITH, P_VSA_LogP_5, P_VSA_i_3, P_VSA_p_2, SAtot, SpAD_EA(bo), SpDiam_Dt (229.088), 1.50844, 2.10674, -0.908854, 0.228069, 0.551813, -1.04576, 1.28524, -0.951941 Nr. factors 6

Nr. desc. 8

RMSE 34.9458

Std. error 37.2780

R2 0.6500

aR2 0.6009

PRESS 80599.8

LOO RMSE 34.9060

LOO R2 0.6505

● Results for a random 3x subsampled training/test ● Sample model with 8 descriptors

□ descriptors, intercept term in () □ statistic on the training set □ left chart: prediction vs. training set □ right chart: prediction vs. test set

•Some models (2)

AVS_Dt, F04[C-O], H_Dz(p), P_VSA_LogP_5, P_VSA_p_2, SAtot, SpAD_D/Dt, SpAD_EA(bo),

SpDiam_Dt (102.3965), 2.7674, -1.4713, -2.4209, -0.9823, 0.5917, -0.4680, 2.3060, 6.4407, -1.7489 Nr. factors 6

Nr. desc. 9

RMSE 32.7377

Std. error 35.2276

R2 0.6928

aR2 0.6435

PRESS 70736

LOO RMSE 32.6982

LOO R2 0.6933

● Bergström et al., 2003 for the full dataset □ averaged consensus model □ 𝑅2 = 0.63, 𝑅𝑀𝑆𝐸𝑡𝑟 = 35.1℃, 𝑅𝑀𝑆𝐸𝑡𝑒 = 44.6 ℃

qsar modeling in maps principles and applications · l p 6 p ã g phenotype Ý ÷ ln a@ e? pek j an...

Documents