abstract the behavior and fate of chemicals in the environment is strongly influenced by the...

1
ABSTRACT ABSTRACT The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly by the basic physico-chemical properties such as solubility in water, vapour pressure, melting point, boiling point, flash point and density. The knowledge of physico-chemical properties is of fundamental interest in risk assessment studies, and is a specific requirement of the EU-Directive “White Paper on a strategy for a future Community Policy for Chemicals”, particularly for High Production Volume (HPV) compounds. In this paper a data set of 153 esters has been studied. The application of the Genetic Algorithm as Variable Subset Selection (GA-VSS) to a wide set of theoretical molecular descriptors of different structural aspect, like 1D-, 2-D and 3D-descriptors (Dragon software), produces highly predictive models of the studied physico-chemical properties. The best linear models, obtained by Ordinary Least Squares regression (OLS) were validated for predictivity both internally, using leave-one-out (Q 2 =78-94%), leave-many-out (30-50%), Y-scrambling and externally (Q 2 EXT =88-94% except for the melting point model). The splitting of the data set into a training and an evaluation set was realised by D-optimal Experimental Design. The reliability of the predictions was checked by the leverage approach in order to verify the chemical domain of the models. The application of the proposed class- specific QSAR models allows fast knowledge of the physico-chemical properties of existing esters. This approach could also be applied usefully to new chemicals, even those not yet synthesised, as it is based simply on the knowledge of the molecular structure. INTRODUCTION INTRODUCTION Esters are an important class of industrial chemicals and some of them belong to HPV (High Production Volume), compounds with a production volume of 1,000 tonnes/year. The EU-Directive “White Paper on a strategy for a future Community Policy for Chemicals” is directed towards such compounds, requiring physico-chemical data by, at the latest, the end of 2005 [1]. Experimental testing is both costly and time-consuming and the systematic determination of missing data using laboratory experiments would place an enormous economic burden on industry and regulatory authorities [2]. To overcome the problem of insufficient data in the field of environmental risk assessment, physical chemical properties and the environmental fate of organic chemicals, quantitative structure-property/activity relationships between descriptors of chemical compounds and their physical, chemical and biological properties have been extensively studied in recent years. The object of the study was to develop QSAR models to rapidly predict some physico-chemical properties of esters. MATERIALS & METHODS MATERIALS & METHODS EXPERIMENTAL DATA EXPERIMENTAL DATA: The experimental data of physical-chemical properties for 153 esters have been taken from the literature [3, 4]. The data were measured at 20-25°C and at 1 atm. The end-points studied were: solubility in water, vapour pressure, melting point, boiling point, flash point and density. Solubility and vapour pressure data are expressed in logarithmic scale. MOLECULAR DESCRIPTORS: MOLECULAR DESCRIPTORS: The molecular structure of the studied compounds were described by using several molecular descriptors calculated by the software DRAGON of Todeschini et. al [5]. A total of 1198 molecular descriptors of different kinds were calculated to describe compound chemical diversity. Constant values and pair-correlated descriptors (with a correlation of 0.98) were excluded, thus the molecular descriptors on which the variable selection by GA was applied are 422. The descriptor tipology is: 0D: Constitutional descriptors (atoms and group counts) 1D: Functional groups, atom centered fragments and empirical descriptors. 2D: BCUTs, Galvez indices from the adjacency matrix, walk counts, various autocorrelations from the molecular graph and topological descriptors. 3D: Randic molecular profiles from the geometry matrix, WHIMs [6-7], GETAWAY [8] and geometrical descriptors. CHEMOMETRIC METHODS: CHEMOMETRIC METHODS: Multiple Linear Regression analysis and variable selection were performed by the software MOBY-DIGS of Todeschini et al.[9], using the Ordinary Least Squares regression (OLS) method and Genetic Algorithm-Variable Subset Selection [10]. The best models were validated by several ways: Leave-one-out: each chemical is left out of the training set and predicted Leave-more-out: up to 50% randomly selected chemicals are left out of the training set Y-scrambling: by random permutation of the responses External validation were performed on a validation set obtained splitting the original data set at 75% by Experimental Design procedure (software DOLPHIN of Todeschini et al [11]). Tools of regression diagnostics as residual plots and Williams plots were used to check the quality of the best models and define their applicability regard to the chemical domain, using the chemometric package SCAN [12]. RMS (residual mean squares) are also reported for model comparison with EPIWIN [13]. CONCLUSIONS CONCLUSIONS New predictive models “ad-hoc” for physico-chemical properties such as solubility in water, vapour pressure, melting point, boiling point, flash point and density are proposed. These models are based on theoretical molecular descriptors selected by Genetic Algorithm. All proposed models have a good predicting power verified with very strong internal validation (50%) and also external validation. On comparing the residuals it can be seen that our models generally show better performance than EPIWIN. Physico–chemical property values, also for new chemicals (even not yet synthesized), can be predicted for esters belonging to the chemical domain (leverage approach for applicability). REFERENCES REFERENCES [1] http:// europa . eu . int / comm /environmental/chemicals/ whitepaper . htm [2] Gramatica P. Fine Chemicals and Intermediates technologies (Chemistry Today), 1991, 18-24; [3] Syracuse Corporation Americana, http://esc.syrres.com; [4] European Commission – Joint Research Centre IUCLID CD-ROM, 2000; [5] Todeschini R., Consonni V. and Pavan E. 2002. DRAGON – Software for the calculation of molecular descriptors, rel. 1.12 for Windows. Free download available at http://www. disat . unimib / chm .; [6] Todeschini, R.; Lasagni, M.; Marengo, E. J. Chemometrics 1994, 8, 263-273; [7] Todeschini, R; Gramatica, P. Quant.Struct.-Act.Relat. 1997, 16, 113-119; [8] Consonni, V., Todeschini, R., Pavan, M., J. Chem. Inf. Comput. Sci., 2002, 42, 693-705; [9] Todeschini, R., 2001. Moby Digs - Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel. 2.3 for Windows, Talete srl, Milan (Italy); [10] Leardi, R.; Boggia, R.; Terrile, M.,. J. Chemom., 1992, 6, 267-281; [11] Todeschini, R.; Mauri, A., 2000; DOLPHIN- Software for Optimal Distance-based Experimental Design rel 1.1 for Windows, Talete srl, Milan (Italy); [12] SCAN- Software for Chemometric Analysis, rel. 1.1 for Windows, Jerll. Inc., Standard, CA, 1992; [13] EPI Suite 2001, Ver.3.10, Environmental Protection Agency (http://www.epa.gov) [14] Wold, S. Eriksson, L. Chemometric Methods in Molecular Design, 1995, VCH, Germany, 309-318; [15] Golbraikh, A. Tropsha, A., J. Mol. Graph and Mod., 2002, 20, 269-276. . QSAR PREDICTION OF PHYSICO-CHEMICAL PROPERTIES QSAR PREDICTION OF PHYSICO-CHEMICAL PROPERTIES OF ESTERS OF ESTERS Gramatica, P., Battaini, F., Gramatica, P., Battaini, F., Papa, E. Papa, E. QSAR and Environmental Chemistry Research Unit, University of Insubria, QSAR and Environmental Chemistry Research Unit, University of Insubria, Varese (Italy). Varese (Italy). Web: http://dipbsf.uninsubria.it/qsar/ e-mail Web: http://dipbsf.uninsubria.it/qsar/ e-mail : : paola . gramatica @ uninsubria .it RESULTS AND DISCUSSION: RESULTS AND DISCUSSION: The best set of descriptors relevant to the modeling of response was selected by Genetic Algorithm from 422 calculated descriptors. The models, always evaluated by optimising their predictive capabilities, were verified for stability and predictivity by internal validation (leave-one-out and leave-many-out) and the permutation of response (Y-scrambling). The leverage of all the studied compounds was also calculated to check the distance from the model experimental space. In order to estimate the true predictive power of models, the original data set of solubility in water, vapour pressure, boiling point and density were spilt in training and test set for calculated external Q 2 [14,15]. The best splitting was here realized by Experimental Design procedure using the software DOLPHIN. Table 1 shows the best models for each end- point. End -Point O bj.training V ariables R 2 Q 2 Q 2 LM O 50% Q 2 EXT B oiling 101 VED2 M ATS1e 93.9 93.4 93.2 B oiling 63 VED2 M ATS1e 92.9 92.2 91.9 93.9 V apour P ressure 87 IVD M Dv 94 93.6 93.4 V apour P ressure 57 IVD M Dv 95.2 94.6 94.6 90.5 S olubility 100 Mv Me ATS1v 90.9 90 89.5 S olubility 66 Mv Me ATS1v 91.6 90.3 89.7 88.2 D ensity 97 AMW IV DM 91 90.5 90.3 D ensity 64 AMW IV DM 89.2 88.3 87.9 90.6 M elting 114 nN SIC3 G A TS 1v GATS1e 81.8 79.9 78.6 Flash P oint 48 VEp1 M ATS1m 92.5 91.5 91.14 End-P oint O bj.training V ariables RM S from our m odel RM S from EPIW IN B oiling 101 VED2 M ATS1e 21.32 22.91 V apour P ressure 87 IVDM Dv 0.70 0.77 S olubility 100 Mv Me ATS1v 0.70 1.16 M elting 114 nN SIC3 G A TS 1v GATS1e 25.30 62.10 The regression lines of the externally validated models are reported (outliers for the training and test set chemicals are highlighted). On comparing the residuals of the different models (tab.2), it is evident that EPIWIN models show similar performance to our models in predicting boiling point and vapour pressure, but bigger RMS for solubility and melting than our model. BOILING POINT Tab.2 – Comparison of models This result appears satisfactory considering that EPIWIN model was obtained on a training set much bigger than our data set. For the other end-points no comparison is possible as EPIWIN does not include these end-points. SOLUBILITY DENSITY VAPOUR PRESSURE Boiling point= 614.1 - 1385.2 VED2 - 369.5 MATS1e Experimental Boiling Predicted Boiling 15 65 115 165 215 265 315 365 15 65 115 165 215 265 315 365 training (63 obj. test (38 obj.) n-butyl carbonate benzyl benzoate ethyl bromoacetate Density = 0.07 + 0.11 AMW + 0.05 IVDM Experimental Density Predicted Density 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 training (64 obj.) test (33 obj.) dicyclohexyl phthalate ethyl lactate Tab.1 – Model Performances Vapour Pressure = 12.6 - 2.9 IVDM - 9.9 Dv Experimental Vapour Pressure Predicted Vapour Pressure -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 training (30 obj.) test (57 obj.) methyl formate cyclohexyl acetate oxamyl bis(2-ethylhexyl) adipate Log Solubility= - 23.4 - 20 Mv + 37.3 Me - 0.1 ATS1v Experimental Solubility Predicted Solubility -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 training (66 obj. test (34 obj.) benzyl acetate 2-hydroxypropyl oxamyl acrylate

Upload: howard-bond

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ABSTRACT The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly

ABSTRACTABSTRACT

The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly by the basic physico-chemical properties such as solubility in water, vapour pressure, melting point, boiling point, flash point and density. The knowledge of physico-chemical properties is of fundamental interest in risk assessment studies, and is a specific requirement of the EU-Directive “White Paper on a strategy for a future Community Policy for Chemicals”, particularly for High Production Volume (HPV) compounds. In this paper a data set of 153 esters has been studied. The application of the Genetic Algorithm as Variable Subset Selection (GA-VSS) to a wide set of theoretical molecular descriptors of different structural aspect, like 1D-, 2-D and 3D-descriptors (Dragon software), produces highly predictive models of the studied physico-chemical properties. The best linear models, obtained by Ordinary Least Squares regression (OLS) were validated for predictivity both internally, using leave-one-out (Q2=78-94%), leave-many-out (30-50%), Y-scrambling and externally (Q2EXT =88-94% except for the melting point model). The splitting of the data set into a training and an evaluation set was realised by D-optimal Experimental Design. The reliability of the predictions was checked by the leverage approach in order to verify the chemical domain of the models. The application of the proposed class-specific QSAR models allows fast knowledge of the physico-chemical properties of existing esters. This approach could also be applied usefully to new chemicals, even those not yet synthesised, as it is based simply on the knowledge of the molecular structure.

INTRODUCTIONINTRODUCTION

Esters are an important class of industrial chemicals and some of them belong to HPV (High Production Volume), compounds with a production volume of 1,000 tonnes/year. The EU-Directive “White Paper on a strategy for a future Community Policy for Chemicals” is directed towards such compounds, requiring physico-chemical data by, at the latest, the end of 2005 [1]. Experimental testing is both costly and time-consuming and the systematic determination of missing data using laboratory experiments would place an enormous economic burden on industry and regulatory authorities [2]. To overcome the problem of insufficient data in the field of environmental risk assessment, physical chemical properties and the environmental fate of organic chemicals, quantitative structure-property/activity relationships between descriptors of chemical compounds and their physical, chemical and biological properties have been extensively studied in recent years. The object of the study was to develop QSAR models to rapidly predict some physico-chemical properties of esters.

MATERIALS & METHODSMATERIALS & METHODS

EXPERIMENTAL DATAEXPERIMENTAL DATA: The experimental data of physical-chemical properties for 153 esters have been taken from the literature [3, 4].  The data were measured at 20-25°C and at 1 atm. The end-points studied were: solubility in water, vapour pressure, melting point, boiling point, flash point and density. Solubility and vapour pressure data are expressed in logarithmic scale.

MOLECULAR DESCRIPTORS:MOLECULAR DESCRIPTORS: The molecular structure of the studied compounds were described by using several molecular descriptors calculated by the software DRAGON of Todeschini et. al [5]. A total

of 1198 molecular descriptors of different kinds were calculated to describe compound chemical diversity. Constant values and pair-correlated descriptors (with a correlation of 0.98) were excluded, thus

the molecular descriptors on which the variable selection by GA was applied are 422.

The descriptor tipology is:

0D: Constitutional descriptors

(atoms and group counts)

1D: Functional groups, atom

centered fragments and

empirical descriptors.

2D: BCUTs, Galvez indices

from the adjacency matrix,

walk counts, various

autocorrelations from the

molecular graph and

topological descriptors.

3D: Randic molecular profiles

from the geometry matrix,

WHIMs [6-7], GETAWAY [8]

and geometrical

descriptors.

CHEMOMETRIC METHODS:CHEMOMETRIC METHODS: Multiple Linear Regression analysis and variable selection were performed by the software MOBY-DIGS of Todeschini et al.[9], using the Ordinary Least Squares regression (OLS) method and Genetic Algorithm-Variable Subset Selection [10]. The best models were validated by several ways:

• Leave-one-out: each chemical is left out of the training set and predicted

• Leave-more-out: up to 50% randomly selected chemicals are left out of the training set

• Y-scrambling: by random permutation of the responses

External validation were performed on a validation set obtained splitting the original data set at 75% by Experimental Design procedure (software DOLPHIN of Todeschini et al [11]). Tools of regression diagnostics as residual plots and Williams plots were used to check the quality of the best models and define their applicability regard to the chemical domain, using the chemometric package SCAN [12]. RMS (residual mean squares) are also reported for model comparison with EPIWIN [13].

CONCLUSIONSCONCLUSIONS

New predictive models “ad-hoc” for physico-chemical properties such as solubility in water, vapour pressure, melting point, boiling point, flash point and density are proposed.

These models are based on theoretical molecular descriptors selected by Genetic Algorithm.

All proposed models have a good predicting power verified with very strong internal validation (50%) and also external validation.

On comparing the residuals it can be seen that our models generally show better performance than EPIWIN.

Physico–chemical property values, also for new chemicals (even not yet synthesized), can be predicted for esters belonging to the chemical domain (leverage approach for applicability).

REFERENCESREFERENCES

[1] http://europa.eu.int/comm/environmental/chemicals/whitepaper.htm

[2] Gramatica P. Fine Chemicals and Intermediates technologies (Chemistry Today), 1991, 18-24;

[3] Syracuse Corporation Americana, http://esc.syrres.com;

[4] European Commission – Joint Research Centre IUCLID CD-ROM, 2000;[5] Todeschini R., Consonni V. and Pavan E. 2002. DRAGON – Software for the calculation of molecular descriptors, rel. 1.12 for Windows. Free download available at http://www.disat.unimib/chm.;

[6] Todeschini, R.; Lasagni, M.; Marengo, E. J. Chemometrics 1994, 8, 263-273;

[7] Todeschini, R; Gramatica, P. Quant.Struct.-Act.Relat. 1997, 16, 113-119;[8] Consonni, V., Todeschini, R., Pavan, M., J. Chem. Inf. Comput. Sci., 2002, 42, 693-705;

[9] Todeschini, R., 2001. Moby Digs - Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel.

2.3 for Windows, Talete srl, Milan (Italy);

[10] Leardi, R.; Boggia, R.; Terrile, M.,. J. Chemom., 1992, 6, 267-281;

[11] Todeschini, R.; Mauri, A., 2000; DOLPHIN- Software for Optimal Distance-based Experimental Design rel 1.1 for Windows, Talete srl,

Milan (Italy);

[12] SCAN- Software for Chemometric Analysis, rel. 1.1 for Windows, Jerll. Inc., Standard, CA, 1992;

[13] EPI Suite 2001, Ver.3.10, Environmental Protection Agency (http://www.epa.gov)

[14] Wold, S. Eriksson, L. Chemometric Methods in Molecular Design, 1995, VCH, Germany, 309-318;

[15] Golbraikh, A. Tropsha, A., J. Mol. Graph and Mod., 2002, 20, 269-276.

.

QSAR PREDICTION OF PHYSICO-CHEMICAL PROPERTIES QSAR PREDICTION OF PHYSICO-CHEMICAL PROPERTIES

OF ESTERSOF ESTERS

Gramatica, P., Battaini, F., Gramatica, P., Battaini, F., Papa, E.Papa, E.

QSAR and Environmental Chemistry Research Unit, University of Insubria, Varese (Italy). QSAR and Environmental Chemistry Research Unit, University of Insubria, Varese (Italy).

Web: http://dipbsf.uninsubria.it/qsar/ e-mailWeb: http://dipbsf.uninsubria.it/qsar/ e-mail:: [email protected]

RESULTS AND DISCUSSION:RESULTS AND DISCUSSION:

The best set of descriptors relevant to the modeling of response was selected by Genetic Algorithm from 422 calculated descriptors. The models, always evaluated by optimising their predictive capabilities, were verified for stability and predictivity by internal validation (leave-one-out and leave-many-out) and the permutation of response (Y-scrambling). The leverage of all the studied compounds was also calculated to check the distance from the model experimental space. In order to estimate the true predictive power of models, the original data set of solubility in water, vapour pressure, boiling point and density were spilt in training and test set for calculated external Q2 [14,15]. The best splitting was here realized by Experimental Design procedure using the software DOLPHIN. Table 1 shows the best models for each end-point.

End - Point Obj. training Variables R2 Q2 Q2LMO 50% Q2

EXT

Boiling 101 VED2 MATS1e 93.9 93.4 93.2Boiling 63 VED2 MATS1e 92.9 92.2 91.9 93.9Vapour Pressure 87 IVDM Dv 94 93.6 93.4Vapour Pressure 57 IVDM Dv 95.2 94.6 94.6 90.5Solubility 100 Mv Me ATS1v 90.9 90 89.5Solubility 66 Mv Me ATS1v 91.6 90.3 89.7 88.2Density 97 AMW IVDM 91 90.5 90.3Density 64 AMW IVDM 89.2 88.3 87.9 90.6Melting 114 nN SIC3 GATS1v GATS1e 81.8 79.9 78.6Flash Point 48 VEp1 MATS1m 92.5 91.5 91.14

End-Point Obj. training Variables RMS from our model RMS from EPIWINBoiling 101 VED2 MATS1e 21.32 22.91Vapour Pressure 87 IVDM Dv 0.70 0.77Solubility 100 Mv Me ATS1v 0.70 1.16Melting 114 nN SIC3 GATS1v GATS1e 25.30 62.10

The regression lines of the externally validated models are reported (outliers for the training and test set chemicals are highlighted). On comparing the residuals of the different models (tab.2), it is evident that EPIWIN models show similar performance to our models in predicting boiling point and vapour pressure, but bigger RMS for solubility and melting than our model.

BOILING POINT

Tab.2 – Comparison of models

This result appears satisfactory considering that EPIWIN model was obtained on a training set much bigger than our data set. For the other end-points no comparison is possible as EPIWIN does not include these end-points.

SOLUBILITY

DENSITY

VAPOUR PRESSURE

Boiling point= 614.1 - 1385.2 VED2 - 369.5 MATS1e

Experimental Boiling

Pre

dic

ted

Bo

ilin

g

15

65

115

165

215

265

315

365

15 65 115 165 215 265 315 365

training (63 obj.)test (38 obj.)

n-butyl carbonate

benzyl benzoate

ethyl bromoacetate

Density = 0.07 + 0.11 AMW + 0.05 IVDM

Experimental Density

Pre

dic

ted

De

ns

ity

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25

training (64 obj.)test (33 obj.)

dicyclohexyl phthalate

ethyl lactate

Tab.1 – Model Performances

Vapour Pressure = 12.6 - 2.9 IVDM - 9.9 Dv

Experimental Vapour Pressure

Pre

dic

ted

Vap

ou

r P

ress

ure

-10.0

-7.5

-5.0

-2.5

0.0

2.5

5.0

-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0

training (30 obj.)test (57 obj.)

methyl formate

cyclohexyl acetate

oxamyl

bis(2-ethylhexyl) adipate

Log Solubility= - 23.4 - 20 Mv + 37.3 Me - 0.1 ATS1v

Experimental Solubility

Pre

dic

ted

So

lub

ility

-6

-4

-2

0

2

4

-6 -4 -2 0 2 4

training (66 obj.)test (34 obj.)

benzyl acetate

2-hydroxypropyl

oxamyl

acrylate