simple method for the prediction of the separation of racemates with high-performance liquid...

10
Available online at www.sciencedirect.com Journal of Chromatography A, 1185 (2008) 49–58 Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase Alberto Del Rio , Johann Gasteiger Computer-Chemie-Centrum, Universit¨ at Erlangen-N¨ urnberg, N¨ agelsbachstrasse 25, D-91052 Erlangen, Germany Received 29 August 2007; received in revised form 13 December 2007; accepted 7 January 2008 Available online 19 January 2008 Abstract A simple method for the prediction of whether or not a racemate can be separated on a Whelk-O1 chiral stationary phase has been developed. In this approach, molecules are represented by counting the number of atom types of the neighbors spheres of the chiral center. A decision tree is then used to decide based on a few of these atom count descriptors whether a given racemate can be separated. High values of correct prediction were obtained, namely with more than 94% for training sets and of about 90% for cross-validation results. The same rate of correct prediction was also obtained on an external data set. The descriptors can be rapidly and easily retrieved by just counting the atom types around the chiral center by inspecting the chemical diagram of the molecule. Furthermore, the decision tree model can be applied through the use of a small set of rules that eventually predicts whether or not a racemate is separated. Due to its computational simplicity, the procedure is of interest for experimentalists that need to make rapid assessment of the separation without having to program or input complex formulas. © 2008 Elsevier B.V. All rights reserved. Keywords: Chemoinformatic predictions; Topological descriptors; Whelk-O1; Chiral HPLC; Chiral separations; Separation factor; Chiral recognition; ChirBase; Decision trees 1. Introduction In the last few decades chemoinformatics, molecular model- ing and quantum chemistry techniques have been successfully used to address chirality related problems [1,2]. These calcula- tions constitute the basis of reliable and interesting results in the field of chiral recognition [3,4]. While the success of molecular modeling and first principle calculations can be fully acknowl- edged only when fairly small molecular structures are involved, chemoinformatic procedures have gained great importance due to the availability of an increasing number of experimental data [5,6]. These issues are of great importance in the context of the increased interest and availability of chiral chromatography techniques. Chiral HPLC has gained a great deal of valuable Corresponding author. Present address: Molecular Modelling & Drug Design Lab, Dipartimento di Scienze Farmaceutiche, Universit` a di Modena e Reggio Emilia, Via Campi 183, 41100 Modena, Italy. Tel.: +39 059 2055122; fax: +39 059 2055131. E-mail address: [email protected] (A. Del Rio). experience not only for the analytical and preparative separations of enantiomers but also for the investigation of chiral recognition processes. Indeed, chiral HPLC is today the most widely used technology to separate racemates as shown by the thousands of articles published each year on the subject and collected in the ChirBase database [5,7–9]. With molecular modeling calculations in which host–guest interactions are directly taken into account it is possible to thoroughly tackle problems related to the separation of race- mates such as the prediction of the enantioselectivities and the assignment of the absolute configuration as well as the elucidation of the mechanism of enantioselective recognition [3,10–16]. Unfortunately, these computational techniques can be applied only in a few cases because often the stationary phases have a complex chemical structure. This makes the calculation expensive in terms of computation time and in some cases not reliable. To circumvent these problems it is possible to perform calcu- lations in which only the ligand structures are considered. In this sense, the main efforts of the scientists in the past years went into building relationships between some molecular descriptors of 0021-9673/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.chroma.2008.01.034

Upload: alberto-del-rio

Post on 26-Jun-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

A

Itwabtt©

KD

1

iutfimect[

it

LEf

0d

Available online at www.sciencedirect.com

Journal of Chromatography A, 1185 (2008) 49–58

Simple method for the prediction of the separation ofracemates with high-performance liquid chromatography

on Whelk-O1 chiral stationary phase

Alberto Del Rio ∗, Johann GasteigerComputer-Chemie-Centrum, Universitat Erlangen-Nurnberg, Nagelsbachstrasse 25, D-91052 Erlangen, Germany

Received 29 August 2007; received in revised form 13 December 2007; accepted 7 January 2008Available online 19 January 2008

bstract

A simple method for the prediction of whether or not a racemate can be separated on a Whelk-O1 chiral stationary phase has been developed.n this approach, molecules are represented by counting the number of atom types of the neighbors spheres of the chiral center. A decision tree ishen used to decide based on a few of these atom count descriptors whether a given racemate can be separated. High values of correct predictionere obtained, namely with more than 94% for training sets and of about 90% for cross-validation results. The same rate of correct prediction was

lso obtained on an external data set. The descriptors can be rapidly and easily retrieved by just counting the atom types around the chiral center

y inspecting the chemical diagram of the molecule. Furthermore, the decision tree model can be applied through the use of a small set of ruleshat eventually predicts whether or not a racemate is separated. Due to its computational simplicity, the procedure is of interest for experimentalistshat need to make rapid assessment of the separation without having to program or input complex formulas. 2008 Elsevier B.V. All rights reserved.

Chir

eoptaC

itmte

eywords: Chemoinformatic predictions; Topological descriptors; Whelk-O1;ecision trees

. Introduction

In the last few decades chemoinformatics, molecular model-ng and quantum chemistry techniques have been successfullysed to address chirality related problems [1,2]. These calcula-ions constitute the basis of reliable and interesting results in theeld of chiral recognition [3,4]. While the success of molecularodeling and first principle calculations can be fully acknowl-

dged only when fairly small molecular structures are involved,hemoinformatic procedures have gained great importance dueo the availability of an increasing number of experimental data5,6].

These issues are of great importance in the context of thencreased interest and availability of chiral chromatographyechniques. Chiral HPLC has gained a great deal of valuable

∗ Corresponding author. Present address: Molecular Modelling & Drug Designab, Dipartimento di Scienze Farmaceutiche, Universita di Modena e Reggiomilia, Via Campi 183, 41100 Modena, Italy. Tel.: +39 059 2055122;

ax: +39 059 2055131.E-mail address: [email protected] (A. Del Rio).

[bher

lsb

021-9673/$ – see front matter © 2008 Elsevier B.V. All rights reserved.oi:10.1016/j.chroma.2008.01.034

al HPLC; Chiral separations; Separation factor; Chiral recognition; ChirBase;

xperience not only for the analytical and preparative separationsf enantiomers but also for the investigation of chiral recognitionrocesses. Indeed, chiral HPLC is today the most widely usedechnology to separate racemates as shown by the thousands ofrticles published each year on the subject and collected in thehirBase database [5,7–9].

With molecular modeling calculations in which host–guestnteractions are directly taken into account it is possible tohoroughly tackle problems related to the separation of race-

ates such as the prediction of the enantioselectivities andhe assignment of the absolute configuration as well as thelucidation of the mechanism of enantioselective recognition3,10–16]. Unfortunately, these computational techniques cane applied only in a few cases because often the stationary phasesave a complex chemical structure. This makes the calculationxpensive in terms of computation time and in some cases noteliable.

To circumvent these problems it is possible to perform calcu-ations in which only the ligand structures are considered. In thisense, the main efforts of the scientists in the past years went intouilding relationships between some molecular descriptors of

Page 2: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

5 hromatogr. A 1185 (2008) 49–58

taabacoamadTae

mTtsm

cgpham

Oham

2

2

smostHatcda

ansis

Fd

pc

li1h

tbioe1h

csla

atomic level we considered different atom types that were usedin combination with the bond distances of Fig. 1b. These atomtypes are summarized in Table 1.

Table 1Atom types considered

Description of the atom type Label used

Total atom count acHydrogen atom count HCarbon atom count C

0 A. Del Rio, J. Gasteiger / J. C

he compounds to be separated and the experimental data avail-ble. Different models suitable to predict the experimental datare already present on the literature [17–26]. Other approachesased on probability rule [27] and factorial design [28] werelso successfully developed. Most of these interesting studiesalculate some molecular descriptors and, with the aid of morer less complex modeling techniques, build models that can bepplied to the prediction of the results with new compounds. Theost critical interest is in establishing if it is possible to achieveseparation of given racemates with given experimental con-

itions by modeling, for instance, the HPLC separation factor.he separation factor (α) is defined as the ratio k2/k1, where k2nd k1 are the retention factors of the second and first elutednantiomers, respectively [29].

As previously shown by Del Rio et al. the separation of race-ates can be related to achiral molecular descriptors [3,17,30].his is true because the power of separation is a result of

he differential binding of the two enantiomers with the chiraltationary phase (CSP) that relates to the constitution of the race-ates and not to the intrinsic chirality of the two enantiomers.In this paper we describe a very fast and intuitive tool that

an be easily applied by experimentalists to determine whether aiven racemate can be separated on a Whelk-O1 chiral stationaryhase. This CSP, conceived by Pirkle et al. in the 1990s [31–33],as been studied with several experimental techniques [34–38]s well as with X-rays [39] and NMR [40,41] to elucidate theechanism of enantioselective recognition.Despite these mechanisms are better understood for Whelk-

1 CSP in respect to all other CSPs available, few is known onow to predict if a separation will take place or not for a givennalyte and no previous example of very simple computationalodel has been yet conceived.

. Experimental

.1. Molecular descriptors

Several experimental and theoretical studies have alreadyhown that the ligands must have minimum structural require-ents for the interaction points with the stationary phase in

rder to achieve a chiral separation [1,10]. Chemoinformatictudies were carried out to try to keep track of all these interac-ions and predict the experimental enantiosectivities [17,24–26].owever, in most cases the resulting models, while insightful

nd robust, are of limited practical interest for the experimenterhat need an easy and immediate tool to establish whether aompound is separable or not. Moreover, many existent ad-hocescriptors are not always easy to be interpreted and have limitedpplicability when used with diverse or many data sets.

In order to keep track of the minimum requirements to achievechiral separation in an extremely simple way we investigatedew achiral descriptors based on counting atoms around the

tereogenic center. The main idea behind these new descriptorss to calculate atomic indices at different bond distance from thetereogenic center.

Fig. 1 shows this idea with the example molecule 1.

NOSH

ig. 1. Example molecule (1) with the central atom shown (a) and the bondistance considered (b).

At first, the stereogenic center is identified (Fig. 1a). Atresent, the method deals only with molecules having centralhirality with one stereocenter.

Starting from the chiral center, the neighboring atoms areabeled following the respective bond distances (Fig. 1b). Thats, atoms directly bonded to the chiral center have bond distance, those connected to the directly bonded to the chiral centerave distance 2, and so forth.

In order to give a faithful account of the topology of allhe molecules of the data set we considered atoms up to 10-ond distances away from the chiral center. This bond distances necessary to unambiguously differentiate each molecule andbtain different sets of descriptors for molecules that are differ-nt. Thus, in the example of Fig. 1 all the atoms of the moleculeare considered since the farthest atoms around the chiral centerave a bond distance of 8.

In a data mining study it has been shown that each CSP needsertain points of preference on the ligand to achieve a chiraleparation [30]. For instance, a CSP like Whelk-O1 requires theigand to have one acceptor and one donor of a hydrogen bonds well as aromatic and lipophilic sites.

In order to reproduce the effects of these interaction sites at

itrogen atom count Nxygen atom count Oulphur atom count Salogens atom count Hal

Page 3: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

A. Del Rio, J. Gasteiger / J. Chro

Table 2Calculated descriptors for the molecule of Fig. 2

Topological distances

1 2 3 4 5 6 7 8 9 10

ac 4 7 5 3 1 2 6 2 0 0H 1 3 3 1 0 0 4 2 0 0C 3 2 2 1 0 2 2 0 0 0N 0 0 0 0 1 0 0 0 0 0O 0 2 0 0 0 0 0 0 0 0SH

Tp

m

ot

eabaait

Mst

2

tR1afopu[dm

avvutHpatttru

cWwttadwa

mvbetbwsrtme

2

rtp

TM

α

11

0 0 0 0 0 0 0 0 0 0al 0 0 0 1 0 0 0 0 0 0

For each bond distance from 1 to 10 all the atom types ofable 1 are calculated obtaining a unique vector of 70 descriptorser molecule.

Table 2 shows these 70 descriptors calculated for the exampleolecule of Fig. 1.As can be seen, the whole set of descriptors gives an account

f the main connectivities properties at fixed-length bond dis-ances referred to chiral center of the molecule.

Despite the high total number of descriptors we have tomphasize from the very beginning that the majority of them,s you might depict later in the results part, are not selected foruilding the models and thus not used. Since the descriptors arechiral it has also to be pointed out that no information about thebsolute configuration of the enantiomers is present. Therefore,n the present study we model the separation factor irrespectiveo the order of elution of enantiomers.

These descriptors were developed with the help of theOSES software library [42]. However, as one can note, their

implicity allows the calculation by hand by only consideringhe structure diagram.

.2. Data set

The data set used for this study corresponds to a large extento the data set collected from the ChirBase database by Delio et al. in a recent paper [8,17]. Thus, here we will consider75 molecules separated on a Whelk O-1 [31,32] chiral station-ry phase under fixed experimental conditions. In particular, theollowing minimum standard conditions were respected in theriginal data set: availability of the separation factor α, mobilehase of 80:20 hexane/propan-2-ol, use of a standard chiral col-

mn 250 mm × 4.6 mm and separation at ambient temperature29]. The details of the structures as well as the chromatographicata of the entire data set are available in the supplementaryaterial and in the publication of Del Rio et al. [17].

csct

able 3olecule partition of the Whelk O-1 data set

-Threshold valuea Whole data set (175 molecules) Trai

Separated Non-separated Sepa

.10 148 27 133

.15 136 39 122

a Values from 1 to the α limit are considered as non-separated racemates. Values gr

matogr. A 1185 (2008) 49–58 51

Since we want to focus on distinguishing between separatednd non-separated racemates we converted the experimental α

alues into categorical variables that will represent the targetariable for the predictions. We want to emphasize that these of α values for prediction purposes is possible because ofhe good consistency of the experimental data of the data set.owever, we anticipate that a hypothetical data set of com-ounds strictly separated with equal conditions, for instancestrict equal temperature, would be the optimal to improve

he goodness of computational models. It is worth noting alsohat the models can predict the outcome of the separation underhe above-mentioned well-defined experimental conditions. Theesults cannot be used for making predictions for separationsnder other experimental conditions.

Although an α value of 1.10 is good enough to achieve ahromatographic baseline separation on brush-type CSPs like

helk-O1, we also made the conversion of the separation factorith an α value of 1.15. As we will see, the use of two different

hresholds will allow us to assess the robustness of the models,o retrieve basic information about the mechanism of separationnd to track the potential discrepancies that a literature-compiledataset might generate. The threshold values that we used asell as the number of separated and non-separated molecules

re summarized in Table 3.To evaluate in a straightforward fashion the predictivity of the

odels we split the two data sets obtained with two α-thresholdalues (see Table 3) in training and test sets, each training seteing the ensemble of the molecules used for building the mod-ls and the test set being the ensemble of molecules with whichhe respective model have been tested. The test sets were builty randomly selecting 18 molecules, representing 10% of thehole data set, for each of the three α limits. The only con-

traint imposed during the randomization was to preserve theatio between the classes (separated and non-separated) in theraining and test set. In Table 3, the different partitions are sum-

arized indicating also the number of molecules included inach class.

.3. Data analysis and modeling techniques

Fixed-length descriptors like those used here are numericalepresentations of the molecules. These descriptors can feed sta-istical and data mining techniques that can be defined as therocesses of exploration of large amounts of data in search for

onsistent patterns, correlations and other systematic relation-hips between the descriptors [6]. In this paper, we use the J48lassifier of the WEKA package which implements the decisionrees algorithm C4.5 [43]. The WEKA program suite version

ning set (157 molecules) Test set (18 molecules)

rated Non-separated Separated Non-separated

24 15 335 14 4

eater than the α limit are separated racemates.

Page 4: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

5 hrom

3[

aaciea

smcww

lmnt

isAw

3

3

T

c

tdwsw

awditw

TM

Mn

12

crortpeirh

ptSv

3

Boie

dnafsM

g

dacfianosel

2 A. Del Rio, J. Gasteiger / J. C

.4.7 was used with the standard settings applied throughout44].

A decision tree modeling is based on a divide-and-conquerpproach. It works from the top down, seeking at each stagen attribute that best separates the classes; then recursively pro-essing the subproblems that results from the split. This strategys particularly useful in our case because the decision tree canasily be converted into a set of classification rules based on thebove-mentioned molecular descriptors.

Another advantage of decision trees is that simple andtraightforward models are normally obtained. Thus, an experi-entalist may find it interesting to retrieve molecular descriptors

alculable by hand and use them with simple rules to predicthether a racemate is separated or not under given conditionsithout computational efforts.As we have already mentioned before, not all the 70 topo-

ogical descriptors will build the models since a J48 classifierakes a selection of the variables. In practice all the descriptors

ot useful for the modeling of the dependent variable as well ashose with zero variance are removed automatically.

We used also the ZeroR classifier as a baseline for evaluat-ng the unbiased properties of J48 models. The ZeroR classifierimply assigns the majority class to all instances of the data set.s we will see in the results part, this evaluation is importanthen unbalanced data set are used [44].

. Results and discussion

.1. Training set modeling

The results of modeling the two training sets are shown inable 4.

For both training sets we obtained models with a high per-entage of correct predictions.

It is important to note that only few descriptors were usedo build the decision tree models. As expected, most of the 70escriptors introduced in Section 2.1 were not used because theyere found to be of no influence on modeling the classes of

eparated and non-separated racemates. The used descriptorsill be fully explained in the next section.Since all the training sets are unbalanced in favor of the “sep-

rated” class, especially the one obtained with α equal to 1.10,e implemented two different analyses to probe the good pre-

iction capabilities of the proposed method. The first one shownn Table 4 compares the model results to those obtained withhe ZeroR classifier. The ZeroR classifier generates models inhich a priori all the analytes are considered separated. In this

able 4odeling of the training sets (157 compounds) with J48 and ZeroR classifiers

odelumbera

α Limit J48 classifier ZeroR classifier(%correct)

%Correct Number ofvariables retained

1.10 94.9 4 84.71.15 94.3 5 77.7

a The two models will be referred in the text with italics numbers.

nt

ccf

tcW

bth

atogr. A 1185 (2008) 49–58

ontext, the difference in the results between J48 and ZeroR rep-esents the gain that the J48 classifier brings to the basic modelf ZeroR. As shown in Table 4, in all cases the percent of cor-ect prediction of the J48 classifier is remarkably higher thanhe ZeroR results meaning that all the J48 models are apt forrediction purposes and, above all, are verified to be not biasedven if the classes are unbalanced. While the ZeroR classifiers used to check the unbiased properties of the models, theirobustness will be evaluated through the cross-validations andoldout predictions presented in Sections 3.3 and 3.4.

Since a hypothetical data set in which the two classes areerfectly balanced would be the ideal condition for buildinghe models we also implemented a second analysis presented inection 3.3 that simulates this situation for the two α-thresholdalues considered.

.2. Analysis of the tree models

Several examples in the literature and found in the Chir-ase database, as well as statistical and computational studiesn the topic of chiral separations suggest that molecules shar-ng similar structural properties offer also a common mode ofnantioselective recognition [3,4,10,30,45].

Thus, the analysis of the tree models in terms of topologicalescriptors can be a valuable tool to uncover simple mecha-isms. More importantly, a simple decision tree model based ontomic count molecular descriptors can be of great usefulnessor experimenters who wish to have a rapid estimate of racemateeparation without having to calculate complicated equations.

odel 1 is shown in Fig. 2.This model is composed of only four molecular descriptors

iving rise to 5 leaves of the tree.The first ramification occurs with atomic count descriptor at

istance 6 (ac6). In other words ac6 describes the number oftoms that we may count at a bond distance of 6 from the chiralenter. An overview count of the instances allocated in theserst two branches shows that molecules having more than eighttoms at a distance of six bonds from the chiral center give mostlyon-separated racemates. On the other hand, molecules with lessr equal to eight atoms at a distance of six bonds lead to mainlyeparated racemates. It is interesting to note that six out of theight incorrect predictions are allocated in the most populatedeave that contains 132 instances. For these six misclassificationso further descriptors were located by the decision tree to placehem in additional leaves.

Subsequent ramifications of the tree involve still total atomicounts at rather large distances (ac8 and ac7) and hydrogen atomount at rather short distances (H4). These ramifications accountor minor partitions between the two classes.

It is worth noting that the topological distances involved inhe model range from 4 to 8 indicating that the main chemi-al features important to determine whether a separation on a

helk-O1 CSP is possible are mainly included in this range.

Model 2 is shown in Fig. 3. For this model a tree with 6 leaves

uilt from 5 descriptors was obtained. Compared to model 1 thisree is a little more complex. However, we may note that alsoere the ac6 descriptor is again present at the root of the tree.

Page 5: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

A. Del Rio, J. Gasteiger / J. Chromatogr. A 1185 (2008) 49–58 53

F 1.10.i ectivet ssifie

TfrwdrcCm1a

nibeHs

tbiwttbliuacdu

ig. 2. Model 1 obtained with J48 pruned tree on the training set with α limitndicating the bond distance from the chiral center. S and NS indicate the resphe number of instances assigned to that node and the number of incorrectly cla

hus, this variable represents an important discriminant factoror distinguishing the two classes in both model 1 and 2. Furtheramification occurs with the ac7 and H4 descriptors which, againere used in model 1. In addiction, carbon atom count at shortistance (C3) was found to be discriminating. In this model, theange in terms of bond distances is larger than in the previousase because also small distances, for instance with the ac2 and3 descriptors, are present. In general we observe no drasticodifications of the model by changing the α limit from 1.10 to

.15, particularly as the most distinguishing variables ac6, ac7nd H4 are used in both models.

Unexpectedly, all the models are built with variables that takeot into consideration heteroatom counts. This might be counter-ntuitive since in brush type CSPs like Whelk-O1 the hydrogen

ond acceptor/donor interactions represent the most importantnergy contribution between the ligand and stationary phase.owever, in accordance to molecular modeling and data mining

tudies carried out by Del Rio et al., while hydrogen bonds make

stvw

Fig. 3. Model 2 obtained with J48 pruned tree on the training set with α

Molecular descriptors are labeled according to Table 1 with the end-numberclass (separated; non-separated). Numbers in brackets under each leave show

d instances, respectively.

he principal docking point between host and guest molecules foroth diastereomeric complexes, the whole association describ-ng the enantioselective process is governed by a sum of veryeak interactions [10]. Under this light it can be understandable

hat the only variables selected for describing the separation fac-or are hydrogen, carbon and total atomic counts which appear toe sufficient for a faithful representation of the minimum topo-ogical requirement needed to achieve a separation. Neverthelesst is worth noting that the limited extension of the dataset and thenbalance toward the number of resolved compounds may havelso an impact on the choice of the variables. While the idealondition would be to have a much more extended and balancedataset of compounds strictly separated with equal conditions,nfortunately, no other data with comparable or improved con-

istency is collectable or available publicly. However to addresshese problems we have implemented a comprehensive cross-alidation scheme as well as a holdout validations method thatill be presented in the next two sections.

limit 1.15. For explanation of the notations see caption of Fig. 2.

Page 6: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

5 hrom

3

pavtaadif1optuads(c

im

taamtrv

cl

cprt

TC

α

1

1

L

tv

ddt

tdmiisc

auoadd1rap

mscowpc

3

sos

4 A. Del Rio, J. Gasteiger / J. C

.3. Cross-validations

While the predictions shown in Section 3.1 give a generalicture of the quality of the models, to assess their stabilitynd their predictivity we implemented a comprehensive cross-alidations (CVs) scheme on the training sets. The results ofhe CVs are essential to determine if the models generatedre the results of a chance combination of independent vari-bles and if they will perform well on future as-yet-unseenata. We adopted a scheme in which the two original train-ng sets of 157 instances were partitioned into different CVsolds. Namely, leave-one-out (LOO) representing the 157-fold,0-fold and 5-fold CVs. When performing a K-fold CV, theriginal sample of 157 instances is partitioned in K subsam-les. Of the K subsamples, a single subsample is retained ashe validation data while the remaining K − 1 subsamples aresed to build the model. The process is repeated K times sos each of the K subsamples is used exactly once as a vali-ation set. As a result, each instance will be in a validationet exactly once and will be in the building model sets exactlyK − 1) times. Finally, the average error across all the K trials isomputed.

Since the number of folds can be decided a priori one canndependently choose how large each validation set is and how

any trials one averages over.One step further to calculate the robustness of the models and

o assess that the instances do not generate models as a chancessociation is to make several repetitions of the experiments. Asn example, making 150 repetitions of the 10-fold CVs experi-ent allows the exploration of 150 different K-partitioning and

he calculation of a standard deviation which reflects how theandom partitioning affects the average results of the cross-alidations.

In Table 5 the results of the cross-validations are summarized.In all cases, all the CVs gave high predictivity since the per-

ent of correct prediction is at about 90 for each case of the α

imit.As expected, with the only exception for the LOO of α 1.15

ase, passing from LOO to 5-fold CVs leads to a decrease in theredictivity. In fact, when the number of folds is progressivelyeduced the numbers of instances included in the (K − 1) foldshat build the model are also reduced.

able 5ross-validations scheme according to the two α limit

Limit LOOa 10-foldb 5-foldb

.10%Correct classifications 91.7 90.1 89.5Standard deviation 0 1.2 1.4

.15%Correct classifications 86.5 89.8 89.7Standard deviation 0 1.4 1.6

OO, 10-fold and 5-fold CVs are performed.a Leave-one-out (LOO) corresponds to a 157-fold cross-validation.b The calculations for the 10- and 5-fold CVs have been repeated 150 and 300

imes, respectively. Thus, the percent of correct predictions refers to the averagealues and the standard deviation refers to all the repetitions.

mvt

ro

TC

α

11

T5

a

i

atogr. A 1185 (2008) 49–58

Nevertheless it is worth noting that the percent of correct pre-iction for the 10- and 5-folds CVs considering the +/− standardeviation is comparable to the LOO value. This indicates thathe generated models are robust and not overfitted.

For the same above-mentioned reasons we can observe alsohat the standard deviations increase when the number of foldsecrease. In fact, when the number of instances used to build theodel decreases, the chances for generating an overfitted model

ncreases. Despite this we can note that the models presentedn Table 5 show no overfitting behavior since the maximumtandard deviation obtained represents a minimal fraction of theorrect predictions.

As previously introduced, we also implemented a secondnalysis to probe whether the fact that the dataset was quitenbalanced in favor of the separated racemates had an effectn the good prediction capabilities. The main idea is to cre-te perfectly balanced subsets of the original dataset and buildecision tree models on these subsets. This is achieved by ran-omly removing exceeding data on separated racemates, i.e.03 compounds for the 1.10-threshold and 79 for the 1.15,espectively. The experiments are repeated 10 times and theverage results are calculated. Table 6 shows the result of thisrocedure.

Since for each subset the procedure systematically removesost of the compounds of the original dataset, it is under-

tandable that the obtained models perform slightly worse inomparison from those of Table 5. However, the high valuesf good prediction obtained with both α-threshold values andith all kinds of cross-validations provide a strong validationroof of the methodology when perfectly balanced datasets areonsidered.

.4. Validations on the external test sets

As already explained in Section 2.2, a holdout validationcheme was also implemented by splitting the original data setf 175 molecules in a training (157 molecules) and external testet (18 molecules) for each of the two cases of α limit. Since theolecules were randomly chosen, the external validation pro-

ides a straightforward and intuitive way for the application ofhe models of Figs. 2 and 3.

The results of the external validation sets are shown in Table 7.In all two cases, the models could classify the separation of

acemates with high accuracy. In particular only two moleculesut of 18 molecules in the test sets could not be correctly classi-

able 6ross-validations scheme on reduced data sets according to the two α limit

Limit Number of moleculesa LOOb 10-fold 5-fold

.10 54 84.5 84.3 83.1

.15 78 79.0 78.8 78.3

he results represent the 10 experiments average values of LOO, 10-fold and-fold CVs.a The perfectly balanced data sets are built with all non-separated racemates

nd a random selection of the separated ones.b Leave-one-out (LOO) corresponds to a 54-fold and 78-fold cross-validations

n the case of 1.10 and 1.15 thresholds, respectively.

Page 7: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

A. Del Rio, J. Gasteiger / J. Chromatogr. A 1185 (2008) 49–58 55

Fig. 4. Test set molecules for an α limit of 1.10. Left bold numbers on the bottom oexperimental separation factor α.

Table 7Validation of the models with the external test sets

α Limit Number of incorrectclassificationa

Incorrect predictedmoleculesb

1.10 2 9, 12 (Fig. 4)1.15 2 25, 29 (Fig. 5)

fifec

g

Fe

a Out of 18 molecules.b The numbering is relative to the molecules shown in the respective figures.

d

ta

ig. 5. Test set molecules for an α limit of 1.15. Left bold numbers on the bottom oxperimental separation factor α.

f each molecule represent the numbering while italic numbers on the right the

ed in both cases of α limit. Thus, on average and independentlyrom the fixed limit of α, we could build reliable and stable mod-ls for the prediction of the power of separation with Whelk-O1olumn.

As shown in Figs. 4 and 5 the randomly selected moleculesive a faithful account of the molecular diversity of the original

ata set.

It is worth noting that in the cases of model 1 and model 2he two misclassified instances belonged to both the separatednd the non-separated class. Thus, despite the relative unbalance

f each molecule represent the numbering while italic numbers on the right the

Page 8: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

5 hrom

om

a

Fd

6 A. Del Rio, J. Gasteiger / J. C

f the original data set in favor of the separated racemates theodels show no bias towards the preponderant class.It is interesting to see that the structure 12 in Fig. 4 is

separated molecule that is predicted to be non-separated.

Hsta

ig. 6. Practical examples for the application of the procedure of the model 1. Stereoescriptors are encircled. Molecule 11 is experimentally non-separated while molecu

atogr. A 1185 (2008) 49–58

owever, the experimental enantioselectivity of 1.14 for thiseparation place this instance at the border of the two classeshat may explain this misclassification. Molecule 9 of Fig. 4 isnon-separated molecule predicted as separated while similar

genic centers are highlighted with a spotted while the atoms for the respectivele 15 is separated.

Page 9: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

Chro

sifmpwfs

3

dtws

tttvc

t

FtibdodcmTimt1ai1

4

aHatOtadr

e

ssbScmtcth

wmotscgwt

vqtacotsTdremotsst

A

As

A

i

R

A. Del Rio, J. Gasteiger / J.

tructures like 10 and 16 are correctly predicted. Apparently,n this case, the atomic descriptors were not capable of dif-erentiating the molecules sufficiently. Model 2 showed also aisclassification on the borderline with structure 25 of Fig. 5

redicted as a separated molecule while being a non-separatedith an α of 1.07. On the other hand, we have no explanation

or the wrong prediction of structure 29 showing a prominenteparation but classified as non-separated.

.5. Practical example

Since the calculation of the descriptors as well as the use ofecision rules can be easily done by hand, an exemplification forhe use of the whole procedure can be practical for those whoant to apply the methods to make a rapid assessment on the

eparations of Whelk-O1 CSP.While for this study we considered two different limit of α

hresholds giving rise to two different models, we may suggesthe reader the use of the model 1 and a limit of 1.10 for fur-her predictions. The reason of our preference is because an α

alue of 1.10 in most of the cases is good enough to achieve ahromatographic baseline separation on Whelk-O1 CSP.

Thus, the following examples shown in Fig. 6 will refer tohe model 1 schematized in Fig. 2.

Molecule 11 (non-separated) and molecule 15 (separated) ofig. 4 are used to exemplify the procedure. From the root of

he tree of the model 1 (Fig. 2) the first descriptor encountereds the ac6. The value of this descriptors can be retrieved foroth molecules by counting the number of total atoms at bondistance 6 from the chiral center. In Fig. 6 is shown the locationf the correspondent atoms. It is important to note that the bondistances are always taken as the shortest path between the chiralenter and the respective atom. As one can see, the two exampleolecules lead to two different counts for the ac6 descriptors.hus, different path through the decision tree are undertaken

n the first ramification. The next discriminant descriptor forolecule 11 is ac8 whose value is 0. This lead directly to a leaf of

he tree labeled to non-separated class molecules. For molecule5 ac7 is the next descriptor whose value is 1 since only onetom is present at bond distance 7 from the chiral center. Alson this case a leaf of the tree is reached predicting the molecule5 as separated on Whelk-O1 CSP.

. Conclusion

We have presented a simple method that can be used forssessing if a given racemate can be separated or not in chiralPLC separations with Whelk-O1 CSP. Various atomic counts

round the chiral center were used in combination with decisionrees and applied to a data set of molecules separated on Whelk-1 chiral stationary phase. Those molecules were collected from

he ChirBase database from the experimental results in the liter-ture. This scheme allowed us to obtain straightforward models

escribing the atomic requirements for the classification of aacemate as separable or non-separable.

Training set modeling as well as cross-validation experimentsxhibited high rates of good prediction with models built on a

matogr. A 1185 (2008) 49–58 57

mall number of descriptors. The decision trees give the pos-ibility to convert the models into small set of rules that cane easily tested by using the corresponding descriptor values.ince the chiral-centered atomic descriptors can be directly cal-ulated by hand just by inspecting the structure diagram of theolecule, the main advantage of the method is the possibility to

est the separation of new molecules in a straightforward, fast andomputer-free way. This might be interesting for experimentalisthat need to obtain a rapid assessment of the separation withoutaving to program or to manipulate complicated equations.

Since the data set was collected from the ChirBase database,hich provides a good diversity of chemical structures, theethodology seems to be applicable with data sets composed

f diverse molecules. It must be noted also that, while most ofhe QSAR studies rely upon experimental data performed undertrict experimental protocols, here the data set was designed as aompilation of literature results. Thus, the method showed also aood internal predictability with regard to the experimental datahich can somehow vary with protocol, column and general

echniques.However, we must acknowledge that, while the methods pro-

ide a tool with extreme calculation simplicity, more accurateuantitative predictions may be obtained through more sophis-icated models. Furthermore, the scheme is at present onlyvailable for molecules with central chirality and one stereo-enter. Therefore, while at this stage these calculations may bef limited applicability in terms of choice of experimental condi-ions, the efforts in modeling other datasets point to increase thepectrum of theoretical model available to the experimentalists.he hypothetical availability of other consistent experimentalata on chromatographic separations would allow the explo-ation and the building of other models that refers to differentxperimental conditions such as variation on the CSP or theobile phase. This would allow one to dispose of different sets

f rules that may be used as a practical and valuable tool forhe prediction of the separation. This is particularly true is suchituations in which direct comparison with analytes of similartructures or experience with a particular CSP are not sufficiento make guesses of a separation outcome.

cknowledgements

A.D.R. gratefully acknowledges the financial support of thelexander von Humboldt foundation. The authors thank Profes-

or C. Roussel and Dr. P. Piras.

ppendix A. Supplementary data

Supplementary data associated with this article can be found,n the online version, at doi:10.1016/j.chroma.2008.01.034.

eferences

[1] K.B. Lipkowitz, J. Chromatogr. A 906 (2001) 417.[2] K.B. Lipkowitz, Acc. Chem. Res. 33 (2000) 555.[3] A. Del Rio, Ph.D. Thesis, Laboratoire de Stereochimie Dynamique et

Chiralite, Univeristy “Paul Cezanne” Aix-Marseille III, Marseille, 2005.

Page 10: Simple method for the prediction of the separation of racemates with high-performance liquid chromatography on Whelk-O1 chiral stationary phase

5 hrom

[

[

[[[[[

[[

[

[[[

[

[

[

[

[[

[[[[[

[[[[[[

[[[

[43] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann,

8 A. Del Rio, J. Gasteiger / J. C

[4] C. Roussel, A. Del Rio, J. Pierrot-Sanders, P. Piras, N. Vanthuyne, J.Chromatogr. A 1037 (2004) 311.

[5] C. Roussel, J. Pierrot-Sanders, I. Heitmann, P. Piras, in: G. Subramanian(Ed.), Chiral Separation Techniques, Wiley-VCH, Weinheim, 2001, p. 95.

[6] J. Gasteiger, T. Engel, Chemoinformatics: A textbook, Wiley-VCH, Wein-heim, 2003.

[7] N.M. Maier, P. Franco, W. Lindner, J. Chromatogr. A 906 (2001) 3.[8] ChirBase, Universite “Paul Cezanne” Aix-Marseille III, Marseille, 2006,

http://chirbase.u-3mrs.fr.[9] E.R. Francotte, J. Chromatogr. A 906 (2001) 379.10] A. Del Rio, J.M. Hayes, M. Stein, P. Piras, C. Roussel, Chirality 16 (2004)

S1.11] S. Alcaro, F. Gasparrini, O. Incani, S. Mecucci, D. Misiti, M. Pierini, C.

Villani, J. Comput. Chem. 21 (2000) 515.12] J. Aerts, J. Comput. Chem. 16 (1995) 914.13] C. Zhao, N.M. Cann, J. Chromatogr. A 1149 (2007) 197.14] C. Zhao, N.M. Cann, J. Chromatogr. A 1131 (2006) 110.15] C. Wolf, L. Pranatharthiharan, E.C. Volpe, J. Org. Chem. 68 (2003) 3287.16] G.E. Job, A. Shvets, W.H. Pirkle, S. Kuwahara, M. Kosaka, Y. Kasai, H.

Taji, K. Fujita, M. Watanabe, N. Harada, J. Chromatogr. A 1055 (2004)41.

17] A. Del Rio, P. Piras, C. Roussel, Chirality 18 (2006) 498.18] S. Funar-Timofei, T. Suzuki, J.A. Paier, A. Steinreiber, K. Faber, W.M.F.

Fabian, J. Chem. Inf. Comp. Sci. 43 (2003) 934.19] T. Suzuki, S. Timofei, B.E. Iuoras, G. Uray, P. Verdino, W.M.F. Fabian, J.

Chromatogr. A 922 (2001) 13.20] T.D. Booth, K. Azzaoui, I.W. Wainer, Anal. Chem. 69 (1997) 3879.

21] W.M. Fabian, W. Stampfer, M. Mazur, G. Uray, Chirality 15 (2003) 271.22] C.A. Montanari, Q.B. Cass, M.E. Tiritan, A.L.S. de Souza, Anal. Chim.

Acta 419 (2000) 93.23] V. Andrisano, C. Bertucci, V. Cavrini, M. Recanatini, A. Cavalli, L. Varoli,

G. Felix, I.W. Wainer, J. Chromatogr. A 876 (2000) 75.

[

[

atogr. A 1185 (2008) 49–58

24] J.V. de Julian-Ortiz, R. Garcia-Domenech, J.G. Alvarez, R.S. Roca, F.J.Garcia-March, G.M. Anton-Fos, J. Chromatogr. A 719 (1996) 37.

25] J.V. de Julian-Ortiz, C.D. Alapont, I. Rios-Santamarina, R. Garcia-Domenech, J. Galvez, J. Mol. Graph. 16 (1998) 14.

26] A. Golbraikh, D. Bonchev, A. Tropsha, J. Chem. Inf. Comp. Sci. 41 (2001)147.

27] R. Kafri, D. Lancet, Chirality 16 (2004) 369.28] B. Koppenhoefer, U. Epperlein, M. Schwierskott, Fresenius. J. Anal. Chem.

359 (1997) 107.29] V. Schurig, Chirality 17 (Suppl.) (2005) S205.30] A. Del Rio, P. Piras, C. Roussel, Chirality 17 (2005) S74.31] W.H. Pirkle, C.J. Welch, B. Lamm, J. Org. Chem. 57 (1992) 3854.32] W.H. Pirkle, C.J. Welch, J. Liq. Chromatogr. 15 (1992) 1947.33] N. Bargmann-Leyder, J.C. Truffert, A. Tambute, M. Caude, J. Chromatogr.

A 666 (1994) 27.34] W.H. Pirkle, K.Z. Gan, L.J. Brice, Tetrahedron 7 (1996) 2813.35] W.H. Pirkle, Y.L. Liu, J. Chromatogr. A 749 (1996) 19.36] W.H. Pirkle, C.J. Welch, Tetrahedron 5 (1994) 777.37] W.H. Pirkle, C.J. Welch, J. Chromatogr. 589 (1992) 45.38] W.H. Pirkle, C.J. Welch, M.H. Hyun, J. Chromatogr. 607 (1992) 126.39] M.E. Koscho, P.L. Spence, W.H. Pirkle, Tetrahedron: Asymmetry 16 (2005)

3147.40] M.E. Koscho, W.H. Pirkle, Tetrahedron: Asymmetry 16 (2005) 3345.41] W.H. Pirkle, C.J. Welch, J. Chromatogr. A 683 (1994) 347.42] MOSES C++ software library, Molecular Networks, Erlangen, 2004-2007,

http://www.molecular-networks.com/software/moses.

San Francisco, CA, 1993.44] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and

Techniques, Morgan Kaufmann, San Francisco, CA, 2005.45] P. Piras, C. Roussel, J. Pierrot-Sanders, J. Chromatogr. A 906 (2001) 443.