1994 application of genetic function approximation to quantitative structure-activity

Upload: felipe-antonio-vasquez-carrasco

Post on 03-Apr-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    1/13

    854 J . Chem. Znf: Comput. Sci. 1994,34, 854-866

    Application of Genetic Function Approximation to Quantitative Structure-ActivityRelationships and Quantitative Structure-Property Relationships

    David Rogers'Molecular Sim ulations Incorporated, 16 New England Executive Park, Burlington, Massachusetts 01803-5297

    A. J. HopfingerDepartment of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois atChicago, Box 6998, Chicago, Illinois 60680

    Received November 12. 1993"The genetic function approximation (GFA) algorithm offers a new approach to the problem of buildingquantitative structure-activity relationship (QS AR ) and quantitative structure-p roperty relationship (QS PR )models. Repla cing regression analysis with the G FA algorith m allows the constru ction of models competitivewith, or superior to, standard techn iques and m akes available additional inform ation not provided by othertechniques. Unlik e most other analysis algorithms, GFA provides the user with multiple models; the populationsof models are cre ated by evolving random initial models using a gen etic algorithm . GFA can build modelsusing not only linear polynomials but also hig her-ord er polynomials, splines, and G aussians. By using spline-based terms, GF A can perform a form of automatic outlier removal and classification. The GFA a lgorithmhas been applied to thre e published data sets to dem onstrate it is an effective tool for doing both QS AR andQSPR.

    1. BACKGROUNDQuantitativ e structure-activity relationship (QS AR ) analy-sis is an area of compu tational research which builds modelsof biological activity using physicochemical properties of aseries of compounds. The und erlying assum ption is tha t thevariations of biological activity within a series of similarstructures can be correlated with changes in measured orcomputed molecular features of th e molecules. These featurescould measure, for example, hydrophobic, steric, and electronicproperties which may influence biological activity. In this

    analysis, a data table is formed, each row representing acandidate compound and each column an experimental orcomp utationa l feature. Regression analysis can be applied tothis data table to create a model of activity based on all orsome of the features. Quantitative structu repro perty rela-tionship (QSPR) analysis is a generalization of the QSARconce pt. Th e QSPR philosophy assumes tha t the behavior ofa compound, as expressed by any meas ured property, can becorrelated to a set of molecular features of the compound.Current QSAR and QSPR methods are limited by thestruc ture of the data: the num ber of compounds with therequisite behavior me asures (e.g. biological activity) is usuallysmall compared with the number of features which can bemeasu red or calculated. This can lead either to models whichhave low error m easure on the training set, but which do notpredict well (a phenomena called overfitting),or to a completefailure to build a meaningful regression model. Recen t findingssuggest that features which characterize molecular shape-related properties may be especially useful,' but a givencompound may have thousand s of these features, making thebuilding of a regression model even more problem atic. Still,stan dard regression analysis continues to be the predominanttechnique for QSA R and QSP R construction, though recentwork using partial least-squares (PL S) regression2or neuralnetworks3 show ad vantag es in some situations over standa rdmethods.* Author to whom correspondence should be addressed.e Abstract published in Advance AC S Abstracts, April 1, 1994.

    0095-233819411634-0854$04.50/0

    The genetic function approximation (GFA) algorithm isderived from Rogers' G/S PL IN ES algorithmk5 and offers anew app roach to the construction of QSAR an d QSPR s. Wepropose supplementingstandard regression analysis with th eGF A algorithm. Application of the GFA algorithm may allowthe construction of higher-quality predictive models and makeavailable additional information not provided by standardregression techniques, even for da ta sets with many features.In particu lar, the advantages of multiple models made availableusing GFA will be discussed, as well as the automaticpartitioning behavior of GFA when used to build spline models.

    2. METHODSA. QSARMethodology. QSA R began with the pioneeringworkof Hansch: who used linear regression to build predictivemodels of the biological activity of a series of compounds. Thegeneral form for the model F(X) is as a linear combinationof basis functions 4k(X) of the features X =(XI, ..., m ) n thetraining data set of size M, as given in eq 1.

    M

    The basis functions are functions of one or more features,such as LOCP, (LOCP 0.1)2, or DIPV-X*(VWDVOL -100.0). Lin ear regression takes th e list of basis functions andfinds a set of coefficients U k to build a func tiona l model of theactivity. The accu racy with which the model describes thetraining da ta is termed the smoothness o f f i t of the model.(A smooth model represents only the ge neral trends of thedata, and not necessarily the details.)Thirty years later, linear regression remains the majorregression technique used to construct QSAR s. Unfortunate ly,a large number of samples is needed, and even moderatenumb ers of features lead to po or-quality regression modelsdue to overfitting. An overfit model can recall th e activitiesof the training set samples but may not accu rately predict theQ 1 9 9 4 American Chem ical Society

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    2/13

    activity of previously-unseen samples. With linear regression,the am ount of fit is determined by the number of basis functionsin the model.To m ake linear regression suitable when m oderate numbersof features are used, it was combined with principal com-ponents analysis (PCA),' which is a techn ique for selectingthe most important set of feature s from a larger table. Oncethe most impo rtant fe atures a re selected, linear regression isused to construct a model. Unfortunately, PCA makes an

    assumption of independence of the featu res. If this assumptionof independence is not true (as is often the case in real-worldapplications), the features selected may not be the mostpredictive set. Still, the reduction in the number of featuresis vital if ove rfitting is to be preve nted.More recently, large amounts of three-dimensional (3D)molecular shape da ta have become available for molecules.For example, in the CoM FA technique? the electrostatic fieldaround a molecule can be calculated on a 3D grid, providinghund reds or thousand s of features. However, it is virtuallyimpossible to use so many features in building standardregression models. It was only with the use of partial le ast-square s (PLS) analysis that model building became possible.These models can show predictiveness, and the CoMFA

    appro ach has become a standa rd for the ana lysis of 3D fielddata.Finally, recently published work suggests that neuralnetworks and g enetic algorithms may be useful in d ata analysis,specifically in the task of reducing the num ber of features forregression models. Wikel and D0w3 applied neural networksand Leardi et a1.* applied genetic algorithms to the featurereduction task, both with some success. Good et aL9 did acomparison of a neural-network style approach to the PLSapproach used in CoM FAS 2The neural-network analysis wassuperior in some of the published results as measured by thecross-validated correlation coefficient (1.2) scores. However,the fit of most neura l-netwo rk models is determ ined in partby the length of the ir trainin g cycle,so the re is risk of overf ittingif trained too long.B. Genetic Function Approximation. The genetic functionapproximation algorithm was initially conceived by takinginspiratio n from two seemingly dispara te algorithms: Hol-land's gen etic algorith mloand Friedman's multivariate a d a ptive regression splines (MARS) algorithm."Genetic algorithms are derived from an analogy with theevolution of DNA . In this analogy, individuals are representedby a one-dimensional string of bits. An initia l population iscreated of individuals, usually with random initial bits. A"fitness function" is used to estimate the quality of anindividual,so tha t the "best" individua ls eceive the best fitnessscore. Individu als with the best fitn ess scores are m ore likely

    to be chosen for mating and to propagate heir genetic materialto offspring through the crossouer operation, in which piecesof genetic materia l are taken from ea ch parent and recombinedto create the child. After many mating steps, the averagefitness of the individuals in the p opulation increases as "good"combinations of genes are discovered and spread through thepopulation. Genetic algorithms are especially good a t search-ing problem spaces with a large number of dimensions, asthey conduct a very efficient directe d sampling of the la rgespace of possibilities.Friedman proposed the MARS algorithm as the newestmember of a class of well-used statistical modeling algorith mssuch as CART12 and k-d trees." It uses splines as basisfunctions to partition data space as it builds its regressionmodels. It was specifically designed to allow the constructio n

    J. Chem. Znf. Comput. Sci., Vol. 34, No . 4,1994 855F1:L O G P DW J X; (DIPV-Y - 2.0);VDW O L : (LOGP - 5.1)*)F2: M m , M O L - m , LOGP,LOG$)

    FK: (ATCEQ ATCE6); DIPMOM, DIPVJX;VDWOL OGP)Flgwe 1. Examples of a population of models represented for thegenetic function approximation algorithm. Each model isrepresentedas a linear string a basis functions. The activity models can bereconstructed by using least-squares regression to regenerate thecoefficients {ad . (The sample features are taken from the Selwooddata set.)of spline-based regression models of data sets with moderatenumbers of features. MA RS gives high levels of performanceand competes well against many neural-network appro achesbut, unfortunately, is computationally intensive and tooexpensive to use with more tha n a bout 20 features and lo00input samples. Also, since MAR S builds its model incre-mentally, it may not discover models containingcombinationsof features tha t pred ict well as a group, but poorly individually.One of us (D.R.) recognized that Friedman was doing asearch over a very large "function space" and that a bettersearch could be done using a genetic algorithm rather tha nhis incremental approach. Replacing the binary strings ofHolland with stringsof basis functions gave a natura l ma ppingfrom Holland's genetic approac h to the fu nctiona l models ofregression-based approaches. This led to the published workon G/S PL INE S, which later evolved into the genetic functionapproximation (GFA) al g~ r i th m .~ .~The GF A algorithm approach has a number of importantadvan tages over other techniqu es: it builds multiple modelsrather than a single model; it a utomatically selects whichfeatures are to be used in its basis functions and determinesthe appr opriate number of basis functions tobeused by testingfull-size models rather t han incrementally building them; itis better at discovering combinations of basis functions thattake advantageofcorrelationsbetween features; it incorporatesthe L O F (lack of fit) erro r measure developed by Friedman"that resists overfitting and allows user control over thesmoothness of fit; it can use a larger variety of basis functionsin constru ction of its models, for example, splines, Gaussians,or highe r-order polynomials; and stu dy of the evolving modelsprovides additional information, not available from standardregression analysis, such as the p referred model length a nduseful partitions of the data set.C. Genetic Function Approximation Algorithm. Manytechniques, including MAR S, C AR T, and P CA, develop asingle regression model by incre men tal addition or deletionof basis functions. In contrast, the GFA algorithm uses apopulation of many models and tests only the final, fully-constru cted model. Improved models are constru cted byperforming the genetic crossover operation to recombine theterms of the better-performing models.A genetic algorithm requires that an individual be repre-sented as a line ar string, which plays the role of the D NA forthe individual. When using GFA , the string is the series ofbasis functions, as shown in Figur e 1. Using the informationin the string, it is possible to reconstruct the activity modelby using least-squares regression to regenerate he coefficients{ak). The in itial models are gen erated by randomly selectingsome number of features from the training da ta set, buildingbasis functions from these features using the user-specifiedbasis function types, and then co nstructin g he geneti c modelsfrom random sequences of th ese basis functions.The m odels are scored using Friedman's "lack of fit" (LOF)measure" which is given by eq 2. In this equation, c is thenumber of basis functions (other than the consta nt term) in

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    3/13

    856 J . Chem. InJ Comput. Sei., Vol. 34. No. 4, 1994 ROGERS ND HOPFINGER

    the model; d is the smoothing parameter (and is the onlyparameter adjustable by the user); p is the total number offeatures contained in all basis functio ns (some basis functions,such as (ATCH4-ATCH6).contain more than one feature);and M s the number of samples in the training set. Unlikethe more-commonly used least-squares error (L SE) , the LO Fmeasure cannot always be reduced by adding more term s tothe regression model. While the new term may reduce theLSE, it also increases th e values of c an d p. which tends toincreosethe LO F score. Thus, adding a new term may reducethe LSE, b ut actually increase the L OF score. By limitingthe tendency to simply add more terms, the LO F measureresists overfitting better th an the LSE measure.Usually, the benefit of new terms for smaller models isenough to offset the penalty from the denominator in theLOF. Hence, the value of LO F decreases as initial terms areadded. However,atsomepoint thebenefitisnolongerenoughto offset the penalty, and t he value of LOF sta rts increasing.The location of this minimum is changed by altering thesmoothing parameter d. The default value f o rd is 1.0, and

    larger values of d shift the minimum towa rd smoother models(th at is, models with fewer terms). In effect, d is th e user'sestimate of how m uch detail in the training data se t isworthmodeling.Once all models in the population have been ra ted using th eLOF score, the genetic crossover operation is repeatedlyperformed. In this operation, two good models are proba-bilistically selected as "parents", with the likelihood of beingchosen inversely proportional to a model's L O F score. Eachparen t is random ly *cut" into two pieces, and a new model iscreated using a piece from each parent, as shown in Figure2. Th e coefficients of the new model ar e determined usingleast-squares regression.Next, mutation operators may alter the newly-createdmodel. Two mutation operators a re possible by default: MWalters by appending a new random basis function, and shiftmoves th e knot of a spline basis function. These two mutationoperators have a default 50%probability of being applied tothe new ly-created model.Finally,ifaduplicateoftheresultingmodeldoesnot lreadyexist in the population, the model with the worst L O F scoreis replaced by the new child.Th e overall process is ended when th e average L O F scoreof the models in the population stop s significantly improving.For a population of 300 models, 300610 000 genetic opera-tions are usually sufficient to achieve 'convergence". Fortypical da ta sets, this process takes between 10min and 1 h

    on a Macintosh-IIfx computer.Upon completion, one can simply select the model from th epopulation with thelowest score, though it isusually preferableto inspect th e different models and select on the basis of th eappropriateness of the feature s, the basis functions, and th efeatu re combinations.Selecting a single model is not always desirable; thepopulation can be studied for information on featu re use, an dpredictions can often be improved by averaging the resu lts ofmultiple models rather than relying on an individual model.Differen t models may have differen t regions in which theypredict well, and, by averaging, the effect of models which areextrapo lating beyond th eir predictive region is reduced.

    Figure2. The crossover operation. Each parent is cut at a randompint , and a piece from each parent is used to construct th e newmodel which now uses some basis functions from each parent.? 7 f i 92Yyc

    Fipre3. Shared structureof th e antimycin analogous n the Selwmddata set.. OC P F'milion eaeffieiml- M-PNT Melting point- DIPMOM Dipole moment. nwvoL:va n derwads olume ISURF-A: surfacearea.M O L WT: Moleeularweight ).DIPVIX, DIPV-Y, DIPV-2 Dipole Y ~ C Mo m p n e n l l in X, andZ.MOE'_X,M O n - Y , M o m - 2 Pnneipd mme nl l Of inenia in X,Y. nd Z- PEAX-X, PEAX-Y, PEAX-2 Principal ellipsoid ares i n X , Y. nd Z. S_lDX. SS-IDY. SS-ID2 Substituent on a10m 8 dimensions in X, Y. and Z. S-lCX. S S _l CY , S S _ l C 2 Subsliluenlon amm 8 center i n X, Y andZ- A T C H l t o A T C H I O Pmialalomicchargcsfora*lms 1-10. S DL I 10ESOLIO Elcclraphilic supedeln atilabitily foramms 1-10- N S D L l 10 NS OL IO Nucleophilic superdelndilability for atoms 1-10- SLIM-F and SUM-R Sums of h e F and R SubStiluenleonEtanIs

    Figure 4. Features contained in the Selwmd data set.3. RESULTS

    Three d at a sets were chosen to illustrate application of theGF A algorithm. Th e Selwood dat a set15 llustrates featureselection an d the utility of exploring multiple models. Th eCardozoJHopfinger da ta s e t L hdemonstrates the automaticpartitioning behavior of spline models. Th e KoehlerJHopfin-ger data set1 ' shows the applicability of the G FA algorithmin polymer modeling. Th e only difference in the applicationoftheGFAalgorithmtothesethreeproblems asin thetypesofbasis functionsused. In heSelwwd applicationonlylincarpolynomials were considered, while, in the Card omJH opfingerapplication, linear and qua drati c polynomialsand splines wereconsidered. For the KoehlerJHopfinger application, linearpolynomials and splines were considered.A. SelaoodDataSet. The Selwood data s e t ' s contains 3 Icompounds, 53 eatures.and a set ofcorresponding antifilarialantimycin activities. In order to save space, the completedata set is not presented here. Reference I5 should beconsulted fordetailson thestruct ureac tivity data. Theseriesof analogs are of the general form shown in Figure 3.Th is da ta set was of particular interest because it containsa large number of features relative to the number ofcompounds. Th e list of features is shown in Figure 4.Selwood et al. used multiv ariate regression to develop aQSA R. Later, this same dat a set was studied by Wikel andDow,' who used a neural network to select fe atures for theirQSA R model. Oth er groups have studied this same dataset.tb-22 n this study, a comparison is made to one of themodels of Selwood et al. and the model of Wikel and Dow.Both groups developed models using 3 of the 53 features l opredict the activity as measured by -[log(lCw)j, where G orefers to the concentration of an analog needed to reduce the

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    4/13

    APPLICATIONFGFA TO QSAR AN D QSPRbiological activity by 50%. Their judgement was that therelatively small number of compounds allows only a fewfeatures in the final regression model.Selwood proposed a QSA R model of three feature s: LOGP,the partition coefficient; M-PNT, the melting point; theESDL10, the electrophilic superdelocalizability a t atom 10.The m odel derived using all 3 samp les is given by eq 3. (Thisis eq 4 in the Selwood reference.15

    - l o g ( I c ~ ) -3.93+0.44 LOGP+0.008*M-PNT- 0.30*ESDLlO (3)LOF 0.487r :0.737F 13.29

    Selwood used a technique which increm entally adds featuresto a model based on maximizing the correla tion between thenew feature (after decorrelation with previously-selectedfeatures) and the activity. This technique, known asforward-stepping regression analysis, is a common method of featureselection. Itssuccess requires that the info rmatio n of interestis contained in the corre lation of individu al feature s with theresponse. However, information requiring the combined effectof sets of features may not be discovered. Still, it is exactlysuch a combined effect that will likely be operative in achem ical system. This means that m odels which reflect sucha combined effect may exist and may outperform modelsdiscovered with the forward-stepping techniq ue. (Selwoodwas able to improve this model by performing outlier removalon the data set. We did not perform outlier removal. Thenext section dem onstrate s how splines can be used to performautomatic outlier removal.)Wikel and Dow3 circumvented the problem of inc reme ntalfeature selection by using a neutral network to select theappropriate combination of variables in place of forward-steppin g regression analysis. While the network does notdirectly select the features, after training it can be analyzedto reveal the fea tures of interest. Regression can then be usedto build the corresponding QSAR model.A QSAR model of three features has been proposed byWikel and Dow: LOGP, he partition coefficient; ATCH4,the partial atomic charge on atom 4; and MOFI-X, heprincip al moment of inertia in the X dimension. This QSA Ris given by eq 4. It shows better correlation and a lower LO Fscore than the QSAR model of Selwood.

    -lOg(ICN) =-1.63+0.231*LOGP+4.415*ATCH4+O.OOO659 * MOFI-X (4)LOF: 0.525r: 0.774F 13.47

    That two different techniques generate two significantlydifferent QSAR models raises some questions: is the dis-covered model the best model, that is, does t m inimize theerror (versus models of the sa me size) over the train ing set?Does a single best model even exist, or instead is there acollection of models of the same perfo rmanc e quality? Is thenumber of terms in the model app ropriate , or should larger,or sm aller, models beconsidered? Can the differencesbetweenmodels which score well due to predictiveness, and modelswhich perform well due to chance correlationsbetween feature sin the training set, be identified?

    J . Chem. Znf. Comp ut. Sci.. Vol. 34, No. 4, 1994 857.Q.8.7.6.5

    0 Best score0 Worst scoreA Avg score

    0 50 0 1000 1500 2000 2500 3000Number of Crossovers

    Figure . Number of crossover operationsversus the best, worst, andaverage LO F s a r e s in the population of models.GFA analysis attempts to both generate models which arecompetitive with or superior to models generated by othertechniques and also answer crucial questions such as thoseabove. The answer to all thes e questions is based upon theconstruction and analysis of multiple models, rather thanoptimization of a single model.B. CFA Applied to the Selwood Data Set. The GFAalgorithm was applied to the Selwood data set with an intentionto illustrate the advantages and uses of multiple models inQSAR analysis.QSA R analysiswith GFA begins by generating a populationof random models. These models are gener ated by random lyselecting some number of features from the file and usingregression to generate the coefficients of the models. For theSelwood data set, a p opulation of 300models was used,andthe term s of the models were limited to linear polynomials.(The limitation to linear polynomials was done for easiercomparison to literature models).The generic operator was applied until the average LO Fscore showed little improvem entover a period of 100crossoveroperations. This convergence criteria was met after 3,000operations. The evolution took approximately 15 min on a

    Macintosh-IIfx. Figure 5 shows a graph of the evolution ofthe LO F scores. Som e preliminary runs suggested that foran average model length of three features, the parameter d(the smoothing param eter in Friedmans LO F function) shouldbe set to 2.0. Provided the value of d is set appropriately,the re is little risk of overfitt ing even if the crossover operationsare con tinued beyond convergence.After convergence, the population was sorted in order ofLO F score. The top 20 models in the population are shownin Table 1.The model of Wikel and Dow was discovered and was rated

    131 out of 300. The model of Selwood was not discovered,though similar models were generated . It is possible it mayhave been discovered if the algorithm was allowed to runfurth er, as some similar runs with a diffe rent random seed diddiscover it. Had it been discovered in this run, it would havebeen rated about 200 out of 300.

    All of the top 20 models, and many of the additional 280models, approximately ma tch or exceed the correla tion scoresof the models built by either Selwood or Wikel and Dow. Thelarge number of high-scoring models, and the large am ountof variatio n, appe ars to refute the supposition that there maybea single best model using this data. The mix ture of modelswith 2-4 featuressuggests hat both smaller and larger modelsmay be approp riate for consideration. Thus, the populationof models allows a deeper understanding of the rangeof possiblemodels from a data set.

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    5/13

    ROGERS ND HOPFINGER58 J . Chem. InJ Comput. Sei., Vol. 34, N o. 4, 1994Table 1. Top 20 Models Generated Using the Full 31-Sample Selwood Data Setu

    1: -lOg(ICs)=-2.501+0.584 * LOGP+1.513*S U M 3- 0.000075 MOFI-YLOF 0.366r: 0.849F 3.27

    2 -log(ICs) =2.871+0.568 *LOGP+0.810 *ESDW- 0.013 *SURF-A

    LOF .368r :0.848F 23.043: -lOg(ICs)=-0.805+0.589*LOGP+0.736*ESDW- O.ooOo77 *MOFI-Y

    LOF 0.392r :0.838F 21.154: -lOg(ICs) 1.791+0.500 * LOGP+0.842 *ESDW- 0.200 *PEAXJ+2.807 *ATCH4LOF 0.397r: 0.880F 22.31

    5 -lOg(ICs)=-2.148+0.694 * LOGP- O.ooOo84 * MOFI-YLOF 0.405r: 0.776F: 21.15

    6: -lOg(ICs) =-1.749+0.486 * LOGP+10.124 *ATCHS+3.444 *ATCH4- O.ooOo55 * MOFI-Y

    LOF 0.407r: 0.776F 21.577: -lOg(ICs)=-0.226+0.608 *LOGP- 0.206 *PEAX-X

    LOF 0.396r: 0.781F:20.81

    0 -lOg(ICs)=-0.777+0.503 * LOGP+1.345*SUMJ- 0.177 *PEAX-XLOF 0.409r:0.830F 9.86

    9 -lOg(ICm)=1.643+0.666 * LOGP- 0.0138 SURF-ALOF 0.410r: 0.772F: 20.69

    1 0 -log(ICs) 0.823+0.553 *LOGP+1.347 SUMP- 0.0118 * SURF-ALOF 0.410I: 0.829F 19.77

    11: -lOg(ICa)=0.849+0.510 * LOGP+0.686 *ESDU- 0.185 *PEAXJLOF 0.416r: 0.827F 19.4112: -lOg(IC,) =-3.314+0.568 * LOGP+6.852 *ATCHl- O.ooOo71* MOFI-YL O F 0.430r: 0.820F 18.44

    13: -lOg(ICm) =-2.169+0.576 *LOGP+6.154 *ATCH3- O.ooOo70 * MOFI-YLOF 0.432r: 0.819F: 18.31

    14: -lOg(ICs)=-2.956+0.553 * LOGP+0.0030 * X P N T+1.286 *SUM-F- 0.000061 *MOFI-Y

    LOF 0.437r: 0.867F 19.44

    15: -lOg(ICs) = 0.147+0.561 *LOGP+0.864 *ESDW+2.030 *ATCH4- O.ooOo76 *MOFI-YLOF: 0.440r: 0.866F 19.50

    1 6 -lOg(ICso)=-2.322+0.590 * LOGP+5.507 *ATCHS- 0.000068 *MOFI-YLop: .441r: 0.815F: 17.76

    17: -lOg(ICa) =3.280+0.536 *LOGP+0.911 *ESDW+1.555 ATCH4- 0.0126 * S U R F ALOF 0.444r: 0.864F 19.21

    1 8 - l o g ( Ic ~ ) 0.960+0.552 LQGP+6.011 *ATCH3- 0.0114 * SURF-AUIF: 0.445r: 0.813F 17.541 9 -log(IC5()) =- 0.091+0.544 * LOGP+6.565 *ATCHl- 0.0115 *SURF-A

    LOF 0.446r: 0.812F 17.4320 -lOg(ICs)3 5.571+0.560*LQGP- 13.758 ATCH6- 0.000065 * MOFI-YLOP: .447r: 0.812F 17.43

    In this study, 300 random models were created and then evolved with 3000 genetic crossover operations. All of these models, and many of theadditional 280 models, approximately match or ex& the scores of the models built with the variables used by either Selwood et al. or Wikel or Dow.The large number of high-scoring models appears to refute the supposition that there may be a single "best" model using the data.

    QSARs are developed to make predictions, not merely forthe ability to reproduce the results in the training set.Predictiveness can be estimate d using cross-validation. Eachsamp le is systematically removed from the d ata set, and newregression coefficients are gene rated for a given model. Thisnewly-regressed model is used to p redict the removed sample.This proce dure is perform ed on each samp le in sequence. Th eseries of predictions is used to calculate a new value for r,called the cross-validated r . Figure 6 shows the values ofcross-validated r using the fea tures of Selwood et al., Wikeland Dow, and the top four models discovered by GFA.As measu red by the cross-validated r , the combinations offeatures discovered by the GF A algorithm yield models whichare more predictive than th e comb ination s of featuresdiscovered by either the forwa rd-stepp ing egression of neural-network techniques. This is because the genetic algorith mspecifically search es for comb ination sof features which scorewell, rather than trying to identify individual features.

    Cross-validation can also be used to confirm or refutehypotheses about the appro priate numb er of term s in a m odel.We decided that models with three features were the mostdesirable from the rep orted QSARs. Prelimin ary runs of theprogram with different values for d indicated a value 2.0asbeing most favorab le to gene rate three-feature models.However, without th is prior knowledge, we could have decidedto use the default value d = 1.0, which favors a populationwith an averagemodel length of five features. Figure 7 showsthe top four models (rated by L O F score), and the model withthe highest cross-validated-r score (which was rated 6 out of300 by LO F score) for a run using the defau lt value d =1O.(Interestingly, the second model is identical to the modelrecently developed by McFarland and Gans using clustersignificance analysis (CSA) 22) Whether the improvementin cross-validated r is worth th e extra te rms may depend onwhether the terms suggest a plausible unde rlying mechanismfor their predictiveness. Otherw ise, it may be best to stay

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    6/13

    APPLICATIONOF GFA TO QSAR AN D QSPR J . Chem. Znf. Comput. Sci., Vol. 34, No. 4, 1994 859-lOg(ICa&d= -3.93 - l O g ( I C a ~ ~ ~1.63+0.44*LOGP+0.008 *PrLpNT-0.30*ESDLlO

    +0.231 *LOGP+4.415*ATCH4+0.000659 MOFI-Xr :0.737 r:0.774F:13.29 F:13.47mssvsl idated-r: 0.667 aussvslidated-r: .679

    -log(ICyJ)=-2.501 -log(IC5g)=2.871 -lOg(ICa) = 0.805 -log(ICa)1.791+0584 *LOGP+1.513*SUMJ- 0.000075 *Mom-Y+0.568 * LOGP- 0.013*SURF-A+0.810 ESDU

    +0389 *LQGP+0.736 ESDW- 0.000077*MOFI-Y+os00 *LOGP+0.842*ESDL.3- 0.200*pEAlcxT:0.849 r: 0.848 r:0.838 +2.807 *ATCH4F:23.27 F:23.04 F:21.15 r: 0.880aussvalidated-r:.804 mssvalidated-r: 0.803 aossvalidated-r: .777 F:22.31aoswalidated-r:0.798

    Figure 6. Correlation coefficient r , cross-validated r, and F or models using the features of Selwood et al., Wikel and Dow, and the top fourmodels discovered by the GFA algorithm using d =2.0 for the smoothing parameter.-log(IcsoklpA-II 1.277

    - 0.114*DIPV-X

    -iog(Icaknu-z=-1.268

    - 0.118* DIPV-X+0.402*LOGP+4.824*ATCH4+12.017*ATCHS

    +0.406 LOGP+4.712 *ATCH4+12.406 *ATCHS- 0.000050* M0FI-Z -O.ooOo50* M O K Y

    r:0.909 r:0.909F: 23.82 F:23.78msvslidatcd-r: .834 mssvalidated-r: .834

    - log (~q , ,k lp~-~2.618 -iog(Icabw4 =-1.815+0.490*LOGP+2.609 *ATCH4+1.972*SUMJ- 0.125 DlPV-X- 0.000073*M O K Z+0.442 LOGP+3.095 ATCH4+0.766*ESDL.3-0.0137 VDWVOL+0.000433 *M0FI-X

    r: 0905 r: 0.904F:22.71 F:22.48~r0rs~rlldatsd-r:.822 mssvalidated-r:0.836

    -lOg(ICh4 =3.301+0.435*LOGP+5.480 *ATCH4+21.025 ATCHS+22.636*ATCH6-0.153 D W J- 0.000056 M O K Zr:0.920F:22.02crouvrlidrtsd-r: .849

    Figure 7. Top four models when the evolution is repeated using th e smaller value d =1.0 for the smoothing parameter; along with the 6t hmodel, which had the highest cross-validated r score in th e population.with sma ller models of ne arly equivalent predictiveness.Both Selwood and W ikel and Dow give short lists of about10 features which they felt were the m ost useful for buildingactivity models. GF A can provide similar information bycounting the number of times each feature is used in thepopulation and ranking the features by that value. (Wecounted feature use using the original value d =2.0 for thesmoothing parame ter). Table 2 shows the top ten featuresselected by Selwood, Wikel and DOW, nd GFA.Th e different te chniqu es show littl e overlap, agreeing onlyon LOCP the partition coe fficient). This is likely caused bydifferent selection pressures under each technique. Forforward -steppin g regression, the appeara nce of a featur ereflects high correlation with the response after decorrelationwith all previously-selectedvariables. For the neural network,the app earanc e reflects an ability to work in conce rt with allof the other features, though the final feature set used in theactivity model will be much smaller. For the GFA algo rithm,the appearance reflects a features utility in many differentcom bination s toward buildin g high-scoring models.The usage of features in the population using GF A ch angesover the evolution of model evaluation. Graph ing the feat ureusage as it changes is a dram atic way to w atch the evolutionof the po pulatio n of models, to estim ate when the populationhas converged and to quickly judge the relative utility ofdifferent features. Such a graph is shown in figure 8 for theSelwood data set.Only the feature LOCP tands out as highly significant.Th e remaind er of the features, those both shown in the graph

    of Figure 6 and unshown, are not well-distinguished fromeach other by usage. This suggests tha t the informa tion inthe data set is duplica ted over many of the features. Th econtinuingchang e in th e usage suggests that fu rthe r evolutionof the population may be warranted. However, it should benoted th at the genetic algorithm searches for combinationsoffeatures which score well, rather than identifying individualfeatures. Henc e, a count of use may not necessarily be thebest measure of whether a fe atur e was useful in making thebest-scoring models, though in this exam ple the top ten fea turesaccou nted for nearly all the fea tures in the top twenty models.The above results demonstrate tha t GFA discovers modelsthat are, at the least, important candidates that must beconsidered in the search for the best predictive model.However, selection of a single best model and the d iscardin gof the rem aining models may not be the most advantageouscourse. It is proposed th at th e outpu ts of the m ultiple modelscan be averaged to gain additional predictivity.Averaging the predictions of some numb er of the higher-scoring models often gave better predictions than any of th eindividual models. This behavior can be seen if the resultspredicted by some number of the top-rated 20 models areaveraged. Figure 9 shows the effect of this on the cross-validation coefficient r . The top model predicted the trainingset with r =0.849; by averagin g its output with th e second-rated model, the correlation coefficient climbs to 0.86, and byavera ging the result with th e second and third rated models,the correlation coefficient is greater than 0.89. Furtheradditions do not increase the c orrelation coefficient, but th e

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    7/13

    860 J. Chem. In f . Comput. Sci.. Vol. 34, No. 4, 1994Table 2. Preferred Variables for the Selwood Data S et from theThree Studies

    Selwood Wikel GFAATCHl *ATCH2 * *ATCH4 * *ATCH *ATCH6 *DIPVJ *DIPV-Y *DIPV-Z *ESDL3ESDL5 *ESDLlO *LOGP * *M-PNT * *M 0 F I - X *MOFI-Y * *NSDL2 *P E WPEAX-Y *S8,lCZ *SUM-F *SUM-R *SURFJ *VDWVOL *

    *

    Variables not selected by any of the three techniques are not shown.)Selwood used a technique tha t selected the best correlated variable afterdecorrelation with t he previous selections; Wikel and Dow used as neu ralnetwork, selecting variables which had large hidden-unit weight in thetrained n etwork; this study used a population of initially -random modelstrained using a genetic algorithm and shows the variables used in 10%or more of the models. The genetic algorithm searches for combinatio nsof featu res which score well, rather th an identify ing individual features.Thus a count of use is not the best m easure of whether a f eatu re usefulin making the best-scoring models. The different techniqu es show littleoverlap, agreeing only on LO GP. This is likely caused by differentselection pressures under each technique.

    ROGERS ND HOPFINGER

    175.150.

    0 LoGPure0 ATCH4 weA MOA-Y we0 SURFJ uea0 ATCHB usePEAXJuse0 ATCHl uw

    ESDL3 we

    Number of QocsoversFigure 8. Chan ge in variable use as the evolution proceeds. (Thegraph only shows varaibles that are used in 15% or more of the 300models), The featureLOCP s the only o ne whose use is widespreadand is used in more than 80% of the models. The next most usedfeature, ATCH4, s used in abo ut 35% of the models, followed closelyby MOFI-Y. The remainder of the features, both shown in thegraph an d unshown, are not w ell-distinguished from each other byusage. This suggests that the information in the dat a set is duplicatedover many of the features. The continuing change in the usagesuggests that further evolution of the population may be warranted .average i s r e ma r k a b ly r o b u s t , remaining above 0.88 even ifwe a v e r a g e t h e output o f a l l 300 mo d e l s in th e popula t ion .

    I r.as+ - .~. , - . , - . , . , . . - 1

    0 2 4 8 8 10 12 1 4 18 18 20AverageCount

    Figure 9. Num ber of models in the average versus the correlationcoefficient. As the num ber of models in the average increases, thecorrelation coefficient increases to 0.88, which is higher than an y ofthe individual models.

    R1. 2 iR1

    Figure 10. Shared structure of the acetylcholinesterase inhibitoranalogs in the Cardozo/Hopfinger data set.Table 3. Cardozo/Hopfinger Data Setpd HOMOno. -[log(IcJo)l C4 Ut@) energy(eV1

    1 8.88 0.356 3.918 -9.3052 8.28 0.454 2.301 -9.5333 8.20 0.505 2.855 -9.6314 (dat a sample missing frompublished data set andnot used by GFA)5 8.15 0.234 2.966 -9.5936 8.05 0.449 3.332 -9.451I 7.92 0.249 3.121 -9.5458 1.88 0.468 2.765 -9.6159 1.70 0.452 2.629 -9.46110 1.64 0.409 2.185 -9.31411 1.60 0.364 2.845 -9.58112 7.44 0.251 3.248 -9.44313 7.16 0.16 8 3.004 -9.61514 1.09 0.247 2.980 -9.51115 1.06 0.021 3.192 -9.31616 6.89 0.021 3.009 -9.11617 6.70 0.226 3.461 -9.85618 6.42 0.463 1.946 -9.508

    C. Cardozo/HopfwerDataSet. TheCardozo/Hopfingerdata set16 contains 17 analogs, 3 features, and a set ofcorresponding acetylcholinesterase inhibitor activities. Theser ies of a n a l o g s h a ve t h e s t r u c t u r e s h o w n in F i g u r e 10.(Thea c t iv i ty mo d e l s used in t h e o r ig in a l publication used 18compounds, bu t t h e data for one compound w a s left ou t of t h ep u b l i c a tio n , a n d so a reduced set o f 1 7 c o m p o u n d s was usedi n th i s s tu d y . ) Th e data set is g i ve n i n T a b l e 3.

    T h e compounds w e r e s o r t e d i n o r d e r of decreas ing ac t iv i ty ,so t h a t c o m p o u n d 1 s t h e m o s t a c t i v e and c o m p o u n d 18 t h eleas t ac t ive . This d a t a set descr ibes a ser ies of ace ty lcho-l ines te rase inh ib i tor s with ac t iv i ty measured by - [ log( IC~o) ] ,wh e r e I Cs0 i s th e c o n c e n t r a t io n of the analog needed to inhibitt h e enzyme by 50%. U n l i k e t h e S e l w o o d data s e t , t h i s d a t aset w a s already reduced us ing a t e c h n iq u e c a l l e d mo le c u la rdecomposition-recompsition (MDR),16 o i t c o n ta in ed a s ma l l

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    8/13

    APPLICATIONF GFA TO QSAR AND QsPRnumber of featu res relative to then um ber of compounds. Hencethere was no need for a further reduction in the number offeatures. The three features selected for this QSA R were C4,the out-of-planeA orbital coefficient of ring carbon 4; Ut , hetotal dipole moment, and HOMO, the energy of the highestoccupied molecular orbital.

    Cardozo et a1.I6 proposed a m odel of three fea tures a nd sixterms. This QSA R is given by eq 5. Most of the error in the-log(IC5o)fuu=-740.93+2.73 * Cd

    - 0.14 * (Ud'- 156.7 *HOMO- 8.25 * HOMO'+1.86 *Ut

    (5)N: 18r :0.804F 3.66

    QSA R is due to the contributions of two compounds in thedata set, analogs 5 and 18. Removal of these two compoundsyields anothe r QS AR , given by eq 6. The el iminatio n of these- l o g ( I C 5 & ~ =-757.52+2.21* c.4- 6.65 *U;- 1.18 * (US'

    - 8.58 *HOMO'- 162.9 * HOMON: 16r :0.939F: 3.95

    two compounds yields a QSA R with a m uch high er correlationscore. However, explicit user intervention was required toidentify and remove the outliers. Moreover, in most cases noclassification process is used th at would assist in identifyingwheth er a given test compound should be treated as an outlier.In the next section, we will show how GF A uses spline-basedterms to automatically partition the compounds in the dataset, and give models over the full dat a set which ar e superiorto the QSAR expressed by eq 5.

    D. GFA Applied to the Cardozo/Hopfinger DataSet. TheGFA algorithm was applied to the Cardozo/Hopfinger dataset to illustrate the a utoma tic partitioning behavior of spline-based models in QSAR analysis.QS AR an alysis with GFA began with a population of 300random models. Th e terms of the models were linearpolynomials, qua drat ic polynomials, linear splines, and qua-

    drat ic splines. Because there were only three features in thedata set, feature selection was not a critical issue. Instead,we included sp line basis functions to explore partitio ns of thedata set. Th e population was evolved for 5,000 crossoveroperations. By th at point there was little continued improve-ment in the average score of the models in the population.The splines used are truncatedpowersplines and ar e denotedwith angle brackets. For example, (f(x) - a) is equal to zeroif the value of cf(x) - a ) is negative, else it is equa l to cf(x)

    - ) . For example, (Ut- .765) is zero when Ut

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    9/13

    862 J. Chem. In$ Comput. Sci., Vol. 34 , No . 4, 1994Table 4. Top 10 Models Derived for the Cardozo/Hopfinger Data Set

    ROGEFG ND HOPFINGER

    1: -lOg(IC50) =6.950+2.046 * C4- 6.523*4 . 3 0 1 - Ut>- 22.090 *(18)+1.037 * (Ut - 2.845) (16,17)LO F 0.209r :0.923F 17.16

    2: -lOg(IC50) =7.101+2.040 *C4- 7.178 * @ O M 0 +9.508)- 4.581 * 4 . 3 0 1 - U?+2.988 * (18)(11

    LOF .216r:0.920F 16.553: -lOg(ICsd)=7.053+1.983 * (15 16)- 6.468 * 4 .301- U p (181+0.987 * (Ut - 2.845)- 11.185 *(3, 5,8 , 11,13,14, 16, 17)

    LOF 0.221t= .918F 16.104: -log(IC50)=7.035+1.771 * C4- 7.683 * 4 .301 - Ut> (181+1.241*(Ut - 2.966)- 18.192 * (3,16,17)LOF: 0.224r: 0.917F 15.84

    5: -lOg(ICso)=7.080+1.778 * (15,16)- 7.686 * 4 . 3 0 1 -Ut> (181+1.242*(Ut - 2.966)- 18.1% *e9.615 - HOMO> (3,16,17)LO F 0.224t 0.917F 15.82

    6: -lOg(IC& 7.138+1.615 * - 7.619 *4.301 -Ut>- 4.616 * 16,171

    (15,16)(181+1.231 * (Ut - 2.%6?LOF 0.225r:0.917F 15.80

    7: -log(IC50)=7.092+1.756 * (15.16)- 7.667 *4.301 - Ut>- 15.250 * (18)(3.8 , 13, 16, 17)+1.234*(ut - 2 . m ?

    LO F 0.225r:0.917F 15.798 -log(IC50)=7.098+1.607 *CA- 7.615 *d 3 O l -Ut> (18)+1.231*(Ut -2.W?- 4.610 *C-9.631 -HOMO> 1617)LOR 0.225t 0.917F 15.789: -1ogQC50)=8.076- 2.110 *4.468 cp

    - 12.972 *42.301 - Up2- 8.219 * (HOMO +9.545)(allbut 3 and 8)(18)+3.354 * (1 )

    LOF: 0.225t 0.916F 15.74

    10 -lw(ICa) =7.056+1.739 * C4- 21.539 *4 .301 - U+1.223 * (Ut - 2.966F (18)- 13 .468*2(3 ,5 ,8 , 11, 13, 16, 17)LO F 0.226r: 0.916F 15.65

    0 Angle brackets are used to denote splines terms, zero if the contents are negative, otherwise the value of the contents. Curly brackets list thecompound numbers for which that term is nonzero. Terms explored were linear and quadratic polynomials and linear and quadratic splines.h Q W s N o n o e r o s a m D l w Q2153 (181 Outlier removalc4 61 all

    Q.301- Up 44 18) Outliernmova l(ut - 2.966)2 58 allbut ( 5 ) 26 (3 6 171 hid nenative HOMO44 {1 67 12 13 15 16 17) High valued of Ut 21 i l l . O u - i &oval 19 (3 6 171 HighlynegafiveHOMO 19 (5811 13141617) Highlyne&eHOMOFigure 12. Ten most frequently usedbasis functions in the populationof 300 models. The first column contains the basis function; thesecond, the number of models which use that basis function; thethird, the set of samples for which the function is nonzero; and thelast column gives comments on the role the spline term is playing inthe model.

    E. QSPR Data Sets. The QSPR data sets from Koehlerand Ho p6nger17 contain seven features and e ither 35 or 30compounds of structurally diverse polymers. The fo rmer datase t was used to predict the property Tg,he glass transitiontemperature, and the latter data set was used to predict theproperty Tm,he melt transition temperature. The sevenfeatures were SB and &, the backbone and side-chaincontributions to the monomer conformational entropy; MB

    10B87

    - 6a :

    3210 1 4 7 11 14 17CompoundNumb

    F i e 3. Histogram of the number of times a given compound wasmade a special case in the top ten models. The histogram shows askew toward the highest-numbered compounds, which a re also theleast active. In effect, the GFA algorithm is making special casesof the least-active compounds and modeling the patterns it found inthe most active analogs.anda,he backbo ne and side-chainmassmoments;ED, +,E, the dispersion, positive ele ctrostatic , and ne gative elec-trostatic intermo lecular energies for the com plete monomerunit. The data set for T g s given in Ta ble 5 . The data setfor Tm s given in Table 6.

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    10/13

    APPLICATIONF GFA TO QSAR A ND QSPRTable 5. Koehler and Hopfinger Data Set for T,, the Glass Transition Temperature

    J . Chem. Inf. Comput. Sci., Vol. 34 , No . 4, 1994 863

    compd no. SB MFJ ss Ms ED E+ E T*( ) TP(2) TP u s 41 3.332 3.653 3.824 3.915 4.36 1.937 18 1.79 1.48

    10 1.0611 0.812 1.1213 1.9614 0.5815 0.5316 0.7617 0.7618 0.7619 0.7620 0.7621 0.7622 1.2223 1.4824 2.9225 2.9226 3.2927 3.2928 2.3829 2.3830 2.3831 3.432 0.5933 3.2734 2.9235 2.92

    1514.714.514.4141434.32432.532315838.456.563.528.328.328.328.328.32822121214.514.523.623.623.623.623.459.518.81212

    0000000000000001.72.293.950.890.49002.893.363.453.281.922.441.13.190003.833.86

    00000000000000028.523.715.73943.70014141514.7292439.5200001414.3

    -0.86-1.04-1.13-1.18-1.39-1.51-3.26-1.52-1.69-1.36-1.93-2.18-2.59-3.774 . 1 3-2.38-2.15-1.59-2.72-2.71-2.5-1.83-1.36-1.37-0.91-0.97-1.91-1.77-2.25-1.66-1.84-3.97-1.65-1.38-1.28

    -1.53-0.87-0.54-0.340.450.51-1.09-0.81-1.27-1.03-0.9-1.92-2.1-2-1.26-2.29-1.79-0.14-1.93-2.410.77-0.80.40.41-1.58-1.2-2.49-1.95-2.13-1.26-0.54-2.34-1.160.430.02

    0.690.370.220.120.26-0.37-0.59-0.38-0.16-0.5-0.46-1 .os-0.22-0.36-0.53-0.41-0.56-0.34-0.42-0.69-0.66-1.88-0.32-0.310.360.270.350.050.33-0.04-0.3 1-2.25-1.58-0.29-0.21

    18820619518514323835325323832324731834639341437833828838038519834322822242232792531421926841318208196

    243 243246 206228 228194 194250 143299 238380 35 3314 314286 286371 323354 354373 373346 346420 393423 414378 378338 338288 288380 380385 385243 243372 343249 228287 22260 242254 23282 28225 25314 314219 219298 268428 428330 318228 208223 223

    SFJnd & are the backbone and side-chain contributions to the monomer conformational entropy; MFJnd i are th e backbone and side-chainmass moments; e, , &, are the dispersion, positive electrostatic, and negative electrostatic probe ene rgies for the monom er unit. Some of thecompounds have two experimental values for Tg. he final colum n is the observed value for Tg hich Koehler and Hopfinger compared their predictionsagainst, an d the one used to train the GFA models. Th e polymers corresponding to each row in the table are reported in ref 17.Koehler and Hopfinger proposed separate models of fivefeatures and six term s for T, and Tm. These models ar e shownin Figure 14.F. GFA Applied to the QSPR Data Sets. Genetic FunctionApproximation was applied to the QSP R data set to illustratethe applicability of the analysis process to QSP R problems.Two separate analysis were conducted for the twovariablesT , and Tm. Eac h analysis with GFA began with a populationof 300 rando m models. Th e terms of the models were linearpolynomials and linear splines. The population was evolvedfor 5000 crossover operations.The top 10 models discovered by the GF A algorithm forT , are shown in Tab le 7; he top 10 models for Tm are shownin Table 8.Th e T, models discovered and rate d best have a smallimprovement in the correlation coefficient and fewer termsthan the Q SPR of Koehler and Hopfinger. The feature useappears different in the GF A models as compared to theoriginal QS PR. For example, Ss, MB,and Ms ar e not usedin any of the top 10 models for T,, while ED which was notused in the original QS PR ) was used in 9 ou t of 10. In fact,the ability to build models which are competitive with theoriginal QSPR, but which contain only features relating toenergy is an interesting result, and parallels a discussion intheorig inal paper, which suggests (andreje cts) the possibilityof not considering mass moments or side chain conformationalenergy.Other patte rns emerge that may be of note. For example,

    a pair of spline terms based on E, ut with opposite signs,

    appears in 8 out of 10 of the top models. These pairs appe arto be isolating a central region of values for E that has themost effect on thevalue of T,. Th e relatively argecoe fficientsof these terms suggests caution, as we may be seeing theamplification of a chance pattern in the data set, but it iscertainly worth presenting to th e researcher for consideration.The Tm models discovered an d ra ted best also showimprovement in the correlation coefficient over the originalQS PR, and have fewer terms. The fea ture use is differen t,both from the original QS PR and from the pa tterns of the T,models. Seven of the 10 top models use only ED, +, -, andSs. The re were two models of only four terms, an d one modelof six terms. No example of the pairing of dual E splineterms was seen. None of the spline erms were found to isolateone or two outliers in the dat a set. Instead , the splines seemto be performing identification of ranges of the variables tha tmay be of interest. For example, 8 out of the top 10 modelsuse the spline term (ED+1.660), which sepa rates out the 14data compounds with ED>-1.660. Whe ther this is due toany underlying mechanism th at would make only tha t rangeof the feature impo rtant needs to be determined.The differences in featu re use between the T, models andthe Tmmodels is best illustrated by graphing th e use of featu resin the popula tion of models as evolution proceeds. This isshown in Figure 1 5. Some features, such as E+ nd E aresimilar in their use. Other s, such as SB, re quite different,being greatly used to predict one but not both of the transitiontemp eratures. Again, it can be seen tha t studying thepopulations of models, and com paring populations, can be a

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    11/13

    864 J. Chem. Znf. Comput. Sci., Vol. 34, No. , 1994Table 6. Koehler and Hopfinger Data Set for Tm,he M elt Transition Temperatu rd

    ROGERS ND HOPFINGER

    compd no. SE ME ss Ms ED Et E Tm( 1) Tm(2) Tm (used)123456789101112131415161718192021222324252627282930

    3.33 153.65 14.73.82 14.53.91 14.44.3 141.93 141 34.31.7 241.48 32.50.76 280.8 311.12 580.76 28.32.92 14.51.48 122.38 23.63.4 23.40.84 480.91 312.92 122.92 122.38 23.61.56 383.43 21.33.29 14.52.35 34.33.24 21.53.27 18.82.39 29.23.87 17

    0 0 -0.860 0 -1.040 0 -1.130 0 -1.180 0 -1.390 0 -1.510 0 -3.260 0 -1.520 0 -1.690 0 -2.50 0 -1 -930 0 -2.181.7 28.5 -2.383.45 15 -0.912.89 14 -1.363.19 20 -1.660 0 -1 -840 0 -30 0 -1.962.89 14 -1.363.83 14 -1.382.9 21.5 -1.710 0 -2.720 0 -1.724.02 14.2 -1.230 0 -2.390 0 -1.750 0 -1.650 0 -2.30 0 -1.53

    -1.53-0.87-0.54-0.340.450.51-1.09-0.81-1 -270.77-0.9-1.92-2.29-1.580.4-1.26-0.54-1.36-2.130.40.43-1.55-2.47-0.48-0.18-1.67

    -1.1-1.16-1.2-0.17

    0.690.370.220.12-0.26-0.37-0.59-0.38-0.16-0.66-0.46-1.08-0.410.36-0.32-0.04-0.31-0.38-0.8-0.32-0.290-2.59-0.1-0.05-0.23-0.24-1.58-1.4-0.27

    33 3335308308410385498473410275485483433417359275396463292379235388728338280533332523606344

    47334930833341048523473511317583533474233593174148672415235435728338280537332545613358

    33333530833341038549847410317485533473417359317396463672379235388728338280533332545606358

    SB nd ss are the backbone and side-chain contributionsto the monomer conform ational entropy;Me and & re the backbone and side-chainmass moments; Eo,&,E are the dispersion, positive electrostatic, and negative electrostatic probe energies for the monomer unit. Som e of thecompounds have two ex perimental values for Tm. he final column is the observed value for Tm hich Kochler and Hop finger compa red their predictionsagainst, and the one used to train the GF A models. The polymers corresponding to each row in the table are reported in ref 17.T8=288.83 Tm=493.7- 27.3 * SB

    +1.07 *MB- 32.6 *SB- 2.51 *MB- 10.1 *ss- 29.3 *E+

    - 15.1 *E.

    - 22.1* ss- 50.5 *E- 109.8 *3-n: 35 n: 30r :0.954 r :0.907F:60.51 F:23.91

    Figure 14. Proposed QSPR models for T8 and Tm.powerful tool in analyzing a data set,

    4. CONCLUSIONThe genetic function approximation (G FA) algorithmoffersa new approach to the problem of building activity models.Replacing standard regression analysis with the GFA algo-rithm allows the construction of m odels competitive with, or

    superior to, standard techniques and m akes availableadditional information not provided by other techniques.A fundamental difference between GFA and other tech-niques is the creation and use of multiple models rather thana single model. While one can simply select the model fromthe population with the lowest LOF score, it is usuallypreferable to inspect the differen t models and select, with theaid of scien tific intuition, using the appropriateness of th efeatures, the basis functions, and the com binations. Thepopulation can be studied for information on feature use, andpredictions can often be improved by averaging the results ofmultiple models rather than relying on an individual model.The method of model construction also has importantconsequences. Most techniques choose features incremen tallyand may not find combinations whose components are not

    TI& 7. Top 10 Models for the QSPR T, Data Set1: Tg=332.836 6 Tg=306.729+4 . 7 6 7 *d+ 1 . 9 5 b-46.059 *d+ 2 . 1 3 8+-21.538*e&, +3 . 9 7 b +19.550 * +221.511* +1%.267 * +-219.826 *. -..- .F 97.34

    2: Tg=269.490+48.654 *d+ 1.-+232.012 *

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    12/13

    APPLICATIONF GFA TO QsAR AN D QSPRTable 8. To p IO Models for the QSPR Tm Data Set1: T, =473.471'G -79.526* E+206.516 *& t 1.-+-275.659* el? t 0.8W

    t -95.585*

  • 7/28/2019 1994 Application of Genetic Function Approximation to Quantitative Structure-Activity

    13/13

    866 J. Chem. Infi Comput. Sci., Vol. 34, No. 4, 1994(15) Selwood, D. L.; Livingtone, D. J.;Comley, J. C.; O Dowd, A. B.; Hudson,A. T.; Jackson, P.; Jandu, K. S.; Rose, V. S.; tables, J. N. J . Med.Chem. 1990,33, 136.(16) Cardozo, M. G.; Iimura, Y.; Sugim oto, H.; Yamanishi, Y.; Hopf ige r,A. J. QSAR Analysis of the Sub stitute d Indanone and BenzylpiperidineRings of a Series of Indanonc-Benzylpiperidine Inhibi tors of Acetyl-cholinesterase. J. Med. Chem. 1992, 35, 584-589.(17) Koehler, M. G.; Hopfinger, A. J. Molecular modelling of polymers: 5.Inclusion of intermolecular energetics in estimating glass and crystal -melt transition temperatures. Polymer, 1989, 30, 116-126.(18) Livingstone, D. J.; Hesketh, G.; Clayworth, D. Novel Method for theDisplay of M ultiva riateData Using Neural Networks. J .Mol.Graphics1991 ,9 , 115-118.

    ROGERS ND HOPFINGER(19) Rose,V.S.;Croall,I. F.;MacFie,H. J.H.AnApplicationofUnsupervieedNeu ral N etwork Methodology (Kohonen Topology-PreservingMapping)to Q S A R Analysis. Quant . Srrucr.-Acr. Relar. 1991, 10, 6 1 5 .(20) Rose,V.S.; Wood, .; MacFie, H. J. H. Single Class DiscriminationUsing Principal Component Analysis (SCD-PCA). Quant. Srrucr.-Act. Relar. 1991, 10, 359-368.(21) Rose, V. S.;Wood, .; MacFie, H. J. H. Generalized Single ClassDiscrimination (GSCD ). A New Method for the Analysisof EmbeddedStructure-ActivityRelationships. Quanr. Srru crJc r. Relat. 1992,l I ,(22) McFarland, J. W.; Gans, D. J. On Identifying Likely Determinants ofBiological Activity in High-Dimensional QSAR Problems. Quanr.Srrucr.-Acr. Relar., in press.

    492-504.