research article predicting drugs side effects based on

9
Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 485034, 8 pages http://dx.doi.org/10.1155/2013/485034 Research Article Predicting Drugs Side Effects Based on Chemical-Chemical Interactions and Protein-Chemical Interactions Lei Chen, 1 Tao Huang, 2,3,4 Jian Zhang, 5 Ming-Yue Zheng, 6 Kai-Yan Feng, 7 Yu-Dong Cai, 8,9 and Kuo-Chen Chou 9,10 1 College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China 2 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 3 Shanghai Center for Bioinformation Technology, Shanghai 200235, China 4 Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York City, NY 10029, USA 5 Department of Ophthalmology, Shanghai First People’s Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200080, China 6 State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Shanghai 201203, China 7 Beijing Genomics Institute, Shenzhen Beishan Industrial Zone, Shenzhen 518083, China 8 Institute of Systems Biology, Shanghai University, Shanghai 200444, China 9 Gordon Life Science Institute, Belmont, Massachusetts 02478, USA 10 Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia Correspondence should be addressed to Yu-Dong Cai; cai [email protected] and Kuo-Chen Chou; [email protected] Received 18 June 2013; Accepted 30 July 2013 Academic Editor: Bing Niu Copyright © 2013 Lei Chen et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A drug side effect is an undesirable effect which occurs in addition to the intended therapeutic effect of the drug. e unexpected side effects that many patients suffer from are the major causes of large-scale drug withdrawal. To address the problem, it is highly demanded by pharmaceutical industries to develop computational methods for predicting the side effects of drugs. In this study, a novel computational method was developed to predict the side effects of drug compounds by hybridizing the chemical-chemical and protein-chemical interactions. Compared to most of the previous works, our method can rank the potential side effects for any query drug according to their predicted level of risk. A training dataset and test datasets were constructed from the benchmark dataset that contains 835 drug compounds to evaluate the method. By a jackknife test on the training dataset, the 1st order prediction accuracy was 86.30%, while it was 89.16% on the test dataset. It is expected that the new method may become a useful tool for drug design, and that the findings obtained by hybridizing various interactions in a network system may provide useful insights for conducting in-depth pharmacological research as well, particularly at the level of systems biomedicine. 1. Introduction Many drugs approved by Food and Drug Administration (FDA) were recalled each year aſter some unexpected side effects were discovered; for example, in 2010, Reductil/Meri- dia, Mylotarg, and Avandia were withdrawn. According to the “Drug Recall” (http://www.drugrecalls.com/drugrecalls .html), about 20 million people had taken the drugs in 1997 and 1998 that were later withdrawn. e drug side effects may have seriously harmful consequences to human beings [1]. For instance, the antiobesity drug fenfluramine/phenterm- ine, also known as fen-phen, may cause heart disease and hypertension. Developing and producing drugs that were later found having serious side effects would be a disaster to a pharmaceutical company. For instance, the withdrawal of the aforementioned antiobesity drug has cost Wyeth more than $21 billion in America alone [2]. erefore, it will not only avoid causing harm to patients but also avoid wasting lots of money if we can discover the side effects of a drug compound in the early phase of drug discovery. Many efforts have been made in this regard, such as utilizing the drug perturbed gene expression profiles or biolo- gical pathways, to predict the side effects of drugs [1, 37],

Upload: others

Post on 29-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Predicting Drugs Side Effects Based on

Hindawi Publishing CorporationBioMed Research InternationalVolume 2013 Article ID 485034 8 pageshttpdxdoiorg1011552013485034

Research ArticlePredicting Drugs Side Effects Based on Chemical-ChemicalInteractions and Protein-Chemical Interactions

Lei Chen1 Tao Huang234 Jian Zhang5 Ming-Yue Zheng6 Kai-Yan Feng7

Yu-Dong Cai89 and Kuo-Chen Chou910

1 College of Information Engineering Shanghai Maritime University Shanghai 201306 China2 Key Laboratory of Systems Biology Shanghai Institutes for Biological Sciences Chinese Academy of Sciences Shanghai 200031 China3 Shanghai Center for Bioinformation Technology Shanghai 200235 China4Department of Genetics and Genomic Sciences Mount Sinai School of Medicine New York City NY 10029 USA5Department of Ophthalmology Shanghai First Peoplersquos Hospital School of Medicine Shanghai Jiao Tong UniversityShanghai 200080 China

6 State Key Laboratory of Drug Research Shanghai Institute of Materia Medica Shanghai 201203 China7 Beijing Genomics Institute Shenzhen Beishan Industrial Zone Shenzhen 518083 China8 Institute of Systems Biology Shanghai University Shanghai 200444 China9Gordon Life Science Institute Belmont Massachusetts 02478 USA10Center of Excellence in Genomic Medicine Research (CEGMR) King Abdulaziz University Jeddah Saudi Arabia

Correspondence should be addressed to Yu-Dong Cai cai yud126com and Kuo-Chen Chou kcchougordonlifescienceorg

Received 18 June 2013 Accepted 30 July 2013

Academic Editor Bing Niu

Copyright copy 2013 Lei Chen et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A drug side effect is an undesirable effect which occurs in addition to the intended therapeutic effect of the drug The unexpectedside effects that many patients suffer from are the major causes of large-scale drug withdrawal To address the problem it is highlydemanded by pharmaceutical industries to develop computational methods for predicting the side effects of drugs In this study anovel computational method was developed to predict the side effects of drug compounds by hybridizing the chemical-chemicaland protein-chemical interactions Compared to most of the previous works our method can rank the potential side effects for anyquery drug according to their predicted level of risk A training dataset and test datasets were constructed from the benchmarkdataset that contains 835 drug compounds to evaluate themethod By a jackknife test on the training dataset the 1st order predictionaccuracy was 8630 while it was 8916 on the test dataset It is expected that the newmethod may become a useful tool for drugdesign and that the findings obtained by hybridizing various interactions in a network system may provide useful insights forconducting in-depth pharmacological research as well particularly at the level of systems biomedicine

1 IntroductionMany drugs approved by Food and Drug Administration(FDA) were recalled each year after some unexpected sideeffects were discovered for example in 2010 ReductilMeri-dia Mylotarg and Avandia were withdrawn According tothe ldquoDrug Recallrdquo (httpwwwdrugrecallscomdrugrecallshtml) about 20 million people had taken the drugs in 1997and 1998 that were later withdrawnThe drug side effects mayhave seriously harmful consequences to human beings [1]For instance the antiobesity drug fenfluraminephenterm-ine also known as fen-phen may cause heart disease and

hypertension Developing and producing drugs that werelater found having serious side effects would be a disaster to apharmaceutical company For instance the withdrawal of theaforementioned antiobesity drug has cost Wyeth more than$21 billion in America alone [2] Therefore it will not onlyavoid causing harm to patients but also avoid wasting lots ofmoney if we can discover the side effects of a drug compoundin the early phase of drug discovery

Many efforts have been made in this regard such asutilizing the drug perturbed gene expression profiles or biolo-gical pathways to predict the side effects of drugs [1 3ndash7]

2 BioMed Research International

using chemical structures for the prediction of drugs sideeffects [8ndash10] Although most of the methods can onlyprovide whether the query drug has some side effectsthey cannot determine which side effects are most likely tohappen or even the order information of the side effects Inthis study we proposed a novel computational method topredict the side effects of drugs based on chemical-chemicalinteraction and protein-chemical interaction Compared tomost of the previous studies our method can provide theorder information of the side effects that is prioritizing theside effects from the most likely one to the least likely one

During the past decade many compound databases havebeen constructed such as KEGG (Kyoto Encyclopedia ofGenes andGenomes) [11] and STITCH (SearchTool for Inter-actions of Chemicals) [12] KEGG provides the informationof chemical substances and reactions while STITCH pro-vides the interaction information of chemicals and proteinsThus we can acquire the properties of many compounds andtheir other information from these databases For those com-pounds not being covered by these databases their propertiescan be inferred from the property-known compounds storedin the databases [13ndash16] Likewise the drugs side effects canalso be inferred as elaborated below

Recently it was evidenced that interactive proteins aremore likely to share common biological functions [17ndash20]and that interactive compounds are also more likely to sharecommon biological functions [13 16] Since the side effectsare part of biological functions of drugs it would be feasibleto use the chemical-chemical interactions to identify thedrugs side effects Unfortunately some of the query drugscannot be predicted for their side effects by this way becausetheir interactive counterparts do not have any informationof the side effects To overcome such difficulty we proposedto utilize the information of indirect interactions includingboth the chemical-chemical interaction and the protein-chemical interaction to identify the drugs side effects ofwhich the direct chemical-chemical interaction data are notavailable To evaluate the method a benchmark datasetretrieved from SIDER [21] was constructed which consistedof 835 drug compounds and it was divided into one trainingdataset and one test dataset By a jackknife test on the trainingdataset the 1st order prediction accuracy was 8630 whileit was 8916 on the test dataset To confirm the effectivenessof the method another method based on chemical structuresimilarity obtained by SMILES string [22] was also conductedon the training and test datasets Encouraged by the goodperformance of the method and superiority to the methodbased on chemical structure similarity we hope that theproposed method can become a useful tool to predict drugsside effects and screen out drugs with undesired side effects

2 Materials and Methods

21 Benchmark Dataset The benchmark dataset used inthe current study was downloaded from SIDER [21] athttpsideeffectsemblde which integrated the side effectsof 888 drugs from the US Food and Drug Administration(FDA) and other sources [21] To obtain a high-quality well-defined benchmark dataset the data were collected strictly

according to the following criteria (i) only the 100 side effectswith most drugs listed in SIDER and the correspondingdrugs were included and (ii) drugs without both chemical-chemical interactions and protein-chemical interactionswerealso excluded Finally we obtained a benchmark dataset Sthat contained 835 drugs belonging to 100 categories of sideeffectsThe codes of the 835 drugs in each of the 100 side effectcategories are given in Supplementary Material I availableonline at httpdxdoiorg1011552013485034

For the convenience of later formulation let us use thesymbols 119862

1 1198622 1198623 119862

100to tag the 100 side effects where

1198621represents ldquoNauseardquo 119862

2ldquoHeadacherdquo 119862

3ldquoVomitingrdquo and

so forth as described in the table in Supplementary MaterialII in which the number of drugs with each of the 100 sideeffect tags is also givenThus the benchmark dataset S can beformulated as

S = S1cup S2cup sdot sdot sdot cup S

100 (1)

where S119894represents the subset that contains the drugs with

the side effect 119862119894(119894 = 1 2 100)

Since many drugs in S have multiple side effects that isthey may simultaneously occur in subsets with different sideeffect tags it is instructive to introduce the concept of ldquovirtualdrugrdquo sample as illustrated as follows A drug compoundcoexisting at two different side effect subsets will be countedas 2 virtual drugs even though they have an identical chemicalstructure if coexisting at three different subsets 3 virtualdrugs and so forth Accordingly the total structure-differentdrug compounds and the total number of the side-effect-different virtual drug compounds can be described by thefollowing equation

119873(str) = 835

119873 (vir) =100

sum

119894=1

119873(119862119894) = 30114

(2)

where 119873(str) is the number of the total structure-differentdrug compounds 119873(vir) the number of the total side-effect-different virtual drug compounds in S and 119873(119862

119894) the

number of drugs with the side-effect tag 119862119894 Substituting

the numbers of 119873(119862119894) (119894 = 1 2 100) in the table in

Supplementary Material II into (2) we obtained 119873(vir) =30114 fully consistent with the results in (2) and the table inSupplementary Material II

It can be seen from (2) that the total number of theside-effect-different virtual drug compounds is much greaterthan that of the total structure-different drug compounds Toprovide an intuitive view about their distribution a histogramof the number of drugs versus the number of side effectsis given in Figure 1 from which we can see that of the 835drugs only 6 have one side effect while the majority hasmore than 10 side effects Thus the prediction of drugs sideeffects is a multilabel classification problem Like the case indealing with compounds withmultiple properties [13 16] theproposed method would provide the order information ofside effects from the most likely to the least likely

To evaluate the methods as described below sufficientlywe randomly selected 10 (83) samples from S to compose

BioMed Research International 3

0

5

10

15

20

25

1 11 21 31 41 51 61 71 81 91

Num

ber o

f dru

gs

Number of side effects

Figure 1 A histogram of the number of drugs versus the number ofside effects

the test dataset denoted by Ste while the remaining 752samples in S were used to construct the training datasetdenoted by Str

22 Chemical-Chemical Interactions and Protein-ChemicalInteractions It was evidenced that interactive proteins aremore likely to share common biological functions thannoninteractive ones [17ndash20] Likewise it has been indicatedby some pioneer studies [13 16] that interactive compoundsfollow the similar rules Since side effect is one part of bio-logical functions of drugs using the properties of interactivecompounds to identify drugs side effects is a feasible scheme

To obtain the information of interactive compoundswe downloaded the data of chemical-chemical interactionsfrom STITCH (httpstitchemblde chemical chemicallinksdetailedv30tsvgz) [12] a well-known database con-taining known and predicted chemical-chemical interactionand protein-chemical interaction data from experiments lit-erature or other reliable sources In the datafile obtained eachinteraction unit contains two chemicals and five scores withtitles ldquoSimilarityrdquo ldquoExperimentalrdquo ldquoDatabaserdquo ldquoTextminingrdquoand ldquoCombined scorerdquo respectively Since the last scorecombines the information of other scores we utilized thelast score to indicate the interactivity of two chemicalsin this study that is compounds in the interaction unitwith ldquoCombined scorerdquo greater than zero were deemed tobe interactive compounds The interactive compounds thusconsidered here satisfy one of the following three properties(I) they participate in the same reactions (II) they sharesimilar structures or activities (III) they have literatureassociations These three properties always indicate that theinteractive compounds occupy the same biological pathwayssuggesting they may induce similar side effects It is con-firmed that using chemical-chemical interactions retrievedfrom STITCH to identify drugs side effects is feasible TheldquoCombined scorerdquo is termed as confidence score becauseits value always indicates the likelihood that two interactivecompounds can interact in a way that two compounds withhigh ldquoCombined scorerdquo mean that they can interact withhigh probability For any two-drug compounds 119889

1and 119889

2

their interaction confidence score was denoted by119876119888(1198891 1198892)

Particularly if the interaction between1198891and1198892did not exist

their interaction confidence score was set to zero that is119876119888(1198891 1198892) = 0

Since the data of chemical-chemical interactions inSTITCH is not very complete at present that is some poten-tial chemical-chemical interactions may not be reported inSTITCH predicted methods based on chemical-chemicalinteractions may have a limitation that samples withoutinteractive counterparts in the training dataset cannot beprocessed Thus it is necessary to give some new schemes tomeasure the interactions that are not reported in STITCH Itis known that if two drug compounds can interact with a thirdcompoundor protein these twodrug compounds are likely toshare some common functions In view of this we proposeda new scheme to measure the likelihood of interaction of twochemicals based on indirect chemical-chemical and protein-chemical interactions

The data for the protein-chemical interactions werealso downloaded from STITCH (httpstitchemblde pro-tein chemicallinksdetailedv30tsvgz) Each of the inter-action units in the datafile obtained contains one com-pound one protein and four scores with titles ldquoExperi-mentalrdquo ldquoDatabaserdquo ldquoTextminingrdquo and ldquoCombined scorerdquorespectively With the similar argument we used the valueof ldquoCombined scorerdquo also termed as confidence score toindicate the likelihood of the interactionrsquos occurrence Forone protein 119901 and one drug compound 119889 their interactionconfidence score was denoted as 119876119901(119901 119889) If there was nointeraction at all between the protein 119901 and the drug 119889 it wasalso set to zero that is 119876119901(119901 119889) = 0

Now we are ready to introduce the new scheme tomeasure the likelihood of interaction of two chemicals Fortwo compounds 119889

1and 119889

2 suppose 119868119888(119889

1) denote a set of

compounds that are directly interacting with the drug 1198891and

119868119888(1198892) a set of compounds directly interacting with the drug

1198892 formulated as

119868119888(1198891) = 119889 119876

119888(119889 1198891) gt 0

119868119888(1198892) = 119889 119876

119888(119889 1198892) gt 0

(3)

In a similar way let 119868119901(1198891) denote a set of proteins that are

directly interacting with the drug 1198891and 119868119901(119889

2) a set of

proteins directly interacting with the drug 1198892 formulated as

119868119901(1198891) = 119901 119876

119901(119901 1198891) gt 0

119868119901(1198892) = 119901 119876

119901(119901 1198892) gt 0

(4)

According to the set theory the drug compounds that areinteracting with both the drug 119889

1and the drug 119889

2should be

the intersection of the set 119868119888(1198891) and the set 119868119888(119889

2) that is

they will form a set given by

119868119888(1198891 1198892) = 119868119888(1198891) cap 119868119888(1198892) (5)

Likewise the human proteins that are interacting with boththe drug 119889

1and the drug 119889

2should be the intersection of the

4 BioMed Research International

set 119868119901(1198891) and the set 119868119901(119889

2) that is they will form a set given

by

119868119901(1198891 1198892) = 119868119901(1198891) cap 119868119901(1198892) (6)

Thus the likelihood of the interaction between 1198891and 119889

2

can be calculated via the following equation

119876ℎ(1198891 1198892) = ( sum

1198891015840isin119868119888(11988911198892)

(119876119888(1198891 1198891015840) + 119876119888(1198892 1198891015840))

+ sum

1199011015840isin119868119901(11988911198892)

(119876119901(1199011015840 1198891) + 119876119901(1199011015840 1198892)))

times (21003816100381610038161003816119868119888(1198891 1198892) cup 119868119901(1198891 1198892)1003816100381610038161003816)minus1

(7)

where isin is a symbol in the set theory meaning ldquomember ofrdquo

23 Interaction-Based Method It is instructive to recall thatby using the information of protein-protein interactionssome methods have been developed to successfully predictthe properties of proteins [17ndash20 23] Actually the underlyingidea of thesemethodswas based on the assumption that inter-active proteins are more likely to share common biologicalfunctions than noninteractive ones Similarly based on theargument in Section 22 and some previous studies [13 16]interactive drugs are more likely to share similar side effectsthan noninteractive ones Based on such an underlying ideathe following predicted method based on chemical-chemicaland protein-chemical interactions was developed

For convenience some notations are necessary Supposethere are 119899 drugs in the training set S1015840 say 119889

1 1198892 119889

119899 the

side effects of the drug 119889119894in the training dataset is described

as

119862 (119889119894) = [1198881198941 1198881198942 119888

119894100]T(119894 = 1 2 119899) (8)

where T is the transpose operator and

119888119894119895= 1 if 119889

119894has the side effect 119862

119895

0 otherwise(9)

Prediction Based on Chemical-Chemical InteractionsAs elab-orated previously interactive drugs always share similar sideeffectsThe likelihood that the query drug 119889 has the side effect119862119895can be calculated by

prod119888

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876119888(119889 119889119894) sdot 119888119894119895119895 = 1 2 100

(10)

According to (10) the greater scoreprod119888(119889 rarr 119862119895)means that

there are lots of interactive compounds of 119889 that have the sideeffect 119862

119895or some interactions between 119889 and its interactive

compounds with the side effect 119862119895are labeled by high

confidence scoresThus the greater the scoreprod119888(119889 rarr 119862119895) is

the more likely the drug compound 119889 has the 119895th side effectwithprod119888(119889 rarr 119862

119895) = 0 indicating that the probability for the

drug 119889 having the 119895th side effect is zero Since a drug usuallyhas multiple side effects (see Figure 1) the prediction shouldprovide a series of candidate side effects ranging from themost likely one to the least likely one rather than only givingthemost likely oneThus for a query drug 119889 suppose we have

prod119888

(119889 997888rarr 1198622) ge prod

119888

(119889 997888rarr 1198624)

ge sdot sdot sdot ge prod119888

(119889 997888rarr 11986290) gt 0

(11)

meaning that the highest likelihood of side effect for the drug119889 is 119862

2or ldquoHeadacherdquo (cf table in Supplementary Material

II) and the second highest is 1198624or ldquoRashrdquo and so forth In

other words 1198622is called the 1st order prediction 119862

4the 2nd

order prediction and so forth Note that the outcome of (10)might be trivial that is

prod119888

(119889 997888rarr 119862119895) = 0 for 119895 = 1 2 100 (12)

implying that no meaningful or direct interactive drugcompounds whatsoever can be found in the training datasetS1015840 for the drug 119889 Under such a circumstance an alternativeapproach should be used for predicting its side effects aselaborated belowPredictionBased onHybrid InteractionsWhen the query drug119889 did not have any directly interactive drugs in the trainingdataset S1015840 or the information of its directly interactive drugswas trivial the data for the indirect chemical-chemical andprotein-chemical interactions would be used to predict itsside effects The prediction method was formulated in asimilar way as the above method But now instead of (10) thelikelihood that the query drug 119889 has the side effect 119862

119895should

be calculated by

prodℎ

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876ℎ(119889 119889119894) sdot 119888119894119895 (13)

By integrating the above two different approaches thefollowing steps were adopted to predict the side effects of thequery drug 119889

Step 1 Themethod based on the chemical-chemical interac-tions that is (10) was first utilized to identify its side effects

Step 2 If the outcomes were trivial or no meaningful resultswere obtained as in the case of (12) the method based on thehybrid interactions that is (13) would be utilized to continuethe prediction

24 Similarity-Based Method It is known that the com-pounds with similar structural properties always involvein similar biological activities [24] The most well-knownrepresenting system to obtain the similarity information oftwo compounds is SMILES (Simplified Molecular Input LineEntry System) [22] which is a line notation for representingmolecules and reactions using ASCII strings Here we also

BioMed Research International 5

used this system to obtain the representations of compoundswhich were used to calculate the similarity score of twocompounds and set up a new computational method toidentify drugs side effect The similarity score betweentwo compounds with their SMILES representations can beobtained from Open Babel [25] that is an open chemicaltoolbox For two-drug compounds 119889

1and 119889

2 their similarity

score obtained from Open Babel was denoted by 119876119904(1198891 1198892)

Based on the fact that the compounds with similar structuralproperties always share the same biological activities thelikelihood that the query drug 119889 has the side effect 119862

119895can

be calculated by

prod119904

(119889 997888rarr 119862119895) = max119889119894isinS1015840119876119904(119889 119889119894) sdot 119888119894119895 (14)

meaning that the likelihood that the query drug 119889 has theside effect 119862

119895is formulated as the maximum similarity scores

between 119889 and those drugs with side effect 119862119895in the training

dataset S1015840 Obviously the greater the score prod119904(119889 rarr 119862119895)

the more likely the drug compound 119889 has the side effect119862119895 Following the similar procedure of the method based on

chemical-chemical interactions we can also obtain the orderinformation of the query drug 119889 in terms of prod119904(119889 rarr 119862

119895)

(119895 = 1 2 100)

25 Jackknife Test In statistical prediction Jackknife test [16]is often used to examine a predictor for its effectiveness inpractical application In the jackknife test all the samplesin the dataset will be singled out one-by-one and tested bythe predictor trained by the remaining samples During theprocess of jackknifing both the training dataset and testingdataset are actually open and each sample will be in turnmoved between the two The jackknife test can exclude theldquomemoryrdquo effect and the arbitrariness problem can alsobe avoided Thus the outcome obtained by the jackknifetest is always unique for a given benchmark dataset [26]Accordingly the jackknife test has been widely recognizedand increasingly adopted to investigate the performance ofvarious predictors [27ndash36] Thus the jackknife test was alsoadopted here to evaluate the anticipated accuracy of thecurrent predicted methods

26 Accuracy Measurement For a query drug we mayidentify a series of side effects with the current predictionmethod For the 119895th order prediction its prediction accuracyAC(119895) can be calculated by

AC (119895) =119875 (119895)

119873119895 = 1 2 100 (15)

where 119875(119895) denotes the number of drugs whose 119895th orderprediction is one of the true side effects and 119873 denotesthe total number of structure-different drugs in the datasetAccording to the prediction method with 100 orders ofprediction results high AC(119895) with small 119895 and low AC(119895)with large 119895 would indicate a good prediction [13 16 20]Generally speaking it also implies a good performance by thepredictor if its 1st order prediction has a high success rate

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 2 A plot of the prediction accuracy of two methods on thetraining dataset versus the order of prediction

3 Results and Discussion

Of the 835 drugs in the benchmark dataset S 83 samples wererandomly selected to compose test dataset Ste while the rest752 samples composed the training dataset Str The predictedresults of the interaction-based method and similarity-basedmethod on the training and test datasets are as follows

31 Performance of the Methods on the Training DatasetFor 752 drug compounds in the training dataset Str theinteraction-based method and similarity-based method wereconducted to make prediction with their performance evalu-ated by jackknife test Listed in columns 2 and 4 of Table 1are the first 20 prediction accuracies obtained by thesetwo methods from which we can see that the 1st orderprediction accuracies of the interaction-based and similarity-based method were 8630 and 8364 respectively whilethe 2nd ones were 8045 and 7912 respectively The total100 prediction accuracies obtained by these two methodsare given in Supplementary Material III and two curveswith prediction accuracies as their 119884-axis and predictionorder as their 119883-axis are shown in Figure 2 It is observedthat the prediction accuracies obtained by the interaction-based method descend generally with the increase of theorder number and the same situation also occurred forthe prediction accuracies obtained by the similarity-basedmethod All of these imply that the two methods sorted theside effects of drug compounds in the training dataset quitewell and they are all quite effective in identifying drugs sideeffects

32 Performance of the Methods on the Test Dataset For the83 drug compounds in the test dataset Ste the side effects ofthese samples were predicted by the interaction-based andsimilarity-basedmethodbased on the drug compounds in thetraining dataset Str After processing by (15) 100 predictionaccuracies obtained by each method were obtained and werealso given in SupplementaryMaterial III Listed in columns 3

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Predicting Drugs Side Effects Based on

2 BioMed Research International

using chemical structures for the prediction of drugs sideeffects [8ndash10] Although most of the methods can onlyprovide whether the query drug has some side effectsthey cannot determine which side effects are most likely tohappen or even the order information of the side effects Inthis study we proposed a novel computational method topredict the side effects of drugs based on chemical-chemicalinteraction and protein-chemical interaction Compared tomost of the previous studies our method can provide theorder information of the side effects that is prioritizing theside effects from the most likely one to the least likely one

During the past decade many compound databases havebeen constructed such as KEGG (Kyoto Encyclopedia ofGenes andGenomes) [11] and STITCH (SearchTool for Inter-actions of Chemicals) [12] KEGG provides the informationof chemical substances and reactions while STITCH pro-vides the interaction information of chemicals and proteinsThus we can acquire the properties of many compounds andtheir other information from these databases For those com-pounds not being covered by these databases their propertiescan be inferred from the property-known compounds storedin the databases [13ndash16] Likewise the drugs side effects canalso be inferred as elaborated below

Recently it was evidenced that interactive proteins aremore likely to share common biological functions [17ndash20]and that interactive compounds are also more likely to sharecommon biological functions [13 16] Since the side effectsare part of biological functions of drugs it would be feasibleto use the chemical-chemical interactions to identify thedrugs side effects Unfortunately some of the query drugscannot be predicted for their side effects by this way becausetheir interactive counterparts do not have any informationof the side effects To overcome such difficulty we proposedto utilize the information of indirect interactions includingboth the chemical-chemical interaction and the protein-chemical interaction to identify the drugs side effects ofwhich the direct chemical-chemical interaction data are notavailable To evaluate the method a benchmark datasetretrieved from SIDER [21] was constructed which consistedof 835 drug compounds and it was divided into one trainingdataset and one test dataset By a jackknife test on the trainingdataset the 1st order prediction accuracy was 8630 whileit was 8916 on the test dataset To confirm the effectivenessof the method another method based on chemical structuresimilarity obtained by SMILES string [22] was also conductedon the training and test datasets Encouraged by the goodperformance of the method and superiority to the methodbased on chemical structure similarity we hope that theproposed method can become a useful tool to predict drugsside effects and screen out drugs with undesired side effects

2 Materials and Methods

21 Benchmark Dataset The benchmark dataset used inthe current study was downloaded from SIDER [21] athttpsideeffectsemblde which integrated the side effectsof 888 drugs from the US Food and Drug Administration(FDA) and other sources [21] To obtain a high-quality well-defined benchmark dataset the data were collected strictly

according to the following criteria (i) only the 100 side effectswith most drugs listed in SIDER and the correspondingdrugs were included and (ii) drugs without both chemical-chemical interactions and protein-chemical interactionswerealso excluded Finally we obtained a benchmark dataset Sthat contained 835 drugs belonging to 100 categories of sideeffectsThe codes of the 835 drugs in each of the 100 side effectcategories are given in Supplementary Material I availableonline at httpdxdoiorg1011552013485034

For the convenience of later formulation let us use thesymbols 119862

1 1198622 1198623 119862

100to tag the 100 side effects where

1198621represents ldquoNauseardquo 119862

2ldquoHeadacherdquo 119862

3ldquoVomitingrdquo and

so forth as described in the table in Supplementary MaterialII in which the number of drugs with each of the 100 sideeffect tags is also givenThus the benchmark dataset S can beformulated as

S = S1cup S2cup sdot sdot sdot cup S

100 (1)

where S119894represents the subset that contains the drugs with

the side effect 119862119894(119894 = 1 2 100)

Since many drugs in S have multiple side effects that isthey may simultaneously occur in subsets with different sideeffect tags it is instructive to introduce the concept of ldquovirtualdrugrdquo sample as illustrated as follows A drug compoundcoexisting at two different side effect subsets will be countedas 2 virtual drugs even though they have an identical chemicalstructure if coexisting at three different subsets 3 virtualdrugs and so forth Accordingly the total structure-differentdrug compounds and the total number of the side-effect-different virtual drug compounds can be described by thefollowing equation

119873(str) = 835

119873 (vir) =100

sum

119894=1

119873(119862119894) = 30114

(2)

where 119873(str) is the number of the total structure-differentdrug compounds 119873(vir) the number of the total side-effect-different virtual drug compounds in S and 119873(119862

119894) the

number of drugs with the side-effect tag 119862119894 Substituting

the numbers of 119873(119862119894) (119894 = 1 2 100) in the table in

Supplementary Material II into (2) we obtained 119873(vir) =30114 fully consistent with the results in (2) and the table inSupplementary Material II

It can be seen from (2) that the total number of theside-effect-different virtual drug compounds is much greaterthan that of the total structure-different drug compounds Toprovide an intuitive view about their distribution a histogramof the number of drugs versus the number of side effectsis given in Figure 1 from which we can see that of the 835drugs only 6 have one side effect while the majority hasmore than 10 side effects Thus the prediction of drugs sideeffects is a multilabel classification problem Like the case indealing with compounds withmultiple properties [13 16] theproposed method would provide the order information ofside effects from the most likely to the least likely

To evaluate the methods as described below sufficientlywe randomly selected 10 (83) samples from S to compose

BioMed Research International 3

0

5

10

15

20

25

1 11 21 31 41 51 61 71 81 91

Num

ber o

f dru

gs

Number of side effects

Figure 1 A histogram of the number of drugs versus the number ofside effects

the test dataset denoted by Ste while the remaining 752samples in S were used to construct the training datasetdenoted by Str

22 Chemical-Chemical Interactions and Protein-ChemicalInteractions It was evidenced that interactive proteins aremore likely to share common biological functions thannoninteractive ones [17ndash20] Likewise it has been indicatedby some pioneer studies [13 16] that interactive compoundsfollow the similar rules Since side effect is one part of bio-logical functions of drugs using the properties of interactivecompounds to identify drugs side effects is a feasible scheme

To obtain the information of interactive compoundswe downloaded the data of chemical-chemical interactionsfrom STITCH (httpstitchemblde chemical chemicallinksdetailedv30tsvgz) [12] a well-known database con-taining known and predicted chemical-chemical interactionand protein-chemical interaction data from experiments lit-erature or other reliable sources In the datafile obtained eachinteraction unit contains two chemicals and five scores withtitles ldquoSimilarityrdquo ldquoExperimentalrdquo ldquoDatabaserdquo ldquoTextminingrdquoand ldquoCombined scorerdquo respectively Since the last scorecombines the information of other scores we utilized thelast score to indicate the interactivity of two chemicalsin this study that is compounds in the interaction unitwith ldquoCombined scorerdquo greater than zero were deemed tobe interactive compounds The interactive compounds thusconsidered here satisfy one of the following three properties(I) they participate in the same reactions (II) they sharesimilar structures or activities (III) they have literatureassociations These three properties always indicate that theinteractive compounds occupy the same biological pathwayssuggesting they may induce similar side effects It is con-firmed that using chemical-chemical interactions retrievedfrom STITCH to identify drugs side effects is feasible TheldquoCombined scorerdquo is termed as confidence score becauseits value always indicates the likelihood that two interactivecompounds can interact in a way that two compounds withhigh ldquoCombined scorerdquo mean that they can interact withhigh probability For any two-drug compounds 119889

1and 119889

2

their interaction confidence score was denoted by119876119888(1198891 1198892)

Particularly if the interaction between1198891and1198892did not exist

their interaction confidence score was set to zero that is119876119888(1198891 1198892) = 0

Since the data of chemical-chemical interactions inSTITCH is not very complete at present that is some poten-tial chemical-chemical interactions may not be reported inSTITCH predicted methods based on chemical-chemicalinteractions may have a limitation that samples withoutinteractive counterparts in the training dataset cannot beprocessed Thus it is necessary to give some new schemes tomeasure the interactions that are not reported in STITCH Itis known that if two drug compounds can interact with a thirdcompoundor protein these twodrug compounds are likely toshare some common functions In view of this we proposeda new scheme to measure the likelihood of interaction of twochemicals based on indirect chemical-chemical and protein-chemical interactions

The data for the protein-chemical interactions werealso downloaded from STITCH (httpstitchemblde pro-tein chemicallinksdetailedv30tsvgz) Each of the inter-action units in the datafile obtained contains one com-pound one protein and four scores with titles ldquoExperi-mentalrdquo ldquoDatabaserdquo ldquoTextminingrdquo and ldquoCombined scorerdquorespectively With the similar argument we used the valueof ldquoCombined scorerdquo also termed as confidence score toindicate the likelihood of the interactionrsquos occurrence Forone protein 119901 and one drug compound 119889 their interactionconfidence score was denoted as 119876119901(119901 119889) If there was nointeraction at all between the protein 119901 and the drug 119889 it wasalso set to zero that is 119876119901(119901 119889) = 0

Now we are ready to introduce the new scheme tomeasure the likelihood of interaction of two chemicals Fortwo compounds 119889

1and 119889

2 suppose 119868119888(119889

1) denote a set of

compounds that are directly interacting with the drug 1198891and

119868119888(1198892) a set of compounds directly interacting with the drug

1198892 formulated as

119868119888(1198891) = 119889 119876

119888(119889 1198891) gt 0

119868119888(1198892) = 119889 119876

119888(119889 1198892) gt 0

(3)

In a similar way let 119868119901(1198891) denote a set of proteins that are

directly interacting with the drug 1198891and 119868119901(119889

2) a set of

proteins directly interacting with the drug 1198892 formulated as

119868119901(1198891) = 119901 119876

119901(119901 1198891) gt 0

119868119901(1198892) = 119901 119876

119901(119901 1198892) gt 0

(4)

According to the set theory the drug compounds that areinteracting with both the drug 119889

1and the drug 119889

2should be

the intersection of the set 119868119888(1198891) and the set 119868119888(119889

2) that is

they will form a set given by

119868119888(1198891 1198892) = 119868119888(1198891) cap 119868119888(1198892) (5)

Likewise the human proteins that are interacting with boththe drug 119889

1and the drug 119889

2should be the intersection of the

4 BioMed Research International

set 119868119901(1198891) and the set 119868119901(119889

2) that is they will form a set given

by

119868119901(1198891 1198892) = 119868119901(1198891) cap 119868119901(1198892) (6)

Thus the likelihood of the interaction between 1198891and 119889

2

can be calculated via the following equation

119876ℎ(1198891 1198892) = ( sum

1198891015840isin119868119888(11988911198892)

(119876119888(1198891 1198891015840) + 119876119888(1198892 1198891015840))

+ sum

1199011015840isin119868119901(11988911198892)

(119876119901(1199011015840 1198891) + 119876119901(1199011015840 1198892)))

times (21003816100381610038161003816119868119888(1198891 1198892) cup 119868119901(1198891 1198892)1003816100381610038161003816)minus1

(7)

where isin is a symbol in the set theory meaning ldquomember ofrdquo

23 Interaction-Based Method It is instructive to recall thatby using the information of protein-protein interactionssome methods have been developed to successfully predictthe properties of proteins [17ndash20 23] Actually the underlyingidea of thesemethodswas based on the assumption that inter-active proteins are more likely to share common biologicalfunctions than noninteractive ones Similarly based on theargument in Section 22 and some previous studies [13 16]interactive drugs are more likely to share similar side effectsthan noninteractive ones Based on such an underlying ideathe following predicted method based on chemical-chemicaland protein-chemical interactions was developed

For convenience some notations are necessary Supposethere are 119899 drugs in the training set S1015840 say 119889

1 1198892 119889

119899 the

side effects of the drug 119889119894in the training dataset is described

as

119862 (119889119894) = [1198881198941 1198881198942 119888

119894100]T(119894 = 1 2 119899) (8)

where T is the transpose operator and

119888119894119895= 1 if 119889

119894has the side effect 119862

119895

0 otherwise(9)

Prediction Based on Chemical-Chemical InteractionsAs elab-orated previously interactive drugs always share similar sideeffectsThe likelihood that the query drug 119889 has the side effect119862119895can be calculated by

prod119888

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876119888(119889 119889119894) sdot 119888119894119895119895 = 1 2 100

(10)

According to (10) the greater scoreprod119888(119889 rarr 119862119895)means that

there are lots of interactive compounds of 119889 that have the sideeffect 119862

119895or some interactions between 119889 and its interactive

compounds with the side effect 119862119895are labeled by high

confidence scoresThus the greater the scoreprod119888(119889 rarr 119862119895) is

the more likely the drug compound 119889 has the 119895th side effectwithprod119888(119889 rarr 119862

119895) = 0 indicating that the probability for the

drug 119889 having the 119895th side effect is zero Since a drug usuallyhas multiple side effects (see Figure 1) the prediction shouldprovide a series of candidate side effects ranging from themost likely one to the least likely one rather than only givingthemost likely oneThus for a query drug 119889 suppose we have

prod119888

(119889 997888rarr 1198622) ge prod

119888

(119889 997888rarr 1198624)

ge sdot sdot sdot ge prod119888

(119889 997888rarr 11986290) gt 0

(11)

meaning that the highest likelihood of side effect for the drug119889 is 119862

2or ldquoHeadacherdquo (cf table in Supplementary Material

II) and the second highest is 1198624or ldquoRashrdquo and so forth In

other words 1198622is called the 1st order prediction 119862

4the 2nd

order prediction and so forth Note that the outcome of (10)might be trivial that is

prod119888

(119889 997888rarr 119862119895) = 0 for 119895 = 1 2 100 (12)

implying that no meaningful or direct interactive drugcompounds whatsoever can be found in the training datasetS1015840 for the drug 119889 Under such a circumstance an alternativeapproach should be used for predicting its side effects aselaborated belowPredictionBased onHybrid InteractionsWhen the query drug119889 did not have any directly interactive drugs in the trainingdataset S1015840 or the information of its directly interactive drugswas trivial the data for the indirect chemical-chemical andprotein-chemical interactions would be used to predict itsside effects The prediction method was formulated in asimilar way as the above method But now instead of (10) thelikelihood that the query drug 119889 has the side effect 119862

119895should

be calculated by

prodℎ

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876ℎ(119889 119889119894) sdot 119888119894119895 (13)

By integrating the above two different approaches thefollowing steps were adopted to predict the side effects of thequery drug 119889

Step 1 Themethod based on the chemical-chemical interac-tions that is (10) was first utilized to identify its side effects

Step 2 If the outcomes were trivial or no meaningful resultswere obtained as in the case of (12) the method based on thehybrid interactions that is (13) would be utilized to continuethe prediction

24 Similarity-Based Method It is known that the com-pounds with similar structural properties always involvein similar biological activities [24] The most well-knownrepresenting system to obtain the similarity information oftwo compounds is SMILES (Simplified Molecular Input LineEntry System) [22] which is a line notation for representingmolecules and reactions using ASCII strings Here we also

BioMed Research International 5

used this system to obtain the representations of compoundswhich were used to calculate the similarity score of twocompounds and set up a new computational method toidentify drugs side effect The similarity score betweentwo compounds with their SMILES representations can beobtained from Open Babel [25] that is an open chemicaltoolbox For two-drug compounds 119889

1and 119889

2 their similarity

score obtained from Open Babel was denoted by 119876119904(1198891 1198892)

Based on the fact that the compounds with similar structuralproperties always share the same biological activities thelikelihood that the query drug 119889 has the side effect 119862

119895can

be calculated by

prod119904

(119889 997888rarr 119862119895) = max119889119894isinS1015840119876119904(119889 119889119894) sdot 119888119894119895 (14)

meaning that the likelihood that the query drug 119889 has theside effect 119862

119895is formulated as the maximum similarity scores

between 119889 and those drugs with side effect 119862119895in the training

dataset S1015840 Obviously the greater the score prod119904(119889 rarr 119862119895)

the more likely the drug compound 119889 has the side effect119862119895 Following the similar procedure of the method based on

chemical-chemical interactions we can also obtain the orderinformation of the query drug 119889 in terms of prod119904(119889 rarr 119862

119895)

(119895 = 1 2 100)

25 Jackknife Test In statistical prediction Jackknife test [16]is often used to examine a predictor for its effectiveness inpractical application In the jackknife test all the samplesin the dataset will be singled out one-by-one and tested bythe predictor trained by the remaining samples During theprocess of jackknifing both the training dataset and testingdataset are actually open and each sample will be in turnmoved between the two The jackknife test can exclude theldquomemoryrdquo effect and the arbitrariness problem can alsobe avoided Thus the outcome obtained by the jackknifetest is always unique for a given benchmark dataset [26]Accordingly the jackknife test has been widely recognizedand increasingly adopted to investigate the performance ofvarious predictors [27ndash36] Thus the jackknife test was alsoadopted here to evaluate the anticipated accuracy of thecurrent predicted methods

26 Accuracy Measurement For a query drug we mayidentify a series of side effects with the current predictionmethod For the 119895th order prediction its prediction accuracyAC(119895) can be calculated by

AC (119895) =119875 (119895)

119873119895 = 1 2 100 (15)

where 119875(119895) denotes the number of drugs whose 119895th orderprediction is one of the true side effects and 119873 denotesthe total number of structure-different drugs in the datasetAccording to the prediction method with 100 orders ofprediction results high AC(119895) with small 119895 and low AC(119895)with large 119895 would indicate a good prediction [13 16 20]Generally speaking it also implies a good performance by thepredictor if its 1st order prediction has a high success rate

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 2 A plot of the prediction accuracy of two methods on thetraining dataset versus the order of prediction

3 Results and Discussion

Of the 835 drugs in the benchmark dataset S 83 samples wererandomly selected to compose test dataset Ste while the rest752 samples composed the training dataset Str The predictedresults of the interaction-based method and similarity-basedmethod on the training and test datasets are as follows

31 Performance of the Methods on the Training DatasetFor 752 drug compounds in the training dataset Str theinteraction-based method and similarity-based method wereconducted to make prediction with their performance evalu-ated by jackknife test Listed in columns 2 and 4 of Table 1are the first 20 prediction accuracies obtained by thesetwo methods from which we can see that the 1st orderprediction accuracies of the interaction-based and similarity-based method were 8630 and 8364 respectively whilethe 2nd ones were 8045 and 7912 respectively The total100 prediction accuracies obtained by these two methodsare given in Supplementary Material III and two curveswith prediction accuracies as their 119884-axis and predictionorder as their 119883-axis are shown in Figure 2 It is observedthat the prediction accuracies obtained by the interaction-based method descend generally with the increase of theorder number and the same situation also occurred forthe prediction accuracies obtained by the similarity-basedmethod All of these imply that the two methods sorted theside effects of drug compounds in the training dataset quitewell and they are all quite effective in identifying drugs sideeffects

32 Performance of the Methods on the Test Dataset For the83 drug compounds in the test dataset Ste the side effects ofthese samples were predicted by the interaction-based andsimilarity-basedmethodbased on the drug compounds in thetraining dataset Str After processing by (15) 100 predictionaccuracies obtained by each method were obtained and werealso given in SupplementaryMaterial III Listed in columns 3

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Predicting Drugs Side Effects Based on

BioMed Research International 3

0

5

10

15

20

25

1 11 21 31 41 51 61 71 81 91

Num

ber o

f dru

gs

Number of side effects

Figure 1 A histogram of the number of drugs versus the number ofside effects

the test dataset denoted by Ste while the remaining 752samples in S were used to construct the training datasetdenoted by Str

22 Chemical-Chemical Interactions and Protein-ChemicalInteractions It was evidenced that interactive proteins aremore likely to share common biological functions thannoninteractive ones [17ndash20] Likewise it has been indicatedby some pioneer studies [13 16] that interactive compoundsfollow the similar rules Since side effect is one part of bio-logical functions of drugs using the properties of interactivecompounds to identify drugs side effects is a feasible scheme

To obtain the information of interactive compoundswe downloaded the data of chemical-chemical interactionsfrom STITCH (httpstitchemblde chemical chemicallinksdetailedv30tsvgz) [12] a well-known database con-taining known and predicted chemical-chemical interactionand protein-chemical interaction data from experiments lit-erature or other reliable sources In the datafile obtained eachinteraction unit contains two chemicals and five scores withtitles ldquoSimilarityrdquo ldquoExperimentalrdquo ldquoDatabaserdquo ldquoTextminingrdquoand ldquoCombined scorerdquo respectively Since the last scorecombines the information of other scores we utilized thelast score to indicate the interactivity of two chemicalsin this study that is compounds in the interaction unitwith ldquoCombined scorerdquo greater than zero were deemed tobe interactive compounds The interactive compounds thusconsidered here satisfy one of the following three properties(I) they participate in the same reactions (II) they sharesimilar structures or activities (III) they have literatureassociations These three properties always indicate that theinteractive compounds occupy the same biological pathwayssuggesting they may induce similar side effects It is con-firmed that using chemical-chemical interactions retrievedfrom STITCH to identify drugs side effects is feasible TheldquoCombined scorerdquo is termed as confidence score becauseits value always indicates the likelihood that two interactivecompounds can interact in a way that two compounds withhigh ldquoCombined scorerdquo mean that they can interact withhigh probability For any two-drug compounds 119889

1and 119889

2

their interaction confidence score was denoted by119876119888(1198891 1198892)

Particularly if the interaction between1198891and1198892did not exist

their interaction confidence score was set to zero that is119876119888(1198891 1198892) = 0

Since the data of chemical-chemical interactions inSTITCH is not very complete at present that is some poten-tial chemical-chemical interactions may not be reported inSTITCH predicted methods based on chemical-chemicalinteractions may have a limitation that samples withoutinteractive counterparts in the training dataset cannot beprocessed Thus it is necessary to give some new schemes tomeasure the interactions that are not reported in STITCH Itis known that if two drug compounds can interact with a thirdcompoundor protein these twodrug compounds are likely toshare some common functions In view of this we proposeda new scheme to measure the likelihood of interaction of twochemicals based on indirect chemical-chemical and protein-chemical interactions

The data for the protein-chemical interactions werealso downloaded from STITCH (httpstitchemblde pro-tein chemicallinksdetailedv30tsvgz) Each of the inter-action units in the datafile obtained contains one com-pound one protein and four scores with titles ldquoExperi-mentalrdquo ldquoDatabaserdquo ldquoTextminingrdquo and ldquoCombined scorerdquorespectively With the similar argument we used the valueof ldquoCombined scorerdquo also termed as confidence score toindicate the likelihood of the interactionrsquos occurrence Forone protein 119901 and one drug compound 119889 their interactionconfidence score was denoted as 119876119901(119901 119889) If there was nointeraction at all between the protein 119901 and the drug 119889 it wasalso set to zero that is 119876119901(119901 119889) = 0

Now we are ready to introduce the new scheme tomeasure the likelihood of interaction of two chemicals Fortwo compounds 119889

1and 119889

2 suppose 119868119888(119889

1) denote a set of

compounds that are directly interacting with the drug 1198891and

119868119888(1198892) a set of compounds directly interacting with the drug

1198892 formulated as

119868119888(1198891) = 119889 119876

119888(119889 1198891) gt 0

119868119888(1198892) = 119889 119876

119888(119889 1198892) gt 0

(3)

In a similar way let 119868119901(1198891) denote a set of proteins that are

directly interacting with the drug 1198891and 119868119901(119889

2) a set of

proteins directly interacting with the drug 1198892 formulated as

119868119901(1198891) = 119901 119876

119901(119901 1198891) gt 0

119868119901(1198892) = 119901 119876

119901(119901 1198892) gt 0

(4)

According to the set theory the drug compounds that areinteracting with both the drug 119889

1and the drug 119889

2should be

the intersection of the set 119868119888(1198891) and the set 119868119888(119889

2) that is

they will form a set given by

119868119888(1198891 1198892) = 119868119888(1198891) cap 119868119888(1198892) (5)

Likewise the human proteins that are interacting with boththe drug 119889

1and the drug 119889

2should be the intersection of the

4 BioMed Research International

set 119868119901(1198891) and the set 119868119901(119889

2) that is they will form a set given

by

119868119901(1198891 1198892) = 119868119901(1198891) cap 119868119901(1198892) (6)

Thus the likelihood of the interaction between 1198891and 119889

2

can be calculated via the following equation

119876ℎ(1198891 1198892) = ( sum

1198891015840isin119868119888(11988911198892)

(119876119888(1198891 1198891015840) + 119876119888(1198892 1198891015840))

+ sum

1199011015840isin119868119901(11988911198892)

(119876119901(1199011015840 1198891) + 119876119901(1199011015840 1198892)))

times (21003816100381610038161003816119868119888(1198891 1198892) cup 119868119901(1198891 1198892)1003816100381610038161003816)minus1

(7)

where isin is a symbol in the set theory meaning ldquomember ofrdquo

23 Interaction-Based Method It is instructive to recall thatby using the information of protein-protein interactionssome methods have been developed to successfully predictthe properties of proteins [17ndash20 23] Actually the underlyingidea of thesemethodswas based on the assumption that inter-active proteins are more likely to share common biologicalfunctions than noninteractive ones Similarly based on theargument in Section 22 and some previous studies [13 16]interactive drugs are more likely to share similar side effectsthan noninteractive ones Based on such an underlying ideathe following predicted method based on chemical-chemicaland protein-chemical interactions was developed

For convenience some notations are necessary Supposethere are 119899 drugs in the training set S1015840 say 119889

1 1198892 119889

119899 the

side effects of the drug 119889119894in the training dataset is described

as

119862 (119889119894) = [1198881198941 1198881198942 119888

119894100]T(119894 = 1 2 119899) (8)

where T is the transpose operator and

119888119894119895= 1 if 119889

119894has the side effect 119862

119895

0 otherwise(9)

Prediction Based on Chemical-Chemical InteractionsAs elab-orated previously interactive drugs always share similar sideeffectsThe likelihood that the query drug 119889 has the side effect119862119895can be calculated by

prod119888

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876119888(119889 119889119894) sdot 119888119894119895119895 = 1 2 100

(10)

According to (10) the greater scoreprod119888(119889 rarr 119862119895)means that

there are lots of interactive compounds of 119889 that have the sideeffect 119862

119895or some interactions between 119889 and its interactive

compounds with the side effect 119862119895are labeled by high

confidence scoresThus the greater the scoreprod119888(119889 rarr 119862119895) is

the more likely the drug compound 119889 has the 119895th side effectwithprod119888(119889 rarr 119862

119895) = 0 indicating that the probability for the

drug 119889 having the 119895th side effect is zero Since a drug usuallyhas multiple side effects (see Figure 1) the prediction shouldprovide a series of candidate side effects ranging from themost likely one to the least likely one rather than only givingthemost likely oneThus for a query drug 119889 suppose we have

prod119888

(119889 997888rarr 1198622) ge prod

119888

(119889 997888rarr 1198624)

ge sdot sdot sdot ge prod119888

(119889 997888rarr 11986290) gt 0

(11)

meaning that the highest likelihood of side effect for the drug119889 is 119862

2or ldquoHeadacherdquo (cf table in Supplementary Material

II) and the second highest is 1198624or ldquoRashrdquo and so forth In

other words 1198622is called the 1st order prediction 119862

4the 2nd

order prediction and so forth Note that the outcome of (10)might be trivial that is

prod119888

(119889 997888rarr 119862119895) = 0 for 119895 = 1 2 100 (12)

implying that no meaningful or direct interactive drugcompounds whatsoever can be found in the training datasetS1015840 for the drug 119889 Under such a circumstance an alternativeapproach should be used for predicting its side effects aselaborated belowPredictionBased onHybrid InteractionsWhen the query drug119889 did not have any directly interactive drugs in the trainingdataset S1015840 or the information of its directly interactive drugswas trivial the data for the indirect chemical-chemical andprotein-chemical interactions would be used to predict itsside effects The prediction method was formulated in asimilar way as the above method But now instead of (10) thelikelihood that the query drug 119889 has the side effect 119862

119895should

be calculated by

prodℎ

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876ℎ(119889 119889119894) sdot 119888119894119895 (13)

By integrating the above two different approaches thefollowing steps were adopted to predict the side effects of thequery drug 119889

Step 1 Themethod based on the chemical-chemical interac-tions that is (10) was first utilized to identify its side effects

Step 2 If the outcomes were trivial or no meaningful resultswere obtained as in the case of (12) the method based on thehybrid interactions that is (13) would be utilized to continuethe prediction

24 Similarity-Based Method It is known that the com-pounds with similar structural properties always involvein similar biological activities [24] The most well-knownrepresenting system to obtain the similarity information oftwo compounds is SMILES (Simplified Molecular Input LineEntry System) [22] which is a line notation for representingmolecules and reactions using ASCII strings Here we also

BioMed Research International 5

used this system to obtain the representations of compoundswhich were used to calculate the similarity score of twocompounds and set up a new computational method toidentify drugs side effect The similarity score betweentwo compounds with their SMILES representations can beobtained from Open Babel [25] that is an open chemicaltoolbox For two-drug compounds 119889

1and 119889

2 their similarity

score obtained from Open Babel was denoted by 119876119904(1198891 1198892)

Based on the fact that the compounds with similar structuralproperties always share the same biological activities thelikelihood that the query drug 119889 has the side effect 119862

119895can

be calculated by

prod119904

(119889 997888rarr 119862119895) = max119889119894isinS1015840119876119904(119889 119889119894) sdot 119888119894119895 (14)

meaning that the likelihood that the query drug 119889 has theside effect 119862

119895is formulated as the maximum similarity scores

between 119889 and those drugs with side effect 119862119895in the training

dataset S1015840 Obviously the greater the score prod119904(119889 rarr 119862119895)

the more likely the drug compound 119889 has the side effect119862119895 Following the similar procedure of the method based on

chemical-chemical interactions we can also obtain the orderinformation of the query drug 119889 in terms of prod119904(119889 rarr 119862

119895)

(119895 = 1 2 100)

25 Jackknife Test In statistical prediction Jackknife test [16]is often used to examine a predictor for its effectiveness inpractical application In the jackknife test all the samplesin the dataset will be singled out one-by-one and tested bythe predictor trained by the remaining samples During theprocess of jackknifing both the training dataset and testingdataset are actually open and each sample will be in turnmoved between the two The jackknife test can exclude theldquomemoryrdquo effect and the arbitrariness problem can alsobe avoided Thus the outcome obtained by the jackknifetest is always unique for a given benchmark dataset [26]Accordingly the jackknife test has been widely recognizedand increasingly adopted to investigate the performance ofvarious predictors [27ndash36] Thus the jackknife test was alsoadopted here to evaluate the anticipated accuracy of thecurrent predicted methods

26 Accuracy Measurement For a query drug we mayidentify a series of side effects with the current predictionmethod For the 119895th order prediction its prediction accuracyAC(119895) can be calculated by

AC (119895) =119875 (119895)

119873119895 = 1 2 100 (15)

where 119875(119895) denotes the number of drugs whose 119895th orderprediction is one of the true side effects and 119873 denotesthe total number of structure-different drugs in the datasetAccording to the prediction method with 100 orders ofprediction results high AC(119895) with small 119895 and low AC(119895)with large 119895 would indicate a good prediction [13 16 20]Generally speaking it also implies a good performance by thepredictor if its 1st order prediction has a high success rate

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 2 A plot of the prediction accuracy of two methods on thetraining dataset versus the order of prediction

3 Results and Discussion

Of the 835 drugs in the benchmark dataset S 83 samples wererandomly selected to compose test dataset Ste while the rest752 samples composed the training dataset Str The predictedresults of the interaction-based method and similarity-basedmethod on the training and test datasets are as follows

31 Performance of the Methods on the Training DatasetFor 752 drug compounds in the training dataset Str theinteraction-based method and similarity-based method wereconducted to make prediction with their performance evalu-ated by jackknife test Listed in columns 2 and 4 of Table 1are the first 20 prediction accuracies obtained by thesetwo methods from which we can see that the 1st orderprediction accuracies of the interaction-based and similarity-based method were 8630 and 8364 respectively whilethe 2nd ones were 8045 and 7912 respectively The total100 prediction accuracies obtained by these two methodsare given in Supplementary Material III and two curveswith prediction accuracies as their 119884-axis and predictionorder as their 119883-axis are shown in Figure 2 It is observedthat the prediction accuracies obtained by the interaction-based method descend generally with the increase of theorder number and the same situation also occurred forthe prediction accuracies obtained by the similarity-basedmethod All of these imply that the two methods sorted theside effects of drug compounds in the training dataset quitewell and they are all quite effective in identifying drugs sideeffects

32 Performance of the Methods on the Test Dataset For the83 drug compounds in the test dataset Ste the side effects ofthese samples were predicted by the interaction-based andsimilarity-basedmethodbased on the drug compounds in thetraining dataset Str After processing by (15) 100 predictionaccuracies obtained by each method were obtained and werealso given in SupplementaryMaterial III Listed in columns 3

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Predicting Drugs Side Effects Based on

4 BioMed Research International

set 119868119901(1198891) and the set 119868119901(119889

2) that is they will form a set given

by

119868119901(1198891 1198892) = 119868119901(1198891) cap 119868119901(1198892) (6)

Thus the likelihood of the interaction between 1198891and 119889

2

can be calculated via the following equation

119876ℎ(1198891 1198892) = ( sum

1198891015840isin119868119888(11988911198892)

(119876119888(1198891 1198891015840) + 119876119888(1198892 1198891015840))

+ sum

1199011015840isin119868119901(11988911198892)

(119876119901(1199011015840 1198891) + 119876119901(1199011015840 1198892)))

times (21003816100381610038161003816119868119888(1198891 1198892) cup 119868119901(1198891 1198892)1003816100381610038161003816)minus1

(7)

where isin is a symbol in the set theory meaning ldquomember ofrdquo

23 Interaction-Based Method It is instructive to recall thatby using the information of protein-protein interactionssome methods have been developed to successfully predictthe properties of proteins [17ndash20 23] Actually the underlyingidea of thesemethodswas based on the assumption that inter-active proteins are more likely to share common biologicalfunctions than noninteractive ones Similarly based on theargument in Section 22 and some previous studies [13 16]interactive drugs are more likely to share similar side effectsthan noninteractive ones Based on such an underlying ideathe following predicted method based on chemical-chemicaland protein-chemical interactions was developed

For convenience some notations are necessary Supposethere are 119899 drugs in the training set S1015840 say 119889

1 1198892 119889

119899 the

side effects of the drug 119889119894in the training dataset is described

as

119862 (119889119894) = [1198881198941 1198881198942 119888

119894100]T(119894 = 1 2 119899) (8)

where T is the transpose operator and

119888119894119895= 1 if 119889

119894has the side effect 119862

119895

0 otherwise(9)

Prediction Based on Chemical-Chemical InteractionsAs elab-orated previously interactive drugs always share similar sideeffectsThe likelihood that the query drug 119889 has the side effect119862119895can be calculated by

prod119888

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876119888(119889 119889119894) sdot 119888119894119895119895 = 1 2 100

(10)

According to (10) the greater scoreprod119888(119889 rarr 119862119895)means that

there are lots of interactive compounds of 119889 that have the sideeffect 119862

119895or some interactions between 119889 and its interactive

compounds with the side effect 119862119895are labeled by high

confidence scoresThus the greater the scoreprod119888(119889 rarr 119862119895) is

the more likely the drug compound 119889 has the 119895th side effectwithprod119888(119889 rarr 119862

119895) = 0 indicating that the probability for the

drug 119889 having the 119895th side effect is zero Since a drug usuallyhas multiple side effects (see Figure 1) the prediction shouldprovide a series of candidate side effects ranging from themost likely one to the least likely one rather than only givingthemost likely oneThus for a query drug 119889 suppose we have

prod119888

(119889 997888rarr 1198622) ge prod

119888

(119889 997888rarr 1198624)

ge sdot sdot sdot ge prod119888

(119889 997888rarr 11986290) gt 0

(11)

meaning that the highest likelihood of side effect for the drug119889 is 119862

2or ldquoHeadacherdquo (cf table in Supplementary Material

II) and the second highest is 1198624or ldquoRashrdquo and so forth In

other words 1198622is called the 1st order prediction 119862

4the 2nd

order prediction and so forth Note that the outcome of (10)might be trivial that is

prod119888

(119889 997888rarr 119862119895) = 0 for 119895 = 1 2 100 (12)

implying that no meaningful or direct interactive drugcompounds whatsoever can be found in the training datasetS1015840 for the drug 119889 Under such a circumstance an alternativeapproach should be used for predicting its side effects aselaborated belowPredictionBased onHybrid InteractionsWhen the query drug119889 did not have any directly interactive drugs in the trainingdataset S1015840 or the information of its directly interactive drugswas trivial the data for the indirect chemical-chemical andprotein-chemical interactions would be used to predict itsside effects The prediction method was formulated in asimilar way as the above method But now instead of (10) thelikelihood that the query drug 119889 has the side effect 119862

119895should

be calculated by

prodℎ

(119889 997888rarr 119862119895) = sum

119889119894isinS1015840119876ℎ(119889 119889119894) sdot 119888119894119895 (13)

By integrating the above two different approaches thefollowing steps were adopted to predict the side effects of thequery drug 119889

Step 1 Themethod based on the chemical-chemical interac-tions that is (10) was first utilized to identify its side effects

Step 2 If the outcomes were trivial or no meaningful resultswere obtained as in the case of (12) the method based on thehybrid interactions that is (13) would be utilized to continuethe prediction

24 Similarity-Based Method It is known that the com-pounds with similar structural properties always involvein similar biological activities [24] The most well-knownrepresenting system to obtain the similarity information oftwo compounds is SMILES (Simplified Molecular Input LineEntry System) [22] which is a line notation for representingmolecules and reactions using ASCII strings Here we also

BioMed Research International 5

used this system to obtain the representations of compoundswhich were used to calculate the similarity score of twocompounds and set up a new computational method toidentify drugs side effect The similarity score betweentwo compounds with their SMILES representations can beobtained from Open Babel [25] that is an open chemicaltoolbox For two-drug compounds 119889

1and 119889

2 their similarity

score obtained from Open Babel was denoted by 119876119904(1198891 1198892)

Based on the fact that the compounds with similar structuralproperties always share the same biological activities thelikelihood that the query drug 119889 has the side effect 119862

119895can

be calculated by

prod119904

(119889 997888rarr 119862119895) = max119889119894isinS1015840119876119904(119889 119889119894) sdot 119888119894119895 (14)

meaning that the likelihood that the query drug 119889 has theside effect 119862

119895is formulated as the maximum similarity scores

between 119889 and those drugs with side effect 119862119895in the training

dataset S1015840 Obviously the greater the score prod119904(119889 rarr 119862119895)

the more likely the drug compound 119889 has the side effect119862119895 Following the similar procedure of the method based on

chemical-chemical interactions we can also obtain the orderinformation of the query drug 119889 in terms of prod119904(119889 rarr 119862

119895)

(119895 = 1 2 100)

25 Jackknife Test In statistical prediction Jackknife test [16]is often used to examine a predictor for its effectiveness inpractical application In the jackknife test all the samplesin the dataset will be singled out one-by-one and tested bythe predictor trained by the remaining samples During theprocess of jackknifing both the training dataset and testingdataset are actually open and each sample will be in turnmoved between the two The jackknife test can exclude theldquomemoryrdquo effect and the arbitrariness problem can alsobe avoided Thus the outcome obtained by the jackknifetest is always unique for a given benchmark dataset [26]Accordingly the jackknife test has been widely recognizedand increasingly adopted to investigate the performance ofvarious predictors [27ndash36] Thus the jackknife test was alsoadopted here to evaluate the anticipated accuracy of thecurrent predicted methods

26 Accuracy Measurement For a query drug we mayidentify a series of side effects with the current predictionmethod For the 119895th order prediction its prediction accuracyAC(119895) can be calculated by

AC (119895) =119875 (119895)

119873119895 = 1 2 100 (15)

where 119875(119895) denotes the number of drugs whose 119895th orderprediction is one of the true side effects and 119873 denotesthe total number of structure-different drugs in the datasetAccording to the prediction method with 100 orders ofprediction results high AC(119895) with small 119895 and low AC(119895)with large 119895 would indicate a good prediction [13 16 20]Generally speaking it also implies a good performance by thepredictor if its 1st order prediction has a high success rate

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 2 A plot of the prediction accuracy of two methods on thetraining dataset versus the order of prediction

3 Results and Discussion

Of the 835 drugs in the benchmark dataset S 83 samples wererandomly selected to compose test dataset Ste while the rest752 samples composed the training dataset Str The predictedresults of the interaction-based method and similarity-basedmethod on the training and test datasets are as follows

31 Performance of the Methods on the Training DatasetFor 752 drug compounds in the training dataset Str theinteraction-based method and similarity-based method wereconducted to make prediction with their performance evalu-ated by jackknife test Listed in columns 2 and 4 of Table 1are the first 20 prediction accuracies obtained by thesetwo methods from which we can see that the 1st orderprediction accuracies of the interaction-based and similarity-based method were 8630 and 8364 respectively whilethe 2nd ones were 8045 and 7912 respectively The total100 prediction accuracies obtained by these two methodsare given in Supplementary Material III and two curveswith prediction accuracies as their 119884-axis and predictionorder as their 119883-axis are shown in Figure 2 It is observedthat the prediction accuracies obtained by the interaction-based method descend generally with the increase of theorder number and the same situation also occurred forthe prediction accuracies obtained by the similarity-basedmethod All of these imply that the two methods sorted theside effects of drug compounds in the training dataset quitewell and they are all quite effective in identifying drugs sideeffects

32 Performance of the Methods on the Test Dataset For the83 drug compounds in the test dataset Ste the side effects ofthese samples were predicted by the interaction-based andsimilarity-basedmethodbased on the drug compounds in thetraining dataset Str After processing by (15) 100 predictionaccuracies obtained by each method were obtained and werealso given in SupplementaryMaterial III Listed in columns 3

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Predicting Drugs Side Effects Based on

BioMed Research International 5

used this system to obtain the representations of compoundswhich were used to calculate the similarity score of twocompounds and set up a new computational method toidentify drugs side effect The similarity score betweentwo compounds with their SMILES representations can beobtained from Open Babel [25] that is an open chemicaltoolbox For two-drug compounds 119889

1and 119889

2 their similarity

score obtained from Open Babel was denoted by 119876119904(1198891 1198892)

Based on the fact that the compounds with similar structuralproperties always share the same biological activities thelikelihood that the query drug 119889 has the side effect 119862

119895can

be calculated by

prod119904

(119889 997888rarr 119862119895) = max119889119894isinS1015840119876119904(119889 119889119894) sdot 119888119894119895 (14)

meaning that the likelihood that the query drug 119889 has theside effect 119862

119895is formulated as the maximum similarity scores

between 119889 and those drugs with side effect 119862119895in the training

dataset S1015840 Obviously the greater the score prod119904(119889 rarr 119862119895)

the more likely the drug compound 119889 has the side effect119862119895 Following the similar procedure of the method based on

chemical-chemical interactions we can also obtain the orderinformation of the query drug 119889 in terms of prod119904(119889 rarr 119862

119895)

(119895 = 1 2 100)

25 Jackknife Test In statistical prediction Jackknife test [16]is often used to examine a predictor for its effectiveness inpractical application In the jackknife test all the samplesin the dataset will be singled out one-by-one and tested bythe predictor trained by the remaining samples During theprocess of jackknifing both the training dataset and testingdataset are actually open and each sample will be in turnmoved between the two The jackknife test can exclude theldquomemoryrdquo effect and the arbitrariness problem can alsobe avoided Thus the outcome obtained by the jackknifetest is always unique for a given benchmark dataset [26]Accordingly the jackknife test has been widely recognizedand increasingly adopted to investigate the performance ofvarious predictors [27ndash36] Thus the jackknife test was alsoadopted here to evaluate the anticipated accuracy of thecurrent predicted methods

26 Accuracy Measurement For a query drug we mayidentify a series of side effects with the current predictionmethod For the 119895th order prediction its prediction accuracyAC(119895) can be calculated by

AC (119895) =119875 (119895)

119873119895 = 1 2 100 (15)

where 119875(119895) denotes the number of drugs whose 119895th orderprediction is one of the true side effects and 119873 denotesthe total number of structure-different drugs in the datasetAccording to the prediction method with 100 orders ofprediction results high AC(119895) with small 119895 and low AC(119895)with large 119895 would indicate a good prediction [13 16 20]Generally speaking it also implies a good performance by thepredictor if its 1st order prediction has a high success rate

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 2 A plot of the prediction accuracy of two methods on thetraining dataset versus the order of prediction

3 Results and Discussion

Of the 835 drugs in the benchmark dataset S 83 samples wererandomly selected to compose test dataset Ste while the rest752 samples composed the training dataset Str The predictedresults of the interaction-based method and similarity-basedmethod on the training and test datasets are as follows

31 Performance of the Methods on the Training DatasetFor 752 drug compounds in the training dataset Str theinteraction-based method and similarity-based method wereconducted to make prediction with their performance evalu-ated by jackknife test Listed in columns 2 and 4 of Table 1are the first 20 prediction accuracies obtained by thesetwo methods from which we can see that the 1st orderprediction accuracies of the interaction-based and similarity-based method were 8630 and 8364 respectively whilethe 2nd ones were 8045 and 7912 respectively The total100 prediction accuracies obtained by these two methodsare given in Supplementary Material III and two curveswith prediction accuracies as their 119884-axis and predictionorder as their 119883-axis are shown in Figure 2 It is observedthat the prediction accuracies obtained by the interaction-based method descend generally with the increase of theorder number and the same situation also occurred forthe prediction accuracies obtained by the similarity-basedmethod All of these imply that the two methods sorted theside effects of drug compounds in the training dataset quitewell and they are all quite effective in identifying drugs sideeffects

32 Performance of the Methods on the Test Dataset For the83 drug compounds in the test dataset Ste the side effects ofthese samples were predicted by the interaction-based andsimilarity-basedmethodbased on the drug compounds in thetraining dataset Str After processing by (15) 100 predictionaccuracies obtained by each method were obtained and werealso given in SupplementaryMaterial III Listed in columns 3

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Predicting Drugs Side Effects Based on

6 BioMed Research International

Table 1 The first 20 prediction accuracies of the interaction-based and similarity-based methods in identifying the side effects of drugs inthe training and test datasets

Prediction order Interaction-based Similarity-based DifferenceTraining dataset Test dataset Training dataset Test dataset Training dataseta Test datasetb

1 8630 8916 8364 8795 266 1202 8045 8313 7912 8313 133 0003 7713 8434 7500 7952 213 4824 7261 8193 7141 7590 120 6025 7340 7711 6822 7470 519 2416 6875 7590 6689 7108 186 4827 6769 6747 6476 5783 293 9648 6423 6506 5997 6506 426 0009 6370 6867 5878 5783 492 108410 5771 5783 5731 6024 040 minus24111 5918 6024 5638 6747 279 minus72312 5918 6988 5665 5181 253 180713 5731 6145 5479 5301 253 84314 5585 5904 5386 6265 199 minus36115 5412 5422 5066 5783 346 minus36116 5386 5904 5266 5542 120 36117 5133 3976 5000 6024 133 minus204818 5452 6265 5173 5301 279 96419 5000 5663 5000 3855 000 180720 4721 4458 4774 5181 minus053 minus723aPercentages in this column were calculated by percentages in column 2 minus percentages in column 4bPercentages in this column were calculated by percentages in column 3 minus percentages in column 5

0010203040506070809

1

1 11 21 31 41 51 61 71 81 91

Pred

ictio

n ac

cura

cy

Prediction order

Interaction-basedSimilarity-based

Figure 3 A plot of the prediction accuracy of two methods on thetest dataset versus the order of prediction

and 5 of Table 1 are the first 20 prediction accuracies obtainedby these two methods from which we can see that the1st order prediction accuracies obtained by the interaction-based and similarity-based method were 8916 and 8795respectively We also plotted two curves with predictionaccuracies as their 119884-axis and prediction order as their 119883-axis which are shown in Figure 3 It is observed fromFigure 3

that the accuracies also exhibit a trend of decrease withthe increase of the order number However two curves inFigure 3 fluctuate more drastically and frequently than thoseof Figure 2 which may be caused by the low number of thesamples in the test dataset In any case the interaction-basedand similarity-based methods still sorted the side effects ofsamples in the test dataset reasonably well implying againthat these twomethods are quite effective in identifying drugsside effects

33 Comparison of the Interaction-Based and Similarity-BasedMethod For 752 samples in the training dataset Str and 83samples in the test dataset Ste the interaction-based andsimilarity-based methods were all used to identify their sideeffects Listed in columns 6 and 7 of Table 1 are the differencesof the first 20 prediction accuracies obtained by these twomethods from which we can see that the 1st order predictionaccuracies obtained by the interaction-based method onthe training and test datasets were 266 and 120 higherthan those of similarity-based method Furthermore mostprediction accuracies in Table 1 obtained by the interaction-based method are higher than the corresponding accuraciesobtained by the similarity-based method indicating thatinteraction-based method is more effective in identifyingdrugs side effects It is also confirmed from Figures 2 and3 that the curve obtained by the interaction-based methodis always above the curve obtained by the similarity-basedmethod when the prediction order is low However with

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Predicting Drugs Side Effects Based on

BioMed Research International 7

the increase of order number the curve obtained by thesimilarity-basedmethod keeps upwith and exceeds the curveobtained by the interaction-based method which may becaused by the following two reasons (I) the high predictionaccuracies obtained by the interaction-based method withlow order number cause the low number of correctly pre-dicted samples with high prediction order (II) the systemof using chemical similarity between two chemicals is morecomplete than that in STITCH at present which leads tothe fact that the similarity-based method can always identifymore side effects than the interaction-based method It isexpected that the interaction-based method can be improvedas more and more chemical-chemical and protein-chemicalinteractions become available in STITCH

34 Discussion It is amultitarget learning problem to predictthe side effects of drugs just like the case in dealing with aprotein system with multiple subcellular location sites [37]For each of the drugs investigated we need to considerhow many different side effects it may have and what arethe probabilities these side effects may occur To deal withthis complicated statistical systems like that we adopted thestrategy of the multiple prediction orders ranging from themost likely side effect prediction order to the least one thatis giving the information to users which side effect is mostlikely which one is the second likely one and so forthCompared tomost of the previous studies on the prediction ofdrugs side effects ourmethod can providemore informationThe multiple prediction orders method can also be utilizedto deal with other multi-target learning problems such assubcellular location prediction [37] and functions of proteins[20]

In addition to the multi-target issue we also faced theproblem of coverage scope Of the 835 drug compounds inthe benchmark dataset some of themhave the information ofchemical-chemical interaction while for the rest such infor-mation is missing To establish a predictor that can be used topredict the side effects of drugs under both the circumstancesthe approach of the direct chemical-chemical interaction andthe approach of the indirect chemical-chemical interactionwere introduced For the drug compounds belonging to the1st circumstance the predictions were conducted based onthe direct chemical-chemical interactions (cf (10)) for therest drug compounds belonging to the 2nd circumstance thepredictions were conducted based on the hybrid interactions(cf (13)) Thus the side effects of all the 835 drugs could bepredicted

Finally the good performance of the interaction-basedmethod on the training and test datasets suggests that predic-tions based on the indirect interactions was also quite goodindicating that the entire interaction networkmdashinvolving allthe drug compounds and their direct or indirect interac-tions as well as their interactions with human proteinsmdashdetermines the side effects of drug compounds

4 Conclusions

In this study we proposed a novel prediction method toidentify drugs side effects For any query drug 119889 its side

effects were determined by the following strategy (1) if thereexist interactive compounds of 119889 in the training set onlychemical-chemical interactions were used to identify its sideeffects (2) otherwise both chemical-chemical interactionsand protein-chemical interactions were employed to makeprediction Good performance of the method on the trainingand test datasets indicates that our method is quite effectivein identifying drugs side effects We hope that the methodwould assist in the prediction of drugs side effects duringdrug development and screening out drug candidates withundesired side effects

Authorsrsquo Contribution

Lei Chen and Tao Huang contributed equally to this work

Acknowledgments

This contribution is supported by National Basic ResearchProgram of China (2011CB510101 2011CB510102) NationalNatural Science Foundation of China (61202021 6110509731371335) Innovation Program of Shanghai Municipal Edu-cation Commission (12YZ120 12ZZ087) the grant of ldquoTheFirst-class Discipline of Universities in Shanghairdquo ShanghaiEducationalDevelopment Foundation (12CG55) and Scienceand Technology Program of Shanghai Maritime University(no 20120105)

References

[1] T Huang W Cui L Hu K Feng Y Li and Y Cai ldquoPredictionof pharmacological and xenobiotic responses to drugs based ontime course gene expression profilesrdquo PLoS ONE vol 4 no 12Article ID e8126 2009

[2] S Saul Fen-Phen Case Lawyers Say Theyrsquoll Reject Wyeth OfferNew York Times New York NY USA 2005

[3] P E Blower C Yang M A Fligner et al ldquoPharmacogeno-mic analysis correlating molecular substructure classes withmicroarray gene expression datardquo Pharmacogenomics Journalvol 2 no 4 pp 259ndash271 2002

[4] K J Bussey K Chin S Lababidi et al ldquoIntegrating dataon DNA copy number with gene expression levels and drugsensitivities in the NCI-60 cell line panelrdquo Molecular CancerTherapeutics vol 5 no 4 pp 853ndash867 2006

[5] R H Shoemaker ldquoThe NCI60 human tumour cell line anti-cancer drug screenrdquo Nature Reviews Cancer vol 6 no 10 pp813ndash823 2006

[6] M Fukuzaki M Seki H Kashima and J Sese ldquoSide effect pre-diction using cooperative pathwaysrdquo in Proceedings of the IEEEInternational Conference on Bioinformatics and Biomedicine(BIBM rsquo09) pp 142ndash147 November 2009

[7] L Xie J Li L Xie and P E Bourne ldquoDrug discovery usingchemical systems biology identification of the protein-ligandbinding network to explain the side effects of CETP inhibitorsrdquoPLoS Computational Biology vol 5 no 5 Article ID e10003872009

[8] J Scheiber J L Jenkins S C K Sukuru et al ldquoMapping adversedrug reactions in chemical spacerdquo Journal of Medicinal Chem-istry vol 52 no 9 pp 3103ndash3107 2009

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Predicting Drugs Side Effects Based on

8 BioMed Research International

[9] E Pauwels V Stoven and Y Yamanishi ldquoPredicting drugside-effect profiles a chemical fragment-based approachrdquo BMCBioinformatics vol 12 article 169 2011

[10] N Atias and R Sharan ldquoAn algorithmic framework for predict-ing side effects of drugsrdquo Journal of Computational Biology vol18 no 3 pp 207ndash218 2011

[11] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[12] MKuhn C vonMeringMCampillos L J Jensen and P BorkldquoSTITCH interaction networks of chemicals and proteinsrdquoNucleic Acids Research vol 36 no 1 pp D684ndashD688 2008

[13] L Hu C Chen T Huang Y Cai and K C Chou ldquoPredictingbiological functions of compounds based on chemical-chemicalinteractionsrdquo PLoS ONE vol 6 no 12 Article ID e29491 2011

[14] Y YamanishiM Araki A GutteridgeWHonda andM Kane-hisa ldquoPrediction of drug-target interaction networks from theintegration of chemical and genomic spacesrdquo Bioinformaticsvol 24 no 13 pp i232ndashi240 2008

[15] L Chen Z He T Huang and Y Cai ldquoUsing compound similar-ity and functional domain composition for prediction of drug-target interaction networksrdquoMedicinal Chemistry vol 6 no 6pp 388ndash395 2010

[16] L Chen W Zeng Y Cai K Feng and K C Chou ldquoPredictinganatomical therapeutic chemical (ATC) classification of drugsby integrating chemical-chemical interactions and similaritiesrdquoPLoS ONE vol 7 no 4 Article ID e35254 2012

[17] R Sharan I Ulitsky and R Shamir ldquoNetwork-based predictionof protein functionrdquoMolecular Systems Biology vol 3 article 882007

[18] P Bogdanov and A K Singh ldquoMolecular function predictionusing neighborhood featuresrdquo IEEEACMTransactions onCom-putational Biology and Bioinformatics vol 7 no 2 pp 208ndash2172010

[19] Y A I Kourmpetis A D J van Dijk M C A M Bink R CH J van Ham and C J F ter Braak ldquoBayesian markov randomfield analysis for protein function prediction based on networkdatardquo PLoS ONE vol 5 no 2 Article ID e9293 2010

[20] L Hu T Huang X ShiW Lu Y Cai and K C Chou ldquoPredict-ing functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid propertiesrdquoPLoS ONE vol 6 no 1 Article ID e14556 2011

[21] M Kuhn M Campillos I Letunic L J Jensen and P BorkldquoA side effect resource to capture phenotypic effects of drugsrdquoMolecular Systems Biology vol 6 article 343 2010

[22] D Weininger ldquoSMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesrdquoJournal of Chemical Information and Computer Sciences vol 28pp 31ndash36 1988

[23] K Ng J Ciou and C Huang ldquoPrediction of protein functionsbased on function-function correlation relationsrdquoComputers inBiology and Medicine vol 40 no 3 pp 300ndash305 2010

[24] M Dunkel S Gunther J Ahmed B Wittig and R PreissnerldquoSuperPred drug classification and target predictionrdquo NucleicAcids Research vol 36 pp W55ndashW59 2008

[25] N M OrsquoBoyle M Banck C A James C Morley T Vander-meersch and G R Hutchison ldquoOpen Babel an open chemicaltoolboxrdquo Journal of Cheminformatics vol 3 article 33 2011

[26] K C Chou ldquoSome remarks on protein attribute predictionand pseudo amino acid composition (50th anniversary yearreview)rdquo Journal of Theoretical Biology vol 273 no 1 pp 236ndash247 2011

[27] C Chen Z Shen and X Zou ldquoDual-layer wavelet SVM forpredicting protein structural class via the general formofChoursquospseudo amino acid compositionrdquo Protein and Peptide Lettersvol 19 no 4 pp 422ndash429 2012

[28] M Hayat and A Khan ldquoDiscriminating outer membrane pro-teins with fuzzy K-nearest neighbor algorithms based on thegeneral form of Choursquos PseAACrdquo Protein and Peptide Lettersvol 19 no 4 pp 411ndash421 2012

[29] L Nanni A Lumini D Gupta and A Garg ldquoIdentifying bac-terial virulent proteins by fusing a set of classifiers based onvariants of Choursquos pseudo amino acid composition and onevolutionary informationrdquo IEEEACM Transactions on Compu-tational Biology and Bioinformatics vol 9 no 2 pp 467ndash4752012

[30] H Mohabatkar M M Beigi and A Esmaeili ldquoPrediction ofGABAA receptor proteins using the concept of Choursquos pseudo-amino acid composition and support vector machinerdquo Journalof Theoretical Biology vol 281 no 1 pp 18ndash23 2011

[31] G Fan and Q Li ldquoPredict mycobacterial proteins subcellularlocations by incorporating pseudo-average chemical shift intothe general form of Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 304 pp 88ndash95 2012

[32] M Esmaeili H Mohabatkar and S Mohsenzadeh ldquoUsing theconcept of Choursquos pseudo amino acid composition for risk typeprediction of human papillomavirusesrdquo Journal of TheoreticalBiology vol 263 no 2 pp 203ndash209 2010

[33] D Zou Z He J He and Y Xia ldquoSupersecondary structure pre-diction using Choursquos pseudo amino acid compositionrdquo Journalof Computational Chemistry vol 32 no 2 pp 271ndash278 2011

[34] D N Georgiou T E Karakasidis J J Nieto and A Torres ldquoUseof fuzzy clustering technique and matrices to classify aminoacids and its impact to Choursquos pseudo amino acid compositionrdquoJournal of Theoretical Biology vol 257 no 1 pp 17ndash26 2009

[35] X Zhao X Li Z Ma and M Yin ldquoIdentify DNA-bindingproteins with optimal Choursquos amino acid compositionrdquo Proteinand Peptide Letters vol 19 no 4 pp 398ndash405 2012

[36] L Chen W-M Zeng Y-D Cai and T Huang ldquoPrediction ofmetabolic pathway using graph property chemical functionalgroup and chemical structural setrdquo Current Bioinformatics vol8 pp 200ndash207 2013

[37] P Du T Li andXWang ldquoRecent progress in predicting proteinsub-subcellular locationsrdquo Expert Review of Proteomics vol 8no 3 pp 391ndash404 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Predicting Drugs Side Effects Based on

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology