handlingimbalanceclassificationvirtualscreeningbigdata ......rent advances in machine learning (ml),...

15
Research Article Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms Sahar K. Hussin , 1 Salah M. Abdelmageid, 2 Adel Alkhalil, 3 Yasser M. Omar, 4 Mahmoud I. Marie, 5 and Rabie A. Ramadan 3,6 1 Communication and Computers Engineering Department Alshrouck Academy, Cairo, Egypt 2 Computer Engineering Department, Collage of Comp. Science and Engineering, Taibah University, Medina, Saudi Arabia 3 College of Computer Science and Engineering, University of Hai’l, Hai’l, Saudi Arabia 4 Arab Academy for Science Technology and Maritime Transport, Cairo, Egypt 5 Computer and System Engineering Department, Al-Azhar University, Cairo, Egypt 6 Computer Engineering Department, Cairo Universality, Cairo, Egypt Correspondence should be addressed to Rabie A. Ramadan; [email protected] Received 28 November 2020; Revised 19 December 2020; Accepted 2 January 2021; Published 28 January 2021 Academic Editor: Abd E.I.-Baset Hassanien Copyright © 2021 Sahar K. Hussin et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. is paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. e paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. e proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. e proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods. 1. Introduction e discovery of new medication to cure human illnesses is progressively hard, expensive, and tedious [1]. A wide va- riety of atoms and molecules must be chosen and prepared to generate a set of predetermined drugs. e drug discovery process can take between 12 and 15 years, with a possibility of failure, and expenses are worth more than one billion dollars [2]. Virtual screening is the most critical process in drug discovery. It is employed to search for small chemical compounds (molecules) in libraries to identify structures that have an affinity to bind to a drug target or protein receptor [3]. Up to 1010 libraries of virtual screening exist, and since this record keeps increasing, traditional classifi- cation methods have become insufficient to manage such large amounts of datasets [4]. One of the well-known re- positories is PubChem [5] for small molecules and their biological properties. It offers several resources that are unfortunately constrained by the unbalanced nature of high- throughput screening (HTS) data. ese data usually contain a few hundred active compounds, excluding many inactive compounds. Dataset imbalances occur when one of the classes is described by a minimal number of samples, typically of major importance, compared to the other classes [6]. is problem may distort prediction accuracy in the used Hindawi Complexity Volume 2021, Article ID 6675279, 15 pages https://doi.org/10.1155/2021/6675279

Upload: others

Post on 02-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

Research ArticleHandling Imbalance Classification Virtual Screening Big DataUsing Machine Learning Algorithms

Sahar K Hussin 1 Salah M Abdelmageid2 Adel Alkhalil3 Yasser M Omar4

Mahmoud I Marie5 and Rabie A Ramadan 36

1Communication and Computers Engineering Department Alshrouck Academy Cairo Egypt2Computer Engineering Department Collage of Comp Science and Engineering Taibah University Medina Saudi Arabia3College of Computer Science and Engineering University of Hairsquol Hairsquol Saudi Arabia4Arab Academy for Science Technology and Maritime Transport Cairo Egypt5Computer and System Engineering Department Al-Azhar University Cairo Egypt6Computer Engineering Department Cairo Universality Cairo Egypt

Correspondence should be addressed to Rabie A Ramadan rabierabieramadanorg

Received 28 November 2020 Revised 19 December 2020 Accepted 2 January 2021 Published 28 January 2021

Academic Editor Abd EI-Baset Hassanien

Copyright copy 2021 Sahar K Hussin et al +is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Virtual screening is the most critical process in drug discovery and it relies onmachine learning to facilitate the screening processIt enables the discovery of molecules that bind to a specific protein to form a drug Despite its benefits virtual screening generatesenormous data and suffers from drawbacks such as high dimensions and imbalance+is paper tackles data imbalance and aims toimprove virtual screening accuracy especially for a minority dataset For a dataset identified without considering the datarsquosimbalanced nature most classification methods tend to have high predictive accuracy for the majority category However theaccuracy was significantly poor for the minority category +e paper proposes a K-mean algorithm coupled with SyntheticMinority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets +e proposed algorithm is namedas KSMOTE Using KSMOTE minority data can be identified at high accuracy and can be detected at high precision A large set ofexperiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors +e proposed solution wascompared to both no-sampling method and SMOTE on the same datasets Experimental results showed that the proposedsolution outperformed other methods

1 Introduction

+e discovery of new medication to cure human illnesses isprogressively hard expensive and tedious [1] A wide va-riety of atoms and molecules must be chosen and preparedto generate a set of predetermined drugs +e drug discoveryprocess can take between 12 and 15 years with a possibilityof failure and expenses are worth more than one billiondollars [2] Virtual screening is the most critical process indrug discovery It is employed to search for small chemicalcompounds (molecules) in libraries to identify structuresthat have an affinity to bind to a drug target or proteinreceptor [3] Up to 1010 libraries of virtual screening exist

and since this record keeps increasing traditional classifi-cation methods have become insufficient to manage suchlarge amounts of datasets [4] One of the well-known re-positories is PubChem [5] for small molecules and theirbiological properties It offers several resources that areunfortunately constrained by the unbalanced nature of high-throughput screening (HTS) data+ese data usually containa few hundred active compounds excluding many inactivecompounds

Dataset imbalances occur when one of the classes isdescribed by a minimal number of samples typically ofmajor importance compared to the other classes [6] +isproblem may distort prediction accuracy in the used

HindawiComplexityVolume 2021 Article ID 6675279 15 pageshttpsdoiorg10115520216675279

classification models which leads to poor classificationperformance In HTS experiments thousands of compoundsare usually screened nevertheless a small fraction of thetested compounds are classified to be active while otherclasses are recognized inactive [7] Such data imbalanceaffects the accuracy and precision of activity predictions inindividual virtual screening datasets Using the binaryclassification of an imbalanced dataset an instance of oneclass became fewer compared to another class +e minorityclass is known as the class with fewer cases and the other iscalled the majority class [8]

Current classification models such as k-nearest neighbor(KNN) random forest (RF) multilayer perceptron (MLP)support vector machine (SVM) decision tree (DT) logisticregression (LG) and gradient boosting (GBT) depend on asufficient representative and reasonably balanced collectionof training data to draw an approximate boundary for de-cision-making between different groups +ese learningalgorithms are utilized in a variety of fields including fi-nancial forecasting and text classification [9] Despite cur-rent advances in machine learning (ML) developingsuccessful algorithms that learn from unbalanced datasetsremains a daunting task In ML many approaches have beencreated to deal with imbalanced data However very fewalgorithms have been able to handle problems related tonegatives and false positives Positive or negative statesusually dominate the unbalanced dataset +erefore speci-ficity and recall (sensitivity) are very vital when processingan imbalanced dataset [10] +e increase in sensitivity in-creases the true-positive expectations of the model andreduces false negatives Likewise an upgrade in specificityincreases true-negative expectations and thus reduces wrongresponses +erefore it is critical that for a good model thegap between sensitivity and specificity metrics should be assmall as possible [11]

Although some of the researchers highlighted theproblem of negatives and false positives when using Pub-Chem data to the best of our knowledge no technique hasbeen reported to address the problem effectively It was alsoreported that the imbalance problem obstructed the clas-sification accuracy of bioactivity [12] In another study [13]the authors compared the performance of seven differentdescriptive classifiers based on their ability to deal withunbalanced datasets Another research stated that multilevelSVM-based algorithms outperformed certain algorithmssuch as (1) traditional SVM (2) weighted SVM (3) neuralnetworks (4) linear regression (5) Naıve Bayes (NB) and(6) C45 tree with imbalanced groups missing values andreal health-related data [14]

Besides the main concept of resampling is to reducevariance between class samples by preprocessing the trainingdata In other words the resampling approach is used intraining samples in order to achieve the same number ofsamples for each class to adjust the previous majority andminority sample distributions +ere are two basic methodsin the traditional resampling methodology namelyundersampling and oversampling [15] Undersamplingproduces a smaller number of majority samples while allminority samples are retained +e predominant class

samples will be eliminated randomly before a satisfactoryratio has been accomplished for all groups Undersamplingis ideal for applications where there is an enormous numberof majority samples and the reduction of training sampleswould minimize the training time for the model However adrawback with undersampling that it discards samplescontributes to the loss of majority class information [16]

Another approach to address the imbalanced data isoversampling By replicating samples it raises the number ofsamples in the minority groups [17] +e benefit fromoversampling is that because all samples are used noknowledge is lost Oversampling however has its owndrawback It contributes to higher processing costs bygenerating extra training samples +erefore to compensatefor that limitation more efficient algorithms are needed

Although resampling methods are usually used to solveproblems with imbalances in the class there is little definedstrategy to identify the acceptable class distribution for aparticular dataset [18] As a result the optimal class dis-tribution differs from one dataset to another Recent variantsof resampling approaches derived from oversampling andundersampling overcome some of the shortcomings ofcurrent technologies including SMOTE (Synthetic MinorityOversampling Techniques) SMOTE is one of the mostimportant oversampling approaches that generate interpo-lation instances which is added to the training sampleswithout duplicating the samples in the class of the minority+e SMOTE approach examines the KNN of the minorityclass test that will be utilized as a base for the new syntheticsample [19] If created instances are smaller than the size ofthe initial dataset the approach randomly selects the originalinstances utilized to create the artificial ones If instances arelarger than the size of the original dataset the approachiterates over the dataset creating an artificial instance peroriginal instance until it reaches the previous scenario [20]SMOTE is considered as an oversampling technique thatproduces synthetic minority class samples +is is theoret-ically performing better than simple oversampling and it iscommonly used For example SMOTE was utilized to detectnetwork intrusions [21] or speech boundary sentence topredict species distribution [22] SMOTE will be utilized inthis research

Data mining techniques can help to reduce promisingcandidate chemicals for interaction with specific moleculartargets before they are experimentally evaluated [23] Intheory this can help to speed up the drug developmentprocess However the improvement of accurate predictionmodels for HTS is difficult For datasets such as those takenfrom HTS experiments the achievement of high predict-ability accuracy may be misleading since this may be ac-companied by an unacceptable false-positive rate [24] ashigh accuracy does not always imply a small proportion offalse predictions

In the event of a large class imbalance this paper at-tempts to address the most effective variant of data pre-processing to enhance data imbalance accuracy whichfavors the collection of interactions that increase the overallaccuracy of a learning model We propose a SMOTEcoupled with k-mean method to classify several imbalanced

2 Complexity

PubChem datasets with the goal of (1) validating whether k-mean with SMOTE affects the output of established modelsand (2) exploring if KSMOTE is appropriate and useful infinding interesting samples from broad datasets Our modelis also applied to different ML algorithms (random forestdecision tree multilayer perceptron logistic regressionand gradient boosting) for comparison purposes with threedifferent datasets +e paper also introduces a procedurefor data sampling to increase the sensitivity and specificityof predicting several moleculesrsquo behavior +e proposedapproach is implemented on standalone clusters forApache Spark 243 in order to address the imbalance in abig dataset

+e remainder of this paper is structured as followsSection 2 provides a general description of class imbalancelearning concepts and reviews the related research con-ducted on the subject matter Section 3 explains how theproposed approach for VS in drug discovery was developedSection 4 presents performance evaluations and experi-mental results Section 5 presents the discussion of ourproposal Finally Section 6 highlights the conclusions andtopics of study for future research

2 Related Work

For the paper to be self-contained this section reviews themost related work to the VS research techniques problemsand state-of-the-art solutions It also examines some of thebig data frameworks that could help solve the problem ofimbalanced datasets

Since we live in the technological era where older storageand processing technologies are not enough computingtechnologies must be scaled to handle a massive amount ofdata generated by different devices +e biggest challenge inhandling such volumes of data is the speed at which they willgrow much faster than the computer resources One of theresearch areas that generate huge data to be analyzed issearching and discovering medicines +e proposedmethods aim to find a molecule capable of binding andactivating or inhibiting a molecular target +e discovery ofnew drugs for human diseases is exceptionally complicatedexpensive and time-consuming Drug discovery uses vari-ous methods [25] based on a statistical approach to scan forsmall molecule libraries and determines the structures mostlikely to bind to a compound However the drug target is aprotein receptor that is involved in a metabolic cycle orsignaling pathway by which a particular disease disorder isestablished or another anatomy

+ere are two VS approaches which are ligand-based VS(LBVS) and structure-based virtual screening (SBVS) [26]LBVS depends on the existing information about the ligandsIt utilizes the knowledge about a set of ligands known to beactive for the given drug target +is type of VS uses themining of big data analytics Training binary classifiers by asmall part of a ligand can be employed and very large sets ofligands can be easily classified into two classes active andnonactive ones SBVS on the other side is utilized to dockexperimentally However 3D protein structural informationis required [27] as shown in Figure 1

K-mean clustering is one of the simplest nonsupervisedlearning algorithms which was first proposed by Macqueenin 1967 It has been applied by many researchers to solvesome of the problems of known groups [28] +is techniqueclassifies a particular dataset into a certain number ofgroups +e algorithm randomly initializes the center of thegroups +en it calculates the distance between an objectand the midpoint of each group Next each data instance islinked to the nearest center and the cluster centers arerecalculated +e distance between the center and eachsample is calculated by the following equation

Euclidean distance 1113944c

i11113944

ci

j1Xi minus Yj (1)

where the Euclidean distance between the data point Xi andcluster center y is d Ci is the total number of data points i incluster and c is the total number of cluster centers All of thetraining samples are first grouped into K groups (the ex-periment with diverse K values runs to observe the result)Suitable training samples from the derived clusters are se-lected +e key idea is that there are different groups in adataset and each group appears to have distinct charac-teristics When a cluster includes samples of large majorityclass and samples of low minority class it functions as amajority class sample If on the other side a cluster has extraminority samples and fewer majority samples it acts morelike a minority class +erefore by considering the numberof majority class samples to that of minority class samples invarious clusters this method oversamples the requirednumber of minority class samples from each cluster

Several approaches have been proposed in the literatureto handle big data classification including classification al-gorithms random forest decision tree multilayer percep-tron logistic regression and gradient boosting

Classification algorithms (CA) are mainly depending onmachine learning (ML) algorithms where they play a vitalrole in VS for drug discovery It can be considered as anLBVS approach Researchers widely used the ML approachto create a binary classification model that is a form of filterto classify ligands as active or inactive in relation to aparticular protein target +ese strategies need fewer com-putational resources and because of their ability to gener-alize they find more complex hits than other earlierapproaches Based on our experience we believe that manyclassification algorithms can be utilized for dealing withunbalanced datasets in VS such as SVM RF Naıve BayesMLP LG ANN DT and GBT Five ML algorithms areapplied in this paper RF DT MLP LG and GBT [29]

Random forest (RF) is an ensemble learning approach inwhich multiple decision trees are constructed based ontraining data and amajority votingmechanism Like KNN itis utilized to predict classification or regression for newinputs +e key advantage of RF is that it can be utilized forproblems that need classification and regression Besides RFis the ability tomanage many higher-dimensional datasets Ithas a powerful strategy for determining the lack of infor-mation and preserving accuracy when much of the infor-mation is missing [29]

Complexity 3

Decision tree (DT) is represented as an actual tree withits root at the top and the leaves at the bottom +e root ofthe tree is divided into two or more branches +e branchmay be broken down into two branches or more +isprocess continues until the leaf is reached meaning no moresplit remains

Multilayer perceptron (MP) has two main types of ar-tificial neural networks (ANNs) which are supervised andunsupervised networks Every network consists of a series oflinked neurons A neuron takes multiple numerical inputsand outputs of values depending on the number of inputsweighted Popular functions of transformation embody thefunctions of tanh and sigmoid Neurons are formed intolayers ANN can contain several hidden layers and theneurons will only be linked to those in the next layersknown as forwarding feed networks multiplayer percep-tions (MLPs) or functional radial base network (RBN) [29]

Logistic regression (LR) is one of the simplest and fre-quently utilizedML algorithms for two-class classification Itis simple to implement and can be utilized as the baseline forany binary classification problem In deep learning basicprinciples are also constructive +e relationship between asingle dependent relative binary variable and independentvariables is defined and estimated by logistic regression It isalso a mathematical method for predicting binary classes+e effect or target variable is dichotomous in nature Di-chotomous means that only two possible groups can be usedfor cancer detection issues for instance [30]

Gradient boosting is a type ofML boosting It relies on theintuition that the best possible next model when combinedwith previous models minimizes the overall prediction error+e name gradient boosting arises because target outcomesfor each case are set based on the gradient of the error withrespect to the prediction Each new model takes a step in thedirection that minimizes prediction error in the space ofpossible predictions for each training case [30]

In [31] the authors compared four Weka classifiersincluding SVM J48 tree NB and RF From a completedcost-sensitive survey SVM and Tree C45 (J48) performedwell taking minority group sizes into account It shows that

a hybrid of majority class undersampling and SMOTE canimprove overall classification performance in an imbalanceddataset In addition the authors in [32] have employed arepetitive SVM as a sample method that is used for SVMprocessing from bioassay information on luciferase inhi-bition that has a high activeinactive imbalance ratio (1377)+e modelsrsquo highest performance was 8660 and 8889percent for active compounds and inactive compoundsrespectively associated with a combined precision of 8774percent when using validation and blind test +ese findingsindicate the quality of the proposed approach for managingthe intrinsic imbalance problem in HTS data used to clusterpossible interference compounds to virtual screening uti-lizing luciferase-based HTS experiments

Similarly in [33] the authors analyzed several commonstrategies for modeling unbalanced data including multipleundersampling threshold ratio 1 3 undersampling one-sided undersampling similarity undersampling clusterundersampling diversity undersampling and only thresh-old selection In total seven methods were compared usingHTS datasets extracted from PubChem +eir analysis led tothe proposal of a new hybrid method which includes bothlow-cost learning and less sampling approaches+e authorsclaim that the multisample and hybrid methods provideaccurate predictions of results more than other methodsBesides some other research studies used toxicity datasets inimbalance algorithms as in [34] +e model was based on adata ensemble where each model sees an equal distributionof the two toxic and nontoxic classes It considers variousaspects of creating computational models to predict celltoxicity based on cell proliferation screening dataset Suchpredictive models can be helpful in evaluating cell-basedscreening outcomes in general by bringing feature-baseddata into such datasets It also could be utilized as a methodto recognize and remove potentially undesirable com-pounds +e authors concluded that the issue of data im-balance hindered the accuracy of the critical activityclassification+ey also developed an artificial random forestgroup model that was designed to mitigate dataset mis-alignment in predicting cell toxicity

Virtual screening

3D structure of target

UnknownNH2

NN

o

o

F

ndashH2N

Actives andinactives known

known

Actives known

Ligand-basedmethods

Similaritysearching

Pharmacophoremapping

Machine learningmethods

Protein liganddocking

Structure-basedmethods

Figure 1 Taxonomy for the 3D structure of VS methods [25]

4 Complexity

+e authors in [35] used a simple oversampling ap-proach to build an SVMmodel classifying compounds basedon the expected cytotoxic versus Jurkat cell line Over-sampling with the minority has been shown to contribute tobetter predictive SVM models in training groups and ex-ternal test groups Consequently the authors in [36] ana-lyzed and validated the importance of different samplingmethods over the nonsampling method in order to achievewell-balanced sensitivity and specificity of the ML modelthat has been created for unbalanced chemical data Ad-ditionally the study conducted an accuracy of 9300 underthe curve (AUC) of 094 a sensitivity of 9600 andspecificity of 9100 using SMOTE sampling and randomforest classification to predict drug-induced liver injuryAlthough it was presented in the literature that some of theproposed approaches have succeeded somehow inresponding to the issues of unbalanced PubChem datasetsthere is still a lack of time efficiency during calculations

3 Proposed KSMOTE Framework

K-Mean Synthetic Minority Oversampling Technique(KSMOTE) is proposed in this paper as a solution for virtualscreening to drug discovery problems KSMOTE combines K-mean and SMOTE algorithms to avoid the imbalanced originaldatasets ensuring that the number of minority samples is asclose as possible to the majority of the population samples Asshown in Figure 2 the data were first separated into twosetsmdashone set contains majority samples and the other setcontains the entire minority sample First majority sampleswere clustered intoK clusters andminority samples whereK isgreater than one in both cases+e number of clusters for eachclass is chosen according to the elbow method +e Euclideandistance was employed to calculate the distance between thecenter of each majority cluster and the center of each minoritycluster Each majority cluster sample was combined with theminority cluster sample subset to make K separate trainingdatasets +is combination was done based on the largestdistance between each majority and minority cluster SMOTEwas then applied to each combination of the clusters Itgenerates an instance of synthetic minority oversamplingminority class For any minority example the k (5 in SMOTE)is the nearest Neighbors of the same class are determined andthen some instances are randomly selected according to theoversampling factor After that new synthetic examples aregenerated along the line between the minority example and itsnearest chosen example

31 Environment Selection and the Dataset Since we aredealing with big data a big data framework must be chosenOne of the most powerful frameworks that have been used inmany data analytics is Spark Spark is a well-known clustercomputing engine that is very reliable It presents applica-tion programming interfaces in various programming lan-guages such as Java Python Scala and R Spark supports in-memory computing allowing handling records much fasterthan disk-based engines Hadoop Spark engine is advanced

for in-reminiscence processing as well as disk-based totallyprocessing [37] It has been installed on different operatingsystems such as Windows and Ubuntu

+is paper implements the proposed approach usingPySpark version 243 [38] and Hadoop version 27 installedon Ubuntu 1804 and Python is used as a programminglanguage A Jupyter notebook version 37 was used +ecomputer configuration for experiments was a local machineIntel Core i7 with 28GHz speed and 8GB of RAM Toillustrate the performance of the proposed framework threedatasets are chosen +ey are carefully chosen where each ofthem differs in its imbalance ratio +ey are also largeenough to illustrate the big data analysis challenge+e threedatasets are AID 440 [39] AID 624202 [40] and AID 651820[41] All of them are retrieved from the PubChem databaseLibrary [5]+e three datasets are summarized in Table 1 andbriefly described in the following paragraphs All of the dataexist in an unstructured format as SDF files +erefore theyrequire some preprocessing to be accepted as input to theproposed platform

(1) AID 440 is a formylpeptide receptor (FPR) +eG-protein coupled with the formylpeptide receptorwas one of the originating chemo-attracting receptormembers [39] It consists of 185 active and 24815nonactive molecules

(2) AID 624202 is a qHTS test to identify small mo-lecular stimulants for BRCA1 expression BRCA1has been involved in a wide range of cellular ac-tivities including repairing DNA damage cell cyclecheckpoint control growth inhibition programmedcell death transcription regulation chromatin re-combination protein presence and autogenouslystem cell regeneration and differentiation +e in-crease in BRCA1 expression would enable cellulardifferentiation and restore tumor inhibitor functionleading to delayed tumor growth and less aggressiveand more treatable breast cancer Promising stim-ulants for BRCA1 expression could be new pre-ventive or curative factors against breast cancer [40]

(3) AID 651820 is a qHTS examination for hepatitis Cvirus (HCV) inhibitors [41] About 200 millionpeople globally are hepatitis C (HCV) contaminatedMany infected individuals progress to chronic liverdisease including cirrhosis with the risk of devel-oping hepatic cancer+ere is no effective hepatitis Cvaccine available to date Current interferon-basedtherapy is effective only in about half of patients andis associated with significant adverse effects It isestimated that the fraction of people with HCV whocan complete treatment is no more than 10 percent+e recent development of direct-acting antiviralsagainst HCV such as protease and polymerase in-hibitors is promising However it still requires acombination of peginterferon and ribavirin formaximum efficacy Moreover these agents are as-sociated with a high resistance rate and many havesignificant side effects (Figure 3)

Complexity 5

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 2: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

classification models which leads to poor classificationperformance In HTS experiments thousands of compoundsare usually screened nevertheless a small fraction of thetested compounds are classified to be active while otherclasses are recognized inactive [7] Such data imbalanceaffects the accuracy and precision of activity predictions inindividual virtual screening datasets Using the binaryclassification of an imbalanced dataset an instance of oneclass became fewer compared to another class +e minorityclass is known as the class with fewer cases and the other iscalled the majority class [8]

Current classification models such as k-nearest neighbor(KNN) random forest (RF) multilayer perceptron (MLP)support vector machine (SVM) decision tree (DT) logisticregression (LG) and gradient boosting (GBT) depend on asufficient representative and reasonably balanced collectionof training data to draw an approximate boundary for de-cision-making between different groups +ese learningalgorithms are utilized in a variety of fields including fi-nancial forecasting and text classification [9] Despite cur-rent advances in machine learning (ML) developingsuccessful algorithms that learn from unbalanced datasetsremains a daunting task In ML many approaches have beencreated to deal with imbalanced data However very fewalgorithms have been able to handle problems related tonegatives and false positives Positive or negative statesusually dominate the unbalanced dataset +erefore speci-ficity and recall (sensitivity) are very vital when processingan imbalanced dataset [10] +e increase in sensitivity in-creases the true-positive expectations of the model andreduces false negatives Likewise an upgrade in specificityincreases true-negative expectations and thus reduces wrongresponses +erefore it is critical that for a good model thegap between sensitivity and specificity metrics should be assmall as possible [11]

Although some of the researchers highlighted theproblem of negatives and false positives when using Pub-Chem data to the best of our knowledge no technique hasbeen reported to address the problem effectively It was alsoreported that the imbalance problem obstructed the clas-sification accuracy of bioactivity [12] In another study [13]the authors compared the performance of seven differentdescriptive classifiers based on their ability to deal withunbalanced datasets Another research stated that multilevelSVM-based algorithms outperformed certain algorithmssuch as (1) traditional SVM (2) weighted SVM (3) neuralnetworks (4) linear regression (5) Naıve Bayes (NB) and(6) C45 tree with imbalanced groups missing values andreal health-related data [14]

Besides the main concept of resampling is to reducevariance between class samples by preprocessing the trainingdata In other words the resampling approach is used intraining samples in order to achieve the same number ofsamples for each class to adjust the previous majority andminority sample distributions +ere are two basic methodsin the traditional resampling methodology namelyundersampling and oversampling [15] Undersamplingproduces a smaller number of majority samples while allminority samples are retained +e predominant class

samples will be eliminated randomly before a satisfactoryratio has been accomplished for all groups Undersamplingis ideal for applications where there is an enormous numberof majority samples and the reduction of training sampleswould minimize the training time for the model However adrawback with undersampling that it discards samplescontributes to the loss of majority class information [16]

Another approach to address the imbalanced data isoversampling By replicating samples it raises the number ofsamples in the minority groups [17] +e benefit fromoversampling is that because all samples are used noknowledge is lost Oversampling however has its owndrawback It contributes to higher processing costs bygenerating extra training samples +erefore to compensatefor that limitation more efficient algorithms are needed

Although resampling methods are usually used to solveproblems with imbalances in the class there is little definedstrategy to identify the acceptable class distribution for aparticular dataset [18] As a result the optimal class dis-tribution differs from one dataset to another Recent variantsof resampling approaches derived from oversampling andundersampling overcome some of the shortcomings ofcurrent technologies including SMOTE (Synthetic MinorityOversampling Techniques) SMOTE is one of the mostimportant oversampling approaches that generate interpo-lation instances which is added to the training sampleswithout duplicating the samples in the class of the minority+e SMOTE approach examines the KNN of the minorityclass test that will be utilized as a base for the new syntheticsample [19] If created instances are smaller than the size ofthe initial dataset the approach randomly selects the originalinstances utilized to create the artificial ones If instances arelarger than the size of the original dataset the approachiterates over the dataset creating an artificial instance peroriginal instance until it reaches the previous scenario [20]SMOTE is considered as an oversampling technique thatproduces synthetic minority class samples +is is theoret-ically performing better than simple oversampling and it iscommonly used For example SMOTE was utilized to detectnetwork intrusions [21] or speech boundary sentence topredict species distribution [22] SMOTE will be utilized inthis research

Data mining techniques can help to reduce promisingcandidate chemicals for interaction with specific moleculartargets before they are experimentally evaluated [23] Intheory this can help to speed up the drug developmentprocess However the improvement of accurate predictionmodels for HTS is difficult For datasets such as those takenfrom HTS experiments the achievement of high predict-ability accuracy may be misleading since this may be ac-companied by an unacceptable false-positive rate [24] ashigh accuracy does not always imply a small proportion offalse predictions

In the event of a large class imbalance this paper at-tempts to address the most effective variant of data pre-processing to enhance data imbalance accuracy whichfavors the collection of interactions that increase the overallaccuracy of a learning model We propose a SMOTEcoupled with k-mean method to classify several imbalanced

2 Complexity

PubChem datasets with the goal of (1) validating whether k-mean with SMOTE affects the output of established modelsand (2) exploring if KSMOTE is appropriate and useful infinding interesting samples from broad datasets Our modelis also applied to different ML algorithms (random forestdecision tree multilayer perceptron logistic regressionand gradient boosting) for comparison purposes with threedifferent datasets +e paper also introduces a procedurefor data sampling to increase the sensitivity and specificityof predicting several moleculesrsquo behavior +e proposedapproach is implemented on standalone clusters forApache Spark 243 in order to address the imbalance in abig dataset

+e remainder of this paper is structured as followsSection 2 provides a general description of class imbalancelearning concepts and reviews the related research con-ducted on the subject matter Section 3 explains how theproposed approach for VS in drug discovery was developedSection 4 presents performance evaluations and experi-mental results Section 5 presents the discussion of ourproposal Finally Section 6 highlights the conclusions andtopics of study for future research

2 Related Work

For the paper to be self-contained this section reviews themost related work to the VS research techniques problemsand state-of-the-art solutions It also examines some of thebig data frameworks that could help solve the problem ofimbalanced datasets

Since we live in the technological era where older storageand processing technologies are not enough computingtechnologies must be scaled to handle a massive amount ofdata generated by different devices +e biggest challenge inhandling such volumes of data is the speed at which they willgrow much faster than the computer resources One of theresearch areas that generate huge data to be analyzed issearching and discovering medicines +e proposedmethods aim to find a molecule capable of binding andactivating or inhibiting a molecular target +e discovery ofnew drugs for human diseases is exceptionally complicatedexpensive and time-consuming Drug discovery uses vari-ous methods [25] based on a statistical approach to scan forsmall molecule libraries and determines the structures mostlikely to bind to a compound However the drug target is aprotein receptor that is involved in a metabolic cycle orsignaling pathway by which a particular disease disorder isestablished or another anatomy

+ere are two VS approaches which are ligand-based VS(LBVS) and structure-based virtual screening (SBVS) [26]LBVS depends on the existing information about the ligandsIt utilizes the knowledge about a set of ligands known to beactive for the given drug target +is type of VS uses themining of big data analytics Training binary classifiers by asmall part of a ligand can be employed and very large sets ofligands can be easily classified into two classes active andnonactive ones SBVS on the other side is utilized to dockexperimentally However 3D protein structural informationis required [27] as shown in Figure 1

K-mean clustering is one of the simplest nonsupervisedlearning algorithms which was first proposed by Macqueenin 1967 It has been applied by many researchers to solvesome of the problems of known groups [28] +is techniqueclassifies a particular dataset into a certain number ofgroups +e algorithm randomly initializes the center of thegroups +en it calculates the distance between an objectand the midpoint of each group Next each data instance islinked to the nearest center and the cluster centers arerecalculated +e distance between the center and eachsample is calculated by the following equation

Euclidean distance 1113944c

i11113944

ci

j1Xi minus Yj (1)

where the Euclidean distance between the data point Xi andcluster center y is d Ci is the total number of data points i incluster and c is the total number of cluster centers All of thetraining samples are first grouped into K groups (the ex-periment with diverse K values runs to observe the result)Suitable training samples from the derived clusters are se-lected +e key idea is that there are different groups in adataset and each group appears to have distinct charac-teristics When a cluster includes samples of large majorityclass and samples of low minority class it functions as amajority class sample If on the other side a cluster has extraminority samples and fewer majority samples it acts morelike a minority class +erefore by considering the numberof majority class samples to that of minority class samples invarious clusters this method oversamples the requirednumber of minority class samples from each cluster

Several approaches have been proposed in the literatureto handle big data classification including classification al-gorithms random forest decision tree multilayer percep-tron logistic regression and gradient boosting

Classification algorithms (CA) are mainly depending onmachine learning (ML) algorithms where they play a vitalrole in VS for drug discovery It can be considered as anLBVS approach Researchers widely used the ML approachto create a binary classification model that is a form of filterto classify ligands as active or inactive in relation to aparticular protein target +ese strategies need fewer com-putational resources and because of their ability to gener-alize they find more complex hits than other earlierapproaches Based on our experience we believe that manyclassification algorithms can be utilized for dealing withunbalanced datasets in VS such as SVM RF Naıve BayesMLP LG ANN DT and GBT Five ML algorithms areapplied in this paper RF DT MLP LG and GBT [29]

Random forest (RF) is an ensemble learning approach inwhich multiple decision trees are constructed based ontraining data and amajority votingmechanism Like KNN itis utilized to predict classification or regression for newinputs +e key advantage of RF is that it can be utilized forproblems that need classification and regression Besides RFis the ability tomanage many higher-dimensional datasets Ithas a powerful strategy for determining the lack of infor-mation and preserving accuracy when much of the infor-mation is missing [29]

Complexity 3

Decision tree (DT) is represented as an actual tree withits root at the top and the leaves at the bottom +e root ofthe tree is divided into two or more branches +e branchmay be broken down into two branches or more +isprocess continues until the leaf is reached meaning no moresplit remains

Multilayer perceptron (MP) has two main types of ar-tificial neural networks (ANNs) which are supervised andunsupervised networks Every network consists of a series oflinked neurons A neuron takes multiple numerical inputsand outputs of values depending on the number of inputsweighted Popular functions of transformation embody thefunctions of tanh and sigmoid Neurons are formed intolayers ANN can contain several hidden layers and theneurons will only be linked to those in the next layersknown as forwarding feed networks multiplayer percep-tions (MLPs) or functional radial base network (RBN) [29]

Logistic regression (LR) is one of the simplest and fre-quently utilizedML algorithms for two-class classification Itis simple to implement and can be utilized as the baseline forany binary classification problem In deep learning basicprinciples are also constructive +e relationship between asingle dependent relative binary variable and independentvariables is defined and estimated by logistic regression It isalso a mathematical method for predicting binary classes+e effect or target variable is dichotomous in nature Di-chotomous means that only two possible groups can be usedfor cancer detection issues for instance [30]

Gradient boosting is a type ofML boosting It relies on theintuition that the best possible next model when combinedwith previous models minimizes the overall prediction error+e name gradient boosting arises because target outcomesfor each case are set based on the gradient of the error withrespect to the prediction Each new model takes a step in thedirection that minimizes prediction error in the space ofpossible predictions for each training case [30]

In [31] the authors compared four Weka classifiersincluding SVM J48 tree NB and RF From a completedcost-sensitive survey SVM and Tree C45 (J48) performedwell taking minority group sizes into account It shows that

a hybrid of majority class undersampling and SMOTE canimprove overall classification performance in an imbalanceddataset In addition the authors in [32] have employed arepetitive SVM as a sample method that is used for SVMprocessing from bioassay information on luciferase inhi-bition that has a high activeinactive imbalance ratio (1377)+e modelsrsquo highest performance was 8660 and 8889percent for active compounds and inactive compoundsrespectively associated with a combined precision of 8774percent when using validation and blind test +ese findingsindicate the quality of the proposed approach for managingthe intrinsic imbalance problem in HTS data used to clusterpossible interference compounds to virtual screening uti-lizing luciferase-based HTS experiments

Similarly in [33] the authors analyzed several commonstrategies for modeling unbalanced data including multipleundersampling threshold ratio 1 3 undersampling one-sided undersampling similarity undersampling clusterundersampling diversity undersampling and only thresh-old selection In total seven methods were compared usingHTS datasets extracted from PubChem +eir analysis led tothe proposal of a new hybrid method which includes bothlow-cost learning and less sampling approaches+e authorsclaim that the multisample and hybrid methods provideaccurate predictions of results more than other methodsBesides some other research studies used toxicity datasets inimbalance algorithms as in [34] +e model was based on adata ensemble where each model sees an equal distributionof the two toxic and nontoxic classes It considers variousaspects of creating computational models to predict celltoxicity based on cell proliferation screening dataset Suchpredictive models can be helpful in evaluating cell-basedscreening outcomes in general by bringing feature-baseddata into such datasets It also could be utilized as a methodto recognize and remove potentially undesirable com-pounds +e authors concluded that the issue of data im-balance hindered the accuracy of the critical activityclassification+ey also developed an artificial random forestgroup model that was designed to mitigate dataset mis-alignment in predicting cell toxicity

Virtual screening

3D structure of target

UnknownNH2

NN

o

o

F

ndashH2N

Actives andinactives known

known

Actives known

Ligand-basedmethods

Similaritysearching

Pharmacophoremapping

Machine learningmethods

Protein liganddocking

Structure-basedmethods

Figure 1 Taxonomy for the 3D structure of VS methods [25]

4 Complexity

+e authors in [35] used a simple oversampling ap-proach to build an SVMmodel classifying compounds basedon the expected cytotoxic versus Jurkat cell line Over-sampling with the minority has been shown to contribute tobetter predictive SVM models in training groups and ex-ternal test groups Consequently the authors in [36] ana-lyzed and validated the importance of different samplingmethods over the nonsampling method in order to achievewell-balanced sensitivity and specificity of the ML modelthat has been created for unbalanced chemical data Ad-ditionally the study conducted an accuracy of 9300 underthe curve (AUC) of 094 a sensitivity of 9600 andspecificity of 9100 using SMOTE sampling and randomforest classification to predict drug-induced liver injuryAlthough it was presented in the literature that some of theproposed approaches have succeeded somehow inresponding to the issues of unbalanced PubChem datasetsthere is still a lack of time efficiency during calculations

3 Proposed KSMOTE Framework

K-Mean Synthetic Minority Oversampling Technique(KSMOTE) is proposed in this paper as a solution for virtualscreening to drug discovery problems KSMOTE combines K-mean and SMOTE algorithms to avoid the imbalanced originaldatasets ensuring that the number of minority samples is asclose as possible to the majority of the population samples Asshown in Figure 2 the data were first separated into twosetsmdashone set contains majority samples and the other setcontains the entire minority sample First majority sampleswere clustered intoK clusters andminority samples whereK isgreater than one in both cases+e number of clusters for eachclass is chosen according to the elbow method +e Euclideandistance was employed to calculate the distance between thecenter of each majority cluster and the center of each minoritycluster Each majority cluster sample was combined with theminority cluster sample subset to make K separate trainingdatasets +is combination was done based on the largestdistance between each majority and minority cluster SMOTEwas then applied to each combination of the clusters Itgenerates an instance of synthetic minority oversamplingminority class For any minority example the k (5 in SMOTE)is the nearest Neighbors of the same class are determined andthen some instances are randomly selected according to theoversampling factor After that new synthetic examples aregenerated along the line between the minority example and itsnearest chosen example

31 Environment Selection and the Dataset Since we aredealing with big data a big data framework must be chosenOne of the most powerful frameworks that have been used inmany data analytics is Spark Spark is a well-known clustercomputing engine that is very reliable It presents applica-tion programming interfaces in various programming lan-guages such as Java Python Scala and R Spark supports in-memory computing allowing handling records much fasterthan disk-based engines Hadoop Spark engine is advanced

for in-reminiscence processing as well as disk-based totallyprocessing [37] It has been installed on different operatingsystems such as Windows and Ubuntu

+is paper implements the proposed approach usingPySpark version 243 [38] and Hadoop version 27 installedon Ubuntu 1804 and Python is used as a programminglanguage A Jupyter notebook version 37 was used +ecomputer configuration for experiments was a local machineIntel Core i7 with 28GHz speed and 8GB of RAM Toillustrate the performance of the proposed framework threedatasets are chosen +ey are carefully chosen where each ofthem differs in its imbalance ratio +ey are also largeenough to illustrate the big data analysis challenge+e threedatasets are AID 440 [39] AID 624202 [40] and AID 651820[41] All of them are retrieved from the PubChem databaseLibrary [5]+e three datasets are summarized in Table 1 andbriefly described in the following paragraphs All of the dataexist in an unstructured format as SDF files +erefore theyrequire some preprocessing to be accepted as input to theproposed platform

(1) AID 440 is a formylpeptide receptor (FPR) +eG-protein coupled with the formylpeptide receptorwas one of the originating chemo-attracting receptormembers [39] It consists of 185 active and 24815nonactive molecules

(2) AID 624202 is a qHTS test to identify small mo-lecular stimulants for BRCA1 expression BRCA1has been involved in a wide range of cellular ac-tivities including repairing DNA damage cell cyclecheckpoint control growth inhibition programmedcell death transcription regulation chromatin re-combination protein presence and autogenouslystem cell regeneration and differentiation +e in-crease in BRCA1 expression would enable cellulardifferentiation and restore tumor inhibitor functionleading to delayed tumor growth and less aggressiveand more treatable breast cancer Promising stim-ulants for BRCA1 expression could be new pre-ventive or curative factors against breast cancer [40]

(3) AID 651820 is a qHTS examination for hepatitis Cvirus (HCV) inhibitors [41] About 200 millionpeople globally are hepatitis C (HCV) contaminatedMany infected individuals progress to chronic liverdisease including cirrhosis with the risk of devel-oping hepatic cancer+ere is no effective hepatitis Cvaccine available to date Current interferon-basedtherapy is effective only in about half of patients andis associated with significant adverse effects It isestimated that the fraction of people with HCV whocan complete treatment is no more than 10 percent+e recent development of direct-acting antiviralsagainst HCV such as protease and polymerase in-hibitors is promising However it still requires acombination of peginterferon and ribavirin formaximum efficacy Moreover these agents are as-sociated with a high resistance rate and many havesignificant side effects (Figure 3)

Complexity 5

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 3: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

PubChem datasets with the goal of (1) validating whether k-mean with SMOTE affects the output of established modelsand (2) exploring if KSMOTE is appropriate and useful infinding interesting samples from broad datasets Our modelis also applied to different ML algorithms (random forestdecision tree multilayer perceptron logistic regressionand gradient boosting) for comparison purposes with threedifferent datasets +e paper also introduces a procedurefor data sampling to increase the sensitivity and specificityof predicting several moleculesrsquo behavior +e proposedapproach is implemented on standalone clusters forApache Spark 243 in order to address the imbalance in abig dataset

+e remainder of this paper is structured as followsSection 2 provides a general description of class imbalancelearning concepts and reviews the related research con-ducted on the subject matter Section 3 explains how theproposed approach for VS in drug discovery was developedSection 4 presents performance evaluations and experi-mental results Section 5 presents the discussion of ourproposal Finally Section 6 highlights the conclusions andtopics of study for future research

2 Related Work

For the paper to be self-contained this section reviews themost related work to the VS research techniques problemsand state-of-the-art solutions It also examines some of thebig data frameworks that could help solve the problem ofimbalanced datasets

Since we live in the technological era where older storageand processing technologies are not enough computingtechnologies must be scaled to handle a massive amount ofdata generated by different devices +e biggest challenge inhandling such volumes of data is the speed at which they willgrow much faster than the computer resources One of theresearch areas that generate huge data to be analyzed issearching and discovering medicines +e proposedmethods aim to find a molecule capable of binding andactivating or inhibiting a molecular target +e discovery ofnew drugs for human diseases is exceptionally complicatedexpensive and time-consuming Drug discovery uses vari-ous methods [25] based on a statistical approach to scan forsmall molecule libraries and determines the structures mostlikely to bind to a compound However the drug target is aprotein receptor that is involved in a metabolic cycle orsignaling pathway by which a particular disease disorder isestablished or another anatomy

+ere are two VS approaches which are ligand-based VS(LBVS) and structure-based virtual screening (SBVS) [26]LBVS depends on the existing information about the ligandsIt utilizes the knowledge about a set of ligands known to beactive for the given drug target +is type of VS uses themining of big data analytics Training binary classifiers by asmall part of a ligand can be employed and very large sets ofligands can be easily classified into two classes active andnonactive ones SBVS on the other side is utilized to dockexperimentally However 3D protein structural informationis required [27] as shown in Figure 1

K-mean clustering is one of the simplest nonsupervisedlearning algorithms which was first proposed by Macqueenin 1967 It has been applied by many researchers to solvesome of the problems of known groups [28] +is techniqueclassifies a particular dataset into a certain number ofgroups +e algorithm randomly initializes the center of thegroups +en it calculates the distance between an objectand the midpoint of each group Next each data instance islinked to the nearest center and the cluster centers arerecalculated +e distance between the center and eachsample is calculated by the following equation

Euclidean distance 1113944c

i11113944

ci

j1Xi minus Yj (1)

where the Euclidean distance between the data point Xi andcluster center y is d Ci is the total number of data points i incluster and c is the total number of cluster centers All of thetraining samples are first grouped into K groups (the ex-periment with diverse K values runs to observe the result)Suitable training samples from the derived clusters are se-lected +e key idea is that there are different groups in adataset and each group appears to have distinct charac-teristics When a cluster includes samples of large majorityclass and samples of low minority class it functions as amajority class sample If on the other side a cluster has extraminority samples and fewer majority samples it acts morelike a minority class +erefore by considering the numberof majority class samples to that of minority class samples invarious clusters this method oversamples the requirednumber of minority class samples from each cluster

Several approaches have been proposed in the literatureto handle big data classification including classification al-gorithms random forest decision tree multilayer percep-tron logistic regression and gradient boosting

Classification algorithms (CA) are mainly depending onmachine learning (ML) algorithms where they play a vitalrole in VS for drug discovery It can be considered as anLBVS approach Researchers widely used the ML approachto create a binary classification model that is a form of filterto classify ligands as active or inactive in relation to aparticular protein target +ese strategies need fewer com-putational resources and because of their ability to gener-alize they find more complex hits than other earlierapproaches Based on our experience we believe that manyclassification algorithms can be utilized for dealing withunbalanced datasets in VS such as SVM RF Naıve BayesMLP LG ANN DT and GBT Five ML algorithms areapplied in this paper RF DT MLP LG and GBT [29]

Random forest (RF) is an ensemble learning approach inwhich multiple decision trees are constructed based ontraining data and amajority votingmechanism Like KNN itis utilized to predict classification or regression for newinputs +e key advantage of RF is that it can be utilized forproblems that need classification and regression Besides RFis the ability tomanage many higher-dimensional datasets Ithas a powerful strategy for determining the lack of infor-mation and preserving accuracy when much of the infor-mation is missing [29]

Complexity 3

Decision tree (DT) is represented as an actual tree withits root at the top and the leaves at the bottom +e root ofthe tree is divided into two or more branches +e branchmay be broken down into two branches or more +isprocess continues until the leaf is reached meaning no moresplit remains

Multilayer perceptron (MP) has two main types of ar-tificial neural networks (ANNs) which are supervised andunsupervised networks Every network consists of a series oflinked neurons A neuron takes multiple numerical inputsand outputs of values depending on the number of inputsweighted Popular functions of transformation embody thefunctions of tanh and sigmoid Neurons are formed intolayers ANN can contain several hidden layers and theneurons will only be linked to those in the next layersknown as forwarding feed networks multiplayer percep-tions (MLPs) or functional radial base network (RBN) [29]

Logistic regression (LR) is one of the simplest and fre-quently utilizedML algorithms for two-class classification Itis simple to implement and can be utilized as the baseline forany binary classification problem In deep learning basicprinciples are also constructive +e relationship between asingle dependent relative binary variable and independentvariables is defined and estimated by logistic regression It isalso a mathematical method for predicting binary classes+e effect or target variable is dichotomous in nature Di-chotomous means that only two possible groups can be usedfor cancer detection issues for instance [30]

Gradient boosting is a type ofML boosting It relies on theintuition that the best possible next model when combinedwith previous models minimizes the overall prediction error+e name gradient boosting arises because target outcomesfor each case are set based on the gradient of the error withrespect to the prediction Each new model takes a step in thedirection that minimizes prediction error in the space ofpossible predictions for each training case [30]

In [31] the authors compared four Weka classifiersincluding SVM J48 tree NB and RF From a completedcost-sensitive survey SVM and Tree C45 (J48) performedwell taking minority group sizes into account It shows that

a hybrid of majority class undersampling and SMOTE canimprove overall classification performance in an imbalanceddataset In addition the authors in [32] have employed arepetitive SVM as a sample method that is used for SVMprocessing from bioassay information on luciferase inhi-bition that has a high activeinactive imbalance ratio (1377)+e modelsrsquo highest performance was 8660 and 8889percent for active compounds and inactive compoundsrespectively associated with a combined precision of 8774percent when using validation and blind test +ese findingsindicate the quality of the proposed approach for managingthe intrinsic imbalance problem in HTS data used to clusterpossible interference compounds to virtual screening uti-lizing luciferase-based HTS experiments

Similarly in [33] the authors analyzed several commonstrategies for modeling unbalanced data including multipleundersampling threshold ratio 1 3 undersampling one-sided undersampling similarity undersampling clusterundersampling diversity undersampling and only thresh-old selection In total seven methods were compared usingHTS datasets extracted from PubChem +eir analysis led tothe proposal of a new hybrid method which includes bothlow-cost learning and less sampling approaches+e authorsclaim that the multisample and hybrid methods provideaccurate predictions of results more than other methodsBesides some other research studies used toxicity datasets inimbalance algorithms as in [34] +e model was based on adata ensemble where each model sees an equal distributionof the two toxic and nontoxic classes It considers variousaspects of creating computational models to predict celltoxicity based on cell proliferation screening dataset Suchpredictive models can be helpful in evaluating cell-basedscreening outcomes in general by bringing feature-baseddata into such datasets It also could be utilized as a methodto recognize and remove potentially undesirable com-pounds +e authors concluded that the issue of data im-balance hindered the accuracy of the critical activityclassification+ey also developed an artificial random forestgroup model that was designed to mitigate dataset mis-alignment in predicting cell toxicity

Virtual screening

3D structure of target

UnknownNH2

NN

o

o

F

ndashH2N

Actives andinactives known

known

Actives known

Ligand-basedmethods

Similaritysearching

Pharmacophoremapping

Machine learningmethods

Protein liganddocking

Structure-basedmethods

Figure 1 Taxonomy for the 3D structure of VS methods [25]

4 Complexity

+e authors in [35] used a simple oversampling ap-proach to build an SVMmodel classifying compounds basedon the expected cytotoxic versus Jurkat cell line Over-sampling with the minority has been shown to contribute tobetter predictive SVM models in training groups and ex-ternal test groups Consequently the authors in [36] ana-lyzed and validated the importance of different samplingmethods over the nonsampling method in order to achievewell-balanced sensitivity and specificity of the ML modelthat has been created for unbalanced chemical data Ad-ditionally the study conducted an accuracy of 9300 underthe curve (AUC) of 094 a sensitivity of 9600 andspecificity of 9100 using SMOTE sampling and randomforest classification to predict drug-induced liver injuryAlthough it was presented in the literature that some of theproposed approaches have succeeded somehow inresponding to the issues of unbalanced PubChem datasetsthere is still a lack of time efficiency during calculations

3 Proposed KSMOTE Framework

K-Mean Synthetic Minority Oversampling Technique(KSMOTE) is proposed in this paper as a solution for virtualscreening to drug discovery problems KSMOTE combines K-mean and SMOTE algorithms to avoid the imbalanced originaldatasets ensuring that the number of minority samples is asclose as possible to the majority of the population samples Asshown in Figure 2 the data were first separated into twosetsmdashone set contains majority samples and the other setcontains the entire minority sample First majority sampleswere clustered intoK clusters andminority samples whereK isgreater than one in both cases+e number of clusters for eachclass is chosen according to the elbow method +e Euclideandistance was employed to calculate the distance between thecenter of each majority cluster and the center of each minoritycluster Each majority cluster sample was combined with theminority cluster sample subset to make K separate trainingdatasets +is combination was done based on the largestdistance between each majority and minority cluster SMOTEwas then applied to each combination of the clusters Itgenerates an instance of synthetic minority oversamplingminority class For any minority example the k (5 in SMOTE)is the nearest Neighbors of the same class are determined andthen some instances are randomly selected according to theoversampling factor After that new synthetic examples aregenerated along the line between the minority example and itsnearest chosen example

31 Environment Selection and the Dataset Since we aredealing with big data a big data framework must be chosenOne of the most powerful frameworks that have been used inmany data analytics is Spark Spark is a well-known clustercomputing engine that is very reliable It presents applica-tion programming interfaces in various programming lan-guages such as Java Python Scala and R Spark supports in-memory computing allowing handling records much fasterthan disk-based engines Hadoop Spark engine is advanced

for in-reminiscence processing as well as disk-based totallyprocessing [37] It has been installed on different operatingsystems such as Windows and Ubuntu

+is paper implements the proposed approach usingPySpark version 243 [38] and Hadoop version 27 installedon Ubuntu 1804 and Python is used as a programminglanguage A Jupyter notebook version 37 was used +ecomputer configuration for experiments was a local machineIntel Core i7 with 28GHz speed and 8GB of RAM Toillustrate the performance of the proposed framework threedatasets are chosen +ey are carefully chosen where each ofthem differs in its imbalance ratio +ey are also largeenough to illustrate the big data analysis challenge+e threedatasets are AID 440 [39] AID 624202 [40] and AID 651820[41] All of them are retrieved from the PubChem databaseLibrary [5]+e three datasets are summarized in Table 1 andbriefly described in the following paragraphs All of the dataexist in an unstructured format as SDF files +erefore theyrequire some preprocessing to be accepted as input to theproposed platform

(1) AID 440 is a formylpeptide receptor (FPR) +eG-protein coupled with the formylpeptide receptorwas one of the originating chemo-attracting receptormembers [39] It consists of 185 active and 24815nonactive molecules

(2) AID 624202 is a qHTS test to identify small mo-lecular stimulants for BRCA1 expression BRCA1has been involved in a wide range of cellular ac-tivities including repairing DNA damage cell cyclecheckpoint control growth inhibition programmedcell death transcription regulation chromatin re-combination protein presence and autogenouslystem cell regeneration and differentiation +e in-crease in BRCA1 expression would enable cellulardifferentiation and restore tumor inhibitor functionleading to delayed tumor growth and less aggressiveand more treatable breast cancer Promising stim-ulants for BRCA1 expression could be new pre-ventive or curative factors against breast cancer [40]

(3) AID 651820 is a qHTS examination for hepatitis Cvirus (HCV) inhibitors [41] About 200 millionpeople globally are hepatitis C (HCV) contaminatedMany infected individuals progress to chronic liverdisease including cirrhosis with the risk of devel-oping hepatic cancer+ere is no effective hepatitis Cvaccine available to date Current interferon-basedtherapy is effective only in about half of patients andis associated with significant adverse effects It isestimated that the fraction of people with HCV whocan complete treatment is no more than 10 percent+e recent development of direct-acting antiviralsagainst HCV such as protease and polymerase in-hibitors is promising However it still requires acombination of peginterferon and ribavirin formaximum efficacy Moreover these agents are as-sociated with a high resistance rate and many havesignificant side effects (Figure 3)

Complexity 5

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 4: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

Decision tree (DT) is represented as an actual tree withits root at the top and the leaves at the bottom +e root ofthe tree is divided into two or more branches +e branchmay be broken down into two branches or more +isprocess continues until the leaf is reached meaning no moresplit remains

Multilayer perceptron (MP) has two main types of ar-tificial neural networks (ANNs) which are supervised andunsupervised networks Every network consists of a series oflinked neurons A neuron takes multiple numerical inputsand outputs of values depending on the number of inputsweighted Popular functions of transformation embody thefunctions of tanh and sigmoid Neurons are formed intolayers ANN can contain several hidden layers and theneurons will only be linked to those in the next layersknown as forwarding feed networks multiplayer percep-tions (MLPs) or functional radial base network (RBN) [29]

Logistic regression (LR) is one of the simplest and fre-quently utilizedML algorithms for two-class classification Itis simple to implement and can be utilized as the baseline forany binary classification problem In deep learning basicprinciples are also constructive +e relationship between asingle dependent relative binary variable and independentvariables is defined and estimated by logistic regression It isalso a mathematical method for predicting binary classes+e effect or target variable is dichotomous in nature Di-chotomous means that only two possible groups can be usedfor cancer detection issues for instance [30]

Gradient boosting is a type ofML boosting It relies on theintuition that the best possible next model when combinedwith previous models minimizes the overall prediction error+e name gradient boosting arises because target outcomesfor each case are set based on the gradient of the error withrespect to the prediction Each new model takes a step in thedirection that minimizes prediction error in the space ofpossible predictions for each training case [30]

In [31] the authors compared four Weka classifiersincluding SVM J48 tree NB and RF From a completedcost-sensitive survey SVM and Tree C45 (J48) performedwell taking minority group sizes into account It shows that

a hybrid of majority class undersampling and SMOTE canimprove overall classification performance in an imbalanceddataset In addition the authors in [32] have employed arepetitive SVM as a sample method that is used for SVMprocessing from bioassay information on luciferase inhi-bition that has a high activeinactive imbalance ratio (1377)+e modelsrsquo highest performance was 8660 and 8889percent for active compounds and inactive compoundsrespectively associated with a combined precision of 8774percent when using validation and blind test +ese findingsindicate the quality of the proposed approach for managingthe intrinsic imbalance problem in HTS data used to clusterpossible interference compounds to virtual screening uti-lizing luciferase-based HTS experiments

Similarly in [33] the authors analyzed several commonstrategies for modeling unbalanced data including multipleundersampling threshold ratio 1 3 undersampling one-sided undersampling similarity undersampling clusterundersampling diversity undersampling and only thresh-old selection In total seven methods were compared usingHTS datasets extracted from PubChem +eir analysis led tothe proposal of a new hybrid method which includes bothlow-cost learning and less sampling approaches+e authorsclaim that the multisample and hybrid methods provideaccurate predictions of results more than other methodsBesides some other research studies used toxicity datasets inimbalance algorithms as in [34] +e model was based on adata ensemble where each model sees an equal distributionof the two toxic and nontoxic classes It considers variousaspects of creating computational models to predict celltoxicity based on cell proliferation screening dataset Suchpredictive models can be helpful in evaluating cell-basedscreening outcomes in general by bringing feature-baseddata into such datasets It also could be utilized as a methodto recognize and remove potentially undesirable com-pounds +e authors concluded that the issue of data im-balance hindered the accuracy of the critical activityclassification+ey also developed an artificial random forestgroup model that was designed to mitigate dataset mis-alignment in predicting cell toxicity

Virtual screening

3D structure of target

UnknownNH2

NN

o

o

F

ndashH2N

Actives andinactives known

known

Actives known

Ligand-basedmethods

Similaritysearching

Pharmacophoremapping

Machine learningmethods

Protein liganddocking

Structure-basedmethods

Figure 1 Taxonomy for the 3D structure of VS methods [25]

4 Complexity

+e authors in [35] used a simple oversampling ap-proach to build an SVMmodel classifying compounds basedon the expected cytotoxic versus Jurkat cell line Over-sampling with the minority has been shown to contribute tobetter predictive SVM models in training groups and ex-ternal test groups Consequently the authors in [36] ana-lyzed and validated the importance of different samplingmethods over the nonsampling method in order to achievewell-balanced sensitivity and specificity of the ML modelthat has been created for unbalanced chemical data Ad-ditionally the study conducted an accuracy of 9300 underthe curve (AUC) of 094 a sensitivity of 9600 andspecificity of 9100 using SMOTE sampling and randomforest classification to predict drug-induced liver injuryAlthough it was presented in the literature that some of theproposed approaches have succeeded somehow inresponding to the issues of unbalanced PubChem datasetsthere is still a lack of time efficiency during calculations

3 Proposed KSMOTE Framework

K-Mean Synthetic Minority Oversampling Technique(KSMOTE) is proposed in this paper as a solution for virtualscreening to drug discovery problems KSMOTE combines K-mean and SMOTE algorithms to avoid the imbalanced originaldatasets ensuring that the number of minority samples is asclose as possible to the majority of the population samples Asshown in Figure 2 the data were first separated into twosetsmdashone set contains majority samples and the other setcontains the entire minority sample First majority sampleswere clustered intoK clusters andminority samples whereK isgreater than one in both cases+e number of clusters for eachclass is chosen according to the elbow method +e Euclideandistance was employed to calculate the distance between thecenter of each majority cluster and the center of each minoritycluster Each majority cluster sample was combined with theminority cluster sample subset to make K separate trainingdatasets +is combination was done based on the largestdistance between each majority and minority cluster SMOTEwas then applied to each combination of the clusters Itgenerates an instance of synthetic minority oversamplingminority class For any minority example the k (5 in SMOTE)is the nearest Neighbors of the same class are determined andthen some instances are randomly selected according to theoversampling factor After that new synthetic examples aregenerated along the line between the minority example and itsnearest chosen example

31 Environment Selection and the Dataset Since we aredealing with big data a big data framework must be chosenOne of the most powerful frameworks that have been used inmany data analytics is Spark Spark is a well-known clustercomputing engine that is very reliable It presents applica-tion programming interfaces in various programming lan-guages such as Java Python Scala and R Spark supports in-memory computing allowing handling records much fasterthan disk-based engines Hadoop Spark engine is advanced

for in-reminiscence processing as well as disk-based totallyprocessing [37] It has been installed on different operatingsystems such as Windows and Ubuntu

+is paper implements the proposed approach usingPySpark version 243 [38] and Hadoop version 27 installedon Ubuntu 1804 and Python is used as a programminglanguage A Jupyter notebook version 37 was used +ecomputer configuration for experiments was a local machineIntel Core i7 with 28GHz speed and 8GB of RAM Toillustrate the performance of the proposed framework threedatasets are chosen +ey are carefully chosen where each ofthem differs in its imbalance ratio +ey are also largeenough to illustrate the big data analysis challenge+e threedatasets are AID 440 [39] AID 624202 [40] and AID 651820[41] All of them are retrieved from the PubChem databaseLibrary [5]+e three datasets are summarized in Table 1 andbriefly described in the following paragraphs All of the dataexist in an unstructured format as SDF files +erefore theyrequire some preprocessing to be accepted as input to theproposed platform

(1) AID 440 is a formylpeptide receptor (FPR) +eG-protein coupled with the formylpeptide receptorwas one of the originating chemo-attracting receptormembers [39] It consists of 185 active and 24815nonactive molecules

(2) AID 624202 is a qHTS test to identify small mo-lecular stimulants for BRCA1 expression BRCA1has been involved in a wide range of cellular ac-tivities including repairing DNA damage cell cyclecheckpoint control growth inhibition programmedcell death transcription regulation chromatin re-combination protein presence and autogenouslystem cell regeneration and differentiation +e in-crease in BRCA1 expression would enable cellulardifferentiation and restore tumor inhibitor functionleading to delayed tumor growth and less aggressiveand more treatable breast cancer Promising stim-ulants for BRCA1 expression could be new pre-ventive or curative factors against breast cancer [40]

(3) AID 651820 is a qHTS examination for hepatitis Cvirus (HCV) inhibitors [41] About 200 millionpeople globally are hepatitis C (HCV) contaminatedMany infected individuals progress to chronic liverdisease including cirrhosis with the risk of devel-oping hepatic cancer+ere is no effective hepatitis Cvaccine available to date Current interferon-basedtherapy is effective only in about half of patients andis associated with significant adverse effects It isestimated that the fraction of people with HCV whocan complete treatment is no more than 10 percent+e recent development of direct-acting antiviralsagainst HCV such as protease and polymerase in-hibitors is promising However it still requires acombination of peginterferon and ribavirin formaximum efficacy Moreover these agents are as-sociated with a high resistance rate and many havesignificant side effects (Figure 3)

Complexity 5

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 5: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

+e authors in [35] used a simple oversampling ap-proach to build an SVMmodel classifying compounds basedon the expected cytotoxic versus Jurkat cell line Over-sampling with the minority has been shown to contribute tobetter predictive SVM models in training groups and ex-ternal test groups Consequently the authors in [36] ana-lyzed and validated the importance of different samplingmethods over the nonsampling method in order to achievewell-balanced sensitivity and specificity of the ML modelthat has been created for unbalanced chemical data Ad-ditionally the study conducted an accuracy of 9300 underthe curve (AUC) of 094 a sensitivity of 9600 andspecificity of 9100 using SMOTE sampling and randomforest classification to predict drug-induced liver injuryAlthough it was presented in the literature that some of theproposed approaches have succeeded somehow inresponding to the issues of unbalanced PubChem datasetsthere is still a lack of time efficiency during calculations

3 Proposed KSMOTE Framework

K-Mean Synthetic Minority Oversampling Technique(KSMOTE) is proposed in this paper as a solution for virtualscreening to drug discovery problems KSMOTE combines K-mean and SMOTE algorithms to avoid the imbalanced originaldatasets ensuring that the number of minority samples is asclose as possible to the majority of the population samples Asshown in Figure 2 the data were first separated into twosetsmdashone set contains majority samples and the other setcontains the entire minority sample First majority sampleswere clustered intoK clusters andminority samples whereK isgreater than one in both cases+e number of clusters for eachclass is chosen according to the elbow method +e Euclideandistance was employed to calculate the distance between thecenter of each majority cluster and the center of each minoritycluster Each majority cluster sample was combined with theminority cluster sample subset to make K separate trainingdatasets +is combination was done based on the largestdistance between each majority and minority cluster SMOTEwas then applied to each combination of the clusters Itgenerates an instance of synthetic minority oversamplingminority class For any minority example the k (5 in SMOTE)is the nearest Neighbors of the same class are determined andthen some instances are randomly selected according to theoversampling factor After that new synthetic examples aregenerated along the line between the minority example and itsnearest chosen example

31 Environment Selection and the Dataset Since we aredealing with big data a big data framework must be chosenOne of the most powerful frameworks that have been used inmany data analytics is Spark Spark is a well-known clustercomputing engine that is very reliable It presents applica-tion programming interfaces in various programming lan-guages such as Java Python Scala and R Spark supports in-memory computing allowing handling records much fasterthan disk-based engines Hadoop Spark engine is advanced

for in-reminiscence processing as well as disk-based totallyprocessing [37] It has been installed on different operatingsystems such as Windows and Ubuntu

+is paper implements the proposed approach usingPySpark version 243 [38] and Hadoop version 27 installedon Ubuntu 1804 and Python is used as a programminglanguage A Jupyter notebook version 37 was used +ecomputer configuration for experiments was a local machineIntel Core i7 with 28GHz speed and 8GB of RAM Toillustrate the performance of the proposed framework threedatasets are chosen +ey are carefully chosen where each ofthem differs in its imbalance ratio +ey are also largeenough to illustrate the big data analysis challenge+e threedatasets are AID 440 [39] AID 624202 [40] and AID 651820[41] All of them are retrieved from the PubChem databaseLibrary [5]+e three datasets are summarized in Table 1 andbriefly described in the following paragraphs All of the dataexist in an unstructured format as SDF files +erefore theyrequire some preprocessing to be accepted as input to theproposed platform

(1) AID 440 is a formylpeptide receptor (FPR) +eG-protein coupled with the formylpeptide receptorwas one of the originating chemo-attracting receptormembers [39] It consists of 185 active and 24815nonactive molecules

(2) AID 624202 is a qHTS test to identify small mo-lecular stimulants for BRCA1 expression BRCA1has been involved in a wide range of cellular ac-tivities including repairing DNA damage cell cyclecheckpoint control growth inhibition programmedcell death transcription regulation chromatin re-combination protein presence and autogenouslystem cell regeneration and differentiation +e in-crease in BRCA1 expression would enable cellulardifferentiation and restore tumor inhibitor functionleading to delayed tumor growth and less aggressiveand more treatable breast cancer Promising stim-ulants for BRCA1 expression could be new pre-ventive or curative factors against breast cancer [40]

(3) AID 651820 is a qHTS examination for hepatitis Cvirus (HCV) inhibitors [41] About 200 millionpeople globally are hepatitis C (HCV) contaminatedMany infected individuals progress to chronic liverdisease including cirrhosis with the risk of devel-oping hepatic cancer+ere is no effective hepatitis Cvaccine available to date Current interferon-basedtherapy is effective only in about half of patients andis associated with significant adverse effects It isestimated that the fraction of people with HCV whocan complete treatment is no more than 10 percent+e recent development of direct-acting antiviralsagainst HCV such as protease and polymerase in-hibitors is promising However it still requires acombination of peginterferon and ribavirin formaximum efficacy Moreover these agents are as-sociated with a high resistance rate and many havesignificant side effects (Figure 3)

Complexity 5

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 6: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

32 Preprocessing A chemical compound (molecules) isstored in SDF files so those files have to be converted tofeature vectors [d1 d2 and d3 d1444] with 1444 dimen-sions and it was suitable to use PaDEL cheminformaticssoftware [42] for this process +ere are two types of PaDEL

cheminformatics software numeric and fingerprint de-scriptors A PaDEL numeric descriptor gives informationabout the quantity of a feature in each compound Moleculesare represented based on constitutional topological andgeometrical descriptors as well as other molecular proper-ties +is includes aliphatic ring count aromatic ring countlogP donor count polar surface area and Balaban index+e Balaban index is a real number and can be either positiveor negative PaDEL is also used as a fingerprint descriptorand it gives 881 attributes +e fingerprint was also calcu-lated in order to compare the model performance derivedfrom PaDEL descriptors Molecules are referred to as in-stances and labeled as 1 or 0 where 1 means active and 0means not active +e reader is directed to [43] for furtherinformation about the descriptors and fingerprints

4 Evaluation Metrics

Instead of utilizing complicated metrics four intuitive andfunctional metrics (specificity sensitivity G-mean andaccuracy) were introduced according to the following

Results

Classification algorithms

Combine all clusters after smote

Smote Smote

Combination kCombination 1

Cluster minority classto k cluster

Cluster majority classto k clusters

Majority classMinority class

Divide

Divide

Big dataset

Test data

80 20

Training data

Combine clusters have the largest distance

Calculate distance among each centriod in majorityand minority clusters

Figure 2 +e proposed model

Table 1 Benchmark imbalanced dataset

Datasets Total records Inactive Active Balance ratioAID 440 25000 24815 185 1 134AID 624202 377550 364035 3980 1 915AID 651820 283005 271341 11664 1 23

O

O

CH3H3C

CH3

N

N

N

N

Figure 3 Molecules (chemical compound) [42]

6 Complexity

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 7: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

reasons First the predictive power of the classificationmethod for each sample particularly the predictive powerof the minority group (ie active power) is demonstratedby measuring performance for both sensitivity and spec-ificity Second G-mean is a combination of sensitivity andspecificity indicating a compromise between the majorityand minority output of the classification Poor quality inpredicting positive samples reduces the G-mean valuewhereas negative samples are classified with accuracy witha high percentage+is is a typical state for imbalanced datacollection It is strongly recommended that external pre-dictions be used to build a reliable model of prediction[41 44] +e four statistical assessment methods are de-scribed as follows

(1) Sensitivity the proportion of positive samples ap-propriately classified and labeled and it can be de-termined by the following equation

sensitivity TP

(TP + FN) (2)

where true positive (TP) corresponds to the rightclassification of positive samples (eg in this workactive compounds) true negative (TN) correspondsto the correct classification (ie inactive com-pounds) of negative samples false positive (FP)means that negative samples have been incorrectlyidentified in positive samples and false negative(FN) is an indicator to incorrectly classified positivesamples

(2) Specificity the proportion of negative samples thatare correctly classified its value indicates how manycases that are predicted to be negative and that aretruly negative as stated in the following equation

specificity TN

(TN + FP) (3)

(3) G-mean it offers a simple way to measure themodelrsquos capability to correctly classify active andinactive compounds by combining sensitivity andspecificity in a single metric G-mean is a measure ofbalance accuracy [45] and is defined as follows

G-mean

specificity times sensitivity1113969

(4)

G-mean is a hybrid of sensitivity and specificityindicating a balance between majority and minorityrating results Low performance in forecastingpositive samples also contributes to reducingG-mean value even though negative samples arehighly accurate +is is a typical unbalanced datasetcondition [45]

(4) Accuracy it shows the capability of a model tocorrectly predict the class labels as given in thefollowing equation

accuracy (TP + TN)

(TP + TN + FP + FN) (5)

41 Experimental Results +is section presents the resultsbased on extensive sets of experiments +e average of theexperiments is concluded and presented with discussionTenfold cross-validation is used to evaluate the performanceof the proposedmodel Besides the original sample retrievedfrom the datasets is randomly divided into ten equal-sizedsubsamples Out of the ten subsamples one subsample ismaintained as validation data for model testing while theremaining nine subsamples are used as training data +evalidation process is then replicated ten times (folds) usingeach of the ten subsamples as validation data exactly once Itis then possible to compute the average of the ten outcomesfrom the folds To validate the effectiveness of KSMOTE theperformance of the proposed model is compared with that ofSMOTE only and no-sampling models Furthermore twotypes of descriptors numeric PaDEL and fingerprint areused to validate their impact on the modelrsquos performance forall compilations+e results presented in this work are basedon original test groups only without oversampling Twotypes of descriptors PaDEL and fingerprint are used togenerate five algorithms (RF DT MLP LG and GBT) tovalidate their effect on model output +erefore the per-formance of applying these algorithms is examined

42 G-Mean-Based Performance In this section G-meanresults are presented for the three selected datasets Figure 4describesG-mean for theAID440 dataset based on both PaDELand fingerprint descriptors +is figure demonstrates the per-formance of the various PaDEL descriptor and fingerprint setsIt also shows a comparison between three different approacheswhich are no-sample SMOTE and KSMOTE Besides theperformance of employing RF DT MLP LG and GBT clas-sifiers with the three different approaches examined 20 of thedatasets are used for testing as mentioned before

As shown in Figure 4 the best G-mean gained in the caseof PaDEL fingerprint descriptor was by the KSMOTE whereit reaches on average 0963 However utilizing KSMOTEwith LG gives almost G-means of 097 which is the bestvalue over other classifiers Based on our experimentsSMOTE- and no-sample-based PaDEL fingerprint de-scriptors are not recommended for virtual screening andclassification where their G-mean is too small

With the same settings applied to the fingerprint de-scriptor the PaDEL numeric descriptor performance isexamined Again the results are almost similar where theKSMOTE shows higher performance than SMOTE and no-sample with nearly 55 However KSMOTE with DT andGBT classifiers are not up to other classifiersrsquo level in thiscase +erefore it is recommended to utilize RF MLP andLG when a numeric descriptor is used On the contraryalthough the performance of SMOTE and no-sample is notthat good they show enhancement over the PaDEL fin-gerprint with an average of 10 Among the overall resultsit has been noticed that the worst G-mean value is producedfrom applying RF classifier with no-sample approach

+e performance of KSMOTE produced from the AID440 dataset is confirmed using the AID 624202 dataset Asshown in Figure 5 KSMOTE gives the best G-mean using all

Complexity 7

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 8: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

of the classifiers +e only drawback shown is the G-mean ofLG performance by almost 8 from other classifier results incase of PaDEL fingerprint classifier is used On the contraryLG classifier shows much more progress than before whereits average G-mean reached 08 which is a good achieve-ment For the rest of the classifiers in both descriptorsG-mean values are enhanced from the previous dataset Stillamong all the classifiers G-mean value for the RF classifierwith no-sample approach is worst +e overall conclusion isthat KSMOTE is recommended to be used with both AID440 and AID 624202 datasets

Table 2 summarizes Figures 4ndash6 collecting all of theresults in one place It shows the complete set of experimentsand average G-mean results in values to see the overallpicture Again as can be exerted from the table the proposedKSMOTE approach gives the best results of G-mean on thethree datasets It is believed that partitioning active andnonactive compounds to K clusters and then combiningpairs that have large distances led to an accurate rate of

oversampling instances in the SMOTE algorithm +is ex-plains why the proposed model produces the best results

43 Sensitivity-Based Performance Sensitivity is anothermetric to measure the performance of the proposed ap-proach compared to others Here the sensitivity perfor-mance presentation is a little bit different where theperformance of the three approaches is displayed for thethree datasets Figure 7 shows the sensitivity of all datasetsbased on PaDEL numeric descriptor while Figure 8 presentsthe sensitivity results for the fingerprint descriptor It isobvious that the KSMOTE sensitivity values are superior toother approaches using both descriptors In additionSMOTE is overperforming the no-sample approach in al-most all of the cases For the AID 440 dataset a low sen-sitivity value of 037 for the minority class is shown by theMLP model from the initial dataset (ie without SMOTEresample) +e SMOTE algorithm was introduced to

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 4 G-mean of PaDEL descriptor and fingerprint for AID 440

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 5 G-mean of PaDEL descriptor and fingerprint for AID624202

8 Complexity

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 9: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

Table 2 Complete set of experiments and average G-mean results

Algorithm PaDEL numeric descriptor PaDEL fingerprintNo-sample SMOTE KSMOTE Time No-sample SMOTE KSMOTE Time

AID 440

RF 0167 0565 0954 23 029 0442 096 12DT 05 059 0937 93 051 0459 0958 49MLP 06 05 0963 20 0477 0498 0964 96LG 056 067 0963 11 0413 0512 096 56GBT 023 056 0963 33 0477 0421 0963 171

AID624202

RF 0445 0625 0952 297 05 0628 096 153DT 0576 0614 094 10 054 0564 094 5MLP 074 0715 095 252 0636 0497 0958 135LG 0628 083 094 268 0791 078 0837 1325GBT 0489 061 095 45 0495 0482 0954 2236

AID 651820

RF 0722 0792 0956 41 0741 0798 092 1925DT 0725 072 0932 878 0765 0743 089 444MLP 082 0817 0915 35 0788 08 091 173LG 0779 08357 0962 19 075 0768 089 936GBT 0714 0742 09 605 0762 0766 0905 299

0010203040506070809

1

No sample SMOTE KSMOTE No sample SMOTE KSMOTE

PADEL numeric descriptor PADEL fingerprint

RFDTMLP

LGGBT

Figure 6 G-mean of numeric PaDEL descriptor and fingerprint for AID 65182

0010203040506070809

1

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 7 Sensitivity of all datasets for the PaDEL numeric descriptor

Complexity 9

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 10: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

oversampling this minority class to significantly improveperceptibility with LG jumping from 0314 to 046 +eKSMOTE algorithm has been used to sample this minorityclass to boost the predictability of the interesting class whichrepresents active compounds Besides the sensitivity in-creases have been shown in the LG with the KSMOTEsensitivity value jumped from 046 to 094 percent

For the AID 624202 dataset where the original modelhardly recognizes the rare class (active compounds) withexceptionally weak sensitivity a 02 RF is more significantlyimproved However the sensitivity value improves dra-matically from 04 to 08 in LG with the incorporation ofSMOTE In KSMOTE however the sensitivity value in MLPrises considerably from 08 to 0928+emodel classificationof AID 651820 is similarly enhanced with sensitivity in themajority class in MLP 0728 (inactive compounds)

Figure 8 presents the sensitivity of the three approachesusing fingerprint descriptors It is obvious that KSMOTEsensitivity values are superior to other approaches As can beseen the performance differs based on the type of the useddataset on the contrary KSMOTE has a stable performanceusing different classifiers In other words the difference inthe KSMOTE is not that noticeable However it is clear fromthe figure that on average SMOTE and no-sample ap-proaches have the same performance as well as behaviorwhen applied to all datasets Besides the sensitivity resultsbecame much better when they are applied on theAID651820 dataset than when AID 440 and AID624202were used Again the results go along with the previousmeasurement

44 Specificity-Based Performance Specificity is anotherimportant performance measure where it measures thepercentage of negative classified classes that are correctlyclassified Figures 9 and 10 show the specificity of all clas-sifiers using PaDEL and fingerprint descriptors respectively+ose figures illuminate two points as follows

(A) All algorithms on average are correctly identifyingthe negative classes except SMOTE LG classifier inboth AID 624202 and AID 651820 datasets

(B) Fingerprint descriptor results are more stable thanPaDEL descriptor results

To summarize the sensitivity and specificity resultsTable 3 shows the produced results using different classifiersKSMOTE has a superior result in most of the experimentsSensitivity and specificity results of the three datasets innumeric and fingerprint descriptors are shown and thevalues marked in bold are the highest gained values amongthe results +ose values show the efficiency of the proposedmethod KSMOTE

45ComputationalTimeComparison One of the issues thatthe algorithms always face is the computational time es-pecially if those algorithms are designed to work on lim-ited-resource devices +e models without SMOTE forsamples with minority classes cannot achieve adequateperformance On the other hand the five classifiersrsquocomputational time when KSMOTE is used has beenproven to be accurate using sensitivity and G-mean valuesin almost all of the three PubChem datasets (seeFigures 4ndash8) +e computed computational time is re-ported in Figures 11 and 12 It is interesting to note thatboth PaDEL descriptors and fingerprint PaDEL descriptorsproduce similar computational efficiency among the fiveclassifiers From the results DT and LG give the bestcomputation time among all classifiers followed by MLP+e computational time values of fingerprint PaDEL aremuch smaller than the numeric PaDEL descriptorrsquoscomputational time in most cases However looking at themaximum computational time among the classifiers inboth Figures 11 and 12 it turns out to be a GBT classifierwith a value of 29 and 60 seconds in numeric and fin-gerprint descriptors

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 8 Sensitivity of all datasets for the fingerprint descriptor

10 Complexity

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 11: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

5 Discussion

It has been considered that the key problem of HTS data is itsextreme imbalance with only a few hits often identifiedfrom a wide variety of compounds analyzed +e imbalanceratio distribution of AID 440 is 1134 that of AID 624202 is1915 and for AID 651820 it is 123 as shown in Figure 1Based on the conducted experiments our proposed modelsuccessfully distinguished the active compounds with anaverage accuracy of 97 and the inactive compounds withan accuracy of 98 with a G-mean of 975

Moreover HTS data size which typically compriseshundreds of thousands of compounds poses anotherchallenge A statistical model may be trained and optimizedon such a highly time-intensive dataset Big data platformssuch as Spark in this study were computationally effectiveand dramatically decreased computing costs in the opti-mized phase and substantially improved the KSMOTEmodelrsquos performance

Ideally the KSMOTE model separates active (minority)dataset from inactive (majority) data with maximum dis-tance But the KSMOTE model constructed from an im-balanced dataset appears to move the hyperplane away fromthe optimal location to the minority side +us most itemsare likely to be categorized into the majority class by bothno-sample and SMOTE models leading to a broad differ-ence between specificity and sensitivity +erefore such amodelrsquos predictability can be significantly weak We not onlyrely on cluster sampling to investigate the progress of theKSMOTE model but also built a SMOTE model for eachsampling round

We checked the KSMOTEmodelrsquos performance with theblind dataset which included 37 active compounds and 4963inactive compounds for AID 440 KSMOTE in AID 440 wasable to classify the inactive compounds very well with anoverall accuracy of gt98 percent while it correctly classifiesthe active compounds at an accuracy of 95 However AID624202 contains 796 active compounds and 72807 inactive

0

02

04

06

08

1

12

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 9 Specificity of all datasets for the PaDEL descriptor

075

08

085

09

095

1

105

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Nosample

SMOTE

KSMOTE

Figure 10 Specificity of all datasets for the fingerprint descriptor

Complexity 11

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 12: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

Tabl

e3

Sensitivity

andspecificity

results

ofthethreedatasets

innu

meric

andfin

gerprint

descriptors

Algorith

mPa

DEL

numeric

descriptor

PaDEL

fingerprint

No-sample

SMOTE

KSM

OTE

No-sample

SMOTE

KSM

OTE

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

Sensitivity

Specificity

AID

440

RF0028

099

8032

099

7091

099

7009

0985

02

0998

0948

099

DT

025

0992

036

0986

09

098

025

7099

0214

0985

093

098

MLP

037

0996

026

0993

094

0993

022

099

025

0994

0951

099

LG0314

099

8046

098

094

098

017

099

8026

70976

094

098

GBT

005

0997

032

0993

094

0991

0228

099

017

099

5095

098

AID

624202

RF02

099

3039

0965

0905

099

3025

0997

04

098

7093

099

DT

0351

0958

04

0957

092

0956

0305

0946

033

0934

0924

0958

MLP

057

0967

053

099

3092

8096

0418

0961

024

0964

095

60971

LG04

0807

08

0806

0921

081

077

50987

076

086

0825

0984

GBT

025

0985

038

0957

091

0985

0249

099

4023

09848

0924

099

5

AID

651820

RF0529

0983

0648

094

4094

094

056

098

067

0966

09

095

DT

057

0923

0579

0876

091

09

065

09

063

0896

088

095

MLP

072

8095

0716

0943

09

0913

058

093

0673

0933

09

093

LG062

0964

079

0856

094

09

059

097

069

0876

088

095

GBT

052

097

056

093

089

0914

063

0972

063

096

7091

091

12 Complexity

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 13: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

compounds and AID 651820 contains 2332 active com-pounds and 54269 inactive compounds+us AID 624202 isable to identify inactive compounds very well with an av-erage accuracy of gt96 percent while it correctly classifies theactive compounds at an accuracy of 94

Also AID 651820 is able to identify inactive compoundsquite well with an average accuracy of gt94 percent while itcorrectly classifies the active compounds at an accuracy of92 KSMOTE is considered better than the other systems(SMOTE only and no-sample) because it depends at thebeginning on the lack of similarity in taking the samples ofclusters +is dissimilarity between samples increases clas-sifiersrsquo accuracy and besides that it used the SMOTE toincrease the number of minority samples (active compound)by generating a new sample for the minority

+e strength of KSMOTE lies in the fact that in additionto the oversampling minority class accurately CBOS pro-duced new samples that do not affect majority class space inany way We use the randomness in an effective way byrestraining the maximum and minimum values of the newlygenerated samples

Recent developments in technology allow for high-throughput scanning facilities large-scale hubs and indi-vidual laboratories that produce massive amounts of data atan unprecedented speed+e need for extensive information

management and analysis attracts increasing attention fromresearchers and government funding agencies Hencecomputational approaches that aid in the efficient processingand extraction of large data are highly valuable and nec-essary Comparison among the five classifiers (RF DT MLPLG and GBT) showed that DT and LG not only performedbetter but also had higher computational efficiencyDetecting active compounds by KSMOTE makes it apromising tool for data mining applications to investigatebiological problems mostly when a large volume of im-balanced datasets is generated Apache Spark improved theproposed model and increased its efficiency It also enabledthe system to be more rapid in data processing compared totraditional models

6 Conclusion

Building accurate classifiers from a large imbalanced datasetis a difficult task Prior research in the literature focused onincreasing overall prediction accuracy however this strategyleads to a bias towards the majority category Given a certainprediction task for unbalanced data one of the relevantquestions to ask is what kind of sampling method should beused Although various sampling methods are available toaddress the data imbalance problem no single sampling

0

10

20

30

40

50

60

70

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (se

cond

)

Figure 11 Comparison of computational time (seconds) in KSMOTE for the numeric PaDEL descriptor

0

5

10

15

20

25

30

RF DT MLP LG GBT RF DT MLP LG GBT RF DT MLP LG GBT

AID 440 AID 624202 AID 651820

Com

puta

tiona

l tim

e (s

econ

d)

Figure 12 Comparison of computational time (seconds) in KSMOTE for the fingerprint PaDEL descriptor

Complexity 13

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 14: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

method works best for all problems +e choice of datasampling methods depends to a large extent on the natureof the dataset and the primary learning objective +e resultsindicate that regardless of the datasets used sampling ap-proaches substantially affect the gap between the sensitivityand the specificity of the model trained in the nonsamplingmethod +is study demonstrates the effectiveness of threedifferent models for balanced binary chemical datasets +iswork implements both K-mean and SMOTE on ApacheSpark to classify unbalanced datasets from PubChem bio-assay To test the generalized application of KSMOTE bothPaDEL and fingerprint descriptors were used to constructclassification models An analysis of the results indicatedthat both sensitivity and G-mean showed a significant im-provement after KSMOTE was employed Minority groupsamples (active compounds) were successfully identifiedand pathological prediction accuracy was achieved In ad-dition models created with PaDEL descriptors showedbetter performance +e proposed model achieved highsensitivity and G-mean up to 99 and 983 respectively

For future research the following points are identifiedbased on the work described in this paper

(1) It is necessary to find solutions to other similarproblems in chemical datasets such as using semi-supervised methods to increase labeled chemicaldatasets +ere is no doubt that the topic needs to bestudied in depth because of its importance and itsrelationship with other areas of knowledge such asbiomedicine and big data

(2) It is suggested to study deep learning algorithms forthe treatment of class imbalance Utilizing deeplearning may increase the accuracy of the classifi-cation overcoming the deficiencies of existingmethods

(3) One more open area is the development of an onlinetool that can be used to try different methods anddecide on the best results instead of working withonly one method at a time

Data Availability

+e dataset used is publicly available at (1) AID 440 (httpspubchemncbinlmnihgovbioassay440) (2) AID 624202(httpspubchemncbinlmnihgovbioassay624202) and(3) AID 651820 (httpspubchemncbinlmnihgovbioassay651820)

Conflicts of Interest

+e authors declare that they have no conflicts of interest

References

[1] J Kim and M Shin ldquoAn integrative model of multi-organdrug-induced toxicity prediction using gene-expression datardquoBMC Bioinformatics vol 15 no S16 p S2 2014

[2] N Nagamine T Shirakawa Y Minato et al ldquoIntegratingstatistical predictions and experimental verifications for en-hancing protein-chemical interaction predictions in virtual

screeningrdquo PLoS Computational Biology vol 5 no 6 ArticleID e1000397 2009

[3] P R Graves and T Haystead ldquoMolecular biologistrsquos guide toproteomicsrdquo Microbiology and Molecular Biology Reviewsvol 66 no 1 pp 39ndash63 2002

[4] B Chen R F Harrison G Papadatos et al ldquoEvaluation ofmachine-learning methods for ligand-based virtual screen-ingrdquo Journal of Computer-Aided Molecular Design vol 21no 1-3 pp 53ndash62 2007

[5] NIH ldquoPubChemrdquo 2020 httpspubchemncbinlmnihgov[6] R Barandela J S Sanchez V Garcıa and E Rangel

ldquoStrategies for learning in class imbalance problemsrdquo PatternRecognition vol 36 no 3 pp 849ndash851 2003

[7] L Han Y Wang and S H Bryant ldquoDeveloping and vali-dating predictive decision tree models from mining chemicalstructural fingerprints and high-throughput screening data inPubChemrdquo BMC Bioinformatics vol 9 no 1 p 401 2008

[8] Q Li T Cheng Y Wang and S H Bryant ldquoPubChem as apublic resource for drug discoveryrdquo Drug Discovery Todayvol 15 no 23-24 pp 1052ndash1057 2010

[9] L Cao and F E H Tay ldquoFinancial forecasting using supportvector machinesrdquo Neural Computing amp Applications vol 10no 2 pp 184ndash192 2001

[10] A Estabrooks T Jo and N Japkowicz ldquoA multiple resam-pling method for learning from imbalanced data setsrdquoComputational Intelligence vol 20 no 1 pp 18ndash36 2004

[11] C-Y Chang M-T Hsu E X Esposito and Y J TsengldquoOversampling to overcome overfitting exploring the rela-tionship between data set composition molecular descriptorsand predictive modeling methodsrdquo Journal of Chemical In-formation and Modeling vol 53 no 4 pp 958ndash971 2013

[12] N Japkowicz and S Stephen ldquo+e class imbalance problem asystematic study1rdquo Intelligent Data Analysis vol 6 no 5pp 429ndash449 2002

[13] G M Weiss and F Provost ldquoLearning when training data arecostly the effect of class distribution on tree inductionrdquoJournal of Artificial Intelligence Research vol 19 pp 315ndash3542003

[14] K Verma and S J Rahman ldquoDetermination of minimumlethal time of commonly used mosquito larvicidesrdquo 8eJournal of Communicable Diseases vol 16 no 2 pp 162ndash1641984

[15] S Q Ye Big Data Analysis for Bioinformatics and BiomedicalDiscoveries Chapman and HallCRC Boca Raton FL USA2015

[16] S del Rıo V Lopez J M Benıtez and F Herrera ldquoOn the useof mapreduce for imbalanced big data using random forestrdquoInformation Sciences vol 285 pp 112ndash137 2014

[17] A Fernandez M J del Jesus and F Herrera ldquoOn the in-fluence of an adaptive inference system in fuzzy rule basedclassification systems for imbalanced data-setsrdquo Expert Sys-tems with Applications vol 36 no 6 pp 9805ndash9812 2009

[18] K Sid and M Batouche Ensemble Learning for Large ScaleVirtual Screening on Apache Spark Springer Cham Swit-zerland 2018

[19] V Lopez A Fernandez J G Moreno-Torres and F HerreraldquoAnalysis of preprocessing vs cost-sensitive learning forimbalanced classification Open problems on intrinsic datacharacteristicsrdquo Expert Systems with Applications vol 39no 7 pp 6585ndash6608 2012

[20] B Hemmateenejad K Javadnia and M Elyasi ldquoQuantitativestructure-retention relationship for the kovats retention indicesof a large set of terpenes a combined data splitting-feature

14 Complexity

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15

Page 15: HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

selection strategyrdquo Analytica Chimica Acta vol 592 no 1pp 72ndash81 2007

[21] S Jain E Kotsampasakou and G F Ecker ldquoComparing theperformance of meta-classifiers-a case study on selectedimbalanced data sets relevant for prediction of liver toxicityrdquoJournal of Computer-Aided Molecular Design vol 32 no 5pp 583ndash590 2018

[22] A V Zakharov M L Peach M Sitzmann andM C Nicklaus ldquoQSAR modeling of imbalanced high-throughput screening data in PubChemrdquo Journal of ChemicalInformation and Modeling vol 54 no 3 pp 705ndash712 2014

[23] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced data setsrdquo Knowledge and Infor-mation Systems vol 25 no 1 pp 1ndash20 2010

[24] Y Xiong Y Qiao D Kihara H-Y Zhang X Zhu andD-Q Wei ldquoSurvey of machine learning techniques forprediction of the isoform specificity of cytochrome P450substratesrdquo Current Drug Metabolism vol 20 no 3pp 229ndash235 2019

[25] B K Shoichet ldquoVirtual screening of chemical librariesrdquoNature vol 432 no 7019 pp 862ndash865 2004

[26] L T Afolabi F Saeed H Hashim and O O PetinrinldquoEnsemble learning method for the prediction of new bio-active moleculesrdquo PLoS One vol 13 no 1 Article IDe0189538 2018

[27] A C Schierz ldquoVirtual screening of bioassay datardquo Journal ofCheminformatics vol 1 no 1 p 21 2009

[28] S K Hussin Y M Omar S M Abdelmageid andM I Marie ldquoTraditional machine learning and big dataanalytics in virtual screening a comparative studyrdquo Inter-national Journal of Advanced Research in Computer Sciencevol 10 no 47 pp 72ndash88 2020

[29] l E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquo Technometrics vol 35 no 2pp 109ndash135 1993

[30] N V Chawla K W Bowyer L O Hall andW P Kegelmeyer ldquoSMOTE synthetic minority over-sam-pling techniquerdquo Journal of Artificial Intelligence Researchvol 16 pp 321ndash357 2002

[31] Q Li Y Wang and S H Bryant ldquoA novel method for mininghighly imbalanced high-throughput screening data in Pub-Chemrdquo Bioinformatics vol 25 no 24 pp 3310ndash3316 2009

[32] R Guha and S C Schurer ldquoUtilizing high throughputscreening data for predictive toxicology models protocols andapplication to MLSCN assaysrdquo Journal of Computer-AidedMolecular Design vol 22 no 6-7 pp 367ndash384 2008

[33] P Banerjee F O Dehnbostel and R Preissner ldquoPrediction isa balancing act importance of sampling methods to balancesensitivity and specificity of predictive models based onimbalanced chemical data setsrdquo Frontiers in Chemistry vol 62018

[34] P W Novianti V L Jong K C B Roes andM J C Eijkemans ldquoFactors affecting the accuracy of a classprediction model in gene expression datardquo BMC Bio-informatics vol 16 no 1 p 199 2015

[35] Spark ldquoApache spark is a unified analytics engine for big dataprocessingrdquo 2020

[36] P AID440 ldquoPrimary HTS assay for formylpeptide receptor(FPR) ligands and primary HTS counter-screen assay forformylpeptide-like-1 (FPRL1) ligandsrdquo 2007 httpspubchemncbinlmnihgovbioassay440

[37] P 624202 AID ldquoqHTS assay to identify small molecule ac-tivators of BRCA1 expressionrdquo 2012 httpspubchemncbinlmnihgovbioassay624202

[38] P 651820 AID ldquoqHTS assay for inhibitors of hepatitis C virus(HCV)rdquo 2012 httpspubchemncbinlmnihgovbioassay651820

[39] Chemlibretextsorg ldquoMolecules and molecular com-poundsrdquo 2020 httpschemlibretextsorgBookshelvesGeneral_ChemistryMap3A_Chemistry_-_+e_Central_Science_(Brown_et_al)02_Atoms_Molecules_and_Ions263A_Molecules_and_Molecular_Compounds

[40] X Wang X Liu S Matwin and N Japkowicz ldquoApplyinginstance-weighted support vector machines to class imbal-anced datasetsrdquo in Proceedings of the 2014 IEEE InternationalConference on Big Data (Big Data) pp 112ndash118 IEEEWashington DC USA December 2014

[41] V H Masand N N E El-Sayed M U Bambole V R Patiland S D +akur ldquoMultiple quantitative structure-activityrelationships (QSARs) analysis for orally active trypanocidalN-myristoyltransferase inhibitorsrdquo Journal of MolecularStructure vol 1175 pp 481ndash487 2019

[42] S W Purnami and R K Trapsilasiwi ldquoSMOTE-least squaresupport vector machine for classification of multiclass im-balanced datardquo in Proceedings of the 9th International Con-ference on Machine Learning and Computing - ICMLCpp 107ndash111 Singapore Singapore Febuary 2017

[43] T Cheng Q Li Y Wang and S H Bryant ldquoBinary classi-fication of aqueous solubility using support vector machineswith reduction and recombination feature selectionrdquo Journalof Chemical Information and Modeling vol 51 no 2pp 229ndash236 2011

[44] B X Wang and N Japkowicz ldquoBoosting support vectormachines for imbalanced datasets Knowledge and Informa-tion Systemsrdquo Knowledge and Information Systems vol 25no 1 p 25 2010

[45] S P R Trapsilasiwi ldquoSMOTE-least square support vectormachine for classification of multiclass imbalanced datardquo inProceedings of the 9th International Conference on MachineLearning and Computing pp 107ndash111 ACM SingaporeFebruary 2017

Complexity 15