a buffer overflow prediction approach based on software...

14
Research Article A Buffer Overflow Prediction Approach Based on Software Metrics and Machine Learning Jiadong Ren, 1,2 Zhangqi Zheng , 1,2 Qian Liu, 1,2 Zhiyao Wei, 1,2 and Huaizhi Yan 3 1 School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei, China 2 e Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao City 066004, China 3 Beijing Key Laboratory of Soſtware Security Engineering Technique, Beijing Institute of Technology, South Zhongguancun Street, Haidian District, Beijing 100081, China Correspondence should be addressed to Zhangqi Zheng; [email protected] Received 11 October 2018; Revised 26 December 2018; Accepted 10 February 2019; Published 3 March 2019 Guest Editor: Jiageng Chen Copyright © 2019 Jiadong Ren et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Buffer overflow vulnerability is the most common and serious type of vulnerability in soſtware today, as network security issues have become increasingly critical. To alleviate the security threat, many vulnerability mining methods based on static and dynamic analysis have been developed. However, the current analysis methods have problems regarding high computational time, low test efficiency, low accuracy, and low versatility. is paper proposed a soſtware buffer overflow vulnerability prediction method by using soſtware metrics and a decision tree algorithm. First, the soſtware metrics were extracted from the soſtware source code, and data from the dynamic data stream at the functional level was extracted by a data mining method. Second, a model based on a decision tree algorithm was constructed to measure multiple types of buffer overflow vulnerabilities at the functional level. Finally, the experimental results showed that our method ran in less time than SVM, Bayes, adaboost, and random forest algorithms and achieved 82.53% and 87.51% accuracy in two different data sets. e method presented in this paper achieved the effect of accurately predicting soſtware buffer overflow vulnerabilities in C/C++ and Java programs. 1. Introduction With the popularity and rapid development of computer network technology, soſtware security is facing a growing number of threats. Since the first buffer overflow attack occurred in 1988, the buffer overflow vulnerability [1] has been the most common and serious soſtware vulnerability, posing a great danger to the security of soſtware systems. According to the distribution of vulnerability types in the June 2017 Security Vulnerability Report of the National Infor- mation Security Vulnerability Database (CNNVD), the buffer error category has the largest proportion of vulnerabilities, at approximately 14.08%. Due to the various forms of buffer overflow vulnerabilities, it is challenging to accurately predict buffer overflows. Due to the large amount of soſtware source code and the complex logic level of invocation, these kinds of vulnerabilities are not easily discovered in a timely manner. Soſtware vulnerability analysis usually requires source code analysis and binary analysis. Although the source code vulnerability analysis can comprehensively consider the execution path according to the rich and complete semantic information in the program source code, it cannot fully reflect the dynamics of the soſtware [2]. Although binary code vul- nerability analysis has been studied comprehensively through static or dynamic and manual or automated methods, there are problems such as low coverage and path explosion. Recently, machine learning algorithms have been widely used in soſtware vulnerability prediction. In vulnerability predic- tion, machine learning techniques such as logistic regression, naive Bayes methods, C4.5 decision tree algorithms, random forest classifiers, and SVMs [3, 4] are commonly used for prediction. Although it is common to use a machine learning algorithm to predict soſtware vulnerabilities, the research based on soſtware metrics is still immature. For instance, in the area of predicting soſtware vulnerability models based on soſtware metrics, most approaches consist of a static analyses based on soſtware source code. It does not consider the characteristics of the dynamic data flow during soſtware Hindawi Security and Communication Networks Volume 2019, Article ID 8391425, 13 pages https://doi.org/10.1155/2019/8391425

Upload: others

Post on 06-Oct-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Research ArticleA Buffer Overflow Prediction Approach Based on SoftwareMetrics and Machine Learning

Jiadong Ren12 Zhangqi Zheng 12 Qian Liu12 Zhiyao Wei12 and Huaizhi Yan3

1School of Information Science and Engineering Yanshan University Qinhuangdao Hebei China2The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province Qinhuangdao City 066004 China3Beijing Key Laboratory of Software Security Engineering Technique Beijing Institute of Technology South Zhongguancun StreetHaidian District Beijing 100081 China

Correspondence should be addressed to Zhangqi Zheng 451499304qqcom

Received 11 October 2018 Revised 26 December 2018 Accepted 10 February 2019 Published 3 March 2019

Guest Editor Jiageng Chen

Copyright copy 2019 Jiadong Ren et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Buffer overflow vulnerability is the most common and serious type of vulnerability in software today as network security issueshave become increasingly critical To alleviate the security threat many vulnerability mining methods based on static and dynamicanalysis have been developed However the current analysis methods have problems regarding high computational time low testefficiency low accuracy and low versatility This paper proposed a software buffer overflow vulnerability prediction method byusing software metrics and a decision tree algorithm First the software metrics were extracted from the software source code anddata from the dynamic data stream at the functional level was extracted by a data mining method Second a model based on adecision tree algorithm was constructed to measure multiple types of buffer overflow vulnerabilities at the functional level Finallythe experimental results showed that our method ran in less time than SVM Bayes adaboost and random forest algorithms andachieved 8253 and 8751 accuracy in two different data setsThemethod presented in this paper achieved the effect of accuratelypredicting software buffer overflow vulnerabilities in CC++ and Java programs

1 Introduction

With the popularity and rapid development of computernetwork technology software security is facing a growingnumber of threats Since the first buffer overflow attackoccurred in 1988 the buffer overflow vulnerability [1] hasbeen the most common and serious software vulnerabilityposing a great danger to the security of software systemsAccording to the distribution of vulnerability types in theJune 2017 Security Vulnerability Report of theNational Infor-mation Security Vulnerability Database (CNNVD) the buffererror category has the largest proportion of vulnerabilitiesat approximately 1408 Due to the various forms of bufferoverflow vulnerabilities it is challenging to accurately predictbuffer overflows Due to the large amount of software sourcecode and the complex logic level of invocation these kinds ofvulnerabilities are not easily discovered in a timely manner

Software vulnerability analysis usually requires sourcecode analysis and binary analysis Although the source

code vulnerability analysis can comprehensively consider theexecution path according to the rich and complete semanticinformation in the program source code it cannot fully reflectthe dynamics of the software [2] Although binary code vul-nerability analysis has been studied comprehensively throughstatic or dynamic and manual or automated methods thereare problems such as low coverage and path explosionRecently machine learning algorithms have beenwidely usedin software vulnerability prediction In vulnerability predic-tion machine learning techniques such as logistic regressionnaive Bayes methods C45 decision tree algorithms randomforest classifiers and SVMs [3 4] are commonly used forprediction Although it is common to use a machine learningalgorithm to predict software vulnerabilities the researchbased on software metrics is still immature For instance inthe area of predicting software vulnerability models basedon software metrics most approaches consist of a staticanalyses based on software source code It does not considerthe characteristics of the dynamic data flow during software

HindawiSecurity and Communication NetworksVolume 2019 Article ID 8391425 13 pageshttpsdoiorg10115520198391425

2 Security and Communication Networks

operation It is not possible to more accurately reflect thetype of vulnerability its severity and the distribution of themodule

This paper proposes a method based on software met-rics for buffer overflow vulnerability prediction called thesoftware vulnerability location (BOVP) method Comparedwith traditional prediction methods such as support vectormachines (SVMs) Bayesian methods adaboost and randomforest (RF) algorithms this new researchmethod is applicableto software on different platforms (CC++ JAVA) and canalso accurately predict software buffer overflow vulnerabili-ties The main contents of this paper are as follows (1) Soft-waremetrics were extracted according to the static analysis ofsoftware source code and a data mining method was used toextract data from the dynamic data stream at the functionallevel (2) A decision tree algorithmwas established tomeasuremultiple types of buffer overflow vulnerability models basedon the functional level The algorithm not only affects theaccuracy of the experiment but also reduces themeasurementdimension and reduces the experimental overhead (3) Theanalyses led to the research idea of considering multiplevulnerability types and different functional classifications forthe software buffer overflow vulnerability analysis

The remainder of this paper is organized as followsSection 2 introduces the related work of the paper Section 3introduces the description of buffer overflow vulnerabilityoverview and source code function classification Section 4introduces the overview of the BOVP method Section 5presents data sources cross-validation experimental proce-dures and results Section 6 describes the advantages of theBOVP approach Finally the conclusion and future researchdirection are presented in Section 7

2 Related Work

At present the academic community proposes dual solutionsto the phenomenon of software buffer overflow vulnerabilitybased on both detection andprotection For example to avoidbuffer overflow people use the polymorphic canary to detectan overflow of the stack buffer based on computer hardwareresearch [5 6] J Duraes et al [7] proposed an automaticidentification method for buffer overflow vulnerability ofexecutable software without source code S Rawat et al [8]proposed an evolutionary calculation method for searchingfor buffer overflow vulnerability this method combinedgenetic algorithms and evolutionary strategies to generatestring-based input It is a lightweight intelligent fuzzy toolused to detect buffer overflow vulnerabilities in C code IADudina et al [9] used static symbolic execution to detectbuffer overflowvulnerabilities described a prototype of an in-process heap buffer overflow detector and provided an exam-ple of the vulnerability discovered by the detector In terms ofprotection from buffer overflow vulnerabilities S Nashimotoet al [10] proposed injecting buffer overflow attacks andverified the countermeasures for multiple fault injectionsThe results showed that their attacks could cover the returnaddress stored in the stack and call any malicious function JRao et al [11] proposed a protection technology called a BFWindow for performance and resource-sensitive embedded

systems It verified each memory write by speculativelychecking the consistency of the data attributes within theextended buffer windowThrough experiments the proposedmechanism could effectively prevent sequential memorywrites across the buffer boundary Although these methodshave effectively detected buffer overflow vulnerabilities to acertain extent the traditional research methods of softwarebuffer overflow vulnerabilities have the problems of largetime overhead path explosion and lack of dynamic softwareanalysis Therefore research on using machine learningmethods to predict software buffer overflow vulnerabilitieshas become increasingly widespread

At present research on using machine learning algo-rithms to predict various types of software vulnerabilities [12]is becoming widespread and the machine learning methodscan improve vulnerability coverage and reduce running timein software vulnerability analysis and discovery processesHowever there are problems in that they that cannot moreaccurately reflect the type of vulnerability the severity ofthe vulnerability or the distribution of the module Forthese problems a method based on the ldquosoftware metricsrdquoto predict various types of software vulnerabilities was pro-posed [13] Early software metrics focused on structuredprogram design Bymeasuring the complexity of the softwarearchitecture researchers described some basic performancemetrics of software systems such as functionality reliabilityportability and maintainability Software metrics can bedivided into three typical representative categories based oninformation flow attributes and behaviour In recent yearssoftware metrics have been studied extensively and havebecome a vibrant research topic in software engineering

James Walden et al [14] compared models based ontwo different types of features software metrics and termfrequencies (text mining features) They experimented withDrupal PHPMyAdmin and Moodle software The exper-iments evaluated the performance of the model throughhierarchical and cross-validation The results showed thatboth of the models for text mining and software metricshad high recall rates Based on this research in 2017 JeffreyStuckman et al [15] explored the role of dimension reductionthrough a series of cross-validation and project predictionexperiments In the case of software metrics the dimensionreduction technique based on a confirmatory factor analysisproduced the best F-measure value when performing projectprediction Choudhary G R et al [16] studied the changemetrics in combination with code metrics in software faultprediction to improve the performance of the fault predictionmodel Different versions of the Eclipse project and extractedchange metrics from GitHub were used for experimentalresearch In addition to the existing change metrics severalnew change metrics were defined and collected from theEclipse project repository The experimental results showedthat the variation metrics had a positive impact on theperformance of the fault prediction model Thus a high-performance model could be established through multiplevariation metrics Sanaz Rahimi et al [17] proposed a vulner-ability discovery model (VDMS) which uses the compilerscode base for static analysis extracts code characteristics suchas complexity (circle complexity) and uses code attributes

Security and Communication Networks 3

Table 1 CC + + language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Stack-based buffer overflow 5810 2748 312 7149 4716 2052 288 23075Heap-based buffer overflow 7122 3076 968 9064 6733 2448 628 30039Buffer overflow 2380 1180 208 3192 2472 1024 144 10600Integer overflow 4608 4284 576 10872 4716 2052 288 27396Integer underflow 3348 3234 408 8100 3414 1548 204 20256Total 23268 14522 2472 38377 22051 9124 1552 111366

as its parameters to predict vulnerabilities By performinga static analysis on the source code of real software it wasverified that the code attributes have a high level of influenceon the accuracy of vulnerability discovery

In recent years many studies have predicted softwarebuffer overflow vulnerability by using a machine learningalgorithm With the maturity of software measurement tech-nology most scholars have adopted software attribute mea-surement based on machine learning algorithms to predictvulnerabilities Most of the software metrics for predictingsoftware vulnerability are based on the static analysis of thesoftware source code but there are dynamic behaviours suchas function calls memory read and write and memory allo-cation during the execution processTherefore if the softwaremetrics are to accurately predict software vulnerabilities thatare persistent and concealed the dynamic behaviour of thesoftware needs to be considered

3 Problem Description

Buffer overflow is the most dangerous vulnerability of soft-ware If the buffer overflow vulnerability can be effectivelyeliminated the security threat to the software will be reducedThis paper analysed the data flow based on the source codefunction level and combined the machine learning algorithmto study the software metrics for real software Metrics needto be selected in software metrics such as the number oflines containing source code the average base complexity allnested functions ormethods and the average number of linesof source code that contain all nested functions or methods(as shown in Table 1)

31 Buffer Overflow Vulnerability Overview A buffer is acontiguous area of memory within a computer that canhold multiple instances of the same data type The reasonfor the buffer overflow is that the computer exceeds thecapacity of the buffer itself when it fills the buffer withdata The overflow data are overlaid on the legal data(data pointer of the next instruction return address of thefunction etc) Buffer overflows can be divided into stack-based buffer overflows heap-based buffer overflows andinteger overflows Integer overflows are classified into threecategories based on underflow and overflow of unsignedintegers symbolic problems and truncation problems Weuse the Intel processor (register EBP) as an example to

introduce stack-based buffer overflows heap-based bufferoverflows and integer overflows

Stack-based buffer overflows in computer science astack is an abstract data type that serves as a collection ofelements with two principal operations push which adds anelement to the collection and pop which removes the mostrecently added element that was not yet removed A stack is acontainer for objects that are inserted and removed accordingto the LIFO principle The stack frame is the collection ofall data in the stack associated with one subprogram callThe stack frame generally includes the return address theargument variables passed on the stack local variables andthe saved copies of any registers modified by the subprogramthat need to be restored The EBP is the base pointer for thecurrent stack frameTheESP is the current stack pointer Eachtime the function is called a new stack frame is generatedand stored in a certain space of the stack The EBP alwayspoints to the top of the stack frame (high address) whilethe ESP points to the next available byte in the stack TheEBP does not change during function execution and the ESPvaries with function execution A stack-based buffer overflowoccurs when a program writes data with a data length thatexceeds the space allocated by the stack to the buffer to thememory address in the stack

Heap-based buffer overflows The heap is memory that isdynamically allocated while the program is runningThe usergenerally requests memory through malloc new etc andoperates the allocated memory through the returned startingaddress pointer After using it it should be released by freedelete etc as part of the memory otherwise it will cause amemory leak The heap blocks in the same heap are usuallycontiguous in memory If data is written to an allocated heapblock the data overflow exceeds the size of the heap blockcausing the data overflow to cover the neighbours behind theheap block

Integer overflows Similar to other types of overflowsinteger overflows cause data to be placed in a storage spacesmaller than itself The underflow of unsigned integers iscaused by the fact that unsigned integers cannot recognizenegative numbers The symbol problem is caused by thecomparison between signed integers the operation of signedintegers and the comparison of unsigned integers and signedintegersThe truncation problemmainly occurs when a high-order integer (for example 32 bits) is copied to a low-orderinteger (for example 16 bits)

4 Security and Communication Networks

Buffer overflow attacks consist of three parts codeembedding overflow attack and system hijacking There arefour types of buffer overflow attacks destroy activity recordsdestroy heap data change function pointers and overflowfixed buffers Buffer overflow is a very common and very dan-gerous vulnerability that is widespread in various operatingsystems and applications The use of buffer overflow vulner-abilities can enable hackers to gain control of a machine oreven the highest privilegesThe use of buffer overflow attackscan lead to program failure system shutdown restarting andso on The buffer overflows we studied in this article includestack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflows

32 Source Code Function Classification Most software met-rics that were previously used to predict software vulnerabili-ties usemetrics to predict whether a vulnerability is includedDuring the execution of the software there is a lack of anal-ysis of the calling relationships between functions Amongthe characteristics of software are functionality reliabilityusability efficiency maintainability and portability Thus thesoftware implementation process is dynamic In the vulner-ability prediction of software the calling relationship of theintrinsic functions in the software execution process shouldbe fully considered especially the data flow changes duringthe software call Therefore this article describes data flowanalysis to distinguish features that contain vulnerabilities

In general software source code data flow analysis isbased on understanding the logic of the code tracking theflow of data from the beginning of the analysed code to theend There are three difficulties in the analysis process (1)The change in data should accumulate with the increase inthe path length (2) Calculation of the value of the branchcondition based on the latest data analysis results since thebranch condition should accumulate as the path grows andeach additional branch node on the path must be in thepath (3) The number of paths usually grows at a geometricprogression as the code size increases A more efficientapproach is to decompose the global data stream analysis intopartial data stream analysis

This paper analyses the functions called in the test casedata and uses the partial data flow analysis method todistinguish the functions into different ldquosourcerdquo and ldquoacceptrdquofunctions This research performs a functional level analysison the program source code According to the behaviour ofthe data flow of each function in the actual execution of theprogram this paper uses a dataminingmethod to distinguishthe function categories from the data in the data set Firstthere are two types (good and bad) depending on whetheror not there are vulnerabilities as a whole Second this paperanalyses the functions using data flow analysis Functionsthat do not contain vulnerabilities are classified into fourcategories Good - the only code in a good function is acall to each of the secondary good functions This functiondoes not contain any nonflawed constructs Good Sink andGood Source - cases that contain data flows using ldquosourcerdquoand ldquosinkrdquo functions that are called from each other fromthe primary good function Each source or sink function

is specific to exactly one secondary good function GoodAuxiliary - where a Good Source is passing safe data to apotentially Bad Sink andwhere a Bad Source is passing unsafeor potentially unsafe data to a Good Sink The categoriescontaining vulnerabilities are divided into three groups Badthis is a function that contains a flawed construct Bad Sinkand Bad Source these are cases that contain data flows usingldquosourcerdquo and ldquosinkrdquo functions that are called from each otheror from a primary bad function Each source or sink functionis specific to either bad function for the test case

According to whether or not the vulnerability is includedin one of the 8 categories this paper defines the set ofcategories as X = x1 x2 x8 where x1 x2 x8 arethe number of functions that do not contain vulnerabilities(Good) sinks that donot contain vulnerabilities (Good Sink)source functions that do not contain vulnerabilities (GoodSource) auxiliary functions that do not contain vulnerabil-ities (Good Auxiliary) a main function containing a bug(Bad) a sink that contains the vulnerabilityrsquos accept function(Bad Sink) a source function containing a vulnerability (BadSource) and a function that is not yet explicitly defined(Other) In this study only the first seven categories werestudied

4 Method

41 BOVP Method This paper first extracts the source codeof the executable program and then establishes a dynamicdata flow analysis model at the software function levelFinally a vulnerability prediction method based on softwaremetrics is implemented The process of establishing the119861119874119881119875 MA Smodel in this method is shown in Figure 1

(1) For the source code of the target software thispaper applied the functional level static analysis and thesoftware attribute measurement method was used to extractall software metric data (M)

(2) According to the characteristics of the software bufferoverflow vulnerability this paper established a metric ldquoArdquo ofthe software level buffer overflow vulnerability based on thefunction level A = a1 a2 an

(3) For the characteristics of the data flow where thefunctions are calling each other the data mining method wasused to extract the data regarding the function class anditems in the ldquoOtherrdquo function category were removed

(4) This paper applied the mutual information value forfeature selection and incorporated the metric A to measurethe impurity of the data partition D for the split criterionThe cost complexity pruning algorithm was applied as thepruning method to determine the buffer overflow vulnera-bility classification algorithm S (strategy)

42 BOVP Prediction

421 Model Overview The CART (classification and regres-sion tree) model was proposed by Breiman et al [18] in 1984and has become a widely used decision tree learning methodThe CART method assumes that the decision tree is a binarytreeThe input feature space is divided into a finite number of

Security and Communication Networks 5

Target soware

Source Code

Extract code modules and functions

Data mining to extract data

Modeling

Data in the form of soware metrics

Classify function data

Metric

Analysis relevance index selection

Source code static analysis to extract metric data

Decision tree classification prediction

Vulnerability function

Vulnerabilityprediction

Extract Bad Bad Sink and Bad Source

Figure 1 The framework for the BOVPmethod

cells and the predicted values are determined by these cellsfor instance for predicting the output values tomap to a givencondition

The CART model in this article consists of the followingthree steps

(1) CART generation generating a decision tree based onthe training data set

(2) CART pruning pruning the decision tree according tocertain constraints such as the maximum depth of the treetheminimumnumber of samples of the leaf nodes the purityof the nodes etc

(3) Optimization of the decision tree generating dif-ferent CARTs for multiple parameter combinations (max-imum depth of the tree (max depth) minimum sample

number of the leaf node (min samples leaf) node impurity(min impurity split)) and choosing the model with the bestgeneralization ability

We first chose to use a supervised learning algorithmThe next consideration was the data problem for examplewhether the eigenvalue is a discrete variable or a continuousvariable whether there is a missing value in the eigenvaluevector and whether there is an abnormal value in thedata Therefore after analysis the CART tree classificationalgorithm was selected CART has the following advantagesover other classical classification and regression algorithms(1) less data preparation no need to standardize data (2) theability to process continuous and discrete data (3) the abilityto handle multiclassification problems (4) the prediction

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

2 Security and Communication Networks

operation It is not possible to more accurately reflect thetype of vulnerability its severity and the distribution of themodule

This paper proposes a method based on software met-rics for buffer overflow vulnerability prediction called thesoftware vulnerability location (BOVP) method Comparedwith traditional prediction methods such as support vectormachines (SVMs) Bayesian methods adaboost and randomforest (RF) algorithms this new researchmethod is applicableto software on different platforms (CC++ JAVA) and canalso accurately predict software buffer overflow vulnerabili-ties The main contents of this paper are as follows (1) Soft-waremetrics were extracted according to the static analysis ofsoftware source code and a data mining method was used toextract data from the dynamic data stream at the functionallevel (2) A decision tree algorithmwas established tomeasuremultiple types of buffer overflow vulnerability models basedon the functional level The algorithm not only affects theaccuracy of the experiment but also reduces themeasurementdimension and reduces the experimental overhead (3) Theanalyses led to the research idea of considering multiplevulnerability types and different functional classifications forthe software buffer overflow vulnerability analysis

The remainder of this paper is organized as followsSection 2 introduces the related work of the paper Section 3introduces the description of buffer overflow vulnerabilityoverview and source code function classification Section 4introduces the overview of the BOVP method Section 5presents data sources cross-validation experimental proce-dures and results Section 6 describes the advantages of theBOVP approach Finally the conclusion and future researchdirection are presented in Section 7

2 Related Work

At present the academic community proposes dual solutionsto the phenomenon of software buffer overflow vulnerabilitybased on both detection andprotection For example to avoidbuffer overflow people use the polymorphic canary to detectan overflow of the stack buffer based on computer hardwareresearch [5 6] J Duraes et al [7] proposed an automaticidentification method for buffer overflow vulnerability ofexecutable software without source code S Rawat et al [8]proposed an evolutionary calculation method for searchingfor buffer overflow vulnerability this method combinedgenetic algorithms and evolutionary strategies to generatestring-based input It is a lightweight intelligent fuzzy toolused to detect buffer overflow vulnerabilities in C code IADudina et al [9] used static symbolic execution to detectbuffer overflowvulnerabilities described a prototype of an in-process heap buffer overflow detector and provided an exam-ple of the vulnerability discovered by the detector In terms ofprotection from buffer overflow vulnerabilities S Nashimotoet al [10] proposed injecting buffer overflow attacks andverified the countermeasures for multiple fault injectionsThe results showed that their attacks could cover the returnaddress stored in the stack and call any malicious function JRao et al [11] proposed a protection technology called a BFWindow for performance and resource-sensitive embedded

systems It verified each memory write by speculativelychecking the consistency of the data attributes within theextended buffer windowThrough experiments the proposedmechanism could effectively prevent sequential memorywrites across the buffer boundary Although these methodshave effectively detected buffer overflow vulnerabilities to acertain extent the traditional research methods of softwarebuffer overflow vulnerabilities have the problems of largetime overhead path explosion and lack of dynamic softwareanalysis Therefore research on using machine learningmethods to predict software buffer overflow vulnerabilitieshas become increasingly widespread

At present research on using machine learning algo-rithms to predict various types of software vulnerabilities [12]is becoming widespread and the machine learning methodscan improve vulnerability coverage and reduce running timein software vulnerability analysis and discovery processesHowever there are problems in that they that cannot moreaccurately reflect the type of vulnerability the severity ofthe vulnerability or the distribution of the module Forthese problems a method based on the ldquosoftware metricsrdquoto predict various types of software vulnerabilities was pro-posed [13] Early software metrics focused on structuredprogram design Bymeasuring the complexity of the softwarearchitecture researchers described some basic performancemetrics of software systems such as functionality reliabilityportability and maintainability Software metrics can bedivided into three typical representative categories based oninformation flow attributes and behaviour In recent yearssoftware metrics have been studied extensively and havebecome a vibrant research topic in software engineering

James Walden et al [14] compared models based ontwo different types of features software metrics and termfrequencies (text mining features) They experimented withDrupal PHPMyAdmin and Moodle software The exper-iments evaluated the performance of the model throughhierarchical and cross-validation The results showed thatboth of the models for text mining and software metricshad high recall rates Based on this research in 2017 JeffreyStuckman et al [15] explored the role of dimension reductionthrough a series of cross-validation and project predictionexperiments In the case of software metrics the dimensionreduction technique based on a confirmatory factor analysisproduced the best F-measure value when performing projectprediction Choudhary G R et al [16] studied the changemetrics in combination with code metrics in software faultprediction to improve the performance of the fault predictionmodel Different versions of the Eclipse project and extractedchange metrics from GitHub were used for experimentalresearch In addition to the existing change metrics severalnew change metrics were defined and collected from theEclipse project repository The experimental results showedthat the variation metrics had a positive impact on theperformance of the fault prediction model Thus a high-performance model could be established through multiplevariation metrics Sanaz Rahimi et al [17] proposed a vulner-ability discovery model (VDMS) which uses the compilerscode base for static analysis extracts code characteristics suchas complexity (circle complexity) and uses code attributes

Security and Communication Networks 3

Table 1 CC + + language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Stack-based buffer overflow 5810 2748 312 7149 4716 2052 288 23075Heap-based buffer overflow 7122 3076 968 9064 6733 2448 628 30039Buffer overflow 2380 1180 208 3192 2472 1024 144 10600Integer overflow 4608 4284 576 10872 4716 2052 288 27396Integer underflow 3348 3234 408 8100 3414 1548 204 20256Total 23268 14522 2472 38377 22051 9124 1552 111366

as its parameters to predict vulnerabilities By performinga static analysis on the source code of real software it wasverified that the code attributes have a high level of influenceon the accuracy of vulnerability discovery

In recent years many studies have predicted softwarebuffer overflow vulnerability by using a machine learningalgorithm With the maturity of software measurement tech-nology most scholars have adopted software attribute mea-surement based on machine learning algorithms to predictvulnerabilities Most of the software metrics for predictingsoftware vulnerability are based on the static analysis of thesoftware source code but there are dynamic behaviours suchas function calls memory read and write and memory allo-cation during the execution processTherefore if the softwaremetrics are to accurately predict software vulnerabilities thatare persistent and concealed the dynamic behaviour of thesoftware needs to be considered

3 Problem Description

Buffer overflow is the most dangerous vulnerability of soft-ware If the buffer overflow vulnerability can be effectivelyeliminated the security threat to the software will be reducedThis paper analysed the data flow based on the source codefunction level and combined the machine learning algorithmto study the software metrics for real software Metrics needto be selected in software metrics such as the number oflines containing source code the average base complexity allnested functions ormethods and the average number of linesof source code that contain all nested functions or methods(as shown in Table 1)

31 Buffer Overflow Vulnerability Overview A buffer is acontiguous area of memory within a computer that canhold multiple instances of the same data type The reasonfor the buffer overflow is that the computer exceeds thecapacity of the buffer itself when it fills the buffer withdata The overflow data are overlaid on the legal data(data pointer of the next instruction return address of thefunction etc) Buffer overflows can be divided into stack-based buffer overflows heap-based buffer overflows andinteger overflows Integer overflows are classified into threecategories based on underflow and overflow of unsignedintegers symbolic problems and truncation problems Weuse the Intel processor (register EBP) as an example to

introduce stack-based buffer overflows heap-based bufferoverflows and integer overflows

Stack-based buffer overflows in computer science astack is an abstract data type that serves as a collection ofelements with two principal operations push which adds anelement to the collection and pop which removes the mostrecently added element that was not yet removed A stack is acontainer for objects that are inserted and removed accordingto the LIFO principle The stack frame is the collection ofall data in the stack associated with one subprogram callThe stack frame generally includes the return address theargument variables passed on the stack local variables andthe saved copies of any registers modified by the subprogramthat need to be restored The EBP is the base pointer for thecurrent stack frameTheESP is the current stack pointer Eachtime the function is called a new stack frame is generatedand stored in a certain space of the stack The EBP alwayspoints to the top of the stack frame (high address) whilethe ESP points to the next available byte in the stack TheEBP does not change during function execution and the ESPvaries with function execution A stack-based buffer overflowoccurs when a program writes data with a data length thatexceeds the space allocated by the stack to the buffer to thememory address in the stack

Heap-based buffer overflows The heap is memory that isdynamically allocated while the program is runningThe usergenerally requests memory through malloc new etc andoperates the allocated memory through the returned startingaddress pointer After using it it should be released by freedelete etc as part of the memory otherwise it will cause amemory leak The heap blocks in the same heap are usuallycontiguous in memory If data is written to an allocated heapblock the data overflow exceeds the size of the heap blockcausing the data overflow to cover the neighbours behind theheap block

Integer overflows Similar to other types of overflowsinteger overflows cause data to be placed in a storage spacesmaller than itself The underflow of unsigned integers iscaused by the fact that unsigned integers cannot recognizenegative numbers The symbol problem is caused by thecomparison between signed integers the operation of signedintegers and the comparison of unsigned integers and signedintegersThe truncation problemmainly occurs when a high-order integer (for example 32 bits) is copied to a low-orderinteger (for example 16 bits)

4 Security and Communication Networks

Buffer overflow attacks consist of three parts codeembedding overflow attack and system hijacking There arefour types of buffer overflow attacks destroy activity recordsdestroy heap data change function pointers and overflowfixed buffers Buffer overflow is a very common and very dan-gerous vulnerability that is widespread in various operatingsystems and applications The use of buffer overflow vulner-abilities can enable hackers to gain control of a machine oreven the highest privilegesThe use of buffer overflow attackscan lead to program failure system shutdown restarting andso on The buffer overflows we studied in this article includestack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflows

32 Source Code Function Classification Most software met-rics that were previously used to predict software vulnerabili-ties usemetrics to predict whether a vulnerability is includedDuring the execution of the software there is a lack of anal-ysis of the calling relationships between functions Amongthe characteristics of software are functionality reliabilityusability efficiency maintainability and portability Thus thesoftware implementation process is dynamic In the vulner-ability prediction of software the calling relationship of theintrinsic functions in the software execution process shouldbe fully considered especially the data flow changes duringthe software call Therefore this article describes data flowanalysis to distinguish features that contain vulnerabilities

In general software source code data flow analysis isbased on understanding the logic of the code tracking theflow of data from the beginning of the analysed code to theend There are three difficulties in the analysis process (1)The change in data should accumulate with the increase inthe path length (2) Calculation of the value of the branchcondition based on the latest data analysis results since thebranch condition should accumulate as the path grows andeach additional branch node on the path must be in thepath (3) The number of paths usually grows at a geometricprogression as the code size increases A more efficientapproach is to decompose the global data stream analysis intopartial data stream analysis

This paper analyses the functions called in the test casedata and uses the partial data flow analysis method todistinguish the functions into different ldquosourcerdquo and ldquoacceptrdquofunctions This research performs a functional level analysison the program source code According to the behaviour ofthe data flow of each function in the actual execution of theprogram this paper uses a dataminingmethod to distinguishthe function categories from the data in the data set Firstthere are two types (good and bad) depending on whetheror not there are vulnerabilities as a whole Second this paperanalyses the functions using data flow analysis Functionsthat do not contain vulnerabilities are classified into fourcategories Good - the only code in a good function is acall to each of the secondary good functions This functiondoes not contain any nonflawed constructs Good Sink andGood Source - cases that contain data flows using ldquosourcerdquoand ldquosinkrdquo functions that are called from each other fromthe primary good function Each source or sink function

is specific to exactly one secondary good function GoodAuxiliary - where a Good Source is passing safe data to apotentially Bad Sink andwhere a Bad Source is passing unsafeor potentially unsafe data to a Good Sink The categoriescontaining vulnerabilities are divided into three groups Badthis is a function that contains a flawed construct Bad Sinkand Bad Source these are cases that contain data flows usingldquosourcerdquo and ldquosinkrdquo functions that are called from each otheror from a primary bad function Each source or sink functionis specific to either bad function for the test case

According to whether or not the vulnerability is includedin one of the 8 categories this paper defines the set ofcategories as X = x1 x2 x8 where x1 x2 x8 arethe number of functions that do not contain vulnerabilities(Good) sinks that donot contain vulnerabilities (Good Sink)source functions that do not contain vulnerabilities (GoodSource) auxiliary functions that do not contain vulnerabil-ities (Good Auxiliary) a main function containing a bug(Bad) a sink that contains the vulnerabilityrsquos accept function(Bad Sink) a source function containing a vulnerability (BadSource) and a function that is not yet explicitly defined(Other) In this study only the first seven categories werestudied

4 Method

41 BOVP Method This paper first extracts the source codeof the executable program and then establishes a dynamicdata flow analysis model at the software function levelFinally a vulnerability prediction method based on softwaremetrics is implemented The process of establishing the119861119874119881119875 MA Smodel in this method is shown in Figure 1

(1) For the source code of the target software thispaper applied the functional level static analysis and thesoftware attribute measurement method was used to extractall software metric data (M)

(2) According to the characteristics of the software bufferoverflow vulnerability this paper established a metric ldquoArdquo ofthe software level buffer overflow vulnerability based on thefunction level A = a1 a2 an

(3) For the characteristics of the data flow where thefunctions are calling each other the data mining method wasused to extract the data regarding the function class anditems in the ldquoOtherrdquo function category were removed

(4) This paper applied the mutual information value forfeature selection and incorporated the metric A to measurethe impurity of the data partition D for the split criterionThe cost complexity pruning algorithm was applied as thepruning method to determine the buffer overflow vulnera-bility classification algorithm S (strategy)

42 BOVP Prediction

421 Model Overview The CART (classification and regres-sion tree) model was proposed by Breiman et al [18] in 1984and has become a widely used decision tree learning methodThe CART method assumes that the decision tree is a binarytreeThe input feature space is divided into a finite number of

Security and Communication Networks 5

Target soware

Source Code

Extract code modules and functions

Data mining to extract data

Modeling

Data in the form of soware metrics

Classify function data

Metric

Analysis relevance index selection

Source code static analysis to extract metric data

Decision tree classification prediction

Vulnerability function

Vulnerabilityprediction

Extract Bad Bad Sink and Bad Source

Figure 1 The framework for the BOVPmethod

cells and the predicted values are determined by these cellsfor instance for predicting the output values tomap to a givencondition

The CART model in this article consists of the followingthree steps

(1) CART generation generating a decision tree based onthe training data set

(2) CART pruning pruning the decision tree according tocertain constraints such as the maximum depth of the treetheminimumnumber of samples of the leaf nodes the purityof the nodes etc

(3) Optimization of the decision tree generating dif-ferent CARTs for multiple parameter combinations (max-imum depth of the tree (max depth) minimum sample

number of the leaf node (min samples leaf) node impurity(min impurity split)) and choosing the model with the bestgeneralization ability

We first chose to use a supervised learning algorithmThe next consideration was the data problem for examplewhether the eigenvalue is a discrete variable or a continuousvariable whether there is a missing value in the eigenvaluevector and whether there is an abnormal value in thedata Therefore after analysis the CART tree classificationalgorithm was selected CART has the following advantagesover other classical classification and regression algorithms(1) less data preparation no need to standardize data (2) theability to process continuous and discrete data (3) the abilityto handle multiclassification problems (4) the prediction

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 3

Table 1 CC + + language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Stack-based buffer overflow 5810 2748 312 7149 4716 2052 288 23075Heap-based buffer overflow 7122 3076 968 9064 6733 2448 628 30039Buffer overflow 2380 1180 208 3192 2472 1024 144 10600Integer overflow 4608 4284 576 10872 4716 2052 288 27396Integer underflow 3348 3234 408 8100 3414 1548 204 20256Total 23268 14522 2472 38377 22051 9124 1552 111366

as its parameters to predict vulnerabilities By performinga static analysis on the source code of real software it wasverified that the code attributes have a high level of influenceon the accuracy of vulnerability discovery

In recent years many studies have predicted softwarebuffer overflow vulnerability by using a machine learningalgorithm With the maturity of software measurement tech-nology most scholars have adopted software attribute mea-surement based on machine learning algorithms to predictvulnerabilities Most of the software metrics for predictingsoftware vulnerability are based on the static analysis of thesoftware source code but there are dynamic behaviours suchas function calls memory read and write and memory allo-cation during the execution processTherefore if the softwaremetrics are to accurately predict software vulnerabilities thatare persistent and concealed the dynamic behaviour of thesoftware needs to be considered

3 Problem Description

Buffer overflow is the most dangerous vulnerability of soft-ware If the buffer overflow vulnerability can be effectivelyeliminated the security threat to the software will be reducedThis paper analysed the data flow based on the source codefunction level and combined the machine learning algorithmto study the software metrics for real software Metrics needto be selected in software metrics such as the number oflines containing source code the average base complexity allnested functions ormethods and the average number of linesof source code that contain all nested functions or methods(as shown in Table 1)

31 Buffer Overflow Vulnerability Overview A buffer is acontiguous area of memory within a computer that canhold multiple instances of the same data type The reasonfor the buffer overflow is that the computer exceeds thecapacity of the buffer itself when it fills the buffer withdata The overflow data are overlaid on the legal data(data pointer of the next instruction return address of thefunction etc) Buffer overflows can be divided into stack-based buffer overflows heap-based buffer overflows andinteger overflows Integer overflows are classified into threecategories based on underflow and overflow of unsignedintegers symbolic problems and truncation problems Weuse the Intel processor (register EBP) as an example to

introduce stack-based buffer overflows heap-based bufferoverflows and integer overflows

Stack-based buffer overflows in computer science astack is an abstract data type that serves as a collection ofelements with two principal operations push which adds anelement to the collection and pop which removes the mostrecently added element that was not yet removed A stack is acontainer for objects that are inserted and removed accordingto the LIFO principle The stack frame is the collection ofall data in the stack associated with one subprogram callThe stack frame generally includes the return address theargument variables passed on the stack local variables andthe saved copies of any registers modified by the subprogramthat need to be restored The EBP is the base pointer for thecurrent stack frameTheESP is the current stack pointer Eachtime the function is called a new stack frame is generatedand stored in a certain space of the stack The EBP alwayspoints to the top of the stack frame (high address) whilethe ESP points to the next available byte in the stack TheEBP does not change during function execution and the ESPvaries with function execution A stack-based buffer overflowoccurs when a program writes data with a data length thatexceeds the space allocated by the stack to the buffer to thememory address in the stack

Heap-based buffer overflows The heap is memory that isdynamically allocated while the program is runningThe usergenerally requests memory through malloc new etc andoperates the allocated memory through the returned startingaddress pointer After using it it should be released by freedelete etc as part of the memory otherwise it will cause amemory leak The heap blocks in the same heap are usuallycontiguous in memory If data is written to an allocated heapblock the data overflow exceeds the size of the heap blockcausing the data overflow to cover the neighbours behind theheap block

Integer overflows Similar to other types of overflowsinteger overflows cause data to be placed in a storage spacesmaller than itself The underflow of unsigned integers iscaused by the fact that unsigned integers cannot recognizenegative numbers The symbol problem is caused by thecomparison between signed integers the operation of signedintegers and the comparison of unsigned integers and signedintegersThe truncation problemmainly occurs when a high-order integer (for example 32 bits) is copied to a low-orderinteger (for example 16 bits)

4 Security and Communication Networks

Buffer overflow attacks consist of three parts codeembedding overflow attack and system hijacking There arefour types of buffer overflow attacks destroy activity recordsdestroy heap data change function pointers and overflowfixed buffers Buffer overflow is a very common and very dan-gerous vulnerability that is widespread in various operatingsystems and applications The use of buffer overflow vulner-abilities can enable hackers to gain control of a machine oreven the highest privilegesThe use of buffer overflow attackscan lead to program failure system shutdown restarting andso on The buffer overflows we studied in this article includestack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflows

32 Source Code Function Classification Most software met-rics that were previously used to predict software vulnerabili-ties usemetrics to predict whether a vulnerability is includedDuring the execution of the software there is a lack of anal-ysis of the calling relationships between functions Amongthe characteristics of software are functionality reliabilityusability efficiency maintainability and portability Thus thesoftware implementation process is dynamic In the vulner-ability prediction of software the calling relationship of theintrinsic functions in the software execution process shouldbe fully considered especially the data flow changes duringthe software call Therefore this article describes data flowanalysis to distinguish features that contain vulnerabilities

In general software source code data flow analysis isbased on understanding the logic of the code tracking theflow of data from the beginning of the analysed code to theend There are three difficulties in the analysis process (1)The change in data should accumulate with the increase inthe path length (2) Calculation of the value of the branchcondition based on the latest data analysis results since thebranch condition should accumulate as the path grows andeach additional branch node on the path must be in thepath (3) The number of paths usually grows at a geometricprogression as the code size increases A more efficientapproach is to decompose the global data stream analysis intopartial data stream analysis

This paper analyses the functions called in the test casedata and uses the partial data flow analysis method todistinguish the functions into different ldquosourcerdquo and ldquoacceptrdquofunctions This research performs a functional level analysison the program source code According to the behaviour ofthe data flow of each function in the actual execution of theprogram this paper uses a dataminingmethod to distinguishthe function categories from the data in the data set Firstthere are two types (good and bad) depending on whetheror not there are vulnerabilities as a whole Second this paperanalyses the functions using data flow analysis Functionsthat do not contain vulnerabilities are classified into fourcategories Good - the only code in a good function is acall to each of the secondary good functions This functiondoes not contain any nonflawed constructs Good Sink andGood Source - cases that contain data flows using ldquosourcerdquoand ldquosinkrdquo functions that are called from each other fromthe primary good function Each source or sink function

is specific to exactly one secondary good function GoodAuxiliary - where a Good Source is passing safe data to apotentially Bad Sink andwhere a Bad Source is passing unsafeor potentially unsafe data to a Good Sink The categoriescontaining vulnerabilities are divided into three groups Badthis is a function that contains a flawed construct Bad Sinkand Bad Source these are cases that contain data flows usingldquosourcerdquo and ldquosinkrdquo functions that are called from each otheror from a primary bad function Each source or sink functionis specific to either bad function for the test case

According to whether or not the vulnerability is includedin one of the 8 categories this paper defines the set ofcategories as X = x1 x2 x8 where x1 x2 x8 arethe number of functions that do not contain vulnerabilities(Good) sinks that donot contain vulnerabilities (Good Sink)source functions that do not contain vulnerabilities (GoodSource) auxiliary functions that do not contain vulnerabil-ities (Good Auxiliary) a main function containing a bug(Bad) a sink that contains the vulnerabilityrsquos accept function(Bad Sink) a source function containing a vulnerability (BadSource) and a function that is not yet explicitly defined(Other) In this study only the first seven categories werestudied

4 Method

41 BOVP Method This paper first extracts the source codeof the executable program and then establishes a dynamicdata flow analysis model at the software function levelFinally a vulnerability prediction method based on softwaremetrics is implemented The process of establishing the119861119874119881119875 MA Smodel in this method is shown in Figure 1

(1) For the source code of the target software thispaper applied the functional level static analysis and thesoftware attribute measurement method was used to extractall software metric data (M)

(2) According to the characteristics of the software bufferoverflow vulnerability this paper established a metric ldquoArdquo ofthe software level buffer overflow vulnerability based on thefunction level A = a1 a2 an

(3) For the characteristics of the data flow where thefunctions are calling each other the data mining method wasused to extract the data regarding the function class anditems in the ldquoOtherrdquo function category were removed

(4) This paper applied the mutual information value forfeature selection and incorporated the metric A to measurethe impurity of the data partition D for the split criterionThe cost complexity pruning algorithm was applied as thepruning method to determine the buffer overflow vulnera-bility classification algorithm S (strategy)

42 BOVP Prediction

421 Model Overview The CART (classification and regres-sion tree) model was proposed by Breiman et al [18] in 1984and has become a widely used decision tree learning methodThe CART method assumes that the decision tree is a binarytreeThe input feature space is divided into a finite number of

Security and Communication Networks 5

Target soware

Source Code

Extract code modules and functions

Data mining to extract data

Modeling

Data in the form of soware metrics

Classify function data

Metric

Analysis relevance index selection

Source code static analysis to extract metric data

Decision tree classification prediction

Vulnerability function

Vulnerabilityprediction

Extract Bad Bad Sink and Bad Source

Figure 1 The framework for the BOVPmethod

cells and the predicted values are determined by these cellsfor instance for predicting the output values tomap to a givencondition

The CART model in this article consists of the followingthree steps

(1) CART generation generating a decision tree based onthe training data set

(2) CART pruning pruning the decision tree according tocertain constraints such as the maximum depth of the treetheminimumnumber of samples of the leaf nodes the purityof the nodes etc

(3) Optimization of the decision tree generating dif-ferent CARTs for multiple parameter combinations (max-imum depth of the tree (max depth) minimum sample

number of the leaf node (min samples leaf) node impurity(min impurity split)) and choosing the model with the bestgeneralization ability

We first chose to use a supervised learning algorithmThe next consideration was the data problem for examplewhether the eigenvalue is a discrete variable or a continuousvariable whether there is a missing value in the eigenvaluevector and whether there is an abnormal value in thedata Therefore after analysis the CART tree classificationalgorithm was selected CART has the following advantagesover other classical classification and regression algorithms(1) less data preparation no need to standardize data (2) theability to process continuous and discrete data (3) the abilityto handle multiclassification problems (4) the prediction

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

4 Security and Communication Networks

Buffer overflow attacks consist of three parts codeembedding overflow attack and system hijacking There arefour types of buffer overflow attacks destroy activity recordsdestroy heap data change function pointers and overflowfixed buffers Buffer overflow is a very common and very dan-gerous vulnerability that is widespread in various operatingsystems and applications The use of buffer overflow vulner-abilities can enable hackers to gain control of a machine oreven the highest privilegesThe use of buffer overflow attackscan lead to program failure system shutdown restarting andso on The buffer overflows we studied in this article includestack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflows

32 Source Code Function Classification Most software met-rics that were previously used to predict software vulnerabili-ties usemetrics to predict whether a vulnerability is includedDuring the execution of the software there is a lack of anal-ysis of the calling relationships between functions Amongthe characteristics of software are functionality reliabilityusability efficiency maintainability and portability Thus thesoftware implementation process is dynamic In the vulner-ability prediction of software the calling relationship of theintrinsic functions in the software execution process shouldbe fully considered especially the data flow changes duringthe software call Therefore this article describes data flowanalysis to distinguish features that contain vulnerabilities

In general software source code data flow analysis isbased on understanding the logic of the code tracking theflow of data from the beginning of the analysed code to theend There are three difficulties in the analysis process (1)The change in data should accumulate with the increase inthe path length (2) Calculation of the value of the branchcondition based on the latest data analysis results since thebranch condition should accumulate as the path grows andeach additional branch node on the path must be in thepath (3) The number of paths usually grows at a geometricprogression as the code size increases A more efficientapproach is to decompose the global data stream analysis intopartial data stream analysis

This paper analyses the functions called in the test casedata and uses the partial data flow analysis method todistinguish the functions into different ldquosourcerdquo and ldquoacceptrdquofunctions This research performs a functional level analysison the program source code According to the behaviour ofthe data flow of each function in the actual execution of theprogram this paper uses a dataminingmethod to distinguishthe function categories from the data in the data set Firstthere are two types (good and bad) depending on whetheror not there are vulnerabilities as a whole Second this paperanalyses the functions using data flow analysis Functionsthat do not contain vulnerabilities are classified into fourcategories Good - the only code in a good function is acall to each of the secondary good functions This functiondoes not contain any nonflawed constructs Good Sink andGood Source - cases that contain data flows using ldquosourcerdquoand ldquosinkrdquo functions that are called from each other fromthe primary good function Each source or sink function

is specific to exactly one secondary good function GoodAuxiliary - where a Good Source is passing safe data to apotentially Bad Sink andwhere a Bad Source is passing unsafeor potentially unsafe data to a Good Sink The categoriescontaining vulnerabilities are divided into three groups Badthis is a function that contains a flawed construct Bad Sinkand Bad Source these are cases that contain data flows usingldquosourcerdquo and ldquosinkrdquo functions that are called from each otheror from a primary bad function Each source or sink functionis specific to either bad function for the test case

According to whether or not the vulnerability is includedin one of the 8 categories this paper defines the set ofcategories as X = x1 x2 x8 where x1 x2 x8 arethe number of functions that do not contain vulnerabilities(Good) sinks that donot contain vulnerabilities (Good Sink)source functions that do not contain vulnerabilities (GoodSource) auxiliary functions that do not contain vulnerabil-ities (Good Auxiliary) a main function containing a bug(Bad) a sink that contains the vulnerabilityrsquos accept function(Bad Sink) a source function containing a vulnerability (BadSource) and a function that is not yet explicitly defined(Other) In this study only the first seven categories werestudied

4 Method

41 BOVP Method This paper first extracts the source codeof the executable program and then establishes a dynamicdata flow analysis model at the software function levelFinally a vulnerability prediction method based on softwaremetrics is implemented The process of establishing the119861119874119881119875 MA Smodel in this method is shown in Figure 1

(1) For the source code of the target software thispaper applied the functional level static analysis and thesoftware attribute measurement method was used to extractall software metric data (M)

(2) According to the characteristics of the software bufferoverflow vulnerability this paper established a metric ldquoArdquo ofthe software level buffer overflow vulnerability based on thefunction level A = a1 a2 an

(3) For the characteristics of the data flow where thefunctions are calling each other the data mining method wasused to extract the data regarding the function class anditems in the ldquoOtherrdquo function category were removed

(4) This paper applied the mutual information value forfeature selection and incorporated the metric A to measurethe impurity of the data partition D for the split criterionThe cost complexity pruning algorithm was applied as thepruning method to determine the buffer overflow vulnera-bility classification algorithm S (strategy)

42 BOVP Prediction

421 Model Overview The CART (classification and regres-sion tree) model was proposed by Breiman et al [18] in 1984and has become a widely used decision tree learning methodThe CART method assumes that the decision tree is a binarytreeThe input feature space is divided into a finite number of

Security and Communication Networks 5

Target soware

Source Code

Extract code modules and functions

Data mining to extract data

Modeling

Data in the form of soware metrics

Classify function data

Metric

Analysis relevance index selection

Source code static analysis to extract metric data

Decision tree classification prediction

Vulnerability function

Vulnerabilityprediction

Extract Bad Bad Sink and Bad Source

Figure 1 The framework for the BOVPmethod

cells and the predicted values are determined by these cellsfor instance for predicting the output values tomap to a givencondition

The CART model in this article consists of the followingthree steps

(1) CART generation generating a decision tree based onthe training data set

(2) CART pruning pruning the decision tree according tocertain constraints such as the maximum depth of the treetheminimumnumber of samples of the leaf nodes the purityof the nodes etc

(3) Optimization of the decision tree generating dif-ferent CARTs for multiple parameter combinations (max-imum depth of the tree (max depth) minimum sample

number of the leaf node (min samples leaf) node impurity(min impurity split)) and choosing the model with the bestgeneralization ability

We first chose to use a supervised learning algorithmThe next consideration was the data problem for examplewhether the eigenvalue is a discrete variable or a continuousvariable whether there is a missing value in the eigenvaluevector and whether there is an abnormal value in thedata Therefore after analysis the CART tree classificationalgorithm was selected CART has the following advantagesover other classical classification and regression algorithms(1) less data preparation no need to standardize data (2) theability to process continuous and discrete data (3) the abilityto handle multiclassification problems (4) the prediction

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 5

Target soware

Source Code

Extract code modules and functions

Data mining to extract data

Modeling

Data in the form of soware metrics

Classify function data

Metric

Analysis relevance index selection

Source code static analysis to extract metric data

Decision tree classification prediction

Vulnerability function

Vulnerabilityprediction

Extract Bad Bad Sink and Bad Source

Figure 1 The framework for the BOVPmethod

cells and the predicted values are determined by these cellsfor instance for predicting the output values tomap to a givencondition

The CART model in this article consists of the followingthree steps

(1) CART generation generating a decision tree based onthe training data set

(2) CART pruning pruning the decision tree according tocertain constraints such as the maximum depth of the treetheminimumnumber of samples of the leaf nodes the purityof the nodes etc

(3) Optimization of the decision tree generating dif-ferent CARTs for multiple parameter combinations (max-imum depth of the tree (max depth) minimum sample

number of the leaf node (min samples leaf) node impurity(min impurity split)) and choosing the model with the bestgeneralization ability

We first chose to use a supervised learning algorithmThe next consideration was the data problem for examplewhether the eigenvalue is a discrete variable or a continuousvariable whether there is a missing value in the eigenvaluevector and whether there is an abnormal value in thedata Therefore after analysis the CART tree classificationalgorithm was selected CART has the following advantagesover other classical classification and regression algorithms(1) less data preparation no need to standardize data (2) theability to process continuous and discrete data (3) the abilityto handle multiclassification problems (4) the prediction

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

6 Security and Communication Networks

process of the CART model which can be easily explained byBoolean logic and compared to other models such as NaiveBayes and SVMs

422 Training In the classification prediction part of theBOVP method first the feature selection operation was per-formed on the attribute set A Second a specific classificationalgorithm was selected according to the data characteristicsThis research adopts the decision tree algorithm [19] toestablish the classification model based on the characteristicsof the integrated software and the characteristics of theclassification algorithm

First the classification decision treemodel is a tree struc-ture that classifies instances based on features The decisiontree consists of nodes and directed edgesThere are two typesof nodes internal nodes and leaf nodes The internal nodesrepresent n attributes including the number of lines of sourcecode the code path that can be executed and the circlecomplexity in A The leaf nodes represent the type of X =x1 x2 x7 Second when the decision tree is classified oneof the features of an instance is tested from the root node andthe instance is assigned to its child node according to the testresult Each child node corresponds to a value of the featureso it recursively moves down until the leaf node is reachedFinally the instance is assigned to the class of the leaf nodeThe decision tree can be converted to a set of if-then rulesor it can be considered a conditional probability distributiondefined in the feature and the class space Compared with thenaive Bayes classification method the decision tree has theadvantage that the construction process does not require anydomain specific knowledge or parameter settings Thereforethe decision tree is more suitable for exploratory knowledgediscovery in practical applications The construction of thedecision tree algorithm [19] is divided into three parts featureselection decision tree generation and decision tree pruningas follows

(1) Feature selection the features in this model arean (n lt= 22) an attribute that includes the number of linesin the source code the code path that can be executed andthe circle complexityThe feature selection is applied to selectthe feature that has the ability to classify the training data in away that can improve the efficiency of decision tree learningThe feature selection process uses information gain and theinformation gain ratio or Gini indexes

(2) Generation of decision tree the Gini index is usedas the criterion of feature selection to measure the impurityof data partition D such that the local optimal feature iscontinuously selected and the training set is divided intosubsets that can be feasible

(3) Decision tree pruning the cost complexity pruningalgorithm is used to prune the tree that is the tree complexityis regarded as a function of the number of leaf nodes andthe error rate of the tree specifically from the bottom of thetree For each internal node N the cost complexity of thesubtree of N and the cost complexity of the subtree of N afterthe pruning of the subtree are calculated and the subtree isclipped otherwise the subtree is retained

Combining the attribute set A = a1 a2 a22 we canset the training data partition to D and construct the decision

treemodel through three steps of feature selection generationdecision tree and pruningThe basic strategy of the algorithmis to call the algorithm with three parameters D attribute listand Attribute selection method D is a data partition which isa collection of training tuples and corresponding categoriesof labels attribute list is a list describing the tuple attributesAttribute selection method specifies the heuristic process thatis used to select the ldquobestrdquo attributes by class in tuple

In Algorithm 1 Lines (1) to (6) specify the Gini mini-mization criterion used to obtain the feature attributes andthe partition points of the split selection When the featureselection is started a tree is constructed step by step Lines(7) to (8) describe the pruning process of the tree accordingto three indicators (120574 120572 120573) Once the stopping condition issatisfied data set D is split into two subtrees D1 and D2which are expressed in Lines (9) to (10) Lines (11) to (12) statethat if the sample number of the nodes is less than 120573 the nodewill be dropped Lines (13) to (17) show that if the node doesnot satisfy all of the conditions it will be converted into a leafnode whose output class is the highest of the classes on thenode Finally the process of prediction is conducted by usingthe TreeModel in Line (19) This method is highly complexespecially when searching for the segmentation point as it isnecessary to traverse all the possible values of all the currentfeatures If there are F features each having N values and thegenerated decision treehas F or S internal nodes then the timecomplexity of the algorithm is O(FlowastNlowastS)

The decision tree in the algorithm uses theGini coefficientas the splitting point of the selection feature That is thetraining data set D is divided into two parts D1 and D2according to whether feature A takes a certain value a Thenunder the condition of feature A the Gini index of set D isdefined as

119866119894119899119894 (119863 119860) =10038161003816100381610038161198631

1003816100381610038161003816119863

119866119894119899119894 (1198631) +10038161003816100381610038161198632

1003816100381610038161003816119863

119866119894119899119894 (1198632) (1)

The Gini index Gini(D) represents the uncertainty of setD and theGini indexGini(D A) represents the uncertainty ofset D after A=a partitioning The larger the Gini index is thegreater the uncertainty of the sample is Compared to otherclassical classification and regression algorithms decisiontrees have the advantage of generating easy-to-understandrules requiring a relatively small number of computationsDecision trees are capable of processing continuous cate-gorical fields and have the ability to clearly display moreimportant fieldsTherefore using the existing data set to trainthe decision tree model the model can be trained to predictwhether the program has any of the 7 buffer vulnerabilities byusing the metrics such as the number of rows containing thesource code the number of times called the code path thatcan be executed and the loop complexity

The specific method of generating the subtree sequencesT0 T1 Tn in pruning is by pruning the branch in Ti thathas the smallest increase in error in the training data set fromT0 to obtain Ti+1 For example when a tree T is prunedat node t the increase in error is R(t)minusR(Tt) where R(t) isthe error of node t after the subtree of node t is prunedR(Tt) is the error of the subtree Tt when the subtree ofnode t is not pruned However the number of leaves in T is

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 7

input D=(X1Y1)(X2y2) (Xn yn) X = X1X2X3 Xn represents a feature property set Y = y1 y2 y3 ynrepresents the predicted attribute setoutput CART model sets(1) for xi in X(2) for Tj in xi(3) search min(Gini)(4) end for(5) end for(6) Build Trees TreeModel from Tj and xi(7) if H(D) le 120574 ampamp the current depth lt 120572 ampamp |D| ge 120573 |D| sample number of D(8) Divide D to D1 and D2(9) DecisionTreeClassifier (D1 120572 120574 120573)(10) DecisionTreeClassifier (D2 120572 120574 120573)(11) else if |D| lt 120573(12) drop D(13) else(14) D 997888rarr leaf Node convert to leaf node(15) predictive class = the most number of classes average of samples(16) break(17) end else(18) return TreeModel(19) Prediction on TMS

Algorithm 1 BOVP based on Gini indexes

reduced by L(Tt)minus1 after pruning so the complexity of T isreduced Therefore considering the complexity factor of thetree the error increase rate after the tree branches are prunedis determined by

120572 =119877 (119905) minus 119877 (119879119905)119871 (119879119905) minus 1

(2)

where R(t) represents the error of node t after the subtreeof node t is pruned R(t)=r(t)lowastp(t) r(t) is the error rate ofnode t and p(t) is the ratio of the number of samples on nodet to the number of samples in the training set R(Tt) representsthe error of the subtree Tt when the subtree of node t is notpruned that is the sum of the errors of all the leaf nodes onsubtree Tt Ti+1 is the pruning tree with the smallest alphavalue in Ti

423 Prediction The input part of the prediction sectionis the buffer overflow vulnerability data set D=(x1 y1)(x2 y2)(xn yn) Vulnerability data include stack bufferoverflow heap buffer overflow buffer underwriting (ldquobufferoverflowrdquo) integer overflow and integer underflow

When constructing the decision tree in the predictionprocess the feature-selectedmetricsA = a1 a2 a16wereused as internal nodes to represent an attribute test and theGini index was used to select the partition attribute Eachbranch represents an output of the test and the leaf nodescorrespond to the function class X = x1 x2 x7 Whenconstructing a decision tree this paper uses attribute selectionmetrics to select the attributes that best divide the tuple intodifferent classes

The prediction result output is based on the class pre-dicted by the decision tree algorithm belonging to the X =

x1 x2 x7 category Further analysis can be performedon the function containing the vulnerability based on theprediction resultsThe prediction part used the accuracy raterecall rate and F1 evaluation index to measure the effect ofthe decision tree algorithm Accuracy is used to measure theratio of the number of correctly classified samples to thetotal number of samples in the test data set It reflects theability of the decision tree classifier to examine the entiresample The recall rate is used to measure the proportionof all that should be retrieved correctly which reflects theproportion of positive data that is correctly determined by thedecision tree algorithm to the total positive data The F1 valueis used to measure the accuracy of the recall and the recallaverage to measure the ideal degree of decision tree algorithmclassification

5 Experiment

In this paper the relationship of function nesting iterationand looping in the process of software execution and thedynamic behaviour and operations of the functions wereconsidered comprehensively The vulnerability classificationand prediction were carried out according to the unifiedpattern of software metrics This paper carried out thefollowing steps for the program (1) This article extractedcode that contained only one type of defect based on thebuffer overflow vulnerability code fragment and that did notcontain code of other unrelated defect types (2) Then thispaper statically analysed the size complexity and couplingof the source code for each code segment that could begenerated At the same time software was used to extract themetrics from the program and 58 functional level metricswere extracted (3) Due to the diversity of the types of buffer

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

8 Security and Communication Networks

overflow vulnerabilities involved in the program there is dataimbalance in the extracted static code indicators Thereforeit was necessary to perform data cleaning on the initiallyextracted data to reexamine and verify the data and deleteduplicate information Twenty-two indicators were selectedfor research (4) The method of feature selection based onmutual information values was used to select the softwaremetric attribute A (5) According to the 41 ratio between thetraining set and test set the Gini coefficient was selected asthe splitting point for feature selection and theCARTdecisiontree was constructed by the K-times crossover test

51 Data Set Introduction Using the Juliet data set of theNIST library (httpssamatenistgovSRDviewphp) thispaper selects 163737 new types of buffer overflow vulner-abilities in an actual program extraction The informationcontained in the data set is made up of the National SecurityAgency (NSA) Reliable Software (CAS) All the programsselected in this article can be compiled and run on MicrosoftWindows using Microsoft Visual Studio This article extractssoftware metrics based on the source code of the Juliet dataset cross-project test case A buffer overflow occurs when awrite stream flows through the buffer and the function returnaddress is modified during program execution In this studysource code written in C and Java were used for experiments

The CC++ language experimental data set had 111366stack-based buffer overflows heap-based buffer overflowsbuffer overflows integer overflows and integer underflowsA total of 52371 buffer overflows were collected from the Javalanguage experimental data set The summaries of the datasets are shown in Tables 1 and 2

52 Metrics Indexes This paper extracted a large numberof real software attributes through analysis from softwaresource code Twenty-twometrics are listed inTable 3 Becausesome attributes are irrelevant feature selection based onmutual information is used to eliminate irrelevant metricsFinally the accuracy recall and accuracy indicators are usedto measure the predicted results

Software metric indexes provide a quantitative standardto evaluate code quality It not only visually reflects the qualityof software products but also provides a numerical basisfor software performance evaluation productivity modelbuilding and cost performance estimation According to thedevelopment of software from structure modularization toobject-oriented design metrics can be divided into tradi-tional software metrics and object-oriented software metricsAccording to traditional software measurement indexesthe attribute set of the software is first defined as A =a1 a2 an First Understand a code analysis tool wasused to extract the software data based on 58metrics Secondbecause the software source code is written in differentlanguages and the coding conventions are different it isnecessary to perform data cleaning on the initially extracteddata to reexamine and verify the data delete duplicateinformation and correct errors To ensure data consistencywe need to filter data that does notmeet certain requirementsFor example the index of the extracted data contains a large

number of zero values and attribute values for the samenumerical metrics Finally 22 indicators remained for studyso n = 22 The specific indicators are shown in Table 3

Feature selection based on these 22 metrics can not onlyimprove the efficiency of classifier training and application byreducing the effective vocabulary space but also remove noisefeatures and improve classification accuracy In this paperthe mutual information method is adopted to select featuresbased on the correlation of attributes

Mutual information indicates whether two variables Xand Y have a relationship [20 21] defined as follows letthe joint distribution of two random variables (X Y) bep(x y) the marginal distributions are p(x) and p(y) themutual information I(XY) is the relative entropy of the jointdistribution p(x y) and the product distribution p(x)p(y) asshown in Equation (3)

119868 (119883 119884) = sum119909isin119883

sum119910isin119884

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (3)

The mutual information of feature items and categoriesreflects the degree of correlation between feature itemsand categories When taking mutual information as theevaluation value for feature extraction the largest number offeatures of mutual information will be selected

53 Feature Selection Due to the limited number of samplesthe design of a classifier with a large number of featuresis computationally expensive and the classification perfor-mance is poor Therefore feature selection was adoptedfor the metrics set A [21] that is a sample of the high-dimensional space was converted into a low-dimensionalspace by means of mapping or transformation and dimen-sionality reduction was achieved Then redundant and irrel-evant features were deleted by feature selection to achievefurther dimensionality reduction In this experiment themutual information value was used to measure the corre-lation for feature selection If there was no correlation themutual information value was zero Conversely if the mutualinformation value is larger the correlation is greater Whenthe feature was selected the mutual information value wascalculated for both the CC++ and Java data sets The resultsare shown in Table 4

The mutual information values of the two data sets aresynthesized to select an nlt=22 and finally 16 metrics areobtained after mutual information calculation Define the setof attributes asA = a1 a2 a16 Among them a1 a2 a16is CountInput CountLine CountLineCode CountLineCodeDecl CountLineCode Exe CountLine Comment CountOut-put CountPath CountSemicolon CountStmt CountStmtDeclCountStmtExe Cyclomatic Modified Cyclomatic Strict RatioCommentCode

Although the global results before and after the featureselection seem similar we achieve the same effect withfewer features by applying feature selection The obtaineddata features and the 16 indicators adopted after the featureselection could achieve high-efficiency cross-project soft-ware metrics thereby reducing the number of indicatorsdimensions and amount of overfitting The process also

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 9

Table 2 Java language experimental data set specific data

GOOD BAD TotalGood Good Sink Good Source Good Auxiliary Bad Bad Sink Bad Source

Buffer overflow 8073 9522 828 21114 7866 4554 414 52371

Table 3 Metrics

Name Description1 CountInput Number of calls Calls by the same method are counted only once and calls by fields are not counted2 CountLine The number of lines of code3 CountLineCode The number of lines containing the code4 CountLineCodeDecl The number of lines of the name class the method name line is also recorded in this number5 CountLineCodeExe The number of lines of pure executing class code6 CountLineComment Annotation class code line number

7 CountOutput The number of calls to other methods multiple calls to the same method are counted as one call Returnstatement counts a call

8 CountPath Code paths that can be executed are related to cyclomatic complexity9 CountPathLog log 10 truncated to an integer value of the metric CountPath10 CountSemicolon The number of semicolons11 CountStmt The number of statements even if multiple sentences are written on one line is counted multiple times12 CountStmtDecl Defines the number of class statements including the method declaration line13 CountStmtExe The number of class statements executed14 Cyclomatic Circle complexity (standard calculation method)15 CyclomaticModified Circle complexity (the second calculation method)16 CyclomaticStrict Circle complexity (the third calculation method)17 Essential Basic complexity (standard calculation method)18 Knots Measure overlapping jumps19 MaxEssentialKnots The maximum node after the structured programming structure has been deleted20 MaxNesting Maximum nesting level relating to cyclomatic complexity21 MinEssentialKnots The minimum node after the structured programming structure has been deleted22 RatioCommentToCode Code comment rate

enhanced the understanding of the relationship between theindicatorsrsquo characteristics and the indicator feature valuesthus increasing the ability of the model to generalize

54 Experimental Results In the classification algorithmthere is often a phenomenon in which the model performswell for the training data but performs poorly for data outsideof the training data set In these cases cross-validationcan be used to evaluate the predictive performance of themodel especially the performance with new data which canreduce overfitting to some extent There are three generalcross-validation methods Hold-Out Method K-fold Cross-Validation and Leave-One-Out Cross-Validation In thisexperiment the K-fold cross-validationmethodwas selectedK-fold cross-validation randomly divides the training setinto k k-1 for training and the remaining data for testingrepeating the process k-times and thus obtaining k modelsto evaluate the performance of themodel In this experimentwe validated the model into two parts 80 of data set D wasused as training data and 20 was used as test data

First we group the data D into a training set to train themodel and a test set to test the predictive performance of themodel This process is repeated until all data are used Nextthe K value is chosen to be 10 which is so that the data set isdivided into 10 subsamples 9 of which are used as trainingdata and one is used as test data Finally the sensitivity ofthe model performance to data partitioning is reduced byaveraging the results of 10 groups of results [22]

541 Results in CC++ Programs In this experiment theprogram written in CC++ language was used to conductexperiments and the decision tree algorithm and othermachine learning algorithms were compared with respect toaccuracy and running time The results are shown in Table 5

Experiments showed that the decision tree algorithmachieved the best results regardless of the accuracy or runningtime and the accuracy in the experiments reached 8255Second the data feature selection was verified by comparingthe 22 metrics without feature selection and the 16 metricsafter feature selection as shown in Table 6

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

10 Security and Communication Networks

Table 4 Mutual information value calculation results

Metrics CC++ dataset Java datasetMutual information value Sort Mutual information value Sort

CountInput 01892 11 06375 3CountLine 05584 1 07996 1CountLineCode 04902 2 06446 2CountLineCodeDecl 04327 5 04373 11CountLineCodeExe 03893 9 05 7CountLineComment 03980 8 0499 8CountOutput 01625 12 03477 13CountPath Nan 13 03959 12CountPathLog Nan 13 02638 17CountSemicolon 04307 7 05484 5CountStmt 04465 4 05879 4CountStmtDecl 04313 6 0439 10CountStmtExe 03434 10 05123 6Cyclomatic Nan 13 03122 14CyclomaticModified Nan 13 03122 15CyclomaticStrict Nan 13 02968 16Essential Nan 13 00099 20Knots Nan 13 02508 18MaxEssentialKnots Nan 13 00099 21MaxNesting Nan 13 02236 19MinEssentialKnots Nan 13 00099 22RatioCommentToCode 04603 3 04959 9

Table 5 CC++ language dataset machine algorithm results

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8206 3769 3547 8254 8255Recall rate () 678 3023 3425 6865 6868Precision rate () 7106 3146 3375 7359 7362Running time (s) 57632 1998 1902 1794 1703

It was proved by experiments that the running time ofthe prediction buffer overflow vulnerability algorithm whenusing the A = a1 a2 a16metrics after feature selection isreduced from the previous 1703 s to 1594 s but the accuracyhas hardly changed so feature selection not only helps toreduce the running time but also yields higher accuracy andimproves the efficiency of the vulnerability prediction In thefinal experimental results the functions predicted by BadBad Sink and Bad Source are the probabilities of a high levelof buffer overflow vulnerabilityThe three types of data can beextracted according to the function name for further analysis

542 Results in Java Programs The experimental results ofthe data set extracted by the program written in the Javalanguage are shown in Tables 7 and 8

In this experiment we divided the buffer overflow vul-nerability into eight categories When there is vulnerabilityin real software it does not necessarily contain all types ofbuffer vulnerabilities There may be only one or several of

them and when we collect data in real software data the datathat may be vulnerable is much less than the data withoutvulnerability and may even reach a ratio of 110000 In theJava data set because the experiment is a data set extractedfrom real software the data volume of Good Source and BadSink is very small which results in extremely unbalanceddata It also leads to the accuracy and recall rate of both typesof data being 0 The accuracy and recall rate of other basicbalanced vulnerability data have better results To maintainthe distribution of the original data we need to improve theaccuracy and recall rate of other basic balanced vulnerabilitydata To improve the accuracy of the results data samplingis not used to balance the data which then demonstrates thevalidity of the SVLmodel used to predict most types of bufferoverflow vulnerabilities in real software At the same timethe experiment proves that using the A = a1 a2 a16metrics to predict the buffer overflow vulnerability afterfeature selection reduces the running time and ensures highaccuracy

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 11

Table 6 Decision tree algorithm-specific operation results

Before feature selection After feature selectionPrecision Recall F1 Precision Recall F1

Good 8809 8356 8576 8803 8347 8569Good Sink 6127 5291 5678 6127 5291 5678Good Source 5313 2537 3434 5313 2537 3434Good Auxiliary 7716 9634 8569 7716 9634 8569Bad 7708 4923 6009 7708 4923 6009Bad Sink 6794 7747 724 6794 7747 7239Bad Source 9071 96 9323 9066 9587 9312Average 7363 687 6976 7361 6867 7105Accuracy 8255 8253Time 1703 1594

Table 7 Accuracy and time of Java dataset

SVM Bayes Adaboost Random forest Decision treeAccuracy () 8716 5795 4940 8741 8744Recall rate () 5956 4825 3498 5964 5964Precision rate () 5874 4428 3311 5891 5893Running time (s) 10432 998 899 1294 1199

Table 8 Decision tree algorithm-specific operation results

Before feature selection () After feature selection ()Precision Recall F1 Precision Recall F1

Good 9585 9304 9442 9611 9304 9455Good Sink 4715 531 4995 4715 531 4995Good Source 0 0 0 0 0 0Good Auxiliary 9856 9813 9835 9856 9813 9834Bad 7727 7399 756 7727 7399 756Bad Sink 0 0 0 0 0 0Bad Source 9574 9951 9654 9375 996 9659Average 5922 5969 5927 5898 5969 5933Accuracy 8744 8751Time 1199 759

6 Discussion

The software vulnerability analysis method proposed in thispaper has good capability and has been applied and testedusing real code and reliable software Furthermore theexperiments verified the effectiveness of theBOVPmodel andprovided guidance for its effective useThemodel minimizedthe probability of misclassification while finding more vul-nerabilities This method has the following advantages (1)Given the relative advantages of static and dynamic analysisof software vulnerabilities the static metrics for source codeare extracted by static analysis and the dynamic metricsat the functional level are analysed according to the dataflow of the function The model can ensure the accuracy ofvulnerability prediction and take into account the change indata flow between functions in running programs This isa new vulnerability prediction method based on data flow

analysis in a running program (2) In this paper the dataset contained two different languages namely CC++ andJava and thus fully validated the validity of the BOVPmodeland proved that the model does not depend on a specificlanguage type (3)Themodel can be applied to vulnerabilitiescaused by multiple types of buffer overflows The mostcommon and most difficult software security vulnerability isbuffer overflow vulnerability The data set contains multipletypes of buffer overflow vulnerabilities providing valuabledata for analysing different types of buffer overflows andoffering a basis for future research (4) To some extent thisresearch saves investment costs time and human resourcesfor software development Applying feature selection withoutaffecting the prediction results reduces the dimension andthe experimental overhead and it greatly improves thepracticality of the software vulnerability prediction studyThis paper provides a new way for software metrics to study

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

12 Security and Communication Networks

data collection in software vulnerability prediction Howeverthis method is only suitable for source code not for nonopensource software

7 Conclusion

The BOVP model proposed in this paper preprocesses theprogram source code and then employs the method ofdynamic data stream analysis on the software functional levelBy studying the characteristics of practical software codeand different types of buffer overflow vulnerability featuresthe decision tree algorithm was developed through featureselection and the Gini index Finally a vulnerability predic-tion method based on software metrics was constructed Thecapability of theBOVPmodel is verified by experiments usingprogram source code written in different languages and theprediction results are more accurate than other methodsThetime taken for the data set collected using CC++ was lessand the accuracy rate was 8253 The accuracy rate of thedata set using Java was also high reaching 8751 which fullydemonstrated that this model is robust in predicting bufferoverflow vulnerability in different types of languages

The current buffer overflow vulnerability is very com-plicated and very common Although it is divided into 8categories it does not contain all of the buffer overflowvulnerabilities Therefore the classification of buffer over-flow vulnerabilities should be more comprehensive in thefuture There are still some shortcomings in the methodFor example there is no corresponding improvement inthe imbalance between Good Source and Bad Sink in theJava data set and the methods of this research are mainlyfor open source software It is believed that there is nodetailed analysis of nonopen source software In the futurewe plan to solve the problem of unbalanced data make upfor the shortcomings of this model and study the binarybuffer overflow prediction model combined with dynamicsymbolic execution technology to find increasingly complextypes of buffer overflow vulnerabilities and also forecastvulnerabilities for nonopen source software

Data Availability

The original data (Juliet Test Suite for CC++ and JulietTest Suite for Java) used to support the findings ofthis study can be download publicly from the website(httpssamatenistgovSRDtestsuitephp) The thirteenthpage of the articles has footnotes giving the link address

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the National Key RampD Program ofChina under Grant no 2016YFB0800700 the National Natu-ral Science Foundation of China under Grants nos 6180233261807028 61772449 61772451 61572420 and 61472341 and

the Natural Science Foundation of Hebei Province Chinaunder Grant no F2016203330

References

[1] C Cowan P Wagle C Pu et al ldquoBuffer overflows attacks anddefenses for the vulnerability of the decaderdquo IEEE 2003

[2] SWu SoftwareVulnerabilityAnalysis Technology Science Press2014

[3] Y Shin and L Williams ldquoIs complexity really the enemy ofsoftware securityrdquo in Proceedings of the ACM Workshop onQuality of Protection pp 47ndash50 ACM 2008

[4] I Chowdhury andM Zulkernine ldquoUsing complexity couplingand cohesion metrics as early indicators of vulnerabilitiesrdquoJournal of Systems Architecture vol 57 no 3 pp 294ndash313 2011

[5] M Frieb A Stegmeier J Mische et al ldquoLightweight hardwaresynchronization for avoiding buffer overflows in network-on-chipsrdquo in Proceedings of the International Conference onArchitecture of Computing Systems pp 112ndash126 Springer ChamSwitzerland 2018

[6] Z Wang X Ding C Pang J Guo J Zhu and B Mao ldquoTodetect stack buffer overflow with polymorphic canariesrdquo inProceedings of the 2018 48th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN) pp243ndash254 IEEE June 2018

[7] J Duraes and H Madeira ldquoA methodology for the automatedidentification of buffer overflow vulnerabilities in executablesoftware without source-coderdquo in Dependable Computing pp20ndash34 Springer Berlin Germany 2005

[8] S Rawat and LMounier ldquoAn evolutionary computing approachfor hunting buffer overflow vulnerabilities a case of aimingin dim lightrdquo in Proceedings of the 6th European Conferenceon Computer Network Defense EC2ND rsquo10 pp 37ndash45 IEEEOctober 2010

[9] I A Dudina and A A Belevantsev ldquoUsing static symbolic exe-cution to detect buffer overflowsrdquo Programming and ComputerSoftware vol 43 no 5 pp 277ndash288 2017

[10] S Nashimoto N Homma Y I Hayashi et al ldquoBuffer overflowattack with multiple fault injection and a proven countermea-surerdquo Journal of Cryptographic Engineering vol 7 no 1 pp 1ndash122016

[11] J Rao Z He S Xu K Dai and X Zou ldquoBFWindow specula-tively checking data property consistency against buffer over-flow attacksrdquo IEICE Transaction on Information and Systemsvol E99D no 8 pp 2002ndash2009 2016

[12] MBozorgiAmachine learning framework for classifying vulner-abilities and predicting exploitability [PhD thesis] DissertationsandTheses - Gradworks 2009

[13] X Xu R Zhao and L Yan ldquoResearch and progress in softwaremetricsrdquo Journal of Information Engineering University vol 15no 5 pp 622ndash627 2014

[14] J Walden J Stuckman and R Scandariato ldquoPredicting vul-nerable components software metrics vs text miningrdquo in Pro-ceedings of the International Symposium on Software ReliabilityEngineering pp 23ndash33 IEEE 2014

[15] J Stuckman J Walden and R Scandariato ldquoThe effect ofdimensionality reduction on software vulnerability predictionmodelsrdquo IEEE Transactions on Reliability vol 99 no 1 pp 1ndash212017

[16] G R Choudhary S Kumar K Kumar A Mishra and CCatal ldquoEmpirical analysis of change metrics for software fault

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

Security and Communication Networks 13

predictionrdquo Computers and Electrical Engineering vol 67 pp15ndash24 2018

[17] S Rahimi and M Zargham ldquoVulnerability scrying method forsoftware vulnerability discovery prediction without a vulnera-bility databaserdquo IEEE Transactions on Reliability vol 62 no 2pp 395ndash407 2013

[18] L Breiman J H Friedman R Olshen et al ldquoClassification andregression treesrdquo Biometrics vol 40 no 3 article 356 1984

[19] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Morgan Kaufmann 2001

[20] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[21] P Viola and W M Wells III ldquoAlignment by maximization ofmutual informationrdquo IEEE Computer Society 1995

[22] P K Shamal K Rahamathulla and A Akbar ldquoA study onsoftware vulnerability prediction modelrdquo in Proceedings of the2nd IEEE International Conference on Wireless Communica-tions Signal Processing and Networking WiSPNET rsquo17 pp 703ndash706 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: A Buffer Overflow Prediction Approach Based on Software ...downloads.hindawi.com/journals/scn/2019/8391425.pdfA Buffer Overflow Prediction Approach Based on Software Metrics and Machine

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom