research article fault localization analysis based on deep...

12
Research Article Fault Localization Analysis Based on Deep Neural Network Wei Zheng, Desheng Hu, and Jing Wang College of Soſtware & Microelectronics, Northwestern Polytechnical University, Xi’an, China Correspondence should be addressed to Desheng Hu; [email protected] Received 24 December 2015; Accepted 31 March 2016 Academic Editor: Wen Chen Copyright © 2016 Wei Zheng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With soſtware’s increasing scale and complexity, soſtware failure is inevitable. To date, although many kinds of soſtware fault localization methods have been proposed and have had respective achievements, they also have limitations. In particular, for fault localization techniques based on machine learning, the models available in literatures are all shallow architecture algorithms. Having shortcomings like the restricted ability to express complex functions under limited amount of sample data and restricted generalization ability for intricate problems, the faults cannot be analyzed accurately via those methods. To that end, we propose a fault localization method based on deep neural network (DNN). is approach is capable of achieving the complex function approximation and attaining distributed representation for input data by learning a deep nonlinear network structure. It also shows a strong capability of learning representation from a small sized training dataset. Our DNN-based model is trained utilizing the coverage data and the results of test cases as input and we further locate the faults by testing the trained model using the virtual test suite. is paper conducts experiments on the Siemens suite and Space program. e results demonstrate that our DNN-based fault localization technique outperforms other fault localization methods like BPNN, Tarantula, and so forth. 1. Introduction Many efforts have been made for debugging a generic program, especially in the stage of identifying where the bugs are, which is known as fault localization. It has been proved that fault localization is one of the most expensive and time- consuming debugging activities. With soſtware’s increasing scale and complexity, the debugging activities are more dif- ficult to perform; thus, there is a high demand for automatic fault localization techniques that can guide programmers to the locations of faults [1]. In this way, it will facilitate the soſtware development process and reduce maintenance cost [2]. At present, many kinds of soſtware fault localization methods have been proposed. Probabilistic program depen- dence graph (PPDG) is presented by Baah et al. in [3] and it gives out the conditional probability of each node to locate fault. Tarantula, which is proposed by Jones and Harrold in [4], indicates that a program entity executed by failed test cases should be suspected and they use different colors to represent the degree of suspiciousness. SOBER, which is proposed by Liu et al. in [5], uses the differences of predicated truth value between successful and failed execution to guide the activities of finding faults. CBT, namely, crosstab-based technique, is proposed by Wong et al. [6] to calculate the suspiciousness of each executable statement as the detected priority. Meanwhile, other fault localization techniques are presented in [7, 8], such as delta debugging and predicate switching. By modifying the variables or their values at a particular point during the execution to change program state, these techniques are able to identify the factor that triggers the program failure. Specifically, Zeller and Hilde- brandt [7] propose the delta debugging to reduce the factors triggering the failures to a small set of variables by comparing the program state difference between the failed test case and successful test case. Predicate switching, which is presented by Zhang et al. [8], alters the program state of failed execution by changing the predicate. e predicate is labeled as critical predicate if its switch can make program execute successfully. Being robust and widely applied, many machine learning and data mining techniques have been adopted to facilitate the fault localization in recent years [9]. Wong and Qi propose a backpropagation (BP) neural network in [10], which utilizes the coverage data of test cases (e.g., the coverage data with respect to which statements are executed by which test case) Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 1820454, 11 pages http://dx.doi.org/10.1155/2016/1820454

Upload: others

Post on 19-Aug-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Research ArticleFault Localization Analysis Based on Deep Neural Network

Wei Zheng Desheng Hu and Jing Wang

College of Software amp Microelectronics Northwestern Polytechnical University Xirsquoan China

Correspondence should be addressed to Desheng Hu hudeshengnpugmailcom

Received 24 December 2015 Accepted 31 March 2016

Academic Editor Wen Chen

Copyright copy 2016 Wei Zheng et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

With softwarersquos increasing scale and complexity software failure is inevitable To date although many kinds of software faultlocalization methods have been proposed and have had respective achievements they also have limitations In particular forfault localization techniques based on machine learning the models available in literatures are all shallow architecture algorithmsHaving shortcomings like the restricted ability to express complex functions under limited amount of sample data and restrictedgeneralization ability for intricate problems the faults cannot be analyzed accurately via those methods To that end we proposea fault localization method based on deep neural network (DNN) This approach is capable of achieving the complex functionapproximation and attaining distributed representation for input data by learning a deep nonlinear network structure It also showsa strong capability of learning representation from a small sized training dataset Our DNN-based model is trained utilizing thecoverage data and the results of test cases as input and we further locate the faults by testing the trained model using the virtualtest suiteThis paper conducts experiments on the Siemens suite and Space programThe results demonstrate that our DNN-basedfault localization technique outperforms other fault localization methods like BPNN Tarantula and so forth

1 Introduction

Many efforts have been made for debugging a genericprogram especially in the stage of identifying where the bugsare which is known as fault localization It has been provedthat fault localization is one of the most expensive and time-consuming debugging activities With softwarersquos increasingscale and complexity the debugging activities are more dif-ficult to perform thus there is a high demand for automaticfault localization techniques that can guide programmers tothe locations of faults [1] In this way it will facilitate thesoftware development process and reduce maintenance cost[2] At present many kinds of software fault localizationmethods have been proposed Probabilistic program depen-dence graph (PPDG) is presented by Baah et al in [3] and itgives out the conditional probability of each node to locatefault Tarantula which is proposed by Jones and Harroldin [4] indicates that a program entity executed by failedtest cases should be suspected and they use different colorsto represent the degree of suspiciousness SOBER which isproposed by Liu et al in [5] uses the differences of predicatedtruth value between successful and failed execution to guide

the activities of finding faults CBT namely crosstab-basedtechnique is proposed by Wong et al [6] to calculate thesuspiciousness of each executable statement as the detectedpriority Meanwhile other fault localization techniques arepresented in [7 8] such as delta debugging and predicateswitching By modifying the variables or their values at aparticular point during the execution to change programstate these techniques are able to identify the factor thattriggers the program failure Specifically Zeller and Hilde-brandt [7] propose the delta debugging to reduce the factorstriggering the failures to a small set of variables by comparingthe program state difference between the failed test case andsuccessful test case Predicate switching which is presentedby Zhang et al [8] alters the program state of failed executionby changing the predicate The predicate is labeled as criticalpredicate if its switch canmake program execute successfully

Being robust and widely applied many machine learningand data mining techniques have been adopted to facilitatethe fault localization in recent years [9]Wong andQi proposea backpropagation (BP) neural network in [10] which utilizesthe coverage data of test cases (eg the coverage data withrespect to which statements are executed by which test case)

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 1820454 11 pageshttpdxdoiorg10115520161820454

2 Mathematical Problems in Engineering

and their corresponding execution results to train the net-work and then input the coverage of a set of virtual test cases(eg each test case covers only one statement) to the trainednetwork and the outputs are regarded as the suspiciousness(ie likelihood of containing the bug) of each executablestatement But the BP neural network leads to problems likelocal minima Aiming to solve this problem Wong et alpropose another fault localization method based on radialbasis function (RBF) network in [11] which is less sensitiveto those problems Although machine learning techniquesexhibit good performance in the field of fault localizationthe models they used for fault location are all shallowarchitectureWith shortcomings like limited ability to expresscomplex function under limited amount of sample data aswell as restricted generalization ability for intricate problemthe faults cannot be analyzed exactly via those methods

This paper proposes a fault localization method basedon deep neural network (DNN) With the capability ofestimating complicated functions by learning a deep non-linear network structure and further attaining distributedrepresentation of input data this method exhibits strongability to learn representation from minority sample dataMoreover DNN is one of the deep learning models that hasbeen successfully applied in many other areas of softwareengineering [12 13] For example the researchers inMicrosoftResearch adopt DNN to decrease the error rate of speechrecognition in [12] which is one of the greatest breakthroughsin that field in the recent ten years We use the Siemens suiteand Space program as platforms to evaluate and demonstratethe effectiveness of DNN-based fault localization techniqueThe remainder of this paper is organized as follows Section 2provides an introduction of some related studies Then inSection 3 we elaborate our DNN-based fault localizationmethod andprovide a toy example to help readers understandthis method We conduct empirical studies on two suites(ie Siemens suite and Space program) in Section 4 ThenSection 5 follows where we report the result of the perfor-mance (effectiveness) comparison between our method andothers and make a discussion Section 6 lists possible threatsto the validity of our approach We present our conclusionand future work in Section 7

2 Related Work

Recent years have witnessed the successful application ofmachine learning techniques in the field of fault localizationHowever many models with shallow architectures encounterdrawbacks like their restricted ability to express complexfunction under limited amount of sample data such as BPneural network and support vector machine And the gener-alization ability for intricate problem is also restrained Withthe rapid development of the deep learningmany researchersbegin to adopt deep neural networks to tackle the limitationsof shallow architectures gradually Before presenting ourDNN-based fault localization technique we introduce somerelated work that contributes to our novel approach

21 Deep Neural Network Model In 2006 deep neural net-work is firstly presented byHinton et al in the journal Science

[14] The rationale of DNN is that the neural network modelis firstly divided into a number of two-layer models before welearn thewholemodel and thenwe train the two-layer neuralnetworkmodel layer by layer and finally get the initial weightsof multilayer neural networks by composing the trained two-layer neural networks the whole process of which is calledlayerwise pretraining [15]Thehidden layer of neural networkcan extract features from the input layer due to its abstractionThus the neural networks with multiple hidden layers arebetter at network processing and network generalization andachieve faster convergence rate We elaborate on the theoryof deep neural networks which is cited as the basis of ourtechnique here

DNN is a sort of feed-forward artificial neural networkwith multiple hidden layers and each node at the samehidden layer can use the same nonlinear function to map thefeature input from the layer below to the current node DNNstructure is very flexible due to the multiple hidden layersand multiple hidden nodes so DNN demonstrates excellentcapacity to fit the highly complex nonlinear relationshipbetween inputs and outputs

Generally DNN model can be utilized for regression orclassification In this paper the model is considered to bea classification model but in order to output continuoussuspiciousness values we do not normalize the incentivevalue of DNNrsquos last layer into integer The relationshipbetween inputs and outputs inDNNmodel can be interpretedas follows

V0 = input

V119897+1 = 120588 (119911119897 (V119897))

119911

119897(V119897) = 119908119897 (V119897) + 119887119897 0 ⩽ 119897 lt 119871

output = V119871

(1)

According to the above formulas we can obtain the finaloutput by transforming the features vector of the first layerV0 into a processed feature vector V119897 through 119871 layers ofnonlinear transformation During the training process ofDNNmodel we need to determine the weight matrix 119908119897 andoffset vector 119887119897 of 119897th layer By utilizing the difference betweenthe target outputs and the actual outputs to construct a costfunction we can then train the DNN by backpropagation(BP) algorithm

Mainly the design of DNN model includes steps likedesigning the number of network layers the number of nodesin each layer the transfer function between the layers and soforth

211TheDesign of the Network Layer Generally deep neuralnetwork consists of three parts the input layer hidden layerand output layer The study of the neural network layersmainly aims to identify the number of hidden layers todetermine the number of layers of a network In neuralnetworks hidden layer has an effect of abstraction and canextract features from the input At the same time the number

Mathematical Problems in Engineering 3

of hidden layers directly determines the processing ability ofnetwork to extract feature If the number of hidden layers istoo small it cannot represent and fit the intricate problemswith limited amount of sample data while with increasednumber of hidden layers the processing ability of the entirenetwork will improve However too many hidden layers alsoproduce some adverse effects such as the increased com-plexity of calculation and local minimaThus we draw a con-clusion that too many or too few hidden layers are all unfav-orable for the network training We must choose the appro-priate number of network layers to adapt to the differentpractical problems

212 The Design of Number of Nodes Usually the nodes ofinput layer and output layer can be determined directly whenthe training samples of deep neural network are confirmedthus determining the number of nodes in the hidden layeris most critical If we set up too many hidden nodes itwill increase the training time and train the neural networkexcessively (ie remember some unnecessary information)and lead to problems like overfitting whereas the network isnot able to handle complex problem as its ability to obtaininformation gets poorer with too few hidden nodes Thenumber of hidden nodes has a direct relationship with inputsand outputs and it needs to be determined according toseveral experiments

213 The Design of Transfer Function Transfer functionreflects the complex relationship between the input and itsoutput Different problems are suited to different transferfunctions In our work we use the common sigmoid functionas a nonlinear transfer function

120588 (119904) =

1

1 + 119890

minus119904 (2)

Here 119904 is the input and 120588(119904) is the output

214 The Design of Learning Rate and Impulse Factor Thelearning rate and impulse factor have an important impactonDNN as well A smaller learning rate will increase trainingtime and slow down the speed of convergence on the otherhand it generates network shocks further For the choice ofimpulse factor we can use its ldquoinertiardquo to cushion the networkshocks We choose the appropriate learning rate and impulsefactor according to the size of the sample

22 Learning Algorithm of DNN There exist various pre-training methods for DNN and mainly they are divided intotwo categories namely the unsupervised pretraining andsupervised pretraining

221 Unsupervised Pretraining For the unsupervised pre-training it usually adopts Restricted Boltzmann Machine(RBM) to initialize a DNN RBM is proposed by Hinton in2010 [16] The structure of a RBM is depicted in Figure 1 TheRBM is divided into two layers the first one is visible layerwith visible units and the second one is hidden layer withhidden units Once the unsupervised pretraining has been

h

W

v

Image Visible variables

Hidden variables

Bipartitestructure

Figure 1 Architecture of a Restricted Boltzmann Machine

DNN-x DNN-x + 1 DNN-x + 2

Figure 2 Architecture of supervised pretraining

completed the obtained parameters will be used as the initialweights of the DNN

222 Supervised Pretraining In order to reduce the inac-curacy of unsupervised training we adopt the supervisedpretraining proposed in [17] The general architecture isshown in Figure 2 It works as follows the BP algorithm isutilized to train a neural network and the topmost layer isreplaced by a randomly initialized hidden layer and a newrandom topmost layer after every pretrainingThe network isunceasingly trained again until convergence or reaching theexpected number of hidden layers

In our work the number of nodes in the input layer ofDNN model is equal to the dimension of the input featurevector The output layer has only one output node that issuspiciousness value After a distinctive pretraining the BPalgorithm is used to fine-tune the parameters of the modelHere 119910

1119879is the training sample and the goal is to minimize

the squared error sum between the training samples 1199101119879

andlabeling 119909

1119879 The objective function can be interpreted as

follows

119864 (119882 119887) =

1

2

sum

119905

1003817100381710038171003817

119891 (119910119905119882 119887) minus 119909

119905

1003817100381710038171003817

2 (3)

Suppose that layer index is 0 1 119871 node layer isrepresented by V

0 V1 119908

0 1199081 119908

119871minus1denotes the weight

layer and 119896119871is the index of nodes in the 119871th layer and then

4 Mathematical Problems in Engineering

take the derivative of theweightmatrix119882119871minus1 andoffset vector119887

119871minus1

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

(4)

Take the derivative for each component of119882119871minus1 and 119887119871minus1

120597119864 (119882 119887)

120597119882

119871minus1

119896119871119896119871minus1

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871) (V119871minus1119896119871minus1)

120597119864 (119882 119887)

120597119887

119871minus1

119896119871

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

(5)

Then take the derivative of the weight matrix119882119871minus2 andoffset vector 119887119871minus2

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588 (119911

119871minus1

119896119871) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588(sum

119896119871minus2

119882

119871minus2

119896119871minus1119896119871minus2V119871minus2119896119871minus2

+ 119887

119871minus2

119896119871minus1) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

(6)

Take the derivative for each component of119882119871minus2 and 119887119871minus2

120597119864 (119882 119887)

120597119882

119871minus2

119896119871minus1119896119871minus2

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1) V119871minus2119896119871minus2

120597119864 (119882 119887)

120597119887

119871minus2

119896119871minus1

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1)

(7)

The result is as follows

120597119864 (119882 119887)

120597119882

119897= sum

119905

119890

119897+1(119905) (V119897 (119905))

119879

120597119864 (119882 119887)

120597119887

119897= sum

119905

119890

119897+1(119905)

(8)

Here

119890

119871

119896119871(119905) = (120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

119890

119871minus1

119896119871minus1(119905) = sum

119896119871

119890

119871

119896119871(119905)119882

119871minus1

119896119871119896119871minus1120588

1015840(119911

119871minus2

119896119871minus1)

(9)

After vectorization

119890

119871(119905) = diag (1205881015840 (119911119871minus1

119896119871)) (120588 (119911

119871minus1) minus 119909119905)

119890

119871

119896(119905) = diag (1205881015840 (119911119871minus2

119896119871minus1)) (119882

119871minus1)

119879

119890

119871(119905)

(10)

The recursive process is as follows

119890

119871(119905) = 119876

119871(119905) (119891 (119910119905

(119882 119887)

0) minus 119909119905)

119890

119897(119905) = 119876

119871(119905) (119882

119897)

119879

119890

119897+1(119905)

119876

119897(119905) = diag (1205881015840 (119911119897minus1 (V119897minus1 (119905))))

120588

1015840(119911) = 120588 (119911) sdot (1 minus 120588 (119911))

(11)

In order to calculate the error between the actual outputand the expected output we output samples 119910

1119879to DNN

and then execute the forward process of DNN Meanwhilewe calculate the outputs of all the hidden layer nodes andoutput nodes and now we can compute to get the error 119890119871(119905)Next backpropagation procedure is executed and error ofthe nodes on each hidden layer is calculated iteratively afterobtaining 119890119871(119905)The parameters of DNN can be updated layerby layer according to the following formula

(119882

119897 119887

119897)

119898+1

= (119882

119897 119887

119897)

119898

+ Δ (119882

119897 119887

119897)

119898

0 ⩽ 119897 ⩽ 119871

Δ (119882

119897 119887

119897)

119898

= (1 minus 120572) 120576 (

120597119864

120597 (119882

119897 119887

119897)

)

+ 120572Δ (119882

119897 119887

119897)

119898minus1

0 ⩽ 119897 ⩽ 119871

(12)

Mathematical Problems in Engineering 5

Table 1 The coverage data and execution result of test case

1199041

1199042

1199043

sdot sdot sdot 119904119898minus1

119904119898

119877

1199051

1 0 1 sdot sdot sdot 0 1 01199052

1 1 1 sdot sdot sdot 0 1 01199053

1 0 1 sdot sdot sdot 1 1 1

d

119905119899minus1

1 0 0 sdot sdot sdot 1 0 1119905119899

1 1 1 sdot sdot sdot 0 0 0

Here 120576 is the learning rate 120572 is the impulse factor and119898denotes the119898th iteration In the process of fault localizationwe input the virtual test matrix 119884 into the DNN model andthen execute the forward process of DNN and finally theoutput is the suspiciousness value of each statement

3 DNN-Based Fault Localization Technique

While the previous section provides an overview of theDNN model this part will present the methodology ofDNN-based fault localization technique in detail With afocus on how to build a DNN model for fault localizationproblem meanwhile we will summarize the procedures oflocalizing faults using DNN and provide a concrete examplefor demonstration at the end of this part

31 Fault Localization Algorithm Based on Deep Neural Net-work Suppose that there is a program P with 119898 executablestatements and 119899 test cases to be executed 119905

119894denotes the 119894th

test case while vectors 119888119905119894and 119903119905119894represent the corresponding

coverage data and execution result respectively after exe-cuting the test case 119905

119894 and 119904

119895is 119895th executable statement

of program P Here 119888119905119894= [(119888

119905119894)1 (119888119905119894)2 (119888

119905119894)119898] If the

executable statement 119904119895is covered by 119905

119895 we assign a real

value 1 to (119888119905119894)119895 otherwise we assign 0 to it If the test case

119905119894is executed successfully we assign a real value 0 to 119903

119905119894

otherwise it is assigned with 1 We depict the coverage dataand execution result of test case through Table 1 For instancefrom Table 1 we can draw a conclusion that 119904

2is not covered

with failed test case 1199053 while 119904

3is covered with failed test case

1199052according to Table 1The network can reflect the complex nonlinear relation-

ship between the coverage data and execution result of testcase when the training of deep neural network is completedWe can identify the suspicious code of faulty version bytrained network At the same time constructing a set ofvirtual test cases is necessary as shown in the followingequation

The Virtual Test Set Consider

[

[

[

[

[

[

[

[

119888V1

119888V2

119888V119898

]

]

]

]

]

]

]

]

=

[

[

[

[

[

[

[

1 0 sdot sdot sdot 0

0 1 sdot sdot sdot 0

d

0 0 sdot sdot sdot 1

]

]

]

]

]

]

]

(13)

the vector 119888V1 119888V2 119888V119898 denotes the coverage data of testcases V

1 V2 V

119898

In the set of virtual test cases the test case V119895only covers

executable statement 119904119895 Since each test case only covers

one statement thus the statement is highly suspicious if thecorresponding test case is failed For example 119904

119895is very likely

to contain bugs with a failed test case V119895 and that means we

should preferentially check the statements which are coveredby failed test cases But such set of test cases does not existin reality thus the execution result of test case in the virtualtest set is difficult to get in practice So we input 119888V119895 to thetrained DNN and the output 119903V119895 represents the probability oftest case execution result The value of 119903V119895 is proportional tothe suspicious degree of containing bugs of 119904

119895

Specific fault location algorithm is as follows

(1) Construct the DNN model with one input layer andone output layer and identify the appropriate numberof hidden layers according to the experiment scaleSuppose the number of input layer nodes is 119898 thenumber of hidden layer nodes is 119899 the number ofoutput layer nodes is 1 and the adopted transferfunction is sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Utilize the coverage data and execution result of testcase as training sample set which is inputted into theDNNmodel and then train theDNNmodel to obtainthe complex nonlinear mapping relationship betweenthe coverage data 119888

119905119894and execution result 119903

119905119894

(3) Input the coverage data vector of virtual test set119888V119895 (1 ⩽ 119895 ⩽ 119898) into the DNN model and get theoutput 119903V119895 (1 ⩽ 119895 ⩽ 119898)

(4) The output 119903V119895 reflects the probability that executablestatement 119904

119895contains the bugs that is suspiciousness

value and then conduct descending ranking for 119903V119895

(5) Rank 119904119895(1 ⩽ 119895 ⩽ 119898) according to their corresponding

suspiciousness value and check the statement one byone from the most suspicious to the least until thefaults are finally located

32 The Example of Fault Localization Algorithm Based onDeep Neural Network Here we illustrate the application offault localization analysis based on deep neural network by aconcrete example It is depicted in Table 2

Table 2 shows that the function of program Mid(sdot) is toget the middle number by comparing three integers Theprogram has twelve statements and ten test cases There isone fault in the program which is contained by statement(6) Here ldquo∙rdquo denotes that the statement is covered by thecorresponding test caseThe blank denotes that the statementis not covered by test case P represents that the test case isexecuted successfully while F represents that the executionresults of test case are failed

The coverage data and execution results of test casesare shown in Table 2 Table 3 correlates with Table 2 Herenumber 1 replaces ldquo∙rdquo and indicates that the statementis covered by the corresponding test case and number 0

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

2 Mathematical Problems in Engineering

and their corresponding execution results to train the net-work and then input the coverage of a set of virtual test cases(eg each test case covers only one statement) to the trainednetwork and the outputs are regarded as the suspiciousness(ie likelihood of containing the bug) of each executablestatement But the BP neural network leads to problems likelocal minima Aiming to solve this problem Wong et alpropose another fault localization method based on radialbasis function (RBF) network in [11] which is less sensitiveto those problems Although machine learning techniquesexhibit good performance in the field of fault localizationthe models they used for fault location are all shallowarchitectureWith shortcomings like limited ability to expresscomplex function under limited amount of sample data aswell as restricted generalization ability for intricate problemthe faults cannot be analyzed exactly via those methods

This paper proposes a fault localization method basedon deep neural network (DNN) With the capability ofestimating complicated functions by learning a deep non-linear network structure and further attaining distributedrepresentation of input data this method exhibits strongability to learn representation from minority sample dataMoreover DNN is one of the deep learning models that hasbeen successfully applied in many other areas of softwareengineering [12 13] For example the researchers inMicrosoftResearch adopt DNN to decrease the error rate of speechrecognition in [12] which is one of the greatest breakthroughsin that field in the recent ten years We use the Siemens suiteand Space program as platforms to evaluate and demonstratethe effectiveness of DNN-based fault localization techniqueThe remainder of this paper is organized as follows Section 2provides an introduction of some related studies Then inSection 3 we elaborate our DNN-based fault localizationmethod andprovide a toy example to help readers understandthis method We conduct empirical studies on two suites(ie Siemens suite and Space program) in Section 4 ThenSection 5 follows where we report the result of the perfor-mance (effectiveness) comparison between our method andothers and make a discussion Section 6 lists possible threatsto the validity of our approach We present our conclusionand future work in Section 7

2 Related Work

Recent years have witnessed the successful application ofmachine learning techniques in the field of fault localizationHowever many models with shallow architectures encounterdrawbacks like their restricted ability to express complexfunction under limited amount of sample data such as BPneural network and support vector machine And the gener-alization ability for intricate problem is also restrained Withthe rapid development of the deep learningmany researchersbegin to adopt deep neural networks to tackle the limitationsof shallow architectures gradually Before presenting ourDNN-based fault localization technique we introduce somerelated work that contributes to our novel approach

21 Deep Neural Network Model In 2006 deep neural net-work is firstly presented byHinton et al in the journal Science

[14] The rationale of DNN is that the neural network modelis firstly divided into a number of two-layer models before welearn thewholemodel and thenwe train the two-layer neuralnetworkmodel layer by layer and finally get the initial weightsof multilayer neural networks by composing the trained two-layer neural networks the whole process of which is calledlayerwise pretraining [15]Thehidden layer of neural networkcan extract features from the input layer due to its abstractionThus the neural networks with multiple hidden layers arebetter at network processing and network generalization andachieve faster convergence rate We elaborate on the theoryof deep neural networks which is cited as the basis of ourtechnique here

DNN is a sort of feed-forward artificial neural networkwith multiple hidden layers and each node at the samehidden layer can use the same nonlinear function to map thefeature input from the layer below to the current node DNNstructure is very flexible due to the multiple hidden layersand multiple hidden nodes so DNN demonstrates excellentcapacity to fit the highly complex nonlinear relationshipbetween inputs and outputs

Generally DNN model can be utilized for regression orclassification In this paper the model is considered to bea classification model but in order to output continuoussuspiciousness values we do not normalize the incentivevalue of DNNrsquos last layer into integer The relationshipbetween inputs and outputs inDNNmodel can be interpretedas follows

V0 = input

V119897+1 = 120588 (119911119897 (V119897))

119911

119897(V119897) = 119908119897 (V119897) + 119887119897 0 ⩽ 119897 lt 119871

output = V119871

(1)

According to the above formulas we can obtain the finaloutput by transforming the features vector of the first layerV0 into a processed feature vector V119897 through 119871 layers ofnonlinear transformation During the training process ofDNNmodel we need to determine the weight matrix 119908119897 andoffset vector 119887119897 of 119897th layer By utilizing the difference betweenthe target outputs and the actual outputs to construct a costfunction we can then train the DNN by backpropagation(BP) algorithm

Mainly the design of DNN model includes steps likedesigning the number of network layers the number of nodesin each layer the transfer function between the layers and soforth

211TheDesign of the Network Layer Generally deep neuralnetwork consists of three parts the input layer hidden layerand output layer The study of the neural network layersmainly aims to identify the number of hidden layers todetermine the number of layers of a network In neuralnetworks hidden layer has an effect of abstraction and canextract features from the input At the same time the number

Mathematical Problems in Engineering 3

of hidden layers directly determines the processing ability ofnetwork to extract feature If the number of hidden layers istoo small it cannot represent and fit the intricate problemswith limited amount of sample data while with increasednumber of hidden layers the processing ability of the entirenetwork will improve However too many hidden layers alsoproduce some adverse effects such as the increased com-plexity of calculation and local minimaThus we draw a con-clusion that too many or too few hidden layers are all unfav-orable for the network training We must choose the appro-priate number of network layers to adapt to the differentpractical problems

212 The Design of Number of Nodes Usually the nodes ofinput layer and output layer can be determined directly whenthe training samples of deep neural network are confirmedthus determining the number of nodes in the hidden layeris most critical If we set up too many hidden nodes itwill increase the training time and train the neural networkexcessively (ie remember some unnecessary information)and lead to problems like overfitting whereas the network isnot able to handle complex problem as its ability to obtaininformation gets poorer with too few hidden nodes Thenumber of hidden nodes has a direct relationship with inputsand outputs and it needs to be determined according toseveral experiments

213 The Design of Transfer Function Transfer functionreflects the complex relationship between the input and itsoutput Different problems are suited to different transferfunctions In our work we use the common sigmoid functionas a nonlinear transfer function

120588 (119904) =

1

1 + 119890

minus119904 (2)

Here 119904 is the input and 120588(119904) is the output

214 The Design of Learning Rate and Impulse Factor Thelearning rate and impulse factor have an important impactonDNN as well A smaller learning rate will increase trainingtime and slow down the speed of convergence on the otherhand it generates network shocks further For the choice ofimpulse factor we can use its ldquoinertiardquo to cushion the networkshocks We choose the appropriate learning rate and impulsefactor according to the size of the sample

22 Learning Algorithm of DNN There exist various pre-training methods for DNN and mainly they are divided intotwo categories namely the unsupervised pretraining andsupervised pretraining

221 Unsupervised Pretraining For the unsupervised pre-training it usually adopts Restricted Boltzmann Machine(RBM) to initialize a DNN RBM is proposed by Hinton in2010 [16] The structure of a RBM is depicted in Figure 1 TheRBM is divided into two layers the first one is visible layerwith visible units and the second one is hidden layer withhidden units Once the unsupervised pretraining has been

h

W

v

Image Visible variables

Hidden variables

Bipartitestructure

Figure 1 Architecture of a Restricted Boltzmann Machine

DNN-x DNN-x + 1 DNN-x + 2

Figure 2 Architecture of supervised pretraining

completed the obtained parameters will be used as the initialweights of the DNN

222 Supervised Pretraining In order to reduce the inac-curacy of unsupervised training we adopt the supervisedpretraining proposed in [17] The general architecture isshown in Figure 2 It works as follows the BP algorithm isutilized to train a neural network and the topmost layer isreplaced by a randomly initialized hidden layer and a newrandom topmost layer after every pretrainingThe network isunceasingly trained again until convergence or reaching theexpected number of hidden layers

In our work the number of nodes in the input layer ofDNN model is equal to the dimension of the input featurevector The output layer has only one output node that issuspiciousness value After a distinctive pretraining the BPalgorithm is used to fine-tune the parameters of the modelHere 119910

1119879is the training sample and the goal is to minimize

the squared error sum between the training samples 1199101119879

andlabeling 119909

1119879 The objective function can be interpreted as

follows

119864 (119882 119887) =

1

2

sum

119905

1003817100381710038171003817

119891 (119910119905119882 119887) minus 119909

119905

1003817100381710038171003817

2 (3)

Suppose that layer index is 0 1 119871 node layer isrepresented by V

0 V1 119908

0 1199081 119908

119871minus1denotes the weight

layer and 119896119871is the index of nodes in the 119871th layer and then

4 Mathematical Problems in Engineering

take the derivative of theweightmatrix119882119871minus1 andoffset vector119887

119871minus1

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

(4)

Take the derivative for each component of119882119871minus1 and 119887119871minus1

120597119864 (119882 119887)

120597119882

119871minus1

119896119871119896119871minus1

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871) (V119871minus1119896119871minus1)

120597119864 (119882 119887)

120597119887

119871minus1

119896119871

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

(5)

Then take the derivative of the weight matrix119882119871minus2 andoffset vector 119887119871minus2

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588 (119911

119871minus1

119896119871) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588(sum

119896119871minus2

119882

119871minus2

119896119871minus1119896119871minus2V119871minus2119896119871minus2

+ 119887

119871minus2

119896119871minus1) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

(6)

Take the derivative for each component of119882119871minus2 and 119887119871minus2

120597119864 (119882 119887)

120597119882

119871minus2

119896119871minus1119896119871minus2

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1) V119871minus2119896119871minus2

120597119864 (119882 119887)

120597119887

119871minus2

119896119871minus1

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1)

(7)

The result is as follows

120597119864 (119882 119887)

120597119882

119897= sum

119905

119890

119897+1(119905) (V119897 (119905))

119879

120597119864 (119882 119887)

120597119887

119897= sum

119905

119890

119897+1(119905)

(8)

Here

119890

119871

119896119871(119905) = (120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

119890

119871minus1

119896119871minus1(119905) = sum

119896119871

119890

119871

119896119871(119905)119882

119871minus1

119896119871119896119871minus1120588

1015840(119911

119871minus2

119896119871minus1)

(9)

After vectorization

119890

119871(119905) = diag (1205881015840 (119911119871minus1

119896119871)) (120588 (119911

119871minus1) minus 119909119905)

119890

119871

119896(119905) = diag (1205881015840 (119911119871minus2

119896119871minus1)) (119882

119871minus1)

119879

119890

119871(119905)

(10)

The recursive process is as follows

119890

119871(119905) = 119876

119871(119905) (119891 (119910119905

(119882 119887)

0) minus 119909119905)

119890

119897(119905) = 119876

119871(119905) (119882

119897)

119879

119890

119897+1(119905)

119876

119897(119905) = diag (1205881015840 (119911119897minus1 (V119897minus1 (119905))))

120588

1015840(119911) = 120588 (119911) sdot (1 minus 120588 (119911))

(11)

In order to calculate the error between the actual outputand the expected output we output samples 119910

1119879to DNN

and then execute the forward process of DNN Meanwhilewe calculate the outputs of all the hidden layer nodes andoutput nodes and now we can compute to get the error 119890119871(119905)Next backpropagation procedure is executed and error ofthe nodes on each hidden layer is calculated iteratively afterobtaining 119890119871(119905)The parameters of DNN can be updated layerby layer according to the following formula

(119882

119897 119887

119897)

119898+1

= (119882

119897 119887

119897)

119898

+ Δ (119882

119897 119887

119897)

119898

0 ⩽ 119897 ⩽ 119871

Δ (119882

119897 119887

119897)

119898

= (1 minus 120572) 120576 (

120597119864

120597 (119882

119897 119887

119897)

)

+ 120572Δ (119882

119897 119887

119897)

119898minus1

0 ⩽ 119897 ⩽ 119871

(12)

Mathematical Problems in Engineering 5

Table 1 The coverage data and execution result of test case

1199041

1199042

1199043

sdot sdot sdot 119904119898minus1

119904119898

119877

1199051

1 0 1 sdot sdot sdot 0 1 01199052

1 1 1 sdot sdot sdot 0 1 01199053

1 0 1 sdot sdot sdot 1 1 1

d

119905119899minus1

1 0 0 sdot sdot sdot 1 0 1119905119899

1 1 1 sdot sdot sdot 0 0 0

Here 120576 is the learning rate 120572 is the impulse factor and119898denotes the119898th iteration In the process of fault localizationwe input the virtual test matrix 119884 into the DNN model andthen execute the forward process of DNN and finally theoutput is the suspiciousness value of each statement

3 DNN-Based Fault Localization Technique

While the previous section provides an overview of theDNN model this part will present the methodology ofDNN-based fault localization technique in detail With afocus on how to build a DNN model for fault localizationproblem meanwhile we will summarize the procedures oflocalizing faults using DNN and provide a concrete examplefor demonstration at the end of this part

31 Fault Localization Algorithm Based on Deep Neural Net-work Suppose that there is a program P with 119898 executablestatements and 119899 test cases to be executed 119905

119894denotes the 119894th

test case while vectors 119888119905119894and 119903119905119894represent the corresponding

coverage data and execution result respectively after exe-cuting the test case 119905

119894 and 119904

119895is 119895th executable statement

of program P Here 119888119905119894= [(119888

119905119894)1 (119888119905119894)2 (119888

119905119894)119898] If the

executable statement 119904119895is covered by 119905

119895 we assign a real

value 1 to (119888119905119894)119895 otherwise we assign 0 to it If the test case

119905119894is executed successfully we assign a real value 0 to 119903

119905119894

otherwise it is assigned with 1 We depict the coverage dataand execution result of test case through Table 1 For instancefrom Table 1 we can draw a conclusion that 119904

2is not covered

with failed test case 1199053 while 119904

3is covered with failed test case

1199052according to Table 1The network can reflect the complex nonlinear relation-

ship between the coverage data and execution result of testcase when the training of deep neural network is completedWe can identify the suspicious code of faulty version bytrained network At the same time constructing a set ofvirtual test cases is necessary as shown in the followingequation

The Virtual Test Set Consider

[

[

[

[

[

[

[

[

119888V1

119888V2

119888V119898

]

]

]

]

]

]

]

]

=

[

[

[

[

[

[

[

1 0 sdot sdot sdot 0

0 1 sdot sdot sdot 0

d

0 0 sdot sdot sdot 1

]

]

]

]

]

]

]

(13)

the vector 119888V1 119888V2 119888V119898 denotes the coverage data of testcases V

1 V2 V

119898

In the set of virtual test cases the test case V119895only covers

executable statement 119904119895 Since each test case only covers

one statement thus the statement is highly suspicious if thecorresponding test case is failed For example 119904

119895is very likely

to contain bugs with a failed test case V119895 and that means we

should preferentially check the statements which are coveredby failed test cases But such set of test cases does not existin reality thus the execution result of test case in the virtualtest set is difficult to get in practice So we input 119888V119895 to thetrained DNN and the output 119903V119895 represents the probability oftest case execution result The value of 119903V119895 is proportional tothe suspicious degree of containing bugs of 119904

119895

Specific fault location algorithm is as follows

(1) Construct the DNN model with one input layer andone output layer and identify the appropriate numberof hidden layers according to the experiment scaleSuppose the number of input layer nodes is 119898 thenumber of hidden layer nodes is 119899 the number ofoutput layer nodes is 1 and the adopted transferfunction is sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Utilize the coverage data and execution result of testcase as training sample set which is inputted into theDNNmodel and then train theDNNmodel to obtainthe complex nonlinear mapping relationship betweenthe coverage data 119888

119905119894and execution result 119903

119905119894

(3) Input the coverage data vector of virtual test set119888V119895 (1 ⩽ 119895 ⩽ 119898) into the DNN model and get theoutput 119903V119895 (1 ⩽ 119895 ⩽ 119898)

(4) The output 119903V119895 reflects the probability that executablestatement 119904

119895contains the bugs that is suspiciousness

value and then conduct descending ranking for 119903V119895

(5) Rank 119904119895(1 ⩽ 119895 ⩽ 119898) according to their corresponding

suspiciousness value and check the statement one byone from the most suspicious to the least until thefaults are finally located

32 The Example of Fault Localization Algorithm Based onDeep Neural Network Here we illustrate the application offault localization analysis based on deep neural network by aconcrete example It is depicted in Table 2

Table 2 shows that the function of program Mid(sdot) is toget the middle number by comparing three integers Theprogram has twelve statements and ten test cases There isone fault in the program which is contained by statement(6) Here ldquo∙rdquo denotes that the statement is covered by thecorresponding test caseThe blank denotes that the statementis not covered by test case P represents that the test case isexecuted successfully while F represents that the executionresults of test case are failed

The coverage data and execution results of test casesare shown in Table 2 Table 3 correlates with Table 2 Herenumber 1 replaces ldquo∙rdquo and indicates that the statementis covered by the corresponding test case and number 0

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Mathematical Problems in Engineering 3

of hidden layers directly determines the processing ability ofnetwork to extract feature If the number of hidden layers istoo small it cannot represent and fit the intricate problemswith limited amount of sample data while with increasednumber of hidden layers the processing ability of the entirenetwork will improve However too many hidden layers alsoproduce some adverse effects such as the increased com-plexity of calculation and local minimaThus we draw a con-clusion that too many or too few hidden layers are all unfav-orable for the network training We must choose the appro-priate number of network layers to adapt to the differentpractical problems

212 The Design of Number of Nodes Usually the nodes ofinput layer and output layer can be determined directly whenthe training samples of deep neural network are confirmedthus determining the number of nodes in the hidden layeris most critical If we set up too many hidden nodes itwill increase the training time and train the neural networkexcessively (ie remember some unnecessary information)and lead to problems like overfitting whereas the network isnot able to handle complex problem as its ability to obtaininformation gets poorer with too few hidden nodes Thenumber of hidden nodes has a direct relationship with inputsand outputs and it needs to be determined according toseveral experiments

213 The Design of Transfer Function Transfer functionreflects the complex relationship between the input and itsoutput Different problems are suited to different transferfunctions In our work we use the common sigmoid functionas a nonlinear transfer function

120588 (119904) =

1

1 + 119890

minus119904 (2)

Here 119904 is the input and 120588(119904) is the output

214 The Design of Learning Rate and Impulse Factor Thelearning rate and impulse factor have an important impactonDNN as well A smaller learning rate will increase trainingtime and slow down the speed of convergence on the otherhand it generates network shocks further For the choice ofimpulse factor we can use its ldquoinertiardquo to cushion the networkshocks We choose the appropriate learning rate and impulsefactor according to the size of the sample

22 Learning Algorithm of DNN There exist various pre-training methods for DNN and mainly they are divided intotwo categories namely the unsupervised pretraining andsupervised pretraining

221 Unsupervised Pretraining For the unsupervised pre-training it usually adopts Restricted Boltzmann Machine(RBM) to initialize a DNN RBM is proposed by Hinton in2010 [16] The structure of a RBM is depicted in Figure 1 TheRBM is divided into two layers the first one is visible layerwith visible units and the second one is hidden layer withhidden units Once the unsupervised pretraining has been

h

W

v

Image Visible variables

Hidden variables

Bipartitestructure

Figure 1 Architecture of a Restricted Boltzmann Machine

DNN-x DNN-x + 1 DNN-x + 2

Figure 2 Architecture of supervised pretraining

completed the obtained parameters will be used as the initialweights of the DNN

222 Supervised Pretraining In order to reduce the inac-curacy of unsupervised training we adopt the supervisedpretraining proposed in [17] The general architecture isshown in Figure 2 It works as follows the BP algorithm isutilized to train a neural network and the topmost layer isreplaced by a randomly initialized hidden layer and a newrandom topmost layer after every pretrainingThe network isunceasingly trained again until convergence or reaching theexpected number of hidden layers

In our work the number of nodes in the input layer ofDNN model is equal to the dimension of the input featurevector The output layer has only one output node that issuspiciousness value After a distinctive pretraining the BPalgorithm is used to fine-tune the parameters of the modelHere 119910

1119879is the training sample and the goal is to minimize

the squared error sum between the training samples 1199101119879

andlabeling 119909

1119879 The objective function can be interpreted as

follows

119864 (119882 119887) =

1

2

sum

119905

1003817100381710038171003817

119891 (119910119905119882 119887) minus 119909

119905

1003817100381710038171003817

2 (3)

Suppose that layer index is 0 1 119871 node layer isrepresented by V

0 V1 119908

0 1199081 119908

119871minus1denotes the weight

layer and 119896119871is the index of nodes in the 119871th layer and then

4 Mathematical Problems in Engineering

take the derivative of theweightmatrix119882119871minus1 andoffset vector119887

119871minus1

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

(4)

Take the derivative for each component of119882119871minus1 and 119887119871minus1

120597119864 (119882 119887)

120597119882

119871minus1

119896119871119896119871minus1

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871) (V119871minus1119896119871minus1)

120597119864 (119882 119887)

120597119887

119871minus1

119896119871

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

(5)

Then take the derivative of the weight matrix119882119871minus2 andoffset vector 119887119871minus2

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588 (119911

119871minus1

119896119871) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588(sum

119896119871minus2

119882

119871minus2

119896119871minus1119896119871minus2V119871minus2119896119871minus2

+ 119887

119871minus2

119896119871minus1) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

(6)

Take the derivative for each component of119882119871minus2 and 119887119871minus2

120597119864 (119882 119887)

120597119882

119871minus2

119896119871minus1119896119871minus2

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1) V119871minus2119896119871minus2

120597119864 (119882 119887)

120597119887

119871minus2

119896119871minus1

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1)

(7)

The result is as follows

120597119864 (119882 119887)

120597119882

119897= sum

119905

119890

119897+1(119905) (V119897 (119905))

119879

120597119864 (119882 119887)

120597119887

119897= sum

119905

119890

119897+1(119905)

(8)

Here

119890

119871

119896119871(119905) = (120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

119890

119871minus1

119896119871minus1(119905) = sum

119896119871

119890

119871

119896119871(119905)119882

119871minus1

119896119871119896119871minus1120588

1015840(119911

119871minus2

119896119871minus1)

(9)

After vectorization

119890

119871(119905) = diag (1205881015840 (119911119871minus1

119896119871)) (120588 (119911

119871minus1) minus 119909119905)

119890

119871

119896(119905) = diag (1205881015840 (119911119871minus2

119896119871minus1)) (119882

119871minus1)

119879

119890

119871(119905)

(10)

The recursive process is as follows

119890

119871(119905) = 119876

119871(119905) (119891 (119910119905

(119882 119887)

0) minus 119909119905)

119890

119897(119905) = 119876

119871(119905) (119882

119897)

119879

119890

119897+1(119905)

119876

119897(119905) = diag (1205881015840 (119911119897minus1 (V119897minus1 (119905))))

120588

1015840(119911) = 120588 (119911) sdot (1 minus 120588 (119911))

(11)

In order to calculate the error between the actual outputand the expected output we output samples 119910

1119879to DNN

and then execute the forward process of DNN Meanwhilewe calculate the outputs of all the hidden layer nodes andoutput nodes and now we can compute to get the error 119890119871(119905)Next backpropagation procedure is executed and error ofthe nodes on each hidden layer is calculated iteratively afterobtaining 119890119871(119905)The parameters of DNN can be updated layerby layer according to the following formula

(119882

119897 119887

119897)

119898+1

= (119882

119897 119887

119897)

119898

+ Δ (119882

119897 119887

119897)

119898

0 ⩽ 119897 ⩽ 119871

Δ (119882

119897 119887

119897)

119898

= (1 minus 120572) 120576 (

120597119864

120597 (119882

119897 119887

119897)

)

+ 120572Δ (119882

119897 119887

119897)

119898minus1

0 ⩽ 119897 ⩽ 119871

(12)

Mathematical Problems in Engineering 5

Table 1 The coverage data and execution result of test case

1199041

1199042

1199043

sdot sdot sdot 119904119898minus1

119904119898

119877

1199051

1 0 1 sdot sdot sdot 0 1 01199052

1 1 1 sdot sdot sdot 0 1 01199053

1 0 1 sdot sdot sdot 1 1 1

d

119905119899minus1

1 0 0 sdot sdot sdot 1 0 1119905119899

1 1 1 sdot sdot sdot 0 0 0

Here 120576 is the learning rate 120572 is the impulse factor and119898denotes the119898th iteration In the process of fault localizationwe input the virtual test matrix 119884 into the DNN model andthen execute the forward process of DNN and finally theoutput is the suspiciousness value of each statement

3 DNN-Based Fault Localization Technique

While the previous section provides an overview of theDNN model this part will present the methodology ofDNN-based fault localization technique in detail With afocus on how to build a DNN model for fault localizationproblem meanwhile we will summarize the procedures oflocalizing faults using DNN and provide a concrete examplefor demonstration at the end of this part

31 Fault Localization Algorithm Based on Deep Neural Net-work Suppose that there is a program P with 119898 executablestatements and 119899 test cases to be executed 119905

119894denotes the 119894th

test case while vectors 119888119905119894and 119903119905119894represent the corresponding

coverage data and execution result respectively after exe-cuting the test case 119905

119894 and 119904

119895is 119895th executable statement

of program P Here 119888119905119894= [(119888

119905119894)1 (119888119905119894)2 (119888

119905119894)119898] If the

executable statement 119904119895is covered by 119905

119895 we assign a real

value 1 to (119888119905119894)119895 otherwise we assign 0 to it If the test case

119905119894is executed successfully we assign a real value 0 to 119903

119905119894

otherwise it is assigned with 1 We depict the coverage dataand execution result of test case through Table 1 For instancefrom Table 1 we can draw a conclusion that 119904

2is not covered

with failed test case 1199053 while 119904

3is covered with failed test case

1199052according to Table 1The network can reflect the complex nonlinear relation-

ship between the coverage data and execution result of testcase when the training of deep neural network is completedWe can identify the suspicious code of faulty version bytrained network At the same time constructing a set ofvirtual test cases is necessary as shown in the followingequation

The Virtual Test Set Consider

[

[

[

[

[

[

[

[

119888V1

119888V2

119888V119898

]

]

]

]

]

]

]

]

=

[

[

[

[

[

[

[

1 0 sdot sdot sdot 0

0 1 sdot sdot sdot 0

d

0 0 sdot sdot sdot 1

]

]

]

]

]

]

]

(13)

the vector 119888V1 119888V2 119888V119898 denotes the coverage data of testcases V

1 V2 V

119898

In the set of virtual test cases the test case V119895only covers

executable statement 119904119895 Since each test case only covers

one statement thus the statement is highly suspicious if thecorresponding test case is failed For example 119904

119895is very likely

to contain bugs with a failed test case V119895 and that means we

should preferentially check the statements which are coveredby failed test cases But such set of test cases does not existin reality thus the execution result of test case in the virtualtest set is difficult to get in practice So we input 119888V119895 to thetrained DNN and the output 119903V119895 represents the probability oftest case execution result The value of 119903V119895 is proportional tothe suspicious degree of containing bugs of 119904

119895

Specific fault location algorithm is as follows

(1) Construct the DNN model with one input layer andone output layer and identify the appropriate numberof hidden layers according to the experiment scaleSuppose the number of input layer nodes is 119898 thenumber of hidden layer nodes is 119899 the number ofoutput layer nodes is 1 and the adopted transferfunction is sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Utilize the coverage data and execution result of testcase as training sample set which is inputted into theDNNmodel and then train theDNNmodel to obtainthe complex nonlinear mapping relationship betweenthe coverage data 119888

119905119894and execution result 119903

119905119894

(3) Input the coverage data vector of virtual test set119888V119895 (1 ⩽ 119895 ⩽ 119898) into the DNN model and get theoutput 119903V119895 (1 ⩽ 119895 ⩽ 119898)

(4) The output 119903V119895 reflects the probability that executablestatement 119904

119895contains the bugs that is suspiciousness

value and then conduct descending ranking for 119903V119895

(5) Rank 119904119895(1 ⩽ 119895 ⩽ 119898) according to their corresponding

suspiciousness value and check the statement one byone from the most suspicious to the least until thefaults are finally located

32 The Example of Fault Localization Algorithm Based onDeep Neural Network Here we illustrate the application offault localization analysis based on deep neural network by aconcrete example It is depicted in Table 2

Table 2 shows that the function of program Mid(sdot) is toget the middle number by comparing three integers Theprogram has twelve statements and ten test cases There isone fault in the program which is contained by statement(6) Here ldquo∙rdquo denotes that the statement is covered by thecorresponding test caseThe blank denotes that the statementis not covered by test case P represents that the test case isexecuted successfully while F represents that the executionresults of test case are failed

The coverage data and execution results of test casesare shown in Table 2 Table 3 correlates with Table 2 Herenumber 1 replaces ldquo∙rdquo and indicates that the statementis covered by the corresponding test case and number 0

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

4 Mathematical Problems in Engineering

take the derivative of theweightmatrix119882119871minus1 andoffset vector119887

119871minus1

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

(4)

Take the derivative for each component of119882119871minus1 and 119887119871minus1

120597119864 (119882 119887)

120597119882

119871minus1

119896119871119896119871minus1

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871) (V119871minus1119896119871minus1)

120597119864 (119882 119887)

120597119887

119871minus1

119896119871

= sum

119905

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

(5)

Then take the derivative of the weight matrix119882119871minus2 andoffset vector 119887119871minus2

119864 (119882 119887) =

1

2

sum

119905

10038171003817100381710038171003817

120588 (119911

119871minus1minus 119909119905)

10038171003817100381710038171003817

=

1

2

sum

119905

sum

119896119871

(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1V119871minus1119896119871minus1

+ 119887

119871minus1

119896119871) minus 119909

119905119896119871)

2

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588 (119911

119871minus1

119896119871) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

=

1

2

sum

119905

sum

119896119871

(120588(sum

119896119871minus1

119908

119871minus1

119896119871119896119871minus1120588(sum

119896119871minus2

119882

119871minus2

119896119871minus1119896119871minus2V119871minus2119896119871minus2

+ 119887

119871minus2

119896119871minus1) + 119887

119871minus1

119896119871minus 119909119905119896119871)

2

)

(6)

Take the derivative for each component of119882119871minus2 and 119887119871minus2

120597119864 (119882 119887)

120597119882

119871minus2

119896119871minus1119896119871minus2

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1) V119871minus2119896119871minus2

120597119864 (119882 119887)

120597119887

119871minus2

119896119871minus1

= sum

119905

sum

119896119871

[(120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)119882

119871minus1

119896119871119896119871minus1]

sdot 120588

1015840(119911

119871minus2

119896119871minus1)

(7)

The result is as follows

120597119864 (119882 119887)

120597119882

119897= sum

119905

119890

119897+1(119905) (V119897 (119905))

119879

120597119864 (119882 119887)

120597119887

119897= sum

119905

119890

119897+1(119905)

(8)

Here

119890

119871

119896119871(119905) = (120588 (119911

119871minus1

119896119871) minus 119909119905119896119871) 120588

1015840(119911

119871minus1

119896119871)

119890

119871minus1

119896119871minus1(119905) = sum

119896119871

119890

119871

119896119871(119905)119882

119871minus1

119896119871119896119871minus1120588

1015840(119911

119871minus2

119896119871minus1)

(9)

After vectorization

119890

119871(119905) = diag (1205881015840 (119911119871minus1

119896119871)) (120588 (119911

119871minus1) minus 119909119905)

119890

119871

119896(119905) = diag (1205881015840 (119911119871minus2

119896119871minus1)) (119882

119871minus1)

119879

119890

119871(119905)

(10)

The recursive process is as follows

119890

119871(119905) = 119876

119871(119905) (119891 (119910119905

(119882 119887)

0) minus 119909119905)

119890

119897(119905) = 119876

119871(119905) (119882

119897)

119879

119890

119897+1(119905)

119876

119897(119905) = diag (1205881015840 (119911119897minus1 (V119897minus1 (119905))))

120588

1015840(119911) = 120588 (119911) sdot (1 minus 120588 (119911))

(11)

In order to calculate the error between the actual outputand the expected output we output samples 119910

1119879to DNN

and then execute the forward process of DNN Meanwhilewe calculate the outputs of all the hidden layer nodes andoutput nodes and now we can compute to get the error 119890119871(119905)Next backpropagation procedure is executed and error ofthe nodes on each hidden layer is calculated iteratively afterobtaining 119890119871(119905)The parameters of DNN can be updated layerby layer according to the following formula

(119882

119897 119887

119897)

119898+1

= (119882

119897 119887

119897)

119898

+ Δ (119882

119897 119887

119897)

119898

0 ⩽ 119897 ⩽ 119871

Δ (119882

119897 119887

119897)

119898

= (1 minus 120572) 120576 (

120597119864

120597 (119882

119897 119887

119897)

)

+ 120572Δ (119882

119897 119887

119897)

119898minus1

0 ⩽ 119897 ⩽ 119871

(12)

Mathematical Problems in Engineering 5

Table 1 The coverage data and execution result of test case

1199041

1199042

1199043

sdot sdot sdot 119904119898minus1

119904119898

119877

1199051

1 0 1 sdot sdot sdot 0 1 01199052

1 1 1 sdot sdot sdot 0 1 01199053

1 0 1 sdot sdot sdot 1 1 1

d

119905119899minus1

1 0 0 sdot sdot sdot 1 0 1119905119899

1 1 1 sdot sdot sdot 0 0 0

Here 120576 is the learning rate 120572 is the impulse factor and119898denotes the119898th iteration In the process of fault localizationwe input the virtual test matrix 119884 into the DNN model andthen execute the forward process of DNN and finally theoutput is the suspiciousness value of each statement

3 DNN-Based Fault Localization Technique

While the previous section provides an overview of theDNN model this part will present the methodology ofDNN-based fault localization technique in detail With afocus on how to build a DNN model for fault localizationproblem meanwhile we will summarize the procedures oflocalizing faults using DNN and provide a concrete examplefor demonstration at the end of this part

31 Fault Localization Algorithm Based on Deep Neural Net-work Suppose that there is a program P with 119898 executablestatements and 119899 test cases to be executed 119905

119894denotes the 119894th

test case while vectors 119888119905119894and 119903119905119894represent the corresponding

coverage data and execution result respectively after exe-cuting the test case 119905

119894 and 119904

119895is 119895th executable statement

of program P Here 119888119905119894= [(119888

119905119894)1 (119888119905119894)2 (119888

119905119894)119898] If the

executable statement 119904119895is covered by 119905

119895 we assign a real

value 1 to (119888119905119894)119895 otherwise we assign 0 to it If the test case

119905119894is executed successfully we assign a real value 0 to 119903

119905119894

otherwise it is assigned with 1 We depict the coverage dataand execution result of test case through Table 1 For instancefrom Table 1 we can draw a conclusion that 119904

2is not covered

with failed test case 1199053 while 119904

3is covered with failed test case

1199052according to Table 1The network can reflect the complex nonlinear relation-

ship between the coverage data and execution result of testcase when the training of deep neural network is completedWe can identify the suspicious code of faulty version bytrained network At the same time constructing a set ofvirtual test cases is necessary as shown in the followingequation

The Virtual Test Set Consider

[

[

[

[

[

[

[

[

119888V1

119888V2

119888V119898

]

]

]

]

]

]

]

]

=

[

[

[

[

[

[

[

1 0 sdot sdot sdot 0

0 1 sdot sdot sdot 0

d

0 0 sdot sdot sdot 1

]

]

]

]

]

]

]

(13)

the vector 119888V1 119888V2 119888V119898 denotes the coverage data of testcases V

1 V2 V

119898

In the set of virtual test cases the test case V119895only covers

executable statement 119904119895 Since each test case only covers

one statement thus the statement is highly suspicious if thecorresponding test case is failed For example 119904

119895is very likely

to contain bugs with a failed test case V119895 and that means we

should preferentially check the statements which are coveredby failed test cases But such set of test cases does not existin reality thus the execution result of test case in the virtualtest set is difficult to get in practice So we input 119888V119895 to thetrained DNN and the output 119903V119895 represents the probability oftest case execution result The value of 119903V119895 is proportional tothe suspicious degree of containing bugs of 119904

119895

Specific fault location algorithm is as follows

(1) Construct the DNN model with one input layer andone output layer and identify the appropriate numberof hidden layers according to the experiment scaleSuppose the number of input layer nodes is 119898 thenumber of hidden layer nodes is 119899 the number ofoutput layer nodes is 1 and the adopted transferfunction is sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Utilize the coverage data and execution result of testcase as training sample set which is inputted into theDNNmodel and then train theDNNmodel to obtainthe complex nonlinear mapping relationship betweenthe coverage data 119888

119905119894and execution result 119903

119905119894

(3) Input the coverage data vector of virtual test set119888V119895 (1 ⩽ 119895 ⩽ 119898) into the DNN model and get theoutput 119903V119895 (1 ⩽ 119895 ⩽ 119898)

(4) The output 119903V119895 reflects the probability that executablestatement 119904

119895contains the bugs that is suspiciousness

value and then conduct descending ranking for 119903V119895

(5) Rank 119904119895(1 ⩽ 119895 ⩽ 119898) according to their corresponding

suspiciousness value and check the statement one byone from the most suspicious to the least until thefaults are finally located

32 The Example of Fault Localization Algorithm Based onDeep Neural Network Here we illustrate the application offault localization analysis based on deep neural network by aconcrete example It is depicted in Table 2

Table 2 shows that the function of program Mid(sdot) is toget the middle number by comparing three integers Theprogram has twelve statements and ten test cases There isone fault in the program which is contained by statement(6) Here ldquo∙rdquo denotes that the statement is covered by thecorresponding test caseThe blank denotes that the statementis not covered by test case P represents that the test case isexecuted successfully while F represents that the executionresults of test case are failed

The coverage data and execution results of test casesare shown in Table 2 Table 3 correlates with Table 2 Herenumber 1 replaces ldquo∙rdquo and indicates that the statementis covered by the corresponding test case and number 0

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Mathematical Problems in Engineering 5

Table 1 The coverage data and execution result of test case

1199041

1199042

1199043

sdot sdot sdot 119904119898minus1

119904119898

119877

1199051

1 0 1 sdot sdot sdot 0 1 01199052

1 1 1 sdot sdot sdot 0 1 01199053

1 0 1 sdot sdot sdot 1 1 1

d

119905119899minus1

1 0 0 sdot sdot sdot 1 0 1119905119899

1 1 1 sdot sdot sdot 0 0 0

Here 120576 is the learning rate 120572 is the impulse factor and119898denotes the119898th iteration In the process of fault localizationwe input the virtual test matrix 119884 into the DNN model andthen execute the forward process of DNN and finally theoutput is the suspiciousness value of each statement

3 DNN-Based Fault Localization Technique

While the previous section provides an overview of theDNN model this part will present the methodology ofDNN-based fault localization technique in detail With afocus on how to build a DNN model for fault localizationproblem meanwhile we will summarize the procedures oflocalizing faults using DNN and provide a concrete examplefor demonstration at the end of this part

31 Fault Localization Algorithm Based on Deep Neural Net-work Suppose that there is a program P with 119898 executablestatements and 119899 test cases to be executed 119905

119894denotes the 119894th

test case while vectors 119888119905119894and 119903119905119894represent the corresponding

coverage data and execution result respectively after exe-cuting the test case 119905

119894 and 119904

119895is 119895th executable statement

of program P Here 119888119905119894= [(119888

119905119894)1 (119888119905119894)2 (119888

119905119894)119898] If the

executable statement 119904119895is covered by 119905

119895 we assign a real

value 1 to (119888119905119894)119895 otherwise we assign 0 to it If the test case

119905119894is executed successfully we assign a real value 0 to 119903

119905119894

otherwise it is assigned with 1 We depict the coverage dataand execution result of test case through Table 1 For instancefrom Table 1 we can draw a conclusion that 119904

2is not covered

with failed test case 1199053 while 119904

3is covered with failed test case

1199052according to Table 1The network can reflect the complex nonlinear relation-

ship between the coverage data and execution result of testcase when the training of deep neural network is completedWe can identify the suspicious code of faulty version bytrained network At the same time constructing a set ofvirtual test cases is necessary as shown in the followingequation

The Virtual Test Set Consider

[

[

[

[

[

[

[

[

119888V1

119888V2

119888V119898

]

]

]

]

]

]

]

]

=

[

[

[

[

[

[

[

1 0 sdot sdot sdot 0

0 1 sdot sdot sdot 0

d

0 0 sdot sdot sdot 1

]

]

]

]

]

]

]

(13)

the vector 119888V1 119888V2 119888V119898 denotes the coverage data of testcases V

1 V2 V

119898

In the set of virtual test cases the test case V119895only covers

executable statement 119904119895 Since each test case only covers

one statement thus the statement is highly suspicious if thecorresponding test case is failed For example 119904

119895is very likely

to contain bugs with a failed test case V119895 and that means we

should preferentially check the statements which are coveredby failed test cases But such set of test cases does not existin reality thus the execution result of test case in the virtualtest set is difficult to get in practice So we input 119888V119895 to thetrained DNN and the output 119903V119895 represents the probability oftest case execution result The value of 119903V119895 is proportional tothe suspicious degree of containing bugs of 119904

119895

Specific fault location algorithm is as follows

(1) Construct the DNN model with one input layer andone output layer and identify the appropriate numberof hidden layers according to the experiment scaleSuppose the number of input layer nodes is 119898 thenumber of hidden layer nodes is 119899 the number ofoutput layer nodes is 1 and the adopted transferfunction is sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Utilize the coverage data and execution result of testcase as training sample set which is inputted into theDNNmodel and then train theDNNmodel to obtainthe complex nonlinear mapping relationship betweenthe coverage data 119888

119905119894and execution result 119903

119905119894

(3) Input the coverage data vector of virtual test set119888V119895 (1 ⩽ 119895 ⩽ 119898) into the DNN model and get theoutput 119903V119895 (1 ⩽ 119895 ⩽ 119898)

(4) The output 119903V119895 reflects the probability that executablestatement 119904

119895contains the bugs that is suspiciousness

value and then conduct descending ranking for 119903V119895

(5) Rank 119904119895(1 ⩽ 119895 ⩽ 119898) according to their corresponding

suspiciousness value and check the statement one byone from the most suspicious to the least until thefaults are finally located

32 The Example of Fault Localization Algorithm Based onDeep Neural Network Here we illustrate the application offault localization analysis based on deep neural network by aconcrete example It is depicted in Table 2

Table 2 shows that the function of program Mid(sdot) is toget the middle number by comparing three integers Theprogram has twelve statements and ten test cases There isone fault in the program which is contained by statement(6) Here ldquo∙rdquo denotes that the statement is covered by thecorresponding test caseThe blank denotes that the statementis not covered by test case P represents that the test case isexecuted successfully while F represents that the executionresults of test case are failed

The coverage data and execution results of test casesare shown in Table 2 Table 3 correlates with Table 2 Herenumber 1 replaces ldquo∙rdquo and indicates that the statementis covered by the corresponding test case and number 0

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

6 Mathematical Problems in Engineering

Table 2 Coverage and execution result of Mid function

Function Test cases1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

Mid(119909 119910 119911)int119898 3 3 5 1 2 3 3 2 2 5 5 5 1 1 4 5 3 4 3 2 1 5 4 2 2 1 3 5 2 6

(1)119898 = 119911 ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(2) if (119910 lt 119911) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

(3) if (119909 lt 119910) ∙ ∙ ∙ ∙ ∙ ∙

(4)119898 = 119910 ∙

(5) else if (119909 lt 119911) ∙ ∙ ∙ ∙ ∙

(6)119898 = 119910 bug ∙ ∙ ∙ ∙

(7) else ∙ ∙ ∙ ∙

(8) if (119909 gt 119910) ∙ ∙ ∙ ∙

(9)119898 = 119910 ∙ ∙ ∙

(10) else if (119909 gt 119911) ∙

(11)119898 = 119909(12) printf (119898) ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

Passfail P P P P P P P P F F

Table 3 Switched Table 2

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

119903

1199051

1 1 1 0 1 1 0 0 0 0 0 1 01199052

1 1 1 1 0 0 0 0 0 0 0 1 01199053

1 1 0 0 0 0 1 1 1 0 0 1 01199054

1 1 0 0 0 0 1 1 0 1 0 1 01199055

1 1 1 0 1 1 0 0 0 0 0 1 01199056

1 1 1 0 1 0 0 0 0 0 0 1 01199057

1 1 0 0 0 0 1 1 1 0 0 1 01199058

1 1 0 0 0 0 1 1 1 0 0 1 01199059

1 1 1 0 1 1 0 0 0 0 0 1 111990510

1 1 1 0 1 1 0 0 0 0 0 1 1

replaces the blank and represents that the statement is notcovered by the corresponding test case In the last columnnumber 1 replaces F and indicates that the execution results oftest case are failed while number 0 replaces P and representsthat the test case is executed successfully

The concrete fault localization process is as follows

(1) Construct the DNN model with one input layer oneoutput layer and three hidden layers The number ofinput layer nodes is 12 and the number of hiddenlayer nodes is simply set as 4 The number of outputlayer nodes is 1 and the transfer function adopted isthe sigmoid function 120588(119904) = 11 + 119890minus119904

(2) Use the coverage data and execution result of test caseas training sample set to input into the DNN modelFirst we input the vector (1 1 1 0 1 1 0 0 0 0 0 1)and its execution result 0 and next input the secondvector (1 1 1 1 0 0 0 0 0 0 0 1) and its executionresult 0 until the coverage data and execution resultsof the ten test cases are all inputted into the networkWe then train the DNN model utilizing the inputted

training data and further we input the next trainingdataset (ie another ten test casesrsquo coverage data andexecution results) iteratively to optimize the DNNmodel until we reach the condition of convergenceThe final DNNmodel we obtained after several timesrsquoiteration reveals the complex nonlinear mappingrelationship between the coverage data and executionresult

(3) Construct the virtual test set with twelve test cases andensure that each test case only covers one executablestatement The set of virtual test cases is depicted inTable 4

(4) Input the virtual test set into the trained DNNmodel and get the suspiciousness value of each cor-responding executable statement and then rank thestatements according to their suspiciousness values

(5) Table 5 shows the descending ranking of statementsaccording to their suspiciousness value According tothe table of ranking list we find that the suspicious-ness value of the sixth statement which contains thebug in programMid(sdot) is the greatest (ie rank first)thus we can locate the fault by checking only onestatement

4 Empirical Studies

So far the modeling procedures of our DNN-based faultlocalization technique have been discussed in the previoussection In this section we empirically compare the perfor-mance of this technique with that of Tarantula localizationtechnique [4] PPDG localization technique [3] and BPNNlocalization technique [10] on the Siemens suite and Spaceprogram to demonstrate the effectiveness of our DNN-based fault localization methodThe Siemens suite and Space

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Mathematical Problems in Engineering 7

Table 4 Virtual test suite

1199041

1199042

1199043

1199044

1199045

1199046

1199047

1199048

1199049

11990410

11990411

11990412

1199051

1 0 0 0 0 0 0 0 0 0 0 01199052

0 1 0 0 0 0 0 0 0 0 0 01199053

0 0 1 0 0 0 0 0 0 0 0 01199054

0 0 0 1 0 0 0 0 0 0 0 01199055

0 0 0 0 1 0 0 0 0 0 0 01199056

0 0 0 0 0 1 0 0 0 0 0 01199057

0 0 0 0 0 0 1 0 0 0 0 01199058

0 0 0 0 0 0 0 1 0 0 0 01199059

0 0 0 0 0 0 0 0 1 0 0 011990510

0 0 0 0 0 0 0 0 0 1 0 011990511

0 0 0 0 0 0 0 0 0 0 1 011990512

0 0 0 0 0 0 0 0 0 0 0 1

Table 5

Statement Suspiciousness Rank1 00448 112 01233 23 00879 54 00937 45 00747 76 (fault) 01593 17 01191 38 00758 69 00661 810 00436 1211 00650 912 00649 10

program can be downloaded from [18] The evaluation stan-dard based on statements ranking is proposed by Jones andHarrold in [4] to compare the effectiveness of the Tarantulalocalization technique with other localization techniquesTarantula localization technique can produce the suspicious-ness ranking of statements The programmers examine thestatements one by one from the first to the last until thefault is located and the percentage of statements withoutbeing examined is defined as the score of fault localizationtechnique that is EXAM score Since the DNN-based faultlocalization technique Tarantula localization technique andPPDG localization technique all produce a suspiciousnessvalue ranking list for executable statements we adopt theEXAM score as the evaluation criterion to measure theeffectiveness

41 Data Collection The physical environment on whichour experiments were carried out included 340GHz IntelCore i7-3770CPU and 16GB physicalmemoryThe operatingsystems were Windows 7 and Ubuntu 1210 Our compilerwas gcc 472 We conducted experiments on the MATLABR2013a To collect the coverage data of each statementtoolbox software named Gcov was used here The test case

execution results of faulty program versions can be obtainedby the following method

(1) We run the test cases on fault-free program versionsto get the execution results

(2) We run the test cases on faulty program versions toget the execution results

(3) We compare the results of fault-free program versionsand results of faulty program versions If they areequal the test case is successful otherwise it is failed

42 Programs of the Test Suite In our experiment we con-ducted two studies on the Siemens suite and Space programto demonstrate the effectiveness of the DNN-based faultlocalization technique

421 The Siemens Suite The Siemens suite has been consid-ered as a classical test sample which is widely employed instudies of fault localization techniques And in this paper ourexperiment begins with it The Siemens suite contains sevenC programs with the corresponding faulty versions and testcases Each faulty version has only one fault but it may crossthe multiline statements or even multiple program functionsin the program Table 6 provides the detailed introductionof the seven C programs in Siemens suite including theprogram name number of faulty versions of each programnumber of statement lines number of executable statementslines and the number of test cases

The function of the seven C programs is different ThePrint tokens and Print tokens 2 programs are used for lexicalanalysis and the function of Replace program is patternreplacement The Schedule and Schedule 2 programs areutilized for priority scheduler while the function of Tcas isaltitude separation and the Tot info program is used forinformation measure

The Siemens suite contains 132 faulty versions while wechoose 122 faulty versions to perform experiments We omit-ted the following versions versions 4 and 6 of Print tokensversion 9 of Schedule 2 version 12 of Replace versions 1314 and 36 of Tcas and versions 6 9 and 21 of Tot info Weeliminated these versions because of the following

(a) There is no syntactic difference between the correctversion and the faulty versions (eg only differencein the header file)

(b) The test cases never fail when executing the programsof faulty versions

(c) The faulty versions have segmentation faults whenexecuting the test cases

(d) The difference between the correct versions andthe faulty versions is not included in executablestatements of program and cannot be replaced Inaddition there exists absence of statements in somefaulty versions and we cannot locate the missingstatements directly In this case we can only select theassociated statements to locate the faults

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

8 Mathematical Problems in Engineering

Table 6 Summary of the Siemens suite

Program Number of faultyversions

LOC (lines of code) Number of executablestatements

Number of test cases

Print tokens 7 565 172 4130Print tokens 2 10 510 146 4115Replace 32 563 175 5542Schedule 9 412 140 2650Schedule 2 10 307 115 2710Tcas 41 173 59 1608Tot info 23 406 100 1052

422 The Space Program As the executable statements ofSiemens suite programs are only about hundreds of lines thescale and complexity of the software have increased graduallyand the program may be of as many as thousands or tenthousands of lines Sowe verify the effectiveness of ourDNN-based fault localization technique on large-scale program setthat is Space program

The Space program contains 38 faulty versions eachversion has more than 9000 statements and 13585 test casesIn our experiment due to similar reasons which have beendepicted in Siemens suite we also omit 5 faulty versions andselect another 33 faulty versions

43 Experimental Process In this paper we conducted exper-iments on the 122 faulty versions of Siemens suite and 33faulty versions of Space program For the DNN modelingprocess we adopted three hidden layers that is the structureof the network is composed of one input layer three hiddenlayers and one output layer which is based on previousexperience and the preliminary experiment According tothe experience and sample size we estimate the number ofhidden layer nodes (ie num) by the following formula

num = round( 11989930

) lowast 10 (14)

Here 119899 represents the number of input layer nodes Theimpulse factor is set as 09 and the range of learning rate is001 0001 00001 000001 we select the most appropriatelearning rate as the final parameter according to the samplescale

Now we take Print tokens program as an example andintroduce the experimental process in detail The other pro-gram versions can refer to the Print tokensThe Print tokensprogram includes seven faulty versions and we pick out fiveversions to conduct the experiment The experiment processis as follows

(1) Firstly we compile the source program of Printtokens and run the test case set to get the expectedexecution results of test cases

(2) We compile the five faulty versions of Print tokensprogram using GCC and similarly get the actual exe-cution results of test cases and acquire the coveragedata of test cases by using Gcov technique

(3) We construct the set of virtual test cases for the fivefaulty versions of Print tokens program respectively

(4) Then we compare the execution results of test casesbetween the source program and the five faulty ver-sions of Print tokens to get the final execution resultsof the test cases of five faulty versions

(5) We integrate the coverage data and the final resultsof test cases of the five faulty versions respectively asinput vectors to train the deep neural network

(6) Finally we utilize the set of virtual test cases to testthe deep neural network to acquire the suspiciousnessof the corresponding statements and then rank thesuspiciousness value list

5 Results and Analysis

According to the experimental process described in the previ-ous section we perform the fault localization experiments ondeep neural network and BP neural network and statisticallyanalyze the experimental results We then compare theexperimental results with Tarantula localization techniqueand PPDG localization technique The experimental resultsof Tarantula and PPDG have been introduced in [3 4]respectively

51 The Experimental Result of Siemens Suite Figure 3 showsthe effectiveness of DNN BP neural network Tarantula andPPDG localization techniques on the Siemens suite

In Figure 3 the 119909-axis represents the percentage of state-ments without being examined until the fault is located thatis EXAM score while the 119910-axis represents the percentage offaulty versions whose faults have been located For instancethere is a point labeled on the curve whose value is (70 80)This means the technique is able to find out eighty percentof the faults in the Siemens suite with seventy percent ofstatements not being examined that is we can identify eightypercent of the faults in the Siemens suite by only examiningthirty percent of the statements

Table 7 offers further explanation for Figure 3According to the approach of experiment statistics from

other fault localization techniques the EXAM score can bedivided into 11 segments Each 10 is treated as one segmentbut the caveat is that the programmers are not able to identify

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Mathematical Problems in Engineering 9

TarantulaPPDG (best)

NNDNN

0102030405060708090

100

of f

ault

vers

ions

0102030405060708090100

80 60 40 20 0100 of executable statements that need not be

examined (ie EXAM score)

Figure 3 Effectiveness comparison on the Siemens suite

Table 7 Effectiveness comparison of four techniques

EXAM score of the faulty versionsDNN BPNN PPDG Tarantula

100-99 1229 656 4194 139399ndash90 6148 4344 3145 418090ndash80 2131 2459 1371 57480ndash70 410 1148 242 98470ndash60 000 738 242 82060ndash50 082 328 565 73850ndash40 000 328 161 08240ndash30 000 000 000 08230ndash20 000 000 080 41020ndash10 000 000 000 73810ndash0 000 000 000 000

the faults without examining any statement thus the abscissacan only be infinitely close to 100 Due to that factor wedivide the 100ndash90 into two segments that is 100ndash99and 99ndash90 The data of every segment can be used toevaluate the effectiveness of fault localization technique Forexample for the 99ndash90 segment of DNN the percentageof faults identified in the Siemens suite is 6148 whilethe percentage of faults located is 4344 for the BPNNtechnique From Table 7 we can find that for the segments50ndash40 40ndash30 30ndash20 20ndash10 and 10ndash0 ofDNN the percentage of faults identified in each segment is000That indicates that the DNN-basedmethod has foundout all the faulty versions after examining 50 of statements

We analyzed the above data and summarized the follow-ing points

(1) Figure 3 reveals that the overall effectiveness of ourDNN-based fault localization technique is better thanthat of the BP neural network fault localization tech-nique as the curve representingDNN-based approachis always above the curve denoting BPNN-based

method Also the DNN-based technique is able toidentify the faults by examining fewer statementscompared with the method based on BP neural net-work For instance by examining 10 of statementsthe DNN-based method is able to identify 7377 ofthe faulty versions while the BP neural network faultlocalization technique only finds out 50 of the faultyversions

(2) Compared with the Tarantula fault localizationtechnique the DNN-based approach dramaticallyimproves the effectiveness of fault localization onthe whole For example the DNN-based methodis capable of finding out 9508 of the faulty ver-sions after examining 20 of statements while theTarantula fault localization technique only identifies6147 of the faulty versions after examining thesame amount of statements For the score segmentof 100ndash99 the effectiveness of the DNN-basedfault localization technique is relatively similar to thatof Tarantula for instance DNN-based method canfind out 1229 of the faulty versions with 99 ofstatements not examined (ie the abscissa values are99) and its performance is close to that of Tarantulafault localization technique

(3) Compared with the PPDG fault localization tech-nique the DNN-based fault localization techniquereflects improved effectiveness as well for the scorerange of 90ndash0 Moreover the DNN-based faultlocalization technique is capable of finding out allfaulty versions by only examining 50 of statementswhile PPDG fault localization technique needs toexamine 80 of statements For the score rangeof 100ndash99 the performance of DNN-based faultlocalization technique is inferior to PPDG fault local-ization technique

In conclusion overall the DNN-based fault localizationtechnique improves the effectiveness of localization con-siderably In particular for the score range of 90ndash0the improved effectiveness of localization is more obviouscompared with other fault localization techniques as it needsless statements to be examined to further identify the faultsThe DNN-based fault localization technique is able to findout all faulty versions by only examining 50 of statementswhich is superior to BPNN Tarantula and PPDG faultlocalization technique while for the score segment of 100ndash99 the performance of our DNN-based approach is similarto that of Tarantula fault localization method while it is a bitinferior to that of PPDG fault localization technique

52 The Experimental Result of Space Program As the exe-cutable statements of Siemens suite programs are only abouthundreds of lines we further verify the effectiveness of DNN-based fault localization technique in large-scale datasets

Figure 4 demonstrates the effectiveness of DNN andBP neural network localization techniques on the Spaceprogram

From Figure 4 we can clearly find that the effectivenessof our DNN-based method is obviously higher than that

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

10 Mathematical Problems in Engineering

of executable statements that need not beexamined (ie EXAM score)

707580859095100

NNDNN

03

04

05

06

07

08

09

10

03

04

05

06

07

08

09

10

of f

ault

vers

ions

Figure 4 Effectiveness comparison on the Space program

of the BP neural network fault localization technique TheDNN-based fault localization technique is able to find out allfaulty versions without examining 93 of statements whileBP neural network identifies all faulty versions with 83 ofthe statements not examinedThat comparison indicates thatthe performance of our DNN-based approach is superior asit reduces the number of statements to be examined Theexperimental results show that DNN-based fault localizationtechnique is also highly effective in large-scale datasets

6 Threats to the Validity

Theremay exist several threats to the validity of the techniquepresented in this paper We discuss some of them in thissection

We empirically determine the structure of the networkincluding the number of hidden layers and the number ofhidden nodes In addition we determine some importantmodel parameters by grid search method So probably thereexist some better structures for fault localization model Asthe DNN model we trained for the first time usually is notthe most optimal one thus empirically we need to modifythe parameters of the network several times to try to obtain amore optimal model

In the field of fault localization we may encountermultiple-bug programs and we need to construct some newmodels that can locatemultiple bugs In addition because thenumber of input nodes of the model is equal to the lines ofexecutable statements thus when the number of executablestatements is very large the model will become very big aswell That results in a limitation of fault localization for somebig-scale software and wemay need to control the scale of themodel or to improve the performance of the computerwe useMoreover virtual test sets built to calculate the suspiciousnessof each test case which only covers one statement may notbe suitably designed When we use the virtual test case todescribe certain statement we assume that if a test case ispredicated as failed then the statement covered by it willhave a bug However whether this assumption is reasonablehas not been confirmed As this assumption is important for

our fault localization approach thus we may need to test thishypothesis or to propose an alternative strategy

7 Conclusion and Future Work

In this paper we propose a DNN-based fault localizationtechnique as DNN is able to simulate the complex nonlinearrelationship between the input and the outputWe conduct anempirical study on Siemens suite and Space program Furtherwe compare the effectiveness of our method with othertechniques like BPneural network Tarantula andPPDGTheresults show that our DNN-based fault localization techniqueperforms best For example for Space program DNN-basedfault localization technique only needs to examine 10 ofstatements to fully identify all faulty versions while BP neuralnetwork fault localization technique needs to examine 20 ofstatements to find out all faulty versions

As software fault localization is one of themost expensivetedious and time-consuming activities during the softwaretesting process thus it is of great significance for researchersto automate the localization process [19] By leveraging thestrong feature-learning ability of DNN our approach helpsto predict the likelihood of containing fault of each statementin a certain program And that can guide programmers tothe location of statementsrsquo faults with minimal human inter-vention As deep learning is widely applied and demonstratesgood performance in research field of image processingspeech recognition and natural language processing and inrelated industries as well [13] with the ever-increasing scaleand complexity of software we believe that our DNN-basedmethod is of great potential to be applied in industry Itcould help to improve the efficiency and effectiveness of thedebugging process boost the software development processreduce the software maintenance cost and so forth

In our future work we will apply deep neural networkin multifaults localization to evaluate its effectiveness Mean-while we will conduct further studies about the feasibilityof applying other deep learning models (eg convolutionalneural network recurrent neural network) in the field ofsoftware fault localization

Competing Interests

The authors declare that they have no competing interests

References

[1] M A Alipour ldquoAutomated fault localization techniques a sur-veyrdquo Tech Rep Oregon State University Corvallis Ore USA2012

[2] W Eric Wong and V Debroy ldquoA survey of software fault local-izationrdquo Tech Rep UTDCS-45-09 Department of ComputerScience The University of Texas at Dallas 2009

[3] G K Baah A Podgurski and M J Harrold ldquoThe probabilisticprogram dependence graph and its application to fault diagno-sisrdquo IEEE Transactions on Software Engineering vol 36 no 4pp 528ndash545 2010

[4] J A Jones and M J Harrold ldquoEmpirical evaluation of thetarantula automatic fault-localization techniquerdquo in Proceedingsof the 20th IEEEACM International Conference on Automated

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Mathematical Problems in Engineering 11

Software Engineering (ASE rsquo05) pp 273ndash282 Long Beach CalifUSA November 2005

[5] C Liu L Fei X Yan J Han and S PMidkiff ldquoStatistical debug-ging a hypothesis testing-based approachrdquo IEEE Transactionson Software Engineering vol 32 no 10 pp 831ndash847 2006

[6] W EWong TWei Y Qi and L Zhao ldquoA crosstab-based statis-ticalmethod for effective fault localizationrdquo in Proceedings of the1st International Conference on Software Testing Verification andValidation (ICST rsquo08) pp 42ndash51 Lillehammer Norway April2008

[7] A Zeller and R Hildebrandt ldquoSimplifying and isolating failure-inducing inputrdquo IEEETransactions on Software Engineering vol28 no 2 pp 183ndash200 2002

[8] X Zhang N Gupta and R Gupta ldquoLocating faults throughautomated predicate switchingrdquo in Proceedings of the 28th Inter-national Conference on Software Engineering (ICSE rsquo06) pp272ndash281 Shanghai China May 2006

[9] S Ali J H Andrews T Dhandapani et al ldquoEvaluating theaccuracy of fault localization techniquesrdquo in Proceedings of the24th IEEEACM International Conference on Automated Soft-ware Engineering (ASE rsquo09) pp 76ndash87 IEEE Computer SocietyAuckland New Zealand November 2009

[10] W E Wong and Y Qi ldquoBP neural network-based effective faultlocalizationrdquo International Journal of Software Engineering andKnowledge Engineering vol 19 no 4 pp 573ndash597 2009

[11] WEWongVDebroy RGoldenXXu andBThuraisinghamldquoEffective software fault localization using an RBF neuralnetworkrdquo IEEETransactions on Reliability vol 61 no 1 pp 149ndash169 2012

[12] F Seide G Li and D Yu ldquoConversational speech transcriptionusing context-dependent deep neural networksrdquo in Proceedingsof the 12th Annual Conference of the International Speech Com-munication Association (INTERSPEECH rsquo11) pp 437ndash440 Flo-rence Italy August 2011

[13] Y Lecun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[15] A Mohamed G Dahl and G Hinton ldquoDeep belief networksfor phone recognitionrdquo in Proceedings of the NIPS Workshop onDeep Learning for Speech Recognition and Related ApplicationsWhistler Canada December 2009

[16] G EHinton ldquoApractical guide to training restricted boltzmannmachinesrdquo Tech Rep UTML TR 2010-003 Department ofComputer Science University of Toronto 2010

[17] G E Dahl D Yu L Deng and A Acero ldquoContext-dependentpre-trained deep neural networks for large-vocabulary speechrecognitionrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 20 no 1 pp 30ndash42 2012

[18] httpsirunledu[19] D Binldey ldquoSource code analysis a roadmaprdquo in Proceedings of

the Future of Software Engineering (FOSE rsquo07) pp 104ndash119 IEEECompurer Society Minneapolis Minn USA May 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Fault Localization Analysis Based on Deep ...downloads.hindawi.com/journals/mpe/2016/1820454.pdf · fault localization techniques that can guide programmers to the

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of