research survey report -...

A

Automated Bug Localization of Software Programs : A Survey Report1

DEBJIT PAL2 (Net ID : dpal2), University of Illinois at Urbana-Champaign

RIZWAN MOHIUDDIN 3(Net ID : rmohiud2), University of Illinois at Urbana-Champaign

From the simple handheld palmtop devices to the supercomputers leading space missions and other safetycritical avionic systems, software today is truly ubiquitous. To ensure good performance, reliability, safetyand correct operations of those systems, program developers take great care and effort in developing un-derlying software. Correct operations demand that the softwares are bug free. In spite of the sincere effortsof good programming practices and the use of many advanced technologies for bug detection in softwareprograms, bug free software remains a myth. Software bugs [Software Bugs 2009] are one of the primaryreasons that cause a huge loss of revenue to the software industry. In any software development project,significant amounts of resources are consumed by software debugging. Hence, devising effective bug lo-calization techniques is very important for realizing automated debugging of software. In past ten years,researchers have come up with many effective static and statistical analysis based methods for automatedbug detection. Effectiveness of these procedures has been established over several large test cases includingsome real world open source software. The statistical method includes parametric and non-parametric hy-pothesis testing of some program features. The proposed algorithms consider profiling of program predicatesto program paths. The result is the detection of most fault relevant predicates to most fault relevant controlpaths of the program. But the main objective of all of these algorithms is to present to the developer, themost comprehensive list of predicates and program paths relevant to a fault so as to ease the debugging

procedure with minimal manual effort. In this paper, we survey some of the most important methods andresults reported in the literature in the domain of software bug localization in the past decade.

Additional Key Words and Phrases: static techniques, statistical bug-localization, fault, null hypothesis,normal distribution, program path profiles

1. INTRODUCTION

Fault localization [Vessey 1985] is one of the most difficult activity in software debugging.Over the past decade many static analysis based and statistical fault-localization techniqueshave been proposed in literature. The main aim is to automate the process of isolatingbugs by profiling different program paths and/or program predicates by profiling severalruns of the program, then using statistical analysis to pinpoint the likely causes of failureand hence to relieve programmers from tedious debugging work. Static analysis can detectprogram defects through checking either a well specified program model [Clarke et al. 1999]or real code directly [Visser et al. 2003], [Musuvathi et al. 2002]. Dynamic analysis on theother hand, contrasts the run time behavior of passing and failing executions for isolat-ing suspicious program segments [Harrold et al. 2000], [Renieris and Reiss 2003], [Zeller2002], [Liblit et al. 2005]. Dynamic analysis often does not assume any prior knowledge ofprogram semantics aside from a labeling of program executions as either correct or incorrect.Among the dynamic methods, statistical bug localization schemes like program invari-

ants [Brun and Ernst 2004] and statistical debugging [Liblit et al. 2003], have achievedsuccess as empirical results show on standard benchmarks like Siemens Suite [SIR 2005].In these methods programs are first instrumented to collect statistics characterizing theirruntime behaviors, such as evaluations of conditionals and function return values. Behaviorscan be recorded in the evaluation history of various program predicates. Then some postexecution analyses are performed on the gathered histories to identify the bug predictorsthat may point to actual bug locations. Effective bug localization techniques can poten-

1This Survey report has been prepared as a class project of CS 512 Data Mining Course in Spring 2013 atUniversity of Illinois at Urbana-Champaign.2Email : [email protected] : [email protected]

ACM Transactions on Programming Languages and Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

A:2 Pal and Mohiuddin

tially save much developer’s time by not only pinpointing bug locations but also providinguseful contextual information for understanding the bug causes. Statistical debugging isnormally based on a low-overhead infrastructure like CBI (Cooperative Bug Isolation in-frastructure) [Liblit 2007]. In this infrastructure, program predicates such as the numberof times a branch condition is taken, is recorded, then statistical models are applied torank the predicates in terms of how closely they relate to the bugs. Developers can then in-spect highly ranked predicates for actual bugs. These predicates are called as bug predictors.Depending on the user’s affordance of overhead, more involved and heavier instrumenta-tion can be employed to collect more detail predicate and path profiles of the programswhich may lead to more precise bug predictors. For example, Tarantula [Jones and Harrold2005], [Jones et al. 2002] instruments almost every statement of the program, and ranksand visualizes the statements according to their potential relation with bugs. In this survey,we summarize some of the major ideas and statistical debugging ideas that have advancedthe state-of-the-art automated software bug localization techniques over the past decade.The survey is organized in the following sections. Section 2 contains a brief overview of

the two main approach for bug localization namely static analysis and statistical analysis.In Section 3 we surveyed some of the major and prominent contributions in the domainof statistical analysis based software bug localization. The major frameworks that we studyare some Parametric methods like CBI [Liblit et al. 2005], [Liblit 2007], SOBER [Liu et al.2005], [Liu et al. 2006], HOLMES [Chilimbi et al. 2009], and some Non-Parametric Hy-pothesis Testing Frameworks like DES [Zhang et al. 2010] [Hu et al. 2008], [Zhang et al.2009], [Zhang et al. 2011] . Section 4 concludes the survey report.

2. OVERVIEW OF DIFFERENT BUG-LOCALIZATION TECHNIQUES

In this section we briefly discuss the main concepts behind the two schools of softwarebug localization techniques : in Section 2.1 we introduce static analysis techniques and inSection 2.2 we introduce dynamic analysis and statistical analysis based software bug local-ization methodologies. In Section 3 we discuss different major statistical bug localizationmethods in detail.

2.1. Static Analysis

Static analysis of software program source codes is a kind of formal analysis where all pos-sible paths of the source code are explored. This kind of analysis strongly depends on thesyntax and the semantics of the underlying programming language. This kind of analysisis performed without actually executing the program. Some of the prominent methods ofstatic analysis are Model Checking [Clarke et al. 1999], Data-Flow Analysis and AbstractInterpretation [Cousot and Cousot 1977]. By a straightforward reduction of the HaltingProblem [Halting Problem 1930] it is possible to prove that (for any Turing complete lan-guage) finding all possible run-time errors in an arbitrary program (or more generally anykind of violation of a specification on the final result of a program) is undecidable: there isno mechanical method that can always answer truthfully and faithfully whether a given pro-gram may or may not exhibit runtime errors. In spite of this inherent limitations there are afew tools that implements static analysis based software verification method : BLAST [Hen-zinger et al. 2002], Clang [Lattner 2007] and Microsoft’s SLAM Toolkit [Microsoft Research2010] are among the most prominent and mostly used tools. One of the most severe limita-tion of this method is that it often suffers from capacity limitations as reported widely inliterature. Intuitively static analysis based methods try to explore all possible executions ofthe program. Since the number of executions is infinite (even for very simple and relativelysmall program), hence static analysis based methods often do not scale. In this survey staticanalysis and related methods are not our prime objective. In the next section we introducestatistical analysis based bug localization briefly.


Automated Bug Localization of Software Programs : A Survey Report A:3

2.2. Statistical Analysis

To alleviate the capacity related issues, researchers have studied program bug localizationmethodologies based on dynamic analysis. This methodology has a common framework inall of its variants : the program source code is co-executed along with a large number ofpossible test cases and execution traces are dumped. Then, suitable machine learning anddata mining algorithms are applied to extract out likely invariants and possible bug locationsin the software program. Clearly, even if a large number of test suites are used to generatethe trace data, it is still impossible to execute all possible paths in the program. However,this makes the problem of invariant generation and bug localization more tractable. Butobviously it comes for a price; some of the invariants generated may be spurious which aretrue for those test cases but do not hold good in general for the program. Also, in case of buglocalization it may give rise to false positives. Over the past decades several such dynamictrace based software bug localization techniques have been proposed by the researchers.Some of those techniques use a hybrid approach where knowledge learned from dynamictraces is used to prune significant portion of the state space of static analysis making itmore tractable. holmes [Chilimbi et al. 2009] is one such method. In some other methods,different state-of-the-art data mining algorithms along with rigorous statistical analysesare performed over the dynamic traces to locate possible bug locations in the program.CBI [Liblit 2007], [Liblit et al. 2005], SOBER [Liu et al. 2005], [Liu et al. 2006] and DES [Huet al. 2008] are some of the prominent statistical analysis based bug localization techniques.This survey by no means is as comprehensive as the numerously vast and diverse works thathave been done in the domain of software bug localization. In this survey paper we presentthe main ideas and the mathematical background of the statistical analyses based softwarebug localization approach which has evolved over the past 10 years.

3. STATISTICAL ANALYSIS

In this section we give detailed overview of some well known statistical bug-localizationframeworks.

3.1. Scalable Statistical Bug Isolation

3.1.1. Overview of the Proposed Method. In this paper [Liblit et al. 2005] the authors havedescribed one of the most important and influential statistical bug isolation methods whichresulted in a major bug isolation tool called CBI (Cooperative Bug Isolation) [Liblit 2007].If P is a predicate of a program, R is the feedback report on the execution of the program(success or failure), then R(P) = 1 if P is observed to be true at least once in R else R(P) =0. A program predicate is a bug predictor of a bug B if R(P) = 1 and R is a failure. Insteadof locating the predicates which are related with the failures of the programs, the proposedtechniques separates the effects of the different bugs and identifies the predictors that areassociated with the individual bugs. The major advantage of the new method is that itcan tackle when more than one bugs are present in the software code. Also, the identifiedpredictors reveal both the circumstances under which the bugs occur and the frequenciesof the failure modes helping to get the debugging efforts prioritized.To collect the information about the predicates from program runs, the algorithm does a

source-to-source transformation to add additional instrumentation code at designated sitesas listed below. The information is collected by using a sparse random sampling technique,which despite controlling the additional performance overhead for instrumentation, providesnecessary information about the run time behaviors of the predicates. The algorithm usesfollowing three kinds of instrumentations to collect data. The same kind of instrumentationsare also used later in SOBER [Liu et al. 2005], [Liu et al. 2006] and holmes [Chilimbi et al.2009] framework (please see Section 3.3 and 3.5). The instrumentations are :



(1) Branches : At each if-else statement, the algorithm tracks two predicates indicatingtrue (corresponds to if block) or false (corresponds to else block) branches were evertaken.

(2) Returns : At each scalar returning function call site six predicates are tracked depend-ing on the return value is ever < 0,≤ 0,= 0, 6= 0, > 0 or ≥ 0. This helps to debug bugsrelated to the success or failure signal generated as per the return value.

(3) Scalar-Pairs : To track bugs concerning boundary issues in the relationship betweenthe two variables or constant at each assignment statement x == . . ., six predicates aretracked to the new value of x : <,≤,=, 6=, >,≥.

3.1.2. Proposed Algorithm for Ranking Predicates. The algorithm has two basic steps :

(1) Identify the potential candidates of predictors for the most important bug B.(2) Emulate the condition as if the bug B has been fixed and proceed further until no

further bug is manifested.

The algorithm does not actually fix the bug in the second step; rather it emulates as if thebug has been fixed. For this step, all the true runs in which the top-ranked bug predictorsobtained in the first step are removed. Then the algorithm is applied recursively on theremaining set of runs. Discarding the runs reduces the importance of other predictors ofbug B, allowing predicates that predict other bugs to rise to the top in the subsequentiterations.The first step of the algorithm has two subparts : (a) removal of the predicates that are

not relevant for the bug isolation purpose which is called as Pruning of Predicates (using ascore of a predicate which corresponds how much difference it makes that the predicate isobserved to be true versus simply reaching the line where the predicate is checked). Thiswas denoted as Increase(P). Those predicates with Increase(P) > 0 are retained for furtherinspection; and (b) rank the surviving predicates based on importance metric calculatedusing statistical information from the runs of the programs. All the P’s with Increase(P)≤ 0 are discarded as they do not help to locate the bugs effectively (examples of suchpredicates are like program invariants, unreachable predicates or control dependent on atrue cause). Another advantage of this pruning method is the localization of the bugs at thepoint where the condition that triggers the bug becomes true rather than the place whereexactly the bug has been manifested (this information is normally available in the stacktrace when a program crashes due to a bug). Further a confidence interval is attached witheach of the predicate based on the Increase(P) value which helps to remove further morepredicates that have high increase scores but very low confidence about their contributionto the bug because of the few observations. The statistical interpretation of Increase(P) >0 is a simplified likelihood ratio hypothesis test. The main hypothesis is if a predicate causesmany crashes then probability of its occurrence in failure runs of the program will be muchhigher than the probability of occurrence in success runs of the programs. The authors haveimposed a normality assumption on the probability distribution of the random variable fromwhich the above mentioned probability comes. This kind of assumption is also followed inlater algorithms like SOBER [Liu et al. 2005].Once a comprehensive list of potential candidate predicates are generated following above

observation, a ranking algorithm is applied to rank among the predicates to indicate themost effective one. For these, the authors take care of specificity and sensitivity : specificityof a predicate implies that the predicate does not mis-predict failure in many successfulruns i.e. it does not give rise to false alarms; on the other hand sensitivity measures thenumber of failed runs for which a predicate can be held responsible i.e. how prominent apredicate is when a bug gets manifested. To incorporate both the factors in the rankingscheme, the authors have considered a harmonic mean which prefers high scores in both theparameters. To handle the problem of redundancy (redundancy of a predicate hides other



bugs that appeared in small number of failure runs or makes them to appear at the bottomof the rank list), an iterative approach has been used : after ranking all the predicates usingthe Importance metric, top ranked predicates and the all runs in which those predicatesare evaluated true, are removed. This process goes on until the list of runs or list of sets ofpredicates becomes empty. Intuitively, this iterative elimination algorithm chooses at leastone predicate predictive of each bug represented by the input set of predicates. The mainidea of this iterative procedure is that two predicates are redundant if they are responsiblefor exact or nearly the same set of failure runs. Hence removing the set of runs in whicha predicate is true automatically reduces the importance of any related predicates in thecorrect proportions. Since the algorithm is iterative, hence at every step a good predictorwith high Importance metric would suffice. Eventually all the predicates that covers a setof different set of failing runs than all the higher ranked predicates will be chosen.

3.1.3. Quantitative Evaluation of the Proposed Method. The effectiveness of the proposed al-gorithm has been shown on MOSS (a widely used software to detect software plagiarism),RHYTHMBOX (a linux music player), BC (a linux calculator) and EXIF (an image com-pression algorithm). The results show that the algorithm can capture most of the bugs andrelevant predictors within a small number of testcases ranging from 1K to 2K whereas forbugs which rarely manifest themselves may require as high as 20K sample testcases. In thenext few sections we describe some other major algorithms which have used this frameworkof statistical bug localization in a more sophisticated ways.

3.2. Tarantula : Another Automatic Fault-Localization Technique

This paper [Jones and Harrold 2005] overviews the Tarantula technique with four other tech-niques namely, Set Union, Set Intersection, Nearest Neighbor and Cause Transitions, andpresents an empirical study of evaluation comparisons of Tarantula versus other techniques.The reports and the results generated based on the conducted experiment, pertaining tothe fault localization ability, indicate that Tarantula consistently outperforms the otherfour techniques in terms of effectiveness in fault localization and is comparable in efficiencyto the least expensive of the other four techniques [Jones and Harrold 2005]. However,in this paper review, we will focus our attention mainly on the Tarantula bug localizationtechnique.Many different bug localization techniques have been studied. To calculate likely faulty

statements, some techniques like “Fault Localization using Execution Slices and DataflowTests” use coverage information provided by test suites [Agrawal et al. 1995]. Some othertechniques like “Isolating Cause-Effect chains from Computer Programs”, perform a binarysearch of the memory state using one failing test and one passing test case to find likelyfaulty statements [Zeller 2002]. Post-deployment remote monitoring and statistical samplingof programs is carried out by programs like “Visualization of Program-Execution data forDeployed Software”. The effectiveness and efficiency of all the stated methods are evaluatedin the form of empirical studies [Orso et al. 2003]. However, since these techniques arecross platform, language, programs and test-suites, very few empirical studies have reportedcomparisons of existing techniques.The main principle thought of Tarantula technique is that entities in a program that

are primarily executed by failed test cases are more likely to be faulty than those that areprimarily executed by passed test cases. However, Tarantula allows some tolerance for thefault to be occasionally executed by passed test cases. Large amounts of data regardingthe software system under test are collected by the readily available standard softwaretesting tools. This can be used to demonstrate the exhaustiveness of testing and find areasof the source code not executed by the test suite, thus prompting the need for additionaltest cases and also provide useful information for fault localization. The collected dataincludes pass/fail information about each test case, statements, branches, methods that



were executed by each test case and the source code for the program under test. In thisTarantula technique, a visualization tool is used for assigning a value for each programentity’s likelihood of being faulty [Jones and Harrold 2005]. Each statement is assigned acolor specific to its probability of being related to a fault. Red color/hue is used to representstatements that are executed primarily by failed test cases and are thus highly suspicious ofbeing faulty. Green color/hue is used to denote statements that are executed primarily bypassed test cases and are thus not likely to be faulty. Yellow color/hue is used to representstatements that are executed by a mixture of passed and failed test cases and may or maynot be faulty. Hue of a statement ’s’ is given by passed(s)/total passed divided by thesummation of passed(s)/total passed and failed(s)/total failed. passed(s) is the number ofpassed test cases that executed the statement ’s’ one or more times. failed(s) is the numberof failed test cases that executed the statement ’s’ one or more times. Total passed and totalfailed are the total number of test cases that pass and fail in the entire test suite. Hue(s) isused to express the likelihood that s is faulty, or the suspiciousness of s and varies between0 (most suspicious) and 1 (least suspicious). This convention is reversed with 1 being themost suspicions and 0 being the least, to express results in a more intuitive manner. Set ofentities that have high suspiciousness value are ranked higher and are considered first bythe programmer when looking for the fault. Each set of entities at the same ranking level isgiven a rank number equal to the greatest number of statements that would be needed to beexamined if the fault were the last statement in that rank to be examined. The evaluationillustration is as follows: To the right of each line of code is a set of test cases and their inputis shown at the top of each column and coverage is shown by “black dots”. Their pass/failstatus is shown at the bottom of the columns. To the right of the test case columns are twocolumns labeled suspiciousness and “rank”.Comparison of the 5 techniques discussed in this paper is based on effectiveness and

efficiency. Effectiveness is evaluated by ranking the statements of a program in terms ofhow the individual techniques specify their rankings. A SDG-ranking technique is usedto rank Set-union, set-intersection, nearest-neighbor and cause-transitions techniques. Forcomparison of the effectiveness of each technique, a graph is drawn where horizontal axisrepresents the score measure which represents the percentage of the subject program thatwould not need to be examined when following the order of program points specified by thetechniques, and the vertical axis represents the percentage of test runs that are found at thescore given at the horizontal axis. Points and lines are drawn at each segment level, to showthe percentage of versions for which the fault is at the lower bound of that segment rangeor higher. The resultant plotting of the graph shows that the Tarantula technique achievedthe best result of all the five techniques and was consistently more effective at guidingthe programmer to the fault. Efficiency of the techniques is measured in time. Time wasgathered for both computation and input-output and the Tarantula technique was provento be more efficient [Jones and Harrold 2005].

3.3. SOBER: Statistical Model-Based and Hypothesis Testing-Based Bug Localization

Technique

3.3.1. Overview of the Proposed Method. To aid tedious and time consuming manual debug-ging of software programs, Liu [Liu et al. 2005], [Liu et al. 2006] proposed an automatic faultlocalization technique based on statistical analysis called SOBER. It can localize softwarefaults without any prior knowledge of the program semantics. Unlike previous state-of-the-art approach [Liblit et al. 2005] that selects predicates correlated with program failures,SOBER models the truth of predicate evaluation in both correct and incorrect executions.SOBER regards a predicate as fault-relevant if its evaluation pattern in incorrect executionssignificantly diverges from that in correct ones. SOBER starts by treating the evaluationsof a predicate P as independent Bernoulli trials; each evaluation is either true or false. Theyestimate the probability of P being true in each execution defined as evaluation bias π(P ) :



π(P ) = nt

nt+nf

where nt and nf denote the number of times predicate P evaluates to true and falserespectively. Although evaluation bias may vary from one execution to another, the observedvalues from multiple executions constitute a random sample from a statistical model. Theevaluation bias from a test case t can be treated as an observation from fP (X |θ), whereθ is either θp or θf depending on the fact whether t is a passing or failing execution runrespectively. With the statistical models for both passing and failing runs given, the faultrelevance of a predicate P for a hidden fault is defined by the following similarity function :

L(P ) = Sim(f(X |θp), f(X |θf)).A predicate P is relevant to a fault if its underlying model fP (X |θf ) diverges significantly

from fP (X |θp), where X is a random variable for the evaluation bias of P . The ranking scores(P ) of the faulty predicates was calculated using a monotonically decreasing function likeg(x) = −log(x). The reason for choosing log(x) is that it effectively measures the relativemagnitude, even when x’s are closed to “0”. Hence, the fault relevance score s(P ) is definedas:

s(P ) = −log(L(P ))

Using this fault relevant score, all the instrumented predicates can be ranked and the top-ranked ones are regarded as the most fault-relevant predicates. Hence, the fault localizationproblem actually boiled down to the following two subproblems :

(1) Choice of a suitable similarity function L(P).(2) Computation procedure of L(P) as the closed form of fP (X |θ) is unknown.

3.3.2. Proposed Implementation of Predicate Ranking Procedure. The main difficulties for theranking procedure were :

(1) Lack of prior knowledge of fP (X |θ).(2) Unavailability of closed forms of fP (X |θp) and fP (X |θf ).(3) No model assumption like normality on fP (X) can be made as that may lead to mis-

leading inferences.

To cope with these problems, an indirect method was proposed in the paper to calculate thedifference between fP (X |θp) and fP (X |θf ) without any model assumption. Initially a nullhypothesis was proposed H0 : fP (X |θp) =fP (X |θf ) i.e. it was assumed that the two modelswere identical. A new statistic Y is derived based on the observed evaluation bias from mfailing cases, which under the null hypothesis H0 conforms to a known distribution. If therealized statistic corresponds to an event that has a small likelihood of happening , the nullhypothesis is likely to be invalid and a significant difference exists between fP (X |θp) andfP (X |θf).The main differences between the approach proposed in [Liblit et al. 2005] (hereby called

as Lib05) and this work are as follows :

(1) From methodological viewpoint, Lib05 estimates how much more likely an executioncrashes if the predicate P is observed as true. This is in comparison with the case whenP is observed as either true or false. This indicates that Lib05 gave more weightageon predicates whose true evaluation correlates with the the program crash. On theother hand SOBER models the evaluation distribution of the predicate P in passing(i.e fP (X |θp)) and failing (i.e fP (X |θf )) executions respectively and treats predicateswith large differences between fP (X |θp) and fP (X |θf) as fault relevant. In a nutshell,the ranking model of the Lib05 and SOBER framework were fundamentally different.



(2) SOBER explores multiple evaluations of a predicate within once execution of a program.On the other hand, Lib05 overlooks this information. For example, if a predicate Pevaluates as true at least once in each execution, and has a different likelihood to betrue in passing and failing executions, Lib05 simply overlooks P . SOBER on the otherhand can readily capture the evaluation divergence and hence surely has an edge overthe Lib05 method.

3.3.3. Quantitative Evaluation of the Proposed Method. To establish the effectiveness, the au-thors have compared the proposed method with two prominent fault localization techniquesnamely Lib05 [Liblit et al. 2005] and Tarantula [Jones and Harrold 2005]; taking SiemensSuite [Hutchins et al. 1994] as the benchmark. For the 130 faults in the Siemens Suite,Lib05 could catch 34 bugs and SOBER could catch 52 faults when developer examines atmost 1% of the code. If the developer examines upto 20% of the code, then SOBER canlocate 96 faults, which was way ahead of the state-of-the-art technique proposed in Lib05.We do not report the comparison of SOBER with Tarantula here as relative superiority wasnot established because the metric of comparison was ambiguous. From the computationalcomplexity viewpoint, SOBER was as efficient as Lib05. The computational complexity ofboth SOBER and Lib05 were O((n +m).k + k.log(k)) where n and m are the number ofcorrect and incorrect executions respectively, k is the number of instrumented predicates.Regarding the evaluation of performance of SOBER the authors observed the following :

(1) SOBER’s effectiveness can be partly contributed to the reasonably adequate test suitegiven the program size. Availability of large test suites helped to collect enough statis-tical evidences regarding faults which enabled SOBER to produce better fault localiza-tion.

(2) Each execution of the program was precisely labeled as passing or failing consideringthe fault-free version as test oracle.

But in reality, due to potential high cost, adequate test suites are often not available. Also,it is very hard to construct the test oracle and due to the large variation in the functionalityof the program, it usually depends on the judgement of the human developers. Hence,all the proposed analyses and comparisons were made so far in a “perfect world“ wheresimultaneously adequate test suites and test oracles are available. The authors have alsotested effectiveness of SOBER in imperfect world and reported that although performanceof SOBER deteriorated with partially available information, still it outperformed othermethods like Lib05. Being a statistical inference technique, the proposed method suffersfrom “Threats to Validity”. The first threat lies in the selection of benchmark program.Since Siemens suite contained small-scale subject programs hence absolute measures ofefficiency on this benchmark does not scale to arbitrary large program but at the same timeas Siemens suite contained representative bugs of real world, hence the relative performancecomparison was statistically significant and credible. Also, the metrics used for comparisonwere by no-means comprehensively fair although till date, they are the by far the mostobjective metric for bug localization.

3.4. Context-aware Statistical Debugging: From Bug Predictors to Faulty Control Flow Paths

3.4.1. Overview of the Proposed Method. In a more recent work [Jiang and Su 2007], theauthors have further extended the state-of-the-art of the bug localization method by asso-ciating bug locations along with the relevant control flow paths in the program. They arguethat many bugs may not be directly associated with isolated predicates in the program.Hence, they have proposed an approach to generate faulty control paths that link manypredicates together for revealing bugs and hence giving more contextual information fordiscovering and understanding bugs. They have used feature selection (to accurately se-lect failure-related predicates as bug predictors), clustering (to group correlated predicates)



and traversal of control flow graphs (CFGs) to generate those paths. To aid debugging,the proposed method called pathgen provides the following contextual information to thedeveloper :

(1) Set of predicates related with the manifested bugs.(2) Control flow paths of the programs that connect the bug predictors

The proposed approach consists of the following three steps :

(1) Instrumentation of program source code and collection of execution profiles of certainprogram predicates through Cooperative Bug Isolation (CBI) [Liblit 2007] framework.

(2) Finding out program predicates that are most likely bug predictors and clustering ofpredicates which are correlated in terms of similar evalution histories. First informationis obtained through feature selection using Support Vector Machine (SVM) [Mitchell1997] and Random Forests (RF) [Breiman 2001]. Clustering of predicates on the otherstep is done via k-means clustering algorithm.

(3) Identification of faulty control flow paths that connect bug predictors and correlatedpredicates by traversing CFGs of the program with a guidance of branch predictionusing the information obtained in previous step.

In first step, CBI lightly instruments a program with statically fixed n predicates. An exe-cution trace is recorded as an n-dimensional vector where the i-th value of the vector countsthe number of times that the i-th predicate is observed to be true during the execution.The vector also has an annotation to denote success or failure of the execution. It has beenobserved that a large number of such vectors can help to effectively understand the programbehavior. The work has considered the same three kind of predicates as instrumented byCBI [Liblit 2007] (please see Section 3.1 for more detail).

3.4.2. Overview of the Proposed Algorithm. The main components of the proposed bug localiza-tion framework is summarised below :

(1) Feature Selection : In this step an importance score is assigned to each feature(critical program predicates that can be accurate bug predictors) based on the facthow likely it may reveal bugs. Top k features with highest scores are chosen as the bugpredictors. In SVMs a linear classification function is used to calculate the importancescore of i-th feature such that highest score can be assigned to those predicates thathave biggest positive impact (i.e. causing an execution to fail) on program execution.RFs use a different interesting heuristic to calculate the importance score. If a predicatehas a big impact on the classification function, then class label for that predicate shouldalso change when the data values for that predicate change. After the classificationfunction is established, RFs randomly permute the data values for each predicate anduses classification function to classify the permuted data. Then the difference betweenthe classification accuracies before and after permutation is used as a measure ofimportance score.

(2) Clustering : Unlike previous work where the profile for one execution of programis viewed as one unit of comparison for bug localization, this work tries to discoverthe relation between predicates across different executions. The work proposes a newhorizontal view of the execution profiles : the profiles for each predicate across allexecutions are viewed as one unit and predicate correlations can be discovered by lookingfor “similar“ horizontal units. Predicate correlations that can be discovered based onthe horizontal views are interesting because they :(a) reveal more information about the state of the program when it fails.(b) disclose a more accurate execution path the failure may take.



(c) provide additional contextual information to understand the bugs better.

k-means clustering is used to discover predicate correlations such that :(a) The distance between any two predicates in the same cluster is less than ǫ.(b) The distance between the mass centers of any two clusters is greater than ǫ.

The authors have chosen normalized Manhattan distance to measure distance betweenany two predicates. The benefits of the predicate clustering are two folds :(a) Different predicates in different cluster are responsible for different bugs whereas

the predicates in the same cluster are more likely related to the same bug.(b) Additional low rank predicates can be included for path construction and help

produce more informative faulty control paths.

(3) Branch Prediction and Faulty Control Path Construction : In this step predi-cates guide the traversal of CFGs. Much information about the branch directions takenin failed runs can be inferred from the bug predictors and execution profiles. Suchknowledge helps the algorithm to efficiently prune unlikely faulty paths during theCFG traversal. Based on the value of CT (abbreviation for a branch condition beingtrue) and CF (abbreviation for a branch condition being false), the branch of C to betraversed is decided as per the following four cases :

(CT == false ∧ CF == false) ⇒ C == neither (1)(CT == false ∧ CF 6= false) ⇒ C == false (2)(CT 6= false ∧ CF == false) ⇒ C == true (3)(CT 6= false ∧ CF 6= false) ⇒ C == both (4)

pathgen is essentially a depth-first search algorithm which chooses a bug predicatorclosest to the main entry of the program and starts traversal from the function contain-ing this predicate and continues until there is no more next node for the path. After allthe faulty control paths are generated, they are post processed to remove unnecessaryportions of the paths and are ordered for inspection.

3.4.3. Quantitative Evaluation of the Proposed Method. The authors have performed an quan-titative evaluation of the proposed approach based on the following criterions :

(1) How many bugs the approach can localize and(2) How much manual inspection of the code is required to localize the bugs.

Table I shows some experimental result on Siemens Suite as reported in the paper. Themethod localized more bugs for the suite within 1% code limit (17 for Tarantula [Jonesand Harrold 2005] and 11 for SOBER [Liu et al. 2005]). On the other hand, Tarantula andSOBER localized more bugs within 20% code limit (75 for Tarantula and 96 for SOBER).Several factors need to be considered for such direct qualitative comparison.

(1) Tarantula and SOBER focus on finding only stand-alone bug predictors.(2) Tarantula instruments almost every statement in the program and SOBER instruments

certain different program predicates using a different implementation of CBI.(3) Unlike the proposed approach, Tarantula and SOBER computed the quantitative mea-

sures by a BFS search of the program dependency graph of a program instead of follow-ing the faulty control paths which provide more meaningful and contextual informationfor developers to understand localized bugs.

Like SOBER (see Section 3.3), this method also heavily relies on stochastic analysis andhence suffers from Threats to Validity. The most crucial factor is what kind of and howmany predicates should be instrumented as different predicates have different effects on buglocalization. Second factor concerns the nature of different types of bugs because (1) certain



Table I. Summary of the Results for the Siemens Test Suite. Each column shows how many bugs canbe discovered by inspecting up to so much code in each version of the programs in the suite.

% code ≤ 1 ≤ 2 ≤ 5 ≤ 10 ≤ 1% ≤ 2% ≤ 4% ≤ 10% ≤ 20% Totalexamined line lines lines lines %code(for each examinedversion)

# of Bugs 11 31 52 64 38 45 54 67 73 79 out of 132

program locations are not directly instrumentable, (2) certain bugs involve many locationsthrough implicit data flows and hence are not understandable if only explicit control flowsare presented. The third factor concerns the adequacy of the data set and the number offailed runs, lack of which can cripple the proposed approach to identify the bug predictorsand to construct potential faulty control flow paths.

3.5. HOLMES: Effective Statistical Debugging via Efficient Path Profiling

3.5.1. Overview of the Proposed Method. In [Chilimbi et al. 2009] the authors have extendedthe idea of [Liblit et al. 2005] and [Liblit 2007] by adopting richer program profiling namely,path profiles for bug isolation whereas the previously mentioned works used predicate profil-ing. The main idea is to isolate bugs by finding paths that correlate with failures. The mainreasons for opting for path profiling are as follows :

(1) In [Ball and Larus 1996] authors have shown effectiveness of path profiling over pointprofiling such as predicate profiling. Further paths in a program carry much more in-formation than a simple predicate and hence more contextually informative to locatebugs [Jiang and Su 2007] , [Arumuga Nainar et al. 2007].

(2) Although predicates can pinpoint the location of the bug but paths can additionallyprovide more context on how the buggy code was exercised. This can make debuggingprocess a lot easier.

(3) In some systems, path profile information is routinely collected for performance opti-mization. and hence statistical analysis can be done at no additional cost.

Instead of sampling predicates as advocated in [Liblit et al. 2005], holmes relies on theobservation that only small portions of a program are relevant to a given bug. It proposesan iterative, bug-directed adaptive path profiling algorithm to lower execution time andspace overheads. In the approach, the program runs without any instrumentation until itstarts generating bug reports due to failures. Once holmes collects sufficient number of bugreports, it combines this information with static analysis of the source code to locate theportions of the code which are most likely to contain the root causes of bugs observed inthe field. It then identifies the set of functions, branches and paths that need to be profiledin subsequent runs. holmes instruments those portions of the codes and re-deploys theinstrumented code in the field in order to collect more detailed information regarding thebug. holmes uses statistical analysis to identify paths that are strong predictors of thereported bugs. Based on the scores assigned to the predicates, holmes may decide to eitherreport the root cause of reported bugs to the developer or may iterate on other parts of theprogram to find the root cause of the reported bug. This is done by collecting more detailedinformation with expanded search and re-instrumented programs.

3.5.2. Proposed Implementation of Predicate Ranking Procedure. The authors have proposedtwo version of the holmes algorithm : one is Non-Adaptive Debugging (NAD) and anotheris Adaptive Debugging (AD). The NAD is essentially the same algorithm proposed by Li-blit [Liblit et al. 2005](see Section 3.1) but using path profiles instead of predicate profiles.The program is instrumented and path profiles are collected during executions. This infor-



Prod

uctio

n

En

vir

on

men

t

Bug Reports

Profiles

myapp.exe

HOLMESBackend

Static Analysis

StatisticalAnalysis

HOLMES Profiling Tools

Root Causes

myapp.cpp

Fig. 1. The holmes Framework [Chilimbi et al. 2009]

mation is aggregated across multiple runs through feedback reports. The feedback reportfor a single program execution contains two bits for each path : one bit to indicate that thepath was observed (the start of the path is visited but the path is not necessarily executed)and another bit to indicate that the path was executed in that run; there is another bit forevery execution to indicate whether it has failed or succeeded. In the next step, each path isassigned a numeric Importance Score following [Liblit et al. 2005] based on sensitivity andspecificity (see Section 3.1 for definition of these two terms) and the top results are selectedand presented to the programmer as the root causes.To cope with the loss of information due to random sampling during the collection of

information about predicate profiles as advocated in [Liblit et al. 2005], the authors pro-posed an adaptive algorithm which enables very low or unnoticeable runtime overheads inproduction environment without increasing the size of the executables considerably due toinstrumentation. A pictorial representation of the adaptive holmes framework has beenreproduced in the Figure 1. The major steps of this algorithm are summarized below :

(1) Monitoring of un-instrumented program to collect sufficient failure and bug reports.(2) Use static analysis of the source code and the bug report to identify the portion of the

code which is more likely relevant to the bug, hence avoiding need for sparse randomsampling.

(3) Calculation of statistical model of the program source code following the method ofNAD. The model consists of bug predictors along with Importance score.



(4) If model identifies strong bug predictors and explains all failures, then the predictorsare reported to the developers; else if the model is inconclusive, then search is iterativelyexpanded for strong bug predictors with the help of the static analysis until all failuresare accounted for.

Implementation of holmes contains the following steps :

—Bootstraping : To find root cause of bugs, holmes look for the functions that arepresent near the stack trace at the point of failure and appear once or more times instack trace. These functions are instrumented. To control the number of instrumentedfunctions, holmes compute an effective score for functions which is high if the functionsappear often on the stack trace and / or appear close to the location of failure.

— Iterative Profiling : This is an iterative procedure where the error model of the sourcecode is successively refined by the method of data collection, statistical analysis andselecting functions which need further profiling and investigation.

3.5.3. Quantitative Evaluation of the Proposed Method. The quantitative evaluation of holmesreveals that path profiling to locate the bus in large real world application. But in some casesthe author failed to prove supremacy of their proposed method and suggested that user mayneed to consider more than one profiling scheme to have run time information of a program.For detailed results we refer the reader to the paper [Chilimbi et al. 2009].

3.6. Fault Localization with Non-Parametric Program Behavior Model

This paper [Hu et al. 2008] firstly discusses workings of statistical fault localization tech-niques like Tarantula [Jones and Harrold 2005], predicate-based statistical fault localizationtechniques like SOBER [Liu et al. 2005] and CBI [Liblit 2007]. Statistic fault localizationtechniques locate faults by analyzing the statistics of dynamic program behaviors. Thepaper points out that the parametric approach to fault localization has its shortcomings.Tarantula and CBI don’t distinguish the number of times that a particular program elementhas been executed in a run. This method was calculated to be empirically less accurate.Although SOBER considers distributions of evaluation biases assembled from successfuland failed runs, it uses the central limit theorem to measure the behavioral difference of apredicate from successful and failed runs. Also, the probability that a predicate is evaluatedto be true, is wrongly assumed to be normally distributed [Liu et al. 2005]. It was provenby necessary research carried out in this paper that most predicates don’t have any knowndistribution [Hu et al. 2008].The authors propose a non-parametric approach to measuring the similarity of the fea-

ture spectra of successful and failed runs. The proposed model is based on predicate-basedfault localization. A general hypothesis testing model called Mann-Whitney test is used.The Mann-Whitney test is a non-parametric hypothesis testing test which is used to deter-mine the degree of difference between spectra of program elements for successful and failedruns. This degree of difference is used to measure ranking score, with predicates havinghigh-ranking score being more suspicious for errors. As a motivating study, a Siemens suiteprogram was used. Histograms of evaluation biases (x-axis) vs. number of (successful orfailed) runs that share the same value of evaluation bias (y-axis), were plotted. This num-ber of successful or failed runs were produced by running the program over all the test casesin the Siemens suite. Observations made from histograms were recorded [Hu et al. 2008].Firstly, evaluation biases are scattered and not always close to 0 or 1. This revealed thatsimply checking whether the predicate is evaluated to be true or false may lead to informa-tion loss and inaccurate analysis. Secondly, the histograms that have different distributionsover successful and failed runs can be good indicators of fault-relevant predicates. Thirdlyit is observed that assuming a normal/Gaussian distribution for predicate evaluation biasis unrealistic. Unlike SOBER where absence of predicate execution run during execution



set a 0.5 evaluation bias [Liu et al. 2005], in this approach there is no data captured inthe distribution if the predicate is not executed in a run. This is based on the fact that notall the evaluation bias values are evenly distributed between 0 and 1, and it wont be a fairassumption since often a large percentage of evaluation biases never include the value of 0.5[Hu et al. 2008]. The paper concludes from the study that a parametric hypothesis testingtechnique or the central limit theorem is not suitable for non-parametric distributions withsmall samples [Hu et al. 2008].A non-parametric hypothesis ranking model is used. There are two sets of possible suc-

cessful runs (Ts) and failed runs (Tf ). In a random test case over the successful or failedruns, the evaluation bias of predicate P for the program execution over t is expressed bythe random variable X. The idea is that if a predicate is relevant to a fault, the difference inits probability density functions of the evaluation biases of that predicate on whole sets ofpossible successful and failed runs should be large. The larger this difference, the more rele-vant that predicate is in relation to the fault. Hence, a ranking function is defined as: R(P)= Diff(f(X |θs), f(X |θf)). Without prior knowledge of these probability density functions,their sample sets can only be estimated from the test suite attached with the program. Tofind the defined ranking function R(P), due to the properties of the data sets in questioninferred from standard statistics textbooks such as [Lowry 1998], the authors suggest thatusing a parametric approach is not a good option. Instead a non-parametric hypothesistesting technique known as Mann-Whitney test was proposed to measure the differences inthe sampled distribution of evaluation biases. The main idea of this test is to transform thetwo sample sets of evaluation biases for a predicate from a number of successful and failedruns, to two rank-value sets and then measure the distance between two rank-value sets.In the problem setting, the authors consider m successful and n failed runs. Vs and Vf aretheir respective sample set of evaluation biases for predicate P respectively. Given a set ofevaluation biases for a predicate P, the first step is to construct the unions of their Passedand Failed EB sets. Ranks are assigned to this union set with the lowest EB getting therank 1. A rank-value set Rs is then obtained and the rank-values are then mapped back tothe corresponding elements in Vs and Vf , thus constructing two new sets Rs and Rf . Afterthis, the Mann-Whitney test selects all the different possible combinations of from thesesets. Then a ranking function R(P) in the Mann-Whitney test is used to derive the rankingscore for predicate P. K is defined as the number of rank-values combinations and S is thesum of rank-values of all elements in Rs. Two more parameters are defined, namely Kl

(which is the number of combinations whose ranks is less than that of Rs) and Kh (which isthe number of combinations whose sum of ranks is larger than of Rs). Finally, the rankingfunction is calculated as: R(P) = -min(Kl/K,Kh/K). The lower the minimum, the moredivergent will be the two sampled distributions. Higher the ranking score, the more relevantwill be P in relation to the fault. All the above detailed expiations have been adapted fromthis paper [Hu et al. 2008].In terms of performance comparisons; by plotting percentage of fault located vs. percent-

age of code examined graphs, the non-parametric model is comparable, if not better, thanboth SOBER and CBI by finding more faults per percentage of examined code [Hu et al.2008]. To maintain uniformity and fair comparison T-score was used as the measure andcomparisons were made with SOBER and CBI due to the same reason. The paper pointsout the lack of information available about the effect of used platform and also performanceusing any other measures. This can be a pointer for further research.

3.7. A Quest between Parametric and Non-parametric Statistical Bug Localization Techniques

In the paper [Zhang et al. 2009] the authors have presented an empirical comparison betweenthe two schools of thoughts of software bug localization : one is the use of a non-parametricanalyses to pinpoint fault relevant predicates as explained in [Hu et al. 2008](see Section 3.6)and use of parametric analysis of certain program feature spectra like Evaluation Bias of



1 2 3 4 5 6 7 8 9 10 11 120

0.2

0.4

0.6

0.8

Range of Evaluation Bias

Fra

ctio

n of

Run

s in

eac

h R

ange Plots for Successful Runs

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4


Fra

ctio

n of

Run

s in

eac

h R

ange Plots for Failure Runs

Fig. 2. Evaluation Bias Distribution of predicate “ (int ∗)(match+(best len− 1))= (int)scan end1” fromprogram Grep V15

each predicate as described in [Liblit et al. 2005], [Liblit 2007], [Liu et al. 2005], [Liu et al.2006], [Jiang and Su 2007] and [Chilimbi et al. 2009](see Section 3.1, 3.3, 3.4 and 3.5). Thethree main issues that are questionable in the paper are as follows :

(1) Can the feature spectra of program elements (especially evaluation biases of predicatesin particular) be safely considered as normal distributions so that parametric fault-localization techniques can be soundly and powerfully applied?

(2) To what extent program feature spectra of the most fault relevant predicates can beconsidered as normal distribution?

(3) Can the effectiveness of non-parametric fault-localization techniques be really decoupledfrom the distribution shape of the program spectra?

To investigate these issues, authors have conducted experiment of normality test on theevaluation bias of each predicate in each faulty version of Siemens Test Suite [SIR 2005] toempirically validate the normality assumption. We independently calculate the EvaluationBias (EB) of some predicates of the programGrep and Replace V1 (a program in SiemensSuite). We plotted the distribution of the EBs as histograms in Figure 2, 3 and 4. We useone sample Kolmogorov-Smirnov test to check the normality. We have divided the range ofEB (i.e EB ∈ [0, 1]) in 12 discrete ranges denoted as range points 1, 2, . . ., 12. Each of thisrange is 0.1 wide except the two terminal range points namely 1 and 12 which correspondsto EB = 0.0 and EB = 1.0 respectively. Clearly from the figures, the normality assumptionsof parametric techniques does not hold good.To investigate Question 1 above, authors have conducted normality tests on the evaluation

biases of the 5778 predicates from the 111 faulty versions of Siemens programs. The authors



1 2 3 4 5 6 7 8 9 10 11 120

0.2

0.4

0.6

0.8


Fra

ctio

n of

Run

s in

eac

h R

ange

1 2 3 4 5 6 7 8 9 10 11 120

0.2

0.4

0.6

0.8


Fra

ctio

n of

Run

s in

eac

h R

ange

Fig. 3. Evaluation Bias Distribution of predicate “ (int ∗)match= (int ∗)scan” from program Grep V16

Table II. Student’s t-test on different thresholds for H1. [Zhang et al. 2009]

Threshold Value 0.000-0.500 0.576 0.584 0.587 0.591 0.600-1.000(θ1)

p-Value 1.000 0.500 0.100 0.050 0.010 ≤ 0.0001

formulate the following Null Hypothesis :

H1 : “The mean degree of normality for the tested predicates is greater than a given thresholdθ1”

where θ1 is the threshold of significance level of acceptance of the null hypothesis. They usedStudent’s t-test to calculate p-score. The empirical result has been reproduced in Table II.For statistically significant (in rejecting a null hypothesis) and conservative estimation ofnormality of program spectra, the authors have set the value of θ1 > 0.6. Clearly, normaldistributions are not common for the evaluation biases of predicates.To investigate the Question 2 above, authors have conducted normality tests on the

evaluation biases of most fault-relevant predicates from the 111 versions of Siemens suite.Theauthors formulate the following Null Hypothesis :

H2 : “The mean degree of normality for the most fault-relevant predicates under test isgreater than a given threshold θ2”

From Table III it is evident that normal distribution is not common for the evaluation biasesof the most fault-relevant predicates. This experimental result poses a serious threat to theparametric analyses presented in [Liu et al. 2005], [Liu et al. 2006], [Chilimbi et al. 2009]and [Liblit et al. 2005]. Since the precondition i.e. the assumption of normality distributioncannot be satisfied hence the conclusion from the precondition cannot be established.



1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5


Fra

ctio

n of

Run

s in

eac

h R

ange Plots for Successful Runs

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5


Fra

ctio

n of

Run

s in

eac

h R

ange Plots for Failure Runs

Fig. 4. Evaluation Bias Distribution of predicate “ m ≥ 0” from program Replace V17

Table III. Student’s t-test on different thresholds for H2. [Zhang et al. 2009]

Threshold Value 0.000-0.400 0.561 0.621 0.638 0.671 0.700-1.000(θ2)

p-Value 1.000 0.500 0.100 0.050 0.010 ≤ 0.0001

To investigate the Question 3 the authors conducted a correlation test between thep-Values of the most fault-relevant predicates and the results of non-parametric fault-localization technique [Hu et al. 2008]. They observed that the normality of evaluation biasesof the most fault-relevant predicate does not strongly correlate with the effectiveness of thenon-parametric fault-localization technique in locating faults rendering non-parametric hy-pothesis testing model highly robust for fault localization (This is one of the main reasonfor us to choose Non-Parametric Techniques in our project work).

3.8. Fault Localization through Evaluation Sequences

Predicate-based statistical fault-localization techniques treat predicates as atomic units,which masks out useful statistics on dynamic program behavior by ignoring informationon evaluation sequences. Fault relevant predicates are found by contrasting the statistics ofevaluation results of individual predicates between failed and successful runs [Zhang et al.2010]. Certain dynamic features of program statements display high sensitivity to the dif-ference between the set of failed and successful runs. The key concepts behind this dynamicanalysis approach is that; (1) A set of program features is used to measure sensitivity, (2)In order to compare the sensitivity values, there must be a function that ranks those val-



ues in order. In this paper, the authors differentiate short circuit evaluations of individualpredicates on individual program statements, producing one set of evaluation sequences perpredicate rather than treating the whole predicate as one unit. Predicates are semanticallymodeled as Boolean expressions. The effectiveness of using these sequences to locate faultsis judged by comparing existing predicate-based techniques with and without evaluationsequences. Four of the UNIX utility program and the Siemens program suites were used forthe study [Zhang et al. 2010].A study was conducted to show how distribution of evaluation biases at evaluation se-

quence level could be used to pin point a fault relevant predicate [Zhang et al. 2010].Evaluation bias was calculated in the same way as [Liu et al. 2005], as EB = nt

nt+nf; where

nt and nf has the same meaning as defined in Section 3.3. An experiment was conductedby the authors to find if short-circuit rules can be useful for fault localization. The evalu-ation sequences of original code fragments and Faulty Version of the code fragment withadditional error causing individual conditions was tested, and the counts of each evaluationsequence for each test case was recorded. On plotting graphs to compare the distributionsof evaluation biases for evaluation sequences, the results showed that for some of the eval-uation sequences, the distribution for biases on passed test cases was drastically differentfrom that of failure causing ones [Zhang et al. 2010].In the paper, using predicate-based statistical fault-localization technique was referred to

as the base technique, and the use of evaluation sequences in predicate execution counts wasknown as the fine-grained version of the base technique. A comparison between predicatebased base techniques (like SOBER and CBI) and fine-grained version of base technique (De-bugging through Evaluation Sequences; DES) was taken using T-score as a measure [Zhanget al. 2010]. The T-score measure assesses the objectively the quality of proposed rankinglists of fault-relevant predicates and the performance of fault-localization techniques; andalso to for the scope of this research paper, to compare different fault localization approachesin the controlled experiment. To study the impact of short-circuit evaluations and evalua-tion sequences for statistical fault localization, the fine-grained view was incorporated intothe base technique to provide execution statistics that may help statistical fault-localizationtechniques identify the fault location more accurately. Base techniques conduct sampling ofpredicates in a subject program to collect run-time execution statistics and rank fault rele-vance to predicates. T-score is used to assess the effectiveness (percentage of code examinedto detect a fault) of the selected set of predicates [Liu et al. 2005] [Liblit 2007]. All potentialevaluation sequences for each of the predicates from the given set of predicates applicable toa base technique are identified. Evaluation outcomes of atomic conditions in the predicatesare collected by inserting probes at the predicate locations. Based on the evaluation out-comes of atomic conditions, evaluation sequences for each predicate are determined. Countsfor number of time each evaluation sequence was run per each test case is calculated. Inorder to rank the fine-grained predicates, each evaluation sequence is treated as a distinctpredicate in the base technique. The ranked evaluation sequences need to be mapped totheir respective predicates, for easier identification of program fault in code. The fault lo-calization capabilities of various evaluation sequences of the same Boolean expression arenot identical. Because of short-circuit evaluations of Boolean expressions in program exe-cution, different evaluation sequences of a predicate may produce different resultant values.A fine-grained approach called Debugging through Evaluation Sequences (DES) is adapted[Zhang et al. 2010].The study was done using small size Siemens suite programs (126 versions) in which

faults are seeded manually and medium-sized real-life UNIX utility programs (110 versions)with real and seeded faults as additional subjects to strengthen the external validity of ourexperiments. The statistics of subject programs and test pools like executable lines of code,number of faulty versions, number of test cases in the test pool, average number of Boolean



expressions, average percentage of Boolean expression statements, average percentage ofBoolean expressions statements with respect to all statements, average percentage of com-pound Boolean expressions with respect to all Boolean expressions are used [Zhang et al.2010].Using DES, a set of instrumented versions of the subject programs, including both the

original and faulty versions are produced. The execution counts for the evaluation sequencesis calculated based on the instrumentation log. The number of faults successfully identifiedthrough the examined percentage of code at different T-scores. The T-score results werecalculated for each program in the form of plotted graphs of “% of code examined vs. % offaults located”. DES SOBER and DES CBI both achieve better average fault-localizationresults per % of code examined than base SOBER and CBI for each program. It wasobserved that in 8 out of 11 programs, the mean effectiveness of the DES-enabled techniquesoutperforms that of the respective base techniques [Zhang et al. 2010].

3.9. A Crosstab-based Statistical Method for Effective Fault Localization

In the crosstab [Wong et al. 2008] based statistical method, the coverage information ofeach executable statement and the execution result is presented with respect to each testcase. A crosstab is constructed for each executable statement and a statistic is computed todetermine the suspiciousness of the corresponding statement. The executable statements arethen ranked in the order of decreasing suspiciousness, with the highly suspicious statementsmore susceptible to errors [Wong et al. 2008]. This method was tested on programs fromthe Siemens suite, Space program and Unix suite. A revised version of χSuds was usedto collect a runtime trace of program execution to find the difference in execution betweenfailed and successful runs. This information was used to find how many successful and failedtests cover each program statement. The success or failure of an execution was determinedby comparing the outputs of the faulty version and correct version of the same program[Wong et al. 2008].A crosstab analysis is used to study the relationship between two or more categorical

variables. Following is an understanding of the proposed crosstab analysis method derivedfrom [Wong et al. 2008]. In this paper, there is a crosstab constructed for each programstatement and has two column wise categorical values “covered” and “not covered” andtwo row-wise categorical variables “successful” and “failed”. The notation for a statementin the program being debugged is noted as ω. A hypothesis text is conducted on eachcrosstab to check dependency relationship. The null hypothesis being that the programexecution result is independent of the coverage of statement ω. A Chi-square statistic testis used to determine whether this hypothesis should be rejected. Under the null hypothesis,the statistic χ2(ω) has approximately a Chi-square distribution. The corresponding Chi-square critical value χ2σ is calculated using the Chi-square distribution table using thelevel of significance σ. The null hypothesis is rejected when χ2(ω) > χ2σ, meaning that theexecution result was dependent on the coverage of ω. A higher association among variablesis indicated by the dependency relationship and lower association by independency. In orderto associate a degree of association between execution result and statement coverage, insteadof just having dependency or independency, the contingency coefficient is used. It is defined

by: M(ω) = χ2(ω)√(row−1)(col−1)

. This coefficient can take values between [0,1], where the lower

limit 0 means complete independence and upper limit 1 means complete association betweenthe execution result and the statement coverage. In order to find whether it’s the failed orsuccessful execution result that is more associated with the coverage of the statement, foreach statement the percentages of all failed and successful tests that execute the statement.If the product of “number of failed test cases covering ω” and “number of successful testcases not covering ω” exceeds the product of “number of successful test cases covering ω”and “number of failed test cases not covering ω” then the coverage of the statement ω is



Table IV. Used Notations

N Total Number of Test CasesNF Total number of failed test casesNS Total number of successful test cases

NC(ω) Number of test cases covering ω

NCF (ω) Number of failed test cases covering ω

NCS(ω) Number of successful test cases covering ω

NU (ω) Number of test cases not covering ω

NUF (ω) Number of failed test cases not covering ω

NUS(ω) Number of successful test cases not covering ω

Table V. Crosstab for Each Statement

ω is covered ω is not covered ΣSuccessful Executions NCS(ω) NUS(ω) NS

Failed Executions NCF (ω) NUF (ω) NF

Σ NC(ω) NU (ω) N

positively associated with the failed execution [Wong et al. 2008]. This implies that if thepercentage of failed test is greater than percentage of successful test, then the associationbetween the failed execution and the coverage of ω is higher than that between the successfulexecution and the coverage of ω. Another statistic called φ(ω) is defined as the division ofpercentage of failed and successful tests. When this value is equal to 1, the execution result iscompletely independent of the coverage of ω. In this case, the coverage of ω makes the samecontribution to both the failed and the successful execution result. If φ(ω) > 1, the coverageof ω is more associated with the failed execution, otherwise with successful execution. Theprogram statements can be classified into 5 classes depending on φ(ω) and χ2(ω) values[Wong et al. 2008]:

(1) Statements with φ > 1 and χ2 > χ2σ have a high degree of association between theircoverage an the failed execution result.

(2) Statements with φ > 1 and χ2 ≤ χ2σ have low degree of association between theircoverage and the failed execution result.

(3) Statements with φ < 1 and χ2 > χ2σ have a high degree of association between theircoverage and the successful execution result.

(4) Statements with φ < 1 and χ2 ≤ χ2σ have a low degree of association between theircoverage an the successful execution result.

(5) Statements with φ = 1 whose coverage is independent of the execution result.

The order of likelihood of classes most likely to have bugs in the decreasing order is1 > 2 > 5 > 4 > 3.Three studies were conducted using the Siemens suite, the Space Program and the Unix

suite to demonstrate the feasibility of the crosstab-based method [Wong et al. 2008]. Re-sults of the crosstab method were compared to those of the Tarantula method in termsof effectiveness and efficiency and it was proven that in the crosstab-based method, lesserpercentage of code had to be examined before the first faulty statement was found (hencemore effective) [Wong et al. 2008]. In case of efficiency, there was not much differencebetween both the methods. The programs were instrumented, tests were rerun, coverageinformation was collected, which statements were executed by which tests were identifiedand each execution was determined as a success or failure. The instrumentation was doneusing a revised version of χSuds, which collected a runtime trace correctly even if a programexecution crashed due to segmentation fault. The success of failure of an execution wasdetermined by comparing the outputs of the faulty version and the correct version.



4. CONCLUSION

In this survey paper, we have discussed some of the most prominent frameworks for predicatebased and path profile based statistical debugging techniques. Some of these techniqueslike CBI uses sparse sampling method to collect predicate profiles from program traces.Some methods like holmes uses a statistical analysis guided static analysis method tonarrow down the set of predicates that are relevant to a particular fault. On the other handtechniques like SOBER assumes that the evaluation bias of passing and failure runs overprogram feature spectra comes from an underlying normal distribution. In recent studies,researchers have shown with empirical experimental results that this normality assumptiondoes not hold good for most of the program predicates. They have advocated the use of morerobust non-parametric hypothesis test to find out fault relevant predicates as it does notassume any underlying probability distributions. In this survey paper, we did not discussinvariant based bug localization techniques like DySy, DAIKON and PRECIS / PREAMBLas they do not come into the scope of our present discussion. They use statistical analysisas a part of the invariant generation procedure. These invariants are then deployed in thesource code to detect run time bugs. Interested readers may see the following for moreinformation regarding these techniques; DySy [Csallner et al. 2008], DAIKON [Ernst et al.2007] and PRECIS / PREAMBL [Sagdeo et al. 2011], [Sagdeo 2012]. In our work we use non-parametric analysis to find the most fault-relevant predicates. Most of the prior results werereported on the Siemens benchmark. We try to establish the robustness of non-parametricmethod over other open source large programs like grep, gzip (two well known commandline utilities in Linux) and EXIF (implementation of an image compression algorithm).

Acknowledgements

We express our gratitude to Ming Ji for pointing us to the SOBER framework from wherewe started this project work. We also thank Prof. Jiawei Han for encouraging us to pursuethis project. We are indebted to Shibamouli Lahiri for his numerous suggestions regardingthis work. We are also thankful to Prof. Ben Liblit for generously sharing the predicateprofile data from the CBI project. Finally, we express our gratitude to Dr. Manish Guptafor helping us to read the datasets.

Note : Sections 3.1, 3.3, 3.4, 3.5 and 3.7 are surveyed by Debjit Pal. Sections 3.2, 3.6, 3.8and 3.9 are surveyed by Rizwan Mohiuddin.

REFERENCES

H. Agrawal, J.R. Horgan, S. London, and W.E. Wong. 1995. Fault localization using execution slices anddataflow tests. In Software Reliability Engineering, 1995. Proceedings., Sixth International Symposiumon. 143–151. DOI:http://dx.doi.org/10.1109/ISSRE.1995.497652

Piramanayagam Arumuga Nainar, Ting Chen, Jake Rosin, and Ben Liblit. 2007. Statistical De-bugging using Compound Boolean Predicates. In Proceedings of the 2007 international sym-posium on Software testing and analysis (ISSTA ’07). ACM, New York, NY, USA, 5–15.DOI:http://dx.doi.org/10.1145/1273463.1273467

Thomas Ball and James R. Larus. 1996. Efficient Path Profiling. In Proceedings of the 29th annualACM/IEEE international symposium on Microarchitecture (MICRO 29). IEEE Computer Society,Washington, DC, USA, 46–57. http://dl.acm.org/citation.cfm?id=243846.243857

Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32.DOI:http://dx.doi.org/10.1023/A:1010933404324

Yuriy Brun and Michael D. Ernst. 2004. Finding Latent Code Errors via Machine Learning over ProgramExecutions. In Proceedings of the 26th International Conference on Software Engineering (ICSE ’04).IEEE Computer Society, Washington, DC, USA, 480–490. http://dl.acm.org/citation.cfm?id=998675.999452



Trishul M. Chilimbi, Ben Liblit, Krishna K. Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Ef-fective Statistical Debugging via Efficient Path Profiling. In 31st International Conference on SoftwareEngineering, ICSE , May 16-24, Vancouver, Canada, Proceedings.

Edmund M. Clarke, Jr., Orna Grumberg, and Doron A. Peled. 1999. Model Checking. MIT Press, Cambridge,MA, USA.

Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for StaticAnalysis of Programs by Construction or Approximation of Fixpoints. In Conference Record of theFourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA,January, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Eds.). ACM, 238–252.

Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. 2008. DySy: dynamic symbolic executionfor invariant inference. In Proceedings of the 30th international conference on Software engineering(ICSE ’08). ACM, New York, NY, USA, 281–290. DOI:http://dx.doi.org/10.1145/1368088.1368127

Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCamant, Carlos Pacheco, Matthew S. Tschantz,and Chen Xiao. 2007. The Daikon system for dynamic detection of likely invariants. Sci. Comput.Program. 69, 1-3 (Dec. 2007), 35–45. DOI:http://dx.doi.org/10.1016/j.scico.2007.01.015

Halting Problem. 1930. Halting Problem. http://en.wikipedia.org/wiki/Halting problem. (1930).

Mary Jean Harrold, Gregg Rothermel, Kent Sayre, Rui Wu, and Liu Yi. 2000. An Em-pirical Investigation of The Relationship Between Spectra Differences and Regres-sion Faults. Software Testing, Verification and Reliability 10, 3 (2000), 171–194.DOI:http://dx.doi.org/10.1002/1099-1689(200009)10:3〈171::AID-STVR209〉3.0.CO;2-J

Thomas A. Henzinger, Ranjit Jhala, Rupak Majumder, and Kenneth L. McMillan. 2002. Blast. http://goto.ucsd.edu/∼rjhala/blast.html. (2002).

Peifeng Hu, Zhenyu Zhang, Wing Kwong Chan, and T. H. Tse. 2008. Fault Localization with Non-parametricProgram Behavior Model. In Proceedings of the Eighth International Conference on Quality Software,QSIC 2008, 12-13 August, Oxford, UK, Hong Zhu (Ed.).

Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the Effectivenessof Dataflow- and Controlflow-Based Test Adequacy Criteria. In Proceedings of the 16th internationalconference on Software engineering (ICSE ’94). IEEE Computer Society Press, Los Alamitos, CA,USA, 191–200. http://dl.acm.org/citation.cfm?id=257734.257766

Lingxiao Jiang and Zhendong Su. 2007. Context-aware Statistical Debugging: From Bug Predictors toFaulty Control Flow Paths. In Proceedings of the twenty-second IEEE/ACM international con-ference on Automated software engineering (ASE ’07). ACM, New York, NY, USA, 184–193.DOI:http://dx.doi.org/10.1145/1321631.1321660

James A. Jones and Mary Jean Harrold. 2005. Empirical Evaluation of the Tarantula Auto-matic Fault-Localization Technique. In Proceedings of the 20th IEEE/ACM international Con-ference on Automated software engineering (ASE ’05). ACM, New York, NY, USA, 273–282.DOI:http://dx.doi.org/10.1145/1101908.1101949

James A. Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of Test Information to AssistFault Localization. In Proceedings of the 24th International Conference on Software Engineering (ICSE’02). ACM, New York, NY, USA, 467–477. DOI:http://dx.doi.org/10.1145/581339.581397

Chris Lattner. 2007. Clang: a C language family frontend for LLVM. http://clang.llvm.org/. (2007).

Ben Liblit. 2007. Cooperative Bug Isolation Project. http://research.cs.wisc.edu/cbi/. (2007).

Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. 2003. Bug Isolation via Remote ProgramSampling. SIGPLAN Not. 38, 5 (May 2003), 141–154. DOI:http://dx.doi.org/10.1145/780822.781148

Ben Liblit, Mayur Naik, Alice X. Zheng, Alexander Aiken, and Michael I. Jordan. 2005. Scalable StatisticalBug Isolation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming LanguageDesign and Implementation, Chicago, IL, USA, June 12-15. ACM, 15–26.

Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hy-pothesis Testing-Based Approach. IEEE Transaction on Software Engineering 32, 10 (2006), 831–848.

Chao Liu, Xifeng Yan, Long Fei, Jiawei Han, and Samuel P. Midkiff. 2005. SOBER: Statistical Model-BasedBug Localization. In Proceedings of the 10th European Software Engineering Conference held jointlywith 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2005,Lisbon, Portugal, September 5-9, Michel Wermelinger and Harald Gall (Eds.). ACM, 286–295.

Richard Lowry. 1998. Concepts and applications of inferential statistics. R. Lowry.

Microsoft Research. 2010. SLAM Toolkit. http://research.microsoft.com/en-us/projects/slam/. (2010).

Thomas M. Mitchell. 1997. Machine Learning (1 ed.). McGraw-Hill, Inc., New York, NY, USA.



Madanlal Musuvathi, David Y. W. Park, Andy Chou, Dawson R. Engler, and David L. Dill. 2002. CMC:A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002),75–88. DOI:http://dx.doi.org/10.1145/844128.844136

Alessandro Orso, James Jones, and Mary Jean Harrold. 2003. Visualization of program-execution data fordeployed software. In Proceedings of the 2003 ACM symposium on Software visualization (SoftVis ’03).ACM, New York, NY, USA, 67–ff. DOI:http://dx.doi.org/10.1145/774833.774843

Manos Renieris and Steven P. Reiss. 2003. Fault Localization With Nearest Neighbor Queries. In 18th IEEEInternational Conference on Automated Software Engineering, 6-10 October, Montreal, Canada. IEEEComputer Society, 30–39.

Parth Sagdeo, Viraj Athavale, Sumant Kowshik, and Shobha Vasudevan. 2011. PRECIS: Inferring Invariantsusing Program Path Guided Clustering. In 26th IEEE/ACM International Conference on AutomatedSoftware Engineering (ASE), Lawrence, KS, USA, November 6-10, Perry Alexander, Corina S. Pasare-anu, and John G. Hosking (Eds.). IEEE, 532–535.

Parth Vivek Sagdeo. 2012. PREAMBL. https://wiki.engr.illinois.edu/display/precis/. (2012).

SIR. 2005. Siemens Test Suite (Software Artifact Repository). http://sir.unl.edu/portal/index.php. (2005).

Software Bugs. 2009. A Collection of Well-Known Software Failures. http://www.cse.lehigh.edu/∼gtan/bug/softwarebug.html. (2009).

Iris Vessey. 1985. Expertise in debugging computer programs: A process analysis. International Journal ofMan-Machine Studies 23, 5 (1985), 459 – 494. DOI:http://dx.doi.org/10.1016/S0020-7373(85)80054-7

Willem Visser, Klaus Havelund, Guillaume Brat, Seungjoon Park, and Flavio Lerda. 2003.Model Checking Programs. Automated Software Engg. 10, 2 (April 2003), 203–232.DOI:http://dx.doi.org/10.1023/A:1022920129859

W. Eric Wong, Tingting Wei, Yu Qi, and Lei Zhao. 2008. A Crosstab-based Statistical Method for EffectiveFault Localization. In First International Conference on Software Testing, Verification, and Validation,ICST 2008, Lillehammer, Norway, April 9-11. IEEE Computer Society, 42–51.

Andreas Zeller. 2002. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACMSIGSOFT symposium on Foundations of software engineering (SIGSOFT ’02/FSE-10). ACM, NewYork, NY, USA, 1–10. DOI:http://dx.doi.org/10.1145/587051.587053

Zhenyu Zhang, W. K. Chan, T. H. Tse, Peifeng Hu, and Xinming Wang. 2009. Is Non-parametric HypothesisTesting Model Robust for Statistical Fault Localization? Information & Software Technology 51, 11(2009), 1573–1585.

Zhenyu Zhang, W. K. Chan, T. H. Tse, Y. T. Yu, and Peifeng Hu. 2011. Non-parametric Statistical FaultLocalization. Journal of Systems and Software 84, 6 (2011), 885–905.

Zhenyu Zhang, Bo Jiang, W. K. Chan, T. H. Tse, and Xinming Wang. 2010. Fault Localization ThroughEvaluation Sequences. Journal of Systems and Software 83, 2 (2010), 174–187.


research survey report -...

Documents