dspot: test amplification for automatic assessment of ... · dspot: test amplification for...

12
DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudry o , Simon Allier o , Marcelino Rodriguez-Cancio ††,o and Martin Monperrus ,o o Inria, France †† University of Rennes 1, France University of Lille, France contact: [email protected] ABSTRACT Context: Computational diversity, i.e., the presence of a set of programs that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security. Objective: We aim at proposing an approach for auto- matically assessing the presence of computational diversity. In this work, computationally diverse variants are defined as (i) sharing the same API, (ii) behaving the same according to an input-output based specification (a test-suite) and (iii) exhibiting observable differences when they run outside the specified input space. Method: Our technique relies on test amplification. We propose source code transformations on test cases to explore the input domain and systematically sense the observation domain. We quantify computational diversity as the dissim- ilarity between observations on inputs that are outside the specified domain. Results: We run our experiments on 472 variants of 7 classes from open-source, large and thoroughly tested Java classes. Our test amplification multiplies by ten the number of input points in the test suite and is effective at detecting software diversity. Conclusion: The key insights of this study are: the sys- tematic exploration of the observable output space of a class provides new insights about its degree of encapsulation; the behavioral diversity that we observe originates from areas of the code that are characterized by their flexibility (caching, checking, formatting, etc.). KEYWORDS: software diversity, software testing, test amplification, dynamic analysis. 1. INTRODUCTION Computational diversity, i.e., the presence of a set of pro- grams that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security [1, 5, 17, 6]. Consequently, it is of utmost importance to have systematic and efficient procedures to determine if a set of programs are computa- tionally diverse. Many works have tried to tackle this challenge, using input generation [12], static analysis [13], or evolutionary testing [25] and [3] (concurrent of this work). Yet, having a reli- able detection of computational diversity for large object- oriented programs is still a challenging endeavor. In this paper, we propose an approach, called DSpot 1 , 1 DSpot stands for diversity spotter for assessing the presence of computational diversity, i.e., to determine if a set of program variants exhibit different be- haviors under certain conditions. DSpot takes as input a test suite and a set of n program variants. The n variants have the same application programming interface (API) and they all pass the same test suite (i.e. they comply with the same executable specification). DSpot consists of two steps: (i) automatically transforming the test suite; and (ii) running this larger test suite, that we call “amplified test suite” on all variants to reveal visible differences in the computation. The first step of DSpot is an original technique of test amplification [23, 26, 20, 25]. Our key insight is to combine the automatic exploration of the input domain with the sys- tematic sensing of the observation domain. The former is obtained by transforming the input values and method calls of the original test. The latter is the result of the analy- sis and transformation of the original assertions of the test suite, in order to observe the program state from as many observation points visible from the public API as possible. The second step of DSpot runs the augmented test suite on each variant. The observation points introduced during am- plification generate new traces on the program state. If there exists a difference between the trace of a pair of variants, we say that these variants are computationally diverse. In other words, two variants are considered diverse if there exists at least one input outside the specified domain that triggers different behaviors on the variants which can be observed through the public API. To evaluate the ability of DSpot at observing computa- tional diversity, we consider 7 open-source software applica- tions. For each of them, we create 472 program variants, and we manually check that they are computationally diverse, they form our ground truth. We then run DSpot for each program variant. Our experiments show that DSpot detects 100% of the 472 computational diverse program variants. In the literature, the technique that is the most similar to test amplification is by Yoo and Harman [25], called “test data regeneration” (TDR for short), we use it as baseline. We show that test suites amplified with DSpot detect twice more computationally diverse programs than TDR. In par- ticular, we show that the new test input transformations that we propose bring a real added value with respect to TDR, to spot behavioral differences. To sum up, our contributions are: an original set of test cases transformations for the auto- matic amplification of an object-oriented test suite. a validation of the ability of amplified test suites to spot computational diversity in 472 variants of 7 open-source arXiv:1503.05807v2 [cs.SE] 15 Jun 2015

Upload: others

Post on 02-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

DSpot Test Amplification for Automatic Assessment ofComputational Diversity

Benoit Baudryo Simon Alliero Marcelino Rodriguez-Canciodaggerdaggero and Martin Monperrusdaggero

o Inria Francedaggerdagger University of Rennes 1 France dagger University of Lille France

contact benoitbaudryinriafr

ABSTRACTContext Computational diversity ie the presence of aset of programs that all perform compatible services butthat exhibit behavioral differences under certain conditionsis essential for fault tolerance and security

Objective We aim at proposing an approach for auto-matically assessing the presence of computational diversityIn this work computationally diverse variants are defined as(i) sharing the same API (ii) behaving the same accordingto an input-output based specification (a test-suite) and (iii)exhibiting observable differences when they run outside thespecified input space

Method Our technique relies on test amplification Wepropose source code transformations on test cases to explorethe input domain and systematically sense the observationdomain We quantify computational diversity as the dissim-ilarity between observations on inputs that are outside thespecified domain

Results We run our experiments on 472 variants of 7classes from open-source large and thoroughly tested Javaclasses Our test amplification multiplies by ten the numberof input points in the test suite and is effective at detectingsoftware diversity

Conclusion The key insights of this study are the sys-tematic exploration of the observable output space of a classprovides new insights about its degree of encapsulation thebehavioral diversity that we observe originates from areas ofthe code that are characterized by their flexibility (cachingchecking formatting etc)

KEYWORDS software diversity software testing testamplification dynamic analysis

1 INTRODUCTIONComputational diversity ie the presence of a set of pro-

grams that all perform compatible services but that exhibitbehavioral differences under certain conditions is essentialfor fault tolerance and security [1 5 17 6] Consequentlyit is of utmost importance to have systematic and efficientprocedures to determine if a set of programs are computa-tionally diverse

Many works have tried to tackle this challenge using inputgeneration [12] static analysis [13] or evolutionary testing[25] and [3] (concurrent of this work) Yet having a reli-able detection of computational diversity for large object-oriented programs is still a challenging endeavor

In this paper we propose an approach called DSpot1

1DSpot stands for diversity spotter

for assessing the presence of computational diversity ie todetermine if a set of program variants exhibit different be-haviors under certain conditions DSpot takes as input a testsuite and a set of n program variants The n variants havethe same application programming interface (API) and theyall pass the same test suite (ie they comply with the sameexecutable specification) DSpot consists of two steps (i)automatically transforming the test suite and (ii) runningthis larger test suite that we call ldquoamplified test suiterdquo onall variants to reveal visible differences in the computation

The first step of DSpot is an original technique of testamplification [23 26 20 25] Our key insight is to combinethe automatic exploration of the input domain with the sys-tematic sensing of the observation domain The former isobtained by transforming the input values and method callsof the original test The latter is the result of the analy-sis and transformation of the original assertions of the testsuite in order to observe the program state from as manyobservation points visible from the public API as possibleThe second step of DSpot runs the augmented test suite oneach variant The observation points introduced during am-plification generate new traces on the program state If thereexists a difference between the trace of a pair of variants wesay that these variants are computationally diverse In otherwords two variants are considered diverse if there exists atleast one input outside the specified domain that triggersdifferent behaviors on the variants which can be observedthrough the public API

To evaluate the ability of DSpot at observing computa-tional diversity we consider 7 open-source software applica-tions For each of them we create 472 program variants andwe manually check that they are computationally diversethey form our ground truth We then run DSpot for eachprogram variant Our experiments show that DSpot detects100 of the 472 computational diverse program variantsIn the literature the technique that is the most similar totest amplification is by Yoo and Harman [25] called ldquotestdata regenerationrdquo (TDR for short) we use it as baselineWe show that test suites amplified with DSpot detect twicemore computationally diverse programs than TDR In par-ticular we show that the new test input transformationsthat we propose bring a real added value with respect toTDR to spot behavioral differences

To sum up our contributions are

bull an original set of test cases transformations for the auto-matic amplification of an object-oriented test suitebull a validation of the ability of amplified test suites to spot

computational diversity in 472 variants of 7 open-source

arX

iv1

503

0580

7v2

[cs

SE

] 1

5 Ju

n 20

15

Software Diversity

Computational Diversity

Static Code Diversity

NVP-Diversitydifferent source code (eg variable names)

different outputs at the module interface

different executions

Source Diversity Binary

Diversityfrom different compiler and compilation options

Figure 1 An High-level View of Software Diversity

1 public int subtract1(int a int b) return a-b

3 public int subtract2(int a int b) throws

OverFlowException 5 BigInteger bigA = BigIntegervalueOf(a)

BigInteger bigB = BigIntegervalueOf(b)7 BigInteger result = bigAsubtract(bigB)

if (resultlowerThan(IntegerMIN_VALUE))) 9 throw new DoNotFitIn32BitException ()

11 the API requires an 32-bit integer value

return resultintValue ()

Listing 1 Two subtraction functions They are NVP-Diverse there exists some inputs for which the output aredifferent

large scale programs

bull a comparative evaluation against the closest related work[25]

bull original insights about the natural diversity of computa-tion due to randomness and variety of runtime environ-ments

bull a publicly available implementation 2 and benchmark 3

The paper is organized as follows section 2 expands onthe background and motivations for this work section 3 de-scribes the core technical contribution of the paper the au-tomatic amplification of test suites section 4 presents ourempirical findings about the amplification of 7 real-worldtest suites and the assessment of diversity among 472 pro-gram variants

2 BACKGROUNDIn this paper we are interested in computational diver-

sity Computational diversity is one kind of software di-versity Figure 1 presents a high-level view of software di-versity Software diversity can be observed statically eitheron source or binary code Computational diversity is theone that happens at runtime The computational diversitywe target in this paper is NVP-Diversity which relates toN-version programming It can be loosely defined as com-putational diversity that is visible at the module interfacedifferent outputs for the same input

21 N-version programmingIn the Encyclopedia of Software Engineering N-version

programming is defined as ldquoa software structuring technique

2httpdiversify-projectgithubiotest-suite-amplificationhtml3httpdiversify-projecteudata

input space observation space

Program P

execution state

Figure 2 Original and amplified points on the input andobservation spaces of P

designed to permit software to be fault-tolerant ie able tooperate and provide correct outputs despite the presence offaultsrdquo [14] In N-version systems N variants of the samemodule written by different teams are executed in parallelThe faults are defined as an output of one or more variantsthat differ from the majorityrsquos output Let us consider asimple example with 2 programs p1 and p2 if one observesa difference in the output for an input x ndash p1(x) 6= p2(x) ndashthen a fault is detected

Let us consider the example of Listing 1 It shows two im-plementations of subtraction which have been developed bytwo different teams a typical N-version setup subtract1

simply uses the subtraction operator subtract2 is morecomplex it leverages BigInteger objects to handle potentialoverflows

The specification given to the two teams states that theexpected input domain is [minus216 216] times [minus216 216] To thatextent both implementations are correct and equivalentThese two implementations are run in parallel in produc-tion using a N-version architecture

If a production input is outside the specified input domaineg subtract1(232+1 2) the behavior of both implemen-tations is different and the overflow fault is detected

22 NVP-DiversityIn this paper we use the term NVP-Diversity to refer

to the concept of computational diversity in N-version pro-gramming

Definition Two programs are NVP-diverse if and onlyif there exists at least one input for which the output isdifferent

Note that according to this definition if two programs areequivalent on all inputs they are not NVP-diverse

In this work we consider programs in mainstream object-oriented programming languages (our prototype handles Javasoftware) In OO programs there is no such thing as ldquoin-putrdquo and ldquooutputsrdquo This requires us to slightly modify ourdefinition of NVP-Diversity

Following [10] we replace ldquoinputrdquo by ldquostimulirdquo and ldquoout-putrdquo by ldquoobservationrdquo A stimuli is a sequence of methodcalls and their parameters on an object under test An ob-servation is a sequence of calls to specific methods to querythe state of an object (typically getter methods) The in-put space I of a class P is the set of all possible stimulifor P The observation space O is the set of all sets ofobservations

Now we can clearly define NVP-diversity for OO-programsDefinition Two classes are NVP-diverse if and only if

there exists two respective instances that produce differentobservations for the same stimuli

23 Graphical ExplanationThe notion of NVP-diversity is directly related to activity

of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

24 Unspecified Input SpaceIn N-Version programming by definition the differences

that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

32 Test Suite Transformations

Inputs

Output Yes

input space exploration

Test suite

Amplified Test suite

observation points selection

computational diversity check

No

Program P1

Program P2

observation point synthesis

satisfiessatisfies

Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

Data TS an initial test suiteResult TSprime an amplified version of TS

1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

8 end9 foreach literalV alue isin test do

10 TStmp larr transform(literalValuetest)11 end

12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

Algorithm 1 Amplification of test cases

321 Exploring the Input SpaceLiterals and statement manipulation The first step

of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

public void testEntrySetRemove () 12

for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))assertFalse(

6 Entry should have been removed from theunderlying map

getMap ()containsKey(sampleKeys[i]))8 end for

10

public void testEntrySetRemove_Add () 212

call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))

entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

16 getMap ()containsKey(sampleKeys[i])

18

public void testEntrySetRemove_Data () 320

integer increment22 int i = 0 -gt int i = 1

for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

26 end for

Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

bull if the original assertion included a method call on owe include this method call as an observation point

Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 2: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

Software Diversity

Computational Diversity

Static Code Diversity

NVP-Diversitydifferent source code (eg variable names)

different outputs at the module interface

different executions

Source Diversity Binary

Diversityfrom different compiler and compilation options

Figure 1 An High-level View of Software Diversity

1 public int subtract1(int a int b) return a-b

3 public int subtract2(int a int b) throws

OverFlowException 5 BigInteger bigA = BigIntegervalueOf(a)

BigInteger bigB = BigIntegervalueOf(b)7 BigInteger result = bigAsubtract(bigB)

if (resultlowerThan(IntegerMIN_VALUE))) 9 throw new DoNotFitIn32BitException ()

11 the API requires an 32-bit integer value

return resultintValue ()

Listing 1 Two subtraction functions They are NVP-Diverse there exists some inputs for which the output aredifferent

large scale programs

bull a comparative evaluation against the closest related work[25]

bull original insights about the natural diversity of computa-tion due to randomness and variety of runtime environ-ments

bull a publicly available implementation 2 and benchmark 3

The paper is organized as follows section 2 expands onthe background and motivations for this work section 3 de-scribes the core technical contribution of the paper the au-tomatic amplification of test suites section 4 presents ourempirical findings about the amplification of 7 real-worldtest suites and the assessment of diversity among 472 pro-gram variants

2 BACKGROUNDIn this paper we are interested in computational diver-

sity Computational diversity is one kind of software di-versity Figure 1 presents a high-level view of software di-versity Software diversity can be observed statically eitheron source or binary code Computational diversity is theone that happens at runtime The computational diversitywe target in this paper is NVP-Diversity which relates toN-version programming It can be loosely defined as com-putational diversity that is visible at the module interfacedifferent outputs for the same input

21 N-version programmingIn the Encyclopedia of Software Engineering N-version

programming is defined as ldquoa software structuring technique

2httpdiversify-projectgithubiotest-suite-amplificationhtml3httpdiversify-projecteudata

input space observation space

Program P

execution state

Figure 2 Original and amplified points on the input andobservation spaces of P

designed to permit software to be fault-tolerant ie able tooperate and provide correct outputs despite the presence offaultsrdquo [14] In N-version systems N variants of the samemodule written by different teams are executed in parallelThe faults are defined as an output of one or more variantsthat differ from the majorityrsquos output Let us consider asimple example with 2 programs p1 and p2 if one observesa difference in the output for an input x ndash p1(x) 6= p2(x) ndashthen a fault is detected

Let us consider the example of Listing 1 It shows two im-plementations of subtraction which have been developed bytwo different teams a typical N-version setup subtract1

simply uses the subtraction operator subtract2 is morecomplex it leverages BigInteger objects to handle potentialoverflows

The specification given to the two teams states that theexpected input domain is [minus216 216] times [minus216 216] To thatextent both implementations are correct and equivalentThese two implementations are run in parallel in produc-tion using a N-version architecture

If a production input is outside the specified input domaineg subtract1(232+1 2) the behavior of both implemen-tations is different and the overflow fault is detected

22 NVP-DiversityIn this paper we use the term NVP-Diversity to refer

to the concept of computational diversity in N-version pro-gramming

Definition Two programs are NVP-diverse if and onlyif there exists at least one input for which the output isdifferent

Note that according to this definition if two programs areequivalent on all inputs they are not NVP-diverse

In this work we consider programs in mainstream object-oriented programming languages (our prototype handles Javasoftware) In OO programs there is no such thing as ldquoin-putrdquo and ldquooutputsrdquo This requires us to slightly modify ourdefinition of NVP-Diversity

Following [10] we replace ldquoinputrdquo by ldquostimulirdquo and ldquoout-putrdquo by ldquoobservationrdquo A stimuli is a sequence of methodcalls and their parameters on an object under test An ob-servation is a sequence of calls to specific methods to querythe state of an object (typically getter methods) The in-put space I of a class P is the set of all possible stimulifor P The observation space O is the set of all sets ofobservations

Now we can clearly define NVP-diversity for OO-programsDefinition Two classes are NVP-diverse if and only if

there exists two respective instances that produce differentobservations for the same stimuli

23 Graphical ExplanationThe notion of NVP-diversity is directly related to activity

of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

24 Unspecified Input SpaceIn N-Version programming by definition the differences

that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

32 Test Suite Transformations

Inputs

Output Yes

input space exploration

Test suite

Amplified Test suite

observation points selection

computational diversity check

No

Program P1

Program P2

observation point synthesis

satisfiessatisfies

Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

Data TS an initial test suiteResult TSprime an amplified version of TS

1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

8 end9 foreach literalV alue isin test do

10 TStmp larr transform(literalValuetest)11 end

12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

Algorithm 1 Amplification of test cases

321 Exploring the Input SpaceLiterals and statement manipulation The first step

of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

public void testEntrySetRemove () 12

for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))assertFalse(

6 Entry should have been removed from theunderlying map

getMap ()containsKey(sampleKeys[i]))8 end for

10

public void testEntrySetRemove_Add () 212

call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))

entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

16 getMap ()containsKey(sampleKeys[i])

18

public void testEntrySetRemove_Data () 320

integer increment22 int i = 0 -gt int i = 1

for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

26 end for

Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

bull if the original assertion included a method call on owe include this method call as an observation point

Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 3: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

24 Unspecified Input SpaceIn N-Version programming by definition the differences

that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

32 Test Suite Transformations

Inputs

Output Yes

input space exploration

Test suite

Amplified Test suite

observation points selection

computational diversity check

No

Program P1

Program P2

observation point synthesis

satisfiessatisfies

Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

Data TS an initial test suiteResult TSprime an amplified version of TS

1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

8 end9 foreach literalV alue isin test do

10 TStmp larr transform(literalValuetest)11 end

12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

Algorithm 1 Amplification of test cases

321 Exploring the Input SpaceLiterals and statement manipulation The first step

of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

public void testEntrySetRemove () 12

for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))assertFalse(

6 Entry should have been removed from theunderlying map

getMap ()containsKey(sampleKeys[i]))8 end for

10

public void testEntrySetRemove_Add () 212

call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))

entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

16 getMap ()containsKey(sampleKeys[i])

18

public void testEntrySetRemove_Data () 320

integer increment22 int i = 0 -gt int i = 1

for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

26 end for

Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

bull if the original assertion included a method call on owe include this method call as an observation point

Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 4: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

public void testEntrySetRemove () 12

for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))assertFalse(

6 Entry should have been removed from theunderlying map

getMap ()containsKey(sampleKeys[i]))8 end for

10

public void testEntrySetRemove_Add () 212

call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))

entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

16 getMap ()containsKey(sampleKeys[i])

18

public void testEntrySetRemove_Data () 320

integer increment22 int i = 0 -gt int i = 1

for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

26 end for

Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

bull if the original assertion included a method call on owe include this method call as an observation point

Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 5: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 6: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 7: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 8: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 9: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 10: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 11: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References
Page 12: DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References