effectiveness of inadequate test suites
TRANSCRIPT
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2017
Effectiveness of Inadequate Test SuitesA Case Study of Mutation Analysis
HIKARI WATANABE
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Effectiveness of Inadequate
Test Suites
A Case Study of Mutation Analysis
HIKARI WATANABE
DD221X, Degree project in Computer Science (30 ECTS credits)
Master’s program in Computer Science (120 ECTS credits)
KTH Royal Institute of Technology, Year 2017
Supervisor at CSC: Karl Meinke
Examiner at CSC: Cristian M Bogdan
Master thesis work carried out at NASDAQ Technology AB
Abstract
How can you tell whether your test suites are reliable? This is often done through the
use of coverage criterion that would define a set of requirements that the test suites
need to fulfill in order to be considered reliable. The most widely used criterion is
those referred to as code coverage, where the degree to which the code base is
covered is used as an arbitrary measure of how good the test suites are. Achieving
high coverage would indicate an adequate test suite i.e. reliable according to the
standards of code coverage. However, covering a line of code does not necessarily
mean that it has been tested. Thus, code coverage can only tell you what parts of the
code base have not been tested, opposed to what have been tested.
Mutation testing on the other hand is an approach to evaluate the adequacy of test
suites through their fault detection ability, rather than how much of the code base
they cover.
This thesis performs mutation analysis on a project with inadequate code coverage.
The present testing effort on unit level is evaluated and the cost and benefits of
adopting mutation testing as a testing method is evaluated.
Sammanfattning
Hur vet man när tester är tillförlitliga? Ofta använder man sig av täckningskriterium
som definierar en uppsättning krav som tester måste uppfylla för att betraktas som
pålitlig. Det mest använda kriterier är de som kallas kodtäckning, där graden till
vilken kodbasen är täckt används som ett mått av pålitlighet av tester. Hög täckning
indikerar adekvat tester, dvs pålitlig enligt kodtäckning. Men täckning av en kodlinje
betyder inte nödvändigtvis att den har testats. Koddekning kan således bara visa
vilka delar av kodbasen som inte har testats, snarare än vad som har testats.
Mutation testing å andra hand är ett sätt att utvärdera testers effektivitet genom
deras felsökningsförmåga, snarare än hur mycket av kodbasen de täcker.
Denna examensarbete utför mutationsanalys på ett projekt med otillräcklig
koddekning. Kvalite av nuvarande tester på enhetsnivå utvärderas och kostnaden
och fördelar för att anta mutation testning som en testmetod utforskas.
Keywords
Mutation testing, code coverage, regression analysis
Preface
I would like to thank my team at NASDAQ that has helped and supported me throughout the project.
Special thanks to:
Kjell Paulson at NASDAQ for having me in his team
Karl Meinke at KTH for taking me on as a thesis student and guiding me throughout
Cristian M Bogdan at KTH, for evaluating my work
Contents
Introduction ................................................................................................................................................ 1
1.1 Objective ........................................................................................................................................... 2
1.2 Delimitations ................................................................................................................................... 3
1.3 Related Work ................................................................................................................................... 3
Background ................................................................................................................................................ 4
2.1 Mutation Testing ............................................................................................................................. 4
2.1.1 RIP Model .................................................................................................................................. 5
2.1.2 Mutation Score ......................................................................................................................... 5
2.1.4 Mutation Operators .................................................................................................................. 5
2.1.5 Equivalent Mutants .................................................................................................................. 6
2.1.6 Cost reduction ........................................................................................................................... 6
2.2 Theory behind Mutation Testing .................................................................................................... 7
2.3 Mutation System ............................................................................................................................. 9
Methodology ............................................................................................................................................. 11
3.1 Codebase ......................................................................................................................................... 11
3.2 Sample Space .................................................................................................................................. 12
3.3 Generating Unit Tests .................................................................................................................... 13
3.4 Mutation System ............................................................................................................................ 13
3.5 Mutation Analysis .......................................................................................................................... 14
3.6 Coverage Metrics and Mutation Coverage .................................................................................... 14
Results ....................................................................................................................................................... 16
4.1 The Sample Space ........................................................................................................................... 16
4.2 Mutation Analysis .......................................................................................................................... 17
4.2.1 Original Test Suites ................................................................................................................. 17
4.2.2 Generated Test Suites ............................................................................................................ 18
4.2.3 Performance ............................................................................................................................ 19
4.3 Linear Regression Analysis ........................................................................................................... 20
Discussion ................................................................................................................................................ 23
5.1 Quality of Unit Tests ...................................................................................................................... 23
5.2 Cost and Benefits of Mutation Testing ........................................................................................ 24
5.4 Reflection ........................................................................................................................................25
5.5 Sustainability and Societal Aspects ...............................................................................................25
5.6 Conclusions ....................................................................................................................................25
5.7 Future Work .................................................................................................................................. 26
References................................................................................................................................................. 27
Introduction | 1
Chapter 1
Introduction
This chapter introduces the topic and objective of this thesis project along with the research
questions and relevance of this study.
Software testing remains one of the most important moreover expensive aspects of ensuring high
quality. According to the Capgemini World Quality Report in 2015 budgets for quality assurance and
testing have risen to an average 35% of total IT spending. Significant 9% increase from 2014, with a
prediction that the average will reach 40% by the year 2018 [WQR].
At its core, software testing is an endeavor for higher quality, typically through the detection of
dormant faults. However, the growing size and complexity of software entail a practically infinite
input space, making it infeasible to completely test entire systems. Testing is thus always a trade-off
between the cost of testing and potential cost of undiscovered faults. To overcome this fundamental
limitation of testing, developers need a structured way to assess the effectiveness, or quality of test
suites in terms of detecting faults.
Intuitively, the most logical measure of fault detection ability is simply the number of real faults it
detects. Faults discovered during a products lifetime can be used in retrospect, to assess the adequacy
of test suites. However, this approach does not lend itself well to a development process. Thus, a
method to predict the quality of test suites solely based on the suites, and the current build of the
system under test (SUT) is required. The most common such approach is the use of coverage criteria
[AO17]. Coverage criteria define the properties that the test suit needs to fulfill, for example,
statement coverage require that every statement to be executed and branch coverage that every
branch be traversed. The coverage measurements would then serve as an arbitrary indicator of
adequacy, e.g. test suite with 80% statement coverage is higher quality than a test suite with 70%
statement coverage.
Mutation Testing is a fault-based testing technique that provides a coverage criterion called mutation
coverage. Mutation coverage is different from the other criteria. It is based on fault detection rather
than coverage in terms of structural aspects of the code base.
2 | Introduction
Mutation analysis is the process of injecting small faults into the SUT through syntactic changes,
creating copies, or mutants each containing one fault. Test suites that are able to detect the injected
faults are then considered adequate i.e. reliable.
Mutation testing can be used for testing at both unit level and integration level [DM96]. It has been
applied to many programming languages, e.g. Fortran, Ada, AspectJ, Java and C. Beside the use at
the software implementation level, it has been applied at the software design level to test the
specifications of the SUT [MR01].
The concept of mutation was first introduced in 1971, by Richard Lipton in a class term paper. It was
later developed and published by DeMillo et al. in 1978. Over four decades of history and wide range
of studies have resulted in a large body of literature [JH11, OU00].
Mutation coverage subsumes many other coverage criteria [OV96] where subsumption can be defined
as, coverage criterion Ca subsumes Cb if and only if every test suit that satisfies Ca also satisfies Cb. It
has also been shown to predict actual faults detection ability better than other criteria in some
settings, but never shown to be worse [GJ14]. However, mutation testing is computationally
expensive and difficult to apply and although there has been much research [JH11], it is still regarded
as academic and not widely adopted within the industry.
1.1 Objective
NASDAQ Technology AB is an American fin-tech company. NASDAQ is a leading provider of trading,
clearing, exchange technology, listing, information and public company services across six continents
[NH17]. The business-critical nature of the financial domain necessitates a solid testing effort with
reliable test suites.
In this thesis, the test suites of one of NASDAQs software projects are evaluated on the unit level.
Historically the project has lacked a set structure for testing which resulted in low coverage. The team
maintaining the project are taking steps to supplement the present testing efforts, however the
abundance of legacy code with high interdependency have made it difficult to create unit tests.
Unit tests are widely recognized as an integral part of a development process. Among other benefits,
they serve as a safety net during the inevitable refactoring of old code, detecting undesired behaviors
and helping to facilitate the fault.
On investigating previous system failures, the project team discovered that only a handful of the
critical failures could have been prevented with unit tests. Thus, the team is doubtful of the gain from
further unit tests and reluctant to invest any resource. However, assessing the quality of present unit
tests would determine the effectiveness of past approaches and could prove to be useful in convincing
the team otherwise.
Since the project is lacking in the amount of unit tests, a new set of unit tests was generated through
an automatic test suites generator. Mutation testing was applied on both the original and generated
unit tests. To assess the benefits of mutation testing, the ability of conventional coverage criteria to
predict mutation coverage was explored through regression analysis.
Introduction | 3
The research questions can be defined as: What is the quality of present unit tests, and what are the
cost and benefits of adopting mutation testing?
1.2 Delimitations
The measurements used within the thesis are directly dependent on the metrics reported by the tools.
Although simple coverage measurements such as line coverage can be easily cross validated because
of its prevalence, path coverage is far more difficult. Thus the ability to validate measurements is
somewhat limited in this regard.
The performance comparison of mutation testing is limited by the lack of possibility to augment or
modify the present test suites. The SUT is extremely large and complex, therefore creating meaningful
test suites without the assistance of a developer from the project is far too time-consuming. Without
a way to augment the test suites, it is practically impossible to measure the performance of mutation
testing at different degree of testing efforts.
1.3 Related Work
A study conducted by Simona et al. [NW11], attempted to assess the cost of applying mutation testing
on a real-world software system. The study applies three widely recognized mutation testing tools,
namely, MuJava, Jumble and Javalanche, on the open source project Eclipse. The study concluded
that although the configuring and applying the tools is simple enough, we should pay special attention
to the high execution time.
A recent study conducted by Gopinath et al. [GJ14] attempted to investigate the correlation between
mutation kill ratio and widely used coverage criteria (statement, block, branch and path coverage).
The study considered hundreds of open source java projects amassed from GitHub repositories. They
measured the coverage and performed mutation analysis on the test suits. The data was then analyzed
through regression analysis, measuring both τβ (Kendal rank correlation coefficient) and R2
(coefficient of determination). The same experiment was conducted on both the projects original test
suites and suites automatically generated through the Randoop testing tool. The study found
correlation between the widely used coverage criteria and mutation kill ratio, with statement coverage
being the best at R2 = 0.94 for original tests and 0.72 for generated tests. The aim of Gopinath et al
was to measure the ability of coverage criteria as a predictor of suite quality, from the perspective of
non-researchers and to present a possible alternative to the computationally expensive mutation
testing.
The statistical approach adopted by Gopinath et al. is the same as that of Gligoric et al. [GG13], who
considered some of the same question but from a research perspective. However, the study considered
only 15 Java programs and 11 C programs and concluded that branch coverage performed best. This
thesis investigates one considerably larger project, and implements a similar statistical approach to
similar results as Gopinath et al.
4 | Background
Chapter 2
Background
This chapter presents the background material to understand mutation testing and the tools used
throughout this thesis.
2.1 Mutation Testing
The 70s saw the rise of Van Halen. Like any other band, when Van Halen was hired to play at a venue
they provided the promoter with a contract rider. The rider included everything from sound and
lighting requirements to food and drinks. Listed among these was a big bowls of M&M’s, but
absolutely no brown ones. This was not just a superstition or some rock star ridiculousness and served
a very specific purpose. They randomly buried the odd request to make sure that the contract was
thoroughly read. Finding a brown M&M meant there might be other things that the promoter missed
[VA01].
Van Halen made sure the rider was thoroughly read, by hiding an odd item for the promoter to find.
In a similar way, mutation testing will make sure the SUT is thoroughly tested, by introducing
artificial faults for the test suites to find. The process creates several copies of the code, each
containing one fault. Existing test cases are executed against the copies, with the objective to
distinguish the original program from the faulty ones, determining the adequacy of existing test
suites.
Let P be a program that correctly functions on some test set T. The program is subjected to a mutation
operators that introduce small artificial faults, thereby creating mutants (refer to figure 2.1) that
differ from the original program in very small ways. Note that each mutant only contains one fault
each. Let these mutants be called P1, P 2 … P n. Running each mutant against T, there are two possible
outcomes:
1. Some Pi gives a different result than P
2. Some Pi gives the same result as P
In case (1) Pi is said to be killed and in case (2) Pi is said to be alive. If a mutant is killed, that means
the tests were able to distinguish P from the mutant. A mutant can be alive because of two reasons.
Background | 5
Either the tests were not sensitive enough to detect the introduced fault and must be augmented, or
Pi and P turns out are functionally equivalent (henceforth noted as Pi ≡ P) [DL78, AB79].
Program P Mutant Pi
…
if (a ≤ b)
…
…
if (a ≥ b)
…
Figure 2.1: Example of a mutant
2.1.1 RIP Model
Condition for a mutant to be considered killed, can be expressed more formally with three conditions,
together referred to as the RIP model [YH14, VM97, AO17].
Reachability: The location of the mutation must be reached by the test.
Infection: After the location is executed, the state of the program must be infected i.e. differ
from the corresponding state, of the original program.
Propagation: The infection must propagate through execution and result in an erroneous
output or final state.
2.1.2 Mutation Score
As is defined by DeMillo et al [DL78], a test set that manages to kill all mutants, except for those
equivalents to P is adequate. In other words, a test set is adequate, if it distinguishes the program from
the mutant programs.
The extent to which coverage criteria is satisfied is measured as a coverage score, calculated in terms
of imposed requirements. In the case of mutation testing it is referred to as mutation score [AO17,
OU00]. Let M be total number of mutants, D the number of killed mutants and E the number of
equivalent mutants. [JH11, AB79, GO92]. Mutation score can be defined as:
MS(T) =𝐷
𝑀−𝐸 (2.1)
2.1.4 Mutation Operators
A mutation operator is a syntactic or semantic transformation rule applied to a SUT to create mutants.
Operators are created with one of two goals: to inject faults representative of common mistakes the
programmers tend to make, or to enforce testing heuristics, e.g. executing every branch.
Key to successful mutation testing is well designed mutation operators. Syntactically illegal mutants
would be caught by the compiler and be of no value. These are called stillborn mutants and should be
discarded or not generated at all, and a trivial mutant can be caught by any test.
Mothra mutant operators is the first set of 22 formalized mutation operators for the Fortran
programming language [JH11, AO17]. The operators were derived through study of programmer
6 | Background
errors and implemented in the Mothra mutation system [KO91, DG88]. The full list and detailed
description of each operator can be found elsewhere [KO91]. The operators were adapted to Java by
Ammann et al [AO17] and one of them is:
Relational Operator Replacement - ROR
Replace each occurrence of one of the relational operators (<, ≤, >, ≥, ==, ! =) by each of the
other operators and by falseOp and trueOp, where falseOp always result in false and trueOp in
result in true. Applying the ROR operator on for example the program P shown in figure 2.1 we
would generate seven possible mutants,
𝑖𝑓(𝑎 ≤ 𝑏), 𝑖𝑓(𝑎 > 𝑏), 𝑖𝑓(𝑎 ≥ 𝑏), 𝑖𝑓(𝑎 == 𝑏), 𝑖𝑓(𝑎 ! = 𝑏), 𝑖𝑓(𝑓𝑎𝑙𝑠𝑒), 𝑖𝑓(𝑡𝑟𝑢𝑒)
2.1.5 Equivalent Mutants
One of the biggest hurdles of mutation testing is the equivalent mutant problem. Some mutants can
turn out to be semantically equal to the original program, although they are syntactically different.
Without detecting all the equivalent mutants, the tester cannot have complete confidence in the test
data. There would simply be no way to be sure, whether the test is inadequate or the live mutants are
equivalent.
An equivalent mutant will always produce the same output as the original program, thus impossible
to kill. Refer to figure 2.2 for an example. Although they have two different conditions, both program
P and mutant Pi will act in the exact same way, hence they are equal.
Detecting equivalence between two programs is an undecidable problem [BA82], i.e. there is no
algorithmic solution. The situation however is somewhat different for the equivalent mutant problem.
We do not need to determine the equivalence of two arbitrary pair of programs, but rather two
syntactically very similar programs. Although this was also proven undecidable, it has been suggested
that it is possible in many specific cases [OP97, OC94].
Program P Equivalent Mutant Pi
…
int a = 0;
while ( 5 < a ) {
a++;
}
…
…
int a = 0;
while ( 5 != a ) {
a++;
}
…
Figure 2.2: Example of equivalent mutant
2.1.6 Cost reduction
One of the hindrances to mutation testing being widely adopted in the industry is the unreasonably
high computational cost of creating and running a vast number of mutant programs. The number of
mutants generated for a program is roughly proportional to the number of data references times the
number of data objects [OL96, OU00]. There are for example seven possible for the single line of code
shown in figure 2.1, and would drastically increase for every new line. As was described in previous
section, we must run at least one and potentially all the test cases against each mutant, this brings
with it a large computational cost.
Background | 7
There are several approaches proposed to reduce the computational cost of mutation testing. These
methods can be categorized as mutant reduction and execution cost reduction. This section will
present the most studied methods for each category according to the survey done by Jia et al [JH11].
2.1.6.1 Mutant Reduction Techniques
Mutant reduction techniques aim to reduce the number of generated mutants without suffering a
significant loss of effectiveness. Let MST(M) denote the mutation score for a test set T applied on the
mutants M. Mutant reduction problem can be defined as the problem of finding the subset M’ of M
so that MST(M) ≈ MST(M’) [JH11].
Mathur et al. proposed the idea of constrained mutation, to apply mutation testing with only the
crucial mutation operators. The concept was later developed by Offutt et al [OR93] as Selective
mutation, an approximation technique that reduces the number of created mutants by reducing the
number of used mutation operators. Mutation operators generate varying number of mutants, some
operators have higher applicability and will generate that many more than others, which may turn
out to be redundant [JH11, OU00, OL96, MA91].
A study on selective mutation conducted by Offutt et al [OL96], on 10 FORTRAN programs concluded
that 5 of the Mothra mutant operators are sufficient, to effectively conduct mutation testing.
2.1.6.2 Execution Cost Reduction Technique
Another way to reduce the computational cost, other than reducing the number of mutants generated,
is to optimize the mutant execution process.
Traditional mutation testing is often referred to as strong mutation. In strong mutation, for a given
program P, a mutant Pi is said be killed, only if the original program P and the mutant Pi produce
different outputs.
Proposed by Howden [HO82], weak mutation is an approximation technique that optimizes the
execution of strong mutation by relaxing the definition of “killing a mutant”. Weak mutation only
requires that the first two condition of the RIP (Reachability, Infection and Propagation) model to be
satisfied. A program P is assumed to be constructed by components {c1, c2 … cn}. Let Pi be a mutant
created by changing the component ci, mutant Pi is said be killed if the internal state of Pi is incorrect
after the execution of the mutated component. As such weak mutation trades test effectiveness for
reduced computational cost [JH11, AO17].
2.2 Theory behind Mutation Testing
Section 2.1 gave an overview of mutation testing. This section will present the theory that makes
mutation testing possible.
Mutation testing is grounded on two fundamental hypotheses, first introduced by DeMillo et al. in
1978 [DL78] stated as:
Competent programmer hypothesis: Programmers are usually competent and produce
code either correct or close to being correct.
8 | Background
Coupling effect: Tests that detect small errors are so sensitive that they implicitly detect
more complex errors.
Suppose we have a program P, which is meant to compute a function F with an input domain D. The
traditional approach to determining the correctness of P would be to find a subset T of D, such that
if for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.2)
then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)
where 𝑃′ is the function actually computed by P. The subset T is then referred to as a reliable test set
i.e. the set of input data needed to determine the correctness of P. However, finding T requires
exhaustive testing efforts and is deemed undecidable [HO76] for any non-trivial programs.
Mutation testing on the other hand is a technique that attempt to draw a weaker conclusion, find a
subset T of D, such that:
if 𝑃 is not pathological
and for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.3)
then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)
A program P is not pathological if it was written by a competent programmer, i.e. it follows the
competent programmer hypothesis. Mutation testing assumes that P is close to the correct program
Pc hence, either P = Pc or some other program Q close to P is correct.
Figure 2.3: Neighborhood of P within the domain of all possible programs
Φ
μ
P
Pc
Background | 9
Let Φ be the set of programs close to P. With the assumption that P or some other program Q within
Φ is correct. The approach of mutation testing to find subset T is to eliminate the alternatives. We
formulate the method as, find subset T of D, such that:
for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥)
and for all 𝑄 in Φ (2.4)
either 𝑄 ≡ 𝑃
or for some 𝑥 in 𝑇, 𝑄(𝑥) ≠ 𝑃(𝑥)
If we can find a subset T that satisfies formula 2.4 then we say that P passes the Φ mutant test, or that
T differentiates P from all other programs in Φ. This can be explained as: Given that P performs
correctly on test set T. Each program Q in Φ should either be equivalent to P or produce a different
result output than P. Instead of having to exhaustively test P with a practically infinite amount of test
set, we can focus on differentiating P from Φ. However, the problem remains too large.
The coupling effect hypothesis says that there is often a strong coupling between members of Φ and
a small subset μ (refer to figure 2.3). The subset μ can be thought of as a set of programs very close to
P, such that if P passes the μ mutant test with test data T, then P will also pass the Φ mutant test with
test data T. The subset μ is referred to as mutants of P and the task of differentiating P from Φ is
reduced to finding μ and differentiating P from μ [BD80].
2.3 Mutation System
Mutation testing is performed using a so called mutation system. A mutation system would
implement the mutation analysis process, i.e. generating the mutants and handling them.
Figure 2.4 show a generic process for mutation analysis. Let P be a program and T be a set of tests to
be evaluated. When P is submitted to a mutation system, the system would first create the mutants
P1, P2 … Pn. Next, T is loaded as input to P. If a test fails we have discovered a bug within P and it needs
to be corrected, otherwise T is executed on the mutants P1, P2 … Pn. If the output of a mutant Pi is
different from the output of P, we mark Pi as killed. Once all the tests in T have been executed, the
mutation score is calculated. If there are still live mutants, the tester can augment T to target the live
mutants and the process is repeated. Equivalent mutants are marked, either manually or through
some automated technique and are not considered for the next iteration.
Although the augmented test set does not necessarily reveal any new faults, the mutation score gives
an approximate indication on the adequacy of the test set. The process is repeated until a mutation
score of 1 is achieved or a threshold set by the tester is met [AO17, OU00].
10 | Background
Figure 2.4: Generic mutation testing process [JH11, OU00]
The above described process of mutation analysis is based on the theory from section 2.2. Creating
the mutants P1, P2 … Pn using mutation operators is an attempt to find μ. The repeated process after
the creating the mutants implements the method of formula 2.4 i.e. differentiating P from μ.
Input
original
program P
Create
mutants
Input test
set T
Run T on P
P (T)
correct ? Fix P
Run T on
each live
mutant
All mutants
killed ?
Analyse
and mark
equivalent
mutants
F
F
T
T
End
Augment T
Star
t
Methodology | 11
Chapter 3
Methodology
This chapter presents the methodology used throughout this thesis to answer the research question,
including the process outline and a description of each step.
An empirical approach was adopted since the problem statement of this thesis is directly reliant on
measured data. The experimental model contains the steps:
1. Codebase of the SUT statistically analyzed
2. An appropriate sample space is chosen from the codebase
3. Second set of unit tests is generated through an automatic test suite generation tool, to
perform mutation analysis and compare the performance.
4. A mutation testing tool is chosen as the mutation system
5. Mutation analysis is performed on both the original and the generated suites
6. Common coverage criteria are compared to mutation coverage
7. The results are evaluated and performance of mutation analysis is compared between the two
data sets.
3.1 Codebase
The SUT is a multi-module project. Each module has its own test suites, build script, dependencies
and resources. It was necessary to attain certain statistical measurements of each module, to
understand the data set. The project was therefore configured to use SonarQube, an open source
platform for development teams to continuously manage source code quality and reliability.
SonarQube provide code analysis and defects hunting, as its core functionality and display e.g. Code
Smells, Vulnerabilities and Duplicate Lines [SQ01]. For this thesis, the most relevant information is
mainly Lines of Code (LOC), Cyclomatic Complexity (CC), Line Coverage (LC) and the number of
12 | Methodology
unit tests (#UT). LOC is the number of executable lines of code and CC is the number of independent
paths within a code.
3.2 Sample Space
The measurements presented here and the final sample space discussed in section 4.1 can to some
extent also be found in the works of Mishra [SM17]. This is because the same codebase was evaluated
and thus can be referred to as well for further information.
Table 3.1 contains data from the statistical measurement of the SUT. Immediately apparent when
observing the measurements is that the first 3 modules are larger in terms of LOC. Unit tests are few
in numbers and concentrated on 2 modules.
Name LOC CC LC #UT
Kenny 81448 16132 5.6% 233
Mark 61248 11520 1.2% 7
Perry 37728 7005 28.5% 172
Sally 16757 2260 0.9% 33
Martin 7269 1197 0.0% 0
Conan 6074 1384 13.1% 74
Coral 5278 965 8.4% 34
Patrick 3285 565 14.4% 10
Derek 3076 598 2.4% 1
Tommy 1745 290 0.0% 0
Brad 1137 209 0.0% 0
Daniel 1132 243 29.8% 5
Emil 917 183 19.7% 2
Uther 831 178 27.8% 17
Danny 819 154 0.0% 0
Francine 585 0.0% 126 0
Sebastian 369 0.0% 48 0
Waldo 164 0.0% 41 0
Table 3.1: Measurements of modules in the SUT
Each module is built separately and tested with its own test suite, as such, can be looked at
individually. To quantify the overall test suite quality of the clearing engine, mutation analysis would
have to cover every module. Mutation analysis will generate mutants for every mutable line of code,
regardless of the absence of tests. Analyzing the quality of the entire project will inevitably result in
very low mutation score, and the data would not fairly represent the quality of current in place unit
tests.
It was deemed appropriate to reduce the modules through a selection process. In the initial selection,
modules with no tests were essentially treated as noise since they served no practical purpose.
Modules without tests increase the total LOC, in turn lowering the total coverage and increasing the
number of live mutants. The second and last selection aimed to further filter the project down to its
core. Intuitively, LC seems to be a logical indicator of a well-tested module, however those with high
coverage turn out to be relatively small. After a discussion with the project team, it was agreed that
Methodology | 13
test suites within the modules Kenny and Perry best represent the most recent testing efforts. Further,
they are two of the largest and most complex modules, constituting a good chunk of the project.
Hence, they were chosen as the sample space to be analyzed.
3.3 Generating Unit Tests
Performing mutation analysis on the current test suites is sufficient for evaluating the quality of
present unit tests. However, to assess the cost of mutation analysis, it is necessary to obtain a second
set of measurements. Mutation analysis of test suites with higher number of tests will allow a
performance comparison. Difference in execution time can be observed and explained through factors
affecting the process e.g. number of test cases and the code coverage. Number of mutants created,
killed and those never covered by tests can be compared to further understand the difference in
results.
Although theoretically possible, it was deemed impractical to create a second set of test suites by hand.
Instead, an automatic test suite generation tool called EvoSuite [EVO1] was used to create a second
set of test suites that was analyzed separately from the original suites.
While test cases can be automatically generated, the task of verifying the correctness still remains a
problem. Faults that that cause exceptions and program crashes can be easily detect but only testing
for obvious faults will lead to negligible tests.
EvoSuite automate the creation of test suites and adopts a search based approach and state of the art
techniques to create tests with small assertions i.e. testing for small faults that do not cause an
exception. EvoSuite also applies an approach that first generates test suites, and later optimizes the
suites to achieve a high coverage criteria score e.g. line, branch and weak mutation coverage. Thus
generating test suites with high coverage [EVO1, FA11].
For further information on the inner workings of EvoSuite, the study of Fraser et al. [FA11] can be
referred.
3.4 Mutation System
Mutation analysis can be defined as a twostep process, generate mutants then check whether the
mutants are detected by the test. Generating mutants is essentially done through creating copies of
the source or byte code with small changes. This process is very rarely done by hand and generally
uses a mutation system. Although there are several mutation systems available for Java, most are old
and come with certain usability issues. This could be the lack of support for popular build tools such
as Maven or mocking frameworks such as Mockito.
Pitest (PIT) was chosen for mutation analysis. PIT is the most recent of the systems and actively
developed with frequent releases. While other systems were built for research purposes, PIT was
meant for development environments [PIT1], more accurately fitting the objective of this thesis
project.
14 | Methodology
PIT applies a set of mutation operators to the byte code generating a large number of mutant classes.
Before exercising the tests against the newly created mutants, PIT will first measure the line coverage
(LC) of the code base. Employing the coverage information, for each mutant, PIT will only execute
tests that cover the line which contain the mutation. This optimization is significant for inadequate
test suites with a large codebase, such as the one examined in this thesis.
3.5 Mutation Analysis
The mutation analysis is performed using the most stable default mutant operators in PIT. They are
defined in the documentation as:
1. Conditionals Boundary Mutator (CBM)
Mutates relational operators <, <=, > and >= to their boundary counterpart.
2. Increments Mutator (IM)
Mutates increments, decrements, assignment increments and assignment decrements of
local variables. For example, i++ would be mutated to i--.
3. Invert Negatives Mutator (INM)
Inverts negation of integers and floating-point numbers, e.g. –i would be mutated to i.
4. Math Mutator (MM)
Replaces binary arithmetic operations for either integer or floating-point arithmetic with
another operation. For example, a + b would be mutated to a – b.
5. Negate Conditionals Mutator (NCM)
Mutates conditionals i.e. ==, !=, <=, >=, < and >. This operator overlaps to some extent with
conditionals boundary mutator, but is easier to kill.
6. Return Values Mutator (RVM)
Mutates the return value of method calls. For example, in the case of a Boolean type return
value, false would be mutated to true and vice versa.
7. Void Method Calls Mutator (VMCM)
Mutator will remove calls to methods with return value type void.
3.6 Coverage Metrics and Mutation Coverage
A similar experiment to that of Gopinath et al. [GJ14] and Gligoric et al. [GG13] is used in this thesis.
The ability of Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC) to predict the
Mutation Score (MS) is evaluated through linear regression analysis.
Regression analysis is a form of predictive modelling technique to estimate the relationship between
a dependent variable (target) and independent variables (predictors) using a best fit line i.e.
regression line.
Methodology | 15
The data set used for regression analysis is per source code class basis. Each class was measured using
the mentioned coverage criteria, each one is then combined with the MS and shown in a scatter graph.
The aim was to determine how well LC, BC and PC could serve as predictors of MS. For that purpose,
the coefficient of determination (R2) was calculated. R2 is a measure of how well the regression line
approximates the real data i.e. a high R2 would indicate that the independent variables are good
predictors.
The choice of coverage criteria is based on how likely they are to be used by development teams.
LC and BC was measured using JaCoCo [JC01], a free code coverage library for Java. PC was
measured using JMockit [JM01], a mocking framework meant to be used for unit testing in Java and
the only free software to measure PC for Java.
16 | Results
Chapter 4
Results
This chapter presents the empirical data from the experiments described in the method section. The
sample space is motivated, followed by the result from performing analysis on both original and
generated test suites are presented. Finally, the result from regression analysis between common
coverage criteria and mutation score is presented.
4.1 The Sample Space
Table 4.1 gives an overview of the modules constituting the sample space. The selection process of the
sample space drastically restricted the number of modules. Although the two modules combined
make up half of the SUT, it is not certain they have similar distributions of factors that can affect the
mutation analysis. Concerning at this point is whether this has resulted in a skewed sample space that
can jeopardize the integrity of analysis results.
LOC CC LC #UT
Kenny 81448 16132 5.6% 233
Perry 37728 7005 28.5% 172
Table 4.1: Measurements of modules in the sample space
Factors to consider are LOC and CC of classes. High LOC indicates a large number of lines to cover
thus implicitly reducing coverage. High CC indicates a complicated class with large number of paths
thus making it difficult to achieve high quality.
The distribution of LOC and CC per class was measured and can be found in figure 4.1 as histograms.
It is apparent from the almost identical shapes of lines for both graphs, that the sample space has kept
the original distribution. Lending to the theory the analysis done on the sample space is reflective of
the whole system.
Results | 17
Figure 4.1: Distribution of LOC and complexity per class, after the initial selection (Covered) and last selection (Core)
process.
4.2 Mutation Analysis
The analysis was performed on both the original test suites and the test suites generated through
EvoSuite. For each case, the test suites for Kenny and Perry were considered separately. The results
are presented in the tables 4.2, 4.3, 4.5 and 4.6 where each row corresponds to one of the mutation
operators. The columns display, for each operator the number of created mutants, how many of those
were killed, how many were left alive and how many were never reached due to the lack of coverage.
4.2.1 Original Test Suites
Result from the mutation analysis of the original test suites can be found in table 4.2 and 4.3.
The mutation score (MS) for both modules are very low, which was to be expected considering the
low LC.
It is immediately apparent that some operators, specifically NCM, RVM and VMCM create most of
the mutants. Although it might be affected by the type of code that is being mutated, this is most likely
due to their more applicable nature. For example, NCM overlaps to some degree with CBM but apply
to far more situations.
The uneven number of mutants created between the modules can be explained as. Kenny has more
than twice the LOC of Perry, hence resulting in far more mutants created.
0
100
200
300
400
500
600
1 20 120 220 1500
Freq
uen
cy
Lines of code
All
Covered
Core
0
100
200
300
400
500
600
700
800
900
1000
1,5 16,5 40
Freq
uen
cy
Cyclomatic comlexity
All
Covered
Core
18 | Results
Operator Created Killed Live No coverage Created Killed Live No coverage
CBM 2666 103 (4%) 84 2479 452 29 (6%) 24 399
IM 1589 57 (4%) 36 1496 184 16 (9%) 2 166
INM 5 1 (20%) 0 4 0 0 (0%) 0 0
MM 330 31 (9%) 16 283 69 15 (22%) 3 51
NCM 10091 564 (9%) 153 9374 2110 638 (20%) 122 1350
RVM 5143 230 (4%) 48 4865 3017 595 (20%) 198 2224
VMCM 6727 183 (3%) 184 6360 1607 41 (3%) 22 1544
Total 26551 1169 (4%) 521 24861 7439 1334 (18%) 371 5734
Table 4.2: Analysis result of Kenny’s test suites Table 4.3: Analysis result of Perry’s test suites
An observation is that, although the MS are very low for both analyses, the ratio of killed to live
mutants seems to overall lean toward the killed. Table 4.4 contains the MS recalculated to only
consider mutants with coverage. The test suites are effective at the part of the code base they cover.
The MS is low due to the low coverage and would most likely increase accordingly with higher
coverage.
Operator Killed Live Killed Live
CBM 103 (55%) 84 29 (55%) 24
IM 57 (61%) 36 16 (89%) 2
INM 1 (100%) 0 0 (0%) 0
MM 31 (66%) 16 15 (83%) 3
NCM 564 (79%) 153 638 (84%) 122
RVM 230 (83%) 48 595 (75%) 198
VMCM 183 (49%) 184 41 (65%) 22
Total 1169 (69%) 521 1334 (78%) 371
Table 4.4: Ratio between killed and live mutants for analysis of Kenny and Perry
4.2.2 Generated Test Suites
The automatic generation of unit tests, yielded new test suites with significantly more unit tests. Test
suites generated for Kenny contained 4923 unit tests with 27% LC compared to the previous 5.6%.
The suites generated for Perry contained 2037 unit tests with 40% LC compared to the previous
28.5%. Although this was a significant increase in coverage it is still low when considering that they
were generated with the goal to achieve a high coverage score. This can be attributed to the complex
codebase and is most likely difficult to remedy.
Result from the mutation analysis of the generated test suites can be found in table 4.5 and 4.6.
The generated suites were analyzed in the same manner as the original suites. The increase in
coverage was reflected by similar increase in MS, strengthening the previous explanation in section
4.1.1.
It was a concern that for automatically generate tests, the MS i.e. quality might be significantly lower
than its coverage, this turns out not to be the case.
Results | 19
Operator Created Killed Live No coverage Created Killed Live No coverage
CBM 2666 430 (16%) 241 1995 452 167 (37%) 17 268
IM 1589 208 (13%) 188 1193 184 62 (34%) 6 116
INM 5 0 (0%) 0 5 0 0 (0%) 0 0
MM 330 27 (8%) 50 253 69 40 (58%) 2 27
NCM 10091 1654 (16%) 780 7657 2110 942 (45%) 105 1063
RVM 5143 1357 (26%) 397 3389 3017 768 (25%) 86 2163
VMCM 6727 1078 (16%) 648 5001 1607 500 (31%) 91 1016
Total 26551 4754 (19%) 2304 19493 7439 2479 (33%) 307 4653
Table 4.5: Analysis result of Kenny’s generated test suites Table 4.6: Analysis result of Perry’s generated test suites
Again, it can be observed the ratio of killed and live mutants seem to overall lean toward the killed.
Table 4.7 contains the MS recalculated to only consider mutants with coverage.
Operator Killed Live Killed Live
CBM 430 (64%) 241 167 (91%) 17
IM 208 (52%) 188 62 (91%) 6
INM 0 (0%) 0 0 (0%) 0
MM 27 (35%) 50 40 (95%) 2
NCM 1654 (68%) 780 942 (96%) 105
RVM 1357 (77%) 397 768 (90%) 86
VMCM 1078 (62%) 648 500 (85%) 91
Total 4754 (67%) 2304 2479 (90%) 307
Table 4.7: Ratio between killed and live mutants for analysis of Kenny and Perry
4.2.3 Performance
During the two mutation analyses, performance data was gathered for both the original test suites
(OTS) and generated test suites (GTS). Table 4.8 contains the overview with Line Coverage (LC),
number of unit tests (#UT), number of covered mutants (#CM), number of executed tests (#ET) and
the execution time.
LC #UT #CM #ET Exec Time
OTSKenny 5.6% 233 1690 4532 4 min 36 sec
GTSKenny 27% 4923 7058 71638 3 h 29 min 30 sec
OTSPerry 28.5% 172 1705 24854 1 h 21 min 21 sec
GTSPerry 40% 2037 2786 24510 50 min 56 sec
Table 4.8: Summary of mutation analysis performance
The computationally expensive nature of mutation analysis comes from the process of executing
entire test suites against every mutant program. PIT however will only execute every test that cover a
mutation. The number of covered mutations and the number of unit tests directly determine the
number of executed tests, which is the most time consuming part of mutation analysis.
20 | Results
Visible is the enormous increase in execution time between analyzing OTSKenny and GTSKenny. Although
the LC increased moderately, it alone cannot explain the spike. The increased LC presumably led to
more covered mutations This combined with the drastic increase in the number of unit tests increased
the number of test executions hence spike in execution time.
The difference in execution time between analyzing OTSKenny and OTSPerry is somewhat difficult
explain. Although one test suite has higher coverage than the other, they both cover almost the same
number of mutations. This combined with similar number of unit tests should result in similar
execution time. The most likely explanation is that, only a handful of test cover any mutations in
OTSKenny, resulting in less number of executed tests than OTSPerry. This indicates that the number tests
and the number of covered mutations are sufficient predictors of execution time.
The reduction in execution time between analyzing OTSPerry and GTSPerry is unexpected. The increase
in LC should increase the execution time if were to look at the case of OTSKenny and GTSKenny. GTSPerry
has higher LC, more unit tests and more covered mutations, yet less executed tests, resulting in a
shorter execution time. Only explanation for this phenomenon, is that even with over ten times more
unit tests, fewer tests in GTSPerry cover any mutation compared to OTSPerry.
4.3 Linear Regression Analysis
The results presented in this section are shared with the thesis work of Mishra [SM17]. Although the
data are the same, they are incorporated differently into the works.
Table 4.9 displays the measured Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC)
for both modules.
LC BC PC
Kenny 5.6 % 5 % 2 %
Perry 28.5 % 27.7 % 13 %
Table 4.9: Line, branch and path coverage summary
For the same reasoning as section 4.1, both LOC and CC were considered in the regression analysis.
Table 4.10 contains the estimated coefficients for the saturated regression model, between dependent
variable (target) MS and the independent variables (predictors) LC, LOC and CC. Each row in the
model represents a predictor. A pValue above 0.05 indicates that the variable is insignificant and thus
LOC and CC could safely be removed from the model.
Results | 21
Estimate Std Error tStat pValue
Lines of code 5.0995e-05 4.011e-05 1.2714 0.20383
Complexity -0.00026416 0.0002234 -1.1825 0.23725
Line coverage 0.81205 0.0086981 93.36 0
Table 4.10: Estimated coefficients for saturated regression model
Figure 4.2 is the scatter plot between the MS and LC. Each data point corresponds to a class in either
module, with the size of the circle representing the classes LOC.
The coefficient of determination or R2 is displayed above the regression line. The value of R2 is perhaps
the most relevant information here. The variable indicates how well the regression line fits the data
set i.e. how well the independent variables can predict the dependent variable i.e. how well LC predicts
MS.
Figure 4.2: Scatter plot between MS and LC
Figure 4.3 and 4.4 is the result of performing the same process as above with BC and PC.
22 | Results
Figure 4.3: Scatter plot between MS and BC Figure 4.4: Scatter plot between MS and PC
The result from performing linear regression analysis indicates that LC is the most accurate predictor
of MS. All three comparisons resulted in moderately high R-squared values, which would mean any
one of the three coverage criteria can be used as a predictor of MS, with moderate to high accuracy.
This result agrees with that of Gopinath et al. [GJ14] and indicates, that the same relation tendency
found when examining a few hundred projects is found in this SUT.
Discussion | 23
Chapter 5
Discussion
This chapter discusses the observed results in regards to the research questions, reflects on the
presented findings and concludes the work.
5.1 Quality of Unit Tests
Mutation analysis will apply a set of mutation operators to create a set of mutants. The test suites are
then executed against these mutants to measure how many mutants can be detected. The purpose is
to measure the quality of test suites. Mutations in uncovered parts of the codebase are never detected,
directly lowering the mutation score (MS). Performing mutation analysis for the whole SUT without
moderate to high coverage will always result in low total MS.
It was assumed very early in the thesis, that the mutation score for the original suites would be low.
This was due to the low coverage and limited number of unit tests, and was shown to be true
immediately after the first mutation analysis.
As mentioned in section 4.2.1 and 4.2.2 the results also support a different observation. When
measuring the number of killed, alive and uncovered mutants, it was noted that the ration between
killed and live mutants leaned towards the killed. This was made clearer when recalculating the
mutation score with only the covered mutants. When considering only the part of the code base with
coverage, the unit tests are surprisingly effective with around 70% in mutation score.
Above observation can be explained as, the test suites are effective at the parts of the code base they
cover, the mutation score is low due to the low coverage and would increase accordingly with higher
coverage. This of course is only true when any new test suites that is added, maintain the same level
of quality.
The generated test suites display the exact same behavior, with the MS being drastically higher when
only considering covered mutants, adding to the plausibility of the above explanation.
It was a concern that for automatically generate tests, the quality would be significantly lower than its
line coverage (LC). The concern was based on the fact that the tests would be automatically generated
and not be able to test the behavior of methods to the same degree as hand written tests. This however
turns out not to be the case for both Kenny and Perry with both maintaining a MS close to the
24 | Discussion
corresponding LC. This observation indicates that automatically generated tests are of enough quality
that developers should consider them as a replacement of unit tests if the current coverage is low or
use them to augment the current test suites. The moderate quality of automatically generated test
suites should also remove any concern of the validity of any comparison between the original test
suites and the generated ones.
5.2 Cost and Benefits of Mutation Testing
Mutation testing subsumes many other coverage criteria [OV96] and has been shown to predict actual
faults detection ability better than other criteria in some settings, but never shown to be worse [GJ14].
Thus, it is difficult to deny the effectiveness of mutation testing. The practicality of mutation testing
however is very much up for debate.
Pitest (PIT) was chosen for mutation analysis. PIT applies a set of mutation operators to the byte code
generating a large number of mutant classes. Before exercising the tests against the newly created
mutants, PIT will first measure the line coverage (LC) of the code base. Employing the coverage
information, for each mutant, PIT will only execute tests that cover the line which contain the
mutation. This optimization is significant in reducing the execution time e.g. the longest execution
time during this thesis was 3 hour 30 minuts.
The result of mutation analysis displayed a drastic increase in execution time, when the number of
unit tests covering any mutation and the number of covered mutations increased. Let us refer to the
situation when two unit tests cover the same mutation as overlapping. Overlapping (as was discussed
in the results section) increase the number of test executions without an increase in killed mutants,
thus directly increasing execution time without increasing the mutation score (MS).
In a perfect world, there would be a handful of unit tests covering all mutation with no overlapping.
However, it is reasonable to assume that, as coverage increases so does the overlapping. After a
threshold, the increase in execution time when supplementing the test suite will not be worth the
increase in mutation score.
The purpose of conducting regression analysis was to assess the ability of common coverage metrics
to predict mutation score, to determine if mutation analysis is truly worth the cost, and whether other
cheaper coverage metrics could be used instead.
The result indicates that LC is an effective predictor of mutation score. Although LC is by no means a
replacement for mutation analysis it can serve as an indicator in practice.
Developers can use LC as the measurement of test suite effectiveness in practice, and have scheduled
mutation analysis of the SUT. Through mocking of dependency and conscious decision to minimize
overlap in coverage between tests, the execution time should be manageable.
Discussion | 25
5.4 Reflection
The decision to perform mutation analysis on the two modules was due to technical limitations. This
approach can be criticized due to the risk of test suites for one module covering parts of the other
module. This could result in some lost coverage that could have increased the mutation score,
although most likely not in a meaningful way.
Generating a second data set for comparison did enable some comparison of performance. However,
the legacy code and high interdependency can have contributed to meaningless tests, e.g. test suites
that simply call class constructors to add to the LC. The measurements obtained from these test suites
might not be genuine.
Performing mutation analysis on the original tests and the generated test resulted in some interesting
data. However, manually creating test suites to measure the performance at different levels of LC and
MS would have been more fruitful.
5.5 Sustainability and Societal Aspects
This thesis is a case study of a fault-based testing technique on a software project used within the
financial industry, as such there is very little ethical concerns. From the societal and aspects, this
thesis is not only relevant for the project team providing the SUT, but also to other development
setups with similar project and the need for higher quality testing efforts.
From the economical sustainability perspective, studies in this field contribute to prevent software
failure with significant economic consequences [FT01]. This thesis can inspire anyone trying to delve
into the subject of mutation testing and higher quality testing.
5.6 Conclusions
Performing mutation analysis, what is the quality of present test suites? The mutation score for the
SUT is low, indicating that very few of the created mutants is discovered. However, when considering
only the part of the code base with coverage, the unit tests are surprisingly effective with around 70%
in mutation score. Hence, it is reasonable to assume that current unit tests are of high quality, albeit
only covering a small portion of the system.
What are the cost and benefits of adopting mutation testing? Mutation testing subsumes many other
coverage criteria [OV96] and has been shown to predict actual faults detection ability better than
other criteria in some settings, but never shown to be worse [GJ14]. Thus the benefits of mutation
testing is difficult to dispute. The concern would be the practicality of the testing method with the
execution time being the most significant factor. This turns out to be reasonable when using a more
modern mutation system, as the one used within this degree project. Following are two
recommendations on how mutation testing could be used.
26 | Discussion
The overlap in coverage between tests was shown to be a major contributor to high execution time for
mutation analysis. Minimizing the number of tests, maximizing the number of covered mutations and
minimizing the overlap in coverage between tests, should result in the best possible execution time.
Performing regression analysis on the original test suites resulted in LC performing the best.
Developers can thus use LC as the measurement of test suite effectiveness in practice, and have
scheduled mutation analysis of the SUT.
Another, unexpected conclusion of this thesis project was in regards to the automatically generated
unit tests. The generated test suites had significantly higher coverage than the original test suites.
Performing mutation analysis, it was revealed that the mutation score was also higher than that of the
original suites. This entails that the test suites do not only cover more of the code base, but also are
effective when doing so. It can be concluded that automatically generated suites can replace hand
written test suites if the current coverage is low, or be used to augment the hand written suites.
5.7 Future Work
Although this was a case study some further work can be done to extend results gathered in this thesis.
It would be interesting to create test suites, following the approach suggested in this thesis, and to
measure the performance at different levels of LC. This would result in findings about scalability and
prove or disprove the conclusions drawn in this thesis.
References | 27
References
[YH14] X. Yao, M. Harman, and Y. Jia, “A study of equivalent and stubborn mutation operators using
human analysis of equivalence,” International Conference on Software Engineering, pp. 919–930,
2014.
[VM97] J. Voas and G. McGraw. Software Fault Injection: Inoculating Programs Against Errors. John
Wiley & Sons, 1997.
[OR93] A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental evaluation of selective mutation,"
in Proceedings of the Fifteenth International Conference on Software Engineering, (Baltimore, MD),
pp. 100-107, IEEE Computer Society Press, May 1993.
[WD94] W. E. Wong, M. E. Delamaro, J. C. Maldonado, and A. P. Mathur, Constrained mutation in
C programs," in Pro- ceedings of the 8th Brazilian Symposium on Software Engi- neering, (Curitiba,
Brazil), pp. 439{452, October 1994.
[HO82] W. E. Howden, “Weak Mutation Testing and Completeness of Test Sets,” IEEE Transactions
on Software Engineering, vol. 8, no. 4, pp. 371–379, July 1982.
[DG88] R. A. DeMillo, D. S. Guindi, K. N. King, W. M. McCracken, and A. J. Offutt, “An Extended
Overview of the Mothra Software Testing Environment,” in Proceedings of the 2nd Workshop on
Software Testing, Verification, and Analysis (TVA’88). Banff Alberta,Canada: IEEE Computer society,
July 1988, pp. 142–151.
[MA91] A. P. Mathur, “Performance, Effectiveness, and Reliability Issues in Software Testing,” in
Proceedings of the 5th International Computer Software and Applications Conference
(COMPSAC’79), Tokyo, Japan, 11-13 September 1991, pp. 604–605.
[OL96] A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland Untch and Christian Zapf: An
Experimental Determination of Sufficient Mutation Operators, ACM Trans. on Software Engineering
& Methodology, Vol. 5, pp. 99–118, April 1996.
[KO91] K. N. King and A. J. Offut. A Fortran language system for mutation-based software testing.
Software-Practice and Experience, 21(7):685-718, July 1991.
[OC94] A. Jefferson Offutt and W. Michael Craft: Using Compiler Optimization Techniques to Detect
Equivalent Mutants, The Journal of Software Testing, Verification and Reliability, 4(3):131–154,
September 1994.
28 | References
[OP97] A. Jefferson Offutt and Jie Pan: Automatically Detecting Equivalent Mutants and Infeasible
Paths, The Journal of Software Testing, Verification, and Reliability, Vol 7, No. 3, pp. 165–192,
September 1997.
[BA82] T. A. Budd and D. Angluin. Two Notions of Correctness and Their Relation to Testing. Acta
Informatica, 18(1):31–45, March 1982.
[GO92] Robert Geist and A. Jefferson Offutt and Frederick C. Harris Estimation and Enhancement
of Real-Time Software Reliability Through Mutation Analysis IEEE Transactions on Computers,
41(5), May 1992.
[AO17] Paul Ammann , Jeff Offutt, Introduction to Software Testing Second Edition, Cambridge
University Press, New York, NY, 2017
[HO76] William E. Howden, “Reliability of the path analysis testing strategy.” IEEE Transactions on
Software Engineering SE-2(3):208-214, September 1976.
[JH11] Yue Jia , Mark Harman, An Analysis and Survey of the Development of Mutation Testing, IEEE
Transactions on Software Engineering, v.37 n.5, p.649-678, September 2011
[BD80] Timothy A. Budd, Richard A. DeMillo, Richard J. Lipton and Frederick G. Sayward:
Theoretical and Empirical Studies on Using Program Mutation To Test The Functional Correctness
of Programs, Proceedings of the 7th ACM SIGPLAN-SIGACT symposium on Principles of
programming languages, p.220–233, January 28–30, 1980, Las Vegas, Nevada.
[DL78] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on Test Data Selection: Help for the
Practicing Programmer. Computer, 11(4):34–41, April 1978
[AB79] A. T. Acree, T. A. Budd, R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Mutation Analysis.
Technique Report GIT-ICS-79/08, Georgia Institute of Technology, Atlanta, Georgia, 1979.
[OU00] A. Jefferson Offutt and Roland H. Untch: Mutation 2000: Uniting the Orthogonal, Mutation
2000: Mutation Testing in the Twentieth and the Twenty First Centuries, pp. 45–55, San Jose, CA,
October 2000.
[OV96] A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation
testing. Technical report, 1996.
[GJ14] Rahul Gopinath, Carlos Jensen, and Groce Alex. Code coverage for suite evaluation by
developers. In ICSE, pages 72–82, 2014.
[NH17] Nasdaqcom. (2017). Nasdaqcom. Retrieved 16 May, 2017, from
http://www.nasdaq.com/about/about_nasdaq.aspx
References | 29
[GG13] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-
adequate test suites using coverage criteria. In ACM International Symposium on Software Testing
and Analysis. ACM, 2013.
[NW11] Nica, S. A., Ramler, R., & Wotawa, F. (2011). Is Mutation Testing Scalable for Real-World
Software Projects? In The Third International Conference on Advances in System Testing and
Validation Lifecycle.
[PIT1] Pitestorg. (2017). Pitestorg. Retrieved 12 May, 2017, from http://pitest.org/
[EVO1] Evosuiteorg. (2017). Evosuiteorg. Retrieved 16 May, 2017, from
http://www.evosuite.org/evosuite/
[SQ01] Documentation - SonarQube Documentation. (n.d.). Retrieved May 16, 2017, from
https://docs.sonarqube.org/display/SONAR/Documentation
[WQR] Capgemini Releases World Quality Report 2016. (2016, September 21). Entertainment
Close-up.
[MR01] T. Murnane, K. Reed: On the Effectiveness of Mutation Analysis as a Black Box Testing
Technique, 13th Australian Software Engineering Conference (ASWEC’01) August 27–28, 2001,
Canberra, Australia p. 0012, 2001.
[DM96] M. E. Delamaro, J. C. Maldonado, A. P. Mathur: Integration Testing Using Interface
Mutation, Proceedings of the Seventh International Symposium of Software Reliability Engineering
(ISSRE’96), White Plains, NY, pp. 112–121, 1996.
[FT01] Financial Times. (n.d.). Retrieved May 24, 2017, from
https://www.ft.com/content/9657d306-4d7c-11e5-b558-8a9722977189
[JC01] JaCoCo Java Code Coverage Library. (2017, March 21). Retrieved June 04, 2017, from
http://www.eclemma.org/jacoco/
[JM01] JMockit An automated testing toolkit for Java. (n.d.). Retrieved June 04, 2017, from
http://jmockit.org/
[SM17] "Analysis of test coverage metrics in a business critical setup". MSc. KTH Royal Institute of
Technology, 2017. Print.
[FA11] Gordon Fraser, Andrea Arcuri, EvoSuite: automatic test suite generation for object-oriented
software, Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on
Foundations of software engineering, September 05-09, 2011, Szeged, Hungary .