[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...
TRANSCRIPT
Reconstructing traceability between bugs and test cases: an experimental study
Nilam Kaushik∗, Ladan Tahvildari∗, Mark Moore‡∗Department of Electrical and Computer Engineering, University of Waterloo
‡ Research In Motion, Canada
Abstract—In manual testing, testers typically follow the stepslisted in the bug report to verify whether a bug has beenfixed or not. Depending on time and availability of resources,a tester may execute some additional test cases to ensure testcoverage. In the case of manual testing, the process of findingthe most relevant manual test cases to run is largely manualand involves tester expertise. From a usability standpoint, thetask of finding the most relevant test cases is tedious as thetester typically has to switch between the defect managementtool and the test case management tool in order to search fortest cases relevant to the bug at hand. In this paper, we use IRtechniques to recover traceability between bugs and test caseswith the aim of recommending test cases for bugs. We reporton our experience of recovering traceability between bugs andtest cases using techniques such as Latent Semantic Indexing(LSI) and Latent Dirichlet Allocation (LDA) through a smallindustrial case study.
Keywords-traceability, test case, bug, LSI
I. INTRODUCTION AND MOTIVATION
Even in organizations with mature software development
processes, software artifacts suffer from lack of traceability,
both due to human factors and the existence of heteroge-
neous tools and distributed teams [1]. In large organizations,
while test cases are created by dedicated testing teams,
bugs may be reported by testers, developers, management
personnel or from external sources. The majority of defects
are reported by a core group of people who have some
domain knowledge and familiarity with the functional be-
havior of the product. We retrieved some statistics about
the distribution of bugs according to the affiliation of the
reporter. We found that 87% of the bugs prior to the release
of a product Y at RIM were reported by developers and
testers. Based on this, we speculate that there is a common
vocabulary used among test case and bug artifacts which
can be exploited by IR techniques to recover traceability
between bugs and test cases.
In manual testing, testers typically follow the steps listed
in the bug report to verify whether a bug has been fixed
or not. Depending on time and availability of resources,
a tester may execute some additional test cases to ensure
test coverage. In the case of manual testing, the process
of finding the most relevant manual test cases to run is
largely manual and involves tester expertise. From a usability
standpoint, the task of finding the most relevant test cases
is tedious as the tester typically has to switch between the
defect management tool and the test case management tool
in order to search for test cases relevant to the bug at hand.
In this paper, we report on our experience of recovering
traceability among bugs and test cases using IR techniques
such as Latent Semantic Indexing (LSI) and Latent Dirichlet
Allocation(LDA) through a small industrial case study.
In industry, different tools are incorporated to manage the
various software artefacts through the software development
process. Seamless integration across tools is important to
obtain the data necessary for building good traceability links
[2]. The existing defect and test case management tools
being used in our project provide fields to allow testers to
explicitly link bugs to test cases by recording the bug id and
test case id manually. However, in our study, we found that
less than 5% of the bugs during a release were linked to any
test case. The availabiltiy of such linkage data can be useful
in prioritizing regression test cases as well.
A. Related work
Lormans et al. investigated the use of LSI to establish
traceability between requirements, design and test cases[3].
Antoniol et al. use a Probablisitc IR model and a Vector
Space IR model to construct links between source code and
documentation [4]. Marcus et al. [5] use an LSI based solu-
tion on the same systems used by Antoniol et al. Baccheli et
al. experiment with lightweight methods, involving capturing
program elements with regexes. They also use Vector Space
IR model and LSI to recover traceability links between
emails and source code. They showed that lightweight
methods significantly outperform IR approaches [6]. IR
methods have also been used to recover traceability links
between requirements themselves for the purpose of man-
aging requirements prior to release. Sneed[7] reconstructs
links between test cases and code components through static
and dynamic analysis for selective regression testing. To the
extent of our knowledge, we are not aware of any work
that uses IR techniques to recommend manual test cases to
testers to facilitate bug verification activities.
II. METHODOLOGY
A. Latent Semantic Indexing
Latent Semantic Indexing (LSI) is an information retrieval
technique that assumes a latent structure in the usage of
words for every document and recognizing topics[8] . LSI
overcomes two shortcomings of traditional Vector Space
2011 18th Working Conference on Reverse Engineering
1095-1350/11 $26.00 © 2011 IEEE
DOI 10.1109/WCRE.2011.58
411
Model approaches, synonymy and polysemy, by discover-
ing relationships between terms across multiple documents.
Given a term-document matrix, LSI outputs a reduction
through a Singular Value Decomposition (SVD). SVD re-
duces the vector space model in less dimensions while
preserving information about the relationship between terms.
The dimension of the matrix after SVD is equal to the
number of topics considered, k. Determining the optimal
value of k for a problem is still an open research question.
As such, if k is small, the topics are small and more general,
whereas if k is large, the topics tend to overlap semantically.
B. Latent Dirchitlet Allocation
Probabilistic Latent Semantic Indexing (PLSI) is a proba-
bilistic version of LSI [9]. Following PLSI, a fully generative
Bayesian model called Latent Dirichlet Allocation was intro-
duced [10]. As with LSI, knowing the the optimal number of
topics is a challenge. We experiment with a range of topics
for both approaches.
III. EXPERIMENT
A. Data set
For confidentiality reasons, we cannot report names of
the products or projects in our study. We anonymize such
information in the paper.
Test cases: We evaluated the approaches over a project X
of product Y. The manual test cases are managed by a test
management tool that allows testers to create, update, store
execuion status information etc about test cases. In practice,
when creating test cases, testers group similar test cases
together under a common test case folder. Typically, test
case folder names contain information about the hierarchical
structure of the test case areas. We extract data from test
cases across all the releases of product Y. From each test
case, we extract the unique test case ID and the test case
description that describes what the test case is supposed to
test.
Bugs: Our queries consisted of a set of 9 closed bug
reports from project X. Bugs are also managed in a bug
tracking management system. From each bug report, we
extract the Bug report ID and the high-level text description
of the bug that was populated when the bug was originally
created by the reporter.
Our corpus consisted of 13380 test cases with a total of
4100 terms and our queries consisted of 9 random bugs
picked from Project X. For the sake of illustration, we
present some hypothetical test case and bug data in Tables
I and II.
We convert the free-form text in the bug and test case
descriptions into feature vectors. We then follow some
standard pre-processing steps such as removing stop words
and numeric tokens, followed by stemming.
Testcase ID Testcase description Test case folder path12345 Verify setup proce-
duresProductY\ProjectX\Traceability-startup
12346 Verify logging is en-abled
ProductY\ProjectX\Traceability-restart
12347 Verify there are no er-rors on restart
ProductY\ProjectX\Traceability-restart
Table I: Hypothetical test case data
Bug ID Bug description1001 NPE when setting up traceability module1002 Errors upon restarting traceability module
Table II: Hypothetical bug data
B. Evaluation
To assess the suitability of IR methods in recommend-
ing test cases for bugs, we conducted a quantitative and
qualitative analysis of the retrieved links. We use a Python
framework called Gensim to extract the semantic topics
from documents[11]. We assess the effectiveness of LSI and
LDA through two well known IR metrics, namely precision
and recall [12]. Precision for a given query is the ratio of
the number of relevant documents retrieved over the total
number of retrieved documents. Meanwhile, recall is the
ratio of the number of relevant documents retrieved over
the total number of relevant documents for a given query.
precisioni =correcti ∩ retrievedi
correcti(1)
recalli =correcti ∩ retrievedi
retrievedi(2)
Both these metrics have values between [0,1]. We also use
F1-measure, the harmonic mean of precision and recall, that
takes into account the trade-off between the two measures.
F = 2 · precision · recallprecision+ recall
(3)
De Lucia et al. outline several strategies for selecting
traceability links such as constant threshold, variable thresh-
old, cut point and cut percentage [12]. We chose a commonly
used constant threshold of 0.7.
In order to find the true bug-testcase linkages, we were
assisted by a tester who was familiar with the bugs and test
cases for Project X. For the 9 bugs that we were interested
in, there was a 1-1 mapping between bugs and test cases.
This may not always be true as some bugs may be linked
to more than one test case.
C. Scenarios
Scenario I: As pointed above, we found that similar test
cases were grouped together under the same test case folder,
the names of which typically contain information about
the hierarchical structure of the test area. In Table I, Test
412
Scenario BugID RelevantTest Cases
Relevance Criteria
I
1001 12345 Match is relevant if the retrievedTest Case matches the Test Case inthe true linkage set
1002 12347
II
1001 12345 Match is relevant if the retrievedTest Case belongs to the samefolder as the Test Case in the truelinkage set
1002 12347,12346
Table III: Set of true linkages for Scenario I and II
cases 12346 and 12347 test the same functionality and are
grouped under the folder ”Traceability-restart”. In the first
experiment, we wanted to observe the effects of including
test case folder names with the test case descriptions on
the overall accuracy of our results. Therefore, we ran our
experiment with and without including the test case folder
names with the test case descriptions. In this scenario,
a match is considered relevant if the retrieved Test case
matches the Test case in the true linkage set.
For the hypothetical data in Table I and II, we present the
true linkages for Scenario-I in Table III. Since each bug is
associated with exactly one test case, for any given query
there will be only one true link.
Scenario II: Based on the results from Scenario-I, it
was evident that the folder name indeed had some useful
information. To take this a step further, we changed our
criteria for relevance as follows: we consider a match to be
relevant if it belongs to the same folder as the Test case
in true linkage set. We show the hypothetical true linkages
for Scenario-II in Table III. As each test case folder would
contain more than one test case, each bug may be associated
with more than one test case. For bug 1002, there are two
relevant test cases- 12347 (the exact match from the real
linkages in Scenario-I) and 12346, which belongs to the
same test case folder as Test case 12347.
IV. RESULTS AND LESSONS LEARNED
The average F-measure for the optimal set of parameters
(cut-off = 2, similarity cut-off = 0.7) for Scenario-I using
LSI are shown in Figure 1. As we had expected, the overall
accuracy for the case where we include the test case folder
names in our corpus is higher. F-measure plateaus between
50 - 250 topics at 0.33, after which it drops to 0.22. In
the case where we do not include the test case folder
information, LSI did not retrieve any valid links until 150
topics, after which we noticed a constant F-measure. Upon
examining the links, we found that there was one particular
test case which was being retrieved correctly for a range of
topics.
The average F-measure values for Scenario-II are shown
in Figure 2. As we are interested in achieving high precision
Figure 1: Average F-measure vs. topics for Scenario I
Figure 2: Average F-measure vs topics for Scenario II
for the lowest possible number of retrieved results, we
choose a cut-off of 2, 5 and 10 results. We anticipate
that testers would have the attention span for upto 10
recommended test cases per bug. With LSI, the maximum
average F-measure we were able to achieve was 0.44 for
a cut-off of 2 results. With LSI, we achieve best results
within a window of 150-200 topics for any cut-off value.
LDA performs quite inconsistently regardless of the cut-off
values.
In our qualitative assessment, we found that bug descrip-
tions were sufficiently descriptive in helping one understand
the functional impact of the bug. On the other hand, test
case descriptions ranged from being short and vague to
sufficiently descriptive. We also noted that very short test
case descriptions would result in a large number of false
positives. Therefore, we excluded those test cases with less
413
than 3 terms from our corpus.
Hayes et al. assert that IR techniques must not substitute
the human decision-maker in the linking process, but should
be used to generate a list of candidate links [13]. We rely
on the tester’s judgment and domain knowledge to help us
find the true linkage between bugs and test cases. However,
there is a good chance that the tester only provides those
links that are obvious and misses the ones that are outside
their knowledge. Therefore, we manually inspect the results
with the help of the tester to verify the quality of our results.
It is worth mentioning that although at a higher cut-off we
observed a drop in accuracy, LSI retrieved such links that
were not in the true linkage set but which the tester actually
considered insightful and relevant. In some cases, the links
belonged to other test areas outside the tester’s immediate
knowledge, but were still relevant to the bug. Although we
did not treat such links as true links, the significance of such
links as recommendations would be interesting to assess.
V. CONCLUSION AND FUTURE WORK
Our work opened the possibility of building tools using IR
techniques to recommend test cases to assist testers in their
bug verification activities. A list of recommended test cases
for the bug at hand would enable testers to not only consider
executing test cases that they might have in mind but also
those that may be outside their immediate knowledge.
We found that including information about the hierachical
structure of the test areas improved accuracy. LSI signifi-
cantly outperformed LDA for any cut-off and achieved best
results between 150 - 200 topics. Based on our qualitative as-
sessment, we found that recommendations can be improved
if testers make test case descriptions more detailed. As future
work, we plan to do a more comprehensive evaluation by
considering more strategies for selecting traceability links
as discussed in [12]. It would also be interesting to see the
effects on accuracy of including additional information to the
corpus such as the test case steps and the steps to reproduce
a bug. So far, we have developed a prototype for testers
which incorporates a means for obtaining explicit feedback
by allowing the tester to rate the usefulness of a provided
recommendation. We plan to investigate reinforcing tester
feedback into providing better recommendations for other
testers.
A. Acknowledgment
We are grateful for the support and feedback of the testing
personnel from RIM who helped us with the qualitative
assessment. We are also grateful to Weining Liu and Shimin
Li for their feedback and insight.
REFERENCES
[1] P. Arkley, P. Mason, and S. Riddle, “Position paper: Enablingtraceability,” in Proc. of the 1st International Workshop onTraceability in Emerging Forms of Software Engineering,2002, pp. 61–65.
[2] N. Kaushik, M. Salehie, L. Tahvildari, S. Li, and M. Moore,“Dynamic prioritization in regression testing,” in Proceedingsof the Software Testing Verification and Validation Work-shop(ICSTW), 2011, pp. 135–138.
[3] M. Lormans and A. V. Deursen, “Can lsi help reconstructingrequirements traceability in design and test?” in Proc. ofthe Conference on Software Maintenance and Reengineering,2006, pp. 47–56.
[4] G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, andE. Merlo, “Recovering traceability links between code anddocumentation,” in IEEE Transactions on Software Engineer-ing, vol. 28(10), 2002, pp. 970–983.
[5] A. Marcus and J. I. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing.”in ICSE’03, 2003, pp. 125–137.
[6] A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails andsource code artifacts,” in Proc. of the Internation Confernceon Software Engineering(ICSE), 2010.
[7] H. M. Sneed, “Reverse engineering of test cases for selectiveregression testing,” in CSMR’04, 2004, pp. 69–74.
[8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,and R. Harshman, “Indexing by latent semantic analysis,”in Journal of the American Society for Information Science,vol. 41, 1990, pp. 391–407.
[9] T. Hoffman, in Machine Learning, vol. 42(1), 2001, pp. 177–196.
[10] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,”in Journal of Machine Learning Research, vol. 3, 2003, pp.993–1022.
[11] R. Rehurek and P. Sojka, “Software Framework for TopicModelling with Large Corpora,” in Proceedings of the LREC2010 Workshop on New Challenges for NLP Frameworks,2010, pp. 45–50.
[12] A. DeLucia, F. Fasano, R. Oliveto, and G. Tortora, “Enhanc-ing an artefact management system with traceability recoveryfeatures,” in Proc. of the 20th IEEE Intl. Conf on SoftwareMaintenance, 2004, pp. 306–315.
[13] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram, “Advancingcandidate link generation for requirements tracing: The studyof methods,” in IEEE Transactions on Software Engineering,vol. 32(1), 2006, pp. 4–19.
414