[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...

Reconstructing traceability between bugs and test cases: an experimental study

Nilam Kaushik∗, Ladan Tahvildari∗, Mark Moore‡∗Department of Electrical and Computer Engineering, University of Waterloo

‡ Research In Motion, Canada

Abstract—In manual testing, testers typically follow the stepslisted in the bug report to verify whether a bug has beenfixed or not. Depending on time and availability of resources,a tester may execute some additional test cases to ensure testcoverage. In the case of manual testing, the process of findingthe most relevant manual test cases to run is largely manualand involves tester expertise. From a usability standpoint, thetask of finding the most relevant test cases is tedious as thetester typically has to switch between the defect managementtool and the test case management tool in order to search fortest cases relevant to the bug at hand. In this paper, we use IRtechniques to recover traceability between bugs and test caseswith the aim of recommending test cases for bugs. We reporton our experience of recovering traceability between bugs andtest cases using techniques such as Latent Semantic Indexing(LSI) and Latent Dirichlet Allocation (LDA) through a smallindustrial case study.

Keywords-traceability, test case, bug, LSI

I. INTRODUCTION AND MOTIVATION

Even in organizations with mature software development

processes, software artifacts suffer from lack of traceability,

both due to human factors and the existence of heteroge-

neous tools and distributed teams [1]. In large organizations,

while test cases are created by dedicated testing teams,

bugs may be reported by testers, developers, management

personnel or from external sources. The majority of defects

are reported by a core group of people who have some

domain knowledge and familiarity with the functional be-

havior of the product. We retrieved some statistics about

the distribution of bugs according to the affiliation of the

reporter. We found that 87% of the bugs prior to the release

of a product Y at RIM were reported by developers and

testers. Based on this, we speculate that there is a common

vocabulary used among test case and bug artifacts which

can be exploited by IR techniques to recover traceability

between bugs and test cases.

In manual testing, testers typically follow the steps listed

in the bug report to verify whether a bug has been fixed

or not. Depending on time and availability of resources,

a tester may execute some additional test cases to ensure

test coverage. In the case of manual testing, the process

of finding the most relevant manual test cases to run is

largely manual and involves tester expertise. From a usability

standpoint, the task of finding the most relevant test cases

is tedious as the tester typically has to switch between the

defect management tool and the test case management tool

in order to search for test cases relevant to the bug at hand.

In this paper, we report on our experience of recovering

traceability among bugs and test cases using IR techniques

such as Latent Semantic Indexing (LSI) and Latent Dirichlet

Allocation(LDA) through a small industrial case study.

In industry, different tools are incorporated to manage the

various software artefacts through the software development

process. Seamless integration across tools is important to

obtain the data necessary for building good traceability links

[2]. The existing defect and test case management tools

being used in our project provide fields to allow testers to

explicitly link bugs to test cases by recording the bug id and

test case id manually. However, in our study, we found that

less than 5% of the bugs during a release were linked to any

test case. The availabiltiy of such linkage data can be useful

in prioritizing regression test cases as well.

A. Related work

Lormans et al. investigated the use of LSI to establish

traceability between requirements, design and test cases[3].

Antoniol et al. use a Probablisitc IR model and a Vector

Space IR model to construct links between source code and

documentation [4]. Marcus et al. [5] use an LSI based solu-

tion on the same systems used by Antoniol et al. Baccheli et

al. experiment with lightweight methods, involving capturing

program elements with regexes. They also use Vector Space

IR model and LSI to recover traceability links between

emails and source code. They showed that lightweight

methods significantly outperform IR approaches [6]. IR

methods have also been used to recover traceability links

between requirements themselves for the purpose of man-

aging requirements prior to release. Sneed[7] reconstructs

links between test cases and code components through static

and dynamic analysis for selective regression testing. To the

extent of our knowledge, we are not aware of any work

that uses IR techniques to recommend manual test cases to

testers to facilitate bug verification activities.

II. METHODOLOGY

A. Latent Semantic Indexing

Latent Semantic Indexing (LSI) is an information retrieval

technique that assumes a latent structure in the usage of

words for every document and recognizing topics[8] . LSI

overcomes two shortcomings of traditional Vector Space

2011 18th Working Conference on Reverse Engineering

1095-1350/11 $26.00 © 2011 IEEE

DOI 10.1109/WCRE.2011.58

411

Model approaches, synonymy and polysemy, by discover-

ing relationships between terms across multiple documents.

Given a term-document matrix, LSI outputs a reduction

through a Singular Value Decomposition (SVD). SVD re-

duces the vector space model in less dimensions while

preserving information about the relationship between terms.

The dimension of the matrix after SVD is equal to the

number of topics considered, k. Determining the optimal

value of k for a problem is still an open research question.

As such, if k is small, the topics are small and more general,

whereas if k is large, the topics tend to overlap semantically.

B. Latent Dirchitlet Allocation

Probabilistic Latent Semantic Indexing (PLSI) is a proba-

bilistic version of LSI [9]. Following PLSI, a fully generative

Bayesian model called Latent Dirichlet Allocation was intro-

duced [10]. As with LSI, knowing the the optimal number of

topics is a challenge. We experiment with a range of topics

for both approaches.

III. EXPERIMENT

A. Data set

For confidentiality reasons, we cannot report names of

the products or projects in our study. We anonymize such

information in the paper.

Test cases: We evaluated the approaches over a project X

of product Y. The manual test cases are managed by a test

management tool that allows testers to create, update, store

execuion status information etc about test cases. In practice,

when creating test cases, testers group similar test cases

together under a common test case folder. Typically, test

case folder names contain information about the hierarchical

structure of the test case areas. We extract data from test

cases across all the releases of product Y. From each test

case, we extract the unique test case ID and the test case

description that describes what the test case is supposed to

test.

Bugs: Our queries consisted of a set of 9 closed bug

reports from project X. Bugs are also managed in a bug

tracking management system. From each bug report, we

extract the Bug report ID and the high-level text description

of the bug that was populated when the bug was originally

created by the reporter.

Our corpus consisted of 13380 test cases with a total of

4100 terms and our queries consisted of 9 random bugs

picked from Project X. For the sake of illustration, we

present some hypothetical test case and bug data in Tables

I and II.

We convert the free-form text in the bug and test case

descriptions into feature vectors. We then follow some

standard pre-processing steps such as removing stop words

and numeric tokens, followed by stemming.

Testcase ID Testcase description Test case folder path12345 Verify setup proce-

duresProductY\ProjectX\Traceability-startup

12346 Verify logging is en-abled

ProductY\ProjectX\Traceability-restart

12347 Verify there are no er-rors on restart

ProductY\ProjectX\Traceability-restart

Table I: Hypothetical test case data

Bug ID Bug description1001 NPE when setting up traceability module1002 Errors upon restarting traceability module

Table II: Hypothetical bug data

B. Evaluation

To assess the suitability of IR methods in recommend-

ing test cases for bugs, we conducted a quantitative and

qualitative analysis of the retrieved links. We use a Python

framework called Gensim to extract the semantic topics

from documents[11]. We assess the effectiveness of LSI and

LDA through two well known IR metrics, namely precision

and recall [12]. Precision for a given query is the ratio of

the number of relevant documents retrieved over the total

number of retrieved documents. Meanwhile, recall is the

ratio of the number of relevant documents retrieved over

the total number of relevant documents for a given query.

precisioni =correcti ∩ retrievedi

correcti(1)

recalli =correcti ∩ retrievedi

retrievedi(2)

Both these metrics have values between [0,1]. We also use

F1-measure, the harmonic mean of precision and recall, that

takes into account the trade-off between the two measures.

F = 2 · precision · recallprecision+ recall

(3)

De Lucia et al. outline several strategies for selecting

traceability links such as constant threshold, variable thresh-

old, cut point and cut percentage [12]. We chose a commonly

used constant threshold of 0.7.

In order to find the true bug-testcase linkages, we were

assisted by a tester who was familiar with the bugs and test

cases for Project X. For the 9 bugs that we were interested

in, there was a 1-1 mapping between bugs and test cases.

This may not always be true as some bugs may be linked

to more than one test case.

C. Scenarios

Scenario I: As pointed above, we found that similar test

cases were grouped together under the same test case folder,

the names of which typically contain information about

the hierarchical structure of the test area. In Table I, Test

412

Scenario BugID RelevantTest Cases

Relevance Criteria

I

1001 12345 Match is relevant if the retrievedTest Case matches the Test Case inthe true linkage set

1002 12347

II

1001 12345 Match is relevant if the retrievedTest Case belongs to the samefolder as the Test Case in the truelinkage set

1002 12347,12346

Table III: Set of true linkages for Scenario I and II

cases 12346 and 12347 test the same functionality and are

grouped under the folder ”Traceability-restart”. In the first

experiment, we wanted to observe the effects of including

test case folder names with the test case descriptions on

the overall accuracy of our results. Therefore, we ran our

experiment with and without including the test case folder

names with the test case descriptions. In this scenario,

a match is considered relevant if the retrieved Test case

matches the Test case in the true linkage set.

For the hypothetical data in Table I and II, we present the

true linkages for Scenario-I in Table III. Since each bug is

associated with exactly one test case, for any given query

there will be only one true link.

Scenario II: Based on the results from Scenario-I, it

was evident that the folder name indeed had some useful

information. To take this a step further, we changed our

criteria for relevance as follows: we consider a match to be

relevant if it belongs to the same folder as the Test case

in true linkage set. We show the hypothetical true linkages

for Scenario-II in Table III. As each test case folder would

contain more than one test case, each bug may be associated

with more than one test case. For bug 1002, there are two

relevant test cases- 12347 (the exact match from the real

linkages in Scenario-I) and 12346, which belongs to the

same test case folder as Test case 12347.

IV. RESULTS AND LESSONS LEARNED

The average F-measure for the optimal set of parameters

(cut-off = 2, similarity cut-off = 0.7) for Scenario-I using

LSI are shown in Figure 1. As we had expected, the overall

accuracy for the case where we include the test case folder

names in our corpus is higher. F-measure plateaus between

50 - 250 topics at 0.33, after which it drops to 0.22. In

the case where we do not include the test case folder

information, LSI did not retrieve any valid links until 150

topics, after which we noticed a constant F-measure. Upon

examining the links, we found that there was one particular

test case which was being retrieved correctly for a range of

topics.

The average F-measure values for Scenario-II are shown

in Figure 2. As we are interested in achieving high precision

Figure 1: Average F-measure vs. topics for Scenario I

Figure 2: Average F-measure vs topics for Scenario II

for the lowest possible number of retrieved results, we

choose a cut-off of 2, 5 and 10 results. We anticipate

that testers would have the attention span for upto 10

recommended test cases per bug. With LSI, the maximum

average F-measure we were able to achieve was 0.44 for

a cut-off of 2 results. With LSI, we achieve best results

within a window of 150-200 topics for any cut-off value.

LDA performs quite inconsistently regardless of the cut-off

values.

In our qualitative assessment, we found that bug descrip-

tions were sufficiently descriptive in helping one understand

the functional impact of the bug. On the other hand, test

case descriptions ranged from being short and vague to

sufficiently descriptive. We also noted that very short test

case descriptions would result in a large number of false

positives. Therefore, we excluded those test cases with less

413

than 3 terms from our corpus.

Hayes et al. assert that IR techniques must not substitute

the human decision-maker in the linking process, but should

be used to generate a list of candidate links [13]. We rely

on the tester’s judgment and domain knowledge to help us

find the true linkage between bugs and test cases. However,

there is a good chance that the tester only provides those

links that are obvious and misses the ones that are outside

their knowledge. Therefore, we manually inspect the results

with the help of the tester to verify the quality of our results.

It is worth mentioning that although at a higher cut-off we

observed a drop in accuracy, LSI retrieved such links that

were not in the true linkage set but which the tester actually

considered insightful and relevant. In some cases, the links

belonged to other test areas outside the tester’s immediate

knowledge, but were still relevant to the bug. Although we

did not treat such links as true links, the significance of such

links as recommendations would be interesting to assess.

V. CONCLUSION AND FUTURE WORK

Our work opened the possibility of building tools using IR

techniques to recommend test cases to assist testers in their

bug verification activities. A list of recommended test cases

for the bug at hand would enable testers to not only consider

executing test cases that they might have in mind but also

those that may be outside their immediate knowledge.

We found that including information about the hierachical

structure of the test areas improved accuracy. LSI signifi-

cantly outperformed LDA for any cut-off and achieved best

results between 150 - 200 topics. Based on our qualitative as-

sessment, we found that recommendations can be improved

if testers make test case descriptions more detailed. As future

work, we plan to do a more comprehensive evaluation by

considering more strategies for selecting traceability links

as discussed in [12]. It would also be interesting to see the

effects on accuracy of including additional information to the

corpus such as the test case steps and the steps to reproduce

a bug. So far, we have developed a prototype for testers

which incorporates a means for obtaining explicit feedback

by allowing the tester to rate the usefulness of a provided

recommendation. We plan to investigate reinforcing tester

feedback into providing better recommendations for other

testers.

A. Acknowledgment

We are grateful for the support and feedback of the testing

personnel from RIM who helped us with the qualitative

assessment. We are also grateful to Weining Liu and Shimin

Li for their feedback and insight.

REFERENCES

[1] P. Arkley, P. Mason, and S. Riddle, “Position paper: Enablingtraceability,” in Proc. of the 1st International Workshop onTraceability in Emerging Forms of Software Engineering,2002, pp. 61–65.

[2] N. Kaushik, M. Salehie, L. Tahvildari, S. Li, and M. Moore,“Dynamic prioritization in regression testing,” in Proceedingsof the Software Testing Verification and Validation Work-shop(ICSTW), 2011, pp. 135–138.

[3] M. Lormans and A. V. Deursen, “Can lsi help reconstructingrequirements traceability in design and test?” in Proc. ofthe Conference on Software Maintenance and Reengineering,2006, pp. 47–56.

[4] G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, andE. Merlo, “Recovering traceability links between code anddocumentation,” in IEEE Transactions on Software Engineer-ing, vol. 28(10), 2002, pp. 970–983.

[5] A. Marcus and J. I. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing.”in ICSE’03, 2003, pp. 125–137.

[6] A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails andsource code artifacts,” in Proc. of the Internation Confernceon Software Engineering(ICSE), 2010.

[7] H. M. Sneed, “Reverse engineering of test cases for selectiveregression testing,” in CSMR’04, 2004, pp. 69–74.

[8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,and R. Harshman, “Indexing by latent semantic analysis,”in Journal of the American Society for Information Science,vol. 41, 1990, pp. 391–407.

[9] T. Hoffman, in Machine Learning, vol. 42(1), 2001, pp. 177–196.

[10] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,”in Journal of Machine Learning Research, vol. 3, 2003, pp.993–1022.

[11] R. Rehurek and P. Sojka, “Software Framework for TopicModelling with Large Corpora,” in Proceedings of the LREC2010 Workshop on New Challenges for NLP Frameworks,2010, pp. 45–50.

[12] A. DeLucia, F. Fasano, R. Oliveto, and G. Tortora, “Enhanc-ing an artefact management system with traceability recoveryfeatures,” in Proc. of the 20th IEEE Intl. Conf on SoftwareMaintenance, 2004, pp. 306–315.

[13] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram, “Advancingcandidate link generation for requirements tracing: The studyof methods,” in IEEE Transactions on Software Engineering,vol. 32(1), 2006, pp. 4–19.

414

[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...

Documents