[ieee 2008 15th working conference on reverse engineering (wcre) - antwerp, belgium...

10
Estimation of Test Code Changes using Historical Release Data Bart Van Rompaey and Serge Demeyer Lab On Re-Engineering University Of Antwerp {bart.vanrompaey2,serge.demeyer}@ua.ac.be Abstract In order to remain effective, test suites have to co-evolve alongside the production system. As such, quantifying the amount of changes in test code should be a part of effort estimation models for maintenance activities. In this pa- per, we verify to which extent (i) production code size, (ii) coverage measurements; and (iii) testability metrics predict the size of test code changes between two releases. For three Java and one C++ system, the size of production code changes appears to be the best predictor. We subsequently use this predictor to construct, calibrate and validate an es- timation model using the historical release data. We demon- strate that is feasible to obtain a reliable prediction model, provided that at least 5 to 10 releases are available. 1 Introduction The efforts spent on testing, i.e., specifying, implement- ing, executing and maintaining test cases, in a software project are considerable. Brooks estimates the total time devoted to testing at 50% of the total allocated time [5], while Kung et al. suggest that 40% to 80% of the devel- opment costs are spent in the testing phase [20]. Harrold reports that the cost of regression testing can be as high as one-third of the total project cost [16]. Even after the initial development phase, the test code must be adapted alongside the production code to remain effective. Elbaum demonstrated how small changes to the system resulted in major coverage drops [10], while van Deursen et al. describe how refactorings can even invali- date tests [28]. As such, the larger the test suite, the more ‘regressions’ can be expected. In literature several studies report on the contribution of test code to the overall system, which varies between 10 and 50% of the total size[31, 29, 2, 13, 24], depending on the kind of tests and the required coverage. Ellims et al., how- ever, report about three industrial case studies where unit testing entailed only 5% to 10% of the project effort with near 100% branch and statement coverage [11], concluding that the perceived cost of unit testing may be exaggerated. The cited numbers indicate that test code, as well as the corresponding test effort, are a substantial part of a soft- ware project especially during system evolution and main- tenance. Therefore, cost estimates for maintenance activi- ties should probably include estimates for test modifications as well. Given that effort estimation typically requires size as an input factor (see for instance [3, 23, 18]), we conclude that we should complement predictions for the amount of changes to the production code with predictions for the amount of changes to the test code. In this work, we explore the size of changes made to a test suite during a system’s evolution, using the following research questions: 1. How much test writing do we observe in a sample set of evolving systems? How does the size of test suite changes compare to corresponding production code changes at the release level? 2. Which factors influence the size of changes to test code? We consider factors where we assume an im- pact based on results of previous studies and intuition: (i) change size to the production system, (ii) coverage levels, and (iii) testability indicators. 3. How accurately can we predict the size of test changes for the next release, given that we know the size of changes in previous releases, the amount of change re- quired in the production code as well as influencing factors under (2.)? Answering these questions, we aim at better understand- ing which factors influence the size of changes in software tests, and evaluate an instrument in the form of a prediction model to allow software projects to make size estimations for the co-evolution of a software test suite. We use carefully selected case studies to approach these questions. In particular, the systems we investigate have a long history of regular releases, with a strong emphasis on unit testing. All four systems are open source: three Java systems (Checkstyle, PMD and Cruisecontrol) come from the Java software development domain, while one C++ 2008 15th Working Conference on Reverse Engineering 1095-1350/08 $25.00 © 2008 IEEE DOI 10.1109/WCRE.2008.29 269 2008 15th Working Conference on Reverse Engineering 1095-1350/08 $25.00 © 2008 IEEE DOI 10.1109/WCRE.2008.29 269 2008 15th Working Conference on Reverse Engineering 1095-1350/08 $25.00 © 2008 IEEE DOI 10.1109/WCRE.2008.29 269 2008 15th Working Conference on Reverse Engineering 1095-1350/08 $25.00 © 2008 IEEE DOI 10.1109/WCRE.2008.29 269 2008 15th Working Conference on Reverse Engineering 1095-1350/08 $25.00 © 2008 IEEE DOI 10.1109/WCRE.2008.29 269

Upload: serge

Post on 24-Mar-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Estimation of Test Code Changes using Historical Release Data

Bart Van Rompaey and Serge DemeyerLab On Re-EngineeringUniversity Of Antwerp

{bart.vanrompaey2,serge.demeyer}@ua.ac.be

Abstract

In order to remain effective, test suites have to co-evolvealongside the production system. As such, quantifying theamount of changes in test code should be a part of effortestimation models for maintenance activities. In this pa-per, we verify to which extent (i) production code size, (ii)coverage measurements; and (iii) testability metrics predictthe size of test code changes between two releases. Forthree Java and one C++ system, the size of production codechanges appears to be the best predictor. We subsequentlyuse this predictor to construct, calibrate and validate an es-timation model using the historical release data. We demon-strate that is feasible to obtain a reliable prediction model,provided that at least 5 to 10 releases are available.

1 Introduction

The efforts spent on testing, i.e., specifying, implement-ing, executing and maintaining test cases, in a softwareproject are considerable. Brooks estimates the total timedevoted to testing at 50% of the total allocated time [5],while Kung et al. suggest that 40% to 80% of the devel-opment costs are spent in the testing phase [20]. Harroldreports that the cost of regression testing can be as high asone-third of the total project cost [16].

Even after the initial development phase, the test codemust be adapted alongside the production code to remaineffective. Elbaum demonstrated how small changes to thesystem resulted in major coverage drops [10], while vanDeursen et al. describe how refactorings can even invali-date tests [28]. As such, the larger the test suite, the more‘regressions’ can be expected.

In literature several studies report on the contribution oftest code to the overall system, which varies between 10 and50% of the total size[31, 29, 2, 13, 24], depending on thekind of tests and the required coverage. Ellims et al., how-ever, report about three industrial case studies where unittesting entailed only 5% to 10% of the project effort with

near 100% branch and statement coverage [11], concludingthat the perceived cost of unit testing may be exaggerated.

The cited numbers indicate that test code, as well as thecorresponding test effort, are a substantial part of a soft-ware project especially during system evolution and main-tenance. Therefore, cost estimates for maintenance activi-ties should probably include estimates for test modificationsas well. Given that effort estimation typically requires sizeas an input factor (see for instance [3, 23, 18]), we concludethat we should complement predictions for the amount ofchanges to the production code with predictions for theamount of changes to the test code.

In this work, we explore the size of changes made to atest suite during a system’s evolution, using the followingresearch questions:

1. How much test writing do we observe in a sample setof evolving systems? How does the size of test suitechanges compare to corresponding production codechanges at the release level?

2. Which factors influence the size of changes to testcode? We consider factors where we assume an im-pact based on results of previous studies and intuition:(i) change size to the production system, (ii) coveragelevels, and (iii) testability indicators.

3. How accurately can we predict the size of test changesfor the next release, given that we know the size ofchanges in previous releases, the amount of change re-quired in the production code as well as influencingfactors under (2.)?

Answering these questions, we aim at better understand-ing which factors influence the size of changes in softwaretests, and evaluate an instrument in the form of a predictionmodel to allow software projects to make size estimationsfor the co-evolution of a software test suite.

We use carefully selected case studies to approach thesequestions. In particular, the systems we investigate havea long history of regular releases, with a strong emphasison unit testing. All four systems are open source: threeJava systems (Checkstyle, PMD and Cruisecontrol) comefrom the Java software development domain, while one C++

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.29

269

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.29

269

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.29

269

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.29

269

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.29

269

system (Poco) concerns a class library.The remainder of this work is structured as follows. Sec-

tion 2 discusses related work on size estimation and co-evolution of software artifacts. Next, in Section 3 we firstexpand upon the research questions, next the methodologywe follow to first quantify test co-evolution, then how wecompose and evaluate a prediction model for the size of testcode co-evolution. The software systems under study andtooling are briefly introduced as well. We report about theresults in Section 4. After considering threats to validity(Section 5) we interpret the results to answers the initial re-search questions in Section 6 before formulating a conclu-sion.

2 Related Work

Many studies have been focusing on software size es-timation, some using lines of code, many others usingfunction points or derivatives [1]. Niessink and van Vlietuse function points in a maintenance context [22]. Theynote that estimating the size of a change in terms of LOCadded/deleted/updated however may well be as difficult asestimating the size (in LOC) in development projects. Thesize of the component to be changed appears to have a muchlarger impact on effort than the size of the change itself.

To estimate the effort of coding and unit testing GUI sys-tems, Lo et al. incorporate GUI widget characteristics intheir estimation model. Unfortunately, they do not specifywhat unit testing means in this context, nor do they reportabout separate coding and testing effort [21].

Aiming to estimate one size measure (change size in testcode) from another size measure (change size in produc-tion code), our work can be considered as a kind of seconddegree size estimation technique. Indeed, we already as-sume the presence of a production size estimate. As such,we identify studies on co-evolution of software artifacts asrelated work. Hindle et al. [17] studied the release-time ac-tivities for a number of artifacts (source, test, build and doc-umentation) of four open source systems by counting andcomparing the number of revisions in the period before andafter a release. The observed behavior is summarized in acondensed notation. Zaidman et al. introduce a visual ap-proach to describe — and monitor — the co-evolution be-tween production and test code [32]. Three views reportabout file (co-)changes, growth rate and coverage.

The following authors have investigated characteristicsof (evolving) test code. Skoglund and Runeson comparethree change strategies for an evolving system, observingthat the corresponding changes in the test code vary fromnone to 1.4 lines of test code per changed line of productioncode [27]. Van Deursen and Moonen have shown that evenwhile refactorings are behavior preserving, they potentiallyinvalidate tests [28]. Bruntink and van Deursen evaluate a

set of metrics to assess the testability of Java classes [6].They conclude that the design of the system under test hasan influence on the test effort required to reach a certain cov-erage criterion, and that Fan Out, Lines of Code Per Class(LOCC) and Response For Class (RFC) are good indica-tors. Elbaum et al. study the impact of software evolutionon code coverage information [10]. They notice that smallchanges in a program can greatly affect the code coverage.In one case study, 1% of affected branches in a program re-sulted in a reduction of mean statement coverage by 16%.Moreover, they report that the impact of changes on codecoverage may be hard to predict.

3 Methodology

3.1 Research Questions

In this work, we study three research questions:RQ1: What is the ratio of written test code to the full

source code (tRatio) during software evolution? With thisobservatory question, we want to obtain quantitative dataon the amount of test writing during software evolution,and evaluate different measurement approaches. Next, wecompare this with (i) the amount of production code that ischanged and (ii) to the the eventual size of production codeand test code at release time.

RQ2: Which factors explain the delta in test code size(4tloc) between two releases? We investigate three kindsof independent variables that – in our hypothesis – influencethe amount of test code written between two releases:• Code size. We assume a strong correlation between

the amount of production code 4ploc and the amountof test code 4tloc that is written between two re-leases, as (i) new code must be verified according toa project’s testing policy and (ii) changes to existingcode — including refactorings — impact and even in-validate existing test cases. We also hypothesize on theinfluence of production source size ploc, as in largercode bases it may become harder to isolate and exer-cise specific components. Finally, Gaelli noticed thatin large test suites, there exists a considerable over-lap in coverage between test cases in terms of calledmethod signatures [13]. As such, we reason, the largerthe test code size tloc, the higher the impact of changeson the test suites, as one change may require modifica-tions in multiple test cases.

• Coverage. The higher the test coverage, we hypoth-esize, the more test cases have to be newly writtenand co-evolved to maintain this coverage. As such,we expect the factor test coverage cov to explain theamount of test code to be written. Under the assump-tion that stable test coverage is a desired quality, devel-opers have to spent time maintaining, or even increas-

270270270270270

ing this coverage. Moreover, we remember the workof Elbaum et al., showing how small changes can havea large impact on the test coverage [10]. For these rea-sons, we expect the delta in test coverage 4cov be-tween two releases to be an explanatory variable, es-pecially when coverage is high.

• Testability. We verify how the testability indica-tors identified by Bruntink and van Deursen [6], i.e.,FanOut, LOCC and RFC, influence the size of testchanges. The reasoning is that the larger or the morecoupled production classes are, the more test code isrequired to isolate, initialize and exercise the unit un-der test.

RQ3: How accurately can we predict the amount of testwriting for a release given a set of historical data? Thegoal is to verify to which extent the amount of test code tobe written in a particular release can be predicted by datathat is available to an analyst before these tests are actuallymodified. We hereby assume that the amount of productioncode to be written can be estimated close enough to serveas a reliable input factor. Secondly, we assume that the con-sidered project has a well defined coverage goal for the nextrelease. Based upon this coverage goal and the coveragemeasurement of the previous release, the 4cov can be cal-culated. As such, we use factors of RQ2 to constitute thisprediction model.

3.2 Measurement Approach

To quantify the size of test changes – and compare itto changes in production counterparts – we apply the fol-lowing partitioning scheme to software revisions, similar toHindle et al. [17]. Files either belong to the categories pro-duction code, test code or other:Production code – Source code developed by the project

team that will end up in the released product.Test code – Code developed by the project team to conduct

developer tests. This includes test cases as well as testdata that is defined either as in-line text strings or asseparate, formatted data files (e.g., XML).

Other – Artifacts such as design documents, build system,images, etc. We ignore this category.

Note that we hereby ignore the possibility that files belongto more than one category. The sets of files per categoryare determined by investigating project-specific conven-tions (such as naming and directory structure) and adaptingour analysis environment to them. We particularly intro-duce these categories with xUnit-style tests in mind [15].

To answer RQ1, we quantify test change size at three lev-els of granularity. As a first measure, we count the numberof file changes as logged by the versioning system. Fluri ob-served that such changes are an insufficient indicator to de-termine their significance [12]. Indeed, such measurement

assigns the same weight to an automated one-line change ina documentation string and the encoding of a complex al-gorithm. We still incorporate this data because (i) to someextent every change contributes to the overall effort; and(ii) we want to confront the results with finer-grained yetmore expensive measurements. Indeed, where reasoning ata file level merely requires the change log of the versioncontrol system (VCS), obtaining statement-level informa-tion requires source code parsing and fact extraction.

Secondly, we calculate the delta in line count betweentwo subsequent revisions, by subtracting file line counts oftwo subsequent revisions using the Unix command-line util-ity wc -l. This measurement allows us to quantify the sizeof a change at the file level. With this approach we still as-sign the same weight to a n-line change in source commentsand some method implementation. Note that changes thathappen within one line are not detected. We sum the aggre-gates for the categories test and test data.

Thirdly, we count increases in number of lines withsource statements. This can be considered as that part of thesource code that is hardest to write and as such takes mostof the effort. Using the fact extraction tool chain Fetch1,we build up a model from the source code for each revi-sion, that we subsequently refine by identifying model en-tities such as classes and methods that belong to test code,following a formalism we introduced in earlier work [30].This formalism identifies test concepts such as test cases,fixtures, setups, etc. following their xUnit definition [15].Next, we calculate the amount of source statement linesusing JAVANCSS and PMCCABE (for C++)2. JavaNCSScounts statements and all kind of declarations (approxi-mately equivalent to counting ’;’ and ’{’ characters). Pmc-cabe counts non-commented, non-blank lines per function.Note that pre-processor directives are not counted. Withthis measurement we ignore changes not including state-ment line additions, changes preserving the amount of lines,moves between name spaces, boiler plate code often gener-ated (class, function def), etc. We can not apply this mea-surement level to the external test data.

As such, these calculations yield us measurements forfile level, line level (plocl and tlocl) and statement level(plocs and tlocs) changes, as well as all delta measurementsbetween subsequent releases. This data, calculated per re-vision3, is aggregated to results per release4, assuming thatin the period between two releases at least one full cycleof the used software development methodology has beencarried out. To find out what constitutes a release for theconsidered projects, we rely upon project-specific conven-

1http://lore.cmi.ua.ac.be/fetchWiki/2http://www.kclee.de/clemens/java/javancss/ and http://www.parisc-

linux.org/ bame/pmccabe/3Version of the software system in the VCS as a result of any change to

the system.4A particular revision that is distributed to the users.

271271271271271

tions. In the remainder of this paper, we will use the terms(4)tloc and (4)ploc (without subscript) when we addressthese concepts without a particular measurements level.

Calculating these size metrics for the releases of a projectthroughout history, we can observe how these metrics com-pare within as well as across projects.

To study the assumed effect of independent variables on4tloc in RQ2, we compute Spearman’s rank correlation co-efficient (rs) between pairs of 4tloc and one independentvariable. Next to absolute and delta size metrics at releasetime, we record the following coverage and testability met-rics. Coverage measurements are computed by building his-torical releases and capturing coverage during test suite ex-ecution. We used the Emma5 and Gcov6 coverage tools,recording statement coverage. From these coverage mea-surements, we derive 4covn = covn − covn−1. Further-more, we verify the influence of the following testabilitymetrics:• The FanOut of a class c is quantified as the number

of classes that either receive a method call or have afield referenced by a method of c.

• Lines Of Code per Class (LOCC) counts the lines ofcode per class by summing up the line count of allmethods.

• Response For Class (RFC) of a class c counts thenumber of methods (from c as well as other classes)that are invoked by the methods of c.

We aggregate these metrics at the system level and evaluatethe average as candidate predictor.

As input for an estimation model for RQ3, we selectthose independent variables that appear to have a large ef-fect on 4tloc, i.e., a strong correlation (rs) in combinationwith a small p-value indicating that there is only a smallchance that the variation in 4tloc as explained by this vari-able is coincidental. We rely upon Coolican to interpret thestrength of the correlation [8]. These factors are fed into aleast-square regression analysis with a model of the form

Y = b0 + b1X1 + ... + bnXn,

with Y representing the response variable 4tloc, and X1

to Xn the considered explanatory variables such as 4ploc,cov or RFC. Such linear regression models are often usedfor effort estimation [4]. We compose several models us-ing historical 4ploc data of releases x1 to xn−1, to predictthe amount of test code changes for the next release xn.This means that, initially, such a model is composed of alimited amount of data points equal to the number of pre-ceding releases. Later on, as a project matures, a larger setof release-time data points becomes available.

We limit the number of factors to two at most for tworeasons. First, the size of the data sets equals the number

5http://emma.sourceforge.net6http://gcc.gnu.org/onlinedocs/gcc/Gcov.html

of releases of the considered software project, which is acouple of tens at most. Secondly, the less factors, the earlieron in the life cycle of the project we can verify the accuracyof such a model.

To evaluate the resulting models, we use three ac-curacy criteria derived from the magnitude of relativeerror measure [7]: Mean Magnitude of Relative Error(MMRE), Median Magnitude of Relative Error (MdMRE)and PRED(25). MMRE is computed as

1n

n∑i=1

| Yi − Yi |Yi

,

with Yi the actual 4tloc and Yi the predicted value.Similarly, MdMRE is the median of the |Yi−Yi|

Yierrors.

PRED(25) is defined as the percentage of predictions within25% of the actual (size) value. Next to calculating and com-paring the accuracy per software system, we also verifywhether the accuracy increases over time as more releasedata becomes available.

3.3 Case Studies

As case studies, we selected three Java software devel-opment tools as a set of projects from a similar applicationdomain. To contrast, we consider a C++ class library asfourth case study.

Checkstyle7 is a Java coding standard checker. Six de-velopers made 2260 commits in the interval between June2001 and March 2007. We observe 16 releases between ver-sion 1.2 and 4.3, during which the statement count growsfrom 800 lines to 19kSLOC. Statement coverage fluctuatesbetween 48% and 86%.

PMD8 is a static analysis tool for problem detection inJava code, such as dead, duplicated or suboptimal code. Itshistory dates back to June 2002. Since then (up to March2007), 3536 version commits were registered. Nineteen de-velopers contributed to 33 releases during this period: from0.1 (12kSLOC) to 3.9 (33kSLOC). The test suite is gradu-ally enforced from 36% to 63% statement coverage.

Cruisecontrol9 is a framework supporting a continuousbuild process. We study its evolution during 2942 revisionsand 19 releases, from 2.0 (8kSLOC and 59% statement cov-erage) until 2.7.1 (28kSLOC and 60% statement coverage).Seventeen developers committed to the repository betweenFebruary 2002 and September 2007.

Poco10 is a collection of open source C++ class librariessupporting network-centric, portable applications. We ac-quired the SVN archive between August 2006 and May

7http://checkstyle.sourceforge.net/8http://pmd.sourceforge.net/9http://cruisecontrol.sourceforge.net/

10http://pocoproject.org

272272272272272

2008. This period spans 12 releases (from 1.2.0 until 1.3.2)contributed to by nine developers. We consider the standardpackage of libraries that ship in the releases, i.e., the com-ponents Foundation, Util, Net and XML. Poco grows from2kSLOC to 46kSLOC statement count, yet its coverage re-mains noticeably stable at around 61%.

The first three projects use the JUnit testing framework11

while Poco uses a xUnit variant for C++, CppUnit 12.

4 Results

This section addresses the case studies’ results. First, wequantify tRatio at multiple levels of granularity. Next, weverify to which extent the size, coverage and coupling met-rics of Section 3.2 correlate with test size changes betweenreleases. Finally, we build estimation models using thosefactors that do exhibit a strong correlation.

Quantifying Test Changes. Figure 1 summarizes theshare of test code in code changes at various levels of gran-ularity. In general, we notice a considerable spread in thepercentage of changes that are applied to test code, indicat-ing that the kind of production changes heavily impacts thecorresponding testing effort. Secondly, we notice how testsuites that make use of external test data require consider-ably less changes to the test code. For all four projects, weobserve a tRatio at the line level between 25% and 30%,yet this measurement also exposes the largest variance.

Predictors for test change size. Table 1 presents the re-sults of the correlation analysis for the factors introducedin Section 3.2. First of all, we observe how 4ploc appearsas a very strong predictor for Checkstyle, Cruisecontrol andPoco, and as strong for PMD in case of line as well as state-ment line count. By means of R2, the coefficient of deter-mination, we can state that the variation in change size oftest code — at the source statement level — is explained bythe change size of production code for 90% in the case ofCheckstyle, 44% in case of PMD, 76% in case of Cruisec-ontrol and 79% for Poco. Figure 3 presents scatterplots of4ploc versus 4tloc.

Considering code size, we notice how plocl correlatesfairly with4tlocl for PMD, meaning that the larger the pro-duction code becomes, the more changes in test code arerequired. In the Poco case, there even exists a strong cor-relation that moreover encompasses test code tlocs as well.Inspection reveals that the complexity density (the numberof linearly independent circuits in a program control graphper line of code [14]) of Poco is increasing. This explains

11http://www.junit.org12http://cppunit.sourceforge.net

Checkstyle PMD Cruisectl Poco4tlocl rs (p) rs (p) rs (p) rs (p)4plocl .89 (<.001) .69 (<.001) .87 (<.001) .92 (<.001)plocl .41 ( .111) .37 ( .035) .03 ( .897) .82 ( .002)tlocl .34 ( .198) .22 ( .217) .03 ( .897) .82 ( .002)cov -.30 ( .266) .10 ( .571) -.17 ( .487) -.04 ( .924)4cov .46 ( .075) .04 ( .844) .35 ( .159) -.23 ( .499)LOCC -.16 ( .517) -.31 ( .080) -.12 ( .644) .59 ( .593)FanOut .28 ( .285) .18 ( .314) .14 ( .583) -.54 ( .086)RFC -.34 ( .191) -.29 ( .107) -.26 ( .294) .06 ( .862)4tlocs rs (p) rs (p) rs (p) rs (p)4plocs .95 (<.001) .66 (<.001) .93 (<.001) .89 (<.001)plocs .25 ( .354) .32 ( .071) .38 ( .115) .57 (.068)tlocs .25 ( .354) .09 ( .622) .38 ( .115) .78 (.005)cov -.21 ( .455) .02 ( .917) -.05 ( .844) -.18 (.603)4cov .29 ( .284) .09 ( .629) .33 ( .182) -.27 (.414)LOCC -.17 ( .522) -.23 ( .209) -.49 ( .042) .15 (.653)FanOut .36 ( .167) .17 ( .363) .35 ( .154) -.48 (.138)RFC -.31 ( .246) -.20 ( .276) -.60 ( .011) .26 (.441)

Table 1. Rank correlation between changes intest size versus size, coverage and testabilitymetrics.

the increasing share of test code for Poco (from 8% to 38%)to maintain the same coverage.

Surprisingly, we furthermore notice that neither cover-age nor delta coverage influence the amount of test codechanges at a 95% confidence interval. For Cruisecontrol,the LOCC and RFC metrics show a moderate to strong,negative correlation, indicating that the more lines of codeclasses counts, or the more classes are coupled to otherclasses, the less changes to the test code can be expected.This is counter-intuitive to our initial assumption. As thisphenomenon only occurs for Cruisecontrol, we attribute itto architectural characteristics of that project. In particular,we notice how Cruisecontrol tests are not isolating the unitunder test, hence our assumption of that the coupling metricRFC should influence the amount of test changes does notapply. Secondly, an increasing size count for classes may belinked with a more stable code base that requires less drasticchanges to the test suite.

Estimating 4tloc in the next release. Relying upon thissingle common predictor, we now verify the accuracy of re-gressions models using the 4ploc explanatory variable. Weuse 4ploc data for historical releases (starting from threedata points) to predict the test change size of the followingrelease. This incremental approach results in 13 regressionmodels for Checkstyle, 29 for PMD, 15 for Cruisecontroland 8 for Poco.

Inspection of the data reveals that 4ploc and 4tloc arenot normally distributed. Therefore, we first apply a log-arithmic transformation (log(x + 0.05)) to fulfill the nor-mality requirement[9]. The Shapiro-Wilk test now indicates

273273273273273

Figure 1. Ratio of test changes in terms of source files (F), lines (L) and statements lines (S) and ratioof test code at release time (SR), for 15 Checkstyle (Ch), 32 PMD, 18 Cruisecontrol (Cr) and 11 Poco(Po) releases.

that the sample data is taken from a normal distribution [25].Figure 2 shows the expected versus actual size of test –

statement line – changes over the considered releases. Thecurve for Checkstyle is characterized by serious spikes: arelease with a large number of changes is followed by onewith few changes for a large part of the history. The esti-mated values follow this curve, but overestimate for releaseswith a large actual 4tloc. The PMD curve shows two majorspikes at releases 1.1 and 2.0. In contrast to the Checkstylecase, the model underestimates the larger changes. Similarto Checkstyle, the curve of Cruisecontrol is characterizedby spikes every two or three releases. For Poco, we identifya couple of initial release with very few changes (sometimeseven no source code changes). The estimation models arenot able to cope with that.

Model details for statement level measurements are pre-sented in Table 4. For example, the PMD model mPMD5

is composed with the first five releases to predict the sixthrelease. The obtained model is 1.86 + 0.574plocs. Thismodel predicts a changes size of 110 statement lines versusthe actual outcome of 105. Despite this accurate result, wedon’t trust this model due the the poor and counter-intuitiveadjusted R2 and high p-value.

To verify whether model assumptions are met, we in-spect R2 values and the residuals. For PMD, we notice thatsome of the early models based upon few datapoints poorlyexplain variability in the data set as exemplified above. Theother three case studies expose high R2 and low p-values

data set MM Md P(25) 4MM 4Md 4P(25)Checkstylel 0.92 0.48 23.1% -18% +39% -42%Checkstyles 1.75 0.47 15.4% +252% -13% +17%PMDl 0.72 0.70 20.7% -38% -32% +7%PMDs 2.74 0.75 17.2% -29% -29% +61%Cruisectll 0.53 0.24 53.3% +123% -34% +100%Cruisectls 0.30 0.27 46.7% -33% -17% +33%Pocol 0.40 0.32 25.0% -23% -28% -100%Pocos 0.59 0.67 25.0% -7% -13% +0%

Table 2. Accuracy measures for test sizechange prediction models: (i) MMRE, MdMREand P(25) for all predicted releases and (ii)delta accuracy between first half and secondhalf of predicted releases.

from a couple of data points on. Moreover, as the num-ber of historical releases grow, R2 tends to increase, asdoes the confidence. We furthermore verify (i) whether theresiduals follow a normal distribution and (ii) whether thehomoscedascity criterion is met: there should be no dis-cernible patterns in the scatter plot of residuals versus fittedvalues. We can indeed confirm that these criteria are ap-proximately met for those models with more data points,yet can not make such observation for models with only acouple of data points.

Table 2 summarizes the model accuracy. The MMREvalues are impacted by outliers in the change size data, in

274274274274274

(a) Checkstyle (b) PMD

(c) Cruisecontrol (d) Poco

Figure 2. Actual versus expected predictions for the linear regression models 4tlocs = b0 + b14plocs.

both directions. For releases with only few changes therelative error quickly explodes. For releases with manychanges, 4tloc estimations are worse than average as well.Overall, the estimates are most accurate for Cruisecontrol,with a median magnitude of error of 26.5% and 46.7% ofestimates within 25% of the actual 4tlocs. PMD exposesthe weakest accuracy, despite its large release set. To verifywhether the regression models become more accurate as asystem matures (and more release data becomes available),we compute and compare the relative error measures for thefirst half of the data points with the latter half. Overall, the4MM and 4Md values decrease by 7% to 38%. Some out-liers in Checkstyle and Cruisecontrol however make up forincreases of more than 100%. Considering P(25), 6 out of 8data sets show a stable or increased accuracy.

Next, we verify whether an additional factor from Ta-ble 1 adds to the accuracy of the predictions. From theother factors significant at the 0.05 level, Table 3 shows that

only the addition of factor RFC in Cruisecontrol results ina more accurate model.

Project Model MM Md P(25)PMD 4tlocl = b0 + b14plocl + b2plocl >10 0.74 14.3%Poco 4tlocl = b0 + b14plocl + b2plocl 1.08 1.00 28.6%Poco 4tlocl = b0 + b14plocl + b2tlocl >10 6.88 0.00%Poco 4tlocs = b0 + b14plocs + b2tlocs >10 1.00 14.3%Cruisectl 4tlocs = b0 + b14plocs + b2RFC 0.21 0.18 64.3%Cruisectl 4tlocs = b0 + b14plocs + b2LOCC 0.50 0.26 42.9%

Table 3. Accuracy of regression models withtwo factors.

5 Threats to Validity

First, we discuss the following topics regarding constructvalidity: (i) the completeness of the composed software

275275275275275

models; (ii) the structure of the version control repositoriesused in the case studies; and (iii) the heuristics to distinguishproduction code from test code.

To calculate the listed metrics, we composed an object-oriented model from the source code. Considering pro-gramming language variants and compiler versions usedthroughout history, we relied on robust parsing [26]. As aconsequence, the composed model is not necessarily com-plete (especially for C++). This incompleteness has in par-ticular an impact on the testability metrics, as e.g., RFCand FanOut are based upon the call graph, one of themore difficult model relations to fully resolve. We argue,however, that this incompleteness does not invalidate ourresults, as trends in these metric distributions over historyare likely to be preserved.

For the experiment, we focus on the main developmentbranch – the trunk – in the VCS of the considered projects.Developers may branch from that trunk to try different de-velopment paths (e.g for fixing bugs) that may be mergedback later. By gathering data for the trunk, we associate allsize changes of a branch to the upcoming release at mergetime. Development in branches may however span severalreleases before being merged. In our case studies, branchuse was limited to version archiving, experimentation andsmall features, hence we can safely ignore this factor.

The convention-based partitioning of VCS artifacts islikely to overestimate the size of production changes.Pieces of test code may be co-located in production files,and test interfaces or stubs are not recognized as such.Source code inspections of the four projects did not revealsuch constructs, reducing this threat for the case studies pre-sented here.

Considering conclusion validity, we are well aware of thepotential over-fitting implications due to constructing andevaluating a model based upon predictors that have shownto be significant for the same data set. However, whencomposing the iterative models for subsequent releases, theeventual release to be estimated is not part of the data setused to create the model.

With respect to external validity, finally, we acknowledgethe limited application scope we can draw from four casestudies. We did contrast the results of three case studiesfrom a similar application domain with a fourth case studyand noticed that Poco’s results do not deviate. Beyond thescope of medium-sized xUnit-tested open source systems,we can however not claim any generalizability of these es-timation models.

6 Discussion

In this discussion section, we reconsider the initial re-search questions.

For the four considered projects, we found important dif-ferences in test change size between the line and statementlevel in case there exists external test data, i.e., test datanot embedded in the test case code. Both measurement lev-els can however be useful in an eventual effort estimationmodel: the file level measurement includes separate test in-put data, thus being a more complete representation of testartifacts. The statement level measurement can be used inabsence of such data, and in cases of a proved difference inproductivity between test data and test code writing. The ra-tio at the statement level is moreover comparable the tRatioat release time. We confirm test contribution numbers citedin literature, with the average test change sizes quantifiedbetween 10% and 40% for the considered projects through-out history. We did not observe a difference in change ratiobetween minor and major project releases for these projects.

Although we expected to find evidence of a correlationbetween the size of test changes and measurements for cov-erage and testability, we are not able to empirically validatesuch hypotheses based on the sample data collected in thiswork. Our results for size metrics are moreover opposite tothe related work in [22]. Besides the possibility of an invalidtheory, the relatively small sample size may impact the re-sults. For this study, extending the sample size is not feasi-ble as we are bound to the number of releases of the systemsunder study. As an alternative, we tried the more power-ful parametric alternative for the Spearman correlation co-efficient, i.e., Pearson’s product-moment coefficient, on thelog-transformed data (fulfilling the normality requirement)that we used in the regression analysis. This approach didhowever not yield additional explanatory variables at thepredetermined 0.05 significance level. Despite these limi-tations, we identified a single, strong predictor across thethree projects for RQ2: the delta in production line changesbetween two releases 4ploc, at the line as well as at thestatement line level.

Furthermore, our reasoning about the impact of cover-age on the size of changes may have been too simplistic.Kanstren notes that test suites typically contain tests withdifferent roles, and proposes a quantitative way to measurethe test coverage of the different parts [19]. Many tests inPMD (also studied in [19]) and Checkstyle take completeinput samples to exercise the system. We can indeed as-sume that the characteristics of such I/O tests, in particularthe number of covered methods, deviate from those of unittests and hence have a different change sensitivity. Sec-ondly, we did not take the efficiency (cov/tloc) of the testsuite into account, that may drop for example due to codedecay.

For RQ3, we selected 4ploc as explanatory variable tocompose univariate models. Three attempts to improve theaccuracy by introducing an additional factor only appearedsuccessful for a single case: the addition of the RFC testa-

276276276276276

bility metric to the estimation model for Cruisecontrol. Therelatively weak MMRE outcome due to outliers indicatesthat we should not attempt to estimate releases with verylittle change, typically cases where estimation is not thatimportant.

7 Conclusion

In this work, we made several remarkable observationsregarding the size of test changes over the lifetime of fourrepresentative software systems. First of all, we confirmedearlier reports on the considerably impact to the overall sys-tem code by unit test suites with a fair to high coverage.Quantified as a 10% to 40% contribution, test code changescan not be ignored as part of maintenance activities andshould be included in cost estimation models.

Consequently, we verified which metrics may be used aspredictors for the size of test changes: (i) change size to theproduction system, (ii) test coverage and, (iii) testability ofclasses (measured using size and coupling/cohesion). Quitesurprisingly, the coverage level of the tests, nor the testa-bility of the classes could be used as predictors. This canprobably be explained by the testing strategy used in theseprojects (mainly full system tests (I/O) and few unit tests)however needs to be verified further in projects where othertesting strategies are used.

Finally, we constructed and calibrated an estimationmodel (4tlocs = b0 + b14plocs) using linear regressionand verified it on the available releases for the systems un-der study (i.e, we verified whether the amount of test codeto be written in release n+1 could be predicted based on re-leases 1 to n). We observed how the prediction accuracyincreases as more releases become available.

In the end, we conclude that estimates for maintenanceactivities on large and long-living software systems shouldtake the test modifications into account as these form a con-siderable part of the overall maintenance effort. It is feasi-ble to construct an estimation model that reliably predictsthe amount of test code that needs to be changed, providedthat 5 to 10 releases are already available.

Acknowledgements – This work was executed in the contextof the ITEA project if04032 entitled Software Evolution, Refac-toring, Improvement of Operational&Usable Systems (SERIOUS)and has been sponsored by IWT, Flanders.

References

[1] A. Albrecht. Software function, source lines of code, and de-velopment effort prediction: A software science validation.IEEE Transactions on Software Engineering, 9(6):639–648,1983.

[2] K. Beck. Test-Driven Development: By Example. Addison-Wesley, 2003.

[3] B. Boehm, C. Abts, A. Brown, S. Chulani, B. Clark,E. Horowitz, R. Madachy, D. Reifer, and B. Steece. Soft-ware Cost Estimation with Cocomo II. Prentice Hall, 2000.

[4] L. Briand, K. E. Emam, and S. Morasca. Theoretical andempirical validation of software product measures. Techni-cal Report ISERN-95-03, International Software Engineer-ing Research Network, October 1995.

[5] F. Brooks. The Mythical Man-Month. A-W, 1975.[6] M. Bruntink and A. van Deursen. An empirical study

into class testability. Journal of Systems and Software,79(9):1219–1232, 2006.

[7] S. Conte, H. Dunsmore, and V. Shen. Software EngineeringMetrics and Models. Benjamin Cummings, 1986.

[8] H. Coolican. Research Methods and Statistics in Psychol-ogy. Hodder Arnold, 2004.

[9] K. El Emam. A methodology for validating software productmetrics. Technical Report NRC 44142, National ResearchCouncil Canada, 2000.

[10] S. Elbaum, D. Gable, and G. Rothermel. The impact ofsoftware evolution on code coverage information. In Pro-ceedings of the 17th International Conference on SoftwareMaintenance, pages 170–179, 2001.

[11] M. Ellims, J. Bridges, and D. Ince. Unit testing in prac-tice. In Proceedings of the 15th International Symposium onSoftware Reliability Engineering, pages 3–13, 2004.

[12] B. Fluri and H. Gall. Classifying change types for quali-fying change couplings. In Proc. of the 14th InternationalConference on Program Comprehension, pages 35–45, June2006.

[13] M. Gaelli, M. Lanza, O. Nierstrasz, and R. Wuyts. Order-ing broken unit tests for focused debugging. In Proceedingsof the 20th International Conference on Software Mainte-nance, pages 114–123, 2004.

[14] G. Gill and C. Kemerer. Cyclomatic complexity density andsoftware maintenance productivity. IEEE Transactions onSoftware Engineering, 17(12):1284–1288, December 1991.

[15] P. Hamill. The xUnit Family of Unit Test Frameworks.O’Reilly, 2004.

[16] M. Harrold. Testing: a roadmap. In 22nd International Con-ference on Software Engineering, the Future of Software En-gineering, pages 61–72, 2000.

[17] A. Hindle, M. Godfrey, and R. Holt. Release pattern dis-covery: A case study of database systems. In Proceedingsof the 23rd International Conference on Software Mainte-nance, pages 285–294, 2007.

[18] M. Jørgensen. Experience with the accuracy of softwaremaintenance task effor prediction models. IEEE Transac-tions on Software Engineering, 21(8):674–681, 1996.

[19] T. Kanstren. Towards a deeper understanding of test cover-age. Journal of Software Maintenance and Evolution: Re-search and Practice, 20(1):59–76, January 2008.

[20] D. Kung, J. Gao, and C.-H. Kung. Testing Object-OrientedSoftware. IEEE, 1998.

[21] R. Lo, R. Webby, and R. Jeffery. Sizing and estimating thecoding and unit testing effort for GUI systems. In Proceed-ings of the 3rd International Software Metrics Symposium,pages 166–173, 1996.

[22] F. Niessink and H. van Vliet. Predicting maintenance effortwith function points. In Proc. of the 13th International Con-ference on Software Maintenance, pages 32–39, Oct 1997.

277277277277277

[23] H. Rombach, B. Ulery, and J. Valett. Towards full life cy-cle control: Adding maintenance measurement to the SEL.Journal of Systems and Software, 18(2):125–138, May 1992.

[24] R. Sangwan and P. Laplante. Test-driven development inlarge projects. IT Pro, 8(5):25–29, 2006.

[25] S. Shapiro and M. Wilk. An analysis of variance test fornormality. Biometrika, 52(3-4):591–611, 1965.

[26] S. Sim, R. Holt, and S. Easterbrook. On using a benchmarkto evaluate c++ extractors. In Proceedings of the 10th In-ternational Workshop on Program Comprehension (IWPC2002), pages 114–123. IEEE Computer Society, 2002.

[27] M. Skoglund and P. Runeson. A case study on regressiontest suite maintenance in system evolution. In Proceedingsof the 20th International Conference on Software Mainte-nance, pages 438–442, 2004.

[28] A. van Deursen and L. Moonen. The video store revisited– thoughts on refactoring and testing. In Proceedings of the2nd eXtreme Programming and Flexible Processes Confer-ence, pages 71–76, 2002.

[29] A. van Deursen, L. Moonen, A. van den Bergh, and G. Kok.Refactoring test code. In Proc. of the 1st eXtreme Program-ming and Flexible Processes Conf., pages 92–95, 2001.

[30] B. Van Rompaey, B. Du Bois, S. Demeyer, and M. Rieger.On the detection of test smells: A metrics-based approachfor General Fixture and Eager Test. IEEE Transactions onSoftware Engineering, 33(12):800–817, December 2007.

[31] T. Yamaura. How to design practical test cases. IEEE Soft-ware, 15(6):30–36, 1998.

[32] A. Zaidman, B. Van Rompaey, S. Demeyer, and A. vanDeursen. Mining software repositories to study co-evolutionof production & test code. In Proceedings of the 1st Inter-national Conference on Software Testing, Verification andValidation, April 2008.

Figure 3. Scatterplots of 4ploc versus 4ploc.

models b0 b1 comp. pred. aR2 pmCh3 -4.035 1.269 0.043 0.05 0.89 .15mCh4 -3.857 1.239 78 128 0.99 < .001mCh5 -3.927 1.273 157 270 0.99 < .001mCh6 -4.018 1.307 1266 2379 0.99 < .001mCh7 -4.205 1.355 558 309 0.99 < .001mCh8 -4.123 1.329 5351 2785 0.98 < .001mCh9 -3.937 1.289 50 14 0.98 < .001mCh10 -4.113 1.296 4027 2699 0.97 < .001mCh11 -4.020 1.277 135 199 0.97 < .001mCh12 -4.000 1.279 4056 3240 0.97 < .001mCh13 -3.956 1.270 0.80 0.050 0.97 < .001mCh14 -4.982 1.395 18 92 0.94 < .001mCh15 -4.750 1.376 35 46 0.93 < .001mPMD3 -10.821 2.731 65 284 0.23 .425mPMD4 -7.50 2.208 2.9 89 0.09 .374mPMD5 1.862 0.571 110 105 -0.11 .494mPMD6 1.835 0.575 189 9 -0.03 .411mPMD7 4.116 0.066 98 618 -0.20 .939mPMD8 1.800 0.524 651 1109 -0.03 .415mPMD9 1.280 0.622 3178 5432 0.22 .114mPMD10 0.934 0.684 207 118 0.51 .012mPMD11 0.883 0.684 691 210 0.51 .008mPMD12 1.100 0.636 108 37 0.48 .008mPMD13 0.874 0.658 72 52 0.49 .005mPMD14 0.791 0.667 460 31 0.51 .003mPMD15 1.172 0.581 260 94 0.36 .010mPMD16 1.244 0.560 165 40 0.34 .010mPMD17 1.217 0.552 84 112 0.32 .011mPMD18 1.261 0.548 1036 2510 0.32 .009mPMD19 0.959 0.599 449 281 0.43 .001mPMD20 1.021 0.587 115 113 0.43 .001mPMD21 1.020 0.587 1.02 0.05 0.43 < .001mPMD22 -0.947 0.840 495 261 0.46 < .001mPMD23 -0.879 0.826 148 151 0.46 < .001mPMD24 -0.879 0.826 158 332 0.46 < .001mPMD25 -0.875 0.829 512 192 0.46 < .001mPMD26 -0.773 0.809 108 132 0.46 < .001mPMD27 -0.763 0.809 230 133 0.46 < .001mPMD28 -0.748 0.804 840 289 0.46 < .001mPMD29 -0.589 0.776 53 38 0.45 < .001mPMD30 -0.627 0.779 122 23 0.46 < .001mPMD31 -0.676 0.779 475 774 0.45 < .001mCr3 -0.939 1.007 170 338 0.87 .161mCr4 -0.716 0.999 761 602 0.86 .046mCr5 -0.647 0.981 403 407 0.88 .011mCr6 -0.647 0.981 815 924 0.89 .003mCr7 -0.684 0.989 1037 497 0.90 .001mCr8 -0.447 0.940 1912 2098 0.87 < .001mCr9 -0.495 0.949 488 984 0.89 < .001mCr10 -0.451 0.952 2022 2866 0.87 < .001mCr11 -0.594 0.977 347 534 0.88 < .001mCr12 -0.487 0.967 327 235 0.88 < .001mCr13 -0.573 0.976 2965 3104 0.88 < .001mCr14 -0.593 0.979 545 638 0.89 < .001mCr15 -0.579 0.978 982 1004 0.89 < .001mCr16 -0.584 0.980 2691 1766 0.89 < .001mCr17 -0.438 0.956 1520 1298 0.89 < .001mPo3 -2.996 1.5e-16 0.05 0.05 0.24 .424mPo4 -2.996 0.000 0.05 16979 NA NAmPo5 -1.756 0.960 20 58 0.79 .028mPo6 -1.627 0.984 18 79 0.82 .008mPo7 -1.474 1.009 21 63 0.83 .003mPo8 -1.372 1.024 120 148 0.84 < .001mPo9 -1.362 1.028 22 130 0.85 < .001mPo10 -1.230 1.043 95 221 0.84 < .001

Table 4. Incrementally built regression models for the four casestudies using 4plocs as predictor, with b0 and b1 model coeffi-cients, the computed (comp.) versus predicted (pred.) size of testcode changes, the adjusted R2 and confidence value p.

278278278278278