an empirical study of software metrics for assessing the phases of an agile project

AN EMPIRICAL STUDY OF SOFTWARE METRICS FOR

ASSESSING THE PHASES OF AN AGILE PROJECT

GIULIO CONCAS*, MICHELE MARCHESI†, GIUSEPPE DESTEFANIS‡

and ROBERTO TONELLI§

Department of Electrical and Electronic EngineeringUniversity of Cagliari, Piazza d'Armi, Cagliari, 09123, Italy

*[email protected]†[email protected]

‡[email protected]§[email protected]

http://www.diee.unica.it

Received 13 September 2011

Revised 7 November 2011

Accepted 25 January 2012

We present an analysis of the evolution of a Web application project developed with object-oriented technology and an agile process. During the development we systematically performed

measurements on the source code, using software metrics that have been proved to be correlated

with software quality, such as the Chidamber and Kemerer suite and Lines of Code metrics. Wealso computed metrics derived from the class dependency graph, including metrics derived from

Social Network Analysis. The application development evolved through phases, characterized

by a di®erent level of adoption of some key agile practices �� namely pair programming, test-

based development and refactoring. The evolution of the metrics of the system, and theirbehavior related to the agile practices adoption level, is presented and discussed. We show that,

in the reported case study, a few metrics are enough to characterize with high signi¯cance the

various phases of the project. Consequently, software quality, as measured using these metrics,

seems directly related to agile practices adoption.

Keywords: Software metrics; software evolution; agile methodologies; object-oriented metrics,

SNA metrics applied to software.

1. Introduction

Software is an artifact that can be easily measured, being readily available and

composed of unambiguous information. In fact, since software inception, many kinds

of metrics have been proposed to measure software characteristics. The main goal of

software metrics is to measure the e®ort needed to develop the software, or to

measure its quality. E®ort metrics are relatively simple and well understood. They

cover the requirement phase, with metrics such as \Function Points" [2] and the like,

International Journal of Software Engineering

and Knowledge Engineering

Vol. 22, No. 4 (2012) 525�548

#.c World Scienti¯c Publishing CompanyDOI: 10.1142/S0218194012500131

525

Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

http://dx.doi.org/10.1142/S0218194012500131

up to design and coding phases, with metrics starting from the simple \Lines of

Code" (LOC), to more complex metrics like Cyclomatic Complexity [25]. While the

e®ectiveness of e®ort metrics in predicting and measuring the actual costs of software

development is still debated, in this paper we will not focus on this kind of metrics,

but only on quality metrics.

Software quality metrics aim to measure how much a software is \good" ��especially from the point of view of being error-free and easy to modify and maintain.

Software quality metrics tend to measure whether software is well structured, not too

simple and not too complex, with cohesive modules that minimize their coupling.

Many quality metrics have been proposed for software, depending also on the par-

adigm and languages used �� there are metrics for structured programming, object-

oriented programming, aspect-oriented programming, and so on. In this paper, we

will focus on object-oriented (OO) metrics, nowadays being the OO paradigm which

is most popular by far among developers.a

In dealing with software metrics, however, the main point is not to comeup with

new, sensible metrics able to measure software, but to empirically demonstrate their

usefulness in practice. Empirical proofs of the value of metrics to assess software

quality are mainly based on ¯nding correlations between speci¯c metrics and the

fault-proneness of software modules, that is the number of faults that were found and

¯xed. Unfortunately, considering software quality just inversely related to the

number of faults has its drawbacks. The ¯rst one is that the relationship between a

fault and a software module is typically declared when the module is modi¯ed to ¯x

the fault. However, a module is often modi¯ed as a consequence of an error, not

because it is wrong. Moreover, simply relating quality and (absence of) faults does

not account for other characteristics that are very important in software develop-

ment �� such as ease of maintenance �� but that are much more di±cult to relate

with software metrics.

In this work we will present the possible use of OO metrics to indirectly assess the

quality of the developed software, by showing signi¯cant changes in time as the

development proceeds along di®erent phases. In these phases, various speci¯c \agile"

development practices were used �� or their use was discontinued. In this context,

we assess the ability of some metrics to discriminate among the phases of the project,

and therefore the usage of speci¯c practices. We present results on an industrial case-

study, and discuss their implications and relationships with previous research. We

understand that the presented evidence is anecdotal, but with real software projects

it is very di±cult to plan multi-project researches of this kind. This is because

aThe relative di®usion of programming languages is continuously monitored by some Web sites. Among

them, lang-index.sourceforge.net monitors the usage of languages in Sourceforge Open Source projects.Here, on November 2011, the share of OO languages was greater than 55%. Tiobe's monthly Programming

Community Index (www.tiobe.com/index.php/content/paperinfo/tpci), published since 2001, shows the

top 50 languages' ratings based on searching the Web with certain phrases that include language namesand counting the numbers of hits returned. Here the ratings of OO Languages on November 2011 was

55.3%.

526 G. Concas et al.

Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

software houses tend to be very secretive about their projects. We hope that other

researchers will try to replicate the presented results on similar projects whose data

they can access.

The target of our research is the evolution of a software project consisting of the

implementation of FLOSS-AR, a program to manage the Register of Research of

universities and research institutes. FlossAr was developed with a full object-oriented

(OO) approach and released with GPL v.2 open source license. It is a Web appli-

cation, which has been implemented through a specialization of an open source

software project, jAPS (Java Agile Portal System) [18], that is a Java framework for

Web portal creation. Throughout the project we collected metrics about the software

product under development. We used the Chidamber and Kemerer (CK) OO metrics

suite [8], as well as complexity metrics computed from the class dependency graph

[10]. The project was developed following an agile process [5, 6] with various adoption

levels of some key agile practices, namely Pair Programming (PP), Test-Driven

Development (TDD) and refactoring [5], that were recorded during the project.

We show how some metrics computed on the developed code seem to have the

capability to discriminate in a statistically signi¯cant way among the various phases

of the project, that in turn are characterized by the adoption, or non-adoption, of the

above mentioned agile practices (PP, TDD, refactoring). In this way, the quality of

an ongoing project might be controlled using these metrics.

This paper is organized as follows: in Sec. 2 we present CK, graph-theoretical and

SNA metrics computed on the software; in Sec. 3 we discuss prior literature on

software metrics; in Sec. 4 we present the phases of the development; in Sec. 5 we

present and discuss the results, relating software quality �� as resulting from the

metrics measurements �� with the adoption of agile practices; Sec. 6 deals with the

threats to the validity of the paper, which is concluded in Sec. 7.

2. Software Metrics

In this section we brie°y introduce all the metrics studied in our work, used as a

starting point to choose the metrics subset best suited to discriminate between

various project phases. For a more detailed description, references with their de¯-

nition and possible uses are given. The metrics we computed throughout the project

are the OO metrics suite given by Chidamber and Kemerer [8], Graph-theoretical

metrics, and Social Network Analysis (SNA) metrics.

The Chidamber and Kemerer (CK) metrics suite is perhaps the most studied

among OO metrics suites, and its relationship with software fault-proneness has

already been validated by many researchers. The CK metrics are: Number Of

Children (NOC) and Depth of Inheritance Tree (DIT), related to inheritance;

Weighted Methods per Class (WMC) and Lack of Cohesion in Methods (LCOM),

pertaining to the internal class structure; Coupling Between Objects (CBO) and

Response For a Class (RFC), that are related to relationships among classes. Several

papers related CK metrics to software quality, not always agreeing on which metrics

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 527

Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

are the most correlated with lack of faults and ease of maintenance; see Sec. 3 for a

survey of the related literature.

As presented and discussed in the next section, among CK metrics, WMC and

CBO are those that have been found to be most correlated with software quality.

RFC and LCOM were sometimes �� but not always �� proved to be correlated with

fault proneness or with maintenance e®ort related to a class. DIT was sometime

found correlated, but was also often found not correlated, or exhibiting too low value

variations. NOC is the CK metric least related to software quality. In general, the

lower the value of CK metrics, the better the quality of the system.

Note that a recent work on Eclipse Java system evolution shows that the cohe-

sion/coupling metrics do not behave as expected in some cases [3]. For instance, in

the referred paper, cohesion metrics were found to decrease after restructurings that

should have increased cohesion, and similar results were found regarding coupling.

However, the work [3] studies coupling and cohesion at package and plugin level,

while all our analysis is made at class level.

The second kind of metrics we analyzed are derived from network theory applied to

the software graph. In fact, it is possible to build a directed graph �� called the class

graph �� from the source code of an OO system, the nodes of the graph being the

classes (or the interfaces), and the graph edges being the dependencies between

classes. In this graph, we can de¯ne the Fan-In (or in-degree) of a class as the number

of edges directed toward the class; the in-degree is a measure of how much the class is

used by other classes in the system. The Fan-Out (or out-degree) of a class is the

number of edges directed from the class; it counts how many other classes of the

system are used by the class. Fan-In and Fan-Out measure the number of di®erent

classes using, or used by, the target class. These metrics can be also weighted by the

number of times another class uses, or is used by, the target class, thus yielding

weighted Fan-In/Fan-Out. As an example, if class A uses class B three times (for

instance de¯ning an instance variable of type B, and two local variables of type B in

two methods), A's Fan-Out is increased by one, while its weighted Fan-Out is

increased by three. Fan-In and Fan-Out �� weighted or not �� are the graph-

theoretical metrics we considered. They are related to complex network theory,

because it is well-known that in complex networks their distribution is fat-tailed, and

often is a power-law [28].

We also consider the class LOCs metric, that is the number of lines of code of the

class. It is good OO programming practice to create small and cohesive classes, so

also class LOC metric should be kept reasonably low in a \good" system.

Graph-theoretical metrics can be related to CK metrics pertaining to the rela-

tionships among classes. We know that CK CBO metric, being the count of the

number of other classes which a given class is coupled to, denotes class dependency

on other classes in the system, and is therefore strictly related to the sum of Fan-In

and Fan-Out of a class node in the class graph, because links represent dependencies

between classes. Also CK RFC metric is computed as the sum of the number of


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

methods of a class and the number of external methods called by them. This latter

quantity is strictly related to the weighted Fan-Out of the class node.

The third group of metrics we used are SNAmetrics [30]. These metrics come from

complex network theory, too. They were introduced for sociological analysis, and

recently used in software graphs as well. There are several variations of SNA metrics.

We decided to restrict the analysis to SNA metrics that account for the directionality

of edges, and that can be considered meaningful in a software engineering context.

These metrics are: in- and out-Reach E±ciency, in- and out-Two Step Reach,

in- and out-number of weak components, in- and out-Closeness. These and other

metrics are fully explained in [12].

The studied SNA metrics have an interpretation from the OO software devel-

opment point of view. We remember that the nodes of the network are classes or

interfaces, while the directed edges represent a dependency between two classes ��the class which the edge comes from uses somehow the class which the edge is

directed to. High reach e±ciency indicates that primary contacts of a class are

in°uential in the network. REI means that the classes using a given class are in turn

used by many other classes. This is a measure of the degree of reuse of a class, not

only directly but also in two steps. REO means that a class uses other classes, which

in turn further use other classes. It is a measure of two-step dependence on the rest of

the system. Both these metrics are related to coupling. They should be kept at

relatively low values to minimize coupling among classes of the system.

Weak Components is a normalized measure of how many disjoint sets of other

classes are coupled to a given class. In general, it is an indirect measure of

coupling �� the higher is WC, the lower is the coupling among the classes coupled to

a given class.

Closeness-In is a measure of how easy it is for a class to be reached, directly or

indirectly, by other classes that need its services. Similarly, Closeness-Out is a

measure of how many dependence steps are needed to reach all other (reachable)

classes of the system. The two closeness measures are related to the \small-world"

property of a software network. For a single class, the hypothesis is that the more

central a class is, the more defects it will have. For ensemble measures over the whole

system �� such as the mean or a percentile of CI or CO �� the hypothesis is that a

smaller value of centrality denotes a smaller coupling among classes. Note that these

measures can greatly vary for entire ensembles of classes if a link is added to a set of

classes that were not previously connected or if such a link is removed.

Table 1 summarizes the metrics we computed for the system under development.

Throughout the project, we computed and analyzed the evolution of a set of source

code metrics including the CK suite of quality metrics, the total number of classes,

the lines of code of classes (LOCs), and the above described metrics derived from the

analysis of the software graph.

All the cited metrics are measurements made on single classes, so there is a value

of the metrics for each class (and interface) of the system. However, we are mainly

interested in measures of the whole system, able to give a synthetic picture of its


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

quality. To this purpose, we computed statistics about the metric values for all the

classes of the system, during its development, and used some of these statistics as a

measure of the whole system. More about this in the section on results.

3. Related Work

Several papers related CK metrics to software quality, not always agreeing on which

metrics are the most correlated with lack of faults and ease of maintenance. In a

study of two commercial systems, Li and Henry studied the link between CK metrics

and maintenance e®ort [22]. Basili et al. found that many of the CK metrics were

associated with fault-proneness of classes [4]. In another study on three industrial

projects, Chidamber et al. reported that WMC, CBO and RFC look highly corre-

lated among each other, and that higher values of CK coupling and the cohesion

metrics (CBO and LCOM) were associated with reduced productivity and increased

rework/design e®ort [9]. Subramanyam and Krishnan studied a large system written

in Cþþ and Java, and found a good correlation between number of defects and

Table 1. The metrics used to study the system.

Metric Type Description

NOC CK Number of Children �� No. of immediate subclasses.

DIT CK Depth of Inheritance Tree �� No. of superclasses, up to the root.

WMC CK Weighted Methods per Class �� No. of methods of the class (weight¼ 1).LCO CK Lack of Cohesion in Methods (LCOM) �� No. of method pairs not sharing any

instance variable minus No. of pairs sharing at least one. Zero if negative.

CBO CK Coupling Between Objects (CBO) �� No. of other classes that depend on the given

class, or which the given class depends on (excluding inheritance).RFC CK Response For a Class (RFC) �� No. of methods plus No. of dependencies on other

classes (excluding inheritance).

FI Graph Fan-In �� No. of other classes that depend on the given class.

WFI Graph Weighted Fan-In �� No. of times all other classes depend on the given class.FO Graph Fan-Out �� No. of other classes which the given class depends on.

WFO Graph Weighted Fan-Out �� No. of times the given class depends on other classes.

REI SNA Reach E±ciency In �� Percentage of nodes within two- step distance from a node,

following arcs from the head to the tail, divided by the No. of nodes within onestep.

REO SNA Reach E±ciency Out �� Percentage of nodes within two- step distance from a

node, following arcs along their direction, divided by the No. of nodes within onestep.

WC SNA Weak Components �� No. of disjoint sets of nodes within one step from a node, not

considering the node itself, divided by the No. of nodes within one step.

CI SNA Closeness-In �� Reciprocal of Farness-In which is de¯ned as the sum of the lengthsof all shortest paths from the node to all other nodes, following arcs from the

head to the tail, divided by the No. of reachable nodes.

CO SNA Closeness-Out �� Reciprocal of Farness-Out which is de¯ned as the sum of the

lengths of all shortest paths from the node to all other nodes, following arcsalong their direction, divided by the No. of reachable nodes.

LOC Dim. Lines Of Code �� No. of lines of code of the class, excluding comments and blank

lines.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

WMC, CBO, DIT [32]. Gyimothy et al. systematically studied the open-source

Mozilla system, ¯nding that above all CBO, and then RFC, LCOM, WMC and DIT

show a fair correlation with defects [17].

Succi et al. reported a broad empirical exploration of the distributions of CK

metrics along several Java and Cþþ projects, con¯rming that some metrics are

fairly correlated, and that NOC and DIT metrics generally exhibit a low variance,

so they are less suitable to be used for a systematic assessment based on metric

computation [33].

Recently, some papers were published on the use of OO metrics to assess the

quality of software developed using agile methodologies. Giblin et al. presented a case

study comparing the source code produced using agile methods with the source code

produced for a similar type of application by the same team using a more traditional

methodology. They made extensive use of speci¯c OO metrics, and concluded that

agile methods have guided the developers to produce better code in both quality and

maintainability [16]. Kunz et al. presented a methodological work discussing cost

estimation approaches for agile software development, and a quality model making

use of distinct metrics for quality management in agile software development [20].

Melis et al. used the software process simulation approach to assess the e®ect of the

use of PP and TDD on e®ort, size, quality and released functionalities [26]. They

found that increasing the usage of these practices signi¯cantly diminishes product

defectiveness, and increases programming e®ort. Dyba and Dingsøyr reported a

systematic review of other empirical studies of agile software development, including

in Sec. 4.7 some other empirical evaluation of product quality [14]. These studies

include a paper by Layman et al. [21] on an industrial project before and after

adoption of Extreme Programming, reporting a 65% decrease in pre-relase defect

rate, and a 35% decrease in post-release defect rate after XP adoption; a paper by

Macias et al. [24] on comparing 20 student projects using Waterfall and XP meth-

odologies, reporting no signi¯cant di®erences in external and internal quality factors;

a paper by Wellington et al. [34] on comparing the development of 4 systems by 20

student teams using Plan-driven and XP methodologies, reporting that XP code

shows consistently better quality metrics, among which a decrease of 40% of WMC

average value.

One of the most studied agile development practice in literature is TDD. Here we

will report papers studying the in°uence of TDD on software quality and OO

metrics. Canfora et al. [7] studied a set of 28 professional developers, asked to develop

a test project. They found that TDD improves the unit testing but slows down the

overall process. Nagappan et al. [27] studied industrial projects carried on in various

contexts, using Java, Cþþ and .NET. The results indicated that the pre-release

defect density decreased between 40% and 90% compared to similar projects that did

not use the TDD practice. The teams experienced a 15�35% increase in initial

development time after adopting TDD. Janzen and Saiedian [19] studied various

projects, industrial and academic, analyzing also the result of the use of TDD on

OO metrics computed on the developed software. They found that test-¯rst


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

programmers consistently produced classes with lower values of WMC metric; CBO

and Fan-Out of the studied classes did not show a signi¯cant di®erence between

software developed with or without TDD; LCOM* metric (a normalized LCOM,

constrained to [0, 1] interval) also showed no signi¯cant di®erence. Siniaalto and

Abrahamsson [31] studied 5 small scale case projects (5-9 KLOCS each), mainly

performed by students. They found that WMC, CBO, RFC, NOC and LCOM do not

signi¯cantly di®er between software developed with or without TDD; however, they

also found signi¯cantly lower values of RFC in TDD software, as well as signi¯cantly

higher values of DIT.

Concas et al. published a paper using the same empirical data of this paper,

limiting their study only to CK metrics and LOC metrics (class LOC and methods

LOC) and describing in deeper detail the agile practices used in the project [11]. They

found that all considered metrics but LCOM are able to discriminate very well

between the ¯rst two phases of the project (initial \Agile" phase and \cowboy

coding" phase), while only a few metrics maintain the ability of discriminating

between subsequent phases, and no metric is able to discriminate between all pairs of

consecutive phases at a signi¯cance level greater than 95%.

A few papers have been published regarding the relationships of graph-theoretic

and SNA metrics with software quality. Among these, Zimmermann and Nagappan

[35] computed and studied many SNA metrics, on both the oriented and non-ori-

ented software graph related to binary modules of Windows Server 2003 operating

systems, and their dependencies. They found that some SNA metrics could identify

60% of the binaries that the Windows developers considered as critical �� twice as

many as those identi¯ed by complexity metrics (dimension, No. of functions, para-

meters and globals, Fan-In, Fan-Out).

Concas et al. [12] presented an extensive analysis of software metrics for 111

object-oriented systems written in Java, including SNA metrics, ¯nding systematic

non-normal behavior in their distributions, and studying the correlations among

metrics. Concas et al. [13] studied the application of CK and SNA metrics to Eclipse

and Netbeans open source systems, and performed an analysis of their correlation

with defects found in classes; they found that the metrics most correlated with

defects are LOCS, RFC and CBO.

4. Project Phases

Besides a ¯rst exploratory phase at the beginning of the project, where the team

studied the functionalities of the underlying open source Web portal management

system (jAPS) and the way to extend it, without producing code, the project evolved

through four main phases, each one characterized by an adoption level of the key

agile practices of pair programming, TDD and refactoring. In particular:

. Pair Programming was one of the keys to the success of the project. All the

development tasks were assigned to pairs and not to single programmers. Given a


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

task, each pair decided which part of it to develop together, and which part to

develop separately. The integration was typically made working together. Some-

times, the developers paired with external programmers belonging to jAPS

development community, and this helped to grasp quickly the needed knowledge

of the framework.

. RegardingTDD, developers had the requirement that all codemust have automated

unit tests and acceptance tests, andmust pass all tests before it can be released. The

choice whether to write tests before or after the code was left to programmers.

. Refactoring was practiced mainly to eliminate code duplications and improve

hierarchies and abstractions. Unfortunately, data on speci¯c refactorings were not

recorded. The developers had a fair knowledge of Fowler's book [15], so several

refactorings cited there were applied.

A full account of the agile practices used in the project, and preliminary results on

the use of CK metrics for discriminating among phases, is reported in [11].

To give empirical evidence to such phases, we asked each of the ¯ve members of

the development team to de¯ne, to their judgement, system evolution phases in

respect of PP, TDD and refactoring usage, and the dates when these phases started

and ended. Four out of ¯ve members cited the four phases. Only one proposed three

phases, merging phases 3 and 4 in just one phase.

Regarding the dates de¯ning the boundaries between phases, all agreed that week

17 signed the end of Phase 2, obviously related to the date of presentation of the

system. The end of Phase 1 was attributed to weeks from 8 to 11, with median equal

to, and mean close to, 10 weeks. The end of Phase 3 was attributed to weeks 20 and

21, the majority saying 21. The resulting phases, that we will consider in the

remaining of the paper, are summarized below:

. Phase 1 (Initial Agile): a phase characterized by the full adoption of all practices,

including testing, refactoring and pair programming. It lasted ten weeks, leading

to the implementation of a key set of the system features. In practice, speci¯c

classes to model and manage the domain of research organizations, roles, products,

and subjects were added to the original classes managing the content management

system, user roles, security, front end and basic system services. The new classes

include service classes mapping the model classes to the database, and allowing

their presentation and user interaction.

. Phase 2 (Cowboy Coding): this is a critical phase, characterized by a minimal

adoption of pair programming, testing and refactoring, because a public presen-

tation was approaching, and the system still lacked many of the features of

competitors' products. So, the team rushed to implement them, compromising the

quality. This phase lasted seven weeks, and included the ¯rst release of the system

after two weeks.

. Phase 3 (Refactoring): an important refactoring phase, characterized by the full

adoption of testing and refactoring practices and by the adoption of a rigorous pair


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

programming rotation strategy. The main refactorings performed were \Extract

Superclass", to remove duplications and extract generalized features from classes

representing research products, and corresponding service classes, and \Extract

Hierarchy" applied to a few \big" classes, such as an Action class that managed a

large percentage of all the events occurring in the user interface. This phase was

needed to ¯x the bugs and the bad design that resulted from the previous phase.

It lasted four weeks and ended with the second release of the system.

. Phase 4 (Mature Agile): Like Phase 1, this is a development phase characterized

by the full adoption of the entire set of practices, until the ¯nal release, after eight

weeks.

5. Results and Discussion

In this section we analyze the evolution of FlossAr source code metrics. At regular

intervals of one week, the source code was checked out from the CVS repository and

analyzed by a parser that calculated the metrics. The parser and the analyzer were

developed by our research group as a plug-in for Eclipse IDE. Thus we gathered 30

\snapshots" of the system, one for each development week.

5.1. Correlations of the metrics of a given system

To study how the 19 metrics �� each computed for all classes of a given system ��are correlated, we calculated the cross-correlation values of the various considered

metrics of the last release of the system under study. We used Kendall's non-

parametric measure of rank correlation [29] because Pearson's correlation coe±cients

were highly in°uenced by outliers, while Spearman's rank correlation coe±cient

computation su®ered from the many equal values found in integer data. The results

are reported in Table 2, highlighting in bold those whose absolute value is above 0.6.

Correlation tests made on other snapshots of the system yield very similar results.

We found a high correlation between several pairs of metrics. RFC is fairly cor-

related with WMC, CBO and LOC, while LCO is correlated withWMC. FI andWFI

are the most correlated metrics (� ¼ 0:90), but FI and WFI are also correlated with

REI, CI and REO �� the latter being strongly anti-correlated. FO is correlated with

CBO, RFC, WFO, and REI, while WC is correlated with CBO. Finally, CI is cor-

related with WFO, which in turn is also correlated with LOC and RFC. Reach

E±ciency, and in particular REO, tends to be anti-correlated with most other

metrics.

These correlations do not mean that some metrics can be easily substituted by

others. However, they can be a good starting point to reduce the number of metrics

to study.

From the correlations studied, and from common knowledge on OO metrics, as

speci¯ed below, the following metrics can be considered candidates to be overlooked,

or substituted by other metrics:


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

Tab

le2.

TheKendallrankcross-correlationcoe±

cients

oftheconsidered

metrics,computedon

allclassesof

thelast

version

ofFLOSS-A

Rsystem

.

Metric

LOC

WMC

LCO

NOC

DIT

CBO

RFC

FI

WFI

FO

WFO

REI

REO

WC

CI

CO

LOC

1.00

0.57

0.39

0.00

0.06

0.41

0.69

0.02

0.05

0.48

0.60

�0.20

0.06

0.35

0.02

0.06

WMC

0.57

1.00

0.60

0.11

�0.09

0.37

0.68

0.24

0.28

0.20

0.30

0.04

�0.14

0.35

0.16

0.06

LCO

0.39

0.60

1.00

0.00

�0.17

0.28

0.44

0.22

0.25

0.10

0.14

0.08

�0.17

0.24

0.14

0.07

NOC

0.00

0.11

0.00

1.00

�0.06

0.08

0.06

0.36

0.35

�0.05

�0.03

0.21

�0.24

0.21

0.29

0.18

DIT

0.06

�0.09

�0.17

�0.06

1.00

�0.02

0.01

�0.20

�0.21

0.24

0.29

�0.17

0.29

�0.04

�0.16

0.14

CBO

0.41

0.37

0.28

0.08

�0.02

1.00

0.57

0.33

0.32

0.56

0.44

�0.13

�0.15

0.74

0.17

0.06

RFC

0.69

0.68

0.44

0.06

0.01

0.57

1.00

0.08

0.12

0.57

0.57

�0.22

0.04

0.46

0.07

0.06

FI

0.02

0.24

0.22

0.36

�0.20

0.33

0.08

1.00

0.91

�0.18

�0.13

0.58

�0.59

0.42

0.59

0.12

WFI

0.05

0.28

0.25

0.35

�0.21

0.32

0.12

0.91

1.00

�0.16

�0.11

0.56

�0.54

0.39

0.61

0.11

FO

0.48

0.20

0.10

�0.05

0.24

0.56

0.57

�0.18

�0.16

1.00

0.77

�0.52

0.28

0.39

�0.11

0.06

WFO

0.60

0.30

0.14

�0.03

0.29

0.44

0.57

�0.13

�0.11

0.77

1.00

�0.40

0.22

0.35

�0.08

0.10

REI

�0.20

0.04

0.08

0.21

�0.17

�0.13

�0.22

0.58

0.56

�0.52

�0.40

1.00

�0.40

0.00

0.37

0.12

REO

0.06

�0.14

�0.17

�0.24

0.29

�0.15

0.04

�0.59

�0.54

0.28

0.22

�0.40

1.00

�0.26

�0.34

�0.16

WC

0.35

0.35

0.24

0.21

�0.04

0.74

0.46

0.42

0.39

0.39

0.35

0.00

�0.26

1.00

0.22

0.09

CI

0.02

0.16

0.14

0.29

�0.16

0.17

0.07

0.59

0.61

�0.11

�0.08

0.37

�0.34

0.22

1.00

0.09

CO

0.06

0.06

0.07

0.18

0.14

0.06

0.06

0.12

0.11

0.06

0.10

0.12

�0.16

0.09

0.09

1.00


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

. NOC and DIT: it is well known that most authors consider these metrics the least

correlated with faults [17]. In our case, we found that mean and 90th percentile of

NOC and DIT metrics show small variations across the snapshots, and look less

useful than other metrics for discriminating among the Phases. The only large

variation in DIT is between weeks 17 and 18 — thus between Phase 2 and

Phase 3 — when a new abstract superclass, \EntityManager", was introduced to

generalize a large part of the behavior of 18 existing classes. This led to a jump in

DIT, and a corresponding drop in WMC, CBO, RFC, FI and FO, because many

dependencies between each of the 18 subclasses and other classes were pushed up

the hierarchy, to the new class. Overall, inheritance links contribute only for about

4% to all links of the software graph. For this reason, despite the importance of

inheritance in OO development, NOC and DIT metrics were not considered to

discriminate among the Phases of the presented case study.

. WMC: the information carried by this metric is found also in LOC (the more

methods in a class, the more lines of codes) and RFC (which includes WMC in its

computation).

. CBO: it is well correlated with RFC, FI, FO, as known from the literature [10, 33],

so we will not consider it.

. WFI: FI is an almost perfect substitute because it is strongly correlated to WFI,

and exhibits correlations very similar to those of WFI with all other metrics;

moreover, it is simpler to compute.

. FO, WFO: these metrics are well represented by RFC metric. Moreover, their

averages, over all the classes of the system, are the same as the averages of FI and

WFI, respectively. This is because their average is the average number of in-links

and out-links over all system classes. Since each in-link corresponds to one out-link,

their total number, and hence their averages, are the same. This is true for both

weighted and non-weighted links.

We decided to consider all SNA metrics, because they are not well studied in the

software ¯eld yet, so they deserve to be studied more in depth. Note that we per-

formed the analysis of variations in metric statistics reported in the following also for

the metrics considered substituted by others, con¯rming that their behavior is

consistent with that of their substitution metrics. In this way, the paper is simpler,

without losing information.

In the end, we analyze the behavior of the following nine metrics, as system

development evolved: LCOM, RFC, FI, REI, REO, WC, CI, CO, LOC.

5.2. Metric statistics across system snapshots and their correlations

The total number of classes in the system (including abstract classes and interfaces),

which is a good indicator of its size, increases over time, though not linearly. The

project started with 362 classes �� those of jAPS release 1.6. At the end of the

project, after 30 weeks, the system had grown to 514 classes, due to the development


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

of new features that constituted the specialized system. Figure 1 shows the evolution

of the number of classes during development, together with the four main phases of

development, and the weeks of the three releases of the system.

We computed key statistics of these metrics �� mean, standard deviation,

median, 90th percentile�� for each of the 30 systems analyzed. Remember that these

metrics are always positive, and none of them is normally distributed, but all follow a

\fat tail" distribution, often a power-law [10, 23, 13], so the statistics must be focused

mainly on the extreme tail. We found that the best statistics to account for the

behavior of the metric in the whole system are the mean �� that is a rough measure

of the overall behavior of the metric across all classes of the system anyway �� and

the 90th percentile �� that gives information on the tail. The standard deviation

gives information only on how the values are spread, but not on the values them-

selves, while the median is skewed toward values that are too low, and tends to be

fairly constant.

We computed the Kendall cross-correlation coe±cients of the mean and 90th

percentile of the metrics, on the 30 weekly snapshots of FLOSS-AR system under

study, to assess how these metrics were related across the development. We show

these cross-correlations in Tables 3 and 4, highlighting in bold those whose absolute

value is above 0.7. Note that the 90th percentile of LOCS metric is constant across

the snapshots, so we had to drop it from Table 4.

This correlation is di®erent from the correlation computed class by class for a

single snapshot of the system shown in Table 2. A high positive value of the class-by-

class cross-correlation between two metrics means that, when one is above (below)

average for a class, the other is likely to be above (below) average as well for the same

class. In Tables 3 and 4, we refer to the correlation among average and 90th

Fig. 1. Total no. of classes during the evolution of FLOSS-AR system.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

percentile values of the metrics, respectively, measured at weekly time steps during

the development.

In this case, a high positive value of the cross-correlation means that, at a given

development step, when one metric averaged among all classes is above (below) its

average value over the whole development, also the other is likely to be above

(below) its average value of a similar percentage for the same time step. As you can

see in Tables 3 and 4, many metrics are fairly correlated with each other. The most

correlated with other metrics for both means and 90th percentiles are LOCs, RFC,

Fan-In, and Closeness-Out �� the latter being anti-correlated with other metrics.

The least correlated metric is Closeness-In.

Regarding the 90th percentiles, the correlations substantially con¯rm those of the

means, but are typically lower. These results often do not match those reported in

Table 2, in the sense that if two metrics are fairly correlated (or not correlated at all)

when computed class-by-class, this does not imply that their means or 90th per-

centiles are correlated (or not correlated) in the same way when computed across a

sequence of snapshots of the system under development, and vice versa. In about

40% of the cases, we observe even an inversion of the sign of the correlation. This is

quite counter-intuitive, but the two correlations have di®erent meanings. If the

slopes of the regression between two correlated quantities �� computed across

Table 4. The Kendall rank cross-correlation coe±cients of the 90th percentiles of

the eight considered metrics, computed on the 30 weekly snapshots of FLOSS-AR

system. LCOM has been dropped because it is constant over all snapshots.

Metric LOC RFC FI REI REO WC CI CO

LOC 1.00 0.30 0.27 0.67 0.71 0.56 �0.05 �0.57RFC 0.30 1.00 0.37 0.08 0.06 0.55 0.51 �0.04

FI 0.27 0.37 1.00 0.35 0.27 0.61 0.65 �0.38

REI 0.67 0.08 0.35 1.00 0.84 0.39 �0.02 �0.90

REO 0.71 0.06 0.27 0.84 1.00 0.38 �0.10 �0.82WC 0.56 0.55 0.61 0.39 0.38 1.00 0.51 �0.40

CI �0.05 0.51 0.65 �0.02 �0.10 0.51 1.00 0.00

CO �0.57 �0.04 �0.38 �0.90 �0.82 �0.40 0.00 1.00

Table 3. The Kendall rank cross-correlation coe±cients of the averages of the nine considered

metrics, computed on the 30 weekly snapshots of FLOSS-AR system.

Metric LOC LCO RFC FI REI REO WC CI CO

LOC 1.00 0.64 0.87 0.71 0.52 0.40 0.63 0.06 �0.42

LCO 0.64 1.00 0.54 0.43 0.42 0.41 0.50 �0.23 �0.34

RFC 0.87 0.54 1.00 0.79 0.58 0.38 0.62 0.14 �0.48

FI 0.71 0.43 0.79 1.00 0.76 0.46 0.64 0.07 �0.69REI 0.52 0.42 0.58 0.76 1.00 0.67 0.52 �0.13 �0.87

REO 0.40 0.41 0.38 0.46 0.67 1.00 0.41 �0.38 �0.57

WC 0.63 0.50 0.62 0.64 0.52 0.41 1.00 0.06 �0.44

CI 0.06 �0.23 0.14 0.07 �0.13 �0.38 0.06 1.00 0.21CO �0.42 �0.34 �0.48 �0.69 �0.87 �0.57 �0.44 0.21 1.00


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

classes of the same snapshot �� vary across di®erent snapshots, the resulting cor-

relation of means or 90th percentiles can be very di®erent from the correlations

referred to a single snapshot.

5.3. Discriminating amongst development phases

using aggregate metrics

As reported in Sec. 4, the development of the system evolved through four distinct

phases. We know that what di®erentiates the various phases is the level of adoption

of agile practices �� namely PP, TDD and refactoring. We also know that these agile

practices were applied or not applied together; consequently, it is not possible to

discriminate among them using the data reported for this case study. So, we talk of

\key agile practices" considering them as applied together. In this subsection we

show and discuss how aggregate statistics of OO and network metrics exhibit speci¯c

patterns of evolution, as system development proceeds. In Fig. 2 we show the

behavior of the mean values of the three metrics that seem to discriminate better

than others among development phases �� Fan-In, Closeness-In and Closeness-Out.

All the values are normalized to the maximum value reached by the metrics. FI and

CO look the best to discriminate between Phases 1 and 2, while CI is the best to

discriminate between Phases 2 and 3. Phases 3 and 4 are less discriminated, but this

is reasonable, because Phase 3 is a refactoring phase, and Phase 4 is a subsequent

development phase that continues on the same path, without aggressive refactoring.

In Fig. 3 we show the behavior of the mean values of other metrics which are still

\good" at discriminating among phases. They are LCOM, RFC, REI and REO.

Fig. 2. The evolution of the mean value of FI, CI and CO metrics.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

In particular, LCOM exhibits a strong growth in Phase 2, when good OO and agile

practices were abandoned, which is only partially corrected in Phases 3 and 4.

Figure 4 shows the behavior of 90th percentiles of FI, WC and CI metrics, the best

at discriminating between phases. Note the di®erent behavior with respect to the

means reported in Figs. 2 and 3. For the sake of brevity, we do not report the

behavior of other metrics, because they look less signi¯cant than the reported ones.

The evolution of most aggregate statistics of the studied metrics with the process

phases shows signi¯cantly di®erent values and trends that depend on the speci¯c

phase, as shown in Figs. 2�4. Our hypothesis is that this variability is due to

the di®erent level of adoption of the key agile practices. In fact, to our knowledge, the

only external factors that might have had an impact on the project are precisely the

di®erences among the phases, as reported in Sec. 4. Regarding internal factors,

the only relevant factor at play was team experience, regarding both agile practices

applications, and knowledge of the system itself. The project duration was relatively

short, so we estimate that the latter factor a®ected signi¯cantly only Phase 1.

We performed a Kolmogorov-Smirnov (KS) two-sample test to assess if those

measurements signi¯cantly di®ered from one phase to the next. The KS test deter-

mines if two datasets belong to di®erent distributions, making no assumption on the

distribution of the data.b For each computed metric, we compared the measurements

Fig. 3. The evolution of the mean value of LCOM, RFC, REI and REO metrics.

bSince the metrics computed at a given weekly snapshot depend also on the state of the system in the

previous snapshot, the assumption underlying KS test that the samples are random and mutually inde-

pendent can be challenged. However, we used KS test to assess the di®erence between measurements indi®erent phases as if they were independent sets of points, and we believe that at a ¯rst approximation the

KS test result is still valid.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

belonging to any pair of phases; we were of course most interested in the ability to

discriminate between subsequent phases.

The results are shown in Tables 5 and 6 for the means and 90th percentiles,

respectively. The cases with signi¯cance levels greater than 99% are shown in bold.

Regarding metrics means (Table 5), Phase 1 metrics di®er very signi¯cantly from

any other phase in all cases but for LCOM between Phases 1 and 2 �� getting a

signi¯cance higher than 90% also for LCOM. Phase 2 is less clearly di®erentiated

from Phases 3 and 4. REO and CI metrics appear to be able to discriminate best,

with a KS signi¯cance greater than 98%, and with RFC and FI following suit at 95%.

Phases 3 and 4 can be discriminated e®ectively by FI, REI and CO metrics means.

90th percentiles are slightly less able to discriminate among phases. Phase 1 is still

well di®erentiated from other phases �� and especially from Phase 2 �� but for RFC

Fig. 4. The evolution of the 90th percentile of FI, WC and CI metrics.

Table 5. Con¯dence level that the mean of the metrics taken in a pair of phases signi¯cantly

di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance is above 99%.

Metric Phases 1�2 Phases 1�3 Phases 1�4 Phases 2�3 Phases 2�4 Phases 3�4

LOC 99.990 99.340 99.985 85.113 96.402 64.019

LCO 91.310 99.340 99.985 62.319 84.722 64.019

RFC 99.990 99.340 99.985 95.250 99.386 64.019FI 99.990 99.340 99.985 95.250 88.606 99.213

REI 99.990 99.340 99.985 62.319 82.414 99.213

REO 99.990 99.340 99.985 98.770 99.924 35.531WC 99.990 99.800 99.998 33.939 56.976 59.580

CI 99.825 99.340 99.985 98.770 99.924 91.128

CO 99.825 99.340 99.985 62.319 82.414 99.213


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

metric. CI is able to discriminate Phase 1 from Phase 2 very well, but totally fails to

discriminate Phases 3 and 4 from Phase 1. It looks like a very powerful indicator of

Phase 2, when good agile practices were dropped by developers. Phase 2 is dis-

criminated from Phase 3 by LOC, FI and REO metrics almost at 99% signi¯cance

level. The same metrics are able to discriminate Phase 2 from Phase 4 at even an

higher level. Finally, Phases 3 and 4 are well discriminated by REI and CO metrics,

con¯rming the results of the means. On the contrary, FI �� that was a good dis-

criminator in the case of the mean �� is totally unable to discriminate between

Phases 3 and 4 when its 90th percentile is used.

These results in fact con¯rm the di®erence in trends and values of the various

metrics in the various phases that are patent in Figs. 2�4.

5.4. Aggregate metrics behavior across development phases

During the development of FLOSS-AR system, Phase 1 is characterized by a steady

growth of the number of classes. All metrics but LCOM and CO are stable during the

¯rst ¯ve weeks of this phase; then, their means tend to grow �� in particular for FI,

REI, REO, WC and, to a lower extent, RFC and LOC. The means of LCOM and

CO, on the contrary, tends to increase during the ¯rst few weeks and then stabilize.

The 90th percentiles of the metrics tend to be quite constant during Phase 1, except

in the case of CO. This means that no signi¯cant addition to the tails of the dis-

tributions (classes with extreme values of the metrics) was made. Regarding the large

variation of CO 90th percentile, we remember that CO for a class is related to the

number of steps needed to reach all the other (reachable) classes, following edges

along their direction. The lower the average No. of these steps for a class, the higher

its CO value. The large variations might be explained with the addition, or deletion,

of links in such a way that some classes increased/decreased substantially their

closeness to other classes in the system, a phenomenon clearly possible in a small-

world network such as a software network.

The starting values of all these metrics are those of the original jAPS framework,

constituted by 367 classes and evaluated by code inspection as a project with a fairly

good OO architecture.

Table 6. Con¯dence level that the 90th percentile of the metrics taken in a pair of phases

signi¯cantly di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance isabove 99%.

Metric Phases 1�2 Phases 1�3 Phases 1�4 Phases 2�3 Phases 2�4 Phases 3�4

LOC 99.947 99.340 99.985 98.770 99.924 64.019

RFC 86.416 91.964 0.000 0.000 84.722 91.128

FI 99.947 91.964 99.985 98.770 99.924 0.483REI 99.947 99.340 99.985 95.250 99.924 99.213

REO 99.746 99.340 99.985 98.770 99.924 50.702

WC 99.529 99.340 97.032 0.000 0.118 8.198

CI 99.529 0.000 0.000 95.250 99.386 0.000CO 99.529 99.340 99.985 45.215 99.924 99.213


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

The increase of RFC and FI means (which we remember is highly correlated with

CBO) denotes a worsening of software quality. Note that Phase 1 is characterized by

a rigorous adoption of agile practices, but we should consider two factors:

(1) The knowledge of the original framework was initially quite low, so the ¯rst

addition of new classes to it in the initial phase had a sub-optimal structure, and

it took time to evolve towards an optimal con¯guration;

(2) Some agile practices require a time to be mastered, and our developers were

junior.

In general, we might conclude that in Phase 1 the team steadily added new

features �� and consequently new classes �� to the system. In the ¯rst half of the

phase, however, these classes substantially kept the structure of the original system

they were added to. As the system grew, this structure was slowly impaired, due to

the factors mentioned above.

Phase 2 is characterized by a strong push for releasing new functionalities and by

giving up the use of pair programming, testing and refactoring. In this phase we

observe a growth in all metrics means but CO, and particularly in metrics related to

coupling and complexity �� with an explosive growth of LCOM. This seems to

con¯rm that in Phase 2 the quality has been compromised for adding several new

features. Regarding 90th percentiles, they substantially con¯rm the behavior of the

corresponding means. It is worth noting that 90th percentile of several metrics ex-

hibit an even steeper change passing from Phase 1 to Phase 2. This happens for FI,

LOCS, WC, CI and CO �� the latter with a steep decrease in value.

Phase 2 is followed by Phase 3, a phase when the team, adopting a rigorous pair

programming rotation strategy together with testing and refactoring, were able to

refactor the system, increasing its cohesion and decreasing coupling �� and thus

reducing the values of several metrics known as anti-correlated to quality, such as

LCOM, RFC, FI and LOCs. In this phase, no new features were added to the system.

The number of classes increased during this phase, because refactoring required to

split classes that had grown too much, and to refactor hierarchies, adding abstract

classes and interfaces. The transition from Phase 2 to Phase 3 is marked by a sig-

ni¯cant decrease of Fan-In and CI, patent in both mean and 90th percentile be-

havior. After this decrease FI and CI means tend to increase again at the end of

Phase 3. CO mean has a trend opposite to CI, as happens also in Phase 2 (but not in

Phases 1 and 4). REO has a behavior similar to CO, while RFC and LCOM were

reduced, mainly at the end of the phase. There is also a light decrease of LOC (not

shown), mainly due to the addition of abstract classes to the hierarchies that factor

out common features and reduce the code of many classes. Note that the values of the

metrics at the end of Phase 3 seem to reach an equilibrium.

Phase 4 is the last development phase. It is characterized by the adoption of all

key agile practices, and by the creation of other classes associated to new features. In

this phase most metrics do not change signi¯cantly �� although, in the end, the


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

values of most of them are slightly lower than at the beginning of the phase ��maybe because the team became more e®ective in the adoption of the agile practices

compared to the initial Phase 1. Only REO tends to grow in the end of the whole

development.

Table 7 summarizes these observations, highlighting which metrics look the best

to discriminate between the various phases.

In conclusion, Fan-In looks the only metric able to discriminate fairly well be-

tween all the various phases, especially considering its mean. Other good dis-

criminators are CI (especially its 90th percentile) for the ¯rst phases, and REI and

CO means for the last phases.

For this case study, a combination of FI mean, CI 90th percentile and REI would

be able to discriminate among the various phases fairly well.

6. Threats to Validity

The presented work is based on a single, empirical case study. This fact yields several

obvious threats to its validity that we discuss in this section.

The ¯rst issue is that what we presented is just one anecdotal case study, since we

were not able to ¯nd other case studies with an amount of source code data and,

above all, information about the variations of agile practices adopted throughout the

development. From a single case study, it is clearly impossible to safely generalize to

other cases. We believe, however, that the case study is of great anecdotal interest,

and might be used by practitioners as a starting point to analyze the relationships

between software metric trends and practices used to improve software quality.

Table 7. The metrics and statistics best suited to discriminate between the various phases.

Phases Metric (Statistic) Discussion

1 ! 2 CI (90th perc.) A steep increase of the 90th perc. of CI looks like a very good

marker of a phase where \good" agile practices were abandoned.

1 ! 2 FI (mean) An increase of FI mean is also a good discriminator of Phase 2.1 ! 2 LCO (mean) LCOM mean starts low, but then increases very signi¯cantly

during the middle of Phase 2.

1 ! 2 CO (mean) CO mean signi¯cantly decreases during Phase 2.

2 ! 3 CI (mean & 90th perc.) When agile practices are resumed, we found an immediate, steepdecline of CI (both mean and 90th perc.), that persisted in

Phase 3.

2 ! 3 FI (mean & 90th perc.) FI con¯rms to be another good marker able to discriminate between

Phases 2 and 3, though to a lesser extent than CI.2 ! 3 REO (mean) REO mean in Phase 3 is consistently and signi¯cantly greater than

in the previous phase.

3 ! 4 REI (mean) REI mean steadily increases at the end of Phase 4, showing a good

discrimination ability with respect to Phase 3.3 ! 4 CO (mean) CO mean decreases at the end of Phase 3, and continues to decrease

in Phase 4, showing a fair discrimination ability.

3 ! 4 FI (mean) FI mean increases at the end of Phase 3, and then remains almostconstant in Phase 4, showing a mild discrimination ability.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

Related to this issue, the adoption of agile practices used to identify the phases of

the project comes from a survey among developers. The details of the actual adoption

of Agile practices (kinds of refactorings applied, exact percentage of time spent in

Pair Programming, etc.) were not explicitly recorded during the project. This

vagueness is another threat to the validity of the results.

Another threat to the validity of the presented results is that we studied a small-

medium sized project, and this might be di±cult to generalize to larger, more critical

projects. This issue is related to the previous one. However, modern development

processes tend to split large projects in a bunch of loosely coupled, smaller devel-

opments, whose magnitude is not so di®erent from the presented one. When this is

the case, this objection should fall.

Another threat concerns the speci¯c OOPL language (Java) and programming

environment (Eclipse) used to develop the system. Again, the generalization to

other languages and programming styles is not granted. We can observe that on one

hand we are interested in OO metrics and in software graphs built from an OO

architecture. The OO paradigm is currently the most used programming paradigm,

and we believe that focusing on it is not really limiting. On the other hand, many

popular OO languages �� especially Cþþ, C# �� are very similar to Java. In a

previous study, the distributions and correlations of CK metrics in 100 Java and

100 Cþþ projects were found fairly similar [33]. So, we believe that the presented

results can generalize to them. For other OO languages �� like Python and

Ruby �� this might not be true because the programming styles are very di®erent

from Java.

The last threat, and perhaps the biggest, is that at least some of the ¯ndings

might have been obtained just by chance. The number of samples used in the sta-

tistical analysis, one for each snaphot, is 30 per metric/statistic. The sample groups

pertaining to the four phases used to discriminate between metrics contain between 4

and 10 values. These numbers, compared to the total number of metrics and sta-

tistics tested to discriminate among phases (16 original metrics, and 4 statistics for

each of them), are small. So, the discrimination ability of some metric/statistic might

be due to statistical variations, and not signi¯cant at all. In order to answer to this

objection, we can highlight that:

(1) Regarding the statistics, we immediately found that the median and the stan-

dard deviation were not able to discriminate anything, and dropped them. So,

we remained with 2 statistics. We used 90 for the percentile value as an optimal

choice, not being too close to the median, and not too close to the extreme

portion of the tail.

(2) We did not use seven of the original metrics (DIT, NOC, WMC, etc.) following

information found in the literature about their incapacity to discriminate soft-

ware quality, and because of strong correlations with other metrics. Having

dropped them is not a \selection of the ¯ttest" in a statistical sense, but bears a

speci¯c meaning.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

(3) As shown in Tables 5 and 6, most of the remaining metrics actually used are able

to strongly discriminate among various pairs of phases. Only the pairs including

Phases 2 and 3, and Phases 3 and 4 are discriminated by just a few metrics/

statistics. This makes very unlikely that this discrimination ability is due to

chance.

7. Conclusions

We presented a case study related to agile software development of a medium-sized

project using Java OOPL, matching the di®erent use of key agile practices in the four

phases of the project with OO and graph-related metrics.

In Phase 1 we observed a deterioration of quality metrics, that signi¯cantly

worsened during Phase 2; Phase 3 led to a signi¯cant improvement in quality, and

Phase 4 kept this improvement. The only external factors that changed during the

phases were adoption of pair programming, TDD and refactoring agile practices,

that were abandoned during Phase 2, and were used again at their full power during

Phase 3, aiming to improve the system quality without adding new features, and

then in Phase 4. As regards internal factors, in Phase 1 the team was clearly less

skilled in the use of agile practices and in the knowledge of the original framework

than in subsequent phases.

We studied the aggregate variation of several source code metrics, speci¯c for OO

systems, and for the oriented software graph built from the OO software structure.

We ¯nd that an appropriate combination of a few metrics �� namely the average

Fan-In, the 90th percentile of Closeness-In, and the average of Reach-E±ciency-

In �� is able to discriminate among the various phases, and hence among the de-

velopment practices used to code the system. The adoption of \good" agile practices

is always associated with \better" values of these metrics �� when pair program-

ming, TDD and refactoring are used, the quality metrics improve; when these

practices are discontinued, the metrics worsen signi¯cantly. We validated the use-

fulness of software metrics in monitoring the quality of the ongoing development, for

the empirical case study analyzed. This might be useful for software practitioners.

Clearly, it is not possible to draw de¯nitive conclusions observing a single, me-

dium-sized project. Unfortunately, it is not easy to ¯nd other case studies, because

they must include not only tracking of the source code produced during develop-

ment �� a task easily accomplished with modern con¯guration management sys-

tems �� but also an accurate tracking of the development practices, and of possible

other external and internal factors, used throughout the project. We hope that this

paper might spur similar studies by researchers with access to proper data, able to

con¯rm or to disprove our ¯ndings.

Acknowledgments

This work was partially funded by Regione Autonoma della Sardegna (RAS),

Regional Law No. 7, 2007 on Promoting Scienti¯c Research and Technological


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

Innovation in Sardinia, call 14/2/2009, and RAS Integrated Facilitation Program

(PIA) for Industry, Artisanship and Services, call 14/10/2008, project No. 265,

Advanced Technologies for Software Measuring and Integrated Management,

TAMIGIS.

References

1. Agile Manifesto, URL: www.agilemanifesto.org.2. A. J. Albrecht, Measuring application development productivity, Proc. of IBM Appli-

cation Development Symposium, Monterey, CA, October 1979, pp. 83�92.3. N. Anquetil and J. Laval, Legacy Software Restructuring: Analyzing a Concrete Case, in

Proc. of the 15th European Conference on Software Maintenance and Reengineering(CSMR'11), Oldenburg, Germany, 2011.

4. V. R. Basili, L. C. Brand andW. L. Melo, A validation of object oriented design metrics asquality indicators, IEEE Trans. Software Eng. 22 (1996) 751�761.

5. K. Beck and C. Andres, Extreme Programming Explained: Embrace Change, SecondEdition (Addison-Wesley, 2004).

6. B. Boehm and R. Turner, Balancing Agility and Discipline (Addison-Wesley Professional,2003).

7. G. Canfora, A. Cimitile, F. Garcia, M. Piattini and C. A. Visaggio, Evaluating advan-tages of test driven development: A controlled experiment with professionals, in Proc. Int.Symposium on Empirical Software Engineering, ISESE'06, 21-22 September (2006), Riode Janeiro, Brazil, pp. 364�371.

8. S. Chidamber and C. Kemerer, A metrics suite for object-oriented design, IEEE Trans.Software Eng. 20 (1994) 476�493.

9. S. Chidamber and C. Kemerer, Managerial use of metrics for object oriented software: Anexploratory analysis, IEEE Trans. Software Eng. 24 (1998) 629�639.

10. G. Concas, M. Marchesi, S. Pinna and N. Serra, Power-laws in a large object-orientedsoftware system, IEEE Trans. Software Eng. 33 (2007) 687�708.

11. G. Concas, M. Di Francesco, M. Marchesi, R. Quaresima and S. Pinna, Study of theevolution of an agile project featuring a web application using software metrics, in Proc.9th Int. Conf. on Product Focused Software Process Improvement (PROFES'08), Fras-cati, Italy, 23�25 June 2008.

12. G. Concas, M. Marchesi, A. Murgia, S. Pinna and R. Tonelli, Assessing traditional andnew metrics for object-oriented systems, in Proceedings of the Workshop on EmergingTrends n Software Metrics (ICSE'10), Cape Town, South Africa, May 2010.

13. G. Concas, M. Marchesi, A. Murgia and R. Tonelli, An empirical study of socialnetworks metrics in object-oriented software, Advances in Software Engineering,Vol. 2010, 2010.

14. T. Dyba and T. Dingsøyr, Empirical studies of agile software development: A systematicreview, Information and Software Technology 50 (2008).

15. M. Fowler, Refactoring. Improving the Design of Existing Code (Addison-Wesley, 1999).16. M. Giblin, P. Brennan and C. Exton, Introducing agile methods in a large software

development team: The impact on the code, in Proc. 11th Int. Conf. On Agile Processesin Software Engineering and Extreme Programming (XP2010), Trondheim, Norway,June 2010, pp. 58�72.

17. T. Gyimothy, R. Ferenc and I. Siket, Empirical validation of object-oriented metricson open source software for fault prediction, IEEE Trans. Software Eng. 31 (2005)897�910.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

18. JAPS: Java agile portal system. URL: http://www.japsportal.org.19. D. S. Janzen and H. Saiedian, Does test-driven development really improve software

design quality?, IEEE Software, March/April 2008, pp. 77�84.20. M. Kunz, R. R. Dumke and A. Schmietendorf, How to measure agile software develop-

ment, in Proc. Int. Conf. on Software Process and Product Measurement (IWSM-Men-sura 2007), Palma de Mallorca, Spain, November 5�8, 2007, pp. 95�101.

21. L. Layman, L. Williams and L. Cunningham, Exploring extreme programming in context:An industrial case study, in Proc. of the Agile Development Conference (ADC'04), SaltLake City, Utah, June 2004, pp. 32�41.

22. W. Li and S. Henry, Object oriented metrics that predict maintainability, J. Systems andSoftware 23 (1993) 111�122.

23. P. Louridas, D. Spinellis and V. Vlachos, Power laws in software, ACM Trans. SoftwareEng. and Methodology, 18(1) 2008.

24. F. Macias, M. Holcombe and M. Gheorghe, A formal experiment comparing extremeprogramming with traditional software construction, in Proc. of the Fourth MexicanInternational Conference on Computer Science (ENC 2003), Tlaxcala, Mexico, 12�12September 2003.

25. T. J. McCabe, A complexity measure, IEEE Trans. Software Eng. 2 (1976) 308�320.26. M. Melis, I. Turnu, A. Cau and G. Concas, Evaluating the impact of test-¯rst pro-

gramming and pair programming through software process simulation, Software ProcessImprovement and Practice 11 (2006) 345�360.

27. N. Nagappan, E. M. Maximilien, T. Bhat and L. Williams, Realizing quality improvementthrough test driven development: Results and experiences of four industrial teams,Empirical Software Engineering 13 (2008) 289�302.

28. M. E. J. Newman, The structure and function of complex networks, SIAM Review45 (2003) 167�256.

29. A. V. Prokhorov, Kendall coe±cient of rank correlation, in Encyclopaedia of Mathe-matics, ed. M. Hazewinkel (Springer Verlag, Heidelberg, 2001).

30. J. Scott, Social Network Analysis: A Handbook (SAGE Publications, London, UK, 2000).31. M. Siniaalto and P. Abrahamsson, Does test-driven development improve the program

code? alarming results from a comparative case study, in Balancing Agility and For-malism in Software Engineering, B. Meyer, J. R. Nawrocky, B. Walter, eds., LectureNotes in Computer Science, Vol. 5802 (Springer, 2008), pp. 143�156.

32. R. Subramanyam and M. S. Krishnan, Empirical analysis of CK metrics for object-oriented design complexity: Implications for software defects, IEEE Trans. Software Eng.33 (2007) 687�708.

33. G. Succi, W. Pedrycz, S. Djokic, P. Zuliani and B. Russo, An empirical exploration of thedistributions of the Chidamber and Kemerer object-oriented metrics suite, EmpiricalSoftware Engineering 10 (2005) 81�103.

34. C. A. Wellington, T. Briggs and C. D. Girard, Comparison of student experiences withplan-driven and agile methodologies, in Proc. of the 35th ASEE/IEEE Frontiers in Ed-ucation Conference, Indianapolis, Indiana, 19�21 October 2005.

35. T. Zimmermann and N. Nagappan, Predicting defects using network analysis ondependency graphs, in Proc. 30th Int. Conf. on Software Engineering (ICSE'08), Leipzig,Germany, 10�18 May 2008, pp. 531�540.


Int.

J. S

oft.

Eng

. Kno

wl.

Eng

. 201

2.22

:525

-548

. Dow

nloa

ded

from

ww

w.w

orld

scie

ntif

ic.c

omby

NO

RT

H C

AR

OL

INA

ST

AT

E U

NIV

ER

SIT

Y o

n 04

/26/

13. F

or p

erso

nal u

se o

nly.

an empirical study of software metrics for assessing the phases of an agile project

Documents