an empirical study of software metrics for assessing the phases of an agile project
TRANSCRIPT
AN EMPIRICAL STUDY OF SOFTWARE METRICS FOR
ASSESSING THE PHASES OF AN AGILE PROJECT
GIULIO CONCAS*, MICHELE MARCHESI†, GIUSEPPE DESTEFANIS‡
and ROBERTO TONELLI§
Department of Electrical and Electronic EngineeringUniversity of Cagliari, Piazza d'Armi, Cagliari, 09123, Italy
*[email protected]†[email protected]
‡[email protected]§[email protected]
http://www.diee.unica.it
Received 13 September 2011
Revised 7 November 2011
Accepted 25 January 2012
We present an analysis of the evolution of a Web application project developed with object-oriented technology and an agile process. During the development we systematically performed
measurements on the source code, using software metrics that have been proved to be correlated
with software quality, such as the Chidamber and Kemerer suite and Lines of Code metrics. Wealso computed metrics derived from the class dependency graph, including metrics derived from
Social Network Analysis. The application development evolved through phases, characterized
by a di®erent level of adoption of some key agile practices ��� namely pair programming, test-
based development and refactoring. The evolution of the metrics of the system, and theirbehavior related to the agile practices adoption level, is presented and discussed. We show that,
in the reported case study, a few metrics are enough to characterize with high signi¯cance the
various phases of the project. Consequently, software quality, as measured using these metrics,
seems directly related to agile practices adoption.
Keywords: Software metrics; software evolution; agile methodologies; object-oriented metrics,
SNA metrics applied to software.
1. Introduction
Software is an artifact that can be easily measured, being readily available and
composed of unambiguous information. In fact, since software inception, many kinds
of metrics have been proposed to measure software characteristics. The main goal of
software metrics is to measure the e®ort needed to develop the software, or to
measure its quality. E®ort metrics are relatively simple and well understood. They
cover the requirement phase, with metrics such as \Function Points" [2] and the like,
International Journal of Software Engineering
and Knowledge Engineering
Vol. 22, No. 4 (2012) 525�548
#.c World Scienti¯c Publishing CompanyDOI: 10.1142/S0218194012500131
525
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
up to design and coding phases, with metrics starting from the simple \Lines of
Code" (LOC), to more complex metrics like Cyclomatic Complexity [25]. While the
e®ectiveness of e®ort metrics in predicting and measuring the actual costs of software
development is still debated, in this paper we will not focus on this kind of metrics,
but only on quality metrics.
Software quality metrics aim to measure how much a software is \good" ���especially from the point of view of being error-free and easy to modify and maintain.
Software quality metrics tend to measure whether software is well structured, not too
simple and not too complex, with cohesive modules that minimize their coupling.
Many quality metrics have been proposed for software, depending also on the par-
adigm and languages used ��� there are metrics for structured programming, object-
oriented programming, aspect-oriented programming, and so on. In this paper, we
will focus on object-oriented (OO) metrics, nowadays being the OO paradigm which
is most popular by far among developers.a
In dealing with software metrics, however, the main point is not to comeup with
new, sensible metrics able to measure software, but to empirically demonstrate their
usefulness in practice. Empirical proofs of the value of metrics to assess software
quality are mainly based on ¯nding correlations between speci¯c metrics and the
fault-proneness of software modules, that is the number of faults that were found and
¯xed. Unfortunately, considering software quality just inversely related to the
number of faults has its drawbacks. The ¯rst one is that the relationship between a
fault and a software module is typically declared when the module is modi¯ed to ¯x
the fault. However, a module is often modi¯ed as a consequence of an error, not
because it is wrong. Moreover, simply relating quality and (absence of) faults does
not account for other characteristics that are very important in software develop-
ment ��� such as ease of maintenance ��� but that are much more di±cult to relate
with software metrics.
In this work we will present the possible use of OO metrics to indirectly assess the
quality of the developed software, by showing signi¯cant changes in time as the
development proceeds along di®erent phases. In these phases, various speci¯c \agile"
development practices were used ��� or their use was discontinued. In this context,
we assess the ability of some metrics to discriminate among the phases of the project,
and therefore the usage of speci¯c practices. We present results on an industrial case-
study, and discuss their implications and relationships with previous research. We
understand that the presented evidence is anecdotal, but with real software projects
it is very di±cult to plan multi-project researches of this kind. This is because
aThe relative di®usion of programming languages is continuously monitored by some Web sites. Among
them, lang-index.sourceforge.net monitors the usage of languages in Sourceforge Open Source projects.Here, on November 2011, the share of OO languages was greater than 55%. Tiobe's monthly Programming
Community Index (www.tiobe.com/index.php/content/paperinfo/tpci), published since 2001, shows the
top 50 languages' ratings based on searching the Web with certain phrases that include language namesand counting the numbers of hits returned. Here the ratings of OO Languages on November 2011 was
55.3%.
526 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
software houses tend to be very secretive about their projects. We hope that other
researchers will try to replicate the presented results on similar projects whose data
they can access.
The target of our research is the evolution of a software project consisting of the
implementation of FLOSS-AR, a program to manage the Register of Research of
universities and research institutes. FlossAr was developed with a full object-oriented
(OO) approach and released with GPL v.2 open source license. It is a Web appli-
cation, which has been implemented through a specialization of an open source
software project, jAPS (Java Agile Portal System) [18], that is a Java framework for
Web portal creation. Throughout the project we collected metrics about the software
product under development. We used the Chidamber and Kemerer (CK) OO metrics
suite [8], as well as complexity metrics computed from the class dependency graph
[10]. The project was developed following an agile process [5, 6] with various adoption
levels of some key agile practices, namely Pair Programming (PP), Test-Driven
Development (TDD) and refactoring [5], that were recorded during the project.
We show how some metrics computed on the developed code seem to have the
capability to discriminate in a statistically signi¯cant way among the various phases
of the project, that in turn are characterized by the adoption, or non-adoption, of the
above mentioned agile practices (PP, TDD, refactoring). In this way, the quality of
an ongoing project might be controlled using these metrics.
This paper is organized as follows: in Sec. 2 we present CK, graph-theoretical and
SNA metrics computed on the software; in Sec. 3 we discuss prior literature on
software metrics; in Sec. 4 we present the phases of the development; in Sec. 5 we
present and discuss the results, relating software quality ��� as resulting from the
metrics measurements ��� with the adoption of agile practices; Sec. 6 deals with the
threats to the validity of the paper, which is concluded in Sec. 7.
2. Software Metrics
In this section we brie°y introduce all the metrics studied in our work, used as a
starting point to choose the metrics subset best suited to discriminate between
various project phases. For a more detailed description, references with their de¯-
nition and possible uses are given. The metrics we computed throughout the project
are the OO metrics suite given by Chidamber and Kemerer [8], Graph-theoretical
metrics, and Social Network Analysis (SNA) metrics.
The Chidamber and Kemerer (CK) metrics suite is perhaps the most studied
among OO metrics suites, and its relationship with software fault-proneness has
already been validated by many researchers. The CK metrics are: Number Of
Children (NOC) and Depth of Inheritance Tree (DIT), related to inheritance;
Weighted Methods per Class (WMC) and Lack of Cohesion in Methods (LCOM),
pertaining to the internal class structure; Coupling Between Objects (CBO) and
Response For a Class (RFC), that are related to relationships among classes. Several
papers related CK metrics to software quality, not always agreeing on which metrics
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 527
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
are the most correlated with lack of faults and ease of maintenance; see Sec. 3 for a
survey of the related literature.
As presented and discussed in the next section, among CK metrics, WMC and
CBO are those that have been found to be most correlated with software quality.
RFC and LCOM were sometimes ��� but not always ��� proved to be correlated with
fault proneness or with maintenance e®ort related to a class. DIT was sometime
found correlated, but was also often found not correlated, or exhibiting too low value
variations. NOC is the CK metric least related to software quality. In general, the
lower the value of CK metrics, the better the quality of the system.
Note that a recent work on Eclipse Java system evolution shows that the cohe-
sion/coupling metrics do not behave as expected in some cases [3]. For instance, in
the referred paper, cohesion metrics were found to decrease after restructurings that
should have increased cohesion, and similar results were found regarding coupling.
However, the work [3] studies coupling and cohesion at package and plugin level,
while all our analysis is made at class level.
The second kind of metrics we analyzed are derived from network theory applied to
the software graph. In fact, it is possible to build a directed graph ��� called the class
graph ��� from the source code of an OO system, the nodes of the graph being the
classes (or the interfaces), and the graph edges being the dependencies between
classes. In this graph, we can de¯ne the Fan-In (or in-degree) of a class as the number
of edges directed toward the class; the in-degree is a measure of how much the class is
used by other classes in the system. The Fan-Out (or out-degree) of a class is the
number of edges directed from the class; it counts how many other classes of the
system are used by the class. Fan-In and Fan-Out measure the number of di®erent
classes using, or used by, the target class. These metrics can be also weighted by the
number of times another class uses, or is used by, the target class, thus yielding
weighted Fan-In/Fan-Out. As an example, if class A uses class B three times (for
instance de¯ning an instance variable of type B, and two local variables of type B in
two methods), A's Fan-Out is increased by one, while its weighted Fan-Out is
increased by three. Fan-In and Fan-Out ��� weighted or not ��� are the graph-
theoretical metrics we considered. They are related to complex network theory,
because it is well-known that in complex networks their distribution is fat-tailed, and
often is a power-law [28].
We also consider the class LOCs metric, that is the number of lines of code of the
class. It is good OO programming practice to create small and cohesive classes, so
also class LOC metric should be kept reasonably low in a \good" system.
Graph-theoretical metrics can be related to CK metrics pertaining to the rela-
tionships among classes. We know that CK CBO metric, being the count of the
number of other classes which a given class is coupled to, denotes class dependency
on other classes in the system, and is therefore strictly related to the sum of Fan-In
and Fan-Out of a class node in the class graph, because links represent dependencies
between classes. Also CK RFC metric is computed as the sum of the number of
528 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
methods of a class and the number of external methods called by them. This latter
quantity is strictly related to the weighted Fan-Out of the class node.
The third group of metrics we used are SNAmetrics [30]. These metrics come from
complex network theory, too. They were introduced for sociological analysis, and
recently used in software graphs as well. There are several variations of SNA metrics.
We decided to restrict the analysis to SNA metrics that account for the directionality
of edges, and that can be considered meaningful in a software engineering context.
These metrics are: in- and out-Reach E±ciency, in- and out-Two Step Reach,
in- and out-number of weak components, in- and out-Closeness. These and other
metrics are fully explained in [12].
The studied SNA metrics have an interpretation from the OO software devel-
opment point of view. We remember that the nodes of the network are classes or
interfaces, while the directed edges represent a dependency between two classes ���the class which the edge comes from uses somehow the class which the edge is
directed to. High reach e±ciency indicates that primary contacts of a class are
in°uential in the network. REI means that the classes using a given class are in turn
used by many other classes. This is a measure of the degree of reuse of a class, not
only directly but also in two steps. REO means that a class uses other classes, which
in turn further use other classes. It is a measure of two-step dependence on the rest of
the system. Both these metrics are related to coupling. They should be kept at
relatively low values to minimize coupling among classes of the system.
Weak Components is a normalized measure of how many disjoint sets of other
classes are coupled to a given class. In general, it is an indirect measure of
coupling ��� the higher is WC, the lower is the coupling among the classes coupled to
a given class.
Closeness-In is a measure of how easy it is for a class to be reached, directly or
indirectly, by other classes that need its services. Similarly, Closeness-Out is a
measure of how many dependence steps are needed to reach all other (reachable)
classes of the system. The two closeness measures are related to the \small-world"
property of a software network. For a single class, the hypothesis is that the more
central a class is, the more defects it will have. For ensemble measures over the whole
system ��� such as the mean or a percentile of CI or CO ��� the hypothesis is that a
smaller value of centrality denotes a smaller coupling among classes. Note that these
measures can greatly vary for entire ensembles of classes if a link is added to a set of
classes that were not previously connected or if such a link is removed.
Table 1 summarizes the metrics we computed for the system under development.
Throughout the project, we computed and analyzed the evolution of a set of source
code metrics including the CK suite of quality metrics, the total number of classes,
the lines of code of classes (LOCs), and the above described metrics derived from the
analysis of the software graph.
All the cited metrics are measurements made on single classes, so there is a value
of the metrics for each class (and interface) of the system. However, we are mainly
interested in measures of the whole system, able to give a synthetic picture of its
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 529
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
quality. To this purpose, we computed statistics about the metric values for all the
classes of the system, during its development, and used some of these statistics as a
measure of the whole system. More about this in the section on results.
3. Related Work
Several papers related CK metrics to software quality, not always agreeing on which
metrics are the most correlated with lack of faults and ease of maintenance. In a
study of two commercial systems, Li and Henry studied the link between CK metrics
and maintenance e®ort [22]. Basili et al. found that many of the CK metrics were
associated with fault-proneness of classes [4]. In another study on three industrial
projects, Chidamber et al. reported that WMC, CBO and RFC look highly corre-
lated among each other, and that higher values of CK coupling and the cohesion
metrics (CBO and LCOM) were associated with reduced productivity and increased
rework/design e®ort [9]. Subramanyam and Krishnan studied a large system written
in Cþþ and Java, and found a good correlation between number of defects and
Table 1. The metrics used to study the system.
Metric Type Description
NOC CK Number of Children ��� No. of immediate subclasses.
DIT CK Depth of Inheritance Tree ��� No. of superclasses, up to the root.
WMC CK Weighted Methods per Class ��� No. of methods of the class (weight¼ 1).LCO CK Lack of Cohesion in Methods (LCOM) ��� No. of method pairs not sharing any
instance variable minus No. of pairs sharing at least one. Zero if negative.
CBO CK Coupling Between Objects (CBO) ��� No. of other classes that depend on the given
class, or which the given class depends on (excluding inheritance).RFC CK Response For a Class (RFC) ��� No. of methods plus No. of dependencies on other
classes (excluding inheritance).
FI Graph Fan-In ��� No. of other classes that depend on the given class.
WFI Graph Weighted Fan-In ��� No. of times all other classes depend on the given class.FO Graph Fan-Out ��� No. of other classes which the given class depends on.
WFO Graph Weighted Fan-Out ��� No. of times the given class depends on other classes.
REI SNA Reach E±ciency In ��� Percentage of nodes within two- step distance from a node,
following arcs from the head to the tail, divided by the No. of nodes within onestep.
REO SNA Reach E±ciency Out ��� Percentage of nodes within two- step distance from a
node, following arcs along their direction, divided by the No. of nodes within onestep.
WC SNA Weak Components ��� No. of disjoint sets of nodes within one step from a node, not
considering the node itself, divided by the No. of nodes within one step.
CI SNA Closeness-In ��� Reciprocal of Farness-In which is de¯ned as the sum of the lengthsof all shortest paths from the node to all other nodes, following arcs from the
head to the tail, divided by the No. of reachable nodes.
CO SNA Closeness-Out ��� Reciprocal of Farness-Out which is de¯ned as the sum of the
lengths of all shortest paths from the node to all other nodes, following arcsalong their direction, divided by the No. of reachable nodes.
LOC Dim. Lines Of Code ��� No. of lines of code of the class, excluding comments and blank
lines.
530 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
WMC, CBO, DIT [32]. Gyimothy et al. systematically studied the open-source
Mozilla system, ¯nding that above all CBO, and then RFC, LCOM, WMC and DIT
show a fair correlation with defects [17].
Succi et al. reported a broad empirical exploration of the distributions of CK
metrics along several Java and Cþþ projects, con¯rming that some metrics are
fairly correlated, and that NOC and DIT metrics generally exhibit a low variance,
so they are less suitable to be used for a systematic assessment based on metric
computation [33].
Recently, some papers were published on the use of OO metrics to assess the
quality of software developed using agile methodologies. Giblin et al. presented a case
study comparing the source code produced using agile methods with the source code
produced for a similar type of application by the same team using a more traditional
methodology. They made extensive use of speci¯c OO metrics, and concluded that
agile methods have guided the developers to produce better code in both quality and
maintainability [16]. Kunz et al. presented a methodological work discussing cost
estimation approaches for agile software development, and a quality model making
use of distinct metrics for quality management in agile software development [20].
Melis et al. used the software process simulation approach to assess the e®ect of the
use of PP and TDD on e®ort, size, quality and released functionalities [26]. They
found that increasing the usage of these practices signi¯cantly diminishes product
defectiveness, and increases programming e®ort. Dyba and Dingsøyr reported a
systematic review of other empirical studies of agile software development, including
in Sec. 4.7 some other empirical evaluation of product quality [14]. These studies
include a paper by Layman et al. [21] on an industrial project before and after
adoption of Extreme Programming, reporting a 65% decrease in pre-relase defect
rate, and a 35% decrease in post-release defect rate after XP adoption; a paper by
Macias et al. [24] on comparing 20 student projects using Waterfall and XP meth-
odologies, reporting no signi¯cant di®erences in external and internal quality factors;
a paper by Wellington et al. [34] on comparing the development of 4 systems by 20
student teams using Plan-driven and XP methodologies, reporting that XP code
shows consistently better quality metrics, among which a decrease of 40% of WMC
average value.
One of the most studied agile development practice in literature is TDD. Here we
will report papers studying the in°uence of TDD on software quality and OO
metrics. Canfora et al. [7] studied a set of 28 professional developers, asked to develop
a test project. They found that TDD improves the unit testing but slows down the
overall process. Nagappan et al. [27] studied industrial projects carried on in various
contexts, using Java, Cþþ and .NET. The results indicated that the pre-release
defect density decreased between 40% and 90% compared to similar projects that did
not use the TDD practice. The teams experienced a 15�35% increase in initial
development time after adopting TDD. Janzen and Saiedian [19] studied various
projects, industrial and academic, analyzing also the result of the use of TDD on
OO metrics computed on the developed software. They found that test-¯rst
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 531
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
programmers consistently produced classes with lower values of WMC metric; CBO
and Fan-Out of the studied classes did not show a signi¯cant di®erence between
software developed with or without TDD; LCOM* metric (a normalized LCOM,
constrained to [0, 1] interval) also showed no signi¯cant di®erence. Siniaalto and
Abrahamsson [31] studied 5 small scale case projects (5-9 KLOCS each), mainly
performed by students. They found that WMC, CBO, RFC, NOC and LCOM do not
signi¯cantly di®er between software developed with or without TDD; however, they
also found signi¯cantly lower values of RFC in TDD software, as well as signi¯cantly
higher values of DIT.
Concas et al. published a paper using the same empirical data of this paper,
limiting their study only to CK metrics and LOC metrics (class LOC and methods
LOC) and describing in deeper detail the agile practices used in the project [11]. They
found that all considered metrics but LCOM are able to discriminate very well
between the ¯rst two phases of the project (initial \Agile" phase and \cowboy
coding" phase), while only a few metrics maintain the ability of discriminating
between subsequent phases, and no metric is able to discriminate between all pairs of
consecutive phases at a signi¯cance level greater than 95%.
A few papers have been published regarding the relationships of graph-theoretic
and SNA metrics with software quality. Among these, Zimmermann and Nagappan
[35] computed and studied many SNA metrics, on both the oriented and non-ori-
ented software graph related to binary modules of Windows Server 2003 operating
systems, and their dependencies. They found that some SNA metrics could identify
60% of the binaries that the Windows developers considered as critical ��� twice as
many as those identi¯ed by complexity metrics (dimension, No. of functions, para-
meters and globals, Fan-In, Fan-Out).
Concas et al. [12] presented an extensive analysis of software metrics for 111
object-oriented systems written in Java, including SNA metrics, ¯nding systematic
non-normal behavior in their distributions, and studying the correlations among
metrics. Concas et al. [13] studied the application of CK and SNA metrics to Eclipse
and Netbeans open source systems, and performed an analysis of their correlation
with defects found in classes; they found that the metrics most correlated with
defects are LOCS, RFC and CBO.
4. Project Phases
Besides a ¯rst exploratory phase at the beginning of the project, where the team
studied the functionalities of the underlying open source Web portal management
system (jAPS) and the way to extend it, without producing code, the project evolved
through four main phases, each one characterized by an adoption level of the key
agile practices of pair programming, TDD and refactoring. In particular:
. Pair Programming was one of the keys to the success of the project. All the
development tasks were assigned to pairs and not to single programmers. Given a
532 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
task, each pair decided which part of it to develop together, and which part to
develop separately. The integration was typically made working together. Some-
times, the developers paired with external programmers belonging to jAPS
development community, and this helped to grasp quickly the needed knowledge
of the framework.
. RegardingTDD, developers had the requirement that all codemust have automated
unit tests and acceptance tests, andmust pass all tests before it can be released. The
choice whether to write tests before or after the code was left to programmers.
. Refactoring was practiced mainly to eliminate code duplications and improve
hierarchies and abstractions. Unfortunately, data on speci¯c refactorings were not
recorded. The developers had a fair knowledge of Fowler's book [15], so several
refactorings cited there were applied.
A full account of the agile practices used in the project, and preliminary results on
the use of CK metrics for discriminating among phases, is reported in [11].
To give empirical evidence to such phases, we asked each of the ¯ve members of
the development team to de¯ne, to their judgement, system evolution phases in
respect of PP, TDD and refactoring usage, and the dates when these phases started
and ended. Four out of ¯ve members cited the four phases. Only one proposed three
phases, merging phases 3 and 4 in just one phase.
Regarding the dates de¯ning the boundaries between phases, all agreed that week
17 signed the end of Phase 2, obviously related to the date of presentation of the
system. The end of Phase 1 was attributed to weeks from 8 to 11, with median equal
to, and mean close to, 10 weeks. The end of Phase 3 was attributed to weeks 20 and
21, the majority saying 21. The resulting phases, that we will consider in the
remaining of the paper, are summarized below:
. Phase 1 (Initial Agile): a phase characterized by the full adoption of all practices,
including testing, refactoring and pair programming. It lasted ten weeks, leading
to the implementation of a key set of the system features. In practice, speci¯c
classes to model and manage the domain of research organizations, roles, products,
and subjects were added to the original classes managing the content management
system, user roles, security, front end and basic system services. The new classes
include service classes mapping the model classes to the database, and allowing
their presentation and user interaction.
. Phase 2 (Cowboy Coding): this is a critical phase, characterized by a minimal
adoption of pair programming, testing and refactoring, because a public presen-
tation was approaching, and the system still lacked many of the features of
competitors' products. So, the team rushed to implement them, compromising the
quality. This phase lasted seven weeks, and included the ¯rst release of the system
after two weeks.
. Phase 3 (Refactoring): an important refactoring phase, characterized by the full
adoption of testing and refactoring practices and by the adoption of a rigorous pair
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 533
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
programming rotation strategy. The main refactorings performed were \Extract
Superclass", to remove duplications and extract generalized features from classes
representing research products, and corresponding service classes, and \Extract
Hierarchy" applied to a few \big" classes, such as an Action class that managed a
large percentage of all the events occurring in the user interface. This phase was
needed to ¯x the bugs and the bad design that resulted from the previous phase.
It lasted four weeks and ended with the second release of the system.
. Phase 4 (Mature Agile): Like Phase 1, this is a development phase characterized
by the full adoption of the entire set of practices, until the ¯nal release, after eight
weeks.
5. Results and Discussion
In this section we analyze the evolution of FlossAr source code metrics. At regular
intervals of one week, the source code was checked out from the CVS repository and
analyzed by a parser that calculated the metrics. The parser and the analyzer were
developed by our research group as a plug-in for Eclipse IDE. Thus we gathered 30
\snapshots" of the system, one for each development week.
5.1. Correlations of the metrics of a given system
To study how the 19 metrics ��� each computed for all classes of a given system ���are correlated, we calculated the cross-correlation values of the various considered
metrics of the last release of the system under study. We used Kendall's non-
parametric measure of rank correlation [29] because Pearson's correlation coe±cients
were highly in°uenced by outliers, while Spearman's rank correlation coe±cient
computation su®ered from the many equal values found in integer data. The results
are reported in Table 2, highlighting in bold those whose absolute value is above 0.6.
Correlation tests made on other snapshots of the system yield very similar results.
We found a high correlation between several pairs of metrics. RFC is fairly cor-
related with WMC, CBO and LOC, while LCO is correlated withWMC. FI andWFI
are the most correlated metrics (� ¼ 0:90), but FI and WFI are also correlated with
REI, CI and REO ��� the latter being strongly anti-correlated. FO is correlated with
CBO, RFC, WFO, and REI, while WC is correlated with CBO. Finally, CI is cor-
related with WFO, which in turn is also correlated with LOC and RFC. Reach
E±ciency, and in particular REO, tends to be anti-correlated with most other
metrics.
These correlations do not mean that some metrics can be easily substituted by
others. However, they can be a good starting point to reduce the number of metrics
to study.
From the correlations studied, and from common knowledge on OO metrics, as
speci¯ed below, the following metrics can be considered candidates to be overlooked,
or substituted by other metrics:
534 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
Tab
le2.
TheKendallrankcross-correlationcoe±
cients
oftheconsidered
metrics,computedon
allclassesof
thelast
version
ofFLOSS-A
Rsystem
.
Metric
LOC
WMC
LCO
NOC
DIT
CBO
RFC
FI
WFI
FO
WFO
REI
REO
WC
CI
CO
LOC
1.00
0.57
0.39
0.00
0.06
0.41
0.69
0.02
0.05
0.48
0.60
�0.20
0.06
0.35
0.02
0.06
WMC
0.57
1.00
0.60
0.11
�0.09
0.37
0.68
0.24
0.28
0.20
0.30
0.04
�0.14
0.35
0.16
0.06
LCO
0.39
0.60
1.00
0.00
�0.17
0.28
0.44
0.22
0.25
0.10
0.14
0.08
�0.17
0.24
0.14
0.07
NOC
0.00
0.11
0.00
1.00
�0.06
0.08
0.06
0.36
0.35
�0.05
�0.03
0.21
�0.24
0.21
0.29
0.18
DIT
0.06
�0.09
�0.17
�0.06
1.00
�0.02
0.01
�0.20
�0.21
0.24
0.29
�0.17
0.29
�0.04
�0.16
0.14
CBO
0.41
0.37
0.28
0.08
�0.02
1.00
0.57
0.33
0.32
0.56
0.44
�0.13
�0.15
0.74
0.17
0.06
RFC
0.69
0.68
0.44
0.06
0.01
0.57
1.00
0.08
0.12
0.57
0.57
�0.22
0.04
0.46
0.07
0.06
FI
0.02
0.24
0.22
0.36
�0.20
0.33
0.08
1.00
0.91
�0.18
�0.13
0.58
�0.59
0.42
0.59
0.12
WFI
0.05
0.28
0.25
0.35
�0.21
0.32
0.12
0.91
1.00
�0.16
�0.11
0.56
�0.54
0.39
0.61
0.11
FO
0.48
0.20
0.10
�0.05
0.24
0.56
0.57
�0.18
�0.16
1.00
0.77
�0.52
0.28
0.39
�0.11
0.06
WFO
0.60
0.30
0.14
�0.03
0.29
0.44
0.57
�0.13
�0.11
0.77
1.00
�0.40
0.22
0.35
�0.08
0.10
REI
�0.20
0.04
0.08
0.21
�0.17
�0.13
�0.22
0.58
0.56
�0.52
�0.40
1.00
�0.40
0.00
0.37
0.12
REO
0.06
�0.14
�0.17
�0.24
0.29
�0.15
0.04
�0.59
�0.54
0.28
0.22
�0.40
1.00
�0.26
�0.34
�0.16
WC
0.35
0.35
0.24
0.21
�0.04
0.74
0.46
0.42
0.39
0.39
0.35
0.00
�0.26
1.00
0.22
0.09
CI
0.02
0.16
0.14
0.29
�0.16
0.17
0.07
0.59
0.61
�0.11
�0.08
0.37
�0.34
0.22
1.00
0.09
CO
0.06
0.06
0.07
0.18
0.14
0.06
0.06
0.12
0.11
0.06
0.10
0.12
�0.16
0.09
0.09
1.00
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 535
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
. NOC and DIT: it is well known that most authors consider these metrics the least
correlated with faults [17]. In our case, we found that mean and 90th percentile of
NOC and DIT metrics show small variations across the snapshots, and look less
useful than other metrics for discriminating among the Phases. The only large
variation in DIT is between weeks 17 and 18 — thus between Phase 2 and
Phase 3 — when a new abstract superclass, \EntityManager", was introduced to
generalize a large part of the behavior of 18 existing classes. This led to a jump in
DIT, and a corresponding drop in WMC, CBO, RFC, FI and FO, because many
dependencies between each of the 18 subclasses and other classes were pushed up
the hierarchy, to the new class. Overall, inheritance links contribute only for about
4% to all links of the software graph. For this reason, despite the importance of
inheritance in OO development, NOC and DIT metrics were not considered to
discriminate among the Phases of the presented case study.
. WMC: the information carried by this metric is found also in LOC (the more
methods in a class, the more lines of codes) and RFC (which includes WMC in its
computation).
. CBO: it is well correlated with RFC, FI, FO, as known from the literature [10, 33],
so we will not consider it.
. WFI: FI is an almost perfect substitute because it is strongly correlated to WFI,
and exhibits correlations very similar to those of WFI with all other metrics;
moreover, it is simpler to compute.
. FO, WFO: these metrics are well represented by RFC metric. Moreover, their
averages, over all the classes of the system, are the same as the averages of FI and
WFI, respectively. This is because their average is the average number of in-links
and out-links over all system classes. Since each in-link corresponds to one out-link,
their total number, and hence their averages, are the same. This is true for both
weighted and non-weighted links.
We decided to consider all SNA metrics, because they are not well studied in the
software ¯eld yet, so they deserve to be studied more in depth. Note that we per-
formed the analysis of variations in metric statistics reported in the following also for
the metrics considered substituted by others, con¯rming that their behavior is
consistent with that of their substitution metrics. In this way, the paper is simpler,
without losing information.
In the end, we analyze the behavior of the following nine metrics, as system
development evolved: LCOM, RFC, FI, REI, REO, WC, CI, CO, LOC.
5.2. Metric statistics across system snapshots and their correlations
The total number of classes in the system (including abstract classes and interfaces),
which is a good indicator of its size, increases over time, though not linearly. The
project started with 362 classes ��� those of jAPS release 1.6. At the end of the
project, after 30 weeks, the system had grown to 514 classes, due to the development
536 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
of new features that constituted the specialized system. Figure 1 shows the evolution
of the number of classes during development, together with the four main phases of
development, and the weeks of the three releases of the system.
We computed key statistics of these metrics ��� mean, standard deviation,
median, 90th percentile��� for each of the 30 systems analyzed. Remember that these
metrics are always positive, and none of them is normally distributed, but all follow a
\fat tail" distribution, often a power-law [10, 23, 13], so the statistics must be focused
mainly on the extreme tail. We found that the best statistics to account for the
behavior of the metric in the whole system are the mean ��� that is a rough measure
of the overall behavior of the metric across all classes of the system anyway ��� and
the 90th percentile ��� that gives information on the tail. The standard deviation
gives information only on how the values are spread, but not on the values them-
selves, while the median is skewed toward values that are too low, and tends to be
fairly constant.
We computed the Kendall cross-correlation coe±cients of the mean and 90th
percentile of the metrics, on the 30 weekly snapshots of FLOSS-AR system under
study, to assess how these metrics were related across the development. We show
these cross-correlations in Tables 3 and 4, highlighting in bold those whose absolute
value is above 0.7. Note that the 90th percentile of LOCS metric is constant across
the snapshots, so we had to drop it from Table 4.
This correlation is di®erent from the correlation computed class by class for a
single snapshot of the system shown in Table 2. A high positive value of the class-by-
class cross-correlation between two metrics means that, when one is above (below)
average for a class, the other is likely to be above (below) average as well for the same
class. In Tables 3 and 4, we refer to the correlation among average and 90th
Fig. 1. Total no. of classes during the evolution of FLOSS-AR system.
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 537
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
percentile values of the metrics, respectively, measured at weekly time steps during
the development.
In this case, a high positive value of the cross-correlation means that, at a given
development step, when one metric averaged among all classes is above (below) its
average value over the whole development, also the other is likely to be above
(below) its average value of a similar percentage for the same time step. As you can
see in Tables 3 and 4, many metrics are fairly correlated with each other. The most
correlated with other metrics for both means and 90th percentiles are LOCs, RFC,
Fan-In, and Closeness-Out ��� the latter being anti-correlated with other metrics.
The least correlated metric is Closeness-In.
Regarding the 90th percentiles, the correlations substantially con¯rm those of the
means, but are typically lower. These results often do not match those reported in
Table 2, in the sense that if two metrics are fairly correlated (or not correlated at all)
when computed class-by-class, this does not imply that their means or 90th per-
centiles are correlated (or not correlated) in the same way when computed across a
sequence of snapshots of the system under development, and vice versa. In about
40% of the cases, we observe even an inversion of the sign of the correlation. This is
quite counter-intuitive, but the two correlations have di®erent meanings. If the
slopes of the regression between two correlated quantities ��� computed across
Table 4. The Kendall rank cross-correlation coe±cients of the 90th percentiles of
the eight considered metrics, computed on the 30 weekly snapshots of FLOSS-AR
system. LCOM has been dropped because it is constant over all snapshots.
Metric LOC RFC FI REI REO WC CI CO
LOC 1.00 0.30 0.27 0.67 0.71 0.56 �0.05 �0.57RFC 0.30 1.00 0.37 0.08 0.06 0.55 0.51 �0.04
FI 0.27 0.37 1.00 0.35 0.27 0.61 0.65 �0.38
REI 0.67 0.08 0.35 1.00 0.84 0.39 �0.02 �0.90
REO 0.71 0.06 0.27 0.84 1.00 0.38 �0.10 �0.82WC 0.56 0.55 0.61 0.39 0.38 1.00 0.51 �0.40
CI �0.05 0.51 0.65 �0.02 �0.10 0.51 1.00 0.00
CO �0.57 �0.04 �0.38 �0.90 �0.82 �0.40 0.00 1.00
Table 3. The Kendall rank cross-correlation coe±cients of the averages of the nine considered
metrics, computed on the 30 weekly snapshots of FLOSS-AR system.
Metric LOC LCO RFC FI REI REO WC CI CO
LOC 1.00 0.64 0.87 0.71 0.52 0.40 0.63 0.06 �0.42
LCO 0.64 1.00 0.54 0.43 0.42 0.41 0.50 �0.23 �0.34
RFC 0.87 0.54 1.00 0.79 0.58 0.38 0.62 0.14 �0.48
FI 0.71 0.43 0.79 1.00 0.76 0.46 0.64 0.07 �0.69REI 0.52 0.42 0.58 0.76 1.00 0.67 0.52 �0.13 �0.87
REO 0.40 0.41 0.38 0.46 0.67 1.00 0.41 �0.38 �0.57
WC 0.63 0.50 0.62 0.64 0.52 0.41 1.00 0.06 �0.44
CI 0.06 �0.23 0.14 0.07 �0.13 �0.38 0.06 1.00 0.21CO �0.42 �0.34 �0.48 �0.69 �0.87 �0.57 �0.44 0.21 1.00
538 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
classes of the same snapshot ��� vary across di®erent snapshots, the resulting cor-
relation of means or 90th percentiles can be very di®erent from the correlations
referred to a single snapshot.
5.3. Discriminating amongst development phases
using aggregate metrics
As reported in Sec. 4, the development of the system evolved through four distinct
phases. We know that what di®erentiates the various phases is the level of adoption
of agile practices ��� namely PP, TDD and refactoring. We also know that these agile
practices were applied or not applied together; consequently, it is not possible to
discriminate among them using the data reported for this case study. So, we talk of
\key agile practices" considering them as applied together. In this subsection we
show and discuss how aggregate statistics of OO and network metrics exhibit speci¯c
patterns of evolution, as system development proceeds. In Fig. 2 we show the
behavior of the mean values of the three metrics that seem to discriminate better
than others among development phases ��� Fan-In, Closeness-In and Closeness-Out.
All the values are normalized to the maximum value reached by the metrics. FI and
CO look the best to discriminate between Phases 1 and 2, while CI is the best to
discriminate between Phases 2 and 3. Phases 3 and 4 are less discriminated, but this
is reasonable, because Phase 3 is a refactoring phase, and Phase 4 is a subsequent
development phase that continues on the same path, without aggressive refactoring.
In Fig. 3 we show the behavior of the mean values of other metrics which are still
\good" at discriminating among phases. They are LCOM, RFC, REI and REO.
Fig. 2. The evolution of the mean value of FI, CI and CO metrics.
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 539
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
In particular, LCOM exhibits a strong growth in Phase 2, when good OO and agile
practices were abandoned, which is only partially corrected in Phases 3 and 4.
Figure 4 shows the behavior of 90th percentiles of FI, WC and CI metrics, the best
at discriminating between phases. Note the di®erent behavior with respect to the
means reported in Figs. 2 and 3. For the sake of brevity, we do not report the
behavior of other metrics, because they look less signi¯cant than the reported ones.
The evolution of most aggregate statistics of the studied metrics with the process
phases shows signi¯cantly di®erent values and trends that depend on the speci¯c
phase, as shown in Figs. 2�4. Our hypothesis is that this variability is due to
the di®erent level of adoption of the key agile practices. In fact, to our knowledge, the
only external factors that might have had an impact on the project are precisely the
di®erences among the phases, as reported in Sec. 4. Regarding internal factors,
the only relevant factor at play was team experience, regarding both agile practices
applications, and knowledge of the system itself. The project duration was relatively
short, so we estimate that the latter factor a®ected signi¯cantly only Phase 1.
We performed a Kolmogorov-Smirnov (KS) two-sample test to assess if those
measurements signi¯cantly di®ered from one phase to the next. The KS test deter-
mines if two datasets belong to di®erent distributions, making no assumption on the
distribution of the data.b For each computed metric, we compared the measurements
Fig. 3. The evolution of the mean value of LCOM, RFC, REI and REO metrics.
bSince the metrics computed at a given weekly snapshot depend also on the state of the system in the
previous snapshot, the assumption underlying KS test that the samples are random and mutually inde-
pendent can be challenged. However, we used KS test to assess the di®erence between measurements indi®erent phases as if they were independent sets of points, and we believe that at a ¯rst approximation the
KS test result is still valid.
540 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
belonging to any pair of phases; we were of course most interested in the ability to
discriminate between subsequent phases.
The results are shown in Tables 5 and 6 for the means and 90th percentiles,
respectively. The cases with signi¯cance levels greater than 99% are shown in bold.
Regarding metrics means (Table 5), Phase 1 metrics di®er very signi¯cantly from
any other phase in all cases but for LCOM between Phases 1 and 2 ��� getting a
signi¯cance higher than 90% also for LCOM. Phase 2 is less clearly di®erentiated
from Phases 3 and 4. REO and CI metrics appear to be able to discriminate best,
with a KS signi¯cance greater than 98%, and with RFC and FI following suit at 95%.
Phases 3 and 4 can be discriminated e®ectively by FI, REI and CO metrics means.
90th percentiles are slightly less able to discriminate among phases. Phase 1 is still
well di®erentiated from other phases ��� and especially from Phase 2 ��� but for RFC
Fig. 4. The evolution of the 90th percentile of FI, WC and CI metrics.
Table 5. Con¯dence level that the mean of the metrics taken in a pair of phases signi¯cantly
di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance is above 99%.
Metric Phases 1�2 Phases 1�3 Phases 1�4 Phases 2�3 Phases 2�4 Phases 3�4
LOC 99.990 99.340 99.985 85.113 96.402 64.019
LCO 91.310 99.340 99.985 62.319 84.722 64.019
RFC 99.990 99.340 99.985 95.250 99.386 64.019FI 99.990 99.340 99.985 95.250 88.606 99.213
REI 99.990 99.340 99.985 62.319 82.414 99.213
REO 99.990 99.340 99.985 98.770 99.924 35.531WC 99.990 99.800 99.998 33.939 56.976 59.580
CI 99.825 99.340 99.985 98.770 99.924 91.128
CO 99.825 99.340 99.985 62.319 82.414 99.213
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 541
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
metric. CI is able to discriminate Phase 1 from Phase 2 very well, but totally fails to
discriminate Phases 3 and 4 from Phase 1. It looks like a very powerful indicator of
Phase 2, when good agile practices were dropped by developers. Phase 2 is dis-
criminated from Phase 3 by LOC, FI and REO metrics almost at 99% signi¯cance
level. The same metrics are able to discriminate Phase 2 from Phase 4 at even an
higher level. Finally, Phases 3 and 4 are well discriminated by REI and CO metrics,
con¯rming the results of the means. On the contrary, FI ��� that was a good dis-
criminator in the case of the mean ��� is totally unable to discriminate between
Phases 3 and 4 when its 90th percentile is used.
These results in fact con¯rm the di®erence in trends and values of the various
metrics in the various phases that are patent in Figs. 2�4.
5.4. Aggregate metrics behavior across development phases
During the development of FLOSS-AR system, Phase 1 is characterized by a steady
growth of the number of classes. All metrics but LCOM and CO are stable during the
¯rst ¯ve weeks of this phase; then, their means tend to grow ��� in particular for FI,
REI, REO, WC and, to a lower extent, RFC and LOC. The means of LCOM and
CO, on the contrary, tends to increase during the ¯rst few weeks and then stabilize.
The 90th percentiles of the metrics tend to be quite constant during Phase 1, except
in the case of CO. This means that no signi¯cant addition to the tails of the dis-
tributions (classes with extreme values of the metrics) was made. Regarding the large
variation of CO 90th percentile, we remember that CO for a class is related to the
number of steps needed to reach all the other (reachable) classes, following edges
along their direction. The lower the average No. of these steps for a class, the higher
its CO value. The large variations might be explained with the addition, or deletion,
of links in such a way that some classes increased/decreased substantially their
closeness to other classes in the system, a phenomenon clearly possible in a small-
world network such as a software network.
The starting values of all these metrics are those of the original jAPS framework,
constituted by 367 classes and evaluated by code inspection as a project with a fairly
good OO architecture.
Table 6. Con¯dence level that the 90th percentile of the metrics taken in a pair of phases
signi¯cantly di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance isabove 99%.
Metric Phases 1�2 Phases 1�3 Phases 1�4 Phases 2�3 Phases 2�4 Phases 3�4
LOC 99.947 99.340 99.985 98.770 99.924 64.019
RFC 86.416 91.964 0.000 0.000 84.722 91.128
FI 99.947 91.964 99.985 98.770 99.924 0.483REI 99.947 99.340 99.985 95.250 99.924 99.213
REO 99.746 99.340 99.985 98.770 99.924 50.702
WC 99.529 99.340 97.032 0.000 0.118 8.198
CI 99.529 0.000 0.000 95.250 99.386 0.000CO 99.529 99.340 99.985 45.215 99.924 99.213
542 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
The increase of RFC and FI means (which we remember is highly correlated with
CBO) denotes a worsening of software quality. Note that Phase 1 is characterized by
a rigorous adoption of agile practices, but we should consider two factors:
(1) The knowledge of the original framework was initially quite low, so the ¯rst
addition of new classes to it in the initial phase had a sub-optimal structure, and
it took time to evolve towards an optimal con¯guration;
(2) Some agile practices require a time to be mastered, and our developers were
junior.
In general, we might conclude that in Phase 1 the team steadily added new
features ��� and consequently new classes ��� to the system. In the ¯rst half of the
phase, however, these classes substantially kept the structure of the original system
they were added to. As the system grew, this structure was slowly impaired, due to
the factors mentioned above.
Phase 2 is characterized by a strong push for releasing new functionalities and by
giving up the use of pair programming, testing and refactoring. In this phase we
observe a growth in all metrics means but CO, and particularly in metrics related to
coupling and complexity ��� with an explosive growth of LCOM. This seems to
con¯rm that in Phase 2 the quality has been compromised for adding several new
features. Regarding 90th percentiles, they substantially con¯rm the behavior of the
corresponding means. It is worth noting that 90th percentile of several metrics ex-
hibit an even steeper change passing from Phase 1 to Phase 2. This happens for FI,
LOCS, WC, CI and CO ��� the latter with a steep decrease in value.
Phase 2 is followed by Phase 3, a phase when the team, adopting a rigorous pair
programming rotation strategy together with testing and refactoring, were able to
refactor the system, increasing its cohesion and decreasing coupling ��� and thus
reducing the values of several metrics known as anti-correlated to quality, such as
LCOM, RFC, FI and LOCs. In this phase, no new features were added to the system.
The number of classes increased during this phase, because refactoring required to
split classes that had grown too much, and to refactor hierarchies, adding abstract
classes and interfaces. The transition from Phase 2 to Phase 3 is marked by a sig-
ni¯cant decrease of Fan-In and CI, patent in both mean and 90th percentile be-
havior. After this decrease FI and CI means tend to increase again at the end of
Phase 3. CO mean has a trend opposite to CI, as happens also in Phase 2 (but not in
Phases 1 and 4). REO has a behavior similar to CO, while RFC and LCOM were
reduced, mainly at the end of the phase. There is also a light decrease of LOC (not
shown), mainly due to the addition of abstract classes to the hierarchies that factor
out common features and reduce the code of many classes. Note that the values of the
metrics at the end of Phase 3 seem to reach an equilibrium.
Phase 4 is the last development phase. It is characterized by the adoption of all
key agile practices, and by the creation of other classes associated to new features. In
this phase most metrics do not change signi¯cantly ��� although, in the end, the
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 543
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
values of most of them are slightly lower than at the beginning of the phase ���maybe because the team became more e®ective in the adoption of the agile practices
compared to the initial Phase 1. Only REO tends to grow in the end of the whole
development.
Table 7 summarizes these observations, highlighting which metrics look the best
to discriminate between the various phases.
In conclusion, Fan-In looks the only metric able to discriminate fairly well be-
tween all the various phases, especially considering its mean. Other good dis-
criminators are CI (especially its 90th percentile) for the ¯rst phases, and REI and
CO means for the last phases.
For this case study, a combination of FI mean, CI 90th percentile and REI would
be able to discriminate among the various phases fairly well.
6. Threats to Validity
The presented work is based on a single, empirical case study. This fact yields several
obvious threats to its validity that we discuss in this section.
The ¯rst issue is that what we presented is just one anecdotal case study, since we
were not able to ¯nd other case studies with an amount of source code data and,
above all, information about the variations of agile practices adopted throughout the
development. From a single case study, it is clearly impossible to safely generalize to
other cases. We believe, however, that the case study is of great anecdotal interest,
and might be used by practitioners as a starting point to analyze the relationships
between software metric trends and practices used to improve software quality.
Table 7. The metrics and statistics best suited to discriminate between the various phases.
Phases Metric (Statistic) Discussion
1 ! 2 CI (90th perc.) A steep increase of the 90th perc. of CI looks like a very good
marker of a phase where \good" agile practices were abandoned.
1 ! 2 FI (mean) An increase of FI mean is also a good discriminator of Phase 2.1 ! 2 LCO (mean) LCOM mean starts low, but then increases very signi¯cantly
during the middle of Phase 2.
1 ! 2 CO (mean) CO mean signi¯cantly decreases during Phase 2.
2 ! 3 CI (mean & 90th perc.) When agile practices are resumed, we found an immediate, steepdecline of CI (both mean and 90th perc.), that persisted in
Phase 3.
2 ! 3 FI (mean & 90th perc.) FI con¯rms to be another good marker able to discriminate between
Phases 2 and 3, though to a lesser extent than CI.2 ! 3 REO (mean) REO mean in Phase 3 is consistently and signi¯cantly greater than
in the previous phase.
3 ! 4 REI (mean) REI mean steadily increases at the end of Phase 4, showing a good
discrimination ability with respect to Phase 3.3 ! 4 CO (mean) CO mean decreases at the end of Phase 3, and continues to decrease
in Phase 4, showing a fair discrimination ability.
3 ! 4 FI (mean) FI mean increases at the end of Phase 3, and then remains almostconstant in Phase 4, showing a mild discrimination ability.
544 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
Related to this issue, the adoption of agile practices used to identify the phases of
the project comes from a survey among developers. The details of the actual adoption
of Agile practices (kinds of refactorings applied, exact percentage of time spent in
Pair Programming, etc.) were not explicitly recorded during the project. This
vagueness is another threat to the validity of the results.
Another threat to the validity of the presented results is that we studied a small-
medium sized project, and this might be di±cult to generalize to larger, more critical
projects. This issue is related to the previous one. However, modern development
processes tend to split large projects in a bunch of loosely coupled, smaller devel-
opments, whose magnitude is not so di®erent from the presented one. When this is
the case, this objection should fall.
Another threat concerns the speci¯c OOPL language (Java) and programming
environment (Eclipse) used to develop the system. Again, the generalization to
other languages and programming styles is not granted. We can observe that on one
hand we are interested in OO metrics and in software graphs built from an OO
architecture. The OO paradigm is currently the most used programming paradigm,
and we believe that focusing on it is not really limiting. On the other hand, many
popular OO languages ��� especially Cþþ, C# ��� are very similar to Java. In a
previous study, the distributions and correlations of CK metrics in 100 Java and
100 Cþþ projects were found fairly similar [33]. So, we believe that the presented
results can generalize to them. For other OO languages ��� like Python and
Ruby ��� this might not be true because the programming styles are very di®erent
from Java.
The last threat, and perhaps the biggest, is that at least some of the ¯ndings
might have been obtained just by chance. The number of samples used in the sta-
tistical analysis, one for each snaphot, is 30 per metric/statistic. The sample groups
pertaining to the four phases used to discriminate between metrics contain between 4
and 10 values. These numbers, compared to the total number of metrics and sta-
tistics tested to discriminate among phases (16 original metrics, and 4 statistics for
each of them), are small. So, the discrimination ability of some metric/statistic might
be due to statistical variations, and not signi¯cant at all. In order to answer to this
objection, we can highlight that:
(1) Regarding the statistics, we immediately found that the median and the stan-
dard deviation were not able to discriminate anything, and dropped them. So,
we remained with 2 statistics. We used 90 for the percentile value as an optimal
choice, not being too close to the median, and not too close to the extreme
portion of the tail.
(2) We did not use seven of the original metrics (DIT, NOC, WMC, etc.) following
information found in the literature about their incapacity to discriminate soft-
ware quality, and because of strong correlations with other metrics. Having
dropped them is not a \selection of the ¯ttest" in a statistical sense, but bears a
speci¯c meaning.
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 545
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
(3) As shown in Tables 5 and 6, most of the remaining metrics actually used are able
to strongly discriminate among various pairs of phases. Only the pairs including
Phases 2 and 3, and Phases 3 and 4 are discriminated by just a few metrics/
statistics. This makes very unlikely that this discrimination ability is due to
chance.
7. Conclusions
We presented a case study related to agile software development of a medium-sized
project using Java OOPL, matching the di®erent use of key agile practices in the four
phases of the project with OO and graph-related metrics.
In Phase 1 we observed a deterioration of quality metrics, that signi¯cantly
worsened during Phase 2; Phase 3 led to a signi¯cant improvement in quality, and
Phase 4 kept this improvement. The only external factors that changed during the
phases were adoption of pair programming, TDD and refactoring agile practices,
that were abandoned during Phase 2, and were used again at their full power during
Phase 3, aiming to improve the system quality without adding new features, and
then in Phase 4. As regards internal factors, in Phase 1 the team was clearly less
skilled in the use of agile practices and in the knowledge of the original framework
than in subsequent phases.
We studied the aggregate variation of several source code metrics, speci¯c for OO
systems, and for the oriented software graph built from the OO software structure.
We ¯nd that an appropriate combination of a few metrics ��� namely the average
Fan-In, the 90th percentile of Closeness-In, and the average of Reach-E±ciency-
In ��� is able to discriminate among the various phases, and hence among the de-
velopment practices used to code the system. The adoption of \good" agile practices
is always associated with \better" values of these metrics ��� when pair program-
ming, TDD and refactoring are used, the quality metrics improve; when these
practices are discontinued, the metrics worsen signi¯cantly. We validated the use-
fulness of software metrics in monitoring the quality of the ongoing development, for
the empirical case study analyzed. This might be useful for software practitioners.
Clearly, it is not possible to draw de¯nitive conclusions observing a single, me-
dium-sized project. Unfortunately, it is not easy to ¯nd other case studies, because
they must include not only tracking of the source code produced during develop-
ment ��� a task easily accomplished with modern con¯guration management sys-
tems ��� but also an accurate tracking of the development practices, and of possible
other external and internal factors, used throughout the project. We hope that this
paper might spur similar studies by researchers with access to proper data, able to
con¯rm or to disprove our ¯ndings.
Acknowledgments
This work was partially funded by Regione Autonoma della Sardegna (RAS),
Regional Law No. 7, 2007 on Promoting Scienti¯c Research and Technological
546 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
Innovation in Sardinia, call 14/2/2009, and RAS Integrated Facilitation Program
(PIA) for Industry, Artisanship and Services, call 14/10/2008, project No. 265,
Advanced Technologies for Software Measuring and Integrated Management,
TAMIGIS.
References
1. Agile Manifesto, URL: www.agilemanifesto.org.2. A. J. Albrecht, Measuring application development productivity, Proc. of IBM Appli-
cation Development Symposium, Monterey, CA, October 1979, pp. 83�92.3. N. Anquetil and J. Laval, Legacy Software Restructuring: Analyzing a Concrete Case, in
Proc. of the 15th European Conference on Software Maintenance and Reengineering(CSMR'11), Oldenburg, Germany, 2011.
4. V. R. Basili, L. C. Brand andW. L. Melo, A validation of object oriented design metrics asquality indicators, IEEE Trans. Software Eng. 22 (1996) 751�761.
5. K. Beck and C. Andres, Extreme Programming Explained: Embrace Change, SecondEdition (Addison-Wesley, 2004).
6. B. Boehm and R. Turner, Balancing Agility and Discipline (Addison-Wesley Professional,2003).
7. G. Canfora, A. Cimitile, F. Garcia, M. Piattini and C. A. Visaggio, Evaluating advan-tages of test driven development: A controlled experiment with professionals, in Proc. Int.Symposium on Empirical Software Engineering, ISESE'06, 21-22 September (2006), Riode Janeiro, Brazil, pp. 364�371.
8. S. Chidamber and C. Kemerer, A metrics suite for object-oriented design, IEEE Trans.Software Eng. 20 (1994) 476�493.
9. S. Chidamber and C. Kemerer, Managerial use of metrics for object oriented software: Anexploratory analysis, IEEE Trans. Software Eng. 24 (1998) 629�639.
10. G. Concas, M. Marchesi, S. Pinna and N. Serra, Power-laws in a large object-orientedsoftware system, IEEE Trans. Software Eng. 33 (2007) 687�708.
11. G. Concas, M. Di Francesco, M. Marchesi, R. Quaresima and S. Pinna, Study of theevolution of an agile project featuring a web application using software metrics, in Proc.9th Int. Conf. on Product Focused Software Process Improvement (PROFES'08), Fras-cati, Italy, 23�25 June 2008.
12. G. Concas, M. Marchesi, A. Murgia, S. Pinna and R. Tonelli, Assessing traditional andnew metrics for object-oriented systems, in Proceedings of the Workshop on EmergingTrends n Software Metrics (ICSE'10), Cape Town, South Africa, May 2010.
13. G. Concas, M. Marchesi, A. Murgia and R. Tonelli, An empirical study of socialnetworks metrics in object-oriented software, Advances in Software Engineering,Vol. 2010, 2010.
14. T. Dyba and T. Dingsøyr, Empirical studies of agile software development: A systematicreview, Information and Software Technology 50 (2008).
15. M. Fowler, Refactoring. Improving the Design of Existing Code (Addison-Wesley, 1999).16. M. Giblin, P. Brennan and C. Exton, Introducing agile methods in a large software
development team: The impact on the code, in Proc. 11th Int. Conf. On Agile Processesin Software Engineering and Extreme Programming (XP2010), Trondheim, Norway,June 2010, pp. 58�72.
17. T. Gyimothy, R. Ferenc and I. Siket, Empirical validation of object-oriented metricson open source software for fault prediction, IEEE Trans. Software Eng. 31 (2005)897�910.
An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 547
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.
18. JAPS: Java agile portal system. URL: http://www.japsportal.org.19. D. S. Janzen and H. Saiedian, Does test-driven development really improve software
design quality?, IEEE Software, March/April 2008, pp. 77�84.20. M. Kunz, R. R. Dumke and A. Schmietendorf, How to measure agile software develop-
ment, in Proc. Int. Conf. on Software Process and Product Measurement (IWSM-Men-sura 2007), Palma de Mallorca, Spain, November 5�8, 2007, pp. 95�101.
21. L. Layman, L. Williams and L. Cunningham, Exploring extreme programming in context:An industrial case study, in Proc. of the Agile Development Conference (ADC'04), SaltLake City, Utah, June 2004, pp. 32�41.
22. W. Li and S. Henry, Object oriented metrics that predict maintainability, J. Systems andSoftware 23 (1993) 111�122.
23. P. Louridas, D. Spinellis and V. Vlachos, Power laws in software, ACM Trans. SoftwareEng. and Methodology, 18(1) 2008.
24. F. Macias, M. Holcombe and M. Gheorghe, A formal experiment comparing extremeprogramming with traditional software construction, in Proc. of the Fourth MexicanInternational Conference on Computer Science (ENC 2003), Tlaxcala, Mexico, 12�12September 2003.
25. T. J. McCabe, A complexity measure, IEEE Trans. Software Eng. 2 (1976) 308�320.26. M. Melis, I. Turnu, A. Cau and G. Concas, Evaluating the impact of test-¯rst pro-
gramming and pair programming through software process simulation, Software ProcessImprovement and Practice 11 (2006) 345�360.
27. N. Nagappan, E. M. Maximilien, T. Bhat and L. Williams, Realizing quality improvementthrough test driven development: Results and experiences of four industrial teams,Empirical Software Engineering 13 (2008) 289�302.
28. M. E. J. Newman, The structure and function of complex networks, SIAM Review45 (2003) 167�256.
29. A. V. Prokhorov, Kendall coe±cient of rank correlation, in Encyclopaedia of Mathe-matics, ed. M. Hazewinkel (Springer Verlag, Heidelberg, 2001).
30. J. Scott, Social Network Analysis: A Handbook (SAGE Publications, London, UK, 2000).31. M. Siniaalto and P. Abrahamsson, Does test-driven development improve the program
code? alarming results from a comparative case study, in Balancing Agility and For-malism in Software Engineering, B. Meyer, J. R. Nawrocky, B. Walter, eds., LectureNotes in Computer Science, Vol. 5802 (Springer, 2008), pp. 143�156.
32. R. Subramanyam and M. S. Krishnan, Empirical analysis of CK metrics for object-oriented design complexity: Implications for software defects, IEEE Trans. Software Eng.33 (2007) 687�708.
33. G. Succi, W. Pedrycz, S. Djokic, P. Zuliani and B. Russo, An empirical exploration of thedistributions of the Chidamber and Kemerer object-oriented metrics suite, EmpiricalSoftware Engineering 10 (2005) 81�103.
34. C. A. Wellington, T. Briggs and C. D. Girard, Comparison of student experiences withplan-driven and agile methodologies, in Proc. of the 35th ASEE/IEEE Frontiers in Ed-ucation Conference, Indianapolis, Indiana, 19�21 October 2005.
35. T. Zimmermann and N. Nagappan, Predicting defects using network analysis ondependency graphs, in Proc. 30th Int. Conf. on Software Engineering (ICSE'08), Leipzig,Germany, 10�18 May 2008, pp. 531�540.
548 G. Concas et al.
Int.
J. S
oft.
Eng
. Kno
wl.
Eng
. 201
2.22
:525
-548
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
NO
RT
H C
AR
OL
INA
ST
AT
E U
NIV
ER
SIT
Y o
n 04
/26/
13. F
or p
erso
nal u
se o
nly.