[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...
TRANSCRIPT
Locating the Meaning of Terms in Source CodeResearch on ”Term Introduction”
Jan Nonnen, Daniel Speicher, Paul Imhoff
University of BonnComputer Science III
Bonn, Germany{nonnen, dsp, imhoffj}@cs.uni-bonn.de
Abstract—Software developers are often facing the challengeof understanding a large code base. Program comprehensionis not only achieved by looking at object interactions, but alsoby considering the meaning of the identifiers and the containedterms. Ideally, the source code should exemplify this meaning.We propose to call the source code locations that define themeaning of a term term introduction.We further derive a heuristic to determine the introductionlocation with the help of an explorative study. This studywas performed on 8000 manually evaluated samples gainedfrom 30 open source projects. To support reproducibility,all samples and classifications are also available online. Theachieved results show a precision of 75% for the heuristic.
Keywords-identifier analysis; name meaning; empirical eval-uation; explorative approach; program comprehension; sourcecode analysis
I. INTRODUCTION
Beck emphasises in [1] the importance of intention re-
vealing names, as an identifier is written once, but read
multiple times. Moreover Caprille and Tonella [2] stated that
”identifier names are one of the most important sources of
information about program entities”.
As in Figure 1 snippet b) offers almost as much informa-
tion to the developer as the code snippet a). On the other
hand the code snippet c) may be equivalent to a) from the
perspective of the compiler, but conveys no intentions to the
developer.
Therefore identifier analysis has become an active re-
search area [3], [4], [5]. A variety of research questions
is approached with different methodologies and different
algorithms.
void transfer(Account from, to; float amount){ from.debit(amount); to.credit(amount); }
void m1(T1 v1, v2; float v3){ v1.m2(v3); v2.m3(v3); }
transfer from to amount from debit amount to credit amount
a)
b)
c)
Figure 1. Different views on the source code: The normal code a), howthe developers b) and the compiler c) see it.
For instance Information Retrieval techniques have been
applied to the analysis of identifiers, e.g. in automatically
retrieving the “topics” of the code [3]. As this approach is
useful in many areas, in case of a complex control flow
these ”topics” may not be precise enough to understand the
method in detail.
Deißenbock and Ratiu [4] use the external ontology
WordNet [6] to identify whether class names are named
concise. As useful as an external ontology in many cases is,
it does not always represent the real semantics of a program
as developers have sometimes very unusual mental models.
For instance with the word ”cat” WordNet (and most people)
will associate the animal, but a developer might associate the
command to concatenate files. While any Unix developer
will feel comfortable with reading the command killCatin a program any non-programmer will immediately fear for
the health of her feline friends.
We agree with Binkley [5] that algorithms dedicated to
naming analysis in source code should be able to also
consider the code structure. Our work is solely taking
advantage of structural information, without referencing to
an external ontology.
As possible application scenarios the presented analysis
could be integrated in a development environment. A dic-
tionary explaining the terms used inside a method, with ref-
erences to their introduction locations, could be used during
software maintenance. Current development environments
(e.g. Eclipse) allow to navigate from a reference in the
source code to its declaration with just one click on it. In
the same way, one could add the ability to navigate from a
term in an identifier to its introduction location.
General Idea
Identifiers can be composed of different terms with special
meaning. When trying to understand these, developers often
have to follow their usage throughout the source code. In
some situations the meaning of the terms can be located at a
source code location, denoted as the ”introduction location”.
In this work we want to explore and find such introduction
locations. Different meanings for the same term normally
2011 18th Working Conference on Reverse Engineering
1095-1350/11 $26.00 © 2011 IEEE
DOI 10.1109/WCRE.2011.21
99
should also result in different introduction locations, each
for a different meaning.
As an example consider a Poker application with iden-
tifiers DrawCard, a type responsible for painting parts of
the user interface, and drawNewCard, a method to give a
player a new card. Figure 2 illustrates this situation. The term
”draw” has two different meanings, and thus two different
introduction locations.
draw NewCard
Draw Card
Figure 2. A general idea of term introductions.
Example
Standard Information Retrieval techniques, e.g. word fre-
quency, do not necessarily consider different meanings for
the same word. Thus a reuse of the same word in different
identifiers conveys the idea that both words refer to the same
meaning. The following case study presents a case in which
the same word refers to different meanings.
The method signature in Figure 3 uses in both parameter
names the term ”context”. bundleContext contains the
two terms ”bundle” and ”context” and contextObservercontains the two terms ”context” and ”observer”.
public static ServiceTracker openTracker(BundleContext bundleContext,Observer contextObserver)
Figure 3. At a first glance the term ”context” seems to be usedwith the same meaning in both parameters.
At a first sight, both ”contexts” in 3 seem to have the
same meaning. A deeper analysis of the concepts behind
the identifiers yields that two meanings can be found, each
in a different part of the code. Figure 2 illustrates the search
for each meaning.
The case study was taken from our research project
Context Sensitive Intelligence [7]. One meaning of ”context”
refers to the OSGi framework which the project uses and
another is a physical location of a mobile device. The
concrete paths used to identify both meanings can be seen
in detail in Figure 4.
In case a term is used homonymously, it is normally a
good practice to clarify the meaning by choosing a different
name for one of the meanings. But, in our example the
term ”context” has already been set by the research project
and by the framework respectively. Therefore the parameter
names are consistent and should not be renamed. In this
Figure 4. Types related to the type containing the methodopenTracker(bundleContext, contextObserver).
situation any tooling that helps a developer being aware
of the different meanings would be very useful. Also an
automatic detection of homonyms based on an automated
identification of introduction locations could be very helpful
for program understanding.
Outline
The remainder of this paper is structured as follows: In
Section II we will describe how identifiers are preprocessed
to obtain the terms contained in the identifiers. These
represent the foundation for our analysis and will be used
to define the concept of a term introduction in Section
III. To find an approximation for the identification of term
introduction locations, we performed an explorative study.
We will present the design of this study in Section IV and
the results in Section V and VI. The last section discusses
possible future work.
II. TERM APPROACH
At first, all identifiers are chopped into a sequence of
tokens, based on camel-case rules. Camel-case convention
separates tokens by capitalizing the initial letter of the
element. For example createFisheyeFigures yields
the sequence (create, fisheye, figures). Each token
has a linguistic canonical form, referred to as lemma. For
example in English ”run”,”runs”, ”ran” and ”running” are
forms of the same lemma, conventionally written as ”run”.
These lemmata, obtained from identifiers, are denoted as
terms in this work, see Figure 5 for an example.
Identifier: createFisheyeFiguresTokens: (create, fisheye, figures)Terms: ((create, v), (fisheye, n), (figure, np))
Figure 5. Preprocessing of the identifiers. We use the part of speech tagsv=verb, n=noun, and np=noun in plural.
Besides lemmatization, stemming is frequently used to
calculate a canonical form. Stemming is a heuristic rule-
driven process to chop off the end of tokens in order
to obtain the correct lemma. A current popular stemming
100
algorithm for English is Porter’s algorithm [8]. It represents a
fast normalization approach without the need of a dictionary.
Lemmatization uses a dictionary and performs a proper
morphological analysis. It utilizes part of speech tagging to
retrieve the lemma. A popular lemmatization tool is Tree-
Tagger [9]. Lemmatization preserves tokens that are not part
of the dictionary, whereas stemming simply applies its rules.
Taking these considerations into account, we decided to use
lemmatization as a preprocessing step. On the example of the
token ”saw” one can see the differences in both strategies.
On the one hand stemming might reduce the token to ”s”.
On the other hand lemmatization normalizes it to ”see”, if
it is used as a verb, and to ”saw” if it is used as a noun.
After defining what terms are, a valid question is whether
all terms are relevant. Text search in the domain of Infor-
mation Retrieval acknowledges that some common words
appear of little value, these are referred to as stop words. A
strategy to deal with these is to create a stop list containing
all irrelevant words. We found out that the projects Hcii
Paint and Zest Core (see Table I) contain terms that can be
ignored in some situations, but not in others. Therefore a
generic stop list is not feasible.
1. 2. 3. 4. 5.
a.) object (89) set(78) panel(70) get(62) color(58)
b.) get(1657) node(1087) figure(784) connection(634) set(627)
6. 7. 8. 9. 10.
a.) paint(57) action(52) point(49) g(46) add(42)
b.) this(494) i(387) layout(376) graph(364) y(355)
Table ITOP TEN FREQUENT TERMS IN THE PROJECTS HCII PAINT (PAINTING
LIBRARY) A.) AND ZEST CORE (GRAPH LIBRARY) B.), BRACKETS SHOW
THE FREQUENCY AND SIGNIFICANT TERMS ARE BOLD FACE.
Some terms in an identifier repeat information that can
be already inferred from the static code structure. Examples
are ”i” as prefix of an interface name or ”abstract” as a
prefix for an abstract class name. We found six patterns of
insignificant terms in class names, presented in Table II.
These patterns are used to ignore terms where appropriate,
e.g. the interface name IBankAccount is only seen as
BankAccount in contrast to this the terms of the classname IPhone will be preserved. A traditional stop list
would allow only to ignore ”i” completely.
Insignificant term Static information
Prefix “i” InterfacePrefix “abstract” Abstract classPrefix “j” Java interface or classPrefix “default” Default implementationPostfix “impl” Interface implementationPostfix “test” JUnit test class/case
Table IITERMS THAT ARE IGNORED IF THEY REFLECT STATIC INFORMATION.
III. INTRODUCTION LOCATIONS
Definition. We define that the tuple (t, c) with an interfaceor class c and term t used in c is a term introduction, ifthe meaning of t can be understood by reading the code inc.
For example the class Bundle in the OSGi framework
seen in Figure 4 is the introduction location of the term
”bundle” in this framework. In the graph framework Zest
the class GraphConnection is the introduction location
of the term ”connection” in this project.
Different developers normally have different experiences
and background. It is a challenge to find a common under-
standing for a clear term introduction location of a term.
For this purpose we started with heuristics from a single
developer perspective and plan to extend these in future work
for finding a heuristic fitting different opinions.
During our evaluation (Section VI-A) we further ob-
served, that not all types are good candidates for introduction
locations. We used the idea of nano patterns, presented by
Høst et al. in [11]. A nano pattern is a common small
concept in a method, for example ”returns void” or ”contains
a loop”. We defined four of those patterns (Table III) to find
instructive locations or remove non-instructive locations.
An instructive location is a location in code which is a
potential introduction location for a heuristic. For example
a method named update and an empty body is not a
candidate for an introduction location. Thus on the one
hand we use some of these patterns to filter non-instructive
locations and on the other hand we add additional instructive
locations.
Meta-concept Description instructive
property private field with public accessors yespublic constant public static final field yesnull implementation method without any statement norejecter method with only throwing statement no
Table IIINANO-PATTERNS USED TO DEFINE INSTRUCTIVE LOCATIONS
A. Heuristic
The term introduction heuristic is a combination of three
simple heuristics. It was derived during our evaluation. The
derivation can be seen later from our study presented in
Section V and VI. The heuristic algorithm is presented in
Listing 1.
The three building blocks of this heuristic are: atomic,
specialiser and compound heuristic. Furthermore we also
apply a reduction based on static code dependencies. In the
following, we explain all four in detail.1) Atomic: A type with a single term as its name (an
atomic name) should represent this concept in its code. An
example is the class named Bundle in Figure 4. Thus
the atomic heuristic declares an introduction location for
101
1 termIntroductionLocations():2 locations := atomicNameLocations()3 locations := locations ∪ specialiserTermLocations()4 locations := compoundTermLocations(locations)5 return rootIntroductions(locations)6
7 atomicNameLocations():8 locations := {}9 foreach named type c ∈ P:
10 if significantTerms(c) = (t) then11 locations := locations ∪ {(t, c)}12 foreach e ∈ instructiveLocations(c):13 if significantTerms(e) = (t) then14 locations := locations ∪ {(t, c)}15 return locations16
17 specialiserTermLocations():18 sorted := {}19 foreach named type c ∈ P:20 t := linguisticSort(significantTerms(c))21 sorted := sorted ∪ {(t, c)}22 repeat23 prefixes := longestReoccuringPrefixes(sorted)24 foreach ((p1, . . . , pl), c) ∈ sorted:25 if exists 1 ≤ m ≤ l : (p1, . . . , pm) ∈ prefixes26 sorted := sorted \ {((p1, . . . , pl), c)}27 ∪{((pm+1, . . . , pl), c)}28 until prefixes = {}29 locations := {}30 foreach ((p1, . . . , pk), c) ∈ sorted:31 locations := locations ∪ {(p1, c), . . . , (pk, c)}32 return locations33
34 compoundTermLocations(locations):35 repeat36 found := false37 foreach named type c ∈ P:38 t := significantTerms(c)39 if ti �∈ locations and40 forall j �= i : tj ∈ locations then41 locations := locations ∪ {(ti, c)}42 found := true43 until not found44 return locations45
46 rootIntroductions(locations):47 introductions := {}48 foreach (t, c) ∈ locations:49 if noOutgoingDependencies((t, c)) or50 partOfCycle((t, c)) then51 introductions := introductions ∪ {(t, c)}52 return introductions
Listing 1. Heuristic for introduction locations in a program P .
every type name that consists of a single term. Further,
a class also expresses concepts in its public or protected
interface. Unfortunately we cannot automatically detect with
how much care a method name was chosen. We take only
public and protected methods into account. The final atomic
heuristic can be seen in Listing 1 line 7-15.
2) Specialiser: This heuristic analyses type names to
locate the most specialised term used in each identifier and
creates introduction location from these. For this purpose
it sorts at first a copy of each term sequence by linguistic
dominance. The word a dominates b, if b renders a more
precisely. For example in the sequence ”box figure”, the
word ”box” specifies the figure type. Therefore ”figure”
dominates ”box”. Falleri et al. presented in [10] a set of
rules describing how to reorder terms by dominance based
on their part of speech tags. The rules are described in detail
as well in the diploma thesis of Nonnen [12]. An example
of this process can be seen in Figure 6.
Terms: ((create, v), (box, n), (figure, np))Sorted: ((create, v), (figure, np), (box, n))
Figure 6. Linguistic dominance sorting. We use the part of speech tagsv=verb, n=noun, and np=plural noun.
The linguistic dominance sorting reorders the original
term sequence by generality. The most general term is on
the left and the most specific term is on the right. The
heuristics collects all reoccurring prefixes and removes them.
This process is repeated until no common prefixes can be
found. The terms which are left over are considered as
specific for their location. Therefore the corresponding type
is considered as introduction location for them. An example
can be seen in Figure 7, and the algorithm in Listing 1 line
17-32.
Identifier: NearCompaniesTrackingIndexBuilderTracking
Sorted terms: ((tracking, n), (company, np), (near, a))((tracking, n), (builder, n), (index, n))
Specialiser: (company, np), (near, a)(builder, n), (index, n)
Figure 7. Example for the specialiser heuristic with the part of speechtags v=verb, n=noun, np=plural noun and a=adjective.
3) Compound: After applying the atomic and specialiser
heuristics there may be type names, where only a single term
has no introduction location. These types thus seem to be
a good introduction location for this term. This principle
is applied iteratively by our compound heuristic. As an
example, after the atomic heuristic introduced the term
”bundle” the meaning of ”context” can be understood by
looking in the type BundleContext. The algorithm can
be seen in Listing 1 line 34-44.
4) Reduction: The application of all three above dis-
cussed heuristics may give many introduction locations for
the same term. Further, some of them are connected via static
code dependencies. These dependencies can be modelled as
a directed graph with introduction locations as nodes. This
model can be employed to identify root introductions. Our
algorithm preserves these and removes all of their children.
In case of a method name being introduced in the super
class, it is as well introduced in the overriding sub class.
Nevertheless, we are mostly interested in the super class
and want ignore the sub class location. Therefore only the
introduction location in the super class is preserved.
102
In our evaluation we encountered the situation, that often
a root node could not be identified. Looking at these cases
we encountered that this was caused by cycles in the depen-
dency graph. In a cycle all contained classes are introduction
locations for the given term. This suggests that the concept
behind the term is distributed through these classes. Thus a
developer may need to take a look at them. Therefore we
decided to preserve cycles and consider them also as root
introduction.
Further, we also applied our introduction rules to external
code and incorporated these additional locations into the
above described reduction. There may be the case that
libraries use a different meaning for a term, the developers
should be made aware of this situation. A solution might be
a project dictionary where the team describes the terms. A
view library class that has a public method refresh is thus
seen as an introduction location for ”refresh”. All project
classes overriding this method have a static dependency to
the library class; therefore our reduction preserves the library
introduction location. The reduction algorithm can be seen
in Listing 1 lines 46-52.
IV. STUDY DESIGN
The main goal of the case study was to find an accurate
term introduction heuristic. It was designed as an explorative
evaluation and performed in the context of the diploma
thesis [12]. At first, in the exploration phase, the three
heuristics (atomic, specialiser and compound) were defined
and evaluated. For an evaluation of those, precision and
recall were estimated on 20 projects (see Table 8) with
the help of over 6000 manually classified samples. We
also considered the reduction as a possible improvement of
the heuristics, thus all heuristics were evaluated with and
without it. The goal in the exploration phase was to find
the best heuristic and further improvements. As an result of
the exploration phase we decided to combine the existing
heuristics into the single final heuristic, described before in
Section III-A. This heuristic was afterwards evaluated on an
independent validation set containing 10 new projects and
2000 randomly drawn samples.
A. Methodology
Project selection: Over 100 projects were initially col-
lected as a base for the evaluation. The project set was
gathered based on projects used in [11]. Further, projects
were collected from github [14], SourceForge [15] and
Google code [16]. Our implementation makes use of the
JTransformer [17] tool in version 2.9. JTransformer is an
Eclipse plugin and provides a model of the Java abstract
syntax tree in Prolog. A major limitation of this tool was
that it did not fully support Java generics which are widely
used. All above collected projects were tested whether they
work with this version of JTransformer. This resulted in 49
possible working projects for our evaluation. We decided to
partition this set of projects into three groups: one for the
exploration phase, one for the validation phase and a third
set for future re-evaluation.
Definition. Let T be the set of all terms used in the program.We define the size of the source code vocabulary to be themetric SV = |T |. To measure how many new terms wereadded to the project as the lines of code (LoC) increase, wedefine the vocabulary density (VD) to be V D = SV/LoC.Let DT ⊂ T be the set of terms that are contained in adictionary. We define the dictionary ratio to be |DT |/|T |.
The decision which of those projects were put in which set
was performed with the help of the lines of code metric, the
vocabulary density and percentage of dictionary vocabulary.
An evenly distribution of the metric values between both
sets was desired, as can be seen in Figure 9. All projects
used in this evaluation and the evaluation phase can be seen
in Table 8.
Figure 9. Cumulative distribution function for the exploration and thevalidation set. Vocabulary density and dictionary ratio are similar distributedin both sets.
Analyses: Precision and recall are two well established
metrics to evaluate correctness and effectiveness.Precisionis the ratio between correct reported term introductions and
the total number of reported introductions. Similarly, recallis used to measure the effectiveness by calculating the ratio
between correct reported term introductions and the total
number of correct introductions.
For the measurement of precision and recall a complete
classification of the sample space is needed. Considering
the size of our sample space, we decided to randomly draw
samples for each project and manually classify the samples.
The categories were true positive (tp), false positive (fp),
false negative (fn) and true negative (tn). We calculated
precision and recall for the samples and used the point
estimators, described below, to estimate a project precision
and recall.
Precision =tp
tp+ fpRecall =
tp
tp+ fn
A positive sample is a tuple (t, c), with t a term that
the heuristic defined as introduced in type c. Similarly for a
103
Exploration Set
iText (5.0.5) PlanetaMessenger (0.3) IBM WALA (1.3.1) GlazedLists (1.8)Jsch (0.1.44) JWNL (1.4r2) Smack (3.1) Zest (1.2)BCEL (5.2) Jaxen (1.1.3) Commons Logging (1.1.1) JVM Monitor (SVN Revision 30)jEdit (4.3.2) OpenCloud (0.2) QuickUML (2001) Workspacemechanic (2010-11-09)PMD (1.8) TreeTagger4Java (1.0.10) Lexi (0.1.1) Time&Money (0.5.1)
Validation Set
BC-Crypto (1.45) Cobertura (1.9.4.1) zxing (1.6) DDDSample (1.1.1.0)Rhino (CVS 2010-11-09) yGuard (2.3.0.1) BSF (3.1) edu.cmu.hcii.paint [13]
NGramJ (1.2.2) Concept Explorer (1.3)
Figure 8. Projects analysed during the evaluation, the project version is noted in brackets
negative sample it is a tuple (t,c) with a term t that is not
introduced in type c.Assume that the heuristic found hP positives (hN neg-
atives). For each positive sample with size sP (negative
sample size sN ) we manually verified that it contained tpS
true positives (for negative samples fnS false negatives).Given these definitions, the number of true positives tp
and false negatives fn can be estimated by
tp =
⌊tpS · hP
sP
⌋, fn =
⌊fnS · hN
sN
⌋.
With these we can estimate precision and recall by:
Precision =tp
hP, Recall =
tp
tp+ fn
To find improvements, all false positives and negatives
were analysed in detail. The most prominent are presented
later in Section VI-A in context of the validation phase.
V. EXPLORATION PHASE
We used the three heuristics atomic, specialiser and
compound in this phase. To consider if a reduction improves
the precision and recall we evaluated all three with and
without the reduction described in Section III-A4.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Atomic Red. Atomic Compound Red. Compound Special. Red. Special.
Figure 10. Precision for the heuristics across projects w/o reduction ofchains in the exploration phase.
Figure 10 presents a box-whisker plot for the precision of
the heuristics. The specialiser heuristic overall showed the
worst precision values and the highest dispersion. The best
results were obtained by the compound heuristic. Only the
atomic and reduced atomic heuristic contained projects with
100% precision, thus the majority of samples contained false
positives.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Atomic Red. Atomic Compound Red. Compound Special. Red. Special.
Figure 11. Recall for the heuristics across projects w/o reduction of chainsin the exploration phase.
Figure 11 presents a box-whisker plot for the recall of the
heuristics. The recall values were low, but as we are mainly
interested in finding correct locations this was expected. Still
there were samples without any false negatives. The best
recall was achieved by the compound heuristic. Interestingly,
the reduction applied on the compound heuristic reduced the
recall.
VI. VALIDATION PHASE
A. Adjustments
Based on the results of the exploration phase, we com-
bined the heuristics from the exploration phase (See listing
1, line 1-5) and addressed false positives and negatives.
The result sets of the atomic heuristic and the specialiser
heuristic are disjoint, as the atomic heuristic considers only
atomic identifiers and the specialiser heuristic considers only
compound identifiers. By combining both sets the compound
heuristic was operating on a larger initial set. First we pro-
cessed only significant terms. Second we made more precise,
which locations we consider to be instructive locations and
finally took external introductions into account. Our general
104
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
70,0%
80,0%
90,0%
100,0%
Exploration Avg. Validation Set
(a) Precision
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
70,0%
80,0%
90,0%
100,0%
Explora�on Avg. Valida�on Set
(b) Recall
Figure 12. Precision (a) and recall (b) of the heuristics in the exploration phase (average) and the final heuristic in the validation phase.
strategy with our approach is to use only the knowledge
given by the code itself.
After the exploration, we evolved the concept of signifi-
cant terms (described in Section III). During the exploration
phase we had seen that 11% of the false positives locations
were due to insignificant terms. Additionally 17% of false
positives was due to test class names repeating the class
name under test. These are also considered to be insignificant
terms.
Additionally we derived and incorporated the concept of
instructive locations for our approach. Except the locations
presented in Table III we further ignore anonymous classes
that only implement inherited methods. The original intro-
duction of these locations should be in the super class/in-
terface. Further, in our first implementation enumeration
types were considered as normal classes, but they carry
a special meaning. Thus our heuristics also analyse the
enumeration constants as instructive locations. The function
instructiveLocations(type) used in line 12 of Listing 1
enumerates these within a named type type.
External Introductions: During the exploration phase we
considered only types which were found in the source code
of the project. This lead to 9% false positive introduction
locations that had an external introduction, e.g. ”values” in
any enumeration type and ”clone” in any class overriding
the method from java.lang.Object. We extended our
analysis to also calculate term introductions for external
interfaces and classes. This information was then used to
apply the integration of static dependencies as defined in
Section III-A4. An introduction of a term in an external
library should not have additional introductions in the source
code by inheriting or implementing this functionality. The
most prominent example was the implementation of the
equals method of java.lang.Object with already
5% of the false positives.
B. Results
The final introduction heuristic performed well in the
validation phase and resolved many issues with the heuristics
of the exploration phase. Figure 12(a) and Figure 12(b) show
box plots for precision and recall. The box on the left shows
the result of the exploration phase and on the right the result
of the final combined heuristic. The median of the precision
improved to 75% and the median of the recall to 100%. Our
heuristic found introduction locations for a third to a half of
the terms in the vocabulary. With a recall of 100% we find
all introduction locations for these terms.
We conjecture that the increase in recall which in turn
slowed down the increase in precision was due to the
compound algorithm of our heuristic. Corrections in the
atomic and specialiser heuristic lead to a linear growth in
recall, but the recursive nature of the compound heuristic
leads to an even stronger growth. We expect our future
research to verify this empirically.
We further analysed for an introduced term, how many
introduction locations there are in average and called this
metric introduction ambiguity. In average we receive 1.6
locations for each introduced term. In total 31% of the
introduced term vocabulary had multiple introduction loca-
tions. Further, we observed a linear correlation between the
introduction ambiguity and the lines of code of 0.77. Thus a
bigger project resulted in a higher introduction ambiguity. As
we already had reduced the introduction locations with static
code dependencies, those multiple introduction locations
were independent of each other. Also a manual inspection
of the multiple introduction terms showed that in 15% of
those, terms were used homonymously.
C. Threats to Validity
Construct Threats: The meaning of a term is not always
indisputable. Therefore introduction locations may not al-
ways explain a concept completely. It still provides a good
starting point for its comprehension. To find a general def-
inition of an introduction location, a consensus of multiple
opinions and a verification of the current classification is
needed. This is in line with the idea of [11], to approximate
the collective understanding of a word meaning by collecting
individual ideas.
105
Statistical Conclusion Threats: Our point estimator heavily
relies on the underlying classification. Using confidence
intervals to calculate the true number of positives and
negatives should provide a more robust model.
Internal Threats: Between the two evaluation phases the
heuristics were changed. To verify that these changes im-
proved the results, the classified samples of the first phase
were re-evaluated with the altered heuristic.
External Threats: One of the main threats is that only
one of the authors of this work has manually classified
the samples. Thus the resulting heuristics depend only on a
single opinion. To counter the subjectiveness we performed a
small user study and provide how the classification was per-
formed (both described below). To support reproducibility
and proper labeling of all samples, they are available online
in form of a ZIP file containing all data as Microsoft Office
Excel 2007 files. It can be downloaded from our website1.
Preliminary Study: We already performed a very small
user study with five students and ten taken positive and
negative samples with the assignment to classify if they think
those are introduction locations. For three of the samples a
narrow majority (three to two votes) was found. All seven
others had a strong majority agreement with either four
to one or five to zero votes. The underlying classification
performed in this work share the opinion of the user study
in nine out of ten cases. The only difference was found for
one narrow majority voted sample. We plan to perform a
more detailed and in-depth user study to further validate
our approach and the performed manual classification.
Classification: The classification was not clearly defined
before the evaluation. Nevertheless, we used the following
rules of thumb to derive if a term should be introduced or
not: Given a term and a class, we declared this as an intro-
duction location, if we could understand the term by only
looking at the source code of this class. For this we used the
control- and data-flow as well as comments. In the special
case of an Enum, we verified that the suggested concept was
used as intended. In case of overwritten methods, we also
considered the defining super class. This was especially used
for verification of true negatives.
VII. RELATED WORK
Takang et al. [18] compared abbreviated identifiers to full
word identifiers and uncommented code to commented. They
observed in their user study, that commented code was more
understandable than uncommented. Furthermore source code
that contained full word identifiers was more understandable
than code that had only abbreviated identifiers.
Lawrie et. al analysed in [19] and [20] how identifier qual-
ity and expressiveness effects code comprehension. They
observed that better comprehension is achieved when using
1http://sewiki.iai.uni-bonn.de/private/jannonnen/public/start
full word identifiers. In occasional cases good comprehen-
sion was also gained from abbreviated identifiers. They thus
recommend that well-chosen and known abbreviations may
be preferred, since identifiers with fewer terms are easier to
remember.
Marcus et al. used in [21] the Latent Semantic Indexing
(LSI) technique to locate domain concepts in the source
code. They extract identifiers and comments from the code,
separate these by considering camel case and underscore
rules and partition the code into documents. Using these,
they apply a Single Value Decomposition (SVD) to cal-
culate the base of the LSI space. Thus all documents are
represented by a vector. The similarity measure between
an input vector and all document vectors is used to locate
similar concepts in the code. In contrast to LSI our approach
does not rely on term frequencies. The principle of similar
concepts is not yet handled in our approach, for this we plan
to incorporate WordNet information on meronym, holonym
or hypernym relations for a term. The LSI technique was
also used by Kuhn et al. [3] to cluster documents that contain
similar vectors. These clusters were automatically labeled
with the terms contained in the eigenvectors for each cluster.
Each set of labels for a cluster is defined to be the topic of
the cluster.
Also Marcus et al. in [22] compared three search strategies
for concept location: an LSI, a grep based and a static
dependency search. As a conclusion all three strategies
had advantages and weaknesses. In [22] Marcus et al. also
observed, that object oriented structuring did not have any
advantage for concept location. Some concepts were directly
represented by classes, but others were distributed in the
system.
In the case study performed in [23], Haiduc and Marcus
partitioned the terms into an identifier and a comment
lexicon to locate domain terms. They reported, that 23%
domain terms that were unique to the comment lexicon and
11% to the identifier lexicon.
The evolution of the term vocabulary size between fixed
project versions was considered in Abebe et al. [24]. They
concluded that the vocabulary increases correlating to the
lines of code. This knowledge was used in our work to define
the vocabulary density metric.
Tonella et al. [25] analysed term vocabulary and their
frequencies within projects. They observed that domain
terms achieve a high frequency count. This fact was also
validated by our evaluation and was used in the concept of
insignificant terms (see Table II).
Høst et al. [11] defined method naming patterns and the
corresponding method meaning in 30 so called nano patternsfor Java. For every method name in 100 projects they
calculated which patterns are fulfilled. These are afterwards
combined and for every method name the nano pattern
frequencies are calculated. From these, similar to SVD, the
most significant patterns are taken as a naming rule. Further,
106
they abstracted from concrete method names to method rules
(e.g. the rule ”get-<noun>”). We defined four additional
nano patterns to improve our heuristics, see Table III.
The creation of an internal ontology based only on the
identifiers of a project was done by Falleri et al. [10]. As
developers may use a non-general name meaning for a term,
this is a major improvement. Their algorithm to create an
internal ontology is used in one of our heuristics, for further
information see the specialiser heuristic in Section III-A.
Deißenbock and Pizka formalised in [26] the idea that
within a given program a concept should always be re-
ferred to by the same name. They argued further, that
naming conventions are needed to check this consistency.
The work by Ratiu and Deißenbock [4] identified concepts in
a program by mapping the program and ontological graphs.
The ontological graph was constructed from the WordNet
ontology [6].
VIII. CONCLUSION AND FUTURE WORK
In this paper we presented the novel concept of term
introductions and elaborated a heuristic to find them. We
carefully designed a study to validate and improve our
heuristic on a set of over 8000 samples from 30 open source
projects. The two phased study allowed us to learn from a
broad set of samples. The heuristic achieved good precision
and recall. Therefore, it is a good first approximation to
handle introduction locations.
Dit et al. presented in [27] a survey and taxonomy of
feature location approaches. In this taxonomy our presented
term introduction technique can be classified as a static
and textual analysis. It operates on compilable Java source
code and provides its results at class level granularity.
The analysis was derived using a quantitative academic
evaluation.
Further improvements for these heuristics could be de-
rived by a detailed user study. The preliminary study has
already shown that there are valid different viewpoints on
this concept. The study should analyse and find a common
understanding for this. It could also help in identifying
good visualizations of introduction locations and term use
dependencies. Further research needs to be performed in
identifying how far these visualizations improve code com-
prehension.
The concept of introduction locations can be refined into
two different approaches. An introduction location can be
either temporal, a specific point in time when the term was
first used in a project, or spatial, a method or type expressing
the concept. Temporal introductions can for example be
analysed by mining a project history in git (a popular
decentralized source code management). In this work we
focussed on spatial term introductions and derived a heuristic
for those. We also plan to integrate the historical data in form
of temporal introduction locations.
Abbreviations were found throughout the samples; still the
presented algorithm did not treat abbreviations differently
than other terms. A possible expansion of abbreviations, for
example based on techniques presented by Madani et al.
[28] or Lawrie et al. [29], seems to be also promising for
our analysis.
During evaluation we often encountered situations in
which not a single term, but a compound of terms had
a meaning. This would be an extension to our concept
of insignificant terms. For example in one case the term
”rich” did not have a special meaning, but the compound
”rich-media” had a special (even domain specific) meaning.
If terms globally always co-occur, they are referred to as
collocations in the diploma thesis of Nonnen [12]. He also
observed that these are not fixed, thus they may change dur-
ing development. Enslen et al. proposed in [30] an improved
identifier splitting algorithm using local term frequency and
co-occurrence. With the above collocation information this
algorithm should improve our current splitting algorithm.
Nonnen further showed in [12], that multiple introduc-
tion locations for a term contain homonymous uses. Also
a majority of homonyms was located in method names.
The homonym analysis was performed manually; further
improvement is needed to find a homonym automatically.
Nevertheless, our current algorithm informs a developer, that
a term is introduced at different locations. For example the
”context” problem in the case study presented in Section I
can be made explicit with the presented approach.
Currently we only make the use of four nano patterns. We
share the opinion, that there are more useful nano patterns
that can be incorporated to improve our instructive locations.
A term use dependency is a dependency of a type (class
or interface) which contains a term (and is not introduced
there) to the term introduction locations. An example would
be the term ”context” in the case study has two term use
dependencies, one into the OSGi framework and one into
a class of the project itself. This would be a valuable hint
for a developer to detect this homonymous use of the term
”context”.
In the evaluation we have also measured how many term
use dependency are inconsistent with traditional static code
dependencies. For example class A introduces a term that
is used in B and A has a static dependency to B, thus
both dependencies together form a cycle. In our considered
projects we found that there were in average three cycles per
1,000 LoC. Further, 55% of these cycles were found inside
a single package. In a future work a systematic classification
in good or bad cycles needs to be performed.
Bad cycles may arise from inconsequent generalization
refactorings. If a developer extracts some functionality for
reuse into a lower layer, but does not change the names
of the extracted program elements, a term use dependency
from the lower layer to the upper layer might be created. The
static dependencies are (hopefully) from the upper layer to
107
the lower layer. Therefore the dependencies form a cycle.
An explicit model of different conceptual domains, as de-
scribed by Evans in [31], can be defined using the concept of
term introductions. In this context the term use dependencies
help to locate terms used between different models. A clear
visualization should help a software architect by illustrating
and showing model boundaries in the existing software. An
integration with e.g. a classical layered architecture could
also reveal a violation of this architecture solely based on
term usage.
ACKNOWLEDGMENT
The authors would like to thank Gunter Kniesel and the
anonymous reviewers for their valuable feedback and ideas.
REFERENCES
[1] K. Beck, Implementation patterns. Addison-Wesley Profes-sional, 2006.
[2] B. Caprile and P. Tonella, “Restructuring program identifiernames,” in Proc. International Conference on Software Main-tenance (ICSM’00), 2000, pp. 97–107.
[3] A. Kuhn, S. Ducasse, and T. Gırba, “Semantic clustering:Identifying topics in source code,” Information and SoftwareTechnology, vol. 49, no. 3, pp. 230–243, 2007.
[4] D. Ratiu and F. Deissenbock, “How Programs RepresentReality (and how they don’t),” Proc. of the 13th WorkingConference on Reverse Engineering (WCRE ’06), 2006.
[5] D. Binkley, “Source code analysis: A road map,” in 2007Future of Software Engineering. IEEE Computer Society,2007, pp. 104–119.
[6] C. Fellbaum, WordNet: an electronic lexical database. Cam-bridge MIT Press, 1999.
[7] H. Mugge, T. Rho, D. Speicher, P. Bihler, and A. Cremers,“Programming for Context-based Adaptability - Lessonslearned about OOP, SOA, and AOP,” in Workshop Selb-storganisierende, Adaptive, Kontextsensitive verteilte Systeme,2007.
[8] M. Porter, “An algorithm for suffix stripping,” Program:electronic library and information systems, vol. 40, no. 3,pp. 211–218, 2006.
[9] H. Schmid, “TreeTagger - a language independent part-of-speech tagger,” Institut fur Maschinelle Sprachverarbeitung,Universitat Stuttgart, 1995.
[10] J. Falleri, M. Huchard, M. Lafourcade, C. Nebut, V. Prince,and M. Dao, “Automatic Extraction of a WordNet-Like Identi-fier Network from Software,” in Proc.of the 18th InternationalConference on Program Comprehension (ICPC’10). IEEE,2010, pp. 4–13.
[11] E. Høst and B. Østvold, “Debugging Method Names,” inProc. 23rd ECOOP, 2009.
[12] J. Nonnen, “Naming Consistency in Source Code Identifiers,”Diploma Thesis, University of Bonn, 2011.
[13] A. Ko, B. Myers, M. Coblenz, and H. Aung, “An exploratorystudy of how developers seek, relate, and collect relevantinformation during software maintenance tasks,” IEEE Trans-actions on Software Engineering, pp. 971–987, 2006.
[14] “Secure source code hosting and collaborative development- github.” [Online]. Available: https://github.com/
[15] “Sourceforge.net: Find, create, and publish open sourcesoftware for free.” [Online]. Available: http://sourceforge.net/
[16] “Google code.” [Online]. Available: http://code.google.com/
[17] G. Kniesel, J. Hannemann, and T. Rho, “A comparisonof logic-based infrastructures for concern detection and ex-traction,” in Proc. of the 3rd workshop on Linking aspecttechnology and evolution. ACM, 2007, p. 6.
[18] A. A. Takang, P. A. Grubb, and R. D. Macredie, “The effectsof comments and identifier names on program comprehensi-bility: an experimental investigation,” J. Prog. Lang, vol. 4,no. 3, pp. 143–167, 1996.
[19] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Whats ina name? A study of identifiers,” in Proc. of the 14th Inter-national Conference on Program Comprehension (ICPC’06),2006.
[20] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effec-tive identifier names for comprehension and memory,” ISSE,vol. 3, no. 4, pp. 303–318, 2007.
[21] A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic, “Aninformation retrieval approach to concept location in sourcecode,” in Proc. of the 11th Working Conference on ReverseEngineering (WCRE’04), 2004.
[22] A. Marcus, V. Rajlich, J. Buchta, M. Petrenko, andA. Sergeyev, “Static techniques for concept location in object-oriented code,” in Proc. of the 13th International Workshopon Program Comprehension (IWPC’05). IEEE ComputerSociety, 2005, pp. 33–42.
[23] S. Haiduc and A. Marcus, “On the use of domain terms insource code,” in Proc. of the 16th International Conferencein Program Comprehension (ICPC’08). IEEE ComputerSociety, 2008, pp. 113–122.
[24] S. Abebe, S. Haiduc, A. Marcus, P. Tonella, and G. Antoniol,“Analyzing the Evolution of the Source Code Vocabulary,”in 13th European Conference on Software Maintenance andReengineering, 2009. CSMR’09, 2009, pp. 189–198.
[25] P. Tonella and S. Abebe, “Code Quality from the Program-mer’s Perspective,” in Proc. of XII Advanced Computing andAnalysis Techniques in Physics Research., 2008.
[26] F. Deißenbock and M. Pizka, “Concise and consistent nam-ing,” in Proc. of the 13th International Workshop on ProgramComprehension (IWPC ’05). Washington, DC, USA: IEEEComputer Society, 2005, pp. 97–106.
[27] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Featurelocation in source code: A taxonomy and survey,” in Journalof Software Maintenance and Evolution: Research and Prac-tice (JSME), 2012, accepted.
[28] N. Madani, L. Guerrouj, M. Di Penta, Y. Gueheneuc, andG. Antoniol, “Recognizing words from source code identifiersusing speech recognition techniques,” in Proc. of the 14thEuropean Conference on Software Maintenance and Reengi-neering (CSMR’10), 2010.
[29] D. Lawrie, H. Feild, and D. Binkley, “Extracting meaningfrom abbreviated identifiers,” in Proc. of 2007 IEEE Work-shop on Source Code Analysis and Manipulation (SCAM’07),2007.
[30] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Miningsource code to automatically split identifiers for softwareanalysis,” in Proceedings of the 2009 6th IEEE InternationalWorking Conference on Mining Software Repositories, ser.MSR ’09. Washington, DC, USA: IEEE Computer Society,2009, pp. 71–80.
[31] E. Evans, Domain-driven design: tackling complexity in theheart of software. Addison-Wesley, 2009.
108