[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...

Locating the Meaning of Terms in Source CodeResearch on ”Term Introduction”

Jan Nonnen, Daniel Speicher, Paul Imhoff

University of BonnComputer Science III

Bonn, Germany{nonnen, dsp, imhoffj}@cs.uni-bonn.de

Abstract—Software developers are often facing the challengeof understanding a large code base. Program comprehensionis not only achieved by looking at object interactions, but alsoby considering the meaning of the identifiers and the containedterms. Ideally, the source code should exemplify this meaning.We propose to call the source code locations that define themeaning of a term term introduction.We further derive a heuristic to determine the introductionlocation with the help of an explorative study. This studywas performed on 8000 manually evaluated samples gainedfrom 30 open source projects. To support reproducibility,all samples and classifications are also available online. Theachieved results show a precision of 75% for the heuristic.

Keywords-identifier analysis; name meaning; empirical eval-uation; explorative approach; program comprehension; sourcecode analysis

I. INTRODUCTION

Beck emphasises in [1] the importance of intention re-

vealing names, as an identifier is written once, but read

multiple times. Moreover Caprille and Tonella [2] stated that

”identifier names are one of the most important sources of

information about program entities”.

As in Figure 1 snippet b) offers almost as much informa-

tion to the developer as the code snippet a). On the other

hand the code snippet c) may be equivalent to a) from the

perspective of the compiler, but conveys no intentions to the

developer.

Therefore identifier analysis has become an active re-

search area [3], [4], [5]. A variety of research questions

is approached with different methodologies and different

algorithms.

void transfer(Account from, to; float amount){ from.debit(amount); to.credit(amount); }

void m1(T1 v1, v2; float v3){ v1.m2(v3); v2.m3(v3); }

transfer from to amount from debit amount to credit amount

a)

b)

c)

Figure 1. Different views on the source code: The normal code a), howthe developers b) and the compiler c) see it.

For instance Information Retrieval techniques have been

applied to the analysis of identifiers, e.g. in automatically

retrieving the “topics” of the code [3]. As this approach is

useful in many areas, in case of a complex control flow

these ”topics” may not be precise enough to understand the

method in detail.

Deißenbock and Ratiu [4] use the external ontology

WordNet [6] to identify whether class names are named

concise. As useful as an external ontology in many cases is,

it does not always represent the real semantics of a program

as developers have sometimes very unusual mental models.

For instance with the word ”cat” WordNet (and most people)

will associate the animal, but a developer might associate the

command to concatenate files. While any Unix developer

will feel comfortable with reading the command killCatin a program any non-programmer will immediately fear for

the health of her feline friends.

We agree with Binkley [5] that algorithms dedicated to

naming analysis in source code should be able to also

consider the code structure. Our work is solely taking

advantage of structural information, without referencing to

an external ontology.

As possible application scenarios the presented analysis

could be integrated in a development environment. A dic-

tionary explaining the terms used inside a method, with ref-

erences to their introduction locations, could be used during

software maintenance. Current development environments

(e.g. Eclipse) allow to navigate from a reference in the

source code to its declaration with just one click on it. In

the same way, one could add the ability to navigate from a

term in an identifier to its introduction location.

General Idea

Identifiers can be composed of different terms with special

meaning. When trying to understand these, developers often

have to follow their usage throughout the source code. In

some situations the meaning of the terms can be located at a

source code location, denoted as the ”introduction location”.

In this work we want to explore and find such introduction

locations. Different meanings for the same term normally

2011 18th Working Conference on Reverse Engineering

1095-1350/11 $26.00 © 2011 IEEE

DOI 10.1109/WCRE.2011.21

99

should also result in different introduction locations, each

for a different meaning.

As an example consider a Poker application with iden-

tifiers DrawCard, a type responsible for painting parts of

the user interface, and drawNewCard, a method to give a

player a new card. Figure 2 illustrates this situation. The term

”draw” has two different meanings, and thus two different

introduction locations.

draw NewCard

Draw Card

Figure 2. A general idea of term introductions.

Example

Standard Information Retrieval techniques, e.g. word fre-

quency, do not necessarily consider different meanings for

the same word. Thus a reuse of the same word in different

identifiers conveys the idea that both words refer to the same

meaning. The following case study presents a case in which

the same word refers to different meanings.

The method signature in Figure 3 uses in both parameter

names the term ”context”. bundleContext contains the

two terms ”bundle” and ”context” and contextObservercontains the two terms ”context” and ”observer”.

public static ServiceTracker openTracker(BundleContext bundleContext,Observer contextObserver)

Figure 3. At a first glance the term ”context” seems to be usedwith the same meaning in both parameters.

At a first sight, both ”contexts” in 3 seem to have the

same meaning. A deeper analysis of the concepts behind

the identifiers yields that two meanings can be found, each

in a different part of the code. Figure 2 illustrates the search

for each meaning.

The case study was taken from our research project

Context Sensitive Intelligence [7]. One meaning of ”context”

refers to the OSGi framework which the project uses and

another is a physical location of a mobile device. The

concrete paths used to identify both meanings can be seen

in detail in Figure 4.

In case a term is used homonymously, it is normally a

good practice to clarify the meaning by choosing a different

name for one of the meanings. But, in our example the

term ”context” has already been set by the research project

and by the framework respectively. Therefore the parameter

names are consistent and should not be renamed. In this

Figure 4. Types related to the type containing the methodopenTracker(bundleContext, contextObserver).

situation any tooling that helps a developer being aware

of the different meanings would be very useful. Also an

automatic detection of homonyms based on an automated

identification of introduction locations could be very helpful

for program understanding.

Outline

The remainder of this paper is structured as follows: In

Section II we will describe how identifiers are preprocessed

to obtain the terms contained in the identifiers. These

represent the foundation for our analysis and will be used

to define the concept of a term introduction in Section

III. To find an approximation for the identification of term

introduction locations, we performed an explorative study.

We will present the design of this study in Section IV and

the results in Section V and VI. The last section discusses

possible future work.

II. TERM APPROACH

At first, all identifiers are chopped into a sequence of

tokens, based on camel-case rules. Camel-case convention

separates tokens by capitalizing the initial letter of the

element. For example createFisheyeFigures yields

the sequence (create, fisheye, figures). Each token

has a linguistic canonical form, referred to as lemma. For

example in English ”run”,”runs”, ”ran” and ”running” are

forms of the same lemma, conventionally written as ”run”.

These lemmata, obtained from identifiers, are denoted as

terms in this work, see Figure 5 for an example.

Identifier: createFisheyeFiguresTokens: (create, fisheye, figures)Terms: ((create, v), (fisheye, n), (figure, np))

Figure 5. Preprocessing of the identifiers. We use the part of speech tagsv=verb, n=noun, and np=noun in plural.

Besides lemmatization, stemming is frequently used to

calculate a canonical form. Stemming is a heuristic rule-

driven process to chop off the end of tokens in order

to obtain the correct lemma. A current popular stemming

100

algorithm for English is Porter’s algorithm [8]. It represents a

fast normalization approach without the need of a dictionary.

Lemmatization uses a dictionary and performs a proper

morphological analysis. It utilizes part of speech tagging to

retrieve the lemma. A popular lemmatization tool is Tree-

Tagger [9]. Lemmatization preserves tokens that are not part

of the dictionary, whereas stemming simply applies its rules.

Taking these considerations into account, we decided to use

lemmatization as a preprocessing step. On the example of the

token ”saw” one can see the differences in both strategies.

On the one hand stemming might reduce the token to ”s”.

On the other hand lemmatization normalizes it to ”see”, if

it is used as a verb, and to ”saw” if it is used as a noun.

After defining what terms are, a valid question is whether

all terms are relevant. Text search in the domain of Infor-

mation Retrieval acknowledges that some common words

appear of little value, these are referred to as stop words. A

strategy to deal with these is to create a stop list containing

all irrelevant words. We found out that the projects Hcii

Paint and Zest Core (see Table I) contain terms that can be

ignored in some situations, but not in others. Therefore a

generic stop list is not feasible.

1. 2. 3. 4. 5.

a.) object (89) set(78) panel(70) get(62) color(58)

b.) get(1657) node(1087) figure(784) connection(634) set(627)

6. 7. 8. 9. 10.

a.) paint(57) action(52) point(49) g(46) add(42)

b.) this(494) i(387) layout(376) graph(364) y(355)

Table ITOP TEN FREQUENT TERMS IN THE PROJECTS HCII PAINT (PAINTING

LIBRARY) A.) AND ZEST CORE (GRAPH LIBRARY) B.), BRACKETS SHOW

THE FREQUENCY AND SIGNIFICANT TERMS ARE BOLD FACE.

Some terms in an identifier repeat information that can

be already inferred from the static code structure. Examples

are ”i” as prefix of an interface name or ”abstract” as a

prefix for an abstract class name. We found six patterns of

insignificant terms in class names, presented in Table II.

These patterns are used to ignore terms where appropriate,

e.g. the interface name IBankAccount is only seen as

BankAccount in contrast to this the terms of the classname IPhone will be preserved. A traditional stop list

would allow only to ignore ”i” completely.

Insignificant term Static information

Prefix “i” InterfacePrefix “abstract” Abstract classPrefix “j” Java interface or classPrefix “default” Default implementationPostfix “impl” Interface implementationPostfix “test” JUnit test class/case

Table IITERMS THAT ARE IGNORED IF THEY REFLECT STATIC INFORMATION.

III. INTRODUCTION LOCATIONS

Definition. We define that the tuple (t, c) with an interfaceor class c and term t used in c is a term introduction, ifthe meaning of t can be understood by reading the code inc.

For example the class Bundle in the OSGi framework

seen in Figure 4 is the introduction location of the term

”bundle” in this framework. In the graph framework Zest

the class GraphConnection is the introduction location

of the term ”connection” in this project.

Different developers normally have different experiences

and background. It is a challenge to find a common under-

standing for a clear term introduction location of a term.

For this purpose we started with heuristics from a single

developer perspective and plan to extend these in future work

for finding a heuristic fitting different opinions.

During our evaluation (Section VI-A) we further ob-

served, that not all types are good candidates for introduction

locations. We used the idea of nano patterns, presented by

Høst et al. in [11]. A nano pattern is a common small

concept in a method, for example ”returns void” or ”contains

a loop”. We defined four of those patterns (Table III) to find

instructive locations or remove non-instructive locations.

An instructive location is a location in code which is a

potential introduction location for a heuristic. For example

a method named update and an empty body is not a

candidate for an introduction location. Thus on the one

hand we use some of these patterns to filter non-instructive

locations and on the other hand we add additional instructive

locations.

Meta-concept Description instructive

property private field with public accessors yespublic constant public static final field yesnull implementation method without any statement norejecter method with only throwing statement no

Table IIINANO-PATTERNS USED TO DEFINE INSTRUCTIVE LOCATIONS

A. Heuristic

The term introduction heuristic is a combination of three

simple heuristics. It was derived during our evaluation. The

derivation can be seen later from our study presented in

Section V and VI. The heuristic algorithm is presented in

Listing 1.

The three building blocks of this heuristic are: atomic,

specialiser and compound heuristic. Furthermore we also

apply a reduction based on static code dependencies. In the

following, we explain all four in detail.1) Atomic: A type with a single term as its name (an

atomic name) should represent this concept in its code. An

example is the class named Bundle in Figure 4. Thus

the atomic heuristic declares an introduction location for

101

1 termIntroductionLocations():2 locations := atomicNameLocations()3 locations := locations ∪ specialiserTermLocations()4 locations := compoundTermLocations(locations)5 return rootIntroductions(locations)6

7 atomicNameLocations():8 locations := {}9 foreach named type c ∈ P:

10 if significantTerms(c) = (t) then11 locations := locations ∪ {(t, c)}12 foreach e ∈ instructiveLocations(c):13 if significantTerms(e) = (t) then14 locations := locations ∪ {(t, c)}15 return locations16

17 specialiserTermLocations():18 sorted := {}19 foreach named type c ∈ P:20 t := linguisticSort(significantTerms(c))21 sorted := sorted ∪ {(t, c)}22 repeat23 prefixes := longestReoccuringPrefixes(sorted)24 foreach ((p1, . . . , pl), c) ∈ sorted:25 if exists 1 ≤ m ≤ l : (p1, . . . , pm) ∈ prefixes26 sorted := sorted \ {((p1, . . . , pl), c)}27 ∪{((pm+1, . . . , pl), c)}28 until prefixes = {}29 locations := {}30 foreach ((p1, . . . , pk), c) ∈ sorted:31 locations := locations ∪ {(p1, c), . . . , (pk, c)}32 return locations33

34 compoundTermLocations(locations):35 repeat36 found := false37 foreach named type c ∈ P:38 t := significantTerms(c)39 if ti �∈ locations and40 forall j �= i : tj ∈ locations then41 locations := locations ∪ {(ti, c)}42 found := true43 until not found44 return locations45

46 rootIntroductions(locations):47 introductions := {}48 foreach (t, c) ∈ locations:49 if noOutgoingDependencies((t, c)) or50 partOfCycle((t, c)) then51 introductions := introductions ∪ {(t, c)}52 return introductions

Listing 1. Heuristic for introduction locations in a program P .

every type name that consists of a single term. Further,

a class also expresses concepts in its public or protected

interface. Unfortunately we cannot automatically detect with

how much care a method name was chosen. We take only

public and protected methods into account. The final atomic

heuristic can be seen in Listing 1 line 7-15.

2) Specialiser: This heuristic analyses type names to

locate the most specialised term used in each identifier and

creates introduction location from these. For this purpose

it sorts at first a copy of each term sequence by linguistic

dominance. The word a dominates b, if b renders a more

precisely. For example in the sequence ”box figure”, the

word ”box” specifies the figure type. Therefore ”figure”

dominates ”box”. Falleri et al. presented in [10] a set of

rules describing how to reorder terms by dominance based

on their part of speech tags. The rules are described in detail

as well in the diploma thesis of Nonnen [12]. An example

of this process can be seen in Figure 6.

Terms: ((create, v), (box, n), (figure, np))Sorted: ((create, v), (figure, np), (box, n))

Figure 6. Linguistic dominance sorting. We use the part of speech tagsv=verb, n=noun, and np=plural noun.

The linguistic dominance sorting reorders the original

term sequence by generality. The most general term is on

the left and the most specific term is on the right. The

heuristics collects all reoccurring prefixes and removes them.

This process is repeated until no common prefixes can be

found. The terms which are left over are considered as

specific for their location. Therefore the corresponding type

is considered as introduction location for them. An example

can be seen in Figure 7, and the algorithm in Listing 1 line

17-32.

Identifier: NearCompaniesTrackingIndexBuilderTracking

Sorted terms: ((tracking, n), (company, np), (near, a))((tracking, n), (builder, n), (index, n))

Specialiser: (company, np), (near, a)(builder, n), (index, n)

Figure 7. Example for the specialiser heuristic with the part of speechtags v=verb, n=noun, np=plural noun and a=adjective.

3) Compound: After applying the atomic and specialiser

heuristics there may be type names, where only a single term

has no introduction location. These types thus seem to be

a good introduction location for this term. This principle

is applied iteratively by our compound heuristic. As an

example, after the atomic heuristic introduced the term

”bundle” the meaning of ”context” can be understood by

looking in the type BundleContext. The algorithm can

be seen in Listing 1 line 34-44.

4) Reduction: The application of all three above dis-

cussed heuristics may give many introduction locations for

the same term. Further, some of them are connected via static

code dependencies. These dependencies can be modelled as

a directed graph with introduction locations as nodes. This

model can be employed to identify root introductions. Our

algorithm preserves these and removes all of their children.

In case of a method name being introduced in the super

class, it is as well introduced in the overriding sub class.

Nevertheless, we are mostly interested in the super class

and want ignore the sub class location. Therefore only the

introduction location in the super class is preserved.

102

In our evaluation we encountered the situation, that often

a root node could not be identified. Looking at these cases

we encountered that this was caused by cycles in the depen-

dency graph. In a cycle all contained classes are introduction

locations for the given term. This suggests that the concept

behind the term is distributed through these classes. Thus a

developer may need to take a look at them. Therefore we

decided to preserve cycles and consider them also as root

introduction.

Further, we also applied our introduction rules to external

code and incorporated these additional locations into the

above described reduction. There may be the case that

libraries use a different meaning for a term, the developers

should be made aware of this situation. A solution might be

a project dictionary where the team describes the terms. A

view library class that has a public method refresh is thus

seen as an introduction location for ”refresh”. All project

classes overriding this method have a static dependency to

the library class; therefore our reduction preserves the library

introduction location. The reduction algorithm can be seen

in Listing 1 lines 46-52.

IV. STUDY DESIGN

The main goal of the case study was to find an accurate

term introduction heuristic. It was designed as an explorative

evaluation and performed in the context of the diploma

thesis [12]. At first, in the exploration phase, the three

heuristics (atomic, specialiser and compound) were defined

and evaluated. For an evaluation of those, precision and

recall were estimated on 20 projects (see Table 8) with

the help of over 6000 manually classified samples. We

also considered the reduction as a possible improvement of

the heuristics, thus all heuristics were evaluated with and

without it. The goal in the exploration phase was to find

the best heuristic and further improvements. As an result of

the exploration phase we decided to combine the existing

heuristics into the single final heuristic, described before in

Section III-A. This heuristic was afterwards evaluated on an

independent validation set containing 10 new projects and

2000 randomly drawn samples.

A. Methodology

Project selection: Over 100 projects were initially col-

lected as a base for the evaluation. The project set was

gathered based on projects used in [11]. Further, projects

were collected from github [14], SourceForge [15] and

Google code [16]. Our implementation makes use of the

JTransformer [17] tool in version 2.9. JTransformer is an

Eclipse plugin and provides a model of the Java abstract

syntax tree in Prolog. A major limitation of this tool was

that it did not fully support Java generics which are widely

used. All above collected projects were tested whether they

work with this version of JTransformer. This resulted in 49

possible working projects for our evaluation. We decided to

partition this set of projects into three groups: one for the

exploration phase, one for the validation phase and a third

set for future re-evaluation.

Definition. Let T be the set of all terms used in the program.We define the size of the source code vocabulary to be themetric SV = |T |. To measure how many new terms wereadded to the project as the lines of code (LoC) increase, wedefine the vocabulary density (VD) to be V D = SV/LoC.Let DT ⊂ T be the set of terms that are contained in adictionary. We define the dictionary ratio to be |DT |/|T |.

The decision which of those projects were put in which set

was performed with the help of the lines of code metric, the

vocabulary density and percentage of dictionary vocabulary.

An evenly distribution of the metric values between both

sets was desired, as can be seen in Figure 9. All projects

used in this evaluation and the evaluation phase can be seen

in Table 8.

Figure 9. Cumulative distribution function for the exploration and thevalidation set. Vocabulary density and dictionary ratio are similar distributedin both sets.

Analyses: Precision and recall are two well established

metrics to evaluate correctness and effectiveness.Precisionis the ratio between correct reported term introductions and

the total number of reported introductions. Similarly, recallis used to measure the effectiveness by calculating the ratio

between correct reported term introductions and the total

number of correct introductions.

For the measurement of precision and recall a complete

classification of the sample space is needed. Considering

the size of our sample space, we decided to randomly draw

samples for each project and manually classify the samples.

The categories were true positive (tp), false positive (fp),

false negative (fn) and true negative (tn). We calculated

precision and recall for the samples and used the point

estimators, described below, to estimate a project precision

and recall.

Precision =tp

tp+ fpRecall =

tp

tp+ fn

A positive sample is a tuple (t, c), with t a term that

the heuristic defined as introduced in type c. Similarly for a

103

Exploration Set

iText (5.0.5) PlanetaMessenger (0.3) IBM WALA (1.3.1) GlazedLists (1.8)Jsch (0.1.44) JWNL (1.4r2) Smack (3.1) Zest (1.2)BCEL (5.2) Jaxen (1.1.3) Commons Logging (1.1.1) JVM Monitor (SVN Revision 30)jEdit (4.3.2) OpenCloud (0.2) QuickUML (2001) Workspacemechanic (2010-11-09)PMD (1.8) TreeTagger4Java (1.0.10) Lexi (0.1.1) Time&Money (0.5.1)

Validation Set

BC-Crypto (1.45) Cobertura (1.9.4.1) zxing (1.6) DDDSample (1.1.1.0)Rhino (CVS 2010-11-09) yGuard (2.3.0.1) BSF (3.1) edu.cmu.hcii.paint [13]

NGramJ (1.2.2) Concept Explorer (1.3)

Figure 8. Projects analysed during the evaluation, the project version is noted in brackets

negative sample it is a tuple (t,c) with a term t that is not

introduced in type c.Assume that the heuristic found hP positives (hN neg-

atives). For each positive sample with size sP (negative

sample size sN ) we manually verified that it contained tpS

true positives (for negative samples fnS false negatives).Given these definitions, the number of true positives tp

and false negatives fn can be estimated by

tp =

⌊tpS · hP

sP

⌋, fn =

⌊fnS · hN

sN

⌋.

With these we can estimate precision and recall by:

Precision =tp

hP, Recall =

tp

tp+ fn

To find improvements, all false positives and negatives

were analysed in detail. The most prominent are presented

later in Section VI-A in context of the validation phase.

V. EXPLORATION PHASE

We used the three heuristics atomic, specialiser and

compound in this phase. To consider if a reduction improves

the precision and recall we evaluated all three with and

without the reduction described in Section III-A4.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Atomic Red. Atomic Compound Red. Compound Special. Red. Special.

Figure 10. Precision for the heuristics across projects w/o reduction ofchains in the exploration phase.

Figure 10 presents a box-whisker plot for the precision of

the heuristics. The specialiser heuristic overall showed the

worst precision values and the highest dispersion. The best

results were obtained by the compound heuristic. Only the

atomic and reduced atomic heuristic contained projects with

100% precision, thus the majority of samples contained false

positives.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Atomic Red. Atomic Compound Red. Compound Special. Red. Special.

Figure 11. Recall for the heuristics across projects w/o reduction of chainsin the exploration phase.

Figure 11 presents a box-whisker plot for the recall of the

heuristics. The recall values were low, but as we are mainly

interested in finding correct locations this was expected. Still

there were samples without any false negatives. The best

recall was achieved by the compound heuristic. Interestingly,

the reduction applied on the compound heuristic reduced the

recall.

VI. VALIDATION PHASE

A. Adjustments

Based on the results of the exploration phase, we com-

bined the heuristics from the exploration phase (See listing

1, line 1-5) and addressed false positives and negatives.

The result sets of the atomic heuristic and the specialiser

heuristic are disjoint, as the atomic heuristic considers only

atomic identifiers and the specialiser heuristic considers only

compound identifiers. By combining both sets the compound

heuristic was operating on a larger initial set. First we pro-

cessed only significant terms. Second we made more precise,

which locations we consider to be instructive locations and

finally took external introductions into account. Our general

104

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

70,0%

80,0%

90,0%

100,0%

Exploration Avg. Validation Set

(a) Precision

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

70,0%

80,0%

90,0%

100,0%

Explora�on Avg. Valida�on Set

(b) Recall

Figure 12. Precision (a) and recall (b) of the heuristics in the exploration phase (average) and the final heuristic in the validation phase.

strategy with our approach is to use only the knowledge

given by the code itself.

After the exploration, we evolved the concept of signifi-

cant terms (described in Section III). During the exploration

phase we had seen that 11% of the false positives locations

were due to insignificant terms. Additionally 17% of false

positives was due to test class names repeating the class

name under test. These are also considered to be insignificant

terms.

Additionally we derived and incorporated the concept of

instructive locations for our approach. Except the locations

presented in Table III we further ignore anonymous classes

that only implement inherited methods. The original intro-

duction of these locations should be in the super class/in-

terface. Further, in our first implementation enumeration

types were considered as normal classes, but they carry

a special meaning. Thus our heuristics also analyse the

enumeration constants as instructive locations. The function

instructiveLocations(type) used in line 12 of Listing 1

enumerates these within a named type type.

External Introductions: During the exploration phase we

considered only types which were found in the source code

of the project. This lead to 9% false positive introduction

locations that had an external introduction, e.g. ”values” in

any enumeration type and ”clone” in any class overriding

the method from java.lang.Object. We extended our

analysis to also calculate term introductions for external

interfaces and classes. This information was then used to

apply the integration of static dependencies as defined in

Section III-A4. An introduction of a term in an external

library should not have additional introductions in the source

code by inheriting or implementing this functionality. The

most prominent example was the implementation of the

equals method of java.lang.Object with already

5% of the false positives.

B. Results

The final introduction heuristic performed well in the

validation phase and resolved many issues with the heuristics

of the exploration phase. Figure 12(a) and Figure 12(b) show

box plots for precision and recall. The box on the left shows

the result of the exploration phase and on the right the result

of the final combined heuristic. The median of the precision

improved to 75% and the median of the recall to 100%. Our

heuristic found introduction locations for a third to a half of

the terms in the vocabulary. With a recall of 100% we find

all introduction locations for these terms.

We conjecture that the increase in recall which in turn

slowed down the increase in precision was due to the

compound algorithm of our heuristic. Corrections in the

atomic and specialiser heuristic lead to a linear growth in

recall, but the recursive nature of the compound heuristic

leads to an even stronger growth. We expect our future

research to verify this empirically.

We further analysed for an introduced term, how many

introduction locations there are in average and called this

metric introduction ambiguity. In average we receive 1.6

locations for each introduced term. In total 31% of the

introduced term vocabulary had multiple introduction loca-

tions. Further, we observed a linear correlation between the

introduction ambiguity and the lines of code of 0.77. Thus a

bigger project resulted in a higher introduction ambiguity. As

we already had reduced the introduction locations with static

code dependencies, those multiple introduction locations

were independent of each other. Also a manual inspection

of the multiple introduction terms showed that in 15% of

those, terms were used homonymously.

C. Threats to Validity

Construct Threats: The meaning of a term is not always

indisputable. Therefore introduction locations may not al-

ways explain a concept completely. It still provides a good

starting point for its comprehension. To find a general def-

inition of an introduction location, a consensus of multiple

opinions and a verification of the current classification is

needed. This is in line with the idea of [11], to approximate

the collective understanding of a word meaning by collecting

individual ideas.

105

Statistical Conclusion Threats: Our point estimator heavily

relies on the underlying classification. Using confidence

intervals to calculate the true number of positives and

negatives should provide a more robust model.

Internal Threats: Between the two evaluation phases the

heuristics were changed. To verify that these changes im-

proved the results, the classified samples of the first phase

were re-evaluated with the altered heuristic.

External Threats: One of the main threats is that only

one of the authors of this work has manually classified

the samples. Thus the resulting heuristics depend only on a

single opinion. To counter the subjectiveness we performed a

small user study and provide how the classification was per-

formed (both described below). To support reproducibility

and proper labeling of all samples, they are available online

in form of a ZIP file containing all data as Microsoft Office

Excel 2007 files. It can be downloaded from our website1.

Preliminary Study: We already performed a very small

user study with five students and ten taken positive and

negative samples with the assignment to classify if they think

those are introduction locations. For three of the samples a

narrow majority (three to two votes) was found. All seven

others had a strong majority agreement with either four

to one or five to zero votes. The underlying classification

performed in this work share the opinion of the user study

in nine out of ten cases. The only difference was found for

one narrow majority voted sample. We plan to perform a

more detailed and in-depth user study to further validate

our approach and the performed manual classification.

Classification: The classification was not clearly defined

before the evaluation. Nevertheless, we used the following

rules of thumb to derive if a term should be introduced or

not: Given a term and a class, we declared this as an intro-

duction location, if we could understand the term by only

looking at the source code of this class. For this we used the

control- and data-flow as well as comments. In the special

case of an Enum, we verified that the suggested concept was

used as intended. In case of overwritten methods, we also

considered the defining super class. This was especially used

for verification of true negatives.

VII. RELATED WORK

Takang et al. [18] compared abbreviated identifiers to full

word identifiers and uncommented code to commented. They

observed in their user study, that commented code was more

understandable than uncommented. Furthermore source code

that contained full word identifiers was more understandable

than code that had only abbreviated identifiers.

Lawrie et. al analysed in [19] and [20] how identifier qual-

ity and expressiveness effects code comprehension. They

observed that better comprehension is achieved when using

1http://sewiki.iai.uni-bonn.de/private/jannonnen/public/start

full word identifiers. In occasional cases good comprehen-

sion was also gained from abbreviated identifiers. They thus

recommend that well-chosen and known abbreviations may

be preferred, since identifiers with fewer terms are easier to

remember.

Marcus et al. used in [21] the Latent Semantic Indexing

(LSI) technique to locate domain concepts in the source

code. They extract identifiers and comments from the code,

separate these by considering camel case and underscore

rules and partition the code into documents. Using these,

they apply a Single Value Decomposition (SVD) to cal-

culate the base of the LSI space. Thus all documents are

represented by a vector. The similarity measure between

an input vector and all document vectors is used to locate

similar concepts in the code. In contrast to LSI our approach

does not rely on term frequencies. The principle of similar

concepts is not yet handled in our approach, for this we plan

to incorporate WordNet information on meronym, holonym

or hypernym relations for a term. The LSI technique was

also used by Kuhn et al. [3] to cluster documents that contain

similar vectors. These clusters were automatically labeled

with the terms contained in the eigenvectors for each cluster.

Each set of labels for a cluster is defined to be the topic of

the cluster.

Also Marcus et al. in [22] compared three search strategies

for concept location: an LSI, a grep based and a static

dependency search. As a conclusion all three strategies

had advantages and weaknesses. In [22] Marcus et al. also

observed, that object oriented structuring did not have any

advantage for concept location. Some concepts were directly

represented by classes, but others were distributed in the

system.

In the case study performed in [23], Haiduc and Marcus

partitioned the terms into an identifier and a comment

lexicon to locate domain terms. They reported, that 23%

domain terms that were unique to the comment lexicon and

11% to the identifier lexicon.

The evolution of the term vocabulary size between fixed

project versions was considered in Abebe et al. [24]. They

concluded that the vocabulary increases correlating to the

lines of code. This knowledge was used in our work to define

the vocabulary density metric.

Tonella et al. [25] analysed term vocabulary and their

frequencies within projects. They observed that domain

terms achieve a high frequency count. This fact was also

validated by our evaluation and was used in the concept of

insignificant terms (see Table II).

Høst et al. [11] defined method naming patterns and the

corresponding method meaning in 30 so called nano patternsfor Java. For every method name in 100 projects they

calculated which patterns are fulfilled. These are afterwards

combined and for every method name the nano pattern

frequencies are calculated. From these, similar to SVD, the

most significant patterns are taken as a naming rule. Further,

106

they abstracted from concrete method names to method rules

(e.g. the rule ”get-<noun>”). We defined four additional

nano patterns to improve our heuristics, see Table III.

The creation of an internal ontology based only on the

identifiers of a project was done by Falleri et al. [10]. As

developers may use a non-general name meaning for a term,

this is a major improvement. Their algorithm to create an

internal ontology is used in one of our heuristics, for further

information see the specialiser heuristic in Section III-A.

Deißenbock and Pizka formalised in [26] the idea that

within a given program a concept should always be re-

ferred to by the same name. They argued further, that

naming conventions are needed to check this consistency.

The work by Ratiu and Deißenbock [4] identified concepts in

a program by mapping the program and ontological graphs.

The ontological graph was constructed from the WordNet

ontology [6].

VIII. CONCLUSION AND FUTURE WORK

In this paper we presented the novel concept of term

introductions and elaborated a heuristic to find them. We

carefully designed a study to validate and improve our

heuristic on a set of over 8000 samples from 30 open source

projects. The two phased study allowed us to learn from a

broad set of samples. The heuristic achieved good precision

and recall. Therefore, it is a good first approximation to

handle introduction locations.

Dit et al. presented in [27] a survey and taxonomy of

feature location approaches. In this taxonomy our presented

term introduction technique can be classified as a static

and textual analysis. It operates on compilable Java source

code and provides its results at class level granularity.

The analysis was derived using a quantitative academic

evaluation.

Further improvements for these heuristics could be de-

rived by a detailed user study. The preliminary study has

already shown that there are valid different viewpoints on

this concept. The study should analyse and find a common

understanding for this. It could also help in identifying

good visualizations of introduction locations and term use

dependencies. Further research needs to be performed in

identifying how far these visualizations improve code com-

prehension.

The concept of introduction locations can be refined into

two different approaches. An introduction location can be

either temporal, a specific point in time when the term was

first used in a project, or spatial, a method or type expressing

the concept. Temporal introductions can for example be

analysed by mining a project history in git (a popular

decentralized source code management). In this work we

focussed on spatial term introductions and derived a heuristic

for those. We also plan to integrate the historical data in form

of temporal introduction locations.

Abbreviations were found throughout the samples; still the

presented algorithm did not treat abbreviations differently

than other terms. A possible expansion of abbreviations, for

example based on techniques presented by Madani et al.

[28] or Lawrie et al. [29], seems to be also promising for

our analysis.

During evaluation we often encountered situations in

which not a single term, but a compound of terms had

a meaning. This would be an extension to our concept

of insignificant terms. For example in one case the term

”rich” did not have a special meaning, but the compound

”rich-media” had a special (even domain specific) meaning.

If terms globally always co-occur, they are referred to as

collocations in the diploma thesis of Nonnen [12]. He also

observed that these are not fixed, thus they may change dur-

ing development. Enslen et al. proposed in [30] an improved

identifier splitting algorithm using local term frequency and

co-occurrence. With the above collocation information this

algorithm should improve our current splitting algorithm.

Nonnen further showed in [12], that multiple introduc-

tion locations for a term contain homonymous uses. Also

a majority of homonyms was located in method names.

The homonym analysis was performed manually; further

improvement is needed to find a homonym automatically.

Nevertheless, our current algorithm informs a developer, that

a term is introduced at different locations. For example the

”context” problem in the case study presented in Section I

can be made explicit with the presented approach.

Currently we only make the use of four nano patterns. We

share the opinion, that there are more useful nano patterns

that can be incorporated to improve our instructive locations.

A term use dependency is a dependency of a type (class

or interface) which contains a term (and is not introduced

there) to the term introduction locations. An example would

be the term ”context” in the case study has two term use

dependencies, one into the OSGi framework and one into

a class of the project itself. This would be a valuable hint

for a developer to detect this homonymous use of the term

”context”.

In the evaluation we have also measured how many term

use dependency are inconsistent with traditional static code

dependencies. For example class A introduces a term that

is used in B and A has a static dependency to B, thus

both dependencies together form a cycle. In our considered

projects we found that there were in average three cycles per

1,000 LoC. Further, 55% of these cycles were found inside

a single package. In a future work a systematic classification

in good or bad cycles needs to be performed.

Bad cycles may arise from inconsequent generalization

refactorings. If a developer extracts some functionality for

reuse into a lower layer, but does not change the names

of the extracted program elements, a term use dependency

from the lower layer to the upper layer might be created. The

static dependencies are (hopefully) from the upper layer to

107

the lower layer. Therefore the dependencies form a cycle.

An explicit model of different conceptual domains, as de-

scribed by Evans in [31], can be defined using the concept of

term introductions. In this context the term use dependencies

help to locate terms used between different models. A clear

visualization should help a software architect by illustrating

and showing model boundaries in the existing software. An

integration with e.g. a classical layered architecture could

also reveal a violation of this architecture solely based on

term usage.

ACKNOWLEDGMENT

The authors would like to thank Gunter Kniesel and the

anonymous reviewers for their valuable feedback and ideas.

REFERENCES

[1] K. Beck, Implementation patterns. Addison-Wesley Profes-sional, 2006.

[2] B. Caprile and P. Tonella, “Restructuring program identifiernames,” in Proc. International Conference on Software Main-tenance (ICSM’00), 2000, pp. 97–107.

[3] A. Kuhn, S. Ducasse, and T. Gırba, “Semantic clustering:Identifying topics in source code,” Information and SoftwareTechnology, vol. 49, no. 3, pp. 230–243, 2007.

[4] D. Ratiu and F. Deissenbock, “How Programs RepresentReality (and how they don’t),” Proc. of the 13th WorkingConference on Reverse Engineering (WCRE ’06), 2006.

[5] D. Binkley, “Source code analysis: A road map,” in 2007Future of Software Engineering. IEEE Computer Society,2007, pp. 104–119.

[6] C. Fellbaum, WordNet: an electronic lexical database. Cam-bridge MIT Press, 1999.

[7] H. Mugge, T. Rho, D. Speicher, P. Bihler, and A. Cremers,“Programming for Context-based Adaptability - Lessonslearned about OOP, SOA, and AOP,” in Workshop Selb-storganisierende, Adaptive, Kontextsensitive verteilte Systeme,2007.

[8] M. Porter, “An algorithm for suffix stripping,” Program:electronic library and information systems, vol. 40, no. 3,pp. 211–218, 2006.

[9] H. Schmid, “TreeTagger - a language independent part-of-speech tagger,” Institut fur Maschinelle Sprachverarbeitung,Universitat Stuttgart, 1995.

[10] J. Falleri, M. Huchard, M. Lafourcade, C. Nebut, V. Prince,and M. Dao, “Automatic Extraction of a WordNet-Like Identi-fier Network from Software,” in Proc.of the 18th InternationalConference on Program Comprehension (ICPC’10). IEEE,2010, pp. 4–13.

[11] E. Høst and B. Østvold, “Debugging Method Names,” inProc. 23rd ECOOP, 2009.

[12] J. Nonnen, “Naming Consistency in Source Code Identifiers,”Diploma Thesis, University of Bonn, 2011.

[13] A. Ko, B. Myers, M. Coblenz, and H. Aung, “An exploratorystudy of how developers seek, relate, and collect relevantinformation during software maintenance tasks,” IEEE Trans-actions on Software Engineering, pp. 971–987, 2006.

[14] “Secure source code hosting and collaborative development- github.” [Online]. Available: https://github.com/

[15] “Sourceforge.net: Find, create, and publish open sourcesoftware for free.” [Online]. Available: http://sourceforge.net/

[16] “Google code.” [Online]. Available: http://code.google.com/

[17] G. Kniesel, J. Hannemann, and T. Rho, “A comparisonof logic-based infrastructures for concern detection and ex-traction,” in Proc. of the 3rd workshop on Linking aspecttechnology and evolution. ACM, 2007, p. 6.

[18] A. A. Takang, P. A. Grubb, and R. D. Macredie, “The effectsof comments and identifier names on program comprehensi-bility: an experimental investigation,” J. Prog. Lang, vol. 4,no. 3, pp. 143–167, 1996.

[19] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Whats ina name? A study of identifiers,” in Proc. of the 14th Inter-national Conference on Program Comprehension (ICPC’06),2006.

[20] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effec-tive identifier names for comprehension and memory,” ISSE,vol. 3, no. 4, pp. 303–318, 2007.

[21] A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic, “Aninformation retrieval approach to concept location in sourcecode,” in Proc. of the 11th Working Conference on ReverseEngineering (WCRE’04), 2004.

[22] A. Marcus, V. Rajlich, J. Buchta, M. Petrenko, andA. Sergeyev, “Static techniques for concept location in object-oriented code,” in Proc. of the 13th International Workshopon Program Comprehension (IWPC’05). IEEE ComputerSociety, 2005, pp. 33–42.

[23] S. Haiduc and A. Marcus, “On the use of domain terms insource code,” in Proc. of the 16th International Conferencein Program Comprehension (ICPC’08). IEEE ComputerSociety, 2008, pp. 113–122.

[24] S. Abebe, S. Haiduc, A. Marcus, P. Tonella, and G. Antoniol,“Analyzing the Evolution of the Source Code Vocabulary,”in 13th European Conference on Software Maintenance andReengineering, 2009. CSMR’09, 2009, pp. 189–198.

[25] P. Tonella and S. Abebe, “Code Quality from the Program-mer’s Perspective,” in Proc. of XII Advanced Computing andAnalysis Techniques in Physics Research., 2008.

[26] F. Deißenbock and M. Pizka, “Concise and consistent nam-ing,” in Proc. of the 13th International Workshop on ProgramComprehension (IWPC ’05). Washington, DC, USA: IEEEComputer Society, 2005, pp. 97–106.

[27] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Featurelocation in source code: A taxonomy and survey,” in Journalof Software Maintenance and Evolution: Research and Prac-tice (JSME), 2012, accepted.

[28] N. Madani, L. Guerrouj, M. Di Penta, Y. Gueheneuc, andG. Antoniol, “Recognizing words from source code identifiersusing speech recognition techniques,” in Proc. of the 14thEuropean Conference on Software Maintenance and Reengi-neering (CSMR’10), 2010.

[29] D. Lawrie, H. Feild, and D. Binkley, “Extracting meaningfrom abbreviated identifiers,” in Proc. of 2007 IEEE Work-shop on Source Code Analysis and Manipulation (SCAM’07),2007.

[30] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Miningsource code to automatically split identifiers for softwareanalysis,” in Proceedings of the 2009 6th IEEE InternationalWorking Conference on Mining Software Repositories, ser.MSR ’09. Washington, DC, USA: IEEE Computer Society,2009, pp. 71–80.

[31] E. Evans, Domain-driven design: tackling complexity in theheart of software. Addison-Wesley, 2009.

108

[ieee 2011 18th working conference on reverse engineering (wcre) - limerick, ireland...

Documents