soft repos minin.pdf

28
A Comparison of Identity Merge Algorithms for Software Repositories Mathieu Goeminne * , Tom Mens * Institut d’Informatique, Facult´ e des Sciences, Universit´ e de Mons Abstract Software repository mining research extracts and analyses data originating from multiple software repositories to understand the historical development of soft- ware systems, and to propose better ways to evolve such systems in the future. Of particular interest is the study of the activities and interactions between the persons involved in the software development process. The main challenge with such studies lies in the ability to determine the identities (e.g., logins or e-mail accounts) in software repositories that represent the same physical person. To achieve this, different identity merge algorithms have been proposed in the past. This article provides an objective comparison of identity merge algorithms, in- cluding some improvements over existing algorithms. The results are validated on a selection of large ongoing open source software projects. Keywords: software repository mining, empirical software engineering, identity merging, open source, software evolution, comparison 1. Introduction Empirical software engineering research focuses on the use of empirical stud- ies, experiments and statistical analysis in order to gain a better understanding of software products and processes [1]. An important branch of empirical re- search studies how software evolves over time and which processes are used to support this evolution. To achieve this, the principal data sources are software repositories of different kinds, such as source code repositories, bug tracking sys- tems, and archived communications of the developer community (e.g., mailing lists, online forums and discussion boards). The research domain of software repository mining [2] uses these data sources to understand the historical de- velopment of software systems, and to build and empirically validate theories, models, processes and tools for these evolving systems. Many of these empir- * Corr. author: Place du Parc 20, 7000 Mons, [email protected], +32 65 37 3453 Preprint submitted to Elsevier November 28, 2011

Upload: kamanw

Post on 11-Nov-2015

244 views

Category:

Documents


0 download

TRANSCRIPT

  • A Comparison of Identity MergeAlgorithms for Software Repositories

    Mathieu Goeminne, Tom Mens

    Institut dInformatique, Faculte des Sciences, Universite de Mons

    Abstract

    Software repository mining research extracts and analyses data originating frommultiple software repositories to understand the historical development of soft-ware systems, and to propose better ways to evolve such systems in the future.Of particular interest is the study of the activities and interactions between thepersons involved in the software development process. The main challenge withsuch studies lies in the ability to determine the identities (e.g., logins or e-mailaccounts) in software repositories that represent the same physical person. Toachieve this, different identity merge algorithms have been proposed in the past.This article provides an objective comparison of identity merge algorithms, in-cluding some improvements over existing algorithms. The results are validatedon a selection of large ongoing open source software projects.

    Keywords: software repository mining, empirical software engineering,identity merging, open source, software evolution, comparison

    1. Introduction

    Empirical software engineering research focuses on the use of empirical stud-ies, experiments and statistical analysis in order to gain a better understandingof software products and processes [1]. An important branch of empirical re-search studies how software evolves over time and which processes are used tosupport this evolution. To achieve this, the principal data sources are softwarerepositories of different kinds, such as source code repositories, bug tracking sys-tems, and archived communications of the developer community (e.g., mailinglists, online forums and discussion boards). The research domain of softwarerepository mining [2] uses these data sources to understand the historical de-velopment of software systems, and to build and empirically validate theories,models, processes and tools for these evolving systems. Many of these empir-

    Corr. author: Place du Parc 20, 7000 Mons, [email protected], +32 65 373453

    Preprint submitted to Elsevier November 28, 2011

  • ical studies focus on open source software (OSS) [3, 4, 5] and their communi-ties [6, 7, 8, 9] because the data related to its evolution is widely available inso-called software forges (such as Launchpad, SourceForge, Google Code, GNUSavannah, and many more).

    One of the challenges in software repository mining is how the informationcoming from different types of data sources can be combined in a coherent way.For many projects, the available data concerning persons involved in the soft-ware project is dispersed across different repositories and needs to be accessedusing different tools:

    centralised or distributed version control tools (e.g., CVS, Subversion, Git)are used to store the source code;

    bug, defect or change tracking tools are used to store the bug reports andchange requests;

    mailing lists, development forums and discussion boards are used to keeptrack of the communication between developers and/or users of the soft-ware.

    When carrying out empirical studies on the evolution of software systems,the information coming from these different data sources needs to be mergedand reconciled [10]. One of the main challenges in this process is to identifyand merge the identities of the persons involved in the software process. Thetools used to access the data sources often have different mechanisms to accessand modify the data, and contributors to each data source may have differentaccounts, logins or e-mail addresses, that need to be mapped onto the personsto which they belong. Doing this manually is too error-prone and too time-consuming. Automated processes are not perfect either, as they may give riseto false positives and false negatives.

    We wish to take into account any person involved in a softwares evolution,not only the developers, because the communities surrounding the software arecomposed of individuals that can play different roles during software develop-ment.

    Social networks have been analysed to understand how open source commu-nities evolve over time [9]. However, these analyses can be biased by missingor incomplete data [11]. Having an efficient tool to detect real persons behindaccounts in multiple repositories helps to obtain social networks that are morerepresentative of the real relations between the stakeholders and helps to gainmore accurate metrics.

    Therefore, this article reviews a number of existing and new identity mergealgorithms, and systematically measures their effectiveness (in terms of precisionand recall) based on a reference model and by applying them on some largeongoing OSS projects.

    The remainder of this article is structured as follows. Section 2 starts byintroducing the terminology needed for understanding the concepts used in thearticle. Section 3 explains the followed methodology. Section 4 presents the

    2

  • Figure 1: Example of persons (bottom part) and their identities and associatedlabels in different repositories (top part).

    identity merge algorithms that will be reviewed, and suggests improvements overthese algorithms. Section 5 discusses the obtained results. Section 6 presents thethreats to validity of our work. Section 7 addresses future work, and Section 8concludes.

    2. Terminology

    To avoid any confusion in terminology, this section defines the terms thatwill be used in the remainder of this article.

    Person: Any individual involved in the software project. The role and respon-sibility a person plays in the project may vary.

    The main characteristic we will use to distinguish different persons will betheir full name.1

    Repository: A data source containing information that is relevant to the soft-ware product or process, and that can be accessed and modified by differ-ent persons by using their identity.Example: In Figure 1, three different repositories are represented. Weuse code-repo to represent the code repository, mail-repo to repre-sent the mail repository, and bug-repo to represent the bug repository.Through its identity, a person may have access to one or more of theserepositories.

    1Theoretically, other characteristics could be used as well, such as birthdate, place of birth,sex, but this type of information is typically not available in software repositories.

    3

  • Identity: The way a person identifies himself in a particular software reposi-tory. Each identity in a repository has a label that is unique within thatrepository. Depending on the repository and the naming conventions im-posed, this label can take the form of a valid email address, a personsname, or a pseudonym (a.k.a. nickname). Within a single repository,the same person may have different identities (with a different label). Forexample, he may use two different e-mail addresses to contribute to a mail-ing list, or he may commit to a code repository using his real name or apseudonym. In different repositories, the same person may have differentidentities. An identity in a given repository always uniquely identifies asingle person, but if an identity with the same label occurs in a differentrepository, it may belong to a different person.Notation: In the remainder of this article, we will represent identities aspairs containing the label and the repository to which the identity belongs.

    Example: In Figure 1, person John Smith has two different code repos-itory identities (johnny, code-repo) and (smith, code-repo), and onebug repository identity (john, bug-repo). John W. Doe has two coderepository identities (john, code-repo) and (John, Doe, code-repo),and two mail repository identities ([email protected], mail-repo) and([email protected], mail-repo).

    Label: A string characterising an identity in a given repository. Depending onthe repositorys nature, the label can be a real name, a nickname or ane-mail address.

    A label can often be split into a series of parts separated by special char-acters (such as space, comma, the @ symbol, or a dot). If the label is ane-mail address, it can be split into the e-mail prefix that precedes the @symbol and the e-mail suffix that follows the @ symbol. If the label is aname, it can be split into two parts representing the first name and lastname of the person, respectively. To facilitate analysis, labels are oftennormalised, by converting capitals into lower case, removing accents andtrailing spaces, and so on.Example: In Figure 1, the label of identity (John, Doe, code-repo)can be split, after normalisation, into two parts john and doe. The sameis true for the label of identity ([email protected], mail-repo) afternormalisation and removal of the e-mail suffix.

    Identity merge: A nonempty set of identities that supposedly identify thesame person.

    A false positive (type I error) refers to a pair of identities that areincorrectly contained in the same identity merge, because in reality theyrepresent different persons.

    A false negative (type II error) refers to a pair of identities that are notin the same identity merge, even though in reality they represent the same

    4

  • person.Example: In Figure 1, a correct identity merge for person John Smithwould be {(johnny, code-repo), (smith, code-repo), (john, bug-repo)}.An incorrect identity merge for the same person would be {(johnny,code-repo), (john, code-repo), (john, bug-repo)}. It contains a falsepositive due to the presence of (john, code-repo) as well as a false nega-tive due to the absence of (smith, code-repo).

    Merge model: A set of identity merges such that each identity is contained inone and only one identity merge. That is, a merge model is a partition ofthe set of identities. An identity merge algorithm is an algorithm thatproduces a merge model by analysing a predefined set of repositories.

    The reference merge model is a merge model that is used as a refer-ence against which to compare merge models that have been obtained byexecuting an identity merge algorithm.Example: The lines in Figure 1 represent a merge model of 4 differentidentity merges (1 for each person). This merge model represents the idealcase in which an identity merge algorithm would be able to correctly as-sociate each identity to its corresponding person. In practice, this is notpossible due to the presence of false positives and false negatives, as theanalysis of the results in Section 5 will reveal.

    3. Methodology

    In order to determine how identity merge algorithms differ from one another,one needs to compare the merge models they create on a selection of softwareprojects.

    3.1. Software project selection

    To select the software projects on which to carry out the comparison, weimpose the following requirements.

    The software should be open source, since this facilitates accessibility of datarelated to the activity of persons involved in the software evolution process.

    To avoid sensitivity to variation of the identity merge algorithm, the softwareproject community should be sufficiently large, in the order of hundreds ofpersons involved. Projects that are too large will be excluded from the selectionsince it would be too fastidious to create the reference merge model, computethe merge results, and manually double-check the obtained results.

    The software must still be maintained and used today, in order to be repre-sentative of a system that is still under active development. This also guaranteesthat the analysed data will not be obsolete or irrelevant.

    The software project must have at least three different and freely accessiblerepositories: a code repository, a mail repository (mailing list) and a bug repos-itory. To facilitate data analysis and comparison, data from these repositories

    5

  • will be extracted into a FLOSSMetrics2 compliant database using the Libre-soft tools CVSAnaly23, MLStats4 and Bicho5. This choice has an impact onthe selected software projects, since the repositories should have a format thatcan be processed by these tools. More specifically, the code repository must bebased on Subversion, CVS or Git, which are the only version control systemssupported by CVSAnaly2.

    The mail repository must be stored as mbox files, which is the only fileformat supported by MLStats. Finally, the bug repository must be a Bugzilla6

    or a Sourceforge7 bug tracker, the only bug repositories supported by Bicho.

    3.2. Data extraction

    Depending on its type and characteristics, a repository can provide differenttypes of identity labels. The code repository and mail repository contain twotypes of labels, representing names and e-mail addresses, respectively. The bugrepository contains labels representing a nickname that may or may not bearressemblance to the real name (depending on the naming conventions imposedor suggested by the project community).

    All these labels will be extracted from the FLOSSMetrics-compliant databasefor each considered software project, and will be provided as input to the iden-tity merge algorithms for computing the merge model.

    3.3. Reference merge model

    In order to assess the quality of each of the considered identity merge al-gorithms, their output needs to be compared against a reference merge modelrepresenting the merge model that an ideal merge algorithm would propose.This reference merge model is built in an iterative way.

    The first iteration contains no identity merges and considers each identityto be independent of the others. In the second iteration, relations betweensome of the identities are identified on the basis of information found in textfiles in the last revision of the considered software project. These text filesare typically created manually by the projects maintainers and are thereforeincomplete: they do not contain a full list of all relations between all identities.We take into consideration information stored in the following files (if existing),stored in the root directory of the code repository: COMMITTERS, MAINTAINERS,AUTHORS, NEWS, and README. These files are semi-structured, no strict formattingconvention is imposed. As such, the quality and quantity of valuable data thatcan be obtained from these files may significantly differ from one software projectto another. Mining changelog files has been used in the past already to gain a

    2http://www.flossmetrics.org/3https://projects.libresoft.es/projects/cvsanaly4https://projects.libresoft.es/projects/mlstats5https://projects.libresoft.es/projects/bicho6http://www.bugzilla.org/7http://sourceforge.net/

    6

  • better understanding of the open source software evolution and communities [12,13, 14].

    In the third iteration, the reference model is completed with new identitymerges that can be obtained by manual inspection of the set of identities. Thisinspection is realised independently by 3 different persons that are not involvedin the considered software projects, but having experience in open source soft-ware communities. The manual completion of a reference merge model is alabour-intensive and error-prone process. Moreover, the reliability of the refer-ence merge model depends on the practices used by the software project com-munity for which the reference model is built: it turns out to be much harderto merge identities for a project allowing poorly structured user accounts thanfor a project that imposes more or less strict naming conventions to be used forlogins and accounts.

    To improve the reference models we contacted the communities involved ineach selected project, but they were not able to provide us with a complete, exactor objective reference model that could be used as a basis for the comparison.For privacy reasons, formal aggregating tools are not publicly accessible, if theyexist, and are difficult to automatically exploit due to their diversity. Personsinvolved in a software project do not tend to give away their personal dataspontaneously, even if this data is publicly available.

    3.4. Algorithm comparison

    Identity merge algorithms aim to offer a more precise view on the historicalevolution of software projects by identifying the persons having contributed tomultiple data sources. Unfortunately there is, as far as we know, no automatedmeans to objectively compare the accuracy of these algorithms. This is neededsince we wish to assess which of the proposed algorithms leads to more relevantresults, with fewer false positives and false negatives.

    To perform such an objective comparison, we developed a tool that takesdata from each considered repository as well as a reference merge model asinput, runs all algorithms on the data, and compares the obtained identitymerge model with the reference model. The quality of the identity mergealgorithm will thus be assessed by how closely it approximates the results of thereference model.

    All of the considered identity merge algorithms are parameterised. There-fore, for each selected projects, the comparison tool runs and computes theresult of each merge algorithm for different values of its parameter. The out-come of each run is then compared to the corresponding merge reference modelto determine its quality. This quality is determined by looking at all pairs ofidentities found in the result of the merge algorithm and the reference model.Four possibilities need to be distinguished, as summarised in Table 1:

    We have a true negative (tn) if the merge algorithm does not propose toplace both identities in the same identity merge, and the reference modelagrees with this;

    7

  • We obtain a false negative (fn) if the merge algorithm does not proposeto place both identities in the same identity merge, while the referencemodel suggests they should;

    We get a true positive (tp) if both the merge algorithm and the referencemodel agree that both identities need to belong to the same identity merge;

    Finally, we have a false positive (fp) if the merge algorithm proposesto place both identities in the same identity merge, while the referencemodel says they should belong to different identity merges.

    identity pair belongs to . . . Reference modelYes No

    Merge algorithm resultYes tp fpNo fn tn

    Table 1: The four possibilities to determine whether a given pair of identitiescorrectly belongs to the same identity merge.

    For each considered parameter value of each merge algorithm on each ofselected project, the number of true and false positives and negatives is usedto compute recall and precision (a value between 0 and 1). Recall provides ananswer to the question What is the percentage of correct identity merges thathave been found by the algorithm? and can be computed using Equation (1).Because an algorithm can very easily achieve a recall of 1 (by determining thatall identities need to belong to the same identity merge), we need the precisionto take into account the number of returned false positives as well. Its definitionis given in Equation (2).

    Recall =tp

    tp + fn(1)

    Precision =tp

    tp + fp(2)

    In order to assess the efficiency of a merge algorithm, recall and precisionare considered simultaneously to determine for which parameter the algorithmproduces the best results. All variants of each algorithm (one for each param-eter value) can be presented in a two-dimensional graph with the recall on thehorizontal axis and the precision on the vertical axis (see, e.g., Figure 2a). Thebest variants of an algorithm are those that maximise either the recall, or theprecision, or both.

    4. Identity merge algorithms

    This section presents the different identity merge algorithms.

    8

  • Each identity merge algorithm follows the steps described in Algorithm 1.This algorithm explains, in procedural style, the input parameters required (aset of identities), the different steps followed, and the output produced (a mergemodel).

    Algorithm 1: mergeAlgorithm

    Input : Set of identities I = {i1, . . . , in}parameter p

    Output: A merge model for I

    identityMerges emptySet();1while isNotEmpty(I) do2

    r firstElementFrom(I) ;3iMerge emptySet();4insert(iMerge, r);5remove(I, r);6foreach i I do7

    if shouldInclude(iMerge, i, p) then8insert(iMerge, i);9remove(I, i);10

    insert(identityMerges, iMerge);11

    return identityMerges;12

    While any identity merge algorithm follows this same procedure, the auxil-iary functions that are used vary from one algorithm to another:

    shouldInclude(m,i,p) determines if an identity merge m should includea specific identity i, according to a given similarity threshold p (providedas a parameter of the algorithm). If an identity should be part of theidentity merge, this function returns true.

    normalise(w) is not directly used in Algorithm 1 but the different mergealgorithms can exploit it to normalise an identity label. This function takesa character string w and converts it into a normalised form by removingaccents, converting uppercase letters into lowercase, replacing a sequenceof whitespace characters by a single space, and removing beginning andending whitespace.

    4.1. Simple algorithm

    As its name suggests, the simplest identity merge algorithm we designed andimplemented uses a very simple shouldInclude function to determine if a givenidentity i should be included in an identity merge iMerge for a given thresholdp (which is a positive integer value in this case).

    The algorithm first creates a list l containing the normalised labels of all iden-tities r already included in the identity merge iMerge. The function shouldInclude

    9

  • returns true, i.e., it decides that identity i should be included in iMerge, if atleast one of its normalised labels belongs to the list l and has a length thatis longer than the threshold p given as parameter to the algorithm. Only thee-mail address prefixes are used.

    Example: In Figure 1, lets suppose p = 3, and (johnny, code-repo),(smith, code-repo) and (john, bug-repo) belong to the identity merge so-calledJohn Smith. In order to determine if ([email protected], mail-repo) shouldbelong to the identity merge, the algorithm creates the list (johnny, smith,john) and transforms the label into john. john belongs to the created list andis four characters long, so the identity label should by added to the identitymerge.

    4.2. Birds algorithm

    Bird [15] suggested another algorithm, designed specifically to detect iden-tities belonging to committers in a code repository and mailers contributing toa mailing list. The algorithm requires as parameter a threshold t ranging from0 to 1.

    In a preprocessing phase, all identity labels are normalised and cleaned. Thecleaning removes punctuation, suffixes, prefixes, technical terms and frequentlyoccurring words (see Appendix A). For identity labels representing e-mail ad-dresses, only the e-mail prefix is considered. After this preprocessing, the label representing the name of a person is split into two parts: the first name(first) and the last name (last). To achieve this, Bird recommends the use ofwhitespaces and commas as separator characters.

    In order to determine if a given identity should belong to a given identitymerge, the algorithm compares the identity with all identities belonging to theidentity merge, and determines whether these are similar. This is the case if atleast one of the following conditions is respected:

    Both identities are found in the code repository and have similar namelabels l1 and l2. More specifically, the algorithm uses the normalised Lev-enshtein similarity [16] similar(l1, l2, t) depending on threshold t providedas a parameter of the algorithm (Equation 3):

    similar(l1, l2, t) =

    {true, if 1 levenshteinDistance(l1,l2)max (size(l1),size(l2)) tfalse, otherwise

    (3)

    Example: In Figure 1, the identities (johnny, code-rep) and (john,code-rep) will belong to the same identity merge for threshold parametert = 0.5 because

    similar(johnny, john, 0.5) = true since 1 2max (6, 4)

    =2

    3 0.5

    10

  • Both identities are found in the code repository and have name labels l1and l2 composed of first1, last1 and first2, last2, respectively such thatfirst1 and first2 are similar, and last1 and last2 are similar (accordingto Equation 3).Example: In Figure 1, using commas and spaces as separators, the iden-tities (John, Doe, code-repo) and (Doe, John, code-repo) belongto the same identity merge because both of them have the same nameparts john and doe (corresponding to a similarity value 1, exceeding thethreshold 0.5).

    Both identities have e-mail labels containing a similar e-mail prefix of atleast 3 characters. The similarity function is defined as before.Example: In Figure 1, the identities ([email protected], mail-repo)and (john [email protected], mail-repo) will belong to the same identitymerge for parameter t = 0.5 because

    similar(john.doe, john doe, 0.5) = true since 1 1max (8, 8)

    =7

    8 0.5

    One of the identities has a name label, and the other identity has an e-mail label. The first one has first and last parts containing at least 2characters and being included in the label of the other identity.Example: In Figure 1, using comma and spaces as separators, the iden-tities (Doe, John, code-rep) and ([email protected], mail-repo)will belong to the same identity merge because the parts doe and john ofthe first identity match the parts john and doe of the second identity.

    One of the identities has a name label having two parts first and last andthe other identity has an e-mail label e. first and last contain at least 2characters and are included in e, except for one part p. The first letter ofp is included in e.Example: In Figure 1, the identities (John Doe, code-repo) and([email protected], mail-repo) will belong to the same identity mergebecause normalisation of John Doe is split in two parts, john and doe.The first letter of the first part (j) is contained in jdoe as well as theentire second part.

    4.3. Birds algorithm extended

    Birds original algorithm has a number of limitations. First, it incorrectlyassumes that a persons name contains only two parts, the first name and thelast name. However, the proposed splitting processes can be extended easily totake into account an arbitrary number of name parts. Second, the algorithmignores identities found in bug repositories. Therefore, the algorithm is likely tohave worse results than an algorithm that does take such identities into account.Because of this, we decided to implement an extension of Birds algorithm thatconsiders identities in bug repositories as well. Just like a committers identityname in the source code repository, the identity name in a bug tracker repositorycan be separated into parts.

    11

  • 4.4. Robless approach

    Robles et al. [17] suggested a more complex and high-level identity mergeapproach based on a set of rules. The first rule consists in the use of GPG8

    key servers to determine coupled e-mail addresses. GPG keys are used to signe-mails, and this is common practice in OSS communities. We can ask a GPGserver, if available, to find the keys of people involved in a particular softwareproject. Each GPG key can be associated to its owners e-mail addresses. Thisassociation guarantees that two e-mail addresses belong to the same physicalperson.

    Robles also suggests to rely on specific conventions imposed by the softwaredevelopment process to confirm or reject an identity merge. For instance, theKDE project maintains a list of names, logins and e-mail addresses for each ac-tive developer. Using such a list can significantly improve the merge, by avoidingincorrectly merged or separated identities. This solution is not perfect eitherbecause it is very project-specific and thus not generally applicable. Moreover,it is not able to cover all involved identities.

    We will not include an implementation of Robles algorithm in our compari-son for different reasons. Firstly, a correct implementation requires significantlymore details than what can be found in the article [17]. More importantly, theneed to connect to a GPG server would significantly slow down the comparisonprocess, making it unworkable in practice. In addition, many OSS projects donot make use of GPG servers. Thirdly, the selected projects contain files givingan incomplete list of relations between real names, nicknames and e-mails. Inaddition, each project uses a different textual structure to present this informa-tion, making a fully automated generic extraction process very difficult.

    Another specificity of Robless approach is to match the e-mail prefix ofan identity label with a list of likely prefix candidates, based on a personsname. These prefix candidates can have the form firstname.lastname, last-name.firstname, or a combination of the first name and the first letter of thelast name, and so on. This rule is different from Birds that only considers aname split in two parts (whereas Robless approach allows for an arbitrary num-ber of parts). On the other hand, Robles requires name parts to be separated bya given character, while the only condition Bird imposes is the presence of eachpart in the e-mail prefix. In our improved algorithm described in Section 4.5,we decided to implement Robles suggestion for matching e-mail prefixes.

    4.5. Improved algorithm

    The third algorithm to be included in the comparison of identity mergealgorithms combines ideas taken from Birds and Robless approach, as wellas ideas taken from our earlier experience with analysing open source softwaredeveloper communities [18]. The proposed algorithm requires as parameter a

    8GNU Privacy Guard, a free implementation of the OpenPGP standard for public keyencryption.

    12

  • threshold t ranging from 0 to 1, and takes natively into account code repositories,mail repositories and bug repositories.

    Similar to Birds algorithm, identity labels are normalised by removing allinsignificant words given in Appendix A. Likewise, the thresholded Levenshteindistance is used to determine the similarity of identities. More specifically,two identities are considered similar if at least one of the following criteria isrespected:

    One of the identities has a nickname or a name label, and the other identityhas an e-mail label. The normalised label of the first identity and the e-mail prefix of the second identity must be similar, according to Equation 3of Birds algorithm, parameterised by threshold t.

    One of the identities has a nickname or a name label, and the other identityhas an e-mail label. Each part of one of the labels has at least 3 charactersand is contained in the other label. The characters , , +, ., -, and are used as separators to split a label in parts.

    Both of the identities have a nickname or a name label. These labels haveat least 3 characters that are similar according to Equation 3 parame-terised by threshold t.

    One of the identities has a nickname or a name label having at least 3characters, and the other identity has an e-mail label. The normalisede-mail prefix of the second identity label must be identical to at least oneof the potential e-mail prefixes of the first identity. The list of potentiale-mail prefixes is created as follows:

    1. Create a list of all normalised parts of the identity label;

    2. For each permutation of these parts, create a string concatenation ofthe parts separated by one of the characters , , +, ., -, or .These new strings are the potential e-mail prefixes.

    Example: If the first identity is (Doe, John, code-repo) and the sec-ond identity is (john doe, mail-repo), they will be considered similarbecause john doe belongs to the list of potential e-mail prefixes contain-ing

    doejohn, johndoe, doe john, john doe, doe+john, john+doe,doe.john, john.doe, doe-john, john-doe,doe john, john doe.

    4.6. Further Improvements

    Poncin et al. [19, 20] presented FRASR, a tool to transform labels into severalmatch objects. For instance, if the label is a real name it forms a match objectcalled raw name according to Poncins terminology, whereas the normalisedversion of the name forms a match object called parsed name. Poncin alsogenerates other variations of the name, combining the first character of a names

    13

  • part with the other parts. These combinations are only added as match objectsif the concatenation has at least a given number of characters (3 by default).FRASR computes a similarity value sim and uses this to determine if an identityi should belong to an identity merge iMerge. The algorithm proceeds as follows:

    1. Initialise a variable vsum to 0.

    2. Perform a pairwise comparision of each match object of the identity i witheach match object of all identities belonging to the identity merge iMerge.For each pair of match objects (o1k, o2k), associate weights w1k and w2kto o1k and o2k (respectively) based on the types of the considered objects.The weight associated to each object type is a parameter of the algorithm.

    3. Compute the product of w1k and w2k and add this to vsum.

    4. The final similarity value sim is vsum divided by the number of compar-isons done. If numMatch is the number of compared objects, sim can beexpressed as follows :

    sim =

    numMatchk=1

    (w1k w2k)

    number of comparisons

    5. If sim exceeds a given threshold, the identity i should belong to the iden-tity merge iMerge.

    In our current merge algorithm comparison we have not taken into accountthe above algorithm. This represents a topic of future work.

    5. Experimental comparison

    5.1. Selected projects

    On the basis of the project selection criteria of section 3.1, we selected the fol-lowing software projects on which to carry out the comparison of identity mergealgorithms. The characteristics of these projects are summarised in table 2.

    Evince9, a document viewer forming a part of the GNOME project10. Brasero11, a simple tool to burn CDs and DVDs. It is also a part of the

    GNOME project.

    Subversion12, a centralised versioning system commonly used in open andclose source software developments.

    9http://projects.gnome.org/evince/10http://www.gnome.org/11http://live.gnome.org/Brasero12http://subversion.apache.org/

    14

  • Project Brasero Evince Subversion

    versioning system git svn svnage (years) 8 11 11size (KLOC) 107 580 422

    # commits 4,100 4,000 51,529# mails 460 1,800 24,673# bugs 250 950 3,493

    # commit accounts 206 204 162# e-mail accounts 102 610 1,690# bug accounts 421 964 1,340

    Table 2: Main characteristics of selected software projects. The reported valuesare those of the last version of each project.

    5.2. Reference merge models

    For Evince, the semi-structured files contain developer-related informationproviding relations between identities of as little as 3 persons. For Brasero,these files provide 45 unique relations between names and e-mail addresses. ForSubversion, the files provide the most complete (in terms of quality as well asquantity) set of identity relations with 156 sets containing the real name, thenickname and the e-mail address of involved persons. Note that some of theidentities described in the files do not belong to any of the considered reposi-tories. In that case, these identities and their relations are excluded from thereference model.

    Based on the reference merge models, we can better estimate the number ofpersons involved in each selected project. The fact of taking into account thereference merge models reduces the number of persons involved in the evolutionby 12% for Brasero, 9% for Evince, and 10% for Subversion.

    5.3. Algorithm comparison

    On each of the selected projects we ran four identity merge algorithms: thesimple algorithm (Section 4.1), Birds algorithm (Section 4.2), the extension ofBirds algorithm (Section 4.3) and our own improved algorithm (Section 4.5).We did this for a range of different parameter values and computed the precisionand recall w.r.t. the reference model.

    The simple algorithm is parameterised by the minimal length a string musthave in order to be considered as being useful for computing its similarity (inthis case, equality) with another string. The other considered merge algorithmsare parameterised by a threshold value used to determine if a string is similarenough to another based on the Levenshtein distance between these strings.

    Figure 2 displays, on a two-dimensional graph, this precision and recall foreach parameter of each algorithm. This figure shows that the simple algorihtm

    15

  • always obtains a high precision. The recall, however, appears to vary (between0 and a value around 0.8) depending on the value of the algorithms parameter.

    When comparing Bird with Bird extended, we observe that the extendedalgorithm that takes into account bug repositories outperforms Bird on therecall scale: it invariably has a higher recall due to the fact that it is able totake additional identities into account. Except for the Brasero project, the Birdand Bird extended algorithm suffer from a low precision, forcing the user tomanually split uncorrectly merged identities.

    Finally, the improved algorithm obtains good results, in particular for the re-call, which is always high. The precision varies from 0 to 0.9 or better dependingon the algorithms parameter.

    The plots in Figures 2 reveal that the results of an algorithm vary a lot basedon the parameter value. Therefore, we analysed for each algorithm the effect ofits parameter value on the precision and recall in Figures 3, 4 and 5.

    For the simple algorithm the precision always remains close to 1, while therecall decreases with increasing values of the minimal word length parameter.Only for a low parameter value ( 3) we get an acceptable high recall value.

    The improved algorithm has the best recall and precision for high parametervalues. In fact, the best precision and recall is obtained with a Levenshteinthreshold of 1, corresponding to a perfect match. This simplifies things a lot,as it means that it is not really necessary to compute the Levenshtein distance,leading to a dramatic improvement in time.

    5.4. Discussion

    From the obtained results we can make several observations. The simplealgorithm provides good precision, but recall strongly depends on the parametervalue. It is possible to find parameter values that perform well in all cases. Birdsoriginal algorithm does not perform well if bug repository data is to be takeninto account. Its precision and recall is too low, regardless of the parametervalue used. The extension of Birds algorithm performs better, with very highrecall values for some parameter values, but the precision is always very weak,so it is not usable in practice. The improved algorithm, borrowing features fromBird, Robless approach and the simple algorithm, has a recall that is acceptable(around 0.6) to good (0.7 or better). Precision varies a lot depending on thealgorithms parameter. Like for Birds algorithm, the best precision is reachedfor a parameter value corresponding to a Levenshtein threshold of 1.

    As such, we can conclude that the simple algorithm performs the best, closelyfollowed by the improved algorithm. Based on this knowledge, we could combinethe best features of both algorithms in the future to come to a better solution.In addition, as explained in section 4.6, we can integrate the improvementssuggested by Poncin et al.

    When comparing the results across the selected software projects we observethat the more noisy and complex the project data is, the worse the mergealgorithms behave. The complexity of a project depends on the number of linesof code it contains and the number of persons involved in the project (these

    16

  • two metrics being correlated). The noise of a project is defined as the ratioof fanciful account names over the total number of accounts. For example, themerge algorithms need to consider more merges for Evince than for Brasero, andthe merges are globally more subtle since more persons use an altered version oftheir names. For instance, a (fictitious) person called Robert Smith could haveas login bobby.smith, and John Baker could use thebaker as pseudonym. Weobserved that, if a project is maintained by a small number of persons, thesepersons tend to use serious logins, probably because they organise their workusing a relatively strict process. When the project is opened to anybody, thereare more fanciful accounts created by persons that only occasionally participateto the projects evolution. It could be interesting to scientifically study if thisbehavior is generally observed in open source projects.

    Birds algorithm as well as its extension behave differently depending on theproject on which they are applied. Taking Brasero as an example, we observein Figures 3c and 3d that Birds algorithm and its extension have an overallweak precision. Nevertheless, larger values of the Levenshstein threshold (inparticular, values that exceed 0.5) lead to a higher precision. For the other twostudied projects (Subversion and Evince), Birds algorithm and its extensiongive a bad precision, independently of the Levenshtein threshold used. Wesuspect that the different behaviour observed for Brasero is due to the relativelystrict naming conventions (i.e., less noise) used by the involved persons whenchoosing their login.

    In contrast to Bird, the simple and improved algorithms appear to be morerobust to noise in the projects: they have roughly the same behavior across thethree studied projects (Figures 3, 4 and 5). The simple algorithm always hasa very high precision with increasing values as the algorithms parameter valueincreases. The recall of the simple algorithm is high for small parameter values,and decreases rapidly as the parameter value increases. The best trade-off forensuring a high precision and a high recall appears to be a parameter valueof 3. The improved algorithm has a better precision and recall for increasingvalues of the Levenshtein threshold. In addition, the precision and recall valuesstabilise after a given threshold value that depends on the project under study.As we continue to apply our algorithms on more projects, we expect that thesimple and improved algorithms will continue to produce good results if theright parameter values are chosen. However, bigger projects may suffer froma weak precision and recall, because a bigger data set implies more noise andmore false positives and negatives.

    In addition, identity merge algorithms will not work well if the communitydoes not impose some discipline on the name conventions for identity labels usedin the different repositories. We tried to apply the algorithms on Wine, anotherOSS project with a huge community (several thousands of involved persons).Because no name conventions were imposed on the bug repository, the mergealgorithms performed very badly, and it was not possible to create a decentreference merge model without considerable effort.

    In presence of e-mail addresses, false negatives are sometimes encounteredif the last name of a person belongs the e-mail suffix. We illustrated this in

    17

  • Figure 1 where John W. Doe has [email protected] as e-mail address. This mergecannot be detected by any of the implemented algorithms. False negatives oftenoccur because different variants of first names are used. For example, Robertmay be the same as Bob or Rob or Bobby; William may be the same as Bill;and so on. A smarter algorithm may be able to detect that William Doe andBill Doe are the same persons. It is doubtful that this approach will work betterin practice since, while reducing false negatives, it will increase false positives.

    Most of the false positives are due to the fact that logins or e-mail prefixesonly contain a first name. Persons with the same first name may accidentallybe merged because their e-mail addresses and bug tracker accounts are consid-ered similar. The algorithms can be adapted easily to avoid these unnecessarymerges, at the expense of introducing more false negatives.

    6. Threats to validity

    The comparison carried out in this article suffers from a number of threatsto validity. For those that are inherent to any empirical study involving the useof OSS projects we refer to Ramil et al. [3] and Stol and Babar [21]. We willonly discuss the threats to validity that are specific to our experiment here.

    6.1. Internal validity

    The FLOSSMetrics databases that are used by the identity merge algorithmsare not perfect. Due to some issues in the Libresoft tools that were used to createthe database, the database contents has some encoding problems that needed tobe solved manually. Sometimes, email addresses were not valid (e.g., null.null,john.deamon@(none) or [email protected]), which is problematic for thealgorithms that need to analyse their structure. To avoid errors and a wasteof time, we would need to fix the external tools used or automate the databasecorrection process.

    The reliability of data is also a potential problem: one can never be surethat the analysed repositories are correct and complete. The less data we have,the less useful the algorithms will be. As the repositories are not always cen-tralised, it is hard to ensure that all related data are collected (e.g., there maybe a hidden mailing list). Nowadays, with developer communities being geo-graphically scattered over the world, distributed version control systems suchas Git and Bazaar, are becoming more and more widely used. Since distributedversion control systems allow a clone repository to be outdated, there is a riskthat we dont have a global view of the software code repository. Fortunately,OSS projects generally use a reference repository containing the whole officialsource code history.

    Although the reference merge models were constructed in an iterative way,using different sources of information, they remain partly and inherently sub-jective. Without any formal and unified data source containing all personsinvolved in the project, there is simply no way to obtain a perfect referencemodel. There will always exist a non-quantifiable bias that one can only try tominimize.

    18

  • There is also a threat to implementation validity. We implemented Birdsalgorithm on the basis of a high-level textual description we found in [15]. Wecannot guarantee that our implementation exactly matches Birds ideas. Ourown simple and improved algorithms, as well as the algorithm comparison toolitself, may also be subject to implementation errors. We tried to minimise thisthreat by validating the tool and the algorithms through extensive unit testsuites.

    6.2. External validity

    In our study we objectively compared the quality of existing identity mergealgorithms. Although we included all concrete merge algorithms we knew of,any new algorithm can easily be added to our comparison tool.

    The set of selected software projects may be biased and not necessarilyrepresentative. We cannot guarantee that the obtained results are generalisableto other projects, but the results we obtained for the studied projects are quitesimilar, increasing our confidence that it will remain true for other projects aswell. To facilitate replication, and to apply our comparison to other projects aswell, our comparison tool is publicly available at its dedicated website13. Anyresearchers wishing to reproduce the results we obtained, or to apply it on otherprojects or other algorithms can access the necessary information.

    7. Future work

    Privacy issues can occur if a person involved in an open source projectsevolution does not wish to be traced back from the repositories on which hehas an account. A way to avoid this would consist in replacing identity labelsby numbers, but this would go at the expense of the reproducibility of thestudy [17].

    The identity merge tool presented in this article will be integrated in anencompassing framework for carrying out empirical studies on evolving OSSprojects [22]. We aim to use this framework to study how persons involvedin software projects organize themselves in structured groups (that may evolveover time), and how this influences the quality of the software project andprocess. The use of an identity merge algorithm will lead to more reliableand perhaps even different results when statistically analysing historical dataof evolving software. Merging identities helps to identify the key actors inOSS communities, and how they interact. Because an identity merge algorithmreduces the number of false positives and negatives and intends to reveal thetrue person behind user accounts, it should be used before any empirical studyinvolving software communities.

    Identity merge algorithms nearly always produce false positives, so a manualpost-check is needed to refine merge results. We can try to reduce this post-treatment by taking into account project-specific or domain-specific rules and

    13http://forge.goeminne.eu/projects/herdsman

    19

  • constraints. Another means to improve merge algorithms consists in taking intoaccount timezones. Depending on the e-mail provider or e-mail applicationsconfiguration, the mailers timezone may be sent simultaneously with the mailcontent, enabling the identification of the mailers timezone.

    Most of the false positives are due to the fact that logins or e-mail prefixesonly contain a first name. Persons with the same first name may accidentally bemerged because their e-mail addresses and bug tracker accounts are consideredsimilar. We could adapt our algorithms to avoid these unnecessary merges.

    8. Conclusion

    Empirical research on evolving open source software systems requires themining of software repositories (such as code repositories, mailing lists and bugrepositories). To mine historical data about software systems development fromthese repositories, identity merge algorithms are needed. These algorithms allowto determine which identities in software repositories represent the same physicalperson. As such, it becomes possible to take into account the social dimension,by studying not only the evolution of the software artefacts themselves, but alsothe interaction with and between the persons producing, using and modifyingthese artefacts. This is an important prerequisite for studying how softwareproject communities are structured, how they evolve over time, and how thismay affect the quality of the project and product.

    This article presented different identity merge algorithms, applied them ondifferent open source software projects, and compared their precision and re-call. We can conclude that, if its parameter value is rightly chosen, the simplealgorithm performs the best, closely followed by the improved algorithm. Ourhypothesis is that members of developer communities tend to be conservativein the labels they use for their identities, making the simple approach fit well,and not requiring more sophisticated rules.

    While some algorithms clearly outperformed others, none of the studiedalgorithms obtained sufficiently high recall and precision. This is in line with theviews of researchers that implemented some of these identity merge algorithmsand proclaimed that, even after fine-tuning the algorithm, a manual check isstill needed to avoid false positives and negatives. Given this need for a manualpostprocessing phase, high recall should be valued over high precision, sinceit is much more difficult to manually detect false negatives than to find falsepositives [15, 20].

    We also analysed how stable the identity merge algorithms are with varyingvalues of their parameters. On the software projects studied, the accuracyof each algorithm may vary a lot depending on the parameter value, but agood choice of the parameter values allows to optimise the precision and recall.We proposed an algorithm that improves upon the state-of-the-art by havinga strong recall and being less sensible to parameter variations. Nevertheless,further improvements remain necessary, as well as a validation on more softwareprojects.

    20

  • Acknowledgments

    We thank Bram Adams and Alexander Serebrenik for the stimulating dis-cussions and for pointing us to some very useful articles. We thank SylvainDegrandsart and Nade`ge Thys for cross-checking our reference models. Thisresearch has been partially supported by FRFC research project 2.4515.09 fi-nanced by F.R.S-F.N.R.S, and ARC research project AUWB-08/12-UMH fi-nanced by the Ministe`re de la Communaute francaise - Direction generale delEnseignement non obligatoire et de la Recherche scientifique, Belgium.

    Vitae

    Mathieu Goeminne obtained the degree of Master in Computer Sciencein 2009. Since then he is Ph.D. student at the Software Engineering Lab of theUniversity of Mons. He is involved in software evolution research and empiricalstudies on the evolution of open source software communities.

    Tom Mens obtained the degrees of Licentiate in Mathematics in 1992,Advanced Master in Computer Science in 1993 and PhD in Science in 1999 atthe Vrije Universiteit Brussel. He was a postdoctoral fellow of the Fund forScientific Research Flanders (FWO) for three years. In 2003 he became alecturer at the Universite de Mons, where he founded and directs the SoftwareEngineering laboratory. Since 2008 he is full professor.

    [1] V. Basili, D. Rombach, K. Schneider, B. Kitchenham, D. Pfahl, R. Selby(Eds.), Empirical Software Engineering Issues. Critical Assessment andFuture Directions, volume 4336 of Lecture Notes in Computer Science,Springer, 2007.

    [2] A. E. Hassan, A. Mockus, R. C. Holt, P. M. Johnson, Guest editorsintroduction: Special issue on mining software repositories, IEEE Trans.Software Engineering 31 (2005) 426428.

    [3] J. Fernandez-Ramil, A. Lozano, M. Wermelinger, A. Capiluppi, Empiricalstudies of open source evolution, in: T. Mens, S. Demeyer (Eds.), SoftwareEvolution, Springer, 2008, pp. 263288.

    [4] R. Milev, S. Muegge, M. Weiss, Design evolution of an open source projectusing an improved modularity metric, in: Proc. Open Source Ecosystems:Diverse Communities Interacting, 5th IFIP WG 2.13 International Con-ference on Open Source Systems, volume 299 of IFIP, Springer, 2009, pp.2033.

    [5] A. Mockus, R. T. Fielding, J. D. Herbsleb, Two case studies of opensource software development: Apache and Mozilla, ACM Trans. Softw.Eng. Methodol. 11 (2002) 309346.

    21

  • [6] G. Robles, J. M. Gonzalez-Barahona, I. Herraiz, Evolution of the core teamof developers in libre software projects, in: Proc. IEEE International Work-ing Conference on Mining Software Repositories (MSR), IEEE ComputerSociety, Washington, DC, USA, 2009, pp. 167170.

    [7] M. Weiss, G. Moroiu, P. Zhao, Evolution of open source communities, in:E. Damiani, B. Fitzgerald, W. Scacchi, M. Scotto, G. Succi (Eds.), OSS,volume 203 of IFIP, Springer, 2006, pp. 2132.

    [8] J. Gutsche, The Evolution of Open Source Communities: An InstitutionalAnalysis, Technical Report, Technische Universitat Dresden, 2004.

    [9] J. Martinez-Romo, G. Robles, J. M. Gonzalez-Barahona, M. Ortuno-Perez,Using social network analysis techniques to study collaboration between aFLOSS community and a company, in: Open Source Development, Com-munities and Quality, volume 275, pp. 171186.

    [10] M. F. Lungu, Towards reverse engineering software ecosystems, in: Proc.International Conference on Software Maintenance, IEEE, 2008, pp. 428431.

    [11] R. Nia, C. Bird, P. T. Devanbu, V. Filkov, Validity of network analysesin open source projects, in: J. Whitehead, T. Zimmermann (Eds.), Proc.International Working Conference on Mining Software Repositories (MSR),IEEE, 2010, pp. 201209.

    [12] A. Capiluppi, P. Lago, M. Morisio, Evidences in the evolution of OSprojects through changelog analyses, in: Proc. 3rd Workshop on OpenSource Software Engineering, pp. 1924.

    [13] L. Yu, Mining change logs and release notes to understand software main-tenance and evolution, CLEI Electron. J. 12 (2009).

    [14] K. Chen, S. R. Schach, L. Yu, J. Offutt, G. Z. Heller, Open-source change logs, Empirical Software Engineering 9 (2004) 197210.10.1023/B:EMSE.0000027779.70556.d0.

    [15] C. Bird, A. Gourley, P. T. Devanbu, M. Gertz, A. Swaminathan, Miningemail social networks, in: S. Diehl, H. Gall, A. E. Hassan (Eds.), Proc.International Working Conference on Mining Software Repositories (MSR),ACM, 2006, pp. 137143.

    [16] G. Navarro, A guided tour to approximate string matching, ACM Com-puting Surveys 33 (1999) 2001.

    [17] G. Robles, J. M. Gonzalez-Barahona, Developer identification methodsfor integrated data from various sources, in: Proc. International WorkingConference on Mining Software Repositories (MSR), ACM, 2005, pp. 106110.

    22

  • [18] F. Stephany, T. Mens, T. Grba, Maispion: a tool for analysing and visual-ising open source software developer communities, in: Proc. InternationalWorkshop on Smalltalk Technologies, ACM, New York, NY, USA, 2009,pp. 5057.

    [19] W. Poncin, A. Serebrenik, M. van den Brand, Process mining softwarerepositories, Proc. European Conference on Software Maintenance andReengineering (2011) 514.

    [20] W. Poncin, Process mining software repositories, Masters thesis, Eind-hoven University of Technology, 2010.

    [21] K.-J. Stol, M. A. Babar, Reporting empirical research in open source soft-ware: The state of practice, in: Proc. Open Source Ecosystems: DiverseCommunities Interacting, 5th IFIP WG 2.13 International Conference onOpen Source Systems, volume 299 of IFIP, Springer, 2009, pp. 156169.

    [22] M. Goeminne, T. Mens, A framework for analysing and visualising opensource software ecosystems, in: Proc. Joint ERCIM Workshop on SoftwareEvolution (EVOL) and International Workshop on Principles of SoftwareEvolution (IWPSE), ACM, New York, NY, USA, 2010, pp. 4247.

    Appendix A. List of insignificant words

    Table A.3 contains a lists of common words that are removed from identitylabels (logins, names and e-mail addresses) during the normalisation process, asthey do not offer interesting information to merge identities and give rise to falsepositives. This list has been obtained on the basis of common words in OSSdevelopment and a manual inspection of the analysed projects. Project-specificterms are shown in the fourth column. When applying a merge algorithm toyet another project, this list needs to be extended further.

    23

  • prefixes technical terms other terms project-specificmr. administrator spam evincemrs. admin. bug braseromiss support bugs bugzillams. development root gnomeprof. dev. mailing linuxpr. developer listdr. maint. contactir. maintainer projectrev. i18ning.jr.d.d.s.ph.d.capt.lt.

    Table A.3: Insignificant words occuring in identity labels

    24

  • (a) Brasero

    (b) Evince

    (c) Subversion

    Figure 2: Precision and recall for each parameter of each algorithm.

    25

  • (a) Simple (b) Improved

    (c) Bird (d) Bird extended

    Figure 3: Brasero Evolution of precision and recall with varying parametersof each algorithm. The higher the value of precision and recall, the better.

    26

  • (a) Simple (b) Improved

    (c) Bird (d) Bird extended

    Figure 4: Evince project

    27

  • (a) Simple (b) Improved

    (c) Bird (d) Bird extended

    Figure 5: Subversion project

    28